1. 17 Oct, 2023 1 commit
    • Christoph Paasch's avatar
      netlink: Correct offload_xstats size · 503930f8
      Christoph Paasch authored
      rtnl_offload_xstats_get_size_hw_s_info_one() conditionalizes the
      size-computation for IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED based on whether
      or not the device has offload_xstats enabled.
      
      However, rtnl_offload_xstats_fill_hw_s_info_one() is adding the u8 for
      that field uncondtionally.
      
      syzkaller triggered a WARNING in rtnl_stats_get due to this:
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 754 at net/core/rtnetlink.c:5982 rtnl_stats_get+0x2f4/0x300
      Modules linked in:
      CPU: 0 PID: 754 Comm: syz-executor148 Not tainted 6.6.0-rc2-g331b78eb12af #45
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      RIP: 0010:rtnl_stats_get+0x2f4/0x300 net/core/rtnetlink.c:5982
      Code: ff ff 89 ee e8 7d 72 50 ff 83 fd a6 74 17 e8 33 6e 50 ff 4c 89 ef be 02 00 00 00 e8 86 00 fa ff e9 7b fe ff ff e8 1c 6e 50 ff <0f> 0b eb e5 e8 73 79 7b 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90
      RSP: 0018:ffffc900006837c0 EFLAGS: 00010293
      RAX: ffffffff81cf7f24 RBX: ffff8881015d9000 RCX: ffff888101815a00
      RDX: 0000000000000000 RSI: 00000000ffffffa6 RDI: 00000000ffffffa6
      RBP: 00000000ffffffa6 R08: ffffffff81cf7f03 R09: 0000000000000001
      R10: ffff888101ba47b9 R11: ffff888101815a00 R12: ffff8881017dae00
      R13: ffff8881017dad00 R14: ffffc90000683ab8 R15: ffffffff83c1f740
      FS:  00007fbc22dbc740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000046 CR3: 000000010264e003 CR4: 0000000000170ef0
      Call Trace:
       <TASK>
       rtnetlink_rcv_msg+0x677/0x710 net/core/rtnetlink.c:6480
       netlink_rcv_skb+0xea/0x1c0 net/netlink/af_netlink.c:2545
       netlink_unicast+0x430/0x500 net/netlink/af_netlink.c:1342
       netlink_sendmsg+0x4fc/0x620 net/netlink/af_netlink.c:1910
       sock_sendmsg+0xa8/0xd0 net/socket.c:730
       ____sys_sendmsg+0x22a/0x320 net/socket.c:2541
       ___sys_sendmsg+0x143/0x190 net/socket.c:2595
       __x64_sys_sendmsg+0xd8/0x150 net/socket.c:2624
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x6e/0xd8
      RIP: 0033:0x7fbc22e8d6a9
      Code: 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 37 0d 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffc4320e778 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004007d0 RCX: 00007fbc22e8d6a9
      RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 0000000000000001 R08: 0000000000000000 R09: 00000000004007d0
      R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffc4320e898
      R13: 00007ffc4320e8a8 R14: 00000000004004a0 R15: 00007fbc22fa5a80
       </TASK>
      ---[ end trace 0000000000000000 ]---
      
      Which didn't happen prior to commit bf9f1baa ("net: add dedicated
      kmem_cache for typical/small skb->head") as the skb always was large
      enough.
      
      Fixes: 0e7788fd ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats")
      Signed-off-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Link: https://lore.kernel.org/r/20231013041448.8229-1-cpaasch@apple.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      503930f8
  2. 16 Oct, 2023 1 commit
  3. 15 Oct, 2023 7 commits
    • David S. Miller's avatar
      Merge branch 'ovs-selftests' · 883f0dc0
      David S. Miller authored
      From: Aaron Conole <aconole@redhat.com>
      To: netdev@vger.kernel.org
      Cc: dev@openvswitch.org, linux-kselftest@vger.kernel.org,
      	linux-kernel@vger.kernel.org, Pravin B Shelar <pshelar@ovn.org>,
      	"David S. Miller" <davem@davemloft.net>,
      	Eric Dumazet <edumazet@google.com>,
      	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
      	Adrian Moreno <amorenoz@redhat.com>,
      	Eelco Chaudron <echaudro@redhat.com>,
      	shuah@kernel.org
      Subject: [PATCH net v2 0/4] selftests: openvswitch: Minor fixes for some systems
      Date: Wed, 11 Oct 2023 15:49:35 -0400	[thread overview]
      Message-ID: <20231011194939.704565-1-aconole@redhat.com> (raw)
      
      A number of corner cases were caught when trying to run the selftests on
      older systems.  Missed skip conditions, some error cases, and outdated
      python setups would all report failures but the issue would actually be
      related to some other condition rather than the selftest suite.
      
      Address these individual cases.
      ====================
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      883f0dc0
    • Aaron Conole's avatar
      selftests: openvswitch: Fix the ct_tuple for v4 · 8eff0e06
      Aaron Conole authored
      The ct_tuple v4 data structure decode / encode routines were using
      the v6 IP address decode and relying on default encode. This could
      cause exceptions during encode / decode depending on how a ct4
      tuple would appear in a netlink message.
      
      Caught during code review.
      
      Fixes: e52b07aa ("selftests: openvswitch: add flow dump support")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8eff0e06
    • Aaron Conole's avatar
      selftests: openvswitch: Skip drop testing on older kernels · 76035fd1
      Aaron Conole authored
      Kernels that don't have support for openvswitch drop reasons also
      won't have the drop counter reasons, so we should skip the test
      completely.  It previously wasn't possible to build a test case
      for this without polluting the datapath, so we introduce a mechanism
      to clear all the flows from a datapath allowing us to test for
      explicit drop actions, and then clear the flows to build the
      original test case.
      
      Fixes: 42420291 ("selftests: openvswitch: add explicit drop testcase")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76035fd1
    • Aaron Conole's avatar
      selftests: openvswitch: Catch cases where the tests are killed · af846afa
      Aaron Conole authored
      In case of fatal signal, or early abort at least cleanup the current
      test case.
      
      Fixes: 25f16c87 ("selftests: add openvswitch selftest suite")
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af846afa
    • Aaron Conole's avatar
      selftests: openvswitch: Add version check for pyroute2 · 92e37f20
      Aaron Conole authored
      Paolo Abeni reports that on some systems the pyroute2 version isn't
      new enough to run the test suite.  Ensure that we support a minimum
      version of 0.6 for all cases (which does include the existing ones).
      The 0.6.1 version was released in May of 2021, so should be
      propagated to most installations at this point.
      
      The alternative that Paolo proposed was to only skip when the
      add-flow is being run.  This would be okay for most cases, except
      if a future test case is added that needs to do flow dump without
      an associated add (just guessing).  In that case, it could also be
      broken and we would need additional skip logic anyway.  Just draw
      a line in the sand now.
      
      Fixes: 25f16c87 ("selftests: add openvswitch selftest suite")
      Reported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Closes: https://lore.kernel.org/lkml/8470c431e0930d2ea204a9363a60937289b7fdbe.camel@redhat.com/Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92e37f20
    • Willem de Bruijn's avatar
      net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation · fc8b2a61
      Willem de Bruijn authored
      Syzbot reported two new paths to hit an internal WARNING using the
      new virtio gso type VIRTIO_NET_HDR_GSO_UDP_L4.
      
          RIP: 0010:skb_checksum_help+0x4a2/0x600 net/core/dev.c:3260
          skb len=64521 gso_size=344
      and
      
          RIP: 0010:skb_warn_bad_offload+0x118/0x240 net/core/dev.c:3262
      
      Older virtio types have historically had loose restrictions, leading
      to many entirely impractical fuzzer generated packets causing
      problems deep in the kernel stack. Ideally, we would have had strict
      validation for all types from the start.
      
      New virtio types can have tighter validation. Limit UDP GSO packets
      inserted via virtio to the same limits imposed by the UDP_SEGMENT
      socket interface:
      
      1. must use checksum offload
      2. checksum offload matches UDP header
      3. no more segments than UDP_MAX_SEGMENTS
      4. UDP GSO does not take modifier flags, notably SKB_GSO_TCP_ECN
      
      Fixes: 860b7f27 ("linux/virtio_net.h: Support USO offload in vnet header.")
      Reported-by: syzbot+01cdbc31e9c0ae9b33ac@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/netdev/0000000000005039270605eb0b7f@google.com/
      Reported-by: syzbot+c99d835ff081ca30f986@syzkaller.appspotmail.com
      Closes: https://lore.kernel.org/netdev/0000000000005426680605eb0b9f@google.com/Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc8b2a61
    • Manish Chopra's avatar
      qed: fix LL2 RX buffer allocation · 2f3389c7
      Manish Chopra authored
      Driver allocates the LL2 rx buffers from kmalloc()
      area to construct the skb using slab_build_skb()
      
      The required size allocation seems to have overlooked
      for accounting both skb_shared_info size and device
      placement padding bytes which results into the below
      panic when doing skb_put() for a standard MTU sized frame.
      
      skbuff: skb_over_panic: text:ffffffffc0b0225f len:1514 put:1514
      head:ff3dabceaf39c000 data:ff3dabceaf39c042 tail:0x62c end:0x566
      dev:<NULL>
      …
      skb_panic+0x48/0x4a
      skb_put.cold+0x10/0x10
      qed_ll2b_complete_rx_packet+0x14f/0x260 [qed]
      qed_ll2_rxq_handle_completion.constprop.0+0x169/0x200 [qed]
      qed_ll2_rxq_completion+0xba/0x320 [qed]
      qed_int_sp_dpc+0x1a7/0x1e0 [qed]
      
      This patch fixes this by accouting skb_shared_info and device
      placement padding size bytes when allocating the buffers.
      
      Cc: David S. Miller <davem@davemloft.net>
      Fixes: 0a7fb11c ("qed: Add Light L2 support")
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f3389c7
  4. 14 Oct, 2023 9 commits
  5. 13 Oct, 2023 9 commits
  6. 12 Oct, 2023 13 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · e8c127b0
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from CAN and BPF.
      
        We have a regression in TC currently under investigation, otherwise
        the things that stand off most are probably the TCP and AF_PACKET
        fixes, with both issues coming from 6.5.
      
        Previous releases - regressions:
      
         - af_packet: fix fortified memcpy() without flex array.
      
         - tcp: fix crashes trying to free half-baked MTU probes
      
         - xdp: fix zero-size allocation warning in xskq_create()
      
         - can: sja1000: always restart the tx queue after an overrun
      
         - eth: mlx5e: again mutually exclude RX-FCS and RX-port-timestamp
      
         - eth: nfp: avoid rmmod nfp crash issues
      
         - eth: octeontx2-pf: fix page pool frag allocation warning
      
        Previous releases - always broken:
      
         - mctp: perform route lookups under a RCU read-side lock
      
         - bpf: s390: fix clobbering the caller's backchain in the trampoline
      
         - phy: lynx-28g: cancel the CDR check work item on the remove path
      
         - dsa: qca8k: fix qca8k driver for Turris 1.x
      
         - eth: ravb: fix use-after-free issue in ravb_tx_timeout_work()
      
         - eth: ixgbe: fix crash with empty VF macvlan list"
      
      * tag 'net-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (54 commits)
        rswitch: Fix imbalance phy_power_off() calling
        rswitch: Fix renesas_eth_sw_remove() implementation
        octeontx2-pf: Fix page pool frag allocation warning
        nfc: nci: assert requested protocol is valid
        af_packet: Fix fortified memcpy() without flex array.
        net: tcp: fix crashes trying to free half-baked MTU probes
        net/smc: Fix pos miscalculation in statistics
        nfp: flower: avoid rmmod nfp crash issues
        net: usb: dm9601: fix uninitialized variable use in dm9601_mdio_read
        ethtool: Fix mod state of verbose no_mask bitset
        net: nfc: fix races in nfc_llcp_sock_get() and nfc_llcp_sock_get_sn()
        mctp: perform route lookups under a RCU read-side lock
        net: skbuff: fix kernel-doc typos
        s390/bpf: Fix unwinding past the trampoline
        s390/bpf: Fix clobbering the caller's backchain in the trampoline
        net/mlx5e: Again mutually exclude RX-FCS and RX-port-timestamp
        net/smc: Fix dependency of SMC on ISM
        ixgbe: fix crash with empty VF macvlan list
        net/mlx5e: macsec: use update_pn flag instead of PN comparation
        net: phy: mscc: macsec: reject PN update requests
        ...
      e8c127b0
    • Linus Torvalds's avatar
      Merge tag 'soc-fixes-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 9a5a1494
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "AngeloGioacchino Del Regno is stepping in as co-maintainer for the
        MediaTek SoC platform and starts by sending some dts fixes for the
        mt8195 platform that had been pending for a while.
      
        On the ixp4xx platform, Krzysztof Halasa steps down as co-maintainer,
        reflecting that Linus Walleij has been handling this on his own for
        the past few years.
      
        Generic RISC-V kernels are now marked as incompatible with the RZ/Five
        platform that requires custom hacks both for managing its DMA bounce
        buffers and for addressing low virtual memory.
      
       Finally, there is one bugfix for the AMDTEE firmware driver to prevent
       a use-after-free bug"
      
      * tag 'soc-fixes-6.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        IXP4xx MAINTAINERS entries
        arm64: dts: mediatek: mt8195: Set DSU PMU status to fail
        arm64: dts: mediatek: fix t-phy unit name
        arm64: dts: mediatek: mt8195-demo: update and reorder reserved memory regions
        arm64: dts: mediatek: mt8195-demo: fix the memory size to 8GB
        MAINTAINERS: Add Angelo as MediaTek SoC co-maintainer
        soc: renesas: Make ARCH_R9A07G043 (riscv version) depend on NONPORTABLE
        tee: amdtee: fix use-after-free vulnerability in amdtee_close_session
      9a5a1494
    • Linus Torvalds's avatar
      Merge tag 'pmdomain-v6.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm · 9b1ad4ba
      Linus Torvalds authored
      Pull pmdomain fix from Ulf Hansson:
      
       - imx: scu-pd: Correct the DMA2 channel
      
      * tag 'pmdomain-v6.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
        pmdomain: imx: scu-pd: correct DMA2 channel
      9b1ad4ba
    • Amir Tzin's avatar
      net/mlx5e: Fix VF representors reporting zero counters to "ip -s" command · 80f12414
      Amir Tzin authored
      Although vf_vport entry of struct mlx5e_stats is never updated, its
      values are mistakenly copied to the caller structure in the VF
      representor .ndo_get_stat_64 callback mlx5e_rep_get_stats(). Remove
      redundant entry and use the updated one, rep_stats, instead.
      
      Fixes: 64b68e36 ("net/mlx5: Refactor and expand rep vport stat group")
      Reviewed-by: default avatarPatrisious Haddad <phaddad@nvidia.com>
      Signed-off-by: default avatarAmir Tzin <amirtz@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      80f12414
    • Jianbo Liu's avatar
      net/mlx5e: Don't offload internal port if filter device is out device · 06b4eac9
      Jianbo Liu authored
      In the cited commit, if the routing device is ovs internal port, the
      out device is set to uplink, and packets go out after encapsulation.
      
      If filter device is uplink, it can trigger the following syndrome:
      mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 3966): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xcdb051), err(-22)
      
      Fix this issue by not offloading internal port if filter device is out
      device. In this case, packets are not forwarded to the root table to
      be processed, the termination table is used instead to forward them
      from uplink to uplink.
      
      Fixes: 100ad4e2 ("net/mlx5e: Offload internal port as encap route device")
      Signed-off-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Reviewed-by: default avatarAriel Levkovich <lariel@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      06b4eac9
    • Lama Kayal's avatar
      net/mlx5e: Take RTNL lock before triggering netdev notifiers · c51c6734
      Lama Kayal authored
      Hold RTNL lock when calling xdp_set_features() with a registered netdev,
      as the call triggers the netdev notifiers. This could happen when
      switching from nic profile to uplink representor for example.
      
      Similar logic which fixed a similar scenario was previously introduced in
      the following commit:
      commit 72cc6549 net/mlx5e: Take RTNL lock when needed before calling
      xdp_set_features().
      
      This fixes the following assertion and warning call trace:
      
      RTNL: assertion failed at net/core/dev.c (1961)
      WARNING: CPU: 13 PID: 2529 at net/core/dev.c:1961
      call_netdevice_notifiers_info+0x7c/0x80
      Modules linked in: rpcrdma rdma_ucm ib_iser libiscsi
      scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib
      ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink
      nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5
      auth_rpcgss oid_registry overlay mlx5_core zram zsmalloc fuse
      CPU: 13 PID: 2529 Comm: devlink Not tainted
      6.5.0_for_upstream_min_debug_2023_09_07_20_04 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
      rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      RIP: 0010:call_netdevice_notifiers_info+0x7c/0x80
      Code: 8f ff 80 3d 77 0d 16 01 00 75 c5 ba a9 07 00 00 48
      c7 c6 c4 bb 0d 82 48 c7 c7 18 c8 06 82 c6 05 5b 0d 16 01 01 e8 44 f6 8c
      ff <0f> 0b eb a2 0f 1f 44 00 00 55 48 89 e5 41 54 48 83 e4 f0 48 83 ec
      RSP: 0018:ffff88819930f7f0 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffffffff8309f740 RCX: 0000000000000027
      RDX: ffff88885fb5b5c8 RSI: 0000000000000001 RDI: ffff88885fb5b5c0
      RBP: 0000000000000028 R08: ffff88887ffabaa8 R09: 0000000000000003
      R10: ffff88887fecbac0 R11: ffff88887ff7bac0 R12: ffff88819930f810
      R13: ffff88810b7fea40 R14: ffff8881154e8fd8 R15: ffff888107e881a0
      FS:  00007f3ad248f800(0000) GS:ffff88885fb40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000563b85f164e0 CR3: 0000000113b5c006 CR4: 0000000000370ea0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       ? __warn+0x79/0x120
       ? call_netdevice_notifiers_info+0x7c/0x80
       ? report_bug+0x17c/0x190
       ? handle_bug+0x3c/0x60
       ? exc_invalid_op+0x14/0x70
       ? asm_exc_invalid_op+0x16/0x20
       ? call_netdevice_notifiers_info+0x7c/0x80
       call_netdevice_notifiers+0x2e/0x50
       mlx5e_set_xdp_feature+0x21/0x50 [mlx5_core]
       mlx5e_build_rep_params+0x97/0x130 [mlx5_core]
       mlx5e_init_ul_rep+0x9f/0x100 [mlx5_core]
       mlx5e_netdev_init_profile+0x76/0x110 [mlx5_core]
       mlx5e_netdev_attach_profile+0x1f/0x90 [mlx5_core]
       mlx5e_netdev_change_profile+0x92/0x160 [mlx5_core]
       mlx5e_vport_rep_load+0x329/0x4a0 [mlx5_core]
       mlx5_esw_offloads_rep_load+0x9e/0xf0 [mlx5_core]
       esw_offloads_enable+0x4bc/0xe90 [mlx5_core]
       mlx5_eswitch_enable_locked+0x3c8/0x570 [mlx5_core]
       ? kmalloc_trace+0x25/0x80
       mlx5_devlink_eswitch_mode_set+0x224/0x680 [mlx5_core]
       ? devlink_get_from_attrs_lock+0x9e/0x110
       devlink_nl_cmd_eswitch_set_doit+0x60/0xe0
       genl_family_rcv_msg_doit+0xd0/0x120
       genl_rcv_msg+0x180/0x2b0
       ? devlink_get_from_attrs_lock+0x110/0x110
       ? devlink_nl_cmd_eswitch_get_doit+0x290/0x290
       ? devlink_pernet_pre_exit+0xf0/0xf0
       ? genl_family_rcv_msg_dumpit+0xf0/0xf0
       netlink_rcv_skb+0x54/0x100
       genl_rcv+0x24/0x40
       netlink_unicast+0x1fc/0x2c0
       netlink_sendmsg+0x232/0x4a0
       sock_sendmsg+0x38/0x60
       ? _copy_from_user+0x2a/0x60
       __sys_sendto+0x110/0x160
       ? handle_mm_fault+0x161/0x260
       ? do_user_addr_fault+0x276/0x620
       __x64_sys_sendto+0x20/0x30
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f3ad231340a
      Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3
      0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f
      05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
      RSP: 002b:00007ffd70aad4b8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000c36b00 RCX:00007f3ad231340a
      RDX: 0000000000000038 RSI: 0000000000c36b00 RDI: 0000000000000003
      RBP: 0000000000c36910 R08: 00007f3ad2625200 R09: 000000000000000c
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
       </TASK>
      ---[ end trace 0000000000000000 ]---
      ------------[ cut here ]------------
      
      Fixes: 4d5ab0ad ("net/mlx5e: take into account device reconfiguration for xdp_features flag")
      Signed-off-by: default avatarLama Kayal <lkayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c51c6734
    • Dragos Tatulea's avatar
      net/mlx5e: XDP, Fix XDP_REDIRECT mpwqe page fragment leaks on shutdown · aaab619c
      Dragos Tatulea authored
      When mlx5e_xdp_xmit is called without the XDP_XMIT_FLUSH set it is
      possible that it leaves a mpwqe session open. That is ok during runtime:
      the session will be closed on the next call to mlx5e_xdp_xmit. But
      having a mpwqe session still open at XDP sq close time is problematic:
      the pc counter is not updated before flushing the contents of the
      xdpi_fifo. This results in leaking page fragments.
      
      The fix is to always close the mpwqe session at the end of
      mlx5e_xdp_xmit, regardless of the XDP_XMIT_FLUSH flag being set or not.
      
      Fixes: 5e0d2eef ("net/mlx5e: XDP, Support Enhanced Multi-Packet TX WQE")
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      aaab619c
    • Dragos Tatulea's avatar
      net/mlx5e: RX, Fix page_pool allocation failure recovery for legacy rq · ef9369e9
      Dragos Tatulea authored
      When a page allocation fails during refill in mlx5e_refill_rx_wqes, the
      page will be released again on the next refill call. This triggers the
      page_pool negative page fragment count warning below:
      
       [  338.326070] WARNING: CPU: 4 PID: 0 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
        ...
       [  338.328993] RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329094] Call Trace:
       [  338.329097]  <IRQ>
       [  338.329100]  ? __warn+0x7d/0x120
       [  338.329105]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329173]  ? report_bug+0x155/0x180
       [  338.329179]  ? handle_bug+0x3c/0x60
       [  338.329183]  ? exc_invalid_op+0x13/0x60
       [  338.329187]  ? asm_exc_invalid_op+0x16/0x20
       [  338.329192]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [  338.329259]  mlx5e_post_rx_wqes+0x210/0x5a0 [mlx5_core]
       [  338.329327]  ? mlx5e_poll_rx_cq+0x88/0x6f0 [mlx5_core]
       [  338.329394]  mlx5e_napi_poll+0x127/0x6b0 [mlx5_core]
       [  338.329461]  __napi_poll+0x25/0x1a0
       [  338.329465]  net_rx_action+0x28a/0x300
       [  338.329468]  __do_softirq+0xcd/0x279
       [  338.329473]  irq_exit_rcu+0x6a/0x90
       [  338.329477]  common_interrupt+0x82/0xa0
       [  338.329482]  </IRQ>
      
      This patch fixes the legacy rq case by releasing all allocated fragments
      and then setting the skip flag on all released fragments. It is
      important to note that the number of released fragments will be higher
      than the number of allocated fragments when an allocation error occurs.
      
      Fixes: 3f93f829 ("net/mlx5e: RX, Defer page release in legacy rq for better recycling")
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarChris Mason <clm@fb.com>
      Closes: https://lore.kernel.org/netdev/117FF31A-7BE0-4050-B2BB-E41F224FF72F@meta.comSigned-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ef9369e9
    • Dragos Tatulea's avatar
      net/mlx5e: RX, Fix page_pool allocation failure recovery for striding rq · be43b748
      Dragos Tatulea authored
      When a page allocation fails during refill in mlx5e_post_rx_mpwqes, the
      page will be released again on the next refill call. This triggers the
      page_pool negative page fragment count warning below:
      
       [ 2436.447717] WARNING: CPU: 1 PID: 2419 at include/net/page_pool/helpers.h:130 mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       ...
       [ 2436.447895] RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.447991] Call Trace:
       [ 2436.447975]  mlx5e_post_rx_mpwqes+0x1d5/0xcf0 [mlx5_core]
       [ 2436.447994]  <IRQ>
       [ 2436.447996]  ? __warn+0x7d/0x120
       [ 2436.448009]  ? mlx5e_handle_rx_cqe_mpwrq+0x109/0x1d0 [mlx5_core]
       [ 2436.448002]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.448044]  ? mlx5e_poll_rx_cq+0x87/0x6e0 [mlx5_core]
       [ 2436.448061]  ? report_bug+0x155/0x180
       [ 2436.448065]  ? handle_bug+0x36/0x70
       [ 2436.448067]  ? exc_invalid_op+0x13/0x60
       [ 2436.448070]  ? asm_exc_invalid_op+0x16/0x20
       [ 2436.448079]  mlx5e_napi_poll+0x122/0x6b0 [mlx5_core]
       [ 2436.448077]  ? mlx5e_page_release_fragmented.isra.0+0x42/0x50 [mlx5_core]
       [ 2436.448113]  ? generic_exec_single+0x35/0x100
       [ 2436.448117]  __napi_poll+0x25/0x1a0
       [ 2436.448120]  net_rx_action+0x28a/0x300
       [ 2436.448122]  __do_softirq+0xcd/0x279
       [ 2436.448126]  irq_exit_rcu+0x6a/0x90
       [ 2436.448128]  sysvec_apic_timer_interrupt+0x6e/0x90
       [ 2436.448130]  </IRQ>
      
      This patch fixes the striding rq case by setting the skip flag on all
      the wqe pages that were expected to have new pages allocated.
      
      Fixes: 4c2a1323 ("net/mlx5e: RX, Defer page release in striding rq for better recycling")
      Tested-by: default avatarChris Mason <clm@fb.com>
      Reported-by: default avatarChris Mason <clm@fb.com>
      Closes: https://lore.kernel.org/netdev/117FF31A-7BE0-4050-B2BB-E41F224FF72F@meta.comSigned-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      be43b748
    • Maher Sanalla's avatar
      net/mlx5: Handle fw tracer change ownership event based on MTRC · 92fd3963
      Maher Sanalla authored
      Currently, whenever fw issues a change ownership event, the PF that owns
      the fw tracer drops its ownership directly and the other PFs try to pick
      up the ownership via what MTRC register suggests.
      
      In some cases, driver releases the ownership of the tracer and reacquires
      it later on. Whenever the driver releases ownership of the tracer, fw
      issues a change ownership event. This event can be delayed and come after
      driver has reacquired ownership of the tracer. Thus the late event will
      trigger the tracer owner PF to release the ownership again and lead to a
      scenario where no PF is owning the tracer.
      
      To prevent the scenario described above, when handling a change
      ownership event, do not drop ownership of the tracer directly, instead
      read the fw MTRC register to retrieve the up-to-date owner of the tracer
      and set it accordingly in driver level.
      
      Fixes: f53aaa31 ("net/mlx5: FW tracer, implement tracer logic")
      Signed-off-by: default avatarMaher Sanalla <msanalla@nvidia.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      92fd3963
    • Vlad Buslov's avatar
      net/mlx5: Bridge, fix peer entry ageing in LAG mode · 7a3ce807
      Vlad Buslov authored
      With current implementation in single FDB LAG mode all packets are
      processed by eswitch 0 rules. As such, 'peer' FDB entries receive the
      packets for rules of other eswitches and are responsible for updating the
      main entry by sending SWITCHDEV_FDB_ADD_TO_BRIDGE notification from their
      background update wq task. However, this introduces a race condition when
      non-zero eswitch instance decides to delete a FDB entry, sends
      SWITCHDEV_FDB_DEL_TO_BRIDGE notification, but another eswitch's update task
      refreshes the same entry concurrently while its async delete work is still
      pending on the workque. In such case another SWITCHDEV_FDB_ADD_TO_BRIDGE
      event may be generated and entry will remain stuck in FDB marked as
      'offloaded' since no more SWITCHDEV_FDB_DEL_TO_BRIDGE notifications are
      sent for deleting the peer entries.
      
      Fix the issue by synchronously marking deleted entries with
      MLX5_ESW_BRIDGE_FLAG_DELETED flag and skipping them in background update
      job.
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarJianbo Liu <jianbol@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7a3ce807
    • Shay Drory's avatar
      net/mlx5: E-switch, register event handler before arming the event · 7624e58a
      Shay Drory authored
      Currently, mlx5 is registering event handler for vport context change
      event some time after arming the event. this can lead to missing an
      event, which will result in wrong rules in the FDB.
      Hence, register the event handler before arming the event.
      
      This solution is valid since FW is sending vport context change event
      only on vports which SW armed, and SW arming the vport when enabling
      it, which is done after the FDB has been created.
      
      Fixes: 6933a937 ("net/mlx5: E-Switch, Use async events chain")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7624e58a
    • Shay Drory's avatar
      net/mlx5: Perform DMA operations in the right locations · 8698cb92
      Shay Drory authored
      The cited patch change mlx5 driver so that during probe DMA
      operations were performed before pci_enable_device(), and during
      teardown DMA operations were performed after pci_disable_device().
      DMA operations require PCI to be enabled. Hence, The above leads to
      the following oops in PPC systems[1].
      
      On s390x systems, as reported by Niklas Schnelle, this is a problem
      because mlx5_pci_init() is where the DMA and coherent mask is set but
      mlx5_cmd_init() already does a dma_alloc_coherent(). Thus a DMA
      allocation is done during probe before the correct mask is set. This
      causes probe to fail initialization of the cmdif SW structs on s390x
      after that is converted to the common dma-iommu code. This is because on
      s390x DMA addresses below 4 GiB are reserved on current machines and
      unlike the old s390x specific DMA API implementation common code
      enforces DMA masks.
      
      Fix it by performing the DMA operations during probe after
      pci_enable_device() and after the dma mask is set,
      and during teardown before pci_disable_device().
      
      [1]
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
      Modules linked in: xt_MASQUERADE nf_conntrack_netlink
      nfnetlink xfrm_user iptable_nat xt_addrtype xt_conntrack nf_nat
      nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 netconsole rpcsec_gss_krb5
      auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser ib_umad
      rdma_cm ib_ipoib iw_cm libiscsi scsi_transport_iscsi ib_cm ib_uverbs
      ib_core mlx5_core(-) ptp pps_core fuse vmx_crypto crc32c_vpmsum [last
      unloaded: mlx5_ib]
      CPU: 1 PID: 8937 Comm: modprobe Not tainted 6.5.0-rc3_for_upstream_min_debug_2023_07_31_16_02 #1
      Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
      NIP:  c000000000423388 LR: c0000000001e733c CTR: c0000000001e4720
      REGS: c0000000055636d0 TRAP: 0380   Not tainted (6.5.0-rc3_for_upstream_min_debug_2023_07_31_16_02)
      MSR:  8000000000009033  CR: 24008884  XER: 20040000
      CFAR: c0000000001e7338 IRQMASK: 0
      NIP [c000000000423388] __free_pages+0x28/0x160
      LR [c0000000001e733c] dma_direct_free+0xac/0x190
      Call Trace:
      [c000000005563970] [5deadbeef0000100] 0x5deadbeef0000100 (unreliable)
      [c0000000055639b0] [c0000000003d46cc] kfree+0x7c/0x150
      [c000000005563a40] [c0000000001e47c8] dma_free_attrs+0xa8/0x1a0
      [c000000005563aa0] [c008000000d0064c] mlx5_cmd_cleanup+0xa4/0x100 [mlx5_core]
      [c000000005563ad0] [c008000000cf629c] mlx5_mdev_uninit+0xf4/0x140 [mlx5_core]
      [c000000005563b00] [c008000000cf6448] remove_one+0x160/0x1d0 [mlx5_core]
      [c000000005563b40] [c000000000958540] pci_device_remove+0x60/0x110
      [c000000005563b80] [c000000000a35e80] device_remove+0x70/0xd0
      [c000000005563bb0] [c000000000a37a38] device_release_driver_internal+0x2a8/0x330
      [c000000005563c00] [c000000000a37b8c] driver_detach+0x8c/0x160
      [c000000005563c40] [c000000000a35350] bus_remove_driver+0x90/0x110
      [c000000005563c80] [c000000000a38948] driver_unregister+0x48/0x90
      [c000000005563cf0] [c000000000957e38] pci_unregister_driver+0x38/0x150
      [c000000005563d40] [c008000000eb6140] mlx5_cleanup+0x38/0x90 [mlx5_core]
      
      Fixes: 06cd555f ("net/mlx5: split mlx5_cmd_init() to probe and reload routines")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Tested-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8698cb92