1. 23 Mar, 2023 11 commits
  2. 22 Mar, 2023 9 commits
    • Ido Schimmel's avatar
      mlxsw: spectrum_fid: Fix incorrect local port type · bb765a74
      Ido Schimmel authored
      Local port is a 10-bit number, but it was mistakenly stored in a u8,
      resulting in firmware errors when using a netdev corresponding to a
      local port higher than 255.
      
      Fix by storing the local port in u16, as is done in the rest of the
      code.
      
      Fixes: bf73904f ("mlxsw: Add support for 802.1Q FID family")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/eace1f9d96545ab8a2775db857cb7e291a9b166b.1679398549.git.petrm@nvidia.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      bb765a74
    • Zhang Changzhong's avatar
      net/sonic: use dma_mapping_error() for error check · 4107b874
      Zhang Changzhong authored
      The DMA address returned by dma_map_single() should be checked with
      dma_mapping_error(). Fix it accordingly.
      
      Fixes: efcce839 ("[PATCH] macsonic/jazzsonic network drivers update")
      Signed-off-by: default avatarZhang Changzhong <zhangchangzhong@huawei.com>
      Tested-by: default avatarStan Johnson <userm57@yahoo.com>
      Signed-off-by: default avatarFinn Thain <fthain@linux-m68k.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/6645a4b5c1e364312103f48b7b36783b94e197a2.1679370343.git.fthain@linux-m68k.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4107b874
    • Jakub Kicinski's avatar
      Merge branch 'fix-trainwreck-with-ocelot-switch-statistics-counters' · 82463d9d
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Fix trainwreck with Ocelot switch statistics counters
      
      While testing the patch set for preemptible traffic classes with some
      controlled traffic and measuring counter deltas:
      https://lore.kernel.org/netdev/20230220122343.1156614-1-vladimir.oltean@nxp.com/
      
      I noticed that in the output of "ethtool -S swp0 --groups eth-mac
      eth-phy eth-ctrl rmon -- --src emac | grep -v ': 0'", the TX counters
      were off. Quickly I realized that their values were permutated by 1
      compared to their names, and that for example
      tx-rmon-etherStatsPkts64to64Octets was incrementing when
      tx-rmon-etherStatsPkts65to127Octets should have.
      
      Initially I suspected something having to do with the bulk reading
      logic, and indeed I found a bug there (fixed as 1/3), but that was not
      the source of the problems. Instead it revealed other problems.
      
      While dumping the regions created by the driver on my switch, I figured
      out that it sees a discontinuity which shouldn't have existed between
      reg 0x278 and reg 0x280.
      
      Discontinuity between last reg 0x0 and new reg 0x0, creating new region
      Discontinuity between last reg 0x108 and new reg 0x200, creating new region
      Discontinuity between last reg 0x278 and new reg 0x280, creating new region
      Discontinuity between last reg 0x2b0 and new reg 0x400, creating new region
      region of 67 contiguous counters starting with SYS:STAT:CNT[0x000]
      region of 31 contiguous counters starting with SYS:STAT:CNT[0x080]
      region of 13 contiguous counters starting with SYS:STAT:CNT[0x0a0]
      region of 18 contiguous counters starting with SYS:STAT:CNT[0x100]
      
      That is where TX_MM_HOLD should have been, and that was the bug, since
      it was missing. After adding it, the regions look like this and the
      off-by-one issue is resolved:
      
      Discontinuity between last reg 0x000000 and new reg 0x000000, creating new region
      Discontinuity between last reg 0x000108 and new reg 0x000200, creating new region
      Discontinuity between last reg 0x0002b0 and new reg 0x000400, creating new region
      region of 67 contiguous counters starting with SYS:STAT:CNT[0x000]
      region of 45 contiguous counters starting with SYS:STAT:CNT[0x080]
      region of 18 contiguous counters starting with SYS:STAT:CNT[0x100]
      
      However, as I am thinking out loud, it should have not reported the
      other counters as off by one even when skipping TX_MM_HOLD... after all,
      on Ocelot/Seville, there are more counters which need to be skipped.
      
      Which is when I investigated and noticed the bug solved in 2/3.
      I've validated that both on native VSC9959 (which uses
      ocelot_mm_stats_layout) as well as by faking the other switches by
      making VSC9959 use the plain ocelot_stats_layout.
      
      To summarize: on all Ocelot switches, the TX counters and drop counters
      are completely broken. The RX counters are mostly fine.
      
      With this occasion, I have collected more cleanup patches in this area,
      which I'm going to submit after the net -> net-next merge.
      ====================
      
      Link: https://lore.kernel.org/r/20230321010325.897817-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      82463d9d
    • Vladimir Oltean's avatar
      net: mscc: ocelot: add TX_MM_HOLD to ocelot_mm_stats_layout · 5291099e
      Vladimir Oltean authored
      The lack of a definition for this counter is what initially prompted me
      to investigate a problem which really manifested itself as the previous
      change, "net: mscc: ocelot: fix transfer from region->buf to ocelot->stats".
      
      When TX_MM_HOLD is defined in enum ocelot_stat but not in struct
      ocelot_stat_layout ocelot_mm_stats_layout, this creates a hole, which
      due to the aforementioned bug, makes all counters following TX_MM_HOLD
      be recorded off by one compared to their correct position. So for
      example, a non-zero TX_PMAC_OCTETS would be reported as TX_MERGE_FRAGMENTS,
      TX_PMAC_UNICAST would be reported as TX_PMAC_OCTETS, TX_PMAC_64 would be
      reported as TX_PMAC_PAUSE, etc etc. This is because the size of the hole
      (1) is much smaller than the size of the region, so the phenomenon where
      the stats are off-by-one, rather than lost, prevails.
      
      However, the phenomenon where stats are lost can be seen too, for
      example with DROP_LOCAL, which is at the beginning of its own region
      (offset 0x000400 vs the previous 0x0002b0 constitutes a discontinuity).
      This is also reported as off by one and saved to TX_PMAC_1527_MAX, but
      that counter is not reported to the unstructured "ethtool -S", as
      opposed to DROP_LOCAL which is (as "drop_local").
      
      Fixes: ab3f97a9 ("net: mscc: ocelot: export ethtool MAC Merge stats for Felix VSC9959")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5291099e
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix transfer from region->buf to ocelot->stats · 17dfd210
      Vladimir Oltean authored
      To understand the problem, we need some definitions.
      
      The driver is aware of multiple counters (enum ocelot_stat), yet not all
      switches supported by the driver implement all counters. There are 2
      statistics layouts: ocelot_stats_layout and ocelot_mm_stats_layout, the
      latter having 36 counters more than the former.
      
      ocelot->stats[] is not a compact array, i.e. there are elements within
      it which are not going to be populated for ocelot_stats_layout. On the
      other hand, ocelot->stats[] is easily indexable, for example "tx_octets"
      for port 3 can be found at ocelot->stats[3 * OCELOT_NUM_STATS +
      OCELOT_STAT_TX_OCTETS], and that is why we keep it sparse.
      
      Regions, as created by ocelot_prepare_stats_regions(), are compact
      (every element from region->buf will correspond to a counter that is
      present in this switch's layout) but are not easily indexable.
      
      Let's define holes as the ranges of values of enum ocelot_stat for which
      ocelot_stats_layout doesn't have a "reg" defined. For example, there is
      a hole between OCELOT_STAT_RX_GREEN_PRIO_7 and OCELOT_STAT_TX_OCTETS
      which is of 23 elements that are only present on ocelot_mm_stats_layout,
      and as such, they are also present in enum ocelot_stat. Let's define the
      left extremity of the hole - the last enum ocelot_stat still defined -
      as A (in this case OCELOT_STAT_RX_GREEN_PRIO_7) and the right extremity -
      the first enum ocelot_stat that is defined after a series of undefined
      ones - as B (in this case OCELOT_STAT_TX_OCTETS).
      
      There is a bug in the procedure which transfers stats from region->buf[]
      to ocelot->stats[].
      
      For each hole in the ocelot_stats_layout, the logic transfers the stats
      starting with enum ocelot_stat B to ocelot->stats[] index A + 1. So all
      stats after a hole are saved to a position which is off by B - A + 1
      elements.
      
      This causes 2 kinds of issues:
      (a) counters which shouldn't increment increment
      (b) counters which should increment don't
      
      Holes in the ocelot_stat_layout automatically imply the end of a region
      and the beginning of a new one; however the reverse is not necessarily
      true. For example, for ocelot_mm_stat_layout, there could be multiple
      regions (which indicate discontinuities in register addresses) while
      there is no hole (which indicates discontinuities in enum ocelot_stat
      values).
      
      In the example above, the stats from the second region->buf[] are not
      transferred to ocelot->stats starting with index
      "port * OCELOT_NUM_STATS + OCELOT_STAT_TX_OCTETS" as they should, but
      rather, starting with element
      "port * OCELOT_NUM_STATS + OCELOT_STAT_RX_GREEN_PRIO_7 + 1".
      
      That stats[] array element is not reported to user space for switches
      that use ocelot_stat_layout, and that is how issue (b) occurs.
      
      However, if the length of the second region is larger than the hole,
      then some stats will start to be transferred to the ocelot->stats[]
      indices which *are* reported to user space, but those indices contain
      wrong values (corresponding to unexpected counters). This is how issue
      (a) occurs.
      
      The procedure, as it was introduced in commit d87b1c08 ("net: mscc:
      ocelot: use bulk reads for stats"), was not buggy, because there were no
      holes in the struct ocelot_stat_layout instances at that time. The
      problem is that when those holes were introduced, the function was not
      updated to take them into consideration.
      
      To update the procedure, we need to know, for each region, which enum
      ocelot_stat corresponds to its region->base. We have no way of deducing
      that based on the contents of struct ocelot_stats_region, so we need to
      add this information.
      
      Fixes: ab3f97a9 ("net: mscc: ocelot: export ethtool MAC Merge stats for Felix VSC9959")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      17dfd210
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix stats region batching · 6acc72a4
      Vladimir Oltean authored
      The blamed commit changed struct ocelot_stat_layout :: "u32 offset" to
      "u32 reg".
      
      However, "u32 reg" is not quite a register address, but an enum
      ocelot_reg, which in itself encodes an enum ocelot_target target in the
      upper bits, and an index into the ocelot->map[target][] array in the
      lower bits.
      
      So, whereas the previous code comparison between stats_layout[i].offset
      and last + 1 was correct (because those "offsets" at the time were
      32-bit relative addresses), the new code, comparing layout[i].reg to
      last + 4 is not correct, because the "reg" here is an enum/index, not an
      actual register address.
      
      What we want to compare are indeed register addresses, but to do that,
      we need to actually go through the same motions as
      __ocelot_bulk_read_ix() itself.
      
      With this bug, all statistics counters are deemed by
      ocelot_prepare_stats_regions() as constituting their own region.
      (Truncated) log on VSC9959 (Felix) below (prints added by me):
      
      Before:
      
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x000]
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x001]
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x002]
      ...
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x041]
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x042]
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x080]
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x081]
      ...
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x0ac]
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x100]
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x101]
      ...
      region of 1 contiguous counters starting with SYS:STAT:CNT[0x111]
      
      After:
      
      region of 67 contiguous counters starting with SYS:STAT:CNT[0x000]
      region of 45 contiguous counters starting with SYS:STAT:CNT[0x080]
      region of 18 contiguous counters starting with SYS:STAT:CNT[0x100]
      
      Since commit d87b1c08 ("net: mscc: ocelot: use bulk reads for
      stats") intended bulking as a performance improvement, and since now,
      with trivial-sized regions, performance is even worse than without
      bulking at all, this could easily qualify as a performance regression.
      
      Fixes: d4c36765 ("net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarColin Foster <colin.foster@in-advantage.com>
      Tested-by: default avatarColin Foster <colin.foster@in-advantage.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6acc72a4
    • Eric Dumazet's avatar
      erspan: do not use skb_mac_header() in ndo_start_xmit() · 8e50ed77
      Eric Dumazet authored
      Drivers should not assume skb_mac_header(skb) == skb->data in their
      ndo_start_xmit().
      
      Use skb_network_offset() and skb_transport_offset() which
      better describe what is needed in erspan_fb_xmit() and
      ip6erspan_tunnel_xmit()
      
      syzbot reported:
      WARNING: CPU: 0 PID: 5083 at include/linux/skbuff.h:2873 skb_mac_header include/linux/skbuff.h:2873 [inline]
      WARNING: CPU: 0 PID: 5083 at include/linux/skbuff.h:2873 ip6erspan_tunnel_xmit+0x1d9c/0x2d90 net/ipv6/ip6_gre.c:962
      Modules linked in:
      CPU: 0 PID: 5083 Comm: syz-executor406 Not tainted 6.3.0-rc2-syzkaller-00866-gd4671cb9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
      RIP: 0010:skb_mac_header include/linux/skbuff.h:2873 [inline]
      RIP: 0010:ip6erspan_tunnel_xmit+0x1d9c/0x2d90 net/ipv6/ip6_gre.c:962
      Code: 04 02 41 01 de 84 c0 74 08 3c 03 0f 8e 1c 0a 00 00 45 89 b4 24 c8 00 00 00 c6 85 77 fe ff ff 01 e9 33 e7 ff ff e8 b4 27 a1 f8 <0f> 0b e9 b6 e7 ff ff e8 a8 27 a1 f8 49 8d bf f0 0c 00 00 48 b8 00
      RSP: 0018:ffffc90003b2f830 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 000000000000ffff RCX: 0000000000000000
      RDX: ffff888021273a80 RSI: ffffffff88e1bd4c RDI: 0000000000000003
      RBP: ffffc90003b2f9d8 R08: 0000000000000003 R09: 000000000000ffff
      R10: 000000000000ffff R11: 0000000000000000 R12: ffff88802b28da00
      R13: 00000000000000d0 R14: ffff88807e25b6d0 R15: ffff888023408000
      FS: 0000555556a61300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055e5b11eb6e8 CR3: 0000000027c1b000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
      <TASK>
      __netdev_start_xmit include/linux/netdevice.h:4900 [inline]
      netdev_start_xmit include/linux/netdevice.h:4914 [inline]
      __dev_direct_xmit+0x504/0x730 net/core/dev.c:4300
      dev_direct_xmit include/linux/netdevice.h:3088 [inline]
      packet_xmit+0x20a/0x390 net/packet/af_packet.c:285
      packet_snd net/packet/af_packet.c:3075 [inline]
      packet_sendmsg+0x31a0/0x5150 net/packet/af_packet.c:3107
      sock_sendmsg_nosec net/socket.c:724 [inline]
      sock_sendmsg+0xde/0x190 net/socket.c:747
      __sys_sendto+0x23a/0x340 net/socket.c:2142
      __do_sys_sendto net/socket.c:2154 [inline]
      __se_sys_sendto net/socket.c:2150 [inline]
      __x64_sys_sendto+0xe1/0x1b0 net/socket.c:2150
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f123aaa1039
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007ffc15d12058 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f123aaa1039
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
      RBP: 0000000000000000 R08: 0000000020000040 R09: 0000000000000014
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f123aa648c0
      R13: 431bde82d7b634db R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: 1baf5ebf ("erspan: auto detect truncated packets.")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230320163427.8096-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8e50ed77
    • Li Zetao's avatar
      atm: idt77252: fix kmemleak when rmmod idt77252 · 4fe3c885
      Li Zetao authored
      There are memory leaks reported by kmemleak:
      
        unreferenced object 0xffff888106500800 (size 128):
          comm "modprobe", pid 1017, jiffies 4297787785 (age 67.152s)
          hex dump (first 32 bytes):
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
            00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          backtrace:
            [<00000000970ce626>] __kmem_cache_alloc_node+0x20c/0x380
            [<00000000fb5f78d9>] kmalloc_trace+0x2f/0xb0
            [<000000000e947e2a>] idt77252_init_one+0x2847/0x3c90 [idt77252]
            [<000000006efb048e>] local_pci_probe+0xeb/0x1a0
          ...
      
        unreferenced object 0xffff888106500b00 (size 128):
          comm "modprobe", pid 1017, jiffies 4297787785 (age 67.152s)
          hex dump (first 32 bytes):
            00 20 3d 01 80 88 ff ff 00 20 3d 01 80 88 ff ff  . =...... =.....
            f0 23 3d 01 80 88 ff ff 00 20 3d 01 00 00 00 00  .#=...... =.....
          backtrace:
            [<00000000970ce626>] __kmem_cache_alloc_node+0x20c/0x380
            [<00000000fb5f78d9>] kmalloc_trace+0x2f/0xb0
            [<00000000f451c5be>] alloc_scq.constprop.0+0x4a/0x400 [idt77252]
            [<00000000e6313849>] idt77252_init_one+0x28cf/0x3c90 [idt77252]
      
      The root cause is traced to the vc_maps which alloced in open_card_oam()
      are not freed in close_card_oam(). The vc_maps are used to record
      open connections, so when close a vc_map in close_card_oam(), the memory
      should be freed. Moreover, the ubr0 is not closed when close a idt77252
      device, leading to the memory leak of vc_map and scq_info.
      
      Fix them by adding kfree in close_card_oam() and implementing new
      close_card_ubr0() to close ubr0.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarLi Zetao <lizetao1@huawei.com>
      Reviewed-by: default avatarFrancois Romieu <romieu@fr.zoreil.com>
      Link: https://lore.kernel.org/r/20230320143318.2644630-1-lizetao1@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4fe3c885
    • Álvaro Fernández Rojas's avatar
      net: dsa: tag_brcm: legacy: fix daisy-chained switches · 032a9540
      Álvaro Fernández Rojas authored
      When BCM63xx internal switches are connected to switches with a 4-byte
      Broadcom tag, it does not identify the packet as VLAN tagged, so it adds one
      based on its PVID (which is likely 0).
      Right now, the packet is received by the BCM63xx internal switch and the 6-byte
      tag is properly processed. The next step would to decode the corresponding
      4-byte tag. However, the internal switch adds an invalid VLAN tag after the
      6-byte tag and the 4-byte tag handling fails.
      In order to fix this we need to remove the invalid VLAN tag after the 6-byte
      tag before passing it to the 4-byte tag decoding.
      
      Fixes: 964dbf18 ("net: dsa: tag_brcm: add support for legacy tags")
      Signed-off-by: default avatarÁlvaro Fernández Rojas <noltari@gmail.com>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20230319095540.239064-1-noltari@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      032a9540
  3. 21 Mar, 2023 13 commits
    • Dan Carpenter's avatar
      net/mlx5: E-Switch, Fix an Oops in error handling code · 640fcdbc
      Dan Carpenter authored
      The error handling dereferences "vport".  There is nothing we can do if
      it is an error pointer except returning the error code.
      
      Fixes: 133dcfc5 ("net/mlx5: E-Switch, Alloc and free unique metadata for match")
      Signed-off-by: default avatarDan Carpenter <error27@gmail.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      640fcdbc
    • Maher Sanalla's avatar
      net/mlx5: Read the TC mapping of all priorities on ETS query · 44d55318
      Maher Sanalla authored
      When ETS configurations are queried by the user to get the mapping
      assignment between packet priority and traffic class, only priorities up
      to maximum TCs are queried from QTCT register in FW to retrieve their
      assigned TC, leaving the rest of the priorities mapped to the default
      TC #0 which might be misleading.
      
      Fix by querying the TC mapping of all priorities on each ETS query,
      regardless of the maximum number of TCs configured in FW.
      
      Fixes: 820c2c5e ("net/mlx5e: Read ETS settings directly from firmware")
      Signed-off-by: default avatarMaher Sanalla <msanalla@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      44d55318
    • Emeel Hakim's avatar
      net/mlx5e: Overcome slow response for first macsec ASO WQE · 7e3fce82
      Emeel Hakim authored
      First ASO WQE poll causes a cache miss in hardware hence the resut is
      delayed. It causes to the situation where such WQE is polled earlier
      than it is needed.
      
      Add logic to retry ASO CQ polling operation.
      
      Fixes: 739cfa34 ("net/mlx5: Make ASO poll CQ usable in atomic context") 
      Signed-off-by: default avatarEmeel Hakim <ehakim@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7e3fce82
    • Roy Novich's avatar
      net/mlx5e: Initialize link speed to zero · 6e9d51b1
      Roy Novich authored
      mlx5e_port_max_linkspeed does not guarantee value assignment for speed.
      Avoid cases where link_speed might be used uninitialized. In case
      mlx5e_port_max_linkspeed fails, a default link speed of 50000 will be
      used for the calculations.
      
      Fixes: 3f6d08d1 ("net/mlx5e: Add RSS support for hairpin")
      Signed-off-by: default avatarRoy Novich <royno@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarAya Levin <ayal@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      6e9d51b1
    • Lama Kayal's avatar
      net/mlx5: Fix steering rules cleanup · 922f56e9
      Lama Kayal authored
      vport's mc, uc and multicast rules are not deleted in teardown path when
      EEH happens. Since the vport's promisc settings(uc, mc and all) in
      firmware are reset after EEH, mlx5 driver will try to delete the above
      rules in the initialization path. This cause kernel crash because these
      software rules are no longer valid.
      
      Fix by nullifying these rules right after delete to avoid accessing any dangling
      pointers.
      
      Call Trace:
      __list_del_entry_valid+0xcc/0x100 (unreliable)
      tree_put_node+0xf4/0x1b0 [mlx5_core]
      tree_remove_node+0x30/0x70 [mlx5_core]
      mlx5_del_flow_rules+0x14c/0x1f0 [mlx5_core]
      esw_apply_vport_rx_mode+0x10c/0x200 [mlx5_core]
      esw_update_vport_rx_mode+0xb4/0x180 [mlx5_core]
      esw_vport_change_handle_locked+0x1ec/0x230 [mlx5_core]
      esw_enable_vport+0x130/0x260 [mlx5_core]
      mlx5_eswitch_enable_sriov+0x2a0/0x2f0 [mlx5_core]
      mlx5_device_enable_sriov+0x74/0x440 [mlx5_core]
      mlx5_load_one+0x114c/0x1550 [mlx5_core]
      mlx5_pci_resume+0x68/0xf0 [mlx5_core]
      eeh_report_resume+0x1a4/0x230
      eeh_pe_dev_traverse+0x98/0x170
      eeh_handle_normal_event+0x3e4/0x640
      eeh_handle_event+0x4c/0x370
      eeh_event_handler+0x14c/0x210
      kthread+0x168/0x1b0
      ret_from_kernel_thread+0x5c/0x84
      
      Fixes: a35f71f2 ("net/mlx5: E-Switch, Implement promiscuous rx modes vf request handling")
      Signed-off-by: default avatarHuy Nguyen <huyn@mellanox.com>
      Signed-off-by: default avatarLama Kayal <lkayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: default avatarMaor Dickman <maord@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      922f56e9
    • Gavin Li's avatar
      net/mlx5e: Block entering switchdev mode with ns inconsistency · 662404b2
      Gavin Li authored
      Upon entering switchdev mode, VF/SF representors are spawned in the
      devlink instance's net namespace, whereas the PF net device transforms
      into the uplink representor, remaining in the net namespace the PF net
      device was in. Therefore, if a PF net device's namespace is different from
      its parent devlink net namespace, entering switchdev mode can create an
      illegal situation where all representors sharing the same core device
      are NOT in the same net namespace.
      
      To avoid this issue, block entering switchdev mode for devices whose child
      netdev net namespace has diverged from the parent devlink's.
      
      Fixes: 7768d197 ("net/mlx5: E-Switch, Add control for encapsulation")
      Signed-off-by: default avatarGavin Li <gavinl@nvidia.com>
      Reviewed-by: default avatarGavi Teitz <gavi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      662404b2
    • Gavin Li's avatar
      net/mlx5e: Set uplink rep as NETNS_LOCAL · c83172b0
      Gavin Li authored
      Previously, NETNS_LOCAL was not set for uplink representors, inconsistent
      with VF representors, and allowed the uplink representor to be moved
      between net namespaces and separated from the VF representors it shares
      the core device with. Such usage would break the isolation model of
      namespaces, as devices in different namespaces would have access to
      shared memory.
      
      To solve this issue, set NETNS_LOCAL for uplink representors if eswitch is
      in switchdev mode.
      
      Fixes: 7a9fb35e ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode")
      Signed-off-by: default avatarGavin Li <gavinl@nvidia.com>
      Reviewed-by: default avatarGavi Teitz <gavi@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c83172b0
    • Radoslaw Tyl's avatar
      i40e: fix flow director packet filter programming · c672297b
      Radoslaw Tyl authored
      Initialize to zero structures to build a valid
      Tx Packet used for the filter programming.
      
      Fixes: a9219b33 ("i40e: VLAN field for flow director")
      Signed-off-by: default avatarRadoslaw Tyl <radoslawx.tyl@intel.com>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      c672297b
    • Stefan Assmann's avatar
      iavf: fix hang on reboot with ice · 4e264be9
      Stefan Assmann authored
      When a system with E810 with existing VFs gets rebooted the following
      hang may be observed.
      
       Pid 1 is hung in iavf_remove(), part of a network driver:
       PID: 1        TASK: ffff965400e5a340  CPU: 24   COMMAND: "systemd-shutdow"
        #0 [ffffaad04005fa50] __schedule at ffffffff8b3239cb
        #1 [ffffaad04005fae8] schedule at ffffffff8b323e2d
        #2 [ffffaad04005fb00] schedule_hrtimeout_range_clock at ffffffff8b32cebc
        #3 [ffffaad04005fb80] usleep_range_state at ffffffff8b32c930
        #4 [ffffaad04005fbb0] iavf_remove at ffffffffc12b9b4c [iavf]
        #5 [ffffaad04005fbf0] pci_device_remove at ffffffff8add7513
        #6 [ffffaad04005fc10] device_release_driver_internal at ffffffff8af08baa
        #7 [ffffaad04005fc40] pci_stop_bus_device at ffffffff8adcc5fc
        #8 [ffffaad04005fc60] pci_stop_and_remove_bus_device at ffffffff8adcc81e
        #9 [ffffaad04005fc70] pci_iov_remove_virtfn at ffffffff8adf9429
       #10 [ffffaad04005fca8] sriov_disable at ffffffff8adf98e4
       #11 [ffffaad04005fcc8] ice_free_vfs at ffffffffc04bb2c8 [ice]
       #12 [ffffaad04005fd10] ice_remove at ffffffffc04778fe [ice]
       #13 [ffffaad04005fd38] ice_shutdown at ffffffffc0477946 [ice]
       #14 [ffffaad04005fd50] pci_device_shutdown at ffffffff8add58f1
       #15 [ffffaad04005fd70] device_shutdown at ffffffff8af05386
       #16 [ffffaad04005fd98] kernel_restart at ffffffff8a92a870
       #17 [ffffaad04005fda8] __do_sys_reboot at ffffffff8a92abd6
       #18 [ffffaad04005fee0] do_syscall_64 at ffffffff8b317159
       #19 [ffffaad04005ff08] __context_tracking_enter at ffffffff8b31b6fc
       #20 [ffffaad04005ff18] syscall_exit_to_user_mode at ffffffff8b31b50d
       #21 [ffffaad04005ff28] do_syscall_64 at ffffffff8b317169
       #22 [ffffaad04005ff50] entry_SYSCALL_64_after_hwframe at ffffffff8b40009b
           RIP: 00007f1baa5c13d7  RSP: 00007fffbcc55a98  RFLAGS: 00000202
           RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f1baa5c13d7
           RDX: 0000000001234567  RSI: 0000000028121969  RDI: 00000000fee1dead
           RBP: 00007fffbcc55ca0   R8: 0000000000000000   R9: 00007fffbcc54e90
           R10: 00007fffbcc55050  R11: 0000000000000202  R12: 0000000000000005
           R13: 0000000000000000  R14: 00007fffbcc55af0  R15: 0000000000000000
           ORIG_RAX: 00000000000000a9  CS: 0033  SS: 002b
      
      During reboot all drivers PM shutdown callbacks are invoked.
      In iavf_shutdown() the adapter state is changed to __IAVF_REMOVE.
      In ice_shutdown() the call chain above is executed, which at some point
      calls iavf_remove(). However iavf_remove() expects the VF to be in one
      of the states __IAVF_RUNNING, __IAVF_DOWN or __IAVF_INIT_FAILED. If
      that's not the case it sleeps forever.
      So if iavf_shutdown() gets invoked before iavf_remove() the system will
      hang indefinitely because the adapter is already in state __IAVF_REMOVE.
      
      Fix this by returning from iavf_remove() if the state is __IAVF_REMOVE,
      as we already went through iavf_shutdown().
      
      Fixes: 97457801 ("iavf: Add waiting so the port is initialized in remove")
      Fixes: a8417330 ("iavf: Fix race condition between iavf_shutdown and iavf_remove")
      Reported-by: default avatarMarius Cornea <mcornea@redhat.com>
      Signed-off-by: default avatarStefan Assmann <sassmann@kpanic.de>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Tested-by: default avatarRafal Romanowski <rafal.romanowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      4e264be9
    • Michal Swiatkowski's avatar
      ice: remove filters only if VSI is deleted · 7d46c0e6
      Michal Swiatkowski authored
      Filters shouldn't be removed in VSI rebuild path. Removing them on PF
      VSI results in no rule for PF MAC after changing for example queues
      amount.
      
      Remove all filters only in the VSI remove flow. As unload should also
      cause the filter to be removed introduce, a new function ice_stop_eth().
      It will unroll ice_start_eth(), so remove filters and close VSI.
      
      Fixes: 6624e780 ("ice: split ice_vsi_setup into smaller functions")
      Signed-off-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      7d46c0e6
    • Michal Swiatkowski's avatar
      ice: check if VF exists before mode check · 83b49e7f
      Michal Swiatkowski authored
      Setting trust on VF should return EINVAL when there is no VF. Move
      checking for switchdev mode after checking if VF exists.
      
      Fixes: c54d209c ("ice: Wait for VF to be reset/ready before configuration")
      Signed-off-by: default avatarMichal Swiatkowski <michal.swiatkowski@intel.com>
      Signed-off-by: default avatarKalyan Kodamagula <kalyan.kodamagula@intel.com>
      Tested-by: default avatarSujai Buvaneswaran <sujai.buvaneswaran@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      83b49e7f
    • Piotr Raczynski's avatar
      ice: fix rx buffers handling for flow director packets · 387d42ae
      Piotr Raczynski authored
      Adding flow director filters stopped working correctly after
      commit 2fba7dc5 ("ice: Add support for XDP multi-buffer
      on Rx side"). As a result, only first flow director filter
      can be added, adding next filter leads to NULL pointer
      dereference attached below.
      
      Rx buffer handling and reallocation logic has been optimized,
      however flow director specific traffic was not accounted for.
      As a result driver handled those packets incorrectly since new
      logic was based on ice_rx_ring::first_desc which was not set
      in this case.
      
      Fix this by setting struct ice_rx_ring::first_desc to next_to_clean
      for flow director received packets.
      
      [  438.544867] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [  438.551840] #PF: supervisor read access in kernel mode
      [  438.556978] #PF: error_code(0x0000) - not-present page
      [  438.562115] PGD 7c953b2067 P4D 0
      [  438.565436] Oops: 0000 [#1] PREEMPT SMP NOPTI
      [  438.569794] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 6.2.0-net-bug #1
      [  438.577531] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022
      [  438.588470] RIP: 0010:ice_clean_rx_irq+0x2b9/0xf20 [ice]
      [  438.593860] Code: 45 89 f7 e9 ac 00 00 00 8b 4d 78 41 31 4e 10 41 09 d5 4d 85 f6 0f 84 82 00 00 00 49 8b 4e 08 41 8b 76
      1c 65 8b 3d 47 36 4a 3f <48> 8b 11 48 c1 ea 36 39 d7 0f 85 a6 00 00 00 f6 41 08 02 0f 85 9c
      [  438.612605] RSP: 0018:ff8c732640003ec8 EFLAGS: 00010082
      [  438.617831] RAX: 0000000000000800 RBX: 00000000000007ff RCX: 0000000000000000
      [  438.624957] RDX: 0000000000000800 RSI: 0000000000000000 RDI: 0000000000000000
      [  438.632089] RBP: ff4ed275a2158200 R08: 00000000ffffffff R09: 0000000000000020
      [  438.639222] R10: 0000000000000000 R11: 0000000000000020 R12: 0000000000001000
      [  438.646356] R13: 0000000000000000 R14: ff4ed275d0daffe0 R15: 0000000000000000
      [  438.653485] FS:  0000000000000000(0000) GS:ff4ed2738fa00000(0000) knlGS:0000000000000000
      [  438.661563] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  438.667310] CR2: 0000000000000000 CR3: 0000007c9f0d6006 CR4: 0000000000771ef0
      [  438.674444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  438.681573] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  438.688697] PKRU: 55555554
      [  438.691404] Call Trace:
      [  438.693857]  <IRQ>
      [  438.695877]  ? profile_tick+0x17/0x80
      [  438.699542]  ice_msix_clean_ctrl_vsi+0x24/0x50 [ice]
      [  438.702571] ice 0000:b1:00.0: VF 1: ctrl_vsi irq timeout
      [  438.704542]  __handle_irq_event_percpu+0x43/0x1a0
      [  438.704549]  handle_irq_event+0x34/0x70
      [  438.704554]  handle_edge_irq+0x9f/0x240
      [  438.709901] iavf 0000:b1:01.1: Failed to add Flow Director filter with status: 6
      [  438.714571]  __common_interrupt+0x63/0x100
      [  438.714580]  common_interrupt+0xb4/0xd0
      [  438.718424] iavf 0000:b1:01.1: Rule ID: 127 dst_ip: 0.0.0.0 src_ip 0.0.0.0 UDP: dst_port 4 src_port 0
      [  438.722255]  </IRQ>
      [  438.722257]  <TASK>
      [  438.722257]  asm_common_interrupt+0x22/0x40
      [  438.722262] RIP: 0010:cpuidle_enter_state+0xc8/0x430
      [  438.722267] Code: 6e e9 25 ff e8 f9 ef ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 d7 f1 24 ff 45
      84 ff 0f 85 57 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
      [  438.722269] RSP: 0018:ffffffff86003e50 EFLAGS: 00000246
      [  438.784108] RAX: ff4ed2738fa00000 RBX: ffbe72a64fc01020 RCX: 0000000000000000
      [  438.791234] RDX: 0000000000000000 RSI: ffffffff858d84de RDI: ffffffff85893641
      [  438.798365] RBP: 0000000000000002 R08: 0000000000000002 R09: 000000003158af9d
      [  438.805490] R10: 0000000000000008 R11: 0000000000000354 R12: ffffffff862365a0
      [  438.812622] R13: 000000661b472a87 R14: 0000000000000002 R15: 0000000000000000
      [  438.819757]  cpuidle_enter+0x29/0x40
      [  438.823333]  do_idle+0x1b6/0x230
      [  438.826566]  cpu_startup_entry+0x19/0x20
      [  438.830492]  rest_init+0xcb/0xd0
      [  438.833717]  arch_call_rest_init+0xa/0x30
      [  438.837731]  start_kernel+0x776/0xb70
      [  438.841396]  secondary_startup_64_no_verify+0xe5/0xeb
      [  438.846449]  </TASK>
      
      Fixes: 2fba7dc5 ("ice: Add support for XDP multi-buffer on Rx side")
      Signed-off-by: default avatarPiotr Raczynski <piotr.raczynski@intel.com>
      Acked-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      387d42ae
    • Jiasheng Jiang's avatar
      octeontx2-vf: Add missing free for alloc_percpu · f038f391
      Jiasheng Jiang authored
      Add the free_percpu for the allocated "vf->hw.lmt_info" in order to avoid
      memory leak, same as the "pf->hw.lmt_info" in
      `drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c`.
      
      Fixes: 5c051207 ("octeontx2-pf: cn10k: Use runtime allocated LMTLINE region")
      Signed-off-by: default avatarJiasheng Jiang <jiasheng@iscas.ac.cn>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Acked-by: default avatarGeethasowjanya Akula <gakula@marvell.com>
      Link: https://lore.kernel.org/r/20230317064337.18198-1-jiasheng@iscas.ac.cnSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f038f391
  4. 20 Mar, 2023 7 commits
    • David S. Miller's avatar
      Merge branch 'ps3_gelic_net-fixes' · f36fa558
      David S. Miller authored
      Geoff Levand says:
      
      ====================
      net/ps3_gelic_net: DMA related fixes
      
      v9: Make rx_skb_size local to gelic_descr_prepare_rx.
      v8: Add more cpu_to_be32 calls.
      v7: Remove all cleanups, sync to spider net.
      v6: Reworked and cleaned up patches.
      v5: Some additional patch cleanups.
      v4: More patch cleanups.
      v3: Cleaned up patches as requested.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f36fa558
    • Geoff Levand's avatar
      net/ps3_gelic_net: Use dma_mapping_error · bebe933d
      Geoff Levand authored
      The current Gelic Etherenet driver was checking the return value of its
      dma_map_single call, and not using the dma_mapping_error() routine.
      
      Fixes runtime problems like these:
      
        DMA-API: ps3_gelic_driver sb_05: device driver failed to check map error
        WARNING: CPU: 0 PID: 0 at kernel/dma/debug.c:1027 .check_unmap+0x888/0x8dc
      
      Fixes: 02c18891 ("ps3: gigabit ethernet driver for PS3, take3")
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarGeoff Levand <geoff@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bebe933d
    • Geoff Levand's avatar
      net/ps3_gelic_net: Fix RX sk_buff length · 19b3bb51
      Geoff Levand authored
      The Gelic Ethernet device needs to have the RX sk_buffs aligned to
      GELIC_NET_RXBUF_ALIGN, and also the length of the RX sk_buffs must
      be a multiple of GELIC_NET_RXBUF_ALIGN.
      
      The current Gelic Ethernet driver was not allocating sk_buffs large
      enough to allow for this alignment.
      
      Also, correct the maximum and minimum MTU sizes, and add a new
      preprocessor macro for the maximum frame size, GELIC_NET_MAX_FRAME.
      
      Fixes various randomly occurring runtime network errors.
      
      Fixes: 02c18891 ("ps3: gigabit ethernet driver for PS3, take3")
      Signed-off-by: default avatarGeoff Levand <geoff@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19b3bb51
    • Tom Rix's avatar
      usb: plusb: remove unused pl_clear_QuickLink_features function · 7d722c98
      Tom Rix authored
      clang with W=1 reports
      drivers/net/usb/plusb.c:65:1: error:
        unused function 'pl_clear_QuickLink_features' [-Werror,-Wunused-function]
      pl_clear_QuickLink_features(struct usbnet *dev, int val)
      ^
      This static function is not used, so remove it.
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d722c98
    • Szymon Heidrich's avatar
      net: usb: lan78xx: Limit packet length to skb->len · 7f247f5a
      Szymon Heidrich authored
      Packet length retrieved from descriptor may be larger than
      the actual socket buffer length. In such case the cloned
      skb passed up the network stack will leak kernel memory contents.
      
      Additionally prevent integer underflow when size is less than
      ETH_FCS_LEN.
      
      Fixes: 55d7de9d ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
      Signed-off-by: default avatarSzymon Heidrich <szymon.heidrich@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f247f5a
    • Zheng Wang's avatar
      net: qcom/emac: Fix use after free bug in emac_remove due to race condition · 6b6bc5b8
      Zheng Wang authored
      In emac_probe, &adpt->work_thread is bound with
      emac_work_thread. Then it will be started by timeout
      handler emac_tx_timeout or a IRQ handler emac_isr.
      
      If we remove the driver which will call emac_remove
        to make cleanup, there may be a unfinished work.
      
      The possible sequence is as follows:
      
      Fix it by finishing the work before cleanup in the emac_remove
      and disable timeout response.
      
      CPU0                  CPU1
      
                          |emac_work_thread
      emac_remove         |
      free_netdev         |
      kfree(netdev);      |
                          |emac_reinit_locked
                          |emac_mac_down
                          |//use netdev
      Fixes: b9b17deb ("net: emac: emac gigabit ethernet controller driver")
      Signed-off-by: default avatarZheng Wang <zyytlz.wz@163.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b6bc5b8
    • Vladimir Oltean's avatar
      net: dsa: report rx_bytes unadjusted for ETH_HLEN · a8eff035
      Vladimir Oltean authored
      We collect the software statistics counters for RX bytes (reported to
      /proc/net/dev and to ethtool -S $dev | grep 'rx_bytes: ") at a time when
      skb->len has already been adjusted by the eth_type_trans() ->
      skb_pull_inline(skb, ETH_HLEN) call to exclude the L2 header.
      
      This means that when connecting 2 DSA interfaces back to back and
      sending 1 packet with length 100, the sending interface will report
      tx_bytes as incrementing by 100, and the receiving interface will report
      rx_bytes as incrementing by 86.
      
      Since accounting for that in scripts is quirky and is something that
      would be DSA-specific behavior (requiring users to know that they are
      running on a DSA interface in the first place), the proposal is that we
      treat it as a bug and fix it.
      
      This design bug has always existed in DSA, according to my analysis:
      commit 91da11f8 ("net: Distributed Switch Architecture protocol
      support") also updates skb->dev->stats.rx_bytes += skb->len after the
      eth_type_trans() call. Technically, prior to Florian's commit
      a86d8bec ("net: dsa: Factor bottom tag receive functions"), each and
      every vendor-specific tagging protocol driver open-coded the same bug,
      until the buggy code was consolidated into something resembling what can
      be seen now. So each and every driver should have its own Fixes: tag,
      because of their different histories until the convergence point.
      I'm not going to do that, for the sake of simplicity, but just blame the
      oldest appearance of buggy code.
      
      There are 2 ways to fix the problem. One is the obvious way, and the
      other is how I ended up doing it. Obvious would have been to move
      dev_sw_netstats_rx_add() one line above eth_type_trans(), and below
      skb_push(skb, ETH_HLEN). But DSA processing is not as simple as that.
      We count the bytes after removing everything DSA-related from the
      packet, to emulate what the packet's length was, on the wire, when the
      user port received it.
      
      When eth_type_trans() executes, dsa_untag_bridge_pvid() has not run yet,
      so in case the switch driver requests this behavior - commit
      412a1526 ("net: dsa: untag the bridge pvid from rx skbs") has the
      details - the obvious variant of the fix wouldn't have worked, because
      the positioning there would have also counted the not-yet-stripped VLAN
      header length, something which is absent from the packet as seen on the
      wire (there it may be untagged, whereas software will see it as
      PVID-tagged).
      
      Fixes: f613ed66 ("net: dsa: Add support for 64-bit statistics")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a8eff035