1. 21 Aug, 2024 4 commits
  2. 20 Aug, 2024 13 commits
    • Joseph Huang's avatar
      net: dsa: mv88e6xxx: Fix out-of-bound access · 528876d8
      Joseph Huang authored
      If an ATU violation was caused by a CPU Load operation, the SPID could
      be larger than DSA_MAX_PORTS (the size of mv88e6xxx_chip.ports[] array).
      
      Fixes: 75c05a74 ("net: dsa: mv88e6xxx: Fix counting of ATU violations")
      Signed-off-by: default avatarJoseph Huang <Joseph.Huang@garmin.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://patch.msgid.link/20240819235251.1331763-1-Joseph.Huang@garmin.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      528876d8
    • Martin Whitaker's avatar
      net: dsa: microchip: fix PTP config failure when using multiple ports · 6efea513
      Martin Whitaker authored
      When performing the port_hwtstamp_set operation, ptp_schedule_worker()
      will be called if hardware timestamoing is enabled on any of the ports.
      When using multiple ports for PTP, port_hwtstamp_set is executed for
      each port. When called for the first time ptp_schedule_worker() returns
      0. On subsequent calls it returns 1, indicating the worker is already
      scheduled. Currently the ksz driver treats 1 as an error and fails to
      complete the port_hwtstamp_set operation, thus leaving the timestamping
      configuration for those ports unchanged.
      
      This patch fixes this by ignoring the ptp_schedule_worker() return
      value.
      
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/7aae307a-35ca-4209-a850-7b2749d40f90@martin-whitaker.me.uk
      Fixes: bb01ad30 ("net: dsa: microchip: ptp: manipulating absolute time using ptp hw clock")
      Signed-off-by: default avatarMartin Whitaker <foss@martin-whitaker.me.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Acked-by: default avatarArun Ramadoss <arun.ramadoss@microchip.com>
      Link: https://patch.msgid.link/20240817094141.3332-1-foss@martin-whitaker.me.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6efea513
    • Paolo Abeni's avatar
      igb: cope with large MAX_SKB_FRAGS · 8aba27c4
      Paolo Abeni authored
      Sabrina reports that the igb driver does not cope well with large
      MAX_SKB_FRAG values: setting MAX_SKB_FRAG to 45 causes payload
      corruption on TX.
      
      An easy reproducer is to run ssh to connect to the machine.  With
      MAX_SKB_FRAGS=17 it works, with MAX_SKB_FRAGS=45 it fails.  This has
      been reported originally in
      https://bugzilla.redhat.com/show_bug.cgi?id=2265320
      
      The root cause of the issue is that the driver does not take into
      account properly the (possibly large) shared info size when selecting
      the ring layout, and will try to fit two packets inside the same 4K
      page even when the 1st fraglist will trump over the 2nd head.
      
      Address the issue by checking if 2K buffers are insufficient.
      
      Fixes: 3948b059 ("net: introduce a config option to tweak MAX_SKB_FRAGS")
      Reported-by: default avatarJan Tluka <jtluka@redhat.com>
      Reported-by: default avatarJirka Hladky <jhladky@redhat.com>
      Reported-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Tested-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Tested-by: default avatarCorinna Vinschen <vinschen@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarCorinna Vinschen <vinschen@redhat.com>
      Link: https://patch.msgid.link/20240816152034.1453285-1-vinschen@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8aba27c4
    • Nikolay Kuratov's avatar
      cxgb4: add forgotten u64 ivlan cast before shift · 80a1e7b8
      Nikolay Kuratov authored
      It is done everywhere in cxgb4 code, e.g. in is_filter_exact_match()
      There is no reason it should not be done here
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE
      Signed-off-by: default avatarNikolay Kuratov <kniv@yandex-team.ru>
      Cc: stable@vger.kernel.org
      Fixes: 12b276fb ("cxgb4: add support to create hash filters")
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Link: https://patch.msgid.link/20240819075408.92378-1-kniv@yandex-team.ruSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80a1e7b8
    • Dan Carpenter's avatar
      dpaa2-switch: Fix error checking in dpaa2_switch_seed_bp() · c50e7475
      Dan Carpenter authored
      The dpaa2_switch_add_bufs() function returns the number of bufs that it
      was able to add.  It returns BUFS_PER_CMD (7) for complete success or a
      smaller number if there are not enough pages available.  However, the
      error checking is looking at the total number of bufs instead of the
      number which were added on this iteration.  Thus the error checking
      only works correctly for the first iteration through the loop and
      subsequent iterations are always counted as a success.
      
      Fix this by checking only the bufs added in the current iteration.
      
      Fixes: 0b1b7137 ("staging: dpaa2-switch: handle Rx path on control interface")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Tested-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Link: https://patch.msgid.link/eec27f30-b43f-42b6-b8ee-04a6f83423b6@stanley.mountainSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c50e7475
    • Paolo Abeni's avatar
      Merge branch 'bonding-fix-xfrm-offload-bugs' · 7565c39d
      Paolo Abeni authored
      Nikolay Aleksandrov says:
      
      ====================
      bonding: fix xfrm offload bugs
      
      I noticed these problems while reviewing a bond xfrm patch recently.
      The fixes are straight-forward, please review carefully the last one
      because it has side-effects. This set has passed bond's selftests
      and my custom bond stress tests which crash without these fixes.
      
      Note the first patch is not critical, but it simplifies the next fix.
      ====================
      
      Link: https://patch.msgid.link/20240816114813.326645-1-razor@blackwall.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7565c39d
    • Nikolay Aleksandrov's avatar
      bonding: fix xfrm state handling when clearing active slave · c4c5c5d2
      Nikolay Aleksandrov authored
      If the active slave is cleared manually the xfrm state is not flushed.
      This leads to xfrm add/del imbalance and adding the same state multiple
      times. For example when the device cannot handle anymore states we get:
       [ 1169.884811] bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA
      because it's filled with the same state after multiple active slave
      clearings. This change also has a few nice side effects: user-space
      gets a notification for the change, the old device gets its mac address
      and promisc/mcast adjusted properly.
      
      Fixes: 18cb261a ("bonding: support hardware encryption offload to slaves")
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      c4c5c5d2
    • Nikolay Aleksandrov's avatar
      bonding: fix xfrm real_dev null pointer dereference · f8cde980
      Nikolay Aleksandrov authored
      We shouldn't set real_dev to NULL because packets can be in transit and
      xfrm might call xdo_dev_offload_ok() in parallel. All callbacks assume
      real_dev is set.
      
       Example trace:
       kernel: BUG: unable to handle page fault for address: 0000000000001030
       kernel: bond0: (slave eni0np1): making interface the new active one
       kernel: #PF: supervisor write access in kernel mode
       kernel: #PF: error_code(0x0002) - not-present page
       kernel: PGD 0 P4D 0
       kernel: Oops: 0002 [#1] PREEMPT SMP
       kernel: CPU: 4 PID: 2237 Comm: ping Not tainted 6.7.7+ #12
       kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
       kernel: RIP: 0010:nsim_ipsec_offload_ok+0xc/0x20 [netdevsim]
       kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA
       kernel: Code: e0 0f 0b 48 83 7f 38 00 74 de 0f 0b 48 8b 47 08 48 8b 37 48 8b 78 40 e9 b2 e5 9a d7 66 90 0f 1f 44 00 00 48 8b 86 80 02 00 00 <83> 80 30 10 00 00 01 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f
       kernel: bond0: (slave eni0np1): making interface the new active one
       kernel: RSP: 0018:ffffabde81553b98 EFLAGS: 00010246
       kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA
       kernel:
       kernel: RAX: 0000000000000000 RBX: ffff9eb404e74900 RCX: ffff9eb403d97c60
       kernel: RDX: ffffffffc090de10 RSI: ffff9eb404e74900 RDI: ffff9eb3c5de9e00
       kernel: RBP: ffff9eb3c0a42000 R08: 0000000000000010 R09: 0000000000000014
       kernel: R10: 7974203030303030 R11: 3030303030303030 R12: 0000000000000000
       kernel: R13: ffff9eb3c5de9e00 R14: ffffabde81553cc8 R15: ffff9eb404c53000
       kernel: FS:  00007f2a77a3ad00(0000) GS:ffff9eb43bd00000(0000) knlGS:0000000000000000
       kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       kernel: CR2: 0000000000001030 CR3: 00000001122ab000 CR4: 0000000000350ef0
       kernel: bond0: (slave eni0np1): making interface the new active one
       kernel: Call Trace:
       kernel:  <TASK>
       kernel:  ? __die+0x1f/0x60
       kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA
       kernel:  ? page_fault_oops+0x142/0x4c0
       kernel:  ? do_user_addr_fault+0x65/0x670
       kernel:  ? kvm_read_and_reset_apf_flags+0x3b/0x50
       kernel: bond0: (slave eni0np1): making interface the new active one
       kernel:  ? exc_page_fault+0x7b/0x180
       kernel:  ? asm_exc_page_fault+0x22/0x30
       kernel:  ? nsim_bpf_uninit+0x50/0x50 [netdevsim]
       kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA
       kernel:  ? nsim_ipsec_offload_ok+0xc/0x20 [netdevsim]
       kernel: bond0: (slave eni0np1): making interface the new active one
       kernel:  bond_ipsec_offload_ok+0x7b/0x90 [bonding]
       kernel:  xfrm_output+0x61/0x3b0
       kernel: bond0: (slave eni0np1): bond_ipsec_add_sa_all: failed to add SA
       kernel:  ip_push_pending_frames+0x56/0x80
      
      Fixes: 18cb261a ("bonding: support hardware encryption offload to slaves")
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      f8cde980
    • Nikolay Aleksandrov's avatar
      bonding: fix null pointer deref in bond_ipsec_offload_ok · 95c90e4a
      Nikolay Aleksandrov authored
      We must check if there is an active slave before dereferencing the pointer.
      
      Fixes: 18cb261a ("bonding: support hardware encryption offload to slaves")
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      95c90e4a
    • Nikolay Aleksandrov's avatar
      bonding: fix bond_ipsec_offload_ok return type · fc59b9a5
      Nikolay Aleksandrov authored
      Fix the return type which should be bool.
      
      Fixes: 955b785e ("bonding: fix suspicious RCU usage in bond_ipsec_offload_ok()")
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      fc59b9a5
    • Thomas Bogendoerfer's avatar
      ip6_tunnel: Fix broken GRO · 4b3e33fc
      Thomas Bogendoerfer authored
      GRO code checks for matching layer 2 headers to see, if packet belongs
      to the same flow and because ip6 tunnel set dev->hard_header_len
      this check fails in cases, where it shouldn't. To fix this don't
      set hard_header_len, but use needed_headroom like ipv4/ip_tunnel.c
      does.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarThomas Bogendoerfer <tbogendoerfer@suse.de>
      Link: https://patch.msgid.link/20240815151419.109864-1-tbogendoerfer@suse.deSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      4b3e33fc
    • Kuniyuki Iwashima's avatar
      kcm: Serialise kcm_sendmsg() for the same socket. · 807067bf
      Kuniyuki Iwashima authored
      syzkaller reported UAF in kcm_release(). [0]
      
      The scenario is
      
        1. Thread A builds a skb with MSG_MORE and sets kcm->seq_skb.
      
        2. Thread A resumes building skb from kcm->seq_skb but is blocked
           by sk_stream_wait_memory()
      
        3. Thread B calls sendmsg() concurrently, finishes building kcm->seq_skb
           and puts the skb to the write queue
      
        4. Thread A faces an error and finally frees skb that is already in the
           write queue
      
        5. kcm_release() does double-free the skb in the write queue
      
      When a thread is building a MSG_MORE skb, another thread must not touch it.
      
      Let's add a per-sk mutex and serialise kcm_sendmsg().
      
      [0]:
      BUG: KASAN: slab-use-after-free in __skb_unlink include/linux/skbuff.h:2366 [inline]
      BUG: KASAN: slab-use-after-free in __skb_dequeue include/linux/skbuff.h:2385 [inline]
      BUG: KASAN: slab-use-after-free in __skb_queue_purge_reason include/linux/skbuff.h:3175 [inline]
      BUG: KASAN: slab-use-after-free in __skb_queue_purge include/linux/skbuff.h:3181 [inline]
      BUG: KASAN: slab-use-after-free in kcm_release+0x170/0x4c8 net/kcm/kcmsock.c:1691
      Read of size 8 at addr ffff0000ced0fc80 by task syz-executor329/6167
      
      CPU: 1 PID: 6167 Comm: syz-executor329 Tainted: G    B              6.8.0-rc5-syzkaller-g9abbc24128bc #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Call trace:
       dump_backtrace+0x1b8/0x1e4 arch/arm64/kernel/stacktrace.c:291
       show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:298
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xd0/0x124 lib/dump_stack.c:106
       print_address_description mm/kasan/report.c:377 [inline]
       print_report+0x178/0x518 mm/kasan/report.c:488
       kasan_report+0xd8/0x138 mm/kasan/report.c:601
       __asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381
       __skb_unlink include/linux/skbuff.h:2366 [inline]
       __skb_dequeue include/linux/skbuff.h:2385 [inline]
       __skb_queue_purge_reason include/linux/skbuff.h:3175 [inline]
       __skb_queue_purge include/linux/skbuff.h:3181 [inline]
       kcm_release+0x170/0x4c8 net/kcm/kcmsock.c:1691
       __sock_release net/socket.c:659 [inline]
       sock_close+0xa4/0x1e8 net/socket.c:1421
       __fput+0x30c/0x738 fs/file_table.c:376
       ____fput+0x20/0x30 fs/file_table.c:404
       task_work_run+0x230/0x2e0 kernel/task_work.c:180
       exit_task_work include/linux/task_work.h:38 [inline]
       do_exit+0x618/0x1f64 kernel/exit.c:871
       do_group_exit+0x194/0x22c kernel/exit.c:1020
       get_signal+0x1500/0x15ec kernel/signal.c:2893
       do_signal+0x23c/0x3b44 arch/arm64/kernel/signal.c:1249
       do_notify_resume+0x74/0x1f4 arch/arm64/kernel/entry-common.c:148
       exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline]
       exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline]
       el0_svc+0xac/0x168 arch/arm64/kernel/entry-common.c:713
       el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
       el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
      
      Allocated by task 6166:
       kasan_save_stack mm/kasan/common.c:47 [inline]
       kasan_save_track+0x40/0x78 mm/kasan/common.c:68
       kasan_save_alloc_info+0x70/0x84 mm/kasan/generic.c:626
       unpoison_slab_object mm/kasan/common.c:314 [inline]
       __kasan_slab_alloc+0x74/0x8c mm/kasan/common.c:340
       kasan_slab_alloc include/linux/kasan.h:201 [inline]
       slab_post_alloc_hook mm/slub.c:3813 [inline]
       slab_alloc_node mm/slub.c:3860 [inline]
       kmem_cache_alloc_node+0x204/0x4c0 mm/slub.c:3903
       __alloc_skb+0x19c/0x3d8 net/core/skbuff.c:641
       alloc_skb include/linux/skbuff.h:1296 [inline]
       kcm_sendmsg+0x1d3c/0x2124 net/kcm/kcmsock.c:783
       sock_sendmsg_nosec net/socket.c:730 [inline]
       __sock_sendmsg net/socket.c:745 [inline]
       sock_sendmsg+0x220/0x2c0 net/socket.c:768
       splice_to_socket+0x7cc/0xd58 fs/splice.c:889
       do_splice_from fs/splice.c:941 [inline]
       direct_splice_actor+0xec/0x1d8 fs/splice.c:1164
       splice_direct_to_actor+0x438/0xa0c fs/splice.c:1108
       do_splice_direct_actor fs/splice.c:1207 [inline]
       do_splice_direct+0x1e4/0x304 fs/splice.c:1233
       do_sendfile+0x460/0xb3c fs/read_write.c:1295
       __do_sys_sendfile64 fs/read_write.c:1362 [inline]
       __se_sys_sendfile64 fs/read_write.c:1348 [inline]
       __arm64_sys_sendfile64+0x160/0x3b4 fs/read_write.c:1348
       __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
       invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51
       el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136
       do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155
       el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712
       el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
       el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
      
      Freed by task 6167:
       kasan_save_stack mm/kasan/common.c:47 [inline]
       kasan_save_track+0x40/0x78 mm/kasan/common.c:68
       kasan_save_free_info+0x5c/0x74 mm/kasan/generic.c:640
       poison_slab_object+0x124/0x18c mm/kasan/common.c:241
       __kasan_slab_free+0x3c/0x78 mm/kasan/common.c:257
       kasan_slab_free include/linux/kasan.h:184 [inline]
       slab_free_hook mm/slub.c:2121 [inline]
       slab_free mm/slub.c:4299 [inline]
       kmem_cache_free+0x15c/0x3d4 mm/slub.c:4363
       kfree_skbmem+0x10c/0x19c
       __kfree_skb net/core/skbuff.c:1109 [inline]
       kfree_skb_reason+0x240/0x6f4 net/core/skbuff.c:1144
       kfree_skb include/linux/skbuff.h:1244 [inline]
       kcm_release+0x104/0x4c8 net/kcm/kcmsock.c:1685
       __sock_release net/socket.c:659 [inline]
       sock_close+0xa4/0x1e8 net/socket.c:1421
       __fput+0x30c/0x738 fs/file_table.c:376
       ____fput+0x20/0x30 fs/file_table.c:404
       task_work_run+0x230/0x2e0 kernel/task_work.c:180
       exit_task_work include/linux/task_work.h:38 [inline]
       do_exit+0x618/0x1f64 kernel/exit.c:871
       do_group_exit+0x194/0x22c kernel/exit.c:1020
       get_signal+0x1500/0x15ec kernel/signal.c:2893
       do_signal+0x23c/0x3b44 arch/arm64/kernel/signal.c:1249
       do_notify_resume+0x74/0x1f4 arch/arm64/kernel/entry-common.c:148
       exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline]
       exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline]
       el0_svc+0xac/0x168 arch/arm64/kernel/entry-common.c:713
       el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
       el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
      
      The buggy address belongs to the object at ffff0000ced0fc80
       which belongs to the cache skbuff_head_cache of size 240
      The buggy address is located 0 bytes inside of
       freed 240-byte region [ffff0000ced0fc80, ffff0000ced0fd70)
      
      The buggy address belongs to the physical page:
      page:00000000d35f4ae4 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10ed0f
      flags: 0x5ffc00000000800(slab|node=0|zone=2|lastcpupid=0x7ff)
      page_type: 0xffffffff()
      raw: 05ffc00000000800 ffff0000c1cbf640 fffffdffc3423100 dead000000000004
      raw: 0000000000000000 00000000000c000c 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff0000ced0fb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff0000ced0fc00: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
      >ffff0000ced0fc80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                         ^
       ffff0000ced0fd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc
       ffff0000ced0fd80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
      
      Fixes: ab7ac4eb ("kcm: Kernel Connection Multiplexor module")
      Reported-by: syzbot+b72d86aa5df17ce74c60@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=b72d86aa5df17ce74c60
      Tested-by: syzbot+b72d86aa5df17ce74c60@syzkaller.appspotmail.com
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://patch.msgid.link/20240815220437.69511-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      807067bf
    • Jeremy Kerr's avatar
      net: mctp: test: Use correct skb for route input check · ce335db0
      Jeremy Kerr authored
      In the MCTP route input test, we're routing one skb, then (when delivery
      is expected) checking the resulting routed skb.
      
      However, we're currently checking the original skb length, rather than
      the routed skb. Check the routed skb instead; the original will have
      been freed at this point.
      
      Fixes: 8892c049 ("mctp: Add route input to socket tests")
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Closes: https://lore.kernel.org/kernel-janitors/4ad204f0-94cf-46c5-bdab-49592addf315@kili.mountain/Signed-off-by: default avatarJeremy Kerr <jk@codeconstruct.com.au>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://patch.msgid.link/20240816-mctp-kunit-skb-fix-v1-1-3c367ac89c27@codeconstruct.com.auSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ce335db0
  3. 19 Aug, 2024 4 commits
  4. 17 Aug, 2024 2 commits
  5. 16 Aug, 2024 17 commits
    • Jakub Kicinski's avatar
      Merge branch 'mlx5-misc-fixes-2024-08-15' · 0373d712
      Jakub Kicinski authored
      Tariq Toukan says:
      
      ====================
      mlx5 misc fixes 2024-08-15
      
      This patchset provides misc bug fixes from the team to the mlx5 driver.
      ====================
      
      Link: https://patch.msgid.link/20240815071611.2211873-1-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0373d712
    • Patrisious Haddad's avatar
      net/mlx5: Fix IPsec RoCE MPV trace call · 607e1df7
      Patrisious Haddad authored
      Prevent the call trace below from happening, by not allowing IPsec
      creation over a slave, if master device doesn't support IPsec.
      
      WARNING: CPU: 44 PID: 16136 at kernel/locking/rwsem.c:240 down_read+0x75/0x94
      Modules linked in: esp4_offload esp4 act_mirred act_vlan cls_flower sch_ingress mlx5_vdpa vringh vhost_iotlb vdpa mst_pciconf(OE) nfsv3 nfs_acl nfs lockd grace fscache netfs xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill cuse fuse rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_ipoib iw_cm ib_cm ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel sha1_ssse3 dell_smbios ib_uverbs aesni_intel crypto_simd dcdbas wmi_bmof dell_wmi_descriptor cryptd pcspkr ib_core acpi_ipmi sp5100_tco ccp i2c_piix4 ipmi_si ptdma k10temp ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect mlx5_core sysimgblt fb_sys_fops cec
       ahci libahci mlxfw drm pci_hyperv_intf libata tg3 sha256_ssse3 tls megaraid_sas i2c_algo_bit psample wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mst_pci]
      CPU: 44 PID: 16136 Comm: kworker/44:3 Kdump: loaded Tainted: GOE 5.15.0-20240509.el8uek.uek7_u3_update_v6.6_ipsec_bf.x86_64 #2
      Hardware name: Dell Inc. PowerEdge R7525/074H08, BIOS 2.0.3 01/15/2021
      Workqueue: events xfrm_state_gc_task
      RIP: 0010:down_read+0x75/0x94
      Code: 00 48 8b 45 08 65 48 8b 14 25 80 fc 01 00 83 e0 02 48 09 d0 48 83 c8 01 48 89 45 08 5d 31 c0 89 c2 89 c6 89 c7 e9 cb 88 3b 00 <0f> 0b 48 8b 45 08 a8 01 74 b2 a8 02 75 ae 48 89 c2 48 83 ca 02 f0
      RSP: 0018:ffffb26387773da8 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffffa08b658af900 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: ff886bc5e1366f2f RDI: 0000000000000000
      RBP: ffffa08b658af940 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0a9bfb31540
      R13: ffffa0a9bfb37900 R14: 0000000000000000 R15: ffffa0a9bfb37905
      FS:  0000000000000000(0000) GS:ffffa0a9bfb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055a45ed814e8 CR3: 000000109038a000 CR4: 0000000000350ee0
      Call Trace:
       <TASK>
       ? show_trace_log_lvl+0x1d6/0x2f9
       ? show_trace_log_lvl+0x1d6/0x2f9
       ? mlx5_devcom_for_each_peer_begin+0x29/0x60 [mlx5_core]
       ? down_read+0x75/0x94
       ? __warn+0x80/0x113
       ? down_read+0x75/0x94
       ? report_bug+0xa4/0x11d
       ? handle_bug+0x35/0x8b
       ? exc_invalid_op+0x14/0x75
       ? asm_exc_invalid_op+0x16/0x1b
       ? down_read+0x75/0x94
       ? down_read+0xe/0x94
       mlx5_devcom_for_each_peer_begin+0x29/0x60 [mlx5_core]
       mlx5_ipsec_fs_roce_tx_destroy+0xb1/0x130 [mlx5_core]
       tx_destroy+0x1b/0xc0 [mlx5_core]
       tx_ft_put+0x53/0xc0 [mlx5_core]
       mlx5e_xfrm_free_state+0x45/0x90 [mlx5_core]
       ___xfrm_state_destroy+0x10f/0x1a2
       xfrm_state_gc_task+0x81/0xa9
       process_one_work+0x1f1/0x3c6
       worker_thread+0x53/0x3e4
       ? process_one_work.cold+0x46/0x3c
       kthread+0x127/0x144
       ? set_kthread_struct+0x60/0x52
       ret_from_fork+0x22/0x2d
       </TASK>
      ---[ end trace 5ef7896144d398e1 ]---
      
      Fixes: dfbd229a ("net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic")
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarPatrisious Haddad <phaddad@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://patch.msgid.link/20240815071611.2211873-5-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      607e1df7
    • Carolina Jubran's avatar
      net/mlx5e: XPS, Fix oversight of Multi-PF Netdev changes · a07e953d
      Carolina Jubran authored
      The offending commit overlooked the Multi-PF Netdev changes.
      
      Revert mlx5e_set_default_xps_cpumasks to incorporate Multi-PF Netdev
      changes.
      
      Fixes: bcee0937 ("net/mlx5e: Modifying channels number and updating TX queues")
      Signed-off-by: default avatarCarolina Jubran <cjubran@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://patch.msgid.link/20240815071611.2211873-4-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a07e953d
    • Dragos Tatulea's avatar
      net/mlx5e: SHAMPO, Release in progress headers · 94e52193
      Dragos Tatulea authored
      The change in the fixes tag cleaned up too much: it removed the part
      that was releasing header pages that were posted via UMR but haven't
      been acknowledged yet on the ICOSQ.
      
      This patch corrects this omission by setting the bits between pi and ci
      to on when shutting down a queue with SHAMPO. To be consistent with the
      Striding RQ code, this action is done in mlx5e_free_rx_missing_descs().
      
      Fixes: e839ac9a ("net/mlx5e: SHAMPO, Simplify header page release in teardown")
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://patch.msgid.link/20240815071611.2211873-3-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      94e52193
    • Dragos Tatulea's avatar
      net/mlx5e: SHAMPO, Fix page leak · f232de7c
      Dragos Tatulea authored
      When SHAMPO is used, a receive queue currently almost always leaks one
      page on shutdown.
      
      A page has MLX5E_SHAMPO_WQ_HEADER_PER_PAGE (8) headers. These headers
      are tracked in the SHAMPO bitmap. Each page is released when the last
      header index in the group is processed. During header allocation, there
      can be leftovers from a page that will be used in a subsequent
      allocation. This is normally fine, except for the following  scenario
      (simplified a bit):
      
      1) Allocate N new page fragments, showing only the relevant last 4
         fragments:
      
          0: new page
          1: new page
          2: new page
          3: new page
          4: page from previous allocation
          5: page from previous allocation
          6: page from previous allocation
          7: page from previous allocation
      
      2) NAPI processes header indices 4-7 because they are the oldest
         allocated. Bit 7 will be set to 0.
      
      3) Receive queue shutdown occurs. All the remaining bits are being
         iterated on to release the pages. But the page assigned to header
         indices 0-3 will not be freed due to what happened in step 2.
      
      This patch fixes the issue by making sure that on allocation, header
      fragments are always allocated in groups of
      MLX5E_SHAMPO_WQ_HEADER_PER_PAGE so that there is never a partial page
      left over between allocations.
      
      A more appropriate fix would be a refactoring of
      mlx5e_alloc_rx_hd_mpwqe() and mlx5e_build_shampo_hd_umr(). But this
      refactoring is too big for net. It will be targeted for net-next.
      
      Fixes: e839ac9a ("net/mlx5e: SHAMPO, Simplify header page release in teardown")
      Signed-off-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://patch.msgid.link/20240815071611.2211873-2-tariqt@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f232de7c
    • David S. Miller's avatar
      Merge branch 'vln-ocelot-fixes' · 3d93a144
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      VLAN fixes for Ocelot driver
      
      This is a collection of patches I've gathered over the past several
      months.
      
      Patches 1-6/14 are supporting patches for selftests.
      
      Patch 9/14 fixes PTP TX from a VLAN upper of a VLAN-aware bridge port
      when using the "ocelot-8021q" tagging protocol. Patch 7/14 is its
      supporting selftest.
      
      Patch 10/14 fixes the QoS class used by PTP in the same case as above.
      It is hard to quantify - there is no selftest.
      
      Patch 11/14 fixes potential data corruption during PTP TX in the same
      case as above. Again, there is no selftest.
      
      Patch 13/14 fixes RX in the same case as above - 8021q upper of a
      VLAN-aware bridge port, with the "ocelot-8021q" tagging protocol. Patch
      12/14 is a supporting patch for this in the DSA core, and 7/14 is also
      its selftest.
      
      Patch 14/14 ensures that VLAN-aware bridges offloaded to Ocelot only
      react to the ETH_P_8021Q TPID, and treat absolutely everything else as
      VLAN-untagged, including ETH_P_8021AD. Patch 8/14 is the supporting
      selftest.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d93a144
    • Vladimir Oltean's avatar
      net: mscc: ocelot: treat 802.1ad tagged traffic as 802.1Q-untagged · 36dd1141
      Vladimir Oltean authored
      I was revisiting the topic of 802.1ad treatment in the Ocelot switch [0]
      and realized that not only is its basic VLAN classification pipeline
      improper for offloading vlan_protocol 802.1ad bridges, but also improper
      for offloading regular 802.1Q bridges already.
      
      Namely, 802.1ad-tagged traffic should be treated as VLAN-untagged by
      bridged ports, but this switch treats it as if it was 802.1Q-tagged with
      the same VID as in the 802.1ad header. This is markedly different to
      what the Linux bridge expects; see the "other_tpid()" function in
      tools/testing/selftests/net/forwarding/bridge_vlan_aware.sh.
      
      An idea came to me that the VCAP IS1 TCAM is more powerful than I'm
      giving it credit for, and that it actually overwrites the classified VID
      before the VLAN Table lookup takes place. In other words, it can be
      used even to save a packet from being dropped on ingress due to VLAN
      membership.
      
      Add a sophisticated TCAM rule hardcoded into the driver to force the
      switch to behave like a Linux bridge with vlan_filtering 1 vlan_protocol
      802.1Q.
      
      Regarding the lifetime of the filter: eventually the bridge will
      disappear, and vlan_filtering on the port will be restored to 0 for
      standalone mode. Then the filter will be deleted.
      
      [0]: https://lore.kernel.org/netdev/20201009122947.nvhye4hvcha3tljh@skbuf/
      
      Fixes: 7142529f ("net: mscc: ocelot: add VLAN filtering")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36dd1141
    • Vladimir Oltean's avatar
      net: dsa: felix: fix VLAN tag loss on CPU reception with ocelot-8021q · f1288fd7
      Vladimir Oltean authored
      There is a major design bug with ocelot-8021q, which is that it expects
      more of the hardware than the hardware can actually do. The short
      summary of the issue is that when a port is under a VLAN-aware bridge
      and we use this tagging protocol, VLAN upper interfaces of this port do
      not see RX traffic.
      
      We use VCAP ES0 (egress rewriter) rules towards the tag_8021q CPU port
      to encapsulate packets with an outer tag, later stripped by software,
      that depends on the source user port. We do this so that packets can be
      identified in ocelot_rcv(). To be precise, we create rules with
      push_outer_tag = OCELOT_ES0_TAG and push_inner_tag = 0.
      
      With this configuration, we expect the switch to keep the inner tag
      configuration as found in the packet (if it was untagged on user port
      ingress, keep it untagged, otherwise preserve the VLAN tag unmodified
      as the inner tag towards the tag_8021q CPU port). But this is not what
      happens.
      
      Instead, table "Tagging Combinations" from the user manual suggests
      that when the ES0 action is "PUSH_OUTER_TAG=1 and PUSH_INNER_TAG=0",
      there will be "no inner tag". Experimentation further clarifies what
      this means.
      
      It appears that this "inner tag" which is not pushed into the packet on
      its egress towards the CPU is none other than the classified VLAN.
      
      When the ingress user port is standalone or under a VLAN-unaware bridge,
      the classified VLAN is a discardable quantity: it is a fixed value - the
      result of ocelot_vlan_unaware_pvid()'s configuration, and actually
      independent of the VID from any 802.1Q header that may be in the frame.
      It is actually preferable to discard the "inner tag" in this case.
      
      The problem is when the ingress port is under a VLAN-aware bridge.
      Then, the classified VLAN is taken from the frame's 802.1Q header, with
      a fallback on the bridge port's PVID. It would be very good to not
      discard the "inner tag" here, because if we do, we break communication
      with any 8021q VLAN uppers that the port might have. These have a
      processing path outside the bridge.
      
      There seems to be nothing else we can do except to change the
      configuration for VCAP ES0 rules, to actually push the inner VLAN into
      the frame. There are 2 options for that, first is to push a fixed value
      specified in the rule, and second is to push a fixed value, plus
      (aka arithmetic +) the classified VLAN. We choose the second option,
      and we select that fixed value as 0. Thus, what is pushed in the inner
      tag is just the classified VLAN.
      
      From there, we need to perform software untagging, in the receive path,
      of stuff that was untagged on the wire.
      
      Fixes: 7c83a7c5 ("net: dsa: add a second tagger for Ocelot switches based on tag_8021q")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1288fd7
    • Vladimir Oltean's avatar
      net: dsa: provide a software untagging function on RX for VLAN-aware bridges · 93e4649e
      Vladimir Oltean authored
      Through code analysis, I realized that the ds->untag_bridge_pvid logic
      is contradictory - see the newly added FIXME above the kernel-doc for
      dsa_software_untag_vlan_unaware_bridge().
      
      Moreover, for the Felix driver, I need something very similar, but which
      is actually _not_ contradictory: untag the bridge PVID on RX, but for
      VLAN-aware bridges. The existing logic does it for VLAN-unaware bridges.
      
      Since I don't want to change the functionality of drivers which were
      supposedly properly tested with the ds->untag_bridge_pvid flag, I have
      introduced a new one: ds->untag_vlan_aware_bridge_pvid, and I have
      refactored the DSA reception code into a common path for both flags.
      
      TODO: both flags should be unified under a single ds->software_vlan_untag,
      which users of both current flags should set. This is not something that
      can be carried out right away. It needs very careful examination of all
      drivers which make use of this functionality, since some of them
      actually get this wrong in the first place.
      
      For example, commit 9130c2d3 ("net: dsa: microchip: ksz8795: Use
      software untagging on CPU port") uses this in a driver which has
      ds->configure_vlan_while_not_filtering = true. The latter mechanism has
      been known for many years to be broken by design:
      https://lore.kernel.org/netdev/CABumfLzJmXDN_W-8Z=p9KyKUVi_HhS7o_poBkeKHS2BkAiyYpw@mail.gmail.com/
      and we have the situation of 2 bugs canceling each other. There is no
      private VLAN, and the port follows the PVID of the VLAN-unaware bridge.
      So, it's kinda ok for that driver to use the ds->untag_bridge_pvid
      mechanism, in a broken way.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93e4649e
    • Vladimir Oltean's avatar
      net: mscc: ocelot: serialize access to the injection/extraction groups · c5e12ac3
      Vladimir Oltean authored
      As explained by Horatiu Vultur in commit 603ead96 ("net: sparx5: Add
      spinlock for frame transmission from CPU") which is for a similar
      hardware design, multiple CPUs can simultaneously perform injection
      or extraction. There are only 2 register groups for injection and 2
      for extraction, and the driver only uses one of each. So we'd better
      serialize access using spin locks, otherwise frame corruption is
      possible.
      
      Note that unlike in sparx5, FDMA in ocelot does not have this issue
      because struct ocelot_fdma_tx_ring already contains an xmit_lock.
      
      I guess this is mostly a problem for NXP LS1028A, as that is dual core.
      I don't think VSC7514 is. So I'm blaming the commit where LS1028A (aka
      the felix DSA driver) started using register-based packet injection and
      extraction.
      
      Fixes: 0a6f17c6 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5e12ac3
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix QoS class for injected packets with "ocelot-8021q" · e1b9e802
      Vladimir Oltean authored
      There are 2 distinct code paths (listed below) in the source code which
      set up an injection header for Ocelot(-like) switches. Code path (2)
      lacks the QoS class and source port being set correctly. Especially the
      improper QoS classification is a problem for the "ocelot-8021q"
      alternative DSA tagging protocol, because we support tc-taprio and each
      packet needs to be scheduled precisely through its time slot. This
      includes PTP, which is normally assigned to a traffic class other than
      0, but would be sent through TC 0 nonetheless.
      
      The code paths are:
      
      (1) ocelot_xmit_common() from net/dsa/tag_ocelot.c - called only by the
          standard "ocelot" DSA tagging protocol which uses NPI-based
          injection - sets up bit fields in the tag manually to account for
          a small difference (destination port offset) between Ocelot and
          Seville. Namely, ocelot_ifh_set_dest() is omitted out of
          ocelot_xmit_common(), because there's also seville_ifh_set_dest().
      
      (2) ocelot_ifh_set_basic(), called by:
          - ocelot_fdma_prepare_skb() for FDMA transmission of the ocelot
            switchdev driver
          - ocelot_port_xmit() -> ocelot_port_inject_frame() for
            register-based transmission of the ocelot switchdev driver
          - felix_port_deferred_xmit() -> ocelot_port_inject_frame() for the
            DSA tagger ocelot-8021q when it must transmit PTP frames (also
            through register-based injection).
          sets the bit fields according to its own logic.
      
      The problem is that (2) doesn't call ocelot_ifh_set_qos_class().
      Copying that logic from ocelot_xmit_common() fixes that.
      
      Unfortunately, although desirable, it is not easily possible to
      de-duplicate code paths (1) and (2), and make net/dsa/tag_ocelot.c
      directly call ocelot_ifh_set_basic()), because of the ocelot/seville
      difference. This is the "minimal" fix with some logic duplicated (but
      at least more consolidated).
      
      Fixes: 0a6f17c6 ("net: dsa: tag_ocelot_8021q: add support for PTP timestamping")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1b9e802
    • Vladimir Oltean's avatar
      net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and register injection · 67c3ca2c
      Vladimir Oltean authored
      Problem description
      -------------------
      
      On an NXP LS1028A (felix DSA driver) with the following configuration:
      
      - ocelot-8021q tagging protocol
      - VLAN-aware bridge (with STP) spanning at least swp0 and swp1
      - 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700
      - ptp4l on swp0.700 and swp1.700
      
      we see that the ptp4l instances do not see each other's traffic,
      and they all go to the grand master state due to the
      ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition.
      
      Jumping to the conclusion for the impatient
      -------------------------------------------
      
      There is a zero-day bug in the ocelot switchdev driver in the way it
      handles VLAN-tagged packet injection. The correct logic already exists in
      the source code, in function ocelot_xmit_get_vlan_info() added by commit
      5ca721c5 ("net: dsa: tag_ocelot: set the classified VLAN during xmit").
      But it is used only for normal NPI-based injection with the DSA "ocelot"
      tagging protocol. The other injection code paths (register-based and
      FDMA-based) roll their own wrong logic. This affects and was noticed on
      the DSA "ocelot-8021q" protocol because it uses register-based injection.
      
      By moving ocelot_xmit_get_vlan_info() to a place that's common for both
      the DSA tagger and the ocelot switch library, it can also be called from
      ocelot_port_inject_frame() in ocelot.c.
      
      We need to touch the lines with ocelot_ifh_port_set()'s prototype
      anyway, so let's rename it to something clearer regarding what it does,
      and add a kernel-doc. ocelot_ifh_set_basic() should do.
      
      Investigation notes
      -------------------
      
      Debugging reveals that PTP event (aka those carrying timestamps, like
      Sync) frames injected into swp0.700 (but also swp1.700) hit the wire
      with two VLAN tags:
      
      00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc
                                                    ~~~~~~~~~~~
      00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00
                ~~~~~~~~~~~
      00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03
      00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00
      00000040: 00 00
      
      The second (unexpected) VLAN tag makes felix_check_xtr_pkt() ->
      ptp_classify_raw() fail to see these as PTP packets at the link
      partner's receiving end, and return PTP_CLASS_NONE (because the BPF
      classifier is not written to expect 2 VLAN tags).
      
      The reason why packets have 2 VLAN tags is because the transmission
      code treats VLAN incorrectly.
      
      Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX
      feature. Therefore, at xmit time, all VLANs should be in the skb head,
      and none should be in the hwaccel area. This is done by:
      
      static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb,
      					  netdev_features_t features)
      {
      	if (skb_vlan_tag_present(skb) &&
      	    !vlan_hw_offload_capable(features, skb->vlan_proto))
      		skb = __vlan_hwaccel_push_inside(skb);
      	return skb;
      }
      
      But ocelot_port_inject_frame() handles things incorrectly:
      
      	ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb));
      
      void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op)
      {
      	(...)
      	if (vlan_tag)
      		ocelot_ifh_set_vlan_tci(ifh, vlan_tag);
      	(...)
      }
      
      The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head
      is by calling:
      
      static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb)
      {
      	skb->vlan_present = 0;
      }
      
      which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get().
      This means that ocelot, when it calls skb_vlan_tag_get(), sees
      (and uses) a residual skb->vlan_tci, while the same VLAN tag is
      _already_ in the skb head.
      
      The trivial fix for double VLAN headers is to replace the content of
      ocelot_ifh_port_set() with:
      
      	if (skb_vlan_tag_present(skb))
      		ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb));
      
      but this would not be correct either, because, as mentioned,
      vlan_hw_offload_capable() is false for us, so we'd be inserting dead
      code and we'd always transmit packets with VID=0 in the injection frame
      header.
      
      I can't actually test the ocelot switchdev driver and rely exclusively
      on code inspection, but I don't think traffic from 8021q uppers has ever
      been injected properly, and not double-tagged. Thus I'm blaming the
      introduction of VLAN fields in the injection header - early driver code.
      
      As hinted at in the early conclusion, what we _want_ to happen for
      VLAN transmission was already described once in commit 5ca721c5
      ("net: dsa: tag_ocelot: set the classified VLAN during xmit").
      
      ocelot_xmit_get_vlan_info() intends to ensure that if the port through
      which we're transmitting is under a VLAN-aware bridge, the outer VLAN
      tag from the skb head is stripped from there and inserted into the
      injection frame header (so that the packet is processed in hardware
      through that actual VLAN). And in all other cases, the packet is sent
      with VID=0 in the injection frame header, since the port is VLAN-unaware
      and has logic to strip this VID on egress (making it invisible to the
      wire).
      
      Fixes: 08d02364 ("net: mscc: fix the injection header")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67c3ca2c
    • Vladimir Oltean's avatar
      selftests: net: bridge_vlan_aware: test that other TPIDs are seen as untagged · e29b82ef
      Vladimir Oltean authored
      The bridge VLAN implementation w.r.t. VLAN protocol is described in
      merge commit 1a0b20b2 ("Merge branch 'bridge-next'"). We are only
      sensitive to those VLAN tags whose TPID is equal to the bridge's
      vlan_protocol. Thus, an 802.1ad VLAN should be treated as 802.1Q-untagged.
      
      Add 3 tests which validate that:
      - 802.1ad-tagged traffic is learned into the PVID of an 802.1Q-aware
        bridge
      - Double-tagged traffic is forwarded when just the PVID of the port is
        present in the VLAN group of the ports
      - Double-tagged traffic is not forwarded when the PVID of the port is
        absent from the VLAN group of the ports
      
      The test passes with both veth and ocelot.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Tested-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e29b82ef
    • Vladimir Oltean's avatar
      selftests: net: local_termination: add PTP frames to the mix · 23797950
      Vladimir Oltean authored
      A breakage in the felix DSA driver shows we do not have enough test
      coverage. More generally, it is sufficiently special that it is likely
      drivers will treat it differently.
      
      This is not meant to be a full PTP test, it just makes sure that PTP
      packets sent to the different addresses corresponding to their profiles
      are received correctly. The local_termination selftest seemed like the
      most appropriate place for this addition.
      
      PTP RX/TX in some cases makes no sense (over a bridge) and this is why
      $skip_ptp exists. And in others - PTP over a bridge port - the IP stack
      needs convincing through the available bridge netfilter hooks to leave
      the PTP packets alone and not stolen by the bridge rx_handler. It is
      safe to assume that users have that figured out already. This is a
      driver level test, and by using tcpdump, all that extra setup is out of
      scope here.
      
      send_non_ip() was an unfinished idea; written but never used.
      Replace it with a more generic send_raw(), and send 3 PTP packet types
      times 3 transports.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23797950
    • Vladimir Oltean's avatar
      selftests: net: local_termination: don't use xfail_on_veth() · 9aa3749c
      Vladimir Oltean authored
      xfail_on_veth() for this test is an incorrect approximation which gives
      false positives and false negatives.
      
      When local_termination fails with "reception succeeded, but should have failed",
      it is because the DUT ($h2) accepts packets even when not configured as
      promiscuous. This is not something specific to veth; even the bridge
      behaves that way, but this is not captured by the xfail_on_veth test.
      
      The IFF_UNICAST_FLT flag is not explicitly exported to user space, but
      it can somewhat be determined from the interface's behavior. We have to
      create a macvlan upper with a different MAC address. This forces a
      dev_uc_add() call in the kernel. When the unicast filtering list is
      not empty, but the device doesn't support IFF_UNICAST_FLT,
      __dev_set_rx_mode() force-enables promiscuity on the interface, to
      ensure correct behavior (that the requested address is received).
      
      We can monitor the change in the promiscuity flag and infer from it
      whether the device supports unicast filtering.
      
      There is no equivalent thing for allmulti, unfortunately. We never know
      what's hiding behind a device which has allmulti=off. Whether it will
      actually perform RX multicast filtering of unknown traffic is a strong
      "maybe". The bridge driver, for example, completely ignores the flag.
      We'll have to keep the xfail behavior, but instead of XFAIL on just
      veth, always XFAIL.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9aa3749c
    • Vladimir Oltean's avatar
      selftests: net: local_termination: introduce new tests which capture VLAN behavior · 5fea8bb0
      Vladimir Oltean authored
      Add more coverage to the local termination selftest as follows:
      - 8021q upper of $h2
      - 8021q upper of $h2, where $h2 is a port of a VLAN-unaware bridge
      - 8021q upper of $h2, where $h2 is a port of a VLAN-aware bridge
      - 8021q upper of VLAN-unaware br0, which is the upper of $h2
      - 8021q upper of VLAN-aware br0, which is the upper of $h2
      
      Especially the cases with traffic sent through the VLAN upper of a
      VLAN-aware bridge port will be immediately relevant when we will start
      transmitting PTP packets as an additional kind of traffic.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fea8bb0
    • Vladimir Oltean's avatar
      selftests: net: local_termination: add one more test for VLAN-aware bridges · 5b8e7418
      Vladimir Oltean authored
      The current bridge() test is for packet reception on a VLAN-unaware
      bridge. Some things are different enough with VLAN-aware bridges that
      it's worth renaming this test into vlan_unaware_bridge(), and add a new
      vlan_aware_bridge() test.
      
      The two will share the same implementation: bridge() becomes a common
      function, which receives $vlan_filtering as an argument. Rename it to
      test_bridge() at the same time, because just bridge() pollutes the
      global namespace and we cannot invoke the binary with the same name from
      the iproute2 package currently.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b8e7418