1. 28 Feb, 2024 2 commits
    • Florian Westphal's avatar
      netfilter: bridge: confirm multicast packets before passing them up the stack · 62e7151a
      Florian Westphal authored
      conntrack nf_confirm logic cannot handle cloned skbs referencing
      the same nf_conn entry, which will happen for multicast (broadcast)
      frames on bridges.
      
       Example:
          macvlan0
             |
            br0
           /  \
        ethX    ethY
      
       ethX (or Y) receives a L2 multicast or broadcast packet containing
       an IP packet, flow is not yet in conntrack table.
      
       1. skb passes through bridge and fake-ip (br_netfilter)Prerouting.
          -> skb->_nfct now references a unconfirmed entry
       2. skb is broad/mcast packet. bridge now passes clones out on each bridge
          interface.
       3. skb gets passed up the stack.
       4. In macvlan case, macvlan driver retains clone(s) of the mcast skb
          and schedules a work queue to send them out on the lower devices.
      
          The clone skb->_nfct is not a copy, it is the same entry as the
          original skb.  The macvlan rx handler then returns RX_HANDLER_PASS.
       5. Normal conntrack hooks (in NF_INET_LOCAL_IN) confirm the orig skb.
      
      The Macvlan broadcast worker and normal confirm path will race.
      
      This race will not happen if step 2 already confirmed a clone. In that
      case later steps perform skb_clone() with skb->_nfct already confirmed (in
      hash table).  This works fine.
      
      But such confirmation won't happen when eb/ip/nftables rules dropped the
      packets before they reached the nf_confirm step in postrouting.
      
      Pablo points out that nf_conntrack_bridge doesn't allow use of stateful
      nat, so we can safely discard the nf_conn entry and let inet call
      conntrack again.
      
      This doesn't work for bridge netfilter: skb could have a nat
      transformation. Also bridge nf prevents re-invocation of inet prerouting
      via 'sabotage_in' hook.
      
      Work around this problem by explicit confirmation of the entry at LOCAL_IN
      time, before upper layer has a chance to clone the unconfirmed entry.
      
      The downside is that this disables NAT and conntrack helpers.
      
      Alternative fix would be to add locking to all code parts that deal with
      unconfirmed packets, but even if that could be done in a sane way this
      opens up other problems, for example:
      
      -m physdev --physdev-out eth0 -j SNAT --snat-to 1.2.3.4
      -m physdev --physdev-out eth1 -j SNAT --snat-to 1.2.3.5
      
      For multicast case, only one of such conflicting mappings will be
      created, conntrack only handles 1:1 NAT mappings.
      
      Users should set create a setup that explicitly marks such traffic
      NOTRACK (conntrack bypass) to avoid this, but we cannot auto-bypass
      them, ruleset might have accept rules for untracked traffic already,
      so user-visible behaviour would change.
      Suggested-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217777Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      62e7151a
    • Ignat Korchagin's avatar
      netfilter: nf_tables: allow NFPROTO_INET in nft_(match/target)_validate() · 7e0f122c
      Ignat Korchagin authored
      Commit d0009eff ("netfilter: nf_tables: validate NFPROTO_* family") added
      some validation of NFPROTO_* families in the nft_compat module, but it broke
      the ability to use legacy iptables modules in dual-stack nftables.
      
      While with legacy iptables one had to independently manage IPv4 and IPv6
      tables, with nftables it is possible to have dual-stack tables sharing the
      rules. Moreover, it was possible to use rules based on legacy iptables
      match/target modules in dual-stack nftables.
      
      As an example, the program from [2] creates an INET dual-stack family table
      using an xt_bpf based rule, which looks like the following (the actual output
      was generated with a patched nft tool as the current nft tool does not parse
      dual stack tables with legacy match rules, so consider it for illustrative
      purposes only):
      
      table inet testfw {
        chain input {
          type filter hook prerouting priority filter; policy accept;
          bytecode counter packets 0 bytes 0 accept
        }
      }
      
      After d0009eff ("netfilter: nf_tables: validate NFPROTO_* family") we get
      EOPNOTSUPP for the above program.
      
      Fix this by allowing NFPROTO_INET for nft_(match/target)_validate(), but also
      restrict the functions to classic iptables hooks.
      
      Changes in v3:
        * clarify that upstream nft will not display such configuration properly and
          that the output was generated with a patched nft tool
        * remove example program from commit description and link to it instead
        * no code changes otherwise
      
      Changes in v2:
        * restrict nft_(match/target)_validate() to classic iptables hooks
        * rewrite example program to use unmodified libnftnl
      
      Fixes: d0009eff ("netfilter: nf_tables: validate NFPROTO_* family")
      Link: https://lore.kernel.org/all/Zc1PfoWN38UuFJRI@calendula/T/#mc947262582c90fec044c7a3398cc92fac7afea72 [1]
      Link: https://lore.kernel.org/all/20240220145509.53357-1-ignat@cloudflare.com/ [2]
      Reported-by: default avatarJordan Griege <jgriege@cloudflare.com>
      Signed-off-by: default avatarIgnat Korchagin <ignat@cloudflare.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      7e0f122c
  2. 22 Feb, 2024 17 commits
  3. 21 Feb, 2024 18 commits
    • Florian Westphal's avatar
      netfilter: nf_tables: use kzalloc for hook allocation · 195e5f88
      Florian Westphal authored
      KMSAN reports unitialized variable when registering the hook,
         reg->hook_ops_type == NF_HOOK_OP_BPF)
              ~~~~~~~~~~~ undefined
      
      This is a small structure, just use kzalloc to make sure this
      won't happen again when new fields get added to nf_hook_ops.
      
      Fixes: 7b4b2fa3 ("netfilter: annotate nf_tables base hook ops")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      195e5f88
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: register hooks last when adding new chain/flowtable · d472e985
      Pablo Neira Ayuso authored
      Register hooks last when adding chain/flowtable to ensure that packets do
      not walk over datastructure that is being released in the error path
      without waiting for the rcu grace period.
      
      Fixes: 91c7b38d ("netfilter: nf_tables: use new transaction infrastructure to handle chain")
      Fixes: 3b49e2e9 ("netfilter: nf_tables: add flow table netlink frontend")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d472e985
    • Pablo Neira Ayuso's avatar
      netfilter: nft_flow_offload: release dst in case direct xmit path is used · 8762785f
      Pablo Neira Ayuso authored
      Direct xmit does not use it since it calls dev_queue_xmit() to send
      packets, hence it calls dst_release().
      
      kmemleak reports:
      
      unreferenced object 0xffff88814f440900 (size 184):
        comm "softirq", pid 0, jiffies 4294951896
        hex dump (first 32 bytes):
          00 60 5b 04 81 88 ff ff 00 e6 e8 82 ff ff ff ff  .`[.............
          21 0b 50 82 ff ff ff ff 00 00 00 00 00 00 00 00  !.P.............
        backtrace (crc cb2bf5d6):
          [<000000003ee17107>] kmem_cache_alloc+0x286/0x340
          [<0000000021a5de2c>] dst_alloc+0x43/0xb0
          [<00000000f0671159>] rt_dst_alloc+0x2e/0x190
          [<00000000fe5092c9>] __mkroute_output+0x244/0x980
          [<000000005fb96fb0>] ip_route_output_flow+0xc0/0x160
          [<0000000045367433>] nf_ip_route+0xf/0x30
          [<0000000085da1d8e>] nf_route+0x2d/0x60
          [<00000000d1ecd1cb>] nft_flow_route+0x171/0x6a0 [nft_flow_offload]
          [<00000000d9b2fb60>] nft_flow_offload_eval+0x4e8/0x700 [nft_flow_offload]
          [<000000009f447dbb>] expr_call_ops_eval+0x53/0x330 [nf_tables]
          [<00000000072e1be6>] nft_do_chain+0x17c/0x840 [nf_tables]
          [<00000000d0551029>] nft_do_chain_inet+0xa1/0x210 [nf_tables]
          [<0000000097c9d5c6>] nf_hook_slow+0x5b/0x160
          [<0000000005eccab1>] ip_forward+0x8b6/0x9b0
          [<00000000553a269b>] ip_rcv+0x221/0x230
          [<00000000412872e5>] __netif_receive_skb_one_core+0xfe/0x110
      
      Fixes: fa502c86 ("netfilter: flowtable: simplify route logic")
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      8762785f
    • Pablo Neira Ayuso's avatar
      netfilter: nft_flow_offload: reset dst in route object after setting up flow · 9e0f0430
      Pablo Neira Ayuso authored
      dst is transferred to the flow object, route object does not own it
      anymore.  Reset dst in route object, otherwise if flow_offload_add()
      fails, error path releases dst twice, leading to a refcount underflow.
      
      Fixes: a3c90f7a ("netfilter: nf_tables: flow offload expression")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9e0f0430
    • Florian Westphal's avatar
      netfilter: nf_tables: set dormant flag on hook register failure · bccebf64
      Florian Westphal authored
      We need to set the dormant flag again if we fail to register
      the hooks.
      
      During memory pressure hook registration can fail and we end up
      with a table marked as active but no registered hooks.
      
      On table/base chain deletion, nf_tables will attempt to unregister
      the hook again which yields a warn splat from the nftables core.
      
      Reported-and-tested-by: syzbot+de4025c006ec68ac56fc@syzkaller.appspotmail.com
      Fixes: 179d9ba5 ("netfilter: nf_tables: fix table flag updates")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      bccebf64
    • Jakub Kicinski's avatar
      Merge branch 'tls-fixes-for-record-type-handling-with-peek' · f76d5f65
      Jakub Kicinski authored
      Sabrina Dubroca says:
      
      ====================
      tls: fixes for record type handling with PEEK
      
      There are multiple bugs in tls_sw_recvmsg's handling of record types
      when MSG_PEEK flag is used, which can lead to incorrectly merging two
      records:
       - consecutive non-DATA records shouldn't be merged, even if they're
         the same type (partly handled by the test at the end of the main
         loop)
       - records of the same type (even DATA) shouldn't be merged if one
         record of a different type comes in between
      ====================
      
      Link: https://lore.kernel.org/r/cover.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f76d5f65
    • Sabrina Dubroca's avatar
      selftests: tls: add test for peeking past a record of a different type · 2bf61726
      Sabrina Dubroca authored
      If we queue 3 records:
       - record 1, type DATA
       - record 2, some other type
       - record 3, type DATA
      the current code can look past the 2nd record and merge the 2 data
      records.
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/4623550f8617c239581030c13402d3262f2bd14f.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2bf61726
    • Sabrina Dubroca's avatar
      selftests: tls: add test for merging of same-type control messages · 7b2a4c2a
      Sabrina Dubroca authored
      Two consecutive control messages of the same type should never be
      merged into one large received blob of data.
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/018f1633d5471684c65def5fe390de3b15c3d683.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7b2a4c2a
    • Sabrina Dubroca's avatar
      tls: don't skip over different type records from the rx_list · ec823bf3
      Sabrina Dubroca authored
      If we queue 3 records:
       - record 1, type DATA
       - record 2, some other type
       - record 3, type DATA
      and do a recv(PEEK), the rx_list will contain the first two records.
      
      The next large recv will walk through the rx_list and copy data from
      record 1, then stop because record 2 is a different type. Since we
      haven't filled up our buffer, we will process the next available
      record. It's also DATA, so we can merge it with the current read.
      
      We shouldn't do that, since there was a record in between that we
      ignored.
      
      Add a flag to let process_rx_list inform tls_sw_recvmsg that it had
      more data available.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/f00c0c0afa080c60f016df1471158c1caf983c34.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec823bf3
    • Sabrina Dubroca's avatar
      tls: stop recv() if initial process_rx_list gave us non-DATA · fdfbaec5
      Sabrina Dubroca authored
      If we have a non-DATA record on the rx_list and another record of the
      same type still on the queue, we will end up merging them:
       - process_rx_list copies the non-DATA record
       - we start the loop and process the first available record since it's
         of the same type
       - we break out of the loop since the record was not DATA
      
      Just check the record type and jump to the end in case process_rx_list
      did some work.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/bd31449e43bd4b6ff546f5c51cf958c31c511deb.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fdfbaec5
    • Sabrina Dubroca's avatar
      tls: break out of main loop when PEEK gets a non-data record · 10f41d07
      Sabrina Dubroca authored
      PEEK needs to leave decrypted records on the rx_list so that we can
      receive them later on, so it jumps back into the async code that
      queues the skb. Unfortunately that makes us skip the
      TLS_RECORD_TYPE_DATA check at the bottom of the main loop, so if two
      records of the same (non-DATA) type are queued, we end up merging
      them.
      
      Add the same record type check, and make it unlikely to not penalize
      the async fastpath. Async decrypt only applies to data record, so this
      check is only needed for PEEK.
      
      process_rx_list also has similar issues.
      
      Fixes: 692d7b5d ("tls: Fix recvmsg() to be able to peek across multiple records")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Link: https://lore.kernel.org/r/3df2eef4fdae720c55e69472b5bea668772b45a2.1708007371.git.sd@queasysnail.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      10f41d07
    • Vasiliy Kovalev's avatar
      gtp: fix use-after-free and null-ptr-deref in gtp_genl_dump_pdp() · 136cfaca
      Vasiliy Kovalev authored
      The gtp_net_ops pernet operations structure for the subsystem must be
      registered before registering the generic netlink family.
      
      Syzkaller hit 'general protection fault in gtp_genl_dump_pdp' bug:
      
      general protection fault, probably for non-canonical address
      0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN NOPTI
      KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
      CPU: 1 PID: 5826 Comm: gtp Not tainted 6.8.0-rc3-std-def-alt1 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-alt1 04/01/2014
      RIP: 0010:gtp_genl_dump_pdp+0x1be/0x800 [gtp]
      Code: c6 89 c6 e8 64 e9 86 df 58 45 85 f6 0f 85 4e 04 00 00 e8 c5 ee 86
            df 48 8b 54 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80>
            3c 02 00 0f 85 de 05 00 00 48 8b 44 24 18 4c 8b 30 4c 39 f0 74
      RSP: 0018:ffff888014107220 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff88800fcda588 R14: 0000000000000001 R15: 0000000000000000
      FS:  00007f1be4eb05c0(0000) GS:ffff88806ce80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1be4e766cf CR3: 000000000c33e000 CR4: 0000000000750ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? show_regs+0x90/0xa0
       ? die_addr+0x50/0xd0
       ? exc_general_protection+0x148/0x220
       ? asm_exc_general_protection+0x22/0x30
       ? gtp_genl_dump_pdp+0x1be/0x800 [gtp]
       ? __alloc_skb+0x1dd/0x350
       ? __pfx___alloc_skb+0x10/0x10
       genl_dumpit+0x11d/0x230
       netlink_dump+0x5b9/0xce0
       ? lockdep_hardirqs_on_prepare+0x253/0x430
       ? __pfx_netlink_dump+0x10/0x10
       ? kasan_save_track+0x10/0x40
       ? __kasan_kmalloc+0x9b/0xa0
       ? genl_start+0x675/0x970
       __netlink_dump_start+0x6fc/0x9f0
       genl_family_rcv_msg_dumpit+0x1bb/0x2d0
       ? __pfx_genl_family_rcv_msg_dumpit+0x10/0x10
       ? genl_op_from_small+0x2a/0x440
       ? cap_capable+0x1d0/0x240
       ? __pfx_genl_start+0x10/0x10
       ? __pfx_genl_dumpit+0x10/0x10
       ? __pfx_genl_done+0x10/0x10
       ? security_capable+0x9d/0xe0
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarVasiliy Kovalev <kovalev@altlinux.org>
      Fixes: 459aa660 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)")
      Link: https://lore.kernel.org/r/20240214162733.34214-1-kovalev@altlinux.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      136cfaca
    • Shigeru Yoshida's avatar
      bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready() · 4cd12c60
      Shigeru Yoshida authored
      syzbot reported the following NULL pointer dereference issue [1]:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        [...]
        RIP: 0010:0x0
        [...]
        Call Trace:
         <TASK>
         sk_psock_verdict_data_ready+0x232/0x340 net/core/skmsg.c:1230
         unix_stream_sendmsg+0x9b4/0x1230 net/unix/af_unix.c:2293
         sock_sendmsg_nosec net/socket.c:730 [inline]
         __sock_sendmsg+0x221/0x270 net/socket.c:745
         ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584
         ___sys_sendmsg net/socket.c:2638 [inline]
         __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667
         do_syscall_64+0xf9/0x240
         entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      If sk_psock_verdict_data_ready() and sk_psock_stop_verdict() are called
      concurrently, psock->saved_data_ready can be NULL, causing the above issue.
      
      This patch fixes this issue by calling the appropriate data ready function
      using the sk_psock_data_ready() helper and protecting it from concurrency
      with sk->sk_callback_lock.
      
      Fixes: 6df7f764 ("bpf, sockmap: Wake up polling after data copy")
      Reported-by: syzbot+fd7b34375c1c8ce29c93@syzkaller.appspotmail.com
      Signed-off-by: default avatarShigeru Yoshida <syoshida@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: syzbot+fd7b34375c1c8ce29c93@syzkaller.appspotmail.com
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Closes: https://syzkaller.appspot.com/bug?extid=fd7b34375c1c8ce29c93 [1]
      Link: https://lore.kernel.org/bpf/20240218150933.6004-1-syoshida@redhat.com
      4cd12c60
    • Simon Horman's avatar
      MAINTAINERS: Add framer headers to NETWORKING [GENERAL] · 14dec56f
      Simon Horman authored
      The cited commit [1] added framer support under drivers/net/wan,
      which is covered by NETWORKING [GENERAL]. And it is implied
      that framer-provider.h and framer.h, which were also added
      buy the same patch, are also maintained as part of NETWORKING [GENERAL].
      
      Make this explicit by adding these files to the corresponding
      section in MAINTAINERS.
      
      [1] 82c944d0 ("net: wan: Add framer framework support")
      Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Reviewed-by: default avatarHerve Codina <herve.codina@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14dec56f
    • Kuniyuki Iwashima's avatar
      af_unix: Drop oob_skb ref before purging queue in GC. · aa82ac51
      Kuniyuki Iwashima authored
      syzbot reported another task hung in __unix_gc().  [0]
      
      The current while loop assumes that all of the left candidates
      have oob_skb and calling kfree_skb(oob_skb) releases the remaining
      candidates.
      
      However, I missed a case that oob_skb has self-referencing fd and
      another fd and the latter sk is placed before the former in the
      candidate list.  Then, the while loop never proceeds, resulting
      the task hung.
      
      __unix_gc() has the same loop just before purging the collected skb,
      so we can call kfree_skb(oob_skb) there and let __skb_queue_purge()
      release all inflight sockets.
      
      [0]:
      Sending NMI from CPU 0 to CPUs 1:
      NMI backtrace for cpu 1
      CPU: 1 PID: 2784 Comm: kworker/u4:8 Not tainted 6.8.0-rc4-syzkaller-01028-g71b605d3 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Workqueue: events_unbound __unix_gc
      RIP: 0010:__sanitizer_cov_trace_pc+0x0/0x70 kernel/kcov.c:200
      Code: 89 fb e8 23 00 00 00 48 8b 3d 84 f5 1a 0c 48 89 de 5b e9 43 26 57 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <f3> 0f 1e fa 48 8b 04 24 65 48 8b 0d 90 52 70 7e 65 8b 15 91 52 70
      RSP: 0018:ffffc9000a17fa78 EFLAGS: 00000287
      RAX: ffffffff8a0a6108 RBX: ffff88802b6c2640 RCX: ffff88802c0b3b80
      RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
      RBP: ffffc9000a17fbf0 R08: ffffffff89383f1d R09: 1ffff1100ee5ff84
      R10: dffffc0000000000 R11: ffffed100ee5ff85 R12: 1ffff110056d84ee
      R13: ffffc9000a17fae0 R14: 0000000000000000 R15: ffffffff8f47b840
      FS:  0000000000000000(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffef5687ff8 CR3: 0000000029b34000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <NMI>
       </NMI>
       <TASK>
       __unix_gc+0xe69/0xf40 net/unix/garbage.c:343
       process_one_work kernel/workqueue.c:2633 [inline]
       process_scheduled_works+0x913/0x1420 kernel/workqueue.c:2706
       worker_thread+0xa5f/0x1000 kernel/workqueue.c:2787
       kthread+0x2ef/0x390 kernel/kthread.c:388
       ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:242
       </TASK>
      
      Reported-and-tested-by: syzbot+ecab4d36f920c3574bf9@syzkaller.appspotmail.com
      Closes: https://syzkaller.appspot.com/bug?extid=ecab4d36f920c3574bf9
      Fixes: 25236c91 ("af_unix: Fix task hung while purging oob_skb in GC.")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa82ac51
    • Alex Elder's avatar
      net: ipa: don't overrun IPA suspend interrupt registers · d80f8e96
      Alex Elder authored
      In newer hardware, IPA supports more than 32 endpoints.  Some
      registers--such as IPA interrupt registers--represent endpoints
      as bits in a 4-byte register, and such registers are repeated as
      needed to represent endpoints beyond the first 32.
      
      In ipa_interrupt_suspend_clear_all(), we clear all pending IPA
      suspend interrupts by reading all status register(s) and writing
      corresponding registers to clear interrupt conditions.
      
      Unfortunately the number of registers to read/write is calculated
      incorrectly, and as a result we access *many* more registers than
      intended.  This bug occurs only when the IPA hardware signals a
      SUSPEND interrupt, which happens when a packet is received for an
      endpoint (or its underlying GSI channel) that is suspended.  This
      situation is difficult to reproduce, but possible.
      
      Fix this by correctly computing the number of interrupt registers to
      read and write.  This is the only place in the code where registers
      that map endpoints or channels this way perform this calculation.
      
      Fixes: f298ba78 ("net: ipa: add a parameter to suspend registers")
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d80f8e96
    • Eric Dumazet's avatar
      net: implement lockless setsockopt(SO_PEEK_OFF) · 56667da7
      Eric Dumazet authored
      syzbot reported a lockdep violation [1] involving af_unix
      support of SO_PEEK_OFF.
      
      Since SO_PEEK_OFF is inherently not thread safe (it uses a per-socket
      sk_peek_off field), there is really no point to enforce a pointless
      thread safety in the kernel.
      
      After this patch :
      
      - setsockopt(SO_PEEK_OFF) no longer acquires the socket lock.
      
      - skb_consume_udp() no longer has to acquire the socket lock.
      
      - af_unix no longer needs a special version of sk_set_peek_off(),
        because it does not lock u->iolock anymore.
      
      As a followup, we could replace prot->set_peek_off to be a boolean
      and avoid an indirect call, since we always use sk_set_peek_off().
      
      [1]
      
      WARNING: possible circular locking dependency detected
      6.8.0-rc4-syzkaller-00267-g0f1dd5e9 #0 Not tainted
      
      syz-executor.2/30025 is trying to acquire lock:
       ffff8880765e7d80 (&u->iolock){+.+.}-{3:3}, at: unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
      
      but task is already holding lock:
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
       ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (sk_lock-AF_UNIX){+.+.}-{0:0}:
              lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
              lock_sock_nested+0x48/0x100 net/core/sock.c:3524
              lock_sock include/net/sock.h:1691 [inline]
              __unix_dgram_recvmsg+0x1275/0x12c0 net/unix/af_unix.c:2415
              sock_recvmsg_nosec+0x18e/0x1d0 net/socket.c:1046
              ____sys_recvmsg+0x3c0/0x470 net/socket.c:2801
              ___sys_recvmsg net/socket.c:2845 [inline]
              do_recvmmsg+0x474/0xae0 net/socket.c:2939
              __sys_recvmmsg net/socket.c:3018 [inline]
              __do_sys_recvmmsg net/socket.c:3041 [inline]
              __se_sys_recvmmsg net/socket.c:3034 [inline]
              __x64_sys_recvmmsg+0x199/0x250 net/socket.c:3034
             do_syscall_64+0xf9/0x240
             entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      -> #0 (&u->iolock){+.+.}-{3:3}:
              check_prev_add kernel/locking/lockdep.c:3134 [inline]
              check_prevs_add kernel/locking/lockdep.c:3253 [inline]
              validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
              __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
              lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
              __mutex_lock_common kernel/locking/mutex.c:608 [inline]
              __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
              unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
             sk_setsockopt+0x207e/0x3360
              do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
              __sys_setsockopt+0x1ad/0x250 net/socket.c:2334
              __do_sys_setsockopt net/socket.c:2343 [inline]
              __se_sys_setsockopt net/socket.c:2340 [inline]
              __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
             do_syscall_64+0xf9/0x240
             entry_SYSCALL_64_after_hwframe+0x6f/0x77
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(sk_lock-AF_UNIX);
                                     lock(&u->iolock);
                                     lock(sk_lock-AF_UNIX);
        lock(&u->iolock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor.2/30025:
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1691 [inline]
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sockopt_lock_sock net/core/sock.c:1060 [inline]
        #0: ffff8880765e7930 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: sk_setsockopt+0xe52/0x3360 net/core/sock.c:1193
      
      stack backtrace:
      CPU: 0 PID: 30025 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00267-g0f1dd5e9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
      Call Trace:
       <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
        check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2187
        check_prev_add kernel/locking/lockdep.c:3134 [inline]
        check_prevs_add kernel/locking/lockdep.c:3253 [inline]
        validate_chain+0x18ca/0x58e0 kernel/locking/lockdep.c:3869
        __lock_acquire+0x1345/0x1fd0 kernel/locking/lockdep.c:5137
        lock_acquire+0x1e3/0x530 kernel/locking/lockdep.c:5754
        __mutex_lock_common kernel/locking/mutex.c:608 [inline]
        __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
        unix_set_peek_off+0x26/0xa0 net/unix/af_unix.c:789
       sk_setsockopt+0x207e/0x3360
        do_sock_setsockopt+0x2fb/0x720 net/socket.c:2307
        __sys_setsockopt+0x1ad/0x250 net/socket.c:2334
        __do_sys_setsockopt net/socket.c:2343 [inline]
        __se_sys_setsockopt net/socket.c:2340 [inline]
        __x64_sys_setsockopt+0xb5/0xd0 net/socket.c:2340
       do_syscall_64+0xf9/0x240
       entry_SYSCALL_64_after_hwframe+0x6f/0x77
      RIP: 0033:0x7f78a1c7dda9
      Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f78a0fde0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 00007f78a1dac050 RCX: 00007f78a1c7dda9
      RDX: 000000000000002a RSI: 0000000000000001 RDI: 0000000000000006
      RBP: 00007f78a1cca47a R08: 0000000000000004 R09: 0000000000000000
      R10: 0000000020000180 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000006e R14: 00007f78a1dac050 R15: 00007ffe5cd81ae8
      
      Fixes: 859051dd ("bpf: Implement cgroup sockaddr hooks for unix sockets")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
      Cc: Daan De Meyer <daan.j.demeyer@gmail.com>
      Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
      Cc: Martin KaFai Lau <martin.lau@kernel.org>
      Cc: David Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56667da7
    • Subbaraya Sundeep's avatar
      octeontx2-af: Consider the action set by PF · 3b1ae9b7
      Subbaraya Sundeep authored
      AF reserves MCAM entries for each PF, VF present in the
      system and populates the entry with DMAC and action with
      default RSS so that basic packet I/O works. Since PF/VF is
      not aware of the RSS action installed by AF, AF only fixup
      the actions of the rules installed by PF/VF with corresponding
      default RSS action. This worked well for rules installed by
      PF/VF for features like RX VLAN offload and DMAC filters but
      rules involving action like drop/forward to queue are also
      getting modified by AF. Hence fix it by setting the default
      RSS action only if requested by PF/VF.
      
      Fixes: 967db352 ("octeontx2-af: add support for multicast/promisc packet replication feature")
      Signed-off-by: default avatarSubbaraya Sundeep <sbhatta@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b1ae9b7
  4. 20 Feb, 2024 3 commits
    • Jakub Kicinski's avatar
      docs: netdev: update the link to the CI repo · 23f9c2c0
      Jakub Kicinski authored
      Netronome graciously transferred the original NIPA repo
      to our new netdev umbrella org. Link to that instead of
      my private fork.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20240216161945.2208842-1-kuba@kernel.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      23f9c2c0
    • Kuniyuki Iwashima's avatar
      arp: Prevent overflow in arp_req_get(). · a7d60277
      Kuniyuki Iwashima authored
      syzkaller reported an overflown write in arp_req_get(). [0]
      
      When ioctl(SIOCGARP) is issued, arp_req_get() looks up an neighbour
      entry and copies neigh->ha to struct arpreq.arp_ha.sa_data.
      
      The arp_ha here is struct sockaddr, not struct sockaddr_storage, so
      the sa_data buffer is just 14 bytes.
      
      In the splat below, 2 bytes are overflown to the next int field,
      arp_flags.  We initialise the field just after the memcpy(), so it's
      not a problem.
      
      However, when dev->addr_len is greater than 22 (e.g. MAX_ADDR_LEN),
      arp_netmask is overwritten, which could be set as htonl(0xFFFFFFFFUL)
      in arp_ioctl() before calling arp_req_get().
      
      To avoid the overflow, let's limit the max length of memcpy().
      
      Note that commit b5f0de6d ("net: dev: Convert sa_data to flexible
      array in struct sockaddr") just silenced syzkaller.
      
      [0]:
      memcpy: detected field-spanning write (size 16) of single field "r->arp_ha.sa_data" at net/ipv4/arp.c:1128 (size 14)
      WARNING: CPU: 0 PID: 144638 at net/ipv4/arp.c:1128 arp_req_get+0x411/0x4a0 net/ipv4/arp.c:1128
      Modules linked in:
      CPU: 0 PID: 144638 Comm: syz-executor.4 Not tainted 6.1.74 #31
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
      RIP: 0010:arp_req_get+0x411/0x4a0 net/ipv4/arp.c:1128
      Code: fd ff ff e8 41 42 de fb b9 0e 00 00 00 4c 89 fe 48 c7 c2 20 6d ab 87 48 c7 c7 80 6d ab 87 c6 05 25 af 72 04 01 e8 5f 8d ad fb <0f> 0b e9 6c fd ff ff e8 13 42 de fb be 03 00 00 00 4c 89 e7 e8 a6
      RSP: 0018:ffffc900050b7998 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: ffff88803a815000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff8641a44a RDI: 0000000000000001
      RBP: ffffc900050b7a98 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 203a7970636d656d R12: ffff888039c54000
      R13: 1ffff92000a16f37 R14: ffff88803a815084 R15: 0000000000000010
      FS:  00007f172bf306c0(0000) GS:ffff88805aa00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f172b3569f0 CR3: 0000000057f12005 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       arp_ioctl+0x33f/0x4b0 net/ipv4/arp.c:1261
       inet_ioctl+0x314/0x3a0 net/ipv4/af_inet.c:981
       sock_do_ioctl+0xdf/0x260 net/socket.c:1204
       sock_ioctl+0x3ef/0x650 net/socket.c:1321
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:870 [inline]
       __se_sys_ioctl fs/ioctl.c:856 [inline]
       __x64_sys_ioctl+0x18e/0x220 fs/ioctl.c:856
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x37/0x90 arch/x86/entry/common.c:81
       entry_SYSCALL_64_after_hwframe+0x64/0xce
      RIP: 0033:0x7f172b262b8d
      Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f172bf300b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00007f172b3abf80 RCX: 00007f172b262b8d
      RDX: 0000000020000000 RSI: 0000000000008954 RDI: 0000000000000003
      RBP: 00007f172b2d3493 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 000000000000000b R14: 00007f172b3abf80 R15: 00007f172bf10000
       </TASK>
      Reported-by: default avatarsyzkaller <syzkaller@googlegroups.com>
      Reported-by: default avatarBjoern Doebel <doebel@amazon.de>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20240215230516.31330-1-kuniyu@amazon.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a7d60277
    • Vasiliy Kovalev's avatar
      devlink: fix possible use-after-free and memory leaks in devlink_init() · def689fc
      Vasiliy Kovalev authored
      The pernet operations structure for the subsystem must be registered
      before registering the generic netlink family.
      
      Make an unregister in case of unsuccessful registration.
      
      Fixes: 687125b5 ("devlink: split out core code")
      Signed-off-by: default avatarVasiliy Kovalev <kovalev@altlinux.org>
      Link: https://lore.kernel.org/r/20240215203400.29976-1-kovalev@altlinux.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      def689fc