1. 23 May, 2023 1 commit
    • John Fastabend's avatar
      bpf, sockmap: Pass skb ownership through read_skb · 78fa0d61
      John Fastabend authored
      The read_skb hook calls consume_skb() now, but this means that if the
      recv_actor program wants to use the skb it needs to inc the ref cnt
      so that the consume_skb() doesn't kfree the sk_buff.
      
      This is problematic because in some error cases under memory pressure
      we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
      Then we get this,
      
       skb_linearize()
         __pskb_pull_tail()
           pskb_expand_head()
             BUG_ON(skb_shared(skb))
      
      Because we incremented users refcnt from sk_psock_verdict_recv() we
      hit the bug on with refcnt > 1 and trip it.
      
      To fix lets simply pass ownership of the sk_buff through the skb_read
      call. Then we can drop the consume from read_skb handlers and assume
      the verdict recv does any required kfree.
      
      Bug found while testing in our CI which runs in VMs that hit memory
      constraints rather regularly. William tested TCP read_skb handlers.
      
      [  106.536188] ------------[ cut here ]------------
      [  106.536197] kernel BUG at net/core/skbuff.c:1693!
      [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
      [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
      [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
      [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
      [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
      [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
      [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
      [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
      [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
      [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
      [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
      [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  106.542255] Call Trace:
      [  106.542383]  <IRQ>
      [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
      [  106.542681]  skb_ensure_writable+0x85/0xa0
      [  106.542882]  sk_skb_pull_data+0x18/0x20
      [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
      [  106.543536]  ? migrate_disable+0x66/0x80
      [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
      [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
      [  106.544561]  tcp_read_skb+0x7b/0x120
      [  106.544740]  tcp_data_queue+0x904/0xee0
      [  106.544931]  tcp_rcv_established+0x212/0x7c0
      [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
      [  106.545326]  tcp_v4_rcv+0xe70/0xf60
      [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
      [  106.545744]  ip_local_deliver_finish+0xa7/0x150
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Reported-by: default avatarWilliam Findlay <will@isovalent.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
      78fa0d61
  2. 22 May, 2023 1 commit
  3. 19 May, 2023 1 commit
    • Will Deacon's avatar
      bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields · 0613d8ca
      Will Deacon authored
      A narrow load from a 64-bit context field results in a 64-bit load
      followed potentially by a 64-bit right-shift and then a bitwise AND
      operation to extract the relevant data.
      
      In the case of a 32-bit access, an immediate mask of 0xffffffff is used
      to construct a 64-bit BPP_AND operation which then sign-extends the mask
      value and effectively acts as a glorified no-op. For example:
      
      0:	61 10 00 00 00 00 00 00	r0 = *(u32 *)(r1 + 0)
      
      results in the following code generation for a 64-bit field:
      
      	ldr	x7, [x7]	// 64-bit load
      	mov	x10, #0xffffffffffffffff
      	and	x7, x7, x10
      
      Fix the mask generation so that narrow loads always perform a 32-bit AND
      operation:
      
      	ldr	x7, [x7]	// 64-bit load
      	mov	w10, #0xffffffff
      	and	w7, w7, w10
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
      Cc: Andrey Ignatov <rdna@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Fixes: 31fd8581 ("bpf: permits narrower load from bpf program context fields")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0613d8ca
  4. 16 May, 2023 1 commit
  5. 15 May, 2023 1 commit
  6. 14 May, 2023 1 commit
  7. 13 May, 2023 10 commits
  8. 12 May, 2023 15 commits
  9. 11 May, 2023 8 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 6e27831b
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from netfilter.
      
        Current release - regressions:
      
         - mtk_eth_soc: fix NULL pointer dereference
      
        Previous releases - regressions:
      
         - core:
            - skb_partial_csum_set() fix against transport header magic value
            - fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().
            - annotate sk->sk_err write from do_recvmmsg()
            - add vlan_get_protocol_and_depth() helper
      
         - netlink: annotate accesses to nlk->cb_running
      
         - netfilter: always release netdev hooks from notifier
      
        Previous releases - always broken:
      
         - core: deal with most data-races in sk_wait_event()
      
         - netfilter: fix possible bug_on with enable_hooks=1
      
         - eth: bonding: fix send_peer_notif overflow
      
         - eth: xpcs: fix incorrect number of interfaces
      
         - eth: ipvlan: fix out-of-bounds caused by unclear skb->cb
      
         - eth: stmmac: Initialize MAC_ONEUS_TIC_COUNTER register"
      
      * tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits)
        af_unix: Fix data races around sk->sk_shutdown.
        af_unix: Fix a data race of sk->sk_receive_queue->qlen.
        net: datagram: fix data-races in datagram_poll()
        net: mscc: ocelot: fix stat counter register values
        ipvlan:Fix out-of-bounds caused by unclear skb->cb
        docs: networking: fix x25-iface.rst heading & index order
        gve: Remove the code of clearing PBA bit
        tcp: add annotations around sk->sk_shutdown accesses
        net: add vlan_get_protocol_and_depth() helper
        net: pcs: xpcs: fix incorrect number of interfaces
        net: deal with most data-races in sk_wait_event()
        net: annotate sk->sk_err write from do_recvmmsg()
        netlink: annotate accesses to nlk->cb_running
        kselftest: bonding: add num_grat_arp test
        selftests: forwarding: lib: add netns support for tc rule handle stats get
        Documentation: bonding: fix the doc of peer_notif_delay
        bonding: fix send_peer_notif overflow
        net: ethernet: mtk_eth_soc: fix NULL pointer dereference
        selftests: nft_flowtable.sh: check ingress/egress chain too
        selftests: nft_flowtable.sh: monitor result file sizes
        ...
      6e27831b
    • Linus Torvalds's avatar
      Merge tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 691e1eee
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
      
       - fix some unused-variable warning in mtk-mdp3
      
       - ignore unused suspend operations in nxp
      
       - some driver fixes in rcar-vin
      
      * tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        media: platform: mtk-mdp3: work around unused-variable warning
        media: nxp: ignore unused suspend operations
        media: rcar-vin: Select correct interrupt mode for V4L2_FIELD_ALTERNATE
        media: rcar-vin: Fix NV12 size alignment
        media: rcar-vin: Gen3 can not scale NV12
      691e1eee
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · cceac926
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix UAF when releasing netnamespace, from Florian Westphal.
      
      2) Fix possible BUG_ON when nf_conntrack is enabled with enable_hooks,
         from Florian Westphal.
      
      3) Fixes for nft_flowtable.sh selftest, from Boris Sukholitko.
      
      4) Extend nft_flowtable.sh selftest to cover integration with
         ingress/egress hooks, from Florian Westphal.
      
      * tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        selftests: nft_flowtable.sh: check ingress/egress chain too
        selftests: nft_flowtable.sh: monitor result file sizes
        selftests: nft_flowtable.sh: wait for specific nc pids
        selftests: nft_flowtable.sh: no need for ps -x option
        selftests: nft_flowtable.sh: use /proc for pid checking
        netfilter: conntrack: fix possible bug_on with enable_hooks=1
        netfilter: nf_tables: always release netdev hooks from notifier
      ====================
      
      Link: https://lore.kernel.org/r/20230510083313.152961-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cceac926
    • Jakub Kicinski's avatar
      Merge branch 'af_unix-fix-two-data-races-reported-by-kcsan' · 33dcee99
      Jakub Kicinski authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix two data races reported by KCSAN.
      
      KCSAN reported data races around these two fields for AF_UNIX sockets.
      
        * sk->sk_receive_queue->qlen
        * sk->sk_shutdown
      
      Let's annotate them properly.
      ====================
      
      Link: https://lore.kernel.org/r/20230510003456.42357-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      33dcee99
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data races around sk->sk_shutdown. · e1d09c2c
      Kuniyuki Iwashima authored
      KCSAN found a data race around sk->sk_shutdown where unix_release_sock()
      and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll()
      and unix_dgram_poll() read it locklessly.
      
      We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE().
      
      BUG: KCSAN: data-race in unix_poll / unix_release_sock
      
      write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0:
       unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
       unix_release+0x59/0x80 net/unix/af_unix.c:1042
       __sock_release+0x7d/0x170 net/socket.c:653
       sock_close+0x19/0x30 net/socket.c:1397
       __fput+0x179/0x5e0 fs/file_table.c:321
       ____fput+0x15/0x20 fs/file_table.c:349
       task_work_run+0x116/0x1a0 kernel/task_work.c:179
       resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
       __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
       syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
       do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1:
       unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170
       sock_poll+0xcf/0x2b0 net/socket.c:1385
       vfs_poll include/linux/poll.h:88 [inline]
       ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855
       ep_send_events fs/eventpoll.c:1694 [inline]
       ep_poll fs/eventpoll.c:1823 [inline]
       do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258
       __do_sys_epoll_wait fs/eventpoll.c:2270 [inline]
       __se_sys_epoll_wait fs/eventpoll.c:2265 [inline]
       __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      value changed: 0x00 -> 0x03
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 3c73419c ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1d09c2c
    • Kuniyuki Iwashima's avatar
      af_unix: Fix a data race of sk->sk_receive_queue->qlen. · 679ed006
      Kuniyuki Iwashima authored
      KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg()
      updates qlen under the queue lock and sendmsg() checks qlen under
      unix_state_sock(), not the queue lock, so the reader side needs
      READ_ONCE().
      
      BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer
      
      write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0:
       __skb_unlink include/linux/skbuff.h:2347 [inline]
       __skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197
       __skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263
       __unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452
       unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549
       sock_recvmsg_nosec net/socket.c:1019 [inline]
       ____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720
       ___sys_recvmsg+0xc8/0x150 net/socket.c:2764
       do_recvmmsg+0x182/0x560 net/socket.c:2858
       __sys_recvmmsg net/socket.c:2937 [inline]
       __do_sys_recvmmsg net/socket.c:2960 [inline]
       __se_sys_recvmmsg net/socket.c:2953 [inline]
       __x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1:
       skb_queue_len include/linux/skbuff.h:2127 [inline]
       unix_recvq_full net/unix/af_unix.c:229 [inline]
       unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445
       unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048
       sock_sendmsg_nosec net/socket.c:724 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:747
       ____sys_sendmsg+0x20e/0x620 net/socket.c:2503
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2557
       __sys_sendmmsg+0x11d/0x370 net/socket.c:2643
       __do_sys_sendmmsg net/socket.c:2672 [inline]
       __se_sys_sendmmsg net/socket.c:2669 [inline]
       __x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      value changed: 0x0000000b -> 0x00000001
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      679ed006
    • Eric Dumazet's avatar
      net: datagram: fix data-races in datagram_poll() · 5bca1d08
      Eric Dumazet authored
      datagram_poll() runs locklessly, we should add READ_ONCE()
      annotations while reading sk->sk_err, sk->sk_shutdown and sk->sk_state.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230509173131.3263780-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5bca1d08
    • Linus Torvalds's avatar
      MAINTAINERS: re-sort all entries and fields · 80e62bc8
      Linus Torvalds authored
      It's been a few years since we've sorted this thing, and the end result
      is that we've added MAINTAINERS entries in the wrong order, and a number
      of entries have their fields in non-canonical order too.
      
      So roll this boulder up the hill one more time by re-running
      
         ./scripts/parse-maintainers.pl --order
      
      on it.
      
      This file ends up being fairly painful for merge conflicts even
      normally, since unlike almost all other kernel files it's one of those
      "everybody touches the same thing", and re-ordering all entries is only
      going to make that worse.  But the alternative is to never do it at all,
      and just let it all rot..
      
      The rc2 week is likely the quietest and least painful time to do this.
      Requested-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Requested-by: Joe Perches <joe@perches.com>	# "Please use --order"
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80e62bc8
  10. 10 May, 2023 1 commit