1. 23 May, 2023 2 commits
    • John Fastabend's avatar
      bpf, sockmap: Convert schedule_work into delayed_work · 29173d07
      John Fastabend authored
      Sk_buffs are fed into sockmap verdict programs either from a strparser
      (when the user might want to decide how framing of skb is done by attaching
      another parser program) or directly through tcp_read_sock. The
      tcp_read_sock is the preferred method for performance when the BPF logic is
      a stream parser.
      
      The flow for Cilium's common use case with a stream parser is,
      
       tcp_read_sock()
        sk_psock_verdict_recv
          ret = bpf_prog_run_pin_on_cpu()
          sk_psock_verdict_apply(sock, skb, ret)
           // if system is under memory pressure or app is slow we may
           // need to queue skb. Do this queuing through ingress_skb and
           // then kick timer to wake up handler
           skb_queue_tail(ingress_skb, skb)
           schedule_work(work);
      
      The work queue is wired up to sk_psock_backlog(). This will then walk the
      ingress_skb skb list that holds our sk_buffs that could not be handled,
      but should be OK to run at some later point. However, its possible that
      the workqueue doing this work still hits an error when sending the skb.
      When this happens the skbuff is requeued on a temporary 'state' struct
      kept with the workqueue. This is necessary because its possible to
      partially send an skbuff before hitting an error and we need to know how
      and where to restart when the workqueue runs next.
      
      Now for the trouble, we don't rekick the workqueue. This can cause a
      stall where the skbuff we just cached on the state variable might never
      be sent. This happens when its the last packet in a flow and no further
      packets come along that would cause the system to kick the workqueue from
      that side.
      
      To fix we could do simple schedule_work(), but while under memory pressure
      it makes sense to back off some instead of continue to retry repeatedly. So
      instead to fix convert schedule_work to schedule_delayed_work and add
      backoff logic to reschedule from backlog queue on errors. Its not obvious
      though what a good backoff is so use '1'.
      
      To test we observed some flakes whil running NGINX compliance test with
      sockmap we attributed these failed test to this bug and subsequent issue.
      
      >From on list discussion. This commit
      
       bec21719("skmsg: Schedule psock work if the cached skb exists on the psock")
      
      was intended to address similar race, but had a couple cases it missed.
      Most obvious it only accounted for receiving traffic on the local socket
      so if redirecting into another socket we could still get an sk_buff stuck
      here. Next it missed the case where copied=0 in the recv() handler and
      then we wouldn't kick the scheduler. Also its sub-optimal to require
      userspace to kick the internal mechanisms of sockmap to wake it up and
      copy data to user. It results in an extra syscall and requires the app
      to actual handle the EAGAIN correctly.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
      29173d07
    • John Fastabend's avatar
      bpf, sockmap: Pass skb ownership through read_skb · 78fa0d61
      John Fastabend authored
      The read_skb hook calls consume_skb() now, but this means that if the
      recv_actor program wants to use the skb it needs to inc the ref cnt
      so that the consume_skb() doesn't kfree the sk_buff.
      
      This is problematic because in some error cases under memory pressure
      we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
      Then we get this,
      
       skb_linearize()
         __pskb_pull_tail()
           pskb_expand_head()
             BUG_ON(skb_shared(skb))
      
      Because we incremented users refcnt from sk_psock_verdict_recv() we
      hit the bug on with refcnt > 1 and trip it.
      
      To fix lets simply pass ownership of the sk_buff through the skb_read
      call. Then we can drop the consume from read_skb handlers and assume
      the verdict recv does any required kfree.
      
      Bug found while testing in our CI which runs in VMs that hit memory
      constraints rather regularly. William tested TCP read_skb handlers.
      
      [  106.536188] ------------[ cut here ]------------
      [  106.536197] kernel BUG at net/core/skbuff.c:1693!
      [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
      [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
      [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
      [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
      [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
      [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
      [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
      [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
      [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
      [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
      [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
      [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  106.542255] Call Trace:
      [  106.542383]  <IRQ>
      [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
      [  106.542681]  skb_ensure_writable+0x85/0xa0
      [  106.542882]  sk_skb_pull_data+0x18/0x20
      [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
      [  106.543536]  ? migrate_disable+0x66/0x80
      [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
      [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
      [  106.544561]  tcp_read_skb+0x7b/0x120
      [  106.544740]  tcp_data_queue+0x904/0xee0
      [  106.544931]  tcp_rcv_established+0x212/0x7c0
      [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
      [  106.545326]  tcp_v4_rcv+0xe70/0xf60
      [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
      [  106.545744]  ip_local_deliver_finish+0xa7/0x150
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Reported-by: default avatarWilliam Findlay <will@isovalent.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
      78fa0d61
  2. 22 May, 2023 1 commit
  3. 19 May, 2023 1 commit
    • Will Deacon's avatar
      bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields · 0613d8ca
      Will Deacon authored
      A narrow load from a 64-bit context field results in a 64-bit load
      followed potentially by a 64-bit right-shift and then a bitwise AND
      operation to extract the relevant data.
      
      In the case of a 32-bit access, an immediate mask of 0xffffffff is used
      to construct a 64-bit BPP_AND operation which then sign-extends the mask
      value and effectively acts as a glorified no-op. For example:
      
      0:	61 10 00 00 00 00 00 00	r0 = *(u32 *)(r1 + 0)
      
      results in the following code generation for a 64-bit field:
      
      	ldr	x7, [x7]	// 64-bit load
      	mov	x10, #0xffffffffffffffff
      	and	x7, x7, x10
      
      Fix the mask generation so that narrow loads always perform a 32-bit AND
      operation:
      
      	ldr	x7, [x7]	// 64-bit load
      	mov	w10, #0xffffffff
      	and	w7, w7, w10
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
      Cc: Andrey Ignatov <rdna@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Fixes: 31fd8581 ("bpf: permits narrower load from bpf program context fields")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0613d8ca
  4. 16 May, 2023 1 commit
  5. 15 May, 2023 1 commit
  6. 14 May, 2023 1 commit
  7. 13 May, 2023 10 commits
  8. 12 May, 2023 15 commits
  9. 11 May, 2023 8 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 6e27831b
      Linus Torvalds authored
      Pull networking fixes from Paolo Abeni:
       "Including fixes from netfilter.
      
        Current release - regressions:
      
         - mtk_eth_soc: fix NULL pointer dereference
      
        Previous releases - regressions:
      
         - core:
            - skb_partial_csum_set() fix against transport header magic value
            - fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().
            - annotate sk->sk_err write from do_recvmmsg()
            - add vlan_get_protocol_and_depth() helper
      
         - netlink: annotate accesses to nlk->cb_running
      
         - netfilter: always release netdev hooks from notifier
      
        Previous releases - always broken:
      
         - core: deal with most data-races in sk_wait_event()
      
         - netfilter: fix possible bug_on with enable_hooks=1
      
         - eth: bonding: fix send_peer_notif overflow
      
         - eth: xpcs: fix incorrect number of interfaces
      
         - eth: ipvlan: fix out-of-bounds caused by unclear skb->cb
      
         - eth: stmmac: Initialize MAC_ONEUS_TIC_COUNTER register"
      
      * tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits)
        af_unix: Fix data races around sk->sk_shutdown.
        af_unix: Fix a data race of sk->sk_receive_queue->qlen.
        net: datagram: fix data-races in datagram_poll()
        net: mscc: ocelot: fix stat counter register values
        ipvlan:Fix out-of-bounds caused by unclear skb->cb
        docs: networking: fix x25-iface.rst heading & index order
        gve: Remove the code of clearing PBA bit
        tcp: add annotations around sk->sk_shutdown accesses
        net: add vlan_get_protocol_and_depth() helper
        net: pcs: xpcs: fix incorrect number of interfaces
        net: deal with most data-races in sk_wait_event()
        net: annotate sk->sk_err write from do_recvmmsg()
        netlink: annotate accesses to nlk->cb_running
        kselftest: bonding: add num_grat_arp test
        selftests: forwarding: lib: add netns support for tc rule handle stats get
        Documentation: bonding: fix the doc of peer_notif_delay
        bonding: fix send_peer_notif overflow
        net: ethernet: mtk_eth_soc: fix NULL pointer dereference
        selftests: nft_flowtable.sh: check ingress/egress chain too
        selftests: nft_flowtable.sh: monitor result file sizes
        ...
      6e27831b
    • Linus Torvalds's avatar
      Merge tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media · 691e1eee
      Linus Torvalds authored
      Pull media fixes from Mauro Carvalho Chehab:
      
       - fix some unused-variable warning in mtk-mdp3
      
       - ignore unused suspend operations in nxp
      
       - some driver fixes in rcar-vin
      
      * tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
        media: platform: mtk-mdp3: work around unused-variable warning
        media: nxp: ignore unused suspend operations
        media: rcar-vin: Select correct interrupt mode for V4L2_FIELD_ALTERNATE
        media: rcar-vin: Fix NV12 size alignment
        media: rcar-vin: Gen3 can not scale NV12
      691e1eee
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · cceac926
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix UAF when releasing netnamespace, from Florian Westphal.
      
      2) Fix possible BUG_ON when nf_conntrack is enabled with enable_hooks,
         from Florian Westphal.
      
      3) Fixes for nft_flowtable.sh selftest, from Boris Sukholitko.
      
      4) Extend nft_flowtable.sh selftest to cover integration with
         ingress/egress hooks, from Florian Westphal.
      
      * tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        selftests: nft_flowtable.sh: check ingress/egress chain too
        selftests: nft_flowtable.sh: monitor result file sizes
        selftests: nft_flowtable.sh: wait for specific nc pids
        selftests: nft_flowtable.sh: no need for ps -x option
        selftests: nft_flowtable.sh: use /proc for pid checking
        netfilter: conntrack: fix possible bug_on with enable_hooks=1
        netfilter: nf_tables: always release netdev hooks from notifier
      ====================
      
      Link: https://lore.kernel.org/r/20230510083313.152961-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cceac926
    • Jakub Kicinski's avatar
      Merge branch 'af_unix-fix-two-data-races-reported-by-kcsan' · 33dcee99
      Jakub Kicinski authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix two data races reported by KCSAN.
      
      KCSAN reported data races around these two fields for AF_UNIX sockets.
      
        * sk->sk_receive_queue->qlen
        * sk->sk_shutdown
      
      Let's annotate them properly.
      ====================
      
      Link: https://lore.kernel.org/r/20230510003456.42357-1-kuniyu@amazon.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      33dcee99
    • Kuniyuki Iwashima's avatar
      af_unix: Fix data races around sk->sk_shutdown. · e1d09c2c
      Kuniyuki Iwashima authored
      KCSAN found a data race around sk->sk_shutdown where unix_release_sock()
      and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll()
      and unix_dgram_poll() read it locklessly.
      
      We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE().
      
      BUG: KCSAN: data-race in unix_poll / unix_release_sock
      
      write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0:
       unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
       unix_release+0x59/0x80 net/unix/af_unix.c:1042
       __sock_release+0x7d/0x170 net/socket.c:653
       sock_close+0x19/0x30 net/socket.c:1397
       __fput+0x179/0x5e0 fs/file_table.c:321
       ____fput+0x15/0x20 fs/file_table.c:349
       task_work_run+0x116/0x1a0 kernel/task_work.c:179
       resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
       exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
       __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
       syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
       do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1:
       unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170
       sock_poll+0xcf/0x2b0 net/socket.c:1385
       vfs_poll include/linux/poll.h:88 [inline]
       ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855
       ep_send_events fs/eventpoll.c:1694 [inline]
       ep_poll fs/eventpoll.c:1823 [inline]
       do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258
       __do_sys_epoll_wait fs/eventpoll.c:2270 [inline]
       __se_sys_epoll_wait fs/eventpoll.c:2265 [inline]
       __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      value changed: 0x00 -> 0x03
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 3c73419c ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1d09c2c
    • Kuniyuki Iwashima's avatar
      af_unix: Fix a data race of sk->sk_receive_queue->qlen. · 679ed006
      Kuniyuki Iwashima authored
      KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg()
      updates qlen under the queue lock and sendmsg() checks qlen under
      unix_state_sock(), not the queue lock, so the reader side needs
      READ_ONCE().
      
      BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer
      
      write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0:
       __skb_unlink include/linux/skbuff.h:2347 [inline]
       __skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197
       __skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263
       __unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452
       unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549
       sock_recvmsg_nosec net/socket.c:1019 [inline]
       ____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720
       ___sys_recvmsg+0xc8/0x150 net/socket.c:2764
       do_recvmmsg+0x182/0x560 net/socket.c:2858
       __sys_recvmmsg net/socket.c:2937 [inline]
       __do_sys_recvmmsg net/socket.c:2960 [inline]
       __se_sys_recvmmsg net/socket.c:2953 [inline]
       __x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1:
       skb_queue_len include/linux/skbuff.h:2127 [inline]
       unix_recvq_full net/unix/af_unix.c:229 [inline]
       unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445
       unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048
       sock_sendmsg_nosec net/socket.c:724 [inline]
       sock_sendmsg+0x148/0x160 net/socket.c:747
       ____sys_sendmsg+0x20e/0x620 net/socket.c:2503
       ___sys_sendmsg+0xc6/0x140 net/socket.c:2557
       __sys_sendmmsg+0x11d/0x370 net/socket.c:2643
       __do_sys_sendmmsg net/socket.c:2672 [inline]
       __se_sys_sendmmsg net/socket.c:2669 [inline]
       __x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
      value changed: 0x0000000b -> 0x00000001
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarMichal Kubiak <michal.kubiak@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      679ed006
    • Eric Dumazet's avatar
      net: datagram: fix data-races in datagram_poll() · 5bca1d08
      Eric Dumazet authored
      datagram_poll() runs locklessly, we should add READ_ONCE()
      annotations while reading sk->sk_err, sk->sk_shutdown and sk->sk_state.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20230509173131.3263780-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5bca1d08
    • Linus Torvalds's avatar
      MAINTAINERS: re-sort all entries and fields · 80e62bc8
      Linus Torvalds authored
      It's been a few years since we've sorted this thing, and the end result
      is that we've added MAINTAINERS entries in the wrong order, and a number
      of entries have their fields in non-canonical order too.
      
      So roll this boulder up the hill one more time by re-running
      
         ./scripts/parse-maintainers.pl --order
      
      on it.
      
      This file ends up being fairly painful for merge conflicts even
      normally, since unlike almost all other kernel files it's one of those
      "everybody touches the same thing", and re-ordering all entries is only
      going to make that worse.  But the alternative is to never do it at all,
      and just let it all rot..
      
      The rc2 week is likely the quietest and least painful time to do this.
      Requested-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Requested-by: Joe Perches <joe@perches.com>	# "Please use --order"
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80e62bc8