1. 25 Oct, 2022 7 commits
    • Jakub Kicinski's avatar
      Merge branch 'mptcp-fixes-for-6-1' · fe1fd0cc
      Jakub Kicinski authored
      Mat Martineau says:
      
      ====================
      mptcp: Fixes for 6.1
      
      Patch 1 fixes an issue with assigning subflow IDs in cases where an
      incoming MP_JOIN is processed before accept() completes on the MPTCP
      socket.
      
      Patches 2 and 3 fix a deadlock issue with fastopen code (new for 6.1) at
      connection time.
      ====================
      
      Link: https://lore.kernel.org/r/20221021225856.88119-1-mathew.j.martineau@linux.intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fe1fd0cc
    • Paolo Abeni's avatar
      mptcp: fix abba deadlock on fastopen · fa9e5746
      Paolo Abeni authored
      Our CI reported lockdep splat in the fastopen code:
       ======================================================
       WARNING: possible circular locking dependency detected
       6.0.0.mptcp_f5e8bfe9878d+ #1558 Not tainted
       ------------------------------------------------------
       packetdrill/1071 is trying to acquire lock:
       ffff8881bd198140 (sk_lock-AF_INET){+.+.}-{0:0}, at: inet_wait_for_connect+0x19c/0x310
      
       but task is already holding lock:
       ffff8881b8346540 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0xfdf/0x1740
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (k-sk_lock-AF_INET){+.+.}-{0:0}:
              __lock_acquire+0xb6d/0x1860
              lock_acquire+0x1d8/0x620
              lock_sock_nested+0x37/0xd0
              inet_stream_connect+0x3f/0xa0
              mptcp_connect+0x411/0x800
              __inet_stream_connect+0x3ab/0x800
              mptcp_stream_connect+0xac/0x110
              __sys_connect+0x101/0x130
              __x64_sys_connect+0x6e/0xb0
              do_syscall_64+0x59/0x90
              entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
       -> #0 (sk_lock-AF_INET){+.+.}-{0:0}:
              check_prev_add+0x15e/0x2110
              validate_chain+0xace/0xdf0
              __lock_acquire+0xb6d/0x1860
              lock_acquire+0x1d8/0x620
              lock_sock_nested+0x37/0xd0
              inet_wait_for_connect+0x19c/0x310
              __inet_stream_connect+0x26c/0x800
              tcp_sendmsg_fastopen+0x341/0x650
              mptcp_sendmsg+0x109d/0x1740
              sock_sendmsg+0xe1/0x120
              __sys_sendto+0x1c7/0x2a0
              __x64_sys_sendto+0xdc/0x1b0
              do_syscall_64+0x59/0x90
              entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(k-sk_lock-AF_INET);
                                      lock(sk_lock-AF_INET);
                                      lock(k-sk_lock-AF_INET);
         lock(sk_lock-AF_INET);
      
        *** DEADLOCK ***
      
       1 lock held by packetdrill/1071:
        #0: ffff8881b8346540 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0xfdf/0x1740
       ======================================================
      
      The problem is caused by the blocking inet_wait_for_connect() releasing
      and re-acquiring the msk socket lock while the subflow socket lock is
      still held and the MPTCP socket requires that the msk socket lock must
      be acquired before the subflow socket lock.
      
      Address the issue always invoking tcp_sendmsg_fastopen() in an
      unblocking manner, and later eventually complete the blocking
      __inet_stream_connect() as needed.
      
      Fixes: d98a82a6 ("mptcp: handle defer connect in mptcp_sendmsg")
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa9e5746
    • Paolo Abeni's avatar
      mptcp: factor out mptcp_connect() · 54f1944e
      Paolo Abeni authored
      The current MPTCP connect implementation duplicates a bit of inet
      code and does not use nor provide a struct proto->connect callback,
      which in turn will not fit the upcoming fastopen implementation.
      
      Refactor such implementation to use the common helper, moving the
      MPTCP-specific bits into mptcp_connect(). Additionally, avoid an
      indirect call to the subflow connect callback.
      
      Note that the fastopen call-path invokes mptcp_connect() while already
      holding the subflow socket lock. Explicitly keep track of such path
      via a new MPTCP-level flag and handle the locking accordingly.
      
      Additionally, track the connect flags in a new msk field to allow
      propagating them to the subflow inet_stream_connect call.
      
      Fixes: d98a82a6 ("mptcp: handle defer connect in mptcp_sendmsg")
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      54f1944e
    • Paolo Abeni's avatar
      mptcp: set msk local address earlier · e72e4032
      Paolo Abeni authored
      The mptcp_pm_nl_get_local_id() code assumes that the msk local address
      is available at that point. For passive sockets, we initialize such
      address at accept() time.
      
      Depending on the running configuration and the user-space timing, a
      passive MPJ subflow can join the msk socket before accept() completes.
      
      In such case, the PM assigns a wrong local id to the MPJ subflow
      and later PM netlink operations will end-up touching the wrong/unexpected
      subflow.
      
      All the above causes sporadic self-tests failures, especially when
      the host is heavy loaded.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/308
      Fixes: 01cacb00 ("mptcp: add netlink-based PM")
      Fixes: d045b9eb ("mptcp: introduce implicit endpoints")
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e72e4032
    • Horatiu Vultur's avatar
      net: lan966x: Stop replacing tx dcbs and dcbs_buf when changing MTU · 4a4b6848
      Horatiu Vultur authored
      When a frame is sent using FDMA, the skb is mapped and then the mapped
      address is given to an tx dcb that is different than the last used tx
      dcb. Once the HW finish with this frame, it would generate an interrupt
      and then the dcb can be reused and memory can be freed. For each dcb
      there is an dcb buf that contains some meta-data(is used by PTP, is
      it free). There is 1 to 1 relationship between dcb and dcb_buf.
      The following issue was observed. That sometimes after changing the MTU
      to allocate new tx dcbs and dcbs_buf, two frames were not
      transmitted. The frames were not transmitted because when reloading the
      tx dcbs, it was always presuming to use the first dcb but that was not
      always happening. Because it could be that the last tx dcb used before
      changing MTU was first dcb and then when it tried to get the next dcb it
      would take dcb 1 instead of 0. Because it is supposed to take a
      different dcb than the last used one. This can be fixed simply by
      changing tx->last_in_use to -1 when the fdma is disabled to reload the
      new dcb and dcbs_buff.
      But there could be a different issue. For example, right after the frame
      is sent, the MTU is changed. Now all the dcbs and dcbs_buf will be
      cleared. And now get the interrupt from HW that it finished with the
      frame. So when we try to clear the skb, it is not possible because we
      lost all the dcbs_buf.
      The solution here is to stop replacing the tx dcbs and dcbs_buf when
      changing MTU because the TX doesn't care what is the MTU size, it is
      only the RX that needs this information.
      
      Fixes: 2ea1cbac ("net: lan966x: Update FDMA to change MTU.")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Link: https://lore.kernel.org/r/20221021090711.3749009-1-horatiu.vultur@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4a4b6848
    • Jakub Kicinski's avatar
      genetlink: piggy back on resv_op to default to a reject policy · 4fa86555
      Jakub Kicinski authored
      To keep backward compatibility we used to leave attribute parsing
      to the family if no policy is specified. This becomes tedious as
      we move to more strict validation. Families must define reject all
      policies if they don't want any attributes accepted.
      
      Piggy back on the resv_start_op field as the switchover point.
      AFAICT only ethtool has added new commands since the resv_start_op
      was defined, and it has per-op policies so this should be a no-op.
      
      Nonetheless the patch should still go into v6.1 for consistency.
      
      Link: https://lore.kernel.org/all/20221019125745.3f2e7659@kernel.org/
      Link: https://lore.kernel.org/r/20221021193532.1511293-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4fa86555
    • Xin Long's avatar
      ethtool: eeprom: fix null-deref on genl_info in dump · 9d9effca
      Xin Long authored
      The similar fix as commit 46cdedf2 ("ethtool: pse-pd: fix null-deref on
      genl_info in dump") is also needed for ethtool eeprom.
      
      Fixes: c781ff12 ("ethtool: Allow network drivers to dump arbitrary EEPROM data")
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Link: https://lore.kernel.org/r/5575919a2efc74cd9ad64021880afc3805c54166.1666362167.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9d9effca
  2. 24 Oct, 2022 18 commits
    • Linus Torvalds's avatar
      Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 337a0a0b
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf.
      
        The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
        I'm biased.
      
        Current release - regressions:
      
         - eth: fman: re-expose location of the MAC address to userspace,
           apparently some udev scripts depended on the exact value
      
        Current release - new code bugs:
      
         - bpf:
             - wait for busy refill_work when destroying bpf memory allocator
             - allow bpf_user_ringbuf_drain() callbacks to return 1
             - fix dispatcher patchable function entry to 5 bytes nop
      
        Previous releases - regressions:
      
         - net-memcg: avoid stalls when under memory pressure
      
         - tcp: fix indefinite deferral of RTO with SACK reneging
      
         - tipc: fix a null-ptr-deref in tipc_topsrv_accept
      
         - eth: macb: specify PHY PM management done by MAC
      
         - tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
      
        Previous releases - always broken:
      
         - eth: amd-xgbe: SFP fixes and compatibility improvements
      
        Misc:
      
         - docs: netdev: offer performance feedback to contributors"
      
      * tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
        net-memcg: avoid stalls when under memory pressure
        tcp: fix indefinite deferral of RTO with SACK reneging
        tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
        net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
        net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
        docs: netdev: offer performance feedback to contributors
        kcm: annotate data-races around kcm->rx_wait
        kcm: annotate data-races around kcm->rx_psock
        net: fman: Use physical address for userspace interfaces
        net/mlx5e: Cleanup MACsec uninitialization routine
        atlantic: fix deadlock at aq_nic_stop
        nfp: only clean `sp_indiff` when application firmware is unloaded
        amd-xgbe: add the bit rate quirk for Molex cables
        amd-xgbe: fix the SFP compliance codes check for DAC cables
        amd-xgbe: enable PLL_CTL for fixed PHY modes only
        amd-xgbe: use enums for mailbox cmd and sub_cmds
        amd-xgbe: Yellow carp devices do not need rrc
        bpf: Use __llist_del_all() whenever possbile during memory draining
        bpf: Wait for busy refill_work when destroying bpf memory allocator
        MAINTAINERS: add keyword match on PTP
        ...
      337a0a0b
    • Linus Torvalds's avatar
      Merge tag 'rcu-urgent.2022.10.20a' of... · f6602a97
      Linus Torvalds authored
      Merge tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
      
      Pull RCU fix from Paul McKenney:
       "Fix a regression caused by commit bf95b2bc ("rcu: Switch polled
        grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
        interrupts enabled after an early-boot call to synchronize_rcu().
      
        Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
        to properly interact with polled grace periods, but the code did not
        take into account the possibility of synchronize_rcu() being invoked
        from the portion of the boot sequence during which interrupts are
        disabled.
      
        This commit therefore switches the lock acquisition and release from
        irq to irqsave/irqrestore"
      
      * tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
        rcu: Keep synchronize_rcu() from enabling irqs in early boot
      f6602a97
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of... · 2a91e897
      Linus Torvalds authored
      Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull KUnit fixes from Shuah Khan:
       "One single fix to update alloc_string_stream() callers to check for
        IS_ERR() instead of NULL to be in sync with alloc_string_stream()
        returning an ERR_PTR()"
      
      * tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kunit: update NULL vs IS_ERR() tests
      2a91e897
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-6.1-rc3' of... · 21c92498
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
      
       - futex, intel_pstate, kexec build fixes
      
       - ftrace dynamic_events dependency check fix
      
       - memory-hotplug fix to remove redundant warning from test report
      
      * tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests/ftrace: fix dynamic_events dependency check
        selftests/memory-hotplug: Remove the redundant warning information
        selftests/kexec: fix build for ARCH=x86_64
        selftests/intel_pstate: fix build for ARCH=x86_64
        selftests/futex: fix build for clang
      21c92498
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 74d5b415
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
      
       - Fix typos in UART1 and MMC in the Ingenic driver
      
       - A really well researched glitch bug fix to the Qualcomm driver that
         was tracked down and fixed by Dough Anderson from Chromium. Hats off
         for this one!
      
       - Revert two patches on the Xilinx ZynqMP driver: this needs a proper
         solution making use of firmware version information to adapt to
         different firmware releases
      
       - Fix interrupt triggers in the Ocelot driver
      
      * tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: ocelot: Fix incorrect trigger of the interrupt.
        Revert "dt-bindings: pinctrl-zynqmp: Add output-enable configuration"
        Revert "pinctrl: pinctrl-zynqmp: Add support for output-enable and bias-high-impedance"
        pinctrl: qcom: Avoid glitching lines when we first mux to output
        pinctrl: Ingenic: JZ4755 bug fixes
      74d5b415
    • Jakub Kicinski's avatar
      net-memcg: avoid stalls when under memory pressure · 720ca52b
      Jakub Kicinski authored
      As Shakeel explains the commit under Fixes had the unintended
      side-effect of no longer pre-loading the cached memory allowance.
      Even tho we previously dropped the first packet received when
      over memory limit - the consecutive ones would get thru by using
      the cache. The charging was happening in batches of 128kB, so
      we'd let in 128kB (truesize) worth of packets per one drop.
      
      After the change we no longer force charge, there will be no
      cache filling side effects. This causes significant drops and
      connection stalls for workloads which use a lot of page cache,
      since we can't reclaim page cache under GFP_NOWAIT.
      
      Some of the latency can be recovered by improving SACK reneg
      handling but nowhere near enough to get back to the pre-5.15
      performance (the application I'm experimenting with still
      sees 5-10x worst latency).
      
      Apply the suggested workaround of using GFP_ATOMIC. We will now
      be more permissive than previously as we'll drop _no_ packets
      in softirq when under pressure. But I can't think of any good
      and simple way to address that within networking.
      
      Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/Suggested-by: default avatarShakeel Butt <shakeelb@google.com>
      Fixes: 4b1327be ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      720ca52b
    • Neal Cardwell's avatar
      tcp: fix indefinite deferral of RTO with SACK reneging · 3d2af9cc
      Neal Cardwell authored
      This commit fixes a bug that can cause a TCP data sender to repeatedly
      defer RTOs when encountering SACK reneging.
      
      The bug is that when we're in fast recovery in a scenario with SACK
      reneging, every time we get an ACK we call tcp_check_sack_reneging()
      and it can note the apparent SACK reneging and rearm the RTO timer for
      srtt/2 into the future. In some SACK reneging scenarios that can
      happen repeatedly until the receive window fills up, at which point
      the sender can't send any more, the ACKs stop arriving, and the RTO
      fires at srtt/2 after the last ACK. But that can take far too long
      (O(10 secs)), since the connection is stuck in fast recovery with a
      low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
      available.
      
      This fix changes the logic in tcp_check_sack_reneging() to only rearm
      the RTO timer if data is cumulatively ACKed, indicating forward
      progress. This avoids this kind of nearly infinite loop of RTO timer
      re-arming. In addition, this meets the goals of
      tcp_check_sack_reneging() in handling Windows TCP behavior that looks
      temporarily like SACK reneging but is not really.
      
      Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
      and provided critical packet traces that enabled root-causing this
      issue. Also, many thanks to Jakub Kicinski for testing this fix.
      
      Fixes: 5ae344c9 ("tcp: reduce spurious retransmits due to transient SACK reneging")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reported-by: default avatarNeil Spring <ntspring@fb.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Tested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3d2af9cc
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · e28c4445
      Jakub Kicinski authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2022-10-23
      
      We've added 7 non-merge commits during the last 18 day(s) which contain
      a total of 8 files changed, 69 insertions(+), 5 deletions(-).
      
      The main changes are:
      
      1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.
      
      2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.
      
      3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.
      
      4) Prevent decl_tag from being referenced in func_proto, from Stanislav.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        bpf: Use __llist_del_all() whenever possbile during memory draining
        bpf: Wait for busy refill_work when destroying bpf memory allocator
        bpf: Fix dispatcher patchable function entry to 5 bytes nop
        bpf: prevent decl_tag from being referenced in func_proto
        selftests/bpf: Add reproducer for decl_tag in func_proto return type
        selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
        bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
      ====================
      
      Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e28c4445
    • Lu Wei's avatar
      tcp: fix a signed-integer-overflow bug in tcp_add_backlog() · ec791d81
      Lu Wei authored
      The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and
      in tcp_add_backlog(), the variable limit is caculated by adding
      sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value
      of int and overflow. This patch reduces the limit budget by
      halving the sndbuf to solve this issue since ACK packets are much
      smaller than the payload.
      
      Fixes: c9c33212 ("tcp: add tcp_add_backlog()")
      Signed-off-by: default avatarLu Wei <luwei32@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec791d81
    • Zhang Changzhong's avatar
      net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY · 9c1eaa27
      Zhang Changzhong authored
      The ndo_start_xmit() method must not free skb when returning
      NETDEV_TX_BUSY, since caller is going to requeue freed skb.
      
      Fixes: 504d4721 ("MIPS: Lantiq: Add ethernet driver")
      Signed-off-by: default avatarZhang Changzhong <zhangchangzhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c1eaa27
    • Zhengchao Shao's avatar
      net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed · d266935a
      Zhengchao Shao authored
      When the ops_init() interface is invoked to initialize the net, but
      ops->init() fails, data is released. However, the ptr pointer in
      net->gen is invalid. In this case, when nfqnl_nf_hook_drop() is invoked
      to release the net, invalid address access occurs.
      
      The process is as follows:
      setup_net()
      	ops_init()
      		data = kzalloc(...)   ---> alloc "data"
      		net_assign_generic()  ---> assign "date" to ptr in net->gen
      		...
      		ops->init()           ---> failed
      		...
      		kfree(data);          ---> ptr in net->gen is invalid
      	...
      	ops_exit_list()
      		...
      		nfqnl_nf_hook_drop()
      			*q = nfnl_queue_pernet(net) ---> q is invalid
      
      The following is the Call Trace information:
      BUG: KASAN: use-after-free in nfqnl_nf_hook_drop+0x264/0x280
      Read of size 8 at addr ffff88810396b240 by task ip/15855
      Call Trace:
      <TASK>
      dump_stack_lvl+0x8e/0xd1
      print_report+0x155/0x454
      kasan_report+0xba/0x1f0
      nfqnl_nf_hook_drop+0x264/0x280
      nf_queue_nf_hook_drop+0x8b/0x1b0
      __nf_unregister_net_hook+0x1ae/0x5a0
      nf_unregister_net_hooks+0xde/0x130
      ops_exit_list+0xb0/0x170
      setup_net+0x7ac/0xbd0
      copy_net_ns+0x2e6/0x6b0
      create_new_namespaces+0x382/0xa50
      unshare_nsproxy_namespaces+0xa6/0x1c0
      ksys_unshare+0x3a4/0x7e0
      __x64_sys_unshare+0x2d/0x40
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      </TASK>
      
      Allocated by task 15855:
      kasan_save_stack+0x1e/0x40
      kasan_set_track+0x21/0x30
      __kasan_kmalloc+0xa1/0xb0
      __kmalloc+0x49/0xb0
      ops_init+0xe7/0x410
      setup_net+0x5aa/0xbd0
      copy_net_ns+0x2e6/0x6b0
      create_new_namespaces+0x382/0xa50
      unshare_nsproxy_namespaces+0xa6/0x1c0
      ksys_unshare+0x3a4/0x7e0
      __x64_sys_unshare+0x2d/0x40
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Freed by task 15855:
      kasan_save_stack+0x1e/0x40
      kasan_set_track+0x21/0x30
      kasan_save_free_info+0x2a/0x40
      ____kasan_slab_free+0x155/0x1b0
      slab_free_freelist_hook+0x11b/0x220
      __kmem_cache_free+0xa4/0x360
      ops_init+0xb9/0x410
      setup_net+0x5aa/0xbd0
      copy_net_ns+0x2e6/0x6b0
      create_new_namespaces+0x382/0xa50
      unshare_nsproxy_namespaces+0xa6/0x1c0
      ksys_unshare+0x3a4/0x7e0
      __x64_sys_unshare+0x2d/0x40
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: f875bae0 ("net: Automatically allocate per namespace data.")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d266935a
    • Jakub Kicinski's avatar
      docs: netdev: offer performance feedback to contributors · c5884ef4
      Jakub Kicinski authored
      Some of us gotten used to producing large quantities of peer feedback
      at work, every 3 or 6 months. Extending the same courtesy to community
      members seems like a logical step. It may be hard for some folks to
      get validation of how important their work is internally, especially
      at smaller companies which don't employ many kernel experts.
      
      The concept of "peer feedback" may be a hyperscaler / silicon valley
      thing so YMMV. Hopefully we can build more context as we go.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5884ef4
    • David S. Miller's avatar
      Merge branch 'kcm-data-races' · 931ae86f
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      kcm: annotate data-races
      
      This series address two different syzbot reports for KCM.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      931ae86f
    • Eric Dumazet's avatar
      kcm: annotate data-races around kcm->rx_wait · 0c745b51
      Eric Dumazet authored
      kcm->rx_psock can be read locklessly in kcm_rfree().
      Annotate the read and writes accordingly.
      
      syzbot reported:
      
      BUG: KCSAN: data-race in kcm_rcv_strparser / kcm_rfree
      
      write to 0xffff88810784e3d0 of 1 bytes by task 1823 on cpu 1:
      reserve_rx_kcm net/kcm/kcmsock.c:283 [inline]
      kcm_rcv_strparser+0x250/0x3a0 net/kcm/kcmsock.c:363
      __strp_recv+0x64c/0xd20 net/strparser/strparser.c:301
      strp_recv+0x6d/0x80 net/strparser/strparser.c:335
      tcp_read_sock+0x13e/0x5a0 net/ipv4/tcp.c:1703
      strp_read_sock net/strparser/strparser.c:358 [inline]
      do_strp_work net/strparser/strparser.c:406 [inline]
      strp_work+0xe8/0x180 net/strparser/strparser.c:415
      process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
      worker_thread+0x618/0xa70 kernel/workqueue.c:2436
      kthread+0x1a9/0x1e0 kernel/kthread.c:376
      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
      
      read to 0xffff88810784e3d0 of 1 bytes by task 17869 on cpu 0:
      kcm_rfree+0x121/0x220 net/kcm/kcmsock.c:181
      skb_release_head_state+0x8e/0x160 net/core/skbuff.c:841
      skb_release_all net/core/skbuff.c:852 [inline]
      __kfree_skb net/core/skbuff.c:868 [inline]
      kfree_skb_reason+0x5c/0x260 net/core/skbuff.c:891
      kfree_skb include/linux/skbuff.h:1216 [inline]
      kcm_recvmsg+0x226/0x2b0 net/kcm/kcmsock.c:1161
      ____sys_recvmsg+0x16c/0x2e0
      ___sys_recvmsg net/socket.c:2743 [inline]
      do_recvmmsg+0x2f1/0x710 net/socket.c:2837
      __sys_recvmmsg net/socket.c:2916 [inline]
      __do_sys_recvmmsg net/socket.c:2939 [inline]
      __se_sys_recvmmsg net/socket.c:2932 [inline]
      __x64_sys_recvmmsg+0xde/0x160 net/socket.c:2932
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x01 -> 0x00
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 17869 Comm: syz-executor.2 Not tainted 6.1.0-rc1-syzkaller-00010-gbb1a1146-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
      
      Fixes: ab7ac4eb ("kcm: Kernel Connection Multiplexor module")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c745b51
    • Eric Dumazet's avatar
      kcm: annotate data-races around kcm->rx_psock · 15e4dabd
      Eric Dumazet authored
      kcm->rx_psock can be read locklessly in kcm_rfree().
      Annotate the read and writes accordingly.
      
      We do the same for kcm->rx_wait in the following patch.
      
      syzbot reported:
      BUG: KCSAN: data-race in kcm_rfree / unreserve_rx_kcm
      
      write to 0xffff888123d827b8 of 8 bytes by task 2758 on cpu 1:
      unreserve_rx_kcm+0x72/0x1f0 net/kcm/kcmsock.c:313
      kcm_rcv_strparser+0x2b5/0x3a0 net/kcm/kcmsock.c:373
      __strp_recv+0x64c/0xd20 net/strparser/strparser.c:301
      strp_recv+0x6d/0x80 net/strparser/strparser.c:335
      tcp_read_sock+0x13e/0x5a0 net/ipv4/tcp.c:1703
      strp_read_sock net/strparser/strparser.c:358 [inline]
      do_strp_work net/strparser/strparser.c:406 [inline]
      strp_work+0xe8/0x180 net/strparser/strparser.c:415
      process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
      worker_thread+0x618/0xa70 kernel/workqueue.c:2436
      kthread+0x1a9/0x1e0 kernel/kthread.c:376
      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
      
      read to 0xffff888123d827b8 of 8 bytes by task 5859 on cpu 0:
      kcm_rfree+0x14c/0x220 net/kcm/kcmsock.c:181
      skb_release_head_state+0x8e/0x160 net/core/skbuff.c:841
      skb_release_all net/core/skbuff.c:852 [inline]
      __kfree_skb net/core/skbuff.c:868 [inline]
      kfree_skb_reason+0x5c/0x260 net/core/skbuff.c:891
      kfree_skb include/linux/skbuff.h:1216 [inline]
      kcm_recvmsg+0x226/0x2b0 net/kcm/kcmsock.c:1161
      ____sys_recvmsg+0x16c/0x2e0
      ___sys_recvmsg net/socket.c:2743 [inline]
      do_recvmmsg+0x2f1/0x710 net/socket.c:2837
      __sys_recvmmsg net/socket.c:2916 [inline]
      __do_sys_recvmmsg net/socket.c:2939 [inline]
      __se_sys_recvmmsg net/socket.c:2932 [inline]
      __x64_sys_recvmmsg+0xde/0x160 net/socket.c:2932
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0xffff88812971ce00 -> 0x0000000000000000
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 5859 Comm: syz-executor.3 Not tainted 6.0.0-syzkaller-12189-g19d17ab7-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
      
      Fixes: ab7ac4eb ("kcm: Kernel Connection Multiplexor module")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15e4dabd
    • Sean Anderson's avatar
      net: fman: Use physical address for userspace interfaces · c99f0f7e
      Sean Anderson authored
      Before 262f2b78 ("net: fman: Map the base address once"), the
      physical address of the MAC was exposed to userspace in two places: via
      sysfs and via SIOCGIFMAP. While this is not best practice, it is an
      external ABI which is in use by userspace software.
      
      The aforementioned commit inadvertently modified these addresses and
      made them virtual. This constitutes and ABI break.  Additionally, it
      leaks the kernel's memory layout to userspace. Partially revert that
      commit, reintroducing the resource back into struct mac_device, while
      keeping the intended changes (the rework of the address mapping).
      
      Fixes: 262f2b78 ("net: fman: Map the base address once")
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarSean Anderson <sean.anderson@seco.com>
      Acked-by: default avatarMadalin Bucur <madalin.bucur@oss.nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c99f0f7e
    • Leon Romanovsky's avatar
      net/mlx5e: Cleanup MACsec uninitialization routine · f8127476
      Leon Romanovsky authored
      The mlx5e_macsec_cleanup() routine has NULL pointer dereferencing if mlx5
      device doesn't support MACsec (priv->macsec will be NULL).
      
      While at it delete comment line, assignment and extra blank lines, so fix
      everything in one patch.
      
      Fixes: 1f53da67 ("net/mlx5e: Create advanced steering operation (ASO) object for MACsec")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8127476
    • Íñigo Huguet's avatar
      atlantic: fix deadlock at aq_nic_stop · 6960d133
      Íñigo Huguet authored
      NIC is stopped with rtnl_lock held, and during the stop it cancels the
      'service_task' work and free irqs.
      
      However, if CONFIG_MACSEC is set, rtnl_lock is acquired both from
      aq_nic_service_task and aq_linkstate_threaded_isr. Then a deadlock
      happens if aq_nic_stop tries to cancel/disable them when they've already
      started their execution.
      
      As the deadlock is caused by rtnl_lock, it causes many other processes
      to stall, not only atlantic related stuff.
      
      Fix it by introducing a mutex that protects each NIC's macsec related
      data, and locking it instead of the rtnl_lock from the service task and
      the threaded IRQ.
      
      Before this patch, all macsec data was protected with rtnl_lock, but
      maybe not all of it needs to be protected. With this new mutex, further
      efforts can be made to limit the protected data only to that which
      requires it. However, probably it doesn't worth it because all macsec's
      data accesses are infrequent, and almost all are done from macsec_ops
      or ethtool callbacks, called holding rtnl_lock, so macsec_mutex won't
      never be much contended.
      
      The issue appeared repeteadly attaching and deattaching the NIC to a
      bond interface. Doing that after this patch I cannot reproduce the bug.
      
      Fixes: 62c1c2e6 ("net: atlantic: MACSec offload skeleton")
      Reported-by: default avatarLi Liang <liali@redhat.com>
      Suggested-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Reviewed-by: default avatarIgor Russkikh <irusskikh@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6960d133
  3. 23 Oct, 2022 9 commits
  4. 22 Oct, 2022 6 commits