1. 25 Oct, 2022 4 commits
  2. 24 Oct, 2022 36 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 96917bb3
      Jakub Kicinski authored
      include/linux/net.h
        a5ef058d ("net: introduce and use custom sockopt socket flag")
        e993ffe3 ("net: flag sockets supporting msghdr originated zerocopy")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      96917bb3
    • Linus Torvalds's avatar
      Merge tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 337a0a0b
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from bpf.
      
        The net-memcg fix stands out, the rest is very run-off-the-mill. Maybe
        I'm biased.
      
        Current release - regressions:
      
         - eth: fman: re-expose location of the MAC address to userspace,
           apparently some udev scripts depended on the exact value
      
        Current release - new code bugs:
      
         - bpf:
             - wait for busy refill_work when destroying bpf memory allocator
             - allow bpf_user_ringbuf_drain() callbacks to return 1
             - fix dispatcher patchable function entry to 5 bytes nop
      
        Previous releases - regressions:
      
         - net-memcg: avoid stalls when under memory pressure
      
         - tcp: fix indefinite deferral of RTO with SACK reneging
      
         - tipc: fix a null-ptr-deref in tipc_topsrv_accept
      
         - eth: macb: specify PHY PM management done by MAC
      
         - tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
      
        Previous releases - always broken:
      
         - eth: amd-xgbe: SFP fixes and compatibility improvements
      
        Misc:
      
         - docs: netdev: offer performance feedback to contributors"
      
      * tag 'net-6.1-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
        net-memcg: avoid stalls when under memory pressure
        tcp: fix indefinite deferral of RTO with SACK reneging
        tcp: fix a signed-integer-overflow bug in tcp_add_backlog()
        net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY
        net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed
        docs: netdev: offer performance feedback to contributors
        kcm: annotate data-races around kcm->rx_wait
        kcm: annotate data-races around kcm->rx_psock
        net: fman: Use physical address for userspace interfaces
        net/mlx5e: Cleanup MACsec uninitialization routine
        atlantic: fix deadlock at aq_nic_stop
        nfp: only clean `sp_indiff` when application firmware is unloaded
        amd-xgbe: add the bit rate quirk for Molex cables
        amd-xgbe: fix the SFP compliance codes check for DAC cables
        amd-xgbe: enable PLL_CTL for fixed PHY modes only
        amd-xgbe: use enums for mailbox cmd and sub_cmds
        amd-xgbe: Yellow carp devices do not need rrc
        bpf: Use __llist_del_all() whenever possbile during memory draining
        bpf: Wait for busy refill_work when destroying bpf memory allocator
        MAINTAINERS: add keyword match on PTP
        ...
      337a0a0b
    • Linus Torvalds's avatar
      Merge tag 'rcu-urgent.2022.10.20a' of... · f6602a97
      Linus Torvalds authored
      Merge tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
      
      Pull RCU fix from Paul McKenney:
       "Fix a regression caused by commit bf95b2bc ("rcu: Switch polled
        grace-period APIs to ->gp_seq_polled"), which could incorrectly leave
        interrupts enabled after an early-boot call to synchronize_rcu().
      
        Such synchronize_rcu() calls must acquire leaf rcu_node locks in order
        to properly interact with polled grace periods, but the code did not
        take into account the possibility of synchronize_rcu() being invoked
        from the portion of the boot sequence during which interrupts are
        disabled.
      
        This commit therefore switches the lock acquisition and release from
        irq to irqsave/irqrestore"
      
      * tag 'rcu-urgent.2022.10.20a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
        rcu: Keep synchronize_rcu() from enabling irqs in early boot
      f6602a97
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of... · 2a91e897
      Linus Torvalds authored
      Merge tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull KUnit fixes from Shuah Khan:
       "One single fix to update alloc_string_stream() callers to check for
        IS_ERR() instead of NULL to be in sync with alloc_string_stream()
        returning an ERR_PTR()"
      
      * tag 'linux-kselftest-kunit-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        kunit: update NULL vs IS_ERR() tests
      2a91e897
    • Linus Torvalds's avatar
      Merge tag 'linux-kselftest-fixes-6.1-rc3' of... · 21c92498
      Linus Torvalds authored
      Merge tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
      
      Pull Kselftest fixes from Shuah Khan:
      
       - futex, intel_pstate, kexec build fixes
      
       - ftrace dynamic_events dependency check fix
      
       - memory-hotplug fix to remove redundant warning from test report
      
      * tag 'linux-kselftest-fixes-6.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
        selftests/ftrace: fix dynamic_events dependency check
        selftests/memory-hotplug: Remove the redundant warning information
        selftests/kexec: fix build for ARCH=x86_64
        selftests/intel_pstate: fix build for ARCH=x86_64
        selftests/futex: fix build for clang
      21c92498
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 74d5b415
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
      
       - Fix typos in UART1 and MMC in the Ingenic driver
      
       - A really well researched glitch bug fix to the Qualcomm driver that
         was tracked down and fixed by Dough Anderson from Chromium. Hats off
         for this one!
      
       - Revert two patches on the Xilinx ZynqMP driver: this needs a proper
         solution making use of firmware version information to adapt to
         different firmware releases
      
       - Fix interrupt triggers in the Ocelot driver
      
      * tag 'pinctrl-v6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: ocelot: Fix incorrect trigger of the interrupt.
        Revert "dt-bindings: pinctrl-zynqmp: Add output-enable configuration"
        Revert "pinctrl: pinctrl-zynqmp: Add support for output-enable and bias-high-impedance"
        pinctrl: qcom: Avoid glitching lines when we first mux to output
        pinctrl: Ingenic: JZ4755 bug fixes
      74d5b415
    • Jakub Kicinski's avatar
      net-memcg: avoid stalls when under memory pressure · 720ca52b
      Jakub Kicinski authored
      As Shakeel explains the commit under Fixes had the unintended
      side-effect of no longer pre-loading the cached memory allowance.
      Even tho we previously dropped the first packet received when
      over memory limit - the consecutive ones would get thru by using
      the cache. The charging was happening in batches of 128kB, so
      we'd let in 128kB (truesize) worth of packets per one drop.
      
      After the change we no longer force charge, there will be no
      cache filling side effects. This causes significant drops and
      connection stalls for workloads which use a lot of page cache,
      since we can't reclaim page cache under GFP_NOWAIT.
      
      Some of the latency can be recovered by improving SACK reneg
      handling but nowhere near enough to get back to the pre-5.15
      performance (the application I'm experimenting with still
      sees 5-10x worst latency).
      
      Apply the suggested workaround of using GFP_ATOMIC. We will now
      be more permissive than previously as we'll drop _no_ packets
      in softirq when under pressure. But I can't think of any good
      and simple way to address that within networking.
      
      Link: https://lore.kernel.org/all/20221012163300.795e7b86@kernel.org/Suggested-by: default avatarShakeel Butt <shakeelb@google.com>
      Fixes: 4b1327be ("net-memcg: pass in gfp_t mask to mem_cgroup_charge_skmem()")
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarRoman Gushchin <roman.gushchin@linux.dev>
      Link: https://lore.kernel.org/r/20221021160304.1362511-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      720ca52b
    • Neal Cardwell's avatar
      tcp: fix indefinite deferral of RTO with SACK reneging · 3d2af9cc
      Neal Cardwell authored
      This commit fixes a bug that can cause a TCP data sender to repeatedly
      defer RTOs when encountering SACK reneging.
      
      The bug is that when we're in fast recovery in a scenario with SACK
      reneging, every time we get an ACK we call tcp_check_sack_reneging()
      and it can note the apparent SACK reneging and rearm the RTO timer for
      srtt/2 into the future. In some SACK reneging scenarios that can
      happen repeatedly until the receive window fills up, at which point
      the sender can't send any more, the ACKs stop arriving, and the RTO
      fires at srtt/2 after the last ACK. But that can take far too long
      (O(10 secs)), since the connection is stuck in fast recovery with a
      low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
      available.
      
      This fix changes the logic in tcp_check_sack_reneging() to only rearm
      the RTO timer if data is cumulatively ACKed, indicating forward
      progress. This avoids this kind of nearly infinite loop of RTO timer
      re-arming. In addition, this meets the goals of
      tcp_check_sack_reneging() in handling Windows TCP behavior that looks
      temporarily like SACK reneging but is not really.
      
      Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
      and provided critical packet traces that enabled root-causing this
      issue. Also, many thanks to Jakub Kicinski for testing this fix.
      
      Fixes: 5ae344c9 ("tcp: reduce spurious retransmits due to transient SACK reneging")
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reported-by: default avatarNeil Spring <ntspring@fb.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Tested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3d2af9cc
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · e28c4445
      Jakub Kicinski authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf 2022-10-23
      
      We've added 7 non-merge commits during the last 18 day(s) which contain
      a total of 8 files changed, 69 insertions(+), 5 deletions(-).
      
      The main changes are:
      
      1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.
      
      2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.
      
      3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.
      
      4) Prevent decl_tag from being referenced in func_proto, from Stanislav.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        bpf: Use __llist_del_all() whenever possbile during memory draining
        bpf: Wait for busy refill_work when destroying bpf memory allocator
        bpf: Fix dispatcher patchable function entry to 5 bytes nop
        bpf: prevent decl_tag from being referenced in func_proto
        selftests/bpf: Add reproducer for decl_tag in func_proto return type
        selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
        bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
      ====================
      
      Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e28c4445
    • David S. Miller's avatar
      Merge branch 'ptp-ocxp-Oroli-ART-CARD' · 86d6f77a
      David S. Miller authored
      Vadim Fedorenko says:
      
      ====================
      ptp: ocp: add support for Orolia ART-CARD
      
      Orolia company created alternative open source TimeCard. The hardware of
      the card provides similar to OCP's card functions, that's why the support
      is added to current driver.
      
      The first patch in the series changes the way to store information about
      serial ports and is more like preparation.
      
      The patches 2 to 4 introduces actual hardware support.
      
      The last patch removes fallback from devlink flashing interface to protect
      against flashing wrong image. This became actual now as we have 2 different
      boards supported and wrong image can ruin hardware easily.
      
      v2:
        Address comments from Jonathan Lemon
      
      v3:
        Fix issue reported by kernel test robot <lkp@intel.com>
      
      v4:
        Fix clang build issue
      
      v5:
        Fix warnings and per-patch build errors
      
      v6:
        Fix more style issues
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      86d6f77a
    • Vadim Fedorenko's avatar
      ptp: ocp: remove flash image header check fallback · c1fd463d
      Vadim Fedorenko authored
      Previously there was a fallback mode to flash firmware image without
      proper header. But now we have different supported vendors and flashing
      wrong image could destroy the hardware. Remove fallback mode and force
      header check. Both vendors have published firmware images with headers.
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1fd463d
    • Vadim Fedorenko's avatar
      ptp: ocp: expose config and temperature for ART card · ee6439aa
      Vadim Fedorenko authored
      Orolia card has disciplining configuration and temperature table
      stored in EEPROM. This patch exposes them as binary attributes to
      have read and write access.
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Co-developed-by: default avatarCharles Parent <charles.parent@orolia2s.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee6439aa
    • Vadim Fedorenko's avatar
      ptp: ocp: add serial port of mRO50 MAC on ART card · 9c44a7ac
      Vadim Fedorenko authored
      ART card provides interface to access to serial port of miniature atomic
      clock found on the card. Add support for this device and configure it
      during init phase.
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Co-developed-by: default avatarCharles Parent <charles.parent@orolia2s.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c44a7ac
    • Vadim Fedorenko's avatar
      ptp: ocp: add Orolia timecard support · 69dbe107
      Vadim Fedorenko authored
      This brings in the Orolia timecard support from the GitHub repository.
      The card uses different drivers to provide access to i2c EEPROM and
      firmware SPI flash. And it also has a bit different EEPROM map, but
      other parts of the code are the same and could be reused.
      Co-developed-by: default avatarCharles Parent <charles.parent@orolia2s.com>
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      69dbe107
    • Vadim Fedorenko's avatar
      ptp: ocp: upgrade serial line information · 895ac5a5
      Vadim Fedorenko authored
      Introduce structure to hold serial port line number and the baud rate
      it supports.
      Acked-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      895ac5a5
    • Lu Wei's avatar
      tcp: fix a signed-integer-overflow bug in tcp_add_backlog() · ec791d81
      Lu Wei authored
      The type of sk_rcvbuf and sk_sndbuf in struct sock is int, and
      in tcp_add_backlog(), the variable limit is caculated by adding
      sk_rcvbuf, sk_sndbuf and 64 * 1024, it may exceed the max value
      of int and overflow. This patch reduces the limit budget by
      halving the sndbuf to solve this issue since ACK packets are much
      smaller than the payload.
      
      Fixes: c9c33212 ("tcp: add tcp_add_backlog()")
      Signed-off-by: default avatarLu Wei <luwei32@huawei.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec791d81
    • Yunsheng Lin's avatar
      net: skb: move skb_pp_recycle() to skbuff.c · 4727bab4
      Yunsheng Lin authored
      skb_pp_recycle() is only used by skb_free_head() in
      skbuff.c, so move it to skbuff.c.
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4727bab4
    • Zhang Changzhong's avatar
      net: lantiq_etop: don't free skb when returning NETDEV_TX_BUSY · 9c1eaa27
      Zhang Changzhong authored
      The ndo_start_xmit() method must not free skb when returning
      NETDEV_TX_BUSY, since caller is going to requeue freed skb.
      
      Fixes: 504d4721 ("MIPS: Lantiq: Add ethernet driver")
      Signed-off-by: default avatarZhang Changzhong <zhangchangzhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c1eaa27
    • Nick Child's avatar
      ibmveth: Always stop tx queues during close · 127b7218
      Nick Child authored
      netif_stop_all_queues must be called before calling H_FREE_LOGICAL_LAN.
      As a result, we can remove the pool_config field from the ibmveth
      adapter structure.
      
      Some device configuration changes call ibmveth_close in order to free
      the current resources held by the device. These functions then make
      their changes and call ibmveth_open to reallocate and reserve resources
      for the device.
      
      Prior to this commit, the flag pool_config was used to tell ibmveth_close
      that it should not halt the transmit queue. pool_config was introduced in
      commit 860f242e ("[PATCH] ibmveth change buffer pools dynamically")
      to avoid interrupting the tx flow when making rx config changes. Since
      then, other commits adopted this approach, even if making tx config
      changes.
      
      The issue with this approach was that the hypervisor freed all of
      the devices control structures after the hcall H_FREE_LOGICAL_LAN
      was performed but the transmit queues were never stopped. So the higher
      layers in the network stack would continue transmission but any
      H_SEND_LOGICAL_LAN hcall would fail with H_PARAMETER until the
      hypervisor's structures for the device were allocated with the
      H_REGISTER_LOGICAL_LAN hcall in ibmveth_open. This resulted in
      no real networking harm but did cause several of these error
      messages to be logged: "h_send_logical_lan failed with rc=-4"
      
      So, instead of trying to keep the transmit queues alive during network
      configuration changes, just stop the queues, make necessary changes then
      restart the queues.
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      127b7218
    • xu xin's avatar
      net: remove useless parameter of __sock_cmsg_send · 233baf9a
      xu xin authored
      The parameter 'msg' has never been used by __sock_cmsg_send, so we can remove it
      safely.
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Signed-off-by: default avatarxu xin <xu.xin16@zte.com.cn>
      Reviewed-by: default avatarZhang Yunkai <zhang.yunkai@zte.com.cn>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      233baf9a
    • Wei Fang's avatar
      net: fec: Add support for periodic output signal of PPS · 350749b9
      Wei Fang authored
      This patch adds the support for configuring periodic output
      signal of PPS. So the PPS can be output at a specified time
      and period.
      For developers or testers, they can use the command "echo
      <channel> <start.sec> <start.nsec> <period.sec> <period.
      nsec> > /sys/class/ptp/ptp0/period" to specify time and
      period to output PPS signal.
      Notice that, the channel can only be set to 0. In addtion,
      the start time must larger than the current PTP clock time.
      So users can use the command "phc_ctl /dev/ptp0 -- get" to
      get the current PTP clock time before.
      Signed-off-by: default avatarWei Fang <wei.fang@nxp.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      350749b9
    • Zhengchao Shao's avatar
      net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed · d266935a
      Zhengchao Shao authored
      When the ops_init() interface is invoked to initialize the net, but
      ops->init() fails, data is released. However, the ptr pointer in
      net->gen is invalid. In this case, when nfqnl_nf_hook_drop() is invoked
      to release the net, invalid address access occurs.
      
      The process is as follows:
      setup_net()
      	ops_init()
      		data = kzalloc(...)   ---> alloc "data"
      		net_assign_generic()  ---> assign "date" to ptr in net->gen
      		...
      		ops->init()           ---> failed
      		...
      		kfree(data);          ---> ptr in net->gen is invalid
      	...
      	ops_exit_list()
      		...
      		nfqnl_nf_hook_drop()
      			*q = nfnl_queue_pernet(net) ---> q is invalid
      
      The following is the Call Trace information:
      BUG: KASAN: use-after-free in nfqnl_nf_hook_drop+0x264/0x280
      Read of size 8 at addr ffff88810396b240 by task ip/15855
      Call Trace:
      <TASK>
      dump_stack_lvl+0x8e/0xd1
      print_report+0x155/0x454
      kasan_report+0xba/0x1f0
      nfqnl_nf_hook_drop+0x264/0x280
      nf_queue_nf_hook_drop+0x8b/0x1b0
      __nf_unregister_net_hook+0x1ae/0x5a0
      nf_unregister_net_hooks+0xde/0x130
      ops_exit_list+0xb0/0x170
      setup_net+0x7ac/0xbd0
      copy_net_ns+0x2e6/0x6b0
      create_new_namespaces+0x382/0xa50
      unshare_nsproxy_namespaces+0xa6/0x1c0
      ksys_unshare+0x3a4/0x7e0
      __x64_sys_unshare+0x2d/0x40
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      </TASK>
      
      Allocated by task 15855:
      kasan_save_stack+0x1e/0x40
      kasan_set_track+0x21/0x30
      __kasan_kmalloc+0xa1/0xb0
      __kmalloc+0x49/0xb0
      ops_init+0xe7/0x410
      setup_net+0x5aa/0xbd0
      copy_net_ns+0x2e6/0x6b0
      create_new_namespaces+0x382/0xa50
      unshare_nsproxy_namespaces+0xa6/0x1c0
      ksys_unshare+0x3a4/0x7e0
      __x64_sys_unshare+0x2d/0x40
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Freed by task 15855:
      kasan_save_stack+0x1e/0x40
      kasan_set_track+0x21/0x30
      kasan_save_free_info+0x2a/0x40
      ____kasan_slab_free+0x155/0x1b0
      slab_free_freelist_hook+0x11b/0x220
      __kmem_cache_free+0xa4/0x360
      ops_init+0xb9/0x410
      setup_net+0x5aa/0xbd0
      copy_net_ns+0x2e6/0x6b0
      create_new_namespaces+0x382/0xa50
      unshare_nsproxy_namespaces+0xa6/0x1c0
      ksys_unshare+0x3a4/0x7e0
      __x64_sys_unshare+0x2d/0x40
      do_syscall_64+0x35/0x80
      entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: f875bae0 ("net: Automatically allocate per namespace data.")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d266935a
    • Eric Dumazet's avatar
      net: add a refcount tracker for kernel sockets · 0cafd77d
      Eric Dumazet authored
      Commit ffa84b5f ("net: add netns refcount tracker to struct sock")
      added a tracker to sockets, but did not track kernel sockets.
      
      We still have syzbot reports hinting about netns being destroyed
      while some kernel TCP sockets had not been dismantled.
      
      This patch tracks kernel sockets, and adds a ref_tracker_dir_print()
      call to net_free() right before the netns is freed.
      
      Normally, each layer is responsible for properly releasing its
      kernel sockets before last call to net_free().
      
      This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Tested-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cafd77d
    • Jakub Kicinski's avatar
      docs: netdev: offer performance feedback to contributors · c5884ef4
      Jakub Kicinski authored
      Some of us gotten used to producing large quantities of peer feedback
      at work, every 3 or 6 months. Extending the same courtesy to community
      members seems like a logical step. It may be hard for some folks to
      get validation of how important their work is internally, especially
      at smaller companies which don't employ many kernel experts.
      
      The concept of "peer feedback" may be a hyperscaler / silicon valley
      thing so YMMV. Hopefully we can build more context as we go.
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5884ef4
    • David S. Miller's avatar
      Merge branch 'kcm-data-races' · 931ae86f
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      kcm: annotate data-races
      
      This series address two different syzbot reports for KCM.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      931ae86f
    • Eric Dumazet's avatar
      kcm: annotate data-races around kcm->rx_wait · 0c745b51
      Eric Dumazet authored
      kcm->rx_psock can be read locklessly in kcm_rfree().
      Annotate the read and writes accordingly.
      
      syzbot reported:
      
      BUG: KCSAN: data-race in kcm_rcv_strparser / kcm_rfree
      
      write to 0xffff88810784e3d0 of 1 bytes by task 1823 on cpu 1:
      reserve_rx_kcm net/kcm/kcmsock.c:283 [inline]
      kcm_rcv_strparser+0x250/0x3a0 net/kcm/kcmsock.c:363
      __strp_recv+0x64c/0xd20 net/strparser/strparser.c:301
      strp_recv+0x6d/0x80 net/strparser/strparser.c:335
      tcp_read_sock+0x13e/0x5a0 net/ipv4/tcp.c:1703
      strp_read_sock net/strparser/strparser.c:358 [inline]
      do_strp_work net/strparser/strparser.c:406 [inline]
      strp_work+0xe8/0x180 net/strparser/strparser.c:415
      process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
      worker_thread+0x618/0xa70 kernel/workqueue.c:2436
      kthread+0x1a9/0x1e0 kernel/kthread.c:376
      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
      
      read to 0xffff88810784e3d0 of 1 bytes by task 17869 on cpu 0:
      kcm_rfree+0x121/0x220 net/kcm/kcmsock.c:181
      skb_release_head_state+0x8e/0x160 net/core/skbuff.c:841
      skb_release_all net/core/skbuff.c:852 [inline]
      __kfree_skb net/core/skbuff.c:868 [inline]
      kfree_skb_reason+0x5c/0x260 net/core/skbuff.c:891
      kfree_skb include/linux/skbuff.h:1216 [inline]
      kcm_recvmsg+0x226/0x2b0 net/kcm/kcmsock.c:1161
      ____sys_recvmsg+0x16c/0x2e0
      ___sys_recvmsg net/socket.c:2743 [inline]
      do_recvmmsg+0x2f1/0x710 net/socket.c:2837
      __sys_recvmmsg net/socket.c:2916 [inline]
      __do_sys_recvmmsg net/socket.c:2939 [inline]
      __se_sys_recvmmsg net/socket.c:2932 [inline]
      __x64_sys_recvmmsg+0xde/0x160 net/socket.c:2932
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x01 -> 0x00
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 17869 Comm: syz-executor.2 Not tainted 6.1.0-rc1-syzkaller-00010-gbb1a1146-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
      
      Fixes: ab7ac4eb ("kcm: Kernel Connection Multiplexor module")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c745b51
    • Eric Dumazet's avatar
      kcm: annotate data-races around kcm->rx_psock · 15e4dabd
      Eric Dumazet authored
      kcm->rx_psock can be read locklessly in kcm_rfree().
      Annotate the read and writes accordingly.
      
      We do the same for kcm->rx_wait in the following patch.
      
      syzbot reported:
      BUG: KCSAN: data-race in kcm_rfree / unreserve_rx_kcm
      
      write to 0xffff888123d827b8 of 8 bytes by task 2758 on cpu 1:
      unreserve_rx_kcm+0x72/0x1f0 net/kcm/kcmsock.c:313
      kcm_rcv_strparser+0x2b5/0x3a0 net/kcm/kcmsock.c:373
      __strp_recv+0x64c/0xd20 net/strparser/strparser.c:301
      strp_recv+0x6d/0x80 net/strparser/strparser.c:335
      tcp_read_sock+0x13e/0x5a0 net/ipv4/tcp.c:1703
      strp_read_sock net/strparser/strparser.c:358 [inline]
      do_strp_work net/strparser/strparser.c:406 [inline]
      strp_work+0xe8/0x180 net/strparser/strparser.c:415
      process_one_work+0x3d3/0x720 kernel/workqueue.c:2289
      worker_thread+0x618/0xa70 kernel/workqueue.c:2436
      kthread+0x1a9/0x1e0 kernel/kthread.c:376
      ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
      
      read to 0xffff888123d827b8 of 8 bytes by task 5859 on cpu 0:
      kcm_rfree+0x14c/0x220 net/kcm/kcmsock.c:181
      skb_release_head_state+0x8e/0x160 net/core/skbuff.c:841
      skb_release_all net/core/skbuff.c:852 [inline]
      __kfree_skb net/core/skbuff.c:868 [inline]
      kfree_skb_reason+0x5c/0x260 net/core/skbuff.c:891
      kfree_skb include/linux/skbuff.h:1216 [inline]
      kcm_recvmsg+0x226/0x2b0 net/kcm/kcmsock.c:1161
      ____sys_recvmsg+0x16c/0x2e0
      ___sys_recvmsg net/socket.c:2743 [inline]
      do_recvmmsg+0x2f1/0x710 net/socket.c:2837
      __sys_recvmmsg net/socket.c:2916 [inline]
      __do_sys_recvmmsg net/socket.c:2939 [inline]
      __se_sys_recvmmsg net/socket.c:2932 [inline]
      __x64_sys_recvmmsg+0xde/0x160 net/socket.c:2932
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0xffff88812971ce00 -> 0x0000000000000000
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 5859 Comm: syz-executor.3 Not tainted 6.0.0-syzkaller-12189-g19d17ab7-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
      
      Fixes: ab7ac4eb ("kcm: Kernel Connection Multiplexor module")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15e4dabd
    • David S. Miller's avatar
      Merge branch 'udp-false-sharing' · b29e0dec
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      udp: avoid false sharing on receive
      
      Under high UDP load, the BH processing and the user-space receiver can
      run on different cores.
      
      The UDP implementation does a lot of effort to avoid false sharing in
      the receive path, but recent changes to the struct sock layout moved
      the sk_forward_alloc and the sk_rcvbuf fields on the same cacheline:
      
              /* --- cacheline 4 boundary (256 bytes) --- */
                      struct sk_buff *   tail;
              } sk_backlog;
              int                        sk_forward_alloc;
              unsigned int               sk_reserved_mem;
              unsigned int               sk_ll_usec;
              unsigned int               sk_napi_id;
              int                        sk_rcvbuf;
      
      sk_forward_alloc is updated by the BH, while sk_rcvbuf is accessed by
      udp_recvmsg(), causing false sharing.
      
      A possible solution would be to re-order the struct sock fields to avoid
      the false sharing. Such change is subject to being invalidated by future
      changes and could have negative side effects on other workload.
      
      Instead this series uses a different approach, touching only the UDP
      socket layout.
      
      The first patch generalizes the custom setsockopt infrastructure, to
      allow UDP tracking the buffer size, and the second patch addresses the
      issue, copying the relevant buffer information into an already hot
      cacheline.
      
      Overall the above gives a 10% peek throughput increase under UDP flood.
      
      v1 -> v2:
       - introduce and use a common helper to initialize the UDP v4/v6 sockets
         (Kuniyuki)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b29e0dec
    • Paolo Abeni's avatar
      udp: track the forward memory release threshold in an hot cacheline · 8a3854c7
      Paolo Abeni authored
      When the receiver process and the BH runs on different cores,
      udp_rmem_release() experience a cache miss while accessing sk_rcvbuf,
      as the latter shares the same cacheline with sk_forward_alloc, written
      by the BH.
      
      With this patch, UDP tracks the rcvbuf value and its update via custom
      SOL_SOCKET socket options, and copies the forward memory threshold value
      used by udp_rmem_release() in a different cacheline, already accessed by
      the above function and uncontended.
      
      Since the UDP socket init operation grown a bit, factor out the common
      code between v4 and v6 in a shared helper.
      
      Overall the above give a 10% peek throughput increase under UDP flood.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a3854c7
    • Paolo Abeni's avatar
      net: introduce and use custom sockopt socket flag · a5ef058d
      Paolo Abeni authored
      We will soon introduce custom setsockopt for UDP sockets, too.
      Instead of doing even more complex arbitrary checks inside
      sock_use_custom_sol_socket(), add a new socket flag and set it
      for the relevant socket types (currently only MPTCP).
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5ef058d
    • Sean Anderson's avatar
      net: fman: Use physical address for userspace interfaces · c99f0f7e
      Sean Anderson authored
      Before 262f2b78 ("net: fman: Map the base address once"), the
      physical address of the MAC was exposed to userspace in two places: via
      sysfs and via SIOCGIFMAP. While this is not best practice, it is an
      external ABI which is in use by userspace software.
      
      The aforementioned commit inadvertently modified these addresses and
      made them virtual. This constitutes and ABI break.  Additionally, it
      leaks the kernel's memory layout to userspace. Partially revert that
      commit, reintroducing the resource back into struct mac_device, while
      keeping the intended changes (the rework of the address mapping).
      
      Fixes: 262f2b78 ("net: fman: Map the base address once")
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarSean Anderson <sean.anderson@seco.com>
      Acked-by: default avatarMadalin Bucur <madalin.bucur@oss.nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c99f0f7e
    • David S. Miller's avatar
      Merge branch 'net-800Gbps-support' · ea5ed0f0
      David S. Miller authored
      Petr Machata says:
      
      ====================
      net: Add support for 800Gbps speed
      
      Amit Cohen <amcohen@nvidia.com> writes:
      
      The next Nvidia Spectrum ASIC will support 800Gbps speed.
      The IEEE 802 LAN/MAN Standards Committee already published standards for
      800Gbps, see the last update [1] and the list of approved changes [2].
      
      As first phase, add support for 800Gbps over 8 lanes (100Gbps/lane).
      In the future 800Gbps over 4 lanes can be supported also.
      
      Extend ethtool to support the relevant PMDs and extend mlxsw and bonding
      drivers to support 800Gbps.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ea5ed0f0
    • Amit Cohen's avatar
      bonding: 3ad: Add support for 800G speed · 41305d37
      Amit Cohen authored
      Add support for 800Gbps speed to allow using 3ad mode with 800G devices.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      41305d37
    • Amit Cohen's avatar
      mlxsw: Add support for 800Gbps link modes · cceef209
      Amit Cohen authored
      Add support for 800Gbps speed, link modes of 100Gbps per lane.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cceef209
    • Amit Cohen's avatar
      ethtool: Add support for 800Gbps link modes · 404c7678
      Amit Cohen authored
      Add support for 800Gbps speed, link modes of 100Gbps per lane.
      As mentioned in slide 21 in IEEE documentation [1], all adopted 802.3df
      copper and optical PMDs baselines using 100G/lane will be supported.
      
      Add the relevant PMDs which are mentioned in slide 5 in IEEE
      documentation [1] and were approved on 10-2022 [2]:
      BP - KR8
      Cu Cable - CR8
      MMF 50m - VR8
      MMF 100m - SR8
      SMF 500m - DR8
      SMF 2km - DR8-2
      
      [1]: https://www.ieee802.org/3/df/public/22_10/22_1004/shrikhande_3df_01a_221004.pdf
      [2]: https://ieee802.org/3/df/KeyMotions_3df_221005.pdfSigned-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      404c7678
    • David S. Miller's avatar
      Merge branch 'sparx5-IS2-VCAP' · c1aa0a90
      David S. Miller authored
      Steen Hegelund says:
      
      ====================
      Add support for Sparx5 IS2 VCAP
      
      This provides initial support for the Sparx5 VCAP functionality via the
      'tc' traffic control userspace tool and its flower filter.
      
      Overview:
      =========
      
      The supported flower filter keys and actions are:
      
      - source and destination MAC address keys
      - trap action
      - pass action
      
      The supported Sparx5 VCAPs are: IS2 (see below for more info)
      
      The VCAP (Versatile Content-Aware Processor) feature is essentially a TCAM
      with rules consisting of:
      
      - Programmable key fields
      - Programmable action fields
      - A counter (which may be only one bit wide)
      
      Besides this each VCAP has:
      
      - A number of independent lookups
      - A keyset configuration typically per port per lookup
      
      VCAPs are used in many of the TSN features such as PSFP, PTP, FRER as well
      as the general shaping, policing and access control, so it is an important
      building block for these advanced features.
      
      Functionality:
      ==============
      
      When a frame is passed to a VCAP the VCAP will generate a set of keys
      (keyset) based on the traffic type.  If there is a rule created with this
      keyset in the VCAP and the values of the keys matches the values in the
      keyset of the frame, the rule is said to match and the actions in the rule
      will be executed and the rule counter will be incremented.  No more rules
      will be examined in this VCAP lookup.
      
      If there is no match in the current lookup the frame will be matched
      against the next lookup (some VCAPs do the processing of the lookups in
      parallel).
      
      The Sparx5 SoC has 6 different VCAP types:
      
      - IS0: Ingress Stage 0 (AKA CLM) mostly handles classification
      - IS2: Ingress Stage 2 mostly handles access control
      - IP6PFX: IPv6 prefix: Provides tables for IPV6 address management
      - LPM: Longest Path Match for IP guarding and routing
      - ES0: Egress Stage 0 is mostly used for CPU copying and multicast handling
      - ES2: Egress Stage 2 is known as the rewriter and mostly updates tags
      
      Design:
      =======
      
      The VCAP implementation provides switchcore independent handling of rules
      and supports:
      
      - Creating and deleting rules
      - Updating and getting rules
      
      The platform specific API implementation as well as the platform specific
      model of the VCAP instances are attached to the VCAP API and a client can
      then access rules via the API in a platform independent way, with the
      limitations that each VCAP has in terms of is supported keys and actions.
      
      The VCAP model is generated from information delivered by the designers of
      the VCAP hardware.
      
      Here is an illustration of this:
      
        +------------------+     +------------------+
        | TC flower filter |     | PTP client       |
        | for Sparx5       |     | for Sparx5       |
        +-------------\----+     +---------/--------+
                       \                  /
                        \                /
                         \              /
                          \            /
                           \          /
                       +----v--------v----+
                       |     VCAP API     |
                       +---------|--------+
                                 |
                                 |
                                 |
                                 |
                       +---------v--------+
                       |   VCAP control   |
                       |   instance       |
                       +----/--------|----+
                           /         |
                          /          |
                         /           |
                        /            |
        +--------------v---+    +----v-------------+
        |   Sparx5 VCAP    |    | Sparx5 VCAP API  |
        |   model          |    | Implementation   |
        +------------------+    +---------|--------+
                                          |
                                          |
                                          |
                                          |
                                +---------v--------+
                                | Sparx5 VCAP HW   |
                                +------------------+
      
      Delivery:
      =========
      
      For now only the IS2 is supported but later the IS0, ES0 and ES2 will be
      added. There are currently no plans to support the IP6PFX and the LPM
      VCAPs.
      
      The IS2 VCAP has 4 lookups and they are accessible with a TC chain id:
      
      - chain 8000000: IS2 Lookup 0
      - chain 8100000: IS2 Lookup 1
      - chain 8200000: IS2 Lookup 2
      - chain 8300000: IS2 Lookup 3
      
      These lookups are executed in parallel by the IS2 VCAP but the actions are
      executed in series (the datasheet explains what happens if actions
      overlap).
      
      The functionality of TC flower as well as TC matchall filters will be
      expanded in later submissions as well as the number of VCAPs supported.
      
      This is current plan:
      
      - add support for more TC flower filter keys and extend the Sparx5 port
        keyset configuration
      - support for TC protocol all
      - debugfs support for inspecting rules
      - TC flower filter statistics
      - Sparx5 IS0 VCAP support and more TC keys and actions to support this
      - add TC policer and drop action support (depends on the Sparx5 QoS support
        upstreamed separately)
      - Sparx5 ES0 VCAP support and more TC actions to support this
      - TC flower template support
      - TC matchall filter support for mirroring and policing ports
      - TC flower filter mirror action support
      - Sparx5 ES2 VCAP support
      
      The LAN966x switchcore will also be updated to use the VCAP API as well as
      future Microchip switches.
      The LAN966x has 3 VCAPS (IS1, IS2 and ES0) and a slightly different keyset
      and actionset portfolio than Sparx5.
      
      Version History:
      ================
      v3      Moved the sparx5_tc_flower_set_exterr function to the VCAP API and
              renamed it.
              Moved the sparx5_netbytes_copy function to the VCAP_API and renamed
              it (thanks Horatiu Vultur).
              Fixed indentation in the vcap_write_rule function.
              Added a comment mentioning the typegroup table terminator in the
              vcap_iter_skip_tg function.
      
      v2      Made the KUNIT test model a superset of the real model to fix a
              kernel robot build error.
      
      v1      Initial version
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1aa0a90