1. 26 Apr, 2023 25 commits
  2. 25 Apr, 2023 15 commits
    • wuych's avatar
      net: phy: marvell-88x2222: remove unnecessary (void*) conversions · 28b17f62
      wuych authored
      Pointer variables of void * type do not require type cast.
      Signed-off-by: default avatarwuych <yunchuan@nfschina.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28b17f62
    • Kuniyuki Iwashima's avatar
      tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. · 50749f2d
      Kuniyuki Iwashima authored
      syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
      skbs.  We can reproduce the problem with these sequences:
      
        sk = socket(AF_INET, SOCK_DGRAM, 0)
        sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
        sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
        sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
        sk.close()
      
      sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
      skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
      ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
      skb is sent, __skb_tstamp_tx() clones it and puts the clone into
      the socket's error queue with the TX timestamp.
      
      When the original skb is received locally, skb_copy_ubufs() calls
      skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
      This additional count is decremented while freeing the skb, but struct
      ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
      not called.
      
      The last refcnt is not released unless we retrieve the TX timestamped
      skb by recvmsg().  Since we clear the error queue in inet_sock_destruct()
      after the socket's refcnt reaches 0, there is a circular dependency.
      If we close() the socket holding such skbs, we never call sock_put()
      and leak the count, sk, and skb.
      
      TCP has the same problem, and commit e0c8bccd ("net: stream:
      purge sk_error_queue in sk_stream_kill_queues()") tried to fix it
      by calling skb_queue_purge() during close().  However, there is a
      small chance that skb queued in a qdisc or device could be put
      into the error queue after the skb_queue_purge() call.
      
      In __skb_tstamp_tx(), the cloned skb should not have a reference
      to the ubuf to remove the circular dependency, but skb_clone() does
      not call skb_copy_ubufs() for zerocopy skb.  So, we need to call
      skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs().
      
      [0]:
      BUG: memory leak
      unreferenced object 0xffff88800c6d2d00 (size 1152):
        comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00  ................
          02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
        backtrace:
          [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
          [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
          [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
          [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
          [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
          [<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
          [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
          [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
          [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
          [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
          [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
          [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
          [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
          [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      BUG: memory leak
      unreferenced object 0xffff888017633a00 (size 240):
        comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff  .........-m.....
        backtrace:
          [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
          [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
          [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
          [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
          [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
          [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
          [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
          [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
          [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
          [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
          [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
          [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
          [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
          [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
          [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
          [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
          [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY")
      Fixes: b5947e5d ("udp: msg_zerocopy")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50749f2d
    • Gencen Gan's avatar
      net: amd: Fix link leak when verifying config failed · d325c34d
      Gencen Gan authored
      After failing to verify configuration, it returns directly without
      releasing link, which may cause memory leak.
      
      Paolo Abeni thinks that the whole code of this driver is quite
      "suboptimal" and looks unmainatained since at least ~15y, so he
      suggests that we could simply remove the whole driver, please
      take it into consideration.
      
      Simon Horman suggests that the fix label should be set to
      "Linux-2.6.12-rc2" considering that the problem has existed
      since the driver was introduced and the commit above doesn't
      seem to exist in net/net-next.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarGan Gecen <gangecen@hust.edu.cn>
      Reviewed-by: default avatarDongliang Mu <dzm91@hust.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d325c34d
    • Christian Marangi's avatar
      net: phy: marvell: Fix inconsistent indenting in led_blink_set · 4774ad84
      Christian Marangi authored
      Fix inconsistent indeinting in m88e1318_led_blink_set reported by kernel
      test robot, probably done by the presence of an if condition dropped in
      later revision of the same code.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202304240007.0VEX8QYG-lkp@intel.com/
      Fixes: ea9e8648 ("net: phy: marvell: Implement led_blink_set()")
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20230423172800.3470-1-ansuelsmth@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4774ad84
    • Horatiu Vultur's avatar
      lan966x: Don't use xdp_frame when action is XDP_TX · 700f11eb
      Horatiu Vultur authored
      When the action of an xdp program was XDP_TX, lan966x was creating
      a xdp_frame and use this one to send the frame back. But it is also
      possible to send back the frame without needing a xdp_frame, because
      it is possible to send it back using the page.
      And then once the frame is transmitted is possible to use directly
      page_pool_recycle_direct as lan966x is using page pools.
      This would save some CPU usage on this path, which results in higher
      number of transmitted frames. Bellow are the statistics:
      Frame size:    Improvement:
      64                ~8%
      256              ~11%
      512               ~8%
      1000              ~0%
      1500              ~0%
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Link: https://lore.kernel.org/r/20230422142344.3630602-1-horatiu.vultur@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      700f11eb
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · ee3392ed
      Jakub Kicinski authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2023-04-24
      
      We've added 5 non-merge commits during the last 3 day(s) which contain
      a total of 7 files changed, 87 insertions(+), 44 deletions(-).
      
      The main changes are:
      
      1) Workaround for bpf iter selftest due to lack of subprog support
         in precision tracking, from Andrii.
      
      2) Disable bpf_refcount_acquire kfunc until races are fixed, from Dave.
      
      3) One more test_verifier test converted from asm macro to asm in C,
         from Eduard.
      
      4) Fix build with NETFILTER=y INET=n config, from Florian.
      
      5) Add __rcu_read_{lock,unlock} into deny list, from Yafang.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
        selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests
        bpf: Add __rcu_read_{lock,unlock} into btf id deny list
        bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed
        selftests/bpf: verifier/prevent_map_lookup converted to inline assembly
        bpf: fix link failure with NETFILTER=y INET=n
      ====================
      
      Link: https://lore.kernel.org/r/20230425005648.86714-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ee3392ed
    • Jakub Kicinski's avatar
      Merge branch 'tsnep-xdp-socket-zero-copy-support' · 9610a8dc
      Jakub Kicinski authored
      Gerhard Engleder says:
      
      ====================
      tsnep: XDP socket zero-copy support
      
      Implement XDP socket zero-copy support for tsnep driver. I tried to
      follow existing drivers like igc as far as possible. But one main
      difference is that tsnep does not need any reconfiguration for XDP BPF
      program setup. So I decided to keep this behavior no matter if a XSK
      pool is used or not. As a result, tsnep starts using the XSK pool even
      if no XDP BPF program is available.
      
      Another difference is that I tried to prevent potentially failing
      allocations during XSK pool setup. E.g. both memory models for page pool
      and XSK pool are registered all the time. Thus, XSK pool setup cannot
      end up with not working queues.
      
      Some prework is done to reduce the last two XSK commits to actual XSK
      changes.
      ====================
      
      Link: https://lore.kernel.org/r/20230421194656.48063-1-gerhard@engleder-embedded.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9610a8dc
    • Gerhard Engleder's avatar
      tsnep: Add XDP socket zero-copy TX support · cd275c23
      Gerhard Engleder authored
      Send and complete XSK pool frames within TX NAPI context. NAPI context
      is triggered by ndo_xsk_wakeup.
      
      Test results with A53 1.2GHz:
      
      xdpsock txonly copy mode, 64 byte frames:
                         pps            pkts           1.00
      tx                 284,409        11,398,144
      Two CPUs with 100% and 10% utilization.
      
      xdpsock txonly zero-copy mode, 64 byte frames:
                         pps            pkts           1.00
      tx                 511,929        5,890,368
      Two CPUs with 100% and 1% utilization.
      
      xdpsock l2fwd copy mode, 64 byte frames:
                         pps            pkts           1.00
      rx                 248,985        7,315,885
      tx                 248,921        7,315,885
      Two CPUs with 100% and 10% utilization.
      
      xdpsock l2fwd zero-copy mode, 64 byte frames:
                         pps            pkts           1.00
      rx                 254,735        3,039,456
      tx                 254,735        3,039,456
      Two CPUs with 100% and 4% utilization.
      
      Packet rate increases and CPU utilization is reduced in both cases.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd275c23
    • Gerhard Engleder's avatar
      tsnep: Add XDP socket zero-copy RX support · 3fc23339
      Gerhard Engleder authored
      Add support for XSK zero-copy to RX path. The setup of the XSK pool can
      be done at runtime. If the netdev is running, then the queue must be
      disabled and enabled during reconfiguration. This can be done easily
      with functions introduced in previous commits.
      
      A more important property is that, if the netdev is running, then the
      setup of the XSK pool shall not stop the netdev in case of errors. A
      broken netdev after a failed XSK pool setup is bad behavior. Therefore,
      the allocation and setup of resources during XSK pool setup is done only
      before any queue is disabled. Additionally, freeing and later allocation
      of resources is eliminated in some cases. Page pool entries are kept for
      later use. Two memory models are registered in parallel. As a result,
      the XSK pool setup cannot fail during queue reconfiguration.
      
      In contrast to other drivers, XSK pool setup and XDP BPF program setup
      are separate actions. XSK pool setup can be done without any XDP BPF
      program. The XDP BPF program can be added, removed or changed without
      any reconfiguration of the XSK pool.
      
      Test results with A53 1.2GHz:
      
      xdpsock rxdrop copy mode, 64 byte frames:
                         pps            pkts           1.00
      rx                 856,054        10,625,775
      Two CPUs with both 100% utilization.
      
      xdpsock rxdrop zero-copy mode, 64 byte frames:
                         pps            pkts           1.00
      rx                 889,388        4,615,284
      Two CPUs with 100% and 20% utilization.
      
      Packet rate increases and CPU utilization is reduced.
      
      100% CPU load seems to the base load. This load is consumed by ksoftirqd
      just for dropping the generated packets without xdpsock running.
      
      Using batch API reduced CPU utilization slightly, but measurements are
      not stable enough to provide meaningful numbers.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3fc23339
    • Gerhard Engleder's avatar
      tsnep: Move skb receive action to separate function · c2d64697
      Gerhard Engleder authored
      The function tsnep_rx_poll() is already pretty long and the skb receive
      action can be reused for XSK zero-copy support. Move page based skb
      receive to separate function.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c2d64697
    • Gerhard Engleder's avatar
      tsnep: Add functions for queue enable/disable · 2ea0a282
      Gerhard Engleder authored
      Move queue enable and disable code to separate functions. This way the
      activation and deactivation of the queues are defined actions, which can
      be used in future execution paths.
      
      This functions will be used for the queue reconfiguration at runtime,
      which is necessary for XSK zero-copy support.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2ea0a282
    • Gerhard Engleder's avatar
      tsnep: Rework TX/RX queue initialization · 33b0ee02
      Gerhard Engleder authored
      Make initialization of TX and RX queues less dynamic by moving some
      initialization from netdev open/close to device probing.
      
      Additionally, move some initialization code to separate functions to
      enable future use in other execution paths.
      
      This is done as preparation for queue reconfigure at runtime, which is
      necessary for XSK zero-copy support.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      33b0ee02
    • Gerhard Engleder's avatar
      tsnep: Replace modulo operation with mask · 42fb2962
      Gerhard Engleder authored
      TX/RX ring size is static and power of 2 to enable compiler to optimize
      modulo operation to mask operation. Make this optimization already in
      the code and don't rely on the compiler.
      
      CPU utilisation during high packet rate has not changed. So no
      performance improvement has been measured. But it is best practice to
      prevent modulo operations.
      Suggested-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      42fb2962
    • Alexander Stein's avatar
      net: phy: dp83867: Add led_brightness_set support · 938f65ad
      Alexander Stein authored
      Up to 4 LEDs can be attached to the PHY, add support for setting
      brightness manually.
      Signed-off-by: default avatarAlexander Stein <alexander.stein@ew.tq-group.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20230424134625.303957-1-alexander.stein@ew.tq-group.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      938f65ad
    • Alexander Stein's avatar
      net: phy: Fix reading LED reg property · aed8fdad
      Alexander Stein authored
      'reg' is always encoded in 32 bits, thus it has to be read using the
      function with the corresponding bit width.
      
      Fixes: 01e5b728 ("net: phy: Add a binding for PHY LEDs")
      Signed-off-by: default avatarAlexander Stein <alexander.stein@ew.tq-group.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20230424141648.317944-1-alexander.stein@ew.tq-group.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aed8fdad