1. 27 Apr, 2023 5 commits
    • Chuck Lever's avatar
      NFSD: Clean up xattr memory allocation flags · 22b620ec
      Chuck Lever authored
      Tetsuo Handa points out:
      > Since GFP_KERNEL is "GFP_NOFS | __GFP_FS", usage like
      > "GFP_KERNEL | GFP_NOFS" does not make sense.
      
      The original intent was to hold the inode lock while estimating
      the buffer requirements for the requested information. Frank van
      der Linden, the author of NFSD's xattr code, says:
      
      > ... you need inode_lock to get an atomic view of an xattr. Since
      > both nfsd_getxattr and nfsd_listxattr to the standard trick of
      > querying the xattr length with a NULL buf argument (just getting
      > the length back), allocating the right buffer size, and then
      > querying again, they need to hold the inode lock to avoid having
      > the xattr changed from under them while doing that.
      >
      > From that then flows the requirement that GFP_FS could cause
      > problems while holding i_rwsem, so I added GFP_NOFS.
      
      However, Dave Chinner states:
      > You can do GFP_KERNEL allocations holding the i_rwsem just fine.
      > All that it requires is the caller holds a reference to the
      > inode ...
      
      Since these code paths acquire a dentry, they do indeed hold a
      reference. It is therefore safe to use GFP_KERNEL for these memory
      allocations. In particular, that's what this code is already doing;
      but now the C source code looks sane too.
      
      At a later time we can revisit in order to remove the inode lock in
      favor of simply retrying if the estimated buffer size is too small.
      Reported-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      22b620ec
    • Dai Ngo's avatar
      NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop · 147abcac
      Dai Ngo authored
      The following request sequence to the same file causes the NFS client and
      server getting into an infinite loop with COMMIT and NFS4ERR_DELAY:
      
      OPEN
      REMOVE
      WRITE
      COMMIT
      
      Problem reported by recall11, recall12, recall14, recall20, recall22,
      recall40, recall42, recall48, recall50 of nfstest suite.
      
      This patch restores the handling of race condition in nfsd_file_do_acquire
      with unlink to that prior of the regression.
      
      Fixes: ac3a2585 ("nfsd: rework refcounting in filecache")
      Signed-off-by: default avatarDai Ngo <dai.ngo@oracle.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      147abcac
    • Chuck Lever's avatar
      SUNRPC: Clear rq_xid when receiving a new RPC Call · 695bc1f3
      Chuck Lever authored
      This is an eye-catcher for tracepoints that record the XID: it means
      svc_rqst() has not received a full RPC Call with an XID yet.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      695bc1f3
    • Chuck Lever's avatar
      SUNRPC: Recognize control messages in server-side TCP socket code · 5e052dda
      Chuck Lever authored
      To support kTLS, the server-side TCP socket receive path needs to
      watch for CMSGs.
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      5e052dda
    • Chuck Lever's avatar
      SUNRPC: Be even lazier about releasing pages · 6a0cdf56
      Chuck Lever authored
      A single RPC transaction that touches only a couple of pages means
      rq_pvec will not be even close to full in svc_xpt_release(). This is
      a common case.
      
      Instead, just leave the pages in rq_pvec until it is completely
      full. This improves the efficiency of the batch release mechanism
      on workloads that involve small RPC messages.
      
      The rq_pvec is also fully emptied just before thread exit.
      Reviewed-by: default avatarCalum Mackay <calum.mackay@oracle.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6a0cdf56
  2. 26 Apr, 2023 26 commits
  3. 25 Apr, 2023 9 commits
    • wuych's avatar
      net: phy: marvell-88x2222: remove unnecessary (void*) conversions · 28b17f62
      wuych authored
      Pointer variables of void * type do not require type cast.
      Signed-off-by: default avatarwuych <yunchuan@nfschina.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28b17f62
    • Kuniyuki Iwashima's avatar
      tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp. · 50749f2d
      Kuniyuki Iwashima authored
      syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
      skbs.  We can reproduce the problem with these sequences:
      
        sk = socket(AF_INET, SOCK_DGRAM, 0)
        sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
        sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
        sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
        sk.close()
      
      sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
      skb->cb->ubuf.refcnt to 1, and calls sock_hold().  Here, struct
      ubuf_info_msgzc indirectly holds a refcnt of the socket.  When the
      skb is sent, __skb_tstamp_tx() clones it and puts the clone into
      the socket's error queue with the TX timestamp.
      
      When the original skb is received locally, skb_copy_ubufs() calls
      skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
      This additional count is decremented while freeing the skb, but struct
      ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
      not called.
      
      The last refcnt is not released unless we retrieve the TX timestamped
      skb by recvmsg().  Since we clear the error queue in inet_sock_destruct()
      after the socket's refcnt reaches 0, there is a circular dependency.
      If we close() the socket holding such skbs, we never call sock_put()
      and leak the count, sk, and skb.
      
      TCP has the same problem, and commit e0c8bccd ("net: stream:
      purge sk_error_queue in sk_stream_kill_queues()") tried to fix it
      by calling skb_queue_purge() during close().  However, there is a
      small chance that skb queued in a qdisc or device could be put
      into the error queue after the skb_queue_purge() call.
      
      In __skb_tstamp_tx(), the cloned skb should not have a reference
      to the ubuf to remove the circular dependency, but skb_clone() does
      not call skb_copy_ubufs() for zerocopy skb.  So, we need to call
      skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs().
      
      [0]:
      BUG: memory leak
      unreferenced object 0xffff88800c6d2d00 (size 1152):
        comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00  ................
          02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
        backtrace:
          [<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
          [<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
          [<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
          [<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
          [<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
          [<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
          [<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
          [<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
          [<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
          [<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
          [<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
          [<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
          [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
          [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      BUG: memory leak
      unreferenced object 0xffff888017633a00 (size 240):
        comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff  .........-m.....
        backtrace:
          [<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
          [<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
          [<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
          [<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
          [<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
          [<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
          [<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
          [<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
          [<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
          [<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
          [<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
          [<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
          [<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
          [<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
          [<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
          [<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          [<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
          [<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY")
      Fixes: b5947e5d ("udp: msg_zerocopy")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50749f2d
    • Gencen Gan's avatar
      net: amd: Fix link leak when verifying config failed · d325c34d
      Gencen Gan authored
      After failing to verify configuration, it returns directly without
      releasing link, which may cause memory leak.
      
      Paolo Abeni thinks that the whole code of this driver is quite
      "suboptimal" and looks unmainatained since at least ~15y, so he
      suggests that we could simply remove the whole driver, please
      take it into consideration.
      
      Simon Horman suggests that the fix label should be set to
      "Linux-2.6.12-rc2" considering that the problem has existed
      since the driver was introduced and the commit above doesn't
      seem to exist in net/net-next.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarGan Gecen <gangecen@hust.edu.cn>
      Reviewed-by: default avatarDongliang Mu <dzm91@hust.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d325c34d
    • Christian Marangi's avatar
      net: phy: marvell: Fix inconsistent indenting in led_blink_set · 4774ad84
      Christian Marangi authored
      Fix inconsistent indeinting in m88e1318_led_blink_set reported by kernel
      test robot, probably done by the presence of an if condition dropped in
      later revision of the same code.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202304240007.0VEX8QYG-lkp@intel.com/
      Fixes: ea9e8648 ("net: phy: marvell: Implement led_blink_set()")
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20230423172800.3470-1-ansuelsmth@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4774ad84
    • Horatiu Vultur's avatar
      lan966x: Don't use xdp_frame when action is XDP_TX · 700f11eb
      Horatiu Vultur authored
      When the action of an xdp program was XDP_TX, lan966x was creating
      a xdp_frame and use this one to send the frame back. But it is also
      possible to send back the frame without needing a xdp_frame, because
      it is possible to send it back using the page.
      And then once the frame is transmitted is possible to use directly
      page_pool_recycle_direct as lan966x is using page pools.
      This would save some CPU usage on this path, which results in higher
      number of transmitted frames. Bellow are the statistics:
      Frame size:    Improvement:
      64                ~8%
      256              ~11%
      512               ~8%
      1000              ~0%
      1500              ~0%
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarAlexander Lobakin <aleksander.lobakin@intel.com>
      Link: https://lore.kernel.org/r/20230422142344.3630602-1-horatiu.vultur@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      700f11eb
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · ee3392ed
      Jakub Kicinski authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2023-04-24
      
      We've added 5 non-merge commits during the last 3 day(s) which contain
      a total of 7 files changed, 87 insertions(+), 44 deletions(-).
      
      The main changes are:
      
      1) Workaround for bpf iter selftest due to lack of subprog support
         in precision tracking, from Andrii.
      
      2) Disable bpf_refcount_acquire kfunc until races are fixed, from Dave.
      
      3) One more test_verifier test converted from asm macro to asm in C,
         from Eduard.
      
      4) Fix build with NETFILTER=y INET=n config, from Florian.
      
      5) Add __rcu_read_{lock,unlock} into deny list, from Yafang.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
        selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests
        bpf: Add __rcu_read_{lock,unlock} into btf id deny list
        bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed
        selftests/bpf: verifier/prevent_map_lookup converted to inline assembly
        bpf: fix link failure with NETFILTER=y INET=n
      ====================
      
      Link: https://lore.kernel.org/r/20230425005648.86714-1-alexei.starovoitov@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ee3392ed
    • Jakub Kicinski's avatar
      Merge branch 'tsnep-xdp-socket-zero-copy-support' · 9610a8dc
      Jakub Kicinski authored
      Gerhard Engleder says:
      
      ====================
      tsnep: XDP socket zero-copy support
      
      Implement XDP socket zero-copy support for tsnep driver. I tried to
      follow existing drivers like igc as far as possible. But one main
      difference is that tsnep does not need any reconfiguration for XDP BPF
      program setup. So I decided to keep this behavior no matter if a XSK
      pool is used or not. As a result, tsnep starts using the XSK pool even
      if no XDP BPF program is available.
      
      Another difference is that I tried to prevent potentially failing
      allocations during XSK pool setup. E.g. both memory models for page pool
      and XSK pool are registered all the time. Thus, XSK pool setup cannot
      end up with not working queues.
      
      Some prework is done to reduce the last two XSK commits to actual XSK
      changes.
      ====================
      
      Link: https://lore.kernel.org/r/20230421194656.48063-1-gerhard@engleder-embedded.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9610a8dc
    • Gerhard Engleder's avatar
      tsnep: Add XDP socket zero-copy TX support · cd275c23
      Gerhard Engleder authored
      Send and complete XSK pool frames within TX NAPI context. NAPI context
      is triggered by ndo_xsk_wakeup.
      
      Test results with A53 1.2GHz:
      
      xdpsock txonly copy mode, 64 byte frames:
                         pps            pkts           1.00
      tx                 284,409        11,398,144
      Two CPUs with 100% and 10% utilization.
      
      xdpsock txonly zero-copy mode, 64 byte frames:
                         pps            pkts           1.00
      tx                 511,929        5,890,368
      Two CPUs with 100% and 1% utilization.
      
      xdpsock l2fwd copy mode, 64 byte frames:
                         pps            pkts           1.00
      rx                 248,985        7,315,885
      tx                 248,921        7,315,885
      Two CPUs with 100% and 10% utilization.
      
      xdpsock l2fwd zero-copy mode, 64 byte frames:
                         pps            pkts           1.00
      rx                 254,735        3,039,456
      tx                 254,735        3,039,456
      Two CPUs with 100% and 4% utilization.
      
      Packet rate increases and CPU utilization is reduced in both cases.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd275c23
    • Gerhard Engleder's avatar
      tsnep: Add XDP socket zero-copy RX support · 3fc23339
      Gerhard Engleder authored
      Add support for XSK zero-copy to RX path. The setup of the XSK pool can
      be done at runtime. If the netdev is running, then the queue must be
      disabled and enabled during reconfiguration. This can be done easily
      with functions introduced in previous commits.
      
      A more important property is that, if the netdev is running, then the
      setup of the XSK pool shall not stop the netdev in case of errors. A
      broken netdev after a failed XSK pool setup is bad behavior. Therefore,
      the allocation and setup of resources during XSK pool setup is done only
      before any queue is disabled. Additionally, freeing and later allocation
      of resources is eliminated in some cases. Page pool entries are kept for
      later use. Two memory models are registered in parallel. As a result,
      the XSK pool setup cannot fail during queue reconfiguration.
      
      In contrast to other drivers, XSK pool setup and XDP BPF program setup
      are separate actions. XSK pool setup can be done without any XDP BPF
      program. The XDP BPF program can be added, removed or changed without
      any reconfiguration of the XSK pool.
      
      Test results with A53 1.2GHz:
      
      xdpsock rxdrop copy mode, 64 byte frames:
                         pps            pkts           1.00
      rx                 856,054        10,625,775
      Two CPUs with both 100% utilization.
      
      xdpsock rxdrop zero-copy mode, 64 byte frames:
                         pps            pkts           1.00
      rx                 889,388        4,615,284
      Two CPUs with 100% and 20% utilization.
      
      Packet rate increases and CPU utilization is reduced.
      
      100% CPU load seems to the base load. This load is consumed by ksoftirqd
      just for dropping the generated packets without xdpsock running.
      
      Using batch API reduced CPU utilization slightly, but measurements are
      not stable enough to provide meaningful numbers.
      Signed-off-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3fc23339