1. 21 Jun, 2023 12 commits
  2. 20 Jun, 2023 13 commits
  3. 19 Jun, 2023 6 commits
  4. 18 Jun, 2023 7 commits
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2023-06-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 9a94d764
      David S. Miller authored
      mlx5-updates-2023-06-16
      
      1) Added a new event handler to firmware sync reset, which is used to
         support firmware sync reset flow on smart NIC. Adding this new stage to
         the flow enables the firmware to ensure host PFs unload before ECPFs
         unload, to avoid race of PFs recovery.
      
      2) Debugfs for mlx5 eswitch bridge offloads
      
      3) Added two new counters for vport stats
      
      4) Minor Fixups and cleanups for net-next branch
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a94d764
    • Jakub Kicinski's avatar
      gro: move the tc_ext comparison to a helper · 2dc6af8b
      Jakub Kicinski authored
      The double ifdefs (one for the variable declaration and
      one around the code) are quite aesthetically displeasing.
      Factor this code out into a helper for easier wrapping.
      
      This will become even more ugly when another skb ext
      comparison is added in the future.
      
      The resulting machine code looks the same, the compiler
      seems to try to use %rax more and some blocks more around
      but I haven't spotted minor differences.
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2dc6af8b
    • Christophe JAILLET's avatar
      net: phy: at803x: Use devm_regulator_get_enable_optional() · 988e8d90
      Christophe JAILLET authored
      Use devm_regulator_get_enable_optional() instead of hand writing it. It
      saves some line of code.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      988e8d90
    • Michael Walle's avatar
      dt-bindings: net: phy: gpy2xx: more precise description · 264879fd
      Michael Walle authored
      Mention that the interrupt line is just asserted for a random period of
      time, not the entire time.
      Suggested-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarMichael Walle <mwalle@kernel.org>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      264879fd
    • Eric Dumazet's avatar
      ipv6: also use netdev_hold() in ip6_route_check_nh() · 3515440d
      Eric Dumazet authored
      In blamed commit, we missed the fact that ip6_validate_gw()
      could change dev under us from ip6_route_check_nh()
      
      In this fix, I use GFP_ATOMIC in order to not pass too many additional
      arguments to ip6_validate_gw() and ip6_route_check_nh() only
      for a rarely used debug feature.
      
      syzbot reported:
      
      refcount_t: decrement hit 0; leaking memory.
      WARNING: CPU: 0 PID: 5006 at lib/refcount.c:31 refcount_warn_saturate+0x1d7/0x1f0 lib/refcount.c:31
      Modules linked in:
      CPU: 0 PID: 5006 Comm: syz-executor403 Not tainted 6.4.0-rc5-syzkaller-01229-g97c5209b #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
      RIP: 0010:refcount_warn_saturate+0x1d7/0x1f0 lib/refcount.c:31
      Code: 05 fb 8e 51 0a 01 e8 98 95 38 fd 0f 0b e9 d3 fe ff ff e8 ac d9 70 fd 48 c7 c7 00 d3 a6 8a c6 05 d8 8e 51 0a 01 e8 79 95 38 fd <0f> 0b e9 b4 fe ff ff 48 89 ef e8 1a d7 c3 fd e9 5c fe ff ff 0f 1f
      RSP: 0018:ffffc900039df6b8 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888026d71dc0 RSI: ffffffff814c03b7 RDI: 0000000000000001
      RBP: ffff888146a505fc R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 1ffff9200073bedc
      R13: 00000000ffffffef R14: ffff888146a505fc R15: ffff8880284eb5a8
      FS: 0000555556c88300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004585c0 CR3: 000000002b1b1000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
      <TASK>
      __refcount_dec include/linux/refcount.h:344 [inline]
      refcount_dec include/linux/refcount.h:359 [inline]
      ref_tracker_free+0x539/0x820 lib/ref_tracker.c:236
      netdev_tracker_free include/linux/netdevice.h:4097 [inline]
      netdev_put include/linux/netdevice.h:4114 [inline]
      netdev_put include/linux/netdevice.h:4110 [inline]
      fib6_nh_init+0xb96/0x1bd0 net/ipv6/route.c:3624
      ip6_route_info_create+0x10f3/0x1980 net/ipv6/route.c:3791
      ip6_route_add+0x28/0x150 net/ipv6/route.c:3835
      ipv6_route_ioctl+0x3fc/0x570 net/ipv6/route.c:4459
      inet6_ioctl+0x246/0x290 net/ipv6/af_inet6.c:569
      sock_do_ioctl+0xcc/0x230 net/socket.c:1189
      sock_ioctl+0x1f8/0x680 net/socket.c:1306
      vfs_ioctl fs/ioctl.c:51 [inline]
      __do_sys_ioctl fs/ioctl.c:870 [inline]
      __se_sys_ioctl fs/ioctl.c:856 [inline]
      __x64_sys_ioctl+0x197/0x210 fs/ioctl.c:856
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      
      Fixes: 70f7457a ("net: create device lookup API with reference tracking")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: David Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3515440d
    • David Howells's avatar
      crypto: Fix af_alg_sendmsg(MSG_SPLICE_PAGES) sglist limit · 43804992
      David Howells authored
      When af_alg_sendmsg() calls extract_iter_to_sg(), it passes MAX_SGL_ENTS as
      the maximum number of elements that may be written to, but some of the
      elements may already have been used (as recorded in sgl->cur), so
      extract_iter_to_sg() may end up overrunning the scatterlist.
      
      Fix this to limit the number of elements to "MAX_SGL_ENTS - sgl->cur".
      
      Note: It probably makes sense in future to alter the behaviour of
      extract_iter_to_sg() to stop if "sgtable->nents >= sg_max" instead, but
      this is a smaller fix for now.
      
      The bug causes errors looking something like:
      
      BUG: KASAN: slab-out-of-bounds in sg_assign_page include/linux/scatterlist.h:109 [inline]
      BUG: KASAN: slab-out-of-bounds in sg_set_page include/linux/scatterlist.h:139 [inline]
      BUG: KASAN: slab-out-of-bounds in extract_bvec_to_sg lib/scatterlist.c:1183 [inline]
      BUG: KASAN: slab-out-of-bounds in extract_iter_to_sg lib/scatterlist.c:1352 [inline]
      BUG: KASAN: slab-out-of-bounds in extract_iter_to_sg+0x17a6/0x1960 lib/scatterlist.c:1339
      
      Fixes: bf63e250 ("crypto: af_alg: Support MSG_SPLICE_PAGES")
      Reported-by: syzbot+6efc50cc1f8d718d6cb7@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/r/000000000000b2585a05fdeb8379@google.com/Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: syzbot+6efc50cc1f8d718d6cb7@syzkaller.appspotmail.com
      cc: Herbert Xu <herbert@gondor.apana.org.au>
      cc: "David S. Miller" <davem@davemloft.net>
      cc: Eric Dumazet <edumazet@google.com>
      cc: Jakub Kicinski <kuba@kernel.org>
      cc: Paolo Abeni <pabeni@redhat.com>
      cc: Jens Axboe <axboe@kernel.dk>
      cc: Matthew Wilcox <willy@infradead.org>
      cc: linux-crypto@vger.kernel.org
      cc: netdev@vger.kernel.org
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43804992
    • Arjun Roy's avatar
      tcp: Use per-vma locking for receive zerocopy · 7a7f0946
      Arjun Roy authored
      Per-VMA locking allows us to lock a struct vm_area_struct without
      taking the process-wide mmap lock in read mode.
      
      Consider a process workload where the mmap lock is taken constantly in
      write mode. In this scenario, all zerocopy receives are periodically
      blocked during that period of time - though in principle, the memory
      ranges being used by TCP are not touched by the operations that need
      the mmap write lock. This results in performance degradation.
      
      Now consider another workload where the mmap lock is never taken in
      write mode, but there are many TCP connections using receive zerocopy
      that are concurrently receiving. These connections all take the mmap
      lock in read mode, but this does induce a lot of contention and atomic
      ops for this process-wide lock. This results in additional CPU
      overhead caused by contending on the cache line for this lock.
      
      However, with per-vma locking, both of these problems can be avoided.
      
      As a test, I ran an RPC-style request/response workload with 4KB
      payloads and receive zerocopy enabled, with 100 simultaneous TCP
      connections. I measured perf cycles within the
      find_tcp_vma/mmap_read_lock/mmap_read_unlock codepath, with and
      without per-vma locking enabled.
      
      When using process-wide mmap semaphore read locking, about 1% of
      measured perf cycles were within this path. With per-VMA locking, this
      value dropped to about 0.45%.
      Signed-off-by: default avatarArjun Roy <arjunroy@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a7f0946
  5. 17 Jun, 2023 2 commits
    • mfreemon@cloudflare.com's avatar
      tcp: enforce receive buffer memory limits by allowing the tcp window to shrink · b650d953
      mfreemon@cloudflare.com authored
      Under certain circumstances, the tcp receive buffer memory limit
      set by autotuning (sk_rcvbuf) is increased due to incoming data
      packets as a result of the window not closing when it should be.
      This can result in the receive buffer growing all the way up to
      tcp_rmem[2], even for tcp sessions with a low BDP.
      
      To reproduce:  Connect a TCP session with the receiver doing
      nothing and the sender sending small packets (an infinite loop
      of socket send() with 4 bytes of payload with a sleep of 1 ms
      in between each send()).  This will cause the tcp receive buffer
      to grow all the way up to tcp_rmem[2].
      
      As a result, a host can have individual tcp sessions with receive
      buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
      limits, causing the host to go into tcp memory pressure mode.
      
      The fundamental issue is the relationship between the granularity
      of the window scaling factor and the number of byte ACKed back
      to the sender.  This problem has previously been identified in
      RFC 7323, appendix F [1].
      
      The Linux kernel currently adheres to never shrinking the window.
      
      In addition to the overallocation of memory mentioned above, the
      current behavior is functionally incorrect, because once tcp_rmem[2]
      is reached when no remediations remain (i.e. tcp collapse fails to
      free up any more memory and there are no packets to prune from the
      out-of-order queue), the receiver will drop in-window packets
      resulting in retransmissions and an eventual timeout of the tcp
      session.  A receive buffer full condition should instead result
      in a zero window and an indefinite wait.
      
      In practice, this problem is largely hidden for most flows.  It
      is not applicable to mice flows.  Elephant flows can send data
      fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
      triggering a zero window.
      
      But this problem does show up for other types of flows.  Examples
      are websockets and other type of flows that send small amounts of
      data spaced apart slightly in time.  In these cases, we directly
      encounter the problem described in [1].
      
      RFC 7323, section 2.4 [2], says there are instances when a retracted
      window can be offered, and that TCP implementations MUST ensure
      that they handle a shrinking window, as specified in RFC 1122,
      section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
      management have made clear that sender must accept a shrunk window
      from the receiver, including RFC 793 [4] and RFC 1323 [5].
      
      This patch implements the functionality to shrink the tcp window
      when necessary to keep the right edge within the memory limit by
      autotuning (sk_rcvbuf).  This new functionality is enabled with
      the new sysctl: net.ipv4.tcp_shrink_window
      
      Additional information can be found at:
      https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
      
      [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
      [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
      [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
      [4] https://www.rfc-editor.org/rfc/rfc793
      [5] https://www.rfc-editor.org/rfc/rfc1323Signed-off-by: default avatarMike Freemon <mfreemon@cloudflare.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b650d953
    • Petr Oros's avatar
      devlink: report devlink_port_type_warn source device · a52305a8
      Petr Oros authored
      devlink_port_type_warn is scheduled for port devlink and warning
      when the port type is not set. But from this warning it is not easy
      found out which device (driver) has no devlink port set.
      
      [ 3709.975552] Type was not set for devlink port.
      [ 3709.975579] WARNING: CPU: 1 PID: 13092 at net/devlink/leftover.c:6775 devlink_port_type_warn+0x11/0x20
      [ 3709.993967] Modules linked in: openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfnetlink bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs vhost_net vhost vhost_iotlb tap tun bridge stp llc qrtr intel_rapl_msr intel_rapl_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal mlx5_ib intel_powerclamp coretemp dell_wmi ledtrig_audio sparse_keymap ipmi_ssif kvm_intel ib_uverbs rfkill ib_core video kvm iTCO_wdt acpi_ipmi intel_vsec irqbypass ipmi_si iTCO_vendor_support dcdbas ipmi_devintf mei_me ipmi_msghandler rapl mei intel_cstate isst_if_mmio isst_if_mbox_pci dell_smbios intel_uncore isst_if_common i2c_i801 dell_wmi_descriptor wmi_bmof i2c_smbus intel_pch_thermal pcspkr acpi_power_meter xfs libcrc32c sd_mod sg nvme_tcp mgag200 i2c_algo_bit nvme_fabrics drm_shmem_helper drm_kms_helper nvme syscopyarea ahci sysfillrect sysimgblt nvme_core fb_sys_fops crct10dif_pclmul libahci mlx5_core sfc crc32_pclmul nvme_common drm
      [ 3709.994030]  crc32c_intel mtd t10_pi mlxfw libata tg3 mdio megaraid_sas psample ghash_clmulni_intel pci_hyperv_intf wmi dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse
      [ 3710.108431] CPU: 1 PID: 13092 Comm: kworker/1:1 Kdump: loaded Not tainted 5.14.0-319.el9.x86_64 #1
      [ 3710.108435] Hardware name: Dell Inc. PowerEdge R750/0PJ80M, BIOS 1.8.2 09/14/2022
      [ 3710.108437] Workqueue: events devlink_port_type_warn
      [ 3710.108440] RIP: 0010:devlink_port_type_warn+0x11/0x20
      [ 3710.108443] Code: 84 76 fe ff ff 48 c7 03 20 0e 1a ad 31 c0 e9 96 fd ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 48 c7 c7 18 24 4e ad e8 ef 71 62 ff <0f> 0b c3 cc cc cc cc 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f6 87
      [ 3710.108445] RSP: 0018:ff3b6d2e8b3c7e90 EFLAGS: 00010282
      [ 3710.108447] RAX: 0000000000000000 RBX: ff366d6580127080 RCX: 0000000000000027
      [ 3710.108448] RDX: 0000000000000027 RSI: 00000000ffff86de RDI: ff366d753f41f8c8
      [ 3710.108449] RBP: ff366d658ff5a0c0 R08: ff366d753f41f8c0 R09: ff3b6d2e8b3c7e18
      [ 3710.108450] R10: 0000000000000001 R11: 0000000000000023 R12: ff366d753f430600
      [ 3710.108451] R13: ff366d753f436900 R14: 0000000000000000 R15: ff366d753f436905
      [ 3710.108452] FS:  0000000000000000(0000) GS:ff366d753f400000(0000) knlGS:0000000000000000
      [ 3710.108453] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3710.108454] CR2: 00007f1c57bc74e0 CR3: 000000111d26a001 CR4: 0000000000773ee0
      [ 3710.108456] PKRU: 55555554
      [ 3710.108457] Call Trace:
      [ 3710.108458]  <TASK>
      [ 3710.108459]  process_one_work+0x1e2/0x3b0
      [ 3710.108466]  ? rescuer_thread+0x390/0x390
      [ 3710.108468]  worker_thread+0x50/0x3a0
      [ 3710.108471]  ? rescuer_thread+0x390/0x390
      [ 3710.108473]  kthread+0xdd/0x100
      [ 3710.108477]  ? kthread_complete_and_exit+0x20/0x20
      [ 3710.108479]  ret_from_fork+0x1f/0x30
      [ 3710.108485]  </TASK>
      [ 3710.108486] ---[ end trace 1b4b23cd0c65d6a0 ]---
      
      After patch:
      [  402.473064] ice 0000:41:00.0: Type was not set for devlink port.
      [  402.473064] ice 0000:41:00.1: Type was not set for devlink port.
      Signed-off-by: default avatarPetr Oros <poros@redhat.com>
      Reviewed-by: default avatarPavan Chebbi <pavan.chebbi@broadcom.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20230615095447.8259-1-poros@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a52305a8