1. 17 Apr, 2023 12 commits
    • David S. Miller's avatar
      Merge branch 'mptcp-subflow-init' · 28f610d0
      David S. Miller authored
      Matthieu Baerts says:
      
      ====================
      mptcp: refactor first subflow init
      
      This series refactors the initialisation of the first subflow of a
      listen socket. The first subflow allocation is no longer done at the
      initialisation of the socket but later, when the connection request is
      received or when requested by the userspace.
      
      This is needed not just because Paolo likes to refactor things but
      because this simplifies the code and makes the behaviour more consistent
      with the rest. Also, this is a prerequisite for future patches adding
      proper support of SELinux/LSM labels with MPTCP and accept(2).
      
      In [1], Ondrej Mosnacek explained they discovered the (userspace-facing)
      sockets returned by accept(2) when using MPTCP always end up with the
      label representing the kernel (typically system_u:system_r:kernel_t:s0),
      while it would make more sense to inherit the context from the parent
      socket (the one that is passed to accept(2)).
      
      Before being able to properly support that on SELinux/LSM side, patches
      2-3/5 prepare the code to simplify the patch 4/5 moving the allocation.
      
      Patch 1/5 is a small clean-up seen while working on the series and patch
      5/5 is a small improvement when closing unaccepted sockets.
      
      [1] https://lore.kernel.org/netdev/CAFqZXNs2LF-OoQBUiiSEyranJUXkPLcCfBkMkwFeM6qEwMKCTw@mail.gmail.com/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28f610d0
    • Paolo Abeni's avatar
      mptcp: fastclose msk when cleaning unaccepted sockets · 8d547809
      Paolo Abeni authored
      When cleaning up unaccepted mptcp socket still laying inside
      the listener queue at listener close time, such sockets will
      go through a regular close, waiting for a timeout before
      shutting down the subflows.
      
      There is no need to keep the kernel resources in use for
      such a possibly long time: short-circuit to fast-close.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d547809
    • Paolo Abeni's avatar
      mptcp: move first subflow allocation at mpc access time · ddb1a072
      Paolo Abeni authored
      In the long run this will simplify the mptcp code and will
      allow for more consistent behavior. Move the first subflow
      allocation out of the sock->init ops into the __mptcp_nmpc_socket()
      helper.
      
      Since the first subflow creation can now happen after the first
      setsockopt() we additionally need to invoke mptcp_sockopt_sync()
      on it.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ddb1a072
    • Paolo Abeni's avatar
      mptcp: move fastopen subflow check inside mptcp_sendmsg_fastopen() · a2702a07
      Paolo Abeni authored
      So that we can avoid a bunch of check in fastpath. Additionally we
      can specialize such check according to the specific fastopen method
      - defer_connect vs MSG_FASTOPEN.
      
      The latter bits will simplify the next patches.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2702a07
    • Paolo Abeni's avatar
      mptcp: avoid unneeded __mptcp_nmpc_socket() usage · 61761231
      Paolo Abeni authored
      In a few spots, the mptcp code invokes the __mptcp_nmpc_socket() helper
      multiple times under the same socket lock scope. Additionally, in such
      places, the socket status ensures that there is no MP capable handshake
      running.
      
      Under the above condition we can replace the later __mptcp_nmpc_socket()
      helper invocation with direct access to the msk->subflow pointer and
      better document such access is not supposed to fail with WARN().
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61761231
    • Paolo Abeni's avatar
      mptcp: drop unneeded argument · 7a486c44
      Paolo Abeni authored
      After commit 3a236aef ("mptcp: refactor passive socket initialization"),
      every mptcp_pm_fully_established() call is always invoked with a
      GFP_ATOMIC argument. We can then drop it.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a486c44
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2023-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 0475135f
      David S. Miller authored
      mlx5-updates-2023-04-14
      
      Yevgeny Kliteynik Says:
      =======================
      
      SW Steering: Support pattern/args modify_header actions
      
      The following patch series adds support for a new pattern/arguments type
      of modify_header actions.
      
      Starting with ConnectX-6 DX, we use a new design of modify_header FW object.
      The current modify_header object allows for having only limited number of
      these FW objects, which means that we are limited in the number of offloaded
      flows that require modify_header action.
      
      The new approach comprises of two types of objects: pattern and argument.
      Pattern holds header modification templates, later used with corresponding
      argument object to create complete header modification actions.
      The pattern indicates which headers are modified, while the arguments
      provide the specific values.
      Therefore a single pattern can be used with different arguments in different
      flows, enabling offloading of large number of modify_header flows.
      
       - Patch 1, 2: Add ICM pool for modify-header-pattern objects and implement
         patterns cache, allowing patterns reuse for different flows
       - Patch 3: Allow for chunk allocation separately for STEv0 and STEv1
       - Patch 4: Read related device capabilities
       - Patch 5: Add create/destroy functions for the new general object type
       - Patch 6: Add support for writing modify header argument to ICM
       - Patch 7, 8: Some required fixes to support pattern/arg - separate read
         buffer from the write buffer and fix QP continuous allocation
       - Patch 9: Add pool for modify header arg objects
       - Patch 10, 11, 12: Implement MODIFY_HEADER and TNL_L3_TO_L2 actions with
         the new patterns/args design
       - Patch 13: Optimization - set modify header action of size 1 directly on
         the STE instead of separate pattern/args combination
       - Patch 14: Adjust debug dump for patterns/args
       - Patch 15: Enable patterns and arguments for supporting devices
      
      =======================
      0475135f
    • David S. Miller's avatar
      Merge branch 'ovs-selftests' · e2174b03
      David S. Miller authored
      Aaron Conole says:
      
      ====================
      selftests: openvswitch: add support for testing upcall interface
      
      The existing selftest suite for openvswitch will work for regression
      testing the datapath feature bits, but won't test things like adding
      interfaces, or the upcall interface.  Here, we add some additional
      test facilities.
      
      First, extend the ovs-dpctl.py python module to support the OVS_FLOW
      and OVS_PACKET netlink families, with some associated messages.  These
      can be extended over time, but the initial support is for more well
      known cases (output, userspace, and CT).
      
      Next, extend the test suite to test upcalls by adding a datapath,
      monitoring the upcall socket associated with the datapath, and then
      dumping any upcalls that are received.  Compare with expected ARP
      upcall via arping.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e2174b03
    • Aaron Conole's avatar
      selftests: openvswitch: add support for upcall testing · 9feac87b
      Aaron Conole authored
      The upcall socket interface can be exercised now to make sure that
      future feature adjustments to the field can maintain backwards
      compatibility.
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9feac87b
    • Aaron Conole's avatar
      selftests: openvswitch: add flow dump support · e52b07aa
      Aaron Conole authored
      Add a basic set of fields to print in a 'dpflow' format.  This will be
      used by future commits to check for flow fields after parsing, as
      well as verifying the flow fields pushed into the kernel from
      userspace.
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e52b07aa
    • Aaron Conole's avatar
      selftests: openvswitch: add interface support · 74cc26f4
      Aaron Conole authored
      Includes an associated test to generate netns and connect
      interfaces, with the option to include packet tracing.
      
      This will be used in the future when flow support is added
      for additional test cases.
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74cc26f4
    • Horatiu Vultur's avatar
      net: phy: micrel: Fix PTP_PF_PEROUT for lan8841 · c6d6ef3e
      Horatiu Vultur authored
      If the 1PPS output was enabled and then lan8841 was configured to be a
      follower, then target clock which is used to generate the 1PPS was not
      configure correctly. The problem was that for each adjustments of the
      time, also the nanosecond part of the target clock was changed.
      Therefore the initial nanosecond part of the target clock was changed.
      The issue can be observed if both the leader and the follower are
      generating 1PPS and see that their PPS are not aligned even if the time
      is allined.
      The fix consists of not modifying the nanosecond part of the target
      clock when adjusting the time. In this way the 1PPS get also aligned.
      
      Fixes: e4ed8ba0 ("net: phy: micrel: Add support for PTP_PF_PEROUT for lan8841")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6d6ef3e
  2. 15 Apr, 2023 4 commits
    • Jakub Kicinski's avatar
      Merge branch 'page_pool-allow-caching-from-safely-localized-napi' · e61caf04
      Jakub Kicinski authored
      Jakub Kicinski says:
      
      ====================
      page_pool: allow caching from safely localized NAPI
      
      I went back to the explicit "are we in NAPI method", mostly
      because I don't like having both around :( (even tho I maintain
      that in_softirq() && !in_hardirq() is as safe, as softirqs do
      not nest).
      
      Still returning the skbs to a CPU, tho, not to the NAPI instance.
      I reckon we could create a small refcounted struct per NAPI instance
      which would allow sockets and other users so hold a persisent
      and safe reference. But that's a bigger change, and I get 90+%
      recycling thru the cache with just these patches (for RR and
      streaming tests with 100% CPU use it's almost 100%).
      
      Some numbers for streaming test with 100% CPU use (from previous version,
      but really they perform the same):
      
      		HW-GRO				page=page
      		before		after		before		after
      recycle:
      cached:			0	138669686		0	150197505
      cache_full:		0	   223391		0	    74582
      ring:		138551933         9997191	149299454		0
      ring_full: 		0             488	     3154	   127590
      released_refcnt:	0		0		0		0
      
      alloc:
      fast:		136491361	148615710	146969587	150322859
      slow:		     1772	     1799	      144	      105
      slow_high_order:	0		0		0		0
      empty:		     1772	     1799	      144	      105
      refill:		  2165245	   156302	  2332880	     2128
      waive:			0		0		0		0
      
      v1: https://lore.kernel.org/all/20230411201800.596103-1-kuba@kernel.org/
      rfcv2: https://lore.kernel.org/all/20230405232100.103392-1-kuba@kernel.org/
      ====================
      
      Link: https://lore.kernel.org/r/20230413042605.895677-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e61caf04
    • Jakub Kicinski's avatar
      bnxt: hook NAPIs to page pools · 294e39e0
      Jakub Kicinski authored
      bnxt has 1:1 mapping of page pools and NAPIs, so it's safe
      to hoook them up together.
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Tested-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      294e39e0
    • Jakub Kicinski's avatar
      page_pool: allow caching from safely localized NAPI · 8c48eea3
      Jakub Kicinski authored
      Recent patches to mlx5 mentioned a regression when moving from
      driver local page pool to only using the generic page pool code.
      Page pool has two recycling paths (1) direct one, which runs in
      safe NAPI context (basically consumer context, so producing
      can be lockless); and (2) via a ptr_ring, which takes a spin
      lock because the freeing can happen from any CPU; producer
      and consumer may run concurrently.
      
      Since the page pool code was added, Eric introduced a revised version
      of deferred skb freeing. TCP skbs are now usually returned to the CPU
      which allocated them, and freed in softirq context. This places the
      freeing (producing of pages back to the pool) enticingly close to
      the allocation (consumer).
      
      If we can prove that we're freeing in the same softirq context in which
      the consumer NAPI will run - lockless use of the cache is perfectly fine,
      no need for the lock.
      
      Let drivers link the page pool to a NAPI instance. If the NAPI instance
      is scheduled on the same CPU on which we're freeing - place the pages
      in the direct cache.
      
      With that and patched bnxt (XDP enabled to engage the page pool, sigh,
      bnxt really needs page pool work :() I see a 2.6% perf boost with
      a TCP stream test (app on a different physical core than softirq).
      
      The CPU use of relevant functions decreases as expected:
      
        page_pool_refill_alloc_cache   1.17% -> 0%
        _raw_spin_lock                 2.41% -> 0.98%
      
      Only consider lockless path to be safe when NAPI is scheduled
      - in practice this should cover majority if not all of steady state
      workloads. It's usually the NAPI kicking in that causes the skb flush.
      
      The main case we'll miss out on is when application runs on the same
      CPU as NAPI. In that case we don't use the deferred skb free path.
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Tested-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8c48eea3
    • Jakub Kicinski's avatar
      net: skb: plumb napi state thru skb freeing paths · b07a2d97
      Jakub Kicinski authored
      We maintain a NAPI-local cache of skbs which is fed by napi_consume_skb().
      Going forward we will also try to cache head and data pages.
      Plumb the "are we in a normal NAPI context" information thru
      deeper into the freeing path, up to skb_release_data() and
      skb_free_head()/skb_pp_recycle(). The "not normal NAPI context"
      comes from netpoll which passes budget of 0 to try to reap
      the Tx completions but not perform any Rx.
      
      Use "bool napi_safe" rather than bare "int budget",
      the further we get from NAPI the more confusing the budget
      argument may seem (particularly whether 0 or MAX is the
      correct value to pass in when not in NAPI).
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Tested-by: default avatarDragos Tatulea <dtatulea@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b07a2d97
  3. 14 Apr, 2023 24 commits