1. 27 Apr, 2022 1 commit
    • Eric Dumazet's avatar
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet authored
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68822bdf
  2. 26 Apr, 2022 4 commits
  3. 25 Apr, 2022 19 commits
  4. 23 Apr, 2022 16 commits
    • David S. Miller's avatar
      Merge branch 'dsa-selftests' · cfc1d91a
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      DSA selftests
      
      When working on complex new features or reworks it becomes increasingly
      difficult to ensure there aren't regressions being introduced, and
      therefore it would be nice if we could go over the functionality we
      already have and write some tests for it.
      
      Verbally I know from Tobias Waldekranz that he has been working on some
      selftests for DSA, yet I have never seen them, so here I am adding some
      tests I have written which have been useful for me. The list is by no
      means complete (it only covers elementary functionality), but it's still
      good to have as a starting point. I also borrowed some refactoring
      changes from Joachim Wiberg that he submitted for his "net: bridge:
      forwarding of unknown IPv4/IPv6/MAC BUM traffic" series, but not the
      entirety of his selftests. I now think that his selftests have some
      overlap with bridge_vlan_unaware.sh and bridge_vlan_aware.sh and they
      should be more tightly integrated with each other - yet I didn't do that
      either :). Another issue I had with his selftests was that they jumped
      straight ahead to configure brport flags on br0 (a radical new idea
      still at RFC status) while we have bigger problems, and we don't have
      nearly enough coverage for the *existing* functionality.
      
      One idea introduced here which I haven't seen before is the symlinking
      of relevant forwarding selftests to the selftests/drivers/net/<my-driver>/
      folder, plus a forwarding.config file. I think there's some value in
      having things structured this way, since the forwarding dir has so many
      selftests that aren't relevant to DSA that it is a bit difficult to find
      the ones that are.
      
      While searching for applications that I could use for multicast testing
      (not my domain of interest/knowledge really), I found Joachim Wiberg's
      mtools, mcjoin and omping, and I tried them all with various degrees of
      success. In particular, I was going to use mcjoin, but I faced some
      issues getting IPv6 multicast traffic to work in a VRF, and I bothered
      David Ahern about it here:
      https://lore.kernel.org/netdev/97eaffb8-2125-834e-641f-c99c097b6ee2@gmail.com/t/
      It seems that the problem is that this application should use
      SO_BINDTODEVICE, yet it doesn't.
      
      So I ended up patching the bare-bones mtools (msend, mreceive) forked by
      Joachim from the University of Virginia's Multimedia Networks Group to
      include IPv6 support, and to use SO_BINDTODEVICE. This is what I'm using
      now for IPv6.
      
      Note that mausezahn doesn't appear to do a particularly good job of
      supporting IPv6 really, and I needed a program to emit the actual
      IP_ADD_MEMBERSHIP calls, for dev_mc_add(), so I could test RX filtering.
      Crafting the IGMP/MLD reports by hand doesn't really do the trick.
      While extremely bare-bones, the mreceive application now seems to do
      what I need it to.
      
      Feedback appreciated, it is very likely that I could have done things in
      a better way.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfc1d91a
    • Vladimir Oltean's avatar
      selftests: drivers: dsa: add a subset of forwarding selftests · 07c8a2dd
      Vladimir Oltean authored
      This adds an initial subset of forwarding selftests which I considered
      to be relevant for DSA drivers, along with a forwarding.config that
      makes it easier to run them (disables veth pair creation, makes sure MAC
      addresses are unique and stable).
      
      The intention is to request driver writers to run these selftests during
      review and make sure that the tests pass, or at least that the problems
      are known.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07c8a2dd
    • Vladimir Oltean's avatar
      selftests: forwarding: add a test for local_termination.sh · 90b9566a
      Vladimir Oltean authored
      This tests the capability of switch ports to filter out undesired
      traffic. Different drivers are expected to have different capabilities
      here (so some may fail and some may pass), yet the test still has some
      value, for example to check for regressions.
      
      There are 2 kinds of failures, one is when a packet which should have
      been accepted isn't (and that should be fixed), and the other "failure"
      (as reported by the test) is when a packet could have been filtered out
      (for being unnecessary) yet it was received.
      
      The bridge driver fares particularly badly at this test:
      
      TEST: br0: Unicast IPv4 to primary MAC address                      [ OK ]
      TEST: br0: Unicast IPv4 to macvlan MAC address                      [ OK ]
      TEST: br0: Unicast IPv4 to unknown MAC address                      [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Unicast IPv4 to unknown MAC address, promisc             [ OK ]
      TEST: br0: Unicast IPv4 to unknown MAC address, allmulti            [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv4 to joined group                           [ OK ]
      TEST: br0: Multicast IPv4 to unknown group                          [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv4 to unknown group, promisc                 [ OK ]
      TEST: br0: Multicast IPv4 to unknown group, allmulti                [ OK ]
      TEST: br0: Multicast IPv6 to joined group                           [ OK ]
      TEST: br0: Multicast IPv6 to unknown group                          [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv6 to unknown group, promisc                 [ OK ]
      TEST: br0: Multicast IPv6 to unknown group, allmulti                [ OK ]
      
      mainly because it does not implement IFF_UNICAST_FLT. Yet I still think
      having the test (with the failures) is useful in case somebody wants to
      tackle that problem in the future, to make an easy before-and-after
      comparison.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90b9566a
    • Vladimir Oltean's avatar
      selftests: forwarding: add a no_forwarding.sh test · 476a4f05
      Vladimir Oltean authored
      Bombard a standalone switch port with various kinds of traffic to ensure
      it is really standalone and doesn't leak packets to other switch ports.
      Also check for switch ports in different bridges, and switch ports in a
      VLAN-aware bridge but having different pvids. No forwarding should take
      place in either case.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      476a4f05
    • Vladimir Oltean's avatar
      selftests: forwarding: add helper for retrieving IPv6 link-local address of interface · a5114df6
      Vladimir Oltean authored
      Pinging an IPv6 link-local multicast address selects the link-local
      unicast address of the interface as source, and we'd like to monitor for
      that in tcpdump.
      
      Add a helper to the forwarding library which retrieves the link-local
      IPv6 address of an interface, to make that task easier.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5114df6
    • Vladimir Oltean's avatar
      selftests: forwarding: add helpers for IP multicast group joins/leaves · f23cddc7
      Vladimir Oltean authored
      Extend the forwarding library with calls to some small C programs which
      join an IP multicast group and send some packets to it. Both IPv4 and
      IPv6 groups are supported. Use cases range from testing IGMP/MLD
      snooping, to RX filtering, to multicast routing.
      
      Testing multicast traffic using msend/mreceive is intended to be done
      using tcpdump.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f23cddc7
    • Joachim Wiberg's avatar
      selftests: forwarding: multiple instances in tcpdump helper · 6182c5c5
      Joachim Wiberg authored
      Extend tcpdump_start() & C:o to handle multiple instances.  Useful when
      observing bridge operation, e.g., unicast learning/flooding, and any
      case of multicast distribution (to these ports but not that one ...).
      
      This means the interface argument is now a mandatory argument to all
      tcpdump_*() functions, hence the changes to the ocelot flower test.
      Signed-off-by: default avatarJoachim Wiberg <troglobit@gmail.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6182c5c5
    • Joachim Wiberg's avatar
      selftests: forwarding: add TCPDUMP_EXTRA_FLAGS to lib.sh · fe32dffd
      Joachim Wiberg authored
      For some use-cases we may want to change the tcpdump flags used in
      tcpdump_start().  For instance, observing interfaces without the PROMISC
      flag, e.g. to see what's really being forwarded to the bridge interface.
      Signed-off-by: default avatarJoachim Wiberg <troglobit@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe32dffd
    • Vladimir Oltean's avatar
      selftests: forwarding: add option to run tests with stable MAC addresses · b343734e
      Vladimir Oltean authored
      By default, DSA switch ports inherit their MAC address from the DSA
      master.
      
      This works well for practical situations, but some selftests like
      bridge_vlan_unaware.sh loop back 2 standalone DSA ports with 2 bridged
      DSA ports, and require the bridge to forward packets between the
      standalone ports.
      
      Due to the bridge seeing that the MAC DA it needs to forward is present
      as a local FDB entry (it coincides with the MAC address of the bridge
      ports), the test packets are not forwarded, but terminated locally on
      br0. In turn, this makes the ping and ping6 tests fail.
      
      Address this by introducing an option to have stable MAC addresses.
      When mac_addr_prepare is called, the current addresses of the netifs are
      saved and replaced with 00:01:02:03:04:${netif number}. Then when
      mac_addr_restore is called at the end of the test, the original MAC
      addresses are restored. This ensures that the MAC addresses are unique,
      which makes the test pass even for DSA ports.
      
      The usage model is for the behavior to be opt-in via STABLE_MAC_ADDRS,
      which DSA should set to true, all others behave as before. By hooking
      the calls to mac_addr_prepare and mac_addr_restore within the forwarding
      lib itself, we do not need to patch each individual selftest, the only
      requirement is that pre_cleanup is called.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b343734e
    • David S. Miller's avatar
      Merge branch 'mptcp-tcp-fallback' · 988998ac
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      mptcp: TCP fallback for established connections
      
      RFC 8684 allows some MPTCP connections to fall back to regular TCP when
      the MPTCP DSS checksum detects middlebox interference, there is only a
      single subflow, and there is no unacknowledged out-of-sequence
      data. When this condition is detected, the stack sends a MPTCP DSS
      option with an "infinite mapping" to signal that a fallback is
      happening, and the peers will stop sending MPTCP options in their TCP
      headers. The Linux MPTCP stack has not yet supported this type of
      fallback, instead closing the connection when the MPTCP checksum fails.
      
      This series adds support for fallback to regular TCP in a more limited
      scenario, for only MPTCP connections that have never connected
      additional subflows or transmitted out-of-sequence data. The selftests
      are also updated to check new MIBs that track infinite mappings.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      988998ac
    • Geliang Tang's avatar
      selftests: mptcp: add infinite map mibs check · 8bd03be3
      Geliang Tang authored
      This patch adds a function chk_infi_nr() to check the mibs for the
      infinite mapping. Invoke it in chk_join_nr() when validate_checksum
      is set.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8bd03be3
    • Geliang Tang's avatar
      mptcp: dump infinite_map field in mptcp_dump_mpext · d9fdd02d
      Geliang Tang authored
      In trace event class mptcp_dump_mpext, dump the newly added infinite_map
      field of struct mptcp_dump_mpext too.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9fdd02d
    • Geliang Tang's avatar
      mptcp: add mib for infinite map sending · 104125b8
      Geliang Tang authored
      This patch adds a new mib named MPTCP_MIB_INFINITEMAPTX, increase it
      when a infinite mapping has been sent out.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      104125b8
    • Geliang Tang's avatar
      mptcp: infinite mapping receiving · f8d4bcac
      Geliang Tang authored
      This patch adds the infinite mapping receiving logic. When the infinite
      mapping is received, set the map_data_len of the subflow to 0.
      
      In subflow_check_data_avail(), only reset the subflow when the map_data_len
      of the subflow is non-zero.
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8d4bcac
    • Geliang Tang's avatar
      mptcp: infinite mapping sending · 1e39e5a3
      Geliang Tang authored
      This patch adds the infinite mapping sending logic.
      
      Add a new flag send_infinite_map in struct mptcp_subflow_context. Set
      it true when a single contiguous subflow is in use and the
      allow_infinite_fallback flag is true in mptcp_pm_mp_fail_received().
      
      In mptcp_sendmsg_frag(), if this flag is true, call the new function
      mptcp_update_infinite_map() to set the infinite mapping.
      
      Add a new flag infinite_map in struct mptcp_ext, set it true in
      mptcp_update_infinite_map(), and check this flag in a new helper
      mptcp_check_infinite_map().
      
      In mptcp_update_infinite_map(), set data_len to 0, and clear the
      send_infinite_map flag, then do fallback.
      
      In mptcp_established_options(), use the helper mptcp_check_infinite_map()
      to let the infinite mapping DSS can be sent out in the fallback mode.
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e39e5a3
    • Geliang Tang's avatar
      mptcp: track and update contiguous data status · 0530020a
      Geliang Tang authored
      This patch adds a new member allow_infinite_fallback in mptcp_sock,
      which is initialized to 'true' when the connection begins and is set
      to 'false' on any retransmit or successful MP_JOIN. Only do infinite
      mapping fallback if there is a single subflow AND there have been no
      retransmissions AND there have never been any MP_JOINs.
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0530020a