1. 27 Apr, 2022 5 commits
    • Bjorn Helgaas's avatar
      net: remove comments that mention obsolete __SLOW_DOWN_IO · e39f63fe
      Bjorn Helgaas authored
      The only remaining definitions of __SLOW_DOWN_IO (for alpha and ia64) do
      nothing, and the only mentions in networking are in comments.  Remove these
      mentions.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e39f63fe
    • Bjorn Helgaas's avatar
      net: wan: atp: remove unused eeprom_delay() · dac173db
      Bjorn Helgaas authored
      atp.h is included only by atp.c, which does not use eeprom_delay().  Remove
      the unused definition.
      Signed-off-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dac173db
    • Jakub Kicinski's avatar
      net: tls: fix async vs NIC crypto offload · c706b2b5
      Jakub Kicinski authored
      When NIC takes care of crypto (or the record has already
      been decrypted) we forget to update darg->async. ->async
      is supposed to mean whether record is async capable on
      input and whether record has been queued for async crypto
      on output.
      Reported-by: default avatarGal Pressman <gal@nvidia.com>
      Fixes: 3547a1f9 ("tls: rx: use async as an in-out argument")
      Tested-by: default avatarGal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c706b2b5
    • Russell King (Oracle)'s avatar
      net: dsa: mt753x: fix pcs conversion regression · fae46308
      Russell King (Oracle) authored
      Daniel Golle reports that the conversion of mt753x to phylink PCS caused
      an oops as below.
      
      The problem is with the placement of the PCS initialisation, which
      occurs after mt7531_setup() has been called. However, burited in this
      function is a call to setup the CPU port, which requires the PCS
      structure to be already setup.
      
      Fix this by changing the initialisation order.
      
      Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
      Mem abort info:
        ESR = 0x96000005
        EC = 0x25: DABT (current EL), IL = 32 bits
        SET = 0, FnV = 0
        EA = 0, S1PTW = 0
        FSC = 0x05: level 1 translation fault
      Data abort info:
        ISV = 0, ISS = 0x00000005
        CM = 0, WnR = 0
      user pgtable: 4k pages, 39-bit VAs, pgdp=0000000046057000
      [0000000000000020] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
      Internal error: Oops: 96000005 [#1] SMP
      Modules linked in:
      CPU: 0 PID: 32 Comm: kworker/u4:1 Tainted: G S 5.18.0-rc3-next-20220422+ #0
      Hardware name: Bananapi BPI-R64 (DT)
      Workqueue: events_unbound deferred_probe_work_func
      pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      pc : mt7531_cpu_port_config+0xcc/0x1b0
      lr : mt7531_cpu_port_config+0xc0/0x1b0
      sp : ffffffc008d5b980
      x29: ffffffc008d5b990 x28: ffffff80060562c8 x27: 00000000f805633b
      x26: ffffff80001a8880 x25: 00000000000009c4 x24: 0000000000000016
      x23: ffffff8005eb6470 x22: 0000000000003600 x21: ffffff8006948080
      x20: 0000000000000000 x19: 0000000000000006 x18: 0000000000000000
      x17: 0000000000000001 x16: 0000000000000001 x15: 02963607fcee069e
      x14: 0000000000000000 x13: 0000000000000030 x12: 0101010101010101
      x11: ffffffc037302000 x10: 0000000000000870 x9 : ffffffc008d5b800
      x8 : ffffff800028f950 x7 : 0000000000000001 x6 : 00000000662b3000
      x5 : 00000000000002f0 x4 : 0000000000000000 x3 : ffffff800028f080
      x2 : 0000000000000000 x1 : ffffff800028f080 x0 : 0000000000000000
      Call trace:
       mt7531_cpu_port_config+0xcc/0x1b0
       mt753x_cpu_port_enable+0x24/0x1f0
       mt7531_setup+0x49c/0x5c0
       mt753x_setup+0x20/0x31c
       dsa_register_switch+0x8bc/0x1020
       mt7530_probe+0x118/0x200
       mdio_probe+0x30/0x64
       really_probe.part.0+0x98/0x280
       __driver_probe_device+0x94/0x140
       driver_probe_device+0x40/0x114
       __device_attach_driver+0xb0/0x10c
       bus_for_each_drv+0x64/0xa0
       __device_attach+0xa8/0x16c
       device_initial_probe+0x10/0x20
       bus_probe_device+0x94/0x9c
       deferred_probe_work_func+0x80/0xb4
       process_one_work+0x200/0x3a0
       worker_thread+0x260/0x4c0
       kthread+0xd4/0xe0
       ret_from_fork+0x10/0x20
      Code: 9409e911 937b7e60 8b0002a0 f9405800 (f9401005)
      ---[ end trace 0000000000000000 ]---
      Reported-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Tested-by: default avatarDaniel Golle <daniel@makrotopia.org>
      Fixes: cbd1f243 ("net: dsa: mt7530: partially convert to phylink_pcs")
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/E1nj6FW-007WZB-5Y@rmk-PC.armlinux.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fae46308
    • Eric Dumazet's avatar
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet authored
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      68822bdf
  2. 26 Apr, 2022 4 commits
  3. 25 Apr, 2022 19 commits
  4. 23 Apr, 2022 12 commits
    • David S. Miller's avatar
      Merge branch 'dsa-selftests' · cfc1d91a
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      DSA selftests
      
      When working on complex new features or reworks it becomes increasingly
      difficult to ensure there aren't regressions being introduced, and
      therefore it would be nice if we could go over the functionality we
      already have and write some tests for it.
      
      Verbally I know from Tobias Waldekranz that he has been working on some
      selftests for DSA, yet I have never seen them, so here I am adding some
      tests I have written which have been useful for me. The list is by no
      means complete (it only covers elementary functionality), but it's still
      good to have as a starting point. I also borrowed some refactoring
      changes from Joachim Wiberg that he submitted for his "net: bridge:
      forwarding of unknown IPv4/IPv6/MAC BUM traffic" series, but not the
      entirety of his selftests. I now think that his selftests have some
      overlap with bridge_vlan_unaware.sh and bridge_vlan_aware.sh and they
      should be more tightly integrated with each other - yet I didn't do that
      either :). Another issue I had with his selftests was that they jumped
      straight ahead to configure brport flags on br0 (a radical new idea
      still at RFC status) while we have bigger problems, and we don't have
      nearly enough coverage for the *existing* functionality.
      
      One idea introduced here which I haven't seen before is the symlinking
      of relevant forwarding selftests to the selftests/drivers/net/<my-driver>/
      folder, plus a forwarding.config file. I think there's some value in
      having things structured this way, since the forwarding dir has so many
      selftests that aren't relevant to DSA that it is a bit difficult to find
      the ones that are.
      
      While searching for applications that I could use for multicast testing
      (not my domain of interest/knowledge really), I found Joachim Wiberg's
      mtools, mcjoin and omping, and I tried them all with various degrees of
      success. In particular, I was going to use mcjoin, but I faced some
      issues getting IPv6 multicast traffic to work in a VRF, and I bothered
      David Ahern about it here:
      https://lore.kernel.org/netdev/97eaffb8-2125-834e-641f-c99c097b6ee2@gmail.com/t/
      It seems that the problem is that this application should use
      SO_BINDTODEVICE, yet it doesn't.
      
      So I ended up patching the bare-bones mtools (msend, mreceive) forked by
      Joachim from the University of Virginia's Multimedia Networks Group to
      include IPv6 support, and to use SO_BINDTODEVICE. This is what I'm using
      now for IPv6.
      
      Note that mausezahn doesn't appear to do a particularly good job of
      supporting IPv6 really, and I needed a program to emit the actual
      IP_ADD_MEMBERSHIP calls, for dev_mc_add(), so I could test RX filtering.
      Crafting the IGMP/MLD reports by hand doesn't really do the trick.
      While extremely bare-bones, the mreceive application now seems to do
      what I need it to.
      
      Feedback appreciated, it is very likely that I could have done things in
      a better way.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfc1d91a
    • Vladimir Oltean's avatar
      selftests: drivers: dsa: add a subset of forwarding selftests · 07c8a2dd
      Vladimir Oltean authored
      This adds an initial subset of forwarding selftests which I considered
      to be relevant for DSA drivers, along with a forwarding.config that
      makes it easier to run them (disables veth pair creation, makes sure MAC
      addresses are unique and stable).
      
      The intention is to request driver writers to run these selftests during
      review and make sure that the tests pass, or at least that the problems
      are known.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07c8a2dd
    • Vladimir Oltean's avatar
      selftests: forwarding: add a test for local_termination.sh · 90b9566a
      Vladimir Oltean authored
      This tests the capability of switch ports to filter out undesired
      traffic. Different drivers are expected to have different capabilities
      here (so some may fail and some may pass), yet the test still has some
      value, for example to check for regressions.
      
      There are 2 kinds of failures, one is when a packet which should have
      been accepted isn't (and that should be fixed), and the other "failure"
      (as reported by the test) is when a packet could have been filtered out
      (for being unnecessary) yet it was received.
      
      The bridge driver fares particularly badly at this test:
      
      TEST: br0: Unicast IPv4 to primary MAC address                      [ OK ]
      TEST: br0: Unicast IPv4 to macvlan MAC address                      [ OK ]
      TEST: br0: Unicast IPv4 to unknown MAC address                      [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Unicast IPv4 to unknown MAC address, promisc             [ OK ]
      TEST: br0: Unicast IPv4 to unknown MAC address, allmulti            [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv4 to joined group                           [ OK ]
      TEST: br0: Multicast IPv4 to unknown group                          [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv4 to unknown group, promisc                 [ OK ]
      TEST: br0: Multicast IPv4 to unknown group, allmulti                [ OK ]
      TEST: br0: Multicast IPv6 to joined group                           [ OK ]
      TEST: br0: Multicast IPv6 to unknown group                          [FAIL]
              reception succeeded, but should have failed
      TEST: br0: Multicast IPv6 to unknown group, promisc                 [ OK ]
      TEST: br0: Multicast IPv6 to unknown group, allmulti                [ OK ]
      
      mainly because it does not implement IFF_UNICAST_FLT. Yet I still think
      having the test (with the failures) is useful in case somebody wants to
      tackle that problem in the future, to make an easy before-and-after
      comparison.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90b9566a
    • Vladimir Oltean's avatar
      selftests: forwarding: add a no_forwarding.sh test · 476a4f05
      Vladimir Oltean authored
      Bombard a standalone switch port with various kinds of traffic to ensure
      it is really standalone and doesn't leak packets to other switch ports.
      Also check for switch ports in different bridges, and switch ports in a
      VLAN-aware bridge but having different pvids. No forwarding should take
      place in either case.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      476a4f05
    • Vladimir Oltean's avatar
      selftests: forwarding: add helper for retrieving IPv6 link-local address of interface · a5114df6
      Vladimir Oltean authored
      Pinging an IPv6 link-local multicast address selects the link-local
      unicast address of the interface as source, and we'd like to monitor for
      that in tcpdump.
      
      Add a helper to the forwarding library which retrieves the link-local
      IPv6 address of an interface, to make that task easier.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5114df6
    • Vladimir Oltean's avatar
      selftests: forwarding: add helpers for IP multicast group joins/leaves · f23cddc7
      Vladimir Oltean authored
      Extend the forwarding library with calls to some small C programs which
      join an IP multicast group and send some packets to it. Both IPv4 and
      IPv6 groups are supported. Use cases range from testing IGMP/MLD
      snooping, to RX filtering, to multicast routing.
      
      Testing multicast traffic using msend/mreceive is intended to be done
      using tcpdump.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f23cddc7
    • Joachim Wiberg's avatar
      selftests: forwarding: multiple instances in tcpdump helper · 6182c5c5
      Joachim Wiberg authored
      Extend tcpdump_start() & C:o to handle multiple instances.  Useful when
      observing bridge operation, e.g., unicast learning/flooding, and any
      case of multicast distribution (to these ports but not that one ...).
      
      This means the interface argument is now a mandatory argument to all
      tcpdump_*() functions, hence the changes to the ocelot flower test.
      Signed-off-by: default avatarJoachim Wiberg <troglobit@gmail.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6182c5c5
    • Joachim Wiberg's avatar
      selftests: forwarding: add TCPDUMP_EXTRA_FLAGS to lib.sh · fe32dffd
      Joachim Wiberg authored
      For some use-cases we may want to change the tcpdump flags used in
      tcpdump_start().  For instance, observing interfaces without the PROMISC
      flag, e.g. to see what's really being forwarded to the bridge interface.
      Signed-off-by: default avatarJoachim Wiberg <troglobit@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe32dffd
    • Vladimir Oltean's avatar
      selftests: forwarding: add option to run tests with stable MAC addresses · b343734e
      Vladimir Oltean authored
      By default, DSA switch ports inherit their MAC address from the DSA
      master.
      
      This works well for practical situations, but some selftests like
      bridge_vlan_unaware.sh loop back 2 standalone DSA ports with 2 bridged
      DSA ports, and require the bridge to forward packets between the
      standalone ports.
      
      Due to the bridge seeing that the MAC DA it needs to forward is present
      as a local FDB entry (it coincides with the MAC address of the bridge
      ports), the test packets are not forwarded, but terminated locally on
      br0. In turn, this makes the ping and ping6 tests fail.
      
      Address this by introducing an option to have stable MAC addresses.
      When mac_addr_prepare is called, the current addresses of the netifs are
      saved and replaced with 00:01:02:03:04:${netif number}. Then when
      mac_addr_restore is called at the end of the test, the original MAC
      addresses are restored. This ensures that the MAC addresses are unique,
      which makes the test pass even for DSA ports.
      
      The usage model is for the behavior to be opt-in via STABLE_MAC_ADDRS,
      which DSA should set to true, all others behave as before. By hooking
      the calls to mac_addr_prepare and mac_addr_restore within the forwarding
      lib itself, we do not need to patch each individual selftest, the only
      requirement is that pre_cleanup is called.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b343734e
    • David S. Miller's avatar
      Merge branch 'mptcp-tcp-fallback' · 988998ac
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      mptcp: TCP fallback for established connections
      
      RFC 8684 allows some MPTCP connections to fall back to regular TCP when
      the MPTCP DSS checksum detects middlebox interference, there is only a
      single subflow, and there is no unacknowledged out-of-sequence
      data. When this condition is detected, the stack sends a MPTCP DSS
      option with an "infinite mapping" to signal that a fallback is
      happening, and the peers will stop sending MPTCP options in their TCP
      headers. The Linux MPTCP stack has not yet supported this type of
      fallback, instead closing the connection when the MPTCP checksum fails.
      
      This series adds support for fallback to regular TCP in a more limited
      scenario, for only MPTCP connections that have never connected
      additional subflows or transmitted out-of-sequence data. The selftests
      are also updated to check new MIBs that track infinite mappings.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      988998ac
    • Geliang Tang's avatar
      selftests: mptcp: add infinite map mibs check · 8bd03be3
      Geliang Tang authored
      This patch adds a function chk_infi_nr() to check the mibs for the
      infinite mapping. Invoke it in chk_join_nr() when validate_checksum
      is set.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8bd03be3
    • Geliang Tang's avatar
      mptcp: dump infinite_map field in mptcp_dump_mpext · d9fdd02d
      Geliang Tang authored
      In trace event class mptcp_dump_mpext, dump the newly added infinite_map
      field of struct mptcp_dump_mpext too.
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9fdd02d