1. 15 Feb, 2021 15 commits
  2. 13 Feb, 2021 25 commits
    • David S. Miller's avatar
      Merge branch 'skbuff-introduce-skbuff_heads-bulking-and-reusing' · c4762993
      David S. Miller authored
      Alexander Lobakin says:
      
      ====================
      skbuff: introduce skbuff_heads bulking and reusing
      
      Currently, all sorts of skb allocation always do allocate
      skbuff_heads one by one via kmem_cache_alloc().
      On the other hand, we have percpu napi_alloc_cache to store
      skbuff_heads queued up for freeing and flush them by bulks.
      
      We can use this cache not only for bulk-wiping, but also to obtain
      heads for new skbs and avoid unconditional allocations, as well as
      for bulk-allocating (like XDP's cpumap code and veth driver already
      do).
      
      As this might affect latencies, cache pressure and lots of hardware
      and driver-dependent stuff, this new feature is mostly optional and
      can be issued via:
       - a new napi_build_skb() function (as a replacement for build_skb());
       - existing {,__}napi_alloc_skb() and napi_get_frags() functions;
       - __alloc_skb() with passing SKB_ALLOC_NAPI in flags.
      
      iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing
      VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger
      on more powerful hosts and NICs with tens of Mpps.
      
      Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs:
       - kmalloc()/kmem_cache_alloc() itself allows by default allocating
         memory from the remote nodes to defragment their slabs. This is
         controlled by sysctl, but according to this, skbuff_head from a
         remote node is an OK case;
       - The easiest way to check if the slab of skbuff_head is remote or
         pfmemalloc'ed is:
      
      	if (!dev_page_is_reusable(virt_to_head_page(skb)))
      		/* drop it */;
      
         ...*but*, regarding that most slabs are built of compound pages,
         virt_to_head_page() will hit unlikely-branch every single call.
         This check costed at least 20 Mbps in test scenarios and seems
         like it'd be better to _not_ do this.
      
      Since v5 [4]:
       - revert flags-to-bool conversion and simplify flags testing in
         __alloc_skb() (Alexander Duyck).
      
      Since v4 [3]:
       - rebase on top of net-next and address kernel build robot issue;
       - reorder checks a bit in __alloc_skb() to make new condition even
         more harmless.
      
      Since v3 [2]:
       - make the feature mostly optional, so driver developers could
         decide whether to use it or not (Paolo Abeni).
         This reuses the old flag for __alloc_skb() and introduces
         a new napi_build_skb();
       - reduce bulk-allocation size from 32 to 16 elements (also Paolo).
         This equals to the value of XDP's devmap and veth batch processing
         (which were tested a lot) and should be sane enough;
       - don't waste cycles on explicit in_serving_softirq() check.
      
      Since v2 [1]:
       - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy
         after the changes that pass tiny skbs requests to kmalloc layer);
       - cover the cache with KASAN instrumentation (suggested by Eric
         Dumazet, help of Dmitry Vyukov);
       - completely drop redundant __kfree_skb_flush() (also Eric);
       - lots of code cleanups;
       - expand the commit message with NUMA and pfmemalloc points (Jakub).
      
      Since v1 [0]:
       - use one unified cache instead of two separate to greatly simplify
         the logics and reduce hotpath overhead (Edward Cree);
       - new: recycle also GRO_MERGED_FREE skbs instead of immediate
         freeing;
       - correct performance numbers after optimizations and performing
         lots of tests for different use cases.
      
      [0] https://lore.kernel.org/netdev/20210111182655.12159-1-alobakin@pm.me
      [1] https://lore.kernel.org/netdev/20210113133523.39205-1-alobakin@pm.me
      [2] https://lore.kernel.org/netdev/20210209204533.327360-1-alobakin@pm.me
      [3] https://lore.kernel.org/netdev/20210210162732.80467-1-alobakin@pm.me
      [4] https://lore.kernel.org/netdev/20210211185220.9753-1-alobakin@pm.me
      ====================
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4762993
    • Alexander Lobakin's avatar
      skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing · 9243adfc
      Alexander Lobakin authored
      napi_frags_finish() and napi_skb_finish() can only be called inside
      NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
      got NAPI_MERGED_FREE verdict instead of immediate freeing.
      Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
      and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
      to NAPI cache.
      As many drivers call napi_alloc_skb()/napi_get_frags() on their
      receive path, this becomes especially useful.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9243adfc
    • Alexander Lobakin's avatar
      skbuff: allow to use NAPI cache from __napi_alloc_skb() · cfb8ec65
      Alexander Lobakin authored
      {,__}napi_alloc_skb() is mostly used either for optional non-linear
      receive methods (usually controlled via Ethtool private flags and off
      by default) and/or for Rx copybreaks.
      Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache
      instead of inplace allocations. This includes both kmalloc and page
      frag paths.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfb8ec65
    • Alexander Lobakin's avatar
      skbuff: allow to optionally use NAPI cache from __alloc_skb() · d13612b5
      Alexander Lobakin authored
      Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
      an skbuff_head from the NAPI cache instead of inplace allocation
      inside __alloc_skb().
      This implies that the function is called from softirq or BH-off
      context, not for allocating a clone or from a distant node.
      
      Cc: Alexander Duyck <alexander.duyck@gmail.com> # Simplified flags check
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d13612b5
    • Alexander Lobakin's avatar
      skbuff: introduce {,__}napi_build_skb() which reuses NAPI cache heads · f450d539
      Alexander Lobakin authored
      Instead of just bulk-flushing skbuff_heads queued up through
      napi_consume_skb() or __kfree_skb_defer(), try to reuse them
      on allocation path.
      If the cache is empty on allocation, bulk-allocate the first
      16 elements, which is more efficient than per-skb allocation.
      If the cache is full on freeing, bulk-wipe the second half of
      the cache (32 elements).
      This also includes custom KASAN poisoning/unpoisoning to be
      double sure there are no use-after-free cases.
      
      To not change current behaviour, introduce a new function,
      napi_build_skb(), to optionally use a new approach later
      in drivers.
      
      Note on selected bulk size, 16:
       - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE
         and especially VETH_XDP_BATCH, which is also used to
         bulk-allocate skbuff_heads and was tested on powerful
         setups;
       - this also showed the best performance in the actual
         test series (from the array of {8, 16, 32}).
      
      Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves
      Suggested-by: Eric Dumazet <edumazet@google.com>   # KASAN poisoning
      Cc: Dmitry Vyukov <dvyukov@google.com>             # Help with KASAN
      Cc: Paolo Abeni <pabeni@redhat.com>                # Reduced batch size
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f450d539
    • Alexander Lobakin's avatar
      skbuff: move NAPI cache declarations upper in the file · 50fad4b5
      Alexander Lobakin authored
      NAPI cache structures will be used for allocating skbuff_heads,
      so move their declarations a bit upper.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      50fad4b5
    • Alexander Lobakin's avatar
      skbuff: remove __kfree_skb_flush() · fec6e49b
      Alexander Lobakin authored
      This function isn't much needed as NAPI skb queue gets bulk-freed
      anyway when there's no more room, and even may reduce the efficiency
      of bulk operations.
      It will be even less needed after reusing skb cache on allocation path,
      so remove it and this way lighten network softirqs a bit.
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fec6e49b
    • Alexander Lobakin's avatar
      skbuff: use __build_skb_around() in __alloc_skb() · f9d6725b
      Alexander Lobakin authored
      Just call __build_skb_around() instead of open-coding it.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9d6725b
    • Alexander Lobakin's avatar
      skbuff: simplify __alloc_skb() a bit · df1ae022
      Alexander Lobakin authored
      Use unlikely() annotations for skbuff_head and data similarly to the
      two other allocation functions and remove totally redundant goto.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df1ae022
    • Alexander Lobakin's avatar
      skbuff: make __build_skb_around() return void · 483126b3
      Alexander Lobakin authored
      __build_skb_around() can never fail and always returns passed skb.
      Make it return void to simplify and optimize the code.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      483126b3
    • Alexander Lobakin's avatar
      skbuff: simplify kmalloc_reserve() · ef28095f
      Alexander Lobakin authored
      Eversince the introduction of __kmalloc_reserve(), "ip" argument
      hasn't been used. _RET_IP_ is embedded inside
      kmalloc_node_track_caller().
      Remove the redundant macro and rename the function after it.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef28095f
    • Alexander Lobakin's avatar
      skbuff: move __alloc_skb() next to the other skb allocation functions · 5381b23d
      Alexander Lobakin authored
      In preparation before reusing several functions in all three skb
      allocation variants, move __alloc_skb() next to the
      __netdev_alloc_skb() and __napi_alloc_skb().
      No functional changes.
      Signed-off-by: default avatarAlexander Lobakin <alobakin@pm.me>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5381b23d
    • David S. Miller's avatar
      Merge branch 'Xilinx-axienet-updates' · 773dc50d
      David S. Miller authored
      Robert Hancock says:
      
      ====================
      Xilinx axienet updates
      
      Updates to the Xilinx AXI Ethernet driver to add support for an additional
      ethtool operation, and to support dynamic switching between 1000BaseX and
      SGMII interface modes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      773dc50d
    • Robert Hancock's avatar
      net: axienet: Support dynamic switching between 1000BaseX and SGMII · 6c8f06bb
      Robert Hancock authored
      Newer versions of the Xilinx AXI Ethernet core (specifically version 7.2 or
      later) allow the core to be configured with a PHY interface mode of "Both",
      allowing either 1000BaseX or SGMII modes to be selected at runtime. Add
      support for this in the driver to allow better support for applications
      which can use both fiber and copper SFP modules.
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c8f06bb
    • Robert Hancock's avatar
      dt-bindings: net: xilinx_axienet: add xlnx,switch-x-sgmii attribute · eceac9d2
      Robert Hancock authored
      Document the new xlnx,switch-x-sgmii attribute which is used to indicate
      that the Ethernet core supports dynamic switching between 1000BaseX and
      SGMII.
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eceac9d2
    • Robert Hancock's avatar
      net: axienet: hook up nway_reset ethtool operation · 66b51663
      Robert Hancock authored
      Hook up the nway_reset ethtool operation to the corresponding phylink
      function so that "ethtool -r" can be supported.
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66b51663
    • David S. Miller's avatar
      Merge branch 'tcp-mem-pressure-vs-SO_RCVLOWAT' · 762d17b9
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: mem pressure vs SO_RCVLOWAT
      
      First patch fixes an issue for applications using SO_RCVLOWAT
      to reduce context switches.
      
      Second patch is a cleanup.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      762d17b9
    • Eric Dumazet's avatar
      tcp: factorize logic into tcp_epollin_ready() · 05dc72ab
      Eric Dumazet authored
      Both tcp_data_ready() and tcp_stream_is_readable() share the same logic.
      
      Add tcp_epollin_ready() helper to avoid duplication.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05dc72ab
    • Eric Dumazet's avatar
      tcp: fix SO_RCVLOWAT related hangs under mem pressure · f969dc5a
      Eric Dumazet authored
      While commit 24adbc16 ("tcp: fix SO_RCVLOWAT hangs with fat skbs")
      fixed an issue vs too small sk_rcvbuf for given sk_rcvlowat constraint,
      it missed to address issue caused by memory pressure.
      
      1) If we are under memory pressure and socket receive queue is empty.
      First incoming packet is allowed to be queued, after commit
      76dfa608 ("tcp: allow one skb to be received per socket under memory pressure")
      
      But we do not send EPOLLIN yet, in case tcp_data_ready() sees sk_rcvlowat
      is bigger than skb length.
      
      2) Then, when next packet comes, it is dropped, and we directly
      call sk->sk_data_ready().
      
      3) If application is using poll(), tcp_poll() will then use
      tcp_stream_is_readable() and decide the socket receive queue is
      not yet filled, so nothing will happen.
      
      Even when sender retransmits packets, phases 2) & 3) repeat
      and flow is effectively frozen, until memory pressure is off.
      
      Fix is to consider tcp_under_memory_pressure() to take care
      of global memory pressure or memcg pressure.
      
      Fixes: 24adbc16 ("tcp: fix SO_RCVLOWAT hangs with fat skbs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarArjun Roy <arjunroy@google.com>
      Suggested-by: default avatarWei Wang <weiwan@google.com>
      Reviewed-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f969dc5a
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · 5cdaf9d6
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2021-02-12
      
      This series contains updates to i40e, ice, and ixgbe drivers.
      
      Maciej does cleanups on the following drivers.
      For i40e, removes redundant check for XDP prog, cleans up no longer
      relevant information, and removes an unused function argument.
      For ice, removes local variable use, instead returning values directly.
      Moves skb pointer from buffer to ring and removes an unneeded check for
      xdp_prog in zero copy path. Also removes a redundant MTU check when
      changing it.
      For i40e, ice, and ixgbe, stores the rx_offset in the Rx ring as
      the value is constant so there's no need for continual calls.
      
      Bjorn folds a decrement into a while statement.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cdaf9d6
    • David S. Miller's avatar
      Merge branch 'tc-mpls-selftests' · 7aceeb73
      David S. Miller authored
      Guillaume Nault says:
      
      ====================
      selftests: tc: Test tc-flower's MPLS features
      
      A couple of patches for exercising the MPLS filters of tc-flower.
      
      Patch 1 tests basic MPLS matching features: those that only work on the
      first label stack entry (that is, the mpls_label, mpls_tc, mpls_bos and
      mpls_ttl options).
      
      Patch 2 tests the more generic "mpls" and "lse" options, which allow
      matching MPLS fields beyond the first stack entry.
      
      In both patches, special care is taken to skip these new tests for
      incompatible versions of tc.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7aceeb73
    • Guillaume Nault's avatar
      selftests: tc: Add generic mpls matching support for tc-flower · c09bfd9a
      Guillaume Nault authored
      Add tests in tc_flower.sh for generic matching on MPLS Label Stack
      Entries. The label, tc, bos and ttl fields are tested for the first
      and second labels. For each field, the minimal and maximal values are
      tested (the former at depth 1 and the later at depth 2).
      There are also tests for matching the presence of a label stack entry
      at a given depth.
      
      In order to reduce the amount of code, all "lse" subcommands are tested
      in match_mpls_lse_test(). Action "continue" is used, so that test
      packets are evaluated by all filters. Then, we can verify if each
      filter matched the expected number of packets.
      
      Some versions of tc-flower produced invalid json output when dumping
      MPLS filters with depth > 1. Skip the test if tc isn't recent enough.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c09bfd9a
    • Guillaume Nault's avatar
      selftests: tc: Add basic mpls_* matching support for tc-flower · 203ee5cd
      Guillaume Nault authored
      Add tests in tc_flower.sh for mpls_label, mpls_tc, mpls_bos and
      mpls_ttl. For each keyword, test the minimal and maximal values.
      
      Selectively skip these new mpls tests for tc versions that don't
      support them.
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      203ee5cd
    • David S. Miller's avatar
      Merge branch 'brport-flags' · 4098ced4
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Cleanup in brport flags switchdev offload for DSA
      
      The initial goal of this series was to have better support for
      standalone ports mode on the DSA drivers like ocelot/felix and sja1105.
      This turned out to require some API adjustments in both directions:
      to the information presented to and by the switchdev notifier, and to
      the API presented to the switch drivers by the DSA layer.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4098ced4
    • Vladimir Oltean's avatar
      net: dsa: sja1105: offload bridge port flags to device · 4d942354
      Vladimir Oltean authored
      The chip can configure unicast flooding, broadcast flooding and learning.
      Learning is per port, while flooding is per {ingress, egress} port pair
      and we need to configure the same value for all possible ingress ports
      towards the requested one.
      
      While multicast flooding is not officially supported, we can hack it by
      using a feature of the second generation (P/Q/R/S) devices, which is that
      FDB entries are maskable, and multicast addresses always have an odd
      first octet. So by putting a match-all for 00:01:00:00:00:00 addr and
      00:01:00:00:00:00 mask at the end of the FDB, we make sure that it is
      always checked last, and does not take precedence in front of any other
      MDB. So it behaves effectively as an unknown multicast entry.
      
      For the first generation switches, this feature is not available, so
      unknown multicast will always be treated the same as unknown unicast.
      So the only thing we can do is request the user to offload the settings
      for these 2 flags in tandem, i.e.
      
      ip link set swp2 type bridge_slave flood off
      Error: sja1105: This chip cannot configure multicast flooding independently of unicast.
      ip link set swp2 type bridge_slave flood off mcast_flood off
      ip link set swp2 type bridge_slave mcast_flood on
      Error: sja1105: This chip cannot configure multicast flooding independently of unicast.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d942354