1. 02 Feb, 2023 24 commits
    • Paolo Abeni's avatar
      Merge branch 'net-sched-transition-act_pedit-to-rcu-and-percpu-stats' · 8b6f322e
      Paolo Abeni authored
      Pedro Tammela says:
      
      ====================
      net/sched: transition act_pedit to rcu and percpu stats
      
      The software pedit action didn't get the same love as some of the
      other actions and it's still using spinlocks and shared stats.
      Therefore, transition the action to rcu and percpu stats which
      improves the action's performance.
      
      We test this change with a very simple packet forwarding setup:
      
      tc filter add dev ens2f0 ingress protocol ip matchall \
         action pedit ex munge eth src set b8:ce:f6:4b:68:35 pipe \
         action pedit ex munge eth dst set ac:1f:6b:e4:ff:93 pipe \
         action mirred egress redirect dev ens2f1
      tc filter add dev ens2f1 ingress protocol ip matchall \
         action pedit ex munge eth src set b8:ce:f6:4b:68:34 pipe \
         action pedit ex munge eth dst set ac:1f:6b:e4:ff:92 pipe \
         action mirred egress redirect dev ens2f0
      
      Using TRex with a http-like profile, in our setup with a 25G NIC
      and a 26 cores Intel CPU, we observe the following in perf:
         before:
          11.59%  2.30%  [kernel]  [k] tcf_pedit_act
             2.55% tcf_pedit_act
                   8.38% _raw_spin_lock
                             6.43% native_queued_spin_lock_slowpath
         after:
          1.46%  1.46%  [kernel]  [k] tcf_pedit_act
      
      tdc results for pedit after the patch:
      1..69
      ok 1 319a - Add pedit action that mangles IP TTL
      ok 2 7e67 - Replace pedit action with invalid goto chain
      ok 3 377e - Add pedit action with RAW_OP offset u32
      ok 4 a0ca - Add pedit action with RAW_OP offset u32 (INVALID)
      ok 5 dd8a - Add pedit action with RAW_OP offset u16 u16
      ok 6 53db - Add pedit action with RAW_OP offset u16 (INVALID)
      ok 7 5c7e - Add pedit action with RAW_OP offset u8 add value
      ok 8 2893 - Add pedit action with RAW_OP offset u8 quad
      ok 9 3a07 - Add pedit action with RAW_OP offset u8-u16-u8
      ok 10 ab0f - Add pedit action with RAW_OP offset u16-u8-u8
      ok 11 9d12 - Add pedit action with RAW_OP offset u32 set u16 clear u8 invert
      ok 12 ebfa - Add pedit action with RAW_OP offset overflow u32 (INVALID)
      ok 13 f512 - Add pedit action with RAW_OP offset u16 at offmask shift set
      ok 14 c2cb - Add pedit action with RAW_OP offset u32 retain value
      ok 15 1762 - Add pedit action with RAW_OP offset u8 clear value
      ok 16 bcee - Add pedit action with RAW_OP offset u8 retain value
      ok 17 e89f - Add pedit action with RAW_OP offset u16 retain value
      ok 18 c282 - Add pedit action with RAW_OP offset u32 clear value
      ok 19 c422 - Add pedit action with RAW_OP offset u16 invert value
      ok 20 d3d3 - Add pedit action with RAW_OP offset u32 invert value
      ok 21 57e5 - Add pedit action with RAW_OP offset u8 preserve value
      ok 22 99e0 - Add pedit action with RAW_OP offset u16 preserve value
      ok 23 1892 - Add pedit action with RAW_OP offset u32 preserve value
      ok 24 4b60 - Add pedit action with RAW_OP negative offset u16/u32 set value
      ok 25 a5a7 - Add pedit action with LAYERED_OP eth set src
      ok 26 86d4 - Add pedit action with LAYERED_OP eth set src & dst
      ok 27 f8a9 - Add pedit action with LAYERED_OP eth set dst
      ok 28 c715 - Add pedit action with LAYERED_OP eth set src (INVALID)
      ok 29 8131 - Add pedit action with LAYERED_OP eth set dst (INVALID)
      ok 30 ba22 - Add pedit action with LAYERED_OP eth type set/clear sequence
      ok 31 dec4 - Add pedit action with LAYERED_OP eth set type (INVALID)
      ok 32 ab06 - Add pedit action with LAYERED_OP eth add type
      ok 33 918d - Add pedit action with LAYERED_OP eth invert src
      ok 34 a8d4 - Add pedit action with LAYERED_OP eth invert dst
      ok 35 ee13 - Add pedit action with LAYERED_OP eth invert type
      ok 36 7588 - Add pedit action with LAYERED_OP ip set src
      ok 37 0fa7 - Add pedit action with LAYERED_OP ip set dst
      ok 38 5810 - Add pedit action with LAYERED_OP ip set src & dst
      ok 39 1092 - Add pedit action with LAYERED_OP ip set ihl & dsfield
      ok 40 02d8 - Add pedit action with LAYERED_OP ip set ttl & protocol
      ok 41 3e2d - Add pedit action with LAYERED_OP ip set ttl (INVALID)
      ok 42 31ae - Add pedit action with LAYERED_OP ip ttl clear/set
      ok 43 486f - Add pedit action with LAYERED_OP ip set duplicate fields
      ok 44 e790 - Add pedit action with LAYERED_OP ip set ce, df, mf, firstfrag, nofrag fields
      ok 45 cc8a - Add pedit action with LAYERED_OP ip set tos
      ok 46 7a17 - Add pedit action with LAYERED_OP ip set precedence
      ok 47 c3b6 - Add pedit action with LAYERED_OP ip add tos
      ok 48 43d3 - Add pedit action with LAYERED_OP ip add precedence
      ok 49 438e - Add pedit action with LAYERED_OP ip clear tos
      ok 50 6b1b - Add pedit action with LAYERED_OP ip clear precedence
      ok 51 824a - Add pedit action with LAYERED_OP ip invert tos
      ok 52 106f - Add pedit action with LAYERED_OP ip invert precedence
      ok 53 6829 - Add pedit action with LAYERED_OP beyond ip set dport & sport
      ok 54 afd8 - Add pedit action with LAYERED_OP beyond ip set icmp_type & icmp_code
      ok 55 3143 - Add pedit action with LAYERED_OP beyond ip set dport (INVALID)
      ok 56 815c - Add pedit action with LAYERED_OP ip6 set src
      ok 57 4dae - Add pedit action with LAYERED_OP ip6 set dst
      ok 58 fc1f - Add pedit action with LAYERED_OP ip6 set src & dst
      ok 59 6d34 - Add pedit action with LAYERED_OP ip6 dst retain value (INVALID)
      ok 60 94bb - Add pedit action with LAYERED_OP ip6 traffic_class
      ok 61 6f5e - Add pedit action with LAYERED_OP ip6 flow_lbl
      ok 62 6795 - Add pedit action with LAYERED_OP ip6 set payload_len, nexthdr, hoplimit
      ok 63 1442 - Add pedit action with LAYERED_OP tcp set dport & sport
      ok 64 b7ac - Add pedit action with LAYERED_OP tcp sport set (INVALID)
      ok 65 cfcc - Add pedit action with LAYERED_OP tcp flags set
      ok 66 3bc4 - Add pedit action with LAYERED_OP tcp set dport, sport & flags fields
      ok 67 f1c8 - Add pedit action with LAYERED_OP udp set dport & sport
      ok 68 d784 - Add pedit action with mixed RAW/LAYERED_OP #1
      ok 69 70ca - Add pedit action with mixed RAW/LAYERED_OP #2
      ====================
      
      Link: https://lore.kernel.org/r/20230131190512.3805897-1-pctammela@mojatatu.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      8b6f322e
    • Pedro Tammela's avatar
      net/sched: simplify tcf_pedit_act · 95b06938
      Pedro Tammela authored
      Remove the check for a negative number of keys as
      this cannot ever happen
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarPedro Tammela <pctammela@mojatatu.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      95b06938
    • Pedro Tammela's avatar
      net/sched: transition act_pedit to rcu and percpu stats · 52cf89f7
      Pedro Tammela authored
      The software pedit action didn't get the same love as some of the
      other actions and it's still using spinlocks and shared stats in the
      datapath.
      Transition the action to rcu and percpu stats as this improves the
      action's performance dramatically on multiple cpu deployments.
      Reviewed-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarPedro Tammela <pctammela@mojatatu.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      52cf89f7
    • Paolo Abeni's avatar
      Merge tag 'rxrpc-next-20230131' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · a8248fc4
      Paolo Abeni authored
      David Howells says:
      
      ====================
      Here's the fifth part of patches in the process of moving rxrpc from doing
      a lot of its stuff in softirq context to doing it in an I/O thread in
      process context and thereby making it easier to support a larger SACK
      table.
      
      The full description is in the description for the first part[1] which is
      now upstream.  The second and third parts are also upstream[2].  A subset
      of the original fourth part[3] got applied as a fix for a race[4].
      
      The fifth part includes some cleanups:
      
       (1) Miscellaneous trace header cleanups: fix a trace string, display the
           security index in rx_packet rather than displaying the type twice,
           remove some whitespace to make checkpatch happier and remove some
           excess tabulation.
      
       (2) Convert ->recvmsg_lock to a spinlock as it's only ever locked
           exclusively.
      
       (3) Make ->ackr_window and ->ackr_nr_unacked non-atomic as they're only
           used in the I/O thread.
      
       (4) Don't use call->tx_lock to access ->tx_buffer as that is only accessed
           inside the I/O thread.  sendmsg() loads onto ->tx_sendmsg and the I/O
           thread decants from that to the buffer.
      
       (5) Remove local->defrag_sem as DATA packets are transmitted serially by
           the I/O thread.
      
       (6) Remove the service connection bundle is it was only used for its
           channel_lock - which has now gone.
      
      And some more significant changes:
      
       (7) Add a debugging option to allow a delay to be injected into packet
           reception to help investigate the behaviour over longer links than
           just a few cm.
      
       (8) Generate occasional PING ACKs to probe for RTT information during a
           receive heavy call.
      
       (9) Simplify the SACK table maintenance and ACK generation.  Now that both
           parts are done in the same thread, there's no possibility of a race
           and no need to try and be cunning to avoid taking a BH spinlock whilst
           copying the SACK table (which in the future will be up to 2K) and no
           need to rotate the copy to fit the ACK packet table.
      
      (10) Use SKB_CONSUMED when freeing received DATA packets (stop dropwatch
           complaining).
      
      * tag 'rxrpc-next-20230131' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
        rxrpc: Kill service bundle
        rxrpc: Change rx_packet tracepoint to display securityIndex not type twice
        rxrpc: Show consumed and freed packets as non-dropped in dropwatch
        rxrpc: Remove local->defrag_sem
        rxrpc: Don't lock call->tx_lock to access call->tx_buffer
        rxrpc: Simplify ACK handling
        rxrpc: De-atomic call->ackr_window and call->ackr_nr_unacked
        rxrpc: Generate extra pings for RTT during heavy-receive call
        rxrpc: Allow a delay to be injected into packet reception
        rxrpc: Convert call->recvmsg_lock to a spinlock
        rxrpc: Shrink the tabulation in the rxrpc trace header a bit
        rxrpc: Remove whitespace before ')' in trace header
        rxrpc: Fix trace string
      ====================
      
      Link: https://lore.kernel.org/all/20230131171227.3912130-1-dhowells@redhat.com/Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a8248fc4
    • Sunil Goutham's avatar
      octeontx2-af: Removed unnecessary debug messages. · 609aa68d
      Sunil Goutham authored
      NPC exact match feature is supported only on one silicon
      variant, removed debug messages which print that this
      feature is not available on all other silicon variants.
      Signed-off-by: default avatarSunil Goutham <sgoutham@marvell.com>
      Signed-off-by: default avatarRatheesh Kannoth <rkannoth@marvell.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20230201040301.1034843-1-rkannoth@marvell.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      609aa68d
    • Heng Qi's avatar
      virtio-net: fix possible unsigned integer overflow · 981f14d4
      Heng Qi authored
      When the single-buffer xdp is loaded and after xdp_linearize_page()
      is called, *num_buf becomes 0 and (*num_buf - 1) may overflow into
      a large integer in virtnet_build_xdp_buff_mrg(), resulting in
      unexpected packet dropping.
      
      Fixes: ef75cb51 ("virtio-net: build xdp_buff with multi buffers")
      Signed-off-by: default avatarHeng Qi <hengqi@linux.alibaba.com>
      Reviewed-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Link: https://lore.kernel.org/r/20230131085004.98687-1-hengqi@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      981f14d4
    • Leon Romanovsky's avatar
      netlink: provide an ability to set default extack message · 028fb19c
      Leon Romanovsky authored
      In netdev common pattern, extack pointer is forwarded to the drivers
      to be filled with error message. However, the caller can easily
      overwrite the filled message.
      
      Instead of adding multiple "if (!extack->_msg)" checks before any
      NL_SET_ERR_MSG() call, which appears after call to the driver, let's
      add new macro to common code.
      
      [1] https://lore.kernel.org/all/Y9Irgrgf3uxOjwUm@unrealReviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/6993fac557a40a1973dfa0095107c3d03d40bec1.1675171790.git.leon@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      028fb19c
    • Brian Haley's avatar
      neighbor: fix proxy_delay usage when it is zero · 62e395f8
      Brian Haley authored
      When set to zero, the neighbor sysctl proxy_delay value
      does not cause an immediate reply for ARP/ND requests
      as expected, it instead causes a random delay between
      [0, U32_MAX). Looking at this comment from
      __get_random_u32_below() explains the reason:
      
      /*
       * This function is technically undefined for ceil == 0, and in fact
       * for the non-underscored constant version in the header, we build bug
       * on that. But for the non-constant case, it's convenient to have that
       * evaluate to being a straight call to get_random_u32(), so that
       * get_random_u32_inclusive() can work over its whole range without
       * undefined behavior.
       */
      
      Added helper function that does not call get_random_u32_below()
      if proxy_delay is zero and just uses the current value of
      jiffies instead, causing pneigh_enqueue() to respond
      immediately.
      
      Also added definition of proxy_delay to ip-sysctl.txt since
      it was missing.
      Signed-off-by: default avatarBrian Haley <haleyb.dev@gmail.com>
      Link: https://lore.kernel.org/r/20230130171428.367111-1-haleyb.dev@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      62e395f8
    • Jakub Kicinski's avatar
      Merge branch 'net-support-ipv4-big-tcp' · 983f507c
      Jakub Kicinski authored
      Xin Long says:
      
      ====================
      net: support ipv4 big tcp
      
      This is similar to the BIG TCP patchset added by Eric for IPv6:
      
        https://lwn.net/Articles/895398/
      
      Different from IPv6, IPv4 tot_len is 16-bit long only, and IPv4 header
      doesn't have exthdrs(options) for the BIG TCP packets' length. To make
      it simple, as David and Paolo suggested, we set IPv4 tot_len to 0 to
      indicate this might be a BIG TCP packet and use skb->len as the real
      IPv4 total length.
      
      This will work safely, as all BIG TCP packets are GSO/GRO packets and
      processed on the same host as they were created; There is no padding
      in GSO/GRO packets, and skb->len - network_offset is exactly the IPv4
      packet total length; Also, before implementing the feature, all those
      places that may get iph tot_len from BIG TCP packets are taken care
      with some new APIs:
      
      Patch 1 adds some APIs for iph tot_len setting and getting, which are
      used in all these places where IPv4 BIG TCP packets may reach in Patch
      2-7, Patch 8 adds a GSO_TCP tp_status for af_packet users, and Patch 9
      add new netlink attributes to make IPv4 BIG TCP independent from IPv6
      BIG TCP on configuration, and Patch 10 implements this feature.
      
      Note that the similar change as in Patch 2-6 are also needed for IPv6
      BIG TCP packets, and will be addressed in another patchset.
      
      The similar performance test is done for IPv4 BIG TCP with 25Gbit NIC
      and 1.5K MTU:
      
      No BIG TCP:
      for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
      168          322          337          3776.49
      143          236          277          4654.67
      128          258          288          4772.83
      171          229          278          4645.77
      175          228          243          4678.93
      149          239          279          4599.86
      164          234          268          4606.94
      155          276          289          4235.82
      180          255          268          4418.95
      168          241          249          4417.82
      
      Enable BIG TCP:
      ip link set dev ens1f0np0 gro_ipv4_max_size 128000 gso_ipv4_max_size 128000
      for i in {1..10}; do netperf -t TCP_RR -H 192.168.100.1 -- -r80000,80000 -O MIN_LATENCY,P90_LATENCY,P99_LATENCY,THROUGHPUT|tail -1; done
      161          241          252          4821.73
      174          205          217          5098.28
      167          208          220          5001.43
      164          228          249          4883.98
      150          233          249          4914.90
      180          233          244          4819.66
      154          208          219          5004.92
      157          209          247          4999.78
      160          218          246          4842.31
      174          206          217          5080.99
      
      Thanks for the feedback from Eric and David Ahern.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1674921359.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      983f507c
    • Xin Long's avatar
      net: add support for ipv4 big tcp · b1a78b9b
      Xin Long authored
      Similar to Eric's IPv6 BIG TCP, this patch is to enable IPv4 BIG TCP.
      
      Firstly, allow sk->sk_gso_max_size to be set to a value greater than
      GSO_LEGACY_MAX_SIZE by not trimming gso_max_size in sk_trim_gso_size()
      for IPv4 TCP sockets.
      
      Then on TX path, set IP header tot_len to 0 when skb->len > IP_MAX_MTU
      in __ip_local_out() to allow to send BIG TCP packets, and this implies
      that skb->len is the length of a IPv4 packet; On RX path, use skb->len
      as the length of the IPv4 packet when the IP header tot_len is 0 and
      skb->len > IP_MAX_MTU in ip_rcv_core(). As the API iph_set_totlen() and
      skb_ip_totlen() are used in __ip_local_out() and ip_rcv_core(), we only
      need to update these APIs.
      
      Also in GRO receive, add the check for ETH_P_IP/IPPROTO_TCP, and allows
      the merged packet size >= GRO_LEGACY_MAX_SIZE in skb_gro_receive(). In
      GRO complete, set IP header tot_len to 0 when the merged packet size
      greater than IP_MAX_MTU in iph_set_totlen() so that it can be processed
      on RX path.
      
      Note that by checking skb_is_gso_tcp() in API iph_totlen(), it makes
      this implementation safe to use iph->len == 0 indicates IPv4 BIG TCP
      packets.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b1a78b9b
    • Xin Long's avatar
      net: add gso_ipv4_max_size and gro_ipv4_max_size per device · 9eefedd5
      Xin Long authored
      This patch introduces gso_ipv4_max_size and gro_ipv4_max_size
      per device and adds netlink attributes for them, so that IPV4
      BIG TCP can be guarded by a separate tunable in the next patch.
      
      To not break the old application using "gso/gro_max_size" for
      IPv4 GSO packets, this patch updates "gso/gro_ipv4_max_size"
      in netif_set_gso/gro_max_size() if the new size isn't greater
      than GSO_LEGACY_MAX_SIZE, so that nothing will change even if
      userspace doesn't realize the new netlink attributes.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9eefedd5
    • Xin Long's avatar
      packet: add TP_STATUS_GSO_TCP for tp_status · 8e08bb75
      Xin Long authored
      Introduce TP_STATUS_GSO_TCP tp_status flag to tell the af_packet user
      that this is a TCP GSO packet. When parsing IPv4 BIG TCP packets in
      tcpdump/libpcap, it can use tp_len as the IPv4 packet len when this
      flag is set, as iph tot_len is set to 0 for IPv4 BIG TCP packets.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8e08bb75
    • Xin Long's avatar
      ipvlan: use skb_ip_totlen in ipvlan_get_L3_hdr · 50e6fb5c
      Xin Long authored
      ipvlan devices calls netif_inherit_tso_max() to get the tso_max_size/segs
      from the lower device, so when lower device supports BIG TCP, the ipvlan
      devices support it too. We also should consider its iph tot_len accessing.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      50e6fb5c
    • Xin Long's avatar
      cipso_ipv4: use iph_set_totlen in skbuff_setattr · 7eb072be
      Xin Long authored
      It may process IPv4 TCP GSO packets in cipso_v4_skbuff_setattr(), so
      the iph->tot_len update should use iph_set_totlen().
      
      Note that for these non GSO packets, the new iph tot_len with extra
      iph option len added may become greater than 65535, the old process
      will cast it and set iph->tot_len to it, which is a bug. In theory,
      iph options shouldn't be added for these big packets in here, a fix
      may be needed here in the future. For now this patch is only to set
      iph->tot_len to 0 when it happens.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7eb072be
    • Xin Long's avatar
      netfilter: use skb_ip_totlen and iph_totlen · a13fbf5e
      Xin Long authored
      There are also quite some places in netfilter that may process IPv4 TCP
      GSO packets, we need to replace them too.
      
      In length_mt(), we have to use u_int32_t/int to accept skb_ip_totlen()
      return value, otherwise it may overflow and mismatch. This change will
      also help us add selftest for IPv4 BIG TCP in the following patch.
      
      Note that we don't need to replace the one in tcpmss_tg4(), as it will
      return if there is data after tcphdr in tcpmss_mangle_packet(). The
      same in mangle_contents() in nf_nat_helper.c, it returns false when
      skb->len + extra > 65535 in enlarge_skb().
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a13fbf5e
    • Xin Long's avatar
      net: sched: use skb_ip_totlen and iph_totlen · 043e397e
      Xin Long authored
      There are 1 action and 1 qdisc that may process IPv4 TCP GSO packets
      and access iph->tot_len, replace them with skb_ip_totlen() and
      iph_totlen() accordingly.
      
      Note that we don't need to replace the one in tcf_csum_ipv4(), as it
      will return for TCP GSO packets in tcf_csum_ipv4_tcp().
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      043e397e
    • Xin Long's avatar
      openvswitch: use skb_ip_totlen in conntrack · ec84c955
      Xin Long authored
      IPv4 GSO packets may get processed in ovs_skb_network_trim(),
      and we need to use skb_ip_totlen() to get iph totlen.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarAaron Conole <aconole@redhat.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec84c955
    • Xin Long's avatar
      bridge: use skb_ip_totlen in br netfilter · 46abd173
      Xin Long authored
      These 3 places in bridge netfilter are called on RX path after GRO
      and IPv4 TCP GSO packets may come through, so replace iph tot_len
      accessing with skb_ip_totlen() in there.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      46abd173
    • Xin Long's avatar
      net: add a couple of helpers for iph tot_len · 058a8f7f
      Xin Long authored
      This patch adds three APIs to replace the iph->tot_len setting
      and getting in all places where IPv4 BIG TCP packets may reach,
      they will be used in the following patches.
      
      Note that iph_totlen() will be used when iph is not in linear
      data of the skb.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      058a8f7f
    • Jakub Kicinski's avatar
      Merge branch 'virtio_net-vdpa-update-mac-address-when-it-is-generated-by-virtio-net' · d8673afb
      Jakub Kicinski authored
      Laurent Vivier says:
      
      ====================
      virtio_net: vdpa: update MAC address when it is generated by virtio-net
      
      When the MAC address is not provided by the vdpa device virtio_net
      driver assigns a random one without notifying the device.
      The consequence, in the case of mlx5_vdpa, is the internal routing
      tables of the device are not updated and this can block the
      communication between two namespaces.
      
      To fix this problem, use virtnet_send_command(VIRTIO_NET_CTRL_MAC)
      to set the address from virtnet_probe() when the MAC address is
      not provided by the device.
      ====================
      
      Link: https://lore.kernel.org/r/20230127204500.51930-1-lvivier@redhat.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d8673afb
    • Laurent Vivier's avatar
      virtio_net: notify MAC address change on device initialization · 9f62d221
      Laurent Vivier authored
      In virtnet_probe(), if the device doesn't provide a MAC address the
      driver assigns a random one.
      As we modify the MAC address we need to notify the device to allow it
      to update all the related information.
      
      The problem can be seen with vDPA and mlx5_vdpa driver as it doesn't
      assign a MAC address by default. The virtio_net device uses a random
      MAC address (we can see it with "ip link"), but we can't ping a net
      namespace from another one using the virtio-vdpa device because the
      new MAC address has not been provided to the hardware:
      RX packets are dropped since they don't go through the receive filters,
      TX packets go through unaffected.
      Signed-off-by: default avatarLaurent Vivier <lvivier@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9f62d221
    • Laurent Vivier's avatar
      virtio_net: disable VIRTIO_NET_F_STANDBY if VIRTIO_NET_F_MAC is not set · 7c06458c
      Laurent Vivier authored
      failover relies on the MAC address to pair the primary and the standby
      devices:
      
        "[...] the hypervisor needs to enable VIRTIO_NET_F_STANDBY
         feature on the virtio-net interface and assign the same MAC address
         to both virtio-net and VF interfaces."
      
        Documentation/networking/net_failover.rst
      
      This patch disables the STANDBY feature if the MAC address is not
      provided by the hypervisor.
      Signed-off-by: default avatarLaurent Vivier <lvivier@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7c06458c
    • Huayu Chen's avatar
      nfp: correct cleanup related to DCB resources · ca3daf43
      Huayu Chen authored
      This patch corrects two oversights relating to releasing resources
      and DCB initialisation.
      
      1. If mapping of the dcbcfg_tbl area fails: an error should be
         propagated, allowing partial initialisation (probe) to be unwound.
      
      2. Conversely, if where dcbcfg_tbl is successfully mapped: it should
         be unmapped in nfp_nic_dcb_clean() which is called via various error
         cleanup paths, and shutdown or removal of the PCIE device.
      
      Fixes: 9b7fe804 ("nfp: add DCB IEEE support")
      Signed-off-by: default avatarHuayu Chen <huayu.chen@corigine.com>
      Reviewed-by: default avatarNiklas Söderlund <niklas.soderlund@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20230131163033.981937-1-simon.horman@corigine.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ca3daf43
    • Jiapeng Chong's avatar
      ipv6: ICMPV6: Use swap() instead of open coding it · bc617613
      Jiapeng Chong authored
      Swap is a function interface that provides exchange function. To avoid
      code duplication, we can use swap function.
      
      ./net/ipv6/icmp.c:344:25-26: WARNING opportunity for swap().
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3896Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20230131063456.76302-1-jiapeng.chong@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bc617613
  2. 01 Feb, 2023 16 commits
    • Jakub Kicinski's avatar
      Merge branch 'devlink-trivial-names-cleanup' · 074dd3b3
      Jakub Kicinski authored
      Jiri Pirko says:
      
      ====================
      devlink: trivial names cleanup
      
      This is a follow-up to Jakub's devlink code split and dump iteration
      helper patchset. No functional changes, just couple of renames to makes
      things consistent and perhaps easier to follow.
      ====================
      
      Link: https://lore.kernel.org/r/20230131090613.2131740-1-jiri@resnulli.usSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      074dd3b3
    • Jiri Pirko's avatar
      devlink: rename and reorder instances of struct devlink_cmd · 8589ba4e
      Jiri Pirko authored
      In order to maintain naming consistency, rename and reorder all usages
      of struct struct devlink_cmd in the following way:
      1) Remove "gen" and replace it with "cmd" to match the struct name
      2) Order devl_cmds[] and the header file to match the order
         of enum devlink_command
      3) Move devl_cmd_rate_get among the peers
      4) Remove "inst" for DEVLINK_CMD_GET
      5) Add "_get" suffix to all to match DEVLINK_CMD_*_GET (only rate had it
         done correctly)
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8589ba4e
    • Jiri Pirko's avatar
      devlink: remove "gen" from struct devlink_gen_cmd name · f8744595
      Jiri Pirko authored
      No need to have "gen" inside name of the structure for devlink commands.
      Remove it.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f8744595
    • Jiri Pirko's avatar
      devlink: rename devlink_nl_instance_iter_dump() to "dumpit" · c3a4fd57
      Jiri Pirko authored
      To have the name of the function consistent with the struct cb name,
      rename devlink_nl_instance_iter_dump() to
      devlink_nl_instance_iter_dumpit().
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c3a4fd57
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-remaining-ipa-v5-0-support' · dd25cfab
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: remaining IPA v5.0 support
      
      This series includes almost all remaining IPA code changes required
      to support IPA v5.0.  IPA register definitions and configuration
      data for IPA v5.0 will be sent later (soon).  Note that the GSI
      register definitions still require work.  GSI for IPA v5.0 supports
      up to 256 (rather than 32) channels, and this changes the way GSI
      register offsets are calculated.  A few GSI register fields also
      change.
      
      The first patch in this series increases the number of IPA endpoints
      supported by the driver, from 32 to 36.  The next updates the width
      of the destination field for the IP_PACKET_INIT immediate command so
      it can represent up to 256 endpoints rather than just 32.  The next
      adds a few definitions of some IPA registers and fields that are
      first available in IPA v5.0.
      
      The next two patches update the code that handles router and filter
      table caches.  Previously these were referred to as "hashed" tables,
      and the IPv4 and IPv6 tables are now combined into one "unified"
      table.  The sixth and seventh patches add support for a new pulse
      generator, which allows time periods to be specified with a wider
      range of clock resolution.  And the last patch just defines two new
      memory regions that were not previously used.
      ====================
      
      Link: https://lore.kernel.org/r/20230130210158.4126129-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dd25cfab
    • Alex Elder's avatar
      net: ipa: define two new memory regions · 5157d6bf
      Alex Elder authored
      IPA v5.0 uses two memory regions not previously used.  Define them
      and treat them as valid only for IPA v5.0.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5157d6bf
    • Alex Elder's avatar
      net: ipa: support a third pulse register · 2cdbcbfd
      Alex Elder authored
      The AP has third pulse generator available starting with IPA v5.0.
      Redefine ipa_qtime_val() to support that possibility.  Pass the IPA
      pointer as an argument so the version can be determined.  And stop
      using the sign of the returned tick count to indicate which of two
      pulse generators to use.
      
      Instead, have the caller provide the address of a variable that will
      hold the selected pulse generator for the Qtime value.  And for
      version 5.0, check whether the third pulse generator best represents
      the time period.
      
      Add code in ipa_qtime_config() to configure the fourth pulse
      generator for IPA v5.0+; in that case configure both the third and
      fourth pulse generators to use 10 msec granularity.
      
      Consistently use "ticks" for local variables that represent a tick
      count.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2cdbcbfd
    • Alex Elder's avatar
      net: ipa: greater timer granularity options · 32079a4a
      Alex Elder authored
      Starting with IPA v5.0, the head-of-line blocking timer has more
      than two pulse generators available to define timer granularity.
      To prepare for that, change the way the field value is encoded
      to use ipa_reg_encode() rather than ipa_reg_bit().
      
      The aggregation granularity selection could (in principle) also use
      an additional pulse generator starting with IPA v5.0.  Encode the
      AGGR_GRAN_SEL field differently to allow that as well.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32079a4a
    • Alex Elder's avatar
      net: ipa: support zeroing new cache tables · a08cedc3
      Alex Elder authored
      IPA v5.0+ separates the configuration of entries in the cached
      (previously "hashed") routing and filtering tables into distinct
      registers.  Previously a single "filter and router" register updated
      entries in both tables at once; now the routing and filter table
      caches have separate registers that define their content.
      
      This patch updates the code that zeroes entries in the cached filter
      and router tables to support IPA versions including v5.0+.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a08cedc3
    • Alex Elder's avatar
      net: ipa: update table cache flushing · 8e7c89d8
      Alex Elder authored
      Update the code that causes filter and router table caches to be
      flushed so that it supports IPA versions 5.0+.  It adds a comment in
      ipa_hardware_config_hashing() that explains that cacheing does not
      need to be enabled, just as before, because it's enabled by default.
      (For the record, the FILT_ROUT_CACHE_CFG register would have been
      used if we wanted to explicitly enable these.)
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8e7c89d8
    • Alex Elder's avatar
      net: ipa: define IPA v5.0+ registers · 8ba59716
      Alex Elder authored
      Define some new registers that appear starting with IPA v5.0, along
      with enumerated types identifying their fields.  Code that uses
      these will be added by upcoming patches.
      
      Most of the new registers are related to filter and routing tables,
      and in particular, their "hashed" variant.  These tables are better
      described as "cached", where a hash value determines which entries
      are cached.  From now on, naming related to this functionality will
      use "cache" instead of "hash", and that is reflected in these new
      register names.  Some registers for managing these caches and their
      contents have changed as well.
      
      A few other new field definitions for registers (unrelated to table
      caches) are also defined.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8ba59716
    • Alex Elder's avatar
      net: ipa: extend endpoints in packet init command · c84ddc11
      Alex Elder authored
      The IP_PACKET_INIT immediate command defines the destination
      endpoint to which a packet should be sent.  Prior to IPA v5.0, a
      5 bit field in that command represents the endpoint, but starting
      with IPA v5.0, the field is extended to 8 bits to support more than
      32 endpoints.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c84ddc11
    • Alex Elder's avatar
      net: ipa: support more endpoints · 07abde54
      Alex Elder authored
      Increase the number of endpoints supported by the driver to 36,
      which IPA v5.0 supports.  This makes it impossible to check at build
      time whether the supported number is too big to fit within the
      (5-bit) PACKET_INIT destination endpoint field.  Instead, convert
      the build time check to compare against what fits in 8 bits.
      
      Add a check in ipa_endpoint_config() to also ensure the hardware
      reports an endpoint count that's in the expected range.  Just
      open-code 32 as the limit (the PACKET_INIT field mask is not
      available where we'd want to use it).
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      07abde54
    • Jakub Kicinski's avatar
      Merge tag 'mlx5-updates-2023-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 71af6a2d
      Jakub Kicinski authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2023-01-30
      
      Add fast update encryption key
      
      Jianbo Liu Says:
      ================
      
      Data encryption keys (DEKs) are the keys used for data encryption and
      decryption operations. Starting from version 22.33.0783, firmware is
      optimized to accelerate the update of user keys into DEK object in
      hardware. The support for bulk allocation and destruction of DEK
      objects is added, and the bulk allocated DEKs are uninitialized, as
      the bulk creation requires no input key. When offload
      encryption/decryption, user gets one object from a bulk, and updates
      key by a new "modify DEK" command. This command is the same as create
      DEK object, but requires no heavy context memory allocation in
      firmware, which consumes most cpu cycles of the create DEK command.
      
      DEKs are cached internally by the NIC, so invalidating internal NIC
      caches is required before reusing DEKs. The SYNC_CRYPTO command is
      added to support it. DEK object can be reused, the keys in it can be
      updated after this command is executed.
      
      This patchset enhances the key creation and destruction flow, to get
      use of this new feature. Any user, for example, ktls, ipsec and
      macsec, can use it to offload keys. But, only ktls uses it, as others
      don't need many keys, and caching two many DEKs in pool is wasteful.
      
      There are two new data struts added:
          a. DEK pool. One pool is created for each key type. The bulks by
      the type, are placed in the pool's different bulk lists, according to
      the number of available and in_used DEKs in the bulk.
          b. DEK bulk. All DEKs in one bulk allocation are store here. There
      are two bitmaps to indicate the state of each DEK.
      
      New APIs are then added. When user need a DEK object,
          a. Fetch one bulk with avail DEKs, from the partial_list or
      avail_list, otherwise create new one.
          b. Pick one DEK, and set its need_sync and in_used bits to 1.
      Move the bulk to full_list if no more available keys, or put it to
      partial_list if the bulk is newly created.
          c. Update DEK object's key with user key, by the "modify DEK"
      command.
          d. Return DEK struct to user, then it gets the object id and fills
      it into the offload commands.
      When user free a DEK,
          a. Set in_use bit to 0. If all need_sync bits are 1 and all in_use
      bits of this bulk are 0, move it to sync_list.
          b. If the number of DEKs, which are freed by users, is over the
      threshold (128), schedule a workqueue to do the sync process.
      
      For the sync process, the SYNC_CRYPTO command is executed first. Then,
      for each bulks in partial_list, full_list and sync_list, reset
      need_sync bits of the freed DEK objects. If all need_sync bits in one
      bulk are zero, move it to avail_list.
      
      We already supported TIS pool to recycle the TISes. With this series
      and TIS pool, TLS CPS performance is improved greatly.
      And we tested https on the system:
          CPU: dual AMD EPYC 7763 64-Core processors
          RAM: 512G
          DEV: ConnectX-6 DX, with FW ver 22.33.0838 and TLS_OPTIMISE=true
      TLS CPS performance numbers are:
          Before: 11k connections/sec
          After: 101 connections/sec
      
      ================
      
      * tag 'mlx5-updates-2023-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
        net/mlx5e: kTLS, Improve connection rate by using fast update encryption key
        net/mlx5: Keep only one bulk of full available DEKs
        net/mlx5: Add async garbage collector for DEK bulk
        net/mlx5: Reuse DEKs after executing SYNC_CRYPTO command
        net/mlx5: Use bulk allocation for fast update encryption key
        net/mlx5: Add bulk allocation and modify_dek operation
        net/mlx5: Add support SYNC_CRYPTO command
        net/mlx5: Add new APIs for fast update encryption key
        net/mlx5: Refactor the encryption key creation
        net/mlx5: Add const to the key pointer of encryption key creation
        net/mlx5: Prepare for fast crypto key update if hardware supports it
        net/mlx5: Change key type to key purpose
        net/mlx5: Add IFC bits and enums for crypto key
        net/mlx5: Add IFC bits for general obj create param
        net/mlx5: Header file for crypto
      ====================
      
      Link: https://lore.kernel.org/r/20230131031201.35336-1-saeed@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      71af6a2d
    • Jakub Kicinski's avatar
      Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · c925ed5f
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN: Remove redundant Device Control Error Reporting Enable
      
      Bjorn Helgaas says:
      
      Since f26e58bf ("PCI/AER: Enable error reporting when AER is native"),
      the PCI core sets the Device Control bits that enable error reporting for
      PCIe devices.
      
      This series removes redundant calls to pci_enable_pcie_error_reporting()
      that do the same thing from several NIC drivers.
      
      There are several more drivers where this should be removed; I started with
      just the Intel drivers here.
      
      * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
        ixgbe: Remove redundant pci_enable_pcie_error_reporting()
        igc: Remove redundant pci_enable_pcie_error_reporting()
        igb: Remove redundant pci_enable_pcie_error_reporting()
        ice: Remove redundant pci_enable_pcie_error_reporting()
        iavf: Remove redundant pci_enable_pcie_error_reporting()
        i40e: Remove redundant pci_enable_pcie_error_reporting()
        fm10k: Remove redundant pci_enable_pcie_error_reporting()
        e1000e: Remove redundant pci_enable_pcie_error_reporting()
      ====================
      
      Link: https://lore.kernel.org/r/20230130192519.686446-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c925ed5f
    • Jakub Kicinski's avatar
      Merge branch 'selftests-mlxsw-convert-to-iproute2-dcb' · 67971c38
      Jakub Kicinski authored
      Petr Machata says:
      
      ====================
      selftests: mlxsw: Convert to iproute2 dcb
      
      There is a dedicated tool for configuration of DCB in iproute2. Use it
      in the selftests instead of lldpad.
      
      Patches #1-#3 convert three tests. Patch #4 drops the now-unnecessary
      lldpad helpers.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1675096231.git.petrm@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      67971c38