1. 30 Sep, 2021 13 commits
    • Jason Xing's avatar
      ixgbe: let the xdpdrv work with more than 64 cpus · 4fe81585
      Jason Xing authored
      Originally, ixgbe driver doesn't allow the mounting of xdpdrv if the
      server is equipped with more than 64 cpus online. So it turns out that
      the loading of xdpdrv causes the "NOMEM" failure.
      
      Actually, we can adjust the algorithm and then make it work through
      mapping the current cpu to some xdp ring with the protect of @tx_lock.
      
      Here are some numbers before/after applying this patch with xdp-example
      loaded on the eth0X:
      
      As client (tx path):
                           Before    After
      TCP_STREAM send-64   734.14    714.20
      TCP_STREAM send-128  1401.91   1395.05
      TCP_STREAM send-512  5311.67   5292.84
      TCP_STREAM send-1k   9277.40   9356.22 (not stable)
      TCP_RR     send-1    22559.75  21844.22
      TCP_RR     send-128  23169.54  22725.13
      TCP_RR     send-512  21670.91  21412.56
      
      As server (rx path):
                           Before    After
      TCP_STREAM send-64   1416.49   1383.12
      TCP_STREAM send-128  3141.49   3055.50
      TCP_STREAM send-512  9488.73   9487.44
      TCP_STREAM send-1k   9491.17   9356.22 (not stable)
      TCP_RR     send-1    23617.74  23601.60
      ...
      
      Notice: the TCP_RR mode is unstable as the official document explains.
      
      I tested many times with different parameters combined through netperf.
      Though the result is not that accurate, I cannot see much influence on
      this patch. The static key is places on the hot path, but it actually
      shouldn't cause a huge regression theoretically.
      Co-developed-by: default avatarShujin Li <lishujin@kuaishou.com>
      Signed-off-by: default avatarShujin Li <lishujin@kuaishou.com>
      Signed-off-by: default avatarJason Xing <xingwanli@kuaishou.com>
      Tested-by: default avatarSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fe81585
    • David S. Miller's avatar
      Merge branch 'SO_RESEVED_MEM' · a3e4abac
      David S. Miller authored
      Wei Wang says:
      
      ====================
      net: add new socket option SO_RESERVE_MEM
      
      This patch series introduces a new socket option SO_RESERVE_MEM.
      This socket option provides a mechanism for users to reserve a certain
      amount of memory for the socket to use. When this option is set, kernel
      charges the user specified amount of memory to memcg, as well as
      sk_forward_alloc. This amount of memory is not reclaimable and is
      available in sk_forward_alloc for this socket.
      With this socket option set, the networking stack spends less cycles
      doing forward alloc and reclaim, which should lead to better system
      performance, with the cost of an amount of pre-allocated and
      unreclaimable memory, even under memory pressure.
      With a tcp_stream test with 10 flows running on a simulated 100ms RTT
      link, I can see the cycles spent in __sk_mem_raise_allocated() dropping
      by ~0.02%. Not a whole lot, since we already have logic in
      sk_mem_uncharge() to only reclaim 1MB when sk_forward_alloc has more
      than 2MB free space. But on a system suffering memory pressure
      constently, the savings should be more.
      
      The first patch is the implementation of this socket option. The
      following 2 patches change the tcp stack to make use of this reserved
      memory when under memory pressure. This makes the tcp stack behavior
      more flexible when under memory pressure, and provides a way for user to
      control the distribution of the memory among its sockets.
      With a TCP connection on a simulated 100ms RTT link, the default
      throughput under memory pressure is ~500Kbps. With SO_RESERVE_MEM set to
      100KB, the throughput under memory pressure goes up to ~3.5Mbps.
      
      Change since v2:
      - Added description for new field added in struct sock in patch 1
      Change since v1:
      - Added performance stats in cover letter and rebased
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3e4abac
    • Wei Wang's avatar
      tcp: adjust rcv_ssthresh according to sk_reserved_mem · 053f3684
      Wei Wang authored
      When user sets SO_RESERVE_MEM socket option, in order to utilize the
      reserved memory when in memory pressure state, we adjust rcv_ssthresh
      according to the available reserved memory for the socket, instead of
      using 4 * advmss always.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      053f3684
    • Wei Wang's avatar
      tcp: adjust sndbuf according to sk_reserved_mem · ca057051
      Wei Wang authored
      If user sets SO_RESERVE_MEM socket option, in order to fully utilize the
      reserved memory in memory pressure state on the tx path, we modify the
      logic in sk_stream_moderate_sndbuf() to set sk_sndbuf according to
      available reserved memory, instead of MIN_SOCK_SNDBUF, and adjust it
      when new data is acked.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca057051
    • Wei Wang's avatar
      net: add new socket option SO_RESERVE_MEM · 2bb2f5fb
      Wei Wang authored
      This socket option provides a mechanism for users to reserve a certain
      amount of memory for the socket to use. When this option is set, kernel
      charges the user specified amount of memory to memcg, as well as
      sk_forward_alloc. This amount of memory is not reclaimable and is
      available in sk_forward_alloc for this socket.
      With this socket option set, the networking stack spends less cycles
      doing forward alloc and reclaim, which should lead to better system
      performance, with the cost of an amount of pre-allocated and
      unreclaimable memory, even under memory pressure.
      
      Note:
      This socket option is only available when memory cgroup is enabled and we
      require this reserved memory to be charged to the user's memcg. We hope
      this could avoid mis-behaving users to abused this feature to reserve a
      large amount on certain sockets and cause unfairness for others.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2bb2f5fb
    • Russell King's avatar
      net: phy: marvell10g: add downshift tunable support · 4075a6a0
      Russell King authored
      Add support for the downshift tunable for the Marvell 88x3310 PHY.
      Downshift is only usable with firmware 0.3.5.0 and later.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4075a6a0
    • Colin Ian King's avatar
      octeontx2-af: Remove redundant initialization of variable pin · 75f81afb
      Colin Ian King authored
      The variable pin is being initialized with a value that is never
      read, it is being updated later on in only one case of a switch
      statement.  The assignment is redundant and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75f81afb
    • Lars-Peter Clausen's avatar
      net: macb: ptp: Switch to gettimex64() interface · e51bb5c2
      Lars-Peter Clausen authored
      The macb PTP support currently implements the `gettime64` callback to allow
      to retrieve the hardware clock time. Update the implementation to provide
      the `gettimex64` callback instead.
      
      The difference between the two is that with `gettime64` a snapshot of the
      system clock is taken before and after invoking the callback. Whereas
      `gettimex64` expects the callback itself to take the snapshots.
      
      To get the time from the macb Ethernet core multiple register accesses have
      to be done. Only one of which will happen at the time reported by the
      function. This leads to a non-symmetric delay and adds a slight offset
      between the hardware and system clock time when using the `gettime64`
      method. This offset can be a few 100 nanoseconds. Switching to the
      `gettimex64` method allows for a more precise correlation of the hardware
      and system clocks and results in a lower offset between the two.
      
      On a Xilinx ZynqMP system `phc2sys` reports a delay of 1120 ns before and
      300 ns after the patch. With the latter being mostly symmetric.
      Signed-off-by: default avatarLars-Peter Clausen <lars@metafoo.de>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e51bb5c2
    • Boris Sukholitko's avatar
      dissector: do not set invalid PPP protocol · 2e861e5e
      Boris Sukholitko authored
      The following flower filter fails to match non-PPP_IP{V6} packets
      wrapped in PPP_SES protocol:
      
      tc filter add dev eth0 ingress protocol ppp_ses flower \
              action simple sdata hi64
      
      The reason is that proto local variable is being set even when
      FLOW_DISSECT_RET_OUT_BAD status is returned.
      
      The fix is to avoid setting proto variable if the PPP protocol is unknown.
      Signed-off-by: default avatarBoris Sukholitko <boris.sukholitko@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e861e5e
    • Linus Walleij's avatar
      net: dsa: rtl8366rb: Use core filtering tracking · 55b115c7
      Linus Walleij authored
      We added a state variable to track whether a certain port
      was VLAN filtering or not, but we can just inquire the DSA
      core about this.
      
      Cc: Vladimir Oltean <olteanv@gmail.com>
      Cc: Mauri Sandberg <sandberg@mailfence.com>
      Cc: DENG Qingfang <dqfext@gmail.com>
      Cc: Alvin Šipraga <alsi@bang-olufsen.dk>
      Cc: Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55b115c7
    • Geetha sowjanya's avatar
      octeontx2-pf: Add XDP support to netdev PF · 06059a1a
      Geetha sowjanya authored
      Adds XDP_PASS, XDP_TX, XDP_DROP and XDP_REDIRECT support
      for netdev PF.
      Signed-off-by: default avatarGeetha sowjanya <gakula@marvell.com>
      Signed-off-by: default avatarSunil Goutham <sgoutham@cavium.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06059a1a
    • Kiran Kumar K's avatar
      octeontx2-af: Adjust LA pointer for cpt parse header · 85212a12
      Kiran Kumar K authored
      In case of ltype NPC_LT_LA_CPT_HDR, LA pointer is pointing to the
      start of cpt parse header. Since cpt parse header has veriable
      length padding, this will be a problem for DMAC extraction. Adding
      KPU profile changes to adjust the LA pointer to start at ether header
      in case of cpt parse header by
         - Adding ptr advance in pkind 58 to a fixed value 40
         - Adding variable length offset 7 and mask 7 (pad len in
           CPT_PARSE_HDR).
      Also added the missing static declaration for npc_set_var_len_offset_pkind
      function.
      Signed-off-by: default avatarKiran Kumar K <kirankumark@marvell.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      85212a12
    • Gustavo A. R. Silva's avatar
      net_sched: Use struct_size() and flex_array_size() helpers · 69508d43
      Gustavo A. R. Silva authored
      Make use of the struct_size() and flex_array_size() helpers instead of
      an open-coded version, in order to avoid any potential type mistakes
      or integer overflows that, in the worse scenario, could lead to heap
      overflows.
      
      Link: https://github.com/KSPP/linux/issues/160Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Link: https://lore.kernel.org/r/20210928193107.GA262595@embeddedorSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      69508d43
  2. 29 Sep, 2021 26 commits
  3. 28 Sep, 2021 1 commit