1. 02 Nov, 2022 2 commits
    • Eric Dumazet's avatar
      tcp: refine tcp_prune_ofo_queue() logic · b0e01253
      Eric Dumazet authored
      After commits 36a6503f ("tcp: refine tcp_prune_ofo_queue()
      to not drop all packets") and 72cd43ba
      ("tcp: free batches of packets in tcp_prune_ofo_queue()")
      tcp_prune_ofo_queue() drops a fraction of ooo queue,
      to make room for incoming packet.
      
      However it makes no sense to drop packets that are
      before the incoming packet, in sequence space.
      
      In order to recover from packet losses faster,
      it makes more sense to only drop ooo packets
      which are after the incoming packet.
      
      Tested:
      packetdrill test:
         0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3800], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 0>
        +.1 < . 1:1(0) ack 1 win 1024
         +0 accept(3, ..., ...) = 4
      
       +.01 < . 200:300(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 200:300>
      
       +.01 < . 400:500(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 400:500 200:300>
      
       +.01 < . 600:700(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 600:700 400:500 200:300>
      
       +.01 < . 800:900(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 800:900 600:700 400:500 200:300>
      
       +.01 < . 1000:1100(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 1000:1100 800:900 600:700 400:500>
      
       +.01 < . 1200:1300(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      
      // this packet is dropped because we have no room left.
       +.01 < . 1400:1500(100) ack 1 win 1024
      
       +.01 < . 1:200(199) ack 1 win 1024
      // Make sure kernel did not drop 200:300 sequence
         +0 > . 1:1(0) ack 300 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      // Make room, since our RCVBUF is very small
         +0 read(4, ..., 299) = 299
      
       +.01 < . 300:400(100) ack 1 win 1024
         +0 > . 1:1(0) ack 500 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      
       +.01 < . 500:600(100) ack 1 win 1024
         +0 > . 1:1(0) ack 700 <nop,nop, sack 1200:1300 1000:1100 800:900>
      
         +0 read(4, ..., 400) = 400
      
       +.01 < . 700:800(100) ack 1 win 1024
         +0 > . 1:1(0) ack 900 <nop,nop, sack 1200:1300 1000:1100>
      
       +.01 < . 900:1000(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1100 <nop,nop, sack 1200:1300>
      
       +.01 < . 1100:1200(100) ack 1 win 1024
      // This checks that 1200:1300 has not been removed from ooo queue
         +0 > . 1:1(0) ack 1300
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20221101035234.3910189-1-edumazet@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b0e01253
    • Dr. David Alan Gilbert's avatar
      net: core: inet[46]_pton strlen len types · 44827016
      Dr. David Alan Gilbert authored
      inet[46]_pton check the input length against
      a sane length limit (INET[6]_ADDRSTRLEN), but
      the strlen value gets truncated due to being stored in an int,
      so there's a theoretical potential for a >4G string to pass
      the limit test.
      Use size_t since that's what strlen actually returns.
      
      I've had a hunt for callers that could hit this, but
      I've not managed to find anything that doesn't get checked with
      some other limit first; but it's possible that I've missed
      something in the depth of the storage target paths.
      Signed-off-by: default avatarDr. David Alan Gilbert <linux@treblig.org>
      Link: https://lore.kernel.org/r/20221029014604.114024-1-linux@treblig.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      44827016
  2. 01 Nov, 2022 17 commits
  3. 31 Oct, 2022 18 commits
  4. 30 Oct, 2022 1 commit
  5. 29 Oct, 2022 2 commits
    • Jakub Kicinski's avatar
      Merge tag 'mlx5-updates-2022-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 02a97e02
      Jakub Kicinski authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2022-10-24
      
      SW steering updates from Yevgeny Kliteynik:
      
      1) 1st Four patches: small fixes / optimizations for SW steering:
      
       - Patch 1: Don't abort destroy flow if failed to destroy table - continue
         and free everything else.
       - Patches 2 and 3 deal with fast teardown:
          + Skip sync during fast teardown, as PCI device is not there any more.
          + Check device state when polling CQ - otherwise SW steering keeps polling
            the CQ forever, because nobody is there to flush it.
       - Patch 4: Removing unneeded function argument.
      
      2) Deal with the hiccups that we get during rules insertion/deletion,
      which sometimes reach 1/4 of a second. While insertion/deletion rate
      improvement was not the focus here, it still is a by-product of removing these
      hiccups.
      
      Another by-product is the reduced standard deviation in measuring the duration
      of rules insertion/deletion bursts.
      
      In the testing we add K rules (warm-up phase), and then continuously do
      insertion/deletion bursts of N rules.
      During the test execution, the driver measures hiccups (amount and duration)
      and total time for insertion/deletion of a batch of rules.
      
      Here are some numbers, before and after these patches:
      
      +--------------------------------------------+-----------------+----------------+
      |                                            |   Create rules  |  Delete rules  |
      |                                            +--------+--------+--------+-------+
      |                                            | Before |  After | Before | After |
      +--------------------------------------------+--------+--------+--------+-------+
      | Max hiccup [msec]                          |    253 |     42 |    254 |    68 |
      +--------------------------------------------+--------+--------+--------+-------+
      | Avg duration of 10K rules add/remove [msec]| 140.07 | 124.32 | 106.99 | 99.51 |
      +--------------------------------------------+--------+--------+--------+-------+
      | Num of hiccups per 100K rules add/remove   |   7.77 |   7.97 |  12.60 | 11.57 |
      +--------------------------------------------+--------+--------+--------+-------+
      | Avg hiccup duration [msec]                 |  36.92 |  33.25 |  36.15 | 33.74 |
      +--------------------------------------------+--------+--------+--------+-------+
      
       - Patch 5: Allocate a short array on stack instead of dynamically- it is
         destroyed at the end of the function.
       - Patch 6: Rather than cleaning the corresponding chunk's section of
         ste_arrays on chunk deletion, initialize these areas upon chunk creation.
         Chunk destruction tend to come in large batches (during pool syncing),
         so instead of doing huge memory initialization during pool sync,
         we amortize this by doing small initsializations on chunk creation.
       - Patch 7: In order to simplifies error flow and allows cleaner addition
         of new pools, handle creation/destruction of all the domain's memory pools
         and other memory-related fields in a separate init/uninit functions.
       - Patch 8: During rehash, write each table row immediately instead of waiting
         for the whole table to be ready and writing it all - saves allocations
         of ste_send_info structures and improves performance.
       - Patch 9: Instead of allocating/freeing send info objects dynamically,
         manage them in pool. The number of send info objects doesn't depend on
         number of rules, so after pre-populating the pool with an initial batch of
         send info objects, the pool is not expected to grow.
         This way we save alloc/free during writing STEs to ICM, which by itself can
         sometimes take up to 40msec.
       - Patch 10: Allocate icm_chunks from their own slab allocator, which lowered
         the alloc/free "hiccups" frequency.
       - Patch 11: Similar to patch 9, allocate htbl from its own slab allocator.
       - Patch 12: Lower sync threshold for ICM hot memory - set the threshold for
         sync to 1/4 of the pool instead of 1/2 of the pool. Although we will have
         more syncs, each     sync will be shorter and will help with insertion rate
         stability. Also, notice that the overall number of hiccups wasn't increased
         due to all the other patches.
       - Patch 13: Keep track of hot ICM chunks in an array instead of list.
         After steering sync, we traverse the hot list and finally free all the
         chunks. It appears that traversing a long list takes unusually long time
         due to cache misses on many entries, which causes a big "hiccup" during
         rule insertion. This patch replaces the list with pre-allocated array that
         stores only the bookkeeping information that is needed to later free the
         chunks in its buddy allocator.
       - Patch 14: Remove the unneeded buddy used_list - we don't need to have the
         list of used chunks, we only need the total amount of used memory.
      
      * tag 'mlx5-updates-2022-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
        net/mlx5: DR, Remove the buddy used_list
        net/mlx5: DR, Keep track of hot ICM chunks in an array instead of list
        net/mlx5: DR, Lower sync threshold for ICM hot memory
        net/mlx5: DR, Allocate htbl from its own slab allocator
        net/mlx5: DR, Allocate icm_chunks from their own slab allocator
        net/mlx5: DR, Manage STE send info objects in pool
        net/mlx5: DR, In rehash write the line in the entry immediately
        net/mlx5: DR, Handle domain memory resources init/uninit separately
        net/mlx5: DR, Initialize chunk's ste_arrays at chunk creation
        net/mlx5: DR, For short chains of STEs, avoid allocating ste_arr dynamically
        net/mlx5: DR, Remove unneeded argument from dr_icm_chunk_destroy
        net/mlx5: DR, Check device state when polling CQ
        net/mlx5: DR, Fix the SMFS sync_steering for fast teardown
        net/mlx5: DR, In destroy flow, free resources even if FW command failed
      ====================
      
      Link: https://lore.kernel.org/r/20221027145643.6618-1-saeed@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      02a97e02
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-start-adding-ipa-v5-0-functionality' · eb288cbd
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: start adding IPA v5.0 functionality
      
      The biggest change for IPA v5.0 is that it supports more than 32
      endpoints.  However there are two other unrelated changes:
        - The STATS_TETHERING memory region is not required
        - Filter tables no longer support a "global" filter
      
      Beyond this, refactoring some code makes supporting more than 32
      endpoints (in an upcoming series) easier.  So this series includes
      a few other changes (not in this order):
        - The maximum endpoint ID in use is determined during config
        - Loops over all endpoints only involve those in use
        - Endpoints IDs and their directions are checked for validity
          differently to simplify comparison against the maximum
      ====================
      
      Link: https://lore.kernel.org/r/20221027122632.488694-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      eb288cbd