1. 05 Feb, 2021 1 commit
  2. 04 Feb, 2021 31 commits
  3. 03 Feb, 2021 8 commits
    • Jakub Kicinski's avatar
      Merge branch 'net-use-indirect_call-in-some-dst_ops' · 2d912da0
      Jakub Kicinski authored
      Brian Vazquez says:
      
      ====================
      net: use INDIRECT_CALL in some dst_ops
      
      This patch series uses the INDIRECT_CALL wrappers in some dst_ops
      functions to mitigate retpoline costs. Benefits depend on the
      platform as described below.
      
      Background: The kernel rewrites the retpoline code at
      __x86_indirect_thunk_r11 depending on the CPU's requirements.
      The INDIRECT_CALL wrappers provide hints on possible targets and
      save the retpoline overhead using a direct call in case the
      target matches one of the hints.
      
      The retpoline overhead for the following three cases has been
      measured by Luigi Rizzo in microbenchmarks, using CPU performance
      counters, and cover reasonably well the range of possible retpoline
      overheads compared to a plain indirect call (in equal conditions,
      specifically with predicted branch, hot cache):
      
      - just "jmp *(%r11)" on modern platforms like Intel Cascadelake.
        In this case the overhead is just 2 clock cycles:
      
      - "lfence; jmp *(%r11)" on e.g. some recent AMD CPUs.
        In this case the lfence is blocked until pending reads complete,
        so the actual overhead depends on previous instructions.
        The best case we have measured 15 clock cycles of overhead.
      
      - worst case, e.g. skylake, the full retpoline is used
      
          __x86_indirect_thunk_r11:     call set_u_target
          capture_speculation:          pause
                                        lfence
                                        jmp capture_speculation
          .align 16
          set_up_target:                mov %r11, (%rsp)
                                        ret
      
         In this case the overhead has been measured in 35-40 clock cycles.
      
      The actual time saved hence depends on the platform and current
      clock speed (which varies heavily, especially when C-states are active).
      Also note that actual benefit might be lower than expected if the
      longer retpoline overlaps with some pending memory read.
      
      MEASUREMENTS:
      The INDIRECT_CALL wrappers in this patchset involve the processing
      of incoming SYN and generation of syncookies. Hence, the test has been
      run by configuring a receiving host with a single NIC rx queue, disabling
      RPS and RFS so that all processing occurs on the same core.
      An external source generates SYN fast enough to saturate the receiving CPU.
      We ran two sets of experiments, with and without the dst_output patch,
      comparing the number of syncookies generated over a 20s period
      in multiple runs.
      
      Assuming the CPU is saturated, the time per packet is
         t = number_of_packets/total_time
      and if the two datasets have statistically meaningful difference,
      the difference in times between the two cases gives an estimate
      of the benefits from one INDIRECT_CALL.
      
      Here are the experimental results:
      
      Skylake     Syncookies over 20s (5 tests)
      ---------------------------------------------------
      indirect    9166325 9182023 9170093 9134014 9171082
      retpoline   9099308 9126350 9154841 9056377 9122376
      
      Computing the stats on the ns_pkt = 20e6/total_packets gives the following:
      
      $ ministat -c 95 -w 70 /tmp/sk-indirect /tmp/sk-retp
      x /tmp/sk-indirect
      + /tmp/sk-retp
      +----------------------------------------------------------------------+
      |x     xx x     +          x    + +           +                       +|
      ||______M__A_______|_|____________M_____A___________________|          |
      +----------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x   5   2.17817e-06   2.18962e-06     2.181e-06  2.182292e-06 4.3252133e-09
      +   5   2.18464e-06   2.20839e-06   2.19241e-06  2.194974e-06 8.8695958e-09
      Difference at 95.0% confidence
              1.2682e-08 +/- 1.01766e-08
              0.581132% +/- 0.466326%
              (Student's t, pooled s = 6.97772e-09)
      
      This suggests a difference of 13ns +/- 10ns
      Our expectation from microbenchmarks was 35-40 cycles per call,
      but part of the gains may be eaten by stalls from pending memory reads.
      
      For Cascadelake:
      Cascadelake     Syncookies over 20s (5 tests)
      ---------------------------------------------------------
      indirect     10339797 10297547 10366826 10378891 10384854
      retpoline    10332674 10366805 10320374 10334272 10374087
      
      Computing the stats on the ns_pkt = 20e6/total_packets gives no
      meaningful difference even at just 80% (this was expected):
      
      $ ministat -c 80 -w 70 /tmp/cl-indirect /tmp/cl-retp
      x /tmp/cl-indirect
      + /tmp/cl-retp
      +----------------------------------------------------------------------+
      |   x    x  +     *                   x   + +        +                x|
      ||______________|_M_________A_____A_______M________|___|               |
      +----------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x   5   1.92588e-06   1.94221e-06   1.92923e-06  1.931716e-06 6.6936746e-09
      +   5   1.92788e-06   1.93791e-06   1.93531e-06  1.933188e-06 4.3734106e-09
      No difference proven at 80.0% confidence
      ====================
      
      Link: https://lore.kernel.org/r/20210201174132.3534118-1-brianvv@google.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d912da0
    • Brian Vazquez's avatar
      net: indirect call helpers for ipv4/ipv6 dst_check functions · bbd807df
      Brian Vazquez authored
      This patch avoids the indirect call for the common case:
      ip6_dst_check and ipv4_dst_check
      Signed-off-by: default avatarBrian Vazquez <brianvv@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bbd807df
    • Brian Vazquez's avatar
      net: use indirect call helpers for dst_mtu · f67fbeae
      Brian Vazquez authored
      This patch avoids the indirect call for the common case:
      ip6_mtu and ipv4_mtu
      Signed-off-by: default avatarBrian Vazquez <brianvv@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f67fbeae
    • Brian Vazquez's avatar
      net: use indirect call helpers for dst_output · 6585d7dc
      Brian Vazquez authored
      This patch avoids the indirect call for the common case:
      ip6_output and ip_output
      Signed-off-by: default avatarBrian Vazquez <brianvv@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6585d7dc
    • Brian Vazquez's avatar
      net: use indirect call helpers for dst_input · e43b2190
      Brian Vazquez authored
      This patch avoids the indirect call for the common case:
      ip_local_deliver and ip6_input
      Signed-off-by: default avatarBrian Vazquez <brianvv@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e43b2190
    • Emil Renner Berthing's avatar
      net: usb: cdc_ncm: use new API for bh tasklet · 4f4e5436
      Emil Renner Berthing authored
      This converts the driver to use the new tasklet API introduced in
      commit 12cc923f ("tasklet: Introduce new initialization API")
      
      It is unfortunate that we need to add a pointer to the driver context to
      get back to the usbnet device, but the space will be reclaimed once
      there are no more users of the old API left and we can remove the data
      value and flag from the tasklet struct.
      Signed-off-by: default avatarEmil Renner Berthing <kernel@esmil.dk>
      Link: https://lore.kernel.org/r/20210130234637.26505-1-kernel@esmil.dkSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4f4e5436
    • Geert Uytterhoeven's avatar
      net: fec: Silence M5272 build warnings · 32d1bbb1
      Geert Uytterhoeven authored
      If CONFIG_M5272=y:
      
          drivers/net/ethernet/freescale/fec_main.c: In function ‘fec_restart’:
          drivers/net/ethernet/freescale/fec_main.c:948:6: warning: unused variable ‘val’ [-Wunused-variable]
            948 |  u32 val;
      	  |      ^~~
          drivers/net/ethernet/freescale/fec_main.c: In function ‘fec_get_mac’:
          drivers/net/ethernet/freescale/fec_main.c:1667:28: warning: unused variable ‘pdata’ [-Wunused-variable]
           1667 |  struct fec_platform_data *pdata = dev_get_platdata(&fep->pdev->dev);
      	  |                            ^~~~~
      
      Fix this by moving the variable declarations inside the existing #ifdef
      blocks.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Link: https://lore.kernel.org/r/20210202130650.865023-1-geert@linux-m68k.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32d1bbb1
    • Eric Dumazet's avatar
      inet: do not export inet_gro_{receive|complete} · fca23f37
      Eric Dumazet authored
      inet_gro_receive() and inet_gro_complete() are part
      of GRO engine which can not be modular.
      
      Similarly, inet_gso_segment() does not need to be exported,
      being part of GSO stack.
      
      In other words, net/ipv6/ip6_offload.o is part of vmlinux,
      regardless of CONFIG_IPV6.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Link: https://lore.kernel.org/r/20210202154145.1568451-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fca23f37