1. 02 Oct, 2018 1 commit
  2. 01 Oct, 2018 11 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-per-cpu-cgroup-storage' · cb86d0f8
      Daniel Borkmann authored
      Roman Gushchin says:
      
      ====================
      This patchset implements per-cpu cgroup local storage and provides
      an example how per-cpu and shared cgroup local storage can be used
      for efficient accounting of network traffic.
      
      v4->v3:
        1) incorporated Alexei's feedback
      
      v3->v2:
        1) incorporated Song's feedback
        2) rebased on top of current bpf-next
      
      v2->v1:
        1) added a selftest implementing network counters
        2) added a missing free() in cgroup local storage selftest
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cb86d0f8
    • Roman Gushchin's avatar
      selftests/bpf: cgroup local storage-based network counters · 371e4fcc
      Roman Gushchin authored
      This commit adds a bpf kselftest, which demonstrates how percpu
      and shared cgroup local storage can be used for efficient lookup-free
      network accounting.
      
      Cgroup local storage provides generic memory area with a very efficient
      lookup free access. To avoid expensive atomic operations for each
      packet, per-cpu cgroup local storage is used. Each packet is initially
      charged to a per-cpu counter, and only if the counter reaches certain
      value (32 in this case), the charge is moved into the global atomic
      counter. This allows to amortize atomic operations, keeping reasonable
      accuracy.
      
      The test also implements a naive network traffic throttling, mostly to
      demonstrate the possibility of bpf cgroup--based network bandwidth
      control.
      
      Expected output:
        ./test_netcnt
        test_netcnt:PASS
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      371e4fcc
    • Roman Gushchin's avatar
      samples/bpf: extend test_cgrp2_attach2 test to use per-cpu cgroup storage · 5fcbd29b
      Roman Gushchin authored
      This commit extends the test_cgrp2_attach2 test to cover per-cpu
      cgroup storage. Bpf program will use shared and per-cpu cgroup
      storages simultaneously, so a better coverage of corresponding
      core code will be achieved.
      
      Expected output:
        $ ./test_cgrp2_attach2
        Attached DROP prog. This ping in cgroup /foo should fail...
        ping: sendmsg: Operation not permitted
        Attached DROP prog. This ping in cgroup /foo/bar should fail...
        ping: sendmsg: Operation not permitted
        Attached PASS prog. This ping in cgroup /foo/bar should pass...
        Detached PASS from /foo/bar while DROP is attached to /foo.
        This ping in cgroup /foo/bar should fail...
        ping: sendmsg: Operation not permitted
        Attached PASS from /foo/bar and detached DROP from /foo.
        This ping in cgroup /foo/bar should pass...
        ### override:PASS
        ### multi:PASS
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      5fcbd29b
    • Roman Gushchin's avatar
      selftests/bpf: extend the storage test to test per-cpu cgroup storage · 919646d2
      Roman Gushchin authored
      This test extends the cgroup storage test to use per-cpu flavor
      of the cgroup storage as well.
      
      The test initializes a per-cpu cgroup storage to some non-zero initial
      value (1000), and then simple bumps a per-cpu counter each time
      the shared counter is atomically incremented. Then it reads all
      per-cpu areas from the userspace side, and checks that the sum
      of values adds to the expected sum.
      
      Expected output:
        $ ./test_cgroup_storage
        test_cgroup_storage:PASS
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      919646d2
    • Roman Gushchin's avatar
      selftests/bpf: add verifier per-cpu cgroup storage tests · a3c6054f
      Roman Gushchin authored
      This commits adds verifier tests covering per-cpu cgroup storage
      functionality. There are 6 new tests, which are exactly the same
      as for shared cgroup storage, but do use per-cpu cgroup storage
      map.
      
      Expected output:
        $ ./test_verifier
        #0/u add+sub+mul OK
        #0/p add+sub+mul OK
        ...
        #286/p invalid cgroup storage access 6 OK
        #287/p valid per-cpu cgroup storage access OK
        #288/p invalid per-cpu cgroup storage access 1 OK
        #289/p invalid per-cpu cgroup storage access 2 OK
        #290/p invalid per-cpu cgroup storage access 3 OK
        #291/p invalid per-cpu cgroup storage access 4 OK
        #292/p invalid per-cpu cgroup storage access 5 OK
        #293/p invalid per-cpu cgroup storage access 6 OK
        #294/p multiple registers share map_lookup_elem result OK
        ...
        #662/p mov64 src == dst OK
        #663/p mov64 src != dst OK
        Summary: 914 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a3c6054f
    • Roman Gushchin's avatar
      bpftool: add support for PERCPU_CGROUP_STORAGE maps · e5487092
      Roman Gushchin authored
      This commit adds support for BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE
      map type.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e5487092
    • Roman Gushchin's avatar
      bpf: sync include/uapi/linux/bpf.h to tools/include/uapi/linux/bpf.h · 25025e0a
      Roman Gushchin authored
      The sync is required due to the appearance of a new map type:
      BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, which implements per-cpu
      cgroup local storage.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      25025e0a
    • Roman Gushchin's avatar
      bpf: don't allow create maps of per-cpu cgroup local storages · c6fdcd6e
      Roman Gushchin authored
      Explicitly forbid creating map of per-cpu cgroup local storages.
      This behavior matches the behavior of shared cgroup storages.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c6fdcd6e
    • Roman Gushchin's avatar
      bpf: introduce per-cpu cgroup local storage · b741f163
      Roman Gushchin authored
      This commit introduced per-cpu cgroup local storage.
      
      Per-cpu cgroup local storage is very similar to simple cgroup storage
      (let's call it shared), except all the data is per-cpu.
      
      The main goal of per-cpu variant is to implement super fast
      counters (e.g. packet counters), which don't require neither
      lookups, neither atomic operations.
      
      >From userspace's point of view, accessing a per-cpu cgroup storage
      is similar to other per-cpu map types (e.g. per-cpu hashmaps and
      arrays).
      
      Writing to a per-cpu cgroup storage is not atomic, but is performed
      by copying longs, so some minimal atomicity is here, exactly
      as with other per-cpu maps.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b741f163
    • Roman Gushchin's avatar
      bpf: rework cgroup storage pointer passing · f294b37e
      Roman Gushchin authored
      To simplify the following introduction of per-cpu cgroup storage,
      let's rework a bit a mechanism of passing a pointer to a cgroup
      storage into the bpf_get_local_storage(). Let's save a pointer
      to the corresponding bpf_cgroup_storage structure, instead of
      a pointer to the actual buffer.
      
      It will help us to handle per-cpu storage later, which has
      a different way of accessing to the actual data.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f294b37e
    • Roman Gushchin's avatar
      bpf: extend cgroup bpf core to allow multiple cgroup storage types · 8bad74f9
      Roman Gushchin authored
      In order to introduce per-cpu cgroup storage, let's generalize
      bpf cgroup core to support multiple cgroup storage types.
      Potentially, per-node cgroup storage can be added later.
      
      This commit is mostly a formal change that replaces
      cgroup_storage pointer with a array of cgroup_storage pointers.
      It doesn't actually introduce a new storage type,
      it will be done later.
      
      Each bpf program is now able to have one cgroup storage of each type.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      8bad74f9
  3. 28 Sep, 2018 1 commit
  4. 27 Sep, 2018 17 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-libbpf-attach-by-name' · 78e6e5c1
      Daniel Borkmann authored
      Andrey Ignatov says:
      
      ====================
      This patch set introduces libbpf_attach_type_by_name function in libbpf
      to identify attach type by section name.
      
      This is useful to avoid writing same logic over and over again in user
      space applications that leverage libbpf.
      
      Patch 1 has more details on the new function and problem being solved.
      Patches 2 and 3 add support for new section names.
      Patch 4 uses new function in a selftest.
      Patch 5 adds selftest for libbpf_{prog,attach}_type_by_name.
      
      As a side note there are a lot of inconsistencies now between names used
      by libbpf and bpftool (e.g. cgroup/skb vs cgroup_skb, cgroup_device and
      device vs cgroup/dev, sockops vs sock_ops, etc). This patch set does not
      address it but it tries not to make it harder to address it in the future.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      78e6e5c1
    • Andrey Ignatov's avatar
      selftests/bpf: Test libbpf_{prog,attach}_type_by_name · 370920c4
      Andrey Ignatov authored
      Add selftest for libbpf functions libbpf_prog_type_by_name and
      libbpf_attach_type_by_name.
      
      Example of output:
        % ./tools/testing/selftests/bpf/test_section_names
        Summary: 35 PASSED, 0 FAILED
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      370920c4
    • Andrey Ignatov's avatar
      selftests/bpf: Use libbpf_attach_type_by_name in test_socket_cookie · c9bf507d
      Andrey Ignatov authored
      Use newly introduced libbpf_attach_type_by_name in test_socket_cookie
      selftest.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c9bf507d
    • Andrey Ignatov's avatar
      libbpf: Support sk_skb/stream_{parser, verdict} section names · c6f6851b
      Andrey Ignatov authored
      Add section names for BPF_SK_SKB_STREAM_PARSER and
      BPF_SK_SKB_STREAM_VERDICT attach types to be able to identify them in
      libbpf_attach_type_by_name.
      
      "stream_parser" and "stream_verdict" are used instead of simple "parser"
      and "verdict" just to avoid possible confusion in a place where attach
      type is used alone (e.g. in bpftool's show sub-commands) since there is
      another attach point that can be named as "verdict": BPF_SK_MSG_VERDICT.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c6f6851b
    • Andrey Ignatov's avatar
      libbpf: Support cgroup_skb/{e,in}gress section names · bafa7afe
      Andrey Ignatov authored
      Add section names for BPF_CGROUP_INET_INGRESS and BPF_CGROUP_INET_EGRESS
      attach types to be able to identify them in libbpf_attach_type_by_name.
      
      "cgroup_skb" is used instead of "cgroup/skb" mostly to easy possible
      unifying of how libbpf and bpftool works with section names:
      * bpftool uses "cgroup_skb" to in "prog list" sub-command;
      * bpftool uses "ingress" and "egress" in "cgroup list" sub-command;
      * having two parts instead of three in a string like "cgroup_skb/ingress"
        can be leveraged to split it to prog_type part and attach_type part,
        or vise versa: use two parts to make a section name.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      bafa7afe
    • Andrey Ignatov's avatar
      libbpf: Introduce libbpf_attach_type_by_name · 956b620f
      Andrey Ignatov authored
      There is a common use-case when ELF object contains multiple BPF
      programs and every program has its own section name. If it's cgroup-bpf
      then programs have to be 1) loaded and 2) attached to a cgroup.
      
      It's convenient to have information necessary to load BPF program
      together with program itself. This is where section name works fine in
      conjunction with libbpf_prog_type_by_name that identifies prog_type and
      expected_attach_type and these can be used with BPF_PROG_LOAD.
      
      But there is currently no way to identify attach_type by section name
      and it leads to messy code in user space that reinvents guessing logic
      every time it has to identify attach type to use with BPF_PROG_ATTACH.
      
      The patch introduces libbpf_attach_type_by_name that guesses attach type
      by section name if a program can be attached.
      
      The difference between expected_attach_type provided by
      libbpf_prog_type_by_name and attach_type provided by
      libbpf_attach_type_by_name is the former is used at BPF_PROG_LOAD time
      and can be zero if a program of prog_type X has only one corresponding
      attach type Y whether the latter provides specific attach type to use
      with BPF_PROG_ATTACH.
      
      No new section names were added to section_names array. Only existing
      ones were reorganized and attach_type was added where appropriate.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      956b620f
    • Song Liu's avatar
      bpf: test_bpf: add init_net to dev for flow_dissector · 10081193
      Song Liu authored
      Latest changes in __skb_flow_dissect() assume skb->dev has valid nd_net.
      However, this is not true for test_bpf. As a result, test_bpf.ko crashes
      the system with the following stack trace:
      
      [ 1133.716622] BUG: unable to handle kernel paging request at 0000000000001030
      [ 1133.716623] PGD 8000001fbf7ee067
      [ 1133.716624] P4D 8000001fbf7ee067
      [ 1133.716624] PUD 1f6c1cf067
      [ 1133.716625] PMD 0
      [ 1133.716628] Oops: 0000 [#1] SMP PTI
      [ 1133.716630] CPU: 7 PID: 40473 Comm: modprobe Kdump: loaded Not tainted 4.19.0-rc5-00805-gca11cc92ccd2 #1167
      [ 1133.716631] Hardware name: Wiwynn Leopard-Orv2/Leopard-DDR BW, BIOS LBM12.5 12/06/2017
      [ 1133.716638] RIP: 0010:__skb_flow_dissect+0x83/0x1680
      [ 1133.716639] Code: 04 00 00 41 0f b7 44 24 04 48 85 db 4d 8d 14 07 0f 84 01 02 00 00 48 8b 43 10 48 85 c0 0f 84 e5 01 00 00 48 8b 80 a8 04 00 00 <48> 8b 90 30 10 00 00 48 85 d2 0f 84 dd 01 00 00 31 c0 b9 05 00 00
      [ 1133.716640] RSP: 0018:ffffc900303c7a80 EFLAGS: 00010282
      [ 1133.716642] RAX: 0000000000000000 RBX: ffff881fea0b7400 RCX: 0000000000000000
      [ 1133.716643] RDX: ffffc900303c7bb4 RSI: ffffffff8235c3e0 RDI: ffff881fea0b7400
      [ 1133.716643] RBP: ffffc900303c7b80 R08: 0000000000000000 R09: 000000000000000e
      [ 1133.716644] R10: ffffc900303c7bb4 R11: ffff881fb6840400 R12: ffffffff8235c3e0
      [ 1133.716645] R13: 0000000000000008 R14: 000000000000001e R15: ffffc900303c7bb4
      [ 1133.716646] FS:  00007f54e75d3740(0000) GS:ffff881fff5c0000(0000) knlGS:0000000000000000
      [ 1133.716648] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1133.716649] CR2: 0000000000001030 CR3: 0000001f6c226005 CR4: 00000000003606e0
      [ 1133.716649] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1133.716650] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1133.716651] Call Trace:
      [ 1133.716660]  ? sched_clock_cpu+0xc/0xa0
      [ 1133.716662]  ? sched_clock_cpu+0xc/0xa0
      [ 1133.716665]  ? log_store+0x1b5/0x260
      [ 1133.716667]  ? up+0x12/0x60
      [ 1133.716669]  ? skb_get_poff+0x4b/0xa0
      [ 1133.716674]  ? __kmalloc_reserve.isra.47+0x2e/0x80
      [ 1133.716675]  skb_get_poff+0x4b/0xa0
      [ 1133.716680]  bpf_skb_get_pay_offset+0xa/0x10
      [ 1133.716686]  ? test_bpf_init+0x578/0x1000 [test_bpf]
      [ 1133.716690]  ? netlink_broadcast_filtered+0x153/0x3d0
      [ 1133.716695]  ? free_pcppages_bulk+0x324/0x600
      [ 1133.716696]  ? 0xffffffffa0279000
      [ 1133.716699]  ? do_one_initcall+0x46/0x1bd
      [ 1133.716704]  ? kmem_cache_alloc_trace+0x144/0x1a0
      [ 1133.716709]  ? do_init_module+0x5b/0x209
      [ 1133.716712]  ? load_module+0x2136/0x25d0
      [ 1133.716715]  ? __do_sys_finit_module+0xba/0xe0
      [ 1133.716717]  ? __do_sys_finit_module+0xba/0xe0
      [ 1133.716719]  ? do_syscall_64+0x48/0x100
      [ 1133.716724]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This patch fixes tes_bpf by using init_net in the dummy dev.
      
      Fixes: d58e468b ("flow_dissector: implements flow dissector BPF hook")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Petar Penkov <ppenkov@google.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      10081193
    • Andrey Ignatov's avatar
      bpftool: Fix bpftool net output · 53d6eb08
      Andrey Ignatov authored
      Print `bpftool net` output to stdout instead of stderr. Only errors
      should be printed to stderr. Regular output should go to stdout and this
      is what all other subcommands of bpftool do, including --json and
      --pretty formats of `bpftool net` itself.
      
      Fixes: commit f6f3bac0 ("tools/bpf: bpftool: add net support")
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      53d6eb08
    • Maciej Żenczykowski's avatar
      net-ipv4: remove 2 always zero parameters from ipv4_redirect() · 1042caa7
      Maciej Żenczykowski authored
      (the parameters in question are mark and flow_flags)
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1042caa7
    • Maciej Żenczykowski's avatar
      net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu() · d888f396
      Maciej Żenczykowski authored
      (the parameters in question are mark and flow_flags)
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarMaciej Żenczykowski <maze@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d888f396
    • Maxime Chevallier's avatar
      net: mvneta: Add support for 2500Mbps SGMII · da58a931
      Maxime Chevallier authored
      The mvneta controller can handle speeds up to 2500Mbps on the SGMII
      interface. This relies on serdes configuration, the lane must be
      configured at 3.125Gbps and we can't use in-band autoneg at that speed.
      
      The main issue when supporting that speed on this particular controller
      is that the link partner can send ethernet frames with a shortened
      preamble, which if not explicitly enabled in the controller will cause
      unexpected behaviours.
      
      This was tested on Armada 385, with the comphy configuration done in
      bootloader.
      Signed-off-by: default avatarMaxime Chevallier <maxime.chevallier@bootlin.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da58a931
    • David S. Miller's avatar
      Merge branch 'net-vhost-improve-performance-when-enable-busyloop' · c09c1474
      David S. Miller authored
      Tonghao Zhang says:
      
      ====================
      net: vhost: improve performance when enable busyloop
      
      This patches improve the guest receive performance.
      On the handle_tx side, we poll the sock receive queue
      at the same time. handle_rx do that in the same way.
      
      For more performance report, see patch 4
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c09c1474
    • Tonghao Zhang's avatar
      net: vhost: add rx busy polling in tx path · 441abde4
      Tonghao Zhang authored
      This patch improves the guest receive performance.
      On the handle_tx side, we poll the sock receive queue at the
      same time. handle_rx do that in the same way.
      
      We set the poll-us=100us and use the netperf to test throughput
      and mean latency. When running the tests, the vhost-net kthread
      of that VM, is alway 100% CPU. The commands are shown as below.
      
      Rx performance is greatly improved by this patch. There is not
      notable performance change on tx with this series though. This
      patch is useful for bi-directional traffic.
      
      netperf -H IP -t TCP_STREAM -l 20 -- -O "THROUGHPUT, THROUGHPUT_UNITS, MEAN_LATENCY"
      
      Topology:
      [Host] ->linux bridge -> tap vhost-net ->[Guest]
      
      TCP_STREAM:
      * Without the patch:  19842.95 Mbps, 6.50 us mean latency
      * With the patch:     37598.20 Mbps, 3.43 us mean latency
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      441abde4
    • Tonghao Zhang's avatar
      net: vhost: factor out busy polling logic to vhost_net_busy_poll() · dc151282
      Tonghao Zhang authored
      Factor out generic busy polling logic and will be
      used for in tx path in the next patch. And with the patch,
      qemu can set differently the busyloop_timeout for rx queue.
      
      To avoid duplicate codes, introduce the helper functions:
      * sock_has_rx_data(changed from sk_has_rx_data)
      * vhost_net_busy_poll_try_queue
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc151282
    • Tonghao Zhang's avatar
      net: vhost: replace magic number of lock annotation · a6a67a2f
      Tonghao Zhang authored
      Use the VHOST_NET_VQ_XXX as a subclass for mutex_lock_nested.
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6a67a2f
    • Tonghao Zhang's avatar
      net: vhost: lock the vqs one by one · 78139c94
      Tonghao Zhang authored
      This patch changes the way that lock all vqs
      at the same, to lock them one by one. It will
      be used for next patch to avoid the deadlock.
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      78139c94
    • Yafang Shao's avatar
      tcp: expose sk_state in tcp_retransmit_skb tracepoint · af4325ec
      Yafang Shao authored
      After sk_state exposed, we can get in which state this retransmission
      occurs. That could give us more detail for dignostic.
      For example, if this retransmission occurs in SYN_SENT state, it may
      also indicates that the syn packet may be dropped on the remote peer due
      to syn backlog queue full and then we could check the remote peer.
      
      BTW,SYNACK retransmission is traced in tcp_retransmit_synack tracepoint.
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af4325ec
  5. 26 Sep, 2018 10 commits