1. 07 Aug, 2017 5 commits
    • David S. Miller's avatar
      Merge branch 'IP-cleanup-LSRR-option-processing' · 21e27f2d
      David S. Miller authored
      Paolo Abeni says:
      
      ====================
      IP: cleanup LSRR option processing
      
      The __ip_options_echo() function expect a valid dst entry in skb->dst;
      as result we sometimes need to preserve the dst entry for the whole IP
      RX path.
      
      The current usage of skb->dst looks more a relic from ancient past that
      a real functional constraint. This patchset tries to remove such usage,
      and than drops some hacks currently in place in the IP code to keep
      skb->dst around.
      
      __ip_options_echo() uses of skb->dst for two different purposes: retrieving
      the netns assicated with the skb, and modify the ingress packet LSRR address
      list.
      
      The first patch removes the code modifying the ingress packet, and the second
      one provides an explicit netns argument to __ip_options_echo(). The following
      patches cleanup the current code keeping arund skb->dst for __ip_options_echo's
      sake.
      
      Updating the __ip_options_echo() function has been previously discussed here:
      
      http://marc.info/?l=linux-netdev&m=150064533516348&w=2
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21e27f2d
    • Paolo Abeni's avatar
      udp: no need to preserve skb->dst · 3bdefdf9
      Paolo Abeni authored
      __ip_options_echo() does not need anymore skb->dst, so we can
      avoid explicitly preserving it for its own sake.
      
      This is almost a revert of commit 0ddf3fb2 ("udp: preserve
      skb->dst if required for IP options processing") plus some
      lifting to fit later changes.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3bdefdf9
    • Paolo Abeni's avatar
      Revert "ipv4: keep skb->dst around in presence of IP options" · 61a1030b
      Paolo Abeni authored
      ip_options_echo() does not use anymore the skb->dst and don't
      need to keep the dst around for options's sake only.
      This reverts commit 34b2cef2.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61a1030b
    • Paolo Abeni's avatar
      ip/options: explicitly provide net ns to __ip_options_echo() · 91ed1e66
      Paolo Abeni authored
      __ip_options_echo() uses the current network namespace, and
      currently retrives it via skb->dst->dev.
      
      This commit adds an explicit 'net' argument to __ip_options_echo()
      and update all the call sites to provide it, usually via a simpler
      sock_net().
      
      After this change, __ip_options_echo() no more needs to access
      skb->dst and we can drop a couple of hack to preserve such
      info in the rx path.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91ed1e66
    • Paolo Abeni's avatar
      IP: do not modify ingress packet IP option in ip_options_echo() · a1e155ec
      Paolo Abeni authored
      While computing the response option set for LSRR, ip_options_echo()
      also changes the ingress packet LSRR addresses list, setting
      the last one to the dst specific address for the ingress packet
      - via memset(start[ ...
      The only visible effect of such change - beyond possibly damaging
      shared/cloned skbs - is modifying the data carried by ICMP replies
      changing the header information for reported the ingress packet,
      which violates RFC1122 3.2.2.6.
      All the others call sites just ignore the ingress packet IP options
      after calling ip_options_echo()
      Note that the last element in the LSRR option address list for the
      reply packet will be properly set later in the ip output path
      via ip_options_build().
      This buggy memset() predates git history and apparently was present
      into the initial ip_options_echo() implementation in linux 1.3.30 but
      still looks wrong.
      
      The removal of the fib_compute_spec_dst() call will help
      completely dropping the skb->dst usage by __ip_options_echo() with a
      later patch.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1e155ec
  2. 05 Aug, 2017 1 commit
  3. 04 Aug, 2017 34 commits
    • John Fastabend's avatar
      net: comment fixes against BPF devmap helper calls · 56ce097c
      John Fastabend authored
      Update BPF comments to accurately reflect XDP usage.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Reported-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56ce097c
    • David S. Miller's avatar
      Merge branch 'net-sched-summer-cleanup-part-1-mainly-in-exts-area' · 8f752224
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      net: sched: summer cleanup part 1, mainly in exts area
      
      This patchset is one of the couple cleanup patchsets I have in queue.
      The motivation aside the obvious need to "make things nicer" is also
      to prepare for shared filter blocks introduction. That requires tp->q
      removal, and therefore removal of all tp->q users.
      
      Patch 1 is just some small thing I spotted on the way
      Patch 2 removes one user of tp->q, namely tcf_em_tree_change
      Patches 3-8 do preparations for exts->nr_actions removal
      Patches 9-10 do simple renames of functions in cls*
      Patches 11-19 remove unnecessary calls of tcf_exts_change helper
      The last patch changes tcf_exts_change to don't take lock
      
      Tested by tools/testing/selftests/tc-testing
      
      v1->v2:
      - removed conversion of action array to list as noted by Cong
      - added the past patch instead
      - small rebases of patches 11-19
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f752224
    • Jiri Pirko's avatar
      net: sched: avoid atomic swap in tcf_exts_change · 9b0d4446
      Jiri Pirko authored
      tcf_exts_change is always called on newly created exts, which are not used
      on fastpath. Therefore, simple struct copy is enough.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9b0d4446
    • Jiri Pirko's avatar
      net: sched: cls_u32: no need to call tcf_exts_change for newly allocated struct · 705c7091
      Jiri Pirko authored
      As the n struct was allocated right before u32_set_parms call,
      no need to use tcf_exts_change to do atomic change, and we can just
      fill-up the unused exts struct directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      705c7091
    • Jiri Pirko's avatar
      net: sched: cls_route: no need to call tcf_exts_change for newly allocated struct · 8c98d571
      Jiri Pirko authored
      As the f struct was allocated right before route4_set_parms call,
      no need to use tcf_exts_change to do atomic change, and we can just
      fill-up the unused exts struct directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c98d571
    • Jiri Pirko's avatar
      net: sched: cls_flow: no need to call tcf_exts_change for newly allocated struct · c09fc2e1
      Jiri Pirko authored
      As the fnew struct just was allocated, so no need to use tcf_exts_change
      to do atomic change, and we can just fill-up the unused exts struct
      directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c09fc2e1
    • Jiri Pirko's avatar
      net: sched: cls_cgroup: no need to call tcf_exts_change for newly allocated struct · 8cc62513
      Jiri Pirko authored
      As the new struct just was allocated, so no need to use tcf_exts_change
      to do atomic change, and we can just fill-up the unused exts struct
      directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8cc62513
    • Jiri Pirko's avatar
      net: sched: cls_bpf: no need to call tcf_exts_change for newly allocated struct · 6839da32
      Jiri Pirko authored
      As the prog struct was allocated right before cls_bpf_set_parms call,
      no need to use tcf_exts_change to do atomic change, and we can just
      fill-up the unused exts struct directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6839da32
    • Jiri Pirko's avatar
      net: sched: cls_basic: no need to call tcf_exts_change for newly allocated struct · ff1f8ca0
      Jiri Pirko authored
      As the f struct was allocated right before basic_set_parms call, no need
      to use tcf_exts_change to do atomic change, and we can just fill-up
      the unused exts struct directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff1f8ca0
    • Jiri Pirko's avatar
      net: sched: cls_matchall: no need to call tcf_exts_change for newly allocated struct · a74cb369
      Jiri Pirko authored
      As the head struct was allocated right before mall_set_parms call,
      no need to use tcf_exts_change to do atomic change, and we can just
      fill-up the unused exts struct directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a74cb369
    • Jiri Pirko's avatar
      net: sched: cls_fw: no need to call tcf_exts_change for newly allocated struct · 94611bff
      Jiri Pirko authored
      As the f struct was allocated right before fw_set_parms call, no need
      to use tcf_exts_change to do atomic change, and we can just fill-up
      the unused exts struct directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      94611bff
    • Jiri Pirko's avatar
      net: sched: cls_flower: no need to call tcf_exts_change for newly allocated struct · 45507529
      Jiri Pirko authored
      As the f struct was allocated right before fl_set_parms call, no need
      to use tcf_exts_change to do atomic change, and we can just fill-up
      the unused exts struct directly by tcf_exts_validate.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45507529
    • Jiri Pirko's avatar
      net: sched: cls_fw: rename fw_change_attrs function · 1e5003af
      Jiri Pirko authored
      Since the function name is misleading since it is not changing
      anything, name it similarly to other cls.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e5003af
    • Jiri Pirko's avatar
      net: sched: cls_bpf: rename cls_bpf_modify_existing function · 6a725c48
      Jiri Pirko authored
      The name cls_bpf_modify_existing is highly misleading, as it indeed does
      not modify anything existing. It does not modify at all.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a725c48
    • Jiri Pirko's avatar
      net: sched: use tcf_exts_has_actions instead of exts->nr_actions · 978dfd8d
      Jiri Pirko authored
      For check in tcf_exts_dump use tcf_exts_has_actions helper instead
      of exts->nr_actions for checking if there are any actions present.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      978dfd8d
    • Jiri Pirko's avatar
      net: sched: remove check for number of actions in tcf_exts_exec · ec1a9cca
      Jiri Pirko authored
      Leave it to tcf_action_exec to return TC_ACT_OK in case there is no
      action present.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec1a9cca
    • Jiri Pirko's avatar
      net: sched: fix return value of tcf_exts_exec · af089e70
      Jiri Pirko authored
      Return the defined TC_ACT_OK instead of 0.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af089e70
    • Jiri Pirko's avatar
      net: sched: remove redundant helpers tcf_exts_is_predicative and tcf_exts_is_available · 6fc6d06e
      Jiri Pirko authored
      These two helpers are doing the same as tcf_exts_has_actions, so remove
      them and use tcf_exts_has_actions instead.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fc6d06e
    • Jiri Pirko's avatar
      net: sched: use tcf_exts_has_actions in tcf_exts_exec · af69afc5
      Jiri Pirko authored
      Use the tcf_exts_has_actions helper instead or directly testing
      exts->nr_actions in tcf_exts_exec.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af69afc5
    • Jiri Pirko's avatar
      net: sched: change names of action number helpers to be aligned with the rest · 3bcc0cec
      Jiri Pirko authored
      The rest of the helpers are named tcf_exts_*, so change the name of
      the action number helpers to be aligned. While at it, change to inline
      functions.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3bcc0cec
    • Jiri Pirko's avatar
      net: sched: remove unneeded tcf_em_tree_change · 4ebc1e3c
      Jiri Pirko authored
      Since tcf_em_tree_validate could be always called on a newly created
      filter, there is no need for this change function.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ebc1e3c
    • Jiri Pirko's avatar
      net: sched: sch_atm: use Qdisc_class_common structure · f7ebdff7
      Jiri Pirko authored
      Even if it is only for classid now, use this common struct a be aligned
      with the rest of the classful qdiscs.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7ebdff7
    • Lin Yun Sheng's avatar
      net: hns: Fix for __udivdi3 compiler error · 967b2e2a
      Lin Yun Sheng authored
      This patch fixes the __udivdi3 undefined error reported by
      test robot.
      
      Fixes: b8c17f70 ("net: hns: Add self-adaptive interrupt coalesce support in hns driver")
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      967b2e2a
    • Dan Carpenter's avatar
      net: phy: marvell: logical vs bitwise OR typo · 5987feb3
      Dan Carpenter authored
      This was supposed to be a bitwise OR but there is a || vs | typo.
      
      Fixes: 864dc729 ("net: phy: marvell: Refactor m88e1121 RGMII delay configuration")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5987feb3
    • David S. Miller's avatar
      Merge branch 'socket-sendmsg-zerocopy' · 35615994
      David S. Miller authored
      Willem de Bruijn says:
      
      ====================
      socket sendmsg MSG_ZEROCOPY
      
      Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the
      shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg.
      Implement the feature for TCP initially, as large writes benefit
      most.
      
      On a send call with MSG_ZEROCOPY, the kernel pins user pages and
      links these directly into the skbuff frags[] array.
      
      Each send call with MSG_ZEROCOPY that transmits data will eventually
      queue a completion notification on the error queue: a per-socket u32
      incremented on each such call. A request may have to revert to copy
      to succeed, for instance when a device cannot support scatter-gather
      IO. In that case a flag is passed along to notify that the operation
      succeeded without zerocopy optimization.
      
      The implementation extends the existing zerocopy infra for tuntap,
      vhost and xen with features needed for TCP, notably reference
      counting to handle cloning on retransmit and GSO.
      
      For more details, see also the netdev 2.1 paper and presentation at
      https://netdevconf.org/2.1/session.html?debruijn
      
      Changelog:
      
        v3 -> v4:
          - dropped UDP, RAW and PF_PACKET for now
              Without loopback support, datagrams are usually smaller than
              the ~8KB size threshold needed to benefit from zerocopy.
          - style: a few reverse chrismas tree
          - minor: SO_ZEROCOPY returns ENOTSUPP on unsupported protocols
          - minor: squashed SO_EE_CODE_ZEROCOPY_COPIED patch
          - minor: rebased on top of net-next with kmap_atomic fix
      
        v2 -> v3:
          - fix rebase conflict: SO_ZEROCOPY 59 -> 60
      
        v1 -> v2:
          - fix (kbuild-bot): do not remove uarg until patch 5
          - fix (kbuild-bot): move zerocopy_sg_from_iter doc with function
          - fix: remove unused extern in header file
      
        RFCv2 -> v1:
          - patch 2
              - review comment: in skb_copy_ubufs, always allocate order-0
                  page, also when replacing compound source pages.
          - patch 3
              - fix: always queue completion notification on MSG_ZEROCOPY,
      	    also if revert to copy.
      	- fix: on syscall abort, correctly revert notification state
      	- minor: skip queue notification on SOCK_DEAD
      	- minor: replace BUG_ON with WARN_ON in recoverable error
          - patch 4
              - new: add socket option SOCK_ZEROCOPY.
      	    only honor MSG_ZEROCOPY if set, ignore for legacy apps.
          - patch 5
              - fix: clear zerocopy state on skb_linearize
          - patch 6
              - fix: only coalesce if prev errqueue elem is zerocopy
      	- minor: try coalescing with list tail instead of head
              - minor: merge bytelen limit patch
          - patch 7
              - new: signal when data had to be copied
          - patch 8 (tcp)
              - optimize: avoid setting PSH bit when exceeding max frags.
      	    that limits GRO on the client. do not goto new_segment.
      	- fix: fail on MSG_ZEROCOPY | MSG_FASTOPEN
      	- minor: do not wait for memory: does not work for optmem
      	- minor: simplify alloc
          - patch 9 (udp)
              - new: add PF_INET6
              - fix: attach zerocopy notification even if revert to copy
      	- minor: simplify alloc size arithmetic
          - patch 10 (raw hdrinc)
              - new: add PF_INET6
          - patch 11 (pf_packet)
              - minor: simplify slightly
          - patch 12
              - new msg_zerocopy regression test: use veth pair to test
      	    all protocols: ipv4/ipv6/packet, tcp/udp/raw, cork
      	    all relevant ethtool settings: rx off, sg off
      	    all relevant packet lengths: 0, <MAX_HEADER, max size
      
        RFC -> RFCv2:
          - review comment: do not loop skb with zerocopy frags onto rx:
                add skb_orphan_frags_rx to orphan even refcounted frags
      	  call this in __netif_receive_skb_core, deliver_skb and tun:
      	  same as commit 1080e512 ("net: orphan frags on receive")
          - fix: hold an explicit sk reference on each notification skb.
                previously relied on the reference (or wmem) held by the
      	  data skb that would trigger notification, but this breaks
      	  on skb_orphan.
          - fix: when aborting a send, do not inc the zerocopy counter
                this caused gaps in the notification chain
          - fix: in packet with SOCK_DGRAM, pull ll headers before calling
                zerocopy_sg_from_iter
          - fix: if sock_zerocopy_realloc does not allow coalescing,
                do not fail, just allocate a new ubuf
          - fix: in tcp, check return value of second allocation attempt
          - chg: allocate notification skbs from optmem
                to avoid affecting tcp write queue accounting (TSQ)
          - chg: limit #locked pages (ulimit) per user instead of per process
          - chg: grow notification ids from 16 to 32 bit
            - pass range [lo, hi] through 32 bit fields ee_info and ee_data
          - chg: rebased to davem-net-next on top of v4.10-rc7
          - add: limit notification coalescing
                sharing ubufs limits overhead, but delays notification until
      	  the last packet is released, possibly unbounded. Add a cap.
          - tests: add snd_zerocopy_lo pf_packet test
          - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)
      
      Limitations / Known Issues:
          - TCP may build slightly smaller than max TSO packets due to
            exceeding MAX_SKB_FRAGS frags when zerocopy pages are unaligned.
          - All SKBTX_SHARED_FRAG may require additional __skb_linearize or
            skb_copy_ubufs calls in u32, skb_find_text, similar to
            skb_checksum_help.
      
      Notification skbuffs are allocated from optmem. For sockets that
      cannot effectively coalesce notifications, the optmem max may need
      to be increased to avoid hitting -ENOBUFS:
      
        sysctl -w net.core.optmem_max=1048576
      
      In application load, copy avoidance shows a roughly 5% systemwide
      reduction in cycles when streaming large flows and a 4-8% reduction in
      wall clock time on early tensorflow test workloads.
      
      For the single-machine veth tests to succeed, loopback support has to
      be temporarily enabled by making skb_orphan_frags_rx map to
      skb_orphan_frags.
      
      * Performance
      
      The below table shows cycles reported by perf for a netperf process
      sending a single 10 Gbps TCP_STREAM. The first three columns show
      Mcycles spent in the netperf process context. The second three columns
      show time spent systemwide (-a -C A,B) on the two cpus that run the
      process and interrupt handler. Reported is the median of at least 3
      runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
      Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
      are disabled and the kernel is booted with idle=halt.
      
      NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size
      
      perf stat -e cycles $NETPERF
      perf stat -C 2,3 -a -e cycles $NETPERF
      
              --process cycles--      ----cpu cycles----
                 std      zc   %      std         zc   %
      4K      27,609  11,217  41      49,217  39,175  79
      16K     21,370   3,823  18      43,540  29,213  67
      64K     20,557   2,312  11      42,189  26,910  64
      256K    21,110   2,134  10      43,006  27,104  63
      1M      20,987   1,610   8      42,759  25,931  61
      
      Perf record indicates the main source of these differences. Process
      cycles only at 1M writes (perf record; perf report -n):
      
      std:
      Samples: 42K of event 'cycles', Event count (approx.): 21258597313
       79.41%         33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
        3.27%          1396  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
        1.66%           694  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
        0.79%           325  netperf  [kernel.kallsyms]  [k] tcp_ack
        0.43%           188  netperf  [kernel.kallsyms]  [k] __alloc_skb
      
      zc:
      Samples: 1K of event 'cycles', Event count (approx.): 1439509124
       30.36%           584  netperf.zerocop  [kernel.kallsyms]  [k] gup_pte_range
       14.63%           284  netperf.zerocop  [kernel.kallsyms]  [k] __zerocopy_sg_from_iter
        8.03%           159  netperf.zerocop  [kernel.kallsyms]  [k] skb_zerocopy_add_frags_iter
        4.84%            96  netperf.zerocop  [kernel.kallsyms]  [k] __alloc_skb
        3.10%            60  netperf.zerocop  [kernel.kallsyms]  [k] kmem_cache_alloc_node
      
      * Safety
      
      The number of pages that can be pinned on behalf of a user with
      MSG_ZEROCOPY is bound by the locked memory ulimit.
      
      While the kernel holds process memory pinned, a process cannot safely
      reuse those pages for other purposes. Packets looped onto the receive
      stack and queued to a socket can be held indefinitely. Avoid unbounded
      notification latency by restricting user pages to egress paths only.
      skb_orphan_frags_rx() will create a private copy of pages even for
      refcounted packets when these are looped, as did skb_orphan_frags for
      the original tun zerocopy implementation.
      
      Pages are not remapped read-only. Processes can modify packet contents
      while packets are in flight in the kernel path. Bytes on which kernel
      control flow depends (headers) are copied to avoid TOCTTOU attacks.
      Datapath integrity does not otherwise depend on payload, with three
      exceptions: checksums, optional sk_filter/tc u32/.. and device +
      driver logic. The effect of wrong checksums is limited to the
      misbehaving process. TC filters that access contents may have to be
      excluded by adding an skb_orphan_frags_rx.
      
      Processes can also safely avoid OOM conditions by bounding the number
      of bytes passed with MSG_ZEROCOPY and by removing shared pages after
      transmission from their own memory map.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35615994
    • Willem de Bruijn's avatar
      test: add msg_zerocopy test · 07b65c5b
      Willem de Bruijn authored
      Introduce regression test for msg_zerocopy feature. Send traffic from
      one process to another with and without zerocopy.
      
      Evaluate tcp, udp, raw and packet sockets, including variants
      - udp: corking and corking with mixed copy/zerocopy calls
      - raw: with and without hdrincl
      - packet: at both raw and dgram level
      
      Test on both ipv4 and ipv6, optionally with ethtool changes to
      disable scatter-gather, tx checksum or tso offload. All of these
      can affect zerocopy behavior.
      
      The regression test can be run on a single machine if over a veth
      pair. Then skb_orphan_frags_rx must be modified to be identical to
      skb_orphan_frags to allow forwarding zerocopy locally.
      
      The msg_zerocopy.sh script will setup the veth pair in network
      namespaces and run all tests.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07b65c5b
    • Willem de Bruijn's avatar
      tcp: enable MSG_ZEROCOPY · f214f915
      Willem de Bruijn authored
      Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
      both supported. Only data sent to remote destinations is sent without
      copying. Packets looped onto a local destination have their payload
      copied to avoid unbounded latency.
      
      Tested:
        A 10x TCP_STREAM between two hosts showed a reduction in netserver
        process cycles by up to 70%, depending on packet size. Systemwide,
        savings are of course much less pronounced, at up to 20% best case.
      
        msg_zerocopy.sh 4 tcp:
      
        without zerocopy
          tx=121792 (7600 MB) txc=0 zc=n
          rx=60458 (7600 MB)
      
        with zerocopy
          tx=286257 (17863 MB) txc=286257 zc=y
          rx=140022 (17863 MB)
      
        This test opens a pair of sockets over veth, one one calls send with
        64KB and optionally MSG_ZEROCOPY and on the other reads the initial
        bytes. The receiver truncates, so this is strictly an upper bound on
        what is achievable. It is more representative of sending data out of
        a physical NIC (when payload is not touched, either).
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f214f915
    • Willem de Bruijn's avatar
      sock: ulimit on MSG_ZEROCOPY pages · a91dbff5
      Willem de Bruijn authored
      Bound the number of pages that a user may pin.
      
      Follow the lead of perf tools to maintain a per-user bound on memory
      locked pages commit 789f90fc ("perf_counter: per user mlock gift")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a91dbff5
    • Willem de Bruijn's avatar
      sock: MSG_ZEROCOPY notification coalescing · 4ab6c99d
      Willem de Bruijn authored
      In the simple case, each sendmsg() call generates data and eventually
      a zerocopy ready notification N, where N indicates the Nth successful
      invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.
      
      TCP and corked sockets can cause send() calls to append new data to an
      existing sk_buff and, thus, ubuf_info. In that case the notification
      must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
      and add skb_zerocopy_realloc() to optionally extend an existing range.
      
      Also coalesce notifications in this common case: if a notification
      [1, 1] is about to be queued while [0, 0] is the queue tail, just modify
      the head of the queue to read [0, 1].
      
      Coalescing is limited to a few TSO frames worth of data to bound
      notification latency.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ab6c99d
    • Willem de Bruijn's avatar
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn authored
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f8b977a
    • Willem de Bruijn's avatar
      sock: add SOCK_ZEROCOPY sockopt · 76851d12
      Willem de Bruijn authored
      The send call ignores unknown flags. Legacy applications may already
      unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
      socket opts in to zerocopy.
      
      Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
      Processes can also query this socket option to detect kernel support
      for the feature. Older kernels will return ENOPROTOOPT.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76851d12
    • Willem de Bruijn's avatar
      sock: add MSG_ZEROCOPY · 52267790
      Willem de Bruijn authored
      The kernel supports zerocopy sendmsg in virtio and tap. Expand the
      infrastructure to support other socket types. Introduce a completion
      notification channel over the socket error queue. Notifications are
      returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
      blocking the send/recv path on receiving notifications.
      
      Add reference counting, to support the skb split, merge, resize and
      clone operations possible with SOCK_STREAM and other socket types.
      
      The patch does not yet modify any datapaths.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52267790
    • Willem de Bruijn's avatar
      sock: skb_copy_ubufs support for compound pages · 3ece7826
      Willem de Bruijn authored
      Refine skb_copy_ubufs to support compound pages. With upcoming TCP
      zerocopy sendmsg, such fragments may appear.
      
      The existing code replaces each page one for one. Splitting each
      compound page into an independent number of regular pages can result
      in exceeding limit MAX_SKB_FRAGS if data is not exactly page aligned.
      
      Instead, fill all destination pages but the last to PAGE_SIZE.
      Split the existing alloc + copy loop into separate stages:
      1. compute bytelength and minimum number of pages to store this.
      2. allocate
      3. copy, filling each page except the last to PAGE_SIZE bytes
      4. update skb frag array
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ece7826
    • Willem de Bruijn's avatar
      sock: allocate skbs from optmem · 98ba0bd5
      Willem de Bruijn authored
      Add sock_omalloc and sock_ofree to be able to allocate control skbs,
      for instance for looping errors onto sk_error_queue.
      
      The transmit budget (sk_wmem_alloc) is involved in transmit skb
      shaping, most notably in TCP Small Queues. Using this budget for
      control packets would impact transmission.
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98ba0bd5