1. 11 Sep, 2016 8 commits
    • Christophe Leroy's avatar
      net: fs_enet: make rx_copybreak value configurable · b0ba357b
      Christophe Leroy authored
      Measurement shows that on a MPC8xx running at 132MHz, the optimal
      limit is 112:
      * 114 bytes packets are processed in 147 TB ticks with higher copybreak
      * 114 bytes packets are processed in 148 TB ticks with lower copybreak
      * 128 bytes packets are processed in 154 TB ticks with higher copybreak
      * 128 bytes packets are processed in 148 TB ticks with lower copybreak
      * 238 bytes packets are processed in 172 TB ticks with higher copybreak
      * 238 bytes packets are processed in 148 TB ticks with lower copybreak
      
      However it might be different on other processors
      and/or frequencies. So it is useful to make it configurable.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0ba357b
    • Christophe Leroy's avatar
      net: fs_enet: don't unmap DMA when packet len is below copybreak · 070e1f01
      Christophe Leroy authored
      When the length of the packet is below the defined copybreak limit,
      the received packet is copied into a newly allocated skb in order
      to reuse the skb. This is only interesting if it allow us to avoid
      a new DMA mapping. We shall therefore not DMA unmap and remap the
      skb->data. Instead, we invalidate the cache
      with dma_sync_single_for_cpu() once the received data has been
      copied into the new skb.
      
      The following measures have been obtained on a mpc885 running at 132Mhz.
      Measurement is done using the timebase with packets sent to the target
      with 'ping -s 1' (packet len is 60):
      * Without this patch: 182 TB ticks
      * With this patch: 143 TB ticks
      
      As a comparison, if we set the copybreak limit to 0, then we get
      148 TB ticks. It means that without this patch, duration is even
      worse when copying received data to a new skb instead of
      allocating a new skb for next packet to be received
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      070e1f01
    • Christophe Leroy's avatar
      net: fs_enet: merge NAPI RX and NAPI TX · 8572763a
      Christophe Leroy authored
      Initially, a NAPI TX routine has been implemented separately from
      NAPI RX, as done on the freescale/gianfar driver.
      
      By merging NAPI RX and NAPI TX, we reduce the amount of TX completion
      interrupts.
      
      Handling of the budget in association with TX interrupts is based on
      indications provided at https://wiki.linuxfoundation.org/networking/napi
      We never proceed more than the complete TX ring on a single run.
      
      At the same time, we fix an issue in the handling of fep->tx_free:
      
      It is only when fep->tx_free goes up to MAX_SKB_FRAGS that
      we need to wake up the queue. There is no need to call
      netif_wake_queue() at every packet successfully transmitted.
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8572763a
    • David S. Miller's avatar
      Merge branch 'act_tunnel_key' · d1ba24fe
      David S. Miller authored
      Hadar Hen Zion says:
      
      ====================
      net/sched: ip tunnel metadata set/release/classify by using TC
      
      This patchset introduces ip tunnel manipulation support using the TC subsystem.
      
      In the decap flow, it enables the user to redirect packets from a shared tunnel
      device and classify by outer and inner headers. The outer headers are extracted
      from the metadata and used by the flower filter. A new action act_tunnel_key,
      releases the metadata.
      
      In the encap flow, act_tunnel_key creates a metadata object to be used by the
      shared tunnel device. The actual redirection to the tunnel device is done using
      act_mirred.
      
      For example:
      $ tc qdisc add dev vnet0 ingress
      $ tc filter add dev vnet0 protocol ip parent ffff: \
      	flower \
      	 ip_proto 1 \
      	action tunnel_key set \
      	 src_ip 11.11.0.1 \
      	 dst_ip 11.11.0.2 \
      	 id 11 \
      	action mirred egress redirect dev vxlan0
      
      $ tc qdisc add dev vxlan0 ingress
      $ tc filter add dev vxlan0 protocol ip parent ffff: \
      	flower \
      	 enc_src_ip 11.11.0.2 \
      	 enc_dst_ip 11.11.0.1 \
      	 enc_key_id 11 \
      	action tunnel_key release \
      	action mirred egress redirect dev vnet0
      
      Amir & Hadar
      
      Changes from V6:
      - Add kfree_rcu to tunnel_key_release function
      - Use reverse Christmas tree order in tunnel_key_init function
      
      Changes from V5:
      - Add __rcu notation to struct tcf_tunnel_key_params in struct tcf_tunnel_key
      - Fix indentation in include/net/dst_metadata.h
      - Fix syntx error in commit message
      
      Changes from V4:
      - Fix tunnel_key_init function error flow.
      - Add 'action' variable to struct tcf_tunnel_key_params and use it instead of
        tcf_action variable which is not protected by rcu lock.
      
      Changes from V3:
      - Use percpu stats
      - No spinlock on datapatch - protecting parameters with rcu
      - Fix buggy handling of set/release dst
      - Use nla_get_in_addr and nla_put_in_addr
      - Fix change logs
      - Pass in6_addr by pointer
      - Rename utility functions to start with double underscore
      
      Changes from V2:
      - Use union in struct fl_flow_key for enc_ipv6 and enc_ipv4.
      - Rename functions _ip_tun_rx_dst and _ipv6_tun_rx_dst to _ip_tun_set_dst and
        _ipv6_tun_set_dst accordingly.
      - Remove local parameter 'encapdecap' from tunnel_key_init function.
      - Don't copy in6_addr values in tunnel_key_dump_addresses function, use pointers.
      
      Changes from V1:
      - More cleanups to key32_to_tunnel_id() and tunnel_id_to_key32()
      - IPv6 Support added
      - Set TUNNEL_KEY flag to make GRE work
      - Handle zero tunnel id properly in act_tunnel_key
      - Don't leave junk in decap action
      - Fix bug in act_tunnel_key initialization where (exists & ocr) is true
      - Remove BUG() from code
      - Rename action to tunnel_key
      - Improve grep-ability of code
      - Reuse code from ip_tun_rx_dst() and ipv6_tun_rx_dst()
      
      Changes from RFC:
      - Add a new action instead of making mirred too complex
      - No need to specify UDP port in action - it is already in the tunnel device
        configuration
      - Added a decap operation to drop tunnel metadata
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d1ba24fe
    • Amir Vadai's avatar
      net/sched: Introduce act_tunnel_key · d0f6dd8a
      Amir Vadai authored
      This action could be used before redirecting packets to a shared tunnel
      device, or when redirecting packets arriving from a such a device.
      
      The action will release the metadata created by the tunnel device
      (decap), or set the metadata with the specified values for encap
      operation.
      
      For example, the following flower filter will forward all ICMP packets
      destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
      redirecting, a metadata for the vxlan tunnel is created using the
      tunnel_key action and it's arguments:
      
      $ tc filter add dev net0 protocol ip parent ffff: \
          flower \
            ip_proto 1 \
            dst_ip 11.11.11.2 \
          action tunnel_key set \
            src_ip 11.11.0.1 \
            dst_ip 11.11.0.2 \
            id 11 \
          action mirred egress redirect dev vxlan0
      Signed-off-by: default avatarAmir Vadai <amir@vadai.me>
      Signed-off-by: default avatarHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d0f6dd8a
    • Amir Vadai's avatar
      net/sched: cls_flower: Classify packet in ip tunnels · bc3103f1
      Amir Vadai authored
      Introduce classifying by metadata extracted by the tunnel device.
      Outer header fields - source/dest ip and tunnel id, are extracted from
      the metadata when classifying.
      
      For example, the following will add a filter on the ingress Qdisc of shared
      vxlan device named 'vxlan0'. To forward packets with outer src ip
      11.11.0.2, dst ip 11.11.0.1 and tunnel id 11. The packets will be
      forwarded to tap device 'vnet0' (after metadata is released):
      
      $ tc filter add dev vxlan0 protocol ip parent ffff: \
          flower \
            enc_src_ip 11.11.0.2 \
            enc_dst_ip 11.11.0.1 \
            enc_key_id 11 \
            dst_ip 11.11.11.1 \
          action tunnel_key release \
          action mirred egress redirect dev vnet0
      
      The action tunnel_key, will be introduced in the next patch in this
      series.
      Signed-off-by: default avatarAmir Vadai <amir@vadai.me>
      Signed-off-by: default avatarHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc3103f1
    • Amir Vadai's avatar
      net/dst: Utility functions to build dst_metadata without supplying an skb · 2ff378b7
      Amir Vadai authored
      Extract __ip_tun_set_dst() and __ipv6_tun_set_dst() out of
      ip_tun_rx_dst() and ipv6_tun_rx_dst(), to be used without supplying an
      skb.
      Signed-off-by: default avatarAmir Vadai <amir@vadai.me>
      Signed-off-by: default avatarHadar Hen Zion <hadarh@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2ff378b7
    • Amir Vadai's avatar
      net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id() · d817f432
      Amir Vadai authored
      Add utility functions to convert a 32 bits key into a 64 bits tunnel and
      vice versa.
      These functions will be used instead of cloning code in GRE and VXLAN,
      and in tc act_iptunnel which will be introduced in a following patch in
      this patchset.
      Signed-off-by: default avatarAmir Vadai <amir@vadai.me>
      Signed-off-by: default avatarHadar Hen Zion <hadarh@mellanox.com>
      Reviewed-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d817f432
  2. 10 Sep, 2016 20 commits
    • Markus Elfring's avatar
      ATM-iphase: Use kmalloc_array() in tx_init() · e808bb6e
      Markus Elfring authored
      * Multiplications for the size determination of memory allocations
        indicated that array data structures should be processed.
        Thus use the corresponding function "kmalloc_array".
      
        This issue was detected by using the Coccinelle software.
      
      * Replace the specification of data types by pointer dereferences
        to make the corresponding size determination a bit safer according to
        the Linux coding style convention.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e808bb6e
    • David S. Miller's avatar
      Merge branch 'alx-msix' · 171a6c52
      David S. Miller authored
      Tobias Regnery says:
      
      ====================
      alx: add msi-x support
      
      This patchset adds msi-x support to the alx driver. It is a preparatory
      series for multi queue support, which I am currently working on. As there
      is no advantage over msi interrupts without multi queue support, msi-x
      interrupts are disabled by default. In order to test for regressions, a
      new module parameter is added to enable msi-x interrupts.
      
      Based on information of the downstream driver at github.com/qca/alx
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      171a6c52
    • Tobias Regnery's avatar
      alx: add module parameter to enable msi-x support · 0c58ee0b
      Tobias Regnery authored
      msi-x support is default disabled in the alx driver. In order to test msi-x
      interrupts for regressions add a module parameter to the driver.
      Signed-off-by: default avatarTobias Regnery <tobias.regnery@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c58ee0b
    • Tobias Regnery's avatar
      alx: add msi-x support · dc39a78b
      Tobias Regnery authored
      Add msi-x support to the alx driver. This is in preparation for multi queue
      support.
      
      msi-x interrupts are disabled by default because without multi queue support
      there is no advantage over msi interrupts. The performance numbers observed
      with iperf stay the same.
      
      Based on information of the downstream driver at github.com/qca/alx
      Signed-off-by: default avatarTobias Regnery <tobias.regnery@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc39a78b
    • Tobias Regnery's avatar
      alx: factor out part of the interrupt handler · a0373aef
      Tobias Regnery authored
      Factor out the handling of misc interrupts into a new function.
      This function can be reused later for msi-x interrupts.
      Signed-off-by: default avatarTobias Regnery <tobias.regnery@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a0373aef
    • Tobias Regnery's avatar
      alx: refactor msi enablement and disablement · 9ee7b683
      Tobias Regnery authored
      Introduce a new flag field for the advanced interrupt capatibilities and add
      new functions to enable and disable msi interrupts. These functions will be
      extended later to cover msi-x interrupts.
      
      We enable msi interrupts earlier in alx_init_intr because with msi-x and multi
      queue support the number of queues must be set before we allocate resources for
      the rx and tx paths.
      Signed-off-by: default avatarTobias Regnery <tobias.regnery@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ee7b683
    • Baoyou Xie's avatar
      qed: mark symbols static where possible · ba56947a
      Baoyou Xie authored
      We get a few warnings when building kernel with W=1:
      drivers/net/ethernet/qlogic/qed/qed_l2.c:112:5: warning: no previous prototype for 'qed_sp_vport_start' [-Wmissing-prototypes]
      drivers/net/ethernet/qlogic/qed/qed_sriov.c:110:6: warning: no previous prototype for 'qed_iov_is_valid_vfid' [-Wmissing-prototypes]
      drivers/net/ethernet/qlogic/qed/qed_sriov.c:188:5: warning: no previous prototype for 'qed_iov_post_vf_bulletin' [-Wmissing-prototypes]
      drivers/net/ethernet/qlogic/qed/qed_sriov.c:578:6: warning: no previous prototype for 'qed_iov_set_vfs_to_disable' [-Wmissing-prototypes]
      drivers/net/ethernet/qlogic/qed/qed_sriov.c:1135:28: warning: no previous prototype for 'qed_iov_get_public_vf_info' [-Wmissing-prototypes]
      drivers/net/ethernet/qlogic/qed/qed_sriov.c:1148:6: warning: no previous prototype for 'qed_iov_clean_vf' [-Wmissing-prototypes]
      drivers/net/ethernet/qlogic/qed/qed_sriov.c:2444:5: warning: no previous prototype for 'qed_iov_chk_ucast' [-Wmissing-prototypes]
      drivers/net/ethernet/qlogic/qed/qed_sriov.c:2762:5: warning: no previous prototype for 'qed_iov_vf_flr_cleanup' [-Wmissing-prototypes]
      ....
      
      In fact, these functions are only used in the file in which they are
      declared and don't need a declaration, but can be made static.
      so this patch marks these functions with 'static'.
      Signed-off-by: default avatarBaoyou Xie <baoyou.xie@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba56947a
    • David S. Miller's avatar
      Merge branch 'bpf-helper-cleanups' · 349aa334
      David S. Miller authored
      Daniel Borkmann says:
      
      ====================
      Some BPF helper cleanups
      
      This series contains a couple of misc cleanups and improvements
      for BPF helpers. For details please see individual patches. We
      let this also sit for a few days with Fengguang's kbuild test
      robot, and there were no issues seen (besides one false positive,
      see last one for details).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      349aa334
    • Daniel Borkmann's avatar
      bpf: add BPF_CALL_x macros for declaring helpers · f3694e00
      Daniel Borkmann authored
      This work adds BPF_CALL_<n>() macros and converts all the eBPF helper functions
      to use them, in a similar fashion like we do with SYSCALL_DEFINE<n>() macros
      that are used today. Motivation for this is to hide all the register handling
      and all necessary casts from the user, so that it is done automatically in the
      background when adding a BPF_CALL_<n>() call.
      
      This makes current helpers easier to review, eases to write future helpers,
      avoids getting the casting mess wrong, and allows for extending all helpers at
      once (f.e. build time checks, etc). It also helps detecting more easily in
      code reviews that unused registers are not instrumented in the code by accident,
      breaking compatibility with existing programs.
      
      BPF_CALL_<n>() internals are quite similar to SYSCALL_DEFINE<n>() ones with some
      fundamental differences, for example, for generating the actual helper function
      that carries all u64 regs, we need to fill unused regs, so that we always end up
      with 5 u64 regs as an argument.
      
      I reviewed several 0-5 generated BPF_CALL_<n>() variants of the .i results and
      they look all as expected. No sparse issue spotted. We let this also sit for a
      few days with Fengguang's kbuild test robot, and there were no issues seen. On
      s390, it barked on the "uses dynamic stack allocation" notice, which is an old
      one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
      to the call wrapper, just telling that the perf raw record/frag sits on stack
      (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
      and they were fine as well. All eBPF helpers are now converted to use these
      macros, getting rid of a good chunk of all the raw castings.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f3694e00
    • Daniel Borkmann's avatar
      bpf: add own ctx rewriter on ifindex for clsact progs · 374fb54e
      Daniel Borkmann authored
      When fetching ifindex, we don't need to test dev for being NULL since
      we're always guaranteed to have a valid dev for clsact programs. Thus,
      avoid this test in fast path.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      374fb54e
    • Daniel Borkmann's avatar
      bpf: add BPF_SIZEOF and BPF_FIELD_SIZEOF macros · f035a515
      Daniel Borkmann authored
      Add BPF_SIZEOF() and BPF_FIELD_SIZEOF() macros to improve the code a bit
      which otherwise often result in overly long bytes_to_bpf_size(sizeof())
      and bytes_to_bpf_size(FIELD_SIZEOF()) lines. So place them into a macro
      helper instead. Moreover, we currently have a BUILD_BUG_ON(BPF_FIELD_SIZEOF())
      check in convert_bpf_extensions(), but we should rather make that generic
      as well and add a BUILD_BUG_ON() test in all BPF_SIZEOF()/BPF_FIELD_SIZEOF()
      users to detect any rewriter size issues at compile time. Note, there are
      currently none, but we want to assert that it stays this way.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f035a515
    • Daniel Borkmann's avatar
      bpf: minor cleanups in helpers · 6088b582
      Daniel Borkmann authored
      Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
      and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
      anyway, so we can drop the unneeded test for dev; also few more other
      misc bits addressed here.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6088b582
    • Eric Dumazet's avatar
      ip_tunnel: do not clear l4 hashes · bf8d85d4
      Eric Dumazet authored
      If skb has a valid l4 hash, there is no point clearing hash and force
      a further flow dissection when a tunnel encapsulation is added.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bf8d85d4
    • Markus Elfring's avatar
      ATM-ForeRunnerHE: Use kmalloc_array() in he_init_group() · 2c4f414f
      Markus Elfring authored
      * Multiplications for the size determination of memory allocations
        indicated that array data structures should be processed.
        Thus use the corresponding function "kmalloc_array".
      
        This issue was detected by using the Coccinelle software.
      
      * Replace the specification of data types by pointer dereferences
        to make the corresponding size determination a bit safer according to
        the Linux coding style convention.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c4f414f
    • Markus Elfring's avatar
      ATM-ENI: Use kmalloc_array() in eni_start() · d9e6620c
      Markus Elfring authored
      * A multiplication for the size determination of a memory allocation
        indicated that an array data structure should be processed.
        Thus use the corresponding function "kmalloc_array".
      
        This issue was detected by using the Coccinelle software.
      
      * Replace the specification of a data structure by a pointer dereference
        to make the corresponding size determination a bit safer according to
        the Linux coding style convention.
      Signed-off-by: default avatarMarkus Elfring <elfring@users.sourceforge.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9e6620c
    • David S. Miller's avatar
      Merge tag 'rxrpc-rewrite-20160908' of... · fa5f4aaf
      David S. Miller authored
      Merge tag 'rxrpc-rewrite-20160908' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
      
      David Howells says:
      
      ====================
      rxrpc: Rewrite data and ack handling
      
      This patch set constitutes the main portion of the AF_RXRPC rewrite.  It
      consists of five fix/helper patches:
      
       (1) Fix ASSERTCMP's and ASSERTIFCMP's handling of signed values.
      
       (2) Update some protocol definitions slightly.
      
       (3) Use of an hlist for RCU purposes.
      
       (4) Removal of per-call sk_buff accounting (not really needed when skbs
           aren't being queued on the main queue).
      
       (5) Addition of a tracepoint to log incoming packets in the data_ready
           callback and to log the end of the data_ready callback.
      
      And then there are two patches that form the main part:
      
       (6) Preallocation of resources for incoming calls so that in patch (7) the
           data_ready handler can be made to fully instantiate an incoming call
           and make it live.  This extends through into AFS so that AFS can
           preallocate its own incoming call resources.
      
           The preallocation size is capped at the listen() backlog setting - and
           that is capped at a sysctl limit which can be set between 4 and 32.
      
           The preallocation is (re)charged either by accepting/rejecting pending
           calls or, in the case of AFS, manually.  If insufficient preallocation
           resources exist, a BUSY packet will be transmitted.
      
           The advantage of using this preallocation is that once a call is set
           up in the data_ready handler, DATA packets can be queued on it
           immediately rather than the DATA packets being queued for a background
           work item to do all the allocation and then try and sort out the DATA
           packets whilst other DATA packets may still be coming in and going
           either to the background thread or the new call.
      
       (7) Rewrite the handling of DATA, ACK and ABORT packets.
      
           In the receive phase, DATA packets are now held in per-call circular
           buffers with deduplication, out of sequence detection and suchlike
           being done in data_ready.  Since there is only one producer and only
           once consumer, no locks need be used on the receive queue.
      
           Received ACK and ABORT packets are now parsed and discarded in
           data_ready to recycle resources as fast as possible.
      
           sk_buffs are no longer pulled, trimmed or cloned, but rather the
           offset and size of the content is tracked.  This particularly affects
           jumbo DATA packets which need insertion into the receive buffer in
           multiple places.  Annotations are kept to track which bit is which.
      
           Packets are no longer queued on the socket receive queue; rather,
           calls are queued.  Dummy packets to convey events therefore no longer
           need to be invented and metadata packets can be discarded as soon as
           parsed rather then being pushed onto the socket receive queue to
           indicate terminal events.
      
           The preallocation facility added in (6) is now used to set up incoming
           calls with very little locking required and no calls to the allocator
           in data_ready.
      
           Decryption and verification is now handled in recvmsg() rather than in
           a background thread.  This allows for the future possibility of
           decrypting directly into the user buffer.
      
           With this patch, the code is a lot simpler and most of the mass of
           call event and state wangling code in call_event.c is gone.
      
      With this, the majority of the AF_RXRPC rewrite is complete.  However,
      there are still things to be done, including:
      
       (*) Limit the number of active service calls to prevent an attacker from
           filling up a server's memory.
      
       (*) Limit the number of calls on the rebuff-with-BUSY queue.
      
       (*) Transmit delayed/deferred ACKs from recvmsg() if possible, rather than
           punting to the background thread.  Ideally, the background thread
           shouldn't run at all, but data_ready can't call kernel_sendmsg() and
           we can't rely on recvmsg() attending to the call in a timely fashion.
      
       (*) Prevent the call at the front of the socket queue from hogging
           recvmsg()'s attention if there's a sufficiently continuous supply of
           data.
      
       (*) Distribute ICMP errors by connection rather than by call.  Possibly
           parse the ICMP packet to try and pin down the exact connection and
           call.
      
       (*) Encrypt/decrypt directly between user buffers and socket buffers where
           possible.
      
       (*) IPv6.
      
       (*) Service ID upgrade.  This is a facility whereby a special flag bit is
           set in the DATA packet header when making a call that tells the server
           that it is allowed to change the service ID to an upgraded one and
           reply with an equivalent call from the upgraded service.
      
           This is used, for example, to override certain AFS calls so that IPv6
           addresses can be returned.
      
       (*) Allow userspace to preallocate call user IDs for incoming calls.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa5f4aaf
    • Colin Ian King's avatar
      via-velocity: remove null pointer check on array tdinfo->skb_dma · 46dfc23e
      Colin Ian King authored
      tdinfo->skb_dma is a 7 element array of dma_addr_t hence cannot be
      null, so the pull pointer check on tdinfo->skb_dma  is redundant.
      Remove it.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Acked-by: default avatarFrancois Romieu <romieu@fr.zoreil.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46dfc23e
    • Baoyou Xie's avatar
      qede: mark qede_set_features() static · 9438451e
      Baoyou Xie authored
      We get 1 warning when building kernel with W=1:
      drivers/net/ethernet/qlogic/qede/qede_main.c:2113:5: warning: no previous prototype for 'qede_set_features' [-Wmissing-prototypes]
      
      In fact, this function is only used in the file in which it is
      declared and don't need a declaration, but can be made static.
      so this patch marks this function with 'static'.
      Signed-off-by: default avatarBaoyou Xie <baoyou.xie@linaro.org>
      Acked-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9438451e
    • Raju Lakkaraju's avatar
      net: phy: Fixed checkpatch errors for Microsemi PHYs. · 4ffd03f5
      Raju Lakkaraju authored
      The existing VSC85xx PHY driver did not follow the coding style and caused "checkpatch" to complain. This commit fixes this.
      Signed-off-by: default avatarRaju Lakkaraju <Raju.Lakkaraju@microsemi.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ffd03f5
    • Colin Ian King's avatar
      net: x25: remove null checks on arrays calling_ae and called_ae · 05f1b12f
      Colin Ian King authored
      dtefacs.calling_ae and called_ae are both 20 element __u8 arrays and
      cannot be null and hence are redundant checks. Remove these.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05f1b12f
  3. 09 Sep, 2016 12 commits
    • stephen hemminger's avatar
      macsec: set network devtype · c24acf03
      stephen hemminger authored
      The netdevice type structure for macsec was being defined but never used.
      To set the network device type the macro SET_NETDEV_DEVTYPE must be called.
      Compile tested only, I don't use macsec.
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c24acf03
    • stephen hemminger's avatar
      rtnetlink: remove unused ifla_stats_policy · b8b867e1
      stephen hemminger authored
      This structure is defined but never used. Flagged with W=1
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8b867e1
    • David S. Miller's avatar
      Merge branch 'newroute-creation-flags' · a349fcc8
      David S. Miller authored
      Guillaume Nault says:
      
      ====================
      ip: fix creation flags reported in RTM_NEWROUTE events
      
      Netlink messages sent to user-space upon RTM_NEWROUTE events have their
      nlmsg_flags field inconsistently set. While the NLM_F_REPLACE and
      NLM_F_APPEND bits are correctly handled, NLM_F_CREATE and NLM_F_EXCL
      are always 0.
      
      This series sets the NLM_F_CREATE and NLM_F_EXCL bits when applicable,
      for IPv4 and IPv6.
      
      Since IPv6 ignores the NLM_F_APPEND flags in requests, this flag isn't
      reported in RTM_NEWROUTE IPv6 events. This keeps IPv6 internal
      consistency (same flag semantic for user requests and kernel events) at
      the cost of bringing different flag interpretation for IPv4 and IPv6.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a349fcc8
    • Guillaume Nault's avatar
      ipv6: report NLM_F_CREATE and NLM_F_EXCL flags in RTM_NEWROUTE events · 73483c12
      Guillaume Nault authored
      Since commit 37a1d361 ("ipv6: include NLM_F_REPLACE in route
      replace notifications"), RTM_NEWROUTE notifications have their
      NLM_F_REPLACE flag set if the new route replaced a preexisting one.
      However, other flags aren't set.
      
      This patch reports the missing NLM_F_CREATE and NLM_F_EXCL flag bits.
      
      NLM_F_APPEND is not reported, because in ipv6 a NLM_F_CREATE request
      is interpreted as an append request (contrary to ipv4, "prepend" is not
      supported, so if NLM_F_EXCL is not set then NLM_F_APPEND is implicit).
      
      As a result, the possible flag combination can now be reported
      (iproute2's terminology into parentheses):
      
        * NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
          ("add").
        * NLM_F_CREATE: route did already exist, new route added after
          preexisting ones ("append").
        * NLM_F_REPLACE: route did already exist, new route replaced the
          first preexisting one ("change").
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      73483c12
    • Guillaume Nault's avatar
      ipv4: fix value of ->nlmsg_flags reported in RTM_NEWROUTE events · b93e1fa7
      Guillaume Nault authored
      fib_table_insert() inconsistently fills the nlmsg_flags field in its
      notification messages.
      
      Since commit b8f55831 ("[RTNETLINK]: Fix sending netlink message
      when replace route."), the netlink message has its nlmsg_flags set to
      NLM_F_REPLACE if the route replaced a preexisting one.
      
      Then commit a2bb6d7d ("ipv4: include NLM_F_APPEND flag in append
      route notifications") started setting nlmsg_flags to NLM_F_APPEND if
      the route matched a preexisting one but was appended.
      
      In other cases (exclusive creation or prepend), nlmsg_flags is 0.
      
      This patch sets ->nlmsg_flags in all situations, preserving the
      semantic of the NLM_F_* bits:
      
        * NLM_F_CREATE: a new fib entry has been created for this route.
        * NLM_F_EXCL: no other fib entry existed for this route.
        * NLM_F_REPLACE: this route has overwritten a preexisting fib entry.
        * NLM_F_APPEND: the new fib entry was added after other entries for
          the same route.
      
      As a result, the possible flag combination can now be reported
      (iproute2's terminology into parentheses):
      
        * NLM_F_CREATE | NLM_F_EXCL: route didn't exist, exclusive creation
          ("add").
        * NLM_F_CREATE | NLM_F_APPEND: route did already exist, new route
          added after preexisting ones ("append").
        * NLM_F_CREATE: route did already exist, new route added before
          preexisting ones ("prepend").
        * NLM_F_REPLACE: route did already exist, new route replaced the
          first preexisting one ("change").
      Signed-off-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b93e1fa7
    • Eric Dumazet's avatar
      ipv4: accept u8 in IP_TOS ancillary data · e895cdce
      Eric Dumazet authored
      In commit f02db315 ("ipv4: IP_TOS and IP_TTL can be specified as
      ancillary data") Francesco added IP_TOS values specified as integer.
      
      However, kernel sends to userspace (at recvmsg() time) an IP_TOS value
      in a single byte, when IP_RECVTOS is set on the socket.
      
      It can be very useful to reflect all ancillary options as given by the
      kernel in a subsequent sendmsg(), instead of aborting the sendmsg() with
      EINVAL after Francesco patch.
      
      So this patch extends IP_TOS ancillary to accept an u8, so that an UDP
      server can simply reuse same ancillary block without having to mangle
      it.
      
      Jesper can then augment
      https://github.com/netoptimizer/network-testing/blob/master/src/udp_example02.c
      to add TOS reflection ;)
      
      Fixes: f02db315 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Francesco Fusco <ffusco@redhat.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e895cdce
    • Daniel Borkmann's avatar
      bpf: fix range propagation on direct packet access · 2d2be8ca
      Daniel Borkmann authored
      LLVM can generate code that tests for direct packet access via
      skb->data/data_end in a way that currently gets rejected by the
      verifier, example:
      
        [...]
         7: (61) r3 = *(u32 *)(r6 +80)
         8: (61) r9 = *(u32 *)(r6 +76)
         9: (bf) r2 = r9
        10: (07) r2 += 54
        11: (3d) if r3 >= r2 goto pc+12
         R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
         R9=pkt(id=0,off=0,r=0) R10=fp
        12: (18) r4 = 0xffffff7a
        14: (05) goto pc+430
        [...]
      
        from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv
                       R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp
        24: (7b) *(u64 *)(r10 -40) = r1
        25: (b7) r1 = 0
        26: (63) *(u32 *)(r6 +56) = r1
        27: (b7) r2 = 40
        28: (71) r8 = *(u8 *)(r9 +20)
        invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0)
      
      The reason why this gets rejected despite a proper test is that we
      currently call find_good_pkt_pointers() only in case where we detect
      tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and
      derived, for example, from a register of type pkt(id=Y,off=0,r=0)
      pointing to skb->data. find_good_pkt_pointers() then fills the range
      in the current branch to pkt(id=Y,off=0,r=Z) on success.
      
      For above case, we need to extend that to recognize pkt_end >= rX
      pattern and mark the other branch that is taken on success with the
      appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers().
      Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the
      only two practical options to test for from what LLVM could have
      generated, since there's no such thing as BPF_JLT (<) or BPF_JLE (<=)
      that we would need to take into account as well.
      
      After the fix:
      
        [...]
         7: (61) r3 = *(u32 *)(r6 +80)
         8: (61) r9 = *(u32 *)(r6 +76)
         9: (bf) r2 = r9
        10: (07) r2 += 54
        11: (3d) if r3 >= r2 goto pc+12
         R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
         R9=pkt(id=0,off=0,r=0) R10=fp
        12: (18) r4 = 0xffffff7a
        14: (05) goto pc+430
        [...]
      
        from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv
                       R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp
        24: (7b) *(u64 *)(r10 -40) = r1
        25: (b7) r1 = 0
        26: (63) *(u32 *)(r6 +56) = r1
        27: (b7) r2 = 40
        28: (71) r8 = *(u8 *)(r9 +20)
        29: (bf) r1 = r8
        30: (25) if r8 > 0x3c goto pc+47
         R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56
         R9=pkt(id=0,off=0,r=54) R10=fp
        31: (b7) r1 = 1
        [...]
      
      Verifier test cases are also added in this work, one that demonstrates
      the mentioned example here and one that tries a bad packet access for
      the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0),
      pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two
      with both test variants (>, >=).
      
      Fixes: 969bf05e ("bpf: direct packet access")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d2be8ca
    • Yaogong Wang's avatar
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang authored
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: default avatarYaogong Wang <wygivan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f5afeae
    • David S. Miller's avatar
      Merge branch 'ovs-802.1ad' · 3b61075b
      David S. Miller authored
      Eric Garver says:
      
      ====================
      openvswitch: add 802.1ad support
      
      This series adds 802.1ad support to openvswitch. It is a continuation of the
      work originally started by Thomas F Herbert - hence the large rev number.
      
      The extra VLAN is implemented by using an additional level of the
      OVS_KEY_ATTR_ENCAP netlink attribute.
      In OVS flow speak, this looks like
      
         eth_type(0x88a8),vlan(vid=100),encap(eth_type(0x8100), vlan(vid=200),
                                              encap(eth_type(0x0800), ...))
      
      The userspace counterpart has also seen recent activity on the ovs-dev mailing
      lists. There are some new 802.1ad OVS tests being added - also on the ovs-dev
      list. This patch series has been tested using the most recent version of
      userspace (v3) and tests (v2).
      
      v22 changes:
        - merge patch 4 into patch 3
        - fix checkpatch.pl errors
          - Still some 80 char warnings for long string literals
        - refresh pointer after pskb_may_pull()
        - refactor vlan nlattr parsing to remove some double checks
        - introduce ovs_nla_put_vlan()
        - move triple VLAN check to after ethertype serialization
        - WARN_ON_ONCE() on triple VLAN and unexpected encap values
      
      v21 changes:
        - Fix (and simplify) netlink attribute parsing
        - re-add handling of truncated VLAN tags
        - fix if/else dangling assignment in {push,pop}_vlan()
        - simplify parse_vlan()
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b61075b
    • Eric Garver's avatar
      openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes · 018c1dda
      Eric Garver authored
      Add support for 802.1ad including the ability to push and pop double
      tagged vlans. Add support for 802.1ad to netlink parsing and flow
      conversion. Uses double nested encap attributes to represent double
      tagged vlan. Inner TPID encoded along with ctci in nested attributes.
      
      This is based on Thomas F Herbert's original v20 patch. I made some
      small clean ups and bug fixes.
      Signed-off-by: default avatarThomas F Herbert <thomasfherbert@gmail.com>
      Signed-off-by: default avatarEric Garver <e@erig.me>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      018c1dda
    • Eric Garver's avatar
      vlan: Check for vlan ethernet types for 8021.q or 802.1ad · fe19c4f9
      Eric Garver authored
      This is to simplify using double tagged vlans. This function allows all
      valid vlan ethertypes to be checked in a single function call.
      Also replace some instances that check for both ETH_P_8021Q and
      ETH_P_8021AD.
      
      Patch based on one originally by Thomas F Herbert.
      Signed-off-by: default avatarThomas F Herbert <thomasfherbert@gmail.com>
      Signed-off-by: default avatarEric Garver <e@erig.me>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe19c4f9
    • Thomas F Herbert's avatar
      openvswitch: 802.1ad uapi changes. · 8c146bb9
      Thomas F Herbert authored
      openvswitch: Add support for 8021.AD
      
      Change the description of the VLAN tpid field.
      Signed-off-by: default avatarThomas F Herbert <thomasfherbert@gmail.com>
      Acked-by: default avatarPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c146bb9