1. 18 Jan, 2017 23 commits
    • Jason Wang's avatar
      tun: rx batching · 5503fcec
      Jason Wang authored
      We can only process 1 packet at one time during sendmsg(). This often
      lead bad cache utilization under heavy load. So this patch tries to do
      some batching during rx before submitting them to host network
      stack. This is done through accepting MSG_MORE as a hint from
      sendmsg() caller, if it was set, batch the packet temporarily in a
      linked list and submit them all once MSG_MORE were cleared.
      
      Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:
      
                                       Mpps  -+%
          rx-frames = 0                0.91  +0%
          rx-frames = 4                1.00  +9.8%
          rx-frames = 8                1.00  +9.8%
          rx-frames = 16               1.01  +10.9%
          rx-frames = 32               1.07  +17.5%
          rx-frames = 48               1.07  +17.5%
          rx-frames = 64               1.08  +18.6%
          rx-frames = 64 (no MSG_MORE) 0.91  +0%
      
      User were allowed to change per device batched packets through
      ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
      to prevent bh from being disabled too long.
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5503fcec
    • Jason Wang's avatar
      vhost_net: tx batching · 0ed005ce
      Jason Wang authored
      This patch tries to utilize tuntap rx batching by peeking the tx
      virtqueue during transmission, if there's more available buffers in
      the virtqueue, set MSG_MORE flag for a hint for backend (e.g tuntap)
      to batch the packets.
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ed005ce
    • Jason Wang's avatar
      vhost: better detection of available buffers · 275bf960
      Jason Wang authored
      This patch tries to do several tweaks on vhost_vq_avail_empty() for a
      better performance:
      
      - check cached avail index first which could avoid userspace memory access.
      - using unlikely() for the failure of userspace access
      - check vq->last_avail_idx instead of cached avail index as the last
        step.
      
      This patch is need for batching supports which needs to peek whether
      or not there's still available buffers in the ring.
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      275bf960
    • Mao Wenan's avatar
      net:add one common config ARCH_WANT_RELAX_ORDER to support relax ordering · 1a8b6d76
      Mao Wenan authored
      Relax ordering(RO) is one feature of 82599 NIC, to enable this feature can
      enhance the performance for some cpu architecure, such as SPARC and so on.
      Currently it only supports one special cpu architecture(SPARC) in 82599
      driver to enable RO feature, this is not very common for other cpu architecture
      which really needs RO feature.
      This patch add one common config CONFIG_ARCH_WANT_RELAX_ORDER to set RO feature,
      and should define CONFIG_ARCH_WANT_RELAX_ORDER in sparc Kconfig firstly.
      Signed-off-by: default avatarMao Wenan <maowenan@huawei.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.duyck@gmail.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a8b6d76
    • David S. Miller's avatar
      Merge branch 'ipv6-simplify-rt6_fill_node' · 1e48aac1
      David S. Miller authored
      David Ahern says:
      
      ====================
      net: ipv6: simplify rt6_fill_node
      
      Remove a couple of unnecessary input arguments to rt6_fill_node.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e48aac1
    • David Ahern's avatar
      net: ipv6: remove prefix arg to rt6_fill_node · f8cfe2ce
      David Ahern authored
      The prefix arg to rt6_fill_node is non-0 in only 1 path - rt6_dump_route
      where a user is requesting a prefix only dump. Simplify rt6_fill_node
      by removing the prefix arg and moving the prefix check to rt6_dump_route.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8cfe2ce
    • David Ahern's avatar
      net: ipv6: remove nowait arg to rt6_fill_node · fd61c6ba
      David Ahern authored
      All callers of rt6_fill_node pass 0 for nowait arg. Remove the arg and
      simplify rt6_fill_node accordingly.
      
      rt6_fill_node passes the nowait of 0 to ip6mr_get_route. Remove the
      nowait arg from it as well.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fd61c6ba
    • David S. Miller's avatar
      Merge branch 'sctp-sender-side-stream-reconf-ssn-reset-request-chunk' · 1ce463dd
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: add sender-side procedures for stream reconf ssn reset request chunk
      
      Patch 6/6 is to implement sender-side procedures for the Outgoing
      and Incoming SSN Reset Request Parameter described in rfc6525
      section 5.1.2 and 5.1.3
      
      Patches 1-5/6 are ahead of it to define some apis and asoc members
      for it.
      
      Note that with this patchset, asoc->reconf_enable has no chance yet to
      be set, until the patch "sctp: add get and set sockopt for reconf_enable"
      is applied in the future. As we can not just enable it when sctp is not
      capable of processing reconf chunk yet.
      
      v1->v2:
        - put these into a smaller group.
        - rename some temporary variables in the codes.
        - rename the titles of the commits and improve some changelogs.
      v2->v3:
        - re-split the patchset and make sure it has no dead codes for review.
      v3->v4:
        - move sctp_make_reconf() into patch 1/6 to avoid kbuild warning.
        - drop unused struct sctp_strreset_req.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ce463dd
    • Xin Long's avatar
      sctp: implement sender-side procedures for SSN Reset Request Parameter · 7f9d68ac
      Xin Long authored
      This patch is to implement sender-side procedures for the Outgoing
      and Incoming SSN Reset Request Parameter described in rfc6525 section
      5.1.2 and 5.1.3.
      
      It is also add sockopt SCTP_RESET_STREAMS in rfc6525 section 6.3.2
      for users.
      
      Note that the new asoc member strreset_outstanding is to make sure
      only one reconf request chunk on the fly as rfc6525 section 5.1.1
      demands.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f9d68ac
    • Xin Long's avatar
      sctp: add sockopt SCTP_ENABLE_STREAM_RESET · 9fb657ae
      Xin Long authored
      This patch is to add sockopt SCTP_ENABLE_STREAM_RESET to get/set
      strreset_enable to indicate which reconf request type it supports,
      which is described in rfc6525 section 6.3.1.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fb657ae
    • Xin Long's avatar
      sctp: add reconf_enable in asoc ep and netns · c28445c3
      Xin Long authored
      This patch is to add reconf_enable field in all of asoc ep and netns
      to indicate if they support stream reset.
      
      When initializing, asoc reconf_enable get the default value from ep
      reconf_enable which is from netns netns reconf_enable by default.
      
      It is also to add reconf_capable in asoc peer part to know if peer
      supports reconf_enable, the value is set if ext params have reconf
      chunk support when processing init chunk, just as rfc6525 section
      5.1.1 demands.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c28445c3
    • Xin Long's avatar
      sctp: add stream reconf primitive · 7a090b04
      Xin Long authored
      This patch is to add a primitive based on sctp primitive frame for
      sending stream reconf request. It works as the other primitives,
      and create a SCTP_CMD_REPLY command to send the request chunk out.
      
      sctp_primitive_RECONF would be the api to send a reconf request
      chunk.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a090b04
    • Xin Long's avatar
      sctp: add stream reconf timer · 7b9438de
      Xin Long authored
      This patch is to add a per transport timer based on sctp timer frame
      for stream reconf chunk retransmission. It would start after sending
      a reconf request chunk, and stop after receiving the response chunk.
      
      If the timer expires, besides retransmitting the reconf request chunk,
      it would also do the same thing with data RTO timer. like to increase
      the appropriate error counts, and perform threshold management, possibly
      destroying the asoc if sctp retransmission thresholds are exceeded, just
      as section 5.1.1 describes.
      
      This patch is also to add asoc strreset_chunk, it is used to save the
      reconf request chunk, so that it can be retransmitted, and to check if
      the response is really for this request by comparing the information
      inside with the response chunk as well.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b9438de
    • Xin Long's avatar
      sctp: add support for generating stream reconf ssn reset request chunk · cc16f00f
      Xin Long authored
      This patch is to add asoc strreset_outseq and strreset_inseq for
      saving the reconf request sequence, initialize them when create
      assoc and process init, and also to define Incoming and Outgoing
      SSN Reset Request Parameter described in rfc6525 section 4.1 and
      4.2, As they can be in one same chunk as section rfc6525 3.1-3
      describes, it makes them in one function.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc16f00f
    • David S. Miller's avatar
      Merge branch 'rework-inet_csk_get_port' · b16ed2b1
      David S. Miller authored
      Josef Bacik says:
      
      ====================
      Rework inet_csk_get_port
      
      V3->V4:
      -Removed the random include of addrconf.h that is no longer needed.
      
      V2->V3:
      -Dropped the fastsock from the tb and instead just carry the saddrs, family, and
       ipv6 only flag.
      -Reworked the helper functions to deal with this change so I could still use
       them when checking the fast path.
      -Killed tb->num_owners as per Eric's request.
      -Attached a reproducer to the bottom of this email.
      
      V1->V2:
      -Added a new patch 'inet: collapse ipv4/v6 rcv_saddr_equal functions into one'
       at Hannes' suggestion.
      -Dropped ->bind_conflict and just use the new helper.
      -Fixed a compile bug from the original ->bind_conflict patch.
      
      The original description of the series follows:
      
      At some point recently the guys working on our load balancer added the ability
      to use SO_REUSEPORT.  When they restarted their app with this option enabled
      they immediately hit a softlockup on what appeared to be the
      inet_bind_bucket->lock.  Eventually what all of our debugging and discussion led
      us to was the fact that the application comes up without SO_REUSEPORT, shuts
      down which creates around 100k twsk's, and then comes up and tries to open a
      bunch of sockets using SO_REUSEPORT, which meant traversing the inet_bind_bucket
      owners list under the lock.  Since this lock is needed for dealing with the
      twsk's and basically anything else related to connections we would softlockup,
      and sometimes not ever recover.
      
      To solve this problem I did what you see in Path 5/5.  Once we have a
      SO_REUSEPORT socket on the tb->owners list we know that the socket has no
      conflicts with any of the other sockets on that list.  So we can add a copy of
      the sock_common (really all we need is the recv_saddr but it seemed ugly to copy
      just the ipv6, ipv4, and flag to indicate if we were ipv6 only in there so I've
      copied the whole common) in order to check subsequent SO_REUSEPORT sockets.  If
      they match the previous one then we can skip the expensive
      inet_csk_bind_conflict check.  This is what eliminated the soft lockup that we
      were seeing.
      
      Patches 1-4 are cleanups and re-workings.  For instance when we specify port ==
      0 we need to find an open port, but we would do two passes through
      inet_csk_bind_conflict every time we found a possible port.  We would also keep
      track of the smallest_port value in order to try and use it if we found no
      port our first run through.  This however made no sense as it would have had to
      fail the first pass through inet_csk_bind_conflict, so would not actually pass
      the second pass through either.  Finally I split the function into two functions
      in order to make it easier to read and to distinguish between the two behaviors.
      
      I have tested this on one of our load balancing boxes during peak traffic and it
      hasn't fallen over.  But this is not my area, so obviously feel free to point
      out where I'm being stupid and I'll get it fixed up and retested.  Thanks,
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b16ed2b1
    • Josef Bacik's avatar
      inet: reset tb->fastreuseport when adding a reuseport sk · 637bc8bb
      Josef Bacik authored
      If we have non reuseport sockets on a tb we will set tb->fastreuseport to 0 and
      never set it again.  Which means that in the future if we end up adding a bunch
      of reuseport sk's to that tb we'll have to do the expensive scan every time.
      Instead add the ipv4/ipv6 saddr fields to the bind bucket, as well as the family
      so we know what comparison to make, and the ipv6 only setting so we can make
      sure to compare with new sockets appropriately.  Once one sk has made it onto
      the list we know that there are no potential bind conflicts on the owners list
      that match that sk's rcv_addr.  So copy the sk's information into our bind
      bucket and set tb->fastruseport to FASTREUSESOCK_STRICT so we know we have to do
      an extra check for subsequent reuseport sockets and skip the expensive bind
      conflict check.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      637bc8bb
    • Josef Bacik's avatar
      inet: split inet_csk_get_port into two functions · 289141b7
      Josef Bacik authored
      inet_csk_get_port does two different things, it either scans for an open port,
      or it tries to see if the specified port is available for use.  Since these two
      operations have different rules and are basically independent lets split them
      into two different functions to make them both more readable.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      289141b7
    • Josef Bacik's avatar
      inet: don't check for bind conflicts twice when searching for a port · 6cd66616
      Josef Bacik authored
      This is just wasted time, we've already found a tb that doesn't have a bind
      conflict, and we don't drop the head lock so scanning again isn't going to give
      us a different answer.  Instead move the tb->reuse setting logic outside of the
      found_tb path and put it in the success: path.  Then make it so that we don't
      goto again if we find a bind conflict in the found_tb path as we won't reach
      this anymore when we are scanning for an ephemeral port.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6cd66616
    • Josef Bacik's avatar
      inet: kill smallest_size and smallest_port · b9470c27
      Josef Bacik authored
      In inet_csk_get_port we seem to be using smallest_port to figure out where the
      best place to look for a SO_REUSEPORT sk that matches with an existing set of
      SO_REUSEPORT's.  However if we get to the logic
      
      if (smallest_size != -1) {
      	port = smallest_port;
      	goto have_port;
      }
      
      we will do a useless search, because we would have already done the
      inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
      would have gone to found_tb and succeeded.  Since this logic makes us do yet
      another trip through inet_csk_bind_conflict for a port we know won't work just
      delete this code and save us the time.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9470c27
    • Josef Bacik's avatar
      inet: drop ->bind_conflict · aa078842
      Josef Bacik authored
      The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
      is how they check the rcv_saddr, so delete this call back and simply
      change inet_csk_bind_conflict to call inet_rcv_saddr_equal.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa078842
    • Josef Bacik's avatar
      inet: collapse ipv4/v6 rcv_saddr_equal functions into one · fe38d2a1
      Josef Bacik authored
      We pass these per-protocol equal functions around in various places, but
      we can just have one function that checks the sk->sk_family and then do
      the right comparison function.  I've also changed the ipv4 version to
      not cast to inet_sock since it is unneeded.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe38d2a1
    • jpinto's avatar
      stmicro: add more information to Kconfig · ab70e586
      jpinto authored
      This patch adds more info to stmicro' Kconfig files in order to be clearer
      that the driver can be used by ethernet cards based on 10/100/1000/EQOS
      Synopsys IP Cores.
      
      EQOS was also added stmmac/Kconfig Kconfig, since dwmac4 is in fact EQoS,
      one of Synopsys Ethernet IPs. More info at:
      https://www.synopsys.com/dw/ipdir.php?ds=dwc_ether_qosSigned-off-by: default avatarJoao Pinto <jpinto@synopsys.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab70e586
    • Martin KaFai Lau's avatar
      net/mlx5e: Support bpf_xdp_adjust_head() · d8bec2b2
      Martin KaFai Lau authored
      This patch adds bpf_xdp_adjust_head() support to mlx5e.
      
      1. rx_headroom is added to struct mlx5e_rq.  It uses
         an existing 4 byte hole in the struct.
      2. The adjusted data length is checked against
         MLX5E_XDP_MIN_INLINE and MLX5E_SW2HW_MTU(rq->netdev->mtu).
      3. The macro MLX5E_SW2HW_MTU is moved from en_main.c to en.h.
         MLX5E_HW2SW_MTU is also moved to en.h for symmetric reason
         but it is not a must.
      
      v2:
      - Keep the xdp specific logic in mlx5e_xdp_handle()
      - Update dma_len after the sanity checks in mlx5e_xmit_xdp_frame()
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8bec2b2
  2. 17 Jan, 2017 17 commits
    • Jason Baron's avatar
      tcp: accept RST for rcv_nxt - 1 after receiving a FIN · 0e40f4c9
      Jason Baron authored
      Using a Mac OSX box as a client connecting to a Linux server, we have found
      that when certain applications (such as 'ab'), are abruptly terminated
      (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
      FIN is accepted by the Linux stack but the RST is sent with the same
      sequence number as the FIN, and Linux responds with a challenge ACK per
      RFC 5961. The OSX client then sometimes (they are rate-limited) does not
      reply with any RST as would be expected on a closed socket.
      
      This results in sockets accumulating on the Linux server left mostly in
      the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
      This sequence of events can tie up a lot of resources on the Linux server
      since there may be a lot of data in write buffers at the time of the RST.
      Accepting a RST equal to rcv_nxt - 1, after we have already successfully
      processed a FIN, has made a significant difference for us in practice, by
      freeing up unneeded resources in a more expedient fashion.
      
      A packetdrill test demonstrating the behavior:
      
      // testing mac osx rst behavior
      
      // Establish a connection
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
      0.200 < . 1:1(0) ack 1 win 32768
      0.200 accept(3, ..., ...) = 4
      
      // Client closes the connection
      0.300 < F. 1:1(0) ack 1 win 32768
      
      // now send rst with same sequence
      0.300 < R. 1:1(0) ack 1 win 32768
      
      // make sure we are in TCP_CLOSE
      0.400 %{
      assert tcpi_state == 7
      }%
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e40f4c9
    • Tobias Klauser's avatar
      net: ethoc: Make needlessly global struct ethtool_ops static · a870a977
      Tobias Klauser authored
      Make the needlessly global struct ethtool_ops ethoc_ethtool_ops static
      to fix a sparse warning.
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a870a977
    • David S. Miller's avatar
      Merge branch 'sfc-RX-hash-config' · 49a67b5d
      David S. Miller authored
      Edward Cree says:
      
      ====================
      sfc: RX hash configuration
      
      This series improves support for getting and setting RX hashing
       configuration on Solarflare adapters through ethtool.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49a67b5d
    • Edward Cree's avatar
      sfc: read back RX hash config from the NIC when querying it with ethtool -x · a707d188
      Edward Cree authored
      Ensures that we report the key and indirection table the NIC is using,
       rather than (if setting them failed earlier) what we wanted it to use.
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a707d188
    • Edward Cree's avatar
    • Ganesh Goudar's avatar
    • Gao Feng's avatar
      net: ping: Use right format specifier to avoid type casting · a7ef6715
      Gao Feng authored
      The inet_num is u16, so use %hu instead of casting it to int. And
      the sk_bound_dev_if is int actually, so it needn't cast to int.
      Signed-off-by: default avatarGao Feng <fgao@ikuai8.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7ef6715
    • Shyam Saini's avatar
      qed: Replace memset with eth_zero_addr · 0ee28e31
      Shyam Saini authored
      Use eth_zero_addr to assign zero address to the given address array
      instead of memset when the second argument in memset is address
      of zero. Also, it makes the code clearer
      Signed-off-by: default avatarShyam Saini <mayhs11saini@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ee28e31
    • Lance Richardson's avatar
      bridge: sparse fixes in br_ip6_multicast_alloc_query() · 53631a5f
      Lance Richardson authored
      Changed type of csum field in struct igmpv3_query from __be16 to
      __sum16 to eliminate type warning, made same change in struct
      igmpv3_report for consistency.
      
      Fixed up an ntohs() where htons() should have been used instead.
      Signed-off-by: default avatarLance Richardson <lrichard@redhat.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      53631a5f
    • David S. Miller's avatar
    • David S. Miller's avatar
      Merge branch 'mpls-packet-stats' · e60a4263
      David S. Miller authored
      Robert Shearman says:
      
      ====================
      mpls: Packet stats
      
      This patchset records per-interface packet stats in the MPLS
      forwarding path and exports them using a nest of attributes root at a
      new IFLA_STATS_AF_SPEC attribute as part of RTM_GETSTATS messages:
      
      [IFLA_STATS_AF_SPEC]
       -> [AF_MPLS]
        -> [MPLS_STATS_LINK]
         -> struct mpls_link_stats
      
      The first patch adds the rtnl infrastructure for this, including a new
      callbacks to per-AF ops of fill_stats_af and get_stats_af_size. The
      second patch records MPLS stats and makes use of the infrastructure to
      export them. The rtnl infrastructure could also be used to export IPv6
      stats in the future.
      
      Changes in v2:
       - make incrementing IPv6 stats in mpls_stats_inc_outucastpkts
         conditional on CONFIG_IPV6 to fix build with CONFIG_IPV6=n
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e60a4263
    • Robert Shearman's avatar
      mpls: Packet stats · 27d69105
      Robert Shearman authored
      Having MPLS packet stats is useful for observing network operation and
      for diagnosing network problems. In the absence of anything better,
      RFC2863 and RFC3813 are used for guidance for which stats to expose
      and the semantics of them. In particular rx_noroutes maps to in
      unknown protos in RFC2863. The stats are exposed to userspace via
      AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
      RTM_GETSTATS messages.
      
      All the introduced fields are 64-bit, even error ones, to ensure no
      overflow with long uptimes. Per-CPU counters are used to avoid
      cache-line contention on the commonly used fields. The other fields
      have also been made per-CPU for code to avoid performance problems in
      error conditions on the assumption that on some platforms the cost of
      atomic operations could be more expensive than sending the packet
      (which is what would be done in the success case). If that's not the
      case, we could instead not use per-CPU counters for these fields.
      
      Only unicast and non-fragment are exposed at the moment, but other
      counters can be exposed in the future either by adding to the end of
      struct mpls_link_stats or by additional netlink attributes in the
      AF_MPLS IFLA_STATS_AF_SPEC nested attribute.
      Signed-off-by: default avatarRobert Shearman <rshearma@brocade.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27d69105
    • Robert Shearman's avatar
      net: AF-specific RTM_GETSTATS attributes · aefb4d4a
      Robert Shearman authored
      Add the functionality for including address-family-specific per-link
      stats in RTM_GETSTATS messages. This is done through adding a new
      IFLA_STATS_AF_SPEC attribute under which address family attributes are
      nested and then the AF-specific attributes can be further nested. This
      follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
      advantage of presenting an easily extended hierarchy. The rtnl_af_ops
      structure is extended to provide AFs with the opportunity to fill and
      provide the size of their stats attributes.
      
      One alternative would have been to provide AFs with the ability to add
      attributes directly into the RTM_GETSTATS message without a nested
      hierarchy. I discounted this approach as it increases the rate at
      which the 32 attribute number space is used up and it makes
      implementation a little more tricky for stats dump resuming (at the
      moment the order in which attributes are added to the message has to
      match the numeric order of the attributes).
      
      Another alternative would have been to register per-AF RTM_GETSTATS
      handlers. I discounted this approach as I perceived a common use-case
      to be getting all the stats for an interface and this approach would
      necessitate multiple requests/dumps to retrieve them all.
      Signed-off-by: default avatarRobert Shearman <rshearma@brocade.com>
      Acked-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aefb4d4a
    • Julia Lawall's avatar
      stmmac: add missing of_node_put · a249708b
      Julia Lawall authored
      The function stmmac_dt_phy provides several possibilities for initializing
      plat->mdio_node, all of which have the effect of increasing the reference
      count of the assigned value.  This field is not updated elsewhere, so the
      value is live until the end of the lifetime of plat (devm_allocated), just
      after the end of stmmac_remove_config_dt.  Thus, add an of_node_put on
      plat->mdio_node in stmmac_remove_config_dt.  It is possible that the field
      mdio_node is never initialized, but of_node_put is NULL-safe, so it is also
      safe to call of_node_put in that case.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@lip6.fr>
      Acked-by: default avatarAlexandre TORGUE <alexandre.torgue@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a249708b
    • Rolf Neugebauer's avatar
      virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on xmit · 501db511
      Rolf Neugebauer authored
      This patch part reverts fd2a0437 and e858fae2 which introduced a
      subtle change in how the virtio_net flags are derived from the SKBs
      ip_summed field.
      
      With the above commits, the flags are set to VIRTIO_NET_HDR_F_DATA_VALID
      when ip_summed == CHECKSUM_UNNECESSARY, thus treating it differently to
      ip_summed == CHECKSUM_NONE, which should be the same.
      
      Further, the virtio spec 1.0 / CS04 explicitly says that
      VIRTIO_NET_HDR_F_DATA_VALID must not be set by the driver.
      
      Fixes: fd2a0437 ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
      Fixes: e858fae2 (" virtio_net: use common code for virtio_net_hdr and skb GSO conversion")
      Signed-off-by: default avatarRolf Neugebauer <rolf.neugebauer@docker.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      501db511
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 4b19a9e2
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Handle multicast packets properly in fast-RX path of mac80211, from
          Johannes Berg.
      
       2) Because of a logic bug, the user can't actually force SW
          checksumming on r8152 devices. This makes diagnosis of hw
          checksumming bugs really annoying. Fix from Hayes Wang.
      
       3) VXLAN route lookup does not take the source and destination ports
          into account, which means IPSEC policies cannot be matched properly.
          Fix from Martynas Pumputis.
      
       4) Do proper RCU locking in netvsc callbacks, from Stephen Hemminger.
      
       5) Fix SKB leaks in mlxsw driver, from Arkadi Sharshevsky.
      
       6) If lwtunnel_fill_encap() fails, we do not abort the netlink message
          construction properly in fib_dump_info(), from David Ahern.
      
       7) Do not use kernel stack for DMA buffers in atusb driver, from Stefan
          Schmidt.
      
       8) Openvswitch conntack actions need to maintain a correct checksum,
          fix from Lance Richardson.
      
       9) ax25_disconnect() is missing a check for ax25->sk being NULL, in
          fact it already checks this, but not in all of the necessary spots.
          Fix from Basil Gunn.
      
      10) Action GET operations in the packet scheduler can erroneously bump
          the reference count of the entry, making it unreleasable. Fix from
          Jamal Hadi Salim. Jamal gives a great set of example command lines
          that trigger this in the commit message.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
        net sched actions: fix refcnt when GETing of action after bind
        net/mlx4_core: Eliminate warning messages for SRQ_LIMIT under SRIOV
        net/mlx4_core: Fix when to save some qp context flags for dynamic VST to VGT transitions
        net/mlx4_core: Fix racy CQ (Completion Queue) free
        net: stmmac: don't use netdev_[dbg, info, ..] before net_device is registered
        net/mlx5e: Fix a -Wmaybe-uninitialized warning
        ax25: Fix segfault after sock connection timeout
        bpf: rework prog_digest into prog_tag
        tipc: allocate user memory with GFP_KERNEL flag
        net: phy: dp83867: allow RGMII_TXID/RGMII_RXID interface types
        ip6_tunnel: Account for tunnel header in tunnel MTU
        mld: do not remove mld souce list info when set link down
        be2net: fix MAC addr setting on privileged BE3 VFs
        be2net: don't delete MAC on close on unprivileged BE3 VFs
        be2net: fix status check in be_cmd_pmac_add()
        cpmac: remove hopeless #warning
        ravb: do not use zero-length alignment DMA descriptor
        mlx4: do not call napi_schedule() without care
        openvswitch: maintain correct checksum state in conntrack actions
        tcp: fix tcp_fastopen unaligned access complaints on sparc
        ...
      4b19a9e2
    • Linus Torvalds's avatar
      Merge branch 'stable/for-linus-4.10' of... · 203f80f1
      Linus Torvalds authored
      Merge branch 'stable/for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
      
      Pull swiotlb fix from Konrad Rzeszutek Wilk:
       "A tiny fix to make sure that page-sized mappings are page-aligned (and
        not say straddle two pages). This is important for some drivers (such
        as NVME)"
      
      * 'stable/for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
        swiotlb: ensure that page-sized mappings are page-aligned
      203f80f1