1. 15 Apr, 2016 38 commits
    • Xin Long's avatar
      sctp: add the sctp_diag.c file · 8f840e47
      Xin Long authored
      This one will implement all the interface of inet_diag, inet_diag_handler.
      which includes sctp_diag_dump, sctp_diag_dump_one and sctp_diag_get_info.
      
      It will work as a module, and register inet_diag_handler when loading.
      
      v2->v3:
      - fix the mistake in inet_assoc_attr_size().
      
      - change inet_diag_msg_laddrs_fill() name to inet_diag_msg_sctpladdrs_fill.
      
      - change inet_diag_msg_paddrs_fill() name to inet_diag_msg_sctpaddrs_fill.
      
      - add inet_diag_msg_sctpinfo_fill() to make asoc/ep fill code clearer.
      
      - add inet_diag_msg_sctpasoc_fill() to make asoc fill code clearer.
      
      - merge inet_asoc_diag_fill() and inet_ep_diag_fill() to
        inet_sctp_diag_fill().
      
      - call sctp_diag_get_info() directly, instead by handler, cause the caller
        is in the same file with it.
      
      - call lock_sock in sctp_tsp_dump_one() to make sure we call get sctp info
        safely.
      
      - after lock_sock(sk), we should check sk != assoc->base.sk.
      
      - change mem[SK_MEMINFO_WMEM_ALLOC] to asoc->sndbuf_used for asoc dump when
        asoc->ep->sndbuf_policy is set. don't use INET_DIAG_MEMINFO attr any more.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f840e47
    • Xin Long's avatar
      sctp: export some functions for sctp_diag in inet_diag · cb2050a7
      Xin Long authored
      inet_diag_msg_common_fill is used to fill the diag msg common info,
      we need to use it in sctp_diag as well, so export it.
      
      inet_diag_msg_attrs_fill is used to fill some common attrs info between
      sctp diag and tcp diag.
      
      v2->v3:
      - do not need to define and export inet_diag_get_handler any more.
        cause all the functions in it are in sctp_diag.ko, we just call
        them in sctp_diag.ko.
      
      - add inet_diag_msg_attrs_fill to make codes clear.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb2050a7
    • Xin Long's avatar
      sctp: export some apis or variables for sctp_diag and reuse some for proc · 626d16f5
      Xin Long authored
      For some main variables in sctp.ko, we couldn't export it to other modules,
      so we have to define some api to access them.
      
      It will include sctp transport and endpoint's traversal.
      
      There are some transport traversal functions for sctp_diag, we can also
      use it for sctp_proc. cause they have the similar situation to traversal
      transport.
      
      v2->v3:
      - rhashtable_walk_init need the parameter gfp, because of recent upstrem
        update
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      626d16f5
    • Xin Long's avatar
      sctp: add sctp_info dump api for sctp_diag · 52c52a61
      Xin Long authored
      sctp_diag will dump some important details of sctp's assoc or ep, we use
      sctp_info to describe them,  sctp_get_sctp_info to get them, and export
      it to sctp_diag.ko.
      
      v2->v3:
      - we will not use list_for_each_safe in sctp_get_sctp_info, cause
        all the callers of it will use lock_sock.
      
      - fix the holes in struct sctp_info with __reserved* field.
        because sctp_diag is a new feature, and sctp_info is just for now,
        it may be changed in the future.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      52c52a61
    • Marcelo Ricardo Leitner's avatar
      sctp: simplify sk_receive_queue locking · 311b2177
      Marcelo Ricardo Leitner authored
      SCTP already serializes access to rcvbuf through its sock lock:
      sctp_recvmsg takes it right in the start and release at the end, while
      rx path will also take the lock before doing any socket processing. On
      sctp_rcv() it will check if there is an user using the socket and, if
      there is, it will queue incoming packets to the backlog. The backlog
      processing will do the same. Even timers will do such check and
      re-schedule if an user is using the socket.
      
      Simplifying this will allow us to remove sctp_skb_list_tail and get ride
      of some expensive lockings.  The lists that it is used on are also
      mangled with functions like __skb_queue_tail and __skb_unlink in the
      same context, like on sctp_ulpq_tail_event() and sctp_clear_pd().
      sctp_close() will also purge those while using only the sock lock.
      
      Therefore the lockings performed by sctp_skb_list_tail() are not
      necessary. This patch removes this function and replaces its calls with
      just skb_queue_splice_tail_init() instead.
      
      The biggest gain is at sctp_ulpq_tail_event(), because the events always
      contain a list, even if it's queueing a single skb and this was
      triggering expensive calls to spin_lock_irqsave/_irqrestore for every
      data chunk received.
      
      As SCTP will deliver each data chunk on a corresponding recvmsg, the
      more effective the change will be.
      Before this patch, with chunks with 30 bytes:
      netperf -t SCTP_STREAM -H 192.168.1.2 -cC -l 60 -- -m 30 -S 400000
      400000 -s 400000 400000
      on a 10Gbit link with 1500 MTU:
      
      SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      425984 425984     30    60.00       137.45   7.34     7.36     52.504  52.608
      
      With it:
      
      SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.1 () port 0 AF_INET
      Recv   Send    Send                          Utilization       Service Demand
      Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
      Size   Size    Size     Time     Throughput  local    remote   local   remote
      bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
      
      425984 425984     30    60.00       179.10   7.97     6.70     43.740  36.788
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      311b2177
    • David S. Miller's avatar
      Merge branch 'mlx5_ifc-updates' · 936d4b41
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5_core: mlx5_ifc updates
      
      This series include mlx5_core updates for both net-next and rdma
      trees for 4.7 kernel cycle. This is the only shared code planned
      for 4.7 between rdma and net trees. Hopefully, this will prevent
      future conflicts when merging between ib-next and net-next once
      4.7 cycle is over and merge window is opened.
      
      Both Mellanox rdma and net submissions will proceed once this series
      is applied into both trees.
      
      Future shared code will be sent to both maintainers as pull requests
      from Mellanox's kernel.org tree.
      
      We have included all the maintainers of respective drivers.
      Kindly review the change and let us know in case of any review comments.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      936d4b41
    • Saeed Mahameed's avatar
      net/mlx5: Update mlx5_ifc hardware features · 7d5e1423
      Saeed Mahameed authored
      Adding the needed mlx5_ifc hardware bits and structs
      for the following feature:
      
      * Add vport to steering commands for SRIOV ACL support
      * Add mlcr, pcmr and mcia registers for dump module EEPROM
      * Add support for FCS, baeacon led and disable_link bits to
        hca caps
      * Add CQE period mode bit in  CQ context for CQE based CQ
        moderation support
      * Add umr SQ bit for fragmented memory registration
      * Add needed bits and caps for Striding RQ support
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d5e1423
    • Tariq Toukan's avatar
      net/mlx5: Fix mlx5 ifc cmd_hca_cap bad offsets · e1c9c62b
      Tariq Toukan authored
      All reserved fields after early_vf_enable are off by 1, since
      early_vf_enable was not explicitly declared as array of size 1.
      
      Reserved field before cqe_zip had a wrong size, it should
      be 0x80 + 0x3f.
      
      Fixes: b0844444 ("net/mlx5_core: Introduce access function to read internal timer ")
      Fixes: b4ff3a36 ("net/mlx5: Use offset based reserved field names in the IFC header file")
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarMatan Barak <matanb@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1c9c62b
    • David S. Miller's avatar
      Merge branch 'qed-tunneling-offload' · 993feee9
      David S. Miller authored
      Manish Chopra says:
      
      ====================
      qed/qede: Add tunneling support
      
      This patch series adds support for VXLAN, GRE and GENEVE tunnels
      to be used over this driver. With this support, adapter can perform
      TSO offload, inner/outer checksums offloads on TX and RX for
      encapsulated packets.
      
      V1->V2 [ Comments from Jesse Gross incorporated ]
      * Drop general infrastructure change patch.
        "net: Make vxlan/geneve default udp ports public"
      * Remove by default Linux default UDP ports configurations in driver.
        Instead, use general registration APIs for UDP port configurations
      * Removing .ndo_features_check - we will add it later with proper change.
      
      Please consider applying this series to net-next.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      993feee9
    • Manish Chopra's avatar
      qede: Add fastpath support for tunneling · 14db81de
      Manish Chopra authored
      This patch enables netdev tunneling features and adds
      TX/RX fastpath support for tunneling in driver.
      Signed-off-by: default avatarManish Chopra <manish.chopra@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14db81de
    • Manish Chopra's avatar
    • Manish Chopra's avatar
      qed/qede: Add GENEVE tunnel slowpath configuration support · 9a109dd0
      Manish Chopra authored
      This patch enables GENEVE tunnel on the adapter and
      add support for driver hooks to configure UDP ports
      for GENEVE tunnel offload to be performed by the adapter.
      Signed-off-by: default avatarManish Chopra <manish.chopra@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9a109dd0
    • Manish Chopra's avatar
      qed/qede: Add VXLAN tunnel slowpath configuration support · b18e170c
      Manish Chopra authored
      This patch enables VXLAN tunnel on the adapter and
      add support for driver hooks to configure UDP ports
      for VXLAN tunnel offload to be performed by the adapter.
      Signed-off-by: default avatarManish Chopra <manish.chopra@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b18e170c
    • Manish Chopra's avatar
      qed: Add infrastructure support for tunneling · 464f6645
      Manish Chopra authored
      This patch adds various structure/APIs needed to configure/enable different
      tunnel [VXLAN/GRE/GENEVE] parameters on the adapter.
      Signed-off-by: default avatarManish Chopra <manish.chopra@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarAriel Elior <Ariel.Elior@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      464f6645
    • Peter Heise's avatar
      net/hsr: Added support for HSR v1 · ee1c2797
      Peter Heise authored
      This patch adds support for the newer version 1 of the HSR
      networking standard. Version 0 is still default and the new
      version has to be selected via iproute2.
      
      Main changes are in the supervision frame handling and its
      ethertype field.
      Signed-off-by: default avatarPeter Heise <peter.heise@airbus.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee1c2797
    • David S. Miller's avatar
      Merge branch 'tcp-synflood-perf' · 125c8d12
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      tcp: final work on SYNFLOOD behavior
      
      In the first patch, I remove the costly association of SYNACK+COOKIES
      to a listener. I believe other parts of the stack should be ready.
      
      The second patch removes a useless write into listener socket
      in tcp_rcv_state_process(), incurring false sharing in
      tcp_conn_request()
      
      Performance under SYNFLOOD goes from 3.2 Mpps to 6 Mpps.
      
      Test was using a single TCP listener, on a host with 8 RX queues
      on the NIC, and 24 cores (48 ht)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      125c8d12
    • Eric Dumazet's avatar
      tcp: remove false sharing in tcp_rcv_state_process() · 8804b272
      Eric Dumazet authored
      Last known hot point during SYNFLOOD attack is the clearing
      of rx_opt.saw_tstamp in tcp_rcv_state_process()
      
      It is not needed for a listener, so we move it where it matters.
      
      Performance while a SYNFLOOD hits a single listener socket
      went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues)
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8804b272
    • Eric Dumazet's avatar
      tcp: do not mess with listener sk_wmem_alloc · b3d05147
      Eric Dumazet authored
      When removing sk_refcnt manipulation on synflood, I missed that
      using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
      transitioned to 0.
      
      We should hold sk_refcnt instead, but this is a big deal under attack.
      (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
      
      In this patch, I chose to not attach a socket to syncookies skb.
      
      Performance is now 5 Mpps instead of 3.2 Mpps.
      
      Following patch will remove last known false sharing in
      tcp_rcv_state_process()
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3d05147
    • Amitoj Kaur Chawla's avatar
      qlge: Replace create_singlethread_workqueue with alloc_ordered_workqueue · ac18dd9e
      Amitoj Kaur Chawla authored
      Replace deprecated create_singlethread_workqueue with
      alloc_ordered_workqueue.
      
      Work items include getting tx/rx frame sizes, resetting MPI processor,
      setting asic recovery bit so ordering seems necessary as only one work
      item should be in queue/executing at any given time, hence the use of
      alloc_ordered_workqueue.
      
      WQ_MEM_RECLAIM flag has been set since ethernet devices seem to sit in
      memory reclaim path, so to guarantee forward progress regardless of
      memory pressure.
      Signed-off-by: default avatarAmitoj Kaur Chawla <amitoj1606@gmail.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac18dd9e
    • David S. Miller's avatar
      Merge branch 'tipc-link-setup-improvements' · 0818556c
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: improvements to the link setup algorithm
      
      This series addresses some smaller issues regarding the link setup
      algorithm. The first commit fixes a rare bug we have discovered during
      testing; the second one may have some future impact on cluster
      scalabilty, while remaining ones can be regarded as cosmetic in
      a wider sense of the word.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0818556c
    • Jon Paul Maloy's avatar
      tipc: let first message on link be a state message · 34b9cd64
      Jon Paul Maloy authored
      According to the link FSM, a received traffic packet can take a link
      from state ESTABLISHING to ESTABLISHED, but the link can still not be
      fully set up in one atomic operation. This means that even if the the
      very first packet on the link is a traffic packet with sequence number
      1 (one), it has to be dropped and retransmitted.
      
      This can be avoided if we let the mentioned packet be preceded by a
      LINK_PROTOCOL/STATE message, which takes up the endpoint before the
      arrival of the traffic.
      
      We add this small feature in this commit.
      
      This is a fully compatible change.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34b9cd64
    • Jon Paul Maloy's avatar
      tipc: ensure that first packets on link are sent in order · de7e07f9
      Jon Paul Maloy authored
      In some link establishment scenarios we see that packet #2 may be sent
      out before packet #1, forcing the receiver to demand retransmission of
      the missing packet. This is harmless, but may cause confusion among
      people tracing the packet flow.
      
      Since this is extremely easy to fix, we do so by adding en extra send
      call to the bearer immediately after the link has come up.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de7e07f9
    • Jon Paul Maloy's avatar
      tipc: refactor function tipc_link_timeout() · 42b18f60
      Jon Paul Maloy authored
      The function tipc_link_timeout() is unnecessary complex, and can
      easily be made more readable.
      
      We do that with this commit. The only functional change is that we
      remove a redundant test for whether the broadcast link is up or not.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42b18f60
    • Jon Paul Maloy's avatar
      tipc: reduce transmission rate of reset messages when link is down · 88e8ac70
      Jon Paul Maloy authored
      When a link is down, it will continuously try to re-establish contact
      with the peer by sending out a RESET or an ACTIVATE message at each
      timeout interval. The default value for this interval is currently
      375 ms. This is wasteful, and may become a problem in very large
      clusters with dozens or hundreds of nodes being down simultaneously.
      
      We now introduce a simple backoff algorithm for these cases. The
      first five messages are sent at default rate; thereafter a message
      is sent only each 16th timer interval.
      
      This will cover the vast majority of link recycling cases, since the
      endpoint starting last will transmit at the higher speed, and the link
      should normally be established well be before the rate needs to be
      reduced.
      
      The only case where we will see a degradation of link re-establishment
      times is when the endpoints remain intact, and a glitch in the
      transmission media is causing the link reset. We will then experience
      a worst-case re-establishing time of 6 seconds, something we deem
      acceptable.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88e8ac70
    • Jon Paul Maloy's avatar
      tipc: guarantee peer bearer id exchange after reboot · 634696b1
      Jon Paul Maloy authored
      When a link endpoint is going down locally, e.g., because its interface
      is being stopped, it will spontaneously send out a RESET message to
      its peer, informing it about this fact. This saves the peer from
      detecting the failure via probing, and hence gives both speedier and
      less resource consuming failure detection on the peer side.
      
      According to the link FSM, a receiver of a RESET message, ignoring the
      reason for it, must now consider the sender ready to come back up, and
      starts periodically sending out ACTIVATE messages to the peer in order
      to re-establish the link. Also, according to the FSM, the receiver of
      an ACTIVATE message can now go directly to state ESTABLISHED and start
      sending regular traffic packets. This is a well-proven and robust FSM.
      
      However, in the case of a reboot, there is a small possibilty that link
      endpoint on the rebooted node may have been re-created with a new bearer
      identity between the moment it sent its (pre-boot) RESET and the moment
      it receives the ACTIVATE from the peer. The new bearer identity cannot
      be known by the peer according to this scenario, since traffic headers
      don't convey such information. This is a problem, because both endpoints
      need to know the correct value of the peer's bearer id at any moment in
      time in order to be able to produce correct link events for their users.
      
      The only way to guarantee this is to enforce a full setup message
      exchange (RESET + ACTIVATE) even after the reboot, since those messages
      carry the bearer idientity in their header.
      
      In this commit we do this by introducing and setting a "stopping" bit in
      the header of the spontaneously generated RESET messages, informing the
      peer that the sender will not be immediately ready to re-establish the
      link. A receiver seeing this bit must act as if this were a locally
      detected connectivity failure, and hence has to go through a full two-
      way setup message exchange before any link can be re-established.
      
      Although never reported, this problem seems to have always been around.
      
      This protocol addition is fully backwards compatible.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      634696b1
    • David S. Miller's avatar
      Merge branch 'mlxsw-next' · 25fb0b6c
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      mlxsw: spectrum_buffers: couple of cosmetic patches
      
      As suggested by David Laight
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25fb0b6c
    • Jiri Pirko's avatar
    • Jiri Pirko's avatar
    • Jiri Pirko's avatar
      devlink: fix sb register stub in case devlink is disabled · de33efd0
      Jiri Pirko authored
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Fixes: bf797471 ("devlink: add shared buffer configuration")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de33efd0
    • Paolo Abeni's avatar
      tun: use per cpu variables for stats accounting · 608b9977
      Paolo Abeni authored
      Currently the tun device accounting uses dev->stats without applying any
      kind of protection, regardless that accounting happens in preemptible
      process context.
      This patch move the tun stats to a per cpu data structure, and protect
      the updates with  u64_stats_update_begin()/u64_stats_update_end() or
      this_cpu_inc according to the stat type. The per cpu stats are
      aggregated by the newly added ndo_get_stats64 ops.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      608b9977
    • David S. Miller's avatar
      Merge branch 'bpf-ARG_PTR_TO_RAW_STACK' · 548aacdd
      David S. Miller authored
      Merge branch 'bpf-ARG_PTR_TO_RAW_STACK'
      
      Daniel Borkmann says:
      
      ====================
      BPF updates
      
      This series adds a new verifier argument type called
      ARG_PTR_TO_RAW_STACK and converts related helpers to make
      use of it. Basic idea is that we can save init of stack
      memory when the helper function is guaranteed to fully
      fill out the passed buffer in every path. Series also adds
      test cases and converts samples. For more details, please
      see individual patches.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      548aacdd
    • Daniel Borkmann's avatar
      bpf, samples: add test cases for raw stack · 3f2050e2
      Daniel Borkmann authored
      This adds test cases mostly around ARG_PTR_TO_RAW_STACK to check the
      verifier behaviour.
      
        [...]
        #84 raw_stack: no skb_load_bytes OK
        #85 raw_stack: skb_load_bytes, no init OK
        #86 raw_stack: skb_load_bytes, init OK
        #87 raw_stack: skb_load_bytes, spilled regs around bounds OK
        #88 raw_stack: skb_load_bytes, spilled regs corruption OK
        #89 raw_stack: skb_load_bytes, spilled regs corruption 2 OK
        #90 raw_stack: skb_load_bytes, spilled regs + data OK
        #91 raw_stack: skb_load_bytes, invalid access 1 OK
        #92 raw_stack: skb_load_bytes, invalid access 2 OK
        #93 raw_stack: skb_load_bytes, invalid access 3 OK
        #94 raw_stack: skb_load_bytes, invalid access 4 OK
        #95 raw_stack: skb_load_bytes, invalid access 5 OK
        #96 raw_stack: skb_load_bytes, invalid access 6 OK
        #97 raw_stack: skb_load_bytes, large access OK
        Summary: 98 PASSED, 0 FAILED
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f2050e2
    • Daniel Borkmann's avatar
      bpf, samples: don't zero data when not needed · 02413cab
      Daniel Borkmann authored
      Remove the zero initialization in the sample programs where appropriate.
      Note that this is an optimization which is now possible, old programs
      still doing the zero initialization are just fine as well. Also, make
      sure we don't have padding issues when we don't memset() the entire
      struct anymore.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02413cab
    • Daniel Borkmann's avatar
      bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK · 074f528e
      Daniel Borkmann authored
      This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
      type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
      bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
      and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
      also be removed since the verifier already makes sure we stay within bounds
      on stack buffers.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      074f528e
    • Daniel Borkmann's avatar
      bpf, verifier: add ARG_PTR_TO_RAW_STACK type · 435faee1
      Daniel Borkmann authored
      When passing buffers from eBPF stack space into a helper function, we have
      ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure
      that such buffers are initialized, within boundaries, etc.
      
      However, the downside with this is that we have a couple of helper functions
      such as bpf_skb_load_bytes() that fill out the passed buffer in the expected
      success case anyway, so zero initializing them prior to the helper call is
      unneeded/wasted instructions in the eBPF program that can be avoided.
      
      Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK.
      The idea is to skip the STACK_MISC check in check_stack_boundary() and color
      the related stack slots as STACK_MISC after we checked all call arguments.
      
      Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of
      the helper function will fill the provided buffer area, so that we cannot leak
      any uninitialized stack memory. This f.e. means that error paths need to
      memset() the buffers, but the expected fast-path doesn't have to do this
      anymore.
      
      Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK
      argument, we can keep it simple and don't need to check for multiple areas.
      Should in future such a use-case really appear, we have check_raw_mode() that
      will make sure we implement support for it first.
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      435faee1
    • Daniel Borkmann's avatar
      bpf, verifier: add bpf_call_arg_meta for passing meta data · 33ff9823
      Daniel Borkmann authored
      Currently, when the verifier checks calls in check_call() function, we
      call check_func_arg() for all 5 arguments e.g. to make sure expected types
      are correct. In some cases, we collect meta data (here: map pointer) to
      perform additional checks such as checking stack boundary on key/value
      sizes for subsequent arguments. As we're going to extend the meta data,
      add a generic struct bpf_call_arg_meta that we can use for passing into
      check_func_arg().
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33ff9823
    • Marcelo Ricardo Leitner's avatar
      sctp: add support for RPS and RFS · 486bdee0
      Marcelo Ricardo Leitner authored
      This patch adds what's missing to properly support RPS and RFS on SCTP,
      as some of it is already implemented in common calls.
      
      Having support for RPS and RFS allows better scaling specially because
      not all NICs support hashing SCTP headers.
      
      Save the hash right when we dequeue a skb from inqueue so we do it only
      once per skb instead of per chunk. New sockets will then inherit the
      hash through sctp_copy_sock().
      Signed-off-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      486bdee0
    • Eric Dumazet's avatar
      net: validate_xmit_skb() changes · d21fd63e
      Eric Dumazet authored
      skbs given to validate_xmit_skb() should not have a next
      pointer anymore.
      
      Also if a packet is dropped, increment dev->tx_dropped
      __dev_queue_xmit() no longer has to change tx_dropped in this case.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d21fd63e
  2. 14 Apr, 2016 2 commits
    • Weongyo Jeong's avatar
      packet: uses kfree_skb() for errors. · da37845f
      Weongyo Jeong authored
      consume_skb() isn't for error cases that kfree_skb() is more proper
      one.  At this patch, it fixed tpacket_rcv() and packet_rcv() to be
      consistent for error or non-error cases letting perf trace its event
      properly.
      Signed-off-by: default avatarWeongyo Jeong <weongyo.linux@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      da37845f
    • Parthasarathy Bhuvaragan's avatar
      tipc: fix a race condition leading to subscriber refcnt bug · 333f7962
      Parthasarathy Bhuvaragan authored
      Until now, the requests sent to topology server are queued
      to a workqueue by the generic server framework.
      These messages are processed by worker threads and trigger the
      registered callbacks.
      To reduce latency on uniprocessor systems, explicit rescheduling
      is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
      messages.
      
      This implementation on SMP systems leads to an subscriber refcnt
      error as described below:
      When a worker thread yields by calling cond_resched() in a SMP
      system, a new worker is created on another CPU to process the
      pending workitem. Sometimes the sleeping thread wakes up before
      the new thread finishes execution.
      This breaks the assumption on ordering and being single threaded.
      The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.
      
      If the first thread was processing subscription create and the
      second thread processing close(), the close request will free
      the subscriber and the create request oops as follows:
      
      [31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380         [tipc]
      [31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97
      [31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc]
      [...]
      [31.228377] Call Trace:
      [31.228377]  [<ffffffff812fbb6b>] dump_stack+0x4d/0x72
      [31.228377]  [<ffffffff8105a311>] __warn+0xd1/0xf0
      [31.228377]  [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20
      [31.228377]  [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
      [31.228377]  [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc]
      [31.228377]  [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc]
      [31.228377]  [<ffffffff81071925>] process_one_work+0x145/0x3d0
      [31.246554] ---[ end trace c3882c9baa05a4fd ]---
      [31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266
      [31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428
      [31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0
      [31.249323] PGD 0
      [31.249323] Oops: 0000 [#1] SMP
      
      In this commit, we
      - rename tipc_conn_shutdown() to tipc_conn_release().
      - move connection release callback execution from tipc_close_conn()
        to a new function tipc_sock_release(), which is executed before
        we free the connection.
      Thus we release the subscriber during connection release procedure
      rather than connection shutdown procedure.
      Signed-off-by: default avatarParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      333f7962