1. 15 Jun, 2016 40 commits
    • Eric Dumazet's avatar
      net_sched: sch_netem: defer skb freeing · 2f08a9a1
      Eric Dumazet authored
      rtnl_kfree_skbs() can be used in tfifo_reset()
      
      It would be nice if we could iterate through rb tree instead
      of removing one skb at a time, and build a single skb chain.
      But this is left for a future patch.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f08a9a1
    • Eric Dumazet's avatar
      net_sched: sch_htb: defer skb freeing · a5a9f534
      Eric Dumazet authored
      Both htb_reset() and htb_destroy() can use __qdisc_reset_queue()
      instead of __skb_queue_purge() to defer skb freeing of internal
      queues.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5a9f534
    • Eric Dumazet's avatar
      net_sched: sch_hhf: defer skb freeing · e7e424cd
      Eric Dumazet authored
      Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7e424cd
    • Eric Dumazet's avatar
      net_sched: fq_codel: defer skb freeing · ece5d4c7
      Eric Dumazet authored
      Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ece5d4c7
    • Eric Dumazet's avatar
      net_sched: sch_fq: defer skb freeing · e14ffdfd
      Eric Dumazet authored
      Both fq_change() and fq_reset() can use rtnl_kfree_skbs()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e14ffdfd
    • Eric Dumazet's avatar
      net_sched: sch_codel: defer skb freeing in codel_change() · b3d7e2b2
      Eric Dumazet authored
      codel_change() can use rtnl_qdisc_drop()
      to defer expensive skb freeing after locks are released.
      
      codel_reset() already has support for deferred skb freeing
      because it uses qdisc_reset_queue()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3d7e2b2
    • Eric Dumazet's avatar
      net_sched: sch_choke: defer skb freeing · f9aed311
      Eric Dumazet authored
      choke_reset() and choke_change() can use rtnl_qdisc_drop()
      to defer expensive skb freeing after locks are released.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9aed311
    • Eric Dumazet's avatar
      net_sched: add the ability to defer skb freeing · 1b5c5493
      Eric Dumazet authored
      qdisc are changed under RTNL protection and often
      while blocking BH and root qdisc spinlock.
      
      When lots of skbs need to be dropped, we free
      them under these locks causing TX/RX freezes,
      and more generally latency spikes.
      
      This commit adds rtnl_kfree_skbs(), used to queue
      skbs for deferred freeing.
      
      Actual freeing happens right after RTNL is released,
      with appropriate scheduling points.
      
      rtnl_qdisc_drop() can also be used in place
      of disc_drop() when RTNL is held.
      
      qdisc_reset_queue() and __qdisc_reset_queue() get
      the new behavior, so standard qdiscs like pfifo, pfifo_fast...
      have their ->reset() method automatically handled.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b5c5493
    • Jon Paul Maloy's avatar
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy authored
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c55c98
    • David Ahern's avatar
      net: vrf: Update flags and features settings · 7889681f
      David Ahern authored
      1. Default VRF devices to not having a qdisc (IFF_NO_QUEUE). Users
         can add one as desired.
      
      2. Disable adding a VLAN to a VRF device.
      
      3. Enable offloads and hardware features similar to other logical
         devices (e.g., dummy, veth)
      
      Change provides a significant boost in TCP stream Tx performance,
      from ~2,700 Mbps to ~18,100 Mbps and makes throughput close to the
      performance without a VRF (18,500 Mbps). netperf TCP_STREAM benchmark
      using qemu with virtio+vhost for the NICs
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7889681f
    • Paolo Abeni's avatar
      tun: fix csum generation for tap devices · df10db98
      Paolo Abeni authored
      The commit 34166093 ("tuntap: use common code for virtio_net_hdr
      and skb GSO conversion") replaced the tun code for header manipulation
      with the generic helpers. While doing so, it implictly moved the
      skb_partial_csum_set() invocation after eth_type_trans(), which
      invalidate the current gso start/offset values.
      Fix it by moving the helper invocation before the mac pulling.
      
      Fixes: 34166093 ("tuntap: use common code for virtio_net_hdr and skb GSO conversion")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df10db98
    • David S. Miller's avatar
      Merge branch 'skb_array' · 829e64d1
      David S. Miller authored
      Michael S. Tsirkin says:
      
      ====================
      skb_array: array based FIFO for skbs
      
      This is in response to the proposal by Jason to make tun
      rx packet queue lockless using a circular buffer.
      My testing seems to show that at least for the common usecase
      in networking, which isn't lockless, circular buffer
      with indices does not perform that well, because
      each index access causes a cache line to bounce between
      CPUs, and index access causes stalls due to the dependency.
      
      By comparison, an array of pointers where NULL means invalid
      and !NULL means valid, can be updated without messing up barriers
      at all and does not have this issue.
      
      On the flip side, cache pressure may be caused by using large queues.
      tun has a queue of 1000 entries by default and that's 8K.
      At this point I'm not sure this can be solved efficiently.
      The correct solution might be sizing the queues appropriately.
      
      Here's an implementation of this idea: it can be used more
      or less whenever sk_buff_head can be used, except you need
      to know the queue size in advance.
      
      As this might be useful outside of networking, I implemented
      a generic array of void pointers, with a type-safe wrapper for skbs.
      
      It remains to be seen whether resizing is required, in case it is
      I included patches implementing resizing by holding both the
      consumer and the producer locks.
      
      I think this code works fine without any extra memory barriers since we
      always read and write the same location, so the accesses can not be
      reordered.
      Multiple writes of the same value into memory would mess things up
      for us, I don't think compilers would do it though.
      But if people feel it's better to be safe wrt compiler optimizations,
      specifying queue as volatile would probably do it in a cleaner way
      than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts?
      
      The only issue is with calls within a loop using the __ptr_ring_XXX
      accessors - in theory compiler could hoist accesses out of the loop.
      
      Following volatile-considered-harmful.txt I merely
      documented that callers that busy-poll should invoke cpu_relax().
      Most people will use the external skb_array_XXX APIs with a spinlock,
      so this should not be an issue for them.
      
      Eric Dumazet suggested adding an extra pointer to skb for when
      we have a single outstanding packet. I could not figure out
      a way to implement this without a shared consumer/producer lock
      though, which would cause cache line bounces by itself.
      
      Jesper, Jason, I know that both of you tested this,
      please post Tested-by tags for whatever was tested.
      
      changes since v7
      	fix typos noticed by Jesper Brouer
      
      changes since v6
      	resize implemented. peek/full calls are no longer lockless
      
      	replaced _FIELD macros with _CALL which invoke a function
      	on the pointer rather than just returning a value
      
      	destroy now scans the array and frees all queued skbs
      
      changes since v5
      	implemented a generic ptr_ring api, and
      		made skb_array a type-safe wrapper
      	apis for taking the spinlock in different contexts
      		following expected usecase in tun
      changes since v4 (v3 was never posted)
      	documentation
      	dropped SKB_ARRAY_MIN_SIZE heuristic
      	unit test (in userspace, included as patch 2)
      
      changes since v2:
              fixed integer overflow pointed out by Eric.
              added some comments.
      
      changes since v1:
              fixed bug pointed out by Eric.
      ====================
      Tested-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      829e64d1
    • Michael S. Tsirkin's avatar
      skb_array: resize support · 7d7072e3
      Michael S. Tsirkin authored
      Update skb_array after ptr_ring API changes.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Tested-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d7072e3
    • Michael S. Tsirkin's avatar
      ptr_ring: resize support · 5d49de53
      Michael S. Tsirkin authored
      This adds ring resize support. Seems to be necessary as
      users such as tun allow userspace control over queue size.
      
      If resize is used, this costs us ability to peek at queue without
      consumer lock - should not be a big deal as peek and consumer are
      usually run on the same CPU.
      
      If ring is made bigger, ring contents is preserved.  If ring is made
      smaller, extra pointers are passed to an optional destructor callback.
      
      Cleanup function also gains destructor callback such that
      all pointers in queue can be cleaned up.
      
      This changes some APIs but we don't have any users yet,
      so it won't break bisect.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d49de53
    • Michael S. Tsirkin's avatar
      skb_array: array based FIFO for skbs · ad69f35d
      Michael S. Tsirkin authored
      A simple array based FIFO of pointers.  Intended for net stack so uses
      skbs for type safety. Implemented as a set of wrappers around ptr_ring.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Tested-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ad69f35d
    • Michael S. Tsirkin's avatar
      ptr_ring: ring test · 9fb6bc5b
      Michael S. Tsirkin authored
      Add ringtest based unit test for ptr ring.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fb6bc5b
    • Michael S. Tsirkin's avatar
      ptr_ring: array based FIFO for pointers · 2e0ab8ca
      Michael S. Tsirkin authored
      A simple array based FIFO of pointers.  Intended for net stack which
      commonly has a single consumer/producer.
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e0ab8ca
    • WANG Cong's avatar
      net_sched: make tcf_hash_check() boolean · b2313077
      WANG Cong authored
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2313077
    • David S. Miller's avatar
      Merge branch 'vrf-ipv6-mcast-link-local' · a6e225ca
      David S. Miller authored
      David Ahern says:
      
      ====================
      net: vrf: Handle ipv6 multicast and link-local addresses
      
      IPv6 multicast and link-local addresses require special handling by the
      VRF driver. Rather than using the VRF device index and full FIB lookups,
      packets to/from these addresses should use direct FIB lookups based on
      the VRF device table.
      
      Multicast routes do not make sense for the L3 master device directly.
      Accordingly, do not add mcast routes for the device, and the VRF driver
      should fail attempts to send packets to ipv6 mcast addresses on the
      device (e.g, ping6 ff02::1%<vrf> should fail)
      
      With this change connections into and out of a VRF enslaved device work
      for multicast and link-local addresses (icmp, tcp, and udp).  e.g.,
      
      1. packets into VM with VRF config:
          ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
          ping6 -c3 ff02::1%br1
          ssh -6 fe80::e0:f9ff:fe1c:b974%br1
      
      2. packets going out a VRF enslaved device:
          ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
          ping6 -c3 ff02::1%eth1
          ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6e225ca
    • David Ahern's avatar
      net: vrf: Handle ipv6 multicast and link-local addresses · 9ff74384
      David Ahern authored
      IPv6 multicast and link-local addresses require special handling by the
      VRF driver:
      1. Rather than using the VRF device index and full FIB lookups,
         packets to/from these addresses should use direct FIB lookups based on
         the VRF device table.
      
      2. fail sends/receives on a VRF device to/from a multicast address
         (e.g, make ping6 ff02::1%<vrf> fail)
      
      3. move the setting of the flow oif to the first dst lookup and revert
         the change in icmpv6_echo_reply made in ca254490 ("net: Add VRF
         support to IPv6 stack"). Linklocal/mcast addresses require use of the
         skb->dev.
      
      With this change connections into and out of a VRF enslaved device work
      for multicast and link-local addresses work (icmp, tcp, and udp)
      e.g.,
      
      1. packets into VM with VRF config:
          ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
          ping6 -c3 ff02::1%br1
      
          ssh -6 fe80::e0:f9ff:fe1c:b974%br1
      
      2. packets going out a VRF enslaved device:
          ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
          ping6 -c3 ff02::1%eth1
          ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ff74384
    • David Ahern's avatar
      net: ipv6: Do not add multicast route for l3 master devices · ba46ee4c
      David Ahern authored
      L3 master devices are virtual devices similar to the loopback
      device. Link local and multicast routes for these devices do
      not make sense. The ipv6 addrconf code already skips adding a
      linklocal address; do the same for the mcast route.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba46ee4c
    • David Ahern's avatar
      net: l3mdev: Remove const from flowi6 arg to get_rt6_dst · cd2a9e62
      David Ahern authored
      Allow drivers to pass flow arg to functions where the arg is not const
      and allow the driver to make updates as needed (eg., setting oif).
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd2a9e62
    • David S. Miller's avatar
      Merge branch 'af_iucv-big-bufs' · c9ad5a65
      David S. Miller authored
      Ursula Braun says:
      
      ====================
      s390: af_iucv patches
      
      here are improvements for af_iucv relaxing the pressure to allocate
      big contiguous kernel buffers.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9ad5a65
    • Eugene Crosser's avatar
      af_iucv: use paged SKBs for big inbound messages · a006353a
      Eugene Crosser authored
      When an inbound message is bigger than a page, allocate a paged SKB,
      and subsequently use IUCV receive primitive with IPBUFLST flag.
      This relaxes the pressure to allocate big contiguous kernel buffers.
      Signed-off-by: default avatarEugene Crosser <Eugene.Crosser@ru.ibm.com>
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a006353a
    • Eugene Crosser's avatar
      af_iucv: remove fragment_skb() to use paged SKBs · 291759a5
      Eugene Crosser authored
      Before introducing paged skbs in the receive path, get rid of the
      function `iucv_fragment_skb()` that replaces one large linear skb
      with several smaller linear skbs.
      Signed-off-by: default avatarEugene Crosser <Eugene.Crosser@ru.ibm.com>
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      291759a5
    • Eugene Crosser's avatar
      af_iucv: use paged SKBs for big outbound messages · e5374399
      Eugene Crosser authored
      When an outbound message is bigger than a page, allocate and fill
      a paged SKB, and subsequently use IUCV send primitive with IPBUFLST
      flag. This relaxes the pressure to allocate big contiguous kernel
      buffers.
      Signed-off-by: default avatarEugene Crosser <Eugene.Crosser@ru.ibm.com>
      Signed-off-by: default avatarUrsula Braun <ubraun@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e5374399
    • Alexander Shiyan's avatar
      dt: bindings: Add bindings for Cirrus Logic CS89x0 ethernet chip · 818d49ad
      Alexander Shiyan authored
      Add device tree binding documentation details for Cirrus Logic
      CS8900/CS8920 ethernet chip.
      Signed-off-by: default avatarAlexander Shiyan <shc_work@mail.ru>
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      818d49ad
    • Alexander Shiyan's avatar
      net: cx89x0: Add DT support · d3cf8fd3
      Alexander Shiyan authored
      Add DT support to the Cirrus Logic CS89x0 driver.
      Signed-off-by: default avatarAlexander Shiyan <shc_work@mail.ru>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3cf8fd3
    • WANG Cong's avatar
      act_police: rename tcf_act_police_locate() to tcf_act_police_init() · d9fa17ef
      WANG Cong authored
      This function is just ->init(), rename it to make it obvious.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9fa17ef
    • WANG Cong's avatar
      net_sched: remove internal use of TC_POLICE_* · 95df1b16
      WANG Cong authored
      These should be gone when we removed CONFIG_NET_CLS_POLICE.
      We can not totally remove them since they are exposed
      to userspace.
      
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95df1b16
    • David S. Miller's avatar
      Merge branch 'rds-mprds-foundations' · 161cd45f
      David S. Miller authored
      Sowmini Varadhan says:
      
      ====================
      RDS: multiple connection paths for scaling
      
      Today RDS-over-TCP is implemented by demux-ing multiple PF_RDS sockets
      between any 2 endpoints (where endpoint == [IP address, port]) over a
      single TCP socket between the 2 IP addresses involved. This has the
      limitation that it ends up funneling multiple RDS flows over a single
      TCP flow, thus the rds/tcp connection is
         (a) upper-bounded to the single-flow bandwidth,
         (b) suffers from head-of-line blocking for the RDS sockets.
      
      Better throughput (for a fixed small packet size, MTU) can be achieved
      by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
      RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
      connection. RDS sockets will be attached to a path based on some hash
      (e.g., of local address and RDS port number) and packets for that RDS
      socket will be sent over the attached path using TCP to segment/reassemble
      RDS datagrams on that path.
      
      The table below, generated using a prototype that implements mprds,
      shows that this is significant for scaling to 40G.  Packet sizes
      used were: 8K byte req, 256 byte resp. MTU: 1500.  The parameters for
      RDS-concurrency used below are described in the rds-stress(1) man page-
      the number listed is proportional to the number of threads at which max
      throughput was attained.
      
        -------------------------------------------------------------------
           RDS-concurrency   Num of       tx+rx K/s (iops)       throughput
           (-t N -d N)       TCP paths
        -------------------------------------------------------------------
              16             1             600K -  700K            4 Gbps
              28             8            5000K - 6000K           32 Gbps
        -------------------------------------------------------------------
      
      FAQ: what is the relation between mprds and mptcp?
        mprds is orthogonal to mptcp. Whereas mptcp creates
        sub-flows for a single TCP connection, mprds parallelizes tx/rx
        at the RDS layer. MPRDS with N paths will allow N datagrams to
        be sent in parallel; each path will continue to send one
        datagram at a time, with sender and receiver keeping track of
        the retransmit and dgram-assembly state based on the RDS header.
        If desired, mptcp can additionally be used to speed up each TCP
        path. That acceleration is orthogonal to the parallelization benefits
        of mprds.
      
      This patch series lays down the foundational data-structures to support
      mprds in the kernel. It implements the changes to split up the
      rds_connection structure into a common (to all paths) part,
      and a per-path rds_conn_path. All I/O workqs are driven from
      the rds_conn_path.
      
      Note that this patchset does not (yet) actually enable multipathing
      for any of the transports; all transports will continue to use a
      single path with the refactored data-structures. A subsequent patchset
      will  add the changes to the rds-tcp module to actually use mprds
      in rds-tcp.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      161cd45f
    • Sowmini Varadhan's avatar
      RDS: Update rds_conn_destroy to be MP capable · 3ecc5693
      Sowmini Varadhan authored
      Refactor rds_conn_destroy() so that the per-path dismantling
      is done in rds_conn_path_destroy, and then iterate as needed
      over rds_conn_path_destroy().
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ecc5693
    • Sowmini Varadhan's avatar
      RDS: Update rds_conn_shutdown to work with rds_conn_path · d769ef81
      Sowmini Varadhan authored
      This commit changes rds_conn_shutdown to take a rds_conn_path *
      argument, allowing it to shutdown paths other than c_path[0] for
      MP-capable transports.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d769ef81
    • Sowmini Varadhan's avatar
      RDS: Initialize all RDS_MPATH_WORKERS in __rds_conn_create · 1c5113cf
      Sowmini Varadhan authored
      Add a for() loop in __rds_conn_create to initialize all the
      conn_paths, in preparate for MP capable transports.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c5113cf
    • Sowmini Varadhan's avatar
      RDS: Add rds_conn_path_error() · fb1b3dc4
      Sowmini Varadhan authored
      rds_conn_path_error() is the MP-aware analog of rds_conn_error,
      to be used by multipath-capable callers.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fb1b3dc4
    • Sowmini Varadhan's avatar
      RDS: update rds-info related functions to traverse multiple conn_paths · 992c9ec5
      Sowmini Varadhan authored
      This commit updates the callbacks related to the rds-info command
      so that they walk through all the rds_conn_path structures and
      report the requested info.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      992c9ec5
    • Sowmini Varadhan's avatar
      RDS: Add rds_conn_path_connect_if_down() for MP-aware callers · 3c0a5900
      Sowmini Varadhan authored
      rds_conn_path_connect_if_down() works on the rds_conn_path
      that it is passed. Callers who are not t_m_capable may continue
      calling rds_conn_connect_if_down, which will invoke
      rds_conn_path_connect_if_down() with the default c_path[0].
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c0a5900
    • Sowmini Varadhan's avatar
      RDS: Make rds_send_pong() take a rds_conn_path argument · 45997e9e
      Sowmini Varadhan authored
      This commit allows rds_send_pong() callers to send back
      the rds pong message on some path other than c_path[0] by
      passing in a struct rds_conn_path * argument.  It also
      removes the last dependency on the #defines in rds_single.h
      from send.c
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45997e9e
    • Sowmini Varadhan's avatar
      RDS: Extract rds_conn_path from i_conn_path in rds_send_drop_to() for MP-capable transports · 01ff34ed
      Sowmini Varadhan authored
      Explicitly set up rds_conn_path, either from i_conn_path (for
      MP capable transpots) or as c_path[0], and use this in
      rds_send_drop_to()
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01ff34ed
    • Sowmini Varadhan's avatar
      RDS: Pass rds_conn_path to rds_send_xmit() · 1f9ecd7e
      Sowmini Varadhan authored
      Pass a struct rds_conn_path to rds_send_xmit so that MP capable
      transports can transmit packets on something other than c_path[0].
      The eventual goal for MP capable transports is to hash the rds
      socket to a path based on the bound local address/port, and use
      this path as the argument to rds_send_xmit()
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f9ecd7e