1. 17 Dec, 2013 15 commits
    • Eric Dumazet's avatar
      tcp: refine TSO splits · d4589926
      Eric Dumazet authored
      While investigating performance problems on small RPC workloads,
      I noticed linux TCP stack was always splitting the last TSO skb
      into two parts (skbs). One being a multiple of MSS, and a small one
      with the Push flag. This split is done even if TCP_NODELAY is set,
      or if no small packet is in flight.
      
      Example with request/response of 4K/4K
      
      IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
      
      We can avoid this split by including the Nagle tests at the right place.
      
      Note : If some NIC had trouble sending TSO packets with a partial
      last segment, we would have hit the problem in GRO/forwarding workload already.
      
      tcp_minshall_update() is moved to tcp_output.c and is updated as we might
      feed a TSO packet with a partial last segment.
      
      This patch tremendously improves performance, as the traffic now looks
      like :
      
      IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
      IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
      IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
      IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
      IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
      IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
      IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
      IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
      IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
      IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
      IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
      IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
      
      Before :
      
      lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
      280774
      
       Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
      
           205719.049006 task-clock                #    9.278 CPUs utilized
               8,449,968 context-switches          #    0.041 M/sec
               1,935,997 CPU-migrations            #    0.009 M/sec
                 160,541 page-faults               #    0.780 K/sec
         548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
         455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
         272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
         166,091,460,030 instructions              #    0.30  insns per cycle
                                                   #    2.74  stalled cycles per insn [83.39%]
          29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
           1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]
      
            22.173517844 seconds time elapsed
      
      lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
      IpOutRequests                   16851063           0.0
      IpExtOutOctets                  23878580777        0.0
      
      After patch :
      
      lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
      280877
      
       Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
      
           107496.071918 task-clock                #    4.847 CPUs utilized
               5,635,458 context-switches          #    0.052 M/sec
               1,374,707 CPU-migrations            #    0.013 M/sec
                 160,920 page-faults               #    0.001 M/sec
         281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
         228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
         142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
          95,227,712,566 instructions              #    0.34  insns per cycle
                                                   #    2.40  stalled cycles per insn [83.43%]
          16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
             874,252,952 branch-misses             #    5.39% of all branches         [83.37%]
      
            22.175821286 seconds time elapsed
      
      lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
      IpOutRequests                   11239428           0.0
      IpExtOutOctets                  23595191035        0.0
      
      Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
      2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
      scheduler.
      
      Many thanks to Neal for review and ideas.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Tested-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4589926
    • stephen hemminger's avatar
      net: remove dead code for add/del multiple · 477bb933
      stephen hemminger authored
      These function to manipulate multiple addresses are not used anywhere
      in current net-next tree. Some out of tree code maybe using these but
      too bad; they should submit their code upstream..
      
      Also, make __hw_addr_flush local since only used by dev_addr_lists.c
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      477bb933
    • David S. Miller's avatar
      Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next · 6ea09d8a
      David S. Miller authored
      John W. Linville says:
      
      ====================
      Please pull this batch of updates for the 3.14 stream...
      
      For the Bluetooth bits, Gustavo says:
      
      "This is the first batch of patches intended for 3.14. There is
      nothing big here.  Most of the code are refactors, clean up, small
      fixes, plus some new device id support."
      
      And...
      
      "More patches to 3.14. Here we have the support for Low Energy
      Connection Oriented Channels (LE CoC). Basically, as the name says,
      this adds supports for connection oriented channels in the same way
      we already have them for BR/EDR connections so profiles/protocols
      that work on top of BR/EDR can now work on LE plus a plenty of new
      possibilities for LE."
      
      For the ath10k bits, Kalle says:
      
      "Janusz and Marek implemented DFS support to ath10k, but the code is
      not enabled yet due to missing cfg80211/mac80211 patches (it will be
      enabled in the next pull request). Michal did some device reset fixes
      and made it possible for ath10k to share an interrupt with another
      device. And lots of smaller fixes from different people."
      
      For the iwlwifi bits, Emmanuel says:
      
      "I have here a big rework of the rate control by Eyal. This is obviously
      the biggest part of this batch.
      I also have enhancement of protection flags by Avri and a few bits for
      WoWLAN by Eliad and Luca. Johannes cleans up the debugfs plus a few
      fixes. I provided a few things for Bluetooth coexistence.
      Besides this we have an implementation for low priority scan."
      
      Along with all that, there are big batches of updates to mwifiex and
      ath9k, Jeff Kirsher's FSF address fix patches, and a handful of other
      bits here and there.
      
      Please let me know if there are problems!
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ea09d8a
    • David S. Miller's avatar
      Merge branch 'phy_power' · b80b376c
      David S. Miller authored
      Sebastian Hesselbarth says:
      
      ====================
      net: phy: Ethernet PHY powerdown optimization
      
      This is v2 of the ethernet PHY power optimization patches to reduce
      power consumption of network PHYs with link that are either unused or
      the corresponding netdev is down.
      
      Compared to the last version, this patch set drops a patch to disable
      unused PHYs after late initcall, as it is not compatible with a modular
      mdio bus [1]. I'll investigate different ways to have a modular mdio bus
      driver get notified when driver loading is done.
      
      Again, a branch with v2 applied to v3.13-rc2 can also be found at
      https://github.com/shesselba/linux-dove.git topic/ethphy-power-v2
      
      [1] http://www.spinics.net/lists/arm-kernel/msg293028.html
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b80b376c
    • Sebastian Hesselbarth's avatar
      net: phy: suspend phydev when going to HALTED · be9dad1f
      Sebastian Hesselbarth authored
      When phydev is going to HALTED state, we can try to suspend it to
      safe more power. phy_suspend helper will check if PHY can be suspended,
      so just call it when entering HALTED state.
      Signed-off-by: default avatarSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Acked-by: default avatarMugunthan V N <mugunthanvnm@ti.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be9dad1f
    • Sebastian Hesselbarth's avatar
      net: phy: resume/suspend PHYs on attach/detach · 1211ce53
      Sebastian Hesselbarth authored
      This ensures PHYs are resumed on attach and suspended on detach.
      Signed-off-by: default avatarSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Acked-by: default avatarMugunthan V N <mugunthanvnm@ti.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1211ce53
    • Sebastian Hesselbarth's avatar
      net: phy: provide phy_resume/phy_suspend helpers · 481b5d93
      Sebastian Hesselbarth authored
      This adds helper functions to resume and suspend a given phy_device
      by calling the corresponding driver callbacks if available.
      Signed-off-by: default avatarSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Acked-by: default avatarMugunthan V N <mugunthanvnm@ti.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      481b5d93
    • Sebastian Hesselbarth's avatar
      net: phy: marvell: provide genphy suspend/resume · 0898b448
      Sebastian Hesselbarth authored
      Marvell PHYs support generic PHY suspend/resume, so provide those
      callbacks to all marvell specific drivers.
      Signed-off-by: default avatarSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Acked-by: default avatarMugunthan V N <mugunthanvnm@ti.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0898b448
    • Sebastian Hesselbarth's avatar
      net: mv643xx_eth: properly start/stop phy device · 58911151
      Sebastian Hesselbarth authored
      When using phydev, it should be phy_start/phy_stop'ed properly. This
      driver doesn't do that, so add the corresponding calls to port_start/
      stop respectively.
      Signed-off-by: default avatarSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Acked-by: default avatarMugunthan V N <mugunthanvnm@ti.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58911151
    • wangweidong's avatar
      sctp: Reorder 'struc association' members to reduce its size · be78cfcb
      wangweidong authored
      Members of 'struct association' are not in appropriate order to
      reuse compiler added padding on 64bit architectures. In this patch
      we reorder those struct members and help reduce the size of the
      structure from 2776 bytes to 2720 bytes on 64 bit architectures.
      Signed-off-by: default avatarWang Weidong <wangweidong1@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be78cfcb
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · e4379310
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates
      
      This series contains updates to i40e only (again).
      
      Jesse provides a fix for when tx_rings structure is NULL and we do not want
      to panic. Then refactors the flow control set up and disables L2 flow control
      by default.  Provides some trivial fixes as well as prevent compiler warnings.
      Then to align to similar behaviour in ixgbe, use the total number of CPUs in
      the system to suggest the number of transmit and receive queue pairs.
      
      Shannon provides a i40e ethtool fix to get some more reasonable information
      reports back out to the ethtool.  In addition, fixes PF reset after offline
      test, where it reorders the test to put the register test last as it is the
      only one that needs a reset, and we wait to trigger the reset until after we
      clear the testing bit.  Lastly provides basic support for handling suspend
      and resume for now, later on Wake-On-LAN support will be added.
      
      Anjali provides changes to tell the stack about our actual number of queues
      in order for RFS/RPS/XFS to work correctly.  Then provides several patches to
      implement dynamically changing the queue count for the main VSI.  Adds
      basic support for get/set channels for RSS so that the number of receive and
      transmit queue pair can be changed via ethtool.  Cleans up the use of
      rtnl_lock in the reset patch since it runs from a work time.
      
      Neerav Parikh cleans up the VF interface to remove FCoE code as this
      feature will not be supported on VF interfaces.
      
      v2:
        - submitted patch 1 to net (since it was a fix needed for net), so dropped
          from this series (this patch will get added to net-next when Dave syncs
          his trees)
        - Dropped patches 4 & 11 from previous submission because of feedback
          received from Ben Hutchings and Sergei Shtylyov.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4379310
    • David S. Miller's avatar
      Merge branch 'ovs_hash' · bc4d0f61
      David S. Miller authored
      Francesco Fusco says:
      
      ====================
      ovs: introduce arch-specific fast hashing improvements
      
      From: Daniel Borkmann <dborkman@redhat.com>
      
      We are introducing a fast hash function (see patch1) that can be
      used in the context of OpenVSwitch to reduce the hashing footprint
      (patch2). For details, please see individual patches!
      
      v1->v2:
       - Make hash generic and place it under lib
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc4d0f61
    • Francesco Fusco's avatar
      net: ovs: use CRC32 accelerated flow hash if available · 500f8087
      Francesco Fusco authored
      Currently OVS uses jhash2() for calculating flow hashes in its
      internal flow_hash() function. The performance of the flow_hash()
      function is critical, as the input data can be hundreds of bytes
      long.
      
      OVS is largely deployed in x86_64 based datacenters.  Therefore,
      we argue that the performance critical fast path of OVS should
      exploit underlying CPU features in order to reduce the per packet
      processing costs. We replace jhash2 with the hash implementation
      provided by the kernel hash lib, which exploits the crc32l
      instruction to achieve high performance
      
      Our patch greatly reduces the hash footprint from ~200 cycles of
      jhash2() to around ~90 cycles in case of ovs_flow_hash_crc()
      (measured with rdtsc over maximum length flow keys on an i7 Intel
      CPU).
      
      Additionally, we wrote a microbenchmark to stress the flow table
      performance. The benchmark inserts random flows into the flow
      hash and then performs lookups. Our hash deployed on a CRC32
      capable CPU reduces the lookup for 1000 flows, 100 masks from
      ~10,100us to ~6,700us, for example.
      
      Thus, simply use the newly introduced arch_fast_hash2() as a
      drop-in replacement.
      Signed-off-by: default avatarFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarThomas Graf <tgraf@redhat.com>
      Acked-by: default avatarJesse Gross <jesse@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      500f8087
    • Francesco Fusco's avatar
      lib: introduce arch optimized hash library · 71ae8aac
      Francesco Fusco authored
      We introduce a new hashing library that is meant to be used in
      the contexts where speed is more important than uniformity of the
      hashed values. The hash library leverages architecture specific
      implementation to achieve high performance and fall backs to
      jhash() for the generic case.
      
      On Intel-based x86 architectures, the library can exploit the crc32l
      instruction, part of the Intel SSE4.2 instruction set, if the
      instruction is supported by the processor. This implementation
      is twice as fast as the jhash() implementation on an i7 processor.
      
      Additional architectures, such as Arm64 provide instructions for
      accelerating the computation of CRC, so they could be added as well
      in follow-up work.
      Signed-off-by: default avatarFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarThomas Graf <tgraf@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71ae8aac
    • tanxiaojun's avatar
      fddi: cleanup unsigned to unsigned int/short · 89e47d3b
      tanxiaojun authored
      Use "unsigned int/short" instead of "unsigned", and change the type of
      iteration variable "i" to "unsigned int".
      Signed-off-by: default avatarTan Xiaojun <tanxiaojun@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89e47d3b
  2. 16 Dec, 2013 17 commits
  3. 15 Dec, 2013 1 commit
  4. 14 Dec, 2013 7 commits
    • Hannes Frederic Sowa's avatar
      ipv6: fix compiler warning in ipv6_exthdrs_len · f52d81dc
      Hannes Frederic Sowa authored
      Commit 299603e8 ("net-gro: Prepare GRO
      stack for the upcoming tunneling support") used an uninitialized variable
      which leads to the following compiler warning:
      
      net/ipv6/ip6_offload.c: In function ‘ipv6_gro_complete’:
      net/ipv6/ip6_offload.c:178:24: warning: ‘optlen’ may be used uninitialized in this function [-Wmaybe-uninitialized]
          opth = (void *)opth + optlen;
                              ^
      net/ipv6/ip6_offload.c:164:22: note: ‘optlen’ was declared here
        int len = 0, proto, optlen;
                            ^
      Fix it up.
      
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f52d81dc
    • David S. Miller's avatar
      Merge branch 'bonding_rcu' · df012169
      David S. Miller authored
      Ding Tianhong says:
      
      ====================
      bonding: rebuild the lock use for bond monitor
      
      Now the bond slave list is not protected by bond lock, only by RTNL,
      but the monitor still use the bond lock to protect the slave list,
      it is useless, according to the Veaceslav's opinion, there were
      three way to fix the protect problem:
      
      1. add bond_master_upper_dev_link() and bond_upper_dev_unlink()
         in bond->lock, but it is unsafe to call call_netdevice_notifiers()
         in write lock.
      2. remove unused bond->lock for monitor function, only use the exist
         rtnl lock(), it will take performance loss in fast path.
      3. use RCU to protect the slave list, of course, performance is better,
         but in slow path, it is ignored.
      
      obviously the solution 1 is not fit here, I will consider the 2 and 3
      solution. My principle is simple, if in fast path, RCU is better,
      otherwise in slow path, both is well, but according to the Jay Vosburgh's
      opinion, the monitor will loss performace if use RTNL to protect the all
      slave list, so remove the bond lock and replace with RCU.
      
      The second problem is the curr_slave_lock for bond, it is too old and
      unwanted in many place, because the curr_active_slave would only be
      changed in 3 place:
      
      1. enslave slave.
      2. release slave.
      3. change active slave.
      
      all above were already holding bond lock, RTNL and curr_slave_lock
      together, it is tedious and no need to add so mach lock, when change
      the curr_active_slave, you have to hold the RTNL and curr_slave_lock
      together, and when you read the curr_active_slave, RTNL or curr_slave_lock,
      any one of them is no problem.
      
      for the stability, I did not change the logic for the monitor,
      all change is clear and simple, I have test the patch set for lockdep,
      it work well and stability.
      
      v2. accept the Jay Vosburgh's opinion, remove the RTNL and replace with RCU,
          also add some rcu function for bond use, so the patch set reach 10.
      
      v3. accept the Nikolay Aleksandrov's opinion, remove no needed bond_has_slave_rcu(),
          add protection for several 3ad mode handler functions and current_arp_slave.
          rebuild the bond_first_slave_rcu(), make it more clear.
      
      v4. because the struct netdev_adjacent should not be exist in netdevice.h, so I have
          to make a new function to support micro bond_first_slave_rcu().
          also add a new patch to simplify the bond_resend_igmp_join_requests_delayed().
      
      v5. according the Jay Vosburgh's opinion, in patch 2 and 6, the calling of notify
          peer is hardly to happen with the bond_xxx_commit() when the monitoring is running,
          so the performance impact about make two round trips to one trip on RTNL is minimal,
          no need to do that,the reason is very clear, so modify the patch 2 and 6, recover
          the notify peer in RTNL alone.
      ====================
      Signed-off-by: default avatarJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df012169
    • dingtianhong's avatar
      bonding: rebuild the bond_resend_igmp_join_requests_delayed() · f2369109
      dingtianhong authored
      The bond_resend_igmp_join_requests_delayed() and
      bond_resend_igmp_join_requests() should be integrated,
      because the bond_resend_igmp_join_requests_delayed() did
      nothing except bond_resend_igmp_join_requests().
      
      The bond igmp_retrans could only be changed in bond_change_active_slave
      and here, bond_change_active_slave will be called in RTNL and curr_slave_lock,
      the bond_resend_igmp_join_requests already hold RTNL, so no need
      to free RTNL and hold curr_slave_lock again, it may be a small optimization,
      so move the igmp_retrans in RTNL and remove the curr_slave_lock.
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2369109
    • dingtianhong's avatar
      bonding: remove unwanted lock for bond_store_primaryxxx() · 75ad932c
      dingtianhong authored
      The bond_select_active_slave() will not release and acquire
      bond lock, so it is no need to read the bond lock for them,
      and the bond_store_primaryxxx() is already in RTNL, so remove the
      unwanted lock.
      Suggested-by: default avatarJay Vosburgh <fubar@us.ibm.com>
      Suggested-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75ad932c
    • dingtianhong's avatar
      bonding: remove unwanted lock for bond_option_active_slave_set() · 4e789fc1
      dingtianhong authored
      The bond_option_active_slave_set() is always called in RTNL,
      the RTNL could protect bond slave list, so remove the unwanted
      bond lock.
      Suggested-by: default avatarJay Vosburgh <fubar@us.ibm.com>
      Suggested-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e789fc1
    • dingtianhong's avatar
      bonding: add RCU for bond_3ad_state_machine_handler() · be79bd04
      dingtianhong authored
      The bond_3ad_state_machine_handler() use the bond lock to protect
      the bond slave list and slave port together, but it is not enough,
      the bond slave list was link and unlink in RTNL, not bond lock,
      so I add RCU to protect the slave list from leaving.
      
      The bond lock is still used here, because when the slave has been
      removed from the list by the time the state machine runs, it appears
      to be possible for both function to manupulate the same aggregator->lag_ports
      by finding the aggregator via two different ports that are both members of
      that aggregator (i.e., port A of the agg is being unbound, and port B
      of the agg is runing its state machine).
      
      If I remove the bond lock, there are nothing to mutex changes
      to aggregator->lag_ports between bond_3ad_state_machine_handler and
      bond_3ad_unbind_slave, So the bond lock is the simplest way to protect
      aggregator->lag_ports.
      
      There was a lot of function need RCU protect, I have two choice
      to make the function in RCU-safe, (1) create new similar functions
      and make the bond slave list in RCU. (2) modify the existed functions
      and make them in read-side critical section, because the RCU
      read-side critical sections may be nested.
      
      I choose (2) because it is no need to create more similar functions.
      
      The nots in the function is still too old, clean up the nots.
      Suggested-by: default avatarNikolay Aleksandrov <nikolay@redhat.com>
      Suggested-by: default avatarJay Vosburgh <fubar@us.ibm.com>
      Suggested-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      be79bd04
    • dingtianhong's avatar
      bonding: remove unwanted lock for bond enslave and release · c8517035
      dingtianhong authored
      The bond_change_active_slave() and bond_select_active_slave()
      do't need bond lock anymore, so remove the unwanted bond lock
      for these two functions.
      
      The bond_select_active_slave() will release and acquire
      curr_slave_lock, so the curr_slave_lock need to protect
      the function.
      
      In bond enslave and bond release, the bond slave list is also
      protected by RTNL, so bond lock is no need to exist, remove
      the lock and clean the functions.
      Suggested-by: default avatarJay Vosburgh <fubar@us.ibm.com>
      Suggested-by: default avatarVeaceslav Falico <vfalico@redhat.com>
      Signed-off-by: default avatarDing Tianhong <dingtianhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8517035