1. 17 Jan, 2014 7 commits
    • Daniel Borkmann's avatar
      packet: use percpu mmap tx frame pending refcount · b0138408
      Daniel Borkmann authored
      In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
      and one atomic_dec() call in skb destructor and use a percpu
      reference count instead in order to determine if packets are
      still pending to be sent out. Micro-benchmark with [1] that has
      been slightly modified (that is, protcol = 0 in socket(2) and
      bind(2)), example on a rather crappy testing machine; I expect
      it to scale and have even better results on bigger machines:
      
      ./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:
      
      With patch:    4,022,015 cyc
      Without patch: 4,812,994 cyc
      
      time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:
      
      With patch:
        real         1m32.241s
        user         0m0.287s
        sys          1m29.316s
      
      Without patch:
        real         1m38.386s
        user         0m0.265s
        sys          1m35.572s
      
      In function tpacket_snd(), it is okay to use packet_read_pending()
      since in fast-path we short-circuit the condition already with
      ph != NULL, since we have next frames to process. In case we have
      MSG_DONTWAIT, we also do not execute this path as need_wait is
      false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
      okay to call a packet_read_pending(), because when we ever reach
      that path, we're done processing outgoing frames anyway and only
      look if there are skbs still outstanding to be orphaned. We can
      stay lockless in this percpu counter since it's acceptable when we
      reach this path for the sum to be imprecise first, but we'll level
      out at 0 after all pending frames have reached the skb destructor
      eventually through tx reclaim. When people pin a tx process to
      particular CPUs, we expect overflows to happen in the reference
      counter as on one CPU we expect heavy increase; and distributed
      through ksoftirqd on all CPUs a decrease, for example. As
      David Laight points out, since the C language doesn't define the
      result of signed int overflow (i.e. rather than wrap, it is
      allowed to saturate as a possible outcome), we have to use
      unsigned int as reference count. The sum over all CPUs when tx
      is complete will result in 0 again.
      
      The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
      can _only_ be set from inside tpacket_snd() path and we made sure
      to increase tx_ring.pending in any case before we called po->xmit(skb).
      So testing for tx_ring.pending == 0 is not too useful. Instead, it
      would rather have been useful to test if lower layers didn't orphan
      the skb so that we're missing ring slots being put back to
      TP_STATUS_AVAILABLE. But such a bug will be caught in user space
      already as we end up realizing that we do not have any
      TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.
      
      Btw, in case of RX_RING path, we do not make use of the pending
      member, therefore we also don't need to use up any percpu memory
      here. Also note that __alloc_percpu() already returns a zero-filled
      percpu area, so initialization is done already.
      
        [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmapSigned-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b0138408
    • Daniel Borkmann's avatar
      packet: don't unconditionally schedule() in case of MSG_DONTWAIT · 87a2fd28
      Daniel Borkmann authored
      In tpacket_snd(), when we've discovered a first frame that is
      not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
      we exit the send routine in case of MSG_DONTWAIT, since we've
      finished traversing the mmaped send ring buffer and don't care
      about pending frames.
      
      While doing so, we still unconditionally call an expensive
      schedule() in the packet_current_frame() "error" path, which
      is unnecessary in this case since it's enough to just quit
      the function.
      
      Also, in case MSG_DONTWAIT is not set, we should rather test
      for need_resched() first and do schedule() only if necessary
      since meanwhile pending frames could already have finished
      processing and called skb destructor.
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      87a2fd28
    • Daniel Borkmann's avatar
      packet: improve socket create/bind latency in some cases · 902fefb8
      Daniel Borkmann authored
      Most people acquire PF_PACKET sockets with a protocol argument in
      the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
      all its sockets. Most likely, at some point in time a subsequent
      bind() call will follow, e.g. in libpcap with ...
      
        memset(&sll, 0, sizeof(sll));
        sll.sll_family          = AF_PACKET;
        sll.sll_ifindex         = ifindex;
        sll.sll_protocol        = htons(ETH_P_ALL);
      
      ... as arguments. What happens in the kernel is that already
      in socket() syscall, we install a proto hook via register_prot_hook()
      if our protocol argument is != 0. Yet, in bind() we're almost
      doing the same work by doing a unregister_prot_hook() with an
      expensive synchronize_net() call in case during socket() the proto
      was != 0, plus follow-up register_prot_hook() with a bound device
      to it this time, in order to limit traffic we get.
      
      In the case when the protocol and user supplied device index (== 0)
      does not change from socket() to bind(), we can spare us doing
      the same work twice. Similarly for re-binding to the same device
      and protocol. For these scenarios, we can decrease create/bind
      latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
      with this patch.
      
      Alternatively, for the first case, if people care, they should
      simply create their sockets with proto == 0 argument and define
      the protocol during bind() as this saves a call to synchronize_net()
      as well (sock-bind-3 case).
      
      In all other cases, we're tied to user space behaviour we must not
      change, also since a bind() is not strictly required. Thus, we need
      the synchronize_net() to make sure no asynchronous packet processing
      paths still refer to the previous elements of po->prot_hook.
      
      In case of mmap()ed sockets, the workflow that includes bind() is
      socket() -> setsockopt(<ring>) -> bind(). In that case, a pair of
      {__unregister, register}_prot_hook is being called from setsockopt()
      in order to install the new protocol receive handler. Thus, when
      we call bind and can skip a re-hook, we have already previously
      installed the new handler. For fanout, this is handled different
      entirely, so we should be good.
      
      Timings on an i7-3520M machine:
      
        * sock-bind-1:   89 us
        * sock-bind-2: 7447 us
        * sock-bind-3:   75 us
      
      sock-bind-1:
        socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=all(0),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      
      sock-bind-2:
        socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      
      sock-bind-3:
        socket(PF_PACKET, SOCK_RAW, 0) = 3
        bind(3, {sa_family=AF_PACKET, proto=htons(ETH_P_IP), if=lo(1),
                 pkttype=PACKET_HOST, addr(0)={0, }, 20) = 0
      Signed-off-by: default avatarDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      902fefb8
    • David S. Miller's avatar
      i40e: Remove autogenerated Module.symvers file. · ec48a787
      David S. Miller authored
      Fixes: 9d8bf547 ("i40e: associate VMDq queue with VM type")
      Reported-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec48a787
    • Paul Gortmaker's avatar
      net/ipv4: don't use module_init in non-modular gre_offload · cf172283
      Paul Gortmaker authored
      Recent commit 438e38fa
      ("gre_offload: statically build GRE offloading support") added
      new module_init/module_exit calls to the gre_offload.c file.
      
      The file is obj-y and can't be anything other than built-in.
      Currently it can never be built modular, so using module_init
      as an alias for __initcall can be somewhat misleading.
      
      Fix this up now, so that we can relocate module_init from
      init.h into module.h in the future.  If we don't do this, we'd
      have to add module.h to obviously non-modular code, and that
      would be a worse thing.  We also make the inclusion explicit.
      
      Note that direct use of __initcall is discouraged, vs. one
      of the priority categorized subgroups.  As __initcall gets
      mapped onto device_initcall, our use of device_initcall
      directly in this change means that the runtime impact is
      zero -- it will remain at level 6 in initcall ordering.
      
      As for the module_exit, rather than replace it with __exitcall,
      we simply remove it, since it appears only UML does anything
      with those, and even for UML, there is no relevant cleanup
      to be done here.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf172283
    • Paul Bolle's avatar
      net/mlx4_core: clean up srq_res_start_move_to() · f088cbb8
      Paul Bolle authored
      Building resource_tracker.o triggers a GCC warning:
          drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 'mlx4_HW2SW_SRQ_wrapper':
          drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:3202:17: warning: 'srq' may be used uninitialized in this function [-Wmaybe-uninitialized]
            atomic_dec(&srq->mtt->ref_count);
                           ^
      
      This is a false positive. But a cleanup of srq_res_start_move_to() can
      help GCC here. The code currently uses a switch statement where a plain
      if/else would do, since only two of the switch's four cases can ever
      occur. Dropping that switch makes the warning go away.
      
      While we're at it, add some missing braces, and convert state to the
      correct type.
      Signed-off-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f088cbb8
    • Paul Bolle's avatar
      net/mlx4_core: clean up cq_res_start_move_to() · c9218a9e
      Paul Bolle authored
      Building resource_tracker.o triggers a GCC warning:
          drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 'mlx4_HW2SW_CQ_wrapper':
          drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:3019:16: warning: 'cq' may be used uninitialized in this function [-Wmaybe-uninitialized]
            atomic_dec(&cq->mtt->ref_count);
                          ^
      
      This is a false positive. But a cleanup of cq_res_start_move_to() can
      help GCC here. The code currently uses a switch statement where an
      if/else construct would do too, since only two of the switch's four
      cases can ever occur. Dropping that switch makes the warning go away.
      
      While we're at it, add some missing braces.
      Signed-off-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9218a9e
  2. 16 Jan, 2014 33 commits
    • David S. Miller's avatar
      Merge branch 'ixgbe-next' · f7cbdb7d
      David S. Miller authored
      Aaron Brown says:
      
      ====================
      Intel Wired LAN Driver Updates
      
      This series contains updates to ixgbe and ixgbevf.
      
      John adds rtnl lock / unlock semantics for ixgbe_reinit_locked()
      which was being called without the rtnl lock being held.
      
      Jacob corrects an issue where ixgbevf_qv_disable function does not
      set the disabled bit correctly.
      
      From the community, Wei uses a type of struct for pci driver-specific
      data in ixgbevf_suspend()
      
      Don changes the way we store ring arrays in a manner that allows
      support of multiple queues on multiple nodes and creates new ring
      initialization functions for work previously done across multiple
      functions - making the code closer to ixgbe and hopefully more readable.
      He also fixes incorrect fiber eeprom write logic.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7cbdb7d
    • Don Skidmore's avatar
      ixgbe: Fix incorrect logic for fixed fiber eeprom write · d3cec927
      Don Skidmore authored
      In this code we wanted to set the bit in IXGBE_SFF_SOFT_RS_SELECT_MASK to
      the value in rs.  So we really needed a logical or rather than an and, this
      patch makes that change.
      Signed-off-by: default avatarDon Skidmore <donald.c.skidmore@intel.com>
      Tested-by: default avatarPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3cec927
    • Don Skidmore's avatar
      ixgbevf: create function for all of ring init · de02decb
      Don Skidmore authored
      This patch creates new functions for ring initialization,
      ixgbevf_configure_tx_ring() and ixgbevf_configure_rx_ring(). The work done
      in these function previously was spread between several other functions and
      this change should hopefully lead to greater readability and make the code
      more like ixgbe.  This patch also moves the placement of some older functions
      to avoid having to write prototypes.  It also promotes a couple of debug
      messages to errors.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDon Skidmore <donald.c.skidmore@intel.com>
      Tested-by: default avatarPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de02decb
    • Don Skidmore's avatar
      ixgbevf: Convert ring storage form pointer to an array to array of pointers · 87e70ab9
      Don Skidmore authored
      This will change how we store rings arrays in the adapter sturct.
      We use to have a pointer to an array now we will be using an array
      of pointers.  This will allow us to support multiple queues on
      muliple nodes at some point we would be able to reallocate the rings
      so that each is on a local node if needed.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDon Skidmore <donald.c.skidmore@intel.com>
      Tested-by: default avatarPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      87e70ab9
    • Wei Yongjun's avatar
      ixgbevf: use pci drvdata correctly in ixgbevf_suspend() · 27ae2967
      Wei Yongjun authored
      We had set the pci driver-specific data in ixgbevf_probe() as a type of
      struct net_device, so we should use it as netdev in ixgbevf_suspend().
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Tested-by: default avatarPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27ae2967
    • Jacob Keller's avatar
      ixgbevf: set the disable state when ixgbevf_qv_disable is called · e689e728
      Jacob Keller authored
      The ixgbevf_qv_disable function used by CONFIG_NET_RX_BUSY_POLL is broken,
      because it does not properly set the IXGBEVF_QV_STATE_DISABLED bit, indicating
      that the q_vector should be disabled (and preventing future locks from
      obtaining the vector). This patch corrects the issue by setting the disable
      state.
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Tested-by: default avatarPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e689e728
    • John Fastabend's avatar
      ixgbe: reinit_locked() should be called with rtnl_lock · 8f4c5c9f
      John Fastabend authored
      ixgbe_service_task() is calling ixgbe_reinit_locked() without
      the rtnl_lock being held. This is because it is being called
      from a worker thread and not a rtnl netlink or dcbnl path.
      
      Add rtnl_{un}lock() semantics. I found this during code review.
      Signed-off-by: default avatarJohn Fastabend <john.r.fastabend@intel.com>
      Tested-by: default avatarPhil Schmitt <phillip.j.schmitt@intel.com>
      Signed-off-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8f4c5c9f
    • Eric Dumazet's avatar
      net: eth_type_trans() should use skb_header_pointer() · 0864c158
      Eric Dumazet authored
      eth_type_trans() can read uninitialized memory as drivers
      do not necessarily pull more than 14 bytes in skb->head before
      calling it.
      
      As David suggested, we can use skb_header_pointer() to
      fix this without breaking some drivers that might not expect
      eth_type_trans() pulling 2 additional bytes.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0864c158
    • David S. Miller's avatar
      Merge branch 'stmmac_pm' · 7967919d
      David S. Miller authored
      Srinivas Kandagatla says:
      
      ====================
      net: stmmac PM related fixes.
      
      During PM_SUSPEND_FREEZE testing, I have noticed that PM support in STMMAC is
      partly broken. I had to re-arrange the code to do PM correctly. There were lot
      of things I did not like personally and some bits did not work in the first
      place. I thought this is the nice opportunity to clean the mess up.
      
      Here is what I did:
       any
      1> Test PM suspend freeze via pm_test
      It did not work for following reasons.
       - If the power to gmac is removed when it enters in low power state.
      stmmac_resume could not cope up with such behaviour, it was expecting the ip
      register contents to be still same as before entering low power, This
      assumption is wrong. So I started to add some code to do Hardware
      initialization, thats when I started to re-arrange the code. stmmac_open
      contains both resource and memory allocations and hardware initialization. I
      had to separate these two things in two different functions.
      
      These two patches do that
        net: stmmac: move dma allocation to new function
        net: stmmac: move hardware setup for stmmac_open to new function
      
      And rest of the other patches are fixing the loose ends, things like mdio
      reset, which might be necessary in cases likes hibernation(I did not test).
      
      In hibernation cases the driver was just unregistering with subsystems and
      releasing resources which I did not like and its not necessary to do this as
      part of PM. So using the same stmmac_suspend/resume made more sense for
      hibernation cases than using stmmac_open/release.
      Also fixed a NULL pointer dereference bug too.
      
      2> Test WOL via PM_SUSPEND_FREEZE
      Did get an wakeup interrupt, but could not wakeup a freeze system.
      So I had to add pm_wakeup_event to the driver.
      net: stmmac: notify the PM core of a wakeup event. patch.
      
      Also few patches like
        net: stmmac: make stmmac_mdio_reset non-static
        net: stmmac: restore pinstate in pm resume.
      helps the resume function to reset the phy and put back the pins in default
      state.
      
      Changes since RFC:
      	- Rebased to net-next on Dave's suggestion.
      
      All these patches are Acked by Peppe.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7967919d
    • Srinivas Kandagatla's avatar
      net: stmmac: notify the PM core of a wakeup event. · 89f7f2cf
      Srinivas Kandagatla authored
      In PM_SUSPEND_FREEZE and WOL(Wakeup On Lan) case, when the driver gets a
      wakeup event, either the driver or platform specific PM code should notify
      the pm core about it, so that the system can wakeup from low power.
      
      In cases where there is no involvement of platform specific PM, it
      becomes driver responsibility to notify the PM core to wakeup the
      system.
      
      Without this WOL with PM_SUSPEND_FREEZE does not work on STi based SOCs.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89f7f2cf
    • Srinivas Kandagatla's avatar
      net: stmmac: restore pinstate in pm resume. · db88f10a
      Srinivas Kandagatla authored
      This patch adds code to restore default pinstate of the pins when it
      comes back from low power state. Without this patch the state of the
      pins would be unknown and the driver would not work.
      
      This patch also adds code to put the pins in to sleep state when the
      driver enters low power state.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db88f10a
    • Srinivas Kandagatla's avatar
      net: stmmac: use suspend functions for hibernation · 33a23e22
      Srinivas Kandagatla authored
      In hibernation freeze case the driver just releases the resources like
      dma buffers, irqs, unregisters the drivers and during restore it does
      register, request the resources. This is not really necessary, as part
      of power management all the data structures are intact, all the
      previously allocated resources can be used after coming out of low
      power.
      
      This patch uses the suspend and resume callbacks for freeze and
      restore which initializes the hardware correctly without unregistering
      or releasing the resources, this should also help in reducing the time
      to restore.
      
      Also this patch fixes a bug in stmmac_pltfr_restore and
      stmmac_pltfr_freeze where it tries to get hold of platform data via
      dev_get_platdata call, which would return NULL in device tree cases and
      the next if statement would crash as there is no NULL check.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33a23e22
    • Srinivas Kandagatla's avatar
      net: stmmac: fix power management suspend-resume case · 623997fb
      Srinivas Kandagatla authored
      The driver PM resume assumes that the IP is still powered up and the
      all the register contents are not disturbed when it comes out of low
      power suspend case. This assumption is wrong, basically the driver
      should not consider any state of registers after it comes out of low
      power. However driver can keep the part of the IP powered up if its a
      wake up source. But it can not assume the register state of the IP. Also
      its possible that SOC glue layer can take the power off the IP if its
      not wake-up source to reduce the power consumption.
      
      This patch re initializes hardware by calling stmmac_hw_setup function in
      resume case.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      623997fb
    • Srinivas Kandagatla's avatar
      net: stmmac: make stmmac_mdio_reset non-static · 073752aa
      Srinivas Kandagatla authored
      This patch promotes stmmac_mdio_reset function from static to
      non-static, so that power management functions can decide to reset if
      the IP comes out from lowe power state specially hibernation cases.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      073752aa
    • Srinivas Kandagatla's avatar
      net: stmmac: move hardware setup for stmmac_open to new function · 523f11b5
      Srinivas Kandagatla authored
      This patch moves hardware setup part of the code in stmmac_open to a new
      function stmmac_hw_setup, the reason for doing this is to make hw
      initialization independent function so that PM functions can re-use it to
      re-initialize the IP after returning from low power state.
      This will also avoid code duplication across stmmac_resume/restore and
      stmmac_open.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      523f11b5
    • Srinivas Kandagatla's avatar
      net: stmmac: move dma allocation to new function · 09f8d696
      Srinivas Kandagatla authored
      This patch moves dma resource allocation to a new function
      alloc_dma_desc_resources, the reason for moving this to a new function
      is to keep the memory allocations in a separate function. One more reason
      it to get suspend and hibernation cases working without releasing and
      allocating these resources during suspend-resume and freeze-restore
      cases.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09f8d696
    • Srinivas Kandagatla's avatar
      net: stmmac: mdio: remove reset gpio free · 984203ce
      Srinivas Kandagatla authored
      This patch removes gpio_free for reset line of the phy, driver stores
      the gpio number in its private data-structure to use in future. As the
      driver uses this pin in future this pin should not be freed.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      984203ce
    • Srinivas Kandagatla's avatar
      net: stmmac: support max-speed device tree property · 9cbadf09
      Srinivas Kandagatla authored
      This patch adds support to "max-speed" property which is a standard
      Ethernet device tree property. max-speed specifies maximum speed
      (specified in megabits per second) supported the device.
      
      Depending on the clocking schemes some of the boards can only support
      few link speeds, so having a way to limit the link speed in the mac
      driver would allow such setups to work reliably.
      
      Without this patch there is no way to tell the driver to limit the
      link speed.
      Signed-off-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@st.com>
      Acked-by: default avatarGiuseppe Cavallaro <peppe.cavallaro@st.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cbadf09
    • David S. Miller's avatar
      Merge branch 'mvneta' · 82a342d1
      David S. Miller authored
      Willy Tarreau says:
      
      ====================
      Assorted mvneta fixes and improvements
      
      this series provides some fixes for a number of issues met with the
      mvneta driver, then adds some improvements. Patches 1-5 are fixes
      and would be needed in 3.13 and likely -stable. The next ones are
      performance improvements and cleanups :
      
        - driver lockup when reading stats while sending traffic from multiple
          CPUs : this obviously only happens on SMP and is the result of missing
          locking on the driver. The problem was present since the introduction
          of the driver in 3.8. The first patch performs some changes that are
          needed for the second one which actually fixes the issue by using
          per-cpu counters. It could make sense to backport this to the relevant
          stable versions.
      
        - mvneta_tx_timeout calls various functions to reset the NIC, and these
          functions sleep, which is not allowed here, resulting in a panic.
          Better completely disable this Tx timeout handler for now since it is
          never called. The problem was encountered while developing some new
          features, it's uncertain whether it's possible to reproduce it with
          regular usage, so maybe a backport to stable is not needed.
      
        - replace the Tx timer with a real Tx IRQ. As first reported by Arnaud
          Ebalard and explained by Eric Dumazet, there is no way this driver
          can work correctly if it uses a driver to recycle the Tx descriptors.
          If too many packets are sent at once, the driver quickly ends up with
          no descriptors (which happens twice as easily in GSO) and has to wait
          10ms for recycling its descriptors and being able to send again. Eric
          has worked around this in the core GSO code. But still when routing
          traffic or sending UDP packets, the limitation is very visible. Using
          Tx IRQs allows Tx descriptors to be recycled when sent. The coalesce
          value is still configurable using ethtool. This fix turns the UDP
          send bitrate from 134 Mbps to 987 Mbps (ie: line rate). It's made of
          two patches, one to add the relevant bits from the original Marvell's
          driver, and another one to implement the change. I don't know if it
          should be backported to stable, as the bug only causes poor performance.
      
        - Patches 6..8 are essentially cleanups, code deduplication and minor
          optimizations for not re-fetching a value we already have (status).
      
        - patch 9 changes the prefetch of Rx descriptor from current one to
          next one. In benchmarks, it results in about 1% general performance
          increase on HTTP traffic, probably because prefetching the current
          descriptor does not leave enough time between the start of prefetch
          and its usage.
      
        - patch 10 implements support for build_skb() on Rx path. The driver
          now preallocates frags instead of skbs and builds an skb just before
          delivering it. This results in a 2% performance increase on HTTP
          traffic, and up to 5% on small packet Rx rate.
      
        - patch 11 implements rx_copybreak for small packets (256 bytes). It
          avoids a dma_map_single()/dma_unmap_single() and increases the Rx
          rate by 16.4%, from 486kpps to 573kpps. Further improvements up to
          711kpps are possible depending how the DMA is used.
      
        - patches 12 and 13 are extra cleanups made possible by some of the
          simplifications above.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82a342d1
    • Arnaud Ebalard's avatar
      net: mvneta: make mvneta_txq_done() return void · cd713199
      Arnaud Ebalard authored
      The function return parameter is not used in mvneta_tx_done_gbe(),
      where the function is called. This patch makes the function return
      void.
      Reviewed-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd713199
    • Arnaud Ebalard's avatar
      net: mvneta: mvneta_tx_done_gbe() cleanups · 0713a86a
      Arnaud Ebalard authored
      mvneta_tx_done_gbe() return value and third parameter are no more
      used. This patch changes the function prototype and removes a useless
      variable where the function is called.
      Reviewed-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0713a86a
    • willy tarreau's avatar
      net: mvneta: implement rx_copybreak · f19fadfc
      willy tarreau authored
      calling dma_map_single()/dma_unmap_single() is quite expensive compared
      to copying a small packet. So let's copy short frames and keep the buffers
      mapped. We set the limit to 256 bytes which seems to give good results both
      on the XP-GP board and on the AX3/4.
      
      The Rx small packet rate increased by 16.4% doing this, from 486kpps to
      573kpps. It is worth noting that even the call to the function
      dma_sync_single_range_for_cpu() is expensive (300 ns) although less
      than dma_unmap_single(). Without it, the packet rate raises to 711kpps
      (+24% more). Thus on systems where coherency from device to CPU is
      guaranteed by a snoop control unit, this patch should provide even more
      gains, and probably rx_copybreak could be increased.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f19fadfc
    • willy tarreau's avatar
      net: mvneta: convert to build_skb() · 8ec2cd48
      willy tarreau authored
      Make use of build_skb() to allocate frags on the RX path. When frag size
      is lower than a page size, we can use netdev_alloc_frag(), and we fall back
      to kmalloc() for larger sizes. The frag size is stored into the mvneta_port
      struct. The alloc/free functions check the frag size to decide what alloc/
      free method to use. MTU changes are safe because the MTU change function
      stops the device and clears the queues before applying the change.
      
      With this patch, I observed a reproducible 2% performance improvement on
      HTTP-based benchmarks, and 5% on small packet RX rate.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ec2cd48
    • willy tarreau's avatar
      net: mvneta: prefetch next rx descriptor instead of current one · 34e4179d
      willy tarreau authored
      Currently, the mvneta driver tries to prefetch the current Rx
      descriptor during read. Tests have shown that prefetching the
      next one instead increases general performance by about 1% on
      HTTP traffic.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34e4179d
    • willy tarreau's avatar
      net: mvneta: simplify access to the rx descriptor status · 5428213c
      willy tarreau authored
      At several places, we already know the value of the rx status but
      we call functions which dereference the pointer again to get it
      and don't need the descriptor for anything else. Simplify this
      task by replacing the rx desc pointer by the status word itself.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5428213c
    • willy tarreau's avatar
      net: mvneta: factor rx refilling code · a1a65ab1
      willy tarreau authored
      Make mvneta_rxq_fill() use mvneta_rx_refill() instead of using
      duplicate code.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1a65ab1
    • willy tarreau's avatar
      net: mvneta: remove tests for impossible cases in the tx_done path · 6c498974
      willy tarreau authored
      Currently, mvneta_txq_bufs_free() calls mvneta_tx_done_policy() with
      a non-null cause to retrieve the pointer to the next queue to process.
      There are useless tests on the return queue number and on the pointer,
      all of which are well defined within a known limited set. This code
      path is fast, although not critical. Removing 3 tests here that the
      compiler could not optimize (verified) is always desirable.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c498974
    • willy tarreau's avatar
      net: mvneta: replace Tx timer with a real interrupt · 71f6d1b3
      willy tarreau authored
      Right now the mvneta driver doesn't handle Tx IRQ, and relies on two
      mechanisms to flush Tx descriptors : a flush at the end of mvneta_tx()
      and a timer. If a burst of packets is emitted faster than the device
      can send them, then the queue is stopped until next wake-up of the
      timer 10ms later. This causes jerky output traffic with bursts and
      pauses, making it difficult to reach line rate with very few streams.
      
      A test on UDP traffic shows that it's not possible to go beyond 134
      Mbps / 12 kpps of outgoing traffic with 1500-bytes IP packets. Routed
      traffic tends to observe pauses as well if the traffic is bursty,
      making it even burstier after the wake-up.
      
      It seems that this feature was inherited from the original driver but
      nothing there mentions any reason for not using the interrupt instead,
      which the chip supports.
      
      Thus, this patch enables Tx interrupts and removes the timer. It does
      the two at once because it's not really possible to make the two
      mechanisms coexist, so a split patch doesn't make sense.
      
      First tests performed on a Mirabox (Armada 370) show that less CPU
      seems to be used when sending traffic. One reason might be that we now
      call the mvneta_tx_done_gbe() with a mask indicating which queues have
      been done instead of looping over all of them.
      
      The same UDP test above now happily reaches 987 Mbps / 87.7 kpps.
      Single-stream TCP traffic can now more easily reach line rate. HTTP
      transfers of 1 MB objects over a single connection went from 730 to
      840 Mbps. It is even possible to go significantly higher (>900 Mbps)
      by tweaking tcp_tso_win_divisor.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Cc: Arnaud Ebalard <arno@natisbad.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      71f6d1b3
    • willy tarreau's avatar
      net: mvneta: add missing bit descriptions for interrupt masks and causes · 40ba35e7
      willy tarreau authored
      Marvell has not published the chip's datasheet yet, so it's very hard
      to find the relevant bits to manipulate to change the IRQ behaviour.
      Fortunately, these bits are described in the proprietary LSP patch set
      which is publicly available here :
      
          http://www.plugcomputer.org/downloads/mirabox/
      
      So let's put them back in the driver in order to reduce the burden of
      current and future maintenance.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40ba35e7
    • willy tarreau's avatar
      net: mvneta: do not schedule in mvneta_tx_timeout · 29021366
      willy tarreau authored
      If a queue timeout is reported, we can oops because of some
      schedules while the caller is atomic, as shown below :
      
        mvneta d0070000.ethernet eth0: tx timeout
        BUG: scheduling while atomic: bash/1528/0x00000100
        Modules linked in: slhttp_ethdiv(C) [last unloaded: slhttp_ethdiv]
        CPU: 2 PID: 1528 Comm: bash Tainted: G        WC   3.13.0-rc4-mvebu-nf #180
        [<c0011bd9>] (unwind_backtrace+0x1/0x98) from [<c000f1ab>] (show_stack+0xb/0xc)
        [<c000f1ab>] (show_stack+0xb/0xc) from [<c02ad323>] (dump_stack+0x4f/0x64)
        [<c02ad323>] (dump_stack+0x4f/0x64) from [<c02abe67>] (__schedule_bug+0x37/0x4c)
        [<c02abe67>] (__schedule_bug+0x37/0x4c) from [<c02ae261>] (__schedule+0x325/0x3ec)
        [<c02ae261>] (__schedule+0x325/0x3ec) from [<c02adb97>] (schedule_timeout+0xb7/0x118)
        [<c02adb97>] (schedule_timeout+0xb7/0x118) from [<c0020a67>] (msleep+0xf/0x14)
        [<c0020a67>] (msleep+0xf/0x14) from [<c01dcbe5>] (mvneta_stop_dev+0x21/0x194)
        [<c01dcbe5>] (mvneta_stop_dev+0x21/0x194) from [<c01dcfe9>] (mvneta_tx_timeout+0x19/0x24)
        [<c01dcfe9>] (mvneta_tx_timeout+0x19/0x24) from [<c024afc7>] (dev_watchdog+0x18b/0x1c4)
        [<c024afc7>] (dev_watchdog+0x18b/0x1c4) from [<c0020b53>] (call_timer_fn.isra.27+0x17/0x5c)
        [<c0020b53>] (call_timer_fn.isra.27+0x17/0x5c) from [<c0020cad>] (run_timer_softirq+0x115/0x170)
        [<c0020cad>] (run_timer_softirq+0x115/0x170) from [<c001ccb9>] (__do_softirq+0xbd/0x1a8)
        [<c001ccb9>] (__do_softirq+0xbd/0x1a8) from [<c001cfad>] (irq_exit+0x61/0x98)
        [<c001cfad>] (irq_exit+0x61/0x98) from [<c000d4bf>] (handle_IRQ+0x27/0x60)
        [<c000d4bf>] (handle_IRQ+0x27/0x60) from [<c000843b>] (armada_370_xp_handle_irq+0x33/0xc8)
        [<c000843b>] (armada_370_xp_handle_irq+0x33/0xc8) from [<c000fba9>] (__irq_usr+0x49/0x60)
      
      Ben Hutchings attempted to propose a better fix consisting in using a
      scheduled work for this, but while it fixed this panic, it caused other
      random freezes and panics proving that the reset sequence in the driver
      is unreliable and that additional fixes should be investigated.
      
      When sending multiple streams over a link limited to 100 Mbps, Tx timeouts
      happen from time to time, and the driver correctly recovers only when the
      function is disabled.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29021366
    • willy tarreau's avatar
      net: mvneta: use per_cpu stats to fix an SMP lock up · 74c41b04
      willy tarreau authored
      Stats writers are mvneta_rx() and mvneta_tx(). They don't lock anything
      when they update the stats, and as a result, it randomly happens that
      the stats freeze on SMP if two updates happen during stats retrieval.
      This is very easily reproducible by starting two HTTP servers and binding
      each of them to a different CPU, then consulting /proc/net/dev in loops
      during transfers, the interface should immediately lock up. This issue
      also randomly happens upon link state changes during transfers, because
      the stats are collected in this situation, but it takes more attempts to
      reproduce it.
      
      The comments in netdevice.h suggest using per_cpu stats instead to get
      rid of this issue.
      
      This patch implements this. It merges both rx_stats and tx_stats into
      a single "stats" member with a single syncp. Both mvneta_rx() and
      mvneta_rx() now only update the a single CPU's counters.
      
      In turn, mvneta_get_stats64() does the summing by iterating over all CPUs
      to get their respective stats.
      
      With this change, stats are still correct and no more lockup is encountered.
      
      Note that this bug was present since the first import of the mvneta
      driver.  It might make sense to backport it to some stable trees. If
      so, it depends on "d33dc73 net: mvneta: increase the 64-bit rx/tx stats
      out of the hot path".
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74c41b04
    • willy tarreau's avatar
      net: mvneta: increase the 64-bit rx/tx stats out of the hot path · dc4277dd
      willy tarreau authored
      Better count packets and bytes in the stack and on 32 bit then
      accumulate them at the end for once. This saves two memory writes
      and two memory barriers per packet. The incoming packet rate was
      increased by 4.7% on the Openblocks AX3 thanks to this.
      
      Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Gregory CLEMENT <gregory.clement@free-electrons.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Tested-by: default avatarArnaud Ebalard <arno@natisbad.org>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc4277dd
    • Paul Gortmaker's avatar
      drivers/net: delete non-required instances of include <linux/init.h> · a81ab36b
      Paul Gortmaker authored
      None of these files are actually using any __init type directives
      and hence don't need to include <linux/init.h>.   Most are just a
      left over from __devinit and __cpuinit removal, or simply due to
      code getting copied from one driver to the next.
      
      This covers everything under drivers/net except for wireless, which
      has been submitted separately.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a81ab36b