1. 22 Oct, 2015 1 commit
    • Arad, Ronen's avatar
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen authored
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: default avatarRonen Arad <ronen.arad@intel.com>
      Acked-by: default avatarSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b1974ed0
  2. 21 Oct, 2015 10 commits
    • Elad Raz's avatar
      Adding switchdev ageing notification on port bridged · 6ac311ae
      Elad Raz authored
      Configure ageing time to the HW for newly bridged device
      
      CC: Scott Feldman <sfeldma@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarElad Raz <eladr@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ac311ae
    • David S. Miller's avatar
      Merge branch 'tcp-rack' · eb9fae32
      David S. Miller authored
      Yuchung Cheng says:
      
      ====================
      RACK loss detection
      
      RACK (Recent ACK) loss recovery uses the notion of time instead of
      packet sequence (FACK) or counts (dupthresh).
      
      It's inspired by the FACK heuristic in tcp_mark_lost_retrans(): when a
      limited transmit (new data packet) is sacked in recovery, then any
      retransmission sent before that newly sacked packet was sent must have
      been lost, since at least one round trip time has elapsed.
      
      But that existing heuristic from tcp_mark_lost_retrans()
      has several limitations:
        1) it can't detect tail drops since it depends on limited transmit
        2) it's disabled upon reordering (assumes no reordering)
        3) it's only enabled in fast recovery but not timeout recovery
      
      RACK addresses these limitations with a core idea: an unacknowledged
      packet P1 is deemed lost if a packet P2 that was sent later is is
      s/acked, since at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when a later retransmission is
      s/acked, while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering, similar to the delayed
      Early Retransmit.
      
      In the current patch set RACK is only a supplemental loss detection
      and does not trigger fast recovery. However we are developing RACK
      to replace or consolidate FACK/dupthresh, early retransmit, and
      thin-dupack. These heuristics all implicitly bear the time notion.
      For example, the delayed Early Retransmit is simply applying RACK
      to trigger the fast recovery with small inflight.
      
      RACK requires measuring the minimum RTT. Tracking a global min is less
      robust due to traffic engineering pathing changes. Therefore it uses a
      windowed filter by Kathleen Nichols. The min RTT can also be useful
      for various other purposes like congestion control or stat monitoring.
      
      This patch has been used on Google servers for well over 1 year. RACK
      has also been implemented in the QUIC protocol. We are submitting an
      IETF draft as well.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb9fae32
    • Yuchung Cheng's avatar
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng authored
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f41b1c5
    • Yuchung Cheng's avatar
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng authored
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      659a8ad5
    • Yuchung Cheng's avatar
      tcp: skb_mstamp_after helper · 625a5e10
      Yuchung Cheng authored
      a helper to prepare the first main RACK patch.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      625a5e10
    • Yuchung Cheng's avatar
      tcp: add tcp_tsopt_ecr_before helper · 77c63127
      Yuchung Cheng authored
      a helper to prepare the main RACK patch
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77c63127
    • Yuchung Cheng's avatar
      tcp: remove tcp_mark_lost_retrans() · af82f4e8
      Yuchung Cheng authored
      Remove the existing lost retransmit detection because RACK subsumes
      it completely. This also stops the overloading the ack_seq field of
      the skb control block.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af82f4e8
    • Yuchung Cheng's avatar
      tcp: track min RTT using windowed min-filter · f6722583
      Yuchung Cheng authored
      Kathleen Nichols' algorithm for tracking the minimum RTT of a
      data stream over some measurement window. It uses constant space
      and constant time per update. Yet it almost always delivers
      the same minimum as an implementation that has to keep all
      the data in the window. The measurement window is tunable via
      sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
      
      The algorithm keeps track of the best, 2nd best & 3rd best min
      values, maintaining an invariant that the measurement time of
      the n'th best >= n-1'th best. It also makes sure that the three
      values are widely separated in the time window since that bounds
      the worse case error when that data is monotonically increasing
      over the window.
      
      Upon getting a new min, we can forget everything earlier because
      it has no value - the new min is less than everything else in the
      window by definition and it's the most recent. So we restart fresh
      on every new min and overwrites the 2nd & 3rd choices. The same
      property holds for the 2nd & 3rd best.
      
      Therefore we have to maintain two invariants to maximize the
      information in the samples, one on values (1st.v <= 2nd.v <=
      3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
      now). These invariants determine the structure of the code
      
      The RTT input to the windowed filter is the minimum RTT measured
      from ACK or SACK, or as the last resort from TCP timestamps.
      
      The accessor tcp_min_rtt() returns the minimum RTT seen in the
      window. ~0U indicates it is not available. The minimum is 1usec
      even if the true RTT is below that.
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6722583
    • Yuchung Cheng's avatar
      tcp: apply Kern's check on RTTs used for congestion control · 9e45a3e3
      Yuchung Cheng authored
      Currently ca_seq_rtt_us does not use Kern's check. Fix that by
      checking if any packet acked is a retransmit, for both RTT used
      for RTT estimation and congestion control.
      
      Fixes: 5b08e47c ("tcp: prefer packet timing to TS-ECR for RTT")
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e45a3e3
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · c8fdc324
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2015-10-19
      
      This series contains updates to i40e and i40evf only.
      
      Kiran adds a spinlock around code accessing VSI MAC filter list to
      ensure that we are synchronizing access to the filter list, otherwise
      we can end up with multiple accesses at the same time which can cause
      the VSI MAC filter list to get in an unstable or corrupted state.
      
      Jesse fixes overlong BIT defines, where the RSS enabling call were
      mistakenly missed.  Also fixes a bug where the enable function was
      enabling the interrupt twice while trying to update the two interrupt
      throttle rate thresholds for Rx and Tx, while refactoring the IRQ
      enable function to simplify reading the flow.  Addressed the high
      CPU utilization of some small streaming workloads that the driver should
      reduce CPU in.
      
      Anjali fixes two X722 issues with respect to EEPROM checksum verify and
      reading NVM version info.  Fixed where a mask value was accidentally
      replaced with a bit mask causing Flow Director sideband to be broken.
      
      Alex Duyck fixes areas of the drivers which run from hard interrupt
      context or with interrupts already disabled in netpoll, so use
      napi_schedule_irqoff() instead of napi_schedule().
      
      Mitch fixes the VF drivers to not easily give up when it is not able
      to communicate with the PF driver.
      
      Carolyn fixes a problem where our tools MAC loopback test, after driver
      unbind would fail because the hardware was configured for multiqueue and
      unbind operation did not clear this configuration.  Also fixed a issue
      where the NVMUpdate tool gets bad data from the PHY when using the PHY
      NVM feature because of contention on the MDIO interface from getting
      PHY capability calls from the driver during regular operations.
      
      Catherine fixed an issue where we were checking if autoneg was allowed
      to change before checking if autoneg was changing, these checks need to
      be in the reverse order.
      
      Jean Sacren fixes up an function header comment to align the kernel-docs
      with the actual code.
      
      v2: Cleaned up the use of spin_is_locked() in patch 1 based on feedback
          from David Miller, since it always evaluates to zero on uni-processor
          builds
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8fdc324
  3. 20 Oct, 2015 1 commit
  4. 19 Oct, 2015 28 commits
    • Catherine Sullivan's avatar
      i40e/i40evf: Bump i40e to 1.3.38 and i40evf to 1.3.25 · a1f192cf
      Catherine Sullivan authored
      Bump.
      
      Change-ID: Id0a7ecaa491f88ce94c9eba4901e592a56044ee0
      Signed-off-by: default avatarCatherine Sullivan <catherine.sullivan@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a1f192cf
    • Jean Sacren's avatar
      i40e: declare rather than initialize int object · 6f66a484
      Jean Sacren authored
      'err' would be overwritten immediately, so we should declare it only
      rather than initialize it to zero.
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      6f66a484
    • Jean Sacren's avatar
      i40e: fix kernel-doc argument name · 2bc11c63
      Jean Sacren authored
      The second argument name in the kernel-doc argument list for
      i40e_features_check() was slightly off. Fix it for the kernel doc.
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      2bc11c63
    • Catherine Sullivan's avatar
      i40e: Move error message to debug level · 52e9689e
      Catherine Sullivan authored
      There is an error coming back from get_phy_capabilities that does not
      seem to have any functional implications. We will continue looking into
      why this error message is occurring, but in the meantime, we will move it
      to debug to avoid confusion.
      
      Change-ID: I9091754bf62c066ddedeb249923d85606e2d68ed
      Signed-off-by: default avatarCatherine Sullivan <catherine.sullivan@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      52e9689e
    • Catherine Sullivan's avatar
      i40e: Fix order of checks when enabling/disabling autoneg in ethtool · 3ce12ee9
      Catherine Sullivan authored
      We were previously checking if autoneg was allowed to change before
      checking if autoneg was changing. We need to do this in the other order
      or else we will erroneously return EINVAL when autoneg is not changing.
      
      Change-ID: Iff9f7d1c9bddc1ad1e5d227d4f42754f90155410
      Signed-off-by: default avatarCatherine Sullivan <catherine.sullivan@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      3ce12ee9
    • Anjali Singhai Jain's avatar
      i40e/i40evf: Fix an accidental error with BIT_ULL replacement · a03dc368
      Anjali Singhai Jain authored
      A mask value of 0x1FF was accidentally replaced with a bit mask
      causing flow director sideband to be broken.
      
      Change-ID: Id3387f67dd1b567b41692b570b383c58671e1eae
      Signed-off-by: default avatarAnjali Singhai Jain <anjali.singhai@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      a03dc368
    • Carolyn Wyborny's avatar
      i40e: fix for PHY NVM interaction problem · 8589af70
      Carolyn Wyborny authored
      This patch fixes a problem where the NVMUpdate Tool, when using the PHY
      NVM feature, gets bad data from the PHY because of contention on the
      MDIO interface from get PHY capability calls from the driver during
      regular operations.  The problem is fixed by adding a check if media
      is available before calling get PHY capability function because that
      bit is not set when device is in PHY interaction mode.
      
      Change-ID: Ib89991b0f841808dd92410f5e8683d6ee3301cd0
      Signed-off-by: default avatarCarolyn Wyborny <carolyn.wyborny@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      8589af70
    • Carolyn Wyborny's avatar
      i40e: Fix for Tools loopback test failing after driver load · bcab2db9
      Carolyn Wyborny authored
      This patch fixes a problem where our Tools MAC Loopback test, after
      driver unbind would fail.  This was because the hw was configured
      for multiqueue and unbind operation did not clear this configuration.
      The problem is fixed by resetting this configuration in i40e_remove.
      
      Change-ID: I130c05138319182ed1476d3a0b5222d6a6320af9
      Signed-off-by: default avatarCarolyn Wyborny <carolyn.wyborny@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      bcab2db9
    • Jesse Brandeburg's avatar
      i40e/i40evf: adjust interrupt throttle less frequently · ee2319cf
      Jesse Brandeburg authored
      The adaptive ITR (interrupt throttle rate) algorithm was adjusting
      the hardware's interrupt rate too frequently.  This caused a lot
      of variation in the interrupt rate for fairly constant workloads.
      
      Change the code to have a counter and adjust only once every N
      number of interrupts.
      
      Change-ID: I0460f1f86571037484eca5aca36ac4d889cb8389
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ee2319cf
    • Jesse Brandeburg's avatar
      i40e/i40evf: change dynamic interrupt thresholds · c56625d5
      Jesse Brandeburg authored
      The dynamic algorithm, while now working, doesn't have good
      performance in 40G mode.
      
      One part of this patch addresses the high CPU utilization of some small
      streaming workloads that the driver should reduce CPU in.
      
      It also changes the minimum ITR that the dynamic algorithm
      will settle on, causing our minimum latency to go from 12us
      to about 14us, when using adaptive mode.
      
      It also changes the BULK interrupt rate to allow maximum throughput
      on a 40Gb connection with a single thread of transmit, clamping
      interrupt rate to 8000 for TX makes single thread traffic go too
      slow.
      
      The new ULTRA bulk setting is introduced and is used
      when the Rx packet rate on this queue exceeds 40000 packets per
      second.  This value of 40000 was chosen because the automatic tuning
      of minimum ITR=20us means that a single queue can't quite achieve
      that many packets per second from a round-robin test.
      
      Change-ID: Icce8faa128688ca5fd2c4229bdd9726877a92ea2
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      c56625d5
    • Jesse Brandeburg's avatar
      i40e/i40evf: fix bug in throttle rate math · 51cc6d9f
      Jesse Brandeburg authored
      The driver was using a value expressed in 2us increments
      for the divisor to figure out our bytes/usec values.
      
      Fix the usecs variable to contain a value in microseconds.
      
      Change-ID: I5c20493103c295d6f201947bb908add7040b7c41
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      51cc6d9f
    • Jesse Brandeburg's avatar
      i40e/i40evf: refactor IRQ enable function · 8f5e39ce
      Jesse Brandeburg authored
      This change moves a multi-line register setting into a function
      which simplifies reading the flow of the enable function.
      
      This also fixes a bug where the enable function was enabling
      the interrupt twice while trying to update the two interrupt
      throttle rate thresholds for Rx and Tx.
      
      Change-ID: Ie308f9d0d48540204590cb9d7a5a7b1196f959bb
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      8f5e39ce
    • Mitch Williams's avatar
      i40evf: don't give up · b9029e94
      Mitch Williams authored
      When the VF driver is unable to communicate with the PF, it just gives
      up and never tries again. Aside from the obvious character flaw that
      this shows, it's also a lousy user experience.
      
      When PF communications fail, wait five seconds, and try again. And
      again. Don't give up, little VF driver! Your prince will come!
      
      Change-ID: Ia1378a39879883563b8faffce819f375821f9585
      Signed-off-by: default avatarMitch Williams <mitch.a.williams@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b9029e94
    • Alexander Duyck's avatar
      i40e/i40evf: use napi_schedule_irqoff() · 5d3465a1
      Alexander Duyck authored
      The i40e_intr and i40e/i40evf_msix_clean_rings functions run from hard
      interrupt context or with interrupts already disabled in netpoll.
      
      They can use napi_schedule_irqoff() instead of napi_schedule()
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      5d3465a1
    • Anjali Singhai's avatar
      i40e: Fix basic support for X722 devices · 07f89be8
      Anjali Singhai authored
      Acquire NVM, before issuing an AQ read nvm command for X722.
      We need to acquire the NVM before issuing an AQ read to the NVM
      otherwise we will get EBUSY from the FW. Also release when done.
      
      This fixes the two X722 issues with respect to eeprom checksum verify
      and reading NVM version info.
      
      With this patch in place, i40e driver will provide basic support
      for X722 devices.
      Signed-off-by: default avatarAnjali Singhai Jain <anjali.singhai@intel.com>
      Acked-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      07f89be8
    • Jesse Brandeburg's avatar
      i40evf: fix overlong BIT defines · d08f5558
      Jesse Brandeburg authored
      The defines from the RSS enabling call were mistakenly
      missed in the patches to the i40e which should have been
      to i40evf as well.
      
      This is a follow up to (commit ed921559886dd40528) "fix
      32 bit build warnings".
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      d08f5558
    • Kiran Patil's avatar
      i40e: Lock for VSI's MAC filter list · 21659035
      Kiran Patil authored
      This patch introduces a spinlock which is to be used for synchronizing
      access to VSI's MAC filter list.
      
      This patch also synchronizes execution of other codepaths which are
      accessing VSI's MAC filter list with execution of
      service_task:sync_vsi_filters.
      
      In function i40e_add_vsi, copied out LAA MAC address instead of cloning
      MAC filter entry because only MAC address is needed to remove MAC VLAN
      filter from FW/HW.
      
      Change-ID: I0e10ac7c715d44aa994239642aa4d57c998573a2
      Signed-off-by: default avatarKiran Patil <kiran.patil@intel.com>
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      21659035
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 1099f860
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Account for extra headroom in ath9k driver, from Felix Fietkau.
      
       2) Fix OOPS in pppoe driver due to incorrect socket state transition,
          from Guillaume Nault.
      
       3) Kill memory leak in amd-xgbe debugfx, from Geliang Tang.
      
       4) Power management fixes for iwlwifi, from Johannes Berg.
      
       5) Fix races in reqsk_queue_unlink(), from Eric Dumazet.
      
       6) Fix dst_entry usage in ARP replies, from Jiri Benc.
      
       7) Cure OOPSes with SO_GET_FILTER, from Daniel Borkmann.
      
       8) Missing allocation failure check in amd-xgbe, from Tom Lendacky.
      
       9) Various resource allocation/freeing cures in DSA< from Neil
          Armstrong.
      
      10) A series of bug fixes in the openvswitch conntrack support, from
          Joe Stringer.
      
      11) Fix two cases (BPF and act_mirred) where we have to clean the sender
          cpu stored in the SKB before transmitting.  From WANG Cong and
          Alexei Starovoitov.
      
      12) Disable VLAN filtering in promiscuous mode in mlx5 driver, from
          Achiad Shochat.
      
      13) Older bnx2x chips cannot do 4-tuple UDP hashing, so prevent this
          configuration via ethtool.  From Yuval Mintz.
      
      14) Don't call rt6_uncached_list_flush_dev() from rt6_ifdown() when
          'dev' is NULL, from Eric Biederman.
      
      15) Prevent stalled link synchronization in tipc, from Jon Paul Maloy.
      
      16) kcalloc() gstrings ethtool buffer before having driver fill it in,
          in order to prevent kernel memory leaking.  From Joe Perches.
      
      17) Fix mixxing rt6_info initialization for blackhole routes, from
          Martin KaFai Lau.
      
      18) Kill VLAN regression in via-rhine, from Andrej Ota.
      
      19) Missing pfmemalloc check in sk_add_backlog(), from Eric Dumazet.
      
      20) Fix spurious MSG_TRUNC signalling in netlink dumps, from Ronen Arad.
      
      21) Scrube SKBs when pushing them between namespaces in openvswitch,
          from Joe Stringer.
      
      22) bcmgenet enables link interrupts too early, fix from Florian
          Fainelli.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (92 commits)
        net: bcmgenet: Fix early link interrupt enabling
        tunnels: Don't require remote endpoint or ID during creation.
        openvswitch: Scrub skb between namespaces
        xen-netback: correctly check failed allocation
        net: asix: add support for the Billionton GUSB2AM-1G-B USB adapter
        netlink: Trim skb to alloc size to avoid MSG_TRUNC
        net: add pfmemalloc check in sk_add_backlog()
        via-rhine: fix VLAN receive handling regression.
        ipv6: Initialize rt6_info properly in ip6_blackhole_route()
        ipv6: Move common init code for rt6_info to a new function rt6_info_init()
        Bluetooth: Fix initializing conn_params in scan phase
        Bluetooth: Fix conn_params list update in hci_connect_le_scan_cleanup
        Bluetooth: Fix remove_device behavior for explicit connects
        Bluetooth: Fix LE reconnection logic
        Bluetooth: Fix reference counting for LE-scan based connections
        Bluetooth: Fix double scan updates
        mlxsw: core: Fix race condition in __mlxsw_emad_transmit
        tipc: move fragment importance field to new header position
        ethtool: Use kcalloc instead of kmalloc for ethtool_get_strings
        tipc: eliminate risk of stalled link synchronization
        ...
      1099f860
    • Florian Fainelli's avatar
      net: bcmgenet: Fix early link interrupt enabling · 37850e37
      Florian Fainelli authored
      Link interrupts are enabled in init_umac(), which is too early for us to
      process them since we do not yet have a valid PHY device pointer. On
      BCM7425 chips for instance, we will crash calling phy_mac_interrupt()
      because phydev is NULL.
      
      Fix this by moving the link interrupts enabling in
      bcmgenet_netif_start(), under a specific function:
      bcmgenet_link_intr_enable() and while at it, update the comments
      surrounding the code.
      
      Fixes: 6cc8e6d4 ("net: bcmgenet: Delay PHY initialization to bcmgenet_open()")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      37850e37
    • David S. Miller's avatar
      Merge tag 'wireless-drivers-for-davem-2015-10-17' of... · afc050dd
      David S. Miller authored
      Merge tag 'wireless-drivers-for-davem-2015-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
      
      Kalle Valo says:
      
      ====================
      iwlwifi:
      
      * mvm: flush fw_dump_wk when mvm fails to start
      * mvm: init card correctly on ctkill exit check
      * pci: add a few more PCI subvendor IDs for the 7265 series
      * fix firmware filename for 3160
      * mvm: clear csa countdown when AP is stopped
      * mvm: fix D3 firmware PN programming
      * dvm: fix D3 firmware PN programming
      * mvm: fix D3 CCMP TX PN assignment
      
      rtlwifi:
      
      * rtl8821ae: Fix system lockups on boot
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afc050dd
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 371f1c7e
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      The following patchset contains Netfilter/IPVS updates for your net-next
      tree. Most relevantly, updates for the nfnetlink_log to integrate with
      conntrack, fixes for cttimeout and improvements for nf_queue core, they are:
      
      1) Remove useless ifdef around static inline function in IPVS, from
         Eric W. Biederman.
      
      2) Simplify the conntrack support for nfnetlink_queue: Merge
         nfnetlink_queue_ct.c file into nfnetlink_queue_core.c, then rename it back
         to nfnetlink_queue.c
      
      3) Use y2038 safe timestamp from nfnetlink_queue.
      
      4) Get rid of dead function definition in nf_conntrack, from Flavio
         Leitner.
      
      5) Attach conntrack support for nfnetlink_log.c, from Ken-ichirou MATSUZAWA.
         This adds a new NETFILTER_NETLINK_GLUE_CT Kconfig switch that
         controls enabling both nfqueue and nflog integration with conntrack.
         The userspace application can request this via NFULNL_CFG_F_CONNTRACK
         configuration flag.
      
      6) Remove unused netns variables in IPVS, from Eric W. Biederman and
         Simon Horman.
      
      7) Don't put back the refcount on the cttimeout object from xt_CT on success.
      
      8) Fix crash on cttimeout policy object removal. We have to flush out
         the cttimeout extension area of the conntrack not to refer to an unexisting
         object that was just removed.
      
      9) Make sure rcu_callback completion before removing nfnetlink_cttimeout
         module removal.
      
      10) Fix compilation warning in br_netfilter when no nf_defrag_ipv4 and
          nf_defrag_ipv6 are enabled. Patch from Arnd Bergmann.
      
      11) Autoload ctnetlink dependencies when NFULNL_CFG_F_CONNTRACK is
          requested. Again from Ken-ichirou MATSUZAWA.
      
      12) Don't use pointer to previous hook when reinjecting traffic via
          nf_queue with NF_REPEAT verdict since it may be already gone. This
          also avoids a deadloop if the userspace application keeps returning
          NF_REPEAT.
      
      13) A bunch of cleanups for netfilter IPv4 and IPv6 code from Ian Morris.
      
      14) Consolidate logger instance existence check in nfulnl_recv_config().
      
      15) Fix broken atomicity when applying configuration updates to logger
          instances in nfnetlink_log.
      
      16) Get rid of the .owner attribute in our hook object. We don't need
          this anymore since we're dropping pending packets that have escaped
          from the kernel when unremoving the hook. Patch from Florian Westphal.
      
      17) Remove unnecessary rcu_read_lock() from nf_reinject code, we always
          assume RCU read side lock from .call_rcu in nfnetlink. Also from Florian.
      
      18) Use static inline function instead of macros to define NF_HOOK() and
          NF_HOOK_COND() when no netfilter support in on, from Arnd Bergmann.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      371f1c7e
    • santosh.shilimkar@oracle.com's avatar
      RDS: fix rds-ping deadlock over TCP transport · 7b4b0009
      santosh.shilimkar@oracle.com authored
      Sowmini found hang with rds-ping while testing RDS over TCP. Its
      a corner case and doesn't happen always. The issue is not reproducible
      with IB transport. Its clear from below dump why we see it with RDS TCP.
      
       [<ffffffff8153b7e5>] do_tcp_setsockopt+0xb5/0x740
       [<ffffffff8153bec4>] tcp_setsockopt+0x24/0x30
       [<ffffffff814d57d4>] sock_common_setsockopt+0x14/0x20
       [<ffffffffa096071d>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
       [<ffffffffa093b5f7>] rds_send_xmit+0xd7/0x740 [rds]
       [<ffffffffa093bda2>] rds_send_pong+0x142/0x180 [rds]
       [<ffffffffa0939d34>] rds_recv_incoming+0x274/0x330 [rds]
       [<ffffffff810815ae>] ? ttwu_queue+0x11e/0x130
       [<ffffffff814dcacd>] ? skb_copy_bits+0x6d/0x2c0
       [<ffffffffa0960350>] rds_tcp_data_recv+0x2f0/0x3d0 [rds_tcp]
       [<ffffffff8153d836>] tcp_read_sock+0x96/0x1c0
       [<ffffffffa0960060>] ? rds_tcp_recv_init+0x40/0x40 [rds_tcp]
       [<ffffffff814d6a90>] ? sock_def_write_space+0xa0/0xa0
       [<ffffffffa09604d1>] rds_tcp_data_ready+0xa1/0xf0 [rds_tcp]
       [<ffffffff81545249>] tcp_data_queue+0x379/0x5b0
       [<ffffffffa0960cdb>] ? rds_tcp_write_space+0xbb/0x110 [rds_tcp]
       [<ffffffff81547fd2>] tcp_rcv_established+0x2e2/0x6e0
       [<ffffffff81552602>] tcp_v4_do_rcv+0x122/0x220
       [<ffffffff81553627>] tcp_v4_rcv+0x867/0x880
       [<ffffffff8152e0b3>] ip_local_deliver_finish+0xa3/0x220
      
      This happens because rds_send_xmit() chain wants to take
      sock_lock which is already taken by tcp_v4_rcv() on its
      way to rds_tcp_data_ready(). Commit db6526dc ("RDS: use
      rds_send_xmit() state instead of RDS_LL_SEND_FULL") which
      was trying to opportunistically finish the send request
      in same thread context.
      
      But because of above recursive lock hang with RDS TCP,
      the send work from rds_send_pong() needs to deferred to
      worker to avoid lock up. Given RDS ping is more of connectivity
      test than performance critical path, its should be ok even
      for transport like IB.
      Reported-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Acked-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarSantosh Shilimkar <ssantosh@kernel.org>
      Signed-off-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Acked-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b4b0009
    • Jesse Gross's avatar
      tunnels: Don't require remote endpoint or ID during creation. · e277de5f
      Jesse Gross authored
      Before lightweight tunnels existed, it really didn't make sense to
      create a tunnel that was not fully specified, such as without a
      destination IP address - the resulting packets would go nowhere.
      However, with lightweight tunnels, the opposite is true - it doesn't
      make sense to require this information when it will be provided later
      on by the route. This loosens the requirements for this information.
      
      An alternative would be to allow the relaxed version only when
      COLLECT_METADATA is enabled. However, since there are several
      variations on this theme (such as NBMA tunnels in GRE), just dropping
      the restrictions seems the most consistent across tunnels and with
      the existing configuration.
      
      CC: John Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarJesse Gross <jesse@nicira.com>
      Signed-off-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e277de5f
    • stephen hemminger's avatar
      uapi: add mpls_iptunnel.h · b3958b9e
      stephen hemminger authored
      Add missing rule to export mpls iptunnel header needed by iproute2
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Acked-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3958b9e
    • Eric Dumazet's avatar
      tcp: do not set queue_mapping on SYNACK · dc6ef6be
      Eric Dumazet authored
      At the time of commit fff32699 ("tcp: reflect SYN queue_mapping into
      SYNACK packets") we had little ways to cope with SYN floods.
      
      We no longer need to reflect incoming skb queue mappings, and instead
      can pick a TX queue based on cpu cooking the SYNACK, with normal XPS
      affinities.
      
      Note that all SYNACK retransmits were picking TX queue 0, this no longer
      is a win given that SYNACK rtx are now distributed on all cpus.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc6ef6be
    • Joe Stringer's avatar
      openvswitch: Scrub skb between namespaces · 740dbc28
      Joe Stringer authored
      If OVS receives a packet from another namespace, then the packet should
      be scrubbed. However, people have already begun to rely on the behaviour
      that skb->mark is preserved across namespaces, so retain this one field.
      
      This is mainly to address information leakage between namespaces when
      using OVS internal ports, but by placing it in ovs_vport_receive() it is
      more generally applicable, meaning it should not be overlooked if other
      port types are allowed to be moved into namespaces in future.
      Signed-off-by: default avatarJoe Stringer <joestringer@nicira.com>
      Acked-by: default avatarPravin B Shelar <pshelar@nicira.com>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      740dbc28
    • David S. Miller's avatar
      Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · a5d6f7dd
      David S. Miller authored
      Johan Hedberg says:
      
      ====================
      pull request: bluetooth 2015-10-16
      
      First of all, sorry for the late set of patches for the 4.3 cycle. We
      just finished an intensive week of testing at the Bluetooth UnPlugFest
      and discovered (and fixed) issues there. Unfortunately a few issues
      affect 4.3-rc5 in a way that they break existing Bluetooth LE mouse and
      keyboard support.
      
      The regressions result from supporting LE privacy in conjunction with
      scanning for Resolvable Private Addresses before connecting. A feature
      that has been tested heavily (including automated unit tests), but sadly
      some regressions slipped in. The UnPlugFest with its multitude of test
      platforms is a good battle testing ground for uncovering every corner
      case.
      
      The patches in this pull request focus only on fixing the regressions in
      4.3-rc5. The patches look a bit larger since we also added comments in
      the critical sections of the fixes to improve clarity.
      
      I would appreciate if we can get these regression fixes to Linus
      quickly. Please let me know if there are any issues pulling. Thanks.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a5d6f7dd
    • Arnd Bergmann's avatar
      net: hix5hd2_gmac: avoid integer overload warning · 951b5d95
      Arnd Bergmann authored
      BITS_RX_EN is an 'unsigned long' constant, so the ones complement of that
      has bits set that do not fit into a 32-bit variable on 64-bit architectures,
      which causes a harmless gcc warning:
      
      drivers/net/ethernet/hisilicon/hix5hd2_gmac.c: In function 'hix5hd2_port_disable':
      drivers/net/ethernet/hisilicon/hix5hd2_gmac.c:374:2: warning: large integer implicitly truncated to unsigned type [-Woverflow]
        writel_relaxed(~(BITS_RX_EN | BITS_TX_EN), priv->base + PORT_EN);
      
      This adds a cast to (u32) to tell gcc that the code is indeed fine.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      951b5d95