Commits · db30485408326a6f466a843b291b23535f63eda0 · Kirill Smelkov / linux

09 Jan, 2015 10 commits

rhashtable: involve rhashtable_lookup_insert routine · db304854

Ying Xue authored Jan 07, 2015

Involve a new function called rhashtable_lookup_insert() which makes
lookup and insertion atomic under bucket lock protection, helping us
avoid to introduce an extra lock when we search and insert an object
into hash table.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

db304854

rhashtable: introduce rhashtable_wakeup_worker helper function · 54c5b7d3

Ying Xue authored Jan 07, 2015

Introduce rhashtable_wakeup_worker() helper function to reduce
duplicated code where to wake up worker.

By the way, as long as the both "future_tbl" and "tbl" bucket table
pointers point to the same bucket array, we should try to wake up
the resizing worker thread, otherwise, it indicates the work of
resizing hash table is not finished yet. However, currently we will
wake up the worker thread only when the two pointers point to
different bucket array. Obviously this is wrong. So, the issue is
also fixed as well in the patch.
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

54c5b7d3

rhashtable: optimize rhashtable_lookup routine · efb975a6

Ying Xue authored Jan 07, 2015

Define an internal compare function and relevant compare argument,
and then make use of rhashtable_lookup_compare() to lookup key in
hash table, reducing duplicated code between rhashtable_lookup()
and rhashtable_lookup_compare().
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

efb975a6

Merge branch 'cxgb4-next' · 7c1b7023

David S. Miller authored Jan 08, 2015

Hariprasad Shenai says:

====================
Add support for few debugfs entries

This patch series adds support for devlog, cim_la, cim_qcfg and mps_tcam
debugfs entries.

The patches series is created against 'net-next' tree.
And includes patches on cxgb4 driver.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7c1b7023

cxgb4: Add support for mps_tcam debugfs · ef82f662

Hariprasad Shenai authored Jan 07, 2015

Debug log to get the MPS TCAM table
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ef82f662

cxgb4: Add support for cim_qcfg entry in debugfs · 74b3092c

Hariprasad Shenai authored Jan 07, 2015

Adds debug log to get cim queue config
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

74b3092c

cxgb4: Add support for cim_la entry in debugfs · f1ff24aa

Hariprasad Shenai authored Jan 07, 2015

The CIM LA captures the embedded processor’s internal state. Optionally, it can
also trace the flow of data in and out of the embedded processor. Therefore, the
CIM LA output contains detailed information of what code the embedded processor
executed prior to the CIM LA capture.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1ff24aa

cxgb4: Add support for devlog · 49aa284f

Hariprasad Shenai authored Jan 07, 2015

Add support for device log entry in debugfs
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

49aa284f

doc: fix the compile error of txtimestamp.c · cd91cc5b

WANG Cong authored Jan 06, 2015

Vinson reported:

  HOSTCC  Documentation/networking/timestamping/txtimestamp
Documentation/networking/timestamping/txtimestamp.c:64:8: error:
redefinition of ‘struct in6_pktinfo’
 struct in6_pktinfo {
        ^
In file included from /usr/include/arpa/inet.h:23:0,
                 from Documentation/networking/timestamping/txtimestamp.c:33:
/usr/include/netinet/in.h:456:8: note: originally defined here
 struct in6_pktinfo
        ^

After we sync with libc header, we don't need this ugly hack any more.
Reported-by: Vinson Lee <vlee@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cd91cc5b

ipv6: fix redefinition of in6_pktinfo and ip6_mtuinfo · 3b50d902

WANG Cong authored Jan 06, 2015

Both netinet/in.h and linux/ipv6.h define these two structs,
if we include both of them, we got:

	/usr/include/linux/ipv6.h:19:8: error: redefinition of ‘struct in6_pktinfo’
	 struct in6_pktinfo {
		^
	In file included from /usr/include/arpa/inet.h:22:0,
			 from txtimestamp.c:33:
	/usr/include/netinet/in.h:524:8: note: originally defined here
	 struct in6_pktinfo
		^
	In file included from txtimestamp.c:40:0:
	/usr/include/linux/ipv6.h:24:8: error: redefinition of ‘struct ip6_mtuinfo’
	 struct ip6_mtuinfo {
		^
	In file included from /usr/include/arpa/inet.h:22:0,
			 from txtimestamp.c:33:
	/usr/include/netinet/in.h:531:8: note: originally defined here
	 struct ip6_mtuinfo
		^
So similarly to what we did for in6_addr, we need to sync with
libc header on their definitions.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b50d902

07 Jan, 2015 3 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 44d84d72
David S. Miller authored Jan 06, 2015

44d84d72

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · bdec4196

Linus Torvalds authored Jan 06, 2015

Pull networking fixes from David Miller:
 "Just a pile of random fixes, including:

   1) Do not apply TSO limits to non-TSO packets, fix from Herbert Xu.

   2) MDI{,X} eeprom check in e100 driver is reversed, from John W.
      Linville.

   3) Missing error return assignments in several ethernet drivers, from
      Julia Lawall.

   4) Altera TSE device doesn't come back up after ifconfig down/up
      sequence, fix from Kostya Belezko.

   5) Add more cases to the check for whether the qmi_wwan device has a
      bogus MAC address and needs to be assigned a random one.  From
      Kristian Evensen.

   6) Fix interrupt hangs in CPSW, from Felipe Balbi.

   7) Implement ndo_features_check in r8152 so that the stack doesn't
      feed GSO packets which are outside of the chip's capabilities.
      From Hayes Wang"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
  qla3xxx: don't allow never end busy loop
  xen-netback: fixing the propagation of the transmit shaper timeout
  r8152: support ndo_features_check
  batman-adv: fix potential TT client + orig-node memory leak
  batman-adv: fix multicast counter when purging originators
  batman-adv: fix counter for multicast supporting nodes
  batman-adv: fix lock class for decoding hash in network-coding.c
  batman-adv: fix delayed foreign originator recognition
  batman-adv: fix and simplify condition when bonding should be used
  Revert "mac80211: Fix accounting of the tailroom-needed counter"
  net: ethernet: cpsw: fix hangs with interrupts
  enic: free all rq buffs when allocation fails
  qmi_wwan: Set random MAC on devices with buggy fw
  openvswitch: Consistently include VLAN header in flow and port stats.
  tcp: Do not apply TSO segment limit to non-TSO packets
  Altera TSE: Add missing phydev
  net/mlx4_core: Fix error flow in mlx4_init_hca()
  net/mlx4_core: Correcly update the mtt's offset in the MR re-reg flow
  qlcnic: Fix return value in qlcnic_probe()
  net: axienet: fix error return code
  ...

bdec4196

Merge tag 'for-linus-3' of git://git.code.sf.net/p/openipmi/linux-ipmi · 0adc1803

Linus Torvalds authored Jan 06, 2015

Pull IPMI fixlet from Corey Minyard:
 "Fix a compile warning"

* tag 'for-linus-3' of git://git.code.sf.net/p/openipmi/linux-ipmi:
  ipmi: Fix compile warning with tv_usec

0adc1803

06 Jan, 2015 27 commits

net: eth: xgene: change APM X-Gene SoC platform ethernet to support ACPI · de7b5b3d

Feng Kan authored Jan 06, 2015

This adds support for APM X-Gene ethernet driver to use ACPI table to derive
ethernet driver parameter.
Signed-off-by: Feng Kan <fkan@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

de7b5b3d

qla3xxx: don't allow never end busy loop · 2abad79a

Andy Shevchenko authored Jan 06, 2015

The counter variable wasn't increased at all which may stuck under
certain circumstances.
Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2abad79a

net/fsl: Add mEMAC MDIO support to XGMAC MDIO · 1fcf77c8

Andy Fleming authored Jan 04, 2015

The Freescale mEMAC supports operating at 10/100/1000/10G, and
its associated MDIO controller is likewise capable of operating
both Clause 22 and Clause 45 MDIO buses. It is nearly identical
to the MDIO controller on the XGMAC, so we just modify that
driver.

Portions of this driver developed by:

Sandeep Singh <sandeep@freescale.com>
Roy Zang <tie-fei.zang@freescale.com>
Signed-off-by: Andy Fleming <afleming@gmail.com>
Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1fcf77c8

ethtool: Extend ethtool plugin module eeprom API to phylib · 2f438366

Ed Swierk authored Jan 02, 2015

This patch extends the ethtool plugin module eeprom API to support cards
whose phy support is delegated to a separate driver.

The handlers for ETHTOOL_GMODULEINFO and ETHTOOL_GMODULEEEPROM call the
module_info and module_eeprom functions if the phy driver provides them;
otherwise the handlers call the equivalent ethtool_ops functions provided
by network drivers with built-in phy support.
Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2f438366

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 3b421b80

Linus Torvalds authored Jan 06, 2015

Pull ext4 bugfixes from Ted Ts'o:
 "Revert a potential seek_data/hole regression which shows up when using
  ext4 to handle ext3 file systems, plus two minor bug fixes"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove spurious KERN_INFO from ext4_warning call
  Revert "ext4: fix suboptimal seek_{data,hole} extents traversial"
  ext4: prevent online resize with backup superblock

3b421b80

mm: propagate error from stack expansion even for guard page · fee7e49d

Linus Torvalds authored Jan 06, 2015

Jay Foad reports that the address sanitizer test (asan) sometimes gets
confused by a stack pointer that ends up being outside the stack vma
that is reported by /proc/maps.

This happens due to an interaction between RLIMIT_STACK and the guard
page: when we do the guard page check, we ignore the potential error
from the stack expansion, which effectively results in a missing guard
page, since the expected stack expansion won't have been done.

And since /proc/maps explicitly ignores the guard page (commit
d7824370: "mm: fix up some user-visible effects of the stack guard
page"), the stack pointer ends up being outside the reported stack area.

This is the minimal patch: it just propagates the error. It also
effectively makes the guard page part of the stack limit, which in turn
measn that the actual real stack is one page less than the stack limit.

Let's see if anybody notices. We could teach acct_stack_growth() to
allow an extra page for a grow-up/grow-down stack in the rlimit test,
but I don't want to add more complexity if it isn't needed.
Reported-and-tested-by: Jay Foad <jay.foad@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fee7e49d

Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge · 627d2cc0

David S. Miller authored Jan 06, 2015

Included changes:
- ensure bonding is used (if enabled) for packets coming in the soft
  interface
- fix race condition to avoid orig_nodes to be deleted right after
  being added
- avoid false positive lockdep splats by assigning lockclass to
  the proper hashtable lock objects
- avoid miscounting of multicast 'disabled' nodes in the network
- fix memory leak in the Global Translation Table in case of
  originator interval change
Signed-off-by: David S. Miller <davem@davemloft.net>

627d2cc0

xen-netback: fixing the propagation of the transmit shaper timeout · 07ff890d

Palik, Imre authored Jan 06, 2015

Since e9ce7cb6 ("xen-netback: Factor queue-specific data into queue struct"),
the transimt shaper timeout is always set to 0.  The value the user sets via
xenbus is never propagated to the transmit shaper.

This patch fixes the issue.

Cc: Anthony Liguori <aliguori@amazon.com>
Signed-off-by: Imre Palik <imrep@amazon.de>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

07ff890d

Driver: Vmxnet3: Make Rx ring 2 size configurable · 53831aa1

Shrikrishna Khare authored Jan 06, 2015

Rx ring 2 size can be configured by adjusting rx-jumbo parameter
of ethtool -G.
Signed-off-by: Ramya Bolla <bollar@vmware.com>
Signed-off-by: Shreyas Bhatewara <sbhatewara@vmware.com>
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

53831aa1

Merge tag 'mac80211-for-davem-2015-01-06' of... · 15ecf7a0

David S. Miller authored Jan 06, 2015

Merge tag 'mac80211-for-davem-2015-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Here's just a single fix - a revert of a patch that broke the
p54 and cw2100 drivers (arguably due to bad assumptions there.)
Since this affects kernels since 3.17, I decided to revert for
now and we'll revisit this optimisation properly for -next.
Signed-off-by: David S. Miller <davem@davemloft.net>

15ecf7a0

r8152: support ndo_features_check · a5e31255

hayeswang authored Jan 06, 2015

Support ndo_features_check to avoid:
 - the transport offset is more than the hw limitation when using hw checksum.
 - the skb->len of a GSO packet is more than the limitation.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5e31255

arm_arch_timer: include clocksource.h directly · 7c8f1e78

Richard Cochran authored Jan 06, 2015

This driver makes use of the clocksource code. Previously it had only
included the proper header indirectly, but that chain was inadvertently
broken by 74d23cc7 "time: move the timecounter/cyclecounter code into its
own file."

This patch fixes the issue by including clocksource.h directly.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c8f1e78

cxgb4: Add PCI device ID for new T5 adapter · 678109ea

Hariprasad Shenai authored Jan 06, 2015

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

678109ea

batman-adv: fix potential TT client + orig-node memory leak · 9d31b3ce

Linus Lüssing authored Dec 13, 2014

This patch fixes a potential memory leak which can occur once an
originator times out. On timeout the according global translation table
entry might not get purged correctly. Furthermore, the non purged TT
entry will cause its orig-node to leak, too. Which additionally can lead
to the new multicast optimization feature not kicking in because of a
therefore bogus counter.

In detail: The batadv_tt_global_entry->orig_list holds the reference to
the orig-node. Usually this reference is released after
BATADV_PURGE_TIMEOUT through: _batadv_purge_orig()->
batadv_purge_orig_node()->batadv_update_route()->_batadv_update_route()->
batadv_tt_global_del_orig() which purges this global tt entry and
releases the reference to the orig-node.

However, if between two batadv_purge_orig_node() calls the orig-node
timeout grew to 2*BATADV_PURGE_TIMEOUT then this call path isn't
reached. Instead the according orig-node is removed from the
originator hash in _batadv_purge_orig(), the batadv_update_route()
part is skipped and won't be reached anymore.

Fixing the issue by moving batadv_tt_global_del_orig() out of the rcu
callback.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Acked-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>

9d31b3ce

batman-adv: fix multicast counter when purging originators · a5164886

Linus Lüssing authored Oct 30, 2014

When purging an orig_node we should only decrease counter tracking the
number of nodes without multicast optimizations support if it was
increased through this orig_node before.

A not yet quite initialized orig_node (meaning it did not have its turn
in the mcast-tvlv handler so far) which gets purged would not adhere to
this and will lead to a counter imbalance.

Fixing this by adding a check whether the orig_node is mcast-initalized
before decreasing the counter in the mcast-orig_node-purging routine.

Introduced by 60432d75
("batman-adv: Announce new capability via multicast TVLV")
Reported-by: Tobias Hachmer <tobias@hachmer.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>

a5164886

batman-adv: fix counter for multicast supporting nodes · e8829f00

Linus Lüssing authored Oct 30, 2014

A miscounting of nodes having multicast optimizations enabled can lead
to multicast packet loss in the following scenario:

If the first OGM a node receives from another one has no multicast
optimizations support (no multicast tvlv) then we are missing to
increase the counter. This potentially leads to the wrong assumption
that we could safely use multicast optimizations.

Fixings this by increasing the counter if the initial OGM has the
multicast TVLV unset, too.

Introduced by 60432d75
("batman-adv: Announce new capability via multicast TVLV")
Reported-by: Tobias Hachmer <tobias@hachmer.de>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>

e8829f00

batman-adv: fix lock class for decoding hash in network-coding.c · f44d5407

Martin Hundebøll authored Nov 11, 2014

batadv_has_set_lock_class() is called with the wrong hash table as first
argument (probably due to a copy-paste error), which leads to false
positives when running with lockdep.

Introduced-by: 612d2b4f
("batman-adv: network coding - save overheard and tx packets for decoding")
Signed-off-by: Martin Hundebøll <martin@hundeboll.net>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>

f44d5407

batman-adv: fix delayed foreign originator recognition · 2c667a33

Linus Lüssing authored Oct 30, 2014

Currently it can happen that the reception of an OGM from a new
originator is not being accepted. More precisely it can happen that
an originator struct gets allocated and initialized
(batadv_orig_node_new()), even the TQ gets calculated and set correctly
(batadv_iv_ogm_calc_tq()) but still the periodic orig_node purging
thread will decide to delete it if it has a chance to jump between
these two function calls.

This is because batadv_orig_node_new() initializes the last_seen value
to zero and its caller (batadv_iv_ogm_orig_get()) makes it visible to
other threads by adding it to the hash table already.
batadv_iv_ogm_calc_tq() will set the last_seen variable to the correct,
current time a few lines later but if the purging thread jumps in between
that it will think that the orig_node timed out and will wrongly
schedule it for deletion already.

If the purging interval is the same as the originator interval (which is
the default: 1 second), then this game can continue for several rounds
until the random OGM jitter added enough difference between these
two (in tests, two to about four rounds seemed common).

Fixing this by initializing the last_seen variable of an orig_node
to the current time before adding it to the hash table.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>

2c667a33

batman-adv: fix and simplify condition when bonding should be used · 329887ad

Simon Wunderlich authored Aug 13, 2014

The current condition actually does NOT consider bonding when the
interface the packet came in from is the soft interface, which is the
opposite of what it should do (and the comment describes). Fix that and
slightly simplify the condition.
Reported-by: Ray Gibson <booray@gmail.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>

329887ad

Merge branch 'rt_cong_ctrl' · a918eb9f

David S. Miller authored Jan 05, 2015

Daniel Borkmann says:

====================
net: allow setting congctl via routing table

This is the second part of our work and allows for setting the congestion
control algorithm via routing table. For details, please see individual
patches.

Since patch 1 is a bug fix, we suggest applying patch 1 to net, and then
merging net into net-next, for example, and following up with the remaining
feature patches wrt dependencies.

Joint work with Florian Westphal, suggested by Hannes Frederic Sowa.

Patch for iproute2 is available under [1], but will be reposted with along
with the man-page update when this set hits net-next.

  [1] http://patchwork.ozlabs.org/patch/418149/

Thanks!

v2 -> v3:
 - Added module auto-loading as suggested by David Miller, thanks!
  - Added patch 2 for handling possible sleeps in fib6
  - While working on this, we discovered a bug, hence fix in patch 1
  - Added auto-loading to patch 4
 - Rebased, retested, rest the same.
v1 -> v2:
 - Very sorry, I noticed I had decnet disabled during testing.
   Added missing header include in decnet, rest as is.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a918eb9f

net: tcp: add per route congestion control · 81164413

Daniel Borkmann authored Jan 05, 2015

This work adds the possibility to define a per route/destination
congestion control algorithm. Generally, this opens up the possibility
for a machine with different links to enforce specific congestion
control algorithms with optimal strategies for each of them based
on their network characteristics, even transparently for a single
application listening on all links.

For our specific use case, this additionally facilitates deployment
of DCTCP, for example, applications can easily serve internal
traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
would also allow for utilizing e.g. long living, low priority
background flows for certain destinations/routes while still being
able for normal traffic to utilize the default congestion control
algorithm. We also thought about a per netns setting (where different
defaults are possible), but given its actually a link specific
property, we argue that a per route/destination setting is the most
natural and flexible.

The administrator can utilize this through ip-route(8) by appending
"congctl [lock] <name>", where <name> denotes the name of a
congestion control algorithm and the optional lock parameter allows
to enforce the given algorithm so that applications in user space
would not be allowed to overwrite that algorithm for that destination.

The dst metric lookups are being done when a dst entry is already
available in order to avoid a costly lookup and still before the
algorithms are being initialized, thus overhead is very low when the
feature is not being used. While the client side would need to drop
the current reference on the module, on server side this can actually
even be avoided as we just got a flat-copied socket clone.

Joint work with Florian Westphal.
Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

81164413

net: tcp: add RTAX_CC_ALGO fib handling · ea697639

Daniel Borkmann authored Jan 05, 2015

This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
control metric to be set up and dumped back to user space.

While the internal representation of RTAX_CC_ALGO is handled as a u32
key, we avoided to expose this implementation detail to user space, thus
instead, we chose the netlink attribute that is being exchanged between
user space to be the actual congestion control algorithm name, similarly
as in the setsockopt(2) API in order to allow for maximum flexibility,
even for 3rd party modules.

It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
it should have been stored in RTAX_FEATURES instead, we first thought
about reusing it for the congestion control key, but it brings more
complications and/or confusion than worth it.

Joint work with Florian Westphal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ea697639

net: tcp: add key management to congestion control · c5c6a8ab

Daniel Borkmann authored Jan 05, 2015

This patch adds necessary infrastructure to the congestion control
framework for later per route congestion control support.

For a per route congestion control possibility, our aim is to store
a unique u32 key identifier into dst metrics, which can then be
mapped into a tcp_congestion_ops struct. We argue that having a
RTAX key entry is the most simple, generic and easy way to manage,
and also keeps the memory footprint of dst entries lower on 64 bit
than with storing a pointer directly, for example. Having a unique
key id also allows for decoupling actual TCP congestion control
module management from the FIB layer, i.e. we don't have to care
about expensive module refcounting inside the FIB at this point.

We first thought of using an IDR store for the realization, which
takes over dynamic assignment of unused key space and also performs
the key to pointer mapping in RCU. While doing so, we stumbled upon
the issue that due to the nature of dynamic key distribution, it
just so happens, arguably in very rare occasions, that excessive
module loads and unloads can lead to a possible reuse of previously
used key space. Thus, previously stale keys in the dst metric are
now being reassigned to a different congestion control algorithm,
which might lead to unexpected behaviour. One way to resolve this
would have been to walk FIBs on the actually rare occasion of a
module unload and reset the metric keys for each FIB in each netns,
but that's just very costly.

Therefore, we argue a better solution is to reuse the unique
congestion control algorithm name member and map that into u32 key
space through jhash. For that, we split the flags attribute (as it
currently uses 2 bits only anyway) into two u32 attributes, flags
and key, so that we can keep the cacheline boundary of 2 cachelines
on x86_64 and cache the precalculated key at registration time for
the fast path. On average we might expect 2 - 4 modules being loaded
worst case perhaps 15, so a key collision possibility is extremely
low, and guaranteed collision-free on LE/BE for all in-tree modules.
Overall this results in much simpler code, and all without the
overhead of an IDR. Due to the deterministic nature, modules can
now be unloaded, the congestion control algorithm for a specific
but unloaded key will fall back to the default one, and on module
reload time it will switch back to the expected algorithm
transparently.

Joint work with Florian Westphal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c5c6a8ab

net: tcp: refactor reinitialization of congestion control · 29ba4fff

Daniel Borkmann authored Jan 05, 2015

We can just move this to an extra function and make the code
a bit more readable, no functional change.

Joint work with Florian Westphal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29ba4fff

net: fib6: convert cfg metric to u32 outside of table write lock · e715b6d3

Florian Westphal authored Jan 05, 2015

Do the nla validation earlier, outside the write lock.

This is needed by followup patch which needs to be able to call
request_module (which can sleep) if needed.

Joint work with Daniel Borkmann.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

e715b6d3

net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference · 0409c9a5

Daniel Borkmann authored Jan 05, 2015

When IPv6 host routes with metrics attached are being added, we fetch
the metrics store from the dst via COW through dst_metrics_write_ptr(),
added through commit e5fd387a.

One remaining problem here is that we actually call into inet_getpeer()
and may end up allocating/creating a new peer from the kmemcache, which
may fail.

Example trace from perf probe (inet_getpeer:41) where create is 1:

ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293)
  85e294 inet_getpeer.part.7 (<- kmem_cache_alloc())
  85e578 inet_getpeer
  8eb333 ipv6_cow_metrics
  8f10ff fib6_commit_metrics

Therefore, a check for NULL on the return of dst_metrics_write_ptr()
is necessary here.

Joint work with Florian Westphal.

Fixes: e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely")
Cc: Michal Kubeček <mkubecek@suse.cz>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0409c9a5

net: Do not call ndo_dflt_fdb_dump if ndo_fdb_dump is defined · 6cb69742

Hubert Sokolowski authored Jan 05, 2015

Add checking whether the call to ndo_dflt_fdb_dump is needed.
It is not expected to call ndo_dflt_fdb_dump unconditionally
by some drivers (i.e. qlcnic or macvlan) that defines
own ndo_fdb_dump. Other drivers define own ndo_fdb_dump
and don't want ndo_dflt_fdb_dump to be called at all.
At the same time it is desirable to call the default dump
function on a bridge device.
Fix attributes that are passed to dev->netdev_ops->ndo_fdb_dump.
Add extra checking in br_fdb_dump to avoid duplicate entries
as now filter_dev can be NULL.

Following tests for filtering have been performed before
the change and after the patch was applied to make sure
they are the same and it doesn't break the filtering algorithm.

[root@localhost ~]# cd /root/iproute2-3.18.0/bridge
[root@localhost bridge]# modprobe dummy
[root@localhost bridge]# ./bridge fdb add f1:f2:f3:f4:f5:f6 dev dummy0
[root@localhost bridge]# brctl addbr br0
[root@localhost bridge]# brctl addif  br0 dummy0
[root@localhost bridge]# ip link set dev br0 address 02:00:00:12:01:04
[root@localhost bridge]# # show all
[root@localhost bridge]# ./bridge fdb show
33:33:00:00:00:01 dev p2p1 self permanent
01:00:5e:00:00:01 dev p2p1 self permanent
33:33:ff:ac:ce:32 dev p2p1 self permanent
33:33:00:00:02:02 dev p2p1 self permanent
01:00:5e:00:00:fb dev p2p1 self permanent
33:33:00:00:00:01 dev p7p1 self permanent
01:00:5e:00:00:01 dev p7p1 self permanent
33:33:ff:79:50:53 dev p7p1 self permanent
33:33:00:00:02:02 dev p7p1 self permanent
01:00:5e:00:00:fb dev p7p1 self permanent
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by bridge
[root@localhost bridge]# ./bridge fdb show br br0
f2:46:50:85:6d:d9 dev dummy0 master br0 permanent
f2:46:50:85:6d:d9 dev dummy0 vlan 1 master br0 permanent
33:33:00:00:00:01 dev dummy0 self permanent
f1:f2:f3:f4:f5:f6 dev dummy0 self permanent
33:33:00:00:00:01 dev br0 self permanent
02:00:00:12:01:04 dev br0 vlan 1 master br0 permanent
02:00:00:12:01:04 dev br0 master br0 permanent
[root@localhost bridge]# # filter by port
[root@localhost bridge]# ./bridge fdb show brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]# # filter by port + bridge
[root@localhost bridge]# ./bridge fdb show br br0 brport dummy0
f2:46:50:85:6d:d9 master br0 permanent
f2:46:50:85:6d:d9 vlan 1 master br0 permanent
33:33:00:00:00:01 self permanent
f1:f2:f3:f4:f5:f6 self permanent
[root@localhost bridge]#
Signed-off-by: Hubert Sokolowski <hubert.sokolowski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6cb69742