Commits · c0451fe1f27b815b3f400df2a63b9aecf589b7b0 · Kirill Smelkov / linux

23 Aug, 2016 1 commit

net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset · c0451fe1

Shmulik Ladkani authored Aug 21, 2016

In b8247f09,

   "net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"

gso skbs arriving from an ingress interface that go through UDP
tunneling, are allowed to be fragmented if the resulting encapulated
segments exceed the dst mtu of the egress interface.

This aligned the behavior of gso skbs to non-gso skbs going through udp
encapsulation path.

However the non-gso vs gso anomaly is present also in the following
cases of a GRE tunnel:
 - ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
   (e.g. OvS vport-gre with df_default=false)
 - ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set

In both of the above cases, the non-gso skbs get fragmented, whereas the
gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
as they don't go through the segment+fragment code path.

Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.

Tunnels that do set IP_DF, will not go to fragmentation of segments.
This preserves behavior of ip_gre in (the default) pmtudisc mode.

Fixes: b8247f09 ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
Reported-by: wenxu <wenxu@ucloud.cn>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Tested-by: wenxu <wenxu@ucloud.cn>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

c0451fe1

22 Aug, 2016 9 commits

net: ipv6: Remove addresses for failures with strict DAD · 85b51b12

Mike Manning authored Aug 18, 2016

If DAD fails with accept_dad set to 2, global addresses and host routes
are incorrectly left in place. Even though disable_ipv6 is set,
contrary to documentation, the addresses are not dynamically deleted
from the interface. It is only on a subsequent link down/up that these
are removed. The fix is not only to set the disable_ipv6 flag, but
also to call addrconf_ifdown(), which is the action to carry out when
disabling IPv6. This results in the addresses and routes being deleted
immediately. The DAD failure for the LL addr is determined as before
via netlink, or by the absence of the LL addr (which also previously
would have had to be checked for in case of an intervening link down
and up). As the call to addrconf_ifdown() requires an rtnl lock, the
logic to disable IPv6 when DAD fails is moved to addrconf_dad_work().

Previous behavior:

root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2
net.ipv6.conf.eth3.accept_dad = 2
root@vm1:/# ip -6 addr add 2000::10/64 dev eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
5: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qlen 1000
    inet6 2000::10/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe43:dd5a/64 scope link tentative dadfailed
       valid_lft forever preferred_lft forever
root@vm1:/# ip -6 route show dev eth3
2000::/64  proto kernel  metric 256
fe80::/64  proto kernel  metric 256
root@vm1:/# ip link set down eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
root@vm1:/# ip -6 route show dev eth3
root@vm1:/#

New behavior:

root@vm1:/# sysctl net.ipv6.conf.eth3.accept_dad=2
net.ipv6.conf.eth3.accept_dad = 2
root@vm1:/# ip -6 addr add 2000::10/64 dev eth3
root@vm1:/# ip link set up eth3
root@vm1:/# ip -6 addr show dev eth3
root@vm1:/# ip -6 route show dev eth3
root@vm1:/#
Signed-off-by: Mike Manning <mmanning@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

85b51b12

include/uapi/linux/ipx.h: fix conflicting defitions with glibc netipx/ipx.h · 53dc65d4

Mikko Rapeli authored Aug 22, 2016

Fixes these compiler warnings via libc-compat.h when glibc netipx/ipx.h is
included before linux/ipx.h:

./linux/ipx.h:9:8: error: redefinition of ‘struct sockaddr_ipx’
./linux/ipx.h:26:8: error: redefinition of ‘struct ipx_route_definition’
./linux/ipx.h:32:8: error: redefinition of ‘struct ipx_interface_definition’
./linux/ipx.h:49:8: error: redefinition of ‘struct ipx_config_data’
./linux/ipx.h:58:8: error: redefinition of ‘struct ipx_route_def’
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

53dc65d4

include/uapi/linux/openvswitch.h: use __u32 from linux/types.h · a1d1f65f

Mikko Rapeli authored Aug 22, 2016

Kernel uapi header are supposed to use them. Fixes userspace compile error:

linux/openvswitch.h:583:2: error: unknown type name ‘uint32_t’
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

a1d1f65f

include/uapi/linux/atm_zatm.h: include linux/time.h · cf00713a

Mikko Rapeli authored Aug 22, 2016

Fixes userspace compile error:

error: field ‘real’ has incomplete type
 struct timeval real;  /* real (wall-clock) time */
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf00713a

include/uapi/linux/openvswitch.h: use __u32 from linux/types.h · e6571aa5

Mikko Rapeli authored Aug 22, 2016

Fixes userspace compiler error:

error: unknown type name ‘uint32_t’
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

e6571aa5

include/uapi/linux/if_pppox.h: include linux/in.h and linux/in6.h · eafe9211

Mikko Rapeli authored Aug 22, 2016

Fixes userspace compilation errors:

error: field ‘addr’ has incomplete type
 struct sockaddr_in addr; /* IP address and port to send to */

error: field ‘addr’ has incomplete type
 struct sockaddr_in6 addr; /* IP address and port to send to */
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

eafe9211

include/uapi/linux/if_pppol2tp.h: include linux/in.h and linux/in6.h · 05ee5de7

Mikko Rapeli authored Aug 22, 2016

Fixes userspace compilation errors like:

error: field ‘addr’ has incomplete type
 struct sockaddr_in addr; /* IP address and port to send to */
                    ^
error: field ‘addr’ has incomplete type
 struct sockaddr_in6 addr; /* IP address and port to send to */
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

05ee5de7

include/uapi/linux/if_tunnel.h: include linux/if.h, linux/ip.h and linux/in6.h · 1fe8e0f0

Mikko Rapeli authored Aug 22, 2016

Fixes userspace compilation errors like:

error: field ‘iph’ has incomplete type
error: field ‘prefix’ has incomplete type
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

1fe8e0f0

include/uapi/linux/if_pppox.h: include linux/if.h · b47b0cc7

Mikko Rapeli authored Aug 22, 2016

Fixes userspace compilation error:

error: ‘IFNAMSIZ’ undeclared here (not in a function)
Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

b47b0cc7

21 Aug, 2016 2 commits

net: tehuti: fix typo: "eneble" -> "enable" · d524d84b

Colin Ian King authored Aug 21, 2016

trivial typo fix in pr_err message
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d524d84b

net: xilinx: emaclite: Fallback to random MAC address. · 5575cf13

Daniel Romell authored Aug 19, 2016

If the address configured in the device tree is invalid, the
driver will fallback to using a random address from the locally
administered range.
Signed-off-by: Daniel Romell <daro@hms.se>
Signed-off-by: David S. Miller <davem@davemloft.net>

5575cf13

20 Aug, 2016 5 commits

vmxnet3: fix tx data ring copy for variable size · ff2e7d5d

Shrikrishna Khare authored Aug 19, 2016

'Commit 3c8b3efc ("vmxnet3: allow variable length transmit data ring
buffer")' changed the size of the buffers in the tx data ring from a
fixed size of 128 bytes to a variable size.

However, while copying data to the data ring, vmxnet3_copy_hdr continues
to carry the old code that assumes fixed buffer size of 128. This patch
fixes it by adding correct offset based on the actual data ring buffer
size.
Signed-off-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: Shrikrishna Khare <skhare@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff2e7d5d

ixgbe: Do not clear RAR entry when clearing VMDq for SAN MAC · c10ac75a

Alexander Duyck authored Aug 19, 2016

The RAR entry for the SAN MAC address was being cleared when we were
clearing the VMDq pool bits.  In order to prevent this we need to add
an extra check to protect the SAN MAC from being cleared.

Fixes: 6e982aea ("ixgbe: Clear stale pool mappings")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c10ac75a

mlxsw: spectrum_buffers: Fix pool value handling in mlxsw_sp_sb_tc_pool_bind_set · 8912862f

Jiri Pirko authored Aug 19, 2016

Pool index has to be converted by get_pool helper to work correctly for
egress pool. In mlxsw the egress pool index starts from 0.

Fixes: 0f433fa0 ("mlxsw: spectrum_buffers: Implement shared buffer configuration")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8912862f

l2tp: Fix the connect status check in pppol2tp_getname · 56cff471

Gao Feng authored Aug 19, 2016

The sk->sk_state is bits flag, so need use bit operation check
instead of value check.
Signed-off-by: Gao Feng <fgao@ikuai8.com>
Tested-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

56cff471

sctp: linearize early if it's not GSO · 4c2f2454

Marcelo Ricardo Leitner authored Aug 18, 2016

Because otherwise when crc computation is still needed it's way more
expensive than on a linear buffer to the point that it affects
performance.

It's so expensive that netperf test gives a perf output as below:

Overhead Command Shared Object Symbol
18,62% netserver [kernel.vmlinux] [k] crc32_generic_shift
2,57% netserver [kernel.vmlinux] [k] __pskb_pull_tail
1,94% netserver [kernel.vmlinux] [k] fib_table_lookup
1,90% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string
1,66% swapper [kernel.vmlinux] [k] intel_idle
1,63% netserver [kernel.vmlinux] [k] _raw_spin_lock
1,59% netserver [sctp] [k] sctp_packet_transmit
1,55% netserver [kernel.vmlinux] [k] memcpy_erms
1,42% netserver [sctp] [k] sctp_rcv

# netperf -H 192.168.10.1 -l 10 -t SCTP_STREAM -cC -- -m 12000
SCTP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.1 () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

212992 212992 12000 10.00 3016.42 2.88 3.78 1.874 2.462

After patch:
Overhead Command Shared Object Symbol
2,75% netserver [kernel.vmlinux] [k] memcpy_erms
2,63% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string
2,39% netserver [kernel.vmlinux] [k] fib_table_lookup
2,04% netserver [kernel.vmlinux] [k] __pskb_pull_tail
1,91% netserver [kernel.vmlinux] [k] _raw_spin_lock
1,91% netserver [sctp] [k] sctp_packet_transmit
1,72% netserver [mlx4_en] [k] mlx4_en_process_rx_cq
1,68% netserver [sctp] [k] sctp_rcv

212992 212992 12000 10.00 3681.77 3.83 3.46 2.045 1.849

Fixes: 3acb50c1 ("sctp: delay as much as possible skb_linearize")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4c2f2454

19 Aug, 2016 21 commits

Merge branch 'mlx5-fixes' · 187335cd

David S. Miller authored Aug 19, 2016

Saeed Mahameed says:

====================
Mellanox 100G mlx5 fixes 2016-08-16

This series includes some bug fixes for mlx5e driver.

From Saeed and Tariq, Optimize MTU change to not reset when it is not required.

From Paul, Command interface message length check to speedup firmware
command preparation.

From Mohamad, Save pci state when pci error is detected.

From Amir, Flow counters "lastuse" update fix.

From Hadar, Use correct flow dissector key on flower offloading.
Plus a small optimization for switchdev hardware id query.

From Or, three patches to address some E-Switch offloads issues.

For -stable of 4.6.y and 4.7.y:
    net/mlx5e: Use correct flow dissector key on flower offloading
    net/mlx5: Fix pci error recovery flow
    net/mlx5: Added missing check of msg length in verifying its signature
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

187335cd

net/mlx5: E-Switch, Avoid ACLs in the offloads mode · f96750f8

Or Gerlitz authored Aug 18, 2016

When we are in the switchdev/offloads mode, HW matching is done as
dictated by the offloaded rules and hence we don't need to enable
the ACLs mechanism used by the legacy mode.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f96750f8

net/mlx5: E-Switch, Set the send-to-vport rules in the correct table · 1a8ee6f2

Or Gerlitz authored Aug 18, 2016

While adding actual offloading support to the new switchdev mode, we didn't
change the setup of the send-to-vport rules to put them in the slow path
table, fix that.

Fixes: 1033665e ('net/mlx5: E-Switch, Use two priorities for SRIOV offloads mode')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1a8ee6f2

net/mlx5: E-Switch, Return the correct devlink e-switch mode · ef78618b

Or Gerlitz authored Aug 18, 2016

Since mlx5 has also the NONE e-switch mode, we must translate from mlx5
mode to devlink mode on the devlink eswitch mode get call, do that.

While here, remove the mlx5_ prefix from the static function helpers
that deal with the mode to comply with the rest of the code.

Fixes: c930a3ad ('net/mlx5e: Add devlink based SRIOV mode change')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ef78618b

net/mlx5e: Retrieve the switchdev id from the firmware only once · dbe413e3

Hadar Hen Zion authored Aug 18, 2016

Avoid firmware command execution each time the switchdev HW ID attr get
call is made. We do that by reading the ID (PF NIC MAC) only once at
load time and store it on the representor structure.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbe413e3

net/mlx5e: Use correct flow dissector key on flower offloading · 1dbd0d37

Hadar Hen Zion authored Aug 18, 2016

The wrong key is used when extracting the address type field set by
the flower offload code. We have to use the control key and not the
basic key, fix that.

Fixes: e3a2b7ed ('net/mlx5e: Support offload cls_flower with drop action')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1dbd0d37

net/mlx5: Update last-use statistics for flow rules · 6c3b4f90

Amir Vadai authored Aug 18, 2016

Set lastuse statistic, when number of packets is changed compared to
last query. This was wrongly dropped when bulk counter reading was added.

Fixes: a351a1b0 ('net/mlx5: Introduce bulk reading of flow counters')
Signed-off-by: Amir Vadai <amirva@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c3b4f90

net/mlx5: Added missing check of msg length in verifying its signature · 2c0f8ce1

Paul Blakey authored Aug 18, 2016

Set and verify signature calculates the signature for each of the
mailbox nodes, even for those that are unused (from cache). Added
a missing length check to set and verify only those which are used.

While here, also moved the setting of msg's nodes token to where we
already go over them. This saves a pass because checksum is disabled,
and the only useful thing remaining that set signature does is setting
the token.

Fixes: e126ba97 ('mlx5: Add driver for Mellanox Connect-IB
adapters')
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c0f8ce1

net/mlx5: Fix pci error recovery flow · 1061c90f

Mohamad Haj Yahia authored Aug 18, 2016

When PCI error is detected we should save the state of the pci prior to
disabling it.

Also when receiving pci slot reset call we need to verify that the
device is responsive.

Fixes: 89d44f0a ('net/mlx5_core: Add pci error handlers to mlx5_core
driver')
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1061c90f

net/mlx5e: Optimization for MTU change · 506753b0

Tariq Toukan authored Aug 18, 2016

Avoid unnecessary interface down/up operations upon an MTU change
when it does not affect the rings configuration.

Fixes: 461017cb ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

506753b0

net/mlx5e: Set port MTU on netdev creation rather on open · 13f9bba7

Saeed Mahameed authored Aug 18, 2016

Port mtu shouldn't be written to hardware on every single interface
open.
Here we set it only when needed, on change_mtu and netdevice creation.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

13f9bba7

fib_trie: Fix the description of pos and bits · 98a384ec

Xunlei Pang authored Aug 18, 2016

1) Fix one typo: s/tn/tp/
2) Fix the description about the "u" bits.
Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

98a384ec

Merge branch 'kaweth-oopses' · 4587a996

David S. Miller authored Aug 18, 2016

Oliver Neukum says:

====================
fixes to kaweth in response to Umap2 testing

These patches fix an oops in firmware downloading and an oops due
to a memory allocation failure
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4587a996

kaweth: fix oops upon failed memory allocation · 575ced7f

Oliver Neukum authored Aug 17, 2016

Just return an error upon failure.
Signed-off-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

575ced7f

kaweth: fix firmware download · 60bcabd0

Oliver Neukum authored Aug 17, 2016

This fixes the oops discovered by the Umap2 project and Alan Stern.
The intf member needs to be set before the firmware is downloaded.
Signed-off-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

60bcabd0

net: bgmac: fix reversed check for MII registration error · b9f63ae7

Rafał Miłecki authored Aug 17, 2016

It was failing on successful registration returning meaningless errors.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Fixes: 55954f3b ("net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file")
Signed-off-by: David S. Miller <davem@davemloft.net>

b9f63ae7

tcp: fix use after free in tcp_xmit_retransmit_queue() · bb1fceca

Eric Dumazet authored Aug 17, 2016

When tcp_sendmsg() allocates a fresh and empty skb, it puts it at the
tail of the write queue using tcp_add_write_queue_tail()

Then it attempts to copy user data into this fresh skb.

If the copy fails, we undo the work and remove the fresh skb.

Unfortunately, this undo lacks the change done to tp->highest_sack and
we can leave a dangling pointer (to a freed skb)

Later, tcp_xmit_retransmit_queue() can dereference this pointer and
access freed memory. For regular kernels where memory is not unmapped,
this might cause SACK bugs because tcp_highest_sack_seq() is buggy,
returning garbage instead of tp->snd_nxt, but with various debug
features like CONFIG_DEBUG_PAGEALLOC, this can crash the kernel.

This bug was found by Marco Grassi thanks to syzkaller.

Fixes: 6859d494 ("[TCP]: Abstract tp->highest_sack accessing & point to next skb")
Reported-by: Marco Grassi <marco.gra@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bb1fceca

cxgb4: Fixes resource allocation for ULD's in kdump kernel · e0d8b290

Hariprasad Shenai authored Aug 17, 2016

At present the code to check in kdump kernel was not disabling
allocation of resources when CONFIG_CHELSIO_T4_DCB is defined, move the
code outside #defines so that it gets disabled irrespective of #define,
when in kdump kernel.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e0d8b290

net: thunderx: Fix OOPs with ethtool --register-dump · 1423661f

David Daney authored Aug 16, 2016

The ethtool_ops .get_regs function attempts to read the nonexistent
register NIC_QSET_SQ_0_7_CNM_CHG, which produces a "bus error" type
OOPs.

Fix by not attempting to read, and removing the definition of,
NIC_QSET_SQ_0_7_CNM_CHG.  A zero is written into the register dump to
keep the layout unchanged.
Signed-off-by: David Daney <david.daney@cavium.com>
Cc: <stable@vger.kernel.org> # 4.4.x-
Signed-off-by: David S. Miller <davem@davemloft.net>

1423661f

qede: Fix Tx timeout due to xmit_more · 039a3927

Yuval Mintz authored Aug 16, 2016

Driver uses netif_tx_queue_stopped() to make sure the xmit_more
indication will be honored, but that only checks for DRV_XOFF.

At the same time, it's possible that during transmission the DQL will
close the transmission queue with STACK_XOFF indication.
In re-configuration flows, when the threshold is relatively low, it's
possible that the device has no pending tranmissions, and during
tranmission the driver would miss doorbelling the HW.
Since there are no pending transmission, there will never be a Tx
completion [and thus the DQL would not remove the STACK_XOFF indication],
eventually causing the Tx queue to timeout.

While we're at it - also doorbell in case driver has to close the
transmission queue on its own [although this one is less important -
if the ring is full, we're bound to receive completion eventually,
which means the doorbell would only be postponed and not indefinetly
blocked].

Fixes: 312e0676 ("qede: Utilize xmit_more")
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

039a3927

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · 53409afd

David S. Miller authored Aug 18, 2016

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter updates for your net tree,
they are:

1) Dump only conntrack that belong to this namespace via /proc file.
   This is some fallout from the conversion to single conntrack table
   for all netns, patch from Liping Zhang.

2) Missing MODULE_ALIAS_NF_LOGGER() for the ARP family that prevents
   module autoloading, also from Liping Zhang.

3) Report overquota event to the right netnamespace, again from Liping.

4) Fix tproxy listener sk refcount that leads to crash, from
   Eric Dumazet.

5) Fix racy refcounting on object deletion from nfnetlink and rule
   removal both for nfacct and cttimeout, from Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

53409afd

18 Aug, 2016 2 commits

netfilter: cttimeout: fix use after free error when delete netns · b75911b6

Liping Zhang authored Aug 18, 2016

In general, when we want to delete a netns, cttimeout_net_exit will
be called before ipt_unregister_table, i.e. before ctnl_timeout_put.

But after call kfree_rcu in cttimeout_net_exit, we will still decrease
the timeout object's refcnt in ctnl_timeout_put, this is incorrect,
and will cause a use after free error.

It is easy to reproduce this problem:
  # while : ; do
  ip netns add xxx
  ip netns exec xxx nfct add timeout testx inet icmp timeout 200
  ip netns exec xxx iptables -t raw -p icmp -I OUTPUT -j CT --timeout testx
  ip netns del xxx
  done

  =======================================================================
  BUG kmalloc-96 (Tainted: G    B       E  ): Poison overwritten
  -----------------------------------------------------------------------
  INFO: 0xffff88002b5161e8-0xffff88002b5161e8. First byte 0x6a instead of
  0x6b
  INFO: Allocated in cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout]
  age=104 cpu=0 pid=3330
  ___slab_alloc+0x4da/0x540
  __slab_alloc+0x20/0x40
  __kmalloc+0x1c8/0x240
  cttimeout_new_timeout+0xd4/0x240 [nfnetlink_cttimeout]
  nfnetlink_rcv_msg+0x21a/0x230 [nfnetlink]
  [ ... ]

So only when the refcnt decreased to 0, we call kfree_rcu to free the
timeout object. And like nfnetlink_acct do, use atomic_cmpxchg to
avoid race between ctnl_timeout_try_del and ctnl_timeout_put.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

b75911b6

netfilter: nfnetlink_acct: fix race between nfacct del and xt_nfacct destroy · 12be15dd

Liping Zhang authored Aug 13, 2016

Suppose that we input the following commands at first:
  # nfacct add test
  # iptables -A INPUT -m nfacct --nfacct-name test

And now "test" acct's refcnt is 2, but later when we try to delete the
"test" nfacct and the related iptables rule at the same time, race maybe
happen:
      CPU0                                    CPU1
  nfnl_acct_try_del                      nfnl_acct_put
  atomic_dec_and_test //ref=1,testfail          -
       -                                 atomic_dec_and_test //ref=0,testok
       -                                 kfree_rcu
  atomic_inc //ref=1                            -

So after the rcu grace period, nf_acct will be freed but it is still linked
in the nfnl_acct_list, and we can access it later, then oops will happen.

Convert atomic_dec_and_test and atomic_inc combinaiton to one atomic
operation atomic_cmpxchg here to fix this problem.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

12be15dd