Commits · 73570bf19faae9da0941ba4a10989dfe085119ec · Kirill Smelkov / linux

02 May, 2018 15 commits

r8169: merge r810x_pll_power_down/up into r8168_pll_power_down/up · 73570bf1

Heiner Kallweit authored May 02, 2018

r810x_pll_power_down/up and r8168_pll_power_down/up have a lot in common,
so we can simplify the code by merging the former into the latter.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

73570bf1

r8169: remove 810x_phy_power_up/down · 40242e23

Heiner Kallweit authored May 02, 2018

The functionality of 810x_phy_power_up/down is covered by the default
clause in 8168_phy_power_up/down. Therefore we don't need these
functions.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

40242e23

r8169: remove unneeded check in r8168_pll_power_down · 6851d025

Heiner Kallweit authored May 02, 2018

RTL_GIGA_MAC_VER_23/24 are configured by rtl_hw_start_8168cp_2()
and rtl_hw_start_8168cp_3() respectively which both apply
CPCMD_QUIRK_MASK, thus clearing bit ASF.

Bit ASF isn't set at any other place in the driver, therefore this
check can be removed.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6851d025

Merge branch 'net-smc-small-features' · 77ec3a0e

David S. Miller authored May 02, 2018

Ursula Braun says:

====================
net/smc: small features 2018/04/30

here are 4 smc patches for net-next covering small new features
in different areas:
   * link health check
   * diagnostics for IPv6 smc sockets
   * ioctl
   * improvement for vlan determination

v2 changes:
   * better title
   * patch 2 - remove compile problem for disabled CONFIG_IPV6
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

77ec3a0e

net/smc: determine vlan_id of stacked net_device · cb9d43f6

Ursula Braun authored May 02, 2018

An SMC link group is bound to a specific vlan_id. Its link uses
the RoCE-GIDs established for the specific vlan_id. This patch makes
sure the appropriate vlan_id is determined for stacked scenarios like
for instance a master bonding device with vlan devices enslaved.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cb9d43f6

net/smc: handle ioctls SIOCINQ, SIOCOUTQ, and SIOCOUTQNSD · 9b67e26f

Ursula Braun authored May 02, 2018

SIOCINQ returns the amount of unread data in the RMB.
SIOCOUTQ returns the amount of unsent or unacked sent data in the send
buffer.
SIOCOUTQNSD returns the amount of data prepared for sending, but
not yet sent.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b67e26f

net/smc: ipv6 support for smc_diag.c · ed75986f

Karsten Graul authored May 02, 2018

Update smc_diag.c to support ipv6 addresses on the diagnosis interface.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed75986f

net/smc: periodic testlink support · 877ae5be

Karsten Graul authored May 02, 2018

Add periodic LLC testlink support to ensure the link is still active.
The interval time is initialized using the value of
sysctl_tcp_keepalive_time.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

877ae5be

Merge branch 'mlxsw-Reject-unsupported-FIB-configurations' · e90c1a10

David S. Miller authored May 02, 2018

Ido Schimmel says:

====================
mlxsw: Reject unsupported FIB configurations

Recently it became possible for listeners of the FIB notification chain
to veto operations such as addition of routes and rules.

Adjust the mlxsw driver to take advantage of it and return an error for
unsupported FIB rules and for routes configured after the abort
mechanism was triggered (due to exceeded resources for example).

v2:
* Change error code in first patch to -EOPNOTSUPP (David Ahern).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e90c1a10

mlxsw: spectrum_router: Return an error for routes added after abort · 50d10711

Ido Schimmel authored May 02, 2018

We currently do not perform accounting in the driver and thus can't
reject routes before resources are exceeded.

However, in order to make users aware of the fact that routes are no
longer offloaded we can return an error for routes configured after the
abort mechanism was triggered.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50d10711

mlxsw: spectrum_router: Return an error for non-default FIB rules · 6290182b

Ido Schimmel authored May 02, 2018

Since commit 9776d325 ("net: Move call_fib_rule_notifiers up in
fib_nl_newrule") it is possible to forbid the installation of
unsupported FIB rules.

Have mlxsw return an error for non-default FIB rules in addition to the
existing extack message.

Example:
# ip rule add from 198.51.100.1 table 10
Error: mlxsw_spectrum: FIB rules not supported.

Note that offload is only aborted when non-default FIB rules are already
installed and merely replayed during module initialization.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6290182b

cxgb4: add new T5 device id's · 794451c1

Ganesh Goudar authored May 02, 2018

Add device id's 0x5019, 0x501a and 0x501b for T5
cards.
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

794451c1

net: stmmac: Avoid VLA usage · 8ac60ffb

Kees Cook authored May 01, 2018

In the quest to remove all stack VLAs from the kernel[1], this switches
the "status" stack buffer to use the existing small (8) upper bound on
how many queues can be checked for DMA, and adds a sanity-check just to
make sure it doesn't operate under pathological conditions.

[1] http://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.comSigned-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Jose Abreu <joabreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ac60ffb

liquidio VF: indicate that disabling rx vlan offload is not allowed · 795d8098

Raghu Vatsavayi authored May 01, 2018

NIC firmware does not support disabling rx vlan offload, but the VF driver
incorrectly indicates that it is supported. The PF driver already does the
correct indication by clearing the NETIF_F_HW_VLAN_CTAG_RX bit in its
netdev->hw_features. So just do the same thing in the VF.
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@cavium.com>
Acked-by: Prasad Kanneganti <prasad.kanneganti@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

795d8098

udp: Complement partial checksum for GSO packet · 6c035ba7

Sean Tranchetti authored Apr 30, 2018

Using the udp_v4_check() function to calculate the pseudo header
for the newly segmented UDP packets results in assigning the complement
of the value to the UDP header checksum field.

Always undo the complement the partial checksum value in order to
match the case where GSO is not used on the UDP transmit path.

Fixes: ee80d1eb ("udp: add udp gso")
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c035ba7

01 May, 2018 25 commits

selftest: add test for TCP_INQ · 702353b5

Soheil Hassas Yeganeh authored May 01, 2018

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

702353b5

tcp: send in-queue bytes in cmsg upon read · b75eba76

Soheil Hassas Yeganeh authored May 01, 2018

Applications with many concurrent connections, high variance
in receive queue length and tight memory bounds cannot
allocate worst-case buffer size to drain sockets. Knowing
the size of receive queue length, applications can optimize
how they allocate buffers to read from the socket.

The number of bytes pending on the socket is directly
available through ioctl(FIONREAD/SIOCINQ) and can be
approximated using getsockopt(MEMINFO) (rmem_alloc includes
skb overheads in addition to application data). But, both of
these options add an extra syscall per recvmsg. Moreover,
ioctl(FIONREAD/SIOCINQ) takes the socket lock.

Add the TCP_INQ socket option to TCP. When this socket
option is set, recvmsg() relays the number of bytes available
on the socket for reading to the application via the
TCP_CM_INQ control message.

Calculate the number of bytes after releasing the socket lock
to include the processed backlog, if any. To avoid an extra
branch in the hot path of recvmsg() for this new control
message, move all cmsg processing inside an existing branch for
processing receive timestamps. Since the socket lock is not held
when calculating the size of receive queue, TCP_INQ is a hint.
For example, it can overestimate the queue size by one byte,
if FIN is received.

With this method, applications can start reading from the socket
using a small buffer, and then use larger buffers based on the
remaining data when needed.

V3 change-log:
	As suggested by David Miller, added loads with barrier
	to check whether we have multiple threads calling recvmsg
	in parallel. When that happens we lock the socket to
	calculate inq.
V4 change-log:
	Removed inline from a static function.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

b75eba76

Merge branch 'hns3-fixes' · ab85539e

David S. Miller authored May 01, 2018

Salil Mehta says:

====================
Misc bug fixes for HNS3 Ethernet driver

This patch-set presents some miscellaneous bug fixs and cleanups for
HNS3 Ethernet Driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ab85539e

net: hns3: Remove packet statistics in the range of 8192~12287 · dbecc779

Xi Wang authored May 01, 2018

Because the current statistics for size 8192~12287 are only valid for GE,
the ranges of 8192~9216 and 9217~12287 are valid only for LGE/CGE, and are
always 0 for GE interfaces. it is easy to cause confusion when viewing the
packet statistics using the command ethtool -S.

This patch removes the 8192~12287 range of packet statistics and uses the
8192~9216 and 9217~12287 ranges for statistics. This change depends on the
firmware upgrade.
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbecc779

net: hns3: Fix for packet loss due wrong filter config in VLAN tbls · dc8131d8

Yunsheng Lin authored May 01, 2018

There are two level of vlan tables in hardware, one is port vlan
which is shared by all functions, the other one is function
vlan table, each function has it's own function vlan table.
Currently, PF sets the port vlan table, and vf sets the function
vlan table, which will cause packet lost problem.

This patch fixes this problem by setting both vlan table, and
use hdev->vlan_table to manage thet port vlan table.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc8131d8

net: hns3: fix a dead loop in hclge_cmd_csq_clean · 3ff50490

Huazhong Tan authored May 01, 2018

If head has invlid value then a dead loop can be triggered in
hclge_cmd_csq_clean. This patch adds sanity check for this case.

Fixes: 68c0a5c7 ("net: hns3: Add HNS3 IMP(Integrated Mgmt Proc) Cmd
Interface Support")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3ff50490

net: hns3: Fix to support autoneg only for port attached with phy · 0c963e8c

Fuyun Liang authored May 01, 2018

This patch adds a check to support autoneg(ethtool -A) only when PHY
is attached with the port.

Fixes: e2cb1dec ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility
Layer) Support")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0c963e8c

net: hns3: fix for phy_addr error in hclge_mac_mdio_config · c5ef83cb

Huazhong Tan authored May 01, 2018

When phy exists, phy_addr must less than PHY_MAX_ADDR.
If not, hclge_mac_mdio_config should return error.
And for fiber(phy_addr=0xff), it does not need hclge_mac_mdio_config.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c5ef83cb

net: hns3: Fixes the error legs in hclge_init_ae_dev function · ffd5656e

Huazhong Tan authored May 01, 2018

This patch fixes some of the missed error legs in the initialization
function of the ae device. This might cause leaks in case of failure.

Fixes: 46a3df9f ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer
Support")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ffd5656e

net: hns3: Fixes the out of bounds access in hclge_map_tqp · 38e62046

Huazhong Tan authored May 01, 2018

This patch fixes the handling of the check when number of vports
are detected to be more than available TPQs. Current handling causes
an out of bounds access in hclge_map_tqp().

Fixes: 7df7dad6 ("net: hns3: Refactor the mapping of tqp to vport")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

38e62046

net: hns3: fix to correctly fetch l4 protocol outer header · 35f58fd7

Huazhong Tan authored May 01, 2018

This patch fixes the function being used to fetch L4
protocol outer header. Mistakenly skb_inner_transport_header
API was being used earlier.

Fixes: 76ad4f0e ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC")
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35f58fd7

net: hns3: Remove error log when getting pfc stats fails · 20670328

Yunsheng Lin authored May 01, 2018

When mac supports DCB, but is in GE mode, it does not support
querying pfc stats, firmware returns error when trying to
query the pfc stats. this creates a lot of noise in the kernel
log when it prints the error log.

This patch fixes it by removing the error log, because it already
return the error to the user space, so the user should be aware of
the error.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

20670328

connector: add parent pid and tgid to coredump and exit events · b086ff87

Stefan Strogin authored May 01, 2018

The intention is to get notified of process failures as soon
as possible, before a possible core dumping (which could be very long)
(e.g. in some process-manager). Coredump and exit process events
are perfect for such use cases (see 2b5faa4c "connector: Added
coredumping event to the process connector").

The problem is that for now the process-manager cannot know the parent
of a dying process using connectors. This could be useful if the
process-manager should monitor for failures only children of certain
parents, so we could filter the coredump and exit events by parent
process and/or thread ID.

Add parent pid and tgid to coredump and exit process connectors event
data.
Signed-off-by: Stefan Strogin <sstrogin@cisco.com>
Acked-by: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

b086ff87

net: core: Inline netdev_features_size_check() · e283de3a

Florian Fainelli authored Apr 30, 2018

We do not require this inline function to be used in multiple different
locations, just inline it where it gets used in register_netdevice().
Suggested-by: David Miller <davem@davemloft.net>
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e283de3a

udp: disable gso with no_check_tx · a8c744a8

Willem de Bruijn authored Apr 30, 2018

Syzbot managed to send a udp gso packet without checksum offload into
the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
triggered the skb_warn_bad_offload.

  RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
   skb_gso_segment include/linux/netdevice.h:4038 [inline]
   validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
   __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
   dev_queue_xmit+0x17/0x20 net/core/dev.c:3618

UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
catch and fail in this case.

After the integrity checks jump directly to the CHECKSUM_PARTIAL case
to avoid reading the no_check_tx flags again (a TOCTTOU race).

Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a8c744a8

cls_flower: Support multiple masks per priority · 05cd271f

Paul Blakey authored Apr 30, 2018

Currently flower doesn't support inserting filters with different masks
on a single priority, even if the actual flows (key + mask) inserted
aren't overlapping, as with the use case of offloading openvswitch
datapath flows. Instead one must go up one level, and assign different
priorities for each mask, which will create a different flower
instances.

This patch opens flower to support more than one mask per priority,
and a single flower instance. It does so by adding another hash table
on top of the existing one which will store the different masks,
and the filters that share it.

The user is left with the responsibility of ensuring non overlapping
flows, otherwise precedence is not guaranteed.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

05cd271f

Merge branch 'sctp-unify-sctp_make_op_error_fixed-and-sctp_make_op_error_space' · 9908b363

David S. Miller authored May 01, 2018

Marcelo Ricardo Leitner says:

====================
sctp: unify sctp_make_op_error_fixed and sctp_make_op_error_space

These two variants are very close to each other and can be merged
to avoid code duplication. That's what this patchset does.

First, we allow sctp_init_cause to return errors, which then allow us to
add sctp_make_op_error_limited that handles both situations.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9908b363

sctp: add sctp_make_op_error_limited and reuse inner functions · 8914f4ba

Marcelo Ricardo Leitner authored Apr 29, 2018

The idea is quite similar to the old functions, but note that the _fixed
function wasn't "fixed" as in that it would generate a packet with a fixed
size, but rather limited/bounded to PMTU.

Also, now with sctp_mtu_payload(), we have a more accurate limit.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8914f4ba

sctp: allow sctp_init_cause to return errors · 6d3e8aa8

Marcelo Ricardo Leitner authored Apr 29, 2018

And do so if the skb doesn't have enough space for the payload.
This is a preparation for the next patch.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d3e8aa8

Merge branch 'net-stmmac-dwmac-meson-100M-phy-mode-support-for-AXG-SoC' · 065662d9

David S. Miller authored May 01, 2018

Yixun Lan says:

====================
net: stmmac: dwmac-meson: 100M phy mode support for AXG SoC

Due to the dwmac glue layer register changed, we need to
introduce a new compatible name for the Meson-AXG SoC
to support for the RMII 100M ethernet PHY.

Change since v1 at [1]:
  - implement set_phy_mode() for each SoC

[1] https://lkml.kernel.org/r/20180426160508.29380-1-yixun.lan@amlogic.com
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

065662d9

net: stmmac: dwmac-meson: extend phy mode setting · efacb568

Yixun Lan authored Apr 28, 2018

In the Meson-AXG SoC, the phy mode setting of PRG_ETH0 in the glue layer
is extended from bit[0] to bit[2:0].
  There is no problem if we configure it to the RGMII 1000M PHY mode,
since the register setting is coincidentally compatible with previous one,
but for the RMII 100M PHY mode, the configuration need to be changed to
value - b100.
  This patch was verified with a RTL8201F 100M ethernet PHY.
Signed-off-by: Yixun Lan <yixun.lan@amlogic.com>
Acked-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

efacb568

dt-bindings: net: meson-dwmac: new compatible name for AXG SoC · 7e5d05e1

Yixun Lan authored Apr 28, 2018

We need to introduce a new compatible name for the Meson-AXG SoC
in order to support the RMII 100M ethernet PHY, since the PRG_ETH0
register of the dwmac glue layer is changed from previous old SoC.
Signed-off-by: Yixun Lan <yixun.lan@amlogic.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

7e5d05e1

Merge branch 'netns-uevent-filtering' · 90d52d4f

David S. Miller authored May 01, 2018

Christian Brauner says:

====================
netns: uevent filtering

This is the new approach to uevent filtering as discussed (see the
threads in [1], [2], and [3]). It only contains *non-functional
changes*.

This series deals with with fixing up uevent filtering logic:
- uevent filtering logic is simplified
- locking time on uevent_sock_list is minimized
- tagged and untagged kobjects are handled in separate codepaths
- permissions for userspace are fixed for network device uevents in
  network namespaces owned by non-initial user namespaces
  Udev is now able to see those events correctly which it wasn't before.
  For example, moving a physical device into a network namespace not
  owned by the initial user namespaces before gave:

  root@xen1:~# udevadm --debug monitor -k
  calling: monitor
  monitor will print the received events for:
  KERNEL - the kernel uevent

  sender uid=65534, message ignored
  sender uid=65534, message ignored
  sender uid=65534, message ignored
  sender uid=65534, message ignored
  sender uid=65534, message ignored

  and now after the discussion and solution in [3] correctly gives:

  root@xen1:~# udevadm --debug monitor -k
  calling: monitor
  monitor will print the received events for:
  KERNEL - the kernel uevent

  KERNEL[625.301042] add      /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net)
  KERNEL[625.301109] move     /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/enp1s0f1 (net)
  KERNEL[625.301138] move     /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net)
  KERNEL[655.333272] remove /devices/pci0000:00/0000:00:02.0/0000:01:00.1/net/eth1 (net)

Thanks!
Christian

[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738
====================
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90d52d4f

netns: restrict uevents · a3498436

Christian Brauner authored Apr 29, 2018

commit 07e98962 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk. We have now reached the point where hotplug events for all devices
that carry a namespace tag are filtered according to that namespace.
Specifically, they are filtered whenever the namespace tag of the kobject
does not match the namespace tag of the netlink socket.
Currently, only network devices carry namespace tags (i.e. network
namespace tags). Hence, uevents for network devices only show up in the
network namespace such devices are created in or moved to.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch simplifies and fixes couple of things:
- Split codepath for sending uevents by kobject namespace tags:
  1. Untagged kobjects - uevent_net_broadcast_untagged():
     Untagged kobjects will be broadcast into all uevent sockets recorded
     in uevent_sock_list, i.e. into all network namespacs owned by the
     intial user namespace.
  2. Tagged kobjects - uevent_net_broadcast_tagged():
     Tagged kobjects will only be broadcast into the network namespace they
     were tagged with.
  Handling of tagged kobjects in 2. does not cause any semantic changes.
  This is just splitting out the filtering logic that was handled by
  kobj_bcast_filter() before.
  Handling of untagged kobjects in 1. will cause a semantic change. The
  reasons why this is needed and ok have been discussed in [1]. Here is a
  short summary:
  - Userspace ignores uevents from network namespaces that are not owned by
    the intial user namespace:
    Uevents are filtered by userspace in a user namespace because the
    received uid != 0. Instead the uid associated with the event will be
    65534 == "nobody" because the global root uid is not mapped.
    This means we can safely and without introducing regressions modify the
    kernel to not send uevents into all network namespaces whose owning
    user namespace is not the initial user namespace because we know that
    userspace will ignore the message because of the uid anyway.
    I have a) verified that is is true for every udev implementation out
    there b) that this behavior has been present in all udev
    implementations from the very beginning.
  - Thundering herd:
    Broadcasting uevents into all network namespaces introduces significant
    overhead.
    All processes that listen to uevents running in non-initial user
    namespaces will end up responding to uevents that will be meaningless
    to them. Mainly, because non-initial user namespaces cannot easily
    manage devices unless they have a privileged host-process helping them
    out. This means that there will be a thundering herd of activity when
    there shouldn't be any.
  - Removing needless overhead/Increasing performance:
    Currently, the uevent socket for each network namespace is added to the
    global variable uevent_sock_list. The list itself needs to be protected
    by a mutex. So everytime a uevent is generated the mutex is taken on
    the list. The mutex is held *from the creation of the uevent (memory
    allocation, string creation etc. until all uevent sockets have been
    handled*. This is aggravated by the fact that for each uevent socket
    that has listeners the mc_list must be walked as well which means we're
    talking O(n^2) here. Given that a standard Linux workload usually has
    quite a lot of network namespaces and - in the face of containers - a
    lot of user namespaces this quickly becomes a performance problem (see
    "Thundering herd" above). By just recording uevent sockets of network
    namespaces that are owned by the initial user namespace we
    significantly increase performance in this codepath.
  - Injecting uevents:
    There's a valid argument that containers might be interested in
    receiving device events especially if they are delegated to them by a
    privileged userspace process. One prime example are SR-IOV enabled
    devices that are explicitly designed to be handed of to other users
    such as VMs or containers.
    This use-case can now be correctly handled since
    commit 692ec06d ("netns: send uevent messages"). This commit
    introduced the ability to send uevents from userspace. As such we can
    let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
    namespace of the network namespace of the netlink socket) userspace
    process make a decision what uevents should be sent. This removes the
    need to blindly broadcast uevents into all user namespaces and provides
    a performant and safe solution to this problem.
  - Filtering logic:
    This patch filters by *owning user namespace of the network namespace a
    given task resides in* and not by user namespace of the task per se.
    This means if the user namespace of a given task is unshared but the
    network namespace is kept and is owned by the initial user namespace a
    listener that is opening the uevent socket in that network namespace
    can still listen to uevents.
- Fix permission for tagged kobjects:
  Network devices that are created or moved into a network namespace that
  is owned by a non-initial user namespace currently are send with
  INVALID_{G,U}ID in their credentials. This means that all current udev
  implementations in userspace will ignore the uevent they receive for
  them. This has lead to weird bugs whereby new devices showing up in such
  network namespaces were not recognized and did not get IPs assigned etc.
  This patch adjusts the permission to the appropriate {g,u}id in the
  respective user namespace. This way udevd is able to correctly handle
  such devices.
- Simplify filtering logic:
  do_one_broadcast() already ensures that only listeners in mc_list receive
  uevents that have the same network namespace as the uevent socket itself.
  So the filtering logic in kobj_bcast_filter is not needed (see [3]). This
  patch therefore removes kobj_bcast_filter() and replaces
  netlink_broadcast_filtered() with the simpler netlink_broadcast()
  everywhere.

[1]: https://lkml.org/lkml/2018/4/4/739
[2]: https://lkml.org/lkml/2018/4/26/767
[3]: https://lkml.org/lkml/2018/4/26/738Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3498436

uevent: add alloc_uevent_skb() helper · 26045a7b

Christian Brauner authored Apr 29, 2018

This patch adds alloc_uevent_skb() in preparation for follow up patches.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

26045a7b