Commits · 4cc7b9544b9a904add353406ed1bacbf56f75c52 · nexedi / linux

07 Aug, 2017 40 commits

bpf: devmap fix mutex in rcu critical section · 4cc7b954

John Fastabend authored Aug 04, 2017

Originally we used a mutex to protect concurrent devmap update
and delete operations from racing with netdev unregister notifier
callbacks.

The notifier hook is needed because we increment the netdev ref
count when a dev is added to the devmap. This ensures the netdev
reference is valid in the datapath. However, we don't want to block
unregister events, hence the initial mutex and notifier handler.

The concern was in the notifier hook we search the map for dev
entries that hold a refcnt on the net device being torn down. But,
in order to do this we require two steps,

  (i) dereference the netdev:  dev = rcu_dereference(map[i])
 (ii) test ifindex:   dev->ifindex == removing_ifindex

and then finally we can swap in the NULL dev in the map via an
xchg operation,

  xchg(map[i], NULL)

The danger here is a concurrent update could run a different
xchg op concurrently leading us to replace the new dev with a
NULL dev incorrectly.

      CPU 1                        CPU 2

   notifier hook                   bpf devmap update

   dev = rcu_dereference(map[i])
                                   dev = rcu_dereference(map[i])
                                   xchg(map[i]), new_dev);
                                   rcu_call(dev,...)
   xchg(map[i], NULL)

The above flow would create the incorrect state with the dev
reference in the update path being lost. To resolve this the
original code used a mutex around the above block. However,
updates, deletes, and lookups occur inside rcu critical sections
so we can't use a mutex in this context safely.

Fortunately, by writing slightly better code we can avoid the
mutex altogether. If CPU 1 in the above example uses a cmpxchg
and _only_ replaces the dev reference in the map when it is in
fact the expected dev the race is removed completely. The two
cases being illustrated here, first the race condition,

      CPU 1                          CPU 2

   notifier hook                     bpf devmap update

   dev = rcu_dereference(map[i])
                                     dev = rcu_dereference(map[i])
                                     xchg(map[i]), new_dev);
                                     rcu_call(dev,...)
   odev = cmpxchg(map[i], dev, NULL)

Now we can test the cmpxchg return value, detect odev != dev and
abort. Or in the good case,

      CPU 1                          CPU 2

   notifier hook                     bpf devmap update
   dev = rcu_dereference(map[i])
   odev = cmpxchg(map[i], dev, NULL)
                                     [...]

Now 'odev == dev' and we can do proper cleanup.

And viola the original race we tried to solve with a mutex is
corrected and the trace noted by Sasha below is resolved due
to removal of the mutex.

Note: When walking the devmap and removing dev references as needed
we depend on the core to fail any calls to dev_get_by_index() using
the ifindex of the device being removed. This way we do not race with
the user while searching the devmap.

Additionally, the mutex was also protecting list add/del/read on
the list of maps in-use. This patch converts this to an RCU list
and spinlock implementation. This protects the list from concurrent
alloc/free operations. The notifier hook walks this list so it uses
RCU read semantics.

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
1 lock held by syz-executor1/16315:
 #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
 #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
 #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388

Fixes: 2ddf71e2 ("net: add notifier hooks for devmap bpf map")
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4cc7b954

Merge branch 'net_sched-clean-up-filter-handle' · 98909f95

David S. Miller authored Aug 07, 2017

Cong Wang says:

====================
net_sched: clean up filter handle

This patchset sits in my local branch for a long time, it is time to
send it out. It cleans up the ambiguous use of 'unsigned long fh',
please see each of them for details.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

98909f95

net_sched: use void pointer for filter handle · 8113c095

WANG Cong authored Aug 04, 2017

Now we use 'unsigned long fh' as a pointer in every place,
it is safe to convert it to a void pointer now. This gets
rid of many casts to pointer.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8113c095

net_sched: refactor notification code for RTM_DELTFILTER · 54df2cf8

WANG Cong authored Aug 04, 2017

It is confusing to use 'unsigned long fh' as both a handle
and a pointer, especially commit 9ee78374
("net sched filters: fix notification of filter delete with proper handle").

This patch introduces tfilter_del_notify() so that we can
pass it as a pointer as before, and we don't need to check
RTM_DELTFILTER in tcf_fill_node() any more.

This prepares for the next patch.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

54df2cf8

lwtunnel: replace EXPORT_SYMBOL with EXPORT_SYMBOL_GPL · 08bd10ff

Roopa Prabhu authored Aug 04, 2017

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

08bd10ff

Merge branch 'bpf-add-support-for-sys-enter-exit-tracepoints' · 1c14dc48

David S. Miller authored Aug 07, 2017

Yonghong Song says:

====================
bpf: add support for sys_{enter|exit}_* tracepoints

Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
style tracepoints. The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
tracepoints are treated differently from other tracepoints and there
is no bpf hook to it.

This patch set adds bpf support for these syscalls tracepoints and also
adds a test case for it.

Changelogs:
v3 -> v4:
 - Check the legality of ctx offset access for syscall tracepoint as well.
   trace_event_get_offsets will return correct max offset for each
   specific syscall tracepoint.
 - Use variable length array to avoid hardcode 6 as the maximum
   arguments beyond syscall_nr.
v2 -> v3:
 - Fix a build issue
v1 -> v2:
 - Do not use TRACE_EVENT_FL_CAP_ANY to identify syscall tracepoint.
   Instead use trace_event_call->class.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

1c14dc48

bpf: add a test case for syscalls/sys_{enter|exit}_* tracepoints · 1da236b6

Yonghong Song authored Aug 04, 2017

Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

1da236b6

bpf: add support for sys_enter_* and sys_exit_* tracepoints · cf5f5cea

Yonghong Song authored Aug 04, 2017

Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
style tracepoints. The iovisor/bcc issue #748
(https://github.com/iovisor/bcc/issues/748) documents this issue.
For example, if you try to attach a bpf program to tracepoints
syscalls/sys_enter_newfstat, you will get the following error:
   # ./tools/trace.py t:syscalls:sys_enter_newfstat
   Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
   Failed to attach BPF to tracepoint

The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
tracepoints are treated differently from other tracepoints and there
is no bpf hook to it.

This patch adds bpf support for these syscalls tracepoints by
  . permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
  . calling bpf programs in perf_syscall_enter and perf_syscall_exit

The legality of bpf program ctx access is also checked.
Function trace_event_get_offsets returns correct max offset for each
specific syscall tracepoint, which is compared against the maximum offset
access in bpf program.
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf5f5cea

of_mdio: use of_property_read_u32_array() · d226a2b8

Sergei Shtylyov authored Aug 05, 2017

The "fixed-link" prop support predated of_property_read_u32_array(), so
basically had to open-code it. Using the modern API saves 24 bytes of the
object code (ARM gcc 4.8.5); the only behavior change would be that the
prop length check is now less strict (however the strict pre-check done
in of_phy_is_fixed_link() is left intact anyway)...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

d226a2b8

ibmvnic: Report rx buffer return codes as netdev_dbg · e1cea2e7

John Allen authored Aug 07, 2017

Reporting any return code for a receive buffer as an "rx error" only
produces alarming noise and the only values that have been observed to be
used in this field are not error conditions. Change this to a netdev_dbg
with a more descriptive message.
Signed-off-by: John Allen <jallen@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1cea2e7

Merge branch 'net-l3mdev-Support-for-sockets-bound-to-enslaved-device' · 9bcb5a57

David S. Miller authored Aug 07, 2017

David Ahern says:

====================
net: l3mdev: Support for sockets bound to enslaved device

A missing piece to the VRF puzzle is the ability to bind sockets to
devices enslaved to a VRF. This patch set adds the enslaved device
index, sdif, to IPv4 and IPv6 socket lookups. The end result for users
is the following scope options for services:

1. "global" services - sockets not bound to any device

   Allows 1 service to work across all network interfaces with
   connected sockets bound to the VRF the connection originates
   (Requires net.ipv4.tcp_l3mdev_accept=1 for TCP and
    net.ipv4.udp_l3mdev_accept=1 for UDP)

2. "VRF" local services - sockets bound to a VRF

   Sockets work across all network interfaces enslaved to a VRF but
   are limited to just the one VRF.

3. "device" services - sockets bound to a specific network interface

   Service works only through the one specific interface.

v3
- convert __inet_lookup_established in dccp_v4_err; missed in v2

v2
- remove sk_lookup struct and add sdif as an argument to existing
  functions

Changes since RFC:
- no significant logic changes; mainly whitespace cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9bcb5a57

net: ipv6: add second dif to raw socket lookups · 5108ab4b

David Ahern authored Aug 07, 2017

Add a second device index, sdif, to raw socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5108ab4b

net: ipv6: add second dif to inet6 socket lookups · 4297a0ef

David Ahern authored Aug 07, 2017

Add a second device index, sdif, to inet6 socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after tcp_v4_rcv
tcp_v4_sdif is used.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4297a0ef

net: ipv6: add second dif to udp socket lookups · 1801b570

David Ahern authored Aug 07, 2017

Add a second device index, sdif, to udp socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

Early demux lookups are handled in the next patch as part of INET_MATCH
changes.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1801b570

net: ipv4: add second dif to multicast source filter · 60d9b031

David Ahern authored Aug 07, 2017

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

60d9b031

net: ipv4: add second dif to raw socket lookups · 67359930

David Ahern authored Aug 07, 2017

Add a second device index, sdif, to raw socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

67359930

net: ipv4: add second dif to inet socket lookups · 3fa6f616

David Ahern authored Aug 07, 2017

Add a second device index, sdif, to inet socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
ingress index is obtained from IPCB using inet_sdif and after the cb move
in  tcp_v4_rcv the tcp_v4_sdif helper is used.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3fa6f616

net: ipv4: add second dif to udp socket lookups · fb74c277

David Ahern authored Aug 07, 2017

Add a second device index, sdif, to udp socket lookups. sdif is the
index for ingress devices enslaved to an l3mdev. It allows the lookups
to consider the enslaved device as well as the L3 domain when searching
for a socket.

Early demux lookups are handled in the next patch as part of INET_MATCH
changes.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fb74c277

Merge tag 'wireless-drivers-next-for-davem-2017-08-07' of... · 46d4b68f

David S. Miller authored Aug 07, 2017

Merge tag 'wireless-drivers-next-for-davem-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.14

The first wireless-drivers-next pull request for 4.14. I'm submitting
this unusally late in the cycle as my vacation postponed this. But
even if this is late there's not still that much new features, mostly
cleanup or fixes.

Major changes:

ath10k

* preparation for wcn3990 support

iwlwifi

* Reorganization of the code into separate directories continues

qtnfmac

* regulatory support updates

* add get_channel, dump_survey and channel_switch cfg80211 handlers
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

46d4b68f

hns3: fix unused function warning · 2a32ca13

Arnd Bergmann authored Aug 07, 2017

Without CONFIG_PCI_IOV, we get a harmless warning about an
unused function:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:2273:13: error: 'hclge_disable_sriov' defined but not used [-Werror=unused-function]

The #ifdefs in this driver are obviously wrong, so this just
removes them and uses an IS_ENABLED() check that does the same
thing correctly in a more readable way.

Fixes: 46a3df9f ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

2a32ca13

Merge tag 'mlx5-shared-2017-08-07' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · fde6af47

David S. Miller authored Aug 07, 2017

Saeed Mahameed says:

====================
mlx5-shared-2017-08-07

This series includes some mlx5 updates for both net-next and rdma trees.

From Saeed,
Core driver updates to allow selectively building the driver with
or without some large driver components, such as
	- E-Switch (Ethernet SRIOV support).
	- Multi-Physical Function Switch (MPFs) support.
For that we split E-Switch and MPFs functionalities into separate files.

From Erez,
Delay mlx5_core events when mlx5 interfaces, namely mlx5_ib, registration
is taking place and until it completes.

From Rabie,
Increase the maximum supported flow counters.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

fde6af47

Merge branch 'net-sched-summer-cleanup-part-2-ndo_setup_tc' · 71feeef6

David S. Miller authored Aug 07, 2017

Jiri Pirko says:

====================
net: sched: summer cleanup part 2, ndo_setup_tc

This patchset focuses on ndo_setup_tc and its args.
Currently there are couple of things that do not make much sense.
The type is passed in struct tc_to_netdev, but as it is always
required, should be arg of the ndo. Other things are passed as args
but they are only relevant for cls offloads and not mqprio. Therefore,
they should be pushed to struct. As the tc_to_netdev struct in the end
is just a container of single pointer, we get rid of it and pass the
struct according to type. So in the end, we have:
ndo_setup_tc(dev, type, type_data_struct)

There are couple of cosmetics done on the way to make things smooth.
Also, reported error is consolidated to eopnotsupp in case the
asked offload is not supported.

v1->v2:
- added forgotten hns3pf bits
====================
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

71feeef6