Commits · 8051bb7f2cbf68ec1289753616703791dd004b5c · Kirill Smelkov / linux

22 Jul, 2019 13 commits

selftests/tls: close the socket with open record · 8051bb7f

Jakub Kicinski authored Jul 19, 2019

Add test which sends some data with MSG_MORE and then
closes the socket (never calling send without MSG_MORE).
This should make sure we clean up open records correctly.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

8051bb7f

selftests/tls: add a bidirectional test · 65d41fb3

Jakub Kicinski authored Jul 19, 2019

Add a simple test which installs the TLS state for both directions,
sends and receives data on both sockets.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

65d41fb3

selftests/tls: test error codes around TLS ULP installation · 78b5dc3d

Jakub Kicinski authored Jul 19, 2019

Test the error codes returned when TCP connection is not
in ESTABLISHED state.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

78b5dc3d

selftests/tls: add a test for ULP but no keys · cf32526c

Jakub Kicinski authored Jul 19, 2019

Make sure we test the TLS_BASE/TLS_BASE case both with data
and the tear down/clean up path.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

cf32526c

bpf: sockmap/tls, close can race with map free · 95fa1454

John Fastabend authored Jul 19, 2019

When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

v4:
 - make sure we don't free things for device;
 - remove the checks which swap the callbacks back
   only if TLS is at the top.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

95fa1454

bpf: sockmap, only create entry if ulp is not already enabled · 0e858739

John Fastabend authored Jul 19, 2019

Sockmap does not currently support adding sockets after TLS has been
enabled. There never was a real use case for this so it was never
added. But, we lost the test for ULP at some point so add it here
and fail the socket insert if TLS is enabled. Future work could
make sockmap support this use case but fixup the bug here.

Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

0e858739

bpf: sockmap, synchronize_rcu before free'ing map · 2bb90e5c

John Fastabend authored Jul 19, 2019

We need to have a synchronize_rcu before free'ing the sockmap because
any outstanding psock references will have a pointer to the map and
when they use this could trigger a use after free.

Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

2bb90e5c

bpf: sockmap, sock_map_delete needs to use xchg · 45a4521d

John Fastabend authored Jul 19, 2019

__sock_map_delete() may be called from a tcp event such as unhash or
close from the following trace,

  tcp_bpf_close()
    tcp_bpf_remove()
      sk_psock_unlink()
        sock_map_delete_from_link()
          __sock_map_delete()

In this case the sock lock is held but this only protects against
duplicate removals on the TCP side. If the map is free'd then we have
this trace,

  sock_map_free
    xchg()                  <- replaces map entry
    sock_map_unref()
      sk_psock_put()
        sock_map_del_link()

The __sock_map_delete() call however uses a read, test, null over the
map entry which can result in both paths trying to free the map
entry.

To fix use xchg in TCP paths as well so we avoid having two references
to the same map entry.

Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

45a4521d

net/tls: fix transition through disconnect with close · 32857cf5

John Fastabend authored Jul 19, 2019

It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call the tls close callback. Because of this a user could
disconnect a socket then put it in a LISTEN state which would break
our assumptions about sockets always being ESTABLISHED state.

More directly because close() can call unhash() and unhash is
implemented by sockmap if a sockmap socket has TLS enabled we can
incorrectly destroy the psock from unhash() and then call its close
handler again. But because the psock (sockmap socket representation)
is already destroyed we call close handler in sk->prot. However,
in some cases (TLS BASE/BASE case) this will still point at the
sockmap close handler resulting in a circular call and crash reported
by syzbot.

To fix both above issues implement the unhash() routine for TLS.

v4:
 - add note about tls offload still needing the fix;
 - move sk_proto to the cold cache line;
 - split TX context free into "release" and "free",
   otherwise the GC work itself is in already freed
   memory;
 - more TX before RX for consistency;
 - reuse tls_ctx_free();
 - schedule the GC work after we're done with context
   to avoid UAF;
 - don't set the unhash in all modes, all modes "inherit"
   TLS_BASE's callbacks anyway;
 - disable the unhash hook for TLS_HW.

Fixes: 3c4d7559 ("tls: kernel TLS support")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

32857cf5

net/tls: remove sock unlock/lock around strp_done() · 313ab004

John Fastabend authored Jul 19, 2019

The tls close() callback currently drops the sock lock to call
strp_done(). Split up the RX cleanup into stopping the strparser
and releasing most resources, syncing strparser and finally
freeing the context.

To avoid the need for a strp_done() call on the cleanup path
of device offload make sure we don't arm the strparser until
we are sure init will be successful.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

313ab004

net/tls: remove close callback sock unlock/lock around TX work flush · f87e62d4

John Fastabend authored Jul 19, 2019

The tls close() callback currently drops the sock lock, makes a
cancel_delayed_work_sync() call, and then relocks the sock.

By restructuring the code we can avoid droping lock and then
reclaiming it. To simplify this we do the following,

 tls_sk_proto_close
 set_bit(CLOSING)
 set_bit(SCHEDULE)
 cancel_delay_work_sync() <- cancel workqueue
 lock_sock(sk)
 ...
 release_sock(sk)
 strp_done()

Setting the CLOSING bit prevents the SCHEDULE bit from being
cleared by any workqueue items e.g. if one happens to be
scheduled and run between when we set SCHEDULE bit and cancel
work. Then because SCHEDULE bit is set now no new work will
be scheduled.

Tested with net selftests and bpf selftests.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

f87e62d4

net/tls: don't call tls_sk_proto_close for hw record offload · ac78fc14

Jakub Kicinski authored Jul 19, 2019

The deprecated TOE offload doesn't actually do anything in
tls_sk_proto_close() - all TLS code is skipped and context
not freed. Remove the callback to make it easier to refactor
tls_sk_proto_close().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

ac78fc14

net/tls: don't arm strparser immediately in tls_set_sw_offload() · 318892ac

Jakub Kicinski authored Jul 19, 2019

In tls_set_device_offload_rx() we prepare the software context
for RX fallback and proceed to add the connection to the device.
Unfortunately, software context prep includes arming strparser
so in case of a later error we have to release the socket lock
to call strp_done().

In preparation for not releasing the socket lock half way through
callbacks move arming strparser into a separate function.
Following patches will make use of that.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

318892ac

19 Jul, 2019 5 commits

libbpf: sanitize VAR to conservative 1-byte INT · 1d4126c4

Andrii Nakryiko authored Jul 19, 2019

If VAR in non-sanitized BTF was size less than 4, converting such VAR
into an INT with size=4 will cause BTF validation failure due to
violationg of STRUCT (into which DATASEC was converted) member size.
Fix by conservatively using size=1.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

1d4126c4

libbpf: fix SIGSEGV when BTF loading fails, but .BTF.ext exists · 04efe591

Andrii Nakryiko authored Jul 19, 2019

In case when BTF loading fails despite sanitization, but BPF object has
.BTF.ext loaded as well, we free and null obj->btf, but not
obj->btf_ext. This leads to an attempt to relocate .BTF.ext later on
during bpf_object__load(), which assumes obj->btf is present. This leads
to SIGSEGV on null pointer access. Fix bug by freeing and nulling
obj->btf_ext as well.
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

04efe591

tcp: fix tcp_set_congestion_control() use from bpf hook · 8d650cde

Eric Dumazet authored Jul 18, 2019

Neal reported incorrect use of ns_capable() from bpf hook.

bpf_setsockopt(...TCP_CONGESTION...)
  -> tcp_set_congestion_control()
   -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
    -> ns_capable_common()
     -> current_cred()
      -> rcu_dereference_protected(current->cred, 1)

Accessing 'current' in bpf context makes no sense, since packets
are processed from softirq context.

As Neal stated : The capability check in tcp_set_congestion_control()
was written assuming a system call context, and then was reused from
a BPF call site.

The fix is to add a new parameter to tcp_set_congestion_control(),
so that the ns_capable() call is only performed under the right
context.

Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d650cde

ag71xx: fix return value check in ag71xx_probe() · 269b7c5f

Wei Yongjun authored Jul 19, 2019

In case of error, the function of_get_mac_address() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check should
be replaced with IS_ERR().

Fixes: d51b6ce4 ("net: ethernet: add ag71xx driver")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

269b7c5f

ag71xx: fix error return code in ag71xx_probe() · 6f5fa8d2

Wei Yongjun authored Jul 19, 2019

Fix to return error code -ENOMEM from the dmam_alloc_coherent() error
handling case instead of 0, as done elsewhere in this function.

Fixes: d51b6ce4 ("net: ethernet: add ag71xx driver")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

6f5fa8d2

18 Jul, 2019 14 commits

usb: qmi_wwan: add D-Link DWM-222 A2 device ID · 7d605309

Rogan Dawes authored Jul 17, 2019

Signed-off-by: Rogan Dawes <rogan@dawes.za.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d605309

bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips. · 9b3d15e6

Michael Chan authored Jul 17, 2019

Unlike legacy chips, 57500 chips don't need additional VNIC resources
for aRFS/ntuple.  Fix the code accordingly so that we don't reserve
and allocate additional VNICs on 57500 chips.  Without this patch,
the driver is failing to initialize when it tries to allocate extra
VNICs.

Fixes: ac33906c ("bnxt_en: Add support for aRFS on 57500 chips.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b3d15e6

net: dsa: sja1105: Fix missing unlock on error in sk_buff() · 008cfbaa

Wei Yongjun authored Jul 17, 2019

Add the missing unlock before return from function sk_buff()
in the error handling case.

Fixes: f3097be2 ("net: dsa: sja1105: Add a state machine for RX timestamping")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

008cfbaa

gve: replace kfree with kvfree · 8ec1e900

Chuhong Yuan authored Jul 17, 2019

Variables allocated by kvzalloc should not be freed by kfree.
Because they may be allocated by vmalloc.
So we replace kfree with kvfree here.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ec1e900

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · bb745231

David S. Miller authored Jul 18, 2019

Alexei Starovoitov says:

====================
pull-request: bpf 2019-07-18

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) verifier precision propagation fix, from Andrii.

2) BTF size fix for typedefs, from Andrii.

3) a bunch of big endian fixes, from Ilya.

4) wide load from bpf_sock_addr fixes, from Stanislav.

5) a bunch of misc fixes from a number of developers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

bb745231

selftests/bpf: fix test_xdp_noinline on s390 · 59fd3486

Ilya Leoshkevich authored Jul 17, 2019

test_xdp_noinline fails on s390 due to a handful of endianness issues.
Use ntohs for parsing eth_proto.
Replace bswaps with ntohs/htons.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

59fd3486

selftests/bpf: fix "valid read map access into a read-only array 1" on s390 · 01a0f9e4

Ilya Leoshkevich authored Jul 18, 2019

This test looks up a 32-bit map element and then loads it using a 64-bit
load. This does not work on s390, which is a big-endian machine.

Since the point of this test doesn't seem to be loading a smaller value
using a larger load, simply use a 32-bit load.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

01a0f9e4

net/mlx5: Replace kfree with kvfree · 7369c10f

Chuhong Yuan authored Jul 17, 2019

Variable allocated by kvmalloc should not be freed by kfree.
Because it may be allocated by vmalloc.
So replace kfree with kvfree here.

Fixes: 9b1f2982 ("net/mlx5: Add support for FW fatal reporter dump")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7369c10f

MAINTAINERS: update netsec driver · 184528af

Ilias Apalodimas authored Jul 18, 2019

Add myself to maintainers since i provided the XDP and page_pool
implementation
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jassi Brar <jaswinder.singh@linaro.org>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

184528af

ipv6: Unlink sibling route in case of failure · 54851aa9

Ido Schimmel authored Jul 17, 2019

When a route needs to be appended to an existing multipath route,
fib6_add_rt2node() first appends it to the siblings list and increments
the number of sibling routes on each sibling.

Later, the function notifies the route via call_fib6_entry_notifiers().
In case the notification is vetoed, the route is not unlinked from the
siblings list, which can result in a use-after-free.

Fix this by unlinking the route from the siblings list before returning
an error.

Audited the rest of the call sites from which the FIB notification chain
is called and could not find more problems.

Fixes: 2233000c ("net/ipv6: Move call_fib6_entry_notifiers up for route adds")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alexander Petrovskiy <alexpe@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

54851aa9

Merge tag 'wireless-drivers-for-davem-2019-07-18' of... · 9c2a57d1

David S. Miller authored Jul 18, 2019

Merge tag 'wireless-drivers-for-davem-2019-07-18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
wireless-drivers fixes for 5.3

First set of fixes for 5.3.

iwlwifi

* add new cards for 9000 and 20000 series and qu c-step devices

ath10k

* workaround an uninitialised variable warning

rt2x00

* fix rx queue hand on USB
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9c2a57d1

liquidio: Replace vmalloc + memset with vzalloc · 5a860f91

Chuhong Yuan authored Jul 18, 2019

Use vzalloc and vzalloc_node instead of using vmalloc and
vmalloc_node and then zeroing the allocated memory by
memset 0.
This simplifies the code.
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5a860f91

udp: Fix typo in net/ipv4/udp.c · 666a3d6e

Su Yanjun authored Jul 18, 2019

Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

666a3d6e

net: bcmgenet: use promisc for unsupported filters · 35cbef98

Justin Chen authored Jul 17, 2019

Currently we silently ignore filters if we cannot meet the filter
requirements. This will lead to the MAC dropping packets that are
expected to pass. A better solution would be to set the NIC to promisc
mode when the required filters cannot be met.

Also correct the number of MDF filters supported. It should be 17,
not 16.
Signed-off-by: Justin Chen <justinpopo6@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35cbef98

17 Jul, 2019 8 commits

ipv6: rt6_check should return NULL if 'from' is NULL · 49d05fe2

David Ahern authored Jul 17, 2019

Paul reported that l2tp sessions were broken after the commit referenced
in the Fixes tag. Prior to this commit rt6_check returned NULL if the
rt6_info 'from' was NULL - ie., the dst_entry was disconnected from a FIB
entry. Restore that behavior.

Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
Reported-by: Paul Donohue <linux-kernel@PaulSD.com>
Tested-by: Paul Donohue <linux-kernel@PaulSD.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

49d05fe2

tipc: initialize 'validated' field of received packets · 866e5fd8

Jon Maloy authored Jul 17, 2019

The tipc_msg_validate() function leaves a boolean flag 'validated' in
the validated buffer's control block, to avoid performing this action
more than once. However, at reception of new packets, the position of
this field may already have been set by lower layer protocols, so
that the packet is erroneously perceived as already validated by TIPC.

We fix this by initializing the said field to 'false' before performing
the initial validation.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

866e5fd8

Merge branch 'ipv4-relax-source-validation-check-for-loopback-packets' · 7b379472

David S. Miller authored Jul 17, 2019

Cong Wang says:

====================
ipv4: relax source validation check for loopback packets

This patchset fixes a corner case when loopback packets get dropped
by rp_filter when we route them from veth to lo. Patch 1 is the fix
and patch 2 provides a simplified test case for this scenario.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7b379472

selftests: add a test case for rp_filter · adb701d6

Cong Wang authored Jul 17, 2019

Add a test case to simulate the loopback packet case fixed
in the previous patch.

This test gets passed after the fix:

IPv4 rp_filter tests
    TEST: rp_filter passes local packets                                [ OK ]
    TEST: rp_filter passes loopback packets                             [ OK ]

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

adb701d6

fib: relax source validation check for loopback packets · 66f82095

Cong Wang authored Jul 17, 2019

In a rare case where we redirect local packets from veth to lo,
these packets fail to pass the source validation when rp_filter
is turned on, as the tracing shows:

  <...>-311708 [040] ..s1 7951180.957825: fib_table_lookup: table 254 oif 0 iif 1 src 10.53.180.130 dst 10.53.180.130 tos 0 scope 0 flags 0
  <...>-311708 [040] ..s1 7951180.957826: fib_table_lookup_nh: nexthop dev eth0 oif 4 src 10.53.180.130

So, the fib table lookup returns eth0 as the nexthop even though
the packets are local and should be routed to loopback nonetheless,
but they can't pass the dev match check in fib_info_nh_uses_dev()
without this patch.

It should be safe to relax this check for this special case, as
normally packets coming out of loopback device still have skb_dst
so they won't even hit this slow path.

Cc: Julian Anastasov <ja@ssi.bg>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

66f82095

Merge branch 'mlxsw-Two-fixes' · f1bf3e2a

David S. Miller authored Jul 17, 2019

Ido Schimmel says:

====================
mlxsw: Two fixes

This patchset contains two fixes for mlxsw.

Patch #1 from Petr fixes an issue in which DSCP rewrite can occur even
if the egress port was switched to Trust L2 mode where priority mapping
is based on PCP.

Patch #2 fixes a problem where packets can be learned on a non-existing
FID if a tc filter with a redirect action is configured on a bridged
port. The problem and fix are explained in detail in the commit message.

Please consider both patches for 5.2.y
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f1bf3e2a

mlxsw: spectrum: Do not process learned records with a dummy FID · 577fa14d

Ido Schimmel authored Jul 17, 2019

The switch periodically sends notifications about learned FDB entries.
Among other things, the notification includes the FID (Filtering
Identifier) and the port on which the MAC was learned.

In case the driver does not have the FID defined on the relevant port,
the following error will be periodically generated:

mlxsw_spectrum2 0000:06:00.0 swp32: Failed to find a matching {Port, VID} following FDB notification

This is not supposed to happen under normal conditions, but can happen
if an ingress tc filter with a redirect action is installed on a bridged
port. The redirect action will cause the packet's FID to be changed to
the dummy FID and a learning notification will be emitted with this FID
- which is not defined on the bridged port.

Fix this by having the driver ignore learning notifications generated
with the dummy FID and delete them from the device.

Another option is to chain an ignore action after the redirect action
which will cause the device to disable learning, but this means that we
need to consume another action whenever a redirect action is used. In
addition, the scenario described above is merely a corner case.

Fixes: cedbb8b2 ("mlxsw: spectrum_flower: Set dummy FID before forward action")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alex Kushnarov <alexanderk@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Alex Kushnarov <alexanderk@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

577fa14d

mlxsw: spectrum_dcb: Configure DSCP map as the last rule is removed · dedfde2f

Petr Machata authored Jul 17, 2019

Spectrum systems use DSCP rewrite map to update DSCP field in egressing
packets to correspond to priority that the packet has. Whether rewriting
will take place is determined at the point when the packet ingresses the
switch: if the port is in Trust L3 mode, packet priority is determined from
the DSCP map at the port, and DSCP rewrite will happen. If the port is in
Trust L2 mode, 802.1p is used for packet prioritization, and no DSCP
rewrite will happen.

The driver determines the port trust mode based on whether any DSCP
prioritization rules are in effect at given port. If there are any, trust
level is L3, otherwise it's L2. When the last DSCP rule is removed, the
port is switched to trust L2. Under that scenario, if DSCP of a packet
should be rewritten, it should be rewritten to 0.

However, when switching to Trust L2, the driver neglects to also update the
DSCP rewrite map. The last DSCP rule thus remains in effect, and packets
egressing through this port, if they have the right priority, will have
their DSCP set according to this rule.

Fix by first configuring the rewrite map, and only then switching to trust
L2 and bailing out.

Fixes: b2b1dab6 ("mlxsw: spectrum: Support ieee_setapp, ieee_delapp")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reported-by: Alex Veber <alexve@mellanox.com>
Tested-by: Alex Veber <alexve@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dedfde2f