Commits · 339bbff2d6e005a5586adeffc3d69a0eea50a764 · nexedi / linux

21 Dec, 2018 16 commits

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 339bbff2

David S. Miller authored Dec 20, 2018

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-12-21

The following pull-request contains BPF updates for your *net-next* tree.

There is a merge conflict in test_verifier.c. Result looks as follows:

        [...]
        },
        {
                "calls: cross frame pruning",
                .insns = {
                [...]
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .errstr_unpriv = "function calls to other bpf functions are allowed for root only",
                .result_unpriv = REJECT,
                .errstr = "!read_ok",
                .result = REJECT,
	},
        {
                "jset: functional",
                .insns = {
        [...]
        {
                "jset: unknown const compare not taken",
                .insns = {
                        BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
                                     BPF_FUNC_get_prandom_u32),
                        BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1),
                        BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0),
                        BPF_EXIT_INSN(),
                },
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .errstr_unpriv = "!read_ok",
                .result_unpriv = REJECT,
                .errstr = "!read_ok",
                .result = REJECT,
        },
        [...]
        {
                "jset: range",
                .insns = {
                [...]
                },
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .result_unpriv = ACCEPT,
                .result = ACCEPT,
        },

The main changes are:

1) Various BTF related improvements in order to get line info
   working. Meaning, verifier will now annotate the corresponding
   BPF C code to the error log, from Martin and Yonghong.

2) Implement support for raw BPF tracepoints in modules, from Matt.

3) Add several improvements to verifier state logic, namely speeding
   up stacksafe check, optimizations for stack state equivalence
   test and safety checks for liveness analysis, from Alexei.

4) Teach verifier to make use of BPF_JSET instruction, add several
   test cases to kselftests and remove nfp specific JSET optimization
   now that verifier has awareness, from Jakub.

5) Improve BPF verifier's slot_type marking logic in order to
   allow more stack slot sharing, from Jiong.

6) Add sk_msg->size member for context access and add set of fixes
   and improvements to make sock_map with kTLS usable with openssl
   based applications, from John.

7) Several cleanups and documentation updates in bpftool as well as
   auto-mount of tracefs for "bpftool prog tracelog" command,
   from Quentin.

8) Include sub-program tags from now on in bpf_prog_info in order to
   have a reliable way for user space to get all tags of the program
   e.g. needed for kallsyms correlation, from Song.

9) Add BTF annotations for cgroup_local_storage BPF maps and
   implement bpf fs pretty print support, from Roman.

10) Fix bpftool in order to allow for cross-compilation, from Ivan.

11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order
    to be compatible with libbfd and allow for Debian packaging,
    from Jakub.

12) Remove an obsolete prog->aux sanitation in dump and get rid of
    version check for prog load, from Daniel.

13) Fix a memory leak in libbpf's line info handling, from Prashant.

14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info
    does not get unaligned, from Jesper.

15) Fix test_progs kselftest to work with older compilers which are less
    smart in optimizing (and thus throwing build error), from Stanislav.

16) Cleanup and simplify AF_XDP socket teardown, from Björn.

17) Fix sk lookup in BPF kselftest's test_sock_addr with regards
    to netns_id argument, from Andrey.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

339bbff2

Merge branch 'expand-txtimestamp-selftest' · e770454f

David S. Miller authored Dec 20, 2018

Willem de Bruijn says:

====================
expand txtimestamp selftest

Convert the existing txtimestamp test to run as part of kselftest
and return a pass/fail.

Also expand the variations of timestamping tested, including packet
sockets, ipv6 raw and dgram and passing options using cmsg.

These are enough changes to split across a few patches, even if all
changes are only this one test.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e770454f

selftests: add txtimestamp kselftest · cda261f4

Willem de Bruijn authored Dec 20, 2018

Run the transmit timestamp tests as part of kselftests.

Add a txtimestamp.sh test script that runs most variants:
ipv4/ipv6, tcp/udp/raw/raw_ipproto/pf_packet, data/nodata,
setsockopt/cmsg. The script runs tests with netem delays.

Refine txtimestamp.c to validate results. Take expected
netem delays as input and compare against real timestamps.

To run without dependencies, add a listener socket to be
able to connect in the case of TCP.

Add the timestamping directory to the kselftests Makefile.
Build all the binaries. Only run verified txtimestamp.sh.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cda261f4

selftests: expand txtimestamp with ipv6 dgram + raw and pf_packet · b52354aa

Willem de Bruijn authored Dec 20, 2018

Expand the transmit timestamp regression test with support for
missing protocols: ipv6 datagram and raw and pf_packet.

Also refine resolve_hostname to independently request AF_INET or
AF_INET6 addresses. Else, ipv4 addresses may be returned as AF_INET6.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b52354aa

selftests: expand txtimestamp with cmsg support · 7085f47f

Willem de Bruijn authored Dec 20, 2018

Commit 3dd17e63 ("sock: accept SO_TIMESTAMPING flags in socket
cmsg") added support for passing tx timestamping options per-call
in sendmsg.

Expand the txtimestamp test with support for this feature.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7085f47f

net: seg6.h: remove an unused #include · a6ae520d

Peter Oskolkov authored Dec 20, 2018

A minor code cleanup.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a6ae520d

ppp: Move PFC decompression to PPP generic layer · 7fb1b8ca

Sam Protsenko authored Dec 20, 2018

Extract "Protocol" field decompression code from transport protocols to
PPP generic layer, where it actually belongs. As a consequence, this
patch fixes incorrect place of PFC decompression in L2TP driver (when
it's not PPPOX_BOUND) and also enables this decompression for other
protocols, like PPPoE.

Protocol field decompression also happens in PPP Multilink Protocol
code and in PPP compression protocols implementations (bsd, deflate,
mppe). It looks like there is no easy way to get rid of that, so it was
decided to leave it as is, but provide those cases with appropriate
comments instead.

Changes in v2:
  - Fix the order of checking skb data room and proto decompression
  - Remove "inline" keyword from ppp_decompress_proto()
  - Don't split line before function name
  - Prefix ppp_decompress_proto() function with "__"
  - Add ppp_decompress_proto() function with skb data room checks
  - Add description for introduced functions
  - Fix comments (as per review on mailing list)
Signed-off-by: Sam Protsenko <semen.protsenko@linaro.org>
Reviewed-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

7fb1b8ca

Merge tag 'wireless-drivers-next-for-davem-2018-12-20' of... · e69fbf31

David S. Miller authored Dec 20, 2018

Merge tag 'wireless-drivers-next-for-davem-2018-12-20' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.21

Last set of patches for 4.21. mt76 is still in very active development
and having some refactoring as well as new features. But also other
drivers got few new features and fixes.

Major changes:

ath10k

* add amsdu support for QCA6174 monitor mode

* report tx rate using the new ieee80211_tx_rate_update() API

* wcn3990 support is not experimental anymore

iwlwifi

* support for FW version 43 for 9000 and 22000 series

brcmfmac

* add support for CYW43012 SDIO chipset

* add the raw 4354 PCIe device ID for unprogrammed Cypress boards

mwifiex

* add NL80211_STA_INFO_RX_BITRATE support

mt76

* use the same firmware for mt76x2e and mt76x2u

* mt76x0e survey support

* more unification between mt76x2 and mt76x0

* mt76x0e AP mode support

* mt76x0e DFS support

* rework and fix tx status handling for mt76x0 and mt76x2
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e69fbf31

linux/netlink.h: drop unnecessary extern prefix · aa9d6e0f

Stephen Hemminger authored Dec 20, 2018

Don't need extern prefix before function prototypes.
Checkpatch has complained about this for a couple of years.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

aa9d6e0f

Merge branch 'ipv4-Prevent-user-triggerable-warning' · 7de33309

David S. Miller authored Dec 20, 2018

Ido Schimmel says:

====================
net: ipv4: Prevent user triggerable warning

Patch #1 prevents a user triaggerable warning in the flow dissector by
setting 'skb->dev' in skbs used for IPv4 output route get requests.

Patch #2 adds a test case that triggers the warning without the first
patch.

I have audited all the RTM_GETROUTE handlers and could not find any
other callpath where an skb is passed to the flow dissector with both
'skb->dev' and 'skb->sk' cleared.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7de33309

selftests: rtnetlink: Add a test case for multipath route get · 676f4bb1

Ido Schimmel authored Dec 20, 2018

Without previous patch a warning would be generated upon multipath route
get when FIB multipath hash policy is to use a 5-tuple for multipath
hash calculation.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

676f4bb1

net: ipv4: Set skb->dev for output route resolution · 21f94775

Ido Schimmel authored Dec 20, 2018

When user requests to resolve an output route, the kernel synthesizes
an skb where the relevant parameters (e.g., source address) are set. The
skb is then passed to ip_route_output_key_hash_rcu() which might call
into the flow dissector in case a multipath route was hit and a nexthop
needs to be selected based on the multipath hash.

Since both 'skb->dev' and 'skb->sk' are not set, a warning is triggered
in the flow dissector [1]. The warning is there to prevent codepaths
from silently falling back to the standard flow dissector instead of the
BPF one.

Therefore, instead of removing the warning, set 'skb->dev' to the
loopback device, as its not used for anything but resolving the correct
namespace.

[1]
WARNING: CPU: 1 PID: 24819 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x314/0x16b0
...
RSP: 0018:ffffa0df41fdf650 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8bcded232000 RCX: 0000000000000000
RDX: ffffa0df41fdf7e0 RSI: ffffffff98e415a0 RDI: ffff8bcded232000
RBP: ffffa0df41fdf760 R08: 0000000000000000 R09: 0000000000000000
R10: ffffa0df41fdf7e8 R11: ffff8bcdf27a3000 R12: ffffffff98e415a0
R13: ffffa0df41fdf7e0 R14: ffffffff98dd2980 R15: ffffa0df41fdf7e0
FS:  00007f46f6897680(0000) GS:ffff8bcdf7a80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055933e95f9a0 CR3: 000000021e636000 CR4: 00000000001006e0
Call Trace:
 fib_multipath_hash+0x28c/0x2d0
 ? fib_multipath_hash+0x28c/0x2d0
 fib_select_path+0x241/0x32f
 ? __fib_lookup+0x6a/0xb0
 ip_route_output_key_hash_rcu+0x650/0xa30
 ? __alloc_skb+0x9b/0x1d0
 inet_rtm_getroute+0x3f7/0xb80
 ? __alloc_pages_nodemask+0x11c/0x2c0
 rtnetlink_rcv_msg+0x1d9/0x2f0
 ? rtnl_calcit.isra.24+0x120/0x120
 netlink_rcv_skb+0x54/0x130
 rtnetlink_rcv+0x15/0x20
 netlink_unicast+0x20a/0x2c0
 netlink_sendmsg+0x2d1/0x3d0
 sock_sendmsg+0x39/0x50
 ___sys_sendmsg+0x2a0/0x2f0
 ? filemap_map_pages+0x16b/0x360
 ? __handle_mm_fault+0x108e/0x13d0
 __sys_sendmsg+0x63/0xa0
 ? __sys_sendmsg+0x63/0xa0
 __x64_sys_sendmsg+0x1f/0x30
 do_syscall_64+0x5a/0x120
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: d0e13a14 ("flow_dissector: lookup netns by skb->sk if skb->dev is NULL")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

21f94775

net: mscc: ocelot: Register poll timeout should be wall time not attempts · 639c1b26

Steen Hegelund authored Dec 20, 2018

When doing indirect access in the Ocelot chip, a command is setup,
issued and then we need to poll until the result is ready. The polling
timeout is specified in milliseconds in the datasheet and not in
register access attempts.
It is not a bug on the currently supported platform, but we observed
that the code does not work properly on other platforms that we want to
support as the timing requirements there are different.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

639c1b26

neighbour: remove stray semicolon · 463561e6

Colin Ian King authored Dec 20, 2018

Currently the stray semicolon means that the final term in the addition
is being missed. Fix this by removing it. Cleans up clang warning:

net/core/neighbour.c:2821:9: warning: expression result unused [-Wunused-value]

Fixes: 82cbb5c6 ("neighbour: register rtnl doit handler")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-By: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

463561e6

net: dsa: microchip: fix unicast frame leak · 962ad710

Tristram Ha authored Dec 19, 2018

Port partitioning is done by enabling UNICAST_VLAN_BOUNDARY and changing
the default port membership of 0x7f to other values such that there is
no communication between ports. In KSZ9477 the member for port 1 is
0x41; port 2, 0x42; port 3, 0x44; port 4, 0x48; port 5, 0x50; and port 7,
0x60. Port 6 is the host port.

Setting a zero value can be used to stop port from receiving.

However, when UNICAST_VLAN_BOUNDARY is disabled and the unicast addresses
are already learned in the dynamic MAC table, setting zero still allows
devices connected to those ports to communicate. This does not apply to
multicast and broadcast addresses though. To prevent these leaks and
make the function of port membership consistent UNICAST_VLAN_BOUNDARY
should never be disabled.

Note that UNICAST_VLAN_BOUNDARY is enabled by default in KSZ9477.

Fixes: b987e98e ("dsa: add DSA switch driver for Microchip KSZ9477")
Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

962ad710

vxlan: Correct merge error. · 3a6d528a

David S. Miller authored Dec 20, 2018

When resolving the conflict wrt. the vxlan_fdb_update call
in vxlan_changelink() I made the last argument false instead
of true.

Fix this.
Signed-off-by: David S. Miller <davem@davemloft.net>

3a6d528a

20 Dec, 2018 24 commits

Merge tag 'mlx5-updates-2018-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · e7164313

David S. Miller authored Dec 20, 2018

Saeed Mahameed says:

====================
mlx5-updates-2018-12-19

This series adds some misc updates and the support for tunnels over VLAN
tc offloads.

From Miroslav Lichvar, patches #1,2
1) Update timecounter at least twice per counter overflow
2) Extend PTP gettime function to read system clock

From Gavi Teitz, patch #3
3) Increase VF representors' SQ size to 128

From Eli Britstein and Or Gerlitz, patches #4-10
4) Adds the capability to support tunnels over VLAN device.

Patch 4 avoids crash for TC flow with egress upper devices

Patch 5 refactors tunnel routing devs into a helper function

Patch 6 avoids crash for TC encap flows with vlan on underlay

Patches 7-8 refactor encap tunnel header preparing code.

Patch 9 adds support for building VLAN tagged ETH header.

Patch 10 adds support for tunnel routing to VLAN device.

From Aviv, patches 11,12 to fix earlier VF lag series
5) Fix query_nic_sys_image_guid() error during init
6) Fix LAG requirement when CONFIG_MLX5_ESWITCH is off
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e7164313

Merge branch 'mlxsw-Two-usability-improvements' · c337680f

David S. Miller authored Dec 20, 2018

Ido Schimmel says:

====================
mlxsw: Two usability improvements

This patchset contains two small improvements in the mlxsw driver. The
first one, in patches #1-#2, relieves the user from the need to
configure a VLAN interface and only later the corresponding VXLAN
tunnel. The issue is explained in detail in the first patch.

The second improvement is described below and allows the user to make
use of VID 1 by having the driver use the reserved 4095 VID for untagged
traffic.

VLAN entries on a given port can be associated with either a bridge or a
router. For example, if swp1.10 is assigned an IP address and swp1.20 is
enslaved to a VLAN-unaware bridge, then both {Port 1, VID 10} and {Port
1, VID 20} would be associated with a filtering identifier (FID) of the
correct type.

In case swp1 itself is assigned an IP address or enslaved to a
VLAN-unaware bridge, then a FID would be associated with {Port 1, VID
1}. Using VID 1 for this purpose means that VLAN devices with VID 1
cannot be created over mlxsw ports, as this VID is (ab)used as the
default VLAN.

Instead of using VID 1 for this purpose, we can use VID 4095 which is
reserved for internal use and cannot be configured by either the 8021q
or the bridge driver.

Patches #3-#7 perform small and non-functional changes that finally
allow us to switch to VID 4095 as the default VID in patch #8.

Patch #9 removes the limitation about creation of VLAN devices with VID
1 over mlxsw ports.

Patches #10-#11 add test cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c337680f

selftests: forwarding: Add router test with VID 1 · 03a84ea3

Ido Schimmel authored Dec 20, 2018

Previous patches made it possible to setup VLAN devices with VID 1 over
mlxsw ports. Verify this functionality actually works by conducting a
simple router test over VID 1.

Adding this test as a generic test since it can be run using veth pairs
and it can also be useful for other physical devices where VID 1 was
considered reserved (knowingly or not).
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

03a84ea3

selftests: mlxsw: Adjust test regarding VID 1 · 29b1e34e

Ido Schimmel authored Dec 20, 2018

Previous patches made it possible to create VLAN devices with VID 1 over
mlxsw ports. Adjust the test to verify such an operation succeeds.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29b1e34e

mlxsw: spectrum: Remove limitation regarding VID 1 · d8a1f7ab

Ido Schimmel authored Dec 20, 2018

VID 1 is not reserved anymore, so remove the check that prevented the
creation of VLAN devices with this VID over mlxsw ports.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d8a1f7ab

mlxsw: spectrum: Switch to VID 4095 as default VID · 0417d25e

Ido Schimmel authored Dec 20, 2018

There is no need to abuse VID 1 anymore and we can instead use VID 4095
as the default VLAN, which will be configured on the port throughout its
lifetime.

The OVS join / leave functions are changed to enable VIDs 1-4094
(inclusive) instead of 2-4095. This because VID 4095 is now the default
VLAN instead of 1.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0417d25e

mlxsw: spectrum: Add an helper function to cleanup VLAN entries · 16f6aceb

Ido Schimmel authored Dec 20, 2018

VLAN entries on a port can be associated with either a bridge VLAN or a
router port. Before the VLAN entry is destroyed these associations need
to be cleaned up.

Currently, this is always invoked from the function which destroys the
VLAN entry, but next patch is going to skip the destruction of the
default entry when a port in unlinked from a LAG.

The above does not mean that the associations should not be cleaned up,
so add a helper that will be invoked from both call sites.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

16f6aceb

mlxsw: spectrum: Store pointer to default port VLAN in port struct · 346fca3b

Ido Schimmel authored Dec 20, 2018

Subsequent patches will need to access the default port VLAN. Since this
VLAN will exist throughout the lifetime of the port, simply store it in
the port's struct.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

346fca3b

mlxsw: spectrum: Allow controlling destruction of default port VLAN · ab6c3b79

Ido Schimmel authored Dec 20, 2018

The function allows flushing all the existing VLAN entries on a port. It
is invoked when a port is destroyed and when it is unlinked from a LAG.
In the latter case, when moving to the new default VLAN, there will not
be a need to destroy the default VLAN entry.

Therefore, add an argument that allows to control whether the default
port VLAN should be destroyed or not. Currently it is always set to
'true'.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ab6c3b79

mlxsw: spectrum: Set PVID during port initialization · 262e1ff9

Ido Schimmel authored Dec 20, 2018

Currently, the driver does not set the port's PVID when initializing a
new port. This is because the driver is using VID 1 as PVID which is the
firmware default.

Subsequent patches are going to change the PVID the driver is setting
when initializing a new port.

Prepare for that by explicitly setting the port's PVID.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

262e1ff9

mlxsw: spectrum: Replace hard-coded default VID with a define · a2d2a205

Ido Schimmel authored Dec 20, 2018

Subsequent patches are going to replace the current default VID (1) with
VLAN_N_VID - 1 (4095).

Prepare for this conversion by replacing the hard-coded '1' with a
define.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a2d2a205

selftests: mlxsw: Add a test case for L3 VNI · 9d15dceb

Ido Schimmel authored Dec 20, 2018

Previous patch added the ability to offload a VXLAN tunnel used for L3
VNI when it is present in the VLAN-aware bridge before the corresponding
VLAN interface is configured. This patch adds a test case to verify
that.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9d15dceb

mlxsw: spectrum_router: Do not force specific configuration order · f40be47a

Ido Schimmel authored Dec 20, 2018

In symmetric routing, the only two members in the VLAN corresponding to
the L3 VNI are the router port and the VXLAN tunnel.

In case the VXLAN device is already enslaved to the bridge and only
later the VLAN interface is configured, the tunnel will not be
offloaded.

The reason for this is that when the router interface (RIF)
corresponding to the VLAN interface is configured, it calls the core
fid_get() API which does not check if NVE should be enabled on the FID.

Instead, call into the bridge code which will check if NVE should be
enabled on the FID.

This effectively means that the same code path is used to retrieve a FID
when either a local port or a router port joins the FID.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f40be47a

Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 6eea2db2

David S. Miller authored Dec 20, 2018

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2018-12-20

This series contains updates to e100, igb, ixgbe, i40e and ice drivers.

I replaced spinlocks for mutex locks to reduce the latency on CPU0 for
igb when updating the statistics.  This work was based off a patch
provided by Jan Jablonsky, which was against an older version of the igb
driver.

Jesus adjusts the receive packet buffer size from 32K to 30K when
running in QAV mode, to stay within 60K for total packet buffer size for
igb.

Vinicius adds igb kernel documentation regarding the CBS algorithm and
its implementation in the i210 family of NICs.

YueHaibing from Huawei fixed the e100 driver that was potentially
passing a NULL pointer, so use the kernel macro IS_ERR_OR_NULL()
instead.

Konstantin Khorenko fixes i40e where we were not setting up the
neigh_priv_len in our net_device, which caused the driver to read beyond
the neighbor entry allocated memory.

Miroslav Lichvar extends the PTP gettime() to read the system clock by
adding support for PTP_SYS_OFFSET_EXTENDED ioctl in i40e.

Young Xiao fixed the ice driver to only enable NAPI on q_vectors that
actually have transmit and receive rings.

Kai-Heng Feng fixes an igb issue that when placed in suspend mode, the
NIC does not wake up when a cable is plugged in.  This was due to the
driver not setting PME during runtime suspend.

Stephen Douthit enables the ixgbe driver allow DSA devices to use the
MII interface to talk to switches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6eea2db2

Merge branch 'bpf-sockmap-fixes-and-improvements' · 1cf4a0cc

Daniel Borkmann authored Dec 20, 2018

John Fastabend says:

====================
Set of bpf fixes and improvements to make sockmap with kTLS usable
with "real" applications. This set came as the fallout of pulling
kTLS+sockmap into Cilium[1] and running in container environment.

Roughly broken into three parts,

Patches 1-3: resolve/improve handling of size field in sk_msg_md
Patch     4: it became difficult to use this in Cilium when the
	     SK_PASS verdict was not correctly handle. So handle
	     the case correctly.
Patch   5-8: Set of issues found while running OpenSSL TX kTLS
	     enabled applications. This resolves the most obvious
	     issues and gets applications using kTLS TX up and
	     running with sock{map|has}.

Other than the "sk_msg, zap ingress queue on psock down" (PATCH 6/8)
which can potentially cause a WARNING the issues fixed in this
series do not cause kernel side warnings, BUG, etc. but instead
cause stalls and other odd behavior in the user space applications
when using kTLS with BPF policies applied.

Primarily tested with 'curl' compiled with latest openssl and
also 'openssl s_client/s_server' containers using Cilium network
plugin with docker/k8s. Some basic testing with httpd was also
enabled. Cilium CI tests will be added shortly to cover these
cases as well. We also have 'wrk' and other test and benchmarking
tools we can run now.

We have two more sets of patches currently under testing that
will be sent shortly to address a few more issues. First the
OpenSSL RX kTLS side breaks when both sk_msg and sk_skb_verdict
programs are used with kTLS, the sk_skb_verdict programs are
not enforced. Second skmsg needs to call into tcp stack to
send to indicate consumed data.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

1cf4a0cc

bpf: tls_sw, init TLS ULP removes BPF proto hooks · 28cb6f1e

John Fastabend authored Dec 20, 2018

The existing code did not expect users would initialize the TLS ULP
without subsequently calling the TLS TX enabling socket option.
If the application tries to send data after the TLS ULP enable op
but before the TLS TX enable op the BPF sk_msg verdict program is
skipped. This patch resolves this by converting the ipv4 sock ops
to be calculated at init time the same way ipv6 ops are done. This
pulls in any changes to the sock ops structure that have been made
after the socket was created including the changes from adding the
socket to a sock{map|hash}.

This was discovered by running OpenSSL master branch which calls
the TLS ULP setsockopt early in TLS handshake but only enables
the TLS TX path once the handshake has completed. As a result the
datapath missed the initial handshake messages.

Fixes: 02c558b2 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

28cb6f1e

bpf: sk_msg, sock{map|hash} redirect through ULP · 0608c69c

John Fastabend authored Dec 20, 2018

A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.

To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.

Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.

In the sendfile case the function do_sendfile() zero's flags,

./fs/read_write.c:
 static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
		   	    size_t count, loff_t max)
 {
   ...
   fl = 0;
#if 0
   /*
    * We need to debate whether we can enable this or not. The
    * man page documents EAGAIN return for the output at least,
    * and the application is arguably buggy if it doesn't expect
    * EAGAIN on a non-blocking file descriptor.
    */
    if (in.file->f_flags & O_NONBLOCK)
	fl = SPLICE_F_NONBLOCK;
#endif
    file_start_write(out.file);
    retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
 }

In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.

./fs/splice.c:
 static int pipe_to_sendpage(struct pipe_inode_info *pipe,
			    struct pipe_buffer *buf, struct splice_desc *sd)
 {
   ...
   more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
   ...
 }

Confirming what we expect that internal flags  are in fact internal
to socket side.

Fixes: d3b18ad3 ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

0608c69c

bpf: sk_msg, zap ingress queue on psock down · a136678c

John Fastabend authored Dec 20, 2018

In addition to releasing any cork'ed data on a psock when the psock
is removed we should also release any skb's in the ingress work queue.
Otherwise the skb's eventually get free'd but late in the tear
down process so we see the WARNING due to non-zero sk_forward_alloc.

  void sk_stream_kill_queues(struct sock *sk)
  {
	...
	WARN_ON(sk->sk_forward_alloc);
	...
  }

Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

a136678c

bpf: sk_msg, fix socket data_ready events · 552de910

John Fastabend authored Dec 20, 2018

When a skb verdict program is in-use and either another BPF program
redirects to that socket or the new SK_PASS support is used the
data_ready callback does not wake up application. Instead because
the stream parser/verdict is using the sk data_ready callback we wake
up the stream parser/verdict block.

Fix this by adding a helper to check if the stream parser block is
enabled on the sk and if so call the saved pointer which is the
upper layers wake up function.

This fixes application stalls observed when an application is waiting
for data in a blocking read().

Fixes: d829e9c4 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

552de910

bpf: skb_verdict, support SK_PASS on RX BPF path · 51199405

John Fastabend authored Dec 20, 2018

Add SK_PASS verdict support to SK_SKB_VERDICT programs. Now that
support for redirects exists we can implement SK_PASS as a redirect
to the same socket. This simplifies the BPF programs and avoids an
extra map lookup on RX path for simple visibility cases.

Further, reduces user (BPF programmer in this context) confusion
when their program drops skb due to lack of support.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

51199405

bpf: skmsg, replace comments with BUILD bug · 7a69c0f2

John Fastabend authored Dec 20, 2018

Enforce comment on structure layout dependency with a BUILD_BUG_ON
to ensure the condition is maintained.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

7a69c0f2

bpf: sk_msg, improve offset chk in _is_valid_access · bc1b4f01

John Fastabend authored Dec 20, 2018

The check for max offset in sk_msg_is_valid_access uses sizeof()
which is incorrect because it would allow accessing possibly
past the end of the struct in the padded case. Further, it doesn't
preclude accessing any padding that may be added in the middle of
a struct. All told this makes it fragile to rely on.

To fix this explicitly check offsets with fields using the
bpf_ctx_range() and bpf_ctx_range_till() macros.

For reference the current structure layout looks as follows (reported
by pahole)

struct sk_msg_md {
	union {
		void *             data;                 /*           8 */
	};                                               /*     0     8 */
	union {
		void *             data_end;             /*           8 */
	};                                               /*     8     8 */
	__u32                      family;               /*    16     4 */
	__u32                      remote_ip4;           /*    20     4 */
	__u32                      local_ip4;            /*    24     4 */
	__u32                      remote_ip6[4];        /*    28    16 */
	__u32                      local_ip6[4];         /*    44    16 */
	__u32                      remote_port;          /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	__u32                      local_port;           /*    64     4 */
	__u32                      size;                 /*    68     4 */

	/* size: 72, cachelines: 2, members: 10 */
	/* last cacheline: 8 bytes */
};

So there should be no padding at the moment but fixing this now
prevents future errors.
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bc1b4f01

bpf: sk_msg, fix sk_msg_md access past end test · 9ee79a65

John Fastabend authored Dec 20, 2018

Currently, the test to ensure reads past the end of the sk_msg_md
data structure fail is incorrectly expecting success. Fix this
typo and use correct expected error.

Fixes: 945a47d8 ("bpf: sk_msg, add tests for size field")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

9ee79a65

bpf/cpumap: make sure frame_size for build_skb is aligned if headroom isn't · 77ea5f4c

Jesper Dangaard Brouer authored Dec 19, 2018

The frame_size passed to build_skb must be aligned, else it is
possible that the embedded struct skb_shared_info gets unaligned.

For correctness make sure that xdpf->headroom in included in the
alignment. No upstream drivers can hit this, as all XDP drivers provide
an aligned headroom. This was discovered when playing with implementing
XDP support for mvneta, which have a 2 bytes DSA header, and this
Marvell ARM64 platform didn't like doing atomic operations on an
unaligned skb_shinfo(skb)->dataref addresses.

Fixes: 1c601d82 ("bpf: cpumap xdp_buff to skb conversion and allocation")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

77ea5f4c