Commits · 0d08026ac6099ef8bd73412005830ce7280b7c80 · Kirill Smelkov / linux

11 Aug, 2021 8 commits

net: ipa: kill ipa_clock_get_additional() · 0d08026a

Alex Elder authored Aug 10, 2021

Now that ipa_clock_get_additional() is a trivial wrapper around
pm_runtime_get_if_active(), just open-code it in its only caller
and delete the function.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d08026a

net: ipa: kill IPA clock reference count · a71aeff3

Alex Elder authored Aug 10, 2021

The runtime power management core code maintains a usage count.  This
count mirrors the IPA clock reference count, and there's no need to
maintain both.  So get rid of the IPA clock reference count and just
rely on the runtime PM usage count to determine when the hardware
should be suspended or resumed.

Use pm_runtime_get_if_active() in ipa_clock_get_additional().  We
care whether power is active, regardless of whether it's in use, so
pass true for its ign_usage_count argument.

The IPA clock mutex is just used to make enabling/disabling the
clock and updating the reference count occur atomically.  Without
the reference count, there's no need for the mutex, so get rid of
that too.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a71aeff3

net: ipa: get rid of extra clock reference · a3d3e759

Alex Elder authored Aug 10, 2021

Suspending the IPA hardware is now managed by the runtime PM core
code.  The ->runtime_idle callback returns a non-zero value, so it
will never suspend except when forced.  As a result, there's no need
to take an extra "do not suspend" clock reference.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3d3e759

net: ipa: use runtime PM core · 63de79f0

Alex Elder authored Aug 10, 2021

Use the runtime power management core to cause hardware suspend and
resume to occur.  Enable it in ipa_clock_init() (without autosuspend),
and disable it in ipa_clock_exit().

Use ipa_runtime_suspend() as the ->runtime_suspend power operation,
and arrange for it to be called by having ipa_clock_get() call
pm_runtime_get_sync() when the first clock reference is taken.
Similarly, use ipa_runtime_resume() as the ->runtime_resume power
operation, and pm_runtime_put() when the last IPA clock reference
is dropped.

Introduce ipa_runtime_idle() as the ->runtime_idle power operation,
and have it return a non-zero value; this way suspend will never
occur except when forced.

Use pm_runtime_force_suspend() and pm_runtime_force_resume() as the
system suspend and resume callbacks, and remove ipa_suspend() and
ipa_resume().

Store a pointer to the device structure passed to ipa_clock_init(),
so it can be used by ipa_clock_exit() to disable runtime power
management.

For now we preserve IPA clock reference counting.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

63de79f0

net: ipa: resume in ipa_clock_get() · 2abb0c7f

Alex Elder authored Aug 10, 2021

Introduce ipa_runtime_suspend() and ipa_runtime_resume(), which
encapsulate the activities necessary for suspending and resuming
the IPA hardware.  Call these functions from ipa_clock_get() and
ipa_clock_put() when the first reference is taken or last one is
dropped.

When the very first clock reference is taken (for ipa_config()),
setup isn't complete yet, so (as before) only the core clock gets
enabled.

When the last clock reference is dropped (after ipa_deconfig()),
ipa_teardown() will have made the setup_complete flag false, so
there too, the core clock will be stopped without affecting GSI
or the endpoints.

Otherwise these new functions will perform the desired suspend and
resume actions once setup is complete.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

2abb0c7f

net: ipa: disable clock in suspend · 1016c6b8

Alex Elder authored Aug 10, 2021

Disable the IPA clock rather than dropping a reference to it in the
system suspend callback.  This forces the suspend to occur without
affecting existing references.

Similarly, enable the clock rather than taking a reference in
ipa_resume(), forcing a resume without changing the reference count.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1016c6b8

net: ipa: have ipa_clock_get() return a value · 7ebd168c

Alex Elder authored Aug 10, 2021

We currently assume no errors occur when enabling or disabling the
IPA core clock and interconnects.  And although this commit exposes
errors that could occur, we generally assume this won't happen in
practice.

This commit changes ipa_clock_get() and ipa_clock_put() so each
returns a value.  The values returned are meant to mimic what the
runtime power management functions return, so we can set up error
handling here before we make the switch.  Have ipa_clock_get()
increment the reference count even if it returns an error, to match
the behavior of pm_runtime_get().

More details follow.

When taking a reference in ipa_clock_get(), return 0 for the first
reference, 1 for subsequent references, or a negative error code if
an error occurs.  Note that if ipa_clock_get() returns an error, we
must not touch hardware; in some cases such errors now cause entire
blocks of code to be skipped.

When dropping a reference in ipa_clock_put(), we return 0 or an
error code.  The error would come from ipa_clock_disable(), which
now returns what ipa_interconnect_disable() returns (either 0 or a
negative error code).  For now, callers ignore the return value;
if an error occurs, a message will have already been logged, and
little more can actually be done to improve the situation.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ebd168c

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · 6f45933d

David S. Miller authored Aug 11, 2021

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Use nfnetlink_unicast() instead of netlink_unicast() in nft_compat.

2) Remove call to nf_ct_l4proto_find() in flowtable offload timeout
   fixup.

3) CLUSTERIP registers ARP hook on demand, from Florian.

4) Use clusterip_net to store pernet warning, also from Florian.

5) Remove struct netns_xt, from Florian Westphal.

6) Enable ebtables hooks in initns on demand, from Florian.

7) Allow to filter conntrack netlink dump per status bits,
   from Florian Westphal.

8) Register x_tables hooks in initns on demand, from Florian.

9) Remove queue_handler from per-netns structure, again from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6f45933d

10 Aug, 2021 20 commits

net: Support filtering interfaces on no master · d3432bf1

Lahav Schlesinger authored Aug 10, 2021

Currently there's support for filtering neighbours/links for interfaces
which have a specific master device (using the IFLA_MASTER/NDA_MASTER
attributes).

This patch adds support for filtering interfaces/neighbours dump for
interfaces that *don't* have a master.
Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210810090658.2778960-1-lschlesinger@drivenets.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d3432bf1

net/sched: cls_api, reset flags on replay · a5397d68

Mark Bloch authored Aug 10, 2021

tc_new_tfilter() can replay a request if it got EAGAIN. The cited commit
didn't account for this when it converted TC action ->init() API
to use flags instead of parameters. This can lead to passing stale flags
down the call chain which results in trying to lock rtnl when it's
already locked, deadlocking the entire system.

Fix by making sure to reset flags on each replay.

============================================
WARNING: possible recursive locking detected
5.14.0-rc3-custom-49011-g3d2bbb4f104d #447 Not tainted
--------------------------------------------
tc/37605 is trying to acquire lock:
ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_setup_cb_add+0x14b/0x4d0

but task is already holding lock:
ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_new_tfilter+0xb12/0x22e0

other info that might help us debug this:
 Possible unsafe locking scenario:
       CPU0
       ----
  lock(rtnl_mutex);
  lock(rtnl_mutex);

 *** DEADLOCK ***
 May be due to missing lock nesting notation
1 lock held by tc/37605:
 #0: ffffffff841df2f0 (rtnl_mutex){+.+.}-{3:3}, at: tc_new_tfilter+0xb12/0x22e0

stack backtrace:
CPU: 0 PID: 37605 Comm: tc Not tainted 5.14.0-rc3-custom-49011-g3d2bbb4f104d #447
Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017
Call Trace:
 dump_stack_lvl+0x8b/0xb3
 __lock_acquire.cold+0x175/0x3cb
 lock_acquire+0x1a4/0x4f0
 __mutex_lock+0x136/0x10d0
 fl_hw_replace_filter+0x458/0x630 [cls_flower]
 fl_change+0x25f2/0x4a64 [cls_flower]
 tc_new_tfilter+0xa65/0x22e0
 rtnetlink_rcv_msg+0x86c/0xc60
 netlink_rcv_skb+0x14d/0x430
 netlink_unicast+0x539/0x7e0
 netlink_sendmsg+0x84d/0xd80
 ____sys_sendmsg+0x7ff/0x970
 ___sys_sendmsg+0xf8/0x170
 __sys_sendmsg+0xea/0x1b0
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7b93b6c0a7
Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48>
RSP: 002b:00007ffe365b3818 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b93b6c0a7
RDX: 0000000000000000 RSI: 00007ffe365b3880 RDI: 0000000000000003
RBP: 00000000610a75f6 R08: 0000000000000001 R09: 0000000000000000
R10: fffffffffffff3a9 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00007ffe365b7b58 R15: 00000000004822c0

Fixes: 695176bf ("net_sched: refactor TC action init API")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20210810034305.63997-1-mbloch@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a5397d68

Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · ebd0d30c

Jakub Kicinski authored Aug 10, 2021

Saeed Mahameed says:

====================
pull-request: mlx5-next 2020-08-9

This pulls mlx5-next branch which includes patches already reviewed on
net-next and rdma mailing lists.

1) mlx5 single E-Switch FDB for lag

2) IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq

3) Add DCS caps & fields support

[1] https://patchwork.kernel.org/project/netdevbpf/cover/20210803231959.26513-1-saeed@kernel.org/

[2] https://patchwork.kernel.org/project/netdevbpf/patch/0e3364dab7e0e4eea5423878b01aa42470be8d36.1626609184.git.leonro@nvidia.com/

[3] https://patchwork.kernel.org/project/netdevbpf/patch/55e1d69bef1fbfa5cf195c0bfcbe35c8019de35e.1624258894.git.leonro@nvidia.com/

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Lag, Create shared FDB when in switchdev mode
  net/mlx5: E-Switch, add logic to enable shared FDB
  net/mlx5: Lag, move lag destruction to a workqueue
  net/mlx5: Lag, properly lock eswitch if needed
  net/mlx5: Add send to vport rules on paired device
  net/mlx5: E-Switch, Add event callback for representors
  net/mlx5e: Use shared mappings for restoring from metadata
  net/mlx5e: Add an option to create a shared mapping
  net/mlx5: E-Switch, set flow source for send to uplink rule
  RDMA/mlx5: Add shared FDB support
  {net, RDMA}/mlx5: Extend send to vport rules
  RDMA/mlx5: Fill port info based on the relevant eswitch
  net/mlx5: Lag, add initial logic for shared FDB
  net/mlx5: Return mdev from eswitch
  IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq
  net/mlx5: Add DCS caps & fields support
====================

Link: https://lore.kernel.org/r/20210809202522.316930-1-saeed@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

ebd0d30c

netfilter: nf_queue: move hookfn registration out of struct net · 87029970

Florian Westphal authored Aug 05, 2021

This was done to detect when the pernet->init() function was not called
yet, by checking if net->nf.queue_handler is NULL.

Once the nfnetlink_queue module is active, all struct net pointers
contain the same address.  So place this back in nf_queue.c.

Handle the 'netns error unwind' test by checking nfnl_queue_net for a
NULL pointer and add a comment for this.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

87029970

Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · d1a4e0a9

Jakub Kicinski authored Aug 10, 2021

Daniel Borkmann says:

====================
bpf-next 2021-08-10

We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).

1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.

2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
   32-bit MIPS JIT development, from Johan Almbladh.

3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.

4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.

5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.

6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.

7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
   and Muhammad Falak R Wani.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
  bpf, tests: Add tail call test suite
  bpf, tests: Add tests for BPF_CMPXCHG
  bpf, tests: Add tests for atomic operations
  bpf, tests: Add test for 32-bit context pointer argument passing
  bpf, tests: Add branch conversion JIT test
  bpf, tests: Add word-order tests for load/store of double words
  bpf, tests: Add tests for ALU operations implemented with function calls
  bpf, tests: Add more ALU64 BPF_MUL tests
  bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
  bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
  bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
  bpf, tests: Fix typos in test case descriptions
  bpf, tests: Add BPF_MOV tests for zero and sign extension
  bpf, tests: Add BPF_JMP32 test cases
  samples, bpf: Add an explict comment to handle nested vlan tagging.
  selftests/bpf: Add tests for XDP bonding
  selftests/bpf: Fix xdp_tx.c prog section name
  net, core: Allow netdev_lower_get_next_private_rcu in bh context
  bpf, devmap: Exclude XDP broadcast to master device
  net, bonding: Add XDP support to the bonding driver
  ...
====================

Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d1a4e0a9

bpf, tests: Add tail call test suite · 874be05f

Johan Almbladh authored Aug 09, 2021

While BPF_CALL instructions were tested implicitly by the cBPF-to-eBPF
translation, there has not been any tests for BPF_TAIL_CALL instructions.
The new test suite includes tests for tail call chaining, tail call count
tracking and error paths. It is mainly intended for JIT development and
testing.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-15-johan.almbladh@anyfinetworks.com

874be05f

bpf, tests: Add tests for BPF_CMPXCHG · 6a3b24ca

Johan Almbladh authored Aug 09, 2021

Tests for BPF_CMPXCHG with both word and double word operands. As with
the tests for other atomic operations, these tests only check the result
of the arithmetic operation. The atomicity of the operations is not tested.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-14-johan.almbladh@anyfinetworks.com

6a3b24ca

bpf, tests: Add tests for atomic operations · e4517b36

Johan Almbladh authored Aug 09, 2021

Tests for each atomic arithmetic operation and BPF_XCHG, derived from
old BPF_XADD tests. The tests include BPF_W/DW and BPF_FETCH variants.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-13-johan.almbladh@anyfinetworks.com

e4517b36

bpf, tests: Add test for 32-bit context pointer argument passing · 53e33f99

Johan Almbladh authored Aug 09, 2021

On a 32-bit architecture, the context pointer will occupy the low
half of R1, and the other half will be zero.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-12-johan.almbladh@anyfinetworks.com

53e33f99

bpf, tests: Add branch conversion JIT test · 66e5eb84

Johan Almbladh authored Aug 09, 2021

Some JITs may need to convert a conditional jump instruction to
to short PC-relative branch and a long unconditional jump, if the
PC-relative offset exceeds offset field width in the CPU instruction.
This test triggers such branch conversion on the 32-bit MIPS JIT.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-11-johan.almbladh@anyfinetworks.com

66e5eb84

bpf, tests: Add word-order tests for load/store of double words · e5009b46

Johan Almbladh authored Aug 09, 2021

A double word (64-bit) load/store may be implemented as two successive
32-bit operations, one for each word. Check that the order of those
operations is consistent with the machine endianness.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-10-johan.almbladh@anyfinetworks.com

e5009b46

bpf, tests: Add tests for ALU operations implemented with function calls · 84024a4e

Johan Almbladh authored Aug 09, 2021

32-bit JITs may implement complex ALU64 instructions using function calls.
The new tests check aspects related to this, such as register clobbering
and register argument re-ordering.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-9-johan.almbladh@anyfinetworks.com

84024a4e

bpf, tests: Add more ALU64 BPF_MUL tests · faa57625

Johan Almbladh authored Aug 09, 2021

This patch adds BPF_MUL tests for 64x32 and 64x64 multiply. Mainly
testing 32-bit JITs that implement ALU64 operations with two 32-bit
CPU registers per operand.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-8-johan.almbladh@anyfinetworks.com

faa57625

bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64 · 3b9890ef

Johan Almbladh authored Aug 09, 2021

This patch adds a number of tests for BPF_LSH, BPF_RSH amd BPF_ARSH
ALU64 operations with values that may trigger different JIT code paths.
Mainly testing 32-bit JITs that implement ALU64 operations with two
32-bit CPU registers per operand.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809091829.810076-7-johan.almbladh@anyfinetworks.com

3b9890ef

bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH · 0f2fca1a

Johan Almbladh authored Aug 09, 2021

This patch adds more tests of ALU32 shift operations BPF_LSH and BPF_RSH,
including the special case of a zero immediate. Also add corresponding
BPF_ARSH tests which were missing for ALU32.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-6-johan.almbladh@anyfinetworks.com

0f2fca1a

bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations · ba89bcf7

Johan Almbladh authored Aug 09, 2021

This patch adds tests of BPF_AND, BPF_OR and BPF_XOR with different
magnitude of the immediate value. Mainly checking 32-bit JIT sub-word
handling and zero/sign extension.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-5-johan.almbladh@anyfinetworks.com

ba89bcf7

bpf, tests: Fix typos in test case descriptions · e92c813b

Johan Almbladh authored Aug 09, 2021

This patch corrects the test description in a number of cases where
the description differed from what was actually tested and expected.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-4-johan.almbladh@anyfinetworks.com

e92c813b

bpf, tests: Add BPF_MOV tests for zero and sign extension · 565731ac

Johan Almbladh authored Aug 09, 2021

Tests for ALU32 and ALU64 MOV with different sizes of the immediate
value. Depending on the immediate field width of the native CPU
instructions, a JIT may generate code differently depending on the
immediate value. Test that zero or sign extension is performed as
expected. Mainly for JIT testing.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-3-johan.almbladh@anyfinetworks.com

565731ac

bpf, tests: Add BPF_JMP32 test cases · b55dfa85

Johan Almbladh authored Aug 09, 2021

An eBPF JIT may implement JMP32 operations in a different way than JMP,
especially on 32-bit architectures. This patch adds a series of tests
for JMP32 operations, mainly for testing JITs.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210809091829.810076-2-johan.almbladh@anyfinetworks.com

b55dfa85

samples, bpf: Add an explict comment to handle nested vlan tagging. · d692a637

Muhammad Falak R Wani authored Aug 09, 2021

A codeblock for handling nested vlan trips newbies into thinking it as
duplicate code. Explicitly add a comment to clarify.
Signed-off-by: Muhammad Falak R Wani <falakreyaz@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809070046.32142-1-falakreyaz@gmail.com

d692a637

09 Aug, 2021 12 commits

Merge branch 'add-frag-page-support-in-page-pool' · 4ef3960e

Jakub Kicinski authored Aug 09, 2021

Yunsheng Lin says:

====================
add frag page support in page pool

This patchset adds frag page support in page pool and
enable skb's page frag recycling based on page pool in
hns3 drvier.
====================

Link: https://lore.kernel.org/r/1628217982-53533-1-git-send-email-linyunsheng@huawei.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

4ef3960e

net: hns3: support skb's frag page recycling based on page pool · 93188e96

Yunsheng Lin authored Aug 06, 2021

This patch adds skb's frag page recycling support based on
the frag page support in page pool.

The performance improves above 10~20% for single thread iperf
TCP flow with IOMMU disabled when iperf server and irq/NAPI
have a different CPU.

The performance improves about 135%(14Gbit to 33Gbit) for single
thread iperf TCP flow when IOMMU is in strict mode and iperf
server shares the same cpu with irq/NAPI.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

93188e96

page_pool: add frag page recycling support in page pool · 53e0961d

Yunsheng Lin authored Aug 06, 2021

Currently page pool only support page recycling when there
is only one user of the page, and the split page reusing
implemented in the most driver can not use the page pool as
bing-pong way of reusing requires the multi user support in
page pool.

Those reusing or recycling has below limitations:
1. page from page pool can only be used be one user in order
   for the page recycling to happen.
2. Bing-pong way of reusing in most driver does not support
   multi desc using different part of the same page in order
   to save memory.

So add multi-users support and frag page recycling in page
pool to overcome the above limitation.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

53e0961d

page_pool: add interface to manipulate frag count in page pool · 0e9d2a0a

Yunsheng Lin authored Aug 06, 2021

For 32 bit systems with 64 bit dma, dma_addr[1] is used to
store the upper 32 bit dma addr, those system should be rare
those days.

For normal system, the dma_addr[1] in 'struct page' is not
used, so we can reuse dma_addr[1] for storing frag count,
which means how many frags this page might be splited to.

In order to simplify the page frag support in the page pool,
the PAGE_POOL_DMA_USE_PP_FRAG_COUNT macro is added to indicate
the 32 bit systems with 64 bit dma, and the page frag support
in page pool is disabled for such system.

The newly added page_pool_set_frag_count() is called to reserve
the maximum frag count before any page frag is passed to the
user. The page_pool_atomic_sub_frag_count_return() is called
when user is done with the page frag.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0e9d2a0a

page_pool: keep pp info as long as page pool owns the page · 57f05bc2

Yunsheng Lin authored Aug 06, 2021

Currently, page->pp is cleared and set everytime the page
is recycled, which is unnecessary.

So only set the page->pp when the page is added to the page
pool and only clear it when the page is released from the
page pool.

This is also a preparation to support allocating frag page
in page pool.
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

57f05bc2

selftests/bpf: Add tests for XDP bonding · 6aab1c81

Jussi Maki authored Jul 31, 2021

Add a test suite to test XDP bonding implementation over a pair of
veth devices.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210731055738.16820-8-joamaki@gmail.com

6aab1c81

selftests/bpf: Fix xdp_tx.c prog section name · 95413846

Jussi Maki authored Jul 31, 2021

The program type cannot be deduced from 'tx' which causes an invalid
argument error when trying to load xdp_tx.o using the skeleton.
Rename the section name to "xdp" so that libbpf can deduce the type.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210731055738.16820-7-joamaki@gmail.com

95413846

net, core: Allow netdev_lower_get_next_private_rcu in bh context · 68918669

Jussi Maki authored Jul 31, 2021

For the XDP bonding slave lookup to work in the NAPI poll context in which
the redudant rcu_read_lock() has been removed we have to follow the same
approach as in 694cea39 ("bpf: Allow RCU-protected lookups to happen
from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held().
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com

68918669

bpf, devmap: Exclude XDP broadcast to master device · aeea1b86

Jussi Maki authored Jul 31, 2021

If the ingress device is bond slave, do not broadcast back through it or
the bond master.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-5-joamaki@gmail.com

aeea1b86

net, bonding: Add XDP support to the bonding driver · 9e2ee5c7

Jussi Maki authored Jul 31, 2021

XDP is implemented in the bonding driver by transparently delegating
the XDP program loading, removal and xmit operations to the bonding
slave devices. The overall goal of this work is that XDP programs
can be attached to a bond device *without* any further changes (or
awareness) necessary to the program itself, meaning the same XDP
program can be attached to a native device but also a bonding device.

Semantics of XDP_TX when attached to a bond are equivalent in such
setting to the case when a tc/BPF program would be attached to the
bond, meaning transmitting the packet out of the bond itself using one
of the bond's configured xmit methods to select a slave device (rather
than XDP_TX on the slave itself). Handling of XDP_TX to transmit
using the configured bonding mechanism is therefore implemented by
rewriting the BPF program return value in bpf_prog_run_xdp. To avoid
performance impact this check is guarded by a static key, which is
incremented when a XDP program is loaded onto a bond device. This
approach was chosen to avoid changes to drivers implementing XDP. If
the slave device does not match the receive device, then XDP_REDIRECT
is transparently used to perform the redirection in order to have
the network driver release the packet from its RX ring. The bonding
driver hashing functions have been refactored to allow reuse with
xdp_buff's to avoid code duplication.

The motivation for this change is to enable use of bonding (and
802.3ad) in hairpinning L4 load-balancers such as [1] implemented with
XDP and also to transparently support bond devices for projects that
use XDP given most modern NICs have dual port adapters. An alternative
to this approach would be to implement 802.3ad in user-space and
implement the bonding load-balancing in the XDP program itself, but
is rather a cumbersome endeavor in terms of slave device management
(e.g. by watching netlink) and requires separate programs for native
vs bond cases for the orchestrator. A native in-kernel implementation
overcomes these issues and provides more flexibility.

Below are benchmark results done on two machines with 100Gbit
Intel E810 (ice) NIC and with 32-core 3970X on sending machine, and
16-core 3950X on receiving machine. 64 byte packets were sent with
pktgen-dpdk at full rate. Two issues [2, 3] were identified with the
ice driver, so the tests were performed with iommu=off and patch [2]
applied. Additionally the bonding round robin algorithm was modified
to use per-cpu tx counters as high CPU load (50% vs 10%) and high rate
of cache misses were caused by the shared rr_tx_counter (see patch
2/3). The statistics were collected using "sar -n dev -u 1 10". On top
of that, for ice, further work is in progress on improving the XDP_TX
numbers [4].

 -----------------------|  CPU  |--| rxpck/s |--| txpck/s |----
 without patch (1 dev):
   XDP_DROP:              3.15%      48.6Mpps
   XDP_TX:                3.12%      18.3Mpps     18.3Mpps
   XDP_DROP (RSS):        9.47%      116.5Mpps
   XDP_TX (RSS):          9.67%      25.3Mpps     24.2Mpps
 -----------------------
 with patch, bond (1 dev):
   XDP_DROP:              3.14%      46.7Mpps
   XDP_TX:                3.15%      13.9Mpps     13.9Mpps
   XDP_DROP (RSS):        10.33%     117.2Mpps
   XDP_TX (RSS):          10.64%     25.1Mpps     24.0Mpps
 -----------------------
 with patch, bond (2 devs):
   XDP_DROP:              6.27%      92.7Mpps
   XDP_TX:                6.26%      17.6Mpps     17.5Mpps
   XDP_DROP (RSS):       11.38%      117.2Mpps
   XDP_TX (RSS):         14.30%      28.7Mpps     27.4Mpps
 --------------------------------------------------------------

RSS: Receive Side Scaling, e.g. the packets were sent to a range of
destination IPs.

  [1]: https://cilium.io/blog/2021/05/20/cilium-110#standalonelb
  [2]: https://lore.kernel.org/bpf/20210601113236.42651-1-maciej.fijalkowski@intel.com/T/#t
  [3]: https://lore.kernel.org/bpf/CAHn8xckNXci+X_Eb2WMv4uVYjO2331UWB2JLtXr_58z0Av8+8A@mail.gmail.com/
  [4]: https://lore.kernel.org/bpf/20210805230046.28715-1-maciej.fijalkowski@intel.com/T/#tSigned-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20210731055738.16820-4-joamaki@gmail.com

9e2ee5c7

net, core: Add support for XDP redirection to slave device · 879af96f

Jussi Maki authored Jul 31, 2021

This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
into XDP_REDIRECT after BPF program run when the ingress device
is a bond slave.

The dev_xdp_prog_count is exposed so that slave devices can be checked
for loaded XDP programs in order to avoid the situation where both
bond master and slave have programs loaded according to xdp_state.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com

879af96f

net, bonding: Refactor bond_xmit_hash for use with xdp_buff · a815bde5

Jussi Maki authored Jul 31, 2021

In preparation for adding XDP support to the bonding driver
refactor the packet hashing functions to be able to work with
any linear data buffer without an skb.
Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Link: https://lore.kernel.org/bpf/20210731055738.16820-2-joamaki@gmail.com

a815bde5