1. 25 May, 2023 3 commits
  2. 24 May, 2023 7 commits
  3. 23 May, 2023 30 commits
    • John Fastabend's avatar
      bpf, sockmap: Test progs verifier error with latest clang · f726e035
      John Fastabend authored
      With a relatively recent clang (7090c10273119) and with this commit
      to fix warnings in selftests (c8ed6685) that uses __sink(err)
      to resolve unused variables. We get the following verifier error.
      
      root@6e731a24b33a:/host/tools/testing/selftests/bpf# ./test_sockmap
      libbpf: prog 'bpf_sockmap': BPF program load failed: Permission denied
      libbpf: prog 'bpf_sockmap': -- BEGIN PROG LOAD LOG --
      0: R1=ctx(off=0,imm=0) R10=fp0
      ; op = (int) skops->op;
      0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; switch (op) {
      1: (16) if w2 == 0x4 goto pc+5        ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      2: (56) if w2 != 0x5 goto pc+15       ; R2_w=5
      ; lport = skops->local_port;
      3: (61) r2 = *(u32 *)(r1 +68)         ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; if (lport == 10000) {
      4: (56) if w2 != 0x2710 goto pc+13 18: R1=ctx(off=0,imm=0) R2=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
      ; __sink(err);
      18: (bc) w1 = w0
      R0 !read_ok
      processed 18 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
      -- END PROG LOAD LOG --
      libbpf: prog 'bpf_sockmap': failed to load: -13
      libbpf: failed to load object 'test_sockmap_kern.bpf.o'
      load_bpf_file: (-1) No such file or directory
      ERROR: (-1) load bpf failed
      libbpf: prog 'bpf_sockmap': BPF program load failed: Permission denied
      libbpf: prog 'bpf_sockmap': -- BEGIN PROG LOAD LOG --
      0: R1=ctx(off=0,imm=0) R10=fp0
      ; op = (int) skops->op;
      0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; switch (op) {
      1: (16) if w2 == 0x4 goto pc+5        ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      2: (56) if w2 != 0x5 goto pc+15       ; R2_w=5
      ; lport = skops->local_port;
      3: (61) r2 = *(u32 *)(r1 +68)         ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; if (lport == 10000) {
      4: (56) if w2 != 0x2710 goto pc+13 18: R1=ctx(off=0,imm=0) R2=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
      ; __sink(err);
      18: (bc) w1 = w0
      R0 !read_ok
      processed 18 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
      -- END PROG LOAD LOG --
      libbpf: prog 'bpf_sockmap': failed to load: -13
      libbpf: failed to load object 'test_sockhash_kern.bpf.o'
      load_bpf_file: (-1) No such file or directory
      ERROR: (-1) load bpf failed
      libbpf: prog 'bpf_sockmap': BPF program load failed: Permission denied
      libbpf: prog 'bpf_sockmap': -- BEGIN PROG LOAD LOG --
      0: R1=ctx(off=0,imm=0) R10=fp0
      ; op = (int) skops->op;
      0: (61) r2 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; switch (op) {
      1: (16) if w2 == 0x4 goto pc+5        ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      2: (56) if w2 != 0x5 goto pc+15       ; R2_w=5
      ; lport = skops->local_port;
      3: (61) r2 = *(u32 *)(r1 +68)         ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
      ; if (lport == 10000) {
      4: (56) if w2 != 0x2710 goto pc+13 18: R1=ctx(off=0,imm=0) R2=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0
      ; __sink(err);
      18: (bc) w1 = w0
      R0 !read_ok
      processed 18 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1
      -- END PROG LOAD LOG --
      
      To fix simply remove the err value because its not actually used anywhere
      in the testing. We can investigate the root cause later. Future patch should
      probably actually test the err value as well. Although if the map updates
      fail they will get caught eventually by userspace.
      
      Fixes: c8ed6685 ("selftests/bpf: fix lots of silly mistakes pointed out by compiler")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-15-john.fastabend@gmail.com
      f726e035
    • John Fastabend's avatar
      bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops · 80e24d22
      John Fastabend authored
      When BPF program drops pkts the sockmap logic 'eats' the packet and
      updates copied_seq. In the PASS case where the sk_buff is accepted
      we update copied_seq from recvmsg path so we need a new test to
      handle the drop case.
      
      Original patch series broke this resulting in
      
      test_sockmap_skb_verdict_fionread:PASS:ioctl(FIONREAD) error 0 nsec
      test_sockmap_skb_verdict_fionread:FAIL:ioctl(FIONREAD) unexpected ioctl(FIONREAD): actual 1503041772 != expected 256
      
      After updated patch with fix.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-14-john.fastabend@gmail.com
      80e24d22
    • John Fastabend's avatar
      bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer · bb516f98
      John Fastabend authored
      A bug was reported where ioctl(FIONREAD) returned zero even though the
      socket with a SK_SKB verdict program attached had bytes in the msg
      queue. The result is programs may hang or more likely try to recover,
      but use suboptimal buffer sizes.
      
      Add a test to check that ioctl(FIONREAD) returns the correct number of
      bytes.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-13-john.fastabend@gmail.com
      bb516f98
    • John Fastabend's avatar
      bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0 · 1fa1fe8f
      John Fastabend authored
      When session gracefully shutdowns epoll needs to wake up and any recv()
      readers should return 0 not the -EAGAIN they previously returned.
      
      Note we use epoll instead of select to test the epoll wake on shutdown
      event as well.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-12-john.fastabend@gmail.com
      1fa1fe8f
    • John Fastabend's avatar
      bpf, sockmap: Build helper to create connected socket pair · 298970c8
      John Fastabend authored
      A common operation for testing is to spin up a pair of sockets that are
      connected. Then we can use these to run specific tests that need to
      send data, check BPF programs and so on.
      
      The sockmap_listen programs already have this logic lets move it into
      the new sockmap_helpers header file for general use.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-11-john.fastabend@gmail.com
      298970c8
    • John Fastabend's avatar
      bpf, sockmap: Pull socket helpers out of listen test for general use · 4e02588d
      John Fastabend authored
      No functional change here we merely pull the helpers in sockmap_listen.c
      into a header file so we can use these in other programs. The tests we
      are about to add aren't really _listen tests so doesn't make sense
      to add them here.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-10-john.fastabend@gmail.com
      4e02588d
    • John Fastabend's avatar
      bpf, sockmap: Incorrectly handling copied_seq · e5c6de5f
      John Fastabend authored
      The read_skb() logic is incrementing the tcp->copied_seq which is used for
      among other things calculating how many outstanding bytes can be read by
      the application. This results in application errors, if the application
      does an ioctl(FIONREAD) we return zero because this is calculated from
      the copied_seq value.
      
      To fix this we move tcp->copied_seq accounting into the recv handler so
      that we update these when the recvmsg() hook is called and data is in
      fact copied into user buffers. This gives an accurate FIONREAD value
      as expected and improves ACK handling. Before we were calling the
      tcp_rcv_space_adjust() which would update 'number of bytes copied to
      user in last RTT' which is wrong for programs returning SK_PASS. The
      bytes are only copied to the user when recvmsg is handled.
      
      Doing the fix for recvmsg is straightforward, but fixing redirect and
      SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
      call this from skmsg handlers. This fixes another issue where a broken
      socket with a BPF program doing a resubmit could hang the receiver. This
      happened because although read_skb() consumed the skb through sock_drop()
      it did not update the copied_seq. Now if a single reccv socket is
      redirecting to many sockets (for example for lb) the receiver sk will be
      hung even though we might expect it to continue. The hang comes from
      not updating the copied_seq numbers and memory pressure resulting from
      that.
      
      We have a slight layer problem of calling tcp_eat_skb even if its not
      a TCP socket. To fix we could refactor and create per type receiver
      handlers. I decided this is more work than we want in the fix and we
      already have some small tweaks depending on caller that use the
      helper skb_bpf_strparser(). So we extend that a bit and always set
      the strparser bit when it is in use and then we can gate the
      seq_copied updates on this.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
      e5c6de5f
    • John Fastabend's avatar
      bpf, sockmap: Wake up polling after data copy · 6df7f764
      John Fastabend authored
      When TCP stack has data ready to read sk_data_ready() is called. Sockmap
      overwrites this with its own handler to call into BPF verdict program.
      But, the original TCP socket had sock_def_readable that would additionally
      wake up any user space waiters with sk_wake_async().
      
      Sockmap saved the callback when the socket was created so call the saved
      data ready callback and then we can wake up any epoll() logic waiting
      on the read.
      
      Note we call on 'copied >= 0' to account for returning 0 when a FIN is
      received because we need to wake up user for this as well so they
      can do the recvmsg() -> 0 and detect the shutdown.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com
      6df7f764
    • John Fastabend's avatar
      bpf, sockmap: TCP data stall on recv before accept · ea444185
      John Fastabend authored
      A common mechanism to put a TCP socket into the sockmap is to hook the
      BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program
      that can map the socket info to the correct BPF verdict parser. When
      the user adds the socket to the map the psock is created and the new
      ops are assigned to ensure the verdict program will 'see' the sk_buffs
      as they arrive.
      
      Part of this process hooks the sk_data_ready op with a BPF specific
      handler to wake up the BPF verdict program when data is ready to read.
      The logic is simple enough (posted here for easy reading)
      
       static void sk_psock_verdict_data_ready(struct sock *sk)
       {
      	struct socket *sock = sk->sk_socket;
      
      	if (unlikely(!sock || !sock->ops || !sock->ops->read_skb))
      		return;
      	sock->ops->read_skb(sk, sk_psock_verdict_recv);
       }
      
      The oversight here is sk->sk_socket is not assigned until the application
      accepts() the new socket. However, its entirely ok for the peer application
      to do a connect() followed immediately by sends. The socket on the receiver
      is sitting on the backlog queue of the listening socket until its accepted
      and the data is queued up. If the peer never accepts the socket or is slow
      it will eventually hit data limits and rate limit the session. But,
      important for BPF sockmap hooks when this data is received TCP stack does
      the sk_data_ready() call but the read_skb() for this data is never called
      because sk_socket is missing. The data sits on the sk_receive_queue.
      
      Then once the socket is accepted if we never receive more data from the
      peer there will be no further sk_data_ready calls and all the data
      is still on the sk_receive_queue(). Then user calls recvmsg after accept()
      and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler.
      The handler checks for data in the sk_msg ingress queue expecting that
      the BPF program has already run from the sk_data_ready hook and enqueued
      the data as needed. So we are stuck.
      
      To fix do an unlikely check in recvmsg handler for data on the
      sk_receive_queue and if it exists wake up data_ready. We have the sock
      locked in both read_skb and recvmsg so should avoid having multiple
      runners.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com
      ea444185
    • John Fastabend's avatar
      bpf, sockmap: Handle fin correctly · 901546fd
      John Fastabend authored
      The sockmap code is returning EAGAIN after a FIN packet is received and no
      more data is on the receive queue. Correct behavior is to return 0 to the
      user and the user can then close the socket. The EAGAIN causes many apps
      to retry which masks the problem. Eventually the socket is evicted from
      the sockmap because its released from sockmap sock free handling. The
      issue creates a delay and can cause some errors on application side.
      
      To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
      is set then set return to zero. A selftest will be added to check this
      condition.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com
      901546fd
    • John Fastabend's avatar
      bpf, sockmap: Improved check for empty queue · 405df89d
      John Fastabend authored
      We noticed some rare sk_buffs were stepping past the queue when system was
      under memory pressure. The general theory is to skip enqueueing
      sk_buffs when its not necessary which is the normal case with a system
      that is properly provisioned for the task, no memory pressure and enough
      cpu assigned.
      
      But, if we can't allocate memory due to an ENOMEM error when enqueueing
      the sk_buff into the sockmap receive queue we push it onto a delayed
      workqueue to retry later. When a new sk_buff is received we then check
      if that queue is empty. However, there is a problem with simply checking
      the queue length. When a sk_buff is being processed from the ingress queue
      but not yet on the sockmap msg receive queue its possible to also recv
      a sk_buff through normal path. It will check the ingress queue which is
      zero and then skip ahead of the pkt being processed.
      
      Previously we used sock lock from both contexts which made the problem
      harder to hit, but not impossible.
      
      To fix instead of popping the skb from the queue entirely we peek the
      skb from the queue and do the copy there. This ensures checks to the
      queue length are non-zero while skb is being processed. Then finally
      when the entire skb has been copied to user space queue or another
      socket we pop it off the queue. This way the queue length check allows
      bypassing the queue only after the list has been completely processed.
      
      To reproduce issue we run NGINX compliance test with sockmap running and
      observe some flakes in our testing that we attributed to this issue.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Suggested-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
      405df89d
    • John Fastabend's avatar
      bpf, sockmap: Reschedule is now done through backlog · bce22552
      John Fastabend authored
      Now that the backlog manages the reschedule() logic correctly we can drop
      the partial fix to reschedule from recvmsg hook.
      
      Rescheduling on recvmsg hook was added to address a corner case where we
      still had data in the backlog state but had nothing to kick it and
      reschedule the backlog worker to run and finish copying data out of the
      state. This had a couple limitations, first it required user space to
      kick it introducing an unnecessary EBUSY and retry. Second it only
      handled the ingress case and egress redirects would still be hung.
      
      With the correct fix, pushing the reschedule logic down to where the
      enomem error occurs we can drop this fix.
      
      Fixes: bec21719 ("skmsg: Schedule psock work if the cached skb exists on the psock")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
      bce22552
    • John Fastabend's avatar
      bpf, sockmap: Convert schedule_work into delayed_work · 29173d07
      John Fastabend authored
      Sk_buffs are fed into sockmap verdict programs either from a strparser
      (when the user might want to decide how framing of skb is done by attaching
      another parser program) or directly through tcp_read_sock. The
      tcp_read_sock is the preferred method for performance when the BPF logic is
      a stream parser.
      
      The flow for Cilium's common use case with a stream parser is,
      
       tcp_read_sock()
        sk_psock_verdict_recv
          ret = bpf_prog_run_pin_on_cpu()
          sk_psock_verdict_apply(sock, skb, ret)
           // if system is under memory pressure or app is slow we may
           // need to queue skb. Do this queuing through ingress_skb and
           // then kick timer to wake up handler
           skb_queue_tail(ingress_skb, skb)
           schedule_work(work);
      
      The work queue is wired up to sk_psock_backlog(). This will then walk the
      ingress_skb skb list that holds our sk_buffs that could not be handled,
      but should be OK to run at some later point. However, its possible that
      the workqueue doing this work still hits an error when sending the skb.
      When this happens the skbuff is requeued on a temporary 'state' struct
      kept with the workqueue. This is necessary because its possible to
      partially send an skbuff before hitting an error and we need to know how
      and where to restart when the workqueue runs next.
      
      Now for the trouble, we don't rekick the workqueue. This can cause a
      stall where the skbuff we just cached on the state variable might never
      be sent. This happens when its the last packet in a flow and no further
      packets come along that would cause the system to kick the workqueue from
      that side.
      
      To fix we could do simple schedule_work(), but while under memory pressure
      it makes sense to back off some instead of continue to retry repeatedly. So
      instead to fix convert schedule_work to schedule_delayed_work and add
      backoff logic to reschedule from backlog queue on errors. Its not obvious
      though what a good backoff is so use '1'.
      
      To test we observed some flakes whil running NGINX compliance test with
      sockmap we attributed these failed test to this bug and subsequent issue.
      
      >From on list discussion. This commit
      
       bec21719("skmsg: Schedule psock work if the cached skb exists on the psock")
      
      was intended to address similar race, but had a couple cases it missed.
      Most obvious it only accounted for receiving traffic on the local socket
      so if redirecting into another socket we could still get an sk_buff stuck
      here. Next it missed the case where copied=0 in the recv() handler and
      then we wouldn't kick the scheduler. Also its sub-optimal to require
      userspace to kick the internal mechanisms of sockmap to wake it up and
      copy data to user. It results in an extra syscall and requires the app
      to actual handle the EAGAIN correctly.
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
      29173d07
    • John Fastabend's avatar
      bpf, sockmap: Pass skb ownership through read_skb · 78fa0d61
      John Fastabend authored
      The read_skb hook calls consume_skb() now, but this means that if the
      recv_actor program wants to use the skb it needs to inc the ref cnt
      so that the consume_skb() doesn't kfree the sk_buff.
      
      This is problematic because in some error cases under memory pressure
      we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
      Then we get this,
      
       skb_linearize()
         __pskb_pull_tail()
           pskb_expand_head()
             BUG_ON(skb_shared(skb))
      
      Because we incremented users refcnt from sk_psock_verdict_recv() we
      hit the bug on with refcnt > 1 and trip it.
      
      To fix lets simply pass ownership of the sk_buff through the skb_read
      call. Then we can drop the consume from read_skb handlers and assume
      the verdict recv does any required kfree.
      
      Bug found while testing in our CI which runs in VMs that hit memory
      constraints rather regularly. William tested TCP read_skb handlers.
      
      [  106.536188] ------------[ cut here ]------------
      [  106.536197] kernel BUG at net/core/skbuff.c:1693!
      [  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
      [  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
      [  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
      [  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
      [  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
      [  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
      [  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
      [  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
      [  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
      [  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
      [  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
      [  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  106.542255] Call Trace:
      [  106.542383]  <IRQ>
      [  106.542487]  __pskb_pull_tail+0x4b/0x3e0
      [  106.542681]  skb_ensure_writable+0x85/0xa0
      [  106.542882]  sk_skb_pull_data+0x18/0x20
      [  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
      [  106.543536]  ? migrate_disable+0x66/0x80
      [  106.543871]  sk_psock_verdict_recv+0xe2/0x310
      [  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
      [  106.544561]  tcp_read_skb+0x7b/0x120
      [  106.544740]  tcp_data_queue+0x904/0xee0
      [  106.544931]  tcp_rcv_established+0x212/0x7c0
      [  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
      [  106.545326]  tcp_v4_rcv+0xe70/0xf60
      [  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
      [  106.545744]  ip_local_deliver_finish+0xa7/0x150
      
      Fixes: 04919bed ("tcp: Introduce tcp_read_skb()")
      Reported-by: default avatarWilliam Findlay <will@isovalent.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: default avatarWilliam Findlay <will@isovalent.com>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
      78fa0d61
    • Nicolas Dichtel's avatar
      ipv{4,6}/raw: fix output xfrm lookup wrt protocol · 3632679d
      Nicolas Dichtel authored
      With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
      protocol field of the flow structure, build by raw_sendmsg() /
      rawv6_sendmsg()),  is set to IPPROTO_RAW. This breaks the ipsec policy
      lookup when some policies are defined with a protocol in the selector.
      
      For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
      specify the protocol. Just accept all values for IPPROTO_RAW socket.
      
      For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
      without breaking backward compatibility (the value of this field was never
      checked). Let's add a new kind of control message, so that the userland
      could specify which protocol is used.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      3632679d
    • Horatiu Vultur's avatar
      lan966x: Fix unloading/loading of the driver · 60076124
      Horatiu Vultur authored
      It was noticing that after a while when unloading/loading the driver and
      sending traffic through the switch, it would stop working. It would stop
      forwarding any traffic and the only way to get out of this was to do a
      power cycle of the board. The root cause seems to be that the switch
      core is initialized twice. Apparently initializing twice the switch core
      disturbs the pointers in the queue systems in the HW, so after a while
      it would stop sending the traffic.
      Unfortunetly, it is not possible to use a reset of the switch here,
      because the reset line is connected to multiple devices like MDIO,
      SGPIO, FAN, etc. So then all the devices will get reseted when the
      network driver will be loaded.
      So the fix is to check if the core is initialized already and if that is
      the case don't initialize it again.
      
      Fixes: db8bcaad ("net: lan966x: add the basic lan966x driver")
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20230522120038.3749026-1-horatiu.vultur@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      60076124
    • Shay Drory's avatar
      net/mlx5: Fix indexing of mlx5_irq · 1da438c0
      Shay Drory authored
      After the cited patch, mlx5_irq xarray index can be different then
      mlx5_irq MSIX table index.
      Fix it by storing both mlx5_irq xarray index and MSIX table index.
      
      Fixes: 3354822c ("net/mlx5: Use dynamic msix vectors allocation")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarEli Cohen <elic@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1da438c0
    • Shay Drory's avatar
      net/mlx5: Fix irq affinity management · ef8c063c
      Shay Drory authored
      The cited patch deny the user of changing the affinity of mlx5 irqs,
      which break backward compatibility.
      Hence, allow the user to change the affinity of mlx5 irqs.
      
      Fixes: bbac70c7 ("net/mlx5: Use newer affinity descriptor")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarEli Cohen <elic@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      ef8c063c
    • Shay Drory's avatar
      net/mlx5: Free irqs only on shutdown callback · 9c2d0801
      Shay Drory authored
      Whenever a shutdown is invoked, free irqs only and keep mlx5_irq
      synthetic wrapper intact in order to avoid use-after-free on
      system shutdown.
      
      for example:
      ==================================================================
      BUG: KASAN: use-after-free in _find_first_bit+0x66/0x80
      Read of size 8 at addr ffff88823fc0d318 by task kworker/u192:0/13608
      
      CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G    B   W  O  6.1.21-cloudflare-kasan-2023.3.21 #1
      Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
      Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
      Call Trace:
        <TASK>
        dump_stack_lvl+0x34/0x48
        print_report+0x170/0x473
        ? _find_first_bit+0x66/0x80
        kasan_report+0xad/0x130
        ? _find_first_bit+0x66/0x80
        _find_first_bit+0x66/0x80
        mlx5e_open_channels+0x3c5/0x3a10 [mlx5_core]
        ? console_unlock+0x2fa/0x430
        ? _raw_spin_lock_irqsave+0x8d/0xf0
        ? _raw_spin_unlock_irqrestore+0x42/0x80
        ? preempt_count_add+0x7d/0x150
        ? __wake_up_klogd.part.0+0x7d/0xc0
        ? vprintk_emit+0xfe/0x2c0
        ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
        ? dev_attr_show.cold+0x35/0x35
        ? devlink_health_do_dump.part.0+0x174/0x340
        ? devlink_health_report+0x504/0x810
        ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        ? process_one_work+0x680/0x1050
        mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
        ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
        ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
        mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
        ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
        devlink_health_reporter_recover+0xa6/0x1f0
        devlink_health_report+0x2f7/0x810
        ? vsnprintf+0x854/0x15e0
        mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
        ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
        ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
        ? newidle_balance+0x9b7/0xe30
        ? psi_group_change+0x6a7/0xb80
        ? mutex_lock+0x96/0xf0
        ? __mutex_lock_slowpath+0x10/0x10
        mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        process_one_work+0x680/0x1050
        worker_thread+0x5a0/0xeb0
        ? process_one_work+0x1050/0x1050
        kthread+0x2a2/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x22/0x30
        </TASK>
      
      Freed by task 1:
        kasan_save_stack+0x23/0x50
        kasan_set_track+0x21/0x30
        kasan_save_free_info+0x2a/0x40
        ____kasan_slab_free+0x169/0x1d0
        slab_free_freelist_hook+0xd2/0x190
        __kmem_cache_free+0x1a1/0x2f0
        irq_pool_free+0x138/0x200 [mlx5_core]
        mlx5_irq_table_destroy+0xf6/0x170 [mlx5_core]
        mlx5_core_eq_free_irqs+0x74/0xf0 [mlx5_core]
        shutdown+0x194/0x1aa [mlx5_core]
        pci_device_shutdown+0x75/0x120
        device_shutdown+0x35c/0x620
        kernel_restart+0x60/0xa0
        __do_sys_reboot+0x1cb/0x2c0
        do_syscall_64+0x3b/0x90
        entry_SYSCALL_64_after_hwframe+0x4b/0xb5
      
      The buggy address belongs to the object at ffff88823fc0d300
        which belongs to the cache kmalloc-192 of size 192
      The buggy address is located 24 bytes inside of
        192-byte region [ffff88823fc0d300, ffff88823fc0d3c0)
      
      The buggy address belongs to the physical page:
      page:0000000010139587 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x23fc0c
      head:0000000010139587 order:1 compound_mapcount:0 compound_pincount:0
      flags: 0x2ffff800010200(slab|head|node=0|zone=2|lastcpupid=0x1ffff)
      raw: 002ffff800010200 0000000000000000 dead000000000122 ffff88810004ca00
      raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
        ffff88823fc0d200: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff88823fc0d280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
       >ffff88823fc0d300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                   ^
        ffff88823fc0d380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
        ffff88823fc0d400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      ==================================================================
      general protection fault, probably for non-canonical address
      0xdffffc005c40d7ac: 0000 [#1] PREEMPT SMP KASAN NOPTI
      KASAN: probably user-memory-access in range [0x00000002e206bd60-0x00000002e206bd67]
      CPU: 25 PID: 13608 Comm: kworker/u192:0 Tainted: G    B   W  O  6.1.21-cloudflare-kasan-2023.3.21 #1
      Hardware name: GIGABYTE R162-R2-GEN0/MZ12-HD2-CD, BIOS R14 05/03/2021
      Workqueue: mlx5e mlx5e_tx_timeout_work [mlx5_core]
      RIP: 0010:__alloc_pages+0x141/0x5c0
      Call Trace:
        <TASK>
        ? sysvec_apic_timer_interrupt+0xa0/0xc0
        ? asm_sysvec_apic_timer_interrupt+0x16/0x20
        ? __alloc_pages_slowpath.constprop.0+0x1ec0/0x1ec0
        ? _raw_spin_unlock_irqrestore+0x3d/0x80
        __kmalloc_large_node+0x80/0x120
        ? kvmalloc_node+0x4e/0x170
        __kmalloc_node+0xd4/0x150
        kvmalloc_node+0x4e/0x170
        mlx5e_open_channels+0x631/0x3a10 [mlx5_core]
        ? console_unlock+0x2fa/0x430
        ? _raw_spin_lock_irqsave+0x8d/0xf0
        ? _raw_spin_unlock_irqrestore+0x42/0x80
        ? preempt_count_add+0x7d/0x150
        ? __wake_up_klogd.part.0+0x7d/0xc0
        ? vprintk_emit+0xfe/0x2c0
        ? mlx5e_trigger_napi_sched+0x40/0x40 [mlx5_core]
        ? dev_attr_show.cold+0x35/0x35
        ? devlink_health_do_dump.part.0+0x174/0x340
        ? devlink_health_report+0x504/0x810
        ? mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        ? process_one_work+0x680/0x1050
        mlx5e_safe_switch_params+0x156/0x220 [mlx5_core]
        ? mlx5e_switch_priv_channels+0x310/0x310 [mlx5_core]
        ? mlx5_eq_poll_irq_disabled+0xb6/0x100 [mlx5_core]
        mlx5e_tx_reporter_timeout_recover+0x123/0x240 [mlx5_core]
        ? __mutex_unlock_slowpath.constprop.0+0x2b0/0x2b0
        devlink_health_reporter_recover+0xa6/0x1f0
        devlink_health_report+0x2f7/0x810
        ? vsnprintf+0x854/0x15e0
        mlx5e_reporter_tx_timeout+0x29d/0x3a0 [mlx5_core]
        ? mlx5e_reporter_tx_err_cqe+0x1a0/0x1a0 [mlx5_core]
        ? mlx5e_tx_reporter_timeout_dump+0x50/0x50 [mlx5_core]
        ? mlx5e_tx_reporter_dump_sq+0x260/0x260 [mlx5_core]
        ? newidle_balance+0x9b7/0xe30
        ? psi_group_change+0x6a7/0xb80
        ? mutex_lock+0x96/0xf0
        ? __mutex_lock_slowpath+0x10/0x10
        mlx5e_tx_timeout_work+0x17c/0x230 [mlx5_core]
        process_one_work+0x680/0x1050
        worker_thread+0x5a0/0xeb0
        ? process_one_work+0x1050/0x1050
        kthread+0x2a2/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x22/0x30
        </TASK>
      ---[ end trace 0000000000000000  ]---
      RIP: 0010:__alloc_pages+0x141/0x5c0
      Code: e0 39 a3 96 89 e9 b8 22 01 32 01 83 e1 0f 48 89 fa 01 c9 48 c1 ea
      03 d3 f8 83 e0 03 89 44 24 6c 48 b8 00 00 00 00 00 fc ff df <80> 3c 02
      00 0f 85 fc 03 00 00 89 e8 4a 8b 14 f5 e0 39 a3 96 4c 89
      RSP: 0018:ffff888251f0f438 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 1ffff1104a3e1e8b RCX: 0000000000000000
      RDX: 000000005c40d7ac RSI: 0000000000000003 RDI: 00000002e206bd60
      RBP: 0000000000052dc0 R08: ffff8882b0044218 R09: ffff8882b0045e8a
      R10: fffffbfff300fefc R11: ffff888167af4000 R12: 0000000000000003
      R13: 0000000000000000 R14: 00000000696c7070 R15: ffff8882373f4380
      FS:  0000000000000000(0000) GS:ffff88bf2be80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005641d031eee8 CR3: 0000002e7ca14000 CR4: 0000000000350ee0
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range:
      0xffffffff80000000-0xffffffffbfffffff)
      ---[ end Kernel panic - not syncing: Fatal exception  ]---]
      Reported-by: default avatarFrederick Lawler <fred@cloudflare.com>
      Link: https://lore.kernel.org/netdev/be5b9271-7507-19c5-ded1-fa78f1980e69@cloudflare.comSigned-off-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      9c2d0801
    • Shay Drory's avatar
      net/mlx5: Devcom, serialize devcom registration · 1f893f57
      Shay Drory authored
      From one hand, mlx5 driver is allowing to probe PFs in parallel.
      From the other hand, devcom, which is a share resource between PFs, is
      registered without any lock. This might resulted in memory problems.
      
      Hence, use the global mlx5_dev_list_lock in order to serialize devcom
      registration.
      
      Fixes: fadd59fc ("net/mlx5: Introduce inter-device communication mechanism")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1f893f57
    • Shay Drory's avatar
      net/mlx5: Devcom, fix error flow in mlx5_devcom_register_device · af871943
      Shay Drory authored
      In case devcom allocation is failed, mlx5 is always freeing the priv.
      However, this priv might have been allocated by a different thread,
      and freeing it might lead to use-after-free bugs.
      Fix it by freeing the priv only in case it was allocated by the
      running thread.
      
      Fixes: fadd59fc ("net/mlx5: Introduce inter-device communication mechanism")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      af871943
    • Shay Drory's avatar
      net/mlx5: E-switch, Devcom, sync devcom events and devcom comp register · 8c253dfc
      Shay Drory authored
      devcom events are sent to all registered component. Following the
      cited patch, it is possible for two components, e.g.: two eswitches,
      to send devcom events, while both components are registered. This
      means eswitch layer will do double un/pairing, which is double
      allocation and free of resources, even though only one un/pairing is
      needed. flow example:
      
      	cpu0					cpu1
      	----					----
      
       mlx5_devlink_eswitch_mode_set(dev0)
        esw_offloads_devcom_init()
         mlx5_devcom_register_component(esw0)
                                               mlx5_devlink_eswitch_mode_set(dev1)
                                                esw_offloads_devcom_init()
                                                 mlx5_devcom_register_component(esw1)
                                                 mlx5_devcom_send_event()
         mlx5_devcom_send_event()
      
      Hence, check whether the eswitches are already un/paired before
      free/allocation of resources.
      
      Fixes: 09b27846 ("net: devlink: enable parallel ops on netlink interface")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarMark Bloch <mbloch@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8c253dfc
    • Paul Blakey's avatar
      net/mlx5e: TC, Fix using eswitch mapping in nic mode · dfa1e46d
      Paul Blakey authored
      Cited patch is using the eswitch object mapping pool while
      in nic mode where it isn't initialized. This results in the
      trace below [0].
      
      Fix that by using either nic or eswitch object mapping pool
      depending if eswitch is enabled or not.
      
      [0]:
      [  826.446057] ==================================================================
      [  826.446729] BUG: KASAN: slab-use-after-free in mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
      [  826.447515] Read of size 8 at addr ffff888194485830 by task tc/6233
      
      [  826.448243] CPU: 16 PID: 6233 Comm: tc Tainted: G        W          6.3.0-rc6+ #1
      [  826.448890] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [  826.449785] Call Trace:
      [  826.450052]  <TASK>
      [  826.450302]  dump_stack_lvl+0x33/0x50
      [  826.450650]  print_report+0xc2/0x610
      [  826.450998]  ? __virt_addr_valid+0xb1/0x130
      [  826.451385]  ? mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
      [  826.451935]  kasan_report+0xae/0xe0
      [  826.452276]  ? mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
      [  826.452829]  mlx5_add_flow_rules+0x30/0x490 [mlx5_core]
      [  826.453368]  ? __kmalloc_node+0x5a/0x120
      [  826.453733]  esw_add_restore_rule+0x20f/0x270 [mlx5_core]
      [  826.454288]  ? mlx5_eswitch_add_send_to_vport_meta_rule+0x260/0x260 [mlx5_core]
      [  826.455011]  ? mutex_unlock+0x80/0xd0
      [  826.455361]  ? __mutex_unlock_slowpath.constprop.0+0x210/0x210
      [  826.455862]  ? mapping_add+0x2cb/0x440 [mlx5_core]
      [  826.456425]  mlx5e_tc_action_miss_mapping_get+0x139/0x180 [mlx5_core]
      [  826.457058]  ? mlx5e_tc_update_skb_nic+0xb0/0xb0 [mlx5_core]
      [  826.457636]  ? __kasan_kmalloc+0x77/0x90
      [  826.458000]  ? __kmalloc+0x57/0x120
      [  826.458336]  mlx5_tc_ct_flow_offload+0x325/0xe40 [mlx5_core]
      [  826.458916]  ? ct_kernel_enter.constprop.0+0x48/0xa0
      [  826.459360]  ? mlx5_tc_ct_parse_action+0xf0/0xf0 [mlx5_core]
      [  826.459933]  ? mlx5e_mod_hdr_attach+0x491/0x520 [mlx5_core]
      [  826.460507]  ? mlx5e_mod_hdr_get+0x12/0x20 [mlx5_core]
      [  826.461046]  ? mlx5e_tc_attach_mod_hdr+0x154/0x170 [mlx5_core]
      [  826.461635]  mlx5e_configure_flower+0x969/0x2110 [mlx5_core]
      [  826.462217]  ? _raw_spin_lock_bh+0x85/0xe0
      [  826.462597]  ? __mlx5e_add_fdb_flow+0x750/0x750 [mlx5_core]
      [  826.463163]  ? kasan_save_stack+0x2e/0x40
      [  826.463534]  ? down_read+0x115/0x1b0
      [  826.463878]  ? down_write_killable+0x110/0x110
      [  826.464288]  ? tc_setup_action.part.0+0x9f/0x3b0
      [  826.464701]  ? mlx5e_is_uplink_rep+0x4c/0x90 [mlx5_core]
      [  826.465253]  ? mlx5e_tc_reoffload_flows_work+0x130/0x130 [mlx5_core]
      [  826.465878]  tc_setup_cb_add+0x112/0x250
      [  826.466247]  fl_hw_replace_filter+0x230/0x310 [cls_flower]
      [  826.466724]  ? fl_hw_destroy_filter+0x1a0/0x1a0 [cls_flower]
      [  826.467212]  fl_change+0x14e1/0x2030 [cls_flower]
      [  826.467636]  ? sock_def_readable+0x89/0x120
      [  826.468019]  ? fl_tmplt_create+0x2d0/0x2d0 [cls_flower]
      [  826.468509]  ? kasan_unpoison+0x23/0x50
      [  826.468873]  ? get_random_u16+0x180/0x180
      [  826.469244]  ? __radix_tree_lookup+0x2b/0x130
      [  826.469640]  ? fl_get+0x7b/0x140 [cls_flower]
      [  826.470042]  ? fl_mask_put+0x200/0x200 [cls_flower]
      [  826.470478]  ? __mutex_unlock_slowpath.constprop.0+0x210/0x210
      [  826.470973]  ? fl_tmplt_create+0x2d0/0x2d0 [cls_flower]
      [  826.471427]  tc_new_tfilter+0x644/0x1050
      [  826.471795]  ? tc_get_tfilter+0x860/0x860
      [  826.472170]  ? __thaw_task+0x130/0x130
      [  826.472525]  ? arch_stack_walk+0x98/0xf0
      [  826.472892]  ? cap_capable+0x9f/0xd0
      [  826.473235]  ? security_capable+0x47/0x60
      [  826.473608]  rtnetlink_rcv_msg+0x1d5/0x550
      [  826.473985]  ? rtnl_calcit.isra.0+0x1f0/0x1f0
      [  826.474383]  ? __stack_depot_save+0x35/0x4c0
      [  826.474779]  ? kasan_save_stack+0x2e/0x40
      [  826.475149]  ? kasan_save_stack+0x1e/0x40
      [  826.475518]  ? __kasan_record_aux_stack+0x9f/0xb0
      [  826.475939]  ? task_work_add+0x77/0x1c0
      [  826.476305]  netlink_rcv_skb+0xe0/0x210
      [  826.476661]  ? rtnl_calcit.isra.0+0x1f0/0x1f0
      [  826.477057]  ? netlink_ack+0x7c0/0x7c0
      [  826.477412]  ? rhashtable_jhash2+0xef/0x150
      [  826.477796]  ? _copy_from_iter+0x105/0x770
      [  826.484386]  netlink_unicast+0x346/0x490
      [  826.484755]  ? netlink_attachskb+0x400/0x400
      [  826.485145]  ? kernel_text_address+0xc2/0xd0
      [  826.485535]  netlink_sendmsg+0x3b0/0x6c0
      [  826.485902]  ? kernel_text_address+0xc2/0xd0
      [  826.486296]  ? netlink_unicast+0x490/0x490
      [  826.486671]  ? iovec_from_user.part.0+0x7a/0x1a0
      [  826.487083]  ? netlink_unicast+0x490/0x490
      [  826.487461]  sock_sendmsg+0x73/0xc0
      [  826.487803]  ____sys_sendmsg+0x364/0x380
      [  826.488186]  ? import_iovec+0x7/0x10
      [  826.488531]  ? kernel_sendmsg+0x30/0x30
      [  826.488893]  ? __copy_msghdr+0x180/0x180
      [  826.489258]  ? kasan_save_stack+0x2e/0x40
      [  826.489629]  ? kasan_save_stack+0x1e/0x40
      [  826.490002]  ? __kasan_record_aux_stack+0x9f/0xb0
      [  826.490424]  ? __call_rcu_common.constprop.0+0x46/0x580
      [  826.490876]  ___sys_sendmsg+0xdf/0x140
      [  826.491231]  ? copy_msghdr_from_user+0x110/0x110
      [  826.491649]  ? fget_raw+0x120/0x120
      [  826.491988]  ? ___sys_recvmsg+0xd9/0x130
      [  826.492355]  ? folio_batch_add_and_move+0x80/0xa0
      [  826.492776]  ? _raw_spin_lock+0x7a/0xd0
      [  826.493137]  ? _raw_spin_lock+0x7a/0xd0
      [  826.493500]  ? _raw_read_lock_irq+0x30/0x30
      [  826.493880]  ? kasan_set_track+0x21/0x30
      [  826.494249]  ? kasan_save_free_info+0x2a/0x40
      [  826.494650]  ? do_sys_openat2+0xff/0x270
      [  826.495016]  ? __fget_light+0x1b5/0x200
      [  826.495377]  ? __virt_addr_valid+0xb1/0x130
      [  826.495763]  __sys_sendmsg+0xb2/0x130
      [  826.496118]  ? __sys_sendmsg_sock+0x20/0x20
      [  826.496501]  ? __x64_sys_rseq+0x2e0/0x2e0
      [  826.496874]  ? do_user_addr_fault+0x276/0x820
      [  826.497273]  ? fpregs_assert_state_consistent+0x52/0x60
      [  826.497727]  ? exit_to_user_mode_prepare+0x30/0x120
      [  826.498158]  do_syscall_64+0x3d/0x90
      [  826.498502]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  826.498949] RIP: 0033:0x7f9b67f4f887
      [  826.499294] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      [  826.500742] RSP: 002b:00007fff5d1a5498 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  826.501395] RAX: ffffffffffffffda RBX: 0000000064413ce6 RCX: 00007f9b67f4f887
      [  826.501975] RDX: 0000000000000000 RSI: 00007fff5d1a5500 RDI: 0000000000000003
      [  826.502556] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000001
      [  826.503135] R10: 00007f9b67e08708 R11: 0000000000000246 R12: 0000000000000001
      [  826.503714] R13: 0000000000000001 R14: 00007fff5d1a9800 R15: 0000000000485400
      [  826.504304]  </TASK>
      
      [  826.504753] Allocated by task 3764:
      [  826.505090]  kasan_save_stack+0x1e/0x40
      [  826.505453]  kasan_set_track+0x21/0x30
      [  826.505810]  __kasan_kmalloc+0x77/0x90
      [  826.506164]  __mlx5_create_flow_table+0x16d/0xbb0 [mlx5_core]
      [  826.506742]  esw_offloads_enable+0x60d/0xfb0 [mlx5_core]
      [  826.507292]  mlx5_eswitch_enable_locked+0x4d3/0x680 [mlx5_core]
      [  826.507885]  mlx5_devlink_eswitch_mode_set+0x2a3/0x580 [mlx5_core]
      [  826.508513]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
      [  826.508969]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
      [  826.509427]  genl_rcv_msg+0x28d/0x3e0
      [  826.509772]  netlink_rcv_skb+0xe0/0x210
      [  826.510133]  genl_rcv+0x24/0x40
      [  826.510448]  netlink_unicast+0x346/0x490
      [  826.510810]  netlink_sendmsg+0x3b0/0x6c0
      [  826.511179]  sock_sendmsg+0x73/0xc0
      [  826.511519]  __sys_sendto+0x18d/0x220
      [  826.511867]  __x64_sys_sendto+0x72/0x80
      [  826.512232]  do_syscall_64+0x3d/0x90
      [  826.512576]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      [  826.513220] Freed by task 5674:
      [  826.513535]  kasan_save_stack+0x1e/0x40
      [  826.513893]  kasan_set_track+0x21/0x30
      [  826.514245]  kasan_save_free_info+0x2a/0x40
      [  826.514629]  ____kasan_slab_free+0x11a/0x1b0
      [  826.515021]  __kmem_cache_free+0x14d/0x280
      [  826.515399]  tree_put_node+0x109/0x1c0 [mlx5_core]
      [  826.515907]  mlx5_destroy_flow_table+0x119/0x630 [mlx5_core]
      [  826.516481]  esw_offloads_steering_cleanup+0xe7/0x150 [mlx5_core]
      [  826.517084]  esw_offloads_disable+0xe0/0x160 [mlx5_core]
      [  826.517632]  mlx5_eswitch_disable_locked+0x26c/0x290 [mlx5_core]
      [  826.518225]  mlx5_devlink_eswitch_mode_set+0x128/0x580 [mlx5_core]
      [  826.518834]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
      [  826.519286]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
      [  826.519748]  genl_rcv_msg+0x28d/0x3e0
      [  826.520101]  netlink_rcv_skb+0xe0/0x210
      [  826.520458]  genl_rcv+0x24/0x40
      [  826.520771]  netlink_unicast+0x346/0x490
      [  826.521137]  netlink_sendmsg+0x3b0/0x6c0
      [  826.521505]  sock_sendmsg+0x73/0xc0
      [  826.521842]  __sys_sendto+0x18d/0x220
      [  826.522191]  __x64_sys_sendto+0x72/0x80
      [  826.522554]  do_syscall_64+0x3d/0x90
      [  826.522894]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      [  826.523540] Last potentially related work creation:
      [  826.523969]  kasan_save_stack+0x1e/0x40
      [  826.524331]  __kasan_record_aux_stack+0x9f/0xb0
      [  826.524739]  insert_work+0x30/0x130
      [  826.525078]  __queue_work+0x34b/0x690
      [  826.525426]  queue_work_on+0x48/0x50
      [  826.525766]  __rhashtable_remove_fast_one+0x4af/0x4d0 [mlx5_core]
      [  826.526365]  del_sw_flow_group+0x1b5/0x270 [mlx5_core]
      [  826.526898]  tree_put_node+0x109/0x1c0 [mlx5_core]
      [  826.527407]  esw_offloads_steering_cleanup+0xd3/0x150 [mlx5_core]
      [  826.528009]  esw_offloads_disable+0xe0/0x160 [mlx5_core]
      [  826.528616]  mlx5_eswitch_disable_locked+0x26c/0x290 [mlx5_core]
      [  826.529218]  mlx5_devlink_eswitch_mode_set+0x128/0x580 [mlx5_core]
      [  826.529823]  devlink_nl_cmd_eswitch_set_doit+0xdf/0x1f0
      [  826.530276]  genl_family_rcv_msg_doit.isra.0+0x146/0x1c0
      [  826.530733]  genl_rcv_msg+0x28d/0x3e0
      [  826.531079]  netlink_rcv_skb+0xe0/0x210
      [  826.531439]  genl_rcv+0x24/0x40
      [  826.531755]  netlink_unicast+0x346/0x490
      [  826.532123]  netlink_sendmsg+0x3b0/0x6c0
      [  826.532487]  sock_sendmsg+0x73/0xc0
      [  826.532825]  __sys_sendto+0x18d/0x220
      [  826.533175]  __x64_sys_sendto+0x72/0x80
      [  826.533533]  do_syscall_64+0x3d/0x90
      [  826.533877]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      [  826.534521] The buggy address belongs to the object at ffff888194485800
                      which belongs to the cache kmalloc-512 of size 512
      [  826.535506] The buggy address is located 48 bytes inside of
                      freed 512-byte region [ffff888194485800, ffff888194485a00)
      
      [  826.536666] The buggy address belongs to the physical page:
      [  826.537138] page:00000000d75841dd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x194480
      [  826.537915] head:00000000d75841dd order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
      [  826.538595] flags: 0x200000000010200(slab|head|node=0|zone=2)
      [  826.539089] raw: 0200000000010200 ffff888100042c80 ffffea0004523800 dead000000000002
      [  826.539755] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
      [  826.540417] page dumped because: kasan: bad access detected
      
      [  826.541095] Memory state around the buggy address:
      [  826.541519]  ffff888194485700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  826.542149]  ffff888194485780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [  826.542773] >ffff888194485800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  826.543400]                                      ^
      [  826.543822]  ffff888194485880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  826.544452]  ffff888194485900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  826.545079] ==================================================================
      
      Fixes: 67027828 ("net/mlx5e: TC, Set CT miss to the specific ct action instance")
      Signed-off-by: default avatarPaul Blakey <paulb@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      dfa1e46d
    • Rahul Rameshbabu's avatar
      net/mlx5e: Fix SQ wake logic in ptp napi_poll context · 7aa50380
      Rahul Rameshbabu authored
      Check in the mlx5e_ptp_poll_ts_cq context if the ptp tx sq should be woken
      up. Before change, the ptp tx sq may never wake up if the ptp tx ts skb
      fifo is full when mlx5e_poll_tx_cq checks if the queue should be woken up.
      
      Fixes: 1880bc4e ("net/mlx5e: Add TX port timestamp support")
      Signed-off-by: default avatarRahul Rameshbabu <rrameshbabu@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      7aa50380
    • Vlad Buslov's avatar
      net/mlx5e: Fix deadlock in tc route query code · 691c041b
      Vlad Buslov authored
      Cited commit causes ABBA deadlock[0] when peer flows are created while
      holding the devcom rw semaphore. Due to peer flows offload implementation
      the lock is taken much higher up the call chain and there is no obvious way
      to easily fix the deadlock. Instead, since tc route query code needs the
      peer eswitch structure only to perform a lookup in xarray and doesn't
      perform any sleeping operations with it, refactor the code for lockless
      execution in following ways:
      
      - RCUify the devcom 'data' pointer. When resetting the pointer
      synchronously wait for RCU grace period before returning. This is fine
      since devcom is currently only used for synchronization of
      pairing/unpairing of eswitches which is rare and already expensive as-is.
      
      - Wrap all usages of 'paired' boolean in {READ|WRITE}_ONCE(). The flag has
      already been used in some unlocked contexts without proper
      annotations (e.g. users of mlx5_devcom_is_paired() function), but it wasn't
      an issue since all relevant code paths checked it again after obtaining the
      devcom semaphore. Now it is also used by mlx5_devcom_get_peer_data_rcu() as
      "best effort" check to return NULL when devcom is being unpaired. Note that
      while RCU read lock doesn't prevent the unpaired flag from being changed
      concurrently it still guarantees that reader can continue to use 'data'.
      
      - Refactor mlx5e_tc_query_route_vport() function to use new
      mlx5_devcom_get_peer_data_rcu() API which fixes the deadlock.
      
      [0]:
      
      [  164.599612] ======================================================
      [  164.600142] WARNING: possible circular locking dependency detected
      [  164.600667] 6.3.0-rc3+ #1 Not tainted
      [  164.601021] ------------------------------------------------------
      [  164.601557] handler1/3456 is trying to acquire lock:
      [  164.601998] ffff88811f1714b0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}, at: mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.603078]
                     but task is already holding lock:
      [  164.603617] ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
      [  164.604459]
                     which lock already depends on the new lock.
      
      [  164.605190]
                     the existing dependency chain (in reverse order) is:
      [  164.605848]
                     -> #1 (&comp->sem){++++}-{3:3}:
      [  164.606380]        down_read+0x39/0x50
      [  164.606772]        mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
      [  164.607336]        mlx5e_tc_query_route_vport+0x86/0xc0 [mlx5_core]
      [  164.607914]        mlx5e_tc_tun_route_lookup+0x1a4/0x1d0 [mlx5_core]
      [  164.608495]        mlx5e_attach_decap_route+0xc6/0x1e0 [mlx5_core]
      [  164.609063]        mlx5e_tc_add_fdb_flow+0x1ea/0x360 [mlx5_core]
      [  164.609627]        __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
      [  164.610175]        mlx5e_configure_flower+0x952/0x1a20 [mlx5_core]
      [  164.610741]        tc_setup_cb_add+0xd4/0x200
      [  164.611146]        fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
      [  164.611661]        fl_change+0xc95/0x18a0 [cls_flower]
      [  164.612116]        tc_new_tfilter+0x3fc/0xd20
      [  164.612516]        rtnetlink_rcv_msg+0x418/0x5b0
      [  164.612936]        netlink_rcv_skb+0x54/0x100
      [  164.613339]        netlink_unicast+0x190/0x250
      [  164.613746]        netlink_sendmsg+0x245/0x4a0
      [  164.614150]        sock_sendmsg+0x38/0x60
      [  164.614522]        ____sys_sendmsg+0x1d0/0x1e0
      [  164.614934]        ___sys_sendmsg+0x80/0xc0
      [  164.615320]        __sys_sendmsg+0x51/0x90
      [  164.615701]        do_syscall_64+0x3d/0x90
      [  164.616083]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  164.616568]
                     -> #0 (&esw->offloads.encap_tbl_lock){+.+.}-{3:3}:
      [  164.617210]        __lock_acquire+0x159e/0x26e0
      [  164.617638]        lock_acquire+0xc2/0x2a0
      [  164.618018]        __mutex_lock+0x92/0xcd0
      [  164.618401]        mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.618943]        post_process_attr+0x153/0x2d0 [mlx5_core]
      [  164.619471]        mlx5e_tc_add_fdb_flow+0x164/0x360 [mlx5_core]
      [  164.620021]        __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
      [  164.620564]        mlx5e_configure_flower+0xe33/0x1a20 [mlx5_core]
      [  164.621125]        tc_setup_cb_add+0xd4/0x200
      [  164.621531]        fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
      [  164.622047]        fl_change+0xc95/0x18a0 [cls_flower]
      [  164.622500]        tc_new_tfilter+0x3fc/0xd20
      [  164.622906]        rtnetlink_rcv_msg+0x418/0x5b0
      [  164.623324]        netlink_rcv_skb+0x54/0x100
      [  164.623727]        netlink_unicast+0x190/0x250
      [  164.624138]        netlink_sendmsg+0x245/0x4a0
      [  164.624544]        sock_sendmsg+0x38/0x60
      [  164.624919]        ____sys_sendmsg+0x1d0/0x1e0
      [  164.625340]        ___sys_sendmsg+0x80/0xc0
      [  164.625731]        __sys_sendmsg+0x51/0x90
      [  164.626117]        do_syscall_64+0x3d/0x90
      [  164.626502]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  164.626995]
                     other info that might help us debug this:
      
      [  164.627725]  Possible unsafe locking scenario:
      
      [  164.628268]        CPU0                    CPU1
      [  164.628683]        ----                    ----
      [  164.629098]   lock(&comp->sem);
      [  164.629421]                                lock(&esw->offloads.encap_tbl_lock);
      [  164.630066]                                lock(&comp->sem);
      [  164.630555]   lock(&esw->offloads.encap_tbl_lock);
      [  164.630993]
                      *** DEADLOCK ***
      
      [  164.631575] 3 locks held by handler1/3456:
      [  164.631962]  #0: ffff888124b75130 (&block->cb_lock){++++}-{3:3}, at: tc_setup_cb_add+0x5b/0x200
      [  164.632703]  #1: ffff888116e512b8 (&esw->mode_lock){++++}-{3:3}, at: mlx5_esw_hold+0x39/0x50 [mlx5_core]
      [  164.633552]  #2: ffff88810137fc98 (&comp->sem){++++}-{3:3}, at: mlx5_devcom_get_peer_data+0x37/0x80 [mlx5_core]
      [  164.634435]
                     stack backtrace:
      [  164.634883] CPU: 17 PID: 3456 Comm: handler1 Not tainted 6.3.0-rc3+ #1
      [  164.635431] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [  164.636340] Call Trace:
      [  164.636616]  <TASK>
      [  164.636863]  dump_stack_lvl+0x47/0x70
      [  164.637217]  check_noncircular+0xfe/0x110
      [  164.637601]  __lock_acquire+0x159e/0x26e0
      [  164.637977]  ? mlx5_cmd_set_fte+0x5b0/0x830 [mlx5_core]
      [  164.638472]  lock_acquire+0xc2/0x2a0
      [  164.638828]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.639339]  ? lock_is_held_type+0x98/0x110
      [  164.639728]  __mutex_lock+0x92/0xcd0
      [  164.640074]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.640576]  ? __lock_acquire+0x382/0x26e0
      [  164.640958]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.641468]  ? mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.641965]  mlx5e_attach_encap+0xd8/0x8b0 [mlx5_core]
      [  164.642454]  ? lock_release+0xbf/0x240
      [  164.642819]  post_process_attr+0x153/0x2d0 [mlx5_core]
      [  164.643318]  mlx5e_tc_add_fdb_flow+0x164/0x360 [mlx5_core]
      [  164.643835]  __mlx5e_add_fdb_flow+0x2d2/0x430 [mlx5_core]
      [  164.644340]  mlx5e_configure_flower+0xe33/0x1a20 [mlx5_core]
      [  164.644862]  ? lock_acquire+0xc2/0x2a0
      [  164.645219]  tc_setup_cb_add+0xd4/0x200
      [  164.645588]  fl_hw_replace_filter+0x14c/0x1f0 [cls_flower]
      [  164.646067]  fl_change+0xc95/0x18a0 [cls_flower]
      [  164.646488]  tc_new_tfilter+0x3fc/0xd20
      [  164.646861]  ? tc_del_tfilter+0x810/0x810
      [  164.647236]  rtnetlink_rcv_msg+0x418/0x5b0
      [  164.647621]  ? rtnl_setlink+0x160/0x160
      [  164.647982]  netlink_rcv_skb+0x54/0x100
      [  164.648348]  netlink_unicast+0x190/0x250
      [  164.648722]  netlink_sendmsg+0x245/0x4a0
      [  164.649090]  sock_sendmsg+0x38/0x60
      [  164.649434]  ____sys_sendmsg+0x1d0/0x1e0
      [  164.649804]  ? copy_msghdr_from_user+0x6d/0xa0
      [  164.650213]  ___sys_sendmsg+0x80/0xc0
      [  164.650563]  ? lock_acquire+0xc2/0x2a0
      [  164.650926]  ? lock_acquire+0xc2/0x2a0
      [  164.651286]  ? __fget_files+0x5/0x190
      [  164.651644]  ? find_held_lock+0x2b/0x80
      [  164.652006]  ? __fget_files+0xb9/0x190
      [  164.652365]  ? lock_release+0xbf/0x240
      [  164.652723]  ? __fget_files+0xd3/0x190
      [  164.653079]  __sys_sendmsg+0x51/0x90
      [  164.653435]  do_syscall_64+0x3d/0x90
      [  164.653784]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      [  164.654229] RIP: 0033:0x7f378054f8bd
      [  164.654577] Code: 28 89 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 6a c3 f4 ff 8b 54 24 1c 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 33 44 89 c7 48 89 44 24 08 e8 be c3 f4 ff 48
      [  164.656041] RSP: 002b:00007f377fa114b0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
      [  164.656701] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f378054f8bd
      [  164.657297] RDX: 0000000000000000 RSI: 00007f377fa11540 RDI: 0000000000000014
      [  164.657885] RBP: 00007f377fa12278 R08: 0000000000000000 R09: 000000000000015c
      [  164.658472] R10: 00007f377fa123d0 R11: 0000000000000293 R12: 0000560962d99bd0
      [  164.665317] R13: 0000000000000000 R14: 0000560962d99bd0 R15: 00007f377fa11540
      
      Fixes: f9d196bd ("net/mlx5e: Use correct eswitch for stack devices with lag")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      691c041b
    • Roi Dayan's avatar
      net/mlx5: Fix error message when failing to allocate device memory · a6573514
      Roi Dayan authored
      Fix spacing for the error and also the correct error code pointer.
      
      Fixes: c9b9dcb4 ("net/mlx5: Move device memory management to mlx5_core")
      Signed-off-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      a6573514
    • Vlad Buslov's avatar
      net/mlx5e: Use correct encap attribute during invalidation · be071cdb
      Vlad Buslov authored
      With introduction of post action infrastructure most of the users of encap
      attribute had been modified in order to obtain the correct attribute by
      calling mlx5e_tc_get_encap_attr() helper instead of assuming encap action
      is always on default attribute. However, the cited commit didn't modify
      mlx5e_invalidate_encap() which prevents it from destroying correct modify
      header action which leads to a warning [0]. Fix the issue by using correct
      attribute.
      
      [0]:
      
      Feb 21 09:47:35 c-237-177-40-045 kernel: WARNING: CPU: 17 PID: 654 at drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:684 mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core]
      Feb 21 09:47:35 c-237-177-40-045 kernel: RIP: 0010:mlx5e_tc_attach_mod_hdr+0x1cc/0x230 [mlx5_core]
      Feb 21 09:47:35 c-237-177-40-045 kernel: Call Trace:
      Feb 21 09:47:35 c-237-177-40-045 kernel:  <TASK>
      Feb 21 09:47:35 c-237-177-40-045 kernel:  mlx5e_tc_fib_event_work+0x8e3/0x1f60 [mlx5_core]
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? mlx5e_take_all_encap_flows+0xe0/0xe0 [mlx5_core]
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lock_downgrade+0x6d0/0x6d0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x273/0x3f0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x273/0x3f0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  process_one_work+0x7c2/0x1310
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? lockdep_hardirqs_on_prepare+0x3f0/0x3f0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? pwq_dec_nr_in_flight+0x230/0x230
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? rwlock_bug.part.0+0x90/0x90
      Feb 21 09:47:35 c-237-177-40-045 kernel:  worker_thread+0x59d/0xec0
      Feb 21 09:47:35 c-237-177-40-045 kernel:  ? __kthread_parkme+0xd9/0x1d0
      
      Fixes: 8300f225 ("net/mlx5e: Create new flow attr for multi table actions")
      Signed-off-by: default avatarVlad Buslov <vladbu@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      be071cdb
    • Yevgeny Kliteynik's avatar
      net/mlx5: DR, Check force-loopback RC QP capability independently from RoCE · c7dd225b
      Yevgeny Kliteynik authored
      SW Steering uses RC QP for writing STEs to ICM. This writingis done in LB
      (loopback), and FL (force-loopback) QP is preferred for performance. FL is
      available when RoCE is enabled or disabled based on RoCE caps.
      This patch adds reading of FL capability from HCA caps in addition to the
      existing reading from RoCE caps, thus fixing the case where we didn't
      have loopback enabled when RoCE was disabled.
      
      Fixes: 7304d603 ("net/mlx5: DR, Add support for force-loopback QP")
      Signed-off-by: default avatarItamar Gozlan <igozlan@nvidia.com>
      Signed-off-by: default avatarYevgeny Kliteynik <kliteyn@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      c7dd225b
    • Erez Shitrit's avatar
      net/mlx5: DR, Fix crc32 calculation to work on big-endian (BE) CPUs · 1e5daf55
      Erez Shitrit authored
      When calculating crc for hash index we use the function crc32 that
      calculates for little-endian (LE) arch.
      Then we convert it to network endianness using htonl(), but it's wrong
      to do the conversion in BE archs since the crc32 value is already LE.
      
      The solution is to switch the bytes from the crc result for all types
      of arc.
      
      Fixes: 40416d8e ("net/mlx5: DR, Replace CRC32 implementation to use kernel lib")
      Signed-off-by: default avatarErez Shitrit <erezsh@nvidia.com>
      Reviewed-by: default avatarAlex Vesker <valex@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      1e5daf55
    • Shay Drory's avatar
      net/mlx5: Handle pairing of E-switch via uplink un/load APIs · 2be5bd42
      Shay Drory authored
      In case user switch a device from switchdev mode to legacy mode, mlx5
      first unpair the E-switch and afterwards unload the uplink vport.
      From the other hand, in case user remove or reload a device, mlx5
      first unload the uplink vport and afterwards unpair the E-switch.
      
      The latter is causing a bug[1], hence, handle pairing of E-switch as
      part of uplink un/load APIs.
      
      [1]
      In case VF_LAG is used, every tc fdb flow is duplicated to the peer
      esw. However, the original esw keeps a pointer to this duplicated
      flow, not the peer esw.
      e.g.: if user create tc fdb flow over esw0, the flow is duplicated
      over esw1, in FW/HW, but in SW, esw0 keeps a pointer to the duplicated
      flow.
      During module unload while a peer tc fdb flow is still offloaded, in
      case the first device to be removed is the peer device (esw1 in the
      example above), the peer net-dev is destroyed, and so the mlx5e_priv
      is memset to 0.
      Afterwards, the peer device is trying to unpair himself from the
      original device (esw0 in the example above). Unpair API invoke the
      original device to clear peer flow from its eswitch (esw0), but the
      peer flow, which is stored over the original eswitch (esw0), is
      trying to use the peer mlx5e_priv, which is memset to 0 and result in
      bellow kernel-oops.
      
      [  157.964081 ] BUG: unable to handle page fault for address: 000000000002ce60
      [  157.964662 ] #PF: supervisor read access in kernel mode
      [  157.965123 ] #PF: error_code(0x0000) - not-present page
      [  157.965582 ] PGD 0 P4D 0
      [  157.965866 ] Oops: 0000 [#1] SMP
      [  157.967670 ] RIP: 0010:mlx5e_tc_del_fdb_flow+0x48/0x460 [mlx5_core]
      [  157.976164 ] Call Trace:
      [  157.976437 ]  <TASK>
      [  157.976690 ]  __mlx5e_tc_del_fdb_peer_flow+0xe6/0x100 [mlx5_core]
      [  157.977230 ]  mlx5e_tc_clean_fdb_peer_flows+0x67/0x90 [mlx5_core]
      [  157.977767 ]  mlx5_esw_offloads_unpair+0x2d/0x1e0 [mlx5_core]
      [  157.984653 ]  mlx5_esw_offloads_devcom_event+0xbf/0x130 [mlx5_core]
      [  157.985212 ]  mlx5_devcom_send_event+0xa3/0xb0 [mlx5_core]
      [  157.985714 ]  esw_offloads_disable+0x5a/0x110 [mlx5_core]
      [  157.986209 ]  mlx5_eswitch_disable_locked+0x152/0x170 [mlx5_core]
      [  157.986757 ]  mlx5_eswitch_disable+0x51/0x80 [mlx5_core]
      [  157.987248 ]  mlx5_unload+0x2a/0xb0 [mlx5_core]
      [  157.987678 ]  mlx5_uninit_one+0x5f/0xd0 [mlx5_core]
      [  157.988127 ]  remove_one+0x64/0xe0 [mlx5_core]
      [  157.988549 ]  pci_device_remove+0x31/0xa0
      [  157.988933 ]  device_release_driver_internal+0x18f/0x1f0
      [  157.989402 ]  driver_detach+0x3f/0x80
      [  157.989754 ]  bus_remove_driver+0x70/0xf0
      [  157.990129 ]  pci_unregister_driver+0x34/0x90
      [  157.990537 ]  mlx5_cleanup+0xc/0x1c [mlx5_core]
      [  157.990972 ]  __x64_sys_delete_module+0x15a/0x250
      [  157.991398 ]  ? exit_to_user_mode_prepare+0xea/0x110
      [  157.991840 ]  do_syscall_64+0x3d/0x90
      [  157.992198 ]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: 04de7dda ("net/mlx5e: Infrastructure for duplicated offloading of TC flows")
      Fixes: 1418ddd9 ("net/mlx5e: Duplicate offloaded TC eswitch rules under uplink LAG")
      Signed-off-by: default avatarShay Drory <shayd@nvidia.com>
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2be5bd42