1. 11 Aug, 2023 3 commits
  2. 10 Aug, 2023 26 commits
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 6a1ed143
      Jakub Kicinski authored
      Martin KaFai Lau says:
      
      ====================
      pull-request: bpf-next 2023-08-09
      
      We've added 19 non-merge commits during the last 6 day(s) which contain
      a total of 25 files changed, 369 insertions(+), 141 deletions(-).
      
      The main changes are:
      
      1) Fix array-index-out-of-bounds access when detaching from an
         already empty mprog entry from Daniel Borkmann.
      
      2) Adjust bpf selftest because of a recent llvm change
         related to the cpu-v4 ISA from Eduard Zingerman.
      
      3) Add uprobe support for the bpf_get_func_ip helper from Jiri Olsa.
      
      4) Fix a KASAN splat due to the kernel incorrectly accepted
         an invalid program using the recent cpu-v4 instruction from
         Yonghong Song.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
        bpf: btf: Remove two unused function declarations
        bpf: lru: Remove unused declaration bpf_lru_promote()
        selftests/bpf: relax expected log messages to allow emitting BPF_ST
        selftests/bpf: remove duplicated functions
        bpf, docs: Fix small typo and define semantics of sign extension
        selftests/bpf: Add bpf_get_func_ip test for uprobe inside function
        selftests/bpf: Add bpf_get_func_ip tests for uprobe on function entry
        bpf: Add support for bpf_get_func_ip helper for uprobe program
        selftests/bpf: Add a movsx selftest for sign-extension of R10
        bpf: Fix an incorrect verification success with movsx insn
        bpf, docs: Formalize type notation and function semantics in ISA standard
        bpf: change bpf_alu_sign_string and bpf_movsx_string to static
        libbpf: Use local includes inside the library
        bpf: fix bpf_dynptr_slice() to stop return an ERR_PTR.
        bpf: fix inconsistent return types of bpf_xdp_copy_buf().
        selftests/bpf: fix the incorrect verification of port numbers.
        selftests/bpf: Add test for detachment on empty mprog entry
        bpf: Fix mprog detachment for empty mprog entry
        bpf: bpf_struct_ops: Remove unnecessary initial values of variables
      ====================
      
      Link: https://lore.kernel.org/r/20230810055123.109578-1-martin.lau@linux.devSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6a1ed143
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 4d016ae4
      Jakub Kicinski authored
      Cross-merge networking fixes after downstream PR.
      
      No conflicts.
      
      Adjacent changes:
      
      drivers/net/ethernet/intel/igc/igc_main.c
        06b41258 ("igc: Add lock to safeguard global Qbv variables")
        d3750076 ("igc: Add TransmissionOverrun counter")
      
      drivers/net/ethernet/microsoft/mana/mana_en.c
        a7dfeda6 ("net: mana: Fix MANA VF unload when hardware is unresponsive")
        a9ca9f9c ("page_pool: split types and declarations from page_pool.h")
        92272ec4 ("eth: add missing xdp.h includes in drivers")
      
      net/mptcp/protocol.h
        511b90e3 ("mptcp: fix disconnect vs accept race")
        b8dc6d6c ("mptcp: fix rcv buffer auto-tuning")
      
      tools/testing/selftests/net/mptcp/mptcp_join.sh
        c8c101ae ("selftests: mptcp: join: fix 'implicit EP' test")
        03668c65 ("selftests: mptcp: join: rework detailed report")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4d016ae4
    • Linus Torvalds's avatar
      Merge tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 25aa0beb
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from netfilter, wireless and bpf.
      
        Still trending up in size but the good news is that the "current"
        regressions are resolved, AFAIK.
      
        We're getting weirdly many fixes for Wake-on-LAN and suspend/resume
        handling on embedded this week (most not merged yet), not sure why.
        But those are all for older bugs.
      
        Current release - regressions:
      
         - tls: set MSG_SPLICE_PAGES consistently when handing encrypted data
           over to TCP
      
        Current release - new code bugs:
      
         - eth: mlx5: correct IDs on VFs internal to the device (IPU)
      
        Previous releases - regressions:
      
         - phy: at803x: fix WoL support / reporting on AR8032
      
         - bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from
           slaves, leading to BUG_ON()
      
         - tun: prevent tun_build_skb() from exceeding the packet size limit
      
         - wifi: rtw89: fix 8852AE disconnection caused by RX full flags
      
         - eth/PCI: enetc: fix probing after 6fffbc7a ("PCI: Honor
           firmware's device disabled status"), keep PCI devices around even
           if they are disabled / not going to be probed to be able to apply
           quirks on them
      
         - eth: prestera: fix handling IPv4 routes with nexthop IDs
      
        Previous releases - always broken:
      
         - netfilter: re-work garbage collection to avoid races between
           user-facing API and timeouts
      
         - tunnels: fix generating ipv4 PMTU error on non-linear skbs
      
         - nexthop: fix infinite nexthop bucket dump when using maximum
           nexthop ID
      
         - wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()
      
        Misc:
      
         - unix: use consistent error code in SO_PEERPIDFD
      
         - ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for
           upcoming IETF RFC"
      
      * tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
        net: hns3: fix strscpy causing content truncation issue
        net: tls: set MSG_SPLICE_PAGES consistently
        ibmvnic: Ensure login failure recovery is safe from other resets
        ibmvnic: Do partial reset on login failure
        ibmvnic: Handle DMA unmapping of login buffs in release functions
        ibmvnic: Unmap DMA login rsp buffer on send login fail
        ibmvnic: Enforce stronger sanity checks on login response
        net: mana: Fix MANA VF unload when hardware is unresponsive
        netfilter: nf_tables: remove busy mark and gc batch API
        netfilter: nft_set_hash: mark set element as dead when deleting from packet path
        netfilter: nf_tables: adapt set backend to use GC transaction API
        netfilter: nf_tables: GC transaction API to avoid race with control plane
        selftests/bpf: Add sockmap test for redirecting partial skb data
        selftests/bpf: fix a CI failure caused by vsock sockmap test
        bpf, sockmap: Fix bug that strp_done cannot be called
        bpf, sockmap: Fix map type error in sock_map_del_link
        xsk: fix refcount underflow in error path
        ipv6: adjust ndisc_is_useropt() to also return true for PIO
        selftests: forwarding: bridge_mdb: Make test more robust
        selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet
        ...
      25aa0beb
    • Hao Chen's avatar
      net: hns3: fix strscpy causing content truncation issue · 5e3d2061
      Hao Chen authored
      hns3_dbg_fill_content()/hclge_dbg_fill_content() is aim to integrate some
      items to a string for content, and we add '\n' and '\0' in the last
      two bytes of content.
      
      strscpy() will add '\0' in the last byte of destination buffer(one of
      items), it result in finishing content print ahead of schedule and some
      dump content truncation.
      
      One Error log shows as below:
      cat mac_list/uc
      UC MAC_LIST:
      
      Expected:
      UC MAC_LIST:
      FUNC_ID  MAC_ADDR            STATE
      pf       00:2b:19:05:03:00   ACTIVE
      
      The destination buffer is length-bounded and not required to be
      NUL-terminated, so just change strscpy() to memcpy() to fix it.
      
      Fixes: 1cf3d556 ("net: hns3: fix strncpy() not using dest-buf length as length issue")
      Signed-off-by: default avatarHao Chen <chenhao418@huawei.com>
      Signed-off-by: default avatarJijie Shao <shaojijie@huawei.com>
      Link: https://lore.kernel.org/r/20230809020902.1941471-1-shaojijie@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5e3d2061
    • Jakub Kicinski's avatar
      net: tls: set MSG_SPLICE_PAGES consistently · 6b486676
      Jakub Kicinski authored
      We used to change the flags for the last segment, because
      non-last segments had the MSG_SENDPAGE_NOTLAST flag set.
      That flag is no longer a thing so remove the setting.
      
      Since flags most likely don't have MSG_SPLICE_PAGES set
      this avoids passing parts of the sg as splice and parts
      as non-splice. Before commit under Fixes we'd have called
      tcp_sendpage() which would add the MSG_SPLICE_PAGES.
      
      Why this leads to trouble remains unclear but Tariq
      reports hitting the WARN_ON(!sendpage_ok()) due to
      page refcount of 0.
      
      Fixes: e117dcfd ("tls: Inline do_tcp_sendpages()")
      Reported-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/all/4c49176f-147a-4283-f1b1-32aac7b4b996@gmail.com/Tested-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20230808180917.1243540-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6b486676
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine · 30813656
      Linus Torvalds authored
      Pull dmaengine fixes from Vinod Koul:
      
       - HAS_IOMEM fixes for fsl edma and intel idma
      
       - return-value fix, interrupt vector setting and typo fix for xilinx
         xdma
      
       - email updates for codeaurora email domain move
      
       - correct pause status for pl330 driver
      
       - idxd clear flag on disable fix
      
       - function documentation fix for owl dma
      
       - potential un-allocated memory fix for mcf driver
      
      * tag 'dmaengine-fix-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
        dmaengine: xilinx: xdma: Fix typo
        dmaengine: xilinx: xdma: Fix interrupt vector setting
        dmaengine: owl-dma: Modify mismatched function name
        dmaengine: idxd: Clear PRS disable flag when disabling IDXD device
        dmaengine: pl330: Return DMA_PAUSED when transaction is paused
        dmaengine: qcom_hidma: Update codeaurora email domain
        dmaengine: mcf-edma: Fix a potential un-allocated memory access
        dmaengine: xilinx: xdma: Fix Judgment of the return value
        idmaengine: make FSL_EDMA and INTEL_IDMA64 depends on HAS_IOMEM
      30813656
    • Jakub Kicinski's avatar
      Merge tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 3e91b0eb
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The existing attempt to resolve races between control plane and GC work
      is error prone, as reported by Bien Pham <phamnnb@sea.com>, some places
      forgot to call nft_set_elem_mark_busy(), leading to double-deactivation
      of elements.
      
      This series contains the following patches:
      
      1) Do not skip expired elements during walk otherwise elements might
         never decrement the reference counter on data, leading to memleak.
      
      2) Add a GC transaction API to replace the former attempt to deal with
         races between control plane and GC. GC worker sets on NFT_SET_ELEM_DEAD_BIT
         on elements and it creates a GC transaction to remove the expired
         elements, GC transaction could abort in case of interference with
         control plane and retried later (GC async). Set backends such as
         rbtree and pipapo also perform GC from control plane (GC sync), in
         such case, element deactivation and removal is safe because mutex
         is held then collected elements are released via call_rcu().
      
      3) Adapt existing set backends to use the GC transaction API.
      
      4) Update rhash set backend to set on _DEAD bit to report deleted
         elements from datapath for GC.
      
      5) Remove old GC batch API and the NFT_SET_ELEM_BUSY_BIT.
      
      * tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        netfilter: nf_tables: remove busy mark and gc batch API
        netfilter: nft_set_hash: mark set element as dead when deleting from packet path
        netfilter: nf_tables: adapt set backend to use GC transaction API
        netfilter: nf_tables: GC transaction API to avoid race with control plane
        netfilter: nf_tables: don't skip expired elements during walk
      ====================
      
      Link: https://lore.kernel.org/r/20230810070830.24064-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3e91b0eb
    • Jakub Kicinski's avatar
      Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 62d02fca
      Jakub Kicinski authored
      Martin KaFai Lau says:
      
      ====================
      pull-request: bpf 2023-08-09
      
      We've added 5 non-merge commits during the last 7 day(s) which contain
      a total of 6 files changed, 102 insertions(+), 8 deletions(-).
      
      The main changes are:
      
      1) A bpf sockmap memleak fix and a fix in accessing the programs of
         a sockmap under the incorrect map type from Xu Kuohai.
      
      2) A refcount underflow fix in xsk from Magnus Karlsson.
      
      * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        selftests/bpf: Add sockmap test for redirecting partial skb data
        selftests/bpf: fix a CI failure caused by vsock sockmap test
        bpf, sockmap: Fix bug that strp_done cannot be called
        bpf, sockmap: Fix map type error in sock_map_del_link
        xsk: fix refcount underflow in error path
      ====================
      
      Link: https://lore.kernel.org/r/20230810055303.120917-1-martin.lau@linux.devSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      62d02fca
    • Nick Child's avatar
      ibmvnic: Ensure login failure recovery is safe from other resets · 6db541ae
      Nick Child authored
      If a login request fails, the recovery process should be protected
      against parallel resets. It is a known issue that freeing and
      registering CRQ's in quick succession can result in a failover CRQ from
      the VIOS. Processing a failover during login recovery is dangerous for
      two reasons:
       1. This will result in two parallel initialization processes, this can
       cause serious issues during login.
       2. It is possible that the failover CRQ is received but never executed.
       We get notified of a pending failover through a transport event CRQ.
       The reset is not performed until a INIT CRQ request is received.
       Previously, if CRQ init fails during login recovery, then the ibmvnic
       irq is freed and the login process returned error. If failover_pending
       is true (a transport event was received), then the ibmvnic device
       would never be able to process the reset since it cannot receive the
       CRQ_INIT request due to the irq being freed. This leaved the device
       in a inoperable state.
      
      Therefore, the login failure recovery process must be hardened against
      these possible issues. Possible failovers (due to quick CRQ free and
      init) must be avoided and any issues during re-initialization should be
      dealt with instead of being propagated up the stack. This logic is
      similar to that of ibmvnic_probe().
      
      Fixes: dff515a3 ("ibmvnic: Harden device login requests")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-5-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6db541ae
    • Nick Child's avatar
      ibmvnic: Do partial reset on login failure · 23cc5f66
      Nick Child authored
      Perform a partial reset before sending a login request if any of the
      following are true:
       1. If a previous request times out. This can be dangerous because the
       	VIOS could still receive the old login request at any point after
       	the timeout. Therefore, it is best to re-register the CRQ's  and
       	sub-CRQ's before retrying.
       2. If the previous request returns an error that is not described in
       	PAPR. PAPR provides procedures if the login returns with partial
       	success or aborted return codes (section L.5.1) but other values
      	do not have a defined procedure. Previously, these conditions
      	just returned error from the login function rather than trying
      	to resolve the issue.
       	This can cause further issues since most callers of the login
       	function are not prepared to handle an error when logging in. This
       	improper cleanup can lead to the device being permanently DOWN'd.
       	For example, if the VIOS believes that the device is already logged
       	in then it will return INVALID_STATE (-7). If we never re-register
       	CRQ's then it will always think that the device is already logged
       	in. This leaves the device inoperable.
      
      The partial reset involves freeing the sub-CRQs, freeing the CRQ then
      registering and initializing a new CRQ and sub-CRQs. This essentially
      restarts all communication with VIOS to allow for a fresh login attempt
      that will be unhindered by any previous failed attempts.
      
      Fixes: dff515a3 ("ibmvnic: Harden device login requests")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-4-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      23cc5f66
    • Nick Child's avatar
      ibmvnic: Handle DMA unmapping of login buffs in release functions · d78a671e
      Nick Child authored
      Rather than leaving the DMA unmapping of the login buffers to the
      login response handler, move this work into the login release functions.
      Previously, these functions were only used for freeing the allocated
      buffers. This could lead to issues if there are more than one
      outstanding login buffer requests, which is possible if a login request
      times out.
      
      If a login request times out, then there is another call to send login.
      The send login function makes a call to the login buffer release
      function. In the past, this freed the buffers but did not DMA unmap.
      Therefore, the VIOS could still write to the old login (now freed)
      buffer. It is for this reason that it is a good idea to leave the DMA
      unmap call to the login buffers release function.
      
      Since the login buffer release functions now handle DMA unmapping,
      remove the duplicate DMA unmapping in handle_login_rsp().
      
      Fixes: dff515a3 ("ibmvnic: Harden device login requests")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-3-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d78a671e
    • Nick Child's avatar
      ibmvnic: Unmap DMA login rsp buffer on send login fail · 411c565b
      Nick Child authored
      If the LOGIN CRQ fails to send then we must DMA unmap the response
      buffer. Previously, if the CRQ failed then the memory was freed without
      DMA unmapping.
      
      Fixes: c98d9cc4 ("ibmvnic: send_login should check for crq errors")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-2-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      411c565b
    • Nick Child's avatar
      ibmvnic: Enforce stronger sanity checks on login response · db17ba71
      Nick Child authored
      Ensure that all offsets in a login response buffer are within the size
      of the allocated response buffer. Any offsets or lengths that surpass
      the allocation are likely the result of an incomplete response buffer.
      In these cases, a full reset is necessary.
      
      When attempting to login, the ibmvnic device will allocate a response
      buffer and pass a reference to the VIOS. The VIOS will then send the
      ibmvnic device a LOGIN_RSP CRQ to signal that the buffer has been filled
      with data. If the ibmvnic device does not get a response in 20 seconds,
      the old buffer is freed and a new login request is sent. With 2
      outstanding requests, any LOGIN_RSP CRQ's could be for the older
      login request. If this is the case then the login response buffer (which
      is for the newer login request) could be incomplete and contain invalid
      data. Therefore, we must enforce strict sanity checks on the response
      buffer values.
      
      Testing has shown that the `off_rxadd_buff_size` value is filled in last
      by the VIOS and will be the smoking gun for these circumstances.
      
      Until VIOS can implement a mechanism for tracking outstanding response
      buffers and a method for mapping a LOGIN_RSP CRQ to a particular login
      response buffer, the best ibmvnic can do in this situation is perform a
      full reset.
      
      Fixes: dff515a3 ("ibmvnic: Harden device login requests")
      Signed-off-by: default avatarNick Child <nnac123@linux.ibm.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Link: https://lore.kernel.org/r/20230809221038.51296-1-nnac123@linux.ibm.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      db17ba71
    • Souradeep Chakrabarti's avatar
      net: mana: Fix MANA VF unload when hardware is unresponsive · a7dfeda6
      Souradeep Chakrabarti authored
      When unloading the MANA driver, mana_dealloc_queues() waits for the MANA
      hardware to complete any inflight packets and set the pending send count
      to zero. But if the hardware has failed, mana_dealloc_queues()
      could wait forever.
      
      Fix this by adding a timeout to the wait. Set the timeout to 120 seconds,
      which is a somewhat arbitrary value that is more than long enough for
      functional hardware to complete any sends.
      
      Cc: stable@vger.kernel.org
      Fixes: ca9c54d2 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
      Signed-off-by: default avatarSouradeep Chakrabarti <schakrabarti@linux.microsoft.com>
      Link: https://lore.kernel.org/r/1691576525-24271-1-git-send-email-schakrabarti@linux.microsoft.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a7dfeda6
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: remove busy mark and gc batch API · a2dd0233
      Pablo Neira Ayuso authored
      Ditch it, it has been replace it by the GC transaction API and it has no
      clients anymore.
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a2dd0233
    • Pablo Neira Ayuso's avatar
      netfilter: nft_set_hash: mark set element as dead when deleting from packet path · c92db303
      Pablo Neira Ayuso authored
      Set on the NFT_SET_ELEM_DEAD_BIT flag on this element, instead of
      performing element removal which might race with an ongoing transaction.
      Enable gc when dynamic flag is set on since dynset deletion requires
      garbage collection after this patch.
      
      Fixes: d0a8d877 ("netfilter: nft_dynset: support for element deletion")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c92db303
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: adapt set backend to use GC transaction API · f6c383b8
      Pablo Neira Ayuso authored
      Use the GC transaction API to replace the old and buggy gc API and the
      busy mark approach.
      
      No set elements are removed from async garbage collection anymore,
      instead the _DEAD bit is set on so the set element is not visible from
      lookup path anymore. Async GC enqueues transaction work that might be
      aborted and retried later.
      
      rbtree and pipapo set backends does not set on the _DEAD bit from the
      sync GC path since this runs in control plane path where mutex is held.
      In this case, set elements are deactivated, removed and then released
      via RCU callback, sync GC never fails.
      
      Fixes: 3c4287f6 ("nf_tables: Add set type for arbitrary concatenation of ranges")
      Fixes: 8d8540c4 ("netfilter: nft_set_rbtree: add timeout support")
      Fixes: 9d098292 ("netfilter: nft_hash: add support for timeouts")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f6c383b8
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: GC transaction API to avoid race with control plane · 5f68718b
      Pablo Neira Ayuso authored
      The set types rhashtable and rbtree use a GC worker to reclaim memory.
      From system work queue, in periodic intervals, a scan of the table is
      done.
      
      The major caveat here is that the nft transaction mutex is not held.
      This causes a race between control plane and GC when they attempt to
      delete the same element.
      
      We cannot grab the netlink mutex from the work queue, because the
      control plane has to wait for the GC work queue in case the set is to be
      removed, so we get following deadlock:
      
         cpu 1                                cpu2
           GC work                            transaction comes in , lock nft mutex
             `acquire nft mutex // BLOCKS
                                              transaction asks to remove the set
                                              set destruction calls cancel_work_sync()
      
      cancel_work_sync will now block forever, because it is waiting for the
      mutex the caller already owns.
      
      This patch adds a new API that deals with garbage collection in two
      steps:
      
      1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT
         so they are not visible via lookup. Annotate current GC sequence in
         the GC transaction. Enqueue GC transaction work as soon as it is
         full. If ruleset is updated, then GC transaction is aborted and
         retried later.
      
      2) GC work grabs the mutex. If GC sequence has changed then this GC
         transaction lost race with control plane, abort it as it contains
         stale references to objects and let GC try again later. If the
         ruleset is intact, then this GC transaction deactivates and removes
         the elements and it uses call_rcu() to destroy elements.
      
      Note that no elements are removed from GC lockless path, the _DEAD bit
      is set and pointers are collected. GC catchall does not remove the
      elements anymore too. There is a new set->dead flag that is set on to
      abort the GC transaction to deal with set->ops->destroy() path which
      removes the remaining elements in the set from commit_release, where no
      mutex is held.
      
      To deal with GC when mutex is held, which allows safe deactivate and
      removal, add sync GC API which releases the set element object via
      call_rcu(). This is used by rbtree and pipapo backends which also
      perform garbage collection from control plane path.
      
      Since element removal from sets can happen from control plane and
      element garbage collection/timeout, it is necessary to keep the set
      structure alive until all elements have been deactivated and destroyed.
      
      We cannot do a cancel_work_sync or flush_work in nft_set_destroy because
      its called with the transaction mutex held, but the aforementioned async
      work queue might be blocked on the very mutex that nft_set_destroy()
      callchain is sitting on.
      
      This gives us the choice of ABBA deadlock or UaF.
      
      To avoid both, add set->refs refcount_t member. The GC API can then
      increment the set refcount and release it once the elements have been
      free'd.
      
      Set backends are adapted to use the GC transaction API in a follow up
      patch entitled:
      
        ("netfilter: nf_tables: use gc transaction API in set backends")
      
      This is joint work with Florian Westphal.
      
      Fixes: cfed7e1b ("netfilter: nf_tables: add set garbage collection helpers")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      5f68718b
    • Linus Torvalds's avatar
      Merge tag '6.5-rc5-ksmbd-server' of git://git.samba.org/ksmbd · 374a7f47
      Linus Torvalds authored
      Pull smb server fixes from Steve French:
       "Two ksmbd server fixes, both also for stable:
      
         - improve buffer validation when multiple EAs returned
      
         - missing check for command payload size"
      
      * tag '6.5-rc5-ksmbd-server' of git://git.samba.org/ksmbd:
        ksmbd: fix wrong next length validation of ea buffer in smb2_set_ea()
        ksmbd: validate command request size
      374a7f47
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v6.5-3-2023-08-09' of... · b4f63b0f
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v6.5-3-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Revert a patch that unconditionally resolved addresses to inlines in
         callchains, something that was done before when DWARF mode was asked
         for, but could as well be done when just frame pointers (the default)
         was selected.
      
         This enriches the callchains with inlines but the way to resolve it
         is gross right now, relying on addr2line, and even if we come up with
         an efficient way of processing all the associated DWARF info for a
         big file as vmlinux is, this has to be something people opt-in, as it
         will still result in overheads, so revert it until we get this done
         in a saner way.
      
       - Update the x86 msr-index.h header with the kernel original, no change
         in tooling output, just addresses a tools/perf build warning.
      
        - Resolve a regression where special "tool events", such as
          "duration_time" were being presented for all CPUs, when it only
          makes sense to show it for the workload, that is, just once.
      
      * tag 'perf-tools-fixes-for-v6.5-3-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf stat: Don't display zero tool counts
        tools arch x86: Sync the msr-index.h copy with the kernel sources
        Revert "perf report: Append inlines to non-DWARF callchains"
      b4f63b0f
    • Martin KaFai Lau's avatar
      b734f02c
    • Xu Kuohai's avatar
      selftests/bpf: Add sockmap test for redirecting partial skb data · a4b7193d
      Xu Kuohai authored
      Add a test case to check whether sockmap redirection works correctly
      when data length returned by stream_parser is less than skb->len.
      
      In addition, this test checks whether strp_done is called correctly.
      The reason is that we returns skb->len - 1 from the stream_parser, so
      the last byte in the skb will be held by strp->skb_head. Therefore,
      if strp_done is not called to free strp->skb_head, we'll get a memleak
      warning.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20230804073740.194770-5-xukuohai@huaweicloud.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      a4b7193d
    • Xu Kuohai's avatar
      selftests/bpf: fix a CI failure caused by vsock sockmap test · 90f0074c
      Xu Kuohai authored
      BPF CI has reported the following failure:
      
      Error: #200/79 sockmap_listen/sockmap VSOCK test_vsock_redir
        Error: #200/79 sockmap_listen/sockmap VSOCK test_vsock_redir
        ./test_progs:vsock_unix_redir_connectible:1506: egress: write: Transport endpoint is not connected
        vsock_unix_redir_connectible:FAIL:1506
        ./test_progs:vsock_unix_redir_connectible:1506: ingress: write: Transport endpoint is not connected
        vsock_unix_redir_connectible:FAIL:1506
        ./test_progs:vsock_unix_redir_connectible:1506: egress: write: Transport endpoint is not connected
        vsock_unix_redir_connectible:FAIL:1506
        ./test_progs:vsock_unix_redir_connectible:1514: ingress: recv() err, errno=11
        vsock_unix_redir_connectible:FAIL:1514
        ./test_progs:vsock_unix_redir_connectible:1518: ingress: vsock socket map failed, a != b
        vsock_unix_redir_connectible:FAIL:1518
        ./test_progs:vsock_unix_redir_connectible:1525: ingress: want pass count 1, have 0
      
      It’s because the recv(... MSG_DONTWAIT) syscall in the test case is
      called before the queued work sk_psock_backlog() in the kernel finishes
      executing. So the data to be read is still queued in psock->ingress_skb
      and cannot be read by the user program. Therefore, the non-blocking
      recv() reads nothing and reports an EAGAIN error.
      
      So replace recv(... MSG_DONTWAIT) with xrecv_nonblock(), which calls
      select() to wait for data to be readable or timeout before calls recv().
      
      Fixes: d61bd8c1 ("selftests/bpf: add a test case for vsock sockmap")
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20230804073740.194770-4-xukuohai@huaweicloud.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      90f0074c
    • Xu Kuohai's avatar
      bpf, sockmap: Fix bug that strp_done cannot be called · 809e4dc7
      Xu Kuohai authored
      strp_done is only called when psock->progs.stream_parser is not NULL,
      but stream_parser was set to NULL by sk_psock_stop_strp(), called
      by sk_psock_drop() earlier. So, strp_done can never be called.
      
      Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock.
      Change the condition for calling strp_done from judging whether
      stream_parser is set to judging whether this flag is set. This flag is
      only set once when strp_init() succeeds, and will never be cleared later.
      
      Fixes: c0d95d33 ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap")
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      809e4dc7
    • Xu Kuohai's avatar
      bpf, sockmap: Fix map type error in sock_map_del_link · 7e96ec0e
      Xu Kuohai authored
      sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although
      both types have member named "progs", the offset of "progs" member in
      these two types is different, so "progs" should be accessed with the
      real map type.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Reviewed-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/r/20230804073740.194770-2-xukuohai@huaweicloud.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      7e96ec0e
    • Magnus Karlsson's avatar
      xsk: fix refcount underflow in error path · 85c2c79a
      Magnus Karlsson authored
      Fix a refcount underflow problem reported by syzbot that can happen
      when a system is running out of memory. If xp_alloc_tx_descs() fails,
      and it can only fail due to not having enough memory, then the error
      path is triggered. In this error path, the refcount of the pool is
      decremented as it has incremented before. However, the reference to
      the pool in the socket was not nulled. This means that when the socket
      is closed later, the socket teardown logic will think that there is a
      pool attached to the socket and try to decrease the refcount again,
      leading to a refcount underflow.
      
      I chose this fix as it involved adding just a single line. Another
      option would have been to move xp_get_pool() and the assignment of
      xs->pool to after the if-statement and using xs_umem->pool instead of
      xs->pool in the whole if-statement resulting in somewhat simpler code,
      but this would have led to much more churn in the code base perhaps
      making it harder to backport.
      
      Fixes: ba3beec2 ("xsk: Fix possible crash when multiple sockets are created")
      Reported-by: syzbot+8ada0057e69293a05fd4@syzkaller.appspotmail.com
      Signed-off-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/r/20230809142843.13944-1-magnus.karlsson@gmail.comSigned-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      85c2c79a
  3. 09 Aug, 2023 11 commits
    • Jakub Kicinski's avatar
      Merge branch 'improve-the-taprio-qdisc-s-relationship-with-its-children' · 29afcd69
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Improve the taprio qdisc's relationship with its children
      
      v1: https://lore.kernel.org/lkml/20230531173928.1942027-1-vladimir.oltean@nxp.com/
      
      Prompted by Vinicius' request to consolidate some child Qdisc
      dereferences in taprio:
      https://lore.kernel.org/netdev/87edmxv7x2.fsf@intel.com/
      
      I remembered that I had left some unfinished work in this Qdisc, namely
      commit af7b29b1 ("Revert "net/sched: taprio: make qdisc_leaf() see
      the per-netdev-queue pfifo child qdiscs"").
      
      This patch set represents another stab at, essentially, what's in the
      title. Not only does taprio not properly detect when it's grafted as a
      non-root qdisc, but it also returns incorrect per-class stats.
      Eventually, Vinicius' request is addressed too, although in a different
      form than the one he requested (which was purely cosmetic).
      
      Review from people more experienced with Qdiscs than me would be
      appreciated. I tried my best to explain what I consider to be problems.
      I am deliberately targeting net-next because the changes are too
      invasive for net - they were reverted from stable once already.
      ====================
      
      Link: https://lore.kernel.org/r/20230807193324.4128292-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      29afcd69
    • Vladimir Oltean's avatar
      selftests/tc-testing: verify that a qdisc can be grafted onto a taprio class · 29c298d2
      Vladimir Oltean authored
      The reason behind commit af7b29b1 ("Revert "net/sched: taprio: make
      qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") was that the
      patch it reverted caused a crash when attaching a CBS shaper to one of
      the taprio classes. Prevent that from happening again by adding a test
      case for it, which now passes correctly in both offload and software
      modes.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarPedro Tammela <pctammela@mojatatu.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-12-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      29c298d2
    • Vladimir Oltean's avatar
      selftests/tc-testing: test that taprio can only be attached as root · 1890cf08
      Vladimir Oltean authored
      Check that the "Can only be attached as root qdisc" error message from
      taprio is effective by attempting to attach it to a class of another
      taprio qdisc. That operation should fail.
      
      In the bug that was squashed by change "net/sched: taprio: try again to
      report q->qdiscs[] to qdisc_leaf()", grafting a child taprio to a root
      software taprio would be misinterpreted as a change() to the root
      taprio. Catch this by looking at whether the base-time of the root
      taprio has changed to follow the base-time of the child taprio,
      something which should have absolutely never happened assuming correct
      semantics.
      
      Vinicius points out that looking at "base_time" in the tc qdisc show
      output is unreliable because user space is in a race with the kernel
      applying the setting. So we create a helper bash script which waits
      while there is any pending schedule.
      
      Link: https://lore.kernel.org/netdev/87il9w0xx7.fsf@intel.com/Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarPedro Tammela <pctammela@mojatatu.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-11-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1890cf08
    • Vladimir Oltean's avatar
      selftests/tc-testing: add ptp_mock Kconfig dependency · 355adce3
      Vladimir Oltean authored
      For offloaded tc-taprio testing with netdevsim, the mock-up PHC driver
      is used.
      Suggested-by: default avatarVictor Nogueira <victor@mojatatu.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-10-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      355adce3
    • Vladimir Oltean's avatar
      net: netdevsim: mimic tc-taprio offload · 35da47fe
      Vladimir Oltean authored
      To be able to use netdevsim for tc-testing with an offloaded tc-taprio
      schedule, it needs to report a PTP clock (which it now does), and to
      accept ndo_setup_tc(TC_SETUP_QDISC_TAPRIO) calls.
      
      Since netdevsim has no packet I/O, this doesn't do anything intelligent,
      it only allows taprio offload code paths to go through some level of
      automated testing.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-9-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      35da47fe
    • Vladimir Oltean's avatar
      net: netdevsim: use mock PHC driver · b63e78fc
      Vladimir Oltean authored
      I'd like to make netdevsim offload tc-taprio, but currently, this Qdisc
      emits a ETHTOOL_GET_TS_INFO call to the driver to make sure that it has
      a PTP clock, so that it is reasonably capable of offloading the schedule.
      
      By using the mock PHC driver, that becomes possible.
      
      Hardware timestamping is not necessary, and netdevsim does not support
      packet I/O anyway.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-8-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b63e78fc
    • Vladimir Oltean's avatar
      net: ptp: create a mock-up PTP Hardware Clock driver · 40b0425f
      Vladimir Oltean authored
      There are several cases where virtual net devices may benefit from
      having a PTP clock, and these have to do with testing. I can see at
      least netdevsim and veth as potential users of a common mock-up PTP
      hardware clock driver.
      
      The proposed idea is to create an object which emulates PTP clock
      operations on top of the unadjustable CLOCK_MONOTONIC_RAW plus a
      software-controlled time domain via a timecounter/cyclecounter and then
      link that PHC to the netdevsim device.
      
      The driver is fully functional for its intended purpose, and it
      successfully passes the PTP selftests.
      
      $ cd tools/testing/selftests/ptp/
      $ ./phc.sh /dev/ptp2
      TEST: settime                          [ OK ]
      TEST: adjtime                          [ OK ]
      TEST: adjfreq                          [ OK ]
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-7-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      40b0425f
    • Vladimir Oltean's avatar
      net/sched: taprio: dump class stats for the actual q->qdiscs[] · 665338b2
      Vladimir Oltean authored
      This makes a difference for the software scheduling mode, where
      dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself,
      but when we're talking about what Qdisc and stats get reported for a
      traffic class, the root taprio isn't what comes to mind, but q->qdiscs[]
      is.
      
      To understand the difference, I've attempted to send 100 packets in
      software mode through class 8001:5, and recorded the stats before and
      after the change.
      
      Here is before:
      
      $ tc -s class show dev eth0
      class taprio 8001:1 root leaf 8001:
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:2 root leaf 8001:
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:3 root leaf 8001:
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:4 root leaf 8001:
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:5 root leaf 8001:
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:6 root leaf 8001:
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:7 root leaf 8001:
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:8 root leaf 8001:
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      
      and here is after:
      
      class taprio 8001:1 root
       Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:2 root
       Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:3 root
       Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:4 root
       Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:5 root
       Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:6 root
       Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:7 root
       Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      class taprio 8001:8 root leaf 800d:
       Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
       backlog 0b 0p requeues 0
       window_drops 0
      
      The most glaring (and expected) difference is that before, all class
      stats reported the global stats, whereas now, they really report just
      the counters for that traffic class.
      
      Finally, Pedro Tammela points out that there is a tc selftest which
      checks specifically which handle do the child Qdiscs corresponding to
      each class have. That's changing here - taprio no longer reports
      tcm->tcm_info as the same handle "1:" as itself (the root Qdisc), but 0
      (the handle of the default pfifo child Qdiscs). Since iproute2 does not
      print a child Qdisc handle of 0, adjust the test's expected output.
      
      Link: https://lore.kernel.org/netdev/3b83fcf6-a5e8-26fb-8c8a-ec34ec4c3342@mojatatu.com/Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-6-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      665338b2
    • Vladimir Oltean's avatar
      net/sched: taprio: delete misleading comment about preallocating child qdiscs · 6e0ec800
      Vladimir Oltean authored
      As mentioned in commit af7b29b1 ("Revert "net/sched: taprio: make
      qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") - unlike
      mqprio, taprio doesn't use q->qdiscs[] only as a temporary transport
      between Qdisc_ops :: init() and Qdisc_ops :: attach().
      
      Delete the comment, which is just stolen from mqprio, but there, the
      usage patterns are a lot different, and this is nothing but confusing.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-5-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6e0ec800
    • Vladimir Oltean's avatar
      net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf() · 98766add
      Vladimir Oltean authored
      This is another stab at commit 1461d212 ("net/sched: taprio: make
      qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), later
      reverted in commit af7b29b1 ("Revert "net/sched: taprio: make
      qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"").
      
      I believe that the problems that caused the revert were fixed, and thus,
      this change is identical to the original patch.
      
      Its purpose is to properly reject attaching a software taprio child
      qdisc to a software taprio parent. Because unoffloaded taprio currently
      reports itself (the root Qdisc) as the return value from qdisc_leaf(),
      then the process of attaching another taprio as child to a Qdisc class
      of the root will just result in a Qdisc_ops :: change() call for the
      root. Whereas that's not we want. We want Qdisc_ops :: init() to be
      called for the taprio child, in order to give the taprio child a chance
      to check whether its sch->parent is TC_H_ROOT or not (and reject this
      configuration).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-4-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      98766add
    • Vladimir Oltean's avatar
      net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode · 25b0d4e4
      Vladimir Oltean authored
      Normally, Qdiscs have one reference on them held by their owner and one
      held for each TXQ to which they are attached, however this is not the
      case with the children of an offloaded taprio. Instead, the taprio qdisc
      currently lives in the following fragile equilibrium.
      
      In the software scheduling case, taprio attaches itself (the root Qdisc)
      to all TXQs, thus having a refcount of 1 + the number of TX queues. In
      this mode, the q->qdiscs[] children are not visible directly to the
      Qdisc API. The lifetime of the Qdiscs from this private array lasts
      until qdisc_destroy() -> taprio_destroy().
      
      In the fully offloaded case, the root taprio has a refcount of 1, and
      all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
      are attached to the netdev TXQs directly and thus are visible to the
      Qdisc API, however taprio loses a reference to them very early - during
      qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
      the q->qdiscs[] array to not leak memory, but interestingly, it does not
      release a reference on these qdiscs because it doesn't effectively own
      them - they are created by taprio but owned by the Qdisc core, and will
      be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
      the Qdisc is deleted or when the child Qdisc is replaced with something
      else.
      
      My interest is to change this equilibrium such that taprio also owns a
      reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
      Qdisc, including in full offload mode. I want this because I would like
      taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
      insight into q->qdiscs[] for the software scheduling mode - currently
      they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
      as the root taprio.
      
      The following set of changes is necessary:
      - don't free q->qdiscs[] early in taprio_attach(), free it late in
        taprio_destroy() for consistency with software mode. But:
      - currently that's not possible, because taprio doesn't own a reference
        on q->qdiscs[]. So hold that reference - once during the initial
        attach() and once during subsequent graft() calls when the child is
        changed.
      - always keep track of the current child in q->qdiscs[], even for full
        offload mode, so that we free in taprio_destroy() what we should, and
        not something stale.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20230807193324.4128292-3-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      25b0d4e4