1. 04 Mar, 2022 8 commits
  2. 03 Mar, 2022 32 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 80901bff
      Jakub Kicinski authored
      net/batman-adv/hard-interface.c
        commit 690bb6fb ("batman-adv: Request iflink once in batadv-on-batadv check")
        commit 6ee3c393 ("batman-adv: Demote batadv-on-batadv skip error message")
      https://lore.kernel.org/all/20220302163049.101957-1-sw@simonwunderlich.de/
      
      net/smc/af_smc.c
        commit 4d08b7b5 ("net/smc: Fix cleanup when register ULP fails")
        commit 462791bb ("net/smc: add sysctl interface for SMC")
      https://lore.kernel.org/all/20220302112209.355def40@canb.auug.org.au/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      80901bff
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · b949c21f
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from can, xfrm, wifi, bluetooth, and netfilter.
      
        Lots of various size fixes, the length of the tag speaks for itself.
        Most of the 5.17-relevant stuff comes from xfrm, wifi and bt trees
        which had been lagging as you pointed out previously. But there's also
        a larger than we'd like portion of fixes for bugs from previous
        releases.
      
        Three more fixes still under discussion, including and xfrm revert for
        uAPI error.
      
        Current release - regressions:
      
         - iwlwifi: don't advertise TWT support, prevent FW crash
      
         - xfrm: fix the if_id check in changelink
      
         - xen/netfront: destroy queues before real_num_tx_queues is zeroed
      
         - bluetooth: fix not checking MGMT cmd pending queue, make scanning
           work again
      
        Current release - new code bugs:
      
         - mptcp: make SIOCOUTQ accurate for fallback socket
      
         - bluetooth: access skb->len after null check
      
         - bluetooth: hci_sync: fix not using conn_timeout
      
         - smc: fix cleanup when register ULP fails
      
         - dsa: restore error path of dsa_tree_change_tag_proto
      
         - iwlwifi: fix build error for IWLMEI
      
         - iwlwifi: mvm: propagate error from request_ownership to the user
      
        Previous releases - regressions:
      
         - xfrm: fix pMTU regression when reported pMTU is too small
      
         - xfrm: fix TCP MSS calculation when pMTU is close to 1280
      
         - bluetooth: fix bt_skb_sendmmsg not allocating partial chunks
      
         - ipv6: ensure we call ipv6_mc_down() at most once, prevent leaks
      
         - ipv6: prevent leaks in igmp6 when input queues get full
      
         - fix up skbs delta_truesize in UDP GRO frag_list
      
         - eth: e1000e: fix possible HW unit hang after an s0ix exit
      
         - eth: e1000e: correct NVM checksum verification flow
      
         - ptp: ocp: fix large time adjustments
      
        Previous releases - always broken:
      
         - tcp: make tcp_read_sock() more robust in presence of urgent data
      
         - xfrm: distinguishing SAs and SPs by if_id in xfrm_migrate
      
         - xfrm: fix xfrm_migrate issues when address family changes
      
         - dcb: flush lingering app table entries for unregistered devices
      
         - smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error
      
         - mac80211: fix EAPoL rekey fail in 802.3 rx path
      
         - mac80211: fix forwarded mesh frames AC & queue selection
      
         - netfilter: nf_queue: fix socket access races and bugs
      
         - batman-adv: fix ToCToU iflink problems and check the result belongs
           to the expected net namespace
      
         - can: gs_usb, etas_es58x: fix opened_channel_cnt's accounting
      
         - can: rcar_canfd: register the CAN device when fully ready
      
         - eth: igb, igc: phy: drop premature return leaking HW semaphore
      
         - eth: ixgbe: xsk: change !netif_carrier_ok() handling in
           ixgbe_xmit_zc(), prevent live lock when link goes down
      
         - eth: stmmac: only enable DMA interrupts when ready
      
         - eth: sparx5: move vlan checks before any changes are made
      
         - eth: iavf: fix races around init, removal, resets and vlan ops
      
         - ibmvnic: more reset flow fixes
      
        Misc:
      
         - eth: fix return value of __setup handlers"
      
      * tag 'net-5.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (92 commits)
        ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()
        net: dsa: make dsa_tree_change_tag_proto actually unwind the tag proto change
        ixgbe: xsk: change !netif_carrier_ok() handling in ixgbe_xmit_zc()
        selftests: mlxsw: resource_scale: Fix return value
        selftests: mlxsw: tc_police_scale: Make test more robust
        net: dcb: disable softirqs in dcbnl_flush_dev()
        bnx2: Fix an error message
        sfc: extend the locking on mcdi->seqno
        net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server
        net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error generated by client
        net: arcnet: com20020: Fix null-ptr-deref in com20020pci_probe()
        tcp: make tcp_read_sock() more robust
        bpf, sockmap: Do not ignore orig_len parameter
        net: ipa: add an interconnect dependency
        net: fix up skbs delta_truesize in UDP GRO frag_list
        iwlwifi: mvm: return value for request_ownership
        nl80211: Update bss channel on channel switch for P2P_CLIENT
        iwlwifi: fix build error for IWLMEI
        ptp: ocp: Add ptp_ocp_adjtime_coarse for large adjustments
        batman-adv: Don't expect inter-netns unique iflink indices
        ...
      b949c21f
    • Linus Torvalds's avatar
      Merge tag 'mips-fixes-5.17_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux · e58bd49d
      Linus Torvalds authored
      Pull MIPS fixes from Thomas Bogendoerfer:
      
       - Fix memory detection for MT7621 devices
      
       - Fix setnocoherentio kernel option
      
       - Fix warning when CONFIG_SCHED_CORE is enabled
      
      * tag 'mips-fixes-5.17_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
        MIPS: ralink: mt7621: use bitwise NOT instead of logical
        mips: setup: fix setnocoherentio() boolean setting
        MIPS: smp: fill in sibling and core maps earlier
        MIPS: ralink: mt7621: do memory detection on KSEG1
      e58bd49d
    • Linus Torvalds's avatar
      Merge tag 'auxdisplay-for-linus-v5.17-rc7' of git://github.com/ojeda/linux · 4d5ae234
      Linus Torvalds authored
      Pull auxdisplay fixes from Miguel Ojeda:
       "A few lcd2s fixes from Andy Shevchenko"
      
      * tag 'auxdisplay-for-linus-v5.17-rc7' of git://github.com/ojeda/linux:
        auxdisplay: lcd2s: Use proper API to free the instance of charlcd object
        auxdisplay: lcd2s: Fix memory leak in ->remove()
        auxdisplay: lcd2s: Fix lcd2s_redefine_char() feature
      4d5ae234
    • Eric Dumazet's avatar
      ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report() · 2d3916f3
      Eric Dumazet authored
      While investigating on why a synchronize_net() has been added recently
      in ipv6_mc_down(), I found that igmp6_event_query() and igmp6_event_report()
      might drop skbs in some cases.
      
      Discussion about removing synchronize_net() from ipv6_mc_down()
      will happen in a different thread.
      
      Fixes: f185de28 ("mld: add new workqueues for process mld events")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Taehee Yoo <ap420073@gmail.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: David Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220303173728.937869-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d3916f3
    • Vladimir Oltean's avatar
      net: dsa: make dsa_tree_change_tag_proto actually unwind the tag proto change · e1bec7fa
      Vladimir Oltean authored
      The blamed commit said one thing but did another. It explains that we
      should restore the "return err" to the original "goto out_unwind_tagger",
      but instead it replaced it with "goto out_unlock".
      
      When DSA_NOTIFIER_TAG_PROTO fails after the first switch of a
      multi-switch tree, the switches would end up not using the same tagging
      protocol.
      
      Fixes: 0b0e2ff1 ("net: dsa: restore error path of dsa_tree_change_tag_proto")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220303154249.1854436-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e1bec7fa
    • Maciej Fijalkowski's avatar
      ixgbe: xsk: change !netif_carrier_ok() handling in ixgbe_xmit_zc() · 6c7273a2
      Maciej Fijalkowski authored
      Commit c685c69f ("ixgbe: don't do any AF_XDP zero-copy transmit if
      netif is not OK") addressed the ring transient state when
      MEM_TYPE_XSK_BUFF_POOL was being configured which in turn caused the
      interface to through down/up. Maurice reported that when carrier is not
      ok and xsk_pool is present on ring pair, ksoftirqd will consume 100% CPU
      cycles due to the constant NAPI rescheduling as ixgbe_poll() states that
      there is still some work to be done.
      
      To fix this, do not set work_done to false for a !netif_carrier_ok().
      
      Fixes: c685c69f ("ixgbe: don't do any AF_XDP zero-copy transmit if netif is not OK")
      Reported-by: default avatarMaurice Baijens <maurice.baijens@ellips.com>
      Tested-by: default avatarMaurice Baijens <maurice.baijens@ellips.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarSandeep Penigalapati <sandeep.penigalapati@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6c7273a2
    • Jakub Kicinski's avatar
      Merge branch 'selftests-mlxsw-a-couple-of-fixes' · 312f2d50
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      selftests: mlxsw: A couple of fixes
      
      Patch #1 fixes a breakage due to a change in iproute2 output. The real
      problem is not iproute2, but the fact that the check was not strict
      enough. Fixed by using JSON output instead. Targeting at net so that the
      test will pass as part of old and new kernels regardless of iproute2
      version.
      
      Patch #2 fixes an issue uncovered by the first one.
      ====================
      
      Link: https://lore.kernel.org/r/20220302161447.217447-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      312f2d50
    • Amit Cohen's avatar
      selftests: mlxsw: resource_scale: Fix return value · 196f9bc0
      Amit Cohen authored
      The test runs several test cases and is supposed to return an error in
      case at least one of them failed.
      
      Currently, the check of the return value of each test case is in the
      wrong place, which can result in the wrong return value. For example:
      
       # TESTS='tc_police' ./resource_scale.sh
       TEST: 'tc_police' [default] 968                                     [FAIL]
               tc police offload count failed
       Error: mlxsw_spectrum: Failed to allocate policer index.
       We have an error talking to the kernel
       Command failed /tmp/tmp.i7Oc5HwmXY:969
       TEST: 'tc_police' [default] overflow 969                            [ OK ]
       ...
       TEST: 'tc_police' [ipv4_max] overflow 969                           [ OK ]
      
       $ echo $?
       0
      
      Fix this by moving the check to be done after each test case.
      
      Fixes: 059b18e2 ("selftests: mlxsw: Return correct error code in resource scale test")
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      196f9bc0
    • Amit Cohen's avatar
      selftests: mlxsw: tc_police_scale: Make test more robust · dc975207
      Amit Cohen authored
      The test adds tc filters and checks how many of them were offloaded by
      grepping for 'in_hw'.
      
      iproute2 commit f4cd4f127047 ("tc: add skip_hw and skip_sw to control
      action offload") added offload indication to tc actions, producing the
      following output:
      
       $ tc filter show dev swp2 ingress
       ...
       filter protocol ipv6 pref 1000 flower chain 0 handle 0x7c0
         eth_type ipv6
         dst_ip 2001:db8:1::7bf
         skip_sw
         in_hw in_hw_count 1
               action order 1:  police 0x7c0 rate 10Mbit burst 100Kb mtu 2Kb action drop overhead 0b
               ref 1 bind 1
               not_in_hw
               used_hw_stats immediate
      
      The current grep expression matches on both 'in_hw' and 'not_in_hw',
      resulting in incorrect results.
      
      Fix that by using JSON output instead.
      
      Fixes: 5061e773 ("selftests: mlxsw: Add scale test for tc-police")
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dc975207
    • Vladimir Oltean's avatar
      net: dcb: disable softirqs in dcbnl_flush_dev() · 10b6bb62
      Vladimir Oltean authored
      Ido Schimmel points out that since commit 52cff74e ("dcbnl : Disable
      software interrupts before taking dcb_lock"), the DCB API can be called
      by drivers from softirq context.
      
      One such in-tree example is the chelsio cxgb4 driver:
      dcb_rpl
      -> cxgb4_dcb_handle_fw_update
         -> dcb_ieee_setapp
      
      If the firmware for this driver happened to send an event which resulted
      in a call to dcb_ieee_setapp() at the exact same time as another
      DCB-enabled interface was unregistering on the same CPU, the softirq
      would deadlock, because the interrupted process was already holding the
      dcb_lock in dcbnl_flush_dev().
      
      Fix this unlikely event by using spin_lock_bh() in dcbnl_flush_dev() as
      in the rest of the dcbnl code.
      
      Fixes: 91b0383f ("net: dcb: flush lingering app table entries for unregistered devices")
      Reported-by: default avatarIdo Schimmel <idosch@idosch.org>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220302193939.1368823-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      10b6bb62
    • Christophe JAILLET's avatar
      bnx2: Fix an error message · 8ccffe9a
      Christophe JAILLET authored
      Fix an error message and report the correct failing function.
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8ccffe9a
    • David S. Miller's avatar
      Merge branch 'ptp-ocp-next' · 25bf4df4
      David S. Miller authored
      Jonathan Lemon says:
      
      ====================
      ptp: ocp: TOD and monitoring updates
      
      Add a series of patches for monitoring the status of the
      driver and adjusting TOD handling, especially around leap seconds.
      
      Add documentation for the new sysfs nodes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      25bf4df4
    • Jonathan Lemon's avatar
      docs: ABI: Document new timecard sysfs nodes. · 4db07317
      Jonathan Lemon authored
      Add documentation for the tod_correction, clock_status_drift,
      and clock_status_offset nodes.
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4db07317
    • Vadim Fedorenko's avatar
      ptp: ocp: adjust utc_tai_offset to TOD info · e68462a0
      Vadim Fedorenko authored
      utc_tai_offset is used to correct IRIG, DCF and NMEA outputs and is
      set during initialisation but is not corrected during leap second
      announce event.  Add watchdog code to control this correction.
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e68462a0
    • Vadim Fedorenko's avatar
      ptp: ocp: add tod_correction attribute · 44a412d1
      Vadim Fedorenko authored
      TOD correction register is used to compensate for leap seconds in
      different domains.  Export it as an attribute with write access.
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44a412d1
    • Vadim Fedorenko's avatar
      ptp: ocp: Expose clock status drift and offset · 2f23f486
      Vadim Fedorenko authored
      Monitoring of clock variance could be done through checking
      the offset and the drift updates that are applied to atomic
      clocks.  Expose these values as attributes for the timecard.
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2f23f486
    • Vadim Fedorenko's avatar
      ptp: ocp: add TOD debug information · 9f492c4c
      Vadim Fedorenko authored
      TOD information is currently displayed only on module load,
      which doesn't provide updated information as the system runs.
      
      Create a debug file which provides the current TOD status information,
      and move the information display there.
      Signed-off-by: default avatarVadim Fedorenko <vadfed@fb.com>
      Signed-off-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f492c4c
    • David S. Miller's avatar
      Merge branch 'skb-mono-delivery-time' · 01e2d157
      David S. Miller authored
      Martin KaFai Lau says:
      
      ====================
      Preserve mono delivery time (EDT) in skb->tstamp
      
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.
      
      This set is to keep the mono delivery time and make it available to
      the final egress interface.  Please see individual patch for
      the details.
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf
      
      v6:
      - Add kdoc and use non-UAPI type in patch 6 (Jakub)
      
      v5:
      netdev:
      - Patch 3 in v4 is broken down into smaller patches 3, 4, and 5 in v5
      - The mono_delivery_time bit clearing in __skb_tstamp_tx() is
        done in __net_timestamp() instead.  This is patch 4 in v5.
      - Missed a skb_clear_delivery_time() for the 'skip_classify' case
        in dev.c in v4.  That is fixed in patch 5 in v5 for correctness.
        The skb_clear_delivery_time() will be moved to a later
        stage in Patch 10, so it was an intermediate error in v4.
      - Added delivery time handling for nfnetlink_{log, queue}.c in patch 9 (Daniel)
      - Added delivery time handling in the IPv6 IOAM hop-by-hop option which has
        an experimental IANA assigned value 49 in patch 8
      - Added delivery time handling in nf_conntrack for the ipv6 defrag case
        in patch 7
      - Removed unlikely() from testing skb->mono_delivery_time (Daniel)
      
      bpf:
      - Remove the skb->tstamp dance in ingress.  Depends on bpf insn
        rewrite to return 0 if skb->tstamp has delivery time in patch 11.
        It is to backward compatible with the existing tc-bpf@ingress in
        patch 11.
      - bpf_set_delivery_time() will also allow dtime == 0 and
        dtime_type == BPF_SKB_DELIVERY_TIME_NONE as argument
        in patch 12.
      
      v4:
      netdev:
      - Push the skb_clear_delivery_time() from
        ip_local_deliver() and ip6_input() to
        ip_local_deliver_finish() and ip6_input_finish()
        to accommodate the ipvs forward path.
        This is the notable change in v4 at the netdev side.
      
          - Patch 3/8 first does the skb_clear_delivery_time() after
            sch_handle_ingress() in dev.c and this will make the
            tc-bpf forward path work via the bpf_redirect_*() helper.
      
          - The next patch 4/8 (new in v4) will then postpone the
            skb_clear_delivery_time() from dev.c to
            the ip_local_deliver_finish() and ip6_input_finish() after
            taking care of the tstamp usage in the ip defrag case.
            This will make the kernel forward path also work, e.g.
            the ip[6]_forward().
      
      - Fixed a case v3 which missed setting the skb->mono_delivery_time bit
        when sending TCP rst/ack in some cases (e.g. from a ctl_sk).
        That case happens at ip_send_unicast_reply() and
        tcp_v6_send_response().  It is fixed in patch 1/8 (and
        then patch 3/8) in v4.
      
      bpf:
      - Adding __sk_buff->delivery_time_type instead of adding
        __sk_buff->mono_delivery_time as in v3.  The tc-bpf can stay with
        one __sk_buff->tstamp instead of having two 'time' fields
        while one is 0 and another is not.
        tc-bpf can use the new __sk_buff->delivery_time_type to tell
        what is stored in __sk_buff->tstamp.
      - bpf_skb_set_delivery_time() helper is added to set
        __sk_buff->tstamp from non mono delivery_time to
        mono delivery_time
      - Most of the convert_ctx_access() bpf insn rewrite in v3
        is gone, so no new rewrite added for __sk_buff->tstamp.
        The only rewrite added is for reading the new
        __sk_buff->delivery_time_type.
      - Added selftests, test_tc_dtime.c
      
      v3:
      - Feedback from v2 is using shinfo(skb)->tx_flags could be racy.
      - Considered to reuse a few bits in skb->tstamp to represent
        different semantics, other than more code churns, it will break
        the bpf usecase which currently can write and then read back
        the skb->tstamp.
      - Went back to v1 idea on adding a bit to skb and address the
        feedbacks on v1:
      - Added one bit skb->mono_delivery_time to flag that
        the skb->tstamp has the mono delivery_time (EDT), instead
        of adding a bit to flag if the skb->tstamp has been forwarded or not.
      - Instead of resetting the delivery_time back to the (rcv) timestamp
        during recvmsg syscall which may be too late and not useful,
        the delivery_time reset in v3 happens earlier once the stack
        knows that the skb will be delivered locally.
      - Handled the tapping@ingress case by af_packet
      - No need to change the (rcv) timestamp to mono clock base as in v1.
        The added one bit to flag skb->mono_delivery_time is enough
        to keep the EDT delivery_time during forward.
      - Added logic to the bpf side to make the existing bpf
        running at ingress can still get the (rcv) timestamp
        when reading the __sk_buff->tstamp.  New __sk_buff->mono_delivery_time
        is also added.  Test is still needed to test this piece.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01e2d157
    • Martin KaFai Lau's avatar
      bpf: selftests: test skb->tstamp in redirect_neigh · c803475f
      Martin KaFai Lau authored
      This patch adds tests on forwarding the delivery_time for
      the following cases
      - tcp/udp + ip4/ip6 + bpf_redirect_neigh
      - tcp/udp + ip4/ip6 + ip[6]_forward
      - bpf_skb_set_delivery_time
      - The old rcv timestamp expectation on tc-bpf@ingress
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c803475f
    • Martin KaFai Lau's avatar
      bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time() · 8d21ec0e
      Martin KaFai Lau authored
      * __sk_buff->delivery_time_type:
      This patch adds __sk_buff->delivery_time_type.  It tells if the
      delivery_time is stored in __sk_buff->tstamp or not.
      
      It will be most useful for ingress to tell if the __sk_buff->tstamp
      has the (rcv) timestamp or delivery_time.  If delivery_time_type
      is 0 (BPF_SKB_DELIVERY_TIME_NONE), it has the (rcv) timestamp.
      
      Two non-zero types are defined for the delivery_time_type,
      BPF_SKB_DELIVERY_TIME_MONO and BPF_SKB_DELIVERY_TIME_UNSPEC.  For UNSPEC,
      it can only happen in egress because only mono delivery_time can be
      forwarded to ingress now.  The clock of UNSPEC delivery_time
      can be deduced from the skb->sk->sk_clockid which is how
      the sch_etf doing it also.
      
      * Provide forwarded delivery_time to tc-bpf@ingress:
      With the help of the new delivery_time_type, the tc-bpf has a way
      to tell if the __sk_buff->tstamp has the (rcv) timestamp or
      the delivery_time.  During bpf load time, the verifier will learn if
      the bpf prog has accessed the new __sk_buff->delivery_time_type.
      If it does, it means the tc-bpf@ingress is expecting the
      skb->tstamp could have the delivery_time.  The kernel will then
      read the skb->tstamp as-is during bpf insn rewrite without
      checking the skb->mono_delivery_time.  This is done by adding a
      new prog->delivery_time_access bit.  The same goes for
      writing skb->tstamp.
      
      * bpf_skb_set_delivery_time():
      The bpf_skb_set_delivery_time() helper is added to allow setting both
      delivery_time and the delivery_time_type at the same time.  If the
      tc-bpf does not need to change the delivery_time_type, it can directly
      write to the __sk_buff->tstamp as the existing tc-bpf has already been
      doing.  It will be most useful at ingress to change the
      __sk_buff->tstamp from the (rcv) timestamp to
      a mono delivery_time and then bpf_redirect_*().
      
      bpf only has mono clock helper (bpf_ktime_get_ns), and
      the current known use case is the mono EDT for fq, and
      only mono delivery time can be kept during forward now,
      so bpf_skb_set_delivery_time() only supports setting
      BPF_SKB_DELIVERY_TIME_MONO.  It can be extended later when use cases
      come up and the forwarding path also supports other clock bases.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d21ec0e
    • Martin KaFai Lau's avatar
      bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress · 7449197d
      Martin KaFai Lau authored
      The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
      as a (rcv) timestamp which currently could either be 0 (not available)
      or ktime_get_real().  This patch is to backward compatible with the
      (rcv) timestamp expectation at ingress.  If the skb->tstamp has
      the delivery_time, the bpf insn rewrite will read 0 for tc-bpf
      running at ingress as it is not available.  When writing at ingress,
      it will also clear the skb->mono_delivery_time bit.
      
      /* BPF_READ: a = __sk_buff->tstamp */
      if (!skb->tc_at_ingress || !skb->mono_delivery_time)
      	a = skb->tstamp;
      else
      	a = 0
      
      /* BPF_WRITE: __sk_buff->tstamp = a */
      if (skb->tc_at_ingress)
      	skb->mono_delivery_time = 0;
      skb->tstamp = a;
      
      [ A note on the BPF_CGROUP_INET_INGRESS which can also access
        skb->tstamp.  At that point, the skb is delivered locally
        and skb_clear_delivery_time() has already been done,
        so the skb->tstamp will only have the (rcv) timestamp. ]
      
      If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time
      has to be cleared also.  It could be done together during
      convert_ctx_access().  However, the latter patch will also expose
      the skb->mono_delivery_time bit as __sk_buff->delivery_time_type.
      Changing the delivery_time_type in the background may surprise
      the user, e.g. the 2nd read on __sk_buff->delivery_time_type
      may need a READ_ONCE() to avoid compiler optimization.  Thus,
      in expecting the needs in the latter patch, this patch does a
      check on !skb->tstamp after running the tc-bpf and clears the
      skb->mono_delivery_time bit if needed.  The earlier discussion
      on v4 [0].
      
      The bpf insn rewrite requires the skb's mono_delivery_time bit and
      tc_at_ingress bit.  They are moved up in sk_buff so that bpf rewrite
      can be done at a fixed offset.  tc_skip_classify is moved together with
      tc_at_ingress.  To get one bit for mono_delivery_time, csum_not_inet is
      moved down and this bit is currently used by sctp.
      
      [0]: https://lore.kernel.org/bpf/20220217015043.khqwqklx45c4m4se@kafai-mbp.dhcp.thefacebook.com/Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7449197d
    • Martin KaFai Lau's avatar
      net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally · cd14e9b7
      Martin KaFai Lau authored
      The previous patches handled the delivery_time in the ingress path
      before the routing decision is made.  This patch can postpone clearing
      delivery_time in a skb until knowing it is delivered locally and also
      set the (rcv) timestamp if needed.  This patch moves the
      skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
      and ip6_input_finish().
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd14e9b7
    • Martin KaFai Lau's avatar
      net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c · 80fcec67
      Martin KaFai Lau authored
      If skb has the (rcv) timestamp available, nfnetlink_{log, queue}.c
      logs/outputs it to the userspace.  When the locally generated skb is
      looping from egress to ingress over a virtual interface (e.g. veth,
      loopback...),  skb->tstamp may have the delivery time before it is
      known that will be delivered locally and received by another sk.  Like
      handling the delivery time in network tapping,  use ktime_get_real() to
      get the (rcv) timestamp.  The earlier added helper skb_tstamp_cond() is
      used to do this.  false is passed to the second 'cond' arg such
      that doing ktime_get_real() or not only depends on the
      netstamp_needed_key static key.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80fcec67
    • Martin KaFai Lau's avatar
      net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option · b6561f84
      Martin KaFai Lau authored
      IOAM is a hop-by-hop option with a temporary iana allocation (49).
      Since it is hop-by-hop, it is done before the input routing decision.
      One of the traced data field is the (rcv) timestamp.
      
      When the locally generated skb is looping from egress to ingress over
      a virtual interface (e.g. veth, loopback...), skb->tstamp may have the
      delivery time before it is known that it will be delivered locally
      and received by another sk.
      
      Like handling the network tapping (tcpdump) in the earlier patch,
      this patch gets the timestamp if needed without over-writing the
      delivery_time in the skb->tstamp.  skb_tstamp_cond() is added to do the
      ktime_get_real() with an extra cond arg to check on top of the
      netstamp_needed_key static key.  skb_tstamp_cond() will also be used in
      a latter patch and it needs the netstamp_needed_key check.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6561f84
    • Martin KaFai Lau's avatar
      net: ipv6: Handle delivery_time in ipv6 defrag · 335c8cf3
      Martin KaFai Lau authored
      A latter patch will postpone the delivery_time clearing until the stack
      knows the skb is being delivered locally (i.e. calling
      skb_clear_delivery_time() at ip_local_deliver_finish() for IPv4
      and at ip6_input_finish() for IPv6).  That will allow other kernel
      forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
      
      A very similar IPv6 defrag codes have been duplicated in
      multiple places: regular IPv6, nf_conntrack, and 6lowpan.
      
      Unlike the IPv4 defrag which is done before ip_local_deliver_finish(),
      the regular IPv6 defrag is done after ip6_input_finish().
      Thus, no change should be needed in the regular IPv6 defrag
      logic because skb_clear_delivery_time() should have been called.
      
      6lowpan also does not need special handling on delivery_time
      because it is a non-inet packet_type.
      
      However, cf_conntrack has a case in NF_INET_PRE_ROUTING that needs
      to do the IPv6 defrag earlier.  Thus, it needs to save the
      mono_delivery_time bit in the inet_frag_queue which is similar
      to how it is handled in the previous patch for the IPv4 defrag.
      
      This patch chooses to do it consistently and stores the mono_delivery_time
      in the inet_frag_queue for all cases such that it will be easier
      for the future refactoring effort on the IPv6 reasm code.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      335c8cf3
    • Martin KaFai Lau's avatar
      net: ip: Handle delivery_time in ip defrag · 8672406e
      Martin KaFai Lau authored
      A latter patch will postpone the delivery_time clearing until the stack
      knows the skb is being delivered locally.  That will allow other kernel
      forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
      
      An earlier attempt was to do skb_clear_delivery_time() in
      ip_local_deliver() and ip6_input().  The discussion [0] requested
      to move it one step later into ip_local_deliver_finish()
      and ip6_input_finish() so that the delivery_time can be kept
      for the ip_vs forwarding path also.
      
      To do that, this patch also needs to take care of the (rcv) timestamp
      usecase in ip_is_fragment().  It needs to expect delivery_time in
      the skb->tstamp, so it needs to save the mono_delivery_time bit in
      inet_frag_queue such that the delivery_time (if any) can be restored
      in the final defragmented skb.
      
      [Note that it will only happen when the locally generated skb is looping
       from egress to ingress over a virtual interface (e.g. veth, loopback...),
       skb->tstamp may have the delivery time before it is known that it will
       be delivered locally and received by another sk.]
      
      [0]: https://lore.kernel.org/netdev/ca728d81-80e8-3767-d5e-d44f6ad96e43@ssi.bg/Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8672406e
    • Martin KaFai Lau's avatar
      net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() · d98d58a0
      Martin KaFai Lau authored
      The previous patches handled the delivery_time before sch_handle_ingress().
      
      This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
      and also clear it with skb_clear_delivery_time() after
      sch_handle_ingress().  This will make the bpf_redirect_*()
      to keep the mono delivery_time and used by a qdisc (fq) of
      the egress-ing interface.
      
      A latter patch will postpone the skb_clear_delivery_time() until the
      stack learns that the skb is being delivered locally and that will
      make other kernel forwarding paths (ip[6]_forward) able to keep
      the delivery_time also.  Thus, like the previous patches on using
      the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
      is not limited within the CONFIG_NET_INGRESS to avoid too many code
      churns among this set.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d98d58a0
    • Martin KaFai Lau's avatar
      net: Clear mono_delivery_time bit in __skb_tstamp_tx() · d93376f5
      Martin KaFai Lau authored
      In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
      the sk_error_queue.  The outgoing skb may have the mono delivery_time
      while the (rcv) timestamp is expected for the clone, so the
      skb->mono_delivery_time bit needs to be cleared from the clone.
      
      This patch adds the skb->mono_delivery_time clearing to the existing
      __net_timestamp() and use it in __skb_tstamp_tx().
      The __net_timestamp() fast path usage in dev.c is changed to directly
      call ktime_get_real() since the mono_delivery_time bit is not set at
      that point.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d93376f5
    • Martin KaFai Lau's avatar
      net: Handle delivery_time in skb->tstamp during network tapping with af_packet · 27942a15
      Martin KaFai Lau authored
      A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
      skb_clear_tstamp() will then keep this delivery_time during forwarding.
      
      This patch is to make the network tapping (with af_packet) to handle
      the delivery_time stored in skb->tstamp.
      
      Regardless of tapping at the ingress or egress,  the tapped skb is
      received by the af_packet socket, so it is ingress to the af_packet
      socket and it expects the (rcv) timestamp.
      
      When tapping at egress, dev_queue_xmit_nit() is used.  It has already
      expected skb->tstamp may have delivery_time,  so it does
      skb_clone()+net_timestamp_set() to ensure the cloned skb has
      the (rcv) timestamp before passing to the af_packet sk.
      This patch only adds to clear the skb->mono_delivery_time
      bit in net_timestamp_set().
      
      When tapping at ingress, it currently expects the skb->tstamp is either 0
      or the (rcv) timestamp.  Meaning, the tapping at ingress path
      has already expected the skb->tstamp could be 0 and it will get
      the (rcv) timestamp by ktime_get_real() when needed.
      
      There are two cases for tapping at ingress:
      
      One case is af_packet queues the skb to its sk_receive_queue.
      The skb is either not shared or new clone created.  The newly
      added skb_clear_delivery_time() is called to clear the
      delivery_time (if any) and set the (rcv) timestamp if
      needed before the skb is queued to the sk_receive_queue.
      
      Another case, the ingress skb is directly copied to the rx_ring
      and tpacket_get_timestamp() is used to get the (rcv) timestamp.
      The newly added skb_tstamp() is used in tpacket_get_timestamp()
      to check the skb->mono_delivery_time bit before returning skb->tstamp.
      As mentioned earlier, the tapping@ingress has already expected
      the skb may not have the (rcv) timestamp (because no sk has asked
      for it) and has handled this case by directly calling ktime_get_real().
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27942a15
    • Martin KaFai Lau's avatar
      net: Add skb_clear_tstamp() to keep the mono delivery_time · de799101
      Martin KaFai Lau authored
      Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
      
      If skb->tstamp has the mono delivery_time, clearing it can hurt
      the performance when it finally transmits out to fq@phy-dev.
      
      The earlier patch added a skb->mono_delivery_time bit to
      flag the skb->tstamp carrying the mono delivery_time.
      
      This patch adds skb_clear_tstamp() helper which keeps
      the mono delivery_time and clears everything else.
      
      The delivery_time clearing will be postponed until the stack knows the
      skb will be delivered locally.  It will be done in a latter patch.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de799101
    • Martin KaFai Lau's avatar
      net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a
      Martin KaFai Lau authored
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.  For example, when forwarding from egress to
      ingress and then finally back to egress:
      
                  tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                           ^              ^
                                           reset          rest
      
      This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
      is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
      
      The current use case is to keep the TCP mono delivery_time (EDT) and
      to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
      to read and change the mono delivery_time.
      
      In the future, another bit (e.g. skb->user_delivery_time) can be added
      for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
      
      [ This patch is a prep work.  The following patches will
        get the other parts of the stack ready first.  Then another patch
        after that will finally set the skb->mono_delivery_time. ]
      
      skb_set_delivery_time() function is added.  It is used by the tcp_output.c
      and during ip[6] fragmentation to assign the delivery_time to
      the skb->tstamp and also set the skb->mono_delivery_time.
      
      A note on the change in ip_send_unicast_reply() in ip_output.c.
      It is only used by TCP to send reset/ack out of a ctl_sk.
      Like the new skb_set_delivery_time(), this patch sets
      the skb->mono_delivery_time to 0 for now as a place
      holder.  It will be enabled in a latter patch.
      A similar case in tcp_ipv6 can be done with
      skb_set_delivery_time() in tcp_v6_send_response().
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1ac9c8a