1. 26 Sep, 2024 12 commits
    • Florian Westphal's avatar
      netfilter: nfnetlink_queue: remove old clash resolution logic · 8af79d3e
      Florian Westphal authored
      For historical reasons there are two clash resolution spots in
      netfilter, one in nfnetlink_queue and one in conntrack core.
      
      nfnetlink_queue one was added first: If a colliding entry is found, NAT
      NAT transformation is reversed by calling nat engine again with altered
      tuple.
      
      See commit 368982cd ("netfilter: nfnetlink_queue: resolve clash for
      unconfirmed conntracks") for details.
      
      One problem is that nf_reroute() won't take an action if the queueing
      doesn't occur in the OUTPUT hook, i.e. when queueing in forward or
      postrouting, packet will be sent via the wrong path.
      
      Another problem is that the scenario addressed (2nd UDP packet sent with
      identical addresses while first packet is still being processed) can also
      occur without any nfqueue involvement due to threaded resolvers doing
      A and AAAA requests back-to-back.
      
      This lead us to add clash resolution logic to the conntrack core, see
      commit 6a757c07 ("netfilter: conntrack: allow insertion of clashing
      entries").  Instead of fixing the nfqueue based logic, lets remove it
      and let conntrack core handle this instead.
      
      Retain the ->update hook for sake of nfqueue based conntrack helpers.
      We could axe this hook completely but we'd have to split confirm and
      helper logic again, see commit ee04805f ("netfilter: conntrack: make
      conntrack userspace helpers work again").
      
      This SHOULD NOT be backported to kernels earlier than v5.6; they lack
      adequate clash resolution handling.
      
      Patch was originally written by Pablo Neira Ayuso.
      Reported-by: default avatarAntonio Ojea <aojea@google.com>
      Closes: https://bugzilla.netfilter.org/show_bug.cgi?id=1766Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Tested-by: default avatarAntonio Ojea <aojea@google.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      8af79d3e
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: missing objects with no memcg accounting · 69e687ce
      Pablo Neira Ayuso authored
      Several ruleset objects are still not using GFP_KERNEL_ACCOUNT for
      memory accounting, update them. This includes:
      
      - catchall elements
      - compat match large info area
      - log prefix
      - meta secctx
      - numgen counters
      - pipapo set backend datastructure
      - tunnel private objects
      
      Fixes: 33758c89 ("memcg: enable accounting for nft objects")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      69e687ce
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: use rcu chain hook list iterator from netlink dump path · 4ffcf5ca
      Pablo Neira Ayuso authored
      Lockless iteration over hook list is possible from netlink dump path,
      use rcu variant to iterate over the hook list as is done with flowtable
      hooks.
      
      Fixes: b9703ed4 ("netfilter: nf_tables: support for adding new devices to an existing netdev chain")
      Reported-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      4ffcf5ca
    • Simon Horman's avatar
      netfilter: ctnetlink: compile ctnetlink_label_size with CONFIG_NF_CONNTRACK_EVENTS · e1f1ee0e
      Simon Horman authored
      Only provide ctnetlink_label_size when it is used,
      which is when CONFIG_NF_CONNTRACK_EVENTS is configured.
      
      Flagged by clang-18 W=1 builds as:
      
      .../nf_conntrack_netlink.c:385:19: warning: unused function 'ctnetlink_label_size' [-Wunused-function]
        385 | static inline int ctnetlink_label_size(const struct nf_conn *ct)
            |                   ^~~~~~~~~~~~~~~~~~~~
      
      The condition on CONFIG_NF_CONNTRACK_LABELS being removed by
      this patch guards compilation of non-trivial implementations
      of ctnetlink_dump_labels() and ctnetlink_label_size().
      
      However, this is not necessary as each of these functions
      will always return 0 if CONFIG_NF_CONNTRACK_LABELS is not defined
      as each function starts with the equivalent of:
      
      	struct nf_conn_labels *labels = nf_ct_labels_find(ct);
      
      	if (!labels)
      		return 0;
      
      And nf_ct_labels_find always returns NULL if CONFIG_NF_CONNTRACK_LABELS
      is not enabled.  So I believe that the compiler optimises the code away
      in such cases anyway.
      
      Found by inspection.
      Compile tested only.
      
      Originally splitted in two patches, Pablo Neira Ayuso collapsed them and
      added Fixes: tag.
      
      Fixes: 0ceabd83 ("netfilter: ctnetlink: deliver labels to userspace")
      Link: https://lore.kernel.org/netfilter-devel/20240909151712.GZ2097826@kernel.org/Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      e1f1ee0e
    • Simon Horman's avatar
      netfilter: nf_reject: Fix build warning when CONFIG_BRIDGE_NETFILTER=n · fc56878c
      Simon Horman authored
      If CONFIG_BRIDGE_NETFILTER is not enabled, which is the case for x86_64
      defconfig, then building nf_reject_ipv4.c and nf_reject_ipv6.c with W=1
      using gcc-14 results in the following warnings, which are treated as
      errors:
      
      net/ipv4/netfilter/nf_reject_ipv4.c: In function 'nf_send_reset':
      net/ipv4/netfilter/nf_reject_ipv4.c:243:23: error: variable 'niph' set but not used [-Werror=unused-but-set-variable]
        243 |         struct iphdr *niph;
            |                       ^~~~
      cc1: all warnings being treated as errors
      net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_reset6':
      net/ipv6/netfilter/nf_reject_ipv6.c:286:25: error: variable 'ip6h' set but not used [-Werror=unused-but-set-variable]
        286 |         struct ipv6hdr *ip6h;
            |                         ^~~~
      cc1: all warnings being treated as errors
      
      Address this by reducing the scope of these local variables to where
      they are used, which is code only compiled when CONFIG_BRIDGE_NETFILTER
      enabled.
      
      Compile tested and run through netfilter selftests.
      Reported-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Closes: https://lore.kernel.org/netfilter-devel/20240906145513.567781-1-andriy.shevchenko@linux.intel.com/Signed-off-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      fc56878c
    • Phil Sutter's avatar
      netfilter: nf_tables: Keep deleted flowtable hooks until after RCU · 642c89c4
      Phil Sutter authored
      Documentation of list_del_rcu() warns callers to not immediately free
      the deleted list item. While it seems not necessary to use the
      RCU-variant of list_del() here in the first place, doing so seems to
      require calling kfree_rcu() on the deleted item as well.
      
      Fixes: 3f0465a9 ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables")
      Signed-off-by: default avatarPhil Sutter <phil@nwl.cc>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      642c89c4
    • 谢致邦 (XIE Zhibang)'s avatar
      docs: tproxy: ignore non-transparent sockets in iptables · aa758763
      谢致邦 (XIE Zhibang) authored
      The iptables example was added in commit d2f26037 (netfilter: Add
      documentation for tproxy, 2008-10-08), but xt_socket 'transparent'
      option was added in commit a31e1ffd (netfilter: xt_socket: added new
      revision of the 'socket' match supporting flags, 2009-06-09).
      
      Now add the 'transparent' option to the iptables example to ignore
      non-transparent sockets, which is also consistent with the nft example.
      Signed-off-by: default avatar谢致邦 (XIE Zhibang) <Yeking@Red54.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      aa758763
    • Andy Shevchenko's avatar
      netfilter: ctnetlink: Guard possible unused functions · 2cadd3b1
      Andy Shevchenko authored
      Some of the functions may be unused (CONFIG_NETFILTER_NETLINK_GLUE_CT=n
      and CONFIG_NF_CONNTRACK_EVENTS=n), it prevents kernel builds with clang,
      `make W=1` and CONFIG_WERROR=y:
      
      net/netfilter/nf_conntrack_netlink.c:657:22: error: unused function 'ctnetlink_acct_size' [-Werror,-Wunused-function]
        657 | static inline size_t ctnetlink_acct_size(const struct nf_conn *ct)
            |                      ^~~~~~~~~~~~~~~~~~~
      net/netfilter/nf_conntrack_netlink.c:667:19: error: unused function 'ctnetlink_secctx_size' [-Werror,-Wunused-function]
        667 | static inline int ctnetlink_secctx_size(const struct nf_conn *ct)
            |                   ^~~~~~~~~~~~~~~~~~~~~
      net/netfilter/nf_conntrack_netlink.c:683:22: error: unused function 'ctnetlink_timestamp_size' [-Werror,-Wunused-function]
        683 | static inline size_t ctnetlink_timestamp_size(const struct nf_conn *ct)
            |                      ^~~~~~~~~~~~~~~~~~~~~~~~
      
      Fix this by guarding possible unused functions with ifdeffery.
      
      See also commit 6863f564 ("kbuild: allow Clang to find unused static
      inline functions for W=1 build").
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Reviewed-by: default avatarSimon Horman <horms@kernel.org>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      2cadd3b1
    • Antonio Ojea's avatar
      selftests: netfilter: nft_tproxy.sh: add tcp tests · 7e37e0ea
      Antonio Ojea authored
      The TPROXY functionality is widely used, however, there are only mptcp
      selftests covering this feature.
      
      The selftests represent the most common scenarios and can also be used
      as selfdocumentation of the feature.
      
      UDP and TCP testcases are split in different files because of the
      different nature of the protocols, specially due to the challenges that
      present to reliable test UDP due to the connectionless nature of the
      protocol. UDP only covers the scenarios involving the prerouting hook.
      
      The UDP tests are signfinicantly slower than the TCP ones, hence they
      use a larger timeout, it takes 20 seconds to run the full UDP suite
      on a 48 vCPU Intel(R) Xeon(R) CPU @2.60GHz.
      Signed-off-by: default avatarAntonio Ojea <aojea@google.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      7e37e0ea
    • Florian Westphal's avatar
      selftests: netfilter: add reverse-clash resolution test case · a57856c0
      Florian Westphal authored
      Add test program that is sending UDP packets in both directions
      and check that packets arrive without source port modification.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a57856c0
    • Florian Westphal's avatar
      netfilter: conntrack: add clash resolution for reverse collisions · a4e6a103
      Florian Westphal authored
      Given existing entry:
      ORIGIN: a:b -> c:d
      REPLY:  c:d -> a:b
      
      And colliding entry:
      ORIGIN: c:d -> a:b
      REPLY:  a:b -> c:d
      
      The colliding ct (and the associated skb) get dropped on insert.
      Permit this by checking if the colliding entry matches the reply
      direction.
      
      Happens when both ends send packets at same time, both requests are picked
      up as NEW, rather than NEW for the 'first' and 'ESTABLISHED' for the
      second packet.
      
      This is an esoteric condition, as ruleset must permit NEW connections
      in either direction and both peers must already have a bidirectional
      traffic flow at the time conntrack gets enabled.
      
      Allow the 'reverse' skb to pass and assign the existing (clashing)
      entry.
      
      While at it, also drop the extra 'dying' check, this is already
      tested earlier by the calling function.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      a4e6a103
    • Florian Westphal's avatar
      netfilter: nf_nat: don't try nat source port reallocation for reverse dir clash · d8f84a9b
      Florian Westphal authored
      A conntrack entry can be inserted to the connection tracking table if there
      is no existing entry with an identical tuple in either direction.
      
      Example:
      INITIATOR -> NAT/PAT -> RESPONDER
      
      Initiator passes through NAT/PAT ("us") and SNAT is done (saddr rewrite).
      Then, later, NAT/PAT machine itself also wants to connect to RESPONDER.
      
      This will not work if the SNAT done earlier has same IP:PORT source pair.
      
      Conntrack table has:
      ORIGINAL: $IP_INITATOR:$SPORT -> $IP_RESPONDER:$DPORT
      REPLY:    $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
      
      and new locally originating connection wants:
      ORIGINAL: $IP_NAT:$SPORT -> $IP_RESPONDER:$DPORT
      REPLY:    $IP_RESPONDER:$DPORT -> $IP_NAT:$SPORT
      
      This is handled by the NAT engine which will do a source port reallocation
      for the locally originating connection that is colliding with an existing
      tuple by attempting a source port rewrite.
      
      This is done even if this new connection attempt did not go through a
      masquerade/snat rule.
      
      There is a rare race condition with connection-less protocols like UDP,
      where we do the port reallocation even though its not needed.
      
      This happens when new packets from the same, pre-existing flow are received
      in both directions at the exact same time on different CPUs after the
      conntrack table was flushed (or conntrack becomes active for first time).
      
      With strict ordering/single cpu, the first packet creates new ct entry and
      second packet is resolved as established reply packet.
      
      With parallel processing, both packets are picked up as new and both get
      their own ct entry.
      
      In this case, the 'reply' packet (picked up as ORIGINAL) can be mangled by
      NAT engine because a port collision is detected.
      
      This change isn't enough to prevent a packet drop later during
      nf_conntrack_confirm(), the existing clash resolution strategy will not
      detect such reverse clash case.  This is resolved by a followup patch.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      d8f84a9b
  2. 16 Sep, 2024 1 commit
    • Linus Torvalds's avatar
      Merge tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next · 94106455
      Linus Torvalds authored
      Pull networking updates from Jakub Kicinski:
       "The zero-copy changes are relatively significant, but regression risk
        should be contained. The feature needs to be used to cause trouble.
      
        Also it feels like we got an order of magnitude more semi-automated
        "refactoring" chaff than usual, I wonder if it's just us.
      
        Core & protocols:
      
         - Support Device Memory TCP, ability to zero-copy receive TCP
           payloads to a DMABUF region of memory while packet headers land
           separately in normal kernel buffers, and TCP processes then as
           usual.
      
         - The ability to read the PTP PHC (Physical Hardware Clock) alongside
           MONOTONIC_RAW timestamps with PTP_SYS_OFFSET_EXTENDED. Previously
           only CLOCK_REALTIME was supported.
      
         - Allow matching on all bits of IP DSCP for routing decisions.
           Previously we only supported on matching TOS bits in IPv4 which is
           a narrower interpretation of the same header field.
      
         - Increase the range of weights used for multi-path routing from
           8 bits to 16 bits.
      
         - Add support for IPv6 PIO p flag in the Prefix Information Option
           per draft-ietf-6man-pio-pflag.
      
         - IPv6 IOAM6 support for new tunsrc encap mode for better
           performance.
      
         - Detect destinations which blackhole MPTCP traffic and avoid
           initiating MPTCP connections to them for a certain period of time,
           1h by default.
      
         - Improve IPsec control path performance by removing the inexact
           policies list.
      
         - AF_VSOCK: add support for SIOCOUTQ ioctl.
      
         - Add enum for reasons TCP reset was sent for easier tracing.
      
         - Add SMC ringbufs usage statistics.
      
        Drivers:
      
         - Handle netconsole setup failures more gracefully, don't fail
           loading, retain the specified target as disabled.
      
         - Extend bonding's IPsec offload pass thru capabilities (ESN, stats).
      
        Filtering:
      
         - Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case
           when long-lived sockets miss a chance to set additional callbacks
           if a sockops program was not attached early in their lifetime.
      
         - Support using BPF skb helpers in tracepoints.
      
         - Conntrack Netlink: support CTA_FILTER for flush.
      
         - Improve SCTP support in nfnetlink_queue.
      
         - Improve performance of large nftables flush transactions.
      
        Things we sprinkled into general kernel code:
      
         - selftests: support setting an "interpreter" for script files; make
           it easy to run as separate cases tests where one "interpreter" is
           fed various test descriptions (in our case packet sequences).
      
        Driver API:
      
         - Extend core and ethtool APIs to support many PHYs connected to a
           single interface (PHY topologies).
      
         - Extend cable diagnostics to specify whether Time Domain
           Reflectometry (TDR) or Active Link Cable Diagnostic (ALCD) was
           used.
      
         - Add library for implementing MAC-PHY Ethernet drivers for SPI
           devices compatible with Open Alliance 10BASE-T1x MAC-PHY Serial
           Interface (TC6) standard.
      
         - Add helpers to the PHY framework, for PHYs following the Open
           Alliance standards:
             - 1000BaseT1 link settings
             - cable test and diagnostics
      
         - Support listing / dumping all allocated RSS contexts.
      
         - Add configuration for frequency Embedded SYNC in DPLL, which
           magically embeds sync pulses into Ethernet signaling.
      
        Device drivers:
      
         - Ethernet high-speed NICs:
            - Broadcom (bnxt):
               - use better FW APIs for queue reset
               - support QOS and TPID settings for the SR-IOV VLAN
               - support dynamic MSI-X allocation
            - Intel (100G, ice, idpf):
               - ice: support PCIe subfunctions
               - iavf: add support for TC U32 filters on VFs
               - ice: support Embedded SYNC in DPLL
            - nVidia/Mellanox (mlx5):
               - support HW managed steering tables
               - support PCIe PTM cross timestamping
            - AMD/Pensando:
               - ionic: use page_pool to increase Rx performance
            - Cisco (enic):
               - report per-queue statistics
      
         - Ethernet virtual:
            - Microsoft vNIC:
               - mana: support configuring ring length
               - netvsc: enable more channels on systems with many CPUs
            - IBM veth:
               - optimize polling to improve TCP_RR performance
               - optimize performance of Tx handling
            - VirtIO net:
               - synchronize the operstate with the admin state to allow a
                 lower virtio-net to propagate the link status to an upper
                 device like macvlan
      
         - Ethernet NICs consumer, and embedded:
            - Add driver for Realtek automotive PCIe devices (RTL9054,
              RTL9068, RTL9072, RTL9075, RTL9068, RTL9071)
            - Add driver for Microchip LAN8650/1 10BASE-T1S MAC-PHY.
            - Microchip:
               - lan743x: use phylink - support WOL, EEE, pause, link settings
               - add Wake-on-LAN support for KSZ87xx family
               - add KSZ8895/KSZ8864 switch support
               - factor out FDMA code and use it in sparx5 and lan966x
                 (including DCB support in both)
            - Synopsys (stmmac):
               - support frame preemption (configured using TC and ethtool)
               - support Loongson DWMAC (GMAC v3.73)
               - support RockChips RK3576 DWMAC
            - TI:
               - am65-cpsw: add multi queue RX support
               - icssg-prueth: HSR offload support
            - Cadence (macb):
               - enable software (hrtimer based) IRQ coalescing by default
            - Xilinx (axinet):
               - expose HW statistics
               - improve multicast filtering
               - relax Rx checksum offload constraints
            - MediaTek:
               - mt7530: add EN7581 support
            - Aspeed (ftgmac100):
               - report link speed and duplex
            - Intel:
               - igc: add mqprio offload
               - igc: report EEE configuration
            - RealTek (r8169):
               - add support for RTL8126A rev.b
            - Vitesse (vsc73xx):
               - implement FDB add/del/dump operations
            - Freescale (fs_enet):
               - use phylink
      
         - Ethernet PHYs:
            - vitesse: implement downshift and MDI-X in vsc73xx PHYs
            - microchip: support LAN887x, supporting IEEE 802.3bw (100BASE-T1)
              and IEEE 802.3bp (1000BASE-T1) specifications
            - add Applied Micro QT2025 PHY driver (in Rust)
            - add Motorcomm yt8821 2.5G Ethernet PHY driver
      
         - CAN:
            - add driver for Rockchip RK3568 CAN-FD controller
            - flexcan: add wakeup support for imx95
            - kvaser_usb: set hardware timestamp on transmitted packets
      
         - WiFi:
            - mac80211/cfg80211:
               - EHT rate support in AQL airtime fairness
               - handle DFS (radar detection) per link in Multi-Link Operation
            - RealTek (rtw89):
               - support RTL8852BT and 8852BE-VT (WiFi 6)
               - support hardware rfkill
               - support HW encryption in unicast management frames
               - support Wake-on-WLAN with supported network detection
            - RealTek (rtw89):
               - improve Rx performance by using USB frame aggregation
               - support USB 3 with RTL8822CU/RTL8822BU
            - Intel (iwlwifi/mvm):
               - offload RLC/SMPS functionality to firmware
            - Marvell (mwifiex):
               - add host based MLME to enable WPA3
      
         - Bluetooth:
            - add support for Amlogic HCI UART protocol
            - add support for ISO data/packets to Intel and NXP drivers"
      
      * tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1303 commits)
        net/mlx5: HWS, check the correct variable in hws_send_ring_alloc_sq()
        netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level()
        ice: Fix a NULL vs IS_ERR() check in probe()
        ice: Fix a couple NULL vs IS_ERR() bugs
        net: ethernet: fs_enet: Make the per clock optional
        net: ti: icssg-prueth: Add multicast filtering support in HSR mode
        net: ti: icssg-prueth: Enable HSR Tx duplication, Tx Tag and Rx Tag offload
        net: ti: icssg-prueth: Add support for HSR frame forward offload
        net: ti: icssg-prueth: Stop hardcoding def_inc
        net: ti: icss-iep: Move icss_iep structure
        net: ibm: emac: get rid of wol_irq
        net: ibm: emac: remove all waiting code
        net: ibm: emac: replace of_get_property
        net: ibm: emac: use netdev's phydev directly
        net: ibm: emac: use devm for register_netdev
        net: ibm: emac: remove mii_bus with devm
        net: ibm: emac: use devm for of_iomap
        net: ibm: emac: manage emac_irq with devm
        net: ibm: emac: use devm for alloc_etherdev
        octeontx2-af: debugfs: Add Channel info to RPM map
        ...
      94106455
  3. 15 Sep, 2024 9 commits
  4. 14 Sep, 2024 18 commits