1. 26 Jan, 2023 15 commits
    • Paolo Abeni's avatar
      mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses · b9d69db8
      Paolo Abeni authored
      Currently the in-kernel PM arbitrary enforces that created subflow's
      family must match the main MPTCP socket while the RFC allows mixing
      IPv4 and IPv6 subflows.
      
      This patch changes the in-kernel PM logic to create subflows matching
      the currently selected source (or destination) address. IPv4 sockets
      can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets
      not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as
      long as the source and destination matches.
      
      A helper, previously introduced is used to ease family matching checks,
      taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269Co-developed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b9d69db8
    • Jamie Bainbridge's avatar
      icmp: Add counters for rate limits · d0941130
      Jamie Bainbridge authored
      There are multiple ICMP rate limiting mechanisms:
      
      * Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec
      * v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask
      * v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask
      
      However, when ICMP output is limited, there is no way to tell
      which limit has been hit or even if the limits are responsible
      for the lack of ICMP output.
      
      Add counters for each of the cases above. As we are within
      local_bh_disable(), use the __INC stats variant.
      
      Example output:
      
       # nstat -sz "*RateLimit*"
       IcmpOutRateLimitGlobal          134                0.0
       IcmpOutRateLimitHost            770                0.0
       Icmp6OutRateLimitHost           84                 0.0
      Signed-off-by: default avatarJamie Bainbridge <jamie.bainbridge@gmail.com>
      Suggested-by: default avatarAbhishek Rawal <rawal.abhishek92@gmail.com>
      Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      d0941130
    • Paolo Abeni's avatar
      Merge branch 'adding-sparx5-is0-vcap-support' · 9f927527
      Paolo Abeni authored
      Steen Hegelund says:
      
      ====================
      Adding Sparx5 IS0 VCAP support
      
      This provides the Ingress Stage 0 (IS0) VCAP (Versatile Content-Aware
      Processor) support for the Sparx5 platform.
      
      The IS0 VCAP (also known in the datasheet as CLM) is a classifier VCAP that
      mainly extracts frame information to metadata that follows the frame in the
      Sparx5 processing flow all the way to the egress port.
      
      The IS0 VCAP has 4 lookups and they are accessible with a TC chain id:
      
      - chain 1000000: IS0 Lookup 0
      - chain 1100000: IS0 Lookup 1
      - chain 1200000: IS0 Lookup 2
      - chain 1300000: IS0 Lookup 3
      - chain 1400000: IS0 Lookup 4
      - chain 1500000: IS0 Lookup 5
      
      Each of these lookups have their own port keyset configuration that decides
      which keys will be used for matching on which traffic type.
      
      The IS0 VCAP has these traffic classifications:
      
      - IPv4 frames
      - IPv6 frames
      - Unicast MPLS frames (ethertype = 0x8847)
      - Multicast MPLS frames (ethertype = 0x8847)
      - Other frame types than MPLS, IPv4 and IPv6
      
      The IS0 VCAP has an action that allows setting the value of a PAG (Policy
      Association Group) key field in the frame metadata, and this can be used
      for matching in an IS2 VCAP rule.
      
      This allow rules in the IS0 VCAP to be linked to rules in the IS2 VCAP.
      
      The linking is exposed by using the TC "goto chain" action with an offset
      from the IS2 chain ids.
      
      As an example a "goto chain 8000001" will use a PAG value of 1 to chain to
      a rule in IS2 Lookup 0.
      ====================
      
      Link: https://lore.kernel.org/r/20230124104511.293938-1-steen.hegelund@microchip.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      9f927527
    • Steen Hegelund's avatar
      net: microchip: sparx5: Add support for IS0 VCAP CVLAN TC keys · 52df82cc
      Steen Hegelund authored
      This adds support for parsing and matching on the CVLAN tags in the Sparx5
      IS0 VCAP.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      52df82cc
    • Steen Hegelund's avatar
      net: microchip: sparx5: Add support for IS0 VCAP ethernet protocol types · 63e35645
      Steen Hegelund authored
      This allows the IS0 VCAP to have its own list of supported ethernet
      protocol types matching what is supported by the VCAPs port lookup
      classification.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      63e35645
    • Steen Hegelund's avatar
      net: microchip: sparx5: Add automatic selection of VCAP rule actionset · 81e164c4
      Steen Hegelund authored
      With more than one possible actionset in a VCAP instance, the VCAP API will
      now use the actions in a VCAP rule to select the actionset that fits these
      actions the best possible way.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      81e164c4
    • Steen Hegelund's avatar
      net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPs · 88bd9ea7
      Steen Hegelund authored
      This allows rules to be chained between VCAP instances, e.g. from IS0
      Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2
      Lookups.
      
      Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the
      hardware.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      88bd9ea7
    • Steen Hegelund's avatar
      net: microchip: sparx5: Add TC support for IS0 VCAP · 542e6e2c
      Steen Hegelund authored
      This enables the TC command to use the Sparx5 IS0 VCAP
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      542e6e2c
    • Steen Hegelund's avatar
      net: microchip: sparx5: Add actionset type id information to rule · 7306fcd1
      Steen Hegelund authored
      This adds the actionset type id to the rule information.  This is needed as
      we now have more than one actionset in a VCAP instance (IS0).
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      7306fcd1
    • Steen Hegelund's avatar
      net: microchip: sparx5: Add IS0 VCAP keyset configuration for Sparx5 · 545609fd
      Steen Hegelund authored
      This adds the IS0 VCAP port keyset configuration for Sparx5 and also
      updates the debugFS support to show the keyset configuration.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      545609fd
    • Steen Hegelund's avatar
      net: microchip: sparx5: Add IS0 VCAP model and updated KUNIT VCAP model · f274a659
      Steen Hegelund authored
      This provides the IS0 (Ingress Stage 0) or CLM VCAP model for Sparx5.
      This VCAP provides classification actions for Sparx5.
      Signed-off-by: default avatarSteen Hegelund <steen.hegelund@microchip.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      f274a659
    • Jakub Kicinski's avatar
      Merge branch 'add-ip_local_port_range-socket-option' · 3f17e16f
      Jakub Kicinski authored
      Jakub Sitnicki says:
      
      ====================
      Add IP_LOCAL_PORT_RANGE socket option
      
      This patch set is a follow up to the "How to share IPv4 addresses by
      partitioning the port space" talk given at LPC 2022 [1].
      
      Please see patch #1 for the motivation & the use case description.
      Patch #2 adds tests exercising the new option in various scenarios.
      
      Documentation
      -------------
      
      Proposed update to the ip(7) man-page:
      
             IP_LOCAL_PORT_RANGE (since Linux X.Y)
                    Set or get the per-socket default local  port  range.  This
                    option  can  be  used  to  clamp down the global local port
                    range, defined by the ip_local_port_range  /proc  interface
                    described below, for a given socket.
      
                    The  option  takes  an uint32_t value with the high 16 bits
                    set to the upper range bound, and the low 16  bits  set  to
                    the  lower  range  bound.  Range  bounds are inclusive. The
                    16-bit values should be in host byte order.
      
                    The lower bound has to be less than the  upper  bound  when
                    both  bounds  are  not  zero. Otherwise, setting the option
                    fails with EINVAL.
      
                    If either bound is outside of the global local port  range,
                    or is zero, then that bound has no effect.
      
                    To  reset  the setting, pass zero as both the upper and the
                    lower bound.
      
      Interaction with SELinux bind() hook
      ------------------------------------
      
      SELinux bind() hook - selinux_socket_bind() - performs a permission check
      if the requested local port number lies outside of the netns ephemeral port
      range.
      
      The proposed socket option cannot be used change the ephemeral port range
      to extend beyond the per-netns port range, as set by
      net.ipv4.ip_local_port_range.
      
      Hence, there is no interaction with SELinux, AFAICT.
      
      RFC -> v1
      RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/
      
       * Allow either the high bound or the low bound, or both, to be zero
       * Add getsockopt support
       * Add selftests
      
      Links:
      ------
      
      [1]: https://lpc.events/event/16/contributions/1349/
      ====================
      
      Link: https://lore.kernel.org/r/20221221-sockopt-port-range-v6-0-be255cc0e51f@cloudflare.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3f17e16f
    • Jakub Sitnicki's avatar
      selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option · ae543965
      Jakub Sitnicki authored
      Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:
      
      1. pass invalid values to setsockopt
      2. pass a range outside of the per-netns port range
      3. configure a single-port range
      4. exhaust a configured multi-port range
      5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
      6. set then get the per-socket port range
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ae543965
    • Jakub Sitnicki's avatar
      inet: Add IP_LOCAL_PORT_RANGE socket option · 91d0b78c
      Jakub Sitnicki authored
      Users who want to share a single public IP address for outgoing connections
      between several hosts traditionally reach for SNAT. However, SNAT requires
      state keeping on the node(s) performing the NAT.
      
      A stateless alternative exists, where a single IP address used for egress
      can be shared between several hosts by partitioning the available ephemeral
      port range. In such a setup:
      
      1. Each host gets assigned a disjoint range of ephemeral ports.
      2. Applications open connections from the host-assigned port range.
      3. Return traffic gets routed to the host based on both, the destination IP
         and the destination port.
      
      An application which wants to open an outgoing connection (connect) from a
      given port range today can choose between two solutions:
      
      1. Manually pick the source port by bind()'ing to it before connect()'ing
         the socket.
      
         This approach has a couple of downsides:
      
         a) Search for a free port has to be implemented in the user-space. If
            the chosen 4-tuple happens to be busy, the application needs to retry
            from a different local port number.
      
            Detecting if 4-tuple is busy can be either easy (TCP) or hard
            (UDP). In TCP case, the application simply has to check if connect()
            returned an error (EADDRNOTAVAIL). That is assuming that the local
            port sharing was enabled (REUSEADDR) by all the sockets.
      
              # Assume desired local port range is 60_000-60_511
              s = socket(AF_INET, SOCK_STREAM)
              s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
              s.bind(("192.0.2.1", 60_000))
              s.connect(("1.1.1.1", 53))
              # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
              # Application must retry with another local port
      
            In case of UDP, the network stack allows binding more than one socket
            to the same 4-tuple, when local port sharing is enabled
            (REUSEADDR). Hence detecting the conflict is much harder and involves
            querying sock_diag and toggling the REUSEADDR flag [1].
      
         b) For TCP, bind()-ing to a port within the ephemeral port range means
            that no connecting sockets, that is those which leave it to the
            network stack to find a free local port at connect() time, can use
            the this port.
      
            IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
            will be skipped during the free port search at connect() time.
      
      2. Isolate the app in a dedicated netns and use the use the per-netns
         ip_local_port_range sysctl to adjust the ephemeral port range bounds.
      
         The per-netns setting affects all sockets, so this approach can be used
         only if:
      
         - there is just one egress IP address, or
         - the desired egress port range is the same for all egress IP addresses
           used by the application.
      
         For TCP, this approach avoids the downsides of (1). Free port search and
         4-tuple conflict detection is done by the network stack:
      
           system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
      
           s = socket(AF_INET, SOCK_STREAM)
           s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
           s.bind(("192.0.2.1", 0))
           s.connect(("1.1.1.1", 53))
           # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
      
        For UDP this approach has limited applicability. Setting the
        IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
        port being shared with other connected UDP sockets.
      
        Hence relying on the network stack to find a free source port, limits the
        number of outgoing UDP flows from a single IP address down to the number
        of available ephemeral ports.
      
      To put it another way, partitioning the ephemeral port range between hosts
      using the existing Linux networking API is cumbersome.
      
      To address this use case, add a new socket option at the SOL_IP level,
      named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
      ephemeral port range for each socket individually.
      
      The option can be used only to narrow down the per-netns local port
      range. If the per-socket range lies outside of the per-netns range, the
      latter takes precedence.
      
      UAPI-wise, the low and high range bounds are passed to the kernel as a pair
      of u16 values in host byte order packed into a u32. This avoids pointer
      passing.
      
        PORT_LO = 40_000
        PORT_HI = 40_511
      
        s = socket(AF_INET, SOCK_STREAM)
        v = struct.pack("I", PORT_HI << 16 | PORT_LO)
        s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
        s.bind(("127.0.0.1", 0))
        s.getsockname()
        # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
        # if there is a free port. EADDRINUSE otherwise.
      
      [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116Reviewed-by: default avatarMarek Majkowski <marek@cloudflare.com>
      Reviewed-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      91d0b78c
    • Randy Dunlap's avatar
      net: Kconfig: fix spellos · 6a7a2c18
      Randy Dunlap authored
      Fix spelling in net/ Kconfig files.
      (reported by codespell)
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: coreteam@netfilter.org
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6a7a2c18
  2. 25 Jan, 2023 17 commits
  3. 24 Jan, 2023 8 commits