1. 12 Mar, 2020 40 commits
    • David S. Miller's avatar
      Merge branch 'ipa-fixes' · 0f70eedc
      David S. Miller authored
      Alex Elder says:
      
      ====================
      net: fix net-next
      
      David:	These patches resolve two issues caused by the IPA driver
      	being incorporated into net-next.  I hope you will merge
      	them as soon as you can.
      
      The IPA driver was merged into net-next last week, but two problems
      arise as a result, affecting net-next and linux-next:
        - The patch that defines field_max() was not incorporated into
          net-next, but is required by the IPA code
        - A patch that updates "sdm845.dtsi" *was* incorporated into
          net-next, but other changes to that file in the Qualcomm
          for-next branch lead to errors
      
      Bjorn has agreed to incorporate the DTS file change into the
      Qualcomm tree after it is reverted from net-next.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f70eedc
    • Alex Elder's avatar
      Revert "arm64: dts: sdm845: add IPA information" · 4639b38b
      Alex Elder authored
      This reverts commit 9cc5ae12.
      
      This commit:
        b303f9f0 arm64: dts: sdm845: Redefine interconnect provider DT nodes
      found in the Qualcomm for-next tree removes/redefines the interconnect
      provider node(s) used for IPA.  I'm not sure whether it technically
      conflicts with the IPA change to "sdm845.dtsi" in for-next, but it renders
      it broken.
      
      Revert this commit in the for-next tree, with the plan to incorporate
      it into the Qualcomm tree instead.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4639b38b
    • Alex Elder's avatar
      bitfield.h: add FIELD_MAX() and field_max() · e31a5016
      Alex Elder authored
      Define FIELD_MAX(), which supplies the maximum value that can be
      represented by a field value.  Define field_max() as well, to go
      along with the lower-case forms of the field mask functions.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e31a5016
    • David S. Miller's avatar
      Merge branch 'ethtool-netlink-interface-part-3' · 82a9822b
      David S. Miller authored
      Michal Kubecek says:
      
      ====================
      ethtool netlink interface, part 3
      
      Implementation of more netlink request types:
      
        - netdev features (ethtool -k/-K, patches 3-6)
        - private flags (--show-priv-flags / --set-priv-flags, patches 7-9)
        - ring sizes (ethtool -g/-G, patches 10-12)
        - channel counts (ethtool -l/-L, patches 13-15)
      
      Patch 1 is a style cleanup suggested in part 2 review and patch 2 updates
      the mapping between netdev features and legacy ioctl requests (which are
      still used by ethtool for backward compatibility).
      
      Changes in v2:
        - fix netdev reference leaks in error path of ethnl_set_rings() and
          ethnl_set_channels() (found by Jakub Kicinski)
        - use __set_bit() rather than set_bit() (suggested by David Miller)
        - in replies to RINGS_GET and CHANNELS_GET requests, omit ring and
          channel types not supported by driver/device (suggested by Jakub
          Kicinski)
        - more descriptive message size calculations in rings_reply_size() and
          channels_reply_size() (suggested by Jakub Kicinski)
        - coding style cleanup (suggested by Jakub Kicinski)
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      82a9822b
    • Michal Kubecek's avatar
      ethtool: add CHANNELS_NTF notification · 546379b9
      Michal Kubecek authored
      Send ETHTOOL_MSG_CHANNELS_NTF notification whenever channel counts of
      a network device are modified using ETHTOOL_MSG_CHANNELS_SET netlink
      message or ETHTOOL_SCHANNELS ioctl request.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      546379b9
    • Michal Kubecek's avatar
      ethtool: set device channel counts with CHANNELS_SET request · e19c591e
      Michal Kubecek authored
      Implement CHANNELS_SET netlink request to set channel counts of a network
      device. These are traditionally set with ETHTOOL_SCHANNELS ioctl request.
      
      Like the ioctl implementation, the generic ethtool code checks if supplied
      values do not exceed driver defined limits; if they do, first offending
      attribute is reported using extack. Checks preventing removing channels
      used for RX indirection table or zerocopy AF_XDP socket are also
      implemented.
      
      Move ethtool_get_max_rxfh_channel() helper into common.c so that it can be
      used by both ioctl and netlink code.
      
      v2:
        - fix netdev reference leak in error path (found by Jakub Kicinsky)
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e19c591e
    • Michal Kubecek's avatar
      ethtool: provide channel counts with CHANNELS_GET request · 0c84979c
      Michal Kubecek authored
      Implement CHANNELS_GET request to get channel counts of a network device.
      These are traditionally available via ETHTOOL_GCHANNELS ioctl request.
      
      Omit attributes for channel types which are not supported by driver or
      device (zero reported for maximum).
      
      v2: (all suggested by Jakub Kicinski)
        - minor cleanup in channels_prepare_data()
        - more descriptive channels_reply_size()
        - omit attributes with zero max count
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c84979c
    • Michal Kubecek's avatar
      ethtool: add RINGS_NTF notification · bc9d1c99
      Michal Kubecek authored
      Send ETHTOOL_MSG_RINGS_NTF notification whenever ring sizes of a network
      device are modified using ETHTOOL_MSG_RINGS_SET netlink message or
      ETHTOOL_SRINGPARAM ioctl request.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc9d1c99
    • Michal Kubecek's avatar
      ethtool: set device ring sizes with RINGS_SET request · 2fc2929e
      Michal Kubecek authored
      Implement RINGS_SET netlink request to set ring sizes of a network device.
      These are traditionally set with ETHTOOL_SRINGPARAM ioctl request.
      
      Like the ioctl implementation, the generic ethtool code checks if supplied
      values do not exceed driver defined limits; if they do, first offending
      attribute is reported using extack.
      
      v2:
        - fix netdev reference leak in error path (found by Jakub Kicinsky)
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2fc2929e
    • Michal Kubecek's avatar
      ethtool: provide ring sizes with RINGS_GET request · e4a1717b
      Michal Kubecek authored
      Implement RINGS_GET request to get ring sizes of a network device. These
      are traditionally available via ETHTOOL_GRINGPARAM ioctl request.
      
      Omit attributes for ring types which are not supported by driver or device
      (zero reported for maximum).
      
      v2: (all suggested by Jakub Kicinski)
        - minor cleanup in rings_prepare_data()
        - more descriptive rings_reply_size()
        - omit attributes with zero max size
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e4a1717b
    • Michal Kubecek's avatar
      ethtool: add PRIVFLAGS_NTF notification · 111dcba3
      Michal Kubecek authored
      Send ETHTOOL_MSG_PRIVFLAGS_NTF notification whenever private flags of
      a network device are modified using ETHTOOL_MSG_PRIVFLAGS_SET netlink
      message or ETHTOOL_SPFLAGS ioctl request.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      111dcba3
    • Michal Kubecek's avatar
      ethtool: set device private flags with PRIVFLAGS_SET request · f265d799
      Michal Kubecek authored
      Implement PRIVFLAGS_SET netlink request to set private flags of a network
      device. These are traditionally set with ETHTOOL_SPFLAGS ioctl request.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f265d799
    • Michal Kubecek's avatar
      ethtool: provide private flags with PRIVFLAGS_GET request · e16c3386
      Michal Kubecek authored
      Implement PRIVFLAGS_GET request to get private flags for a network device.
      These are traditionally available via ETHTOOL_GPFLAGS ioctl request.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e16c3386
    • Michal Kubecek's avatar
      ethtool: add FEATURES_NTF notification · 9c6451ef
      Michal Kubecek authored
      Send ETHTOOL_MSG_FEATURES_NTF notification whenever network device features
      are modified using ETHTOOL_MSG_FEATURES_SET netlink message, ethtool ioctl
      request or any other way resulting in call to netdev_update_features() or
      netdev_change_features()
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c6451ef
    • Michal Kubecek's avatar
      ethtool: set netdev features with FEATURES_SET request · 0980bfcd
      Michal Kubecek authored
      Implement FEATURES_SET netlink request to set network device features.
      These are traditionally set using ETHTOOL_SFEATURES ioctl request.
      
      Actual change is subject to netdev_change_features() sanity checks so that
      it can differ from what was requested. Unlike with most other SET requests,
      in addition to error code and optional extack, kernel provides an optional
      reply message (ETHTOOL_MSG_FEATURES_SET_REPLY) in the same format but with
      different semantics: information about difference between user request and
      actual result and difference between old and new state of dev->features.
      This reply message can be suppressed by setting ETHTOOL_FLAG_OMIT_REPLY
      flag in request header.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0980bfcd
    • Michal Kubecek's avatar
      ethtool: add ethnl_parse_bitset() helper · 88db6d1e
      Michal Kubecek authored
      Unlike other SET type commands, modifying netdev features is required to
      provide a reply telling userspace what was actually changed, compared to
      what was requested. For that purpose, the "modified" flag provided by
      ethnl_update_bitset() is not sufficient, we need full information which
      bits were requested to change.
      
      Therefore provide ethnl_parse_bitset() returning effective value and mask
      bitmaps equivalent to the contents of a bitset nested attribute.
      
      v2: use non-atomic __set_bit() (suggested by David Miller)
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88db6d1e
    • Michal Kubecek's avatar
      ethtool: provide netdev features with FEATURES_GET request · 0524399d
      Michal Kubecek authored
      Implement FEATURES_GET request to get network device features. These are
      traditionally available via ETHTOOL_GFEATURES ioctl request.
      
      v2:
        - style cleanup suggested by Jakub Kicinski
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0524399d
    • Michal Kubecek's avatar
      ethtool: update mapping of features to legacy ioctl requests · f70bb065
      Michal Kubecek authored
      Legacy ioctl request like ETHTOOL_GTXCSUM are still used by ethtool utility
      to get values of legacy flags (which rather work as feature groups). These
      are calculated from values of actual features and request to set them is
      implemented as an attempt to set all features mapping to them but there are
      two inconsistencies:
      
      - tx-checksum-fcoe-crc is shown under tx-checksumming but NETIF_F_FCOE_CRC
        is not included in ETHTOOL_GTXCSUM/ETHTOOL_STXCSUM
      - tx-scatter-gather-fraglist is shown under scatter-gather but
        NETIF_F_FRAGLIST is not included in ETHTOOL_GSG/ETHTOOL_SSG
      
      As the mapping in ethtool output is more correct from logical point of
      view, fix ethtool_get_feature_mask() to match it.
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f70bb065
    • Michal Kubecek's avatar
      ethtool: rename ethnl_parse_header() to ethnl_parse_header_dev_get() · 98130546
      Michal Kubecek authored
      Andrew Lunn pointed out that even if it's documented that
      ethnl_parse_header() takes reference to network device if it fills it
      into the target structure, its name doesn't make it apparent so that
      corresponding dev_put() looks like mismatched.
      
      Rename the function ethnl_parse_header_dev_get() to indicate that it
      takes a reference.
      Suggested-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      98130546
    • David S. Miller's avatar
      Merge branch 'Introduce-connection-tracking-offload' · f8ab3047
      David S. Miller authored
      Paul Blakey says:
      
      ====================
      Introduce connection tracking offload
      
      Background
      ----------
      
      The connection tracking action provides the ability to associate connection state to a packet.
      The connection state may be used for stateful packet processing such as stateful firewalls
      and NAT operations.
      
      Connection tracking in TC SW
      ----------------------------
      
      The CT state may be matched only after the CT action is performed.
      As such, CT use cases are commonly implemented using multiple chains.
      Consider the following TC filters, as an example:
      1. tc filter add dev ens1f0_0 ingress prio 1 chain 0 proto ip flower \
          src_mac 24:8a:07:a5:28:01 ct_state -trk \
          action ct \
          pipe action goto chain 2
      
      2. tc filter add dev ens1f0_0 ingress prio 1 chain 2 proto ip flower \
          ct_state +trk+new \
          action ct commit \
          pipe action tunnel_key set \
              src_ip 0.0.0.0 \
              dst_ip 7.7.7.8 \
              id 98 \
              dst_port 4789 \
          action mirred egress redirect dev vxlan0
      
      3. tc filter add dev ens1f0_0 ingress prio 1 chain 2 proto ip flower \
          ct_state +trk+est \
          action tunnel_key set \
              src_ip 0.0.0.0 \
              dst_ip 7.7.7.8 \
              id 98 \
              dst_port 4789 \
          action mirred egress redirect dev vxlan0
      
      Filter #1 (chain 0) decides, after initial packet classification, to send the packet to the
      connection tracking module (ct action).
      Once the ct_state is initialized by the CT action the packet processing continues on chain 2.
      
      Chain 2 classifies the packet based on the ct_state.
      Filter #2 matches on the +trk+new CT state while filter #3 matches on the +trk+est ct_state.
      
      MLX5 Connection tracking HW offload - MLX5 driver patches
      ------------------------------
      
      The MLX5 hardware model aligns with the software model by realizing a multi-table
      architecture. In SW the TC CT action sets the CT state on the skb. Similarly,
      HW sets the CT state on a HW register. Driver gets this CT state while offloading
      a tuple with a new ct_metadata action that provides it.
      
      Matches on ct_state are translated to HW register matches.
      
      TC filter with CT action broken to two rules, a pre_ct rule, and a post_ct rule.
      pre_ct rule:
         Inserted on the corrosponding tc chain table, matches on original tc match, with
         actions: any pre ct actions, set fte_id, set zone, and goto the ct table.
         The fte_id is a register mapping uniquely identifying this filter.
      post_ct_rule:
         Inserted in a post_ct table, matches on the fte_id register mapping, with
         actions: counter + any post ct actions (this is usally 'goto chain X')
      
      post_ct table is a table that all the tuples inserted to the ct table goto, so
      if there is a tuple hit, packet will continue from ct table to post_ct table,
      after being marked with the CT state (mark/label..)
      
      This design ensures that the rule's actions and counters will be executed only after a CT hit.
      HW misses will continue processing in SW from the last chain ID that was processed in hardware.
      
      The following illustrates the HW model:
      
      +-------------------+      +--------------------+    +--------------+
      + pre_ct (tc chain) +----->+ CT (nat or no nat) +--->+ post_ct      +----->
      + original match    +   |  + tuple + zone match + |  + fte_id match +  |
      +-------------------+   |  +--------------------+ |  +--------------+  |
                              v                         v                    v
                           set chain miss mapping    set mark             original
                           set fte_id                set label            filter
                           set zone                  set established      actions
                           set tunnel_id             do nat (if needed)
                           do decap
      
      To fill CT table, driver registers a CB for flow offload events, for each new
      flow table that is passed to it from offloading ct actions. Once a flow offload
      event is triggered on this CB, offload this flow to the hardware CT table.
      
      Established events offload
      --------------------------
      
      Currently, act_ct maintains an FT instance per ct zone. Flow table entries
      are created, per ct connection, when connections enter an established
      state and deleted otherwise. Once an entry is created, the FT assumes
      ownership of the entries, and manages their aging. FT is used for software
      offload of conntrack. FT entries associate 5-tuples with an action list.
      
      The act_ct changes in this patchset:
      Populate the action list with a (new) ct_metadata action, providing the
      connection's ct state (zone,mark and label), and mangle actions if NAT
      is configured.
      
      Pass the action's flow table instance as ct action entry parameter,
      so  when the action is offloaded, the driver may register a callback on
      it's block to receive FT flow offload add/del/stats events.
      
      Netilter changes
      --------------------------
      The netfilter changes export the relevant bits, and add the relevant CBs
      to support the above.
      
      Applying this patchset
      --------------------------
      
      On top of current net-next ("r8169: simplify getting stats by using netdev_stats_to_stats64"),
      pull Saeed's ct-offload branch, from git git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git
      and fix the following non trivial conflict in fs_core.c as follows:
      
      Then apply this patchset.
      
      Changelog:
        v2->v3:
          Added the first two patches needed after rebasing on net-next:
           "net/mlx5: E-Switch, Enable reg c1 loopback when possible"
           "net/mlx5e: en_rep: Create uplink rep root table after eswitch offloads table"
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8ab3047
    • Paul Blakey's avatar
      net/mlx5e: CT: Support clear action · 1ef3018f
      Paul Blakey authored
      Clear action, as with software, removes all ct metadata from
      the packet.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ef3018f
    • Paul Blakey's avatar
      net/mlx5e: CT: Handle misses after executing CT action · 5c6b9460
      Paul Blakey authored
      Mark packets with a unique tupleid, and on miss use that id to get
      the act ct restore_cookie. Using that restore cookie, we ask CT to
      restore the relevant info on the SKB.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c6b9460
    • Paul Blakey's avatar
      net/mlx5e: CT: Offload established flows · ac991b48
      Paul Blakey authored
      Register driver callbacks with the nf flow table platform.
      FT add/delete events will create/delete FTE in the CT/CT_NAT tables.
      
      Restoring the CT state on miss will be added in the following patch.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac991b48
    • Paul Blakey's avatar
      net/mlx5e: CT: Introduce connection tracking · 4c3844d9
      Paul Blakey authored
      Add support for offloading tc ct action and ct matches.
      We translate the tc filter with CT action the following HW model:
      
      +-------------------+      +--------------------+    +--------------+
      + pre_ct (tc chain) +----->+ CT (nat or no nat) +--->+ post_ct      +----->
      + original match    +  |   + tuple + zone match + |  + fte_id match +  |
      +-------------------+  |   +--------------------+ |  +--------------+  |
                             v                          v                    v
                            set chain miss mapping  set mark             original
                            set fte_id              set label            filter
                            set zone                set established      actions
                            set tunnel_id           do nat (if needed)
                            do decap
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Reviewed-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4c3844d9
    • Paul Blakey's avatar
      flow_offload: Add flow_match_ct to get rule ct match · ee1c45e8
      Paul Blakey authored
      Add relevant getter for ct info dissector.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee1c45e8
    • Paul Blakey's avatar
      net/mlx5: E-Switch, Support getting chain mapping · 43435e91
      Paul Blakey authored
      Currently, we write chain register mapping on miss from the the last
      prio of a chain. It is used to restore the chain in software.
      
      To support re-using the chain register mapping from global tables (such
      as CT tuple table) misses, export the chain mapping.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43435e91
    • Paul Blakey's avatar
      net/mlx5: E-Switch, Add support for offloading rules with no in_port · 6fb0701a
      Paul Blakey authored
      FTEs in global tables may match on packets from multiple in_ports.
      Provide the capability to omit the in_port match condition.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fb0701a
    • Paul Blakey's avatar
      net/mlx5: E-Switch, Introduce global tables · d18296ff
      Paul Blakey authored
      Currently, flow tables are automatically connected according to their
      <chain,prio,level> tuple.
      
      Introduce global tables which are flow tables that are detached from the
      eswitch chains processing, and will be connected by explicitly referencing
      them from multiple chains.
      
      Add this new table type, and allow connecting them by refenece.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarOz Shlomo <ozsh@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d18296ff
    • Paul Blakey's avatar
      net/sched: act_ct: Enable hardware offload of flow table entires · edd5861e
      Paul Blakey authored
      Pass the zone's flow table instance on the flow action to the drivers.
      Thus, allowing drivers to register FT add/del/stats callbacks.
      
      Finally, enable hardware offload on the flow table instance.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edd5861e
    • Paul Blakey's avatar
      net/sched: act_ct: Support refreshing the flow table entries · 8b3646d6
      Paul Blakey authored
      If driver deleted an FT entry, a FT failed to offload, or registered to the
      flow table after flows were already added, we still get packets in
      software.
      
      For those packets, while restoring the ct state from the flow table
      entry, refresh it's hardware offload.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b3646d6
    • Paul Blakey's avatar
      net/sched: act_ct: Support restoring conntrack info on skbs · 30b0cf90
      Paul Blakey authored
      Provide an API to restore the ct state pointer.
      
      This may be used by drivers to restore the ct state if they
      miss in tc chain after they already did the hardware connection
      tracking action (ct_metadata action).
      
      For example, consider the following rule on chain 0 that is in_hw,
      however chain 1 is not_in_hw:
      
      $ tc filter add dev ... chain 0 ... \
        flower ... action ct pipe action goto chain 1
      
      Packets of a flow offloaded (via nf flow table offload) by the driver
      hit this rule in hardware, will be marked with the ct metadata action
      (mark, label, zone) that does the equivalent of the software ct action,
      and when the packet jumps to hardware chain 1, there would be a miss.
      
      CT was already processed in hardware. Therefore, the driver's miss
      handling should restore the ct state on the skb, using the provided API,
      and continue the packet processing in chain 1.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30b0cf90
    • Paul Blakey's avatar
      net/sched: act_ct: Instantiate flow table entry actions · 9c26ba9b
      Paul Blakey authored
      NF flow table API associate 5-tuple rule with an action list by calling
      the flow table type action() CB to fill the rule's actions.
      
      In action CB of act_ct, populate the ct offload entry actions with a new
      ct_metadata action. Initialize the ct_metadata with the ct mark, label and
      zone information. If ct nat was performed, then also append the relevant
      packet mangle actions (e.g. ipv4/ipv6/tcp/udp header rewrites).
      
      Drivers that offload the ft entries may match on the 5-tuple and perform
      the action list.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Reviewed-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c26ba9b
    • Paul Blakey's avatar
      netfilter: flowtable: Add API for registering to flow table events · 978703f4
      Paul Blakey authored
      Let drivers to add their cb allowing them to receive flow offload events
      of type TC_SETUP_CLSFLOWER (REPLACE/DEL/STATS) for flows managed by the
      flow table.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      978703f4
    • Paul Blakey's avatar
      net/mlx5e: en_rep: Create uplink rep root table after eswitch offloads table · c6fe5729
      Paul Blakey authored
      The eswitch offloads table, which has the reps (vport) rx miss rules,
      was moved from OFFLOADS namespace [0,0] (prio, level), to [1,0], so
      the restore table (the new [0,0]) can come before it. The destinations
      of these miss rules is the rep root ft (ttc for non uplink reps).
      
      Uplink rep root ft is created as OFFLOADS namespace [0,1], and is used
      as a hook to next RX prio (either ethtool or ttc), but this fails to
      pass fs_core level's check.
      
      Move uplink rep root ft to OFFLOADS prio 1, level 1 ([1,1]), so it
      will keep the same relative position after the restore table
      change.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6fe5729
    • Paul Blakey's avatar
      net/mlx5: E-Switch, Enable reg c1 loopback when possible · 5b7cb745
      Paul Blakey authored
      Enable reg c1 loopback if firmware reports it's supported,
      as this is needed for restoring packet metadata (e.g chain).
      
      Also define helper to query if it is enabled.
      Signed-off-by: default avatarPaul Blakey <paulb@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5b7cb745
    • David S. Miller's avatar
    • David S. Miller's avatar
      Merge branch 'bind_addr_zero' · 93e61613
      David S. Miller authored
      Kuniyuki Iwashima says:
      
      ====================
      Improve bind(addr, 0) behaviour.
      
      Currently we fail to bind sockets to ephemeral ports when all of the ports
      are exhausted even if all sockets have SO_REUSEADDR enabled. In this case,
      we still have a chance to connect to the different remote hosts.
      
      These patches add net.ipv4.ip_autobind_reuse option and fix the behaviour
      to fully utilize all space of the local (addr, port) tuples.
      
      Changes in v5:
        - Add more description to documents.
        - Fix sysctl option to use proc_dointvec_minmax.
        - Remove the Fixes: tag and squash two commits.
      
      Changes in v4:
        - Add net.ipv4.ip_autobind_reuse option to not change the current behaviour.
        - Modify .gitignore for test.
        https://lore.kernel.org/netdev/20200308181615.90135-1-kuniyu@amazon.co.jp/
      
      Changes in v3:
        - Change the title and write more specific description of the 3rd patch.
        - Add a test in tools/testing/selftests/net/ as the 4th patch.
        https://lore.kernel.org/netdev/20200229113554.78338-1-kuniyu@amazon.co.jp/
      
      Changes in v2:
        - Change the description of the 2nd patch ('localhost' -> 'address').
        - Correct the description and the if statement of the 3rd patch.
        https://lore.kernel.org/netdev/20200226074631.67688-1-kuniyu@amazon.co.jp/
      
      v1 with tests:
        https://lore.kernel.org/netdev/20200220152020.13056-1-kuniyu@amazon.co.jp/
      ====================
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      93e61613
    • Kuniyuki Iwashima's avatar
      selftests: net: Add SO_REUSEADDR test to check if 4-tuples are fully utilized. · 7f204a7d
      Kuniyuki Iwashima authored
      This commit adds a test to check if we can fully utilize 4-tuples for
      connect() when all ephemeral ports are exhausted.
      
      The test program changes the local port range to use only one port and binds
      two sockets with or without SO_REUSEADDR and SO_REUSEPORT, and with the same
      EUID or with different EUIDs, then do listen().
      
      We should be able to bind only one socket having both SO_REUSEADDR and
      SO_REUSEPORT per EUID, which restriction is to prevent unintentional
      listen().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7f204a7d
    • Kuniyuki Iwashima's avatar
      tcp: Forbid to bind more than one sockets haveing SO_REUSEADDR and SO_REUSEPORT per EUID. · 33575921
      Kuniyuki Iwashima authored
      If there is no TCP_LISTEN socket on a ephemeral port, we can bind multiple
      sockets having SO_REUSEADDR to the same port. Then if all sockets bound to
      the port have also SO_REUSEPORT enabled and have the same EUID, all of them
      can be listened. This is not safe.
      
      Let's say, an application has root privilege and binds sockets to an
      ephemeral port with both of SO_REUSEADDR and SO_REUSEPORT. When none of
      sockets is not listened yet, a malicious user can use sudo, exhaust
      ephemeral ports, and bind sockets to the same ephemeral port, so he or she
      can call listen and steal the port.
      
      To prevent this issue, we must not bind more than one sockets that have the
      same EUID and both of SO_REUSEADDR and SO_REUSEPORT.
      
      On the other hand, if the sockets have different EUIDs, the issue above does
      not occur. After sockets with different EUIDs are bound to the same port and
      one of them is listened, no more socket can be listened. This is because the
      condition below is evaluated true and listen() for the second socket fails.
      
      			} else if (!reuseport_ok ||
      				   !reuseport || !sk2->sk_reuseport ||
      				   rcu_access_pointer(sk->sk_reuseport_cb) ||
      				   (sk2->sk_state != TCP_TIME_WAIT &&
      				    !uid_eq(uid, sock_i_uid(sk2)))) {
      				if (inet_rcv_saddr_equal(sk, sk2, true))
      					break;
      			}
      
      Therefore, on the same port, we cannot do listen() for multiple sockets with
      different EUIDs and any other listen syscalls fail, so the problem does not
      happen. In this case, we can still call connect() for other sockets that
      cannot be listened, so we have to succeed to call bind() in order to fully
      utilize 4-tuples.
      
      Summarizing the above, we should be able to bind only one socket having
      SO_REUSEADDR and SO_REUSEPORT per EUID.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33575921
    • Kuniyuki Iwashima's avatar
      tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted. · 4b01a967
      Kuniyuki Iwashima authored
      Commit aacd9289 ("tcp: bind() use stronger
      condition for bind_conflict") introduced a restriction to forbid to bind
      SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
      assign ports dispersedly so that we can connect to the same remote host.
      
      The change results in accelerating port depletion so that we fail to bind
      sockets to the same local port even if we want to connect to the different
      remote hosts.
      
      You can reproduce this issue by following instructions below.
      
        1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
        2. set SO_REUSEADDR to two sockets.
        3. bind two sockets to (localhost, 0) and the latter fails.
      
      Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
      the legacy behaviour to enable the SO_REUSEADDR option and make it possible
      to connect to different remote (addr, port) tuples.
      
      This patch allows us to bind SO_REUSEADDR enabled sockets to the same
      (addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
      ephemeral ports are exhausted. This also allows connect() and listen() to
      share ports in the following way and may break some applications. So the
      ip_autobind_reuse is 0 by default and disables the feature.
      
        1. setsockopt(sk1, SO_REUSEADDR)
        2. setsockopt(sk2, SO_REUSEADDR)
        3. bind(sk1, saddr, 0)
        4. bind(sk2, saddr, 0)
        5. connect(sk1, daddr)
        6. listen(sk2)
      
      If it is set 1, we can fully utilize the 4-tuples, but we should use
      IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
      
      The notable thing is that if all sockets bound to the same port have
      both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
      ephemeral port and also do listen().
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b01a967