1. 24 Jun, 2021 2 commits
    • Lorenz Bauer's avatar
      tools/testing: add a selftest for SO_NETNS_COOKIE · ae24bab2
      Lorenz Bauer authored
      Make sure that SO_NETNS_COOKIE returns a non-zero value, and
      that sockets from different namespaces have a distinct cookie
      value.
      Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae24bab2
    • Martynas Pumputis's avatar
      net: retrieve netns cookie via getsocketopt · e8b9eab9
      Martynas Pumputis authored
      It's getting more common to run nested container environments for
      testing cloud software. One of such examples is Kind [1] which runs a
      Kubernetes cluster in Docker containers on a single host. Each container
      acts as a Kubernetes node, and thus can run any Pod (aka container)
      inside the former. This approach simplifies testing a lot, as it
      eliminates complicated VM setups.
      
      Unfortunately, such a setup breaks some functionality when cgroupv2 BPF
      programs are used for load-balancing. The load-balancer BPF program
      needs to detect whether a request originates from the host netns or a
      container netns in order to allow some access, e.g. to a service via a
      loopback IP address. Typically, the programs detect this by comparing
      netns cookies with the one of the init ns via a call to
      bpf_get_netns_cookie(NULL). However, in nested environments the latter
      cannot be used given the Kubernetes node's netns is outside the init ns.
      To fix this, we need to pass the Kubernetes node netns cookie to the
      program in a different way: by extending getsockopt() with a
      SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes
      node netns can retrieve the cookie and pass it to the program instead.
      
      Thus, this is following up on Eric's commit 3d368ab8 ("net:
      initialize net->net_cookie at netns setup") to allow retrieval via
      SO_NETNS_COOKIE.  This is also in line in how we retrieve socket cookie
      via SO_COOKIE.
      
        [1] https://kind.sigs.k8s.io/Signed-off-by: default avatarLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Cc: Eric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8b9eab9
  2. 23 Jun, 2021 15 commits
    • David S. Miller's avatar
      Merge branch 'devlink-rate-limit-fixes' · 35713d9b
      David S. Miller authored
      Dmytro Linkin says:
      
      ====================
      Fixes for devlink rate objects API
      
      Patch #1 fixes not decreased refcount of parent node for destroyed leaf
      object.
      
      Patch #2 fixes incorect eswitch mode check.
      
      Patch #3 protects list traversing with a lock.
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35713d9b
    • Dmytro Linkin's avatar
      devlink: Protect rate list with lock while switching modes · a3e5e579
      Dmytro Linkin authored
      Devlink eswitch set command doesn't hold devlink->lock, which makes
      possible race condition between rate list traversing and others devlink
      rate KAPI calls, like devlink_rate_nodes_destroy().
      Hold devlink lock while traversing the list.
      
      Fixes: a8ecb93e ("devlink: Introduce rate nodes")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3e5e579
    • Dmytro Linkin's avatar
      devlink: Remove eswitch mode check for mode set call · ff99324d
      Dmytro Linkin authored
      When eswitch is disabled, querying its current mode results in error.
      Due to this when trying to set the eswitch mode for mlx5 devices, it
      fails to set the eswitch switchdev mode.
      Hence remove such check.
      
      Fixes: a8ecb93e ("devlink: Introduce rate nodes")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ff99324d
    • Dmytro Linkin's avatar
      devlink: Decrease refcnt of parent rate object on leaf destroy · 1321ed5e
      Dmytro Linkin authored
      Port functions, like SFs, can be deleted by the user when its leaf rate
      object has parent node. In such case node refcnt won't be decreased
      which blocks the node from deletion later.
      Do simple refcnt decrease, since driver in cleanup stage. This:
      1) assumes that driver took proper internal parent unset action;
      2) allows to avoid nested callbacks call and deadlock.
      
      Fixes: d7555984 ("devlink: Allow setting parent node of rate objects")
      Signed-off-by: default avatarDmytro Linkin <dlinkin@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1321ed5e
    • Xianting Tian's avatar
      virtio_net: Use virtio_find_vqs_ctx() helper · a2f7dc00
      Xianting Tian authored
      virtio_find_vqs_ctx() is defined but never be called currently,
      it is the right place to use it.
      Signed-off-by: default avatarXianting Tian <xianting.tian@linux.alibaba.com>
      Reviewed-by: default avatarStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2f7dc00
    • Kuniyuki Iwashima's avatar
      net/tls: Remove the __TLS_DEC_STATS() macro. · 10ed7ce4
      Kuniyuki Iwashima authored
      The commit d26b698d ("net/tls: add skeleton of MIB statistics")
      introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is
      not defined also. Let's remove it.
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10ed7ce4
    • Kuniyuki Iwashima's avatar
      tcp: Add stats for socket migration. · 55d444b3
      Kuniyuki Iwashima authored
      This commit adds two stats for the socket migration feature to evaluate the
      effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).
      
      If the migration fails because of the own_req race in receiving ACK and
      sending SYN+ACK paths, we do not increment the failure stat. Then another
      CPU is responsible for the req.
      
      Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/Suggested-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55d444b3
    • David Wilder's avatar
      ibmveth: Set CHECKSUM_PARTIAL if NULL TCP CSUM. · 7525de25
      David Wilder authored
      TCP checksums on received packets may be set to NULL by the sender if CSO
      is enabled. The hypervisor flags these packets as check-sum-ok and the
      skb is then flagged CHECKSUM_UNNECESSARY. If these packets are then
      forwarded the sender will not request CSO due to the CHECKSUM_UNNECESSARY
      flag. The result is a TCP packet sent with a bad checksum. This change
      sets up CHECKSUM_PARTIAL on these packets causing the sender to correctly
      request CSUM offload.
      Signed-off-by: default avatarDavid Wilder <dwilder@us.ibm.com>
      Reviewed-by: default avatarPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      Tested-by: default avatarCristobal Forno <cforno12@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7525de25
    • David S. Miller's avatar
      Merge tag 'mlx5-net-next-2021-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · fe87797b
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-net-next-2021-06-22
      
      1) Various minor cleanups and fixes from net-next branch
      2) Optimize mlx5 feature check on tx and
         a fix to allow Vxlan with Ipsec offloads
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fe87797b
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · a7b62112
      David S. Miller authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter updates for net-next
      
      The following patchset contains Netfilter updates for net-next:
      
      1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr,
         from Phil Sutter.
      
      2) Simplify TCP option sanity check for TCP packets, also from Phil.
      
      3) Add a new expression to store when the rule has been used last time.
      
      4) Pass the hook state object to log function, from Florian Westphal.
      
      5) Document the new sysctl knobs to tune the flowtable timeouts,
         from Oz Shlomo.
      
      6) Fix snprintf error check in the new nfnetlink_hook infrastructure,
         from Dan Carpenter.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7b62112
    • Andrea Righi's avatar
      selftests: icmp_redirect: support expected failures · 0a36a75c
      Andrea Righi authored
      According to a comment in commit 99513cfa ("selftest: Fixes for
      icmp_redirect test") the test "IPv6: mtu exception plus redirect" is
      expected to fail, because of a bug in the IPv6 logic that hasn't been
      fixed yet apparently.
      
      We should probably consider this failure as an "expected failure",
      therefore change the script to return XFAIL for that particular test and
      also report the total amount of expected failures at the end of the run.
      Signed-off-by: default avatarAndrea Righi <andrea.righi@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a36a75c
    • David S. Miller's avatar
      Merge branch 'lockless-qdisc-opts' · e940eb3c
      David S. Miller authored
      Yunsheng Lin says:
      
      ====================
      Some optimization for lockless qdisc
      
      Patch 1: remove unnecessary seqcount operation.
      Patch 2: implement TCQ_F_CAN_BYPASS.
      Patch 3: remove qdisc->empty.
      
      Performance data for pktgen in queue_xmit mode + dummy netdev
      with pfifo_fast:
      
       threads    unpatched           patched             delta
          1       2.60Mpps            3.21Mpps             +23%
          2       3.84Mpps            5.56Mpps             +44%
          4       5.52Mpps            5.58Mpps             +1%
          8       2.77Mpps            2.76Mpps             -0.3%
         16       2.24Mpps            2.23Mpps             -0.4%
      
      Performance for IP forward testing: 1.05Mpps increases to
      1.16Mpps, about 10% improvement.
      
      V3: Add 'Acked-by' from Jakub and 'Tested-by' from Vladimir,
          and resend based on latest net-next.
      V2: Adjust the comment and commit log according to discussion
          in V1.
      V1: Drop RFC tag, add nolock_qdisc_is_empty() and do the qdisc
          empty checking without the protection of qdisc->seqlock to
          aviod doing unnecessary spin_trylock() for contention case.
      RFC v4: Use STATE_MISSED and STATE_DRAINING to indicate non-empty
              qdisc, and add patch 1 and 3.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e940eb3c
    • Yunsheng Lin's avatar
      net: sched: remove qdisc->empty for lockless qdisc · d3e0f575
      Yunsheng Lin authored
      As MISSED and DRAINING state are used to indicate a non-empty
      qdisc, qdisc->empty is not longer needed, so remove it.
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3e0f575
    • Yunsheng Lin's avatar
      net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc · c4fef01b
      Yunsheng Lin authored
      Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
      flag set, but queue discipline by-pass does not work for lockless
      qdisc because skb is always enqueued to qdisc even when the qdisc
      is empty, see __dev_xmit_skb().
      
      This patch calls sch_direct_xmit() to transmit the skb directly
      to the driver for empty lockless qdisc, which aviod enqueuing
      and dequeuing operation.
      
      As qdisc->empty is not reliable to indicate a empty qdisc because
      there is a time window between enqueuing and setting qdisc->empty.
      So we use the MISSED state added in commit a90c57f2 ("net:
      sched: fix packet stuck problem for lockless qdisc"), which
      indicate there is lock contention, suggesting that it is better
      not to do the qdisc bypass in order to avoid packet out of order
      problem.
      
      In order to make MISSED state reliable to indicate a empty qdisc,
      we need to ensure that testing and clearing of MISSED state is
      within the protection of qdisc->seqlock, only setting MISSED state
      can be done without the protection of qdisc->seqlock. A MISSED
      state testing is added without the protection of qdisc->seqlock to
      aviod doing unnecessary spin_trylock() for contention case.
      
      As the enqueuing is not within the protection of qdisc->seqlock,
      there is still a potential data race as mentioned by Jakub [1]:
      
            thread1               thread2             thread3
      qdisc_run_begin() # true
                              qdisc_run_begin(q)
                                   set(MISSED)
      pfifo_fast_dequeue
        clear(MISSED)
        # recheck the queue
      qdisc_run_end()
                                  enqueue skb1
                                                   qdisc empty # true
                                                qdisc_run_begin() # true
                                                sch_direct_xmit() # skb2
                               qdisc_run_begin()
                                  set(MISSED)
      
      When above happens, skb1 enqueued by thread2 is transmited after
      skb2 is transmited by thread3 because MISSED state setting and
      enqueuing is not under the qdisc->seqlock. If qdisc bypass is
      disabled, skb1 has better chance to be transmited quicker than
      skb2.
      
      This patch does not take care of the above data race, because we
      view this as similar as below:
      Even at the same time CPU1 and CPU2 write the skb to two socket
      which both heading to the same qdisc, there is no guarantee that
      which skb will hit the qdisc first, because there is a lot of
      factor like interrupt/softirq/cache miss/scheduling afffecting
      that.
      
      There are below cases that need special handling:
      1. When MISSED state is cleared before another round of dequeuing
         in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
         dequeue all skb in one round and call __netif_schedule(), which
         might result in a non-empty qdisc without MISSED set. In order
         to avoid this, the MISSED state is set for lockless qdisc and
         __netif_schedule() will be called at the end of qdisc_run_end.
      
      2. The MISSED state also need to be set for lockless qdisc instead
         of calling __netif_schedule() directly when requeuing a skb for
         a similar reason.
      
      3. For netdev queue stopped case, the MISSED case need clearing
         while the netdev queue is stopped, otherwise there may be
         unnecessary __netif_schedule() calling. So a new DRAINING state
         is added to indicate this case, which also indicate a non-empty
         qdisc.
      
      4. As there is already netif_xmit_frozen_or_stopped() checking in
         dequeue_skb() and sch_direct_xmit(), which are both within the
         protection of qdisc->seqlock, but the same checking in
         __dev_xmit_skb() is without the protection, which might cause
         empty indication of a lockless qdisc to be not reliable. So
         remove the checking in __dev_xmit_skb(), and the checking in
         the protection of qdisc->seqlock seems enough to avoid the cpu
         consumption problem for netdev queue stopped case.
      
      1. https://lkml.org/lkml/2021/5/29/215Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4fef01b
    • Yunsheng Lin's avatar
      net: sched: avoid unnecessary seqcount operation for lockless qdisc · dd25296a
      Yunsheng Lin authored
      qdisc->running seqcount operation is mainly used to do heuristic
      locking on q->busylock for locked qdisc, see qdisc_is_running()
      and __dev_xmit_skb().
      
      So avoid doing seqcount operation for qdisc with TCQ_F_NOLOCK
      flag.
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
      Signed-off-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd25296a
  3. 22 Jun, 2021 23 commits
    • Huy Nguyen's avatar
      net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload · f1267798
      Huy Nguyen authored
      The packet is VXLAN packet over IPsec transport mode tunnel
      which has the following format: [IP1 | ESP | UDP | VXLAN | IP2 | TCP]
      NVIDIA ConnectX card cannot do checksum offload for two L4 headers.
      The solution is using the checksum partial offload similar to
      VXLAN | TCP packet. Hardware calculates IP1, IP2 and TCP checksums and
      software calculates UDP checksum. However, unlike VXLAN | TCP case,
      IPsec's mlx5 driver cannot access the inner plaintext IP protocol type.
      Therefore, inner_ipproto is added in the sec_path structure
      to provide this information. Also, utilize the skb's csum_start to
      program L4 inner checksum offset.
      
      While at it, remove the call to mlx5e_set_eseg_swp and setup software parser
      fields directly in mlx5e_ipsec_set_swp. mlx5e_set_eseg_swp is not
      needed as the two features (GENEVE and IPsec) are different and adding
      this sharing layer creates unnecessary complexity and affect
      performance.
      
      For the case VXLAN packet over IPsec tunnel mode tunnel, checksum offload
      is disabled because the hardware does not support checksum offload for
      three L3 (IP) headers.
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarHuy Nguyen <huyn@nvidia.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      f1267798
    • Huy Nguyen's avatar
      net/xfrm: Add inner_ipproto into sec_path · fa453523
      Huy Nguyen authored
      The inner_ipproto saves the inner IP protocol of the plain
      text packet. This allows vendor's IPsec feature making offload
      decision at skb's features_check and configuring hardware at
      ndo_start_xmit.
      
      For example, ConnectX6-DX IPsec device needs the plaintext's
      IP protocol to support partial checksum offload on
      VXLAN/GENEVE packet over IPsec transport mode tunnel.
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarHuy Nguyen <huyn@nvidia.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Acked-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      fa453523
    • Huy Nguyen's avatar
      net/mlx5: Optimize mlx5e_feature_checks for non IPsec packet · dd7cf00f
      Huy Nguyen authored
      mlx5e_ipsec_feature_check belongs to mlx5e_tunnel_features_check.
      Also, IPsec is not the default configuration so it should be
      checked at the end instead of the beginning of mlx5e_features_check.
      Signed-off-by: default avatarRaed Salem <raeds@nvidia.com>
      Signed-off-by: default avatarHuy Nguyen <huyn@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      dd7cf00f
    • caihuoqing's avatar
      net/mlx5: remove "default n" from Kconfig · 5bf3ee97
      caihuoqing authored
      remove "default n" and "No" is default
      Signed-off-by: default avatarcaihuoqing <caihuoqing@baidu.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      5bf3ee97
    • Colin Ian King's avatar
      net/mlx5: Fix spelling mistake "enught" -> "enough" · 2cc7dad7
      Colin Ian King authored
      There is a spelling mistake in a mlx5_core_err error message. Fix it.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      2cc7dad7
    • Nathan Chancellor's avatar
      net/mlx5: Use cpumask_available() in mlx5_eq_create_generic() · d4472a4b
      Nathan Chancellor authored
      When CONFIG_CPUMASK_OFFSTACK is unset, cpumask_var_t is not a pointer
      but a single element array, meaning its address in a structure cannot be
      NULL as long as it is not the first element, which it is not. This
      results in a clang warning:
      
      drivers/net/ethernet/mellanox/mlx5/core/eq.c:715:14: warning: address of
      array 'param->affinity' will always evaluate to 'true'
      [-Wpointer-bool-conversion]
              if (!param->affinity)
                  ~~~~~~~~^~~~~~~~
      1 warning generated.
      
      The helper cpumask_available was added in commit f7e30f01 ("cpumask:
      Add helper cpumask_available()") to handle situations like this so use
      it to keep the meaning of the code the same while resolving the warning.
      
      Fixes: e4e3f24b ("net/mlx5: Provide cpumask at EQ creation phase")
      Link: https://github.com/ClangBuiltLinux/linux/issues/1400Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d4472a4b
    • Jiapeng Chong's avatar
      net/mlx5: Fix missing error code in mlx5_init_fs() · 9201ab5f
      Jiapeng Chong authored
      The error code is missing in this code scenario, add the error code
      '-ENOMEM' to the return value 'err'.
      
      Eliminate the follow smatch warning:
      
      drivers/net/ethernet/mellanox/mlx5/core/fs_core.c:2973 mlx5_init_fs()
      warn: missing error code 'err'.
      Reported-by: default avatarAbaci Robot <abaci@linux.alibaba.com>
      Fixes: 4a98544d ("net/mlx5: Move chains ft pool to be used by all firmware steering").
      Signed-off-by: default avatarJiapeng Chong <jiapeng.chong@linux.alibaba.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      9201ab5f
    • David S. Miller's avatar
      Merge branch 'mptcp-C-flag-and-fixes' · 38f75922
      David S. Miller authored
      Mat Martineau says:
      
      ====================
      mptcp: Connection-time 'C' flag and two fixes
      
      Here are six more patches from the MPTCP tree.
      
      Most of them add support for the 'C' flag in the MPTCP connection-time
      option headers. This flag affects how the initial address and port are
      treated by each peer. Normally one peer may send MP_JOIN requests to the
      remote address and port that were used when initiating the MPTCP
      connection. The 'C' bit indicates that MP_JOINs should only be sent to
      remote addresses that have been advertised with ADD_ADDR.
      
      The other two patches are unrelated improvements.
      
      Patches 1-4: Add the 'C' flag feature, a sysctl to optionally enable it,
      and a selftest.
      
      Patch 5: Adjust rp_filter settings in a selftest.
      
      Patch 6: Improve rbuf cleanup for MPTCP sockets.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38f75922
    • Paolo Abeni's avatar
      mptcp: refine mptcp_cleanup_rbuf · fde56eea
      Paolo Abeni authored
      The current cleanup rbuf tries a bit too hard to avoid acquiring
      the subflow socket lock. We may end-up delaying the needed ack,
      or skip acking a blocked subflow.
      
      Address the above extending the conditions used to trigger the cleanup
      to reflect more closely what TCP does and invoking tcp_cleanup_rbuf()
      on all the active subflows.
      
      Note that we can't replicate the exact tests implemented in
      tcp_cleanup_rbuf(), as MPTCP lacks some of the required info - e.g.
      ping-pong mode.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fde56eea
    • Yonglong Li's avatar
      selftests: mptcp: turn rp_filter off on each NIC · d8e336f7
      Yonglong Li authored
      To turn rp_filter off we should:
      
        echo 0 > /proc/sys/net/ipv4/conf/default/rp_filter
      
      and
      
        echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
      
      before NIC created.
      Co-developed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarYonglong Li <liyonglong@chinatelecom.cn>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8e336f7
    • Geliang Tang's avatar
      selftests: mptcp: add deny_join_id0 testcases · 0cddb4a6
      Geliang Tang authored
      This patch added a new argument '-d' for mptcp_join.sh script, to invoke
      the testcases for the MP_CAPABLE 'C' flag.
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cddb4a6
    • Geliang Tang's avatar
      mptcp: add deny_join_id0 in mptcp_options_received · df377be3
      Geliang Tang authored
      This patch added a new flag named deny_join_id0 in struct
      mptcp_options_received. Set it when MP_CAPABLE with the flag
      MPTCP_CAP_DENYJOIN_ID0 is received.
      
      Also add a new flag remote_deny_join_id0 in struct mptcp_pm_data. When the
      flag deny_join_id0 is set, set this remote_deny_join_id0 flag.
      
      In mptcp_pm_create_subflow_or_signal_addr, if the remote_deny_join_id0 flag
      is set, and the remote address id is zero, stop this connection.
      Suggested-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df377be3
    • Geliang Tang's avatar
      mptcp: add allow_join_id0 in mptcp_out_options · bab6b88e
      Geliang Tang authored
      This patch defined a new flag MPTCP_CAP_DENY_JOIN_ID0 for the third bit,
      labeled "C" of the MP_CAPABLE option.
      
      Add a new flag allow_join_id0 in struct mptcp_out_options. If this flag is
      set, send out the MP_CAPABLE option with the flag MPTCP_CAP_DENY_JOIN_ID0.
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bab6b88e
    • Geliang Tang's avatar
      mptcp: add sysctl allow_join_initial_addr_port · d2f77960
      Geliang Tang authored
      This patch added a new sysctl, named allow_join_initial_addr_port, to
      control whether allow peers to send join requests to the IP address and
      port number used by the initial subflow.
      Suggested-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2f77960
    • David S. Miller's avatar
      Merge branch 'sctp-packetization-path-MTU' · a432c771
      David S. Miller authored
      Xin Long says:
      
      ====================
      sctp: implement RFC8899: Packetization Layer Path MTU Discovery for SCTP transport
      
      Overview(From RFC8899):
      
        In contrast to PMTUD, Packetization Layer Path MTU Discovery
        (PLPMTUD) [RFC4821] introduces a method that does not rely upon
        reception and validation of PTB messages.  It is therefore more
        robust than Classical PMTUD.  This has become the recommended
        approach for implementing discovery of the PMTU [BCP145].
      
        It uses a general strategy in which the PL sends probe packets to
        search for the largest size of unfragmented datagram that can be sent
        over a network path.  Probe packets are sent to explore using a
        larger packet size.  If a probe packet is successfully delivered (as
        determined by the PL), then the PLPMTU is raised to the size of the
        successful probe.  If a black hole is detected (e.g., where packets
        of size PLPMTU are consistently not received), the method reduces the
        PLPMTU.
      
      SCTP Probe Packets:
      
        As the RFC suggested, the probe packets consist of an SCTP common header
        followed by a HEARTBEAT chunk and a PAD chunk. The PAD chunk is used to
        control the length of the probe packet.  The HEARTBEAT chunk is used to
        trigger the sending of a HEARTBEAT ACK chunk to confirm this probe on
        the HEARTBEAT sender.
      
        The HEARTBEAT chunk also carries a Heartbeat Information parameter that
        includes the probe size to help an implementation associate a HEARTBEAT
        ACK with the size of probe that was sent. The sender use the nonce and
        the probe size to verify the information returned.
      
      Detailed Implementation on SCTP:
      
                             +------+
                    +------->| Base |-----------------+ Connectivity
                    |        +------+                 | or BASE_PLPMTU
                    |           |                     | confirmation failed
                    |           |                     v
                    |           | Connectivity    +-------+
                    |           | and BASE_PLPMTU | Error |
                    |           | confirmed       +-------+
                    |           |                     | Consistent
                    |           v                     | connectivity
         Black Hole |       +--------+                | and BASE_PLPMTU
          detected  |       | Search |<---------------+ confirmed
                    |       +--------+
                    |          ^  |
                    |          |  |
                    |    Raise |  | Search
                    |    timer |  | algorithm
                    |  expired |  | completed
                    |          |  |
                    |          |  v
                    |   +-----------------+
                    +---| Search Complete |
                        +-----------------+
      
        When PLPMTUD is enabled, it's in Base state, and starts to probe with
        BASE_PLPMTU (1200). If this probe succeeds, it goes to Search state;
        If this probe fails, it goes to Error state under which pl.pmtu goes
        down to MIN_PLPMTU (512) and keeps probing with BASE_PLPMTU until it
        succeeds and goes to Search state.
      
        During the Search state, the probe size is growing by a Big step (32)
        every time when the last probe succeeds at the beginning. Once a probe
        (such as 1420) fails after trying MAX_PROBES (3) times, the probe_size
        goes back to the last one (1420 - 32 = 1388), meanwhile 'probe_high'
        is set to 1420 and the growing step becomes a Small one (4). Then the
        probe is continuing with a Small step grown each round. Until it gets
        the optimal size (such as 1400) when probe with its next probe size
        (1404) fails, it sync this size to pathmtu and goes to Complete state.
      
        In Complete state, it will only does a probe check for the pathmtu just
        set, if it fails, which means a Black Hole is detected and it goes back
        to Base state. If it succeeds, it goes back to Search state again, and
        probe is continuing with growing a Small step (1400 + 4). If this probe
        fails, probe_high is set and goes back to 1388 and then Complete state,
        which is kind of a loop normally. However if the env's pathmtu changes
        to a big size somehow, this probe will succeed and then probe continues
        with growing a Big step (1400 + 32) each round until another probe fails.
      
      PTB Messages Process:
      
        PLPMTUD doesn't rely on these package to find the pmtu, and shouldn't
        trust it either. When processing them, it only changes the probe_size
        to PL_PTB_SIZE(info - hlen) if 'pl.pmtu < PL_PTB_SIZE < the current
        probe_size' druing Search state. As this could help probe_size to get
        to the optimal size faster, for exmaple:
      
        pl.pmtu = 1388, probe_size = 1420, while the env's pathmtu = 1400.
        When probe_size is 1420, a Toobig packet with 1400 comes back. If probe
        size changes to use 1400, it will save quite a few rounds to get there.
        But of course after having this value, PLPMTUD will still verify it on
        its own before using it.
      
      Patches:
      
        - Patch 1-6: introduce some new constants/variables from the RFC, systcl
          and members in transport, APIs for the following patches, chunks and
          a timer for the probe sending and some codes for the probe receiving.
      
        - Patch 7-9: implement the state transition on the tx path, rx path and
          toobig ICMP packet processing. This is the main algorithm part.
      
        - Patch 10: activate this feature
      
        - Patch 11-14: improve the process for ICMP packets for SCTP over UDP,
          so that it can also be covered by this feature.
      
      Tests:
      
        - do sysctl and setsockopt tests for this feature's enabling and disabling.
      
        - get these pr_debug points for this feature by
            # cat /sys/kernel/debug/dynamic_debug/control | grep PLP
          and enable them on kernel dynamic debug, then play with the pathmtu and
          check if the state transition and plpmtu change match the RFC.
      
        - do the above tests for SCTP over IPv4/IPv6 and SCTP over UDP.
      
      v1->v2:
        - See Patch 06/14.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a432c771
    • Xin Long's avatar
      sctp: process sctp over udp icmp err on sctp side · 9e47df00
      Xin Long authored
      Previously, sctp over udp was using udp tunnel's icmp err process, which
      only does sk lookup on sctp side. However for sctp's icmp error process,
      there are more things to do, like syncing assoc pmtu/retransmit packets
      for toobig type err, and starting proto_unreach_timer for unreach type
      err etc.
      
      Now after adding PLPMTUD, which also requires to process toobig type err
      on sctp side. This patch is to process icmp err on sctp side by parsing
      the type/code/info in .encap_err_lookup and call sctp's icmp processing
      functions. Note as the 'redirect' err process needs to know the outer
      ip(v6) header's, we have to leave it to udp(v6)_err to handle it.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9e47df00
    • Xin Long's avatar
      sctp: extract sctp_v4_err_handle function from sctp_v4_err · d8306075
      Xin Long authored
      This patch is to extract sctp_v4_err_handle() from sctp_v4_err() to
      only handle the icmp err after the sock lookup, and it also makes
      the code clearer.
      
      sctp_v4_err_handle() will be used in sctp over udp's err handling
      in the following patch.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d8306075
    • Xin Long's avatar
      sctp: extract sctp_v6_err_handle function from sctp_v6_err · f6549bd3
      Xin Long authored
      This patch is to extract sctp_v6_err_handle() from sctp_v6_err() to
      only handle the icmp err after the sock lookup, and it also makes
      the code clearer.
      
      sctp_v6_err_handle() will be used in sctp over udp's err handling
      in the following patch.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f6549bd3
    • Xin Long's avatar
      sctp: remove the unessessary hold for idev in sctp_v6_err · 237a6a2e
      Xin Long authored
      Same as in tcp_v6_err() and __udp6_lib_err(), there's no need to
      hold idev in sctp_v6_err(), so just call __in6_dev_get() instead.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      237a6a2e
    • Xin Long's avatar
      sctp: enable PLPMTUD when the transport is ready · 7307e4fa
      Xin Long authored
      sctp_transport_pl_reset() is called whenever any of these 3 members in
      transport is changed:
      
        - probe_interval
        - param_flags & SPP_PMTUD_ENABLE
        - state == ACTIVE
      
      If all are true, start the PLPMTUD when it's not yet started. If any of
      these is false, stop the PLPMTUD when it's already running.
      
      sctp_transport_pl_update() is called when the transport dst has changed.
      It will restart the PLPMTUD probe. Again, the pathmtu won't change but
      use the dst's mtu until the Search phase is done.
      
      Note that after using PLPMTUD, the pathmtu is only initialized with the
      dst mtu when the transport dst changes. At other time it is updated by
      pl.pmtu. So sctp_transport_pmtu_check() will be called only when PLPMTUD
      is disabled in sctp_packet_config().
      
      After this patch, the PLPMTUD feature from RFC8899 will be activated
      and can be used by users.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7307e4fa
    • Xin Long's avatar
      sctp: do state transition when receiving an icmp TOOBIG packet · 83696408
      Xin Long authored
      PLPMTUD will short-circuit the old process for icmp TOOBIG packets.
      This part is described in rfc8899#section-4.6.2 (PL_PTB_SIZE =
      PTB_SIZE - other_headers_len). Note that from rfc8899#section-5.2
      State Machine, each case below is for some specific states only:
      
        a) PL_PTB_SIZE < MIN_PLPMTU || PL_PTB_SIZE >= PROBED_SIZE,
           discard it, for any state
      
        b) MIN_PLPMTU < PL_PTB_SIZE < BASE_PLPMTU,
           Base -> Error, for Base state
      
        c) BASE_PLPMTU <= PL_PTB_SIZE < PLPMTU,
           Search -> Base or Complete -> Base, for Search and Complete states.
      
        d) PLPMTU < PL_PTB_SIZE < PROBED_SIZE,
           set pl.probe_size to PL_PTB_SIZE then verify it, for Search state.
      
      The most important one is case d), which will help find the optimal
      fast during searching. Like when pathmtu = 1392 for SCTP over IPv4,
      the search will be (20 is iphdr_len):
      
        1. probe with 1200 - 20
        2. probe with 1232 - 20
        3. probe with 1264 - 20
        ...
        7. probe with 1388 - 20
        8. probe with 1420 - 20
      
      When sending the probe with 1420 - 20, TOOBIG may come with PL_PTB_SIZE =
      1392 - 20. Then it matches case d), and saves some rounds to try with the
      1392 - 20 probe. But of course, PLPMTUD doesn't trust TOOBIG packets, and
      it will go back to the common searching once the probe with the new size
      can't be verified.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      83696408
    • Xin Long's avatar
      sctp: do state transition when a probe succeeds on HB ACK recv path · b87641af
      Xin Long authored
      As described in rfc8899#section-5.2, when a probe succeeds, there might
      be the following state transitions:
      
        - Base -> Search, occurs when probe succeeds with BASE_PLPMTU,
          pl.pmtu is not changing,
          pl.probe_size increases by SCTP_PL_BIG_STEP,
      
        - Error -> Search, occurs when probe succeeds with BASE_PLPMTU,
          pl.pmtu is changed from SCTP_MIN_PLPMTU to SCTP_BASE_PLPMTU,
          pl.probe_size increases by SCTP_PL_BIG_STEP.
      
        - Search -> Search Complete, occurs when probe succeeds with the probe
          size SCTP_MAX_PLPMTU less than pl.probe_high,
          pl.pmtu is not changing, but update *pathmtu* with it,
          pl.probe_size is set back to pl.pmtu to double check it.
      
        - Search Complete -> Search, occurs when probe succeeds with the probe
          size equal to pl.pmtu,
          pl.pmtu is not changing,
          pl.probe_size increases by SCTP_PL_MIN_STEP.
      
      So search process can be described as:
      
       1. When it just enters 'Search' state, *pathmtu* is not updated with
          pl.pmtu, and probe_size increases by a big step (SCTP_PL_BIG_STEP)
          each round.
      
       2. Until pl.probe_high is set when a probe fails, and probe_size
          decreases back to pl.pmtu, as described in the last patch.
      
       3. When the probe with the new size succeeds, probe_size changes to
          increase by a small step (SCTP_PL_MIN_STEP) due to pl.probe_high
          is set.
      
       4. Until probe_size is next to pl.probe_high, the searching finishes and
          it goes to 'Complete' state and updates *pathmtu* with pl.pmtu, and
          then probe_size is set to pl.pmtu to confirm by once more probe.
      
       5. This probe occurs after "30 * probe_inteval", a much longer time than
          that in Search state. Once it is done it goes to 'Search' state again
          with probe_size increased by SCTP_PL_MIN_STEP.
      
      As we can see above, during the searching, pl.pmtu changes while *pathmtu*
      doesn't. *pathmtu* is only updated when the search finishes by which it
      gets an optimal value for it. A big step is used at the beginning until
      it gets close to the optimal value, then it changes to a small step until
      it has this optimal value.
      
      The small step is also used in 'Complete' until it goes to 'Search' state
      again and the probe with 'pmtu + the small step' succeeds, which means a
      higher size could be used. Then probe_size changes to increase by a big
      step again until it gets close to the next optimal value.
      
      Note that anytime when black hole is detected, it goes directly to 'Base'
      state with pl.pmtu set to SCTP_BASE_PLPMTU, as described in the last patch.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b87641af
    • Xin Long's avatar
      sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send path · 1dc68c19
      Xin Long authored
      The state transition is described in rfc8899#section-5.2,
      PROBE_COUNT == MAX_PROBES means the probe fails for MAX times, and the
      state transition includes:
      
        - Base -> Error, occurs when BASE_PLPMTU Confirmation Fails,
          pl.pmtu is set to SCTP_MIN_PLPMTU,
          probe_size is still SCTP_BASE_PLPMTU;
      
        - Search -> Base, occurs when Black Hole Detected,
          pl.pmtu is set to SCTP_BASE_PLPMTU,
          probe_size is set back to SCTP_BASE_PLPMTU;
      
        - Search Complete -> Base, occurs when Black Hole Detected
          pl.pmtu is set to SCTP_BASE_PLPMTU,
          probe_size is set back to SCTP_BASE_PLPMTU;
      
      Note a black hole is encountered when a sender is unaware that packets
      are not being delivered to the destination endpoint. So it includes the
      probe failures with equal probe_size to pl.pmtu, and definitely not
      include that with greater probe_size than pl.pmtu. The later one is the
      normal probe failure where probe_size should decrease back to pl.pmtu
      and pl.probe_high is set.  pl.probe_high would be used on HB ACK recv
      path in the next patch.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1dc68c19