1. 30 Jun, 2020 18 commits
    • Paolo Abeni's avatar
      mptcp: check for plain TCP sock at accept time · d2f77c53
      Paolo Abeni authored
      This cleanup the code a bit and avoid corrupted states
      on weird syscall sequence (accept(), connect()).
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d2f77c53
    • Davide Caratti's avatar
      mptcp: fallback in case of simultaneous connect · 8fd73804
      Davide Caratti authored
      when a MPTCP client tries to connect to itself, tcp_finish_connect() is
      never reached. Because of this, depending on the socket current state,
      multiple faulty behaviours can be observed:
      
      1) a WARN_ON() in subflow_data_ready() is hit
       WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230
       [...]
       CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187
       [...]
       RIP: 0010:subflow_data_ready+0x18b/0x230
       [...]
       Call Trace:
        tcp_data_queue+0xd2f/0x4250
        tcp_rcv_state_process+0xb1c/0x49d3
        tcp_v4_do_rcv+0x2bc/0x790
        __release_sock+0x153/0x2d0
        release_sock+0x4f/0x170
        mptcp_shutdown+0x167/0x4e0
        __sys_shutdown+0xe6/0x180
        __x64_sys_shutdown+0x50/0x70
        do_syscall_64+0x9a/0x370
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      2) client is stuck forever in mptcp_sendmsg() because the socket is not
         TCP_ESTABLISHED
      
       crash> bt 4847
       PID: 4847   TASK: ffff88814b2fb100  CPU: 1   COMMAND: "gh35"
        #0 [ffff8881376ff680] __schedule at ffffffff97248da4
        #1 [ffff8881376ff778] schedule at ffffffff9724a34f
        #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0
        #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba
        #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859
        #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca
        #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b
        #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa
        #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52
        #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f
       #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26
       #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba
       #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c
           RIP: 00007f126f6956ed  RSP: 00007ffc2a320278  RFLAGS: 00000217
           RAX: ffffffffffffffda  RBX: 0000000020000044  RCX: 00007f126f6956ed
           RDX: 0000000000000004  RSI: 00000000004007b8  RDI: 0000000000000003
           RBP: 00007ffc2a3202a0   R8: 0000000000400720   R9: 0000000000400720
           R10: 0000000000400720  R11: 0000000000000217  R12: 00000000004004b0
           R13: 00007ffc2a320380  R14: 0000000000000000  R15: 0000000000000000
           ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b
      
      3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake
         didn't complete.
      
       $ tcpdump -tnnr bad.pcap
       IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0
       IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0
       IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0
       IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0
       IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0
      
      force a fallback to TCP in these cases, and adjust the main socket
      state to avoid hanging in mptcp_sendmsg().
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35Reported-by: default avatarChristoph Paasch <cpaasch@apple.com>
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8fd73804
    • Davide Caratti's avatar
      net: mptcp: improve fallback to TCP · e1ff9e82
      Davide Caratti authored
      Keep using MPTCP sockets and a use "dummy mapping" in case of fallback
      to regular TCP. When fallback is triggered, skip addition of the MPTCP
      option on send.
      
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11
      Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22Co-developed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1ff9e82
    • Baruch Siach's avatar
      net: phy: marvell10g: support XFI rate matching mode · e1170333
      Baruch Siach authored
      When the hardware MACTYPE hardware configuration pins are set to "XFI
      with Rate Matching" the PHY interface operate at fixed 10Gbps speed. The
      MAC buffer packets in both directions to match various wire speeds.
      
      Read the MAC Type field in the Port Control register, and set the MAC
      interface speed accordingly.
      Signed-off-by: default avatarBaruch Siach <baruch@tkos.co.il>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e1170333
    • David S. Miller's avatar
      Merge tag 'mlx5-tls-2020-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 10780291
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-tls-2020-06-26
      
      1) Improve hardware layouts and structure for kTLS support
      
      2) Generalize ICOSQ (Internal Channel Operations Send Queue)
      Due to the asynchronous nature of adding new kTLS flows and handling
      HW asynchronous kTLS resync requests, the XSK ICOSQ was extended to
      support generic async operations, such as kTLS add flow and resync, in
      addition to the existing XSK usages.
      
      3) kTLS hardware flow steering and classification:
      The driver already has the means to classify TCP ipv4/6 flows to send them
      to the corresponding RSS HW engine, as reflected in patches 3 through 5,
      the series will add a steering layer that will hook to the driver's TCP
      classifiers and will match on well known kTLS connection, in case of a
      match traffic will be redirected to the kTLS decryption engine, otherwise
      traffic will continue flowing normally to the TCP RSS engine.
      
      3) kTLS add flow RX HW offload support
      New offload contexts post their static/progress params WQEs
      (Work Queue Element) to communicate the newly added kTLS contexts
      over the per-channel async ICOSQ.
      
      The Channel/RQ is selected according to the socket's rxq index.
      
      A new TLS-RX workqueue is used to allow asynchronous addition of
      steering rules, out of the NAPI context.
      It will be also used in a downstream patch in the resync procedure.
      
      Feature is OFF by default. Can be turned on by:
      $ ethtool -K <if> tls-hw-rx-offload on
      
      4) Added mlx5 kTLS sw stats and new counters are documented in
      Documentation/networking/tls-offload.rst
      rx_tls_ctx - number of TLS RX HW offload contexts added to device for
      decryption.
      
      rx_tls_ooo - number of RX packets which were part of a TLS stream
      but did not arrive in the expected order and triggered the resync
      procedure.
      
      rx_tls_del - number of TLS RX HW offload contexts deleted from device
      (connection has finished).
      
      rx_tls_err - number of RX packets which were part of a TLS stream
       but were not decrypted due to unexpected error in the state machine.
      
      5) Asynchronous RX resync
      
      a. The NIC driver indicates that it would like to resync on some TLS
      record within the received packet (P), but the driver does not
      know (yet) which of the TLS records within the packet.
      At this stage, the NIC driver will query the device to find the exact
      TCP sequence for resync (tcpsn), however, the driver does not wait
      for the device to provide the response.
      
      b. Eventually, the device responds, and the driver provides the tcpsn
      within the resync packet to KTLS. Now, KTLS can check the tcpsn against
      any processed TLS records within packet P, and also against any record
      that is processed in the future within packet P.
      
      The asynchronous resync path simplifies the device driver, as it can
      save bits on the packet completion (32-bit TCP sequence), and pass this
      information on an asynchronous command instead.
      
      Performance:
          CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off
          NIC: ConnectX-6 Dx 100GbE dual port
      
          Goodput (app-layer throughput) comparison:
          +---------------+-------+-------+---------+
          | # connections |   1   |   4   |    8    |
          +---------------+-------+-------+---------+
          | SW (Gbps)     |  7.26 | 24.70 |   50.30 |
          +---------------+-------+-------+---------+
          | HW (Gbps)     | 18.50 | 64.30 |   92.90 |
          +---------------+-------+-------+---------+
          | Speedup       | 2.55x | 2.56x | 1.85x * |
          +---------------+-------+-------+---------+
      
          * After linerate is reached, diff is observed in CPU util
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      10780291
    • David S. Miller's avatar
      Merge branch 'TC-Introduce-qevents' · 989d957a
      David S. Miller authored
      Petr Machata says:
      
      ====================
      TC: Introduce qevents
      
      The Spectrum hardware allows execution of one of several actions as a
      result of queue management decisions: tail-dropping, early-dropping,
      marking a packet, or passing a configured latency threshold or buffer
      size. Such packets can be mirrored, trapped, or sampled.
      
      Modeling the action to be taken as simply a TC action is very attractive,
      but it is not obvious where to put these actions. At least with ECN marking
      one could imagine a tree of qdiscs and classifiers that effectively
      accomplishes this task, albeit in an impractically complex manner. But
      there is just no way to match on dropped-ness of a packet, let alone
      dropped-ness due to a particular reason.
      
      To allow configuring user-defined actions as a result of inner workings of
      a qdisc, this patch set introduces a concept of qevents. Those are attach
      points for TC blocks, where filters can be put that are executed as the
      packet hits well-defined points in the qdisc algorithms. The attached
      blocks can be shared, in a manner similar to clsact ingress and egress
      blocks, arbitrary classifiers with arbitrary actions can be put on them,
      etc.
      
      For example:
      
      	red limit 500K avpkt 1K qevent early_drop block 10
      	matchall action mirred egress mirror dev eth1
      
      The central patch #2 introduces several helpers to allow easy and uniform
      addition of qevents to qdiscs: initialization, destruction, qevent block
      number change validation, and qevent handling, i.e. dispatch of the filters
      attached to the block bound to a qevent.
      
      Patch #1 adds root_lock argument to qdisc enqueue op. The problem this is
      tackling is that if a qevent filter pushes packets to the same qdisc tree
      that holds the qevent in the first place, attempt to take qdisc root lock
      for the second time will lead to a deadlock. To solve the issue, qevent
      handler needs to unlock and relock the root lock around the filter
      processing. Passing root_lock around makes it possible to get the lock
      where it is needed, and visibly so, such that it is obvious the lock will
      be used when invoking a qevent.
      
      The following two patches, #3 and #4, then add two qevents to the RED
      qdisc: "early_drop" qevent fires when a packet is early-dropped; "mark"
      qevent, when it is ECN-marked.
      
      Patch #5 contains a selftest. I have mentioned this test when pushing the
      RED ECN nodrop mode and said that "I have no confidence in its portability
      to [...] different configurations". That still holds. The backlog and
      packet size are tuned to make the test deterministic. But it is better than
      nothing, and on the boxes that I ran it on it does work and shows that
      qevents work the way they are supposed to, and that their addition has not
      broken the other tested features.
      
      This patch set does not deal with offloading. The idea there is that a
      driver will be able to figure out that a given block is used in qevent
      context by looking at binder type. A future patch-set will add a qdisc
      pointer to struct flow_block_offload, which a driver will be able to
      consult to glean the TC or other relevant attributes.
      
      Changes from RFC to v1:
      - Move a "q = qdisc_priv(sch)" from patch #3 to patch #4
      - Fix deadlock caused by mirroring packet back to the same qdisc tree.
      - Rename "tail" qevent to "tail_drop".
      - Adapt to the new 100-column standard.
      - Add a selftest
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      989d957a
    • Petr Machata's avatar
      selftests: forwarding: Add a RED test for SW datapath · 6cf0291f
      Petr Machata authored
      This test is inspired by the mlxsw RED selftest. It is much simpler to set
      up (also because there is no point in testing PRIO / RED encapsulation). It
      tests bare RED, ECN and ECN+nodrop modes of operation. On top of that it
      tests RED early_drop and mark qevents.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6cf0291f
    • Petr Machata's avatar
      net: sched: sch_red: Add qevents "early_drop" and "mark" · aee9caa0
      Petr Machata authored
      In order to allow acting on dropped and/or ECN-marked packets, add two new
      qevents to the RED qdisc: "early_drop" and "mark". Filters attached at
      "early_drop" block are executed as packets are early-dropped, those
      attached at the "mark" block are executed as packets are ECN-marked.
      
      Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block
      index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark"
      qevent. Absence of these attributes signifies "don't care": no block is
      allocated in that case, or the existing blocks are left intact in case of
      the change callback.
      
      For purposes of offloading, blocks attached to these qevents appear with
      newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and
      FLOW_BLOCK_BINDER_TYPE_RED_MARK.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aee9caa0
    • Petr Machata's avatar
      net: sched: sch_red: Split init and change callbacks · 65545ea2
      Petr Machata authored
      In the following patches, RED will get two qevents. The implementation will
      be clearer if the callback for change is not a pure subset of the callback
      for init. Split the two and promote attribute parsing to the callbacks
      themselves from the common code, because it will be handy there.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65545ea2
    • Petr Machata's avatar
      net: sched: Introduce helpers for qevent blocks · 3625750f
      Petr Machata authored
      Qevents are attach points for TC blocks, where filters can be put that are
      executed when "interesting events" take place in a qdisc. The data to keep
      and the functions to invoke to maintain a qevent will be largely the same
      between qevents. Therefore introduce sched-wide helpers for qevent
      management.
      
      Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc,
      blocks attachment cannot be changed after the qdisc is created. To that
      end, add a helper tcf_qevent_validate_change(), which verifies whether
      block index attribute is not attached, or if it is, whether its value
      matches the current one (i.e. there is no material change).
      
      The function tcf_qevent_handle() should be invoked when qdisc hits the
      "interesting event" corresponding to a block. This function releases root
      lock for the duration of executing the attached filters, to allow packets
      generated through user actions (notably mirred) to be reinserted to the
      same qdisc tree.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3625750f
    • Petr Machata's avatar
      net: sched: Pass root lock to Qdisc_ops.enqueue · aebe4426
      Petr Machata authored
      A following patch introduces qevents, points in qdisc algorithm where
      packet can be processed by user-defined filters. Should this processing
      lead to a situation where a new packet is to be enqueued on the same port,
      holding the root lock would lead to deadlocks. To solve the issue, qevent
      handler needs to unlock and relock the root lock when necessary.
      
      To that end, add the root lock argument to the qdisc op enqueue, and
      propagate throughout.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aebe4426
    • David S. Miller's avatar
      Merge branch 'net-ethernet-ti-am65-cpsw-update-and-enable-sr2-0-soc' · 5e701e49
      David S. Miller authored
      Grygorii Strashko says:
      
      ====================
      net: ethernet: ti: am65-cpsw: update and enable sr2.0 soc
      
      This series contains set of improvements for TI AM654x/J721E CPSW2G driver and
      adds support for TI AM654x SR2.0 SoC.
      
      Patch 1: adds vlans restoration after "if down/up"
      Patches 2-5: improvments
      Patch 6: adds support for TI AM654x SR2.0 SoC which allows to disable errata i2027 W/A.
      By default, errata i2027 W/A (TX csum offload disabled) is enabled on AM654x SoC
      for backward compatibility, unless SR2.0 SoC is identified using SOC BUS framework.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e701e49
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-nuss: enable am65x sr2.0 support · 38389aa6
      Grygorii Strashko authored
      The AM65x SR2.0 MCU CPSW has fixed errata i2027 "CPSW: CPSW Does Not
      Support CPPI Receive Checksum (Host to Ethernet) Offload Feature". This
      errata also fixed for J271E SoC.
      
      Use SOC bus data for K3 SoC identification and apply i2027 errata w/a only
      for the AM65x SR1.0 SoC.
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38389aa6
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-ethtool: configured critical setting only when no running netdevs · 3d0fda90
      Grygorii Strashko authored
      Ensure that critical setting can only be configured when there are no
      running netdevs - all ports are down.
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3d0fda90
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-ethtool: skip hw cfg when change p0-rx-ptype-rrobin · 7d58d3eb
      Grygorii Strashko authored
      Skip HW configuration when p0-rx-ptype-rrobin is changed as it will be done
      by .ndev_open(),
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d58d3eb
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-nuss: fix ports mac sl initialization · d6d0aeaf
      Grygorii Strashko authored
      The MAC SL has to be initialized for each port otherwise
      am65_cpsw_nuss_slave_disable_unused() will crash for disabled ports.
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d6d0aeaf
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw: move to pf_p0_rx_ptype_rrobin init in probe · 51824048
      Grygorii Strashko authored
      The pf_p0_rx_ptype_rrobin is global parameter so move its initialization in
      probe.
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      51824048
    • Grygorii Strashko's avatar
      net: ethernet: ti: am65-cpsw-nuss: restore vlan configuration while down/up · 7bcffde0
      Grygorii Strashko authored
      The vlan configuration is not restored after interface down/up sequence.
      
      Steps to check:
       # ip link add link eth0 name eth0.100 type vlan id 100
       # ifconfig eth0 down
       # ifconfig eth0 up
      
      This patch fixes it, restoring vlan ALE entries on .ndo_open().
      
      Fixes: 93a76530 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
      Signed-off-by: default avatarGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7bcffde0
  2. 29 Jun, 2020 22 commits