Commit f3a3e248 authored by David S. Miller's avatar David S. Miller

Merge branch 'net-smc'

Ursula Braun says:

====================
net/smc: Shared Memory Communications - RDMA

here is now V4 of the SMC-R patches having processed your feedback from end
of November. The most important change is the replacement of sysfs by a
generic netlink solution in patch 04. And I tried to get rid of the __packed
attributes. There are still a few usages left due to SMC-R protocol defined
structures.

V4 changes:
The order of patches 03 and 04 for pnet table management and SMC IB-client
establishing has been exchanged, since pnet table management is now built on
top of smc_ib_devices.
Patch 01: Use EXPORT_SYMBOL_GPL().
Patch 02: Define "use_fallback" as bool.
          Get rid of useless smc_sock fields clearing in smc_sock_alloc(),
          since sk_alloc() clears out the memory.
Patch 03: Postpone smc_ib_remember_port_attr() call till ib_device is
          mentioned in the pnet table.
Patch 04: Replace sysfs-usage by a generic netlink approach for pnet table
          configuration.
          Change layout of pnet table entries to reference net_device and
          ib_device instead of dealing with names of net_devices and
          ib_devices.
Patch 05: Adapt "use_fallback" usages to new type bool.
          Get rid of useless smc_sock fields clearing in smc_sock_alloc()
          Avoid __packed where possible.
          Check if clc responses are not too big.
Patch 09: Postpone smc_setup_per_ibdev till the first connection with this
          ib_device is really created.
Patch 11: Get rid of __packed usage.

V3 changes:
Patch 05: Remove unneeded DEFINE_WAIT
Patch 06: Improve synchronization of link group creation
Patch 07: Rename peer_rmbe_len into peer_rmbe_size to be more consistent
Patch 09: Avoid calls of ib_get_memory_region with IB_ACCESS_LOCAL_WRITE,
          use new default local_dma_lkey from protection domain as lkey
          instead.
          Remove no longer needed function smc_ib_dereg_memory_region().
Patch 14: Switch to state ACTIVE only if still in state INIT.
          Return 0 for recvmsg invoked in a socket closing state.
          Allow getname call in state APPCLOSEWAIT1
          Do not trigger destruction of a socket-in-error queued in accept
          queue.
          During cleanup of accept queue, make sure sockets are destructed,
          and sockets in fallback mode are handled appropriately.
          When freeing sndbufs/rmbs, remove them from their list and free
          the entry.
          Use add_wait_queue() and remove_wait_queue() in close wait
          functions.
          If actively closing a socket in state for PEERFINCLOSEWAIT, keep
          this state.
          If passively closing a socket while bytes are to be received, move
          to state APPCLOSEWAIT1.
          If actively aborting a socket, skip sending the close_abort flag,
          since RDMA communication is no longer possible.
          When terminating a link group, do not schedule link group freeing a
          2nd time, since already done when unregistering the last remaining
          connection.
Patch 15: Introduce smc_diag module for monitoring SMC protocol sockets.
          This replaces the old patch 0015 dealing with procfs.

V2 changes:
Patch 0002: Add SMC versions for family key strings in net/core/sock.c.
Patch 0006: initialize rb_tree.
Patch 0007: Get rid of unneeded use of xchg() in smc_sndbuf_unuse() and
            smc_rmb_unuse().
Patch 0008: Correct error checking logic for ib_function calls.
            Define struct smc_link field wr_tx_id as atomic_long_t.
            Use "do_div" instead of "%" to be architecture-independent.
Patch 0009: Correct error checking logic for ib_function calls.
Patch 0011: Remove xchg() calls in cursor handling. Use atomic64_t for cursor
            overlays on 64-bit architectures. If not available, use plain u64
            and add locking for cursor reading and writing.
            Implement smc_curs_add() without modulo operator "%".
Patch 0012: Remove xchg() calls in cursor handling.
            Implement smc_tx_rdma_writes() without module operator "%".
Patch 0013: Remove xchg() calls in cursor handling.
Patch 0014: Return type bool in smc_wr_tx_has_pending().
            Remove unneeded semicolon in smc_close_shutdown_write().
            Call smc_close_active() in non-fallback case only.
            Get rid of duplicate schedule of sock_put_work().
            Take nested sock_lock in smc_listen_work().
            Start close stream_wait in case of prepared sends only.
Patch 0015: Remove unneeded socket ref_count in smc_proc_seq_show().
            Take lock before list_empty check in smc_proc_sock_list_del().

These patches are the initial part of the implementation of the
"Shared Memory Communications-RDMA" (SMC-R) protocol as defined in
RFC7609 [1]. While SMC-R does not aim to replace TCP,
it taps a wealth of existing data center TCP socket applications
to become more efficient without the need for rewriting them.
SMC-R uses RDMA over Converged Ethernet (RoCE) to save CPU consumption.
For instance, when running 10 parallel connections with uperf, we measured
a decrease of 60% in CPU consumption with SMC-R compared to TCP/IP
(with throughput and latency comparable;
measured on x86_64 with the same RoCE card and port).

SMC-R does not require an RDMA communication manager (RDMA CM).

SMC-R inherits TCP qualities such as reliable connections, host-based
firewall packet filtering (on connection establishment) and unmodified
application of communication encryption such as TLS (transport layer
security) or SSL (secure sockets layer). Since original TCP is used to
establish SMC-R connections, load balancers and packet inspection based
on TCP/IP connection establishment continue to work for SMC-R.

On the other hand, using SMC-R implies:
- either involving a preload library when invoking the unchanged TCP-application
  or slightly modifying the source by simply changing the socket family in
  the socket() call
- accepting extra overhead and latency in connection establishment due to
  SMC Connection Layer Control (CLC) handshake
- explicit coupling of RoCE ports with Ethernet ports
- not routable as currently built on RoCE V1
- bypassing of packet-based networking features
    - filtering (netfilter)
    - sniffing (libpcap, packet sockets, (E)BPF)
    - traffic control (scheduling, shaping)
- bypassing of IP-header based socket options
- bypassing of memory buffer (pressure) management
- unusable together with IPsec

Overview of the SMC-R Protocol described in informational RFC 7609

SMC-R is an open protocol that provides RDMA capabilities over RoCE
transparently for applications exploiting TCP sockets.
A new socket protocol family PF_SMC is introduced.
There are no changes required to applications using the sockets API for TCP
stream sockets other than the specification of the new socket family AF_SMC.
Unmodified applications can be used by means of a dynamic preload shared
library which rewrites the socket API call
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) into
socket(AF_SMC,  SOCK_STREAM, IPPROTO_TCP).
SMC-R re-uses the address family AF_INET for all addressing purposes around
struct sockaddr.

SMC-R system architecture layers:

+=============================================================================+
|                                      | unmodified TCP application           |
| native SMC application               +--------------------------------------+
|                                      | dynamic preload shared library       |
+=============================================================================+
|                                 SMC socket                                  |
+-----------------------------------------------------------------------------+
|                    | TCP socket (for connection establishment and fallback) |
| IB verbs           +--------------------------------------------------------+
|                    | IP                                                     |
+--------------------+--------------------------------------------------------+
| RoCE device driver | some network device driver                             |
+=============================================================================+

Terms:

A link group is determined by an ordered peer pair of TCP client and TCP server
(IP addresses and subnet). Reversed client server roles cause an own link group.
A link is a logical point-to-point connection based on an
infiniband reliable connected queue pair (RC-QP) between two RoCE ports
(MACs and GIDs) of a peer pair.
A link group can have 1..8 links for failover and load balancing.
This initial Linux implementation always has 1 link per link group.
Each link group on a peer can have 1..255 remote memory buffers (RMBs).
If more RMBs are needed, a peer can open another link group
(this initial Linux implementation) or fall back to TCP.
Each RMB has its own particular size and its own (R)DMA mapping and credentials
(rtoken consisting of rkey and RDMA "virtual address").
This initial Linux implementation uses physically contiguous memory for RMBs
but we are working towards scattered memory because of memory fragmentation.
Each RMB has 1..255 RMB elements (RMBEs) of equal size
to provide multiplexing of connections within an RMB.
An RMBE is the RDMA Write destination organized as wrapping ring buffer
for data transmit of a particular connection in one direction
(duplex by means of mirror symmetry as with TCP).
This initial Linux implementation always has 1 RMBE per RMB
and thus an individual RMB for each connection.

SMC-R connection establishment with subsequent data transfer:

   CLIENT                                                   SERVER

TCP three-way handshake:
                         regular TCP SYN
      -------------------------------------------------------->
                       regular TCP SYN ACK
      <--------------------------------------------------------
                         regular TCP ACK
      -------------------------------------------------------->

SMC Connection Layer Control (CLC) handshake
exchanges RDMA credentials between peers:
             via above TCP connection: SMC CLC Proposal
      -------------------------------------------------------->
              via above TCP connection: SMC CLC Accept
      <--------------------------------------------------------
             via above TCP connection: SMC CLC Confirm
      -------------------------------------------------------->

SMC Link Layer Control (LLC) (only once per link, i.e. 1st conn. of link group):
                 RoCE RC-QP: SMC LLC Confirm Link
      <========================================================
             RoCE RC-QP: SMC LLC Confirm Link response
      ========================================================>

SMC data transmission (incl. SMC Connection Data Control (CDC) message):
                       RoCE RC-QP: RDMA Write
      ========================================================>
             RoCE RC-QP: SMC CDC message (flow control)
      ========================================================>
                          ...

                       RoCE RC-QP: RDMA Write
      <========================================================
             RoCE RC-QP: SMC CDC message (flow control)
      <========================================================
                          ...

Data flow within an established connection:

+----------------------------------------------------------------------------
|            SENDER
| sendmsg()
|    |
|    | produces into sndbuf [sender's process context]
|    v
| +--------+
| | sndbuf | [ring buffer]
| +--------+
|    |
|    | consumes from sndbuf and produces into receiver's RMBE [any context]
|    | by sending RDMA Write followed by SMC CDC message over RoCE RC-QP
|    |
+----|-----------------------------------------------------------------------
     |
+----|-----------------------------------------------------------------------
|    v       RECEIVER
| +------+
| | RMBE | [ring buffer, can have size different from sender's sndbuf]
| |      | [RMBE represents rcvbuf, no further de-coupling as on sender side]
| +------+
|    |
|    | consumes from RMBE [receiver's process context]
|    v
| recvmsg()
+----------------------------------------------------------------------------

Flow control ("cursor" updates) by means of SMC CDC messages:

               SENDER                            RECEIVER

        sends updates via CDC-------------+   sends updates via CDC
        on consuming from sndbuf          |   on consuming from RMBE
        and producing into RMBE           |   by means of recvmsg()
                                          |            |
                                          |            |
      +-----------------------------------|------------+
      |                                   |
   +--v-------------------------+      +--v-----------------------+
   | receiver's consumer cursor |      | sender's producer cursor----+
   +----------------|-----------+      +--------------------------+  |
                    |                                                |
                    |                        receiver's RMBE         |
                    |                  +--------------------------+  |
                    |                  |                          |  |
                    +--------------------------------+            |  |
                                       |             |            |  |
                                       |             v            |  |
                                       |             +------------|  |
                                       |-------------+////////////|  |
                                       |//RDMA data written by////|  |
                                       |////sender that is////////|  |
                                       |/available to be consumed/|  |
                                       |///////// +---------------|  |
                                       |----------+^              |  |
                                       |           |              |  |
                                       |           +-----------------+
                                       |                          |
                                       +--------------------------+

Sending updates of the producer cursor is immediate for low latency;
something like Nagle's algorithm (absence of TCP_NODELAY) is optional and
currently not part of this initial Linux implementation.
Sending updates of the consumer cursor is conditional to avoid the
silly window syndrome.

Normal connection termination:

Normal connection termination starts transitioning from socket state
ACTIVE via either "Active Close" or "Passive Close".

shutdown rdwr               +-----------------+
or close,   +-------------->|  INIT / CLOSED  |<-------------+
send PeerCon|nClosed        +-----------------+              | PeerConnClosed
            |                       |                        | received
            |            connection | established            |
            |                       V                        |
    +----------------+     +-----------------+     +----------------+
    |AppFinCloseWait |     |     ACTIVE      |     |PeerFinCloseWait|
    +----------------+     +-----------------+     +----------------+
            |                   |         |                   |
            |     Active Close: |         |Passive Close:     |
            |     close or      |         |PeerConnClosed or  |
            |     shutdown wr or|         |PeerDoneWriting    |
            |     shutdown rdwr |         |received           |

    |                   V         V                   |
 PeerConnClo|sed    +--------------+   +-------------+        | close or
 received   +--<----|PeerCloseWait1|   |AppCloseWait1|--->----+ shutdown rdwr,
            |       +--------------+   +-------------+        | send
            |  PeerDoneWri|ting                | shutdown wr, | PeerConnClosed
            |  received   |            send Pee|rDoneWriting  |
            |             V                    V              |
            |       +--------------+   +-------------+        |
            +--<----|PeerCloseWait2|   |AppCloseWait2|--->----+
                    +--------------+   +-------------+

In state CLOSED, the socket can be destructed only, once the application has
issued a close().

Abnormal connection termination:

                            +-----------------+
            +-------------->|  INIT / CLOSED  |<-------------+
            |               +-----------------+              |
            |                                                |
            |           +-----------------------+            |
            |           |     Any state         |            |
 PeerConnAbo|rt         | (before setting       |            | send
 received   |           |  PeerConnClosed       |            | PeerConnAbort
            |           |  indicator in         |            |
            |           |  peer's RMBE)         |            |
            |           +-----------------------+            |
            |                   |         |                  |
            |     Active Abort: |         | Passive Abort:   |
            |     problem,      |         | PeerConnAbort    |
            |     send          |         | received,        |
            |     PeerConnAbort,|         | ECONNRESET       |
            |     ECONNABORTED  |         |                  |
            |                   V         V                  |
            |       +--------------+   +--------------+      |
            +-------|PeerAbortWait |   | ProcessAbort |------+
                    +--------------+   +--------------+

Implementation notes beyond RFC 7609:

A PNET table in sysfs provides the mapping between network device names and
RoCE Infiniband device names for the transparent switch of data communication.
A PNET table can contain an arbitrary number of PNETIDs.
Each PNETID contains exactly one (Ethernet) network device name
and one or more RoCE Infiniband device names.
Each device name can only exist in at most one PNETID (no overlapping).
This initial Linux implementation allows at most one RoCE Infiniband device
name per PNETID.
After a new TCP connection is established, the network device name
used for egress traffic with the TCP connection's local source IP address
is used as key to lookup the unique PNETID, and the RoCE Infiniband device
of this PNETID is used to switch data communication from TCP to RDMA
during SMC CLC handshake.

Problem determination:

A protocol dissector is available with upstream wireshark for formatting
SMC-R related RoCE LAN traffic.
[https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-smcr.c]

We are working on enhancing the Linux implementation to cover:

- Improve default socket closing asynchronicity
- Address corner cases with many parallel connections
- Tracing
- Integrated load balancing and fail-over within a link group
- Splice and sendpage support
- IPv6 addressing support
- Keepalive, Cork
- Namespaces support
- Urgent data
- More socket options
- Diagnostics
- Statistics support
- SNMP support

References:

[1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents c8584b3f f16a7dd5
......@@ -10850,6 +10850,13 @@ S: Maintained
F: drivers/staging/media/st-cec/
F: Documentation/devicetree/bindings/media/stih-cec.txt
SHARED MEMORY COMMUNICATIONS (SMC) SOCKETS
M: Ursula Braun <ubraun@linux.vnet.ibm.com>
L: linux-s390@vger.kernel.org
W: http://www.ibm.com/developerworks/linux/linux390/
S: Supported
F: net/smc/
SYNOPSYS DESIGNWARE DMAC DRIVER
M: Viresh Kumar <vireshk@kernel.org>
M: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
......
......@@ -202,8 +202,12 @@ struct ucred {
#define AF_VSOCK 40 /* vSockets */
#define AF_KCM 41 /* Kernel Connection Multiplexor*/
#define AF_QIPCRTR 42 /* Qualcomm IPC Router */
#define AF_SMC 43 /* smc sockets: reserve number for
* PF_SMC protocol family that
* reuses AF_INET address family
*/
#define AF_MAX 43 /* For now.. */
#define AF_MAX 44 /* For now.. */
/* Protocol families, same as address families. */
#define PF_UNSPEC AF_UNSPEC
......@@ -251,6 +255,7 @@ struct ucred {
#define PF_VSOCK AF_VSOCK
#define PF_KCM AF_KCM
#define PF_QIPCRTR AF_QIPCRTR
#define PF_SMC AF_SMC
#define PF_MAX AF_MAX
/* Maximum queue length specifiable by listen. */
......
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Definitions for the SMC module (socket related)
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#ifndef _SMC_H
#define _SMC_H
struct smc_hashinfo {
rwlock_t lock;
struct hlist_head ht;
};
int smc_hash_sk(struct sock *sk);
void smc_unhash_sk(struct sock *sk);
#endif /* _SMC_H */
......@@ -70,6 +70,7 @@
#include <net/checksum.h>
#include <net/tcp_states.h>
#include <linux/net_tstamp.h>
#include <net/smc.h>
/*
* This structure really needs to be cleaned up.
......@@ -986,6 +987,7 @@ struct request_sock_ops;
struct timewait_sock_ops;
struct inet_hashinfo;
struct raw_hashinfo;
struct smc_hashinfo;
struct module;
/*
......@@ -1024,6 +1026,7 @@ struct proto {
int (*getsockopt)(struct sock *sk, int level,
int optname, char __user *optval,
int __user *option);
void (*keepalive)(struct sock *sk, int valbool);
#ifdef CONFIG_COMPAT
int (*compat_setsockopt)(struct sock *sk,
int level,
......@@ -1093,6 +1096,7 @@ struct proto {
struct inet_hashinfo *hashinfo;
struct udp_table *udp_table;
struct raw_hashinfo *raw_hash;
struct smc_hashinfo *smc_hash;
} h;
struct module *owner;
......
......@@ -27,6 +27,7 @@
#define NETLINK_ECRYPTFS 19
#define NETLINK_RDMA 20
#define NETLINK_CRYPTO 21 /* Crypto layer */
#define NETLINK_SMC 22 /* SMC monitoring */
#define NETLINK_INET_DIAG NETLINK_SOCK_DIAG
......
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Definitions for generic netlink based configuration of an SMC-R PNET table
*
* Copyright IBM Corp. 2016
*
* Author(s): Thomas Richter <tmricht@linux.vnet.ibm.com>
*/
#ifndef _UAPI_LINUX_SMC_H_
#define _UAPI_LINUX_SMC_H_
/* Netlink SMC_PNETID attributes */
enum {
SMC_PNETID_UNSPEC,
SMC_PNETID_NAME,
SMC_PNETID_ETHNAME,
SMC_PNETID_IBNAME,
SMC_PNETID_IBPORT,
__SMC_PNETID_MAX,
SMC_PNETID_MAX = __SMC_PNETID_MAX - 1
};
enum { /* SMC PNET Table commands */
SMC_PNETID_GET = 1,
SMC_PNETID_ADD,
SMC_PNETID_DEL,
SMC_PNETID_FLUSH
};
#define SMCR_GENL_FAMILY_NAME "SMC_PNETID"
#define SMCR_GENL_FAMILY_VERSION 1
#endif /* _UAPI_LINUX_SMC_H */
#ifndef _UAPI_SMC_DIAG_H_
#define _UAPI_SMC_DIAG_H_
#include <linux/types.h>
#include <linux/inet_diag.h>
#include <rdma/ib_verbs.h>
/* Request structure */
struct smc_diag_req {
__u8 diag_family;
__u8 pad[2];
__u8 diag_ext; /* Query extended information */
struct inet_diag_sockid id;
};
/* Base info structure. It contains socket identity (addrs/ports/cookie) based
* on the internal clcsock, and more SMC-related socket data
*/
struct smc_diag_msg {
__u8 diag_family;
__u8 diag_state;
__u8 diag_fallback;
__u8 diag_shutdown;
struct inet_diag_sockid id;
__u32 diag_uid;
__u64 diag_inode;
};
/* Extensions */
enum {
SMC_DIAG_NONE,
SMC_DIAG_CONNINFO,
SMC_DIAG_LGRINFO,
SMC_DIAG_SHUTDOWN,
__SMC_DIAG_MAX,
};
#define SMC_DIAG_MAX (__SMC_DIAG_MAX - 1)
/* SMC_DIAG_CONNINFO */
struct smc_diag_cursor {
__u16 reserved;
__u16 wrap;
__u32 count;
};
struct smc_diag_conninfo {
__u32 token; /* unique connection id */
__u32 sndbuf_size; /* size of send buffer */
__u32 rmbe_size; /* size of RMB element */
__u32 peer_rmbe_size; /* size of peer RMB element */
/* local RMB element cursors */
struct smc_diag_cursor rx_prod; /* received producer cursor */
struct smc_diag_cursor rx_cons; /* received consumer cursor */
/* peer RMB element cursors */
struct smc_diag_cursor tx_prod; /* sent producer cursor */
struct smc_diag_cursor tx_cons; /* sent consumer cursor */
__u8 rx_prod_flags; /* received producer flags */
__u8 rx_conn_state_flags; /* recvd connection flags*/
__u8 tx_prod_flags; /* sent producer flags */
__u8 tx_conn_state_flags; /* sent connection flags*/
/* send buffer cursors */
struct smc_diag_cursor tx_prep; /* prepared to be sent cursor */
struct smc_diag_cursor tx_sent; /* sent cursor */
struct smc_diag_cursor tx_fin; /* confirmed sent cursor */
};
/* SMC_DIAG_LINKINFO */
struct smc_diag_linkinfo {
__u8 link_id; /* link identifier */
__u8 ibname[IB_DEVICE_NAME_MAX]; /* name of the RDMA device */
__u8 ibport; /* RDMA device port number */
__u8 gid[40]; /* local GID */
__u8 peer_gid[40]; /* peer GID */
};
struct smc_diag_lgrinfo {
struct smc_diag_linkinfo lnk[1];
__u8 role;
};
#endif /* _UAPI_SMC_DIAG_H_ */
......@@ -57,6 +57,7 @@ source "net/packet/Kconfig"
source "net/unix/Kconfig"
source "net/xfrm/Kconfig"
source "net/iucv/Kconfig"
source "net/smc/Kconfig"
config INET
bool "TCP/IP networking"
......
......@@ -51,6 +51,7 @@ obj-$(CONFIG_MAC80211) += mac80211/
obj-$(CONFIG_TIPC) += tipc/
obj-$(CONFIG_NETLABEL) += netlabel/
obj-$(CONFIG_IUCV) += iucv/
obj-$(CONFIG_SMC) += smc/
obj-$(CONFIG_RFKILL) += rfkill/
obj-$(CONFIG_NET_9P) += 9p/
obj-$(CONFIG_CAIF) += caif/
......
......@@ -222,7 +222,7 @@ static const char *const af_family_key_strings[AF_MAX+1] = {
"sk_lock-AF_RXRPC" , "sk_lock-AF_ISDN" , "sk_lock-AF_PHONET" ,
"sk_lock-AF_IEEE802154", "sk_lock-AF_CAIF" , "sk_lock-AF_ALG" ,
"sk_lock-AF_NFC" , "sk_lock-AF_VSOCK" , "sk_lock-AF_KCM" ,
"sk_lock-AF_MAX"
"sk_lock-AF_SMC" , "sk_lock-AF_MAX"
};
static const char *const af_family_slock_key_strings[AF_MAX+1] = {
"slock-AF_UNSPEC", "slock-AF_UNIX" , "slock-AF_INET" ,
......@@ -239,7 +239,7 @@ static const char *const af_family_slock_key_strings[AF_MAX+1] = {
"slock-AF_RXRPC" , "slock-AF_ISDN" , "slock-AF_PHONET" ,
"slock-AF_IEEE802154", "slock-AF_CAIF" , "slock-AF_ALG" ,
"slock-AF_NFC" , "slock-AF_VSOCK" ,"slock-AF_KCM" ,
"slock-AF_MAX"
"slock-AF_SMC" , "slock-AF_MAX"
};
static const char *const af_family_clock_key_strings[AF_MAX+1] = {
"clock-AF_UNSPEC", "clock-AF_UNIX" , "clock-AF_INET" ,
......@@ -256,7 +256,7 @@ static const char *const af_family_clock_key_strings[AF_MAX+1] = {
"clock-AF_RXRPC" , "clock-AF_ISDN" , "clock-AF_PHONET" ,
"clock-AF_IEEE802154", "clock-AF_CAIF" , "clock-AF_ALG" ,
"clock-AF_NFC" , "clock-AF_VSOCK" , "clock-AF_KCM" ,
"clock-AF_MAX"
"closck-AF_smc" , "clock-AF_MAX"
};
/*
......@@ -762,11 +762,8 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
goto set_rcvbuf;
case SO_KEEPALIVE:
#ifdef CONFIG_INET
if (sk->sk_protocol == IPPROTO_TCP &&
sk->sk_type == SOCK_STREAM)
tcp_set_keepalive(sk, valbool);
#endif
if (sk->sk_prot->keepalive)
sk->sk_prot->keepalive(sk, valbool);
sock_valbool_flag(sk, SOCK_KEEPOPEN, valbool);
break;
......
......@@ -2376,6 +2376,7 @@ struct proto tcp_prot = {
.shutdown = tcp_shutdown,
.setsockopt = tcp_setsockopt,
.getsockopt = tcp_getsockopt,
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
.sendpage = tcp_sendpage,
......
......@@ -617,6 +617,7 @@ void tcp_set_keepalive(struct sock *sk, int val)
else if (!val)
inet_csk_delete_keepalive_timer(sk);
}
EXPORT_SYMBOL_GPL(tcp_set_keepalive);
static void tcp_keepalive_timer (unsigned long data)
......
......@@ -1889,6 +1889,7 @@ struct proto tcpv6_prot = {
.shutdown = tcp_shutdown,
.setsockopt = tcp_setsockopt,
.getsockopt = tcp_getsockopt,
.keepalive = tcp_set_keepalive,
.recvmsg = tcp_recvmsg,
.sendmsg = tcp_sendmsg,
.sendpage = tcp_sendpage,
......
config SMC
tristate "SMC socket protocol family"
depends on INET && INFINIBAND
---help---
SMC-R provides a "sockets over RDMA" solution making use of
RDMA over Converged Ethernet (RoCE) technology to upgrade
AF_INET TCP connections transparently.
The Linux implementation of the SMC-R solution is designed as
a separate socket family SMC.
Select this option if you want to run SMC socket applications
config SMC_DIAG
tristate "SMC: socket monitoring interface"
depends on SMC
---help---
Support for SMC socket monitoring interface used by tools such as
smcss.
if unsure, say Y.
obj-$(CONFIG_SMC) += smc.o
obj-$(CONFIG_SMC_DIAG) += smc_diag.o
smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o
This diff is collapsed.
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Definitions for the SMC module (socket related)
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#ifndef __SMC_H
#define __SMC_H
#include <linux/socket.h>
#include <linux/types.h>
#include <linux/compiler.h> /* __aligned */
#include <net/sock.h>
#include "smc_ib.h"
#define SMCPROTO_SMC 0 /* SMC protocol */
#define SMC_MAX_PORTS 2 /* Max # of ports */
extern struct proto smc_proto;
#ifdef ATOMIC64_INIT
#define KERNEL_HAS_ATOMIC64
#endif
enum smc_state { /* possible states of an SMC socket */
SMC_ACTIVE = 1,
SMC_INIT = 2,
SMC_CLOSED = 7,
SMC_LISTEN = 10,
/* normal close */
SMC_PEERCLOSEWAIT1 = 20,
SMC_PEERCLOSEWAIT2 = 21,
SMC_APPFINCLOSEWAIT = 24,
SMC_APPCLOSEWAIT1 = 22,
SMC_APPCLOSEWAIT2 = 23,
SMC_PEERFINCLOSEWAIT = 25,
/* abnormal close */
SMC_PEERABORTWAIT = 26,
SMC_PROCESSABORT = 27,
};
struct smc_link_group;
struct smc_wr_rx_hdr { /* common prefix part of LLC and CDC to demultiplex */
u8 type;
} __aligned(1);
struct smc_cdc_conn_state_flags {
#if defined(__BIG_ENDIAN_BITFIELD)
u8 peer_done_writing : 1; /* Sending done indicator */
u8 peer_conn_closed : 1; /* Peer connection closed indicator */
u8 peer_conn_abort : 1; /* Abnormal close indicator */
u8 reserved : 5;
#elif defined(__LITTLE_ENDIAN_BITFIELD)
u8 reserved : 5;
u8 peer_conn_abort : 1;
u8 peer_conn_closed : 1;
u8 peer_done_writing : 1;
#endif
};
struct smc_cdc_producer_flags {
#if defined(__BIG_ENDIAN_BITFIELD)
u8 write_blocked : 1; /* Writing Blocked, no rx buf space */
u8 urg_data_pending : 1; /* Urgent Data Pending */
u8 urg_data_present : 1; /* Urgent Data Present */
u8 cons_curs_upd_req : 1; /* cursor update requested */
u8 failover_validation : 1;/* message replay due to failover */
u8 reserved : 3;
#elif defined(__LITTLE_ENDIAN_BITFIELD)
u8 reserved : 3;
u8 failover_validation : 1;
u8 cons_curs_upd_req : 1;
u8 urg_data_present : 1;
u8 urg_data_pending : 1;
u8 write_blocked : 1;
#endif
};
/* in host byte order */
union smc_host_cursor { /* SMC cursor - an offset in an RMBE */
struct {
u16 reserved;
u16 wrap; /* window wrap sequence number */
u32 count; /* cursor (= offset) part */
};
#ifdef KERNEL_HAS_ATOMIC64
atomic64_t acurs; /* for atomic processing */
#else
u64 acurs; /* for atomic processing */
#endif
} __aligned(8);
/* in host byte order, except for flag bitfields in network byte order */
struct smc_host_cdc_msg { /* Connection Data Control message */
struct smc_wr_rx_hdr common; /* .type = 0xFE */
u8 len; /* length = 44 */
u16 seqno; /* connection seq # */
u32 token; /* alert_token */
union smc_host_cursor prod; /* producer cursor */
union smc_host_cursor cons; /* consumer cursor,
* piggy backed "ack"
*/
struct smc_cdc_producer_flags prod_flags; /* conn. tx/rx status */
struct smc_cdc_conn_state_flags conn_state_flags; /* peer conn. status*/
u8 reserved[18];
} __aligned(8);
struct smc_connection {
struct rb_node alert_node;
struct smc_link_group *lgr; /* link group of connection */
u32 alert_token_local; /* unique conn. id */
u8 peer_conn_idx; /* from tcp handshake */
int peer_rmbe_size; /* size of peer rx buffer */
atomic_t peer_rmbe_space;/* remaining free bytes in peer
* rmbe
*/
int rtoken_idx; /* idx to peer RMB rkey/addr */
struct smc_buf_desc *sndbuf_desc; /* send buffer descriptor */
int sndbuf_size; /* sndbuf size <== sock wmem */
struct smc_buf_desc *rmb_desc; /* RMBE descriptor */
int rmbe_size; /* RMBE size <== sock rmem */
int rmbe_size_short;/* compressed notation */
int rmbe_update_limit;
/* lower limit for consumer
* cursor update
*/
struct smc_host_cdc_msg local_tx_ctrl; /* host byte order staging
* buffer for CDC msg send
* .prod cf. TCP snd_nxt
* .cons cf. TCP sends ack
*/
union smc_host_cursor tx_curs_prep; /* tx - prepared data
* snd_max..wmem_alloc
*/
union smc_host_cursor tx_curs_sent; /* tx - sent data
* snd_nxt ?
*/
union smc_host_cursor tx_curs_fin; /* tx - confirmed by peer
* snd-wnd-begin ?
*/
atomic_t sndbuf_space; /* remaining space in sndbuf */
u16 tx_cdc_seq; /* sequence # for CDC send */
spinlock_t send_lock; /* protect wr_sends */
struct work_struct tx_work; /* retry of smc_cdc_msg_send */
struct smc_host_cdc_msg local_rx_ctrl; /* filled during event_handl.
* .prod cf. TCP rcv_nxt
* .cons cf. TCP snd_una
*/
union smc_host_cursor rx_curs_confirmed; /* confirmed to peer
* source of snd_una ?
*/
atomic_t bytes_to_rcv; /* arrived data,
* not yet received
*/
#ifndef KERNEL_HAS_ATOMIC64
spinlock_t acurs_lock; /* protect cursors */
#endif
};
struct smc_sock { /* smc sock container */
struct sock sk;
struct socket *clcsock; /* internal tcp socket */
struct smc_connection conn; /* smc connection */
struct sockaddr *addr; /* inet connect address */
struct smc_sock *listen_smc; /* listen parent */
struct work_struct tcp_listen_work;/* handle tcp socket accepts */
struct work_struct smc_listen_work;/* prepare new accept socket */
struct list_head accept_q; /* sockets to be accepted */
spinlock_t accept_q_lock; /* protects accept_q */
struct delayed_work sock_put_work; /* final socket freeing */
bool use_fallback; /* fallback to tcp */
u8 wait_close_tx_prepared : 1;
/* shutdown wr or close
* started, waiting for unsent
* data to be sent
*/
};
static inline struct smc_sock *smc_sk(const struct sock *sk)
{
return (struct smc_sock *)sk;
}
#define SMC_SYSTEMID_LEN 8
extern u8 local_systemid[SMC_SYSTEMID_LEN]; /* unique system identifier */
/* convert an u32 value into network byte order, store it into a 3 byte field */
static inline void hton24(u8 *net, u32 host)
{
__be32 t;
t = cpu_to_be32(host);
memcpy(net, ((u8 *)&t) + 1, 3);
}
/* convert a received 3 byte field into host byte order*/
static inline u32 ntoh24(u8 *net)
{
__be32 t = 0;
memcpy(((u8 *)&t) + 1, net, 3);
return be32_to_cpu(t);
}
#define SMC_BUF_MIN_SIZE 16384 /* minimum size of an RMB */
#define SMC_RMBE_SIZES 16 /* number of distinct sizes for an RMBE */
/* theoretically, the RFC states that largest size would be 512K,
* i.e. compressed 5 and thus 6 sizes (0..5), despite
* struct smc_clc_msg_accept_confirm.rmbe_size being a 4 bit value (0..15)
*/
/* convert the RMB size into the compressed notation - minimum 16K.
* In contrast to plain ilog2, this rounds towards the next power of 2,
* so the socket application gets at least its desired sndbuf / rcvbuf size.
*/
static inline u8 smc_compress_bufsize(int size)
{
u8 compressed;
if (size <= SMC_BUF_MIN_SIZE)
return 0;
size = (size - 1) >> 14;
compressed = ilog2(size) + 1;
if (compressed >= SMC_RMBE_SIZES)
compressed = SMC_RMBE_SIZES - 1;
return compressed;
}
/* convert the RMB size from compressed notation into integer */
static inline int smc_uncompress_bufsize(u8 compressed)
{
u32 size;
size = 0x00000001 << (((int)compressed) + 14);
return (int)size;
}
#ifdef CONFIG_XFRM
static inline bool using_ipsec(struct smc_sock *smc)
{
return (smc->clcsock->sk->sk_policy[0] ||
smc->clcsock->sk->sk_policy[1]) ? 1 : 0;
}
#else
static inline bool using_ipsec(struct smc_sock *smc)
{
return 0;
}
#endif
struct smc_clc_msg_local;
int smc_netinfo_by_tcpsk(struct socket *clcsock, __be32 *subnet,
u8 *prefix_len);
void smc_conn_free(struct smc_connection *conn);
int smc_conn_create(struct smc_sock *smc, __be32 peer_in_addr,
struct smc_ib_device *smcibdev, u8 ibport,
struct smc_clc_msg_local *lcl, int srv_first_contact);
struct sock *smc_accept_dequeue(struct sock *parent, struct socket *new_sock);
void smc_close_non_accepted(struct sock *sk);
#endif /* __SMC_H */
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Connection Data Control (CDC)
* handles flow control
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#include <linux/spinlock.h>
#include "smc.h"
#include "smc_wr.h"
#include "smc_cdc.h"
#include "smc_tx.h"
#include "smc_rx.h"
#include "smc_close.h"
/********************************** send *************************************/
struct smc_cdc_tx_pend {
struct smc_connection *conn; /* socket connection */
union smc_host_cursor cursor; /* tx sndbuf cursor sent */
union smc_host_cursor p_cursor; /* rx RMBE cursor produced */
u16 ctrl_seq; /* conn. tx sequence # */
};
/* handler for send/transmission completion of a CDC msg */
static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
struct smc_link *link,
enum ib_wc_status wc_status)
{
struct smc_cdc_tx_pend *cdcpend = (struct smc_cdc_tx_pend *)pnd_snd;
struct smc_sock *smc;
int diff;
if (!cdcpend->conn)
/* already dismissed */
return;
smc = container_of(cdcpend->conn, struct smc_sock, conn);
bh_lock_sock(&smc->sk);
if (!wc_status) {
diff = smc_curs_diff(cdcpend->conn->sndbuf_size,
&cdcpend->conn->tx_curs_fin,
&cdcpend->cursor);
/* sndbuf_space is decreased in smc_sendmsg */
smp_mb__before_atomic();
atomic_add(diff, &cdcpend->conn->sndbuf_space);
/* guarantee 0 <= sndbuf_space <= sndbuf_size */
smp_mb__after_atomic();
smc_curs_write(&cdcpend->conn->tx_curs_fin,
smc_curs_read(&cdcpend->cursor, cdcpend->conn),
cdcpend->conn);
}
smc_tx_sndbuf_nonfull(smc);
if (smc->sk.sk_state != SMC_ACTIVE)
/* wake up smc_close_wait_tx_pends() */
smc->sk.sk_state_change(&smc->sk);
bh_unlock_sock(&smc->sk);
}
int smc_cdc_get_free_slot(struct smc_link *link,
struct smc_wr_buf **wr_buf,
struct smc_cdc_tx_pend **pend)
{
return smc_wr_tx_get_free_slot(link, smc_cdc_tx_handler, wr_buf,
(struct smc_wr_tx_pend_priv **)pend);
}
static inline void smc_cdc_add_pending_send(struct smc_connection *conn,
struct smc_cdc_tx_pend *pend)
{
BUILD_BUG_ON_MSG(
sizeof(struct smc_cdc_msg) > SMC_WR_BUF_SIZE,
"must increase SMC_WR_BUF_SIZE to at least sizeof(struct smc_cdc_msg)");
BUILD_BUG_ON_MSG(
offsetof(struct smc_cdc_msg, reserved) > SMC_WR_TX_SIZE,
"must adapt SMC_WR_TX_SIZE to sizeof(struct smc_cdc_msg); if not all smc_wr upper layer protocols use the same message size any more, must start to set link->wr_tx_sges[i].length on each individual smc_wr_tx_send()");
BUILD_BUG_ON_MSG(
sizeof(struct smc_cdc_tx_pend) > SMC_WR_TX_PEND_PRIV_SIZE,
"must increase SMC_WR_TX_PEND_PRIV_SIZE to at least sizeof(struct smc_cdc_tx_pend)");
pend->conn = conn;
pend->cursor = conn->tx_curs_sent;
pend->p_cursor = conn->local_tx_ctrl.prod;
pend->ctrl_seq = conn->tx_cdc_seq;
}
int smc_cdc_msg_send(struct smc_connection *conn,
struct smc_wr_buf *wr_buf,
struct smc_cdc_tx_pend *pend)
{
struct smc_link *link;
int rc;
link = &conn->lgr->lnk[SMC_SINGLE_LINK];
smc_cdc_add_pending_send(conn, pend);
conn->tx_cdc_seq++;
conn->local_tx_ctrl.seqno = conn->tx_cdc_seq;
smc_host_msg_to_cdc((struct smc_cdc_msg *)wr_buf,
&conn->local_tx_ctrl, conn);
rc = smc_wr_tx_send(link, (struct smc_wr_tx_pend_priv *)pend);
if (!rc)
smc_curs_write(&conn->rx_curs_confirmed,
smc_curs_read(&conn->local_tx_ctrl.cons, conn),
conn);
return rc;
}
int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn)
{
struct smc_cdc_tx_pend *pend;
struct smc_wr_buf *wr_buf;
int rc;
rc = smc_cdc_get_free_slot(&conn->lgr->lnk[SMC_SINGLE_LINK], &wr_buf,
&pend);
if (rc)
return rc;
return smc_cdc_msg_send(conn, wr_buf, pend);
}
static bool smc_cdc_tx_filter(struct smc_wr_tx_pend_priv *tx_pend,
unsigned long data)
{
struct smc_connection *conn = (struct smc_connection *)data;
struct smc_cdc_tx_pend *cdc_pend =
(struct smc_cdc_tx_pend *)tx_pend;
return cdc_pend->conn == conn;
}
static void smc_cdc_tx_dismisser(struct smc_wr_tx_pend_priv *tx_pend)
{
struct smc_cdc_tx_pend *cdc_pend =
(struct smc_cdc_tx_pend *)tx_pend;
cdc_pend->conn = NULL;
}
void smc_cdc_tx_dismiss_slots(struct smc_connection *conn)
{
struct smc_link *link = &conn->lgr->lnk[SMC_SINGLE_LINK];
smc_wr_tx_dismiss_slots(link, SMC_CDC_MSG_TYPE,
smc_cdc_tx_filter, smc_cdc_tx_dismisser,
(unsigned long)conn);
}
bool smc_cdc_tx_has_pending(struct smc_connection *conn)
{
struct smc_link *link = &conn->lgr->lnk[SMC_SINGLE_LINK];
return smc_wr_tx_has_pending(link, SMC_CDC_MSG_TYPE,
smc_cdc_tx_filter, (unsigned long)conn);
}
/********************************* receive ***********************************/
static inline bool smc_cdc_before(u16 seq1, u16 seq2)
{
return (s16)(seq1 - seq2) < 0;
}
static void smc_cdc_msg_recv_action(struct smc_sock *smc,
struct smc_link *link,
struct smc_cdc_msg *cdc)
{
union smc_host_cursor cons_old, prod_old;
struct smc_connection *conn = &smc->conn;
int diff_cons, diff_prod;
if (!cdc->prod_flags.failover_validation) {
if (smc_cdc_before(ntohs(cdc->seqno),
conn->local_rx_ctrl.seqno))
/* received seqno is old */
return;
}
smc_curs_write(&prod_old,
smc_curs_read(&conn->local_rx_ctrl.prod, conn),
conn);
smc_curs_write(&cons_old,
smc_curs_read(&conn->local_rx_ctrl.cons, conn),
conn);
smc_cdc_msg_to_host(&conn->local_rx_ctrl, cdc, conn);
diff_cons = smc_curs_diff(conn->peer_rmbe_size, &cons_old,
&conn->local_rx_ctrl.cons);
if (diff_cons) {
/* peer_rmbe_space is decreased during data transfer with RDMA
* write
*/
smp_mb__before_atomic();
atomic_add(diff_cons, &conn->peer_rmbe_space);
/* guarantee 0 <= peer_rmbe_space <= peer_rmbe_size */
smp_mb__after_atomic();
}
diff_prod = smc_curs_diff(conn->rmbe_size, &prod_old,
&conn->local_rx_ctrl.prod);
if (diff_prod) {
/* bytes_to_rcv is decreased in smc_recvmsg */
smp_mb__before_atomic();
atomic_add(diff_prod, &conn->bytes_to_rcv);
/* guarantee 0 <= bytes_to_rcv <= rmbe_size */
smp_mb__after_atomic();
smc->sk.sk_data_ready(&smc->sk);
}
if (conn->local_rx_ctrl.conn_state_flags.peer_conn_abort) {
smc->sk.sk_err = ECONNRESET;
conn->local_tx_ctrl.conn_state_flags.peer_conn_abort = 1;
}
if (smc_cdc_rxed_any_close_or_senddone(conn))
smc_close_passive_received(smc);
/* piggy backed tx info */
/* trigger sndbuf consumer: RDMA write into peer RMBE and CDC */
if (diff_cons && smc_tx_prepared_sends(conn)) {
smc_tx_sndbuf_nonempty(conn);
/* trigger socket release if connection closed */
smc_close_wake_tx_prepared(smc);
}
/* subsequent patch: trigger socket release if connection closed */
/* socket connected but not accepted */
if (!smc->sk.sk_socket)
return;
/* data available */
if ((conn->local_rx_ctrl.prod_flags.write_blocked) ||
(conn->local_rx_ctrl.prod_flags.cons_curs_upd_req))
smc_tx_consumer_update(conn);
}
/* called under tasklet context */
static inline void smc_cdc_msg_recv(struct smc_cdc_msg *cdc,
struct smc_link *link, u64 wr_id)
{
struct smc_link_group *lgr = container_of(link, struct smc_link_group,
lnk[SMC_SINGLE_LINK]);
struct smc_connection *connection;
struct smc_sock *smc;
/* lookup connection */
read_lock_bh(&lgr->conns_lock);
connection = smc_lgr_find_conn(ntohl(cdc->token), lgr);
if (!connection) {
read_unlock_bh(&lgr->conns_lock);
return;
}
smc = container_of(connection, struct smc_sock, conn);
sock_hold(&smc->sk);
read_unlock_bh(&lgr->conns_lock);
bh_lock_sock(&smc->sk);
smc_cdc_msg_recv_action(smc, link, cdc);
bh_unlock_sock(&smc->sk);
sock_put(&smc->sk); /* no free sk in softirq-context */
}
/***************************** init, exit, misc ******************************/
static void smc_cdc_rx_handler(struct ib_wc *wc, void *buf)
{
struct smc_link *link = (struct smc_link *)wc->qp->qp_context;
struct smc_cdc_msg *cdc = buf;
if (wc->byte_len < offsetof(struct smc_cdc_msg, reserved))
return; /* short message */
if (cdc->len != sizeof(*cdc))
return; /* invalid message */
smc_cdc_msg_recv(cdc, link, wc->wr_id);
}
static struct smc_wr_rx_handler smc_cdc_rx_handlers[] = {
{
.handler = smc_cdc_rx_handler,
.type = SMC_CDC_MSG_TYPE
},
{
.handler = NULL,
}
};
int __init smc_cdc_init(void)
{
struct smc_wr_rx_handler *handler;
int rc = 0;
for (handler = smc_cdc_rx_handlers; handler->handler; handler++) {
INIT_HLIST_NODE(&handler->list);
rc = smc_wr_rx_register_handler(handler);
if (rc)
break;
}
return rc;
}
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Connection Data Control (CDC)
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#ifndef SMC_CDC_H
#define SMC_CDC_H
#include <linux/kernel.h> /* max_t */
#include <linux/atomic.h>
#include <linux/in.h>
#include <linux/compiler.h>
#include "smc.h"
#include "smc_core.h"
#include "smc_wr.h"
#define SMC_CDC_MSG_TYPE 0xFE
/* in network byte order */
union smc_cdc_cursor { /* SMC cursor */
struct {
__be16 reserved;
__be16 wrap;
__be32 count;
};
#ifdef KERNEL_HAS_ATOMIC64
atomic64_t acurs; /* for atomic processing */
#else
u64 acurs; /* for atomic processing */
#endif
} __aligned(8);
/* in network byte order */
struct smc_cdc_msg {
struct smc_wr_rx_hdr common; /* .type = 0xFE */
u8 len; /* 44 */
__be16 seqno;
__be32 token;
union smc_cdc_cursor prod;
union smc_cdc_cursor cons; /* piggy backed "ack" */
struct smc_cdc_producer_flags prod_flags;
struct smc_cdc_conn_state_flags conn_state_flags;
u8 reserved[18];
} __aligned(8);
static inline bool smc_cdc_rxed_any_close(struct smc_connection *conn)
{
return conn->local_rx_ctrl.conn_state_flags.peer_conn_abort ||
conn->local_rx_ctrl.conn_state_flags.peer_conn_closed;
}
static inline bool smc_cdc_rxed_any_close_or_senddone(
struct smc_connection *conn)
{
return smc_cdc_rxed_any_close(conn) ||
conn->local_rx_ctrl.conn_state_flags.peer_done_writing;
}
static inline void smc_curs_add(int size, union smc_host_cursor *curs,
int value)
{
curs->count += value;
if (curs->count >= size) {
curs->wrap++;
curs->count -= size;
}
}
/* SMC cursors are 8 bytes long and require atomic reading and writing */
static inline u64 smc_curs_read(union smc_host_cursor *curs,
struct smc_connection *conn)
{
#ifndef KERNEL_HAS_ATOMIC64
unsigned long flags;
u64 ret;
spin_lock_irqsave(&conn->acurs_lock, flags);
ret = curs->acurs;
spin_unlock_irqrestore(&conn->acurs_lock, flags);
return ret;
#else
return atomic64_read(&curs->acurs);
#endif
}
static inline u64 smc_curs_read_net(union smc_cdc_cursor *curs,
struct smc_connection *conn)
{
#ifndef KERNEL_HAS_ATOMIC64
unsigned long flags;
u64 ret;
spin_lock_irqsave(&conn->acurs_lock, flags);
ret = curs->acurs;
spin_unlock_irqrestore(&conn->acurs_lock, flags);
return ret;
#else
return atomic64_read(&curs->acurs);
#endif
}
static inline void smc_curs_write(union smc_host_cursor *curs, u64 val,
struct smc_connection *conn)
{
#ifndef KERNEL_HAS_ATOMIC64
unsigned long flags;
spin_lock_irqsave(&conn->acurs_lock, flags);
curs->acurs = val;
spin_unlock_irqrestore(&conn->acurs_lock, flags);
#else
atomic64_set(&curs->acurs, val);
#endif
}
static inline void smc_curs_write_net(union smc_cdc_cursor *curs, u64 val,
struct smc_connection *conn)
{
#ifndef KERNEL_HAS_ATOMIC64
unsigned long flags;
spin_lock_irqsave(&conn->acurs_lock, flags);
curs->acurs = val;
spin_unlock_irqrestore(&conn->acurs_lock, flags);
#else
atomic64_set(&curs->acurs, val);
#endif
}
/* calculate cursor difference between old and new, where old <= new */
static inline int smc_curs_diff(unsigned int size,
union smc_host_cursor *old,
union smc_host_cursor *new)
{
if (old->wrap != new->wrap)
return max_t(int, 0,
((size - old->count) + new->count));
return max_t(int, 0, (new->count - old->count));
}
static inline void smc_host_cursor_to_cdc(union smc_cdc_cursor *peer,
union smc_host_cursor *local,
struct smc_connection *conn)
{
union smc_host_cursor temp;
smc_curs_write(&temp, smc_curs_read(local, conn), conn);
peer->count = htonl(temp.count);
peer->wrap = htons(temp.wrap);
/* peer->reserved = htons(0); must be ensured by caller */
}
static inline void smc_host_msg_to_cdc(struct smc_cdc_msg *peer,
struct smc_host_cdc_msg *local,
struct smc_connection *conn)
{
peer->common.type = local->common.type;
peer->len = local->len;
peer->seqno = htons(local->seqno);
peer->token = htonl(local->token);
smc_host_cursor_to_cdc(&peer->prod, &local->prod, conn);
smc_host_cursor_to_cdc(&peer->cons, &local->cons, conn);
peer->prod_flags = local->prod_flags;
peer->conn_state_flags = local->conn_state_flags;
}
static inline void smc_cdc_cursor_to_host(union smc_host_cursor *local,
union smc_cdc_cursor *peer,
struct smc_connection *conn)
{
union smc_host_cursor temp, old;
union smc_cdc_cursor net;
smc_curs_write(&old, smc_curs_read(local, conn), conn);
smc_curs_write_net(&net, smc_curs_read_net(peer, conn), conn);
temp.count = ntohl(net.count);
temp.wrap = ntohs(net.wrap);
if ((old.wrap > temp.wrap) && temp.wrap)
return;
if ((old.wrap == temp.wrap) &&
(old.count > temp.count))
return;
smc_curs_write(local, smc_curs_read(&temp, conn), conn);
}
static inline void smc_cdc_msg_to_host(struct smc_host_cdc_msg *local,
struct smc_cdc_msg *peer,
struct smc_connection *conn)
{
local->common.type = peer->common.type;
local->len = peer->len;
local->seqno = ntohs(peer->seqno);
local->token = ntohl(peer->token);
smc_cdc_cursor_to_host(&local->prod, &peer->prod, conn);
smc_cdc_cursor_to_host(&local->cons, &peer->cons, conn);
local->prod_flags = peer->prod_flags;
local->conn_state_flags = peer->conn_state_flags;
}
struct smc_cdc_tx_pend;
int smc_cdc_get_free_slot(struct smc_link *link, struct smc_wr_buf **wr_buf,
struct smc_cdc_tx_pend **pend);
void smc_cdc_tx_dismiss_slots(struct smc_connection *conn);
int smc_cdc_msg_send(struct smc_connection *conn, struct smc_wr_buf *wr_buf,
struct smc_cdc_tx_pend *pend);
int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn);
bool smc_cdc_tx_has_pending(struct smc_connection *conn);
int smc_cdc_init(void) __init;
#endif /* SMC_CDC_H */
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* CLC (connection layer control) handshake over initial TCP socket to
* prepare for RDMA traffic
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#include <linux/in.h>
#include <net/sock.h>
#include <net/tcp.h>
#include "smc.h"
#include "smc_core.h"
#include "smc_clc.h"
#include "smc_ib.h"
/* Wait for data on the tcp-socket, analyze received data
* Returns:
* 0 if success and it was not a decline that we received.
* SMC_CLC_DECL_REPLY if decline received for fallback w/o another decl send.
* clcsock error, -EINTR, -ECONNRESET, -EPROTO otherwise.
*/
int smc_clc_wait_msg(struct smc_sock *smc, void *buf, int buflen,
u8 expected_type)
{
struct sock *clc_sk = smc->clcsock->sk;
struct smc_clc_msg_hdr *clcm = buf;
struct msghdr msg = {NULL, 0};
int reason_code = 0;
struct kvec vec;
int len, datlen;
int krflags;
/* peek the first few bytes to determine length of data to receive
* so we don't consume any subsequent CLC message or payload data
* in the TCP byte stream
*/
vec.iov_base = buf;
vec.iov_len = buflen;
krflags = MSG_PEEK | MSG_WAITALL;
smc->clcsock->sk->sk_rcvtimeo = CLC_WAIT_TIME;
len = kernel_recvmsg(smc->clcsock, &msg, &vec, 1,
sizeof(struct smc_clc_msg_hdr), krflags);
if (signal_pending(current)) {
reason_code = -EINTR;
clc_sk->sk_err = EINTR;
smc->sk.sk_err = EINTR;
goto out;
}
if (clc_sk->sk_err) {
reason_code = -clc_sk->sk_err;
smc->sk.sk_err = clc_sk->sk_err;
goto out;
}
if (!len) { /* peer has performed orderly shutdown */
smc->sk.sk_err = ECONNRESET;
reason_code = -ECONNRESET;
goto out;
}
if (len < 0) {
smc->sk.sk_err = -len;
reason_code = len;
goto out;
}
datlen = ntohs(clcm->length);
if ((len < sizeof(struct smc_clc_msg_hdr)) ||
(datlen < sizeof(struct smc_clc_msg_decline)) ||
(datlen > sizeof(struct smc_clc_msg_accept_confirm)) ||
memcmp(clcm->eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER)) ||
((clcm->type != SMC_CLC_DECLINE) &&
(clcm->type != expected_type))) {
smc->sk.sk_err = EPROTO;
reason_code = -EPROTO;
goto out;
}
/* receive the complete CLC message */
vec.iov_base = buf;
vec.iov_len = buflen;
memset(&msg, 0, sizeof(struct msghdr));
krflags = MSG_WAITALL;
smc->clcsock->sk->sk_rcvtimeo = CLC_WAIT_TIME;
len = kernel_recvmsg(smc->clcsock, &msg, &vec, 1, datlen, krflags);
if (len < datlen) {
smc->sk.sk_err = EPROTO;
reason_code = -EPROTO;
goto out;
}
if (clcm->type == SMC_CLC_DECLINE) {
reason_code = SMC_CLC_DECL_REPLY;
if (ntohl(((struct smc_clc_msg_decline *)buf)->peer_diagnosis)
== SMC_CLC_DECL_SYNCERR)
smc->conn.lgr->sync_err = true;
}
out:
return reason_code;
}
/* send CLC DECLINE message across internal TCP socket */
int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info,
u8 out_of_sync)
{
struct smc_clc_msg_decline dclc;
struct msghdr msg;
struct kvec vec;
int len;
memset(&dclc, 0, sizeof(dclc));
memcpy(dclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
dclc.hdr.type = SMC_CLC_DECLINE;
dclc.hdr.length = htons(sizeof(struct smc_clc_msg_decline));
dclc.hdr.version = SMC_CLC_V1;
dclc.hdr.flag = out_of_sync ? 1 : 0;
memcpy(dclc.id_for_peer, local_systemid, sizeof(local_systemid));
dclc.peer_diagnosis = htonl(peer_diag_info);
memcpy(dclc.trl.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
memset(&msg, 0, sizeof(msg));
vec.iov_base = &dclc;
vec.iov_len = sizeof(struct smc_clc_msg_decline);
len = kernel_sendmsg(smc->clcsock, &msg, &vec, 1,
sizeof(struct smc_clc_msg_decline));
if (len < sizeof(struct smc_clc_msg_decline))
smc->sk.sk_err = EPROTO;
if (len < 0)
smc->sk.sk_err = -len;
return len;
}
/* send CLC PROPOSAL message across internal TCP socket */
int smc_clc_send_proposal(struct smc_sock *smc,
struct smc_ib_device *smcibdev,
u8 ibport)
{
struct smc_clc_msg_proposal pclc;
int reason_code = 0;
struct msghdr msg;
struct kvec vec;
int len, rc;
/* send SMC Proposal CLC message */
memset(&pclc, 0, sizeof(pclc));
memcpy(pclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
pclc.hdr.type = SMC_CLC_PROPOSAL;
pclc.hdr.length = htons(sizeof(pclc));
pclc.hdr.version = SMC_CLC_V1; /* SMC version */
memcpy(pclc.lcl.id_for_peer, local_systemid, sizeof(local_systemid));
memcpy(&pclc.lcl.gid, &smcibdev->gid[ibport - 1], SMC_GID_SIZE);
memcpy(&pclc.lcl.mac, &smcibdev->mac[ibport - 1],
sizeof(smcibdev->mac[ibport - 1]));
/* determine subnet and mask from internal TCP socket */
rc = smc_netinfo_by_tcpsk(smc->clcsock, &pclc.outgoing_subnet,
&pclc.prefix_len);
if (rc)
return SMC_CLC_DECL_CNFERR; /* configuration error */
memcpy(pclc.trl.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
memset(&msg, 0, sizeof(msg));
vec.iov_base = &pclc;
vec.iov_len = sizeof(pclc);
/* due to the few bytes needed for clc-handshake this cannot block */
len = kernel_sendmsg(smc->clcsock, &msg, &vec, 1, sizeof(pclc));
if (len < sizeof(pclc)) {
if (len >= 0) {
reason_code = -ENETUNREACH;
smc->sk.sk_err = -reason_code;
} else {
smc->sk.sk_err = smc->clcsock->sk->sk_err;
reason_code = -smc->sk.sk_err;
}
}
return reason_code;
}
/* send CLC CONFIRM message across internal TCP socket */
int smc_clc_send_confirm(struct smc_sock *smc)
{
struct smc_connection *conn = &smc->conn;
struct smc_clc_msg_accept_confirm cclc;
struct smc_link *link;
int reason_code = 0;
struct msghdr msg;
struct kvec vec;
int len;
link = &conn->lgr->lnk[SMC_SINGLE_LINK];
/* send SMC Confirm CLC msg */
memset(&cclc, 0, sizeof(cclc));
memcpy(cclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
cclc.hdr.type = SMC_CLC_CONFIRM;
cclc.hdr.length = htons(sizeof(cclc));
cclc.hdr.version = SMC_CLC_V1; /* SMC version */
memcpy(cclc.lcl.id_for_peer, local_systemid, sizeof(local_systemid));
memcpy(&cclc.lcl.gid, &link->smcibdev->gid[link->ibport - 1],
SMC_GID_SIZE);
memcpy(&cclc.lcl.mac, &link->smcibdev->mac[link->ibport - 1],
sizeof(link->smcibdev->mac));
hton24(cclc.qpn, link->roce_qp->qp_num);
cclc.rmb_rkey =
htonl(conn->rmb_desc->mr_rx[SMC_SINGLE_LINK]->rkey);
cclc.conn_idx = 1; /* for now: 1 RMB = 1 RMBE */
cclc.rmbe_alert_token = htonl(conn->alert_token_local);
cclc.qp_mtu = min(link->path_mtu, link->peer_mtu);
cclc.rmbe_size = conn->rmbe_size_short;
cclc.rmb_dma_addr =
cpu_to_be64((u64)conn->rmb_desc->dma_addr[SMC_SINGLE_LINK]);
hton24(cclc.psn, link->psn_initial);
memcpy(cclc.trl.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
memset(&msg, 0, sizeof(msg));
vec.iov_base = &cclc;
vec.iov_len = sizeof(cclc);
len = kernel_sendmsg(smc->clcsock, &msg, &vec, 1, sizeof(cclc));
if (len < sizeof(cclc)) {
if (len >= 0) {
reason_code = -ENETUNREACH;
smc->sk.sk_err = -reason_code;
} else {
smc->sk.sk_err = smc->clcsock->sk->sk_err;
reason_code = -smc->sk.sk_err;
}
}
return reason_code;
}
/* send CLC ACCEPT message across internal TCP socket */
int smc_clc_send_accept(struct smc_sock *new_smc, int srv_first_contact)
{
struct smc_connection *conn = &new_smc->conn;
struct smc_clc_msg_accept_confirm aclc;
struct smc_link *link;
struct msghdr msg;
struct kvec vec;
int rc = 0;
int len;
link = &conn->lgr->lnk[SMC_SINGLE_LINK];
memset(&aclc, 0, sizeof(aclc));
memcpy(aclc.hdr.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
aclc.hdr.type = SMC_CLC_ACCEPT;
aclc.hdr.length = htons(sizeof(aclc));
aclc.hdr.version = SMC_CLC_V1; /* SMC version */
if (srv_first_contact)
aclc.hdr.flag = 1;
memcpy(aclc.lcl.id_for_peer, local_systemid, sizeof(local_systemid));
memcpy(&aclc.lcl.gid, &link->smcibdev->gid[link->ibport - 1],
SMC_GID_SIZE);
memcpy(&aclc.lcl.mac, link->smcibdev->mac[link->ibport - 1],
sizeof(link->smcibdev->mac[link->ibport - 1]));
hton24(aclc.qpn, link->roce_qp->qp_num);
aclc.rmb_rkey =
htonl(conn->rmb_desc->mr_rx[SMC_SINGLE_LINK]->rkey);
aclc.conn_idx = 1; /* as long as 1 RMB = 1 RMBE */
aclc.rmbe_alert_token = htonl(conn->alert_token_local);
aclc.qp_mtu = link->path_mtu;
aclc.rmbe_size = conn->rmbe_size_short,
aclc.rmb_dma_addr =
cpu_to_be64((u64)conn->rmb_desc->dma_addr[SMC_SINGLE_LINK]);
hton24(aclc.psn, link->psn_initial);
memcpy(aclc.trl.eyecatcher, SMC_EYECATCHER, sizeof(SMC_EYECATCHER));
memset(&msg, 0, sizeof(msg));
vec.iov_base = &aclc;
vec.iov_len = sizeof(aclc);
len = kernel_sendmsg(new_smc->clcsock, &msg, &vec, 1, sizeof(aclc));
if (len < sizeof(aclc)) {
if (len >= 0)
new_smc->sk.sk_err = EPROTO;
else
new_smc->sk.sk_err = new_smc->clcsock->sk->sk_err;
rc = sock_error(&new_smc->sk);
}
return rc;
}
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* CLC (connection layer control) handshake over initial TCP socket to
* prepare for RDMA traffic
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#ifndef _SMC_CLC_H
#define _SMC_CLC_H
#include <rdma/ib_verbs.h>
#include "smc.h"
#define SMC_CLC_PROPOSAL 0x01
#define SMC_CLC_ACCEPT 0x02
#define SMC_CLC_CONFIRM 0x03
#define SMC_CLC_DECLINE 0x04
/* eye catcher "SMCR" EBCDIC for CLC messages */
static const char SMC_EYECATCHER[4] = {'\xe2', '\xd4', '\xc3', '\xd9'};
#define SMC_CLC_V1 0x1 /* SMC version */
#define CLC_WAIT_TIME (6 * HZ) /* max. wait time on clcsock */
#define SMC_CLC_DECL_MEM 0x01010000 /* insufficient memory resources */
#define SMC_CLC_DECL_TIMEOUT 0x02000000 /* timeout */
#define SMC_CLC_DECL_CNFERR 0x03000000 /* configuration error */
#define SMC_CLC_DECL_IPSEC 0x03030000 /* IPsec usage */
#define SMC_CLC_DECL_SYNCERR 0x04000000 /* synchronization error */
#define SMC_CLC_DECL_REPLY 0x06000000 /* reply to a received decline */
#define SMC_CLC_DECL_INTERR 0x99990000 /* internal error */
#define SMC_CLC_DECL_TCL 0x02040000 /* timeout w4 QP confirm */
#define SMC_CLC_DECL_SEND 0x07000000 /* sending problem */
struct smc_clc_msg_hdr { /* header1 of clc messages */
u8 eyecatcher[4]; /* eye catcher */
u8 type; /* proposal / accept / confirm / decline */
__be16 length;
#if defined(__BIG_ENDIAN_BITFIELD)
u8 version : 4,
flag : 1,
rsvd : 3;
#elif defined(__LITTLE_ENDIAN_BITFIELD)
u8 rsvd : 3,
flag : 1,
version : 4;
#endif
} __packed; /* format defined in RFC7609 */
struct smc_clc_msg_trail { /* trailer of clc messages */
u8 eyecatcher[4];
};
struct smc_clc_msg_local { /* header2 of clc messages */
u8 id_for_peer[SMC_SYSTEMID_LEN]; /* unique system id */
u8 gid[16]; /* gid of ib_device port */
u8 mac[6]; /* mac of ib_device port */
};
struct smc_clc_msg_proposal { /* clc proposal message */
struct smc_clc_msg_hdr hdr;
struct smc_clc_msg_local lcl;
__be16 iparea_offset; /* offset to IP address information area */
__be32 outgoing_subnet; /* subnet mask */
u8 prefix_len; /* number of significant bits in mask */
u8 reserved[2];
u8 ipv6_prefixes_cnt; /* number of IPv6 prefixes in prefix array */
struct smc_clc_msg_trail trl; /* eye catcher "SMCR" EBCDIC */
} __aligned(4);
struct smc_clc_msg_accept_confirm { /* clc accept / confirm message */
struct smc_clc_msg_hdr hdr;
struct smc_clc_msg_local lcl;
u8 qpn[3]; /* QP number */
__be32 rmb_rkey; /* RMB rkey */
u8 conn_idx; /* Connection index, which RMBE in RMB */
__be32 rmbe_alert_token;/* unique connection id */
#if defined(__BIG_ENDIAN_BITFIELD)
u8 rmbe_size : 4, /* RMBE buf size (compressed notation) */
qp_mtu : 4; /* QP mtu */
#elif defined(__LITTLE_ENDIAN_BITFIELD)
u8 qp_mtu : 4,
rmbe_size : 4;
#endif
u8 reserved;
__be64 rmb_dma_addr; /* RMB virtual address */
u8 reserved2;
u8 psn[3]; /* initial packet sequence number */
struct smc_clc_msg_trail trl; /* eye catcher "SMCR" EBCDIC */
} __packed; /* format defined in RFC7609 */
struct smc_clc_msg_decline { /* clc decline message */
struct smc_clc_msg_hdr hdr;
u8 id_for_peer[SMC_SYSTEMID_LEN]; /* sender peer_id */
__be32 peer_diagnosis; /* diagnosis information */
u8 reserved2[4];
struct smc_clc_msg_trail trl; /* eye catcher "SMCR" EBCDIC */
} __aligned(4);
struct smc_sock;
struct smc_ib_device;
int smc_clc_wait_msg(struct smc_sock *smc, void *buf, int buflen,
u8 expected_type);
int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info,
u8 out_of_sync);
int smc_clc_send_proposal(struct smc_sock *smc, struct smc_ib_device *smcibdev,
u8 ibport);
int smc_clc_send_confirm(struct smc_sock *smc);
int smc_clc_send_accept(struct smc_sock *smc, int srv_first_contact);
#endif
This diff is collapsed.
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Socket Closing
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#ifndef SMC_CLOSE_H
#define SMC_CLOSE_H
#include <linux/workqueue.h>
#include "smc.h"
#define SMC_MAX_STREAM_WAIT_TIMEOUT (2 * HZ)
#define SMC_CLOSE_SOCK_PUT_DELAY HZ
void smc_close_wake_tx_prepared(struct smc_sock *smc);
void smc_close_active_abort(struct smc_sock *smc);
int smc_close_active(struct smc_sock *smc);
void smc_close_passive_received(struct smc_sock *smc);
void smc_close_sock_put_work(struct work_struct *work);
int smc_close_shutdown_write(struct smc_sock *smc);
#endif /* SMC_CLOSE_H */
This diff is collapsed.
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Definitions for SMC Connections, Link Groups and Links
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#ifndef _SMC_CORE_H
#define _SMC_CORE_H
#include <linux/atomic.h>
#include <rdma/ib_verbs.h>
#include "smc.h"
#include "smc_ib.h"
#define SMC_RMBS_PER_LGR_MAX 255 /* max. # of RMBs per link group */
struct smc_lgr_list { /* list of link group definition */
struct list_head list;
spinlock_t lock; /* protects list of link groups */
};
extern struct smc_lgr_list smc_lgr_list; /* list of link groups */
enum smc_lgr_role { /* possible roles of a link group */
SMC_CLNT, /* client */
SMC_SERV /* server */
};
#define SMC_WR_BUF_SIZE 48 /* size of work request buffer */
struct smc_wr_buf {
u8 raw[SMC_WR_BUF_SIZE];
};
struct smc_link {
struct smc_ib_device *smcibdev; /* ib-device */
u8 ibport; /* port - values 1 | 2 */
struct ib_pd *roce_pd; /* IB protection domain,
* unique for every RoCE QP
*/
struct ib_qp *roce_qp; /* IB queue pair */
struct ib_qp_attr qp_attr; /* IB queue pair attributes */
struct smc_wr_buf *wr_tx_bufs; /* WR send payload buffers */
struct ib_send_wr *wr_tx_ibs; /* WR send meta data */
struct ib_sge *wr_tx_sges; /* WR send gather meta data */
struct smc_wr_tx_pend *wr_tx_pends; /* WR send waiting for CQE */
/* above four vectors have wr_tx_cnt elements and use the same index */
dma_addr_t wr_tx_dma_addr; /* DMA address of wr_tx_bufs */
atomic_long_t wr_tx_id; /* seq # of last sent WR */
unsigned long *wr_tx_mask; /* bit mask of used indexes */
u32 wr_tx_cnt; /* number of WR send buffers */
wait_queue_head_t wr_tx_wait; /* wait for free WR send buf */
struct smc_wr_buf *wr_rx_bufs; /* WR recv payload buffers */
struct ib_recv_wr *wr_rx_ibs; /* WR recv meta data */
struct ib_sge *wr_rx_sges; /* WR recv scatter meta data */
/* above three vectors have wr_rx_cnt elements and use the same index */
dma_addr_t wr_rx_dma_addr; /* DMA address of wr_rx_bufs */
u64 wr_rx_id; /* seq # of last recv WR */
u32 wr_rx_cnt; /* number of WR recv buffers */
union ib_gid gid; /* gid matching used vlan id */
u32 peer_qpn; /* QP number of peer */
enum ib_mtu path_mtu; /* used mtu */
enum ib_mtu peer_mtu; /* mtu size of peer */
u32 psn_initial; /* QP tx initial packet seqno */
u32 peer_psn; /* QP rx initial packet seqno */
u8 peer_mac[ETH_ALEN]; /* = gid[8:10||13:15] */
u8 peer_gid[sizeof(union ib_gid)]; /* gid of peer*/
u8 link_id; /* unique # within link group */
struct completion llc_confirm; /* wait for rx of conf link */
struct completion llc_confirm_resp; /* wait 4 rx of cnf lnk rsp */
};
/* For now we just allow one parallel link per link group. The SMC protocol
* allows more (up to 8).
*/
#define SMC_LINKS_PER_LGR_MAX 1
#define SMC_SINGLE_LINK 0
#define SMC_FIRST_CONTACT 1 /* first contact to a peer */
#define SMC_REUSE_CONTACT 0 /* follow-on contact to a peer*/
/* tx/rx buffer list element for sndbufs list and rmbs list of a lgr */
struct smc_buf_desc {
struct list_head list;
u64 dma_addr[SMC_LINKS_PER_LGR_MAX];
/* mapped address of buffer */
void *cpu_addr; /* virtual address of buffer */
struct ib_mr *mr_rx[SMC_LINKS_PER_LGR_MAX];
/* for rmb only:
* rkey provided to peer
*/
u32 used; /* currently used / unused */
};
struct smc_rtoken { /* address/key of remote RMB */
u64 dma_addr;
u32 rkey;
};
#define SMC_LGR_ID_SIZE 4
struct smc_link_group {
struct list_head list;
enum smc_lgr_role role; /* client or server */
__be32 daddr; /* destination ip address */
struct smc_link lnk[SMC_LINKS_PER_LGR_MAX]; /* smc link */
char peer_systemid[SMC_SYSTEMID_LEN];
/* unique system_id of peer */
struct rb_root conns_all; /* connection tree */
rwlock_t conns_lock; /* protects conns_all */
unsigned int conns_num; /* current # of connections */
unsigned short vlan_id; /* vlan id of link group */
struct list_head sndbufs[SMC_RMBE_SIZES];/* tx buffers */
rwlock_t sndbufs_lock; /* protects tx buffers */
struct list_head rmbs[SMC_RMBE_SIZES]; /* rx buffers */
rwlock_t rmbs_lock; /* protects rx buffers */
struct smc_rtoken rtokens[SMC_RMBS_PER_LGR_MAX]
[SMC_LINKS_PER_LGR_MAX];
/* remote addr/key pairs */
unsigned long rtokens_used_mask[BITS_TO_LONGS(
SMC_RMBS_PER_LGR_MAX)];
/* used rtoken elements */
u8 id[SMC_LGR_ID_SIZE]; /* unique lgr id */
struct delayed_work free_work; /* delayed freeing of an lgr */
bool sync_err; /* lgr no longer fits to peer */
};
/* Find the connection associated with the given alert token in the link group.
* To use rbtrees we have to implement our own search core.
* Requires @conns_lock
* @token alert token to search for
* @lgr link group to search in
* Returns connection associated with token if found, NULL otherwise.
*/
static inline struct smc_connection *smc_lgr_find_conn(
u32 token, struct smc_link_group *lgr)
{
struct smc_connection *res = NULL;
struct rb_node *node;
node = lgr->conns_all.rb_node;
while (node) {
struct smc_connection *cur = rb_entry(node,
struct smc_connection, alert_node);
if (cur->alert_token_local > token) {
node = node->rb_left;
} else {
if (cur->alert_token_local < token) {
node = node->rb_right;
} else {
res = cur;
break;
}
}
}
return res;
}
struct smc_sock;
struct smc_clc_msg_accept_confirm;
void smc_lgr_free(struct smc_link_group *lgr);
void smc_lgr_terminate(struct smc_link_group *lgr);
int smc_sndbuf_create(struct smc_sock *smc);
int smc_rmb_create(struct smc_sock *smc);
int smc_rmb_rtoken_handling(struct smc_connection *conn,
struct smc_clc_msg_accept_confirm *clc);
#endif
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Monitoring SMC transport protocol sockets
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <ubraun@linux.vnet.ibm.com>
*/
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/types.h>
#include <linux/init.h>
#include <linux/sock_diag.h>
#include <linux/inet_diag.h>
#include <linux/smc_diag.h>
#include <net/netlink.h>
#include <net/smc.h>
#include "smc.h"
#include "smc_core.h"
static void smc_gid_be16_convert(__u8 *buf, u8 *gid_raw)
{
sprintf(buf, "%04x:%04x:%04x:%04x:%04x:%04x:%04x:%04x",
be16_to_cpu(((__be16 *)gid_raw)[0]),
be16_to_cpu(((__be16 *)gid_raw)[1]),
be16_to_cpu(((__be16 *)gid_raw)[2]),
be16_to_cpu(((__be16 *)gid_raw)[3]),
be16_to_cpu(((__be16 *)gid_raw)[4]),
be16_to_cpu(((__be16 *)gid_raw)[5]),
be16_to_cpu(((__be16 *)gid_raw)[6]),
be16_to_cpu(((__be16 *)gid_raw)[7]));
}
static void smc_diag_msg_common_fill(struct smc_diag_msg *r, struct sock *sk)
{
struct smc_sock *smc = smc_sk(sk);
r->diag_family = sk->sk_family;
if (!smc->clcsock)
return;
r->id.idiag_sport = htons(smc->clcsock->sk->sk_num);
r->id.idiag_dport = smc->clcsock->sk->sk_dport;
r->id.idiag_if = smc->clcsock->sk->sk_bound_dev_if;
sock_diag_save_cookie(sk, r->id.idiag_cookie);
memset(&r->id.idiag_src, 0, sizeof(r->id.idiag_src));
memset(&r->id.idiag_dst, 0, sizeof(r->id.idiag_dst));
r->id.idiag_src[0] = smc->clcsock->sk->sk_rcv_saddr;
r->id.idiag_dst[0] = smc->clcsock->sk->sk_daddr;
}
static int smc_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
struct smc_diag_msg *r,
struct user_namespace *user_ns)
{
if (nla_put_u8(skb, SMC_DIAG_SHUTDOWN, sk->sk_shutdown))
return 1;
r->diag_uid = from_kuid_munged(user_ns, sock_i_uid(sk));
r->diag_inode = sock_i_ino(sk);
return 0;
}
static int __smc_diag_dump(struct sock *sk, struct sk_buff *skb,
struct netlink_callback *cb,
const struct smc_diag_req *req,
struct nlattr *bc)
{
struct smc_sock *smc = smc_sk(sk);
struct user_namespace *user_ns;
struct smc_diag_msg *r;
struct nlmsghdr *nlh;
nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq,
cb->nlh->nlmsg_type, sizeof(*r), NLM_F_MULTI);
if (!nlh)
return -EMSGSIZE;
r = nlmsg_data(nlh);
smc_diag_msg_common_fill(r, sk);
r->diag_state = sk->sk_state;
r->diag_fallback = smc->use_fallback;
user_ns = sk_user_ns(NETLINK_CB(cb->skb).sk);
if (smc_diag_msg_attrs_fill(sk, skb, r, user_ns))
goto errout;
if ((req->diag_ext & (1 << (SMC_DIAG_CONNINFO - 1))) && smc->conn.lgr) {
struct smc_connection *conn = &smc->conn;
struct smc_diag_conninfo cinfo = {
.token = conn->alert_token_local,
.sndbuf_size = conn->sndbuf_size,
.rmbe_size = conn->rmbe_size,
.peer_rmbe_size = conn->peer_rmbe_size,
.rx_prod.wrap = conn->local_rx_ctrl.prod.wrap,
.rx_prod.count = conn->local_rx_ctrl.prod.count,
.rx_cons.wrap = conn->local_rx_ctrl.cons.wrap,
.rx_cons.count = conn->local_rx_ctrl.cons.count,
.tx_prod.wrap = conn->local_tx_ctrl.prod.wrap,
.tx_prod.count = conn->local_tx_ctrl.prod.count,
.tx_cons.wrap = conn->local_tx_ctrl.cons.wrap,
.tx_cons.count = conn->local_tx_ctrl.cons.count,
.tx_prod_flags =
*(u8 *)&conn->local_tx_ctrl.prod_flags,
.tx_conn_state_flags =
*(u8 *)&conn->local_tx_ctrl.conn_state_flags,
.rx_prod_flags = *(u8 *)&conn->local_rx_ctrl.prod_flags,
.rx_conn_state_flags =
*(u8 *)&conn->local_rx_ctrl.conn_state_flags,
.tx_prep.wrap = conn->tx_curs_prep.wrap,
.tx_prep.count = conn->tx_curs_prep.count,
.tx_sent.wrap = conn->tx_curs_sent.wrap,
.tx_sent.count = conn->tx_curs_sent.count,
.tx_fin.wrap = conn->tx_curs_fin.wrap,
.tx_fin.count = conn->tx_curs_fin.count,
};
if (nla_put(skb, SMC_DIAG_CONNINFO, sizeof(cinfo), &cinfo) < 0)
goto errout;
}
if ((req->diag_ext & (1 << (SMC_DIAG_LGRINFO - 1))) && smc->conn.lgr) {
struct smc_diag_lgrinfo linfo = {
.role = smc->conn.lgr->role,
.lnk[0].ibport = smc->conn.lgr->lnk[0].ibport,
.lnk[0].link_id = smc->conn.lgr->lnk[0].link_id,
};
memcpy(linfo.lnk[0].ibname,
smc->conn.lgr->lnk[0].smcibdev->ibdev->name,
sizeof(smc->conn.lgr->lnk[0].smcibdev->ibdev->name));
smc_gid_be16_convert(linfo.lnk[0].gid,
smc->conn.lgr->lnk[0].gid.raw);
smc_gid_be16_convert(linfo.lnk[0].peer_gid,
smc->conn.lgr->lnk[0].peer_gid);
if (nla_put(skb, SMC_DIAG_LGRINFO, sizeof(linfo), &linfo) < 0)
goto errout;
}
nlmsg_end(skb, nlh);
return 0;
errout:
nlmsg_cancel(skb, nlh);
return -EMSGSIZE;
}
static int smc_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
{
struct net *net = sock_net(skb->sk);
struct nlattr *bc = NULL;
struct hlist_head *head;
struct sock *sk;
int rc = 0;
read_lock(&smc_proto.h.smc_hash->lock);
head = &smc_proto.h.smc_hash->ht;
if (hlist_empty(head))
goto out;
sk_for_each(sk, head) {
if (!net_eq(sock_net(sk), net))
continue;
rc = __smc_diag_dump(sk, skb, cb, nlmsg_data(cb->nlh), bc);
if (rc)
break;
}
out:
read_unlock(&smc_proto.h.smc_hash->lock);
return rc;
}
static int smc_diag_handler_dump(struct sk_buff *skb, struct nlmsghdr *h)
{
struct net *net = sock_net(skb->sk);
if (h->nlmsg_type == SOCK_DIAG_BY_FAMILY &&
h->nlmsg_flags & NLM_F_DUMP) {
{
struct netlink_dump_control c = {
.dump = smc_diag_dump,
.min_dump_alloc = SKB_WITH_OVERHEAD(32768),
};
return netlink_dump_start(net->diag_nlsk, skb, h, &c);
}
}
return 0;
}
static const struct sock_diag_handler smc_diag_handler = {
.family = AF_SMC,
.dump = smc_diag_handler_dump,
};
static int __init smc_diag_init(void)
{
return sock_diag_register(&smc_diag_handler);
}
static void __exit smc_diag_exit(void)
{
sock_diag_unregister(&smc_diag_handler);
}
module_init(smc_diag_init);
module_exit(smc_diag_exit);
MODULE_LICENSE("GPL");
MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG, 43 /* AF_SMC */);
This diff is collapsed.
/*
* Shared Memory Communications over RDMA (SMC-R) and RoCE
*
* Definitions for IB environment
*
* Copyright IBM Corp. 2016
*
* Author(s): Ursula Braun <Ursula Braun@linux.vnet.ibm.com>
*/
#ifndef _SMC_IB_H
#define _SMC_IB_H
#include <rdma/ib_verbs.h>
#define SMC_MAX_PORTS 2 /* Max # of ports */
#define SMC_GID_SIZE sizeof(union ib_gid)
#define SMC_IB_MAX_SEND_SGE 2
struct smc_ib_devices { /* list of smc ib devices definition */
struct list_head list;
spinlock_t lock; /* protects list of smc ib devices */
};
extern struct smc_ib_devices smc_ib_devices; /* list of smc ib devices */
struct smc_ib_device { /* ib-device infos for smc */
struct list_head list;
struct ib_device *ibdev;
struct ib_port_attr pattr[SMC_MAX_PORTS]; /* ib dev. port attrs */
struct ib_event_handler event_handler; /* global ib_event handler */
struct ib_cq *roce_cq_send; /* send completion queue */
struct ib_cq *roce_cq_recv; /* recv completion queue */
struct tasklet_struct send_tasklet; /* called by send cq handler */
struct tasklet_struct recv_tasklet; /* called by recv cq handler */
char mac[SMC_MAX_PORTS][6]; /* mac address per port*/
union ib_gid gid[SMC_MAX_PORTS]; /* gid per port */
u8 initialized : 1; /* ib dev CQ, evthdl done */
struct work_struct port_event_work;
unsigned long port_event_mask;
};
struct smc_buf_desc;
struct smc_link;
int smc_ib_register_client(void) __init;
void smc_ib_unregister_client(void);
bool smc_ib_port_active(struct smc_ib_device *smcibdev, u8 ibport);
int smc_ib_remember_port_attr(struct smc_ib_device *smcibdev, u8 ibport);
int smc_ib_buf_map(struct smc_ib_device *smcibdev, int buf_size,
struct smc_buf_desc *buf_slot,
enum dma_data_direction data_direction);
void smc_ib_buf_unmap(struct smc_ib_device *smcibdev, int bufsize,
struct smc_buf_desc *buf_slot,
enum dma_data_direction data_direction);
void smc_ib_dealloc_protection_domain(struct smc_link *lnk);
int smc_ib_create_protection_domain(struct smc_link *lnk);
void smc_ib_destroy_queue_pair(struct smc_link *lnk);
int smc_ib_create_queue_pair(struct smc_link *lnk);
int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
struct ib_mr **mr);
int smc_ib_ready_link(struct smc_link *lnk);
int smc_ib_modify_qp_rts(struct smc_link *lnk);
int smc_ib_modify_qp_reset(struct smc_link *lnk);
long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev);
#endif
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment