Commit 2a3c389a authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "A smaller cycle this time. Notably we see another new driver, 'Soft
  iWarp', and the deletion of an ancient unused driver for nes.

   - Revise and simplify the signature offload RDMA MR APIs

   - More progress on hoisting object allocation boiler plate code out
     of the drivers

   - Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib,
     i40iw

   - Tree wide cleanups: struct_size, put_user_page, xarray, rst doc
     conversion

   - Removal of obsolete ib_ucm chardev and nes driver

   - netlink based discovery of chardevs and autoloading of the modules
     providing them

   - Move more of the rdamvt/hfi1 uapi to include/uapi/rdma

   - New driver 'siw' for software based iWarp running on top of netdev,
     much like rxe's software RoCE.

   - mlx5 feature to report events in their raw devx format to userspace

   - Expose per-object counters through rdma tool

   - Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core
     from netdev"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (194 commits)
  RMDA/siw: Require a 64 bit arch
  RDMA/siw: Mark expected switch fall-throughs
  RDMA/core: Fix -Wunused-const-variable warnings
  rdma/siw: Remove set but not used variable 's'
  rdma/siw: Add missing dependencies on LIBCRC32C and DMA_VIRT_OPS
  RDMA/siw: Add missing rtnl_lock around access to ifa
  rdma/siw: Use proper enumerated type in map_cqe_status
  RDMA/siw: Remove unnecessary kthread create/destroy printouts
  IB/rdmavt: Fix variable shadowing issue in rvt_create_cq
  RDMA/core: Fix race when resolving IP address
  RDMA/core: Make rdma_counter.h compile stand alone
  IB/core: Work on the caller socket net namespace in nldev_newlink()
  RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM
  RDMA/mlx5: Set RDMA DIM to be enabled by default
  RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink
  RDMA/core: Provide RDMA DIM support for ULPs
  linux/dim: Implement RDMA adaptive moderation (DIM)
  IB/mlx5: Report correctly tag matching rendezvous capability
  docs: infiniband: add it to the driver-api bookset
  IB/mlx5: Implement VHCA tunnel mechanism in DEVX
  ...
parents 8de26253 0b043644
......@@ -423,23 +423,6 @@ Description:
(e.g. driver restart on the VM which owns the VF).
sysfs interface for NetEffect RNIC Low-Level iWARP driver (nes)
---------------------------------------------------------------
What: /sys/class/infiniband/nesX/hw_rev
What: /sys/class/infiniband/nesX/hca_type
What: /sys/class/infiniband/nesX/board_id
Date: Feb, 2008
KernelVersion: v2.6.25
Contact: linux-rdma@vger.kernel.org
Description:
hw_rev: (RO) Hardware revision number
hca_type: (RO) Host Channel Adapter type (NEX020)
board_id: (RO) Manufacturing board id
sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4)
-----------------------------------------------------
......
......@@ -90,6 +90,7 @@ needed).
driver-api/index
core-api/index
infiniband/index
media/index
networking/index
input/index
......
INFINIBAND MIDLAYER LOCKING
===========================
InfiniBand Midlayer Locking
===========================
This guide is an attempt to make explicit the locking assumptions
made by the InfiniBand midlayer. It describes the requirements on
......@@ -6,45 +8,47 @@ INFINIBAND MIDLAYER LOCKING
protocols that use the midlayer.
Sleeping and interrupt context
==============================
With the following exceptions, a low-level driver implementation of
all of the methods in struct ib_device may sleep. The exceptions
are any methods from the list:
create_ah
modify_ah
query_ah
destroy_ah
post_send
post_recv
poll_cq
req_notify_cq
map_phys_fmr
- create_ah
- modify_ah
- query_ah
- destroy_ah
- post_send
- post_recv
- poll_cq
- req_notify_cq
- map_phys_fmr
which may not sleep and must be callable from any context.
The corresponding functions exported to upper level protocol
consumers:
ib_create_ah
ib_modify_ah
ib_query_ah
ib_destroy_ah
ib_post_send
ib_post_recv
ib_req_notify_cq
ib_map_phys_fmr
- ib_create_ah
- ib_modify_ah
- ib_query_ah
- ib_destroy_ah
- ib_post_send
- ib_post_recv
- ib_req_notify_cq
- ib_map_phys_fmr
are therefore safe to call from any context.
In addition, the function
ib_dispatch_event
- ib_dispatch_event
used by low-level drivers to dispatch asynchronous events through
the midlayer is also safe to call from any context.
Reentrancy
----------
All of the methods in struct ib_device exported by a low-level
driver must be fully reentrant. The low-level driver is required to
......@@ -62,6 +66,7 @@ Reentrancy
information between different calls of ib_poll_cq() is not defined.
Callbacks
---------
A low-level driver must not perform a callback directly from the
same callchain as an ib_device method call. For example, it is not
......@@ -74,18 +79,18 @@ Callbacks
completion event handlers for the same CQ are not called
simultaneously. The driver must guarantee that only one CQ event
handler for a given CQ is running at a time. In other words, the
following situation is not allowed:
following situation is not allowed::
CPU1 CPU2
CPU1 CPU2
low-level driver ->
consumer CQ event callback:
/* ... */
ib_req_notify_cq(cq, ...);
low-level driver ->
/* ... */ consumer CQ event callback:
/* ... */
return from CQ event handler
low-level driver ->
consumer CQ event callback:
/* ... */
ib_req_notify_cq(cq, ...);
low-level driver ->
/* ... */ consumer CQ event callback:
/* ... */
return from CQ event handler
The context in which completion event and asynchronous event
callbacks run is not defined. Depending on the low-level driver, it
......@@ -93,6 +98,7 @@ Callbacks
Upper level protocol consumers may not sleep in a callback.
Hot-plug
--------
A low-level driver announces that a device is ready for use by
consumers when it calls ib_register_device(), all initialization
......
.. SPDX-License-Identifier: GPL-2.0
==========
InfiniBand
==========
.. toctree::
:maxdepth: 1
core_locking
ipoib
opa_vnic
sysfs
tag_matching
user_mad
user_verbs
.. only:: subproject and html
Indices
=======
* :ref:`genindex`
IP OVER INFINIBAND
==================
IP over InfiniBand
==================
The ib_ipoib driver is an implementation of the IP over InfiniBand
protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib
......@@ -8,16 +10,17 @@ IP OVER INFINIBAND
masqueraded to the kernel as ethernet interfaces).
Partitions and P_Keys
=====================
When the IPoIB driver is loaded, it creates one interface for each
port using the P_Key at index 0. To create an interface with a
different P_Key, write the desired P_Key into the main interface's
/sys/class/net/<intf name>/create_child file. For example:
/sys/class/net/<intf name>/create_child file. For example::
echo 0x8001 > /sys/class/net/ib0/create_child
This will create an interface named ib0.8001 with P_Key 0x8001. To
remove a subinterface, use the "delete_child" file:
remove a subinterface, use the "delete_child" file::
echo 0x8001 > /sys/class/net/ib0/delete_child
......@@ -28,6 +31,7 @@ Partitions and P_Keys
rtnl_link_ops, where children created using either way behave the same.
Datagram vs Connected modes
===========================
The IPoIB driver supports two modes of operation: datagram and
connected. The mode is set and read through an interface's
......@@ -51,6 +55,7 @@ Datagram vs Connected modes
networking stack to use the smaller UD MTU for these neighbours.
Stateless offloads
==================
If the IB HW supports IPoIB stateless offloads, IPoIB advertises
TCP/IP checksum and/or Large Send (LSO) offloading capability to the
......@@ -60,9 +65,10 @@ Stateless offloads
on/off using ethtool calls. Currently LRO is supported only for
checksum offload capable devices.
Stateless offloads are supported only in datagram mode.
Stateless offloads are supported only in datagram mode.
Interrupt moderation
====================
If the underlying IB device supports CQ event moderation, one can
use ethtool to set interrupt mitigation parameters and thus reduce
......@@ -71,6 +77,7 @@ Interrupt moderation
moderation is supported.
Debugging Information
=====================
By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
to 'y', tracing messages are compiled into the driver. They are
......@@ -79,7 +86,7 @@ Debugging Information
runtime through files in /sys/module/ib_ipoib/.
CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
virtual filesystem. By mounting this filesystem, for example with
virtual filesystem. By mounting this filesystem, for example with::
mount -t debugfs none /sys/kernel/debug
......@@ -96,10 +103,13 @@ Debugging Information
performance, because it adds tests to the fast path.
References
==========
Transmission of IP over InfiniBand (IPoIB) (RFC 4391)
http://ietf.org/rfc/rfc4391.txt
http://ietf.org/rfc/rfc4391.txt
IP over InfiniBand (IPoIB) Architecture (RFC 4392)
http://ietf.org/rfc/rfc4392.txt
http://ietf.org/rfc/rfc4392.txt
IP over InfiniBand: Connected Mode (RFC 4755)
http://ietf.org/rfc/rfc4755.txt
=================================================================
Intel Omni-Path (OPA) Virtual Network Interface Controller (VNIC)
=================================================================
Intel Omni-Path (OPA) Virtual Network Interface Controller (VNIC) feature
supports Ethernet functionality over Omni-Path fabric by encapsulating
the Ethernet packets between HFI nodes.
......@@ -17,70 +21,72 @@ an independent Ethernet network. The configuration is performed by an
Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM)
application. HFI nodes can have multiple VNICs each connected to a
different virtual Ethernet switch. The below diagram presents a case
of two virtual Ethernet switches with two HFI nodes.
+-------------------+
| Subnet/ |
| Ethernet |
| Manager |
+-------------------+
/ /
/ /
/ /
/ /
+-----------------------------+ +------------------------------+
| Virtual Ethernet Switch | | Virtual Ethernet Switch |
| +---------+ +---------+ | | +---------+ +---------+ |
| | VPORT | | VPORT | | | | VPORT | | VPORT | |
+--+---------+----+---------+-+ +-+---------+----+---------+---+
| \ / |
| \ / |
| \/ |
| / \ |
| / \ |
+-----------+------------+ +-----------+------------+
| VNIC | VNIC | | VNIC | VNIC |
+-----------+------------+ +-----------+------------+
| HFI | | HFI |
+------------------------+ +------------------------+
of two virtual Ethernet switches with two HFI nodes::
+-------------------+
| Subnet/ |
| Ethernet |
| Manager |
+-------------------+
/ /
/ /
/ /
/ /
+-----------------------------+ +------------------------------+
| Virtual Ethernet Switch | | Virtual Ethernet Switch |
| +---------+ +---------+ | | +---------+ +---------+ |
| | VPORT | | VPORT | | | | VPORT | | VPORT | |
+--+---------+----+---------+-+ +-+---------+----+---------+---+
| \ / |
| \ / |
| \/ |
| / \ |
| / \ |
+-----------+------------+ +-----------+------------+
| VNIC | VNIC | | VNIC | VNIC |
+-----------+------------+ +-----------+------------+
| HFI | | HFI |
+------------------------+ +------------------------+
The Omni-Path encapsulated Ethernet packet format is as described below.
Bits Field
------------------------------------
==================== ================================
Bits Field
==================== ================================
Quad Word 0:
0-19 SLID (lower 20 bits)
20-30 Length (in Quad Words)
31 BECN bit
32-51 DLID (lower 20 bits)
52-56 SC (Service Class)
57-59 RC (Routing Control)
60 FECN bit
61-62 L2 (=10, 16B format)
63 LT (=1, Link Transfer Head Flit)
0-19 SLID (lower 20 bits)
20-30 Length (in Quad Words)
31 BECN bit
32-51 DLID (lower 20 bits)
52-56 SC (Service Class)
57-59 RC (Routing Control)
60 FECN bit
61-62 L2 (=10, 16B format)
63 LT (=1, Link Transfer Head Flit)
Quad Word 1:
0-7 L4 type (=0x78 ETHERNET)
8-11 SLID[23:20]
12-15 DLID[23:20]
16-31 PKEY
32-47 Entropy
48-63 Reserved
0-7 L4 type (=0x78 ETHERNET)
8-11 SLID[23:20]
12-15 DLID[23:20]
16-31 PKEY
32-47 Entropy
48-63 Reserved
Quad Word 2:
0-15 Reserved
16-31 L4 header
32-63 Ethernet Packet
0-15 Reserved
16-31 L4 header
32-63 Ethernet Packet
Quad Words 3 to N-1:
0-63 Ethernet packet (pad extended)
0-63 Ethernet packet (pad extended)
Quad Word N (last):
0-23 Ethernet packet (pad extended)
24-55 ICRC
56-61 Tail
62-63 LT (=01, Link Transfer Tail Flit)
0-23 Ethernet packet (pad extended)
24-55 ICRC
56-61 Tail
62-63 LT (=01, Link Transfer Tail Flit)
==================== ================================
Ethernet packet is padded on the transmit side to ensure that the VNIC OPA
packet is quad word aligned. The 'Tail' field contains the number of bytes
......@@ -123,7 +129,7 @@ operation. It also handles the encapsulation of Ethernet packets with an
Omni-Path header in the transmit path. For each VNIC interface, the
information required for encapsulation is configured by the EM via VEMA MAD
interface. It also passes any control information to the HW dependent driver
by invoking the RDMA netdev control operations.
by invoking the RDMA netdev control operations::
+-------------------+ +----------------------+
| | | Linux |
......
SYSFS FILES
===========
Sysfs files
===========
The sysfs interface has moved to
Documentation/ABI/stable/sysfs-class-infiniband.
==================
Tag matching logic
==================
The MPI standard defines a set of rules, known as tag-matching, for matching
source send operations to destination receives. The following parameters must
match the following source and destination parameters:
* Communicator
* User tag - wild card may be specified by the receiver
* Source rank – wild car may be specified by the receiver
* Destination rank – wild
The ordering rules require that when more than one pair of send and receive
message envelopes may match, the pair that includes the earliest posted-send
and the earliest posted-receive is the pair that must be used to satisfy the
......@@ -35,6 +39,7 @@ the header to initiate an RDMA READ operation directly to the matching buffer.
A fin message needs to be received in order for the buffer to be reused.
Tag matching implementation
===========================
There are two types of matching objects used, the posted receive list and the
unexpected message list. The application posts receive buffers through calls
......
USERSPACE MAD ACCESS
====================
Userspace MAD access
====================
Device files
============
Each port of each InfiniBand device has a "umad" device and an
"issm" device attached. For example, a two-port HCA will have two
......@@ -8,12 +11,13 @@ Device files
device of each type (for switch port 0).
Creating MAD agents
===================
A MAD agent can be created by filling in a struct ib_user_mad_reg_req
and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
descriptor for the appropriate device file. If the registration
request succeeds, a 32-bit id will be returned in the structure.
For example:
For example::
struct ib_user_mad_reg_req req = { /* ... */ };
ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
......@@ -26,12 +30,14 @@ Creating MAD agents
ioctl. Also, all agents registered through a file descriptor will
be unregistered when the descriptor is closed.
2014 -- a new registration ioctl is now provided which allows additional
2014
a new registration ioctl is now provided which allows additional
fields to be provided during registration.
Users of this registration call are implicitly setting the use of
pkey_index (see below).
Receiving MADs
==============
MADs are received using read(). The receive side now supports
RMPP. The buffer passed to read() must be at least one
......@@ -41,7 +47,8 @@ Receiving MADs
MAD (RMPP), the errno is set to ENOSPC and the length of the
buffer needed is set in mad.length.
Example for normal MAD (non RMPP) reads:
Example for normal MAD (non RMPP) reads::
struct ib_user_mad *mad;
mad = malloc(sizeof *mad + 256);
ret = read(fd, mad, sizeof *mad + 256);
......@@ -50,7 +57,8 @@ Receiving MADs
free(mad);
}
Example for RMPP reads:
Example for RMPP reads::
struct ib_user_mad *mad;
mad = malloc(sizeof *mad + 256);
ret = read(fd, mad, sizeof *mad + 256);
......@@ -76,11 +84,12 @@ Receiving MADs
poll()/select() may be used to wait until a MAD can be read.
Sending MADs
============
MADs are sent using write(). The agent ID for sending should be
filled into the id field of the MAD, the destination LID should be
filled into the lid field, and so on. The send side does support
RMPP so arbitrary length MAD can be sent. For example:
RMPP so arbitrary length MAD can be sent. For example::
struct ib_user_mad *mad;
......@@ -97,6 +106,7 @@ Sending MADs
perror("write");
Transaction IDs
===============
Users of the umad devices can use the lower 32 bits of the
transaction ID field (that is, the least significant half of the
......@@ -105,6 +115,7 @@ Transaction IDs
the kernel and will be overwritten before a MAD is sent.
P_Key Index Handling
====================
The old ib_umad interface did not allow setting the P_Key index for
MADs that are sent and did not provide a way for obtaining the P_Key
......@@ -119,6 +130,7 @@ P_Key Index Handling
default, and the IB_USER_MAD_ENABLE_PKEY ioctl will be removed.
Setting IsSM Capability Bit
===========================
To set the IsSM capability bit for a port, simply open the
corresponding issm device file. If the IsSM bit is already set,
......@@ -129,25 +141,26 @@ Setting IsSM Capability Bit
the issm file.
/dev files
==========
To create the appropriate character device files automatically with
udev, a rule like
udev, a rule like::
KERNEL=="umad*", NAME="infiniband/%k"
KERNEL=="issm*", NAME="infiniband/%k"
can be used. This will create device nodes named
can be used. This will create device nodes named::
/dev/infiniband/umad0
/dev/infiniband/issm0
for the first port, and so on. The InfiniBand device and port
associated with these devices can be determined from the files
associated with these devices can be determined from the files::
/sys/class/infiniband_mad/umad0/ibdev
/sys/class/infiniband_mad/umad0/port
and
and::
/sys/class/infiniband_mad/issm0/ibdev
/sys/class/infiniband_mad/issm0/port
USERSPACE VERBS ACCESS
======================
Userspace verbs access
======================
The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS,
enables direct userspace access to IB hardware via "verbs," as
......@@ -13,6 +15,7 @@ USERSPACE VERBS ACCESS
libmthca userspace driver be installed.
User-kernel communication
=========================
Userspace communicates with the kernel for slow path, resource
management operations via the /dev/infiniband/uverbsN character
......@@ -28,6 +31,7 @@ User-kernel communication
system call.
Resource management
===================
Since creation and destruction of all IB resources is done by
commands passed through a file descriptor, the kernel can keep track
......@@ -41,6 +45,7 @@ Resource management
prevent one process from touching another process's resources.
Memory pinning
==============
Direct userspace I/O requires that memory regions that are potential
I/O targets be kept resident at the same physical address. The
......@@ -54,13 +59,14 @@ Memory pinning
number of pages pinned by a process.
/dev files
==========
To create the appropriate character device files automatically with
udev, a rule like
udev, a rule like::
KERNEL=="uverbs*", NAME="infiniband/%k"
can be used. This will create device nodes named
can be used. This will create device nodes named::
/dev/infiniband/uverbs0
......
......@@ -11018,14 +11018,6 @@ F: driver/net/net_failover.c
F: include/net/net_failover.h
F: Documentation/networking/net_failover.rst
NETEFFECT IWARP RNIC DRIVER (IW_NES)
M: Faisal Latif <faisal.latif@intel.com>
L: linux-rdma@vger.kernel.org
W: http://www.intel.com/Products/Server/Adapters/Server-Cluster/Server-Cluster-overview.htm
S: Supported
F: drivers/infiniband/hw/nes/
F: include/uapi/rdma/nes-abi.h
NETEM NETWORK EMULATOR
M: Stephen Hemminger <stephen@networkplumber.org>
L: netem@lists.linux-foundation.org (moderated for non-subscribers)
......@@ -14755,6 +14747,13 @@ M: Chris Boot <bootc@bootc.net>
S: Maintained
F: drivers/leds/leds-net48xx.c
SOFT-IWARP DRIVER (siw)
M: Bernard Metzler <bmt@zurich.ibm.com>
L: linux-rdma@vger.kernel.org
S: Supported
F: drivers/infiniband/sw/siw/
F: include/uapi/rdma/siw-abi.h
SOFT-ROCE DRIVER (rxe)
M: Moni Shoua <monis@mellanox.com>
L: linux-rdma@vger.kernel.org
......
......@@ -7,6 +7,7 @@ menuconfig INFINIBAND
depends on m || IPV6 != m
depends on !ALPHA
select IRQ_POLL
select DIMLIB
---help---
Core support for InfiniBand (IB). Make sure to also select
any protocols you wish to use as well as drivers for your
......@@ -36,17 +37,6 @@ config INFINIBAND_USER_ACCESS
libibverbs, libibcm and a hardware driver library from
rdma-core <https://github.com/linux-rdma/rdma-core>.
config INFINIBAND_USER_ACCESS_UCM
tristate "Userspace CM (UCM, DEPRECATED)"
depends on BROKEN || COMPILE_TEST
depends on INFINIBAND_USER_ACCESS
help
The UCM module has known security flaws, which no one is
interested to fix. The user-space part of this code was
dropped from the upstream a long time ago.
This option is DEPRECATED and planned to be removed.
config INFINIBAND_EXP_LEGACY_VERBS_NEW_UAPI
bool "Allow experimental legacy verbs in new ioctl uAPI (EXPERIMENTAL)"
depends on INFINIBAND_USER_ACCESS
......@@ -98,7 +88,6 @@ source "drivers/infiniband/hw/efa/Kconfig"
source "drivers/infiniband/hw/i40iw/Kconfig"
source "drivers/infiniband/hw/mlx4/Kconfig"
source "drivers/infiniband/hw/mlx5/Kconfig"
source "drivers/infiniband/hw/nes/Kconfig"
source "drivers/infiniband/hw/ocrdma/Kconfig"
source "drivers/infiniband/hw/vmw_pvrdma/Kconfig"
source "drivers/infiniband/hw/usnic/Kconfig"
......@@ -108,6 +97,7 @@ source "drivers/infiniband/hw/hfi1/Kconfig"
source "drivers/infiniband/hw/qedr/Kconfig"
source "drivers/infiniband/sw/rdmavt/Kconfig"
source "drivers/infiniband/sw/rxe/Kconfig"
source "drivers/infiniband/sw/siw/Kconfig"
endif
source "drivers/infiniband/ulp/ipoib/Kconfig"
......
......@@ -6,13 +6,12 @@ obj-$(CONFIG_INFINIBAND) += ib_core.o ib_cm.o iw_cm.o \
$(infiniband-y)
obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o
obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o $(user_access-y)
obj-$(CONFIG_INFINIBAND_USER_ACCESS_UCM) += ib_ucm.o $(user_access-y)
ib_core-y := packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o \
roce_gid_mgmt.o mr_pool.o addr.o sa_query.o \
multicast.o mad.o smi.o agent.o mad_rmpp.o \
nldev.o restrack.o
nldev.o restrack.o counters.o
ib_core-$(CONFIG_SECURITY_INFINIBAND) += security.o
ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
......@@ -29,8 +28,6 @@ rdma_ucm-y := ucma.o
ib_umad-y := user_mad.o
ib_ucm-y := ucm.o
ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_marshall.o \
rdma_core.o uverbs_std_types.o uverbs_ioctl.o \
uverbs_std_types_cq.o \
......
......@@ -337,7 +337,7 @@ static int dst_fetch_ha(const struct dst_entry *dst,
neigh_event_send(n, NULL);
ret = -ENODATA;
} else {
memcpy(dev_addr->dst_dev_addr, n->ha, MAX_ADDR_LEN);
neigh_ha_snapshot(dev_addr->dst_dev_addr, n, dst->dev);
}
neigh_release(n);
......
......@@ -60,6 +60,7 @@ extern bool ib_devices_shared_netns;
int ib_device_register_sysfs(struct ib_device *device);
void ib_device_unregister_sysfs(struct ib_device *device);
int ib_device_rename(struct ib_device *ibdev, const char *name);
int ib_device_set_dim(struct ib_device *ibdev, u8 use_dim);
typedef void (*roce_netdev_callback)(struct ib_device *device, u8 port,
struct net_device *idev, void *cookie);
......@@ -88,6 +89,15 @@ typedef int (*nldev_callback)(struct ib_device *device,
int ib_enum_all_devs(nldev_callback nldev_cb, struct sk_buff *skb,
struct netlink_callback *cb);
struct ib_client_nl_info {
struct sk_buff *nl_msg;
struct device *cdev;
unsigned int port;
u64 abi;
};
int ib_get_client_nl_info(struct ib_device *ibdev, const char *client_name,
struct ib_client_nl_info *res);
enum ib_cache_gid_default_mode {
IB_CACHE_GID_DEFAULT_MODE_SET,
IB_CACHE_GID_DEFAULT_MODE_DELETE
......
This diff is collapsed.
......@@ -18,6 +18,53 @@
#define IB_POLL_FLAGS \
(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)
static const struct dim_cq_moder
rdma_dim_prof[RDMA_DIM_PARAMS_NUM_PROFILES] = {
{1, 0, 1, 0},
{1, 0, 4, 0},
{2, 0, 4, 0},
{2, 0, 8, 0},
{4, 0, 8, 0},
{16, 0, 8, 0},
{16, 0, 16, 0},
{32, 0, 16, 0},
{32, 0, 32, 0},
};
static void ib_cq_rdma_dim_work(struct work_struct *w)
{
struct dim *dim = container_of(w, struct dim, work);
struct ib_cq *cq = dim->priv;
u16 usec = rdma_dim_prof[dim->profile_ix].usec;
u16 comps = rdma_dim_prof[dim->profile_ix].comps;
dim->state = DIM_START_MEASURE;
cq->device->ops.modify_cq(cq, comps, usec);
}
static void rdma_dim_init(struct ib_cq *cq)
{
struct dim *dim;
if (!cq->device->ops.modify_cq || !cq->device->use_cq_dim ||
cq->poll_ctx == IB_POLL_DIRECT)
return;
dim = kzalloc(sizeof(struct dim), GFP_KERNEL);
if (!dim)
return;
dim->state = DIM_START_MEASURE;
dim->tune_state = DIM_GOING_RIGHT;
dim->profile_ix = RDMA_DIM_START_PROFILE;
dim->priv = cq;
cq->dim = dim;
INIT_WORK(&dim->work, ib_cq_rdma_dim_work);
}
static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *wcs,
int batch)
{
......@@ -78,6 +125,7 @@ static void ib_cq_completion_direct(struct ib_cq *cq, void *private)
static int ib_poll_handler(struct irq_poll *iop, int budget)
{
struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
struct dim *dim = cq->dim;
int completed;
completed = __ib_process_cq(cq, budget, cq->wc, IB_POLL_BATCH);
......@@ -87,6 +135,9 @@ static int ib_poll_handler(struct irq_poll *iop, int budget)
irq_poll_sched(&cq->iop);
}
if (dim)
rdma_dim(dim, completed);
return completed;
}
......@@ -105,6 +156,8 @@ static void ib_cq_poll_work(struct work_struct *work)
if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
queue_work(cq->comp_wq, &cq->work);
else if (cq->dim)
rdma_dim(cq->dim, completed);
}
static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
......@@ -113,7 +166,7 @@ static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
}
/**
* __ib_alloc_cq - allocate a completion queue
* __ib_alloc_cq_user - allocate a completion queue
* @dev: device to allocate the CQ for
* @private: driver private data, accessible from cq->cq_context
* @nr_cqe: number of CQEs to allocate
......@@ -139,25 +192,30 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
struct ib_cq *cq;
int ret = -ENOMEM;
cq = dev->ops.create_cq(dev, &cq_attr, NULL);
if (IS_ERR(cq))
return cq;
cq = rdma_zalloc_drv_obj(dev, ib_cq);
if (!cq)
return ERR_PTR(ret);
cq->device = dev;
cq->uobject = NULL;
cq->event_handler = NULL;
cq->cq_context = private;
cq->poll_ctx = poll_ctx;
atomic_set(&cq->usecnt, 0);
cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
if (!cq->wc)
goto out_destroy_cq;
goto out_free_cq;
cq->res.type = RDMA_RESTRACK_CQ;
rdma_restrack_set_task(&cq->res, caller);
ret = dev->ops.create_cq(cq, &cq_attr, NULL);
if (ret)
goto out_free_wc;
rdma_restrack_kadd(&cq->res);
rdma_dim_init(cq);
switch (cq->poll_ctx) {
case IB_POLL_DIRECT:
cq->comp_handler = ib_cq_completion_direct;
......@@ -178,29 +236,29 @@ struct ib_cq *__ib_alloc_cq_user(struct ib_device *dev, void *private,
break;
default:
ret = -EINVAL;
goto out_free_wc;
goto out_destroy_cq;
}
return cq;
out_free_wc:
kfree(cq->wc);
rdma_restrack_del(&cq->res);
out_destroy_cq:
rdma_restrack_del(&cq->res);
cq->device->ops.destroy_cq(cq, udata);
out_free_wc:
kfree(cq->wc);
out_free_cq:
kfree(cq);
return ERR_PTR(ret);
}
EXPORT_SYMBOL(__ib_alloc_cq_user);
/**
* ib_free_cq - free a completion queue
* ib_free_cq_user - free a completion queue
* @cq: completion queue to free.
* @udata: User data or NULL for kernel object
*/
void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
{
int ret;
if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
return;
......@@ -218,9 +276,12 @@ void ib_free_cq_user(struct ib_cq *cq, struct ib_udata *udata)
WARN_ON_ONCE(1);
}
kfree(cq->wc);
rdma_restrack_del(&cq->res);
ret = cq->device->ops.destroy_cq(cq, udata);
WARN_ON_ONCE(ret);
cq->device->ops.destroy_cq(cq, udata);
if (cq->dim)
cancel_work_sync(&cq->dim->work);
kfree(cq->dim);
kfree(cq->wc);
kfree(cq);
}
EXPORT_SYMBOL(ib_free_cq_user);
......@@ -46,6 +46,7 @@
#include <rdma/rdma_netlink.h>
#include <rdma/ib_addr.h>
#include <rdma/ib_cache.h>
#include <rdma/rdma_counter.h>
#include "core_priv.h"
#include "restrack.h"
......@@ -270,7 +271,7 @@ struct ib_port_data_rcu {
struct ib_port_data pdata[];
};
static int ib_device_check_mandatory(struct ib_device *device)
static void ib_device_check_mandatory(struct ib_device *device)
{
#define IB_MANDATORY_FUNC(x) { offsetof(struct ib_device_ops, x), #x }
static const struct {
......@@ -305,8 +306,6 @@ static int ib_device_check_mandatory(struct ib_device *device)
break;
}
}
return 0;
}
/*
......@@ -375,7 +374,7 @@ struct ib_device *ib_device_get_by_name(const char *name,
down_read(&devices_rwsem);
device = __ib_device_get_by_name(name);
if (device && driver_id != RDMA_DRIVER_UNKNOWN &&
device->driver_id != driver_id)
device->ops.driver_id != driver_id)
device = NULL;
if (device) {
......@@ -449,6 +448,15 @@ int ib_device_rename(struct ib_device *ibdev, const char *name)
return 0;
}
int ib_device_set_dim(struct ib_device *ibdev, u8 use_dim)
{
if (use_dim > 1)
return -EINVAL;
ibdev->use_cq_dim = use_dim;
return 0;
}
static int alloc_name(struct ib_device *ibdev, const char *name)
{
struct ib_device *device;
......@@ -494,10 +502,12 @@ static void ib_device_release(struct device *device)
if (dev->port_data) {
ib_cache_release_one(dev);
ib_security_release_port_pkey_list(dev);
rdma_counter_release(dev);
kfree_rcu(container_of(dev->port_data, struct ib_port_data_rcu,
pdata[0]),
rcu_head);
}
xa_destroy(&dev->compat_devs);
xa_destroy(&dev->client_data);
kfree_rcu(dev, rcu_head);
......@@ -1193,10 +1203,7 @@ static int setup_device(struct ib_device *device)
int ret;
setup_dma_device(device);
ret = ib_device_check_mandatory(device);
if (ret)
return ret;
ib_device_check_mandatory(device);
ret = setup_port_data(device);
if (ret) {
......@@ -1321,6 +1328,8 @@ int ib_register_device(struct ib_device *device, const char *name)
ib_device_register_rdmacg(device);
rdma_counter_init(device);
/*
* Ensure that ADD uevent is not fired because it
* is too early amd device is not initialized yet.
......@@ -1479,7 +1488,7 @@ void ib_unregister_driver(enum rdma_driver_id driver_id)
down_read(&devices_rwsem);
xa_for_each (&devices, index, ib_dev) {
if (ib_dev->driver_id != driver_id)
if (ib_dev->ops.driver_id != driver_id)
continue;
get_device(&ib_dev->dev);
......@@ -1749,6 +1758,104 @@ void ib_unregister_client(struct ib_client *client)
}
EXPORT_SYMBOL(ib_unregister_client);
static int __ib_get_global_client_nl_info(const char *client_name,
struct ib_client_nl_info *res)
{
struct ib_client *client;
unsigned long index;
int ret = -ENOENT;
down_read(&clients_rwsem);
xa_for_each_marked (&clients, index, client, CLIENT_REGISTERED) {
if (strcmp(client->name, client_name) != 0)
continue;
if (!client->get_global_nl_info) {
ret = -EOPNOTSUPP;
break;
}
ret = client->get_global_nl_info(res);
if (WARN_ON(ret == -ENOENT))
ret = -EINVAL;
if (!ret && res->cdev)
get_device(res->cdev);
break;
}
up_read(&clients_rwsem);
return ret;
}
static int __ib_get_client_nl_info(struct ib_device *ibdev,
const char *client_name,
struct ib_client_nl_info *res)
{
unsigned long index;
void *client_data;
int ret = -ENOENT;
down_read(&ibdev->client_data_rwsem);
xan_for_each_marked (&ibdev->client_data, index, client_data,
CLIENT_DATA_REGISTERED) {
struct ib_client *client = xa_load(&clients, index);
if (!client || strcmp(client->name, client_name) != 0)
continue;
if (!client->get_nl_info) {
ret = -EOPNOTSUPP;
break;
}
ret = client->get_nl_info(ibdev, client_data, res);
if (WARN_ON(ret == -ENOENT))
ret = -EINVAL;
/*
* The cdev is guaranteed valid as long as we are inside the
* client_data_rwsem as remove_one can't be called. Keep it
* valid for the caller.
*/
if (!ret && res->cdev)
get_device(res->cdev);
break;
}
up_read(&ibdev->client_data_rwsem);
return ret;
}
/**
* ib_get_client_nl_info - Fetch the nl_info from a client
* @device - IB device
* @client_name - Name of the client
* @res - Result of the query
*/
int ib_get_client_nl_info(struct ib_device *ibdev, const char *client_name,
struct ib_client_nl_info *res)
{
int ret;
if (ibdev)
ret = __ib_get_client_nl_info(ibdev, client_name, res);
else
ret = __ib_get_global_client_nl_info(client_name, res);
#ifdef CONFIG_MODULES
if (ret == -ENOENT) {
request_module("rdma-client-%s", client_name);
if (ibdev)
ret = __ib_get_client_nl_info(ibdev, client_name, res);
else
ret = __ib_get_global_client_nl_info(client_name, res);
}
#endif
if (ret) {
if (ret == -ENOENT)
return -EOPNOTSUPP;
return ret;
}
if (WARN_ON(!res->cdev))
return -EINVAL;
return 0;
}
/**
* ib_set_client_data - Set IB client context
* @device:Device to set context for
......@@ -2039,7 +2146,7 @@ struct ib_device *ib_device_get_by_netdev(struct net_device *ndev,
(uintptr_t)ndev) {
if (rcu_access_pointer(cur->netdev) == ndev &&
(driver_id == RDMA_DRIVER_UNKNOWN ||
cur->ib_dev->driver_id == driver_id) &&
cur->ib_dev->ops.driver_id == driver_id) &&
ib_device_try_get(cur->ib_dev)) {
res = cur->ib_dev;
break;
......@@ -2344,12 +2451,28 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
#define SET_OBJ_SIZE(ptr, name) SET_DEVICE_OP(ptr, size_##name)
if (ops->driver_id != RDMA_DRIVER_UNKNOWN) {
WARN_ON(dev_ops->driver_id != RDMA_DRIVER_UNKNOWN &&
dev_ops->driver_id != ops->driver_id);
dev_ops->driver_id = ops->driver_id;
}
if (ops->owner) {
WARN_ON(dev_ops->owner && dev_ops->owner != ops->owner);
dev_ops->owner = ops->owner;
}
if (ops->uverbs_abi_ver)
dev_ops->uverbs_abi_ver = ops->uverbs_abi_ver;
dev_ops->uverbs_no_driver_id_binding |=
ops->uverbs_no_driver_id_binding;
SET_DEVICE_OP(dev_ops, add_gid);
SET_DEVICE_OP(dev_ops, advise_mr);
SET_DEVICE_OP(dev_ops, alloc_dm);
SET_DEVICE_OP(dev_ops, alloc_fmr);
SET_DEVICE_OP(dev_ops, alloc_hw_stats);
SET_DEVICE_OP(dev_ops, alloc_mr);
SET_DEVICE_OP(dev_ops, alloc_mr_integrity);
SET_DEVICE_OP(dev_ops, alloc_mw);
SET_DEVICE_OP(dev_ops, alloc_pd);
SET_DEVICE_OP(dev_ops, alloc_rdma_netdev);
......@@ -2357,6 +2480,11 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
SET_DEVICE_OP(dev_ops, alloc_xrcd);
SET_DEVICE_OP(dev_ops, attach_mcast);
SET_DEVICE_OP(dev_ops, check_mr_status);
SET_DEVICE_OP(dev_ops, counter_alloc_stats);
SET_DEVICE_OP(dev_ops, counter_bind_qp);
SET_DEVICE_OP(dev_ops, counter_dealloc);
SET_DEVICE_OP(dev_ops, counter_unbind_qp);
SET_DEVICE_OP(dev_ops, counter_update_stats);
SET_DEVICE_OP(dev_ops, create_ah);
SET_DEVICE_OP(dev_ops, create_counters);
SET_DEVICE_OP(dev_ops, create_cq);
......@@ -2409,6 +2537,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
SET_DEVICE_OP(dev_ops, iw_reject);
SET_DEVICE_OP(dev_ops, iw_rem_ref);
SET_DEVICE_OP(dev_ops, map_mr_sg);
SET_DEVICE_OP(dev_ops, map_mr_sg_pi);
SET_DEVICE_OP(dev_ops, map_phys_fmr);
SET_DEVICE_OP(dev_ops, mmap);
SET_DEVICE_OP(dev_ops, modify_ah);
......@@ -2445,6 +2574,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
SET_DEVICE_OP(dev_ops, unmap_fmr);
SET_OBJ_SIZE(dev_ops, ib_ah);
SET_OBJ_SIZE(dev_ops, ib_cq);
SET_OBJ_SIZE(dev_ops, ib_pd);
SET_OBJ_SIZE(dev_ops, ib_srq);
SET_OBJ_SIZE(dev_ops, ib_ucontext);
......
......@@ -34,14 +34,18 @@ void ib_mr_pool_put(struct ib_qp *qp, struct list_head *list, struct ib_mr *mr)
EXPORT_SYMBOL(ib_mr_pool_put);
int ib_mr_pool_init(struct ib_qp *qp, struct list_head *list, int nr,
enum ib_mr_type type, u32 max_num_sg)
enum ib_mr_type type, u32 max_num_sg, u32 max_num_meta_sg)
{
struct ib_mr *mr;
unsigned long flags;
int ret, i;
for (i = 0; i < nr; i++) {
mr = ib_alloc_mr(qp->pd, type, max_num_sg);
if (type == IB_MR_TYPE_INTEGRITY)
mr = ib_alloc_mr_integrity(qp->pd, max_num_sg,
max_num_meta_sg);
else
mr = ib_alloc_mr(qp->pd, type, max_num_sg);
if (IS_ERR(mr)) {
ret = PTR_ERR(mr);
goto out;
......
This diff is collapsed.
......@@ -6,6 +6,7 @@
#include <rdma/rdma_cm.h>
#include <rdma/ib_verbs.h>
#include <rdma/restrack.h>
#include <rdma/rdma_counter.h>
#include <linux/mutex.h>
#include <linux/sched/task.h>
#include <linux/pid_namespace.h>
......@@ -45,6 +46,7 @@ static const char *type2str(enum rdma_restrack_type type)
[RDMA_RESTRACK_CM_ID] = "CM_ID",
[RDMA_RESTRACK_MR] = "MR",
[RDMA_RESTRACK_CTX] = "CTX",
[RDMA_RESTRACK_COUNTER] = "COUNTER",
};
return names[type];
......@@ -169,6 +171,8 @@ static struct ib_device *res_to_dev(struct rdma_restrack_entry *res)
return container_of(res, struct ib_mr, res)->device;
case RDMA_RESTRACK_CTX:
return container_of(res, struct ib_ucontext, res)->device;
case RDMA_RESTRACK_COUNTER:
return container_of(res, struct rdma_counter, res)->device;
default:
WARN_ONCE(true, "Wrong resource tracking type %u\n", res->type);
return NULL;
......@@ -190,6 +194,20 @@ void rdma_restrack_set_task(struct rdma_restrack_entry *res,
}
EXPORT_SYMBOL(rdma_restrack_set_task);
/**
* rdma_restrack_attach_task() - attach the task onto this resource
* @res: resource entry
* @task: the task to attach, the current task will be used if it is NULL.
*/
void rdma_restrack_attach_task(struct rdma_restrack_entry *res,
struct task_struct *task)
{
if (res->task)
put_task_struct(res->task);
get_task_struct(task);
res->task = task;
}
static void rdma_restrack_add(struct rdma_restrack_entry *res)
{
struct ib_device *dev = res_to_dev(res);
......@@ -203,15 +221,22 @@ static void rdma_restrack_add(struct rdma_restrack_entry *res)
kref_init(&res->kref);
init_completion(&res->comp);
if (res->type != RDMA_RESTRACK_QP)
ret = xa_alloc_cyclic(&rt->xa, &res->id, res, xa_limit_32b,
&rt->next_id, GFP_KERNEL);
else {
if (res->type == RDMA_RESTRACK_QP) {
/* Special case to ensure that LQPN points to right QP */
struct ib_qp *qp = container_of(res, struct ib_qp, res);
ret = xa_insert(&rt->xa, qp->qp_num, res, GFP_KERNEL);
res->id = ret ? 0 : qp->qp_num;
} else if (res->type == RDMA_RESTRACK_COUNTER) {
/* Special case to ensure that cntn points to right counter */
struct rdma_counter *counter;
counter = container_of(res, struct rdma_counter, res);
ret = xa_insert(&rt->xa, counter->id, res, GFP_KERNEL);
res->id = ret ? 0 : counter->id;
} else {
ret = xa_alloc_cyclic(&rt->xa, &res->id, res, xa_limit_32b,
&rt->next_id, GFP_KERNEL);
}
if (!ret)
......@@ -237,7 +262,8 @@ EXPORT_SYMBOL(rdma_restrack_kadd);
*/
void rdma_restrack_uadd(struct rdma_restrack_entry *res)
{
if (res->type != RDMA_RESTRACK_CM_ID)
if ((res->type != RDMA_RESTRACK_CM_ID) &&
(res->type != RDMA_RESTRACK_COUNTER))
res->task = NULL;
if (!res->task)
......@@ -323,3 +349,16 @@ void rdma_restrack_del(struct rdma_restrack_entry *res)
}
}
EXPORT_SYMBOL(rdma_restrack_del);
bool rdma_is_visible_in_pid_ns(struct rdma_restrack_entry *res)
{
/*
* 1. Kern resources should be visible in init
* namespace only
* 2. Present only resources visible in the current
* namespace
*/
if (rdma_is_kernel_res(res))
return task_active_pid_ns(current) == &init_pid_ns;
return task_active_pid_ns(current) == task_active_pid_ns(res->task);
}
......@@ -25,4 +25,7 @@ struct rdma_restrack_root {
int rdma_restrack_init(struct ib_device *dev);
void rdma_restrack_clean(struct ib_device *dev);
void rdma_restrack_attach_task(struct rdma_restrack_entry *res,
struct task_struct *task);
bool rdma_is_visible_in_pid_ns(struct rdma_restrack_entry *res);
#endif /* _RDMA_CORE_RESTRACK_H_ */
This diff is collapsed.
......@@ -43,6 +43,7 @@
#include <rdma/ib_mad.h>
#include <rdma/ib_pma.h>
#include <rdma/ib_cache.h>
#include <rdma/rdma_counter.h>
struct ib_port;
......@@ -800,9 +801,12 @@ static int update_hw_stats(struct ib_device *dev, struct rdma_hw_stats *stats,
return 0;
}
static ssize_t print_hw_stat(struct rdma_hw_stats *stats, int index, char *buf)
static ssize_t print_hw_stat(struct ib_device *dev, int port_num,
struct rdma_hw_stats *stats, int index, char *buf)
{
return sprintf(buf, "%llu\n", stats->value[index]);
u64 v = rdma_counter_get_hwstat_value(dev, port_num, index);
return sprintf(buf, "%llu\n", stats->value[index] + v);
}
static ssize_t show_hw_stats(struct kobject *kobj, struct attribute *attr,
......@@ -828,7 +832,7 @@ static ssize_t show_hw_stats(struct kobject *kobj, struct attribute *attr,
ret = update_hw_stats(dev, stats, hsa->port_num, hsa->index);
if (ret)
goto unlock;
ret = print_hw_stat(stats, hsa->index, buf);
ret = print_hw_stat(dev, hsa->port_num, stats, hsa->index, buf);
unlock:
mutex_unlock(&stats->lock);
......@@ -999,6 +1003,8 @@ static void setup_hw_stats(struct ib_device *device, struct ib_port *port,
goto err;
port->hw_stats_ag = hsag;
port->hw_stats = stats;
if (device->port_data)
device->port_data[port_num].hw_stats = stats;
} else {
struct kobject *kobj = &device->dev.kobj;
ret = sysfs_create_group(kobj, hsag);
......@@ -1289,6 +1295,8 @@ const struct attribute_group ib_dev_attr_group = {
void ib_free_port_attrs(struct ib_core_device *coredev)
{
struct ib_device *device = rdma_device_to_ibdev(&coredev->dev);
bool is_full_dev = &device->coredev == coredev;
struct kobject *p, *t;
list_for_each_entry_safe(p, t, &coredev->port_list, entry) {
......@@ -1298,6 +1306,8 @@ void ib_free_port_attrs(struct ib_core_device *coredev)
if (port->hw_stats_ag)
free_hsag(&port->kobj, port->hw_stats_ag);
kfree(port->hw_stats);
if (device->port_data && is_full_dev)
device->port_data[port->port_num].hw_stats = NULL;
if (port->pma_table)
sysfs_remove_group(p, port->pma_table);
......
This diff is collapsed.
......@@ -52,6 +52,8 @@
#include <rdma/rdma_cm_ib.h>
#include <rdma/ib_addr.h>
#include <rdma/ib.h>
#include <rdma/rdma_netlink.h>
#include "core_priv.h"
MODULE_AUTHOR("Sean Hefty");
MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access");
......@@ -81,7 +83,7 @@ struct ucma_file {
};
struct ucma_context {
int id;
u32 id;
struct completion comp;
atomic_t ref;
int events_reported;
......@@ -94,7 +96,7 @@ struct ucma_context {
struct list_head list;
struct list_head mc_list;
/* mark that device is in process of destroying the internal HW
* resources, protected by the global mut
* resources, protected by the ctx_table lock
*/
int closing;
/* sync between removal event and id destroy, protected by file mut */
......@@ -104,7 +106,7 @@ struct ucma_context {
struct ucma_multicast {
struct ucma_context *ctx;
int id;
u32 id;
int events_reported;
u64 uid;
......@@ -122,9 +124,8 @@ struct ucma_event {
struct work_struct close_work;
};
static DEFINE_MUTEX(mut);
static DEFINE_IDR(ctx_idr);
static DEFINE_IDR(multicast_idr);
static DEFINE_XARRAY_ALLOC(ctx_table);
static DEFINE_XARRAY_ALLOC(multicast_table);
static const struct file_operations ucma_fops;
......@@ -133,7 +134,7 @@ static inline struct ucma_context *_ucma_find_context(int id,
{
struct ucma_context *ctx;
ctx = idr_find(&ctx_idr, id);
ctx = xa_load(&ctx_table, id);
if (!ctx)
ctx = ERR_PTR(-ENOENT);
else if (ctx->file != file || !ctx->cm_id)
......@@ -145,7 +146,7 @@ static struct ucma_context *ucma_get_ctx(struct ucma_file *file, int id)
{
struct ucma_context *ctx;
mutex_lock(&mut);
xa_lock(&ctx_table);
ctx = _ucma_find_context(id, file);
if (!IS_ERR(ctx)) {
if (ctx->closing)
......@@ -153,7 +154,7 @@ static struct ucma_context *ucma_get_ctx(struct ucma_file *file, int id)
else
atomic_inc(&ctx->ref);
}
mutex_unlock(&mut);
xa_unlock(&ctx_table);
return ctx;
}
......@@ -216,10 +217,7 @@ static struct ucma_context *ucma_alloc_ctx(struct ucma_file *file)
INIT_LIST_HEAD(&ctx->mc_list);
ctx->file = file;
mutex_lock(&mut);
ctx->id = idr_alloc(&ctx_idr, ctx, 0, 0, GFP_KERNEL);
mutex_unlock(&mut);
if (ctx->id < 0)
if (xa_alloc(&ctx_table, &ctx->id, ctx, xa_limit_32b, GFP_KERNEL))
goto error;
list_add_tail(&ctx->list, &file->ctx_list);
......@@ -238,13 +236,10 @@ static struct ucma_multicast* ucma_alloc_multicast(struct ucma_context *ctx)
if (!mc)
return NULL;
mutex_lock(&mut);
mc->id = idr_alloc(&multicast_idr, NULL, 0, 0, GFP_KERNEL);
mutex_unlock(&mut);
if (mc->id < 0)
mc->ctx = ctx;
if (xa_alloc(&multicast_table, &mc->id, NULL, xa_limit_32b, GFP_KERNEL))
goto error;
mc->ctx = ctx;
list_add_tail(&mc->list, &ctx->mc_list);
return mc;
......@@ -319,9 +314,9 @@ static void ucma_removal_event_handler(struct rdma_cm_id *cm_id)
* handled separately below.
*/
if (ctx->cm_id == cm_id) {
mutex_lock(&mut);
xa_lock(&ctx_table);
ctx->closing = 1;
mutex_unlock(&mut);
xa_unlock(&ctx_table);
queue_work(ctx->file->close_wq, &ctx->close_work);
return;
}
......@@ -523,9 +518,7 @@ static ssize_t ucma_create_id(struct ucma_file *file, const char __user *inbuf,
err2:
rdma_destroy_id(cm_id);
err1:
mutex_lock(&mut);
idr_remove(&ctx_idr, ctx->id);
mutex_unlock(&mut);
xa_erase(&ctx_table, ctx->id);
mutex_lock(&file->mut);
list_del(&ctx->list);
mutex_unlock(&file->mut);
......@@ -537,13 +530,13 @@ static void ucma_cleanup_multicast(struct ucma_context *ctx)
{
struct ucma_multicast *mc, *tmp;
mutex_lock(&mut);
mutex_lock(&ctx->file->mut);
list_for_each_entry_safe(mc, tmp, &ctx->mc_list, list) {
list_del(&mc->list);
idr_remove(&multicast_idr, mc->id);
xa_erase(&multicast_table, mc->id);
kfree(mc);
}
mutex_unlock(&mut);
mutex_unlock(&ctx->file->mut);
}
static void ucma_cleanup_mc_events(struct ucma_multicast *mc)
......@@ -614,11 +607,11 @@ static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf,
if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
return -EFAULT;
mutex_lock(&mut);
xa_lock(&ctx_table);
ctx = _ucma_find_context(cmd.id, file);
if (!IS_ERR(ctx))
idr_remove(&ctx_idr, ctx->id);
mutex_unlock(&mut);
__xa_erase(&ctx_table, ctx->id);
xa_unlock(&ctx_table);
if (IS_ERR(ctx))
return PTR_ERR(ctx);
......@@ -630,14 +623,14 @@ static ssize_t ucma_destroy_id(struct ucma_file *file, const char __user *inbuf,
flush_workqueue(ctx->file->close_wq);
/* At this point it's guaranteed that there is no inflight
* closing task */
mutex_lock(&mut);
xa_lock(&ctx_table);
if (!ctx->closing) {
mutex_unlock(&mut);
xa_unlock(&ctx_table);
ucma_put_ctx(ctx);
wait_for_completion(&ctx->comp);
rdma_destroy_id(ctx->cm_id);
} else {
mutex_unlock(&mut);
xa_unlock(&ctx_table);
}
resp.events_reported = ucma_free_ctx(ctx);
......@@ -951,8 +944,7 @@ static ssize_t ucma_query_path(struct ucma_context *ctx,
}
}
if (copy_to_user(response, resp,
sizeof(*resp) + (i * sizeof(struct ib_path_rec_data))))
if (copy_to_user(response, resp, struct_size(resp, path_data, i)))
ret = -EFAULT;
kfree(resp);
......@@ -1432,9 +1424,7 @@ static ssize_t ucma_process_join(struct ucma_file *file,
goto err3;
}
mutex_lock(&mut);
idr_replace(&multicast_idr, mc, mc->id);
mutex_unlock(&mut);
xa_store(&multicast_table, mc->id, mc, 0);
mutex_unlock(&file->mut);
ucma_put_ctx(ctx);
......@@ -1444,9 +1434,7 @@ static ssize_t ucma_process_join(struct ucma_file *file,
rdma_leave_multicast(ctx->cm_id, (struct sockaddr *) &mc->addr);
ucma_cleanup_mc_events(mc);
err2:
mutex_lock(&mut);
idr_remove(&multicast_idr, mc->id);
mutex_unlock(&mut);
xa_erase(&multicast_table, mc->id);
list_del(&mc->list);
kfree(mc);
err1:
......@@ -1508,8 +1496,8 @@ static ssize_t ucma_leave_multicast(struct ucma_file *file,
if (copy_from_user(&cmd, inbuf, sizeof(cmd)))
return -EFAULT;
mutex_lock(&mut);
mc = idr_find(&multicast_idr, cmd.id);
xa_lock(&multicast_table);
mc = xa_load(&multicast_table, cmd.id);
if (!mc)
mc = ERR_PTR(-ENOENT);
else if (mc->ctx->file != file)
......@@ -1517,8 +1505,8 @@ static ssize_t ucma_leave_multicast(struct ucma_file *file,
else if (!atomic_inc_not_zero(&mc->ctx->ref))
mc = ERR_PTR(-ENXIO);
else
idr_remove(&multicast_idr, mc->id);
mutex_unlock(&mut);
__xa_erase(&multicast_table, mc->id);
xa_unlock(&multicast_table);
if (IS_ERR(mc)) {
ret = PTR_ERR(mc);
......@@ -1615,14 +1603,14 @@ static ssize_t ucma_migrate_id(struct ucma_file *new_file,
* events being added before existing events.
*/
ucma_lock_files(cur_file, new_file);
mutex_lock(&mut);
xa_lock(&ctx_table);
list_move_tail(&ctx->list, &new_file->ctx_list);
ucma_move_events(ctx, new_file);
ctx->file = new_file;
resp.events_reported = ctx->events_reported;
mutex_unlock(&mut);
xa_unlock(&ctx_table);
ucma_unlock_files(cur_file, new_file);
response:
......@@ -1757,18 +1745,15 @@ static int ucma_close(struct inode *inode, struct file *filp)
ctx->destroying = 1;
mutex_unlock(&file->mut);
mutex_lock(&mut);
idr_remove(&ctx_idr, ctx->id);
mutex_unlock(&mut);
xa_erase(&ctx_table, ctx->id);
flush_workqueue(file->close_wq);
/* At that step once ctx was marked as destroying and workqueue
* was flushed we are safe from any inflights handlers that
* might put other closing task.
*/
mutex_lock(&mut);
xa_lock(&ctx_table);
if (!ctx->closing) {
mutex_unlock(&mut);
xa_unlock(&ctx_table);
ucma_put_ctx(ctx);
wait_for_completion(&ctx->comp);
/* rdma_destroy_id ensures that no event handlers are
......@@ -1776,7 +1761,7 @@ static int ucma_close(struct inode *inode, struct file *filp)
*/
rdma_destroy_id(ctx->cm_id);
} else {
mutex_unlock(&mut);
xa_unlock(&ctx_table);
}
ucma_free_ctx(ctx);
......@@ -1805,6 +1790,19 @@ static struct miscdevice ucma_misc = {
.fops = &ucma_fops,
};
static int ucma_get_global_nl_info(struct ib_client_nl_info *res)
{
res->abi = RDMA_USER_CM_ABI_VERSION;
res->cdev = ucma_misc.this_device;
return 0;
}
static struct ib_client rdma_cma_client = {
.name = "rdma_cm",
.get_global_nl_info = ucma_get_global_nl_info,
};
MODULE_ALIAS_RDMA_CLIENT("rdma_cm");
static ssize_t show_abi_version(struct device *dev,
struct device_attribute *attr,
char *buf)
......@@ -1833,7 +1831,14 @@ static int __init ucma_init(void)
ret = -ENOMEM;
goto err2;
}
ret = ib_register_client(&rdma_cma_client);
if (ret)
goto err3;
return 0;
err3:
unregister_net_sysctl_table(ucma_ctl_table_hdr);
err2:
device_remove_file(ucma_misc.this_device, &dev_attr_abi_version);
err1:
......@@ -1843,11 +1848,10 @@ static int __init ucma_init(void)
static void __exit ucma_cleanup(void)
{
ib_unregister_client(&rdma_cma_client);
unregister_net_sysctl_table(ucma_ctl_table_hdr);
device_remove_file(ucma_misc.this_device, &dev_attr_abi_version);
misc_deregister(&ucma_misc);
idr_destroy(&ctx_idr);
idr_destroy(&multicast_idr);
}
module_init(ucma_init);
......
......@@ -54,9 +54,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
page = sg_page_iter_page(&sg_iter);
if (!PageDirty(page) && umem->writable && dirty)
set_page_dirty_lock(page);
put_page(page);
if (umem->writable && dirty)
put_user_pages_dirty_lock(&page, 1);
else
put_user_page(page);
}
sg_free_table(&umem->sg_head);
......@@ -244,7 +245,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
umem->context = context;
umem->length = size;
umem->address = addr;
umem->page_shift = PAGE_SHIFT;
umem->writable = ib_access_writable(access);
umem->owning_mm = mm = current->mm;
mmgrab(mm);
......@@ -361,6 +361,9 @@ static void __ib_umem_release_tail(struct ib_umem *umem)
*/
void ib_umem_release(struct ib_umem *umem)
{
if (!umem)
return;
if (umem->is_odp) {
ib_umem_odp_release(to_ib_umem_odp(umem));
__ib_umem_release_tail(umem);
......@@ -385,7 +388,7 @@ int ib_umem_page_count(struct ib_umem *umem)
n = 0;
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
n += sg_dma_len(sg) >> umem->page_shift;
n += sg_dma_len(sg) >> PAGE_SHIFT;
return n;
}
......
This diff is collapsed.
......@@ -54,6 +54,7 @@
#include <rdma/ib_mad.h>
#include <rdma/ib_user_mad.h>
#include <rdma/rdma_netlink.h>
#include "core_priv.h"
......@@ -744,7 +745,7 @@ static int ib_umad_reg_agent(struct ib_umad_file *file, void __user *arg,
"process %s did not enable P_Key index support.\n",
current->comm);
dev_warn(&file->port->dev,
" Documentation/infiniband/user_mad.txt has info on the new ABI.\n");
" Documentation/infiniband/user_mad.rst has info on the new ABI.\n");
}
}
......@@ -1124,11 +1125,48 @@ static const struct file_operations umad_sm_fops = {
.llseek = no_llseek,
};
static int ib_umad_get_nl_info(struct ib_device *ibdev, void *client_data,
struct ib_client_nl_info *res)
{
struct ib_umad_device *umad_dev = client_data;
if (!rdma_is_port_valid(ibdev, res->port))
return -EINVAL;
res->abi = IB_USER_MAD_ABI_VERSION;
res->cdev = &umad_dev->ports[res->port - rdma_start_port(ibdev)].dev;
return 0;
}
static struct ib_client umad_client = {
.name = "umad",
.add = ib_umad_add_one,
.remove = ib_umad_remove_one
.remove = ib_umad_remove_one,
.get_nl_info = ib_umad_get_nl_info,
};
MODULE_ALIAS_RDMA_CLIENT("umad");
static int ib_issm_get_nl_info(struct ib_device *ibdev, void *client_data,
struct ib_client_nl_info *res)
{
struct ib_umad_device *umad_dev =
ib_get_client_data(ibdev, &umad_client);
if (!rdma_is_port_valid(ibdev, res->port))
return -EINVAL;
res->abi = IB_USER_MAD_ABI_VERSION;
res->cdev = &umad_dev->ports[res->port - rdma_start_port(ibdev)].sm_dev;
return 0;
}
static struct ib_client issm_client = {
.name = "issm",
.get_nl_info = ib_issm_get_nl_info,
};
MODULE_ALIAS_RDMA_CLIENT("issm");
static ssize_t ibdev_show(struct device *dev, struct device_attribute *attr,
char *buf)
......@@ -1387,13 +1425,17 @@ static int __init ib_umad_init(void)
}
ret = ib_register_client(&umad_client);
if (ret) {
pr_err("couldn't register ib_umad client\n");
if (ret)
goto out_class;
}
ret = ib_register_client(&issm_client);
if (ret)
goto out_client;
return 0;
out_client:
ib_unregister_client(&umad_client);
out_class:
class_unregister(&umad_class);
......@@ -1411,6 +1453,7 @@ static int __init ib_umad_init(void)
static void __exit ib_umad_cleanup(void)
{
ib_unregister_client(&issm_client);
ib_unregister_client(&umad_client);
class_unregister(&umad_class);
unregister_chrdev_region(base_umad_dev,
......
This diff is collapsed.
This diff is collapsed.
......@@ -111,9 +111,9 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
INIT_LIST_HEAD(&obj->comp_list);
INIT_LIST_HEAD(&obj->async_list);
cq = ib_dev->ops.create_cq(ib_dev, &attr, &attrs->driver_udata);
if (IS_ERR(cq)) {
ret = PTR_ERR(cq);
cq = rdma_zalloc_drv_obj(ib_dev, ib_cq);
if (!cq) {
ret = -ENOMEM;
goto err_event_file;
}
......@@ -122,10 +122,15 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
cq->comp_handler = ib_uverbs_comp_handler;
cq->event_handler = ib_uverbs_cq_event_handler;
cq->cq_context = ev_file ? &ev_file->ev_queue : NULL;
obj->uobject.object = cq;
obj->uobject.user_handle = user_handle;
atomic_set(&cq->usecnt, 0);
cq->res.type = RDMA_RESTRACK_CQ;
ret = ib_dev->ops.create_cq(cq, &attr, &attrs->driver_udata);
if (ret)
goto err_free;
obj->uobject.object = cq;
obj->uobject.user_handle = user_handle;
rdma_restrack_uadd(&cq->res);
ret = uverbs_copy_to(attrs, UVERBS_ATTR_CREATE_CQ_RESP_CQE, &cq->cqe,
......@@ -136,7 +141,9 @@ static int UVERBS_HANDLER(UVERBS_METHOD_CQ_CREATE)(
return 0;
err_cq:
ib_destroy_cq_user(cq, uverbs_get_cleared_udata(attrs));
cq = NULL;
err_free:
kfree(cq);
err_event_file:
if (ev_file)
uverbs_uobject_put(ev_file_uobj);
......
......@@ -128,6 +128,7 @@ static int UVERBS_HANDLER(UVERBS_METHOD_DM_MR_REG)(
mr->device = pd->device;
mr->pd = pd;
mr->type = IB_MR_TYPE_DM;
mr->dm = dm;
mr->uobject = uobj;
atomic_inc(&pd->usecnt);
......
......@@ -22,6 +22,8 @@ static void *uapi_add_elm(struct uverbs_api *uapi, u32 key, size_t alloc_size)
return ERR_PTR(-EOVERFLOW);
elm = kzalloc(alloc_size, GFP_KERNEL);
if (!elm)
return ERR_PTR(-ENOMEM);
rc = radix_tree_insert(&uapi->radix, key, elm);
if (rc) {
kfree(elm);
......@@ -645,7 +647,7 @@ struct uverbs_api *uverbs_alloc_api(struct ib_device *ibdev)
return ERR_PTR(-ENOMEM);
INIT_RADIX_TREE(&uapi->radix, GFP_KERNEL);
uapi->driver_id = ibdev->driver_id;
uapi->driver_id = ibdev->ops.driver_id;
rc = uapi_merge_def(uapi, ibdev, uverbs_core_api, false);
if (rc)
......
This diff is collapsed.
......@@ -7,7 +7,6 @@ obj-$(CONFIG_INFINIBAND_EFA) += efa/
obj-$(CONFIG_INFINIBAND_I40IW) += i40iw/
obj-$(CONFIG_MLX4_INFINIBAND) += mlx4/
obj-$(CONFIG_MLX5_INFINIBAND) += mlx5/
obj-$(CONFIG_INFINIBAND_NES) += nes/
obj-$(CONFIG_INFINIBAND_OCRDMA) += ocrdma/
obj-$(CONFIG_INFINIBAND_VMWARE_PVRDMA) += vmw_pvrdma/
obj-$(CONFIG_INFINIBAND_USNIC) += usnic/
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment