Commits · f515f86b34b2e7d4b24cc9b7375c9e749895088e · nexedi / linux

08 Feb, 2018 1 commit

fix parallelism for rpc tasks · f515f86b

Olga Kornievskaia authored Jun 29, 2017

Hi folks,

On a multi-core machine, is it expected that we can have parallel RPCs
handled by each of the per-core workqueue?

In testing a read workload, observing via "top" command that a single
"kworker" thread is running servicing the requests (no parallelism).
It's more prominent while doing these operations over krb5p mount.

What has been suggested by Bruce is to try this and in my testing I
see then the read workload spread among all the kworker threads.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

f515f86b

07 Feb, 2018 2 commits

Make the xprtiod workqueue unbounded. · 90ea9f1b

Trond Myklebust authored Feb 06, 2018

This should help reduce the latency on replies.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

90ea9f1b

SUNRPC: Queue latency-sensitive socket tasks to xprtiod · 2275cde4

Trond Myklebust authored Feb 07, 2018

The response to a write_space notification is very latency sensitive,
so we should queue it to the lower latency xprtiod_workqueue. This
is something we already do for the other cases where an rpc task
holds the transport XPRT_LOCKED bitlock.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

2275cde4

06 Feb, 2018 2 commits

Merge tag 'nfs-rdma-for-4.16-2' of git://git.linux-nfs.org/projects/anna/linux-nfs · 0c8cbcd3

Trond Myklebust authored Feb 06, 2018

NFS-over-RDMA client fixes for Linux 4.16 #2

Stable fixes:
- Fix calculating ri_max_send_sges, which can oops if max_sge is too small
- Fix a BUG after device removal if freed resources haven't been allocated yet

0c8cbcd3

SUNRPC: Ensure we always close the socket after a connection shuts down · 9b30889c

Trond Myklebust authored Feb 05, 2018

Ensure that we release the TCP socket once it is in the TCP_CLOSE or
TCP_TIME_WAIT state (and only then) so that we don't confuse rkhunter
and its ilk.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

9b30889c

02 Feb, 2018 2 commits

xprtrdma: Fix BUG after a device removal · e89e8d8f

Chuck Lever authored Jan 31, 2018

Michal Kalderon reports a BUG that occurs just after device removal:

[  169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
[  169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[  169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]

The RPC/RDMA client transport attempts to allocate some resources
on demand. Registered buffers are one such resource. These are
allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
and Reply messages. A hardware resource is associated with each of
these buffers, as they can be used for a Send or Receive Work
Request.

If a device is removed from under an NFS/RDMA mount, the transport
layer is responsible for releasing all hardware resources before
the device can be finally unplugged. A BUG results when the NFS
mount hasn't yet seen much activity: the transport tries to release
resources that haven't yet been allocated.

rpcrdma_free_regbuf() already checks for this case, so just move
that check to cover the DEVICE_REMOVAL case as well.
Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: bebd0318 ("xprtrdma: Support unplugging an HCA ...")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

e89e8d8f

xprtrdma: Fix calculation of ri_max_send_sges · 1179e2c2

Chuck Lever authored Jan 31, 2018

Commit 16f906d6 ("xprtrdma: Reduce required number of send
SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
a problem where xprtrdma would not work if the device's max_sge
capability was small (low single digits).

At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
each RPC. ri_max_send_sges is set to this value:

  ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;

Then when marshaling each RPC, rpcrdma_args_inline uses that value
to determine whether the device has enough Send SGEs to convey an
NFS WRITE payload inline, or whether instead a Read chunk is
required.

More recently, commit ae72950a ("xprtrdma: Add data structure to
manage RDMA Send arguments") used the ri_max_send_sges value to
calculate the size of an array, but that commit erroneously assumed
ri_max_send_sges contains a value similar to the device's max_sge,
and not one that was reduced by the minimum SGE count.

This assumption results in the calculated size of the sendctx's
Send SGE array to be too small. When the array is used to marshal
an RPC, the code can write Send SGEs into the following sendctx
element in that array, corrupting it. When the device's max_sge is
large, this issue is entirely harmless; but it results in an oops
in the provider's post_send method, if dev.attrs.max_sge is small.

So let's straighten this out: ri_max_send_sges will now contain a
value with the same meaning as dev.attrs.max_sge, which makes
the code easier to understand, and enables rpcrdma_sendctx_create
to calculate the size of the SGE array correctly.
Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1179e2c2

29 Jan, 2018 1 commit

NFS: Fix a race between mmap() and O_DIRECT · e231c687

Trond Myklebust authored Jan 28, 2018

When locking the file in order to do O_DIRECT on it, we must unmap
any mmapped ranges on the pagecache so that we can flush out the
dirty data.

Fixes: a5864c99 ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org # v4.8+

e231c687

28 Jan, 2018 1 commit

NFS: Remove a redundant call to unmap_mapping_range() · 128159f2

Trond Myklebust authored Jan 28, 2018

We don't need to call unmap_mapping_range() prior to calling
nfs_sync_mapping().
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

128159f2

25 Jan, 2018 2 commits

pnfs/blocklayout: Ensure disk address in block device map · f34462c3

Benjamin Coddington authored Jan 25, 2018

It's possible that the device map is smaller than the offset into the device
for the I/O we're adding.  Add a check for it and bail out, otherwise we
risk botching the bio calculations that follow.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>

f34462c3

pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors · b3960475

Benjamin Coddington authored Jan 25, 2018

Fixup the field types to match their use.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>

b3960475

24 Jan, 2018 1 commit

lockd: Fix server refcounting · 535cb8f3

Trond Myklebust authored Jan 23, 2018

The server shouldn't actually delete the struct nlm_host until it hits
the garbage collector. In order to make that work correctly with the
refcount API, we can bump the refcount by one, and then use
refcount_dec_if_one() in the garbage collector.
Signed-off-by: Trond Myklebust <trondmy@gmail.com>
Acked-by: J. Bruce Fields <bfields@fieldses.org>

535cb8f3

23 Jan, 2018 20 commits

Merge tag 'nfs-rdma-for-4.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs · 8f39fce8

Trond Myklebust authored Jan 23, 2018

NFS-over-RDMA client updates for Linux 4.16

New features:
- xprtrdma tracepoints

Bugfixes and cleanups:
- Fix memory leak if rpcrdma_buffer_create() fails
- Fix allocating extra rpcrdma_reps for the backchannel
- Remove various unused and redundant variables and lock cycles
- Fix IPv6 support in xprt_rdma_set_port()
- Fix memory leak by calling buf_free for callback replies
- Fix "bytes registered" accounting
- Fix kernel-doc comments
- SUNRPC tracepoint cleanups for consistent information
- Optimizations for __rpc_execute()

8f39fce8

SUNRPC: Fix null rpc_clnt dereference in rpc_task_queued tracepoint · 0be283f6

Benjamin Coddington authored Jan 23, 2018

Backchannel tasks will not have a reference to the rpc_clnt.  Return -1 for
cl_clid in that case.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>

0be283f6

SUNRPC: Micro-optimize __rpc_execute · 21ead9ff

Chuck Lever authored Jan 03, 2018

The common case: There are 13 to 14 actions per RPC, and tk_callback
is non-NULL in only one of them. There's no need to store a NULL in
the tk_callback field during each FSM step.

This slightly improves throughput results in dbench and other multi-
threaded benchmarks on my two-socket client on 56Gb InfiniBand, but
will probably be inconsequential on slower systems.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

21ead9ff

SUNRPC: task_run_action should display tk_callback · cf08d6f2

Chuck Lever authored Jan 03, 2018

This shows up in every RPC:

kworker/4:1-19772 [004] 3467.373443: rpc_task_run_action: task:4711@2 flags=0e81 state=0005 status=0 action=call_status
kworker/4:1-19772 [004] 3467.373444: rpc_task_run_action: task:4711@2 flags=0e81 state=0005 status=0 action=call_status

What's actually going on is that the first iteration of the RPC
scheduler is invoking the function in tk_callback (in this case,
xprt_timer), then invoking call_status on the next iteration.

Feeding do_action, rather than tk_action, to the "task_run_action"
trace point will now always display the correct FSM step.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

cf08d6f2

sunrpc: Format RPC events consistently for display · 52069449

Chuck Lever authored Jan 03, 2018

Clean up: Make it easier to use text search when browsing a trace
report. Other events use "status=%d".
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

52069449

SUNRPC: Trace xprt_timer events · 82476d9f

Chuck Lever authored Jan 03, 2018

Track RPC timeouts: report the XID and the server address to match
the content of network capture.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

82476d9f

xprtrdma: Correct some documenting comments · 9ab6d89e

Chuck Lever authored Jan 03, 2018

Fix kernel-doc warnings in net/sunrpc/xprtrdma/ .

net/sunrpc/xprtrdma/verbs.c:1575: warning: No description found for parameter 'count'
net/sunrpc/xprtrdma/verbs.c:1575: warning: Excess function parameter 'min_reqs' description in 'rpcrdma_ep_post_extra_recv'

net/sunrpc/xprtrdma/backchannel.c:288: warning: No description found for parameter 'r_xprt'
net/sunrpc/xprtrdma/backchannel.c:288: warning: Excess function parameter 'xprt' description in 'rpcrdma_bc_receive_call'
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9ab6d89e

xprtrdma: Fix "bytes registered" accounting · aae2349c

Chuck Lever authored Jan 03, 2018

The contents of seg->mr_len changed when ->ro_map stopped returning
the full chunk length in the first segment. Count the full length of
each Write chunk, not the length of the first segment (which now can
only be as large as a page).

Fixes: 9d6b0409 ("xprtrdma: Place registered MWs on a ... ")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

aae2349c

xprtrdma: Instrument allocation/release of rpcrdma_req/rep objects · ae724676
Chuck Lever authored Dec 20, 2017
```
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
ae724676
xprtrdma: Add trace points to instrument QP and CQ access upcalls · 643cf323
Chuck Lever authored Dec 20, 2017
```
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
643cf323
xprtrdma: Add trace points in the client-side backchannel code paths · fc1eb807
Chuck Lever authored Dec 20, 2017
```
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
fc1eb807

xprtrdma: Add trace points for connect events · b4744e00

Chuck Lever authored Dec 20, 2017

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

b4744e00

xprtrdma: Add trace points to instrument MR allocation and recovery · 1c443eff
Chuck Lever authored Dec 20, 2017
```
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
```
1c443eff

xprtrdma: Add trace points to instrument memory invalidation · 2937fede

Chuck Lever authored Dec 20, 2017

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2937fede

xprtrdma: Add trace points in reply decoder path · e11b7c96

Chuck Lever authored Dec 20, 2017

This includes decoding Write and Reply chunks, and fixing up inline
payloads.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

e11b7c96

xprtrdma: Add trace points to instrument memory registration · 58f10ad4

Chuck Lever authored Dec 20, 2017

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

58f10ad4

xprtrdma: Add trace points in the RPC Reply handler paths · b4a7f91c

Chuck Lever authored Dec 20, 2017

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

b4a7f91c

xprtrdma: Add trace points in RPC Call transmit paths · ab03eff5

Chuck Lever authored Dec 20, 2017

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

ab03eff5

rpcrdma: infrastructure for static trace points in rpcrdma.ko · e48f083e

Chuck Lever authored Jan 20, 2018

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

e48f083e

rdma/ib: Add trace point macros to display human-readable values · 6b3a60ae

Chuck Lever authored Jan 20, 2018

These can be shared with all kernel ULPs, and more can easily be
added as needed.

Note: checkpatch.pl has some heartburn with the TRACE_DEFINE_ENUM
macros and the LIST macros. These follow the same style as other
header files under include/tracing/events , thus should be
considered acceptable exceptions.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

6b3a60ae

22 Jan, 2018 1 commit

NFS: reject request for id_legacy key without auxdata · 49686cbb

Eric Biggers authored Jan 19, 2018

nfs_idmap_legacy_upcall() is supposed to be called with 'aux' pointing
to a 'struct idmap', via the call to request_key_with_auxdata() in
nfs_idmap_request_key().

However it can also be reached via the request_key() system call in
which case 'aux' will be NULL, causing a NULL pointer dereference in
nfs_idmap_prepare_pipe_upcall(), assuming that the key description is
valid enough to get that far.

Fix this by making nfs_idmap_legacy_upcall() negate the key if no
auxdata is provided.

As usual, this bug was found by syzkaller.  A simple reproducer using
the command-line keyctl program is:

    keyctl request2 id_legacy uid:0 '' @s

Fixes: 57e62324 ("NFS: Store the legacy idmapper result in the keyring")
Reported-by: syzbot+5dfdbcf7b3eb5912abbb@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Trond Myklebust <trondmy@gmail.com>

49686cbb

18 Jan, 2018 3 commits

nfs: Do not convert nfs_idmap_cache_timeout to jiffies · cbebc6ef

Jan Chochol authored Jan 05, 2018

Since commit 57e62324 ("NFS: Store the legacy idmapper result in the
keyring") nfs_idmap_cache_timeout changed units from jiffies to seconds.
Unfortunately sysctl interface was not updated accordingly.

As a effect updating /proc/sys/fs/nfs/idmap_cache_timeout with some
value will incorrectly multiply this value by HZ.
Also reading /proc/sys/fs/nfs/idmap_cache_timeout will show real value
divided by HZ.

Fixes: 57e62324 ("NFS: Store the legacy idmapper result in the keyring")
Signed-off-by: Jan Chochol <jan@chochol.info>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

cbebc6ef

nfs: Use proper enum definitions for nfs_show_stable · 06e19024

Chuck Lever authored Jan 18, 2018

Commit 8224b273 ("NFS: Add static NFS I/O tracepoints") had a
hack to work around some odd behavior observed with
__print_symbolic. I couldn't ever get it to display NFS_FILE_SYNC
when using TRACE_DEFINE_ENUM macros to set up the enum values.

I tracked down the actual bug that forced me to add the workaround.
That issue will be addressed soon, so replace the hack with a proper
implementation.

Fixes: 8224b273 ("NFS: Add static NFS I/O tracepoints")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

06e19024

nfs41: do not return ENOMEM on LAYOUTUNAVAILABLE · 7ff4cff6

Tigran Mkrtchyan authored Jan 16, 2018

A pNFS server may return LAYOUTUNAVAILABLE error on LAYOUTGET for files
which don't have any layout. In this situation pnfs_update_layout
currently returns NULL. As this NULL is converted into ENOMEM, IO
requests fails instead of falling back to MDS.

Do not return ENOMEM on LAYOUTUNAVAILABLE and let client retry through
MDS.

Fixes 8d40b0f1. I will suggest to backport this fix to affected
stable branches.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
[trondmy: Use IS_ERR_OR_NULL()]
Fixes: 8d40b0f1 ("NFS filelayout:call GETDEVICEINFO after...")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

7ff4cff6

16 Jan, 2018 4 commits

xprtrdma: Introduce rpcrdma_mw_unmap_and_put · ec12e479

Chuck Lever authored Dec 14, 2017

Clean up: Code review suggested that a common bit of code can be
placed into a helper function, and this gives us fewer places to
stick an "I DMA unmapped something" trace point.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

ec12e479

xprtrdma: Remove usage of "mw" · 96ceddea

Chuck Lever authored Dec 14, 2017

Clean up: struct rpcrdma_mw was named after Memory Windows, but
xprtrdma no longer supports a Memory Window registration mode.
Rename rpcrdma_mw and its fields to reduce confusion and make
the code more sensible to read.

Renaming "mw" was suggested by Tom Talpey, the author of the
original xprtrdma implementation. It's a good idea, but I haven't
done this until now because it's a huge diffstat for no benefit
other than code readability.

However, I'm about to introduce static trace points that expose
a few of xprtrdma's internal data structures. They should make sense
in the trace report, and it's reasonable to treat trace points as a
kernel API contract which might be difficult to change later.

While I'm churning things up, two additional changes:
- rename variables unhelpfully called "r" to "mr", to improve code
  clarity, and
- rename the MR-related helper functions using the form
  "rpcrdma_mr_<verb>", to be consistent with other areas of the
  code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

96ceddea

xprtrdma: Replace all usage of "frmr" with "frwr" · ce5b3717

Chuck Lever authored Dec 14, 2017

Clean up: Over time, the industry has adopted the term "frwr"
instead of "frmr". The term "frwr" is now more widely recognized.

For the past couple of years I've attempted to add new code using
"frwr" , but there still remains plenty of older code that still
uses "frmr". Replace all usage of "frmr" to avoid confusion.

While we're churning code, rename variables unhelpfully called "f"
to "frwr", to improve code clarity.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

ce5b3717

xprtrdma: Don't clear RPC_BC_PA_IN_USE on pre-allocated rpc_rqst's · 30b5416b

Chuck Lever authored Dec 14, 2017

No need for the overhead of atomically setting and clearing this bit
flag for every use of a pre-allocated backchannel rpc_rqst. These
are a distinct pool of rpc_rqsts that are used only for callback
operations, so it is safe to simply leave the bit set.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

30b5416b