Commits · 58002399da65c53f00987623d9ff7c3c2773a0d9 · nexedi / linux

26 Apr, 2019 6 commits

NFSv4: Convert the NFS client idmapper to use the container user namespace · 58002399

Trond Myklebust authored Apr 24, 2019

When mapping NFS identities using the NFSv4 idmapper, we want to substitute
for the uids and gids that would normally go on the wire as part of a
NFSv3 request. So we use the same mapping in the NFSv4 upcall as we
use in the NFSv3 RPC call (i.e. the mapping stored in the rpc_clnt cred).
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

58002399

NFS: Convert NFSv3 to use the container user namespace · 264d948c

Trond Myklebust authored Apr 24, 2019

When mapping NFS identities, we want to substitute for the uids and
gids on the wire as we would for the AUTH_UNIX creds.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

264d948c

SUNRPC: Use namespace of listening daemon in the client AUTH_GSS upcall · ac83228a

Trond Myklebust authored Apr 24, 2019

When the client needs to talk to rpc.gssd, we should ensure that the
uid argument is encoded to match the user namespace of the daemon.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

ac83228a

SUNRPC: Use the client user namespace when encoding creds · 283ebe3e

Trond Myklebust authored Apr 24, 2019

When encoding AUTH_UNIX creds and AUTH_GSS upcalls, use the user namespace
of the process that created the rpc client.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

283ebe3e

NFS: Store the credential of the mount process in the nfs_server · 1a58e8a0

Trond Myklebust authored Apr 24, 2019

Store the credential of the mount process so that we can determine
information such as the user namespace.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1a58e8a0

SUNRPC: Cache cred of process creating the rpc_client · 79caa5fa

Trond Myklebust authored Apr 24, 2019

When converting kuids to AUTH_UNIX creds, etc we will want to use the
same user namespace as the process that created the rpc client.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

79caa5fa

25 Apr, 2019 34 commits

xprtrdma: Remove stale comment · 2cfd11f1

Chuck Lever authored Apr 24, 2019

The comment hasn't been accurate for several years.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2cfd11f1

xprtrdma: Update comments that reference ib_drain_qp · b8fe677f

Chuck Lever authored Apr 24, 2019

Commit e1ede312 ("xprtrdma: Fix helper that drains the
transport") replaced the ib_drain_qp() call, so update documenting
comments to reflect current operation.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

b8fe677f

xprtrdma: Remove pr_err() call sites from completion handlers · 5f2311f5

Chuck Lever authored Apr 24, 2019

Clean up: rely on the trace points instead.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

5f2311f5

xprtrdma: Eliminate struct rpcrdma_create_data_internal · 86c4ccd9

Chuck Lever authored Apr 24, 2019

Clean up.

Move the remaining field in rpcrdma_create_data_internal so the
structure can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

86c4ccd9

xprtrdma: Aggregate the inline settings in struct rpcrdma_ep · 94087e97

Chuck Lever authored Apr 24, 2019

Clean up.

The inline settings are actually a characteristic of the endpoint,
and not related to the device. They are also modified after the
transport instance is created, so they do not belong in the cdata
structure either.

Lastly, let's use names that are more natural to RDMA than to NFS:
inline_write -> inline_send and inline_read -> inline_recv. The
/proc files retain their names to avoid breaking user space.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

94087e97

xprtrdma: Remove rpcrdma_create_data_internal::rsize and wsize · fd595174

Chuck Lever authored Apr 24, 2019

Clean up.

xprt_rdma_max_inline_{read,write} cannot be set to large values
by virtue of proc_dointvec_minmax. The current maximum is
RPCRDMA_MAX_INLINE, which is much smaller than RPCRDMA_MAX_SEGS *
PAGE_SIZE.

The .rsize and .wsize fields are otherwise unused in the current
code base, and thus can be removed.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

fd595174

SUNRPC: Update comments based on recent changes · 1f7d1c73

Chuck Lever authored Apr 24, 2019

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1f7d1c73

xprtrdma: Eliminate rpcrdma_ia::ri_device · f19bd0bb

Chuck Lever authored Apr 24, 2019

Clean up.

Since commit 54cbd6b0 ("xprtrdma: Delay DMA mapping Send and
Receive buffers"), a pointer to the device is now saved in each
regbuf when it is DMA mapped.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

f19bd0bb

xprtrdma: More Send completion batching · c209e49c

Chuck Lever authored Apr 24, 2019

Instead of using a fixed number, allow the amount of Send completion
batching to vary based on the client's maximum credit limit.

- A larger default gives a small boost to IOPS throughput

- Reducing it based on max_requests gives a safe result when the
  max credit limit is cranked down (eg. when the device has a small
  max_qp_wr).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

c209e49c

xprtrdma: Clean up sendctx functions · dbcc53a5

Chuck Lever authored Apr 24, 2019

Minor clean-ups I've stumbled on since sendctx was merged last year.
In particular, making Send completion processing more efficient
appears to have a measurable impact on IOPS throughput.

Note: test_and_clear_bit() returns a value, thus an explicit memory
barrier is not necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

dbcc53a5

xprtrdma: Trace marshaling failures · 17e4c443

Chuck Lever authored Apr 24, 2019

Record an event when rpcrdma_marshal_req returns a non-zero return
value to help track down why an xprt close might have occurred.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

17e4c443

xprtrdma: Increase maximum number of backchannel requests · 4ba02e8d

Chuck Lever authored Apr 24, 2019

Reflects the change introduced in commit 067c4696 ("NFSv4.1:
Bump the default callback session slot count to 16").
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

4ba02e8d

xprtrdma: Backchannel can use GFP_KERNEL allocations · 3f9c7e76

Chuck Lever authored Apr 24, 2019

The Receive handler runs in process context, thus can use on-demand
GFP_KERNEL allocations instead of pre-allocation.

This makes the xprtrdma backchannel independent of the number of
backchannel session slots provisioned by the Upper Layer protocol.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

3f9c7e76

xprtrdma: Clean up regbuf helpers · d2832af3

Chuck Lever authored Apr 24, 2019

For code legibility, clean up the function names to be consistent
with the pattern: "rpcrdma" _ object-type _ action

Also rpcrdma_regbuf_alloc and rpcrdma_regbuf_free no longer have any
callers outside of verbs.c, and can thus be made static.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

d2832af3

xprtrdma: De-duplicate "allocate new, free old regbuf" · 0f665ceb

Chuck Lever authored Apr 24, 2019

Clean up by providing an API to do this common task.

At this point, the difference between rpcrdma_get_sendbuf and
rpcrdma_get_recvbuf has become tiny. These can be collapsed into a
single helper.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

0f665ceb

xprtrdma: Allocate req's regbufs at xprt create time · bb93a1ae

Chuck Lever authored Apr 24, 2019

Allocating an rpcrdma_req's regbufs at xprt create time enables
a pair of micro-optimizations:

First, if these regbufs are always there, we can eliminate two
conditional branches from the hot xprt_rdma_allocate path.

Second, by allocating a 1KB buffer, it places a lower bound on the
size of these buffers, without adding yet another conditional
branch. The lower bound reduces the number of hardway re-
allocations. In fact, for some workloads it completely eliminates
hardway allocations.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

bb93a1ae

xprtrdma: rpcrdma_regbuf alignment · 8cec3dba

Chuck Lever authored Apr 24, 2019

Allocate the struct rpcrdma_regbuf separately from the I/O buffer
to better guarantee the alignment of the I/O buffer and eliminate
the wasted space between the rpcrdma_regbuf metadata and the buffer
itself.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

8cec3dba

xprtrdma: Clean up rpcrdma_create_rep() and rpcrdma_destroy_rep() · 23146500

Chuck Lever authored Apr 24, 2019

For code legibility, clean up the function names to be consistent
with the pattern: "rpcrdma" _ object-type _ action
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

23146500

xprtrdma: Clean up rpcrdma_create_req() · 1769e6a8

Chuck Lever authored Apr 24, 2019

Eventually, I'd like to invoke rpcrdma_create_req() during the
call_reserve step. Memory allocation there probably needs to use
GFP_NOIO. Therefore a set of GFP flags needs to be passed in.

As an additional clean up, just return a pointer or NULL, because
the only error return code here is -ENOMEM.

Lastly, clean up the function names to be consistent with the
pattern: "rpcrdma" _ object-type _ action
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

1769e6a8

xprtrdma: Fix an frwr_map recovery nit · b2ca473b

Chuck Lever authored Apr 24, 2019

After a DMA map failure in frwr_map, mark the MR so that recycling
won't attempt to DMA unmap it.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: e2f34e26 ("xprtrdma: Yet another double DMA-unmap")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

b2ca473b

SUNRPC: Avoid digging into the ATOMIC pool · 52db6f9a

Chuck Lever authored Apr 24, 2019

Page allocation requests made when the SPARSE_PAGES flag is set are
allowed to fail, and are not critical. No need to spend a rare
resource.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

52db6f9a

Fix nfs4.2 return -EINVAL when do dedupe operation · ce96e888

Xiaoli Feng authored Mar 16, 2019

dedupe_file_range operations is combiled into remap_file_range.
But in nfs42_remap_file_range, it's skiped for dedupe operations.
Before this patch:
  # dd if=/dev/zero of=nfs/file bs=1M count=1
  # xfs_io -c "dedupe nfs/file 4k 64k 4k" nfs/file
  XFS_IOC_FILE_EXTENT_SAME: Invalid argument
After this patch:
  # dd if=/dev/zero of=nfs/file bs=1M count=1
  # xfs_io -c "dedupe nfs/file 4k 64k 4k" nfs/file
  deduped 4096/4096 bytes at offset 65536
  4 KiB, 1 ops; 0.0046 sec (865.988 KiB/sec and 216.4971 ops/sec)
Signed-off-by: Xiaoli Feng <fengxiaoli0714@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

ce96e888

NFS: Remove redundant open context from nfs_page · c79d183e

Trond Myklebust authored Apr 07, 2019

The lock context already references and tracks the open context, so
take the opportunity to save some space in struct nfs_page.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

c79d183e

NFS: Add a helper to return a pointer to the open context of a struct nfs_page · 9fcd5960

Trond Myklebust authored Apr 07, 2019

Add a helper for when we remove the explicit pointer to the open
context.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9fcd5960

NFS: Ensure that all nfs lock contexts have a valid open context · 15494511

Trond Myklebust authored Apr 07, 2019

Force the lock context to keep a reference to the parent open
context so that we can guarantee the validity of the latter.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

15494511

NFS: Allow signal interruption of NFS4ERR_DELAYed operations · 0688e64b

Trond Myklebust authored Apr 07, 2019

If the server is unable to immediately execute an RPC call, and returns
an NFS4ERR_DELAY then we can assume it is safe to interrupt the operation
in order to handle ordinary signals. This allows the application to
service timer interrupts that would otherwise have to wait until the
server is again able to respond.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

0688e64b

pNFS: Add tracking to limit the number of pNFS retries · 33344e0f

Trond Myklebust authored Apr 07, 2019

When the client is reading or writing using pNFS, and hits an error
on the DS, then it typically sends a LAYOUTERROR and/or LAYOUTRETURN
to the MDS, before redirtying the failed pages, and going for a new
round of reads/writebacks. The problem is that if the server has no
way to fix the DS, then we may need a way to interrupt this loop
after a set number of attempts have been made.
This patch adds an optional module parameter that allows the admin
to specify how many times to retry the read/writeback process before
failing with a fatal error.
The default behaviour is to retry forever.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

33344e0f

NFS: Remove unused argument from nfs_create_request() · 28b1d3f5

Trond Myklebust authored Apr 07, 2019

All the callers of nfs_create_request() are now creating page group
heads, so we can remove the redundant 'last' page argument.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

28b1d3f5

NFS: Fix up NFS I/O subrequest creation · c917cfaf

Trond Myklebust authored Apr 07, 2019

We require all NFS I/O subrequests to duplicate the lock context as well
as the open context.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

c917cfaf

NFS: Replace custom error reporting mechanism with generic one · 6fbda89b

Trond Myklebust authored Apr 07, 2019

Replace the NFS custom error reporting mechanism with the generic
mapping_set_error().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

6fbda89b

NFS: Don't inadvertently clear writeback errors · aded8d7b

Trond Myklebust authored Apr 07, 2019

vfs_fsync() has the side effect of clearing unreported writeback errors,
so we need to make sure that we do not abuse it in situations where
applications might not normally expect us to report those errors.

The solution is to replace calls to vfs_fsync() with calls to nfs_wb_all().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

aded8d7b

NFS: Don't call generic_error_remove_page() while holding locks · 22876f54

Trond Myklebust authored Apr 07, 2019

The NFS read code can trigger writeback while holding the page lock.
If an error then triggers a call to nfs_write_error_remove_page(),
we can deadlock.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

22876f54

NFS: Don't interrupt file writeout due to fatal errors · 14bebe3c

Trond Myklebust authored Apr 07, 2019

When flushing out dirty pages, the fact that we may hit fatal errors
is not a reason to stop writeback. Those errors are reported through
fsync(), not through the flush mechanism.

Fixes: a6598813 ("NFS: Don't write back further requests if there...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

14bebe3c

NFS: Add a mount option "softerr" to allow clients to see ETIMEDOUT errors · 91a575e1

Trond Myklebust authored Apr 07, 2019

Add a mount option that exposes the ETIMEDOUT errors that occur during
soft timeouts to the application. This allows aware applications to
distinguish between server disk IO errors and client timeout errors.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

91a575e1