Commits · 9bfffea3524b49d0268d01f8e7967f06c4d0a942 · Kirill Smelkov / linux

16 Dec, 2020 5 commits

pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read · 9bfffea3

Trond Myklebust authored Dec 16, 2020

The callers of ff_layout_choose_ds_for_read() should decide whether or
not they want to return the layout on error. Sometimes, we may just want
to retry from the beginning.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9bfffea3

NFSv4/pnfs: Add tracing for the deviceid cache · cac1d3a2

Trond Myklebust authored Dec 16, 2020

Add tracepoints to allow debugging of the deviceid cache.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

cac1d3a2

fs/lockd: convert comma to semicolon · 3316fb80

Zheng Yongjun authored Dec 11, 2020

Replace a comma between expression statements by a semicolon.
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

3316fb80

NFSv4.2: fix error return on memory allocation failure · 7be9b38a

Colin Ian King authored Dec 16, 2020

Currently when an alloc_page fails the error return is not set in
variable err and a garbage initialized value is returned. Fix this
by setting err to -ENOMEM before taking the error return path.

Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: a1f26739 ("NFSv4.2: improve page handling for GETXATTR")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

7be9b38a

Merge tag 'nfs-rdma-for-5.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next · edffb84c

Trond Myklebust authored Dec 15, 2020

NFSoRDmA Client updates for Linux 5.11

Cleanups and improvements:
  - Remove use of raw kernel memory addresses in tracepoints
  - Replace dprintk() call sites in ERR_CHUNK path
  - Trace unmap sync calls
  - Optimize MR DMA-unmapping
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

edffb84c

14 Dec, 2020 19 commits

NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet · 5c3485bb

Trond Myklebust authored Dec 10, 2020

We have no way of tracking server READ_PLUS support in pNFS for now, so
just disable it.
Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

5c3485bb

NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow · 7aedc687

Trond Myklebust authored Dec 08, 2020

If the server returns more data than we have buffer space for, then
we need to truncate and exit early.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

7aedc687

NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow · 503b934a

Trond Myklebust authored Dec 08, 2020

Expanding the READ_PLUS extents can cause the read buffer to overflow.
If it does, then don't error, but just exit early.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

503b934a

NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer · dac3b105

Trond Myklebust authored Dec 08, 2020

If a hole extends beyond the READ_PLUS read buffer, then we want to fill
just the remaining buffer with zeros. Also ignore eof...
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

dac3b105

NFSv4.2: decode_read_plus_hole() needs to check the extent offset · 82f98c8b

Trond Myklebust authored Dec 08, 2020

The server is allowed to return a hole extent with an offset that starts
before the offset supplied in the READ_PLUS argument. Ensure that we
support that case too.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

82f98c8b

NFSv4.2: decode_read_plus_data() must skip padding after data segment · 5c4afe2a

Trond Myklebust authored Dec 08, 2020

All XDR opaque object sizes are 32-bit aligned, and a data segment is no
exception.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

5c4afe2a

NFSv4.2: Ensure we always reset the result->count in decode_read_plus() · 1ee63101
Trond Myklebust authored Dec 08, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
1ee63101

SUNRPC: When expanding the buffer, we may need grow the sparse pages · 5802f7c2

Trond Myklebust authored Dec 10, 2020

If we're shifting the page data to the right, and this happens to be a
sparse page array, then we may need to allocate new pages in order to
receive the data.
Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

5802f7c2

SUNRPC: Cleanup - constify a number of xdr_buf helpers · f8d0e60f

Trond Myklebust authored Dec 08, 2020

There are a number of xdr helpers for struct xdr_buf that do not change
the structure itself. Mark those as taking const pointers for
documentation purposes.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

f8d0e60f

SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field · 5a5f1c2c

Trond Myklebust authored Dec 08, 2020

Move the setting of the xdr_stream 'nwords' field into the helpers that
reset the xdr_stream cursor.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

5a5f1c2c

SUNRPC: _copy_to/from_pages() now check for zero length · e43ac22b

Trond Myklebust authored Dec 06, 2020

Clean up callers of _copy_to/from_pages() that still check for a zero
length.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

e43ac22b

SUNRPC: Cleanup xdr_shrink_bufhead() · 6707fbd7

Trond Myklebust authored Dec 06, 2020

Clean up xdr_shrink_bufhead() to use the new helpers instead of doing
its own thing.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

6707fbd7

SUNRPC: Fix xdr_expand_hole() · c4f2f591

Trond Myklebust authored Dec 04, 2020

We do want to try to grow the buffer if possible, but if that attempt
fails, we still want to move the data and truncate the XDR message.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

c4f2f591

SUNRPC: Fixes for xdr_align_data() · 9a20f6f4

Trond Myklebust authored Dec 04, 2020

The main use case right now for xdr_align_data() is to shift the page
data to the left, and in practice shrink the total XDR data buffer.
This patch ensures that we fix up the accounting for the buffer length
as we shift that data around.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9a20f6f4

SUNRPC: _shift_data_left/right_pages should check the shift length · c54e959b
Trond Myklebust authored Dec 07, 2020
```
Exit early if the shift is zero.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
c54e959b

NFSv4.1: use BITS_PER_LONG macro in nfs4session.h · 1f70ea70

Geliang Tang authored Dec 07, 2020

Use the existing BITS_PER_LONG macro instead of calculating the value.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

1f70ea70

xprtrdma: Fix XDRBUF_SPARSE_PAGES support · 15261b91

Chuck Lever authored Dec 08, 2020

Olga K. observed that rpcrdma_marsh_req() allocates sparse pages
only when it has determined that a Reply chunk is necessary. There
are plenty of cases where no Reply chunk is needed, but the
XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in
rpcrdma_inline_fixup() when it tries to copy parts of the received
Reply into a missing page.

To avoid crashing, handle sparse page allocation up front.

Until XATTR support was added, this issue did not appear often
because the only SPARSE_PAGES consumer always expected a reply large
enough to always require a Reply chunk.
Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

15261b91

NFSv4.2: improve page handling for GETXATTR · a1f26739

Frank van der Linden authored Dec 02, 2020

XDRBUF_SPARSE_PAGES can cause problems for the RDMA transport,
and it's easy enough to allocate enough pages for the request
up front, so do that.

Also, since we've allocated the pages anyway, use the full
page aligned length for the receive buffer. This will allow
caching of valid replies that are too large for the caller,
but that still fit in the allocated pages.
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

a1f26739

sunrpc: fix xs_read_xdr_buf for partial pages receive · ac9645c8

Dan Aloni authored Dec 05, 2020

When receiving pages data, return value 'ret' when positive includes
`buf->page_base`, so we should subtract that before it is used for
changing `offset` and comparing against `want`.

This was discovered on the very rare cases where the server returned a
chunk of bytes that when added to the already received amount of bytes
for the pages happened to match the current `recv.len`, for example
on this case:

     buf->page_base : 258356
     actually received from socket: 1740
     ret : 260096
     want : 260096

In this case neither of the two 'if ... goto out' trigger, and we
continue to tail parsing.

Worth to mention that the ensuing EMSGSIZE from the continued execution of
`xs_read_xdr_buf` may be observed by an application due to 4 superfluous
bytes being added to the pages data.

Fixes: 277e4ab7 ("SUNRPC: Simplify TCP receive code by switching to using iterators")
Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

ac9645c8

10 Dec, 2020 1 commit

NFSv4.2: Fix up the get/listxattr calls to rpc_prepare_reply_pages() · fa94a951

Trond Myklebust authored Dec 03, 2020

Ensure that both getxattr and listxattr page array are correctly
aligned, and that getxattr correctly accounts for the page padding word.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

fa94a951

02 Dec, 2020 15 commits

NFS: switch nfsiod to be an UNBOUND workqueue. · bf701b76

NeilBrown authored Nov 27, 2020

nfsiod is currently a concurrency-managed workqueue (CMWQ).
This means that workitems scheduled to nfsiod on a given CPU are queued
behind all other work items queued on any CMWQ on the same CPU.  This
can introduce unexpected latency.

Occaionally nfsiod can even cause excessive latency.  If the work item
to complete a CLOSE request calls the final iput() on an inode, the
address_space of that inode will be dismantled.  This takes time
proportional to the number of in-memory pages, which on a large host
working on large files (e.g..  5TB), can be a large number of pages
resulting in a noticable number of seconds.

We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
This causes each concurrent work item to gets a dedicated thread which
can be scheduled to an idle CPU.

There is precedent for this as several other filesystems use WQ_UNBOUND
workqueue for handling various async events.
Signed-off-by: NeilBrown <neilb@suse.de>
Fixes: ada609ee ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

bf701b76

lockd: don't use interval-based rebinding over TCP · 9b82d88d

Calum Mackay authored Oct 28, 2020

NLM uses an interval-based rebinding, i.e. it clears the transport's
binding under certain conditions if more than 60 seconds have elapsed
since the connection was last bound.

This rebinding is not necessary for an autobind RPC client over a
connection-oriented protocol like TCP.

It can also cause problems: it is possible for nlm_bind_host() to clear
XPRT_BOUND whilst a connection worker is in the middle of trying to
reconnect, after it had already been checked in xprt_connect().

When the connection worker notices that XPRT_BOUND has been cleared
under it, in xs_tcp_finish_connecting(), that results in:

	xs_tcp_setup_socket: connect returned unhandled error -107

Worse, it's possible that the two can get into lockstep, resulting in
the same behaviour repeated indefinitely, with the above error every
300 seconds, without ever recovering, and the connection never being
established. This has been seen in practice, with a large number of NLM
client tasks, following a server restart.

The existing callers of nlm_bind_host & nlm_rebind_host should not need
to force the rebind, for TCP, so restrict the interval-based rebinding
to UDP only.

For TCP, we will still rebind when needed, e.g. on timeout, and connection
error (including closure), since connection-related errors on an existing
connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
already unconditionally clear XPRT_BOUND.

To avoid having to add the fix, and explanation, to both nlm_bind_host()
and nlm_rebind_host(), remove the duplicate code from the former, and
have it call the latter.

Drop the dprintk, which adds no value over a trace.
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Fixes: 35f5a422 ("SUNRPC: new interface to force an RPC rebind")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9b82d88d

net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs' · 35a6d396

Fedor Tokarev authored Oct 15, 2020

'snprintf' returns the number of characters which would have been written
if enough space had been available, excluding the terminating null byte.
Thus, the return value of 'sizeof(buf)' means that the last character
has been dropped.
Signed-off-by: Fedor Tokarev <ftokarev@gmail.com>
Fixes: 2f34b8bf ("SUNRPC: add links for all client xprts to debugfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

35a6d396

NFSv4: Refactor to use user namespaces for nfs4idmap · d3ff46fe

Sargun Dhillon authored Nov 12, 2020

In several patches work has been done to enable NFSv4 to use user
namespaces:
58002399: NFSv4: Convert the NFS client idmapper to use the container user namespace
3b7eb5e3: NFS: When mounting, don't share filesystems between different user namespaces

Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.

Users can still use rpc.idmapd if they choose to, but there are complexities
with user namespaces and request-key that have yet to be addresssed.

Eventually, we will need to at least:
  * Come up with an upcall mechanism that can be triggered inside of the container,
    or safely triggered outside, with the requisite context to do the right
    mapping. * Handle whatever refactoring needs to be done in net/sunrpc.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d3ff46fe

NFS: NFSv2/NFSv3: Use cred from fs_context during mount · d18a9d3f

Sargun Dhillon authored Nov 12, 2020

There was refactoring done to use the fs_context for mounting done in:
62a55d08: NFS: Additional refactoring for fs_context conversion

This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.

NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0: NFS: Store the credential of the mount process in the nfs_server
264d948c: NFS: Convert NFSv3 to use the container user namespace
c207db2f: NFS: Convert NFSv2 to use the container user namespace

Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Tested-by: Alban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d18a9d3f

NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode · b6d49ecd

Trond Myklebust authored Nov 25, 2020

When returning the layout in nfs4_evict_inode(), we need to ensure that
the layout is actually done being freed before we can proceed to free the
inode itself.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

b6d49ecd

NFSv4: Fix open coded xdr_stream_remaining() · 17068466
Trond Myklebust authored Nov 21, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
17068466
SUNRPC: Fix open coded xdr_stream_remaining() · eee1f549
Trond Myklebust authored Nov 21, 2020
```
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
```
eee1f549

SUNRPC: Fix up xdr_set_page() · 0279024f

Trond Myklebust authored Nov 21, 2020

While we always want to align to the next page and/or the beginning of
the tail buffer when we call xdr_set_next_page(), the functions
xdr_align_data() and xdr_expand_hole() really want to align to the next
object in that next page or tail.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

0279024f

SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages() · 9ed5af26

Trond Myklebust authored Nov 21, 2020

rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to
contain the length of the data that we expect to want placed in the head
kvec plus a count of 1 word of padding that is placed after the page data.
This is very confusing when trying to read the code, and sometimes leads
to callers adding an arbitrary value of '1' just in order to satisfy the
requirement (whether or not the page data actually needs such padding).

This patch aims to clarify the code by changing the 'hdrsize' argument
to remove that 1 word of padding. This means we need to subtract the
padding from all the existing callers.

Fixes: 02ef04e4 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9ed5af26

SUNRPC: Fix up xdr_read_pages() to take arbitrary object lengths · 1d973166

Trond Myklebust authored Nov 20, 2020

Fix up xdr_read_pages() so that it can handle object lengths that are
larger than the page length, by simply aligning to the next object in
the buffer tail.
The function will continue to return the length of the truncate object
data that actually fit into the pages.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

1d973166

SUNRPC: Clean up helpers xdr_set_iov() and xdr_set_page_base() · 8d86e373

Trond Myklebust authored Nov 21, 2020

Allow xdr_set_iov() to set a base so that we can use it to set the
cursor to a specific position in the kvec buffer.

If the new base overflows the kvec/pages buffer in either xdr_set_iov()
or xdr_set_page_base(), then truncate it so that we point to the end of
the buffer.

Finally, change both function to return the number of bytes remaining to
read in their buffers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

8d86e373

SUNRPC: Fix up typo in xdr_init_decode() · 2b1f83d1

Trond Myklebust authored Nov 21, 2020

We already know that the head buffer and page are empty, so if there is
any data, it is in the tail.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

2b1f83d1

NFSv4: Fix the alignment of page data in the getdeviceinfo reply · 046e5ccb

Trond Myklebust authored Nov 13, 2020

We can fit the device_addr4 opaque data padding in the pages.

Fixes: cf500bac ("SUNRPC: Introduce rpc_prepare_reply_pages()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

046e5ccb

pNFS: Clean up open coded xdr string decoding · 98899813

Trond Myklebust authored Nov 09, 2020

Use the existing xdr_stream_decode_string_dup() to safely decode into
kmalloced strings.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

98899813