1. 16 Dec, 2020 6 commits
  2. 14 Dec, 2020 19 commits
  3. 10 Dec, 2020 1 commit
  4. 02 Dec, 2020 14 commits
    • NeilBrown's avatar
      NFS: switch nfsiod to be an UNBOUND workqueue. · bf701b76
      NeilBrown authored
      nfsiod is currently a concurrency-managed workqueue (CMWQ).
      This means that workitems scheduled to nfsiod on a given CPU are queued
      behind all other work items queued on any CMWQ on the same CPU.  This
      can introduce unexpected latency.
      
      Occaionally nfsiod can even cause excessive latency.  If the work item
      to complete a CLOSE request calls the final iput() on an inode, the
      address_space of that inode will be dismantled.  This takes time
      proportional to the number of in-memory pages, which on a large host
      working on large files (e.g..  5TB), can be a large number of pages
      resulting in a noticable number of seconds.
      
      We can avoid these latency problems by switching nfsiod to WQ_UNBOUND.
      This causes each concurrent work item to gets a dedicated thread which
      can be scheduled to an idle CPU.
      
      There is precedent for this as several other filesystems use WQ_UNBOUND
      workqueue for handling various async events.
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Fixes: ada609ee ("workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      bf701b76
    • Calum Mackay's avatar
      lockd: don't use interval-based rebinding over TCP · 9b82d88d
      Calum Mackay authored
      NLM uses an interval-based rebinding, i.e. it clears the transport's
      binding under certain conditions if more than 60 seconds have elapsed
      since the connection was last bound.
      
      This rebinding is not necessary for an autobind RPC client over a
      connection-oriented protocol like TCP.
      
      It can also cause problems: it is possible for nlm_bind_host() to clear
      XPRT_BOUND whilst a connection worker is in the middle of trying to
      reconnect, after it had already been checked in xprt_connect().
      
      When the connection worker notices that XPRT_BOUND has been cleared
      under it, in xs_tcp_finish_connecting(), that results in:
      
      	xs_tcp_setup_socket: connect returned unhandled error -107
      
      Worse, it's possible that the two can get into lockstep, resulting in
      the same behaviour repeated indefinitely, with the above error every
      300 seconds, without ever recovering, and the connection never being
      established. This has been seen in practice, with a large number of NLM
      client tasks, following a server restart.
      
      The existing callers of nlm_bind_host & nlm_rebind_host should not need
      to force the rebind, for TCP, so restrict the interval-based rebinding
      to UDP only.
      
      For TCP, we will still rebind when needed, e.g. on timeout, and connection
      error (including closure), since connection-related errors on an existing
      connection, ECONNREFUSED when trying to connect, and rpc_check_timeout(),
      already unconditionally clear XPRT_BOUND.
      
      To avoid having to add the fix, and explanation, to both nlm_bind_host()
      and nlm_rebind_host(), remove the duplicate code from the former, and
      have it call the latter.
      
      Drop the dprintk, which adds no value over a trace.
      Signed-off-by: default avatarCalum Mackay <calum.mackay@oracle.com>
      Fixes: 35f5a422 ("SUNRPC: new interface to force an RPC rebind")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      9b82d88d
    • Fedor Tokarev's avatar
      net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs' · 35a6d396
      Fedor Tokarev authored
      'snprintf' returns the number of characters which would have been written
      if enough space had been available, excluding the terminating null byte.
      Thus, the return value of 'sizeof(buf)' means that the last character
      has been dropped.
      Signed-off-by: default avatarFedor Tokarev <ftokarev@gmail.com>
      Fixes: 2f34b8bf ("SUNRPC: add links for all client xprts to debugfs")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      35a6d396
    • Sargun Dhillon's avatar
      NFSv4: Refactor to use user namespaces for nfs4idmap · d3ff46fe
      Sargun Dhillon authored
      In several patches work has been done to enable NFSv4 to use user
      namespaces:
      58002399: NFSv4: Convert the NFS client idmapper to use the container user namespace
      3b7eb5e3: NFS: When mounting, don't share filesystems between different user namespaces
      
      Unfortunately, the userspace APIs were only such that the userspace facing
      side of the filesystem (superblock s_user_ns) could be set to a non init
      user namespace. This furthers the fs_context related refactoring, and
      piggybacks on top of that logic, so the superblock user namespace, and the
      NFS user namespace are the same.
      
      Users can still use rpc.idmapd if they choose to, but there are complexities
      with user namespaces and request-key that have yet to be addresssed.
      
      Eventually, we will need to at least:
        * Come up with an upcall mechanism that can be triggered inside of the container,
          or safely triggered outside, with the requisite context to do the right
          mapping. * Handle whatever refactoring needs to be done in net/sunrpc.
      Signed-off-by: default avatarSargun Dhillon <sargun@sargun.me>
      Tested-by: default avatarAlban Crequy <alban.crequy@gmail.com>
      Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      d3ff46fe
    • Sargun Dhillon's avatar
      NFS: NFSv2/NFSv3: Use cred from fs_context during mount · d18a9d3f
      Sargun Dhillon authored
      There was refactoring done to use the fs_context for mounting done in:
      62a55d08: NFS: Additional refactoring for fs_context conversion
      
      This made it so that the net_ns is fetched from the fs_context (the netns
      that fsopen is called in). This change also makes it so that the credential
      fetched during fsopen is used as well as the net_ns.
      
      NFS has already had a number of changes to prepare it for user namespaces:
      1a58e8a0: NFS: Store the credential of the mount process in the nfs_server
      264d948c: NFS: Convert NFSv3 to use the container user namespace
      c207db2f: NFS: Convert NFSv2 to use the container user namespace
      
      Previously, different credentials could be used for creation of the
      fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
      the actual credential check, and that's where current_creds() were fetched.
      This meant that the user namespace which fsopen was called in could be a
      non-init user namespace. This still requires that the user that calls
      FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.
      
      This roughly allows a privileged user to mount on behalf of an unprivileged
      usernamespace, by forking off and calling fsopen in the unprivileged user
      namespace. It can then pass back that fsfd to the privileged process which
      can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
      before switching back into the mount namespace of the container, and finish
      up the mounting process and call fsmount and move_mount.
      Signed-off-by: default avatarSargun Dhillon <sargun@sargun.me>
      Tested-by: default avatarAlban Crequy <alban.crequy@gmail.com>
      Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      d18a9d3f
    • Trond Myklebust's avatar
      NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode · b6d49ecd
      Trond Myklebust authored
      When returning the layout in nfs4_evict_inode(), we need to ensure that
      the layout is actually done being freed before we can proceed to free the
      inode itself.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      b6d49ecd
    • Trond Myklebust's avatar
    • Trond Myklebust's avatar
    • Trond Myklebust's avatar
      SUNRPC: Fix up xdr_set_page() · 0279024f
      Trond Myklebust authored
      While we always want to align to the next page and/or the beginning of
      the tail buffer when we call xdr_set_next_page(), the functions
      xdr_align_data() and xdr_expand_hole() really want to align to the next
      object in that next page or tail.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      0279024f
    • Trond Myklebust's avatar
      SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages() · 9ed5af26
      Trond Myklebust authored
      rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to
      contain the length of the data that we expect to want placed in the head
      kvec plus a count of 1 word of padding that is placed after the page data.
      This is very confusing when trying to read the code, and sometimes leads
      to callers adding an arbitrary value of '1' just in order to satisfy the
      requirement (whether or not the page data actually needs such padding).
      
      This patch aims to clarify the code by changing the 'hdrsize' argument
      to remove that 1 word of padding. This means we need to subtract the
      padding from all the existing callers.
      
      Fixes: 02ef04e4 ("NFS: Account for XDR pad of buf->pages")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      9ed5af26
    • Trond Myklebust's avatar
      SUNRPC: Fix up xdr_read_pages() to take arbitrary object lengths · 1d973166
      Trond Myklebust authored
      Fix up xdr_read_pages() so that it can handle object lengths that are
      larger than the page length, by simply aligning to the next object in
      the buffer tail.
      The function will continue to return the length of the truncate object
      data that actually fit into the pages.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      1d973166
    • Trond Myklebust's avatar
      SUNRPC: Clean up helpers xdr_set_iov() and xdr_set_page_base() · 8d86e373
      Trond Myklebust authored
      Allow xdr_set_iov() to set a base so that we can use it to set the
      cursor to a specific position in the kvec buffer.
      
      If the new base overflows the kvec/pages buffer in either xdr_set_iov()
      or xdr_set_page_base(), then truncate it so that we point to the end of
      the buffer.
      
      Finally, change both function to return the number of bytes remaining to
      read in their buffers.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      8d86e373
    • Trond Myklebust's avatar
      SUNRPC: Fix up typo in xdr_init_decode() · 2b1f83d1
      Trond Myklebust authored
      We already know that the head buffer and page are empty, so if there is
      any data, it is in the tail.
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      2b1f83d1
    • Trond Myklebust's avatar
      NFSv4: Fix the alignment of page data in the getdeviceinfo reply · 046e5ccb
      Trond Myklebust authored
      We can fit the device_addr4 opaque data padding in the pages.
      
      Fixes: cf500bac ("SUNRPC: Introduce rpc_prepare_reply_pages()")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      046e5ccb