1. 02 Jan, 2019 12 commits
    • Chuck Lever's avatar
      NFS: Make "port=" mount option optional for RDMA mounts · 0dfbb5f0
      Chuck Lever authored
      Having to specify "proto=rdma,port=20049" is cumbersome.
      
      RFC 8267 Section 6.3 requires NFSv4 clients to use "the alternative
      well-known port number", which is 20049. Make the use of the well-
      known port number automatic, just as it is for NFS/TCP and port
      2049.
      
      For NFSv2/3, Section 4.2 allows clients to simply choose 20049 as
      the default or use rpcbind. I don't know of an NFS/RDMA server
      implementation that registers it's NFS/RDMA service with rpcbind,
      so automatically choosing 20049 seems like the better choice. The
      other widely-deployed NFS/RDMA client, Solaris, also uses 20049
      as the default port.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      0dfbb5f0
    • Chuck Lever's avatar
      xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) · 0a93fbcb
      Chuck Lever authored
      Place the associated RPC transaction's XID in the upper 32 bits of
      each RDMA segment's rdma_offset field. There are two reasons to do
      this:
      
      - The R_key only has 8 bits that are different from registration to
        registration. The XID adds more uniqueness to each RDMA segment to
        reduce the likelihood of a software bug on the server reading from
        or writing into memory it's not supposed to.
      
      - On-the-wire RDMA Read and Write requests do not otherwise carry
        any identifier that matches them up to an RPC. The XID in the
        upper 32 bits will act as an eye-catcher in network captures.
      Suggested-by: default avatarTom Talpey <ttalpey@microsoft.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      0a93fbcb
    • Chuck Lever's avatar
      xprtrdma: Remove rpcrdma_memreg_ops · 5f62412b
      Chuck Lever authored
      Clean up: Now that there is only FRWR, there is no need for a memory
      registration switch. The indirect calls to the memreg operations can
      be replaced with faster direct calls.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5f62412b
    • Chuck Lever's avatar
      xprtrdma: Remove support for FMR memory registration · ba69cd12
      Chuck Lever authored
      FMR is not supported on most recent RDMA devices. It is also less
      secure than FRWR because an FMR memory registration can expose
      adjacent bytes to remote reading or writing. As discussed during the
      RDMA BoF at LPC 2018, it is time to remove support for FMR in the
      NFS/RDMA client stack.
      
      Note that NFS/RDMA server-side uses either local memory registration
      or FRWR. FMR is not used.
      
      There are a few Infiniband/RoCE devices in the kernel tree that do
      not appear to support MEM_MGT_EXTENSIONS (FRWR), and therefore will
      not support client-side NFS/RDMA after this patch. These are:
      
       - mthca
       - qib
       - hns (RoCE)
      
      Users of these devices can use NFS/TCP on IPoIB instead.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ba69cd12
    • Chuck Lever's avatar
      xprtrdma: Reduce max_frwr_depth · a7886849
      Chuck Lever authored
      Some devices advertise a large max_fast_reg_page_list_len
      capability, but perform optimally when MRs are significantly smaller
      than that depth -- probably when the MR itself is no larger than a
      page.
      
      By default, the RDMA R/W core API uses max_sge_rd as the maximum
      page depth for MRs. For some devices, the value of max_sge_rd is
      1, which is also not optimal. Thus, when max_sge_rd is larger than
      1, use that value. Otherwise use the value of the
      max_fast_reg_page_list_len attribute.
      
      I've tested this with CX-3 Pro, FastLinq, and CX-5 devices. It
      reproducibly improves the throughput of large I/Os by several
      percent.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a7886849
    • Chuck Lever's avatar
      xprtrdma: Fix ri_max_segs and the result of ro_maxpages · 6946f823
      Chuck Lever authored
      With certain combinations of krb5i/p, MR size, and r/wsize, I/O can
      fail with EMSGSIZE. This is because the calculated value of
      ri_max_segs (the max number of MRs per RPC) exceeded
      RPCRDMA_MAX_HDR_SEGS, which caused Read or Write list encoding to
      walk off the end of the transport header.
      
      Once that was addressed, the ro_maxpages result has to be corrected
      to account for the number of MRs needed for Reply chunks, which is
      2 MRs smaller than a normal Read or Write chunk.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6946f823
    • Chuck Lever's avatar
      xprtrdma: Don't wake pending tasks until disconnect is done · 0c0829bc
      Chuck Lever authored
      Transport disconnect processing does a "wake pending tasks" at
      various points.
      
      Suppose an RPC Reply is being processed. The RPC task that Reply
      goes with is waiting on the pending queue. If a disconnect wake-up
      happens before reply processing is done, that reply, even if it is
      good, is thrown away, and the RPC has to be sent again.
      
      This window apparently does not exist for socket transports because
      there is a lock held while a reply is being received which prevents
      the wake-up call until after reply processing is done.
      
      To resolve this, all RPC replies being processed on an RPC-over-RDMA
      transport have to complete before pending tasks are awoken due to a
      transport disconnect.
      
      Callers that already hold the transport write lock may invoke
      ->ops->close directly. Others use a generic helper that schedules
      a close when the write lock can be taken safely.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      0c0829bc
    • Chuck Lever's avatar
      xprtrdma: No qp_event disconnect · 3d433ad8
      Chuck Lever authored
      After thinking about this more, and auditing other kernel ULP imple-
      mentations, I believe that a DISCONNECT cm_event will occur after a
      fatal QP event. If that's the case, there's no need for an explicit
      disconnect in the QP event handler.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      3d433ad8
    • Chuck Lever's avatar
      xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue · 6d2d0ee2
      Chuck Lever authored
      To address a connection-close ordering problem, we need the ability
      to drain the RPC completions running on rpcrdma_receive_wq for just
      one transport. Give each transport its own RPC completion workqueue,
      and drain that workqueue when disconnecting the transport.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6d2d0ee2
    • Chuck Lever's avatar
      xprtrdma: Refactor Receive accounting · 6ceea368
      Chuck Lever authored
      Clean up: Divide the work cleanly:
      
      - rpcrdma_wc_receive is responsible only for RDMA Receives
      - rpcrdma_reply_handler is responsible only for RPC Replies
      - the posted send and receive counts both belong in rpcrdma_ep
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6ceea368
    • Chuck Lever's avatar
      xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails · b674c4b4
      Chuck Lever authored
      The recovery case in frwr_op_unmap_sync needs to DMA unmap each MR.
      frwr_release_mr does not DMA-unmap, but the recycle worker does.
      
      Fixes: 61da886b ("xprtrdma: Explicitly resetting MRs is ... ")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      b674c4b4
    • Chuck Lever's avatar
      xprtrdma: Yet another double DMA-unmap · e2f34e26
      Chuck Lever authored
      While chasing yet another set of DMAR fault reports, I noticed that
      the frwr recycler conflates whether or not an MR has been DMA
      unmapped with frwr->fr_state. Actually the two have only an indirect
      relationship. It's in fact impossible to guess reliably whether the
      MR has been DMA unmapped based on its fr_state field, especially as
      the surrounding code and its assumptions have changed over time.
      
      A better approach is to track the DMA mapping status explicitly so
      that the recycler is less brittle to unexpected situations, and
      attempts to DMA-unmap a second time are prevented.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Cc: stable@vger.kernel.org # v4.20
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      e2f34e26
  2. 21 Dec, 2018 1 commit
    • Chris Perl's avatar
      NFS: nfs_compare_mount_options always compare auth flavors. · 594d1644
      Chris Perl authored
      This patch removes the check from nfs_compare_mount_options to see if a
      `sec' option was passed for the current mount before comparing auth
      flavors and instead just always compares auth flavors.
      
      Consider the following scenario:
      
      You have a server with the address 192.168.1.1 and two exports /export/a
      and /export/b.  The first export supports `sys' and `krb5' security, the
      second just `sys'.
      
      Assume you start with no mounts from the server.
      
      The following results in EIOs being returned as the kernel nfs client
      incorrectly thinks it can share the underlying `struct nfs_server's:
      
      $ mkdir /tmp/{a,b}
      $ sudo mount -t nfs -o vers=3,sec=krb5 192.168.1.1:/export/a /tmp/a
      $ sudo mount -t nfs -o vers=3          192.168.1.1:/export/b /tmp/b
      $ df >/dev/null
      df: ‘/tmp/b’: Input/output error
      Signed-off-by: default avatarChris Perl <cperl@janestreet.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      594d1644
  3. 19 Dec, 2018 26 commits
  4. 18 Dec, 2018 1 commit