An error occurred fetching the project authors.
  1. 02 Feb, 2018 1 commit
    • Chuck Lever's avatar
      xprtrdma: Fix calculation of ri_max_send_sges · 1179e2c2
      Chuck Lever authored
      Commit 16f906d6 ("xprtrdma: Reduce required number of send
      SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
      a problem where xprtrdma would not work if the device's max_sge
      capability was small (low single digits).
      
      At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
      each RPC. ri_max_send_sges is set to this value:
      
        ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
      
      Then when marshaling each RPC, rpcrdma_args_inline uses that value
      to determine whether the device has enough Send SGEs to convey an
      NFS WRITE payload inline, or whether instead a Read chunk is
      required.
      
      More recently, commit ae72950a ("xprtrdma: Add data structure to
      manage RDMA Send arguments") used the ri_max_send_sges value to
      calculate the size of an array, but that commit erroneously assumed
      ri_max_send_sges contains a value similar to the device's max_sge,
      and not one that was reduced by the minimum SGE count.
      
      This assumption results in the calculated size of the sendctx's
      Send SGE array to be too small. When the array is used to marshal
      an RPC, the code can write Send SGEs into the following sendctx
      element in that array, corrupting it. When the device's max_sge is
      large, this issue is entirely harmless; but it results in an oops
      in the provider's post_send method, if dev.attrs.max_sge is small.
      
      So let's straighten this out: ri_max_send_sges will now contain a
      value with the same meaning as dev.attrs.max_sge, which makes
      the code easier to understand, and enables rpcrdma_sendctx_create
      to calculate the size of the SGE array correctly.
      Reported-by: default avatarMichal Kalderon <Michal.Kalderon@cavium.com>
      Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarMichal Kalderon <Michal.Kalderon@cavium.com>
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1179e2c2
  2. 23 Jan, 2018 5 commits
  3. 16 Jan, 2018 4 commits
    • Chuck Lever's avatar
      xprtrdma: Remove usage of "mw" · 96ceddea
      Chuck Lever authored
      Clean up: struct rpcrdma_mw was named after Memory Windows, but
      xprtrdma no longer supports a Memory Window registration mode.
      Rename rpcrdma_mw and its fields to reduce confusion and make
      the code more sensible to read.
      
      Renaming "mw" was suggested by Tom Talpey, the author of the
      original xprtrdma implementation. It's a good idea, but I haven't
      done this until now because it's a huge diffstat for no benefit
      other than code readability.
      
      However, I'm about to introduce static trace points that expose
      a few of xprtrdma's internal data structures. They should make sense
      in the trace report, and it's reasonable to treat trace points as a
      kernel API contract which might be difficult to change later.
      
      While I'm churning things up, two additional changes:
      - rename variables unhelpfully called "r" to "mr", to improve code
        clarity, and
      - rename the MR-related helper functions using the form
        "rpcrdma_mr_<verb>", to be consistent with other areas of the
        code.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      96ceddea
    • Chuck Lever's avatar
      xprtrdma: Split xprt_rdma_send_request · cf73daf5
      Chuck Lever authored
      Clean up. @rqst is set up differently for backchannel Replies. For
      example, rqst->rq_task and task->tk_client are both NULL. So it is
      easier to understand and maintain this code path if it is separated.
      
      Also, we can get rid of the confusing rl_connect_cookie hack in
      rpcrdma_bc_receive_call.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      cf73daf5
    • Chuck Lever's avatar
      xprtrdma: Move unmap-safe logic to rpcrdma_marshal_req · a2b6470b
      Chuck Lever authored
      Clean up. This logic is related to marshaling the request, and I'd
      like to keep everything that touches req->rl_registered close
      together, for CPU cache efficiency.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a2b6470b
    • Chuck Lever's avatar
      xprtrdma: Per-mode handling for Remote Invalidation · c3441618
      Chuck Lever authored
      Refactoring change: Remote Invalidation is particular to the memory
      registration mode that is use. Use a callout instead of a generic
      function to handle Remote Invalidation.
      
      This gets rid of the 8-byte flags field in struct rpcrdma_mw, of
      which only a single bit flag has been allocated.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      c3441618
  4. 15 Dec, 2017 1 commit
    • Chuck Lever's avatar
      xprtrdma: Spread reply processing over more CPUs · ccede759
      Chuck Lever authored
      Commit d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler
      directly from RECV completion") introduced a performance regression
      for NFS I/O small enough to not need memory registration. In multi-
      threaded benchmarks that generate primarily small I/O requests,
      IOPS throughput is reduced by nearly a third. This patch restores
      the previous level of throughput.
      
      Because workqueues are typically BOUND (in particular ib_comp_wq,
      nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
      to aggregate on the CPU that is handling Receive completions.
      
      The usual approach to addressing this problem is to create a QP
      and CQ for each CPU, and then schedule transactions on the QP
      for the CPU where you want the transaction to complete. The
      transaction then does not require an extra context switch during
      completion to end up on the same CPU where the transaction was
      started.
      
      This approach doesn't work for the Linux NFS/RDMA client because
      currently the Linux NFS client does not support multiple connections
      per client-server pair, and the RDMA core API does not make it
      straightforward for ULPs to determine which CPU is responsible for
      handling Receive completions for a CQ.
      
      So for the moment, record the CPU number in the rpcrdma_req before
      the transport sends each RPC Call. Then during Receive completion,
      queue the RPC completion on that same CPU.
      
      Additionally, move all RPC completion processing to the deferred
      handler so that even RPCs with simple small replies complete on
      the CPU that sent the corresponding RPC Call.
      
      Fixes: d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ccede759
  5. 17 Nov, 2017 14 commits
    • Chuck Lever's avatar
      xprtrdma: Update copyright notices · 62b56a67
      Chuck Lever authored
      Credit work contributed by Oracle engineers since 2014.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      62b56a67
    • Chuck Lever's avatar
      rpcrdma: Remove C structure definitions of XDR data items · 2232df5e
      Chuck Lever authored
      Clean up: C-structure style XDR encoding and decoding logic has
      been replaced over the past several merge windows on both the
      client and server. These data structures are no longer used.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarDevesh Sharma <devesh.sharma@broadcom.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      2232df5e
    • Chuck Lever's avatar
      xprtrdma: RPC completion should wait for Send completion · 01bb35c8
      Chuck Lever authored
      When an RPC Call includes a file data payload, that payload can come
      from pages in the page cache, or a user buffer (for direct I/O).
      
      If the payload can fit inline, xprtrdma includes it in the Send
      using a scatter-gather technique. xprtrdma mustn't allow the RPC
      consumer to re-use the memory where that payload resides before the
      Send completes. Otherwise, the new contents of that memory would be
      exposed by an HCA retransmit of the Send operation.
      
      So, block RPC completion on Send completion, but only in the case
      where a separate file data payload is part of the Send. This
      prevents the reuse of that memory while it is still part of a Send
      operation without an undue cost to other cases.
      
      Waiting is avoided in the common case because typically the Send
      will have completed long before the RPC Reply arrives.
      
      These days, an RPC timeout will trigger a disconnect, which tears
      down the QP. The disconnect flushes all waiting Sends. This bounds
      the amount of time the reply handler has to wait for a Send
      completion.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      01bb35c8
    • Chuck Lever's avatar
      xprtrdma: Refactor rpcrdma_deferred_completion · 0ba6f370
      Chuck Lever authored
      Invoke a common routine for releasing hardware resources (for
      example, invalidating MRs). This needs to be done whether an
      RPC Reply has arrived or the RPC was terminated early.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      0ba6f370
    • Chuck Lever's avatar
      xprtrdma: Add data structure to manage RDMA Send arguments · ae72950a
      Chuck Lever authored
      Problem statement:
      
      Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
      enabled storage initiators don't handle delayed Send completion
      correctly. If Send completion is delayed beyond the end of a ULP
      transaction, the ULP may release resources that are still being used
      by the HCA to complete a long-running Send operation.
      
      This is a common design trait amongst our initiators. Most Send
      operations are faster than the ULP transaction they are part of.
      Waiting for a completion for these is typically unnecessary.
      
      Infrequently, a network partition or some other problem crops up
      where an ordering problem can occur. In NFS parlance, the RPC Reply
      arrives and completes the RPC, but the HCA is still retrying the
      Send WR that conveyed the RPC Call. In this case, the HCA can try
      to use memory that has been invalidated or DMA unmapped, and the
      connection is lost. If that memory has been re-used for something
      else (possibly not related to NFS), and the Send retransmission
      exposes that data on the wire.
      
      Thus we cannot assume that it is safe to release Send-related
      resources just because a ULP reply has arrived.
      
      After some analysis, we have determined that the completion
      housekeeping will not be difficult for xprtrdma:
      
       - Inline Send buffers are registered via the local DMA key, and
         are already left DMA mapped for the lifetime of a transport
         connection, thus no additional handling is necessary for those
       - Gathered Sends involving page cache pages _will_ need to
         DMA unmap those pages after the Send completes. But like
         inline send buffers, they are registered via the local DMA key,
         and thus will not need to be invalidated
      
      In addition, RPC completion will need to wait for Send completion
      in the latter case. However, nearly always, the Send that conveys
      the RPC Call will have completed long before the RPC Reply
      arrives, and thus no additional latency will be accrued.
      
      Design notes:
      
      In this patch, the rpcrdma_sendctx object is introduced, and a
      lock-free circular queue is added to manage a set of them per
      transport.
      
      The RPC client's send path already prevents sending more than one
      RPC Call at the same time. This allows us to treat the consumer
      side of the queue (rpcrdma_sendctx_get_locked) as if there is a
      single consumer thread.
      
      The producer side of the queue (rpcrdma_sendctx_put_locked) is
      invoked only from the Send completion handler, which is a single
      thread of execution (soft IRQ).
      
      The only care that needs to be taken is with the tail index, which
      is shared between the producer and consumer. Only the producer
      updates the tail index. The consumer compares the head with the
      tail to ensure that the a sendctx that is in use is never handed
      out again (or, expressed more conventionally, the queue is empty).
      
      When the sendctx queue empties completely, there are enough Sends
      outstanding that posting more Send operations can result in a Send
      Queue overflow. In this case, the ULP is told to wait and try again.
      This introduces strong Send Queue accounting to xprtrdma.
      
      As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      suggested a mechanism that does not require signaling every Send.
      We signal once every N Sends, and perform SGE unmapping of N Send
      operations during that one completion.
      Reported-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Suggested-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ae72950a
    • Chuck Lever's avatar
      xprtrdma: "Unoptimize" rpcrdma_prepare_hdr_sge() · a062a2a3
      Chuck Lever authored
      Commit 655fec69 ("xprtrdma: Use gathered Send for large inline
      messages") assumed that, since the zeroeth element of the Send SGE
      array always pointed to req->rl_rdmabuf, it needed to be initialized
      just once. This was a valid assumption because the Send SGE array
      and rl_rdmabuf both live in the same rpcrdma_req.
      
      In a subsequent patch, the Send SGE array will be separated from the
      rpcrdma_req, so the zeroeth element of the SGE array needs to be
      initialized every time.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      a062a2a3
    • Chuck Lever's avatar
      xprtrdma: Change return value of rpcrdma_prepare_send_sges() · 857f9aca
      Chuck Lever authored
      Clean up: Make rpcrdma_prepare_send_sges() return a negative errno
      instead of a bool. Soon callers will want distinct treatments of
      different types of failures.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      857f9aca
    • Chuck Lever's avatar
      xprtrdma: Fix error handling in rpcrdma_prepare_msg_sges() · 394b2c77
      Chuck Lever authored
      When this function fails, it needs to undo the DMA mappings it's
      done so far. Otherwise these are leaked.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      394b2c77
    • Chuck Lever's avatar
      xprtrdma: Clean up SGE accounting in rpcrdma_prepare_msg_sges() · ad99f053
      Chuck Lever authored
      Clean up. rpcrdma_prepare_hdr_sge() sets num_sge to one, then
      rpcrdma_prepare_msg_sges() sets num_sge again to the count of SGEs
      it added, plus one for the header SGE just mapped in
      rpcrdma_prepare_hdr_sge(). This is confusing, and nails in an
      assumption about when these functions are called.
      
      Instead, maintain a running count that both functions can update
      with just the number of SGEs they have added to the SGE array.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ad99f053
    • Chuck Lever's avatar
      xprtrdma: Decode credits field in rpcrdma_reply_handler · be798f90
      Chuck Lever authored
      We need to decode and save the incoming rdma_credits field _after_
      we know that the direction of the message is "forward direction
      Reply". Otherwise, the credits value in reverse direction Calls is
      also used to update the forward direction credits.
      
      It is safe to decode the rdma_credits field in rpcrdma_reply_handler
      now that rpcrdma_reply_handler is single-threaded. Receives complete
      in the same order as they were sent on the NFS server.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      be798f90
    • Chuck Lever's avatar
      xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completion · d8f532d2
      Chuck Lever authored
      I noticed that the soft IRQ thread looked pretty busy under heavy
      I/O workloads. perf suggested one area that was expensive was the
      queue_work() call in rpcrdma_wc_receive. That gave me some ideas.
      
      Instead of scheduling a separate worker to process RPC Replies,
      promote the Receive completion handler to IB_POLL_WORKQUEUE, and
      invoke rpcrdma_reply_handler directly.
      
      Note that the poll workqueue is single-threaded. In order to keep
      memory invalidation from serializing all RPC Replies, handle any
      necessary invalidation tasks in a separate multi-threaded workqueue.
      
      This provides a two-tier scheme, similar to OS I/O interrupt
      handlers: A fast interrupt handler that schedules the slow handler
      and re-enables the interrupt, and a slower handler that is invoked
      for any needed heavy lifting.
      
      Benefits include:
      - One less context switch for RPCs that don't register memory
      - Receive completion handling is moved out of soft IRQ context to
        make room for other users of soft IRQ
      - The same CPU core now DMA syncs and XDR decodes the Receive buffer
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d8f532d2
    • Chuck Lever's avatar
      xprtrdma: Refactor rpcrdma_reply_handler some more · e1352c96
      Chuck Lever authored
      Clean up: I'd like to be able to invoke the tail of
      rpcrdma_reply_handler in two different places. Split the tail out
      into its own helper function.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      e1352c96
    • Chuck Lever's avatar
      xprtrdma: Move decoded header fields into rpcrdma_rep · 5381e0ec
      Chuck Lever authored
      Clean up: Make it easier to pass the decoded XID, vers, credits, and
      proc fields around by moving these variables into struct rpcrdma_rep.
      
      Note: the credits field will be handled in a subsequent patch.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5381e0ec
    • Chuck Lever's avatar
      xprtrdma: Throw away reply when version is unrecognized · 61433af5
      Chuck Lever authored
      A reply with an unrecognized value in the version field means the
      transport header is potentially garbled and therefore all the fields
      are untrustworthy.
      
      Fixes: 59aa1f9a ("xprtrdma: Properly handle RDMA_ERROR ... ")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      61433af5
  6. 05 Sep, 2017 1 commit
    • Chuck Lever's avatar
      xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler · 9590d083
      Chuck Lever authored
      Adopt the use of xprt_pin_rqst to eliminate contention between
      Call-side users of rb_lock and the use of rb_lock in
      rpcrdma_reply_handler.
      
      This replaces the mechanism introduced in 431af645 ("xprtrdma:
      Fix client lock-up after application signal fires").
      
      Use recv_lock to quickly find the completing rqst, pin it, then
      drop the lock. At that point invalidation and pull-up of the Reply
      XDR can be done. Both are often expensive operations.
      
      Finally, take recv_lock again to signal completion to the RPC
      layer. It also protects adjustment of "cwnd".
      
      This greatly reduces the amount of time a lock is held by the
      reply handler. Comparing lock_stat results shows a marked decrease
      in contention on rb_lock and recv_lock.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      [trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from
         the "out_norqst:" path in rpcrdma_reply_handler.]
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      9590d083
  7. 18 Aug, 2017 1 commit
  8. 15 Aug, 2017 2 commits
  9. 11 Aug, 2017 4 commits
  10. 08 Aug, 2017 5 commits
  11. 13 Jul, 2017 2 commits
    • Chuck Lever's avatar
      xprtrdma: Replace PAGE_MASK with offset_in_page() · d933cc32
      Chuck Lever authored
      Clean up.
      
      Reported by: Geliang Tang <geliangtang@gmail.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d933cc32
    • Chuck Lever's avatar
      xprtrdma: Fix client lock-up after application signal fires · 431af645
      Chuck Lever authored
      After a signal, the RPC client aborts synchronous RPCs running on
      behalf of the signaled application.
      
      The server is still executing those RPCs, and will write the results
      back into the client's memory when it's done. By the time the server
      writes the results, that memory is likely being used for other
      purposes. Therefore xprtrdma has to immediately invalidate all
      memory regions used by those aborted RPCs to prevent the server's
      writes from clobbering that re-used memory.
      
      With FMR memory registration, invalidation takes a relatively long
      time. In fact, the invalidation is often still running when the
      server tries to write the results into the memory regions that are
      being invalidated.
      
      This sets up a race between two processes:
      
      1.  After the signal, xprt_rdma_free calls ro_unmap_safe.
      2.  While ro_unmap_safe is still running, the server replies and
          rpcrdma_reply_handler runs, calling ro_unmap_sync.
      
      Both processes invoke ib_unmap_fmr on the same FMR.
      
      The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
      the same time, but HCAs generally don't tolerate this. Sometimes
      this can result in a system crash.
      
      If the HCA happens to survive, rpcrdma_reply_handler continues. It
      removes the rpc_rqst from rq_list and releases the transport_lock.
      This enables xprt_rdma_free to run in another process, and the
      rpc_rqst is released while rpcrdma_reply_handler is still waiting
      for the ib_unmap_fmr call to finish.
      
      But further down in rpcrdma_reply_handler, the transport_lock is
      taken again, and "rqst" is dereferenced. If "rqst" has already been
      released, this triggers a general protection fault. Since bottom-
      halves are disabled, the system locks up.
      
      Address both issues by reversing the order of the xprt_lookup_rqst
      call and the ro_unmap_sync call. Introduce a separate lookup
      mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
      xprt_lookup_rqst. Now the handler takes the transport_lock once
      and holds it for the XID lookup and RPC completion.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      431af645