1. 27 Sep, 2016 3 commits
  2. 23 Sep, 2016 3 commits
    • Daniel Wagner's avatar
      xprtrdma: use complete() instead complete_all() · 5690a22d
      Daniel Wagner authored
      There is only one waiter for the completion, therefore there
      is no need to use complete_all(). Let's make that clear by
      using complete() instead of complete_all().
      
      The usage pattern of the completion is:
      
      waiter context                          waker context
      
      frwr_op_unmap_sync()
        reinit_completion()
        ib_post_send()
        wait_for_completion()
      
      					frwr_wc_localinv_wake()
      					  complete()
      Signed-off-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: linux-nfs@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5690a22d
    • Daniel Wagner's avatar
      NFS: cache_lib: use complete() instead of complete_all() · 2a446a5d
      Daniel Wagner authored
      There is only one waiter for the completion, therefore there
      is no need to use complete_all(). Let's make that clear by
      using complete() instead of complete_all().
      
      The generic caching code from sunrpc is calling revisit() only once.
      
      The usage pattern of the completion is:
      
      waiter context                          waker context
      
      do_cache_lookup_wait()
        nfs_cache_defer_req_alloc()
          init_completion()
        do_cache_lookup()
        nfs_cache_wait_for_upcall()
          wait_for_completion_timeout()
      
      					nfs_dns_cache_revisit()
      					  complete()
      
        nfs_cache_defer_req_put()
      Signed-off-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      2a446a5d
    • Daniel Wagner's avatar
      NFS: direct: use complete() instead of complete_all() · 024de8f1
      Daniel Wagner authored
      There is only one waiter for the completion, therefore there
      is no need to use complete_all(). Let's make that clear by
      using complete() instead of complete_all().
      
      nfs_file_direct_write() or nfs_file_direct_read() allocated a request
      object via nfs_direct_req_alloc(), which initializes the
      completion. The request object then is freed later in the exit path.
      Between the initialization and the release either
      nfs_direct_write_schedule_iovec() resp
      nfs_direct_read_schedule_iovec() are called which will asynchronously
      process the request. The calling function waits via nfs_direct_wait()
      till the async work has been done. Thus there is only one waiter on
      the completion.
      
      nfs_direct_pgio_init() and nfs_direct_read_completion() are passed via
      function pointers to nfs pageio. The first function does a ref
      counting (get_dreq() and put_dreq()) which ensures that
      nfs_direct_read_completion() and nfs_direct_read_schedule_iovec() only
      call the completion path once.
      
      The usage pattern of the completion is:
      
      waiter context                          waker context
      
      nfs_file_direct_write()
        dreq = nfs_direct_req_alloc()
          init_completion()
        nfs_direct_write_schedule_iovec()
        nfs_direct_wait()
          wait_for_completion_killable()
      
                                              nfs_direct_write_schedule_work()
                                                nfs_direct_complete()
                                                  complete()
      
      nfs_file_direct_read()
        dreq = nfs_direct_req_all()
          init_completion()
        nfs_direct_read_schedule_iovec()
        nfs_direct_wait()
          wait_for_completion_killable()
                                              nfs_direct_read_schedule_iovec()
                                                nfs_direct_complete()
                                                  complete()
      
                                              nfs_direct_read_completion()
                                                nfs_direct_complete()
                                                  complete()
      Signed-off-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      024de8f1
  3. 22 Sep, 2016 12 commits
  4. 20 Sep, 2016 1 commit
  5. 19 Sep, 2016 21 commits
    • David Vrabel's avatar
      sunrpc: fix write space race causing stalls · d48f9ce7
      David Vrabel authored
      Write space becoming available may race with putting the task to sleep
      in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
      race does not work.
      
      This (edited) partial trace illustrates the problem:
      
         [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
         [2] xs_write_space <-xs_tcp_write_space
         [3] xprt_write_space <-xs_write_space
         [4] rpc_task_sleep: task:43546@5 ...
         [5] xs_write_space <-xs_tcp_write_space
      
      [1] Task 43546 runs but is out of write space.
      
      [2] Space becomes available, xs_write_space() clears the
          SOCKWQ_ASYNC_NOSPACE bit.
      
      [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
          this has not yet been queued and the wake up is lost.
      
      [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
          which queues task 43546.
      
      [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
          is supposed to handle the above race) does not call
          xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
          thus the task is not woken.
      
      Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
      so the second call to sk->sk_write_space() calls xprt_write_space().
      Suggested-by: default avatarTrond Myklebust <trondmy@primarydata.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      cc: stable@vger.kernel.org # 4.4
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d48f9ce7
    • Jeff Layton's avatar
      pnfs: add a new mechanism to select a layout driver according to an ordered list · ca440c38
      Jeff Layton authored
      Currently, the layout driver selection code always chooses the first one
      from the list. That's not really ideal however, as the server can send
      the list of layout types in any order that it likes. It's up to the
      client to select the best one for its needs.
      
      This patch adds an ordered list of preferred driver types and has the
      selection code sort the list of available layout drivers according to it.
      Any unrecognized layout type is sorted to the end of the list.
      
      For now, the order of preference is hardcoded, but it should be possible
      to make this configurable in the future.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Reviewed-by: default avatarJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ca440c38
    • Chuck Lever's avatar
      xprtrdma: Eliminate rpcrdma_receive_worker() · 496b77a5
      Chuck Lever authored
      Clean up: the extra layer of indirection doesn't add value.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      496b77a5
    • Chuck Lever's avatar
      xprtrdma: Rename rpcrdma_receive_wc() · 1519e969
      Chuck Lever authored
      Clean up: When converting xprtrdma to use the new CQ API, I missed a
      spot. The naming convention elsewhere is:
      
        {svc_rdma,rpcrdma}_wc_{operation}
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1519e969
    • Chuck Lever's avatar
      xprtrmda: Report address of frmr, not mw · eeb30613
      Chuck Lever authored
      Tie frwr debugging messages together by always reporting the address
      of the frwr.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      eeb30613
    • Chuck Lever's avatar
      xprtrdma: Support larger inline thresholds · 44829d02
      Chuck Lever authored
      The Version One default inline threshold is still 1KB. But allow
      testing with thresholds up to 64KB.
      
      This maximum is somewhat arbitrary. There's no fundamental
      architectural limit I'm aware of, but it's good to keep the size of
      Receive buffers reasonable. Now that Send can use a s/g list, a
      Send buffer is only as large as each RPC requires. Receive buffers
      are always the size of the inline threshold, however.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      44829d02
    • Chuck Lever's avatar
      xprtrdma: Use gathered Send for large inline messages · 655fec69
      Chuck Lever authored
      An RPC Call message that is sent inline but that has a data payload
      (ie, one or more items in rq_snd_buf's page list) must be "pulled
      up:"
      
      - call_allocate has to reserve enough RPC Call buffer space to
      accommodate the data payload
      
      - call_transmit has to memcopy the rq_snd_buf's page list and tail
      into its head iovec before it is sent
      
      As the inline threshold is increased beyond its current 1KB default,
      however, this means data payloads of more than a few KB are copied
      by the host CPU. For example, if the inline threshold is increased
      just to 4KB, then NFS WRITE requests up to 4KB would involve a
      memcpy of the NFS WRITE's payload data into the RPC Call buffer.
      This is an undesirable amount of participation by the host CPU.
      
      The inline threshold may be much larger than 4KB in the future,
      after negotiation with a peer server.
      
      Instead of copying the components of rq_snd_buf into its head iovec,
      construct a gather list of these components, and send them all in
      place. The same approach is already used in the Linux server's
      RPC-over-RDMA reply path.
      
      This mechanism also eliminates the need for rpcrdma_tail_pullup,
      which is used to manage the XDR pad and trailing inline content when
      a Read list is present.
      
      This requires that the pages in rq_snd_buf's page list be DMA-mapped
      during marshaling, and unmapped when a data-bearing RPC is
      completed. This is slightly less efficient for very small I/O
      payloads, but significantly more efficient as data payload size and
      inline threshold increase past a kilobyte.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      655fec69
    • Chuck Lever's avatar
      xprtrdma: Basic support for Remote Invalidation · c8b920bb
      Chuck Lever authored
      Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
      as part of a Receive completion. Local invalidation can be skipped
      for that rkey.
      
      Use an out-of-band signaling mechanism to indicate to the server
      that the client is prepared to receive RDMA Send With Invalidate.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      c8b920bb
    • Chuck Lever's avatar
      xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0
      Chuck Lever authored
      Send an RDMA-CM private message on connect, and look for one during
      a connection-established event.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Once the client knows the server's inline threshold maxima, it can
      adjust the use of Reply chunks, and eliminate most use of Position
      Zero Read chunks. Moderately-sized I/O can be done using a pure
      inline RDMA Send instead of RDMA operations that require memory
      registration.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      87cfb9a0
    • Chuck Lever's avatar
      rpcrdma: RDMA/CM private message data structure · ff06bd19
      Chuck Lever authored
      Introduce data structure used by both client and server to exchange
      implementation details during RDMA/CM connection establishment.
      
      This is an experimental out-of-band exchange between Linux
      RPC-over-RDMA Version One implementations, replacing the deprecated
      CCP (see RFC 5666bis). The purpose of this extension is to enable
      prototyping of features that might be introduced in a subsequent
      version of RPC-over-RDMA.
      
      Suggested by Christoph Hellwig and Devesh Sharma.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      ff06bd19
    • Chuck Lever's avatar
      xprtrdma: Move recv_wr to struct rpcrdma_rep · 6ea8e711
      Chuck Lever authored
      Clean up: The fields in the recv_wr do not vary. There is no need to
      initialize them before each ib_post_recv(). This removes a large-ish
      data structure from the stack.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6ea8e711
    • Chuck Lever's avatar
      xprtrdma: Move send_wr to struct rpcrdma_req · 90aab602
      Chuck Lever authored
      Clean up: Most of the fields in each send_wr do not vary. There is
      no need to initialize them before each ib_post_send(). This removes
      a large-ish data structure from the stack.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      90aab602
    • Chuck Lever's avatar
      xprtrdma: Simplify rpcrdma_ep_post_recv() · b157380a
      Chuck Lever authored
      Clean up.
      
      Since commit fc664485 ("xprtrdma: Split the completion queue"),
      rpcrdma_ep_post_recv() no longer uses the "ep" argument.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      b157380a
    • Chuck Lever's avatar
      xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf · 13650c23
      Chuck Lever authored
      Clean up. The "ia" argument is no longer used.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      13650c23
    • Chuck Lever's avatar
      xprtrdma: Delay DMA mapping Send and Receive buffers · 54cbd6b0
      Chuck Lever authored
      Currently, each regbuf is allocated and DMA mapped at the same time.
      This is done during transport creation.
      
      When a device driver is unloaded, every DMA-mapped buffer in use by
      a transport has to be unmapped, and then remapped to the new
      device if the driver is loaded again. Remapping will have to be done
      _after_ the connect worker has set up the new device.
      
      But there's an ordering problem:
      
      call_allocate, which invokes xprt_rdma_allocate which calls
      rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
      the connect worker can run to set up the new device.
      
      Instead, at transport creation, allocate each buffer, but leave it
      unmapped. Once the RPC carries these buffers into ->send_request, by
      which time a transport connection should have been established,
      check to see that the RPC's buffers have been DMA mapped. If not,
      map them there.
      
      When device driver unplug support is added, it will simply unmap all
      the transport's regbufs, but it doesn't have to deallocate the
      underlying memory.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      54cbd6b0
    • Chuck Lever's avatar
      xprtrdma: Replace DMA_BIDIRECTIONAL · 99ef4db3
      Chuck Lever authored
      The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
      Fortunately, xprtrdma now knows which direction I/O is going as
      soon as it allocates each regbuf.
      
      The RPC Call and Reply buffers are no longer the same regbuf. They
      can each be labeled correctly now. The RPC Reply buffer is never
      part of either a Send or Receive WR, but it can be part of Reply
      chunk, which is mapped and registered via ->ro_map . So it is not
      DMA mapped when it is allocated (DMA_NONE), to avoid a double-
      mapping.
      
      Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
      contents are never modified by the host CPU, DMA-API-HOWTO.txt
      suggests that a DMA sync before posting each buffer should be
      unnecessary. (See my_card_interrupt_handler).
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      99ef4db3
    • Chuck Lever's avatar
      xprtrdma: Use smaller buffers for RPC-over-RDMA headers · 08cf2efd
      Chuck Lever authored
      Commit 94931746 ("xprtrdma: Limit number of RDMA segments in
      RPC-over-RDMA headers") capped the number of chunks that may appear
      in RPC-over-RDMA headers. The maximum header size can be estimated
      and fixed to avoid allocating buffer space that is never used.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      08cf2efd
    • Chuck Lever's avatar
      xprtrdma: Initialize separate RPC call and reply buffers · 9c40c49f
      Chuck Lever authored
      RPC-over-RDMA needs to separate its RPC call and reply buffers.
      
       o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
         Send operation using DMA_TO_DEVICE
      
       o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
         as part of a Reply chunk using DMA_FROM_DEVICE
      
      The two mappings are for data movement in opposite directions.
      
      DMA-API.txt suggests that if these mappings share a DMA cacheline,
      bad things can happen. This could occur in the final bytes of
      rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
      happen to share a DMA cacheline.
      
      On x86_64 the cacheline size is typically 8 bytes, and RPC call
      messages are usually much smaller than the send buffer, so this
      hasn't been a noticeable problem. But the DMA cacheline size can be
      larger on other platforms.
      
      Also, often rq_rcv_buf starts most of the way into a page, thus
      an additional RDMA segment is needed to map and register the end of
      that buffer. Try to avoid that scenario to reduce the cost of
      registering and invalidating Reply chunks.
      
      Instead of carrying a single regbuf that covers both rq_snd_buf and
      rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
      rq_snd_buf and one regbuf for rq_rcv_buf.
      
      Some incidental changes worth noting:
      
      - To clear out some spaghetti, refactor xprt_rdma_allocate.
      - The value stored in rg_size is the same as the value stored in
        the iov.length field, so eliminate rg_size
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      9c40c49f
    • Chuck Lever's avatar
      SUNRPC: Add a transport-specific private field in rpc_rqst · 5a6d1db4
      Chuck Lever authored
      Currently there's a hidden and indirect mechanism for finding the
      rpcrdma_req that goes with an rpc_rqst. It depends on getting from
      the rq_buffer pointer in struct rpc_rqst to the struct
      rpcrdma_regbuf that controls that buffer, and then to the struct
      rpcrdma_req it goes with.
      
      This was done back in the day to avoid the need to add a per-rqst
      pointer or to alter the buf_free API when support for RPC-over-RDMA
      was introduced.
      
      I'm about to change the way regbuf's work to support larger inline
      thresholds. Now is a good time to replace this indirect mechanism
      with something that is more straightforward. I guess this should be
      considered a clean up.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5a6d1db4
    • Chuck Lever's avatar
      SUNRPC: Separate buffer pointers for RPC Call and Reply messages · 68778945
      Chuck Lever authored
      For xprtrdma, the RPC Call and Reply buffers are involved in real
      I/O operations.
      
      To start with, the DMA direction of the I/O for a Call is opposite
      that of a Reply.
      
      In the current arrangement, the Reply buffer address is on a
      four-byte alignment just past the call buffer. Would be friendlier
      on some platforms if that was at a DMA cache alignment instead.
      
      Because the current arrangement allocates a single memory region
      which contains both buffers, the RPC Reply buffer often contains a
      page boundary in it when the Call buffer is large enough (which is
      frequent).
      
      It would be a little nicer for setting up DMA operations (and
      possible registration of the Reply buffer) if the two buffers were
      separated, well-aligned, and contained as few page boundaries as
      possible.
      
      Now, I could just pad out the single memory region used for the pair
      of buffers. But frequently that would mean a lot of unused space to
      ensure the Reply buffer did not have a page boundary.
      
      Add a separate pointer to rpc_rqst that points right to the RPC
      Reply buffer. This makes no difference to xprtsock, but it will help
      xprtrdma in subsequent patches.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      68778945
    • Chuck Lever's avatar
      SUNRPC: Generalize the RPC buffer release API · 3435c74a
      Chuck Lever authored
      xprtrdma needs to allocate the Call and Reply buffers separately.
      TBH, the reliance on using a single buffer for the pair of XDR
      buffers is transport implementation-specific.
      
      Instead of passing just the rq_buffer into the buf_free method, pass
      the task structure and let buf_free take care of freeing both
      XDR buffers at once.
      
      There's a micro-optimization here. In the common case, both
      xprt_release and the transport's buf_free method were checking if
      rq_buffer was NULL. Now the check is done only once per RPC.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      3435c74a