1. 02 Feb, 2015 11 commits
    • Christoph Hellwig's avatar
      nfsd: implement pNFS layout recalls · c5c707f9
      Christoph Hellwig authored
      Add support to issue layout recalls to clients.  For now we only support
      full-file recalls to get a simple and stable implementation.  This allows
      to embedd a nfsd4_callback structure in the layout_state and thus avoid
      any memory allocations under spinlocks during a recall.  For normal
      use cases that do not intent to share a single file between multiple
      clients this implementation is fully sufficient.
      
      To ensure layouts are recalled on local filesystem access each layout
      state registers a new FL_LAYOUT lease with the kernel file locking code,
      which filesystems that support pNFS exports that require recalls need
      to break on conflicting access patterns.
      
      The XDR code is based on the old pNFS server implementation by
      Andy Adamson, Benny Halevy, Boaz Harrosh, Dean Hildebrand, Fred Isaman,
      Marc Eshel, Mike Sager and Ricardo Labiaga.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      c5c707f9
    • Christoph Hellwig's avatar
      nfsd: implement pNFS operations · 9cf514cc
      Christoph Hellwig authored
      Add support for the GETDEVICEINFO, LAYOUTGET, LAYOUTCOMMIT and
      LAYOUTRETURN NFSv4.1 operations, as well as backing code to manage
      outstanding layouts and devices.
      
      Layout management is very straight forward, with a nfs4_layout_stateid
      structure that extends nfs4_stid to manage layout stateids as the
      top-level structure.  It is linked into the nfs4_file and nfs4_client
      structures like the other stateids, and contains a linked list of
      layouts that hang of the stateid.  The actual layout operations are
      implemented in layout drivers that are not part of this commit, but
      will be added later.
      
      The worst part of this commit is the management of the pNFS device IDs,
      which suffers from a specification that is not sanely implementable due
      to the fact that the device-IDs are global and not bound to an export,
      and have a small enough size so that we can't store the fsid portion of
      a file handle, and must never be reused.  As we still do need perform all
      export authentication and validation checks on a device ID passed to
      GETDEVICEINFO we are caught between a rock and a hard place.  To work
      around this issue we add a new hash that maps from a 64-bit integer to a
      fsid so that we can look up the export to authenticate against it,
      a 32-bit integer as a generation that we can bump when changing the device,
      and a currently unused 32-bit integer that could be used in the future
      to handle more than a single device per export.  Entries in this hash
      table are never deleted as we can't reuse the ids anyway, and would have
      a severe lifetime problem anyway as Linux export structures are temporary
      structures that can go away under load.
      
      Parts of the XDR data, structures and marshaling/unmarshaling code, as
      well as many concepts are derived from the old pNFS server implementation
      from Andy Adamson, Benny Halevy, Dean Hildebrand, Marc Eshel, Fred Isaman,
      Mike Sager, Ricardo Labiaga and many others.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9cf514cc
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
    • Christoph Hellwig's avatar
      nfsd: add fh_fsid_match helper · 9558f250
      Christoph Hellwig authored
      Add a helper to check that the fsid parts of two file handles match.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9558f250
    • Christoph Hellwig's avatar
      nfsd: move nfsd_fh_match to nfsfh.h · 4d94c2ef
      Christoph Hellwig authored
      The pnfs code will need it too.  Also remove the nfsd_ prefix to match the
      other filehandle helpers in that file.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      4d94c2ef
    • Christoph Hellwig's avatar
      fs: add FL_LAYOUT lease type · 11afe9f7
      Christoph Hellwig authored
      This (ab-)uses the file locking code to allow filesystems to recall
      outstanding pNFS layouts on a file.  This new lease type is similar but
      not quite the same as FL_DELEG.  A FL_LAYOUT lease can always be granted,
      an a per-filesystem lock (XFS iolock for the initial implementation)
      ensures not FL_LAYOUT leases granted when we would need to recall them.
      
      Also included are changes that allow multiple outstanding read
      leases of different types on the same file as long as they have a
      differnt owner.  This wasn't a problem until now as nfsd never set
      FL_LEASE leases, and no one else used FL_DELEG leases, but given that
      nfsd will also issues FL_LAYOUT leases we will have to handle it now.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      11afe9f7
    • Christoph Hellwig's avatar
      fs: track fl_owner for leases · 2ab99ee1
      Christoph Hellwig authored
      Just like for other lock types we should allow different owners to have
      a read lease on a file.  Currently this can't happen, but with the addition
      of pNFS layout leases we'll need this feature.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      2ab99ee1
    • Christoph Hellwig's avatar
      nfs: add LAYOUT_TYPE_MAX enum value · 6cae0a46
      Christoph Hellwig authored
      This gives us a nice upper bound for later use in nfѕd.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      6cae0a46
    • J. Bruce Fields's avatar
      Merge branch 'locks-3.20' of git://git.samba.org/jlayton/linux into for-3.20 · a584143b
      J. Bruce Fields authored
      Christoph's block pnfs patches have some minor dependencies on these
      lock patches.
      a584143b
  2. 23 Jan, 2015 3 commits
  3. 22 Jan, 2015 1 commit
  4. 16 Jan, 2015 17 commits
  5. 15 Jan, 2015 8 commits
    • Chuck Lever's avatar
      svcrdma: Handle additional inline content · a97c331f
      Chuck Lever authored
      Most NFS RPCs place their large payload argument at the end of the
      RPC header (eg, NFSv3 WRITE). For NFSv3 WRITE and SYMLINK, RPC/RDMA
      sends the complete RPC header inline, and the payload argument in
      the read list. Data in the read list is the last part of the XDR
      stream.
      
      One important case is not like this, however. NFSv4 COMPOUND is a
      counted array of operations. A WRITE operation, with its large data
      payload, can appear in the middle of the compound's operations
      array. Thus NFSv4 WRITE compounds can have header content after the
      WRITE payload.
      
      The Linux client, for example, performs an NFSv4 WRITE like this:
      
        { PUTFH, WRITE, GETATTR }
      
      Though RFC 5667 is not precise about this, the proper way to convey
      this compound is to place the GETATTR inline, _after_ the front of
      the RPC header. The receiver inserts the read list payload into the
      XDR stream after the initial WRITE arguments, and before the GETATTR
      operation, thanks to the value of the read list "position" field.
      
      The Linux client currently sends the GETATTR at the end of the
      RPC/RDMA read list, which is incorrect. It will be corrected in the
      future.
      
      The Linux server currently rejects NFSv4 compounds with inline
      content after the read list. For the above NFSv4 WRITE compound, the
      NFS compound header indicates there are three operations, but the
      server finds nonsense when it looks in the XDR stream for the third
      operation, and the compound fails with OP_ILLEGAL.
      
      Move trailing inline content to the end of the XDR buffer's page
      list. This presents incoming NFSv4 WRITE compounds to NFSD in the
      same way the socket transport does.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      a97c331f
    • Chuck Lever's avatar
      svcrdma: Move read list XDR round-up logic · fcbeced5
      Chuck Lever authored
      This is a pre-requisite for a subsequent patch.
      
      Read list XDR round-up needs to be done _before_ additional inline
      content is copied to the end of the XDR buffer's page list. Move
      the logic added by commit e560e3b5 ("svcrdma: Add zero padding
      if the client doesn't send it").
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      fcbeced5
    • Chuck Lever's avatar
      svcrdma: Support RDMA_NOMSG requests · 0b056c22
      Chuck Lever authored
      Currently the Linux server can not decode RDMA_NOMSG type requests.
      Operations whose length exceeds the fixed size of RDMA SEND buffers,
      like large NFSv4 CREATE(NF4LNK) operations, must be conveyed via
      RDMA_NOMSG.
      
      For an RDMA_MSG type request, the client sends the RPC/RDMA, RPC
      headers, and some or all of the NFS arguments via RDMA SEND.
      
      For an RDMA_NOMSG type request, the client sends just the RPC/RDMA
      header via RDMA SEND. The request's read list contains elements for
      the entire RPC message, including the RPC header.
      
      NFSD expects the RPC/RMDA header and RPC header to be contiguous in
      page zero of the XDR buffer. Add logic in the RDMA READ path to make
      the read list contents land where the server prefers, when the
      incoming message is a type RDMA_NOMSG message.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      0b056c22
    • Chuck Lever's avatar
      svcrdma: rc_position sanity checking · 61edbcb7
      Chuck Lever authored
      An RPC/RDMA client may send large RPC arguments via a read
      list. This is a list of scatter/gather elements which convey
      RPC call arguments too large to fit in a small RDMA SEND.
      
      Each entry in the read list has a "position" field, whose value is
      the byte offset in the XDR stream where the data in that entry is to
      be inserted. Entries which share the same "position" value make up
      the same RPC argument. The receiver inserts entries with the same
      position field value in list order into the XDR stream.
      
      Currently the Linux NFS/RDMA server cannot handle receiving read
      chunks in more than one position, mostly because no current client
      sends read lists with elements in more than one position. As a
      sanity check, ensure that all received chunks have the same
      "rc_position."
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      61edbcb7
    • Chuck Lever's avatar
      svcrdma: Plant reader function in struct svcxprt_rdma · e5452411
      Chuck Lever authored
      The RDMA reader function doesn't change once an svcxprt_rdma is
      instantiated. Instead of checking sc_devcap during every incoming
      RPC, set the reader function once when the connection is accepted.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      e5452411
    • Chuck Lever's avatar
      svcrdma: Find rmsgp more reliably · e5523bd2
      Chuck Lever authored
      xdr_start() can return the wrong rmsgp address if an assumption
      about how the xdr_buf was constructed changes.  When it gets it
      wrong, the client receives a reply that has gibberish in the
      RPC/RDMA header, preventing it from matching a waiting RPC request.
      
      Instead, make (and document) just one assumption: that the RDMA
      header for the client's RPC call is at the start of the first page
      in rq_pages.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      e5523bd2
    • Chuck Lever's avatar
      svcrdma: Scrub BUG_ON() and WARN_ON() call sites · 3fe04ee9
      Chuck Lever authored
      Current convention is to avoid using BUG_ON() in places where an
      oops could cause complete system failure.
      
      Replace BUG_ON() call sites in svcrdma with an assertion error
      message and allow execution to continue safely.
      
      Some BUG_ON() calls are removed because they have never fired in
      production (that we are aware of).
      
      Some WARN_ON() calls are also replaced where a back trace is not
      helpful; e.g., in a workqueue task.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      3fe04ee9
    • Chuck Lever's avatar
      svcrdma: Clean up read chunk counting · 2397aa8b
      Chuck Lever authored
      The byte_count argument is not used, and the function is called
      only from one place.
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      2397aa8b