1. 14 Feb, 2019 6 commits
  2. 13 Feb, 2019 4 commits
  3. 11 Feb, 2019 14 commits
  4. 09 Feb, 2019 5 commits
    • Doug Ledford's avatar
      Merge branch 'wip/dl-for-next' into for-next · 82771f20
      Doug Ledford authored
      Due to concurrent work by myself and Jason, a normal fast forward merge
      was not possible.  This brings in a number of hfi1 changes, mainly the
      hfi1 TID RDMA support (roughly 10,000 LOC change), which was reviewed
      and integrated over a period of days.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      82771f20
    • Doug Ledford's avatar
      Merge branch 'hfi1-tid' into wip/dl-for-next · 416fbc1b
      Doug Ledford authored
      Omni-Path TID RDMA Feature
      
      Intel Omni-Path (OPA) TID RDMA support is a feature that accelerates
      data movement between two OPA nodes through the IB Verbs interface. It
      improves RDMA READ/WRITE performance by delivering the data payload to a
      user buffer directly without any software copying.
      
      Architecture
      =============
      The TID RDMA protocol is implemented on the hfi1 driver level and is
      therefore transparent to the ULPs. It is designed to facilitate the data
      transactions for two specific RDMA requests:
        - RDMA READ;
        - RDMA WRITE.
      Previously, when a verbs data packet is received at the destination
      (requester side for RDMA READ and responder side for RDMA WRITE), the
      data payload is copied to the user buffer by software, which slows down
      the performance significantly for large requests.
      
      Internally, hfi1 converts qualified RDMA READ/WRITE requests into TID
      RDMA READ/WRITE requests when the requests are post sent to the hfi1
      driver. Non-qualified RDMA requests are handled by normal RDMA protocol.
      
      For TID RDMA requests, hardware resources (hardware flow and TID entries)
      are allocated on the destination side (the requester side for TID RDMA
      READ and the responder side for TID RDMA WRITE). The information for
      these resources is conveyed to the data source side (the responder side
      for TID RDMA READ and the requester side for TID RDMA WRITE) and embedded
      in data packets. When data packets are received by the destination,
      hardware will deliver the data payload to the destination buffer without
      involving software and therefore improve the performance.
      
      Details
      =======
      RDMA READ/WRITE requests are qualified by the following:
        - Total data length >= 256k;
        - Totoal data length is a multiple of 4K pages.
      
      Additional qualifications are enforced for the destination buffers:
        For RDMA RAED:
          - Each destination sge buffer is 4K aligned;
          - Each destination sge buffer is a multiple of 4K pages.
        For RDMA WRITE:
          - The destination number is 4K aligned.
      
      In addition, in an OPA fabric, some nodes may support TID RDMA while
      others may not. As such, it is important for two transaction nodes to
      exchange the information about the features they support. This discovery
      mechanism is called OPA Feature Negotion (OPFN) and is described in
      details in the patch series. Through OPFN, two nodes can find whether
      they both support TID RDMA and subsequently convert RDMA requests into
      TID RDMA requests.
      
      * hfi1-tid: (46 commits)
        IB/hfi1: Prioritize the sending of ACK packets
        IB/hfi1: Add static trace for TID RDMA WRITE protocol
        IB/hfi1: Enable TID RDMA WRITE protocol
        IB/hfi1: Add interlock between TID RDMA WRITE and other requests
        IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbs
        IB/hfi1: Add the dual leg code
        IB/hfi1: Add the TID second leg ACK packet builder
        IB/hfi1: Add the TID second leg send packet builder
        IB/hfi1: Resend the TID RDMA WRITE DATA packets
        IB/hfi1: Add a function to receive TID RDMA RESYNC packet
        IB/hfi1: Add a function to build TID RDMA RESYNC packet
        IB/hfi1: Add TID RDMA retry timer
        IB/hfi1: Add a function to receive TID RDMA ACK packet
        IB/hfi1: Add a function to build TID RDMA ACK packet
        IB/hfi1: Add a function to receive TID RDMA WRITE DATA packet
        IB/hfi1: Add a function to build TID RDMA WRITE DATA packet
        IB/hfi1: Add a function to receive TID RDMA WRITE response
        IB/hfi1: Add TID resource timer
        IB/hfi1: Add a function to build TID RDMA WRITE response
        IB/hfi1: Add functions to receive TID RDMA WRITE request
        ...
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      416fbc1b
    • Raju Rangoju's avatar
      iw_cxgb4: fix srqidx leak during connection abort · f368ff18
      Raju Rangoju authored
      When an application aborts the connection by moving QP from RTS to ERROR,
      then iw_cxgb4's modify_rc_qp() RTS->ERROR logic sets the
      *srqidxp to 0 via t4_set_wq_in_error(&qhp->wq, 0), and aborts the
      connection by calling c4iw_ep_disconnect().
      
      c4iw_ep_disconnect() does the following:
       1. sends up a close_complete_upcall(ep, -ECONNRESET) to libcxgb4.
       2. sends abort request CPL to hw.
      
      But, since the close_complete_upcall() is sent before sending the
      ABORT_REQ to hw, libcxgb4 would fail to release the srqidx if the
      connection holds one. Because, the srqidx is passed up to libcxgb4 only
      after corresponding ABORT_RPL is processed by kernel in abort_rpl().
      
      This patch handle the corner-case by moving the call to
      close_complete_upcall() from c4iw_ep_disconnect() to abort_rpl().  So that
      libcxgb4 is notified about the -ECONNRESET only after abort_rpl(), and
      libcxgb4 can relinquish the srqidx properly.
      Signed-off-by: default avatarRaju Rangoju <rajur@chelsio.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      f368ff18
    • Raju Rangoju's avatar
      iw_cxgb4: complete the cached SRQ buffers · 11a27e21
      Raju Rangoju authored
      If TP fetches an SRQ buffer but ends up not using it before the connection
      is aborted, then it passes the index of that SRQ buffer to the host in
      ABORT_REQ_RSS or ABORT_RPL CPL message.
      
      But, if the srqidx field is zero in the received ABORT_RPL or
      ABORT_REQ_RSS CPL, then we need to read the tcb.rq_start field to see if
      it really did have an RQE cached. This works around a case where HW does
      not include the srqidx in the ABORT_RPL/ABORT_REQ_RSS CPL.
      
      The final value of rq_start is the one present in TCB with the
      TF_RX_PDU_OUT bit cleared. So, we need to read the TCB, examine the
      TF_RX_PDU_OUT (bit 49 of t_flags) in order to determine if there's a rx
      PDU feedback event pending.
      Signed-off-by: default avatarRaju Rangoju <rajur@chelsio.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      11a27e21
    • Raju Rangoju's avatar
      cxgb4: add tcb flags and tcb rpl struct · e381a1cb
      Raju Rangoju authored
      This patch adds the tcb flags and structures needed for querying tcb
      information.
      Signed-off-by: default avatarRaju Rangoju <rajur@chelsio.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      e381a1cb
  5. 08 Feb, 2019 11 commits
    • Jason Gunthorpe's avatar
      RDMA/devices: Re-organize device.c locking · 921eab11
      Jason Gunthorpe authored
      The locking here started out with a single lock that covered everything
      and then has lately veered into crazy town.
      
      The fundamental problem is that several places need to iterate over a
      linked list, but also need to drop their locks to avoid deadlock during
      client callbacks.
      
      xarray's restartable iteration offers a simple solution to the
      problem. Once all the lists are xarrays we can drop locks in the places
      that need that and rely on xarray to provide consistency and locking for
      the data structure.
      
      The resulting simplification is that each of the three lists has a
      dedicated rwsem that must be held when working with the list it
      covers. One data structure is no longer covered by multiple locks.
      
      The sleeping semaphore is selected because the read side generally needs
      to be held over something sleeping, and using RCU reader locking in those
      cases is overkill.
      
      In the process this simplifies the entire registration/unregistration flow
      to be the expected list of setups and the reversed list of matching
      teardowns, and the registration lock 'refcount' can now be revised to be
      released after the ULPs are removed, providing a very sane semantic for
      this feature.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      921eab11
    • Jason Gunthorpe's avatar
      RDMA/devices: Use xarray to store the client_data · 0df91bb6
      Jason Gunthorpe authored
      Now that we have a small ID for each client we can use xarray instead of
      linearly searching linked lists for client data. This will give much
      faster and scalable client data lookup, and will lets us revise the
      locking scheme.
      
      Since xarray can store 'going_down' using a mark just entirely eliminate
      the struct ib_client_data and directly store the client_data value in the
      xarray. However this does require a special iterator as we must still
      iterate over any NULL client_data values.
      
      Also eliminate the client_data_lock in favour of internal xarray locking.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      0df91bb6
    • Jason Gunthorpe's avatar
      RDMA/devices: Use xarray to store the clients · e59178d8
      Jason Gunthorpe authored
      This gives each client a unique ID and will let us move client_data to use
      xarray, and revise the locking scheme.
      
      clients have to be add/removed in strict FIFO/LIFO order as they
      interdepend. To support this the client_ids are assigned to increase in
      FIFO order. The existing linked list is kept to support reverse iteration
      until xarray can get a reverse iteration API.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarParav Pandit <parav@mellanox.com>
      e59178d8
    • Jason Gunthorpe's avatar
      RDMA/device: Use an ida instead of a free page in alloc_name · 3b88afd3
      Jason Gunthorpe authored
      ida is the proper data structure to hold list of clustered small integers
      and then allocate an unused integer. Get rid of the convoluted and limited
      open-coded bitmap.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      3b88afd3
    • Jason Gunthorpe's avatar
      RDMA/device: Get rid of reg_state · 652432f3
      Jason Gunthorpe authored
      This really has no purpose anymore, refcount can be used to tell if the
      device is still registered. Keeping it around just invites mis-use.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarParav Pandit <parav@mellanox.com>
      652432f3
    • Jason Gunthorpe's avatar
      RDMA/device: Call ib_cache_release_one() only from ib_device_release() · d45f89d5
      Jason Gunthorpe authored
      Instead of complicated logic about when this memory is freed, always free
      it during device release(). All the cache pointers start out as NULL, so
      it is safe to call this before the cache is initialized.
      
      This makes for a simpler error unwind flow, and a simpler understanding of
      the lifetime of the memory allocations inside the struct ib_device.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      d45f89d5
    • Jason Gunthorpe's avatar
      RDMA/device: Ensure that security memory is always freed · b34b269a
      Jason Gunthorpe authored
      Since this only frees memory it should be done during the release
      callback. Otherwise there are possible error flows where it might not get
      called if registration aborts.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      b34b269a
    • Jason Gunthorpe's avatar
      RDMA/device: Check that the rename is nop under the lock · e3593b56
      Jason Gunthorpe authored
      Since another rename could be running in parallel it is safer to check
      that the name is not changing inside the lock, where we already know the
      device name will not change.
      
      Fixes: d21943dd ("RDMA/core: Implement IB device rename function")
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarParav Pandit <parav@mellanox.com>
      e3593b56
    • Leon Romanovsky's avatar
      RDMA: Handle PD allocations by IB/core · 21a428a0
      Leon Romanovsky authored
      The PD allocations in IB/core allows us to simplify drivers and their
      error flows in their .alloc_pd() paths. The changes in .alloc_pd() go hand
      in had with relevant update in .dealloc_pd().
      
      We will use this opportunity and convert .dealloc_pd() to don't fail, as
      it was suggested a long time ago, failures are not happening as we have
      never seen a WARN_ON print.
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      21a428a0
    • Leon Romanovsky's avatar
      RDMA/core: Share driver structure size with core · 30471d4b
      Leon Romanovsky authored
      Add new macros to be used in drivers while registering ops structure and
      IB/core while calling allocation routines, so drivers won't need to
      perform kzalloc/kfree in their paths.
      
      The change in allocation stage allows us to initialize common fields prior
      to calling to drivers (e.g. restrack).
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      30471d4b
    • Daniel Jurgens's avatar
      IB/core: Don't register each MAD agent for LSM notifier · c66f6741
      Daniel Jurgens authored
      When creating many MAD agents in a short period of time, receive packet
      processing can be delayed long enough to cause timeouts while new agents
      are being added to the atomic notifier chain with IRQs disabled.  Notifier
      chain registration and unregstration is an O(n) operation. With large
      numbers of MAD agents being created and destroyed simultaneously the CPUs
      spend too much time with interrupts disabled.
      
      Instead of each MAD agent registering for it's own LSM notification,
      maintain a list of agents internally and register once, this registration
      already existed for handling the PKeys. This list is write mostly, so a
      normal spin lock is used vs a read/write lock. All MAD agents must be
      checked, so a single list is used instead of breaking them down per
      device.
      
      Notifier calls are done under rcu_read_lock, so there isn't a risk of
      similar packet timeouts while checking the MAD agents security settings
      when notified.
      Signed-off-by: default avatarDaniel Jurgens <danielj@mellanox.com>
      Reviewed-by: default avatarParav Pandit <parav@mellanox.com>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      c66f6741