1. 15 Apr, 2015 40 commits
    • Doug Ledford's avatar
    • Honggang LI's avatar
      mlx5: wrong page mask if CONFIG_ARCH_DMA_ADDR_T_64BIT enabled for 32Bit architectures · 59d2d18c
      Honggang LI authored
      If CONFIG_ARCH_DMA_ADDR_T_64BIT enabled for x86 systems and physical
      memory is more than 4GB, dma_map_page may return a valid memory
      address which greater than 0xffffffff. As a result, the mlx5 device page
      allocator RB tree will be initialized with valid addresses greater than
      0xfffffff.
      
      However, (addr & PAGE_MASK) set the high four bytes to zeros. So, it's
      impossible for the function, free_4k, to release the pages whose
      addresses greater than 4GB. Memory leaks. And mlx5_ib module can't
      release the pages when user try to remove the module, as a result,
      system hang.
      
      [root@rdma05 root]# dmesg  | grep addr | head
      addr             = 3fe384000
      addr & PAGE_MASK =  fe384000
      [root@rdma05 root]# rmmod mlx5_ib   <---- hang on
      
      ---------------------- cosnole log -----------------
      mlx5_ib 0000:04:00.0: irq 138 for MSI/MSI-X
        alloc irq_desc for 139 on node -1
        alloc kstat_irqs on node -1
      mlx5_ib 0000:04:00.0: irq 139 for MSI/MSI-X
      0000:04:00.0:free_4k:221:(pid 1519): page not found
      0000:04:00.0:free_4k:221:(pid 1519): page not found
      0000:04:00.0:free_4k:221:(pid 1519): page not found
      0000:04:00.0:free_4k:221:(pid 1519): page not found
      ---------------------- cosnole log -----------------
      
      Fixes: bf0bf77f ('mlx5: Support communicating arbitrary host page size to firmware')
      Signed-off-by: default avatarHonggang Li <honli@redhat.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      59d2d18c
    • Sagi Grimberg's avatar
      IB/iser: Rewrite bounce buffer code path · ba943fb2
      Sagi Grimberg authored
      In some rare cases, IO operations may be not aligned to page
      boundaries. This prevents iser from performing fast memory
      registration. In order to overcome that iser uses a bounce
      buffer to carry the transaction. We basically allocate a buffer
      in the size of the transaction and perform a copy.
      
      The buffer allocation using kmalloc is too restrictive since it
      requires higher order (atomic) allocations for large transactions
      (which may result in memory exhaustion fairly fast for some workloads).
      We rewrite the bounce buffer code path to allocate scattered pages
      and perform a copy between the transaction sg and the bounce sg.
      Reported-by: default avatarAlex Lyakas <alex@zadarastorage.com>
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ba943fb2
    • Sagi Grimberg's avatar
      IB/iser: Bump version to 1.6 · 4fcd1470
      Sagi Grimberg authored
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      4fcd1470
    • Sagi Grimberg's avatar
      IB/iser: Remove code duplication for a single DMA entry · ad1e5672
      Sagi Grimberg authored
      In singleton scatterlists, DMA memory registration code
      is taken both for Fastreg and FMR code paths. Move it to
      a function.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ad1e5672
    • Sagi Grimberg's avatar
      IB/iser: Pass struct iser_mem_reg to iser_fast_reg_mr and iser_reg_sig_mr · 6ef8bb83
      Sagi Grimberg authored
      Instead of passing ib_sge as output variable, we pass the mem_reg
      pointer to have the routines fill the rkey as well. This reduces
      code duplication and extra assignments. This is a preparation step
      to unify some registration logics together. Also, pass iser_fast_reg_mr
      the fastreg descriptor directly.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      6ef8bb83
    • Sagi Grimberg's avatar
      IB/iser: Modify struct iser_mem_reg members · 90a6684c
      Sagi Grimberg authored
      No need to keep lkey, va, len variables, we can keep
      them as struct ib_sge. This will help when we change the
      memory registration logic.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      90a6684c
    • Sagi Grimberg's avatar
      IB/iser: Make fastreg pool cache friendly · 8b95aa2c
      Sagi Grimberg authored
      Memory regions are resources that are saved
      in the device caches. Increase the probability for
      a cache hit by adding the MRU descriptor to pool
      head.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      8b95aa2c
    • Sagi Grimberg's avatar
      IB/iser: Move PI context alloc/free to routines · 4dec2a27
      Sagi Grimberg authored
      Make iser_[create|destroy]_fastreg_desc shorter, more
      readable and easily extendable.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      4dec2a27
    • Sagi Grimberg's avatar
      IB/iser: Move fastreg descriptor pool get/put to helper functions · bd8b944e
      Sagi Grimberg authored
      Instead of open-coding connection fastreg pool get/put,
      we introduce iser_reg_desc[get|put] helpers.
      
      We aren't setting these static as this will be a per-device
      routine later on. Also, cleanup iser_unreg_rdma_mem_fastreg
      a bit.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      bd8b944e
    • Sagi Grimberg's avatar
      IB/iser: Merge build page-vec into register page-vec · f0e35c27
      Sagi Grimberg authored
      No need for these two separate. Keep it in a single routine
      like in the fastreg case. This will also make iser_reg_page_vec
      closer to iser_fast_reg_mr arguments. This is a preparation
      step for registration flow refactor.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      f0e35c27
    • Sagi Grimberg's avatar
      IB/iser: Get rid of struct iser_rdma_regd · b130eded
      Sagi Grimberg authored
      This struct members other than struct iser_mem_reg are unused,
      so remove it altogether.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      b130eded
    • Sagi Grimberg's avatar
      IB/iser: Remove redundant assignments in iser_reg_page_vec · 6847fdeb
      Sagi Grimberg authored
      Buffer length was assigned twice, and no reason to set va to
      io_addr and then add the offset, just set va to io_addr + offset.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      6847fdeb
    • Sagi Grimberg's avatar
      IB/iser: Move memory reg/dereg routines to iser_memory.c · d03e61d0
      Sagi Grimberg authored
      As memory registration/de-registration methods, lets
      move them to their natural location. While we're at it,
      make iser_reg_page_vec routine static.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      d03e61d0
    • Sagi Grimberg's avatar
      IB/iser: Don't pass ib_device to fall_to_bounce_buff routine · 56408325
      Sagi Grimberg authored
      No need to pass that, we can take it from the task.
      In a later stage, this function will be invoked
      according to a device capability.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      56408325
    • Sagi Grimberg's avatar
      IB/iser: Remove a redundant struct iser_data_buf · e3784bd1
      Sagi Grimberg authored
      No need to keep two iser_data_buf structures just in case we use
      mem copy. We can avoid that just by adding a pointer to the original
      sg. So keep only two iser_data_buf per command (data and protection)
      and pass the relevant data_buf to bounce buffer routine.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      e3784bd1
    • Sagi Grimberg's avatar
      IB/iser: Remove redundant cmd_data_len calculation · ecc3993a
      Sagi Grimberg authored
      This code was added before we had protection data length
      calculation (in iser_send_command), so we needed to calc
      the sg data length from the sg itself. This is not needed
      anymore.
      
      This patch does not change any functionality.
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarAdir Lev <adirl@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ecc3993a
    • Sagi Grimberg's avatar
      IB/iser: Fix wrong calculation of protection buffer length · a065fe6a
      Sagi Grimberg authored
      This length miss-calculation may cause a silent data corruption
      in the DIX case and cause the device to reference unmapped area.
      
      Fixes: d77e6535 ('libiscsi, iser: Adjust data_length to include protection information')
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      a065fe6a
    • Sagi Grimberg's avatar
      IB/iser: Handle fastreg/local_inv completion errors · 30bf1d58
      Sagi Grimberg authored
      Fast registration and local invalidate work requests can
      also fail. We should call error completion handler for them.
      Reported-by: default avatarRoi Dayan <roid@mellanox.com>
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      30bf1d58
    • Sagi Grimberg's avatar
      IB/iser: Fix unload during ep_poll wrong dereference · c4de4663
      Sagi Grimberg authored
      In case the user unloaded ib_iser while ep_connect is in
      progress, we need to destroy the endpoint although ep_disconnect
      wasn't invoked (we detect this by the iser conn state != DOWN).
      However, if we got an REJECTED/UNREACHABLE CM event we move the
      connection state to DOWN which will prevent us from destroying
      the endpoint in the module unload stage. Fix this by setting the
      connection state to TERMINATING in iser_conn_error so we can still
      destroy the endpoint at unload stage.
      Reported-by: default avatarAriel Nahum <arieln@mellanox.com>
      Signed-off-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      c4de4663
    • Doug Ledford's avatar
      ib_srpt: convert printk's to pr_* functions · 9f5d32af
      Doug Ledford authored
      The driver already defined the pr_format, it just hadn't
      been converted to use pr_info, pr_warn, and pr_err instead
      of the equivalent printks.  Convert so that messages from
      the driver are now properly tagged with their driver name
      and can be more easily debugged.
      
      In addition, a number of these printk's were not newline
      terminated, so fix that at the same time.
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      9f5d32af
    • Bart Van Assche's avatar
      IB/srp: Use P_Key cache for P_Key lookups · 56b5390c
      Bart Van Assche authored
      This change slightly reduces the time needed to log in.
      Signed-off-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarSagi Grimberg <sagig@mellanox.com>
      Reviewed-by: default avatarDavid Dillow <dave@thedillows.org>
      Cc: Sebastian Parschauer <sebastian.riemer@profitbricks.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      56b5390c
    • Sebastian Ott's avatar
      infiniband/mlx4: check for mapping error · cc47d369
      Sebastian Ott authored
      Since ib_dma_map_single can fail use ib_dma_mapping_error to check
      for errors.
      Signed-off-by: default avatarSebastian Ott <sebott@linux.vnet.ibm.com>
      Acked-by: default avatarJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      cc47d369
    • Selvin Xavier's avatar
      MAINTAINERS: Adding list of maintainers for ocrdma · d2928a8c
      Selvin Xavier authored
      Updating the MAINTAINERS file with ocrdma maintainers and their email ids
      Signed-off-by: default avatarSelvin Xavier <selvin.xavier@emulex.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      d2928a8c
    • Stephen Hemminger's avatar
      rdma: replace deprecated ifconfig in doc · b962dc0a
      Stephen Hemminger authored
      The ifconfig command has been deprecated for many years.
      To encourage new users not to continue using it and learning
      iproute2; the ifconfig should not be used in examples.
      Signed-off-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      b962dc0a
    • Sébastien Dugué's avatar
      ib_uverbs: Fix pages leak when using XRC SRQs · a233c4b5
      Sébastien Dugué authored
      Hello,
      
        When an application using XRCs abruptly terminates, the mmaped pages
      of the CQ buffers are leaked.
      
        This comes from the fact that when resources are released in
      ib_uverbs_cleanup_ucontext(), we fail to release the CQs because their
      refcount is not 0.
      
        When creating an XRC SRQ, we increment the associated CQ refcount.
      This refcount is only decremented when the SRQ is released.
      
        Therefore we need to release the SRQs prior to the CQs to make sure
      that all references to the CQs are gone before trying to release these.
      Signed-off-by: default avatarSebastien Dugue <sebastien.dugue@bull.net>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      a233c4b5
    • Erez Shitrit's avatar
      IB/mlx4: Fix WQE LSO segment calculation · ca9b590c
      Erez Shitrit authored
      The current code decreases from the mss size (which is the gso_size
      from the kernel skb) the size of the packet headers.
      
      It shouldn't do that because the mss that comes from the stack
      (e.g IPoIB) includes only the tcp payload without the headers.
      
      The result is indication to the HW that each packet that the HW sends
      is smaller than what it could be, and too many packets will be sent
      for big messages.
      
      An easy way to demonstrate one more aspect of the problem is by
      configuring the ipoib mtu to be less than 2*hlen (2*56) and then
      run app sending big TCP messages. This will tell the HW to send packets
      with giant (negative value which under unsigned arithmetics becomes
      a huge positive one) length and the QP moves to SQE state.
      
      Fixes: b832be1e ('IB/mlx4: Add IPoIB LSO support')
      Reported-by: default avatarMatthew Finlay <matt@mellanox.com>
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ca9b590c
    • Erez Shitrit's avatar
      IB/ipoib: Remove IPOIB_MCAST_RUN bit · 0e5544d9
      Erez Shitrit authored
      After Doug Ledford's changes there is no need in that bit, it's
      semantic becomes subset of the IPOIB_FLAG_OPER_UP bit.
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      0e5544d9
    • Erez Shitrit's avatar
      IB/ipoib: Save only IPOIB_MAX_PATH_REC_QUEUE skb's · 1e85b806
      Erez Shitrit authored
      Whenever there is no path->ah to the destination, keep only defined
      number of skb's. Otherwise there are cases that the driver can keep
      infinite list of skb's.
      
      For example, when one device want to send unicast arp to the destination,
      and from some reason the SM doesn't respond, the driver currently keeps
      all the skb's. If that unicast arp traffic stopped, all  these skb's
      are kept by the path object till the interface is down.
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      1e85b806
    • Erez Shitrit's avatar
      IB/ipoib: Handle QP in SQE state · 2c010730
      Erez Shitrit authored
      As the result of a completion error the QP can moved to SQE state by
      the hardware. Since it's not the Error state, there are no flushes
      and hence the driver doesn't know about that.
      
      The fix creates a task that after completion with error which is not a
      flush tracks the QP state and if it is in SQE state moves it back to RTS.
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      2c010730
    • Erez Shitrit's avatar
      IB/ipoib: Update broadcast record values after each successful join request · 3fd0605c
      Erez Shitrit authored
      Update the cached broadcast record in the priv object after every new
      join of this broadcast domain group.
      
      These values are needed for the port configuration (MTU size) and to
      all the new multicast (non-broadcast) join requests initial parameters.
      
      For example, SM starts with 2K MTU for all the fabric, and after that it
      restarts (or handover to new SM) with new port configuration of 4K MTU.
      Without using the new values, the driver will keep its old configuration
      of 2K and will not apply the new configuration of 4K.
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      3fd0605c
    • Erez Shitrit's avatar
      IB/ipoib: Use one linear skb in RX flow · a44878d1
      Erez Shitrit authored
      The current code in the RX flow uses two sg entries for each incoming
      packet, the first one was for the IB headers and the second for the rest
      of the data, that causes two  dma map/unmap and two allocations, and few
      more actions that were done at the data path.
      
      Use only one linear skb on each incoming packet, for the data (IB
      headers and payload), that reduces the packet processing in the
      data-path (only one skb, no frags, the first frag was not used anyway,
      less memory allocations) and the dma handling (only one dma map/unmap
      over each incoming packet instead of two map/unmap per each incoming packet).
      
      After commit 73d3fe6d ("gro: fix aggregation for skb using frag_list") from
      Eric Dumazet, we will get full aggregation for large packets.
      
      When running bandwidth tests before and after the (over the card's numa node),
      using "netperf -H 1.1.1.3 -T -t TCP_STREAM", the results before are ~12Gbs before
      and after ~16Gbs on my setup (Mellanox's ConnectX3).
      Signed-off-by: default avatarErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: default avatarOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      a44878d1
    • Doug Ledford's avatar
      IB/ipoib: drop mcast_mutex usage · 1c0453d6
      Doug Ledford authored
      We needed the mcast_mutex when we had to prevent the join completion
      callback from having the value it stored in mcast->mc overwritten
      by a delayed return from ib_sa_join_multicast.  By storing the return
      of ib_sa_join_multicast in an intermediate variable, we prevent a
      delayed return from ib_sa_join_multicast overwriting the valid
      contents of mcast->mc, and we no longer need a mutex to force the
      join callback to run after the return of ib_sa_join_multicast.  This
      allows us to do away with the mutex entirely and protect our critical
      sections with a just a spinlock instead.  This is highly desirable
      as there were some places where we couldn't use a mutex because the
      code was not allowed to sleep, and so we were currently using a mix
      of mutex and spinlock to protect what we needed to protect.  Now we
      only have a spin lock and the locking complexity is greatly reduced.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      1c0453d6
    • Doug Ledford's avatar
      IB/ipoib: deserialize multicast joins · d2fe937c
      Doug Ledford authored
      Allow the ipoib layer to attempt to join all outstanding multicast
      groups at once.  The ib_sa layer will serialize multiple attempts to
      join the same group, but will process attempts to join different groups
      in parallel.  Take advantage of that.
      
      In order to make this happen, change the mcast_join_thread to loop
      through all needed joins, sending a join request for each one that we
      still need to join.  There are a few special cases we handle though:
      
      1) Don't attempt to join anything but the broadcast group until the join
      of the broadcast group has succeeded.
      2) No longer restart the join task at the end of completion handling.
      If we completed successfully, we are done.  The join task now needs kicked
      either by mcast_send or mcast_restart_task or mcast_start_thread, but
      should not need started anytime else except when scheduling a backoff
      attempt to rejoin.
      3) No longer use separate join/completion routines for regular and
      sendonly joins, pass them all through the same routine and just do the
      right thing based on the SENDONLY join flag.
      4) Only try to join a SENDONLY join twice, then drop the packets and
      quit trying.  We leave the mcast group in the list so that if we get a
      new packet, all that we have to do is queue up the packet and restart
      the join task and it will automatically try to join twice and then
      either send or flush the queue again.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      d2fe937c
    • Doug Ledford's avatar
      IB/ipoib: fix MCAST_FLAG_BUSY usage · 69911416
      Doug Ledford authored
      Commit a9c8ba58 ("IPoIB: Fix usage of uninitialized multicast
      objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
      in how it was used.  We didn't always initialize the completion struct
      before we set the flag, and we didn't always call complete on the
      completion struct from all paths that complete it.  And when we did
      complete it, sometimes we continued to touch the mcast entry after
      the completion, opening us up to possible use after free issues.
      
      This made it less than totally effective, and certainly made its use
      confusing.  And in the flush function we would use the presence of this
      flag to signal that we should wait on the completion struct, but we never
      cleared this flag, ever.
      
      In order to make things clearer and aid in resolving the rtnl deadlock
      bug I've been chasing, I cleaned this up a bit.
      
       1) Remove the MCAST_JOIN_STARTED flag entirely
       2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
       3) Test mcast->mc directly to see if we have completed
          ib_sa_join_multicast (using IS_ERR_OR_NULL)
       4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
          the mcast->done completion struct
       5) Make sure that before calling complete(&mcast->done), we always clear
          the MCAST_FLAG_BUSY bit
       6) Take the mcast_mutex before we call ib_sa_multicast_join and also
          take the mutex in our join callback.  This forces
          ib_sa_multicast_join to return and set mcast->mc before we process
          the callback.  This way, our callback can safely clear mcast->mc
          if there is an error on the join and we will do the right thing as
          a result in mcast_dev_flush.
       7) Because we need the mutex to synchronize mcast->mc, we can no
          longer call mcast_sendonly_join directly from mcast_send and
          instead must add sendonly join processing to the mcast_join_task
       8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
          we have a running task.  We know when we need to reschedule our
          join task thread and don't need a flag to tell us.
       9) Add a helper for rescheduling the join task thread
      
      A number of different races are resolved with these changes.  These
      races existed with the old MCAST_FLAG_BUSY usage, the
      MCAST_JOIN_STARTED flag was an attempt to address them, and while it
      helped, a determined effort could still trip things up.
      
      One race looks something like this:
      
      Thread 1                             Thread 2
      ib_sa_join_multicast (as part of running restart mcast task)
        alloc member
        call callback
                                           ifconfig ib0 down
      				     wait_for_completion
          callback call completes
                                           wait_for_completion in
      				     mcast_dev_flush completes
      				       mcast->mc is PTR_ERR_OR_NULL
      				       so we skip ib_sa_leave_multicast
          return from callback
        return from ib_sa_join_multicast
      set mcast->mc = return from ib_sa_multicast
      
      We now have a permanently unbalanced join/leave issue that trips up the
      refcounting in core/multicast.c
      
      Another like this:
      
      Thread 1                   Thread 2         Thread 3
      ib_sa_multicast_join
                                                  ifconfig ib0 down
      					    priv->broadcast = NULL
                                 join_complete
      			                    wait_for_completion
      			   mcast->mc is not yet set, so don't clear
      return from ib_sa_join_multicast and set mcast->mc
      			   complete
      			   return -EAGAIN (making mcast->mc invalid)
      			   		    call ib_sa_multicast_leave
      					    on invalid mcast->mc, hang
      					    forever
      
      By holding the mutex around ib_sa_multicast_join and taking the mutex
      early in the callback, we force mcast->mc to be valid at the time we
      run the callback.  This allows us to clear mcast->mc if there is an
      error and the join is going to fail.  We do this before we complete
      the mcast.  In this way, mcast_dev_flush always sees consistent state
      in regards to mcast->mc membership at the time that the
      wait_for_completion() returns.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      69911416
    • Doug Ledford's avatar
      IB/ipoib: No longer use flush as a parameter · efc82eee
      Doug Ledford authored
      Various places in the IPoIB code had a deadlock related to flushing
      the ipoib workqueue.  Now that we have per device workqueues and a
      specific flush workqueue, there is no longer a deadlock issue with
      flushing the device specific workqueues and we can do so unilaterally.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      efc82eee
    • Doug Ledford's avatar
      IB/ipoib: Use dedicated workqueues per interface · 0b39578b
      Doug Ledford authored
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue means
      that we would also have to prevent all races between different devices
      (for instance, ipoib_mcast_restart_task on interface ib0 can race with
      ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
      the rtnl_lock).
      
      There are several possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      0b39578b
    • Doug Ledford's avatar
      IB/ipoib: Make the carrier_on_task race aware · 894021a7
      Doug Ledford authored
      We blindly assume that we can just take the rtnl lock and that will
      prevent races with downing this interface.  Unfortunately, that's not
      the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
      in an attempt to clear out all remaining instances of ipoib_join_task.
      But, since this task is put on the same workqueue as the join task,
      the flush_workqueue waits on this thread too.  But this thread is
      deadlocked on the rtnl lock.  The better thing here is to use trylock
      and loop on that until we either get the lock or we see that
      FLAG_OPER_UP has been cleared, in which case we don't need to do
      anything anyway and we just return.
      
      While investigating which flag should be used, FLAG_ADMIN_UP or
      FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more
      appropriate flag to use.  However, there was a mix of these two flags in
      use in the existing code.  So while we check for that flag here as part
      of this race fix, also cleanup the two places that had used the less
      appropriate flag for their tests.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      894021a7
    • Doug Ledford's avatar
      IB/ipoib: Consolidate rtnl_lock tasks in workqueue · c84ca6d2
      Doug Ledford authored
      The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and
      needs to keep it held.  It also needs to call flush_workqueue() to flush
      out any outstanding work.  In the past, we've had to try and make sure
      that we didn't flush out any outstanding join completions because they
      also wanted to grab rtnl_lock() and that would deadlock.  It turns out
      that the only thing in the join completion handler that needs this lock
      can be safely moved to our carrier_on_task, thereby reducing the
      potential for the join completion code and the flush code to deadlock
      against each other.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      c84ca6d2
    • Doug Ledford's avatar
      IB/ipoib: change init sequence ordering · be7aa663
      Doug Ledford authored
      In preparation for using per device work queues, we need to move the
      start of the neighbor thread task to after ipoib_ib_dev_init and move
      the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
      Otherwise we will end up freeing our workqueue with work possibly
      still on it.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      be7aa663