1. 28 Apr, 2016 30 commits
    • Sebastian Sanchez's avatar
      IB/hfi1: Check P_KEY for all sent packets from user mode · e38d1e4f
      Sebastian Sanchez authored
      Add the P_KEY check for user-context mechanism for
      both PIO and SDMA. For PIO, the
      SendCtxtCheckEnable.DisallowKDETHPackets is set by
      default. When the P_KEY is set,
      SendCtxtCheckEnable.DisallowKDETHPackets is cleared.
      For SDMA, a software check was included. This change
      requires user processes to set the P_KEY before sending
      any packets, otherwise, the sent packet will fail. The
      original submission didn't have this check but it's
      required.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarMikto Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarSebastian Sanchez <sebastian.sanchez@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      e38d1e4f
    • Sebastian Sanchez's avatar
      IB/hfi1: Adjust default MTU to be 10KB · ef699e84
      Sebastian Sanchez authored
      Increasing the default MTU size to 10KB improves performance
      for PSM. Change the default MTU to 10KB but constrain
      Verbs MTU to 8KB. Also update default MTU module parameter
      description to be HFI1_DEFAULT_MAX_MTU.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Reviewed-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarJubin John <jubin.john@intel.com>
      Signed-off-by: default avatarSebastian Sanchez <sebastian.sanchez@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ef699e84
    • Dean Luick's avatar
      IB/hfi1: Simplify init_qpmap_table() · 60d585ad
      Dean Luick authored
      Make init_qpmap_table() easier to understand by simplifying
      the loop indexing and writing each register when it is "full",
      removing the need for a follow-on register write.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      60d585ad
    • Dean Luick's avatar
      IB/hfi1: Correctly obtain the full service class · de882ff5
      Dean Luick authored
      The function hdr2sc was using an unshifted mask to obtain
      the 5th bit of the service class.  Correct the issue by using
      the shifted mask.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      de882ff5
    • Dean Luick's avatar
      IB/hfi1: Fix QOS rule mappings · 33a9eb52
      Dean Luick authored
      The QOS RSM rule mappings are off by one, referencing a kernel receive
      context that does not exist.
      
      Correctly start the QOS RSM map entries at FIRST_KERNEL_CONTEXT rather
      than MIN_KERNEL_KCTXTS.  Remove the cruft that hid this.
      
      Change the QP map table so all traffic not caught by QOS RSM goes to
      the control context rather than the first QOS context.
      
      Correct comments to match the actual code operation and intent.
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      33a9eb52
    • Dean Luick's avatar
      IB/hfi1: Remove invalid QOS check · 35969d9b
      Dean Luick authored
      Remove an invalid compare of the number of QOS RSM map table entries
      against the number of physical receive contexts.  The RSM map table
      has its own size and has no relation to the number of physical receive
      contexts.
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      35969d9b
    • Dean Luick's avatar
      IB/hfi1: Fix QOS num_vl bit width · 153d58cd
      Dean Luick authored
      The bit width for num_vls, n, needs to be calculated based on
      the pow2 rounded up of the number of vls.  Otherwise num_vls of 3,
      5, 6, and 7 will have misplaced QOS RSM map entries.
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      153d58cd
    • Dean Luick's avatar
      IB/hfi1: Fix i2c resource reservation checks · f9c82a0b
      Dean Luick authored
      The i2c and qsfp read/write routines should check for the resource
      reservation of the incoming argument target rather than the implicit
      target of the hardware HFI.
      Reviewed-by: default avatarEaswar Hariharan <easwar.hariharan@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarJubin John <jubin.john@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      f9c82a0b
    • Dean Luick's avatar
      IB/hfi1: Fix sysfs file offset usage · 4ee15859
      Dean Luick authored
      Two sysfs files do not pay attention to the file offset when
      reading data. Fix that.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarJubin John <jubin.john@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      4ee15859
    • Jubin John's avatar
      IB/rdmavt,hfi1,qib: Fix memory leak · ea0e4ce3
      Jubin John authored
      rdi->ports has memory allocated in rvt_alloc_device(), but does not get
      freed because the hfi1 and qib drivers drivers call ib_dealloc_device()
      directly instead of going through rdmavt. Add a rvt_dealloc_device()
      that frees rdi->ports and then calls ib_dealloc_device(). Switch hfi1
      and qib drivers to calling rvt_dealloc_device() instead of
      ib_dealloc_device() directly.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarBrian Welty <brian.welty@intel.com>
      Signed-off-by: default avatarJubin John <jubin.john@intel.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ea0e4ce3
    • Mitko Haralanov's avatar
      IB/hfi1: Fix buffer cache races which may cause corruption · e88c9271
      Mitko Haralanov authored
      There are two possible causes for node/memory corruption both
      of which are related to the cache eviction algorithm. One way
      to cause corruption is due to the asynchronous nature of the
      MMU invalidation and the locking used when invalidating node.
      
      The MMU invalidation routine would temporarily release the
      RB tree lock to avoid a deadlock. However, this would allow
      the eviction function to take the lock resulting in the removal
      of cache nodes.
      
      If the node being removed by the eviction code is the same as
      the node being invalidated, the result is use after free.
      
      The same is true in the other direction due to the temporary
      release of the eviction list lock in the eviction loop.
      
      Another corner case exists when dealing with the SDMA buffer
      cache that could cause memory corruption of kernel memory.
      The most common way, in which this corruption exhibits itself
      is a linked list node corruption. In that case, the kernel will
      complain that a node with poisoned pointers is being removed.
      The fact that the pointers are already poisoned means that the
      node has already been removed from the list.
      
      To root cause of this corruption was a mishandling of the
      eviction list maintained by the driver. In order for this
      to happen four conditions need to be satisfied:
      
         1. A node describing a user buffer already exists in the
            interval RB tree,
         2. The beginning of the current user buffer matches that
            node but is bigger. This will cause the node to be
            extended.
         3. The amount of cached buffers is close or at the limit
            of the buffer cache size.
         4. The node has dropped close to the end of the eviction
            list. This will cause the node to be considered for
            eviction.
      
      If all of the above conditions have been satisfied, it is
      possible for the eviction algorithm to evict the current node,
      which will free the node without the driver knowing.
      
      To solve both issues described above:
         - the locking around the MMU invalidation loop and cache
           eviction loop has been improved so locks are not released in
           the loop body,
         - a new RB function is introduced which will "atomically" find
           and remove the matching node from the RB tree, preventing the
           MMU invalidation loop from touching it, and
         - the node being extended by the pin_vector_pages() function is
           removed from the eviction list prior to calling the eviction
           function.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      e88c9271
    • Mitko Haralanov's avatar
      IB/hfi1: Extract and reinsert MMU RB node on lookup · f53af85e
      Mitko Haralanov authored
      The page pinning function, which also maintains the pin cache,
      behaves one of two ways when an exact buffer match is not found:
        1. If no node is not found (a buffer with the same starting address
           is not found in the cache), a new node is created, the buffer
           pages are pinned, and the node is inserted into the RB tree, or
        2. If a node is found but the buffer in that node is a subset of
           the new user buffer, the node is extended with the new buffer
           pages.
      
      Both modes of operation require (re-)insertion into the interval RB
      tree.
      
      When the node being inserted is a new node, the operations are pretty
      simple. However, when the node is already existing and is being
      extended, special care must be taken.
      
      First, we want to guard against an asynchronous attempt to
      delete the node by the MMU invalidation notifier. The simplest way to
      do this is to remove the node from the RB tree, preventing the search
      algorithm from finding it.
      
      Second, the node needs to be re-inserted so it lands in the proper place
      in the tree and the tree is correctly re-balanced. This also requires
      the node to be removed from the RB tree.
      
      This commit adds the hfi1_mmu_rb_extract() function, which will search
      for a node in the interval RB tree matching an address and length and
      remove it from the RB tree if found. This allows for both of the above
      special cases be handled in a single step.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      f53af85e
    • Mitko Haralanov's avatar
      IB/hfi1: Correctly compute node interval · de79093b
      Mitko Haralanov authored
      The computation of the interval of an interval RB node
      was incorrect leading to data corruption due to the RB
      search algorithm not properly finding the all RB nodes
      in an MMU invalidation interval.
      
      The problem stemmed from the fact that the beginning
      address of the node's range was being aligned to a page
      boundary. For certain buffer sizes, this would lead to
      a end address calculation that was off by 1 page.
      
      An important aspect of keeping the RB same is also
      updating the node's range in the case it's being extended.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      de79093b
    • Mitko Haralanov's avatar
      IB/hfi1: Protect the interval RB tree when cleaning up · 782f6697
      Mitko Haralanov authored
      The current implementation of the clean up function for
      the interval RB trees has two flaws which may cause
      problems in cases of concurrent executing of the function
      and MMU notifier.
      
      The flaws were due to the fact that deregistration of the
      MMU callbacks was done after the tree was emptied and,
      furthermore, the tree was not being locked.
      
      This commit fixes both of these flaws by, first, switch the
      order of operations, and, second, locking the tree while
      traversing it to prevent any other operations.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      782f6697
    • Mitko Haralanov's avatar
      IB/hfi1: Fix memory leak in user ExpRcv and SDMA · 0ad2d3d0
      Mitko Haralanov authored
      The driver had two memory leaks - one in the user
      expected receive code and one in SDMA buffer cache.
      
      The leak in the expected receive code only showed up
      when the user/admin had set ulimit sufficiently low
      and the driver did not have enough room in the cache
      before hitting the limit of allowed cachable memory.
      
      When this condition occurred, the driver returned
      early signaling userland that it needed to free some
      buffers to free up room in the cache.
      
      The bug was that the driver was not cleaning up
      allocated memory prior to returning early.
      
      The leak in the SDMA buffer cache could occur (even
      though it never did), when the insertion of a buffer
      node in the interval RB tree failed. In this case, the
      driver failed to unpin the pages of the node instead
      erroneously returning success.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      0ad2d3d0
    • Mitko Haralanov's avatar
      IB/hfi1: Don't remove list entries if they are not in a list · 4787bc5e
      Mitko Haralanov authored
      The SDMA cache logic maintains an eviction list which is ordered
      by most recently used user buffers. Upon errors or buffer freeing,
      the list nodes were unconditionally being deleted. This would lead
      to list corruption warnings if the nodes were never inserted in the
      eviction list to begin with.
      
      This commit prevents this by checking that the nodes are already
      part of the eviction list.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      4787bc5e
    • Mike Marciniszyn's avatar
      IB/qib, IB/hfi1: Fix up UD loopback use of irq flags · 747f4d7a
      Mike Marciniszyn authored
      The dual lock patch moved locking around and missed an issue
      with handling irq flags when processing UD loopback
      packets.  This issue was revealed by smatch.
      
      Fix for both qib and hfi1 to pass the saved flags to the UD request
      builder and handle the changes correctly.
      
      Fixes: 46a80d62 ("IB/qib, staging/rdma/hfi1: add s_hlock for use in post send")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      747f4d7a
    • Mike Marciniszyn's avatar
      IB/rdmavt: Fix adaptive pio hang · f39cc34d
      Mike Marciniszyn authored
      The RVT_S_WAIT_PIO_DRAIN flag was missing from
      the set of flags indicating a qp is waiting
      on a resource.
      
      This caused the sleep/wakeup for adaptive pio
      drain to lose a wakeup "hanging" a QP.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      f39cc34d
    • Doug Ledford's avatar
      e29bff46
    • Doug Ledford's avatar
      Merge branch 'master' of... · d53e181c
      Doug Ledford authored
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into testing/4.6
      d53e181c
    • Jason Gunthorpe's avatar
      IB/security: Restrict use of the write() interface · e6bd18f5
      Jason Gunthorpe authored
      The drivers/infiniband stack uses write() as a replacement for
      bi-directional ioctl().  This is not safe. There are ways to
      trigger write calls that result in the return structure that
      is normally written to user space being shunted off to user
      specified kernel memory instead.
      
      For the immediate repair, detect and deny suspicious accesses to
      the write API.
      
      For long term, update the user space libraries and the kernel API
      to something that doesn't present the same security vulnerabilities
      (likely a structured ioctl() interface).
      
      The impacted uAPI interfaces are generally only available if
      hardware from drivers/infiniband is installed in the system.
      Reported-by: default avatarJann Horn <jann@thejh.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      [ Expanded check to all known write() entry points ]
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      e6bd18f5
    • Dean Luick's avatar
      IB/hfi1: Use kernel default llseek for ui device · 7723d8c2
      Dean Luick authored
      The ui device llseek had a mistake with SEEK_END and did
      not fully follow seek semantics.  Correct all this by
      using a kernel supplied function for fixed size devices.
      
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      7723d8c2
    • Mitko Haralanov's avatar
      IB/hfi1: Don't attempt to free resources if initialization failed · 94158442
      Mitko Haralanov authored
      Attempting to free resources which have not been allocated and
      initialized properly led to the following kernel backtrace:
      
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<ffffffffa09658fe>] unlock_exp_tids.isra.8+0x2e/0x120 [hfi1]
          PGD 852a43067 PUD 85d4a6067 PMD 0
          Oops: 0000 [#1] SMP
          CPU: 0 PID: 2831 Comm: osu_bw Tainted: G          IO 3.12.18-wfr+ #1
          task: ffff88085b15b540 ti: ffff8808588fe000 task.ti: ffff8808588fe000
          RIP: 0010:[<ffffffffa09658fe>]  [<ffffffffa09658fe>] unlock_exp_tids.isra.8+0x2e/0x120 [hfi1]
          RSP: 0018:ffff8808588ffde0  EFLAGS: 00010282
          RAX: 0000000000000000 RBX: ffff880858a31800 RCX: 0000000000000000
          RDX: ffff88085d971bc0 RSI: ffff880858a318f8 RDI: ffff880858a318c0
          RBP: ffff8808588ffe20 R08: 0000000000000000 R09: 0000000000000000
          R10: ffff88087ffd6f40 R11: 0000000001100348 R12: ffff880852900000
          R13: ffff880858a318c0 R14: 0000000000000000 R15: ffff88085d971be8
          FS:  00007f4674e83740(0000) GS:ffff88087f400000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000000 CR3: 000000085c377000 CR4: 00000000001407f0
          Stack:
           ffffffffa0941a71 ffff880858a318f8 ffff88085d971bc0 ffff880858a31800
           ffff880852900000 ffff880858a31800 00000000003ffff7 ffff88085d971bc0
           ffff8808588ffe60 ffffffffa09663fc ffff8808588ffe60 ffff880858a31800
          Call Trace:
           [<ffffffffa0941a71>] ? find_mmu_handler+0x51/0x70 [hfi1]
           [<ffffffffa09663fc>] hfi1_user_exp_rcv_free+0x6c/0x120 [hfi1]
           [<ffffffffa0932809>] hfi1_file_close+0x1a9/0x340 [hfi1]
           [<ffffffff8116c189>] __fput+0xe9/0x270
           [<ffffffff8116c35e>] ____fput+0xe/0x10
           [<ffffffff81065707>] task_work_run+0xa7/0xe0
           [<ffffffff81002969>] do_notify_resume+0x59/0x80
           [<ffffffff814ffc1a>] int_signal+0x12/0x17
      
      This commit re-arranges the context initialization code in a way that
      would allow for context event flags to be used to determine whether
      the context has been successfully initialized.
      
      In turn, this can be used to skip the resource de-allocation if they
      were never allocated in the first place.
      
      Fixes: 3abb33ac ("staging/hfi1: Add TID cache receive init and free funcs")
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Reviewed-by: Leon Romanovsky <leonro@mellanox.com.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      94158442
    • Mike Marciniszyn's avatar
      IB/hfi1: Fix missing lock/unlock in verbs drain callback · b9b06cb6
      Mike Marciniszyn authored
      The iowait_sdma_drained() callback lacked locking to
      protect the qp s_flags field.
      
      This causes the s_flags to be out of sync
      on multiple CPUs, potentially corrupting the s_flags.
      
      Fixes: a545f530 ("staging/rdma/hfi: fix CQ completion order issue")
      Reviewed-by: default avatarSebastian Sanchez <sebastian.sanchez@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      b9b06cb6
    • Jubin John's avatar
      IB/rdmavt: Fix send scheduling · e6d2e017
      Jubin John authored
      call_send is used to determine whether to send immediately or schedule
      a send for later. The current logic in rdmavt is inverted and has a
      negative impact on the latency of the hfi1 and qib drivers. Fix this
      regression by correctly calling send immediately when call_send is set.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarJubin John <jubin.john@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      e6d2e017
    • Mitko Haralanov's avatar
      IB/hfi1: Prevent unpinning of wrong pages · 849e3e93
      Mitko Haralanov authored
      The routine used by the SDMA cache to handle already
      cached nodes can extend an already existing node.
      
      In its error handling code, the routine will unpin pages
      when not all pages of the buffer extension were pinned.
      
      There was a bug in that part of the routine, which would
      mistakenly unpin pages from the original set rather than
      the newly pinned pages.
      
      This commit fixes that bug by offsetting the page array
      to the proper place pointing at the beginning of the newly
      pinned pages.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      849e3e93
    • Mitko Haralanov's avatar
      IB/hfi1: Fix deadlock caused by locking with wrong scope · de82bdff
      Mitko Haralanov authored
      The locking around the interval RB tree is designed to prevent
      access to the tree while it's being modified. The locking in its
      current form is too overzealous, which is causing a deadlock in
      certain cases with the following backtrace:
      
          Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
          CPU: 0 PID: 5836 Comm: IMB-MPI1 Tainted: G           O 3.12.18-wfr+ #1
           0000000000000000 ffff88087f206c50 ffffffff814f1caa ffffffff817b53f0
           ffff88087f206cc8 ffffffff814ecd56 0000000000000010 ffff88087f206cd8
           ffff88087f206c78 0000000000000000 0000000000000000 0000000000001662
          Call Trace:
           <NMI>  [<ffffffff814f1caa>] dump_stack+0x45/0x56
           [<ffffffff814ecd56>] panic+0xc2/0x1cb
           [<ffffffff810d4370>] ? restart_watchdog_hrtimer+0x50/0x50
           [<ffffffff810d4432>] watchdog_overflow_callback+0xc2/0xd0
           [<ffffffff81109b4e>] __perf_event_overflow+0x8e/0x2b0
           [<ffffffff8110a714>] perf_event_overflow+0x14/0x20
           [<ffffffff8101c906>] intel_pmu_handle_irq+0x1b6/0x390
           [<ffffffff814f927b>] perf_event_nmi_handler+0x2b/0x50
           [<ffffffff814f8ad8>] nmi_handle.isra.3+0x88/0x180
           [<ffffffff814f8d39>] do_nmi+0x169/0x310
           [<ffffffff814f8177>] end_repeat_nmi+0x1e/0x2e
           [<ffffffff81272600>] ? unmap_single+0x30/0x30
           [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40
           [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40
           [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40
           <<EOE>>  <IRQ>  [<ffffffffa056c4a8>] hfi1_mmu_rb_search+0x38/0x70 [hfi1]
           [<ffffffffa05919cb>] user_sdma_free_request+0xcb/0x120 [hfi1]
           [<ffffffffa0593393>] user_sdma_txreq_cb+0x263/0x350 [hfi1]
           [<ffffffffa057fad7>] ? sdma_txclean+0x27/0x1c0 [hfi1]
           [<ffffffffa0593130>] ? user_sdma_send_pkts+0x1710/0x1710 [hfi1]
           [<ffffffffa057fdd6>] sdma_make_progress+0x166/0x480 [hfi1]
           [<ffffffff810762c9>] ? ttwu_do_wakeup+0x19/0xd0
           [<ffffffffa0581c7e>] sdma_engine_interrupt+0x8e/0x100 [hfi1]
           [<ffffffffa0546bdd>] sdma_interrupt+0x5d/0xa0 [hfi1]
           [<ffffffff81097e57>] handle_irq_event_percpu+0x47/0x1d0
           [<ffffffff81098017>] handle_irq_event+0x37/0x60
           [<ffffffff8109aa5f>] handle_edge_irq+0x6f/0x120
           [<ffffffff810044af>] handle_irq+0xbf/0x150
           [<ffffffff8104c9b7>] ? irq_enter+0x17/0x80
           [<ffffffff8150168d>] do_IRQ+0x4d/0xc0
           [<ffffffff814f7c6a>] common_interrupt+0x6a/0x6a
           <EOI>  [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0
           [<ffffffff814f56c6>] __schedule+0x3b6/0x7e0
           [<ffffffff810763a6>] __cond_resched+0x26/0x30
           [<ffffffff814f5eda>] _cond_resched+0x3a/0x50
           [<ffffffff814f4f82>] down_write+0x12/0x30
           [<ffffffffa0591619>] hfi1_release_user_pages+0x69/0x90 [hfi1]
           [<ffffffffa059173a>] sdma_rb_remove+0x9a/0xc0 [hfi1]
           [<ffffffffa056c00d>] __mmu_rb_remove.isra.5+0x5d/0x70 [hfi1]
           [<ffffffffa056c536>] hfi1_mmu_rb_remove+0x56/0x70 [hfi1]
           [<ffffffffa059427b>] hfi1_user_sdma_process_request+0x74b/0x1160 [hfi1]
           [<ffffffffa055c763>] hfi1_aio_write+0xc3/0x100 [hfi1]
           [<ffffffff8116a14c>] do_sync_readv_writev+0x4c/0x80
           [<ffffffff8116b58b>] do_readv_writev+0xbb/0x230
           [<ffffffff811a9da1>] ? fsnotify+0x241/0x320
           [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0
           [<ffffffff8116b795>] vfs_writev+0x35/0x60
           [<ffffffff8116b8c9>] SyS_writev+0x49/0xc0
           [<ffffffff810cd876>] ? __audit_syscall_exit+0x1f6/0x2a0
           [<ffffffff814ff992>] system_call_fastpath+0x16/0x1b
      
      As evident from the backtrace above, the process was being put to sleep
      while holding the lock.
      
      Limiting the scope of the lock only to the RB tree operation fixes the
      above error allowing for proper locking and the process being put to
      sleep when needed.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      de82bdff
    • Mitko Haralanov's avatar
      IB/hfi1: Prevent NULL pointer deferences in caching code · f19bd643
      Mitko Haralanov authored
      There is a potential kernel crash when the MMU notifier calls the
      invalidation routines in the hfi1 pinned page caching code for sdma.
      
      The invalidation routine could call the remove callback
      for the node, which in turn ends up dereferencing the
      current task_struct to get a pointer to the mm_struct.
      However, the mm_struct pointer could be NULL resulting in
      the following backtrace:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
          IP: [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1]
          15
          task: ffff88085e66e080 ti: ffff88085c244000 task.ti: ffff88085c244000
          RIP: 0010:[<ffffffffa041f75a>]  [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1]
          RSP: 0000:ffff88085c245878  EFLAGS: 00010002
          RAX: 0000000000000000 RBX: ffff88105b9bbd40 RCX: ffffea003931a830
          RDX: 0000000000000004 RSI: ffff88105754a9c0 RDI: ffff88105754a9c0
          RBP: ffff88085c245890 R08: ffff88105b9bbd70 R09: 00000000fffffffb
          R10: ffff88105b9bbd58 R11: 0000000000000013 R12: ffff88105754a9c0
          R13: 0000000000000001 R14: 0000000000000001 R15: ffff88105b9bbd40
          FS:  0000000000000000(0000) GS:ffff88107ef40000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00000000000000a8 CR3: 0000000001a0b000 CR4: 00000000001407e0
          Stack:
           ffff88105b9bbd40 ffff88080ec481a8 ffff88080ec481b8 ffff88085c2458c0
           ffffffffa03fa00e ffff88080ec48190 ffff88080ed9cd00 0000000001024000
           0000000000000000 ffff88085c245920 ffffffffa03fa0e7 0000000000000282
          Call Trace:
           [<ffffffffa03fa00e>] __mmu_rb_remove.isra.5+0x5e/0x70 [hfi1]
           [<ffffffffa03fa0e7>] mmu_notifier_mem_invalidate+0xc7/0xf0 [hfi1]
           [<ffffffffa03fa143>] mmu_notifier_page+0x13/0x20 [hfi1]
           [<ffffffff81156dd0>] __mmu_notifier_invalidate_page+0x50/0x70
           [<ffffffff81140bbb>] try_to_unmap_one+0x20b/0x470
           [<ffffffff81141ee7>] try_to_unmap_anon+0xa7/0x120
           [<ffffffff81141fad>] try_to_unmap+0x4d/0x60
           [<ffffffff8111fd7b>] shrink_page_list+0x2eb/0x9d0
           [<ffffffff81120ab3>] shrink_inactive_list+0x243/0x490
           [<ffffffff81121491>] shrink_lruvec+0x4c1/0x640
           [<ffffffff81121641>] shrink_zone+0x31/0x100
           [<ffffffff81121b0f>] kswapd_shrink_zone.constprop.62+0xef/0x1c0
           [<ffffffff811229e3>] kswapd+0x403/0x7e0
           [<ffffffff811225e0>] ? shrink_all_memory+0xf0/0xf0
           [<ffffffff81068ac0>] kthread+0xc0/0xd0
           [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40
           [<ffffffff814ff8ec>] ret_from_fork+0x7c/0xb0
           [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40
      
      To correct this, the mm_struct passed to us by the MMU notifier is
      used (which is what should have been done to begin with). This avoids
      the broken derefences and ensures that the correct mm_struct is used.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      f19bd643
    • Sagi Grimberg's avatar
    • Sagi Grimberg's avatar
      IB/mlx5: Expose correct max_sge_rd limit · 986ef95e
      Sagi Grimberg authored
      mlx5 devices (Connect-IB, ConnectX-4, ConnectX-4-LX) has a limitation
      where rdma read work queue entries cannot exceed 512 bytes.
      A rdma_read wqe needs to fit in 512 bytes:
      - wqe control segment (16 bytes)
      - rdma segment (16 bytes)
      - scatter elements (16 bytes each)
      
      So max_sge_rd should be: (512 - 16 - 16) / 16 = 30.
      
      Cc: linux-stable@vger.kernel.org
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagig@grimberg.me>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      986ef95e
  2. 27 Apr, 2016 9 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · b75a2bf8
      Linus Torvalds authored
      Pull workqueue fix from Tejun Heo:
       "So, it turns out we had a silly bug in the most fundamental part of
        workqueue for a very long time.  AFAICS, this dates back to pre-git
        era and has quite likely been there from the time workqueue was first
        introduced.
      
        A work item uses its PENDING bit to synchronize multiple queuers.
        Anyone who wins the PENDING bit owns the pending state of the work
        item.  Whether a queuer wins or loses the race, one thing should be
        guaranteed - there will soon be at least one execution of the work
        item - where "after" means that the execution instance would be able
        to see all the changes that the queuer has made prior to the queueing
        attempt.
      
        Unfortunately, we were missing a smp_mb() after clearing PENDING for
        execution, so nothing guaranteed visibility of the changes that a
        queueing loser has made, which manifested as a reproducible blk-mq
        stall.
      
        Lots of kudos to Roman for debugging the problem.  The patch for
        -stable is the minimal one.  For v3.7, Peter is working on a patch to
        make the code path slightly more efficient and less fragile"
      
      * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: fix ghost PENDING flag while doing MQ IO
      b75a2bf8
    • Linus Torvalds's avatar
      Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 763cfc86
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Two patches to fix a deadlock which can be easily triggered if memcg
        charge moving is used.
      
        This bug was introduced while converting threadgroup locking to a
        global percpu_rwsem and is caused by cgroup controller task migration
        path depending on the ability to create new kthreads.  cpuset had a
        similar issue which was fixed by performing heavy-lifting operations
        asynchronous to task migration.  The two patches fix the same issue in
        memcg in a similar way.  The first patch makes the mechanism generic
        and the second relocates memcg charge moving outside the migration
        path.
      
        Given that we don't want to perform heavy operations while
        writelocking threadgroup lock anyway, moving them out of the way is a
        desirable solution.  One thing to note is that the problem was
        difficult to debug because lockdep couldn't figure out the deadlock
        condition.  Looking into how to improve that"
      
      * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        memcg: relocate charge moving from ->attach to ->post_attach
        cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
      763cfc86
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 3118e5f9
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "I2C has one buildfix, one ABBA deadlock fix, and three simple 'add ID'
        patches"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: exynos5: Fix possible ABBA deadlock by keeping I2C clock prepared
        i2c: cpm: Fix build break due to incompatible pointer types
        i2c: ismt: Add Intel DNV PCI ID
        i2c: xlp9xx: add support for Broadcom Vulcan
        i2c: rk3x: add support for rk3228
      3118e5f9
    • Linus Torvalds's avatar
      Merge tag 'arc-4.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 24131a61
      Linus Torvalds authored
      Pull ARC fixes from Vineet Gupta:
      
       - lockdep now works for ARCv2 builds
      
       - enable DT reserved-memory binding (for forthcoming HDMI driver)
      
      * tag 'arc-4.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: add support for reserved memory defined by device tree
        ARC: support generic per-device coherent dma mem
        Documentation: dt: arc: fix spelling mistakes
        ARCv2: Enable LOCKDEP
      24131a61
    • Linus Torvalds's avatar
      Merge tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 · 508fea71
      Linus Torvalds authored
      Pull arch/nios2 fix from Ley Foon Tan:
       "memset: use the right constraint modifier for the %4 output operand"
      
      * tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
        nios2: memset: use the right constraint modifier for the %4 output operand
      508fea71
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.6-3' of... · 9453203b
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull x86 platform driver fix from Darren Hart:
       "Fix regression caused by hotkey enabling value in toshiba_acpi"
      
      * tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
        toshiba_acpi: Fix regression caused by hotkey enabling value
      9453203b
    • Alexey Brodkin's avatar
      ARC: add support for reserved memory defined by device tree · 1b10cb21
      Alexey Brodkin authored
      Enable reserved memory initialization from device tree.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      1b10cb21
    • Alexey Brodkin's avatar
      ARC: support generic per-device coherent dma mem · 32ed9a0e
      Alexey Brodkin authored
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      32ed9a0e
    • Romain Perier's avatar
      nios2: memset: use the right constraint modifier for the %4 output operand · a8950e49
      Romain Perier authored
      Depending on the size of the area to be memset'ed, the nios2 memset implementation
      either uses a naive loop (for buffers smaller or equal than 8 bytes) or a more optimized
      implementation (for buffers larger than 8 bytes). This implementation does 4-byte stores
      rather than 1-byte stores to speed up memset.
      
      However, we discovered that on our nios2 platform, memset() was not properly setting the
      buffer to the expected value. A memset of 0xff would not set the entire buffer to 0xff, but to:
      
      0xff 0x00 0xff 0x00 0xff 0x00 0xff 0x00 ...
      
      Which is obviously incorrect. Our investigation has revealed that the problem lies in the
      incorrect constraints used in the inline assembly.
      
      The following piece of assembly, from the nios2 memset implementation, is supposed to
      create a 4-byte value that repeats 4 times the 1-byte pattern passed as memset argument:
      
      /* fill8 %3, %5 (c & 0xff) */
      "       slli    %4, %5, 8\n"
      "       or      %4, %4, %5\n"
      "       slli    %3, %4, 16\n"
      "       or      %3, %3, %4\n"
      
      However, depending on the compiler and optimization level, this code might be compiled as:
      
      34:	280a923a 	slli	r5,r5,8
      38:	294ab03a 	or	r5,r5,r5
      3c:	2808943a 	slli	r4,r5,16
      40:	2148b03a 	or	r4,r4,r5
      
      This is wrong because r5 gets used both for %5 and %4, which leads to the final pattern
      stored in r4 to be 0xff00ff00 rather than the expected 0xffffffff.
      
      %4 is defined with the "=r" constraint, i.e as an output operand. However, as explained in
      http://www.ethernut.de/en/documents/arm-inline-asm.html, this does not prevent gcc from
      using the same register for an output operand (%4) and input operand (%5). By using the
      constraint modifier '&', we indicate that the register should be used for output only. With this
      change, we get the following assembly output:
      
      34:	2810923a 	slli	r8,r5,8
      38:	4150b03a 	or	r8,r8,r5
      3c:	400e943a 	slli	r7,r8,16
      40:	3a0eb03a 	or	r7,r7,r8
      
      Which correctly produces the 0xffffffff pattern when 0xff is passed as the memset() pattern.
      
      It is worth mentioning the observed consequence of this bug: we were hitting the kernel
      BUG() in mm/bootmem.c:__free() that verifies when marking a page as free that it was
      previously marked as occupied (i.e that the bit was set to 1). The entire bootmem bitmap is
      set to 0xff bit via a memset() during the bootmem initialization. The bootmem_free() call right
      after the initialization was finding some bits to be set to 0, which didn't make sense since the
      bitmap has just been memset'ed to 0xff. Except that due to the bug explained above, the
      bitmap was in fact initialized to 0xff00ff00.
      
      Thanks to Marek Vasut for his help and feedback.
      Signed-off-by: default avatarRomain Perier <romain.perier@free-electrons.com>
      Acked-by: default avatarMarek Vasut <marex@denx.de>
      Acked-by: default avatarLey Foon Tan <lftan@altera.com>
      a8950e49
  3. 26 Apr, 2016 1 commit
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · f28f20da
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Handle v4/v6 mixed sockets properly in soreuseport, from Craig
          Gallak.
      
       2) Bug fixes for the new macsec facility (missing kmalloc NULL checks,
          missing locking around netdev list traversal, etc.) from Sabrina
          Dubroca.
      
       3) Fix handling of host routes on ifdown in ipv6, from David Ahern.
      
       4) Fix double-fdput in bpf verifier.  From Jann Horn.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
        bpf: fix double-fdput in replace_map_fd_with_map_ptr()
        net: ipv6: Delete host routes on an ifdown
        Revert "ipv6: Revert optional address flusing on ifdown."
        net/mlx4_en: fix spurious timestamping callbacks
        net: dummy: remove note about being Y by default
        cxgbi: fix uninitialized flowi6
        ipv6: Revert optional address flusing on ifdown.
        ipv4/fib: don't warn when primary address is missing if in_dev is dead
        net/mlx5: Add pci shutdown callback
        net/mlx5_core: Remove static from local variable
        net/mlx5e: Use vport MTU rather than physical port MTU
        net/mlx5e: Fix minimum MTU
        net/mlx5e: Device's mtu field is u16 and not int
        net/mlx5_core: Add ConnectX-5 to list of supported devices
        net/mlx5e: Fix MLX5E_100BASE_T define
        net/mlx5_core: Fix soft lockup in steering error flow
        qlcnic: Update version to 5.3.64
        net: stmmac: socfpga: Remove re-registration of reset controller
        macsec: fix netlink attribute validation
        macsec: add missing macsec prefix in uapi
        ...
      f28f20da