1. 28 Apr, 2016 19 commits
    • Ismail, Mustafa's avatar
      RDMA/i40iw: Set vendor_err only if there is an actual error · df35630a
      Ismail, Mustafa authored
      Add a check for cq_poll_info.error before setting vendor_err
      instead of always setting it.
      Signed-off-by: default avatarMustafa Ismail <mustafa.ismail@intel.com>
      Signed-off-by: default avatarFaisal Latif <faisal.latif@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      df35630a
    • Ismail, Mustafa's avatar
      RDMA/i40iw: Add qp table lock around AE processing · 996abf0a
      Ismail, Mustafa authored
      QP may be freed during Async Event processing.
      Add a lock around QP table to prevent it.
      Signed-off-by: default avatarMustafa Ismail <mustafa.ismail@intel.com>
      Signed-off-by: default avatarFaisal Latif <faisal.latif@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      996abf0a
    • Ismail, Mustafa's avatar
      RDMA/i40iw: Do not set self-referencing pointer to NULL after free · 36a47933
      Ismail, Mustafa authored
      iwqp->allocated_buffer is a self-referencing pointer to iwqp.
      Do not set iwqp->allocated_buffer to NULL after freeing it.
      Signed-off-by: default avatarMustafa Ismail <mustafa.ismail@intel.com>
      Signed-off-by: default avatarFaisal Latif <faisal.latif@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      36a47933
    • Ismail, Mustafa's avatar
      RDMA/i40iw: Correct max message size in query port · bd57aeae
      Ismail, Mustafa authored
      Fix to correct max reported message size in query port.
      Signed-off-by: default avatarMustafa Ismail <mustafa.ismail@intel.com>
      Signed-off-by: default avatarFaisal Latif <faisal.latif@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      bd57aeae
    • Ismail, Mustafa's avatar
      RDMA/i40iw: Fix refused connections · b3437e0d
      Ismail, Mustafa authored
      Make sure cm_node is setup before sending SYN packet and
      ORD/IRD negotiation.
      Signed-off-by: default avatarMustafa Ismail <mustafa.ismail@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      b3437e0d
    • Ismail, Mustafa's avatar
      RDMA/i40iw: Correct QP size calculation · 23ef48ad
      Ismail, Mustafa authored
      Include inline data size as part of SQ size calculation.
      RQ size calculation uses only number of SGEs and does not
      support 96 byte WQE size.
      Signed-off-by: default avatarMustafa Ismail <mustafa.ismail@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      23ef48ad
    • Ismail, Mustafa's avatar
      RDMA/i40iw: Fix overflow of region length · 6b900365
      Ismail, Mustafa authored
      Change region_length to u64 as a region can be > 4GB.
      Signed-off-by: default avatarMustafa Ismail <mustafa.ismail@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      6b900365
    • Doug Ledford's avatar
      e29bff46
    • Doug Ledford's avatar
      Merge branch 'master' of... · d53e181c
      Doug Ledford authored
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into testing/4.6
      d53e181c
    • Jason Gunthorpe's avatar
      IB/security: Restrict use of the write() interface · e6bd18f5
      Jason Gunthorpe authored
      The drivers/infiniband stack uses write() as a replacement for
      bi-directional ioctl().  This is not safe. There are ways to
      trigger write calls that result in the return structure that
      is normally written to user space being shunted off to user
      specified kernel memory instead.
      
      For the immediate repair, detect and deny suspicious accesses to
      the write API.
      
      For long term, update the user space libraries and the kernel API
      to something that doesn't present the same security vulnerabilities
      (likely a structured ioctl() interface).
      
      The impacted uAPI interfaces are generally only available if
      hardware from drivers/infiniband is installed in the system.
      Reported-by: default avatarJann Horn <jann@thejh.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      [ Expanded check to all known write() entry points ]
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      e6bd18f5
    • Dean Luick's avatar
      IB/hfi1: Use kernel default llseek for ui device · 7723d8c2
      Dean Luick authored
      The ui device llseek had a mistake with SEEK_END and did
      not fully follow seek semantics.  Correct all this by
      using a kernel supplied function for fixed size devices.
      
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      7723d8c2
    • Mitko Haralanov's avatar
      IB/hfi1: Don't attempt to free resources if initialization failed · 94158442
      Mitko Haralanov authored
      Attempting to free resources which have not been allocated and
      initialized properly led to the following kernel backtrace:
      
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<ffffffffa09658fe>] unlock_exp_tids.isra.8+0x2e/0x120 [hfi1]
          PGD 852a43067 PUD 85d4a6067 PMD 0
          Oops: 0000 [#1] SMP
          CPU: 0 PID: 2831 Comm: osu_bw Tainted: G          IO 3.12.18-wfr+ #1
          task: ffff88085b15b540 ti: ffff8808588fe000 task.ti: ffff8808588fe000
          RIP: 0010:[<ffffffffa09658fe>]  [<ffffffffa09658fe>] unlock_exp_tids.isra.8+0x2e/0x120 [hfi1]
          RSP: 0018:ffff8808588ffde0  EFLAGS: 00010282
          RAX: 0000000000000000 RBX: ffff880858a31800 RCX: 0000000000000000
          RDX: ffff88085d971bc0 RSI: ffff880858a318f8 RDI: ffff880858a318c0
          RBP: ffff8808588ffe20 R08: 0000000000000000 R09: 0000000000000000
          R10: ffff88087ffd6f40 R11: 0000000001100348 R12: ffff880852900000
          R13: ffff880858a318c0 R14: 0000000000000000 R15: ffff88085d971be8
          FS:  00007f4674e83740(0000) GS:ffff88087f400000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000000 CR3: 000000085c377000 CR4: 00000000001407f0
          Stack:
           ffffffffa0941a71 ffff880858a318f8 ffff88085d971bc0 ffff880858a31800
           ffff880852900000 ffff880858a31800 00000000003ffff7 ffff88085d971bc0
           ffff8808588ffe60 ffffffffa09663fc ffff8808588ffe60 ffff880858a31800
          Call Trace:
           [<ffffffffa0941a71>] ? find_mmu_handler+0x51/0x70 [hfi1]
           [<ffffffffa09663fc>] hfi1_user_exp_rcv_free+0x6c/0x120 [hfi1]
           [<ffffffffa0932809>] hfi1_file_close+0x1a9/0x340 [hfi1]
           [<ffffffff8116c189>] __fput+0xe9/0x270
           [<ffffffff8116c35e>] ____fput+0xe/0x10
           [<ffffffff81065707>] task_work_run+0xa7/0xe0
           [<ffffffff81002969>] do_notify_resume+0x59/0x80
           [<ffffffff814ffc1a>] int_signal+0x12/0x17
      
      This commit re-arranges the context initialization code in a way that
      would allow for context event flags to be used to determine whether
      the context has been successfully initialized.
      
      In turn, this can be used to skip the resource de-allocation if they
      were never allocated in the first place.
      
      Fixes: 3abb33ac ("staging/hfi1: Add TID cache receive init and free funcs")
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Reviewed-by: Leon Romanovsky <leonro@mellanox.com.
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      94158442
    • Mike Marciniszyn's avatar
      IB/hfi1: Fix missing lock/unlock in verbs drain callback · b9b06cb6
      Mike Marciniszyn authored
      The iowait_sdma_drained() callback lacked locking to
      protect the qp s_flags field.
      
      This causes the s_flags to be out of sync
      on multiple CPUs, potentially corrupting the s_flags.
      
      Fixes: a545f530 ("staging/rdma/hfi: fix CQ completion order issue")
      Reviewed-by: default avatarSebastian Sanchez <sebastian.sanchez@intel.com>
      Signed-off-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      b9b06cb6
    • Jubin John's avatar
      IB/rdmavt: Fix send scheduling · e6d2e017
      Jubin John authored
      call_send is used to determine whether to send immediately or schedule
      a send for later. The current logic in rdmavt is inverted and has a
      negative impact on the latency of the hfi1 and qib drivers. Fix this
      regression by correctly calling send immediately when call_send is set.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: default avatarJubin John <jubin.john@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      e6d2e017
    • Mitko Haralanov's avatar
      IB/hfi1: Prevent unpinning of wrong pages · 849e3e93
      Mitko Haralanov authored
      The routine used by the SDMA cache to handle already
      cached nodes can extend an already existing node.
      
      In its error handling code, the routine will unpin pages
      when not all pages of the buffer extension were pinned.
      
      There was a bug in that part of the routine, which would
      mistakenly unpin pages from the original set rather than
      the newly pinned pages.
      
      This commit fixes that bug by offsetting the page array
      to the proper place pointing at the beginning of the newly
      pinned pages.
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      849e3e93
    • Mitko Haralanov's avatar
      IB/hfi1: Fix deadlock caused by locking with wrong scope · de82bdff
      Mitko Haralanov authored
      The locking around the interval RB tree is designed to prevent
      access to the tree while it's being modified. The locking in its
      current form is too overzealous, which is causing a deadlock in
      certain cases with the following backtrace:
      
          Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
          CPU: 0 PID: 5836 Comm: IMB-MPI1 Tainted: G           O 3.12.18-wfr+ #1
           0000000000000000 ffff88087f206c50 ffffffff814f1caa ffffffff817b53f0
           ffff88087f206cc8 ffffffff814ecd56 0000000000000010 ffff88087f206cd8
           ffff88087f206c78 0000000000000000 0000000000000000 0000000000001662
          Call Trace:
           <NMI>  [<ffffffff814f1caa>] dump_stack+0x45/0x56
           [<ffffffff814ecd56>] panic+0xc2/0x1cb
           [<ffffffff810d4370>] ? restart_watchdog_hrtimer+0x50/0x50
           [<ffffffff810d4432>] watchdog_overflow_callback+0xc2/0xd0
           [<ffffffff81109b4e>] __perf_event_overflow+0x8e/0x2b0
           [<ffffffff8110a714>] perf_event_overflow+0x14/0x20
           [<ffffffff8101c906>] intel_pmu_handle_irq+0x1b6/0x390
           [<ffffffff814f927b>] perf_event_nmi_handler+0x2b/0x50
           [<ffffffff814f8ad8>] nmi_handle.isra.3+0x88/0x180
           [<ffffffff814f8d39>] do_nmi+0x169/0x310
           [<ffffffff814f8177>] end_repeat_nmi+0x1e/0x2e
           [<ffffffff81272600>] ? unmap_single+0x30/0x30
           [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40
           [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40
           [<ffffffff814f780d>] ? _raw_spin_lock_irqsave+0x2d/0x40
           <<EOE>>  <IRQ>  [<ffffffffa056c4a8>] hfi1_mmu_rb_search+0x38/0x70 [hfi1]
           [<ffffffffa05919cb>] user_sdma_free_request+0xcb/0x120 [hfi1]
           [<ffffffffa0593393>] user_sdma_txreq_cb+0x263/0x350 [hfi1]
           [<ffffffffa057fad7>] ? sdma_txclean+0x27/0x1c0 [hfi1]
           [<ffffffffa0593130>] ? user_sdma_send_pkts+0x1710/0x1710 [hfi1]
           [<ffffffffa057fdd6>] sdma_make_progress+0x166/0x480 [hfi1]
           [<ffffffff810762c9>] ? ttwu_do_wakeup+0x19/0xd0
           [<ffffffffa0581c7e>] sdma_engine_interrupt+0x8e/0x100 [hfi1]
           [<ffffffffa0546bdd>] sdma_interrupt+0x5d/0xa0 [hfi1]
           [<ffffffff81097e57>] handle_irq_event_percpu+0x47/0x1d0
           [<ffffffff81098017>] handle_irq_event+0x37/0x60
           [<ffffffff8109aa5f>] handle_edge_irq+0x6f/0x120
           [<ffffffff810044af>] handle_irq+0xbf/0x150
           [<ffffffff8104c9b7>] ? irq_enter+0x17/0x80
           [<ffffffff8150168d>] do_IRQ+0x4d/0xc0
           [<ffffffff814f7c6a>] common_interrupt+0x6a/0x6a
           <EOI>  [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0
           [<ffffffff814f56c6>] __schedule+0x3b6/0x7e0
           [<ffffffff810763a6>] __cond_resched+0x26/0x30
           [<ffffffff814f5eda>] _cond_resched+0x3a/0x50
           [<ffffffff814f4f82>] down_write+0x12/0x30
           [<ffffffffa0591619>] hfi1_release_user_pages+0x69/0x90 [hfi1]
           [<ffffffffa059173a>] sdma_rb_remove+0x9a/0xc0 [hfi1]
           [<ffffffffa056c00d>] __mmu_rb_remove.isra.5+0x5d/0x70 [hfi1]
           [<ffffffffa056c536>] hfi1_mmu_rb_remove+0x56/0x70 [hfi1]
           [<ffffffffa059427b>] hfi1_user_sdma_process_request+0x74b/0x1160 [hfi1]
           [<ffffffffa055c763>] hfi1_aio_write+0xc3/0x100 [hfi1]
           [<ffffffff8116a14c>] do_sync_readv_writev+0x4c/0x80
           [<ffffffff8116b58b>] do_readv_writev+0xbb/0x230
           [<ffffffff811a9da1>] ? fsnotify+0x241/0x320
           [<ffffffff81073524>] ? finish_task_switch+0x54/0xe0
           [<ffffffff8116b795>] vfs_writev+0x35/0x60
           [<ffffffff8116b8c9>] SyS_writev+0x49/0xc0
           [<ffffffff810cd876>] ? __audit_syscall_exit+0x1f6/0x2a0
           [<ffffffff814ff992>] system_call_fastpath+0x16/0x1b
      
      As evident from the backtrace above, the process was being put to sleep
      while holding the lock.
      
      Limiting the scope of the lock only to the RB tree operation fixes the
      above error allowing for proper locking and the process being put to
      sleep when needed.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      de82bdff
    • Mitko Haralanov's avatar
      IB/hfi1: Prevent NULL pointer deferences in caching code · f19bd643
      Mitko Haralanov authored
      There is a potential kernel crash when the MMU notifier calls the
      invalidation routines in the hfi1 pinned page caching code for sdma.
      
      The invalidation routine could call the remove callback
      for the node, which in turn ends up dereferencing the
      current task_struct to get a pointer to the mm_struct.
      However, the mm_struct pointer could be NULL resulting in
      the following backtrace:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
          IP: [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1]
          15
          task: ffff88085e66e080 ti: ffff88085c244000 task.ti: ffff88085c244000
          RIP: 0010:[<ffffffffa041f75a>]  [<ffffffffa041f75a>] sdma_rb_remove+0xaa/0x100 [hfi1]
          RSP: 0000:ffff88085c245878  EFLAGS: 00010002
          RAX: 0000000000000000 RBX: ffff88105b9bbd40 RCX: ffffea003931a830
          RDX: 0000000000000004 RSI: ffff88105754a9c0 RDI: ffff88105754a9c0
          RBP: ffff88085c245890 R08: ffff88105b9bbd70 R09: 00000000fffffffb
          R10: ffff88105b9bbd58 R11: 0000000000000013 R12: ffff88105754a9c0
          R13: 0000000000000001 R14: 0000000000000001 R15: ffff88105b9bbd40
          FS:  0000000000000000(0000) GS:ffff88107ef40000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00000000000000a8 CR3: 0000000001a0b000 CR4: 00000000001407e0
          Stack:
           ffff88105b9bbd40 ffff88080ec481a8 ffff88080ec481b8 ffff88085c2458c0
           ffffffffa03fa00e ffff88080ec48190 ffff88080ed9cd00 0000000001024000
           0000000000000000 ffff88085c245920 ffffffffa03fa0e7 0000000000000282
          Call Trace:
           [<ffffffffa03fa00e>] __mmu_rb_remove.isra.5+0x5e/0x70 [hfi1]
           [<ffffffffa03fa0e7>] mmu_notifier_mem_invalidate+0xc7/0xf0 [hfi1]
           [<ffffffffa03fa143>] mmu_notifier_page+0x13/0x20 [hfi1]
           [<ffffffff81156dd0>] __mmu_notifier_invalidate_page+0x50/0x70
           [<ffffffff81140bbb>] try_to_unmap_one+0x20b/0x470
           [<ffffffff81141ee7>] try_to_unmap_anon+0xa7/0x120
           [<ffffffff81141fad>] try_to_unmap+0x4d/0x60
           [<ffffffff8111fd7b>] shrink_page_list+0x2eb/0x9d0
           [<ffffffff81120ab3>] shrink_inactive_list+0x243/0x490
           [<ffffffff81121491>] shrink_lruvec+0x4c1/0x640
           [<ffffffff81121641>] shrink_zone+0x31/0x100
           [<ffffffff81121b0f>] kswapd_shrink_zone.constprop.62+0xef/0x1c0
           [<ffffffff811229e3>] kswapd+0x403/0x7e0
           [<ffffffff811225e0>] ? shrink_all_memory+0xf0/0xf0
           [<ffffffff81068ac0>] kthread+0xc0/0xd0
           [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40
           [<ffffffff814ff8ec>] ret_from_fork+0x7c/0xb0
           [<ffffffff81068a00>] ? insert_kthread_work+0x40/0x40
      
      To correct this, the mm_struct passed to us by the MMU notifier is
      used (which is what should have been done to begin with). This avoids
      the broken derefences and ensures that the correct mm_struct is used.
      Reviewed-by: default avatarDennis Dalessandro <dennis.dalessandro@intel.com>
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Signed-off-by: default avatarMitko Haralanov <mitko.haralanov@intel.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      f19bd643
    • Sagi Grimberg's avatar
    • Sagi Grimberg's avatar
      IB/mlx5: Expose correct max_sge_rd limit · 986ef95e
      Sagi Grimberg authored
      mlx5 devices (Connect-IB, ConnectX-4, ConnectX-4-LX) has a limitation
      where rdma read work queue entries cannot exceed 512 bytes.
      A rdma_read wqe needs to fit in 512 bytes:
      - wqe control segment (16 bytes)
      - rdma segment (16 bytes)
      - scatter elements (16 bytes each)
      
      So max_sge_rd should be: (512 - 16 - 16) / 16 = 30.
      
      Cc: linux-stable@vger.kernel.org
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagig@grimberg.me>
      Signed-off-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      986ef95e
  2. 27 Apr, 2016 9 commits
    • Linus Torvalds's avatar
      Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq · b75a2bf8
      Linus Torvalds authored
      Pull workqueue fix from Tejun Heo:
       "So, it turns out we had a silly bug in the most fundamental part of
        workqueue for a very long time.  AFAICS, this dates back to pre-git
        era and has quite likely been there from the time workqueue was first
        introduced.
      
        A work item uses its PENDING bit to synchronize multiple queuers.
        Anyone who wins the PENDING bit owns the pending state of the work
        item.  Whether a queuer wins or loses the race, one thing should be
        guaranteed - there will soon be at least one execution of the work
        item - where "after" means that the execution instance would be able
        to see all the changes that the queuer has made prior to the queueing
        attempt.
      
        Unfortunately, we were missing a smp_mb() after clearing PENDING for
        execution, so nothing guaranteed visibility of the changes that a
        queueing loser has made, which manifested as a reproducible blk-mq
        stall.
      
        Lots of kudos to Roman for debugging the problem.  The patch for
        -stable is the minimal one.  For v3.7, Peter is working on a patch to
        make the code path slightly more efficient and less fragile"
      
      * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
        workqueue: fix ghost PENDING flag while doing MQ IO
      b75a2bf8
    • Linus Torvalds's avatar
      Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 763cfc86
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "Two patches to fix a deadlock which can be easily triggered if memcg
        charge moving is used.
      
        This bug was introduced while converting threadgroup locking to a
        global percpu_rwsem and is caused by cgroup controller task migration
        path depending on the ability to create new kthreads.  cpuset had a
        similar issue which was fixed by performing heavy-lifting operations
        asynchronous to task migration.  The two patches fix the same issue in
        memcg in a similar way.  The first patch makes the mechanism generic
        and the second relocates memcg charge moving outside the migration
        path.
      
        Given that we don't want to perform heavy operations while
        writelocking threadgroup lock anyway, moving them out of the way is a
        desirable solution.  One thing to note is that the problem was
        difficult to debug because lockdep couldn't figure out the deadlock
        condition.  Looking into how to improve that"
      
      * 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        memcg: relocate charge moving from ->attach to ->post_attach
        cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
      763cfc86
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 3118e5f9
      Linus Torvalds authored
      Pull i2c fixes from Wolfram Sang:
       "I2C has one buildfix, one ABBA deadlock fix, and three simple 'add ID'
        patches"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: exynos5: Fix possible ABBA deadlock by keeping I2C clock prepared
        i2c: cpm: Fix build break due to incompatible pointer types
        i2c: ismt: Add Intel DNV PCI ID
        i2c: xlp9xx: add support for Broadcom Vulcan
        i2c: rk3x: add support for rk3228
      3118e5f9
    • Linus Torvalds's avatar
      Merge tag 'arc-4.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc · 24131a61
      Linus Torvalds authored
      Pull ARC fixes from Vineet Gupta:
      
       - lockdep now works for ARCv2 builds
      
       - enable DT reserved-memory binding (for forthcoming HDMI driver)
      
      * tag 'arc-4.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
        ARC: add support for reserved memory defined by device tree
        ARC: support generic per-device coherent dma mem
        Documentation: dt: arc: fix spelling mistakes
        ARCv2: Enable LOCKDEP
      24131a61
    • Linus Torvalds's avatar
      Merge tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 · 508fea71
      Linus Torvalds authored
      Pull arch/nios2 fix from Ley Foon Tan:
       "memset: use the right constraint modifier for the %4 output operand"
      
      * tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
        nios2: memset: use the right constraint modifier for the %4 output operand
      508fea71
    • Linus Torvalds's avatar
      Merge tag 'platform-drivers-x86-v4.6-3' of... · 9453203b
      Linus Torvalds authored
      Merge tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
      
      Pull x86 platform driver fix from Darren Hart:
       "Fix regression caused by hotkey enabling value in toshiba_acpi"
      
      * tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
        toshiba_acpi: Fix regression caused by hotkey enabling value
      9453203b
    • Alexey Brodkin's avatar
      ARC: add support for reserved memory defined by device tree · 1b10cb21
      Alexey Brodkin authored
      Enable reserved memory initialization from device tree.
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      1b10cb21
    • Alexey Brodkin's avatar
      ARC: support generic per-device coherent dma mem · 32ed9a0e
      Alexey Brodkin authored
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      32ed9a0e
    • Romain Perier's avatar
      nios2: memset: use the right constraint modifier for the %4 output operand · a8950e49
      Romain Perier authored
      Depending on the size of the area to be memset'ed, the nios2 memset implementation
      either uses a naive loop (for buffers smaller or equal than 8 bytes) or a more optimized
      implementation (for buffers larger than 8 bytes). This implementation does 4-byte stores
      rather than 1-byte stores to speed up memset.
      
      However, we discovered that on our nios2 platform, memset() was not properly setting the
      buffer to the expected value. A memset of 0xff would not set the entire buffer to 0xff, but to:
      
      0xff 0x00 0xff 0x00 0xff 0x00 0xff 0x00 ...
      
      Which is obviously incorrect. Our investigation has revealed that the problem lies in the
      incorrect constraints used in the inline assembly.
      
      The following piece of assembly, from the nios2 memset implementation, is supposed to
      create a 4-byte value that repeats 4 times the 1-byte pattern passed as memset argument:
      
      /* fill8 %3, %5 (c & 0xff) */
      "       slli    %4, %5, 8\n"
      "       or      %4, %4, %5\n"
      "       slli    %3, %4, 16\n"
      "       or      %3, %3, %4\n"
      
      However, depending on the compiler and optimization level, this code might be compiled as:
      
      34:	280a923a 	slli	r5,r5,8
      38:	294ab03a 	or	r5,r5,r5
      3c:	2808943a 	slli	r4,r5,16
      40:	2148b03a 	or	r4,r4,r5
      
      This is wrong because r5 gets used both for %5 and %4, which leads to the final pattern
      stored in r4 to be 0xff00ff00 rather than the expected 0xffffffff.
      
      %4 is defined with the "=r" constraint, i.e as an output operand. However, as explained in
      http://www.ethernut.de/en/documents/arm-inline-asm.html, this does not prevent gcc from
      using the same register for an output operand (%4) and input operand (%5). By using the
      constraint modifier '&', we indicate that the register should be used for output only. With this
      change, we get the following assembly output:
      
      34:	2810923a 	slli	r8,r5,8
      38:	4150b03a 	or	r8,r8,r5
      3c:	400e943a 	slli	r7,r8,16
      40:	3a0eb03a 	or	r7,r7,r8
      
      Which correctly produces the 0xffffffff pattern when 0xff is passed as the memset() pattern.
      
      It is worth mentioning the observed consequence of this bug: we were hitting the kernel
      BUG() in mm/bootmem.c:__free() that verifies when marking a page as free that it was
      previously marked as occupied (i.e that the bit was set to 1). The entire bootmem bitmap is
      set to 0xff bit via a memset() during the bootmem initialization. The bootmem_free() call right
      after the initialization was finding some bits to be set to 0, which didn't make sense since the
      bitmap has just been memset'ed to 0xff. Except that due to the bug explained above, the
      bitmap was in fact initialized to 0xff00ff00.
      
      Thanks to Marek Vasut for his help and feedback.
      Signed-off-by: default avatarRomain Perier <romain.perier@free-electrons.com>
      Acked-by: default avatarMarek Vasut <marex@denx.de>
      Acked-by: default avatarLey Foon Tan <lftan@altera.com>
      a8950e49
  3. 26 Apr, 2016 12 commits
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · f28f20da
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Handle v4/v6 mixed sockets properly in soreuseport, from Craig
          Gallak.
      
       2) Bug fixes for the new macsec facility (missing kmalloc NULL checks,
          missing locking around netdev list traversal, etc.) from Sabrina
          Dubroca.
      
       3) Fix handling of host routes on ifdown in ipv6, from David Ahern.
      
       4) Fix double-fdput in bpf verifier.  From Jann Horn.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
        bpf: fix double-fdput in replace_map_fd_with_map_ptr()
        net: ipv6: Delete host routes on an ifdown
        Revert "ipv6: Revert optional address flusing on ifdown."
        net/mlx4_en: fix spurious timestamping callbacks
        net: dummy: remove note about being Y by default
        cxgbi: fix uninitialized flowi6
        ipv6: Revert optional address flusing on ifdown.
        ipv4/fib: don't warn when primary address is missing if in_dev is dead
        net/mlx5: Add pci shutdown callback
        net/mlx5_core: Remove static from local variable
        net/mlx5e: Use vport MTU rather than physical port MTU
        net/mlx5e: Fix minimum MTU
        net/mlx5e: Device's mtu field is u16 and not int
        net/mlx5_core: Add ConnectX-5 to list of supported devices
        net/mlx5e: Fix MLX5E_100BASE_T define
        net/mlx5_core: Fix soft lockup in steering error flow
        qlcnic: Update version to 5.3.64
        net: stmmac: socfpga: Remove re-registration of reset controller
        macsec: fix netlink attribute validation
        macsec: add missing macsec prefix in uapi
        ...
      f28f20da
    • Linus Torvalds's avatar
      Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 91ea692f
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "Here are the latest bug fixes for ARM SoCs, mostly addressing recent
        regressions.  Changes are across several platforms, so I'm listing
        every change separately here.
      
        Regressions since 4.5:
      
         - A correction of the psci firmware DT binding, to prevent users from
           relying on unintended semantics
      
         - Actually getting the newly merged clock driver for some OMAP
           platforms to work
      
         - A revert of patches for the Qualcomm BAM, these need to be reworked
           for 4.7 to avoid breaking boards other than the one they were
           intended for
      
         - A correction for the I2C device nodes on the Socionext Uniphier
           platform
      
         - i.MX SDHCI was broken for non-DT platforms due to a change with the
           setting of the DMA mask
      
         - A revert of a patch that accidentally added a nonexisting clock on
           the Rensas "Porter" board
      
         - A couple of OMAP fixes that are all related to suspend after the
           power domain changes for dra7
      
         - On Mediatek, revert part of the power domain initialization changes
           that broke mt8173-evb
      
        Fixes for older bugs:
      
         - Workaround for an "external abort" in the omap34xx suspend/resume
           code.
      
         - The USB1/eSATA should not be listed as an excon device on
           am57xx-beagle-x15 (broken since v4.0)
      
         - A v4.5 regression in the TI AM33xx and AM43XX DT specifying
           incorrect DMA request lines for the GPMC
      
         - The jiffies calibration on Renesas platforms was incorrect for some
           modern CPU cores.
      
         - A hardware errata woraround for clockdomains on TI DRA7"
      
      * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        drivers: firmware: psci: unify enable-method binding on ARM {64,32}-bit systems
        arm64: dts: uniphier: fix I2C nodes of PH1-LD20
        ARM: shmobile: timer: Fix preset_lpj leading to too short delays
        Revert "ARM: dts: porter: Enable SCIF_CLK frequency and pins"
        ARM: dts: r8a7791: Don't disable referenced optional clocks
        Revert "ARM: OMAP: Catch callers of revision information prior to it being populated"
        ARM: OMAP3: Fix external abort on 36xx waking from off mode idle
        ARM: dts: am57xx-beagle-x15: remove extcon_usb1
        ARM: dts: am437x: Fix GPMC dma properties
        ARM: dts: am33xx: Fix GPMC dma properties
        Revert "soc: mediatek: SCPSYS: Fix double enabling of regulators"
        ARM: mach-imx: sdhci-esdhc-imx: initialize DMA mask
        ARM: DRA7: clockdomain: Implement timer workaround for errata i874
        ARM: OMAP: Catch callers of revision information prior to it being populated
        ARM: dts: dra7: Correct clock tree for sys_32k_ck
        ARM: OMAP: DRA7: Provide proper class to omap2_set_globals_tap
        ARM: OMAP: DRA7: wakeupgen: Skip SAR save for wakeupgen
        Revert "dts: msm8974: Add dma channels for blsp2_i2c1 node"
        Revert "dts: msm8974: Add blsp2_bam dma node"
        ARM: dts: Add clocks for dm814x ADPLL
      91ea692f
    • Linus Torvalds's avatar
      devpts: more pty driver interface cleanups · 8ead9dd5
      Linus Torvalds authored
      This is more prep-work for the upcoming pty changes.  Still just code
      cleanup with no actual semantic changes.
      
      This removes a bunch pointless complexity by just having the slave pty
      side remember the dentry associated with the devpts slave rather than
      the inode.  That allows us to remove all the "look up the dentry" code
      for when we want to remove it again.
      
      Together with moving the tty pointer from "inode->i_private" to
      "dentry->d_fsdata" and getting rid of pointless inode locking, this
      removes about 30 lines of code.  Not only is the end result smaller,
      it's simpler and easier to understand.
      
      The old code, for example, depended on the d_find_alias() to not just
      find the dentry, but also to check that it is still hashed, which in
      turn validated the tty pointer in the inode.
      
      That is a _very_ roundabout way to say "invalidate the cached tty
      pointer when the dentry is removed".
      
      The new code just does
      
      	dentry->d_fsdata = NULL;
      
      in devpts_pty_kill() instead, invalidating the tty pointer rather more
      directly and obviously.  Don't do something complex and subtle when the
      obvious straightforward approach will do.
      
      The rest of the patch (ie apart from code deletion and the above tty
      pointer clearing) is just switching the calling convention to pass the
      dentry or file pointer around instead of the inode.
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Aurelien Jarno <aurelien@aurel32.net>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Greg KH <greg@kroah.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ead9dd5
    • Jann Horn's avatar
      bpf: fix double-fdput in replace_map_fd_with_map_ptr() · 8358b02b
      Jann Horn authored
      When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
      references a non-map file descriptor as a map file descriptor, the error
      handling code called fdput() twice instead of once (in __bpf_map_get() and
      in replace_map_fd_with_map_ptr()). If the file descriptor table of the
      current task is shared, this causes f_count to be decremented too much,
      allowing the struct file to be freed while it is still in use
      (use-after-free). This can be exploited to gain root privileges by an
      unprivileged user.
      
      This bug was introduced in
      commit 0246e64d ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
      exploitable since
      commit 1be7f75d ("bpf: enable non-root eBPF programs") because
      previously, CAP_SYS_ADMIN was required to reach the vulnerable code.
      
      (posted publicly according to request by maintainer)
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8358b02b
    • Hariprasad S's avatar
      RDMA/iw_cxgb4: Fix bar2 virt addr calculation for T4 chips · 32cc92c7
      Hariprasad S authored
      For T4, kernel mode qps don't use the user doorbell. User mode qps during
      flow control db ringing are forced into kernel, where user doorbell is
      treated as kernel doorbell and proper bar2 offset in bar2 virtual space is
      calculated, which incase of T4 is a bogus address, causing a kernel panic
      due to illegal write during doorbell ringing.
      In case of T4, kernel mode qp bar2 virtual address should be 0. Added T4
      check during bar2 virtual address calculation to return 0. Fixed Bar2
      range checks based on bar2 physical address.
      
      The below oops will be fixed
      
        <1>BUG: unable to handle kernel paging request at 000000000002aa08
        <1>IP: [<ffffffffa011d800>] c4iw_uld_control+0x4e0/0x880 [iw_cxgb4]
        <4>PGD 1416a8067 PUD 15bf35067 PMD 0
        <4>Oops: 0002 [#1] SMP
        <4>last sysfs file:
        /sys/devices/pci0000:00/0000:00:03.0/0000:02:00.4/infiniband/cxgb4_0/node_guid
        <4>CPU 5
        <4>Modules linked in: rdma_ucm rdma_cm ib_cm ib_sa ib_mad ib_uverbs
        ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE
        iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack
        ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge autofs4
        target_core_iblock target_core_file target_core_pscsi target_core_mod
        configfs bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q
        garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf vhost_net macvtap
        macvlan tun kvm uinput microcode iTCO_wdt iTCO_vendor_support sg joydev
        serio_raw i2c_i801 i2c_core lpc_ich mfd_core e1000e ptp pps_core ioatdma dca
        i7core_edac edac_core shpchp ext3 jbd mbcache sd_mod crc_t10dif pata_acpi
        ata_generic ata_piix iw_cxgb4 iw_cm ib_core ib_addr cxgb4 ipv6 dm_mirror
        dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
        <4>
        Supermicro X8ST3/X8ST3
        <4>RIP: 0010:[<ffffffffa011d800>]  [<ffffffffa011d800>]
        c4iw_uld_control+0x4e0/0x880 [iw_cxgb4]
        <4>RSP: 0000:ffff880155a03db0  EFLAGS: 00010006
        <4>RAX: 000000000000001d RBX: ffff88013ae5fc00 RCX: ffff880155adb180
        <4>RDX: 000000000002aa00 RSI: 0000000000000001 RDI: ffff88013ae5fdf8
        <4>RBP: ffff880155a03e10 R08: 0000000000000000 R09: 0000000000000001
        <4>R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        <4>R13: 000000000000001d R14: ffff880156414ab0 R15: ffffe8ffffc05b88
        <4>FS:  0000000000000000(0000) GS:ffff8800282a0000(0000) knlGS:0000000000000000
        <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
        <4>CR2: 000000000002aa08 CR3: 000000015bd0e000 CR4: 00000000000007e0
        <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        <4>Process cxgb4 (pid: 394, threadinfo ffff880155a00000, task ffff880156414ab0)
        <4>Stack:
        <4> ffff880156415068 ffff880155adb180 ffff880155a03df0 ffffffffa00a344b
        <4><d> 00000000000003e8 ffff880155920000 0000000000000004 ffff880155920000
        <4><d> ffff88015592d438 ffffffffa00a3860 ffff880155a03fd8 ffffe8ffffc05b88
        <4>Call Trace:
        <4> [<ffffffffa00a344b>] ? enable_txq_db+0x2b/0x80 [cxgb4]
        <4> [<ffffffffa00a3860>] ? process_db_full+0x0/0xa0 [cxgb4]
        <4> [<ffffffffa00a38a6>] process_db_full+0x46/0xa0 [cxgb4]
        <4> [<ffffffff8109fda0>] worker_thread+0x170/0x2a0
        <4> [<ffffffff810a6aa0>] ? autoremove_wake_function+0x0/0x40
        <4> [<ffffffff8109fc30>] ? worker_thread+0x0/0x2a0
        <4> [<ffffffff810a660e>] kthread+0x9e/0xc0
        <4> [<ffffffff8100c28a>] child_rip+0xa/0x20
        <4> [<ffffffff810a6570>] ? kthread+0x0/0xc0
        <4> [<ffffffff8100c280>] ? child_rip+0x0/0x20
        <4>Code: e9 ba 00 00 00 66 0f 1f 44 00 00 44 8b 05 29 07 02 00 45 85 c0 0f 85
        71 02 00 00 8b 83 70 01 00 00 45 0f b7 ed c1 e0 0f 44 09 e8 <89> 42 08 0f ae f8
        66 c7 83 82 01 00 00 00 00 44 0f b7 ab dc 01
        <1>RIP  [<ffffffffa011d800>] c4iw_uld_control+0x4e0/0x880 [iw_cxgb4]
        <4> RSP <ffff880155a03db0>
        <4>CR2: 000000000002aa08`
      
      Based on original work by Bharat Potnuri <bharat@chelsio.com>
      
      Fixes: 74217d4c ("iw_cxgb4: support for bar2 qid densities exceeding the page size")
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Reviewed-by: default avatarLeon Romanovsky <leon@leon.nu>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      32cc92c7
    • Steve Wise's avatar
      iw_cxgb4: handle draining an idle qp · 40edd7fd
      Steve Wise authored
      In c4iw_drain_sq/rq(), if the particular queue is already empty
      then don't block.
      
      Fixes: ce4af14d94aa ('iw_cxgb4: add queue drain functions')
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      40edd7fd
    • Steve Wise's avatar
      iw_cxgb3: initialize ibdev.iwcm->ifname for port mapping · ad202348
      Steve Wise authored
      The IWCM uses ibdev.iwcm->ifname for registration with the iwarp
      port map daemon.  But iw_cxgb3 did not initialize this field which
      causes intermittent registration failures based on the contents of the
      uninitialized memory.
      
      Fixes: c1340e8a ("iw_cxgb3: support for iWARP port mapping")
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ad202348
    • Steve Wise's avatar
      iw_cxgb4: initialize ibdev.iwcm->ifname for port mapping · 851d7b6b
      Steve Wise authored
      The IWCM uses ibdev.iwcm->ifname for registration with the iwarp
      port map daemon.  But iw_cxgb4 did not initialize this field which
      causes intermittent registration failures based on the contents of the
      uninitialized memory.
      
      Fixes: 170003c8 ("iw_cxgb4: remove port mapper related code")
      Signed-off-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      851d7b6b
    • Sagi Grimberg's avatar
      IB/core: Don't drain non-existent rq queue-pair · 42235f80
      Sagi Grimberg authored
      The drain_rq function expects a normal receive qp to drain.  A qp can
      only have either a normal rq or an srq.  If there is an srq, there
      is no rq to drain.  Until the API supports draining SRQs, simply
      skip draining the rq when the qp has an srq attached.
      
      Fixes: 765d6774 ("IB: new common API for draining queues")
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      42235f80
    • David Ahern's avatar
      net: ipv6: Delete host routes on an ifdown · 38bd10c4
      David Ahern authored
      It was a simple idea -- save IPv6 configured addresses on a link down
      so that IPv6 behaves similar to IPv4. As always the devil is in the
      details and the IPv6 stack as too many behavioral differences from IPv4
      making the simple idea more complicated than it needs to be.
      
      The current implementation for keeping IPv6 addresses can panic or spit
      out a warning in one of many paths:
      
      1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
         rt6_fill_node while handling a route dump request.
      
      2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
         fib6_del
      
      3. Panic in fib6_purge_rt because rt6i_ref count is not 1.
      
      The root cause of all these is references related to the host route for
      an address that is retained.
      
      So, this patch deletes the host route every time the ifdown loop runs.
      Since the host route is deleted and will be re-generated an up there is
      no longer a need for the l3mdev fix up. On the 'admin up' side move
      addrconf_permanent_addr into the NETDEV_UP event handling so that it
      runs only once versus on UP and CHANGE events.
      
      All of the current panics and warnings appear to be related to
      addresses on the loopback device, but given the catastrophic nature when
      a bug is triggered this patch takes the conservative approach and evicts
      all host routes rather than trying to determine when it can be re-used
      and when it can not. That can be a later optimizaton if desired.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38bd10c4
    • David S. Miller's avatar
      Revert "ipv6: Revert optional address flusing on ifdown." · 6a923934
      David S. Miller authored
      This reverts commit 841645b5.
      
      Ok, this puts the feature back.  I've decided to apply David A.'s
      bug fix and run with that rather than make everyone wait another
      whole release for this feature.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6a923934
    • Roman Pen's avatar
      workqueue: fix ghost PENDING flag while doing MQ IO · 346c09f8
      Roman Pen authored
      The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
      with the following backtrace:
      
      [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
      [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
      [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
      [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
      [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
      [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
      [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
      [  601.350965] Call Trace:
      [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
      [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
      [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
      [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
      [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
      [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
      [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
      [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
      [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
      [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
      [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
      [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
      [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
      [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
      [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
      [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
      [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
      [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
      [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
      [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
      
      The underlying device is a null_blk, with default parameters:
      
        queue_mode    = MQ
        submit_queues = 1
      
      Verification that nullb0 has something inflight:
      
      root@pserver8:~# cat /sys/block/nullb0/inflight
             0        1
      root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
      ...
      /sys/block/nullb0/mq/0/cpu2/rq_list
      CTX pending:
              ffff8838038e2400
      ...
      
      During debug it became clear that stalled request is always inserted in
      the rq_list from the following path:
      
         save_stack_trace_tsk + 34
         blk_mq_insert_requests + 231
         blk_mq_flush_plug_list + 281
         blk_flush_plug_list + 199
         wait_on_page_bit + 192
         __filemap_fdatawait_range + 228
         filemap_fdatawait_range + 20
         filemap_write_and_wait_range + 63
         blkdev_fsync + 27
         vfs_fsync_range + 73
         blkdev_write_iter + 202
         __vfs_write + 170
         vfs_write + 169
         kernel_write + 56
      
      So blk_flush_plug_list() was called with from_schedule == true.
      
      If from_schedule is true, that means that finally blk_mq_insert_requests()
      offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
      i.e. it calls kblockd_schedule_delayed_work_on().
      
      That means, that we race with another CPU, which is about to execute
      __blk_mq_run_hw_queue() work.
      
      Further debugging shows the following traces from different CPUs:
      
        CPU#0                                  CPU#1
        ----------------------------------     -------------------------------
        reqeust A inserted
        STORE hctx->ctx_map[0] bit marked
        kblockd_schedule...() returns 1
        <schedule to kblockd workqueue>
                                               request B inserted
                                               STORE hctx->ctx_map[1] bit marked
                                               kblockd_schedule...() returns 0
        *** WORK PENDING bit is cleared ***
        flush_busy_ctxs() is executed, but
        bit 1, set by CPU#1, is not observed
      
      As a result request B pended forever.
      
      This behaviour can be explained by speculative LOAD of hctx->ctx_map on
      CPU#0, which is reordered with clear of PENDING bit and executed _before_
      actual STORE of bit 1 on CPU#1.
      
      The proper fix is an explicit full barrier <mfence>, which guarantees
      that clear of PENDING bit is to be executed before all possible
      speculative LOADS or STORES inside actual work function.
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
      Cc: Michael Wang <yun.wang@profitbricks.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      346c09f8