1. 03 Jul, 2018 23 commits
  2. 26 Jun, 2018 17 commits
    • Greg Kroah-Hartman's avatar
      Linux 4.14.52 · a26899e0
      Greg Kroah-Hartman authored
      a26899e0
    • Vlastimil Babka's avatar
      mm, page_alloc: do not break __GFP_THISNODE by zonelist reset · 1d26c112
      Vlastimil Babka authored
      commit 7810e678 upstream.
      
      In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
      allocations that can ignore memory policies.  The zonelist is obtained
      from current CPU's node.  This is a problem for __GFP_THISNODE
      allocations that want to allocate on a different node, e.g.  because the
      allocating thread has been migrated to a different CPU.
      
      This has been observed to break SLAB in our 4.4-based kernel, because
      there it relies on __GFP_THISNODE working as intended.  If a slab page
      is put on wrong node's list, then further list manipulations may corrupt
      the list because page_to_nid() is used to determine which node's
      list_lock should be locked and thus we may take a wrong lock and race.
      
      Current SLAB implementation seems to be immune by luck thanks to commit
      511e3a05 ("mm/slab: make cache_grow() handle the page allocated on
      arbitrary node") but there may be others assuming that __GFP_THISNODE
      works as promised.
      
      We can fix it by simply removing the zonelist reset completely.  There
      is actually no reason to reset it, because memory policies and cpusets
      don't affect the zonelist choice in the first place.  This was different
      when commit 183f6371 ("mm: ignore mempolicies when using
      ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
      own restricted zonelists.
      
      We might consider this for 4.17 although I don't know if there's
      anything currently broken.
      
      SLAB is currently not affected, but in kernels older than 4.7 that don't
      yet have 511e3a05 ("mm/slab: make cache_grow() handle the page
      allocated on arbitrary node") it is.  That's at least 4.4 LTS.  Older
      ones I'll have to check.
      
      So stable backports should be more important, but will have to be
      reviewed carefully, as the code went through many changes.  BTW I think
      that also the ac->preferred_zoneref reset is currently useless if we
      don't also reset ac->nodemask from a mempolicy to NULL first (which we
      probably should for the OOM victims etc?), but I would leave that for a
      separate patch.
      
      Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Fixes: 183f6371 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d26c112
    • Thadeu Lima de Souza Cascardo's avatar
      fs/binfmt_misc.c: do not allow offset overflow · 250edf95
      Thadeu Lima de Souza Cascardo authored
      commit 5cc41e09 upstream.
      
      WHen registering a new binfmt_misc handler, it is possible to overflow
      the offset to get a negative value, which might crash the system, or
      possibly leak kernel data.
      
      Here is a crash log when 2500000000 was used as an offset:
      
        BUG: unable to handle kernel paging request at ffff989cfd6edca0
        IP: load_misc_binary+0x22b/0x470 [binfmt_misc]
        PGD 1ef3e067 P4D 1ef3e067 PUD 0
        Oops: 0000 [#1] SMP NOPTI
        Modules linked in: binfmt_misc kvm_intel ppdev kvm irqbypass joydev input_leds serio_raw mac_hid parport_pc qemu_fw_cfg parpy
        CPU: 0 PID: 2499 Comm: bash Not tainted 4.15.0-22-generic #24-Ubuntu
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
        RIP: 0010:load_misc_binary+0x22b/0x470 [binfmt_misc]
        Call Trace:
          search_binary_handler+0x97/0x1d0
          do_execveat_common.isra.34+0x667/0x810
          SyS_execve+0x31/0x40
          do_syscall_64+0x73/0x130
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Use kstrtoint instead of simple_strtoul.  It will work as the code
      already set the delimiter byte to '\0' and we only do it when the field
      is not empty.
      
      Tested with offsets -1, 2500000000, UINT_MAX and INT_MAX.  Also tested
      with examples documented at Documentation/admin-guide/binfmt-misc.rst
      and other registrations from packages on Ubuntu.
      
      Link: http://lkml.kernel.org/r/20180529135648.14254-1-cascardo@canonical.com
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarThadeu Lima de Souza Cascardo <cascardo@canonical.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      250edf95
    • Michael S. Tsirkin's avatar
      vhost: fix info leak due to uninitialized memory · 7446344b
      Michael S. Tsirkin authored
      commit 670ae9ca upstream.
      
      struct vhost_msg within struct vhost_msg_node is copied to userspace.
      Unfortunately it turns out on 64 bit systems vhost_msg has padding after
      type which gcc doesn't initialize, leaking 4 uninitialized bytes to
      userspace.
      
      This padding also unfortunately means 32 bit users of this interface are
      broken on a 64 bit kernel which will need to be fixed separately.
      
      Fixes: CVE-2018-1118
      Cc: stable@vger.kernel.org
      Reported-by: default avatarKevin Easton <kevin@guarana.org>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reported-by: syzbot+87cfa083e727a224754b@syzkaller.appspotmail.com
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7446344b
    • Jason Gerecke's avatar
      HID: wacom: Correct logical maximum Y for 2nd-gen Intuos Pro large · d37c95f5
      Jason Gerecke authored
      commit d471b6b2 upstream.
      
      The HID descriptor for the 2nd-gen Intuos Pro large (PTH-860) contains
      a typo which defines an incorrect logical maximum Y value. This causes
      a small portion of the bottom of the tablet to become unusable (both
      because the area is below the "bottom" of the tablet and because
      'wacom_wac_event' ignores out-of-range values). It also results in a
      skewed aspect ratio.
      
      To fix this, we add a quirk to 'wacom_usage_mapping' which overwrites
      the data with the correct value.
      Signed-off-by: default avatarJason Gerecke <jason.gerecke@wacom.com>
      CC: stable@vger.kernel.org # v4.10+
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d37c95f5
    • Even Xu's avatar
      HID: intel_ish-hid: ipc: register more pm callbacks to support hibernation · ab17de60
      Even Xu authored
      commit ebeaa367 upstream.
      
      Current ISH driver only registers suspend/resume PM callbacks which don't
      support hibernation (suspend to disk). Basically after hiberation, the ISH
      can't resume properly and user may not see sensor events (for example: screen
      		rotation may not work).
      
      User will not see a crash or panic or anything except the following message
      in log:
      
      	hid-sensor-hub 001F:8086:22D8.0001: timeout waiting for response from ISHTP device
      
      So this patch adds support for S4/hiberbation to ISH by using the
      SIMPLE_DEV_PM_OPS() MACRO instead of struct dev_pm_ops directly. The suspend
      and resume functions will now be used for both suspend to RAM and hibernation.
      
      If power management is disabled, SIMPLE_DEV_PM_OPS will do nothing, the suspend
      and resume related functions won't be used, so mark them as __maybe_unused to
      clarify that this is the intended behavior, and remove #ifdefs for power
      management.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEven Xu <even.xu@intel.com>
      Acked-by: default avatarSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab17de60
    • Martin Brandenburg's avatar
      orangefs: report attributes_mask and attributes for statx · e3e6bd6a
      Martin Brandenburg authored
      commit 7f54910f upstream.
      
      OrangeFS formerly failed to set attributes_mask with the result that
      software could not see immutable and append flags present in the
      filesystem.
      Reported-by: default avatarBecky Ligon <ligon@clemson.edu>
      Signed-off-by: default avatarMartin Brandenburg <martin@omnibond.com>
      Fixes: 68a24a6c ("orangefs: implement statx")
      Cc: stable@vger.kernel.org
      Cc: hubcap@omnibond.com
      Signed-off-by: default avatarMike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e3e6bd6a
    • Martin Brandenburg's avatar
      orangefs: set i_size on new symlink · f7e4328c
      Martin Brandenburg authored
      commit f6a4b4c9 upstream.
      
      As long as a symlink inode remains in-core, the destination (and
      therefore size) will not be re-fetched from the server, as it cannot
      change.  The original implementation of the attribute cache assumed that
      setting the expiry time in the past was sufficient to cause a re-fetch
      of all attributes on the next getattr.  That does not work in this case.
      
      The bug manifested itself as follows.  When the command sequence
      
      touch foo; ln -s foo bar; ls -l bar
      
      is run, the output was
      
      lrwxrwxrwx. 1 fedora fedora 4906 Apr 24 19:10 bar -> foo
      
      However, after a re-mount, ls -l bar produces
      
      lrwxrwxrwx. 1 fedora fedora    3 Apr 24 19:10 bar -> foo
      
      After this commit, even before a re-mount, the output is
      
      lrwxrwxrwx. 1 fedora fedora    3 Apr 24 19:10 bar -> foo
      Reported-by: default avatarBecky Ligon <ligon@clemson.edu>
      Signed-off-by: default avatarMartin Brandenburg <martin@omnibond.com>
      Fixes: 71680c18 ("orangefs: Cache getattr results.")
      Cc: stable@vger.kernel.org
      Cc: hubcap@omnibond.com
      Signed-off-by: default avatarMike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7e4328c
    • Luca Coelho's avatar
      iwlwifi: fw: harden page loading code · b8511dbf
      Luca Coelho authored
      commit 9039d985 upstream.
      
      The page loading code trusts the data provided in the firmware images
      a bit too much and may cause a buffer overflow or copy unknown data if
      the block sizes don't match what we expect.
      
      To prevent potential problems, harden the code by checking if the
      sizes we are copying are what we expect.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLuca Coelho <luciano.coelho@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8511dbf
    • Tony Luck's avatar
      x86/intel_rdt: Enable CMT and MBM on new Skylake stepping · 2d58a9ac
      Tony Luck authored
      commit 1d9f3e20 upstream.
      
      New stepping of Skylake has fixes for cache occupancy and memory
      bandwidth monitoring.
      
      Update the code to enable these by default on newer steppings.
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: stable@vger.kernel.org # v4.14
      Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180608160732.9842-1-tony.luck@intel.comSigned-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2d58a9ac
    • Stefan Potyra's avatar
      w1: mxc_w1: Enable clock before calling clk_get_rate() on it · e6ef46cb
      Stefan Potyra authored
      commit 955bc613 upstream.
      
      According to the API, you may only call clk_get_rate() after actually
      enabling it.
      
      Found by Linux Driver Verification project (linuxtesting.org).
      
      Fixes: a5fd9139 ("w1: add 1-wire master driver for i.MX27 / i.MX31")
      Signed-off-by: default avatarStefan Potyra <Stefan.Potyra@elektrobit.com>
      Acked-by: default avatarEvgeniy Polyakov <zbr@ioremap.net>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6ef46cb
    • Hans de Goede's avatar
      libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk · 0667483a
      Hans de Goede authored
      commit 2cfce3a8 upstream.
      
      Commit 184add2c ("libata: Apply NOLPM quirk for SanDisk
      SD7UB3Q*G1001 SSDs") disabled LPM for SanDisk SD7UB3Q*G1001 SSDs.
      
      This has lead to several reports of users of that SSD where LPM
      was working fine and who know have a significantly increased idle
      power consumption on their laptops.
      
      Likely there is another problem on the T450s from the original
      reporter which gets exposed by the uncore reaching deeper sleep
      states (higher PC-states) due to LPM being enabled. The problem as
      reported, a hardfreeze about once a day, already did not sound like
      it would be caused by LPM and the reports of the SSD working fine
      confirm this. The original reporter is ok with dropping the quirk.
      
      A X250 user has reported the same hard freeze problem and for him
      the problem went away after unrelated updates, I suspect some GPU
      driver stack changes fixed things.
      
      TL;DR: The original reporters problem were triggered by LPM but not
      an LPM issue, so drop the quirk for the SSD in question.
      
      BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1583207
      Cc: stable@vger.kernel.org
      Cc: Richard W.M. Jones <rjones@redhat.com>
      Cc: Lorenzo Dalrio <lorenzo.dalrio@gmail.com>
      Reported-by: default avatarLorenzo Dalrio <lorenzo.dalrio@gmail.com>
      Signed-off-by: default avatarHans de Goede <hdegoede@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatar"Richard W.M. Jones" <rjones@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0667483a
    • Dan Carpenter's avatar
      libata: zpodd: small read overflow in eject_tray() · 27c0f1e5
      Dan Carpenter authored
      commit 18c9a99b upstream.
      
      We read from the cdb[] buffer in ata_exec_internal_sg().  It has to be
      ATAPI_CDB_LEN (16) bytes long, but this buffer is only 12 bytes.
      
      Fixes: 21334205 ("libata: handle power transition of ODD")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      27c0f1e5
    • Chen Yu's avatar
      cpufreq: governors: Fix long idle detection logic in load calculation · 1404d2e5
      Chen Yu authored
      commit 75920196 upstream.
      
      According to current code implementation, detecting the long
      idle period is done by checking if the interval between two
      adjacent utilization update handlers is long enough. Although
      this mechanism can detect if the idle period is long enough
      (no utilization hooks invoked during idle period), it might
      not cover a corner case: if the task has occupied the CPU
      for too long which causes no context switches during that
      period, then no utilization handler will be launched until this
      high prio task is scheduled out. As a result, the idle_periods
      field might be calculated incorrectly because it regards the
      100% load as 0% and makes the conservative governor who uses
      this field confusing.
      
      Change the detection to compare the idle_time with sampling_rate
      directly.
      Reported-by: default avatarArtem S. Tashkinov <t.artem@mailcity.com>
      Signed-off-by: default avatarChen Yu <yu.c.chen@intel.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Cc: All applicable <stable@vger.kernel.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1404d2e5
    • Tao Wang's avatar
      cpufreq: Fix new policy initialization during limits updates via sysfs · c3c77b5d
      Tao Wang authored
      commit c7d1f119 upstream.
      
      If the policy limits are updated via cpufreq_update_policy() and
      subsequently via sysfs, the limits stored in user_policy may be
      set incorrectly.
      
      For example, if both min and max are set via sysfs to the maximum
      available frequency, user_policy.min and user_policy.max will also
      be the maximum.  If a policy notifier triggered by
      cpufreq_update_policy() lowers both the min and the max at this
      point, that change is not reflected by the user_policy limits, so
      if the max is updated again via sysfs to the same lower value,
      then user_policy.max will be lower than user_policy.min which
      shouldn't happen.  In particular, if one of the policy CPUs is
      then taken offline and back online, cpufreq_set_policy() will
      fail for it due to a failing limits check.
      
      To prevent that from happening, initialize the min and max fields
      of the new_policy object to the ones stored in user_policy that
      were previously set via sysfs.
      Signed-off-by: default avatarKevin Wangtao <kevin.wangtao@hisilicon.com>
      Acked-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      [ rjw: Subject & changelog ]
      Cc: All applicable <stable@vger.kernel.org>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c3c77b5d
    • Tejun Heo's avatar
      bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue · 67b46304
      Tejun Heo authored
      commit f1834646 upstream.
      
      From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
      From: Tejun Heo <tj@kernel.org>
      Date: Wed, 23 May 2018 10:29:00 -0700
      
      cgwb_release() punts the actual release to cgwb_release_workfn() on
      system_wq.  Depending on the number of cgroups or block devices, there
      can be a lot of cgwb_release_workfn() in flight at the same time.
      
      We're periodically seeing close to 256 kworkers getting stuck with the
      following stack trace and overtime the entire system gets stuck.
      
        [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
        [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
        [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
        [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
        [<ffffffff811cd645>] wb_exit+0x85/0xa0
        [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
        [<ffffffff810a68d0>] process_one_work+0x150/0x410
        [<ffffffff810a71fd>] worker_thread+0x6d/0x520
        [<ffffffff810ad3dc>] kthread+0x12c/0x160
        [<ffffffff81969019>] ret_from_fork+0x29/0x40
        [<ffffffffffffffff>] 0xffffffffffffffff
      
      The events leading to the lockup are...
      
      1. A lot of cgwb_release_workfn() is queued at the same time and all
         system_wq kworkers are assigned to execute them.
      
      2. They all end up calling synchronize_rcu_expedited().  One of them
         wins and tries to perform the expedited synchronization.
      
      3. However, that invovles queueing rcu_exp_work to system_wq and
         waiting for it.  Because #1 is holding all available kworkers on
         system_wq, rcu_exp_work can't be executed.  cgwb_release_workfn()
         is waiting for synchronize_rcu_expedited() which in turn is waiting
         for cgwb_release_workfn() to free up some of the kworkers.
      
      We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
      same time.  There's nothing to be gained from that.  This patch
      updates cgwb release path to use a dedicated percpu workqueue with
      @max_active of 1.
      
      While this resolves the problem at hand, it might be a good idea to
      isolate rcu_exp_work to its own workqueue too as it can be used from
      various paths and is prone to this sort of indirect A-A deadlocks.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67b46304
    • Roman Pen's avatar
      blk-mq: reinit q->tag_set_list entry only after grace period · ba502bf2
      Roman Pen authored
      commit a347c7ad upstream.
      
      It is not allowed to reinit q->tag_set_list list entry while RCU grace
      period has not completed yet, otherwise the following soft lockup in
      blk_mq_sched_restart() happens:
      
      [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
      [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
      [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
      [ 1064.256510] Call Trace:
      [ 1064.256664]  <IRQ>
      [ 1064.256824]  blk_mq_free_request+0xea/0x100
      [ 1064.256987]  msg_io_conf+0x59/0xd0 [ibnbd_client]
      [ 1064.257175]  complete_rdma_req+0xf2/0x230 [ibtrs_client]
      [ 1064.257340]  ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
      [ 1064.257502]  ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
      [ 1064.257669]  ib_create_qp+0x321/0x380 [ib_core]
      [ 1064.257841]  ib_process_cq_direct+0xbd/0x120 [ib_core]
      [ 1064.258007]  irq_poll_softirq+0xb7/0xe0
      [ 1064.258165]  __do_softirq+0x106/0x2a2
      [ 1064.258328]  irq_exit+0x92/0xa0
      [ 1064.258509]  do_IRQ+0x4a/0xd0
      [ 1064.258660]  common_interrupt+0x7a/0x7a
      [ 1064.258818]  </IRQ>
      
      Meanwhile another context frees other queue but with the same set of
      shared tags:
      
      [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
      [ 1288.201833] bash            D    0  5910   5820 0x00000000
      [ 1288.202016] Call Trace:
      [ 1288.202315]  schedule+0x32/0x80
      [ 1288.202462]  schedule_timeout+0x1e5/0x380
      [ 1288.203838]  wait_for_completion+0xb0/0x120
      [ 1288.204137]  __wait_rcu_gp+0x125/0x160
      [ 1288.204287]  synchronize_sched+0x6e/0x80
      [ 1288.204770]  blk_mq_free_queue+0x74/0xe0
      [ 1288.204922]  blk_cleanup_queue+0xc7/0x110
      [ 1288.205073]  ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
      [ 1288.205389]  ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
      [ 1288.205548]  kernfs_fop_write+0x109/0x180
      [ 1288.206328]  vfs_write+0xb3/0x1a0
      [ 1288.206476]  SyS_write+0x52/0xc0
      [ 1288.206624]  do_syscall_64+0x68/0x1d0
      [ 1288.206774]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      What happened is the following:
      
      1. There are several MQ queues with shared tags.
      2. One queue is about to be freed and now task is in
         blk_mq_del_queue_tag_set().
      3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
         tag list in order to find hctx to restart.
      
      Because linked list entry was modified in blk_mq_del_queue_tag_set()
      without proper waiting for a grace period, blk_mq_sched_restart()
      never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.
      
      Fix is simple: reinit list entry after an RCU grace period elapsed.
      
      Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
      Cc: stable@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-block@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ba502bf2