1. 27 May, 2020 11 commits
    • Gustavo A. R. Silva's avatar
      nvme: replace zero-length array with flexible-array · f1e71d75
      Gustavo A. R. Silva authored
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      sizeof(flexible-array-member) triggers a warning because flexible array
      members have incomplete type[1]. There are some instances of code in
      which the sizeof operator is being incorrectly/erroneously applied to
      zero-length arrays and the result is zero. Such instances may be hiding
      some bugs. So, this work (flexible-array member conversions) will also
      help to get completely rid of those sorts of issues.
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      f1e71d75
    • Damien Le Moal's avatar
      nvme: fix io_opt limit setting · 68ab60ca
      Damien Le Moal authored
      Currently, a namespace io_opt queue limit is set by default to the
      physical sector size of the namespace and to the the write optimal
      size (NOWS) when the namespace reports optimal IO sizes. This causes
      problems with block limits stacking in blk_stack_limits() when a
      namespace block device is combined with an HDD which generally do not
      report any optimal transfer size (io_opt limit is 0). The code:
      
      /* Optimal I/O a multiple of the physical block size? */
      if (t->io_opt & (t->physical_block_size - 1)) {
      	t->io_opt = 0;
      	t->misaligned = 1;
      	ret = -1;
      }
      
      in blk_stack_limits() results in an error return for this function when
      the combined devices have different but compatible physical sector
      sizes (e.g. 512B sector SSD with 4KB sector disks).
      
      Fix this by not setting the optimal IO size queue limit if the namespace
      does not report an optimal write size value.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarBart van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      68ab60ca
    • Wu Bo's avatar
      nvme: disable streams when get stream params failed · 84e4c204
      Wu Bo authored
      Disable streams again if getting the stream params fails.
      Signed-off-by: default avatarWu Bo <wubo40@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      84e4c204
    • Martin George's avatar
      nvme-fc: print proper nvme-fc devloss_tmo value · 614fc1c0
      Martin George authored
      The nvme-fc devloss_tmo is computed as the min of either the
      ctrl_loss_tmo (max_retries * reconnect_delay) or the remote port's
      devloss_tmo. But what gets printed as the nvme-fc devloss_tmo in
      nvme_fc_reconnect_or_delete() is always the remote port's devloss_tmo
      value. So correct this by printing the min value instead.
      Signed-off-by: default avatarMartin George <marting@netapp.com>
      Reviewed-by: default avatarJames Smart <james.smart@broadcom.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      614fc1c0
    • Weiping Zhang's avatar
      nvme-pci: make sure write/poll_queues less or equal then cpu count · 9c9e76d5
      Weiping Zhang authored
      Check module parameter write/poll_queues before using it to catch
      too large values.
      
      Reproducer:
      
      modprobe -r nvme
      modprobe nvme write_queues=`nproc`
      echo $((`nproc`+1)) > /sys/module/nvme/parameters/write_queues
      echo 1 > /sys/block/nvme0n1/device/reset_controller
      
      [  657.069000] ------------[ cut here ]------------
      [  657.069022] WARNING: CPU: 10 PID: 1163 at kernel/irq/affinity.c:390 irq_create_affinity_masks+0x47c/0x4a0
      [  657.069056]  dm_region_hash dm_log dm_mod
      [  657.069059] CPU: 10 PID: 1163 Comm: kworker/u193:9 Kdump: loaded Tainted: G        W         5.6.0+ #8
      [  657.069060] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
      [  657.069064] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
      [  657.069066] RIP: 0010:irq_create_affinity_masks+0x47c/0x4a0
      [  657.069067] Code: fe ff ff 48 c7 c0 b0 89 14 95 48 89 46 20 e9 e9 fb ff ff 31 c0 e9 90 fc ff ff 0f 0b 48 c7 44 24 08 00 00 00 00 e9 e9 fc ff ff <0f> 0b e9 87 fe ff ff 48 8b 7c 24 28 e8 33 a0 80 00 e9 b6 fc ff ff
      [  657.069068] RSP: 0018:ffffb505ce1ffc78 EFLAGS: 00010202
      [  657.069069] RAX: 0000000000000060 RBX: ffff9b97921fe5c0 RCX: 0000000000000000
      [  657.069069] RDX: ffff9b67bad80000 RSI: 00000000ffffffa0 RDI: 0000000000000000
      [  657.069070] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9b97921fe718
      [  657.069070] R10: ffff9b97921fe710 R11: 0000000000000001 R12: 0000000000000064
      [  657.069070] R13: 0000000000000060 R14: 0000000000000000 R15: 0000000000000001
      [  657.069071] FS:  0000000000000000(0000) GS:ffff9b67c0880000(0000) knlGS:0000000000000000
      [  657.069072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  657.069072] CR2: 0000559eac6fc238 CR3: 000000057860a002 CR4: 00000000007606e0
      [  657.069073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  657.069073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  657.069073] PKRU: 55555554
      [  657.069074] Call Trace:
      [  657.069080]  __pci_enable_msix_range+0x233/0x5a0
      [  657.069085]  ? kernfs_put+0xec/0x190
      [  657.069086]  pci_alloc_irq_vectors_affinity+0xbb/0x130
      [  657.069089]  nvme_reset_work+0x6e6/0xeab [nvme]
      [  657.069093]  ? __switch_to_asm+0x34/0x70
      [  657.069094]  ? __switch_to_asm+0x40/0x70
      [  657.069095]  ? nvme_irq_check+0x30/0x30 [nvme]
      [  657.069098]  process_one_work+0x1a7/0x370
      [  657.069101]  worker_thread+0x1c9/0x380
      [  657.069102]  ? max_active_store+0x80/0x80
      [  657.069103]  kthread+0x112/0x130
      [  657.069104]  ? __kthread_parkme+0x70/0x70
      [  657.069105]  ret_from_fork+0x35/0x40
      [  657.069106] ---[ end trace f4f06b7d24513d06 ]---
      [  657.077110] nvme nvme0: 95/1/0 default/read/poll queues
      Signed-off-by: default avatarWeiping Zhang <zhangweiping@didiglobal.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9c9e76d5
    • Sagi Grimberg's avatar
      nvmet-tcp: move send/recv error handling in the send/recv methods instead of call-sites · 0236d343
      Sagi Grimberg authored
      Have routines handle errors and just bail out of the poll loop.
      This simplifies the code and will help as we may enhance the poll
      loop logic and these are somewhat in the way.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      0236d343
    • Sagi Grimberg's avatar
      nvmet-tcp: set MSG_EOR if we send last payload in the batch · f381ab1f
      Sagi Grimberg authored
      when trying to send the pdu data digest, we should set this
      flag.
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      f381ab1f
    • Sagi Grimberg's avatar
      nvmet-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to send · 4eea8043
      Sagi Grimberg authored
      We can signal the stack that this is not the last page coming and the
      stack can build a larger tso segment, so go ahead and use it.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      4eea8043
    • Sagi Grimberg's avatar
      nvme-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to send · 5bb052d7
      Sagi Grimberg authored
      We can signal the stack that this is not the last page coming and the
      stack can build a larger tso segment, so go ahead and use it.
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      5bb052d7
    • Christoph Hellwig's avatar
    • Chen Zhou's avatar
      nvmet: replace kstrndup() with kmemdup_nul() · 09bb8986
      Chen Zhou authored
      It is more efficient to use kmemdup_nul() if the size is known exactly.
      
      The doc in kernel:
      "Note: Use kmemdup_nul() instead if the size is known exactly."
      Signed-off-by: default avatarChen Zhou <chenzhou10@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      09bb8986
  2. 26 May, 2020 1 commit
    • Jiri Kosina's avatar
      block/floppy: fix contended case in floppy_queue_rq() · 263c6158
      Jiri Kosina authored
      Since the switch of floppy driver to blk-mq, the contended (fdc_busy) case
      in floppy_queue_rq() is not handled correctly.
      
      In case we reach floppy_queue_rq() with fdc_busy set (i.e. with the floppy
      locked due to another request still being in-flight), we put the request
      on the list of requests and return BLK_STS_OK to the block core, without
      actually scheduling delayed work / doing further processing of the
      request. This means that processing of this request is postponed until
      another request comes and passess uncontended.
      
      Which in some cases might actually never happen and we keep waiting
      indefinitely. The simple testcase is
      
      	for i in `seq 1 2000`; do echo -en $i '\r'; blkid --info /dev/fd0 2> /dev/null; done
      
      run in quemu. That reliably causes blkid eventually indefinitely hanging
      in __floppy_read_block_0() waiting for completion, as the BIO callback
      never happens, and no further IO is ever submitted on the (non-existent)
      floppy device. This was observed reliably on qemu-emulated device.
      
      Fix that by not queuing the request in the contended case, and return
      BLK_STS_RESOURCE instead, so that blk core handles the request
      rescheduling and let it pass properly non-contended later.
      
      Fixes: a9f38e1d ("floppy: convert to blk-mq")
      Cc: stable@vger.kernel.org
      Tested-by: default avatarLibor Pechacek <lpechacek@suse.cz>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      263c6158
  3. 24 May, 2020 1 commit
  4. 21 May, 2020 14 commits
  5. 16 May, 2020 1 commit
  6. 13 May, 2020 12 commits
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · 8fd2b980
      Jens Axboe authored
      Merge branch 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.8/drivers
      
      Pull MD changes from Song.
      
      * 'md-next' of git://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/raid1: Replace zero-length array with flexible-array
        md: add a newline when printing parameter 'start_ro' by sysfs
        md: stop using ->queuedata
        md/raid1: release pending accounting for an I/O only after write-behind is also finished
        md: remove redundant memalloc scope API usage
        raid5: update code comment of scribble_alloc()
        raid5: remove gfp flags from scribble_alloc()
        md: use memalloc scope APIs in mddev_suspend()/mddev_resume()
        md: remove the extra line for ->hot_add_disk
        md: flush md_rdev_misc_wq for HOT_ADD_DISK case
        md: don't flush workqueue unconditionally in md_open
        md: add new workqueue for delete rdev
        md: add checkings before flush md_misc_wq
      8fd2b980
    • Gustavo A. R. Silva's avatar
      md/raid1: Replace zero-length array with flexible-array · 358369f0
      Gustavo A. R. Silva authored
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      sizeof(flexible-array-member) triggers a warning because flexible array
      members have incomplete type[1]. There are some instances of code in
      which the sizeof operator is being incorrectly/erroneously applied to
      zero-length arrays and the result is zero. Such instances may be hiding
      some bugs. So, this work (flexible-array member conversions) will also
      help to get completely rid of those sorts of issues.
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      358369f0
    • Xiongfeng Wang's avatar
      md: add a newline when printing parameter 'start_ro' by sysfs · 3f99980c
      Xiongfeng Wang authored
      Add a missing newline when printing module parameter 'start_ro' by
      sysfs.
      Signed-off-by: default avatarXiongfeng Wang <wangxiongfeng2@huawei.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      3f99980c
    • Christoph Hellwig's avatar
      md: stop using ->queuedata · e4fc5a74
      Christoph Hellwig authored
      Pointer to mddev is already available in private_data.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      e4fc5a74
    • David Jeffery's avatar
      md/raid1: release pending accounting for an I/O only after write-behind is also finished · c91114c2
      David Jeffery authored
      When using RAID1 and write-behind, md can deadlock when errors occur. With
      write-behind, r1bio structs can be accounted by raid1 as queued but not
      counted as pending. The pending count is dropped when the original bio is
      returned complete but write-behind for the r1bio may still be active.
      
      This breaks the accounting used in some conditions to know when the raid1
      md device has reached an idle state. It can result in calls to
      freeze_array deadlocking. freeze_array will never complete from a negative
      "unqueued" value being calculated due to a queued count larger than the
      pending count.
      
      To properly account for write-behind, move the call to allow_barrier from
      call_bio_endio to raid_end_bio_io. When using write-behind, md can call
      call_bio_endio before all write-behind I/O is complete. Using
      raid_end_bio_io for the point to call allow_barrier will release the
      pending count at a point where all I/O for an r1bio, even write-behind, is
      done.
      Signed-off-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      c91114c2
    • Coly Li's avatar
      md: remove redundant memalloc scope API usage · 3024ba2d
      Coly Li authored
      In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save()
      and memalloc_noio_restore() are used when allocating memory by calling
      mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in
      raid array suspend context, it is unncessary to explicitly call them
      around mempool_create_kmalloc_pool() any longer.
      
      This patch removes the redundant memalloc scope APIs in
      mddev_create_serial_pool().
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      3024ba2d
    • Coly Li's avatar
      raid5: update code comment of scribble_alloc() · 7f8a30e5
      Coly Li authored
      Code comments of scribble_alloc() is outdated for a while. This patch
      update the comments in function header for the new parameter list.
      Suggested-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      7f8a30e5
    • Coly Li's avatar
      raid5: remove gfp flags from scribble_alloc() · ba54d4d4
      Coly Li authored
      Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does
      not have the expected behavior. kvmalloc_array() inside scribble_alloc()
      which receives the GFP_NOIO flag will eventually call kmalloc_node() to
      allocate physically continuous pages.
      
      Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to
      prevent memory reclaim I/Os during raid array suspend context, calling
      to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive
      I/O as expected.
      
      This patch removes the useless gfp flags from parameters list of
      scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The
      incorrect GFP_NOIO flag does not exist anymore.
      
      Fixes: b330e6a4 ("md: convert to kvmalloc")
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      ba54d4d4
    • Coly Li's avatar
      md: use memalloc scope APIs in mddev_suspend()/mddev_resume() · 78f57ef9
      Coly Li authored
      In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
      flag, then it is sent into kvmalloc_array() inside scribble_alloc().
      
      The problem is kvmalloc_array() eventually calls kvmalloc_node() which
      does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
      kmalloc_node() is called indeed to allocate physically continuous
      pages. When system memory is under heavy pressure, and the requesting
      size is large, there is high probability that allocating continueous
      pages will fail.
      
      But simply using GFP_KERNEL flag to call kvmalloc_array() is also
      progblematic. In the code path where scribble_alloc() is called, the
      raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
      and such I/Os go back to the suspend raid array, deadlock will happen.
      
      What is desired here is to allocate non-physically (a.k.a virtually)
      continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
      to use the mmealloc sceope APIs to restrict memory reclaim I/O in
      allocating context, specifically to call memalloc_noio_save() when
      suspend the raid array and to call memalloc_noio_restore() when
      resume the raid array.
      
      This patch adds the memalloc scope APIs in mddev_suspend() and
      mddev_resume(), to restrict memory reclaim I/Os during the raid array
      is suspended. The benifit of adding the memalloc scope API in the
      unified entry point mddev_suspend()/mddev_resume() is, no matter which
      md raid array type (personality), we are sure the deadlock by recursive
      memory reclaim I/O won't happen on the suspending context.
      
      Please notice that the memalloc scope APIs only take effect on the raid
      array suspending context, if the memory allocation is from another new
      created kthread after raid array suspended, the recursive memory reclaim
      I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
      used for the critical section where the raid metadata is modifying,
      creating a kthread to allocate memory inside the critical section is
      queer and very probably being buggy.
      
      Fixes: b330e6a4 ("md: convert to kvmalloc")
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      78f57ef9
    • Guoqing Jiang's avatar
      md: remove the extra line for ->hot_add_disk · 3f79cc22
      Guoqing Jiang authored
      It is not not necessary to add a newline for them since they don't exceed
      80 characters, and it is not intutive to distinguish ->hot_add_disk() from
      hot_add_disk() too.
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      3f79cc22
    • Guoqing Jiang's avatar
      md: flush md_rdev_misc_wq for HOT_ADD_DISK case · 78b990cf
      Guoqing Jiang authored
      Since rdev->kobj is removed asynchronously, it is possible that the
      rdev->kobj still exists when try to add the rdev again after rdev
      is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk
      -> bind_rdev_to_array missed it.
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      78b990cf
    • Guoqing Jiang's avatar
      md: don't flush workqueue unconditionally in md_open · f6766ff6
      Guoqing Jiang authored
      We need to check mddev->del_work before flush workqueu since the purpose
      of flush is to ensure the previous md is disappeared. Otherwise the similar
      deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
      bdev->bd_mutex before flush workqueue.
      
      kernel: [  154.522645] ======================================================
      kernel: [  154.522647] WARNING: possible circular locking dependency detected
      kernel: [  154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G           O
      kernel: [  154.522651] ------------------------------------------------------
      kernel: [  154.522653] mdadm/2482 is trying to acquire lock:
      kernel: [  154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
      kernel: [  154.522673]
      kernel: [  154.522673] but task is already holding lock:
      kernel: [  154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
      kernel: [  154.522691]
      kernel: [  154.522691] which lock already depends on the new lock.
      kernel: [  154.522691]
      kernel: [  154.522694]
      kernel: [  154.522694] the existing dependency chain (in reverse order) is:
      kernel: [  154.522696]
      kernel: [  154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
      kernel: [  154.522704]        __mutex_lock+0x87/0x950
      kernel: [  154.522706]        __blkdev_get+0x79/0x590
      kernel: [  154.522708]        blkdev_get+0x65/0x140
      kernel: [  154.522709]        blkdev_get_by_dev+0x2f/0x40
      kernel: [  154.522716]        lock_rdev+0x3d/0x90 [md_mod]
      kernel: [  154.522719]        md_import_device+0xd6/0x1b0 [md_mod]
      kernel: [  154.522723]        new_dev_store+0x15e/0x210 [md_mod]
      kernel: [  154.522728]        md_attr_store+0x7a/0xc0 [md_mod]
      kernel: [  154.522732]        kernfs_fop_write+0x117/0x1b0
      kernel: [  154.522735]        vfs_write+0xad/0x1a0
      kernel: [  154.522737]        ksys_write+0xa4/0xe0
      kernel: [  154.522745]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522748]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522749]
      kernel: [  154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
      kernel: [  154.522752]        __mutex_lock+0x87/0x950
      kernel: [  154.522756]        new_dev_store+0xc9/0x210 [md_mod]
      kernel: [  154.522759]        md_attr_store+0x7a/0xc0 [md_mod]
      kernel: [  154.522761]        kernfs_fop_write+0x117/0x1b0
      kernel: [  154.522763]        vfs_write+0xad/0x1a0
      kernel: [  154.522765]        ksys_write+0xa4/0xe0
      kernel: [  154.522767]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522769]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522770]
      kernel: [  154.522770] -> #2 (kn->count#253){++++}:
      kernel: [  154.522775]        __kernfs_remove+0x253/0x2c0
      kernel: [  154.522778]        kernfs_remove+0x1f/0x30
      kernel: [  154.522780]        kobject_del+0x28/0x60
      kernel: [  154.522783]        mddev_delayed_delete+0x24/0x30 [md_mod]
      kernel: [  154.522786]        process_one_work+0x2a7/0x5f0
      kernel: [  154.522788]        worker_thread+0x2d/0x3d0
      kernel: [  154.522793]        kthread+0x117/0x130
      kernel: [  154.522795]        ret_from_fork+0x3a/0x50
      kernel: [  154.522796]
      kernel: [  154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
      kernel: [  154.522800]        process_one_work+0x27e/0x5f0
      kernel: [  154.522802]        worker_thread+0x2d/0x3d0
      kernel: [  154.522804]        kthread+0x117/0x130
      kernel: [  154.522806]        ret_from_fork+0x3a/0x50
      kernel: [  154.522807]
      kernel: [  154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
      kernel: [  154.522813]        __lock_acquire+0x1392/0x1690
      kernel: [  154.522816]        lock_acquire+0xb4/0x1a0
      kernel: [  154.522818]        flush_workqueue+0xab/0x4b0
      kernel: [  154.522821]        md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522823]        __blkdev_get+0xea/0x590
      kernel: [  154.522825]        blkdev_get+0x65/0x140
      kernel: [  154.522828]        do_dentry_open+0x1d1/0x380
      kernel: [  154.522831]        path_openat+0x567/0xcc0
      kernel: [  154.522834]        do_filp_open+0x9b/0x110
      kernel: [  154.522836]        do_sys_openat2+0x201/0x2a0
      kernel: [  154.522838]        do_sys_open+0x57/0x80
      kernel: [  154.522840]        do_syscall_64+0x64/0x2b0
      kernel: [  154.522842]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522844]
      kernel: [  154.522844] other info that might help us debug this:
      kernel: [  154.522844]
      kernel: [  154.522846] Chain exists of:
      kernel: [  154.522846]   (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
      kernel: [  154.522846]
      kernel: [  154.522850]  Possible unsafe locking scenario:
      kernel: [  154.522850]
      kernel: [  154.522852]        CPU0                    CPU1
      kernel: [  154.522853]        ----                    ----
      kernel: [  154.522854]   lock(&bdev->bd_mutex);
      kernel: [  154.522856]                                lock(&mddev->reconfig_mutex);
      kernel: [  154.522858]                                lock(&bdev->bd_mutex);
      kernel: [  154.522860]   lock((wq_completion)md_misc);
      kernel: [  154.522861]
      kernel: [  154.522861]  *** DEADLOCK ***
      kernel: [  154.522861]
      kernel: [  154.522864] 1 lock held by mdadm/2482:
      kernel: [  154.522865]  #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
      kernel: [  154.522868]
      kernel: [  154.522868] stack backtrace:
      kernel: [  154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G           O      5.6.0-rc7-lp151.27-default #25
      kernel: [  154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      kernel: [  154.522878] Call Trace:
      kernel: [  154.522881]  dump_stack+0x8f/0xcb
      kernel: [  154.522884]  check_noncircular+0x194/0x1b0
      kernel: [  154.522888]  ? __lock_acquire+0x1392/0x1690
      kernel: [  154.522890]  __lock_acquire+0x1392/0x1690
      kernel: [  154.522893]  lock_acquire+0xb4/0x1a0
      kernel: [  154.522895]  ? flush_workqueue+0x84/0x4b0
      kernel: [  154.522898]  flush_workqueue+0xab/0x4b0
      kernel: [  154.522900]  ? flush_workqueue+0x84/0x4b0
      kernel: [  154.522905]  ? md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522908]  md_open+0xb6/0xc0 [md_mod]
      kernel: [  154.522910]  __blkdev_get+0xea/0x590
      kernel: [  154.522912]  ? bd_acquire+0xc0/0xc0
      kernel: [  154.522914]  blkdev_get+0x65/0x140
      kernel: [  154.522916]  ? bd_acquire+0xc0/0xc0
      kernel: [  154.522918]  do_dentry_open+0x1d1/0x380
      kernel: [  154.522921]  path_openat+0x567/0xcc0
      kernel: [  154.522923]  ? __lock_acquire+0x380/0x1690
      kernel: [  154.522926]  do_filp_open+0x9b/0x110
      kernel: [  154.522929]  ? __alloc_fd+0xe5/0x1f0
      kernel: [  154.522935]  ? kmem_cache_alloc+0x28c/0x630
      kernel: [  154.522939]  ? do_sys_openat2+0x201/0x2a0
      kernel: [  154.522941]  do_sys_openat2+0x201/0x2a0
      kernel: [  154.522944]  do_sys_open+0x57/0x80
      kernel: [  154.522946]  do_syscall_64+0x64/0x2b0
      kernel: [  154.522948]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      kernel: [  154.522951] RIP: 0033:0x7f98d279d9ae
      
      And md_alloc also flushed the same workqueue, but the thing is different
      here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
      the flush is necessary to avoid race condition, so leave it as it is.
      Signed-off-by: default avatarGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      f6766ff6