1. 01 Dec, 2020 17 commits
    • Victor Gladkov's avatar
      nvme-fabrics: reject I/O to offline device · 8c4dfea9
      Victor Gladkov authored
      Commands get stuck while Host NVMe-oF controller is in reconnect state.
      The controller enters into reconnect state when it loses connection with
      the target.  It tries to reconnect every 10 seconds (default) until
      a successful reconnect or until the reconnect time-out is reached.
      The default reconnect time out is 10 minutes.
      
      Applications are expecting commands to complete with success or error
      within a certain timeout (30 seconds by default).  The NVMe host is
      enforcing that timeout while it is connected, but during reconnect the
      timeout is not enforced and commands may get stuck for a long period or
      even forever.
      
      To fix this long delay due to the default timeout, introduce new
      "fast_io_fail_tmo" session parameter.  The timeout is measured in seconds
      from the controller reconnect and any command beyond that timeout is
      rejected.  The new parameter value may be passed during 'connect'.
      The default value of -1 means no timeout (similar to current behavior).
      Signed-off-by: default avatarVictor Gladkov <victor.gladkov@kioxia.com>
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChao Leng <lengchao@huawei.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8c4dfea9
    • Colin Ian King's avatar
      nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig · 9f20599c
      Colin Ian King authored
      There is a spelling mistake in the Kconfig help text. Fix it.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      9f20599c
    • Max Gurtovoy's avatar
      nvmet: make sure discovery change log event is protected · 0068a7b0
      Max Gurtovoy authored
      Generation counter is protected by nvmet_config_sem. Make sure the
      callers that call functions that might change it, are calling it
      properly.
      Signed-off-by: default avatarMax Gurtovoy <mgurtovoy@nvidia.com>
      Reviewed-by: default avatarIsrael Rukshin <israelr@nvidia.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      0068a7b0
    • Amit's avatar
      nvmet: remove unused ctrl->cqs · 6d65aeab
      Amit authored
      remove unused cqs from nvmet_ctrl struct
      this will reduce the allocated memory.
      Signed-off-by: default avatarAmit <amit.engel@dell.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      6d65aeab
    • Niklas Schnelle's avatar
      nvme-pci: don't allocate unused I/O queues · e3aef095
      Niklas Schnelle authored
      currently the NVME_QUIRK_SHARED_TAGS quirk for Apple devices is handled
      during the assignment of nr_io_queues in nvme_setup_io_queues().
      This however means that for these devices nvme_max_io_queues() will
      actually not return the supported maximum which is confusing and
      unexpected and also means that in nvme_probe() we are allocating
      for I/O queues that will never be used.
      Fix this by moving the quirk handling into nvme_max_io_queues().
      Signed-off-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      e3aef095
    • Niklas Schnelle's avatar
      nvme-pci: drop min() from nr_io_queues assignment · ff4e5fba
      Niklas Schnelle authored
      in nvme_setup_io_queues() the number of I/O queues is set to either 1 in
      case of a quirky Apple device or to the min of nvme_max_io_queues() or
      dev->nr_allocated_queues - 1.
      This is unnecessarily complicated as dev->nr_allocated_queues is only
      assigned once and is nvme_max_io_queues() + 1.
      Signed-off-by: default avatarNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      ff4e5fba
    • Chaitanya Kulkarni's avatar
      nvmet: use inline bio for passthru fast path · dab3902b
      Chaitanya Kulkarni authored
      In nvmet_passthru_execute_cmd() which is a high frequency function
      it uses bio_alloc() which leads to memory allocation from the fs pool
      for each I/O.
      
      For NVMeoF nvmet_req we already have inline_bvec allocated as a part of
      request allocation that can be used with preallocated bio when we
      already know the size of request before bio allocation with bio_alloc(),
      which we already do.
      
      Introduce a bio member for the nvmet_req passthru anon union. In the
      fast path, check if we can get away with inline bvec and bio from
      nvmet_req with bio_init() call before actually allocating from the
      bio_alloc().
      
      This will be useful to avoid any new memory allocation under high
      memory pressure situation and get rid of any extra work of
      allocation (bio_alloc()) vs initialization (bio_init()) when
      transfer len is < NVMET_MAX_INLINE_DATA_LEN that user can configure at
      compile time.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      dab3902b
    • Chaitanya Kulkarni's avatar
      nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio · a4fe2d3a
      Chaitanya Kulkarni authored
      The function blk_rq_append_bio() is a genereric API written for all
      types driver (having bounce buffers) and different context (where
      request is already having a bio i.e. rq->bio != NULL).
      
      It does mainly three things: calculating the segments, bounce queue and
      if req->bio == NULL call blk_rq_bio_prep() or handle low level merge()
      case.
      
      The NVMe PCIe and fabrics transports currently does not use queue
      bounce mechanism. In order to find this for each request processing
      in the passthru blk_rq_append_bio() does extra work in the fast path
      for each request.
      
      When I ran I/Os with different block sizes on the passthru controller
      I found that we can reuse the req->sg_cnt instead of iterating over the
      bvecs to find out nr_segs in blk_rq_append_bio(). This calculation in
      blk_rq_append_bio() is a duplication of work given that we have the
      value in req->sg_cnt. (correct me here if I'm wrong).
      
      With NVMe passthru request based driver we allocate fresh request each
      time, so every call to blk_rq_append_bio() rq->bio will be NULL i.e.
      we don't really need the second condition in the blk_rq_append_bio()
      and the resulting error condition in the caller of blk_rq_append_bio().
      
      So for NVMeOF passthru driver recalculating the segments, bounce check
      and ll_back_merge code is not needed such that we can get away with the
      minimal version of the blk_rq_append_bio() which removes the error check
      in the fast path along with extra variable in nvmet_passthru_map_sg().
      
      This patch updates the nvmet_passthru_map_sg() such that it does only
      appending the bio to the request in the context of the NVMeOF Passthru
      driver. Following are perf numbers :-
      
      With current implementation (blk_rq_append_bio()) :-
      ----------------------------------------------------
      +    5.80%     0.02%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    5.44%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    4.88%     0.00%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    5.44%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    4.86%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    5.17%     0.00%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
      
      With this patch using blk_rq_bio_prep() :-
      ----------------------------------------------------
      +    3.14%     0.02%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    3.26%     0.01%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    5.37%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    5.18%     0.02%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    4.84%     0.02%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
      +    4.87%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a4fe2d3a
    • Chaitanya Kulkarni's avatar
      nvmet: remove op_flags for passthru commands · 06b3bec8
      Chaitanya Kulkarni authored
      For passthru commands setting op_flags has no meaning. Remove the code
      that sets the op flags in nvmet_passthru_map_sg().
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      06b3bec8
    • Chaitanya Kulkarni's avatar
      nvme: split nvme_alloc_request() · 39dfe844
      Chaitanya Kulkarni authored
      Right now nvme_alloc_request() allocates a request from block layer
      based on the value of the qid. When qid set to NVME_QID_ANY it used
      blk_mq_alloc_request() else blk_mq_alloc_request_hctx().
      
      The function nvme_alloc_request() is called from different context, The
      only place where it uses non NVME_QID_ANY value is for fabrics connect
      commands :-
      
      nvme_submit_sync_cmd()		NVME_QID_ANY
      nvme_features()			NVME_QID_ANY
      nvme_sec_submit()		NVME_QID_ANY
      nvmf_reg_read32()		NVME_QID_ANY
      nvmf_reg_read64()		NVME_QID_ANY
      nvmf_reg_write32()		NVME_QID_ANY
      nvmf_connect_admin_queue()	NVME_QID_ANY
      nvme_submit_user_cmd()		NVME_QID_ANY
      	nvme_alloc_request()
      nvme_keep_alive()		NVME_QID_ANY
      	nvme_alloc_request()
      nvme_timeout()			NVME_QID_ANY
      	nvme_alloc_request()
      nvme_delete_queue()		NVME_QID_ANY
      	nvme_alloc_request()
      nvmet_passthru_execute_cmd()	NVME_QID_ANY
      	nvme_alloc_request()
      nvmf_connect_io_queue() 	QID
      	__nvme_submit_sync_cmd()
      		nvme_alloc_request()
      
      With passthru nvme_alloc_request() now falls into the I/O fast path such
      that blk_mq_alloc_request_hctx() is never gets called and that adds
      additional branch check in fast path.
      
      Split the nvme_alloc_request() into nvme_alloc_request() and
      nvme_alloc_request_qid().
      
      Replace each call of the nvme_alloc_request() with NVME_QID_ANY param
      with a call to newly added nvme_alloc_request() without NVME_QID_ANY.
      
      Replace a call to nvme_alloc_request() with QID param with a call to
      newly added nvme_alloc_request() and nvme_alloc_request_qid()
      based on the qid value set in the __nvme_submit_sync_cmd().
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      39dfe844
    • Chaitanya Kulkarni's avatar
      block: move blk_rq_bio_prep() to linux/blk-mq.h · 53ffabfd
      Chaitanya Kulkarni authored
      This is a preparation patch to have minimal block layer request bio
      append functionality in the context of the NVMeOF Passthru driver which
      falls in the fast path and doesn't need calls from blk_rq_append_bio().
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      53ffabfd
    • Chaitanya Kulkarni's avatar
      nvmet: add passthru io timeout value attr · 47e9730c
      Chaitanya Kulkarni authored
      NVMeOF controller in the passsthru mode is capable of handling wide set
      of I/O commands including vender specific passhtru io comands.
      
      The vendor specific I/O commands are used to read the large drive
      logs and can take longer than default NVMe commands, i.e. for
      passthru requests the timeout value may differ from the passthru
      controller's default timeout values (nvme-core:io_timeout).
      
      Add a configfs attribute so that user can set the io timeout values.
      In case if this configfs value is not set nvme_alloc_request() will set
      the NVME_IO_TIMEOUT value when request queuedata is NULL.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      47e9730c
    • Chaitanya Kulkarni's avatar
      nvmet: add passthru admin timeout value attr · a2f6a2b8
      Chaitanya Kulkarni authored
      NVMeOF controller in the passsthru mode is capable of handling wide set
      of admin commands including vender specific passhtru admin comands.
      
      The vendor specific admin commands are used to read the large drive
      logs and can take longer than default NVMe commands, i.e. for
      passthru requests the timeout value may differ from the passthru
      controller's default timeout values (nvme-core:admin_timeout).
      
      Add a configfs attribute so that user can set the admin timeout values.
      In case if this configfs value is not set nvme_alloc_request() will set
      the ADMIN_TIMEOUT value when request queuedata is NULL.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a2f6a2b8
    • Chaitanya Kulkarni's avatar
      nvme: use consistent macro name for timeout · dc96f938
      Chaitanya Kulkarni authored
      This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
      make consistent with NVME_IO_TIMEOUT.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      dc96f938
    • Chaitanya Kulkarni's avatar
      nvme: centralize setting the timeout in nvme_alloc_request · 0d2e7c84
      Chaitanya Kulkarni authored
      The function nvme_alloc_request() is called from different context
      (I/O and Admin queue) where callers do not consider the I/O timeout when
      called from I/O queue context.
      
      Update nvme_alloc_request() to set the default I/O and Admin timeout
      value based on whether the queuedata is set or not.
      Signed-off-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      0d2e7c84
    • Baolin Wang's avatar
      nvme: simplify nvme_req_qid() · 84115d6d
      Baolin Wang authored
      Use the request's '->mq_hctx->queue_num' directly to simplify the
      nvme_req_qid() function.
      Signed-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      84115d6d
    • James Smart's avatar
      nvme-fcloop: add sysfs attribute to inject command drop · 03d99e5d
      James Smart authored
      Add sysfs attribute to specify parameters for dropping a command.  The
      attribute takes a string of:
      
        <opcode>:<starting a what instance>:<number of times>
      
      Opcode is formatted as lower 8 bits are opcode.  If a fabrics opcode, a
      bit above bits 7:0 will be set.
      
      Once set, each sqe is looked at. If the opcode matches the running
      instance count is updated. If the instance count is in the range of where
      to drop (based on starting and # of times), then drop the command by not
      passing it to the target layer.
      Signed-off-by: default avatarJames Smart <james.smart@broadcom.com>
      03d99e5d
  2. 30 Nov, 2020 8 commits
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · 48332ff2
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.11/drivers
      
      Pull MD changes from Song:
      
      "Summary:
       1. Fix race condition in md_ioctl(), by Dae R. Jeong;
       2. Initialize read_slot properly for raid10, by Kevin Vigor;
       3. Code cleanup, by Pankaj Gupta;
       4. md-cluster resync/reshape fix, by Zhao Heming."
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/cluster: fix deadlock when node is doing resync job
        md/cluster: block reshape with remote resync job
        md: use current request time as base for ktime comparisons
        md: add comments in md_flush_request()
        md: improve variable names in md_flush_request()
        md/raid10: initialize r10_bio->read_slot before use.
        md: fix a warning caused by a race between concurrent md_ioctl()s
      48332ff2
    • Zhao Heming's avatar
      md/cluster: fix deadlock when node is doing resync job · bca5b065
      Zhao Heming authored
      md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
      During sending msg, node can concurrently receive msg from another node.
      When node does resync job, grab token_lockres:EX may trigger a deadlock:
      ```
      nodeA                       nodeB
      --------------------     --------------------
      a.
      send METADATA_UPDATED
      held token_lockres:EX
                               b.
                               md_do_sync
                                resync_info_update
                                  send RESYNCING
                                   + set MD_CLUSTER_SEND_LOCK
                                   + wait for holding token_lockres:EX
      
                               c.
                               mdadm /dev/md0 --remove /dev/sdg
                                + held reconfig_mutex
                                + send REMOVE
                                   + wait_event(MD_CLUSTER_SEND_LOCK)
      
                               d.
                               recv_daemon //METADATA_UPDATED from A
                                process_metadata_update
                                 + (mddev_trylock(mddev) ||
                                    MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                                   //this time, both return false forever
      ```
      Explaination:
      a. A send METADATA_UPDATED
         This will block another node to send msg
      
      b. B does sync jobs, which will send RESYNCING at intervals.
         This will be block for holding token_lockres:EX lock.
      
      c. B do "mdadm --remove", which will send REMOVE.
         This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
      
      d. B recv METADATA_UPDATED msg, which send from A in step <a>.
         This will be blocked by step <c>: holding mddev lock, it makes
         wait_event can't hold mddev lock. (btw,
         MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
      
      There is a similar deadlock in commit 0ba95977
      ("md-cluster: use sync way to handle METADATA_UPDATED msg")
      In that commit, step c is "update sb". This patch step c is
      "mdadm --remove".
      
      For fixing this issue, we can refer the solution of function:
      metadata_update_start. Which does the same grab lock_token action.
      lock_comm can use the same steps to avoid deadlock. By moving
      MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
      It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
      but it is safe & can break deadlock.
      
      Repro steps (I only triggered 3 times with hundreds tests):
      
      two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
      ```
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
       --bitmap-chunk=1M
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mkfs.xfs /dev/md0
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      test script will hung when executing "mdadm --remove".
      
      ```
       # dump stacks by "echo t > /proc/sysrq-trigger"
      md0_cluster_rec D    0  5329      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? _cond_resched+0x2d/0x40
       ? schedule+0x4a/0xb0
       ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
       ? wait_woken+0x80/0x80
       ? process_recvd_msg+0x113/0x1d0 [md_cluster]
       ? recv_daemon+0x9e/0x120 [md_cluster]
       ? md_thread+0x94/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      
      mdadm           D    0  5423      1 0x00004004
      Call Trace:
       __schedule+0x1f6/0x560
       ? __schedule+0x1fe/0x560
       ? schedule+0x4a/0xb0
       ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? remove_disk+0x4f/0x90 [md_cluster]
       ? hot_remove_disk+0xb1/0x1b0 [md_mod]
       ? md_ioctl+0x50c/0xba0 [md_mod]
       ? wait_woken+0x80/0x80
       ? blkdev_ioctl+0xa2/0x2a0
       ? block_ioctl+0x39/0x40
       ? ksys_ioctl+0x82/0xc0
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x5f/0x150
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      md0_resync      D    0  5425      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? schedule+0x4a/0xb0
       ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? lock_token+0x2d/0x90 [md_cluster]
       ? resync_info_update+0x95/0x100 [md_cluster]
       ? raid1_sync_request+0x7d3/0xa40 [raid1]
       ? md_do_sync.cold+0x737/0xc8f [md_mod]
       ? md_thread+0x94/0x160 [md_mod]
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      ```
      
      At last, thanks for Xiao's solution.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhao Heming <heming.zhao@suse.com>
      Suggested-by: default avatarXiao Ni <xni@redhat.com>
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      bca5b065
    • Zhao Heming's avatar
      md/cluster: block reshape with remote resync job · a8da01f7
      Zhao Heming authored
      Reshape request should be blocked with ongoing resync job. In cluster
      env, a node can start resync job even if the resync cmd isn't executed
      on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
      will start resync job. However, current update_raid_disks() only check
      local recovery status, which is incomplete. As a result, we see user will
      execute "mdadm --grow" successfully on local, while the remote node deny
      to do reshape job when it doing resync job. The inconsistent handling
      cause array enter unexpected status. If user doesn't observe this issue
      and continue executing mdadm cmd, the array doesn't work at last.
      
      Fix this issue by blocking reshape request. When node executes "--grow"
      and detects ongoing resync, it should stop and report error to user.
      
      The following script reproduces the issue with ~100% probability.
      (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
      ```
       # on node1, node2 is the remote node.
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhao Heming <heming.zhao@suse.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      a8da01f7
    • Pankaj Gupta's avatar
      md: use current request time as base for ktime comparisons · a23f2aae
      Pankaj Gupta authored
      Request coalescing logic uses 'prev_flush_start' as base to
      compare the current request start time. 'prev_flush_start' is
      updated in other context.
      
      This patch changes this by using ktime comparison base to
      'req_start' for better readability of code.
      Signed-off-by: default avatarPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      a23f2aae
    • Pankaj Gupta's avatar
      md: add comments in md_flush_request() · 204d1a64
      Pankaj Gupta authored
      Request coalescing logic is dependent on flush time update in other
      context. This patch adds comments to understand the code flow better.
      Signed-off-by: default avatarPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      204d1a64
    • Pankaj Gupta's avatar
      md: improve variable names in md_flush_request() · 81ba3c24
      Pankaj Gupta authored
      This patch improves readability by using better variable names
      in flush request coalescing logic.
      Signed-off-by: default avatarPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Reviewed-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      81ba3c24
    • Kevin Vigor's avatar
      md/raid10: initialize r10_bio->read_slot before use. · 93decc56
      Kevin Vigor authored
      In __make_request() a new r10bio is allocated and passed to
      raid10_read_request(). The read_slot member of the bio is not
      initialized, and the raid10_read_request() uses it to index an
      array. This leads to occasional panics.
      
      Fix by initializing the field to invalid value and checking for
      valid value in raid10_read_request().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarKevin Vigor <kvigor@gmail.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      93decc56
    • Dae R. Jeong's avatar
      md: fix a warning caused by a race between concurrent md_ioctl()s · c731b84b
      Dae R. Jeong authored
      Syzkaller reports a warning as belows.
      WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
      ...
      Call Trace:
      ...
      RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
      RSP: 0018:ffff888096027950 EFLAGS: 00010293
      RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
      RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
      RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
      R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
      R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
       __blkdev_driver_ioctl block/ioctl.c:304 [inline]
       blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
       block_ioctl+0xee/0x130 fs/block_dev.c:1930
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:509 [inline]
       do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
       ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
       __do_sys_ioctl fs/ioctl.c:720 [inline]
       __se_sys_ioctl fs/ioctl.c:718 [inline]
       __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This is caused by a race between two concurrenct md_ioctl()s closing
      the array.
      CPU1 (md_ioctl())                   CPU2 (md_ioctl())
      ------                              ------
      set_bit(MD_CLOSING, &mddev->flags);
      did_set_md_closing = true;
                                          WARN_ON_ONCE(test_bit(MD_CLOSING,
                                                  &mddev->flags));
      if(did_set_md_closing)
          clear_bit(MD_CLOSING, &mddev->flags);
      
      Fix the warning by returning immediately if the MD_CLOSING bit is set
      in &mddev->flags which indicates that the array is being closed.
      
      Fixes: 065e519e ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
      Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDae R. Jeong <dae.r.jeong@kaist.ac.kr>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      c731b84b
  3. 16 Nov, 2020 15 commits