1. 08 Dec, 2020 2 commits
  2. 07 Dec, 2020 2 commits
  3. 04 Dec, 2020 8 commits
  4. 02 Dec, 2020 1 commit
    • Jens Axboe's avatar
      Merge tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme into for-5.11/drivers · 3b9351f0
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "nvme updates for 5.11
      
       - nvmet passthrough improvements (Chaitanya Kulkarni)
       - fcloop error injection support (James Smart)
       - read-only support for zoned namespaces without Zone Append
         (Javier González)
       - improve some error message (Minwoo Im)
       - reject I/O to offline fabrics namespaces (Victor Gladkov)
       - PCI queue allocation cleanups (Niklas Schnelle)
       - remove an unused allocation in nvmet (Amit Engel)
       - a Kconfig spelling fix (Colin Ian King)
       - nvme_req_qid simplication (Baolin Wang)"
      
      * tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme: (23 commits)
        nvme: export zoned namespaces without Zone Append support read-only
        nvme: rename bdev operations
        nvme: rename controller base dev_t char device
        nvme: remove unnecessary return values
        nvme: print a warning for when listing active namespaces fails
        nvme: improve an error message on Identify failure
        nvme-fabrics: reject I/O to offline device
        nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig
        nvmet: make sure discovery change log event is protected
        nvmet: remove unused ctrl->cqs
        nvme-pci: don't allocate unused I/O queues
        nvme-pci: drop min() from nr_io_queues assignment
        nvmet: use inline bio for passthru fast path
        nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio
        nvmet: remove op_flags for passthru commands
        nvme: split nvme_alloc_request()
        block: move blk_rq_bio_prep() to linux/blk-mq.h
        nvmet: add passthru io timeout value attr
        nvmet: add passthru admin timeout value attr
        nvme: use consistent macro name for timeout
        ...
      3b9351f0
  5. 01 Dec, 2020 23 commits
  6. 30 Nov, 2020 4 commits
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · 48332ff2
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.11/drivers
      
      Pull MD changes from Song:
      
      "Summary:
       1. Fix race condition in md_ioctl(), by Dae R. Jeong;
       2. Initialize read_slot properly for raid10, by Kevin Vigor;
       3. Code cleanup, by Pankaj Gupta;
       4. md-cluster resync/reshape fix, by Zhao Heming."
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/cluster: fix deadlock when node is doing resync job
        md/cluster: block reshape with remote resync job
        md: use current request time as base for ktime comparisons
        md: add comments in md_flush_request()
        md: improve variable names in md_flush_request()
        md/raid10: initialize r10_bio->read_slot before use.
        md: fix a warning caused by a race between concurrent md_ioctl()s
      48332ff2
    • Zhao Heming's avatar
      md/cluster: fix deadlock when node is doing resync job · bca5b065
      Zhao Heming authored
      md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
      During sending msg, node can concurrently receive msg from another node.
      When node does resync job, grab token_lockres:EX may trigger a deadlock:
      ```
      nodeA                       nodeB
      --------------------     --------------------
      a.
      send METADATA_UPDATED
      held token_lockres:EX
                               b.
                               md_do_sync
                                resync_info_update
                                  send RESYNCING
                                   + set MD_CLUSTER_SEND_LOCK
                                   + wait for holding token_lockres:EX
      
                               c.
                               mdadm /dev/md0 --remove /dev/sdg
                                + held reconfig_mutex
                                + send REMOVE
                                   + wait_event(MD_CLUSTER_SEND_LOCK)
      
                               d.
                               recv_daemon //METADATA_UPDATED from A
                                process_metadata_update
                                 + (mddev_trylock(mddev) ||
                                    MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                                   //this time, both return false forever
      ```
      Explaination:
      a. A send METADATA_UPDATED
         This will block another node to send msg
      
      b. B does sync jobs, which will send RESYNCING at intervals.
         This will be block for holding token_lockres:EX lock.
      
      c. B do "mdadm --remove", which will send REMOVE.
         This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.
      
      d. B recv METADATA_UPDATED msg, which send from A in step <a>.
         This will be blocked by step <c>: holding mddev lock, it makes
         wait_event can't hold mddev lock. (btw,
         MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)
      
      There is a similar deadlock in commit 0ba95977
      ("md-cluster: use sync way to handle METADATA_UPDATED msg")
      In that commit, step c is "update sb". This patch step c is
      "mdadm --remove".
      
      For fixing this issue, we can refer the solution of function:
      metadata_update_start. Which does the same grab lock_token action.
      lock_comm can use the same steps to avoid deadlock. By moving
      MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
      It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
      but it is safe & can break deadlock.
      
      Repro steps (I only triggered 3 times with hundreds tests):
      
      two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
      ```
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
       --bitmap-chunk=1M
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mkfs.xfs /dev/md0
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      test script will hung when executing "mdadm --remove".
      
      ```
       # dump stacks by "echo t > /proc/sysrq-trigger"
      md0_cluster_rec D    0  5329      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? _cond_resched+0x2d/0x40
       ? schedule+0x4a/0xb0
       ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
       ? wait_woken+0x80/0x80
       ? process_recvd_msg+0x113/0x1d0 [md_cluster]
       ? recv_daemon+0x9e/0x120 [md_cluster]
       ? md_thread+0x94/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      
      mdadm           D    0  5423      1 0x00004004
      Call Trace:
       __schedule+0x1f6/0x560
       ? __schedule+0x1fe/0x560
       ? schedule+0x4a/0xb0
       ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? remove_disk+0x4f/0x90 [md_cluster]
       ? hot_remove_disk+0xb1/0x1b0 [md_mod]
       ? md_ioctl+0x50c/0xba0 [md_mod]
       ? wait_woken+0x80/0x80
       ? blkdev_ioctl+0xa2/0x2a0
       ? block_ioctl+0x39/0x40
       ? ksys_ioctl+0x82/0xc0
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x5f/0x150
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      md0_resync      D    0  5425      2 0x80004000
      Call Trace:
       __schedule+0x1f6/0x560
       ? schedule+0x4a/0xb0
       ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
       ? wait_woken+0x80/0x80
       ? lock_token+0x2d/0x90 [md_cluster]
       ? resync_info_update+0x95/0x100 [md_cluster]
       ? raid1_sync_request+0x7d3/0xa40 [raid1]
       ? md_do_sync.cold+0x737/0xc8f [md_mod]
       ? md_thread+0x94/0x160 [md_mod]
       ? md_congested+0x30/0x30 [md_mod]
       ? kthread+0x115/0x140
       ? __kthread_bind_mask+0x60/0x60
       ? ret_from_fork+0x1f/0x40
      ```
      
      At last, thanks for Xiao's solution.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhao Heming <heming.zhao@suse.com>
      Suggested-by: default avatarXiao Ni <xni@redhat.com>
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      bca5b065
    • Zhao Heming's avatar
      md/cluster: block reshape with remote resync job · a8da01f7
      Zhao Heming authored
      Reshape request should be blocked with ongoing resync job. In cluster
      env, a node can start resync job even if the resync cmd isn't executed
      on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
      will start resync job. However, current update_raid_disks() only check
      local recovery status, which is incomplete. As a result, we see user will
      execute "mdadm --grow" successfully on local, while the remote node deny
      to do reshape job when it doing resync job. The inconsistent handling
      cause array enter unexpected status. If user doesn't observe this issue
      and continue executing mdadm cmd, the array doesn't work at last.
      
      Fix this issue by blocking reshape request. When node executes "--grow"
      and detects ongoing resync, it should stop and report error to user.
      
      The following script reproduces the issue with ~100% probability.
      (two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
      ```
       # on node1, node2 is the remote node.
      ssh root@node2 "mdadm -S --scan"
      mdadm -S --scan
      for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
      count=20; done
      
      mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
      ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"
      
      sleep 5
      
      mdadm --manage --add /dev/md0 /dev/sdi
      mdadm --wait /dev/md0
      mdadm --grow --raid-devices=3 /dev/md0
      
      mdadm /dev/md0 --fail /dev/sdg
      mdadm /dev/md0 --remove /dev/sdg
      mdadm --grow --raid-devices=2 /dev/md0
      ```
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhao Heming <heming.zhao@suse.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      a8da01f7
    • Pankaj Gupta's avatar
      md: use current request time as base for ktime comparisons · a23f2aae
      Pankaj Gupta authored
      Request coalescing logic uses 'prev_flush_start' as base to
      compare the current request start time. 'prev_flush_start' is
      updated in other context.
      
      This patch changes this by using ktime comparison base to
      'req_start' for better readability of code.
      Signed-off-by: default avatarPankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      a23f2aae