Commits · 786998050cbc8ead32e6e9fcda2facb3bf3d198d · Kirill Smelkov / linux

04 Dec, 2020 5 commits

block/rnbd-srv: close a mapped device from server side. · 78699805

Lutz Pogrell authored Nov 26, 2020

The forceful close of an exported device is required
for the use case, when the client side hangs, is crashed,
or is not accessible.

There have been cases observed, where only some of
the devices are to be cleaned up, but the session shall
remain.

When the device is to be exported to a different
client host, server side cleanup is required.
Signed-off-by: Lutz Pogrell <lutz.pogrell@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

78699805

Documentation/ABI/rnbd-clt: session name is appended to the device path · 7578d5cd

Gioh Kim authored Nov 26, 2020

When mapping a device,
/sys/devices/virtual/rnbd-client/ctl/devices/<device_id> was created.
But we found out that it had a problem when mapping the same file
on different servers. So we append the session name after the
device_id as below.
/sys/devices/virtual/rnbd-client/ctl/devices/<device_id>@<session_name>
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7578d5cd

Documentation/ABI/rnbd-clt: fix typo in sysfs-class-rnbd-client · 47479b79

Gioh Kim authored Nov 26, 2020

/sys/block/rnbd<N> is created, not /sys/block/rnbd_client/rnbd<N>
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

47479b79

block/rnbd-clt: support mapping two devices with the same name from different servers · 91f4acb2

Guoqing Jiang authored Nov 26, 2020

Previously, we can't map same device name from different sessions
due to the limitation of sysfs naming mechanism.

root@clt2:~# ls -l /sys/class/rnbd-client/ctl/devices/
total 0
lrwxrwxrwx 1 root 0 Sep  2 16:31 !dev!nullb1 -> ../../../block/rnbd0

We only use the device name in above, which caused device with
the same name can't be mapped from another server. To address
the issue, the sessname is appended to the node to differentiate
where the device comes from.

Also, we need to check if the pathname is existed in a specific
session instead of search it in global sess_list.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Gioh Kim <gi-oh.kim@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

91f4acb2

block/rnbd-clt: Make path parameter optional for map_device · ce9fe18a

Md Haris Iqbal authored Nov 26, 2020

During map_device if the given session exists, then the path parameter is
not used. In such a case, the path parameter is redundant.

This commit makes the path parameter optional for map_device. When the
path parameter is not given, if the session exists then that is used to
establish the rtrs connection.

If the session does not exist, and the path parameter is also missing,
then map_device fails.
Signed-off-by: Md Haris Iqbal <haris.iqbal@cloud.ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ce9fe18a

02 Dec, 2020 1 commit

Merge tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme into for-5.11/drivers · 3b9351f0

Jens Axboe authored Dec 02, 2020

Pull NVMe updates from Christoph:

"nvme updates for 5.11

 - nvmet passthrough improvements (Chaitanya Kulkarni)
 - fcloop error injection support (James Smart)
 - read-only support for zoned namespaces without Zone Append
   (Javier González)
 - improve some error message (Minwoo Im)
 - reject I/O to offline fabrics namespaces (Victor Gladkov)
 - PCI queue allocation cleanups (Niklas Schnelle)
 - remove an unused allocation in nvmet (Amit Engel)
 - a Kconfig spelling fix (Colin Ian King)
 - nvme_req_qid simplication (Baolin Wang)"

* tag 'nvme-5.11-20201202' of git://git.infradead.org/nvme: (23 commits)
  nvme: export zoned namespaces without Zone Append support read-only
  nvme: rename bdev operations
  nvme: rename controller base dev_t char device
  nvme: remove unnecessary return values
  nvme: print a warning for when listing active namespaces fails
  nvme: improve an error message on Identify failure
  nvme-fabrics: reject I/O to offline device
  nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig
  nvmet: make sure discovery change log event is protected
  nvmet: remove unused ctrl->cqs
  nvme-pci: don't allocate unused I/O queues
  nvme-pci: drop min() from nr_io_queues assignment
  nvmet: use inline bio for passthru fast path
  nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio
  nvmet: remove op_flags for passthru commands
  nvme: split nvme_alloc_request()
  block: move blk_rq_bio_prep() to linux/blk-mq.h
  nvmet: add passthru io timeout value attr
  nvmet: add passthru admin timeout value attr
  nvme: use consistent macro name for timeout
  ...

3b9351f0

01 Dec, 2020 23 commits

nvme: export zoned namespaces without Zone Append support read-only · 2f4c9ba2

Javier González authored Dec 01, 2020

Allow ZNS NVMe SSDs to present a read-only namespace when append is not
supported, instead of rejecting the namespace directly.

This allows (i) the namespace to be used in read-only mode, which is not
a problem as the append command only affects the write path, and (ii) to
use standard management tools such as nvme-cli to choose a different
format or firmware slot that is compatible with the Linux zoned block
device.
Signed-off-by: Javier González <javier.gonz@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

2f4c9ba2

nvme: rename bdev operations · ba4fb320

Javier González authored Dec 01, 2020

Remane block device operations in preparation to add char device file
operations.
Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

ba4fb320

nvme: rename controller base dev_t char device · f68abd9c

Javier González authored Dec 01, 2020

Rename controller base dev_t char device in preparation for adding a
namespace char device.
Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

f68abd9c

nvme: remove unnecessary return values · e1aaf5ca

Javier González authored Dec 01, 2020

Cleanup unnecessary ret values that are not checked or used in
nvme_alloc_ns().
Signed-off-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

e1aaf5ca

nvme: print a warning for when listing active namespaces fails · f781f3dd

Minwoo Im authored Nov 30, 2020

During the scan_work, an Identify command is issued to figure out which
namespaces are active.  If this command fails, the nvme driver falls back
to scanning namespaces sequentially.  In this situation, we don't see
any warnings and don't even know whether list-ns command has been failed
or not easiliy.

Printa warning when the Identify command executin fail:

[    1.108399] nvme nvme0: Identify NS List failed (status=0x400b)
[    1.109583] nvme0n1: detected capacity change from 0 to 1048576
[    1.112186] nvme nvme0: Identify Descriptors failed (nsid=2, status=0x4002)
[    1.113929] nvme nvme0: Identify Descriptors failed (nsid=3, status=0x4002)
[    1.116537] nvme nvme0: Identify Descriptors failed (nsid=4, status=0x4002)
...
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

f781f3dd

nvme: improve an error message on Identify failure · aa9d7295

Minwoo Im authored Nov 30, 2020

Add the namespace ID to the error message when the Identify command used
to retrieve the Namespace Identification Descriptor list fails.

This avoids rather useless and duplicative messages like the following:
[    1.321031] nvme nvme0: Identify Descriptors failed (16386)
[    1.321948] nvme nvme0: Identify Descriptors failed (16386)
[    1.322872] nvme nvme0: Identify Descriptors failed (16386)
[    1.323775] nvme nvme0: Identify Descriptors failed (16386)
[    1.324687] nvme nvme0: Identify Descriptors failed (16386)
...

Also, print the nvme status code in hexadecimal rather than decimal
format rather for better readability.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

aa9d7295

nvme-fabrics: reject I/O to offline device · 8c4dfea9

Victor Gladkov authored Nov 24, 2020

Commands get stuck while Host NVMe-oF controller is in reconnect state.
The controller enters into reconnect state when it loses connection with
the target.  It tries to reconnect every 10 seconds (default) until
a successful reconnect or until the reconnect time-out is reached.
The default reconnect time out is 10 minutes.

Applications are expecting commands to complete with success or error
within a certain timeout (30 seconds by default).  The NVMe host is
enforcing that timeout while it is connected, but during reconnect the
timeout is not enforced and commands may get stuck for a long period or
even forever.

To fix this long delay due to the default timeout, introduce new
"fast_io_fail_tmo" session parameter.  The timeout is measured in seconds
from the controller reconnect and any command beyond that timeout is
rejected.  The new parameter value may be passed during 'connect'.
The default value of -1 means no timeout (similar to current behavior).
Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

8c4dfea9

nvmet: fix a spelling mistake "incuding" -> "including" in Kconfig · 9f20599c

Colin Ian King authored Nov 26, 2020

There is a spelling mistake in the Kconfig help text. Fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

9f20599c

nvmet: make sure discovery change log event is protected · 0068a7b0

Max Gurtovoy authored Nov 25, 2020

Generation counter is protected by nvmet_config_sem. Make sure the
callers that call functions that might change it, are calling it
properly.
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

0068a7b0

nvmet: remove unused ctrl->cqs · 6d65aeab

Amit authored Nov 15, 2020

remove unused cqs from nvmet_ctrl struct
this will reduce the allocated memory.
Signed-off-by: Amit <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

6d65aeab

nvme-pci: don't allocate unused I/O queues · e3aef095

Niklas Schnelle authored Nov 12, 2020

currently the NVME_QUIRK_SHARED_TAGS quirk for Apple devices is handled
during the assignment of nr_io_queues in nvme_setup_io_queues().
This however means that for these devices nvme_max_io_queues() will
actually not return the supported maximum which is confusing and
unexpected and also means that in nvme_probe() we are allocating
for I/O queues that will never be used.
Fix this by moving the quirk handling into nvme_max_io_queues().
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

e3aef095

nvme-pci: drop min() from nr_io_queues assignment · ff4e5fba

Niklas Schnelle authored Nov 12, 2020

in nvme_setup_io_queues() the number of I/O queues is set to either 1 in
case of a quirky Apple device or to the min of nvme_max_io_queues() or
dev->nr_allocated_queues - 1.
This is unnecessarily complicated as dev->nr_allocated_queues is only
assigned once and is nvme_max_io_queues() + 1.
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

ff4e5fba

nvmet: use inline bio for passthru fast path · dab3902b

Chaitanya Kulkarni authored Nov 09, 2020

In nvmet_passthru_execute_cmd() which is a high frequency function
it uses bio_alloc() which leads to memory allocation from the fs pool
for each I/O.

For NVMeoF nvmet_req we already have inline_bvec allocated as a part of
request allocation that can be used with preallocated bio when we
already know the size of request before bio allocation with bio_alloc(),
which we already do.

Introduce a bio member for the nvmet_req passthru anon union. In the
fast path, check if we can get away with inline bvec and bio from
nvmet_req with bio_init() call before actually allocating from the
bio_alloc().

This will be useful to avoid any new memory allocation under high
memory pressure situation and get rid of any extra work of
allocation (bio_alloc()) vs initialization (bio_init()) when
transfer len is < NVMET_MAX_INLINE_DATA_LEN that user can configure at
compile time.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

dab3902b

nvmet: use blk_rq_bio_prep instead of blk_rq_append_bio · a4fe2d3a

Chaitanya Kulkarni authored Nov 09, 2020

The function blk_rq_append_bio() is a genereric API written for all
types driver (having bounce buffers) and different context (where
request is already having a bio i.e. rq->bio != NULL).

It does mainly three things: calculating the segments, bounce queue and
if req->bio == NULL call blk_rq_bio_prep() or handle low level merge()
case.

The NVMe PCIe and fabrics transports currently does not use queue
bounce mechanism. In order to find this for each request processing
in the passthru blk_rq_append_bio() does extra work in the fast path
for each request.

When I ran I/Os with different block sizes on the passthru controller
I found that we can reuse the req->sg_cnt instead of iterating over the
bvecs to find out nr_segs in blk_rq_append_bio(). This calculation in
blk_rq_append_bio() is a duplication of work given that we have the
value in req->sg_cnt. (correct me here if I'm wrong).

With NVMe passthru request based driver we allocate fresh request each
time, so every call to blk_rq_append_bio() rq->bio will be NULL i.e.
we don't really need the second condition in the blk_rq_append_bio()
and the resulting error condition in the caller of blk_rq_append_bio().

So for NVMeOF passthru driver recalculating the segments, bounce check
and ll_back_merge code is not needed such that we can get away with the
minimal version of the blk_rq_append_bio() which removes the error check
in the fast path along with extra variable in nvmet_passthru_map_sg().

This patch updates the nvmet_passthru_map_sg() such that it does only
appending the bio to the request in the context of the NVMeOF Passthru
driver. Following are perf numbers :-

With current implementation (blk_rq_append_bio()) :-
----------------------------------------------------
+    5.80%     0.02%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.44%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.88%     0.00%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.44%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.86%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.17%     0.00%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd

With this patch using blk_rq_bio_prep() :-
----------------------------------------------------
+    3.14%     0.02%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    3.26%     0.01%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.37%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    5.18%     0.02%  kworker/0:2-eve  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.84%     0.02%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
+    4.87%     0.01%  kworker/0:2-mm_  [nvmet]  [k] nvmet_passthru_execute_cmd
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

a4fe2d3a

nvmet: remove op_flags for passthru commands · 06b3bec8

Chaitanya Kulkarni authored Nov 09, 2020

For passthru commands setting op_flags has no meaning. Remove the code
that sets the op flags in nvmet_passthru_map_sg().
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

06b3bec8

nvme: split nvme_alloc_request() · 39dfe844

Chaitanya Kulkarni authored Nov 09, 2020

Right now nvme_alloc_request() allocates a request from block layer
based on the value of the qid. When qid set to NVME_QID_ANY it used
blk_mq_alloc_request() else blk_mq_alloc_request_hctx().

The function nvme_alloc_request() is called from different context, The
only place where it uses non NVME_QID_ANY value is for fabrics connect
commands :-

nvme_submit_sync_cmd()		NVME_QID_ANY
nvme_features()			NVME_QID_ANY
nvme_sec_submit()		NVME_QID_ANY
nvmf_reg_read32()		NVME_QID_ANY
nvmf_reg_read64()		NVME_QID_ANY
nvmf_reg_write32()		NVME_QID_ANY
nvmf_connect_admin_queue()	NVME_QID_ANY
nvme_submit_user_cmd()		NVME_QID_ANY
	nvme_alloc_request()
nvme_keep_alive()		NVME_QID_ANY
	nvme_alloc_request()
nvme_timeout()			NVME_QID_ANY
	nvme_alloc_request()
nvme_delete_queue()		NVME_QID_ANY
	nvme_alloc_request()
nvmet_passthru_execute_cmd()	NVME_QID_ANY
	nvme_alloc_request()
nvmf_connect_io_queue() 	QID
	__nvme_submit_sync_cmd()
		nvme_alloc_request()

With passthru nvme_alloc_request() now falls into the I/O fast path such
that blk_mq_alloc_request_hctx() is never gets called and that adds
additional branch check in fast path.

Split the nvme_alloc_request() into nvme_alloc_request() and
nvme_alloc_request_qid().

Replace each call of the nvme_alloc_request() with NVME_QID_ANY param
with a call to newly added nvme_alloc_request() without NVME_QID_ANY.

Replace a call to nvme_alloc_request() with QID param with a call to
newly added nvme_alloc_request() and nvme_alloc_request_qid()
based on the qid value set in the __nvme_submit_sync_cmd().
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

39dfe844

block: move blk_rq_bio_prep() to linux/blk-mq.h · 53ffabfd

Chaitanya Kulkarni authored Nov 09, 2020

This is a preparation patch to have minimal block layer request bio
append functionality in the context of the NVMeOF Passthru driver which
falls in the fast path and doesn't need calls from blk_rq_append_bio().
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

53ffabfd

nvmet: add passthru io timeout value attr · 47e9730c

Chaitanya Kulkarni authored Nov 09, 2020

NVMeOF controller in the passsthru mode is capable of handling wide set
of I/O commands including vender specific passhtru io comands.

The vendor specific I/O commands are used to read the large drive
logs and can take longer than default NVMe commands, i.e. for
passthru requests the timeout value may differ from the passthru
controller's default timeout values (nvme-core:io_timeout).

Add a configfs attribute so that user can set the io timeout values.
In case if this configfs value is not set nvme_alloc_request() will set
the NVME_IO_TIMEOUT value when request queuedata is NULL.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

47e9730c

nvmet: add passthru admin timeout value attr · a2f6a2b8

Chaitanya Kulkarni authored Nov 09, 2020

NVMeOF controller in the passsthru mode is capable of handling wide set
of admin commands including vender specific passhtru admin comands.

The vendor specific admin commands are used to read the large drive
logs and can take longer than default NVMe commands, i.e. for
passthru requests the timeout value may differ from the passthru
controller's default timeout values (nvme-core:admin_timeout).

Add a configfs attribute so that user can set the admin timeout values.
In case if this configfs value is not set nvme_alloc_request() will set
the ADMIN_TIMEOUT value when request queuedata is NULL.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

a2f6a2b8

nvme: use consistent macro name for timeout · dc96f938

Chaitanya Kulkarni authored Nov 09, 2020

This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

dc96f938

nvme: centralize setting the timeout in nvme_alloc_request · 0d2e7c84

Chaitanya Kulkarni authored Nov 09, 2020

The function nvme_alloc_request() is called from different context
(I/O and Admin queue) where callers do not consider the I/O timeout when
called from I/O queue context.

Update nvme_alloc_request() to set the default I/O and Admin timeout
value based on whether the queuedata is set or not.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

0d2e7c84

nvme: simplify nvme_req_qid() · 84115d6d

Baolin Wang authored Oct 27, 2020

Use the request's '->mq_hctx->queue_num' directly to simplify the
nvme_req_qid() function.
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

84115d6d

nvme-fcloop: add sysfs attribute to inject command drop · 03d99e5d

James Smart authored Oct 16, 2020

Add sysfs attribute to specify parameters for dropping a command.  The
attribute takes a string of:

  <opcode>:<starting a what instance>:<number of times>

Opcode is formatted as lower 8 bits are opcode.  If a fabrics opcode, a
bit above bits 7:0 will be set.

Once set, each sqe is looked at. If the opcode matches the running
instance count is updated. If the instance count is in the range of where
to drop (based on starting and # of times), then drop the command by not
passing it to the target layer.
Signed-off-by: James Smart <james.smart@broadcom.com>

03d99e5d

30 Nov, 2020 8 commits

Merge branch 'md-next' of... · 48332ff2

Jens Axboe authored Nov 30, 2020

Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.11/drivers

Pull MD changes from Song:

"Summary:
 1. Fix race condition in md_ioctl(), by Dae R. Jeong;
 2. Initialize read_slot properly for raid10, by Kevin Vigor;
 3. Code cleanup, by Pankaj Gupta;
 4. md-cluster resync/reshape fix, by Zhao Heming."

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md/cluster: fix deadlock when node is doing resync job
  md/cluster: block reshape with remote resync job
  md: use current request time as base for ktime comparisons
  md: add comments in md_flush_request()
  md: improve variable names in md_flush_request()
  md/raid10: initialize r10_bio->read_slot before use.
  md: fix a warning caused by a race between concurrent md_ioctl()s

48332ff2

md/cluster: fix deadlock when node is doing resync job · bca5b065

Zhao Heming authored Nov 19, 2020

md-cluster uses MD_CLUSTER_SEND_LOCK to make node can exclusively send msg.
During sending msg, node can concurrently receive msg from another node.
When node does resync job, grab token_lockres:EX may trigger a deadlock:
```
nodeA                       nodeB
--------------------     --------------------
a.
send METADATA_UPDATED
held token_lockres:EX
                         b.
                         md_do_sync
                          resync_info_update
                            send RESYNCING
                             + set MD_CLUSTER_SEND_LOCK
                             + wait for holding token_lockres:EX

                         c.
                         mdadm /dev/md0 --remove /dev/sdg
                          + held reconfig_mutex
                          + send REMOVE
                             + wait_event(MD_CLUSTER_SEND_LOCK)

                         d.
                         recv_daemon //METADATA_UPDATED from A
                          process_metadata_update
                           + (mddev_trylock(mddev) ||
                              MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD)
                             //this time, both return false forever
```
Explaination:
a. A send METADATA_UPDATED
   This will block another node to send msg

b. B does sync jobs, which will send RESYNCING at intervals.
   This will be block for holding token_lockres:EX lock.

c. B do "mdadm --remove", which will send REMOVE.
   This will be blocked by step <b>: MD_CLUSTER_SEND_LOCK is 1.

d. B recv METADATA_UPDATED msg, which send from A in step <a>.
   This will be blocked by step <c>: holding mddev lock, it makes
   wait_event can't hold mddev lock. (btw,
   MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD keep ZERO in this scenario.)

There is a similar deadlock in commit 0ba95977
("md-cluster: use sync way to handle METADATA_UPDATED msg")
In that commit, step c is "update sb". This patch step c is
"mdadm --remove".

For fixing this issue, we can refer the solution of function:
metadata_update_start. Which does the same grab lock_token action.
lock_comm can use the same steps to avoid deadlock. By moving
MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD from lock_token to lock_comm.
It enlarge a little bit window of MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD,
but it is safe & can break deadlock.

Repro steps (I only triggered 3 times with hundreds tests):

two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB.
```
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh \
 --bitmap-chunk=1M
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mkfs.xfs /dev/md0
mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

test script will hung when executing "mdadm --remove".

```
 # dump stacks by "echo t > /proc/sysrq-trigger"
md0_cluster_rec D    0  5329      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? _cond_resched+0x2d/0x40
 ? schedule+0x4a/0xb0
 ? process_metadata_update.isra.0+0xdb/0x140 [md_cluster]
 ? wait_woken+0x80/0x80
 ? process_recvd_msg+0x113/0x1d0 [md_cluster]
 ? recv_daemon+0x9e/0x120 [md_cluster]
 ? md_thread+0x94/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40

mdadm           D    0  5423      1 0x00004004
Call Trace:
 __schedule+0x1f6/0x560
 ? __schedule+0x1fe/0x560
 ? schedule+0x4a/0xb0
 ? lock_comm.isra.0+0x7b/0xb0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? remove_disk+0x4f/0x90 [md_cluster]
 ? hot_remove_disk+0xb1/0x1b0 [md_mod]
 ? md_ioctl+0x50c/0xba0 [md_mod]
 ? wait_woken+0x80/0x80
 ? blkdev_ioctl+0xa2/0x2a0
 ? block_ioctl+0x39/0x40
 ? ksys_ioctl+0x82/0xc0
 ? __x64_sys_ioctl+0x16/0x20
 ? do_syscall_64+0x5f/0x150
 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

md0_resync      D    0  5425      2 0x80004000
Call Trace:
 __schedule+0x1f6/0x560
 ? schedule+0x4a/0xb0
 ? dlm_lock_sync+0xa1/0xd0 [md_cluster]
 ? wait_woken+0x80/0x80
 ? lock_token+0x2d/0x90 [md_cluster]
 ? resync_info_update+0x95/0x100 [md_cluster]
 ? raid1_sync_request+0x7d3/0xa40 [raid1]
 ? md_do_sync.cold+0x737/0xc8f [md_mod]
 ? md_thread+0x94/0x160 [md_mod]
 ? md_congested+0x30/0x30 [md_mod]
 ? kthread+0x115/0x140
 ? __kthread_bind_mask+0x60/0x60
 ? ret_from_fork+0x1f/0x40
```

At last, thanks for Xiao's solution.

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Suggested-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>

bca5b065

md/cluster: block reshape with remote resync job · a8da01f7

Zhao Heming authored Nov 19, 2020

Reshape request should be blocked with ongoing resync job. In cluster
env, a node can start resync job even if the resync cmd isn't executed
on it, e.g., user executes "mdadm --grow" on node A, sometimes node B
will start resync job. However, current update_raid_disks() only check
local recovery status, which is incomplete. As a result, we see user will
execute "mdadm --grow" successfully on local, while the remote node deny
to do reshape job when it doing resync job. The inconsistent handling
cause array enter unexpected status. If user doesn't observe this issue
and continue executing mdadm cmd, the array doesn't work at last.

Fix this issue by blocking reshape request. When node executes "--grow"
and detects ongoing resync, it should stop and report error to user.

The following script reproduces the issue with ~100% probability.
(two nodes share 3 iSCSI luns: sdg/sdh/sdi. Each lun size is 1GB)
```
 # on node1, node2 is the remote node.
ssh root@node2 "mdadm -S --scan"
mdadm -S --scan
for i in {g,h,i};do dd if=/dev/zero of=/dev/sd$i oflag=direct bs=1M \
count=20; done

mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdg /dev/sdh
ssh root@node2 "mdadm -A /dev/md0 /dev/sdg /dev/sdh"

sleep 5

mdadm --manage --add /dev/md0 /dev/sdi
mdadm --wait /dev/md0
mdadm --grow --raid-devices=3 /dev/md0

mdadm /dev/md0 --fail /dev/sdg
mdadm /dev/md0 --remove /dev/sdg
mdadm --grow --raid-devices=2 /dev/md0
```

Cc: stable@vger.kernel.org
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>

a8da01f7

md: use current request time as base for ktime comparisons · a23f2aae

Pankaj Gupta authored Nov 11, 2020

Request coalescing logic uses 'prev_flush_start' as base to
compare the current request start time. 'prev_flush_start' is
updated in other context.

This patch changes this by using ktime comparison base to
'req_start' for better readability of code.
Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>

a23f2aae

md: add comments in md_flush_request() · 204d1a64

Pankaj Gupta authored Nov 11, 2020

Request coalescing logic is dependent on flush time update in other
context. This patch adds comments to understand the code flow better.
Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>

204d1a64

md: improve variable names in md_flush_request() · 81ba3c24

Pankaj Gupta authored Nov 11, 2020

This patch improves readability by using better variable names
in flush request coalescing logic.
Signed-off-by: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Song Liu <songliubraving@fb.com>

81ba3c24

md/raid10: initialize r10_bio->read_slot before use. · 93decc56

Kevin Vigor authored Nov 06, 2020

In __make_request() a new r10bio is allocated and passed to
raid10_read_request(). The read_slot member of the bio is not
initialized, and the raid10_read_request() uses it to index an
array. This leads to occasional panics.

Fix by initializing the field to invalid value and checking for
valid value in raid10_read_request().

Cc: stable@vger.kernel.org
Signed-off-by: Kevin Vigor <kvigor@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>

93decc56

md: fix a warning caused by a race between concurrent md_ioctl()s · c731b84b

Dae R. Jeong authored Oct 22, 2020

Syzkaller reports a warning as belows.
WARNING: CPU: 0 PID: 9647 at drivers/md/md.c:7169
...
Call Trace:
...
RIP: 0010:md_ioctl+0x4017/0x5980 drivers/md/md.c:7169
RSP: 0018:ffff888096027950 EFLAGS: 00010293
RAX: ffff88809322c380 RBX: 0000000000000932 RCX: ffffffff84e266f2
RDX: 0000000000000000 RSI: ffffffff84e299f7 RDI: 0000000000000007
RBP: ffff888096027bc0 R08: ffff88809322c380 R09: ffffed101341a482
R10: ffff888096027940 R11: ffff88809a0d240f R12: 0000000000000932
R13: ffff8880a2c14100 R14: ffff88809a0d2268 R15: ffff88809a0d2408
 __blkdev_driver_ioctl block/ioctl.c:304 [inline]
 blkdev_ioctl+0xece/0x1c10 block/ioctl.c:606
 block_ioctl+0xee/0x130 fs/block_dev.c:1930
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:509 [inline]
 do_vfs_ioctl+0xd5f/0x1380 fs/ioctl.c:696
 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is caused by a race between two concurrenct md_ioctl()s closing
the array.
CPU1 (md_ioctl())                   CPU2 (md_ioctl())
------                              ------
set_bit(MD_CLOSING, &mddev->flags);
did_set_md_closing = true;
                                    WARN_ON_ONCE(test_bit(MD_CLOSING,
                                            &mddev->flags));
if(did_set_md_closing)
    clear_bit(MD_CLOSING, &mddev->flags);

Fix the warning by returning immediately if the MD_CLOSING bit is set
in &mddev->flags which indicates that the array is being closed.

Fixes: 065e519e ("md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop")
Reported-by: syzbot+1e46a0864c1a6e9bd3d8@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Dae R. Jeong <dae.r.jeong@kaist.ac.kr>
Signed-off-by: Song Liu <songliubraving@fb.com>

c731b84b

16 Nov, 2020 3 commits

s390/dasd: Process FCES path event notification · 4d063e64

Jan Höppner authored Oct 08, 2020

If the Fibre Channel Endpoint-Security status of a path changes, a
corresponding path event is received from the CIO layer.

Process this event by re-reading the FCES information.

As the information is retrieved for all paths on a single CU in one
call, the internal status can also be updated for all paths and no
processing per path is necessary.
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4d063e64

s390/dasd: Prepare for additional path event handling · b7294932

Jan Höppner authored Oct 08, 2020

As more path events need to be handled for ECKD the current path
verification infrastructure can be reused. Rename all path verifcation
code to fit the more broadly based task of path event handling and put
the path verification in a new separate function.
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b7294932

s390/dasd: Display FC Endpoint Security information via sysfs · 19508b20

Jan Höppner authored Oct 08, 2020

Add a new sysfs attribute (fc_security) per device and per operational
channel path. The information of the current FC Endpoint Security state
is received through the CIO layer.

The state of the FC Endpoint Security can be either "Unsupported",
"Authentication", or "Encryption".

For example:
$ cat /sys/bus/ccw/devices/0.0.c600/fc_security
Encryption

If any of the operational paths is in a state different from all
others, the device sysfs attribute will display the additional state
"Inconsistent".

The sysfs attributes per paths are organised in a new directory called
"paths_info" with subdirectories for each path.

/sys/bus/ccw/devices/0.0.c600/paths_info/
├── 0.38
│   └── fc_security
├── 0.39
│   └── fc_security
├── 0.3a
│   └── fc_security
└── 0.3b
    └── fc_security
Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

19508b20