Commit 0be600a5 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'for-4.16/dm-changes' of...

Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - DM core fixes to ensure that bio submission follows a depth-first
   tree walk; this is critical to allow forward progress without the
   need to use the bioset's BIOSET_NEED_RESCUER.

 - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure.

 - DM core cleanups and improvements to make bio-based DM more efficient
   (e.g. reduced memory footprint as well leveraging per-bio-data more).

 - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages
   the more direct IO submission path in the block layer; this mode is
   used by DM multipath and also optimizes targets like DM thin-pool
   that stack directly on NVMe data device.

 - DM multipath improvements to factor out legacy SCSI-only (e.g.
   scsi_dh) code paths to allow for more optimized support for NVMe
   multipath.

 - A fix for DM multipath path selectors (service-time and queue-length)
   to select paths in a more balanced way; largely academic but doesn't
   hurt.

 - Numerous DM raid target fixes and improvements.

 - Add a new DM "unstriped" target that enables Intel to workaround
   firmware limitations in some NVMe drives that are striped internally
   (this target also works when stacked above the DM "striped" target).

 - Various Documentation fixes and improvements.

 - Misc cleanups and fixes across various DM infrastructure and targets
   (e.g. bufio, flakey, log-writes, snapshot).

* tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits)
  dm cache: Documentation: update default migration_throttling value
  dm mpath selector: more evenly distribute ties
  dm unstripe: fix target length versus number of stripes size check
  dm thin: fix trailing semicolon in __remap_and_issue_shared_cell
  dm table: fix NVMe bio-based dm_table_determine_type() validation
  dm: various cleanups to md->queue initialization code
  dm mpath: delay the retry of a request if the target responded as busy
  dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED
  dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
  dm log writes: fix max length used for kstrndup
  dm: backfill missing calls to mutex_destroy()
  dm snapshot: use mutex instead of rw_semaphore
  dm flakey: check for null arg_name in parse_features()
  dm thin: extend thinpool status format string with omitted fields
  dm thin: fixes in thin-provisioning.txt
  dm thin: document representation of <highest mapped sector> when there is none
  dm thin: fix documentation relative to low water mark threshold
  dm cache: be consistent in specifying sectors and SI units in cache.txt
  dm cache: delete obsoleted paragraph in cache.txt
  dm cache: fix grammar in cache-policies.txt
  ...
parents 040639b7 9614e2ba
...@@ -60,7 +60,7 @@ Memory usage: ...@@ -60,7 +60,7 @@ Memory usage:
The mq policy used a lot of memory; 88 bytes per cache block on a 64 The mq policy used a lot of memory; 88 bytes per cache block on a 64
bit machine. bit machine.
smq uses 28bit indexes to implement it's data structures rather than smq uses 28bit indexes to implement its data structures rather than
pointers. It avoids storing an explicit hit count for each block. It pointers. It avoids storing an explicit hit count for each block. It
has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
the entries (each hotspot block covers a larger area than a single the entries (each hotspot block covers a larger area than a single
...@@ -84,7 +84,7 @@ resulting in better promotion/demotion decisions. ...@@ -84,7 +84,7 @@ resulting in better promotion/demotion decisions.
Adaptability: Adaptability:
The mq policy maintained a hit count for each cache block. For a The mq policy maintained a hit count for each cache block. For a
different block to get promoted to the cache it's hit count has to different block to get promoted to the cache its hit count has to
exceed the lowest currently in the cache. This meant it could take a exceed the lowest currently in the cache. This meant it could take a
long time for the cache to adapt between varying IO patterns. long time for the cache to adapt between varying IO patterns.
......
...@@ -59,7 +59,7 @@ Fixed block size ...@@ -59,7 +59,7 @@ Fixed block size
The origin is divided up into blocks of a fixed size. This block size The origin is divided up into blocks of a fixed size. This block size
is configurable when you first create the cache. Typically we've been is configurable when you first create the cache. Typically we've been
using block sizes of 256KB - 1024KB. The block size must be between 64 using block sizes of 256KB - 1024KB. The block size must be between 64
(32KB) and 2097152 (1GB) and a multiple of 64 (32KB). sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
Having a fixed block size simplifies the target a lot. But it is Having a fixed block size simplifies the target a lot. But it is
something of a compromise. For instance, a small part of a block may be something of a compromise. For instance, a small part of a block may be
...@@ -119,7 +119,7 @@ doing here to avoid migrating during those peak io moments. ...@@ -119,7 +119,7 @@ doing here to avoid migrating during those peak io moments.
For the time being, a message "migration_threshold <#sectors>" For the time being, a message "migration_threshold <#sectors>"
can be used to set the maximum number of sectors being migrated, can be used to set the maximum number of sectors being migrated,
the default being 204800 sectors (or 100MB). the default being 2048 sectors (1MB).
Updating on-disk metadata Updating on-disk metadata
------------------------- -------------------------
...@@ -143,11 +143,6 @@ the policy how big this chunk is, but it should be kept small. Like the ...@@ -143,11 +143,6 @@ the policy how big this chunk is, but it should be kept small. Like the
dirty flags this data is lost if there's a crash so a safe fallback dirty flags this data is lost if there's a crash so a safe fallback
value should always be possible. value should always be possible.
For instance, the 'mq' policy, which is currently the default policy,
uses this facility to store the hit count of the cache blocks. If
there's a crash this information will be lost, which means the cache
may be less efficient until those hit counts are regenerated.
Policy hints affect performance, not correctness. Policy hints affect performance, not correctness.
Policy messaging Policy messaging
......
...@@ -343,5 +343,8 @@ Version History ...@@ -343,5 +343,8 @@ Version History
1.11.0 Fix table line argument order 1.11.0 Fix table line argument order
(wrong raid10_copies/raid10_format sequence) (wrong raid10_copies/raid10_format sequence)
1.11.1 Add raid4/5/6 journal write-back support via journal_mode option 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option
1.12.1 fix for MD deadlock between mddev_suspend() and md_write_start() available 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
state races.
1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
...@@ -49,6 +49,10 @@ The difference between persistent and transient is with transient ...@@ -49,6 +49,10 @@ The difference between persistent and transient is with transient
snapshots less metadata must be saved on disk - they can be kept in snapshots less metadata must be saved on disk - they can be kept in
memory by the kernel. memory by the kernel.
When loading or unloading the snapshot target, the corresponding
snapshot-origin or snapshot-merge target must be suspended. A failure to
suspend the origin target could result in data corruption.
* snapshot-merge <origin> <COW device> <persistent> <chunksize> * snapshot-merge <origin> <COW device> <persistent> <chunksize>
......
...@@ -112,9 +112,11 @@ $low_water_mark is expressed in blocks of size $data_block_size. If ...@@ -112,9 +112,11 @@ $low_water_mark is expressed in blocks of size $data_block_size. If
free space on the data device drops below this level then a dm event free space on the data device drops below this level then a dm event
will be triggered which a userspace daemon should catch allowing it to will be triggered which a userspace daemon should catch allowing it to
extend the pool device. Only one such event will be sent. extend the pool device. Only one such event will be sent.
Resuming a device with a new table itself triggers an event so the
userspace daemon can use this to detect a situation where a new table No special event is triggered if a just resumed device's free space is below
already exceeds the threshold. the low water mark. However, resuming a device always triggers an
event; a userspace daemon should verify that free space exceeds the low
water mark when handling this event.
A low water mark for the metadata device is maintained in the kernel and A low water mark for the metadata device is maintained in the kernel and
will trigger a dm event if free space on the metadata device drops below will trigger a dm event if free space on the metadata device drops below
...@@ -274,7 +276,8 @@ ii) Status ...@@ -274,7 +276,8 @@ ii) Status
<transaction id> <used metadata blocks>/<total metadata blocks> <transaction id> <used metadata blocks>/<total metadata blocks>
<used data blocks>/<total data blocks> <held metadata root> <used data blocks>/<total data blocks> <held metadata root>
[no_]discard_passdown ro|rw ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
needs_check|-
transaction id: transaction id:
A 64-bit number used by userspace to help synchronise with metadata A 64-bit number used by userspace to help synchronise with metadata
...@@ -394,3 +397,6 @@ ii) Status ...@@ -394,3 +397,6 @@ ii) Status
If the pool has encountered device errors and failed, the status If the pool has encountered device errors and failed, the status
will just contain the string 'Fail'. The userspace recovery will just contain the string 'Fail'. The userspace recovery
tools should then be used. tools should then be used.
In the case where <nr mapped sectors> is 0, there is no highest
mapped sector and the value of <highest mapped sector> is unspecified.
Introduction
============
The device-mapper "unstriped" target provides a transparent mechanism to
unstripe a device-mapper "striped" target to access the underlying disks
without having to touch the true backing block-device. It can also be
used to unstripe a hardware RAID-0 to access backing disks.
Parameters:
<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
<number of stripes>
The number of stripes in the RAID 0.
<chunk size>
The amount of 512B sectors in the chunk striping.
<dev_path>
The block device you wish to unstripe.
<stripe #>
The stripe number within the device that corresponds to physical
drive you wish to unstripe. This must be 0 indexed.
Why use this module?
====================
An example of undoing an existing dm-stripe
-------------------------------------------
This small bash script will setup 4 loop devices and use the existing
striped target to combine the 4 devices into one. It then will use
the unstriped target ontop of the striped device to access the
individual backing loop devices. We write data to the newly exposed
unstriped devices and verify the data written matches the correct
underlying device on the striped array.
#!/bin/bash
MEMBER_SIZE=$((128 * 1024 * 1024))
NUM=4
SEQ_END=$((${NUM}-1))
CHUNK=256
BS=4096
RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
COUNT=$((${MEMBER_SIZE} / ${BS}))
for i in $(seq 0 ${SEQ_END}); do
dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
losetup /dev/loop${i} member-${i}
DM_PARMS+=" /dev/loop${i} 0"
done
echo $DM_PARMS | dmsetup create raid0
for i in $(seq 0 ${SEQ_END}); do
echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
done;
for i in $(seq 0 ${SEQ_END}); do
dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
diff /dev/mapper/set-${i} member-${i}
done;
for i in $(seq 0 ${SEQ_END}); do
dmsetup remove set-${i}
done
dmsetup remove raid0
for i in $(seq 0 ${SEQ_END}); do
losetup -d /dev/loop${i}
rm -f member-${i}
done
Another example
---------------
Intel NVMe drives contain two cores on the physical device.
Each core of the drive has segregated access to its LBA range.
The current LBA model has a RAID 0 128k chunk on each core, resulting
in a 256k stripe across the two cores:
Core 0: Core 1:
__________ __________
| LBA 512| | LBA 768|
| LBA 0 | | LBA 256|
---------- ----------
The purpose of this unstriping is to provide better QoS in noisy
neighbor environments. When two partitions are created on the
aggregate drive without this unstriping, reads on one partition
can affect writes on another partition. This is because the partitions
are striped across the two cores. When we unstripe this hardware RAID 0
and make partitions on each new exposed device the two partitions are now
physically separated.
With the dm-unstriped target we're able to segregate an fio script that
has read and write jobs that are independent of each other. Compared to
when we run the test on a combined drive with partitions, we were able
to get a 92% reduction in read latency using this device mapper target.
Example dmsetup usage
=====================
unstriped ontop of Intel NVMe device that has 2 cores
-----------------------------------------------------
dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
There will now be two devices that expose Intel NVMe core 0 and 1
respectively:
/dev/mapper/nvmset0
/dev/mapper/nvmset1
unstriped ontop of striped with 4 drives using 128K chunk size
--------------------------------------------------------------
dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
...@@ -269,6 +269,13 @@ config DM_BIO_PRISON ...@@ -269,6 +269,13 @@ config DM_BIO_PRISON
source "drivers/md/persistent-data/Kconfig" source "drivers/md/persistent-data/Kconfig"
config DM_UNSTRIPED
tristate "Unstriped target"
depends on BLK_DEV_DM
---help---
Unstripes I/O so it is issued solely on a single drive in a HW
RAID0 or dm-striped target.
config DM_CRYPT config DM_CRYPT
tristate "Crypt target support" tristate "Crypt target support"
depends on BLK_DEV_DM depends on BLK_DEV_DM
......
...@@ -43,6 +43,7 @@ obj-$(CONFIG_BCACHE) += bcache/ ...@@ -43,6 +43,7 @@ obj-$(CONFIG_BCACHE) += bcache/
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
obj-$(CONFIG_DM_UNSTRIPED) += dm-unstripe.o
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
......
...@@ -662,7 +662,7 @@ static void submit_io(struct dm_buffer *b, int rw, bio_end_io_t *end_io) ...@@ -662,7 +662,7 @@ static void submit_io(struct dm_buffer *b, int rw, bio_end_io_t *end_io)
sector = (b->block << b->c->sectors_per_block_bits) + b->c->start; sector = (b->block << b->c->sectors_per_block_bits) + b->c->start;
if (rw != WRITE) { if (rw != REQ_OP_WRITE) {
n_sectors = 1 << b->c->sectors_per_block_bits; n_sectors = 1 << b->c->sectors_per_block_bits;
offset = 0; offset = 0;
} else { } else {
...@@ -740,7 +740,7 @@ static void __write_dirty_buffer(struct dm_buffer *b, ...@@ -740,7 +740,7 @@ static void __write_dirty_buffer(struct dm_buffer *b,
b->write_end = b->dirty_end; b->write_end = b->dirty_end;
if (!write_list) if (!write_list)
submit_io(b, WRITE, write_endio); submit_io(b, REQ_OP_WRITE, write_endio);
else else
list_add_tail(&b->write_list, write_list); list_add_tail(&b->write_list, write_list);
} }
...@@ -753,7 +753,7 @@ static void __flush_write_list(struct list_head *write_list) ...@@ -753,7 +753,7 @@ static void __flush_write_list(struct list_head *write_list)
struct dm_buffer *b = struct dm_buffer *b =
list_entry(write_list->next, struct dm_buffer, write_list); list_entry(write_list->next, struct dm_buffer, write_list);
list_del(&b->write_list); list_del(&b->write_list);
submit_io(b, WRITE, write_endio); submit_io(b, REQ_OP_WRITE, write_endio);
cond_resched(); cond_resched();
} }
blk_finish_plug(&plug); blk_finish_plug(&plug);
...@@ -1123,7 +1123,7 @@ static void *new_read(struct dm_bufio_client *c, sector_t block, ...@@ -1123,7 +1123,7 @@ static void *new_read(struct dm_bufio_client *c, sector_t block,
return NULL; return NULL;
if (need_submit) if (need_submit)
submit_io(b, READ, read_endio); submit_io(b, REQ_OP_READ, read_endio);
wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE); wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE);
...@@ -1193,7 +1193,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c, ...@@ -1193,7 +1193,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
dm_bufio_unlock(c); dm_bufio_unlock(c);
if (need_submit) if (need_submit)
submit_io(b, READ, read_endio); submit_io(b, REQ_OP_READ, read_endio);
dm_bufio_release(b); dm_bufio_release(b);
cond_resched(); cond_resched();
...@@ -1454,7 +1454,7 @@ void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block) ...@@ -1454,7 +1454,7 @@ void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block)
old_block = b->block; old_block = b->block;
__unlink_buffer(b); __unlink_buffer(b);
__link_buffer(b, new_block, b->list_mode); __link_buffer(b, new_block, b->list_mode);
submit_io(b, WRITE, write_endio); submit_io(b, REQ_OP_WRITE, write_endio);
wait_on_bit_io(&b->state, B_WRITING, wait_on_bit_io(&b->state, B_WRITING,
TASK_UNINTERRUPTIBLE); TASK_UNINTERRUPTIBLE);
__unlink_buffer(b); __unlink_buffer(b);
...@@ -1716,7 +1716,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign ...@@ -1716,7 +1716,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
if (!DM_BUFIO_CACHE_NAME(c)) { if (!DM_BUFIO_CACHE_NAME(c)) {
r = -ENOMEM; r = -ENOMEM;
mutex_unlock(&dm_bufio_clients_lock); mutex_unlock(&dm_bufio_clients_lock);
goto bad_cache; goto bad;
} }
} }
...@@ -1727,7 +1727,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign ...@@ -1727,7 +1727,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
if (!DM_BUFIO_CACHE(c)) { if (!DM_BUFIO_CACHE(c)) {
r = -ENOMEM; r = -ENOMEM;
mutex_unlock(&dm_bufio_clients_lock); mutex_unlock(&dm_bufio_clients_lock);
goto bad_cache; goto bad;
} }
} }
} }
...@@ -1738,27 +1738,28 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign ...@@ -1738,27 +1738,28 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
if (!b) { if (!b) {
r = -ENOMEM; r = -ENOMEM;
goto bad_buffer; goto bad;
} }
__free_buffer_wake(b); __free_buffer_wake(b);
} }
c->shrinker.count_objects = dm_bufio_shrink_count;
c->shrinker.scan_objects = dm_bufio_shrink_scan;
c->shrinker.seeks = 1;
c->shrinker.batch = 0;
r = register_shrinker(&c->shrinker);
if (r)
goto bad;
mutex_lock(&dm_bufio_clients_lock); mutex_lock(&dm_bufio_clients_lock);
dm_bufio_client_count++; dm_bufio_client_count++;
list_add(&c->client_list, &dm_bufio_all_clients); list_add(&c->client_list, &dm_bufio_all_clients);
__cache_size_refresh(); __cache_size_refresh();
mutex_unlock(&dm_bufio_clients_lock); mutex_unlock(&dm_bufio_clients_lock);
c->shrinker.count_objects = dm_bufio_shrink_count;
c->shrinker.scan_objects = dm_bufio_shrink_scan;
c->shrinker.seeks = 1;
c->shrinker.batch = 0;
register_shrinker(&c->shrinker);
return c; return c;
bad_buffer: bad:
bad_cache:
while (!list_empty(&c->reserved_buffers)) { while (!list_empty(&c->reserved_buffers)) {
struct dm_buffer *b = list_entry(c->reserved_buffers.next, struct dm_buffer *b = list_entry(c->reserved_buffers.next,
struct dm_buffer, lru_list); struct dm_buffer, lru_list);
...@@ -1767,6 +1768,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign ...@@ -1767,6 +1768,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
} }
dm_io_client_destroy(c->dm_io); dm_io_client_destroy(c->dm_io);
bad_dm_io: bad_dm_io:
mutex_destroy(&c->lock);
kfree(c); kfree(c);
bad_client: bad_client:
return ERR_PTR(r); return ERR_PTR(r);
...@@ -1811,6 +1813,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c) ...@@ -1811,6 +1813,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
BUG_ON(c->n_buffers[i]); BUG_ON(c->n_buffers[i]);
dm_io_client_destroy(c->dm_io); dm_io_client_destroy(c->dm_io);
mutex_destroy(&c->lock);
kfree(c); kfree(c);
} }
EXPORT_SYMBOL_GPL(dm_bufio_client_destroy); EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
......
...@@ -91,8 +91,7 @@ struct mapped_device { ...@@ -91,8 +91,7 @@ struct mapped_device {
/* /*
* io objects are allocated from here. * io objects are allocated from here.
*/ */
mempool_t *io_pool; struct bio_set *io_bs;
struct bio_set *bs; struct bio_set *bs;
/* /*
...@@ -130,8 +129,6 @@ struct mapped_device { ...@@ -130,8 +129,6 @@ struct mapped_device {
struct srcu_struct io_barrier; struct srcu_struct io_barrier;
}; };
void dm_init_md_queue(struct mapped_device *md);
void dm_init_normal_md_queue(struct mapped_device *md);
int md_in_flight(struct mapped_device *md); int md_in_flight(struct mapped_device *md);
void disable_write_same(struct mapped_device *md); void disable_write_same(struct mapped_device *md);
void disable_write_zeroes(struct mapped_device *md); void disable_write_zeroes(struct mapped_device *md);
......
...@@ -2193,6 +2193,8 @@ static void crypt_dtr(struct dm_target *ti) ...@@ -2193,6 +2193,8 @@ static void crypt_dtr(struct dm_target *ti)
kzfree(cc->cipher_auth); kzfree(cc->cipher_auth);
kzfree(cc->authenc_key); kzfree(cc->authenc_key);
mutex_destroy(&cc->bio_alloc_lock);
/* Must zero key material before freeing */ /* Must zero key material before freeing */
kzfree(cc); kzfree(cc);
} }
...@@ -2702,8 +2704,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -2702,8 +2704,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad; goto bad;
} }
cc->bs = bioset_create(MIN_IOS, 0, (BIOSET_NEED_BVECS | cc->bs = bioset_create(MIN_IOS, 0, BIOSET_NEED_BVECS);
BIOSET_NEED_RESCUER));
if (!cc->bs) { if (!cc->bs) {
ti->error = "Cannot allocate crypt bioset"; ti->error = "Cannot allocate crypt bioset";
goto bad; goto bad;
......
...@@ -229,6 +229,8 @@ static void delay_dtr(struct dm_target *ti) ...@@ -229,6 +229,8 @@ static void delay_dtr(struct dm_target *ti)
if (dc->dev_write) if (dc->dev_write)
dm_put_device(ti, dc->dev_write); dm_put_device(ti, dc->dev_write);
mutex_destroy(&dc->timer_lock);
kfree(dc); kfree(dc);
} }
......
...@@ -70,6 +70,11 @@ static int parse_features(struct dm_arg_set *as, struct flakey_c *fc, ...@@ -70,6 +70,11 @@ static int parse_features(struct dm_arg_set *as, struct flakey_c *fc,
arg_name = dm_shift_arg(as); arg_name = dm_shift_arg(as);
argc--; argc--;
if (!arg_name) {
ti->error = "Insufficient feature arguments";
return -EINVAL;
}
/* /*
* drop_writes * drop_writes
*/ */
......
...@@ -58,8 +58,7 @@ struct dm_io_client *dm_io_client_create(void) ...@@ -58,8 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
if (!client->pool) if (!client->pool)
goto bad; goto bad;
client->bios = bioset_create(min_ios, 0, (BIOSET_NEED_BVECS | client->bios = bioset_create(min_ios, 0, BIOSET_NEED_BVECS);
BIOSET_NEED_RESCUER));
if (!client->bios) if (!client->bios)
goto bad; goto bad;
......
...@@ -477,8 +477,10 @@ static int run_complete_job(struct kcopyd_job *job) ...@@ -477,8 +477,10 @@ static int run_complete_job(struct kcopyd_job *job)
* If this is the master job, the sub jobs have already * If this is the master job, the sub jobs have already
* completed so we can free everything. * completed so we can free everything.
*/ */
if (job->master_job == job) if (job->master_job == job) {
mutex_destroy(&job->lock);
mempool_free(job, kc->job_pool); mempool_free(job, kc->job_pool);
}
fn(read_err, write_err, context); fn(read_err, write_err, context);
if (atomic_dec_and_test(&kc->nr_jobs)) if (atomic_dec_and_test(&kc->nr_jobs))
...@@ -750,6 +752,7 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from, ...@@ -750,6 +752,7 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
* followed by SPLIT_COUNT sub jobs. * followed by SPLIT_COUNT sub jobs.
*/ */
job = mempool_alloc(kc->job_pool, GFP_NOIO); job = mempool_alloc(kc->job_pool, GFP_NOIO);
mutex_init(&job->lock);
/* /*
* set up for the read. * set up for the read.
...@@ -811,7 +814,6 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from, ...@@ -811,7 +814,6 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
if (job->source.count <= SUB_JOB_SIZE) if (job->source.count <= SUB_JOB_SIZE)
dispatch_job(job); dispatch_job(job);
else { else {
mutex_init(&job->lock);
job->progress = 0; job->progress = 0;
split_job(job); split_job(job);
} }
......
...@@ -594,7 +594,7 @@ static int log_mark(struct log_writes_c *lc, char *data) ...@@ -594,7 +594,7 @@ static int log_mark(struct log_writes_c *lc, char *data)
return -ENOMEM; return -ENOMEM;
} }
block->data = kstrndup(data, maxsize, GFP_KERNEL); block->data = kstrndup(data, maxsize - 1, GFP_KERNEL);
if (!block->data) { if (!block->data) {
DMERR("Error copying mark data"); DMERR("Error copying mark data");
kfree(block); kfree(block);
......
...@@ -64,36 +64,30 @@ struct priority_group { ...@@ -64,36 +64,30 @@ struct priority_group {
/* Multipath context */ /* Multipath context */
struct multipath { struct multipath {
struct list_head list; unsigned long flags; /* Multipath state flags */
struct dm_target *ti;
const char *hw_handler_name;
char *hw_handler_params;
spinlock_t lock; spinlock_t lock;
enum dm_queue_mode queue_mode;
unsigned nr_priority_groups;
struct list_head priority_groups;
wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
struct pgpath *current_pgpath; struct pgpath *current_pgpath;
struct priority_group *current_pg; struct priority_group *current_pg;
struct priority_group *next_pg; /* Switch to this PG if set */ struct priority_group *next_pg; /* Switch to this PG if set */
unsigned long flags; /* Multipath state flags */ atomic_t nr_valid_paths; /* Total number of usable paths */
unsigned nr_priority_groups;
struct list_head priority_groups;
const char *hw_handler_name;
char *hw_handler_params;
wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
unsigned pg_init_retries; /* Number of times to retry pg_init */ unsigned pg_init_retries; /* Number of times to retry pg_init */
unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */ unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */
atomic_t nr_valid_paths; /* Total number of usable paths */
atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */ atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */
atomic_t pg_init_count; /* Number of times pg_init called */ atomic_t pg_init_count; /* Number of times pg_init called */
enum dm_queue_mode queue_mode;
struct mutex work_mutex; struct mutex work_mutex;
struct work_struct trigger_event; struct work_struct trigger_event;
struct dm_target *ti;
struct work_struct process_queued_bios; struct work_struct process_queued_bios;
struct bio_list queued_bios; struct bio_list queued_bios;
...@@ -135,10 +129,10 @@ static struct pgpath *alloc_pgpath(void) ...@@ -135,10 +129,10 @@ static struct pgpath *alloc_pgpath(void)
{ {
struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL); struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL);
if (pgpath) { if (!pgpath)
return NULL;
pgpath->is_active = true; pgpath->is_active = true;
INIT_DELAYED_WORK(&pgpath->activate_path, activate_path_work);
}
return pgpath; return pgpath;
} }
...@@ -193,13 +187,8 @@ static struct multipath *alloc_multipath(struct dm_target *ti) ...@@ -193,13 +187,8 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
if (m) { if (m) {
INIT_LIST_HEAD(&m->priority_groups); INIT_LIST_HEAD(&m->priority_groups);
spin_lock_init(&m->lock); spin_lock_init(&m->lock);
set_bit(MPATHF_QUEUE_IO, &m->flags);
atomic_set(&m->nr_valid_paths, 0); atomic_set(&m->nr_valid_paths, 0);
atomic_set(&m->pg_init_in_progress, 0);
atomic_set(&m->pg_init_count, 0);
m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
INIT_WORK(&m->trigger_event, trigger_event); INIT_WORK(&m->trigger_event, trigger_event);
init_waitqueue_head(&m->pg_init_wait);
mutex_init(&m->work_mutex); mutex_init(&m->work_mutex);
m->queue_mode = DM_TYPE_NONE; m->queue_mode = DM_TYPE_NONE;
...@@ -221,14 +210,27 @@ static int alloc_multipath_stage2(struct dm_target *ti, struct multipath *m) ...@@ -221,14 +210,27 @@ static int alloc_multipath_stage2(struct dm_target *ti, struct multipath *m)
m->queue_mode = DM_TYPE_MQ_REQUEST_BASED; m->queue_mode = DM_TYPE_MQ_REQUEST_BASED;
else else
m->queue_mode = DM_TYPE_REQUEST_BASED; m->queue_mode = DM_TYPE_REQUEST_BASED;
} else if (m->queue_mode == DM_TYPE_BIO_BASED) {
} else if (m->queue_mode == DM_TYPE_BIO_BASED ||
m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
INIT_WORK(&m->process_queued_bios, process_queued_bios); INIT_WORK(&m->process_queued_bios, process_queued_bios);
if (m->queue_mode == DM_TYPE_BIO_BASED) {
/* /*
* bio-based doesn't support any direct scsi_dh management; * bio-based doesn't support any direct scsi_dh management;
* it just discovers if a scsi_dh is attached. * it just discovers if a scsi_dh is attached.
*/ */
set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags); set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags);
} }
}
if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
set_bit(MPATHF_QUEUE_IO, &m->flags);
atomic_set(&m->pg_init_in_progress, 0);
atomic_set(&m->pg_init_count, 0);
m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
init_waitqueue_head(&m->pg_init_wait);
}
dm_table_set_type(ti->table, m->queue_mode); dm_table_set_type(ti->table, m->queue_mode);
...@@ -246,6 +248,7 @@ static void free_multipath(struct multipath *m) ...@@ -246,6 +248,7 @@ static void free_multipath(struct multipath *m)
kfree(m->hw_handler_name); kfree(m->hw_handler_name);
kfree(m->hw_handler_params); kfree(m->hw_handler_params);
mutex_destroy(&m->work_mutex);
kfree(m); kfree(m);
} }
...@@ -264,29 +267,23 @@ static struct dm_mpath_io *get_mpio_from_bio(struct bio *bio) ...@@ -264,29 +267,23 @@ static struct dm_mpath_io *get_mpio_from_bio(struct bio *bio)
return dm_per_bio_data(bio, multipath_per_bio_data_size()); return dm_per_bio_data(bio, multipath_per_bio_data_size());
} }
static struct dm_bio_details *get_bio_details_from_bio(struct bio *bio) static struct dm_bio_details *get_bio_details_from_mpio(struct dm_mpath_io *mpio)
{ {
/* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */ /* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */
struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
void *bio_details = mpio + 1; void *bio_details = mpio + 1;
return bio_details; return bio_details;
} }
static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p, static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p)
struct dm_bio_details **bio_details_p)
{ {
struct dm_mpath_io *mpio = get_mpio_from_bio(bio); struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
struct dm_bio_details *bio_details = get_bio_details_from_bio(bio); struct dm_bio_details *bio_details = get_bio_details_from_mpio(mpio);
memset(mpio, 0, sizeof(*mpio));
memset(bio_details, 0, sizeof(*bio_details));
dm_bio_record(bio_details, bio);
if (mpio_p) mpio->nr_bytes = bio->bi_iter.bi_size;
mpio->pgpath = NULL;
*mpio_p = mpio; *mpio_p = mpio;
if (bio_details_p)
*bio_details_p = bio_details; dm_bio_record(bio_details, bio);
} }
/*----------------------------------------------- /*-----------------------------------------------
...@@ -340,6 +337,9 @@ static void __switch_pg(struct multipath *m, struct priority_group *pg) ...@@ -340,6 +337,9 @@ static void __switch_pg(struct multipath *m, struct priority_group *pg)
{ {
m->current_pg = pg; m->current_pg = pg;
if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
return;
/* Must we initialise the PG first, and queue I/O till it's ready? */ /* Must we initialise the PG first, and queue I/O till it's ready? */
if (m->hw_handler_name) { if (m->hw_handler_name) {
set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
...@@ -385,6 +385,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes) ...@@ -385,6 +385,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
unsigned bypassed = 1; unsigned bypassed = 1;
if (!atomic_read(&m->nr_valid_paths)) { if (!atomic_read(&m->nr_valid_paths)) {
if (m->queue_mode != DM_TYPE_NVME_BIO_BASED)
clear_bit(MPATHF_QUEUE_IO, &m->flags); clear_bit(MPATHF_QUEUE_IO, &m->flags);
goto failed; goto failed;
} }
...@@ -516,12 +517,10 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, ...@@ -516,12 +517,10 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
return DM_MAPIO_KILL; return DM_MAPIO_KILL;
} else if (test_bit(MPATHF_QUEUE_IO, &m->flags) || } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) ||
test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) { test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
if (pg_init_all_paths(m)) pg_init_all_paths(m);
return DM_MAPIO_DELAY_REQUEUE; return DM_MAPIO_DELAY_REQUEUE;
return DM_MAPIO_REQUEUE;
} }
memset(mpio, 0, sizeof(*mpio));
mpio->pgpath = pgpath; mpio->pgpath = pgpath;
mpio->nr_bytes = nr_bytes; mpio->nr_bytes = nr_bytes;
...@@ -530,11 +529,22 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, ...@@ -530,11 +529,22 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
if (IS_ERR(clone)) { if (IS_ERR(clone)) {
/* EBUSY, ENODEV or EWOULDBLOCK: requeue */ /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
bool queue_dying = blk_queue_dying(q); if (blk_queue_dying(q)) {
if (queue_dying) {
atomic_inc(&m->pg_init_in_progress); atomic_inc(&m->pg_init_in_progress);
activate_or_offline_path(pgpath); activate_or_offline_path(pgpath);
return DM_MAPIO_DELAY_REQUEUE;
} }
/*
* blk-mq's SCHED_RESTART can cover this requeue, so we
* needn't deal with it by DELAY_REQUEUE. More importantly,
* we have to return DM_MAPIO_REQUEUE so that blk-mq can
* get the queue busy feedback (via BLK_STS_RESOURCE),
* otherwise I/O merging can suffer.
*/
if (q->mq_ops)
return DM_MAPIO_REQUEUE;
else
return DM_MAPIO_DELAY_REQUEUE; return DM_MAPIO_DELAY_REQUEUE;
} }
clone->bio = clone->biotail = NULL; clone->bio = clone->biotail = NULL;
...@@ -557,9 +567,9 @@ static void multipath_release_clone(struct request *clone) ...@@ -557,9 +567,9 @@ static void multipath_release_clone(struct request *clone)
/* /*
* Map cloned bios (bio-based multipath) * Map cloned bios (bio-based multipath)
*/ */
static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_mpath_io *mpio)
static struct pgpath *__map_bio(struct multipath *m, struct bio *bio)
{ {
size_t nr_bytes = bio->bi_iter.bi_size;
struct pgpath *pgpath; struct pgpath *pgpath;
unsigned long flags; unsigned long flags;
bool queue_io; bool queue_io;
...@@ -568,7 +578,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m ...@@ -568,7 +578,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
pgpath = READ_ONCE(m->current_pgpath); pgpath = READ_ONCE(m->current_pgpath);
queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags); queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
if (!pgpath || !queue_io) if (!pgpath || !queue_io)
pgpath = choose_pgpath(m, nr_bytes); pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
if ((pgpath && queue_io) || if ((pgpath && queue_io) ||
(!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) { (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
...@@ -576,14 +586,62 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m ...@@ -576,14 +586,62 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
spin_lock_irqsave(&m->lock, flags); spin_lock_irqsave(&m->lock, flags);
bio_list_add(&m->queued_bios, bio); bio_list_add(&m->queued_bios, bio);
spin_unlock_irqrestore(&m->lock, flags); spin_unlock_irqrestore(&m->lock, flags);
/* PG_INIT_REQUIRED cannot be set without QUEUE_IO */ /* PG_INIT_REQUIRED cannot be set without QUEUE_IO */
if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags))
pg_init_all_paths(m); pg_init_all_paths(m);
else if (!queue_io) else if (!queue_io)
queue_work(kmultipathd, &m->process_queued_bios); queue_work(kmultipathd, &m->process_queued_bios);
return DM_MAPIO_SUBMITTED;
return ERR_PTR(-EAGAIN);
}
return pgpath;
}
static struct pgpath *__map_bio_nvme(struct multipath *m, struct bio *bio)
{
struct pgpath *pgpath;
unsigned long flags;
/* Do we need to select a new pgpath? */
/*
* FIXME: currently only switching path if no path (due to failure, etc)
* - which negates the point of using a path selector
*/
pgpath = READ_ONCE(m->current_pgpath);
if (!pgpath)
pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
if (!pgpath) {
if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
/* Queue for the daemon to resubmit */
spin_lock_irqsave(&m->lock, flags);
bio_list_add(&m->queued_bios, bio);
spin_unlock_irqrestore(&m->lock, flags);
queue_work(kmultipathd, &m->process_queued_bios);
return ERR_PTR(-EAGAIN);
}
return NULL;
} }
return pgpath;
}
static int __multipath_map_bio(struct multipath *m, struct bio *bio,
struct dm_mpath_io *mpio)
{
struct pgpath *pgpath;
if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
pgpath = __map_bio_nvme(m, bio);
else
pgpath = __map_bio(m, bio);
if (IS_ERR(pgpath))
return DM_MAPIO_SUBMITTED;
if (!pgpath) { if (!pgpath) {
if (must_push_back_bio(m)) if (must_push_back_bio(m))
return DM_MAPIO_REQUEUE; return DM_MAPIO_REQUEUE;
...@@ -592,7 +650,6 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m ...@@ -592,7 +650,6 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
} }
mpio->pgpath = pgpath; mpio->pgpath = pgpath;
mpio->nr_bytes = nr_bytes;
bio->bi_status = 0; bio->bi_status = 0;
bio_set_dev(bio, pgpath->path.dev->bdev); bio_set_dev(bio, pgpath->path.dev->bdev);
...@@ -601,7 +658,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m ...@@ -601,7 +658,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
if (pgpath->pg->ps.type->start_io) if (pgpath->pg->ps.type->start_io)
pgpath->pg->ps.type->start_io(&pgpath->pg->ps, pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
&pgpath->path, &pgpath->path,
nr_bytes); mpio->nr_bytes);
return DM_MAPIO_REMAPPED; return DM_MAPIO_REMAPPED;
} }
...@@ -610,8 +667,7 @@ static int multipath_map_bio(struct dm_target *ti, struct bio *bio) ...@@ -610,8 +667,7 @@ static int multipath_map_bio(struct dm_target *ti, struct bio *bio)
struct multipath *m = ti->private; struct multipath *m = ti->private;
struct dm_mpath_io *mpio = NULL; struct dm_mpath_io *mpio = NULL;
multipath_init_per_bio_data(bio, &mpio, NULL); multipath_init_per_bio_data(bio, &mpio);
return __multipath_map_bio(m, bio, mpio); return __multipath_map_bio(m, bio, mpio);
} }
...@@ -619,7 +675,8 @@ static void process_queued_io_list(struct multipath *m) ...@@ -619,7 +675,8 @@ static void process_queued_io_list(struct multipath *m)
{ {
if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED) if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED)
dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table)); dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table));
else if (m->queue_mode == DM_TYPE_BIO_BASED) else if (m->queue_mode == DM_TYPE_BIO_BASED ||
m->queue_mode == DM_TYPE_NVME_BIO_BASED)
queue_work(kmultipathd, &m->process_queued_bios); queue_work(kmultipathd, &m->process_queued_bios);
} }
...@@ -649,7 +706,9 @@ static void process_queued_bios(struct work_struct *work) ...@@ -649,7 +706,9 @@ static void process_queued_bios(struct work_struct *work)
blk_start_plug(&plug); blk_start_plug(&plug);
while ((bio = bio_list_pop(&bios))) { while ((bio = bio_list_pop(&bios))) {
r = __multipath_map_bio(m, bio, get_mpio_from_bio(bio)); struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
dm_bio_restore(get_bio_details_from_mpio(mpio), bio);
r = __multipath_map_bio(m, bio, mpio);
switch (r) { switch (r) {
case DM_MAPIO_KILL: case DM_MAPIO_KILL:
bio->bi_status = BLK_STS_IOERR; bio->bi_status = BLK_STS_IOERR;
...@@ -752,34 +811,11 @@ static int parse_path_selector(struct dm_arg_set *as, struct priority_group *pg, ...@@ -752,34 +811,11 @@ static int parse_path_selector(struct dm_arg_set *as, struct priority_group *pg,
return 0; return 0;
} }
static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps, static int setup_scsi_dh(struct block_device *bdev, struct multipath *m, char **error)
struct dm_target *ti)
{ {
int r; struct request_queue *q = bdev_get_queue(bdev);
struct pgpath *p;
struct multipath *m = ti->private;
struct request_queue *q = NULL;
const char *attached_handler_name; const char *attached_handler_name;
int r;
/* we need at least a path arg */
if (as->argc < 1) {
ti->error = "no device given";
return ERR_PTR(-EINVAL);
}
p = alloc_pgpath();
if (!p)
return ERR_PTR(-ENOMEM);
r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
&p->path.dev);
if (r) {
ti->error = "error getting device";
goto bad;
}
if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags) || m->hw_handler_name)
q = bdev_get_queue(p->path.dev->bdev);
if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) { if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) {
retain: retain:
...@@ -811,23 +847,56 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps ...@@ -811,23 +847,56 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
char b[BDEVNAME_SIZE]; char b[BDEVNAME_SIZE];
printk(KERN_INFO "dm-mpath: retaining handler on device %s\n", printk(KERN_INFO "dm-mpath: retaining handler on device %s\n",
bdevname(p->path.dev->bdev, b)); bdevname(bdev, b));
goto retain; goto retain;
} }
if (r < 0) { if (r < 0) {
ti->error = "error attaching hardware handler"; *error = "error attaching hardware handler";
dm_put_device(ti, p->path.dev); return r;
goto bad;
} }
if (m->hw_handler_params) { if (m->hw_handler_params) {
r = scsi_dh_set_params(q, m->hw_handler_params); r = scsi_dh_set_params(q, m->hw_handler_params);
if (r < 0) { if (r < 0) {
ti->error = "unable to set hardware " *error = "unable to set hardware handler parameters";
"handler parameters"; return r;
dm_put_device(ti, p->path.dev); }
}
}
return 0;
}
static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps,
struct dm_target *ti)
{
int r;
struct pgpath *p;
struct multipath *m = ti->private;
/* we need at least a path arg */
if (as->argc < 1) {
ti->error = "no device given";
return ERR_PTR(-EINVAL);
}
p = alloc_pgpath();
if (!p)
return ERR_PTR(-ENOMEM);
r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
&p->path.dev);
if (r) {
ti->error = "error getting device";
goto bad; goto bad;
} }
if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
INIT_DELAYED_WORK(&p->activate_path, activate_path_work);
r = setup_scsi_dh(p->path.dev->bdev, m, &ti->error);
if (r) {
dm_put_device(ti, p->path.dev);
goto bad;
} }
} }
...@@ -838,7 +907,6 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps ...@@ -838,7 +907,6 @@ static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps
} }
return p; return p;
bad: bad:
free_pgpath(p); free_pgpath(p);
return ERR_PTR(r); return ERR_PTR(r);
...@@ -933,7 +1001,8 @@ static int parse_hw_handler(struct dm_arg_set *as, struct multipath *m) ...@@ -933,7 +1001,8 @@ static int parse_hw_handler(struct dm_arg_set *as, struct multipath *m)
if (!hw_argc) if (!hw_argc)
return 0; return 0;
if (m->queue_mode == DM_TYPE_BIO_BASED) { if (m->queue_mode == DM_TYPE_BIO_BASED ||
m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
dm_consume_args(as, hw_argc); dm_consume_args(as, hw_argc);
DMERR("bio-based multipath doesn't allow hardware handler args"); DMERR("bio-based multipath doesn't allow hardware handler args");
return 0; return 0;
...@@ -1022,6 +1091,8 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m) ...@@ -1022,6 +1091,8 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
if (!strcasecmp(queue_mode_name, "bio")) if (!strcasecmp(queue_mode_name, "bio"))
m->queue_mode = DM_TYPE_BIO_BASED; m->queue_mode = DM_TYPE_BIO_BASED;
else if (!strcasecmp(queue_mode_name, "nvme"))
m->queue_mode = DM_TYPE_NVME_BIO_BASED;
else if (!strcasecmp(queue_mode_name, "rq")) else if (!strcasecmp(queue_mode_name, "rq"))
m->queue_mode = DM_TYPE_REQUEST_BASED; m->queue_mode = DM_TYPE_REQUEST_BASED;
else if (!strcasecmp(queue_mode_name, "mq")) else if (!strcasecmp(queue_mode_name, "mq"))
...@@ -1122,7 +1193,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv) ...@@ -1122,7 +1193,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
ti->num_discard_bios = 1; ti->num_discard_bios = 1;
ti->num_write_same_bios = 1; ti->num_write_same_bios = 1;
ti->num_write_zeroes_bios = 1; ti->num_write_zeroes_bios = 1;
if (m->queue_mode == DM_TYPE_BIO_BASED) if (m->queue_mode == DM_TYPE_BIO_BASED || m->queue_mode == DM_TYPE_NVME_BIO_BASED)
ti->per_io_data_size = multipath_per_bio_data_size(); ti->per_io_data_size = multipath_per_bio_data_size();
else else
ti->per_io_data_size = sizeof(struct dm_mpath_io); ti->per_io_data_size = sizeof(struct dm_mpath_io);
...@@ -1151,16 +1222,19 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m) ...@@ -1151,16 +1222,19 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
static void flush_multipath_work(struct multipath *m) static void flush_multipath_work(struct multipath *m)
{ {
if (m->hw_handler_name) {
set_bit(MPATHF_PG_INIT_DISABLED, &m->flags); set_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
smp_mb__after_atomic(); smp_mb__after_atomic();
flush_workqueue(kmpath_handlerd); flush_workqueue(kmpath_handlerd);
multipath_wait_for_pg_init_completion(m); multipath_wait_for_pg_init_completion(m);
flush_workqueue(kmultipathd);
flush_work(&m->trigger_event);
clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags); clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
smp_mb__after_atomic(); smp_mb__after_atomic();
}
flush_workqueue(kmultipathd);
flush_work(&m->trigger_event);
} }
static void multipath_dtr(struct dm_target *ti) static void multipath_dtr(struct dm_target *ti)
...@@ -1496,6 +1570,9 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, ...@@ -1496,6 +1570,9 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
if (error && blk_path_error(error)) { if (error && blk_path_error(error)) {
struct multipath *m = ti->private; struct multipath *m = ti->private;
if (error == BLK_STS_RESOURCE)
r = DM_ENDIO_DELAY_REQUEUE;
else
r = DM_ENDIO_REQUEUE; r = DM_ENDIO_REQUEUE;
if (pgpath) if (pgpath)
...@@ -1546,9 +1623,6 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, ...@@ -1546,9 +1623,6 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
goto done; goto done;
} }
/* Queue for the daemon to resubmit */
dm_bio_restore(get_bio_details_from_bio(clone), clone);
spin_lock_irqsave(&m->lock, flags); spin_lock_irqsave(&m->lock, flags);
bio_list_add(&m->queued_bios, clone); bio_list_add(&m->queued_bios, clone);
spin_unlock_irqrestore(&m->lock, flags); spin_unlock_irqrestore(&m->lock, flags);
...@@ -1656,6 +1730,9 @@ static void multipath_status(struct dm_target *ti, status_type_t type, ...@@ -1656,6 +1730,9 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
case DM_TYPE_BIO_BASED: case DM_TYPE_BIO_BASED:
DMEMIT("queue_mode bio "); DMEMIT("queue_mode bio ");
break; break;
case DM_TYPE_NVME_BIO_BASED:
DMEMIT("queue_mode nvme ");
break;
case DM_TYPE_MQ_REQUEST_BASED: case DM_TYPE_MQ_REQUEST_BASED:
DMEMIT("queue_mode mq "); DMEMIT("queue_mode mq ");
break; break;
......
...@@ -195,9 +195,6 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes) ...@@ -195,9 +195,6 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
if (list_empty(&s->valid_paths)) if (list_empty(&s->valid_paths))
goto out; goto out;
/* Change preferred (first in list) path to evenly balance. */
list_move_tail(s->valid_paths.next, &s->valid_paths);
list_for_each_entry(pi, &s->valid_paths, list) { list_for_each_entry(pi, &s->valid_paths, list) {
if (!best || if (!best ||
(atomic_read(&pi->qlen) < atomic_read(&best->qlen))) (atomic_read(&pi->qlen) < atomic_read(&best->qlen)))
...@@ -210,6 +207,9 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes) ...@@ -210,6 +207,9 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
if (!best) if (!best)
goto out; goto out;
/* Move most recently used to least preferred to evenly balance. */
list_move_tail(&best->list, &s->valid_paths);
ret = best->path; ret = best->path;
out: out:
spin_unlock_irqrestore(&s->lock, flags); spin_unlock_irqrestore(&s->lock, flags);
......
...@@ -29,6 +29,9 @@ ...@@ -29,6 +29,9 @@
*/ */
#define MIN_RAID456_JOURNAL_SPACE (4*2048) #define MIN_RAID456_JOURNAL_SPACE (4*2048)
/* Global list of all raid sets */
static LIST_HEAD(raid_sets);
static bool devices_handle_discard_safely = false; static bool devices_handle_discard_safely = false;
/* /*
...@@ -105,8 +108,6 @@ struct raid_dev { ...@@ -105,8 +108,6 @@ struct raid_dev {
#define CTR_FLAG_JOURNAL_DEV (1 << __CTR_FLAG_JOURNAL_DEV) #define CTR_FLAG_JOURNAL_DEV (1 << __CTR_FLAG_JOURNAL_DEV)
#define CTR_FLAG_JOURNAL_MODE (1 << __CTR_FLAG_JOURNAL_MODE) #define CTR_FLAG_JOURNAL_MODE (1 << __CTR_FLAG_JOURNAL_MODE)
#define RESUME_STAY_FROZEN_FLAGS (CTR_FLAG_DELTA_DISKS | CTR_FLAG_DATA_OFFSET)
/* /*
* Definitions of various constructor flags to * Definitions of various constructor flags to
* be used in checks of valid / invalid flags * be used in checks of valid / invalid flags
...@@ -209,6 +210,8 @@ struct raid_dev { ...@@ -209,6 +210,8 @@ struct raid_dev {
#define RT_FLAG_UPDATE_SBS 3 #define RT_FLAG_UPDATE_SBS 3
#define RT_FLAG_RESHAPE_RS 4 #define RT_FLAG_RESHAPE_RS 4
#define RT_FLAG_RS_SUSPENDED 5 #define RT_FLAG_RS_SUSPENDED 5
#define RT_FLAG_RS_IN_SYNC 6
#define RT_FLAG_RS_RESYNCING 7
/* Array elements of 64 bit needed for rebuild/failed disk bits */ /* Array elements of 64 bit needed for rebuild/failed disk bits */
#define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8) #define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8)
...@@ -224,8 +227,8 @@ struct rs_layout { ...@@ -224,8 +227,8 @@ struct rs_layout {
struct raid_set { struct raid_set {
struct dm_target *ti; struct dm_target *ti;
struct list_head list;
uint32_t bitmap_loaded;
uint32_t stripe_cache_entries; uint32_t stripe_cache_entries;
unsigned long ctr_flags; unsigned long ctr_flags;
unsigned long runtime_flags; unsigned long runtime_flags;
...@@ -270,6 +273,19 @@ static void rs_config_restore(struct raid_set *rs, struct rs_layout *l) ...@@ -270,6 +273,19 @@ static void rs_config_restore(struct raid_set *rs, struct rs_layout *l)
mddev->new_chunk_sectors = l->new_chunk_sectors; mddev->new_chunk_sectors = l->new_chunk_sectors;
} }
/* Find any raid_set in active slot for @rs on global list */
static struct raid_set *rs_find_active(struct raid_set *rs)
{
struct raid_set *r;
struct mapped_device *md = dm_table_get_md(rs->ti->table);
list_for_each_entry(r, &raid_sets, list)
if (r != rs && dm_table_get_md(r->ti->table) == md)
return r;
return NULL;
}
/* raid10 algorithms (i.e. formats) */ /* raid10 algorithms (i.e. formats) */
#define ALGORITHM_RAID10_DEFAULT 0 #define ALGORITHM_RAID10_DEFAULT 0
#define ALGORITHM_RAID10_NEAR 1 #define ALGORITHM_RAID10_NEAR 1
...@@ -572,7 +588,7 @@ static const char *raid10_md_layout_to_format(int layout) ...@@ -572,7 +588,7 @@ static const char *raid10_md_layout_to_format(int layout)
} }
/* Return md raid10 algorithm for @name */ /* Return md raid10 algorithm for @name */
static int raid10_name_to_format(const char *name) static const int raid10_name_to_format(const char *name)
{ {
if (!strcasecmp(name, "near")) if (!strcasecmp(name, "near"))
return ALGORITHM_RAID10_NEAR; return ALGORITHM_RAID10_NEAR;
...@@ -675,15 +691,11 @@ static struct raid_type *get_raid_type_by_ll(const int level, const int layout) ...@@ -675,15 +691,11 @@ static struct raid_type *get_raid_type_by_ll(const int level, const int layout)
return NULL; return NULL;
} }
/* /* Adjust rdev sectors */
* Conditionally change bdev capacity of @rs static void rs_set_rdev_sectors(struct raid_set *rs)
* in case of a disk add/remove reshape
*/
static void rs_set_capacity(struct raid_set *rs)
{ {
struct mddev *mddev = &rs->md; struct mddev *mddev = &rs->md;
struct md_rdev *rdev; struct md_rdev *rdev;
struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
/* /*
* raid10 sets rdev->sector to the device size, which * raid10 sets rdev->sector to the device size, which
...@@ -692,8 +704,16 @@ static void rs_set_capacity(struct raid_set *rs) ...@@ -692,8 +704,16 @@ static void rs_set_capacity(struct raid_set *rs)
rdev_for_each(rdev, mddev) rdev_for_each(rdev, mddev)
if (!test_bit(Journal, &rdev->flags)) if (!test_bit(Journal, &rdev->flags))
rdev->sectors = mddev->dev_sectors; rdev->sectors = mddev->dev_sectors;
}
set_capacity(gendisk, mddev->array_sectors); /*
* Change bdev capacity of @rs in case of a disk add/remove reshape
*/
static void rs_set_capacity(struct raid_set *rs)
{
struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
set_capacity(gendisk, rs->md.array_sectors);
revalidate_disk(gendisk); revalidate_disk(gendisk);
} }
...@@ -744,6 +764,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r ...@@ -744,6 +764,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
mddev_init(&rs->md); mddev_init(&rs->md);
INIT_LIST_HEAD(&rs->list);
rs->raid_disks = raid_devs; rs->raid_disks = raid_devs;
rs->delta_disks = 0; rs->delta_disks = 0;
...@@ -761,6 +782,9 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r ...@@ -761,6 +782,9 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
for (i = 0; i < raid_devs; i++) for (i = 0; i < raid_devs; i++)
md_rdev_init(&rs->dev[i].rdev); md_rdev_init(&rs->dev[i].rdev);
/* Add @rs to global list. */
list_add(&rs->list, &raid_sets);
/* /*
* Remaining items to be initialized by further RAID params: * Remaining items to be initialized by further RAID params:
* rs->md.persistent * rs->md.persistent
...@@ -773,6 +797,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r ...@@ -773,6 +797,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
return rs; return rs;
} }
/* Free all @rs allocations and remove it from global list. */
static void raid_set_free(struct raid_set *rs) static void raid_set_free(struct raid_set *rs)
{ {
int i; int i;
...@@ -790,6 +815,8 @@ static void raid_set_free(struct raid_set *rs) ...@@ -790,6 +815,8 @@ static void raid_set_free(struct raid_set *rs)
dm_put_device(rs->ti, rs->dev[i].data_dev); dm_put_device(rs->ti, rs->dev[i].data_dev);
} }
list_del(&rs->list);
kfree(rs); kfree(rs);
} }
...@@ -1002,7 +1029,7 @@ static int validate_raid_redundancy(struct raid_set *rs) ...@@ -1002,7 +1029,7 @@ static int validate_raid_redundancy(struct raid_set *rs)
!rs->dev[i].rdev.sb_page) !rs->dev[i].rdev.sb_page)
rebuild_cnt++; rebuild_cnt++;
switch (rs->raid_type->level) { switch (rs->md.level) {
case 0: case 0:
break; break;
case 1: case 1:
...@@ -1017,6 +1044,11 @@ static int validate_raid_redundancy(struct raid_set *rs) ...@@ -1017,6 +1044,11 @@ static int validate_raid_redundancy(struct raid_set *rs)
break; break;
case 10: case 10:
copies = raid10_md_layout_to_copies(rs->md.new_layout); copies = raid10_md_layout_to_copies(rs->md.new_layout);
if (copies < 2) {
DMERR("Bogus raid10 data copies < 2!");
return -EINVAL;
}
if (rebuild_cnt < copies) if (rebuild_cnt < copies)
break; break;
...@@ -1576,6 +1608,24 @@ static sector_t __rdev_sectors(struct raid_set *rs) ...@@ -1576,6 +1608,24 @@ static sector_t __rdev_sectors(struct raid_set *rs)
return 0; return 0;
} }
/* Check that calculated dev_sectors fits all component devices. */
static int _check_data_dev_sectors(struct raid_set *rs)
{
sector_t ds = ~0;
struct md_rdev *rdev;
rdev_for_each(rdev, &rs->md)
if (!test_bit(Journal, &rdev->flags) && rdev->bdev) {
ds = min(ds, to_sector(i_size_read(rdev->bdev->bd_inode)));
if (ds < rs->md.dev_sectors) {
rs->ti->error = "Component device(s) too small";
return -EINVAL;
}
}
return 0;
}
/* Calculate the sectors per device and per array used for @rs */ /* Calculate the sectors per device and per array used for @rs */
static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev) static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev)
{ {
...@@ -1625,7 +1675,7 @@ static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev) ...@@ -1625,7 +1675,7 @@ static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev)
mddev->array_sectors = array_sectors; mddev->array_sectors = array_sectors;
mddev->dev_sectors = dev_sectors; mddev->dev_sectors = dev_sectors;
return 0; return _check_data_dev_sectors(rs);
bad: bad:
rs->ti->error = "Target length not divisible by number of data devices"; rs->ti->error = "Target length not divisible by number of data devices";
return -EINVAL; return -EINVAL;
...@@ -1674,8 +1724,11 @@ static void do_table_event(struct work_struct *ws) ...@@ -1674,8 +1724,11 @@ static void do_table_event(struct work_struct *ws)
struct raid_set *rs = container_of(ws, struct raid_set, md.event_work); struct raid_set *rs = container_of(ws, struct raid_set, md.event_work);
smp_rmb(); /* Make sure we access most actual mddev properties */ smp_rmb(); /* Make sure we access most actual mddev properties */
if (!rs_is_reshaping(rs)) if (!rs_is_reshaping(rs)) {
if (rs_is_raid10(rs))
rs_set_rdev_sectors(rs);
rs_set_capacity(rs); rs_set_capacity(rs);
}
dm_table_event(rs->ti->table); dm_table_event(rs->ti->table);
} }
...@@ -1860,7 +1913,7 @@ static bool rs_reshape_requested(struct raid_set *rs) ...@@ -1860,7 +1913,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
if (rs_takeover_requested(rs)) if (rs_takeover_requested(rs))
return false; return false;
if (!mddev->level) if (rs_is_raid0(rs))
return false; return false;
change = mddev->new_layout != mddev->layout || change = mddev->new_layout != mddev->layout ||
...@@ -1868,7 +1921,7 @@ static bool rs_reshape_requested(struct raid_set *rs) ...@@ -1868,7 +1921,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
rs->delta_disks; rs->delta_disks;
/* Historical case to support raid1 reshape without delta disks */ /* Historical case to support raid1 reshape without delta disks */
if (mddev->level == 1) { if (rs_is_raid1(rs)) {
if (rs->delta_disks) if (rs->delta_disks)
return !!rs->delta_disks; return !!rs->delta_disks;
...@@ -1876,7 +1929,7 @@ static bool rs_reshape_requested(struct raid_set *rs) ...@@ -1876,7 +1929,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
mddev->raid_disks != rs->raid_disks; mddev->raid_disks != rs->raid_disks;
} }
if (mddev->level == 10) if (rs_is_raid10(rs))
return change && return change &&
!__is_raid10_far(mddev->new_layout) && !__is_raid10_far(mddev->new_layout) &&
rs->delta_disks >= 0; rs->delta_disks >= 0;
...@@ -2340,7 +2393,7 @@ static int super_init_validation(struct raid_set *rs, struct md_rdev *rdev) ...@@ -2340,7 +2393,7 @@ static int super_init_validation(struct raid_set *rs, struct md_rdev *rdev)
DMERR("new device%s provided without 'rebuild'", DMERR("new device%s provided without 'rebuild'",
new_devs > 1 ? "s" : ""); new_devs > 1 ? "s" : "");
return -EINVAL; return -EINVAL;
} else if (rs_is_recovering(rs)) { } else if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags) && rs_is_recovering(rs)) {
DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)", DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)",
(unsigned long long) mddev->recovery_cp); (unsigned long long) mddev->recovery_cp);
return -EINVAL; return -EINVAL;
...@@ -2640,12 +2693,19 @@ static int rs_adjust_data_offsets(struct raid_set *rs) ...@@ -2640,12 +2693,19 @@ static int rs_adjust_data_offsets(struct raid_set *rs)
* Make sure we got a minimum amount of free sectors per device * Make sure we got a minimum amount of free sectors per device
*/ */
if (rs->data_offset && if (rs->data_offset &&
to_sector(i_size_read(rdev->bdev->bd_inode)) - rdev->sectors < MIN_FREE_RESHAPE_SPACE) { to_sector(i_size_read(rdev->bdev->bd_inode)) - rs->md.dev_sectors < MIN_FREE_RESHAPE_SPACE) {
rs->ti->error = data_offset ? "No space for forward reshape" : rs->ti->error = data_offset ? "No space for forward reshape" :
"No space for backward reshape"; "No space for backward reshape";
return -ENOSPC; return -ENOSPC;
} }
out: out:
/*
* Raise recovery_cp in case data_offset != 0 to
* avoid false recovery positives in the constructor.
*/
if (rs->md.recovery_cp < rs->md.dev_sectors)
rs->md.recovery_cp += rs->dev[0].rdev.data_offset;
/* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */ /* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */
rdev_for_each(rdev, &rs->md) { rdev_for_each(rdev, &rs->md) {
if (!test_bit(Journal, &rdev->flags)) { if (!test_bit(Journal, &rdev->flags)) {
...@@ -2682,14 +2742,14 @@ static int rs_setup_takeover(struct raid_set *rs) ...@@ -2682,14 +2742,14 @@ static int rs_setup_takeover(struct raid_set *rs)
sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset; sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset;
if (rt_is_raid10(rs->raid_type)) { if (rt_is_raid10(rs->raid_type)) {
if (mddev->level == 0) { if (rs_is_raid0(rs)) {
/* Userpace reordered disks -> adjust raid_disk indexes */ /* Userpace reordered disks -> adjust raid_disk indexes */
__reorder_raid_disk_indexes(rs); __reorder_raid_disk_indexes(rs);
/* raid0 -> raid10_far layout */ /* raid0 -> raid10_far layout */
mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR, mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR,
rs->raid10_copies); rs->raid10_copies);
} else if (mddev->level == 1) } else if (rs_is_raid1(rs))
/* raid1 -> raid10_near layout */ /* raid1 -> raid10_near layout */
mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR, mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR,
rs->raid_disks); rs->raid_disks);
...@@ -2777,6 +2837,23 @@ static int rs_prepare_reshape(struct raid_set *rs) ...@@ -2777,6 +2837,23 @@ static int rs_prepare_reshape(struct raid_set *rs)
return 0; return 0;
} }
/* Get reshape sectors from data_offsets or raid set */
static sector_t _get_reshape_sectors(struct raid_set *rs)
{
struct md_rdev *rdev;
sector_t reshape_sectors = 0;
rdev_for_each(rdev, &rs->md)
if (!test_bit(Journal, &rdev->flags)) {
reshape_sectors = (rdev->data_offset > rdev->new_data_offset) ?
rdev->data_offset - rdev->new_data_offset :
rdev->new_data_offset - rdev->data_offset;
break;
}
return max(reshape_sectors, (sector_t) rs->data_offset);
}
/* /*
* *
* - change raid layout * - change raid layout
...@@ -2788,6 +2865,7 @@ static int rs_setup_reshape(struct raid_set *rs) ...@@ -2788,6 +2865,7 @@ static int rs_setup_reshape(struct raid_set *rs)
{ {
int r = 0; int r = 0;
unsigned int cur_raid_devs, d; unsigned int cur_raid_devs, d;
sector_t reshape_sectors = _get_reshape_sectors(rs);
struct mddev *mddev = &rs->md; struct mddev *mddev = &rs->md;
struct md_rdev *rdev; struct md_rdev *rdev;
...@@ -2804,13 +2882,13 @@ static int rs_setup_reshape(struct raid_set *rs) ...@@ -2804,13 +2882,13 @@ static int rs_setup_reshape(struct raid_set *rs)
/* /*
* Adjust array size: * Adjust array size:
* *
* - in case of adding disks, array size has * - in case of adding disk(s), array size has
* to grow after the disk adding reshape, * to grow after the disk adding reshape,
* which'll hapen in the event handler; * which'll hapen in the event handler;
* reshape will happen forward, so space has to * reshape will happen forward, so space has to
* be available at the beginning of each disk * be available at the beginning of each disk
* *
* - in case of removing disks, array size * - in case of removing disk(s), array size
* has to shrink before starting the reshape, * has to shrink before starting the reshape,
* which'll happen here; * which'll happen here;
* reshape will happen backward, so space has to * reshape will happen backward, so space has to
...@@ -2841,7 +2919,7 @@ static int rs_setup_reshape(struct raid_set *rs) ...@@ -2841,7 +2919,7 @@ static int rs_setup_reshape(struct raid_set *rs)
rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector; rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector;
} }
mddev->reshape_backwards = 0; /* adding disks -> forward reshape */ mddev->reshape_backwards = 0; /* adding disk(s) -> forward reshape */
/* Remove disk(s) */ /* Remove disk(s) */
} else if (rs->delta_disks < 0) { } else if (rs->delta_disks < 0) {
...@@ -2874,6 +2952,15 @@ static int rs_setup_reshape(struct raid_set *rs) ...@@ -2874,6 +2952,15 @@ static int rs_setup_reshape(struct raid_set *rs)
mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1; mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1;
} }
/*
* Adjust device size for forward reshape
* because md_finish_reshape() reduces it.
*/
if (!mddev->reshape_backwards)
rdev_for_each(rdev, &rs->md)
if (!test_bit(Journal, &rdev->flags))
rdev->sectors += reshape_sectors;
return r; return r;
} }
...@@ -2890,7 +2977,7 @@ static void configure_discard_support(struct raid_set *rs) ...@@ -2890,7 +2977,7 @@ static void configure_discard_support(struct raid_set *rs)
/* /*
* XXX: RAID level 4,5,6 require zeroing for safety. * XXX: RAID level 4,5,6 require zeroing for safety.
*/ */
raid456 = (rs->md.level == 4 || rs->md.level == 5 || rs->md.level == 6); raid456 = rs_is_raid456(rs);
for (i = 0; i < rs->raid_disks; i++) { for (i = 0; i < rs->raid_disks; i++) {
struct request_queue *q; struct request_queue *q;
...@@ -2915,7 +3002,7 @@ static void configure_discard_support(struct raid_set *rs) ...@@ -2915,7 +3002,7 @@ static void configure_discard_support(struct raid_set *rs)
* RAID1 and RAID10 personalities require bio splitting, * RAID1 and RAID10 personalities require bio splitting,
* RAID0/4/5/6 don't and process large discard bios properly. * RAID0/4/5/6 don't and process large discard bios properly.
*/ */
ti->split_discard_bios = !!(rs->md.level == 1 || rs->md.level == 10); ti->split_discard_bios = !!(rs_is_raid1(rs) || rs_is_raid10(rs));
ti->num_discard_bios = 1; ti->num_discard_bios = 1;
} }
...@@ -2935,10 +3022,10 @@ static void configure_discard_support(struct raid_set *rs) ...@@ -2935,10 +3022,10 @@ static void configure_discard_support(struct raid_set *rs)
static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{ {
int r; int r;
bool resize; bool resize = false;
struct raid_type *rt; struct raid_type *rt;
unsigned int num_raid_params, num_raid_devs; unsigned int num_raid_params, num_raid_devs;
sector_t calculated_dev_sectors, rdev_sectors; sector_t calculated_dev_sectors, rdev_sectors, reshape_sectors;
struct raid_set *rs = NULL; struct raid_set *rs = NULL;
const char *arg; const char *arg;
struct rs_layout rs_layout; struct rs_layout rs_layout;
...@@ -3021,7 +3108,10 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -3021,7 +3108,10 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad; goto bad;
} }
resize = calculated_dev_sectors != rdev_sectors;
reshape_sectors = _get_reshape_sectors(rs);
if (calculated_dev_sectors != rdev_sectors)
resize = calculated_dev_sectors != (reshape_sectors ? rdev_sectors - reshape_sectors : rdev_sectors);
INIT_WORK(&rs->md.event_work, do_table_event); INIT_WORK(&rs->md.event_work, do_table_event);
ti->private = rs; ti->private = rs;
...@@ -3105,6 +3195,8 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -3105,6 +3195,8 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
goto bad; goto bad;
} }
/* Out-of-place space has to be available to allow for a reshape unless raid1! */
if (reshape_sectors || rs_is_raid1(rs)) {
/* /*
* We can only prepare for a reshape here, because the * We can only prepare for a reshape here, because the
* raid set needs to run to provide the repective reshape * raid set needs to run to provide the repective reshape
...@@ -3118,6 +3210,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -3118,6 +3210,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
/* Reshaping ain't recovery, so disable recovery */ /* Reshaping ain't recovery, so disable recovery */
rs_setup_recovery(rs, MaxSector); rs_setup_recovery(rs, MaxSector);
}
rs_set_cur(rs); rs_set_cur(rs);
} else { } else {
/* May not set recovery when a device rebuild is requested */ /* May not set recovery when a device rebuild is requested */
...@@ -3144,7 +3237,6 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -3144,7 +3237,6 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
mddev_lock_nointr(&rs->md); mddev_lock_nointr(&rs->md);
r = md_run(&rs->md); r = md_run(&rs->md);
rs->md.in_sync = 0; /* Assume already marked dirty */ rs->md.in_sync = 0; /* Assume already marked dirty */
if (r) { if (r) {
ti->error = "Failed to run raid array"; ti->error = "Failed to run raid array";
mddev_unlock(&rs->md); mddev_unlock(&rs->md);
...@@ -3248,25 +3340,27 @@ static int raid_map(struct dm_target *ti, struct bio *bio) ...@@ -3248,25 +3340,27 @@ static int raid_map(struct dm_target *ti, struct bio *bio)
} }
/* Return string describing the current sync action of @mddev */ /* Return string describing the current sync action of @mddev */
static const char *decipher_sync_action(struct mddev *mddev) static const char *decipher_sync_action(struct mddev *mddev, unsigned long recovery)
{ {
if (test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) if (test_bit(MD_RECOVERY_FROZEN, &recovery))
return "frozen"; return "frozen";
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || /* The MD sync thread can be done with io but still be running */
(!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))) { if (!test_bit(MD_RECOVERY_DONE, &recovery) &&
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) (test_bit(MD_RECOVERY_RUNNING, &recovery) ||
(!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &recovery)))) {
if (test_bit(MD_RECOVERY_RESHAPE, &recovery))
return "reshape"; return "reshape";
if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { if (test_bit(MD_RECOVERY_SYNC, &recovery)) {
if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) if (!test_bit(MD_RECOVERY_REQUESTED, &recovery))
return "resync"; return "resync";
else if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) else if (test_bit(MD_RECOVERY_CHECK, &recovery))
return "check"; return "check";
return "repair"; return "repair";
} }
if (test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) if (test_bit(MD_RECOVERY_RECOVER, &recovery))
return "recover"; return "recover";
} }
...@@ -3283,7 +3377,7 @@ static const char *decipher_sync_action(struct mddev *mddev) ...@@ -3283,7 +3377,7 @@ static const char *decipher_sync_action(struct mddev *mddev)
* 'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device * 'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device
* '-' = Non-existing device (i.e. uspace passed '- -' into the ctr) * '-' = Non-existing device (i.e. uspace passed '- -' into the ctr)
*/ */
static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, bool array_in_sync) static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev)
{ {
if (!rdev->bdev) if (!rdev->bdev)
return "-"; return "-";
...@@ -3291,85 +3385,108 @@ static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, ...@@ -3291,85 +3385,108 @@ static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev,
return "D"; return "D";
else if (test_bit(Journal, &rdev->flags)) else if (test_bit(Journal, &rdev->flags))
return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a"; return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a";
else if (!array_in_sync || !test_bit(In_sync, &rdev->flags)) else if (test_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags) ||
(!test_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags) &&
!test_bit(In_sync, &rdev->flags)))
return "a"; return "a";
else else
return "A"; return "A";
} }
/* Helper to return resync/reshape progress for @rs and @array_in_sync */ /* Helper to return resync/reshape progress for @rs and runtime flags for raid set in sync / resynching */
static sector_t rs_get_progress(struct raid_set *rs, static sector_t rs_get_progress(struct raid_set *rs, unsigned long recovery,
sector_t resync_max_sectors, bool *array_in_sync) sector_t resync_max_sectors)
{ {
sector_t r, curr_resync_completed; sector_t r;
struct mddev *mddev = &rs->md; struct mddev *mddev = &rs->md;
curr_resync_completed = mddev->curr_resync_completed ?: mddev->recovery_cp; clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
*array_in_sync = false; clear_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
if (rs_is_raid0(rs)) { if (rs_is_raid0(rs)) {
r = resync_max_sectors; r = resync_max_sectors;
*array_in_sync = true; set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
} else {
r = mddev->reshape_position;
/* Reshape is relative to the array size */
if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) ||
r != MaxSector) {
if (r == MaxSector) {
*array_in_sync = true;
r = resync_max_sectors;
} else { } else {
/* Got to reverse on backward reshape */ if (test_bit(MD_RECOVERY_NEEDED, &recovery) ||
if (mddev->reshape_backwards) test_bit(MD_RECOVERY_RESHAPE, &recovery) ||
r = mddev->array_sectors - r; test_bit(MD_RECOVERY_RUNNING, &recovery))
r = mddev->curr_resync_completed;
/* Devide by # of data stripes */
sector_div(r, mddev_data_stripes(rs));
}
/* Sync is relative to the component device size */
} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
r = curr_resync_completed;
else else
r = mddev->recovery_cp; r = mddev->recovery_cp;
if ((r == MaxSector) || if (r >= resync_max_sectors &&
(test_bit(MD_RECOVERY_DONE, &mddev->recovery) && (!test_bit(MD_RECOVERY_REQUESTED, &recovery) ||
(mddev->curr_resync_completed == resync_max_sectors))) { (!test_bit(MD_RECOVERY_FROZEN, &recovery) &&
!test_bit(MD_RECOVERY_NEEDED, &recovery) &&
!test_bit(MD_RECOVERY_RUNNING, &recovery)))) {
/* /*
* Sync complete. * Sync complete.
*/ */
*array_in_sync = true; /* In case we have finished recovering, the array is in sync. */
r = resync_max_sectors; if (test_bit(MD_RECOVERY_RECOVER, &recovery))
} else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
} else if (test_bit(MD_RECOVERY_RECOVER, &recovery)) {
/*
* In case we are recovering, the array is not in sync
* and health chars should show the recovering legs.
*/
;
} else if (test_bit(MD_RECOVERY_SYNC, &recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
/*
* If "resync" is occurring, the raid set
* is or may be out of sync hence the health
* characters shall be 'a'.
*/
set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
} else if (test_bit(MD_RECOVERY_RESHAPE, &recovery) &&
!test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
/*
* If "reshape" is occurring, the raid set
* is or may be out of sync hence the health
* characters shall be 'a'.
*/
set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
} else if (test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
/* /*
* If "check" or "repair" is occurring, the raid set has * If "check" or "repair" is occurring, the raid set has
* undergone an initial sync and the health characters * undergone an initial sync and the health characters
* should not be 'a' anymore. * should not be 'a' anymore.
*/ */
*array_in_sync = true; set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
} else { } else {
struct md_rdev *rdev; struct md_rdev *rdev;
/*
* We are idle and recovery is needed, prevent 'A' chars race
* caused by components still set to in-sync by constrcuctor.
*/
if (test_bit(MD_RECOVERY_NEEDED, &recovery))
set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
/* /*
* The raid set may be doing an initial sync, or it may * The raid set may be doing an initial sync, or it may
* be rebuilding individual components. If all the * be rebuilding individual components. If all the
* devices are In_sync, then it is the raid set that is * devices are In_sync, then it is the raid set that is
* being initialized. * being initialized.
*/ */
set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
rdev_for_each(rdev, mddev) rdev_for_each(rdev, mddev)
if (!test_bit(Journal, &rdev->flags) && if (!test_bit(Journal, &rdev->flags) &&
!test_bit(In_sync, &rdev->flags)) !test_bit(In_sync, &rdev->flags)) {
*array_in_sync = true; clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
#if 0 break;
r = 0; /* HM FIXME: TESTME: https://bugzilla.redhat.com/show_bug.cgi?id=1210637 ? */ }
#endif
} }
} }
return r; return min(r, resync_max_sectors);
} }
/* Helper to return @dev name or "-" if !@dev */ /* Helper to return @dev name or "-" if !@dev */
...@@ -3385,7 +3502,7 @@ static void raid_status(struct dm_target *ti, status_type_t type, ...@@ -3385,7 +3502,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
struct mddev *mddev = &rs->md; struct mddev *mddev = &rs->md;
struct r5conf *conf = mddev->private; struct r5conf *conf = mddev->private;
int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0; int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0;
bool array_in_sync; unsigned long recovery;
unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */ unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */
unsigned int sz = 0; unsigned int sz = 0;
unsigned int rebuild_disks; unsigned int rebuild_disks;
...@@ -3405,17 +3522,18 @@ static void raid_status(struct dm_target *ti, status_type_t type, ...@@ -3405,17 +3522,18 @@ static void raid_status(struct dm_target *ti, status_type_t type,
/* Access most recent mddev properties for status output */ /* Access most recent mddev properties for status output */
smp_rmb(); smp_rmb();
recovery = rs->md.recovery;
/* Get sensible max sectors even if raid set not yet started */ /* Get sensible max sectors even if raid set not yet started */
resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ? resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ?
mddev->resync_max_sectors : mddev->dev_sectors; mddev->resync_max_sectors : mddev->dev_sectors;
progress = rs_get_progress(rs, resync_max_sectors, &array_in_sync); progress = rs_get_progress(rs, recovery, resync_max_sectors);
resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ? resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ?
atomic64_read(&mddev->resync_mismatches) : 0; atomic64_read(&mddev->resync_mismatches) : 0;
sync_action = decipher_sync_action(&rs->md); sync_action = decipher_sync_action(&rs->md, recovery);
/* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */ /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */
for (i = 0; i < rs->raid_disks; i++) for (i = 0; i < rs->raid_disks; i++)
DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev, array_in_sync)); DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev));
/* /*
* In-sync/Reshape ratio: * In-sync/Reshape ratio:
...@@ -3466,7 +3584,7 @@ static void raid_status(struct dm_target *ti, status_type_t type, ...@@ -3466,7 +3584,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
* v1.10.0+: * v1.10.0+:
*/ */
DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ? DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ?
__raid_dev_status(rs, &rs->journal_dev.rdev, 0) : "-"); __raid_dev_status(rs, &rs->journal_dev.rdev) : "-");
break; break;
case STATUSTYPE_TABLE: case STATUSTYPE_TABLE:
...@@ -3622,24 +3740,19 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits) ...@@ -3622,24 +3740,19 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits)
blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs)); blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs));
} }
static void raid_presuspend(struct dm_target *ti)
{
struct raid_set *rs = ti->private;
md_stop_writes(&rs->md);
}
static void raid_postsuspend(struct dm_target *ti) static void raid_postsuspend(struct dm_target *ti)
{ {
struct raid_set *rs = ti->private; struct raid_set *rs = ti->private;
if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
/* Writes have to be stopped before suspending to avoid deadlocks. */
if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery))
md_stop_writes(&rs->md);
mddev_lock_nointr(&rs->md); mddev_lock_nointr(&rs->md);
mddev_suspend(&rs->md); mddev_suspend(&rs->md);
mddev_unlock(&rs->md); mddev_unlock(&rs->md);
} }
rs->md.ro = 1;
} }
static void attempt_restore_of_faulty_devices(struct raid_set *rs) static void attempt_restore_of_faulty_devices(struct raid_set *rs)
...@@ -3816,10 +3929,33 @@ static int raid_preresume(struct dm_target *ti) ...@@ -3816,10 +3929,33 @@ static int raid_preresume(struct dm_target *ti)
struct raid_set *rs = ti->private; struct raid_set *rs = ti->private;
struct mddev *mddev = &rs->md; struct mddev *mddev = &rs->md;
/* This is a resume after a suspend of the set -> it's already started */ /* This is a resume after a suspend of the set -> it's already started. */
if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags)) if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags))
return 0; return 0;
if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags)) {
struct raid_set *rs_active = rs_find_active(rs);
if (rs_active) {
/*
* In case no rebuilds have been requested
* and an active table slot exists, copy
* current resynchonization completed and
* reshape position pointers across from
* suspended raid set in the active slot.
*
* This resumes the new mapping at current
* offsets to continue recover/reshape without
* necessarily redoing a raid set partially or
* causing data corruption in case of a reshape.
*/
if (rs_active->md.curr_resync_completed != MaxSector)
mddev->curr_resync_completed = rs_active->md.curr_resync_completed;
if (rs_active->md.reshape_position != MaxSector)
mddev->reshape_position = rs_active->md.reshape_position;
}
}
/* /*
* The superblocks need to be updated on disk if the * The superblocks need to be updated on disk if the
* array is new or new devices got added (thus zeroed * array is new or new devices got added (thus zeroed
...@@ -3851,11 +3987,10 @@ static int raid_preresume(struct dm_target *ti) ...@@ -3851,11 +3987,10 @@ static int raid_preresume(struct dm_target *ti)
mddev->resync_min = mddev->recovery_cp; mddev->resync_min = mddev->recovery_cp;
} }
rs_set_capacity(rs);
/* Check for any reshape request unless new raid set */ /* Check for any reshape request unless new raid set */
if (test_and_clear_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) { if (test_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) {
/* Initiate a reshape. */ /* Initiate a reshape. */
rs_set_rdev_sectors(rs);
mddev_lock_nointr(mddev); mddev_lock_nointr(mddev);
r = rs_start_reshape(rs); r = rs_start_reshape(rs);
mddev_unlock(mddev); mddev_unlock(mddev);
...@@ -3881,21 +4016,15 @@ static void raid_resume(struct dm_target *ti) ...@@ -3881,21 +4016,15 @@ static void raid_resume(struct dm_target *ti)
attempt_restore_of_faulty_devices(rs); attempt_restore_of_faulty_devices(rs);
} }
mddev->ro = 0;
mddev->in_sync = 0;
/*
* Keep the RAID set frozen if reshape/rebuild flags are set.
* The RAID set is unfrozen once the next table load/resume,
* which clears the reshape/rebuild flags, occurs.
* This ensures that the constructor for the inactive table
* retrieves an up-to-date reshape_position.
*/
if (!(rs->ctr_flags & RESUME_STAY_FROZEN_FLAGS))
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
/* Only reduce raid set size before running a disk removing reshape. */
if (mddev->delta_disks < 0)
rs_set_capacity(rs);
mddev_lock_nointr(mddev); mddev_lock_nointr(mddev);
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
mddev->ro = 0;
mddev->in_sync = 0;
mddev_resume(mddev); mddev_resume(mddev);
mddev_unlock(mddev); mddev_unlock(mddev);
} }
...@@ -3903,7 +4032,7 @@ static void raid_resume(struct dm_target *ti) ...@@ -3903,7 +4032,7 @@ static void raid_resume(struct dm_target *ti)
static struct target_type raid_target = { static struct target_type raid_target = {
.name = "raid", .name = "raid",
.version = {1, 13, 0}, .version = {1, 13, 2},
.module = THIS_MODULE, .module = THIS_MODULE,
.ctr = raid_ctr, .ctr = raid_ctr,
.dtr = raid_dtr, .dtr = raid_dtr,
...@@ -3912,7 +4041,6 @@ static struct target_type raid_target = { ...@@ -3912,7 +4041,6 @@ static struct target_type raid_target = {
.message = raid_message, .message = raid_message,
.iterate_devices = raid_iterate_devices, .iterate_devices = raid_iterate_devices,
.io_hints = raid_io_hints, .io_hints = raid_io_hints,
.presuspend = raid_presuspend,
.postsuspend = raid_postsuspend, .postsuspend = raid_postsuspend,
.preresume = raid_preresume, .preresume = raid_preresume,
.resume = raid_resume, .resume = raid_resume,
......
...@@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped) ...@@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped)
/* The target wants to requeue the I/O */ /* The target wants to requeue the I/O */
dm_requeue_original_request(tio, false); dm_requeue_original_request(tio, false);
break; break;
case DM_ENDIO_DELAY_REQUEUE:
/* The target wants to requeue the I/O after a delay */
dm_requeue_original_request(tio, true);
break;
default: default:
DMWARN("unimplemented target endio return value: %d", r); DMWARN("unimplemented target endio return value: %d", r);
BUG(); BUG();
...@@ -713,7 +717,6 @@ int dm_old_init_request_queue(struct mapped_device *md, struct dm_table *t) ...@@ -713,7 +717,6 @@ int dm_old_init_request_queue(struct mapped_device *md, struct dm_table *t)
/* disable dm_old_request_fn's merge heuristic by default */ /* disable dm_old_request_fn's merge heuristic by default */
md->seq_rq_merge_deadline_usecs = 0; md->seq_rq_merge_deadline_usecs = 0;
dm_init_normal_md_queue(md);
blk_queue_softirq_done(md->queue, dm_softirq_done); blk_queue_softirq_done(md->queue, dm_softirq_done);
/* Initialize the request-based DM worker thread */ /* Initialize the request-based DM worker thread */
...@@ -821,7 +824,6 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t) ...@@ -821,7 +824,6 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
err = PTR_ERR(q); err = PTR_ERR(q);
goto out_tag_set; goto out_tag_set;
} }
dm_init_md_queue(md);
return 0; return 0;
......
...@@ -282,9 +282,6 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes) ...@@ -282,9 +282,6 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
if (list_empty(&s->valid_paths)) if (list_empty(&s->valid_paths))
goto out; goto out;
/* Change preferred (first in list) path to evenly balance. */
list_move_tail(s->valid_paths.next, &s->valid_paths);
list_for_each_entry(pi, &s->valid_paths, list) list_for_each_entry(pi, &s->valid_paths, list)
if (!best || (st_compare_load(pi, best, nr_bytes) < 0)) if (!best || (st_compare_load(pi, best, nr_bytes) < 0))
best = pi; best = pi;
...@@ -292,6 +289,9 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes) ...@@ -292,6 +289,9 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
if (!best) if (!best)
goto out; goto out;
/* Move most recently used to least preferred to evenly balance. */
list_move_tail(&best->list, &s->valid_paths);
ret = best->path; ret = best->path;
out: out:
spin_unlock_irqrestore(&s->lock, flags); spin_unlock_irqrestore(&s->lock, flags);
......
...@@ -47,7 +47,7 @@ struct dm_exception_table { ...@@ -47,7 +47,7 @@ struct dm_exception_table {
}; };
struct dm_snapshot { struct dm_snapshot {
struct rw_semaphore lock; struct mutex lock;
struct dm_dev *origin; struct dm_dev *origin;
struct dm_dev *cow; struct dm_dev *cow;
...@@ -439,9 +439,9 @@ static int __find_snapshots_sharing_cow(struct dm_snapshot *snap, ...@@ -439,9 +439,9 @@ static int __find_snapshots_sharing_cow(struct dm_snapshot *snap,
if (!bdev_equal(s->cow->bdev, snap->cow->bdev)) if (!bdev_equal(s->cow->bdev, snap->cow->bdev))
continue; continue;
down_read(&s->lock); mutex_lock(&s->lock);
active = s->active; active = s->active;
up_read(&s->lock); mutex_unlock(&s->lock);
if (active) { if (active) {
if (snap_src) if (snap_src)
...@@ -909,7 +909,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s) ...@@ -909,7 +909,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
int r; int r;
chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1; chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1;
down_write(&s->lock); mutex_lock(&s->lock);
/* /*
* Process chunks (and associated exceptions) in reverse order * Process chunks (and associated exceptions) in reverse order
...@@ -924,7 +924,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s) ...@@ -924,7 +924,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
b = __release_queued_bios_after_merge(s); b = __release_queued_bios_after_merge(s);
out: out:
up_write(&s->lock); mutex_unlock(&s->lock);
if (b) if (b)
flush_bios(b); flush_bios(b);
...@@ -983,9 +983,9 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s) ...@@ -983,9 +983,9 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
if (linear_chunks < 0) { if (linear_chunks < 0) {
DMERR("Read error in exception store: " DMERR("Read error in exception store: "
"shutting down merge"); "shutting down merge");
down_write(&s->lock); mutex_lock(&s->lock);
s->merge_failed = 1; s->merge_failed = 1;
up_write(&s->lock); mutex_unlock(&s->lock);
} }
goto shut; goto shut;
} }
...@@ -1026,10 +1026,10 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s) ...@@ -1026,10 +1026,10 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
previous_count = read_pending_exceptions_done_count(); previous_count = read_pending_exceptions_done_count();
} }
down_write(&s->lock); mutex_lock(&s->lock);
s->first_merging_chunk = old_chunk; s->first_merging_chunk = old_chunk;
s->num_merging_chunks = linear_chunks; s->num_merging_chunks = linear_chunks;
up_write(&s->lock); mutex_unlock(&s->lock);
/* Wait until writes to all 'linear_chunks' drain */ /* Wait until writes to all 'linear_chunks' drain */
for (i = 0; i < linear_chunks; i++) for (i = 0; i < linear_chunks; i++)
...@@ -1071,10 +1071,10 @@ static void merge_callback(int read_err, unsigned long write_err, void *context) ...@@ -1071,10 +1071,10 @@ static void merge_callback(int read_err, unsigned long write_err, void *context)
return; return;
shut: shut:
down_write(&s->lock); mutex_lock(&s->lock);
s->merge_failed = 1; s->merge_failed = 1;
b = __release_queued_bios_after_merge(s); b = __release_queued_bios_after_merge(s);
up_write(&s->lock); mutex_unlock(&s->lock);
error_bios(b); error_bios(b);
merge_shutdown(s); merge_shutdown(s);
...@@ -1173,7 +1173,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -1173,7 +1173,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
s->exception_start_sequence = 0; s->exception_start_sequence = 0;
s->exception_complete_sequence = 0; s->exception_complete_sequence = 0;
INIT_LIST_HEAD(&s->out_of_order_list); INIT_LIST_HEAD(&s->out_of_order_list);
init_rwsem(&s->lock); mutex_init(&s->lock);
INIT_LIST_HEAD(&s->list); INIT_LIST_HEAD(&s->list);
spin_lock_init(&s->pe_lock); spin_lock_init(&s->pe_lock);
s->state_bits = 0; s->state_bits = 0;
...@@ -1338,9 +1338,9 @@ static void snapshot_dtr(struct dm_target *ti) ...@@ -1338,9 +1338,9 @@ static void snapshot_dtr(struct dm_target *ti)
/* Check whether exception handover must be cancelled */ /* Check whether exception handover must be cancelled */
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
if (snap_src && snap_dest && (s == snap_src)) { if (snap_src && snap_dest && (s == snap_src)) {
down_write(&snap_dest->lock); mutex_lock(&snap_dest->lock);
snap_dest->valid = 0; snap_dest->valid = 0;
up_write(&snap_dest->lock); mutex_unlock(&snap_dest->lock);
DMERR("Cancelling snapshot handover."); DMERR("Cancelling snapshot handover.");
} }
up_read(&_origins_lock); up_read(&_origins_lock);
...@@ -1371,6 +1371,8 @@ static void snapshot_dtr(struct dm_target *ti) ...@@ -1371,6 +1371,8 @@ static void snapshot_dtr(struct dm_target *ti)
dm_exception_store_destroy(s->store); dm_exception_store_destroy(s->store);
mutex_destroy(&s->lock);
dm_put_device(ti, s->cow); dm_put_device(ti, s->cow);
dm_put_device(ti, s->origin); dm_put_device(ti, s->origin);
...@@ -1458,7 +1460,7 @@ static void pending_complete(void *context, int success) ...@@ -1458,7 +1460,7 @@ static void pending_complete(void *context, int success)
if (!success) { if (!success) {
/* Read/write error - snapshot is unusable */ /* Read/write error - snapshot is unusable */
down_write(&s->lock); mutex_lock(&s->lock);
__invalidate_snapshot(s, -EIO); __invalidate_snapshot(s, -EIO);
error = 1; error = 1;
goto out; goto out;
...@@ -1466,14 +1468,14 @@ static void pending_complete(void *context, int success) ...@@ -1466,14 +1468,14 @@ static void pending_complete(void *context, int success)
e = alloc_completed_exception(GFP_NOIO); e = alloc_completed_exception(GFP_NOIO);
if (!e) { if (!e) {
down_write(&s->lock); mutex_lock(&s->lock);
__invalidate_snapshot(s, -ENOMEM); __invalidate_snapshot(s, -ENOMEM);
error = 1; error = 1;
goto out; goto out;
} }
*e = pe->e; *e = pe->e;
down_write(&s->lock); mutex_lock(&s->lock);
if (!s->valid) { if (!s->valid) {
free_completed_exception(e); free_completed_exception(e);
error = 1; error = 1;
...@@ -1498,7 +1500,7 @@ static void pending_complete(void *context, int success) ...@@ -1498,7 +1500,7 @@ static void pending_complete(void *context, int success)
full_bio->bi_end_io = pe->full_bio_end_io; full_bio->bi_end_io = pe->full_bio_end_io;
increment_pending_exceptions_done_count(); increment_pending_exceptions_done_count();
up_write(&s->lock); mutex_unlock(&s->lock);
/* Submit any pending write bios */ /* Submit any pending write bios */
if (error) { if (error) {
...@@ -1694,7 +1696,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) ...@@ -1694,7 +1696,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
/* FIXME: should only take write lock if we need /* FIXME: should only take write lock if we need
* to copy an exception */ * to copy an exception */
down_write(&s->lock); mutex_lock(&s->lock);
if (!s->valid || (unlikely(s->snapshot_overflowed) && if (!s->valid || (unlikely(s->snapshot_overflowed) &&
bio_data_dir(bio) == WRITE)) { bio_data_dir(bio) == WRITE)) {
...@@ -1717,9 +1719,9 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) ...@@ -1717,9 +1719,9 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
if (bio_data_dir(bio) == WRITE) { if (bio_data_dir(bio) == WRITE) {
pe = __lookup_pending_exception(s, chunk); pe = __lookup_pending_exception(s, chunk);
if (!pe) { if (!pe) {
up_write(&s->lock); mutex_unlock(&s->lock);
pe = alloc_pending_exception(s); pe = alloc_pending_exception(s);
down_write(&s->lock); mutex_lock(&s->lock);
if (!s->valid || s->snapshot_overflowed) { if (!s->valid || s->snapshot_overflowed) {
free_pending_exception(pe); free_pending_exception(pe);
...@@ -1754,7 +1756,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) ...@@ -1754,7 +1756,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
bio->bi_iter.bi_size == bio->bi_iter.bi_size ==
(s->store->chunk_size << SECTOR_SHIFT)) { (s->store->chunk_size << SECTOR_SHIFT)) {
pe->started = 1; pe->started = 1;
up_write(&s->lock); mutex_unlock(&s->lock);
start_full_bio(pe, bio); start_full_bio(pe, bio);
goto out; goto out;
} }
...@@ -1764,7 +1766,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) ...@@ -1764,7 +1766,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
if (!pe->started) { if (!pe->started) {
/* this is protected by snap->lock */ /* this is protected by snap->lock */
pe->started = 1; pe->started = 1;
up_write(&s->lock); mutex_unlock(&s->lock);
start_copy(pe); start_copy(pe);
goto out; goto out;
} }
...@@ -1774,7 +1776,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) ...@@ -1774,7 +1776,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
} }
out_unlock: out_unlock:
up_write(&s->lock); mutex_unlock(&s->lock);
out: out:
return r; return r;
} }
...@@ -1810,7 +1812,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio) ...@@ -1810,7 +1812,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector); chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector);
down_write(&s->lock); mutex_lock(&s->lock);
/* Full merging snapshots are redirected to the origin */ /* Full merging snapshots are redirected to the origin */
if (!s->valid) if (!s->valid)
...@@ -1841,12 +1843,12 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio) ...@@ -1841,12 +1843,12 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
bio_set_dev(bio, s->origin->bdev); bio_set_dev(bio, s->origin->bdev);
if (bio_data_dir(bio) == WRITE) { if (bio_data_dir(bio) == WRITE) {
up_write(&s->lock); mutex_unlock(&s->lock);
return do_origin(s->origin, bio); return do_origin(s->origin, bio);
} }
out_unlock: out_unlock:
up_write(&s->lock); mutex_unlock(&s->lock);
return r; return r;
} }
...@@ -1878,7 +1880,7 @@ static int snapshot_preresume(struct dm_target *ti) ...@@ -1878,7 +1880,7 @@ static int snapshot_preresume(struct dm_target *ti)
down_read(&_origins_lock); down_read(&_origins_lock);
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
if (snap_src && snap_dest) { if (snap_src && snap_dest) {
down_read(&snap_src->lock); mutex_lock(&snap_src->lock);
if (s == snap_src) { if (s == snap_src) {
DMERR("Unable to resume snapshot source until " DMERR("Unable to resume snapshot source until "
"handover completes."); "handover completes.");
...@@ -1888,7 +1890,7 @@ static int snapshot_preresume(struct dm_target *ti) ...@@ -1888,7 +1890,7 @@ static int snapshot_preresume(struct dm_target *ti)
"source is suspended."); "source is suspended.");
r = -EINVAL; r = -EINVAL;
} }
up_read(&snap_src->lock); mutex_unlock(&snap_src->lock);
} }
up_read(&_origins_lock); up_read(&_origins_lock);
...@@ -1934,11 +1936,11 @@ static void snapshot_resume(struct dm_target *ti) ...@@ -1934,11 +1936,11 @@ static void snapshot_resume(struct dm_target *ti)
(void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
if (snap_src && snap_dest) { if (snap_src && snap_dest) {
down_write(&snap_src->lock); mutex_lock(&snap_src->lock);
down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING); mutex_lock_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
__handover_exceptions(snap_src, snap_dest); __handover_exceptions(snap_src, snap_dest);
up_write(&snap_dest->lock); mutex_unlock(&snap_dest->lock);
up_write(&snap_src->lock); mutex_unlock(&snap_src->lock);
} }
up_read(&_origins_lock); up_read(&_origins_lock);
...@@ -1953,9 +1955,9 @@ static void snapshot_resume(struct dm_target *ti) ...@@ -1953,9 +1955,9 @@ static void snapshot_resume(struct dm_target *ti)
/* Now we have correct chunk size, reregister */ /* Now we have correct chunk size, reregister */
reregister_snapshot(s); reregister_snapshot(s);
down_write(&s->lock); mutex_lock(&s->lock);
s->active = 1; s->active = 1;
up_write(&s->lock); mutex_unlock(&s->lock);
} }
static uint32_t get_origin_minimum_chunksize(struct block_device *bdev) static uint32_t get_origin_minimum_chunksize(struct block_device *bdev)
...@@ -1995,7 +1997,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type, ...@@ -1995,7 +1997,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
switch (type) { switch (type) {
case STATUSTYPE_INFO: case STATUSTYPE_INFO:
down_write(&snap->lock); mutex_lock(&snap->lock);
if (!snap->valid) if (!snap->valid)
DMEMIT("Invalid"); DMEMIT("Invalid");
...@@ -2020,7 +2022,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type, ...@@ -2020,7 +2022,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
DMEMIT("Unknown"); DMEMIT("Unknown");
} }
up_write(&snap->lock); mutex_unlock(&snap->lock);
break; break;
...@@ -2086,7 +2088,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector, ...@@ -2086,7 +2088,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
if (dm_target_is_snapshot_merge(snap->ti)) if (dm_target_is_snapshot_merge(snap->ti))
continue; continue;
down_write(&snap->lock); mutex_lock(&snap->lock);
/* Only deal with valid and active snapshots */ /* Only deal with valid and active snapshots */
if (!snap->valid || !snap->active) if (!snap->valid || !snap->active)
...@@ -2113,9 +2115,9 @@ static int __origin_write(struct list_head *snapshots, sector_t sector, ...@@ -2113,9 +2115,9 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
pe = __lookup_pending_exception(snap, chunk); pe = __lookup_pending_exception(snap, chunk);
if (!pe) { if (!pe) {
up_write(&snap->lock); mutex_unlock(&snap->lock);
pe = alloc_pending_exception(snap); pe = alloc_pending_exception(snap);
down_write(&snap->lock); mutex_lock(&snap->lock);
if (!snap->valid) { if (!snap->valid) {
free_pending_exception(pe); free_pending_exception(pe);
...@@ -2158,7 +2160,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector, ...@@ -2158,7 +2160,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
} }
next_snapshot: next_snapshot:
up_write(&snap->lock); mutex_unlock(&snap->lock);
if (pe_to_start_now) { if (pe_to_start_now) {
start_copy(pe_to_start_now); start_copy(pe_to_start_now);
......
...@@ -228,6 +228,7 @@ void dm_stats_cleanup(struct dm_stats *stats) ...@@ -228,6 +228,7 @@ void dm_stats_cleanup(struct dm_stats *stats)
dm_stat_free(&s->rcu_head); dm_stat_free(&s->rcu_head);
} }
free_percpu(stats->last); free_percpu(stats->last);
mutex_destroy(&stats->mutex);
} }
static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end, static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
......
...@@ -866,7 +866,8 @@ EXPORT_SYMBOL(dm_consume_args); ...@@ -866,7 +866,8 @@ EXPORT_SYMBOL(dm_consume_args);
static bool __table_type_bio_based(enum dm_queue_mode table_type) static bool __table_type_bio_based(enum dm_queue_mode table_type)
{ {
return (table_type == DM_TYPE_BIO_BASED || return (table_type == DM_TYPE_BIO_BASED ||
table_type == DM_TYPE_DAX_BIO_BASED); table_type == DM_TYPE_DAX_BIO_BASED ||
table_type == DM_TYPE_NVME_BIO_BASED);
} }
static bool __table_type_request_based(enum dm_queue_mode table_type) static bool __table_type_request_based(enum dm_queue_mode table_type)
...@@ -909,13 +910,33 @@ static bool dm_table_supports_dax(struct dm_table *t) ...@@ -909,13 +910,33 @@ static bool dm_table_supports_dax(struct dm_table *t)
return true; return true;
} }
static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
struct verify_rq_based_data {
unsigned sq_count;
unsigned mq_count;
};
static int device_is_rq_based(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
struct request_queue *q = bdev_get_queue(dev->bdev);
struct verify_rq_based_data *v = data;
if (q->mq_ops)
v->mq_count++;
else
v->sq_count++;
return queue_is_rq_based(q);
}
static int dm_table_determine_type(struct dm_table *t) static int dm_table_determine_type(struct dm_table *t)
{ {
unsigned i; unsigned i;
unsigned bio_based = 0, request_based = 0, hybrid = 0; unsigned bio_based = 0, request_based = 0, hybrid = 0;
unsigned sq_count = 0, mq_count = 0; struct verify_rq_based_data v = {.sq_count = 0, .mq_count = 0};
struct dm_target *tgt; struct dm_target *tgt;
struct dm_dev_internal *dd;
struct list_head *devices = dm_table_get_devices(t); struct list_head *devices = dm_table_get_devices(t);
enum dm_queue_mode live_md_type = dm_get_md_type(t->md); enum dm_queue_mode live_md_type = dm_get_md_type(t->md);
...@@ -923,6 +944,14 @@ static int dm_table_determine_type(struct dm_table *t) ...@@ -923,6 +944,14 @@ static int dm_table_determine_type(struct dm_table *t)
/* target already set the table's type */ /* target already set the table's type */
if (t->type == DM_TYPE_BIO_BASED) if (t->type == DM_TYPE_BIO_BASED)
return 0; return 0;
else if (t->type == DM_TYPE_NVME_BIO_BASED) {
if (!dm_table_does_not_support_partial_completion(t)) {
DMERR("nvme bio-based is only possible with devices"
" that don't support partial completion");
return -EINVAL;
}
/* Fallthru, also verify all devices are blk-mq */
}
BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED);
goto verify_rq_based; goto verify_rq_based;
} }
...@@ -937,7 +966,7 @@ static int dm_table_determine_type(struct dm_table *t) ...@@ -937,7 +966,7 @@ static int dm_table_determine_type(struct dm_table *t)
bio_based = 1; bio_based = 1;
if (bio_based && request_based) { if (bio_based && request_based) {
DMWARN("Inconsistent table: different target types" DMERR("Inconsistent table: different target types"
" can't be mixed up"); " can't be mixed up");
return -EINVAL; return -EINVAL;
} }
...@@ -959,8 +988,18 @@ static int dm_table_determine_type(struct dm_table *t) ...@@ -959,8 +988,18 @@ static int dm_table_determine_type(struct dm_table *t)
/* We must use this table as bio-based */ /* We must use this table as bio-based */
t->type = DM_TYPE_BIO_BASED; t->type = DM_TYPE_BIO_BASED;
if (dm_table_supports_dax(t) || if (dm_table_supports_dax(t) ||
(list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) {
t->type = DM_TYPE_DAX_BIO_BASED; t->type = DM_TYPE_DAX_BIO_BASED;
} else {
/* Check if upgrading to NVMe bio-based is valid or required */
tgt = dm_table_get_immutable_target(t);
if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) {
t->type = DM_TYPE_NVME_BIO_BASED;
goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */
} else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) {
t->type = DM_TYPE_NVME_BIO_BASED;
}
}
return 0; return 0;
} }
...@@ -980,7 +1019,8 @@ static int dm_table_determine_type(struct dm_table *t) ...@@ -980,7 +1019,8 @@ static int dm_table_determine_type(struct dm_table *t)
* (e.g. request completion process for partial completion.) * (e.g. request completion process for partial completion.)
*/ */
if (t->num_targets > 1) { if (t->num_targets > 1) {
DMWARN("Request-based dm doesn't support multiple targets yet"); DMERR("%s DM doesn't support multiple targets",
t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based");
return -EINVAL; return -EINVAL;
} }
...@@ -997,28 +1037,29 @@ static int dm_table_determine_type(struct dm_table *t) ...@@ -997,28 +1037,29 @@ static int dm_table_determine_type(struct dm_table *t)
return 0; return 0;
} }
/* Non-request-stackable devices can't be used for request-based dm */ tgt = dm_table_get_immutable_target(t);
list_for_each_entry(dd, devices, list) { if (!tgt) {
struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev); DMERR("table load rejected: immutable target is required");
return -EINVAL;
if (!queue_is_rq_based(q)) { } else if (tgt->max_io_len) {
DMERR("table load rejected: including" DMERR("table load rejected: immutable target that splits IO is not supported");
" non-request-stackable devices");
return -EINVAL; return -EINVAL;
} }
if (q->mq_ops) /* Non-request-stackable devices can't be used for request-based dm */
mq_count++; if (!tgt->type->iterate_devices ||
else !tgt->type->iterate_devices(tgt, device_is_rq_based, &v)) {
sq_count++; DMERR("table load rejected: including non-request-stackable devices");
return -EINVAL;
} }
if (sq_count && mq_count) { if (v.sq_count && v.mq_count) {
DMERR("table load rejected: not all devices are blk-mq request-stackable"); DMERR("table load rejected: not all devices are blk-mq request-stackable");
return -EINVAL; return -EINVAL;
} }
t->all_blk_mq = mq_count > 0; t->all_blk_mq = v.mq_count > 0;
if (t->type == DM_TYPE_MQ_REQUEST_BASED && !t->all_blk_mq) { if (!t->all_blk_mq &&
(t->type == DM_TYPE_MQ_REQUEST_BASED || t->type == DM_TYPE_NVME_BIO_BASED)) {
DMERR("table load rejected: all devices are not blk-mq request-stackable"); DMERR("table load rejected: all devices are not blk-mq request-stackable");
return -EINVAL; return -EINVAL;
} }
...@@ -1079,7 +1120,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device * ...@@ -1079,7 +1120,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
{ {
enum dm_queue_mode type = dm_table_get_type(t); enum dm_queue_mode type = dm_table_get_type(t);
unsigned per_io_data_size = 0; unsigned per_io_data_size = 0;
struct dm_target *tgt; unsigned min_pool_size = 0;
struct dm_target *ti;
unsigned i; unsigned i;
if (unlikely(type == DM_TYPE_NONE)) { if (unlikely(type == DM_TYPE_NONE)) {
...@@ -1089,11 +1131,13 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device * ...@@ -1089,11 +1131,13 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
if (__table_type_bio_based(type)) if (__table_type_bio_based(type))
for (i = 0; i < t->num_targets; i++) { for (i = 0; i < t->num_targets; i++) {
tgt = t->targets + i; ti = t->targets + i;
per_io_data_size = max(per_io_data_size, tgt->per_io_data_size); per_io_data_size = max(per_io_data_size, ti->per_io_data_size);
min_pool_size = max(min_pool_size, ti->num_flush_bios);
} }
t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_io_data_size); t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported,
per_io_data_size, min_pool_size);
if (!t->mempools) if (!t->mempools)
return -ENOMEM; return -ENOMEM;
...@@ -1705,6 +1749,20 @@ static bool dm_table_all_devices_attribute(struct dm_table *t, ...@@ -1705,6 +1749,20 @@ static bool dm_table_all_devices_attribute(struct dm_table *t,
return true; return true;
} }
static int device_no_partial_completion(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
char b[BDEVNAME_SIZE];
/* For now, NVMe devices are the only devices of this class */
return (strncmp(bdevname(dev->bdev, b), "nvme", 3) == 0);
}
static bool dm_table_does_not_support_partial_completion(struct dm_table *t)
{
return dm_table_all_devices_attribute(t, device_no_partial_completion);
}
static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data) sector_t start, sector_t len, void *data)
{ {
...@@ -1820,6 +1878,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, ...@@ -1820,6 +1878,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
} }
blk_queue_write_cache(q, wc, fua); blk_queue_write_cache(q, wc, fua);
if (dm_table_supports_dax(t))
queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
if (dm_table_supports_dax_write_cache(t)) if (dm_table_supports_dax_write_cache(t))
dax_write_cache(t->md->dax_dev, true); dax_write_cache(t->md->dax_dev, true);
......
...@@ -492,6 +492,11 @@ static void pool_table_init(void) ...@@ -492,6 +492,11 @@ static void pool_table_init(void)
INIT_LIST_HEAD(&dm_thin_pool_table.pools); INIT_LIST_HEAD(&dm_thin_pool_table.pools);
} }
static void pool_table_exit(void)
{
mutex_destroy(&dm_thin_pool_table.mutex);
}
static void __pool_table_insert(struct pool *pool) static void __pool_table_insert(struct pool *pool)
{ {
BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex)); BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
...@@ -1717,7 +1722,7 @@ static void __remap_and_issue_shared_cell(void *context, ...@@ -1717,7 +1722,7 @@ static void __remap_and_issue_shared_cell(void *context,
bio_op(bio) == REQ_OP_DISCARD) bio_op(bio) == REQ_OP_DISCARD)
bio_list_add(&info->defer_bios, bio); bio_list_add(&info->defer_bios, bio);
else { else {
struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));; struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds); h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds);
inc_all_io_entry(info->tc->pool, bio); inc_all_io_entry(info->tc->pool, bio);
...@@ -4387,6 +4392,8 @@ static void dm_thin_exit(void) ...@@ -4387,6 +4392,8 @@ static void dm_thin_exit(void)
dm_unregister_target(&pool_target); dm_unregister_target(&pool_target);
kmem_cache_destroy(_new_mapping_cache); kmem_cache_destroy(_new_mapping_cache);
pool_table_exit();
} }
module_init(dm_thin_init); module_init(dm_thin_init);
......
/*
* Copyright (C) 2017 Intel Corporation.
*
* This file is released under the GPL.
*/
#include "dm.h"
#include <linux/module.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/bio.h>
#include <linux/slab.h>
#include <linux/bitops.h>
#include <linux/device-mapper.h>
struct unstripe_c {
struct dm_dev *dev;
sector_t physical_start;
uint32_t stripes;
uint32_t unstripe;
sector_t unstripe_width;
sector_t unstripe_offset;
uint32_t chunk_size;
u8 chunk_shift;
};
#define DM_MSG_PREFIX "unstriped"
static void cleanup_unstripe(struct unstripe_c *uc, struct dm_target *ti)
{
if (uc->dev)
dm_put_device(ti, uc->dev);
kfree(uc);
}
/*
* Contruct an unstriped mapping.
* <number of stripes> <chunk size> <stripe #> <dev_path> <offset>
*/
static int unstripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
{
struct unstripe_c *uc;
sector_t tmp_len;
unsigned long long start;
char dummy;
if (argc != 5) {
ti->error = "Invalid number of arguments";
return -EINVAL;
}
uc = kzalloc(sizeof(*uc), GFP_KERNEL);
if (!uc) {
ti->error = "Memory allocation for unstriped context failed";
return -ENOMEM;
}
if (kstrtouint(argv[0], 10, &uc->stripes) || !uc->stripes) {
ti->error = "Invalid stripe count";
goto err;
}
if (kstrtouint(argv[1], 10, &uc->chunk_size) || !uc->chunk_size) {
ti->error = "Invalid chunk_size";
goto err;
}
// FIXME: must support non power of 2 chunk_size, dm-stripe.c does
if (!is_power_of_2(uc->chunk_size)) {
ti->error = "Non power of 2 chunk_size is not supported yet";
goto err;
}
if (kstrtouint(argv[2], 10, &uc->unstripe)) {
ti->error = "Invalid stripe number";
goto err;
}
if (uc->unstripe > uc->stripes && uc->stripes > 1) {
ti->error = "Please provide stripe between [0, # of stripes]";
goto err;
}
if (dm_get_device(ti, argv[3], dm_table_get_mode(ti->table), &uc->dev)) {
ti->error = "Couldn't get striped device";
goto err;
}
if (sscanf(argv[4], "%llu%c", &start, &dummy) != 1) {
ti->error = "Invalid striped device offset";
goto err;
}
uc->physical_start = start;
uc->unstripe_offset = uc->unstripe * uc->chunk_size;
uc->unstripe_width = (uc->stripes - 1) * uc->chunk_size;
uc->chunk_shift = fls(uc->chunk_size) - 1;
tmp_len = ti->len;
if (sector_div(tmp_len, uc->chunk_size)) {
ti->error = "Target length not divisible by chunk size";
goto err;
}
if (dm_set_target_max_io_len(ti, uc->chunk_size)) {
ti->error = "Failed to set max io len";
goto err;
}
ti->private = uc;
return 0;
err:
cleanup_unstripe(uc, ti);
return -EINVAL;
}
static void unstripe_dtr(struct dm_target *ti)
{
struct unstripe_c *uc = ti->private;
cleanup_unstripe(uc, ti);
}
static sector_t map_to_core(struct dm_target *ti, struct bio *bio)
{
struct unstripe_c *uc = ti->private;
sector_t sector = bio->bi_iter.bi_sector;
/* Shift us up to the right "row" on the stripe */
sector += uc->unstripe_width * (sector >> uc->chunk_shift);
/* Account for what stripe we're operating on */
sector += uc->unstripe_offset;
return sector;
}
static int unstripe_map(struct dm_target *ti, struct bio *bio)
{
struct unstripe_c *uc = ti->private;
bio_set_dev(bio, uc->dev->bdev);
bio->bi_iter.bi_sector = map_to_core(ti, bio) + uc->physical_start;
return DM_MAPIO_REMAPPED;
}
static void unstripe_status(struct dm_target *ti, status_type_t type,
unsigned int status_flags, char *result, unsigned int maxlen)
{
struct unstripe_c *uc = ti->private;
unsigned int sz = 0;
switch (type) {
case STATUSTYPE_INFO:
break;
case STATUSTYPE_TABLE:
DMEMIT("%d %llu %d %s %llu",
uc->stripes, (unsigned long long)uc->chunk_size, uc->unstripe,
uc->dev->name, (unsigned long long)uc->physical_start);
break;
}
}
static int unstripe_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
struct unstripe_c *uc = ti->private;
return fn(ti, uc->dev, uc->physical_start, ti->len, data);
}
static void unstripe_io_hints(struct dm_target *ti,
struct queue_limits *limits)
{
struct unstripe_c *uc = ti->private;
limits->chunk_sectors = uc->chunk_size;
}
static struct target_type unstripe_target = {
.name = "unstriped",
.version = {1, 0, 0},
.module = THIS_MODULE,
.ctr = unstripe_ctr,
.dtr = unstripe_dtr,
.map = unstripe_map,
.status = unstripe_status,
.iterate_devices = unstripe_iterate_devices,
.io_hints = unstripe_io_hints,
};
static int __init dm_unstripe_init(void)
{
int r;
r = dm_register_target(&unstripe_target);
if (r < 0)
DMERR("target registration failed");
return r;
}
static void __exit dm_unstripe_exit(void)
{
dm_unregister_target(&unstripe_target);
}
module_init(dm_unstripe_init);
module_exit(dm_unstripe_exit);
MODULE_DESCRIPTION(DM_NAME " unstriped target");
MODULE_AUTHOR("Scott Bauer <scott.bauer@intel.com>");
MODULE_LICENSE("GPL");
...@@ -2333,6 +2333,9 @@ static void dmz_cleanup_metadata(struct dmz_metadata *zmd) ...@@ -2333,6 +2333,9 @@ static void dmz_cleanup_metadata(struct dmz_metadata *zmd)
/* Free the zone descriptors */ /* Free the zone descriptors */
dmz_drop_zones(zmd); dmz_drop_zones(zmd);
mutex_destroy(&zmd->mblk_flush_lock);
mutex_destroy(&zmd->map_lock);
} }
/* /*
......
...@@ -827,6 +827,7 @@ static int dmz_ctr(struct dm_target *ti, unsigned int argc, char **argv) ...@@ -827,6 +827,7 @@ static int dmz_ctr(struct dm_target *ti, unsigned int argc, char **argv)
err_cwq: err_cwq:
destroy_workqueue(dmz->chunk_wq); destroy_workqueue(dmz->chunk_wq);
err_bio: err_bio:
mutex_destroy(&dmz->chunk_lock);
bioset_free(dmz->bio_set); bioset_free(dmz->bio_set);
err_meta: err_meta:
dmz_dtr_metadata(dmz->metadata); dmz_dtr_metadata(dmz->metadata);
...@@ -861,6 +862,8 @@ static void dmz_dtr(struct dm_target *ti) ...@@ -861,6 +862,8 @@ static void dmz_dtr(struct dm_target *ti)
dmz_put_zoned_device(ti); dmz_put_zoned_device(ti);
mutex_destroy(&dmz->chunk_lock);
kfree(dmz); kfree(dmz);
} }
......
...@@ -60,18 +60,73 @@ void dm_issue_global_event(void) ...@@ -60,18 +60,73 @@ void dm_issue_global_event(void)
} }
/* /*
* One of these is allocated per bio. * One of these is allocated (on-stack) per original bio.
*/ */
struct clone_info {
struct dm_table *map;
struct bio *bio;
struct dm_io *io;
sector_t sector;
unsigned sector_count;
};
/*
* One of these is allocated per clone bio.
*/
#define DM_TIO_MAGIC 7282014
struct dm_target_io {
unsigned magic;
struct dm_io *io;
struct dm_target *ti;
unsigned target_bio_nr;
unsigned *len_ptr;
bool inside_dm_io;
struct bio clone;
};
/*
* One of these is allocated per original bio.
* It contains the first clone used for that original.
*/
#define DM_IO_MAGIC 5191977
struct dm_io { struct dm_io {
unsigned magic;
struct mapped_device *md; struct mapped_device *md;
blk_status_t status; blk_status_t status;
atomic_t io_count; atomic_t io_count;
struct bio *bio; struct bio *orig_bio;
unsigned long start_time; unsigned long start_time;
spinlock_t endio_lock; spinlock_t endio_lock;
struct dm_stats_aux stats_aux; struct dm_stats_aux stats_aux;
/* last member of dm_target_io is 'struct bio' */
struct dm_target_io tio;
}; };
void *dm_per_bio_data(struct bio *bio, size_t data_size)
{
struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
if (!tio->inside_dm_io)
return (char *)bio - offsetof(struct dm_target_io, clone) - data_size;
return (char *)bio - offsetof(struct dm_target_io, clone) - offsetof(struct dm_io, tio) - data_size;
}
EXPORT_SYMBOL_GPL(dm_per_bio_data);
struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size)
{
struct dm_io *io = (struct dm_io *)((char *)data + data_size);
if (io->magic == DM_IO_MAGIC)
return (struct bio *)((char *)io + offsetof(struct dm_io, tio) + offsetof(struct dm_target_io, clone));
BUG_ON(io->magic != DM_TIO_MAGIC);
return (struct bio *)((char *)io + offsetof(struct dm_target_io, clone));
}
EXPORT_SYMBOL_GPL(dm_bio_from_per_bio_data);
unsigned dm_bio_get_target_bio_nr(const struct bio *bio)
{
return container_of(bio, struct dm_target_io, clone)->target_bio_nr;
}
EXPORT_SYMBOL_GPL(dm_bio_get_target_bio_nr);
#define MINOR_ALLOCED ((void *)-1) #define MINOR_ALLOCED ((void *)-1)
/* /*
...@@ -93,8 +148,8 @@ static int dm_numa_node = DM_NUMA_NODE; ...@@ -93,8 +148,8 @@ static int dm_numa_node = DM_NUMA_NODE;
* For mempools pre-allocation at the table loading time. * For mempools pre-allocation at the table loading time.
*/ */
struct dm_md_mempools { struct dm_md_mempools {
mempool_t *io_pool;
struct bio_set *bs; struct bio_set *bs;
struct bio_set *io_bs;
}; };
struct table_device { struct table_device {
...@@ -103,7 +158,6 @@ struct table_device { ...@@ -103,7 +158,6 @@ struct table_device {
struct dm_dev dm_dev; struct dm_dev dm_dev;
}; };
static struct kmem_cache *_io_cache;
static struct kmem_cache *_rq_tio_cache; static struct kmem_cache *_rq_tio_cache;
static struct kmem_cache *_rq_cache; static struct kmem_cache *_rq_cache;
...@@ -170,14 +224,9 @@ static int __init local_init(void) ...@@ -170,14 +224,9 @@ static int __init local_init(void)
{ {
int r = -ENOMEM; int r = -ENOMEM;
/* allocate a slab for the dm_ios */
_io_cache = KMEM_CACHE(dm_io, 0);
if (!_io_cache)
return r;
_rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0); _rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0);
if (!_rq_tio_cache) if (!_rq_tio_cache)
goto out_free_io_cache; return r;
_rq_cache = kmem_cache_create("dm_old_clone_request", sizeof(struct request), _rq_cache = kmem_cache_create("dm_old_clone_request", sizeof(struct request),
__alignof__(struct request), 0, NULL); __alignof__(struct request), 0, NULL);
...@@ -212,8 +261,6 @@ static int __init local_init(void) ...@@ -212,8 +261,6 @@ static int __init local_init(void)
kmem_cache_destroy(_rq_cache); kmem_cache_destroy(_rq_cache);
out_free_rq_tio_cache: out_free_rq_tio_cache:
kmem_cache_destroy(_rq_tio_cache); kmem_cache_destroy(_rq_tio_cache);
out_free_io_cache:
kmem_cache_destroy(_io_cache);
return r; return r;
} }
...@@ -225,7 +272,6 @@ static void local_exit(void) ...@@ -225,7 +272,6 @@ static void local_exit(void)
kmem_cache_destroy(_rq_cache); kmem_cache_destroy(_rq_cache);
kmem_cache_destroy(_rq_tio_cache); kmem_cache_destroy(_rq_tio_cache);
kmem_cache_destroy(_io_cache);
unregister_blkdev(_major, _name); unregister_blkdev(_major, _name);
dm_uevent_exit(); dm_uevent_exit();
...@@ -486,18 +532,69 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode, ...@@ -486,18 +532,69 @@ static int dm_blk_ioctl(struct block_device *bdev, fmode_t mode,
return r; return r;
} }
static struct dm_io *alloc_io(struct mapped_device *md) static void start_io_acct(struct dm_io *io);
static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
{ {
return mempool_alloc(md->io_pool, GFP_NOIO); struct dm_io *io;
struct dm_target_io *tio;
struct bio *clone;
clone = bio_alloc_bioset(GFP_NOIO, 0, md->io_bs);
if (!clone)
return NULL;
tio = container_of(clone, struct dm_target_io, clone);
tio->inside_dm_io = true;
tio->io = NULL;
io = container_of(tio, struct dm_io, tio);
io->magic = DM_IO_MAGIC;
io->status = 0;
atomic_set(&io->io_count, 1);
io->orig_bio = bio;
io->md = md;
spin_lock_init(&io->endio_lock);
start_io_acct(io);
return io;
} }
static void free_io(struct mapped_device *md, struct dm_io *io) static void free_io(struct mapped_device *md, struct dm_io *io)
{ {
mempool_free(io, md->io_pool); bio_put(&io->tio.clone);
}
static struct dm_target_io *alloc_tio(struct clone_info *ci, struct dm_target *ti,
unsigned target_bio_nr, gfp_t gfp_mask)
{
struct dm_target_io *tio;
if (!ci->io->tio.io) {
/* the dm_target_io embedded in ci->io is available */
tio = &ci->io->tio;
} else {
struct bio *clone = bio_alloc_bioset(gfp_mask, 0, ci->io->md->bs);
if (!clone)
return NULL;
tio = container_of(clone, struct dm_target_io, clone);
tio->inside_dm_io = false;
}
tio->magic = DM_TIO_MAGIC;
tio->io = ci->io;
tio->ti = ti;
tio->target_bio_nr = target_bio_nr;
return tio;
} }
static void free_tio(struct dm_target_io *tio) static void free_tio(struct dm_target_io *tio)
{ {
if (tio->inside_dm_io)
return;
bio_put(&tio->clone); bio_put(&tio->clone);
} }
...@@ -510,15 +607,13 @@ int md_in_flight(struct mapped_device *md) ...@@ -510,15 +607,13 @@ int md_in_flight(struct mapped_device *md)
static void start_io_acct(struct dm_io *io) static void start_io_acct(struct dm_io *io)
{ {
struct mapped_device *md = io->md; struct mapped_device *md = io->md;
struct bio *bio = io->bio; struct bio *bio = io->orig_bio;
int cpu;
int rw = bio_data_dir(bio); int rw = bio_data_dir(bio);
io->start_time = jiffies; io->start_time = jiffies;
cpu = part_stat_lock(); generic_start_io_acct(md->queue, rw, bio_sectors(bio), &dm_disk(md)->part0);
part_round_stats(md->queue, cpu, &dm_disk(md)->part0);
part_stat_unlock();
atomic_set(&dm_disk(md)->part0.in_flight[rw], atomic_set(&dm_disk(md)->part0.in_flight[rw],
atomic_inc_return(&md->pending[rw])); atomic_inc_return(&md->pending[rw]));
...@@ -531,7 +626,7 @@ static void start_io_acct(struct dm_io *io) ...@@ -531,7 +626,7 @@ static void start_io_acct(struct dm_io *io)
static void end_io_acct(struct dm_io *io) static void end_io_acct(struct dm_io *io)
{ {
struct mapped_device *md = io->md; struct mapped_device *md = io->md;
struct bio *bio = io->bio; struct bio *bio = io->orig_bio;
unsigned long duration = jiffies - io->start_time; unsigned long duration = jiffies - io->start_time;
int pending; int pending;
int rw = bio_data_dir(bio); int rw = bio_data_dir(bio);
...@@ -752,15 +847,6 @@ int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo) ...@@ -752,15 +847,6 @@ int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo)
return 0; return 0;
} }
/*-----------------------------------------------------------------
* CRUD START:
* A more elegant soln is in the works that uses the queue
* merge fn, unfortunately there are a couple of changes to
* the block layer that I want to make for this. So in the
* interests of getting something for people to use I give
* you this clearly demarcated crap.
*---------------------------------------------------------------*/
static int __noflush_suspending(struct mapped_device *md) static int __noflush_suspending(struct mapped_device *md)
{ {
return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
...@@ -780,8 +866,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error) ...@@ -780,8 +866,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
/* Push-back supersedes any I/O errors */ /* Push-back supersedes any I/O errors */
if (unlikely(error)) { if (unlikely(error)) {
spin_lock_irqsave(&io->endio_lock, flags); spin_lock_irqsave(&io->endio_lock, flags);
if (!(io->status == BLK_STS_DM_REQUEUE && if (!(io->status == BLK_STS_DM_REQUEUE && __noflush_suspending(md)))
__noflush_suspending(md)))
io->status = error; io->status = error;
spin_unlock_irqrestore(&io->endio_lock, flags); spin_unlock_irqrestore(&io->endio_lock, flags);
} }
...@@ -793,7 +878,8 @@ static void dec_pending(struct dm_io *io, blk_status_t error) ...@@ -793,7 +878,8 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
*/ */
spin_lock_irqsave(&md->deferred_lock, flags); spin_lock_irqsave(&md->deferred_lock, flags);
if (__noflush_suspending(md)) if (__noflush_suspending(md))
bio_list_add_head(&md->deferred, io->bio); /* NOTE early return due to BLK_STS_DM_REQUEUE below */
bio_list_add_head(&md->deferred, io->orig_bio);
else else
/* noflush suspend was interrupted. */ /* noflush suspend was interrupted. */
io->status = BLK_STS_IOERR; io->status = BLK_STS_IOERR;
...@@ -801,7 +887,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error) ...@@ -801,7 +887,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
} }
io_error = io->status; io_error = io->status;
bio = io->bio; bio = io->orig_bio;
end_io_acct(io); end_io_acct(io);
free_io(md, io); free_io(md, io);
...@@ -847,7 +933,7 @@ static void clone_endio(struct bio *bio) ...@@ -847,7 +933,7 @@ static void clone_endio(struct bio *bio)
struct mapped_device *md = tio->io->md; struct mapped_device *md = tio->io->md;
dm_endio_fn endio = tio->ti->type->end_io; dm_endio_fn endio = tio->ti->type->end_io;
if (unlikely(error == BLK_STS_TARGET)) { if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) {
if (bio_op(bio) == REQ_OP_WRITE_SAME && if (bio_op(bio) == REQ_OP_WRITE_SAME &&
!bio->bi_disk->queue->limits.max_write_same_sectors) !bio->bi_disk->queue->limits.max_write_same_sectors)
disable_write_same(md); disable_write_same(md);
...@@ -1005,7 +1091,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, ...@@ -1005,7 +1091,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
/* /*
* A target may call dm_accept_partial_bio only from the map routine. It is * A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH. * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
* *
* dm_accept_partial_bio informs the dm that the target only wants to process * dm_accept_partial_bio informs the dm that the target only wants to process
* additional n_sectors sectors of the bio and the rest of the data should be * additional n_sectors sectors of the bio and the rest of the data should be
...@@ -1055,7 +1141,7 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start) ...@@ -1055,7 +1141,7 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start)
{ {
#ifdef CONFIG_BLK_DEV_ZONED #ifdef CONFIG_BLK_DEV_ZONED
struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
struct bio *report_bio = tio->io->bio; struct bio *report_bio = tio->io->orig_bio;
struct blk_zone_report_hdr *hdr = NULL; struct blk_zone_report_hdr *hdr = NULL;
struct blk_zone *zone; struct blk_zone *zone;
unsigned int nr_rep = 0; unsigned int nr_rep = 0;
...@@ -1122,67 +1208,15 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start) ...@@ -1122,67 +1208,15 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start)
} }
EXPORT_SYMBOL_GPL(dm_remap_zone_report); EXPORT_SYMBOL_GPL(dm_remap_zone_report);
/* static blk_qc_t __map_bio(struct dm_target_io *tio)
* Flush current->bio_list when the target map method blocks.
* This fixes deadlocks in snapshot and possibly in other targets.
*/
struct dm_offload {
struct blk_plug plug;
struct blk_plug_cb cb;
};
static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
{
struct dm_offload *o = container_of(cb, struct dm_offload, cb);
struct bio_list list;
struct bio *bio;
int i;
INIT_LIST_HEAD(&o->cb.list);
if (unlikely(!current->bio_list))
return;
for (i = 0; i < 2; i++) {
list = current->bio_list[i];
bio_list_init(&current->bio_list[i]);
while ((bio = bio_list_pop(&list))) {
struct bio_set *bs = bio->bi_pool;
if (unlikely(!bs) || bs == fs_bio_set ||
!bs->rescue_workqueue) {
bio_list_add(&current->bio_list[i], bio);
continue;
}
spin_lock(&bs->rescue_lock);
bio_list_add(&bs->rescue_list, bio);
queue_work(bs->rescue_workqueue, &bs->rescue_work);
spin_unlock(&bs->rescue_lock);
}
}
}
static void dm_offload_start(struct dm_offload *o)
{
blk_start_plug(&o->plug);
o->cb.callback = flush_current_bio_list;
list_add(&o->cb.list, &current->plug->cb_list);
}
static void dm_offload_end(struct dm_offload *o)
{
list_del(&o->cb.list);
blk_finish_plug(&o->plug);
}
static void __map_bio(struct dm_target_io *tio)
{ {
int r; int r;
sector_t sector; sector_t sector;
struct dm_offload o;
struct bio *clone = &tio->clone; struct bio *clone = &tio->clone;
struct dm_io *io = tio->io;
struct mapped_device *md = io->md;
struct dm_target *ti = tio->ti; struct dm_target *ti = tio->ti;
blk_qc_t ret = BLK_QC_T_NONE;
clone->bi_end_io = clone_endio; clone->bi_end_io = clone_endio;
...@@ -1191,44 +1225,37 @@ static void __map_bio(struct dm_target_io *tio) ...@@ -1191,44 +1225,37 @@ static void __map_bio(struct dm_target_io *tio)
* anything, the target has assumed ownership of * anything, the target has assumed ownership of
* this io. * this io.
*/ */
atomic_inc(&tio->io->io_count); atomic_inc(&io->io_count);
sector = clone->bi_iter.bi_sector; sector = clone->bi_iter.bi_sector;
dm_offload_start(&o);
r = ti->type->map(ti, clone); r = ti->type->map(ti, clone);
dm_offload_end(&o);
switch (r) { switch (r) {
case DM_MAPIO_SUBMITTED: case DM_MAPIO_SUBMITTED:
break; break;
case DM_MAPIO_REMAPPED: case DM_MAPIO_REMAPPED:
/* the bio has been remapped so dispatch it */ /* the bio has been remapped so dispatch it */
trace_block_bio_remap(clone->bi_disk->queue, clone, trace_block_bio_remap(clone->bi_disk->queue, clone,
bio_dev(tio->io->bio), sector); bio_dev(io->orig_bio), sector);
generic_make_request(clone); if (md->type == DM_TYPE_NVME_BIO_BASED)
ret = direct_make_request(clone);
else
ret = generic_make_request(clone);
break; break;
case DM_MAPIO_KILL: case DM_MAPIO_KILL:
dec_pending(tio->io, BLK_STS_IOERR);
free_tio(tio); free_tio(tio);
dec_pending(io, BLK_STS_IOERR);
break; break;
case DM_MAPIO_REQUEUE: case DM_MAPIO_REQUEUE:
dec_pending(tio->io, BLK_STS_DM_REQUEUE);
free_tio(tio); free_tio(tio);
dec_pending(io, BLK_STS_DM_REQUEUE);
break; break;
default: default:
DMWARN("unimplemented target map return value: %d", r); DMWARN("unimplemented target map return value: %d", r);
BUG(); BUG();
} }
}
struct clone_info { return ret;
struct mapped_device *md; }
struct dm_table *map;
struct bio *bio;
struct dm_io *io;
sector_t sector;
unsigned sector_count;
};
static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len) static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
{ {
...@@ -1272,28 +1299,49 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio, ...@@ -1272,28 +1299,49 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
return 0; return 0;
} }
static struct dm_target_io *alloc_tio(struct clone_info *ci, static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
struct dm_target *ti, struct dm_target *ti, unsigned num_bios)
unsigned target_bio_nr)
{ {
struct dm_target_io *tio; struct dm_target_io *tio;
struct bio *clone; int try;
clone = bio_alloc_bioset(GFP_NOIO, 0, ci->md->bs); if (!num_bios)
tio = container_of(clone, struct dm_target_io, clone); return;
tio->io = ci->io; if (num_bios == 1) {
tio->ti = ti; tio = alloc_tio(ci, ti, 0, GFP_NOIO);
tio->target_bio_nr = target_bio_nr; bio_list_add(blist, &tio->clone);
return;
}
return tio; for (try = 0; try < 2; try++) {
int bio_nr;
struct bio *bio;
if (try)
mutex_lock(&ci->io->md->table_devices_lock);
for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
tio = alloc_tio(ci, ti, bio_nr, try ? GFP_NOIO : GFP_NOWAIT);
if (!tio)
break;
bio_list_add(blist, &tio->clone);
}
if (try)
mutex_unlock(&ci->io->md->table_devices_lock);
if (bio_nr == num_bios)
return;
while ((bio = bio_list_pop(blist))) {
tio = container_of(bio, struct dm_target_io, clone);
free_tio(tio);
}
}
} }
static void __clone_and_map_simple_bio(struct clone_info *ci, static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
struct dm_target *ti, struct dm_target_io *tio, unsigned *len)
unsigned target_bio_nr, unsigned *len)
{ {
struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
struct bio *clone = &tio->clone; struct bio *clone = &tio->clone;
tio->len_ptr = len; tio->len_ptr = len;
...@@ -1302,16 +1350,22 @@ static void __clone_and_map_simple_bio(struct clone_info *ci, ...@@ -1302,16 +1350,22 @@ static void __clone_and_map_simple_bio(struct clone_info *ci,
if (len) if (len)
bio_setup_sector(clone, ci->sector, *len); bio_setup_sector(clone, ci->sector, *len);
__map_bio(tio); return __map_bio(tio);
} }
static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti, static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
unsigned num_bios, unsigned *len) unsigned num_bios, unsigned *len)
{ {
unsigned target_bio_nr; struct bio_list blist = BIO_EMPTY_LIST;
struct bio *bio;
struct dm_target_io *tio;
alloc_multiple_bios(&blist, ci, ti, num_bios);
for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++) while ((bio = bio_list_pop(&blist))) {
__clone_and_map_simple_bio(ci, ti, target_bio_nr, len); tio = container_of(bio, struct dm_target_io, clone);
(void) __clone_and_map_simple_bio(ci, tio, len);
}
} }
static int __send_empty_flush(struct clone_info *ci) static int __send_empty_flush(struct clone_info *ci)
...@@ -1331,28 +1385,18 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, ...@@ -1331,28 +1385,18 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
{ {
struct bio *bio = ci->bio; struct bio *bio = ci->bio;
struct dm_target_io *tio; struct dm_target_io *tio;
unsigned target_bio_nr; int r;
unsigned num_target_bios = 1;
int r = 0;
/*
* Does the target want to receive duplicate copies of the bio?
*/
if (bio_data_dir(bio) == WRITE && ti->num_write_bios)
num_target_bios = ti->num_write_bios(ti, bio);
for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) { tio = alloc_tio(ci, ti, 0, GFP_NOIO);
tio = alloc_tio(ci, ti, target_bio_nr);
tio->len_ptr = len; tio->len_ptr = len;
r = clone_bio(tio, bio, sector, *len); r = clone_bio(tio, bio, sector, *len);
if (r < 0) { if (r < 0) {
free_tio(tio); free_tio(tio);
break; return r;
}
__map_bio(tio);
} }
(void) __map_bio(tio);
return r; return 0;
} }
typedef unsigned (*get_num_bios_fn)(struct dm_target *ti); typedef unsigned (*get_num_bios_fn)(struct dm_target *ti);
...@@ -1379,19 +1423,13 @@ static bool is_split_required_for_discard(struct dm_target *ti) ...@@ -1379,19 +1423,13 @@ static bool is_split_required_for_discard(struct dm_target *ti)
return ti->split_discard_bios; return ti->split_discard_bios;
} }
static int __send_changing_extent_only(struct clone_info *ci, static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti,
get_num_bios_fn get_num_bios, get_num_bios_fn get_num_bios,
is_split_required_fn is_split_required) is_split_required_fn is_split_required)
{ {
struct dm_target *ti;
unsigned len; unsigned len;
unsigned num_bios; unsigned num_bios;
do {
ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti))
return -EIO;
/* /*
* Even though the device advertised support for this type of * Even though the device advertised support for this type of
* request, that does not mean every target supports it, and * request, that does not mean every target supports it, and
...@@ -1410,25 +1448,25 @@ static int __send_changing_extent_only(struct clone_info *ci, ...@@ -1410,25 +1448,25 @@ static int __send_changing_extent_only(struct clone_info *ci,
__send_duplicate_bios(ci, ti, num_bios, &len); __send_duplicate_bios(ci, ti, num_bios, &len);
ci->sector += len; ci->sector += len;
} while (ci->sector_count -= len); ci->sector_count -= len;
return 0; return 0;
} }
static int __send_discard(struct clone_info *ci) static int __send_discard(struct clone_info *ci, struct dm_target *ti)
{ {
return __send_changing_extent_only(ci, get_num_discard_bios, return __send_changing_extent_only(ci, ti, get_num_discard_bios,
is_split_required_for_discard); is_split_required_for_discard);
} }
static int __send_write_same(struct clone_info *ci) static int __send_write_same(struct clone_info *ci, struct dm_target *ti)
{ {
return __send_changing_extent_only(ci, get_num_write_same_bios, NULL); return __send_changing_extent_only(ci, ti, get_num_write_same_bios, NULL);
} }
static int __send_write_zeroes(struct clone_info *ci) static int __send_write_zeroes(struct clone_info *ci, struct dm_target *ti)
{ {
return __send_changing_extent_only(ci, get_num_write_zeroes_bios, NULL); return __send_changing_extent_only(ci, ti, get_num_write_zeroes_bios, NULL);
} }
/* /*
...@@ -1441,17 +1479,17 @@ static int __split_and_process_non_flush(struct clone_info *ci) ...@@ -1441,17 +1479,17 @@ static int __split_and_process_non_flush(struct clone_info *ci)
unsigned len; unsigned len;
int r; int r;
if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
return __send_discard(ci);
else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
return __send_write_same(ci);
else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES))
return __send_write_zeroes(ci);
ti = dm_table_find_target(ci->map, ci->sector); ti = dm_table_find_target(ci->map, ci->sector);
if (!dm_target_is_valid(ti)) if (!dm_target_is_valid(ti))
return -EIO; return -EIO;
if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
return __send_discard(ci, ti);
else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
return __send_write_same(ci, ti);
else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES))
return __send_write_zeroes(ci, ti);
if (bio_op(bio) == REQ_OP_ZONE_REPORT) if (bio_op(bio) == REQ_OP_ZONE_REPORT)
len = ci->sector_count; len = ci->sector_count;
else else
...@@ -1468,34 +1506,33 @@ static int __split_and_process_non_flush(struct clone_info *ci) ...@@ -1468,34 +1506,33 @@ static int __split_and_process_non_flush(struct clone_info *ci)
return 0; return 0;
} }
static void init_clone_info(struct clone_info *ci, struct mapped_device *md,
struct dm_table *map, struct bio *bio)
{
ci->map = map;
ci->io = alloc_io(md, bio);
ci->sector = bio->bi_iter.bi_sector;
}
/* /*
* Entry point to split a bio into clones and submit them to the targets. * Entry point to split a bio into clones and submit them to the targets.
*/ */
static void __split_and_process_bio(struct mapped_device *md, static blk_qc_t __split_and_process_bio(struct mapped_device *md,
struct dm_table *map, struct bio *bio) struct dm_table *map, struct bio *bio)
{ {
struct clone_info ci; struct clone_info ci;
blk_qc_t ret = BLK_QC_T_NONE;
int error = 0; int error = 0;
if (unlikely(!map)) { if (unlikely(!map)) {
bio_io_error(bio); bio_io_error(bio);
return; return ret;
} }
ci.map = map; init_clone_info(&ci, md, map, bio);
ci.md = md;
ci.io = alloc_io(md);
ci.io->status = 0;
atomic_set(&ci.io->io_count, 1);
ci.io->bio = bio;
ci.io->md = md;
spin_lock_init(&ci.io->endio_lock);
ci.sector = bio->bi_iter.bi_sector;
start_io_acct(ci.io);
if (bio->bi_opf & REQ_PREFLUSH) { if (bio->bi_opf & REQ_PREFLUSH) {
ci.bio = &ci.md->flush_bio; ci.bio = &ci.io->md->flush_bio;
ci.sector_count = 0; ci.sector_count = 0;
error = __send_empty_flush(&ci); error = __send_empty_flush(&ci);
/* dec_pending submits any data associated with flush */ /* dec_pending submits any data associated with flush */
...@@ -1506,32 +1543,95 @@ static void __split_and_process_bio(struct mapped_device *md, ...@@ -1506,32 +1543,95 @@ static void __split_and_process_bio(struct mapped_device *md,
} else { } else {
ci.bio = bio; ci.bio = bio;
ci.sector_count = bio_sectors(bio); ci.sector_count = bio_sectors(bio);
while (ci.sector_count && !error) while (ci.sector_count && !error) {
error = __split_and_process_non_flush(&ci); error = __split_and_process_non_flush(&ci);
if (current->bio_list && ci.sector_count && !error) {
/*
* Remainder must be passed to generic_make_request()
* so that it gets handled *after* bios already submitted
* have been completely processed.
* We take a clone of the original to store in
* ci.io->orig_bio to be used by end_io_acct() and
* for dec_pending to use for completion handling.
* As this path is not used for REQ_OP_ZONE_REPORT,
* the usage of io->orig_bio in dm_remap_zone_report()
* won't be affected by this reassignment.
*/
struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
md->queue->bio_split);
ci.io->orig_bio = b;
bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
bio_chain(b, bio);
ret = generic_make_request(bio);
break;
}
}
} }
/* drop the extra reference count */ /* drop the extra reference count */
dec_pending(ci.io, errno_to_blk_status(error)); dec_pending(ci.io, errno_to_blk_status(error));
return ret;
} }
/*-----------------------------------------------------------------
* CRUD END
*---------------------------------------------------------------*/
/* /*
* The request function that just remaps the bio built up by * Optimized variant of __split_and_process_bio that leverages the
* dm_merge_bvec. * fact that targets that use it do _not_ have a need to split bios.
*/ */
static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) static blk_qc_t __process_bio(struct mapped_device *md,
struct dm_table *map, struct bio *bio)
{
struct clone_info ci;
blk_qc_t ret = BLK_QC_T_NONE;
int error = 0;
if (unlikely(!map)) {
bio_io_error(bio);
return ret;
}
init_clone_info(&ci, md, map, bio);
if (bio->bi_opf & REQ_PREFLUSH) {
ci.bio = &ci.io->md->flush_bio;
ci.sector_count = 0;
error = __send_empty_flush(&ci);
/* dec_pending submits any data associated with flush */
} else {
struct dm_target *ti = md->immutable_target;
struct dm_target_io *tio;
/*
* Defend against IO still getting in during teardown
* - as was seen for a time with nvme-fcloop
*/
if (unlikely(WARN_ON_ONCE(!ti || !dm_target_is_valid(ti)))) {
error = -EIO;
goto out;
}
tio = alloc_tio(&ci, ti, 0, GFP_NOIO);
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
ret = __clone_and_map_simple_bio(&ci, tio, NULL);
}
out:
/* drop the extra reference count */
dec_pending(ci.io, errno_to_blk_status(error));
return ret;
}
typedef blk_qc_t (process_bio_fn)(struct mapped_device *, struct dm_table *, struct bio *);
static blk_qc_t __dm_make_request(struct request_queue *q, struct bio *bio,
process_bio_fn process_bio)
{ {
int rw = bio_data_dir(bio);
struct mapped_device *md = q->queuedata; struct mapped_device *md = q->queuedata;
blk_qc_t ret = BLK_QC_T_NONE;
int srcu_idx; int srcu_idx;
struct dm_table *map; struct dm_table *map;
map = dm_get_live_table(md, &srcu_idx); map = dm_get_live_table(md, &srcu_idx);
generic_start_io_acct(q, rw, bio_sectors(bio), &dm_disk(md)->part0);
/* if we're suspended, we have to queue this io for later */ /* if we're suspended, we have to queue this io for later */
if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) { if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
dm_put_live_table(md, srcu_idx); dm_put_live_table(md, srcu_idx);
...@@ -1540,12 +1640,27 @@ static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) ...@@ -1540,12 +1640,27 @@ static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
queue_io(md, bio); queue_io(md, bio);
else else
bio_io_error(bio); bio_io_error(bio);
return BLK_QC_T_NONE; return ret;
} }
__split_and_process_bio(md, map, bio); ret = process_bio(md, map, bio);
dm_put_live_table(md, srcu_idx); dm_put_live_table(md, srcu_idx);
return BLK_QC_T_NONE; return ret;
}
/*
* The request function that remaps the bio to one target and
* splits off any remainder.
*/
static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
{
return __dm_make_request(q, bio, __split_and_process_bio);
}
static blk_qc_t dm_make_request_nvme(struct request_queue *q, struct bio *bio)
{
return __dm_make_request(q, bio, __process_bio);
} }
static int dm_any_congested(void *congested_data, int bdi_bits) static int dm_any_congested(void *congested_data, int bdi_bits)
...@@ -1626,20 +1741,9 @@ static const struct dax_operations dm_dax_ops; ...@@ -1626,20 +1741,9 @@ static const struct dax_operations dm_dax_ops;
static void dm_wq_work(struct work_struct *work); static void dm_wq_work(struct work_struct *work);
void dm_init_md_queue(struct mapped_device *md) static void dm_init_normal_md_queue(struct mapped_device *md)
{
/*
* Initialize data that will only be used by a non-blk-mq DM queue
* - must do so here (in alloc_dev callchain) before queue is used
*/
md->queue->queuedata = md;
md->queue->backing_dev_info->congested_data = md;
}
void dm_init_normal_md_queue(struct mapped_device *md)
{ {
md->use_blk_mq = false; md->use_blk_mq = false;
dm_init_md_queue(md);
/* /*
* Initialize aspects of queue that aren't relevant for blk-mq * Initialize aspects of queue that aren't relevant for blk-mq
...@@ -1653,9 +1757,10 @@ static void cleanup_mapped_device(struct mapped_device *md) ...@@ -1653,9 +1757,10 @@ static void cleanup_mapped_device(struct mapped_device *md)
destroy_workqueue(md->wq); destroy_workqueue(md->wq);
if (md->kworker_task) if (md->kworker_task)
kthread_stop(md->kworker_task); kthread_stop(md->kworker_task);
mempool_destroy(md->io_pool);
if (md->bs) if (md->bs)
bioset_free(md->bs); bioset_free(md->bs);
if (md->io_bs)
bioset_free(md->io_bs);
if (md->dax_dev) { if (md->dax_dev) {
kill_dax(md->dax_dev); kill_dax(md->dax_dev);
...@@ -1681,6 +1786,10 @@ static void cleanup_mapped_device(struct mapped_device *md) ...@@ -1681,6 +1786,10 @@ static void cleanup_mapped_device(struct mapped_device *md)
md->bdev = NULL; md->bdev = NULL;
} }
mutex_destroy(&md->suspend_lock);
mutex_destroy(&md->type_lock);
mutex_destroy(&md->table_devices_lock);
dm_mq_cleanup_mapped_device(md); dm_mq_cleanup_mapped_device(md);
} }
...@@ -1734,10 +1843,10 @@ static struct mapped_device *alloc_dev(int minor) ...@@ -1734,10 +1843,10 @@ static struct mapped_device *alloc_dev(int minor)
md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id); md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
if (!md->queue) if (!md->queue)
goto bad; goto bad;
md->queue->queuedata = md;
md->queue->backing_dev_info->congested_data = md;
dm_init_md_queue(md); md->disk = alloc_disk_node(1, md->numa_node_id);
md->disk = alloc_disk_node(1, numa_node_id);
if (!md->disk) if (!md->disk)
goto bad; goto bad;
...@@ -1820,17 +1929,22 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t) ...@@ -1820,17 +1929,22 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
{ {
struct dm_md_mempools *p = dm_table_get_md_mempools(t); struct dm_md_mempools *p = dm_table_get_md_mempools(t);
if (md->bs) {
/* The md already has necessary mempools. */
if (dm_table_bio_based(t)) { if (dm_table_bio_based(t)) {
/* /*
* Reload bioset because front_pad may have changed * The md may already have mempools that need changing.
* If so, reload bioset because front_pad may have changed
* because a different table was loaded. * because a different table was loaded.
*/ */
if (md->bs) {
bioset_free(md->bs); bioset_free(md->bs);
md->bs = p->bs; md->bs = NULL;
p->bs = NULL;
} }
if (md->io_bs) {
bioset_free(md->io_bs);
md->io_bs = NULL;
}
} else if (md->bs) {
/* /*
* There's no need to reload with request-based dm * There's no need to reload with request-based dm
* because the size of front_pad doesn't change. * because the size of front_pad doesn't change.
...@@ -1842,13 +1956,12 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t) ...@@ -1842,13 +1956,12 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
goto out; goto out;
} }
BUG_ON(!p || md->io_pool || md->bs); BUG_ON(!p || md->bs || md->io_bs);
md->io_pool = p->io_pool;
p->io_pool = NULL;
md->bs = p->bs; md->bs = p->bs;
p->bs = NULL; p->bs = NULL;
md->io_bs = p->io_bs;
p->io_bs = NULL;
out: out:
/* mempool bind completed, no longer need any mempools in the table */ /* mempool bind completed, no longer need any mempools in the table */
dm_table_free_md_mempools(t); dm_table_free_md_mempools(t);
...@@ -1894,6 +2007,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, ...@@ -1894,6 +2007,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
{ {
struct dm_table *old_map; struct dm_table *old_map;
struct request_queue *q = md->queue; struct request_queue *q = md->queue;
bool request_based = dm_table_request_based(t);
sector_t size; sector_t size;
lockdep_assert_held(&md->suspend_lock); lockdep_assert_held(&md->suspend_lock);
...@@ -1917,12 +2031,15 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, ...@@ -1917,12 +2031,15 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
* This must be done before setting the queue restrictions, * This must be done before setting the queue restrictions,
* because request-based dm may be run just after the setting. * because request-based dm may be run just after the setting.
*/ */
if (dm_table_request_based(t)) { if (request_based)
dm_stop_queue(q); dm_stop_queue(q);
if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) {
/* /*
* Leverage the fact that request-based DM targets are * Leverage the fact that request-based DM targets and
* immutable singletons and establish md->immutable_target * NVMe bio based targets are immutable singletons
* - used to optimize both dm_request_fn and dm_mq_queue_rq * - used to optimize both dm_request_fn and dm_mq_queue_rq;
* and __process_bio.
*/ */
md->immutable_target = dm_table_get_immutable_target(t); md->immutable_target = dm_table_get_immutable_target(t);
} }
...@@ -1962,13 +2079,18 @@ static struct dm_table *__unbind(struct mapped_device *md) ...@@ -1962,13 +2079,18 @@ static struct dm_table *__unbind(struct mapped_device *md)
*/ */
int dm_create(int minor, struct mapped_device **result) int dm_create(int minor, struct mapped_device **result)
{ {
int r;
struct mapped_device *md; struct mapped_device *md;
md = alloc_dev(minor); md = alloc_dev(minor);
if (!md) if (!md)
return -ENXIO; return -ENXIO;
dm_sysfs_init(md); r = dm_sysfs_init(md);
if (r) {
free_dev(md);
return r;
}
*result = md; *result = md;
return 0; return 0;
...@@ -2026,6 +2148,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) ...@@ -2026,6 +2148,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
switch (type) { switch (type) {
case DM_TYPE_REQUEST_BASED: case DM_TYPE_REQUEST_BASED:
dm_init_normal_md_queue(md);
r = dm_old_init_request_queue(md, t); r = dm_old_init_request_queue(md, t);
if (r) { if (r) {
DMERR("Cannot initialize queue for request-based mapped device"); DMERR("Cannot initialize queue for request-based mapped device");
...@@ -2043,15 +2166,10 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) ...@@ -2043,15 +2166,10 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
case DM_TYPE_DAX_BIO_BASED: case DM_TYPE_DAX_BIO_BASED:
dm_init_normal_md_queue(md); dm_init_normal_md_queue(md);
blk_queue_make_request(md->queue, dm_make_request); blk_queue_make_request(md->queue, dm_make_request);
/* break;
* DM handles splitting bios as needed. Free the bio_split bioset case DM_TYPE_NVME_BIO_BASED:
* since it won't be used (saves 1 process per bio-based DM device). dm_init_normal_md_queue(md);
*/ blk_queue_make_request(md->queue, dm_make_request_nvme);
bioset_free(md->queue->bio_split);
md->queue->bio_split = NULL;
if (type == DM_TYPE_DAX_BIO_BASED)
queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
break; break;
case DM_TYPE_NONE: case DM_TYPE_NONE:
WARN_ON_ONCE(true); WARN_ON_ONCE(true);
...@@ -2130,7 +2248,6 @@ EXPORT_SYMBOL_GPL(dm_device_name); ...@@ -2130,7 +2248,6 @@ EXPORT_SYMBOL_GPL(dm_device_name);
static void __dm_destroy(struct mapped_device *md, bool wait) static void __dm_destroy(struct mapped_device *md, bool wait)
{ {
struct request_queue *q = dm_get_md_queue(md);
struct dm_table *map; struct dm_table *map;
int srcu_idx; int srcu_idx;
...@@ -2141,7 +2258,7 @@ static void __dm_destroy(struct mapped_device *md, bool wait) ...@@ -2141,7 +2258,7 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
set_bit(DMF_FREEING, &md->flags); set_bit(DMF_FREEING, &md->flags);
spin_unlock(&_minor_lock); spin_unlock(&_minor_lock);
blk_set_queue_dying(q); blk_set_queue_dying(md->queue);
if (dm_request_based(md) && md->kworker_task) if (dm_request_based(md) && md->kworker_task)
kthread_flush_worker(&md->kworker); kthread_flush_worker(&md->kworker);
...@@ -2752,11 +2869,12 @@ int dm_noflush_suspending(struct dm_target *ti) ...@@ -2752,11 +2869,12 @@ int dm_noflush_suspending(struct dm_target *ti)
EXPORT_SYMBOL_GPL(dm_noflush_suspending); EXPORT_SYMBOL_GPL(dm_noflush_suspending);
struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type,
unsigned integrity, unsigned per_io_data_size) unsigned integrity, unsigned per_io_data_size,
unsigned min_pool_size)
{ {
struct dm_md_mempools *pools = kzalloc_node(sizeof(*pools), GFP_KERNEL, md->numa_node_id); struct dm_md_mempools *pools = kzalloc_node(sizeof(*pools), GFP_KERNEL, md->numa_node_id);
unsigned int pool_size = 0; unsigned int pool_size = 0;
unsigned int front_pad; unsigned int front_pad, io_front_pad;
if (!pools) if (!pools)
return NULL; return NULL;
...@@ -2764,16 +2882,19 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu ...@@ -2764,16 +2882,19 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
switch (type) { switch (type) {
case DM_TYPE_BIO_BASED: case DM_TYPE_BIO_BASED:
case DM_TYPE_DAX_BIO_BASED: case DM_TYPE_DAX_BIO_BASED:
pool_size = dm_get_reserved_bio_based_ios(); case DM_TYPE_NVME_BIO_BASED:
pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size);
front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio);
pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache); pools->io_bs = bioset_create(pool_size, io_front_pad, 0);
if (!pools->io_pool) if (!pools->io_bs)
goto out;
if (integrity && bioset_integrity_create(pools->io_bs, pool_size))
goto out; goto out;
break; break;
case DM_TYPE_REQUEST_BASED: case DM_TYPE_REQUEST_BASED:
case DM_TYPE_MQ_REQUEST_BASED: case DM_TYPE_MQ_REQUEST_BASED:
pool_size = dm_get_reserved_rq_based_ios(); pool_size = max(dm_get_reserved_rq_based_ios(), min_pool_size);
front_pad = offsetof(struct dm_rq_clone_bio_info, clone); front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
/* per_io_data_size is used for blk-mq pdu at queue allocation */ /* per_io_data_size is used for blk-mq pdu at queue allocation */
break; break;
...@@ -2781,7 +2902,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu ...@@ -2781,7 +2902,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
BUG(); BUG();
} }
pools->bs = bioset_create(pool_size, front_pad, BIOSET_NEED_RESCUER); pools->bs = bioset_create(pool_size, front_pad, 0);
if (!pools->bs) if (!pools->bs)
goto out; goto out;
...@@ -2801,10 +2922,10 @@ void dm_free_md_mempools(struct dm_md_mempools *pools) ...@@ -2801,10 +2922,10 @@ void dm_free_md_mempools(struct dm_md_mempools *pools)
if (!pools) if (!pools)
return; return;
mempool_destroy(pools->io_pool);
if (pools->bs) if (pools->bs)
bioset_free(pools->bs); bioset_free(pools->bs);
if (pools->io_bs)
bioset_free(pools->io_bs);
kfree(pools); kfree(pools);
} }
......
...@@ -49,7 +49,6 @@ struct dm_md_mempools; ...@@ -49,7 +49,6 @@ struct dm_md_mempools;
/*----------------------------------------------------------------- /*-----------------------------------------------------------------
* Internal table functions. * Internal table functions.
*---------------------------------------------------------------*/ *---------------------------------------------------------------*/
void dm_table_destroy(struct dm_table *t);
void dm_table_event_callback(struct dm_table *t, void dm_table_event_callback(struct dm_table *t,
void (*fn)(void *), void *context); void (*fn)(void *), void *context);
struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index); struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index);
...@@ -206,7 +205,8 @@ void dm_kcopyd_exit(void); ...@@ -206,7 +205,8 @@ void dm_kcopyd_exit(void);
* Mempool operations * Mempool operations
*/ */
struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type,
unsigned integrity, unsigned per_bio_data_size); unsigned integrity, unsigned per_bio_data_size,
unsigned min_pool_size);
void dm_free_md_mempools(struct dm_md_mempools *pools); void dm_free_md_mempools(struct dm_md_mempools *pools);
/* /*
......
...@@ -28,6 +28,7 @@ enum dm_queue_mode { ...@@ -28,6 +28,7 @@ enum dm_queue_mode {
DM_TYPE_REQUEST_BASED = 2, DM_TYPE_REQUEST_BASED = 2,
DM_TYPE_MQ_REQUEST_BASED = 3, DM_TYPE_MQ_REQUEST_BASED = 3,
DM_TYPE_DAX_BIO_BASED = 4, DM_TYPE_DAX_BIO_BASED = 4,
DM_TYPE_NVME_BIO_BASED = 5,
}; };
typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t; typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t;
...@@ -220,14 +221,6 @@ struct target_type { ...@@ -220,14 +221,6 @@ struct target_type {
#define DM_TARGET_WILDCARD 0x00000008 #define DM_TARGET_WILDCARD 0x00000008
#define dm_target_is_wildcard(type) ((type)->features & DM_TARGET_WILDCARD) #define dm_target_is_wildcard(type) ((type)->features & DM_TARGET_WILDCARD)
/*
* Some targets need to be sent the same WRITE bio severals times so
* that they can send copies of it to different devices. This function
* examines any supplied bio and returns the number of copies of it the
* target requires.
*/
typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
/* /*
* A target implements own bio data integrity. * A target implements own bio data integrity.
*/ */
...@@ -291,13 +284,6 @@ struct dm_target { ...@@ -291,13 +284,6 @@ struct dm_target {
*/ */
unsigned per_io_data_size; unsigned per_io_data_size;
/*
* If defined, this function is called to find out how many
* duplicate bios should be sent to the target when writing
* data.
*/
dm_num_write_bios_fn num_write_bios;
/* target specific data */ /* target specific data */
void *private; void *private;
...@@ -329,35 +315,9 @@ struct dm_target_callbacks { ...@@ -329,35 +315,9 @@ struct dm_target_callbacks {
int (*congested_fn) (struct dm_target_callbacks *, int); int (*congested_fn) (struct dm_target_callbacks *, int);
}; };
/* void *dm_per_bio_data(struct bio *bio, size_t data_size);
* For bio-based dm. struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size);
* One of these is allocated for each bio. unsigned dm_bio_get_target_bio_nr(const struct bio *bio);
* This structure shouldn't be touched directly by target drivers.
* It is here so that we can inline dm_per_bio_data and
* dm_bio_from_per_bio_data
*/
struct dm_target_io {
struct dm_io *io;
struct dm_target *ti;
unsigned target_bio_nr;
unsigned *len_ptr;
struct bio clone;
};
static inline void *dm_per_bio_data(struct bio *bio, size_t data_size)
{
return (char *)bio - offsetof(struct dm_target_io, clone) - data_size;
}
static inline struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size)
{
return (struct bio *)((char *)data + data_size + offsetof(struct dm_target_io, clone));
}
static inline unsigned dm_bio_get_target_bio_nr(const struct bio *bio)
{
return container_of(bio, struct dm_target_io, clone)->target_bio_nr;
}
int dm_register_target(struct target_type *t); int dm_register_target(struct target_type *t);
void dm_unregister_target(struct target_type *t); void dm_unregister_target(struct target_type *t);
...@@ -499,6 +459,11 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type); ...@@ -499,6 +459,11 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type);
*/ */
int dm_table_complete(struct dm_table *t); int dm_table_complete(struct dm_table *t);
/*
* Destroy the table when finished.
*/
void dm_table_destroy(struct dm_table *t);
/* /*
* Target may require that it is never sent I/O larger than len. * Target may require that it is never sent I/O larger than len.
*/ */
...@@ -585,6 +550,7 @@ do { \ ...@@ -585,6 +550,7 @@ do { \
#define DM_ENDIO_DONE 0 #define DM_ENDIO_DONE 0
#define DM_ENDIO_INCOMPLETE 1 #define DM_ENDIO_INCOMPLETE 1
#define DM_ENDIO_REQUEUE 2 #define DM_ENDIO_REQUEUE 2
#define DM_ENDIO_DELAY_REQUEUE 3
/* /*
* Definitions of return values from target map function. * Definitions of return values from target map function.
...@@ -592,7 +558,7 @@ do { \ ...@@ -592,7 +558,7 @@ do { \
#define DM_MAPIO_SUBMITTED 0 #define DM_MAPIO_SUBMITTED 0
#define DM_MAPIO_REMAPPED 1 #define DM_MAPIO_REMAPPED 1
#define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE
#define DM_MAPIO_DELAY_REQUEUE 3 #define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE
#define DM_MAPIO_KILL 4 #define DM_MAPIO_KILL 4
#define dm_sector_div64(x, y)( \ #define dm_sector_div64(x, y)( \
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment