Commit 9903883f authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm

Pull device-mapper changes from Alasdair G Kergon:
 "Add a device-mapper target called dm-switch to provide a multipath
  framework for storage arrays that dynamically reconfigure their
  preferred paths for different device regions.

  Fix a bug in the verity target that prevented its use with some
  specific sizes of devices.

  Improve some locking mechanisms in the device-mapper core and bufio.

  Add Mike Snitzer as a device-mapper maintainer.

  A few more clean-ups and fixes"

* tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
  dm: add switch target
  dm: update maintainers
  dm: optimize reorder structure
  dm: optimize use SRCU and RCU
  dm bufio: submit writes outside lock
  dm cache: fix arm link errors with inline
  dm verity: use __ffs and __fls
  dm flakey: correct ctr alloc failure mesg
  dm verity: remove pointless comparison
  dm: use __GFP_HIGHMEM in __vmalloc
  dm verity: fix inability to use a few specific devices sizes
  dm ioctl: set noio flag to avoid __vmalloc deadlock
  dm mpath: fix ioctl deadlock when no paths
parents 36805aae 9d0eb0ab
dm-switch
=========
The device-mapper switch target creates a device that supports an
arbitrary mapping of fixed-size regions of I/O across a fixed set of
paths. The path used for any specific region can be switched
dynamically by sending the target a message.
It maps I/O to underlying block devices efficiently when there is a large
number of fixed-sized address regions but there is no simple pattern
that would allow for a compact representation of the mapping such as
dm-stripe.
Background
----------
Dell EqualLogic and some other iSCSI storage arrays use a distributed
frameless architecture. In this architecture, the storage group
consists of a number of distinct storage arrays ("members") each having
independent controllers, disk storage and network adapters. When a LUN
is created it is spread across multiple members. The details of the
spreading are hidden from initiators connected to this storage system.
The storage group exposes a single target discovery portal, no matter
how many members are being used. When iSCSI sessions are created, each
session is connected to an eth port on a single member. Data to a LUN
can be sent on any iSCSI session, and if the blocks being accessed are
stored on another member the I/O will be forwarded as required. This
forwarding is invisible to the initiator. The storage layout is also
dynamic, and the blocks stored on disk may be moved from member to
member as needed to balance the load.
This architecture simplifies the management and configuration of both
the storage group and initiators. In a multipathing configuration, it
is possible to set up multiple iSCSI sessions to use multiple network
interfaces on both the host and target to take advantage of the
increased network bandwidth. An initiator could use a simple round
robin algorithm to send I/O across all paths and let the storage array
members forward it as necessary, but there is a performance advantage to
sending data directly to the correct member.
A device-mapper table already lets you map different regions of a
device onto different targets. However in this architecture the LUN is
spread with an address region size on the order of 10s of MBs, which
means the resulting table could have more than a million entries and
consume far too much memory.
Using this device-mapper switch target we can now build a two-layer
device hierarchy:
Upper Tier – Determine which array member the I/O should be sent to.
Lower Tier – Load balance amongst paths to a particular member.
The lower tier consists of a single dm multipath device for each member.
Each of these multipath devices contains the set of paths directly to
the array member in one priority group, and leverages existing path
selectors to load balance amongst these paths. We also build a
non-preferred priority group containing paths to other array members for
failover reasons.
The upper tier consists of a single dm-switch device. This device uses
a bitmap to look up the location of the I/O and choose the appropriate
lower tier device to route the I/O. By using a bitmap we are able to
use 4 bits for each address range in a 16 member group (which is very
large for us). This is a much denser representation than the dm table
b-tree can achieve.
Construction Parameters
=======================
<num_paths> <region_size> <num_optional_args> [<optional_args>...]
[<dev_path> <offset>]+
<num_paths>
The number of paths across which to distribute the I/O.
<region_size>
The number of 512-byte sectors in a region. Each region can be redirected
to any of the available paths.
<num_optional_args>
The number of optional arguments. Currently, no optional arguments
are supported and so this must be zero.
<dev_path>
The block device that represents a specific path to the device.
<offset>
The offset of the start of data on the specific <dev_path> (in units
of 512-byte sectors). This number is added to the sector number when
forwarding the request to the specific path. Typically it is zero.
Messages
========
set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>...
Modify the region table by specifying which regions are redirected to
which paths.
<index>
The region number (region size was specified in constructor parameters).
If index is omitted, the next region (previous index + 1) is used.
Expressed in hexadecimal (WITHOUT any prefix like 0x).
<path_nr>
The path number in the range 0 ... (<num_paths> - 1).
Expressed in hexadecimal (WITHOUT any prefix like 0x).
Status
======
No status line is reported.
Example
=======
Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with
the same size.
Create a switch device with 64kB region size:
dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0`
switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0"
Set mappings for the first 7 entries to point to devices switch0, switch1,
switch2, switch0, switch1, switch2, switch1:
dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1
......@@ -2574,6 +2574,7 @@ S: Maintained
DEVICE-MAPPER (LVM)
M: Alasdair Kergon <agk@redhat.com>
M: Mike Snitzer <snitzer@redhat.com>
M: dm-devel@redhat.com
L: dm-devel@redhat.com
W: http://sources.redhat.com/dm
......@@ -2585,6 +2586,7 @@ F: drivers/md/dm*
F: drivers/md/persistent-data/
F: include/linux/device-mapper.h
F: include/linux/dm-*.h
F: include/uapi/linux/dm-*.h
DIOLAN U2C-12 I2C DRIVER
M: Guenter Roeck <linux@roeck-us.net>
......
......@@ -412,4 +412,18 @@ config DM_VERITY
If unsure, say N.
config DM_SWITCH
tristate "Switch target support (EXPERIMENTAL)"
depends on BLK_DEV_DM
---help---
This device-mapper target creates a device that supports an arbitrary
mapping of fixed-size regions of I/O across a fixed set of paths.
The path used for any specific region can be switched dynamically
by sending the target a message.
To compile this code as a module, choose M here: the module will
be called dm-switch.
If unsure, say N.
endif # MD
......@@ -40,6 +40,7 @@ obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
obj-$(CONFIG_DM_SWITCH) += dm-switch.o
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_PERSISTENT_DATA) += persistent-data/
obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o
......
......@@ -145,6 +145,7 @@ struct dm_buffer {
unsigned long state;
unsigned long last_accessed;
struct dm_bufio_client *c;
struct list_head write_list;
struct bio bio;
struct bio_vec bio_vec[DM_BUFIO_INLINE_VECS];
};
......@@ -349,7 +350,7 @@ static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
if (gfp_mask & __GFP_NORETRY)
noio_flag = memalloc_noio_save();
ptr = __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
ptr = __vmalloc(c->block_size, gfp_mask | __GFP_HIGHMEM, PAGE_KERNEL);
if (gfp_mask & __GFP_NORETRY)
memalloc_noio_restore(noio_flag);
......@@ -630,7 +631,8 @@ static int do_io_schedule(void *word)
* - Submit our write and don't wait on it. We set B_WRITING indicating
* that there is a write in progress.
*/
static void __write_dirty_buffer(struct dm_buffer *b)
static void __write_dirty_buffer(struct dm_buffer *b,
struct list_head *write_list)
{
if (!test_bit(B_DIRTY, &b->state))
return;
......@@ -639,7 +641,24 @@ static void __write_dirty_buffer(struct dm_buffer *b)
wait_on_bit_lock(&b->state, B_WRITING,
do_io_schedule, TASK_UNINTERRUPTIBLE);
submit_io(b, WRITE, b->block, write_endio);
if (!write_list)
submit_io(b, WRITE, b->block, write_endio);
else
list_add_tail(&b->write_list, write_list);
}
static void __flush_write_list(struct list_head *write_list)
{
struct blk_plug plug;
blk_start_plug(&plug);
while (!list_empty(write_list)) {
struct dm_buffer *b =
list_entry(write_list->next, struct dm_buffer, write_list);
list_del(&b->write_list);
submit_io(b, WRITE, b->block, write_endio);
dm_bufio_cond_resched();
}
blk_finish_plug(&plug);
}
/*
......@@ -655,7 +674,7 @@ static void __make_buffer_clean(struct dm_buffer *b)
return;
wait_on_bit(&b->state, B_READING, do_io_schedule, TASK_UNINTERRUPTIBLE);
__write_dirty_buffer(b);
__write_dirty_buffer(b, NULL);
wait_on_bit(&b->state, B_WRITING, do_io_schedule, TASK_UNINTERRUPTIBLE);
}
......@@ -802,7 +821,8 @@ static void __free_buffer_wake(struct dm_buffer *b)
wake_up(&c->free_buffer_wait);
}
static void __write_dirty_buffers_async(struct dm_bufio_client *c, int no_wait)
static void __write_dirty_buffers_async(struct dm_bufio_client *c, int no_wait,
struct list_head *write_list)
{
struct dm_buffer *b, *tmp;
......@@ -818,7 +838,7 @@ static void __write_dirty_buffers_async(struct dm_bufio_client *c, int no_wait)
if (no_wait && test_bit(B_WRITING, &b->state))
return;
__write_dirty_buffer(b);
__write_dirty_buffer(b, write_list);
dm_bufio_cond_resched();
}
}
......@@ -853,7 +873,8 @@ static void __get_memory_limit(struct dm_bufio_client *c,
* If we are over threshold_buffers, start freeing buffers.
* If we're over "limit_buffers", block until we get under the limit.
*/
static void __check_watermark(struct dm_bufio_client *c)
static void __check_watermark(struct dm_bufio_client *c,
struct list_head *write_list)
{
unsigned long threshold_buffers, limit_buffers;
......@@ -872,7 +893,7 @@ static void __check_watermark(struct dm_bufio_client *c)
}
if (c->n_buffers[LIST_DIRTY] > threshold_buffers)
__write_dirty_buffers_async(c, 1);
__write_dirty_buffers_async(c, 1, write_list);
}
/*
......@@ -897,7 +918,8 @@ static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
*--------------------------------------------------------------*/
static struct dm_buffer *__bufio_new(struct dm_bufio_client *c, sector_t block,
enum new_flag nf, int *need_submit)
enum new_flag nf, int *need_submit,
struct list_head *write_list)
{
struct dm_buffer *b, *new_b = NULL;
......@@ -924,7 +946,7 @@ static struct dm_buffer *__bufio_new(struct dm_bufio_client *c, sector_t block,
goto found_buffer;
}
__check_watermark(c);
__check_watermark(c, write_list);
b = new_b;
b->hold_count = 1;
......@@ -992,10 +1014,14 @@ static void *new_read(struct dm_bufio_client *c, sector_t block,
int need_submit;
struct dm_buffer *b;
LIST_HEAD(write_list);
dm_bufio_lock(c);
b = __bufio_new(c, block, nf, &need_submit);
b = __bufio_new(c, block, nf, &need_submit, &write_list);
dm_bufio_unlock(c);
__flush_write_list(&write_list);
if (!b)
return b;
......@@ -1047,6 +1073,8 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
{
struct blk_plug plug;
LIST_HEAD(write_list);
BUG_ON(dm_bufio_in_request());
blk_start_plug(&plug);
......@@ -1055,7 +1083,15 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
for (; n_blocks--; block++) {
int need_submit;
struct dm_buffer *b;
b = __bufio_new(c, block, NF_PREFETCH, &need_submit);
b = __bufio_new(c, block, NF_PREFETCH, &need_submit,
&write_list);
if (unlikely(!list_empty(&write_list))) {
dm_bufio_unlock(c);
blk_finish_plug(&plug);
__flush_write_list(&write_list);
blk_start_plug(&plug);
dm_bufio_lock(c);
}
if (unlikely(b != NULL)) {
dm_bufio_unlock(c);
......@@ -1069,7 +1105,6 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
goto flush_plug;
dm_bufio_lock(c);
}
}
dm_bufio_unlock(c);
......@@ -1126,11 +1161,14 @@ EXPORT_SYMBOL_GPL(dm_bufio_mark_buffer_dirty);
void dm_bufio_write_dirty_buffers_async(struct dm_bufio_client *c)
{
LIST_HEAD(write_list);
BUG_ON(dm_bufio_in_request());
dm_bufio_lock(c);
__write_dirty_buffers_async(c, 0);
__write_dirty_buffers_async(c, 0, &write_list);
dm_bufio_unlock(c);
__flush_write_list(&write_list);
}
EXPORT_SYMBOL_GPL(dm_bufio_write_dirty_buffers_async);
......@@ -1147,8 +1185,13 @@ int dm_bufio_write_dirty_buffers(struct dm_bufio_client *c)
unsigned long buffers_processed = 0;
struct dm_buffer *b, *tmp;
LIST_HEAD(write_list);
dm_bufio_lock(c);
__write_dirty_buffers_async(c, 0, &write_list);
dm_bufio_unlock(c);
__flush_write_list(&write_list);
dm_bufio_lock(c);
__write_dirty_buffers_async(c, 0);
again:
list_for_each_entry_safe_reverse(b, tmp, &c->lru[LIST_DIRTY], lru_list) {
......@@ -1274,7 +1317,7 @@ void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block)
BUG_ON(!b->hold_count);
BUG_ON(test_bit(B_READING, &b->state));
__write_dirty_buffer(b);
__write_dirty_buffer(b, NULL);
if (b->hold_count == 1) {
wait_on_bit(&b->state, B_WRITING,
do_io_schedule, TASK_UNINTERRUPTIBLE);
......
......@@ -425,6 +425,10 @@ static bool block_size_is_power_of_two(struct cache *cache)
return cache->sectors_per_block_shift >= 0;
}
/* gcc on ARM generates spurious references to __udivdi3 and __umoddi3 */
#if defined(CONFIG_ARM) && __GNUC__ == 4 && __GNUC_MINOR__ <= 6
__always_inline
#endif
static dm_block_t block_div(dm_block_t b, uint32_t n)
{
do_div(b, n);
......
......@@ -176,7 +176,7 @@ static int flakey_ctr(struct dm_target *ti, unsigned int argc, char **argv)
fc = kzalloc(sizeof(*fc), GFP_KERNEL);
if (!fc) {
ti->error = "Cannot allocate linear context";
ti->error = "Cannot allocate context";
return -ENOMEM;
}
fc->start_time = jiffies;
......
This diff is collapsed.
......@@ -1561,7 +1561,6 @@ static int multipath_ioctl(struct dm_target *ti, unsigned int cmd,
unsigned long flags;
int r;
again:
bdev = NULL;
mode = 0;
r = 0;
......@@ -1579,7 +1578,7 @@ static int multipath_ioctl(struct dm_target *ti, unsigned int cmd,
}
if ((pgpath && m->queue_io) || (!pgpath && m->queue_if_no_path))
r = -EAGAIN;
r = -ENOTCONN;
else if (!bdev)
r = -EIO;
......@@ -1591,11 +1590,8 @@ static int multipath_ioctl(struct dm_target *ti, unsigned int cmd,
if (!r && ti->len != i_size_read(bdev->bd_inode) >> SECTOR_SHIFT)
r = scsi_verify_blk_ioctl(NULL, cmd);
if (r == -EAGAIN && !fatal_signal_pending(current)) {
if (r == -ENOTCONN && !fatal_signal_pending(current))
queue_work(kmultipathd, &m->process_queued_ios);
msleep(10);
goto again;
}
return r ? : __blkdev_driver_ioctl(bdev, mode, cmd, arg);
}
......
This diff is collapsed.
......@@ -26,22 +26,8 @@
#define KEYS_PER_NODE (NODE_SIZE / sizeof(sector_t))
#define CHILDREN_PER_NODE (KEYS_PER_NODE + 1)
/*
* The table has always exactly one reference from either mapped_device->map
* or hash_cell->new_map. This reference is not counted in table->holders.
* A pair of dm_create_table/dm_destroy_table functions is used for table
* creation/destruction.
*
* Temporary references from the other code increase table->holders. A pair
* of dm_table_get/dm_table_put functions is used to manipulate it.
*
* When the table is about to be destroyed, we wait for table->holders to
* drop to zero.
*/
struct dm_table {
struct mapped_device *md;
atomic_t holders;
unsigned type;
/* btree table */
......@@ -208,7 +194,6 @@ int dm_table_create(struct dm_table **result, fmode_t mode,
INIT_LIST_HEAD(&t->devices);
INIT_LIST_HEAD(&t->target_callbacks);
atomic_set(&t->holders, 0);
if (!num_targets)
num_targets = KEYS_PER_NODE;
......@@ -246,10 +231,6 @@ void dm_table_destroy(struct dm_table *t)
if (!t)
return;
while (atomic_read(&t->holders))
msleep(1);
smp_mb();
/* free the indexes */
if (t->depth >= 2)
vfree(t->index[t->depth - 2]);
......@@ -274,22 +255,6 @@ void dm_table_destroy(struct dm_table *t)
kfree(t);
}
void dm_table_get(struct dm_table *t)
{
atomic_inc(&t->holders);
}
EXPORT_SYMBOL(dm_table_get);
void dm_table_put(struct dm_table *t)
{
if (!t)
return;
smp_mb__before_atomic_dec();
atomic_dec(&t->holders);
}
EXPORT_SYMBOL(dm_table_put);
/*
* Checks to see if we need to extend highs or targets.
*/
......
......@@ -451,7 +451,7 @@ static void verity_prefetch_io(struct work_struct *work)
goto no_prefetch_cluster;
if (unlikely(cluster & (cluster - 1)))
cluster = 1 << (fls(cluster) - 1);
cluster = 1 << __fls(cluster);
hash_block_start &= ~(sector_t)(cluster - 1);
hash_block_end |= cluster - 1;
......@@ -695,8 +695,8 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
goto bad;
}
if (sscanf(argv[0], "%d%c", &num, &dummy) != 1 ||
num < 0 || num > 1) {
if (sscanf(argv[0], "%u%c", &num, &dummy) != 1 ||
num > 1) {
ti->error = "Invalid version";
r = -EINVAL;
goto bad;
......@@ -723,7 +723,7 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
r = -EINVAL;
goto bad;
}
v->data_dev_block_bits = ffs(num) - 1;
v->data_dev_block_bits = __ffs(num);
if (sscanf(argv[4], "%u%c", &num, &dummy) != 1 ||
!num || (num & (num - 1)) ||
......@@ -733,7 +733,7 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
r = -EINVAL;
goto bad;
}
v->hash_dev_block_bits = ffs(num) - 1;
v->hash_dev_block_bits = __ffs(num);
if (sscanf(argv[5], "%llu%c", &num_ll, &dummy) != 1 ||
(sector_t)(num_ll << (v->data_dev_block_bits - SECTOR_SHIFT))
......@@ -812,7 +812,7 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
}
v->hash_per_block_bits =
fls((1 << v->hash_dev_block_bits) / v->digest_size) - 1;
__fls((1 << v->hash_dev_block_bits) / v->digest_size);
v->levels = 0;
if (v->data_blocks)
......@@ -831,9 +831,8 @@ static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
for (i = v->levels - 1; i >= 0; i--) {
sector_t s;
v->hash_level_block[i] = hash_position;
s = verity_position_at_level(v, v->data_blocks, i);
s = (s >> v->hash_per_block_bits) +
!!(s & ((1 << v->hash_per_block_bits) - 1));
s = (v->data_blocks + ((sector_t)1 << ((i + 1) * v->hash_per_block_bits)) - 1)
>> ((i + 1) * v->hash_per_block_bits);
if (hash_position + s < hash_position) {
ti->error = "Hash device offset overflow";
r = -E2BIG;
......
This diff is collapsed.
......@@ -446,9 +446,9 @@ int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len);
/*
* Table reference counting.
*/
struct dm_table *dm_get_live_table(struct mapped_device *md);
void dm_table_get(struct dm_table *t);
void dm_table_put(struct dm_table *t);
struct dm_table *dm_get_live_table(struct mapped_device *md, int *srcu_idx);
void dm_put_live_table(struct mapped_device *md, int srcu_idx);
void dm_sync_table(struct mapped_device *md);
/*
* Queries
......
......@@ -267,9 +267,9 @@ enum {
#define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl)
#define DM_VERSION_MAJOR 4
#define DM_VERSION_MINOR 24
#define DM_VERSION_MINOR 25
#define DM_VERSION_PATCHLEVEL 0
#define DM_VERSION_EXTRA "-ioctl (2013-01-15)"
#define DM_VERSION_EXTRA "-ioctl (2013-06-26)"
/* Status bits */
#define DM_READONLY_FLAG (1 << 0) /* In/Out */
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment