Commit fe41c2c0 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper changes from Mike Snitzer:
 "A lot of attention was paid to improving the thin-provisioning
  target's handling of metadata operation failures and running out of
  space.  A new 'error_if_no_space' feature was added to allow users to
  error IOs rather than queue them when either the data or metadata
  space is exhausted.

  Additional fixes/features include:
   - a few fixes to properly support thin metadata device resizing
   - a solution for reliably waiting for a DM device's embedded kobject
     to be released before destroying the device
   - old dm-snapshot is updated to use the dm-bufio interface to take
     advantage of readahead capabilities that improve snapshot
     activation
   - new dm-cache target tunables to control how quickly data is
     promoted to the cache (fast) device
   - improved write efficiency of cluster mirror target by combining
     userspace flush and mark requests"

* tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
  dm log userspace: allow mark requests to piggyback on flush requests
  dm space map metadata: fix bug in resizing of thin metadata
  dm cache: add policy name to status output
  dm thin: fix pool feature parsing
  dm sysfs: fix a module unload race
  dm snapshot: use dm-bufio prefetch
  dm snapshot: use dm-bufio
  dm snapshot: prepare for switch to using dm-bufio
  dm snapshot: use GFP_KERNEL when initializing exceptions
  dm cache: add block sizes and total cache blocks to status output
  dm btree: add dm_btree_find_lowest_key
  dm space map metadata: fix extending the space map
  dm space map common: make sure new space is used during extend
  dm: wait until embedded kobject is released before destroying a device
  dm: remove pointless kobject comparison in dm_get_from_kobject
  dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
  dm cache policy mq: introduce three promotion threshold tunables
  dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD
  dm thin: fix set_pool_mode exposed pool operation races
  dm thin: eliminate the no_free_space flag
  ...
parents 194e57fd 5066a4df
......@@ -40,8 +40,11 @@ on hit count on entry. The policy aims to take different cache miss
costs into account and to adjust to varying load patterns automatically.
Message and constructor argument pairs are:
'sequential_threshold <#nr_sequential_ios>' and
'random_threshold <#nr_random_ios>'.
'sequential_threshold <#nr_sequential_ios>'
'random_threshold <#nr_random_ios>'
'read_promote_adjustment <value>'
'write_promote_adjustment <value>'
'discard_promote_adjustment <value>'
The sequential threshold indicates the number of contiguous I/Os
required before a stream is treated as sequential. The random threshold
......@@ -55,6 +58,15 @@ since spindles tend to have good bandwidth. The io_tracker counts
contiguous I/Os to try to spot when the io is in one of these sequential
modes.
Internally the mq policy maintains a promotion threshold variable. If
the hit count of a block not in the cache goes above this threshold it
gets promoted to the cache. The read, write and discard promote adjustment
tunables allow you to tweak the promotion threshold by adding a small
value based on the io type. They default to 4, 8 and 1 respectively.
If you're trying to quickly warm a new cache device you may wish to
reduce these to encourage promotion. Remember to switch them back to
their defaults after the cache fills though.
cleaner
-------
......
......@@ -217,36 +217,43 @@ the characteristics of a specific policy, always request it by name.
Status
------
<#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses>
<#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache>
<#dirty> <#features> <features>* <#core args> <core args>* <#policy args>
<policy args>*
#used metadata blocks : Number of metadata blocks used
#total metadata blocks : Total number of metadata blocks
#read hits : Number of times a READ bio has been mapped
<metadata block size> <#used metadata blocks>/<#total metadata blocks>
<cache block size> <#used cache blocks>/<#total cache blocks>
<#read hits> <#read misses> <#write hits> <#write misses>
<#demotions> <#promotions> <#dirty> <#features> <features>*
<#core args> <core args>* <policy name> <#policy args> <policy args>*
metadata block size : Fixed block size for each metadata block in
sectors
#used metadata blocks : Number of metadata blocks used
#total metadata blocks : Total number of metadata blocks
cache block size : Configurable block size for the cache device
in sectors
#used cache blocks : Number of blocks resident in the cache
#total cache blocks : Total number of cache blocks
#read hits : Number of times a READ bio has been mapped
to the cache
#read misses : Number of times a READ bio has been mapped
#read misses : Number of times a READ bio has been mapped
to the origin
#write hits : Number of times a WRITE bio has been mapped
#write hits : Number of times a WRITE bio has been mapped
to the cache
#write misses : Number of times a WRITE bio has been
#write misses : Number of times a WRITE bio has been
mapped to the origin
#demotions : Number of times a block has been removed
#demotions : Number of times a block has been removed
from the cache
#promotions : Number of times a block has been moved to
#promotions : Number of times a block has been moved to
the cache
#blocks in cache : Number of blocks resident in the cache
#dirty : Number of blocks in the cache that differ
#dirty : Number of blocks in the cache that differ
from the origin
#feature args : Number of feature args to follow
feature args : 'writethrough' (optional)
#core args : Number of core arguments (must be even)
core args : Key/value pairs for tuning the core
#feature args : Number of feature args to follow
feature args : 'writethrough' (optional)
#core args : Number of core arguments (must be even)
core args : Key/value pairs for tuning the core
e.g. migration_threshold
#policy args : Number of policy arguments to follow (must be even)
policy args : Key/value pairs
e.g. 'sequential_threshold 1024
policy name : Name of the policy
#policy args : Number of policy arguments to follow (must be even)
policy args : Key/value pairs
e.g. sequential_threshold
Messages
--------
......
......@@ -235,6 +235,8 @@ i) Constructor
read_only: Don't allow any changes to be made to the pool
metadata.
error_if_no_space: Error IOs, instead of queueing, if no space.
Data block size must be between 64KB (128 sectors) and 1GB
(2097152 sectors) inclusive.
......@@ -276,6 +278,11 @@ ii) Status
contain the string 'Fail'. The userspace recovery tools
should then be used.
error_if_no_space|queue_if_no_space
If the pool runs out of data or metadata space, the pool will
either queue or error the IO destined to the data device. The
default is to queue the IO until more space is added.
iii) Messages
create_thin <dev id>
......
......@@ -176,8 +176,12 @@ config MD_FAULTY
source "drivers/md/bcache/Kconfig"
config BLK_DEV_DM_BUILTIN
boolean
config BLK_DEV_DM
tristate "Device mapper support"
select BLK_DEV_DM_BUILTIN
---help---
Device-mapper is a low level volume manager. It works by allowing
people to specify mappings for ranges of logical sectors. Various
......@@ -238,6 +242,7 @@ config DM_CRYPT
config DM_SNAPSHOT
tristate "Snapshot target"
depends on BLK_DEV_DM
select DM_BUFIO
---help---
Allow volume managers to take writable snapshots of a device.
......@@ -250,12 +255,12 @@ config DM_THIN_PROVISIONING
Provides thin provisioning and snapshots that share a data store.
config DM_DEBUG_BLOCK_STACK_TRACING
boolean "Keep stack trace of thin provisioning block lock holders"
depends on STACKTRACE_SUPPORT && DM_THIN_PROVISIONING
boolean "Keep stack trace of persistent data block lock holders"
depends on STACKTRACE_SUPPORT && DM_PERSISTENT_DATA
select STACKTRACE
---help---
Enable this for messages that may help debug problems with the
block manager locking used by thin provisioning.
block manager locking used by thin provisioning and caching.
If unsure, say N.
......
......@@ -32,6 +32,7 @@ obj-$(CONFIG_MD_FAULTY) += faulty.o
obj-$(CONFIG_BCACHE) += bcache/
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
......
......@@ -104,6 +104,8 @@ struct dm_bufio_client {
struct list_head reserved_buffers;
unsigned need_reserved_buffers;
unsigned minimum_buffers;
struct hlist_head *cache_hash;
wait_queue_head_t free_buffer_wait;
......@@ -861,8 +863,8 @@ static void __get_memory_limit(struct dm_bufio_client *c,
buffers = dm_bufio_cache_size_per_client >>
(c->sectors_per_block_bits + SECTOR_SHIFT);
if (buffers < DM_BUFIO_MIN_BUFFERS)
buffers = DM_BUFIO_MIN_BUFFERS;
if (buffers < c->minimum_buffers)
buffers = c->minimum_buffers;
*limit_buffers = buffers;
*threshold_buffers = buffers * DM_BUFIO_WRITEBACK_PERCENT / 100;
......@@ -1350,6 +1352,34 @@ void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block)
}
EXPORT_SYMBOL_GPL(dm_bufio_release_move);
/*
* Free the given buffer.
*
* This is just a hint, if the buffer is in use or dirty, this function
* does nothing.
*/
void dm_bufio_forget(struct dm_bufio_client *c, sector_t block)
{
struct dm_buffer *b;
dm_bufio_lock(c);
b = __find(c, block);
if (b && likely(!b->hold_count) && likely(!b->state)) {
__unlink_buffer(b);
__free_buffer_wake(b);
}
dm_bufio_unlock(c);
}
EXPORT_SYMBOL(dm_bufio_forget);
void dm_bufio_set_minimum_buffers(struct dm_bufio_client *c, unsigned n)
{
c->minimum_buffers = n;
}
EXPORT_SYMBOL(dm_bufio_set_minimum_buffers);
unsigned dm_bufio_get_block_size(struct dm_bufio_client *c)
{
return c->block_size;
......@@ -1546,6 +1576,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
INIT_LIST_HEAD(&c->reserved_buffers);
c->need_reserved_buffers = reserved_buffers;
c->minimum_buffers = DM_BUFIO_MIN_BUFFERS;
init_waitqueue_head(&c->free_buffer_wait);
c->async_write_error = 0;
......
......@@ -108,6 +108,18 @@ int dm_bufio_issue_flush(struct dm_bufio_client *c);
*/
void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block);
/*
* Free the given buffer.
* This is just a hint, if the buffer is in use or dirty, this function
* does nothing.
*/
void dm_bufio_forget(struct dm_bufio_client *c, sector_t block);
/*
* Set the minimum number of buffers before cleanup happens.
*/
void dm_bufio_set_minimum_buffers(struct dm_bufio_client *c, unsigned n);
unsigned dm_bufio_get_block_size(struct dm_bufio_client *c);
sector_t dm_bufio_get_device_size(struct dm_bufio_client *c);
sector_t dm_bufio_get_block_number(struct dm_buffer *b);
......
#include "dm.h"
/*
* The kobject release method must not be placed in the module itself,
* otherwise we are subject to module unload races.
*
* The release method is called when the last reference to the kobject is
* dropped. It may be called by any other kernel code that drops the last
* reference.
*
* The release method suffers from module unload race. We may prevent the
* module from being unloaded at the start of the release method (using
* increased module reference count or synchronizing against the release
* method), however there is no way to prevent the module from being
* unloaded at the end of the release method.
*
* If this code were placed in the dm module, the following race may
* happen:
* 1. Some other process takes a reference to dm kobject
* 2. The user issues ioctl function to unload the dm device
* 3. dm_sysfs_exit calls kobject_put, however the object is not released
* because of the other reference taken at step 1
* 4. dm_sysfs_exit waits on the completion
* 5. The other process that took the reference in step 1 drops it,
* dm_kobject_release is called from this process
* 6. dm_kobject_release calls complete()
* 7. a reschedule happens before dm_kobject_release returns
* 8. dm_sysfs_exit continues, the dm device is unloaded, module reference
* count is decremented
* 9. The user unloads the dm module
* 10. The other process that was rescheduled in step 7 continues to run,
* it is now executing code in unloaded module, so it crashes
*
* Note that if the process that takes the foreign reference to dm kobject
* has a low priority and the system is sufficiently loaded with
* higher-priority processes that prevent the low-priority process from
* being scheduled long enough, this bug may really happen.
*
* In order to fix this module unload race, we place the release method
* into a helper code that is compiled directly into the kernel.
*/
void dm_kobject_release(struct kobject *kobj)
{
complete(dm_get_completion_from_kobject(kobj));
}
EXPORT_SYMBOL(dm_kobject_release);
......@@ -287,9 +287,8 @@ static struct entry *alloc_entry(struct entry_pool *ep)
static struct entry *alloc_particular_entry(struct entry_pool *ep, dm_cblock_t cblock)
{
struct entry *e = ep->entries + from_cblock(cblock);
list_del(&e->list);
INIT_LIST_HEAD(&e->list);
list_del_init(&e->list);
INIT_HLIST_NODE(&e->hlist);
ep->nr_allocated++;
......@@ -391,6 +390,10 @@ struct mq_policy {
*/
unsigned promote_threshold;
unsigned discard_promote_adjustment;
unsigned read_promote_adjustment;
unsigned write_promote_adjustment;
/*
* The hash table allows us to quickly find an entry by origin
* block. Both pre_cache and cache entries are in here.
......@@ -400,6 +403,10 @@ struct mq_policy {
struct hlist_head *table;
};
#define DEFAULT_DISCARD_PROMOTE_ADJUSTMENT 1
#define DEFAULT_READ_PROMOTE_ADJUSTMENT 4
#define DEFAULT_WRITE_PROMOTE_ADJUSTMENT 8
/*----------------------------------------------------------------*/
/*
......@@ -642,25 +649,21 @@ static int demote_cblock(struct mq_policy *mq, dm_oblock_t *oblock)
* We bias towards reads, since they can be demoted at no cost if they
* haven't been dirtied.
*/
#define DISCARDED_PROMOTE_THRESHOLD 1
#define READ_PROMOTE_THRESHOLD 4
#define WRITE_PROMOTE_THRESHOLD 8
static unsigned adjusted_promote_threshold(struct mq_policy *mq,
bool discarded_oblock, int data_dir)
{
if (data_dir == READ)
return mq->promote_threshold + READ_PROMOTE_THRESHOLD;
return mq->promote_threshold + mq->read_promote_adjustment;
if (discarded_oblock && (any_free_cblocks(mq) || any_clean_cblocks(mq))) {
/*
* We don't need to do any copying at all, so give this a
* very low threshold.
*/
return DISCARDED_PROMOTE_THRESHOLD;
return mq->discard_promote_adjustment;
}
return mq->promote_threshold + WRITE_PROMOTE_THRESHOLD;
return mq->promote_threshold + mq->write_promote_adjustment;
}
static bool should_promote(struct mq_policy *mq, struct entry *e,
......@@ -809,7 +812,7 @@ static int no_entry_found(struct mq_policy *mq, dm_oblock_t oblock,
bool can_migrate, bool discarded_oblock,
int data_dir, struct policy_result *result)
{
if (adjusted_promote_threshold(mq, discarded_oblock, data_dir) == 1) {
if (adjusted_promote_threshold(mq, discarded_oblock, data_dir) <= 1) {
if (can_migrate)
insert_in_cache(mq, oblock, result);
else
......@@ -1135,20 +1138,28 @@ static int mq_set_config_value(struct dm_cache_policy *p,
const char *key, const char *value)
{
struct mq_policy *mq = to_mq_policy(p);
enum io_pattern pattern;
unsigned long tmp;
if (!strcasecmp(key, "random_threshold"))
pattern = PATTERN_RANDOM;
else if (!strcasecmp(key, "sequential_threshold"))
pattern = PATTERN_SEQUENTIAL;
else
return -EINVAL;
if (kstrtoul(value, 10, &tmp))
return -EINVAL;
mq->tracker.thresholds[pattern] = tmp;
if (!strcasecmp(key, "random_threshold")) {
mq->tracker.thresholds[PATTERN_RANDOM] = tmp;
} else if (!strcasecmp(key, "sequential_threshold")) {
mq->tracker.thresholds[PATTERN_SEQUENTIAL] = tmp;
} else if (!strcasecmp(key, "discard_promote_adjustment"))
mq->discard_promote_adjustment = tmp;
else if (!strcasecmp(key, "read_promote_adjustment"))
mq->read_promote_adjustment = tmp;
else if (!strcasecmp(key, "write_promote_adjustment"))
mq->write_promote_adjustment = tmp;
else
return -EINVAL;
return 0;
}
......@@ -1158,9 +1169,16 @@ static int mq_emit_config_values(struct dm_cache_policy *p, char *result, unsign
ssize_t sz = 0;
struct mq_policy *mq = to_mq_policy(p);
DMEMIT("4 random_threshold %u sequential_threshold %u",
DMEMIT("10 random_threshold %u "
"sequential_threshold %u "
"discard_promote_adjustment %u "
"read_promote_adjustment %u "
"write_promote_adjustment %u",
mq->tracker.thresholds[PATTERN_RANDOM],
mq->tracker.thresholds[PATTERN_SEQUENTIAL]);
mq->tracker.thresholds[PATTERN_SEQUENTIAL],
mq->discard_promote_adjustment,
mq->read_promote_adjustment,
mq->write_promote_adjustment);
return 0;
}
......@@ -1213,6 +1231,9 @@ static struct dm_cache_policy *mq_create(dm_cblock_t cache_size,
mq->hit_count = 0;
mq->generation = 0;
mq->promote_threshold = 0;
mq->discard_promote_adjustment = DEFAULT_DISCARD_PROMOTE_ADJUSTMENT;
mq->read_promote_adjustment = DEFAULT_READ_PROMOTE_ADJUSTMENT;
mq->write_promote_adjustment = DEFAULT_WRITE_PROMOTE_ADJUSTMENT;
mutex_init(&mq->lock);
spin_lock_init(&mq->tick_lock);
......@@ -1244,7 +1265,7 @@ static struct dm_cache_policy *mq_create(dm_cblock_t cache_size,
static struct dm_cache_policy_type mq_policy_type = {
.name = "mq",
.version = {1, 1, 0},
.version = {1, 2, 0},
.hint_size = 4,
.owner = THIS_MODULE,
.create = mq_create
......@@ -1252,10 +1273,11 @@ static struct dm_cache_policy_type mq_policy_type = {
static struct dm_cache_policy_type default_policy_type = {
.name = "default",
.version = {1, 1, 0},
.version = {1, 2, 0},
.hint_size = 4,
.owner = THIS_MODULE,
.create = mq_create
.create = mq_create,
.real = &mq_policy_type
};
static int __init mq_init(void)
......
......@@ -146,6 +146,10 @@ const char *dm_cache_policy_get_name(struct dm_cache_policy *p)
{
struct dm_cache_policy_type *t = p->private;
/* if t->real is set then an alias was used (e.g. "default") */
if (t->real)
return t->real->name;
return t->name;
}
EXPORT_SYMBOL_GPL(dm_cache_policy_get_name);
......
......@@ -222,6 +222,12 @@ struct dm_cache_policy_type {
char name[CACHE_POLICY_NAME_SIZE];
unsigned version[CACHE_POLICY_VERSION_SIZE];
/*
* For use by an alias dm_cache_policy_type to point to the
* real dm_cache_policy_type.
*/
struct dm_cache_policy_type *real;
/*
* Policies may store a hint for each each cache block.
* Currently the size of this hint must be 0 or 4 bytes but we
......
......@@ -2826,12 +2826,13 @@ static void cache_resume(struct dm_target *ti)
/*
* Status format:
*
* <#used metadata blocks>/<#total metadata blocks>
* <metadata block size> <#used metadata blocks>/<#total metadata blocks>
* <cache block size> <#used cache blocks>/<#total cache blocks>
* <#read hits> <#read misses> <#write hits> <#write misses>
* <#demotions> <#promotions> <#blocks in cache> <#dirty>
* <#demotions> <#promotions> <#dirty>
* <#features> <features>*
* <#core args> <core args>
* <#policy args> <policy args>*
* <policy name> <#policy args> <policy args>*
*/
static void cache_status(struct dm_target *ti, status_type_t type,
unsigned status_flags, char *result, unsigned maxlen)
......@@ -2869,17 +2870,20 @@ static void cache_status(struct dm_target *ti, status_type_t type,
residency = policy_residency(cache->policy);
DMEMIT("%llu/%llu %u %u %u %u %u %u %llu %u ",
DMEMIT("%u %llu/%llu %u %llu/%llu %u %u %u %u %u %u %llu ",
(unsigned)(DM_CACHE_METADATA_BLOCK_SIZE >> SECTOR_SHIFT),
(unsigned long long)(nr_blocks_metadata - nr_free_blocks_metadata),
(unsigned long long)nr_blocks_metadata,
cache->sectors_per_block,
(unsigned long long) from_cblock(residency),
(unsigned long long) from_cblock(cache->cache_size),
(unsigned) atomic_read(&cache->stats.read_hit),
(unsigned) atomic_read(&cache->stats.read_miss),
(unsigned) atomic_read(&cache->stats.write_hit),
(unsigned) atomic_read(&cache->stats.write_miss),
(unsigned) atomic_read(&cache->stats.demotion),
(unsigned) atomic_read(&cache->stats.promotion),
(unsigned long long) from_cblock(residency),
cache->nr_dirty);
(unsigned long long) from_cblock(cache->nr_dirty));
if (writethrough_mode(&cache->features))
DMEMIT("1 writethrough ");
......@@ -2896,6 +2900,8 @@ static void cache_status(struct dm_target *ti, status_type_t type,
}
DMEMIT("2 migration_threshold %llu ", (unsigned long long) cache->migration_threshold);
DMEMIT("%s ", dm_cache_policy_get_name(cache->policy));
if (sz < maxlen) {
r = policy_emit_config_values(cache->policy, result + sz, maxlen - sz);
if (r)
......@@ -3129,7 +3135,7 @@ static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits)
static struct target_type cache_target = {
.name = "cache",
.version = {1, 2, 0},
.version = {1, 3, 0},
.module = THIS_MODULE,
.ctr = cache_ctr,
.dtr = cache_dtr,
......
......@@ -24,7 +24,6 @@ struct delay_c {
struct work_struct flush_expired_bios;
struct list_head delayed_bios;
atomic_t may_delay;
mempool_t *delayed_pool;
struct dm_dev *dev_read;
sector_t start_read;
......@@ -40,14 +39,11 @@ struct delay_c {
struct dm_delay_info {
struct delay_c *context;
struct list_head list;
struct bio *bio;
unsigned long expires;
};
static DEFINE_MUTEX(delayed_bios_lock);
static struct kmem_cache *delayed_cache;
static void handle_delayed_timer(unsigned long data)
{
struct delay_c *dc = (struct delay_c *)data;
......@@ -87,13 +83,14 @@ static struct bio *flush_delayed_bios(struct delay_c *dc, int flush_all)
mutex_lock(&delayed_bios_lock);
list_for_each_entry_safe(delayed, next, &dc->delayed_bios, list) {
if (flush_all || time_after_eq(jiffies, delayed->expires)) {
struct bio *bio = dm_bio_from_per_bio_data(delayed,
sizeof(struct dm_delay_info));
list_del(&delayed->list);
bio_list_add(&flush_bios, delayed->bio);
if ((bio_data_dir(delayed->bio) == WRITE))
bio_list_add(&flush_bios, bio);
if ((bio_data_dir(bio) == WRITE))
delayed->context->writes--;
else
delayed->context->reads--;
mempool_free(delayed, dc->delayed_pool);
continue;
}
......@@ -185,12 +182,6 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv)
}
out:
dc->delayed_pool = mempool_create_slab_pool(128, delayed_cache);
if (!dc->delayed_pool) {
DMERR("Couldn't create delayed bio pool.");
goto bad_dev_write;
}
dc->kdelayd_wq = alloc_workqueue("kdelayd", WQ_MEM_RECLAIM, 0);
if (!dc->kdelayd_wq) {
DMERR("Couldn't start kdelayd");
......@@ -206,12 +197,11 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv)
ti->num_flush_bios = 1;
ti->num_discard_bios = 1;
ti->per_bio_data_size = sizeof(struct dm_delay_info);
ti->private = dc;
return 0;
bad_queue:
mempool_destroy(dc->delayed_pool);
bad_dev_write:
if (dc->dev_write)
dm_put_device(ti, dc->dev_write);
bad_dev_read:
......@@ -232,7 +222,6 @@ static void delay_dtr(struct dm_target *ti)
if (dc->dev_write)
dm_put_device(ti, dc->dev_write);
mempool_destroy(dc->delayed_pool);
kfree(dc);
}
......@@ -244,10 +233,9 @@ static int delay_bio(struct delay_c *dc, int delay, struct bio *bio)
if (!delay || !atomic_read(&dc->may_delay))
return 1;
delayed = mempool_alloc(dc->delayed_pool, GFP_NOIO);
delayed = dm_per_bio_data(bio, sizeof(struct dm_delay_info));
delayed->context = dc;
delayed->bio = bio;
delayed->expires = expires = jiffies + (delay * HZ / 1000);
mutex_lock(&delayed_bios_lock);
......@@ -356,13 +344,7 @@ static struct target_type delay_target = {
static int __init dm_delay_init(void)
{
int r = -ENOMEM;
delayed_cache = KMEM_CACHE(dm_delay_info, 0);
if (!delayed_cache) {
DMERR("Couldn't create delayed bio cache.");
goto bad_memcache;
}
int r;
r = dm_register_target(&delay_target);
if (r < 0) {
......@@ -373,15 +355,12 @@ static int __init dm_delay_init(void)
return 0;
bad_register:
kmem_cache_destroy(delayed_cache);
bad_memcache:
return r;
}
static void __exit dm_delay_exit(void)
{
dm_unregister_target(&delay_target);
kmem_cache_destroy(delayed_cache);
}
/* Module hooks */
......
This diff is collapsed.
......@@ -13,10 +13,13 @@
#include <linux/export.h>
#include <linux/slab.h>
#include <linux/dm-io.h>
#include "dm-bufio.h"
#define DM_MSG_PREFIX "persistent snapshot"
#define DM_CHUNK_SIZE_DEFAULT_SECTORS 32 /* 16KB */
#define DM_PREFETCH_CHUNKS 12
/*-----------------------------------------------------------------
* Persistent snapshots, by persistent we mean that the snapshot
* will survive a reboot.
......@@ -257,6 +260,7 @@ static int chunk_io(struct pstore *ps, void *area, chunk_t chunk, int rw,
INIT_WORK_ONSTACK(&req.work, do_metadata);
queue_work(ps->metadata_wq, &req.work);
flush_workqueue(ps->metadata_wq);
destroy_work_on_stack(&req.work);
return req.result;
}
......@@ -401,17 +405,18 @@ static int write_header(struct pstore *ps)
/*
* Access functions for the disk exceptions, these do the endian conversions.
*/
static struct disk_exception *get_exception(struct pstore *ps, uint32_t index)
static struct disk_exception *get_exception(struct pstore *ps, void *ps_area,
uint32_t index)
{
BUG_ON(index >= ps->exceptions_per_area);
return ((struct disk_exception *) ps->area) + index;
return ((struct disk_exception *) ps_area) + index;
}
static void read_exception(struct pstore *ps,
static void read_exception(struct pstore *ps, void *ps_area,
uint32_t index, struct core_exception *result)
{
struct disk_exception *de = get_exception(ps, index);
struct disk_exception *de = get_exception(ps, ps_area, index);
/* copy it */
result->old_chunk = le64_to_cpu(de->old_chunk);
......@@ -421,7 +426,7 @@ static void read_exception(struct pstore *ps,
static void write_exception(struct pstore *ps,
uint32_t index, struct core_exception *e)
{
struct disk_exception *de = get_exception(ps, index);
struct disk_exception *de = get_exception(ps, ps->area, index);
/* copy it */
de->old_chunk = cpu_to_le64(e->old_chunk);
......@@ -430,7 +435,7 @@ static void write_exception(struct pstore *ps,
static void clear_exception(struct pstore *ps, uint32_t index)
{
struct disk_exception *de = get_exception(ps, index);
struct disk_exception *de = get_exception(ps, ps->area, index);
/* clear it */
de->old_chunk = 0;
......@@ -442,7 +447,7 @@ static void clear_exception(struct pstore *ps, uint32_t index)
* 'full' is filled in to indicate if the area has been
* filled.
*/
static int insert_exceptions(struct pstore *ps,
static int insert_exceptions(struct pstore *ps, void *ps_area,
int (*callback)(void *callback_context,
chunk_t old, chunk_t new),
void *callback_context,
......@@ -456,7 +461,7 @@ static int insert_exceptions(struct pstore *ps,
*full = 1;
for (i = 0; i < ps->exceptions_per_area; i++) {
read_exception(ps, i, &e);
read_exception(ps, ps_area, i, &e);
/*
* If the new_chunk is pointing at the start of
......@@ -493,26 +498,72 @@ static int read_exceptions(struct pstore *ps,
void *callback_context)
{
int r, full = 1;
struct dm_bufio_client *client;
chunk_t prefetch_area = 0;
client = dm_bufio_client_create(dm_snap_cow(ps->store->snap)->bdev,
ps->store->chunk_size << SECTOR_SHIFT,
1, 0, NULL, NULL);
if (IS_ERR(client))
return PTR_ERR(client);
/*
* Setup for one current buffer + desired readahead buffers.
*/
dm_bufio_set_minimum_buffers(client, 1 + DM_PREFETCH_CHUNKS);
/*
* Keeping reading chunks and inserting exceptions until
* we find a partially full area.
*/
for (ps->current_area = 0; full; ps->current_area++) {
r = area_io(ps, READ);
if (r)
return r;
struct dm_buffer *bp;
void *area;
chunk_t chunk;
if (unlikely(prefetch_area < ps->current_area))
prefetch_area = ps->current_area;
if (DM_PREFETCH_CHUNKS) do {
chunk_t pf_chunk = area_location(ps, prefetch_area);
if (unlikely(pf_chunk >= dm_bufio_get_device_size(client)))
break;
dm_bufio_prefetch(client, pf_chunk, 1);
prefetch_area++;
if (unlikely(!prefetch_area))
break;
} while (prefetch_area <= ps->current_area + DM_PREFETCH_CHUNKS);
chunk = area_location(ps, ps->current_area);
area = dm_bufio_read(client, chunk, &bp);
if (unlikely(IS_ERR(area))) {
r = PTR_ERR(area);
goto ret_destroy_bufio;
}
r = insert_exceptions(ps, callback, callback_context, &full);
if (r)
return r;
r = insert_exceptions(ps, area, callback, callback_context,
&full);
dm_bufio_release(bp);
dm_bufio_forget(client, chunk);
if (unlikely(r))
goto ret_destroy_bufio;
}
ps->current_area--;
skip_metadata(ps);
return 0;
r = 0;
ret_destroy_bufio:
dm_bufio_client_destroy(client);
return r;
}
static struct pstore *get_info(struct dm_exception_store *store)
......@@ -733,7 +784,7 @@ static int persistent_prepare_merge(struct dm_exception_store *store,
ps->current_committed = ps->exceptions_per_area;
}
read_exception(ps, ps->current_committed - 1, &ce);
read_exception(ps, ps->area, ps->current_committed - 1, &ce);
*last_old_chunk = ce.old_chunk;
*last_new_chunk = ce.new_chunk;
......@@ -743,8 +794,8 @@ static int persistent_prepare_merge(struct dm_exception_store *store,
*/
for (nr_consecutive = 1; nr_consecutive < ps->current_committed;
nr_consecutive++) {
read_exception(ps, ps->current_committed - 1 - nr_consecutive,
&ce);
read_exception(ps, ps->area,
ps->current_committed - 1 - nr_consecutive, &ce);
if (ce.old_chunk != *last_old_chunk - nr_consecutive ||
ce.new_chunk != *last_new_chunk - nr_consecutive)
break;
......
......@@ -610,12 +610,12 @@ static struct dm_exception *dm_lookup_exception(struct dm_exception_table *et,
return NULL;
}
static struct dm_exception *alloc_completed_exception(void)
static struct dm_exception *alloc_completed_exception(gfp_t gfp)
{
struct dm_exception *e;
e = kmem_cache_alloc(exception_cache, GFP_NOIO);
if (!e)
e = kmem_cache_alloc(exception_cache, gfp);
if (!e && gfp == GFP_NOIO)
e = kmem_cache_alloc(exception_cache, GFP_ATOMIC);
return e;
......@@ -697,7 +697,7 @@ static int dm_add_exception(void *context, chunk_t old, chunk_t new)
struct dm_snapshot *s = context;
struct dm_exception *e;
e = alloc_completed_exception();
e = alloc_completed_exception(GFP_KERNEL);
if (!e)
return -ENOMEM;
......@@ -1405,7 +1405,7 @@ static void pending_complete(struct dm_snap_pending_exception *pe, int success)
goto out;
}
e = alloc_completed_exception();
e = alloc_completed_exception(GFP_NOIO);
if (!e) {
down_write(&s->lock);
__invalidate_snapshot(s, -ENOMEM);
......
......@@ -86,6 +86,7 @@ static const struct sysfs_ops dm_sysfs_ops = {
static struct kobj_type dm_ktype = {
.sysfs_ops = &dm_sysfs_ops,
.default_attrs = dm_attrs,
.release = dm_kobject_release,
};
/*
......@@ -104,5 +105,7 @@ int dm_sysfs_init(struct mapped_device *md)
*/
void dm_sysfs_exit(struct mapped_device *md)
{
kobject_put(dm_kobject(md));
struct kobject *kobj = dm_kobject(md);
kobject_put(kobj);
wait_for_completion(dm_get_completion_from_kobject(kobj));
}
......@@ -155,7 +155,6 @@ static int alloc_targets(struct dm_table *t, unsigned int num)
{
sector_t *n_highs;
struct dm_target *n_targets;
int n = t->num_targets;
/*
* Allocate both the target array and offset array at once.
......@@ -169,12 +168,7 @@ static int alloc_targets(struct dm_table *t, unsigned int num)
n_targets = (struct dm_target *) (n_highs + num);
if (n) {
memcpy(n_highs, t->highs, sizeof(*n_highs) * n);
memcpy(n_targets, t->targets, sizeof(*n_targets) * n);
}
memset(n_highs + n, -1, sizeof(*n_highs) * (num - n));
memset(n_highs, -1, sizeof(*n_highs) * num);
vfree(t->highs);
t->num_allocated = num;
......@@ -260,17 +254,6 @@ void dm_table_destroy(struct dm_table *t)
kfree(t);
}
/*
* Checks to see if we need to extend highs or targets.
*/
static inline int check_space(struct dm_table *t)
{
if (t->num_targets >= t->num_allocated)
return alloc_targets(t, t->num_allocated * 2);
return 0;
}
/*
* See if we've already got a device in the list.
*/
......@@ -731,8 +714,7 @@ int dm_table_add_target(struct dm_table *t, const char *type,
return -EINVAL;
}
if ((r = check_space(t)))
return r;
BUG_ON(t->num_targets >= t->num_allocated);
tgt = t->targets + t->num_targets;
memset(tgt, 0, sizeof(*tgt));
......
......@@ -1349,6 +1349,12 @@ dm_thin_id dm_thin_dev_id(struct dm_thin_device *td)
return td->id;
}
/*
* Check whether @time (of block creation) is older than @td's last snapshot.
* If so then the associated block is shared with the last snapshot device.
* Any block on a device created *after* the device last got snapshotted is
* necessarily not shared.
*/
static bool __snapshotted_since(struct dm_thin_device *td, uint32_t time)
{
return td->snapshotted_time > time;
......@@ -1458,6 +1464,20 @@ int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block)
return r;
}
int dm_pool_block_is_used(struct dm_pool_metadata *pmd, dm_block_t b, bool *result)
{
int r;
uint32_t ref_count;
down_read(&pmd->root_lock);
r = dm_sm_get_count(pmd->data_sm, b, &ref_count);
if (!r)
*result = (ref_count != 0);
up_read(&pmd->root_lock);
return r;
}
bool dm_thin_changed_this_transaction(struct dm_thin_device *td)
{
int r;
......
......@@ -131,7 +131,7 @@ dm_thin_id dm_thin_dev_id(struct dm_thin_device *td);
struct dm_thin_lookup_result {
dm_block_t block;
unsigned shared:1;
bool shared:1;
};
/*
......@@ -181,6 +181,8 @@ int dm_pool_get_data_block_size(struct dm_pool_metadata *pmd, sector_t *result);
int dm_pool_get_data_dev_size(struct dm_pool_metadata *pmd, dm_block_t *result);
int dm_pool_block_is_used(struct dm_pool_metadata *pmd, dm_block_t b, bool *result);
/*
* Returns -ENOSPC if the new size is too small and already allocated
* blocks would be lost.
......
This diff is collapsed.
......@@ -200,8 +200,8 @@ struct mapped_device {
/* forced geometry settings */
struct hd_geometry geometry;
/* sysfs handle */
struct kobject kobj;
/* kobject and completion */
struct dm_kobject_holder kobj_holder;
/* zero-length flush that will be cloned and submitted to targets */
struct bio flush_bio;
......@@ -2041,6 +2041,7 @@ static struct mapped_device *alloc_dev(int minor)
init_waitqueue_head(&md->wait);
INIT_WORK(&md->work, dm_wq_work);
init_waitqueue_head(&md->eventq);
init_completion(&md->kobj_holder.completion);
md->disk->major = _major;
md->disk->first_minor = minor;
......@@ -2902,20 +2903,14 @@ struct gendisk *dm_disk(struct mapped_device *md)
struct kobject *dm_kobject(struct mapped_device *md)
{
return &md->kobj;
return &md->kobj_holder.kobj;
}
/*
* struct mapped_device should not be exported outside of dm.c
* so use this check to verify that kobj is part of md structure
*/
struct mapped_device *dm_get_from_kobject(struct kobject *kobj)
{
struct mapped_device *md;
md = container_of(kobj, struct mapped_device, kobj);
if (&md->kobj != kobj)
return NULL;
md = container_of(kobj, struct mapped_device, kobj_holder.kobj);
if (test_bit(DMF_FREEING, &md->flags) ||
dm_deleting_md(md))
......
......@@ -15,6 +15,8 @@
#include <linux/list.h>
#include <linux/blkdev.h>
#include <linux/hdreg.h>
#include <linux/completion.h>
#include <linux/kobject.h>
#include "dm-stats.h"
......@@ -148,11 +150,26 @@ void dm_interface_exit(void);
/*
* sysfs interface
*/
struct dm_kobject_holder {
struct kobject kobj;
struct completion completion;
};
static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
{
return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
}
int dm_sysfs_init(struct mapped_device *md);
void dm_sysfs_exit(struct mapped_device *md);
struct kobject *dm_kobject(struct mapped_device *md);
struct mapped_device *dm_get_from_kobject(struct kobject *kobj);
/*
* The kobject helper
*/
void dm_kobject_release(struct kobject *kobj);
/*
* Targets for linear and striped mappings
*/
......
......@@ -104,7 +104,7 @@ static int __check_holder(struct block_lock *lock)
for (i = 0; i < MAX_HOLDERS; i++) {
if (lock->holders[i] == current) {
DMERR("recursive lock detected in pool metadata");
DMERR("recursive lock detected in metadata");
#ifdef CONFIG_DM_DEBUG_BLOCK_STACK_TRACING
DMERR("previously held here:");
print_stack_trace(lock->traces + i, 4);
......
......@@ -770,8 +770,8 @@ EXPORT_SYMBOL_GPL(dm_btree_insert_notify);
/*----------------------------------------------------------------*/
static int find_highest_key(struct ro_spine *s, dm_block_t block,
uint64_t *result_key, dm_block_t *next_block)
static int find_key(struct ro_spine *s, dm_block_t block, bool find_highest,
uint64_t *result_key, dm_block_t *next_block)
{
int i, r;
uint32_t flags;
......@@ -788,7 +788,11 @@ static int find_highest_key(struct ro_spine *s, dm_block_t block,
else
i--;
*result_key = le64_to_cpu(ro_node(s)->keys[i]);
if (find_highest)
*result_key = le64_to_cpu(ro_node(s)->keys[i]);
else
*result_key = le64_to_cpu(ro_node(s)->keys[0]);
if (next_block || flags & INTERNAL_NODE)
block = value64(ro_node(s), i);
......@@ -799,16 +803,16 @@ static int find_highest_key(struct ro_spine *s, dm_block_t block,
return 0;
}
int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root,
uint64_t *result_keys)
static int dm_btree_find_key(struct dm_btree_info *info, dm_block_t root,
bool find_highest, uint64_t *result_keys)
{
int r = 0, count = 0, level;
struct ro_spine spine;
init_ro_spine(&spine, info);
for (level = 0; level < info->levels; level++) {
r = find_highest_key(&spine, root, result_keys + level,
level == info->levels - 1 ? NULL : &root);
r = find_key(&spine, root, find_highest, result_keys + level,
level == info->levels - 1 ? NULL : &root);
if (r == -ENODATA) {
r = 0;
break;
......@@ -822,8 +826,23 @@ int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root,
return r ? r : count;
}
int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root,
uint64_t *result_keys)
{
return dm_btree_find_key(info, root, true, result_keys);
}
EXPORT_SYMBOL_GPL(dm_btree_find_highest_key);
int dm_btree_find_lowest_key(struct dm_btree_info *info, dm_block_t root,
uint64_t *result_keys)
{
return dm_btree_find_key(info, root, false, result_keys);
}
EXPORT_SYMBOL_GPL(dm_btree_find_lowest_key);
/*----------------------------------------------------------------*/
/*
* FIXME: We shouldn't use a recursive algorithm when we have limited stack
* space. Also this only works for single level trees.
......
......@@ -134,6 +134,14 @@ int dm_btree_insert_notify(struct dm_btree_info *info, dm_block_t root,
int dm_btree_remove(struct dm_btree_info *info, dm_block_t root,
uint64_t *keys, dm_block_t *new_root);
/*
* Returns < 0 on failure. Otherwise the number of key entries that have
* been filled out. Remember trees can have zero entries, and as such have
* no lowest key.
*/
int dm_btree_find_lowest_key(struct dm_btree_info *info, dm_block_t root,
uint64_t *result_keys);
/*
* Returns < 0 on failure. Otherwise the number of key entries that have
* been filled out. Remember trees can have zero entries, and as such have
......
......@@ -245,6 +245,10 @@ int sm_ll_extend(struct ll_disk *ll, dm_block_t extra_blocks)
return -EINVAL;
}
/*
* We need to set this before the dm_tm_new_block() call below.
*/
ll->nr_blocks = nr_blocks;
for (i = old_blocks; i < blocks; i++) {
struct dm_block *b;
struct disk_index_entry idx;
......@@ -252,6 +256,7 @@ int sm_ll_extend(struct ll_disk *ll, dm_block_t extra_blocks)
r = dm_tm_new_block(ll->tm, &dm_sm_bitmap_validator, &b);
if (r < 0)
return r;
idx.blocknr = cpu_to_le64(dm_block_location(b));
r = dm_tm_unlock(ll->tm, b);
......@@ -266,7 +271,6 @@ int sm_ll_extend(struct ll_disk *ll, dm_block_t extra_blocks)
return r;
}
ll->nr_blocks = nr_blocks;
return 0;
}
......
......@@ -385,13 +385,13 @@ static int sm_metadata_new_block(struct dm_space_map *sm, dm_block_t *b)
int r = sm_metadata_new_block_(sm, b);
if (r) {
DMERR("unable to allocate new metadata block");
DMERR_LIMIT("unable to allocate new metadata block");
return r;
}
r = sm_metadata_get_nr_free(sm, &count);
if (r) {
DMERR("couldn't get free block count");
DMERR_LIMIT("couldn't get free block count");
return r;
}
......@@ -608,20 +608,38 @@ static int sm_metadata_extend(struct dm_space_map *sm, dm_block_t extra_blocks)
* Flick into a mode where all blocks get allocated in the new area.
*/
smm->begin = old_len;
memcpy(&smm->sm, &bootstrap_ops, sizeof(smm->sm));
memcpy(sm, &bootstrap_ops, sizeof(*sm));
/*
* Extend.
*/
r = sm_ll_extend(&smm->ll, extra_blocks);
if (r)
goto out;
/*
* Switch back to normal behaviour.
* We repeatedly increment then commit until the commit doesn't
* allocate any new blocks.
*/
memcpy(&smm->sm, &ops, sizeof(smm->sm));
for (i = old_len; !r && i < smm->begin; i++)
r = sm_ll_inc(&smm->ll, i, &ev);
do {
for (i = old_len; !r && i < smm->begin; i++) {
r = sm_ll_inc(&smm->ll, i, &ev);
if (r)
goto out;
}
old_len = smm->begin;
r = sm_ll_commit(&smm->ll);
if (r)
goto out;
} while (old_len != smm->begin);
out:
/*
* Switch back to normal behaviour.
*/
memcpy(sm, &ops, sizeof(*sm));
return r;
}
......
......@@ -201,11 +201,18 @@
* int (*flush)(struct dm_dirty_log *log);
*
* Payload-to-userspace:
* None.
* If the 'integrated_flush' directive is present in the constructor
* table, the payload is as same as DM_ULOG_MARK_REGION:
* uint64_t [] - region(s) to mark
* else
* None
* Payload-to-kernel:
* None.
*
* No incoming or outgoing payload. Simply flush log state to disk.
* If the 'integrated_flush' option was used during the creation of the
* log, mark region requests are carried as payload in the flush request.
* Piggybacking the mark requests in this way allows for fewer communications
* between kernel and userspace.
*
* When the request has been processed, user-space must return the
* dm_ulog_request to the kernel - setting the 'error' field and clearing
......@@ -385,8 +392,15 @@
* version 2: DM_ULOG_CTR allowed to return a string containing a
* device name that is to be registered with DM via
* 'dm_get_device'.
* version 3: DM_ULOG_FLUSH is capable of carrying payload for marking
* regions. This "integrated flush" reduces the number of
* requests between the kernel and userspace by effectively
* merging 'mark' and 'flush' requests. A constructor table
* argument ('integrated_flush') is required to turn this
* feature on, so it is backwards compatible with older
* userspace versions.
*/
#define DM_ULOG_REQUEST_VERSION 2
#define DM_ULOG_REQUEST_VERSION 3
struct dm_ulog_request {
/*
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment