Commit ebb37277 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "It might look big in volume, but when categorized, not a lot of
  drivers are touched.  The pull request contains:

   - mtip32xx fixes from Micron.

   - A slew of drbd updates, this time in a nicer series.

   - bcache, a flash/ssd caching framework from Kent.

   - Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
  bcache: Use bd_link_disk_holder()
  bcache: Allocator cleanup/fixes
  cciss: bug fix to prevent cciss from loading in kdump crash kernel
  cciss: add cciss_allow_hpsa module parameter
  drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
  mtip32xx: Workaround for unaligned writes
  bcache: Make sure blocksize isn't smaller than device blocksize
  bcache: Fix merge_bvec_fn usage for when it modifies the bvm
  bcache: Correctly check against BIO_MAX_PAGES
  bcache: Hack around stuff that clones up to bi_max_vecs
  bcache: Set ra_pages based on backing device's ra_pages
  bcache: Take data offset from the bdev superblock.
  mtip32xx: mtip32xx: Disable TRIM support
  mtip32xx: fix a smatch warning
  bcache: Disable broken btree fuzz tester
  bcache: Fix a format string overflow
  bcache: Fix a minor memory leak on device teardown
  bcache: Documentation updates
  bcache: Use WARN_ONCE() instead of __WARN()
  bcache: Add missing #include <linux/prefetch.h>
  ...
parents 4de13d7a f50efd2f
What: /sys/block/<disk>/bcache/unregister
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
A write to this file causes the backing device or cache to be
unregistered. If a backing device had dirty data in the cache,
writeback mode is automatically disabled and all dirty data is
flushed before the device is unregistered. Caches unregister
all associated backing devices before unregistering themselves.
What: /sys/block/<disk>/bcache/clear_stats
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
Writing to this file resets all the statistics for the device.
What: /sys/block/<disk>/bcache/cache
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a backing device that has cache, a symlink to
the bcache/ dir of that cache.
What: /sys/block/<disk>/bcache/cache_hits
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: integer number of full cache hits,
counted per bio. A partial cache hit counts as a miss.
What: /sys/block/<disk>/bcache/cache_misses
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: integer number of cache misses.
What: /sys/block/<disk>/bcache/cache_hit_ratio
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: cache hits as a percentage.
What: /sys/block/<disk>/bcache/sequential_cutoff
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: Threshold past which sequential IO will
skip the cache. Read and written as bytes in human readable
units (i.e. echo 10M > sequntial_cutoff).
What: /sys/block/<disk>/bcache/bypassed
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
Sum of all reads and writes that have bypassed the cache (due
to the sequential cutoff). Expressed as bytes in human
readable units.
What: /sys/block/<disk>/bcache/writeback
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: When on, writeback caching is enabled and
writes will be buffered in the cache. When off, caching is in
writethrough mode; reads and writes will be added to the
cache but no write buffering will take place.
What: /sys/block/<disk>/bcache/writeback_running
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: when off, dirty data will not be written
from the cache to the backing device. The cache will still be
used to buffer writes until it is mostly full, at which point
writes transparently revert to writethrough mode. Intended only
for benchmarking/testing.
What: /sys/block/<disk>/bcache/writeback_delay
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: In writeback mode, when dirty data is
written to the cache and the cache held no dirty data for that
backing device, writeback from cache to backing device starts
after this delay, expressed as an integer number of seconds.
What: /sys/block/<disk>/bcache/writeback_percent
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For backing devices: If nonzero, writeback from cache to
backing device only takes place when more than this percentage
of the cache is used, allowing more write coalescing to take
place and reducing total number of writes sent to the backing
device. Integer between 0 and 40.
What: /sys/block/<disk>/bcache/synchronous
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, a boolean that allows synchronous mode to be
switched on and off. In synchronous mode all writes are ordered
such that the cache can reliably recover from unclean shutdown;
if disabled bcache will not generally wait for writes to
complete but if the cache is not shut down cleanly all data
will be discarded from the cache. Should not be turned off with
writeback caching enabled.
What: /sys/block/<disk>/bcache/discard
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, a boolean allowing discard/TRIM to be turned off
or back on if the device supports it.
What: /sys/block/<disk>/bcache/bucket_size
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, bucket size in human readable units, as set at
cache creation time; should match the erase block size of the
SSD for optimal performance.
What: /sys/block/<disk>/bcache/nbuckets
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, the number of usable buckets.
What: /sys/block/<disk>/bcache/tree_depth
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, height of the btree excluding leaf nodes (i.e. a
one node tree will have a depth of 0).
What: /sys/block/<disk>/bcache/btree_cache_size
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
Number of btree buckets/nodes that are currently cached in
memory; cache dynamically grows and shrinks in response to
memory pressure from the rest of the system.
What: /sys/block/<disk>/bcache/written
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, total amount of data in human readable units
written to the cache, excluding all metadata.
What: /sys/block/<disk>/bcache/btree_written
Date: November 2010
Contact: Kent Overstreet <kent.overstreet@gmail.com>
Description:
For a cache, sum of all btree writes in human readable units.
This diff is collapsed.
......@@ -1620,6 +1620,13 @@ W: http://www.baycom.org/~tom/ham/ham.html
S: Maintained
F: drivers/net/hamradio/baycom*
BCACHE (BLOCK LAYER CACHE)
M: Kent Overstreet <koverstreet@google.com>
L: linux-bcache@vger.kernel.org
W: http://bcache.evilpiepirate.org
S: Maintained:
F: drivers/md/bcache/
BEFS FILE SYSTEM
S: Orphan
F: Documentation/filesystems/befs.txt
......
......@@ -920,16 +920,14 @@ bio_pagedec(struct bio *bio)
static void
bufinit(struct buf *buf, struct request *rq, struct bio *bio)
{
struct bio_vec *bv;
memset(buf, 0, sizeof(*buf));
buf->rq = rq;
buf->bio = bio;
buf->resid = bio->bi_size;
buf->sector = bio->bi_sector;
bio_pageinc(bio);
buf->bv = bv = bio_iovec(bio);
buf->bv_resid = bv->bv_len;
buf->bv = bio_iovec(bio);
buf->bv_resid = buf->bv->bv_len;
WARN_ON(buf->bv_resid == 0);
}
......
......@@ -75,6 +75,12 @@ module_param(cciss_simple_mode, int, S_IRUGO|S_IWUSR);
MODULE_PARM_DESC(cciss_simple_mode,
"Use 'simple mode' rather than 'performant mode'");
static int cciss_allow_hpsa;
module_param(cciss_allow_hpsa, int, S_IRUGO|S_IWUSR);
MODULE_PARM_DESC(cciss_allow_hpsa,
"Prevent cciss driver from accessing hardware known to be "
" supported by the hpsa driver");
static DEFINE_MUTEX(cciss_mutex);
static struct proc_dir_entry *proc_cciss;
......@@ -4115,9 +4121,13 @@ static int cciss_lookup_board_id(struct pci_dev *pdev, u32 *board_id)
*board_id = ((subsystem_device_id << 16) & 0xffff0000) |
subsystem_vendor_id;
for (i = 0; i < ARRAY_SIZE(products); i++)
for (i = 0; i < ARRAY_SIZE(products); i++) {
/* Stand aside for hpsa driver on request */
if (cciss_allow_hpsa)
return -ENODEV;
if (*board_id == products[i].board_id)
return i;
}
dev_warn(&pdev->dev, "unrecognized board ID: 0x%08x, ignoring.\n",
*board_id);
return -ENODEV;
......@@ -4959,6 +4969,16 @@ static int cciss_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
ctlr_info_t *h;
unsigned long flags;
/*
* By default the cciss driver is used for all older HP Smart Array
* controllers. There are module paramaters that allow a user to
* override this behavior and instead use the hpsa SCSI driver. If
* this is the case cciss may be loaded first from the kdump initrd
* image and cause a kernel panic. So if reset_devices is true and
* cciss_allow_hpsa is set just bail.
*/
if ((reset_devices) && (cciss_allow_hpsa == 1))
return -ENODEV;
rc = cciss_init_reset_devices(pdev);
if (rc) {
if (rc != -ENOTSUPP)
......
This diff is collapsed.
......@@ -612,6 +612,17 @@ static void bm_memset(struct drbd_bitmap *b, size_t offset, int c, size_t len)
}
}
/* For the layout, see comment above drbd_md_set_sector_offsets(). */
static u64 drbd_md_on_disk_bits(struct drbd_backing_dev *ldev)
{
u64 bitmap_sectors;
if (ldev->md.al_offset == 8)
bitmap_sectors = ldev->md.md_size_sect - ldev->md.bm_offset;
else
bitmap_sectors = ldev->md.al_offset - ldev->md.bm_offset;
return bitmap_sectors << (9 + 3);
}
/*
* make sure the bitmap has enough room for the attached storage,
* if necessary, resize.
......@@ -668,7 +679,7 @@ int drbd_bm_resize(struct drbd_conf *mdev, sector_t capacity, int set_new_bits)
words = ALIGN(bits, 64) >> LN2_BPL;
if (get_ldev(mdev)) {
u64 bits_on_disk = ((u64)mdev->ldev->md.md_size_sect-MD_BM_OFFSET) << 12;
u64 bits_on_disk = drbd_md_on_disk_bits(mdev->ldev);
put_ldev(mdev);
if (bits > bits_on_disk) {
dev_info(DEV, "bits = %lu\n", bits);
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -313,8 +313,14 @@ static int drbd_seq_show(struct seq_file *seq, void *v)
static int drbd_proc_open(struct inode *inode, struct file *file)
{
if (try_module_get(THIS_MODULE))
return single_open(file, drbd_seq_show, PDE_DATA(inode));
int err;
if (try_module_get(THIS_MODULE)) {
err = single_open(file, drbd_seq_show, PDE_DATA(inode));
if (err)
module_put(THIS_MODULE);
return err;
}
return -ENODEV;
}
......
......@@ -850,6 +850,7 @@ int drbd_connected(struct drbd_conf *mdev)
err = drbd_send_current_state(mdev);
clear_bit(USE_DEGR_WFC_T, &mdev->flags);
clear_bit(RESIZE_PENDING, &mdev->flags);
atomic_set(&mdev->ap_in_flight, 0);
mod_timer(&mdev->request_timer, jiffies + HZ); /* just start it here. */
return err;
}
......@@ -2266,7 +2267,7 @@ static int receive_Data(struct drbd_tconn *tconn, struct packet_info *pi)
drbd_set_out_of_sync(mdev, peer_req->i.sector, peer_req->i.size);
peer_req->flags |= EE_CALL_AL_COMPLETE_IO;
peer_req->flags &= ~EE_MAY_SET_IN_SYNC;
drbd_al_begin_io(mdev, &peer_req->i);
drbd_al_begin_io(mdev, &peer_req->i, true);
}
err = drbd_submit_peer_request(mdev, peer_req, rw, DRBD_FAULT_DT_WR);
......@@ -2662,7 +2663,6 @@ static int drbd_asb_recover_1p(struct drbd_conf *mdev) __must_hold(local)
if (hg == -1 && mdev->state.role == R_PRIMARY) {
enum drbd_state_rv rv2;
drbd_set_role(mdev, R_SECONDARY, 0);
/* drbd_change_state() does not sleep while in SS_IN_TRANSIENT_STATE,
* we might be here in C_WF_REPORT_PARAMS which is transient.
* we do not need to wait for the after state change work either. */
......@@ -3993,7 +3993,7 @@ static int receive_state(struct drbd_tconn *tconn, struct packet_info *pi)
clear_bit(DISCARD_MY_DATA, &mdev->flags);
drbd_md_sync(mdev); /* update connected indicator, la_size, ... */
drbd_md_sync(mdev); /* update connected indicator, la_size_sect, ... */
return 0;
}
......@@ -4660,8 +4660,8 @@ static int drbd_do_features(struct drbd_tconn *tconn)
#if !defined(CONFIG_CRYPTO_HMAC) && !defined(CONFIG_CRYPTO_HMAC_MODULE)
static int drbd_do_auth(struct drbd_tconn *tconn)
{
dev_err(DEV, "This kernel was build without CONFIG_CRYPTO_HMAC.\n");
dev_err(DEV, "You need to disable 'cram-hmac-alg' in drbd.conf.\n");
conn_err(tconn, "This kernel was build without CONFIG_CRYPTO_HMAC.\n");
conn_err(tconn, "You need to disable 'cram-hmac-alg' in drbd.conf.\n");
return -1;
}
#else
......@@ -5258,9 +5258,11 @@ int drbd_asender(struct drbd_thread *thi)
bool ping_timeout_active = false;
struct net_conf *nc;
int ping_timeo, tcp_cork, ping_int;
struct sched_param param = { .sched_priority = 2 };
current->policy = SCHED_RR; /* Make this a realtime task! */
current->rt_priority = 2; /* more important than all other tasks */
rv = sched_setscheduler(current, SCHED_RR, &param);
if (rv < 0)
conn_err(tconn, "drbd_asender: ERROR set priority, ret=%d\n", rv);
while (get_t_state(thi) == RUNNING) {
drbd_thread_current_set_cpu(thi);
......
......@@ -34,14 +34,14 @@
static bool drbd_may_do_local_read(struct drbd_conf *mdev, sector_t sector, int size);
/* Update disk stats at start of I/O request */
static void _drbd_start_io_acct(struct drbd_conf *mdev, struct drbd_request *req, struct bio *bio)
static void _drbd_start_io_acct(struct drbd_conf *mdev, struct drbd_request *req)
{
const int rw = bio_data_dir(bio);
const int rw = bio_data_dir(req->master_bio);
int cpu;
cpu = part_stat_lock();
part_round_stats(cpu, &mdev->vdisk->part0);
part_stat_inc(cpu, &mdev->vdisk->part0, ios[rw]);
part_stat_add(cpu, &mdev->vdisk->part0, sectors[rw], bio_sectors(bio));
part_stat_add(cpu, &mdev->vdisk->part0, sectors[rw], req->i.size >> 9);
(void) cpu; /* The macro invocations above want the cpu argument, I do not like
the compiler warning about cpu only assigned but never used... */
part_inc_in_flight(&mdev->vdisk->part0, rw);
......@@ -263,8 +263,7 @@ void drbd_req_complete(struct drbd_request *req, struct bio_and_error *m)
else
root = &mdev->read_requests;
drbd_remove_request_interval(root, req);
} else if (!(s & RQ_POSTPONED))
D_ASSERT((s & (RQ_NET_MASK & ~RQ_NET_DONE)) == 0);
}
/* Before we can signal completion to the upper layers,
* we may need to close the current transfer log epoch.
......@@ -755,6 +754,11 @@ int __req_mod(struct drbd_request *req, enum drbd_req_event what,
D_ASSERT(req->rq_state & RQ_NET_PENDING);
mod_rq_state(req, m, RQ_NET_PENDING, RQ_NET_OK|RQ_NET_DONE);
break;
case QUEUE_AS_DRBD_BARRIER:
start_new_tl_epoch(mdev->tconn);
mod_rq_state(req, m, 0, RQ_NET_OK|RQ_NET_DONE);
break;
};
return rv;
......@@ -861,8 +865,10 @@ static void maybe_pull_ahead(struct drbd_conf *mdev)
bool congested = false;
enum drbd_on_congestion on_congestion;
rcu_read_lock();
nc = rcu_dereference(tconn->net_conf);
on_congestion = nc ? nc->on_congestion : OC_BLOCK;
rcu_read_unlock();
if (on_congestion == OC_BLOCK ||
tconn->agreed_pro_version < 96)
return;
......@@ -956,14 +962,8 @@ static int drbd_process_write_request(struct drbd_request *req)
struct drbd_conf *mdev = req->w.mdev;
int remote, send_oos;
rcu_read_lock();
remote = drbd_should_do_remote(mdev->state);
if (remote) {
maybe_pull_ahead(mdev);
remote = drbd_should_do_remote(mdev->state);
}
send_oos = drbd_should_send_out_of_sync(mdev->state);
rcu_read_unlock();
/* Need to replicate writes. Unless it is an empty flush,
* which is better mapped to a DRBD P_BARRIER packet,
......@@ -975,8 +975,8 @@ static int drbd_process_write_request(struct drbd_request *req)
/* The only size==0 bios we expect are empty flushes. */
D_ASSERT(req->master_bio->bi_rw & REQ_FLUSH);
if (remote)
start_new_tl_epoch(mdev->tconn);
return 0;
_req_mod(req, QUEUE_AS_DRBD_BARRIER);
return remote;
}
if (!remote && !send_oos)
......@@ -1020,12 +1020,24 @@ drbd_submit_req_private_bio(struct drbd_request *req)
bio_endio(bio, -EIO);
}
void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long start_time)
static void drbd_queue_write(struct drbd_conf *mdev, struct drbd_request *req)
{
const int rw = bio_rw(bio);
struct bio_and_error m = { NULL, };
spin_lock(&mdev->submit.lock);
list_add_tail(&req->tl_requests, &mdev->submit.writes);
spin_unlock(&mdev->submit.lock);
queue_work(mdev->submit.wq, &mdev->submit.worker);
}
/* returns the new drbd_request pointer, if the caller is expected to
* drbd_send_and_submit() it (to save latency), or NULL if we queued the
* request on the submitter thread.
* Returns ERR_PTR(-ENOMEM) if we cannot allocate a drbd_request.
*/
struct drbd_request *
drbd_request_prepare(struct drbd_conf *mdev, struct bio *bio, unsigned long start_time)
{
const int rw = bio_data_dir(bio);
struct drbd_request *req;
bool no_remote = false;
/* allocate outside of all locks; */
req = drbd_req_new(mdev, bio);
......@@ -1035,7 +1047,7 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
* if user cannot handle io errors, that's not our business. */
dev_err(DEV, "could not kmalloc() req\n");
bio_endio(bio, -ENOMEM);
return;
return ERR_PTR(-ENOMEM);
}
req->start_time = start_time;
......@@ -1044,28 +1056,40 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
req->private_bio = NULL;
}
/* For WRITES going to the local disk, grab a reference on the target
* extent. This waits for any resync activity in the corresponding
* resync extent to finish, and, if necessary, pulls in the target
* extent into the activity log, which involves further disk io because
* of transactional on-disk meta data updates.
* Empty flushes don't need to go into the activity log, they can only
* flush data for pending writes which are already in there. */
/* Update disk stats */
_drbd_start_io_acct(mdev, req);
if (rw == WRITE && req->private_bio && req->i.size
&& !test_bit(AL_SUSPENDED, &mdev->flags)) {
if (!drbd_al_begin_io_fastpath(mdev, &req->i)) {
drbd_queue_write(mdev, req);
return NULL;
}
req->rq_state |= RQ_IN_ACT_LOG;
drbd_al_begin_io(mdev, &req->i);
}
return req;
}
static void drbd_send_and_submit(struct drbd_conf *mdev, struct drbd_request *req)
{
const int rw = bio_rw(req->master_bio);
struct bio_and_error m = { NULL, };
bool no_remote = false;
spin_lock_irq(&mdev->tconn->req_lock);
if (rw == WRITE) {
/* This may temporarily give up the req_lock,
* but will re-aquire it before it returns here.
* Needs to be before the check on drbd_suspended() */
complete_conflicting_writes(req);
/* no more giving up req_lock from now on! */
/* check for congestion, and potentially stop sending
* full data updates, but start sending "dirty bits" only. */
maybe_pull_ahead(mdev);
}
/* no more giving up req_lock from now on! */
if (drbd_suspended(mdev)) {
/* push back and retry: */
......@@ -1078,9 +1102,6 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
goto out;
}
/* Update disk stats */
_drbd_start_io_acct(mdev, req, bio);
/* We fail READ/READA early, if we can not serve it.
* We must do this before req is registered on any lists.
* Otherwise, drbd_req_complete() will queue failed READ for retry. */
......@@ -1137,7 +1158,116 @@ void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long
if (m.bio)
complete_master_bio(mdev, &m);
}
void __drbd_make_request(struct drbd_conf *mdev, struct bio *bio, unsigned long start_time)
{
struct drbd_request *req = drbd_request_prepare(mdev, bio, start_time);
if (IS_ERR_OR_NULL(req))
return;
drbd_send_and_submit(mdev, req);
}
static void submit_fast_path(struct drbd_conf *mdev, struct list_head *incoming)
{
struct drbd_request *req, *tmp;
list_for_each_entry_safe(req, tmp, incoming, tl_requests) {
const int rw = bio_data_dir(req->master_bio);
if (rw == WRITE /* rw != WRITE should not even end up here! */
&& req->private_bio && req->i.size
&& !test_bit(AL_SUSPENDED, &mdev->flags)) {
if (!drbd_al_begin_io_fastpath(mdev, &req->i))
continue;
req->rq_state |= RQ_IN_ACT_LOG;
}
list_del_init(&req->tl_requests);
drbd_send_and_submit(mdev, req);
}
}
static bool prepare_al_transaction_nonblock(struct drbd_conf *mdev,
struct list_head *incoming,
struct list_head *pending)
{
struct drbd_request *req, *tmp;
int wake = 0;
int err;
spin_lock_irq(&mdev->al_lock);
list_for_each_entry_safe(req, tmp, incoming, tl_requests) {
err = drbd_al_begin_io_nonblock(mdev, &req->i);
if (err == -EBUSY)
wake = 1;
if (err)
continue;
req->rq_state |= RQ_IN_ACT_LOG;
list_move_tail(&req->tl_requests, pending);
}
spin_unlock_irq(&mdev->al_lock);
if (wake)
wake_up(&mdev->al_wait);
return !list_empty(pending);
}
void do_submit(struct work_struct *ws)
{
struct drbd_conf *mdev = container_of(ws, struct drbd_conf, submit.worker);
LIST_HEAD(incoming);
LIST_HEAD(pending);
struct drbd_request *req, *tmp;
for (;;) {
spin_lock(&mdev->submit.lock);
list_splice_tail_init(&mdev->submit.writes, &incoming);
spin_unlock(&mdev->submit.lock);
submit_fast_path(mdev, &incoming);
if (list_empty(&incoming))
break;
wait_event(mdev->al_wait, prepare_al_transaction_nonblock(mdev, &incoming, &pending));
/* Maybe more was queued, while we prepared the transaction?
* Try to stuff them into this transaction as well.
* Be strictly non-blocking here, no wait_event, we already
* have something to commit.
* Stop if we don't make any more progres.
*/
for (;;) {
LIST_HEAD(more_pending);
LIST_HEAD(more_incoming);
bool made_progress;
/* It is ok to look outside the lock,
* it's only an optimization anyways */
if (list_empty(&mdev->submit.writes))
break;
spin_lock(&mdev->submit.lock);
list_splice_tail_init(&mdev->submit.writes, &more_incoming);
spin_unlock(&mdev->submit.lock);
if (list_empty(&more_incoming))
break;
made_progress = prepare_al_transaction_nonblock(mdev, &more_incoming, &more_pending);
list_splice_tail_init(&more_pending, &pending);
list_splice_tail_init(&more_incoming, &incoming);
if (!made_progress)
break;
}
drbd_al_begin_io_commit(mdev, false);
list_for_each_entry_safe(req, tmp, &pending, tl_requests) {
list_del_init(&req->tl_requests);
drbd_send_and_submit(mdev, req);
}
}
}
void drbd_make_request(struct request_queue *q, struct bio *bio)
......
......@@ -88,6 +88,14 @@ enum drbd_req_event {
QUEUE_FOR_NET_READ,
QUEUE_FOR_SEND_OOS,
/* An empty flush is queued as P_BARRIER,
* which will cause it to complete "successfully",
* even if the local disk flush failed.
*
* Just like "real" requests, empty flushes (blkdev_issue_flush()) will
* only see an error if neither local nor remote data is reachable. */
QUEUE_AS_DRBD_BARRIER,
SEND_CANCELED,
SEND_FAILED,
HANDED_OVER_TO_NETWORK,
......
......@@ -570,6 +570,13 @@ is_valid_state(struct drbd_conf *mdev, union drbd_state ns)
mdev->tconn->agreed_pro_version < 88)
rv = SS_NOT_SUPPORTED;
else if (ns.role == R_PRIMARY && ns.disk < D_UP_TO_DATE && ns.pdsk < D_UP_TO_DATE)
rv = SS_NO_UP_TO_DATE_DISK;
else if ((ns.conn == C_STARTING_SYNC_S || ns.conn == C_STARTING_SYNC_T) &&
ns.pdsk == D_UNKNOWN)
rv = SS_NEED_CONNECTION;
else if (ns.conn >= C_CONNECTED && ns.pdsk == D_UNKNOWN)
rv = SS_CONNECTED_OUTDATES;
......@@ -635,6 +642,10 @@ is_valid_soft_transition(union drbd_state os, union drbd_state ns, struct drbd_t
&& os.conn < C_WF_REPORT_PARAMS)
rv = SS_NEED_CONNECTION; /* No NetworkFailure -> SyncTarget etc... */
if (ns.conn == C_DISCONNECTING && ns.pdsk == D_OUTDATED &&
os.conn < C_CONNECTED && os.pdsk > D_OUTDATED)
rv = SS_OUTDATE_WO_CONN;
return rv;
}
......@@ -1377,13 +1388,6 @@ static void after_state_ch(struct drbd_conf *mdev, union drbd_state os,
&drbd_bmio_set_n_write, &abw_start_sync,
"set_n_write from StartingSync", BM_LOCKED_TEST_ALLOWED);
/* We are invalidating our self... */
if (os.conn < C_CONNECTED && ns.conn < C_CONNECTED &&
os.disk > D_INCONSISTENT && ns.disk == D_INCONSISTENT)
/* other bitmap operation expected during this phase */
drbd_queue_bitmap_io(mdev, &drbd_bmio_set_n_write, NULL,
"set_n_write from invalidate", BM_LOCKED_MASK);
/* first half of local IO error, failure to attach,
* or administrative detach */
if (os.disk != D_FAILED && ns.disk == D_FAILED) {
......@@ -1748,13 +1752,9 @@ _conn_rq_cond(struct drbd_tconn *tconn, union drbd_state mask, union drbd_state
if (test_and_clear_bit(CONN_WD_ST_CHG_FAIL, &tconn->flags))
return SS_CW_FAILED_BY_PEER;
rv = tconn->cstate != C_WF_REPORT_PARAMS ? SS_CW_NO_NEED : SS_UNKNOWN_ERROR;
if (rv == SS_UNKNOWN_ERROR)
rv = conn_is_valid_transition(tconn, mask, val, 0);
if (rv == SS_SUCCESS)
rv = SS_UNKNOWN_ERROR; /* cont waiting, otherwise fail. */
if (rv == SS_SUCCESS && tconn->cstate == C_WF_REPORT_PARAMS)
rv = SS_UNKNOWN_ERROR; /* continue waiting */
return rv;
}
......
......@@ -89,6 +89,7 @@ static const char *drbd_state_sw_errors[] = {
[-SS_LOWER_THAN_OUTDATED] = "Disk state is lower than outdated",
[-SS_IN_TRANSIENT_STATE] = "In transient state, retry after next state change",
[-SS_CONCURRENT_ST_CHG] = "Concurrent state changes detected and aborted",
[-SS_OUTDATE_WO_CONN] = "Need a connection for a graceful disconnect/outdate peer",
[-SS_O_VOL_PEER_PRI] = "Other vol primary on peer not allowed by config",
};
......
......@@ -89,6 +89,7 @@ void drbd_md_io_complete(struct bio *bio, int error)
md_io->done = 1;
wake_up(&mdev->misc_wait);
bio_put(bio);
if (mdev->ldev) /* special case: drbd_md_read() during drbd_adm_attach() */
put_ldev(mdev);
}
......@@ -1410,7 +1411,7 @@ int w_restart_disk_io(struct drbd_work *w, int cancel)
struct drbd_conf *mdev = w->mdev;
if (bio_data_dir(req->master_bio) == WRITE && req->rq_state & RQ_IN_ACT_LOG)
drbd_al_begin_io(mdev, &req->i);
drbd_al_begin_io(mdev, &req->i, false);
drbd_req_make_private_bio(req, req->master_bio);
req->private_bio->bi_bdev = mdev->ldev->backing_bdev;
......@@ -1425,7 +1426,7 @@ static int _drbd_may_sync_now(struct drbd_conf *mdev)
int resync_after;
while (1) {
if (!odev->ldev)
if (!odev->ldev || odev->state.disk == D_DISKLESS)
return 1;
rcu_read_lock();
resync_after = rcu_dereference(odev->ldev->disk_conf)->resync_after;
......@@ -1433,7 +1434,7 @@ static int _drbd_may_sync_now(struct drbd_conf *mdev)
if (resync_after == -1)
return 1;
odev = minor_to_mdev(resync_after);
if (!expect(odev))
if (!odev)
return 1;
if ((odev->state.conn >= C_SYNC_SOURCE &&
odev->state.conn <= C_PAUSED_SYNC_T) ||
......@@ -1515,7 +1516,7 @@ enum drbd_ret_code drbd_resync_after_valid(struct drbd_conf *mdev, int o_minor)
if (o_minor == -1)
return NO_ERROR;
if (o_minor < -1 || minor_to_mdev(o_minor) == NULL)
if (o_minor < -1 || o_minor > MINORMASK)
return ERR_RESYNC_AFTER;
/* check for loops */
......@@ -1524,6 +1525,15 @@ enum drbd_ret_code drbd_resync_after_valid(struct drbd_conf *mdev, int o_minor)
if (odev == mdev)
return ERR_RESYNC_AFTER_CYCLE;
/* You are free to depend on diskless, non-existing,
* or not yet/no longer existing minors.
* We only reject dependency loops.
* We cannot follow the dependency chain beyond a detached or
* missing minor.
*/
if (!odev || !odev->ldev || odev->state.disk == D_DISKLESS)
return NO_ERROR;
rcu_read_lock();
resync_after = rcu_dereference(odev->ldev->disk_conf)->resync_after;
rcu_read_unlock();
......@@ -1652,7 +1662,9 @@ void drbd_start_resync(struct drbd_conf *mdev, enum drbd_conns side)
clear_bit(B_RS_H_DONE, &mdev->flags);
write_lock_irq(&global_state_lock);
if (!get_ldev_if_state(mdev, D_NEGOTIATING)) {
/* Did some connection breakage or IO error race with us? */
if (mdev->state.conn < C_CONNECTED
|| !get_ldev_if_state(mdev, D_NEGOTIATING)) {
write_unlock_irq(&global_state_lock);
mutex_unlock(mdev->state_mutex);
return;
......
......@@ -780,6 +780,7 @@ static const struct block_device_operations mg_disk_ops = {
.getgeo = mg_getgeo
};
#ifdef CONFIG_PM_SLEEP
static int mg_suspend(struct device *dev)
{
struct mg_drv_data *prv_data = dev->platform_data;
......@@ -824,6 +825,7 @@ static int mg_resume(struct device *dev)
return 0;
}
#endif
static SIMPLE_DEV_PM_OPS(mg_pm, mg_suspend, mg_resume);
......
......@@ -728,6 +728,9 @@ static void mtip_async_complete(struct mtip_port *port,
atomic_set(&port->commands[tag].active, 0);
release_slot(port, tag);
if (unlikely(command->unaligned))
up(&port->cmd_slot_unal);
else
up(&port->cmd_slot);
}
......@@ -1560,10 +1563,12 @@ static int mtip_get_identify(struct mtip_port *port, void __user *user_buffer)
}
#endif
#ifdef MTIP_TRIM /* Disabling TRIM support temporarily */
/* Demux ID.DRAT & ID.RZAT to determine trim support */
if (port->identify[69] & (1 << 14) && port->identify[69] & (1 << 5))
port->dd->trim_supp = true;
else
#endif
port->dd->trim_supp = false;
/* Set the identify buffer as valid. */
......@@ -2557,7 +2562,7 @@ static int mtip_hw_ioctl(struct driver_data *dd, unsigned int cmd,
*/
static void mtip_hw_submit_io(struct driver_data *dd, sector_t sector,
int nsect, int nents, int tag, void *callback,
void *data, int dir)
void *data, int dir, int unaligned)
{
struct host_to_dev_fis *fis;
struct mtip_port *port = dd->port;
......@@ -2570,6 +2575,7 @@ static void mtip_hw_submit_io(struct driver_data *dd, sector_t sector,
command->scatter_ents = nents;
command->unaligned = unaligned;
/*
* The number of retries for this command before it is
* reported as a failure to the upper layers.
......@@ -2598,6 +2604,9 @@ static void mtip_hw_submit_io(struct driver_data *dd, sector_t sector,
fis->res3 = 0;
fill_command_sg(dd, command, nents);
if (unaligned)
fis->device |= 1 << 7;
/* Populate the command header */
command->command_header->opts =
__force_bit2int cpu_to_le32(
......@@ -2644,9 +2653,13 @@ static void mtip_hw_submit_io(struct driver_data *dd, sector_t sector,
* return value
* None
*/
static void mtip_hw_release_scatterlist(struct driver_data *dd, int tag)
static void mtip_hw_release_scatterlist(struct driver_data *dd, int tag,
int unaligned)
{
struct semaphore *sem = unaligned ? &dd->port->cmd_slot_unal :
&dd->port->cmd_slot;
release_slot(dd->port, tag);
up(sem);
}
/*
......@@ -2661,22 +2674,25 @@ static void mtip_hw_release_scatterlist(struct driver_data *dd, int tag)
* or NULL if no command slots are available.
*/
static struct scatterlist *mtip_hw_get_scatterlist(struct driver_data *dd,
int *tag)
int *tag, int unaligned)
{
struct semaphore *sem = unaligned ? &dd->port->cmd_slot_unal :
&dd->port->cmd_slot;
/*
* It is possible that, even with this semaphore, a thread
* may think that no command slots are available. Therefore, we
* need to make an attempt to get_slot().
*/
down(&dd->port->cmd_slot);
down(sem);
*tag = get_slot(dd->port);
if (unlikely(test_bit(MTIP_DDF_REMOVE_PENDING_BIT, &dd->dd_flag))) {
up(&dd->port->cmd_slot);
up(sem);
return NULL;
}
if (unlikely(*tag < 0)) {
up(&dd->port->cmd_slot);
up(sem);
return NULL;
}
......@@ -3010,6 +3026,11 @@ static inline void hba_setup(struct driver_data *dd)
dd->mmio + HOST_HSORG);
}
static int mtip_device_unaligned_constrained(struct driver_data *dd)
{
return (dd->pdev->device == P420M_DEVICE_ID ? 1 : 0);
}
/*
* Detect the details of the product, and store anything needed
* into the driver data structure. This includes product type and
......@@ -3232,8 +3253,15 @@ static int mtip_hw_init(struct driver_data *dd)
for (i = 0; i < MTIP_MAX_SLOT_GROUPS; i++)
dd->work[i].port = dd->port;
/* Enable unaligned IO constraints for some devices */
if (mtip_device_unaligned_constrained(dd))
dd->unal_qdepth = MTIP_MAX_UNALIGNED_SLOTS;
else
dd->unal_qdepth = 0;
/* Counting semaphore to track command slot usage */
sema_init(&dd->port->cmd_slot, num_command_slots - 1);
sema_init(&dd->port->cmd_slot, num_command_slots - 1 - dd->unal_qdepth);
sema_init(&dd->port->cmd_slot_unal, dd->unal_qdepth);
/* Spinlock to prevent concurrent issue */
for (i = 0; i < MTIP_MAX_SLOT_GROUPS; i++)
......@@ -3836,7 +3864,7 @@ static void mtip_make_request(struct request_queue *queue, struct bio *bio)
struct scatterlist *sg;
struct bio_vec *bvec;
int nents = 0;
int tag = 0;
int tag = 0, unaligned = 0;
if (unlikely(dd->dd_flag & MTIP_DDF_STOP_IO)) {
if (unlikely(test_bit(MTIP_DDF_REMOVE_PENDING_BIT,
......@@ -3872,7 +3900,15 @@ static void mtip_make_request(struct request_queue *queue, struct bio *bio)
return;
}
sg = mtip_hw_get_scatterlist(dd, &tag);
if (bio_data_dir(bio) == WRITE && bio_sectors(bio) <= 64 &&
dd->unal_qdepth) {
if (bio->bi_sector % 8 != 0) /* Unaligned on 4k boundaries */
unaligned = 1;
else if (bio_sectors(bio) % 8 != 0) /* Aligned but not 4k/8k */
unaligned = 1;
}
sg = mtip_hw_get_scatterlist(dd, &tag, unaligned);
if (likely(sg != NULL)) {
blk_queue_bounce(queue, &bio);
......@@ -3880,7 +3916,7 @@ static void mtip_make_request(struct request_queue *queue, struct bio *bio)
dev_warn(&dd->pdev->dev,
"Maximum number of SGL entries exceeded\n");
bio_io_error(bio);
mtip_hw_release_scatterlist(dd, tag);
mtip_hw_release_scatterlist(dd, tag, unaligned);
return;
}
......@@ -3900,7 +3936,8 @@ static void mtip_make_request(struct request_queue *queue, struct bio *bio)
tag,
bio_endio,
bio,
bio_data_dir(bio));
bio_data_dir(bio),
unaligned);
} else
bio_io_error(bio);
}
......@@ -4156,26 +4193,24 @@ static int mtip_block_remove(struct driver_data *dd)
*/
static int mtip_block_shutdown(struct driver_data *dd)
{
/* Delete our gendisk structure, and cleanup the blk queue. */
if (dd->disk) {
dev_info(&dd->pdev->dev,
"Shutting down %s ...\n", dd->disk->disk_name);
/* Delete our gendisk structure, and cleanup the blk queue. */
if (dd->disk) {
if (dd->disk->queue)
if (dd->disk->queue) {
del_gendisk(dd->disk);
else
blk_cleanup_queue(dd->queue);
} else
put_disk(dd->disk);
dd->disk = NULL;
dd->queue = NULL;
}
spin_lock(&rssd_index_lock);
ida_remove(&rssd_index_ida, dd->index);
spin_unlock(&rssd_index_lock);
blk_cleanup_queue(dd->queue);
dd->disk = NULL;
dd->queue = NULL;
mtip_hw_shutdown(dd);
return 0;
}
......
......@@ -52,6 +52,9 @@
#define MTIP_FTL_REBUILD_MAGIC 0xED51
#define MTIP_FTL_REBUILD_TIMEOUT_MS 2400000
/* unaligned IO handling */
#define MTIP_MAX_UNALIGNED_SLOTS 8
/* Macro to extract the tag bit number from a tag value. */
#define MTIP_TAG_BIT(tag) (tag & 0x1F)
......@@ -333,6 +336,8 @@ struct mtip_cmd {
int scatter_ents; /* Number of scatter list entries used */
int unaligned; /* command is unaligned on 4k boundary */
struct scatterlist sg[MTIP_MAX_SG]; /* Scatter list entries */
int retries; /* The number of retries left for this command. */
......@@ -452,6 +457,10 @@ struct mtip_port {
* command slots available.
*/
struct semaphore cmd_slot;
/* Semaphore to control queue depth of unaligned IOs */
struct semaphore cmd_slot_unal;
/* Spinlock for working around command-issue bug. */
spinlock_t cmd_issue_lock[MTIP_MAX_SLOT_GROUPS];
};
......@@ -502,6 +511,8 @@ struct driver_data {
int isr_binding;
int unal_qdepth; /* qdepth of unaligned IO queue */
struct list_head online_list; /* linkage for online list */
struct list_head remove_list; /* linkage for removing list */
......
......@@ -174,6 +174,8 @@ config MD_FAULTY
In unsure, say N.
source "drivers/md/bcache/Kconfig"
config BLK_DEV_DM
tristate "Device mapper support"
---help---
......
......@@ -29,6 +29,7 @@ obj-$(CONFIG_MD_RAID10) += raid10.o
obj-$(CONFIG_MD_RAID456) += raid456.o
obj-$(CONFIG_MD_MULTIPATH) += multipath.o
obj-$(CONFIG_MD_FAULTY) += faulty.o
obj-$(CONFIG_BCACHE) += bcache/
obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
......
config BCACHE
tristate "Block device as cache"
select CLOSURES
---help---
Allows a block device to be used as cache for other devices; uses
a btree for indexing and the layout is optimized for SSDs.
See Documentation/bcache.txt for details.
config BCACHE_DEBUG
bool "Bcache debugging"
depends on BCACHE
---help---
Don't select this option unless you're a developer
Enables extra debugging tools (primarily a fuzz tester)
config BCACHE_EDEBUG
bool "Extended runtime checks"
depends on BCACHE
---help---
Don't select this option unless you're a developer
Enables extra runtime checks which significantly affect performance
config BCACHE_CLOSURES_DEBUG
bool "Debug closures"
depends on BCACHE
select DEBUG_FS
---help---
Keeps all active closures in a linked list and provides a debugfs
interface to list them, which makes it possible to see asynchronous
operations that get stuck.
# cgroup code needs to be updated:
#
#config CGROUP_BCACHE
# bool "Cgroup controls for bcache"
# depends on BCACHE && BLK_CGROUP
# ---help---
# TODO
obj-$(CONFIG_BCACHE) += bcache.o
bcache-y := alloc.o btree.o bset.o io.o journal.o writeback.o\
movinggc.o request.o super.o sysfs.o debug.o util.o trace.o stats.o closure.o
CFLAGS_request.o += -Iblock
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#ifndef _BCACHE_DEBUG_H
#define _BCACHE_DEBUG_H
/* Btree/bkey debug printing */
#define KEYHACK_SIZE 80
struct keyprint_hack {
char s[KEYHACK_SIZE];
};
struct keyprint_hack bch_pkey(const struct bkey *k);
struct keyprint_hack bch_pbtree(const struct btree *b);
#define pkey(k) (&bch_pkey(k).s[0])
#define pbtree(b) (&bch_pbtree(b).s[0])
#ifdef CONFIG_BCACHE_EDEBUG
unsigned bch_count_data(struct btree *);
void bch_check_key_order_msg(struct btree *, struct bset *, const char *, ...);
void bch_check_keys(struct btree *, const char *, ...);
#define bch_check_key_order(b, i) \
bch_check_key_order_msg(b, i, "keys out of order")
#define EBUG_ON(cond) BUG_ON(cond)
#else /* EDEBUG */
#define bch_count_data(b) 0
#define bch_check_key_order(b, i) do {} while (0)
#define bch_check_key_order_msg(b, i, ...) do {} while (0)
#define bch_check_keys(b, ...) do {} while (0)
#define EBUG_ON(cond) do {} while (0)
#endif
#ifdef CONFIG_BCACHE_DEBUG
void bch_btree_verify(struct btree *, struct bset *);
void bch_data_verify(struct search *);
#else /* DEBUG */
static inline void bch_btree_verify(struct btree *b, struct bset *i) {}
static inline void bch_data_verify(struct search *s) {};
#endif
#ifdef CONFIG_DEBUG_FS
void bch_debug_init_cache_set(struct cache_set *);
#else
static inline void bch_debug_init_cache_set(struct cache_set *c) {}
#endif
#endif
This diff is collapsed.
This diff is collapsed.
#ifndef _BCACHE_JOURNAL_H
#define _BCACHE_JOURNAL_H
/*
* THE JOURNAL:
*
* The journal is treated as a circular buffer of buckets - a journal entry
* never spans two buckets. This means (not implemented yet) we can resize the
* journal at runtime, and will be needed for bcache on raw flash support.
*
* Journal entries contain a list of keys, ordered by the time they were
* inserted; thus journal replay just has to reinsert the keys.
*
* We also keep some things in the journal header that are logically part of the
* superblock - all the things that are frequently updated. This is for future
* bcache on raw flash support; the superblock (which will become another
* journal) can't be moved or wear leveled, so it contains just enough
* information to find the main journal, and the superblock only has to be
* rewritten when we want to move/wear level the main journal.
*
* Currently, we don't journal BTREE_REPLACE operations - this will hopefully be
* fixed eventually. This isn't a bug - BTREE_REPLACE is used for insertions
* from cache misses, which don't have to be journaled, and for writeback and
* moving gc we work around it by flushing the btree to disk before updating the
* gc information. But it is a potential issue with incremental garbage
* collection, and it's fragile.
*
* OPEN JOURNAL ENTRIES:
*
* Each journal entry contains, in the header, the sequence number of the last
* journal entry still open - i.e. that has keys that haven't been flushed to
* disk in the btree.
*
* We track this by maintaining a refcount for every open journal entry, in a
* fifo; each entry in the fifo corresponds to a particular journal
* entry/sequence number. When the refcount at the tail of the fifo goes to
* zero, we pop it off - thus, the size of the fifo tells us the number of open
* journal entries
*
* We take a refcount on a journal entry when we add some keys to a journal
* entry that we're going to insert (held by struct btree_op), and then when we
* insert those keys into the btree the btree write we're setting up takes a
* copy of that refcount (held by struct btree_write). That refcount is dropped
* when the btree write completes.
*
* A struct btree_write can only hold a refcount on a single journal entry, but
* might contain keys for many journal entries - we handle this by making sure
* it always has a refcount on the _oldest_ journal entry of all the journal
* entries it has keys for.
*
* JOURNAL RECLAIM:
*
* As mentioned previously, our fifo of refcounts tells us the number of open
* journal entries; from that and the current journal sequence number we compute
* last_seq - the oldest journal entry we still need. We write last_seq in each
* journal entry, and we also have to keep track of where it exists on disk so
* we don't overwrite it when we loop around the journal.
*
* To do that we track, for each journal bucket, the sequence number of the
* newest journal entry it contains - if we don't need that journal entry we
* don't need anything in that bucket anymore. From that we track the last
* journal bucket we still need; all this is tracked in struct journal_device
* and updated by journal_reclaim().
*
* JOURNAL FILLING UP:
*
* There are two ways the journal could fill up; either we could run out of
* space to write to, or we could have too many open journal entries and run out
* of room in the fifo of refcounts. Since those refcounts are decremented
* without any locking we can't safely resize that fifo, so we handle it the
* same way.
*
* If the journal fills up, we start flushing dirty btree nodes until we can
* allocate space for a journal write again - preferentially flushing btree
* nodes that are pinning the oldest journal entries first.
*/
#define BCACHE_JSET_VERSION_UUIDv1 1
/* Always latest UUID format */
#define BCACHE_JSET_VERSION_UUID 1
#define BCACHE_JSET_VERSION 1
/*
* On disk format for a journal entry:
* seq is monotonically increasing; every journal entry has its own unique
* sequence number.
*
* last_seq is the oldest journal entry that still has keys the btree hasn't
* flushed to disk yet.
*
* version is for on disk format changes.
*/
struct jset {
uint64_t csum;
uint64_t magic;
uint64_t seq;
uint32_t version;
uint32_t keys;
uint64_t last_seq;
BKEY_PADDED(uuid_bucket);
BKEY_PADDED(btree_root);
uint16_t btree_level;
uint16_t pad[3];
uint64_t prio_bucket[MAX_CACHES_PER_SET];
union {
struct bkey start[0];
uint64_t d[0];
};
};
/*
* Only used for holding the journal entries we read in btree_journal_read()
* during cache_registration
*/
struct journal_replay {
struct list_head list;
atomic_t *pin;
struct jset j;
};
/*
* We put two of these in struct journal; we used them for writes to the
* journal that are being staged or in flight.
*/
struct journal_write {
struct jset *data;
#define JSET_BITS 3
struct cache_set *c;
struct closure_waitlist wait;
bool need_write;
};
/* Embedded in struct cache_set */
struct journal {
spinlock_t lock;
/* used when waiting because the journal was full */
struct closure_waitlist wait;
struct closure_with_timer io;
/* Number of blocks free in the bucket(s) we're currently writing to */
unsigned blocks_free;
uint64_t seq;
DECLARE_FIFO(atomic_t, pin);
BKEY_PADDED(key);
struct journal_write w[2], *cur;
};
/*
* Embedded in struct cache. First three fields refer to the array of journal
* buckets, in cache_sb.
*/
struct journal_device {
/*
* For each journal bucket, contains the max sequence number of the
* journal writes it contains - so we know when a bucket can be reused.
*/
uint64_t seq[SB_JOURNAL_BUCKETS];
/* Journal bucket we're currently writing to */
unsigned cur_idx;
/* Last journal bucket that still contains an open journal entry */
unsigned last_idx;
/* Next journal bucket to be discarded */
unsigned discard_idx;
#define DISCARD_READY 0
#define DISCARD_IN_FLIGHT 1
#define DISCARD_DONE 2
/* 1 - discard in flight, -1 - discard completed */
atomic_t discard_in_flight;
struct work_struct discard_work;
struct bio discard_bio;
struct bio_vec discard_bv;
/* Bio for journal reads/writes to this device */
struct bio bio;
struct bio_vec bv[8];
};
#define journal_pin_cmp(c, l, r) \
(fifo_idx(&(c)->journal.pin, (l)->journal) > \
fifo_idx(&(c)->journal.pin, (r)->journal))
#define JOURNAL_PIN 20000
#define journal_full(j) \
(!(j)->blocks_free || fifo_free(&(j)->pin) <= 1)
struct closure;
struct cache_set;
struct btree_op;
void bch_journal(struct closure *);
void bch_journal_next(struct journal *);
void bch_journal_mark(struct cache_set *, struct list_head *);
void bch_journal_meta(struct cache_set *, struct closure *);
int bch_journal_read(struct cache_set *, struct list_head *,
struct btree_op *);
int bch_journal_replay(struct cache_set *, struct list_head *,
struct btree_op *);
void bch_journal_free(struct cache_set *);
int bch_journal_alloc(struct cache_set *);
#endif /* _BCACHE_JOURNAL_H */
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#ifndef _BCACHE_STATS_H_
#define _BCACHE_STATS_H_
struct cache_stat_collector {
atomic_t cache_hits;
atomic_t cache_misses;
atomic_t cache_bypass_hits;
atomic_t cache_bypass_misses;
atomic_t cache_readaheads;
atomic_t cache_miss_collisions;
atomic_t sectors_bypassed;
};
struct cache_stats {
struct kobject kobj;
unsigned long cache_hits;
unsigned long cache_misses;
unsigned long cache_bypass_hits;
unsigned long cache_bypass_misses;
unsigned long cache_readaheads;
unsigned long cache_miss_collisions;
unsigned long sectors_bypassed;
unsigned rescale;
};
struct cache_accounting {
struct closure cl;
struct timer_list timer;
atomic_t closing;
struct cache_stat_collector collector;
struct cache_stats total;
struct cache_stats five_minute;
struct cache_stats hour;
struct cache_stats day;
};
struct search;
void bch_cache_accounting_init(struct cache_accounting *acc,
struct closure *parent);
int bch_cache_accounting_add_kobjs(struct cache_accounting *acc,
struct kobject *parent);
void bch_cache_accounting_clear(struct cache_accounting *acc);
void bch_cache_accounting_destroy(struct cache_accounting *acc);
void bch_mark_cache_accounting(struct search *s, bool hit, bool bypass);
void bch_mark_cache_readahead(struct search *s);
void bch_mark_cache_miss_collision(struct search *s);
void bch_mark_sectors_bypassed(struct search *s, int sectors);
#endif /* _BCACHE_STATS_H_ */
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment