Commit d4c90b1b authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-block

Pull block IO driver bits from Jens Axboe:
 "As I mentioned in the core block pull request, due to real life
  circumstances the driver pull request would be late.  Now it looks
  like -rc2 late...  On the plus side, apart form the rsxx update, these
  are all things that I could argue could go in later in the cycle as
  they are fixes and not features.  So even though things are late, it's
  not ALL bad.

  The pull request contains:

   - Updates to bcache, all bug fixes, from Kent.

   - A pile of drbd bug fixes (no big features this time!).

   - xen blk front/back fixes.

   - rsxx driver updates, some of them deferred form 3.10.  So should be
     well cooked by now"

* 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits)
  bcache: Allocation kthread fixes
  bcache: Fix GC_SECTORS_USED() calculation
  bcache: Journal replay fix
  bcache: Shutdown fix
  bcache: Fix a sysfs splat on shutdown
  bcache: Advertise that flushes are supported
  bcache: check for allocation failures
  bcache: Fix a dumb race
  bcache: Use standard utility code
  bcache: Update email address
  bcache: Delete fuzz tester
  bcache: Document shrinker reserve better
  bcache: FUA fixes
  drbd: Allow online change of al-stripes and al-stripe-size
  drbd: Constants should be UPPERCASE
  drbd: Ignore the exit code of a fence-peer handler if it returns too late
  drbd: Fix rcu_read_lock balance on error path
  drbd: fix error return code in drbd_init()
  drbd: Do not sleep inside rcu
  bcache: Refresh usage docs
  ...
parents 3b2f64d0 0878ae2d
What: /sys/module/xen_blkback/parameters/max_buffer_pages
Date: March 2013
KernelVersion: 3.11
Contact: Roger Pau Monné <roger.pau@citrix.com>
Description:
Maximum number of free pages to keep in each block
backend buffer.
What: /sys/module/xen_blkback/parameters/max_persistent_grants
Date: March 2013
KernelVersion: 3.11
Contact: Roger Pau Monné <roger.pau@citrix.com>
Description:
Maximum number of grants to map persistently in
blkback. If the frontend tries to use more than
max_persistent_grants, the LRU kicks in and starts
removing 5% of max_persistent_grants every 100ms.
What: /sys/module/xen_blkfront/parameters/max
Date: June 2013
KernelVersion: 3.11
Contact: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Description:
Maximum number of segments that the frontend will negotiate
with the backend for indirect descriptors. The default value
is 32 - higher value means more potential throughput but more
memory usage. The backend picks the minimum of the frontend
and its default backend value.
......@@ -46,29 +46,33 @@ you format your backing devices and cache device at the same time, you won't
have to manually attach:
make-bcache -B /dev/sda /dev/sdb -C /dev/sdc
To make bcache devices known to the kernel, echo them to /sys/fs/bcache/register:
bcache-tools now ships udev rules, and bcache devices are known to the kernel
immediately. Without udev, you can manually register devices like this:
echo /dev/sdb > /sys/fs/bcache/register
echo /dev/sdc > /sys/fs/bcache/register
To register your bcache devices automatically, you could add something like
this to an init script:
Registering the backing device makes the bcache device show up in /dev; you can
now format it and use it as normal. But the first time using a new bcache
device, it'll be running in passthrough mode until you attach it to a cache.
See the section on attaching.
echo /dev/sd* > /sys/fs/bcache/register_quiet
The devices show up as:
It'll look for bcache superblocks and ignore everything that doesn't have one.
/dev/bcache<N>
Registering the backing device makes the bcache show up in /dev; you can now
format it and use it as normal. But the first time using a new bcache device,
it'll be running in passthrough mode until you attach it to a cache. See the
section on attaching.
As well as (with udev):
The devices show up at /dev/bcacheN, and can be controlled via sysfs from
/sys/block/bcacheN/bcache:
/dev/bcache/by-uuid/<uuid>
/dev/bcache/by-label/<label>
To get started:
mkfs.ext4 /dev/bcache0
mount /dev/bcache0 /mnt
You can control bcache devices through sysfs at /sys/block/bcache<N>/bcache .
Cache devices are managed as sets; multiple caches per set isn't supported yet
but will allow for mirroring of metadata and dirty data in the future. Your new
cache set shows up as /sys/fs/bcache/<UUID>
......@@ -80,11 +84,11 @@ must be attached to your cache set to enable caching. Attaching a backing
device to a cache set is done thusly, with the UUID of the cache set in
/sys/fs/bcache:
echo <UUID> > /sys/block/bcache0/bcache/attach
echo <CSET-UUID> > /sys/block/bcache0/bcache/attach
This only has to be done once. The next time you reboot, just reregister all
your bcache devices. If a backing device has data in a cache somewhere, the
/dev/bcache# device won't be created until the cache shows up - particularly
/dev/bcache<N> device won't be created until the cache shows up - particularly
important if you have writeback caching turned on.
If you're booting up and your cache device is gone and never coming back, you
......@@ -191,6 +195,9 @@ want for getting the best possible numbers when benchmarking.
SYSFS - BACKING DEVICE:
Available at /sys/block/<bdev>/bcache, /sys/block/bcache*/bcache and
(if attached) /sys/fs/bcache/<cset-uuid>/bdev*
attach
Echo the UUID of a cache set to this file to enable caching.
......@@ -300,6 +307,8 @@ cache_readaheads
SYSFS - CACHE SET:
Available at /sys/fs/bcache/<cset-uuid>
average_key_size
Average data per key in the btree.
......@@ -390,6 +399,8 @@ trigger_gc
SYSFS - CACHE DEVICE:
Available at /sys/block/<cdev>/bcache
block_size
Minimum granularity of writes - should match hardware sector size.
......
......@@ -1642,7 +1642,7 @@ S: Maintained
F: drivers/net/hamradio/baycom*
BCACHE (BLOCK LAYER CACHE)
M: Kent Overstreet <koverstreet@google.com>
M: Kent Overstreet <kmo@daterainc.com>
L: linux-bcache@vger.kernel.org
W: http://bcache.evilpiepirate.org
S: Maintained:
......@@ -3346,7 +3346,7 @@ F: Documentation/firmware_class/
F: drivers/base/firmware*.c
F: include/linux/firmware.h
FLASHSYSTEM DRIVER (IBM FlashSystem 70/80 PCI SSD Flash Card)
FLASH ADAPTER DRIVER (IBM Flash Adapter 900GB Full Height PCI Flash Card)
M: Joshua Morris <josh.h.morris@us.ibm.com>
M: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
S: Maintained
......
......@@ -532,11 +532,11 @@ config BLK_DEV_RBD
If unsure, say N.
config BLK_DEV_RSXX
tristate "IBM FlashSystem 70/80 PCIe SSD Device Driver"
tristate "IBM Flash Adapter 900GB Full Height PCIe Device Driver"
depends on PCI
help
Device driver for IBM's high speed PCIe SSD
storage devices: FlashSystem-70 and FlashSystem-80.
storage device: Flash Adapter 900GB Full Height.
To compile this driver as a module, choose M here: the
module will be called rsxx.
......
......@@ -659,6 +659,27 @@ void drbd_al_shrink(struct drbd_conf *mdev)
wake_up(&mdev->al_wait);
}
int drbd_initialize_al(struct drbd_conf *mdev, void *buffer)
{
struct al_transaction_on_disk *al = buffer;
struct drbd_md *md = &mdev->ldev->md;
sector_t al_base = md->md_offset + md->al_offset;
int al_size_4k = md->al_stripes * md->al_stripe_size_4k;
int i;
memset(al, 0, 4096);
al->magic = cpu_to_be32(DRBD_AL_MAGIC);
al->transaction_type = cpu_to_be16(AL_TR_INITIALIZED);
al->crc32c = cpu_to_be32(crc32c(0, al, 4096));
for (i = 0; i < al_size_4k; i++) {
int err = drbd_md_sync_page_io(mdev, mdev->ldev, al_base + i * 8, WRITE);
if (err)
return err;
}
return 0;
}
static int w_update_odbm(struct drbd_work *w, int unused)
{
struct update_odbm_work *udw = container_of(w, struct update_odbm_work, w);
......
......@@ -832,6 +832,7 @@ struct drbd_tconn { /* is a resource from the config file */
unsigned susp_nod:1; /* IO suspended because no data */
unsigned susp_fen:1; /* IO suspended because fence peer handler runs */
struct mutex cstate_mutex; /* Protects graceful disconnects */
unsigned int connect_cnt; /* Inc each time a connection is established */
unsigned long flags;
struct net_conf *net_conf; /* content protected by rcu */
......@@ -1132,6 +1133,7 @@ extern void drbd_mdev_cleanup(struct drbd_conf *mdev);
void drbd_print_uuids(struct drbd_conf *mdev, const char *text);
extern void conn_md_sync(struct drbd_tconn *tconn);
extern void drbd_md_write(struct drbd_conf *mdev, void *buffer);
extern void drbd_md_sync(struct drbd_conf *mdev);
extern int drbd_md_read(struct drbd_conf *mdev, struct drbd_backing_dev *bdev);
extern void drbd_uuid_set(struct drbd_conf *mdev, int idx, u64 val) __must_hold(local);
......@@ -1466,8 +1468,16 @@ extern void drbd_suspend_io(struct drbd_conf *mdev);
extern void drbd_resume_io(struct drbd_conf *mdev);
extern char *ppsize(char *buf, unsigned long long size);
extern sector_t drbd_new_dev_size(struct drbd_conf *, struct drbd_backing_dev *, sector_t, int);
enum determine_dev_size { dev_size_error = -1, unchanged = 0, shrunk = 1, grew = 2 };
extern enum determine_dev_size drbd_determine_dev_size(struct drbd_conf *, enum dds_flags) __must_hold(local);
enum determine_dev_size {
DS_ERROR_SHRINK = -3,
DS_ERROR_SPACE_MD = -2,
DS_ERROR = -1,
DS_UNCHANGED = 0,
DS_SHRUNK = 1,
DS_GREW = 2
};
extern enum determine_dev_size
drbd_determine_dev_size(struct drbd_conf *, enum dds_flags, struct resize_parms *) __must_hold(local);
extern void resync_after_online_grow(struct drbd_conf *);
extern void drbd_reconsider_max_bio_size(struct drbd_conf *mdev);
extern enum drbd_state_rv drbd_set_role(struct drbd_conf *mdev,
......@@ -1633,6 +1643,7 @@ extern int __drbd_set_out_of_sync(struct drbd_conf *mdev, sector_t sector,
#define drbd_set_out_of_sync(mdev, sector, size) \
__drbd_set_out_of_sync(mdev, sector, size, __FILE__, __LINE__)
extern void drbd_al_shrink(struct drbd_conf *mdev);
extern int drbd_initialize_al(struct drbd_conf *, void *);
/* drbd_nl.c */
/* state info broadcast */
......
......@@ -2762,8 +2762,6 @@ int __init drbd_init(void)
/*
* allocate all necessary structs
*/
err = -ENOMEM;
init_waitqueue_head(&drbd_pp_wait);
drbd_proc = NULL; /* play safe for drbd_cleanup */
......@@ -2773,6 +2771,7 @@ int __init drbd_init(void)
if (err)
goto fail;
err = -ENOMEM;
drbd_proc = proc_create_data("drbd", S_IFREG | S_IRUGO , NULL, &drbd_proc_fops, NULL);
if (!drbd_proc) {
printk(KERN_ERR "drbd: unable to register proc file\n");
......@@ -2803,7 +2802,6 @@ int __init drbd_init(void)
fail:
drbd_cleanup();
if (err == -ENOMEM)
/* currently always the case */
printk(KERN_ERR "drbd: ran out of memory\n");
else
printk(KERN_ERR "drbd: initialization failure\n");
......@@ -2881,34 +2879,14 @@ struct meta_data_on_disk {
u8 reserved_u8[4096 - (7*8 + 10*4)];
} __packed;
/**
* drbd_md_sync() - Writes the meta data super block if the MD_DIRTY flag bit is set
* @mdev: DRBD device.
*/
void drbd_md_sync(struct drbd_conf *mdev)
void drbd_md_write(struct drbd_conf *mdev, void *b)
{
struct meta_data_on_disk *buffer;
struct meta_data_on_disk *buffer = b;
sector_t sector;
int i;
/* Don't accidentally change the DRBD meta data layout. */
BUILD_BUG_ON(UI_SIZE != 4);
BUILD_BUG_ON(sizeof(struct meta_data_on_disk) != 4096);
del_timer(&mdev->md_sync_timer);
/* timer may be rearmed by drbd_md_mark_dirty() now. */
if (!test_and_clear_bit(MD_DIRTY, &mdev->flags))
return;
/* We use here D_FAILED and not D_ATTACHING because we try to write
* metadata even if we detach due to a disk failure! */
if (!get_ldev_if_state(mdev, D_FAILED))
return;
buffer = drbd_md_get_buffer(mdev);
if (!buffer)
goto out;
memset(buffer, 0, sizeof(*buffer));
buffer->la_size_sect = cpu_to_be64(drbd_get_capacity(mdev->this_bdev));
......@@ -2937,6 +2915,35 @@ void drbd_md_sync(struct drbd_conf *mdev)
dev_err(DEV, "meta data update failed!\n");
drbd_chk_io_error(mdev, 1, DRBD_META_IO_ERROR);
}
}
/**
* drbd_md_sync() - Writes the meta data super block if the MD_DIRTY flag bit is set
* @mdev: DRBD device.
*/
void drbd_md_sync(struct drbd_conf *mdev)
{
struct meta_data_on_disk *buffer;
/* Don't accidentally change the DRBD meta data layout. */
BUILD_BUG_ON(UI_SIZE != 4);
BUILD_BUG_ON(sizeof(struct meta_data_on_disk) != 4096);
del_timer(&mdev->md_sync_timer);
/* timer may be rearmed by drbd_md_mark_dirty() now. */
if (!test_and_clear_bit(MD_DIRTY, &mdev->flags))
return;
/* We use here D_FAILED and not D_ATTACHING because we try to write
* metadata even if we detach due to a disk failure! */
if (!get_ldev_if_state(mdev, D_FAILED))
return;
buffer = drbd_md_get_buffer(mdev);
if (!buffer)
goto out;
drbd_md_write(mdev, buffer);
/* Update mdev->ldev->md.la_size_sect,
* since we updated it on metadata. */
......
This diff is collapsed.
......@@ -1039,6 +1039,8 @@ static int conn_connect(struct drbd_tconn *tconn)
rcu_read_lock();
idr_for_each_entry(&tconn->volumes, mdev, vnr) {
kref_get(&mdev->kref);
rcu_read_unlock();
/* Prevent a race between resync-handshake and
* being promoted to Primary.
*
......@@ -1049,8 +1051,6 @@ static int conn_connect(struct drbd_tconn *tconn)
mutex_lock(mdev->state_mutex);
mutex_unlock(mdev->state_mutex);
rcu_read_unlock();
if (discard_my_data)
set_bit(DISCARD_MY_DATA, &mdev->flags);
else
......@@ -3545,7 +3545,7 @@ static int receive_sizes(struct drbd_tconn *tconn, struct packet_info *pi)
{
struct drbd_conf *mdev;
struct p_sizes *p = pi->data;
enum determine_dev_size dd = unchanged;
enum determine_dev_size dd = DS_UNCHANGED;
sector_t p_size, p_usize, my_usize;
int ldsc = 0; /* local disk size changed */
enum dds_flags ddsf;
......@@ -3617,9 +3617,9 @@ static int receive_sizes(struct drbd_tconn *tconn, struct packet_info *pi)
ddsf = be16_to_cpu(p->dds_flags);
if (get_ldev(mdev)) {
dd = drbd_determine_dev_size(mdev, ddsf);
dd = drbd_determine_dev_size(mdev, ddsf, NULL);
put_ldev(mdev);
if (dd == dev_size_error)
if (dd == DS_ERROR)
return -EIO;
drbd_md_sync(mdev);
} else {
......@@ -3647,7 +3647,7 @@ static int receive_sizes(struct drbd_tconn *tconn, struct packet_info *pi)
drbd_send_sizes(mdev, 0, ddsf);
}
if (test_and_clear_bit(RESIZE_PENDING, &mdev->flags) ||
(dd == grew && mdev->state.conn == C_CONNECTED)) {
(dd == DS_GREW && mdev->state.conn == C_CONNECTED)) {
if (mdev->state.pdsk >= D_INCONSISTENT &&
mdev->state.disk >= D_INCONSISTENT) {
if (ddsf & DDSF_NO_RESYNC)
......
......@@ -1115,8 +1115,10 @@ __drbd_set_state(struct drbd_conf *mdev, union drbd_state ns,
drbd_thread_restart_nowait(&mdev->tconn->receiver);
/* Resume AL writing if we get a connection */
if (os.conn < C_CONNECTED && ns.conn >= C_CONNECTED)
if (os.conn < C_CONNECTED && ns.conn >= C_CONNECTED) {
drbd_resume_al(mdev);
mdev->tconn->connect_cnt++;
}
/* remember last attach time so request_timer_fn() won't
* kill newly established sessions while we are still trying to thaw
......
This diff is collapsed.
......@@ -431,6 +431,15 @@ static int __issue_creg_rw(struct rsxx_cardinfo *card,
*hw_stat = completion.creg_status;
if (completion.st) {
/*
* This read is needed to verify that there has not been any
* extreme errors that might have occurred, i.e. EEH. The
* function iowrite32 will not detect EEH errors, so it is
* necessary that we recover if such an error is the reason
* for the timeout. This is a dummy read.
*/
ioread32(card->regmap + SCRATCH);
dev_warn(CARD_TO_DEV(card),
"creg command failed(%d x%08x)\n",
completion.st, addr);
......@@ -727,6 +736,11 @@ int rsxx_creg_setup(struct rsxx_cardinfo *card)
{
card->creg_ctrl.active_cmd = NULL;
card->creg_ctrl.creg_wq =
create_singlethread_workqueue(DRIVER_NAME"_creg");
if (!card->creg_ctrl.creg_wq)
return -ENOMEM;
INIT_WORK(&card->creg_ctrl.done_work, creg_cmd_done);
mutex_init(&card->creg_ctrl.reset_lock);
INIT_LIST_HEAD(&card->creg_ctrl.queue);
......
......@@ -155,6 +155,7 @@ static void bio_dma_done_cb(struct rsxx_cardinfo *card,
atomic_set(&meta->error, 1);
if (atomic_dec_and_test(&meta->pending_dmas)) {
if (!card->eeh_state && card->gendisk)
disk_stats_complete(card, meta->bio, meta->start_time);
bio_endio(meta->bio, atomic_read(&meta->error) ? -EIO : 0);
......@@ -170,6 +171,12 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
might_sleep();
if (!card)
goto req_err;
if (bio->bi_sector + (bio->bi_size >> 9) > get_capacity(card->gendisk))
goto req_err;
if (unlikely(card->halt)) {
st = -EFAULT;
goto req_err;
......@@ -196,6 +203,7 @@ static void rsxx_make_request(struct request_queue *q, struct bio *bio)
atomic_set(&bio_meta->pending_dmas, 0);
bio_meta->start_time = jiffies;
if (!unlikely(card->halt))
disk_stats_start(card, bio);
dev_dbg(CARD_TO_DEV(card), "BIO[%c]: meta: %p addr8: x%llx size: %d\n",
......@@ -225,24 +233,6 @@ static bool rsxx_discard_supported(struct rsxx_cardinfo *card)
return (pci_rev >= RSXX_DISCARD_SUPPORT);
}
static unsigned short rsxx_get_logical_block_size(
struct rsxx_cardinfo *card)
{
u32 capabilities = 0;
int st;
st = rsxx_get_card_capabilities(card, &capabilities);
if (st)
dev_warn(CARD_TO_DEV(card),
"Failed reading card capabilities register\n");
/* Earlier firmware did not have support for 512 byte accesses */
if (capabilities & CARD_CAP_SUBPAGE_WRITES)
return 512;
else
return RSXX_HW_BLK_SIZE;
}
int rsxx_attach_dev(struct rsxx_cardinfo *card)
{
mutex_lock(&card->dev_lock);
......@@ -305,7 +295,7 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
return -ENOMEM;
}
blk_size = rsxx_get_logical_block_size(card);
blk_size = card->config.data.block_size;
blk_queue_make_request(card->queue, rsxx_make_request);
blk_queue_bounce_limit(card->queue, BLK_BOUNCE_ANY);
......@@ -347,6 +337,7 @@ void rsxx_destroy_dev(struct rsxx_cardinfo *card)
card->gendisk = NULL;
blk_cleanup_queue(card->queue);
card->queue->queuedata = NULL;
unregister_blkdev(card->major, DRIVER_NAME);
}
......
......@@ -245,6 +245,22 @@ static void rsxx_complete_dma(struct rsxx_dma_ctrl *ctrl,
kmem_cache_free(rsxx_dma_pool, dma);
}
int rsxx_cleanup_dma_queue(struct rsxx_dma_ctrl *ctrl,
struct list_head *q)
{
struct rsxx_dma *dma;
struct rsxx_dma *tmp;
int cnt = 0;
list_for_each_entry_safe(dma, tmp, q, list) {
list_del(&dma->list);
rsxx_complete_dma(ctrl, dma, DMA_CANCELLED);
cnt++;
}
return cnt;
}
static void rsxx_requeue_dma(struct rsxx_dma_ctrl *ctrl,
struct rsxx_dma *dma)
{
......@@ -252,9 +268,10 @@ static void rsxx_requeue_dma(struct rsxx_dma_ctrl *ctrl,
* Requeued DMAs go to the front of the queue so they are issued
* first.
*/
spin_lock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
ctrl->stats.sw_q_depth++;
list_add(&dma->list, &ctrl->queue);
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
}
static void rsxx_handle_dma_error(struct rsxx_dma_ctrl *ctrl,
......@@ -329,6 +346,7 @@ static void rsxx_handle_dma_error(struct rsxx_dma_ctrl *ctrl,
static void dma_engine_stalled(unsigned long data)
{
struct rsxx_dma_ctrl *ctrl = (struct rsxx_dma_ctrl *)data;
int cnt;
if (atomic_read(&ctrl->stats.hw_q_depth) == 0 ||
unlikely(ctrl->card->eeh_state))
......@@ -349,18 +367,28 @@ static void dma_engine_stalled(unsigned long data)
"DMA channel %d has stalled, faulting interface.\n",
ctrl->id);
ctrl->card->dma_fault = 1;
/* Clean up the DMA queue */
spin_lock(&ctrl->queue_lock);
cnt = rsxx_cleanup_dma_queue(ctrl, &ctrl->queue);
spin_unlock(&ctrl->queue_lock);
cnt += rsxx_dma_cancel(ctrl);
if (cnt)
dev_info(CARD_TO_DEV(ctrl->card),
"Freed %d queued DMAs on channel %d\n",
cnt, ctrl->id);
}
}
static void rsxx_issue_dmas(struct work_struct *work)
static void rsxx_issue_dmas(struct rsxx_dma_ctrl *ctrl)
{
struct rsxx_dma_ctrl *ctrl;
struct rsxx_dma *dma;
int tag;
int cmds_pending = 0;
struct hw_cmd *hw_cmd_buf;
ctrl = container_of(work, struct rsxx_dma_ctrl, issue_dma_work);
hw_cmd_buf = ctrl->cmd.buf;
if (unlikely(ctrl->card->halt) ||
......@@ -368,22 +396,22 @@ static void rsxx_issue_dmas(struct work_struct *work)
return;
while (1) {
spin_lock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
if (list_empty(&ctrl->queue)) {
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
break;
}
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
tag = pop_tracker(ctrl->trackers);
if (tag == -1)
break;
spin_lock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
dma = list_entry(ctrl->queue.next, struct rsxx_dma, list);
list_del(&dma->list);
ctrl->stats.sw_q_depth--;
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
/*
* This will catch any DMAs that slipped in right before the
......@@ -440,9 +468,8 @@ static void rsxx_issue_dmas(struct work_struct *work)
}
}
static void rsxx_dma_done(struct work_struct *work)
static void rsxx_dma_done(struct rsxx_dma_ctrl *ctrl)
{
struct rsxx_dma_ctrl *ctrl;
struct rsxx_dma *dma;
unsigned long flags;
u16 count;
......@@ -450,7 +477,6 @@ static void rsxx_dma_done(struct work_struct *work)
u8 tag;
struct hw_status *hw_st_buf;
ctrl = container_of(work, struct rsxx_dma_ctrl, dma_done_work);
hw_st_buf = ctrl->status.buf;
if (unlikely(ctrl->card->halt) ||
......@@ -520,33 +546,32 @@ static void rsxx_dma_done(struct work_struct *work)
rsxx_enable_ier(ctrl->card, CR_INTR_DMA(ctrl->id));
spin_unlock_irqrestore(&ctrl->card->irq_lock, flags);
spin_lock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
if (ctrl->stats.sw_q_depth)
queue_work(ctrl->issue_wq, &ctrl->issue_dma_work);
spin_unlock(&ctrl->queue_lock);
spin_unlock_bh(&ctrl->queue_lock);
}
static int rsxx_cleanup_dma_queue(struct rsxx_cardinfo *card,
struct list_head *q)
static void rsxx_schedule_issue(struct work_struct *work)
{
struct rsxx_dma *dma;
struct rsxx_dma *tmp;
int cnt = 0;
struct rsxx_dma_ctrl *ctrl;
list_for_each_entry_safe(dma, tmp, q, list) {
list_del(&dma->list);
ctrl = container_of(work, struct rsxx_dma_ctrl, issue_dma_work);
if (dma->dma_addr)
pci_unmap_page(card->dev, dma->dma_addr,
get_dma_size(dma),
(dma->cmd == HW_CMD_BLK_WRITE) ?
PCI_DMA_TODEVICE :
PCI_DMA_FROMDEVICE);
kmem_cache_free(rsxx_dma_pool, dma);
cnt++;
}
mutex_lock(&ctrl->work_lock);
rsxx_issue_dmas(ctrl);
mutex_unlock(&ctrl->work_lock);
}
return cnt;
static void rsxx_schedule_done(struct work_struct *work)
{
struct rsxx_dma_ctrl *ctrl;
ctrl = container_of(work, struct rsxx_dma_ctrl, dma_done_work);
mutex_lock(&ctrl->work_lock);
rsxx_dma_done(ctrl);
mutex_unlock(&ctrl->work_lock);
}
static int rsxx_queue_discard(struct rsxx_cardinfo *card,
......@@ -698,10 +723,10 @@ int rsxx_dma_queue_bio(struct rsxx_cardinfo *card,
for (i = 0; i < card->n_targets; i++) {
if (!list_empty(&dma_list[i])) {
spin_lock(&card->ctrl[i].queue_lock);
spin_lock_bh(&card->ctrl[i].queue_lock);
card->ctrl[i].stats.sw_q_depth += dma_cnt[i];
list_splice_tail(&dma_list[i], &card->ctrl[i].queue);
spin_unlock(&card->ctrl[i].queue_lock);
spin_unlock_bh(&card->ctrl[i].queue_lock);
queue_work(card->ctrl[i].issue_wq,
&card->ctrl[i].issue_dma_work);
......@@ -711,8 +736,11 @@ int rsxx_dma_queue_bio(struct rsxx_cardinfo *card,
return 0;
bvec_err:
for (i = 0; i < card->n_targets; i++)
rsxx_cleanup_dma_queue(card, &dma_list[i]);
for (i = 0; i < card->n_targets; i++) {
spin_lock_bh(&card->ctrl[i].queue_lock);
rsxx_cleanup_dma_queue(&card->ctrl[i], &dma_list[i]);
spin_unlock_bh(&card->ctrl[i].queue_lock);
}
return st;
}
......@@ -780,6 +808,7 @@ static int rsxx_dma_ctrl_init(struct pci_dev *dev,
spin_lock_init(&ctrl->trackers->lock);
spin_lock_init(&ctrl->queue_lock);
mutex_init(&ctrl->work_lock);
INIT_LIST_HEAD(&ctrl->queue);
setup_timer(&ctrl->activity_timer, dma_engine_stalled,
......@@ -793,8 +822,8 @@ static int rsxx_dma_ctrl_init(struct pci_dev *dev,
if (!ctrl->done_wq)
return -ENOMEM;
INIT_WORK(&ctrl->issue_dma_work, rsxx_issue_dmas);
INIT_WORK(&ctrl->dma_done_work, rsxx_dma_done);
INIT_WORK(&ctrl->issue_dma_work, rsxx_schedule_issue);
INIT_WORK(&ctrl->dma_done_work, rsxx_schedule_done);
st = rsxx_hw_buffers_init(dev, ctrl);
if (st)
......@@ -918,13 +947,30 @@ int rsxx_dma_setup(struct rsxx_cardinfo *card)
return st;
}
int rsxx_dma_cancel(struct rsxx_dma_ctrl *ctrl)
{
struct rsxx_dma *dma;
int i;
int cnt = 0;
/* Clean up issued DMAs */
for (i = 0; i < RSXX_MAX_OUTSTANDING_CMDS; i++) {
dma = get_tracker_dma(ctrl->trackers, i);
if (dma) {
atomic_dec(&ctrl->stats.hw_q_depth);
rsxx_complete_dma(ctrl, dma, DMA_CANCELLED);
push_tracker(ctrl->trackers, i);
cnt++;
}
}
return cnt;
}
void rsxx_dma_destroy(struct rsxx_cardinfo *card)
{
struct rsxx_dma_ctrl *ctrl;
struct rsxx_dma *dma;
int i, j;
int cnt = 0;
int i;
for (i = 0; i < card->n_targets; i++) {
ctrl = &card->ctrl[i];
......@@ -943,33 +989,11 @@ void rsxx_dma_destroy(struct rsxx_cardinfo *card)
del_timer_sync(&ctrl->activity_timer);
/* Clean up the DMA queue */
spin_lock(&ctrl->queue_lock);
cnt = rsxx_cleanup_dma_queue(card, &ctrl->queue);
spin_unlock(&ctrl->queue_lock);
spin_lock_bh(&ctrl->queue_lock);
rsxx_cleanup_dma_queue(ctrl, &ctrl->queue);
spin_unlock_bh(&ctrl->queue_lock);
if (cnt)
dev_info(CARD_TO_DEV(card),
"Freed %d queued DMAs on channel %d\n",
cnt, i);
/* Clean up issued DMAs */
for (j = 0; j < RSXX_MAX_OUTSTANDING_CMDS; j++) {
dma = get_tracker_dma(ctrl->trackers, j);
if (dma) {
pci_unmap_page(card->dev, dma->dma_addr,
get_dma_size(dma),
(dma->cmd == HW_CMD_BLK_WRITE) ?
PCI_DMA_TODEVICE :
PCI_DMA_FROMDEVICE);
kmem_cache_free(rsxx_dma_pool, dma);
cnt++;
}
}
if (cnt)
dev_info(CARD_TO_DEV(card),
"Freed %d pending DMAs on channel %d\n",
cnt, i);
rsxx_dma_cancel(ctrl);
vfree(ctrl->trackers);
......@@ -1013,7 +1037,7 @@ int rsxx_eeh_save_issued_dmas(struct rsxx_cardinfo *card)
cnt++;
}
spin_lock(&card->ctrl[i].queue_lock);
spin_lock_bh(&card->ctrl[i].queue_lock);
list_splice(&issued_dmas[i], &card->ctrl[i].queue);
atomic_sub(cnt, &card->ctrl[i].stats.hw_q_depth);
......@@ -1028,7 +1052,7 @@ int rsxx_eeh_save_issued_dmas(struct rsxx_cardinfo *card)
PCI_DMA_TODEVICE :
PCI_DMA_FROMDEVICE);
}
spin_unlock(&card->ctrl[i].queue_lock);
spin_unlock_bh(&card->ctrl[i].queue_lock);
}
kfree(issued_dmas);
......@@ -1036,30 +1060,13 @@ int rsxx_eeh_save_issued_dmas(struct rsxx_cardinfo *card)
return 0;
}
void rsxx_eeh_cancel_dmas(struct rsxx_cardinfo *card)
{
struct rsxx_dma *dma;
struct rsxx_dma *tmp;
int i;
for (i = 0; i < card->n_targets; i++) {
spin_lock(&card->ctrl[i].queue_lock);
list_for_each_entry_safe(dma, tmp, &card->ctrl[i].queue, list) {
list_del(&dma->list);
rsxx_complete_dma(&card->ctrl[i], dma, DMA_CANCELLED);
}
spin_unlock(&card->ctrl[i].queue_lock);
}
}
int rsxx_eeh_remap_dmas(struct rsxx_cardinfo *card)
{
struct rsxx_dma *dma;
int i;
for (i = 0; i < card->n_targets; i++) {
spin_lock(&card->ctrl[i].queue_lock);
spin_lock_bh(&card->ctrl[i].queue_lock);
list_for_each_entry(dma, &card->ctrl[i].queue, list) {
dma->dma_addr = pci_map_page(card->dev, dma->page,
dma->pg_off, get_dma_size(dma),
......@@ -1067,12 +1074,12 @@ int rsxx_eeh_remap_dmas(struct rsxx_cardinfo *card)
PCI_DMA_TODEVICE :
PCI_DMA_FROMDEVICE);
if (!dma->dma_addr) {
spin_unlock(&card->ctrl[i].queue_lock);
spin_unlock_bh(&card->ctrl[i].queue_lock);
kmem_cache_free(rsxx_dma_pool, dma);
return -ENOMEM;
}
}
spin_unlock(&card->ctrl[i].queue_lock);
spin_unlock_bh(&card->ctrl[i].queue_lock);
}
return 0;
......
......@@ -39,6 +39,7 @@
#include <linux/vmalloc.h>
#include <linux/timer.h>
#include <linux/ioctl.h>
#include <linux/delay.h>
#include "rsxx.h"
#include "rsxx_cfg.h"
......@@ -114,6 +115,7 @@ struct rsxx_dma_ctrl {
struct timer_list activity_timer;
struct dma_tracker_list *trackers;
struct rsxx_dma_stats stats;
struct mutex work_lock;
};
struct rsxx_cardinfo {
......@@ -134,6 +136,7 @@ struct rsxx_cardinfo {
spinlock_t lock;
bool active;
struct creg_cmd *active_cmd;
struct workqueue_struct *creg_wq;
struct work_struct done_work;
struct list_head queue;
unsigned int q_depth;
......@@ -154,6 +157,7 @@ struct rsxx_cardinfo {
int buf_len;
} log;
struct workqueue_struct *event_wq;
struct work_struct event_work;
unsigned int state;
u64 size8;
......@@ -181,6 +185,8 @@ struct rsxx_cardinfo {
int n_targets;
struct rsxx_dma_ctrl *ctrl;
struct dentry *debugfs_dir;
};
enum rsxx_pci_regmap {
......@@ -283,6 +289,7 @@ enum rsxx_creg_addr {
CREG_ADD_CAPABILITIES = 0x80001050,
CREG_ADD_LOG = 0x80002000,
CREG_ADD_NUM_TARGETS = 0x80003000,
CREG_ADD_CRAM = 0xA0000000,
CREG_ADD_CONFIG = 0xB0000000,
};
......@@ -372,6 +379,8 @@ typedef void (*rsxx_dma_cb)(struct rsxx_cardinfo *card,
int rsxx_dma_setup(struct rsxx_cardinfo *card);
void rsxx_dma_destroy(struct rsxx_cardinfo *card);
int rsxx_dma_init(void);
int rsxx_cleanup_dma_queue(struct rsxx_dma_ctrl *ctrl, struct list_head *q);
int rsxx_dma_cancel(struct rsxx_dma_ctrl *ctrl);
void rsxx_dma_cleanup(void);
void rsxx_dma_queue_reset(struct rsxx_cardinfo *card);
int rsxx_dma_configure(struct rsxx_cardinfo *card);
......@@ -382,7 +391,6 @@ int rsxx_dma_queue_bio(struct rsxx_cardinfo *card,
void *cb_data);
int rsxx_hw_buffers_init(struct pci_dev *dev, struct rsxx_dma_ctrl *ctrl);
int rsxx_eeh_save_issued_dmas(struct rsxx_cardinfo *card);
void rsxx_eeh_cancel_dmas(struct rsxx_cardinfo *card);
int rsxx_eeh_remap_dmas(struct rsxx_cardinfo *card);
/***** cregs.c *****/
......
This diff is collapsed.
......@@ -50,6 +50,19 @@
__func__, __LINE__, ##args)
/*
* This is the maximum number of segments that would be allowed in indirect
* requests. This value will also be passed to the frontend.
*/
#define MAX_INDIRECT_SEGMENTS 256
#define SEGS_PER_INDIRECT_FRAME \
(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
#define MAX_INDIRECT_PAGES \
((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
#define INDIRECT_PAGES(_segs) \
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
/* Not a real protocol. Used to generate ring structs which contain
* the elements common to all protocols only. This way we get a
* compiler-checkable way to use common struct elements, so we can
......@@ -83,12 +96,31 @@ struct blkif_x86_32_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));
struct blkif_x86_32_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad1;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
/*
* The maximum number of indirect segments (and pages) that will
* be used is determined by MAX_INDIRECT_SEGMENTS, this value
* is also exported to the guest (via xenstore
* feature-max-indirect-segments entry), so the frontend knows how
* many indirect segments the backend supports.
*/
uint64_t _pad2; /* make it 64 byte aligned */
} __attribute__((__packed__));
struct blkif_x86_32_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_x86_32_request_rw rw;
struct blkif_x86_32_request_discard discard;
struct blkif_x86_32_request_other other;
struct blkif_x86_32_request_indirect indirect;
} u;
} __attribute__((__packed__));
......@@ -127,12 +159,32 @@ struct blkif_x86_64_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));
struct blkif_x86_64_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
uint32_t _pad1; /* offsetof(blkif_..,u.indirect.id)==8 */
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad2;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
/*
* The maximum number of indirect segments (and pages) that will
* be used is determined by MAX_INDIRECT_SEGMENTS, this value
* is also exported to the guest (via xenstore
* feature-max-indirect-segments entry), so the frontend knows how
* many indirect segments the backend supports.
*/
uint32_t _pad3; /* make it 64 byte aligned */
} __attribute__((__packed__));
struct blkif_x86_64_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_x86_64_request_rw rw;
struct blkif_x86_64_request_discard discard;
struct blkif_x86_64_request_other other;
struct blkif_x86_64_request_indirect indirect;
} u;
} __attribute__((__packed__));
......@@ -182,12 +234,26 @@ struct xen_vbd {
struct backend_info;
/* Number of available flags */
#define PERSISTENT_GNT_FLAGS_SIZE 2
/* This persistent grant is currently in use */
#define PERSISTENT_GNT_ACTIVE 0
/*
* This persistent grant has been used, this flag is set when we remove the
* PERSISTENT_GNT_ACTIVE, to know that this grant has been used recently.
*/
#define PERSISTENT_GNT_WAS_ACTIVE 1
/* Number of requests that we can fit in a ring */
#define XEN_BLKIF_REQS 32
struct persistent_gnt {
struct page *page;
grant_ref_t gnt;
grant_handle_t handle;
DECLARE_BITMAP(flags, PERSISTENT_GNT_FLAGS_SIZE);
struct rb_node node;
struct list_head remove_node;
};
struct xen_blkif {
......@@ -219,6 +285,23 @@ struct xen_blkif {
/* tree to store persistent grants */
struct rb_root persistent_gnts;
unsigned int persistent_gnt_c;
atomic_t persistent_gnt_in_use;
unsigned long next_lru;
/* used by the kworker that offload work from the persistent purge */
struct list_head persistent_purge_list;
struct work_struct persistent_purge_work;
/* buffer of free pages to map grant refs */
spinlock_t free_pages_lock;
int free_pages_num;
struct list_head free_pages;
/* List of all 'pending_req' available */
struct list_head pending_free;
/* And its spinlock. */
spinlock_t pending_free_lock;
wait_queue_head_t pending_free_wq;
/* statistics */
unsigned long st_print;
......@@ -231,6 +314,41 @@ struct xen_blkif {
unsigned long long st_wr_sect;
wait_queue_head_t waiting_to_free;
/* Thread shutdown wait queue. */
wait_queue_head_t shutdown_wq;
};
struct seg_buf {
unsigned long offset;
unsigned int nsec;
};
struct grant_page {
struct page *page;
struct persistent_gnt *persistent_gnt;
grant_handle_t handle;
grant_ref_t gref;
};
/*
* Each outstanding request that we've passed to the lower device layers has a
* 'pending_req' allocated to it. Each buffer_head that completes decrements
* the pendcnt towards zero. When it hits zero, the specified domain has a
* response queued for it, with the saved 'id' passed back.
*/
struct pending_req {
struct xen_blkif *blkif;
u64 id;
int nr_pages;
atomic_t pendcnt;
unsigned short operation;
int status;
struct list_head free_list;
struct grant_page *segments[MAX_INDIRECT_SEGMENTS];
/* Indirect descriptors */
struct grant_page *indirect_pages[MAX_INDIRECT_PAGES];
struct seg_buf seg[MAX_INDIRECT_SEGMENTS];
struct bio *biolist[MAX_INDIRECT_SEGMENTS];
};
......@@ -257,6 +375,7 @@ int xen_blkif_xenbus_init(void);
irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
int xen_blkif_schedule(void *arg);
int xen_blkif_purge_persistent(void *arg);
int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
struct backend_info *be, int state);
......@@ -268,7 +387,7 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be);
static inline void blkif_get_x86_32_req(struct blkif_request *dst,
struct blkif_x86_32_request *src)
{
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation;
switch (src->operation) {
case BLKIF_OP_READ:
......@@ -291,6 +410,18 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break;
case BLKIF_OP_INDIRECT:
dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
dst->u.indirect.handle = src->u.indirect.handle;
dst->u.indirect.id = src->u.indirect.id;
dst->u.indirect.sector_number = src->u.indirect.sector_number;
barrier();
j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
for (i = 0; i < j; i++)
dst->u.indirect.indirect_grefs[i] =
src->u.indirect.indirect_grefs[i];
break;
default:
/*
* Don't know how to translate this op. Only get the
......@@ -304,7 +435,7 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
static inline void blkif_get_x86_64_req(struct blkif_request *dst,
struct blkif_x86_64_request *src)
{
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation;
switch (src->operation) {
case BLKIF_OP_READ:
......@@ -327,6 +458,18 @@ static inline void blkif_get_x86_64_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break;
case BLKIF_OP_INDIRECT:
dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
dst->u.indirect.handle = src->u.indirect.handle;
dst->u.indirect.id = src->u.indirect.id;
dst->u.indirect.sector_number = src->u.indirect.sector_number;
barrier();
j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
for (i = 0; i < j; i++)
dst->u.indirect.indirect_grefs[i] =
src->u.indirect.indirect_grefs[i];
break;
default:
/*
* Don't know how to translate this op. Only get the
......
......@@ -98,12 +98,17 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
err = PTR_ERR(blkif->xenblkd);
blkif->xenblkd = NULL;
xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
return;
}
}
static struct xen_blkif *xen_blkif_alloc(domid_t domid)
{
struct xen_blkif *blkif;
struct pending_req *req, *n;
int i, j;
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
if (!blkif)
......@@ -118,8 +123,57 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies;
init_waitqueue_head(&blkif->waiting_to_free);
blkif->persistent_gnts.rb_node = NULL;
spin_lock_init(&blkif->free_pages_lock);
INIT_LIST_HEAD(&blkif->free_pages);
blkif->free_pages_num = 0;
atomic_set(&blkif->persistent_gnt_in_use, 0);
INIT_LIST_HEAD(&blkif->pending_free);
for (i = 0; i < XEN_BLKIF_REQS; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
if (!req)
goto fail;
list_add_tail(&req->free_list,
&blkif->pending_free);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
req->segments[j] = kzalloc(sizeof(*req->segments[0]),
GFP_KERNEL);
if (!req->segments[j])
goto fail;
}
for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
GFP_KERNEL);
if (!req->indirect_pages[j])
goto fail;
}
}
spin_lock_init(&blkif->pending_free_lock);
init_waitqueue_head(&blkif->pending_free_wq);
init_waitqueue_head(&blkif->shutdown_wq);
return blkif;
fail:
list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
if (!req->segments[j])
break;
kfree(req->segments[j]);
}
for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
if (!req->indirect_pages[j])
break;
kfree(req->indirect_pages[j]);
}
kfree(req);
}
kmem_cache_free(xen_blkif_cachep, blkif);
return ERR_PTR(-ENOMEM);
}
static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
......@@ -178,6 +232,7 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)
{
if (blkif->xenblkd) {
kthread_stop(blkif->xenblkd);
wake_up(&blkif->shutdown_wq);
blkif->xenblkd = NULL;
}
......@@ -198,8 +253,28 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)
static void xen_blkif_free(struct xen_blkif *blkif)
{
struct pending_req *req, *n;
int i = 0, j;
if (!atomic_dec_and_test(&blkif->refcnt))
BUG();
/* Check that there is no request in use */
list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
kfree(req->segments[j]);
for (j = 0; j < MAX_INDIRECT_PAGES; j++)
kfree(req->indirect_pages[j]);
kfree(req);
i++;
}
WARN_ON(i != XEN_BLKIF_REQS);
kmem_cache_free(xen_blkif_cachep, blkif);
}
......@@ -678,6 +753,11 @@ static void connect(struct backend_info *be)
dev->nodename);
goto abort;
}
err = xenbus_printf(xbt, dev->nodename, "feature-max-indirect-segments", "%u",
MAX_INDIRECT_SEGMENTS);
if (err)
dev_warn(&dev->dev, "writing %s/feature-max-indirect-segments (%d)",
dev->nodename, err);
err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
(unsigned long long)vbd_sz(&be->blkif->vbd));
......@@ -704,6 +784,11 @@ static void connect(struct backend_info *be)
dev->nodename);
goto abort;
}
err = xenbus_printf(xbt, dev->nodename, "physical-sector-size", "%u",
bdev_physical_block_size(be->blkif->vbd.bdev));
if (err)
xenbus_dev_error(dev, err, "writing %s/physical-sector-size",
dev->nodename);
err = xenbus_transaction_end(xbt, 0);
if (err == -EAGAIN)
......
This diff is collapsed.
......@@ -63,7 +63,10 @@
#include "bcache.h"
#include "btree.h"
#include <linux/freezer.h>
#include <linux/kthread.h>
#include <linux/random.h>
#include <trace/events/bcache.h>
#define MAX_IN_FLIGHT_DISCARDS 8U
......@@ -151,7 +154,7 @@ static void discard_finish(struct work_struct *w)
mutex_unlock(&ca->set->bucket_lock);
closure_wake_up(&ca->set->bucket_wait);
wake_up(&ca->set->alloc_wait);
wake_up_process(ca->alloc_thread);
closure_put(&ca->set->cl);
}
......@@ -350,38 +353,30 @@ static void invalidate_buckets(struct cache *ca)
break;
}
pr_debug("free %zu/%zu free_inc %zu/%zu unused %zu/%zu",
fifo_used(&ca->free), ca->free.size,
fifo_used(&ca->free_inc), ca->free_inc.size,
fifo_used(&ca->unused), ca->unused.size);
trace_bcache_alloc_invalidate(ca);
}
#define allocator_wait(ca, cond) \
do { \
DEFINE_WAIT(__wait); \
\
while (1) { \
prepare_to_wait(&ca->set->alloc_wait, \
&__wait, TASK_INTERRUPTIBLE); \
set_current_state(TASK_INTERRUPTIBLE); \
if (cond) \
break; \
\
mutex_unlock(&(ca)->set->bucket_lock); \
if (test_bit(CACHE_SET_STOPPING_2, &ca->set->flags)) { \
finish_wait(&ca->set->alloc_wait, &__wait); \
closure_return(cl); \
} \
if (kthread_should_stop()) \
return 0; \
\
try_to_freeze(); \
schedule(); \
mutex_lock(&(ca)->set->bucket_lock); \
} \
\
finish_wait(&ca->set->alloc_wait, &__wait); \
__set_current_state(TASK_RUNNING); \
} while (0)
void bch_allocator_thread(struct closure *cl)
static int bch_allocator_thread(void *arg)
{
struct cache *ca = container_of(cl, struct cache, alloc);
struct cache *ca = arg;
mutex_lock(&ca->set->bucket_lock);
......@@ -442,7 +437,7 @@ long bch_bucket_alloc(struct cache *ca, unsigned watermark, struct closure *cl)
{
long r = -1;
again:
wake_up(&ca->set->alloc_wait);
wake_up_process(ca->alloc_thread);
if (fifo_used(&ca->free) > ca->watermark[watermark] &&
fifo_pop(&ca->free, r)) {
......@@ -476,9 +471,7 @@ long bch_bucket_alloc(struct cache *ca, unsigned watermark, struct closure *cl)
return r;
}
pr_debug("alloc failure: blocked %i free %zu free_inc %zu unused %zu",
atomic_read(&ca->set->prio_blocked), fifo_used(&ca->free),
fifo_used(&ca->free_inc), fifo_used(&ca->unused));
trace_bcache_alloc_fail(ca);
if (cl) {
closure_wait(&ca->set->bucket_wait, cl);
......@@ -552,6 +545,17 @@ int bch_bucket_alloc_set(struct cache_set *c, unsigned watermark,
/* Init */
int bch_cache_allocator_start(struct cache *ca)
{
struct task_struct *k = kthread_run(bch_allocator_thread,
ca, "bcache_allocator");
if (IS_ERR(k))
return PTR_ERR(k);
ca->alloc_thread = k;
return 0;
}
void bch_cache_allocator_exit(struct cache *ca)
{
struct discard *d;
......
......@@ -178,7 +178,6 @@
#define pr_fmt(fmt) "bcache: %s() " fmt "\n", __func__
#include <linux/bio.h>
#include <linux/blktrace_api.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mutex.h>
......@@ -388,8 +387,6 @@ struct keybuf_key {
typedef bool (keybuf_pred_fn)(struct keybuf *, struct bkey *);
struct keybuf {
keybuf_pred_fn *key_predicate;
struct bkey last_scanned;
spinlock_t lock;
......@@ -437,9 +434,12 @@ struct bcache_device {
/* If nonzero, we're detaching/unregistering from cache set */
atomic_t detaching;
int flush_done;
uint64_t nr_stripes;
unsigned stripe_size_bits;
atomic_t *stripe_sectors_dirty;
atomic_long_t sectors_dirty;
unsigned long sectors_dirty_gc;
unsigned long sectors_dirty_last;
long sectors_dirty_derivative;
......@@ -531,6 +531,7 @@ struct cached_dev {
unsigned sequential_merge:1;
unsigned verify:1;
unsigned partial_stripes_expensive:1;
unsigned writeback_metadata:1;
unsigned writeback_running:1;
unsigned char writeback_percent;
......@@ -565,8 +566,7 @@ struct cache {
unsigned watermark[WATERMARK_MAX];
struct closure alloc;
struct workqueue_struct *alloc_workqueue;
struct task_struct *alloc_thread;
struct closure prio;
struct prio_set *disk_buckets;
......@@ -664,13 +664,9 @@ struct gc_stat {
* CACHE_SET_STOPPING always gets set first when we're closing down a cache set;
* we'll continue to run normally for awhile with CACHE_SET_STOPPING set (i.e.
* flushing dirty data).
*
* CACHE_SET_STOPPING_2 gets set at the last phase, when it's time to shut down
* the allocation thread.
*/
#define CACHE_SET_UNREGISTERING 0
#define CACHE_SET_STOPPING 1
#define CACHE_SET_STOPPING_2 2
struct cache_set {
struct closure cl;
......@@ -703,9 +699,6 @@ struct cache_set {
/* For the btree cache */
struct shrinker shrink;
/* For the allocator itself */
wait_queue_head_t alloc_wait;
/* For the btree cache and anything allocation related */
struct mutex bucket_lock;
......@@ -823,10 +816,9 @@ struct cache_set {
/*
* A btree node on disk could have too many bsets for an iterator to fit
* on the stack - this is a single element mempool for btree_read_work()
* on the stack - have to dynamically allocate them
*/
struct mutex fill_lock;
struct btree_iter *fill_iter;
mempool_t *fill_iter;
/*
* btree_sort() is a merge sort and requires temporary space - single
......@@ -834,6 +826,7 @@ struct cache_set {
*/
struct mutex sort_lock;
struct bset *sort;
unsigned sort_crit_factor;
/* List of buckets we're currently writing data to */
struct list_head data_buckets;
......@@ -906,8 +899,6 @@ static inline unsigned local_clock_us(void)
return local_clock() >> 10;
}
#define MAX_BSETS 4U
#define BTREE_PRIO USHRT_MAX
#define INITIAL_PRIO 32768
......@@ -1112,23 +1103,6 @@ static inline void __bkey_put(struct cache_set *c, struct bkey *k)
atomic_dec_bug(&PTR_BUCKET(c, k, i)->pin);
}
/* Blktrace macros */
#define blktrace_msg(c, fmt, ...) \
do { \
struct request_queue *q = bdev_get_queue(c->bdev); \
if (q) \
blk_add_trace_msg(q, fmt, ##__VA_ARGS__); \
} while (0)
#define blktrace_msg_all(s, fmt, ...) \
do { \
struct cache *_c; \
unsigned i; \
for_each_cache(_c, (s), i) \
blktrace_msg(_c, fmt, ##__VA_ARGS__); \
} while (0)
static inline void cached_dev_put(struct cached_dev *dc)
{
if (atomic_dec_and_test(&dc->count))
......@@ -1173,10 +1147,16 @@ static inline uint8_t bucket_disk_gen(struct bucket *b)
static struct kobj_attribute ksysfs_##n = \
__ATTR(n, S_IWUSR|S_IRUSR, show, store)
/* Forward declarations */
static inline void wake_up_allocators(struct cache_set *c)
{
struct cache *ca;
unsigned i;
for_each_cache(ca, c, i)
wake_up_process(ca->alloc_thread);
}
void bch_writeback_queue(struct cached_dev *);
void bch_writeback_add(struct cached_dev *, unsigned);
/* Forward declarations */
void bch_count_io_errors(struct cache *, int, const char *);
void bch_bbio_count_io_errors(struct cache_set *, struct bio *,
......@@ -1193,7 +1173,6 @@ void bch_submit_bbio(struct bio *, struct cache_set *, struct bkey *, unsigned);
uint8_t bch_inc_gen(struct cache *, struct bucket *);
void bch_rescale_priorities(struct cache_set *, int);
bool bch_bucket_add_unused(struct cache *, struct bucket *);
void bch_allocator_thread(struct closure *);
long bch_bucket_alloc(struct cache *, unsigned, struct closure *);
void bch_bucket_free(struct cache_set *, struct bkey *);
......@@ -1241,9 +1220,9 @@ void bch_cache_set_stop(struct cache_set *);
struct cache_set *bch_cache_set_alloc(struct cache_sb *);
void bch_btree_cache_free(struct cache_set *);
int bch_btree_cache_alloc(struct cache_set *);
void bch_cached_dev_writeback_init(struct cached_dev *);
void bch_moving_init_cache_set(struct cache_set *);
int bch_cache_allocator_start(struct cache *ca);
void bch_cache_allocator_exit(struct cache *ca);
int bch_cache_allocator_init(struct cache *ca);
......
......@@ -78,6 +78,7 @@ struct bkey *bch_keylist_pop(struct keylist *l)
bool __bch_ptr_invalid(struct cache_set *c, int level, const struct bkey *k)
{
unsigned i;
char buf[80];
if (level && (!KEY_PTRS(k) || !KEY_SIZE(k) || KEY_DIRTY(k)))
goto bad;
......@@ -102,7 +103,8 @@ bool __bch_ptr_invalid(struct cache_set *c, int level, const struct bkey *k)
return false;
bad:
cache_bug(c, "spotted bad key %s: %s", pkey(k), bch_ptr_status(c, k));
bch_bkey_to_text(buf, sizeof(buf), k);
cache_bug(c, "spotted bad key %s: %s", buf, bch_ptr_status(c, k));
return true;
}
......@@ -162,10 +164,16 @@ bool bch_ptr_bad(struct btree *b, const struct bkey *k)
#ifdef CONFIG_BCACHE_EDEBUG
bug:
mutex_unlock(&b->c->bucket_lock);
{
char buf[80];
bch_bkey_to_text(buf, sizeof(buf), k);
btree_bug(b,
"inconsistent pointer %s: bucket %zu pin %i prio %i gen %i last_gc %i mark %llu gc_gen %i",
pkey(k), PTR_BUCKET_NR(b->c, k, i), atomic_read(&g->pin),
buf, PTR_BUCKET_NR(b->c, k, i), atomic_read(&g->pin),
g->prio, g->gen, g->last_gc, GC_MARK(g), g->gc_gen);
}
return true;
#endif
}
......@@ -1084,33 +1092,39 @@ void bch_btree_sort_into(struct btree *b, struct btree *new)
new->sets->size = 0;
}
#define SORT_CRIT (4096 / sizeof(uint64_t))
void bch_btree_sort_lazy(struct btree *b)
{
if (b->nsets) {
unsigned i, j, keys = 0, total;
for (i = 0; i <= b->nsets; i++)
keys += b->sets[i].data->keys;
unsigned crit = SORT_CRIT;
int i;
total = keys;
/* Don't sort if nothing to do */
if (!b->nsets)
goto out;
for (j = 0; j < b->nsets; j++) {
if (keys * 2 < total ||
keys < 1000) {
bch_btree_sort_partial(b, j);
/* If not a leaf node, always sort */
if (b->level) {
bch_btree_sort(b);
return;
}
keys -= b->sets[j].data->keys;
for (i = b->nsets - 1; i >= 0; --i) {
crit *= b->c->sort_crit_factor;
if (b->sets[i].data->keys < crit) {
bch_btree_sort_partial(b, i);
return;
}
}
/* Must sort if b->nsets == 3 or we'll overflow */
if (b->nsets >= (MAX_BSETS - 1) - b->level) {
/* Sort if we'd overflow */
if (b->nsets + 1 == MAX_BSETS) {
bch_btree_sort(b);
return;
}
}
out:
bset_build_written_tree(b);
}
......
#ifndef _BCACHE_BSET_H
#define _BCACHE_BSET_H
#include <linux/slab.h>
/*
* BKEYS:
*
......@@ -142,6 +144,8 @@
/* Btree key comparison/iteration */
#define MAX_BSETS 4U
struct btree_iter {
size_t size, used;
struct btree_iter_set {
......
This diff is collapsed.
......@@ -102,7 +102,6 @@
#include "debug.h"
struct btree_write {
struct closure *owner;
atomic_t *journal;
/* If btree_split() frees a btree node, it writes a new pointer to that
......@@ -142,16 +141,12 @@ struct btree {
*/
struct bset_tree sets[MAX_BSETS];
/* Used to refcount bio splits, also protects b->bio */
/* For outstanding btree writes, used as a lock - protects write_idx */
struct closure_with_waitlist io;
/* Gets transferred to w->prio_blocked - see the comment there */
int prio_blocked;
struct list_head list;
struct delayed_work work;
uint64_t io_start_time;
struct btree_write writes[2];
struct bio *bio;
};
......@@ -164,13 +159,11 @@ static inline void set_btree_node_ ## flag(struct btree *b) \
{ set_bit(BTREE_NODE_ ## flag, &b->flags); } \
enum btree_flags {
BTREE_NODE_read_done,
BTREE_NODE_io_error,
BTREE_NODE_dirty,
BTREE_NODE_write_idx,
};
BTREE_FLAG(read_done);
BTREE_FLAG(io_error);
BTREE_FLAG(dirty);
BTREE_FLAG(write_idx);
......@@ -278,6 +271,13 @@ struct btree_op {
BKEY_PADDED(replace);
};
enum {
BTREE_INSERT_STATUS_INSERT,
BTREE_INSERT_STATUS_BACK_MERGE,
BTREE_INSERT_STATUS_OVERWROTE,
BTREE_INSERT_STATUS_FRONT_MERGE,
};
void bch_btree_op_init_stack(struct btree_op *);
static inline void rw_lock(bool w, struct btree *b, int level)
......@@ -293,9 +293,7 @@ static inline void rw_unlock(bool w, struct btree *b)
#ifdef CONFIG_BCACHE_EDEBUG
unsigned i;
if (w &&
b->key.ptr[0] &&
btree_node_read_done(b))
if (w && b->key.ptr[0])
for (i = 0; i <= b->nsets; i++)
bch_check_key_order(b, b->sets[i].data);
#endif
......@@ -370,9 +368,8 @@ static inline bool should_split(struct btree *b)
> btree_blocks(b));
}
void bch_btree_read_done(struct closure *);
void bch_btree_read(struct btree *);
void bch_btree_write(struct btree *b, bool now, struct btree_op *op);
void bch_btree_node_read(struct btree *);
void bch_btree_node_write(struct btree *, struct closure *);
void bch_cannibalize_unlock(struct cache_set *, struct closure *);
void bch_btree_set_root(struct btree *);
......@@ -380,7 +377,6 @@ struct btree *bch_btree_node_alloc(struct cache_set *, int, struct closure *);
struct btree *bch_btree_node_get(struct cache_set *, struct bkey *,
int, struct btree_op *);
bool bch_btree_insert_keys(struct btree *, struct btree_op *);
bool bch_btree_insert_check_key(struct btree *, struct btree_op *,
struct bio *);
int bch_btree_insert(struct btree_op *, struct cache_set *);
......@@ -393,13 +389,14 @@ void bch_moving_gc(struct closure *);
int bch_btree_check(struct cache_set *, struct btree_op *);
uint8_t __bch_btree_mark_key(struct cache_set *, int, struct bkey *);
void bch_keybuf_init(struct keybuf *, keybuf_pred_fn *);
void bch_refill_keybuf(struct cache_set *, struct keybuf *, struct bkey *);
void bch_keybuf_init(struct keybuf *);
void bch_refill_keybuf(struct cache_set *, struct keybuf *, struct bkey *,
keybuf_pred_fn *);
bool bch_keybuf_check_overlapping(struct keybuf *, struct bkey *,
struct bkey *);
void bch_keybuf_del(struct keybuf *, struct keybuf_key *);
struct keybuf_key *bch_keybuf_next(struct keybuf *);
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *,
struct keybuf *, struct bkey *);
struct keybuf_key *bch_keybuf_next_rescan(struct cache_set *, struct keybuf *,
struct bkey *, keybuf_pred_fn *);
#endif
......@@ -66,16 +66,18 @@ static inline void closure_put_after_sub(struct closure *cl, int flags)
} else {
struct closure *parent = cl->parent;
struct closure_waitlist *wait = closure_waitlist(cl);
closure_fn *destructor = cl->fn;
closure_debug_destroy(cl);
smp_mb();
atomic_set(&cl->remaining, -1);
if (wait)
closure_wake_up(wait);
if (cl->fn)
cl->fn(cl);
if (destructor)
destructor(cl);
if (parent)
closure_put(parent);
......
......@@ -47,11 +47,10 @@ const char *bch_ptr_status(struct cache_set *c, const struct bkey *k)
return "";
}
struct keyprint_hack bch_pkey(const struct bkey *k)
int bch_bkey_to_text(char *buf, size_t size, const struct bkey *k)
{
unsigned i = 0;
struct keyprint_hack r;
char *out = r.s, *end = r.s + KEYHACK_SIZE;
char *out = buf, *end = buf + size;
#define p(...) (out += scnprintf(out, end - out, __VA_ARGS__))
......@@ -75,16 +74,14 @@ struct keyprint_hack bch_pkey(const struct bkey *k)
if (KEY_CSUM(k))
p(" cs%llu %llx", KEY_CSUM(k), k->ptr[1]);
#undef p
return r;
return out - buf;
}
struct keyprint_hack bch_pbtree(const struct btree *b)
int bch_btree_to_text(char *buf, size_t size, const struct btree *b)
{
struct keyprint_hack r;
snprintf(r.s, 40, "%zu level %i/%i", PTR_BUCKET_NR(b->c, &b->key, 0),
return scnprintf(buf, size, "%zu level %i/%i",
PTR_BUCKET_NR(b->c, &b->key, 0),
b->level, b->c->root ? b->c->root->level : -1);
return r;
}
#if defined(CONFIG_BCACHE_DEBUG) || defined(CONFIG_BCACHE_EDEBUG)
......@@ -100,10 +97,12 @@ static void dump_bset(struct btree *b, struct bset *i)
{
struct bkey *k;
unsigned j;
char buf[80];
for (k = i->start; k < end(i); k = bkey_next(k)) {
bch_bkey_to_text(buf, sizeof(buf), k);
printk(KERN_ERR "block %zu key %zi/%u: %s", index(i, b),
(uint64_t *) k - i->d, i->keys, pkey(k));
(uint64_t *) k - i->d, i->keys, buf);
for (j = 0; j < KEY_PTRS(k); j++) {
size_t n = PTR_BUCKET_NR(b->c, k, j);
......@@ -144,7 +143,7 @@ void bch_btree_verify(struct btree *b, struct bset *new)
v->written = 0;
v->level = b->level;
bch_btree_read(v);
bch_btree_node_read(v);
closure_wait_event(&v->io.wait, &cl,
atomic_read(&b->io.cl.remaining) == -1);
......@@ -200,7 +199,7 @@ void bch_data_verify(struct search *s)
if (!check)
return;
if (bch_bio_alloc_pages(check, GFP_NOIO))
if (bio_alloc_pages(check, GFP_NOIO))
goto out_put;
check->bi_rw = READ_SYNC;
......@@ -252,6 +251,7 @@ static void vdump_bucket_and_panic(struct btree *b, const char *fmt,
va_list args)
{
unsigned i;
char buf[80];
console_lock();
......@@ -262,7 +262,8 @@ static void vdump_bucket_and_panic(struct btree *b, const char *fmt,
console_unlock();
panic("at %s\n", pbtree(b));
bch_btree_to_text(buf, sizeof(buf), b);
panic("at %s\n", buf);
}
void bch_check_key_order_msg(struct btree *b, struct bset *i,
......@@ -337,6 +338,7 @@ static ssize_t bch_dump_read(struct file *file, char __user *buf,
{
struct dump_iterator *i = file->private_data;
ssize_t ret = 0;
char kbuf[80];
while (size) {
struct keybuf_key *w;
......@@ -355,11 +357,12 @@ static ssize_t bch_dump_read(struct file *file, char __user *buf,
if (i->bytes)
break;
w = bch_keybuf_next_rescan(i->c, &i->keys, &MAX_KEY);
w = bch_keybuf_next_rescan(i->c, &i->keys, &MAX_KEY, dump_pred);
if (!w)
break;
i->bytes = snprintf(i->buf, PAGE_SIZE, "%s\n", pkey(&w->key));
bch_bkey_to_text(kbuf, sizeof(kbuf), &w->key);
i->bytes = snprintf(i->buf, PAGE_SIZE, "%s\n", kbuf);
bch_keybuf_del(&i->keys, w);
}
......@@ -377,7 +380,7 @@ static int bch_dump_open(struct inode *inode, struct file *file)
file->private_data = i;
i->c = c;
bch_keybuf_init(&i->keys, dump_pred);
bch_keybuf_init(&i->keys);
i->keys.last_scanned = KEY(0, 0, 0);
return 0;
......@@ -409,142 +412,6 @@ void bch_debug_init_cache_set(struct cache_set *c)
#endif
/* Fuzz tester has rotted: */
#if 0
static ssize_t btree_fuzz(struct kobject *k, struct kobj_attribute *a,
const char *buffer, size_t size)
{
void dump(struct btree *b)
{
struct bset *i;
for (i = b->sets[0].data;
index(i, b) < btree_blocks(b) &&
i->seq == b->sets[0].data->seq;
i = ((void *) i) + set_blocks(i, b->c) * block_bytes(b->c))
dump_bset(b, i);
}
struct cache_sb *sb;
struct cache_set *c;
struct btree *all[3], *b, *fill, *orig;
int j;
struct btree_op op;
bch_btree_op_init_stack(&op);
sb = kzalloc(sizeof(struct cache_sb), GFP_KERNEL);
if (!sb)
return -ENOMEM;
sb->bucket_size = 128;
sb->block_size = 4;
c = bch_cache_set_alloc(sb);
if (!c)
return -ENOMEM;
for (j = 0; j < 3; j++) {
BUG_ON(list_empty(&c->btree_cache));
all[j] = list_first_entry(&c->btree_cache, struct btree, list);
list_del_init(&all[j]->list);
all[j]->key = KEY(0, 0, c->sb.bucket_size);
bkey_copy_key(&all[j]->key, &MAX_KEY);
}
b = all[0];
fill = all[1];
orig = all[2];
while (1) {
for (j = 0; j < 3; j++)
all[j]->written = all[j]->nsets = 0;
bch_bset_init_next(b);
while (1) {
struct bset *i = write_block(b);
struct bkey *k = op.keys.top;
unsigned rand;
bkey_init(k);
rand = get_random_int();
op.type = rand & 1
? BTREE_INSERT
: BTREE_REPLACE;
rand >>= 1;
SET_KEY_SIZE(k, bucket_remainder(c, rand));
rand >>= c->bucket_bits;
rand &= 1024 * 512 - 1;
rand += c->sb.bucket_size;
SET_KEY_OFFSET(k, rand);
#if 0
SET_KEY_PTRS(k, 1);
#endif
bch_keylist_push(&op.keys);
bch_btree_insert_keys(b, &op);
if (should_split(b) ||
set_blocks(i, b->c) !=
__set_blocks(i, i->keys + 15, b->c)) {
i->csum = csum_set(i);
memcpy(write_block(fill),
i, set_bytes(i));
b->written += set_blocks(i, b->c);
fill->written = b->written;
if (b->written == btree_blocks(b))
break;
bch_btree_sort_lazy(b);
bch_bset_init_next(b);
}
}
memcpy(orig->sets[0].data,
fill->sets[0].data,
btree_bytes(c));
bch_btree_sort(b);
fill->written = 0;
bch_btree_read_done(&fill->io.cl);
if (b->sets[0].data->keys != fill->sets[0].data->keys ||
memcmp(b->sets[0].data->start,
fill->sets[0].data->start,
b->sets[0].data->keys * sizeof(uint64_t))) {
struct bset *i = b->sets[0].data;
struct bkey *k, *l;
for (k = i->start,
l = fill->sets[0].data->start;
k < end(i);
k = bkey_next(k), l = bkey_next(l))
if (bkey_cmp(k, l) ||
KEY_SIZE(k) != KEY_SIZE(l))
pr_err("key %zi differs: %s != %s",
(uint64_t *) k - i->d,
pkey(k), pkey(l));
for (j = 0; j < 3; j++) {
pr_err("**** Set %i ****", j);
dump(all[j]);
}
panic("\n");
}
pr_info("fuzz complete: %i keys", b->sets[0].data->keys);
}
}
kobj_attribute_write(fuzz, btree_fuzz);
#endif
void bch_debug_exit(void)
{
if (!IS_ERR_OR_NULL(debug))
......@@ -554,11 +421,6 @@ void bch_debug_exit(void)
int __init bch_debug_init(struct kobject *kobj)
{
int ret = 0;
#if 0
ret = sysfs_create_file(kobj, &ksysfs_fuzz.attr);
if (ret)
return ret;
#endif
debug = debugfs_create_dir("bcache", NULL);
return ret;
......
......@@ -3,15 +3,8 @@
/* Btree/bkey debug printing */
#define KEYHACK_SIZE 80
struct keyprint_hack {
char s[KEYHACK_SIZE];
};
struct keyprint_hack bch_pkey(const struct bkey *k);
struct keyprint_hack bch_pbtree(const struct btree *b);
#define pkey(k) (&bch_pkey(k).s[0])
#define pbtree(b) (&bch_pbtree(b).s[0])
int bch_bkey_to_text(char *buf, size_t size, const struct bkey *k);
int bch_btree_to_text(char *buf, size_t size, const struct btree *b);
#ifdef CONFIG_BCACHE_EDEBUG
......
......@@ -9,6 +9,8 @@
#include "bset.h"
#include "debug.h"
#include <linux/blkdev.h>
static void bch_bi_idx_hack_endio(struct bio *bio, int error)
{
struct bio *p = bio->bi_private;
......@@ -66,13 +68,6 @@ static void bch_generic_make_request_hack(struct bio *bio)
* The newly allocated bio will point to @bio's bi_io_vec, if the split was on a
* bvec boundry; it is the caller's responsibility to ensure that @bio is not
* freed before the split.
*
* If bch_bio_split() is running under generic_make_request(), it's not safe to
* allocate more than one bio from the same bio set. Therefore, if it is running
* under generic_make_request() it masks out __GFP_WAIT when doing the
* allocation. The caller must check for failure if there's any possibility of
* it being called from under generic_make_request(); it is then the caller's
* responsibility to retry from a safe context (by e.g. punting to workqueue).
*/
struct bio *bch_bio_split(struct bio *bio, int sectors,
gfp_t gfp, struct bio_set *bs)
......@@ -83,20 +78,13 @@ struct bio *bch_bio_split(struct bio *bio, int sectors,
BUG_ON(sectors <= 0);
/*
* If we're being called from underneath generic_make_request() and we
* already allocated any bios from this bio set, we risk deadlock if we
* use the mempool. So instead, we possibly fail and let the caller punt
* to workqueue or somesuch and retry in a safe context.
*/
if (current->bio_list)
gfp &= ~__GFP_WAIT;
if (sectors >= bio_sectors(bio))
return bio;
if (bio->bi_rw & REQ_DISCARD) {
ret = bio_alloc_bioset(gfp, 1, bs);
if (!ret)
return NULL;
idx = 0;
goto out;
}
......@@ -160,17 +148,18 @@ static unsigned bch_bio_max_sectors(struct bio *bio)
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
unsigned max_segments = min_t(unsigned, BIO_MAX_PAGES,
queue_max_segments(q));
struct bio_vec *bv, *end = bio_iovec(bio) +
min_t(int, bio_segments(bio), max_segments);
if (bio->bi_rw & REQ_DISCARD)
return min(ret, q->limits.max_discard_sectors);
if (bio_segments(bio) > max_segments ||
q->merge_bvec_fn) {
struct bio_vec *bv;
int i, seg = 0;
ret = 0;
for (bv = bio_iovec(bio); bv < end; bv++) {
bio_for_each_segment(bv, bio, i) {
struct bvec_merge_data bvm = {
.bi_bdev = bio->bi_bdev,
.bi_sector = bio->bi_sector,
......@@ -178,10 +167,14 @@ static unsigned bch_bio_max_sectors(struct bio *bio)
.bi_rw = bio->bi_rw,
};
if (seg == max_segments)
break;
if (q->merge_bvec_fn &&
q->merge_bvec_fn(q, &bvm, bv) < (int) bv->bv_len)
break;
seg++;
ret += bv->bv_len >> 9;
}
}
......@@ -218,30 +211,10 @@ static void bch_bio_submit_split_endio(struct bio *bio, int error)
closure_put(cl);
}
static void __bch_bio_submit_split(struct closure *cl)
{
struct bio_split_hook *s = container_of(cl, struct bio_split_hook, cl);
struct bio *bio = s->bio, *n;
do {
n = bch_bio_split(bio, bch_bio_max_sectors(bio),
GFP_NOIO, s->p->bio_split);
if (!n)
continue_at(cl, __bch_bio_submit_split, system_wq);
n->bi_end_io = bch_bio_submit_split_endio;
n->bi_private = cl;
closure_get(cl);
bch_generic_make_request_hack(n);
} while (n != bio);
continue_at(cl, bch_bio_submit_split_done, NULL);
}
void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
{
struct bio_split_hook *s;
struct bio *n;
if (!bio_has_data(bio) && !(bio->bi_rw & REQ_DISCARD))
goto submit;
......@@ -250,6 +223,7 @@ void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
goto submit;
s = mempool_alloc(p->bio_split_hook, GFP_NOIO);
closure_init(&s->cl, NULL);
s->bio = bio;
s->p = p;
......@@ -257,8 +231,18 @@ void bch_generic_make_request(struct bio *bio, struct bio_split_pool *p)
s->bi_private = bio->bi_private;
bio_get(bio);
closure_call(&s->cl, __bch_bio_submit_split, NULL, NULL);
return;
do {
n = bch_bio_split(bio, bch_bio_max_sectors(bio),
GFP_NOIO, s->p->bio_split);
n->bi_end_io = bch_bio_submit_split_endio;
n->bi_private = &s->cl;
closure_get(&s->cl);
bch_generic_make_request_hack(n);
} while (n != bio);
continue_at(&s->cl, bch_bio_submit_split_done, NULL);
submit:
bch_generic_make_request_hack(bio);
}
......
......@@ -9,6 +9,8 @@
#include "debug.h"
#include "request.h"
#include <trace/events/bcache.h>
/*
* Journal replay/recovery:
*
......@@ -182,9 +184,14 @@ int bch_journal_read(struct cache_set *c, struct list_head *list,
pr_debug("starting binary search, l %u r %u", l, r);
while (l + 1 < r) {
seq = list_entry(list->prev, struct journal_replay,
list)->j.seq;
m = (l + r) >> 1;
read_bucket(m);
if (read_bucket(m))
if (seq != list_entry(list->prev, struct journal_replay,
list)->j.seq)
l = m;
else
r = m;
......@@ -300,7 +307,8 @@ int bch_journal_replay(struct cache_set *s, struct list_head *list,
for (k = i->j.start;
k < end(&i->j);
k = bkey_next(k)) {
pr_debug("%s", pkey(k));
trace_bcache_journal_replay_key(k);
bkey_copy(op->keys.top, k);
bch_keylist_push(&op->keys);
......@@ -384,7 +392,7 @@ static void btree_flush_write(struct cache_set *c)
return;
found:
if (btree_node_dirty(best))
bch_btree_write(best, true, NULL);
bch_btree_node_write(best, NULL);
rw_unlock(true, best);
}
......@@ -617,7 +625,7 @@ static void journal_write_unlocked(struct closure *cl)
bio_reset(bio);
bio->bi_sector = PTR_OFFSET(k, i);
bio->bi_bdev = ca->bdev;
bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH;
bio->bi_rw = REQ_WRITE|REQ_SYNC|REQ_META|REQ_FLUSH|REQ_FUA;
bio->bi_size = sectors << 9;
bio->bi_end_io = journal_write_endio;
......@@ -712,7 +720,8 @@ void bch_journal(struct closure *cl)
spin_lock(&c->journal.lock);
if (journal_full(&c->journal)) {
/* XXX: tracepoint */
trace_bcache_journal_full(c);
closure_wait(&c->journal.wait, cl);
journal_reclaim(c);
......@@ -728,13 +737,15 @@ void bch_journal(struct closure *cl)
if (b * c->sb.block_size > PAGE_SECTORS << JSET_BITS ||
b > c->journal.blocks_free) {
/* XXX: If we were inserting so many keys that they won't fit in
trace_bcache_journal_entry_full(c);
/*
* XXX: If we were inserting so many keys that they won't fit in
* an _empty_ journal write, we'll deadlock. For now, handle
* this in bch_keylist_realloc() - but something to think about.
*/
BUG_ON(!w->data->keys);
/* XXX: tracepoint */
BUG_ON(!closure_wait(&w->wait, cl));
closure_flush(&c->journal.io);
......
......@@ -9,6 +9,8 @@
#include "debug.h"
#include "request.h"
#include <trace/events/bcache.h>
struct moving_io {
struct keybuf_key *w;
struct search s;
......@@ -44,14 +46,14 @@ static void write_moving_finish(struct closure *cl)
{
struct moving_io *io = container_of(cl, struct moving_io, s.cl);
struct bio *bio = &io->bio.bio;
struct bio_vec *bv = bio_iovec_idx(bio, bio->bi_vcnt);
struct bio_vec *bv;
int i;
while (bv-- != bio->bi_io_vec)
bio_for_each_segment_all(bv, bio, i)
__free_page(bv->bv_page);
pr_debug("%s %s", io->s.op.insert_collision
? "collision moving" : "moved",
pkey(&io->w->key));
if (io->s.op.insert_collision)
trace_bcache_gc_copy_collision(&io->w->key);
bch_keybuf_del(&io->s.op.c->moving_gc_keys, io->w);
......@@ -94,8 +96,6 @@ static void write_moving(struct closure *cl)
struct moving_io *io = container_of(s, struct moving_io, s);
if (!s->error) {
trace_bcache_write_moving(&io->bio.bio);
moving_init(io);
io->bio.bio.bi_sector = KEY_START(&io->w->key);
......@@ -122,7 +122,6 @@ static void read_moving_submit(struct closure *cl)
struct moving_io *io = container_of(s, struct moving_io, s);
struct bio *bio = &io->bio.bio;
trace_bcache_read_moving(bio);
bch_submit_bbio(bio, s->op.c, &io->w->key, 0);
continue_at(cl, write_moving, bch_gc_wq);
......@@ -138,7 +137,8 @@ static void read_moving(struct closure *cl)
/* XXX: if we error, background writeback could stall indefinitely */
while (!test_bit(CACHE_SET_STOPPING, &c->flags)) {
w = bch_keybuf_next_rescan(c, &c->moving_gc_keys, &MAX_KEY);
w = bch_keybuf_next_rescan(c, &c->moving_gc_keys,
&MAX_KEY, moving_pred);
if (!w)
break;
......@@ -159,10 +159,10 @@ static void read_moving(struct closure *cl)
bio->bi_rw = READ;
bio->bi_end_io = read_moving_endio;
if (bch_bio_alloc_pages(bio, GFP_KERNEL))
if (bio_alloc_pages(bio, GFP_KERNEL))
goto err;
pr_debug("%s", pkey(&w->key));
trace_bcache_gc_copy(&w->key);
closure_call(&io->s.cl, read_moving_submit, NULL, &c->gc.cl);
......@@ -250,5 +250,5 @@ void bch_moving_gc(struct closure *cl)
void bch_moving_init_cache_set(struct cache_set *c)
{
bch_keybuf_init(&c->moving_gc_keys, moving_pred);
bch_keybuf_init(&c->moving_gc_keys);
}
This diff is collapsed.
......@@ -30,7 +30,7 @@ struct search {
};
void bch_cache_read_endio(struct bio *, int);
int bch_get_congested(struct cache_set *);
unsigned bch_get_congested(struct cache_set *);
void bch_insert_data(struct closure *cl);
void bch_btree_insert_async(struct closure *);
void bch_cache_read_endio(struct bio *, int);
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -15,8 +15,6 @@
struct closure;
#include <trace/events/bcache.h>
#ifdef CONFIG_BCACHE_EDEBUG
#define atomic_dec_bug(v) BUG_ON(atomic_dec_return(v) < 0)
......@@ -566,12 +564,8 @@ static inline unsigned fract_exp_two(unsigned x, unsigned fract_bits)
return x;
}
#define bio_end(bio) ((bio)->bi_sector + bio_sectors(bio))
void bch_bio_map(struct bio *bio, void *base);
int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp);
static inline sector_t bdev_sectors(struct block_device *bdev)
{
return bdev->bd_inode->i_size >> 9;
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment