Commit f35546e0 authored by Jens Axboe's avatar Jens Axboe

Merge branch 'stable/for-jens-3.10' of...

Merge branch 'stable/for-jens-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.11/drivers

Konrad writes:

It has the 'feature-max-indirect-segments' implemented in both backend
and frontend. The current problem with the backend and frontend is that the
segment size is limited to 11 pages. It means we can at most squeeze in 44kB per
request. The ring can hold 32 (next power of two below 36) requests, meaning we
can do 1.4M of outstanding requests. Nowadays that is not enough.

The problem in the past was addressed in two ways - but neither one went upstream.
The first solution to this proposed by Justin from Spectralogic was to negotiate
the segment size.  This means that the ‘struct blkif_sring_entry’ is now a variable size.
It can expand from 112 bytes (cover 11 pages of data - 44kB) to 1580 bytes
(256 pages of data - so 1MB). It is a simple extension by just making the array in the
request expand from 11 to a variable size negotiated. But it had limits: this extension
still limits the number of segments per request to 255 (as the total number must be
specified in the request, which only has an 8-bit field for that purpose).

The other solution (from Intel - Ronghui) was to create one extra ring that only has the
‘struct blkif_request_segment’ in them. The ‘struct blkif_request’ would be changed to have
an index in said ‘segment ring’. There is only one segment ring. This means that the size of
the initial ring is still the same. The requests would point to the segment and enumerate out
how many of the indexes it wants to use. The limit is of course the size of the segment.
If one assumes a one-page segment this means we can in one request cover ~4MB.

Those patches were posted as RFC and the author never followed up on the ideas on changing
it to be a bit more flexible.

There is yet another mechanism that could be employed  (which these patches implement) - and it
borrows from VirtIO protocol. And that is the ‘indirect descriptors’. This very similar to
what Intel suggests, but with a twist. The twist is to negotiate how many of these
'segment' pages (aka indirect descriptor pages) we want to support (in reality we negotiate
how many entries in the segment we want to cover, and we module the number if it is
bigger than the segment size).

This means that with the existing 36 slots in the ring (single page) we can cover:
32 slots * each blkif_request_indirect covers: 512 * 4096 ~= 64M. Since we ample space
in the blkif_request_indirect to span more than one indirect page, that number (64M)
can be also multiplied by eight = 512MB.

Roger Pau Monne took the idea and implemented them in these patches. They work
great and the corner cases (migration between backends with and without this extension)
work nicely. The backend has a limit right now off how many indirect entries
it can handle: one indirect page, and at maximum 256 entries (out of 512 - so  50% of the page
is used). That comes out to 32 slots * 256 entries in a indirect page * 1 indirect page
per request * 4096 = 32MB.

This is a conservative number that can change in the future. Right now it strikes
a good balance between giving excellent performance, memory usage in the backend, and
balancing the needs of many guests.

In the patchset there is also the split of the blkback structure to be per-VBD.
This means that the spinlock contention we had with many guests trying to do I/O and
all the blkback threads hitting the same lock has been eliminated.

Also there are bug-fixes to deal with oddly sized sectors, insane amounts on
th ring, and also a security fix (posted earlier).
parents 36f988e9 1e0f7a21
What: /sys/module/xen_blkback/parameters/max_buffer_pages
Date: March 2013
KernelVersion: 3.11
Contact: Roger Pau Monné <roger.pau@citrix.com>
Description:
Maximum number of free pages to keep in each block
backend buffer.
What: /sys/module/xen_blkback/parameters/max_persistent_grants
Date: March 2013
KernelVersion: 3.11
Contact: Roger Pau Monné <roger.pau@citrix.com>
Description:
Maximum number of grants to map persistently in
blkback. If the frontend tries to use more than
max_persistent_grants, the LRU kicks in and starts
removing 5% of max_persistent_grants every 100ms.
What: /sys/module/xen_blkfront/parameters/max
Date: June 2013
KernelVersion: 3.11
Contact: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Description:
Maximum number of segments that the frontend will negotiate
with the backend for indirect descriptors. The default value
is 32 - higher value means more potential throughput but more
memory usage. The backend picks the minimum of the frontend
and its default backend value.
...@@ -50,110 +50,118 @@ ...@@ -50,110 +50,118 @@
#include "common.h" #include "common.h"
/* /*
* These are rather arbitrary. They are fairly large because adjacent requests * Maximum number of unused free pages to keep in the internal buffer.
* pulled from a communication ring are quite likely to end up being part of * Setting this to a value too low will reduce memory used in each backend,
* the same scatter/gather request at the disc. * but can have a performance penalty.
* *
* ** TRY INCREASING 'xen_blkif_reqs' IF WRITE SPEEDS SEEM TOO LOW ** * A sane value is xen_blkif_reqs * BLKIF_MAX_SEGMENTS_PER_REQUEST, but can
* * be set to a lower value that might degrade performance on some intensive
* This will increase the chances of being able to write whole tracks. * IO workloads.
* 64 should be enough to keep us competitive with Linux.
*/ */
static int xen_blkif_reqs = 64;
module_param_named(reqs, xen_blkif_reqs, int, 0);
MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate");
/* Run-time switchable: /sys/module/blkback/parameters/ */ static int xen_blkif_max_buffer_pages = 1024;
static unsigned int log_stats; module_param_named(max_buffer_pages, xen_blkif_max_buffer_pages, int, 0644);
module_param(log_stats, int, 0644); MODULE_PARM_DESC(max_buffer_pages,
"Maximum number of free pages to keep in each block backend buffer");
/* /*
* Each outstanding request that we've passed to the lower device layers has a * Maximum number of grants to map persistently in blkback. For maximum
* 'pending_req' allocated to it. Each buffer_head that completes decrements * performance this should be the total numbers of grants that can be used
* the pendcnt towards zero. When it hits zero, the specified domain has a * to fill the ring, but since this might become too high, specially with
* response queued for it, with the saved 'id' passed back. * the use of indirect descriptors, we set it to a value that provides good
* performance without using too much memory.
*
* When the list of persistent grants is full we clean it up using a LRU
* algorithm.
*/ */
struct pending_req {
struct xen_blkif *blkif;
u64 id;
int nr_pages;
atomic_t pendcnt;
unsigned short operation;
int status;
struct list_head free_list;
DECLARE_BITMAP(unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
};
#define BLKBACK_INVALID_HANDLE (~0) static int xen_blkif_max_pgrants = 1056;
module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
MODULE_PARM_DESC(max_persistent_grants,
"Maximum number of grants to map persistently");
struct xen_blkbk { /*
struct pending_req *pending_reqs; * The LRU mechanism to clean the lists of persistent grants needs to
/* List of all 'pending_req' available */ * be executed periodically. The time interval between consecutive executions
struct list_head pending_free; * of the purge mechanism is set in ms.
/* And its spinlock. */ */
spinlock_t pending_free_lock; #define LRU_INTERVAL 100
wait_queue_head_t pending_free_wq;
/* The list of all pages that are available. */
struct page **pending_pages;
/* And the grant handles that are available. */
grant_handle_t *pending_grant_handles;
};
static struct xen_blkbk *blkbk;
/* /*
* Maximum number of grant pages that can be mapped in blkback. * When the persistent grants list is full we will remove unused grants
* BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of * from the list. The percent number of grants to be removed at each LRU
* pages that blkback will persistently map. * execution.
* Currently, this is:
* RING_SIZE = 32 (for all known ring types)
* BLKIF_MAX_SEGMENTS_PER_REQUEST = 11
* sizeof(struct persistent_gnt) = 48
* So the maximum memory used to store the grants is:
* 32 * 11 * 48 = 16896 bytes
*/ */
static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol) #define LRU_PERCENT_CLEAN 5
/* Run-time switchable: /sys/module/blkback/parameters/ */
static unsigned int log_stats;
module_param(log_stats, int, 0644);
#define BLKBACK_INVALID_HANDLE (~0)
/* Number of free pages to remove on each call to free_xenballooned_pages */
#define NUM_BATCH_FREE_PAGES 10
static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
{ {
switch (protocol) { unsigned long flags;
case BLKIF_PROTOCOL_NATIVE:
return __CONST_RING_SIZE(blkif, PAGE_SIZE) * spin_lock_irqsave(&blkif->free_pages_lock, flags);
BLKIF_MAX_SEGMENTS_PER_REQUEST; if (list_empty(&blkif->free_pages)) {
case BLKIF_PROTOCOL_X86_32: BUG_ON(blkif->free_pages_num != 0);
return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) * spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
BLKIF_MAX_SEGMENTS_PER_REQUEST; return alloc_xenballooned_pages(1, page, false);
case BLKIF_PROTOCOL_X86_64:
return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
BLKIF_MAX_SEGMENTS_PER_REQUEST;
default:
BUG();
} }
BUG_ON(blkif->free_pages_num == 0);
page[0] = list_first_entry(&blkif->free_pages, struct page, lru);
list_del(&page[0]->lru);
blkif->free_pages_num--;
spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
return 0; return 0;
} }
static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
/* int num)
* Little helpful macro to figure out the index and virtual address of the
* pending_pages[..]. For each 'pending_req' we have have up to
* BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
* 10 and would index in the pending_pages[..].
*/
static inline int vaddr_pagenr(struct pending_req *req, int seg)
{ {
return (req - blkbk->pending_reqs) * unsigned long flags;
BLKIF_MAX_SEGMENTS_PER_REQUEST + seg; int i;
}
#define pending_page(req, seg) pending_pages[vaddr_pagenr(req, seg)] spin_lock_irqsave(&blkif->free_pages_lock, flags);
for (i = 0; i < num; i++)
list_add(&page[i]->lru, &blkif->free_pages);
blkif->free_pages_num += num;
spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
}
static inline unsigned long vaddr(struct pending_req *req, int seg) static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
{ {
unsigned long pfn = page_to_pfn(blkbk->pending_page(req, seg)); /* Remove requested pages in batches of NUM_BATCH_FREE_PAGES */
return (unsigned long)pfn_to_kaddr(pfn); struct page *page[NUM_BATCH_FREE_PAGES];
} unsigned int num_pages = 0;
unsigned long flags;
#define pending_handle(_req, _seg) \ spin_lock_irqsave(&blkif->free_pages_lock, flags);
(blkbk->pending_grant_handles[vaddr_pagenr(_req, _seg)]) while (blkif->free_pages_num > num) {
BUG_ON(list_empty(&blkif->free_pages));
page[num_pages] = list_first_entry(&blkif->free_pages,
struct page, lru);
list_del(&page[num_pages]->lru);
blkif->free_pages_num--;
if (++num_pages == NUM_BATCH_FREE_PAGES) {
spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
free_xenballooned_pages(num_pages, page);
spin_lock_irqsave(&blkif->free_pages_lock, flags);
num_pages = 0;
}
}
spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
if (num_pages != 0)
free_xenballooned_pages(num_pages, page);
}
#define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
static int do_block_io_op(struct xen_blkif *blkif); static int do_block_io_op(struct xen_blkif *blkif);
static int dispatch_rw_block_io(struct xen_blkif *blkif, static int dispatch_rw_block_io(struct xen_blkif *blkif,
...@@ -170,13 +178,29 @@ static void make_response(struct xen_blkif *blkif, u64 id, ...@@ -170,13 +178,29 @@ static void make_response(struct xen_blkif *blkif, u64 id,
(n) = (&(pos)->node != NULL) ? rb_next(&(pos)->node) : NULL) (n) = (&(pos)->node != NULL) ? rb_next(&(pos)->node) : NULL)
static void add_persistent_gnt(struct rb_root *root, /*
* We don't need locking around the persistent grant helpers
* because blkback uses a single-thread for each backed, so we
* can be sure that this functions will never be called recursively.
*
* The only exception to that is put_persistent_grant, that can be called
* from interrupt context (by xen_blkbk_unmap), so we have to use atomic
* bit operations to modify the flags of a persistent grant and to count
* the number of used grants.
*/
static int add_persistent_gnt(struct xen_blkif *blkif,
struct persistent_gnt *persistent_gnt) struct persistent_gnt *persistent_gnt)
{ {
struct rb_node **new = &(root->rb_node), *parent = NULL; struct rb_node **new = NULL, *parent = NULL;
struct persistent_gnt *this; struct persistent_gnt *this;
if (blkif->persistent_gnt_c >= xen_blkif_max_pgrants) {
if (!blkif->vbd.overflow_max_grants)
blkif->vbd.overflow_max_grants = 1;
return -EBUSY;
}
/* Figure out where to put new node */ /* Figure out where to put new node */
new = &blkif->persistent_gnts.rb_node;
while (*new) { while (*new) {
this = container_of(*new, struct persistent_gnt, node); this = container_of(*new, struct persistent_gnt, node);
...@@ -186,22 +210,28 @@ static void add_persistent_gnt(struct rb_root *root, ...@@ -186,22 +210,28 @@ static void add_persistent_gnt(struct rb_root *root,
else if (persistent_gnt->gnt > this->gnt) else if (persistent_gnt->gnt > this->gnt)
new = &((*new)->rb_right); new = &((*new)->rb_right);
else { else {
pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n"); pr_alert_ratelimited(DRV_PFX " trying to add a gref that's already in the tree\n");
BUG(); return -EINVAL;
} }
} }
bitmap_zero(persistent_gnt->flags, PERSISTENT_GNT_FLAGS_SIZE);
set_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
/* Add new node and rebalance tree. */ /* Add new node and rebalance tree. */
rb_link_node(&(persistent_gnt->node), parent, new); rb_link_node(&(persistent_gnt->node), parent, new);
rb_insert_color(&(persistent_gnt->node), root); rb_insert_color(&(persistent_gnt->node), &blkif->persistent_gnts);
blkif->persistent_gnt_c++;
atomic_inc(&blkif->persistent_gnt_in_use);
return 0;
} }
static struct persistent_gnt *get_persistent_gnt(struct rb_root *root, static struct persistent_gnt *get_persistent_gnt(struct xen_blkif *blkif,
grant_ref_t gref) grant_ref_t gref)
{ {
struct persistent_gnt *data; struct persistent_gnt *data;
struct rb_node *node = root->rb_node; struct rb_node *node = NULL;
node = blkif->persistent_gnts.rb_node;
while (node) { while (node) {
data = container_of(node, struct persistent_gnt, node); data = container_of(node, struct persistent_gnt, node);
...@@ -209,13 +239,31 @@ static struct persistent_gnt *get_persistent_gnt(struct rb_root *root, ...@@ -209,13 +239,31 @@ static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
node = node->rb_left; node = node->rb_left;
else if (gref > data->gnt) else if (gref > data->gnt)
node = node->rb_right; node = node->rb_right;
else else {
if(test_bit(PERSISTENT_GNT_ACTIVE, data->flags)) {
pr_alert_ratelimited(DRV_PFX " requesting a grant already in use\n");
return NULL;
}
set_bit(PERSISTENT_GNT_ACTIVE, data->flags);
atomic_inc(&blkif->persistent_gnt_in_use);
return data; return data;
} }
}
return NULL; return NULL;
} }
static void free_persistent_gnts(struct rb_root *root, unsigned int num) static void put_persistent_gnt(struct xen_blkif *blkif,
struct persistent_gnt *persistent_gnt)
{
if(!test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
pr_alert_ratelimited(DRV_PFX " freeing a grant already unused");
set_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
clear_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
atomic_dec(&blkif->persistent_gnt_in_use);
}
static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
unsigned int num)
{ {
struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
...@@ -240,7 +288,7 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num) ...@@ -240,7 +288,7 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
ret = gnttab_unmap_refs(unmap, NULL, pages, ret = gnttab_unmap_refs(unmap, NULL, pages,
segs_to_unmap); segs_to_unmap);
BUG_ON(ret); BUG_ON(ret);
free_xenballooned_pages(segs_to_unmap, pages); put_free_pages(blkif, pages, segs_to_unmap);
segs_to_unmap = 0; segs_to_unmap = 0;
} }
...@@ -251,21 +299,148 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num) ...@@ -251,21 +299,148 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
BUG_ON(num != 0); BUG_ON(num != 0);
} }
static void unmap_purged_grants(struct work_struct *work)
{
struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct persistent_gnt *persistent_gnt;
int ret, segs_to_unmap = 0;
struct xen_blkif *blkif = container_of(work, typeof(*blkif), persistent_purge_work);
while(!list_empty(&blkif->persistent_purge_list)) {
persistent_gnt = list_first_entry(&blkif->persistent_purge_list,
struct persistent_gnt,
remove_node);
list_del(&persistent_gnt->remove_node);
gnttab_set_unmap_op(&unmap[segs_to_unmap],
vaddr(persistent_gnt->page),
GNTMAP_host_map,
persistent_gnt->handle);
pages[segs_to_unmap] = persistent_gnt->page;
if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
ret = gnttab_unmap_refs(unmap, NULL, pages,
segs_to_unmap);
BUG_ON(ret);
put_free_pages(blkif, pages, segs_to_unmap);
segs_to_unmap = 0;
}
kfree(persistent_gnt);
}
if (segs_to_unmap > 0) {
ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
BUG_ON(ret);
put_free_pages(blkif, pages, segs_to_unmap);
}
}
static void purge_persistent_gnt(struct xen_blkif *blkif)
{
struct persistent_gnt *persistent_gnt;
struct rb_node *n;
unsigned int num_clean, total;
bool scan_used = false, clean_used = false;
struct rb_root *root;
if (blkif->persistent_gnt_c < xen_blkif_max_pgrants ||
(blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
!blkif->vbd.overflow_max_grants)) {
return;
}
if (work_pending(&blkif->persistent_purge_work)) {
pr_alert_ratelimited(DRV_PFX "Scheduled work from previous purge is still pending, cannot purge list\n");
return;
}
num_clean = (xen_blkif_max_pgrants / 100) * LRU_PERCENT_CLEAN;
num_clean = blkif->persistent_gnt_c - xen_blkif_max_pgrants + num_clean;
num_clean = min(blkif->persistent_gnt_c, num_clean);
if ((num_clean == 0) ||
(num_clean > (blkif->persistent_gnt_c - atomic_read(&blkif->persistent_gnt_in_use))))
return;
/*
* At this point, we can assure that there will be no calls
* to get_persistent_grant (because we are executing this code from
* xen_blkif_schedule), there can only be calls to put_persistent_gnt,
* which means that the number of currently used grants will go down,
* but never up, so we will always be able to remove the requested
* number of grants.
*/
total = num_clean;
pr_debug(DRV_PFX "Going to purge %u persistent grants\n", num_clean);
INIT_LIST_HEAD(&blkif->persistent_purge_list);
root = &blkif->persistent_gnts;
purge_list:
foreach_grant_safe(persistent_gnt, n, root, node) {
BUG_ON(persistent_gnt->handle ==
BLKBACK_INVALID_HANDLE);
if (clean_used) {
clear_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
continue;
}
if (test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
continue;
if (!scan_used &&
(test_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags)))
continue;
rb_erase(&persistent_gnt->node, root);
list_add(&persistent_gnt->remove_node,
&blkif->persistent_purge_list);
if (--num_clean == 0)
goto finished;
}
/*
* If we get here it means we also need to start cleaning
* grants that were used since last purge in order to cope
* with the requested num
*/
if (!scan_used && !clean_used) {
pr_debug(DRV_PFX "Still missing %u purged frames\n", num_clean);
scan_used = true;
goto purge_list;
}
finished:
if (!clean_used) {
pr_debug(DRV_PFX "Finished scanning for grants to clean, removing used flag\n");
clean_used = true;
goto purge_list;
}
blkif->persistent_gnt_c -= (total - num_clean);
blkif->vbd.overflow_max_grants = 0;
/* We can defer this work */
INIT_WORK(&blkif->persistent_purge_work, unmap_purged_grants);
schedule_work(&blkif->persistent_purge_work);
pr_debug(DRV_PFX "Purged %u/%u\n", (total - num_clean), total);
return;
}
/* /*
* Retrieve from the 'pending_reqs' a free pending_req structure to be used. * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
*/ */
static struct pending_req *alloc_req(void) static struct pending_req *alloc_req(struct xen_blkif *blkif)
{ {
struct pending_req *req = NULL; struct pending_req *req = NULL;
unsigned long flags; unsigned long flags;
spin_lock_irqsave(&blkbk->pending_free_lock, flags); spin_lock_irqsave(&blkif->pending_free_lock, flags);
if (!list_empty(&blkbk->pending_free)) { if (!list_empty(&blkif->pending_free)) {
req = list_entry(blkbk->pending_free.next, struct pending_req, req = list_entry(blkif->pending_free.next, struct pending_req,
free_list); free_list);
list_del(&req->free_list); list_del(&req->free_list);
} }
spin_unlock_irqrestore(&blkbk->pending_free_lock, flags); spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
return req; return req;
} }
...@@ -273,17 +448,17 @@ static struct pending_req *alloc_req(void) ...@@ -273,17 +448,17 @@ static struct pending_req *alloc_req(void)
* Return the 'pending_req' structure back to the freepool. We also * Return the 'pending_req' structure back to the freepool. We also
* wake up the thread if it was waiting for a free page. * wake up the thread if it was waiting for a free page.
*/ */
static void free_req(struct pending_req *req) static void free_req(struct xen_blkif *blkif, struct pending_req *req)
{ {
unsigned long flags; unsigned long flags;
int was_empty; int was_empty;
spin_lock_irqsave(&blkbk->pending_free_lock, flags); spin_lock_irqsave(&blkif->pending_free_lock, flags);
was_empty = list_empty(&blkbk->pending_free); was_empty = list_empty(&blkif->pending_free);
list_add(&req->free_list, &blkbk->pending_free); list_add(&req->free_list, &blkif->pending_free);
spin_unlock_irqrestore(&blkbk->pending_free_lock, flags); spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
if (was_empty) if (was_empty)
wake_up(&blkbk->pending_free_wq); wake_up(&blkif->pending_free_wq);
} }
/* /*
...@@ -382,10 +557,12 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id) ...@@ -382,10 +557,12 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
static void print_stats(struct xen_blkif *blkif) static void print_stats(struct xen_blkif *blkif)
{ {
pr_info("xen-blkback (%s): oo %3llu | rd %4llu | wr %4llu | f %4llu" pr_info("xen-blkback (%s): oo %3llu | rd %4llu | wr %4llu | f %4llu"
" | ds %4llu\n", " | ds %4llu | pg: %4u/%4d\n",
current->comm, blkif->st_oo_req, current->comm, blkif->st_oo_req,
blkif->st_rd_req, blkif->st_wr_req, blkif->st_rd_req, blkif->st_wr_req,
blkif->st_f_req, blkif->st_ds_req); blkif->st_f_req, blkif->st_ds_req,
blkif->persistent_gnt_c,
xen_blkif_max_pgrants);
blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000); blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000);
blkif->st_rd_req = 0; blkif->st_rd_req = 0;
blkif->st_wr_req = 0; blkif->st_wr_req = 0;
...@@ -397,6 +574,8 @@ int xen_blkif_schedule(void *arg) ...@@ -397,6 +574,8 @@ int xen_blkif_schedule(void *arg)
{ {
struct xen_blkif *blkif = arg; struct xen_blkif *blkif = arg;
struct xen_vbd *vbd = &blkif->vbd; struct xen_vbd *vbd = &blkif->vbd;
unsigned long timeout;
int ret;
xen_blkif_get(blkif); xen_blkif_get(blkif);
...@@ -406,27 +585,52 @@ int xen_blkif_schedule(void *arg) ...@@ -406,27 +585,52 @@ int xen_blkif_schedule(void *arg)
if (unlikely(vbd->size != vbd_sz(vbd))) if (unlikely(vbd->size != vbd_sz(vbd)))
xen_vbd_resize(blkif); xen_vbd_resize(blkif);
wait_event_interruptible( timeout = msecs_to_jiffies(LRU_INTERVAL);
timeout = wait_event_interruptible_timeout(
blkif->wq, blkif->wq,
blkif->waiting_reqs || kthread_should_stop()); blkif->waiting_reqs || kthread_should_stop(),
wait_event_interruptible( timeout);
blkbk->pending_free_wq, if (timeout == 0)
!list_empty(&blkbk->pending_free) || goto purge_gnt_list;
kthread_should_stop()); timeout = wait_event_interruptible_timeout(
blkif->pending_free_wq,
!list_empty(&blkif->pending_free) ||
kthread_should_stop(),
timeout);
if (timeout == 0)
goto purge_gnt_list;
blkif->waiting_reqs = 0; blkif->waiting_reqs = 0;
smp_mb(); /* clear flag *before* checking for work */ smp_mb(); /* clear flag *before* checking for work */
if (do_block_io_op(blkif)) ret = do_block_io_op(blkif);
if (ret > 0)
blkif->waiting_reqs = 1; blkif->waiting_reqs = 1;
if (ret == -EACCES)
wait_event_interruptible(blkif->shutdown_wq,
kthread_should_stop());
purge_gnt_list:
if (blkif->vbd.feature_gnt_persistent &&
time_after(jiffies, blkif->next_lru)) {
purge_persistent_gnt(blkif);
blkif->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
}
/* Shrink if we have more than xen_blkif_max_buffer_pages */
shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages);
if (log_stats && time_after(jiffies, blkif->st_print)) if (log_stats && time_after(jiffies, blkif->st_print))
print_stats(blkif); print_stats(blkif);
} }
/* Since we are shutting down remove all pages from the buffer */
shrink_free_pagepool(blkif, 0 /* All */);
/* Free all persistent grant pages */ /* Free all persistent grant pages */
if (!RB_EMPTY_ROOT(&blkif->persistent_gnts)) if (!RB_EMPTY_ROOT(&blkif->persistent_gnts))
free_persistent_gnts(&blkif->persistent_gnts, free_persistent_gnts(blkif, &blkif->persistent_gnts,
blkif->persistent_gnt_c); blkif->persistent_gnt_c);
BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts)); BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
...@@ -441,148 +645,98 @@ int xen_blkif_schedule(void *arg) ...@@ -441,148 +645,98 @@ int xen_blkif_schedule(void *arg)
return 0; return 0;
} }
struct seg_buf {
unsigned int offset;
unsigned int nsec;
};
/* /*
* Unmap the grant references, and also remove the M2P over-rides * Unmap the grant references, and also remove the M2P over-rides
* used in the 'pending_req'. * used in the 'pending_req'.
*/ */
static void xen_blkbk_unmap(struct pending_req *req) static void xen_blkbk_unmap(struct xen_blkif *blkif,
struct grant_page *pages[],
int num)
{ {
struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct page *unmap_pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
unsigned int i, invcount = 0; unsigned int i, invcount = 0;
grant_handle_t handle;
int ret; int ret;
for (i = 0; i < req->nr_pages; i++) { for (i = 0; i < num; i++) {
if (!test_bit(i, req->unmap_seg)) if (pages[i]->persistent_gnt != NULL) {
put_persistent_gnt(blkif, pages[i]->persistent_gnt);
continue; continue;
handle = pending_handle(req, i); }
if (handle == BLKBACK_INVALID_HANDLE) if (pages[i]->handle == BLKBACK_INVALID_HANDLE)
continue; continue;
gnttab_set_unmap_op(&unmap[invcount], vaddr(req, i), unmap_pages[invcount] = pages[i]->page;
GNTMAP_host_map, handle); gnttab_set_unmap_op(&unmap[invcount], vaddr(pages[i]->page),
pending_handle(req, i) = BLKBACK_INVALID_HANDLE; GNTMAP_host_map, pages[i]->handle);
pages[invcount] = virt_to_page(vaddr(req, i)); pages[i]->handle = BLKBACK_INVALID_HANDLE;
invcount++; if (++invcount == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
ret = gnttab_unmap_refs(unmap, NULL, unmap_pages,
invcount);
BUG_ON(ret);
put_free_pages(blkif, unmap_pages, invcount);
invcount = 0;
} }
}
ret = gnttab_unmap_refs(unmap, NULL, pages, invcount); if (invcount) {
ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount);
BUG_ON(ret); BUG_ON(ret);
put_free_pages(blkif, unmap_pages, invcount);
}
} }
static int xen_blkbk_map(struct blkif_request *req, static int xen_blkbk_map(struct xen_blkif *blkif,
struct pending_req *pending_req, struct grant_page *pages[],
struct seg_buf seg[], int num, bool ro)
struct page *pages[])
{ {
struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct persistent_gnt *persistent_gnt = NULL; struct persistent_gnt *persistent_gnt = NULL;
struct xen_blkif *blkif = pending_req->blkif;
phys_addr_t addr = 0; phys_addr_t addr = 0;
int i, j; int i, seg_idx, new_map_idx;
bool new_map;
int nseg = req->u.rw.nr_segments;
int segs_to_map = 0; int segs_to_map = 0;
int ret = 0; int ret = 0;
int last_map = 0, map_until = 0;
int use_persistent_gnts; int use_persistent_gnts;
use_persistent_gnts = (blkif->vbd.feature_gnt_persistent); use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
BUG_ON(blkif->persistent_gnt_c >
max_mapped_grant_pages(pending_req->blkif->blk_protocol));
/* /*
* Fill out preq.nr_sects with proper amount of sectors, and setup * Fill out preq.nr_sects with proper amount of sectors, and setup
* assign map[..] with the PFN of the page in our domain with the * assign map[..] with the PFN of the page in our domain with the
* corresponding grant reference for each page. * corresponding grant reference for each page.
*/ */
for (i = 0; i < nseg; i++) { again:
for (i = map_until; i < num; i++) {
uint32_t flags; uint32_t flags;
if (use_persistent_gnts) if (use_persistent_gnts)
persistent_gnt = get_persistent_gnt( persistent_gnt = get_persistent_gnt(
&blkif->persistent_gnts, blkif,
req->u.rw.seg[i].gref); pages[i]->gref);
if (persistent_gnt) { if (persistent_gnt) {
/* /*
* We are using persistent grants and * We are using persistent grants and
* the grant is already mapped * the grant is already mapped
*/ */
new_map = false; pages[i]->page = persistent_gnt->page;
} else if (use_persistent_gnts && pages[i]->persistent_gnt = persistent_gnt;
blkif->persistent_gnt_c <
max_mapped_grant_pages(blkif->blk_protocol)) {
/*
* We are using persistent grants, the grant is
* not mapped but we have room for it
*/
new_map = true;
persistent_gnt = kmalloc(
sizeof(struct persistent_gnt),
GFP_KERNEL);
if (!persistent_gnt)
return -ENOMEM;
if (alloc_xenballooned_pages(1, &persistent_gnt->page,
false)) {
kfree(persistent_gnt);
return -ENOMEM;
}
persistent_gnt->gnt = req->u.rw.seg[i].gref;
persistent_gnt->handle = BLKBACK_INVALID_HANDLE;
pages_to_gnt[segs_to_map] =
persistent_gnt->page;
addr = (unsigned long) pfn_to_kaddr(
page_to_pfn(persistent_gnt->page));
add_persistent_gnt(&blkif->persistent_gnts,
persistent_gnt);
blkif->persistent_gnt_c++;
pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
persistent_gnt->gnt, blkif->persistent_gnt_c,
max_mapped_grant_pages(blkif->blk_protocol));
} else {
/*
* We are either using persistent grants and
* hit the maximum limit of grants mapped,
* or we are not using persistent grants.
*/
if (use_persistent_gnts &&
!blkif->vbd.overflow_max_grants) {
blkif->vbd.overflow_max_grants = 1;
pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
blkif->domid, blkif->vbd.handle);
}
new_map = true;
pages[i] = blkbk->pending_page(pending_req, i);
addr = vaddr(pending_req, i);
pages_to_gnt[segs_to_map] =
blkbk->pending_page(pending_req, i);
}
if (persistent_gnt) {
pages[i] = persistent_gnt->page;
persistent_gnts[i] = persistent_gnt;
} else { } else {
persistent_gnts[i] = NULL; if (get_free_page(blkif, &pages[i]->page))
} goto out_of_memory;
addr = vaddr(pages[i]->page);
if (new_map) { pages_to_gnt[segs_to_map] = pages[i]->page;
pages[i]->persistent_gnt = NULL;
flags = GNTMAP_host_map; flags = GNTMAP_host_map;
if (!persistent_gnt && if (!use_persistent_gnts && ro)
(pending_req->operation != BLKIF_OP_READ))
flags |= GNTMAP_readonly; flags |= GNTMAP_readonly;
gnttab_set_map_op(&map[segs_to_map++], addr, gnttab_set_map_op(&map[segs_to_map++], addr,
flags, req->u.rw.seg[i].gref, flags, pages[i]->gref,
blkif->domid); blkif->domid);
} }
map_until = i + 1;
if (segs_to_map == BLKIF_MAX_SEGMENTS_PER_REQUEST)
break;
} }
if (segs_to_map) { if (segs_to_map) {
...@@ -595,49 +749,133 @@ static int xen_blkbk_map(struct blkif_request *req, ...@@ -595,49 +749,133 @@ static int xen_blkbk_map(struct blkif_request *req,
* so that when we access vaddr(pending_req,i) it has the contents of * so that when we access vaddr(pending_req,i) it has the contents of
* the page from the other domain. * the page from the other domain.
*/ */
bitmap_zero(pending_req->unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST); for (seg_idx = last_map, new_map_idx = 0; seg_idx < map_until; seg_idx++) {
for (i = 0, j = 0; i < nseg; i++) { if (!pages[seg_idx]->persistent_gnt) {
if (!persistent_gnts[i] ||
persistent_gnts[i]->handle == BLKBACK_INVALID_HANDLE) {
/* This is a newly mapped grant */ /* This is a newly mapped grant */
BUG_ON(j >= segs_to_map); BUG_ON(new_map_idx >= segs_to_map);
if (unlikely(map[j].status != 0)) { if (unlikely(map[new_map_idx].status != 0)) {
pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
map[j].handle = BLKBACK_INVALID_HANDLE; pages[seg_idx]->handle = BLKBACK_INVALID_HANDLE;
ret |= 1; ret |= 1;
if (persistent_gnts[i]) { goto next;
rb_erase(&persistent_gnts[i]->node, }
&blkif->persistent_gnts); pages[seg_idx]->handle = map[new_map_idx].handle;
blkif->persistent_gnt_c--; } else {
kfree(persistent_gnts[i]); continue;
persistent_gnts[i] = NULL; }
if (use_persistent_gnts &&
blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
/*
* We are using persistent grants, the grant is
* not mapped but we might have room for it.
*/
persistent_gnt = kmalloc(sizeof(struct persistent_gnt),
GFP_KERNEL);
if (!persistent_gnt) {
/*
* If we don't have enough memory to
* allocate the persistent_gnt struct
* map this grant non-persistenly
*/
goto next;
} }
persistent_gnt->gnt = map[new_map_idx].ref;
persistent_gnt->handle = map[new_map_idx].handle;
persistent_gnt->page = pages[seg_idx]->page;
if (add_persistent_gnt(blkif,
persistent_gnt)) {
kfree(persistent_gnt);
persistent_gnt = NULL;
goto next;
} }
pages[seg_idx]->persistent_gnt = persistent_gnt;
pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
persistent_gnt->gnt, blkif->persistent_gnt_c,
xen_blkif_max_pgrants);
goto next;
}
if (use_persistent_gnts && !blkif->vbd.overflow_max_grants) {
blkif->vbd.overflow_max_grants = 1;
pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
blkif->domid, blkif->vbd.handle);
} }
if (persistent_gnts[i]) {
if (persistent_gnts[i]->handle ==
BLKBACK_INVALID_HANDLE) {
/* /*
* If this is a new persistent grant * We could not map this grant persistently, so use it as
* save the handler * a non-persistent grant.
*/ */
persistent_gnts[i]->handle = map[j++].handle; next:
new_map_idx++;
} }
pending_handle(pending_req, i) = segs_to_map = 0;
persistent_gnts[i]->handle; last_map = map_until;
if (map_until != num)
goto again;
if (ret) return ret;
continue;
} else {
pending_handle(pending_req, i) = map[j++].handle;
bitmap_set(pending_req->unmap_seg, i, 1);
if (ret) out_of_memory:
continue; pr_alert(DRV_PFX "%s: out of memory\n", __func__);
put_free_pages(blkif, pages_to_gnt, segs_to_map);
return -ENOMEM;
}
static int xen_blkbk_map_seg(struct pending_req *pending_req)
{
int rc;
rc = xen_blkbk_map(pending_req->blkif, pending_req->segments,
pending_req->nr_pages,
(pending_req->operation != BLKIF_OP_READ));
return rc;
}
static int xen_blkbk_parse_indirect(struct blkif_request *req,
struct pending_req *pending_req,
struct seg_buf seg[],
struct phys_req *preq)
{
struct grant_page **pages = pending_req->indirect_pages;
struct xen_blkif *blkif = pending_req->blkif;
int indirect_grefs, rc, n, nseg, i;
struct blkif_request_segment_aligned *segments = NULL;
nseg = pending_req->nr_pages;
indirect_grefs = INDIRECT_PAGES(nseg);
BUG_ON(indirect_grefs > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
for (i = 0; i < indirect_grefs; i++)
pages[i]->gref = req->u.indirect.indirect_grefs[i];
rc = xen_blkbk_map(blkif, pages, indirect_grefs, true);
if (rc)
goto unmap;
for (n = 0, i = 0; n < nseg; n++) {
if ((n % SEGS_PER_INDIRECT_FRAME) == 0) {
/* Map indirect segments */
if (segments)
kunmap_atomic(segments);
segments = kmap_atomic(pages[n/SEGS_PER_INDIRECT_FRAME]->page);
} }
seg[i].offset = (req->u.rw.seg[i].first_sect << 9); i = n % SEGS_PER_INDIRECT_FRAME;
pending_req->segments[n]->gref = segments[i].gref;
seg[n].nsec = segments[i].last_sect -
segments[i].first_sect + 1;
seg[n].offset = (segments[i].first_sect << 9);
if ((segments[i].last_sect >= (PAGE_SIZE >> 9)) ||
(segments[i].last_sect < segments[i].first_sect)) {
rc = -EINVAL;
goto unmap;
} }
return ret; preq->nr_sects += seg[n].nsec;
}
unmap:
if (segments)
kunmap_atomic(segments);
xen_blkbk_unmap(blkif, pages, indirect_grefs);
return rc;
} }
static int dispatch_discard_io(struct xen_blkif *blkif, static int dispatch_discard_io(struct xen_blkif *blkif,
...@@ -647,7 +885,18 @@ static int dispatch_discard_io(struct xen_blkif *blkif, ...@@ -647,7 +885,18 @@ static int dispatch_discard_io(struct xen_blkif *blkif,
int status = BLKIF_RSP_OKAY; int status = BLKIF_RSP_OKAY;
struct block_device *bdev = blkif->vbd.bdev; struct block_device *bdev = blkif->vbd.bdev;
unsigned long secure; unsigned long secure;
struct phys_req preq;
preq.sector_number = req->u.discard.sector_number;
preq.nr_sects = req->u.discard.nr_sectors;
err = xen_vbd_translate(&preq, blkif, WRITE);
if (err) {
pr_warn(DRV_PFX "access denied: DISCARD [%llu->%llu] on dev=%04x\n",
preq.sector_number,
preq.sector_number + preq.nr_sects, blkif->vbd.pdevice);
goto fail_response;
}
blkif->st_ds_req++; blkif->st_ds_req++;
xen_blkif_get(blkif); xen_blkif_get(blkif);
...@@ -658,7 +907,7 @@ static int dispatch_discard_io(struct xen_blkif *blkif, ...@@ -658,7 +907,7 @@ static int dispatch_discard_io(struct xen_blkif *blkif,
err = blkdev_issue_discard(bdev, req->u.discard.sector_number, err = blkdev_issue_discard(bdev, req->u.discard.sector_number,
req->u.discard.nr_sectors, req->u.discard.nr_sectors,
GFP_KERNEL, secure); GFP_KERNEL, secure);
fail_response:
if (err == -EOPNOTSUPP) { if (err == -EOPNOTSUPP) {
pr_debug(DRV_PFX "discard op failed, not supported\n"); pr_debug(DRV_PFX "discard op failed, not supported\n");
status = BLKIF_RSP_EOPNOTSUPP; status = BLKIF_RSP_EOPNOTSUPP;
...@@ -674,7 +923,7 @@ static int dispatch_other_io(struct xen_blkif *blkif, ...@@ -674,7 +923,7 @@ static int dispatch_other_io(struct xen_blkif *blkif,
struct blkif_request *req, struct blkif_request *req,
struct pending_req *pending_req) struct pending_req *pending_req)
{ {
free_req(pending_req); free_req(blkif, pending_req);
make_response(blkif, req->u.other.id, req->operation, make_response(blkif, req->u.other.id, req->operation,
BLKIF_RSP_EOPNOTSUPP); BLKIF_RSP_EOPNOTSUPP);
return -EIO; return -EIO;
...@@ -726,7 +975,9 @@ static void __end_block_io_op(struct pending_req *pending_req, int error) ...@@ -726,7 +975,9 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
* the proper response on the ring. * the proper response on the ring.
*/ */
if (atomic_dec_and_test(&pending_req->pendcnt)) { if (atomic_dec_and_test(&pending_req->pendcnt)) {
xen_blkbk_unmap(pending_req); xen_blkbk_unmap(pending_req->blkif,
pending_req->segments,
pending_req->nr_pages);
make_response(pending_req->blkif, pending_req->id, make_response(pending_req->blkif, pending_req->id,
pending_req->operation, pending_req->status); pending_req->operation, pending_req->status);
xen_blkif_put(pending_req->blkif); xen_blkif_put(pending_req->blkif);
...@@ -734,7 +985,7 @@ static void __end_block_io_op(struct pending_req *pending_req, int error) ...@@ -734,7 +985,7 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
if (atomic_read(&pending_req->blkif->drain)) if (atomic_read(&pending_req->blkif->drain))
complete(&pending_req->blkif->drain_complete); complete(&pending_req->blkif->drain_complete);
} }
free_req(pending_req); free_req(pending_req->blkif, pending_req);
} }
} }
...@@ -767,6 +1018,12 @@ __do_block_io_op(struct xen_blkif *blkif) ...@@ -767,6 +1018,12 @@ __do_block_io_op(struct xen_blkif *blkif)
rp = blk_rings->common.sring->req_prod; rp = blk_rings->common.sring->req_prod;
rmb(); /* Ensure we see queued requests up to 'rp'. */ rmb(); /* Ensure we see queued requests up to 'rp'. */
if (RING_REQUEST_PROD_OVERFLOW(&blk_rings->common, rp)) {
rc = blk_rings->common.rsp_prod_pvt;
pr_warn(DRV_PFX "Frontend provided bogus ring requests (%d - %d = %d). Halting ring processing on dev=%04x\n",
rp, rc, rp - rc, blkif->vbd.pdevice);
return -EACCES;
}
while (rc != rp) { while (rc != rp) {
if (RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, rc)) if (RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, rc))
...@@ -777,7 +1034,7 @@ __do_block_io_op(struct xen_blkif *blkif) ...@@ -777,7 +1034,7 @@ __do_block_io_op(struct xen_blkif *blkif)
break; break;
} }
pending_req = alloc_req(); pending_req = alloc_req(blkif);
if (NULL == pending_req) { if (NULL == pending_req) {
blkif->st_oo_req++; blkif->st_oo_req++;
more_to_do = 1; more_to_do = 1;
...@@ -807,11 +1064,12 @@ __do_block_io_op(struct xen_blkif *blkif) ...@@ -807,11 +1064,12 @@ __do_block_io_op(struct xen_blkif *blkif)
case BLKIF_OP_WRITE: case BLKIF_OP_WRITE:
case BLKIF_OP_WRITE_BARRIER: case BLKIF_OP_WRITE_BARRIER:
case BLKIF_OP_FLUSH_DISKCACHE: case BLKIF_OP_FLUSH_DISKCACHE:
case BLKIF_OP_INDIRECT:
if (dispatch_rw_block_io(blkif, &req, pending_req)) if (dispatch_rw_block_io(blkif, &req, pending_req))
goto done; goto done;
break; break;
case BLKIF_OP_DISCARD: case BLKIF_OP_DISCARD:
free_req(pending_req); free_req(blkif, pending_req);
if (dispatch_discard_io(blkif, &req)) if (dispatch_discard_io(blkif, &req))
goto done; goto done;
break; break;
...@@ -853,17 +1111,28 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, ...@@ -853,17 +1111,28 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
struct pending_req *pending_req) struct pending_req *pending_req)
{ {
struct phys_req preq; struct phys_req preq;
struct seg_buf seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct seg_buf *seg = pending_req->seg;
unsigned int nseg; unsigned int nseg;
struct bio *bio = NULL; struct bio *bio = NULL;
struct bio *biolist[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct bio **biolist = pending_req->biolist;
int i, nbio = 0; int i, nbio = 0;
int operation; int operation;
struct blk_plug plug; struct blk_plug plug;
bool drain = false; bool drain = false;
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct grant_page **pages = pending_req->segments;
unsigned short req_operation;
req_operation = req->operation == BLKIF_OP_INDIRECT ?
req->u.indirect.indirect_op : req->operation;
if ((req->operation == BLKIF_OP_INDIRECT) &&
(req_operation != BLKIF_OP_READ) &&
(req_operation != BLKIF_OP_WRITE)) {
pr_debug(DRV_PFX "Invalid indirect operation (%u)\n",
req_operation);
goto fail_response;
}
switch (req->operation) { switch (req_operation) {
case BLKIF_OP_READ: case BLKIF_OP_READ:
blkif->st_rd_req++; blkif->st_rd_req++;
operation = READ; operation = READ;
...@@ -885,33 +1154,47 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, ...@@ -885,33 +1154,47 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
} }
/* Check that the number of segments is sane. */ /* Check that the number of segments is sane. */
nseg = req->u.rw.nr_segments; nseg = req->operation == BLKIF_OP_INDIRECT ?
req->u.indirect.nr_segments : req->u.rw.nr_segments;
if (unlikely(nseg == 0 && operation != WRITE_FLUSH) || if (unlikely(nseg == 0 && operation != WRITE_FLUSH) ||
unlikely(nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) { unlikely((req->operation != BLKIF_OP_INDIRECT) &&
(nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) ||
unlikely((req->operation == BLKIF_OP_INDIRECT) &&
(nseg > MAX_INDIRECT_SEGMENTS))) {
pr_debug(DRV_PFX "Bad number of segments in request (%d)\n", pr_debug(DRV_PFX "Bad number of segments in request (%d)\n",
nseg); nseg);
/* Haven't submitted any bio's yet. */ /* Haven't submitted any bio's yet. */
goto fail_response; goto fail_response;
} }
preq.sector_number = req->u.rw.sector_number;
preq.nr_sects = 0; preq.nr_sects = 0;
pending_req->blkif = blkif; pending_req->blkif = blkif;
pending_req->id = req->u.rw.id; pending_req->id = req->u.rw.id;
pending_req->operation = req->operation; pending_req->operation = req_operation;
pending_req->status = BLKIF_RSP_OKAY; pending_req->status = BLKIF_RSP_OKAY;
pending_req->nr_pages = nseg; pending_req->nr_pages = nseg;
if (req->operation != BLKIF_OP_INDIRECT) {
preq.dev = req->u.rw.handle;
preq.sector_number = req->u.rw.sector_number;
for (i = 0; i < nseg; i++) { for (i = 0; i < nseg; i++) {
pages[i]->gref = req->u.rw.seg[i].gref;
seg[i].nsec = req->u.rw.seg[i].last_sect - seg[i].nsec = req->u.rw.seg[i].last_sect -
req->u.rw.seg[i].first_sect + 1; req->u.rw.seg[i].first_sect + 1;
seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) || if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) ||
(req->u.rw.seg[i].last_sect < req->u.rw.seg[i].first_sect)) (req->u.rw.seg[i].last_sect <
req->u.rw.seg[i].first_sect))
goto fail_response; goto fail_response;
preq.nr_sects += seg[i].nsec; preq.nr_sects += seg[i].nsec;
}
} else {
preq.dev = req->u.indirect.handle;
preq.sector_number = req->u.indirect.sector_number;
if (xen_blkbk_parse_indirect(req, pending_req, seg, &preq))
goto fail_response;
} }
if (xen_vbd_translate(&preq, blkif, operation) != 0) { if (xen_vbd_translate(&preq, blkif, operation) != 0) {
...@@ -948,7 +1231,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, ...@@ -948,7 +1231,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
* the hypercall to unmap the grants - that is all done in * the hypercall to unmap the grants - that is all done in
* xen_blkbk_unmap. * xen_blkbk_unmap.
*/ */
if (xen_blkbk_map(req, pending_req, seg, pages)) if (xen_blkbk_map_seg(pending_req))
goto fail_flush; goto fail_flush;
/* /*
...@@ -960,11 +1243,12 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, ...@@ -960,11 +1243,12 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
for (i = 0; i < nseg; i++) { for (i = 0; i < nseg; i++) {
while ((bio == NULL) || while ((bio == NULL) ||
(bio_add_page(bio, (bio_add_page(bio,
pages[i], pages[i]->page,
seg[i].nsec << 9, seg[i].nsec << 9,
seg[i].offset) == 0)) { seg[i].offset) == 0)) {
bio = bio_alloc(GFP_KERNEL, nseg-i); int nr_iovecs = min_t(int, (nseg-i), BIO_MAX_PAGES);
bio = bio_alloc(GFP_KERNEL, nr_iovecs);
if (unlikely(bio == NULL)) if (unlikely(bio == NULL))
goto fail_put_bio; goto fail_put_bio;
...@@ -1009,11 +1293,12 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, ...@@ -1009,11 +1293,12 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
return 0; return 0;
fail_flush: fail_flush:
xen_blkbk_unmap(pending_req); xen_blkbk_unmap(blkif, pending_req->segments,
pending_req->nr_pages);
fail_response: fail_response:
/* Haven't submitted any bio's yet. */ /* Haven't submitted any bio's yet. */
make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR); make_response(blkif, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
free_req(pending_req); free_req(blkif, pending_req);
msleep(1); /* back off a bit */ msleep(1); /* back off a bit */
return -EIO; return -EIO;
...@@ -1070,73 +1355,20 @@ static void make_response(struct xen_blkif *blkif, u64 id, ...@@ -1070,73 +1355,20 @@ static void make_response(struct xen_blkif *blkif, u64 id,
static int __init xen_blkif_init(void) static int __init xen_blkif_init(void)
{ {
int i, mmap_pages;
int rc = 0; int rc = 0;
if (!xen_domain()) if (!xen_domain())
return -ENODEV; return -ENODEV;
blkbk = kzalloc(sizeof(struct xen_blkbk), GFP_KERNEL);
if (!blkbk) {
pr_alert(DRV_PFX "%s: out of memory!\n", __func__);
return -ENOMEM;
}
mmap_pages = xen_blkif_reqs * BLKIF_MAX_SEGMENTS_PER_REQUEST;
blkbk->pending_reqs = kzalloc(sizeof(blkbk->pending_reqs[0]) *
xen_blkif_reqs, GFP_KERNEL);
blkbk->pending_grant_handles = kmalloc(sizeof(blkbk->pending_grant_handles[0]) *
mmap_pages, GFP_KERNEL);
blkbk->pending_pages = kzalloc(sizeof(blkbk->pending_pages[0]) *
mmap_pages, GFP_KERNEL);
if (!blkbk->pending_reqs || !blkbk->pending_grant_handles ||
!blkbk->pending_pages) {
rc = -ENOMEM;
goto out_of_memory;
}
for (i = 0; i < mmap_pages; i++) {
blkbk->pending_grant_handles[i] = BLKBACK_INVALID_HANDLE;
blkbk->pending_pages[i] = alloc_page(GFP_KERNEL);
if (blkbk->pending_pages[i] == NULL) {
rc = -ENOMEM;
goto out_of_memory;
}
}
rc = xen_blkif_interface_init(); rc = xen_blkif_interface_init();
if (rc) if (rc)
goto failed_init; goto failed_init;
INIT_LIST_HEAD(&blkbk->pending_free);
spin_lock_init(&blkbk->pending_free_lock);
init_waitqueue_head(&blkbk->pending_free_wq);
for (i = 0; i < xen_blkif_reqs; i++)
list_add_tail(&blkbk->pending_reqs[i].free_list,
&blkbk->pending_free);
rc = xen_blkif_xenbus_init(); rc = xen_blkif_xenbus_init();
if (rc) if (rc)
goto failed_init; goto failed_init;
return 0;
out_of_memory:
pr_alert(DRV_PFX "%s: out of memory\n", __func__);
failed_init: failed_init:
kfree(blkbk->pending_reqs);
kfree(blkbk->pending_grant_handles);
if (blkbk->pending_pages) {
for (i = 0; i < mmap_pages; i++) {
if (blkbk->pending_pages[i])
__free_page(blkbk->pending_pages[i]);
}
kfree(blkbk->pending_pages);
}
kfree(blkbk);
blkbk = NULL;
return rc; return rc;
} }
......
...@@ -50,6 +50,19 @@ ...@@ -50,6 +50,19 @@
__func__, __LINE__, ##args) __func__, __LINE__, ##args)
/*
* This is the maximum number of segments that would be allowed in indirect
* requests. This value will also be passed to the frontend.
*/
#define MAX_INDIRECT_SEGMENTS 256
#define SEGS_PER_INDIRECT_FRAME \
(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
#define MAX_INDIRECT_PAGES \
((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
#define INDIRECT_PAGES(_segs) \
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
/* Not a real protocol. Used to generate ring structs which contain /* Not a real protocol. Used to generate ring structs which contain
* the elements common to all protocols only. This way we get a * the elements common to all protocols only. This way we get a
* compiler-checkable way to use common struct elements, so we can * compiler-checkable way to use common struct elements, so we can
...@@ -83,12 +96,31 @@ struct blkif_x86_32_request_other { ...@@ -83,12 +96,31 @@ struct blkif_x86_32_request_other {
uint64_t id; /* private guest value, echoed in resp */ uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__)); } __attribute__((__packed__));
struct blkif_x86_32_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad1;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
/*
* The maximum number of indirect segments (and pages) that will
* be used is determined by MAX_INDIRECT_SEGMENTS, this value
* is also exported to the guest (via xenstore
* feature-max-indirect-segments entry), so the frontend knows how
* many indirect segments the backend supports.
*/
uint64_t _pad2; /* make it 64 byte aligned */
} __attribute__((__packed__));
struct blkif_x86_32_request { struct blkif_x86_32_request {
uint8_t operation; /* BLKIF_OP_??? */ uint8_t operation; /* BLKIF_OP_??? */
union { union {
struct blkif_x86_32_request_rw rw; struct blkif_x86_32_request_rw rw;
struct blkif_x86_32_request_discard discard; struct blkif_x86_32_request_discard discard;
struct blkif_x86_32_request_other other; struct blkif_x86_32_request_other other;
struct blkif_x86_32_request_indirect indirect;
} u; } u;
} __attribute__((__packed__)); } __attribute__((__packed__));
...@@ -127,12 +159,32 @@ struct blkif_x86_64_request_other { ...@@ -127,12 +159,32 @@ struct blkif_x86_64_request_other {
uint64_t id; /* private guest value, echoed in resp */ uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__)); } __attribute__((__packed__));
struct blkif_x86_64_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
uint32_t _pad1; /* offsetof(blkif_..,u.indirect.id)==8 */
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad2;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
/*
* The maximum number of indirect segments (and pages) that will
* be used is determined by MAX_INDIRECT_SEGMENTS, this value
* is also exported to the guest (via xenstore
* feature-max-indirect-segments entry), so the frontend knows how
* many indirect segments the backend supports.
*/
uint32_t _pad3; /* make it 64 byte aligned */
} __attribute__((__packed__));
struct blkif_x86_64_request { struct blkif_x86_64_request {
uint8_t operation; /* BLKIF_OP_??? */ uint8_t operation; /* BLKIF_OP_??? */
union { union {
struct blkif_x86_64_request_rw rw; struct blkif_x86_64_request_rw rw;
struct blkif_x86_64_request_discard discard; struct blkif_x86_64_request_discard discard;
struct blkif_x86_64_request_other other; struct blkif_x86_64_request_other other;
struct blkif_x86_64_request_indirect indirect;
} u; } u;
} __attribute__((__packed__)); } __attribute__((__packed__));
...@@ -182,12 +234,26 @@ struct xen_vbd { ...@@ -182,12 +234,26 @@ struct xen_vbd {
struct backend_info; struct backend_info;
/* Number of available flags */
#define PERSISTENT_GNT_FLAGS_SIZE 2
/* This persistent grant is currently in use */
#define PERSISTENT_GNT_ACTIVE 0
/*
* This persistent grant has been used, this flag is set when we remove the
* PERSISTENT_GNT_ACTIVE, to know that this grant has been used recently.
*/
#define PERSISTENT_GNT_WAS_ACTIVE 1
/* Number of requests that we can fit in a ring */
#define XEN_BLKIF_REQS 32
struct persistent_gnt { struct persistent_gnt {
struct page *page; struct page *page;
grant_ref_t gnt; grant_ref_t gnt;
grant_handle_t handle; grant_handle_t handle;
DECLARE_BITMAP(flags, PERSISTENT_GNT_FLAGS_SIZE);
struct rb_node node; struct rb_node node;
struct list_head remove_node;
}; };
struct xen_blkif { struct xen_blkif {
...@@ -219,6 +285,23 @@ struct xen_blkif { ...@@ -219,6 +285,23 @@ struct xen_blkif {
/* tree to store persistent grants */ /* tree to store persistent grants */
struct rb_root persistent_gnts; struct rb_root persistent_gnts;
unsigned int persistent_gnt_c; unsigned int persistent_gnt_c;
atomic_t persistent_gnt_in_use;
unsigned long next_lru;
/* used by the kworker that offload work from the persistent purge */
struct list_head persistent_purge_list;
struct work_struct persistent_purge_work;
/* buffer of free pages to map grant refs */
spinlock_t free_pages_lock;
int free_pages_num;
struct list_head free_pages;
/* List of all 'pending_req' available */
struct list_head pending_free;
/* And its spinlock. */
spinlock_t pending_free_lock;
wait_queue_head_t pending_free_wq;
/* statistics */ /* statistics */
unsigned long st_print; unsigned long st_print;
...@@ -231,6 +314,41 @@ struct xen_blkif { ...@@ -231,6 +314,41 @@ struct xen_blkif {
unsigned long long st_wr_sect; unsigned long long st_wr_sect;
wait_queue_head_t waiting_to_free; wait_queue_head_t waiting_to_free;
/* Thread shutdown wait queue. */
wait_queue_head_t shutdown_wq;
};
struct seg_buf {
unsigned long offset;
unsigned int nsec;
};
struct grant_page {
struct page *page;
struct persistent_gnt *persistent_gnt;
grant_handle_t handle;
grant_ref_t gref;
};
/*
* Each outstanding request that we've passed to the lower device layers has a
* 'pending_req' allocated to it. Each buffer_head that completes decrements
* the pendcnt towards zero. When it hits zero, the specified domain has a
* response queued for it, with the saved 'id' passed back.
*/
struct pending_req {
struct xen_blkif *blkif;
u64 id;
int nr_pages;
atomic_t pendcnt;
unsigned short operation;
int status;
struct list_head free_list;
struct grant_page *segments[MAX_INDIRECT_SEGMENTS];
/* Indirect descriptors */
struct grant_page *indirect_pages[MAX_INDIRECT_PAGES];
struct seg_buf seg[MAX_INDIRECT_SEGMENTS];
struct bio *biolist[MAX_INDIRECT_SEGMENTS];
}; };
...@@ -257,6 +375,7 @@ int xen_blkif_xenbus_init(void); ...@@ -257,6 +375,7 @@ int xen_blkif_xenbus_init(void);
irqreturn_t xen_blkif_be_int(int irq, void *dev_id); irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
int xen_blkif_schedule(void *arg); int xen_blkif_schedule(void *arg);
int xen_blkif_purge_persistent(void *arg);
int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt, int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
struct backend_info *be, int state); struct backend_info *be, int state);
...@@ -268,7 +387,7 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be); ...@@ -268,7 +387,7 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be);
static inline void blkif_get_x86_32_req(struct blkif_request *dst, static inline void blkif_get_x86_32_req(struct blkif_request *dst,
struct blkif_x86_32_request *src) struct blkif_x86_32_request *src)
{ {
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST; int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation; dst->operation = src->operation;
switch (src->operation) { switch (src->operation) {
case BLKIF_OP_READ: case BLKIF_OP_READ:
...@@ -291,6 +410,18 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst, ...@@ -291,6 +410,18 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number; dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors; dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break; break;
case BLKIF_OP_INDIRECT:
dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
dst->u.indirect.handle = src->u.indirect.handle;
dst->u.indirect.id = src->u.indirect.id;
dst->u.indirect.sector_number = src->u.indirect.sector_number;
barrier();
j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
for (i = 0; i < j; i++)
dst->u.indirect.indirect_grefs[i] =
src->u.indirect.indirect_grefs[i];
break;
default: default:
/* /*
* Don't know how to translate this op. Only get the * Don't know how to translate this op. Only get the
...@@ -304,7 +435,7 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst, ...@@ -304,7 +435,7 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
static inline void blkif_get_x86_64_req(struct blkif_request *dst, static inline void blkif_get_x86_64_req(struct blkif_request *dst,
struct blkif_x86_64_request *src) struct blkif_x86_64_request *src)
{ {
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST; int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation; dst->operation = src->operation;
switch (src->operation) { switch (src->operation) {
case BLKIF_OP_READ: case BLKIF_OP_READ:
...@@ -327,6 +458,18 @@ static inline void blkif_get_x86_64_req(struct blkif_request *dst, ...@@ -327,6 +458,18 @@ static inline void blkif_get_x86_64_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number; dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors; dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break; break;
case BLKIF_OP_INDIRECT:
dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
dst->u.indirect.handle = src->u.indirect.handle;
dst->u.indirect.id = src->u.indirect.id;
dst->u.indirect.sector_number = src->u.indirect.sector_number;
barrier();
j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
for (i = 0; i < j; i++)
dst->u.indirect.indirect_grefs[i] =
src->u.indirect.indirect_grefs[i];
break;
default: default:
/* /*
* Don't know how to translate this op. Only get the * Don't know how to translate this op. Only get the
......
...@@ -98,12 +98,17 @@ static void xen_update_blkif_status(struct xen_blkif *blkif) ...@@ -98,12 +98,17 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
err = PTR_ERR(blkif->xenblkd); err = PTR_ERR(blkif->xenblkd);
blkif->xenblkd = NULL; blkif->xenblkd = NULL;
xenbus_dev_error(blkif->be->dev, err, "start xenblkd"); xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
return;
} }
} }
static struct xen_blkif *xen_blkif_alloc(domid_t domid) static struct xen_blkif *xen_blkif_alloc(domid_t domid)
{ {
struct xen_blkif *blkif; struct xen_blkif *blkif;
struct pending_req *req, *n;
int i, j;
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL); blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
if (!blkif) if (!blkif)
...@@ -118,8 +123,57 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid) ...@@ -118,8 +123,57 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies; blkif->st_print = jiffies;
init_waitqueue_head(&blkif->waiting_to_free); init_waitqueue_head(&blkif->waiting_to_free);
blkif->persistent_gnts.rb_node = NULL; blkif->persistent_gnts.rb_node = NULL;
spin_lock_init(&blkif->free_pages_lock);
INIT_LIST_HEAD(&blkif->free_pages);
blkif->free_pages_num = 0;
atomic_set(&blkif->persistent_gnt_in_use, 0);
INIT_LIST_HEAD(&blkif->pending_free);
for (i = 0; i < XEN_BLKIF_REQS; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
if (!req)
goto fail;
list_add_tail(&req->free_list,
&blkif->pending_free);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
req->segments[j] = kzalloc(sizeof(*req->segments[0]),
GFP_KERNEL);
if (!req->segments[j])
goto fail;
}
for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
GFP_KERNEL);
if (!req->indirect_pages[j])
goto fail;
}
}
spin_lock_init(&blkif->pending_free_lock);
init_waitqueue_head(&blkif->pending_free_wq);
init_waitqueue_head(&blkif->shutdown_wq);
return blkif; return blkif;
fail:
list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
if (!req->segments[j])
break;
kfree(req->segments[j]);
}
for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
if (!req->indirect_pages[j])
break;
kfree(req->indirect_pages[j]);
}
kfree(req);
}
kmem_cache_free(xen_blkif_cachep, blkif);
return ERR_PTR(-ENOMEM);
} }
static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page, static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
...@@ -178,6 +232,7 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif) ...@@ -178,6 +232,7 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)
{ {
if (blkif->xenblkd) { if (blkif->xenblkd) {
kthread_stop(blkif->xenblkd); kthread_stop(blkif->xenblkd);
wake_up(&blkif->shutdown_wq);
blkif->xenblkd = NULL; blkif->xenblkd = NULL;
} }
...@@ -198,8 +253,28 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif) ...@@ -198,8 +253,28 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)
static void xen_blkif_free(struct xen_blkif *blkif) static void xen_blkif_free(struct xen_blkif *blkif)
{ {
struct pending_req *req, *n;
int i = 0, j;
if (!atomic_dec_and_test(&blkif->refcnt)) if (!atomic_dec_and_test(&blkif->refcnt))
BUG(); BUG();
/* Check that there is no request in use */
list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
kfree(req->segments[j]);
for (j = 0; j < MAX_INDIRECT_PAGES; j++)
kfree(req->indirect_pages[j]);
kfree(req);
i++;
}
WARN_ON(i != XEN_BLKIF_REQS);
kmem_cache_free(xen_blkif_cachep, blkif); kmem_cache_free(xen_blkif_cachep, blkif);
} }
...@@ -678,6 +753,11 @@ static void connect(struct backend_info *be) ...@@ -678,6 +753,11 @@ static void connect(struct backend_info *be)
dev->nodename); dev->nodename);
goto abort; goto abort;
} }
err = xenbus_printf(xbt, dev->nodename, "feature-max-indirect-segments", "%u",
MAX_INDIRECT_SEGMENTS);
if (err)
dev_warn(&dev->dev, "writing %s/feature-max-indirect-segments (%d)",
dev->nodename, err);
err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu", err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
(unsigned long long)vbd_sz(&be->blkif->vbd)); (unsigned long long)vbd_sz(&be->blkif->vbd));
...@@ -704,6 +784,11 @@ static void connect(struct backend_info *be) ...@@ -704,6 +784,11 @@ static void connect(struct backend_info *be)
dev->nodename); dev->nodename);
goto abort; goto abort;
} }
err = xenbus_printf(xbt, dev->nodename, "physical-sector-size", "%u",
bdev_physical_block_size(be->blkif->vbd.bdev));
if (err)
xenbus_dev_error(dev, err, "writing %s/physical-sector-size",
dev->nodename);
err = xenbus_transaction_end(xbt, 0); err = xenbus_transaction_end(xbt, 0);
if (err == -EAGAIN) if (err == -EAGAIN)
......
...@@ -74,12 +74,30 @@ struct grant { ...@@ -74,12 +74,30 @@ struct grant {
struct blk_shadow { struct blk_shadow {
struct blkif_request req; struct blkif_request req;
struct request *request; struct request *request;
struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST]; struct grant **grants_used;
struct grant **indirect_grants;
struct scatterlist *sg;
};
struct split_bio {
struct bio *bio;
atomic_t pending;
int err;
}; };
static DEFINE_MUTEX(blkfront_mutex); static DEFINE_MUTEX(blkfront_mutex);
static const struct block_device_operations xlvbd_block_fops; static const struct block_device_operations xlvbd_block_fops;
/*
* Maximum number of segments in indirect requests, the actual value used by
* the frontend driver is the minimum of this value and the value provided
* by the backend driver.
*/
static unsigned int xen_blkif_max_segments = 32;
module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default is 32)");
#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE) #define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
/* /*
...@@ -98,7 +116,6 @@ struct blkfront_info ...@@ -98,7 +116,6 @@ struct blkfront_info
enum blkif_state connected; enum blkif_state connected;
int ring_ref; int ring_ref;
struct blkif_front_ring ring; struct blkif_front_ring ring;
struct scatterlist sg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
unsigned int evtchn, irq; unsigned int evtchn, irq;
struct request_queue *rq; struct request_queue *rq;
struct work_struct work; struct work_struct work;
...@@ -114,6 +131,7 @@ struct blkfront_info ...@@ -114,6 +131,7 @@ struct blkfront_info
unsigned int discard_granularity; unsigned int discard_granularity;
unsigned int discard_alignment; unsigned int discard_alignment;
unsigned int feature_persistent:1; unsigned int feature_persistent:1;
unsigned int max_indirect_segments;
int is_ready; int is_ready;
}; };
...@@ -142,6 +160,13 @@ static DEFINE_SPINLOCK(minor_lock); ...@@ -142,6 +160,13 @@ static DEFINE_SPINLOCK(minor_lock);
#define DEV_NAME "xvd" /* name in /dev */ #define DEV_NAME "xvd" /* name in /dev */
#define SEGS_PER_INDIRECT_FRAME \
(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
#define INDIRECT_GREFS(_segs) \
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
static int blkfront_setup_indirect(struct blkfront_info *info);
static int get_id_from_freelist(struct blkfront_info *info) static int get_id_from_freelist(struct blkfront_info *info)
{ {
unsigned long free = info->shadow_free; unsigned long free = info->shadow_free;
...@@ -358,7 +383,8 @@ static int blkif_queue_request(struct request *req) ...@@ -358,7 +383,8 @@ static int blkif_queue_request(struct request *req)
struct blkif_request *ring_req; struct blkif_request *ring_req;
unsigned long id; unsigned long id;
unsigned int fsect, lsect; unsigned int fsect, lsect;
int i, ref; int i, ref, n;
struct blkif_request_segment_aligned *segments = NULL;
/* /*
* Used to store if we are able to queue the request by just using * Used to store if we are able to queue the request by just using
...@@ -369,21 +395,27 @@ static int blkif_queue_request(struct request *req) ...@@ -369,21 +395,27 @@ static int blkif_queue_request(struct request *req)
grant_ref_t gref_head; grant_ref_t gref_head;
struct grant *gnt_list_entry = NULL; struct grant *gnt_list_entry = NULL;
struct scatterlist *sg; struct scatterlist *sg;
int nseg, max_grefs;
if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
return 1; return 1;
/* Check if we have enought grants to allocate a requests */ max_grefs = info->max_indirect_segments ?
if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) { info->max_indirect_segments +
INDIRECT_GREFS(info->max_indirect_segments) :
BLKIF_MAX_SEGMENTS_PER_REQUEST;
/* Check if we have enough grants to allocate a requests */
if (info->persistent_gnts_c < max_grefs) {
new_persistent_gnts = 1; new_persistent_gnts = 1;
if (gnttab_alloc_grant_references( if (gnttab_alloc_grant_references(
BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c, max_grefs - info->persistent_gnts_c,
&gref_head) < 0) { &gref_head) < 0) {
gnttab_request_free_callback( gnttab_request_free_callback(
&info->callback, &info->callback,
blkif_restart_queue_callback, blkif_restart_queue_callback,
info, info,
BLKIF_MAX_SEGMENTS_PER_REQUEST); max_grefs);
return 1; return 1;
} }
} else } else
...@@ -394,13 +426,39 @@ static int blkif_queue_request(struct request *req) ...@@ -394,13 +426,39 @@ static int blkif_queue_request(struct request *req)
id = get_id_from_freelist(info); id = get_id_from_freelist(info);
info->shadow[id].request = req; info->shadow[id].request = req;
if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) {
ring_req->operation = BLKIF_OP_DISCARD;
ring_req->u.discard.nr_sectors = blk_rq_sectors(req);
ring_req->u.discard.id = id;
ring_req->u.discard.sector_number = (blkif_sector_t)blk_rq_pos(req);
if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard)
ring_req->u.discard.flag = BLKIF_DISCARD_SECURE;
else
ring_req->u.discard.flag = 0;
} else {
BUG_ON(info->max_indirect_segments == 0 &&
req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
BUG_ON(info->max_indirect_segments &&
req->nr_phys_segments > info->max_indirect_segments);
nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
ring_req->u.rw.id = id; ring_req->u.rw.id = id;
if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
/*
* The indirect operation can only be a BLKIF_OP_READ or
* BLKIF_OP_WRITE
*/
BUG_ON(req->cmd_flags & (REQ_FLUSH | REQ_FUA));
ring_req->operation = BLKIF_OP_INDIRECT;
ring_req->u.indirect.indirect_op = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ;
ring_req->u.indirect.sector_number = (blkif_sector_t)blk_rq_pos(req);
ring_req->u.indirect.handle = info->handle;
ring_req->u.indirect.nr_segments = nseg;
} else {
ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req); ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req);
ring_req->u.rw.handle = info->handle; ring_req->u.rw.handle = info->handle;
ring_req->operation = rq_data_dir(req) ? ring_req->operation = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ; BLKIF_OP_WRITE : BLKIF_OP_READ;
if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) { if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
/* /*
* Ideally we can do an unordered flush-to-disk. In case the * Ideally we can do an unordered flush-to-disk. In case the
...@@ -411,25 +469,24 @@ static int blkif_queue_request(struct request *req) ...@@ -411,25 +469,24 @@ static int blkif_queue_request(struct request *req)
*/ */
ring_req->operation = info->flush_op; ring_req->operation = info->flush_op;
} }
ring_req->u.rw.nr_segments = nseg;
if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) { }
/* id, sector_number and handle are set above. */ for_each_sg(info->shadow[id].sg, sg, nseg, i) {
ring_req->operation = BLKIF_OP_DISCARD;
ring_req->u.discard.nr_sectors = blk_rq_sectors(req);
if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard)
ring_req->u.discard.flag = BLKIF_DISCARD_SECURE;
else
ring_req->u.discard.flag = 0;
} else {
ring_req->u.rw.nr_segments = blk_rq_map_sg(req->q, req,
info->sg);
BUG_ON(ring_req->u.rw.nr_segments >
BLKIF_MAX_SEGMENTS_PER_REQUEST);
for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
fsect = sg->offset >> 9; fsect = sg->offset >> 9;
lsect = fsect + (sg->length >> 9) - 1; lsect = fsect + (sg->length >> 9) - 1;
if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
(i % SEGS_PER_INDIRECT_FRAME == 0)) {
if (segments)
kunmap_atomic(segments);
n = i / SEGS_PER_INDIRECT_FRAME;
gnt_list_entry = get_grant(&gref_head, info);
info->shadow[id].indirect_grants[n] = gnt_list_entry;
segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
}
gnt_list_entry = get_grant(&gref_head, info); gnt_list_entry = get_grant(&gref_head, info);
ref = gnt_list_entry->gref; ref = gnt_list_entry->gref;
...@@ -441,8 +498,7 @@ static int blkif_queue_request(struct request *req) ...@@ -441,8 +498,7 @@ static int blkif_queue_request(struct request *req)
BUG_ON(sg->offset + sg->length > PAGE_SIZE); BUG_ON(sg->offset + sg->length > PAGE_SIZE);
shared_data = kmap_atomic( shared_data = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
pfn_to_page(gnt_list_entry->pfn));
bvec_data = kmap_atomic(sg_page(sg)); bvec_data = kmap_atomic(sg_page(sg));
/* /*
...@@ -461,13 +517,23 @@ static int blkif_queue_request(struct request *req) ...@@ -461,13 +517,23 @@ static int blkif_queue_request(struct request *req)
kunmap_atomic(bvec_data); kunmap_atomic(bvec_data);
kunmap_atomic(shared_data); kunmap_atomic(shared_data);
} }
if (ring_req->operation != BLKIF_OP_INDIRECT) {
ring_req->u.rw.seg[i] = ring_req->u.rw.seg[i] =
(struct blkif_request_segment) { (struct blkif_request_segment) {
.gref = ref, .gref = ref,
.first_sect = fsect, .first_sect = fsect,
.last_sect = lsect }; .last_sect = lsect };
} else {
n = i % SEGS_PER_INDIRECT_FRAME;
segments[n] =
(struct blkif_request_segment_aligned) {
.gref = ref,
.first_sect = fsect,
.last_sect = lsect };
}
} }
if (segments)
kunmap_atomic(segments);
} }
info->ring.req_prod_pvt++; info->ring.req_prod_pvt++;
...@@ -542,7 +608,9 @@ static void do_blkif_request(struct request_queue *rq) ...@@ -542,7 +608,9 @@ static void do_blkif_request(struct request_queue *rq)
flush_requests(info); flush_requests(info);
} }
static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size) static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
unsigned int physical_sector_size,
unsigned int segments)
{ {
struct request_queue *rq; struct request_queue *rq;
struct blkfront_info *info = gd->private_data; struct blkfront_info *info = gd->private_data;
...@@ -564,14 +632,15 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size) ...@@ -564,14 +632,15 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
/* Hard sector size and max sectors impersonate the equiv. hardware. */ /* Hard sector size and max sectors impersonate the equiv. hardware. */
blk_queue_logical_block_size(rq, sector_size); blk_queue_logical_block_size(rq, sector_size);
blk_queue_max_hw_sectors(rq, 512); blk_queue_physical_block_size(rq, physical_sector_size);
blk_queue_max_hw_sectors(rq, (segments * PAGE_SIZE) / 512);
/* Each segment in a request is up to an aligned page in size. */ /* Each segment in a request is up to an aligned page in size. */
blk_queue_segment_boundary(rq, PAGE_SIZE - 1); blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
blk_queue_max_segment_size(rq, PAGE_SIZE); blk_queue_max_segment_size(rq, PAGE_SIZE);
/* Ensure a merged request will fit in a single I/O ring slot. */ /* Ensure a merged request will fit in a single I/O ring slot. */
blk_queue_max_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST); blk_queue_max_segments(rq, segments);
/* Make sure buffer addresses are sector-aligned. */ /* Make sure buffer addresses are sector-aligned. */
blk_queue_dma_alignment(rq, 511); blk_queue_dma_alignment(rq, 511);
...@@ -588,13 +657,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size) ...@@ -588,13 +657,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
static void xlvbd_flush(struct blkfront_info *info) static void xlvbd_flush(struct blkfront_info *info)
{ {
blk_queue_flush(info->rq, info->feature_flush); blk_queue_flush(info->rq, info->feature_flush);
printk(KERN_INFO "blkfront: %s: %s: %s %s\n", printk(KERN_INFO "blkfront: %s: %s: %s %s %s %s %s\n",
info->gd->disk_name, info->gd->disk_name,
info->flush_op == BLKIF_OP_WRITE_BARRIER ? info->flush_op == BLKIF_OP_WRITE_BARRIER ?
"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ? "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
"flush diskcache" : "barrier or flush"), "flush diskcache" : "barrier or flush"),
info->feature_flush ? "enabled" : "disabled", info->feature_flush ? "enabled;" : "disabled;",
info->feature_persistent ? "using persistent grants" : ""); "persistent grants:",
info->feature_persistent ? "enabled;" : "disabled;",
"indirect descriptors:",
info->max_indirect_segments ? "enabled;" : "disabled;");
} }
static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset) static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
...@@ -667,7 +739,8 @@ static char *encode_disk_name(char *ptr, unsigned int n) ...@@ -667,7 +739,8 @@ static char *encode_disk_name(char *ptr, unsigned int n)
static int xlvbd_alloc_gendisk(blkif_sector_t capacity, static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
struct blkfront_info *info, struct blkfront_info *info,
u16 vdisk_info, u16 sector_size) u16 vdisk_info, u16 sector_size,
unsigned int physical_sector_size)
{ {
struct gendisk *gd; struct gendisk *gd;
int nr_minors = 1; int nr_minors = 1;
...@@ -734,7 +807,9 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity, ...@@ -734,7 +807,9 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev); gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity); set_capacity(gd, capacity);
if (xlvbd_init_blk_queue(gd, sector_size)) { if (xlvbd_init_blk_queue(gd, sector_size, physical_sector_size,
info->max_indirect_segments ? :
BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
del_gendisk(gd); del_gendisk(gd);
goto release; goto release;
} }
...@@ -818,6 +893,7 @@ static void blkif_free(struct blkfront_info *info, int suspend) ...@@ -818,6 +893,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
{ {
struct grant *persistent_gnt; struct grant *persistent_gnt;
struct grant *n; struct grant *n;
int i, j, segs;
/* Prevent new requests being issued until we fix things up. */ /* Prevent new requests being issued until we fix things up. */
spin_lock_irq(&info->io_lock); spin_lock_irq(&info->io_lock);
...@@ -843,6 +919,47 @@ static void blkif_free(struct blkfront_info *info, int suspend) ...@@ -843,6 +919,47 @@ static void blkif_free(struct blkfront_info *info, int suspend)
} }
BUG_ON(info->persistent_gnts_c != 0); BUG_ON(info->persistent_gnts_c != 0);
for (i = 0; i < BLK_RING_SIZE; i++) {
/*
* Clear persistent grants present in requests already
* on the shared ring
*/
if (!info->shadow[i].request)
goto free_shadow;
segs = info->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
info->shadow[i].req.u.indirect.nr_segments :
info->shadow[i].req.u.rw.nr_segments;
for (j = 0; j < segs; j++) {
persistent_gnt = info->shadow[i].grants_used[j];
gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
__free_page(pfn_to_page(persistent_gnt->pfn));
kfree(persistent_gnt);
}
if (info->shadow[i].req.operation != BLKIF_OP_INDIRECT)
/*
* If this is not an indirect operation don't try to
* free indirect segments
*/
goto free_shadow;
for (j = 0; j < INDIRECT_GREFS(segs); j++) {
persistent_gnt = info->shadow[i].indirect_grants[j];
gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
__free_page(pfn_to_page(persistent_gnt->pfn));
kfree(persistent_gnt);
}
free_shadow:
kfree(info->shadow[i].grants_used);
info->shadow[i].grants_used = NULL;
kfree(info->shadow[i].indirect_grants);
info->shadow[i].indirect_grants = NULL;
kfree(info->shadow[i].sg);
info->shadow[i].sg = NULL;
}
/* No more gnttab callback work. */ /* No more gnttab callback work. */
gnttab_cancel_free_callback(&info->callback); gnttab_cancel_free_callback(&info->callback);
spin_unlock_irq(&info->io_lock); spin_unlock_irq(&info->io_lock);
...@@ -867,12 +984,13 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, ...@@ -867,12 +984,13 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
struct blkif_response *bret) struct blkif_response *bret)
{ {
int i = 0; int i = 0;
struct bio_vec *bvec; struct scatterlist *sg;
struct req_iterator iter;
unsigned long flags;
char *bvec_data; char *bvec_data;
void *shared_data; void *shared_data;
unsigned int offset = 0; int nseg;
nseg = s->req.operation == BLKIF_OP_INDIRECT ?
s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
if (bret->operation == BLKIF_OP_READ) { if (bret->operation == BLKIF_OP_READ) {
/* /*
...@@ -881,26 +999,29 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, ...@@ -881,26 +999,29 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
* than PAGE_SIZE, we have to keep track of the current offset, * than PAGE_SIZE, we have to keep track of the current offset,
* to be sure we are copying the data from the right shared page. * to be sure we are copying the data from the right shared page.
*/ */
rq_for_each_segment(bvec, s->request, iter) { for_each_sg(s->sg, sg, nseg, i) {
BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE); BUG_ON(sg->offset + sg->length > PAGE_SIZE);
if (bvec->bv_offset < offset)
i++;
BUG_ON(i >= s->req.u.rw.nr_segments);
shared_data = kmap_atomic( shared_data = kmap_atomic(
pfn_to_page(s->grants_used[i]->pfn)); pfn_to_page(s->grants_used[i]->pfn));
bvec_data = bvec_kmap_irq(bvec, &flags); bvec_data = kmap_atomic(sg_page(sg));
memcpy(bvec_data, shared_data + bvec->bv_offset, memcpy(bvec_data + sg->offset,
bvec->bv_len); shared_data + sg->offset,
bvec_kunmap_irq(bvec_data, &flags); sg->length);
kunmap_atomic(bvec_data);
kunmap_atomic(shared_data); kunmap_atomic(shared_data);
offset = bvec->bv_offset + bvec->bv_len;
} }
} }
/* Add the persistent grant into the list of free grants */ /* Add the persistent grant into the list of free grants */
for (i = 0; i < s->req.u.rw.nr_segments; i++) { for (i = 0; i < nseg; i++) {
list_add(&s->grants_used[i]->node, &info->persistent_gnts); list_add(&s->grants_used[i]->node, &info->persistent_gnts);
info->persistent_gnts_c++; info->persistent_gnts_c++;
} }
if (s->req.operation == BLKIF_OP_INDIRECT) {
for (i = 0; i < INDIRECT_GREFS(nseg); i++) {
list_add(&s->indirect_grants[i]->node, &info->persistent_gnts);
info->persistent_gnts_c++;
}
}
} }
static irqreturn_t blkif_interrupt(int irq, void *dev_id) static irqreturn_t blkif_interrupt(int irq, void *dev_id)
...@@ -1034,14 +1155,6 @@ static int setup_blkring(struct xenbus_device *dev, ...@@ -1034,14 +1155,6 @@ static int setup_blkring(struct xenbus_device *dev,
SHARED_RING_INIT(sring); SHARED_RING_INIT(sring);
FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE); FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
sg_init_table(info->sg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
/* Allocate memory for grants */
err = fill_grant_buffer(info, BLK_RING_SIZE *
BLKIF_MAX_SEGMENTS_PER_REQUEST);
if (err)
goto fail;
err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring)); err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring));
if (err < 0) { if (err < 0) {
free_page((unsigned long)sring); free_page((unsigned long)sring);
...@@ -1223,13 +1336,84 @@ static int blkfront_probe(struct xenbus_device *dev, ...@@ -1223,13 +1336,84 @@ static int blkfront_probe(struct xenbus_device *dev,
return 0; return 0;
} }
/*
* This is a clone of md_trim_bio, used to split a bio into smaller ones
*/
static void trim_bio(struct bio *bio, int offset, int size)
{
/* 'bio' is a cloned bio which we need to trim to match
* the given offset and size.
* This requires adjusting bi_sector, bi_size, and bi_io_vec
*/
int i;
struct bio_vec *bvec;
int sofar = 0;
size <<= 9;
if (offset == 0 && size == bio->bi_size)
return;
bio->bi_sector += offset;
bio->bi_size = size;
offset <<= 9;
clear_bit(BIO_SEG_VALID, &bio->bi_flags);
while (bio->bi_idx < bio->bi_vcnt &&
bio->bi_io_vec[bio->bi_idx].bv_len <= offset) {
/* remove this whole bio_vec */
offset -= bio->bi_io_vec[bio->bi_idx].bv_len;
bio->bi_idx++;
}
if (bio->bi_idx < bio->bi_vcnt) {
bio->bi_io_vec[bio->bi_idx].bv_offset += offset;
bio->bi_io_vec[bio->bi_idx].bv_len -= offset;
}
/* avoid any complications with bi_idx being non-zero*/
if (bio->bi_idx) {
memmove(bio->bi_io_vec, bio->bi_io_vec+bio->bi_idx,
(bio->bi_vcnt - bio->bi_idx) * sizeof(struct bio_vec));
bio->bi_vcnt -= bio->bi_idx;
bio->bi_idx = 0;
}
/* Make sure vcnt and last bv are not too big */
bio_for_each_segment(bvec, bio, i) {
if (sofar + bvec->bv_len > size)
bvec->bv_len = size - sofar;
if (bvec->bv_len == 0) {
bio->bi_vcnt = i;
break;
}
sofar += bvec->bv_len;
}
}
static void split_bio_end(struct bio *bio, int error)
{
struct split_bio *split_bio = bio->bi_private;
if (error)
split_bio->err = error;
if (atomic_dec_and_test(&split_bio->pending)) {
split_bio->bio->bi_phys_segments = 0;
bio_endio(split_bio->bio, split_bio->err);
kfree(split_bio);
}
bio_put(bio);
}
static int blkif_recover(struct blkfront_info *info) static int blkif_recover(struct blkfront_info *info)
{ {
int i; int i;
struct blkif_request *req; struct request *req, *n;
struct blk_shadow *copy; struct blk_shadow *copy;
int j; int rc;
struct bio *bio, *cloned_bio;
struct bio_list bio_list, merge_bio;
unsigned int segs, offset;
int pending, size;
struct split_bio *split_bio;
struct list_head requests;
/* Stage 1: Make a safe copy of the shadow state. */ /* Stage 1: Make a safe copy of the shadow state. */
copy = kmemdup(info->shadow, sizeof(info->shadow), copy = kmemdup(info->shadow, sizeof(info->shadow),
...@@ -1244,36 +1428,64 @@ static int blkif_recover(struct blkfront_info *info) ...@@ -1244,36 +1428,64 @@ static int blkif_recover(struct blkfront_info *info)
info->shadow_free = info->ring.req_prod_pvt; info->shadow_free = info->ring.req_prod_pvt;
info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff; info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
/* Stage 3: Find pending requests and requeue them. */ rc = blkfront_setup_indirect(info);
if (rc) {
kfree(copy);
return rc;
}
segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
blk_queue_max_segments(info->rq, segs);
bio_list_init(&bio_list);
INIT_LIST_HEAD(&requests);
for (i = 0; i < BLK_RING_SIZE; i++) { for (i = 0; i < BLK_RING_SIZE; i++) {
/* Not in use? */ /* Not in use? */
if (!copy[i].request) if (!copy[i].request)
continue; continue;
/* Grab a request slot and copy shadow state into it. */ /*
req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); * Get the bios in the request so we can re-queue them.
*req = copy[i].req; */
if (copy[i].request->cmd_flags &
/* We get a new request id, and must reset the shadow state. */ (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
req->u.rw.id = get_id_from_freelist(info); /*
memcpy(&info->shadow[req->u.rw.id], &copy[i], sizeof(copy[i])); * Flush operations don't contain bios, so
* we need to requeue the whole request
if (req->operation != BLKIF_OP_DISCARD) { */
/* Rewrite any grant references invalidated by susp/resume. */ list_add(&copy[i].request->queuelist, &requests);
for (j = 0; j < req->u.rw.nr_segments; j++) continue;
gnttab_grant_foreign_access_ref(
req->u.rw.seg[j].gref,
info->xbdev->otherend_id,
pfn_to_mfn(copy[i].grants_used[j]->pfn),
0);
} }
info->shadow[req->u.rw.id].req = *req; merge_bio.head = copy[i].request->bio;
merge_bio.tail = copy[i].request->biotail;
info->ring.req_prod_pvt++; bio_list_merge(&bio_list, &merge_bio);
copy[i].request->bio = NULL;
blk_put_request(copy[i].request);
} }
kfree(copy); kfree(copy);
/*
* Empty the queue, this is important because we might have
* requests in the queue with more segments than what we
* can handle now.
*/
spin_lock_irq(&info->io_lock);
while ((req = blk_fetch_request(info->rq)) != NULL) {
if (req->cmd_flags &
(REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
list_add(&req->queuelist, &requests);
continue;
}
merge_bio.head = req->bio;
merge_bio.tail = req->biotail;
bio_list_merge(&bio_list, &merge_bio);
req->bio = NULL;
if (req->cmd_flags & (REQ_FLUSH | REQ_FUA))
pr_alert("diskcache flush request found!\n");
__blk_put_request(info->rq, req);
}
spin_unlock_irq(&info->io_lock);
xenbus_switch_state(info->xbdev, XenbusStateConnected); xenbus_switch_state(info->xbdev, XenbusStateConnected);
spin_lock_irq(&info->io_lock); spin_lock_irq(&info->io_lock);
...@@ -1281,14 +1493,50 @@ static int blkif_recover(struct blkfront_info *info) ...@@ -1281,14 +1493,50 @@ static int blkif_recover(struct blkfront_info *info)
/* Now safe for us to use the shared ring */ /* Now safe for us to use the shared ring */
info->connected = BLKIF_STATE_CONNECTED; info->connected = BLKIF_STATE_CONNECTED;
/* Send off requeued requests */
flush_requests(info);
/* Kick any other new requests queued since we resumed */ /* Kick any other new requests queued since we resumed */
kick_pending_request_queues(info); kick_pending_request_queues(info);
list_for_each_entry_safe(req, n, &requests, queuelist) {
/* Requeue pending requests (flush or discard) */
list_del_init(&req->queuelist);
BUG_ON(req->nr_phys_segments > segs);
blk_requeue_request(info->rq, req);
}
spin_unlock_irq(&info->io_lock); spin_unlock_irq(&info->io_lock);
while ((bio = bio_list_pop(&bio_list)) != NULL) {
/* Traverse the list of pending bios and re-queue them */
if (bio_segments(bio) > segs) {
/*
* This bio has more segments than what we can
* handle, we have to split it.
*/
pending = (bio_segments(bio) + segs - 1) / segs;
split_bio = kzalloc(sizeof(*split_bio), GFP_NOIO);
BUG_ON(split_bio == NULL);
atomic_set(&split_bio->pending, pending);
split_bio->bio = bio;
for (i = 0; i < pending; i++) {
offset = (i * segs * PAGE_SIZE) >> 9;
size = min((unsigned int)(segs * PAGE_SIZE) >> 9,
(unsigned int)(bio->bi_size >> 9) - offset);
cloned_bio = bio_clone(bio, GFP_NOIO);
BUG_ON(cloned_bio == NULL);
trim_bio(cloned_bio, offset, size);
cloned_bio->bi_private = split_bio;
cloned_bio->bi_end_io = split_bio_end;
submit_bio(cloned_bio->bi_rw, cloned_bio);
}
/*
* Now we have to wait for all those smaller bios to
* end, so we can also end the "parent" bio.
*/
continue;
}
/* We don't need to split this bio */
submit_bio(bio->bi_rw, bio);
}
return 0; return 0;
} }
...@@ -1308,8 +1556,12 @@ static int blkfront_resume(struct xenbus_device *dev) ...@@ -1308,8 +1556,12 @@ static int blkfront_resume(struct xenbus_device *dev)
blkif_free(info, info->connected == BLKIF_STATE_CONNECTED); blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);
err = talk_to_blkback(dev, info); err = talk_to_blkback(dev, info);
if (info->connected == BLKIF_STATE_SUSPENDED && !err)
err = blkif_recover(info); /*
* We have to wait for the backend to switch to
* connected state, since we want to read which
* features it supports.
*/
return err; return err;
} }
...@@ -1387,6 +1639,60 @@ static void blkfront_setup_discard(struct blkfront_info *info) ...@@ -1387,6 +1639,60 @@ static void blkfront_setup_discard(struct blkfront_info *info)
kfree(type); kfree(type);
} }
static int blkfront_setup_indirect(struct blkfront_info *info)
{
unsigned int indirect_segments, segs;
int err, i;
err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
"feature-max-indirect-segments", "%u", &indirect_segments,
NULL);
if (err) {
info->max_indirect_segments = 0;
segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
} else {
info->max_indirect_segments = min(indirect_segments,
xen_blkif_max_segments);
segs = info->max_indirect_segments;
}
err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
if (err)
goto out_of_memory;
for (i = 0; i < BLK_RING_SIZE; i++) {
info->shadow[i].grants_used = kzalloc(
sizeof(info->shadow[i].grants_used[0]) * segs,
GFP_NOIO);
info->shadow[i].sg = kzalloc(sizeof(info->shadow[i].sg[0]) * segs, GFP_NOIO);
if (info->max_indirect_segments)
info->shadow[i].indirect_grants = kzalloc(
sizeof(info->shadow[i].indirect_grants[0]) *
INDIRECT_GREFS(segs),
GFP_NOIO);
if ((info->shadow[i].grants_used == NULL) ||
(info->shadow[i].sg == NULL) ||
(info->max_indirect_segments &&
(info->shadow[i].indirect_grants == NULL)))
goto out_of_memory;
sg_init_table(info->shadow[i].sg, segs);
}
return 0;
out_of_memory:
for (i = 0; i < BLK_RING_SIZE; i++) {
kfree(info->shadow[i].grants_used);
info->shadow[i].grants_used = NULL;
kfree(info->shadow[i].sg);
info->shadow[i].sg = NULL;
kfree(info->shadow[i].indirect_grants);
info->shadow[i].indirect_grants = NULL;
}
return -ENOMEM;
}
/* /*
* Invoked when the backend is finally 'ready' (and has told produced * Invoked when the backend is finally 'ready' (and has told produced
* the details about the physical device - #sectors, size, etc). * the details about the physical device - #sectors, size, etc).
...@@ -1395,6 +1701,7 @@ static void blkfront_connect(struct blkfront_info *info) ...@@ -1395,6 +1701,7 @@ static void blkfront_connect(struct blkfront_info *info)
{ {
unsigned long long sectors; unsigned long long sectors;
unsigned long sector_size; unsigned long sector_size;
unsigned int physical_sector_size;
unsigned int binfo; unsigned int binfo;
int err; int err;
int barrier, flush, discard, persistent; int barrier, flush, discard, persistent;
...@@ -1414,8 +1721,15 @@ static void blkfront_connect(struct blkfront_info *info) ...@@ -1414,8 +1721,15 @@ static void blkfront_connect(struct blkfront_info *info)
set_capacity(info->gd, sectors); set_capacity(info->gd, sectors);
revalidate_disk(info->gd); revalidate_disk(info->gd);
/* fall through */ return;
case BLKIF_STATE_SUSPENDED: case BLKIF_STATE_SUSPENDED:
/*
* If we are recovering from suspension, we need to wait
* for the backend to announce it's features before
* reconnecting, at least we need to know if the backend
* supports indirect descriptors, and how many.
*/
blkif_recover(info);
return; return;
default: default:
...@@ -1437,6 +1751,16 @@ static void blkfront_connect(struct blkfront_info *info) ...@@ -1437,6 +1751,16 @@ static void blkfront_connect(struct blkfront_info *info)
return; return;
} }
/*
* physcial-sector-size is a newer field, so old backends may not
* provide this. Assume physical sector size to be the same as
* sector_size in that case.
*/
err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
"physical-sector-size", "%u", &physical_sector_size);
if (err != 1)
physical_sector_size = sector_size;
info->feature_flush = 0; info->feature_flush = 0;
info->flush_op = 0; info->flush_op = 0;
...@@ -1483,7 +1807,15 @@ static void blkfront_connect(struct blkfront_info *info) ...@@ -1483,7 +1807,15 @@ static void blkfront_connect(struct blkfront_info *info)
else else
info->feature_persistent = persistent; info->feature_persistent = persistent;
err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size); err = blkfront_setup_indirect(info);
if (err) {
xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
info->xbdev->otherend);
return;
}
err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size,
physical_sector_size);
if (err) { if (err) {
xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s", xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
info->xbdev->otherend); info->xbdev->otherend);
......
...@@ -102,6 +102,30 @@ typedef uint64_t blkif_sector_t; ...@@ -102,6 +102,30 @@ typedef uint64_t blkif_sector_t;
*/ */
#define BLKIF_OP_DISCARD 5 #define BLKIF_OP_DISCARD 5
/*
* Recognized if "feature-max-indirect-segments" in present in the backend
* xenbus info. The "feature-max-indirect-segments" node contains the maximum
* number of segments allowed by the backend per request. If the node is
* present, the frontend might use blkif_request_indirect structs in order to
* issue requests with more than BLKIF_MAX_SEGMENTS_PER_REQUEST (11). The
* maximum number of indirect segments is fixed by the backend, but the
* frontend can issue requests with any number of indirect segments as long as
* it's less than the number provided by the backend. The indirect_grefs field
* in blkif_request_indirect should be filled by the frontend with the
* grant references of the pages that are holding the indirect segments.
* This pages are filled with an array of blkif_request_segment_aligned
* that hold the information about the segments. The number of indirect
* pages to use is determined by the maximum number of segments
* a indirect request contains. Every indirect page can contain a maximum
* of 512 segments (PAGE_SIZE/sizeof(blkif_request_segment_aligned)),
* so to calculate the number of indirect pages to use we have to do
* ceil(indirect_segments/512).
*
* If a backend does not recognize BLKIF_OP_INDIRECT, it should *not*
* create the "feature-max-indirect-segments" node!
*/
#define BLKIF_OP_INDIRECT 6
/* /*
* Maximum scatter/gather segments per request. * Maximum scatter/gather segments per request.
* This is carefully chosen so that sizeof(struct blkif_ring) <= PAGE_SIZE. * This is carefully chosen so that sizeof(struct blkif_ring) <= PAGE_SIZE.
...@@ -109,6 +133,16 @@ typedef uint64_t blkif_sector_t; ...@@ -109,6 +133,16 @@ typedef uint64_t blkif_sector_t;
*/ */
#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11 #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
#define BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST 8
struct blkif_request_segment_aligned {
grant_ref_t gref; /* reference to I/O buffer frame */
/* @first_sect: first sector in frame to transfer (inclusive). */
/* @last_sect: last sector in frame to transfer (inclusive). */
uint8_t first_sect, last_sect;
uint16_t _pad; /* padding to make it 8 bytes, so it's cache-aligned */
} __attribute__((__packed__));
struct blkif_request_rw { struct blkif_request_rw {
uint8_t nr_segments; /* number of segments */ uint8_t nr_segments; /* number of segments */
blkif_vdev_t handle; /* only for read/write requests */ blkif_vdev_t handle; /* only for read/write requests */
...@@ -147,12 +181,31 @@ struct blkif_request_other { ...@@ -147,12 +181,31 @@ struct blkif_request_other {
uint64_t id; /* private guest value, echoed in resp */ uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__)); } __attribute__((__packed__));
struct blkif_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
#ifdef CONFIG_X86_64
uint32_t _pad1; /* offsetof(blkif_...,u.indirect.id) == 8 */
#endif
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad2;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
#ifdef CONFIG_X86_64
uint32_t _pad3; /* make it 64 byte aligned */
#else
uint64_t _pad3; /* make it 64 byte aligned */
#endif
} __attribute__((__packed__));
struct blkif_request { struct blkif_request {
uint8_t operation; /* BLKIF_OP_??? */ uint8_t operation; /* BLKIF_OP_??? */
union { union {
struct blkif_request_rw rw; struct blkif_request_rw rw;
struct blkif_request_discard discard; struct blkif_request_discard discard;
struct blkif_request_other other; struct blkif_request_other other;
struct blkif_request_indirect indirect;
} u; } u;
} __attribute__((__packed__)); } __attribute__((__packed__));
......
...@@ -188,6 +188,11 @@ struct __name##_back_ring { \ ...@@ -188,6 +188,11 @@ struct __name##_back_ring { \
#define RING_REQUEST_CONS_OVERFLOW(_r, _cons) \ #define RING_REQUEST_CONS_OVERFLOW(_r, _cons) \
(((_cons) - (_r)->rsp_prod_pvt) >= RING_SIZE(_r)) (((_cons) - (_r)->rsp_prod_pvt) >= RING_SIZE(_r))
/* Ill-behaved frontend determination: Can there be this many requests? */
#define RING_REQUEST_PROD_OVERFLOW(_r, _prod) \
(((_prod) - (_r)->rsp_prod_pvt) > RING_SIZE(_r))
#define RING_PUSH_REQUESTS(_r) do { \ #define RING_PUSH_REQUESTS(_r) do { \
wmb(); /* back sees requests /before/ updated producer index */ \ wmb(); /* back sees requests /before/ updated producer index */ \
(_r)->sring->req_prod = (_r)->req_prod_pvt; \ (_r)->sring->req_prod = (_r)->req_prod_pvt; \
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment