Commit f35546e0 authored by Jens Axboe's avatar Jens Axboe

Merge branch 'stable/for-jens-3.10' of...

Merge branch 'stable/for-jens-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.11/drivers

Konrad writes:

It has the 'feature-max-indirect-segments' implemented in both backend
and frontend. The current problem with the backend and frontend is that the
segment size is limited to 11 pages. It means we can at most squeeze in 44kB per
request. The ring can hold 32 (next power of two below 36) requests, meaning we
can do 1.4M of outstanding requests. Nowadays that is not enough.

The problem in the past was addressed in two ways - but neither one went upstream.
The first solution to this proposed by Justin from Spectralogic was to negotiate
the segment size.  This means that the ‘struct blkif_sring_entry’ is now a variable size.
It can expand from 112 bytes (cover 11 pages of data - 44kB) to 1580 bytes
(256 pages of data - so 1MB). It is a simple extension by just making the array in the
request expand from 11 to a variable size negotiated. But it had limits: this extension
still limits the number of segments per request to 255 (as the total number must be
specified in the request, which only has an 8-bit field for that purpose).

The other solution (from Intel - Ronghui) was to create one extra ring that only has the
‘struct blkif_request_segment’ in them. The ‘struct blkif_request’ would be changed to have
an index in said ‘segment ring’. There is only one segment ring. This means that the size of
the initial ring is still the same. The requests would point to the segment and enumerate out
how many of the indexes it wants to use. The limit is of course the size of the segment.
If one assumes a one-page segment this means we can in one request cover ~4MB.

Those patches were posted as RFC and the author never followed up on the ideas on changing
it to be a bit more flexible.

There is yet another mechanism that could be employed  (which these patches implement) - and it
borrows from VirtIO protocol. And that is the ‘indirect descriptors’. This very similar to
what Intel suggests, but with a twist. The twist is to negotiate how many of these
'segment' pages (aka indirect descriptor pages) we want to support (in reality we negotiate
how many entries in the segment we want to cover, and we module the number if it is
bigger than the segment size).

This means that with the existing 36 slots in the ring (single page) we can cover:
32 slots * each blkif_request_indirect covers: 512 * 4096 ~= 64M. Since we ample space
in the blkif_request_indirect to span more than one indirect page, that number (64M)
can be also multiplied by eight = 512MB.

Roger Pau Monne took the idea and implemented them in these patches. They work
great and the corner cases (migration between backends with and without this extension)
work nicely. The backend has a limit right now off how many indirect entries
it can handle: one indirect page, and at maximum 256 entries (out of 512 - so  50% of the page
is used). That comes out to 32 slots * 256 entries in a indirect page * 1 indirect page
per request * 4096 = 32MB.

This is a conservative number that can change in the future. Right now it strikes
a good balance between giving excellent performance, memory usage in the backend, and
balancing the needs of many guests.

In the patchset there is also the split of the blkback structure to be per-VBD.
This means that the spinlock contention we had with many guests trying to do I/O and
all the blkback threads hitting the same lock has been eliminated.

Also there are bug-fixes to deal with oddly sized sectors, insane amounts on
th ring, and also a security fix (posted earlier).
parents 36f988e9 1e0f7a21
What: /sys/module/xen_blkback/parameters/max_buffer_pages
Date: March 2013
KernelVersion: 3.11
Contact: Roger Pau Monné <roger.pau@citrix.com>
Description:
Maximum number of free pages to keep in each block
backend buffer.
What: /sys/module/xen_blkback/parameters/max_persistent_grants
Date: March 2013
KernelVersion: 3.11
Contact: Roger Pau Monné <roger.pau@citrix.com>
Description:
Maximum number of grants to map persistently in
blkback. If the frontend tries to use more than
max_persistent_grants, the LRU kicks in and starts
removing 5% of max_persistent_grants every 100ms.
What: /sys/module/xen_blkfront/parameters/max
Date: June 2013
KernelVersion: 3.11
Contact: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Description:
Maximum number of segments that the frontend will negotiate
with the backend for indirect descriptors. The default value
is 32 - higher value means more potential throughput but more
memory usage. The backend picks the minimum of the frontend
and its default backend value.
This diff is collapsed.
......@@ -50,6 +50,19 @@
__func__, __LINE__, ##args)
/*
* This is the maximum number of segments that would be allowed in indirect
* requests. This value will also be passed to the frontend.
*/
#define MAX_INDIRECT_SEGMENTS 256
#define SEGS_PER_INDIRECT_FRAME \
(PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
#define MAX_INDIRECT_PAGES \
((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
#define INDIRECT_PAGES(_segs) \
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
/* Not a real protocol. Used to generate ring structs which contain
* the elements common to all protocols only. This way we get a
* compiler-checkable way to use common struct elements, so we can
......@@ -83,12 +96,31 @@ struct blkif_x86_32_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));
struct blkif_x86_32_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad1;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
/*
* The maximum number of indirect segments (and pages) that will
* be used is determined by MAX_INDIRECT_SEGMENTS, this value
* is also exported to the guest (via xenstore
* feature-max-indirect-segments entry), so the frontend knows how
* many indirect segments the backend supports.
*/
uint64_t _pad2; /* make it 64 byte aligned */
} __attribute__((__packed__));
struct blkif_x86_32_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_x86_32_request_rw rw;
struct blkif_x86_32_request_discard discard;
struct blkif_x86_32_request_other other;
struct blkif_x86_32_request_indirect indirect;
} u;
} __attribute__((__packed__));
......@@ -127,12 +159,32 @@ struct blkif_x86_64_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));
struct blkif_x86_64_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
uint32_t _pad1; /* offsetof(blkif_..,u.indirect.id)==8 */
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad2;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
/*
* The maximum number of indirect segments (and pages) that will
* be used is determined by MAX_INDIRECT_SEGMENTS, this value
* is also exported to the guest (via xenstore
* feature-max-indirect-segments entry), so the frontend knows how
* many indirect segments the backend supports.
*/
uint32_t _pad3; /* make it 64 byte aligned */
} __attribute__((__packed__));
struct blkif_x86_64_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_x86_64_request_rw rw;
struct blkif_x86_64_request_discard discard;
struct blkif_x86_64_request_other other;
struct blkif_x86_64_request_indirect indirect;
} u;
} __attribute__((__packed__));
......@@ -182,12 +234,26 @@ struct xen_vbd {
struct backend_info;
/* Number of available flags */
#define PERSISTENT_GNT_FLAGS_SIZE 2
/* This persistent grant is currently in use */
#define PERSISTENT_GNT_ACTIVE 0
/*
* This persistent grant has been used, this flag is set when we remove the
* PERSISTENT_GNT_ACTIVE, to know that this grant has been used recently.
*/
#define PERSISTENT_GNT_WAS_ACTIVE 1
/* Number of requests that we can fit in a ring */
#define XEN_BLKIF_REQS 32
struct persistent_gnt {
struct page *page;
grant_ref_t gnt;
grant_handle_t handle;
DECLARE_BITMAP(flags, PERSISTENT_GNT_FLAGS_SIZE);
struct rb_node node;
struct list_head remove_node;
};
struct xen_blkif {
......@@ -219,6 +285,23 @@ struct xen_blkif {
/* tree to store persistent grants */
struct rb_root persistent_gnts;
unsigned int persistent_gnt_c;
atomic_t persistent_gnt_in_use;
unsigned long next_lru;
/* used by the kworker that offload work from the persistent purge */
struct list_head persistent_purge_list;
struct work_struct persistent_purge_work;
/* buffer of free pages to map grant refs */
spinlock_t free_pages_lock;
int free_pages_num;
struct list_head free_pages;
/* List of all 'pending_req' available */
struct list_head pending_free;
/* And its spinlock. */
spinlock_t pending_free_lock;
wait_queue_head_t pending_free_wq;
/* statistics */
unsigned long st_print;
......@@ -231,6 +314,41 @@ struct xen_blkif {
unsigned long long st_wr_sect;
wait_queue_head_t waiting_to_free;
/* Thread shutdown wait queue. */
wait_queue_head_t shutdown_wq;
};
struct seg_buf {
unsigned long offset;
unsigned int nsec;
};
struct grant_page {
struct page *page;
struct persistent_gnt *persistent_gnt;
grant_handle_t handle;
grant_ref_t gref;
};
/*
* Each outstanding request that we've passed to the lower device layers has a
* 'pending_req' allocated to it. Each buffer_head that completes decrements
* the pendcnt towards zero. When it hits zero, the specified domain has a
* response queued for it, with the saved 'id' passed back.
*/
struct pending_req {
struct xen_blkif *blkif;
u64 id;
int nr_pages;
atomic_t pendcnt;
unsigned short operation;
int status;
struct list_head free_list;
struct grant_page *segments[MAX_INDIRECT_SEGMENTS];
/* Indirect descriptors */
struct grant_page *indirect_pages[MAX_INDIRECT_PAGES];
struct seg_buf seg[MAX_INDIRECT_SEGMENTS];
struct bio *biolist[MAX_INDIRECT_SEGMENTS];
};
......@@ -257,6 +375,7 @@ int xen_blkif_xenbus_init(void);
irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
int xen_blkif_schedule(void *arg);
int xen_blkif_purge_persistent(void *arg);
int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
struct backend_info *be, int state);
......@@ -268,7 +387,7 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be);
static inline void blkif_get_x86_32_req(struct blkif_request *dst,
struct blkif_x86_32_request *src)
{
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation;
switch (src->operation) {
case BLKIF_OP_READ:
......@@ -291,6 +410,18 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break;
case BLKIF_OP_INDIRECT:
dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
dst->u.indirect.handle = src->u.indirect.handle;
dst->u.indirect.id = src->u.indirect.id;
dst->u.indirect.sector_number = src->u.indirect.sector_number;
barrier();
j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
for (i = 0; i < j; i++)
dst->u.indirect.indirect_grefs[i] =
src->u.indirect.indirect_grefs[i];
break;
default:
/*
* Don't know how to translate this op. Only get the
......@@ -304,7 +435,7 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
static inline void blkif_get_x86_64_req(struct blkif_request *dst,
struct blkif_x86_64_request *src)
{
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation;
switch (src->operation) {
case BLKIF_OP_READ:
......@@ -327,6 +458,18 @@ static inline void blkif_get_x86_64_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break;
case BLKIF_OP_INDIRECT:
dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
dst->u.indirect.handle = src->u.indirect.handle;
dst->u.indirect.id = src->u.indirect.id;
dst->u.indirect.sector_number = src->u.indirect.sector_number;
barrier();
j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
for (i = 0; i < j; i++)
dst->u.indirect.indirect_grefs[i] =
src->u.indirect.indirect_grefs[i];
break;
default:
/*
* Don't know how to translate this op. Only get the
......
......@@ -98,12 +98,17 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
err = PTR_ERR(blkif->xenblkd);
blkif->xenblkd = NULL;
xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
return;
}
}
static struct xen_blkif *xen_blkif_alloc(domid_t domid)
{
struct xen_blkif *blkif;
struct pending_req *req, *n;
int i, j;
BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
if (!blkif)
......@@ -118,8 +123,57 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies;
init_waitqueue_head(&blkif->waiting_to_free);
blkif->persistent_gnts.rb_node = NULL;
spin_lock_init(&blkif->free_pages_lock);
INIT_LIST_HEAD(&blkif->free_pages);
blkif->free_pages_num = 0;
atomic_set(&blkif->persistent_gnt_in_use, 0);
INIT_LIST_HEAD(&blkif->pending_free);
for (i = 0; i < XEN_BLKIF_REQS; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
if (!req)
goto fail;
list_add_tail(&req->free_list,
&blkif->pending_free);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
req->segments[j] = kzalloc(sizeof(*req->segments[0]),
GFP_KERNEL);
if (!req->segments[j])
goto fail;
}
for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
GFP_KERNEL);
if (!req->indirect_pages[j])
goto fail;
}
}
spin_lock_init(&blkif->pending_free_lock);
init_waitqueue_head(&blkif->pending_free_wq);
init_waitqueue_head(&blkif->shutdown_wq);
return blkif;
fail:
list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
if (!req->segments[j])
break;
kfree(req->segments[j]);
}
for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
if (!req->indirect_pages[j])
break;
kfree(req->indirect_pages[j]);
}
kfree(req);
}
kmem_cache_free(xen_blkif_cachep, blkif);
return ERR_PTR(-ENOMEM);
}
static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
......@@ -178,6 +232,7 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)
{
if (blkif->xenblkd) {
kthread_stop(blkif->xenblkd);
wake_up(&blkif->shutdown_wq);
blkif->xenblkd = NULL;
}
......@@ -198,8 +253,28 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)
static void xen_blkif_free(struct xen_blkif *blkif)
{
struct pending_req *req, *n;
int i = 0, j;
if (!atomic_dec_and_test(&blkif->refcnt))
BUG();
/* Check that there is no request in use */
list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
kfree(req->segments[j]);
for (j = 0; j < MAX_INDIRECT_PAGES; j++)
kfree(req->indirect_pages[j]);
kfree(req);
i++;
}
WARN_ON(i != XEN_BLKIF_REQS);
kmem_cache_free(xen_blkif_cachep, blkif);
}
......@@ -678,6 +753,11 @@ static void connect(struct backend_info *be)
dev->nodename);
goto abort;
}
err = xenbus_printf(xbt, dev->nodename, "feature-max-indirect-segments", "%u",
MAX_INDIRECT_SEGMENTS);
if (err)
dev_warn(&dev->dev, "writing %s/feature-max-indirect-segments (%d)",
dev->nodename, err);
err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
(unsigned long long)vbd_sz(&be->blkif->vbd));
......@@ -704,6 +784,11 @@ static void connect(struct backend_info *be)
dev->nodename);
goto abort;
}
err = xenbus_printf(xbt, dev->nodename, "physical-sector-size", "%u",
bdev_physical_block_size(be->blkif->vbd.bdev));
if (err)
xenbus_dev_error(dev, err, "writing %s/physical-sector-size",
dev->nodename);
err = xenbus_transaction_end(xbt, 0);
if (err == -EAGAIN)
......
This diff is collapsed.
......@@ -102,6 +102,30 @@ typedef uint64_t blkif_sector_t;
*/
#define BLKIF_OP_DISCARD 5
/*
* Recognized if "feature-max-indirect-segments" in present in the backend
* xenbus info. The "feature-max-indirect-segments" node contains the maximum
* number of segments allowed by the backend per request. If the node is
* present, the frontend might use blkif_request_indirect structs in order to
* issue requests with more than BLKIF_MAX_SEGMENTS_PER_REQUEST (11). The
* maximum number of indirect segments is fixed by the backend, but the
* frontend can issue requests with any number of indirect segments as long as
* it's less than the number provided by the backend. The indirect_grefs field
* in blkif_request_indirect should be filled by the frontend with the
* grant references of the pages that are holding the indirect segments.
* This pages are filled with an array of blkif_request_segment_aligned
* that hold the information about the segments. The number of indirect
* pages to use is determined by the maximum number of segments
* a indirect request contains. Every indirect page can contain a maximum
* of 512 segments (PAGE_SIZE/sizeof(blkif_request_segment_aligned)),
* so to calculate the number of indirect pages to use we have to do
* ceil(indirect_segments/512).
*
* If a backend does not recognize BLKIF_OP_INDIRECT, it should *not*
* create the "feature-max-indirect-segments" node!
*/
#define BLKIF_OP_INDIRECT 6
/*
* Maximum scatter/gather segments per request.
* This is carefully chosen so that sizeof(struct blkif_ring) <= PAGE_SIZE.
......@@ -109,6 +133,16 @@ typedef uint64_t blkif_sector_t;
*/
#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
#define BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST 8
struct blkif_request_segment_aligned {
grant_ref_t gref; /* reference to I/O buffer frame */
/* @first_sect: first sector in frame to transfer (inclusive). */
/* @last_sect: last sector in frame to transfer (inclusive). */
uint8_t first_sect, last_sect;
uint16_t _pad; /* padding to make it 8 bytes, so it's cache-aligned */
} __attribute__((__packed__));
struct blkif_request_rw {
uint8_t nr_segments; /* number of segments */
blkif_vdev_t handle; /* only for read/write requests */
......@@ -147,12 +181,31 @@ struct blkif_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));
struct blkif_request_indirect {
uint8_t indirect_op;
uint16_t nr_segments;
#ifdef CONFIG_X86_64
uint32_t _pad1; /* offsetof(blkif_...,u.indirect.id) == 8 */
#endif
uint64_t id;
blkif_sector_t sector_number;
blkif_vdev_t handle;
uint16_t _pad2;
grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
#ifdef CONFIG_X86_64
uint32_t _pad3; /* make it 64 byte aligned */
#else
uint64_t _pad3; /* make it 64 byte aligned */
#endif
} __attribute__((__packed__));
struct blkif_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_request_rw rw;
struct blkif_request_discard discard;
struct blkif_request_other other;
struct blkif_request_indirect indirect;
} u;
} __attribute__((__packed__));
......
......@@ -188,6 +188,11 @@ struct __name##_back_ring { \
#define RING_REQUEST_CONS_OVERFLOW(_r, _cons) \
(((_cons) - (_r)->rsp_prod_pvt) >= RING_SIZE(_r))
/* Ill-behaved frontend determination: Can there be this many requests? */
#define RING_REQUEST_PROD_OVERFLOW(_r, _prod) \
(((_prod) - (_r)->rsp_prod_pvt) > RING_SIZE(_r))
#define RING_PUSH_REQUESTS(_r) do { \
wmb(); /* back sees requests /before/ updated producer index */ \
(_r)->sring->req_prod = (_r)->req_prod_pvt; \
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment