Commit dd291d77 authored by Damien Le Moal's avatar Damien Le Moal Committed by Jens Axboe

block: Introduce zone write plugging

Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.

Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.

This mechanism allows to:
 - Untangle zone write ordering from block IO schedulers. This allows
   removing the restriction on using mq-deadline for writing to zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   are not preventing other BIOs from executing (reads or writes to
   other zones). Depending on the workload, this can significantly
   improve the device use (higher queue depth operation) and
   performance.
 - Both blk-mq (request based) zoned devices and BIO-based zoned devices
   (e.g.  device mapper) can use zone write plugging. It is mandatory
   for the former but optional for the latter. BIO-based drivers can
   use zone write plugging to implement write ordering guarantees, or
   the drivers can implement their own if needed.
 - The code is less invasive in the block layer and is mostly limited to
   blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
   bio.c.

Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.

Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.

Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.

Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.

Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.

When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.

Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.

If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.

To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.

In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.

If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.
Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
Tested-by: default avatarHans Holmberg <hans.holmberg@wdc.com>
Tested-by: default avatarDennis Maisenbacher <dennis.maisenbacher@wdc.com>
Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
parent ecfe43b1
...@@ -1576,6 +1576,8 @@ void bio_endio(struct bio *bio) ...@@ -1576,6 +1576,8 @@ void bio_endio(struct bio *bio)
if (!bio_integrity_endio(bio)) if (!bio_integrity_endio(bio))
return; return;
blk_zone_bio_endio(bio);
rq_qos_done_bio(bio); rq_qos_done_bio(bio);
if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) { if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
......
...@@ -377,6 +377,7 @@ struct bio *__bio_split_to_limits(struct bio *bio, ...@@ -377,6 +377,7 @@ struct bio *__bio_split_to_limits(struct bio *bio,
blkcg_bio_issue_init(split); blkcg_bio_issue_init(split);
bio_chain(split, bio); bio_chain(split, bio);
trace_block_split(split, bio->bi_iter.bi_sector); trace_block_split(split, bio->bi_iter.bi_sector);
WARN_ON_ONCE(bio_zone_write_plugging(bio));
submit_bio_noacct(bio); submit_bio_noacct(bio);
return split; return split;
} }
...@@ -988,6 +989,9 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req, ...@@ -988,6 +989,9 @@ enum bio_merge_status bio_attempt_back_merge(struct request *req,
blk_update_mixed_merge(req, bio, false); blk_update_mixed_merge(req, bio, false);
if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
blk_zone_write_plug_bio_merged(bio);
req->biotail->bi_next = bio; req->biotail->bi_next = bio;
req->biotail = bio; req->biotail = bio;
req->__data_len += bio->bi_iter.bi_size; req->__data_len += bio->bi_iter.bi_size;
...@@ -1003,6 +1007,14 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req, ...@@ -1003,6 +1007,14 @@ static enum bio_merge_status bio_attempt_front_merge(struct request *req,
{ {
const blk_opf_t ff = bio_failfast(bio); const blk_opf_t ff = bio_failfast(bio);
/*
* A front merge for writes to sequential zones of a zoned block device
* can happen only if the user submitted writes out of order. Do not
* merge such write to let it fail.
*/
if (req->rq_flags & RQF_ZONE_WRITE_PLUGGING)
return BIO_MERGE_FAILED;
if (!ll_front_merge_fn(req, bio, nr_segs)) if (!ll_front_merge_fn(req, bio, nr_segs))
return BIO_MERGE_FAILED; return BIO_MERGE_FAILED;
......
...@@ -828,6 +828,8 @@ static void blk_complete_request(struct request *req) ...@@ -828,6 +828,8 @@ static void blk_complete_request(struct request *req)
bio = next; bio = next;
} while (bio); } while (bio);
blk_zone_complete_request(req);
/* /*
* Reset counters so that the request stacking driver * Reset counters so that the request stacking driver
* can find how many bytes remain in the request * can find how many bytes remain in the request
...@@ -939,6 +941,7 @@ bool blk_update_request(struct request *req, blk_status_t error, ...@@ -939,6 +941,7 @@ bool blk_update_request(struct request *req, blk_status_t error,
* completely done * completely done
*/ */
if (!req->bio) { if (!req->bio) {
blk_zone_complete_request(req);
/* /*
* Reset counters so that the request stacking driver * Reset counters so that the request stacking driver
* can find how many bytes remain in the request * can find how many bytes remain in the request
...@@ -2963,15 +2966,30 @@ void blk_mq_submit_bio(struct bio *bio) ...@@ -2963,15 +2966,30 @@ void blk_mq_submit_bio(struct bio *bio)
struct request *rq; struct request *rq;
blk_status_t ret; blk_status_t ret;
/*
* If the plug has a cached request for this queue, try to use it.
*/
rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
/*
* A BIO that was released from a zone write plug has already been
* through the preparation in this function, already holds a reference
* on the queue usage counter, and is the only write BIO in-flight for
* the target zone. Go straight to preparing a request for it.
*/
if (bio_zone_write_plugging(bio)) {
nr_segs = bio->__bi_nr_segments;
if (rq)
blk_queue_exit(q);
goto new_request;
}
bio = blk_queue_bounce(bio, q); bio = blk_queue_bounce(bio, q);
/* /*
* If the plug has a cached request for this queue, try use it.
*
* The cached request already holds a q_usage_counter reference and we * The cached request already holds a q_usage_counter reference and we
* don't have to acquire a new one if we use it. * don't have to acquire a new one if we use it.
*/ */
rq = blk_mq_peek_cached_request(plug, q, bio->bi_opf);
if (!rq) { if (!rq) {
if (unlikely(bio_queue_enter(bio))) if (unlikely(bio_queue_enter(bio)))
return; return;
...@@ -2988,6 +3006,10 @@ void blk_mq_submit_bio(struct bio *bio) ...@@ -2988,6 +3006,10 @@ void blk_mq_submit_bio(struct bio *bio)
if (blk_mq_attempt_bio_merge(q, bio, nr_segs)) if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
goto queue_exit; goto queue_exit;
if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
goto queue_exit;
new_request:
if (!rq) { if (!rq) {
rq = blk_mq_get_new_requests(q, plug, bio, nr_segs); rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
if (unlikely(!rq)) if (unlikely(!rq))
...@@ -3006,6 +3028,7 @@ void blk_mq_submit_bio(struct bio *bio) ...@@ -3006,6 +3028,7 @@ void blk_mq_submit_bio(struct bio *bio)
if (ret != BLK_STS_OK) { if (ret != BLK_STS_OK) {
bio->bi_status = ret; bio->bi_status = ret;
bio_endio(bio); bio_endio(bio);
blk_zone_complete_request(rq);
blk_mq_free_request(rq); blk_mq_free_request(rq);
return; return;
} }
...@@ -3013,6 +3036,9 @@ void blk_mq_submit_bio(struct bio *bio) ...@@ -3013,6 +3036,9 @@ void blk_mq_submit_bio(struct bio *bio)
if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq)) if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
return; return;
if (bio_zone_write_plugging(bio))
blk_zone_write_plug_attempt_merge(rq);
if (plug) { if (plug) {
blk_add_rq_to_plug(plug, rq); blk_add_rq_to_plug(plug, rq);
return; return;
......
This diff is collapsed.
...@@ -415,7 +415,14 @@ static inline struct bio *blk_queue_bounce(struct bio *bio, ...@@ -415,7 +415,14 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
} }
#ifdef CONFIG_BLK_DEV_ZONED #ifdef CONFIG_BLK_DEV_ZONED
void disk_free_zone_bitmaps(struct gendisk *disk); void disk_init_zone_resources(struct gendisk *disk);
void disk_free_zone_resources(struct gendisk *disk);
static inline bool bio_zone_write_plugging(struct bio *bio)
{
return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
}
void blk_zone_write_plug_bio_merged(struct bio *bio);
void blk_zone_write_plug_attempt_merge(struct request *rq);
static inline void blk_zone_update_request_bio(struct request *rq, static inline void blk_zone_update_request_bio(struct request *rq,
struct bio *bio) struct bio *bio)
{ {
...@@ -423,22 +430,60 @@ static inline void blk_zone_update_request_bio(struct request *rq, ...@@ -423,22 +430,60 @@ static inline void blk_zone_update_request_bio(struct request *rq,
* For zone append requests, the request sector indicates the location * For zone append requests, the request sector indicates the location
* at which the BIO data was written. Return this value to the BIO * at which the BIO data was written. Return this value to the BIO
* issuer through the BIO iter sector. * issuer through the BIO iter sector.
* For plugged zone writes, we need the original BIO sector so
* that blk_zone_write_plug_bio_endio() can lookup the zone write plug.
*/ */
if (req_op(rq) == REQ_OP_ZONE_APPEND) if (req_op(rq) == REQ_OP_ZONE_APPEND || bio_zone_write_plugging(bio))
bio->bi_iter.bi_sector = rq->__sector; bio->bi_iter.bi_sector = rq->__sector;
} }
void blk_zone_write_plug_bio_endio(struct bio *bio);
static inline void blk_zone_bio_endio(struct bio *bio)
{
/*
* For write BIOs to zoned devices, signal the completion of the BIO so
* that the next write BIO can be submitted by zone write plugging.
*/
if (bio_zone_write_plugging(bio))
blk_zone_write_plug_bio_endio(bio);
}
void blk_zone_write_plug_complete_request(struct request *rq);
static inline void blk_zone_complete_request(struct request *rq)
{
if (rq->rq_flags & RQF_ZONE_WRITE_PLUGGING)
blk_zone_write_plug_complete_request(rq);
}
int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd, int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
unsigned long arg); unsigned long arg);
int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode, int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
unsigned int cmd, unsigned long arg); unsigned int cmd, unsigned long arg);
#else /* CONFIG_BLK_DEV_ZONED */ #else /* CONFIG_BLK_DEV_ZONED */
static inline void disk_free_zone_bitmaps(struct gendisk *disk) static inline void disk_init_zone_resources(struct gendisk *disk)
{
}
static inline void disk_free_zone_resources(struct gendisk *disk)
{
}
static inline bool bio_zone_write_plugging(struct bio *bio)
{
return false;
}
static inline void blk_zone_write_plug_bio_merged(struct bio *bio)
{
}
static inline void blk_zone_write_plug_attempt_merge(struct request *rq)
{ {
} }
static inline void blk_zone_update_request_bio(struct request *rq, static inline void blk_zone_update_request_bio(struct request *rq,
struct bio *bio) struct bio *bio)
{ {
} }
static inline void blk_zone_bio_endio(struct bio *bio)
{
}
static inline void blk_zone_complete_request(struct request *rq)
{
}
static inline int blkdev_report_zones_ioctl(struct block_device *bdev, static inline int blkdev_report_zones_ioctl(struct block_device *bdev,
unsigned int cmd, unsigned long arg) unsigned int cmd, unsigned long arg)
{ {
......
...@@ -1182,7 +1182,7 @@ static void disk_release(struct device *dev) ...@@ -1182,7 +1182,7 @@ static void disk_release(struct device *dev)
disk_release_events(disk); disk_release_events(disk);
kfree(disk->random); kfree(disk->random);
disk_free_zone_bitmaps(disk); disk_free_zone_resources(disk);
xa_destroy(&disk->part_tbl); xa_destroy(&disk->part_tbl);
disk->queue->disk = NULL; disk->queue->disk = NULL;
...@@ -1364,6 +1364,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id, ...@@ -1364,6 +1364,7 @@ struct gendisk *__alloc_disk_node(struct request_queue *q, int node_id,
if (blkcg_init_disk(disk)) if (blkcg_init_disk(disk))
goto out_erase_part0; goto out_erase_part0;
disk_init_zone_resources(disk);
rand_initialize_disk(disk); rand_initialize_disk(disk);
disk_to_dev(disk)->class = &block_class; disk_to_dev(disk)->class = &block_class;
disk_to_dev(disk)->type = &disk_type; disk_to_dev(disk)->type = &disk_type;
......
...@@ -56,6 +56,8 @@ typedef __u32 __bitwise req_flags_t; ...@@ -56,6 +56,8 @@ typedef __u32 __bitwise req_flags_t;
#define RQF_SPECIAL_PAYLOAD ((__force req_flags_t)(1 << 18)) #define RQF_SPECIAL_PAYLOAD ((__force req_flags_t)(1 << 18))
/* The per-zone write lock is held for this request */ /* The per-zone write lock is held for this request */
#define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19)) #define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19))
/* The request completion needs to be signaled to zone write pluging. */
#define RQF_ZONE_WRITE_PLUGGING ((__force req_flags_t)(1 << 20))
/* ->timeout has been called, don't expire again */ /* ->timeout has been called, don't expire again */
#define RQF_TIMED_OUT ((__force req_flags_t)(1 << 21)) #define RQF_TIMED_OUT ((__force req_flags_t)(1 << 21))
#define RQF_RESV ((__force req_flags_t)(1 << 23)) #define RQF_RESV ((__force req_flags_t)(1 << 23))
......
...@@ -234,7 +234,12 @@ struct bio { ...@@ -234,7 +234,12 @@ struct bio {
struct bvec_iter bi_iter; struct bvec_iter bi_iter;
union {
/* for polled bios: */
blk_qc_t bi_cookie; blk_qc_t bi_cookie;
/* for plugged zoned writes only: */
unsigned int __bi_nr_segments;
};
bio_end_io_t *bi_end_io; bio_end_io_t *bi_end_io;
void *bi_private; void *bi_private;
#ifdef CONFIG_BLK_CGROUP #ifdef CONFIG_BLK_CGROUP
...@@ -305,6 +310,7 @@ enum { ...@@ -305,6 +310,7 @@ enum {
BIO_QOS_MERGED, /* but went through rq_qos merge path */ BIO_QOS_MERGED, /* but went through rq_qos merge path */
BIO_REMAPPED, BIO_REMAPPED,
BIO_ZONE_WRITE_LOCKED, /* Owns a zoned device zone write lock */ BIO_ZONE_WRITE_LOCKED, /* Owns a zoned device zone write lock */
BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
BIO_FLAG_LAST BIO_FLAG_LAST
}; };
......
...@@ -194,6 +194,12 @@ struct gendisk { ...@@ -194,6 +194,12 @@ struct gendisk {
unsigned int zone_capacity; unsigned int zone_capacity;
unsigned long *conv_zones_bitmap; unsigned long *conv_zones_bitmap;
unsigned long *seq_zones_wlock; unsigned long *seq_zones_wlock;
unsigned int zone_wplugs_hash_bits;
spinlock_t zone_wplugs_lock;
struct mempool_s *zone_wplugs_pool;
struct hlist_head *zone_wplugs_hash;
struct list_head zone_wplugs_err_list;
struct work_struct zone_wplugs_work;
#endif /* CONFIG_BLK_DEV_ZONED */ #endif /* CONFIG_BLK_DEV_ZONED */
#if IS_ENABLED(CONFIG_CDROM) #if IS_ENABLED(CONFIG_CDROM)
...@@ -663,6 +669,7 @@ static inline unsigned int bdev_max_active_zones(struct block_device *bdev) ...@@ -663,6 +669,7 @@ static inline unsigned int bdev_max_active_zones(struct block_device *bdev)
return bdev->bd_disk->queue->limits.max_active_zones; return bdev->bd_disk->queue->limits.max_active_zones;
} }
bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
#else /* CONFIG_BLK_DEV_ZONED */ #else /* CONFIG_BLK_DEV_ZONED */
static inline unsigned int bdev_nr_zones(struct block_device *bdev) static inline unsigned int bdev_nr_zones(struct block_device *bdev)
{ {
...@@ -690,6 +697,10 @@ static inline unsigned int bdev_max_active_zones(struct block_device *bdev) ...@@ -690,6 +697,10 @@ static inline unsigned int bdev_max_active_zones(struct block_device *bdev)
{ {
return 0; return 0;
} }
static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
{
return false;
}
#endif /* CONFIG_BLK_DEV_ZONED */ #endif /* CONFIG_BLK_DEV_ZONED */
static inline unsigned int blk_queue_depth(struct request_queue *q) static inline unsigned int blk_queue_depth(struct request_queue *q)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment