• Damien Le Moal's avatar
    block: Introduce zone write plugging · dd291d77
    Damien Le Moal authored
    Zone write plugging implements a per-zone "plug" for write operations
    to control the submission and execution order of write operations to
    sequential write required zones of a zoned block device. Per-zone
    plugging guarantees that at any time there is at most only one write
    request per zone being executed. This mechanism is intended to replace
    zone write locking which implements a similar per-zone write throttling
    at the scheduler level, but is implemented only by mq-deadline.
    
    Unlike zone write locking which operates on requests, zone write
    plugging operates on BIOs. A zone write plug is simply a BIO list that
    is atomically manipulated using a spinlock and a kblockd submission
    work. A write BIO to a zone is "plugged" to delay its execution if a
    write BIO for the same zone was already issued, that is, if a write
    request for the same zone is being executed. The next plugged BIO is
    unplugged and issued once the write request completes.
    
    This mechanism allows to:
     - Untangle zone write ordering from block IO schedulers. This allows
       removing the restriction on using mq-deadline for writing to zoned
       block devices. Any block IO scheduler, including "none" can be used.
     - Zone write plugging operates on BIOs instead of requests. Plugged
       BIOs waiting for execution thus do not hold scheduling tags and thus
       are not preventing other BIOs from executing (reads or writes to
       other zones). Depending on the workload, this can significantly
       improve the device use (higher queue depth operation) and
       performance.
     - Both blk-mq (request based) zoned devices and BIO-based zoned devices
       (e.g.  device mapper) can use zone write plugging. It is mandatory
       for the former but optional for the latter. BIO-based drivers can
       use zone write plugging to implement write ordering guarantees, or
       the drivers can implement their own if needed.
     - The code is less invasive in the block layer and is mostly limited to
       blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
       bio.c.
    
    Zone write plugging is implemented using struct blk_zone_wplug. This
    structure includes a spinlock, a BIO list and a work structure to
    handle the submission of plugged BIOs. Zone write plugs structures are
    managed using a per-disk hash table.
    
    Plugging of zone write BIOs is done using the function
    blk_zone_write_plug_bio() which returns false if a BIO execution does
    not need to be delayed and true otherwise. This function is called
    from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
    spanning multiple zones which would cause mishandling of zone write
    plugs. This ichange enables by default zone write plugging for any mq
    request-based block device. BIO-based device drivers can also use zone
    write plugging by expliclty calling blk_zone_write_plug_bio() in their
    ->submit_bio method. For such devices, the driver must ensure that a
    BIO passed to blk_zone_write_plug_bio() is already split and not
    straddling zone boundaries.
    
    Only write and write zeroes BIOs are plugged. Zone write plugging does
    not introduce any significant overhead for other operations. A BIO that
    is being handled through zone write plugging is flagged using the new
    BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
    this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
    The completion of BIOs and requests flagged trigger respectively calls
    to the functions blk_zone_write_bio_endio() and
    blk_zone_write_complete_request(). The latter function is used to
    trigger submission of the next plugged BIO using the zone plug work.
    blk_zone_write_bio_endio() does the same for BIO-based devices.
    This ensures that at any time, at most one request (blk-mq devices) or
    one BIO (BIO-based devices) is being executed for any zone. The
    handling of zone write plugs using a per-zone plug spinlock maximizes
    parallelism and device usage by allowing multiple zones to be writen
    simultaneously without lock contention.
    
    Zone write plugging ignores flush BIOs without data. Hovever, any flush
    BIO that has data is always plugged so that the write part of the flush
    sequence is serialized with other regular writes.
    
    Given that any BIO handled through zone write plugging will be the only
    BIO in flight for the target zone when it is executed, the unplugging
    and submission of a BIO will have no chance of successfully merging with
    plugged requests or requests in the scheduler. To overcome this
    potential performance degradation, blk_mq_submit_bio() calls the
    function blk_zone_write_plug_attempt_merge() to try to merge other
    plugged BIOs with the one just unplugged and submitted. Successful
    merging is signaled using blk_zone_write_plug_bio_merged(), called from
    bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
    of segments of plugged BIOs to attempt merging, the number of segments
    of a plugged BIO is saved using the new struct bio field
    __bi_nr_segments. To avoid growing the size of struct bio, this field is
    added as a union with the bio_cookie field. This is safe to do as
    polling is always disabled for plugged BIOs.
    
    When BIOs are plugged in a zone write plug, the device request queue
    usage counter is always incremented. This reference is kept and reused
    for blk-mq devices when the plugged BIO is unplugged and submitted
    again using submit_bio_noacct_nocheck(). For this case, the unplugged
    BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
    blk_mq_submit_bio() proceeds directly to allocating a new request for
    the BIO, re-using the usage reference count taken when the BIO was
    plugged. This extra reference count is dropped in
    blk_zone_write_plug_attempt_merge() for any plugged BIO that is
    successfully merged. Given that BIO-based devices will not take this
    path, the extra reference is dropped after a plugged BIO is unplugged
    and submitted.
    
    Zone write plugs are dynamically allocated and managed using a hash
    table (an array of struct hlist_head) with RCU protection.
    A zone write plug is allocated when a write BIO is received for the
    zone and not freed until the zone is fully written, reset or finished.
    To detect when a zone write plug can be freed, the write state of each
    zone is tracked using a write pointer offset which corresponds to the
    offset of a zone write pointer relative to the zone start. Write
    operations always increment this write pointer offset. Zone reset
    operations set it to 0 and zone finish operations set it to the zone
    size.
    
    If a write error happens, the wp_offset value of a zone write plug may
    become incorrect and out of sync with the device managed write pointer.
    This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
    The function blk_zone_wplug_handle_error() is called from the new disk
    zone write plug work when this flag is set. This function executes a
    report zone to update the zone write pointer offset to the current
    value as indicated by the device. The disk zone write plug work is
    scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
    with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
    write. Once scheduled, the disk zone write plugs work keeps running
    until all zone errors are handled.
    
    To match the new data structures used for zoned disks, the function
    disk_free_zone_bitmaps() is renamed to the more generic
    disk_free_zone_resources(). The function disk_init_zone_resources() is
    also introduced to initialize zone write plugs resources when a gendisk
    is allocated.
    
    In order to guarantee that the user can simultaneously write up to a
    number of zones equal to a device max active zone limit or max open zone
    limit, zone write plugs are allocated using a mempool sized to the
    maximum of these 2 device limits. For a device that does not have
    active and open zone limits, 128 is used as the default mempool size.
    
    If a change to the device active and open zone limits is detected, the
    disk mempool is resized when blk_revalidate_disk_zones() is executed.
    
    This commit contains contributions from Christoph Hellwig <hch@lst.de>.
    Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
    Tested-by: default avatarHans Holmberg <hans.holmberg@wdc.com>
    Tested-by: default avatarDennis Maisenbacher <dennis.maisenbacher@wdc.com>
    Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
    Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
    Link: https://lore.kernel.org/r/20240408014128.205141-8-dlemoal@kernel.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
    dd291d77
blk.h 19.9 KB