• NeilBrown's avatar
    dm: ensure bio submission follows a depth-first tree walk · 18a25da8
    NeilBrown authored
    A dm device can, in general, represent a tree of targets, each of which
    handles a sub-range of the range of blocks handled by the parent.
    
    The bio sequencing managed by generic_make_request() requires that bios
    are generated and handled in a depth-first manner.  Each call to a
    make_request_fn() may submit bios to a single member device, and may
    submit bios for a reduced region of the same device as the
    make_request_fn.
    
    In particular, any bios submitted to member devices must be expected to
    be processed in order, so a later one must never wait for an earlier
    one.
    
    This ordering is usually achieved by using bio_split() to reduce a bio
    to a size that can be completely handled by one target, and resubmitting
    the remainder to the originating device. bio_queue_split() shows the
    canonical approach.
    
    dm doesn't follow this approach, largely because it has needed to split
    bios since long before bio_split() was available.  It currently can
    submit bios to separate targets within the one dm_make_request() call.
    Dependencies between these targets, as can happen with dm-snap, can
    cause deadlocks if either bios gets stuck behind the other in the queues
    managed by generic_make_request().  This requires the 'rescue'
    functionality provided by dm_offload_{start,end}.
    
    Some of this requirement can be removed by changing the order of bio
    submission to follow the canonical approach.  That is, if dm finds that
    it needs to split a bio, the remainder should be sent to
    generic_make_request() rather than being handled immediately.  This
    delays the handling until the first part is completely processed, so the
    deadlock problems do not occur.
    
    __split_and_process_bio() can be called both from dm_make_request() and
    from dm_wq_work().  When called from dm_wq_work() the current approach
    is perfectly satisfactory as each bio will be processed immediately.
    When called from dm_make_request(), current->bio_list will be non-NULL,
    and in this case it is best to create a separate "clone" bio for the
    remainder.
    
    When we use bio_clone_bioset() to split off the front part of a bio
    and chain the two together and submit the remainder to
    generic_make_request(), it is important that the newly allocated
    bio is used as the head to be processed immediately, and the original
    bio gets "bio_advance()"d and sent to generic_make_request() as the
    remainder.  Otherwise, if the newly allocated bio is used as the
    remainder, and if it then needs to be split again, then the next
    bio_clone_bioset() call will be made while holding a reference a bio
    (result of the first clone) from the same bioset.  This can potentially
    exhaust the bioset mempool and result in a memory allocation deadlock.
    
    Note that there is no race caused by reassigning cio.io->bio after already
    calling __map_bio().  This bio will only be dereferenced again after
    dec_pending() has found io->io_count to be zero, and this cannot happen
    before the dec_pending() call at the end of __split_and_process_bio().
    
    To provide the clone bio when splitting, we use q->bio_split.  This
    was previously being freed by bio-based dm to avoid having excess
    rescuer threads.  As bio_split bio sets no longer create rescuer
    threads, there is little cost and much gain from restoring the
    q->bio_split bio set.
    Signed-off-by: default avatarNeilBrown <neilb@suse.com>
    Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
    18a25da8
dm.c 68.4 KB