• David Howells's avatar
    netfs: New writeback implementation · 288ace2f
    David Howells authored
    The current netfslib writeback implementation creates writeback requests of
    contiguous folio data and then separately tiles subrequests over the space
    twice, once for the server and once for the cache.  This creates a few
    issues:
    
     (1) Every time there's a discontiguity or a change between writing to only
         one destination or writing to both, it must create a new request.
         This makes it harder to do vectored writes.
    
     (2) The folios don't have the writeback mark removed until the end of the
         request - and a request could be hundreds of megabytes.
    
     (3) In future, I want to support a larger cache granularity, which will
         require aggregation of some folios that contain unmodified data (which
         only need to go to the cache) and some which contain modifications
         (which need to be uploaded and stored to the cache) - but, currently,
         these are treated as discontiguous.
    
    There's also a move to get everyone to use writeback_iter() to extract
    writable folios from the pagecache.  That said, currently writeback_iter()
    has some issues that make it less than ideal:
    
     (1) there's no way to cancel the iteration, even if you find a "temporary"
         error that means the current folio and all subsequent folios are going
         to fail;
    
     (2) there's no way to filter the folios being written back - something
         that will impact Ceph with it's ordered snap system;
    
     (3) and if you get a folio you can't immediately deal with (say you need
         to flush the preceding writes), you are left with a folio hanging in
         the locked state for the duration, when really we should unlock it and
         relock it later.
    
    In this new implementation, I use writeback_iter() to pump folios,
    progressively creating two parallel, but separate streams and cleaning up
    the finished folios as the subrequests complete.  Either or both streams
    can contain gaps, and the subrequests in each stream can be of variable
    size, don't need to align with each other and don't need to align with the
    folios.
    
    Indeed, subrequests can cross folio boundaries, may cover several folios or
    a folio may be spanned by multiple folios, e.g.:
    
             +---+---+-----+-----+---+----------+
    Folios:  |   |   |     |     |   |          |
             +---+---+-----+-----+---+----------+
    
               +------+------+     +----+----+
    Upload:    |      |      |.....|    |    |
               +------+------+     +----+----+
    
             +------+------+------+------+------+
    Cache:   |      |      |      |      |      |
             +------+------+------+------+------+
    
    The progressive subrequest construction permits the algorithm to be
    preparing both the next upload to the server and the next write to the
    cache whilst the previous ones are already in progress.  Throttling can be
    applied to control the rate of production of subrequests - and, in any
    case, we probably want to write them to the server in ascending order,
    particularly if the file will be extended.
    
    Content crypto can also be prepared at the same time as the subrequests and
    run asynchronously, with the prepped requests being stalled until the
    crypto catches up with them.  This might also be useful for transport
    crypto, but that happens at a lower layer, so probably would be harder to
    pull off.
    
    The algorithm is split into three parts:
    
     (1) The issuer.  This walks through the data, packaging it up, encrypting
         it and creating subrequests.  The part of this that generates
         subrequests only deals with file positions and spans and so is usable
         for DIO/unbuffered writes as well as buffered writes.
    
     (2) The collector. This asynchronously collects completed subrequests,
         unlocks folios, frees crypto buffers and performs any retries.  This
         runs in a work queue so that the issuer can return to the caller for
         writeback (so that the VM can have its kswapd thread back) or async
         writes.
    
     (3) The retryer.  This pauses the issuer, waits for all outstanding
         subrequests to complete and then goes through the failed subrequests
         to reissue them.  This may involve reprepping them (with cifs, the
         credits must be renegotiated, and a subrequest may need splitting),
         and doing RMW for content crypto if there's a conflicting change on
         the server.
    
    [!] Note that some of the functions are prefixed with "new_" to avoid
    clashes with existing functions.  These will be renamed in a later patch
    that cuts over to the new algorithm.
    Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
    cc: Eric Van Hensbergen <ericvh@kernel.org>
    cc: Latchesar Ionkov <lucho@ionkov.net>
    cc: Dominique Martinet <asmadeus@codewreck.org>
    cc: Christian Schoenebeck <linux_oss@crudebyte.com>
    cc: Marc Dionne <marc.dionne@auristor.com>
    cc: v9fs@lists.linux.dev
    cc: linux-afs@lists.infradead.org
    cc: netfs@lists.linux.dev
    cc: linux-fsdevel@vger.kernel.org
    288ace2f
write_issue.c 19.4 KB