• Andrew Morton's avatar
    [PATCH] readv/writev speedup · a83638a4
    Andrew Morton authored
    This is Janet Morgan's patch which converts the readv/writev code
    to submit all segments for IO before waiting on them, rather than
    submitting each segment separately.
    
    This is a critical performance fix for O_DIRECT reads and writes.
    Prior to this change, O_DIRECT vectored IO was forced to wait for
    completion against each segment of the iovec rather than submitting all
    segments and waiting on the lot.  ie: for ten segments, this code will
    be ten times faster.
    
    There will also be moderate improvements for buffered IO - smaller code
    paths, plus writev() only takes i_sem once.
    
    The patch ended up quite large unfortunately - turned out that the only
    sane way to implement this without duplicating significant amounts of
    code (the generic_file_write() bounds checking, all the O_DIRECT
    handling, etc) was to redo generic_file_read() and generic_file_write()
    to take an iovec/nr_segs pair rather than `buf, count'.
    
    New exported functions generic_file_readv() and generic_file_writev()
    have been added:
    
    ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
                              unsigned long nr_segs, loff_t *ppos);
    ssize_t generic_file_writev(struct file *file, const struct iovec *iov,
                              unsigned long nr_segs, loff_t * ppos);
    
    If a driver does not use these in their file_operations then they will
    continue to use the old readv/writev code, which sits in a loop calling
    calls fops->read() or fops->write().
    
    ext2, ext3, JFS and the blockdev driver are currently using this
    capability.
    
    Some coding cleanups were made in fs/read_write.c.  Mainly:
    
    - pass "READ" or "WRITE" around to indicate the diretion of the
      operation, rather than the (confusing, inverted)
      VERIFY_READ/VERIFY_WRITE.
    
    - Use the identifier `nr_segs' everywhere to indicate the iovec
      length rather than `count', which is often used to indicate the
      number of bytes in the syscall.  It was confusing the heck out of me.
    
    - Some cleanups to the raw driver.
    
    - Some additional generality in fs/direct_io.c: the core `struct dio'
      used to be a "populate-and-go" thing.  Janet has broken that up so
      you can initialise a struct dio once, then loop around feeding it
      more file segments, then wait on completion against everything.
    
    - In a couple of places we needed to handle the situation where we
      knew, a-priori, that the user was going to get a short read or write.
      File size limit exceeded, read past i_size, etc.  We handled that by
      shortening the iovec in-place with iov_shorten().  Which is not
      particularly pretty, but neither were the alternatives.
    a83638a4
inode.c 33.8 KB