• Andrew Morton's avatar
    [PATCH] direct IO updates · 0d85f8bf
    Andrew Morton authored
    This patch is a performance and correctness update to the direct-IO
    code: O_DIRECT and the raw driver.  It mainly affects IO against
    blockdevs.
    
    The direct_io code was returning -EINVAL for a filesystem hole.  Change
    it to clear the userspace page instead.
    
    There were a few restrictions and weirdnesses wrt blocksize and
    alignments.  The code has been reworked so we now lay out maximum-sized
    BIOs at any sector alignment.
    
    Because of this, the raw driver has been altered to set the blockdev's
    soft blocksize to the minimum possible at open() time.  Typically, 512
    bytes.  There are now no performance disadvantages to using small
    blocksizes, and this gives the finest possible alignment.
    
    There is no API here for setting or querying the soft blocksize of the
    raw driver (there never was, really), which could conceivably be a
    problem.  If it is, we can permit BLKBSZSET and BLKBSZGET against the
    fd which /dev/raw/rawN returned, but that would require that
    blk_ioctl() be exported to modules again.
    
    This code is wickedly quick.  Here's an oprofile of a single 500MHz
    PIII reading from four (old) scsi disks (two aic7xxx controllers) via
    the raw driver.  Aggregate throughput is 72 megabytes/second:
    
    c013363c 24       0.0896492   __set_page_dirty_buffers
    c021b8cc 24       0.0896492   ahc_linux_isr
    c012b5dc 25       0.0933846   kmem_cache_free
    c014d894 26       0.09712     dio_bio_complete
    c01cc78c 26       0.09712     number
    c0123bd4 40       0.149415    follow_page
    c01eed8c 46       0.171828    end_that_request_first
    c01ed410 49       0.183034    blk_recount_segments
    c01ed574 65       0.2428      blk_rq_map_sg
    c014db38 85       0.317508    do_direct_IO
    c021b090 90       0.336185    ahc_linux_run_device_queue
    c010bb78 236      0.881551    timer_interrupt
    c01052d8 25354    94.707      poll_idle
    
    A testament to the efficiency of the 2.5 block layer.
    
    And against four IDE disks on an HPT374 controller.  Throughput is 120
    megabytes/sec:
    
    c01eed8c 80       0.292462    end_that_request_first
    c01fe850 87       0.318052    hpt3xx_intrproc
    c01ed574 123      0.44966     blk_rq_map_sg
    c01f8f10 141      0.515464    ata_select
    c014db38 153      0.559333    do_direct_IO
    c010bb78 235      0.859107    timer_interrupt
    c01f9144 281      1.02727     ata_irq_enable
    c01ff990 290      1.06017     udma_pci_init
    c01fe878 308      1.12598     hpt3xx_maskproc
    c02006f8 379      1.38554     idedisk_do_request
    c02356a0 609      2.22637     pci_conf1_read
    c01ff8dc 611      2.23368     udma_pci_start
    c01ff950 922      3.37062     udma_pci_irq_status
    c01f8fac 1002     3.66308     ata_status
    c01ff26c 1059     3.87146     ata_start_dma
    c01feb70 1141     4.17124     hpt374_udma_stop
    c01f9228 3072     11.2305     ata_out_regfile
    c01052d8 15193    55.5422     poll_idle
    
    Not so good.
    
    One problem which has been identified with O_DIRECT is the cost of
    repeated calls into the mapping's get_block() callback.  Not a big
    problem with ext2 but other filesystems have more complex get_block
    implementations.
    
    So what I have done is to require that callers of generic_direct_IO()
    implement the new `get_blocks()' interface.  This is a small extension
    to get_block().  It gets passed another argument which indicates the
    maximum number of blocks which should be mapped, and it returns the
    number of blocks which it did map in bh_result->b_size.  This allows
    the fs to map up to 4G of disk (or of hole) in a single get_block()
    invokation.
    
    There are some other caveats and requirements of get_blocks() which are
    documented in the comment block over fs/direct_io.c:get_more_blocks().
    
    Possibly, get_blocks() will be the 2.6 kernel's way of doing gang block
    mapping.  It certainly allows good speedups.  But it doesn't allow the
    fs to return a scatter list of blocks - it only understands linear
    chunks of disk.  I think that's really all it _should_ do.
    
    I'll let get_blocks() sit for a while and wait for some feedback.  If
    it is sufficient and nobody objects too much, I shall convert all
    get_block() instances in the kernel to be get_blocks() instances.  And
    I'll teach readahead (at least) to use the get_blocks() extension.
    
    Delayed allocate writeback could use get_blocks().  As could
    block_prepare_write() for blocksize < PAGE_CACHE_SIZE.  There's no
    mileage using it in mpage_writepages() because all our filesystems are
    syncalloc, and nobody uses MAP_SHARED for much.
    
    It will be tricky to use get_blocks() for writes, because if a ton of
    blocks have been mapped into the file and then something goes wrong,
    the kernel needs to either remove those blocks from the file or zero
    them out.  The direct_io code zeroes them out.
    
    btw, some time ago you mentioned that some drivers and/or hardware may
    get upset if there are multiple simultaneous IOs in progress against
    the same block.  Well, the raw driver has always allowed that to
    happen.  O_DIRECT writes to blockdevs do as well now.
    
    todo:
    
    1) The driver will probably explode if someone runs BLKBSZSET while
       IO is in progress.  Need to use bdclaim() somewhere.
    
    2) readv() and writev() need to become direct_io-aware.  At present
       we're doing stop-and-wait for each segment when performing
       readv/writev against the raw driver and O_DIRECT blockdevs.
    0d85f8bf
block_dev.c 18.6 KB