• Andrew Morton's avatar
    [PATCH] O_DIRECT data exposure fixes · bc0e2bbf
    Andrew Morton authored
    From: Badari Pulavarty, Suparna Bhattacharya, Andrew Morton
    
    Forward port of Stephen Tweedie's DIO fixes from 2.4, to fix various DIO vs
    buffered IO exposures involving races causing:
    
    (a) stale data from uninstantiated blocks to be read, e.g.
    
        - O_DIRECT reads against buffered writes to a sparse region
    
        - O_DIRECT writes to a sparse region against buffered reads
    
    (b) potential data corruption with
    
        - O_DIRECT IOs against truncate
    
        due to writes to truncated blocks (which may have been reallocated to
        another file).
    
    Summary of fixes:
    
    1) All the changes affect only regular files.  RAW/O_DIRECT on block are
       unaffected. 
    
    2) The DIO code will not fill in sparse regions on a write.  Instead
       -ENOTBLK is returned and the generic file write code would fallthrough to
       buffered IO in this case followed by writing through the pages to disk
       using filemap_fdatawrite/wait.
    
    3) i_sem is held during both DIO reads and writes.  For reads, and writes
       to already allocated blocks, it is released right after IO is issued,
       while for writes to newly allocated blocks (e.g file extending writes and
       hole overwrites) it is held all the way through until IO completes (and
       data is committed to disk).
    
    4) filemap_fdatawrite/wait are called under i_sem to synchronize buffered
       pages to disk blocks before issuing DIO.
    
    5) A new rwsem (i_alloc_sem) is held in shared mode all the while a DIO
       (read or write) is in progress, and in exclusive mode by truncate to guard
       against deallocation of data blocks during DIO. 
    
    6) All this new locking has been pushed down into blockdev_direct_IO to
       avoid interfering with NFS direct IO.  The locks are taken in the order
       i_sem followed by i_alloc_sem.  While i_sem may be released after IO
       submission in some cases, i_alloc_sem is held through until dio_complete
       (in the case of AIO-DIO this happens through the IO completion callback).
    
    7) i_sem and i_alloc_sem are not held for the _nolock versions of write
       routines, as used by blockdev and XFS.  Filesystems can specify the
       needs_special_locking parameter to __blockdev_direct_IO from their direct
       IO address space op accordingly.
    
    Note from Badari:
    Here is the locking (when needs_special_locking is true):
    
    (1) generic_file_*_write() holds i_sem (as before) and calls
        ->direct_IO().  blockdev_direct_IO gets i_alloc_sem and call
        direct_io_worker().
    
    (2) generic_file_*_read() does not hold any locks.  blockdev_direct_IO()
        gets i_sem and then i_alloc_sem and calls direct_io_worker() to do the
        work
    
    (3) direct_io_worker() does the work and drops i_sem after submitting IOs
        if appropriate and drops i_alloc_sem after completing IOs.
    bc0e2bbf
inode.c 36.2 KB