• Jiaying Zhang's avatar
    ext4: flush the i_completed_io_list during ext4_truncate · 3889fd57
    Jiaying Zhang authored
    Ted first found the bug when running 2.6.36 kernel with dioread_nolock
    mount option that xfstests #13 complained about wrong file size during fsck.
    However, the bug exists in the older kernels as well although it is
    somehow harder to trigger.
    
    The problem is that ext4_end_io_work() can happen after we have truncated an
    inode to a smaller size. Then when ext4_end_io_work() calls 
    ext4_convert_unwritten_extents(), we may reallocate some blocks that have 
    been truncated, so the inode size becomes inconsistent with the allocated
    blocks. 
    
    The following patch flushes the i_completed_io_list during truncate to reduce 
    the risk that some pending end_io requests are executed later and convert 
    already truncated blocks to initialized. 
    
    Note that although the fix helps reduce the problem a lot there may still 
    be a race window between vmtruncate() and ext4_end_io_work(). The fundamental
    problem is that if vmtruncate() is called without either i_mutex or i_alloc_sem
    held, it can race with an ongoing write request so that the io_end request is
    processed later when the corresponding blocks have been truncated.
    
    Ted and I have discussed the problem offline and we saw a few ways to fix
    the race completely:
    
    a) We guarantee that i_mutex lock and i_alloc_sem write lock are both hold 
    whenever vmtruncate() is called. The i_mutex lock prevents any new write
    requests from entering writeback and the i_alloc_sem prevents the race
    from ext4_page_mkwrite(). Currently we hold both locks if vmtruncate()
    is called from do_truncate(), which is probably the most common case.
    However, there are places where we may call vmtruncate() without holding
    either i_mutex or i_alloc_sem. I would like to ask for other people's
    opinions on what locks are expected to be held before calling vmtruncate().
    There seems a disagreement among the callers of that function.
    
    b) We change the ext4 write path so that we change the extent tree to contain 
    the newly allocated blocks and update i_size both at the same time --- when 
    the write of the data blocks is completed.
    
    c) We add some additional locking to synchronize vmtruncate() and 
    ext4_end_io_work(). This approach may have performance implications so we
    need to be careful.
    
    All of the above proposals may require more substantial changes, so
    we may consider to take the following patch as a bandaid.
    Signed-off-by: default avatarJiaying Zhang <jiayingz@google.com>
    Signed-off-by: default avatar"Theodore Ts'o" <tytso@mit.edu>
    3889fd57
extents.c 103 KB