• Dave Chinner's avatar
    xfs: don't use speculative prealloc for small files · 133eeb17
    Dave Chinner authored
    Dedicated small file workloads have been seeing significant free
    space fragmentation causing premature inode allocation failure
    when large inode sizes are in use. A particular test case showed
    that a workload that runs to a real ENOSPC on 256 byte inodes would
    fail inode allocation with ENOSPC about about 80% full with 512 byte
    inodes, and at about 50% full with 1024 byte inodes.
    
    The same workload, when run with -o allocsize=4096 on 1024 byte
    inodes would run to being 100% full before giving ENOSPC. That is,
    no freespace fragmentation at all.
    
    The issue was caused by the specific IO pattern the application had
    - the framework it was using did not support direct IO, and so it
    was emulating it by using fadvise(DONT_NEED). The result was that
    the data was getting written back before the speculative prealloc
    had been trimmed from memory by the close(), and so small single
    block files were being allocated with 2 blocks, and then having one
    truncated away. The result was lots of small 4k free space extents,
    and hence each new 8k allocation would take another 8k from
    contiguous free space and turn it into 4k of allocated space and 4k
    of free space.
    
    Hence inode allocation, which requires contiguous, aligned
    allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
    (1024 byte inodes) can fail to find sufficiently large freespace and
    hence fail while there is still lots of free space available.
    
    There's a simple fix for this, and one that has precendence in the
    allocator code already - don't do speculative allocation unless the
    size of the file is larger than a certain size. In this case, that
    size is the minimum default preallocation size:
    mp->m_writeio_blocks. And to keep with the concept of being nice to
    people when the files are still relatively small, cap the prealloc
    to mp->m_writeio_blocks until the file goes over a stripe unit is
    size, at which point we'll fall back to the current behaviour based
    on the last extent size.
    
    This will effectively turn off speculative prealloc for very small
    files, keep preallocation low for small files, and behave as it
    currently does for any file larger than a stripe unit. This
    completely avoids the freespace fragmentation problem this
    particular IO pattern was causing.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
    Reviewed-by: default avatarMark Tinguely <tinguely@sgi.com>
    Signed-off-by: default avatarBen Myers <bpm@sgi.com>
    133eeb17
xfs_iomap.c 25 KB