• Dave Chinner's avatar
    xfs: optimise away log forces on timestamp updates for fdatasync · fc0561ce
    Dave Chinner authored
    xfs: timestamp updates cause excessive fdatasync log traffic
    
    Sage Weil reported that a ceph test workload was writing to the
    log on every fdatasync during an overwrite workload. Event tracing
    showed that the only metadata modification being made was the
    timestamp updates during the write(2) syscall, but fdatasync(2)
    is supposed to ignore them. The key observation was that the
    transactions in the log all looked like this:
    
    INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32
    
    And contained a flags field of 0x45 or 0x85, and had data and
    attribute forks following the inode core. This means that the
    timestamp updates were triggering dirty relogging of previously
    logged parts of the inode that hadn't yet been flushed back to
    disk.
    
    There are two parts to this problem. The first is that XFS relogs
    dirty regions in subsequent transactions, so it carries around the
    fields that have been dirtied since the last time the inode was
    written back to disk, not since the last time the inode was forced
    into the log.
    
    The second part is that on v5 filesystems, the inode change count
    update during inode dirtying also sets the XFS_ILOG_CORE flag, so
    on v5 filesystems this makes a timestamp update dirty the entire
    inode.
    
    As a result when fdatasync is run, it looks at the dirty fields in
    the inode, and sees more than just the timestamp flag, even though
    the only metadata change since the last fdatasync was just the
    timestamps. Hence we force the log on every subsequent fdatasync
    even though it is not needed.
    
    To fix this, add a new field to the inode log item that tracks
    changes since the last time fsync/fdatasync forced the log to flush
    the changes to the journal. This flag is updated when we dirty the
    inode, but we do it before updating the change count so it does not
    carry the "core dirty" flag from timestamp updates. The fields are
    zeroed when the inode is marked clean (due to writeback/freeing) or
    when an fsync/datasync forces the log. Hence if we only dirty the
    timestamps on the inode between fsync/fdatasync calls, the fdatasync
    will not trigger another log force.
    
    Over 100 runs of the test program:
    
    Ext4 baseline:
    	runtime: 1.63s +/- 0.24s
    	avg lat: 1.59ms +/- 0.24ms
    	iops: ~2000
    
    XFS, vanilla kernel:
            runtime: 2.45s +/- 0.18s
    	avg lat: 2.39ms +/- 0.18ms
    	log forces: ~400/s
    	iops: ~1000
    
    XFS, patched kernel:
            runtime: 1.49s +/- 0.26s
    	avg lat: 1.46ms +/- 0.25ms
    	log forces: ~30/s
    	iops: ~1500
    Reported-by: default avatarSage Weil <sage@redhat.com>
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
    Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
    fc0561ce
xfs_inode_item.h 2.02 KB