• Jens Axboe's avatar
    block: cache current nsec time in struct blk_plug · da4c8c3d
    Jens Axboe authored
    Querying the current time is the most costly thing we do in the block
    layer per IO, and depending on kernel config settings, we may do it
    many times per IO.
    
    None of the callers actually need nsec granularity. Take advantage of
    that by caching the current time in the plug, with the assumption here
    being that any time checking will be temporally close enough that the
    slight loss of precision doesn't matter.
    
    If the block plug gets flushed, eg on preempt or schedule out, then
    we invalidate the cached clock.
    
    On a basic peak IOPS test case with iostats enabled, this changes
    the performance from:
    
    IOPS=108.41M, BW=52.93GiB/s, IOS/call=31/31
    IOPS=108.43M, BW=52.94GiB/s, IOS/call=32/32
    IOPS=108.29M, BW=52.88GiB/s, IOS/call=31/32
    IOPS=108.35M, BW=52.91GiB/s, IOS/call=32/32
    IOPS=108.42M, BW=52.94GiB/s, IOS/call=31/31
    IOPS=108.40M, BW=52.93GiB/s, IOS/call=32/32
    IOPS=108.31M, BW=52.89GiB/s, IOS/call=32/31
    
    to
    
    IOPS=118.79M, BW=58.00GiB/s, IOS/call=31/32
    IOPS=118.62M, BW=57.92GiB/s, IOS/call=31/31
    IOPS=118.80M, BW=58.01GiB/s, IOS/call=32/31
    IOPS=118.78M, BW=58.00GiB/s, IOS/call=32/32
    IOPS=118.69M, BW=57.95GiB/s, IOS/call=32/31
    IOPS=118.62M, BW=57.92GiB/s, IOS/call=32/31
    IOPS=118.63M, BW=57.92GiB/s, IOS/call=31/32
    
    which is more than a 9% improvement in performance. Looking at perf diff,
    we can see a huge reduction in time overhead:
    
        10.55%     -9.88%  [kernel.vmlinux]  [k] read_tsc
         1.31%     -1.22%  [kernel.vmlinux]  [k] ktime_get
    
    Note that since this relies on blk_plug for the caching, it's only
    applicable to the issue side. But this is where most of the time calls
    happen anyway. On the completion side, cached time stamping is done with
    struct io_comp patch, as long as the driver supports it.
    
    It's also worth noting that the above testing doesn't enable any of the
    higher cost CPU items on the block layer side, like wbt, cgroups,
    iocost, etc, which all would add additional time querying and hence
    overhead. IOW, results would likely look even better in comparison with
    those enabled, as distros would do.
    Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
    Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    da4c8c3d
blk.h 17.2 KB