• Jens Axboe's avatar
    [PATCH] cfq-v2 I/O scheduler update · f9887e4a
    Jens Axboe authored
    Here is the next incarnation of the CFQ io scheduler, so far known as
    CFQ v2 locally. It attempts to address some of the limitations of the
    original CFQ io scheduler (hence forth known as CFQ v1). Some of the
    problems with CFQ v1 are:
    
    - It does accounting for the lifetime of the cfq_queue, which is setup
      and torn down for the time when a process has io in flight. For a fork
      heavy work load (such as a kernel compile, for instance), new
      processes can effectively starve io of running processes. This is in
      part due to the fact that CFQ v1 gives preference to a new processes
      to get better latency numbers. Removing that heuristic is not an
      option exactly because of that.
    
    - It makes no attempts to address inter-cfq_queue fairness.
    
    - It makes no attempt to limit upper latency bound of a single request.
    
    - It only provides per-tgid grouping. You need to change the source to
      group on a different criteria.
    
    - It uses a mempool for the cfq_queues. Theoretically this could
      deadlock if io bound processes never exit.
    
    - The may_queue() logic can be unfair since it fluctuates quickly, thus
      leaving processes sleeping while new processes are allowed to allocate
      a request.
    
    CFQ v2 attempts to fix these issues. It uses the process io_context
    logic to maintain a cfq_queue lifetime of the duration of the process
    (and its io). This means we can now be a lot more clever in deciding
    which process is allowed to queue or dispatch io to the device. The
    cfq_io_context is per-process per-queue, this is an extension to what AS
    currently does in that we truly do have a unique per-process identifier
    for io grouping. Busy queues are sorted by service time used, sub sorted
    by in_flight requests. Queues that have no io in flight are also
    preferred at dispatch time.
    
    Accounting is done on completion time of a request, or with a fixed cost
    for tagged command queueing. Requests are fifo'ed like with deadline, to
    make sure that a single request doesn't stay in the io scheduler for
    ages.
    
    Process grouping is selectable at runtime. I provide 4 grouping
    criterias: process group, thread group id, user id, and group id.
    
    As usual, settings are sysfs tweakable in /sys/block/<dev>/queue/iosched
    
    axboe@apu:[.]s/block/hda/queue/iosched $ ls
    back_seek_max      fifo_batch_expire  find_best_crq  queued
    back_seek_penalty  fifo_expire_async  key_type       show_status
    clear_elapsed      fifo_expire_sync   quantum        tagged
    
    In order, each of these settings control:
    
    back_seek_max
    back_seek_penalty:
    	Useful logic stolen from AS that allow small backwards seeks in
    	the io stream if we deem them useful. CFQ uses a strict
    	ascending elevator otherwise. _max controls the maximum allowed
    	backwards seek, defaulting to 16MiB. _penalty denotes how
    	expensive we account a backwards seek compared to a forward
    	seek. Default is 2, meaning it's twice as expensive.
    
    clear_elapsed:
    	Really a debug switch, will go away in the future. It clears the
    	maximum values for completion and dispatch time, shown in
    	show_status.
    
    fifo_batch_expire
    fifo_batch_async
    fifo_batch_sync:
    	The settings for the expiry fifo. batch_expire is how often we
    	allow the fifo expire to control which request to select.
    	Default is 125ms. _async is the deadline for async requests
    	(typically writes), _sync is the deadline for sync requests
    	(reads and sync writes). Defaults are, respectively, 5 seconds
    	and 0.5 seconds.
    
    key_type:
    	The grouping key. Can be set to pgid, tgid, uid, or gid. The
    	current value is shown bracketed:
    
    	axboe@apu:[.]s/block/hda/queue/iosched $ cat key_type
    	[pgid] tgid uid gid
    
    	Default is tgid. To set, simply echo any of the 4 words into the
    	file.
    
    quantum:
    	The amount of requests we select for dispatch when the driver
    	asks for work to do and the current pending list is empty.
    	Default is 4.
    
    queued:
    	The minimum amount of requests a group is allowed to queue.
    	Default is 8.
    
    show_status:
    	Debug output showing the current state of the queues.
    
    tagged:
    	Set this to 1 if the device is using tagged command queueing.
    	This cannot be reliably detected by CFQ yet, since most drivers
    	don't use the block layer (well it could, by looking at number
    	of requests being between dispatch and completion. but not
    	completely reliably). Default is 0.
    
    The patch is a little big, but works reliably here on my laptop. There
    are a number of other changes and fixes in there (like converting to
    hlist for hashes). The code is commented a lot better, CFQ v1 has
    basically no comments (reflecting that it was writting in one go, no
    touched or tuned much since then). This is of course only done to
    increase the AAF, akpm acceptance factor. Since I'm on the road, I
    cannot provide any really good numbers of CFQ v1 compared to v2, maybe
    someone will help me out there.
    Signed-off-by: default avatarJens Axboe <axboe@suse.de>
    Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
    f9887e4a
blkdev.h 22.5 KB