[PATCH] cfq-v2 I/O scheduler update

Here is the next incarnation of the CFQ io scheduler, so far known as CFQ v2 locally. It attempts to address some of the limitations of the original CFQ io scheduler (hence forth known as CFQ v1). Some of the problems with CFQ v1 are: - It does accounting for the lifetime of the cfq_queue, which is setup and torn down for the time when a process has io in flight. For a fork heavy work load (such as a kernel compile, for instance), new processes can effectively starve io of running processes. This is in part due to the fact that CFQ v1 gives preference to a new processes to get better latency numbers. Removing that heuristic is not an option exactly because of that. - It makes no attempts to address inter-cfq_queue fairness. - It makes no attempt to limit upper latency bound of a single request. - It only provides per-tgid grouping. You need to change the source to group on a different criteria. - It uses a mempool for the cfq_queues. Theoretically this could deadlock if io bound processes never exit. - The may_queue() logic can be unfair since it fluctuates quickly, thus leaving processes sleeping while new processes are allowed to allocate a request. CFQ v2 attempts to fix these issues. It uses the process io_context logic to maintain a cfq_queue lifetime of the duration of the process (and its io). This means we can now be a lot more clever in deciding which process is allowed to queue or dispatch io to the device. The cfq_io_context is per-process per-queue, this is an extension to what AS currently does in that we truly do have a unique per-process identifier for io grouping. Busy queues are sorted by service time used, sub sorted by in_flight requests. Queues that have no io in flight are also preferred at dispatch time. Accounting is done on completion time of a request, or with a fixed cost for tagged command queueing. Requests are fifo'ed like with deadline, to make sure that a single request doesn't stay in the io scheduler for ages. Process grouping is selectable at runtime. I provide 4 grouping criterias: process group, thread group id, user id, and group id. As usual, settings are sysfs tweakable in /sys/block/<dev>/queue/iosched axboe@apu:[.]s/block/hda/queue/iosched $ ls back_seek_max fifo_batch_expire find_best_crq queued back_seek_penalty fifo_expire_async key_type show_status clear_elapsed fifo_expire_sync quantum tagged In order, each of these settings control: back_seek_max back_seek_penalty: Useful logic stolen from AS that allow small backwards seeks in the io stream if we deem them useful. CFQ uses a strict ascending elevator otherwise. _max controls the maximum allowed backwards seek, defaulting to 16MiB. _penalty denotes how expensive we account a backwards seek compared to a forward seek. Default is 2, meaning it's twice as expensive. clear_elapsed: Really a debug switch, will go away in the future. It clears the maximum values for completion and dispatch time, shown in show_status. fifo_batch_expire fifo_batch_async fifo_batch_sync: The settings for the expiry fifo. batch_expire is how often we allow the fifo expire to control which request to select. Default is 125ms. _async is the deadline for async requests (typically writes), _sync is the deadline for sync requests (reads and sync writes). Defaults are, respectively, 5 seconds and 0.5 seconds. key_type: The grouping key. Can be set to pgid, tgid, uid, or gid. The current value is shown bracketed: axboe@apu:[.]s/block/hda/queue/iosched $ cat key_type [pgid] tgid uid gid Default is tgid. To set, simply echo any of the 4 words into the file. quantum: The amount of requests we select for dispatch when the driver asks for work to do and the current pending list is empty. Default is 4. queued: The minimum amount of requests a group is allowed to queue. Default is 8. show_status: Debug output showing the current state of the queues. tagged: Set this to 1 if the device is using tagged command queueing. This cannot be reliably detected by CFQ yet, since most drivers don't use the block layer (well it could, by looking at number of requests being between dispatch and completion. but not completely reliably). Default is 0. The patch is a little big, but works reliably here on my laptop. There are a number of other changes and fixes in there (like converting to hlist for hashes). The code is commented a lot better, CFQ v1 has basically no comments (reflecting that it was writting in one go, no touched or tuned much since then). This is of course only done to increase the AAF, akpm acceptance factor. Since I'm on the road, I cannot provide any really good numbers of CFQ v1 compared to v2, maybe someone will help me out there. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>

[PATCH] cfq-v2 I/O scheduler update
Here is the next incarnation of the CFQ io scheduler, so far known as CFQ v2 locally. It attempts to address some of the limitations of the original CFQ io scheduler (hence forth known as CFQ v1). Some of the problems with CFQ v1 are: - It does accounting for the lifetime of the cfq_queue, which is setup and torn down for the time when a process has io in flight. For a fork heavy work load (such as a kernel compile, for instance), new processes can effectively starve io of running processes. This is in part due to the fact that CFQ v1 gives preference to a new processes to get better latency numbers. Removing that heuristic is not an option exactly because of that. - It makes no attempts to address inter-cfq_queue fairness. - It makes no attempt to limit upper latency bound of a single request. - It only provides per-tgid grouping. You need to change the source to group on a different criteria. - It uses a mempool for the cfq_queues. Theoretically this could deadlock if io bound processes never exit. - The may_queue() logic can be unfair since it fluctuates quickly, thus leaving processes sleeping while new processes are allowed to allocate a request. CFQ v2 attempts to fix these issues. It uses the process io_context logic to maintain a cfq_queue lifetime of the duration of the process (and its io). This means we can now be a lot more clever in deciding which process is allowed to queue or dispatch io to the device. The cfq_io_context is per-process per-queue, this is an extension to what AS currently does in that we truly do have a unique per-process identifier for io grouping. Busy queues are sorted by service time used, sub sorted by in_flight requests. Queues that have no io in flight are also preferred at dispatch time. Accounting is done on completion time of a request, or with a fixed cost for tagged command queueing. Requests are fifo'ed like with deadline, to make sure that a single request doesn't stay in the io scheduler for ages. Process grouping is selectable at runtime. I provide 4 grouping criterias: process group, thread group id, user id, and group id. As usual, settings are sysfs tweakable in /sys/block/<dev>/queue/iosched axboe@apu:[.]s/block/hda/queue/iosched $ ls back_seek_max fifo_batch_expire find_best_crq queued back_seek_penalty fifo_expire_async key_type show_status clear_elapsed fifo_expire_sync quantum tagged In order, each of these settings control: back_seek_max back_seek_penalty: Useful logic stolen from AS that allow small backwards seeks in the io stream if we deem them useful. CFQ uses a strict ascending elevator otherwise. _max controls the maximum allowed backwards seek, defaulting to 16MiB. _penalty denotes how expensive we account a backwards seek compared to a forward seek. Default is 2, meaning it's twice as expensive. clear_elapsed: Really a debug switch, will go away in the future. It clears the maximum values for completion and dispatch time, shown in show_status. fifo_batch_expire fifo_batch_async fifo_batch_sync: The settings for the expiry fifo. batch_expire is how often we allow the fifo expire to control which request to select. Default is 125ms. _async is the deadline for async requests (typically writes), _sync is the deadline for sync requests (reads and sync writes). Defaults are, respectively, 5 seconds and 0.5 seconds. key_type: The grouping key. Can be set to pgid, tgid, uid, or gid. The current value is shown bracketed: axboe@apu:[.]s/block/hda/queue/iosched $ cat key_type [pgid] tgid uid gid Default is tgid. To set, simply echo any of the 4 words into the file. quantum: The amount of requests we select for dispatch when the driver asks for work to do and the current pending list is empty. Default is 4. queued: The minimum amount of requests a group is allowed to queue. Default is 8. show_status: Debug output showing the current state of the queues. tagged: Set this to 1 if the device is using tagged command queueing. This cannot be reliably detected by CFQ yet, since most drivers don't use the block layer (well it could, by looking at number of requests being between dispatch and completion. but not completely reliably). Default is 0. The patch is a little big, but works reliably here on my laptop. There are a number of other changes and fixes in there (like converting to hlist for hashes). The code is commented a lot better, CFQ v1 has basically no comments (reflecting that it was writting in one go, no touched or tuned much since then). This is of course only done to increase the AAF, akpm acceptance factor. Since I'm on the road, I cannot provide any really good numbers of CFQ v1 compared to v2, maybe someone will help me out there. Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
f9887e4a · Jens Axboe · Linus Torvalds · df02202c · f9887e4a · f9887e4a
Commit f9887e4a authored Oct 18, 2004 by Jens Axboe Committed by Linus Torvalds Oct 18, 2004
6 changed files
--- a/drivers/block/as-iosched.c
+++ b/drivers/block/as-iosched.c
@@ -1828,14 +1828,14 @@ static int as_set_request(request_queue_t *q, struct request *rq, int gfp_mask)

 static int as_may_queue(request_queue_t *q, int rw)
 {
-	int ret = 0;
+	int ret = ELV_MQUEUE_MAY;
 	struct as_data *ad = q->elevator->elevator_data;
 	struct io_context *ioc;
 	if (ad->antic_status == ANTIC_WAIT_REQ ||
 			ad->antic_status == ANTIC_WAIT_NEXT) {
 		ioc = as_get_io_context();
 		if (ad->io_context == ioc)
-			ret = 1;
+			ret = ELV_MQUEUE_MUST;
 		put_io_context(ioc);
 	}


--- a/drivers/block/cfq-iosched.c
+++ b/drivers/block/cfq-iosched.c
--- a/drivers/block/elevator.c
+++ b/drivers/block/elevator.c
@@ -437,7 +437,7 @@ int elv_may_queue(request_queue_t *q, int rw)
 	if (e->ops->elevator_may_queue_fn)
 		return e->ops->elevator_may_queue_fn(q, rw);

-	return 0;
+	return ELV_MQUEUE_MAY;
 }

 void elv_completed_request(request_queue_t *q, struct request *rq)

--- a/drivers/block/ll_rw_blk.c
+++ b/drivers/block/ll_rw_blk.c
@@ -243,6 +243,7 @@ void blk_queue_make_request(request_queue_t * q, make_request_fn * mfn)
 	blk_queue_hardsect_size(q, 512);
 	blk_queue_dma_alignment(q, 511);
 	blk_queue_congestion_threshold(q);
+	q->nr_batching = BLK_BATCH_REQ;

 	q->unplug_thresh = 4;		/* hmm */
 	q->unplug_delay = (3 * HZ) / 1000;	/* 3 milliseconds */
@@ -1511,8 +1512,10 @@ request_queue_t *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)
 	/*
 	 * all done
 	 */
-	if (!elevator_init(q, NULL))
+	if (!elevator_init(q, NULL)) {
+		blk_queue_congestion_threshold(q);
 		return q;
+	}

 	blk_cleanup_queue(q);
 out_init:
@@ -1540,13 +1543,20 @@ static inline void blk_free_request(request_queue_t *q, struct request *rq)
 	mempool_free(rq, q->rq.rq_pool);
 }

-static inline struct request *blk_alloc_request(request_queue_t *q,int gfp_mask)
+static inline struct request *blk_alloc_request(request_queue_t *q, int rw,
+						int gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);

 	if (!rq)
 		return NULL;

+	/*
+	 * first three bits are identical in rq->flags and bio->bi_rw,
+	 * see bio.h and blkdev.h
+	 */
+	rq->flags = rw;
+
 	if (!elv_set_request(q, rq, gfp_mask))
 		return rq;

@@ -1558,7 +1568,7 @@ static inline struct request *blk_alloc_request(request_queue_t *q,int gfp_mask)
 * ioc_batching returns true if the ioc is a valid batching request and
 * should be given priority access to a request.
 */
-static inline int ioc_batching(struct io_context *ioc)
+static inline int ioc_batching(request_queue_t *q, struct io_context *ioc)
 {
 	if (!ioc)
 		return 0;
@@ -1568,7 +1578,7 @@ static inline int ioc_batching(struct io_context *ioc)
 	 * even if the batch times out, otherwise we could theoretically
 	 * lose wakeups.
 	 */
-	return ioc->nr_batch_requests == BLK_BATCH_REQ ||
+	return ioc->nr_batch_requests == q->nr_batching ||
 		(ioc->nr_batch_requests > 0
 		&& time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME));
 }
@@ -1579,12 +1589,12 @@ static inline int ioc_batching(struct io_context *ioc)
 * is the behaviour we want though - once it gets a wakeup it should be given
 * a nice run.
 */
-void ioc_set_batching(struct io_context *ioc)
+void ioc_set_batching(request_queue_t *q, struct io_context *ioc)
 {
-	if (!ioc || ioc_batching(ioc))
+	if (!ioc || ioc_batching(q, ioc))
 		return;

-	ioc->nr_batch_requests = BLK_BATCH_REQ;
+	ioc->nr_batch_requests = q->nr_batching;
 	ioc->last_waited = jiffies;
 }

@@ -1600,10 +1610,10 @@ static void freed_request(request_queue_t *q, int rw)
 	if (rl->count[rw] < queue_congestion_off_threshold(q))
 		clear_queue_congested(q, rw);
 	if (rl->count[rw]+1 <= q->nr_requests) {
+		smp_mb();
 		if (waitqueue_active(&rl->wait[rw]))
 			wake_up(&rl->wait[rw]);
-		if (!waitqueue_active(&rl->wait[rw]))
-			blk_clear_queue_full(q, rw);
+		blk_clear_queue_full(q, rw);
 	}
 	if (unlikely(waitqueue_active(&rl->drain)) &&
 	    !rl->count[READ] && !rl->count[WRITE])
@@ -1632,13 +1642,22 @@ static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
 		 * will be blocked.
 		 */
 		if (!blk_queue_full(q, rw)) {
-			ioc_set_batching(ioc);
+			ioc_set_batching(q, ioc);
 			blk_set_queue_full(q, rw);
 		}
 	}

-	if (blk_queue_full(q, rw)
-			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
+	switch (elv_may_queue(q, rw)) {
+		case ELV_MQUEUE_NO:
+			spin_unlock_irq(q->queue_lock);
+			goto out;
+		case ELV_MQUEUE_MAY:
+			break;
+		case ELV_MQUEUE_MUST:
+			goto get_rq;
+	}
+
+	if (blk_queue_full(q, rw) && !ioc_batching(q, ioc)) {
 		/*
 		 * The queue is full and the allocating process is not a
 		 * "batcher", and not exempted by the IO scheduler
@@ -1647,12 +1666,13 @@ static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
 		goto out;
 	}

+get_rq:
 	rl->count[rw]++;
 	if (rl->count[rw] >= queue_congestion_on_threshold(q))
 		set_queue_congested(q, rw);
 	spin_unlock_irq(q->queue_lock);

-	rq = blk_alloc_request(q, gfp_mask);
+	rq = blk_alloc_request(q, rw, gfp_mask);
 	if (!rq) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -1667,17 +1687,11 @@ static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
 		goto out;
 	}

-	if (ioc_batching(ioc))
+	if (ioc_batching(q, ioc))
 		ioc->nr_batch_requests--;
 	
 	INIT_LIST_HEAD(&rq->queuelist);

-	/*
-	 * first three bits are identical in rq->flags and bio->bi_rw,
-	 * see bio.h and blkdev.h
-	 */
-	rq->flags = rw;
-
 	rq->errors = 0;
 	rq->rq_status = RQ_ACTIVE;
 	rq->bio = rq->biotail = NULL;
@@ -1726,7 +1740,7 @@ static struct request *get_request_wait(request_queue_t *q, int rw)
 			 * See ioc_batching, ioc_set_batching
 			 */
 			ioc = get_io_context(GFP_NOIO);
-			ioc_set_batching(ioc);
+			ioc_set_batching(q, ioc);
 			put_io_context(ioc);
 		}
 		finish_wait(&rl->wait[rw], &wait);
@@ -3082,6 +3096,9 @@ void put_io_context(struct io_context *ioc)
 	if (atomic_dec_and_test(&ioc->refcount)) {
 		if (ioc->aic && ioc->aic->dtor)
 			ioc->aic->dtor(ioc->aic);
+		if (ioc->cic && ioc->cic->dtor)
+			ioc->cic->dtor(ioc->cic);
+
 		kmem_cache_free(iocontext_cachep, ioc);
 	}
 }
@@ -3095,14 +3112,15 @@ void exit_io_context(void)

 	local_irq_save(flags);
 	ioc = current->io_context;
-	if (ioc) {
-		if (ioc->aic && ioc->aic->exit)
-			ioc->aic->exit(ioc->aic);
-		put_io_context(ioc);
-		current->io_context = NULL;
-	} else
-		WARN_ON(1);
+	current->io_context = NULL;
 	local_irq_restore(flags);
+
+	if (ioc->aic && ioc->aic->exit)
+		ioc->aic->exit(ioc->aic);
+	if (ioc->cic && ioc->cic->exit)
+		ioc->cic->exit(ioc->cic);
+
+	put_io_context(ioc);
 }

 /*
@@ -3121,20 +3139,39 @@ struct io_context *get_io_context(int gfp_flags)

 	local_irq_save(flags);
 	ret = tsk->io_context;
-	if (ret == NULL) {
-		ret = kmem_cache_alloc(iocontext_cachep, GFP_ATOMIC);
-		if (ret) {
-			atomic_set(&ret->refcount, 1);
-			ret->pid = tsk->pid;
-			ret->last_waited = jiffies; /* doesn't matter... */
-			ret->nr_batch_requests = 0; /* because this is 0 */
-			ret->aic = NULL;
+	if (ret)
+		goto out;
+
+	local_irq_restore(flags);
+
+	ret = kmem_cache_alloc(iocontext_cachep, gfp_flags);
+	if (ret) {
+		atomic_set(&ret->refcount, 1);
+		ret->pid = tsk->pid;
+		ret->last_waited = jiffies; /* doesn't matter... */
+		ret->nr_batch_requests = 0; /* because this is 0 */
+		ret->aic = NULL;
+		ret->cic = NULL;
+		spin_lock_init(&ret->lock);
+
+		local_irq_save(flags);
+
+		/*
+		 * very unlikely, someone raced with us in setting up the task
+		 * io context. free new context and just grab a reference.
+		 */
+		if (!tsk->io_context)
 			tsk->io_context = ret;
+		else {
+			kmem_cache_free(iocontext_cachep, ret);
+			ret = tsk->io_context;
 		}
-	}
-	if (ret)
+
+out:
 		atomic_inc(&ret->refcount);
-	local_irq_restore(flags);
+		local_irq_restore(flags);
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL(get_io_context);

--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -52,6 +52,20 @@ struct as_io_context {
 	sector_t seek_mean;
 };

+struct cfq_queue;
+struct cfq_io_context {
+	void (*dtor)(struct cfq_io_context *);
+	void (*exit)(struct cfq_io_context *);
+
+	struct io_context *ioc;
+
+	/*
+	 * circular list of cfq_io_contexts belonging to a process io context
+	 */
+	struct list_head list;
+	struct cfq_queue *cfqq;
+};
+
 /*
 * This is the per-process I/O subsystem state.  It is refcounted and
 * kmalloc'ed. Currently all fields are modified in process io context
@@ -67,7 +81,10 @@ struct io_context {
 	unsigned long last_waited; /* Time last woken after wait for request */
 	int nr_batch_requests;     /* Number of requests left in the batch */

+	spinlock_t lock;
+
 	struct as_io_context *aic;
+	struct cfq_io_context *cic;
 };

 void put_io_context(struct io_context *ioc);
@@ -343,6 +360,7 @@ struct request_queue
 	unsigned long		nr_requests;	/* Max # of requests */
 	unsigned int		nr_congestion_on;
 	unsigned int		nr_congestion_off;
+	unsigned int		nr_batching;

 	unsigned short		max_sectors;
 	unsigned short		max_hw_sectors;

--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -130,4 +130,13 @@ extern int elv_try_last_merge(request_queue_t *, struct bio *);
 #define ELEVATOR_INSERT_BACK	2
 #define ELEVATOR_INSERT_SORT	3

+/*
+ * return values from elevator_may_queue_fn
+ */
+enum {
+	ELV_MQUEUE_MAY,
+	ELV_MQUEUE_NO,
+	ELV_MQUEUE_MUST,
+};
+
 #endif