[PATCH] anticipatory I/O scheduler

From: Nick Piggin <piggin@cyberone.com.au> This is the core anticipatory IO scheduler. There are nearly 100 changesets in this and five months work. I really cannot describe it fully here. Major points: - It works by recognising that reads are dependent: we don't know where the next read will occur, but it's probably close-by the previous one. So once a read has completed we leave the disk idle, anticipating that a request for a nearby read will come in. - There is read batching and write batching logic. - when we're servicing a batch of writes we will refuse to seek away for a read for some tens of milliseconds. Then the write stream is preempted. - when we're servicing a batch of reads (via anticipation) we'll do that for some tens of milliseconds, then preempt. - There are request deadlines, for latency and fairness. The oldest outstanding request is examined at regular intervals. If this request is older than a specific deadline, it will be the next one dispatched. This gives a good fairness heuristic while being simple because processes tend to have localised IO. Just about all of the rest of the complexity involves an array of fixups which prevent most of teh obvious failure modes with anticipation: trying to not leave the disk head pointlessly idle. Some of these algorithms are: - Process tracking. If the process whose read we are anticipating submits a write, abandon anticipation. - Process exit tracking. If the process whose read we are anticipating exits, abandon anticipation. - Process IO history. We accumulate statistical info on the process's recent IO patterns to aid in making decisions about how long to anticipate new reads. Currently thinktime and seek distance are tracked. Thinktime is the time between when a process's last request has completed and when it submits another one. Seek distance is simply the number of sectors between each read request. If either statistic becomes too high, the it isn't anticipated that the process will submit another read. The above all means that we need a per-process "io context". This is a fully refcounted structure. In this patch it is AS-only. later we generalise it a little so other IO schedulers could use the same framework. - Requests are grouped as synchronous and asynchronous whereas deadline scheduler groups requests as reads and writes. This can provide better sync write performance, and may give better responsiveness with journalling filesystems (although we haven't done that yet). We currently detect synchronous writes by nastily setting PF_SYNCWRITE in current->flags. The plan is to remove this later, and to propagate the sync hint from writeback_contol.sync_mode into bio->bi_flags thence into request->flags. Once that is done, direct-io needs to set the BIO sync hint as well. - There is also quite a bit of complexity gone into bashing TCQ into submission. Timing for a read batch is not started until the first read request actually completes. A read batch also does not start until all outstanding writes have completed. AS is the default IO scheduler. deadline may be chosen by booting with "elevator=deadline". There are a few reasons for retaining deadline: - AS is often slower than deadline in random IO loads with large TCQ windows. The usual real world task here is OLTP database loads. - deadline is presumably more stable. - deadline is much simpler. The tunable per-queue entries under /sys/block/*/iosched/ are all in milliseconds: * read_expire Controls how long until a request becomes "expired". It also controls the interval between which expired requests are served, so set to 50, a request might take anywhere < 100ms to be serviced _if_ it is the next on the expired list. Obviously it can't make the disk go faster. Result is basically the timeslice a reader gets in the presence of other IO. 100*((seek time / read_expire) + 1) is very roughly the % streaming read efficiency your disk should get in the presence of multiple readers. * read_batch_expire Controls how much time a batch of reads is given before pending writes are served. Higher value is more efficient. Shouldn't really be below read_expire. * write_ versions of the above * antic_expire Controls the maximum amount of time we can anticipate a good read before giving up. Many other factors may cause anticipation to be stopped early, or some processes will not be "anticipated" at all. Should be a bit higher for big seek time devices though not a linear correspondance - most processes have only a few ms thinktime.

[PATCH] anticipatory I/O scheduler
From: Nick Piggin <piggin@cyberone.com.au> This is the core anticipatory IO scheduler. There are nearly 100 changesets in this and five months work. I really cannot describe it fully here. Major points: - It works by recognising that reads are dependent: we don't know where the next read will occur, but it's probably close-by the previous one. So once a read has completed we leave the disk idle, anticipating that a request for a nearby read will come in. - There is read batching and write batching logic. - when we're servicing a batch of writes we will refuse to seek away for a read for some tens of milliseconds. Then the write stream is preempted. - when we're servicing a batch of reads (via anticipation) we'll do that for some tens of milliseconds, then preempt. - There are request deadlines, for latency and fairness. The oldest outstanding request is examined at regular intervals. If this request is older than a specific deadline, it will be the next one dispatched. This gives a good fairness heuristic while being simple because processes tend to have localised IO. Just about all of the rest of the complexity involves an array of fixups which prevent most of teh obvious failure modes with anticipation: trying to not leave the disk head pointlessly idle. Some of these algorithms are: - Process tracking. If the process whose read we are anticipating submits a write, abandon anticipation. - Process exit tracking. If the process whose read we are anticipating exits, abandon anticipation. - Process IO history. We accumulate statistical info on the process's recent IO patterns to aid in making decisions about how long to anticipate new reads. Currently thinktime and seek distance are tracked. Thinktime is the time between when a process's last request has completed and when it submits another one. Seek distance is simply the number of sectors between each read request. If either statistic becomes too high, the it isn't anticipated that the process will submit another read. The above all means that we need a per-process "io context". This is a fully refcounted structure. In this patch it is AS-only. later we generalise it a little so other IO schedulers could use the same framework. - Requests are grouped as synchronous and asynchronous whereas deadline scheduler groups requests as reads and writes. This can provide better sync write performance, and may give better responsiveness with journalling filesystems (although we haven't done that yet). We currently detect synchronous writes by nastily setting PF_SYNCWRITE in current->flags. The plan is to remove this later, and to propagate the sync hint from writeback_contol.sync_mode into bio->bi_flags thence into request->flags. Once that is done, direct-io needs to set the BIO sync hint as well. - There is also quite a bit of complexity gone into bashing TCQ into submission. Timing for a read batch is not started until the first read request actually completes. A read batch also does not start until all outstanding writes have completed. AS is the default IO scheduler. deadline may be chosen by booting with "elevator=deadline". There are a few reasons for retaining deadline: - AS is often slower than deadline in random IO loads with large TCQ windows. The usual real world task here is OLTP database loads. - deadline is presumably more stable. - deadline is much simpler. The tunable per-queue entries under /sys/block/*/iosched/ are all in milliseconds: * read_expire Controls how long until a request becomes "expired". It also controls the interval between which expired requests are served, so set to 50, a request might take anywhere < 100ms to be serviced _if_ it is the next on the expired list. Obviously it can't make the disk go faster. Result is basically the timeslice a reader gets in the presence of other IO. 100*((seek time / read_expire) + 1) is very roughly the % streaming read efficiency your disk should get in the presence of multiple readers. * read_batch_expire Controls how much time a batch of reads is given before pending writes are served. Higher value is more efficient. Shouldn't really be below read_expire. * write_ versions of the above * antic_expire Controls the maximum amount of time we can anticipate a good read before giving up. Many other factors may cause anticipation to be stopped early, or some processes will not be "anticipated" at all. Should be a bit higher for big seek time devices though not a linear correspondance - most processes have only a few ms thinktime.
97ff29c2 · Andrew Morton · Linus Torvalds · 104e6fdc · 97ff29c2 · 97ff29c2
Commit 97ff29c2 authored Jul 04, 2003 by Andrew Morton Committed by Linus Torvalds Jul 04, 2003
9 changed files
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -13,7 +13,8 @@
 # kblockd threads
 #

-obj-y	:= elevator.o ll_rw_blk.o ioctl.o genhd.o scsi_ioctl.o deadline-iosched.o
+obj-y	:= elevator.o ll_rw_blk.o ioctl.o genhd.o scsi_ioctl.o \
+	deadline-iosched.o as-iosched.o

 obj-$(CONFIG_MAC_FLOPPY)	+= swim3.o
 obj-$(CONFIG_BLK_DEV_FD)	+= floppy.o

--- a/drivers/block/as-iosched.c
+++ b/drivers/block/as-iosched.c
--- a/drivers/block/ll_rw_blk.c
+++ b/drivers/block/ll_rw_blk.c
@@ -1033,7 +1033,7 @@ static inline void __generic_unplug_device(request_queue_t *q)
 	/*
 	 * was plugged, fire request_fn if queue has stuff to do
 	 */
-	if (!elv_queue_empty(q))
+	if (elv_next_request(q))
 		q->request_fn(q);
 }

@@ -1204,6 +1204,18 @@ static int blk_init_free_list(request_queue_t *q)

 static int __make_request(request_queue_t *, struct bio *);

+static elevator_t *chosen_elevator = &iosched_as;
+
+static int __init elevator_setup(char *str)
+{
+	if (!strcmp(str, "deadline"))
+		chosen_elevator = &iosched_deadline;
+	if (!strcmp(str, "as"))
+		chosen_elevator = &iosched_as;
+	return 1;
+}
+__setup("elevator=", elevator_setup);
+
 /**
 * blk_init_queue  - prepare a request queue for use with a block device
 * @q:    The &request_queue_t to be initialised
@@ -1235,11 +1247,20 @@ static int __make_request(request_queue_t *, struct bio *);
 int blk_init_queue(request_queue_t *q, request_fn_proc *rfn, spinlock_t *lock)
 {
 	int ret;
+	static int printed;

 	if (blk_init_free_list(q))
 		return -ENOMEM;

-	if ((ret = elevator_init(q, &iosched_deadline))) {
+	if (!printed) {
+		printed = 1;
+		if (chosen_elevator == &iosched_deadline)
+			printk("deadline elevator\n");
+		else if (chosen_elevator == &iosched_as)
+			printk("anticipatory scheduling elevator\n");
+	}
+
+	if ((ret = elevator_init(q, chosen_elevator))) {
 		blk_cleanup_queue(q);
 		return ret;
 	}

--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -319,6 +319,7 @@ asmlinkage long sys_fsync(unsigned int fd)

 	/* We need to protect against concurrent writers.. */
 	down(&inode->i_sem);
+	current->flags |= PF_SYNCWRITE;
 	ret = filemap_fdatawrite(inode->i_mapping);
 	err = file->f_op->fsync(file, dentry, 0);
 	if (!ret)
@@ -326,6 +327,7 @@ asmlinkage long sys_fsync(unsigned int fd)
 	err = filemap_fdatawait(inode->i_mapping);
 	if (!ret)
 		ret = err;
+	current->flags &= ~PF_SYNCWRITE;
 	up(&inode->i_sem);

 out_putf:
@@ -354,6 +356,7 @@ asmlinkage long sys_fdatasync(unsigned int fd)
 		goto out_putf;

 	down(&inode->i_sem);
+	current->flags |= PF_SYNCWRITE;
 	ret = filemap_fdatawrite(inode->i_mapping);
 	err = file->f_op->fsync(file, dentry, 1);
 	if (!ret)
@@ -361,6 +364,7 @@ asmlinkage long sys_fdatasync(unsigned int fd)
 	err = filemap_fdatawait(inode->i_mapping);
 	if (!ret)
 		ret = err;
+	current->flags &= ~PF_SYNCWRITE;
 	up(&inode->i_sem);

 out_putf:

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -516,6 +516,7 @@ int generic_osync_inode(struct inode *inode, int what)
 	int need_write_inode_now = 0;
 	int err2;

+	current->flags |= PF_SYNCWRITE;
 	if (what & OSYNC_DATA)
 		err = filemap_fdatawrite(inode->i_mapping);
 	if (what & (OSYNC_METADATA|OSYNC_DATA)) {
@@ -528,6 +529,7 @@ int generic_osync_inode(struct inode *inode, int what)
 		if (!err)
 			err = err2;
 	}
+	current->flags &= ~PF_SYNCWRITE;

 	spin_lock(&inode_lock);
 	if ((inode->i_state & I_DIRTY) &&

--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -89,6 +89,11 @@ extern elevator_t elevator_noop;
 */
 extern elevator_t iosched_deadline;

+/*
+ * anticipatory I/O scheduler
+ */
+extern elevator_t iosched_as;
+
 extern int elevator_init(request_queue_t *, elevator_t *);
 extern void elevator_exit(request_queue_t *);
 extern inline int elv_rq_merge_ok(struct request *, struct bio *);

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -321,6 +321,8 @@ struct k_itimer {
 };


+struct as_io_context;			/* Anticipatory scheduler */
+void exit_as_io_context(void);

 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
@@ -450,6 +452,8 @@ struct task_struct {
 	struct dentry *proc_dentry;
 	struct backing_dev_info *backing_dev_info;

+	struct as_io_context *as_io_context;
+
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
 };
@@ -481,6 +485,7 @@ do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0)
 #define PF_KSWAPD	0x00040000	/* I am kswapd */
 #define PF_SWAPOFF	0x00080000	/* I am in swapoff */
 #define PF_LESS_THROTTLE 0x01000000	/* Throttle me less: I clena memory */
+#define PF_SYNCWRITE	0x00200000	/* I am doing a sync write */

 #ifdef CONFIG_SMP
 extern int set_cpus_allowed(task_t *p, unsigned long new_mask);

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -682,6 +682,8 @@ NORET_TYPE void do_exit(long code)
 		panic("Attempted to kill the idle task!");
 	if (unlikely(tsk->pid == 1))
 		panic("Attempted to kill init!");
+	if (tsk->as_io_context)
+		exit_as_io_context();
 	tsk->flags |= PF_EXITING;
 	del_timer_sync(&tsk->real_timer);


--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -864,6 +864,7 @@ struct task_struct *copy_process(unsigned long clone_flags,
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = get_jiffies_64();
 	p->security = NULL;
+	p->as_io_context = NULL;

 	retval = -ENOMEM;
 	if ((retval = security_task_alloc(p)))