Commit 97ff29c2 authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] anticipatory I/O scheduler

From: Nick Piggin <piggin@cyberone.com.au>

This is the core anticipatory IO scheduler.  There are nearly 100 changesets
in this and five months work.  I really cannot describe it fully here.

Major points:

- It works by recognising that reads are dependent: we don't know where the
  next read will occur, but it's probably close-by the previous one.  So once
  a read has completed we leave the disk idle, anticipating that a request
  for a nearby read will come in.

- There is read batching and write batching logic.

  - when we're servicing a batch of writes we will refuse to seek away
    for a read for some tens of milliseconds.  Then the write stream is
    preempted.

  - when we're servicing a batch of reads (via anticipation) we'll do
    that for some tens of milliseconds, then preempt.

- There are request deadlines, for latency and fairness.
  The oldest outstanding request is examined at regular intervals. If
  this request is older than a specific deadline, it will be the next
  one dispatched. This gives a good fairness heuristic while being simple
  because processes tend to have localised IO.


Just about all of the rest of the complexity involves an array of fixups
which prevent most of teh obvious failure modes with anticipation: trying to
not leave the disk head pointlessly idle.  Some of these algorithms are:

- Process tracking.  If the process whose read we are anticipating submits
  a write, abandon anticipation.

- Process exit tracking.  If the process whose read we are anticipating
  exits, abandon anticipation.

- Process IO history.  We accumulate statistical info on the process's
  recent IO patterns to aid in making decisions about how long to anticipate
  new reads.

  Currently thinktime and seek distance are tracked. Thinktime is the
  time between when a process's last request has completed and when it
  submits another one. Seek distance is simply the number of sectors
  between each read request. If either statistic becomes too high, the
  it isn't anticipated that the process will submit another read.

The above all means that we need a per-process "io context".  This is a fully
refcounted structure.  In this patch it is AS-only.  later we generalise it a
little so other IO schedulers could use the same framework.

- Requests are grouped as synchronous and asynchronous whereas deadline
  scheduler groups requests as reads and writes. This can provide better
  sync write performance, and may give better responsiveness with journalling
  filesystems (although we haven't done that yet).

  We currently detect synchronous writes by nastily setting PF_SYNCWRITE in
  current->flags.  The plan is to remove this later, and to propagate the
  sync hint from writeback_contol.sync_mode into bio->bi_flags thence into
  request->flags.  Once that is done, direct-io needs to set the BIO sync
  hint as well.

- There is also quite a bit of complexity gone into bashing TCQ into
  submission. Timing for a read batch is not started until the first read
  request actually completes. A read batch also does not start until all
  outstanding writes have completed.

AS is the default IO scheduler.  deadline may be chosen by booting with
"elevator=deadline".

There are a few reasons for retaining deadline:

- AS is often slower than deadline in random IO loads with large TCQ
  windows. The usual real world task here is OLTP database loads.

- deadline is presumably more stable.

- deadline is much simpler.



The tunable per-queue entries under /sys/block/*/iosched/ are all in
milliseconds:

* read_expire

  Controls how long until a request becomes "expired".

  It also controls the interval between which expired requests are served,
  so set to 50, a request might take anywhere < 100ms to be serviced _if_ it
  is the next on the expired list.

  Obviously it can't make the disk go faster.  Result is basically the
  timeslice a reader gets in the presence of other IO.  100*((seek time /
  read_expire) + 1) is very roughly the % streaming read efficiency your disk
  should get in the presence of multiple readers.

* read_batch_expire

  Controls how much time a batch of reads is given before pending writes
  are served.  Higher value is more efficient.  Shouldn't really be below
  read_expire.

* write_ versions of the above

* antic_expire

  Controls the maximum amount of time we can anticipate a good read before
  giving up.  Many other factors may cause anticipation to be stopped early,
  or some processes will not be "anticipated" at all.  Should be a bit higher
  for big seek time devices though not a linear correspondance - most
  processes have only a few ms thinktime.
parent 104e6fdc
......@@ -13,7 +13,8 @@
# kblockd threads
#
obj-y := elevator.o ll_rw_blk.o ioctl.o genhd.o scsi_ioctl.o deadline-iosched.o
obj-y := elevator.o ll_rw_blk.o ioctl.o genhd.o scsi_ioctl.o \
deadline-iosched.o as-iosched.o
obj-$(CONFIG_MAC_FLOPPY) += swim3.o
obj-$(CONFIG_BLK_DEV_FD) += floppy.o
......
This diff is collapsed.
......@@ -1033,7 +1033,7 @@ static inline void __generic_unplug_device(request_queue_t *q)
/*
* was plugged, fire request_fn if queue has stuff to do
*/
if (!elv_queue_empty(q))
if (elv_next_request(q))
q->request_fn(q);
}
......@@ -1204,6 +1204,18 @@ static int blk_init_free_list(request_queue_t *q)
static int __make_request(request_queue_t *, struct bio *);
static elevator_t *chosen_elevator = &iosched_as;
static int __init elevator_setup(char *str)
{
if (!strcmp(str, "deadline"))
chosen_elevator = &iosched_deadline;
if (!strcmp(str, "as"))
chosen_elevator = &iosched_as;
return 1;
}
__setup("elevator=", elevator_setup);
/**
* blk_init_queue - prepare a request queue for use with a block device
* @q: The &request_queue_t to be initialised
......@@ -1235,11 +1247,20 @@ static int __make_request(request_queue_t *, struct bio *);
int blk_init_queue(request_queue_t *q, request_fn_proc *rfn, spinlock_t *lock)
{
int ret;
static int printed;
if (blk_init_free_list(q))
return -ENOMEM;
if ((ret = elevator_init(q, &iosched_deadline))) {
if (!printed) {
printed = 1;
if (chosen_elevator == &iosched_deadline)
printk("deadline elevator\n");
else if (chosen_elevator == &iosched_as)
printk("anticipatory scheduling elevator\n");
}
if ((ret = elevator_init(q, chosen_elevator))) {
blk_cleanup_queue(q);
return ret;
}
......
......@@ -319,6 +319,7 @@ asmlinkage long sys_fsync(unsigned int fd)
/* We need to protect against concurrent writers.. */
down(&inode->i_sem);
current->flags |= PF_SYNCWRITE;
ret = filemap_fdatawrite(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 0);
if (!ret)
......@@ -326,6 +327,7 @@ asmlinkage long sys_fsync(unsigned int fd)
err = filemap_fdatawait(inode->i_mapping);
if (!ret)
ret = err;
current->flags &= ~PF_SYNCWRITE;
up(&inode->i_sem);
out_putf:
......@@ -354,6 +356,7 @@ asmlinkage long sys_fdatasync(unsigned int fd)
goto out_putf;
down(&inode->i_sem);
current->flags |= PF_SYNCWRITE;
ret = filemap_fdatawrite(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 1);
if (!ret)
......@@ -361,6 +364,7 @@ asmlinkage long sys_fdatasync(unsigned int fd)
err = filemap_fdatawait(inode->i_mapping);
if (!ret)
ret = err;
current->flags &= ~PF_SYNCWRITE;
up(&inode->i_sem);
out_putf:
......
......@@ -516,6 +516,7 @@ int generic_osync_inode(struct inode *inode, int what)
int need_write_inode_now = 0;
int err2;
current->flags |= PF_SYNCWRITE;
if (what & OSYNC_DATA)
err = filemap_fdatawrite(inode->i_mapping);
if (what & (OSYNC_METADATA|OSYNC_DATA)) {
......@@ -528,6 +529,7 @@ int generic_osync_inode(struct inode *inode, int what)
if (!err)
err = err2;
}
current->flags &= ~PF_SYNCWRITE;
spin_lock(&inode_lock);
if ((inode->i_state & I_DIRTY) &&
......
......@@ -89,6 +89,11 @@ extern elevator_t elevator_noop;
*/
extern elevator_t iosched_deadline;
/*
* anticipatory I/O scheduler
*/
extern elevator_t iosched_as;
extern int elevator_init(request_queue_t *, elevator_t *);
extern void elevator_exit(request_queue_t *);
extern inline int elv_rq_merge_ok(struct request *, struct bio *);
......
......@@ -321,6 +321,8 @@ struct k_itimer {
};
struct as_io_context; /* Anticipatory scheduler */
void exit_as_io_context(void);
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
......@@ -450,6 +452,8 @@ struct task_struct {
struct dentry *proc_dentry;
struct backing_dev_info *backing_dev_info;
struct as_io_context *as_io_context;
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
};
......@@ -481,6 +485,7 @@ do { if (atomic_dec_and_test(&(tsk)->usage)) __put_task_struct(tsk); } while(0)
#define PF_KSWAPD 0x00040000 /* I am kswapd */
#define PF_SWAPOFF 0x00080000 /* I am in swapoff */
#define PF_LESS_THROTTLE 0x01000000 /* Throttle me less: I clena memory */
#define PF_SYNCWRITE 0x00200000 /* I am doing a sync write */
#ifdef CONFIG_SMP
extern int set_cpus_allowed(task_t *p, unsigned long new_mask);
......
......@@ -682,6 +682,8 @@ NORET_TYPE void do_exit(long code)
panic("Attempted to kill the idle task!");
if (unlikely(tsk->pid == 1))
panic("Attempted to kill init!");
if (tsk->as_io_context)
exit_as_io_context();
tsk->flags |= PF_EXITING;
del_timer_sync(&tsk->real_timer);
......
......@@ -864,6 +864,7 @@ struct task_struct *copy_process(unsigned long clone_flags,
p->lock_depth = -1; /* -1 = no lock */
p->start_time = get_jiffies_64();
p->security = NULL;
p->as_io_context = NULL;
retval = -ENOMEM;
if ((retval = security_task_alloc(p)))
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment