Commit 799391cc authored by Andrew Morton's avatar Andrew Morton Committed by Arnaldo Carvalho de Melo

[PATCH] improved I/O scheduling for indirect blocks

Fixes a performance problem with many-small-file writeout.

At present, files are written out via their mapping and their indirect
blocks are written out via the blockdev mapping.  As we know that
indirects are disk-adjacent to the data it is better to start I/O
against the indirects at the same time as the data.

The delalloc pathes have code in ext2_writepage() which recognises when
the target page->index was at an indirect boundary and does an explicit
hunt-and-write against the neighbouring indirect block.  Which is
ideal.  (Unless the file was dirtied seekily and the page which is next
to the indirect was not dirtied).

This patch does it the other way: when we start writeback against a
mapping, also start writeback against any dirty buffers which are
attached to mapping->private_list.  Let the elevator take care of the
rest.

The patch makes a number of tuning changes to the writeback path in
fs-writeback.c.  This is very fiddly code: getting the throughput
tuned, getting the data-integrity "sync" operations right, avoiding
most of the livelock opportunities, getting the `kupdate' function
working efficiently, keeping it all least somewhat comprehensible.

An important intent here is to ensure that metadata blocks for inodes
are marked dirty before writeback starts working the blockdev mapping,
so all the inode blocks are efficiently written back.

The patch removes try_to_writeback_unused_inodes(), which became
unreferenced in vm-writeback.patch.

The patch has a tweak in ext2_put_inode() to prevent ext2 from
incorrectly droppping its preallocation window in response to a random
iput().


Generally, many-small-file writeout is a lot faster than 2.5.7 (which
is linux-before-I-futzed-with-it).  The workload which was optimised was

	tar xfz /nfs/mountpoint/linux-2.4.18.tar.gz ; sync

on mem=128M and mem=2048M.

With these patches, 2.5.15 is completing in about 2/3 of the time of
2.5.7.  But it is only a shade faster than 2.4.19-pre7.  Why is 2.5.7
so much slower than 2.4.19?  Not sure yet.

Heavy dbench loads (dbench 32 on mem=128M) are slightly faster than
2.5.7 and significantly slower than 2.4.19.  It appears that the cause
is poor read throughput at the later stages of the run.  Because there
are background writeback threads operating at the same time.

The 2.4.19-pre8 write scheduling manages to stop writeback during the
latter stages of the dbench run in a way which I haven't been able to
sanely emulate yet.  It may not be desirable to do this anyway - it's
optimising for the case where the files are about to be deleted.  But
it would be good to find a way of "pausing" the writeback for a few
seconds to allow readers to get an interval of decent bandwidth.

tiobench throughput is basically the same across all recent kernels.
CPU load on writes is down maybe 30% in 2.5.15.
parent a9f525e6
......@@ -210,10 +210,7 @@ int sync_blockdev(struct block_device *bdev)
if (bdev) {
int err;
ret = filemap_fdatawait(bdev->bd_inode->i_mapping);
err = filemap_fdatawrite(bdev->bd_inode->i_mapping);
if (!ret)
ret = err;
ret = filemap_fdatawrite(bdev->bd_inode->i_mapping);
err = filemap_fdatawait(bdev->bd_inode->i_mapping);
if (!ret)
ret = err;
......@@ -229,12 +226,14 @@ EXPORT_SYMBOL(sync_blockdev);
*/
int fsync_super(struct super_block *sb)
{
sync_inodes_sb(sb); /* All the inodes */
sync_inodes_sb(sb, 0);
DQUOT_SYNC(sb);
lock_super(sb);
if (sb->s_dirt && sb->s_op && sb->s_op->write_super)
sb->s_op->write_super(sb);
unlock_super(sb);
sync_blockdev(sb->s_bdev);
sync_inodes_sb(sb, 1);
return sync_blockdev(sb->s_bdev);
}
......@@ -276,10 +275,10 @@ int fsync_dev(kdev_t dev)
*/
asmlinkage long sys_sync(void)
{
sync_inodes(); /* All mappings and inodes, including block devices */
sync_inodes(0); /* All mappings and inodes, including block devices */
DQUOT_SYNC(NULL);
sync_supers(); /* Write the superblocks */
sync_inodes(); /* All the mappings and inodes, again. */
sync_inodes(1); /* All the mappings and inodes, again. */
return 0;
}
......@@ -775,6 +774,80 @@ int sync_mapping_buffers(struct address_space *mapping)
}
EXPORT_SYMBOL(sync_mapping_buffers);
/**
* write_mapping_buffers - Start writeout of a mapping's "associated" buffers.
* @mapping - the mapping which wants those buffers written.
*
* Starts I/O against dirty buffers which are on @mapping->private_list.
* Those buffers must be backed by @mapping->assoc_mapping.
*
* The private_list buffers generally contain filesystem indirect blocks.
* The idea is that the filesystem can start I/O against the indirects at
* the same time as running generic_writeback_mapping(), so the indirect's
* I/O will be merged with the data.
*
* We sneakliy write the buffers in probable tail-to-head order. This is
* because generic_writeback_mapping writes in probable head-to-tail
* order. If the file is so huge that the data or the indirects overflow
* the request queue we will at least get some merging this way.
*
* Any clean+unlocked buffers are de-listed. clean/locked buffers must be
* left on the list for an fsync() to wait on.
*
* Couldn't think of a smart way of avoiding livelock, so chose the dumb
* way instead.
*
* FIXME: duplicates fsync_inode_buffers() functionality a bit.
*/
int write_mapping_buffers(struct address_space *mapping)
{
spinlock_t *lock;
struct address_space *buffer_mapping;
unsigned nr_to_write; /* livelock avoidance */
struct list_head *lh;
int ret = 0;
if (list_empty(&mapping->private_list))
goto out;
buffer_mapping = mapping->assoc_mapping;
lock = &buffer_mapping->private_lock;
spin_lock(lock);
nr_to_write = 0;
lh = mapping->private_list.next;
while (lh != &mapping->private_list) {
lh = lh->next;
nr_to_write++;
}
nr_to_write *= 2; /* Allow for some late additions */
while (nr_to_write-- && !list_empty(&mapping->private_list)) {
struct buffer_head *bh;
bh = BH_ENTRY(mapping->private_list.prev);
list_del_init(&bh->b_assoc_buffers);
if (!buffer_dirty(bh) && !buffer_locked(bh))
continue;
/* Stick it on the far end of the list. Order is preserved. */
list_add(&bh->b_assoc_buffers, &mapping->private_list);
if (test_set_buffer_locked(bh))
continue;
get_bh(bh);
spin_unlock(lock);
if (test_clear_buffer_dirty(bh)) {
bh->b_end_io = end_buffer_io_sync;
submit_bh(WRITE, bh);
} else {
unlock_buffer(bh);
put_bh(bh);
}
spin_lock(lock);
}
spin_unlock(lock);
out:
return ret;
}
void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
{
struct address_space *mapping = inode->i_mapping;
......
......@@ -41,7 +41,7 @@ static int ext2_update_inode(struct inode * inode, int do_sync);
*/
void ext2_put_inode (struct inode * inode)
{
if (atomic_read(&inode->i_count) < 2)
if (atomic_read(&inode->i_count) < 2) /* final iput? */
ext2_discard_prealloc (inode);
}
......@@ -584,6 +584,20 @@ static int ext2_direct_IO(int rw, struct inode * inode, struct kiobuf * iobuf, u
{
return generic_direct_IO(rw, inode, iobuf, blocknr, blocksize, ext2_get_block);
}
static int
ext2_writeback_mapping(struct address_space *mapping, int *nr_to_write)
{
int ret;
int err;
ret = write_mapping_buffers(mapping);
err = generic_writeback_mapping(mapping, nr_to_write);
if (!ret)
ret = err;
return ret;
}
struct address_space_operations ext2_aops = {
readpage: ext2_readpage,
writepage: ext2_writepage,
......@@ -592,7 +606,7 @@ struct address_space_operations ext2_aops = {
commit_write: generic_commit_write,
bmap: ext2_bmap,
direct_IO: ext2_direct_IO,
writeback_mapping: generic_writeback_mapping,
writeback_mapping: ext2_writeback_mapping,
vm_writeback: generic_vm_writeback,
};
......
This diff is collapsed.
......@@ -311,6 +311,7 @@ int invalidate_inodes(struct super_block * sb)
busy = invalidate_list(&inode_in_use, sb, &throw_away);
busy |= invalidate_list(&inode_unused, sb, &throw_away);
busy |= invalidate_list(&sb->s_dirty, sb, &throw_away);
busy |= invalidate_list(&sb->s_io, sb, &throw_away);
busy |= invalidate_list(&sb->s_locked_inodes, sb, &throw_away);
spin_unlock(&inode_lock);
......@@ -896,6 +897,11 @@ void remove_dquot_ref(struct super_block *sb, short type)
if (IS_QUOTAINIT(inode))
remove_inode_dquot_ref(inode, type, &tofree_head);
}
list_for_each(act_head, &sb->s_io) {
inode = list_entry(act_head, struct inode, i_list);
if (IS_QUOTAINIT(inode))
remove_inode_dquot_ref(inode, type, &tofree_head);
}
list_for_each(act_head, &sb->s_locked_inodes) {
inode = list_entry(act_head, struct inode, i_list);
if (IS_QUOTAINIT(inode))
......
......@@ -48,6 +48,7 @@ static struct super_block *alloc_super(void)
if (s) {
memset(s, 0, sizeof(struct super_block));
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_locked_inodes);
INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
......@@ -154,6 +155,9 @@ static int grab_super(struct super_block *s)
*
* Associates superblock with fs type and puts it on per-type and global
* superblocks' lists. Should be called with sb_lock held; drops it.
*
* NOTE: the super_blocks ordering here is important: writeback wants
* the blockdev superblock to be at super_blocks.next.
*/
static void insert_super(struct super_block *s, struct file_system_type *type)
{
......
......@@ -29,6 +29,7 @@ enum bh_state_bits {
struct page;
struct kiobuf;
struct buffer_head;
struct address_space;
typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
/*
......@@ -145,14 +146,19 @@ int try_to_free_buffers(struct page *);
void create_empty_buffers(struct page *, unsigned long,
unsigned long b_state);
void end_buffer_io_sync(struct buffer_head *bh, int uptodate);
/* Things to do with buffers at mapping->private_list */
void buffer_insert_list(spinlock_t *lock,
struct buffer_head *, struct list_head *);
int sync_mapping_buffers(struct address_space *mapping);
void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
int write_mapping_buffers(struct address_space *mapping);
int inode_has_buffers(struct inode *);
void invalidate_inode_buffers(struct inode *);
int fsync_buffers_list(spinlock_t *lock, struct list_head *);
int sync_mapping_buffers(struct address_space *mapping);
void mark_buffer_async_read(struct buffer_head *bh);
void mark_buffer_async_write(struct buffer_head *bh);
void invalidate_inode_buffers(struct inode *);
void invalidate_bdev(struct block_device *, int);
void __invalidate_buffers(kdev_t dev, int);
int sync_blockdev(struct block_device *bdev);
......@@ -163,8 +169,6 @@ int fsync_dev(kdev_t);
int fsync_bdev(struct block_device *);
int fsync_super(struct super_block *);
int fsync_no_super(struct block_device *);
int fsync_buffers_list(spinlock_t *lock, struct list_head *);
int inode_has_buffers(struct inode *);
struct buffer_head *__get_hash_table(struct block_device *, sector_t, int);
struct buffer_head * __getblk(struct block_device *, sector_t, int);
void __brelse(struct buffer_head *);
......
......@@ -618,7 +618,6 @@ struct super_block {
kdev_t s_dev;
unsigned long s_blocksize;
unsigned long s_old_blocksize;
unsigned short s_writeback_gen;/* To avoid writeback livelock */
unsigned char s_blocksize_bits;
unsigned char s_dirt;
unsigned long long s_maxbytes; /* Max file size */
......@@ -632,9 +631,11 @@ struct super_block {
struct rw_semaphore s_umount;
struct semaphore s_lock;
int s_count;
int s_syncing;
atomic_t s_active;
struct list_head s_dirty; /* dirty inodes */
struct list_head s_io; /* parked for writeback */
struct list_head s_locked_inodes;/* inodes being synced */
struct list_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_files;
......@@ -1116,7 +1117,6 @@ extern int invalidate_device(kdev_t, int);
extern void invalidate_inode_pages(struct inode *);
extern void invalidate_inode_pages2(struct address_space *);
extern void write_inode_now(struct inode *, int);
extern void sync_inodes_sb(struct super_block *);
extern int filemap_fdatawrite(struct address_space *);
extern int filemap_fdatawait(struct address_space *);
extern void sync_supers(void);
......
......@@ -27,15 +27,13 @@ static inline int current_is_pdflush(void)
#define WB_SYNC_NONE 0 /* Don't wait on anything */
#define WB_SYNC_LAST 1 /* Wait on the last-written mapping */
#define WB_SYNC_ALL 2 /* Wait on every mapping */
#define WB_SYNC_HOLD 3 /* Hold the inode on sb_dirty for sys_sync() */
void try_to_writeback_unused_inodes(unsigned long pexclusive);
void writeback_single_inode(struct inode *inode,
int sync, int *nr_to_write);
void writeback_unlocked_inodes(int *nr_to_write, int sync_mode,
unsigned long *older_than_this);
void writeback_inodes_sb(struct super_block *);
void __wait_on_inode(struct inode * inode);
void sync_inodes(void);
void sync_inodes_sb(struct super_block *, int wait);
void sync_inodes(int wait);
static inline void wait_on_inode(struct inode *inode)
{
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment