Commit 43152186 authored by Andrew Morton's avatar Andrew Morton Committed by Arnaldo Carvalho de Melo

[PATCH] i_dirty_buffers locking fix

This fixes a race between try_to_free_buffers' call to
__remove_inode_queue() and other users of b_inode_buffers
(fsync_inode_buffers and mark_buffer_dirty_inode()).  They are
presently taking different locks.

The patch relocates and redefines and clarifies(?) the role of
inode.i_dirty_buffers.

The 2.4 definition of i_dirty_buffers is "a list of random buffers
which is protected by a kernel-wide lock".  This definition needs to be
narrowed in the 2.5 context.  It is now

"a list of buffers from a different mapping, protected by a lock within
that mapping".  This list of buffers is specifically for fsync().

As this is a "data plane" operation, all the structures have been moved
out of the inode and into the address_space.  So address_space now has:

list_head private_list;

     A list, available to the address_space for any purpose.  If
     that address_space chooses to use the helper functions
     mark_buffer_dirty_inode and sync_mapping_buffers() then this list
     will contain buffer_heads, attached via
     buffer_head.b_assoc_buffers.

     If the address_space does not call those helper functions
     then the list is free for other usage.  The only requirement is
     that the list be list_empty() at destroy_inode() time.

     At least, this is the objective.  At present,
     generic_file_write() will call generic_osync_inode(), which
     expects that list to contain buffer_heads.  So private_list isn't
     useful for anything else yet.

spinlock_t private_lock;

     A spinlock, available to the address_space.

     If the address_space is using try_to_free_buffers(),
     mark_inode_dirty_buffers() and fsync_inode_buffers() then this
     lock is used to protect the private_list of *other* mappings which
     have listed buffers from *this* mapping onto themselves.

     That is: for buffer_heads, mapping_A->private_lock does not
     protect mapping_A->private_list!  It protects the b_assoc_buffers
     list from buffers which are backed by mapping_A and it protects
     mapping_B->private_list, mapping_C->private_list, ...

     So what we have here is a cross-mapping association.  S_ISREG
     mappings maintain a list of buffers from the blockdev's
     address_space which they need to know about for a successful
     fsync().  The locking follows the buffers: the lock in in the
     blockdev's mapping, not in the S_ISREG file's mapping.

     For address_spaces which use try_to_free_buffers,
     private_lock is also (and quite unrelatedly) used for protection
     of the buffer ring at page->private.  Exclusion between
     try_to_free_buffers(), __get_hash_table() and
     __set_page_dirty_buffers().  This is in fact its major use.

address_space *assoc_mapping

    Sigh.  This is the address of the mapping which backs the
    buffers which are attached to private_list.  It's here so that
    generic_osync_inode() can locate the lock which protects this
    mapping's private_list.  Will probably go away.


A consequence of all the above is that:

    a) All the buffers at a mapping_A's ->private_list must come
       from the same mapping, mapping_B.  There is no requirement that
       mapping_B be a blockdev mapping, but that's how it's used.

       There is a BUG() check in mark_buffer_dirty_inode() for this.

    b) blockdev mappings never have any buffers on ->private_list.
       It just never happens, and doesn't make a lot of sense.

reiserfs is using b_inode_buffers for attaching dependent buffers to its
journal and that caused a few problems.  Fixed in reiserfs_releasepage.patch
parent 6b9f3b41
This diff is collapsed.
......@@ -37,7 +37,7 @@ int ext2_sync_file(struct file * file, struct dentry *dentry, int datasync)
struct inode *inode = dentry->d_inode;
int err;
err = fsync_inode_buffers(inode);
err = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
......
......@@ -41,7 +41,8 @@ static int ext2_update_inode(struct inode * inode, int do_sync);
*/
void ext2_put_inode (struct inode * inode)
{
ext2_discard_prealloc (inode);
if (atomic_read(&inode->i_count) < 2)
ext2_discard_prealloc (inode);
}
/*
......@@ -860,7 +861,7 @@ void ext2_truncate (struct inode * inode)
}
inode->i_mtime = inode->i_ctime = CURRENT_TIME;
if (IS_SYNC(inode)) {
fsync_inode_buffers(inode);
sync_mapping_buffers(inode->i_mapping);
ext2_sync_inode (inode);
} else {
mark_inode_dirty(inode);
......
......@@ -55,13 +55,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
J_ASSERT(ext3_journal_current_handle() == 0);
/*
* fsync_inode_buffers() just walks i_dirty_buffers and waits
* fsync_inode_buffers() just walks private_list and waits
* on them. It's a no-op for full data journalling because
* i_dirty_buffers will be ampty.
* private_list will be empty.
* Really, we only need to start I/O on the dirty buffers -
* we'll end up waiting on them in commit.
*/
ret = fsync_inode_buffers(inode);
ret = sync_mapping_buffers(inode->i_mapping);
ext3_force_commit(inode->i_sb);
return ret;
......
......@@ -1078,14 +1078,8 @@ static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
* We need to pick up the new inode size which generic_commit_write gave us
* `file' can be NULL - eg, when called from block_symlink().
*
* ext3 inode->i_dirty_buffers policy: If we're journalling data we
* definitely don't want them to appear on the inode at all - instead
* we need to manage them at the JBD layer and we need to intercept
* the relevant sync operations and translate them into journal operations.
*
* If we're not journalling data then we can just leave the buffers
* on ->i_dirty_buffers. If someone writes them out for us then thanks.
* Otherwise we'll do it in commit, if we're using ordered data.
* ext3 never places buffers on inode->i_mapping->private_list. metadata
* buffers are managed internally.
*/
static int ext3_commit_write(struct file *file, struct page *page,
......
......@@ -467,43 +467,34 @@ void write_inode_now(struct inode *inode, int sync)
/**
* generic_osync_inode - flush all dirty data for a given inode to disk
* @inode: inode to write
* @datasync: if set, don't bother flushing timestamps
* @what: what to write and wait upon
*
* This can be called by file_write functions for files which have the
* O_SYNC flag set, to flush dirty writes to disk.
* O_SYNC flag set, to flush dirty writes to disk.
*
* @what is a bitmask, specifying which part of the inode's data should be
* written and waited upon:
*
* OSYNC_DATA: i_mapping's dirty data
* OSYNC_METADATA: the buffers at i_mapping->private_list
* OSYNC_INODE: the inode itself
*/
int generic_osync_inode(struct inode *inode, int what)
{
int err = 0, err2 = 0, need_write_inode_now = 0;
/*
* WARNING
*
* Currently, the filesystem write path does not pass the
* filp down to the low-level write functions. Therefore it
* is impossible for (say) __block_commit_write to know if
* the operation is O_SYNC or not.
*
* Ideally, O_SYNC writes would have the filesystem call
* ll_rw_block as it went to kick-start the writes, and we
* could call osync_inode_buffers() here to wait only for
* those IOs which have already been submitted to the device
* driver layer. As it stands, if we did this we'd not write
* anything to disk since our writes have not been queued by
* this point: they are still on the dirty LRU.
*
* So, currently we will call fsync_inode_buffers() instead,
* to flush _all_ dirty buffers for this inode to disk on
* every O_SYNC write, not just the synchronous I/Os. --sct
*/
int err = 0;
int need_write_inode_now = 0;
int err2;
if (what & OSYNC_DATA)
writeback_single_inode(inode, 0, NULL);
if (what & (OSYNC_METADATA|OSYNC_DATA))
err = fsync_inode_buffers(inode);
err = filemap_fdatawrite(inode->i_mapping);
if (what & (OSYNC_METADATA|OSYNC_DATA)) {
err2 = sync_mapping_buffers(inode->i_mapping);
if (!err)
err = err2;
}
if (what & OSYNC_DATA) {
err2 = filemap_fdatawrite(inode->i_mapping);
err2 = filemap_fdatawait(inode->i_mapping);
if (!err)
err = err2;
}
......
......@@ -106,6 +106,7 @@ static struct inode *alloc_inode(struct super_block *sb)
inode->i_data.dirtied_when = 0;
inode->i_mapping = &inode->i_data;
inode->i_data.ra_pages = &default_ra_pages;
inode->i_data.assoc_mapping = NULL;
if (sb->s_bdev)
inode->i_data.ra_pages = sb->s_bdev->bd_inode->i_mapping->ra_pages;
memset(&inode->u, 0, sizeof(inode->u));
......@@ -139,13 +140,13 @@ void inode_init_once(struct inode *inode)
INIT_LIST_HEAD(&inode->i_data.locked_pages);
INIT_LIST_HEAD(&inode->i_data.io_pages);
INIT_LIST_HEAD(&inode->i_dentry);
INIT_LIST_HEAD(&inode->i_dirty_buffers);
INIT_LIST_HEAD(&inode->i_devices);
sema_init(&inode->i_sem, 1);
INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
rwlock_init(&inode->i_data.page_lock);
spin_lock_init(&inode->i_data.i_shared_lock);
spin_lock_init(&inode->i_bufferlist_lock);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
INIT_LIST_HEAD(&inode->i_data.i_mmap);
INIT_LIST_HEAD(&inode->i_data.i_mmap_shared);
}
......
......@@ -31,7 +31,7 @@ int minix_sync_file(struct file * file, struct dentry *dentry, int datasync)
struct inode *inode = dentry->d_inode;
int err;
err = fsync_inode_buffers(inode);
err = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
......
......@@ -1510,6 +1510,13 @@ static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
INIT_LIST_HEAD(&vol->mftbmp_mapping.i_mmap);
INIT_LIST_HEAD(&vol->mftbmp_mapping.i_mmap_shared);
spin_lock_init(&vol->mftbmp_mapping.i_shared_lock);
/*
* private_lock and private_list are unused by ntfs. But they
* are available.
*/
spin_lock_init(&vol->mftbmp_mapping.private_lock);
INIT_LIST_HEAD(&vol->mftbmp_mapping.private_list);
vol->mftbmp_mapping.assoc_mapping = NULL;
vol->mftbmp_mapping.dirtied_when = 0;
vol->mftbmp_mapping.gfp_mask = GFP_HIGHUSER;
vol->mftbmp_mapping.ra_pages =
......
......@@ -85,7 +85,7 @@ static int reiserfs_sync_file(
if (!S_ISREG(p_s_inode->i_mode))
BUG ();
n_err = fsync_inode_buffers(p_s_inode) ;
n_err = sync_mapping_buffers(p_s_inode->i_mapping) ;
reiserfs_commit_for_inode(p_s_inode) ;
unlock_kernel() ;
return ( n_err < 0 ) ? -EIO : 0;
......
......@@ -36,7 +36,7 @@ int sysv_sync_file(struct file * file, struct dentry *dentry, int datasync)
struct inode *inode = dentry->d_inode;
int err;
err = fsync_inode_buffers(inode);
err = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
......
......@@ -44,7 +44,7 @@ int udf_fsync_inode(struct inode *inode, int datasync)
{
int err;
err = fsync_inode_buffers(inode);
err = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
......
......@@ -50,7 +50,7 @@ struct buffer_head {
struct block_device *b_bdev;
bh_end_io_t *b_end_io; /* I/O completion */
void *b_private; /* reserved for b_end_io */
struct list_head b_inode_buffers; /* list of inode dirty buffers */
struct list_head b_assoc_buffers; /* associated with another mapping */
};
......@@ -147,6 +147,8 @@ void create_empty_buffers(struct page *, unsigned long,
void end_buffer_io_sync(struct buffer_head *bh, int uptodate);
void buffer_insert_list(spinlock_t *lock,
struct buffer_head *, struct list_head *);
int sync_mapping_buffers(struct address_space *mapping);
void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
void mark_buffer_async_read(struct buffer_head *bh);
void mark_buffer_async_write(struct buffer_head *bh);
......@@ -217,14 +219,6 @@ static inline void put_bh(struct buffer_head *bh)
atomic_dec(&bh->b_count);
}
static inline void
mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
{
mark_buffer_dirty(bh);
buffer_insert_list(&inode->i_bufferlist_lock,
bh, &inode->i_dirty_buffers);
}
/*
* If an error happens during the make_request, this function
* has to be recalled. It marks the buffer as clean and not
......@@ -243,11 +237,6 @@ static inline void buffer_IO_error(struct buffer_head * bh)
bh->b_end_io(bh, buffer_uptodate(bh));
}
static inline int fsync_inode_buffers(struct inode *inode)
{
return fsync_buffers_list(&inode->i_bufferlist_lock,
&inode->i_dirty_buffers);
}
static inline void brelse(struct buffer_head *buf)
{
......
......@@ -306,6 +306,7 @@ struct address_space_operations {
};
struct address_space {
struct inode *host; /* owner: inode, block_device */
struct radix_tree_root page_tree; /* radix tree of all pages */
rwlock_t page_lock; /* and rwlock protecting it */
struct list_head clean_pages; /* list of clean pages */
......@@ -314,13 +315,15 @@ struct address_space {
struct list_head io_pages; /* being prepared for I/O */
unsigned long nrpages; /* number of total pages */
struct address_space_operations *a_ops; /* methods */
struct inode *host; /* owner: inode, block_device */
list_t i_mmap; /* list of private mappings */
list_t i_mmap_shared; /* list of private mappings */
spinlock_t i_shared_lock; /* and spinlock protecting it */
unsigned long dirtied_when; /* jiffies of first page dirtying */
int gfp_mask; /* how to allocate the pages */
unsigned long *ra_pages; /* device readahead */
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
};
struct char_device {
......@@ -350,10 +353,6 @@ struct inode {
struct list_head i_hash;
struct list_head i_list;
struct list_head i_dentry;
struct list_head i_dirty_buffers; /* uses i_bufferlist_lock */
spinlock_t i_bufferlist_lock;
unsigned long i_ino;
atomic_t i_count;
kdev_t i_dev;
......
......@@ -42,7 +42,7 @@
*
* pagemap_lru_lock
* ->i_shared_lock (vmtruncate)
* ->i_bufferlist_lock (__free_pte->__set_page_dirty_buffers)
* ->private_lock (__free_pte->__set_page_dirty_buffers)
* ->mapping->page_lock
* ->inode_lock (__mark_inode_dirty)
* ->sb_lock (fs/fs-writeback.c)
......
......@@ -450,7 +450,7 @@ EXPORT_SYMBOL(write_one_page);
* It's better to have clean pages accidentally attached to dirty_pages than to
* leave dirty pages attached to clean_pages.
*
* We use i_bufferlist_lock to lock against try_to_free_buffers while using the
* We use private_lock to lock against try_to_free_buffers while using the
* page's buffer list. Also use this to protect against clean buffers being
* added to the page after it was set dirty.
*
......@@ -462,18 +462,15 @@ EXPORT_SYMBOL(write_one_page);
*/
int __set_page_dirty_buffers(struct page *page)
{
struct address_space * const mapping = page->mapping;
int ret = 0;
struct address_space *mapping = page->mapping;
struct inode *inode;
if (mapping == NULL) {
SetPageDirty(page);
goto out;
}
inode = mapping->host;
spin_lock(&inode->i_bufferlist_lock);
spin_lock(&mapping->private_lock);
if (page_has_buffers(page) && !PageSwapCache(page)) {
struct buffer_head *head = page_buffers(page);
......@@ -496,7 +493,7 @@ int __set_page_dirty_buffers(struct page *page)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
spin_unlock(&inode->i_bufferlist_lock);
spin_unlock(&mapping->private_lock);
out:
return ret;
}
......
......@@ -37,11 +37,10 @@ static struct address_space_operations swap_aops = {
};
/*
* swapper_inode is needed only for for i_bufferlist_lock. This
* avoid special-casing in other parts of the kernel.
* swapper_inode doesn't do anything much. It is really only here to
* avoid some special-casing in other parts of the kernel.
*/
static struct inode swapper_inode = {
i_bufferlist_lock: SPIN_LOCK_UNLOCKED,
i_mapping: &swapper_space,
};
......@@ -55,6 +54,8 @@ struct address_space swapper_space = {
host: &swapper_inode,
a_ops: &swap_aops,
i_shared_lock: SPIN_LOCK_UNLOCKED,
private_lock: SPIN_LOCK_UNLOCKED,
private_list: LIST_HEAD_INIT(swapper_space.private_list),
};
#ifdef SWAP_CACHE_INFO
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment