[PATCH] i_dirty_buffers locking fix

This fixes a race between try_to_free_buffers' call to __remove_inode_queue() and other users of b_inode_buffers (fsync_inode_buffers and mark_buffer_dirty_inode()). They are presently taking different locks. The patch relocates and redefines and clarifies(?) the role of inode.i_dirty_buffers. The 2.4 definition of i_dirty_buffers is "a list of random buffers which is protected by a kernel-wide lock". This definition needs to be narrowed in the 2.5 context. It is now "a list of buffers from a different mapping, protected by a lock within that mapping". This list of buffers is specifically for fsync(). As this is a "data plane" operation, all the structures have been moved out of the inode and into the address_space. So address_space now has: list_head private_list; A list, available to the address_space for any purpose. If that address_space chooses to use the helper functions mark_buffer_dirty_inode and sync_mapping_buffers() then this list will contain buffer_heads, attached via buffer_head.b_assoc_buffers. If the address_space does not call those helper functions then the list is free for other usage. The only requirement is that the list be list_empty() at destroy_inode() time. At least, this is the objective. At present, generic_file_write() will call generic_osync_inode(), which expects that list to contain buffer_heads. So private_list isn't useful for anything else yet. spinlock_t private_lock; A spinlock, available to the address_space. If the address_space is using try_to_free_buffers(), mark_inode_dirty_buffers() and fsync_inode_buffers() then this lock is used to protect the private_list of *other* mappings which have listed buffers from *this* mapping onto themselves. That is: for buffer_heads, mapping_A->private_lock does not protect mapping_A->private_list! It protects the b_assoc_buffers list from buffers which are backed by mapping_A and it protects mapping_B->private_list, mapping_C->private_list, ... So what we have here is a cross-mapping association. S_ISREG mappings maintain a list of buffers from the blockdev's address_space which they need to know about for a successful fsync(). The locking follows the buffers: the lock in in the blockdev's mapping, not in the S_ISREG file's mapping. For address_spaces which use try_to_free_buffers, private_lock is also (and quite unrelatedly) used for protection of the buffer ring at page->private. Exclusion between try_to_free_buffers(), __get_hash_table() and __set_page_dirty_buffers(). This is in fact its major use. address_space *assoc_mapping Sigh. This is the address of the mapping which backs the buffers which are attached to private_list. It's here so that generic_osync_inode() can locate the lock which protects this mapping's private_list. Will probably go away. A consequence of all the above is that: a) All the buffers at a mapping_A's ->private_list must come from the same mapping, mapping_B. There is no requirement that mapping_B be a blockdev mapping, but that's how it's used. There is a BUG() check in mark_buffer_dirty_inode() for this. b) blockdev mappings never have any buffers on ->private_list. It just never happens, and doesn't make a lot of sense. reiserfs is using b_inode_buffers for attaching dependent buffers to its journal and that caused a few problems. Fixed in reiserfs_releasepage.patch

[PATCH] i_dirty_buffers locking fix
This fixes a race between try_to_free_buffers' call to __remove_inode_queue() and other users of b_inode_buffers (fsync_inode_buffers and mark_buffer_dirty_inode()). They are presently taking different locks. The patch relocates and redefines and clarifies(?) the role of inode.i_dirty_buffers. The 2.4 definition of i_dirty_buffers is "a list of random buffers which is protected by a kernel-wide lock". This definition needs to be narrowed in the 2.5 context. It is now "a list of buffers from a different mapping, protected by a lock within that mapping". This list of buffers is specifically for fsync(). As this is a "data plane" operation, all the structures have been moved out of the inode and into the address_space. So address_space now has: list_head private_list; A list, available to the address_space for any purpose. If that address_space chooses to use the helper functions mark_buffer_dirty_inode and sync_mapping_buffers() then this list will contain buffer_heads, attached via buffer_head.b_assoc_buffers. If the address_space does not call those helper functions then the list is free for other usage. The only requirement is that the list be list_empty() at destroy_inode() time. At least, this is the objective. At present, generic_file_write() will call generic_osync_inode(), which expects that list to contain buffer_heads. So private_list isn't useful for anything else yet. spinlock_t private_lock; A spinlock, available to the address_space. If the address_space is using try_to_free_buffers(), mark_inode_dirty_buffers() and fsync_inode_buffers() then this lock is used to protect the private_list of *other* mappings which have listed buffers from *this* mapping onto themselves. That is: for buffer_heads, mapping_A->private_lock does not protect mapping_A->private_list! It protects the b_assoc_buffers list from buffers which are backed by mapping_A and it protects mapping_B->private_list, mapping_C->private_list, ... So what we have here is a cross-mapping association. S_ISREG mappings maintain a list of buffers from the blockdev's address_space which they need to know about for a successful fsync(). The locking follows the buffers: the lock in in the blockdev's mapping, not in the S_ISREG file's mapping. For address_spaces which use try_to_free_buffers, private_lock is also (and quite unrelatedly) used for protection of the buffer ring at page->private. Exclusion between try_to_free_buffers(), __get_hash_table() and __set_page_dirty_buffers(). This is in fact its major use. address_space *assoc_mapping Sigh. This is the address of the mapping which backs the buffers which are attached to private_list. It's here so that generic_osync_inode() can locate the lock which protects this mapping's private_list. Will probably go away. A consequence of all the above is that: a) All the buffers at a mapping_A's ->private_list must come from the same mapping, mapping_B. There is no requirement that mapping_B be a blockdev mapping, but that's how it's used. There is a BUG() check in mark_buffer_dirty_inode() for this. b) blockdev mappings never have any buffers on ->private_list. It just never happens, and doesn't make a lot of sense. reiserfs is using b_inode_buffers for attaching dependent buffers to its journal and that caused a few problems. Fixed in reiserfs_releasepage.patch
43152186 · Andrew Morton · Arnaldo Carvalho de Melo · 6b9f3b41 · 43152186 · 43152186
Commit 43152186 authored May 19, 2002 by Andrew Morton Committed by Arnaldo Carvalho de Melo May 19, 2002
17 changed files
--- a/fs/buffer.c
+++ b/fs/buffer.c
--- a/fs/ext2/fsync.c
+++ b/fs/ext2/fsync.c
@@ -37,7 +37,7 @@ int ext2_sync_file(struct file * file, struct dentry *dentry, int datasync)
 	struct inode *inode = dentry->d_inode;
 	int err;
 	
-	err  = fsync_inode_buffers(inode);
+	err  = sync_mapping_buffers(inode->i_mapping);
 	if (!(inode->i_state & I_DIRTY))
 		return err;
 	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))

--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -41,7 +41,8 @@ static int ext2_update_inode(struct inode * inode, int do_sync);
 */
 void ext2_put_inode (struct inode * inode)
 {
-	ext2_discard_prealloc (inode);
+	if (atomic_read(&inode->i_count) < 2)
+		ext2_discard_prealloc (inode);
 }

 /*
@@ -860,7 +861,7 @@ void ext2_truncate (struct inode * inode)
 	}
 	inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 	if (IS_SYNC(inode)) {
-		fsync_inode_buffers(inode);
+		sync_mapping_buffers(inode->i_mapping);
 		ext2_sync_inode (inode);
 	} else {
 		mark_inode_dirty(inode);

--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -55,13 +55,13 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
 	J_ASSERT(ext3_journal_current_handle() == 0);

 	/*
-	 * fsync_inode_buffers() just walks i_dirty_buffers and waits
+	 * fsync_inode_buffers() just walks private_list and waits
 	 * on them.  It's a no-op for full data journalling because
-	 * i_dirty_buffers will be ampty.
+	 * private_list will be empty.
 	 * Really, we only need to start I/O on the dirty buffers -
 	 * we'll end up waiting on them in commit.
 	 */
-	ret = fsync_inode_buffers(inode);
+	ret = sync_mapping_buffers(inode->i_mapping);
 	ext3_force_commit(inode->i_sb);

 	return ret;

--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1078,14 +1078,8 @@ static int commit_write_fn(handle_t *handle, struct buffer_head *bh)
 * We need to pick up the new inode size which generic_commit_write gave us
 * `file' can be NULL - eg, when called from block_symlink().
 *
- * ext3 inode->i_dirty_buffers policy:  If we're journalling data we
- * definitely don't want them to appear on the inode at all - instead
- * we need to manage them at the JBD layer and we need to intercept
- * the relevant sync operations and translate them into journal operations.
- *
- * If we're not journalling data then we can just leave the buffers
- * on ->i_dirty_buffers.  If someone writes them out for us then thanks.
- * Otherwise we'll do it in commit, if we're using ordered data.
+ * ext3 never places buffers on inode->i_mapping->private_list.  metadata
+ * buffers are managed internally.
 */

 static int ext3_commit_write(struct file *file, struct page *page,

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -467,43 +467,34 @@ void write_inode_now(struct inode *inode, int sync)
 /**
 * generic_osync_inode - flush all dirty data for a given inode to disk
 * @inode: inode to write
- * @datasync: if set, don't bother flushing timestamps
+ * @what:  what to write and wait upon
 *
 * This can be called by file_write functions for files which have the
- * O_SYNC flag set, to flush dirty writes to disk.  
+ * O_SYNC flag set, to flush dirty writes to disk.
+ *
+ * @what is a bitmask, specifying which part of the inode's data should be
+ * written and waited upon:
+ *
+ *    OSYNC_DATA:     i_mapping's dirty data
+ *    OSYNC_METADATA: the buffers at i_mapping->private_list
+ *    OSYNC_INODE:    the inode itself
 */

 int generic_osync_inode(struct inode *inode, int what)
 {
-	int err = 0, err2 = 0, need_write_inode_now = 0;
-	
-	/* 
-	 * WARNING
-	 *
-	 * Currently, the filesystem write path does not pass the
-	 * filp down to the low-level write functions.  Therefore it
-	 * is impossible for (say) __block_commit_write to know if
-	 * the operation is O_SYNC or not.
-	 *
-	 * Ideally, O_SYNC writes would have the filesystem call
-	 * ll_rw_block as it went to kick-start the writes, and we
-	 * could call osync_inode_buffers() here to wait only for
-	 * those IOs which have already been submitted to the device
-	 * driver layer.  As it stands, if we did this we'd not write
-	 * anything to disk since our writes have not been queued by
-	 * this point: they are still on the dirty LRU.
-	 * 
-	 * So, currently we will call fsync_inode_buffers() instead,
-	 * to flush _all_ dirty buffers for this inode to disk on 
-	 * every O_SYNC write, not just the synchronous I/Os.  --sct
-	 */
+	int err = 0;
+	int need_write_inode_now = 0;
+	int err2;

 	if (what & OSYNC_DATA)
-		writeback_single_inode(inode, 0, NULL);
-	if (what & (OSYNC_METADATA|OSYNC_DATA))
-		err = fsync_inode_buffers(inode);
+		err = filemap_fdatawrite(inode->i_mapping);
+	if (what & (OSYNC_METADATA|OSYNC_DATA)) {
+		err2 = sync_mapping_buffers(inode->i_mapping);
+		if (!err)
+			err = err2;
+	}
 	if (what & OSYNC_DATA) {
-		err2 = filemap_fdatawrite(inode->i_mapping);
+		err2 = filemap_fdatawait(inode->i_mapping);
 		if (!err)
 			err = err2;
 	}

--- a/fs/inode.c
+++ b/fs/inode.c
@@ -106,6 +106,7 @@ static struct inode *alloc_inode(struct super_block *sb)
 		inode->i_data.dirtied_when = 0;
 		inode->i_mapping = &inode->i_data;
 		inode->i_data.ra_pages = &default_ra_pages;
+		inode->i_data.assoc_mapping = NULL;
 		if (sb->s_bdev)
 			inode->i_data.ra_pages = sb->s_bdev->bd_inode->i_mapping->ra_pages;
 		memset(&inode->u, 0, sizeof(inode->u));
@@ -139,13 +140,13 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->i_data.locked_pages);
 	INIT_LIST_HEAD(&inode->i_data.io_pages);
 	INIT_LIST_HEAD(&inode->i_dentry);
-	INIT_LIST_HEAD(&inode->i_dirty_buffers);
 	INIT_LIST_HEAD(&inode->i_devices);
 	sema_init(&inode->i_sem, 1);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	rwlock_init(&inode->i_data.page_lock);
 	spin_lock_init(&inode->i_data.i_shared_lock);
-	spin_lock_init(&inode->i_bufferlist_lock);
+	INIT_LIST_HEAD(&inode->i_data.private_list);
+	spin_lock_init(&inode->i_data.private_lock);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap_shared);
 }

--- a/fs/minix/file.c
+++ b/fs/minix/file.c
@@ -31,7 +31,7 @@ int minix_sync_file(struct file * file, struct dentry *dentry, int datasync)
 	struct inode *inode = dentry->d_inode;
 	int err;

-	err = fsync_inode_buffers(inode);
+	err = sync_mapping_buffers(inode->i_mapping);
 	if (!(inode->i_state & I_DIRTY))
 		return err;
 	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))

--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -1510,6 +1510,13 @@ static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
 	INIT_LIST_HEAD(&vol->mftbmp_mapping.i_mmap);
 	INIT_LIST_HEAD(&vol->mftbmp_mapping.i_mmap_shared);
 	spin_lock_init(&vol->mftbmp_mapping.i_shared_lock);
+	/*
+	 * private_lock and private_list are unused by ntfs.  But they
+	 * are available.
+	 */
+	spin_lock_init(&vol->mftbmp_mapping.private_lock);
+	INIT_LIST_HEAD(&vol->mftbmp_mapping.private_list);
+	vol->mftbmp_mapping.assoc_mapping = NULL;
 	vol->mftbmp_mapping.dirtied_when = 0;
 	vol->mftbmp_mapping.gfp_mask = GFP_HIGHUSER;
 	vol->mftbmp_mapping.ra_pages =

--- a/fs/reiserfs/file.c
+++ b/fs/reiserfs/file.c
@@ -85,7 +85,7 @@ static int reiserfs_sync_file(
  if (!S_ISREG(p_s_inode->i_mode))
      BUG ();

-  n_err = fsync_inode_buffers(p_s_inode) ;
+  n_err = sync_mapping_buffers(p_s_inode->i_mapping) ;
  reiserfs_commit_for_inode(p_s_inode) ;
  unlock_kernel() ;
  return ( n_err < 0 ) ? -EIO : 0;

--- a/fs/sysv/file.c
+++ b/fs/sysv/file.c
@@ -36,7 +36,7 @@ int sysv_sync_file(struct file * file, struct dentry *dentry, int datasync)
 	struct inode *inode = dentry->d_inode;
 	int err;

-	err = fsync_inode_buffers(inode);
+	err = sync_mapping_buffers(inode->i_mapping);
 	if (!(inode->i_state & I_DIRTY))
 		return err;
 	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))

--- a/fs/udf/fsync.c
+++ b/fs/udf/fsync.c
@@ -44,7 +44,7 @@ int udf_fsync_inode(struct inode *inode, int datasync)
 {
 	int err;

-	err = fsync_inode_buffers(inode);
+	err = sync_mapping_buffers(inode->i_mapping);
 	if (!(inode->i_state & I_DIRTY))
 		return err;
 	if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))

--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -50,7 +50,7 @@ struct buffer_head {
 	struct block_device *b_bdev;
 	bh_end_io_t *b_end_io;		/* I/O completion */
 	void *b_private;		/* reserved for b_end_io */
-	struct list_head     b_inode_buffers; /* list of inode dirty buffers */
+	struct list_head b_assoc_buffers; /* associated with another mapping */
 };


@@ -147,6 +147,8 @@ void create_empty_buffers(struct page *, unsigned long,
 void end_buffer_io_sync(struct buffer_head *bh, int uptodate);
 void buffer_insert_list(spinlock_t *lock,
 			struct buffer_head *, struct list_head *);
+int sync_mapping_buffers(struct address_space *mapping);
+void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);

 void mark_buffer_async_read(struct buffer_head *bh);
 void mark_buffer_async_write(struct buffer_head *bh);
@@ -217,14 +219,6 @@ static inline void put_bh(struct buffer_head *bh)
        atomic_dec(&bh->b_count);
 }

-static inline void
-mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
-{
-	mark_buffer_dirty(bh);
-	buffer_insert_list(&inode->i_bufferlist_lock,
-			bh, &inode->i_dirty_buffers);
-}
-
 /*
 * If an error happens during the make_request, this function
 * has to be recalled. It marks the buffer as clean and not
@@ -243,11 +237,6 @@ static inline void buffer_IO_error(struct buffer_head * bh)
 	bh->b_end_io(bh, buffer_uptodate(bh));
 }

-static inline int fsync_inode_buffers(struct inode *inode)
-{
-	return fsync_buffers_list(&inode->i_bufferlist_lock,
-				&inode->i_dirty_buffers);
-}

 static inline void brelse(struct buffer_head *buf)
 {

--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -306,6 +306,7 @@ struct address_space_operations {
 };

 struct address_space {
+	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	rwlock_t		page_lock;	/* and rwlock protecting it */
 	struct list_head	clean_pages;	/* list of clean pages */
@@ -314,13 +315,15 @@ struct address_space {
 	struct list_head	io_pages;	/* being prepared for I/O */
 	unsigned long		nrpages;	/* number of total pages */
 	struct address_space_operations *a_ops;	/* methods */
-	struct inode		*host;		/* owner: inode, block_device */
 	list_t			i_mmap;		/* list of private mappings */
 	list_t			i_mmap_shared;	/* list of private mappings */
 	spinlock_t		i_shared_lock;  /* and spinlock protecting it */
 	unsigned long		dirtied_when;	/* jiffies of first page dirtying */
 	int			gfp_mask;	/* how to allocate the pages */
 	unsigned long 		*ra_pages;	/* device readahead */
+	spinlock_t		private_lock;	/* for use by the address_space */
+	struct list_head	private_list;	/* ditto */
+	struct address_space	*assoc_mapping;	/* ditto */
 };

 struct char_device {
@@ -350,10 +353,6 @@ struct inode {
 	struct list_head	i_hash;
 	struct list_head	i_list;
 	struct list_head	i_dentry;
-
-	struct list_head	i_dirty_buffers;   /* uses i_bufferlist_lock */
-	spinlock_t		i_bufferlist_lock;
-
 	unsigned long		i_ino;
 	atomic_t		i_count;
 	kdev_t			i_dev;

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -42,7 +42,7 @@
 *
 *  pagemap_lru_lock
 *  ->i_shared_lock		(vmtruncate)
- *    ->i_bufferlist_lock	(__free_pte->__set_page_dirty_buffers)
+ *    ->private_lock		(__free_pte->__set_page_dirty_buffers)
 *      ->mapping->page_lock
 *      ->inode_lock		(__mark_inode_dirty)
 *        ->sb_lock		(fs/fs-writeback.c)

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -450,7 +450,7 @@ EXPORT_SYMBOL(write_one_page);
 * It's better to have clean pages accidentally attached to dirty_pages than to
 * leave dirty pages attached to clean_pages.
 *
- * We use i_bufferlist_lock to lock against try_to_free_buffers while using the
+ * We use private_lock to lock against try_to_free_buffers while using the
 * page's buffer list.  Also use this to protect against clean buffers being
 * added to the page after it was set dirty.
 *
@@ -462,18 +462,15 @@ EXPORT_SYMBOL(write_one_page);
 */
 int __set_page_dirty_buffers(struct page *page)
 {
+	struct address_space * const mapping = page->mapping;
 	int ret = 0;
-	struct address_space *mapping = page->mapping;
-	struct inode *inode;

 	if (mapping == NULL) {
 		SetPageDirty(page);
 		goto out;
 	}

-	inode = mapping->host;
-
-	spin_lock(&inode->i_bufferlist_lock);
+	spin_lock(&mapping->private_lock);

 	if (page_has_buffers(page) && !PageSwapCache(page)) {
 		struct buffer_head *head = page_buffers(page);
@@ -496,7 +493,7 @@ int __set_page_dirty_buffers(struct page *page)
 		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	}
 	
-	spin_unlock(&inode->i_bufferlist_lock);
+	spin_unlock(&mapping->private_lock);
 out:
 	return ret;
 }

--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -37,11 +37,10 @@ static struct address_space_operations swap_aops = {
 };

 /*
- * swapper_inode is needed only for for i_bufferlist_lock. This
- * avoid special-casing in other parts of the kernel.
+ * swapper_inode doesn't do anything much.  It is really only here to
+ * avoid some special-casing in other parts of the kernel.
 */
 static struct inode swapper_inode = {
-	i_bufferlist_lock:	SPIN_LOCK_UNLOCKED,
 	i_mapping:		&swapper_space,
 };

@@ -55,6 +54,8 @@ struct address_space swapper_space = {
 	host:		&swapper_inode,
 	a_ops:		&swap_aops,
 	i_shared_lock:	SPIN_LOCK_UNLOCKED,
+	private_lock:	SPIN_LOCK_UNLOCKED,
+	private_list:	LIST_HEAD_INIT(swapper_space.private_list),
 };

 #ifdef SWAP_CACHE_INFO