Commit ace476bb authored by Andrew Morton's avatar Andrew Morton Committed by Linus Torvalds

[PATCH] jbd: descriptor buffer state fix

Fix a problem discovered by Jeff Mahoney <jeffm@suse.com>, based on an initial
patch from Chris Mason <mason@suse.com>.

journal_get_descriptor_buffer() is used to obtain a regular old buffer_head
against the blockdev mapping.  The caller will populate that bh by hand and
will then submit it for writing.

But there are problems:

a) The function sets bh->b_state nonatomically.  But this buffer is
   accessible to other CPUs via pagecache lookup.

b) The function sets the buffer dirty and then the caller populates it and
   then it is submitted for I/O.  Wrong order: there's a window in which the
   VM could write the buffer before it is fully populated.

c) The function fails to set the buffer uptodate after zeroing it.  And one
   caller forgot to mark it uptodate as well.  So if the VM happens to decide
   to write the containing page back __block_write_full_page() encounters a
   dirty, not uptodate buffer, which is an illegal state.  This was generating
   buffer_error() warnings before we removed buffer_error().

   Leaving the buffer not uptodate also means that a concurrent reader of
   /dev/hda1 could cause physical I/O against the buffer, scribbling on what
   we just put in it.

   So journal_get_descriptor_buffer() is changed to mark the buffer
   uptodate, under the buffer lock.

I considered changing journal_get_descriptor_buffer() to return a locked
buffer but there doesn't seem to be a need for this, and both callers end up
using ll_rw_block() anyway, which requires that the buffer be unlocked again.

Note that the journal_get_descriptor_buffer() callers dirty these buffers with
set_buffer_dirty().  That's a bit naughty, because it could create dirty
buffers against a clean page - an illegal state.  They really should use
mark_buffer_dirty() to dirty the page and inode as well.  But all callers will
immediately write and clean the buffer anyway, so we can safely leave this
optimising cheat in place.
Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
parent 669be61f
...@@ -412,7 +412,8 @@ void journal_commit_transaction(journal_t *journal) ...@@ -412,7 +412,8 @@ void journal_commit_transaction(journal_t *journal)
tagp = &bh->b_data[sizeof(journal_header_t)]; tagp = &bh->b_data[sizeof(journal_header_t)];
space_left = bh->b_size - sizeof(journal_header_t); space_left = bh->b_size - sizeof(journal_header_t);
first_tag = 1; first_tag = 1;
set_bit(BH_JWrite, &bh->b_state); set_buffer_jwrite(bh);
set_buffer_dirty(bh);
wbuf[bufs++] = bh; wbuf[bufs++] = bh;
/* Record it so that we can wait for IO /* Record it so that we can wait for IO
...@@ -638,7 +639,8 @@ void journal_commit_transaction(journal_t *journal) ...@@ -638,7 +639,8 @@ void journal_commit_transaction(journal_t *journal)
JBUFFER_TRACE(descriptor, "write commit block"); JBUFFER_TRACE(descriptor, "write commit block");
{ {
struct buffer_head *bh = jh2bh(descriptor); struct buffer_head *bh = jh2bh(descriptor);
set_buffer_uptodate(bh);
set_buffer_dirty(bh);
sync_dirty_buffer(bh); sync_dirty_buffer(bh);
if (unlikely(!buffer_uptodate(bh))) if (unlikely(!buffer_uptodate(bh)))
err = -EIO; err = -EIO;
......
...@@ -585,9 +585,13 @@ int journal_bmap(journal_t *journal, unsigned long blocknr, ...@@ -585,9 +585,13 @@ int journal_bmap(journal_t *journal, unsigned long blocknr,
* We play buffer_head aliasing tricks to write data/metadata blocks to * We play buffer_head aliasing tricks to write data/metadata blocks to
* the journal without copying their contents, but for journal * the journal without copying their contents, but for journal
* descriptor blocks we do need to generate bona fide buffers. * descriptor blocks we do need to generate bona fide buffers.
*
* After the caller of journal_get_descriptor_buffer() has finished modifying
* the buffer's contents they really should run flush_dcache_page(bh->b_page).
* But we don't bother doing that, so there will be coherency problems with
* mmaps of blockdevs which hold live JBD-controlled filesystems.
*/ */
struct journal_head *journal_get_descriptor_buffer(journal_t *journal)
struct journal_head * journal_get_descriptor_buffer(journal_t *journal)
{ {
struct buffer_head *bh; struct buffer_head *bh;
unsigned long blocknr; unsigned long blocknr;
...@@ -599,8 +603,10 @@ struct journal_head * journal_get_descriptor_buffer(journal_t *journal) ...@@ -599,8 +603,10 @@ struct journal_head * journal_get_descriptor_buffer(journal_t *journal)
return NULL; return NULL;
bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize); bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize);
lock_buffer(bh);
memset(bh->b_data, 0, journal->j_blocksize); memset(bh->b_data, 0, journal->j_blocksize);
bh->b_state |= (1 << BH_Dirty); set_buffer_uptodate(bh);
unlock_buffer(bh);
BUFFER_TRACE(bh, "return this buffer"); BUFFER_TRACE(bh, "return this buffer");
return journal_add_journal_head(bh); return journal_add_journal_head(bh);
} }
......
...@@ -522,7 +522,7 @@ void journal_write_revoke_records(journal_t *journal, ...@@ -522,7 +522,7 @@ void journal_write_revoke_records(journal_t *journal,
kmem_cache_free(revoke_record_cache, record); kmem_cache_free(revoke_record_cache, record);
} }
} }
if (descriptor) if (descriptor)
flush_descriptor(journal, descriptor, offset); flush_descriptor(journal, descriptor, offset);
jbd_debug(1, "Wrote %d revoke records\n", count); jbd_debug(1, "Wrote %d revoke records\n", count);
} }
...@@ -606,7 +606,7 @@ static void flush_descriptor(journal_t *journal, ...@@ -606,7 +606,7 @@ static void flush_descriptor(journal_t *journal,
header->r_count = htonl(offset); header->r_count = htonl(offset);
set_buffer_jwrite(bh); set_buffer_jwrite(bh);
BUFFER_TRACE(bh, "write"); BUFFER_TRACE(bh, "write");
set_buffer_uptodate(bh); set_buffer_dirty(bh);
ll_rw_block(WRITE, 1, &bh); ll_rw_block(WRITE, 1, &bh);
} }
#endif #endif
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment