Commit 4dbd80fb authored by Qu Wenruo's avatar Qu Wenruo Committed by Filipe Manana

btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error

[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range
Signed-off-by: default avatarQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
parent a967efb3
...@@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, ...@@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode,
BTRFS_DATA_RELOC_TREE_OBJECTID) { BTRFS_DATA_RELOC_TREE_OBJECTID) {
ret = btrfs_reloc_clone_csums(inode, start, ret = btrfs_reloc_clone_csums(inode, start,
cur_alloc_size); cur_alloc_size);
/*
* Only drop cache here, and process as normal.
*
* We must not allow extent_clear_unlock_delalloc()
* at out_unlock label to free meta of this ordered
* extent, as its meta should be freed by
* btrfs_finish_ordered_io().
*
* So we must continue until @start is increased to
* skip current ordered extent.
*/
if (ret) if (ret)
goto out_drop_extent_cache; btrfs_drop_extent_cache(BTRFS_I(inode), start,
start + ram_size - 1, 0);
} }
btrfs_dec_block_group_reservations(fs_info, ins.objectid); btrfs_dec_block_group_reservations(fs_info, ins.objectid);
if (disk_num_bytes < cur_alloc_size)
break;
/* we're not doing compressed IO, don't unlock the first /* we're not doing compressed IO, don't unlock the first
* page (which the caller expects to stay locked), don't * page (which the caller expects to stay locked), don't
* clear any dirty bits and don't set any writeback bits * clear any dirty bits and don't set any writeback bits
...@@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode, ...@@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode *inode,
delalloc_end, locked_page, delalloc_end, locked_page,
EXTENT_LOCKED | EXTENT_DELALLOC, EXTENT_LOCKED | EXTENT_DELALLOC,
op); op);
disk_num_bytes -= cur_alloc_size; if (disk_num_bytes < cur_alloc_size)
disk_num_bytes = 0;
else
disk_num_bytes -= cur_alloc_size;
num_bytes -= cur_alloc_size; num_bytes -= cur_alloc_size;
alloc_hint = ins.objectid + ins.offset; alloc_hint = ins.objectid + ins.offset;
start += cur_alloc_size; start += cur_alloc_size;
/*
* btrfs_reloc_clone_csums() error, since start is increased
* extent_clear_unlock_delalloc() at out_unlock label won't
* free metadata of current ordered extent, we're OK to exit.
*/
if (ret)
goto out_unlock;
} }
out: out:
return ret; return ret;
...@@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, ...@@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
BUG_ON(ret); /* -ENOMEM */ BUG_ON(ret); /* -ENOMEM */
if (root->root_key.objectid == if (root->root_key.objectid ==
BTRFS_DATA_RELOC_TREE_OBJECTID) { BTRFS_DATA_RELOC_TREE_OBJECTID)
/*
* Error handled later, as we must prevent
* extent_clear_unlock_delalloc() in error handler
* from freeing metadata of created ordered extent.
*/
ret = btrfs_reloc_clone_csums(inode, cur_offset, ret = btrfs_reloc_clone_csums(inode, cur_offset,
num_bytes); num_bytes);
if (ret) {
if (!nolock && nocow)
btrfs_end_write_no_snapshoting(root);
goto error;
}
}
extent_clear_unlock_delalloc(inode, cur_offset, extent_clear_unlock_delalloc(inode, cur_offset,
cur_offset + num_bytes - 1, end, cur_offset + num_bytes - 1, end,
...@@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode, ...@@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode *inode,
if (!nolock && nocow) if (!nolock && nocow)
btrfs_end_write_no_snapshoting(root); btrfs_end_write_no_snapshoting(root);
cur_offset = extent_end; cur_offset = extent_end;
/*
* btrfs_reloc_clone_csums() error, now we're OK to call error
* handler, as metadata for created ordered extent will only
* be freed by btrfs_finish_ordered_io().
*/
if (ret)
goto error;
if (cur_offset > end) if (cur_offset > end)
break; break;
} }
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment