Commit 8407f553 authored by Filipe Manana's avatar Filipe Manana Committed by Chris Mason

Btrfs: fix data corruption after fast fsync and writeback error

When we do a fast fsync, we start all ordered operations and then while
they're running in parallel we visit the list of modified extent maps
and construct their matching file extent items and write them to the
log btree. After that, in btrfs_sync_log() we wait for all the ordered
operations to finish (via btrfs_wait_logged_extents).

The problem with this is that we were completely ignoring errors that
can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
or -EIO errors for example. When such error happens, it means we have parts
of the on disk extent that weren't written to, and so we end up logging
file extent items that point to these extents that contain garbage/random
data - so after a crash/reboot plus log replay, we get our inode's metadata
pointing to those extents.

This worked in contrast with the full (non-fast) fsync path, where we
start all ordered operations, wait for them to finish and then write
to the log btree. In this path, after each ordered operation completes
we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
-EIO if so (via btrfs_wait_ordered_range).

So if an error happens with any ordered operation, just return a -EIO
error to userspace, so that it knows that not all of its previous writes
were durably persisted and the application can take proper action (like
redo the writes for e.g.) - and definitely not leave any file extent items
in the log refer to non fully written extents.
Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
Signed-off-by: default avatarChris Mason <clm@fb.com>
parent 669249ee
......@@ -2029,6 +2029,25 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
*/
mutex_unlock(&inode->i_mutex);
/*
* If any of the ordered extents had an error, just return it to user
* space, so that the application knows some writes didn't succeed and
* can take proper action (retry for e.g.). Blindly committing the
* transaction in this case, would fool userspace that everything was
* successful. And we also want to make sure our log doesn't contain
* file extent items pointing to extents that weren't fully written to -
* just like in the non fast fsync path, where we check for the ordered
* operation's error flag before writing to the log tree and return -EIO
* if any of them had this flag set (btrfs_wait_ordered_range) -
* therefore we need to check for errors in the ordered operations,
* which are indicated by ctx.io_err.
*/
if (ctx.io_err) {
btrfs_end_transaction(trans, root);
ret = ctx.io_err;
goto out;
}
if (ret != BTRFS_NO_LOG_SYNC) {
if (!ret) {
ret = btrfs_sync_log(trans, root, &ctx);
......
This diff is collapsed.
......@@ -28,6 +28,7 @@
struct btrfs_log_ctx {
int log_ret;
int log_transid;
int io_err;
struct list_head list;
};
......@@ -35,6 +36,7 @@ static inline void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx)
{
ctx->log_ret = 0;
ctx->log_transid = 0;
ctx->io_err = 0;
INIT_LIST_HEAD(&ctx->list);
}
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment