Btrfs: avoid unnecessary splits when setting bits on an extent io tree
When attempting to set bits on a range of an exent io tree that already has those bits set we can end up splitting an extent state record, use the preallocated extent state record, insert it into the red black tree, do another search on the red black tree, merge the preallocated extent state record with the previous extent state record, remove that previous record from the red black tree and then free it. This is all unnecessary work that consumes time. This happens specifically at the following case at __set_extent_bit(): $ cat -n fs/btrfs/extent_io.c 957 static int __must_check 958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, (...) 1044 /* 1045 * | ---- desired range ---- | 1046 * | state | 1047 * or 1048 * | ------------- state -------------- | 1049 * (...) 1060 if (state->start < start) { 1061 if (state->state & exclusive_bits) { 1062 *failed_start = start; 1063 err = -EEXIST; 1064 goto out; 1065 } 1066 1067 prealloc = alloc_extent_state_atomic(prealloc); 1068 BUG_ON(!prealloc); 1069 err = split_state(tree, state, prealloc, start); 1070 if (err) 1071 extent_io_tree_panic(tree, err); 1072 1073 prealloc = NULL; So if our extent state represents a range from 0 to 1MiB for example, and we want to set bits in the range 128KiB to 256KiB for example, and that extent state record already has all those bits set, we end up splitting that record, so we end up with extent state records in the tree which represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is temporary because a subsequent iteration in that function will end up merging the records. The splitting requires using the preallocated extent state record, so a future iteration that needs to do another split will need to allocate another extent state record in an atomic context, something not ideal that we try to avoid as much as possible. The splitting also requires an insertion in the red black tree, and a subsequent merge will require a deletion from the red black tree and freeing an extent state record. This change just skips the splitting of an extent state record when it already has all the bits the we need to set. Setting a bit that is already set for a range is very common in the inode's 'file_extent_tree' extent io tree for example, where we keep setting the EXTENT_DIRTY bit every time we replace an extent. This change also fixes a bug that happens after the recent patchset from Josef that avoids having implicit holes after a power failure when not using the NO_HOLES feature, more specifically the patch with the subject: "btrfs: introduce the inode->file_extent_tree" This patch introduced an extent io tree per inode to keep track of completed ordered extents and figure out at any time what is the safe value for the inode's disk_i_size. This assumes that for contiguous ranges in a file we always end up with a single extent state record in the io tree, but that is not the case, as there is a short time window where we can have two extent state records representing contiguous ranges. When this happens we end setting up an incorrect value for the inode's disk_i_size, resulting in data loss after a clean unmount of the filesystem. The following example explains how this can happen. Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in the inode's file_extent_tree we have a single extent state record that represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the following steps happen: 1) A buffered write against file range [512KiB, 768KiB) is made. At this point delalloc was not flushed yet; 2) Deduplication from some other inode into this inode's range [128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range() to be called, from btrfs_insert_clone_extent(), to mark the range [128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree; 3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we end up at __set_extent_bit(). In the first iteration of that function's loop we end up in the following branch: $ cat -n fs/btrfs/extent_io.c 957 static int __must_check 958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, (...) 1044 /* 1045 * | ---- desired range ---- | 1046 * | state | 1047 * or 1048 * | ------------- state -------------- | 1049 * (...) 1060 if (state->start < start) { 1061 if (state->state & exclusive_bits) { 1062 *failed_start = start; 1063 err = -EEXIST; 1064 goto out; 1065 } 1066 1067 prealloc = alloc_extent_state_atomic(prealloc); 1068 BUG_ON(!prealloc); 1069 err = split_state(tree, state, prealloc, start); 1070 if (err) 1071 extent_io_tree_panic(tree, err); 1072 1073 prealloc = NULL; (...) 1089 goto search_again; This splits the state record into two, one for range [0, 128KiB) and another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY bit set. Then we jump to the 'search_again' label, where we unlock the the spinlock protecting the extent io tree before jumping to the 'again' label to perform the next iteration; 4) In the meanwhile, delalloc is flushed, the ordered extent for the range [512KiB, 768KiB) is created and when it completes, at btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write() with a value of 0 for its 'new_size' argument; 5) Before the deduplication task currently at __set_extent_bit() moves to the next iteration, the task finishing the ordered extent calls find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write() and gets 'start' set to 0 and 'end' set to 128KiB - because at this moment the io tree has two extent state records, one representing the range [0, 128KiB) and another representing the range [128KiB, 1MiB), both with EXTENT_DIRTY set. Then we set 'isize' to: isize = min(isize, end + 1) = min(1MiB, 128KiB - 1 + 1) = 128KiB Then we set the inode's disk_i_size to 128KiB (isize). After a clean unmount of the filesystem and mounting it again, we have the file with a size of 128KiB, and effectively lost all the data it had before in the range from 128KiB to 1MiB. This change fixes that issue too, as we never end up splitting extent state records when they already have all the bits we want set. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Showing
Please register or sign in to comment