- 25 May, 2020 40 commits
-
-
Qu Wenruo authored
[BUG] When balance is canceled, there is a pretty high chance that unmounting the fs can lead to lead the NULL pointer dereference: BTRFS warning (device dm-3): page private not zero on page 223158272 ... BTRFS warning (device dm-3): page private not zero on page 223162368 BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1 BUG: kernel NULL pointer dereference, address: 0000000000000168 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 2 PID: 5793 Comm: umount Tainted: G O 5.7.0-rc5-custom+ #53 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__lock_acquire+0x5dc/0x24c0 Call Trace: lock_acquire+0xab/0x390 _raw_spin_lock+0x39/0x80 btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs] release_extent_buffer+0xb2/0x170 [btrfs] free_extent_buffer+0x66/0xb0 [btrfs] btrfs_put_root+0x8e/0x130 [btrfs] btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs] btrfs_free_fs_info+0xe5/0x120 [btrfs] btrfs_kill_super+0x1f/0x30 [btrfs] deactivate_locked_super+0x3b/0x80 deactivate_super+0x3e/0x50 cleanup_mnt+0x109/0x160 __cleanup_mnt+0x12/0x20 task_work_run+0x67/0xa0 exit_to_usermode_loop+0xc5/0xd0 syscall_return_slowpath+0x205/0x360 do_syscall_64+0x6e/0xb0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7fd028ef740b [CAUSE] When balance is canceled, all reloc roots are marked as orphan, and orphan reloc roots are going to be cleaned up. However for orphan reloc roots and merged reloc roots, their lifespan are quite different: Merged reloc roots | Orphan reloc roots by cancel -------------------------------------------------------------------- create_reloc_root() | create_reloc_root() |- refs == 1 | |- refs == 1 | btrfs_grab_root(reloc_root); | btrfs_grab_root(reloc_root); |- refs == 2 | |- refs == 2 | root->reloc_root = reloc_root; | root->reloc_root = reloc_root; >>> No difference so far <<< | prepare_to_merge() | prepare_to_merge() |- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR) | merge_reloc_roots() | merge_reloc_roots() |- merge_reloc_root() | |- Doing nothing to put reloc root |- insert_dirty_subvol() | |- refs == 2 |- __del_reloc_root() | |- btrfs_put_root() | |- refs == 1 | >>> Now orphan reloc roots still have refs 2 <<< | clean_dirty_subvols() | clean_dirty_subvols() |- btrfs_drop_snapshot() | |- btrfS_drop_snapshot() |- reloc_root get freed | |- reloc_root still has refs 2 | related ebs get freed, but | reloc_root still recorded in | allocated_roots btrfs_check_leaked_roots() | btrfs_check_leaked_roots() |- No leaked roots | |- Leaked reloc_roots detected | |- btrfs_put_root() | |- free_extent_buffer(root->node); | |- eb already freed, caused NULL | pointer dereference [FIX] The fix is to clear fs_root->reloc_root and put it at merge_reloc_roots() time, so that we won't leak reloc roots. Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.1+ Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Robbie Ko authored
When creating a snapshot, ordered extents need to be flushed and this can take a long time. In create_snapshot there are two locks held when this happens: 1. Destination directory inode lock 2. Global subvolume semaphore This will unnecessarily block other operations like subvolume destroy, create, or setflag until the snapshot is created. We can fix that by moving the flush outside the locked section as this does not depend on the aforementioned locks. The code factors out the snapshot related work from create_snapshot to btrfs_mksnapshot. __btrfs_ioctl_snap_create btrfs_mksubvol create_subvol btrfs_mksnapshot <flush> btrfs_mksubvol create_snapshot Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
SHAREABLE flag is set for subvolumes because users can create snapshot for subvolumes, thus sharing tree blocks of them. But data reloc tree is not exposed to user space, as it's only an internal tree for data relocation, thus it doesn't need the full path replacement handling at all. This patch will make data reloc tree a non-shareable tree, and add btrfs_fs_info::data_reloc_root for data reloc tree, so relocation code can grab it from fs_info directly. This would slightly improve tree relocation, as now data reloc tree can go through regular COW routine to get relocated, without bothering the complex tree reloc tree routine. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There are a lot of root owner checks in btrfs_truncate_inode_items() like: if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) || root == fs_info->tree_root) But considering that, only these trees can have INODE_ITEMs: - tree root (for v1 space cache) - subvolume trees - tree reloc trees - data reloc tree - log trees And since subvolume/tree reloc/data reloc trees all have SHAREABLE bit, and we're checking tree root manually, so above check is just excluding log trees. This patch will replace two of such checks to a simpler one: if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) This would merge btrfs_drop_extent_cache() and lock_extent_bits() call into the same if branch. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The name BTRFS_ROOT_REF_COWS is not very clear about the meaning. In fact, that bit can only be set to those trees: - Subvolume roots - Data reloc root - Reloc roots for above roots All other trees won't get this bit set. So just by the result, it is obvious that, roots with this bit set can have tree blocks shared with other trees. Either shared by snapshots, or by reloc roots (an special snapshot created by relocation). This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to make it easier to understand, and update all comment mentioning "reference counted" to follow the rename. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Commit dccdb07b ("btrfs: kill btrfs_fs_info::volume_mutex") removed the last use of the volume_mutex, forgetting to update the comment. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The fallback path calls helper write_extent_buffer to do write of the data spanning two extent buffer pages. As the size is known, we can do the write directly in two steps. This removes one function call and compiler can optimize memcpy as the sizes are known at compile time. The cached token address is set to the second page. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The helper write_extent_buffer is called to do write of the data spanning two extent buffer pages. As the size is known, we can do the write directly in two steps. This removes one function call and compiler can optimize memcpy as the sizes are known at compile time. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The fallback path calls helper read_extent_buffer to do read of the data spanning two extent buffer pages. As the size is known, we can do the read directly in two steps. This removes one function call and compiler can optimize memcpy as the sizes are known at compile time. The cached token address is set to the second page. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The helper read_extent_buffer is called to do read of the data spanning two extent buffer pages. As the size is known, we can do the read directly in two steps. This removes one function call and compiler can optimize memcpy as the sizes are known at compile time. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Helpers that iterate over extent buffer pages set up several variables, one of them is finding out offset of the extent buffer start within a page. Right now we have extent buffers aligned to page sizes so this is effectively storing zero. This makes the code harder the follow and can be simplified. The same change is done in all the helpers: * remove: size_t start_offset = offset_in_page(eb->start); * simplify code using start_offset Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There are many helpers around extent buffers, found in extent_io.h and ctree.h. Most of them can be converted to take constified eb as there are no changes to the extent buffer structure itself but rather the pages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
All uses of map_private_extent_buffer have been replaced by more effective way. The set/get helpers have their own bounds checker. The function name was confusing since the non-private helper was removed in a6591715 ("Btrfs: stop using highmem for extent_buffers") many years ago. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The bin search jumps over the extent buffer item keys, comparing directly the bytes if the key is in one page, or storing it in a temporary buffer in case it spans two pages. The mapping start and length are obtained from map_private_extent_buffer, which is heavy weight compared to what we need. We know the key size and can find out the eb page in a simple way. For keys spanning two pages the fallback read_extent_buffer is used. The temporary variables are reduced and moved to the scope of use. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The set/get token helpers either use the cached address in the token or unconditionally call map_private_extent_buffer to get the address of page containing the requested offset plus the mapping start and length. Depending on the return value, the fast path uses unaligned put to write data within a page, or fall back to write_extent_buffer that can handle writes spanning more pages. This is all wasteful. We know the number of bytes to write, 1/2/4/8 and can find out the page. Then simply check if it's contained or the fallback is needed. The token address is updated to the page, or the on the next index, expecting that the next write will use that. This saves one function call to map_private_extent_buffer and several unnecessary temporary variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The helpers unconditionally call map_private_extent_buffer to get the address of page containing the requested offset plus the mapping start and length. Depending on the return value, the fast path uses unaligned put to write data within a page, or fall back to write_extent_buffer that can handle writes spanning more pages. This is all wasteful. We know the number of bytes to write, 1/2/4/8 and can find out the page. Then simply check if it's contained or the fallback is needed. This saves one function call to map_private_extent_buffer and several unnecessary temporary variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The set/get token helpers either use the cached address in the token or unconditionally call map_private_extent_buffer to get the address of page containing the requested offset plus the mapping start and length. Depending on the return value, the fast path uses unaligned read to get data within a page, or fall back to read_extent_buffer that can handle reads spanning more pages. This is all wasteful. We know the number of bytes to read, 1/2/4/8 and can find out the page. Then simply check if it's contained or the fallback is needed. The token address is updated to the page, or the on the next index, expecting that the next read will use that. This saves one function call to map_private_extent_buffer and several unnecessary temporary variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The helpers unconditionally call map_private_extent_buffer to get the address of page containing the requested offset plus the mapping start and length. Depending on the return value, the fast path uses unaligned read to get data within a page, or fall back to read_extent_buffer that can handle reads spanning more pages. This is all wasteful. We know the number of bytes to read, 1/2/4/8 and can find out the page. Then simply check if it's contained or the fallback is needed. This saves one function call to map_private_extent_buffer and several unnecessary temporary variables. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The bounds checking is now done in map_private_extent_buffer but that will be removed in following patches and some sanity checks should still be done. There are two separate checks to see the kind of out of bounds access: partial (start offset is in the buffer) or complete (both start and end are out). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
All the set/get helpers first check if the token contains a cached address. After first use the address is always valid, but the extra check is done for each call. The token initialization can optimistically set it to the first extent buffer page, that we know always exists. Then the condition in all btrfs_token_*/btrfs_set_token_* can be simplified by removing the address check from the condition, but for development the assertion still makes sure it's valid. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The token is supposed to cache the last page used by the set/get helpers. In leaf_space_used the first and last items are accessed, it's not likely they'd be on the same page so there's some overhead caused updating the token address but not using it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The set/get token is supposed to cache the last page that was accessed so it speeds up subsequential access to the eb. It does not make sense to use that for just one change, which is the case of inode size in overwrite_item. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Now that all set/get helpers use the eb from the token, we don't need to pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some stack space. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The token stores a copy of the extent buffer pointer but does not make any use of it besides sanity checks. We can use it and drop the eb parameter from several functions, this patch only switches the use inside the set/get helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Tiezhu Yang authored
disk-io.h is included more than once in block-group.c, remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The name of this function contains the word "cache", which is left from the times where btrfs_block_group was called btrfs_block_group_cache. Now this "cache" doesn't match anything, and we have better namings for functions like read/insert/remove_block_group_item(). Rename it to update_block_group_item(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Currently the block group item insert is pretty straight forward, fill the block group item structure and insert it into extent tree. However the incoming skinny block group feature is going to change this, so this patch will refactor insertion into a new function, insert_block_group_item(), to make the incoming feature easier to add. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
When deleting a block group item, it's pretty straight forward, just delete the item pointed by the key. However it will not be that straight-forward for incoming skinny block group item. So refactor the block group item deletion into a new function, remove_block_group_item(), also to make the already lengthy btrfs_remove_block_group() a little shorter. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Structure btrfs_block_group has the following members which are currently read from on-disk block group item and key: - length - from item key - used - flags - from block group item However for incoming skinny block group tree, we are going to read those members from different sources. This patch will refactor such read by: - Don't initialize btrfs_block_group::length at allocation Caller should initialize them manually. Also to avoid possible (well, only two callers) missing initialization, add extra ASSERT() in btrfs_add_block_group_cache(). - Refactor length/used/flags initialization into one function The new function, fill_one_block_group() will handle the initialization of such members. - Use btrfs_block_group::length to replace key::offset Since skinny block group item would have a different meaning for its key offset. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Regular block group items in extent tree are scattered inside the huge tree, thus forward readahead makes no sense. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Marcos Paulo de Souza authored
Whenever a chown is executed, all capabilities of the file being touched are lost. When doing incremental send with a file with capabilities, there is a situation where the capability can be lost on the receiving side. The sequence of actions bellow shows the problem: $ mount /dev/sda fs1 $ mount /dev/sdb fs2 $ touch fs1/foo.bar $ setcap cap_sys_nice+ep fs1/foo.bar $ btrfs subvolume snapshot -r fs1 fs1/snap_init $ btrfs send fs1/snap_init | btrfs receive fs2 $ chgrp adm fs1/foo.bar $ setcap cap_sys_nice+ep fs1/foo.bar $ btrfs subvolume snapshot -r fs1 fs1/snap_complete $ btrfs subvolume snapshot -r fs1 fs1/snap_incremental $ btrfs send fs1/snap_complete | btrfs receive fs2 $ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2 At this point, only a chown was emitted by "btrfs send" since only the group was changed. This makes the cap_sys_nice capability to be dropped from fs2/snap_incremental/foo.bar To fix that, only emit capabilities after chown is emitted. The current code first checks for xattrs that are new/changed, emits them, and later emit the chown. Now, __process_new_xattr skips capabilities, letting only finish_inode_if_needed to emit them, if they exist, for the inode being processed. This behavior was being worked around in "btrfs receive" side by caching the capability and only applying it after chown. Now, xattrs are only emmited _after_ chown, making that workaround not needed anymore. Link: https://github.com/kdave/btrfs-progs/issues/202 CC: stable@vger.kernel.org # 4.4+ Suggested-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When scrubbing a stripe, whenever we find an extent we lookup for its checksums in the checksum tree. However we do it even for metadata extents which don't have checksum items stored in the checksum tree, that is only for data extents. So make the lookup for checksums only if we are processing with a data extent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The helpers btrfs_freeze_block_group() and btrfs_unfreeze_block_group() used to be named btrfs_get_block_group_trimming() and btrfs_put_block_group_trimming() respectively. At the time they were added to free-space-cache.c, by commit e33e17ee ("btrfs: add missing discards when unpinning extents with -o discard") because all the trimming related functions were in free-space-cache.c. Now that the helpers were renamed and are used in scrub context as well, move them to block-group.c, a much more logical location for them. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation"), I added the 'trimming' member to the block group structure. Its purpose was to prevent races between trimming and block group deletion/allocation by pinning the block group in a way that prevents its logical address and device extents from being reused while trimming is in progress for a block group, so that if another task deletes the block group and then another task allocates a new block group that gets the same logical address and device extents while the trimming task is still in progress. After the previous fix for scrub (patch "btrfs: fix a race between scrub and block group removal/allocation"), scrub now also has the same needs that trimming has, so the member name 'trimming' no longer makes sense. Since there is already a 'pinned' member in the block group that refers to space reservations (pinned bytes), rename the member to 'frozen', add a comment on top of it to describe its general purpose and rename the helpers to increment and decrement the counter as well, to match the new member name. The next patch in the series will move the helpers into a more suitable file (from free-space-cache.c to block-group.c). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When scrub is verifying the extents of a block group for a device, it is possible that the corresponding block group gets removed and its logical address and device extents get used for a new block group allocation. When this happens scrub incorrectly reports that errors were detected and, if the the new block group has a different profile then the old one, deleted block group, we can crash due to a null pointer dereference. Possibly other unexpected and weird consequences can happen as well. Consider the following sequence of actions that leads to the null pointer dereference crash when scrub is running in parallel with balance: 1) Balance sets block group X to read-only mode and starts relocating it. Block group X is a metadata block group, has a raid1 profile (two device extents, each one in a different device) and a logical address of 19424870400; 2) Scrub is running and finds device extent E, which belongs to block group X. It enters scrub_stripe() to find all extents allocated to block group X, the search is done using the extent tree; 3) Balance finishes relocating block group X and removes block group X; 4) Balance starts relocating another block group and when trying to commit the current transaction as part of the preparation step (prepare_to_relocate()), it blocks because scrub is running; 5) The scrub task finds the metadata extent at the logical address 19425001472 and marks the pages of the extent to be read by a bio (struct scrub_bio). The extent item's flags, which have the bit BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct scrub_page). It is these flags in the scrub pages that tells the bio's end io function (scrub_bio_end_io_worker) which type of extent it is dealing with. At this point we end up with 4 pages in a bio which is ready for submission (the metadata extent has a size of 16Kb, so that gives 4 pages on x86); 6) At the next iteration of scrub_stripe(), scrub checks that there is a pause request from the relocation task trying to commit a transaction, therefore it submits the pending bio and pauses, waiting for the transaction commit to complete before resuming; 7) The relocation task commits the transaction. The device extent E, that was used by our block group X, is now available for allocation, since the commit root for the device tree was swapped by the transaction commit; 8) Another task doing a direct IO write allocates a new data block group Y which ends using device extent E. This new block group Y also ends up getting the same logical address that block group X had: 19424870400. This happens because block group X was the block group with the highest logical address and, when allocating Y, find_next_chunk() returns the end offset of the current last block group to be used as the logical address for the new block group, which is 18351128576 + 1073741824 = 19424870400 So our new block group Y has the same logical address and device extent that block group X had. However Y is a data block group, while X was a metadata one, and Y has a raid0 profile, while X had a raid1 profile; 9) After allocating block group Y, the direct IO submits a bio to write to device extent E; 10) The read bio submitted by scrub reads the 4 pages (16Kb) from device extent E, which now correspond to the data written by the task that did a direct IO write. Then at the end io function associated with the bio, scrub_bio_end_io_worker(), we call scrub_block_complete() which calls scrub_checksum(). This later function checks the flags of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK is set in the flags, so it assumes it has a metadata extent and then calls scrub_checksum_tree_block(). That functions returns an error, since interpreting data as a metadata extent causes the checksum verification to fail. So this makes scrub_checksum() call scrub_handle_errored_block(), which determines 'failed_mirror_index' to be 1, since the device extent E was allocated as the second mirror of block group X. It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at 'sblocks_for_recheck', and all the memory is initialized to zeroes by kcalloc(). After that it calls scrub_setup_recheck_block(), which is responsible for filling each of those structures. However, when that function calls btrfs_map_sblock() against the logical address of the metadata extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches the current block group Y. However block group Y has a raid0 profile and not a raid1 profile like X had, so the following call returns 1: scrub_nr_raid_mirrors(bbio) And as a result scrub_setup_recheck_block() only initializes the first (index 0) scrub_block structure in 'sblocks_for_recheck'. Then scrub_recheck_block() is called by scrub_handle_errored_block() with the second (index 1) scrub_block structure as the argument, because 'failed_mirror_index' was previously set to 1. This scrub_block was not initialized by scrub_setup_recheck_block(), so it has zero pages, its 'page_count' member is 0 and its 'pagev' page array has all members pointing to NULL. Finally when scrub_recheck_block() calls scrub_recheck_block_checksum() we have a NULL pointer dereference when accessing the flags of the first page, as pavev[0] is NULL: static void scrub_recheck_block_checksum(struct scrub_block *sblock) { (...) if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA) scrub_checksum_data(sblock); (...) } Producing a stack trace like the following: [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028 [542998.010238] #PF: supervisor read access in kernel mode [542998.010878] #PF: error_code(0x0000) - not-present page [542998.011516] PGD 0 P4D 0 [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1 [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs] [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs] [542998.018474] Code: 4c 89 e6 ... [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202 [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000 [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120 [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001 [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800 [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0 [542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000 [542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0 [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [542998.032380] Call Trace: [542998.032752] scrub_recheck_block+0x162/0x400 [btrfs] [542998.033500] ? __alloc_pages_nodemask+0x31e/0x460 [542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs] [542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs] [542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs] [542998.036735] process_one_work+0x26d/0x6a0 [542998.037275] worker_thread+0x4f/0x3e0 [542998.037740] ? process_one_work+0x6a0/0x6a0 [542998.038378] kthread+0x103/0x140 [542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70 [542998.039419] ret_from_fork+0x3a/0x50 [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ... [542998.047288] CR2: 0000000000000028 [542998.047724] ---[ end trace bde186e176c7f96a ]--- This issue has been around for a long time, possibly since scrub exists. The last time I ran into it was over 2 years ago. After recently fixing fstests to pass the "--full-balance" command line option to btrfs-progs when doing balance, several tests started to more heavily exercise balance with fsstress, scrub and other operations in parallel, and therefore started to hit this issue again (with btrfs/061 for example). Fix this by having scrub increment the 'trimming' counter of the block group, which pins the block group in such a way that it guarantees neither its logical address nor device extents can be reused by future block group allocations until we decrement the 'trimming' counter. Also make sure that on each iteration of scrub_stripe() we stop scrubbing the block group if it was removed already. A later patch in the series will rename the block group's 'trimming' counter and its helpers to a more generic name, since now it is not used exclusively for pinning while trimming anymore. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The extent references v0 have been superseded long time go, there are some unused declarations of access helpers. We can safely remove them now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct btrfs_extent_item_v0 is still part of a backward compatibility check in relocation.c and thus not removed. Signed-off-by: David Sterba <dsterba@suse.com>
-
YueHaibing authored
There's no callers in-tree anymore since commit d24ee97b ("btrfs: use new helpers to set uuids in eb") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] For the following operation, qgroup is guaranteed to be screwed up due to snapshot adding to a new qgroup: # mkfs.btrfs -f $dev # mount $dev $mnt # btrfs qgroup en $mnt # btrfs subv create $mnt/src # xfs_io -f -c "pwrite 0 1m" $mnt/src/file # sync # btrfs qgroup create 1/0 $mnt/src # btrfs subv snapshot -i 1/0 $mnt/src $mnt/snapshot # btrfs qgroup show -prce $mnt/src qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16.00KiB 16.00KiB none none --- --- 0/257 1.02MiB 16.00KiB none none --- --- 0/258 1.02MiB 16.00KiB none none 1/0 --- 1/0 0.00B 0.00B none none --- 0/258 ^^^^^^^^^^^^^^^^^^^^ [CAUSE] The problem is in btrfs_qgroup_inherit(), we don't have good enough check to determine if the new relation would break the existing accounting. Unlike btrfs_add_qgroup_relation(), which has proper check to determine if we can do quick update without a rescan, in btrfs_qgroup_inherit() we can even assign a snapshot to multiple qgroups. [FIX] Fix it by manually marking qgroup inconsistent for snapshot inheritance. For subvolume creation, since all its extents are exclusively owned, we don't need to rescan. In theory, we should call relation check like quick_update_accounting() when doing qgroup inheritance and inform user about qgroup accounting inconsistency. But we don't have good mechanism to relay that back to the user in the snapshot creation context, thus we can only silently mark the qgroup inconsistent. Anyway, user shouldn't use qgroup inheritance during snapshot creation, and should add qgroup relationship after snapshot creation by 'btrfs qgroup assign', which has a much better UI to inform user about qgroup inconsistent and kick in rescan automatically. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Robbie Ko authored
When mounting, we handle deleted subvolume and orphan items. First, find add orphan roots, then add them to fs_root radix tree. Second, in tree-root, process each orphan item, skip if it is dead root. The original algorithm is based on the list of dead_roots, one by one to visit and check whether the objectid is consistent, the time complexity is O (n ^ 2). When processing 50000 deleted subvols, it takes about 120s. Because btrfs_find_orphan_roots has already ran before us, and added deleted subvol to fs_roots radix tree. The fs root will only be removed from the fs_roots radix tree after the cleaner process is started, and the cleaner will only start execution after the mount is complete. btrfs_orphan_cleanup can be called during the whole filesystem mount lifetime, but only "tree root" will be used in this section of code, and only mount time will be brought into tree root. So we can quickly check whether the orphan item is dead root through the fs_roots radix tree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
-