- 11 Jul, 2024 40 commits
-
-
Filipe Manana authored
At btrfs_force_cow_block() we have several error paths that need to unlock the "cow" extent buffer, drop the reference on it and then return an error. This is a bit verbose so add a label where we perform these tasks and make the error paths jump to that label. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When freeing a tree block, at btrfs_free_tree_block(), if we fail to create a delayed reference we don't deal with the error and just do a BUG_ON(). The error most likely to happen is -ENOMEM, and we have a comment mentioning that only -ENOMEM can happen, but that is not true, because in case qgroups are enabled any error returned from btrfs_qgroup_trace_extent_post() (can be -EUCLEAN or anything returned from btrfs_search_slot() for example) can be propagated back to btrfs_free_tree_block(). So stop doing a BUG_ON() and return the error to the callers and make them abort the transaction to prevent leaking space. Syzbot was triggering this, likely due to memory allocation failure injection. Reported-by: syzbot+a306f914b4d01b3958fe@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000fcba1e05e998263c@google.com/Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
It's pointless to pass a super block argument to btrfs_iget_locked() because we always pass a root and from it we can get the super block through: root->fs_info->sb So remove the super block argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
It's pointless to pass a super block argument to btrfs_iget_path() because we always pass a root and from it we can get the super block through: root->fs_info->sb So remove the super block argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
It's pointless to pass a super block argument to btrfs_iget() because we always pass a root and from it we can get the super block through: root->fs_info->sb So remove the super block argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Since commit 2b2553f1 ("btrfs: stop setting PageError in the data I/O path") btrfs no longer utilizes subpage error bitmaps anymore, but the commit forgot to remove the error bitmap in btrfs_subpage_dump_bitmap(), resulting in possible meaningless result for the error bitmap. Fix it by just removing the error bitmap dumping. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] There is a bug report that a canceled checksum conversion (still experimental feature) results in unexpected super block flags: csum_type 0 (crc32c) csum_size 4 csum 0x14973811 [match] bytenr 65536 flags 0x1000000001 ( WRITTEN | CHANGING_FSID_V2 ) magic _BHRfS_M [match] While for a filesystem with ongoing checksum conversion it should have either CHANGING_DATA_CSUM or CHANGING_META_CSUM. [CAUSE] It turns out that, due to btrfs-progs keeps its own extra flags inside its own ctree.h headers, not the shared uapi headers, we have conflicting super flags: kernel-shared/uapi/btrfs_tree.h:#define BTRFS_SUPER_FLAG_METADUMP_V2 (1ULL << 34) kernel-shared/uapi/btrfs_tree.h:#define BTRFS_SUPER_FLAG_CHANGING_FSID (1ULL << 35) kernel-shared/uapi/btrfs_tree.h:#define BTRFS_SUPER_FLAG_CHANGING_FSID_V2 (1ULL << 36) kernel-shared/ctree.h:#define BTRFS_SUPER_FLAG_CHANGING_DATA_CSUM (1ULL << 36) kernel-shared/ctree.h:#define BTRFS_SUPER_FLAG_CHANGING_META_CSUM (1ULL << 37) Note that CHANGING_FSID_V2 is conflicting with CHANGING_DATA_CSUM. [FIX] The proper fix would be done inside btrfs-progs, but to keep everything properly recorded, we should have everything inside the same uapi header. Copy all the new flags into uapi header, and change the value for CHANGING_DATA_CSUM and CHANGING_META_CSUM, while keep the value of CHANGING_BG_TREE untouched. Thankfully checksum change is still only experimental and all those CHANGING_* flags are transient (only for btrfs-progs to resume the conversion, and kernel will reject them all), the damage is still minor. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Snapshot delete has some complicated looking code that is weirdly subtle at times. I've cleaned it up the best I can without re-writing it, but there are still a lot of details that are non-obvious. Add a bunch of comments to the main parts of the code to help future developers better understand the mechanics of snapshot deletion. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
In walk_up_proc() we BUG_ON(ret) from btrfs_dec_ref(). This is incorrect, we have proper error handling here, return the error. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
In walk_up_proc() we have several sanity checks that should only trip if the programmer made a mistake. Convert these to ASSERT()'s instead of BUG_ON()'s. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
In reada we BUG_ON(refs == 0), which could be unkind since we aren't holding a lock on the extent leaf and thus could get a transient incorrect answer. In walk_down_proc we also BUG_ON(refs == 0), which could happen if we have extent tree corruption. Change that to return -EUCLEAN. In do_walk_down() we catch this case and handle it correctly, however we return -EIO, which -EUCLEAN is a more appropriate error code. Finally in walk_up_proc we have the same BUG_ON(refs == 0), so convert that to proper error handling. Also adjust the error message so we can actually do something with the information. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We have a couple of areas where we check to make sure the tree block is locked before looking up or messing with references. This is old code so it has this as BUG_ON(). Convert this to ASSERT() for developers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We have blanket BUG_ON(ret) after every one of these reference mod attempts, which is just incorrect. If we encounter any errors during walk_down_tree() we will abort, so abort on any one of these failures. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We handle errors here properly, ENOMEM isn't fatal, return the error. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
This is a big chunk of code in do_walk_down() that will conditionally remove the reference for the child block we're currently evaluating. Extract it out into it's own helper and call that from do_walk_down() instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We currently duplicate the logic for walking into a node during snapshot delete. In one case it is during the actual delete, and in the other we use it for deciding if we should reada the block or not. Factor this code into it's own helper and comment fully what we're doing, and then update the two users to use the new helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We only set this if wc->refs[level - 1] > 1, and we check this way up above where we need it because the first thing we do before dropping our refs is reset wc->refs[level - 1] to 0. Reorder resetting of wc->refs to after our drop logic, and then remove the need_account variable and simply check wc->refs[level - 1] directly instead of using a variable. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
do_walk_down() already has a bunch of things going on, and there's a bit of code related to reading in the next eb if we decide we need it. Move this code off into it's own helper to clean up do_walk_down() a little bit. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Instead of using a flag we're passing around everywhere, add a field to walk_control that we're already passing around everywhere and use that instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Currently if our extent buffer isn't uptodate we will drop the lock, free it, and then call read_tree_block() for the bytenr. This is inefficient, we already have the extent buffer, we can simply call btrfs_read_extent_buffer(). Merge these two cases down into one if statement, if we are not uptodate we can drop the lock, trigger readahead, and do the read using btrfs_read_extent_buffer(), and carry on. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Currently we have a handful of btrfs_check_eb_owner() calls in various places and helpers that read extent buffers. However we call this in the endio handler for every metadata block, so these extra checks are unnecessary, simply remove them from everywhere except the endio handler. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We do find_extent_buffer(), and then if we don't find the eb in cache we call btrfs_find_create_tree_block(), which calls find_extent_buffer() first and then allocates the extent buffer. The reason we're doing this is because if we don't find the extent buffer in cache we set reada = 1. However this doesn't matter, because lower down we only trigger reada if !btrfs_buffer_uptodate(eb), which is what the case would be if we didn't find the extent buffer in cache and had to allocate it. Clean this up to simply call btrfs_find_create_tree_block(), and then use the fact that we're having to read the extent buffer off of disk to go ahead and kick off readahead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
As of commit 1b53e51a ("btrfs: don't commit transaction for every subvol create") we started to make any fsync after creating a subvolume to fallback to a transaction commit if the fsync is performed in the same transaction that was used to create the subvolume. This happens with the following at ioctl.c:create_subvol(): $ cat fs/btrfs/ioctl.c (...) /* Tree log can't currently deal with an inode which is a new root. */ btrfs_set_log_full_commit(trans); (...) Note that the comment is misleading as the problem is not that fsync can not deal with the root inode of a new root, but that we can not log any inode that belongs to a root that was not yet persisted because that would make log replay fail since the root doesn't exist at log replay time. The above simply makes any fsync fallback to a full transaction commit if it happens in the same transaction used to create the subvolume - even if it's an inode that belongs to any other subvolume. This is a brute force solution and it doesn't necessarily improve performance for every workload out there - it just moves a full transaction commit from one place, the subvolume creation, to another - an fsync for any inode. Just improve on this by making the fallback to a transaction commit only for an fsync against an inode of the new subvolume, or for the directory that contains the dentry that points to the new subvolume (in case anyone attempts to fsync the directory in the same transaction). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When creating and deleting a subvolume, after starting a transaction we are explicitly calling btrfs_record_root_in_trans() for the root which we passed to btrfs_start_transaction(). This is pointless because at transaction.c:start_transaction() we end up doing that call, regardless of whether we actually start a new transaction or join an existing one, and if we were not it would mean the root item of that root would not be updated in the root tree when committing the transaction, leading to problems easy to spot with fstests for example. Remove these redundant calls. They were introduced with commit 74e97958 ("btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations"). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
All parameters passed into setup_relocation_extent_mapping() can be derived from 'struct reloc_control', so only pass in a 'struct reloc_control'. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Pass a 'struct reloc_control' to prealloc_file_extent_cluster() instead of passing its members 'data_inode' and 'cluster' on their own. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
In describe_relocation() the fs_info is only needed for printing information via btrfs_info() and can easily be accessed via the passed in 'struct btrfs_block_group'. So we can safely remove the fs_info parameter. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Pass a struct reloc_control to relocate_one_folio, instead of passing it's members data_inode and cluster as separate arguments to the function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Instead of passing in a reloc_control's data_inode and file_extent_cluster members, pass in the whole reloc_control structure. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Pass a 'struct reloc_control' to relocate_data_extent() instead of it's data_inode and file_extent_cluster separately. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
During ordered extent splitting if we find a duplicated ordered extent when attempting to insert the new ordered extent we panic but with a message that has the "zoned:" prefix. This is because the splitting used to be exclusive for zoned filesystems, but as of commit b73a6fd1 ("btrfs: split partial dio bios before submit") it can also be done for non zoned filesystems during direct IO writes. So remove the "zoned:" prefix from the message and mention the split to make it more specific and different from the panic message at insert_ordered_extent(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
We never expect an ordered extent insertion to fail due to already having another ordered extent in the tree for the same file offset, since we always wait for existing ordered extents in a range to complete before writing into the range again. So mark the failure checks for the results of tree_insert() as unlikely, to make it clear it's never expected (save exceptional causes like bugs or memory corruptions) and to serve as a hint for the compiler to possibly generate better code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_split_ordered_extent(), we are removing and re-inserting the ordered extent that we are trimming, but we don't need to since the trimming doesn't change its position in the red black tree because we don't have overlapping ordered extents (that would imply double allocation of extents) and we know the split length is smaller than the ordered extent's num_bytes field (we checked that early in the function). So drop the remove and re-insert code for the slit ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
There are subtle details about why the root's ordered_extent_lock is held, so add a comment mentioning them. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_wait_ordered_extents(), there's no point in updating the counters after locking the root's ordered extent lock, as the counters are local. So change this to update the counters before taking the lock. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_wait_ordered_roots(), there's no point in decrementing the counter after locking fs_info->ordered_root_lock as the counter is local. So change this to decrement the counter before taking the lock. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We can add const to many parameters, this is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. This patch does not cover all possible conversions. Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There is already an error inside that header: #if !defined(__LINUX_SPINLOCK_TYPES_H) # error "Do not include directly, include spinlock_types.h" #endif Thankfully it never get triggered as some other headers have already included spinlock_types.h. However clangd would still do a proper warning on that if only extent_map.h is opened. Fix it by using spinlock_types.h instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
We have several headers that are including themselves, triggering clangd warnings. Such includes are caused by commit 602035d7 ("btrfs: add forward declarations and headers, part 2"). Just remove such unnecessary include. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Junchao Sun authored
Generic slab works fine allocating btrfs_qgroup_extent_record structures. It's not necessary to create a dedicated kmem cache that would be created but unused if quotas were not enabled. Let's delete the TODO line. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-