- 16 May, 2022 40 commits
-
-
Qu Wenruo authored
It's only internally used as another way to represent btrfs profiles, it's not exposed through any on-disk format, in fact this btrfs_raid_types is diverted from the on-disk format values. Furthermore, since it's internal structure, its definition can change in the future. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
rmw_workers doesn't need ordered execution or thread disabling threshold (as the thresh parameter is less than DFT_THRESHOLD). Just switch to the normal workqueues that use a lot less resources, especially in the work_struct vs btrfs_work structures. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
All three scrub workqueues don't need ordered execution or thread disabling threshold (as the thresh parameter is less than DFT_THRESHOLD). Just switch to the normal workqueues that use a lot less resources, especially in the work_struct vs btrfs_work structures. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Just let the one caller that wants optional WQ_HIGHPRI handling allocate a separate btrfs_workqueue for that. This allows to rename struct __btrfs_workqueue to btrfs_workqueue, remove a pointer indirection and separate allocation for all btrfs_workqueue users and generally simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Now the btrfs RAID56 infrastructure has migrated to use sector_ptr interface, it should be safe to enable subpage support for RAID56. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The non-compatible part is only the bitmap iteration part, now the bitmap size is extended to rbio::stripe_nsectors, not the old rbio::stripe_npages. Since we're here, also slightly improve the function by: - Rename @i to @stripe - Rename @bit to @sectornr - Move @page and @index into the inner loop Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Function steal_rbio() will take all the uptodate pages from the source rbio to destination rbio. With the new stripe_sectors[] array, we also need to do the extra check: - Check sector::flags to make sure the full page is uptodate Now we don't use PageUptodate flag for subpage cases to indicate if the page is uptodate. Instead we need to check all the sectors belong to the page to be sure about whether it's full page uptodate. So here we introduce a new helper, full_page_sectors_uptodate() to do the check. - Update rbio::stripe_sectors[] to use the new page pointer We only need to change the page pointer, no need to change anything else. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Unlike previous code, we can not directly set PageUptodate for stripe pages now. Instead we have to iterate through all the sectors and set SECTOR_UPTODATE flag there. Introduce a new helper find_stripe_sector(), to do the work. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The functionality is completely replaced by the new bio_sectors member, now it's time to remove the old member. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This requires one extra parameter @pgoff for the function. In the current code base, scrub is still one page per sector, thus the new parameter will always be 0. It needs the extra subpage scrub optimization code to fully take advantage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There is only one caller for that helper now, and we're definitely fine to open-code it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
With this function converted to subpage compatible sector interfaces, the following helper functions can be removed: - rbio_stripe_page() - rbio_pstripe_page() - rbio_qstripe_page() - page_in_rbio() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This involves: - Use sector_ptr interface to grab the pointers - Add sector->pgoff to pointers[] - Rebuild data using sectorsize instead of PAGE_SIZE - Use memcpy() to replace copy_page() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The core is to convert direct page usage into sector_ptr usage, and use memcpy() to replace copy_page(). For pointers usage, we need to convert it to kmap_local_page() + sector->pgoff. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Make rbio_add_io_page() subpage compatible, which involves: - Rename rbio_add_io_page() to rbio_add_io_sector() Although we still rely on PAGE_SIZE == sectorsize, so add a new ASSERT() inside rbio_add_io_sector() to make sure all pgoff is 0. - Introduce rbio_stripe_sector() helper The equivalent of rbio_stripe_page(). This new helper has extra ASSERT()s to validate the stripe and sector number. - Introduce sector_in_rbio() helper The equivalent of page_in_rbio(). - Rename @pagenr variables to @sectornr - Use rbio::stripe_nsectors when iterating the bitmap Please note that, this only changes the interface, the bios are still using full page for IO. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
This new member is going to fully replace bio_pages in the future, but for now let's keep them co-exist, until the full switch is done. Currently cache_rbio_pages() and index_rbio_pages() will also populate the new array. And cache_rbio_pages() need to record which sectors are uptodate, so we also need to introduce sector_ptr::uptodate bit. To avoid extra memory usage, we let the new @uptodate bit to share bits with @pgoff. Now pgoff only has at most 31 bits, which is already more than enough, as even for 256K page size, we only need 18 bits. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The new member is an array of sector_ptr pointers, they will represent all sectors inside a full stripe (including P/Q). They co-operate with btrfs_raid_bio::stripe_pages: stripe_pages: | Page 0, range [0, 64K) | Page 1 ... stripe_sectors: | | | ... | | | | \- sector 15, page 0, pgoff=60K | \- sector 1, page 0, pgoff=4K \---- sector 0, page 0, pfoff=0 With such structure, we can represent subpage sectors without using extra pages. Here we introduce a new helper, index_stripe_sectors(), to update stripe_sectors[] to point to correct page and pgoff. So every time rbio::stripe_pages[] pointer gets updated, the new helper should be called. The following functions have to call the new helper: - steal_rbio() - alloc_rbio_pages() - alloc_rbio_parity_pages() - alloc_rbio_essential_pages() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The new members are all related to number of sectors, but the existing number of pages members are kept as is: - nr_sectors Total sectors of the full stripe including P/Q. - stripe_nsectors The sectors of a single stripe. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There are a lot of members using much larger type in btrfs_raid_bio than necessary, like nr_pages which represents the total number of a full stripe. Instead of int (which is at least 32bits), u16 is already enough (max stripe length will be 256MiB, already beyond current RAID56 device number limit). So this patch will reduce the width of the following members: - stripe_len to u32 - nr_pages to u16 - nr_data to u8 - real_stripes to u8 - scrubp to u8 - faila/b to s8 As -1 is used to indicate no corruption This will slightly reduce the size of btrfs_raid_bio from 272 bytes to 256 bytes, reducing 16 bytes usage. But please note that, when using btrfs_raid_bio, we allocate extra space for it to cover various pointer array, so the reduce memory is not really a big saving overall. As we're here modifying the comments already, update existing comments to current code standard. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
The function rbio_nr_pages() is only called once inside alloc_rbio(), there is no reason to make it dedicated helper. Furthermore, the return type doesn't match, the function return "unsigned long" which may not be necessary, while the only caller only uses "int". Since we're doing cleaning up here, also fix the type to "const unsigned int" for all involved local variables. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Currently btrfs uses fixed stripe length (64K), thus u32 is wide enough for the usage. Furthermore, even in the future we choose to enlarge stripe length to larger values, I don't believe we would want stripe as large as 4G or larger. So this patch will reduce the width for all in-memory structures and parameters, this involves: - RAID56 related function argument lists This allows us to do direct division related to stripe_len. Although we will use bits shift to replace the division anyway. - btrfs_io_geometry structure This involves one change to simplify the calculation of both @stripe_nr and @stripe_offset, using div64_u64_rem(). And add extra sanity check to make sure @stripe_offset is always small enough for u32. This saves 8 bytes for the structure. - map_lookup structure This convert @stripe_len to u32, which saves 8 bytes. (saved 4 bytes, and removed a 4-bytes hole) Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Both btrfs_repair_one_sector and submit_bio_one as the direct caller of one of the instances ignore errors as they expect the methods themselves to call ->bi_end_io on error. Remove the unused and dangerous return value. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
btrfs_submit_compressed_read already calls ->bi_end_io on error and the caller must ignore the return value, so remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
btrfs_submit_metadata_bio already calls ->bi_end_io on error and the caller must ignore the return value, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
This argument is unused since commit 953651eb ("btrfs: factor out helper adding a page to bio") and commit 1b36294a ("btrfs: call submit_bio_hook directly for metadata pages") reworked the way metadata bio submission is handled. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Keep btrfs_readpage next to btrfs_do_readpage and the other address space operations. This allows to keep submit_one_bio and struct btrfs_bio_ctrl file local in extent_io.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] There is a report that a btrfs has a bad super block num devices. This makes btrfs to reject the fs completely. BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here BTRFS error (device sdd3): failed to read chunk tree: -22 BTRFS error (device sdd3): open_ctree failed [CAUSE] During btrfs device removal, chunk tree and super block num devs are updated in two different transactions: btrfs_rm_device() |- btrfs_rm_dev_item(device) | |- trans = btrfs_start_transaction() | | Now we got transaction X | | | |- btrfs_del_item() | | Now device item is removed from chunk tree | | | |- btrfs_commit_transaction() | Transaction X got committed, super num devs untouched, | but device item removed from chunk tree. | (AKA, super num devs is already incorrect) | |- cur_devices->num_devices--; |- cur_devices->total_devices--; |- btrfs_set_super_num_devices() All those operations are not in transaction X, thus it will only be written back to disk in next transaction. So after the transaction X in btrfs_rm_dev_item() committed, but before transaction X+1 (which can be minutes away), a power loss happen, then we got the super num mismatch. This has been fixed by commit bbac5869 ("btrfs: remove device item and update super block in the same transaction"). [FIX] Make the super_num_devices check less strict, converting it from a hard error to a warning, and reset the value to a correct one for the current or next transaction commit. As the number of device items is the critical information where the super block num_devices is only a cached value (and also useful for cross checking), it's safe to automatically update it. Other device related problems like missing device are handled after that and may require other means to resolve, like degraded mount. With this fix, potentially affected filesystems won't fail mount and require the manual repair by btrfs check. Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Goldwyn Rodrigues authored
Parameter struct compressed_bio is not used by the function submit_compressed_bio(). Remove it. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When doing a NOCOW write, either through direct IO or buffered IO, we do two lookups for the block group that contains the target extent: once when we call btrfs_inc_nocow_writers() and then later again when we call btrfs_dec_nocow_writers() after creating the ordered extent. The lookups require taking a lock and navigating the red black tree used to track all block groups, which can take a non-negligible amount of time for a large filesystem with thousands of block groups, as well as lock contention and cache line bouncing. Improve on this by having a single block group search: making btrfs_inc_nocow_writers() return the block group to its caller and then have the caller pass that block group to btrfs_dec_nocow_writers(). This is part of a patchset comprised of the following patches: btrfs: remove search start argument from first_logical_byte() btrfs: use rbtree with leftmost node cached for tracking lowest block group btrfs: use a read/write lock for protecting the block groups tree btrfs: return block group directly at btrfs_next_block_group() btrfs: avoid double search for block group during NOCOW writes The following test was used to test these changes from a performance perspective: $ cat test.sh #!/bin/bash modprobe null_blk nr_devices=0 NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0 mkdir $NULL_DEV_PATH if [ $? -ne 0 ]; then echo "Failed to create nullb0 directory." exit 1 fi echo 2 > $NULL_DEV_PATH/submit_queues echo 16384 > $NULL_DEV_PATH/size # 16G echo 1 > $NULL_DEV_PATH/memory_backed echo 1 > $NULL_DEV_PATH/power DEV=/dev/nullb0 MNT=/mnt/nullb0 LOOP_MNT="$MNT/loop" MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [io_uring_writes] rw=randwrite fsync=0 fallocate=posix group_reporting=1 direct=1 ioengine=io_uring iodepth=64 bs=64k filesize=1g runtime=300 time_based directory=$LOOP_MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT mkdir $LOOP_MNT truncate -s 4T $MNT/loopfile mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT # Trigger the allocation of about 3500 data block groups, without # actually consuming space on underlying filesystem, just to make # the tree of block group large. fallocate -l 3500G $LOOP_MNT/filler fio /tmp/fio-job.ini umount $LOOP_MNT umount $MNT echo 0 > $NULL_DEV_PATH/power rmdir $NULL_DEV_PATH The test was run on a non-debug kernel (Debian's default kernel config), the result were the following. Before patchset: WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec After patchset: WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec +3.3% write throughput and +3.3% IO done in the same time period. The test has somewhat limited coverage scope, as with only NOCOW writes we get less contention on the red black tree of block groups, since we don't have the extra contention caused by COW writes, namely when allocating data extents, pinning and unpinning data extents, but on the hand there's access to tree in the NOCOW path, when incrementing a block group's number of NOCOW writers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_next_block_group(), we have this long line with two statements: cache = btrfs_lookup_first_block_group(...); return cache; This makes it a bit harder to read due to two statements on the same line, so change that to directly return the result of the call to btrfs_lookup_first_block_group(). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Currently we use a spin lock to protect the red black tree that we use to track block groups. Most accesses to that tree are actually read only and for large filesystems, with thousands of block groups, it actually has a bad impact on performance, as concurrent read only searches on the tree are serialized. Read only searches on the tree are very frequent and done when: 1) Pinning and unpinning extents, as we need to lookup the respective block group from the tree; 2) Freeing the last reference of a tree block, regardless if we pin the underlying extent or add it back to free space cache/tree; 3) During NOCOW writes, both buffered IO and direct IO, we need to check if the block group that contains an extent is read only or not and to increment the number of NOCOW writers in the block group. For those operations we need to search for the block group in the tree. Similarly, after creating the ordered extent for the NOCOW write, we need to decrement the number of NOCOW writers from the same block group, which requires searching for it in the tree; 4) Decreasing the number of extent reservations in a block group; 5) When allocating extents and freeing reserved extents; 6) Adding and removing free space to the free space tree; 7) When releasing delalloc bytes during ordered extent completion; 8) When relocating a block group; 9) During fitrim, to iterate over the block groups; 10) etc; Write accesses to the tree, to add or remove block groups, are much less frequent as they happen only when allocating a new block group or when deleting a block group. We also use the same spin lock to protect the list of currently caching block groups. Additions to this list are made when we need to cache a block group, because we don't have a free space cache for it (or we have but it's invalid), and removals from this list are done when caching of the block group's free space finishes. These cases are also not very common, but when they happen, they happen only once when the filesystem is mounted. So switch the lock that protects the tree of block groups from a spinning lock to a read/write lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
We keep track of the start offset of the block group with the lowest start offset at fs_info->first_logical_byte. This requires explicitly updating that field every time we add, delete or lookup a block group to/from the red black tree at fs_info->block_group_cache_tree. Since the block group with the lowest start address happens to always be the one that is the leftmost node of the tree, we can use a red black tree that caches the left most node. Then when we need the start address of that block group, we can just quickly get the leftmost node in the tree and extract the start offset of that node's block group. This avoids the need to explicitly keep track of that address in the dedicated member fs_info->first_logical_byte, and it also allows the next patch in the series to switch the lock that protects the red black tree from a spin lock to a read/write lock - without this change it would be tricky because block group searches also update fs_info->first_logical_byte. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The search start argument passed to first_logical_byte() is always 0, as we always want to get the logical start address of the block group with the lowest logical start address. So remove it, as not only it is not necessary, it also makes the following patches that change the lock that protects the red black tree of block groups from a spin lock to a read/write lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] If we hit an error from submit_extent_page() inside __extent_writepage_io(), we could still return 0 to the caller, and even trigger the warning in btrfs_page_assert_not_dirty(). [CAUSE] In __extent_writepage_io(), if we hit an error from submit_extent_page(), we will just clean up the range and continue. This is completely fine for regular PAGE_SIZE == sectorsize, as we can only hit one sector in one page, thus after the error we're ensured to exit and @ret will be saved. But for subpage case, we may have other dirty subpage range in the page, and in the next loop, we may succeeded submitting the next range. In that case, @ret will be overwritten, and we return 0 to the caller, while we have hit some error. [FIX] Introduce @has_error and @saved_ret to record the first error we hit, so we will never forget what error we hit. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] Test case generic/475 have a very high chance (almost 100%) to hit a fs hang, where a data page will never be unlocked and hang all later operations. [CAUSE] In btrfs_do_readpage(), if we hit an error from submit_extent_page() we will try to do the cleanup for our current io range, and exit. This works fine for PAGE_SIZE == sectorsize cases, but not for subpage. For subpage btrfs_do_readpage() will lock the full page first, which can contain several different sectors and extents: btrfs_do_readpage() |- begin_page_read() | |- btrfs_subpage_start_reader(); | Now the page will have PAGE_SIZE / sectorsize reader pending, | and the page is locked. | |- end_page_read() for different branches | This function will reduce subpage readers, and when readers | reach 0, it will unlock the page. But when submit_extent_page() failed, we only cleanup the current io range, while the remaining io range will never be cleaned up, and the page remains locked forever. [FIX] Update the error handling of submit_extent_page() to cleanup all the remaining subpage range before exiting the loop. Please note that, now submit_extent_page() can only fail due to sanity check in alloc_new_bio(). Thus regular IO errors are impossible to trigger the error path. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] When running generic/475 with 64K page size and 4K sector size, it has a very high chance (almost 100%) to hang, with mostly data page locked but no one is going to unlock it. [CAUSE] With commit 1784b7d5 ("btrfs: handle csum lookup errors properly on reads"), if we failed to lookup checksum due to metadata IO error, we will return error for btrfs_submit_data_bio(). This will cause the page to be unlocked twice in btrfs_do_readpage(): btrfs_do_readpage() |- submit_extent_page() | |- submit_one_bio() | |- btrfs_submit_data_bio() | |- if (ret) { | |- bio->bi_status = ret; | |- bio_endio(bio); } | In the endio function, we will call end_page_read() | and unlock_extent() to cleanup the subpage range. | |- if (ret) { |- unlock_extent(); end_page_read() } Here we unlock the extent and cleanup the subpage range again. For unlock_extent(), it's mostly double unlock safe. But for end_page_read(), it's not, especially for subpage case, as for subpage case we will call btrfs_subpage_end_reader() to reduce the reader number, and use that to number to determine if we need to unlock the full page. If double accounted, it can underflow the number and leave the page locked without anyone to unlock it. [FIX] The commit 1784b7d5 ("btrfs: handle csum lookup errors properly on reads") itself is completely fine, it's our existing code not properly handling the error from bio submission hook properly. This patch will make submit_one_bio() to return void so that the callers will never be able to do cleanup when bio submission hook fails. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Schspa Shi authored
This is an optimization for fix fee13fe9 ("btrfs: correct zstd workspace manager lock to use spin_lock_bh()") The critical region for wsm.lock is only accessed by the process context and the softirq context. Because in the soft interrupt, the critical section will not be preempted by the soft interrupt again, there is no need to call spin_lock_bh(&wsm.lock) to turn off the soft interrupt, spin_lock(&wsm.lock) is enough for this situation. Signed-off-by: Schspa Shi <schspa@gmail.com> [ minor comment update ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
We are still using the magic value of 2 at btrfs_create_new_inode(), but there's now a constant for that, named BTRFS_DIR_START_INDEX, which was introduced in commit 528ee697 ("btrfs: put initial index value of a directory in a constant"). So change that to use the constant. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Cleanup the function submit_read_repair() by: - Remove the fixed argument submit_bio_hook() The function is only called on buffered data read path, so the @submit_bio_hook argument is always btrfs_submit_data_bio(). Since it's fixed, then there is no need to pass that argument at all. - Rename the function to submit_data_read_repair() Just to be more explicit on all the 3 things, data, read and repair. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Reading a value from a different member of a union is not just a great way to obfuscate code, but also creates an aliasing violation. Switch btrfs_is_zoned to look at ->zone_size and remove the union. Note: union was to simplify the detection of zoned filesystem but now this is wrapped behind btrfs_is_zoned so we can drop the union. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
-