- 23 Aug, 2021 40 commits
-
-
Qu Wenruo authored
[BUG] When using the following script, btrfs will report data corruption after one data balance with subpage support: mkfs.btrfs -f -s 4k $dev mount $dev -o nospace_cache $mnt $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress sync btrfs balance start -d $mnt btrfs scrub start -B $mnt Similar problem can be easily observed in btrfs/028 test case, there will be tons of balance failure with -EIO. [CAUSE] Above fsstress will result the following data extents layout in extent tree: item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82 refs 2 gen 7 flags DATA extent data backref root FS_TREE objectid 259 offset 1339392 count 1 extent data backref root FS_TREE objectid 259 offset 647168 count 1 item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24 block group used 102400 chunk_objectid 256 flags DATA item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53 refs 1 gen 7 flags DATA extent data backref root FS_TREE objectid 259 offset 729088 count 1 Then when creating the data reloc inode, the data reloc inode will look like this: 0 32K 64K 96K 100K 104K |<------ Extent A ----->| |<- Ext B ->| Then when we first try to relocate extent A, we setup the data reloc inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K). For page 64K, since the i_size is just 96K, we fill range [96K, 128K) with 0 and set it uptodate. Then when we come to extent B, we update i_size to 104K, then try to read page [64K, 128K). Then we find the page is already uptodate, so we skip the read. But range [96K, 128K) is filled with 0, not the real data. Then we writeback the data reloc inode to disk, with 0 filling range [96K, 128K), corrupting the content of extent B. The behavior is caused by the fact that we still do full page read for subpage case. The bug won't really happen for regular sectorsize, as one page only contains one sector. [FIX] This patch will fix the problem by invalidating range [i_size, PAGE_END] in prealloc_file_extent_cluster(). So that if above example happens, when we preallocate the file extent for extent B, we will clear the uptodate bits for range [96K, 128K), allowing later relocate_one_page() to re-read the needed range. There is a special note for the invalidating part. Since we're not calling real btrfs_invalidatepage(), but just clearing the subpage and page uptodate bits, we can leave a page half dirty and half out of date. Reading such page can cause a deadlock, as we normally expect a dirty page to be fully uptodate. Thus here we flush and wait the data reloc inode before doing the hacked invalidating. This won't cause extra overhead, as we're going to writeback the data later anyway. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] When relocating partial preallocated data extents (part of the preallocated extent is written) for subpage, it can cause the following false alert and make the relocation to fail: BTRFS info (device dm-3): balance: start -d BTRFS info (device dm-3): relocating block group 13631488 flags data BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1 BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1 BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 BTRFS info (device dm-3): balance: ended with status: -5 The minimal script to reproduce looks like this: mkfs.btrfs -f -s 4k $dev mount $dev -o nospace_cache $mnt xfs_io -f -c "falloc 0 8k" $mnt/file xfs_io -f -c "pwrite 0 4k" $mnt/file btrfs balance start -d $mnt [CAUSE] Function btrfs_verify_data_csum() checks if the full range has EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range have EXTENT_NODATASUM bit, then it skip the range. This works pretty well for regular sectorsize, as in that case btrfs_verify_data_csum() is called for each sector, thus no problem at all. But for subpage case, btrfs_verify_data_csum() is called on each bvec, which can contain several sectors, and since it checks *all* bytes for EXTENT_NODATASUM bit, if we have some range with csum, then we will continue checking all the sectors. For the preallocated sectors, it doesn't have any csum, thus obviously the csum won't match and cause the false alert. [FIX] Move the EXTENT_NODATASUM check into the main loop, so that we can check each sector for EXTENT_NODATASUM bit for subpage case. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] There is a possible use-after-free bug when running generic/095. BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b Faulting instruction address: 0xc000000000283654 c000000000283078 do_raw_spin_unlock+0x88/0x230 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0 c0000000009e0458 end_bio_extent_writepage+0x158/0x270 c000000000b6fd14 bio_endio+0x254/0x270 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200 c000000000b6fd14 bio_endio+0x254/0x270 c000000000b781fc blk_update_request+0x46c/0x670 c000000000b8b394 blk_mq_end_request+0x34/0x1d0 c000000000d82d1c lo_complete_rq+0x11c/0x140 c000000000b880a4 blk_complete_reqs+0x84/0xb0 c0000000012b2ca4 __do_softirq+0x334/0x680 c0000000001dd878 irq_exit+0x148/0x1d0 c000000000016f4c do_IRQ+0x20c/0x240 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0 [CAUSE] There is very small race window like the following in generic/095. Thread 1 | Thread 2 --------------------------------+------------------------------------ end_bio_extent_writepage() | btrfs_releasepage() |- spin_lock_irqsave() | | |- end_page_writeback() | | | | |- if (PageWriteback() ||...) | | |- clear_page_extent_mapped() | | |- kfree(subpage); |- spin_unlock_irqrestore(). The race can also happen between writeback and btrfs_invalidatepage(), although that would be much harder as btrfs_invalidatepage() has much more work to do before the clear_page_extent_mapped() call. [FIX] Here we "wait" for the subapge spinlock to be released before we detach subpage structure. So this patch will introduce a new function, wait_subpage_spinlock(), to do the "wait" by acquiring the spinlock and release it. Since the caller has ensured the page is not dirty nor writeback, and page is already locked, the only way to hold the subpage spinlock is from endio function. Thus we only need to acquire the spinlock to wait for any existing holder. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] When running generic/095, there is a high chance to crash with subpage data RW support: assertion failed: PagePrivate(page) && page->private ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3403! Internal error: Oops - BUG: 0 [#1] SMP CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17 Hardware name: Khadas VIM3 (DT) Call trace: assertfail.constprop.0+0x28/0x2c [btrfs] btrfs_subpage_assert+0x80/0xa0 [btrfs] btrfs_subpage_set_uptodate+0x34/0xec [btrfs] btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs] btrfs_dirty_pages+0x160/0x270 [btrfs] btrfs_buffered_write+0x444/0x630 [btrfs] btrfs_direct_write+0x1cc/0x2d0 [btrfs] btrfs_file_write_iter+0xc0/0x160 [btrfs] new_sync_write+0xe8/0x180 vfs_write+0x1b4/0x210 ksys_pwrite64+0x7c/0xc0 __arm64_sys_pwrite64+0x24/0x30 el0_svc_common.constprop.0+0x70/0x140 do_el0_svc+0x28/0x90 el0_svc+0x2c/0x54 el0_sync_handler+0x1a8/0x1ac el0_sync+0x170/0x180 Code: f0000160 913be042 913c4000 955444bc (d4210000) ---[ end trace 3fdd39f4cccedd68 ]--- [CAUSE] Although prepare_pages() calls find_or_create_page(), which returns the page locked, but in later prepare_uptodate_page() calls, we may call btrfs_readpage() which will unlock the page before it returns. This leaves a window where btrfs_releasepage() can sneak in and release the page, clearing page->private and causing above ASSERT(). [FIX] In prepare_uptodate_page(), we should not only check page->mapping, but also PagePrivate() to ensure we are still holding the correct page which has proper fs context setup. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
RAID56 is not only unsafe due to its write-hole problem, but also has tons of hardcoded PAGE_SIZE. Disable it for subpage support for now. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Current submit_extent_page() just checks if the current page range can be fitted into current bio, and if not, submit then re-add. But this behavior can't handle subpage case at all. For subpage case, the problem is in the page size, 64K, which is also the same size as stripe size. This means, if we can't fit a full 64K into a bio, due to stripe limit, then it won't fit into next bio without crossing stripe either. The proper way to handle it is: - Check how many bytes we can be put into current bio - Put as many bytes as possible into current bio first - Submit current bio - Create a new bio - Add the remaining bytes into the new bio Refactor submit_extent_page() so that it does the above iteration. The main loop inside submit_extent_page() will look like this: cur = pg_offset; while (cur < pg_offset + size) { u32 offset = cur - pg_offset; int added; if (!bio_ctrl->bio) { /* Allocate new bio if needed */ } /* Add as many bytes into the bio */ added = btrfs_bio_add_page(); if (added < size - offset) { /* The current bio is full, submit it */ } cur += added; } Also, since we're doing new bio allocation deep inside the main loop, extract that code into a new helper, alloc_new_bio(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] When running the following fsx command (extracted from generic/127) on subpage filesystem, it can create inline extent with regular extents: fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx The offending extent would look like: item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14 index 2 namelen 4 name: file item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728 generation 7 type 0 (inline) inline extent data size 707 ram_bytes 707 compression 0 (none) item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53 generation 7 type 2 (prealloc) prealloc data disk byte 102346752 nr 4096 prealloc data offset 0 nr 4096 [CAUSE] For subpage filesystem, the writeback is triggered in page units, which means, even if we just want to writeback range [16K, 20K) for 64K page system, we will still try to writeback any dirty sector of range [0, 64K). This is never a problem if sectorsize == PAGE_SIZE, but for subpage, this can cause unexpected problems. For above test case, the last several operations from fsx are: 9055 trunc from 0x40000 to 0x2c3 9057 falloc from 0x164c to 0x19d2 (0x386 bytes) In operation 9055, we dirtied sector [0, 4096), then in falloc, we call btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to writeback any dirty data in [4096, 8192), but nothing else. Unfortunately, in subpage case, above btrfs_wait_ordered_range() will trigger writeback of the range [0, 64K), which includes the data at [0, 4096). And since at the call site, we haven't yet increased i_size, which is still 707, this means cow_file_range() can insert an inline extent. Resulting above inline + regular extent. [WORKAROUND] I don't really have any good short-term solution yet, as this means all operations that would trigger writeback need to be reviewed for any i_size change. So here I choose to disable inline extent creation for subpage case as a workaround. We have done tons of work just to avoid such extent, so I don't to create an exception just for subpage. This only affects inline extent creation, subpage has no problem reading existing inline extents at all. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] When running fsstress with subpage RW support, there are random BUG_ON()s triggered with the following trace: kernel BUG at fs/btrfs/file-item.c:667! Internal error: Oops - BUG: 0 [#1] SMP CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43 Hardware name: Radxa ROCK Pi 4B (DT) Workqueue: btrfs-worker-high btrfs_work_helper [btrfs] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs] lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs] Call trace: btrfs_csum_one_bio+0x420/0x4e0 [btrfs] btrfs_submit_bio_start+0x20/0x30 [btrfs] run_one_async_start+0x28/0x44 [btrfs] btrfs_work_helper+0x128/0x1b4 [btrfs] process_one_work+0x22c/0x430 worker_thread+0x70/0x3a0 kthread+0x13c/0x140 ret_from_fork+0x10/0x30 [CAUSE] Above BUG_ON() means there is some bio range which doesn't have ordered extent, which indeed is worth a BUG_ON(). Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra subpage dirty bitmap to record which range is dirty and should be written back. This means, if we submit bio for a subpage range, we do not only need to clear page dirty, but also need to clear subpage dirty bits. In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for any range we submit a bio. But there is loophole, if we hit a range which is beyond i_size, we just call btrfs_writepage_endio_finish_ordered() to finish the ordered io, then break out, without clearing the subpage dirty. This means, if we hit above branch, the subpage dirty bits are still there, if other range of the page get dirtied and we need to writeback that page again, we will submit bio for the old range, leaving a wild bio range which doesn't have ordered extent. [FIX] Fix it by always calling btrfs_page_clear_dirty() in __extent_writepage_io(). Also to avoid such problem from happening again, add a new assert, btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage dirty bits are cleared before exiting __extent_writepage_io(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
For subpage case, one page of data reloc inode can contain several file extents, like this: |<--- File extent A --->| FE B | FE C |<--- File extent D -->| |<--------- Page --------->| We can no longer use PAGE_SIZE directly for various operations. This patch will relocate_one_page() to handle subpage case by: - Iterating through all extents of a cluster when marking pages When marking pages dirty and delalloc, we need to check the cluster extent boundary. Now we introduce a loop to go extent by extent of a page, until we either finished the last extent, or reach the page end. By this, regular sectorsize == PAGE_SIZE can still work as usual, since we will do that loop only once. - Iteration start from max(page_start, extent_start) Since we can have the following case: | FE B | FE C |<--- File extent D -->| |<--------- Page --------->| Thus we can't always start from page_start, but do a max(page_start, extent_start) - Iteration end when the cluster is exhausted Similar to previous case, the last file extent can end before the page end: |<--- File extent A --->| FE B | FE C | |<--------- Page --------->| In this case, we need to manually exit the loop after we have finished the last extent of the cluster. - Reserve metadata space for each extent range Since now we can hit multiple ranges in one page, we should reserve metadata for each range, not simply PAGE_SIZE. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
In function relocate_file_extent_cluster(), we have a big loop for marking all involved page delalloc. That part is long enough to be contained in one function, so this patch will move that code chunk into a new function, relocate_one_page(). This also provides enough space for later subpage work. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
For the initial subpage support, although we won't support compressed write, we still need to support compressed read. But for lzo_decompress_bio() it has several problems: - The abuse of PAGE_SIZE for boundary detection For subpage case, we should follow sectorsize to detect the padding zeros. Using PAGE_SIZE will cause subpage compress read to skip certain bytes, and causing read error. - Too many helper variables There are half a dozen helper variables, which is only making things harder to read This patch will rework lzo_decompress_bio() to make it work for subpage: - Use sectorsize to do boundary check, while still use PAGE_SIZE for page switching This allows us to have the same on-disk format for 4K sectorsize fs, while take advantage of larger page size. - Use two main cursors Only @cur_in and @cur_out is utilized as the main cursor. The helper variables will only be declared inside the loop, and only 2 helper variables needed. - Introduce a helper function to copy compressed segment payload Introduce a new helper, copy_compressed_segment(), to copy a compressed segment to workspace buffer. This function will handle the page switching. Now the net result is, with all the excessive comments and new helper function, the refactored code is still smaller, and easier to read. For other decompression code, they have no special padding rule, thus no need to bother for initial subpage support, but will be refactored to the same style later. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There are several bugs inside the function btrfs_decompress_buf2page() - @start_byte doesn't take bvec.bv_offset into consideration Thus it can't handle case where the target range is not page aligned. - Too many helper variables There are tons of helper variables, @buf_offset, @current_buf_start, @start_byte, @prev_start_byte, @working_bytes, @bytes. This hurts anyone who wants to read the function. - No obvious main cursor for the iteartion A new problem caused by previous problem. - Comments for parameter list makes no sense Like @buf_start is the offset to @buf, or offset inside the full decompressed extent? (Spoiler alert, the later case) And @total_out acts more like @buf_start + @size_of_buf. The worst is @disk_start. The real meaning of it is the file offset of the full decompressed extent. This patch will rework the whole function by: - Add a proper comment with ASCII art to explain the parameter list - Rework parameter list The old @buf_start is renamed to @decompressed, to show how many bytes are already decompressed inside the full decompressed extent. The old @total_out is replaced by @buf_len, which is the decompressed data size. For old @disk_start and @bio, just pass @compressed_bio in. - Use single main cursor The main cursor will be @cur_file_offset, to show what's the current file offset. Other helper variables will be declared inside the main loop, and only minimal amount of helper variables: * offset_inside_decompressed_buf: The only real helper * copy_start_file_offset: File offset we start memcpy * bvec_file_offset: File offset of current bvec Even with all these extensive comments, the final function is still smaller than the original function, which is definitely a win. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] When subpage compressed read write support is enabled, btrfs/038 always fails with EIO. A simplified script can easily trigger the problem: mkfs.btrfs -f -s 4k $dev mount $dev $mnt -o compress=lzo xfs_io -f -c "truncate 118811" $mnt/foo xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null sync btrfs subvolume snapshot -r $mnt $mnt/mysnap1 xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null sync xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null sync btrfs subvolume snapshot -r $mnt $mnt/mysnap2 cat $mnt/mysnap2/foo # Above cat will fail due to EIO [CAUSE] The problem is in btrfs_submit_compressed_read(). When it tries to grab the extent map of the read range, it uses the following call: em = lookup_extent_mapping(em_tree, page_offset(bio_first_page_all(bio)), fs_info->sectorsize); The problem is in the page_offset(bio_first_page_all(bio)) part. The offending inode has the following file extent layout item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13680640 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53 generation 8 type 1 (regular) extent data disk byte 0 nr 0 item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53 generation 8 type 1 (regular) extent data disk byte 13676544 nr 4096 extent data offset 0 nr 53248 ram 86016 extent compression 2 (lzo) And the bio passed in has the following parameters: page_offset(bio_first_page_all(bio)) = 131072 bio_first_bvec_all(bio)->bv_offset = 65536 If we use page_offset(bio_first_page_all(bio) without adding bv_offset, we will get an extent map for file offset 131072, not 196608. This means we read uncompressed data from disk, and later decompression will definitely fail. [FIX] Take bv_offset into consideration when trying to grab an extent map. And add an ASSERT() to ensure we're really getting a compressed extent. Thankfully this won't affect anything but subpage, thus we only need to ensure this patch get merged before we enabled basic subpage support. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
For current subpage support, we only support 64K page size with 4K sector size. This makes compressed readahead less effective, as maximum compressed extent size is only 128K, 2x the page size. On the other hand, the function add_ra_bio_pages() is still assuming sectorsize == PAGE_SIZE, and code change may affect 4K page size systems. So for now, let's disable subpage compressed readahead for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] When testing experimental subpage compressed write support, it hits a NULL pointer dereference inside read path: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018 pc : __pi_memcmp+0x28/0x1ec lr : check_data_csum+0xd0/0x274 [btrfs] Call trace: __pi_memcmp+0x28/0x1ec btrfs_verify_data_csum+0xf4/0x244 [btrfs] end_bio_extent_readpage+0x1d0/0x6b0 [btrfs] bio_endio+0x15c/0x1dc end_workqueue_fn+0x44/0x64 [btrfs] btrfs_work_helper+0x74/0x250 [btrfs] process_one_work+0x1d4/0x47c worker_thread+0x180/0x400 kthread+0x11c/0x120 ret_from_fork+0x10/0x30 Code: 54000261 d100044c d343fd8c f8408403 (f8408424) ---[ end trace 9e2c59f33ea40866 ]--- [CAUSE] When reading two compressed extents inside the same page, like the following layout, we trigger above crash: 0 32K 64K |-------|\\\\\\\| | \- Compressed extent (A) \--------- Compressed extent (B) For compressed read, we don't need to populate its io_bio->csum, as we rely on compressed_bio->csum to verify the compressed data, and then copy the decompressed to inode pages. Normally btrfs_verify_data_csum() skip such page by checking and clearing its PageChecked flag But since that flag is still for the full page, when endio for inode page range [0, 32K) gets executed, it clears PageChecked flag for the full page. Then when endio for inode page range [32K, 64K) gets executed, since the page no longer has PageChecked flag, it just continues checking, even though io_bio->csum is NULL. [FIX] Thankfully there are only two users of PageChecked bit: - Cow fixup Since subpage has its own way to trace page dirty (dirty_bitmap) and ordered bit (ordered_bitmap), it should never trigger cow fixup. - Compressed read We can distinguish such read by just checking io_bio->csum. So just check io_bio->csum before doing the verification to avoid such NULL pointer dereference. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a compressed extent. This is fine, as for PAGE_SIZE == sectorsize case, we can only have one sector for one page, thus @this_bio_flag will only be set at most once. But for subpage case, after hitting a compressed extent, @this_bio_flag will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular extent. This will lead to various read errors, and causing new ASSERT() in incoming subpage patches, which adds more strict check in btrfs_submit_compressed_read(). Fix it by declaring @this_bio_flag inside the main loop and reset its value for each iteration. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Comparators just read the data and thus get const parameters. This should be also preserved by the local variables, update all comparators passed to sort or bsearch. Cleanups: - unnecessary casts are dropped - btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern and 'inline' is dropped as the function address is taken Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There are two helpers doing the same calculations based on nparity and ncopies. calc_data_stripes can be simplified into one expression, so far we don't have profile with both copies and parity, so there's no effective change. calc_stripe_length should reuse the helper and not repeat the same calculation. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The device allocation is split to two functions, but one just calls the other and they're very far in the file. Merge them together. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The helper does a simple translation from block group flags to index to the btrfs_raid_array table. There's no apparent reason to inline the function, the translation happens usually once per function and is not called in a loop. Making it a proper function saves quite some binary code (x86_64, release config): text data bss dec hex filename 1164011 19253 14912 1198176 124860 pre/btrfs.ko 1161559 19253 14912 1195724 123ecc post/btrfs.ko DELTA: -2451 Also add the const attribute as there are no side effects, this could help compiler to optimize a few things without the function body. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The stripe checks for raid1c3/raid1c4 are missing in the sequence in btrfs_check_chunk_valid. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There are hardcoded values in several checks regarding chunks and stripe constraints. We have that defined in the raid table and ought to use it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it to header to avoid the function call overhead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
In commit e65f152e ("btrfs: refactor how we finish ordered extent io for endio functions") there was last caller not using 1 for the uptodate parameter. Now there's only one, passing 1, so we can remove it and simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The uptodate parameter should be bool, change the type. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Since commit d75855b4 ("btrfs: Remove extent_io_ops::writepage_start_hook") removes the writepage_start_hook() and adds btrfs_writepage_cow_fixup() function, there is no need to follow the old hook parameters. Remove the @start and @end hook, since currently the fixup check is full page check, it doesn't need @start and @end hook. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Marcos Paulo de Souza authored
btrfs_search_slot is called in multiple places in dir-item.c to search for a dir entry, and then calling btrfs_match_dir_name to return a btrfs_dir_item. In order to reduce the number of callers of btrfs_search_slot, create a common function that looks for the dir key, and if found call btrfs_match_dir_item_name. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Marcos Paulo de Souza authored
We can return from btrfs_search_slot directly which also shows that it follows the same return value convention. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Marcos Paulo de Souza authored
After calling btrfs_search_slot is a common practice to check if the slot found isn't bigger than number of slots in the current leaf, and if so, search for the same key in the next leaf by calling btrfs_next_leaf, which calls btrfs_next_old_leaf to do the job. Calling btrfs_next_item in the same situation would end up in the same code flow, since * btrfs_next_item * btrfs_next_old_item * if slot >= nritems(curr_leaf) btrfs_next_old_leaf Change btrfs_verify_dev_extents and calculate_emulated_zone_size functions to use btrfs_next_leaf in the same situation. Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
Currently all the callers of btrfs_find_all_roots() pass a value of false for its ignore_offset argument. This makes the argument pointless and we can remove it and make btrfs_find_all_roots() always pass false as the ignore_offset argument for btrfs_find_all_roots_safe(). So just do that. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
During a fast fsync, if we have already fsynced the file before and in the current transaction, we can make the inode item update more efficient and avoid acquiring a write lock on the leaf's parent. To update the inode item we are always using btrfs_insert_empty_item() to get a path pointing to the inode item, which calls btrfs_search_slot() with an "ins_len" argument of 'sizeof(struct btrfs_inode_item) + sizeof(struct btrfs_item)', and that always results in the search taking a write lock on the level 1 node that is the parent of the leaf that contains the inode item. This adds unnecessary lock contention on log trees when we have multiple fsyncs in parallel against inodes in the same subvolume, which has a very significant impact due to the fact that log trees are short lived and their height very rarely goes beyond level 2. Also, by using btrfs_insert_empty_item() when we need to update the inode item, we also end up splitting the leaf of the existing inode item when the leaf has an amount of free space smaller than the size of an inode item. Improve this by using btrfs_seach_slot(), with a 0 "ins_len" argument, when we know the inode item already exists in the log. This avoids these two inefficiencies. The following script, using fio, was used to perform the tests: $ cat fio-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-d single -m single" if [ $# -ne 4 ]; then echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE" exit 1 fi NUM_JOBS=$1 FILE_SIZE=$2 FSYNC_FREQ=$3 BLOCK_SIZE=$4 cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=$FSYNC_FREQ fallocate=none group_reporting=1 direct=0 bs=$BLOCK_SIZE ioengine=sync size=$FILE_SIZE directory=$MNT numjobs=$NUM_JOBS EOF echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo echo "mount options: $MOUNT_OPTIONS" echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The tests were done on a physical machine, with 12 cores, 64G of RAM, using a NVMEe device and using a non-debug kernel config (the default one from Debian). The summary line from fio is provided below for each test run. With 8 jobs, file size 256M, fsync frequency of 4 and a block size of 4K: Before: WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=2048MiB (2147MB), run=72297-72297msec After: WRITE: bw=28.7MiB/s (30.1MB/s), 28.7MiB/s-28.7MiB/s (30.1MB/s-30.1MB/s), io=2048MiB (2147MB), run=71411-71411msec +1.4% throughput, -1.2% runtime With 16 jobs, file size 256M, fsync frequency of 4 and a block size of 4K: Before: WRITE: bw=40.0MiB/s (42.0MB/s), 40.0MiB/s-40.0MiB/s (42.0MB/s-42.0MB/s), io=4096MiB (4295MB), run=99980-99980msec After: WRITE: bw=40.9MiB/s (42.9MB/s), 40.9MiB/s-40.9MiB/s (42.9MB/s-42.9MB/s), io=4096MiB (4295MB), run=97933-97933msec +2.2% throughput, -2.1% runtime The changes are small but it's possible to be better on faster hardware as in the test machine used disk utilization was pretty much 100% during the whole time the tests were running (observed with 'iostat -xz 1'). The tests also included the previous patch with the subject of: "btrfs: avoid unnecessary log mutex contention when syncing log". So they compared a branch without that patch and without this patch versus a branch with these two patches applied. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
One of the last steps of syncing the log is to remove all log contexts from the root's list of contexts, done at btrfs_remove_all_log_ctxs(). There we iterate over all the contexts in the list and delete each one from the list, and after that we call INIT_LIST_HEAD() on the list. That is unnecessary since at that point the list is empty. So just remove the INIT_LIST_HEAD() call. It's not needed, increases code size (bloat-o-meter reported a delta of -122 for btrfs_sync_log() after this change) and increases two critical sections delimited by log mutexes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When syncing the log we acquire the root's log mutex just to update the root's last_log_commit. This is unnecessary because: 1) At this point there can only be one task updating this value, which is the task committing the current log transaction. Any task that enters btrfs_sync_log() has to wait for the previous log transaction to commit and wait for the current log transaction to commit if someone else already started it (in this case it never reaches to the point of updating last_log_commit, as that is done by the committing task); 2) All readers of the root's last_log_commit don't acquire the root's log mutex. This is to avoid blocking the readers, potentially for too long and because getting a stale value of last_log_commit does not cause any functional problem, in the worst case getting a stale value results in logging an inode unnecessarily. Plus it's actually very rare to get a stale value that results in unnecessarily logging the inode. So in order to avoid unnecessary contention on the root's log mutex, which is used for several different purposes, like starting/joining a log transaction and starting writeback of a log transaction, stop acquiring the log mutex for updating the root's last_log_commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When using the NO_HOLES feature and expanding the size of an inode, we update the inode's last_trans, last_sub_trans and last_log_commit fields at maybe_insert_hole() so that a fsync does know that the inode needs to be logged (by making sure that btrfs_inode_in_log() returns false). This happens for expanding truncate operations, buffered writes, direct IO writes and when cloning extents to an offset greater than the inode's i_size. However the way we do it is racy, because in between setting the inode's last_sub_trans and last_log_commit fields, the log transaction ID that was assigned to last_sub_trans might be committed before we read the root's last_log_commit and assign that value to last_log_commit. If that happens it would make a future call to btrfs_inode_in_log() return true. This is a race that should be extremely unlikely to be hit in practice, and it is the same that was described by commit bc0939fc ("btrfs: fix race between marking inode needs to be logged and log syncing"). The fix would simply be to set last_log_commit to the value we assigned to last_sub_trans minus 1, like it was done in that commit. However updating these two fields plus the last_trans field is pointless here because all the callers of btrfs_cont_expand() (which is the only caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans() or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the next fsync will log the inode, as it makes btrfs_inode_in_log() return false. So just remove the code that explicitly sets the inode's last_trans, last_sub_trans and last_log_commit fields. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
In commit 351cbf6e ("btrfs: use nofs allocations for running delayed items") we wrapped all btree updates when running delayed items with memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion detected by lockdep involving reclaim and the mutex of delayed nodes. The problem is because the ref verify tool does some memory allocations with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode eviction, which requires locking the mutex of an inode's delayed node. On the other hand the ref verify tool is called when allocating metadata extents as part of operations that modify a btree, which is a problem when running delayed nodes, where we do btree updates while holding the mutex of a delayed node. This is what caused the lockdep warning. Instead of wrapping every btree update when running delayed nodes, change the ref verify tool to never do GFP_KERNEL allocations, because: 1) We get less repeated code, which at the moment does not even have a comment mentioning why we need to setup the NOFS context, which is a recommended good practice as mentioned at Documentation/core-api/gfp_mask-from-fs-io.rst 2) The ref verify tool is something meant only for debugging and not something that should be enabled on non-debug / non-development kernels; 3) We may have yet more places outside delayed-inode.c where we have similar problem: doing btree updates while holding some lock and then having the GFP_KERNEL memory allocations, from the ref verify tool, trigger reclaim and trying again to acquire the same lock through the reclaim path. Or we could get more such cases in the future, therefore this change prevents getting into similar cases when using the ref verify tool. Curiously most of the memory allocations done by the ref verify tool were already using GFP_NOFS, except a few ones for no apparent reason. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When we insert the delayed items of an inode, which corresponds to the directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we do the following: 1) Pick the first delayed item from the rbtree and insert it into the fs/subvolume btree, using btrfs_insert_empty_item() for that; 2) Without releasing the path returned by btrfs_insert_empty_item(), keep collecting as many consecutive delayed items from the rbtree as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the immediate successor of the previously picked item and as long as they fit in the available space of the leaf the path points to; 3) Then insert all the collected items into the leaf; 4) Release the reserve metadata space for each collected item and release each item (implies deleting from the rbtree); 5) Unlock the path. While this is much better than inserting items one by one, it can be improved in a few aspects: 1) Instead of adding items based on the remaining free space of the leaf, collect as many items that can fit in a leaf and bulk insert them. This results in less and larger batches, reducing the total amount of time to insert the delayed items. For example when adding 100K files to a directory, we ended up creating 1658 batches with very variable sizes ranging from 1 item to 118 items, on a filesystem with a node/leaf size of 16K. After this change, we end up with 839 batches, with the vast majority of them having exactly 120 items; 2) We do the search for more items to batch, by iterating the rbtree, while holding a write lock on the leaf; 3) While still holding the leaf locked, we are releasing the reserved metadata for each item and then deleting each item, keeping a write lock on the leaf for longer than necessary. Releasing the delayed items one by one can take a significant amount of time, because deleting them from the rbtree can often be a bit slow when the deletion results in rebalancing the rbtree. So change this so that we try to create larger batches, with a total item size up to the maximum a leaf can support, and by unlocking the leaf immediately after inserting the items, releasing the reserved metadata space of each item and releasing each item without holding the write lock on the leaf. The following script that runs fs_mark was used to test this change: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" FILES=1000000 THREADS=16 FILE_SIZE=0 echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT It was run on machine with 12 cores, 64G of ram, using a NVMe device and using a non-debug kernel config (Debian's default config). Results before this change: FSUse% Count Size Files/sec App Overhead 1 16000000 0 76182.1 72223046 3 32000000 0 62746.9 80776528 5 48000000 0 77029.0 93022381 6 64000000 0 73691.6 95251075 8 80000000 0 66288.0 85089634 Results after this change: FSUse% Count Size Files/sec App Overhead 1 16000000 0 79049.5 (+3.7%) 69700824 3 32000000 0 65248.9 (+3.9%) 80583693 5 48000000 0 77991.4 (+1.2%) 90040908 6 64000000 0 75096.8 (+1.9%) 89862241 8 80000000 0 66926.8 (+1.0%) 84429169 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
When extent tree gets corrupted, normally it's not extent tree root, but one toasted tree leaf/node. In that case, rescue=ibadroots mount option won't help as it can only handle the extent tree root corruption. This patch will enhance the behavior by: - Allow fill_dummy_bgs() to ignore -EEXIST error This means we may have some block group items read from disk, but then hit some error halfway. - Fallback to fill_dummy_bgs() if any error gets hit in btrfs_read_block_groups() Of course, this still needs rescue=ibadroots mount option. With that, rescue=ibadroots can handle extent tree corruption more gracefully and allow a better recover chance. Reported-by: Zhenyu Wu <wuzy001@gmail.com> Link: https://www.spinics.net/lists/linux-btrfs/msg114424.htmlReviewed-by: Su Yue <l@damenly.su> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Marcos Paulo de Souza authored
Using a transaction in btrfs_search_slot is only useful when we are searching to add or modify the tree. When the function is used for searching, insert length and mod arguments are 0, there is no need to use a transaction. No functional changes, changing for consistency. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At reada_for_search(), when attempting to readahead a node or leaf's siblings, we skip the readahead of the siblings if the node/leaf is already in memory. That is probably fine for the READA_FORWARD and READA_BACK readahead types, as they are used on contexts where we end up reading some consecutive leaves, but usually not the whole btree. However for a READA_FORWARD_ALWAYS mode, currently only used for full send operations, it does not make sense to skip the readahead if the target node or leaf is already loaded in memory, since we know the caller is visiting every node and leaf of the btree in ascending order. So change the behaviour to not skip the readahead when the target node is already in memory and the readahead mode is READA_FORWARD_ALWAYS. The following test script was used to measure the improvement on a box using an average, consumer grade, spinning disk, with 32GiB of RAM and using a non-debug kernel config (Debian's default config). $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null mount $MOUNT_OPTIONS $DEV $MNT # Create files with inline data to make it easier and faster to create # large btrees. add_files() { local total=$1 local start_offset=$2 local number_jobs=$3 local total_per_job=$(($total / $number_jobs)) echo "Creating $total new files using $number_jobs jobs" for ((n = 0; n < $number_jobs; n++)); do ( local start_num=$(($start_offset + $n * $total_per_job)) for ((i = 1; i <= $total_per_job; i++)); do local file_num=$((start_num + $i)) local file_path="$MNT/file_${file_num}" xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null if [ $? -ne 0 ]; then echo "Failed creating file $file_path" break fi done ) & worker_pids[$n]=$! done wait ${worker_pids[@]} sync echo echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)" } file_count=2000000 add_files $file_count 0 4 echo echo "Creating snapshot..." btrfs subvolume snapshot -r $MNT $MNT/snap1 umount $MNT echo 3 > /proc/sys/vm/drop_caches blockdev --flushbufs $DEV &> /dev/null hdparm -F $DEV &> /dev/null mount $MOUNT_OPTIONS $DEV $MNT echo echo "Testing full send..." start=$(date +%s) btrfs send $MNT/snap1 > /dev/null end=$(date +%s) echo echo "Full send took $((end - start)) seconds" umount $MNT The duration of the full send operations, in seconds, were the following: Before this change: 85 seconds After this change: 76 seconds (-11.2%) Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The pages in block_ctx have never been allocated from highmem (in btrfsic_read_block) so the mapping is pointless and can be removed. Signed-off-by: David Sterba <dsterba@suse.com>
-