- 25 Jul, 2022 40 commits
-
-
Christoph Hellwig authored
btrfs_submit_data_write_bio special cases the reloc root because the checksums are preloaded, but only does so for the !sync case. The sync case can't happen for data relocation, but just handling it more generally significantly simplifies the logic. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Transfer the bio counter reference acquired by btrfs_submit_bio to raid56_parity_write and raid56_parity_recovery together with the bio that the reference was acquired for instead of acquiring another reference in those helpers and dropping the original one in btrfs_submit_bio. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Also use the proper bool type for the generic_io argument. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. As this requires touching all the callers, rename the function to btrfs_submit_bio, which describes the functionality much better. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
For profiles other than RAID56, __btrfs_map_block() returns @map_length as min(stripe_end, logical + *length), which is also the same result from btrfs_get_io_geometry(). But for RAID56, __btrfs_map_block() returns @map_length as stripe_len. This strange behavior is going to hurt incoming bio split at btrfs_map_bio() time, as we will use @map_length as bio split size. Fix this behavior by returning @map_length by the same calculation as for other profiles. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there are functions passing it as arguments, this is not necessary. The fixed value has been used for a long time and though the stripe length should be configurable by super block member stripesize, this hasn't been implemented and would require more changes so we don't need to keep this code around until then. Partially based on a patch from Qu Wenruo. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The inode cache feature was removed in kernel 5.11, and we no longer have any code that reads from or writes to inode caches. We may still mount a filesystem that has inode caches, but they are ignored. Remove the check for an inode cache from btrfs_is_free_space_inode(), since we no longer have code to trigger reads from an inode cache or writes to an inode cache. The check at send.c is still needed, because in case we find a filesystem with an inode cache, we must ignore it. Also leave the checks at tree-checker.c, as they are sanity checks. This eliminates a dead branch and reduces the amount of code since it's in an inline function. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620662 189240 29032 1838934 1c0f56 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
This flag has been merged in 3.10 and is effectively always-on. Its status depends on the host page size so there's another way to guarantee compatibility with old kernels. Due to a bug introduced in 6f93e834 ("btrfs: fix upper limit for max_inline for page size 64K") the flag is not persisted among features in the superblock so it's not reliable. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
This feature has been the default for about 13 year. At this point it's safe to consider it an indispensable feature of BTRFS as such there's no need to advertise it in sysfs. Remove the global sysfs feature file, the per-filesystem feature file has never been there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Skinny extents have been a default mkfs feature since version 3.18 i (introduced in btrfs-progs commit 6715de04d9a7 ("btrfs-progs: mkfs: make skinny-metadata default") ). It really doesn't bring any value to users to simply remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Added in commit 727011e0 ("Btrfs: allow metadata blocks larger than the page size") in 2010 and it's been default for mkfs since 3.12 (2013). The message doesn't really convey any useful information to users. Remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The chained assignments may be convenient to write, but make readability a bit worse as it's too easy to overlook that there are several values set on the same line while this is rather an exception. Making it consistent everywhere avoids surprises. The pattern where inode times are initialized reuses the first value and the order is mtime, ctime. In other blocks the assignments are expanded so the order of variables is similar to the neighboring code. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Use the same expression for stripe_nr for RAID0 (map->sub_stripes is 1) and RAID10 (map->sub_stripes is 2), with equivalent results. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There's a sequence of hard coded values for RAID1 profiles that are already stored in the raid_attr table that should be used instead. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Commit 6f93e834 seemingly inadvertently moved the code responsible for flagging the filesystem as having BIG_METADATA to a place where setting the flag was essentially lost. This means that filesystems created with kernels containing this bug (starting with 5.15) can potentially be mounted by older (pre-3.4) kernels. In reality chances for this happening are low because there are other incompat flags introduced in the mean time. Still the correct behavior is to set INCOMPAT_BIG_METADATA flag and persist this in the superblock. Fixes: 6f93e834 ("btrfs: fix upper limit for max_inline for page size 64K") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Per user request, print the checksum type and implementation at mount time among the messages. The checksum is user configurable and the actual crypto implementation is useful to see for performance reasons. The same information is also available after mount in /sys/fs/FSID/checksum file. Example: [25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm Link: https://github.com/kdave/btrfs-progs/issues/483Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
If you try to force a chunk allocation, but you race with another chunk allocation, you will end up waiting on the chunk allocation that just occurred and then allocate another chunk. If you have many threads all doing this at once you can way over-allocate chunks. Fix this by resetting force to NO_FORCE, that way if we think we need to allocate we can, otherwise we don't force another chunk allocation if one is already happening. Reviewed-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS and later from XFLAGS interfaces, now commonly found under the 'fileattr' API. This corresponds to the individual inode bits and that's part of the on-disk format, so this is suitable for the protocol. The other interfaces contain a lot of cruft or bits that btrfs does not support yet. Currently the value is u64 and matches btrfs_inode_item. Not all the bits can be set by ioctls (like NODATASUM or READONLY), but we can send them over the protocol and leave it up to the receiving side what and how to apply. As some of the flags, eg. IMMUTABLE, can prevent any further changes, the receiving side needs to understand that and apply the changes in the right order, or possibly with some intermediate steps. This should be easier, future proof and simpler on the protocol layer than implementing in kernel. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
When send v1 was introduced the otime (inode creation time) was not available, however the attribute in btrfs send protocol exists. Though it would be possible to add it for v1 too as the attribute would be ignored by v1 receive, let's not change the layout of v1 and only add that to v2+. The otime cannot be changed and is only informative. Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
When handling a real world transid mismatch image, it's hard to know which copy is corrupted, as the error messages just look like this: BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on 30408704 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 We don't even know if the retry is caused by btrfs or the VFS retry. To make things a little easier to read, add mirror number for all related tree block read errors. So the above messages would look like this: BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 1 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 BTRFS warning (device dm-3): checksum verify failed on logical 30408704 mirror 2 wanted 0xcdcdcdcd found 0x3c0adc8e level 0 Signed-off-by: Qu Wenruo <wqu@suse.com> [ update messages, add "logical" ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Naohiro Aota authored
The 'goto out' in cow_file_range() in the exit block are not necessary and jump back. Replace them with return, while still keeping 'goto out' in the main code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ keep goto in the main code, update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Naohiro Aota authored
When cow_file_range() fails in the middle of the allocation loop, it unlocks the pages but leaves the ordered extents intact. Thus, we need to call btrfs_cleanup_ordered_extents() to finish the created ordered extents. Also, we need to call end_extent_writepage() if locked_page is available because btrfs_cleanup_ordered_extents() never processes the region on the locked_page. Furthermore, we need to set the mapping as error if locked_page is unavailable before unlocking the pages, so that the errno is properly propagated to the user space. CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Naohiro Aota authored
btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it is not usable for submit_uncompressed_range() which can have NULL locked_page. Add support supports locked_page == NULL case. Also, it rewrites redundant "page_offset(locked_page)". Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Naohiro Aota authored
There is a hung_task report on zoned btrfs like below. https://github.com/naota/linux/issues/59 [726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds. [726.329839] Not tainted 5.16.0-rc1+ #1 [726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [726.331603] task:rocksdb:high0 state:D stack: 0 pid:11085 ppid: 11082 flags:0x00000000 [726.331608] Call Trace: [726.331611] <TASK> [726.331614] __schedule+0x2e5/0x9d0 [726.331622] schedule+0x58/0xd0 [726.331626] io_schedule+0x3f/0x70 [726.331629] __folio_lock+0x125/0x200 [726.331634] ? find_get_entries+0x1bc/0x240 [726.331638] ? filemap_invalidate_unlock_two+0x40/0x40 [726.331642] truncate_inode_pages_range+0x5b2/0x770 [726.331649] truncate_inode_pages_final+0x44/0x50 [726.331653] btrfs_evict_inode+0x67/0x480 [726.331658] evict+0xd0/0x180 [726.331661] iput+0x13f/0x200 [726.331664] do_unlinkat+0x1c0/0x2b0 [726.331668] __x64_sys_unlink+0x23/0x30 [726.331670] do_syscall_64+0x3b/0xc0 [726.331674] entry_SYSCALL_64_after_hwframe+0x44/0xae [726.331677] RIP: 0033:0x7fb9490a171b [726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 [726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b [726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300 [726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000 [726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000 [726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260 [726.331693] </TASK> While we debug the issue, we found running fstests generic/551 on 5GB non-zoned null_blk device in the emulated zoned mode also had a similar hung issue. Also, we can reproduce the same symptom with an error injected cow_file_range() setup. The hang occurs when cow_file_range() fails in the middle of allocation. cow_file_range() called from do_allocation_zoned() can split the give region ([start, end]) for allocation depending on current block group usages. When btrfs can allocate bytes for one part of the split regions but fails for the other region (e.g. because of -ENOSPC), we return the error leaving the pages in the succeeded regions locked. Technically, this occurs only when @unlock == 0. Otherwise, we unlock the pages in an allocated region after creating an ordered extent. Considering the callers of cow_file_range(unlock=0) won't write out the pages, we can unlock the pages on error exit from cow_file_range(). So, we can ensure all the pages except @locked_page are unlocked on error case. In summary, cow_file_range now behaves like this: - page_started == 1 (return value) - All the pages are unlocked. IO is started. - unlock == 1 - All the pages except @locked_page are unlocked in any case - unlock == 0 - On success, all the pages are locked for writing out them - On failure, all the pages except @locked_page are unlocked Fixes: 42c01100 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Ioannis Angelakopoulos authored
Export commit stats in file /sys/fs/btrfs/UUID/commit_stats with example output like: commits 123 last_commit_ms 11 max_commit_ms 150 total_commit_ms 2000 The values are in one file so reading them at a single time will give a more consistent view. The stats are internally tracked in nanoseconds so the cumulative values should not suffer from rounding errors. Writing 0 to the file 'commit_stats' will reset max_commit_ms. Initial values are set at first mount of the filesystem. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Ioannis Angelakopoulos authored
Track several stats about transaction commit, to be later exported via sysfs: - number of commits so far - duration of the last commit in ns - maximum commit duration seen so far in ns - total duration for all commits so far in ns The update of the commit stats occurs after the commit thread has gone through all the logic that checks if there is another thread committing at the same time. This means that we only account for actual commit work in the commit stats we report and not the time the thread spends waiting until it is ready to do the commit work. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Same as in commit 21b4ee70 ("xfs: drop ->writepage completely"): we can remove the callback as it's only used in one place - single page writeback from memory reclaim and is not called for cgroup writeback at all. We only allow such writeback from kswapd, not from direct memory reclaim, and so it is rarely used. When it comes from kswapd, it is effectively random dirty page shoot-down, which is horrible for IO patterns. We can rely on background writeback to clean all dirty pages in an efficient way and not let it be interrupted by kswapd. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The new, new_gen and deleted indicate a status, use boolean type instead of int. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The whole send operation is restartable and handling properly a buffer write may not be easy. We can't know what caused that and if a short delay and retry will fix it or how many retries should be performed in case it's a temporary condition. The error value is returned to the ioctl caller so in case it's transient problem, the user would be notified about the reason. Remove the TODO note as there's no plan to handle ERESTARTSYS. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We don't need the whole ctree.h in send.h, none of the data types defined there are used. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We don't need this ifdef as the header file is not shared, the protocol definition used by userspace should be from libbtrfs or libbtrfsutil. Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
Btrfs currently limits direct I/O reads to a single sector, which goes back to commit c329861d ("Btrfs: don't allocate a separate csums array for direct reads") from Josef. That commit changes the direct I/O code to ".. use the private part of the io_tree for our csums.", but ten years later that isn't how checksums for direct reads work, instead they use a csums allocation on a per-btrfs_dio_private basis (which have their own performance problem for small I/O, but that will be addressed later). There is no fundamental limit in btrfs itself to limit the I/O size except for the size of the checksum array that scales linearly with the number of sectors in an I/O. Pick a somewhat arbitrary limit of 256 limits, which matches what the buffered reads typically see as the upper limit as the limit for direct I/O as well. This significantly improves direct read performance. For example a fio run doing 1 MiB aio reads with a queue depth of 1 roughly triples the throughput: Baseline: READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec With this patch: READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] There is a small workload which will always fail with recent kernel: (A simplified version from btrfs/125 test case) mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3 mount $dev1 $mnt xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1 sync umount $mnt btrfs dev scan -u $dev3 mount -o degraded $dev1 $mnt xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2 umount $mnt btrfs dev scan mount $dev1 $mnt btrfs balance start --full-balance $mnt umount $mnt The failure is always failed to read some tree blocks: BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7 ... [CAUSE] With the recently added debug output, we can see all RAID56 operations related to full stripe 38928384: 56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384 56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384 56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384 56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384 56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384 56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384 56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384 56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2 Before we enter raid56_parity_recover(), we have triggered some metadata write for the full stripe 38928384, this leads to us to read all the sectors from disk. Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to avoid unnecessary read. This means, for that full stripe, after any partial write, we will have stale data, along with P/Q calculated using that stale data. Thankfully due to patch "btrfs: only write the sectors in the vertical stripe which has data stripes" we haven't submitted all the corrupted P/Q to disk. When we really need to recover certain range, aka in raid56_parity_recover(), we will use the cached rbio, along with its cached sectors (the full stripe is all cached). This explains why we have no event raid56_scrub_read_recover() triggered. Since we have the cached P/Q which is calculated using the stale data, the recovered one will just be stale. In our particular test case, it will always return the same incorrect metadata, thus causing the same error message "parent transid verify failed on 39010304 wanted 9 found 7" again and again. [BTRFS DESTRUCTIVE RMW PROBLEM] Test case btrfs/125 (and above workload) always has its trouble with the destructive read-modify-write (RMW) cycle: 0 32K 64K Data1: | Good | Good | Data2: | Bad | Bad | Parity: | Good | Good | In above case, if we trigger any write into Data1, we will use the bad data in Data2 to re-generate parity, killing the only chance to recovery Data2, thus Data2 is lost forever. This destructive RMW cycle is not specific to btrfs RAID56, but there are some btrfs specific behaviors making the case even worse: - Btrfs will cache sectors for unrelated vertical stripes. In above example, if we're only writing into 0~32K range, btrfs will still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2. This behavior is to cache sectors for later update. Incidentally commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible") has a bug which makes RAID56 to never trust the cached sectors, thus slightly improve the situation for recovery. Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in steal_rbio" will revert the behavior back to the old one. - Btrfs raid56 partial write will update all P/Q sectors and cache them This means, even if data at (64K ~ 96K) of Data2 is free space, and only (96K ~ 128K) of Data2 is really stale data. And we write into that (96K ~ 128K), we will update all the parity sectors for the full stripe. This unnecessary behavior will completely kill the chance of recovery. Thankfully, an unrelated optimization "btrfs: only write the sectors in the vertical stripe which has data stripes" will prevent submitting the write bio for untouched vertical sectors. That optimization will keep the on-disk P/Q untouched for a chance for later recovery. [FIX] Although we have no good way to completely fix the destructive RMW (unless we go full scrub for each partial write), we can still limit the damage. With patch "btrfs: only write the sectors in the vertical stripe which has data stripes" now we won't really submit the P/Q of unrelated vertical stripes, so the on-disk P/Q should still be fine. Now we really need to do is just drop all the cached sectors when doing recovery. By this, we have a chance to read the original P/Q from disk, and have a chance to recover the stale data, while still keep the cache to speed up regular write path. In fact, just dropping all the cache for recovery path is good enough to allow the test case btrfs/125 along with the small script to pass reliably. The lack of metadata write after the degraded mount, and forced metadata COW is saving us this time. So this patch will fix the behavior by not trust any cache in __raid56_parity_recover(), to solve the problem while still keep the cache useful. But please note that this test pass DOES NOT mean we have solved the destructive RMW problem, we just do better damage control a little better. Related patches: - btrfs: only write the sectors in the vertical stripe - d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible") - btrfs: update stripe_sectors::uptodate in steal_rbio Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Christoph Hellwig authored
finish_func is always set to finish_ordered_fn, so remove it and also the now pointless and somewhat confusingly named __endio_write_update_ordered wrapper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
With Filipe's recent rework of the delayed inode code one aspect which isn't batched is the release of the reserved metadata of delayed inode's delete items. With this patch on top of Filipe's rework and running the same test as provided in the description of a patch titled "btrfs: improve batch deletion of delayed dir index items" I observe the following change of the number of calls to btrfs_block_rsv_release: Before this change: - block_rsv_release: 1004 - btrfs_delete_delayed_items_total_time: 14602 - delete_batches: 505 After: - block_rsv_release: 510 - btrfs_delete_delayed_items_total_time: 13643 - delete_batches: 507 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Btrfs on-disk format has reserved the first 1MiB for the primary super block (at 64KiB offset) and bootloaders may also use this space. This behavior is only introduced since v4.1 btrfs-progs release, although kernel can ensure we never touch the reserved range of super blocks, it's better to inform the end users, and a balance will resolve the problem. Signed-off-by: Qu Wenruo <wqu@suse.com> [ update changelog and message ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There's a reserved space on each device of size 1MiB that can be used by bootloaders or to avoid accidental overwrite. Use a symbolic constant with the explaining comment instead of hard coding the value and multiple comments. Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for the primary super block (at offset 64KiB), until then the range could have been used by mistake. Kernel has been always respecting the 1MiB range for writes. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There's only one function we pass to iterate_inodes_from_logical as iterator, so we can drop the indirection and call it directly, after moving the function to backref.c Signed-off-by: David Sterba <dsterba@suse.com>
-