- 25 May, 2020 40 commits
-
-
Filipe Manana authored
When scrub is verifying the extents of a block group for a device, it is possible that the corresponding block group gets removed and its logical address and device extents get used for a new block group allocation. When this happens scrub incorrectly reports that errors were detected and, if the the new block group has a different profile then the old one, deleted block group, we can crash due to a null pointer dereference. Possibly other unexpected and weird consequences can happen as well. Consider the following sequence of actions that leads to the null pointer dereference crash when scrub is running in parallel with balance: 1) Balance sets block group X to read-only mode and starts relocating it. Block group X is a metadata block group, has a raid1 profile (two device extents, each one in a different device) and a logical address of 19424870400; 2) Scrub is running and finds device extent E, which belongs to block group X. It enters scrub_stripe() to find all extents allocated to block group X, the search is done using the extent tree; 3) Balance finishes relocating block group X and removes block group X; 4) Balance starts relocating another block group and when trying to commit the current transaction as part of the preparation step (prepare_to_relocate()), it blocks because scrub is running; 5) The scrub task finds the metadata extent at the logical address 19425001472 and marks the pages of the extent to be read by a bio (struct scrub_bio). The extent item's flags, which have the bit BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct scrub_page). It is these flags in the scrub pages that tells the bio's end io function (scrub_bio_end_io_worker) which type of extent it is dealing with. At this point we end up with 4 pages in a bio which is ready for submission (the metadata extent has a size of 16Kb, so that gives 4 pages on x86); 6) At the next iteration of scrub_stripe(), scrub checks that there is a pause request from the relocation task trying to commit a transaction, therefore it submits the pending bio and pauses, waiting for the transaction commit to complete before resuming; 7) The relocation task commits the transaction. The device extent E, that was used by our block group X, is now available for allocation, since the commit root for the device tree was swapped by the transaction commit; 8) Another task doing a direct IO write allocates a new data block group Y which ends using device extent E. This new block group Y also ends up getting the same logical address that block group X had: 19424870400. This happens because block group X was the block group with the highest logical address and, when allocating Y, find_next_chunk() returns the end offset of the current last block group to be used as the logical address for the new block group, which is 18351128576 + 1073741824 = 19424870400 So our new block group Y has the same logical address and device extent that block group X had. However Y is a data block group, while X was a metadata one, and Y has a raid0 profile, while X had a raid1 profile; 9) After allocating block group Y, the direct IO submits a bio to write to device extent E; 10) The read bio submitted by scrub reads the 4 pages (16Kb) from device extent E, which now correspond to the data written by the task that did a direct IO write. Then at the end io function associated with the bio, scrub_bio_end_io_worker(), we call scrub_block_complete() which calls scrub_checksum(). This later function checks the flags of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK is set in the flags, so it assumes it has a metadata extent and then calls scrub_checksum_tree_block(). That functions returns an error, since interpreting data as a metadata extent causes the checksum verification to fail. So this makes scrub_checksum() call scrub_handle_errored_block(), which determines 'failed_mirror_index' to be 1, since the device extent E was allocated as the second mirror of block group X. It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at 'sblocks_for_recheck', and all the memory is initialized to zeroes by kcalloc(). After that it calls scrub_setup_recheck_block(), which is responsible for filling each of those structures. However, when that function calls btrfs_map_sblock() against the logical address of the metadata extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches the current block group Y. However block group Y has a raid0 profile and not a raid1 profile like X had, so the following call returns 1: scrub_nr_raid_mirrors(bbio) And as a result scrub_setup_recheck_block() only initializes the first (index 0) scrub_block structure in 'sblocks_for_recheck'. Then scrub_recheck_block() is called by scrub_handle_errored_block() with the second (index 1) scrub_block structure as the argument, because 'failed_mirror_index' was previously set to 1. This scrub_block was not initialized by scrub_setup_recheck_block(), so it has zero pages, its 'page_count' member is 0 and its 'pagev' page array has all members pointing to NULL. Finally when scrub_recheck_block() calls scrub_recheck_block_checksum() we have a NULL pointer dereference when accessing the flags of the first page, as pavev[0] is NULL: static void scrub_recheck_block_checksum(struct scrub_block *sblock) { (...) if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA) scrub_checksum_data(sblock); (...) } Producing a stack trace like the following: [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028 [542998.010238] #PF: supervisor read access in kernel mode [542998.010878] #PF: error_code(0x0000) - not-present page [542998.011516] PGD 0 P4D 0 [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1 [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs] [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs] [542998.018474] Code: 4c 89 e6 ... [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202 [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000 [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120 [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001 [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800 [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0 [542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000 [542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0 [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [542998.032380] Call Trace: [542998.032752] scrub_recheck_block+0x162/0x400 [btrfs] [542998.033500] ? __alloc_pages_nodemask+0x31e/0x460 [542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs] [542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs] [542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs] [542998.036735] process_one_work+0x26d/0x6a0 [542998.037275] worker_thread+0x4f/0x3e0 [542998.037740] ? process_one_work+0x6a0/0x6a0 [542998.038378] kthread+0x103/0x140 [542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70 [542998.039419] ret_from_fork+0x3a/0x50 [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ... [542998.047288] CR2: 0000000000000028 [542998.047724] ---[ end trace bde186e176c7f96a ]--- This issue has been around for a long time, possibly since scrub exists. The last time I ran into it was over 2 years ago. After recently fixing fstests to pass the "--full-balance" command line option to btrfs-progs when doing balance, several tests started to more heavily exercise balance with fsstress, scrub and other operations in parallel, and therefore started to hit this issue again (with btrfs/061 for example). Fix this by having scrub increment the 'trimming' counter of the block group, which pins the block group in such a way that it guarantees neither its logical address nor device extents can be reused by future block group allocations until we decrement the 'trimming' counter. Also make sure that on each iteration of scrub_stripe() we stop scrubbing the block group if it was removed already. A later patch in the series will rename the block group's 'trimming' counter and its helpers to a more generic name, since now it is not used exclusively for pinning while trimming anymore. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The extent references v0 have been superseded long time go, there are some unused declarations of access helpers. We can safely remove them now. The struct btrfs_extent_ref_v0 is not used anywhere, but struct btrfs_extent_item_v0 is still part of a backward compatibility check in relocation.c and thus not removed. Signed-off-by: David Sterba <dsterba@suse.com>
-
YueHaibing authored
There's no callers in-tree anymore since commit d24ee97b ("btrfs: use new helpers to set uuids in eb") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
[BUG] For the following operation, qgroup is guaranteed to be screwed up due to snapshot adding to a new qgroup: # mkfs.btrfs -f $dev # mount $dev $mnt # btrfs qgroup en $mnt # btrfs subv create $mnt/src # xfs_io -f -c "pwrite 0 1m" $mnt/src/file # sync # btrfs qgroup create 1/0 $mnt/src # btrfs subv snapshot -i 1/0 $mnt/src $mnt/snapshot # btrfs qgroup show -prce $mnt/src qgroupid rfer excl max_rfer max_excl parent child -------- ---- ---- -------- -------- ------ ----- 0/5 16.00KiB 16.00KiB none none --- --- 0/257 1.02MiB 16.00KiB none none --- --- 0/258 1.02MiB 16.00KiB none none 1/0 --- 1/0 0.00B 0.00B none none --- 0/258 ^^^^^^^^^^^^^^^^^^^^ [CAUSE] The problem is in btrfs_qgroup_inherit(), we don't have good enough check to determine if the new relation would break the existing accounting. Unlike btrfs_add_qgroup_relation(), which has proper check to determine if we can do quick update without a rescan, in btrfs_qgroup_inherit() we can even assign a snapshot to multiple qgroups. [FIX] Fix it by manually marking qgroup inconsistent for snapshot inheritance. For subvolume creation, since all its extents are exclusively owned, we don't need to rescan. In theory, we should call relation check like quick_update_accounting() when doing qgroup inheritance and inform user about qgroup accounting inconsistency. But we don't have good mechanism to relay that back to the user in the snapshot creation context, thus we can only silently mark the qgroup inconsistent. Anyway, user shouldn't use qgroup inheritance during snapshot creation, and should add qgroup relationship after snapshot creation by 'btrfs qgroup assign', which has a much better UI to inform user about qgroup inconsistent and kick in rescan automatically. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Robbie Ko authored
When mounting, we handle deleted subvolume and orphan items. First, find add orphan roots, then add them to fs_root radix tree. Second, in tree-root, process each orphan item, skip if it is dead root. The original algorithm is based on the list of dead_roots, one by one to visit and check whether the objectid is consistent, the time complexity is O (n ^ 2). When processing 50000 deleted subvols, it takes about 120s. Because btrfs_find_orphan_roots has already ran before us, and added deleted subvol to fs_roots radix tree. The fs root will only be removed from the fs_roots radix tree after the cleaner process is started, and the cleaner will only start execution after the mount is complete. btrfs_orphan_cleanup can be called during the whole filesystem mount lifetime, but only "tree root" will be used in this section of code, and only mount time will be brought into tree root. So we can quickly check whether the orphan item is dead root through the fs_roots radix tree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
YueHaibing authored
There's no callers in-tree anymore since commit 64403612 ("btrfs: rework btrfs_check_space_for_delayed_refs") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
I've grepped logs for 'errno=.*unknown' and found -95, -117 and -122, now added to the table. The wording is adjusted so it makes sense in context of filesystem. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Add the raw errnos and sort them accordingly. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
When an old device has new fsid through 'btrfs device add -f <dev>' our fs_devices list has an alien device in one of the fs_devices lists. By having an alien device in fs_devices, we have two issues so far 1. missing device does not not show as missing in the userland 2. degraded mount will fail Both issues are caused by the fact that there's an alien device in the fs_devices list. (Alien means that it does not belong to the filesystem, identified by fsid, or does not contain btrfs filesystem at all, eg. due to overwrite). A device can be scanned/added through the control device ioctls SCAN_DEV, DEVICES_READY or by ADD_DEV. And device coming through the control device is checked against the all other devices in the lists, but this was not the case for ADD_DEV. This patch fixes both issues above by removing the alien device. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to the bdev with greatest device::generation number. For a typical-missing device the generation number is zero so fs_devices::latest_bdev will never point to it. But if the missing device is due to alienation [1], then device::generation is not zero and if it is greater or equal to the rest of device generations in the list, then fs_devices::latest_bdev ends up pointing to the missing device and reports the error like [2]. [1] We maintain devices of a fsid (as in fs_device::fsid) in the fs_devices::devices list, a device is considered as an alien device if its fsid does not match with the fs_device::fsid Consider a working filesystem with raid1: $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb $ mount /dev/sda /mnt-raid1 $ umount /mnt-raid1 While mnt-raid1 was unmounted the user force-adds one of its devices to another btrfs filesystem: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt-single $ btrfs dev add -f /dev/sda /mnt-single Now the original mnt-raid1 fails to mount in degraded mode, because fs_devices::latest_bdev is pointing to the alien device. $ mount -o degraded /dev/sdb /mnt-raid1 [2] mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing kernel: BTRFS error (device sdb): failed to read devices kernel: BTRFS error (device sdb): open_ctree failed Fix the root cause by checking if the device is not missing before it can be considered for the fs_devices::latest_bdev. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Eric Biggers authored
Use crypto_shash_digest() instead of crypto_shash_init() + crypto_shash_update() + crypto_shash_final(). This is more efficient. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
There is no need of goto out in open_fs_devices() as there is nothing special done there. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At btrfs_log_prealloc_extents() we are checking if copy_items() returns a value greater than 0. That used to happen in the past to signal the caller that the path given to it was released and reused for other searches, but as of commit 0e56315c ("Btrfs: fix missing hole after hole punching and fsync when using NO_HOLES"), the copy_items() function does not have that behaviour anymore and always returns 0 or a negative value. So just remove that check at btrfs_log_prealloc_extents(), which the previously mentioned commit forgot to remove. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
Currently, direct I/O has its own versions of bio_readpage_error() and btrfs_check_repairable() (dio_read_error() and btrfs_check_dio_repairable(), respectively). The main difference is that the direct I/O version doesn't do read validation. The rework of direct I/O repair makes it possible to do validation, so we can get rid of btrfs_check_dio_repairable() and combine bio_readpage_error() and dio_read_error() into a new helper, btrfs_submit_read_repair(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
This was originally added in commit 8b110e39 ("Btrfs: implement repair function when direct read fails") to avoid a deadlock. In that commit, the direct I/O read endio executes on the endio_workers workqueue, submits a repair bio, and waits for it to complete. The repair bio endio must execute on a different workqueue, otherwise it could block on the endio_workers workqueue becoming available, which won't happen because the original endio is blocked on the repair bio. As of the previous commit, the original endio doesn't wait for the repair bio, so this separate workqueue is unnecessary. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
Direct I/O read repair was originally implemented in commit 8b110e39 ("Btrfs: implement repair function when direct read fails"). This implementation is unnecessarily complicated. There is major code duplication between __btrfs_subio_endio_read() (checks checksums and handles I/O errors for files with checksums), __btrfs_correct_data_nocsum() (handles I/O errors for files without checksums), btrfs_retry_endio() (checks checksums and handles I/O errors for retries of files with checksums), and btrfs_retry_endio_nocsum() (handles I/O errors for retries of files without checksum). If it sounds like these should be one function, that's because they should. Additionally, these functions are very hard to follow due to their excessive use of goto. This commit replaces the original implementation. After the previous commit getting rid of orig_bio, we can reuse the same endio callback for repair I/O and the original I/O, we just need to track the file offset and original iterator in the repair bio. We can also unify the handling of files with and without checksums and simplify the control flow. We also no longer have to wait for each repair I/O to complete one by one. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
In the worst case, there are _4_ layers of bios in the Btrfs direct I/O path: 1. The bio created by the generic direct I/O code (dio_bio). 2. A clone of dio_bio we create in btrfs_submit_direct() to represent the entire direct I/O range (orig_bio). 3. A partial clone of orig_bio limited to the size of a RAID stripe that we create in btrfs_submit_direct_hook(). 4. Clones of each of those split bios for each RAID stripe that we create in btrfs_map_bio(). As of the previous commit, the second layer (orig_bio) is no longer needed for anything: we can split dio_bio instead, and complete dio_bio directly when all of the cloned bios complete. This lets us clean up a bunch of cruft, including dip->subio_endio and dip->errors (we can use dio_bio->bi_status instead). It also enables the next big cleanup of direct I/O read repair. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
The next commit will get rid of btrfs_dio_private->orig_bio. The only thing we really need it for is containing all of the checksums, but we can easily put the checksum array in btrfs_dio_private and have the submitted bios reference the array. We can also look the checksums up while we're setting up instead of the current awkward logic that looks them up for orig_bio when the first split bio is submitted. (Interestingly, btrfs_dio_private did contain the checksums before commit 23ea8e5a ("Btrfs: load checksum data once when submitting a direct read io"), but it didn't look them up up front.) Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
This is really a reference count now, so convert it to refcount_t and rename it to refs. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
We haven't used this since commit 9be3395b ("Btrfs: use a btrfs bioset instead of abusing bio internals"). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
Since its introduction in commit 2fe6303e ("Btrfs: split bio_readpage_error into several functions"), btrfs_check_repairable() has only been used from extent_io.c where it is defined. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
__readpage_endio_check() is also used from the direct I/O read code, so give it a more descriptive name. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
Fix a couple of issues in the btrfs_lookup_bio_sums documentation: * The bio doesn't need to be a btrfs_io_bio if dst was provided. Move the declaration in the code to make that clear, too. * dst must be large enough to hold nblocks * csum_size, not just csum_size. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
The purpose of the validation step is to distinguish between good and bad sectors in a failed multi-sector read. If a multi-sector read succeeded but some of those sectors had checksum errors, we don't need to validate anything; we know the sectors with bad checksums need to be repaired. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
Read repair does two things: it finds a good copy of data to return to the reader, and it corrects the bad copy on disk. If a read of multiple sectors has an I/O error, repair does an extra "validation" step that issues a separate read for each sector. This allows us to find the exact failing sectors and only rewrite those. This heuristic is implemented in bio_readpage_error()/btrfs_check_repairable() as: failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT; if (failed_bio_pages > 1) do validation However, at this point, bi_iter may have already been advanced. This means that we'll skip the validation step and rewrite the entire failed read. Fix it by getting the actual size from the biovec (which we can do because this is only called for non-cloned bios, although that will change in a later commit). Fixes: 8a2ee44a ("btrfs: look at bi_size for repair decisions") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private, we complete the ordered extent range. However, we don't mark that the range doesn't need to be cleaned up from btrfs_direct_IO() until later. Therefore, if we fail to allocate the btrfs_dio_private, we complete the ordered extent range twice. We could fix this by updating unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so that creating the btrfs_dio_private and submitting the bios are separate, and once the btrfs_dio_private is created, cleanup always happens through the btrfs_dio_private. The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start is really subtle. We have the following: 1. btrfs_direct_IO sets those two to the same value. 2. When we call __blockdev_direct_IO unless btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to modify unsubmitted_oe_range_start so that start < end. Cleanup won't happen. 3. We come into btrfs_submit_direct - if it dip allocation fails we'd return with oe_range_end now modified so cleanup will happen. 4. If we manage to allocate the dip we reset the unsubmitted range members to be equal so that cleanup happens from btrfs_endio_direct_write. This 4-step logic is not really obvious, especially given it's scattered across 3 functions. Fixes: f28a4928 ("Btrfs: fix leaking of ordered extents after direct IO write error") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> [ add range start/end logic explanation from Nikolay ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID stripe or chunk, we submit orig_bio without cloning it. In this case, we don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we decrement pending_bios to -1, and we never complete orig_bio. Fix it by initializing pending_bios to 1 instead of incrementing later. Fixing this exposes another bug: we put orig_bio prematurely and then put it again from end_io. Fix it by not putting orig_bio. After this change, pending_bios is really more of a reference count, but I'll leave that cleanup separate to keep the fix small. Fixes: e65e1535 ("btrfs: fix panic caused by direct IO") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Omar Sandoval authored
An upcoming Btrfs fix needs to know the original size of a non-cloned bios. Rather than accessing the bvec table directly, let's add a bio_for_each_bvec_all() accessor. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
At clean_pinned_extents(), whether we end up returning success or failure, we pretty much have to do the same things: 1) unlock unused_bg_unpin_mutex 2) decrement reference count on the previous transaction We also call btrfs_dec_block_group_ro() in case of failure, but that is better done in its caller, btrfs_delete_unused_bgs(), since its the caller that calls inc_block_group_ro(), so it should be responsible for the decrement operation, as it is in case any of the other functions it calls fail. So move the call to btrfs_dec_block_group_ro() from clean_pinned_extents() into btrfs_delete_unused_bgs() and unify the error and success return paths for clean_pinned_extents(), reducing duplicated code and making it simpler. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
All callers pass the eb::level so we can get read it directly inside the btrfs_bin_search and key_search. This is inspired by the work of Marek in U-boot. CC: Marek Behun <marek.behun@nic.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Instead of returning both the page and the super block structure, make btrfs_read_disk_super just return a pointer to struct btrfs_disk_super. As a result the function signature is simplified. Also, read_cache_page_gfp can never return NULL so check its return value only for IS_ERR. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
The function always works on a local copy of the reloc root list, which cannot be modified outside of it so using list_for_each_entry is fine. Additionally the macro handles empty lists so drop list_empty checks of callers. No semantic changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Deleting a subvolume on a full filesystem leads to ENOSPC followed by a forced read-only. This is not a transaction abort and the filesystem is otherwise ok, so the error should be just propagated to the callers. This is caused by unnecessary call to btrfs_handle_fs_error for all errors, except EAGAIN. This does not make sense as the standard transaction abort mechanism is in btrfs_drop_snapshot so all relevant failures are handled. Originally in commit cb1b69f4 ("Btrfs: forced readonly when btrfs_drop_snapshot() fails") there was no return value at all, so the btrfs_std_error made some sense but once the error handling and propagation has been implemented we don't need it anymore. Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
The reclaim_size counter of a space_info object is unsigned. So its value can never be negative, it's pointless to have an assertion that checks its value is >= 0, therefore remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Zheng Wei authored
Remove the duplicate definition of 'inode_item_err' in the file tree-checker.c that got there by accident in c23c77b0 ("btrfs: tree-checker: Refactor inode key check into seperate function"). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Zheng Wei <wei.zheng@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
Nikolay noticed a bunch of test failures with my global rsv steal patches. At first he thought they were introduced by them, but they've been failing for a while with 64k nodes. The problem is with 64k nodes we have a global reserve that calculates out to 13MiB on a freshly made file system, which only has 8MiB of metadata space. Because of changes I previously made we no longer account for the global reserve in the overcommit logic, which means we correctly allow overcommit to happen even though we are already overcommitted. However in some corner cases, for example btrfs/170, we will allocate the entire file system up with data chunks before we have enough space pressure to allocate a metadata chunk. Then once the fs is full we ENOSPC out because we cannot overcommit and the global reserve is taking up all of the available space. The most ideal way to deal with this is to change our space reservation stuff to take into account the height of the tree's that we're modifying, so that our global reserve calculation does not end up so obscenely large. However that is a huge undertaking. Instead fix this by forcing a chunk allocation if the global reserve is larger than the total metadata space. This gives us essentially the same behavior that happened before, we get a chunk allocated and these tests can pass. This is meant to be a stop-gap measure until we can tackle the "tree height only" project. Fixes: 0096420a ("btrfs: do not account global reserve in can_overcommit") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
With normal tickets we could have a large reservation at the front of the list that is unable to be satisfied, but a smaller ticket later on that can be satisfied. The way we handle this is to run btrfs_try_granting_tickets() in maybe_fail_all_tickets(). However no such protection exists for priority tickets. Fix this by handling it in handle_reserve_ticket(). If we've returned after attempting to flush space in a priority related way, we'll still be on the priority list and need to be removed. We rely on the flushing to free up space and wake the ticket, but if there is not enough space to reclaim _but_ there's enough space in the space_info to handle subsequent reservations then we would have gotten an ENOSPC erroneously. Address this by catching where we are still on the list, meaning we were a priority ticket, and removing ourselves and then running btrfs_try_granting_tickets(). This will handle this particular corner case. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
In debugging a generic/320 failure on ppc64, Nikolay noticed that sometimes we'd ENOSPC out with plenty of space to reclaim if we had committed the transaction. He further discovered that this was because there was a priority ticket that was small enough to fit in the free space currently in the space_info. Consider the following scenario. There is no more space to reclaim in the fs without committing the transaction. Assume there's 1MiB of space free in the space info, but there are pending normal tickets with 2MiB reservations. Now a priority ticket comes in with a .5MiB reservation. Because we have normal tickets pending we add ourselves to the priority list, despite the fact that we could satisfy this reservation. The flushing machinery now gets to the point where it wants to commit the transaction, but because there's a .5MiB ticket on the priority list and we have 1MiB of free space we assume the ticket will be granted soon, so we bail without committing the transaction. Meanwhile the priority flushing does not commit the transaction, and eventually fails with an ENOSPC. Then all other tickets are failed with ENOSPC because we were never able to actually commit the transaction. The fix for this is we should have simply granted the priority flusher his reservation, because there was space to make the reservation. Priority flushers by definition take priority, so they are allowed to make their reservations before any previous normal tickets. By not adding this priority ticket to the list the normal flushing mechanisms will then commit the transaction and everything will continue normally. We still need to serialize ourselves with other priority tickets, so if there are any tickets on the priority list then we need to add ourselves to that list in order to maintain the serialization between priority tickets. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
On ppc64le with 64k page size (respectively 64k block size) generic/320 was failing and debug output showed we were getting a premature ENOSPC with a bunch of space in btrfs_fs_info::trans_block_rsv. This meant there were still open transaction handles holding space, yet the flusher didn't commit the transaction because it deemed the freed space won't be enough to satisfy the current reserve ticket. Fix this by accounting for space in trans_block_rsv when deciding whether the current transaction should be committed or not. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We previously had a limit of stealing 50% of the global reserve for unlink. This was from a time when the global reserve was used for the delayed refs as well. However now those reservations are kept separate, so the global reserve can be depleted much more to allow us to make progress for space restoring operations like unlink. Change the minimum amount of space required to be left in the global reserve to 10%. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-