- 10 Jun, 2015 26 commits
-
-
Qu Wenruo authored
qgroup. Make snapshot accounting work with new extent-oriented mechanism by skipping given root in new/old_roots in create_pending_snapshot(). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
This is used by later qgroup fix patches for snapshot. As current snapshot accounting is done by btrfs_qgroup_inherit(), but new extent oriented quota mechanism will account extent from btrfs_copy_root() and other snapshot things, causing wrong result. So add this ability to handle snapshot accounting. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
This function will delete unode with given (val,aux) pair. And with this patch, seqnum for debug usage doesn't have any meaning now, so remove them. This is used by later patches to skip snapshot root. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Goodbye, the old mechanisim. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Since the self test transaction don't have delayed_ref_roots, so use find_all_roots() and export btrfs_qgroup_account_extent() to simulate it Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Switch from old ref_node based qgroup to extent based qgroup mechanism for normal operations. The new mechanism should hugely reduce the overhead of btrfs quota system, and further more, the codes and logic should be more clean and easier to maintain. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Switch rescan to use the new new extent oriented mechanism. As rescan is also based on extent, new mechanism is just a perfect match for rescan. With re-designed internal functions, rescan is quite easy, just call btrfs_find_all_roots() and then btrfs_qgroup_account_one_extent(). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
btrfs_qgroup_account_extents(). The new btrfs_qgroup_account_extents() function should be called in btrfs_commit_transaction() and it will update all the qgroup according to delayed_ref_root->dirty_extent_root. The new function can handle both normal operation during commit_transaction() or in rescan in a unified method with clearer logic. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
btrfs_find_all_roots(). Allow btrfs_find_all_roots() to skip all delayed_ref_head lock and tree lock to do tree search. This is important for later qgroup implement which will call find_all_roots() after fs trees are committed. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Add function btrfs_qgroup_prepare_account_extents() to get old_roots which are needed for qgroup. We do it in commit_transaction() and before switch_roots(), and only search commit_root, so it gives a quite accurate view for previous transaction. With old_roots from previous transaction, we can use it to do accurate account with current transaction. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Add hook in add_delayed_ref_head() to record quota-related extent record into delayed_ref_root->dirty_extent_record rb-tree for later qgroup accounting. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Add function qgroup_update_counters(), which will update related qgroups' rfer/excl according to old/new_roots. This is one of the two core functions for the new qgroup implement. This is based on btrfs_adjust_coutners() but with clearer logic and comment. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
This function is used to update refcnt for qgroups. And is one of the two core functions used in the new qgroup implement. This is based on the old update_old/new_refcnt, but provides a unified logic and behavior. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
btrfs: extent-tree: Use ref_node to replace unneeded parameters in __inc_extent_ref() and __free_extent() __btrfs_inc_extent_ref() and __btrfs_free_extent() have already had too many parameters, but three of them can be extracted from btrfs_delayed_ref_node struct. So use btrfs_delayed_ref_node struct as a single parameter to replace the bytenr/num_byte/no_quota parameters. The real objective of this patch is to allow btrfs_qgroup_record_ref() get the delayed_ref_node in incoming qgroup patches. Other functions calling btrfs_qgroup_record_ref() are not affected since the rest will only add/sub exclusive extents, where node is not used. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Use inline functions to do such things, to improve readability. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Acked-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Cleanup the rb_tree merge/insert/update functions, since now we use list instead of rb_tree now. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
This patch replace the rbtree used in ref_head to list. This has the following advantage: 1) Easier merge logic. With the new list implement, we only need to care merging the tail ref_node with the new ref_node. And this can be done quite easy at insert time, no need to do a indicated merge at run_delayed_refs(). Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Qu Wenruo authored
Old __merge_refs() in backref.c will even merge refs whose root_id are different, which makes qgroup gives wrong result. Fix it by checking ref_for_same_block() before any mode specific works. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Zhao Lei authored
lockdep report following warning in test: [25176.843958] ================================= [25176.844519] [ INFO: inconsistent lock state ] [25176.845047] 4.1.0-rc3 #22 Tainted: G W [25176.845591] --------------------------------- [25176.846153] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [25176.846713] fsstress/26661 [HC0[0]:SC1[1]:HE1:SE0] takes: [25176.847246] (&wr_ctx->wr_lock){+.?...}, at: [<ffffffffa04cdc6d>] scrub_free_ctx+0x2d/0xf0 [btrfs] [25176.847838] {SOFTIRQ-ON-W} state was registered at: [25176.848396] [<ffffffff810bf460>] __lock_acquire+0x6a0/0xe10 [25176.848955] [<ffffffff810bfd1e>] lock_acquire+0xce/0x2c0 [25176.849491] [<ffffffff816489af>] mutex_lock_nested+0x7f/0x410 [25176.850029] [<ffffffffa04d04ff>] scrub_stripe+0x4df/0x1080 [btrfs] [25176.850575] [<ffffffffa04d11b1>] scrub_chunk.isra.19+0x111/0x130 [btrfs] [25176.851110] [<ffffffffa04d144c>] scrub_enumerate_chunks+0x27c/0x510 [btrfs] [25176.851660] [<ffffffffa04d3b87>] btrfs_scrub_dev+0x1c7/0x6c0 [btrfs] [25176.852189] [<ffffffffa04e918e>] btrfs_dev_replace_start+0x36e/0x450 [btrfs] [25176.852771] [<ffffffffa04a98e0>] btrfs_ioctl+0x1e10/0x2d20 [btrfs] [25176.853315] [<ffffffff8121c5b8>] do_vfs_ioctl+0x318/0x570 [25176.853868] [<ffffffff8121c851>] SyS_ioctl+0x41/0x80 [25176.854406] [<ffffffff8164da17>] system_call_fastpath+0x12/0x6f [25176.854935] irq event stamp: 51506 [25176.855511] hardirqs last enabled at (51506): [<ffffffff810d4ce5>] vprintk_emit+0x225/0x5e0 [25176.856059] hardirqs last disabled at (51505): [<ffffffff810d4b77>] vprintk_emit+0xb7/0x5e0 [25176.856642] softirqs last enabled at (50886): [<ffffffff81067a23>] __do_softirq+0x363/0x640 [25176.857184] softirqs last disabled at (50949): [<ffffffff8106804d>] irq_exit+0x10d/0x120 [25176.857746] other info that might help us debug this: [25176.858845] Possible unsafe locking scenario: [25176.859981] CPU0 [25176.860537] ---- [25176.861059] lock(&wr_ctx->wr_lock); [25176.861705] <Interrupt> [25176.862272] lock(&wr_ctx->wr_lock); [25176.862881] *** DEADLOCK *** Reason: Above warning is caused by: Interrupt -> bio_endio() -> ... -> scrub_put_ctx() -> scrub_free_ctx() *1 -> ... -> mutex_lock(&wr_ctx->wr_lock); scrub_put_ctx() is allowed to be called in end_bio interrupt, but in code design, it will never call scrub_free_ctx(sctx) in interrupe context(above *1), because btrfs_scrub_dev() get one additional reference of sctx->refs, which makes scrub_free_ctx() only called withine btrfs_scrub_dev(). Now the code runs out of our wish, because free sequence in scrub_pending_bio_dec() have a gap. Current code: -----------------------------------+----------------------------------- scrub_pending_bio_dec() | btrfs_scrub_dev -----------------------------------+----------------------------------- atomic_dec(&sctx->bios_in_flight); | wake_up(&sctx->list_wait); | | scrub_put_ctx() | -> atomic_dec_and_test(&sctx->refs) scrub_put_ctx(sctx); | -> atomic_dec_and_test(&sctx->refs)| -> scrub_free_ctx() | -----------------------------------+----------------------------------- We expected: -----------------------------------+----------------------------------- scrub_pending_bio_dec() | btrfs_scrub_dev -----------------------------------+----------------------------------- atomic_dec(&sctx->bios_in_flight); | wake_up(&sctx->list_wait); | scrub_put_ctx(sctx); | -> atomic_dec_and_test(&sctx->refs)| | scrub_put_ctx() | -> atomic_dec_and_test(&sctx->refs) | -> scrub_free_ctx() -----------------------------------+----------------------------------- Fix: Move scrub_pending_bio_dec() to a workqueue, to avoid this function run in interrupt context. Tested by check tracelog in debug. Changelog v1->v2: Use workqueue instead of adjust function call sequence in v1, because v1 will introduce a bug pointed out by: Filipe David Manana <fdmanana@gmail.com> Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Mark Fasheh authored
The extent-same code rejects requests with an unaligned length. This poses a problem when we want to dedupe the tail extent of files as we skip cloning the portion between i_size and the extent boundary. If we don't clone the entire extent, it won't be deleted. So the combination of these behaviors winds up giving us worst-case dedupe on many files. We can fix this by allowing a length that extents to i_size and internally aligining those to the end of the block. This is what btrfs_ioctl_clone() so we can just copy that check over. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
-
chandan authored
max_to_defrag represents the number of pages to defrag rather than the last page of the file range to be defragged. Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag should actually have the value 4. Instead in the current code we end up setting it to 6. Now, this does not (yet) cause an issue since the first part of the while loop condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control to flow out of the while loop before any buggy behavior is actually caused. So the patch just makes sure that max_to_defrag ends up having the right value rather than fixing a bug. I did run the xfstests suite to make sure that the code does not regress. Changelog: v1->v2: Provide a much descriptive commit message. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>
-
chandan authored
Read-ahead is done for the pages in the range [ra_index, ra_index + cluster - 1]. So the next read-ahead should be starting from the page at index 'ra_index + cluster' (unless we deemed that the extent at 'ra_index + cluster' as non-defraggable) rather than from the page at index 'ra_index + max_cluster'. This patch fixes this. I did run the xfstests suite to make sure that the code does not regress. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
When allocating a new chunk or removing one we need to update num_devs device items and insert or remove a chunk item in the chunk tree, so in the worst case the space needed in the chunk space_info is: btrfs_calc_trunc_metadata_size(chunk_root, num_devs) + btrfs_calc_trans_metadata_size(chunk_root, 1) That is, in the worst case we need to cow num_devs paths and cow 1 other path that can result in splitting every node and leaf, and each path consisting of BTRFS_MAX_LEVEL - 1 nodes and 1 leaf. We were requiring some additional chunk_root->nodesize * BTRFS_MAX_LEVEL * num_devs bytes, which were unnecessary since updating the existing device items does not result in splitting the nodes and leaf since after updating them they remain with the same size. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
We don't need to attach ordered extents that have completed to the current transaction. Doing so only makes us hold memory for longer than necessary and delaying the iput of the inode until the transaction is committed (for each created ordered extent we do an igrab and then schedule an asynchronous iput when the ordered extent's reference count drops to 0), preventing the inode from being evictable until the transaction commits. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
Commit 3a8b36f3 ("Btrfs: fix data loss in the fast fsync path") added a performance regression for that causes an unnecessary sync of the log trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done against a file, without no writes or any metadata updates to the inode in between them and if a transaction is committed before the second fsync is called. Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99) after a test sysbench test that measured a -62% decrease of file io requests per second for that tests' workload. The test is: echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor mkfs -t btrfs /dev/sda2 mount -t btrfs /dev/sda2 /fs/sda2 cd /fs/sda2 for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \ --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \ --file-num=1024 run A test on kvm guest, running a debug kernel gave me the following results: Without 3a8b36f3: 16.01 reqs/sec With 3a8b36f3: 3.39 reqs/sec With 3a8b36f3 and this patch: 16.04 reqs/sec Reported-by: Huang Ying <ying.huang@intel.com> Tested-by: Huang, Ying <ying.huang@intel.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Chris Mason authored
Merge branch 'send_fixes_4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.2
-
- 03 Jun, 2015 14 commits
-
-
Filipe Manana authored
Zygo Blaxell and other users have reported occasional hangs while an inode is being evicted, leading to traces like the following: [ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds. [ 5281.973836] Not tainted 4.0.0-rc5-btrfs-next-9+ #2 [ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 5281.976364] rm D ffff8800724cfc38 0 20488 7747 0x00000000 [ 5281.977506] ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001 [ 5281.978461] ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78 [ 5281.979541] ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123 [ 5281.981396] Call Trace: [ 5281.982066] [<ffffffff8143107e>] schedule+0x74/0x83 [ 5281.983341] [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs] [ 5281.985127] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [ 5281.986715] [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs] [ 5281.988680] [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs] [ 5281.990200] [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs] [ 5281.991781] [<ffffffff8116964d>] evict+0xa0/0x148 [ 5281.992735] [<ffffffff8116a43d>] iput+0x18f/0x1e5 [ 5281.993796] [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa [ 5281.994806] [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58 [ 5281.996120] [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab [ 5281.997562] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 5281.998815] [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b [ 5281.999920] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [ 5282.001299] 1 lock held by rm/20488: [ 5282.002066] #0: (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b This happens when we have readahead, which calls readpages(), happening right before the inode eviction handler is invoked. So the reason is essentially: 1) readpages() is called while a reference on the inode is held, so eviction can not be triggered before readpages() returns. It also locks one or more ranges in the inode's io_tree (which is done at extent_io.c:__do_contiguous_readpages()); 2) readpages() submits several read bios, all with an end io callback that runs extent_io.c:end_bio_extent_readpage() and that is executed by other task when a bio finishes, corresponding to a work queue (fs_info->end_io_workers) worker kthread. This callback unlocks the ranges in the inode's io_tree that were previously locked in step 1; 3) readpages() returns, the reference on the inode is dropped; 4) One or more of the read bios previously submitted are still not complete (their end io callback was not yet invoked or has not yet finished execution); 5) Inode eviction is triggered (through an unlink call for example). The inode reference count was not incremented before submitting the read bios, therefore this is possible; 6) The eviction handler starts executing and enters the loop that iterates over all extent states in the inode's io_tree; 7) The loop picks one extent state record and uses its ->start and ->end fields, after releasing the inode's io_tree spinlock, to call lock_extent_bits() and clear_extent_bit(). The call to lock the range [state->start, state->end] blocks because the whole range or a part of it was locked by the previous call to readpages() and the corresponding end io callback, which unlocks the range was not yet executed; 8) The end io callback for the read bio is executed and unlocks the range [state->start, state->end] (or a superset of that range). And at clear_extent_bit() the extent_state record state is used as a second argument to split_state(), which sets state->start to a larger value; 9) The task executing the eviction handler is woken up by the task executing the bio's end io callback (through clear_state_bit) and the eviction handler locks the range [old value for state->start, state->end]. Shortly after, when calling clear_extent_bit(), it unlocks the range [new value for state->start, state->end], so it ends up unlocking only part of the range that it locked, leaving an extent state record in the io_tree that represents the unlocked subrange; 10) The eviction handler loop, in its next iteration, gets the extent_state record for the subrange that it did not unlock in the previous step and then tries to lock it, resulting in an hang. So fix this by not using the ->start and ->end fields of an existing extent_state record. This is a simple solution, and an alternative could be to bump the inode's reference count before submitting each read bio and having it dropped in the bio's end io callback. But that would be a more invasive/complex change and would not protect against other possible places that are not holding a reference on the inode as well. Something to consider in the future. Many thanks to Zygo Blaxell for reporting, in the mailing list, the issue, a set of scripts to trigger it and testing this fix. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Liu Bo authored
The return value of read_tree_block() can confuse callers as it always returns NULL for either -ENOMEM or -EIO, so it's likely that callers parse it to a wrong error, for instance, in btrfs_read_tree_root(). This fixes the above issue. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
Liu Bo authored
read_tree_block may take a reference on the 'eb', a following free_extent_buffer is necessary. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
-
Liu Bo authored
After commit 8407f553 ("Btrfs: fix data corruption after fast fsync and writeback error"), during wait_ordered_extents(), we wait for ordered extent setting BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've already got checksum information, so we don't need to check (csum_bytes_left == 0) in the whole logging path. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
Unlike when attempting to allocate a new block group, where we check that we have enough space in the system space_info to update the device items and insert a new chunk item in the chunk tree, we were not checking if the system space_info had enough space for updating the device items and deleting the chunk item in the chunk tree. This often lead to -ENOSPC error when attempting to allocate blocks for the chunk tree (during btree node/leaf COW operations) while updating the device items or deleting the chunk item, which resulted in the current transaction being aborted and turning the filesystem into read-only mode. While running fstests generic/038, which stresses allocation of block groups and removal of unused block groups, with a large scratch device (750Gb) this happened often, despite more than enough unallocated space, and resulted in the following trace: [68663.586604] WARNING: CPU: 3 PID: 1521 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [68663.600407] BTRFS: Transaction aborted (error -28) (...) [68663.730829] Call Trace: [68663.732585] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [68663.734334] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [68663.739980] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [68663.757153] [<ffffffffa036ca6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.760925] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [68663.762854] [<ffffffffa03b159d>] ? btrfs_update_device+0x15a/0x16c [btrfs] [68663.764073] [<ffffffffa036ca6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [68663.765130] [<ffffffffa03b3638>] btrfs_remove_chunk+0x597/0x5ee [btrfs] [68663.765998] [<ffffffffa0384663>] ? btrfs_delete_unused_bgs+0x245/0x296 [btrfs] [68663.767068] [<ffffffffa0384676>] btrfs_delete_unused_bgs+0x258/0x296 [btrfs] [68663.768227] [<ffffffff8143527f>] ? _raw_spin_unlock_irq+0x2d/0x4c [68663.769081] [<ffffffffa038b109>] cleaner_kthread+0x13d/0x16c [btrfs] [68663.799485] [<ffffffffa038afcc>] ? btrfs_alloc_root+0x28/0x28 [btrfs] [68663.809208] [<ffffffff8105f367>] kthread+0xef/0xf7 [68663.828795] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28 [68663.844942] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.846486] [<ffffffff81435a88>] ret_from_fork+0x58/0x90 [68663.847760] [<ffffffff8105f278>] ? __kthread_parkme+0xad/0xad [68663.849503] ---[ end trace 798477c6d6dbaad6 ]--- [68663.850525] BTRFS: error (device sdc) in btrfs_remove_chunk:2652: errno=-28 No space left So fix this by verifying that enough space exists in system space_info, and reserving the space in the chunk block reserve, before attempting to delete the block group and allocate a new system chunk if we don't have enough space to perform the necessary updates and delete in the chunk tree. Like for the block group creation case, we don't error our if we fail to allocate a new system chunk, since we might end up not needing it (no node/leaf splits happen during the COW operations and/or we end up not needing to COW any btree nodes or leafs because they were already COWed in the current transaction and their writeback didn't start yet). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
While creating a block group, we often end up getting ENOSPC while updating the chunk tree, which leads to a transaction abortion that produces a trace like the following: [30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]() [30670.117777] BTRFS: Transaction aborted (error -28) (...) [30670.163567] Call Trace: [30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs] [30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs] [30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs] [30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs] [30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs] [30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15 [30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de [30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f [30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62 [30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [30670.187792] ---[ end trace 0373e6b491c4a8cc ]--- This is because we don't do proper space reservation for the chunk block reserve when we have multiple tasks allocating chunks in parallel. So block group creation has 2 phases, and the first phase essentially checks if there is enough space in the system space_info, allocating a new system chunk if there isn't, while the second phase updates the device, extent and chunk trees. However, because the updates to the chunk tree happen in the second phase, if we have N tasks, each with its own transaction handle, allocating new chunks in parallel and if there is only enough space in the system space_info to allocate M chunks, where M < N, none of the tasks ends up allocating a new system chunk in the first phase and N - M tasks will get -ENOSPC when attempting to update the chunk tree in phase 2 if they need to COW any nodes/leafs from the chunk tree. Fix this by doing proper reservation in the chunk block reserve. The issue could be reproduced by running fstests generic/038 in a loop, which eventually triggered the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Josef Bacik authored
We should be doing this, it's weird we hadn't been doing this. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Omar Sandoval authored
Now that we're guaranteed to have a meaningful root dentry, we can just export seq_dentry() and use it in btrfs_show_options(). The subvolume ID is easy to get and can also be useful, so put that in there, too. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Omar Sandoval authored
Currently, mounting a subvolume with subvolid= takes a different code path than mounting with subvol=. This isn't really a big deal except for the fact that mounts done with subvolid= or the default subvolume don't have a dentry that's connected to the dentry tree like in the subvol= case. To unify the code paths, when given subvolid= or using the default subvolume ID, translate it into a subvolume name by walking ROOT_BACKREFs in the root tree and INODE_REFs in the filesystem trees. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Omar Sandoval authored
There's nothing to stop a user from passing both subvol= and subvolid= to mount, but if they don't refer to the same subvolume, someone is going to be surprised at some point. Error out on this case, but allow users to pass in both if they do match (which they could, for example, get out of /proc/mounts). Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Omar Sandoval authored
In preparation for new functionality in mount_subvol(), give it ownership of subvol_name and tidy up the error paths. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Omar Sandoval authored
Currently, setup_root_args() substitutes 's/subvol=[^,]*/subvolid=0/'. But, this means that if the user passes both a subvol and subvolid for some reason, we won't actually mount the top-level when we recursively mount. For example, consider: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt btrfs subvol create /mnt/subvol1 # subvolid=257 btrfs subvol create /mnt/subvol2 # subvolid=258 umount /mnt mount -osubvol=/subvol1,subvolid=258 /dev/sdb /mnt In the final mount, subvol=/subvol1,subvolid=258 becomes subvolid=0,subvolid=258, and the last option takes precedence, so we mount subvol2 and try to look up subvol1 inside of it, which fails. So, instead, do a thorough scan through the argument list and remove any subvol= and subvolid= options, then append subvolid=0 to the end. This implicitly makes subvol= take precedence over subvolid=, but we're about to add a stricter check for that. This also makes setup_root_args() more generic, which we'll need soon. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Omar Sandoval authored
Since commit 0723a047 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options"), when mounting a subvolume read/write when another subvolume has previously been mounted read-only, we first do a remount. However, this should be done with the superblock locked, as per sync_filesystem(): /* * We need to be protected against the filesystem going from * r/o to r/w or vice versa. */ WARN_ON(!rwsem_is_locked(&sb->s_umount)); This WARN_ON can easily be hit with: mkfs.btrfs -f /dev/vdb mount /dev/vdb /mnt btrfs subvol create /mnt/vol1 btrfs subvol create /mnt/vol2 umount /mnt mount -oro,subvol=/vol1 /dev/vdb /mnt mount -orw,subvol=/vol2 /dev/vdb /mnt2 Fixes: 0723a047 ("btrfs: allow mounting btrfs subvolumes with different ro/rw options") Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
-
Filipe Manana authored
When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits() through free_io_failure(), we weren't waking up any tasks waiting for the extent's state EXTENT_LOCKED bit, leading to an hang. So make sure clear_extent_bits() ends up waking up any waiters if the bit EXTENT_LOCKED is supplied by its callers. Zygo Blaxell was experiencing such hangs at inode eviction time after file unlinks. Thanks to him for a set of scripts to reproduce the issue. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
-