- 25 Feb, 2019 40 commits
-
-
Anand Jain authored
This fixes a longstanding lockdep warning triggered by fstests/btrfs/011. Circular locking dependency check reports warning[1], that's because the btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock held. The test case leading to this warning: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /btrfs $ btrfs scrub start -B /btrfs In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy of the scrub workers are needed. So once we have incremented and decremented the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this patch drops the scrub_lock before calling btrfs_destroy_workqueue(). [359.258534] ====================================================== [359.260305] WARNING: possible circular locking dependency detected [359.261938] 5.0.0-rc6-default #461 Not tainted [359.263135] ------------------------------------------------------ [359.264672] btrfs/20975 is trying to acquire lock: [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540 [359.268416] [359.268416] but task is already holding lock: [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs] [359.272418] [359.272418] which lock already depends on the new lock. [359.272418] [359.274692] [359.274692] the existing dependency chain (in reverse order) is: [359.276671] [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}: [359.278187] __mutex_lock+0x86/0x9c0 [359.279086] btrfs_scrub_pause+0x31/0x100 [btrfs] [359.280421] btrfs_commit_transaction+0x1e4/0x9e0 [btrfs] [359.281931] close_ctree+0x30b/0x350 [btrfs] [359.283208] generic_shutdown_super+0x64/0x100 [359.284516] kill_anon_super+0x14/0x30 [359.285658] btrfs_kill_super+0x12/0xa0 [btrfs] [359.286964] deactivate_locked_super+0x29/0x60 [359.288242] cleanup_mnt+0x3b/0x70 [359.289310] task_work_run+0x98/0xc0 [359.290428] exit_to_usermode_loop+0x83/0x90 [359.291445] do_syscall_64+0x15b/0x180 [359.292598] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.294011] [359.294011] -> #2 (sb_internal#2){.+.+}: [359.295432] __sb_start_write+0x113/0x1d0 [359.296394] start_transaction+0x369/0x500 [btrfs] [359.297471] btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs] [359.298629] normal_work_helper+0xcd/0x530 [btrfs] [359.299698] process_one_work+0x246/0x610 [359.300898] worker_thread+0x3c/0x390 [359.302020] kthread+0x116/0x130 [359.303053] ret_from_fork+0x24/0x30 [359.304152] [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}: [359.306100] process_one_work+0x21f/0x610 [359.307302] worker_thread+0x3c/0x390 [359.308465] kthread+0x116/0x130 [359.309357] ret_from_fork+0x24/0x30 [359.310229] [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}: [359.311812] lock_acquire+0x90/0x180 [359.312929] flush_workqueue+0xaa/0x540 [359.313845] drain_workqueue+0xa1/0x180 [359.314761] destroy_workqueue+0x17/0x240 [359.315754] btrfs_destroy_workqueue+0x57/0x200 [btrfs] [359.317245] scrub_workers_put+0x2c/0x60 [btrfs] [359.318585] btrfs_scrub_dev+0x336/0x590 [btrfs] [359.319944] btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs] [359.321622] btrfs_ioctl+0x28a4/0x2e40 [btrfs] [359.322908] do_vfs_ioctl+0xa2/0x6d0 [359.324021] ksys_ioctl+0x3a/0x70 [359.325066] __x64_sys_ioctl+0x16/0x20 [359.326236] do_syscall_64+0x54/0x180 [359.327379] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.328772] [359.328772] other info that might help us debug this: [359.328772] [359.330990] Chain exists of: [359.330990] (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock [359.330990] [359.334376] Possible unsafe locking scenario: [359.334376] [359.336020] CPU0 CPU1 [359.337070] ---- ---- [359.337821] lock(&fs_info->scrub_lock); [359.338506] lock(sb_internal#2); [359.339506] lock(&fs_info->scrub_lock); [359.341461] lock((wq_completion)"%s-%s""btrfs", name); [359.342437] [359.342437] *** DEADLOCK *** [359.342437] [359.343745] 1 lock held by btrfs/20975: [359.344788] #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs] [359.346778] [359.346778] stack backtrace: [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461 [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [359.350501] Call Trace: [359.350931] dump_stack+0x67/0x90 [359.351676] print_circular_bug.isra.37.cold.56+0x15c/0x195 [359.353569] check_prev_add.constprop.44+0x4f9/0x750 [359.354849] ? check_prev_add.constprop.44+0x286/0x750 [359.356505] __lock_acquire+0xb84/0xf10 [359.357505] lock_acquire+0x90/0x180 [359.358271] ? flush_workqueue+0x87/0x540 [359.359098] flush_workqueue+0xaa/0x540 [359.359912] ? flush_workqueue+0x87/0x540 [359.360740] ? drain_workqueue+0x1e/0x180 [359.361565] ? drain_workqueue+0xa1/0x180 [359.362391] drain_workqueue+0xa1/0x180 [359.363193] destroy_workqueue+0x17/0x240 [359.364539] btrfs_destroy_workqueue+0x57/0x200 [btrfs] [359.365673] scrub_workers_put+0x2c/0x60 [btrfs] [359.366618] btrfs_scrub_dev+0x336/0x590 [btrfs] [359.367594] ? start_transaction+0xa1/0x500 [btrfs] [359.368679] btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs] [359.369545] btrfs_ioctl+0x28a4/0x2e40 [btrfs] [359.370186] ? __lock_acquire+0x263/0xf10 [359.370777] ? kvm_clock_read+0x14/0x30 [359.371392] ? kvm_sched_clock_read+0x5/0x10 [359.372248] ? sched_clock+0x5/0x10 [359.372786] ? sched_clock_cpu+0xc/0xc0 [359.373662] ? do_vfs_ioctl+0xa2/0x6d0 [359.374552] do_vfs_ioctl+0xa2/0x6d0 [359.375378] ? do_sigaction+0xff/0x250 [359.376233] ksys_ioctl+0x3a/0x70 [359.376954] __x64_sys_ioctl+0x16/0x20 [359.377772] do_syscall_64+0x54/0x180 [359.378841] entry_SYSCALL_64_after_hwframe+0x49/0xbe [359.380422] RIP: 0033:0x7f5429296a97 Backporting to older kernels: scrub_nocow_workers must be freed the same way as the others. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Anand Jain <anand.jain@oracle.com> [ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
We have killed volume mutex (commit: dccdb07b btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have escaped. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
There is no need to forward declare flush_write_bio(), as it only depends on submit_one_bio(). Both of them are pretty small, just move them to kill the forward declaration. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
The variables and function parameters of __etree_search which pertain to prev/next are grossly misnamed. Namely, prev_ret holds the next state and not the previous. Similarly, next_ret actually holds the previous extent state relating to the offset we are interested in. Fix this by renaming the variables as well as switching the arguments order. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
With the refactoring introduced in 8b62f87b ("Btrfs: reworki outstanding_extents") this flag became unused. Remove it and renumber the following flags accordingly. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
There is no point in using a construct like 'if (!condition) WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We could generate a lot of delayed refs in evict but never have any left over space from our block rsv to make up for that fact. So reserve some extra space and give it to the transaction so it can be used to refill the delayed refs rsv every loop through the truncate path. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
For FLUSH_LIMIT flushers we really can only allocate chunks and flush delayed inode items, everything else is problematic. I added a bunch of new states and it lead to weirdness in the FLUSH_LIMIT case because I forgot about how it worked. So instead explicitly declare the states that are ok for flushing with FLUSH_LIMIT and use that for our state machine. Then as we add new things that are safe we can just add them to this list. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete, because of the over-reservation for the worst case. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error. Flushing that could have been running in parallel did not make any space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ rename to calc_refill_bytes, update comment and changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However, this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
We've done this forever because of the voodoo around knowing how much space we have. However, we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. This has a side effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves issues I was seeing with the mixed bg tests in xfstests without the new flushing state. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However, if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
Zstd compression requires different amounts of memory for each level of compression. The prior patches implemented indirection to allow for each compression type to manage their workspaces independently. This patch uses this indirection to implement compression level support for zstd. To manage the additional memory require, each compression level has its own queue of workspaces. A global LRU is used to help with reclaim. Reclaim is done via a timer which provides a mechanism to decrease memory utilization by keeping only workspaces around that are sized appropriately. Forward progress is guaranteed by a preallocated max workspace hidden from the LRU. When getting a workspace, it uses a bitmap to identify the levels that are populated and scans up. If it finds a workspace that is greater than it, it uses it, but does not update the last_used time and the corresponding place in the LRU. If we hit memory pressure, we sleep on the max level workspace. We continue to rescan in case we can use a smaller workspace, but eventually should be able to obtain the max level workspace or allocate one again should memory pressure subside. The memory requirement for decompression is the same as level 1, and therefore can use any of available workspace. The number of workspaces is bound by an upper limit of the workqueue's limit which currently is 2 (percpu limit). The reclaim timer is used to free inactive/improperly sized workspaces and is set to 307s to avoid colliding with transaction commit (every 30s). Repeating the experiment from v2 [1], the Silesia corpus was copied to a btrfs filesystem 10 times and then read back after dropping the caches. The btrfs filesystem was on an SSD. Level Ratio Compression (MB/s) Decompression (MB/s) Memory (KB) 1 2.658 438.47 910.51 780 2 2.744 364.86 886.55 1004 3 2.801 336.33 828.41 1260 4 2.858 286.71 886.55 1260 5 2.916 212.77 556.84 1388 6 2.363 119.82 990.85 1516 7 3.000 154.06 849.30 1516 8 3.011 159.54 875.03 1772 9 3.025 100.51 940.15 1772 10 3.033 118.97 616.26 1772 11 3.036 94.19 802.11 1772 12 3.037 73.45 931.49 1772 13 3.041 55.17 835.26 2284 14 3.087 44.70 716.78 2547 15 3.126 37.30 878.84 2547 [1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terrelln@fb.com/ Cc: Nick Terrell <terrelln@fb.com> Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
It is possible based on the level configurations that a higher level workspace uses less memory than a lower level workspace. In order to reuse workspaces, this must be made a monotonic relationship. This precomputes the required memory for each level and enforces the monotonicity between level and memory required. This is also done in upstream zstd in [1]. [1] https://github.com/facebook/zstd/commit/a68b76afefec6876f8e8a538155109a5aeac0143 Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
Zstd currently only supports the default level of compression. This patch switches to using the level passed in for btrfs zstd configuration. Zstd workspaces now keep track of the requested level as this can differ from the size of the workspace. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
Currently, the only user of set_level() is zlib which sets an internal workspace parameter. As level is now plumbed into get_workspace(), this can be handled there rather than separately. This repurposes set_level() to bound the level passed in so it can be used when setting the mounts compression level and as well as verifying the level before getting a workspace. The other benefit is this divides the meaning of compress(0) and get_workspace(0). The former means we want to use the default compression level of the compression type. The latter means we can use any workspace available. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
Zlib compression supports multiple levels, but doesn't require changing in how a workspace itself is created and managed. Zstd introduces a different memory requirement such that higher levels of compression require more memory. This requires changes in how the alloc()/get() methods work for zstd. This pach plumbs compression level through the interface as a parameter in preparation for zstd compression levels. This gives the compression types opportunity to create/manage based on the compression level. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
The previous patch added generic helpers for get_workspace() and put_workspace(). Now, we can migrate ownership of the workspace_manager to be in the compression type code as the compression code itself doesn't care beyond being able to get a workspace. The init/cleanup and get/put methods are abstracted so each compression algorithm can decide how they want to manage their workspaces. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
There are two levels of workspace management. First, alloc()/free() which are responsible for actually creating and destroy workspaces. Second, at a higher level, get()/put() which is the compression code asking for a workspace from a workspace_manager. The compression code shouldn't really care how it gets a workspace, but that it got a workspace. This adds get_workspace() and put_workspace() to be the higher level interface which is responsible for indexing into the appropriate compression type. It also introduces btrfs_put_workspace() and btrfs_get_workspace() to be the generic implementations of the higher interface. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
Workspace manager init and cleanup code is open coded inside a for loop over the compression types. This forces each compression type to rely on the same workspace manager implementation. This patch creates helper methods that will be the generic implementation for btrfs workspace management. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
Make the workspace_manager own the interface operations rather than managing index-paired arrays for the workspace_manager and compression operations. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
While the heuristic workspaces aren't really compression workspaces, they use the same interface for managing them. So rather than branching, let's just handle them once again as the index 0 compression type. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
This is in preparation for zstd compression levels. As each level will require different size of workspace, workspaces_list is no longer a really fitting name. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Dennis Zhou authored
It is very easy to miss places that rely on a certain bitshifting for decoding the type_level overloading. Add helpers to do this instead. Cc: Omar Sandoval <osandov@osandov.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Support for a new command that can be used eg. as a command $ btrfs device scan --forget [dev]' (the final name may change though) to undo the effects of 'btrfs device scan [dev]'. For this purpose this patch proposes to use ioctl #5 as it was empty and is next to the SCAN ioctl. The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device (/dev/btrfs-control) to unregister one or all devices, devices that are not mounted. The argument is struct btrfs_ioctl_vol_args, ::name specifies the device path. To unregister all device, the path is an empty string. Again, the devices are removed only if they aren't part of a mounte filesystem. This new ioctl provides: - release of unwanted btrfs_fs_devices and btrfs_devices structures from memory if the device is not going to be mounted - ability to mount filesystem in degraded mode, when one devices is corrupted like in split brain raid1 - running test cases which would require reloading the kernel module but this is not possible eg. due to mounted filesystem or built-in Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. The waiting is killable as it could be indirectly called from user operations like fallocate or zero-range. Such call sites should handle the error but otherwise it's not necessary. Eg. flush_space just needs to attempt to make space by waiting on iputs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add killable comment and changelog parts ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Since inc_block_group_ro() would return -ENOSPC, outputting debug info for enospc_debug mount option would be helpful to debug some balance false ENOSPC report. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Inside qgroup_rsv_add/release(), we have trace events trace_qgroup_update_reserve() to catch reserved space update. However we still have two manual trace_qgroup_update_reserve() calls just outside these functions. Remove these duplicated calls. Fixes: 64ee4e75 ("btrfs: qgroup: Update trace events to use new separate rsv types") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anders Roxell authored
A compiler warning (in a patch in development) pointed to a variable that was used only inside and ASSERT: u64 root_objectid = root->root_key.objectid; ASSERT(root_objectid == ...); fs/btrfs/relocation.c: In function ‘insert_dirty_subv’: fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable] u64 root_objectid = root->root_key.objectid; ^~~~~~~~~~~~~ When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used. Rework the assertion helper by adding a runtime check instead of the '#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the condition being passed into an inline function after preprocessing. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The last caller that does not have a fixed value of lock is btrfs_set_path_blocking, that actually does the same conditional swtich by the lock type so we can merge the branches together and remove the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Currently, the number of readers and writers is checked and in case there are any, wait and redo the locks. There's some duplication before the branches go back to again label, eg. calling wait_event on blocking_readers twice. The sequence is transformed loop: * wait for readers * wait for writers * write_lock * check readers, unlock and wait for readers, loop * check writers, unlock and wait for writers, loop The new sequence is not exactly the same due to the simplification, for readers it's slightly faster. For the writers, original code does * wait for writers * (loop) wait for readers * wait for writers -- again while the new goes directly to the reader check. This should behave the same on a contended lock with multiple writers and readers, but can reduce number of times we're waiting on something. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
btrfs_set_lock_blocking is now only a simple wrapper around btrfs_set_lock_blocking_write. The name does not bring any semantic value that could not be inferred from the new function so there's no point keeping it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We can use the right helper where the lock type is a fixed parameter. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There are many callers that hardcode the desired lock type so we can avoid the switch and call them directly. Split the current function to two. There are no remaining users of btrfs_clear_lock_blocking_rw so it's removed. The call sites will be converted in followup patches. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There are many callers that hardcode the desired lock type so we can avoid the switch and call them directly. Split the current function to two but leave a helper that still takes the variable lock type to make current code compile. The call sites will be converted in followup patches. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Since it's replaced by new delayed subtree swap code, remove the original code. The cleanup is small since most of its core function is still used by delayed subtree swap trace. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Before this patch, qgroup code traces the whole subtree of subvolume and reloc trees unconditionally. This makes qgroup numbers consistent, but it could cause tons of unnecessary extent tracing, which causes a lot of overhead. However for subtree swap of balance, just swap both subtrees because they contain the same contents and tree structure, so qgroup numbers won't change. It's the race window between subtree swap and transaction commit could cause qgroup number change. This patch will delay the qgroup subtree scan until COW happens for the subtree root. So if there is no other operations for the fs, balance won't cause extra qgroup overhead. (best case scenario) Depending on the workload, most of the subtree scan can still be avoided. Only for worst case scenario, it will fall back to old subtree swap overhead. (scan all swapped subtrees) [[Benchmark]] Hardware: VM 4G vRAM, 8 vCPUs, disk is using 'unsafe' cache mode, backing device is SAMSUNG 850 evo SSD. Host has 16G ram. Mkfs parameter: --nodesize 4K (To bump up tree size) Initial subvolume contents: 4G data copied from /usr and /lib. (With enough regular small files) Snapshots: 16 snapshots of the original subvolume. each snapshot has 3 random files modified. balance parameter: -m So the content should be pretty similar to a real world root fs layout. And after file system population, there is no other activity, so it should be the best case scenario. | v4.20-rc1 | w/ patchset | diff ----------------------------------------------------------------------- relocated extents | 22615 | 22457 | -0.1% qgroup dirty extents | 163457 | 121606 | -25.6% time (sys) | 22.884s | 18.842s | -17.6% time (real) | 27.724s | 22.884s | -17.5% Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
To allow delayed subtree swap rescan, btrfs needs to record per-root information about which tree blocks get swapped. This patch introduces the required infrastructure. The designed workflow will be: 1) Record the subtree root block that gets swapped. During subtree swap: O = Old tree blocks N = New tree blocks reloc tree subvolume tree X Root Root / \ / \ NA OB OA OB / | | \ / | | \ NC ND OE OF OC OD OE OF In this case, NA and OA are going to be swapped, record (NA, OA) into subvolume tree X. 2) After subtree swap. reloc tree subvolume tree X Root Root / \ / \ OA OB NA OB / | | \ / | | \ OC OD OE OF NC ND OE OF 3a) COW happens for OB If we are going to COW tree block OB, we check OB's bytenr against tree X's swapped_blocks structure. If it doesn't fit any, nothing will happen. 3b) COW happens for NA Check NA's bytenr against tree X's swapped_blocks, and get a hit. Then we do subtree scan on both subtrees OA and NA. Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND). Then no matter what we do to subvolume tree X, qgroup numbers will still be correct. Then NA's record gets removed from X's swapped_blocks. 4) Transaction commit Any record in X's swapped_blocks gets removed, since there is no modification to swapped subtrees, no need to trigger heavy qgroup subtree rescan for them. This will introduce 128 bytes overhead for each btrfs_root even qgroup is not enabled. This is to reduce memory allocations and potential failures. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Refactor btrfs_qgroup_trace_subtree_swap() into qgroup_trace_subtree_swap(), which only needs two extent buffer and some other bool to control the behavior. This provides the basis for later delayed subtree scan work. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-