- 18 Nov, 2019 40 commits
-
-
Johannes Thumshirn authored
Add xxhash64 to the list of possible checksumming algorithms used by BTRFS. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
To remove use of extent_map::bdev we need to find a replacement, and the latest_bdev is the only one we can use here, because inode::i_bdev and superblock::s_bdev are NULL. The DIO code uses bdev in two places: * to read blocksize to perform alignment checks in do_blockdev_direct_IO, but we do them in btrfs code before any call to DIO * in the following call chain: do_direct_IO get_more_blocks sdio->get_block() <-- this is btrfs_get_blocks_direct subsequently the map_bh->b_dev member is used in clean_bdev_aliases and dio_new_bio to set the bio's bdev to that of the buffer_head. However, because we have provided a submit function dio_bio_submit calls our submission function and ignores the bdev. So it's safe to pass any valid bdev that's used within the filesystem. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
This is a preparatory patch for removing extent_map::bdev. There's some history behind the code so this is only precaution to catch if things break before the actual removal happens. Logically, comparing a raw low-level block device (bdev) does not make sense for extent maps (high-level objects). This had no effect in practice but was quite confusing in the code. The lookup_map is set iff EXTENT_FLAG_FS_MAPPING is set. The two pointers were stored in the same bytes and used potentially in two meanings. Now they're split, so the asserts are in place to check that the condition will not change. The lookup map pointer misused bdev, this has been changed in commit 95617d69 ("btrfs: cleanup, stop casting for extent_map->lookup everywhere") to the explicit type. But the semantics hasn't changed and bdev was not actually used to decide if maps are mergeable. Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
Instead of checking if we've read a BTRFS_CHUNK_ITEM_KEY from disk and then process it we could just bail out early if the read disk key wasn't a BTRFS_CHUNK_ITEM_KEY. This removes a level of indentation and makes the code nicer to read. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
In btrfs_may_alloc_data_chunk() we're checking if the chunk type is of type BTRFS_BLOCK_GROUP_DATA and if it is we process it. Instead of checking if the chunk type is a BTRFS_BLOCK_GROUP_DATA chunk we can negate the check and bail out early if it isn't. This makes the code a bit more readable. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
In lock_stripe_add() we're caching the bucket for the stripe hash table just for a single call to dereference the stripe hash. If we just directly call rbio_bucket() we can safe the pointless local variable. Also move the dereferencing of the stripe hash outside of the variable declaration block to not break over the 80 characters limit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Johannes Thumshirn authored
In lock_stripe_add() we're traversing the stripe hash list and check if the current list element's raid_map equals is equal to the raid bio's raid_map. If both are equal we continue processing. If we'd check for inequality instead of equality we can reduce one level of indentation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We don't modify the data passed to tracepoints, some of the declarations are already const, add it to the rest. Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Remove typecasts from trace printk, adjust types and move typecast to the assignment if necessary. When assigning, the types are more obvious compared to matching the variables to the format strings. Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Instead of using an input pointer parameter as the return value and have an int as the return type of find_desired_extent, rework the function to directly return the found offset. Doing that the 'ret' variable in btrfs_llseek_file can be removed. Additional (subjective) benefit is that btrfs' llseek function now resemebles those of the other major filesystems. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Handle SEEK_END/SEEK_CUR in a single 'default' case by directly returning from generic_file_llseek. This makes the 'out' label redundant. Finally return directly the vale from vfs_setpos. No semantic changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Modifying the file position is done on a per-file basis. This renders holding the inode lock for writing useless and makes the performance of concurrent llseek's abysmal. Fix this by holding the inode for read. This provides protection against concurrent truncates and find_desired_extent already includes proper extent locking for the range which ensures proper locking against concurrent writes. SEEK_CUR and SEEK_END can be done lockessly. The former is synchronized by file::f_lock spinlock. SEEK_END is not synchronized but atomic, but that's OK since there is not guarantee that SEEK_END will always be at the end of the file in the face of tail modifications. This change brings ~82% performance improvement when doing a lot of parallel fseeks. The workload essentially does: for (d=0; d<num_seek_read; d++) { /* offset %= 16777216; */ fseek (f, 256 * d % 16777216, SEEK_SET); fread (buffer, 64, 1, f); } Without patch: num workprocesses = 16 num fseek/fread = 8000000 step = 256 fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 real 0m41.412s user 0m28.777s sys 2m16.510s With patch: num workprocesses = 16 num fseek/fread = 8000000 step = 256 fork 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 real 0m11.479s user 0m27.629s sys 0m21.040s Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We can infer the ops from the type that is now passed to all functions that would need it, this makes workspace_manager::ops redundant and can be removed. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Replace indirect calls to free_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for free_workspace. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Replace indirect calls to alloc_workspace by switch and calls to the specific callbacks. This is mainly to get rid of the indirection due to spectre vulnerability mitigations. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
We can infer the workspace_manager from type and the type will be used in the following patch to call a common helper for alloc_workspace. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Similar to get_workspace, majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. Trivial callback implementations use the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Majority of the callbacks is trivial, we don't gain anything by the indirection, so replace them by a switch function. ZLIB needs to adjust level in the callback and ZSTD workspace management is complex, the rest is call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The indirect calls will be replaced by a switch in compression.c. (Switch is faster than indirect calls with when Spectre mitigations are enabled). Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Replace loop calling to all algos with a list of direct calls to the cleanup manager callback. When that becomes trivial it is replaced by direct call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
With the access to the workspace structures, we can look it up together with the compression ops inside the workspace manager cleanup helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Replace loop calling to all algos with a list of direct calls to the init manager callback. When that becomes trivial it is replaced by direct call to the helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
With the access to the workspace structures, we can look it up together with the compression ops inside the workspace manager init helper. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There's a lot of indirection when the generic code calls into algo-specific callbacks to reach the private workspace manager structure and back to the generic code. To simplify that, export the workspace manager for heuristic, LZO and ZLIB, while ZSTD is going to use it's own manager. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The indirect calls bring some overhead due to spectre vulnerability mitigations. The number of cases is small and below the threshold (10-20) where indirect call would be better. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Export compress_pages, decompress_bio and decompress callbacks for all compression algos. The indirect calls will be replaced by a switch. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
When free'ing extents in a block group we check to see if the block group is not cached, and then cache it if we need to. However we'll just carry on as long as we're loading the cache. This is problematic because we are dirtying the block group here. If we are fast enough we could do a transaction commit and clear the free space cache while we're still loading the space cache in another thread. This truncates the free space inode, which will keep it from loading the space cache. Fix this by using the btrfs_block_group_cache_done helper so that we try to load the space cache unconditionally here, which will result in the caller waiting for the fast caching to complete and keep us from truncating the free space inode. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Josef Bacik authored
While testing 5.2 we ran into the following panic [52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001 [52238.105608] RIP: 0010:drop_buffers+0x3d/0x150 [52238.304051] Call Trace: [52238.308958] try_to_free_buffers+0x15b/0x1b0 [52238.317503] shrink_page_list+0x1164/0x1780 [52238.325877] shrink_inactive_list+0x18f/0x3b0 [52238.334596] shrink_node_memcg+0x23e/0x7d0 [52238.342790] ? do_shrink_slab+0x4f/0x290 [52238.350648] shrink_node+0xce/0x4a0 [52238.357628] balance_pgdat+0x2c7/0x510 [52238.365135] kswapd+0x216/0x3e0 [52238.371425] ? wait_woken+0x80/0x80 [52238.378412] ? balance_pgdat+0x510/0x510 [52238.386265] kthread+0x111/0x130 [52238.392727] ? kthread_create_on_node+0x60/0x60 [52238.401782] ret_from_fork+0x1f/0x30 The page we were trying to drop had a page->private, but had no page->mapping and so called drop_buffers, assuming that we had a buffer_head on the page, and then panic'ed trying to deref 1, which is our page->private for data pages. This is happening because we're truncating the free space cache while we're trying to load the free space cache. This isn't supposed to happen, and I'll fix that in a followup patch. However we still shouldn't allow those sort of mistakes to result in messing with pages that do not belong to us. So add the page->mapping check to verify that we still own this page after dropping and re-acquiring the page lock. This page being unlocked as: btrfs_readpage extent_read_full_page __extent_read_full_page __do_readpage if (!nr) unlock_page <-- nr can be 0 only if submit_extent_page returns an error CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add callchain ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
In the fixup worker, if we fail to mark the range as delalloc in the io tree, we must release the previously reserved metadata, as well as update the outstanding extents counter for the inode, otherwise we leak metadata space. In pratice we can't return an error from btrfs_set_extent_delalloc(), which is just a wrapper around __set_extent_bit(), as for most errors __set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as well) and returning an -EEXIST error doesn't happen in this case since the exclusive bits parameter always has a value of 0 through this code path. Nevertheless, just fix the error handling in the fixup worker, in case one day __set_extent_bit() can return an error to this code path. Fixes: f3038ee3 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Filipe Manana authored
When doing a buffered write it's possible to leave the subv_writers counter of the root, used for synchronization between buffered nocow writers and snapshotting. This happens in an exceptional case like the following: 1) We fail to allocate data space for the write, since there's not enough available data space nor enough unallocated space for allocating a new data block group; 2) Because of that failure, we try to go to NOCOW mode, which succeeds and therefore we set the local variable 'only_release_metadata' to true and set the root's sub_writers counter to 1 through the call to btrfs_start_write_no_snapshotting() made by check_can_nocow(); 3) The call to btrfs_copy_from_user() returns zero, which is very unlikely to happen but not impossible; 4) No pages are copied because btrfs_copy_from_user() returned zero; 5) We call btrfs_end_write_no_snapshotting() which decrements the root's subv_writers counter to 0; 6) We don't set 'only_release_metadata' back to 'false' because we do it only if 'copied', the value returned by btrfs_copy_from_user(), is greater than zero; 7) On the next iteration of the while loop, which processes the same page range, we are now able to allocate data space for the write (we got enough data space released in the meanwhile); 8) After this if we fail at btrfs_delalloc_reserve_metadata(), because now there isn't enough free metadata space, or in some other place further below (prepare_pages(), lock_and_cleanup_extent_if_need(), btrfs_dirty_pages()), we break out of the while loop with 'only_release_metadata' having a value of 'true'; 9) Because 'only_release_metadata' is 'true' we end up decrementing the root's subv_writers counter to -1 (through a call to btrfs_end_write_no_snapshotting()), and we also end up not releasing the data space previously reserved through btrfs_check_data_free_space(). As a consequence the mechanism for synchronizing NOCOW buffered writes with snapshotting gets broken. Fix this by always setting 'only_release_metadata' to false at the start of each iteration. Fixes: 8257b2dc ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume") Fixes: 7ee9e440 ("Btrfs: check if we can nocow if we don't have data space") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Marcos Paulo de Souza authored
Some functions are doing some unnecessary indirection to reach the btrfs_fs_info struct. Change these functions to receive a btrfs_fs_info struct instead of a *file. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
We don't need int argument bool shall do in free_root_pointers(). And rename the argument as it confused two people. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Chengguang Xu authored
The compression type upper limit constant is the same as the last value and this is confusing. In order to keep coding style consistent, use BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of 'NR' being one more than the last value. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Chengguang Xu authored
Use enum to replace macro definitions of extent types. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Chengguang Xu authored
DEFINE_HASHTABLE itself has already included initialization code, we don't have to call hash_init() again, so remove it. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
This function is used only during the final phase of freespace cache writeout. This is necessary since using the plain btrfs_join_transaction api is deadlock prone. The deadlock looks like: T1: btrfs_commit_transaction commit_cowonly_roots btrfs_write_dirty_block_groups btrfs_wait_cache_io __btrfs_wait_cache_io btrfs_wait_ordered_range <-- Triggers ordered IO for freespace inode and blocks transaction commit until freespace cache writeout T2: <-- after T1 has triggered the writeout finish_ordered_fn btrfs_finish_ordered_io btrfs_join_transaction <--- this would block waiting for current transaction to commit, but since trans commit is waiting for this writeout to finish The special purpose functions prevents it by simply skipping the "wait for writeout" since it's guaranteed the transaction won't proceed until we are done. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Using an ASSERT in btrfs_pin_extent allows to more stringently observe whether the function is called under a transaction or not. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
The helper is trivial and we can understand what the atomic_inc on something named refs does. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Tejun Heo authored
During a cyclic writeback, extent_write_cache_pages() uses done_index to update the writeback_index after the current run is over. However, instead of current index + 1, it gets to to the current index itself. Unfortunately, this, combined with returning on EOF instead of looping back, can lead to the following pathlogical behavior. 1. There is a single file which has accumulated enough dirty pages to trigger balance_dirty_pages() and the writer appending to the file with a series of short writes. 2. balance_dirty_pages kicks in, wakes up background writeback and sleeps. 3. Writeback kicks in and the cursor is on the last page of the dirty file. Writeback is started or skipped if already in progress. As it's EOF, extent_write_cache_pages() returns and the cursor is set to done_index which is pointing to the last page. 4. Writeback is done. Nothing happens till balance_dirty_pages finishes, at which point we go back to #1. This can almost completely stall out writing back of the file and keep the system over dirty threshold for a long time which can mess up the whole system. We encountered this issue in production with a package handling application which can reliably reproduce the issue when running under tight memory limits. Reading the comment in the error handling section, this seems to be to avoid accidentally skipping a page in case the write attempt on the page doesn't succeed. However, this concern seems bogus. On each page, the code either: * Skips and moves onto the next page. * Fails issue and sets done_index to index + 1. * Successfully issues and continue to the next page if budget allows and not EOF. IOW, as long as it's not EOF and there's budget, the code never retries writing back the same page. Only when a page happens to be the last page of a particular run, we end up retrying the page, which can't possibly guarantee anything data integrity related. Besides, cyclic writes are only used for non-syncing writebacks meaning that there's no data integrity implication to begin with. Fix it by always setting done_index past the current page being processed. Note that this problem exists in other writepages too. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-