- 06 Aug, 2018 40 commits
-
-
Nikolay Borisov authored
The purpose of the function is to free all the pages comprising an extent buffer. This can be achieved with a simple for loop rather than the slightly more involved 'do {} while' construct. So rewrite the loop using a 'for' construct. Additionally we can never have an extent_buffer that has 0 pages so remove the check for index == 0. No functional changes. The reversed order used to have a meaning in the past where the first page served as a blocking point for several callers. See eg 4f2de97a ("Btrfs: set page->private to the eb"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Commit eb14ab8e ("Btrfs: fix page->private races") fixed a genuine race between extent buffer initialisation and btree_releasepage. Unfortunately as the code has evolved the comments weren't changed which made them slightly wrong and they weren't very clear in the fist place. Fix this by (hopefully) rewording them in a more approachable manner. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
Current version of the page unlocking code was added in 727011e0 ("Btrfs: allow metadata blocks larger than the page size") but even in this commit that particular flag was never used per-se. In fact, btrfs only uses PageChecked for data pages to identify pages which have been dirtied but don't have ORDERED bit set. For more information see 247e743c ("Btrfs: Use async helpers to deal with pages that have been improperly dirtied"). However, this doesn't apply to extent buffer pages. The important bit here is that the pages are unlocked AFTER the extent buffer has been properly recorded in the radix tree to avoid races with btree_releasepage. Let's exploit this fact and simplify the page unlocking sequence by unlocking the pages in-order and removing the redundant PageChecked flag setting/clearing. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Remove the remaining code that misused the page cache pages during device replace and could cause data corruption for compressed nodatasum extents. Such files do not normally exist but there's a bug that allows this combination and the corruption was exposed by device replace fixup code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
There are many places that open code the duplicity factor of the block group profiles, create a common helper. This can be easily extended for more copies. Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
We have assigned the %fs_info->fs_devices in %fs_devices as its not modified just use it for the mutex_lock(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
The fs_info can be fetched from the transaction handle directly. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. In addition, remove the WARN_ON(trans == NULL) because it's not possible to hit this condition. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
They can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
It can be fetched from the transaction handle. Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Rename btrfs_parse_early_options() to btrfs_parse_device_options(). As btrfs_parse_early_options() parses the -o device options and scan the device provided. So this rename specifies its action. Also the function name is in line with btrfs_parse_subvol_options(). No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Lu Fengqi authored
Since commit 0b246afa ("btrfs: root->fs_info cleanup, add fs_info convenience variables"), the srcroot is no longer used to get fs_info::nodesize. In fact, it can be dropped after commit 707e8a07 ("btrfs: use nodesize everywhere, kill leafsize"). Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Qu Wenruo authored
Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and add a block group to unused_bgs list. No functional modification, and only 3 callers are involved. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Nikolay Borisov authored
do_chunk_alloc implements logic to detect whether there is currently pending chunk allocation (by means of space_info->chunk_alloc being set) and if so it loops around to the 'again' label. Additionally, based on the state of the space_info (e.g. whether it's full or not) and the return value of should_alloc_chunk() it decides whether this is a "hard" error (ENOSPC) or we can just return 0. This patch refactors all of this: 1. Put order to the scattered ifs handling the various cases in an easy-to-read if {} else if{} branches. This makes clear the various cases we are interested in handling. 2. Call should_alloc_chunk only once and use the result in the if/else if constructs. All of this is done under space_info->lock, so even before multiple calls of should_alloc_chunk were unnecessary. 3. Rewrite the "do {} while()" loop currently implemented via label into an explicit loop construct. 4. Move the mutex locking for the case where the caller is the one doing the allocation. For the case where the caller needs to wait a concurrent allocation, introduce a pair of mutex_lock/mutex_unlock to act as a barrier and reword the comment. 5. Switch local vars to bool type where pertinent. All in all this shouldn't introduce any functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Ethan Lien authored
In commit b150a4f1 ("Btrfs: use a percpu to keep track of possibly pinned bytes") we use total_bytes_pinned to track how many bytes we are going to free in this transaction. When we are close to ENOSPC, we check it and know if we can make the allocation by commit the current transaction. For every data/metadata extent we are going to free, we add total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and release it in unpin_extent_range() when we finish the transaction. So this is a variable we frequently update but rarely read - just the suitable use of percpu_counter. But in previous commit we update total_bytes_pinned by default 32 batch size, making every update essentially a spin lock protected update. Since every spin lock/unlock operation involves syncing a globally used variable and some kind of barrier in a SMP system, this is more expensive than using total_bytes_pinned as a simple atomic64_t. So fix this by using a customized batch size. Since we only read total_bytes_pinned when we are close to ENOSPC and fail to allocate new chunk, we can use a really large batch size and have nearly no penalty in most cases. [Test] We tested the patch on a 4-cores x86 machine: 1. fallocate a 16GiB size test file 2. take snapshot (so all following writes will be COW) 3. run a 180 sec, 4 jobs, 4K random write fio on test file We also added a temporary lockdep class on percpu_counter's spin lock used by total_bytes_pinned to track it by lock_stat. [Results] unpatched: lock_stat version 0.4 ----------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 82 82 0.21 0.61 29.46 0.36 298340 635973 0.09 11.01 173476.25 0.27 patched: lock_stat version 0.4 ----------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 1 1 0.62 0.62 0.62 0.62 13601 31542 0.14 9.61 11016.90 0.35 [Analysis] Since the spin lock only protects a single in-memory variable, the contentions (number of lock acquisitions that had to wait) in both unpatched and patched version are low. But when we see acquisitions and acq-bounces, we get much lower counts in patched version. Here the most important metric is acq-bounces. It means how many times the lock gets transferred between different cpus, so the patch can really reduce cacheline bouncing of spin lock (also the global counter of percpu_counter) in a SMP system. Fixes: b150a4f1 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Ethan Lien authored
We use customized, nodesize batch value to update dirty_metadata_bytes. We should also use batch version of compare function or we will easily goto fast path and get false result from percpu_counter_compare(). Fixes: e2d84521 ("Btrfs: use percpu counter for dirty metadata count") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Ethan Lien <ethanlien@synology.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Gu Jinxiang authored
Return device pointer (with the IS_ERR semantics) from btrfs_scan_one_device so we don't have to return in through pointer. And since btrfs_fs_devices can be obtained from btrfs_device, return that. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ fixed conflics after recent changes to btrfs_scan_one_device ] Signed-off-by: David Sterba <dsterba@suse.com>
-
Gu Jinxiang authored
fs_devices is always passed to btrfs_scan_one_device which overrides it. In the call stack below fs_devices is passed to btrfs_scan_one_device from btrfs_mount_root. In btrfs_mount_root the output fs_devices of this call stack is not used. btrfs_mount_root btrfs_parse_early_options btrfs_scan_one_device So, it is not necessary to pass fs_devices from btrfs_mount_root, using a local variable in btrfs_parse_early_options is enough. Signed-off-by: Gu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Technically this extends the critical section covered by uuid_mutex to: - parse early mount options -- here we can call device scan on paths that can be passed as 'device=/dev/...' - scan the device passed to mount - open the devices related to the fs_devices -- this increases fs_devices::opened The race can happen when mount calls one of the scans and there's another one called eg. by mkfs or 'btrfs dev scan': Mount Scan ----- ---- scan_one_device (dev1, fsid1) scan_one_device (dev2, fsid1) add the device free stale devices fsid1 fs_devices::opened == 0 find fsid1:dev1 free fsid1:dev1 if it's the last one, free fs_devices of fsid1 too open_devices (dev1, fsid1) dev1 not found When fixed, the uuid mutex will make sure that mount will increase fs_devices::opened and this will not be touched by the racing scan ioctl. Reported-and-tested-by: syzbot+909a5177749d7990ffa4@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+ceb2606025ec1cc3479c@syzkaller.appspotmail.com Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
In preparation to take a big lock, move resource initialization before the critical section. It's not obvious from the diff, the desired order is: - initialize mount security options - allocate temporary fs_info - allocate superblock buffers Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Prepartory work to fix race between mount and device scan. btrfs_parse_early_options calls the device scan from mount and we'll need to let mount completely manage the critical section. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Prepartory work to fix race between mount and device scan. The callers will have to manage the critical section, eg. mount wants to scan and then call btrfs_open_devices without the ioctl scan walking in and modifying the fs devices in the meantime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
David Sterba authored
Prepartory work to fix race between mount and device scan. The callers will have to manage the critical section, eg. mount wants to scan and then call btrfs_open_devices without the ioctl scan walking in and modifying the fs devices in the meantime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
btrfs_free_stale_devices() finds a stale (not opened) device matching path in the fs_uuid list. We are already under uuid_mutex so when we check for each fs_devices, hold the device_list_mutex too. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
-
Anand Jain authored
Over the years we named %fs_devices and %devices to represent the struct btrfs_fs_devices and the struct btrfs_device. So follow the same scheme here too. No functional changes. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
-