- 20 Feb, 2013 40 commits
-
-
Zach Brown authored
Very large fallocate requests are cpu bound and result in extents with a repeating pattern of ever decreasing size: $ time fallocate -l 1T file real 0m13.039s ( an excerpt of the extents from btrfs-debug-tree: ) prealloc data disk byte 1536292564992 nr 397312 prealloc data disk byte 1536292962304 nr 196608 prealloc data disk byte 1536293158912 nr 98304 prealloc data disk byte 1536293257216 nr 49152 prealloc data disk byte 1536293306368 nr 24576 prealloc data disk byte 1536293330944 nr 12288 prealloc data disk byte 1536293343232 nr 8192 prealloc data disk byte 1536293351424 nr 4096 prealloc data disk byte 1536293355520 nr 4096 prealloc data disk byte 1536293359616 nr 4096 The excessive cpu use comes from __btrfs_prealloc_file_range() trying to allocate the entire remaining size after each extent is allocated. btrfs_reserve_extent() repeatedly cuts this requested size in half until it gets down to the size that the allocators can return. We limit the problem for now by capping each reservation at 256 meg. The small extents come from a masking bug when decreasing the requested reservation size. The high 32bits are cleared and the remaining low bits might happen to reserve a small size. Fix this by using round_down() which properly casts the mask. After these fixes huge fallocate requests are fast and result in nice large extents: $ time fallocate -l 1T file real 0m0.082s prealloc data disk byte 1112425889792 nr 268435456 prealloc data disk byte 1112694325248 nr 268435456 prealloc data disk byte 1112962760704 nr 268435456 Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
-
Thomas Gleixner authored
__btrfs_close_devices() clones btrfs device structs with memcpy(). Some of the fields in the clone are reinitialized, but it's missing to init io_lock. In mainline this goes unnoticed, but on RT it leaves the plist pointing to the original about to be freed lock struct. Initialize io_lock after cloning, so no references to the original struct are left. Reported-and-tested-by: Mike Galbraith <efault@gmx.de> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
-
Chris Mason authored
Signed-off-by: Chris Mason <chris.mason@fusionio.com> Conflicts: fs/btrfs/ctree.h fs/btrfs/extent-tree.c fs/btrfs/inode.c fs/btrfs/volumes.c
-
Chris Mason authored
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9 Signed-off-by: Chris Mason <chris.mason@fusionio.com> Conflicts: fs/btrfs/disk-io.c
-
Miao Xie authored
We forget to free qgroup reservation in commit_transaction(),fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Wang Shilong authored
The original code forget to check whether quota has been disabled firstly, and it will return 'EINVAL' and return error to users if quota has been disabled,it will be unfriendly and confusing for users to see that. So just return directly if quota has been disabled. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
Right now inode cache inode is treated as the same as space cache inode, ie. keep inode in memory till putting super. But this leads to an awkward situation. If we're going to delete a snapshot/subvolume, btrfs will not actually delete it and return free space, but will add it to dead roots list until the last inode on this snap/subvol being destroyed. Then we'll fetch deleted roots and cleanup them via cleaner thread. So here is the problem, if we enable inode cache option, each snap/subvol has a cached inode which is used to store inode allcation information. And this cache inode will be kept in memory, as the above said. So with inode cache, snap/subvol can only be added into dead roots list during freeing roots stage in umount, so that we can ONLY get space back after another remount(we cleanup dead roots on mount). But the real thing is we'll no more use the snap/subvol if we mark it deleted, so we can safely iput its cache inode when we delete snap/subvol. Another thing is that we need to change the rules of droping inode, we don't keep snap/subvol's cache inode in memory till end so that we can add snap/subvol into dead roots list in time. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
In some cases, we need commit the current transaction, but don't want to start a new one if there is no running transaction, so we introduce the function - btrfs_attach_transaction(), which can catch the current transaction, and return -ENOENT if there is no running transaction. But no running transaction doesn't mean the current transction completely, because we removed the running transaction before it completes. In some cases, it doesn't matter. But in some special cases, such as freeze fs, we hope the transaction is fully on disk, it will introduce some bugs, for example, we may feeze the fs and dump the data in the disk, if the transction doesn't complete, we would dump inconsistent data. So we need fix the above problem for those cases. We fixes this problem by introducing a function: btrfs_attach_transaction_barrier() if we hope all the transaction is fully on the disk, even they are not running, we can use this function. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
Now btrfs_commit_transaction() does this ret = btrfs_run_ordered_operations(root, 0) which async flushes all inodes on the ordered operations list, it introduced a deadlock that transaction-start task, transaction-commit task and the flush workers waited for each other. (See the following URL to get the detail http://marc.info/?l=linux-btrfs&m=136070705732646&w=2) As we know, if ->in_commit is set, it means someone is committing the current transaction, we should not try to join it if we are not JOIN or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid the above problem. In this way, there is another benefit: there is no new transaction handle to block the transaction which is on the way of commit, once we set ->in_commit. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
In start_transactio(), we will try to join the transaction again after the current transaction is committed, so we should not release the reserved space of the qgroup. Fix it. Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Zach Brown authored
super.magic is an le64 but it's treated as an unterminated string when compared against BTRFS_MAGIC which is defined as a string. Instead define BTRFS_MAGIC as a normal hex value and use endian helpers to compare it to the super's magic. I tested this by mounting an fs made before the change and made sure that it didn't introduce sparse errors. This matches a similar cleanup that is pending in btrfs-progs. David Sterba pointed out that we should fix the kernel side as well :). Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
jeff.liu authored
With this new ioctl(2) BTRFS_IOC_SET_FSLABEL, we can set/change the label of a mounted file system. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
jeff.liu authored
Add a new ioctl(2) BTRFS_IOC_GET_FSLABLE, so that we can get the label upon a mounted filesystem. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Goffredo Baroncelli <kreijack@inwind.it> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Miao made the ordered operations stuff run async, which introduced a deadlock where we could get somebody (sync) racing in and committing the transaction while a commit was already happening. The new committer would try and flush ordered operations which would hang waiting for the commit to finish because it is done asynchronously and no longer inherits the callers trans handle. To fix this we need to make the ordered operations list a per transaction list. We can get new inodes added to the ordered operation list by truncating them and then having another process writing to them, so this makes it so that anybody trying to add an ordered operation _must_ start a transaction in order to add itself to the list, which will keep new inodes from getting added to the ordered operations list after we start committing. This should fix the deadlock and also keeps us from doing a lot more work than we need to during commit. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Dave pointed out that xfstests 273 will tell you that it failed to load the space cache for a block group when it remounts. This is because we run out of space writing out the block group cache. This is ok and is working as it should, but let's try to be a bit nicer. This happens because the block group was 100mb, but bitmap entries cover 128mb, so we were only getting extent entries for this block group, which ended up being too many to fit in the free space cache. So relax the bitmap size requirements to block groups that are at least half the size a bitmap will cover or larger, that way we can still keep the amount of space used in the free space cache low enough to be able to write it out. With this patch I no longer fail to write out the free space cache. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Ilya Dryomov authored
Enhance balance usage filter by making it possible to balance out only completely empty chunks. Today, usage filter properly acts on values from 1 to 99 inclusive, usage=100 selects all chunks, and usage=0 selects no chunks. This commit changes the usage=0 case: the new meaning is to restripe only completely empty chunks and nothing else. Suggested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Ilya Dryomov authored
Commit 5af3e8cc introduced a use-after-free at volumes.c:3139: bctl is freed above in __cancel_balance() in all cases except for balance pause. Fix this by moving the offending check a couple statements above, the meaning of the check is preserved. Reported-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Nobody uses these io tree ops anymore so just remove them and clean up the code a bit. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
David Sterba authored
The defrag operation can take very long, we want to have a way how to cancel it. The code checks for a pending signal at safe points in the defrag loops and returns EAGAIN. This means a user can press ^C after running 'btrfs fi defrag', woks for both defrag modes, files and root. Returning from the command was instant in my light tests, but may take longer depending on the aging factor of the filesystem. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
David Sterba authored
The warning in use_block_rsv is not useful for users and may fill the logs unnecessarily. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
This idea is from ext4. By this patch, we can make the dio write parallel, and improve the performance. But because we can not update isize without i_mutex, the unlocked dio write just can be done in front of the EOF. We needn't worry about the race between dio write and truncate, because the truncate need wait untill all the dio write end. And we also needn't worry about the race between dio write and punch hole, because we have extent lock to protect our operation. I ran fio to test the performance of this feature. == Hardware == CPU: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz Mem: 2GB SSD: Intel X25-M 120GB (Test Partition: 60GB) == config file == [global] ioengine=psync direct=1 bs=4k size=32G runtime=60 directory=/mnt/btrfs/ filename=testfile group_reporting thread [file1] numjobs=1 # 2 4 rw=randwrite == result (KBps) == write 1 2 4 lock 24936 24738 24726 nolock 24962 30866 32101 == result (iops) == write 1 2 4 lock 6234 6184 6181 nolock 6240 7716 8025 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
Currently, we can do unlocked dio reads, but the following race is possible: dio_read_task truncate_task ->btrfs_setattr() ->btrfs_direct_IO ->__blockdev_direct_IO ->btrfs_get_block ->btrfs_truncate() #alloc truncated blocks #to other inode ->submit_io() #INFORMATION LEAK In order to avoid this problem, we must serialize unlocked dio reads with truncate. There are two approaches: - use extent lock to protect the extent that we truncate - use inode_dio_wait() to make sure the truncating task will wait for the read DIO. If we use the 1st one, we will meet the endless truncation problem due to the nonlocked read DIO after we implement the nonlocked write DIO. It is because we still need invoke inode_dio_wait() avoid the race between write DIO and truncation. By that time, we have to introduce btrfs_inode_{block, resume}_nolock_dio() again. That is we have to implement this patch again, so I choose the 2nd way to fix the problem. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
The deadlock problem happened when running fsstress(a test program in LTP). Steps to reproduce: # mkfs.btrfs -b 100M <partition> # mount <partition> <mnt> # <Path>/fsstress -p 3 -n 10000000 -d <mnt> The reason is: btrfs_direct_IO() |->do_direct_IO() |->get_page() |->get_blocks() | |->btrfs_delalloc_resereve_space() | |->btrfs_add_ordered_extent() ------- Add a new ordered extent |->dio_send_cur_page(page0) -------------- We didn't submit bio here |->get_page() |->get_blocks() |->btrfs_delalloc_resereve_space() |->flush_space() |->btrfs_start_ordered_extent() |->wait_event() ---------- Wait the completion of the ordered extent that is mentioned above But because we didn't submit the bio that is mentioned above, the ordered extent can not complete, we would wait for its completion forever. There are two methods which can fix this deadlock problem: 1. submit the bio before we invoke get_blocks() 2. reserve the space before we do dio Though the 1st is the simplest way, we need modify the code of VFS, and it is likely to break contiguous requests, and introduce performance regression for the other filesystems. So we have to choose the 2nd way. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I noticed we were getting lots of warnings with xfstest 83 because we have reservations outstanding. This is because we moved the orphan add outside of the truncate, but we don't actually cleanup our reservation if something fails. This fixes the problem and I no longer see warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Sometimes xfstest 83 will fail to remount the scratch device because we've gotten ourselves so full that we cannot cleanup the orphan items. In this case check to see if we're doing the orphan cleanup and if we are allow us to steal our reservation from the global block rsv. With this patch I've not been able to reproduce the failed mount problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
The argument "inherit" of btrfs_ioctl_snap_create_transid() was assigned to NULL during we created the snapshots, so we didn't free it though we called kfree() in the caller. But since we are sure the snapshot creation is done after the function - btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't assign the pointer "inherit" to NULL, and just free it in the caller of btrfs_ioctl_snap_create_transid(). In this way, the code can become more readable. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
open_ctree() need read the metadata to initialize the global information of btrfs. But it may fail after it submit some bio, and then it will jump to the error path. Unfortunately, it doesn't check if there are some bios in flight, and just stop all the worker threads. As a result, when the submitted bios end, they can not find any worker thread which can deal with subsequent work, then oops happen. kernel BUG at fs/btrfs/async-thread.c:605! Fix this problem by invoking invalidate_inode_pages2() before we stop the worker threads. This function will wait until the bio end because it need lock the pages which are going to be invalidated, and if a page is under disk read IO, it must be locked. invalidate_inode_pages2() need wait until end bio handler to unlocked it. Reported-and-Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Mark Fasheh authored
This patch adds the flag, BTRFS_SEND_FLAG_NO_FILE_DATA to the btrfs send ioctl code. When this flag is set, the btrfs send code will never write file data into the stream (thus also avoiding expensive reads of that data in the first place). BTRFS_SEND_C_UPDATE_EXTENT commands will be sent (instead of BTRFS_SEND_C_WRITE) with an offset, length pair indicating the extent in question. This patch does not affect the operation of BTRFS_SEND_C_CLONE commands - they will continue to be sent when a search finds an appropriate extent to clone from. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
For write, we also reserve some space for COW blocks during updating the checksum tree, and we calculate the number of blocks by checking if the number of bytes outstanding that are going to need csums needs one more block for csum. When we add these checksum into the checksum tree, we use ordered sums list. Every ordered sum contains csums for each sector, and we'll first try to look up an existing csum item, a) if we don't yet have a proper csum item, then we need to insert one, b) or if we find one but the csum item is not big enough, then we need to extend it. The point is we'll unlock the whole path and then insert or extend. So others can hack in and update the tree. Each insert or extend needs update the tree with COW on, and we may need to insert/extend for many times. That means what we've reserved for updating checksum tree is NOT enough indeed. The case is even more serious with having several write threads at the same time, it can end up eating our reserved space quickly and starting eating globle reserve pool instead. I don't yet come up with a way to calculate the worse case for updating csum, but extending the checksum item as much as possible can be helpful in my test. The idea behind is that it can reduce the times we insert/extend so that it saves us precious reserved space. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
The entry point at the defrag ioctl always sets "cache only" to 0; the codepaths haven't run for a long time as far as I can tell. Chris says they're dead code, so remove them. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I hit a deadlock where transaction commit was waiting on num_writers to be 0. This happened because somebody came into btrfs_commit_transaction and noticed we had aborted and it went to cleanup_transaction. This shouldn't happen because cleanup_transaction is really to fixup a bad commit, it doesn't do the normal trans handle cleanup things. So if we have an error just do the normal btrfs_end_transaction dance and return. Once we are in the actual commit path we can use cleanup_transaction and be good to go. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I noticed we would deadlock if we aborted a transaction while doing compressed io. This is because we don't unlock our pages if something goes horribly wrong. To fix this we need to make sure that we call extent_clear_unlock_delalloc in order to unlock all the pages. If we have to cow in the async submission thread we need to make sure to unlock our locked_page as the cow error path will not unlock the locked page as it depends on the caller to unlock that page. With this patch we no longer deadlock on the page lock when we have an aborted transaction. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
People have been complaining about random ENOSPC errors that will clear up after a umount or just a given amount of time. Chris was able to reproduce this with stress.sh and lots of processes and so was I. Basically the overcommit stuff would really let us get out of hand, in my tests I saw up to 30 gigs of outstanding reservations with only 2 gigs total of metadata space. This usually worked out fine but with so much outstanding reservation the flushing stuff short circuits to make sure we don't hang forever flushing when we really need ENOSPC. Plus we allocate chunks in order to alleviate the pressure, but this doesn't actually help us since we only use the non-allocated area in our over commit logic. So instead of basing overcommit on the amount of non-allocated space, instead just do it based on how much total space we have, and then limit it to the non-allocated space in case we are short on space to spill over into. This allows us to have the same performance as well as no longer giving random ENOSPC. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Dave sent me a panic where we were doing the orphan cleanup and panic'ed trying to release our reservation from the orphan block rsv. The reason for this is because our orphan block rsv had been free'd out from underneath us because the transaction commit found that there were no orphan inodes according to its count and decided to free it. This is incorrect so make sure we inc the orphan inodes count so the accounting is all done properly. This would also cause the warning in the orphan commit code normally if you had any orphans to cleanup as they would only decrement the orphan count so you'd get a negative orphan count which could cause problems during runtime. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
When a transaction aborts or there's an EIO on an ordered extent or any error really we will not free up the space we reserved for this ordered extent. This results in warnings from the block group cache cleanup in the case of a transaction abort, or leaking space in the case of EIO on an ordered extent. Fix this up by free'ing the reserved space if we have an error at all trying to complete an ordered extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
When we abort we've been just free'ing up all the ordered extents and hoping for the best. This results in lots of warnings from various places, warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't fixed. It will also screw up lots of pages who have been set private but never get cleared because the ordered extents are never allowed to be submitted. This patch fixes those warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I hit this error when reproducing a bug that would end in a transaction abort. We take the delayed ref head's mutex to keep anybody from processing it while we're destroying it, but we fail to drop the mutex before we carry on and free the damned thing. Fix this by doing the remove logic for the head ourselves and unlock the mutex, that way we can avoid use after free's or hung tasks waiting on that mutex to come back so they know the delayed ref completed. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
WARN_ON isn't enough, we need to stop the loop if for any reason we would overrun the devices_info array. I tried to track down the connection between the length of the alloc_devices list and the rw_devices counter but it wasn't immediately obvious, so be defensive about it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
No point in DEFINE_WAIT(wait) if it's not used! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
"item" was set but never used in this function. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-