- 20 Feb, 2013 40 commits
-
-
Miao Xie authored
This idea is from ext4. By this patch, we can make the dio write parallel, and improve the performance. But because we can not update isize without i_mutex, the unlocked dio write just can be done in front of the EOF. We needn't worry about the race between dio write and truncate, because the truncate need wait untill all the dio write end. And we also needn't worry about the race between dio write and punch hole, because we have extent lock to protect our operation. I ran fio to test the performance of this feature. == Hardware == CPU: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz Mem: 2GB SSD: Intel X25-M 120GB (Test Partition: 60GB) == config file == [global] ioengine=psync direct=1 bs=4k size=32G runtime=60 directory=/mnt/btrfs/ filename=testfile group_reporting thread [file1] numjobs=1 # 2 4 rw=randwrite == result (KBps) == write 1 2 4 lock 24936 24738 24726 nolock 24962 30866 32101 == result (iops) == write 1 2 4 lock 6234 6184 6181 nolock 6240 7716 8025 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
Currently, we can do unlocked dio reads, but the following race is possible: dio_read_task truncate_task ->btrfs_setattr() ->btrfs_direct_IO ->__blockdev_direct_IO ->btrfs_get_block ->btrfs_truncate() #alloc truncated blocks #to other inode ->submit_io() #INFORMATION LEAK In order to avoid this problem, we must serialize unlocked dio reads with truncate. There are two approaches: - use extent lock to protect the extent that we truncate - use inode_dio_wait() to make sure the truncating task will wait for the read DIO. If we use the 1st one, we will meet the endless truncation problem due to the nonlocked read DIO after we implement the nonlocked write DIO. It is because we still need invoke inode_dio_wait() avoid the race between write DIO and truncation. By that time, we have to introduce btrfs_inode_{block, resume}_nolock_dio() again. That is we have to implement this patch again, so I choose the 2nd way to fix the problem. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
The deadlock problem happened when running fsstress(a test program in LTP). Steps to reproduce: # mkfs.btrfs -b 100M <partition> # mount <partition> <mnt> # <Path>/fsstress -p 3 -n 10000000 -d <mnt> The reason is: btrfs_direct_IO() |->do_direct_IO() |->get_page() |->get_blocks() | |->btrfs_delalloc_resereve_space() | |->btrfs_add_ordered_extent() ------- Add a new ordered extent |->dio_send_cur_page(page0) -------------- We didn't submit bio here |->get_page() |->get_blocks() |->btrfs_delalloc_resereve_space() |->flush_space() |->btrfs_start_ordered_extent() |->wait_event() ---------- Wait the completion of the ordered extent that is mentioned above But because we didn't submit the bio that is mentioned above, the ordered extent can not complete, we would wait for its completion forever. There are two methods which can fix this deadlock problem: 1. submit the bio before we invoke get_blocks() 2. reserve the space before we do dio Though the 1st is the simplest way, we need modify the code of VFS, and it is likely to break contiguous requests, and introduce performance regression for the other filesystems. So we have to choose the 2nd way. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I noticed we were getting lots of warnings with xfstest 83 because we have reservations outstanding. This is because we moved the orphan add outside of the truncate, but we don't actually cleanup our reservation if something fails. This fixes the problem and I no longer see warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Sometimes xfstest 83 will fail to remount the scratch device because we've gotten ourselves so full that we cannot cleanup the orphan items. In this case check to see if we're doing the orphan cleanup and if we are allow us to steal our reservation from the global block rsv. With this patch I've not been able to reproduce the failed mount problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
The argument "inherit" of btrfs_ioctl_snap_create_transid() was assigned to NULL during we created the snapshots, so we didn't free it though we called kfree() in the caller. But since we are sure the snapshot creation is done after the function - btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't assign the pointer "inherit" to NULL, and just free it in the caller of btrfs_ioctl_snap_create_transid(). In this way, the code can become more readable. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
open_ctree() need read the metadata to initialize the global information of btrfs. But it may fail after it submit some bio, and then it will jump to the error path. Unfortunately, it doesn't check if there are some bios in flight, and just stop all the worker threads. As a result, when the submitted bios end, they can not find any worker thread which can deal with subsequent work, then oops happen. kernel BUG at fs/btrfs/async-thread.c:605! Fix this problem by invoking invalidate_inode_pages2() before we stop the worker threads. This function will wait until the bio end because it need lock the pages which are going to be invalidated, and if a page is under disk read IO, it must be locked. invalidate_inode_pages2() need wait until end bio handler to unlocked it. Reported-and-Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Mark Fasheh authored
This patch adds the flag, BTRFS_SEND_FLAG_NO_FILE_DATA to the btrfs send ioctl code. When this flag is set, the btrfs send code will never write file data into the stream (thus also avoiding expensive reads of that data in the first place). BTRFS_SEND_C_UPDATE_EXTENT commands will be sent (instead of BTRFS_SEND_C_WRITE) with an offset, length pair indicating the extent in question. This patch does not affect the operation of BTRFS_SEND_C_CLONE commands - they will continue to be sent when a search finds an appropriate extent to clone from. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Liu Bo authored
For write, we also reserve some space for COW blocks during updating the checksum tree, and we calculate the number of blocks by checking if the number of bytes outstanding that are going to need csums needs one more block for csum. When we add these checksum into the checksum tree, we use ordered sums list. Every ordered sum contains csums for each sector, and we'll first try to look up an existing csum item, a) if we don't yet have a proper csum item, then we need to insert one, b) or if we find one but the csum item is not big enough, then we need to extend it. The point is we'll unlock the whole path and then insert or extend. So others can hack in and update the tree. Each insert or extend needs update the tree with COW on, and we may need to insert/extend for many times. That means what we've reserved for updating checksum tree is NOT enough indeed. The case is even more serious with having several write threads at the same time, it can end up eating our reserved space quickly and starting eating globle reserve pool instead. I don't yet come up with a way to calculate the worse case for updating csum, but extending the checksum item as much as possible can be helpful in my test. The idea behind is that it can reduce the times we insert/extend so that it saves us precious reserved space. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
The entry point at the defrag ioctl always sets "cache only" to 0; the codepaths haven't run for a long time as far as I can tell. Chris says they're dead code, so remove them. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I hit a deadlock where transaction commit was waiting on num_writers to be 0. This happened because somebody came into btrfs_commit_transaction and noticed we had aborted and it went to cleanup_transaction. This shouldn't happen because cleanup_transaction is really to fixup a bad commit, it doesn't do the normal trans handle cleanup things. So if we have an error just do the normal btrfs_end_transaction dance and return. Once we are in the actual commit path we can use cleanup_transaction and be good to go. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I noticed we would deadlock if we aborted a transaction while doing compressed io. This is because we don't unlock our pages if something goes horribly wrong. To fix this we need to make sure that we call extent_clear_unlock_delalloc in order to unlock all the pages. If we have to cow in the async submission thread we need to make sure to unlock our locked_page as the cow error path will not unlock the locked page as it depends on the caller to unlock that page. With this patch we no longer deadlock on the page lock when we have an aborted transaction. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
People have been complaining about random ENOSPC errors that will clear up after a umount or just a given amount of time. Chris was able to reproduce this with stress.sh and lots of processes and so was I. Basically the overcommit stuff would really let us get out of hand, in my tests I saw up to 30 gigs of outstanding reservations with only 2 gigs total of metadata space. This usually worked out fine but with so much outstanding reservation the flushing stuff short circuits to make sure we don't hang forever flushing when we really need ENOSPC. Plus we allocate chunks in order to alleviate the pressure, but this doesn't actually help us since we only use the non-allocated area in our over commit logic. So instead of basing overcommit on the amount of non-allocated space, instead just do it based on how much total space we have, and then limit it to the non-allocated space in case we are short on space to spill over into. This allows us to have the same performance as well as no longer giving random ENOSPC. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Dave sent me a panic where we were doing the orphan cleanup and panic'ed trying to release our reservation from the orphan block rsv. The reason for this is because our orphan block rsv had been free'd out from underneath us because the transaction commit found that there were no orphan inodes according to its count and decided to free it. This is incorrect so make sure we inc the orphan inodes count so the accounting is all done properly. This would also cause the warning in the orphan commit code normally if you had any orphans to cleanup as they would only decrement the orphan count so you'd get a negative orphan count which could cause problems during runtime. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
When a transaction aborts or there's an EIO on an ordered extent or any error really we will not free up the space we reserved for this ordered extent. This results in warnings from the block group cache cleanup in the case of a transaction abort, or leaking space in the case of EIO on an ordered extent. Fix this up by free'ing the reserved space if we have an error at all trying to complete an ordered extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
When we abort we've been just free'ing up all the ordered extents and hoping for the best. This results in lots of warnings from various places, warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't fixed. It will also screw up lots of pages who have been set private but never get cleared because the ordered extents are never allowed to be submitted. This patch fixes those warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I hit this error when reproducing a bug that would end in a transaction abort. We take the delayed ref head's mutex to keep anybody from processing it while we're destroying it, but we fail to drop the mutex before we carry on and free the damned thing. Fix this by doing the remove logic for the head ourselves and unlock the mutex, that way we can avoid use after free's or hung tasks waiting on that mutex to come back so they know the delayed ref completed. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
WARN_ON isn't enough, we need to stop the loop if for any reason we would overrun the devices_info array. I tried to track down the connection between the length of the alloc_devices list and the rw_devices counter but it wasn't immediately obvious, so be defensive about it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
No point in DEFINE_WAIT(wait) if it's not used! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
"item" was set but never used in this function. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
__btrfs_std_error didn't always properly call va_end, and might call va_start even if fmt was NULL. Move all the varargs handling into the block where we have fmt. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
I don't think that BTRFS_DEV_EXTENT_KEY is supposed to fall through to BTRFS_DEV_STATS_KEY ... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
This keeps static checkers happy. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
At least backref_tree_panic() can apparently pass in a null fs_info, so handle that in __btrfs_panic to get the message out on the console. The btrfs_panic macro also uses fs_info, but that's largely pointless; it's testing to see if BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is not set. But if it *were* set, __btrfs_panic() would have, well, paniced and we wouldn't be here, testing it! So just BUG() at this point. And since we only use fs_info once now, just use it directly. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
No need to test the result, we can't get a null pointer from list_entry() Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Eric Sandeen authored
All we do is set it to NULL and test it :) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
Because of how little we allocate chunks now we can get really tight on metadata space before we will allocate a new chunk. This resulted in being unable to add device extents when allocating a new metadata chunk as we did not have enough space. This is because we were allowed to overcommit too much metadata without actually making sure we had enough space to make allocations. The idea behind overcommit is that we are allowed to say "sure you can have that reservation" when most of the free space is occupied by reservations, not actual allocations. But in this case where a majority of the total space is in use by actual allocations we can screw ourselves by not being able to make real allocations when it matters. So make sure we have enough real space for our global reserve, and if not then don't allow overcommitting. Thanks, Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
I got a double free error when unmounting a file system that failed to add a chunk during its operation. This is because we will kfree the mapping that we created but leave the extent_map in the em_tree for chunks. So to fix this just remove the extent_map when we error out so we don't run into this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
If we error out allocating a dev extent we will have already created the block group and such which will cause problems since the allocator may have tried to allocate out of the block group that no longer exists. This will cause BUG_ON()'s in the bio submission path. This also makes a failure to allocate a dev extent a non-abort error, we will just clean up the dev extents we did allocate and exit. Now if we fail to delete the dev extents we will abort since we can't have half of the dev extents hanging around, but this will make us much less likely to abort. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
There is no lock to protect fs_info->fs_state, it will introduce some problems, such as the value may be covered by the other task when several tasks modify it. For example: Task0 - CPU0 Task1 - CPU1 mov %fs_state rax or $0x1 rax mov %fs_state rax or $0x2 rax mov rax %fs_state mov rax %fs_state The expected value is 3, but in fact, it is 2. Though this problem doesn't happen now (because there is only one flag currently), the code is error prone, if we add other flags, the above problem will happen to a certainty. Now we use bit operation for it to fix the above problem. In this way, we can make the code more robust and be easy to add new flags. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
There is no lock to protect fs_info->avail_{data, metadata, system}_alloc_bits, it may introduce some problem, such as the wrong profile information, so we add a seqlock to protect them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
We need not use a global lock to protect the delalloc_bytes of the inode, just use its own lock. In this way, we can reduce the lock contention and ->delalloc_lock will just protect delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
fs_info->delalloc_bytes is accessed very frequently, so use percpu counter instead of the u64 variant for it to reduce the lock contention. This patch also fixed the problem that we access the variant without the lock protection.At worst, we would not flush the delalloc inodes, and just return ENOSPC error when we still have some free space in the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
->dirty_metadata_bytes is accessed very frequently, so use percpu counter instead of the u64 variant to reduce the contention of the lock. This patch also fixed the problem that we access it without lock protection in __btrfs_btree_balance_dirty(), which may cause we skip the dirty pages flush. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
fs_info->alloc_start is a 64bits variant, can be accessed by multi-task, but it is not protected strictly, it can be changed while we are accessing it. On 32bit machine, we will get wrong value because we access it by two instructions.(In fact, it is also possible that the same problem happens on the 64bit machine, because the compiler may split the 64bit operation into two 32bit operation.) For example: Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning, then we remount and set ->alloc_start to 0x0000 0100 0000 0000. Task0 Task1 load high 32 bits set high 32 bits set low 32 bits load low 32 bits Task1 will get 0. This patch fixes this problem by using two locks to protect it fs_info->chunk_mutex sb->s_umount On the read side, we just need get one of these two locks, and on the write side, we must lock all of them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Miao Xie authored
Though ->max_inline is a 64bit variant, and may be accessed by multi-task, but it is just suggestive number, so we needn't add anything to protect fs_info->max_inline, just add a comment to explain wny we don't use a lock to protect it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Filipe Brandenburger authored
The header file will then be installed under /usr/include/linux so that userspace applications can refer to Btrfs ioctls by name and use the same structs used internally in the kernel. Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Kusanagi Kouichi authored
CAP_DAC_READ_SEARCH overrides read and search permission check on file and directory. It seems fit for BTRFS_IOC_INO_PATHS. Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-
Josef Bacik authored
This reverts commit 2794ed01. Wasn't supposed to get used in btrfs_mknod, it was supposed to be in btrfs_create, which was done in commit 9185aa58. Signed-off-by: Josef Bacik <jbacik@fusionio.com>
-