Commits · 46d8bc34248f3a94dea910137d1ddf5fb1e3a1cc · Kirill Smelkov / linux

01 Oct, 2012 35 commits

Btrfs: fix a bug in checking whether a inode is already in log · 46d8bc34

Liu Bo authored Aug 29, 2012

This is based on Josef's "Btrfs: turbo charge fsync".

The current btrfs checks if an inode is in log by comparing
root's last_log_commit to inode's last_sub_trans[2].

But the problem is that this root->last_log_commit is shared among
inodes.

Say we have N inodes to be logged, after the first inode,
root's last_log_commit is updated and the N-1 remained files will
be skipped.

This fixes the bug by keeping a local copy of root's last_log_commit
inside each inode and this local copy will be maintained itself.

[1]: we regard each log transaction as a subset of btrfs's transaction,
i.e. sub_trans
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

46d8bc34

Btrfs: fix wrong orphan count of the fs/file tree · 321f0e70

Miao Xie authored Aug 28, 2012

If we add a new orphan item, we should increase the atomic counter,
not decrease it. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>

321f0e70

Btrfs: improve fsync by filtering extents that we want · 4e2f84e6

Liu Bo authored Aug 27, 2012

This is based on Josef's "Btrfs: turbo charge fsync".

The above Josef's patch performs very good in random sync write test,
because we won't have too much extents to merge.

However, it does not performs good on the test:
dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync

The reason is when we do sequencial sync write, we need to merge the
current extent just with the previous one, so that we can get accumulated
extents to log:

A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...

So we'll have to flush more and more checksum into log tree, which is the
bottleneck according to my tests.

But we can avoid this by telling fsync the real extents that are needed
to be logged.

With this, I did the above dd sync write test (size=50m),

         w/o (orig)   w/ (josef's)   w/ (this)
SATA      104KB/s       109KB/s       121KB/s
ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

4e2f84e6

Btrfs: do not needlessly restart the transaction for enospc · ca7e70f5

Josef Bacik authored Aug 27, 2012

We will stop and restart a transaction every time we move to a different leaf
when truncating a file. This is for enospc reasons, but really we could
probably get away with doing this a little better by actually working until we
hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do
truncates which will fail as soon as the block rsv runs out of space, and then
at that point we can stop and restart the transaction and refill the block rsv
and carry on. This will make rm'ing of a file with lots of extents a bit
faster. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>

ca7e70f5

Btrfs: cleanup extents after we finish logging inode · 06d3d22b

Liu Bo authored Aug 27, 2012

This is based on Josef's "Btrfs: turbo charge fsync".

We should cleanup those extents after we've finished logging inode,
otherwise we may do redundant work on them.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

06d3d22b

Btrfs: only warn if we hit an error when doing the tree logging · 0fa83cdb

Josef Bacik authored Aug 24, 2012

I hit this a couple times while working on my fsync patch (all my bugs, not
normal operation), but with my new stuff we could have new errors from cases
I have not encountered, so instead of BUG()'ing we should be WARN()'ing so
that we are notified there is a problem but the user doesn't lose their
data. We can easily commit the transaction in the case that the tree
logging fails and still be fine, so let's try and be as nice to the user as
possible. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>

0fa83cdb

Btrfs: turbo charge fsync · 5dc562c5

Josef Bacik authored Aug 17, 2012

At least for the vm workload.  Currently on fsync we will

1) Truncate all items in the log tree for the given inode if they exist

and

2) Copy all items for a given inode into the log

The problem with this is that for things like VMs you can have lots of
extents from the fragmented writing behavior, and worst yet you may have
only modified a few extents, not the entire thing.  This patch fixes this
problem by tracking which transid modified our extent, and then when we do
the tree logging we find all of the extents we've modified in our current
transaction, sort them and commit them.  We also only truncate up to the
xattrs of the inode and copy that stuff in normally, and then just drop any
extents in the range we have that exist in the log already.  Here are some
numbers of a 50 meg fio job that does random writes and fsync()s after every
write

		Original	Patched
SATA drive	82KB/s		140KB/s
Fusion drive	431KB/s		2532KB/s

So around 2-6 times faster depending on your hardware.  There are a few
corner cases, for example if you truncate at all we have to do it the old
way since there is no way to be sure what is in the log is ok.  This
probably could be done smarter, but if you write-fsync-truncate-write-fsync
you deserve what you get.  All this work is in RAM of course so if your
inode gets evicted from cache and you read it in and fsync it we'll do it
the slow way if we are still in the same transaction that we last modified
the inode in.

The biggest cool part of this is that it requires no changes to the recovery
code, so if you fsync with this patch and crash and load an old kernel, it
will run the recovery and be a-ok.  I have tested this pretty thoroughly
with an fsync tester and everything comes back fine, as well as xfstests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>

5dc562c5

Btrfs: fix possible corruption when fsyncing written prealloced extents · 224ecce5

Josef Bacik authored Aug 16, 2012

While working on my fsync patch my fsync tester kept hitting mismatching
md5sums when I would randomly write to a prealloc'ed region, syncfs() and
then write to the prealloced region some more and then fsync() and then
immediately reboot. This is because the tree logging code will skip writing
csums for file extents who's generation is less than the current running
transaction. When we mark extents as written we haven't been updating their
generation so they were always being skipped. This wouldn't happen if you
were to preallocate and then write in the same transaction, but if you for
example prealloced a VM you could definitely run into this problem. This
patch makes my fsync tester happy again. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>

224ecce5

Btrfs: do not allocate chunks as agressively · 54338b5c

Josef Bacik authored Aug 14, 2012

Swinging this pendulum back the other way. We've been allocating chunks up
to 2% of the disk no matter how much we actually have allocated. So instead
fix this calculation to only allocate chunks if we have more than 80% of the
space available allocated. Please test this as it will likely cause all
sorts of ENOSPC problems to pop up suddenly. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>

54338b5c

Btrfs: update last trans if we don't update the inode · 7c735313

Josef Bacik authored Aug 13, 2012

There is a completely impossible situation to hit where you can preallocate
a file, fsync it, write into the preallocated region, have the transaction
commit twice and then fsync and then immediately lose power and lose all of
the contents of the write. This patch fixes this just so I feel better
about the situation and because it is lightweight, we just update the
last_trans when we finish an ordered IO and we don't update the inode
itself. This way we are completely safe and I feel better. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>

7c735313

Btrfs: fix gcc warnings for 32bit compiles · 995e01b7

Jan Schmidt authored Aug 13, 2012

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>

995e01b7

Btrfs: fix btrfs send for inline items and compression · 74dd17fb

Chris Mason authored Aug 07, 2012

The btrfs send code was assuming the offset of the file item into the
extent translated to bytes on disk.  If we're compressed, this isn't
true, and so it was off into extents owned by other files.

It was also improperly handling inline extents.  This solves a crash
where we may have gone past the end of the file extent item by not
testing early enough for an inline extent.  It also solves problems
where we have a whole between the end of the inline item and the start
of the full extent.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>

74dd17fb

Btrfs: don't treat top/root directory inode as deleted/reused · 6d85ed05

Alexander Block authored Aug 01, 2012

We can't do the deleted/reused logic for top/root inodes as it would
create a stream that tries to delete and recreate the root dir.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

6d85ed05

Btrfs: ignore non-FS inodes for send/receive · 2981e225

Alexander Block authored Aug 01, 2012

We have to ignore inode/space cache objects in send/receive.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

2981e225

Btrfs: pass root instead of parent_root to iterate_inode_ref · 2f28f478

Alexander Block authored Aug 01, 2012

We need to pass the root that we determined earlier to iterate_inode_ref.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

2f28f478

Btrfs: use <= instead of < in is_extent_unchanged · d8347fa4

Alexander Block authored Aug 01, 2012

Used the wrong compare operator here.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

d8347fa4

Btrfs: fix check for changed extent in is_extent_unchanged · 3954096d

Alexander Block authored Aug 01, 2012

The previous check was working fine, but this check should be
easier to read. Also, we could theoritically have some exotic
bugs with the previous checks.
Signed-off-by: Alexander Block <ablock84@googlemail.com>

3954096d

Btrfs: free nce and nce_head on error in name_cache_insert · 5dc67d0b

Alexander Block authored Aug 01, 2012

Both were leaked in case of error.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

5dc67d0b

Btrfs: remove unused tmp_path from iterate_dir_item · 3e126f32

Alexander Block authored Aug 01, 2012

A leftover from older code and unused now.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

3e126f32

Btrfs: code cleanups for send/receive · e938c8ad

Alexander Block authored Jul 28, 2012

Doing some code cleanups as suggested by Arne.
Changes do not change any logic.
Signed-off-by: Alexander Block <ablock84@googlemail.com>

e938c8ad

Btrfs: add/fix comments/documentation for send/receive · 766702ef
Alexander Block authored Jul 28, 2012
```
As the subject already said, add/fix comments.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
```
766702ef

Btrfs: update send_progress at correct places · e479d9bb

Alexander Block authored Jul 28, 2012

Updating send_progress in process_recorded_refs was not correct.
It got updated too early in the cur_inode_new_gen case.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

e479d9bb

Btrfs: make aux field of ulist 64 bit · 34d73f54

Alexander Block authored Jul 28, 2012

Btrfs send/receive uses the aux field to store inode numbers. On
32 bit machines this may become a problem.

Also fix all users of ulist_add and ulist_add_merged.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

34d73f54

Btrfs: fix use of radix_tree for name_cache in send/receive · 7e0926fe

Alexander Block authored Jul 28, 2012

We can't easily use the index of the radix tree for inums as the
radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels,
we now use the lower 32bit of the inum as index and an additional
list to store multiple entries per radix tree entry.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

7e0926fe

Btrfs: fix memory leak for name_cache in send/receive · 17589bd9

Alexander Block authored Jul 28, 2012

When everything is done, name_cache_free is called which however
forgot to call kfree on the cache entries.
Signed-off-by: Alexander Block <ablock84@googlemail.com>

17589bd9

Btrfs: don't break in the final loop of find_extent_clone · adbe7fb6

Alexander Block authored Jul 28, 2012

If we break, we may miss the clone from send_root which we prefer
over all other clones.

Commit is a result of Arne's review.
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

adbe7fb6

Btrfs: use normal return path for root == send_root case · 52f9e53e

Alexander Block authored Jul 28, 2012

Don't have a seperate return path for the mentioned case. Now
we do the same "take lowest inode/offset" logic for all found clones.

Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>

52f9e53e

Btrfs: use kmalloc instead of stack for backref_ctx · 35075bb0

Alexander Block authored Jul 28, 2012

Make sure to never get in trouble due to the backref_ctx
which was on the stack before.

Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>

35075bb0

Btrfs: rename backref_ctx::found_in_send_root to found_itself · ee849c04

Alexander Block authored Jul 28, 2012

The new name should be easier to understand/read.

Commit is a result of Arne's review.
Signed-off-by: Alexander Block <ablock84@googlemail.com>

ee849c04

Btrfs: remove unused use_list from send/receive code · d27aed5e
Alexander Block authored Jul 28, 2012
```
use_list is a leftover and unused.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
```
d27aed5e

Btrfs: add correct parent to check_dirs when dir got moved · ccf1626b

Alexander Block authored Jul 28, 2012

We only added the parent for the new position of a moved dir.
We also need to add the old parent of the moved dir.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

ccf1626b

Btrfs: remove unused code with #if 0 · 9ea3ef51

Alexander Block authored Jul 28, 2012

fs_path_remove is not used at the moment due to a previous patch.
Remove it for now (with #if 0) to avoid compile warnings.
Signed-off-by: Alexander Block <ablock84@googlemail.com>

9ea3ef51

Btrfs: add missing check for dir != tmp_dir to is_first_ref · b9291aff

Alexander Block authored Jul 28, 2012

We missed that check which resultet in all refs with the same name
being reported as first_ref.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

b9291aff

Btrfs: fix cur_ino < parent_ino case for send/receive · 1f4692da

Alexander Block authored Jul 28, 2012

When the current inodes inum is smaller then the inum of the
parent directory strange things were happending due to wrong
path resolution and other bugs. Fix this with a new approach
for the problem.
Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Signed-off-by: Alexander Block <ablock84@googlemail.com>

1f4692da

Btrfs: add rdev to get_inode_info in send/receive · 85a7b33b
Alexander Block authored Jul 26, 2012
```
We need rdev in the next commit.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
```
85a7b33b

30 Sep, 2012 2 commits

Linux 3.6 · a0d271cb
Linus Torvalds authored Sep 30, 2012

a0d271cb

vfs: dcache: fix deadlock in tree traversal · 8110e16d

Miklos Szeredi authored Sep 17, 2012

IBM reported a deadlock in select_parent().  This was found to be caused
by taking rename_lock when already locked when restarting the tree
traversal.

There are two cases when the traversal needs to be restarted:

 1) concurrent d_move(); this can only happen when not already locked,
    since taking rename_lock protects against concurrent d_move().

 2) racing with final d_put() on child just at the moment of ascending
    to parent; rename_lock doesn't protect against this rare race, so it
    can happen when already locked.

Because of case 2, we need to be able to handle restarting the traversal
when rename_lock is already held.  This patch fixes all three callers of
try_to_ascend().

IBM reported that the deadlock is gone with this patch.

[ I rewrote the patch to be smaller and just do the "goto again" if the
  lock was already held, but credit goes to Miklos for the real work.
   - Linus ]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8110e16d

29 Sep, 2012 2 commits

Merge tag 'iommu-fixes-v3.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 6a3e3dbe

Linus Torvalds authored Sep 29, 2012

Pull IOMMU fixes from Joerg Roedel:
 "Two small patches:

	* One patch to fix the function declarations for
	  !CONFIG_IOMMU_API. This is causing build errors
	  in linux-next and should be fixed for v3.6.

	* Another patch to fix an IOMMU group related NULL pointer
	  dereference."

* tag 'iommu-fixes-v3.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/amd: Fix wrong assumption in iommu-group specific code
  iommu: static inline iommu group stub functions

6a3e3dbe

Merge git://git.infradead.org/users/willy/linux-nvme · 21e98932

Linus Torvalds authored Sep 29, 2012

Pull NVMe driver fixes from Matthew Wilcox:
 "Now that actual hardware has been released (don't have any yet
  myself), people are starting to want some of these fixes merged."

Willy doesn't have hardware? Guys...

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Cancel outstanding IOs on queue deletion
  NVMe: Free admin queue memory on initialisation failure
  NVMe: Use ida for nvme device instance
  NVMe: Fix whitespace damage in nvme_init
  NVMe: handle allocation failure in nvme_map_user_pages()
  NVMe: Fix uninitialized iod compiler warning
  NVMe: Do not set IO queue depth beyond device max
  NVMe: Set block queue max sectors
  NVMe: use namespace id for nvme_get_features
  NVMe: replace nvme_ns with nvme_dev for user admin
  NVMe: Fix nvme module init when nvme_major is set
  NVMe: Set request queue logical block size

21e98932

28 Sep, 2012 1 commit

mtdchar: fix offset overflow detection · 9c603e53

Linus Torvalds authored Sep 08, 2012

Sasha Levin has been running trinity in a KVM tools guest, and was able
to trigger the BUG_ON() at arch/x86/mm/pat.c:279 (verifying the range of
the memory type).  The call trace showed that it was mtdchar_mmap() that
created an invalid remap_pfn_range().

The problem is that mtdchar_mmap() does various really odd and subtle
things with the vma page offset etc, and uses the wrong types (and the
wrong overflow) detection for it.

For example, the page offset may well be 32-bit on a 32-bit
architecture, but after shifting it up by PAGE_SHIFT, we need to use a
potentially 64-bit resource_size_t to correctly hold the full value.

Also, we need to check that the vma length plus offset doesn't overflow
before we check that it is smaller than the length of the mtdmap region.

This fixes things up and tries to make the code a bit easier to read.
Reported-and-tested-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Artem Bityutskiy <dedekind1@gmail.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

9c603e53