Commits · d05a2b4cd97071462e77e6a7a8f109c36307182a · Kirill Smelkov / linux

28 Oct, 2014 1 commit

Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items · d05a2b4c

Filipe Manana authored Oct 27, 2014

We have a race that can lead us to miss skinny extent items in the function
btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
So basically the sequence of steps is:

1) We search in the extent tree for the skinny extent, which returns > 0
(not found);

2) We check the previous item in the returned leaf for a non-skinny extent,
and we don't find it;

3) Because we didn't find the non-skinny extent in step 2), we release our
path to search the extent tree again, but this time for a non-skinny
extent key;

4) Right after we released our path in step 3), a skinny extent was inserted
in the extent tree (delayed refs were run) - our second extent tree search
will miss it, because it's not looking for a skinny extent;

5) After the second search returned (with ret > 0), we look for any delayed
ref for our extent's bytenr (and we do it while holding a read lock on the
leaf), but we won't find any, as such delayed ref had just run and completed
after we released out path in step 3) before doing the second search.

Fix this by removing completely the path release and re-search logic. This is
safe, because if we seach for a metadata item and we don't find it, we have the
guarantee that the returned leaf is the one where the item would be inserted,
and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
non-skinny extent item is if it exists. The only case where path->slots[0] is
zero is when there are no smaller keys in the tree (i.e. no left siblings for
our leaf), in which case the re-search logic isn't needed as well.

This race has been present since the introduction of skinny metadata (change
3173a18f).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>

d05a2b4c

27 Oct, 2014 3 commits

Btrfs: properly clean up btrfs_end_io_wq_cache · 5ed5f588

Josef Bacik authored Oct 15, 2014

In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on
unload, which makes us unable to unload and then re-load the btrfs module.  This
fixes the problem.  Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>

5ed5f588

Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() · 1a4ed8fd

Filipe Manana authored Oct 27, 2014

If we couldn't find our extent item, we accessed the current slot
(path->slots[0]) to check if it corresponds to an equivalent skinny
metadata item. However this slot could be beyond our last item in the
leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
we shouldn't process it.

Since btrfs_lookup_extent() is only used to find extent items for data
extents, fix this by removing completely the logic that looks up for an
equivalent skinny metadata item, since it can not exist.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>

1a4ed8fd

btrfs: use macro accessors in superblock validation checks · 21e7626b

David Sterba authored Oct 27, 2014

The initial patch c926093e (btrfs: add more superblock checks)
did not properly use the macro accessors that wrap endianness and the
code would not work correctly on big endian machines.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>

21e7626b

17 Oct, 2014 1 commit

Revert "Btrfs: race free update of commit root for ro snapshots" · d3797308

Chris Mason authored Oct 15, 2014

This reverts commit 9c3b306e.

Switching only one commit root during a transaction is wrong because it
leads the fs into an inconsistent state. All commit roots should be
switched at once, at transaction commit time, otherwise backref walking
can often miss important references that were only accessible through
the old commit root.  Plus, the root item for the snapshot's root wasn't
getting updated and preventing the next transaction commit to do it.

This made several users get into random corruption issues after creation
of readonly snapshots.

A regression test for xfstests will follow soon.

Cc: stable@vger.kernel.org # 3.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>

d3797308

08 Oct, 2014 1 commit

btrfs: Fix compile error when CONFIG_SECURITY is not set. · a43bb39b

Qu Wenruo authored Oct 08, 2014

Fix the following compile error when CONFIG_SECURITY is not set:

error: 'struct security_mnt_opts' has no member named 'num_mnt_opts'
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>

a43bb39b

07 Oct, 2014 1 commit

Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off · 0d4cf4e6

Chris Mason authored Oct 07, 2014

Commit fccb84c9 moved added some helpers to cleanup our sanity tests,
but it looks like both Dave and I always compile with the tests enabled.

This fixes things to work when they are turned off too.
Signed-off-by: Chris Mason <clm@fb.com>

0d4cf4e6

06 Oct, 2014 1 commit

btrfs: Make btrfs handle security mount options internally to avoid losing security label. · f667aef6

Qu Wenruo authored Sep 23, 2014

[BUG]
Originally when mount btrfs with "-o subvol=" mount option, btrfs will
lose all security lable.
And if the btrfs fs is mounted somewhere else, due to the lost of
security lable, SELinux will refuse to mount since the same super block
is being mounted using different security lable.

[REPRODUCER]
With SELinux enabled:
 #mkfs -t btrfs /dev/sda5
 #mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs
 #btrfs subvolume create /mnt/btrfs/subvol
 #mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5
  /mnt/test

kernel message:
SELinux: mount invalid.  Same superblock, different security settings
for (dev sda5, type btrfs)

[REASON]
This happens because btrfs will call vfs_kern_mount() and then
mount_subtree() to handle subvolume name lookup.
First mount will cut off all the security lables and when it comes to
the second vfs_kern_mount(), it has no security label now.

[FIX]
This patch will makes btrfs behavior much more like nfs,
which has the type flag FS_BINARY_MOUNTDATA,
making btrfs handles the security label internally.
So security label will be set in the real mount time and won't lose
label when use with "subvol=" mount option.
Reported-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>

f667aef6

04 Oct, 2014 3 commits

Merge branch 'remove-unlikely' of... · 0ec31a61

Chris Mason authored Oct 04, 2014

Merge branch 'remove-unlikely' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus

0ec31a61

Merge branch 'cleanup/blocksize-diet-part1' of... · 27b19cc8

Chris Mason authored Oct 04, 2014

Merge branch 'cleanup/blocksize-diet-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus

27b19cc8

Merge branch 'cleanup/misc-for-3.18' of... · bbf65cf0

Chris Mason authored Oct 04, 2014

Merge branch 'cleanup/misc-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/extent_io.c

bbf65cf0

03 Oct, 2014 12 commits

Btrfs: send, don't delay dir move if there's a new parent inode · bf8e8ca6

Filipe Manana authored Oct 02, 2014

If between two snapshots we rename an existing directory named X to Y and
make it a child (direct or not) of a new inode named X, we were delaying
the move/rename of the former directory unnecessarily, which would result
in attempting to rename the new directory from its orphan name to name X
prematurely.

Minimal reproducer:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ mkdir -p /mnt/merlin/RC/OSD/Source

    $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    $ mkdir /mnt/OSD
    $ mv /mnt/merlin/RC/OSD /mnt/OSD/OSD-Plane_788
    $ mv /mnt/OSD /mnt/merlin/RC

    $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    $ btrfs send /mnt/mysnap1 -f /tmp/1.snap
    $ btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap

    $ mkfs.btrfs -f /dev/vdc
    $ mount /dev/vdc /mnt2

    $ btrfs receive /mnt2 -f /tmp/1.snap
    $ btrfs receive /mnt2 -f /tmp/2.snap

The second receive (from an incremental send) failed with the following
error message: "rename o261-7-0 -> merlin/RC/OSD failed".
This is a regression introduced in the 3.16 kernel.

A test case for xfstests follows.
Reported-by: Marc Merlin <marc@merlins.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>

bf8e8ca6

btrfs: add more superblock checks · c926093e

David Sterba authored Sep 30, 2014

Populate btrfs_check_super_valid() with checks that try to verify
consistency of superblock by additional conditions that may arise from
corrupted devices or bitflips. Some of tests are only hints and issue
warnings instead of failing the mount, basically when the checks are
derived from the data found in the superblock.

Tested on a broken image provided by Qu.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>

c926093e

Btrfs: fix race in WAIT_SYNC ioctl · 42383020

Sage Weil authored Sep 26, 2014

We check whether transid is already committed via last_trans_committed and
then search through trans_list for pending transactions.  If
last_trans_committed is updated by btrfs_commit_transaction after we check
it (there is no locking), we will fail to find the committed transaction
and return EINVAL to the caller.  This has been observed occasionally by
ceph-osd (which uses this ioctl heavily).

Fix by rechecking whether the provided transid <= last_trans_committed
after the search fails, and if so return 0.
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>

42383020

Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db

Filipe Manana authored Sep 26, 2014

While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>

656f30db

Btrfs: remove redundant btrfs_verify_qgroup_counts declaration. · 15b636e1

Fabian Frederick authored Sep 25, 2014

Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS
and keep prototype in qgroup.h only
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>

15b636e1

btrfs: fix shadow warning on cmp · b99d9a6a

Fabian Frederick authored Sep 25, 2014

cmp was declared twice in btrfs_compare_trees resulting in a shadow
warning. This patch renames second internal variable.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>

b99d9a6a

Btrfs: fix compilation errors under DEBUG · 1b6e4469

Fabian Frederick authored Sep 24, 2014

bi_sector and bi_size moved to bi_iter since commit 4f024f37
("block: Abstract out bvec iterator")
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>

1b6e4469

Btrfs: fix crash of btrfs_release_extent_buffer_page · 81465028

Liu Bo authored Sep 23, 2014

This is actually inspired by Filipe's patch. When write_one_eb() fails on
submit_extent_page(), it'll give up writing this eb and mark it with
EXTENT_BUFFER_IOERR. So if it's not the last page that encounter the failure,
there are some left pages which remain DIRTY, and if a later COW on this eb
happens, ie. eb is COWed and freed, it'd run into BUG_ON in
btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));

This adds the missing clear_page_dirty_for_io() for the rest pages of eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>

81465028

Btrfs: add missing end_page_writeback on submit_extent_page failure · 55e3bd2e

Filipe Manana authored Sep 22, 2014

If submit_extent_page() fails in write_one_eb(), we end up with the current
page not marked dirty anymore, unlocked and marked for writeback. But we never
end up calling end_page_writeback() against the page, which will make calls to
filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
for the writeback bit to be cleared from the page.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>

55e3bd2e

btrfs: Fix the wrong condition judgment about subset extent map · 32be3a1a

Qu Wenruo authored Sep 22, 2014

Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
best fitted extent map
is using wrong condition to judgement whether the range is a subset of a
existing extent map.

This may cause bug in btrfs no-holes mode.

This patch will correct the judgment and fix the bug.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>

32be3a1a

Btrfs: fix build_backref_tree issue with multiple shared blocks · bbe90514

Josef Bacik authored Sep 19, 2014

Marc Merlin sent me a broken fs image months ago where it would blow up in the
upper->checked BUG_ON() in build_backref_tree.  This is because we had a
scenario like this

block a -- level 4 (not shared)
   |
block b -- level 3 (reloc block, shared)
   |
block c -- level 2 (not shared)
   |
block d -- level 1 (shared)
   |
block e -- level 0 (shared)

We go to build a backref tree for block e, we notice block d is shared and add
it to the list of blocks to lookup it's backrefs for.  Now when we loop around
we will check edges for the block, so we will see we looked up block c last
time.  So we lookup block d and then see that the block that points to it is
block c and we can just skip that edge since we've already been up this path.
The problem is because we clear need_check when we see block d (as it is shared)
we never add block b as needing to be checked.  And because block c is in our
path already we bail out before we walk up to block b and add it to the backref
check list.

To fix this we need to reset need_check if we trip over a block that doesn't
need to be checked.  This will make sure that any subsequent blocks in the path
as we're walking up afterwards are added to the list to be processed.  With this
patch I can now mount Marc's fs image and it'll complete the balance without
panicing.  Thanks,
Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>

bbe90514

Btrfs: cleanup error handling in build_backref_tree · 75bfb9af

Josef Bacik authored Sep 19, 2014

When balance panics it tends to panic in the

BUG_ON(!upper->checked);

test, because it means it couldn't build the backref tree properly. This is
annoying to users and frankly a recoverable error, nothing in this function is
actually fatal since it is just an in-memory building of the backrefs for a
given bytenr. So go through and change all the BUG_ON()'s to ASSERT()'s, and
fix the BUG_ON(!upper->checked) thing to just return an error.

This patch also fixes the error handling so it tears down the work we've done
properly. This code was horribly broken since we always just panic'ed instead
of actually erroring out, so it needed to be completely re-worked. With this
patch my broken image no longer panics when I mount it. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>

75bfb9af

02 Oct, 2014 17 commits

btrfs: move checks for DUMMY_ROOT into a helper · fccb84c9
David Sterba authored Sep 29, 2014
```
Signed-off-by: David Sterba <dsterba@suse.cz>
```
fccb84c9

btrfs: new define for the inline extent data start · 7ec20afb

David Sterba authored Jul 24, 2014

Use a common definition for the inline data start so we don't have to
open-code it and introduce bugs like "Btrfs: fix wrong max inline data
size limit" fixed.
Signed-off-by: David Sterba <dsterba@suse.cz>

7ec20afb

btrfs: kill extent_buffer_page helper · fb85fc9a

David Sterba authored Jul 31, 2014

It used to be more complex but now it's just a simple array access.
Signed-off-by: David Sterba <dsterba@suse.cz>

fb85fc9a

btrfs: drop constant param from btrfs_release_extent_buffer_page · a50924e3
David Sterba authored Jul 31, 2014
```
All callers use the same value, simplify the function.
Signed-off-by: David Sterba <dsterba@suse.cz>
```
a50924e3
btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB · 2755a0de
David Sterba authored Jul 31, 2014
```
Signed-off-by: David Sterba <dsterba@suse.cz>
```
2755a0de
btrfs: let merge_reloc_roots return void · 94404e82
David Sterba authored Jul 30, 2014
```
Signed-off-by: David Sterba <dsterba@suse.cz>
```
94404e82
btrfs: remove unused members from struct scrub_warning · 8b9456da
David Sterba authored Jul 30, 2014
```
Signed-off-by: David Sterba <dsterba@suse.cz>
```
8b9456da

btrfs: use slab for end_io_wq structures · 97eb6b69

David Sterba authored Jul 30, 2014

The structure is frequently reused.  Rename it according to the slab
name.
Signed-off-by: David Sterba <dsterba@suse.cz>

97eb6b69

btrfs: fix error labels in init_btrfs_fs · af13b492

David Sterba authored Jul 30, 2014

btrfs_interface_init rarely fails but we could leak the prelim_ref slab.
Signed-off-by: David Sterba <dsterba@suse.cz>

af13b492

btrfs: use enum for wq endio metadata type · bfebd8b5

David Sterba authored Jul 30, 2014

The enum exists but is not consistently used.
Signed-off-by: David Sterba <dsterba@suse.cz>

bfebd8b5

btrfs: remove unused extent state bits · 01d5bc37
David Sterba authored Jul 30, 2014
```
The last users are long gone.
Signed-off-by: David Sterba <dsterba@suse.cz>
```
01d5bc37

Btrfs: set default max_inline to 8KiB instead of 8MiB · 95ac567a

Filipe David Borba Manana authored Aug 08, 2013

8MiB is way too large and likely set by mistake. This is not
a significant issue as in practice the max amount of data
added to an inline extent is also limited by the page cache
and btree leaf sizes.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>

95ac567a

btrfs: remove blocksize from btrfs_alloc_free_block and rename · 4d75f8a9

David Sterba authored Jun 15, 2014

Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free +
_tree_block family. The parameter blocksize was set to the metadata
block size, directly or indirectly.
Signed-off-by: David Sterba <dsterba@suse.cz>

4d75f8a9

btrfs: remove unused parameter blocksize from btrfs_find_tree_block · 0308af44
David Sterba authored Jun 15, 2014
```
Signed-off-by: David Sterba <dsterba@suse.cz>
```
0308af44
btrfs: remove parameter blocksize from read_tree_block · ce86cd59
David Sterba authored Jun 15, 2014
```
We know the tree block size, no need to pass it around.
Signed-off-by: David Sterba <dsterba@suse.cz>
```
ce86cd59

btrfs: inline code of reada_tree_block and remove it · 453848a0

David Sterba authored Jun 15, 2014

It's trivial with a single user. And remove one pointless BUG_ON.
Signed-off-by: David Sterba <dsterba@suse.cz>

453848a0

btrfs: return void from readahead_tree_block · 6197d86e

David Sterba authored Jun 15, 2014

Errors in readahead are not fatal and ignored elsewhere in the code.
Signed-off-by: David Sterba <dsterba@suse.cz>

6197d86e