Commits · a57195214358b75807a74bad96a8601a36262af7 · Kirill Smelkov / linux

24 Sep, 2009 4 commits

Btrfs: check size of inode backref before adding hardlink · a5719521

Yan, Zheng authored Sep 24, 2009

For every hardlink in btrfs, there is a corresponding inode back
reference. All inode back references for hardlinks in a given
directory are stored in single b-tree item. The size of b-tree item
is limited by the size of b-tree leaf, so we can only create limited
number of hardlinks to a given file in a directory.

The original code lacks of the check, it oops if the number of
hardlinks goes over the limit. This patch fixes the issue by adding
check to btrfs_link and btrfs_rename.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a5719521

Btrfs: fix releasepage to avoid unlocking extents we haven't locked · 11ef160f

Chris Mason authored Sep 23, 2009

During releasepage, we try to drop any extent_state structs for the
bye offsets of the page we're releaseing.  But the code was incorrectly
telling clear_extent_bit to delete the state struct unconditionallly.

Normally this would be fine because we have the page locked, but other
parts of btrfs will lock down an entire extent, the most common place
being IO completion.

releasepage was deleting the extent state without first locking the extent,
which may result in removing a state struct that another process had
locked down.  The fix here is to leave the NODATASUM and EXTENT_LOCKED
bits alone in releasepage.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

11ef160f

Btrfs: Fix test_range_bit for whole file extents · 46562cec

Chris Mason authored Sep 23, 2009

If test_range_bit finds an extent that goes all the way to (u64)-1, it
can incorrectly wrap the u64 instead of treaing it like the end of
the address space.

This just adds a check for the highest possible offset so we don't wrap.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

46562cec

Btrfs: fix errors handling cached state in set/clear_extent_bit · 42daec29

Chris Mason authored Sep 23, 2009

Both set and clear_extent_bit allow passing a cached
state struct to reduce rbtree search times.  clear_extent_bit
was improperly bypassing some of the checks around making sure
the extent state fields were correct for a given operation.

The fix used here (from Yan Zheng) is to use the hit_next
goto target instead of jumping all the way down to start clearing
bits without making sure the cached state was exactly correct
for the operation we were doing.

This also fixes up the setting of the start variable for both
ops in the case where we find an overlapping extent that
begins before the range we want to change.  In both cases
we were incorrectly going backwards from the original
requested change.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

42daec29

22 Sep, 2009 2 commits

Btrfs: fix early enospc during balancing · 7ce618db

Chris Mason authored Sep 22, 2009

We now do extra checks before a balance to make sure
there is room for the balance to take place.  One of
the checks was testing to see if we were trying to
balance away the last block group of a given type.

If there is no space available for new chunks, we
should not try and balance away the last block group
of a give type.  But, the code wasn't checking for
available chunk space, and so it was exiting too soon.

The fix here is to combine some of the checks and make
sure we try to allocate new chunks when we're balancing
the last block group.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

7ce618db

Btrfs: deal with NULL space info · 33b4d47f

Chris Mason authored Sep 22, 2009

After a balance it is briefly possible for the space info
field in the inode to be NULL.  This adds some checks
to make sure things properly deal with the NULL value.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

33b4d47f

21 Sep, 2009 11 commits

Btrfs: account for space used by the super mirrors · 1b2da372

Josef Bacik authored Sep 11, 2009

As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
space accounting for the space info's. Currently we exclude the free space for
the super mirrors, but the space they take up isn't accounted for in any of the
counters. This patch introduces bytes_super, which keeps track of the amount
of bytes used for a super mirror in the block group cache and space info. This
makes sure that our free space caclucations will be completely accurate.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1b2da372

Btrfs: fix extent entry threshold calculation · 25891f79

Josef Bacik authored Sep 11, 2009

There is a slight problem with the extent entry threshold calculation for the
free space cache. We only adjust the threshold down as we add bitmaps, but
never actually adjust the threshold up as we add bitmaps. This means we could
fragment the free space so badly that we end up using all bitmaps to describe
the free space, use all the free space which would result in the bitmaps being
freed, but then go to add free space again as we delete things and immediately
add bitmaps since the extent threshold would still be 0. Now as we free
bitmaps the extent threshold will be ratcheted up to allow more extent entries
to be added.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

25891f79

Btrfs: remove dead code · f61408b8

Josef Bacik authored Sep 11, 2009

This patch removes a bunch of dead code from the snapshot removal stuff.  It
was confusing me when doing the metadata ENOSPC stuff so I killed it.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f61408b8

Btrfs: fix bitmap size tracking · f019f426

Josef Bacik authored Sep 11, 2009

When we first go to add free space, we allocate a new info and set the offset
and bytes to the space we are adding. This is fine, except we actually set the
size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd
end up counting the same space twice. This isn't a huge deal, it just makes
the allocator behave weirdly since it will think that a bitmap entry has more
space than it ends up actually having. I used a BUG_ON() to catch when this
problem happened, and with this patch I no longer get the BUG_ON().
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f019f426

Btrfs: don't keep retrying a block group if we fail to allocate a cluster · 0a24325e

Josef Bacik authored Sep 11, 2009

The box can get locked up in the allocator if we happen upon a block group
under these conditions:

1) During a commit, so caching threads cannot make progress
2) Our block group currently is in the middle of being cached
3) Our block group currently has plenty of free space in it
4) Our block group is so fragmented that it ends up having no free space chunks
larger than min_bytes calculated by btrfs_find_space_cluster.

What happens is we try and do btrfs_find_space_cluster, which fails because it
is unable to find enough free space chunks that are large than min_bytes and
are close enough together. Since the block group is not cached we do a
wait_block_group_cache_progress, which waits for the number of bytes we need,
except the block group already has _plenty_ of free space, its just severely
fragmented, so we loop and try again, ad infinitum. This patch keeps us from
waiting on the block group to finish caching if we failed to find a free space
cluster before. It also makes sure that we don't even try to find a free space
cluster if we are on our last loop in the allocator, since we will have tried
everything at this point at it is futile.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

0a24325e

Btrfs: make balance code choose more wisely when relocating · ba1bf481

Josef Bacik authored Sep 11, 2009

Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents.  For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic.  Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk.  This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.

V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.

-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.

-check to make sure the block group we are going to relocate isn't the last one
in that particular space

-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

ba1bf481

Btrfs: fix arithmetic error in clone ioctl · 1fb58a60

Sage Weil authored Sep 21, 2009

Fix an arithmetic error that was breaking extents cloned via the clone
ioctl starting in the second half of a file.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1fb58a60

Btrfs: add snapshot/subvolume destroy ioctl · 76dda93c

Yan, Zheng authored Sep 21, 2009

This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being
used and doesn't contains links to other subvolumes can be destroyed.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

76dda93c

Btrfs: change how subvolumes are organized · 4df27c4d

Yan, Zheng authored Sep 21, 2009

btrfs allows subvolumes and snapshots anywhere in the directory tree.
If we snapshot a subvolume that contains a link to other subvolume
called subvolA, subvolA can be accessed through both the original
subvolume and the snapshot. This is similar to creating hard link to
directory, and has the very similar problems.

The aim of this patch is enforcing there is only one access point to
each subvolume. Only the first directory entry (the one added when
the subvolume/snapshot was created) is treated as valid access point.
The first directory entry is distinguished by checking root forward
reference. If the corresponding root forward reference is missing,
we know the entry is not the first one.

This patch also adds snapshot/subvolume rename support, the code
allows rename subvolume link across subvolumes.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4df27c4d

Btrfs: do not reuse objectid of deleted snapshot/subvol · 13a8a7c8

Yan, Zheng authored Sep 21, 2009

The new back reference format does not allow reusing objectid of
deleted snapshot/subvol. So we use ++highest_objectid to allocate
objectid for new snapshot/subvol.

Now we use ++highest_objectid to allocate objectid for both new inode
and new snapshot/subvolume, so this patch removes 'find hole' code in
btrfs_find_free_objectid.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

13a8a7c8

Btrfs: speed up snapshot dropping · 1c4850e2

Yan, Zheng authored Sep 21, 2009

This patch contains two changes to avoid unnecessary tree block reads during
snapshot dropping.

First, check tree block's reference count and flags before reading the tree
block. if reference count > 1 and there is no need to update backrefs, we can
avoid reading the tree block.

Second, save when snapshot was created in root_key.offset. we can compare block
pointer's generation with snapshot's creation generation during updating
backrefs. If a given block was created before snapshot was created, the
snapshot can't be the tree block's owner. So we can avoid reading the block.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1c4850e2

18 Sep, 2009 2 commits

Btrfs: search for an allocation hint while filling file COW · b917b7c3

Chris Mason authored Sep 18, 2009

The allocator has some nice knobs for sending hints about where
to try and allocate new blocks, but when we're doing file allocations
we're not sending any hint at all.

This commit adds a simple extent map search to see if we can
quickly and easily find a hint for the allocator.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

b917b7c3

Btrfs: properly honor wbc->nr_to_write changes · f85d7d6c

Chris Mason authored Sep 18, 2009

When btrfs fills a delayed allocation, it tries to increase
the wbc nr_to_write to cover a big part of allocation.  The
theory is that we're doing contiguous IO and writing a few
more blocks will save seeks overall at a very low cost.

The problem is that extent_write_cache_pages could ignore
the new higher nr_to_write if nr_to_write had already gone
down to zero.  We fix that by rechecking the nr_to_write
for every page that is processed in the pagevec.

This updates the math around bumping the nr_to_write value
to make sure we don't leave a tiny amount of IO hanging
around for the very end of a new extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f85d7d6c

17 Sep, 2009 1 commit

Btrfs: improve async block group caching · 11833d66

Yan Zheng authored Sep 11, 2009

This patch gets rid of two limitations of async block group caching.
The old code delays handling pinned extents when block group is in
caching. To allocate logged file extents, the old code need wait
until block group is fully cached. To get rid of the limitations,
This patch introduces a data structure to track the progress of
caching. Base on the caching progress, we know which extents should
be added to the free space cache when handling the pinned extents.
The logged file extents are also handled in a similar way.

This patch also changes how pinned extents are tracked. The old
code uses one tree to track pinned extents, and copy the pinned
extents tree at transaction commit time. This patch makes it use
two trees to track pinned extents. One tree for extents that are
pinned in the running transaction, one tree for extents that can
be unpinned. At transaction commit time, we swap the two trees.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

11833d66

16 Sep, 2009 3 commits

Btrfs: Fix async thread shutdown race · 6e74057c

Chris Mason authored Sep 15, 2009

It was possible for an async worker thread to be selected to
receive a new work item, but exit before the work item was
actually placed into that thread's work list.

This commit fixes the race by incrementing the num_pending
counter earlier, and making sure to check the number of pending
work items before a thread exits.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

6e74057c

Btrfs: fix worker thread double spin_lock_irq · 627e421a

Chris Mason authored Sep 15, 2009

The exit-on-idle code for async worker threads was incorrectly
calling spin_lock_irq with interrupts already off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

627e421a

Btrfs: fix async worker startup race · 3e99d8eb

Chris Mason authored Sep 15, 2009

After a new worker thread starts, it is placed into the
list of idle threads.  But, this may race with a
check for idle done by the worker thread itself, resulting
in a double list_add operation.

This fix adds a check to make sure the idle thread addition
is done properly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3e99d8eb

11 Sep, 2009 17 commits

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable · 83ebade3
Chris Mason authored Sep 11, 2009

83ebade3

Btrfs: zero page past end of inline file items · 93c82d57

Chris Mason authored Sep 11, 2009

When btrfs_get_extent is reading inline file items for readpage,
it needs to copy the inline extent into the page.  If the
inline extent doesn't cover all of the page, that means there
is a hole in the file, or that our file is smaller than one
page.

readpage does zeroing for the case where the file is smaller than one
page, but nobody is currently zeroing for the case where there is
a hole after the inline item.

This commit changes btrfs_get_extent to zero fill the page past
the end of the inline item.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

93c82d57

Btrfs: fix btrfs page_mkwrite to return locked page · 50a9b214

Chris Mason authored Sep 11, 2009

This closes a whole where the page may be written before
the page_mkwrite caller has a chance to dirty it

(thanks to Nick Piggin)
Signed-off-by: Chris Mason <chris.mason@oracle.com>

50a9b214

Btrfs: Fix extent replacment race · a1ed835e

Chris Mason authored Sep 11, 2009

Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones.  There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.

Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.

This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a1ed835e

Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b

Chris Mason authored Sep 02, 2009

Btrfs writes go through delalloc to the data=ordered code.  This
makes sure that all of the data is on disk before the metadata
that references it.  The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.

This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.

One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.

These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.

This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.

As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.

At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

8b62b72b

Btrfs: use a cached state for extent state operations during delalloc · 9655d298

Chris Mason authored Sep 02, 2009

This changes the btrfs code to find delalloc ranges in the extent state
tree to use the new state caching code from set/test bit.  It reduces
one of the biggest causes of rbtree searches in the writeback path.

test_range_bit is also modified to take the cached state as a starting
point while searching.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

9655d298

Btrfs: don't lock bits in the extent tree during writepage · d5550c63

Chris Mason authored Sep 02, 2009

At writepage time, we have the page locked and we have the
extent_map entry for this extent pinned in the extent_map tree.
So, the page can't go away and its mapping can't change.

There is no need for the extra extent_state lock bits during writepage.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

d5550c63

Btrfs: cache values for locking extents · 2c64c53d

Chris Mason authored Sep 02, 2009

Many of the btrfs extent state tree users follow the same pattern.
They lock an extent range in the tree, do some operation and then
unlock.

This translates to at least 2 rbtree searches, and maybe more if they
are doing operations on the extent state tree.  A locked extent
in the tree isn't going to be merged or changed, and so we can
safely return the extent state structure as a cached handle.

This changes set_extent_bit to give back a cached handle, and also
changes both set_extent_bit and clear_extent_bit to use the cached
handle if it is available.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

2c64c53d

Btrfs: reduce CPU usage in the extent_state tree · 1edbb734

Chris Mason authored Sep 02, 2009

Btrfs is currently mirroring some of the page state bits into
its extent state tree.  The goal behind this was to use it in supporting
blocksizes other than the page size.

But, we don't currently support that, and we're using quite a lot of CPU
on the rb tree and its spin lock.  This commit starts a series of
cleanups to reduce the amount of work done in the extent state tree as
part of each IO.

This commit:

* Adds the ability to lock an extent in the state tree and also set
other bits.  The idea is to do locking and delalloc in one call

* Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits.  Btrfs is using
a combination of the page bits and the ordered write code for this
instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1edbb734

Btrfs: Fix new state initialization order · e48c465b

Chris Mason authored Sep 11, 2009

As the extent state tree is manipulated, there are call backs
that are used to take extra actions when different state bits are set
or cleared.  One example of this is a counter for the total number
of delayed allocation bytes in a single inode and in the whole FS.

When new states are inserted, this callback is being done before we
properly setup the new state.  This hasn't caused problems before
because the lock bit was always done first, and the existing call backs
don't care about the lock bit.

This patch makes sure the state is properly setup before using the
callback, which is important for later optimizations that do more work
without using the lock bit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e48c465b

Btrfs: switch extent_map to a rw lock · 890871be

Chris Mason authored Sep 02, 2009

There are two main users of the extent_map tree.  The
first is regular file inodes, where it is evenly spread
between readers and writers.

The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.

The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

890871be

Btrfs: tweak congestion backoff · 57fd5a5f

Chris Mason authored Aug 07, 2009

The btrfs io submission thread tries to back off congested devices in
favor of rotating off to another disk.

But, it tries to make sure it submits at least some IO before rotating
on (the others may be congested too), and so it has a magic number of
requests it tries to write before it hops.

This makes the magic number smaller.  Testing shows that we're spending
too much time on congested devices and leaving the other devices idle.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

57fd5a5f

Btrfs: use larger nr_to_write for larger extents · a97adc9f

Chris Mason authored Aug 07, 2009

When btrfs fills a large delayed allocation extent, it is a good idea
to try and convince the write_cache_pages caller to go ahead and
write a good chunk of that extent.  The extra IO is basically free
because we know it is contiguous.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a97adc9f

Btrfs: reduce worker thread spin_lock_irq hold times · 4f878e84

Chris Mason authored Aug 07, 2009

This changes the btrfs worker threads to batch work items
into a local list.  It allows us to pull work items in
large chunks and significantly reduces the number of times we
need to take the worker thread spinlock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4f878e84

Btrfs: keep irqs on more often in the worker threads · 4e3f9c50

Chris Mason authored Aug 05, 2009

The btrfs worker thread spinlock was being used both for the
queueing of IO and for the processing of ordered events.

The ordered events never happen from end_io handlers, and so they
don't need to use the _irq version of spinlocks.  This adds a
dedicated lock to the ordered lists so they don't have to run
with irqs off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4e3f9c50

Btrfs: optimize set extent bit · 40431d6c

Chris Mason authored Aug 05, 2009

The Btrfs set_extent_bit call currently searches the rbtree
every time it needs to find more extent_state objects to fill
the requested operation.

This adds a simple test with rb_next to see if the next object
in the tree was adjacent to the one we just found.  If so,
we skip the search and just use the next object.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

40431d6c

Btrfs: Allow worker threads to exit when idle · 9042846b

Chris Mason authored Aug 04, 2009

The Btrfs worker threads don't currently die off after they have
been idle for a while, leading to a lot of threads sitting around
doing nothing for each mount.

Also, they are unable to start atomically (from end_io hanlders).

This commit reworks the worker threads so they can be started
from end_io handlers (just setting a flag that asks for a thread
to be added at a later date) and so they can exit if they
have been idle for a long time.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

9042846b