Commits · 1bc8779349d6278e2713a1ff94418c2a6746a791 · Kirill Smelkov / linux

04 Jun, 2011 1 commit

btrfs: scrub: don't reuse bios and pages · 1bc87793

Arne Jansen authored May 28, 2011

The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>

1bc87793

28 May, 2011 1 commit

Merge branch 'for-chris' of · ff5714cc

Chris Mason authored May 28, 2011

git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

Conflicts:
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/transaction.c
Signed-off-by: Chris Mason <chris.mason@oracle.com>

ff5714cc

27 May, 2011 1 commit

Btrfs: use the device_list_mutex during write_dev_supers · 174ba509

Chris Mason authored May 27, 2011

write_dev_supers was changed to use RCU to protect the list of
devices, but it was then sleeping while it actually wrote the supers.
This fixes it to just use the mutex, since we really don't any
concurrency in write_dev_supers anyway.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

174ba509

26 May, 2011 4 commits

Btrfs: setup free ino caching in a more asynchronous way · a47d6b70

Li Zefan authored May 26, 2011

For a filesystem that has lots of files in it, the first time we mount
it with free ino caching support, it can take quite a long time to
setup the caching before we can create new files.

Here we fill the cache with [highest_ino, BTRFS_LAST_FREE_OBJECTID]
before we start the caching thread to search through the extent tree.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a47d6b70

btrfs scrub: don't coalesce pages that are logically discontiguous · 00d01bc1

Arne Jansen authored May 25, 2011

scrub_page collects several pages into one bio as long as they are physically
contiguous. As we only save one logical address for the whole bio, don't
collect pages that are physically contiguous but logically discontiguous.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

00d01bc1

Btrfs: return -ENOMEM in clear_extent_bit · c309df07

Chris Mason authored May 26, 2011

The btrfs releasepage function depends on ENOMEM coming
back when it is called atomic.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

c309df07

Btrfs: add mount -o auto_defrag · 4cb5300b

Chris Mason authored May 24, 2011

This will detect small random writes into files and
queue the up for an auto defrag process.  It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4cb5300b

23 May, 2011 33 commits

Merge branch 'cleanups_and_fixes' into inode_numbers · d6c0cb37

Chris Mason authored May 23, 2011

Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c
Signed-off-by: Chris Mason <chris.mason@oracle.com>

d6c0cb37

Btrfs: using rcu lock in the reader side of devices list · 1f78160c

Xiao Guangrong authored Apr 20, 2011

fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1f78160c

Btrfs: drop unnecessary device lock · 46224705

Xiao Guangrong authored Apr 20, 2011

Drop device_list_mutex for the reader side  on clone_fs_devices and
btrfs_rm_device pathes since the fs_info->volume_mutex can ensure the device
list is not updated

btrfs_close_extra_devices is the initialized path, we can not add or remove
device at this time, so we can simply drop the mutex safely, like other
initialized function does(add_missing_dev, __find_device, __btrfs_open_devices
...).
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

46224705

Btrfs: fix the race between remove dev and alloc chunk · 0c1daee0

Xiao Guangrong authored Apr 20, 2011

On remove device path, it updates device->dev_alloc_list but does not hold
chunk lock
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

0c1daee0

Btrfs: fix the race between reading and updating devices · c9513edb

Xiao Guangrong authored Apr 20, 2011

On btrfs_congested_fn and __unplug_io_fn paths, we should hold
device_list_mutex to avoid remove/add device path to
update fs_devices->devices

On __btrfs_close_devices and btrfs_prepare_sprout paths, the devices in
fs_devices->devices or fs_devices->devices is updated, so we should hold
the mutex to avoid the reader side to reach them
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

c9513edb

Btrfs: fix bh leak on __btrfs_open_devices path · 4f6c9328

Xiao Guangrong authored Apr 20, 2011

'bh' is forgot to release if no error is detected
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4f6c9328

Btrfs: fix unsafe usage of merge_state · c7f895a2

Xiao Guangrong authored Apr 20, 2011

merge_state can free the current state if it can be merged with the next node,
but in set_extent_bit(), after merge_state, we still use the current extent to
get the next node and cache it into cached_state
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

c7f895a2

Btrfs: allocate extent state and check the result properly · 8233767a

Xiao Guangrong authored Apr 20, 2011

It doesn't allocate extent_state and check the result properly:
- in set_extent_bit, it doesn't allocate extent_state if the path is not
  allowed wait

- in clear_extent_bit, it doesn't check the result after atomic-ly allocate,
  we trigger BUG_ON() if it's fail

- if allocate fail, we trigger BUG_ON instead of returning -ENOMEM since
  the return value of clear_extent_bit() is ignored by many callers
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

8233767a

fs/btrfs: Add missing btrfs_free_path · b0839166

Julia Lawall authored May 14, 2011

Btrfs_alloc_path should be matched with btrfs_free_path in error-handling code.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
local idexpression struct btrfs_path * x;
expression ra,rb;
position p1,p2;
@@

x = btrfs_alloc_path@p1(...)
...  when != btrfs_free_path(x,...)
     when != if (...) { ... btrfs_free_path(x,...) ...}
     when != x = ra
if(...) { ... when != x = rb
     when forall
     when != btrfs_free_path(x,...)
 \(return <+...x...+>; \| return@p2...; \) }

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

cocci.print_main("alloc",p1)
cocci.print_secs("return",p2)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

b0839166

Btrfs: check return value of btrfs_inc_extent_ref() · 37daa4f9

Tsutomu Itoh authored Apr 28, 2011

If return value of btrfs_inc_extent_ref() is not 0, BUG() is called.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

37daa4f9

Btrfs: return error to caller if read_one_inode() fails · c00e9493

Tsutomu Itoh authored Apr 28, 2011

When read_one_inode() fails, error code is returned to caller instead
of BUG_ON().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

c00e9493

Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item · 1cd30799

Tsutomu Itoh authored May 19, 2011

Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1cd30799

Btrfs: return error code to caller when btrfs_del_item fails · 65a246c5

Tsutomu Itoh authored May 19, 2011

The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

65a246c5

Btrfs: return error code to caller when btrfs_previous_item fails · b0b802d7

Tsutomu Itoh authored May 19, 2011

The error code is returned instead of calling BUG_ON when
btrfs_previous_item returns the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

b0b802d7

btrfs: fix typo 'testeing' -> 'testing' · 27160b6b

Sergei Trofimovich authored May 20, 2011

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

27160b6b

btrfs: typo: 'btrfS' -> 'btrfs' · 9694b3fc

Sergei Trofimovich authored May 20, 2011

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

9694b3fc

btrfs: don't spin in shrink_delalloc if there is nothing to free · c4f675cd

Sergei Trofimovich authored May 20, 2011

Observed as a large delay when --mixed filesystem is filled up.
Test example:
1. create tiny --mixed FS:
   $ dd if=/dev/zero of=2G.img seek=$((2048 * 1024 * 1024 - 1)) count=1 bs=1
   $ mkfs.btrfs --mixed 2G.img
   $ mount -oloop 2G.img /mnt/ut/
2. Try to fill it up:
   $ dd if=/dev/urandom of=10M.file bs=10240 count=1024
   $ seq 1 256 | while read file_no; do echo $file_no; time cp 10M.file ${file_no}.copy; done

Up to '200.copy' it goes fast, but when disk fills-up each -ENOSPC
message takes 3 seconds to pop-up _every_ ENOSPC (and in usermode linux
it's even more: 30-60 seconds!). (Maybe, time depends on kernel's timer resolution).

No IO, no CPU load, just rescheduling. Some debugging revealed busy spinning
in shrink_delalloc.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

c4f675cd

btrfs: Delete unused version.sh script. · 0f3b708c

Jamey Sharp authored May 05, 2011

In 2008, commit b4f6c45d dropped the use
of fs/btrfs/version.sh, but left the script behind. Kill it.

Commit by Jamey Sharp and Josh Triplett.
Signed-off-by: Jamey Sharp <jamey@minilop.net>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

0f3b708c

btrfs: Ensure the tree search ioctl returns the right number of records · e2156867

Hugo Mills authored May 14, 2011

Btrfs's tree search ioctl has a field to indicate that no more than a
given number of records should be returned. The ioctl doesn't honour
this, as the tested value is not incremented until the end of the
copy_to_sk function. This patch removes an unnecessary local variable,
and updates the num_found counter as each key is found in the tree.
Signed-off-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e2156867

BTRFS: Remove unused node_lock · 0956c798

Andi Kleen authored May 18, 2011

240f62c8 replaced the node_lock with rcu_read_lock, but forgot
to remove the actual lock in the data structure. Remove it here.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

0956c798

Btrfs: leave spinning on lookup and map the leaf · d90c7321

Josef Bacik authored May 17, 2011

On lookup we only want to read the inode item, so leave the path spinning.  Also
we're just wholesale reading the leaf off, so map the leaf so we don't do a
bunch of kmap/kunmaps.  Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

d90c7321

Btrfs: check for duplicate entries in the free space cache · 207dde82

Josef Bacik authored May 13, 2011

If there are duplicate entries in the free space cache, discard the entire cache
and load it the old fashioned way.  Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

207dde82

Btrfs: don't try to allocate from a block group that doesn't have enough space · cca1c81f

Josef Bacik authored May 13, 2011

If we have a very large filesystem, we can spend a lot of time in
find_free_extent just trying to allocate from empty block groups.  So instead
check to see if the block group even has enough space for the allocation, and if
not go on to the next block group.
Signed-off-by: Josef Bacik <josef@redhat.com>

cca1c81f

Btrfs: don't always do readahead · 026fd317

Josef Bacik authored May 13, 2011

Our readahead is sort of sloppy, and really isn't always needed. For example if
ls is doing a stating ls (which is the default) it's going to stat in non-disk
order, so if say you have a directory with a stupid amount of files, readahead
is going to do nothing but waste time in the case of doing the stat. Taking the
unconditional readahead out made my test go from 57 minutes to 36 minutes. This
means that everywhere we do loop through the tree we want to make sure we do set
path->reada properly, so I went through and found all of the places where we
loop through the path and set reada to 1. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

026fd317

Btrfs: try not to sleep as much when doing slow caching · 589d8ade

Josef Bacik authored May 11, 2011

When the fs is super full and we unmount the fs, we could get stuck in this
thing where unmount is waiting for the caching kthread to make progress and the
caching kthread keeps scheduling because we're in the middle of a commit. So
instead just let the caching kthread keep going and only yeild if
need_resched(). This makes my horrible umount case go from taking up to 10
minutes to taking less than 20 seconds. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

589d8ade

Btrfs: kill BTRFS_I(inode)->block_group · d82a6f1d

Josef Bacik authored May 11, 2011

Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull. In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable. So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

d82a6f1d

Btrfs: don't look at the extent buffer level 3 times in a row · 7e2355ba

Josef Bacik authored May 11, 2011

We have a bit of debugging in btrfs_search_slot to make sure the level of the
cow block is the same as the original block we were cow'ing.  I don't think I've
ever seen this tripped, so kill it.  This saves us 2 kmap's per level in our
search.  Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

7e2355ba

Btrfs: map the node block when looking for readahead targets · cb25c2ea

Josef Bacik authored May 11, 2011

If we have particularly full nodes, we could call btrfs_node_blockptr up to 32
times, which is 32 pairs of kmap/kunmap, which _sucks_. So go ahead and map the
extent buffer while we look for readahead targets. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

cb25c2ea

Btrfs: set range_start to the right start in count_range_bits · af60bed2

Josef Bacik authored May 04, 2011

In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results.  For example, if the range

[0-8192]

is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096].  So instead set range_start = max(cur_start, state->start).
This makes everything come out right.  Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

af60bed2

Btrfs: fix how we do space reservation for truncate · fcb80c2a

Josef Bacik authored May 03, 2011

The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up. This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen. This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen. So in order
to fix this we need to split all of the reservation stuf up. So with this patch
we have

1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.

2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.

3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.

Hopefully this will make the ceph problem go away. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

fcb80c2a

Btrfs: kill trans_mutex · a4abeea4

Josef Bacik authored Apr 11, 2011

We use trans_mutex for lots of things, here's a basic list

1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists

Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction. So replace the
trans_mutex with a trans_lock spinlock and use it to do the following

1) Protect fs_info->running_transaction. All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list. This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list. This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work. So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.

I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

a4abeea4

Btrfs: if we've already started a trans handle, use that one · 2a1eb461

Josef Bacik authored Apr 13, 2011

We currently track trans handles in current->journal_info, but we don't actually
use it. This patch fixes it. This will cover the case where we have multiple
people starting transactions down the call chain. This keeps us from having to
allocate a new handle and all of that, we just increase the use count of the
current handle, save the old block_rsv, and return. I tested this with xfstests
and it worked out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

2a1eb461

Btrfs: take away the num_items argument from btrfs_join_transaction · 7a7eaa40

Josef Bacik authored Apr 13, 2011

I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :). So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>

7a7eaa40