Commits · 89ce8a63d0c761fbb02089850605360f389477d8 · nexedi / linux

25 Sep, 2008 40 commits

Add btrfs_end_transaction_throttle to force writers to wait for pending commits · 89ce8a63

Chris Mason authored Jun 25, 2008

The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.

There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.

This is a temporary measure until better asynchronous commit support is
added.  This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

89ce8a63

Btrfs: Fix snapshot deletion to release the alloc_mutex much more often. · 333db94c
Chris Mason authored Jun 25, 2008
```
This lowers the impact of snapshot deletion on the rest of the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
333db94c

Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it · 5cd57b2c

Chris Mason authored Jun 25, 2008

Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree.  But, those locks might already be held in other places, leading
to deadlocks.

Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups.  A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

5cd57b2c

Fix btrfs_next_leaf to check for new items after dropping locks · 168fd7d2
Chris Mason authored Jun 25, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
168fd7d2

Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks · 594a24eb

Chris Mason authored Jun 25, 2008

This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

594a24eb

Drop locks in btrfs_search_slot when reading a tree block. · 051e1b9f

Chris Mason authored Jun 25, 2008

One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree. This drops
locks held by a path when doing reads during a tree search.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

051e1b9f

Btrfs: Replace the big fs_mutex with a collection of other locks · a2135011

Chris Mason authored Jun 25, 2008

Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a2135011

Btrfs: Start btree concurrency work. · 925baedd

Chris Mason authored Jun 25, 2008

The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

925baedd

Btrfs: Add a thread pool just for submit_bio · 1cc127b5

Chris Mason authored Jun 12, 2008

If a bio submission is after a lock holder waiting for the bio
on the work queue, it is possible to deadlock.  Move the bios
into their own pool.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1cc127b5

BTRFS_IOC_TRANS_START should be privilegued · df5b5520

Christoph Hellwig authored Jun 11, 2008

As mentioned in the comment next to it btrfs_ioctl_trans_start can
do bad damage to filesystems and thus should be limited to privilegued
users.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

df5b5520

Btrfs: split out ioctl.c · f46b5a66

Christoph Hellwig authored Jun 11, 2008

Split the ioctl handling out of inode.c into a file of it's own.
Also fix up checkpatch.pl warnings for the moved code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f46b5a66

Btrfs: kerneldoc comments for extent_map.c · 9d2423c5

Christoph Hellwig authored Jun 11, 2008

Add kerneldoc comments for all exported functions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

9d2423c5

Btrfs: Add a mount option to control worker thread pool size · 4543df7e

Chris Mason authored Jun 11, 2008

mount -o thread_pool_size changes the default, which is
min(num_cpus + 2, 8).  Larger thread pools would make more sense on
very large disk arrays.

This mount option controls the max size of each thread pool.  There
are multiple thread pools, so the total worker count will be larger
than the mount option.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

4543df7e

Btrfs: Worker thread optimizations · 35d8ba66

Chris Mason authored Jun 11, 2008

This changes the worker thread pool to maintain a list of idle threads,
avoiding a complex search for a good thread to wake up.

Threads have two states:

idle - we try to reuse the last thread used in hopes of improving the batching
ratios

busy - each time a new work item is added to a busy task, the task is
rotated to the end of the line.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

35d8ba66

Btrfs: Add backport for the kthread work on kernels older than 2.6.20 · d05e5a4d
Chris Mason authored Jun 11, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
d05e5a4d

Btrfs: Fix mount -o max_inline=0 · 15ada040

Chris Mason authored Jun 11, 2008

max_inline=0 used to force the max_inline size to one sector instead.  Now
it properly disables inline data items, while still being able to read
any that happen to exist on disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

15ada040

Btrfs: Add async worker threads for pre and post IO checksumming · 8b712842

Chris Mason authored Jun 11, 2008

Btrfs has been using workqueues to spread the checksumming load across
other CPUs in the system.  But, workqueues only schedule work on the
same CPU that queued the work, giving them a limited benefit for systems with
higher CPU counts.

This code adds a generic facility to schedule work with pools of kthreads,
and changes the bio submission code to queue bios up.  The queueing is
important to make sure large numbers of procs on the system don't
turn streaming workloads into random workloads by sending IO down
concurrently.

The end result of all of this is much higher performance (and CPU usage) when
doing checksumming on large machines.  Two worker pools are created,
one for writes and one for endio processing.  The two could deadlock if
we tried to service both from a single pool.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

8b712842

btrfs: allow scanning multiple devices during mount · 43e570b0

Christoph Hellwig authored Jun 10, 2008

Allows to specify one or multiple device=/dev/foo options during mount
so that ioctls on the control device can be avoided.  Especially useful
when trying to mount a multi-device setup as root.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

43e570b0

btrfs: sanity mount option parsing and early mount code · edf24abe

Christoph Hellwig authored Jun 10, 2008

Also adds lots of comments to describe what's going on here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

edf24abe

btrfs: fix strange indentation in lookup_extent_mapping · 306929f3
Christoph Hellwig authored Jun 10, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
306929f3

btrfs: tiny makefile cleanup · 95c9eb17

Christoph Hellwig authored Jun 10, 2008

use normal kbuild syntax to build acl.o conditinally and remove comment
out lines.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>

95c9eb17

Btrfs: transaction ioctls · 6bf13c0c

Sage Weil authored Jun 10, 2008

These ioctls let a user application hold a transaction open while it
performs a series of operations.  A final ioctl does a sync on the fs
(closing the current transaction).  This is the main requirement for
Ceph's OSD to be able to keep the data it's storing in a btrfs volume
consistent, and AFAICS it works just fine.  The application would do
something like

	fd = ::open("some/file", O_RDONLY);
	::ioctl(fd, BTRFS_IOC_TRANS_START);
	/* do a bunch of stuff */
	::ioctl(fd, BTRFS_IOC_TRANS_END);
or just
	::close(fd);

And to ensure it commits to disk,

	::ioctl(fd, BTRFS_IOC_SYNC);

When a transaction is held open, the trans_handle is attached to the
struct file (via private_data) so that it will get cleaned up if the
process dies unexpectedly.  A held transaction is also ended on fsync() to
avoid a deadlock.

A misbehaving application could also deliberately hold a transaction open,
effectively locking up the FS, so it may make sense to restrict something
like this to root or something.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

6bf13c0c

Btrfs: Dislable acl xattr handlers · eba12c7b

Yan authored Jun 09, 2008

The acl code is not yet complete, and the xattr handlers are causing
problems for cp -p on some distros.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

eba12c7b

Btrfs: bdi_init and bdi_destroy come with 2.6.23 · 51ebc0d3
Jan Engelhardt authored Jun 09, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
51ebc0d3

btrfsctl -A error code fixup · f819d837

Linda Knippers authored Jun 09, 2008

Send the error back to userland if the ioctl fails
Signed-off-by: Chris Mason <chris.mason@oracle.com>

f819d837

Btrfs: Invalidate dcache entry after creating snapshot and · 3b96362c

Sven Wegener authored Jun 09, 2008

We need to invalidate an existing dcache entry after creating a new
snapshot or subvolume, because a negative dache entry will stop us from
accessing the new snapshot or subvolume.

---
  ctree.h       |   23 +++++++++++++++++++++++
  inode.c       |    4 ++++
  transaction.c |    4 ++++
  3 files changed, 31 insertions(+)
Signed-off-by: Chris Mason <chris.mason@oracle.com>

3b96362c

Btrfs: Fix race in running_transaction checks · 48ec2cf8

Chris Mason authored Jun 09, 2008

When a new transaction was started, the code would incorrectly
set the pointer in fs_info before all the data structures were setup.
fsync heavy workloads hit races on the setup of the ordered inode spinlock
Signed-off-by: Chris Mason <chris.mason@oracle.com>

48ec2cf8

btrfs delete ordered inode handling fix · e1b81e67

Mingming authored May 27, 2008

Use btrfs_release_file instead of a put_inode call
Signed-off-by: Chris Mason <chris.mason@oracle.com>

e1b81e67

Btrfs: Always use the async submission queue for checksummed writes · da496f2a

Chris Mason authored May 27, 2008

This avoids IO stalls and poorly ordered IO from inline writers mixing in
with the async submission queue
Signed-off-by: Chris Mason <chris.mason@oracle.com>

da496f2a

Btrfs: Allocator fix variety pack · 0ef3e66b

Chris Mason authored May 24, 2008

* Force chunk allocation when find_free_extent has to do a full scan
* Record the max key at the start of defrag so it doesn't run forever
* Block groups might not be contiguous, make a forward search for the
  next block group in extent-tree.c
* Get rid of extra checks for total fs size
* Fix relocate_one_reference to avoid relocating the same file data block
  twice when referenced by an older transaction
* Use the open device count when allocating chunks so that we don't
  try to allocate from devices that don't exist
Signed-off-by: Chris Mason <chris.mason@oracle.com>

0ef3e66b

Btrfs: Use kzalloc on the fs_devices allocation · 515dc322
Chris Mason authored May 16, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
515dc322
Btrfs: Handle transid == 0 while opening devices · 6af5ac3c
Chris Mason authored May 16, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
6af5ac3c
Btrfs: Enable btree balancing on old kernels again · 1c8cfcc1
Chris Mason authored May 16, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
1c8cfcc1

Btrfs: Change the congestion functions to meter the number of async submits as well · cb03c743

Chris Mason authored May 15, 2008

The async submit workqueue was absorbing too many requests, leading to long
stalls where the async submitters were stalling.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

cb03c743

Fix corners in writepage and btrfs_truncate_page · 211c17f5

Chris Mason authored May 15, 2008

The extent_io writepage calls needed an extra check for discarding
pages that started on th last byte in the file.

btrfs_truncate_page needed checks to make sure the page was still part
of the file after reading it, and most importantly, needed to wait for
all IO to the page to finish before freeing the corresponding extents on
disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

211c17f5

Fix btrfs_open_devices to deal with changes since the scan ioctls · a0af469b

Chris Mason authored May 13, 2008

Devices can change after the scan ioctls are done, and btrfs_open_devices
needs to be able to verify them as they are opened and used by the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

a0af469b

Btrfs: Add mount -o degraded to allow mounts to continue with missing devices · dfe25020
Chris Mason authored May 13, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
dfe25020

Btrfs: Handle write errors on raid1 and raid10 · 1259ab75

Chris Mason authored May 12, 2008

When duplicate copies exist, writes are allowed to fail to one of those
copies.  This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

1259ab75

Btrfs: Pass down the expected generation number when reading tree blocks · ca7a79ad
Chris Mason authored May 12, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
ca7a79ad
Btrfs: Don't do btree balance_dirty_pages on old kernels, it stalls forever · 188de649
Chris Mason authored May 09, 2008
```
Signed-off-by: Chris Mason <chris.mason@oracle.com>
```
188de649