Commits · 0878ae2db83a10894724cdeaba7ef9f1ac1c9ac8 · nexedi / linux

17 Jul, 2013 1 commit

Merge branch 'bcache-for-3.11' of git://evilpiepirate.org/~kent/linux-bcache into for-3.11/drivers · 0878ae2d

Jens Axboe authored Jul 16, 2013

Kent writes:

Hey Jens - I've been busy torture testing and chasing bugs, here's the
fruits of my labors. These are all fairly small fixes, some of them
quite important.

0878ae2d

12 Jul, 2013 8 commits

bcache: Allocation kthread fixes · 79826c35

Kent Overstreet authored Jul 10, 2013

The alloc kthread should've been using try_to_freeze() - and also there
was the potential for the alloc kthread to get woken up after it had
shut down, which would have been bad.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>

79826c35

bcache: Fix GC_SECTORS_USED() calculation · 29ebf465

Kent Overstreet authored Jul 11, 2013

Part of the job of garbage collection is to add up however many sectors
of live data it finds in each bucket, but that doesn't work very well if
it doesn't reset GC_SECTORS_USED() when it starts. Whoops.

This wouldn't have broken anything horribly, but allocation tries to
preferentially reclaim buckets that are mostly empty and that's not
gonna work with an incorrect GC_SECTORS_USED() value.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

29ebf465

bcache: Journal replay fix · faa56736

Kent Overstreet authored Jul 11, 2013

The journal replay code starts by finding something that looks like a
valid journal entry, then it does a binary search over the unchecked
region of the journal for the journal entries with the highest sequence
numbers.

Trouble is, the logic was wrong - journal_read_bucket() returns true if
it found journal entries we need, but if the range of journal entries
we're looking for loops around the end of the journal - in that case
journal_read_bucket() could return true when it hadn't found the highest
sequence number we'd seen yet, and in that case the binary search did
the wrong thing. Whoops.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

faa56736

bcache: Shutdown fix · 5caa52af

Kent Overstreet authored Jul 10, 2013

Stopping a cache set is supposed to make it stop attached backing
devices, but somewhere along the way that code got lost. Fixing this
mainly has the effect of fixing our reboot notifier.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

5caa52af

bcache: Fix a sysfs splat on shutdown · c9502ea4

Kent Overstreet authored Jul 10, 2013

If we stopped a bcache device when we were already detaching (or
something like that), bcache_device_unlink() would try to remove a
symlink from sysfs that was already gone because the bcache dev kobject
had already been removed from sysfs.

So keep track of whether we've removed stuff from sysfs.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

c9502ea4

bcache: Advertise that flushes are supported · 54d12f2b

Kent Overstreet authored Jul 10, 2013

Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
out unless we say we support them...
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

54d12f2b

bcache: check for allocation failures · d2a65ce2

Dan Carpenter authored Jul 05, 2013

There is a missing NULL check after the kzalloc().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

d2a65ce2

bcache: Fix a dumb race · 6aa8f1a6

Kent Overstreet authored Jul 10, 2013

In the far-too-complicated closure code - closures can have destructors,
for probably dubious reasons; they get run after the closure is no
longer waiting on anything but before dropping the parent ref, intended
just for freeing whatever memory the closure is embedded in.

Trouble is, when remaining goes to 0 and we've got nothing more to run -
we also have to unlock the closure, setting remaining to -1. If there's
a destructor, that unlock isn't doing anything - nobody could be trying
to lock it if we're about to free it - but if the unlock _is needed...
that check for a destructor was racy. Argh.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: linux-stable <stable@vger.kernel.org> # >= v3.10

6aa8f1a6

02 Jul, 2013 2 commits
- Merge branch 'bcache-for-3.11' of git://evilpiepirate.org/~kent/linux-bcache into for-3.11/drivers · d0e3d023
  Jens Axboe authored Jul 02, 2013
  
  d0e3d023
- Merge tag 'v3.10-rc7' into for-3.11/drivers · 5f0e5afa
  Jens Axboe authored Jul 02, 2013
```
Linux 3.10-rc7

Pull this in early to avoid doing it with the bcache merge,
since there are a number of changes to bcache between my old
base (3.10-rc1) and the new pull request.
```
  5f0e5afa
01 Jul, 2013 5 commits

bcache: Use standard utility code · 8e51e414

Kent Overstreet authored Jun 06, 2013

Some of bcache's utility code has made it into the rest of the kernel,
so drop the bcache versions.

Bcache used to have a workaround for allocating from a bio set under
generic_make_request() (if you allocated more than once, the bios you
already allocated would get stuck on current->bio_list when you
submitted, and you'd risk deadlock) - bcache would mask out __GFP_WAIT
when allocating bios under generic_make_request() so that allocation
could fail and it could retry from workqueue. But bio_alloc_bioset() has
a workaround now, so we can drop this hack and the associated error
handling.
Signed-off-by: Kent Overstreet <koverstreet@google.com>

8e51e414

bcache: Update email address · 47cd2eb0
Kent Overstreet authored Jul 01, 2013
```
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
```
47cd2eb0

bcache: Delete fuzz tester · f3059a54

Kent Overstreet authored May 15, 2013

This code has rotted and it hasn't been used in ages anyways.
Signed-off-by: Kent Overstreet <kmo@daterainc.com>

f3059a54

bcache: Document shrinker reserve better · 36c9ea98
Kent Overstreet authored Jun 03, 2013
```
Signed-off-by: Kent Overstreet <kmo@daterainc.com>
```
36c9ea98

bcache: FUA fixes · e49c7c37

Kent Overstreet authored Jun 26, 2013

Journal writes need to be marked FUA, not just REQ_FLUSH. And btree node
writes have... weird ordering requirements.
Signed-off-by: Kent Overstreet <koverstreet@google.com>

e49c7c37

28 Jun, 2013 7 commits

drbd: Allow online change of al-stripes and al-stripe-size · d752b269

Philipp Reisner authored Jun 25, 2013

Allow to change the AL layout with an resize operation. For that
the reisze command gets two new fields: al_stripes and al_stripe_size.

In order to make the operation crash save:
1) Lock out all IO and MD-IO
2) Write the super block with MDF_PRIMARY_IND clear
3) write the bitmap to the new location (all zeros, since
   we allow only while connected)
4) Initialize the new AL-area
5) Write the super block with the restored MDF_PRIMARY_IND.
6) Unfreeze all IO

Since the AL-layout has no influence on the protocol, this operation
needs to be beforemed on both sides of a resource (if intended).
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d752b269

drbd: Constants should be UPPERCASE · e96c9633

Philipp Reisner authored Jun 25, 2013

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e96c9633

drbd: Ignore the exit code of a fence-peer handler if it returns too late · 28e448bb

Philipp Reisner authored Jun 25, 2013

In case the connection was established and lost again before
the a fence-peer handler returns, ignore the exit code of this
instance. (And use the exit code of the later started instance)
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

28e448bb

drbd: Fix rcu_read_lock balance on error path · f9eb7bf4

Andreas Gruenbacher authored Jun 25, 2013

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f9eb7bf4

drbd: fix error return code in drbd_init() · 6110d70b

Wei Yongjun authored Jun 25, 2013

Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6110d70b

drbd: Do not sleep inside rcu · 26ea8f92

Andreas Gruenbacher authored Jun 25, 2013

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

26ea8f92

Merge branch 'stable/for-jens-3.10' of... · f35546e0

Jens Axboe authored Jun 28, 2013

Merge branch 'stable/for-jens-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.11/drivers

Konrad writes:

It has the 'feature-max-indirect-segments' implemented in both backend
and frontend. The current problem with the backend and frontend is that the
segment size is limited to 11 pages. It means we can at most squeeze in 44kB per
request. The ring can hold 32 (next power of two below 36) requests, meaning we
can do 1.4M of outstanding requests. Nowadays that is not enough.

The problem in the past was addressed in two ways - but neither one went upstream.
The first solution to this proposed by Justin from Spectralogic was to negotiate
the segment size. This means that the ‘struct blkif_sring_entry’ is now a variable size.
It can expand from 112 bytes (cover 11 pages of data - 44kB) to 1580 bytes
(256 pages of data - so 1MB). It is a simple extension by just making the array in the
request expand from 11 to a variable size negotiated. But it had limits: this extension
still limits the number of segments per request to 255 (as the total number must be
specified in the request, which only has an 8-bit field for that purpose).

The other solution (from Intel - Ronghui) was to create one extra ring that only has the
‘struct blkif_request_segment’ in them. The ‘struct blkif_request’ would be changed to have
an index in said ‘segment ring’. There is only one segment ring. This means that the size of
the initial ring is still the same. The requests would point to the segment and enumerate out
how many of the indexes it wants to use. The limit is of course the size of the segment.
If one assumes a one-page segment this means we can in one request cover ~4MB.

Those patches were posted as RFC and the author never followed up on the ideas on changing
it to be a bit more flexible.

There is yet another mechanism that could be employed (which these patches implement) - and it
borrows from VirtIO protocol. And that is the ‘indirect descriptors’. This very similar to
what Intel suggests, but with a twist. The twist is to negotiate how many of these
'segment' pages (aka indirect descriptor pages) we want to support (in reality we negotiate
how many entries in the segment we want to cover, and we module the number if it is
bigger than the segment size).

This means that with the existing 36 slots in the ring (single page) we can cover:
32 slots * each blkif_request_indirect covers: 512 * 4096 ~= 64M. Since we ample space
in the blkif_request_indirect to span more than one indirect page, that number (64M)
can be also multiplied by eight = 512MB.

Roger Pau Monne took the idea and implemented them in these patches. They work
great and the corner cases (migration between backends with and without this extension)
work nicely. The backend has a limit right now off how many indirect entries
it can handle: one indirect page, and at maximum 256 entries (out of 512 - so 50% of the page
is used). That comes out to 32 slots * 256 entries in a indirect page * 1 indirect page
per request * 4096 = 32MB.

This is a conservative number that can change in the future. Right now it strikes
a good balance between giving excellent performance, memory usage in the backend, and
balancing the needs of many guests.

In the patchset there is also the split of the blkback structure to be per-VBD.
This means that the spinlock contention we had with many guests trying to do I/O and
all the blkback threads hitting the same lock has been eliminated.

Also there are bug-fixes to deal with oddly sized sectors, insane amounts on
th ring, and also a security fix (posted earlier).

f35546e0

27 Jun, 2013 15 commits

bcache: Refresh usage docs · cecd628d

Gabriel de Perthuis authored Jun 27, 2013

Mention udev autoregistration, symlinks.  Write down some sysfs paths.
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>

cecd628d

bcache: Send label uevents · ab9e1400

Gabriel de Perthuis authored Jun 09, 2013

Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>

ab9e1400

bcache: Send a uevent with a cached device's UUID · a25c32be
Gabriel de Perthuis authored Jun 07, 2013
```
Signed-off-by: Gabriel de Perthuis <g2p.code@gmail.com>
```
a25c32be

doc: Fix typo in documentation/bcache.txt · bd206b51

Masanari Iida authored May 20, 2013

Correct spelling typo in documentation/bcache.txt
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>

bd206b51

bcache: Write out full stripes · 72c27061

Kent Overstreet authored Jun 05, 2013

Now that we're tracking dirty data per stripe, we can add two
optimizations for raid5/6:

 * If a stripe is already dirty, force writes to that stripe to
   writeback mode - to help build up full stripes of dirty data

 * When flushing dirty data, preferentially write out full stripes first
   if there are any.
Signed-off-by: Kent Overstreet <koverstreet@google.com>

72c27061

bcache: Track dirty data by stripe · 279afbad

Kent Overstreet authored Jun 05, 2013

To make background writeback aware of raid5/6 stripes, we first need to
track the amount of dirty data within each stripe - we do this by
breaking up the existing sectors_dirty into per stripe atomic_ts
Signed-off-by: Kent Overstreet <koverstreet@google.com>

279afbad

bcache: Initialize sectors_dirty when attaching · 444fc0b6

Kent Overstreet authored May 11, 2013

Previously, dirty_data wouldn't get initialized until the first garbage
collection... which was a bit of a problem for background writeback (as
the PD controller keys off of it) and also confusing for users.

This is also prep work for making background writeback aware of raid5/6
stripes.
Signed-off-by: Kent Overstreet <koverstreet@google.com>

444fc0b6

bcache: Improve lazy sorting · 6ded34d1

Kent Overstreet authored May 11, 2013

The old lazy sorting code was kind of hacky - rewrite in a way that
mathematically makes more sense; the idea is that the size of the sets
of keys in a btree node should increase by a more or less fixed ratio
from smallest to biggest.
Signed-off-by: Kent Overstreet <koverstreet@google.com>

6ded34d1

bcache: Rip out pkey()/pbtree() · 85b1492e

Kent Overstreet authored May 14, 2013

Old gcc doesnt like the struct hack, and it is kind of ugly. So finish
off the work to convert pr_debug() statements to tracepoints, and delete
pkey()/pbtree().
Signed-off-by: Kent Overstreet <koverstreet@google.com>

85b1492e

bcache: Fix/revamp tracepoints · c37511b8

Kent Overstreet authored Apr 26, 2013

The tracepoints were reworked to be more sensible, and fixed a null
pointer deref in one of the tracepoints.

Converted some of the pr_debug()s to tracepoints - this is partly a
performance optimization; it used to be that with DEBUG or
CONFIG_DYNAMIC_DEBUG pr_debug() was an empty macro; but at some point it
was changed to an empty inline function.

Some of the pr_debug() statements had rather expensive function calls as
part of the arguments, so this code was getting run unnecessarily even
on non debug kernels - in some fast paths, too.
Signed-off-by: Kent Overstreet <koverstreet@google.com>

c37511b8

bcache: Refactor btree io · 57943511

Kent Overstreet authored Apr 25, 2013

The most significant change is that btree reads are now done
synchronously, instead of asynchronously and doing the post read stuff
from a workqueue.

This was originally done because we can't block on IO under
generic_make_request(). But - we already have a mechanism to punt cache
lookups to workqueue if needed, so if we just use that we don't have to
deal with the complexity of doing things asynchronously.

The main benefit is this makes the locking situation saner; we can hold
our write lock on the btree node until we're finished reading it, and we
don't need that btree_node_read_done() flag anymore.

Also, for writes, btree_write() was broken out into btree_node_write()
and btree_leaf_dirty() - the old code with the boolean argument was dumb
and confusing.

The prio_blocked mechanism was improved a bit too, now the only counter
is in struct btree_write, we don't mess with transfering a count from
struct btree anymore.

This required changing garbage collection to block prios at the start
and unblock when it finishes, which is cleaner than what it was doing
anyways (the old code had mostly the same effect, but was doing it in a
convoluted way)

And the btree iter btree_node_read_done() uses was converted to a real
mempool.
Signed-off-by: Kent Overstreet <koverstreet@google.com>

57943511

bcache: Convert allocator thread to kthread · 119ba0f8

Kent Overstreet authored Apr 24, 2013

Using a workqueue when we just want a single thread is a bit silly.
Signed-off-by: Kent Overstreet <koverstreet@google.com>

119ba0f8

bcache: Warn when a device is already registered. · a9dd53ad

Gabriel de Perthuis authored May 04, 2013

Signed-off-by: Gabriel de Perthuis <g2p.code+bcache@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>

a9dd53ad

bcache: fix a spurious gcc complaint, use scnprintf · bbc77aa7

Kent Overstreet authored May 28, 2013

An old version of gcc was complaining about using a const int as the
size of a stack allocated array. Which should be fine - but using
ARRAY_SIZE() is better, anyways.

Also, refactor the code to use scnprintf().
Signed-off-by: Kent Overstreet <koverstreet@google.com>

bbc77aa7

md: bcache: io.c: fix a potential NULL pointer dereference · 5c694129

Kumar Amit Mehta authored May 28, 2013

bio_alloc_bioset returns NULL on failure. This fix adds a missing check
for potential NULL pointer dereferencing.
Signed-off-by: Kumar Amit Mehta <gmate.amit@gmail.com>
Signed-off-by: Kent Overstreet <koverstreet@google.com>

5c694129

25 Jun, 2013 1 commit

xen-blkback: check the number of iovecs before allocating a bios · 1e0f7a21

Roger Pau Monne authored Jun 22, 2013

With the introduction of indirect segments we can receive requests
with a number of segments bigger than the maximum number of allowed
iovecs in a bios, so make sure that blkback doesn't try to allocate a
bios with more iovecs than BIO_MAX_PAGES
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

1e0f7a21

22 Jun, 2013 1 commit
- Linux 3.10-rc7 · 9e895ace
  Linus Torvalds authored Jun 22, 2013
  
  9e895ace