Commits · 582686915803e34adc8fdcd90bff7ca7f6a42221 · Kirill Smelkov / linux

21 Jul, 2011 8 commits

fat: remove i_alloc_sem abuse · 58268691

Christoph Hellwig authored Jun 24, 2011

Add a new rw_semaphore to protect bmap against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Note that we can't simply use i_mutex, given that the swapon code
calls ->bmap under it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

58268691

VFS: Fixup kerneldoc for generic_permission() · 8c5dc70a

Tobias Klauser authored Jul 01, 2011

The flags parameter went away in
d749519b444db985e40b897f73ce1898b11f997e
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8c5dc70a

anonfd: fix missing declaration · e46ebd27

Tomasz Stanislawski authored Jul 12, 2011

The forward declaration of struct file_operations is
added to avoid compilation warnings.
Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

e46ebd27

xfs: make use of new shrinker callout for the inode cache · 8daaa831

Dave Chinner authored Jul 08, 2011

Convert the inode reclaim shrinker to use the new per-sb shrinker
operations. This allows much bigger reclaim batches to be used, and
allows the XFS inode cache to be shrunk in proportion with the VFS
dentry and inode caches. This avoids the problem of the VFS caches
being shrunk significantly before the XFS inode cache is shrunk
resulting in imbalances in the caches during reclaim.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8daaa831

vfs: increase shrinker batch size · 8ab47664

Dave Chinner authored Jul 08, 2011

Now that the per-sb shrinker is responsible for shrinking 2 or more
caches, increase the batch size to keep econmies of scale for
shrinking each cache.  Increase the shrinker batch size to 1024
objects.

To allow for a large increase in batch size, add a conditional
reschedule to prune_icache_sb() so that we don't hold the LRU spin
lock for too long. This mirrors the behaviour of the
__shrink_dcache_sb(), and allows us to increase the batch size
without needing to worry about problems caused by long lock hold
times.

To ensure that filesystems using the per-sb shrinker callouts don't
cause problems, document that the object freeing method must
reschedule appropriately inside loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8ab47664

superblock: add filesystem shrinker operations · 0e1fdafd

Dave Chinner authored Jul 08, 2011

Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.

Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0e1fdafd

inode: remove iprune_sem · 4f8c19fd

Dave Chinner authored Jul 08, 2011

Now that we have per-sb shrinkers with a lifecycle that is a subset
of the superblock lifecycle and can reliably detect a filesystem
being unmounted, there is not longer any race condition for the
iprune_sem to protect against. Hence we can remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

4f8c19fd

superblock: introduce per-sb cache shrinker infrastructure · b0d40c92

Dave Chinner authored Jul 08, 2011

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well. The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

b0d40c92

20 Jul, 2011 32 commits

superblock: move pin_sb_for_writeback() to fs/super.c · 12ad3ab6

Dave Chinner authored Jul 08, 2011

The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.

pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

12ad3ab6

inode: move to per-sb LRU locks · 09cc9fc7

Dave Chinner authored Jul 08, 2011

With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

09cc9fc7

inode: Make unused inode LRU per superblock · 98b745c6

Dave Chinner authored Jul 08, 2011

The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.

To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.

The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

98b745c6

inode: convert inode_stat.nr_unused to per-cpu counters · fcb94f72

Dave Chinner authored Jul 08, 2011

Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fcb94f72

vmscan: add customisable shrinker batch size · e9299f50

Dave Chinner authored Jul 08, 2011

For shrinkers that have their own cond_resched* calls, having
shrink_slab break the work down into small batches is not
paticularly efficient. Add a custom batchsize field to the struct
shrinker so that shrinkers can use a larger batch size if they
desire.

A value of zero (uninitialised) means "use the default", so
behaviour is unchanged by this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

e9299f50

vmscan: reduce wind up shrinker->nr when shrinker can't do work · 3567b59a

Dave Chinner authored Jul 08, 2011

When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.

However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.

This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.

Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.

This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.

To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.

If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3567b59a

vmscan: shrinker->nr updates race and go wrong · acf92b48

Dave Chinner authored Jul 08, 2011

shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.

As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!

Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

acf92b48

vmscan: add shrink_slab tracepoints · 09576073

Dave Chinner authored Jul 08, 2011

It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

09576073

make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) · a9049376
Al Viro authored Jul 08, 2011
```
... and simplify the living hell out of callers
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
a9049376

deuglify squashfs_lookup() · 0c1aa9a9

Al Viro authored Jul 08, 2011

d_splice_alias(NULL, dentry) is equivalent to d_add(dentry, NULL), NULL
so no need for that if (inode) ... in there (or ERR_PTR(0), for that
matter)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0c1aa9a9

nfsd4_list_rec_dir(): don't bother with reopening rec_file · 5b4b299c

Al Viro authored Jul 07, 2011

just rewind it to the beginning before vfs_readdir() and be
done with that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

5b4b299c

kill useless checks for sb->s_op == NULL · e7f59097
Al Viro authored Jul 07, 2011
```
never is...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
e7f59097
btrfs: kill magical embedded struct superblock · 0ee5dc67
Al Viro authored Jul 07, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
0ee5dc67
get rid of pointless checks for dentry->sb == NULL · fb408e6c
Al Viro authored Jul 07, 2011
```
it never is...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
fb408e6c

Make ->d_sb assign-once and always non-NULL · a4464dbc

Al Viro authored Jul 07, 2011

New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
Allocates dentry, sets its ->d_sb to given superblock and sets
->d_op accordingly.  Old d_alloc(NULL, name) callers are converted
to that (all of them know what superblock they want).  d_alloc()
itself is left only for parent != NULl case; uses __d_alloc(),
inserts result into the list of parent's children.

Note that now ->d_sb is assign-once and never NULL *and*
->d_parent is never NULL either.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

a4464dbc

unexport kern_path_parent() · e3c3d9c8
Al Viro authored Jun 27, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
e3c3d9c8
switch vfs_path_lookup() to struct path · e0a01249
Al Viro authored Jun 27, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
e0a01249

kill lookup_create() · ed75e95d

Al Viro authored Jun 27, 2011

folded into the only caller (kern_path_create())
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ed75e95d

devtmpfs: get rid of bogus mkdir in create_path() · 5da4e689

Al Viro authored Jun 27, 2011

We do _NOT_ want to mkdir the path itself - we are preparing to
mknod it, after all.  Normally it'll fail with -ENOENT and
just do nothing, but if somebody has created the parent in
the meanwhile, we'll get buggered...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

5da4e689

switch devtmpfs to kern_path_create() · 69753a0f
Al Viro authored Jun 27, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
69753a0f

switch devtmpfs object creation/removal to separate kernel thread · 2780f1ff

Al Viro authored Jun 27, 2011

... and give it a namespace where devtmpfs would be mounted on root,
thus avoiding abuses of vfs_path_lookup() (it was never intended to
be used with LOOKUP_PARENT).  Games with credentials are also gone.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

2780f1ff

make sure that nsproxy_cache is initialized early enough · 66577193
Al Viro authored Jun 28, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
66577193
switch do_spufs_create() to user_path_create(), fix double-unlock · 1ba10681
Al Viro authored Jun 26, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
1ba10681

new helpers: kern_path_create/user_path_create · dae6ad8f

Al Viro authored Jun 26, 2011

combination of kern_path_parent() and lookup_create().  Does *not*
expose struct nameidata to caller.  Syscalls converted to that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

dae6ad8f

kill LOOKUP_CONTINUE · 49084c3b

Al Viro authored Jun 25, 2011

LOOKUP_PARENT is equivalent to it now
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

49084c3b

nfs: LOOKUP_{OPEN,CREATE,EXCL} is set only on the last step · 8aeb376c
Al Viro authored Jun 25, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
8aeb376c
cifs_lookup(): LOOKUP_OPEN is set only on the last component · 43527803
Al Viro authored Jun 25, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
43527803
ceph: LOOKUP_OPEN is set only when it's the last component · a127e0af
Al Viro authored Jun 25, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
a127e0af
jfs_ci_revalidate() is safe from RCU mode · 5c0f360b
Al Viro authored Jun 25, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
5c0f360b
LOOKUP_CREATE and LOOKUP_RENAME_TARGET can be set only on the last step · 407938e7
Al Viro authored Jun 25, 2011
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
407938e7

no need to check for LOOKUP_OPEN in ->create() instances · dd7dd556

Al Viro authored Jun 25, 2011

... it will be set in nd->flag for all cases with non-NULL nd
(i.e. when called from do_last()).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

dd7dd556

don't pass nameidata to vfs_create() from ecryptfs_create() · bf6c7f6c

Al Viro authored Jun 25, 2011

Instead of playing with removal of LOOKUP_OPEN, mangling (and
restoring) nd->path, just pass NULL to vfs_create().  The whole
point of what's being done there is to suppress any attempts
to open file by underlying fs, which is what nd == NULL indicates.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

bf6c7f6c