Commits · 3825bdb7ed920845961f32f364454bee5f469abb · Kirill Smelkov / linux

26 Oct, 2010 40 commits

fs: use RCU read side protection in d_validate · 3825bdb7

Christoph Hellwig authored Oct 10, 2010

d_validate does a purely read lookup in the dentry hash, so use RCU read side
locking instead of dcache_lock.  Split out from a larget patch by
Nick Piggin <npiggin@suse.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3825bdb7

fs: clean up dentry lru modification · a4633357

Christoph Hellwig authored Oct 10, 2010

Always do a list_del_init on the LRU to make sure the list_empty invariant for
not beeing on the LRU always holds true, and fold dentry_lru_del_init into
dentry_lru_del. Replace the dentry_lru_add_tail primitive with a
dentry_lru_move_tail operations that simpler when the dentry already is one
the list, which is always is. Move the list_empty into dentry_lru_add to
fit the scheme of the other lru helpers, and simplify locking once we
move to a separate LRU lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

a4633357

fs: split __shrink_dcache_sb · 3049cfe2

Christoph Hellwig authored Oct 10, 2010

Currently __shrink_dcache_sb has an extremly awkward calling convention
because it tries to please very different callers. Split out the
main loop into a shrink_dentry_list helper, which gets called directly
from shrink_dcache_sb for the cases where all dentries need to be pruned,
or from __shrink_dcache_sb for pruning only a certain number of dentries.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3049cfe2

fs: improve DCACHE_REFERENCED usage · 265ac902

Nick Piggin authored Oct 10, 2010

dentry referenced bit is only set when installing the dentry back
onto the LRU. However with lazy LRU, the dentry can already be on
the LRU list at dput time, thus missing out on setting the referenced
bit. Fix this.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

265ac902

fs: use percpu counter for nr_dentry and nr_dentry_unused · 312d3ca8

Christoph Hellwig authored Oct 10, 2010

The nr_dentry stat is a globally touched cacheline and atomic operation
twice over the lifetime of a dentry. It is used for the benfit of userspace
only. Turn it into a per-cpu counter and always decrement it in d_free instead
of doing various batching operations to reduce lock hold times in the callers.

Based on an earlier patch from Nick Piggin <npiggin@suse.de>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

312d3ca8

fs: simplify __d_free · 9c82ab9c

Christoph Hellwig authored Oct 10, 2010

Remove d_callback and always call __d_free with a RCU head.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9c82ab9c

fs: take dcache_lock inside __d_path · be148247

Christoph Hellwig authored Oct 10, 2010

All callers take dcache_lock just around the call to __d_path, so
take the lock into it in preparation of getting rid of dcache_lock.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

be148247

fs: do not assign default i_ino in new_inode · 85fe4025

Christoph Hellwig authored Oct 23, 2010

Instead of always assigning an increasing inode number in new_inode
move the call to assign it into those callers that actually need it.
For now callers that need it is estimated conservatively, that is
the call is added to all filesystems that do not assign an i_ino
by themselves.  For a few more filesystems we can avoid assigning
any inode number given that they aren't user visible, and for others
it could be done lazily when an inode number is actually needed,
but that's left for later patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

85fe4025

fs: introduce a per-cpu last_ino allocator · f991bd2e

Eric Dumazet authored Oct 23, 2010

new_inode() dirties a contended cache line to get increasing
inode numbers. This limits performance on workloads that cause
significant parallel inode allocation.

Solve this problem by using a per_cpu variable fed by the shared
last_ino in batches of 1024 allocations.  This reduces contention on
the shared last_ino, and give same spreading ino numbers than before
(i.e. same wraparound after 2^32 allocations).
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

f991bd2e

new helper: ihold() · 7de9c6ee

Al Viro authored Oct 23, 2010

Clones an existing reference to inode; caller must already hold one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7de9c6ee

fs: remove inode_add_to_list/__inode_add_to_list · 646ec461

Christoph Hellwig authored Oct 23, 2010

Split up inode_add_to_list/__inode_add_to_list.  Locking for the two
lists will be split soon so these helpers really don't buy us much
anymore.

The __ prefixes for the sb list helpers will go away soon, but until
inode_lock is gone we'll need them to distinguish between the locked
and unlocked variants.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

646ec461

fs: move i_count increments into find_inode/find_inode_fast · f7899bd5

Christoph Hellwig authored Oct 23, 2010

Now that iunique is not abusing find_inode anymore we can move the i_ref
increment back to where it belongs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

f7899bd5

fs: Stop abusing find_inode_fast in iunique · ad5e195a

Christoph Hellwig authored Oct 23, 2010

Stop abusing find_inode_fast for iunique and opencode the inode hash walk.
Introduce a new iunique_lock to protect the iunique counters once inode_lock
is removed.

Based on a patch originally from Nick Piggin.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ad5e195a

fs: Factor inode hash operations into functions · 4c51acbc

Dave Chinner authored Oct 23, 2010

Before replacing the inode hash locking with a more scalable
mechanism, factor the removal of the inode from the hashes rather
than open coding it in several places.

Based on a patch originally from Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

4c51acbc

fs: Implement lazy LRU updates for inodes · 9e38d86f

Nick Piggin authored Oct 23, 2010

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9e38d86f

fs: Convert nr_inodes and nr_unused to per-cpu counters · cffbc8aa

Dave Chinner authored Oct 23, 2010

The number of inodes allocated does not need to be tied to the
addition or removal of an inode to/from a list. If we are not tied
to a list lock, we could update the counters when inodes are
initialised or destroyed, but to do that we need to convert the
counters to be per-cpu (i.e. independent of a lock). This means that
we have the freedom to change the list/locking implementation
without needing to care about the counters.

Based on a patch originally from Eric Dumazet.

[AV: cleaned up a bit, fixed build breakage on weird configs
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

cffbc8aa

vfs: fix infinite loop caused by clone_mnt race · be1a16a0

Miklos Szeredi authored Oct 05, 2010

If clone_mnt() happens while mnt_make_readonly() is running, the
cloned mount might have MNT_WRITE_HOLD flag set, which results in
mnt_want_write() spinning forever on this mount.

Needs CAP_SYS_ADMIN to trigger deliberately and unlikely to happen
accidentally.  But if it does happen it can hang the machine.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

be1a16a0

switch hfs to hlist_add_fake() · 89b0fc38
Al Viro authored Oct 23, 2010
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
89b0fc38

list.h: new helper - hlist_add_fake() · 756acc2d

Al Viro authored Oct 23, 2010

Make node look as if it was on hlist, with hlist_del()
working correctly.  Usable without any locking...

Convert a couple of places where we want to do that to
inode->i_hash.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

756acc2d

new helper: inode_unhashed() · 1d3382cb

Al Viro authored Oct 23, 2010

note: for race-free uses you inode_lock held
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

1d3382cb

unexport invalidate_inodes · a8dade34
Al Viro authored Oct 24, 2010
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
a8dade34
smbfs never retains inodes with zero refcount in the first place · 61ebdb42
Al Viro authored Oct 24, 2010
```
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
```
61ebdb42

ntfs: don't call invalidate_inodes() · 70fd136e

Al Viro authored Oct 24, 2010

We are in fill_super(); again, no inodes with zero i_count could
be around until we set MS_ACTIVE.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

70fd136e

gfs2: invalidate_inodes() is no-op there · 9dcefee5

Al Viro authored Oct 24, 2010

In fill_super() we hadn't MS_ACTIVE set yet, so there won't
be any inodes with zero i_count sitting around.

In put_super() we already have MS_ACTIVE removed *and* we
had called invalidate_inodes() since then.  So again there
won't be any inodes with zero i_count...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

9dcefee5

ext2_remount: don't bother with invalidate_inodes() · 8e3b9a07

Al Viro authored Oct 24, 2010

It's pointless - we *do* have busy inodes (root directory,
for one), so that call will fail and attempt to change
XIP flag will be ignored.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8e3b9a07

fs/buffer.c: call __block_write_begin() if we have page · 309f77ad

Namhyung Kim authored Oct 25, 2010

If we have the appropriate page already, call __block_write_begin()
directly instead of releasing and regrabbing it inside of
block_write_begin().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

309f77ad

lockdep: fixup checking of dir inode annotation · a3314a0e

Namhyung Kim authored Oct 11, 2010

Since inode->i_mode shares its bits for S_IFMT, S_ISDIR should be
used to distinguish whether it is a dir or not.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

a3314a0e

aio: bump i_count instead of using igrab · 306fb097

Chris Mason authored Aug 23, 2010

The aio batching code is using igrab to get an extra reference on the
inode so it can safely batch.  igrab will go ahead and take the global
inode spinlock, which can be a bottleneck on large machines doing lots
of AIO.

In this case, igrab isn't required because we already have a reference
on the file handle.  It is safe to just bump the i_count directly
on the inode.

Benchmarking shows this patch brings IOP/s on tons of flash up by about
2.5X.
Signed-off-by: Chris Mason <chris.mason@oracle.com>

306fb097

update block_device_operations documentation · e1455d1b

Christoph Hellwig authored Oct 06, 2010

Updated Documentation/filesystems/Locking to match the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>

e1455d1b

fs/buffer.c: remove duplicated assignment on b_private · 8358e7d7

Namhyung Kim authored Oct 16, 2010

bh->b_private is initialized within init_buffer(), thus the
assignment should be redundant. Remove it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

8358e7d7

fs: move exportfs since it is not a networking filesystem · bb1e5f8c

Randy Dunlap authored Oct 15, 2010

Move the EXPORTFS kconfig symbol out of the NETWORK_FILESYSTEMS block
since it provides a library function that can be (and is) used by other
(non-network) filesystems.

This also eliminates a kconfig dependency warning:

warning: (XFS_FS && BLOCK || NFSD && NETWORK_FILESYSTEMS && INET && FILE_LOCKING && BKL) selects EXPORTFS which has unmet direct dependencies (NETWORK_FILESYSTEMS)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alex Elder <aelder@sgi.com>
Cc: xfs-masters@oss.sgi.com
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

bb1e5f8c

hfs: use sync_dirty_buffer · 3072b90c

Christoph Hellwig authored Oct 06, 2010

Use sync_dirty_buffer instead of the incorrect opencoding it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

3072b90c

vfs: introduce FMODE_UNSIGNED_OFFSET for allowing negative f_pos · 4a3956c7

KAMEZAWA Hiroyuki authored Oct 01, 2010

Now, rw_verify_area() checsk f_pos is negative or not.  And if negative,
returns -EINVAL.

But, some special files as /dev/(k)mem and /proc/<pid>/mem etc..  has
negative offsets.  And we can't do any access via read/write to the
file(device).

So introduce FMODE_UNSIGNED_OFFSET to allow negative file offsets.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

4a3956c7

hostfs: fix UML crash: remove f_spare from hostfs · ba10f486

Richard Weinberger authored Oct 19, 2010

365b1818 ("add f_flags to struct statfs(64)") resized f_spare within
struct statfs which caused a UML crash.  There is no need to copy f_spare.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Tested-by: Toralf Förster <toralf.foerster@gmx.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ba10f486

Documentation: Fix trivial typo in filesystems/sharedsubtree.txt · d9d1dc80

Valerie Aurora authored Aug 30, 2010

Documentation: Fix trivial typo in filesystems/sharedsubtree.txt

This typo is easy to ignore unless you have spent a great deal of time
thinking about how to eliminate duplicate dentries in unions.
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

d9d1dc80

affs: testing the wrong variable · 0e45b67d

Dan Carpenter authored Aug 25, 2010

The intent was to verify that bh = affs_bread_ino(...) returned a valid
pointer.  We checked "ext_bh" earlier in the function and it's valid
here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

0e45b67d

fs: allow for more than 2^31 files · 7e360c38

Eric Dumazet authored Oct 05, 2010

Andrew,

Could you please review this patch, you probably are the right guy to
take it, because it crosses fs and net trees.

Note : /proc/sys/fs/file-nr is a read-only file, so this patch doesnt
depend on previous patch (sysctl: fix min/max handling in
__do_proc_doulongvec_minmax())

Thanks !

[PATCH V4] fs: allow for more than 2^31 files

Robin Holt tried to boot a 16TB system and found af_unix was overflowing
a 32bit value :

<quote>

We were seeing a failure which prevented boot.  The kernel was incapable
of creating either a named pipe or unix domain socket.  This comes down
to a common kernel function called unix_create1() which does:

        atomic_inc(&unix_nr_socks);
        if (atomic_read(&unix_nr_socks) > 2 * get_max_files())
                goto out;

The function get_max_files() is a simple return of files_stat.max_files.
files_stat.max_files is a signed integer and is computed in
fs/file_table.c's files_init().

        n = (mempages * (PAGE_SIZE / 1024)) / 10;
        files_stat.max_files = n;

In our case, mempages (total_ram_pages) is approx 3,758,096,384
(0xe0000000).  That leaves max_files at approximately 1,503,238,553.
This causes 2 * get_max_files() to integer overflow.

</quote>

Fix is to let /proc/sys/fs/file-nr & /proc/sys/fs/file-max use long
integers, and change af_unix to use an atomic_long_t instead of
atomic_t.

get_max_files() is changed to return an unsigned long.
get_nr_files() is changed to return a long.

unix_nr_socks is changed from atomic_t to atomic_long_t, while not
strictly needed to address Robin problem.

Before patch (on a 64bit kernel) :
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
-18446744071562067968

After patch:
# echo 2147483648 >/proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
2147483648
# cat /proc/sys/fs/file-nr
704     0       2147483648
Reported-by: Robin Holt <holt@sgi.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Miller <davem@davemloft.net>
Reviewed-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

7e360c38

isofs: Fix isofs_get_blocks for 8TB files · fde214d4

Jan Kara authored Oct 04, 2010

Currently isofs_get_blocks() was limited to handle only 4TB files on 32-bit
architectures because of unnecessary use of iblock variable which was signed
long. Just remove the variable. The error messages that were using this
variable should have rather used b_off anyway because that is the block we
are currently mapping.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

fde214d4

fs: kill block_prepare_write · ebdec241

Christoph Hellwig authored Oct 06, 2010

__block_write_begin and block_prepare_write are identical except for slightly
different calling conventions.  Convert all callers to the __block_write_begin
calling conventions and drop block_prepare_write.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

ebdec241

fs: mark destroy_inode static · 56b0dacf

Christoph Hellwig authored Oct 06, 2010

Hugetlbfs used to need it, but after the destroy_inode and evict_inode
changes it's not required anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

56b0dacf