Commits · d8914002a0391331a88d9f5de4a235220735d4cc · Kirill Smelkov / linux

30 Aug, 2013 4 commits

xfs: inode buffers may not be valid during recovery readahead · d8914002

Dave Chinner authored Aug 27, 2013

CRC enabled filesystems fail log recovery with 100% reliability on
xfstests xfs/085 with the following failure:

XFS (vdb): Mounting Filesystem
XFS (vdb): Starting recovery (logdev: internal)
XFS (vdb): Corruption detected. Unmount and run xfs_repair
XFS (vdb): bad inode magic/vsn daddr 144 #0 (magic=0)
XFS: Assertion failed: 0, file: fs/xfs/xfs_inode_buf.c, line: 95

The problem is that the inode buffer has not been recovered before
the readahead on the inode buffer is issued. The checkpoint being
recovered actually allocates the inode chunk we are doing readahead
from, so what comes from disk during readahead is essentially
random and the verifier barfs on it.

This inode buffer readahead problem affects non-crc filesystems,
too, but xfstests does not trigger it at all on such
configurations....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

d8914002

xfs: check LSN ordering for v5 superblocks during recovery · 50d5c8d8

Dave Chinner authored Aug 28, 2013

Log recovery has some strict ordering requirements which unordered
or reordered metadata writeback can defeat. This can occur when an
item is logged in a transaction, written back to disk, and then
logged in a new transaction before the tail of the log is moved past
the original modification.

The result of this is that when we read an object off disk for
recovery purposes, the buffer that we read may not contain the
object type that recovery is expecting and hence at the end of the
checkpoint being recovered we have an invalid object in memory.

This isn't usually a problem, as recovery will then replay all the
other checkpoints and that brings the object back to a valid and
correct state, but the issue is that while the object is in the
invalid state it can be flushed to disk. This results in the object
verifier failing and triggering a corruption shutdown of log
recover. This is correct behaviour for the verifiers - the problem
is that we are not detecting that the object we've read off disk is
newer than the transaction we are replaying.

All metadata in v5 filesystems has the LSN of it's last modification
stamped in it. This enabled log recover to read that field and
determine the age of the object on disk correctly. If the LSN of the
object on disk is older than the transaction being replayed, then we
replay the modification. If the LSN of the object matches or is more
recent than the transaction's LSN, then we should avoid overwriting
the object as that is what leads to the transient corrupt state.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

50d5c8d8

xfs: btree block LSN escaping to disk uninitialised · b58fa554

Dave Chinner authored Aug 28, 2013

When testing LSN ordering code for v5 superblocks, it was discovered
that the the LSN embedded in the generic btree blocks was
occasionally uninitialised. These values didn't get written to disk
by metadata writeback - they got written by previous transactions in
log recovery.

The issue is here that the when the block is first allocated and
initialised, the LSN field was not initialised - it gets overwritten
before IO is issued on the buffer - but the value that is logged by
transactions that modify the header before it is written to disk
(and initialised) contain garbage. Hence the first recovery of the
buffer will stamp garbage into the LSN field, and that can cause
subsequent transactions to not replay correctly.

The fix is simply to initialise the bb_lsn field to zero when we
initialise the block for the first time.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

b58fa554

XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file:... · 37804376

Dave Chinner authored Aug 26, 2013

XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568

The calculation doesn't take into account the size of the dir v3
header, so overestimates the hash entries in a node. This causes
directory buffer overruns when splitting and merging nodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

37804376

29 Aug, 2013 2 commits

xfs: fix bad dquot buffer size in log recovery readahead · 0f0d3345

Dave Chinner authored Aug 27, 2013

xfstests xfs/087 fails 100% reliably with this assert:

XFS (vdb): Mounting Filesystem
XFS (vdb): Starting recovery (logdev: internal)
XFS: Assertion failed: bp->b_flags & XBF_STALE, file: fs/xfs/xfs_buf.c, line: 548

while trying to read a dquot buffer in xlog_recover_dquot_ra_pass2().

The issue is that the buffer length to read that is passed to
xfs_buf_readahead is in units of filesystem blocks, not disk blocks.
(i.e. FSB, not daddr). Fix it but putting the correct conversion in
place.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

0f0d3345

xfs: don't account buffer cancellation during log recovery readahead · 84a5b730

Dave Chinner authored Aug 27, 2013

When doing readhaead in log recovery, we check to see if buffers are
cancelled before doing readahead. If we find a cancelled buffer,
however, we always decrement the reference count we have on it, and
that means that readahead is causing a double decrement of the
cancelled buffer reference count.

This results in log recovery *replaying cancelled buffers* as the
actual recovery pass does not find the cancelled buffer entry in the
commit phase of the second pass across a transaction. On debug
kernels, this results in an ASSERT failure like so:

XFS: Assertion failed: !(flags & XFS_BLF_CANCEL), file: fs/xfs/xfs_log_recover.c, line: 1815

xfstests generic/311 reproduces this ASSERT failure with 100%
reproducability.

Fix it by making readahead only peek at the buffer cancelled state
rather than the full accounting that xlog_check_buffer_cancelled()
does.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

84a5b730

26 Aug, 2013 2 commits

xfs: check for underflow in xfs_iformat_fork() · 0d0ab120

Dan Carpenter authored Aug 15, 2013

The "di_size" variable comes from the disk and it's a signed 64 bit.
We check the upper limit but we should check for negative numbers as
well.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

0d0ab120

xfs: xfs_dir3_sfe_put_ino can be static · 98f7462c

Fengguang Wu authored Aug 24, 2013

TO: Dave Chinner <david@fromorbit.com>
CC: Ben Myers <bpm@sgi.com>
CC: linux-kernel@vger.kernel.org 
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

98f7462c

23 Aug, 2013 2 commits

xfs: introduce object readahead to log recovery · 00574da1

Zhi Yong Wu authored Aug 14, 2013

  It can take a long time to run log recovery operation because it is
single threaded and is bound by read latency. We can find that it took
most of the time to wait for the read IO to occur, so if one object
readahead is introduced to log recovery, it will obviously reduce the
log recovery time.

Log recovery time stat:

          w/o this patch        w/ this patch

real:        0m15.023s             0m7.802s
user:        0m0.001s              0m0.001s
sys:         0m0.246s              0m0.107s
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

00574da1

xfs: Simplify xfs_ail_min() with list_first_entry_or_null() · 8d1d4083

Jie Liu authored Aug 15, 2013

At xfs_ail_min(), we do check if the AIL list is empty or not before
returning the first item in it with list_empty() and list_first_entry().

This can be simplified a bit with a new list operation routine that is
the list_first_entry_or_null() which has been introduced by:

commit 6d7581e6
    list: introduce list_first_entry_or_null

v2: make xfs_ail_min() as a static inline function and move it to
    xfs_trans_priv.h as per Dave Chinner's comments.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

8d1d4083

22 Aug, 2013 4 commits

xfs: Register hotcpu notifier after initialization · 46677e67

Richard Weinberger authored Aug 19, 2013

Currently the code initializizes mp->m_icsb_mutex and other things
_after_ register_hotcpu_notifier().
As the notifier takes mp->m_icsb_mutex it can happen
that it takes the lock before it's initialization.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

46677e67

xfs: add xfs sb v4 support for dirent filetype field · 3e3c51ce

Mark Tinguely authored Aug 19, 2013

Add XFS superblock v4 support for the file type field in the
directory entry feature.

This support adds a feature bit for version 4 superblocks and
leaves the original superblock 5 incompatibility bit.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

3e3c51ce

xfs: Add write support for dirent filetype field · 1c55cece

Dave Chinner authored Aug 12, 2013

Add support to propagate and add filetype values into the on-disk
directs. This involves passing the filetype into the xfs_da_args
structure along with the name and namelength for direct operations,
and encoding it into the dirent at the same time we write the inode
number into the dirent.

With write support, add the feature flag to the
XFS_SB_FEAT_INCOMPAT_ALL mask so we can now mount filesystems with
this feature set.

Performance of directory recursion is now much improved. Parallel
walk of ~50 million directory entries across hundreds of directories
improves significantly. Unpatched, no CRCs:

Walking via ls -R

real    3m19.886s
user    6m36.960s
sys     28m19.087s

THis is doing roughly 500 getdents() calls per second, and 250,000
inode lookups per second to determine the inode type at roughly
17,000 read IOPS. CPU usage is 90% kernel space.

With dtype support patched in and the fileset recreated with CRCs
enabled:

Walking via ls -R

real    0m31.316s
user    6m32.975s
sys     0m21.111s

This is doing roughly 3500 getdents() calls per second at 16,000
IOPS. There are no inode lookups at all. CPU usages is almost 100%
userspace.

This is a big win for recursive directory walks that only need to
find file names and file types.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

1c55cece

xfs: Add read-only support for dirent filetype field · 0cb97766

Dave Chinner authored Aug 12, 2013

Add support for the file type field in directory entries so that
readdir can return the type of the inode the dirent points to to
userspace without first having to read the inode off disk.

The encoding of the type field is a single byte that is added to the
end of the directory entry name length. For all intents and
purposes, it appends a "hidden" byte to the name field which
contains the type information. As the directory entry is already of
dynamic size, helpers are already required to access and decode the
direct entry structures.

Hence the relevent extraction and iteration helpers are updated to
understand the hidden byte.  Helpers for reading and writing the
filetype field from the directory entries are also added. Only the
read helpers are used by this patch.  It also adds all the code
necessary to read the type information out of the dirents on disk.

Further we add the superblock feature bit and helpers to indicate
that we understand the on-disk format change. This is not a
compatible change - existing kernels cannot read the new format
successfully - so an incompatible feature flag is added. We don't
yet allow filesystems to mount with this flag yet - that will be
added once write support is added.

Finally, the code to take the type from the VFS, convert it to an
XFS on-disk type and put it into the xfs_name structures passed
around is added, but the directory code does not use this field yet.
That will be in the next patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

0cb97766

21 Aug, 2013 1 commit

powerpc/spufs: convert userns uid/gid mount options to kuid/kgid · ed56f34f

Dwight Engen authored Aug 21, 2013

Acked-by: Jeremy Kerr <jk@ozlabs.org>
Tested-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ben Myers <bpm@sgi.com>

ed56f34f

20 Aug, 2013 25 commits

xfs: Add support for the Q_XGETQSTATV · 5d5e3d57

Chandra Seetharaman authored Aug 06, 2013

For XFS, add support for Q_XGETQSTATV quotactl command.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

5d5e3d57

quota: Add a new quotactl command Q_XGETQSTATV · af30cb44

Chandra Seetharaman authored Aug 06, 2013

XFS now supports three types of quotas (user, group and project).

Current version of Q_XGETSTAT has support for only two types of quotas.
In order to support three types of quotas, the interface, specifically
struct fs_quota_stat, need to be expanded. Current version of fs_quota_stat
does not allow expansion without breaking backward compatibility.

So, a quotactl command and new fs_quota_stat structure need to be added.

This patch adds a new command Q_XGETQSTATV to quotactl() which takes
a new data structure fs_quota_statv. This new data structure provides
support for future expansion and backward compatibility.

Callers of the new quotactl command have to set the version of the data
structure being passed, and kernel will fill as much data as requested.
If the kernel does not support the user-space provided version, EINVAL
will be returned. User-space can reduce the version number and call the same
quotactl again.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

[v2: Applied rjohnston's suggestions as per Chandra's request. -bpm]

af30cb44

xfs: fix the comment of xfs_mountfs() · c2bfbc9b

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

c2bfbc9b

xfs: fix the comment of xfs_sb_quiet_read_verify() · 2533787a

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

2533787a

xfs: fix the comment of xlog_recover_do_dquot_buffer() · 8ba701ee

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

8ba701ee

xfs: fix the comment of xfs_log_unmount_write() · 8e159e72

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

8e159e72

xfs: fix the comment of xfs_ifree_cluster() · 0b8182db

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

0b8182db

xfs: fix the comment of xfs_ialloc_ag_select() · 2f21ff1c

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

2f21ff1c

xfs: fix the comment of xfs_extent_busy_update_extent() · b3c49634

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

b3c49634

xfs: fix the comment of xfs_setsize_buftarg_early() · 8b4ad79c

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

8b4ad79c

xfs: fix the comment of xfs_bmap_punch_delalloc_range() · ad4809bf

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

ad4809bf

xfs: fix the comment of xfs_bmap_last_before() · 02bb4873

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

02bb4873

xfs: fix the comment of xfs_bmap_validate_ret() · a97f4df7

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

a97f4df7

xfs: fix the comment of xfs_bmap_count_tree() · 8be11e92

Zhi Yong Wu authored Aug 12, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

8be11e92

xfs: rename bio_add_buffer() to xfs_bio_add_buffer() · c7c1a7d8

Zhi Yong Wu authored Aug 07, 2013

Follow up with xfs naming style.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

c7c1a7d8

xfs: fix the comment of xlog_find_head() · 0a94da24

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

0a94da24

xfs: fix the comment of xlog_recover_buffer_pass2() · 34be5ff3

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

34be5ff3

xfs: remove two unused macro definitions in xfs_linux.h · 5c753909

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

5c753909

xfs: fix the comment of xfs_btree_get_iroot() · 1cb93863

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

1cb93863

xfs: fix the comment of xfs_iroot_realloc() · f6c27349

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

f6c27349

xfs: remove one blank line in xfs_btree_make_block_unfull() · 7c3e6640

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

7c3e6640

xfs: fix the comment of xlog_write_setup_copy() · ac0e300f

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

ac0e300f

xfs: fix the comment of xfs_mod_incore_sb_unlocked() · 99e738b7

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

99e738b7

xfs: fix the comment of xfs_btree_lookup() · 49d3da14

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

49d3da14

xfs: fix the comment of xfs_buf_free() · b46fe825

Zhi Yong Wu authored Aug 07, 2013

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

b46fe825