Commits · a44667226d32a168a0953f9382cd8503ec947d7d · Kirill Smelkov / linux

13 Apr, 2023 14 commits

Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of... · a4466722

Dave Chinner authored Apr 14, 2023

Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: fix bugs in parent pointer checking [v24.5]

Jan Kara pointed out that the VFS doesn't take i_rwsem of a child
subdirectory that is being moved from one parent to another. Upon
deeper analysis, I realized that this was the source of a very hard to
trigger false corruption report in the parent pointer checking code.

Now that we've refactored how directory walks work in scrub, we can also
get rid of all the unnecessary and broken locking to make parent pointer
scrubbing work properly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

a4466722

Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of... · f697c2cc

Dave Chinner authored Apr 14, 2023

Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: fix iget usage in directory scrub [v24.5]

In this series, we fix some problems with how the directory scrubber
grabs child inodes. First, we want to reduce EDEADLOCK returns by
replacing fixed-iteration loops with interruptible trylock loops.
Second, we add UNTRUSTED to the child iget call so that we can detect a
dirent that points to an unallocated inode. Third, we fix a bug where
we weren't checking the inode pointed to by dotdot entries at all.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

f697c2cc

Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of... · b1bdab25

Dave Chinner authored Apr 14, 2023

Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: detect incorrect gaps in rmap btree [v24.5]

Following in the theme of the last two patchsets, this one strengthens
the rmap btree record checking so that scrub can count the number of
space records that map to a given owner and that do not map to a given
owner. This enables us to determine exclusive ownership of space that
can't be shared.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

b1bdab25

Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of... · f1121b99

Dave Chinner authored Apr 14, 2023

Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: detect incorrect gaps in inode btree [v24.5]

This series continues the corrections for a couple of problems I found
in the inode btree scrubber. The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa. The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

f1121b99

Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of... · e7cef2fe

Dave Chinner authored Apr 14, 2023

Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: detect incorrect gaps in refcount btree [v24.5]

The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber. If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree. If
we find any, then the refcount btree is not consistent. This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.

To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records. The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.

The final patch closes the scanning gap in the refcountbt scanner.

v23.1: create helpers for the key extraction and comparison functions,
improve documentation, and eliminate the ->mask_key indirect
calls
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

e7cef2fe

Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of... · 6858c887

Dave Chinner authored Apr 14, 2023

Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: enhance btree key scrubbing [v24.5]

This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

6858c887

Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of... · 1ee75505

Dave Chinner authored Apr 14, 2023

Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: fix rmap btree key flag handling [v24.5]

This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups. The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.

The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups. The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.

This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock. Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.

A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record. This was what doomed my previous
attempt to correct this problem in 2019.

A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.

With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

1ee75505

Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of... · b764ea20

Dave Chinner authored Apr 14, 2023

Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: hoist scrub record checks into libxfs [v24.5]

There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't. Move these bits into libxfs so that
everyone can benefit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

b764ea20

Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of... · 01822a74

Dave Chinner authored Apr 14, 2023

Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: standardize btree record checking code [v24.5]

While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform.  In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.

Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record.  For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

01822a74

Merge tag 'scrub-drain-intents-6.4_2023-04-11' of... · b634abac

Dave Chinner authored Apr 14, 2023

Merge tag 'scrub-drain-intents-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: drain deferred work items when scrubbing [v24.5]

The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata. In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.

This patchset implements that scrub drain. The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running. This last paradigm repeats elsewhere in this megaseries.

v23.1: make intent items take an active ref to the perag structure and
document why we bump and drop the intent counts when we do
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

b634abac

Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of... · 793f5c2c

Dave Chinner authored Apr 14, 2023

Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs_scrub: fix licensing and copyright notices [v24.5]

Fix various attribution problems in the xfs_scrub source code, such as
the author's contact information, out of date SPDX tags, and a rough
estimate of when the feature was under heavy development. The most
egregious parts are the files that are missing license information
completely.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

793f5c2c

Merge tag 'pass-perag-refs-6.4_2023-04-11' of... · 1e5ffdc5

Dave Chinner authored Apr 14, 2023

Merge tag 'pass-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: pass perag references around when possible [v24.5]

Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.

v24.2: rework some of the naming and whatnot so there's less opencoding
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

1e5ffdc5

Merge tag 'intents-perag-refs-6.4_2023-04-11' of... · 826053db

Dave Chinner authored Apr 14, 2023

Merge tag 'intents-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: make intent items take a perag reference [v24.5]

Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery. This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

826053db

Merge tag 'online-fsck-design-6.4_2023-04-11' of... · bed25d80

Dave Chinner authored Apr 14, 2023

Merge tag 'online-fsck-design-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next

xfs: design documentation for online fsck [v24.5]

After six years of development and a nearly two year hiatus from
patchbombing, I think it is time to resume the process of merging the
online fsck feature into XFS. The full patchset comprises 105 separate
patchsets that capture 470 patches across the kernel, xfsprogs, and
fstests projects.

I would like to merge this feature into upstream in time for the 2023
LTS kernel. As of 5.15 (aka last year's LTS), we have merged all
generally useful infrastructure improvements into the regular
filesystem. The only changes to the core filesystem that remain are the
ones that are only useful to online fsck itself. In other words, the
vast majority of the new code in the patchsets comprising the online
fsck feature are is mostly self contained and can be turned off via
Kconfig.

Many of you readers might be wondering -- why have I chosen to make one
large submission with 100+ patchsets comprising ~500 patches? Why
didn't I merge small pieces of functionality bit by bit and revise
common code as necessary? Well, the simple answer is that in the past
six years, the fundamental algorithms have been revised repeatedly as
I've built out the functionality. In other words, the codebase as it is
now has the benefit that I now know every piece that's necessary to get
the job done in a reasonable manner and within the constraints laid out
by community reviews. I believe this has reduced code churn in mainline
and freed up my time so that I can iterate faster.

As a concession to the mail servers, I'm breaking up the submission into
smaller pieces; I'm only pushing the design document and the revisions
to the existing scrub code, which is the first 20%% of the patches.
Also, I'm arbitrarily restarting the version numbering by reversioning
all patchsets from version 22 to epoch 23, version 1.

The big question to everyone reading this is: How might I convince you
that there is more merit in merging the whole feature and dealing with
the consequences than continuing to maintain it out of tree?

---------

To prepare the XFS community and potential patch reviewers for the
upstream submission of the online fsck feature, I decided to write a
document capturing the broader picture behind the online repair
development effort. The document begins by defining the problems that
online fsck aims to solve and outlining specific use cases for the
functionality.

Using that as a base, the rest of the design document presents the high
level algorithms that fulfill the goals set out at the start and the
interactions between the large pieces of the system. Case studies round
out the design documentation by adding the details of exactly how
specific parts of the online fsck code integrate the algorithms with the
filesystem.

The goal of this effort is to help the XFS community understand how the
gigantic online repair patchset works. The questions I submit to the
community reviewers are:

1. As you read the design doc (and later the code), do you feel that you
understand what's going on well enough to try to fix a bug if you
found one?

2. What sorts of interactions between systems (or between scrub and the
rest of the kernel) am I missing?

3. Do you feel confident enough in the implementation as it is now that
the benefits of merging the feature (as EXPERIMENTAL) outweigh any
potential disruptions to XFS at large?

4. Are there problematic interactions between subsystems that ought to
be cleared up before merging?

5. Can I just merge all of this?

I intend to commit this document to the kernel's documentation directory
when we start merging the patchset, albeit without the links to
git.kernel.org. A much more readable version of this is posted at:
https://djwong.org/docs/xfs-online-fsck-design/

v2: add missing sections about: all the in-kernel data structures and
new apis that the scrub and repair functions use; how xattrs and
directories are checked; how space btree records are checked; and
add more details to the parts where all these bits tie together.
Proofread for verb tense inconsistencies and eliminate vague 'we'
usage. Move all the discussion of what we can do with pageable
kernel memory into a single source file and section. Document where
log incompat feature locks fit into the locking model.

v3: resync with 6.0, fix a few typos, begin discussion of the merging
plan for this megapatchset.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

bed25d80

12 Apr, 2023 26 commits

xfs: fix BUG_ON in xfs_getbmap() · 8ee81ed5

Ye Bin authored Apr 12, 2023

There's issue as follows:
XFS: Assertion failed: (bmv->bmv_iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c, line: 329
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:102!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 14612 Comm: xfs_io Not tainted 6.3.0-rc2-next-20230315-00006-g2729d23ddb3b-dirty #422
RIP: 0010:assfail+0x96/0xa0
RSP: 0018:ffffc9000fa178c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff888179a18000
RDX: 0000000000000000 RSI: ffff888179a18000 RDI: 0000000000000002
RBP: 0000000000000000 R08: ffffffff8321aab6 R09: 0000000000000000
R10: 0000000000000001 R11: ffffed1105f85139 R12: ffffffff8aacc4c0
R13: 0000000000000149 R14: ffff888269f58000 R15: 000000000000000c
FS:  00007f42f27a4740(0000) GS:ffff88882fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000b92388 CR3: 000000024f006000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 xfs_getbmap+0x1a5b/0x1e40
 xfs_ioc_getbmap+0x1fd/0x5b0
 xfs_file_ioctl+0x2cb/0x1d50
 __x64_sys_ioctl+0x197/0x210
 do_syscall_64+0x39/0xb0
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Above issue may happen as follows:
         ThreadA                       ThreadB
do_shared_fault
 __do_fault
  xfs_filemap_fault
   __xfs_filemap_fault
    filemap_fault
                             xfs_ioc_getbmap -> Without BMV_IF_DELALLOC flag
			      xfs_getbmap
			       xfs_ilock(ip, XFS_IOLOCK_SHARED);
			       filemap_write_and_wait
 do_page_mkwrite
  xfs_filemap_page_mkwrite
   __xfs_filemap_fault
    xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
    iomap_page_mkwrite
     ...
     xfs_buffered_write_iomap_begin
      xfs_bmapi_reserve_delalloc -> Allocate delay extent
                              xfs_ilock_data_map_shared(ip)
	                      xfs_getbmap_report_one
			       ASSERT((bmv->bmv_iflags & BMV_IF_DELALLOC) != 0)
	                        -> trigger BUG_ON

As xfs_filemap_page_mkwrite() only hold XFS_MMAPLOCK_SHARED lock, there's
small window mkwrite can produce delay extent after file write in xfs_getbmap().
To solve above issue, just skip delalloc extents.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

8ee81ed5

xfs: verify buffer contents when we skip log replay · 22ed903e

Darrick J. Wong authored Apr 12, 2023

syzbot detected a crash during log recovery:

XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
XFS (loop0): Starting recovery (logdev: internal)
==================================================================
BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074

CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
 print_address_description+0x74/0x340 mm/kasan/report.c:306
 print_report+0x107/0x1f0 mm/kasan/report.c:417
 kasan_report+0xcd/0x100 mm/kasan/report.c:517
 xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
 xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
 xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
 xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
 xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
 xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
 xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
 xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
 xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
 xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
 get_tree_bdev+0x400/0x620 fs/super.c:1282
 vfs_get_tree+0x88/0x270 fs/super.c:1489
 do_new_mount+0x289/0xad0 fs/namespace.c:3145
 do_mount fs/namespace.c:3488 [inline]
 __do_sys_mount fs/namespace.c:3697 [inline]
 __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89fa3f4aca
Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
 </TASK>

The fuzzed image contains an AGF with an obviously garbage
agf_refcount_level value of 32, and a dirty log with a buffer log item
for that AGF.  The ondisk AGF has a higher LSN than the recovered log
item.  xlog_recover_buf_commit_pass2 reads the buffer, compares the
LSNs, and decides to skip replay because the ondisk buffer appears to be
newer.

Unfortunately, the ondisk buffer is corrupt, but recovery just read the
buffer with no buffer ops specified:

	error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
			buf_f->blf_len, buf_flags, &bp, NULL);

Skipping the buffer leaves its contents in memory unverified.  This sets
us up for a kernel crash because xfs_refcount_recover_cow_leftovers
reads the buffer (which is still around in XBF_DONE state, so no read
verification) and creates a refcountbt cursor of height 32.  This is
impossible so we run off the end of the cursor object and crash.

Fix this by invoking the verifier on all skipped buffers and aborting
log recovery if the ondisk buffer is corrupt.  It might be smarter to
force replay the log item atop the buffer and then see if it'll pass the
write verifier (like ext4 does) but for now let's go with the
conservative option where we stop immediately.

Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994eSigned-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

22ed903e

xfs: _{attr,data}_map_shared should take ILOCK_EXCL until iread_extents is completely done · c95356ca

Darrick J. Wong authored Apr 12, 2023

While fuzzing the data fork extent count on a btree-format directory
with xfs/375, I observed the following (excerpted) splat:

XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
------------[ cut here ]------------
WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
Call Trace:
 <TASK>
 xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
 __x64_sys_ioctl+0x82/0xa0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

The cause of this is a race condition in xfs_ilock_data_map_shared,
which performs an unlocked access to the data fork to guess which lock
mode it needs:

Thread 0                          Thread 1

xfs_need_iread_extents
<observe no iext tree>
xfs_ilock(..., ILOCK_EXCL)
xfs_iread_extents
<observe no iext tree>
<check ILOCK_EXCL>
<load bmbt extents into iext>
<notice iext size doesn't
 match nextents>
                                  xfs_need_iread_extents
                                  <observe iext tree>
                                  xfs_ilock(..., ILOCK_SHARED)
<tear down iext tree>
xfs_iunlock(..., ILOCK_EXCL)
                                  xfs_iread_extents
                                  <observe no iext tree>
                                  <check ILOCK_EXCL>
                                  *BOOM*

Fix this race by adding a flag to the xfs_ifork structure to indicate
that we have not yet read in the extent records and changing the
predicate to look at the flag state, not if_height.  The memory barrier
ensures that the flag will not be set until the very end of the
function.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

c95356ca

xfs: remove WARN when dquot cache insertion fails · 4b827b3f

Dave Chinner authored Apr 12, 2023

It just creates unnecessary bot noise these days.

Reported-by: syzbot+6ae213503fb12e87934f@syzkaller.appspotmail.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

4b827b3f

xfs: don't consider future format versions valid · aa880198

Dave Chinner authored Apr 12, 2023

In commit fe08cc50 we reworked the valid superblock version
checks. If it is a V5 filesystem, it is always valid, then we
checked if the version was less than V4 (reject) and then checked
feature fields in the V4 flags to determine if it was valid.

What we missed was that if the version is not V4 at this point,
we shoudl reject the fs. i.e. the check current treats V6+
filesystems as if it was a v4 filesystem. Fix this.

cc: stable@vger.kernel.org
Fixes: fe08cc50 ("xfs: open code sb verifier feature checks")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>

aa880198

xfs: fix parent pointer scrub racing with subdirectory reparenting · 0916056e

Darrick J. Wong authored Apr 11, 2023

Jan Kara pointed out that rename() doesn't lock a subdirectory that is
being moved from one parent to another, even though the move requires an
update to the subdirectory's dotdot entry. This means that it's *not*
sufficient to hold a directory's IOLOCK to stabilize the dotdot entry.
We must hold the ILOCK of both the child and the alleged parent, and
there's no use in holding the parent's IOLOCK.

With that in mind, we can get rid of all the messy code that tries to
grab the parent's IOLOCK, which means we don't need to let go of the
ILOCK of the directory whose parent we are checking. We still have to
use nonblocking mode to take the ILOCK of the alleged parent, so the
revalidation loop has to stay.

However, we can remove the retry counter, since threads aren't supposed
to hold the ILOCK for long periods of time. Remove the inverted ilock
helper from the common code since nobody uses it. Remove the entire
source of -EDEADLOCK-based "retry harder" scrub executions.

Link: https://lore.kernel.org/linux-xfs/20230117123735.un7wbamlbdihninm@quack3/Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

0916056e

xfs: simplify xchk_parent_validate · b049962c

Darrick J. Wong authored Apr 11, 2023

This function is unnecessarily long because it contains code to
revalidate a dotdot entry after cycling locks to try to confirm a
subdirectory parent pointer.  Shorten the codebase by making the
parent's lookup call do double duty as the revalidation code.

This weakeans the efficacy of this scrub function temporarily, but the
next patch will resolve this as part of fixing an unhandled race that is
the result of the VFS rename locking model not working the way Darrick
thought it did.

Rename this stupid 'dnum' variable too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

b049962c

xfs: remove xchk_parent_count_parent_dentries · cbab28f4

Darrick J. Wong authored Apr 11, 2023

This helper is now trivial, so get rid of it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

cbab28f4

xfs: always check the existence of a dirent's child inode · 6bb9209c

Darrick J. Wong authored Apr 11, 2023

When we're scrubbing directory entries, we always need to iget the child
inode to make sure that the inode pointer points to a valid inode. The
original directory scrub code (commit a5c4) only set us up to do this
for ftype=1 filesystems, which is not sufficient; and then commit 4b80
made it worse by exempting the dot and dotdot entries.

Sorta-fixes: a5c46e5e ("xfs: scrub directory metadata")
Sorta-fixes: 4b80ac64 ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

6bb9209c

xfs: xfs_iget in the directory scrubber needs to use UNTRUSTED · d9a94480

Darrick J. Wong authored Apr 11, 2023

In commit 4b80ac64, we tried to strengthen the directory scrubber by
using the iget call to detect directory entries that point to
unallocated inodes. Unfortunately, that commit neglected to pass
XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first.
If the inode number points to something that isn't even an inode
cluster, iget will throw corruption errors and return -EFSCORRUPTED,
which means that we fail to mark the directory corrupt.

Fixes: 4b80ac64 ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

d9a94480

xfs: streamline the directory iteration code for scrub · 4c233b5c

Darrick J. Wong authored Apr 11, 2023

Currently, online scrub reuses the xfs_readdir code to walk every entry
in a directory. This isn't awesome for performance, since we end up
cycling the directory ILOCK needlessly and coding around the particular
quirks of the VFS dir_context interface.

Create a streamlined version of readdir that keeps the ILOCK (since the
walk function isn't going to copy stuff to userspace), skips a whole lot
of directory walk cursor checks (since we start at 0 and walk to the
end) and has a sane way to return error codes.

Note: Porting the dotdot checking code is left for a subsequent patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

4c233b5c

xfs: ensure that single-owner file blocks are not owned by others · 30f8ee5e

Darrick J. Wong authored Apr 11, 2023

For any file fork mapping that can only have a single owner, make sure
that there are no other rmap owners for that mapping. This patch
requires the more detailed checking provided by xfs_rmap_count_owners so
that we can know how many rmap records for a given range of space had a
matching owner, how many had a non-matching owner, and how many
conflicted with the records that have a matching owner.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

30f8ee5e

xfs: use the directory name hash function for dir scrubbing · 9dceccc5

Darrick J. Wong authored Apr 11, 2023

The directory code has a directory-specific hash computation function
that includes a modified hash function for case-insensitive lookups.
Hence we must use that function (and not the raw da_hashname) when
checking the dabtree structure.

Found by accidentally breaking xfs/188 to create an abnormally huge
case-insensitive directory and watching scrub break.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

9dceccc5

xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan results · efc0845f

Darrick J. Wong authored Apr 11, 2023

Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill
scan results because for a given range of inode numbers, we might have
no indexed inodes at all; the entire region might be allocated ondisk
inodes; or there might be a mix of the two.

Unfortunately, sparse inodes adds to the complexity, because each inode
record can have holes, which means that we cannot use the generic btree
_scan_keyfill function because we must look for holes in individual
records to decide the result.  On the plus side, online fsck can now
detect sub-chunk discrepancies in the inobt.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

efc0845f

xfs: teach scrub to check for sole ownership of metadata objects · 69115f77

Darrick J. Wong authored Apr 11, 2023

Strengthen online scrub's checking even further by enabling us to check
that a range of blocks are owned solely by a given owner.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

69115f77

xfs: directly cross-reference the inode btrees with each other · bc0f3b55

Darrick J. Wong authored Apr 11, 2023

Improve the cross-referencing of the two inode btrees by directly
checking the free and hole state of each inode with the other btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

bc0f3b55

xfs: clean up broken eearly-exit code in the inode btree scrubber · c01868b6

Darrick J. Wong authored Apr 11, 2023

Corrupt inode chunks should cause us to exit early after setting the
CORRUPT flag on the scrub state.  While we're at it, collapse trivial
helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

c01868b6

xfs: remove pointless shadow variable from xfs_difree_inobt · cc120766

Darrick J. Wong authored Apr 11, 2023

In xfs_difree_inobt, the pag passed in was previously used to look up
the AGI buffer.  There's no need to extract it again, so remove the
shadow variable and shut up -Wshadow.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

cc120766

xfs: ensure that all metadata and data blocks are not cow staging extents · 7ac14fa2

Darrick J. Wong authored Apr 11, 2023

Make sure that all filesystem metadata blocks and file data blocks are
not also marked as CoW staging extents.  The extra checking added here
was inspired by an actual VM host filesystem corruption incident due to
bugs in the CoW handling of 4.x kernels.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

7ac14fa2

xfs: check the reference counts of gaps in the refcount btree · 7ad9ea63

Darrick J. Wong authored Apr 11, 2023

Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings.  We don't
currently check this, so make the refcount scrubber more complete.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

7ad9ea63

xfs: implement masked btree key comparisons for _has_records scans · 4a200a09

Darrick J. Wong authored Apr 11, 2023

For keyspace fullness scans, we want to be able to mask off the parts of
the key that we don't care about. For most btree types we /do/ want the
full keyspace, but for checking that a given space usage also has a full
complement of rmapbt records (even if different/multiple owners) we need
this masking so that we only track sparseness of rm_startblock, not the
whole keyspace (which is extremely sparse).

Augment the ->diff_two_keys and ->keys_contiguous helpers to take a
third union xfs_btree_key argument, and wire up xfs_rmap_has_records to
pass this through. This third "mask" argument should contain a nonzero
value in each structure field that should be used in the key comparisons
done during the scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

4a200a09

xfs: replace xfs_btree_has_record with a general keyspace scanner · 6abc7aef

Darrick J. Wong authored Apr 11, 2023

The current implementation of xfs_btree_has_record returns true if it
finds /any/ record within the given range.  Unfortunately, that's not
sufficient for scrub.  We want to be able to tell if a range of keyspace
for a btree is devoid of records, is totally mapped to records, or is
somewhere in between.  By forcing this to be a boolean, we conflated
sparseness and fullness, which caused scrub to return incorrect results.
Fix the API so that we can tell the caller which of those three is the
current state.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

6abc7aef

xfs: refactor ->diff_two_keys callsites · bd7e7951

Darrick J. Wong authored Apr 11, 2023

Create wrapper functions around ->diff_two_keys so that we don't have to
remember what the return values mean, and adjust some of the code
comments to reflect the longtime code behavior.  We're going to
introduce more uses of ->diff_two_keys in the next patch, so reduce the
cognitive load for readers by doing this refactoring now.
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

bd7e7951

xfs: always scrub record/key order of interior records · 2bea8df0

Darrick J. Wong authored Apr 11, 2023

In commit d47fef93, we removed the firstrec and firstkey fields of
struct xchk_btree because Christoph thought they were unnecessary
because we could use the record index in the btree cursor. This is
incorrect because bc_ptrs (now bc_levels[].ptr) tracks the cursor
position within a specific btree block, not within the entire level.

The end result is that scrub no longer detects situations where the
rightmost record of a block is identical to the leftmost record of that
block's right sibling. Fix this regression by reintroducing record
validity booleans so that order checking skips *only* the leftmost
record/key in each level.

Fixes: d47fef93 ("xfs: don't track firstrec/firstkey separately in xchk_btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

2bea8df0

xfs: refactor converting btree irec to btree key · ee5fe8ff

Darrick J. Wong authored Apr 11, 2023

We keep doing these conversions to support btree queries, so refactor
this into a helper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

ee5fe8ff

xfs: check btree keys reflect the child block · c99f99fa

Darrick J. Wong authored Apr 11, 2023

When scrub is checking a non-root btree block, it should make sure that
the keys in the parent btree block accurately capture the keyspace that
the child block stores.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>

c99f99fa