- 19 Aug, 2021 36 commits
-
-
Dave Chinner authored
The attr2 feature is somewhat unique in that it has both a superblock feature bit to enable it and mount options to enable and disable it. Back when it was first introduced in 2005, attr2 was disabled unless either the attr2 superblock feature bit was set, or the attr2 mount option was set. If the superblock feature bit was not set but the mount option was set, then when the first attr2 format inode fork was created, it would set the superblock feature bit. This is as it should be - the superblock feature bit indicated the presence of the attr2 on disk format. The noattr2 mount option, however, did not affect the superblock feature bit. If noattr2 was specified, the on-disk superblock feature bit was ignored and the code always just created attr1 format inode forks. If neither of the attr2 or noattr2 mounts option were specified, then the behaviour was determined by the superblock feature bit. This was all pretty sane. Fast foward 3 years, and we are dealing with fallout from the botched sb_features2 addition and having to deal with feature mismatches between the sb_features2 and sb_bad_features2 fields. The attr2 feature bit was one of these flags. The reconciliation was done well after mount option parsing and, unfortunately, the feature reconciliation had a bug where it ignored the noattr2 mount option. For reasons lost to the mists of time, it was decided that resolving this issue in commit 7c12f296 ("[XFS] Fix up noattr2 so that it will properly update the versionnum and features2 fields.") required noattr2 to clear the superblock attr2 feature bit. This greatly complicated the attr2 behaviour and broke rules about feature bits needing to be set when those specific features are present in the filesystem. By complicated, I mean that it introduced problems due to feature bit interactions with log recovery. All of the superblock feature bit checks are done prior to log recovery, but if we crash after removing a feature bit, then on the next mount we see the feature bit in the unrecovered superblock, only to have it go away after the log has been replayed. This means our mount time feature processing could be all wrong. Hence you can mount with noattr2, crash shortly afterwards, and mount again without attr2 or noattr2 and still have attr2 enabled because the second mount sees attr2 still enabled in the superblock before recovery runs and removes the feature bit. It's just a mess. Further, this is all legacy code as the v5 format requires attr2 to be enabled at all times and it cannot be disabled. i.e. the noattr2 mount option returns an error when used on v5 format filesystems. To straighten this all out, this patch reverts the attr2/noattr2 mount option behaviour back to the original behaviour. There is no reason for disabling attr2 these days, so we will only do this when the noattr2 mount option is set. This will not remove the superblock feature bit. The superblock bit will provide the default behaviour and only track whether attr2 is present on disk or not. The attr2 mount option will enable the creation of attr2 format inode forks, and if the superblock feature bit is not set it will be added when the first attr2 inode fork is created. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
xfs_has_attr() is poorly named. It has global scope as it is defined in a header file, but it has no namespace scope that tells us what it is checking has attributes. It's not even clear what "has_attr" means, because what it is actually doing is an attribute fork lookup to see if the attribute exists. Upcoming patches use this "xfs_has_<foo>" namespace for global filesystem features, which conflicts with this function. Rename xfs_has_attr() to xfs_attr_lookup() and make it a static function, freeing up the "xfs_has_" namespace for global scope usage. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
The verifier checks explicitly for bp->b_bn == XFS_SB_DADDR to match the primary superblock buffer, but the primary superblock is an uncached buffer and so bp->b_bn is always -1ULL. Hence this never matches and the CRC error reporting is wholly dependent on the mount superblock already being populated so CRC feature checks pass and allow CRC errors to be reported. Fix this so that the primary superblock CRC error reporting is not dependent on already having read the superblock into memory. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Darrick J. Wong authored
Because there are a lot of tracepoints that express numeric data with an associated unit and tag, document what they are to help everyone else keep these thigns straight. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
When using pretty-printed scrub tracepoints, decode the meaning of the scrub flags as strings for easier reading. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Always print inode generation in hexadecimal and preceded with the unit "gen". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
For the remaining xfs_buf tracepoints, convert all the tags to xfs_daddr_t units and retag them 'daddrcount' to match everything else. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Emit whichfork values as text strings in the ftrace output. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Whenever we record i_disk_size (i.e. the ondisk file size), use the "disize" tag and hexadecimal format consistently. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Some of our tracepoints have a field known as "count". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal when we're referring to blocks, extents, or IO operations. "fsbcount" are in units of fs blocks "bytecount" are in units of bytes Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Some of our tracepoints have a field known as "len". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "fsbcount" are in units of fs blocks "bbcount" are in units of 512b blocks "ireccount" are in units of inodes Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Some of our tracepoints describe fields as "offset". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "fileoff" means file offset, in units of fs blocks "pos" means file offset, in bytes "forkoff" means inode fork offset, in bytes The one remaining "offset" value is for iclogs, since that's the byte offset of the end of where we've written into the current iclog. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Some of our tracepoints describe fields as "blkno", "block", or "bno". That name doesn't describe any units, which makes the fields not very useful. Rename the fields to capture units and ensure the format is hexadecimal. "startblock" is the startblock field from the bmap structure, which is a segmented fsblock on the data device, or an rfsblock on the realtime device. "fileoff" is a file offset, in units of filesystem blocks "daddr" is a raw device offset, in 512b blocks Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Always print disk addr (i.e. 512 byte block) numbers in hexadecimal and preceded with the unit "daddr". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Always print rmap owner number in hexadecimal and preceded with the unit "owner". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Always print allocation group block numbers in hexadecimal and preceded with the unit "agbno". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Always print allocation group numbers in hexadecimal and preceded with the unit "agno". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
Always print inode numbers in hexadecimal and preceded with the unit "ino" or "agino", as apropriate. Fix one tracepoint that used "ino %u" for an inode btree block count to reduce confusion. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Darrick J. Wong authored
XFS_DADDR_TO_FSB converts a raw disk address (in units of 512b blocks) to a raw disk address (in units of fs blocks). Unfortunately, the xchk_block_error_class tracepoints incorrectly uses this to decode xfs_daddr_t into segmented AG number and AG block addresses. Use the correct translation code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
-
Christoph Hellwig authored
With quotaoff not allowing disabling of accounting there is no need for untagged lookups in this code, so remove the dead leftovers. Repoted-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [djwong: convert to for_each_perag_tag] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Darrick J. Wong authored
Constify the rest of the btree functions that take structure and union pointers and are not supposed to modify them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
This btree function is called when updating a record in the rightmost block of a btree so that we can update the AGF's longest free extent length field. Neither parameter is supposed to be updated, so mark them both const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The @start pointer passed to each per-AG btree type's ->alloc_block function isn't supposed to be modified, since it's a hint about the location of the btree block being split that is to be fed to the allocator, so mark the parameter const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The pointer passed to each per-AG btree type's ->set_root function isn't supposed to be modified (that function sets an external pointer to the root block) so mark them const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
xchk_btree calls a user-supplied function to validate each btree record that it finds. Those functions are not supposed to change the record data, so mark the parameter const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The inorder functions are simple predicates, which means that they don't modify the parameters. Mark them all const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
These functions initialize a key from a record, but they aren't supposed to modify the record. Mark it const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The query_range functions are supposed to call a caller-supplied function on each record found in the dataset. These functions don't own the memory storing the record, so don't let them change the record. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Range query functions are not supposed to modify the query keys that are being passed in, so mark them all const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
The btree key comparison functions are not allowed to change the keys that are passed in, so mark them const. We'll need this for the next patch, which adds const to the btree range query functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Add a tracepoint for fs shutdowns so we can capture that in ftrace output. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Now that we always grab an active reference to a perag structure when dealing with perag metadata, we can remove this unnecessary variable. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
There are several GETFSMAP backend functions for XFS to cover the three devices and various feature support. Each of these functions are passed pointers to the low and high keys for the dataset that userspace requested, and a pointer to scratchpad variables that are used to control the iteration and fill out records. The scratchpad data can be changed arbitrarily, but the keys are supposed to remain unchanged (and under the control of the outermost loop in xfs_getfsmap). Unfortunately, the data and rt backends modify the keys that are passed in from the main control loop, which causes subsequent calls to return incorrect query results. Specifically, each of those two functions set the block number in the high key to the size of their respective device. Since fsmap results are sorted in device number order, if the lower numbered device is smaller than the higher numbered device, the first function will set the high key to the small size, and the key remains unchanged as it is passed into the function for the higher numbered device. The second function will then fail to return all of the results for the dataset that userspace is asking for because the keyspace is incorrectly constrained. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
-
Darrick J. Wong authored
The fsmap implementation for realtime devices uses the gap between info->next_daddr and a free rtextent reported by xfs_rtalloc_query_range to feed userspace fsmap records with an "unknown" owner. We use this trick to report to userspace when the last rtextent in the filesystem is in use by synthesizing a null rmap record starting at the next block after the query range. Unfortunately, there's a minor accounting bug in the way that we construct the null rmap record. Originally, ahigh.ar_startext contains the last rtextent for which the user wants records. It's entirely possible that number is beyond the end of the rt volume, so the location synthesized rmap record /must/ be constrained to the minimum of the high key and the number of extents in the rt volume. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
-
Darrick J. Wong authored
In commit 8ad560d2, we changed xfs_rtalloc_query_range to constrain the range of bits in the realtime bitmap file that would actually be searched. In commit a3a374bf, we changed the range again (incorrectly), leading to the fix in commit d88850bd, which finally corrected the range check code. Unfortunately, the author never noticed that the function modifies its input parameters, which is a totaly no-no since none of the other range query functions change their input parameters. So, fix this function yet again to stash the upper end of the query range (i.e. the high key) in a local variable and hope this is the last time I have to fix my own function. While we're at it, mark the key inputs const so nobody makes this mistake again. :( Fixes: 8ad560d2 ("xfs: strengthen rtalloc query range checks") Not-fixed-by: a3a374bf ("xfs: fix off-by-one error in xfs_rtalloc_query_range") Not-fixed-by: d88850bd ("xfs: fix high key handling in the rt allocator's query_range function") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
-
Dave Chinner authored
->writepage is only used in one place - single page writeback from memory reclaim. We only allow such writeback from kswapd, not from direct memory reclaim, and so it is rarely used. When it comes from kswapd, it is effectively random dirty page shoot-down, which is horrible for IO patterns. We will already have background writeback trying to clean all the dirty pages in memory as efficiently as possible, so having kswapd interrupt our well formed IO stream only slows things down. So get rid of xfs_vm_writepage() completely. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: forward port to 5.15] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
- 16 Aug, 2021 4 commits
-
-
Dave Chinner authored
We only use the CIL workqueue in the CIL, so it makes no sense to hang it off the xfs_mount and have to walk multiple pointers back up to the mount when we have the CIL structures right there. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Because we use a single work structure attached to the CIL rather than the CIL context, we can only queue a single work item at a time. This results in the CIL being single threaded and limits performance when it becomes CPU bound. The design of the CIL is that it is pipelined and multiple commits can be running concurrently, but the way the work is currently implemented means that it is not pipelining as it was intended. The critical work to switch the CIL context can take a few milliseconds to run, but the rest of the CIL context flush can take hundreds of milliseconds to complete. The context switching is the serialisation point of the CIL, once the context has been switched the rest of the context push can run asynchrnously with all other context pushes. Hence we can move the work to the CIL context so that we can run multiple CIL pushes at the same time and spread the majority of the work out over multiple CPUs. We can keep the per-cpu CIL commit state on the CIL rather than the context, because the context is pinned to the CIL until the switch is done and we aggregate and drain the per-cpu state held on the CIL during the context switch. However, because we no longer serialise the CIL work, we can have effectively unlimited CIL pushes in progress. We don't want to do this - not only does it create contention on the iclogs and the state machine locks, we can run the log right out of space with outstanding pushes. Instead, limit the work concurrency to 4 concurrent works being processed at a time. This is enough concurrency to remove the CIL from being a CPU bound bottleneck but not enough to create new contention points or unbound concurrency issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
The AIL pushing is stalling on log forces when it comes across pinned items. This is happening on removal workloads where the AIL is dominated by stale items that are removed from AIL when the checkpoint that marks the items stale is committed to the journal. This results is relatively few items in the AIL, but those that are are often pinned as directories items are being removed from are still being logged. As a result, many push cycles through the CIL will first issue a blocking log force to unpin the items. This can take some time to complete, with tracing regularly showing push delays of half a second and sometimes up into the range of several seconds. Sequences like this aren't uncommon: .... 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 270ms delay> 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 400.099623: xfsaild: first lsn 0x11002f3600 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 <wanted 50ms, got 500ms delay> 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 400.589349: xfsaild: first lsn 0x1100305000 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 <wanted 50ms, got 460ms delay> 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 400.950343: xfsaild: first lsn 0x1100317c00 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 <wanted 20ms, got 200ms delay> 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 401.142334: xfsaild: first lsn 0x110032e600 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 <wanted 10ms, got 10ms delay> 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 401.154328: xfsaild: first lsn 0x1100353000 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 300ms delay> 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 401.451526: xfsaild: first lsn 0x1100353000 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 <wanted 50ms, got 500ms delay> 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 .... In each of these cases, every AIL pass saw 101 log items stuck on the AIL (pinned) with very few other items being found. Each pass, a log force was issued, and delay between last/first is the sleep time + the sync log force time. Some of these 101 items pinned the tail of the log. The tail of the log does slowly creep forward (first lsn), but the problem is that the log is actually out of reservation space because it's been running so many transactions that stale items that never reach the AIL but consume log space. Hence we have a largely empty AIL, with long term pins on items that pin the tail of the log that don't get pushed frequently enough to keep log space available. The problem is the hundreds of milliseconds that we block in the log force pushing the CIL out to disk. The AIL should not be stalled like this - it needs to run and flush items that are at the tail of the log with minimal latency. What we really need to do is trigger a log flush, but then not wait for it at all - we've already done our waiting for stuff to complete when we backed off prior to the log force being issued. Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we still do a blocking flush of the CIL and that is what is causing the issue. Hence we need a new interface for the CIL to trigger an immediate background push of the CIL to get it moving faster but not to wait on that to occur. While the CIL is pushing, the AIL can also be pushing. We already have an internal interface to do this - xlog_cil_push_now() - but we need a wrapper for it to be used externally. xlog_cil_force_seq() can easily be extended to do what we need as it already implements the synchronous CIL push via xlog_cil_push_now(). Add the necessary flags and "push current sequence" semantics to xlog_cil_force_seq() and convert the AIL pushing to use it. One of the complexities here is that the CIL push does not guarantee that the commit record for the CIL checkpoint is written to disk. The current log force ensures this by submitting the current ACTIVE iclog that the commit record was written to. We need the CIL to actually write this commit record to disk for an async push to ensure that the checkpoint actually makes it to disk and unpins the pinned items in the checkpoint on completion. Hence we need to pass down to the CIL push that we are doing an async flush so that it can switch out the commit_iclog if necessary to get written to disk when the commit iclog is finally released. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-
Dave Chinner authored
Because log recovery depends on strictly ordered start records as well as strictly ordered commit records. This is a zero day bug in the way XFS writes pipelined transactions to the journal which is exposed by fixing the zero day bug that prevents the CIL from pipelining checkpoints. This re-introduces explicit concurrent commits back into the on-disk journal and hence out of order start records. The XFS journal commit code has never ordered start records and we have relied on strict commit record ordering for correct recovery ordering of concurrently written transactions. Unfortunately, root cause analysis uncovered the fact that log recovery uses the LSN of the start record for transaction commit processing. Hence, whilst the commits are processed in strict order by recovery, the LSNs associated with the commits can be out of order and so recovery may stamp incorrect LSNs into objects and/or misorder intents in the AIL for later processing. This can result in log recovery failures and/or on disk corruption, sometimes silent. Because this is a long standing log recovery issue, we can't just fix log recovery and call it good. This still leaves older kernels susceptible to recovery failures and corruption when replaying a log from a kernel that pipelines checkpoints. There is also the issue that in-memory ordering for AIL pushing and data integrity operations are based on checkpoint start LSNs, and if the start LSN is incorrect in the journal, it is also incorrect in memory. Hence there's really only one choice for fixing this zero-day bug: we need to strictly order checkpoint start records in ascending sequence order in the log, the same way we already strictly order commit records. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-