- 30 Apr, 2019 1 commit
-
-
Darrick J. Wong authored
During testing of xfs/141 on a V4 filesystem, I observed some inconsistent behavior with regards to resources that are held (i.e. remain locked) across a defer roll. The transaction roll always gives the defer roll function a new transaction, even if committing the old transaction fails. However, the defer roll function only rejoins the held resources if the transaction commit succeedied. This means that callers of defer roll have to figure out whether the held resources are attached to the transaction being passed back. Worse yet, if the defer roll was part of a defer finish call, we have a third possibility: the defer finish could pass back a dirty transaction with dirty held resources and an error code. The only sane way to handle all of these scenarios is to require that the code that held the resource either cancel the transaction before unlocking and releasing the resources, or use functions that detach resources from a transaction properly (e.g. xfs_trans_brelse) if they need to drop the reference before committing or cancelling the transaction. In order to make this so, change the defer roll code to join held resources to the new transaction unconditionally and fix all the bhold callers to release the held buffers correctly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
- 26 Apr, 2019 6 commits
-
-
Brian Foster authored
xfs_prepare_shift() fails to check the error return from xfs_flush_unmap_range(). If the latter fails, that could lead to an insert/collapse range operation over a delalloc range, which is not supported. Add an error check and return appropriately. This is reproduced rarely by generic/475. Fixes: 7f9f71be ("xfs: extent shifting doesn't fully invalidate page cache") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
In theory, the incore per-AG structure counters should match the ones on disk, so check that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
The forthcoming summary counter patch races with regular filesystem activity to compute rough expected values for the counters. This design was chosen to avoid having to freeze the entire filesystem to check the counters, but while that's running we'd prefer to minimize background reclamation activity to reduce the perturbations to the incore free block count. Therefore, provide a way for scrubbers to disable background posteof and cowblock reclamation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
"reclaim" is used throughout the icache code to mean reclamation of incore inode structures. It's also used for two helper functions that toggle background deletion of speculative preallocations. Separate the second of the two uses to make things less confusing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
Add a percpu counter to track the number of blocks directly reserved for delayed allocations on the data device. This counter (in contrast to i_delayed_blks) does not track allocated CoW staging extents or anything going on with the realtime device. It will be used in the upcoming summary counter scrub function to check the free block counts without having to freeze the filesystem or walk all the inodes to find the delayed allocations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
In xrep_roll_ag_trans, the transaction roll will always set sc->tp to the new transaction, even if committing the old one fails. A bare transaction roll leaves the buffer(s) locked but not joined to the new transaction, so it's not necessary to release the hold if the roll fails. Remove the incorrect xfs_trans_bhold_release calls. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
- 23 Apr, 2019 6 commits
-
-
Darrick J. Wong authored
We passed an inode into xfs_ioctl_setattr_get_trans with join_flags indicating which locks are held on that inode. If we can't allocate a transaction then we need to unlock the inode before we bail out, like all the other error paths do. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
There's only a few uses left, so just kill the typedef while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Widen the incore inode's i_delayed_blks counter to be a 64-bit integer. This is necessary to fix an integer overflow problem that can be reproduced easily now that we use the counter to track blocks that are assigned to the inode in memory but not on disk. This includes actual delalloc reservations as well as real extents in the COW fork that are waiting to be remapped into the data fork. These 'delayed mapping' blocks can easily exceed 2^32 blocks if one creates a very large sparse file of size approximately 2^33 bytes with one byte written every 2^23 bytes, sets a very large COW extent size hint of 2^23 blocks, reflinks the first file into a second file, and then writes a single byte every 2^23 blocks in the original file. When this happens, we'll try to create approximately 1024 2^23 extent reservations in the COW fork, which will overflow the counter and cause problems. Note that on x64 we end up filling a 4-byte gap in the structure so this doesn't increase the incore size. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Widen the incore quota transaction delta structure to treat block counters as 64-bit integers. This is a necessary addition so that we can widen the i_delayed_blks counter to be a 64-bit integer. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Darrick J. Wong authored
Dave Chinner noticed that xfs_file_dio_aio_write returns EAGAIN without dropping the IOLOCK when its deciding not to wait, which means that we leak the IOLOCK there. Since we now make unaligned directio always wait, we have the opportunity to bail out before trying to take the lock, which should reduce the overhead of this never-gonna-work case considerably while also solving the dropped lock problem. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
-
Brian Foster authored
Block allocation requires a permanent transaction for deferred AGFL frees. Add an assert in the block allocation path to make explicit and obvious to future callers the requirement of a transaction with a permanent reservation. Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: split this out from the previous patch per hch request] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
- 22 Apr, 2019 1 commit
-
-
Brian Foster authored
The growdata transaction is used by growfs operations to increase the data size of the filesystem. Part of this sequence involves extending the size of the last preexisting AG in the fs, if necessary. This is implemented by freeing the newly available physical range to the AG. tr_growdata is not a permanent transaction, however, and block allocation transactions must be permanent to handle deferred frees of AGFL blocks. If the grow operation extends an existing AG that requires AGFL fixing, assert failures occur due to a populated dfops list on a non-permanent transaction and the AGFL free does not occur. This is reproduced (rarely) by xfs/104. Change tr_growdata to a permanent transaction with a default log count. This increases initial transaction reservation size, but growfs is an infrequent and non-performance critical operation and so should have minimal impact. Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: add a comment to the assert] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
- 16 Apr, 2019 8 commits
-
-
Darrick J. Wong authored
It's possible for pagecache writeback to split up a large amount of work into smaller pieces for throttling purposes or to reduce the amount of time a writeback operation is pending. Whatever the reason, XFS can end up with a bunch of IO completions that call for the same operation to be performed on a contiguous extent mapping. Since mappings are extent based in XFS, we'd prefer to run fewer transactions when we can. When we're processing an ioend on the list of io completions, check to see if the next items on the list are both adjacent and of the same type. If so, we can merge the completions to reduce transaction overhead. On fast storage this doesn't seem to make much of a difference in performance, though the number of transactions for an overnight xfstests run seems to drop by ~5%. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Now that we're no longer using m_data_workqueue, remove it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
When scheduling writeback of dirty file data in the page cache, XFS uses IO completion workqueue items to ensure that filesystem metadata only updates after the write completes successfully. This is essential for converting unwritten extents to real extents at the right time and performing COW remappings. Unfortunately, XFS queues each IO completion work item to an unbounded workqueue, which means that the kernel can spawn dozens of threads to try to handle the items quickly. These threads need to take the ILOCK to update file metadata, which results in heavy ILOCK contention if a large number of the work items target a single file, which is inefficient. Worse yet, the writeback completion threads get stuck waiting for the ILOCK while holding transaction reservations, which can use up all available log reservation space. When that happens, metadata updates to other parts of the filesystem grind to a halt, even if the filesystem could otherwise have handled it. Even worse, if one of the things grinding to a halt happens to be a thread in the middle of a defer-ops finish holding the same ILOCK and trying to obtain more log reservation having exhausted the permanent reservation, we now have an ABBA deadlock - writeback completion has a transaction reserved and wants the ILOCK, and someone else has the ILOCK and wants a transaction reservation. Therefore, we create a per-inode writeback io completion queue + work item. When writeback finishes, it can add the ioend to the per-inode queue and let the single worker item process that queue. This dramatically cuts down on the number of kworkers and ILOCK contention in the system, and seems to have eliminated an occasional deadlock I was seeing while running generic/476. Testing with a program that simulates a heavy random-write workload to a single file demonstrates that the number of kworkers drops from approximately 120 threads per file to 1, without dramatically changing write bandwidth or pagecache access latency. Note that we leave the xfs-conv workqueue's max_active alone because we still want to be able to run ioend processing for as many inodes as the system can handle. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Skip cross-referencing with a btree if the health report tells us that it's known to be bad. This should reduce the dmesg spew considerably. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
Now that we have the ability to track sick metadata in-core, make scrub and repair update those health assessments after doing work. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
Now that we no longer memset the scrub context, we can move the already_fixed variable into the scrub context's state flags instead of passing around pointers to separate stack variables. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
Combine all the boolean state flags in struct xfs_scrub into a single unsigned int, because we're going to be adding more state flags soon. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
Darrick J. Wong authored
It's a little silly how the memset in scrub context initialization forces us to declare stack variables to preserve context variables across a retry. Since the teardown functions already null out most of the ephemeral state (buffer pointers, btree cursors, etc.), just skip the memset and move the initialization as needed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
-
- 15 Apr, 2019 13 commits
-
-
Darrick J. Wong authored
Use space in the bulkstat ioctl structure to report any problems observed with the inode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Use the AG geometry info ioctl to report health status too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Use our newly expanded geometry structure to report the overall fs and realtime health status. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Add a new ioctl to describe an allocation group's geometry. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Dave Chinner authored
Unfortunately, the V4 XFS_IOC_FSGEOMETRY structure is out of space so we can't just add a new field to it. Hence we need to bump the definition to V5 and and treat the V4 ioctl and structure similar to v1 to v3. While doing this, clean up all the definitions associated with the XFS_IOC_FSGEOMETRY ioctl. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: forward port to 5.1, expand structure size to 256 bytes] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
If we know the filesystem metadata isn't healthy during unmount, we want to encourage the administrator to run xfs_repair right away. We can't do this if BAD_SUMMARY will cause an unclean log unmount to force summary recalculation, so turn it off if the fs is bad. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Replace the BAD_SUMMARY mount flag with calls to the equivalent health tracking code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Darrick J. Wong authored
Add the necessary in-core metadata fields to keep track of which parts of the filesystem have been observed and which parts were observed to be unhealthy, and print a warning at unmount time if we have unfixed problems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
-
Wang Shilong authored
This patch tries to address two problems: 1) return @minlen we used to trim to user space. 2) return EINVAL if granularity is larger than avg size, even most of cases, granularity is small(4K), but if devices return a lager granularity for some reaons (testing, bugs etc), fstrim should return failure directly. Signed-off-by: Wang Shilong <wshilong@ddn.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
The block allocation AG selection code has parameters that allow a caller to perform multiple allocations from a single AG and transaction (under certain conditions). The parameters specify the total block allocation count required by the transaction and the AG selection code selects and locks an AG that will be able to satisfy the overall requirement. If the available block accounting calculation turns out to be inaccurate and a subsequent allocation call fails with -ENOSPC, the resulting transaction cancel leads to filesystem shutdown because the transaction is dirty. This exact problem can be reproduced with a highly parallel space consumer and fsstress workload running long enough to a large filesystem against -ENOSPC conditions. A bmbt block allocation request made for inode extent to bmap format conversion after an extent allocation is expected to be satisfied by the same AG and the same transaction as the extent allocation. The bmbt block allocation fails, however, because the block availability of the AG has changed since the AG was selected (outside of the blocks used for the extent itself). The inconsistent block availability calculation is caused by the deferred block freeing behavior of the AGFL. This immediately removes extra blocks from the AGFL to free up AGFL slots, but rather than immediately freeing such blocks as was done in the past, the block free is deferred such that said blocks are not available for allocation until the current transaction commits. The AG selection logic currently considers all AGFL blocks as available and executes shortly before any extra AGFL blocks are freed. This means the block availability of the current AG can change before the first allocation even occurs, but in practice a failure is more likely to manifest via a subsequent allocation because extent allocation usually has a contiguity requirement larger than a single block that can't be satisfied from the AGFL. In general, XFS prefers operational robustness to absolute allocation efficiency. In other words, we prefer to return -ENOSPC slightly earlier at the expense of not being able to allocate every last block in an AG to avoid this kind of problem. As such, update the AG block availability calculation to consider extra AGFL blocks as unavailable since they are immediately removed following the calculation and will not become available until the current transaction commits. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
If xfs_iflush_cluster() fails due to corruption, the error path issues a shutdown and simulates an I/O completion to release the buffer. This code has a couple small problems. First, the shutdown sequence can issue a synchronous log force, which is unsafe to do with buffer locks held. Second, the simulated I/O completion does not guarantee the buffer is async and thus is unlocked and released. For example, if the last operation on the buffer was a read off disk prior to the corruption event, XBF_ASYNC is not set and the buffer is left locked and held upon return. This results in a memory leak as shown by the following message on module unload: BUG xfs_buf (...): Objects remaining in xfs_buf on __kmem_cache_shutdown() Fix both of these problems by setting XBF_ASYNC on the buffer prior to the simulated I/O error and performing the shutdown immediately after ioend processing when the buffer has been released. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
XFS shutdown deadlocks have been reproduced by fstest generic/475. The deadlock signature involves log I/O completion running error handling to abort logged items and waiting for an inode cluster buffer lock in the buffer item unpin handler. The buffer lock is held by xfsaild attempting to flush an inode. The buffer happens to be pinned and so xfs_iflush() triggers an async log force to begin work required to get it unpinned. The log force is blocked waiting on the commit completion, which never occurs and thus leaves the filesystem deadlocked. The root problem is that aborted log I/O completion pots commit completion behind callback completion, which is unexpected for async log forces. Under normal running conditions, an async log force returns to the caller once the CIL ctx has been formatted/submitted and the commit completion event triggered at the tail end of xlog_cil_push(). If the filesystem has shutdown, however, we rely on xlog_cil_committed() to trigger the completion event and it happens to do so after running log item unpin callbacks. This makes it unsafe to invoke an async log force from contexts that hold locks that might also be required in log completion processing. To address this problem, wake commit completion waiters before aborting log items in the log I/O completion handler. This ensures that an async log force will not deadlock on held locks if the filesystem happens to shutdown. Note that it is still unsafe to issue a sync log force while holding such locks because a sync log force explicitly waits on the force completion, which occurs after log I/O completion processing. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
Brian Foster authored
The xfs_buf_log_item ->iop_unlock() callback asserts that the buffer is unlocked when either non-stale or aborted. This assert occurs after the bli refcount has been dropped and the log item potentially freed. The aborted check is thus a potential use after free. This problem has been reproduced with KASAN enabled via generic/475. Fix up xfs_buf_item_unlock() to query aborted state before the bli reference is dropped to prevent a potential use after free. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
-
- 14 Apr, 2019 5 commits
-
-
Linus Torvalds authored
-
Linus Torvalds authored
Merge page ref overflow branch. Jann Horn reported that he can overflow the page ref count with sufficient memory (and a filesystem that is intentionally extremely slow). Admittedly it's not exactly easy. To have more than four billion references to a page requires a minimum of 32GB of kernel memory just for the pointers to the pages, much less any metadata to keep track of those pointers. Jann needed a total of 140GB of memory and a specially crafted filesystem that leaves all reads pending (in order to not ever free the page references and just keep adding more). Still, we have a fairly straightforward way to limit the two obvious user-controllable sources of page references: direct-IO like page references gotten through get_user_pages(), and the splice pipe page duplication. So let's just do that. * branch page-refs: fs: prevent page refcount overflow in pipe_buf_get mm: prevent get_user_pages() from overflowing page refcount mm: add 'try_get_page()' helper function mm: make page ref count overflow check tighter and more explicit
-
Matthew Wilcox authored
Change pipe_buf_get() to return a bool indicating whether it succeeded in raising the refcount of the page (if the thing in the pipe is a page). This removes another mechanism for overflowing the page refcount. All callers converted to handle a failure. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Linus Torvalds authored
If the page refcount wraps around past zero, it will be freed while there are still four billion references to it. One of the possible avenues for an attacker to try to make this happen is by doing direct IO on a page multiple times. This patch makes get_user_pages() refuse to take a new page reference if there are already more than two billion references to the page. Reported-by: Jann Horn <jannh@google.com> Acked-by: Matthew Wilcox <willy@infradead.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-
Linus Torvalds authored
This is the same as the traditional 'get_page()' function, but instead of unconditionally incrementing the reference count of the page, it only does so if the count was "safe". It returns whether the reference count was incremented (and is marked __must_check, since the caller obviously has to be aware of it). Also like 'get_page()', you can't use this function unless you already had a reference to the page. The intent is that you can use this exactly like get_page(), but in situations where you want to limit the maximum reference count. The code currently does an unconditional WARN_ON_ONCE() if we ever hit the reference count issues (either zero or negative), as a notification that the conditional non-increment actually happened. NOTE! The count access for the "safety" check is inherently racy, but that doesn't matter since the buffer we use is basically half the range of the reference count (ie we look at the sign of the count). Acked-by: Matthew Wilcox <willy@infradead.org> Cc: Jann Horn <jannh@google.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-