Commit 4613b17c authored by Darrick J. Wong's avatar Darrick J. Wong

Merge tag 'xfs-iunlink-item-5.20' of...

Merge tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-5.20-mergeB

xfs: introduce in-memory inode unlink log items

To facilitate future improvements in inode logging and improving
inode cluster buffer locking order consistency, we need a new
mechanism for defering inode cluster buffer modifications during
unlinked list modifications.

The unlinked inode list buffer locking is complex. The unlinked
list is unordered - we add to the tail, remove from where-ever the
inode is in the list. Hence we might need to lock two inode buffers
here (previous inode in list and the one being removed). While we
can order the locking of these buffers correctly within the confines
of the unlinked list, there may be other inodes that need buffer
locking in the same transaction. e.g. O_TMPFILE being linked into a
directory also modifies the directory inode.

Hence we need a mechanism for defering unlinked inode list updates
until a point where we know that all modifications have been made
and all that remains is to lock and modify the cluster buffers.

We can do this by first observing that we serialise unlinked list
modifications by holding the AGI buffer lock. IOWs, the AGI is going
to be locked until the transaction commits any time we modify the
unlinked list. Hence it doesn't matter when in the unlink
transactions that we actually load, lock and modify the inode
cluster buffer.

We add an in-memory unlinked inode log item to defer the inode
cluster buffer update to transaction commit time where it can be
ordered with all the other inode cluster operations that need to be
done. Essentially all we need to do is record the inodes that need
to have their unlinked list pointer updated in a new log item that
we attached to the transaction.

This log item exists purely for the purpose of delaying the update
of the unlinked list pointer until the inode cluster buffer can be
locked in the correct order around the other inode cluster buffers.
It plays no part in the actual commit, and there's no change to
anything that is written to the log. i.e. the inode cluster buffers
still have to be fully logged here (not just ordered) as log
recovery depedends on this to replay mods to the unlinked inode
list.

Hence if we add a "precommit" hook into xfs_trans_commit()
to run a "precommit" operation on these iunlink log items, we can
delay the locking, modification and logging of the inode cluster
buffer until after all other modifications have been made. The
precommit hook reuires us to sort the items that are going to be run
so that we can lock precommit items in the correct order as we
perform the modifications they describe.

To make this unlinked inode list processing simpler and easier to
implement as a log item, we need to change the way we track the
unlinked list in memory. Starting from the observation that an inode
on the unlinked list is pinned in memory by the VFS, we can use the
xfs_inode itself to track the unlinked list. To do this efficiently,
we want the unlinked list to be a double linked list. The problem
here is that we need a list per AGI unlinked list, and there are 64
of these per AGI. The approach taken in this patchset is to shadow
the AGI unlinked list heads in the perag, and link inodes by agino,
hence requiring only 8 extra bytes per inode to track this state.

We can then use the agino pointers for lockless inode cache lookups
to retreive the inode. The aginos in the inode are modified only
under the AGI lock, just like the cluster buffer pointers, so we
don't need any extra locking here.  The i_next_unlinked field tracks
the on-disk value of the unlinked list, and the i_prev_unlinked is a
purely in-memory pointer that enables us to efficiently remove
inodes from the middle of the list.

This results in moving a lot of the unlink modification work into
the precommit operations on the unlink log item. Tracking all the
unlinked inodes in the inodes themselves also gets rid of the
unlinked list reference hash table that is used to track this back
pointer relationship. This greatly simplifies the the unlinked list
modification code, and removes memory allocations in this hot path
to track back pointers. This, overall, slightly reduces the CPU
overhead of the unlink path.

The result of this log item means that we move all the actual
manipulation of objects to be logged out of the iunlink path and
into the iunlink item. This allows for future optimisation of this
mechanism without needing changes to high level unlink path, as
well as making the unlink lock ordering predictable and synchronised
with other operations that may require inode cluster locking.
Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>

* tag 'xfs-iunlink-item-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: add in-memory iunlink log item
  xfs: add log item precommit operation
  xfs: combine iunlink inode update functions
  xfs: clean up xfs_iunlink_update_inode()
  xfs: double link the unlinked inode list
  xfs: introduce xfs_iunlink_lookup
  xfs: refactor xlog_recover_process_iunlinks()
  xfs: track the iunlink list pointer in the xfs_inode
  xfs: factor the xfs_iunlink functions
  xfs: flush inode gc workqueue before clearing agi bucket
parents 0f38063d 784eb7d8
...@@ -106,6 +106,7 @@ xfs-y += xfs_log.o \ ...@@ -106,6 +106,7 @@ xfs-y += xfs_log.o \
xfs_icreate_item.o \ xfs_icreate_item.o \
xfs_inode_item.o \ xfs_inode_item.o \
xfs_inode_item_recover.o \ xfs_inode_item_recover.o \
xfs_iunlink_item.o \
xfs_refcount_item.o \ xfs_refcount_item.o \
xfs_rmap_item.o \ xfs_rmap_item.o \
xfs_log_recover.o \ xfs_log_recover.o \
......
...@@ -194,7 +194,6 @@ xfs_free_perag( ...@@ -194,7 +194,6 @@ xfs_free_perag(
XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0); XFS_IS_CORRUPT(pag->pag_mount, atomic_read(&pag->pag_ref) != 0);
cancel_delayed_work_sync(&pag->pag_blockgc_work); cancel_delayed_work_sync(&pag->pag_blockgc_work);
xfs_iunlink_destroy(pag);
xfs_buf_hash_destroy(pag); xfs_buf_hash_destroy(pag);
call_rcu(&pag->rcu_head, __xfs_free_perag); call_rcu(&pag->rcu_head, __xfs_free_perag);
...@@ -323,10 +322,6 @@ xfs_initialize_perag( ...@@ -323,10 +322,6 @@ xfs_initialize_perag(
if (error) if (error)
goto out_remove_pag; goto out_remove_pag;
error = xfs_iunlink_init(pag);
if (error)
goto out_hash_destroy;
/* first new pag is fully initialized */ /* first new pag is fully initialized */
if (first_initialised == NULLAGNUMBER) if (first_initialised == NULLAGNUMBER)
first_initialised = index; first_initialised = index;
...@@ -349,8 +344,6 @@ xfs_initialize_perag( ...@@ -349,8 +344,6 @@ xfs_initialize_perag(
mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp); mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
return 0; return 0;
out_hash_destroy:
xfs_buf_hash_destroy(pag);
out_remove_pag: out_remove_pag:
radix_tree_delete(&mp->m_perag_tree, index); radix_tree_delete(&mp->m_perag_tree, index);
out_free_pag: out_free_pag:
...@@ -362,7 +355,6 @@ xfs_initialize_perag( ...@@ -362,7 +355,6 @@ xfs_initialize_perag(
if (!pag) if (!pag)
break; break;
xfs_buf_hash_destroy(pag); xfs_buf_hash_destroy(pag);
xfs_iunlink_destroy(pag);
kmem_free(pag); kmem_free(pag);
} }
return error; return error;
......
...@@ -103,12 +103,6 @@ struct xfs_perag { ...@@ -103,12 +103,6 @@ struct xfs_perag {
/* background prealloc block trimming */ /* background prealloc block trimming */
struct delayed_work pag_blockgc_work; struct delayed_work pag_blockgc_work;
/*
* Unlinked inode information. This incore information reflects
* data stored in the AGI, so callers must hold the AGI buffer lock
* or have some other means to control concurrency.
*/
struct rhashtable pagi_unlinked_hash;
#endif /* __KERNEL__ */ #endif /* __KERNEL__ */
}; };
......
...@@ -230,6 +230,7 @@ xfs_inode_from_disk( ...@@ -230,6 +230,7 @@ xfs_inode_from_disk(
ip->i_extsize = be32_to_cpu(from->di_extsize); ip->i_extsize = be32_to_cpu(from->di_extsize);
ip->i_forkoff = from->di_forkoff; ip->i_forkoff = from->di_forkoff;
ip->i_diflags = be16_to_cpu(from->di_flags); ip->i_diflags = be16_to_cpu(from->di_flags);
ip->i_next_unlinked = be32_to_cpu(from->di_next_unlinked);
if (from->di_dmevmask || from->di_dmstate) if (from->di_dmevmask || from->di_dmstate)
xfs_iflags_set(ip, XFS_IPRESERVE_DM_FIELDS); xfs_iflags_set(ip, XFS_IPRESERVE_DM_FIELDS);
......
...@@ -111,6 +111,8 @@ xfs_inode_alloc( ...@@ -111,6 +111,8 @@ xfs_inode_alloc(
INIT_WORK(&ip->i_ioend_work, xfs_end_io); INIT_WORK(&ip->i_ioend_work, xfs_end_io);
INIT_LIST_HEAD(&ip->i_ioend_list); INIT_LIST_HEAD(&ip->i_ioend_list);
spin_lock_init(&ip->i_ioend_lock); spin_lock_init(&ip->i_ioend_lock);
ip->i_next_unlinked = NULLAGINO;
ip->i_prev_unlinked = NULLAGINO;
return ip; return ip;
} }
...@@ -912,6 +914,7 @@ xfs_reclaim_inode( ...@@ -912,6 +914,7 @@ xfs_reclaim_inode(
ip->i_checked = 0; ip->i_checked = 0;
spin_unlock(&ip->i_flags_lock); spin_unlock(&ip->i_flags_lock);
ASSERT(!ip->i_itemp || ip->i_itemp->ili_item.li_buf == NULL);
xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_iunlock(ip, XFS_ILOCK_EXCL);
XFS_STATS_INC(ip->i_mount, xs_ig_reclaims); XFS_STATS_INC(ip->i_mount, xs_ig_reclaims);
......
This diff is collapsed.
...@@ -68,6 +68,10 @@ typedef struct xfs_inode { ...@@ -68,6 +68,10 @@ typedef struct xfs_inode {
uint64_t i_diflags2; /* XFS_DIFLAG2_... */ uint64_t i_diflags2; /* XFS_DIFLAG2_... */
struct timespec64 i_crtime; /* time created */ struct timespec64 i_crtime; /* time created */
/* unlinked list pointers */
xfs_agino_t i_next_unlinked;
xfs_agino_t i_prev_unlinked;
/* VFS inode */ /* VFS inode */
struct inode i_vnode; /* embedded VFS inode */ struct inode i_vnode; /* embedded VFS inode */
...@@ -505,9 +509,6 @@ extern struct kmem_cache *xfs_inode_cache; ...@@ -505,9 +509,6 @@ extern struct kmem_cache *xfs_inode_cache;
bool xfs_inode_needs_inactive(struct xfs_inode *ip); bool xfs_inode_needs_inactive(struct xfs_inode *ip);
int xfs_iunlink_init(struct xfs_perag *pag);
void xfs_iunlink_destroy(struct xfs_perag *pag);
void xfs_end_io(struct work_struct *work); void xfs_end_io(struct work_struct *work);
int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
......
// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (c) 2020-2022, Red Hat, Inc.
* All Rights Reserved.
*/
#include "xfs.h"
#include "xfs_fs.h"
#include "xfs_shared.h"
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
#include "xfs_mount.h"
#include "xfs_inode.h"
#include "xfs_trans.h"
#include "xfs_trans_priv.h"
#include "xfs_ag.h"
#include "xfs_iunlink_item.h"
#include "xfs_trace.h"
#include "xfs_error.h"
struct kmem_cache *xfs_iunlink_cache;
static inline struct xfs_iunlink_item *IUL_ITEM(struct xfs_log_item *lip)
{
return container_of(lip, struct xfs_iunlink_item, item);
}
static void
xfs_iunlink_item_release(
struct xfs_log_item *lip)
{
struct xfs_iunlink_item *iup = IUL_ITEM(lip);
xfs_perag_put(iup->pag);
kmem_cache_free(xfs_iunlink_cache, IUL_ITEM(lip));
}
static uint64_t
xfs_iunlink_item_sort(
struct xfs_log_item *lip)
{
return IUL_ITEM(lip)->ip->i_ino;
}
/*
* Look up the inode cluster buffer and log the on-disk unlinked inode change
* we need to make.
*/
static int
xfs_iunlink_log_dinode(
struct xfs_trans *tp,
struct xfs_iunlink_item *iup)
{
struct xfs_mount *mp = tp->t_mountp;
struct xfs_inode *ip = iup->ip;
struct xfs_dinode *dip;
struct xfs_buf *ibp;
int offset;
int error;
error = xfs_imap_to_bp(mp, tp, &ip->i_imap, &ibp);
if (error)
return error;
/*
* Don't log the unlinked field on stale buffers as this may be the
* transaction that frees the inode cluster and relogging the buffer
* here will incorrectly remove the stale state.
*/
if (ibp->b_flags & XBF_STALE)
goto out;
dip = xfs_buf_offset(ibp, ip->i_imap.im_boffset);
/* Make sure the old pointer isn't garbage. */
if (be32_to_cpu(dip->di_next_unlinked) != iup->old_agino) {
xfs_inode_verifier_error(ip, -EFSCORRUPTED, __func__, dip,
sizeof(*dip), __this_address);
error = -EFSCORRUPTED;
goto out;
}
trace_xfs_iunlink_update_dinode(mp, iup->pag->pag_agno,
XFS_INO_TO_AGINO(mp, ip->i_ino),
be32_to_cpu(dip->di_next_unlinked), iup->next_agino);
dip->di_next_unlinked = cpu_to_be32(iup->next_agino);
offset = ip->i_imap.im_boffset +
offsetof(struct xfs_dinode, di_next_unlinked);
xfs_dinode_calc_crc(mp, dip);
xfs_trans_inode_buf(tp, ibp);
xfs_trans_log_buf(tp, ibp, offset, offset + sizeof(xfs_agino_t) - 1);
return 0;
out:
xfs_trans_brelse(tp, ibp);
return error;
}
/*
* On precommit, we grab the inode cluster buffer for the inode number we were
* passed, then update the next unlinked field for that inode in the buffer and
* log the buffer. This ensures that the inode cluster buffer was logged in the
* correct order w.r.t. other inode cluster buffers. We can then remove the
* iunlink item from the transaction and release it as it is has now served it's
* purpose.
*/
static int
xfs_iunlink_item_precommit(
struct xfs_trans *tp,
struct xfs_log_item *lip)
{
struct xfs_iunlink_item *iup = IUL_ITEM(lip);
int error;
error = xfs_iunlink_log_dinode(tp, iup);
list_del(&lip->li_trans);
xfs_iunlink_item_release(lip);
return error;
}
static const struct xfs_item_ops xfs_iunlink_item_ops = {
.iop_release = xfs_iunlink_item_release,
.iop_sort = xfs_iunlink_item_sort,
.iop_precommit = xfs_iunlink_item_precommit,
};
/*
* Initialize the inode log item for a newly allocated (in-core) inode.
*
* Inode extents can only reside within an AG. Hence specify the starting
* block for the inode chunk by offset within an AG as well as the
* length of the allocated extent.
*
* This joins the item to the transaction and marks it dirty so
* that we don't need a separate call to do this, nor does the
* caller need to know anything about the iunlink item.
*/
int
xfs_iunlink_log_inode(
struct xfs_trans *tp,
struct xfs_inode *ip,
struct xfs_perag *pag,
xfs_agino_t next_agino)
{
struct xfs_mount *mp = tp->t_mountp;
struct xfs_iunlink_item *iup;
ASSERT(xfs_verify_agino_or_null(pag, next_agino));
ASSERT(xfs_verify_agino_or_null(pag, ip->i_next_unlinked));
/*
* Since we're updating a linked list, we should never find that the
* current pointer is the same as the new value, unless we're
* terminating the list.
*/
if (ip->i_next_unlinked == next_agino) {
if (next_agino != NULLAGINO)
return -EFSCORRUPTED;
return 0;
}
iup = kmem_cache_zalloc(xfs_iunlink_cache, GFP_KERNEL | __GFP_NOFAIL);
xfs_log_item_init(mp, &iup->item, XFS_LI_IUNLINK,
&xfs_iunlink_item_ops);
iup->ip = ip;
iup->next_agino = next_agino;
iup->old_agino = ip->i_next_unlinked;
atomic_inc(&pag->pag_ref);
iup->pag = pag;
xfs_trans_add_item(tp, &iup->item);
tp->t_flags |= XFS_TRANS_DIRTY;
set_bit(XFS_LI_DIRTY, &iup->item.li_flags);
return 0;
}
// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (c) 2020-2022, Red Hat, Inc.
* All Rights Reserved.
*/
#ifndef XFS_IUNLINK_ITEM_H
#define XFS_IUNLINK_ITEM_H 1
struct xfs_trans;
struct xfs_inode;
struct xfs_perag;
/* in memory log item structure */
struct xfs_iunlink_item {
struct xfs_log_item item;
struct xfs_inode *ip;
struct xfs_perag *pag;
xfs_agino_t next_agino;
xfs_agino_t old_agino;
};
extern struct kmem_cache *xfs_iunlink_cache;
int xfs_iunlink_log_inode(struct xfs_trans *tp, struct xfs_inode *ip,
struct xfs_perag *pag, xfs_agino_t next_agino);
#endif /* XFS_IUNLINK_ITEM_H */
...@@ -2667,55 +2667,57 @@ xlog_recover_clear_agi_bucket( ...@@ -2667,55 +2667,57 @@ xlog_recover_clear_agi_bucket(
return; return;
} }
STATIC xfs_agino_t static int
xlog_recover_process_one_iunlink( xlog_recover_iunlink_bucket(
struct xfs_perag *pag, struct xfs_perag *pag,
xfs_agino_t agino, struct xfs_agi *agi,
int bucket) int bucket)
{ {
struct xfs_buf *ibp; struct xfs_mount *mp = pag->pag_mount;
struct xfs_dinode *dip; struct xfs_inode *prev_ip = NULL;
struct xfs_inode *ip; struct xfs_inode *ip;
xfs_ino_t ino; xfs_agino_t prev_agino, agino;
int error; int error = 0;
ino = XFS_AGINO_TO_INO(pag->pag_mount, pag->pag_agno, agino);
error = xfs_iget(pag->pag_mount, NULL, ino, 0, 0, &ip);
if (error)
goto fail;
/* agino = be32_to_cpu(agi->agi_unlinked[bucket]);
* Get the on disk inode to find the next inode in the bucket. while (agino != NULLAGINO) {
*/ error = xfs_iget(mp, NULL,
error = xfs_imap_to_bp(pag->pag_mount, NULL, &ip->i_imap, &ibp); XFS_AGINO_TO_INO(mp, pag->pag_agno, agino),
0, 0, &ip);
if (error) if (error)
goto fail_iput; break;
dip = xfs_buf_offset(ibp, ip->i_imap.im_boffset);
xfs_iflags_clear(ip, XFS_IRECOVERY);
ASSERT(VFS_I(ip)->i_nlink == 0); ASSERT(VFS_I(ip)->i_nlink == 0);
ASSERT(VFS_I(ip)->i_mode != 0); ASSERT(VFS_I(ip)->i_mode != 0);
xfs_iflags_clear(ip, XFS_IRECOVERY);
agino = ip->i_next_unlinked;
/* setup for the next pass */ if (prev_ip) {
agino = be32_to_cpu(dip->di_next_unlinked); ip->i_prev_unlinked = prev_agino;
xfs_buf_relse(ibp); xfs_irele(prev_ip);
xfs_irele(ip);
return agino;
fail_iput:
xfs_irele(ip);
fail:
/* /*
* We can't read in the inode this bucket points to, or this inode * Ensure the inode is removed from the unlinked list
* is messed up. Just ditch this bucket of inodes. We will lose * before we continue so that it won't race with
* some inodes and space, but at least we won't hang. * building the in-memory list here. This could be
* * serialised with the agibp lock, but that just
* Call xlog_recover_clear_agi_bucket() to perform a transaction to * serialises via lockstepping and it's much simpler
* clear the inode pointer in the bucket. * just to flush the inodegc queue and wait for it to
* complete.
*/ */
xlog_recover_clear_agi_bucket(pag, bucket); xfs_inodegc_flush(mp);
return NULLAGINO; }
prev_agino = agino;
prev_ip = ip;
}
if (prev_ip) {
ip->i_prev_unlinked = prev_agino;
xfs_irele(prev_ip);
}
xfs_inodegc_flush(mp);
return error;
} }
/* /*
...@@ -2741,59 +2743,70 @@ xlog_recover_process_one_iunlink( ...@@ -2741,59 +2743,70 @@ xlog_recover_process_one_iunlink(
* scheduled on this CPU to ensure other scheduled work can run without undue * scheduled on this CPU to ensure other scheduled work can run without undue
* latency. * latency.
*/ */
STATIC void static void
xlog_recover_process_iunlinks( xlog_recover_iunlink_ag(
struct xlog *log) struct xfs_perag *pag)
{ {
struct xfs_mount *mp = log->l_mp;
struct xfs_perag *pag;
xfs_agnumber_t agno;
struct xfs_agi *agi; struct xfs_agi *agi;
struct xfs_buf *agibp; struct xfs_buf *agibp;
xfs_agino_t agino;
int bucket; int bucket;
int error; int error;
for_each_perag(mp, agno, pag) {
error = xfs_read_agi(pag, NULL, &agibp); error = xfs_read_agi(pag, NULL, &agibp);
if (error) { if (error) {
/* /*
* AGI is b0rked. Don't process it. * AGI is b0rked. Don't process it.
* *
* We should probably mark the filesystem as corrupt * We should probably mark the filesystem as corrupt after we've
* after we've recovered all the ag's we can.... * recovered all the ag's we can....
*/ */
continue; return;
} }
/* /*
* Unlock the buffer so that it can be acquired in the normal * Unlock the buffer so that it can be acquired in the normal course of
* course of the transaction to truncate and free each inode. * the transaction to truncate and free each inode. Because we are not
* Because we are not racing with anyone else here for the AGI * racing with anyone else here for the AGI buffer, we don't even need
* buffer, we don't even need to hold it locked to read the * to hold it locked to read the initial unlinked bucket entries out of
* initial unlinked bucket entries out of the buffer. We keep * the buffer. We keep buffer reference though, so that it stays pinned
* buffer reference though, so that it stays pinned in memory * in memory while we need the buffer.
* while we need the buffer.
*/ */
agi = agibp->b_addr; agi = agibp->b_addr;
xfs_buf_unlock(agibp); xfs_buf_unlock(agibp);
for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) { for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++) {
agino = be32_to_cpu(agi->agi_unlinked[bucket]); error = xlog_recover_iunlink_bucket(pag, agi, bucket);
while (agino != NULLAGINO) { if (error) {
agino = xlog_recover_process_one_iunlink(pag, /*
agino, bucket); * Bucket is unrecoverable, so only a repair scan can
cond_resched(); * free the remaining unlinked inodes. Just empty the
* bucket and remaining inodes on it unreferenced and
* unfreeable.
*/
xfs_inodegc_flush(pag->pag_mount);
xlog_recover_clear_agi_bucket(pag, bucket);
} }
} }
xfs_buf_rele(agibp); xfs_buf_rele(agibp);
} }
static void
xlog_recover_process_iunlinks(
struct xlog *log)
{
struct xfs_perag *pag;
xfs_agnumber_t agno;
for_each_perag(log->l_mp, agno, pag)
xlog_recover_iunlink_ag(pag);
/* /*
* Flush the pending unlinked inodes to ensure that the inactivations * Flush the pending unlinked inodes to ensure that the inactivations
* are fully completed on disk and the incore inodes can be reclaimed * are fully completed on disk and the incore inodes can be reclaimed
* before we signal that recovery is complete. * before we signal that recovery is complete.
*/ */
xfs_inodegc_flush(mp); xfs_inodegc_flush(log->l_mp);
} }
STATIC void STATIC void
......
...@@ -40,6 +40,7 @@ ...@@ -40,6 +40,7 @@
#include "xfs_defer.h" #include "xfs_defer.h"
#include "xfs_attr_item.h" #include "xfs_attr_item.h"
#include "xfs_xattr.h" #include "xfs_xattr.h"
#include "xfs_iunlink_item.h"
#include <linux/magic.h> #include <linux/magic.h>
#include <linux/fs_context.h> #include <linux/fs_context.h>
...@@ -2096,8 +2097,16 @@ xfs_init_caches(void) ...@@ -2096,8 +2097,16 @@ xfs_init_caches(void)
if (!xfs_attri_cache) if (!xfs_attri_cache)
goto out_destroy_attrd_cache; goto out_destroy_attrd_cache;
xfs_iunlink_cache = kmem_cache_create("xfs_iul_item",
sizeof(struct xfs_iunlink_item),
0, 0, NULL);
if (!xfs_iunlink_cache)
goto out_destroy_attri_cache;
return 0; return 0;
out_destroy_attri_cache:
kmem_cache_destroy(xfs_attri_cache);
out_destroy_attrd_cache: out_destroy_attrd_cache:
kmem_cache_destroy(xfs_attrd_cache); kmem_cache_destroy(xfs_attrd_cache);
out_destroy_bui_cache: out_destroy_bui_cache:
...@@ -2148,6 +2157,7 @@ xfs_destroy_caches(void) ...@@ -2148,6 +2157,7 @@ xfs_destroy_caches(void)
* destroy caches. * destroy caches.
*/ */
rcu_barrier(); rcu_barrier();
kmem_cache_destroy(xfs_iunlink_cache);
kmem_cache_destroy(xfs_attri_cache); kmem_cache_destroy(xfs_attri_cache);
kmem_cache_destroy(xfs_attrd_cache); kmem_cache_destroy(xfs_attrd_cache);
kmem_cache_destroy(xfs_bui_cache); kmem_cache_destroy(xfs_bui_cache);
......
...@@ -3672,7 +3672,6 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \ ...@@ -3672,7 +3672,6 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
TP_ARGS(ip)) TP_ARGS(ip))
DEFINE_AGINODE_EVENT(xfs_iunlink); DEFINE_AGINODE_EVENT(xfs_iunlink);
DEFINE_AGINODE_EVENT(xfs_iunlink_remove); DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
DECLARE_EVENT_CLASS(xfs_fs_corrupt_class, DECLARE_EVENT_CLASS(xfs_fs_corrupt_class,
TP_PROTO(struct xfs_mount *mp, unsigned int flags), TP_PROTO(struct xfs_mount *mp, unsigned int flags),
......
...@@ -844,6 +844,90 @@ xfs_trans_committed_bulk( ...@@ -844,6 +844,90 @@ xfs_trans_committed_bulk(
spin_unlock(&ailp->ail_lock); spin_unlock(&ailp->ail_lock);
} }
/*
* Sort transaction items prior to running precommit operations. This will
* attempt to order the items such that they will always be locked in the same
* order. Items that have no sort function are moved to the end of the list
* and so are locked last.
*
* This may need refinement as different types of objects add sort functions.
*
* Function is more complex than it needs to be because we are comparing 64 bit
* values and the function only returns 32 bit values.
*/
static int
xfs_trans_precommit_sort(
void *unused_arg,
const struct list_head *a,
const struct list_head *b)
{
struct xfs_log_item *lia = container_of(a,
struct xfs_log_item, li_trans);
struct xfs_log_item *lib = container_of(b,
struct xfs_log_item, li_trans);
int64_t diff;
/*
* If both items are non-sortable, leave them alone. If only one is
* sortable, move the non-sortable item towards the end of the list.
*/
if (!lia->li_ops->iop_sort && !lib->li_ops->iop_sort)
return 0;
if (!lia->li_ops->iop_sort)
return 1;
if (!lib->li_ops->iop_sort)
return -1;
diff = lia->li_ops->iop_sort(lia) - lib->li_ops->iop_sort(lib);
if (diff < 0)
return -1;
if (diff > 0)
return 1;
return 0;
}
/*
* Run transaction precommit functions.
*
* If there is an error in any of the callouts, then stop immediately and
* trigger a shutdown to abort the transaction. There is no recovery possible
* from errors at this point as the transaction is dirty....
*/
static int
xfs_trans_run_precommits(
struct xfs_trans *tp)
{
struct xfs_mount *mp = tp->t_mountp;
struct xfs_log_item *lip, *n;
int error = 0;
/*
* Sort the item list to avoid ABBA deadlocks with other transactions
* running precommit operations that lock multiple shared items such as
* inode cluster buffers.
*/
list_sort(NULL, &tp->t_items, xfs_trans_precommit_sort);
/*
* Precommit operations can remove the log item from the transaction
* if the log item exists purely to delay modifications until they
* can be ordered against other operations. Hence we have to use
* list_for_each_entry_safe() here.
*/
list_for_each_entry_safe(lip, n, &tp->t_items, li_trans) {
if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
continue;
if (lip->li_ops->iop_precommit) {
error = lip->li_ops->iop_precommit(tp, lip);
if (error)
break;
}
}
if (error)
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
return error;
}
/* /*
* Commit the given transaction to the log. * Commit the given transaction to the log.
* *
...@@ -869,6 +953,13 @@ __xfs_trans_commit( ...@@ -869,6 +953,13 @@ __xfs_trans_commit(
trace_xfs_trans_commit(tp, _RET_IP_); trace_xfs_trans_commit(tp, _RET_IP_);
error = xfs_trans_run_precommits(tp);
if (error) {
if (tp->t_flags & XFS_TRANS_PERM_LOG_RES)
xfs_defer_cancel(tp);
goto out_unreserve;
}
/* /*
* Finish deferred items on final commit. Only permanent transactions * Finish deferred items on final commit. Only permanent transactions
* should ever have deferred ops. * should ever have deferred ops.
......
...@@ -72,10 +72,12 @@ struct xfs_item_ops { ...@@ -72,10 +72,12 @@ struct xfs_item_ops {
void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *); void (*iop_format)(struct xfs_log_item *, struct xfs_log_vec *);
void (*iop_pin)(struct xfs_log_item *); void (*iop_pin)(struct xfs_log_item *);
void (*iop_unpin)(struct xfs_log_item *, int remove); void (*iop_unpin)(struct xfs_log_item *, int remove);
uint (*iop_push)(struct xfs_log_item *, struct list_head *); uint64_t (*iop_sort)(struct xfs_log_item *lip);
int (*iop_precommit)(struct xfs_trans *tp, struct xfs_log_item *lip);
void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq); void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
void (*iop_release)(struct xfs_log_item *);
xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t); xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
uint (*iop_push)(struct xfs_log_item *, struct list_head *);
void (*iop_release)(struct xfs_log_item *);
int (*iop_recover)(struct xfs_log_item *lip, int (*iop_recover)(struct xfs_log_item *lip,
struct list_head *capture_list); struct list_head *capture_list);
bool (*iop_match)(struct xfs_log_item *item, uint64_t id); bool (*iop_match)(struct xfs_log_item *item, uint64_t id);
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment