Commit 0d227466 authored by Dave Chinner's avatar Dave Chinner Committed by Dave Chinner

xfs: intent item whiteouts

When we log modifications based on intents, we add both intent
and intent done items to the modification being made. These get
written to the log to ensure that the operation is re-run if the
intent done is not found in the log.

However, for operations that complete wholly within a single
checkpoint, the change in the checkpoint is atomic and will never
need replay. In this case, we don't need to actually write the
intent and intent done items to the journal because log recovery
will never need to manually restart this modification.

Log recovery currently handles intent/intent done matching by
inserting the intent into the AIL, then removing it when a matching
intent done item is found. Hence for all the intent-based operations
that complete within a checkpoint, we spend all that time parsing
the intent/intent done items just to cancel them and do nothing with
them.

Hence it follows that the only time we actually need intents in the
log is when the modification crosses checkpoint boundaries in the
log and so may only be partially complete in the journal. Hence if
we commit and intent done item to the CIL and the intent item is in
the same checkpoint, we don't actually have to write them to the
journal because log recovery will always cancel the intents.

We've never really worried about the overhead of logging intents
unnecessarily like this because the intents we log are generally
very much smaller than the change being made. e.g. freeing an extent
involves modifying at lease two freespace btree blocks and the AGF,
so the EFI/EFD overhead is only a small increase in space and
processing time compared to the overall cost of freeing an extent.

However, delayed attributes change this cost equation dramatically,
especially for inline attributes. In the case of adding an inline
attribute, we only log the inode core and attribute fork at present.
With delayed attributes, we now log the attr intent which includes
the name and value, the inode core adn attr fork, and finally the
attr intent done item. We increase the number of items we log from 1
to 3, and the number of log vectors (regions) goes up from 3 to 7.
Hence we tripple the number of objects that the CIL has to process,
and more than double the number of log vectors that need to be
written to the journal.

At scale, this means delayed attributes cause a non-pipelined CIL to
become CPU bound processing all the extra items, resulting in a > 40%
performance degradation on 16-way file+xattr create worklaods.
Pipelining the CIL (as per 5.15) reduces the performance degradation
to 20%, but now the limitation is the rate at which the log items
can be written to the iclogs and iclogs be dispatched for IO and
completed.

Even log IO completion is slowed down by these intents, because it
now has to process 3x the number of items in the checkpoint.
Processing completed intents is especially inefficient here, because
we first insert the intent into the AIL, then remove it from the AIL
when the intent done is processed. IOWs, we are also doing expensive
operations in log IO completion we could completely avoid if we
didn't log completed intent/intent done pairs.

Enter log item whiteouts.

When an intent done is committed, we can check to see if the
associated intent is in the same checkpoint as we are currently
committing the intent done to. If so, we can mark the intent log
item with a whiteout and immediately free the intent done item
rather than committing it to the CIL. We can basically skip the
entire formatting and CIL insertion steps for the intent done item.

However, we cannot remove the intent item from the CIL at this point
because the unlocked per-cpu CIL item lists do not permit removal
without holding the CIL context lock exclusively. Transaction commit
only holds the context lock shared, hence the best we can do is mark
the intent item with a whiteout so that the CIL push can release it
rather than writing it to the log.

This means we never write the intent to the log if the intent done
has also been committed to the same checkpoint, but we'll always
write the intent if the intent done has not been committed or has
been committed to a different checkpoint. This will result in
correct log recovery behaviour in all cases, without the overhead of
logging unnecessary intents.

This intent whiteout concept is generic - we can apply it to all
intent/intent done pairs that have a direct 1:1 relationship. The
way deferred ops iterate and relog intents mean that all intents
currently have a 1:1 relationship with their done intent, and hence
we can apply this cancellation to all existing intent/intent done
implementations.

For delayed attributes with a 16-way 64kB xattr create workload,
whiteouts reduce the amount of journalled metadata from ~2.5GB/s
down to ~600MB/s and improve the creation rate from 9000/s to
14000/s.
Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
Reviewed-by: default avatarAllison Henderson <allison.henderson@oracle.com>
Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
parent 3512fc1e
...@@ -476,7 +476,8 @@ xlog_cil_insert_format_items( ...@@ -476,7 +476,8 @@ xlog_cil_insert_format_items(
static void static void
xlog_cil_insert_items( xlog_cil_insert_items(
struct xlog *log, struct xlog *log,
struct xfs_trans *tp) struct xfs_trans *tp,
uint32_t released_space)
{ {
struct xfs_cil *cil = log->l_cilp; struct xfs_cil *cil = log->l_cilp;
struct xfs_cil_ctx *ctx = cil->xc_ctx; struct xfs_cil_ctx *ctx = cil->xc_ctx;
...@@ -525,7 +526,9 @@ xlog_cil_insert_items( ...@@ -525,7 +526,9 @@ xlog_cil_insert_items(
ASSERT(tp->t_ticket->t_curr_res >= len); ASSERT(tp->t_ticket->t_curr_res >= len);
} }
tp->t_ticket->t_curr_res -= len; tp->t_ticket->t_curr_res -= len;
tp->t_ticket->t_curr_res += released_space;
ctx->space_used += len; ctx->space_used += len;
ctx->space_used -= released_space;
/* /*
* If we've overrun the reservation, dump the tx details before we move * If we've overrun the reservation, dump the tx details before we move
...@@ -970,11 +973,16 @@ xlog_cil_build_trans_hdr( ...@@ -970,11 +973,16 @@ xlog_cil_build_trans_hdr(
* Pull all the log vectors off the items in the CIL, and remove the items from * Pull all the log vectors off the items in the CIL, and remove the items from
* the CIL. We don't need the CIL lock here because it's only needed on the * the CIL. We don't need the CIL lock here because it's only needed on the
* transaction commit side which is currently locked out by the flush lock. * transaction commit side which is currently locked out by the flush lock.
*
* If a log item is marked with a whiteout, we do not need to write it to the
* journal and so we just move them to the whiteout list for the caller to
* dispose of appropriately.
*/ */
static void static void
xlog_cil_build_lv_chain( xlog_cil_build_lv_chain(
struct xfs_cil *cil, struct xfs_cil *cil,
struct xfs_cil_ctx *ctx, struct xfs_cil_ctx *ctx,
struct list_head *whiteouts,
uint32_t *num_iovecs, uint32_t *num_iovecs,
uint32_t *num_bytes) uint32_t *num_bytes)
{ {
...@@ -985,6 +993,13 @@ xlog_cil_build_lv_chain( ...@@ -985,6 +993,13 @@ xlog_cil_build_lv_chain(
item = list_first_entry(&cil->xc_cil, item = list_first_entry(&cil->xc_cil,
struct xfs_log_item, li_cil); struct xfs_log_item, li_cil);
if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
list_move(&item->li_cil, whiteouts);
trace_xfs_cil_whiteout_skip(item);
continue;
}
list_del_init(&item->li_cil); list_del_init(&item->li_cil);
if (!ctx->lv_chain) if (!ctx->lv_chain)
ctx->lv_chain = item->li_lv; ctx->lv_chain = item->li_lv;
...@@ -1000,6 +1015,19 @@ xlog_cil_build_lv_chain( ...@@ -1000,6 +1015,19 @@ xlog_cil_build_lv_chain(
} }
} }
static void
xlog_cil_cleanup_whiteouts(
struct list_head *whiteouts)
{
while (!list_empty(whiteouts)) {
struct xfs_log_item *item = list_first_entry(whiteouts,
struct xfs_log_item, li_cil);
list_del_init(&item->li_cil);
trace_xfs_cil_whiteout_unpin(item);
item->li_ops->iop_unpin(item, 1);
}
}
/* /*
* Push the Committed Item List to the log. * Push the Committed Item List to the log.
* *
...@@ -1030,6 +1058,7 @@ xlog_cil_push_work( ...@@ -1030,6 +1058,7 @@ xlog_cil_push_work(
struct xfs_log_vec lvhdr = { NULL }; struct xfs_log_vec lvhdr = { NULL };
xfs_csn_t push_seq; xfs_csn_t push_seq;
bool push_commit_stable; bool push_commit_stable;
LIST_HEAD (whiteouts);
new_ctx = xlog_cil_ctx_alloc(); new_ctx = xlog_cil_ctx_alloc();
new_ctx->ticket = xlog_cil_ticket_alloc(log); new_ctx->ticket = xlog_cil_ticket_alloc(log);
...@@ -1098,7 +1127,7 @@ xlog_cil_push_work( ...@@ -1098,7 +1127,7 @@ xlog_cil_push_work(
list_add(&ctx->committing, &cil->xc_committing); list_add(&ctx->committing, &cil->xc_committing);
spin_unlock(&cil->xc_push_lock); spin_unlock(&cil->xc_push_lock);
xlog_cil_build_lv_chain(cil, ctx, &num_iovecs, &num_bytes); xlog_cil_build_lv_chain(cil, ctx, &whiteouts, &num_iovecs, &num_bytes);
/* /*
* Switch the contexts so we can drop the context lock and move out * Switch the contexts so we can drop the context lock and move out
...@@ -1201,6 +1230,7 @@ xlog_cil_push_work( ...@@ -1201,6 +1230,7 @@ xlog_cil_push_work(
/* Not safe to reference ctx now! */ /* Not safe to reference ctx now! */
spin_unlock(&log->l_icloglock); spin_unlock(&log->l_icloglock);
xlog_cil_cleanup_whiteouts(&whiteouts);
return; return;
out_skip: out_skip:
...@@ -1212,6 +1242,7 @@ xlog_cil_push_work( ...@@ -1212,6 +1242,7 @@ xlog_cil_push_work(
out_abort_free_ticket: out_abort_free_ticket:
xfs_log_ticket_ungrant(log, ctx->ticket); xfs_log_ticket_ungrant(log, ctx->ticket);
ASSERT(xlog_is_shutdown(log)); ASSERT(xlog_is_shutdown(log));
xlog_cil_cleanup_whiteouts(&whiteouts);
if (!ctx->commit_iclog) { if (!ctx->commit_iclog) {
xlog_cil_committed(ctx); xlog_cil_committed(ctx);
return; return;
...@@ -1360,6 +1391,43 @@ xlog_cil_empty( ...@@ -1360,6 +1391,43 @@ xlog_cil_empty(
return empty; return empty;
} }
/*
* If there are intent done items in this transaction and the related intent was
* committed in the current (same) CIL checkpoint, we don't need to write either
* the intent or intent done item to the journal as the change will be
* journalled atomically within this checkpoint. As we cannot remove items from
* the CIL here, mark the related intent with a whiteout so that the CIL push
* can remove it rather than writing it to the journal. Then remove the intent
* done item from the current transaction and release it so it doesn't get put
* into the CIL at all.
*/
static uint32_t
xlog_cil_process_intents(
struct xfs_cil *cil,
struct xfs_trans *tp)
{
struct xfs_log_item *lip, *ilip, *next;
uint32_t len = 0;
list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
continue;
ilip = lip->li_ops->iop_intent(lip);
if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
continue;
set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
trace_xfs_cil_whiteout_mark(ilip);
len += ilip->li_lv->lv_bytes;
kmem_free(ilip->li_lv);
ilip->li_lv = NULL;
xfs_trans_del_item(lip);
lip->li_ops->iop_release(lip);
}
return len;
}
/* /*
* Commit a transaction with the given vector to the Committed Item List. * Commit a transaction with the given vector to the Committed Item List.
* *
...@@ -1382,6 +1450,7 @@ xlog_cil_commit( ...@@ -1382,6 +1450,7 @@ xlog_cil_commit(
{ {
struct xfs_cil *cil = log->l_cilp; struct xfs_cil *cil = log->l_cilp;
struct xfs_log_item *lip, *next; struct xfs_log_item *lip, *next;
uint32_t released_space = 0;
/* /*
* Do all necessary memory allocation before we lock the CIL. * Do all necessary memory allocation before we lock the CIL.
...@@ -1393,7 +1462,10 @@ xlog_cil_commit( ...@@ -1393,7 +1462,10 @@ xlog_cil_commit(
/* lock out background commit */ /* lock out background commit */
down_read(&cil->xc_ctx_lock); down_read(&cil->xc_ctx_lock);
xlog_cil_insert_items(log, tp); if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
released_space = xlog_cil_process_intents(cil, tp);
xlog_cil_insert_items(log, tp, released_space);
if (regrant && !xlog_is_shutdown(log)) if (regrant && !xlog_is_shutdown(log))
xfs_log_ticket_regrant(log, tp->t_ticket); xfs_log_ticket_regrant(log, tp->t_ticket);
......
...@@ -1332,6 +1332,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push); ...@@ -1332,6 +1332,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned); DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
DEFINE_LOG_ITEM_EVENT(xfs_ail_locked); DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing); DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
DECLARE_EVENT_CLASS(xfs_ail_class, DECLARE_EVENT_CLASS(xfs_ail_class,
TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn), TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
......
...@@ -55,13 +55,15 @@ struct xfs_log_item { ...@@ -55,13 +55,15 @@ struct xfs_log_item {
#define XFS_LI_IN_AIL 0 #define XFS_LI_IN_AIL 0
#define XFS_LI_ABORTED 1 #define XFS_LI_ABORTED 1
#define XFS_LI_FAILED 2 #define XFS_LI_FAILED 2
#define XFS_LI_DIRTY 3 /* log item dirty in transaction */ #define XFS_LI_DIRTY 3
#define XFS_LI_WHITEOUT 4
#define XFS_LI_FLAGS \ #define XFS_LI_FLAGS \
{ (1u << XFS_LI_IN_AIL), "IN_AIL" }, \ { (1u << XFS_LI_IN_AIL), "IN_AIL" }, \
{ (1u << XFS_LI_ABORTED), "ABORTED" }, \ { (1u << XFS_LI_ABORTED), "ABORTED" }, \
{ (1u << XFS_LI_FAILED), "FAILED" }, \ { (1u << XFS_LI_FAILED), "FAILED" }, \
{ (1u << XFS_LI_DIRTY), "DIRTY" } { (1u << XFS_LI_DIRTY), "DIRTY" }, \
{ (1u << XFS_LI_WHITEOUT), "WHITEOUT" }
struct xfs_item_ops { struct xfs_item_ops {
unsigned flags; unsigned flags;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment