Commit a3365263 authored by unknown's avatar unknown

WL#3071 Maria checkpoint

Fixing bad comments (I remember my maths' teacher "one late night you'll
obey to the simplifications made by your tired neurons"; exactly
what happened here). In Checkpoint, when we flush a table's state
we must flush all log records (WAL), not only those before checkpoint
started.


storage/maria/ma_bitmap.c:
  there was a flaw in reasoning, bug does exist.
storage/maria/ma_blockrec.c:
  moving piece of comment to ma_checkpoint.c
storage/maria/ma_checkpoint.c:
  Comments.
  When checkpoint flushes a state, WAL imposes that all records up
  to this state have been flushed, not only up to checkpoint_start_log_horizon.
storage/maria/ma_recovery.c:
  finishing comment.
parent e0028938
...@@ -533,16 +533,25 @@ static my_bool _ma_read_bitmap_page(MARIA_SHARE *share, ...@@ -533,16 +533,25 @@ static my_bool _ma_read_bitmap_page(MARIA_SHARE *share,
Inexistent or half-created page (could be crash in the middle of Inexistent or half-created page (could be crash in the middle of
_ma_bitmap_create_first(), before appending maria_bitmap_marker). _ma_bitmap_create_first(), before appending maria_bitmap_marker).
*/ */
/* /**
@todo RECOVERY BUG
We are updating data_file_length before writing any log record for the We are updating data_file_length before writing any log record for the
row operation. What if now state is flushed by a checkpoint with the row operation. What if now state is flushed by a checkpoint with the
new value, and crash before the checkpoint record is written, recovery new value, and crash before the checkpoint record is written, recovery
may not even open the table (no log records) so not fix may not even open the table (no log records) so not fix
data_file_length ("WAL violation")? In fact this is ok: data_file_length ("WAL violation")?
- checkpoint flushes state only if share->id!=0 Scenario: assume share->id==0, then:
- so if state was flushed, table had share->id!=0, so had a thread 1 (here) thread 2 (checkpoint)
LOGREC_FILE_ID (or was in previous checkpoint record), so recovery will update data_file_length
meet and open it and fix data_file_length. copy state to memory, flush log
set share->id and write FILE_ID (not flushed)
see share->id!=0 so flush state
crash
FILE_ID will be missing, Recovery will not open table and not fix
data_file_length. This bug should be fixed with other "checkpoint vs
bitmap" bugs.
One possibility will be logging a standalone LOGREC_CREATE_BITMAP in a
separate transaction (using dummy_transaction_object).
*/ */
share->state.state.data_file_length= end_of_page; share->state.state.data_file_length= end_of_page;
bzero(bitmap->map, bitmap->block_size); bzero(bitmap->map, bitmap->block_size);
...@@ -598,6 +607,12 @@ static my_bool _ma_change_bitmap_page(MARIA_HA *info, ...@@ -598,6 +607,12 @@ static my_bool _ma_change_bitmap_page(MARIA_HA *info,
if (bitmap->changed) if (bitmap->changed)
{ {
/**
@todo RECOVERY BUG this is going to flush the bitmap page possibly to
disk even though it could be over-allocated with not yet any REDO-UNDO
complete group (WAL violation: no way to undo the over-allocation if
crash). See also collect_tables().
*/
if (write_changed_bitmap(info->s, bitmap)) if (write_changed_bitmap(info->s, bitmap))
DBUG_RETURN(1); DBUG_RETURN(1);
bitmap->changed= 0; bitmap->changed= 0;
......
...@@ -1444,17 +1444,9 @@ static my_bool write_tail(MARIA_HA *info, ...@@ -1444,17 +1444,9 @@ static my_bool write_tail(MARIA_HA *info,
{ {
/* /*
We are modifying a state member before writing the UNDO; this is a WAL We are modifying a state member before writing the UNDO; this is a WAL
violation: assume this setting is made, checkpoint flushes new state, violation. But for data_file_length this is ok, as long as we change
and crash happens before the UNDO is written: how to undo the bad state? data_file_length after writing any log record (FILE_ID/REDO/UNDO) (see
Fortunately for data_file_length this is ok: as long as we change collect_tables()).
data_file_length after writing any REDO or UNDO we are safe:
- checkpoint flushes state only if it's older than
checkpoint_start_log_horizon, and flushes log up to that horizon first
- so if checkpoint flushed state with new data_file_length, REDO is in
log so LOGREC_FILE_ID too, recovery will meet and open the table thus
fix data_file_length to be the file's physical size.
Same property is currently true in all places of this file which change
data_file_length.
*/ */
info->state->data_file_length= position + block_size; info->state->data_file_length= position + block_size;
} }
......
...@@ -176,27 +176,6 @@ static int really_execute_checkpoint(void) ...@@ -176,27 +176,6 @@ static int really_execute_checkpoint(void)
DBUG_PRINT("info",("checkpoint_start_log_horizon (%lu,0x%lx)", DBUG_PRINT("info",("checkpoint_start_log_horizon (%lu,0x%lx)",
LSN_IN_PARTS(checkpoint_start_log_horizon))); LSN_IN_PARTS(checkpoint_start_log_horizon)));
lsn_store(checkpoint_start_log_horizon_char, checkpoint_start_log_horizon); lsn_store(checkpoint_start_log_horizon_char, checkpoint_start_log_horizon);
/*
We are going to flush the state of some tables (in collect_tables()) if
it's older than checkpoint_start_log_horizon. Before, all records
describing how to undo this flushed state must be in the log
(WAL). Usually this means UNDOs. In the special case of data_file_length,
recovery just needs to open the table, so any LOGREC_FILE_ID/REDO/UNDO
allowing recovery to understand it must open a table, is enough.
*/
/**
Apart from data|key_file_length which are easily recoverable from the OS,
all other state members must be updated only when writing the UNDO;
otherwise, if updated before, if their new value is flushed by a
checkpoint and there is a crash before UNDO is written, their REDO group
will be missing or at least incomplete and skipped by recovery, so bad
state value will stay. For example, setting key_root before writing the
UNDO: the table would have old index page (they were pinned at time of
crash) and a new, thus wrong, key_root.
@todo RECOVERY BUG check that all code honours that.
*/
if (translog_flush(checkpoint_start_log_horizon))
goto err;
/* /*
STEP 2: fetch information about transactions. STEP 2: fetch information about transactions.
...@@ -887,9 +866,32 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon) ...@@ -887,9 +866,32 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
*/ */
} }
translog_unlock(); translog_unlock();
/**
We are going to flush these states.
Before, all records describing how to undo such state must be
in the log (WAL). Usually this means UNDOs. In the special case of
data|key_file_length, recovery just needs to open the table to fix the
length, so any LOGREC_FILE_ID/REDO/UNDO allowing recovery to
understand it must open a table, is enough; so as long as
data|key_file_length is updated after writing any log record it's ok:
if we copied new value above, it means the record was before
state_copies_horizon and we flush such record below.
Apart from data|key_file_length which are easily recoverable from the
real file's size, all other state members must be updated only when
writing the UNDO; otherwise, if updated before, if their new value is
flushed by a checkpoint and there is a crash before UNDO is written,
their REDO group will be missing or at least incomplete and skipped
by recovery, so bad state value will stay. For example, setting
key_root before writing the UNDO: the table would have old index
pages (they were pinned at time of crash) and a new, thus wrong,
key_root.
@todo RECOVERY BUG check that all code honours that.
*/
if (translog_flush(state_copies_horizon))
goto err;
/* now we have cached states and they are WAL-safe*/
state_copies_end= state_copy; state_copies_end= state_copy;
state_copy= state_copies; state_copy= state_copies;
/* so now we have cached states */
} }
/* locate our state among these cached ones */ /* locate our state among these cached ones */
...@@ -911,15 +913,11 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon) ...@@ -911,15 +913,11 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
dfile= share->bitmap.file; dfile= share->bitmap.file;
/* /*
Ignore table which has no logged writes (all its future log records will Ignore table which has no logged writes (all its future log records will
be found naturally by Recovery). This also avoids flushing be found naturally by Recovery). Ignore obsolete shares (_before_
a data_file_length changed too early by a client (before any log record setting themselves to last_version=0 they already did all flush and
was written, giving no chance to recovery to meet and open the table, sync; if we flush their state now we may be flushing an obsolete state
see _ma_read_bitmap_page()). onto a newer one (assuming the table has been reopened with a different
Ignore obsolete shares (_before_ setting themselves to last_version=0 share but of course same physical index file).
they already did all flush and sync; if we flush their state now we may
be flushing an obsolete state onto a newer one (assuming the table has
been reopened with a different share but of course same physical index
file).
*/ */
if ((share->id != 0) && (share->last_version != 0)) if ((share->id != 0) && (share->last_version != 0))
{ {
...@@ -974,7 +972,6 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon) ...@@ -974,7 +972,6 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
It may also be a share which got last_version==0 since we checked It may also be a share which got last_version==0 since we checked
last_version; in this case, it flushed its state and the LSN test last_version; in this case, it flushed its state and the LSN test
above will catch it. above will catch it.
Last, see comments at start of really_execute_checkpoint().
*/ */
} }
else else
...@@ -1005,6 +1002,12 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon) ...@@ -1005,6 +1002,12 @@ static int collect_tables(LEX_STRING *str, LSN checkpoint_start_log_horizon)
each checkpoint if the table was once written and then not anymore. each checkpoint if the table was once written and then not anymore.
*/ */
} }
/**
@todo RECOVERY BUG this is going to flush the bitmap page possibly to
disk even though it could be over-allocated with not yet any
REDO-UNDO complete group (WAL violation: no way to undo the
over-allocation if crash); see also _ma_change_bitmap_page().
*/
sync_error|= sync_error|=
_ma_flush_bitmap(share); /* after that, all is in page cache */ _ma_flush_bitmap(share); /* after that, all is in page cache */
DBUG_ASSERT(share->pagecache == maria_pagecache); DBUG_ASSERT(share->pagecache == maria_pagecache);
......
...@@ -1156,13 +1156,14 @@ prototype_redo_exec_hook(REDO_INSERT_ROW_HEAD) ...@@ -1156,13 +1156,14 @@ prototype_redo_exec_hook(REDO_INSERT_ROW_HEAD)
/** /**
@todo RECOVERY BUG @todo RECOVERY BUG
we stamp page with UNDO's LSN. Assume an operation logs REDO-REDO-UNDO we stamp page with UNDO's LSN. Assume an operation logs REDO-REDO-UNDO
where the two REDOs are about the same page. Then recovery applies first where the two REDOs are about the same page (that is possible only with a
REDO and skips second REDO which is wrong. Solutions: head or tail page, not blob page). Then recovery applies first REDO and
a) when applying REDO, keep page pinned, don't stamp it, remember it; skips second REDO which is wrong. Solution:
when seeing UNDO, unpin pages and stamp them; for BLOB pages we cannot a)
pin them (too large for memory) so need an additional pass in REDO phase: * when applying REDO to head or tail, keep page pinned, don't stamp it,
- find UNDO * when applying REDO to blob page, stamp it with UNDO's LSN
- execute all REDOs about this UNDO but skipping REDOs for * when seeing UNDO, unpin head/tail pages and stamp them with UNDO's
LSN.
or b) when applying REDO, stamp page with REDO's LSN (=> difference in or b) when applying REDO, stamp page with REDO's LSN (=> difference in
'cmp' between run-time and recovery, need a special 'cmp'...). 'cmp' between run-time and recovery, need a special 'cmp'...).
*/ */
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment