Commit 753ee49f authored by unknown's avatar unknown

- WL#3072 Maria Recovery:

Recovery of state.records (the count of records which is stored into
the header of the index file). For that, state.is_of_lsn is introduced;
logic is explained in ma_recovery.c (look for "Recovery of the state").
The net gain is that in case of crash, we now recover state.records,
and it is idempotent (ma_test_recovery tests it).
state.checksum is not recovered yet, mail sent for discussion.
- WL#3071 Maria Checkpoint: preparation for it, by protecting
all modifications of the state in memory or on disk with intern_lock
(with the exception of the really-often-modified state.records,
which is now protected with the log's lock, see ma_recovery.c
(look for "Recovery of the state"). Also, if maria_close() sees that
Checkpoint is looking at this table it will not my_free() the share.
- don't compute row's checksum twice in case of UPDATE (correction
to a bugfix I made yesterday).


storage/maria/ha_maria.cc:
  protect state write with intern_lock (against Checkpoint)
storage/maria/ma_blockrec.c:
  * don't reset trn->rec_lsn in _ma_unpin_all_pages(), because it
  should wait until we have corrected the allocation in the bitmap
  (as the REDO can serve to correct the allocation during Recovery);
  introducing _ma_finalize_row() for that.
  * In a changeset yesterday I moved computation of the checksum
  into write_block_record(), to fix a bug in UPDATE. Now I notice
  that maria_update() already computes the checksum, it's just that
  it puts it into info->cur_row while _ma_update_block_record()
  uses info->new_row; so, removing the checksum computation from
  write_block_record(), putting it back into allocate_and_write_block_record()
  (which is called only by INSERT and UNDO_DELETE), and copying
  cur_row->checksum into new_row->checksum in _ma_update_block_record().
storage/maria/ma_check.c:
  new prototypes, they will take intern_lock when writing the state;
  also take intern_lock when changing share->kfile. In both cases
  this is to protect against Checkpoint reading/writing the state or reading
  kfile at the same time.
  Not updating create_rename_lsn directly at end of write_log_record_for_repair()
  as it wouldn't have intern_lock.
storage/maria/ma_close.c:
  Checkpoint builds a list of shares (under THR_LOCK_maria), then it
  handles each such share (under intern_lock) (doing flushing etc);
  if maria_close() freed this share between the two, Checkpoint
  would see a bad pointer. To avoid this, when building the list Checkpoint
  marks each share, so that maria_close() knows it should not free it
  and Checkpoint will free it itself.
  Extending the zone covered by intern_lock to protect against
  Checkpoint reading kfile, writing state.
storage/maria/ma_create.c:
  When we update create_rename_lsn, we also update is_of_lsn to
  the same value: it is logical, and allows us to test in maria_open()
  that the former is not bigger than the latter (the contrary is a sign
  of index header corruption, or severe logging bug which hinders
  Recovery, table needs a repair).
  _ma_update_create_rename_lsn_on_disk() also writes is_of_lsn;
  it now operates under intern_lock (protect against Checkpoint),
  a shortcut function is available for cases where acquiring
  intern_lock is not needed (table's creation or first open).
storage/maria/ma_delete.c:
  if table is transactional, "records" is already decremented
  when logging UNDO_ROW_DELETE.
storage/maria/ma_delete_all.c:
  comments
storage/maria/ma_extra.c:
  Protect modifications of the state, in memory and/or on disk,
  with intern_lock, against a concurrent Checkpoint.
  When state goes to disk, update it's is_of_lsn (by calling
  the new _ma_state_info_write()).
  In HA_EXTRA_FORCE_REOPEN, don't set share->changed to 0 (undoing
  a change I made a few days ago) and ASK_MONTY
storage/maria/ma_locking.c:
  no real code change here.
storage/maria/ma_loghandler.c:
  Log-write-hooks for updating "state.records" under log's mutex
  when writing/updating/deleting a row or deleting all rows.
storage/maria/ma_loghandler_lsn.h:
  merge (make LSN_ERROR and LSN_REPAIRED_BY_MARIA_CHK different)
storage/maria/ma_open.c:
  When opening a table verify that is_of_lsn >= create_rename_lsn; if
  false the header must be corrupted.
  _ma_state_info_write() is split in two: _ma_state_info_write_sub()
  which is the old _ma_state_info_write(), and _ma_state_info_write()
  which additionally takes intern_lock if requested (to protect
  against Checkpoint) and updates is_of_lsn.
  _ma_open_keyfile() should change kfile.file under intern_lock
  to protect Checkpoint from reading a wrong kfile.file.
storage/maria/ma_recovery.c:
  Recovery of state.records: when the REDO phase sees UNDO_ROW_INSERT
  which has a LSN > state.is_of_lsn it increments state.records.
  Same for UNDO_ROW_DELETE and UNDO_ROW_PURGE.
  When closing a table during Recovery, we know its state is at least
  as new as the current log record we are looking at, so increase
  is_of_lsn to the LSN of the current log record.
storage/maria/ma_rename.c:
  update for new behaviour of _ma_update_create_rename_lsn_on_disk().
storage/maria/ma_test1.c:
  update to new prototype
storage/maria/ma_test2.c:
  update to new prototype (actually prototype was changed days ago,
  but compiler does not complain about the extra argument??)
storage/maria/ma_test_recovery.expected:
  new result file of ma_test_recovery. Improvements: record
  count read from index's header is now always correct.
storage/maria/ma_test_recovery:
  "rm" fails if file does not exist. Redirect stderr of script.
storage/maria/ma_write.c:
  if table is transactional, "records" is already incremented when
  logging UNDO_ROW_INSERT. Comments.
storage/maria/maria_chk.c:
  update is_of_lsn too
storage/maria/maria_def.h:
  - MARIA_STATE_INFO::is_of_lsn which is used by Recovery. It is stored
  into the index file's header.
  - Checkpoint can now mark a table as "don't free this", and maria_close()
  can reply "ok then you will free it".
  - new functions
storage/maria/maria_pack.c:
  update for new name
parent f456b30c
......@@ -1287,6 +1287,7 @@ int ha_maria::repair(THD *thd, HA_CHECK &param, bool do_optimize)
}
}
thd->proc_info= "Saving state";
pthread_mutex_lock(&share->intern_lock);
if (!error)
{
if ((share->state.changed & STATE_CHANGED) || maria_is_crashed(file))
......@@ -1324,6 +1325,7 @@ int ha_maria::repair(THD *thd, HA_CHECK &param, bool do_optimize)
file->update |= HA_STATE_CHANGED | HA_STATE_ROW_CHANGED;
maria_update_state_info(&param, file, 0);
}
pthread_mutex_unlock(&share->intern_lock);
thd->proc_info= old_proc_info;
if (!thd->locked_tables)
{
......
......@@ -690,8 +690,6 @@ static my_bool check_if_zero(uchar *pos, uint length)
We unpin pages in the reverse order as they where pinned; This may not
be strictly necessary but may simplify things in the future.
info->trn->rec_lsn contains the lsn for the first REDO
RETURN
0 ok
1 error (fatal disk error)
......@@ -717,7 +715,6 @@ void _ma_unpin_all_pages(MARIA_HA *info, LSN undo_lsn)
pinned_page->unlock, PAGECACHE_UNPIN,
info->trn->rec_lsn, undo_lsn);
info->trn->rec_lsn= 0;
info->pinned_pages.elements= 0;
DBUG_VOID_RETURN;
}
......@@ -739,6 +736,22 @@ static uint empty_space_on_page(uchar *buff, uint block_size)
}
#endif
/**
When we have finished the write/update/delete of a row, we have cleanups to
do. For now it is signalling to Checkpoint that all dirtied pages have
their rec_lsn set and page LSN set (_ma_unpin_all_pages() has been called),
and that bitmap pages are correct (_ma_bitmap_release_unused() has been
called).
*/
#define _ma_finalize_row(info) \
do { info->trn->rec_lsn= LSN_IMPOSSIBLE; } while(0)
/** unpinning is often the last operation before finalizing: */
#define _ma_unpin_all_pages_and_finalize_row(info,undo_lsn) do \
{ \
_ma_unpin_all_pages(info, undo_lsn); \
_ma_finalize_row(info); \
} while(0)
/*
Find free position in directory
......@@ -1729,10 +1742,7 @@ static my_bool write_block_record(MARIA_HA *info,
if (share->base.pack_fields)
store_key_length_inc(data, row->field_lengths_length);
if (share->calc_checksum)
{
row->checksum= (info->s->calc_checksum)(info, record);
*(data++)= (uchar) (row->checksum); /* store least significant byte */
}
memcpy(data, record, share->base.null_bytes);
data+= share->base.null_bytes;
memcpy(data, row->empty_bits, share->base.pack_bytes);
......@@ -2387,6 +2397,8 @@ static my_bool write_block_record(MARIA_HA *info,
/* Release not used space in used pages */
if (_ma_bitmap_release_unused(info, bitmap_blocks))
goto disk_err;
_ma_finalize_row(info);
DBUG_RETURN(0);
crashed:
......@@ -2421,7 +2433,7 @@ static my_bool write_block_record(MARIA_HA *info,
Unpin all pinned pages to not cause problems for disk cache. This is
safe to call even if we already called _ma_unpin_all_pages() above.
*/
_ma_unpin_all_pages(info, 0);
_ma_unpin_all_pages_and_finalize_row(info, 0);
DBUG_RETURN(1);
}
......@@ -2458,6 +2470,8 @@ static my_bool allocate_and_write_block_record(MARIA_HA *info,
PAGECACHE_LOCK_WRITE, &row_pos))
DBUG_RETURN(1);
row->lastpos= ma_recordpos(blocks->block->page, row_pos.rownr);
if (info->s->calc_checksum)
row->checksum= (info->s->calc_checksum)(info, record);
if (write_block_record(info, (uchar*) 0, record, row,
blocks, blocks->block->org_bitmap_value != 0,
&row_pos, undo_lsn))
......@@ -2595,7 +2609,7 @@ my_bool _ma_write_abort_block_record(MARIA_HA *info)
log_data + LSN_STORE_SIZE))
res= 1;
}
_ma_unpin_all_pages(info, info->trn->undo_lsn);
_ma_unpin_all_pages_and_finalize_row(info, info->trn->undo_lsn);
DBUG_RETURN(res);
}
......@@ -2625,6 +2639,8 @@ my_bool _ma_update_block_record(MARIA_HA *info, MARIA_RECORD_POS record_pos,
DBUG_ENTER("_ma_update_block_record");
DBUG_PRINT("enter", ("rowid: %lu", (long) record_pos));
/* checksum was computed by maria_update() already and put into cur_row */
new_row->checksum= cur_row->checksum;
calc_record_size(info, record, new_row);
page= ma_recordpos_to_page(record_pos);
......@@ -2713,7 +2729,7 @@ my_bool _ma_update_block_record(MARIA_HA *info, MARIA_RECORD_POS record_pos,
&row_pos, LSN_ERROR));
err:
_ma_unpin_all_pages(info, 0);
_ma_unpin_all_pages_and_finalize_row(info, 0);
DBUG_RETURN(1);
}
......@@ -3001,11 +3017,11 @@ my_bool _ma_delete_block_record(MARIA_HA *info, const uchar *record)
}
_ma_unpin_all_pages(info, info->trn->undo_lsn);
_ma_unpin_all_pages_and_finalize_row(info, info->trn->undo_lsn);
DBUG_RETURN(0);
err:
_ma_unpin_all_pages(info, 0);
_ma_unpin_all_pages_and_finalize_row(info, 0);
DBUG_RETURN(1);
}
......@@ -4878,7 +4894,7 @@ my_bool _ma_apply_undo_row_insert(MARIA_HA *info, LSN undo_lsn,
res= 0;
err:
_ma_unpin_all_pages(info, lsn);
_ma_unpin_all_pages_and_finalize_row(info, lsn);
DBUG_RETURN(res);
}
......
......@@ -2001,7 +2001,7 @@ int maria_repair(HA_CHECK *param, register MARIA_HA *info,
*/
if (_ma_flush_table_files(info, MARIA_FLUSH_DATA | MARIA_FLUSH_INDEX,
FLUSH_FORCE_WRITE, FLUSH_IGNORE_CHANGED) ||
_ma_state_info_write(share->kfile.file, &share->state, 1|2))
_ma_state_info_write(share, 1|2|4))
goto err;
if (!rep_quick)
......@@ -2459,9 +2459,8 @@ int _ma_flush_table_files_after_repair(HA_CHECK *param, MARIA_HA *info)
MARIA_SHARE *share= info->s;
if (_ma_flush_table_files(info, MARIA_FLUSH_DATA | MARIA_FLUSH_INDEX,
FLUSH_RELEASE, FLUSH_RELEASE) ||
_ma_state_info_write(share->kfile.file, &share->state, 1) ||
(share->now_transactional && !share->temporary
&& _ma_sync_table_files(info)))
_ma_state_info_write(share, 1|4) ||
(share->base.born_transactional && _ma_sync_table_files(info)))
{
_ma_check_print_error(param,"%d when trying to write bufferts",my_errno);
return 1;
......@@ -2540,8 +2539,10 @@ int maria_sort_index(HA_CHECK *param, register MARIA_HA *info, char *name)
/* Put same locks as old file */
share->r_locks= share->w_locks= share->tot_locks= 0;
(void) _ma_writeinfo(info,WRITEINFO_UPDATE_KEYFILE);
pthread_mutex_lock(&share->intern_lock);
VOID(my_close(share->kfile.file, MYF(MY_WME)));
share->kfile.file = -1;
pthread_mutex_unlock(&share->intern_lock);
VOID(my_close(new_file,MYF(MY_WME)));
if (maria_change_to_newfile(share->index_file_name, MARIA_NAME_IEXT,
INDEX_TMP_EXT, sync_dir) ||
......@@ -5087,7 +5088,7 @@ int maria_update_state_info(HA_CHECK *param, MARIA_HA *info,uint update)
*/
if (info->lock_type == F_WRLCK)
share->state.state= *info->state;
if (_ma_state_info_write(share->kfile.file, &share->state, 1 + 2))
if (_ma_state_info_write(share, 1|2))
goto err;
share->changed=0;
}
......@@ -5540,6 +5541,7 @@ static int _ma_safe_scan_block_record(MARIA_SORT_INFO *sort_info,
/**
@brief Writes a LOGREC_REPAIR_TABLE record and updates create_rename_lsn
and is_of_lsn
REPAIR/OPTIMIZE have replaced the data/index file with a new file
and so, in this scenario:
......@@ -5560,8 +5562,8 @@ static int _ma_safe_scan_block_record(MARIA_SORT_INFO *sort_info,
static int write_log_record_for_repair(const HA_CHECK *param, MARIA_HA *info)
{
MARIA_SHARE *share= info->s;
if (translog_inited) /* test it in case this is maria_chk */
/* in case this is maria_chk or recovery... */
if (translog_inited && !maria_in_recovery)
{
/*
For now this record is only informative. It could serve when applying
......@@ -5582,6 +5584,7 @@ static int write_log_record_for_repair(const HA_CHECK *param, MARIA_HA *info)
*/
LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 1];
uchar log_data[LSN_STORE_SIZE];
LSN lsn;
compile_time_assert(LSN_STORE_SIZE >= (FILEID_STORE_SIZE + 4));
log_array[TRANSLOG_INTERNAL_PARTS + 0].str= (char*) log_data;
log_array[TRANSLOG_INTERNAL_PARTS + 0].length= FILEID_STORE_SIZE + 4;
......@@ -5590,22 +5593,21 @@ static int write_log_record_for_repair(const HA_CHECK *param, MARIA_HA *info)
or not: did it touch the data file or not?).
*/
int4store(log_data + FILEID_STORE_SIZE, param->testflag);
if (unlikely(translog_write_record(&share->state.create_rename_lsn,
LOGREC_REDO_REPAIR_TABLE,
if (unlikely(translog_write_record(&lsn, LOGREC_REDO_REPAIR_TABLE,
&dummy_transaction_object, info,
log_array[TRANSLOG_INTERNAL_PARTS +
0].length,
sizeof(log_array)/sizeof(log_array[0]),
log_array, log_data) ||
translog_flush(share->state.create_rename_lsn)))
translog_flush(lsn)))
return 1;
/*
The table's existence was made durable earlier (MY_SYNC_DIR passed to
maria_change_to_newfile()).
_ma_flush_table_files_after_repair() is later called by maria_repair(),
and makes sure to flush the data, index and state and sync, so
create_rename_lsn reaches disk, thus we won't apply old REDOs to the new
table.
and makes sure to flush the data, index, update is_of_lsn, flush state
and sync, so create_rename_lsn reaches disk, thus we won't apply old
REDOs to the new table.
*/
}
return 0;
......
......@@ -25,6 +25,7 @@
int maria_close(register MARIA_HA *info)
{
int error=0,flag;
my_bool share_can_be_freed= FALSE;
MARIA_SHARE *share=info->s;
DBUG_ENTER("maria_close");
DBUG_PRINT("enter",("base: 0x%lx reopen: %u locks: %u",
......@@ -58,7 +59,6 @@ int maria_close(register MARIA_HA *info)
}
flag= !--share->reopen;
maria_open_list=list_delete(maria_open_list,&info->open_list);
pthread_mutex_unlock(&share->intern_lock);
my_free(info->rec_buff, MYF(MY_ALLOW_ZERO_PTR));
(*share->end)(info);
......@@ -90,20 +90,23 @@ int maria_close(register MARIA_HA *info)
(share->mode != O_RDONLY && maria_is_crashed(info)))
{
/*
File must be synced as it is going out of the maria_open_list and so
becoming unknown to Checkpoint. State must be written to file as
it was not done at table's unlocking.
State must be written to file as it was not done at table's
unlocking.
*/
if (_ma_state_info_write(share->kfile.file, &share->state, 1) ||
my_sync(share->kfile.file, MYF(MY_WME)))
if (_ma_state_info_write(share, 1))
error= my_errno;
}
/*
File must be synced as it is going out of the maria_open_list and so
becoming unknown to future Checkpoints.
*/
if (my_sync(share->kfile.file, MYF(MY_WME)))
error= my_errno;
if (my_close(share->kfile.file, MYF(0)))
error= my_errno;
}
#ifdef THREAD
thr_lock_delete(&share->lock);
VOID(pthread_mutex_destroy(&share->intern_lock));
{
int i,keys;
keys = share->state.header.keys;
......@@ -114,16 +117,36 @@ int maria_close(register MARIA_HA *info)
}
#endif
DBUG_ASSERT(share->now_transactional == share->base.born_transactional);
my_free((uchar*) share, MYF(0));
if (share->in_checkpoint == MARIA_CHECKPOINT_LOOKS_AT_ME)
{
share->kfile.file= -1; /* because Checkpoint does not need to flush */
/* we cannot my_free() the share, Checkpoint would see a bad pointer */
share->in_checkpoint|= MARIA_CHECKPOINT_SHOULD_FREE_ME;
}
else
share_can_be_freed= TRUE;
}
pthread_mutex_unlock(&THR_LOCK_maria);
pthread_mutex_unlock(&share->intern_lock);
if (share_can_be_freed)
{
VOID(pthread_mutex_destroy(&share->intern_lock));
my_free((uchar *)share, MYF(0));
}
if (info->ftparser_param)
{
my_free((uchar*)info->ftparser_param, MYF(0));
info->ftparser_param= 0;
}
if (info->dfile.file >= 0 && my_close(info->dfile.file, MYF(0)))
if (info->dfile.file >= 0)
{
/*
This is outside of mutex so would confuse a concurrent
Checkpoint. Fortunately in BLOCK_RECORD we close earlier under mutex.
*/
if (my_close(info->dfile.file, MYF(0)))
error = my_errno;
}
my_free((uchar*) info,MYF(0));
......
......@@ -634,7 +634,7 @@ int maria_create(const char *name, enum data_file_type datafile_type,
share.state.dellink = HA_OFFSET_ERROR;
share.state.first_bitmap_with_space= 0;
share.state.create_rename_lsn= LSN_IMPOSSIBLE;
share.state.create_rename_lsn= share.state.is_of_lsn= LSN_IMPOSSIBLE;
share.state.process= (ulong) getpid();
share.state.unique= (ulong) 0;
share.state.update_count=(ulong) 0;
......@@ -792,7 +792,7 @@ int maria_create(const char *name, enum data_file_type datafile_type,
errpos=1;
DBUG_PRINT("info", ("write state info and base info"));
if (_ma_state_info_write(file, &share.state, 2) ||
if (_ma_state_info_write_sub(file, &share.state, 2) ||
_ma_base_info_write(file, &share.base))
goto err;
DBUG_PRINT("info", ("base_pos: %d base_info_size: %d",
......@@ -933,6 +933,7 @@ int maria_create(const char *name, enum data_file_type datafile_type,
LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 4];
uint total_rec_length= 0;
uint i;
LSN lsn;
log_array[TRANSLOG_INTERNAL_PARTS + 1].length= 1 + 2 + 2 +
kfile_size_before_extension;
/* we are needing maybe 64 kB, so don't use the stack */
......@@ -991,20 +992,20 @@ int maria_create(const char *name, enum data_file_type datafile_type,
called external_lock(), so have no TRN. It does not matter, as all
these operations are non-transactional and sync their files.
*/
if (unlikely(translog_write_record(&share.state.create_rename_lsn,
if (unlikely(translog_write_record(&lsn,
LOGREC_REDO_CREATE_TABLE,
&dummy_transaction_object, NULL,
total_rec_length,
sizeof(log_array)/sizeof(log_array[0]),
log_array, NULL) ||
translog_flush(share.state.create_rename_lsn)))
translog_flush(lsn)))
goto err;
/*
store LSN into file, needed for Recovery to not be confused if a
DROP+CREATE happened (applying REDOs to the wrong table).
*/
share.kfile.file= file;
if (_ma_update_create_rename_lsn_on_disk(&share, FALSE))
if (_ma_update_create_rename_lsn_on_disk_sub(&share, lsn, FALSE))
goto err;
my_free(log_data, MYF(0));
}
......@@ -1205,13 +1206,14 @@ int _ma_initialize_data_file(MARIA_SHARE *share, File dfile)
/**
@brief Writes create_rename_lsn to disk, optionally forces
@brief Writes create_rename_lsn and is_of_lsn to disk, optionally forces.
This is for special cases where:
- we don't want to write the full state to disk (so, not call
_ma_state_info_write()) because some parts of the state may be
currently inconsistent, or because it would be overkill
- we must sync this LSN immediately for correctness.
- we must sync these LSNs immediately for correctness.
It acquires intern_lock to protect the two LSNs and state write.
@param share table's share
@param do_sync if the write should be forced to disk
......@@ -1221,13 +1223,42 @@ int _ma_initialize_data_file(MARIA_SHARE *share, File dfile)
@retval 1 error (disk problem)
*/
int _ma_update_create_rename_lsn_on_disk(MARIA_SHARE *share, my_bool do_sync)
int _ma_update_create_rename_lsn_on_disk(MARIA_SHARE *share,
LSN lsn, my_bool do_sync)
{
char buf[LSN_STORE_SIZE];
int res;
pthread_mutex_lock(&share->intern_lock);
res= _ma_update_create_rename_lsn_on_disk_sub(share, lsn, do_sync);
pthread_mutex_unlock(&share->intern_lock);
return res;
}
/**
@brief Writes create_rename_lsn and is_of_lsn to disk, optionally forces.
Shortcut of _ma_update_create_rename_lsn_on_disk() when we know that
intern_lock is not needed (when creating a table or opening it for the
first time).
@param share table's share
@param do_sync if the write should be forced to disk
@return Operation status
@retval 0 ok
@retval 1 error (disk problem)
*/
int _ma_update_create_rename_lsn_on_disk_sub(MARIA_SHARE *share,
LSN lsn, my_bool do_sync)
{
char buf[LSN_STORE_SIZE*2], *ptr;
File file= share->kfile.file;
DBUG_ASSERT(file >= 0);
lsn_store(buf, share->state.create_rename_lsn);
return (my_pwrite(file, buf, sizeof(buf),
for (ptr= buf; ptr < (buf + sizeof(buf)); ptr+= LSN_STORE_SIZE)
lsn_store(ptr, lsn);
share->state.is_of_lsn= share->state.create_rename_lsn= lsn;
return my_pwrite(file, buf, sizeof(buf),
sizeof(share->state.header) + 2, MYF(MY_NABP)) ||
(do_sync && my_sync(file, MYF(0))));
(do_sync && my_sync(file, MYF(0)));
}
......@@ -103,7 +103,7 @@ int maria_delete(MARIA_HA *info,const uchar *record)
}
info->update= HA_STATE_CHANGED+HA_STATE_DELETED+HA_STATE_ROW_CHANGED;
info->state->records--;
info->state->records-= !share->now_transactional;
share->state.changed|= STATE_NOT_OPTIMIZED_ROWS;
mi_sizestore(lastpos, info->cur_row.lastpos);
......
......@@ -69,6 +69,10 @@ int maria_delete_all_rows(MARIA_HA *info)
goto err;
}
/*
For recovery it matters that this is called after writing the log record,
so that resetting state.records actually happens under log's mutex.
*/
_ma_reset_status(info);
/*
......@@ -143,6 +147,10 @@ void _ma_reset_status(MARIA_HA *info)
info->state->key_file_length= share->base.keystart;
info->state->data_file_length= 0;
info->state->empty= info->state->key_empty= 0;
/**
@todo RECOVERY BUG
the line below must happen under log's mutex when writing the REDO
*/
info->state->checksum= 0;
/* Drop the delete key chain. */
......
......@@ -227,8 +227,11 @@ int maria_extra(MARIA_HA *info, enum ha_extra_function function,
info->lock_wait=MY_DONT_WAIT;
break;
case HA_EXTRA_NO_KEYS:
/* we're going to modify pieces of the state, stall Checkpoint */
pthread_mutex_lock(&share->intern_lock);
if (info->lock_type == F_UNLCK)
{
pthread_mutex_unlock(&share->intern_lock);
error=1; /* Not possibly if not lock */
break;
}
......@@ -263,8 +266,10 @@ int maria_extra(MARIA_HA *info, enum ha_extra_function function,
0), and so the only way it leaves information (share->state.key_map)
for the posterity is by writing it to disk.
*/
error=_ma_state_info_write(share->kfile.file, &share->state, (1 | 2));
DBUG_ASSERT(!maria_in_recovery);
error= _ma_state_info_write(share, 1|2);
}
pthread_mutex_unlock(&share->intern_lock);
break;
case HA_EXTRA_FORCE_REOPEN:
/*
......@@ -275,8 +280,22 @@ int maria_extra(MARIA_HA *info, enum ha_extra_function function,
/** @todo consider porting these flush-es to MyISAM */
error= _ma_flush_table_files(info, MARIA_FLUSH_DATA | MARIA_FLUSH_INDEX,
FLUSH_FORCE_WRITE, FLUSH_FORCE_WRITE) ||
_ma_state_info_write(share->kfile.file, &share->state, 1 | 2) ||
(share->changed= 0);
_ma_state_info_write(share, 1|2|4);
#ifdef ASK_MONTY
|| (share->changed= 0);
#endif
/**
@todo RECOVERY BUG
Though we flushed the state, IF some other thread may have the same
table (same MARIA_SHARE) open at this time then it may have a
more recent state to flush when it closes, thus we don't set
share->changed to 0 here. On the other hand, this means that when our
thread closes its table, it will flush the state again, then it would
overwrite any state written by yet another thread which may have opened
the table (new MARIA_SHARE) and done some updates.
ASK_MONTY about the IF above. See also same tag in
HA_EXTRA_PREPARE_FOR_DROP|RENAME.
*/
pthread_mutex_lock(&THR_LOCK_maria);
/* this makes the share not be re-used next time the table is opened */
share->last_version= 0L; /* Impossible version */
......@@ -328,11 +347,13 @@ int maria_extra(MARIA_HA *info, enum ha_extra_function function,
We have to sync now, as on Windows we are going to close the file
(so cannot sync later).
*/
if (_ma_state_info_write(share->kfile.file, &share->state, 1 | 2) ||
if (_ma_state_info_write(share, 1 | 2) ||
my_sync(share->kfile.file, MYF(0)))
error= my_errno;
#ifdef ASK_MONTY /* see same tag in HA_EXTRA_FORCE_REOPEN */
else
share->changed= 0;
#endif
}
else
{
......
......@@ -116,7 +116,7 @@ int maria_lock_database(MARIA_HA *info, int lock_type)
/* transactional tables rather flush their state at Checkpoint */
if (!share->base.born_transactional)
{
if (_ma_state_info_write(share->kfile.file, &share->state, 1))
if (_ma_state_info_write_sub(share->kfile.file, &share->state, 1))
error= my_errno;
else
{
......@@ -287,6 +287,7 @@ void _ma_get_status(void* param, int concurrent_insert)
void _ma_update_status(void* param)
{
MARIA_HA *info=(MARIA_HA*) param;
MARIA_SHARE *share= info->s;
/*
Because someone may have closed the table we point at, we only
update the state if its our own state. This isn't a problem as
......@@ -299,19 +300,19 @@ void _ma_update_status(void* param)
DBUG_PRINT("info",("updating status: key_file: %ld data_file: %ld",
(long) info->state->key_file_length,
(long) info->state->data_file_length));
if (info->state->key_file_length < info->s->state.state.key_file_length ||
info->state->data_file_length < info->s->state.state.data_file_length)
if (info->state->key_file_length < share->state.state.key_file_length ||
info->state->data_file_length < share->state.state.data_file_length)
DBUG_PRINT("warning",("old info: key_file: %ld data_file: %ld",
(long) info->s->state.state.key_file_length,
(long) info->s->state.state.data_file_length));
(long) share->state.state.key_file_length,
(long) share->state.state.data_file_length));
#endif
/*
we are going to modify the state without lock's log, this would break
recovery if done with a transactional table.
*/
DBUG_ASSERT(!info->s->base.born_transactional);
info->s->state.state= *info->state;
info->state= &info->s->state.state;
share->state.state= *info->state;
info->state= &share->state.state;
}
info->append_insert_at_end= 0;
}
......@@ -432,7 +433,8 @@ int _ma_writeinfo(register MARIA_HA *info, uint operation)
share->state.process= share->last_process= share->this_process;
share->state.unique= info->last_unique= info->this_unique;
share->state.update_count= info->last_loop= ++info->this_loop;
if ((error= _ma_state_info_write(share->kfile.file, &share->state, 1)))
if ((error= _ma_state_info_write_sub(share->kfile.file,
&share->state, 1)))
olderror=my_errno;
#ifdef __WIN__
if (maria_flush)
......
......@@ -213,6 +213,22 @@ static my_bool write_hook_for_redo(enum translog_record_type type,
static my_bool write_hook_for_undo(enum translog_record_type type,
TRN *trn, MARIA_HA *tbl_info, LSN *lsn,
struct st_translog_parts *parts);
static my_bool write_hook_for_redo_delete_all(enum translog_record_type type,
TRN *trn, MARIA_HA *tbl_info,
LSN *lsn,
struct st_translog_parts *parts);
static my_bool write_hook_for_undo_row_insert(enum translog_record_type type,
TRN *trn, MARIA_HA *tbl_info,
LSN *lsn,
struct st_translog_parts *parts);
static my_bool write_hook_for_undo_row_delete(enum translog_record_type type,
TRN *trn, MARIA_HA *tbl_info,
LSN *lsn,
struct st_translog_parts *parts);
static my_bool write_hook_for_undo_row_purge(enum translog_record_type type,
TRN *trn, MARIA_HA *tbl_info,
LSN *lsn,
struct st_translog_parts *parts);
static my_bool write_hook_for_clr_end(enum translog_record_type type,
TRN *trn, MARIA_HA *tbl_info, LSN *lsn,
struct st_translog_parts *parts);
......@@ -429,13 +445,13 @@ static LOG_DESC INIT_LOGREC_UNDO_ROW_INSERT=
{LOGRECTYPE_PSEUDOFIXEDLENGTH,
LSN_STORE_SIZE + FILEID_STORE_SIZE + PAGE_STORE_SIZE + DIRPOS_STORE_SIZE,
LSN_STORE_SIZE + FILEID_STORE_SIZE + PAGE_STORE_SIZE + DIRPOS_STORE_SIZE,
NULL, write_hook_for_undo, NULL, 1,
NULL, write_hook_for_undo_row_insert, NULL, 1,
"undo_row_insert", LOGREC_LAST_IN_GROUP, NULL, NULL};
static LOG_DESC INIT_LOGREC_UNDO_ROW_DELETE=
{LOGRECTYPE_VARIABLE_LENGTH, 0,
LSN_STORE_SIZE + FILEID_STORE_SIZE + PAGE_STORE_SIZE + DIRPOS_STORE_SIZE,
NULL, write_hook_for_undo, NULL, 1,
NULL, write_hook_for_undo_row_delete, NULL, 1,
"undo_row_delete", LOGREC_LAST_IN_GROUP, NULL, NULL};
static LOG_DESC INIT_LOGREC_UNDO_ROW_UPDATE=
......@@ -447,7 +463,7 @@ static LOG_DESC INIT_LOGREC_UNDO_ROW_UPDATE=
static LOG_DESC INIT_LOGREC_UNDO_ROW_PURGE=
{LOGRECTYPE_PSEUDOFIXEDLENGTH, LSN_STORE_SIZE + FILEID_STORE_SIZE,
LSN_STORE_SIZE + FILEID_STORE_SIZE,
NULL, write_hook_for_undo, NULL, 1,
NULL, write_hook_for_undo_row_purge, NULL, 1,
"undo_row_purge", LOGREC_LAST_IN_GROUP, NULL, NULL};
static LOG_DESC INIT_LOGREC_UNDO_KEY_INSERT=
......@@ -493,7 +509,7 @@ static LOG_DESC INIT_LOGREC_REDO_DROP_TABLE=
static LOG_DESC INIT_LOGREC_REDO_DELETE_ALL=
{LOGRECTYPE_FIXEDLENGTH, FILEID_STORE_SIZE, FILEID_STORE_SIZE,
NULL, write_hook_for_redo, NULL, 0,
NULL, write_hook_for_redo_delete_all, NULL, 0,
"redo_delete_all", LOGREC_IS_GROUP_ITSELF, NULL, NULL};
static LOG_DESC INIT_LOGREC_REDO_REPAIR_TABLE=
......@@ -6308,6 +6324,88 @@ static my_bool write_hook_for_undo(enum translog_record_type type
}
/**
@brief Sets the table's records count to 0, then calls the generic REDO
hook.
@todo move it to a separate file
@return Operation status, always 0 (success)
*/
static my_bool write_hook_for_redo_delete_all(enum translog_record_type type
__attribute__ ((unused)),
TRN *trn, MARIA_HA *tbl_info
__attribute__ ((unused)),
LSN *lsn,
struct st_translog_parts *parts
__attribute__ ((unused)))
{
tbl_info->s->state.state.records= 0;
return write_hook_for_redo(type, trn, tbl_info, lsn, parts);
}
/**
@brief Upates "records" and calls the generic UNDO hook
@todo move it to a separate file
@return Operation status, always 0 (success)
*/
static my_bool write_hook_for_undo_row_insert(enum translog_record_type type
__attribute__ ((unused)),
TRN *trn, MARIA_HA *tbl_info,
LSN *lsn,
struct st_translog_parts *parts
__attribute__ ((unused)))
{
tbl_info->s->state.state.records++;
return write_hook_for_undo(type, trn, tbl_info, lsn, parts);
}
/**
@brief Upates "records" and calls the generic UNDO hook
@todo move it to a separate file
@return Operation status, always 0 (success)
*/
static my_bool write_hook_for_undo_row_delete(enum translog_record_type type
__attribute__ ((unused)),
TRN *trn, MARIA_HA *tbl_info,
LSN *lsn,
struct st_translog_parts *parts
__attribute__ ((unused)))
{
tbl_info->s->state.state.records--;
return write_hook_for_undo(type, trn, tbl_info, lsn, parts);
}
/**
@brief Upates "records" and calls the generic UNDO hook
@todo we will get rid of this record soon.
@return Operation status, always 0 (success)
*/
static my_bool write_hook_for_undo_row_purge(enum translog_record_type type
__attribute__ ((unused)),
TRN *trn, MARIA_HA *tbl_info,
LSN *lsn,
struct st_translog_parts *parts
__attribute__ ((unused)))
{
tbl_info->s->state.state.records--;
return write_hook_for_undo(type, trn, tbl_info, lsn, parts);
}
/**
@brief Sets transaction's undo_lsn, first_undo_lsn if needed
......
......@@ -85,7 +85,7 @@ typedef LSN LSN_WITH_FLAGS;
#define LSN_ERROR 1
/** @brief some impossible LSN serve as markers */
#define LSN_REPAIRED_BY_MARIA_CHK ((LSN)1)
#define LSN_REPAIRED_BY_MARIA_CHK ((LSN)2)
/**
@brief the maximum valid LSN.
......
......@@ -613,12 +613,14 @@ MARIA_HA *maria_open(const char *name, int mode, uint open_flags)
view of the server, including server's recovery) now.
*/
if ((open_flags & HA_OPEN_FROM_SQL_LAYER) || maria_in_recovery)
{
share->state.create_rename_lsn= translog_get_horizon();
_ma_update_create_rename_lsn_on_disk(share, TRUE);
}
}
else if (!LSN_VALID(share->state.create_rename_lsn) &&
_ma_update_create_rename_lsn_on_disk_sub(share,
translog_get_horizon(),
TRUE);
}
else if ((!LSN_VALID(share->state.create_rename_lsn) ||
!LSN_VALID(share->state.is_of_lsn) ||
(cmp_translog_addr(share->state.create_rename_lsn,
share->state.is_of_lsn) > 0)) &&
!(open_flags & HA_OPEN_FOR_REPAIR))
{
/*
......@@ -968,18 +970,64 @@ static void setup_key_functions(register MARIA_KEYDEF *keyinfo)
/**
@brief Function to save and store the header in the index file (.MYI)
Operates under MARIA_SHARE::intern_lock if requested.
Sets MARIA_SHARE::MARIA_STATE_INFO::is_of_lsn if table is transactional.
Then calls _ma_state_info_write_sub().
@param share table
@param pWrite bitmap: if 1 is set my_pwrite() is used otherwise
my_write(); if 2 is set, info about keys is written
(should only be needed after ALTER TABLE
ENABLE/DISABLE KEYS, and REPAIR/OPTIMIZE); if 4 is
set, MARIA_SHARE::intern_lock is taken.
@return Operation status
@retval 0 OK
@retval 1 Error
*/
uint _ma_state_info_write(MARIA_SHARE *share, uint pWrite)
{
uint res= 0;
if (pWrite & 4)
pthread_mutex_lock(&share->intern_lock);
else if (maria_multi_threaded)
safe_mutex_assert_owner(&share->intern_lock);
if (share->base.born_transactional && translog_inited &&
!maria_in_recovery)
{
/*
In a recovery, we want to set is_of_lsn to the LSN of the last
record executed by Recovery, not the current EOF of the log (which
is too new). Recovery does it by itself.
*/
share->state.is_of_lsn= translog_get_horizon();
}
res= _ma_state_info_write_sub(share->kfile.file, &share->state, pWrite);
if (pWrite & 4)
pthread_mutex_unlock(&share->intern_lock);
return res;
}
/**
@brief Function to save and store the header in the index file (.MYI).
Shortcut to use instead of _ma_state_info_write() when appropriate.
@param file descriptor of the index file to write
@param state state information to write to the file
@param pWrite bitmap (determines the amount of information to
write, and if my_write() or my_pwrite() should be
used)
@param pWrite bitmap: if 1 is set my_pwrite() is used otherwise
my_write(); if 2 is set, info about keys is written
(should only be needed after ALTER TABLE
ENABLE/DISABLE KEYS, and REPAIR/OPTIMIZE).
@return Operation status
@retval 0 OK
@retval 1 Error
*/
uint _ma_state_info_write(File file, MARIA_STATE_INFO *state, uint pWrite)
uint _ma_state_info_write_sub(File file, MARIA_STATE_INFO *state, uint pWrite)
{
/** @todo RECOVERY write it only at checkpoint time */
uchar buff[MARIA_STATE_INFO_SIZE + MARIA_STATE_EXTRA_SIZE];
......@@ -994,10 +1042,11 @@ uint _ma_state_info_write(File file, MARIA_STATE_INFO *state, uint pWrite)
/* open_count must be first because of _ma_mark_file_changed ! */
mi_int2store(ptr,state->open_count); ptr+= 2;
/*
if you change the offset of this LSN inside the file, fix
ma_create + ma_rename + ma_delete_all + backward-compatibility.
if you change the offset of create_rename_lsn/is_of_lsn inside the file,
fix ma_create + ma_rename + ma_delete_all + backward-compatibility.
*/
lsn_store(ptr, state->create_rename_lsn); ptr+= LSN_STORE_SIZE;
lsn_store(ptr, state->is_of_lsn); ptr+= LSN_STORE_SIZE;
*ptr++= (uchar)state->changed;
*ptr++= state->sortkey;
mi_rowstore(ptr,state->state.records); ptr+= 8;
......@@ -1022,7 +1071,7 @@ uint _ma_state_info_write(File file, MARIA_STATE_INFO *state, uint pWrite)
{
mi_sizestore(ptr,state->key_root[i]); ptr+= 8;
}
/** @todo RECOVERY key_del is a problem for recovery */
/** @todo RECOVERY BUG key_del is a problem for recovery */
mi_sizestore(ptr,state->key_del); ptr+= 8;
if (pWrite & 2) /* From maria_chk */
{
......@@ -1060,6 +1109,7 @@ static uchar *_ma_state_info_read(uchar *ptr, MARIA_STATE_INFO *state)
state->open_count = mi_uint2korr(ptr); ptr+= 2;
state->create_rename_lsn= lsn_korr(ptr); ptr+= LSN_STORE_SIZE;
state->is_of_lsn= lsn_korr(ptr); ptr+= LSN_STORE_SIZE;
state->changed= (my_bool) *ptr++;
state->sortkey= (uint) *ptr++;
state->state.records= mi_rowkorr(ptr); ptr+= 8;
......@@ -1382,11 +1432,16 @@ int _ma_open_datafile(MARIA_HA *info, MARIA_SHARE *share,
int _ma_open_keyfile(MARIA_SHARE *share)
{
if ((share->kfile.file= my_open(share->unique_file_name,
/*
Modifications to share->kfile should be under intern_lock to protect
against a concurrent checkpoint.
*/
pthread_mutex_lock(&share->intern_lock);
share->kfile.file= my_open(share->unique_file_name,
share->mode | O_SHARE,
MYF(MY_WME))) < 0)
return 1;
return 0;
MYF(MY_WME));
pthread_mutex_unlock(&share->intern_lock);
return (share->kfile.file < 0);
}
......
......@@ -143,6 +143,9 @@ int maria_recover()
fprintf(trace_file, "SUCCESS\n");
fclose(trace_file);
}
// @todo set global_trid_generator from checkpoint or default value of 1/0,
// and also update it when seeing LOGREC_LONG_TRANSACTION_ID
// suggestion: add an arg to trnman_init
maria_in_recovery= FALSE;
DBUG_RETURN(res);
}
......@@ -224,7 +227,7 @@ int maria_apply_log(LSN from_lsn, my_bool apply, FILE *trace_file,
/*
we don't use maria_panic() because it would maria_end(), and Recovery does
not want that (we want to keep modules initialized for runtime).
not want that (we want to keep some modules initialized for runtime).
*/
if (close_all_tables())
goto err;
......@@ -333,6 +336,10 @@ static void new_transaction(uint16 sid, TrID long_id, LSN undo_lsn,
llbuf, sid);
all_active_trans[sid].undo_lsn= undo_lsn;
all_active_trans[sid].first_undo_lsn= first_undo_lsn;
// @todo set_if_bigger(global_trid_generator, long_id)
// indeed not only uncommitted transactions should bump generator,
// committed ones too (those not seen by undo phase so not
// into trnman_recreate)
}
......@@ -424,6 +431,9 @@ prototype_redo_exec_hook(REDO_CREATE_TABLE)
ptr+= 2;
/* set create_rename_lsn (for maria_read_log to be idempotent) */
lsn_store(ptr + sizeof(info->s->state.header) + 2, rec->lsn);
/* we also set is_of_lsn, like maria_create() does */
lsn_store(ptr + sizeof(info->s->state.header) + 2 + LSN_STORE_SIZE,
rec->lsn);
if (my_pwrite(kfile, ptr,
kfile_size_before_extension, 0, MYF(MY_NABP|MY_WME)) ||
my_chsize(kfile, keystart, 0, MYF(MY_WME)))
......@@ -843,11 +853,7 @@ prototype_redo_exec_hook(UNDO_ROW_INSERT)
if (info == NULL)
return 0;
set_undo_lsn_for_active_trans(rec->short_trid, rec->lsn);
/*
in an upcoming patch ("recovery of the state"), we introduce
state.is_of_lsn. For now, we just assume the state is old (true when we
recreate tables from scratch - but not idempotent).
*/
if (cmp_translog_addr(rec->lsn, info->s->state.is_of_lsn) > 0)
{
fprintf(tracef, " state older than record, updating rows' count\n");
info->s->state.state.records++;
......@@ -870,6 +876,7 @@ prototype_redo_exec_hook(UNDO_ROW_DELETE)
if (info == NULL)
return 0;
set_undo_lsn_for_active_trans(rec->short_trid, rec->lsn);
if (cmp_translog_addr(rec->lsn, info->s->state.is_of_lsn) > 0)
{
fprintf(tracef, " state older than record, updating rows' count\n");
info->s->state.state.records--;
......@@ -887,6 +894,7 @@ prototype_redo_exec_hook(UNDO_ROW_UPDATE)
if (info == NULL)
return 0;
set_undo_lsn_for_active_trans(rec->short_trid, rec->lsn);
if (cmp_translog_addr(rec->lsn, info->s->state.is_of_lsn) > 0)
{
info->s->state.changed|= STATE_CHANGED | STATE_NOT_ANALYZED |
STATE_NOT_OPTIMIZED_KEYS | STATE_NOT_SORTED_PAGES;
......@@ -902,6 +910,7 @@ prototype_redo_exec_hook(UNDO_ROW_PURGE)
return 0;
/* this a bit broken, but this log record type will be deleted soon */
set_undo_lsn_for_active_trans(rec->short_trid, rec->lsn);
if (cmp_translog_addr(rec->lsn, info->s->state.is_of_lsn) > 0)
{
fprintf(tracef, " state older than record, updating rows' count\n");
info->s->state.state.records--;
......@@ -965,6 +974,7 @@ prototype_redo_exec_hook(CLR_END)
set_undo_lsn_for_active_trans(rec->short_trid, previous_undo_lsn);
fprintf(tracef, " CLR_END was about %s, undo_lsn now LSN (%lu,0x%lx)\n",
log_desc->name, LSN_IN_HEX(previous_undo_lsn));
if (cmp_translog_addr(rec->lsn, info->s->state.is_of_lsn) > 0)
{
fprintf(tracef, " state older than record, updating rows' count\n");
switch (undone_record_type) {
......@@ -1395,11 +1405,23 @@ static int run_undo_phase(uint unfinished)
}
static void prepare_table_for_close(MARIA_HA *info,
LSN at_lsn __attribute__ ((unused)))
/**
@brief re-enables transactionality, updates is_of_lsn
@param info table
@param at_lsn LSN to set is_of_lsn
*/
static void prepare_table_for_close(MARIA_HA *info, LSN at_lsn)
{
MARIA_SHARE *share= info->s;
/* we will soon use at_lsn here */
/*
State is now at least as new as the LSN of the current record. It may be
newer, in case we are seeing a LOGREC_FILE_ID which tells us to close a
table, but that table was later modified further in the log.
*/
if (cmp_translog_addr(share->state.is_of_lsn, at_lsn) < 0)
share->state.is_of_lsn= at_lsn;
_ma_reenable_logging_for_table(share);
}
......@@ -1637,11 +1659,18 @@ static int close_all_tables()
if (maria_open_list == NULL)
goto end;
fprintf(tracef, "Closing all tables\n");
/*
Since the end of end_of_redo_phase(), we may have written new records
(if UNDO phase ran) and thus the state is newer than at
end_of_redo_phase(), we need to bump is_of_lsn again.
*/
LSN addr= translog_get_horizon();
for (list_element= maria_open_list ; list_element ; list_element= next_open)
{
next_open= list_element->next;
info= (MARIA_HA*)list_element->data;
pthread_mutex_unlock(&THR_LOCK_maria); /* ok, UNDO phase not online yet */
prepare_table_for_close(info, addr);
error|= maria_close(info);
pthread_mutex_lock(&THR_LOCK_maria);
}
......@@ -1650,6 +1679,100 @@ static int close_all_tables()
return error;
}
#ifdef MARIA_EXTERNAL_LOCKING
#error Maria's Recovery is really not ready for it
#endif
/*
Recovery of the state : how it works
=====================================
Ignoring Checkpoints for a start.
The state (MARIA_HA::MARIA_SHARE::MARIA_STATE_INFO) is updated in
memory frequently (at least at every row write/update/delete) but goes
to disk at few moments: maria_close() when closing the last open
instance, and a few rare places like CHECK/REPAIR/ALTER
(non-transactional tables also do it at maria_lock_database() but we
needn't cover them here).
In case of crash, state on disk is likely to be older than what it was
in memory, the REDO phase needs to recreate the state as it was in
memory at the time of crash. When we say Recovery here we will always
mean "REDO phase".
For example MARIA_STATUS_INFO::records (count of records). It is updated at
the end of every row write/update/delete/delete_all. When Recovery sees the
sign of such row operation (UNDO or REDO), it may need to update the records'
count if that count does not reflect that operation (is older). How to know
the age of the state compared to the log record: every time the state
goes to disk at runtime, its member "is_of_lsn" is updated to the
current end-of-log LSN. So Recovery just needs to compare is_of_lsn
and the record's LSN to know if it should modify "records".
Other operations like ALTER TABLE DISABLE KEYS update the state but
don't write log records, thus the REDO phase cannot repeat their
effect on the state in case of crash. But we make them sync the state
as soon as they have finished. This reduces the window for a problem.
It looks like only one thread at a time updates the state in memory or
on disk. However there is not 100% certainty when it comes to
HA_EXTRA_(FORCE_REOPEN|PREPARE_FOR_RENAME): can they read the state
from memory while some other thread is updating "records" in memory?
If yes, they may write a corrupted state to disk.
We assume that no for now: ASK_MONTY.
With checkpoints
================
Checkpoint module needs to read the state in memory and write it to
disk. This may happen while some other thread is modifying the state
in memory or on disk. Checkpoint thus may be reading changing data, it
needs a mutex to not have it corrupted, and concurrent modifiers of
the state need that mutex too for the same reason.
"records" is modified for every row write/update/delete, we don't want
to add a mutex lock/unlock there. So we re-use the mutex lock/unlock
which is already present in these moments, namely the log's mutex which is
taken when UNDO_ROW_INSERT|UPDATE|DELETE is written: we update "records" in
under-log-mutex hooks when writing these records (thus "records" is
not updated at the end of maria_write/update/delete() anymore).
Thus Checkpoint takes the log's lock and can read "records" from
memory an write it to disk and release log's lock.
We however want to avoid having the disk write under the log's
lock. So it has to be under another mutex, natural choice is
intern_lock (as Checkpoint needs it anyway to read MARIA_SHARE::kfile,
and as maria_close() takes it too). All state writes to disk are
changed to be protected with intern_lock.
So Checkpoint takes intern_lock, log's lock, reads "records" from
memory, releases log's lock, updates is_of_lsn and writes "records" to
disk, release intern_lock.
In practice, not only "records" needs to be written but the full
state. So, Checkpoint reads the full state from memory. Some other
thread may at this moment be modifying in memory some pieces of the
state which are not protected by the lock's log (see ma_extra.c
HA_EXTRA_NO_KEYS), and Checkpoint would be reading a corrupted state
from memory; to guard against that we extend the intern_lock-zone to
changes done to the state in memory by HA_EXTRA_NO_KEYS et al, and
also any change made in memory to create_rename_lsn/state_is_of_lsn.
Last, we don't want in Checkpoint to do
log lock; read state from memory; release log lock;
for each table, it may hold the log's lock too much in total.
So, we instead do
log lock; read N states from memory; release log lock;
Thus, the sequence above happens outside of any intern_lock.
But this re-introduces the problem that some other thread may be changing the
state in memory and on disk under intern_lock, without log's lock, like
HA_EXTRA_NO_KEYS, while we read the N states. However, when Checkpoint later
comes to handling the table under intern_lock, which is serialized with
HA_EXTRA_NO_KEYS, it can see that is_of_lsn is higher then when the state was
read from memory under log's lock, and thus can decide to not flush the
obsolete state it has, knowing that the other thread flushed a more recent
state already. If on the other hand is_of_lsn is not higher, the read state is
current and can be flushed. So we have a per-table sequence:
lock intern_lock; test if is_of_lsn is higher than when we read the state
under log's lock; if no then flush the read state to disk.
*/
/* some comments and pseudo-code which we keep for later */
#if 0
/*
......
......@@ -66,6 +66,7 @@ int maria_rename(const char *old_name, const char *new_name)
!maria_in_recovery) ? MY_SYNC_DIR : 0;
if (sync_dir)
{
LSN lsn;
uchar log_data[2 + 2];
LEX_STRING log_array[TRANSLOG_INTERNAL_PARTS + 3];
uint old_name_len= strlen(old_name), new_name_len= strlen(new_name);
......@@ -85,13 +86,12 @@ int maria_rename(const char *old_name, const char *new_name)
under THR_LOCK_maria or not...), how to use it in Recovery.
For now it can serve to apply logs to a backup so we sync it.
*/
if (unlikely(translog_write_record(&share->state.create_rename_lsn,
LOGREC_REDO_RENAME_TABLE,
if (unlikely(translog_write_record(&lsn, LOGREC_REDO_RENAME_TABLE,
&dummy_transaction_object, NULL,
2 + 2 + old_name_len + new_name_len,
sizeof(log_array)/sizeof(log_array[0]),
log_array, NULL) ||
translog_flush(share->state.create_rename_lsn)))
translog_flush(lsn)))
{
maria_close(info);
DBUG_RETURN(1);
......@@ -100,7 +100,7 @@ int maria_rename(const char *old_name, const char *new_name)
store LSN into file, needed for Recovery to not be confused if a
RENAME happened (applying REDOs to the wrong table).
*/
if (_ma_update_create_rename_lsn_on_disk(share, TRUE))
if (_ma_update_create_rename_lsn_on_disk(share, lsn, TRUE))
{
maria_close(info);
DBUG_RETURN(1);
......
......@@ -75,7 +75,7 @@ int main(int argc,char *argv[])
if (maria_init() ||
(init_pagecache(maria_pagecache, IO_SIZE*16, 0, 0,
maria_block_size) == 0) ||
ma_control_file_create_or_open(TRUE) ||
ma_control_file_create_or_open() ||
(init_pagecache(maria_log_pagecache,
TRANSLOG_PAGECACHE_SIZE, 0, 0,
TRANSLOG_PAGE_SIZE) == 0) ||
......
......@@ -89,7 +89,7 @@ int main(int argc, char *argv[])
if (maria_init() ||
(init_pagecache(maria_pagecache, pagecache_size, 0, 0,
maria_block_size) == 0) ||
ma_control_file_create_or_open(TRUE) ||
ma_control_file_create_or_open() ||
(init_pagecache(maria_log_pagecache,
TRANSLOG_PAGECACHE_SIZE, 0, 0,
TRANSLOG_PAGE_SIZE) == 0) ||
......
......@@ -100,7 +100,7 @@ set -- "ma_test1 $silent -M -T -c" "ma_test2 $silent -L -K -W -P -M -T -c" "ma_t
while [ $# != 0 ]
do
prog=$1
rm maria_log.* maria_log_control
rm -f maria_log.* maria_log_control
echo "TEST WITH $prog"
$maria_path/$prog
# derive table's name from program's name
......@@ -138,7 +138,7 @@ do
prog=$1
commit_run_args=$2
abort_run_args=$3;
rm maria_log.* maria_log_control
rm -f maria_log.* maria_log_control
echo "TEST WITH $prog $commit_run_args (commit at end)"
$maria_path/$prog $commit_run_args
# derive table's name from program's name
......@@ -193,7 +193,7 @@ done
done
rm -f $table.* $tmp/$table* $tmp/maria_chk_*.txt $tmp/maria_read_log_$table.txt
) > $tmp/ma_test_recovery.output
) 2>&1 > $tmp/ma_test_recovery.output
diff $maria_path/ma_test_recovery.expected $tmp/ma_test_recovery.output > /dev/null || diff_failed=1
if [ "$diff_failed" == "1" ]
......
......@@ -21,12 +21,10 @@ testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
7,8c7,8
7c7
< Checksum: 3757530372
< Data records: 15 Deleted blocks: 0
---
> Checksum: 0
> Data records: 30 Deleted blocks: 0
11c11
< Datafile length: 16384 Keyfile length: 16384
---
......@@ -41,7 +39,7 @@ applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
11c11
< Datafile length: 90112 Keyfile length: 212992
< Datafile length: 90112 Keyfile length: 204800
---
> Datafile length: 90112 Keyfile length: 8192
========DIFF END=======
......@@ -50,7 +48,7 @@ applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
11c11
< Datafile length: 90112 Keyfile length: 212992
< Datafile length: 90112 Keyfile length: 204800
---
> Datafile length: 90112 Keyfile length: 8192
========DIFF END=======
......@@ -97,12 +95,10 @@ testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
7,8c7,8
7c7
< Checksum: 221293111
< Data records: 25 Deleted blocks: 0
---
> Checksum: 0
> Data records: 50 Deleted blocks: 0
11c11
< Datafile length: 16384 Keyfile length: 16384
---
......@@ -134,22 +130,8 @@ TEST WITH ma_test1 -s -M -T -c -N --debug=d:t:i:o,/tmp/ma_test1.trace --testfla
terminating after deletes
Dying on request without maria_commit()/maria_close()
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 54 Deleted blocks: 0
========DIFF END=======
testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 81 Deleted blocks: 0
========DIFF END=======
testing applying of CLRs to recreate table
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
......@@ -191,12 +173,10 @@ testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
7,8c7,8
7c7
< Checksum: 221293111
< Data records: 25 Deleted blocks: 0
---
> Checksum: 0
> Data records: 50 Deleted blocks: 0
11c11
< Datafile length: 16384 Keyfile length: 16384
---
......@@ -228,22 +208,8 @@ TEST WITH ma_test1 -s -M -T -c -N --debug=d:t:i:o,/tmp/ma_test1.trace --testfla
terminating after deletes
Dying on request without maria_commit()/maria_close()
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 54 Deleted blocks: 0
========DIFF END=======
testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 81 Deleted blocks: 0
========DIFF END=======
testing applying of CLRs to recreate table
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
......@@ -285,12 +251,10 @@ testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
7,8c7,8
7c7
< Checksum: 221293111
< Data records: 25 Deleted blocks: 0
---
> Checksum: 0
> Data records: 50 Deleted blocks: 0
11c11
< Datafile length: 16384 Keyfile length: 16384
---
......@@ -322,22 +286,8 @@ TEST WITH ma_test1 -s -M -T -c -N --debug=d:t:i:o,/tmp/ma_test1.trace --testfla
terminating after deletes
Dying on request without maria_commit()/maria_close()
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 54 Deleted blocks: 0
========DIFF END=======
testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 81 Deleted blocks: 0
========DIFF END=======
testing applying of CLRs to recreate table
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
......@@ -379,12 +329,10 @@ testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
7,8c7,8
7c7
< Checksum: 411409161
< Data records: 25 Deleted blocks: 0
---
> Checksum: 0
> Data records: 50 Deleted blocks: 0
11c11
< Datafile length: 49152 Keyfile length: 16384
---
......@@ -416,22 +364,8 @@ TEST WITH ma_test1 -s -M -T -c -N --debug=d:t:i:o,/tmp/ma_test1.trace -b --testf
terminating after deletes
Dying on request without maria_commit()/maria_close()
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 54 Deleted blocks: 0
========DIFF END=======
testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 81 Deleted blocks: 0
========DIFF END=======
testing applying of CLRs to recreate table
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
......@@ -473,12 +407,10 @@ testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
7,8c7,8
7c7
< Checksum: 411409161
< Data records: 25 Deleted blocks: 0
---
> Checksum: 0
> Data records: 50 Deleted blocks: 0
11c11
< Datafile length: 49152 Keyfile length: 16384
---
......@@ -510,22 +442,8 @@ TEST WITH ma_test1 -s -M -T -c -N --debug=d:t:i:o,/tmp/ma_test1.trace -b --testf
terminating after deletes
Dying on request without maria_commit()/maria_close()
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 54 Deleted blocks: 0
========DIFF END=======
testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 81 Deleted blocks: 0
========DIFF END=======
testing applying of CLRs to recreate table
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
......@@ -567,12 +485,10 @@ testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
7,8c7,8
7c7
< Checksum: 411409161
< Data records: 25 Deleted blocks: 0
---
> Checksum: 0
> Data records: 50 Deleted blocks: 0
11c11
< Datafile length: 49152 Keyfile length: 16384
---
......@@ -604,22 +520,8 @@ TEST WITH ma_test1 -s -M -T -c -N --debug=d:t:i:o,/tmp/ma_test1.trace -b --testf
terminating after deletes
Dying on request without maria_commit()/maria_close()
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 54 Deleted blocks: 0
========DIFF END=======
testing idempotency
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
========DIFF START=======
8c8
< Data records: 27 Deleted blocks: 0
---
> Data records: 81 Deleted blocks: 0
========DIFF END=======
testing applying of CLRs to recreate table
applying log
Differences in maria_chk -dvv, recovery not yet perfect !
......
......@@ -162,12 +162,20 @@ int maria_write(MARIA_HA *info, uchar *record)
rw_unlock(&share->key_root_lock[i]);
}
}
/**
@todo RECOVERY BUG
this += must happen under log's mutex when writing the UNDO
*/
if (share->calc_write_checksum)
info->cur_row.checksum= (*share->calc_write_checksum)(info,record);
if (filepos != HA_OFFSET_ERROR)
{
if ((*share->write_record)(info,record))
goto err;
/**
@todo when we enable multiple writers, we will have to protect
'records' and 'checksum' somehow.
*/
info->state->checksum+= info->cur_row.checksum;
}
if (share->base.auto_key)
......@@ -175,7 +183,7 @@ int maria_write(MARIA_HA *info, uchar *record)
ma_retrieve_auto_increment(info, record));
info->update= (HA_STATE_CHANGED | HA_STATE_AKTIV | HA_STATE_WRITTEN |
HA_STATE_ROW_CHANGED);
info->state->records++;
info->state->records+= !share->now_transactional; /*otherwise already done*/
info->cur_row.lastpos= filepos;
VOID(_ma_writeinfo(info, WRITEINFO_UPDATE_KEYFILE));
if (info->invalidator != 0)
......
......@@ -1035,7 +1035,8 @@ static int maria_chk(HA_CHECK *param, char *filename)
that it will have to find and store it.
*/
if (share->base.born_transactional)
share->state.create_rename_lsn= LSN_REPAIRED_BY_MARIA_CHK;
share->state.create_rename_lsn= share->state.is_of_lsn=
LSN_REPAIRED_BY_MARIA_CHK;
if ((param->testflag & (T_REP_BY_SORT | T_REP_PARALLEL)) &&
(maria_is_any_key_active(share->state.key_map) ||
(rep_quick && !param->keys_in_use && !recreate)) &&
......
......@@ -95,6 +95,7 @@ typedef struct st_maria_state_info
uint open_count;
uint8 changed; /* Changed since mariachk */
LSN create_rename_lsn; /**< LSN when table was last created/renamed */
LSN is_of_lsn; /**< LSN when state was last updated on disk */
/* the following isn't saved on disk */
uint state_diff_length; /* Should be 0 */
......@@ -104,7 +105,7 @@ typedef struct st_maria_state_info
#define MARIA_STATE_INFO_SIZE \
(24 + LSN_STORE_SIZE + 4 + 11*8 + 4*4 + 8 + 3*4 + 5*8)
(24 + LSN_STORE_SIZE*2 + 4 + 11*8 + 4*4 + 8 + 3*4 + 5*8)
#define MARIA_STATE_KEY_SIZE 8
#define MARIA_STATE_KEYBLOCK_SIZE 8
#define MARIA_STATE_KEYSEG_SIZE 4
......@@ -214,6 +215,8 @@ typedef struct st_maria_file_bitmap
ulong pages_covered; /* Pages covered by bitmap + 1 */
} MARIA_FILE_BITMAP;
#define MARIA_CHECKPOINT_LOOKS_AT_ME 1
#define MARIA_CHECKPOINT_SHOULD_FREE_ME 2
typedef struct st_maria_share
{ /* Shared between opens */
......@@ -300,6 +303,7 @@ typedef struct st_maria_share
myf write_flag;
enum data_file_type data_file_type;
enum pagecache_page_type page_type; /* value depending transactional */
uint8 in_checkpoint; /**< if Checkpoint looking at table */
my_bool temporary;
/* Below flag is needed to make log tables work with concurrent insert */
my_bool is_log_table;
......@@ -864,7 +868,8 @@ extern uint _ma_nommap_pread(MARIA_HA *info, uchar *Buffer,
extern uint _ma_nommap_pwrite(MARIA_HA *info, uchar *Buffer,
uint Count, my_off_t offset, myf MyFlags);
uint _ma_state_info_write(File file, MARIA_STATE_INFO *state, uint pWrite);
uint _ma_state_info_write(MARIA_SHARE *share, uint pWrite);
uint _ma_state_info_write_sub(File file, MARIA_STATE_INFO *state, uint pWrite);
uint _ma_state_info_read_dsk(File file, MARIA_STATE_INFO *state);
uint _ma_base_info_write(File file, MARIA_BASE_INFO *base);
int _ma_keyseg_write(File file, const HA_KEYSEG *keyseg);
......@@ -933,7 +938,10 @@ int _ma_create_index_by_sort(MARIA_SORT_PARAM *info, my_bool no_messages,
ulong);
int _ma_sync_table_files(const MARIA_HA *info);
int _ma_initialize_data_file(MARIA_SHARE *share, File dfile);
int _ma_update_create_rename_lsn_on_disk(MARIA_SHARE *share, my_bool do_sync);
int _ma_update_create_rename_lsn_on_disk(MARIA_SHARE *share,
LSN lsn, my_bool do_sync);
int _ma_update_create_rename_lsn_on_disk_sub(MARIA_SHARE *share,
LSN lsn, my_bool do_sync);
void _ma_unpin_all_pages(MARIA_HA *info, LSN undo_lsn);
#define _ma_tmp_disable_logging_for_table(S) \
......
......@@ -3000,7 +3000,8 @@ static int save_state(MARIA_HA *isam_file,PACK_MRG_INFO *mrg,
VOID(my_chsize(share->kfile.file, share->base.keystart, 0, MYF(0)));
if (share->base.keys)
isamchk_neaded=1;
DBUG_RETURN(_ma_state_info_write(share->kfile.file, &share->state, (1 + 2)));
DBUG_RETURN(_ma_state_info_write_sub(share->kfile.file,
&share->state, (1 + 2)));
}
......@@ -3033,7 +3034,7 @@ static int save_state_mrg(File file,PACK_MRG_INFO *mrg,my_off_t new_length,
if (isam_file->s->base.keys)
isamchk_neaded=1;
state.changed=STATE_CHANGED | STATE_NOT_ANALYZED; /* Force check of table */
DBUG_RETURN (_ma_state_info_write(file,&state,1+2));
DBUG_RETURN (_ma_state_info_write_sub(file,&state,1+2));
}
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment