• Marko Mäkelä's avatar
    Bug#12704861 Corruption after a crash during BLOB update · d259243a
    Marko Mäkelä authored
    The fix of Bug#12612184 broke crash recovery. When a record that
    contains off-page columns (BLOBs) is updated, we must first write redo
    log about the BLOB page writes, and only after that write the redo log
    about the B-tree changes. The buggy fix would log the B-tree changes
    first, meaning that after recovery, we could end up having a record
    that contains a null BLOB pointer.
    
    Because we will be redo logging the writes off the off-page columns
    before the B-tree changes, we must make sure that the pages chosen for
    the off-page columns are free both before and after the B-tree
    changes. In this way, the worst thing that can happen in crash
    recovery is that the BLOBs are written to free pages, but the B-tree
    changes are not applied. The BLOB pages would correctly remain free in
    this case. To achieve this, we must allocate the BLOB pages in the
    mini-transaction of the B-tree operation. A further quirk is that BLOB
    pages are allocated from the same file segment as leaf pages. Because
    of this, we must temporarily "hide" any leaf pages that were freed
    during the B-tree operation by "fake allocating" them prior to writing
    the BLOBs, and freeing them again before the mtr_commit() of the
    B-tree operation, in btr_mark_freed_leaves().
    
    btr_cur_mtr_commit_and_start(): Remove this faulty function that was
    introduced in the Bug#12612184 fix. The problem that this function was
    trying to address was that when we did mtr_commit() the BLOB writes
    before the mtr_commit() of the update, the new BLOB pages could have
    overwritten clustered index B-tree leaf pages that were freed during
    the update. If recovery applied the redo log of the BLOB writes but
    did not see the log of the record update, the index tree would be
    corrupted. The correct solution is to make the freed clustered index
    pages unavailable to the BLOB allocation. This function is also a
    likely culprit of InnoDB hangs that were observed when testing the
    Bug#12612184 fix.
    
    btr_mark_freed_leaves(): Mark all freed clustered index leaf pages of
    a mini-transaction allocated (nonfree=TRUE) before storing the BLOBs,
    or freed (nonfree=FALSE) before committing the mini-transaction.
    
    btr_freed_leaves_validate(): A debug function for checking that all
    clustered index leaf pages that have been marked free in the
    mini-transaction are consistent (have not been zeroed out).
    
    btr_page_alloc_low(): Refactored from btr_page_alloc(). Return the
    number of the allocated page, or FIL_NULL if out of space. Add the
    parameter "mtr_t* init_mtr" for specifying the mini-transaction where
    the page should be initialized, or if this is a "fake allocation"
    (init_mtr=NULL) by btr_mark_freed_leaves(nonfree=TRUE).
    
    btr_page_alloc(): Add the parameter init_mtr, allowing the page to be
    initialized and X-latched in a different mini-transaction than the one
    that is used for the allocation. Invoke btr_page_alloc_low(). If a
    clustered index leaf page was previously freed in mtr, remove it from
    the memo of previously freed pages.
    
    btr_page_free(): Assert that the page is a B-tree page and it has been
    X-latched by the mini-transaction. If the freed page was a leaf page
    of a clustered index, link it by a MTR_MEMO_FREE_CLUST_LEAF marker to
    the mini-transaction.
    
    btr_store_big_rec_extern_fields_func(): Add the parameter alloc_mtr,
    which is NULL (old behaviour in inserts) and the same as local_mtr in
    updates. If alloc_mtr!=NULL, the BLOB pages will be allocated from it
    instead of the mini-transaction that is used for writing the BLOBs.
    
    fsp_alloc_from_free_frag(): Refactored from
    fsp_alloc_free_page(). Allocate the specified page from a partially
    free extent.
    
    fseg_alloc_free_page_low(), fseg_alloc_free_page_general(): Add the
    parameter "mtr_t* init_mtr" for specifying the mini-transaction where
    the page should be initialized, or NULL if this is a "fake allocation"
    that prevents the reuse of a previously freed B-tree page for BLOB
    storage. If init_mtr==NULL, try harder to reallocate the specified page
    and assert that it succeeded.
    
    fsp_alloc_free_page(): Add the parameter "mtr_t* init_mtr" for
    specifying the mini-transaction where the page should be initialized.
    Do not allow init_mtr == NULL, because this function is never to be
    used for "fake allocations".
    
    mtr_t: Add the operation MTR_MEMO_FREE_CLUST_LEAF and the flag
    mtr->freed_clust_leaf for quickly determining if any
    MTR_MEMO_FREE_CLUST_LEAF operations have been posted.
    
    row_ins_index_entry_low(): When columns are being made off-page in
    insert-by-update, invoke btr_mark_freed_leaves(nonfree=TRUE) and pass
    the mini-transaction as the alloc_mtr to
    btr_store_big_rec_extern_fields(). Finally, invoke
    btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
    
    row_build(): Correct a comment, and add a debug assertion that a
    record that contains NULL BLOB pointers must be a fresh insert.
    
    row_upd_clust_rec(): When columns are being moved off-page, invoke
    btr_mark_freed_leaves(nonfree=TRUE) and pass the mini-transaction as
    the alloc_mtr to btr_store_big_rec_extern_fields(). Finally, invoke
    btr_mark_freed_leaves(nonfree=FALSE) to avoid leaking pages.
    
    buf_reset_check_index_page_at_flush(): Remove. The function
    fsp_init_file_page_low() already sets
    bpage->check_index_page_at_flush=FALSE.
    
    There is a known issue in tablespace extension. If the request to
    allocate a BLOB page leads to the tablespace being extended, crash
    recovery could see BLOB writes to pages that are off the tablespace
    file bounds. This should trigger an assertion failure in fil_io() at
    crash recovery. The safe thing would be to write redo log about the
    tablespace extension to the mini-transaction of the BLOB write, not to
    the mini-transaction of the record update. However, there is no redo
    log record for file extension in the current redo log format.
    
    rb:693 approved by Sunny Bains
    d259243a
innodb-index.test 20.5 KB