• Marko Mäkelä's avatar
    MDEV-14425 Improve the redo log for concurrency · 685d958e
    Marko Mäkelä authored
    The InnoDB redo log used to be formatted in blocks of 512 bytes.
    The log blocks were encrypted and the checksum was calculated while
    holding log_sys.mutex, creating a serious scalability bottleneck.
    
    We remove the fixed-size redo log block structure altogether and
    essentially turn every mini-transaction into a log block of its own.
    This allows encryption and checksum calculations to be performed
    on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
    The mutex only protects a memcpy() of the data to the shared
    log_sys.buf, as well as the padding of the log, in case the
    to-be-written part of the log would not end in a block boundary of
    the underlying storage. For now, the "padding" consists of writing
    a single NUL byte, to allow recovery and mariadb-backup to detect
    the end of the circular log faster.
    
    Like the previous implementation, we will overwrite the last log block
    over and over again, until it has been completely filled. It would be
    possible to write only up to the last completed block (if no more
    recent write was requested), or to write dummy FILE_CHECKPOINT records
    to fill the incomplete block, by invoking the currently disabled
    function log_pad(). This would require adjustments to some logic around
    log checkpoints, page flushing, and shutdown.
    
    An upgrade after a crash of any previous version is not supported.
    Logically empty log files from a previous version will be upgraded.
    
    An attempt to start up InnoDB without a valid ib_logfile0 will be
    refused. Previously, the redo log used to be created automatically
    if it was missing. Only with with innodb_force_recovery=6, it is
    possible to start InnoDB in read-only mode even if the log file
    does not exist. This allows the contents of a possibly corrupted
    database to be dumped.
    
    Because a prepared backup from an earlier version of mariadb-backup
    will create a 0-sized log file, we will allow an upgrade from such
    log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
    tablespace looks valid.
    
    The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
    with 64-byte log checkpoint blocks at 0x1000 and 0x2000.
    
    The start of log records will move from 0x800 to 0x3000. This allows us
    to use 4096-byte aligned blocks for all I/O in a future revision.
    
    We extend the MDEV-12353 redo log record format as follows.
    
    (1) Empty mini-transactions or extra NUL bytes will not be allowed.
    (2) The end-of-minitransaction marker (a NUL byte) will be replaced
    with a 1-bit sequence number, which will be toggled each time when the
    circular log file wraps back to the beginning.
    (3) After the sequence bit, a CRC-32C checksum of all data
    (excluding the sequence bit) will written.
    (4) If the log is encrypted, 8 bytes will be written before
    the checksum and included in it. This is part of the
    initialization vector (IV) of encrypted log data.
    (5) File names, page numbers, and checkpoint information will not be
    encrypted. Only the payload bytes of page-level log will be encrypted.
    The tablespace ID and page number will form part of the IV.
    (6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
    with all-zero payload, and with the normal end marker and checksum.
    The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.
    
    In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
    no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
    will require a valid log file. When resizing the log, we will create
    a logically empty ib_logfile101 at the current LSN and use an atomic rename
    to replace ib_logfile0 with it. See the test innodb.log_file_size.
    
    Because there is no mandatory padding in the log file, we are able
    to create a dummy log file as of an arbitrary log sequence number.
    See the test mariabackup.huge_lsn.
    
    The parameter innodb_log_write_ahead_size and the
    INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.
    
    The minimum value of innodb_log_buffer_size will be increased to 2MiB
    (because log_sys.buf will replace recv_sys.buf) and the increment
    adjusted to 4096 bytes (the maximum log block size).
    
    The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:
    
    os_log_fsyncs
    os_log_pending_fsyncs
    log_pending_log_flushes
    log_pending_checkpoint_writes
    
    The following status variables will be removed:
    
    Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
    Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)
    
    log_sys.get_block_size(): Return the physical block size of the log file.
    This is only implemented on Linux and Microsoft Windows for now, and for
    the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
    maximum size of a checkpoint block). If the block size is anything else,
    the traditional 512-byte size will be used via normal file system
    buffering.
    
    If the file system buffers can be bypassed, a message like the following
    will be issued:
    
    InnoDB: File system buffers for log disabled (block size=512 bytes)
    InnoDB: File system buffers for log disabled (block size=4096 bytes)
    
    This has been tested on Linux and Microsoft Windows with both sizes.
    
    On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
    Tests in 3 different environments where the log is stored in a device
    with a physical block size of 512 bytes are yielding better throughput
    without O_DIRECT. This could be due to the fact that in the event the
    last log block is being overwritten (if multiple transactions would
    become durable at the same time, and each of will write a small
    number of bytes to the last log block), it should be faster to re-copy
    data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
    to be finally written at fdatasync() time.
    
    The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
    data files. This option will enable O_DIRECT on the log file on Linux.
    It may be unsafe to use when the storage device does not support
    FUA (Force Unit Access) mode.
    
    When the server is compiled WITH_PMEM=ON, we will use memory-mapped
    I/O for the log file if the log resides on a "mount -o dax" device.
    We will identify PMEM in a start-up message:
    
    InnoDB: log sequence number 0 (memory-mapped); transaction id 3
    
    On Linux, we will also invoke mmap() on any ib_logfile0 that resides
    in /dev/shm, effectively treating the log file as persistent memory.
    This should speed up "./mtr --mem" and increase the test coverage of
    PMEM on non-PMEM hardware. It also allows users to estimate how much
    the performance would be improved by installing persistent memory.
    On other tmpfs file systems such as /run, we will not use mmap().
    
    mariadb-backup: Eliminated several variables. We will refer
    directly to recv_sys and log_sys.
    
    backup_wait_for_lsn(): Detect non-progress of
    xtrabackup_copy_logfile(). In this new log format with
    arbitrary-sized blocks, we can only detect log file overrun
    indirectly, by observing that the scanned log sequence number
    is not advancing.
    
    xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
    because we are not allowed to modify the server's log file, and our
    memory mapping is read-only.
    
    trx_flush_log_if_needed_low(): Do not use the callback on pmem.
    Using neither flush_lock nor write_lock around PMEM writes seems
    to yield the best performance. The pmem_persist() calls may
    still be somewhat slower than the pwrite() and fdatasync() based
    interface (PMEM mounted without -o dax).
    
    recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.
    
    recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.
    
    recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.
    
    recv_sys_t, log_sys_t: Removed many data members.
    
    recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
    recv_sys.offset: Renamed from recv_sys.recovered_offset.
    log_sys.buf_size: Replaces srv_log_buffer_size.
    
    recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
    when the buffer is being allocated from the memory heap.
    
    recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
    backed by ib_logfile0. The pointer will wrap from recv_sys.len
    (log_sys.file_size) to log_sys.START_OFFSET. For the record that
    wraps around, we may copy file name or record payload data to
    the auxiliary buffer decrypt_buf in order to have a contiguous
    block of memory. The maximum size of a record is less than
    innodb_page_size bytes.
    
    recv_sys_t::parse(): Take the smart pointer as a template parameter.
    Do not temporarily add a trailing NUL byte to FILE_ records, because
    we are not supposed to modify the memory-mapped log file. (It is
    attached in read-write mode already during recovery.)
    
    recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().
    
    recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
    returned on PMEM, use recv_ring to wrap around the buffer to the start.
    
    mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
    on PMEM, because it has no meaning on the mmap-based log.
    
    log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
    srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
    Protected by log_sys.mutex. Updated consistently in log_close().
    Previously, mtr_t::commit() conditionally updated the count,
    which was inconsistent.
    
    log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
    for writing to log_sys.log (the ib_logfile0). Replaces
    srv_stats.log_writes and export_vars.innodb_log_writes.
    Protected by log_sys.mutex.
    
    log_sys.waits: Count waits in append_prepare(). Replaces
    srv_stats.log_waits and export_vars.innodb_log_waits.
    
    recv_recover_page(): Do not unnecessarily acquire
    log_sys.flush_order_mutex. We are inserting the blocks in arbitary
    order anyway, to be adjusted in recv_sys.apply(true).
    
    We will change the definition of flush_lock and write_lock to
    avoid potential false sharing. Depending on sizeof(log_sys) and
    CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
    share a cache line with each other or with the last data members
    of log_sys.
    
    Thanks to Matthias Leich for providing https://rr-project.org traces
    for various failures during the development, and to
    Thirunarayanan Balathandayuthapani for his help in debugging
    some of the recovery code. And thanks to the developers of the
    rr debugger for a tool without which extensive changes to InnoDB
    would be very challenging to get right.
    
    Thanks to Vladislav Vaintroub for useful feedback and
    to him, Axel Schwenke and Krunal Bauskar for testing the performance.
    685d958e
log_file_name_debug.test 1.75 KB