Commits · 1f5fc7b745fd5009325fcf94be67c3f850408266 · nexedi / MariaDB

22 Jan, 2022 1 commit

MDEV-27208: mtr --ps-protocol test fixup · 1f5fc7b7

Marko Mäkelä authored Jan 22, 2022

The test ./mtr --ps-protocol main.func_math
was broken in commit 5b3ad94c
because in that mode, one of several truncation warnings for
a single integer literal would be omitted. Those warnings are
issued by the parser somewhere outside CRC32() or CRC32C().

1f5fc7b7

21 Jan, 2022 5 commits

MDEV-27208: Extend CRC32() and implement CRC32C() · 5b3ad94c

Marko Mäkelä authored Jan 21, 2022

We used to define a native unary function CRC32() that computes the CRC-32
of a string using the ISO 3309 polynomial that is being used by zlib
and many others.

Often, a CRC is computed in pieces. To faciliate this, we introduce a
2-ary variant of the function that inputs a previous CRC as the first
argument: CRC32('MariaDB')=CRC32(CRC32('Maria'),'DB').

InnoDB and MyRocks use a different polynomial, which was implemented
in SSE4.2 instructions that were introduced in the
Intel Nehalem microarchitecture. This is commonly called CRC-32C
(Castagnoli).

We introduce a native function that uses the Castagnoli polynomial:
CRC32C('MariaDB')=CRC32C(CRC32C('Maria'),'DB'). This allows
SELECT...INTO DUMPFILE to be used for the creation of files with
valid checksums, such as a logically empty InnoDB redo log file
ib_logfile0 corresponding to a particular log sequence number.

5b3ad94c

MDEV-27199: Remove FIL_PAGE_FILE_FLUSH_LSN · b07920b6

Marko Mäkelä authored Jan 21, 2022

The only purpose of the field FIL_PAGE_FILE_FLUSH_LSN was to
store the log sequence number for a new ib_logfile0 when the
InnoDB redo log was missing at startup.

Because FIL_PAGE_FILE_FLUSH_LSN no longer serves any purpose,
we will stop updating it. The writes of that field were inherently
risky, because they were not covered by neither the redo log nor
the doublewrite buffer.

Warning: After MDEV-14425 and before this change, users could perform
a clean shutdown of the server, replace the ib_logfile0 with a
0-length file, and expect a valid log file to be created on the
next server startup. After this change, if the FIL_PAGE_FILE_FLUSH_LSN
had ever been updated in the past, the server would still create a
log file in such a scenario, but possibly with an incorrect (too small)
LSN. Users should not manipulate log files directly!

b07920b6

Disable adaptive spinning on buf_pool.mutex · 88d9fbb4

Marko Mäkelä authored Jan 21, 2022

During the testing of MDEV-14425, buf_pool.mutex and log_sys.mutex
were identified as the main bottlenecks for write workloads.
Let us disable spinning also for buf_pool.mutex, except on ARMv8
where spinning was enabled for log_sys.mutex
in commit f7684f0c (MDEV-26855).
This was tested on AMD64 and recommended by Axel Schwenke.

According to Krunal Bauskar, removing the spinloops did not improve
performance in his tests on ARMv8.

88d9fbb4

Cleanup: Replace ut_crc32c(x,y) with my_crc32c(0,x,y) · 5d54fd61
Marko Mäkelä authored Jan 21, 2022

5d54fd61

MDEV-14425 Improve the redo log for concurrency · 685d958e

Marko Mäkelä authored Jan 21, 2022

The InnoDB redo log used to be formatted in blocks of 512 bytes.
The log blocks were encrypted and the checksum was calculated while
holding log_sys.mutex, creating a serious scalability bottleneck.

We remove the fixed-size redo log block structure altogether and
essentially turn every mini-transaction into a log block of its own.
This allows encryption and checksum calculations to be performed
on local mtr_t::m_log buffers, before acquiring log_sys.mutex.
The mutex only protects a memcpy() of the data to the shared
log_sys.buf, as well as the padding of the log, in case the
to-be-written part of the log would not end in a block boundary of
the underlying storage. For now, the "padding" consists of writing
a single NUL byte, to allow recovery and mariadb-backup to detect
the end of the circular log faster.

Like the previous implementation, we will overwrite the last log block
over and over again, until it has been completely filled. It would be
possible to write only up to the last completed block (if no more
recent write was requested), or to write dummy FILE_CHECKPOINT records
to fill the incomplete block, by invoking the currently disabled
function log_pad(). This would require adjustments to some logic around
log checkpoints, page flushing, and shutdown.

An upgrade after a crash of any previous version is not supported.
Logically empty log files from a previous version will be upgraded.

An attempt to start up InnoDB without a valid ib_logfile0 will be
refused. Previously, the redo log used to be created automatically
if it was missing. Only with with innodb_force_recovery=6, it is
possible to start InnoDB in read-only mode even if the log file
does not exist. This allows the contents of a possibly corrupted
database to be dumped.

Because a prepared backup from an earlier version of mariadb-backup
will create a 0-sized log file, we will allow an upgrade from such
log files, provided that the FIL_PAGE_FILE_FLUSH_LSN in the system
tablespace looks valid.

The 512-byte log checkpoint blocks at 0x200 and 0x600 will be replaced
with 64-byte log checkpoint blocks at 0x1000 and 0x2000.

The start of log records will move from 0x800 to 0x3000. This allows us
to use 4096-byte aligned blocks for all I/O in a future revision.

We extend the MDEV-12353 redo log record format as follows.

(1) Empty mini-transactions or extra NUL bytes will not be allowed.
(2) The end-of-minitransaction marker (a NUL byte) will be replaced
with a 1-bit sequence number, which will be toggled each time when the
circular log file wraps back to the beginning.
(3) After the sequence bit, a CRC-32C checksum of all data
(excluding the sequence bit) will written.
(4) If the log is encrypted, 8 bytes will be written before
the checksum and included in it. This is part of the
initialization vector (IV) of encrypted log data.
(5) File names, page numbers, and checkpoint information will not be
encrypted. Only the payload bytes of page-level log will be encrypted.
The tablespace ID and page number will form part of the IV.
(6) For padding, arbitrary-length FILE_CHECKPOINT records may be written,
with all-zero payload, and with the normal end marker and checksum.
The minimum size is 7 bytes, or 7+8 with innodb_encrypt_log=ON.

In mariadb-backup and in Galera snapshot transfer (SST) scripts, we will
no longer remove ib_logfile0 or create an empty ib_logfile0. Server startup
will require a valid log file. When resizing the log, we will create
a logically empty ib_logfile101 at the current LSN and use an atomic rename
to replace ib_logfile0 with it. See the test innodb.log_file_size.

Because there is no mandatory padding in the log file, we are able
to create a dummy log file as of an arbitrary log sequence number.
See the test mariabackup.huge_lsn.

The parameter innodb_log_write_ahead_size and the
INFORMATION_SCHEMA.INNODB_METRICS counter log_padded will be removed.

The minimum value of innodb_log_buffer_size will be increased to 2MiB
(because log_sys.buf will replace recv_sys.buf) and the increment
adjusted to 4096 bytes (the maximum log block size).

The following INFORMATION_SCHEMA.INNODB_METRICS counters will be removed:

os_log_fsyncs
os_log_pending_fsyncs
log_pending_log_flushes
log_pending_checkpoint_writes

The following status variables will be removed:

Innodb_os_log_fsyncs (this is included in Innodb_data_fsyncs)
Innodb_os_log_pending_fsyncs (this was limited to at most 1 by design)

log_sys.get_block_size(): Return the physical block size of the log file.
This is only implemented on Linux and Microsoft Windows for now, and for
the power-of-2 block sizes between 64 and 4096 bytes (the minimum and
maximum size of a checkpoint block). If the block size is anything else,
the traditional 512-byte size will be used via normal file system
buffering.

If the file system buffers can be bypassed, a message like the following
will be issued:

InnoDB: File system buffers for log disabled (block size=512 bytes)
InnoDB: File system buffers for log disabled (block size=4096 bytes)

This has been tested on Linux and Microsoft Windows with both sizes.

On Linux, only enable O_DIRECT on the log for innodb_flush_method=O_DSYNC.
Tests in 3 different environments where the log is stored in a device
with a physical block size of 512 bytes are yielding better throughput
without O_DIRECT. This could be due to the fact that in the event the
last log block is being overwritten (if multiple transactions would
become durable at the same time, and each of will write a small
number of bytes to the last log block), it should be faster to re-copy
data from log_sys.buf or log_sys.flush_buf to the kernel buffer,
to be finally written at fdatasync() time.

The parameter innodb_flush_method=O_DSYNC will imply O_DIRECT for
data files. This option will enable O_DIRECT on the log file on Linux.
It may be unsafe to use when the storage device does not support
FUA (Force Unit Access) mode.

When the server is compiled WITH_PMEM=ON, we will use memory-mapped
I/O for the log file if the log resides on a "mount -o dax" device.
We will identify PMEM in a start-up message:

InnoDB: log sequence number 0 (memory-mapped); transaction id 3

On Linux, we will also invoke mmap() on any ib_logfile0 that resides
in /dev/shm, effectively treating the log file as persistent memory.
This should speed up "./mtr --mem" and increase the test coverage of
PMEM on non-PMEM hardware. It also allows users to estimate how much
the performance would be improved by installing persistent memory.
On other tmpfs file systems such as /run, we will not use mmap().

mariadb-backup: Eliminated several variables. We will refer
directly to recv_sys and log_sys.

backup_wait_for_lsn(): Detect non-progress of
xtrabackup_copy_logfile(). In this new log format with
arbitrary-sized blocks, we can only detect log file overrun
indirectly, by observing that the scanned log sequence number
is not advancing.

xtrabackup_copy_logfile(): On PMEM, do not modify the sequence bit,
because we are not allowed to modify the server's log file, and our
memory mapping is read-only.

trx_flush_log_if_needed_low(): Do not use the callback on pmem.
Using neither flush_lock nor write_lock around PMEM writes seems
to yield the best performance. The pmem_persist() calls may
still be somewhat slower than the pwrite() and fdatasync() based
interface (PMEM mounted without -o dax).

recv_sys_t::buf: Remove. We will use log_sys.buf for parsing.

recv_sys_t::MTR_SIZE_MAX: Replaces RECV_SCAN_SIZE.

recv_sys_t::file_checkpoint: Renamed from mlog_checkpoint_lsn.

recv_sys_t, log_sys_t: Removed many data members.

recv_sys.lsn: Renamed from recv_sys.recovered_lsn.
recv_sys.offset: Renamed from recv_sys.recovered_offset.
log_sys.buf_size: Replaces srv_log_buffer_size.

recv_buf: A smart pointer that wraps log_sys.buf[recv_sys.offset]
when the buffer is being allocated from the memory heap.

recv_ring: A smart pointer that wraps a circular log_sys.buf[] that is
backed by ib_logfile0. The pointer will wrap from recv_sys.len
(log_sys.file_size) to log_sys.START_OFFSET. For the record that
wraps around, we may copy file name or record payload data to
the auxiliary buffer decrypt_buf in order to have a contiguous
block of memory. The maximum size of a record is less than
innodb_page_size bytes.

recv_sys_t::parse(): Take the smart pointer as a template parameter.
Do not temporarily add a trailing NUL byte to FILE_ records, because
we are not supposed to modify the memory-mapped log file. (It is
attached in read-write mode already during recovery.)

recv_sys_t::parse_mtr(): Wrapper for recv_sys_t::parse().

recv_sys_t::parse_pmem(): Like parse_mtr(), but if PREMATURE_EOF would be
returned on PMEM, use recv_ring to wrap around the buffer to the start.

mtr_t::finish_write(), log_close(): Do not enforce log_sys.max_buf_free
on PMEM, because it has no meaning on the mmap-based log.

log_sys.write_to_buf: Count writes to log_sys.buf. Replaces
srv_stats.log_write_requests and export_vars.innodb_log_write_requests.
Protected by log_sys.mutex. Updated consistently in log_close().
Previously, mtr_t::commit() conditionally updated the count,
which was inconsistent.

log_sys.write_to_log: Count swaps of log_sys.buf and log_sys.flush_buf,
for writing to log_sys.log (the ib_logfile0). Replaces
srv_stats.log_writes and export_vars.innodb_log_writes.
Protected by log_sys.mutex.

log_sys.waits: Count waits in append_prepare(). Replaces
srv_stats.log_waits and export_vars.innodb_log_waits.

recv_recover_page(): Do not unnecessarily acquire
log_sys.flush_order_mutex. We are inserting the blocks in arbitary
order anyway, to be adjusted in recv_sys.apply(true).

We will change the definition of flush_lock and write_lock to
avoid potential false sharing. Depending on sizeof(log_sys) and
CPU_LEVEL1_DCACHE_LINESIZE, the flush_lock and write_lock could
share a cache line with each other or with the last data members
of log_sys.

Thanks to Matthias Leich for providing https://rr-project.org traces
for various failures during the development, and to
Thirunarayanan Balathandayuthapani for his help in debugging
some of the recovery code. And thanks to the developers of the
rr debugger for a tool without which extensive changes to InnoDB
would be very challenging to get right.

Thanks to Vladislav Vaintroub for useful feedback and
to him, Axel Schwenke and Krunal Bauskar for testing the performance.

685d958e

20 Jan, 2022 7 commits

MDEV-27540 Different OpenSSL versions mix up in build depending on cmake options · baef53a7
Sergei Golubchik authored Jan 20, 2022
```
list ${OPENSSL_ROOT_DIR}/lib64 explicitly, because
cmake below version 3.23.0 won't search there.
```
baef53a7

MDEV-25785 Add support for OpenSSL 3.0 · d42c2efb

Vladislav Vaintroub authored Nov 08, 2021

Summary of changes

- MD_CTX_SIZE is increased

- EVP_CIPHER_CTX_buf_noconst(ctx) does not work anymore, points
  to nobody knows where. The assumption made previously was that
  (since the function does not seem to be documented)
  was that it points to the last partial source block.
  Add own partial block buffer for NOPAD encryption instead

- SECLEVEL in CipherString in openssl.cnf
  had been downgraded to 0, from 1, to make TLSv1.0 and TLSv1.1 possible
   (according to https://github.com/openssl/openssl/blob/openssl-3.0.0/NEWS.md
   even though the manual for SSL_CTX_get_security_level claims that it
   should not be necessary)

- Workaround Ssl_cipher_list issue, it now returns TLSv1.3 ciphers,
  in addition to what was set in --ssl-cipher

- ctx_buf buffer now must be aligned to 16 bytes with openssl(
  previously with WolfSSL only), ot crashes will happen

- updated aes-t , to be better debuggable
  using function, rather than a huge multiline macro
  added test that does "nopad" encryption piece-wise, to test
  replacement of EVP_CIPHER_CTX_buf_noconst

d42c2efb

Merge 10.7 into 10.8 · a855d6d9
Marko Mäkelä authored Jan 20, 2022

a855d6d9

MDEV-26519 fixup: GCC 11 -Og -Wmaybe-uninitialized · 852534dc

Marko Mäkelä authored Jan 20, 2022

GCC does not understand that the variable have_ndv determines
whether the variable ndv_ll is initialized. Let us add a
redundant initialization to pacify GCC.

852534dc

Merge 10.6 into 10.7 · 5e6fd4e8
Marko Mäkelä authored Jan 20, 2022

5e6fd4e8
Merge 10.5 into 10.6 · 21778b8a
Marko Mäkelä authored Jan 20, 2022

21778b8a
MDEV-27550: Disable galera.MW-328D · 66465914
Marko Mäkelä authored Jan 20, 2022

66465914

19 Jan, 2022 27 commits

MDEV-27499 fixup: Add a wait to buf_flush_sync() · 764ca7e6

Marko Mäkelä authored Jan 19, 2022

The test innodb.log_file_size would occasionally fail with
an assertion failure !buf_pool.any_io_pending(). Let us wait
for the page cleaner thread to become idle already in
srv_prepare_to_delete_redo_log_file(), like we used to.

764ca7e6

Merge MDEV-26519: JSON_HB histograms into 10.8 · da78030e
Sergei Petrunia authored Jan 19, 2022

da78030e
Code cleanup · ce4956f3
Sergei Petrunia authored Jan 19, 2022

ce4956f3
Switch the default histogram_type to still be DOUBLE_PREC_HB · f7e49c98
Sergei Petrunia authored Jan 19, 2022
```
MTR still uses JSON_HB as the default.
```
f7e49c98
JSON_HB histogram: represent values of BIT() columns in hex always · 4842a563
Sergei Petrunia authored Jan 14, 2022

4842a563

MDEV-26901: Estimation for filtered rows less precise ... #4 · dae20dde

Sergei Petrunia authored Jan 11, 2022

In Histogram_json_hb::point_selectivity(), do return selectivity of 0.0
when the histogram says so.

The logic of "Do not return 0.0 estimate as it causes a multiply-by-zero
meltdown in cost and cardinality calculations" is moved into
records_in_column_ranges() where it is one *once* per column pair (as
opposed to doing once per range, which can cause the error to add-up
to large number when there are many ranges)

dae20dde

MDEV-27229: Estimation for filtered rows less precise ... #5 · db8f15be

Sergei Petrunia authored Jan 11, 2022

Followup: remove this line from get_column_range_cardinality()

set_if_bigger(res, col_stats->get_avg_frequency());

and make sure it is only used with the binary histograms.
For JSON histograms, it makes the estimates unnecessarily imprecise.

db8f15be

MDEV-27243: Estimation for filtered rows less precise ... #7 · d3e511d4
Sergei Petrunia authored Jan 08, 2022
```
Added a testcase
```
d3e511d4
MDEV-27229: Estimation for filtered rows less precise ... #5 · 531dd708
Sergei Petrunia authored Jan 08, 2022
```
Fix special handling for values that are right next to buckets with ndv=1.
```
531dd708
Update test results · 67d4d042
Sergei Petrunia authored Dec 14, 2021

67d4d042

MDEV-27230: Estimation for filtered rows less precise ... · 905634dc

Sergei Petrunia authored Dec 13, 2021

Fix the code in Histogram_json_hb::range_selectivity that handles
special cases: a non-inclusive endpoint hitting a bucket boundary...

905634dc

MDEV-27203: Valgrind / MSAN errors in Histogram_json_hb::parse_bucket · 08f1c4a2
Sergei Petrunia authored Dec 13, 2021
```
In read_bucket_endpoint(), handle all possible parser states.
```
08f1c4a2
MDEV-26764: JSON_HB Histograms: handle BINARY and unassigned characters · d8d57d2c
Sergei Petrunia authored Dec 03, 2021
```
Encode such characters in hex.
```
d8d57d2c
More test coverage · 748b293c
Sergei Petrunia authored Dec 03, 2021

748b293c

MDEV-26519: Improved histograms · c2d2c1e7

Sergei Petrunia authored Dec 03, 2021

Save extra information in the histogram:

    "target_histogram_size": nnn,
    "collected_at": "(date and time)",
    "collected_by": "(server version)",

c2d2c1e7

MDEV-26519: Improved histograms: Better error reporting, test coverage · a0916cf5

Sergei Petrunia authored Dec 02, 2021

Also report JSON histogram load errors into error log, like it is already
done with other histogram/statistics load errors.

Add test coverage to see what happens if one upgrades but does NOT run
mysql_upgrade.

a0916cf5

Rename histogram_hb_v2 -> histogram_hb · a0f93f43
Sergei Petrunia authored Dec 02, 2021

a0f93f43

MDEV-26519: Improved histograms: Make JSON parser efficient · 1d14176e

Sergei Petrunia authored Dec 02, 2021

Previous JSON parser was using an API which made the parsing
inefficient: the same JSON contents was parsed again and again.

Switch to using a lower-level parsing API which allows to do
parsing in an efficient way.

1d14176e

MDEV-27062: Make histogram_type=JSON_HB the new default · be55ad0d
Sergei Petrunia authored Nov 29, 2021

be55ad0d

MDEV-26886: Estimation for filtered rows less precise with JSON histogram · eb6a9ad7

Sergei Petrunia authored Nov 26, 2021

- Make Histogram_json_hb::range_selectivity handle singleton buckets
  specially when computing selectivity of the max. endpoint bound.
  (for min. endpoint, we already do that).

- Also, fixed comments for Histogram_json_hb::find_bucket

eb6a9ad7

MDEV-26911: Unexpected ER_DUP_KEY, ASAN errors, double free detected in ... · 106c785e

Sergei Petrunia authored Nov 02, 2021

When loading the histogram, use table->field[N], not table->s->field[N].

When we used the latter we would corrupt the fields's default value. One
of the consequences of that would be that AUTO_INCREMENT fields would
stop working correctly.

106c785e

MDEV-26892: JSON histograms become invalid with a specific (corrupt) value .. · ac0194bd
Sergei Petrunia authored Oct 24, 2021
```
Handle the case where the last value in the table cannot be represented
in utf8mb4.
```
ac0194bd
MDEV-26849: JSON Histograms: point selectivity estimates are off · 05877df4
Sergei Petrunia authored Oct 22, 2021
```
.. for non-existent values.

Handle this special case.
```
05877df4
MDEV-26750: Estimation for filtered rows is far off with JSON_HB histogram · f3f78bed
Sergei Petrunia authored Oct 18, 2021
```
Fix a bug in position_in_interval(). Do not overwrite one interval endpoint
with another.
```
f3f78bed

MDEV-26801: Valgrind/MSAN errors in Column_statistics_collected::finish ... · 27539cd2

Sergei Petrunia authored Oct 11, 2021

The problem was introduced in fix for MDEV-26724. That patch has made it
possible for histogram collection to fail. In particular, it fails for
non-assigned characters.

When histogram construction fails, we also abort the computation of
COUNT(DISTINCT). When we try to use the value, we get valgrind failures.

Switched the code to abort the statistics collection in this case.

27539cd2

MDEV-26709: JSON histogram may contain bucketS than histogram_size allows · 93d59804
Sergei Petrunia authored Oct 11, 2021
```
When computing bucket_capacity= records/histogram->get_width(), round
the value UP, not down.
```
93d59804

MDEV-26724 Endless loop in json_escape_to_string upon ... empty string · 3936dc33

Sergei Petrunia authored Oct 10, 2021

Part#3:
- make json_escape() return different errors on conversion error
  and on out-of-space condition.
- Make histogram code handle conversion errors.

3936dc33