Commit d7ffb7c3 authored by Alexander Barkov's avatar Alexander Barkov Committed by Oleksandr Byelkin

MDEV-27009 Add UCA-14.0.0 collations - dump logical positions and contractions

- uca-dump can now dump logical positions as a set of "#define" directives.
  Logical positions for 4.0.0 and for 5.2.0 were calculated and put into
  ctype-uca.c manually. That required some efforts by analyzing allkeys.txt
  with help of grep and sort.
  Now when defining a new MY_UCA_INFO it's possible to use the new #define's
  instead of calculating logical positions manually.
  Logical positions also print their weights in DUCET format as a comment
  before the define:

/*
[.0000.0021.0002]
[.0000.0117.0002]
*/

  The comment helps to know weight ranges on various levels,
  which makes it easier to debug the code.

- uca-dump can now dump built-in DUCET contractions

- Adding a new uca-dump command line option --no-contractions, this is useful
  if one needs to re-dump 4.0.0 and 5.2.0 data in ctype-uca.c compatible way.

- Adding a new uca-dump command line options --case-first=upper|level.
  This can be useful if one need to dump with UPPER case first by default.
  It's not yet decided if we'll use --case-first=upper during the dump though.

- Moving parts of the code from the main loop into separate functions
  parse_chars() and parse_weights(). This allows to reuse the code between
  single characters and contractions.

- Adding a new function my_ducet_weight_normalize(), to cut zero weights
  from a weight string, e.g. [AAAA][0000][BBBB] -> [AAAA][BBBB].
  This helps to reuse the code between single characters and contractions.

- Weight normalization is now done before printing, in separate loops inside
  my_ducet_normalize(). Before this change, normalization was done during
  priting, inside the printing loop. This helps to separate steps:
  loading -> normalizing -> printing.
  This makes it easier to follow what's going on, e.g. while debugging.

- Fixing ctype-uca.c to handle built-in contractions of any length.
  Previously we had only built-in contractions in utf8mb4_thai_520_w2,
  which contains only 2-character contractions.
parent 0736c03d
......@@ -33720,16 +33720,11 @@ init_weight_level(MY_CHARSET_LOADER *loader, MY_COLL_RULES *rules,
for (i= 0; i != src->contractions.nitems; i++)
{
MY_CONTRACTION *item= &src->contractions.item[i];
/*
TODO: calculate length from item->ch.
Generally contractions can consist of more than 2 characters.
*/
uint length= 2;
uint length= my_wstrnlen(item->ch, array_elements(item->ch));
uint16 *weights= my_uca_init_one_contraction(&dst->contractions,
item->ch, length,
item->with_context);
memcpy(weights, item->weight, length * sizeof(uint16));
weights[length]= 0;
memcpy(weights, item->weight, sizeof(item->weight));
}
return FALSE;
}
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment