Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
M
MariaDB
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
nexedi
MariaDB
Commits
68acb479
Commit
68acb479
authored
May 28, 2008
by
Alexander Barkov
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Updating charset doc files.
Thanks to Paul for preparing the up-to-date files reflecting 4.1 changes.
parent
4396f7c2
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
113 additions
and
68 deletions
+113
-68
sql/share/charsets/README
sql/share/charsets/README
+19
-20
strings/CHARSET_INFO.txt
strings/CHARSET_INFO.txt
+94
-48
No files found.
sql/share/charsets/README
View file @
68acb479
This directory holds configuration files
which allow
MySQL to work with
This directory holds configuration files
that enable
MySQL to work with
different character sets. It contains:
*.conf
Each conf file contains four tables which describe character types,
charset_name.xml
Each charset_name.xml file contains information for a simple character
set. The information in the file describes character types,
lower- and upper-case equivalencies and sorting orders for the
character values in the set.
Index
The Index file lists all of the available charset configurations.
Index.xml
The Index.xml file lists all of the available charset configurations,
including collations.
Each charset is paired with a number. The number is stored
IN THE DATABASE TABLE FILES and must not be changed. Always
add new character sets to the end of the list, so that the
numbers of the other character sets will not be changed.
Each collation must have a unique number. The number is stored
IN THE DATABASE TABLE FILES and must not be changed.
The max-id attribute of the <charsets> element must be set to
the largest collation number.
Compiled in or configuration file?
When should a character set be compiled in to MySQL's string library
(libmystrings), and when should it be placed in a c
onfiguration
file?
(libmystrings), and when should it be placed in a c
harset_name.xml
configuration
file?
If the character set requires the strcoll functions or is a
multi-byte character set, it MUST be compiled in to the string
library. If it does not require these functions, it should be
placed in a configuration file.
placed in a c
harset_name.xml c
onfiguration file.
If the character set uses any one of the strcoll functions, it
must define all of them. Likewise, if the set uses one of the
...
...
@@ -30,11 +33,7 @@ Compiled in or configuration file?
more information on how to add a complex character set to MySQL.
Syntax of configuration files
The syntax is very simple. Comments start with a '#' character and
proceed to the end of the line. Words are separated by arbitrary
amounts of whitespace.
For the character set configuration files, every word must be a
number in hexadecimal format. The ctype array takes up the first
257 words; the to_lower, to_upper and sort_order arrays take up 256
words each after that.
The syntax is very simple. Words in <map> array elements are
separated by arbitrary amounts of whitespace. Each word must be a
number in hexadecimal format. The ctype array has 257 words; the
other arrays (lower, upper, etc.) take up 256 words each after that.
strings/CHARSET_INFO.txt
View file @
68acb479
...
...
@@ -3,9 +3,8 @@ CHARSET_INFO
============
A structure containing data for charset+collation pair implementation.
Virtual functions which use this data are collected
into separate structures MY_CHARSET_HANDLER and
MY_COLLATION_HANDLER.
Virtual functions that use this data are collected into separate
structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER.
typedef struct charset_info_st
...
...
@@ -56,7 +55,7 @@ character set. Not really used now. Intended to optimize some
parts of the code where we need to find the default collation
using its non-default counterpart for the given character set.
binary_num
n
er - ID of a charset+collation pair, which consists
binary_num
b
er - ID of a charset+collation pair, which consists
of the same character set and the binary collation of this
character set. Not really used now.
...
...
@@ -65,15 +64,15 @@ Names
csname - name of the character set for this charset+collation pair.
name - name of the collation for this charset+collation pair.
comment - a text comment, d
y
splayed in "Description" column of
comment - a text comment, d
i
splayed in "Description" column of
SHOW CHARACTER SET output.
Conversion tables
-----------------
ctype - pointer to array[257] of "type of characters"
bit mask for each cha
tacter, e.g. if
a
character is a digit
or a letter or a
separator, etc.
bit mask for each cha
racter, e.g., whether
a
character is a digit
, letter,
separator, etc.
Monty 2004-10-21:
If you look at the macros, we use ctype[(char)+1].
...
...
@@ -87,17 +86,64 @@ Conversion tables
to_upper - pointer to array[256] used in UCASE()
sort_order - pointer to array[256] used for strings comparison
In all Asian charsets these arrays are set up as follows:
- All bytes in the range 0x80..0xFF were marked as letters in the
ctype array.
- The to_lower and to_upper arrays map only ASCII letters.
UPPER() and LOWER() doesn't really work for multi-byte characters.
Most of the characters in Asian character sets are ideograms
anyway and they don't have case mapping. However, there are
still some characters from European alphabets.
For example:
_ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE
_ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE
But they don't map to each other with UPPER and LOWER operations.
- The sort_order array is filled case insensitively for the
ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte
range 0x80..0xFF for these collations:
cp932_japanese_ci,
euckr_korean_ci,
eucjpms_japanese_ci,
gb2312_chinese_ci,
sjis_japanese_ci,
ujis_japanese_ci.
So multi-byte characters are sorted just according to their codes.
- Two collations are still case insensitive for the ASCII characters,
but have special sorting order for multi-byte characters
(something more complex than just according to codes):
big5_chinese_ci
gbk_chinese_ci
So handlers for these collations use only the 0x00..0x7F part
of their sort_order arrays, and apply the special functions
for multi-byte characters
In Unicode character sets we have full support of UPPER/LOWER mapping,
for sorting order, and for character type detection.
"utf8_general_ci" still has the "old-fashioned" arrays
like to_upper, to_lower, sort_order and ctype, but they are
not really used (maybe only in some rare legacy functions).
Unicode conversion data
-----------------------
For 8bit character sets:
For 8
-
bit character sets:
tab_to_uni : array[256] of charset->Unicode translation
tab_from_uni: a structure for Unicode->charset translation
Non-8
bit charsets have their own structures per charset
hidden in correspond
ent
ctype-xxx.c file and don't use
Non-8
-
bit charsets have their own structures per charset
hidden in correspond
ing
ctype-xxx.c file and don't use
tab_to_uni and tab_from_uni tables.
...
...
@@ -106,9 +152,9 @@ Parser maps
state_map[]
ident_map[]
These maps are to quickly identify if a character is
an identificator part, a digit, a special character,
or a part of other
SQL language lexical item.
These maps are used to quickly identify whether a character is an
identifier part, a digit, a special character, or a part of another
SQL language lexical item.
Probably can be combined with ctype array in the future.
But for some reasons these two arrays are used in the parser,
...
...
@@ -116,32 +162,32 @@ while a separate ctype[] array is used in the other part of the
code, like fulltext, etc.
Misc fields
-----------
Misc
ellaneous
fields
-----------
---------
strxfrm_multiply - how many times a sort key (
i.e.
a string
which
can be passed into memcmp() for comparison)
strxfrm_multiply - how many times a sort key (
that is,
a string
that
can be passed into memcmp() for comparison)
can be longer than the original string.
Usually it is 1. For some complex
collations it can be bigger. For example
collations it can be bigger. For example
,
in latin1_german2_ci, a sort key is up to
tw
ice
longer than the original string.
tw
o times
longer than the original string.
e.g. Letter 'A' with two dots above is
substituted with 'AE'.
mbminlen - mini
num multi
byte sequence length.
Now always 1 except
ucs2. For ucs2
mbminlen - mini
mum multi-
byte sequence length.
Now always 1 except
for ucs2. For ucs2,
it is 2.
mbmaxlen - maximum multibyte sequence length.
1 for 8bit charsets. Can be also 2 or 3.
mbmaxlen - maximum multi
-
byte sequence length.
1 for 8
-
bit charsets. Can be also 2 or 3.
max_sort_char - for LIKE range
in case of 8bit character sets - native code
in case of 8
-
bit character sets - native code
of maximum character (max_str pad byte);
in case of UTF8 and UCS2 - Unicode code of the maximum
possible character (usually U+FFFF). This code is
converted to multibyte representation (usually 0xEFBFBF)
converted to multi
-
byte representation (usually 0xEFBFBF)
and then used as a pad sequence for max_str.
in case of other multibyte character sets -
in case of other multi
-
byte character sets -
max_str pad byte (usually 0xFF).
MY_CHARSET_HANDLER
...
...
@@ -151,10 +197,10 @@ MY_CHARSET_HANDLER is a collection of character-set
related routines. Defined in m_ctype.h. Have the
following set of functions:
Multibyte routines
Multi
-
byte routines
------------------
ismbchar() - detects
if the given string is a multi
byte sequence
mbcharlen() - returns length of multibyte sequence starting with
ismbchar() - detects
whether the given string is a multi-
byte sequence
mbcharlen() - returns length of multi
-
byte sequence starting with
the given character
numchars() - returns number of characters in the given string, e.g.
in SQL function CHAR_LENGTH().
...
...
@@ -163,29 +209,29 @@ charpos() - calculates the offset of the given position in the string.
INSERT()
well_formed_length()
- finds the length of correctly formed mult
y
byte beginning.
- finds the length of correctly formed mult
i-
byte beginning.
Used in INSERTs to cut a beginning of the given string
which is
a) "well formed" according to the given character set.
b)
can fit into the given data type
b) can fit into the given data type
Terminates the string in the good position, taking in account
multibyte character boundaries.
multi
-
byte character boundaries.
lengthsp() - returns the length of the given string without traling spaces.
lengthsp() - returns the length of the given string without tra
i
ling spaces.
Unicode conversion routines
---------------------------
mb_wc - converts the left multi
byte sequence into it
Unicode code.
mc_mb - converts the given Unicode code into multibyte sequence.
mb_wc - converts the left multi
-byte sequence into its
Unicode code.
mc_mb - converts the given Unicode code into multi
-
byte sequence.
Case and sort conver
t
ion
Case and sort conver
s
ion
------------------------
caseup_str - converts the given 0-terminated string
into the upper
case
casedn_str - converts the given 0-terminated string
into the lower
case
caseup - converts the given string
into the lower
case using length
casedn - converts the given string
into the lower
case using length
caseup_str - converts the given 0-terminated string
to upper
case
casedn_str - converts the given 0-terminated string
to lower
case
caseup - converts the given string
to lower
case using length
casedn - converts the given string
to lower
case using length
Number-to-string conversion routines
------------------------------------
...
...
@@ -193,7 +239,7 @@ snprintf()
long10_to_str()
longlong10_to_str()
The names are pretty self-descri
pt
ing.
The names are pretty self-descri
b
ing.
String padding routines
-----------------------
...
...
@@ -201,7 +247,7 @@ fill() - writes the given Unicode value into the given string
with the given length. Used to pad the string, usually
with space character, according to the given charset.
String-to-num
n
er conversion routines
String-to-num
b
er conversion routines
------------------------------------
strntol()
strntoul()
...
...
@@ -209,10 +255,10 @@ strntoll()
strntoull()
strntod()
These functions are almost
for the same thing with their
STDLIB counterparts,
but also:
These functions are almost
the same as their STDLIB counterparts,
but also:
- accept length instead of 0-terminator
- a
nd are character set dependa
nt
- a
re character set depende
nt
Simple scanner routines
-----------------------
...
...
@@ -230,9 +276,9 @@ strnxfrm() - makes a sort key suitable for memcmp() corresponding
like_range() - creates a LIKE range, for optimizer
wildcmp() - wildcard comparison, for LIKE
strcasecmp() - 0-terminated string comparison
instr() - finds the first substring appear
e
nce in the string
hash_sort() - calculates hash value taking in account
instr() - finds the first substring appear
a
nce in the string
hash_sort() - calculates hash value taking in
to
account
the collation rules, e.g. case-insensitivity,
accent sensitivity, etc.
\ No newline at end of file
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment