Newer Older's avatar committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

A structure containing data for charset+collation pair implementation. 

Virtual functions which use this data are collected
into separate structures MY_CHARSET_HANDLER and

typedef struct charset_info_st
  uint      number;
  uint      primary_number;
  uint      binary_number;
  uint      state;

  const char *csname;
  const char *name;
  const char *comment;

  uchar    *ctype;
  uchar    *to_lower;
  uchar    *to_upper;
  uchar    *sort_order;

  uint16      *tab_to_uni;
  MY_UNI_IDX  *tab_from_uni;

  uchar state_map[256];
  uchar ident_map[256];

  uint      strxfrm_multiply;
  uint      mbminlen;
  uint      mbmaxlen;
  uint16    max_sort_char; /* For LIKE optimization */'s avatar committed
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60



CHARSET_INFO fields description:

Numbers (identifiers)

number - an ID uniquely identifying this charset+collation pair.

primary_number - ID of a charset+collation pair, which consists
of the same character set and the default collation of this
character set. Not really used now. Intended to optimize some
parts of the code where we need to find the default collation
using its non-default counterpart for the given character set.

binary_numner - ID of a charset+collation pair, which consists
of the same character set and the binary collation of this's avatar committed
character set. Not really used now.'s avatar committed
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76


  csname  - name of the character set for this charset+collation pair.
  name    - name of the collation for this charset+collation pair.
  comment - a text comment, dysplayed in "Description" column of
            SHOW CHARACTER SET output.

Conversion tables
  ctype      - pointer to array[257] of "type of characters"
               bit mask for each chatacter, e.g. if a 
               character is a digit or a letter or a separator, etc.'s avatar committed
77 78 79 80 81 82 83 84 85 86

               Monty 2004-10-21:
                 If you look at the macros, we use ctype[(char)+1].
                 ctype[0] is traditionally in most ctype libraries
                 reserved for EOF (-1). The idea is that you can use
                 the result from fgetc() directly with ctype[]. As
                 we have to be compatible with external ctype[] versions,
                 it's better to do it the same way as they do...

  to_lower   - pointer to array[256] used in LCASE()'s avatar committed
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
  to_upper   - pointer to array[256] used in UCASE()
  sort_order - pointer to array[256] used for strings comparison

Unicode conversion data
For 8bit character sets:

tab_to_uni  : array[256] of charset->Unicode translation
tab_from_uni: a structure for Unicode->charset translation

Non-8 bit charsets have their own structures per charset
hidden in correspondent ctype-xxx.c file and don't use
tab_to_uni and tab_from_uni tables.

Parser maps

 These maps are to quickly identify if a character is
an identificator part, a digit, a special character, 
or a part of other SQL language lexical item.

Probably can be combined with ctype array in the future.
But for some reasons these two arrays are used in the parser,
while a separate ctype[] array is used in the other part of the
code, like fulltext, etc.

Misc fields

  strxfrm_multiply - how many times a sort key (i.e. a string
                     which can be passed into memcmp() for comparison)
                     can be longer than the original string. 
                     Usually it is 1. For some complex
                     collations it can be bigger. For example
                     in latin1_german2_ci, a sort key is up to
                     twice longer than the original string.
                     e.g. Letter 'A' with two dots above is
                     substituted with 'AE'. 
  mbminlen         - mininum multibyte sequence length.'s avatar committed
                     Now always 1 except ucs2. For ucs2's avatar committed
133 134 135 136
                     it is 2.
  mbmaxlen         - maximum multibyte sequence length.
                     1 for 8bit charsets. Can be also 2 or 3.

137 138 139 140 141 142 143 144 145
  max_sort_char    - for LIKE range
                     in case of 8bit character sets - native code
		     of maximum character (max_str pad byte);
                     in case of UTF8 and UCS2 - Unicode code of the maximum
		     possible character (usually U+FFFF). This code is
		     converted to multibyte representation (usually 0xEFBFBF)
		     and then used as a pad sequence for max_str.
		     in case of other multibyte character sets -
		     max_str pad byte (usually 0xFF).'s avatar committed
146 147 148 149 150 151 152 153 154 155 156


MY_CHARSET_HANDLER is a collection of character-set
related routines. Defined in m_ctype.h. Have the 
following set of functions:

Multibyte routines
ismbchar()  - detects if the given string is a multibyte sequence's avatar committed
mbcharlen() - returns length of multibyte sequence starting with's avatar committed
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
              the given character
numchars()  - returns number of characters in the given string, e.g.
              in SQL function CHAR_LENGTH().
charpos()   - calculates the offset of the given position in the string.
              Used in SQL functions LEFT(), RIGHT(), SUBSTRING(), 

            - finds the length of correctly formed multybyte beginning.
              Used in INSERTs to cut a beginning of the given string
              which is
              a) "well formed" according to the given character set.
              b)  can fit into the given data type
              Terminates the string in the good position, taking in account
              multibyte character boundaries.

lengthsp()  - returns the length of the given string without traling spaces.

Unicode conversion routines
mb_wc       - converts the left multibyte sequence into it Unicode code.
mc_mb       - converts the given Unicode code into multibyte sequence.

Case and sort conversion's avatar committed
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
caseup_str  - converts the given 0-terminated string into the upper case
casedn_str  - converts the given 0-terminated string into the lower case
caseup      - converts the given string into the lower case using length
casedn      - converts the given string into the lower case using length

Number-to-string conversion routines

The names are pretty self-descripting.

String padding routines
fill()     - writes the given Unicode value into the given string
             with the given length. Used to pad the string, usually
             with space character, according to the given charset.

String-to-numner conversion routines

These functions are almost for the same thing with their
STDLIB counterparts, but also:
  - accept length instead of 0-terminator
  - and are character set dependant

Simple scanner routines
scan()    - to skip leading spaces in the given string.
            Used when a string value is inserted into a numeric field.

strnncoll()   - compares two strings according to the given collation
strnncollsp() - like the above but ignores trailing spaces
strnxfrm()    - makes a sort key suitable for memcmp() corresponding
                to the given string
like_range()  - creates a LIKE range, for optimizer
wildcmp()     - wildcard comparison, for LIKE
strcasecmp()  - 0-terminated string comparison
instr()       - finds the first substring appearence in the string
hash_sort()   - calculates hash value taking in account
                the collation rules, e.g. case-insensitivity, 
                accent sensitivity, etc.