• Gabriel Krisman Bertazi's avatar
    unicode: introduce UTF-8 character database · 955405d1
    Gabriel Krisman Bertazi authored
    The decomposition and casefolding of UTF-8 characters are described in a
    prefix tree in utf8data.h, which is a generate from the Unicode
    Character Database (UCD), published by the Unicode Consortium, and
    should not be edited by hand.  The structures in utf8data.h are meant to
    be used for lookup operations by the unicode subsystem, when decoding a
    utf-8 string.
    
    mkutf8data.c is the source for a program that generates utf8data.h. It
    was written by Olaf Weber from SGI and originally proposed to be merged
    into Linux in 2014.  The original proposal performed the compatibility
    decomposition, NFKD, but the current version was modified by me to do
    canonical decomposition, NFD, as suggested by the community.  The
    changes from the original submission are:
    
      * Rebase to mainline.
      * Fix out-of-tree-build.
      * Update makefile to build 11.0.0 ucd files.
      * drop references to xfs.
      * Convert NFKD to NFD.
      * Merge back robustness fixes from original patch. Requested by
        Dave Chinner.
    
    The original submission is archived at:
    
    <https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs>
    
    The utf8data.h file can be regenerated using the instructions in
    fs/unicode/README.utf8data.
    
    - Notes on the update from 8.0.0 to 11.0:
    
    The structure of the ucd files and special cases have not experienced
    any changes between versions 8.0.0 and 11.0.0.  8.0.0 saw the addition
    of Cherokee LC characters, which is an interesting case for
    case-folding.  The update is accompanied by new tests on the test_ucd
    module to catch specific cases.  No changes to mkutf8data script were
    required for the updates.
    Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@collabora.co.uk>
    Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
    955405d1
utf8data.h 1.08 MB
The source could not be displayed because it is larger than 1 MB. You can load it anyway or download it instead.