• Egmont Koblinger's avatar
    console UTF-8 fixes · 2f1a2ccb
    Egmont Koblinger authored
    The UTF-8 part of the vt driver suffers from the following issues which are
    addressed in my patch:
    
    1) If there's no glyph found for a particular valid UTF-8 character, we try
       to display U+FFFD. However if this one is not found either, here's what
       the current kernel does:
    
       - First, if the Unicode value is less than the number of glyphs, use the
         glyph directly from that position of the glyph table. While it may be a
         good idea in the 8-bit world, it has absolutely no sense with Unicode
         in mind. For example, if a Latin-2 font is loaded and an application
         prints U+00FB ("u with circumflex", not present in Latin-2) then as a
         fallback solution the glyph from the 0xFB position of the Latin-2
         fontset (which is an "u with double accent" - a different character) is
         displayed.
    
       - Second, if this fallback fails too, a simple ASCII question mark is
         printed, which is visually undistinguishable from a real question mark.
    
       I changed the code to skip the first step (except if in non-UTF-8 mode),
       and changed the second step to print the question mark with inverse color
       attributes, so it is visually clear that it's not a real question mark,
       and resembles more to the common glyph of U+FFFD.
    
    2) The UTF-8 decoder is buggy in many ways:
    
       - Lone continuation bytes (section 3.1 of Markus Kuhn's UTF-8 stress
         test) are not caught, they are displayed as some "random" (taken
         directly form the font table, see above) glyphs instead the replacement
         character.
    
       - Incomplete sequences (sections 3.2 and 3.3 of the stress test) emit no
         replacement character, but rather cause the subsequent valid character
         to be displayed more times(!).
    
       - The decoder is not safe: overlong sequences are not caught currently,
         they are displayed as if these were valid representations. This may
         even have security impacts.
    
       - The decoder does not handle D800..DFFF and FFFE..FFFF specially, it
         just emits these code points and lets it be looked up in the glyph
         table. Since these are invalid code points, I replace them by U+FFFD
         and hence give no chance for them to be looked up in the glyph table.
         (Assuming no font ships glyphs for these code points, this change is
         not visible to the users since the glyph shown will be the same.)
    
       With my fixes to the decoder it now behaves exactly as Markus Kuhn's
       stress test recommends.
    
    3) It has no concept of double-width (CJK) characters. It's way beyond the
       scope of my patch to try to display them, but at least I think it's
       important for the cursor to jump two positions when printing such
       characters, since this is what applications (such as text editors)
       expect. Currently the cursor only jumps one position, and hence
       applications suffer from displaying and refreshing problems, and editing
       some English letters that are preceded by some CJK characters in the same
       line is a nightmare. With my patch an additional space is inserted after
       the CJK character has been printed (which usually means a replacement
       symbol of course). (If U+FFFD isn't availble and hence an inverse
       question mark is displayed in the first cell, I keep the inverted state
       for the space in the 2nd column so it's quite easy to see that they are
       tied together.)
    
    4) There is a small built-in table of zero-width spaces that are not to be
       printed but silently skipped. U+200A is included there, but it's not a
       zero-width character, so I remove it from there.
    Signed-off-by: default avatarEgmont Koblinger <egmont@uhulinux.hu>
    Cc: Jan Engelhardt <jengelh@linux01.gwdg.de>
    Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Cc: "Antonino A. Daplas" <adaplas@pol.net>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    2f1a2ccb
consolemap.c 20.7 KB