• Kirill Smelkov's avatar
    strconv: Fix b & friends on macos/windows · 0561926a
    Kirill Smelkov authored
    On macos and windows, Python2 is built with --enable-unicode=ucs2, which
    makes it to use UTF-16 encoding for unicode characters, and so for
    characters higher than U+10000 it uses surrogate encoding with _2_
    unicode points, for example:
    
            >>> import sys
            >>> sys.maxunicode
            65535                       <-- NOTE indicates UCS2 build
            >>> s = u'\U00012345'
            >>> s
            u'\U00012345'
            >>> s.encode('utf-8')
            '\xf0\x92\x8d\x85'
            >>> len(s)
            2                           <-- NOTE _not_ 1
            >>> s[0]
            u'\ud808'
            >>> s[1]
            u'\udf45'
    
    This leads to e.g. b tests failing for
    
        # tbytes                        tunicode
        (b"\xf0\x90\x8c\xbc",           u'\U0001033c'),     # Valid 4 Octet Sequence '𐌼'
    
        >           assert b(tunicode) == tbytes
        E           AssertionError: assert '\xed\xa0\x80\xed\xbc\xbc' == '\xf0\x90\x8c\xbc'
        E             - \xed\xa0\x80\xed\xbc\xbc
        E  ...
    0561926a
strconv.py 10.9 KB