-
Kirill Smelkov authored
On macos and windows, Python2 is built with --enable-unicode=ucs2, which makes it to use UTF-16 encoding for unicode characters, and so for characters higher than U+10000 it uses surrogate encoding with _2_ unicode points, for example: >>> import sys >>> sys.maxunicode 65535 <-- NOTE indicates UCS2 build >>> s = u'\U00012345' >>> s u'\U00012345' >>> s.encode('utf-8') '\xf0\x92\x8d\x85' >>> len(s) 2 <-- NOTE _not_ 1 >>> s[0] u'\ud808' >>> s[1] u'\udf45' This leads to e.g. b tests failing for # tbytes tunicode (b"\xf0\x90\x8c\xbc", u'\U0001033c'), # Valid 4 Octet Sequence '𐌼' > assert b(tunicode) == tbytes E AssertionError: assert '\xed\xa0\x80\xed\xbc\xbc' == '\xf0\x90\x8c\xbc' E - \xed\xa0\x80\xed\xbc\xbc E ...
0561926a