golang/strconv.py · 0561926a15ee63c31b9fe926613b15ab3e34e9ab · Boxiang Sun / pygolang

strconv: Fix b & friends on macos/windows · 0561926a

Kirill Smelkov authored Feb 28, 2020

On macos and windows, Python2 is built with --enable-unicode=ucs2, which
makes it to use UTF-16 encoding for unicode characters, and so for
characters higher than U+10000 it uses surrogate encoding with _2_
unicode points, for example:

        >>> import sys
        >>> sys.maxunicode
        65535                       <-- NOTE indicates UCS2 build
        >>> s = u'\U00012345'
        >>> s
        u'\U00012345'
        >>> s.encode('utf-8')
        '\xf0\x92\x8d\x85'
        >>> len(s)
        2                           <-- NOTE _not_ 1
        >>> s[0]
        u'\ud808'
        >>> s[1]
        u'\udf45'

This leads to e.g. b tests failing for

    # tbytes                        tunicode
    (b"\xf0\x90\x8c\xbc",           u'\U0001033c'),     # Valid 4 Octet Sequence '𐌼'

    >           assert b(tunicode) == tbytes
    E           AssertionError: assert '\xed\xa0\x80\xed\xbc\xbc' == '\xf0\x90\x8c\xbc'
    E             - \xed\xa0\x80\xed\xbc\xbc
    E  ...

0561926a

strconv.py 10.9 KB

Replace strconv.py