-
Kirill Smelkov authored
bstr and ustr currently claim, that: - bstr → ustr → bstr is always identity even if bytes data is not valid UTF-8, and - ustr → bstr → ustr is always identity even if bytes data is not valid UTF-8. this is indeed true for any bytes data. But for some (incorrect) unicode, the conversion from ustr → bstr might currently fail as the following example demonstrates: # py3 In [1]: x = u'\udc00' In [2]: x.encode('utf-8') UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed In [3]: x.encode('utf-8', 'surrogateescape') UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed I know how to fix this by adjusting UTF-8b(*) encoding process a bit, but I currently lack time to do it. -> Let's place corresponding todo entry. Please note, once again, that for arbitrary bytes input the conversion from bstr → ustr → bstr always succeeds and works ok already. And it is this particular conversion that is most relevant in practice. (*) aka surrogateescape in python speak. See http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html for original explanation from 2000.
c0a53847