golang_str: Teach bstr/ustr to stringify bytes as UTF-8 bytestrings even inside containers

bstr/ustr constructors either convert or stringify its argument. For example bstr(u'α') gives b('α') while bstr(1) gives b('1'). And if the argument is bytes, bstr treats it as UTF-8 encoded bytestring: >>> x = u'β'.encode() >>> x b'\xce\xb2' >>> bstr(x) b('β') however if that same bytes argument is placed inside container - e.g. inside list - currently it is not stringified as bytestring: >>> bstr([x]) b("[b'\\xce\\xb2']") <-- NOTE not b("['β']") which is not consistent with our intended approach that bstr/ustr treat bytes in their arguments as UTF-8 encoded strings. This happens because when a list is stringified, list.__str__ implementation goes through its arguments and invokes __repr__ of the arguments. And in general a container might be arbitrary deep, e.g. dict -> list -> list -> bytes, and even when stringifying that deep dict, we want to handle that leaf bytes as UTF-8 encoded string. There are many containers in Python - lists, tuples, dicts, collections.OrderedDict, collections.UserDict, collections.namedtuple, collections.defaultdict, etc, and also there are many user-defined containers - including implemented at C level - which we can not even know all in advance. It means that we cannot do some, probably deep/recursive typechecking, inside bstringify and implement kind of parallel stringification of arbitrary complex structure with adjustment to stringification of bytes. We cannot also create object clone - for stringification - with bytes instances replaced with str (e.g. via DeepReplacer - see recent previous patch), and then stringify the clone. That would generally be incorrect, because in this approach we cannot know whether an object is being stringified as it is, or whether it is being used internally for data storage and is not stringified directly. In the latter case if we replace bytes with unicode, it might break internal invariant of custom container class and break its logic. What we can do however, is to hook into bytes.__repr__ implementations, and to detect - if this implementation is called from under bstringify - then we know we should adjust it and treat this bytes as bytestring. Else - use original bytes.__repr__ implementation. This way we can handle arbitrary complex data structures. Hereby patch implements that approach for bytes, unicode on py2, and for bytearray. See added comments that start with # patch bytes.{__repr__,__str__} and ... for details. After this patch stringification of bytes inside containers treat them as UTF-8 bytestrings: >>> bstr([x]) b("['β']")

golang_str: Teach bstr/ustr to stringify bytes as UTF-8 bytestrings even inside containers
bstr/ustr constructors either convert or stringify its argument. For example bstr(u'α') gives b('α') while bstr(1) gives b('1'). And if the argument is bytes, bstr treats it as UTF-8 encoded bytestring: >>> x = u'β'.encode() >>> x b'\xce\xb2' >>> bstr(x) b('β') however if that same bytes argument is placed inside container - e.g. inside list - currently it is not stringified as bytestring: >>> bstr([x]) b("[b'\\xce\\xb2']") <-- NOTE not b("['β']") which is not consistent with our intended approach that bstr/ustr treat bytes in their arguments as UTF-8 encoded strings. This happens because when a list is stringified, list.__str__ implementation goes through its arguments and invokes __repr__ of the arguments. And in general a container might be arbitrary deep, e.g. dict -> list -> list -> bytes, and even when stringifying that deep dict, we want to handle that leaf bytes as UTF-8 encoded string. There are many containers in Python - lists, tuples, dicts, collections.OrderedDict, collections.UserDict, collections.namedtuple, collections.defaultdict, etc, and also there are many user-defined containers - including implemented at C level - which we can not even know all in advance. It means that we cannot do some, probably deep/recursive typechecking, inside bstringify and implement kind of parallel stringification of arbitrary complex structure with adjustment to stringification of bytes. We cannot also create object clone - for stringification - with bytes instances replaced with str (e.g. via DeepReplacer - see recent previous patch), and then stringify the clone. That would generally be incorrect, because in this approach we cannot know whether an object is being stringified as it is, or whether it is being used internally for data storage and is not stringified directly. In the latter case if we replace bytes with unicode, it might break internal invariant of custom container class and break its logic. What we can do however, is to hook into bytes.__repr__ implementations, and to detect - if this implementation is called from under bstringify - then we know we should adjust it and treat this bytes as bytestring. Else - use original bytes.__repr__ implementation. This way we can handle arbitrary complex data structures. Hereby patch implements that approach for bytes, unicode on py2, and for bytearray. See added comments that start with # patch bytes.{__repr__,__str__} and ... for details. After this patch stringification of bytes inside containers treat them as UTF-8 bytestrings: >>> bstr([x]) b("['β']")
ddf6958b · Kirill Smelkov · ff24be3d · ddf6958b · ddf6958b
Commit ddf6958b authored Oct 08, 2022 by Kirill Smelkov
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 262 additions and 32 deletions

golang/_golang_str.pyx golang/_golang_str.pyx +222 -29

golang/golang_str_test.py golang/golang_str_test.py +40 -3

No files found.
--- a/golang/_golang_str.pyx
+++ b/golang/_golang_str.pyx
--- a/golang/golang_str_test.py
+++ b/golang/golang_str_test.py
@@ -144,12 +144,20 @@ def test_strings_basic():
    _ = ustr();         assert type(_) is ustr;  assert _ == ''
    _ = bstr(123);      assert type(_) is bstr;  assert _ == '123'
    _ = ustr(123);      assert type(_) is ustr;  assert _ == '123'
-    _ = bstr([1,'b']);  assert type(_) is bstr;  assert _ == "[1, 'b']"
-    _ = ustr([1,'b']);  assert type(_) is ustr;  assert _ == "[1, 'b']"
+    _ = bstr([1,'β']);  assert type(_) is bstr;  assert _ == "[1, 'β']"
+    _ = ustr([1,'β']);  assert type(_) is ustr;  assert _ == "[1, 'β']"
    obj = object()
    _ = bstr(obj);      assert type(_) is bstr;  assert _ == str(obj)  # <object ...>
    _ = ustr(obj);      assert type(_) is ustr;  assert _ == str(obj)  # <object ...>

+    # when stringifying they also handle bytes/bytearray inside containers as UTF-8 strings
+    _ = bstr([xunicode(  'β')]);   assert type(_) is bstr;  assert _ == "['β']"
+    _ = ustr([xunicode(  'β')]);   assert type(_) is ustr;  assert _ == "['β']"
+    _ = bstr([xbytes(    'β')]);   assert type(_) is bstr;  assert _ == "['β']"
+    _ = ustr([xbytes(    'β')]);   assert type(_) is ustr;  assert _ == "['β']"
+    _ = bstr([xbytearray('β')]);   assert type(_) is bstr;  assert _ == "['β']"
+    _ = ustr([xbytearray('β')]);   assert type(_) is ustr;  assert _ == "['β']"
+

    b_  = xbytes    ("мир");  assert type(b_) is bytes
    u_  = xunicode  ("мир");  assert type(u_) is unicode
@@ -1138,6 +1146,11 @@ def test_qq():
    _(           b('мир'),  '"мир"')            # b()
    _(           u('мир'),  '"мир"')            # u()
    _(                  1,  '"1"')              # int
+    _(    [xbytes('мир')],  '"[\'мир\']"')      # [b'']
+    _(           [u'мир'],  '"[\'мир\']"')      # [u'']
+    _([xbytearray('мир')],  '"[\'мир\']"')      # [b'']
+    _(         [b('мир')],  '"[\'мир\']"')      # [b()]
+    _(         [u('мир')],  '"[\'мир\']"')      # [u()]


    # what qq returns - bstr - can be mixed with both unicode, bytes and bytearray
@@ -1669,11 +1682,31 @@ def test_deepreplace_str():

 # ----------------------------------------

-# verify that what we patched stay unaffected when
+# verify that what we patched - e.g. bytes.__repr__ - stay unaffected when
 # called outside of bstr/ustr context.
 def test_strings_patched_transparently():
    b_  = xbytes    ("мир");  assert type(b_)  is bytes
    u_  = xunicode  ("мир");  assert type(u_)  is unicode
+    ba_ = xbytearray("мир");  assert type(ba_) is bytearray
+
+    # standard {repr,str}(bytes|unicode|bytearray) stay unaffected
+    assert repr(b_)  == x32(r"b'\xd0\xbc\xd0\xb8\xd1\x80'",
+                             r"'\xd0\xbc\xd0\xb8\xd1\x80'")
+    assert repr(u_)  == x32(r"'мир'",
+                            r"u'\u043c\u0438\u0440'")
+    assert repr(ba_) == r"bytearray(b'\xd0\xbc\xd0\xb8\xd1\x80')"
+
+    assert str(b_)   == x32(r"b'\xd0\xbc\xd0\xb8\xd1\x80'",
+                               "\xd0\xbc\xd0\xb8\xd1\x80")
+    if six.PY3  or  sys.getdefaultencoding() == 'utf-8': # py3 or gpython/py2
+        assert str(u_) == "мир"
+    else:
+        # python/py2
+        with raises(UnicodeEncodeError): str(u_)  # 'ascii' codec can't encode ...
+        assert str(u'abc') == "abc"
+
+    assert str(ba_)  == x32(r"bytearray(b'\xd0\xbc\xd0\xb8\xd1\x80')",
+                                        b'\xd0\xbc\xd0\xb8\xd1\x80')

    # unicode comparison stay unaffected
    assert (u_ == u_)  is True
@@ -1855,3 +1888,7 @@ def isascii(x):
 class hlist(list):
    def __hash__(self):
        return 0    # always hashable
+
+# x32(a,b) returns a on py3, or b on py2
+def x32(a, b):
+    return a if six.PY3 else b