• Kirill Smelkov's avatar
    golang_str: Teach bstr/ustr to stringify bytes as UTF-8 bytestrings even inside containers · ddf6958b
    Kirill Smelkov authored
    bstr/ustr constructors either convert or stringify its argument. For
    example bstr(u'α') gives b('α') while bstr(1) gives b('1'). And if the
    argument is bytes, bstr treats it as UTF-8 encoded bytestring:
    
        >>> x = u'β'.encode()
        >>> x
        b'\xce\xb2'
        >>> bstr(x)
        b('β')
    
    however if that same bytes argument is placed inside container - e.g. inside
    list - currently it is not stringified as bytestring:
    
        >>> bstr([x])
        b("[b'\\xce\\xb2']")	<-- NOTE not b("['β']")
    
    which is not consistent with our intended approach that bstr/ustr treat
    bytes in their arguments as UTF-8 encoded strings.
    
    This happens because when a list is stringified, list.__str__
    implementation goes through its arguments and invokes __repr__ of the
    arguments. And in general a container might be arbitrary deep, e.g. dict
    -> list -> list -> bytes, and even when stringifying that deep dict, we
    want to handle that leaf bytes as UTF-8 encoded string.
    
    There are many containers in Python - lists, tuples, dicts,
    collections.OrderedDict, collections.UserDict, collections.namedtuple,
    collections.defaultdict, etc, and also there are many user-defined
    containers - including implemented at C level - which we can not even
    know all in advance.
    
    It means that we cannot do some, probably deep/recursive typechecking,
    inside bstringify and implement kind of parallel stringification of
    arbitrary complex structure with adjustment to stringification of bytes.
    We cannot also create object clone - for stringification - with bytes
    instances replaced with str (e.g. via DeepReplacer - see recent previous
    patch), and then stringify the clone. That would generally be incorrect,
    because in this approach we cannot know whether an object is being
    stringified as it is, or whether it is being used internally for data
    storage and is not stringified directly. In the latter case if we
    replace bytes with unicode, it might break internal invariant of custom
    container class and break its logic.
    
    What we can do however, is to hook into bytes.__repr__ implementations,
    and to detect - if this implementation is called from under bstringify -
    then we know we should adjust it and treat this bytes as bytestring.
    Else - use original bytes.__repr__ implementation. This way we can handle
    arbitrary complex data structures.
    
    Hereby patch implements that approach for bytes, unicode on py2, and for
    bytearray. See added comments that start with
    
        # patch bytes.{__repr__,__str__} and ...
    
    for details.
    
    After this patch stringification of bytes inside containers treat them
    as UTF-8 bytestrings:
    
        >>> bstr([x])
        b("['β']")
    ddf6958b
_golang_str.pyx 53.8 KB