• Kirill Smelkov's avatar
    golang_str: bstr/ustr %-formatting · 390fd810
    Kirill Smelkov authored
    Teach bstr/ustr to do % formatting similarly to how unicode does, but
    with treating bytes as UTF8-encoded strings - all in line with
    general idea for bstr/ustr to treat bytes as strings.
    
    The following approach is used to implement this:
    
    1. both bstr and ustr format via bytes-based _bprintf.
    2. we parse the format string and handle every formatting specifier separately:
    3. for formats besides %s/%r we use bytes.__mod__ directly.
    
    4. for %s we stringify corresponding argument specially with all, potentially
       internal, bytes instances treated as UTF8-encoded strings:
    
          '%s' % b'\xce\xb2'      ->  "β"
          '%s' % [b'\xce\xb2']    ->  "['β']"
    
    5. for %r, similarly to %s, we prepare repr of corresponding argument
       specially with all, potentially internal, bytes instances also treated as
       UTF8-encoded strings:
    
          '%r' % b'\xce\xb2'      ->  "b'β'"
          '%r' % [b'\xce\xb2']    ->  "[b'β']"
    
    For "2" we implement %-format parsing ourselves. test_strings_mod
    has good coverage for this phase to make sure we get it right and behaving
    exactly the same way as standard Python does.
    
    For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called
    from under bstr.__mod__(). See _bstringify for details.
    
    For "5", similarly to "4", we rely on adjustments to bytes.__repr__ .
    See _bstringify_repr for details.
    
    I initially tried to avoid parsing format specification myself and
    wanted to reuse original bytes.__mod__ and just adjust its behaviour
    a bit somehow. This did not worked quite right as the following comment
    explains:
    
        # Rejected alternative: try to format; if we get "TypeError: %b requires a
        # bytes-like object ..." retry with that argument converted to bstr.
        #
        # Rejected because e.g. for  `%(x)s %(x)r` % {'x': obj}`  we need to use
        # access number instead of key 'x' to determine which accesses to
        # bstringify. We could do that, but unfortunately on Python2 the access
        # number is not easily predictable because string could be upgraded to
        # unicode in the midst of being formatted and so some access keys will be
        # accesses not once.
        #
        # Another reason for rejection: b'%r' and u'%r' handle arguments
        # differently - on b %r is aliased to %a.
    
    That's why full %-format parsing and handling is implemented in this
    patch. Once again to make sure its behaviour is really the same compared
    to Python's builtin %-formatting, we have good test coverage for both
    %-format parsing itself, and for actual formatting of many various cases.
    
    See test_strings_mod for details.
    390fd810
README.rst 20.6 KB