README.rst · 390fd81005ee527c087a830af3d32baa1f5d8805 · Kirill Smelkov / pygolang

golang_str: bstr/ustr %-formatting · 390fd810
Kirill Smelkov authored Oct 09, 2022
Teach bstr/ustr to do % formatting similarly to how unicode does, but
with treating bytes as UTF8-encoded strings - all in line with
general idea for bstr/ustr to treat bytes as strings.

The following approach is used to implement this:

1. both bstr and ustr format via bytes-based _bprintf.
2. we parse the format string and handle every formatting specifier separately:
3. for formats besides %s/%r we use bytes.__mod__ directly.

4. for %s we stringify corresponding argument specially with all, potentially
   internal, bytes instances treated as UTF8-encoded strings:

      '%s' % b'\xce\xb2'      ->  "β"
      '%s' % [b'\xce\xb2']    ->  "['β']"

5. for %r, similarly to %s, we prepare repr of corresponding argument
   specially with all, potentially internal, bytes instances also treated as
   UTF8-encoded strings:

      '%r' % b'\xce\xb2'      ->  "b'β'"
      '%r' % [b'\xce\xb2']    ->  "[b'β']"

For "2" we implement %-format parsing ourselves. test_strings_mod
has good coverage for this phase to make sure we get it right and behaving
exactly the same way as standard Python does.

For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called
from under bstr.__mod__(). See _bstringify for details.

For "5", similarly to "4", we rely on adjustments to bytes.__repr__ .
See _bstringify_repr for details.

I initially tried to avoid parsing format specification myself and
wanted to reuse original bytes.__mod__ and just adjust its behaviour
a bit somehow. This did not worked quite right as the following comment
explains:

    # Rejected alternative: try to format; if we get "TypeError: %b requires a
    # bytes-like object ..." retry with that argument converted to bstr.
    #
    # Rejected because e.g. for  `%(x)s %(x)r` % {'x': obj}`  we need to use
    # access number instead of key 'x' to determine which accesses to
    # bstringify. We could do that, but unfortunately on Python2 the access
    # number is not easily predictable because string could be upgraded to
    # unicode in the midst of being formatted and so some access keys will be
    # accesses not once.
    #
    # Another reason for rejection: b'%r' and u'%r' handle arguments
    # differently - on b %r is aliased to %a.

That's why full %-format parsing and handling is implemented in this
patch. Once again to make sure its behaviour is really the same compared
to Python's builtin %-formatting, we have good test coverage for both
%-format parsing itself, and for actual formatting of many various cases.

See test_strings_mod for details.
390fd810
README.rst 20.6 KB
Replace README.rst