Commit 390fd810 authored by Kirill Smelkov's avatar Kirill Smelkov

golang_str: bstr/ustr %-formatting

Teach bstr/ustr to do % formatting similarly to how unicode does, but
with treating bytes as UTF8-encoded strings - all in line with
general idea for bstr/ustr to treat bytes as strings.

The following approach is used to implement this:

1. both bstr and ustr format via bytes-based _bprintf.
2. we parse the format string and handle every formatting specifier separately:
3. for formats besides %s/%r we use bytes.__mod__ directly.

4. for %s we stringify corresponding argument specially with all, potentially
   internal, bytes instances treated as UTF8-encoded strings:

      '%s' % b'\xce\xb2'      ->  "β"
      '%s' % [b'\xce\xb2']    ->  "['β']"

5. for %r, similarly to %s, we prepare repr of corresponding argument
   specially with all, potentially internal, bytes instances also treated as
   UTF8-encoded strings:

      '%r' % b'\xce\xb2'      ->  "b'β'"
      '%r' % [b'\xce\xb2']    ->  "[b'β']"

For "2" we implement %-format parsing ourselves. test_strings_mod
has good coverage for this phase to make sure we get it right and behaving
exactly the same way as standard Python does.

For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called
from under bstr.__mod__(). See _bstringify for details.

For "5", similarly to "4", we rely on adjustments to bytes.__repr__ .
See _bstringify_repr for details.

I initially tried to avoid parsing format specification myself and
wanted to reuse original bytes.__mod__ and just adjust its behaviour
a bit somehow. This did not worked quite right as the following comment
explains:

    # Rejected alternative: try to format; if we get "TypeError: %b requires a
    # bytes-like object ..." retry with that argument converted to bstr.
    #
    # Rejected because e.g. for  `%(x)s %(x)r` % {'x': obj}`  we need to use
    # access number instead of key 'x' to determine which accesses to
    # bstringify. We could do that, but unfortunately on Python2 the access
    # number is not easily predictable because string could be upgraded to
    # unicode in the midst of being formatted and so some access keys will be
    # accesses not once.
    #
    # Another reason for rejection: b'%r' and u'%r' handle arguments
    # differently - on b %r is aliased to %a.

That's why full %-format parsing and handling is implemented in this
patch. Once again to make sure its behaviour is really the same compared
to Python's builtin %-formatting, we have good test coverage for both
%-format parsing itself, and for actual formatting of many various cases.

See test_strings_mod for details.
parent ddf6958b
......@@ -269,6 +269,11 @@ Usage example::
for c in s: # c will iterate through
... # [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
# the following gives b('привет мир труд май')
b('привет %s %s %s') % (u'мир', # raw unicode
u'труд'.encode('utf-8'), # raw bytes
u('май')) # ustr
def f(s):
s = u(s) # make sure s is ustr, decoding as UTF-8(*) if it was bstr, bytes, bytearray or buffer.
... # (*) the decoding never fails nor looses information.
......
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment