Commit 1f99393d authored by Kirill Smelkov's avatar Kirill Smelkov

golang_str: Start exposing Pygolang string types publicly

In 2020 in edc7aaab (golang: Teach qq to be usable with both bytes and
str format whatever type qq argument is) I added custom bytes- and
unicode- like types for qq to return instead of str with the idea for
qq's result to be interoperable with both bytes and unicode. Citing that patch:

    qq is used to quote strings or byte-strings. The following example
    illustrates the problem we are currently hitting in zodbtools with
    Python3:

        >>> "hello %s" % qq("мир")
        'hello "мир"'

        >>> b"hello %s" % qq("мир")
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

        >>> "hello %s" % qq(b("мир"))
        'hello "мир"'

        >>> b"hello %s" % qq(b("мир"))
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

    i.e. one way or another if type of format string and what qq returns do not
    match it creates a TypeError.

    We want qq(obj) to be useable with both string and bytestring format.

    For that let's teach qq to return special str- and bytes- derived types that
    know how to automatically convert to str->bytes and bytes->str via b/u
    correspondingly. This way formatting works whatever types combination it was
    for format and for qq, and the whole result has the same type as format.

    For now we teach only qq to use new types and don't generally expose
    _str and _unicode to be returned by b and u yet. However we might do so
    in the future after incrementally gaining a bit more experience.

So two years later I gained that experience and found that having string
type, that can interoperate with both bytes and unicode, is generally
useful. It is useful for practical backward compatibility with Python2
and for simplicity of programming avoiding constant stream of
encode/decode noise. Thus the day to expose Pygolang string types for
general use has come.

This patch does the first small step: it exposes bytes- and unicode-
like types (now named as bstr and ustr) publicly. It switches b and u to
return bstr and ustr correspondingly instead of bytes and unicode. This
is change in behaviour, but hopefully it should not break anything as
there are not many b/u users currently and bstr and ustr are intended to
be drop-in replacements for standard string types.

Next patches will enhance bstr/ustr step by step to be actually drop-in
replacements for standard string types for real.

See nexedi/zodbtools!13 (comment 81646)
for preliminary discussion from 2019.

See also "Python 3 Losses: Nexedi Perspective"[1] and associated "cost
overview"[2] for related presentation by Jean-Paul from 2018.

[1] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
[2] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
parent ffb40903
...@@ -10,7 +10,7 @@ Package `golang` provides Go-like features for Python: ...@@ -10,7 +10,7 @@ Package `golang` provides Go-like features for Python:
- `func` allows to define methods separate from class. - `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow. - `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining. - `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode. - `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace. - `gimport` allows to import python modules by full path in a Go workspace.
Package `golang.pyx` provides__ similar features for Cython/nogil. Package `golang.pyx` provides__ similar features for Cython/nogil.
...@@ -229,19 +229,32 @@ __ https://www.python.org/dev/peps/pep-3134/ ...@@ -229,19 +229,32 @@ __ https://www.python.org/dev/peps/pep-3134/
Strings Strings
------- -------
`b` and `u` provide way to make sure an object is either bytes or unicode. Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
`b(obj)` converts str/unicode/bytes obj to UTF-8 encoded bytestring, while the idea to make working with byte- and unicode- strings easy and transparently
`u(obj)` converts str/unicode/bytes obj to unicode string. For example:: interoperable:
b("привет мир") # -> gives bytes corresponding to UTF-8 encoding of "привет мир". - `bstr` is byte-string: it is based on `bytes` and can automatically convert to `unicode` [*]_.
- `ustr` is unicode-string: it is based on `unicode` and can automatically convert to `bytes`.
The conversion, in both encoding and decoding, never fails and never looses
information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
even if bytes data is not valid UTF-8.
`bstr`/`ustr` constructors will accept arbitrary objects and either convert or stringify them. For
cases when no stringification is desired, and one only wants to convert
`bstr`/`ustr` / `unicode`/`bytes`
to Pygolang string, `b` and `u` provide way to make sure an
object is either `bstr` or `ustr` correspondingly.
Usage example::
s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'.
def f(s): def f(s):
s = u(s) # make sure s is unicode, decoding as UTF-8(*) if it was bytes. s = u(s) # make sure s is ustr, decoding as UTF-8(*) if it was bstr or bytes.
... # (*) but see below about lack of decode errors. ... # (*) the decoding never fails nor looses information.
The conversion in both encoding and decoding never fails and never looses .. [*] `unicode` on Python2, `str` on Python3.
information: `b(u(·))` and `u(b(·))` are always identity for bytes and unicode
correspondingly, even if bytes input is not valid UTF-8.
Import Import
......
...@@ -24,7 +24,7 @@ ...@@ -24,7 +24,7 @@
- `func` allows to define methods separate from class. - `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow. - `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining. - `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode. - `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace. - `gimport` allows to import python modules by full path in a Go workspace.
See README for thorough overview. See README for thorough overview.
...@@ -36,7 +36,7 @@ from __future__ import print_function, absolute_import ...@@ -36,7 +36,7 @@ from __future__ import print_function, absolute_import
__version__ = "0.1" __version__ = "0.1"
__all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic', __all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic',
'recover', 'func', 'error', 'b', 'u', 'gimport'] 'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'gimport']
from golang._gopath import gimport # make gimport available from golang from golang._gopath import gimport # make gimport available from golang
import inspect, sys import inspect, sys
...@@ -316,7 +316,9 @@ from ._golang import \ ...@@ -316,7 +316,9 @@ from ._golang import \
pypanic as panic, \ pypanic as panic, \
pyerror as error, \ pyerror as error, \
pyb as b, \ pyb as b, \
pyu as u pybstr as bstr, \
pyu as u, \
pyustr as ustr
# import golang.strconv into _golang from here to workaround cyclic golang ↔ strconv dependency # import golang.strconv into _golang from here to workaround cyclic golang ↔ strconv dependency
def _(): def _():
......
...@@ -43,6 +43,7 @@ In addition to Cython/nogil API, golang.pyx provides runtime for golang.py: ...@@ -43,6 +43,7 @@ In addition to Cython/nogil API, golang.pyx provides runtime for golang.py:
- Python-level channels are represented by pychan + pyselect. - Python-level channels are represented by pychan + pyselect.
- Python-level error is represented by pyerror. - Python-level error is represented by pyerror.
- Python-level panic is represented by pypanic. - Python-level panic is represented by pypanic.
- Python-level strings are represented by pybstr and pyustr.
""" """
......
...@@ -28,7 +28,7 @@ from libc.stdint cimport uint8_t ...@@ -28,7 +28,7 @@ from libc.stdint cimport uint8_t
pystrconv = None # = golang.strconv imported at runtime (see __init__.py) pystrconv = None # = golang.strconv imported at runtime (see __init__.py)
def pyb(s): # -> bytes def pyb(s): # -> bstr
"""b converts str/unicode/bytes s to UTF-8 encoded bytestring. """b converts str/unicode/bytes s to UTF-8 encoded bytestring.
Bytes input is preserved as-is: Bytes input is preserved as-is:
...@@ -42,8 +42,11 @@ def pyb(s): # -> bytes ...@@ -42,8 +42,11 @@ def pyb(s): # -> bytes
TypeError is raised if type(s) is not one of the above. TypeError is raised if type(s) is not one of the above.
See also: u. See also: u, bstr/ustr.
""" """
if type(s) is pybstr:
return s
if isinstance(s, bytes): # py2: str py3: bytes if isinstance(s, bytes): # py2: str py3: bytes
pass pass
elif isinstance(s, unicode): # py2: unicode py3: str elif isinstance(s, unicode): # py2: unicode py3: str
...@@ -51,9 +54,9 @@ def pyb(s): # -> bytes ...@@ -51,9 +54,9 @@ def pyb(s): # -> bytes
else: else:
raise TypeError("b: invalid type %s" % type(s)) raise TypeError("b: invalid type %s" % type(s))
return s return pybstr(s)
def pyu(s): # -> unicode def pyu(s): # -> ustr
"""u converts str/unicode/bytes s to unicode string. """u converts str/unicode/bytes s to unicode string.
Unicode input is preserved as-is: Unicode input is preserved as-is:
...@@ -69,8 +72,11 @@ def pyu(s): # -> unicode ...@@ -69,8 +72,11 @@ def pyu(s): # -> unicode
TypeError is raised if type(s) is not one of the above. TypeError is raised if type(s) is not one of the above.
See also: b. See also: b, bstr/ustr.
""" """
if type(s) is pyustr:
return s
if isinstance(s, unicode): # py2: unicode py3: str if isinstance(s, unicode): # py2: unicode py3: str
pass pass
elif isinstance(s, bytes): # py2: str py3: bytes elif isinstance(s, bytes): # py2: str py3: bytes
...@@ -78,22 +84,22 @@ def pyu(s): # -> unicode ...@@ -78,22 +84,22 @@ def pyu(s): # -> unicode
else: else:
raise TypeError("u: invalid type %s" % type(s)) raise TypeError("u: invalid type %s" % type(s))
return s return pyustr(s)
# __pystr converts obj to str of current python: # __pystr converts obj to ~str of current python:
# #
# - to bytes, via b, if running on py2, or # - to ~bytes, via b, if running on py2, or
# - to unicode, via u, if running on py3. # - to ~unicode, via u, if running on py3.
# #
# It is handy to use __pystr when implementing __str__ methods. # It is handy to use __pystr when implementing __str__ methods.
# #
# NOTE __pystr is currently considered to be internal function and should not # NOTE __pystr is currently considered to be internal function and should not
# be used by code outside of pygolang. # be used by code outside of pygolang.
# #
# XXX we should be able to use _pystr, but py3's str verify that it must have # XXX we should be able to use pybstr, but py3's str verify that it must have
# Py_TPFLAGS_UNICODE_SUBCLASS in its type flags. # Py_TPFLAGS_UNICODE_SUBCLASS in its type flags.
cdef __pystr(object obj): cdef __pystr(object obj): # -> ~str
if PY_MAJOR_VERSION >= 3: if PY_MAJOR_VERSION >= 3:
return pyu(obj) return pyu(obj)
else: else:
...@@ -101,8 +107,8 @@ cdef __pystr(object obj): ...@@ -101,8 +107,8 @@ cdef __pystr(object obj):
# XXX cannot `cdef class`: github.com/cython/cython/issues/711 # XXX cannot `cdef class`: github.com/cython/cython/issues/711
class _pystr(bytes): class pybstr(bytes):
"""_str is like bytes but can be automatically converted to Python unicode """bstr is like bytes but can be automatically converted to Python unicode
string via UTF-8 decoding. string via UTF-8 decoding.
The decoding never fails nor looses information - see u for details. The decoding never fails nor looses information - see u for details.
...@@ -123,8 +129,8 @@ class _pystr(bytes): ...@@ -123,8 +129,8 @@ class _pystr(bytes):
return self return self
cdef class _pyunicode(unicode): cdef class pyustr(unicode):
"""_unicode is like unicode(py2)|str(py3) but can be automatically converted """ustr is like unicode(py2)|str(py3) but can be automatically converted
to bytes via UTF-8 encoding. to bytes via UTF-8 encoding.
The encoding always succeeds - see b for details. The encoding always succeeds - see b for details.
...@@ -139,11 +145,11 @@ cdef class _pyunicode(unicode): ...@@ -139,11 +145,11 @@ cdef class _pyunicode(unicode):
else: else:
return pyb(self) return pyb(self)
# initialize .tp_print for _pystr so that this type could be printed. # initialize .tp_print for pybstr so that this type could be printed.
# If we don't - printing it will result in `RuntimeError: print recursion` # If we don't - printing it will result in `RuntimeError: print recursion`
# because str of this type never reaches real bytes or unicode. # because str of this type never reaches real bytes or unicode.
# Do it only on python2, because python3 does not use tp_print at all. # Do it only on python2, because python3 does not use tp_print at all.
# NOTE _pyunicode does not need this because on py2 str(_pyunicode) returns _pystr. # NOTE pyustr does not need this because on py2 str(pyustr) returns pybstr.
IF PY2: IF PY2:
# NOTE Cython does not define tp_print for PyTypeObject - do it ourselves # NOTE Cython does not define tp_print for PyTypeObject - do it ourselves
from libc.stdio cimport FILE from libc.stdio cimport FILE
...@@ -153,12 +159,12 @@ IF PY2: ...@@ -153,12 +159,12 @@ IF PY2:
printfunc tp_print printfunc tp_print
cdef PyTypeObject *Py_TYPE(object) cdef PyTypeObject *Py_TYPE(object)
cdef int _pystr_tp_print(PyObject *obj, FILE *f, int nesting) except -1: cdef int _pybstr_tp_print(PyObject *obj, FILE *f, int nesting) except -1:
o = <bytes>obj o = <bytes>obj
o = bytes(buffer(o)) # change tp_type to bytes instead of _pystr o = bytes(buffer(o)) # change tp_type to bytes instead of pybstr
return Py_TYPE(o).tp_print(<PyObject*>o, f, nesting) return Py_TYPE(o).tp_print(<PyObject*>o, f, nesting)
Py_TYPE(_pystr()).tp_print = _pystr_tp_print Py_TYPE(pybstr()).tp_print = _pybstr_tp_print
# qq is substitute for %q, which is missing in python. # qq is substitute for %q, which is missing in python.
...@@ -179,9 +185,9 @@ def pyqq(obj): ...@@ -179,9 +185,9 @@ def pyqq(obj):
# a-la str type (unicode on py3, bytes on py2), that can be transparently # a-la str type (unicode on py3, bytes on py2), that can be transparently
# converted to unicode or bytes as needed. # converted to unicode or bytes as needed.
if PY_MAJOR_VERSION >= 3: if PY_MAJOR_VERSION >= 3:
qobj = _pyunicode(pyu(qobj)) qobj = pyu(qobj)
else: else:
qobj = _pystr(pyb(qobj)) qobj = pyb(qobj)
return qobj return qobj
......
...@@ -111,7 +111,7 @@ def test_strings(): ...@@ -111,7 +111,7 @@ def test_strings():
assert isinstance(_, unicode) assert isinstance(_, unicode)
assert u(_) is _ assert u(_) is _
# verify print for _pystr and _pyunicode # verify print for bstr/ustr.
def test_strings_print(): def test_strings_print():
outok = readfile(dir_testprog + "/golang_test_str.txt") outok = readfile(dir_testprog + "/golang_test_str.txt")
retcode, stdout, stderr = _pyrun(["golang_test_str.py"], retcode, stdout, stderr = _pyrun(["golang_test_str.py"],
......
...@@ -18,7 +18,7 @@ ...@@ -18,7 +18,7 @@
# #
# See COPYING file for full licensing terms. # See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options. # See https://www.nexedi.com/licensing for rationale and options.
"""This program helps to verify _pystr and _pyunicode. """This program helps to verify b, u and underlying bstr and ustr.
It complements golang_str_test.test_strings_print. It complements golang_str_test.test_strings_print.
""" """
...@@ -31,6 +31,8 @@ from golang.gcompat import qq ...@@ -31,6 +31,8 @@ from golang.gcompat import qq
def main(): def main():
sb = b("привет b") sb = b("привет b")
su = u("привет u") su = u("привет u")
print("print(b):", sb)
print("print(u):", su)
print("print(qq(b)):", qq(sb)) print("print(qq(b)):", qq(sb))
print("print(qq(u)):", qq(su)) print("print(qq(u)):", qq(su))
......
print(b): привет b
print(u): привет u
print(qq(b)): "привет b" print(qq(b)): "привет b"
print(qq(u)): "привет u" print(qq(u)): "привет u"
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (C) 2019-2021 Nexedi SA and Contributors. # Copyright (C) 2019-2022 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -71,6 +71,8 @@ def test_golang_builtins(): ...@@ -71,6 +71,8 @@ def test_golang_builtins():
assert error is golang.error assert error is golang.error
assert b is golang.b assert b is golang.b
assert u is golang.u assert u is golang.u
assert bstr is golang.bstr
assert ustr is golang.ustr
# indirectly verify golang.__all__ # indirectly verify golang.__all__
for k in golang.__all__: for k in golang.__all__:
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment