Commit a72c1c1a authored by Kirill Smelkov's avatar Kirill Smelkov

golang_str: bstr/ustr iteration

Even though bstr is semantically array of bytes, while ustr is array of
unicode characters, iterating them _both_ yields unicode characters.
This goes in line with Go approach described in "Strings, bytes, runes
and characters in Go"[1] and allows for both ustr _and_ bstr to be used
as strings in unicode world.

Even though this diverges (just a bit) from str/py2 str behaviur, and
diverges more from bytes/py3 behaviour, I have not hit any problem in
practice due to this divergence. In other words the semantics of
bytestring used in Go - to iterate them as unicode characters - is
sound. For the reference it is the authors of Go who originally invented
UTF-8 - see [2] for details.

See also [3] for our discussion with Jérome on this topic.

[1] https://blog.golang.org/strings
[2] https://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf
[3] nexedi/zodbtools!13 (comment 81646)
parent 04be919b
......@@ -242,7 +242,12 @@ even if bytes data is not valid UTF-8.
Semantically `bstr` is array of bytes, while `ustr` is array of
unicode-characters. Accessing their elements by `[index]` yields byte and
unicode character correspondingly [*]_.
unicode character correspondingly [*]_. Iterating them, however, yields unicode
characters for both `bstr` and `ustr`. In practice `bstr` is enough 99% of the
time, and `ustr` only needs to be used for random access to string characters.
See `Strings, bytes, runes and characters in Go`__ for overview of this approach.
__ https://blog.golang.org/strings
Operations in between `bstr` and `ustr`/`unicode` / `bytes`/`bytearray` coerce to `bstr`, while
operations in between `ustr` and `bstr`/`bytes`/`bytearray` / `unicode` coerce
......@@ -260,6 +265,8 @@ object is either `bstr` or `ustr` correspondingly.
Usage example::
s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'.
for c in s: # c will iterate through
... # [u(_) for _ in ('п','р','и','в','е','т')]
def f(s):
s = u(s) # make sure s is ustr, decoding as UTF-8(*) if it was bstr, bytes, bytearray or buffer.
......
......@@ -26,6 +26,7 @@ from cpython cimport PyUnicode_AsUnicode, PyUnicode_GetSize, PyUnicode_FromUnico
from cpython cimport PyUnicode_DecodeUTF8
from cpython cimport PyTypeObject, Py_TYPE, richcmpfunc
from cpython cimport Py_EQ, Py_NE, Py_LT, Py_GT, Py_LE, Py_GE
from cpython.iterobject cimport PySeqIter_New
from cpython cimport PyObject_CheckBuffer
cdef extern from "Python.h":
void PyType_Modified(PyTypeObject *)
......@@ -195,7 +196,10 @@ class pybstr(bytes):
is always identity even if bytes data is not valid UTF-8.
Semantically bstr is array of bytes. Accessing its elements by [index]
yields byte character.
yields byte character. Iterating through bstr, however, yields unicode
characters. In practice bstr is enough 99% of the time, and ustr only
needs to be used for random access to string characters. See
https://blog.golang.org/strings for overview of this approach.
Operations in between bstr and ustr/unicode / bytes/bytearray coerce to bstr.
When the coercion happens, bytes and bytearray, similarly to bstr, are also
......@@ -280,6 +284,11 @@ class pybstr(bytes):
else:
return pyb(x)
# __iter__ - yields unicode characters
def __iter__(self):
# TODO iterate without converting self to u
return pyu(self).__iter__()
# XXX cannot `cdef class` with __new__: https://github.com/cython/cython/issues/799
class pyustr(unicode):
......@@ -292,9 +301,13 @@ class pyustr(unicode):
is always identity even if bytes data is not valid UTF-8.
ustr is similar to standard unicode type - accessing its
ustr is similar to standard unicode type - iterating and accessing its
elements by [index] yields unicode characters.
ustr complements bstr and is meant to be used only in situations when
random access to string characters is needed. Otherwise bstr is more
preferable and should be enough 99% of the time.
Operations in between ustr and bstr/bytes/bytearray / unicode coerce to ustr.
When the coercion happens, bytes and bytearray, similarly to bstr, are also
treated as UTF8-encoded strings.
......@@ -360,6 +373,26 @@ class pyustr(unicode):
def __getitem__(self, idx):
return pyu(unicode.__getitem__(self, idx))
# __iter__
def __iter__(self):
if PY_MAJOR_VERSION >= 3:
return _pyustrIter(unicode.__iter__(self))
else:
# on python 2 unicode does not have .__iter__
return PySeqIter_New(self)
# _pyustrIter wraps unicode iterator to return pyustr for each yielded character.
cdef class _pyustrIter:
cdef object uiter
def __init__(self, uiter):
self.uiter = uiter
def __iter__(self):
return self
def __next__(self):
x = next(self.uiter)
return pyu(x)
# _bdata/_udata retrieve raw data from bytes/unicode.
def _bdata(obj): # -> bytes
......
......@@ -384,6 +384,38 @@ def test_strings_index():
assert _[1:-1:2]== b'\xbc\xb8\x80\x83\xd0\xd0\xd1'
# verify strings iteration.
def test_strings_iter():
us = u("миру мир"); u_ = u"миру мир"
bs = b("миру мир")
# iter( b/u/unicode ) -> iterate unicode characters
# NOTE that iter(b) too yields unicode characters - not integers or bytes
bi = iter(bs)
ui = iter(us)
ui_ = iter(u_)
class XIter:
def __iter__(self):
return self
def __next__(self, missing=object):
x = next(bi, missing)
y = next(ui, missing)
z = next(ui_, missing)
assert type(x) is type(y)
if x is not missing:
assert type(x) is ustr
if z is not missing:
assert type(z) is unicode
assert x == y
assert y == z
if x is missing:
raise StopIteration
return x
next = __next__ # py2
assert list(XIter()) == ['м','и','р','у',' ','м','и','р']
# verify string operations like `x + y` for all combinations of pairs from
# bytes, unicode, bstr, ustr and bytearray. Except if both x and y are std
# python types, e.g. (bytes, unicode), because those combinations are handled
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment