golang/strconv.py · bcb95cd55731676911459742abc1284e1e24837a · nexedi / pygolang

golang: Provide b, u for strings · bcb95cd5

Kirill Smelkov authored Jan 29, 2020

With Python3 I've got tired to constantly use .encode() and .decode();
getting exception if original argument was unicode on e.g. b.decode();
getting exception on raw bytes that are invalid UTF-8, not being able to
use bytes literal with non-ASCII characters, etc.

So instead of this pain provide two functions that make sure an object
is either bytes or unicode:

- b converts str/unicode/bytes s to UTF-8 encoded bytestring.

	Bytes input is preserved as-is:

	   b(bytes_input) == bytes_input

	Unicode input is UTF-8 encoded. The encoding always succeeds.
	b is reverse operation to u - the following invariant is always true:

	   b(u(bytes_input)) == bytes_input

- u converts str/unicode/bytes s to unicode string.

	Unicode input is preserved as-is:

	   u(unicode_input) == unicode_input

	Bytes input is UTF-8 decoded. The decoding always succeeds and input
	information is not lost: non-valid UTF-8 bytes are decoded into
	surrogate codes ranging from U+DC80 to U+DCFF.
	u is reverse operation to b - the following invariant is always true:

	   u(b(unicode_input)) == unicode_input

NOTE: encoding _and_ decoding *never* fail nor loose information. This
is achieved by using 'surrogateescape' error handler on Python3, and
providing manual fallback that behaves the same way on Python2.

The naming is chosen with the idea so that b(something) resembles
b"something", and u(something) resembles u"something".

This, even being only a part of strings solution discussed in [1],
should help handle byte- and unicode- strings in more robust and
distraction free way.

Top-level documentation is TODO.

[1] nexedi/zodbtools!13

bcb95cd5

strconv.py 8.49 KB

Replace strconv.py