-
Kirill Smelkov authored
With Python3 I've got tired to constantly use .encode() and .decode(); getting exception if original argument was unicode on e.g. b.decode(); getting exception on raw bytes that are invalid UTF-8, not being able to use bytes literal with non-ASCII characters, etc. So instead of this pain provide two functions that make sure an object is either bytes or unicode: - b converts str/unicode/bytes s to UTF-8 encoded bytestring. Bytes input is preserved as-is: b(bytes_input) == bytes_input Unicode input is UTF-8 encoded. The encoding always succeeds. b is reverse operation to u - the following invariant is always true: b(u(bytes_input)) == bytes_input - u converts str/unicode/bytes s to unicode string. Unicode input is preserved as-is: u(unicode_input) == unicode_input Bytes input is UTF-8 decoded. The decoding always succeeds and input information is not lost: non-valid UTF-8 bytes are decoded into surrogate codes ranging from U+DC80 to U+DCFF. u is reverse operation to b - the following invariant is always true: u(b(unicode_input)) == unicode_input NOTE: encoding _and_ decoding *never* fail nor loose information. This is achieved by using 'surrogateescape' error handler on Python3, and providing manual fallback that behaves the same way on Python2. The naming is chosen with the idea so that b(something) resembles b"something", and u(something) resembles u"something". This, even being only a part of strings solution discussed in [1], should help handle byte- and unicode- strings in more robust and distraction free way. Top-level documentation is TODO. [1] zodbtools!13
bcb95cd5