Commit 6722061c authored by Stefan Behnel's avatar Stefan Behnel

improve some Sphinx markup

parent a9963a76
......@@ -16,46 +16,46 @@ implicitly insert these encoding/decoding steps.
Python string types in Cython code
----------------------------------
Cython supports four Python string types: ``bytes``, ``str``,
``unicode`` and ``basestring``. The ``bytes`` and ``unicode`` types
are the specific types known from normal Python 2.x (named ``bytes``
and ``str`` in Python 3). Additionally, Cython also supports the
``bytearray`` type starting with Python 2.6. It behaves like the
``bytes`` type, except that it is mutable.
The ``str`` type is special in that it is the byte string in Python 2
Cython supports four Python string types: :obj:`bytes`, :obj:`str`,
:obj:`unicode` and :obj:`basestring`. The :obj:`bytes` and :obj:`unicode` types
are the specific types known from normal Python 2.x (named :obj:`bytes`
and :obj:`str` in Python 3). Additionally, Cython also supports the
:obj:`bytearray` type starting with Python 2.6. It behaves like the
:obj:`bytes` type, except that it is mutable.
The :obj:`str` type is special in that it is the byte string in Python 2
and the Unicode string in Python 3 (for Cython code compiled with
language level 2, i.e. the default). Meaning, it always corresponds
exactly with the type that the Python runtime itself calls ``str``.
Thus, in Python 2, both ``bytes`` and ``str`` represent the byte string
type, whereas in Python 3, both ``str`` and ``unicode`` represent the
exactly with the type that the Python runtime itself calls :obj:`str`.
Thus, in Python 2, both :obj:`bytes` and :obj:`str` represent the byte string
type, whereas in Python 3, both :obj:`str` and :obj:`unicode` represent the
Python Unicode string type. The switch is made at C compile time, the
Python version that is used to run Cython is not relevant.
When compiling Cython code with language level 3, the ``str`` type is
When compiling Cython code with language level 3, the :obj:`str` type is
identified with exactly the Unicode string type at Cython compile time,
i.e. it does not identify with ``bytes`` when running in Python 2.
i.e. it does not identify with :obj:`bytes` when running in Python 2.
Note that the ``str`` type is not compatible with the ``unicode``
Note that the :obj:`str` type is not compatible with the :obj:`unicode`
type in Python 2, i.e. you cannot assign a Unicode string to a variable
or argument that is typed ``str``. The attempt will result in either
a compile time error (if detectable) or a ``TypeError`` exception at
or argument that is typed :obj:`str`. The attempt will result in either
a compile time error (if detectable) or a :obj:`TypeError` exception at
runtime. You should therefore be careful when you statically type a
string variable in code that must be compatible with Python 2, as this
Python version allows a mix of byte strings and unicode strings for data
and users normally expect code to be able to work with both. Code that
only targets Python 3 can safely type variables and arguments as either
``bytes`` or ``unicode``.
:obj:`bytes` or :obj:`unicode`.
The ``basestring`` type represents both the types ``str`` and ``unicode``,
The :obj:`basestring` type represents both the types :obj:`str` and :obj:`unicode`,
i.e. all Python text string types in Python 2 and Python 3. This can be
used for typing text variables that normally contain Unicode text (at
least in Python 3) but must additionally accept the ``str`` type in
least in Python 3) but must additionally accept the :obj:`str` type in
Python 2 for backwards compatibility reasons. It is not compatible with
the ``bytes`` type. Its usage should be rare in normal Cython code as
the generic ``object`` type (i.e. untyped code) will normally be good
the :obj:`bytes` type. Its usage should be rare in normal Cython code as
the generic :obj:`object` type (i.e. untyped code) will normally be good
enough and has the additional advantage of supporting the assignment of
string subtypes. Support for the ``basestring`` type is new in Cython
string subtypes. Support for the :obj:`basestring` type is new in Cython
0.20.
......@@ -100,7 +100,7 @@ Python variable::
cdef char* c_string = c_call_returning_a_c_string()
cdef bytes py_string = c_string
A type cast to ``object`` or ``bytes`` will do the same thing::
A type cast to :obj:`object` or :obj:`bytes` will do the same thing::
py_string = <bytes> c_string
......@@ -163,8 +163,8 @@ however, when the C function stores the pointer for later use. Apart
from keeping a Python reference to the string object, no manual memory
management is required.
Starting with Cython 0.20, the ``bytearray`` type is supported and
coerces in the same way as the ``bytes`` type. However, when using it
Starting with Cython 0.20, the :obj:`bytearray` type is supported and
coerces in the same way as the :obj:`bytes` type. However, when using it
in a C context, special care must be taken not to grow or shrink the
object buffer after converting it to a C string pointer. These
modifications can change the internal buffer address, which will make
......@@ -224,6 +224,7 @@ In Cython 0.18, these standard declarations have been changed to
use the correct ``const`` modifier, so your code will automatically
benefit from the new ``const`` support if it uses them.
Decoding bytes to text
----------------------
......@@ -234,7 +235,7 @@ the C byte strings to Python Unicode strings on reception, and to
encode Python Unicode strings to C byte strings on the way out.
With a Python byte string object, you would normally just call the
``.decode()`` method to decode it into a Unicode string::
``bytes.decode()`` method to decode it into a Unicode string::
ustring = byte_string.decode('UTF-8')
......@@ -318,6 +319,7 @@ assignment. Later access to the invalidated pointer will read invalid
memory and likely result in a segfault. Cython will therefore refuse
to compile this code.
C++ strings
-----------
......@@ -375,7 +377,7 @@ There are two use cases where this is inconvenient. First, if all
C strings that are being processed (or the large majority) contain
text, automatic encoding and decoding from and to Python unicode
objects can reduce the code overhead a little. In this case, you
can set the ``c_string_type`` directive in your module to ``unicode``
can set the ``c_string_type`` directive in your module to :obj:`unicode`
and the ``c_string_encoding`` to the encoding that your C code uses,
for example::
......@@ -393,7 +395,7 @@ The second use case is when all C strings that are being processed
only contain ASCII encodable characters (e.g. numbers) and you want
your code to use the native legacy string type in Python 2 for them,
instead of always using Unicode. In this case, you can set the
string type to ``str``::
string type to :obj:`str`::
# cython: c_string_type=str, c_string_encoding=ascii
......@@ -472,15 +474,15 @@ whereas the following ``ISO-8859-15`` encoded source file will print
Note that the unicode literal ``u'abcö'`` is a correctly decoded four
character Unicode string in both cases, whereas the unprefixed Python
``str`` literal ``'abcö'`` will become a byte string in Python 2 (thus
:obj:`str` literal ``'abcö'`` will become a byte string in Python 2 (thus
having length 4 or 5 in the examples above), and a 4 character Unicode
string in Python 3. If you are not familiar with encodings, this may
not appear obvious at first read. See `CEP 108`_ for details.
As a rule of thumb, it is best to avoid unprefixed non-ASCII ``str``
As a rule of thumb, it is best to avoid unprefixed non-ASCII :obj:`str`
literals and to use unicode string literals for all text. Cython also
supports the ``__future__`` import ``unicode_literals`` that instructs
the parser to read all unprefixed ``str`` literals in a source file as
the parser to read all unprefixed :obj:`str` literals in a source file as
unicode string literals, just like Python 3.
.. _`CEP 108`: http://wiki.cython.org/enhancements/stringliterals
......@@ -522,7 +524,7 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python
The explicit coercion works for any C integer type. Values outside of
the range of a :c:type:`char` or :c:type:`unsigned char` will raise an
``OverflowError`` at runtime. Coercion will also happen automatically
:obj:`OverflowError` at runtime. Coercion will also happen automatically
when assigning to a typed variable, e.g.::
cdef bytes py_byte_string
......@@ -544,10 +546,10 @@ The following will print 65::
cdef Py_UCS4 uchar_val = u'A'
print( <long>uchar_val )
Note that casting to a C ``long`` (or ``unsigned long``) will work
Note that casting to a C :c:type:`long` (or :c:type:`unsigned long`) will work
just fine, as the maximum code point value that a Unicode character
can have is 1114111 (``0x10FFFF``). On platforms with 32bit or more,
``int`` is just as good.
:c:type:`int` is just as good.
Narrow Unicode builds
......@@ -682,15 +684,15 @@ zero-terminated UTF-16 encoded :c:type:`wchar_t*` strings, so called
"wide strings".
By default, Windows builds of CPython define :c:type:`Py_UNICODE` as
a synonym for :c:type:`wchar_t`. This makes internal ``unicode``
a synonym for :c:type:`wchar_t`. This makes internal :obj:`unicode`
representation compatible with UTF-16 and allows for efficient zero-copy
conversions. This also means that Windows builds are always
`Narrow Unicode builds`_ with all the caveats.
To aid interoperation with Windows APIs, Cython 0.19 supports wide
strings (in the form of :c:type:`Py_UNICODE*`) and implicitly converts
them to and from ``unicode`` string objects. These conversions behave the
same way as they do for :c:type:`char*` and ``bytes`` as described in
them to and from :obj:`unicode` string objects. These conversions behave the
same way as they do for :c:type:`char*` and :obj:`bytes` as described in
`Passing byte strings`_.
In addition to automatic conversion, unicode literals that appear
......@@ -722,7 +724,7 @@ Here is an example of how one would call a Unicode API on Windows::
APIs deprecated and inefficient.
One consequence of CPython 3.3 changes is that :py:func:`len` of
``unicode`` strings is always measured in *code points* ("characters"),
:obj:`unicode` strings is always measured in *code points* ("characters"),
while Windows API expect the number of UTF-16 *code units*
(where each surrogate is counted individually). To always get the number
of code units, call :c:func:`PyUnicode_GetSize` directly.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment