Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Milestones
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Commits
Issue Boards
Open sidebar
Xavier Thompson
cython
Commits
6722061c
Commit
6722061c
authored
Jan 25, 2014
by
Stefan Behnel
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
improve some Sphinx markup
parent
a9963a76
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
40 additions
and
38 deletions
+40
-38
docs/src/tutorial/strings.rst
docs/src/tutorial/strings.rst
+40
-38
No files found.
docs/src/tutorial/strings.rst
View file @
6722061c
...
@@ -16,46 +16,46 @@ implicitly insert these encoding/decoding steps.
...
@@ -16,46 +16,46 @@ implicitly insert these encoding/decoding steps.
Python string types in Cython code
Python string types in Cython code
----------------------------------
----------------------------------
Cython supports four Python string types:
``bytes``, ``str`
`,
Cython supports four Python string types:
:obj:`bytes`, :obj:`str
`,
``unicode`` and ``basestring``. The ``bytes`` and ``unicode`
` types
:obj:`unicode` and :obj:`basestring`. The :obj:`bytes` and :obj:`unicode
` types
are the specific types known from normal Python 2.x (named
``bytes`
`
are the specific types known from normal Python 2.x (named
:obj:`bytes
`
and
``str`
` in Python 3). Additionally, Cython also supports the
and
:obj:`str
` in Python 3). Additionally, Cython also supports the
``bytearray`
` type starting with Python 2.6. It behaves like the
:obj:`bytearray
` type starting with Python 2.6. It behaves like the
``bytes`
` type, except that it is mutable.
:obj:`bytes
` type, except that it is mutable.
The
``str`
` type is special in that it is the byte string in Python 2
The
:obj:`str
` type is special in that it is the byte string in Python 2
and the Unicode string in Python 3 (for Cython code compiled with
and the Unicode string in Python 3 (for Cython code compiled with
language level 2, i.e. the default). Meaning, it always corresponds
language level 2, i.e. the default). Meaning, it always corresponds
exactly with the type that the Python runtime itself calls
``str`
`.
exactly with the type that the Python runtime itself calls
:obj:`str
`.
Thus, in Python 2, both
``bytes`` and ``str`
` represent the byte string
Thus, in Python 2, both
:obj:`bytes` and :obj:`str
` represent the byte string
type, whereas in Python 3, both
``str`` and ``unicode`
` represent the
type, whereas in Python 3, both
:obj:`str` and :obj:`unicode
` represent the
Python Unicode string type. The switch is made at C compile time, the
Python Unicode string type. The switch is made at C compile time, the
Python version that is used to run Cython is not relevant.
Python version that is used to run Cython is not relevant.
When compiling Cython code with language level 3, the
``str`
` type is
When compiling Cython code with language level 3, the
:obj:`str
` type is
identified with exactly the Unicode string type at Cython compile time,
identified with exactly the Unicode string type at Cython compile time,
i.e. it does not identify with
``bytes`
` when running in Python 2.
i.e. it does not identify with
:obj:`bytes
` when running in Python 2.
Note that the
``str`` type is not compatible with the ``unicode`
`
Note that the
:obj:`str` type is not compatible with the :obj:`unicode
`
type in Python 2, i.e. you cannot assign a Unicode string to a variable
type in Python 2, i.e. you cannot assign a Unicode string to a variable
or argument that is typed
``str`
`. The attempt will result in either
or argument that is typed
:obj:`str
`. The attempt will result in either
a compile time error (if detectable) or a
``TypeError`
` exception at
a compile time error (if detectable) or a
:obj:`TypeError
` exception at
runtime. You should therefore be careful when you statically type a
runtime. You should therefore be careful when you statically type a
string variable in code that must be compatible with Python 2, as this
string variable in code that must be compatible with Python 2, as this
Python version allows a mix of byte strings and unicode strings for data
Python version allows a mix of byte strings and unicode strings for data
and users normally expect code to be able to work with both. Code that
and users normally expect code to be able to work with both. Code that
only targets Python 3 can safely type variables and arguments as either
only targets Python 3 can safely type variables and arguments as either
``bytes`` or ``unicode`
`.
:obj:`bytes` or :obj:`unicode
`.
The
``basestring`` type represents both the types ``str`` and ``unicode`
`,
The
:obj:`basestring` type represents both the types :obj:`str` and :obj:`unicode
`,
i.e. all Python text string types in Python 2 and Python 3. This can be
i.e. all Python text string types in Python 2 and Python 3. This can be
used for typing text variables that normally contain Unicode text (at
used for typing text variables that normally contain Unicode text (at
least in Python 3) but must additionally accept the
``str`
` type in
least in Python 3) but must additionally accept the
:obj:`str
` type in
Python 2 for backwards compatibility reasons. It is not compatible with
Python 2 for backwards compatibility reasons. It is not compatible with
the
``bytes`
` type. Its usage should be rare in normal Cython code as
the
:obj:`bytes
` type. Its usage should be rare in normal Cython code as
the generic
``object`
` type (i.e. untyped code) will normally be good
the generic
:obj:`object
` type (i.e. untyped code) will normally be good
enough and has the additional advantage of supporting the assignment of
enough and has the additional advantage of supporting the assignment of
string subtypes. Support for the
``basestring`
` type is new in Cython
string subtypes. Support for the
:obj:`basestring
` type is new in Cython
0.20.
0.20.
...
@@ -100,7 +100,7 @@ Python variable::
...
@@ -100,7 +100,7 @@ Python variable::
cdef char* c_string = c_call_returning_a_c_string()
cdef char* c_string = c_call_returning_a_c_string()
cdef bytes py_string = c_string
cdef bytes py_string = c_string
A type cast to
``object`` or ``bytes`
` will do the same thing::
A type cast to
:obj:`object` or :obj:`bytes
` will do the same thing::
py_string = <bytes> c_string
py_string = <bytes> c_string
...
@@ -163,8 +163,8 @@ however, when the C function stores the pointer for later use. Apart
...
@@ -163,8 +163,8 @@ however, when the C function stores the pointer for later use. Apart
from keeping a Python reference to the string object, no manual memory
from keeping a Python reference to the string object, no manual memory
management is required.
management is required.
Starting with Cython 0.20, the
``bytearray`
` type is supported and
Starting with Cython 0.20, the
:obj:`bytearray
` type is supported and
coerces in the same way as the
``bytes`
` type. However, when using it
coerces in the same way as the
:obj:`bytes
` type. However, when using it
in a C context, special care must be taken not to grow or shrink the
in a C context, special care must be taken not to grow or shrink the
object buffer after converting it to a C string pointer. These
object buffer after converting it to a C string pointer. These
modifications can change the internal buffer address, which will make
modifications can change the internal buffer address, which will make
...
@@ -224,6 +224,7 @@ In Cython 0.18, these standard declarations have been changed to
...
@@ -224,6 +224,7 @@ In Cython 0.18, these standard declarations have been changed to
use the correct ``const`` modifier, so your code will automatically
use the correct ``const`` modifier, so your code will automatically
benefit from the new ``const`` support if it uses them.
benefit from the new ``const`` support if it uses them.
Decoding bytes to text
Decoding bytes to text
----------------------
----------------------
...
@@ -234,7 +235,7 @@ the C byte strings to Python Unicode strings on reception, and to
...
@@ -234,7 +235,7 @@ the C byte strings to Python Unicode strings on reception, and to
encode Python Unicode strings to C byte strings on the way out.
encode Python Unicode strings to C byte strings on the way out.
With a Python byte string object, you would normally just call the
With a Python byte string object, you would normally just call the
``.decode()`` method to decode it into a Unicode string::
``
bytes
.decode()`` method to decode it into a Unicode string::
ustring = byte_string.decode('UTF-8')
ustring = byte_string.decode('UTF-8')
...
@@ -318,6 +319,7 @@ assignment. Later access to the invalidated pointer will read invalid
...
@@ -318,6 +319,7 @@ assignment. Later access to the invalidated pointer will read invalid
memory and likely result in a segfault. Cython will therefore refuse
memory and likely result in a segfault. Cython will therefore refuse
to compile this code.
to compile this code.
C++ strings
C++ strings
-----------
-----------
...
@@ -375,7 +377,7 @@ There are two use cases where this is inconvenient. First, if all
...
@@ -375,7 +377,7 @@ There are two use cases where this is inconvenient. First, if all
C strings that are being processed (or the large majority) contain
C strings that are being processed (or the large majority) contain
text, automatic encoding and decoding from and to Python unicode
text, automatic encoding and decoding from and to Python unicode
objects can reduce the code overhead a little. In this case, you
objects can reduce the code overhead a little. In this case, you
can set the ``c_string_type`` directive in your module to
``unicode`
`
can set the ``c_string_type`` directive in your module to
:obj:`unicode
`
and the ``c_string_encoding`` to the encoding that your C code uses,
and the ``c_string_encoding`` to the encoding that your C code uses,
for example::
for example::
...
@@ -393,7 +395,7 @@ The second use case is when all C strings that are being processed
...
@@ -393,7 +395,7 @@ The second use case is when all C strings that are being processed
only contain ASCII encodable characters (e.g. numbers) and you want
only contain ASCII encodable characters (e.g. numbers) and you want
your code to use the native legacy string type in Python 2 for them,
your code to use the native legacy string type in Python 2 for them,
instead of always using Unicode. In this case, you can set the
instead of always using Unicode. In this case, you can set the
string type to
``str`
`::
string type to
:obj:`str
`::
# cython: c_string_type=str, c_string_encoding=ascii
# cython: c_string_type=str, c_string_encoding=ascii
...
@@ -472,15 +474,15 @@ whereas the following ``ISO-8859-15`` encoded source file will print
...
@@ -472,15 +474,15 @@ whereas the following ``ISO-8859-15`` encoded source file will print
Note that the unicode literal ``u'abcö'`` is a correctly decoded four
Note that the unicode literal ``u'abcö'`` is a correctly decoded four
character Unicode string in both cases, whereas the unprefixed Python
character Unicode string in both cases, whereas the unprefixed Python
``str`
` literal ``'abcö'`` will become a byte string in Python 2 (thus
:obj:`str
` literal ``'abcö'`` will become a byte string in Python 2 (thus
having length 4 or 5 in the examples above), and a 4 character Unicode
having length 4 or 5 in the examples above), and a 4 character Unicode
string in Python 3. If you are not familiar with encodings, this may
string in Python 3. If you are not familiar with encodings, this may
not appear obvious at first read. See `CEP 108`_ for details.
not appear obvious at first read. See `CEP 108`_ for details.
As a rule of thumb, it is best to avoid unprefixed non-ASCII
``str`
`
As a rule of thumb, it is best to avoid unprefixed non-ASCII
:obj:`str
`
literals and to use unicode string literals for all text. Cython also
literals and to use unicode string literals for all text. Cython also
supports the ``__future__`` import ``unicode_literals`` that instructs
supports the ``__future__`` import ``unicode_literals`` that instructs
the parser to read all unprefixed
``str`
` literals in a source file as
the parser to read all unprefixed
:obj:`str
` literals in a source file as
unicode string literals, just like Python 3.
unicode string literals, just like Python 3.
.. _`CEP 108`: http://wiki.cython.org/enhancements/stringliterals
.. _`CEP 108`: http://wiki.cython.org/enhancements/stringliterals
...
@@ -522,7 +524,7 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python
...
@@ -522,7 +524,7 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python
The explicit coercion works for any C integer type. Values outside of
The explicit coercion works for any C integer type. Values outside of
the range of a :c:type:`char` or :c:type:`unsigned char` will raise an
the range of a :c:type:`char` or :c:type:`unsigned char` will raise an
``OverflowError`
` at runtime. Coercion will also happen automatically
:obj:`OverflowError
` at runtime. Coercion will also happen automatically
when assigning to a typed variable, e.g.::
when assigning to a typed variable, e.g.::
cdef bytes py_byte_string
cdef bytes py_byte_string
...
@@ -544,10 +546,10 @@ The following will print 65::
...
@@ -544,10 +546,10 @@ The following will print 65::
cdef Py_UCS4 uchar_val = u'A'
cdef Py_UCS4 uchar_val = u'A'
print( <long>uchar_val )
print( <long>uchar_val )
Note that casting to a C
``long`` (or ``unsigned long`
`) will work
Note that casting to a C
:c:type:`long` (or :c:type:`unsigned long
`) will work
just fine, as the maximum code point value that a Unicode character
just fine, as the maximum code point value that a Unicode character
can have is 1114111 (``0x10FFFF``). On platforms with 32bit or more,
can have is 1114111 (``0x10FFFF``). On platforms with 32bit or more,
``int`
` is just as good.
:c:type:`int
` is just as good.
Narrow Unicode builds
Narrow Unicode builds
...
@@ -682,15 +684,15 @@ zero-terminated UTF-16 encoded :c:type:`wchar_t*` strings, so called
...
@@ -682,15 +684,15 @@ zero-terminated UTF-16 encoded :c:type:`wchar_t*` strings, so called
"wide strings".
"wide strings".
By default, Windows builds of CPython define :c:type:`Py_UNICODE` as
By default, Windows builds of CPython define :c:type:`Py_UNICODE` as
a synonym for :c:type:`wchar_t`. This makes internal
``unicode`
`
a synonym for :c:type:`wchar_t`. This makes internal
:obj:`unicode
`
representation compatible with UTF-16 and allows for efficient zero-copy
representation compatible with UTF-16 and allows for efficient zero-copy
conversions. This also means that Windows builds are always
conversions. This also means that Windows builds are always
`Narrow Unicode builds`_ with all the caveats.
`Narrow Unicode builds`_ with all the caveats.
To aid interoperation with Windows APIs, Cython 0.19 supports wide
To aid interoperation with Windows APIs, Cython 0.19 supports wide
strings (in the form of :c:type:`Py_UNICODE*`) and implicitly converts
strings (in the form of :c:type:`Py_UNICODE*`) and implicitly converts
them to and from
``unicode`
` string objects. These conversions behave the
them to and from
:obj:`unicode
` string objects. These conversions behave the
same way as they do for :c:type:`char*` and
``bytes`
` as described in
same way as they do for :c:type:`char*` and
:obj:`bytes
` as described in
`Passing byte strings`_.
`Passing byte strings`_.
In addition to automatic conversion, unicode literals that appear
In addition to automatic conversion, unicode literals that appear
...
@@ -722,7 +724,7 @@ Here is an example of how one would call a Unicode API on Windows::
...
@@ -722,7 +724,7 @@ Here is an example of how one would call a Unicode API on Windows::
APIs deprecated and inefficient.
APIs deprecated and inefficient.
One consequence of CPython 3.3 changes is that :py:func:`len` of
One consequence of CPython 3.3 changes is that :py:func:`len` of
``unicode`
` strings is always measured in *code points* ("characters"),
:obj:`unicode
` strings is always measured in *code points* ("characters"),
while Windows API expect the number of UTF-16 *code units*
while Windows API expect the number of UTF-16 *code units*
(where each surrogate is counted individually). To always get the number
(where each surrogate is counted individually). To always get the number
of code units, call :c:func:`PyUnicode_GetSize` directly.
of code units, call :c:func:`PyUnicode_GetSize` directly.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment