Port/move channels to C/C++/Pyx
- Move channels implementation to be done in C++ inside libgolang. The code and logic is based on previous Python-level channels implementation, but the new code is just C++ and does not depend on Python nor GIL at all, and so works without GIL if libgolang runtime works without GIL(*). (*) for example "thread" runtime works without GIL, while "gevent" runtime acquires GIL on every semaphore acquire. New channels implementation is located in δ(libgolang.cpp). - Provide low-level C channels API to the implementation. The low-level C API was inspired by Libtask[1] and Plan9/Libthread[2]. [1] Libtask: a Coroutine Library for C and Unix. https://swtch.com/libtask. [2] http://9p.io/magic/man2html/2/thread. - Provide high-level C++ channels API that provides type-safety and automatic channel lifetime management. Overview of C and C++ APIs are in δ(libgolang.h). - Expose C++ channels API at Pyx level as Cython/nogil API so that Cython programs could use channels with ease and without need to care about lifetime management and low-level details. Overview of Cython/nogil channels API is in δ(README.rst) and δ(_golang.pxd). - Turn Python channels to be tiny wrapper around chan<PyObject>. Implementation note: - gevent case needs special care because greenlet, which gevent uses, swaps coroutine stack from C stack to heap on coroutine park, and replaces that space on C stack with stack of activated coroutine copied back from heap. This way if an object on g's stack is accessed while g is parked it would be memory of another g's stack. The channels implementation explicitly cares about this issue so that stack -> * channel send, or * -> stack channel receive work correctly. It should be noted that greenlet approach, which it inherits from stackless, is not only a bit tricky, but also comes with overhead (stack <-> heap copy), and prevents a coroutine to migrate from 1 OS thread to another OS thread as that would change addresses of on-stack things for that coroutine. As the latter property prevents to use multiple CPUs even if the program / runtime are prepared to work without GIL, it would be more logical to change gevent/greenlet to use separate stack for each coroutine. That would remove stack <-> heap copy and the need for special care in channels implementation for stack - stack sends. Such approach should be possible to implement with e.g. swapcontext or similar mechanism, and a proof of concept of such work wrapped into greenlet-compatible API exists[3]. It would be good if at some point there would be a chance to explore such approach in Pygolang context. [3] https://github.com/python-greenlet/greenlet/issues/113#issuecomment-264529838 and below Just this patch brings in the following speedup at Python level: (on i7@2.6GHz) thread runtime: name old time/op new time/op delta go 20.0µs ± 1% 15.6µs ± 1% -21.84% (p=0.000 n=10+10) chan 9.37µs ± 4% 2.89µs ± 6% -69.12% (p=0.000 n=10+10) select 20.2µs ± 4% 3.4µs ± 5% -83.20% (p=0.000 n=8+10) def 58.0ns ± 0% 60.0ns ± 0% +3.45% (p=0.000 n=8+10) func_def 43.8µs ± 1% 43.9µs ± 1% ~ (p=0.796 n=10+10) call 62.4ns ± 1% 63.5ns ± 1% +1.76% (p=0.001 n=10+10) func_call 1.06µs ± 1% 1.05µs ± 1% -0.63% (p=0.002 n=10+10) try_finally 136ns ± 0% 137ns ± 0% +0.74% (p=0.000 n=9+10) defer 2.28µs ± 1% 2.33µs ± 1% +2.34% (p=0.000 n=10+10) workgroup_empty 48.2µs ± 1% 34.1µs ± 2% -29.18% (p=0.000 n=9+10) workgroup_raise 58.9µs ± 1% 45.5µs ± 1% -22.74% (p=0.000 n=10+10) gevent runtime: name old time/op new time/op delta go 24.7µs ± 1% 15.9µs ± 1% -35.72% (p=0.000 n=9+9) chan 11.6µs ± 1% 7.3µs ± 1% -36.74% (p=0.000 n=10+10) select 22.5µs ± 1% 10.4µs ± 1% -53.73% (p=0.000 n=10+10) def 55.0ns ± 0% 55.0ns ± 0% ~ (all equal) func_def 43.6µs ± 1% 43.6µs ± 1% ~ (p=0.684 n=10+10) call 63.0ns ± 0% 64.0ns ± 0% +1.59% (p=0.000 n=10+10) func_call 1.06µs ± 1% 1.07µs ± 1% +0.45% (p=0.045 n=10+9) try_finally 135ns ± 0% 137ns ± 0% +1.48% (p=0.000 n=10+10) defer 2.31µs ± 1% 2.33µs ± 1% +0.89% (p=0.000 n=10+10) workgroup_empty 70.2µs ± 0% 55.8µs ± 0% -20.63% (p=0.000 n=10+10) workgroup_raise 90.3µs ± 0% 70.9µs ± 1% -21.51% (p=0.000 n=9+10) The whole Cython/nogil work - starting from 8fa3c15b (Start using Cython and providing Cython/nogil API) to this patch - brings in the following speedup at Python level: (on i7@2.6GHz) thread runtime: name old time/op new time/op delta go 92.9µs ± 1% 15.6µs ± 1% -83.16% (p=0.000 n=10+10) chan 13.9µs ± 1% 2.9µs ± 6% -79.14% (p=0.000 n=10+10) select 29.7µs ± 6% 3.4µs ± 5% -88.55% (p=0.000 n=10+10) def 57.0ns ± 0% 60.0ns ± 0% +5.26% (p=0.000 n=10+10) func_def 44.0µs ± 1% 43.9µs ± 1% ~ (p=0.055 n=10+10) call 63.5ns ± 1% 63.5ns ± 1% ~ (p=1.000 n=10+10) func_call 1.06µs ± 0% 1.05µs ± 1% -1.31% (p=0.000 n=10+10) try_finally 139ns ± 0% 137ns ± 0% -1.44% (p=0.000 n=10+10) defer 2.36µs ± 1% 2.33µs ± 1% -1.26% (p=0.000 n=10+10) workgroup_empty 98.4µs ± 1% 34.1µs ± 2% -65.32% (p=0.000 n=10+10) workgroup_raise 135µs ± 1% 46µs ± 1% -66.35% (p=0.000 n=10+10) gevent runtime: name old time/op new time/op delta go 68.8µs ± 1% 15.9µs ± 1% -76.91% (p=0.000 n=10+9) chan 14.8µs ± 1% 7.3µs ± 1% -50.67% (p=0.000 n=10+10) select 32.0µs ± 0% 10.4µs ± 1% -67.57% (p=0.000 n=10+10) def 58.0ns ± 0% 55.0ns ± 0% -5.17% (p=0.000 n=10+10) func_def 43.9µs ± 1% 43.6µs ± 1% -0.53% (p=0.035 n=10+10) call 63.5ns ± 1% 64.0ns ± 0% +0.79% (p=0.033 n=10+10) func_call 1.08µs ± 1% 1.07µs ± 1% -1.74% (p=0.000 n=10+9) try_finally 142ns ± 0% 137ns ± 0% -3.52% (p=0.000 n=10+10) defer 2.32µs ± 1% 2.33µs ± 1% +0.71% (p=0.005 n=10+10) workgroup_empty 90.3µs ± 0% 55.8µs ± 0% -38.26% (p=0.000 n=10+10) workgroup_raise 108µs ± 1% 71µs ± 1% -34.64% (p=0.000 n=10+10) This patch is the final patch in series to reach the goal of providing channels that could be used in Cython/nogil code. Cython/nogil channels work is dedicated to the memory of Вера Павловна Супрун[4]. [4] https://navytux.spb.ru/memory/%D0%A2%D1%91%D1%82%D1%8F%20%D0%92%D0%B5%D1%80%D0%B0.pdf#page=3
Showing
Please register or sign in to comment