Activate LeakSanitizer on recent CPython 3 + Fix memory leak in .from_chan_* pychan <- chan[X] wrappers
The most interesting patches are patches 4 and 5 whose description I quote below.
/cc @jerome
/cc ORS team (@jhuge, @lu.xu, @tomo, @xavier_thompson, @Daetalus)
---- 8< ----
tox: *-asan: Activate LeakSanitizer on recent CPython 3
Background: in 2019 in 4dc1a7f0 (tox += ThreadSanitizer, AddressSanitizer, Python debug builds) I've added ThreadSanitizer and AddressSanitizer to our test build matrix. That was done in order to help catching concurrency issues while doing quality assurance and it indeed worked well(*). However AddressSanitizer was added with disabled memory leak detection, because at that time there were tons of various allocations done by Python itself, that were not released on Python shutdown. So leak reporting pass was disabled to avoid huge non-pygolang related printouts.
5 years later Pygolang is used by various projects, including in XLTE, where, in particular, it is used to organize 100Hz polling of Amarisoft eNodeB service to retrieve information about flows on Data Radio Bearers:
kirr/xlte@2a016d48
https://lab.nexedi.com/kirr/xlte/-/blob/8e606c64/amari/drb.py
And everything works relatively well except that recently Joanne approached me with reports that xamari program, that does the polling, is leaking memory.
During investigation I could manually find at least two causes of the leakage, and, while working on a fix, discovered third type of leak. Since all those leaks happen inside Pygolang and at, or around, C level - most of the time not visible through python objects and so not discoverable via e.g objgraph, it raises the need to have robust quality assurance procedures built into maintenance process to cover not only concurrency and memory safety issues, but also to reliably detect low-level memory leaks.
So before we start fixing any of the leakages, let's activate LeakSanitizer (https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer) to be present during tox runs, so that whenever we run something via ASAN in the end tested process is checked for whether anything was leaked and if yes, which codepath was used to allocate leaked memory.
Initially I though that it would be hard to have distilled reports, the ones without much of added noise due to leaks on python side, but it turns out that starting from py3.11 cpython codebase was improved to release on interpreter shutdown most of what was allocated, and so with modest suppression rules we can distill LeakSanitizer output to come with the issues that are relevant to pygolang itself.
That builds the foundation for making sure there are no memory leaks done by pygolang in the follow-up patches. The next patch will fix chan leakage in pychan_from_raw, and there will be more fixes to come later.
For now ASAN builds become broken, but that is good that we can see the issues. Please see the appendix for current LeakSanitizer output.
(*) at that time, besides discovering longstanding race-condition in Python itself (https://bugs.python.org/issue38106, https://github.com/python/cpython/pull/16047), several concurrency issues were found in pygolang codebase as well:
dcf4ebd1
65c43848
5aa1e899
fd2a6fab
golang: pychan: Fix memory leak in .from_chan_* pychan <- chan[X] wrappers
The code of pychan_from_raw, that pychan.from_chan_X calls, was creating pychan object and then setting pychan._ch to the specified raw channel. But it missed that pychan.new was creating full Python object, with everything initialized - in particular with pychan._ch initialized to another channel created by pychan.cinit constructor, and so pointer to that another channel was removed without decrefing it first. That caused the leak of that second channel observable as the following LeakSanitizer report when run on e.g. added test:
Direct leak of 72 byte(s) in 1 object(s) allocated from:
#0 0x7f70902f3bd7 in malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:69
#1 0x7f708bfab612 in zalloc golang/runtime/libgolang.cpp:1307
#2 0x7f708bfa56c0 in _makechan golang/runtime/libgolang.cpp:469
#3 0x7f708be78da2 in __pyx_f_6golang_7_golang__makechan_pyexc golang/_golang.cpp:8844
#4 0x7f708be703ad in __pyx_pf_6golang_7_golang_6pychan___cinit__ golang/_golang.cpp:4870
#5 0x7f708be7019d in __pyx_pw_6golang_7_golang_6pychan_1__cinit__ golang/_golang.cpp:4833
#6 0x7f708beaa0f8 in __pyx_tp_new_6golang_7_golang_pychan golang/_golang.cpp:29896
#7 0x7f708be743af in __pyx_f_6golang_7_golang_pychan_from_raw golang/_golang.cpp:7240
#8 0x7f708be73e76 in __pyx_f_6golang_7_golang_6pychan_from_chan_int golang/_golang.cpp:6927
#9 0x7f7088a990a5 in __pyx_pf_6golang_12_golang_test_62test_pychan_from_raw_noleak golang/_golang_test.cpp:7479
#10 0x7f7088a98ef2 in __pyx_pw_6golang_12_golang_test_63test_pychan_from_raw_noleak golang/_golang_test.cpp:7445
-> Fix it by adjusting raw chan -> py chan conversion routine to first create uninitialized py object - with no underlying channel created at all.
Adjust pynil to use pychan_from_raw instead of duplicating its code, so that we keep the logic and the fix only in one place.
The context where this problem was originally discovered is xamari from XLTE where on every request new timer is created to handle request timeout, and that timer, being Python-level object, wraps underlying C-level timer with creating pychan wrapper of that:
https://lab.nexedi.com/kirr/xlte/-/blob/8e606c64/amari/__init__.py#L182-193 https://lab.nexedi.com/nexedi/pygolang/-/blob/6dd420da/golang/_time.pyx#L96 https://lab.nexedi.com/nexedi/pygolang/-/blob/6dd420da/golang/_time.pyx#L104
As the result on every request memory was leaked on and on.
Besides new test going ok even under LeakSanitizer, the following test program confirms the fix:
from golang import context
def main():
bg = context.background()
key = object()
while 1:
ctx = context.with_value(bg, key, 1)
main()
Before the patch it leaks ~ 1GB of RAM every second on my computer due to similar raw chan -> py chan wrapping in py context.with_value
https://lab.nexedi.com/nexedi/pygolang/-/blob/6dd420da/golang/_context.pyx#L169-180 https://lab.nexedi.com/nexedi/pygolang/-/blob/6dd420da/golang/_context.pyx#L38-43
After the fix that program stays at constant RSS usage forever.