• Kirill Smelkov's avatar
    time: Redo timers properly · 044deb35
    Kirill Smelkov authored
    Background: in 2019 in 9c260fde (time: New package that mirrors Go's
    time) and b073f6df (time: Move/Port timers to C++/Pyx nogil) I've added
    basic timers - with proper API but with very dumb implementation that
    was spawning one thread per each timer. There were just a few timers in
    the users and this was working, surprisingly, relatively ok...
    
    ... until 2023 where I was working on XLTE that needs to organize 100Hz
    polling of Amarisoft eNodeB service to retrieve information about flows
    on Data Radio Bearers:
    
        kirr/xlte@2a016d48
        https://lab.nexedi.com/kirr/xlte/-/blob/8e606c64/amari/drb.py
    
    There each request comes with its own deadline - to catch "no reply",
    and the deadlines are implemented via timers. So there are 100 threads
    created every second which adds visible overhead, consumes a lot of
    virtual address space and RSS for threads stacks, and should be all unnecessary.
    
    We was tolerating even that for some time, but recently Joanne approached me
    with reports that xamari program, that does the polling, is leaking memory.
    
    With that, and because it was hard to find what is actually leaking,
    I've started to remove uncertainties and there are a lot of uncertainty
    in what is going on when lots of threads are being created over and over.
    
    In the end the leak turned out to be likely a different thing (see
    !24, still
    discovered while working on hereby patch), but all of the above was
    enough motivation to finally start redoing the timers properly.
    
    --------
    
    So when it comes to do the timers properly more or less, there is
    usually queue of armed timers, and a loop that picks entries from that
    queue to fire them. I was initially trying to do the simple thing and
    use std::priority_queue for that, because priority_queue is internally
    heap, and heaps can provide O(log(n)) insertion and removal of arbitrary
    element, plus O(1) "pick top element to process". Exactly what would
    suit. However I quickly found that even in 2024, std::priority_queue
    does not provide removal operation at all, and there is no such thing as
    e.g. std::sift_heap, that would help to implement that manually. Which
    is surprising, because e.g. libevent implements all that just ok via
    sifting up/down upon removal in logarithmic complexity:
    
    https://github.com/libevent/libevent/blob/80e25c02/minheap-internal.h#L96-L115
    
    the lack of efficient removal operation turned out to be a blocker to
    use std::priority_queue because most of the timers, that are armed for
    timeouts, are never expired and upon successful completion of covered
    operation, the timer is stopped. In other words the timer is removed
    from the timer queue and the removal is one of the most often
    operations.
    
    So, if std::priority_queue cannot work, we would need to either bring in
    another implementation of a heap, or, if we are to bring something,
    bring and use something else that is more suitable for implementing
    timers.
    
    That reminded me that in 2005 for my Navy project, I already implemented
    custom timer wheel to handle timeouts after reading https://lwn.net/Articles/152436/ .
    Contrary to heaps, such timer wheels provide O(1) insertion and removal
    of timers and work generally faster. But this time I did not want to
    delve into implementing all that myself again and tried to look around
    of what is available out there.
    
    There was an update to kernel timer-wheel implementation described at
    https://lwn.net/Articles/646950/ and from that a project called
    Timeout.c was also found that provides implementation for such a wheel
    for user space: https://25thandclement.com/~william/projects/timeout.c.html .
    
    However when we are to pick third-party code, we should be ready to
    understand it and fix bugs there on our own. So the audit of timeout.c
    did not went very smoothly - there are many platform-depended places,
    and the issue tracker shows signs that sometimes not everything is ok
    with the implementation. With that I've looked around a bit more and
    found more compact and more portable Ratas library with good structure
    and description and whose audit came more well:
    
        https://www.snellman.net/blog/archive/2016-07-27-ratas-hierarchical-timer-wheel
        https://github.com/jsnell/ratas
    
    Here, after going through the code, I feel to be capable to understand
    issues and fix bugs myself if that would become needed.
    
    And the benchmark comparison of Timeout.c and Ratas shows that they
    should be of the same order regarding performance:
    
    https://lab.nexedi.com/kirr/misc/-/blob/4f51fd6/bench/time-wheel/ratas-vs-timeout.pdf
    kirr/ratas@382321d2
    kirr/timeout@d6f15744
    
    which makes Ratas the winner for me.
    
    Having timer-wheel implementation, the rest is just technique to glue it
    all together. One implementation aspect deserves to be mentioned though:
    
    The timer loop uses Semaphore.acquire, recently modernized to also
    accept timeout, to organize sleep in between pauses with also being able
    to be simultaneously woken up if new timer is armed with earlier
    expiration time.
    
    Other than that the changes are mostly straightforward. Please see the
    patch itself for details.
    
    Regarding how the new implementation is more efficient for what we had
    before, there are added benchmarks to measure arming timers that do not
    fire, and, for symmetry, arming timers that do fire. We are most
    interested in the first benchmark, because it shows how cheap or
    expensive it is to use timers to implement timeouts, but the second one
    is also useful to have to see the overhead of the whole timers machinery.
    
    On my machine under py3.11 they go as after this patch:
    
        name              time/op
        timer_arm_cancel   805ns ± 0%
        timer_arm_fire    9.63µs ± 0%
    
    and before the patch the benchmarks simply do not run till the end
    because they run out of memory due to huge number of threads being
    created.
    
    Still with the following test program we can measure the effect new
    timers implementation has:
    
        ---- 8< ----
        from golang import time
    
        def main():
            δt_rate = 1*time.millisecond
    
            tprev = time.now()
            tnext = tprev + δt_rate
            while 1:
                timer = time.Timer(5*time.second)
                _ = timer.stop()
                assert _ is True
    
                t = time.now()
                δtsleep = tnext - t
                #print('sleep %.3f ms' % (δtsleep/time.millisecond))
                time.sleep(δtsleep)
                tprev = tnext
                tnext += δt_rate
    
        main()
        ---- 8< ----
    
    This program creates/arms and cancels a timer 1000 times per second.
    
    Before hereby patch this program consumes ~ 30% of CPU, while after
    hereby patch this program consumes ~ 7-8% of CPU.
    
    For the reference just a sleep part of that program, with all code
    related to timers removed consumes ~5% of CPU, while the consumption of
    plain sleep(1ms) in C and directly using system calls
    
        ---- 8< ----
        #include <unistd.h>
    
        int main() {
            while (1) {
                usleep(1000);
            }
            return 0;
        }
        ---- 8< ----
    
    is ~ 3-4% of CPU on my machine.
    
    /cc @jerome
    /cc ORS team (@jhuge, @lu.xu, @tomo, @xavier_thompson, @Daetalus)
    /proposed-for-review-on !26
    044deb35