• Thomas Gleixner's avatar
    atomics: Provide rcuref - scalable reference counting · ee1ee6db
    Thomas Gleixner authored
    atomic_t based reference counting, including refcount_t, uses
    atomic_inc_not_zero() for acquiring a reference. atomic_inc_not_zero() is
    implemented with a atomic_try_cmpxchg() loop. High contention of the
    reference count leads to retry loops and scales badly. There is nothing to
    improve on this implementation as the semantics have to be preserved.
    
    Provide rcuref as a scalable alternative solution which is suitable for RCU
    managed objects. Similar to refcount_t it comes with overflow and underflow
    detection and mitigation.
    
    rcuref treats the underlying atomic_t as an unsigned integer and partitions
    this space into zones:
    
      0x00000000 - 0x7FFFFFFF	valid zone (1 .. (INT_MAX + 1) references)
      0x80000000 - 0xBFFFFFFF	saturation zone
      0xC0000000 - 0xFFFFFFFE	dead zone
      0xFFFFFFFF   			no reference
    
    rcuref_get() unconditionally increments the reference count with
    atomic_add_negative_relaxed(). rcuref_put() unconditionally decrements the
    reference count with atomic_add_negative_release().
    
    This unconditional increment avoids the inc_not_zero() problem, but
    requires a more complex implementation on the put() side when the count
    drops from 0 to -1.
    
    When this transition is detected then it is attempted to mark the reference
    count dead, by setting it to the midpoint of the dead zone with a single
    atomic_cmpxchg_release() operation. This operation can fail due to a
    concurrent rcuref_get() elevating the reference count from -1 to 0 again.
    
    If the unconditional increment in rcuref_get() hits a reference count which
    is marked dead (or saturated) it will detect it after the fact and bring
    back the reference count to the midpoint of the respective zone. The zones
    provide enough tolerance which makes it practically impossible to escape
    from a zone.
    
    The racy implementation of rcuref_put() requires to protect rcuref_put()
    against a grace period ending in order to prevent a subtle use after
    free. As RCU is the only mechanism which allows to protect against that, it
    is not possible to fully replace the atomic_inc_not_zero() based
    implementation of refcount_t with this scheme.
    
    The final drop is slightly more expensive than the atomic_dec_return()
    counterpart, but that's not the case which this is optimized for. The
    optimization is on the high frequeunt get()/put() pairs and their
    scalability.
    
    The performance of an uncontended rcuref_get()/put() pair where the put()
    is not dropping the last reference is still on par with the plain atomic
    operations, while at the same time providing overflow and underflow
    detection and mitigation.
    
    The performance of rcuref compared to plain atomic_inc_not_zero() and
    atomic_dec_return() based reference counting under contention:
    
     -  Micro benchmark: All CPUs running a increment/decrement loop on an
        elevated reference count, which means the 0 to -1 transition never
        happens.
    
        The performance gain depends on microarchitecture and the number of
        CPUs and has been observed in the range of 1.3X to 4.7X
    
     - Conversion of dst_entry::__refcnt to rcuref and testing with the
        localhost memtier/memcached benchmark. That benchmark shows the
        reference count contention prominently.
    
        The performance gain depends on microarchitecture and the number of
        CPUs and has been observed in the range of 1.1X to 2.6X over the
        previous fix for the false sharing issue vs. struct
        dst_entry::__refcnt.
    
        When memtier is run over a real 1Gb network connection, there is a
        small gain on top of the false sharing fix. The two changes combined
        result in a 2%-5% total gain for that networked test.
    Reported-by: default avatarWangyang Guo <wangyang.guo@intel.com>
    Reported-by: default avatarArjan Van De Ven <arjan.van.de.ven@intel.com>
    Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
    Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
    Link: https://lore.kernel.org/r/20230323102800.158429195@linutronix.de
    ee1ee6db
Makefile 14.6 KB