X

xlte

Assorted LTE tools

  • Kirill Smelkov's avatar
    Draft support for E-UTRAN IP Throughput KPI · 2a016d48
    Kirill Smelkov authored
    The most interesting patches are
    
    - d102ffaa (drb: Start of the package)
    - 5bf7dc1c (amari.{drb,xlog}: Provide aggregated DRB statistics in the form of synthetic x.drb_stats message)
    - 499a7c1b (amari.kpi: Teach LogMeasure to handle x.drb_stats messages)
    - 2824f50d (kpi: Calc: Add support for E-UTRAN IP Throughput KPI)
    - 4b2c8c21 (demo/kpidemo.*: Add support for E-UTRAN IP Throughput KPI + demonstrate it in the notebook)
    
    The other patches introduce or adjust needed infrastructure. A byproduct
    of particular note is that kpi.Measurement now supports QCI.
    
    A demo might be seen in the last part of
    https://nbviewer.org/urls/lab.nexedi.com/kirr/xlte/raw/43aac33e/demo/kpidemo.ipynb
    
    And below we provide the overall overview of the implementation.
    
    Overview of E-UTRAN IP Throughput computation
    ---------------------------------------------
    
    Before we begin explaining how IP Throughput is computed, let's first refresh
    what it is and have a look at what is required to compute it reasonably.
    
    This KPI is defined in TS 32.450[1] and aggregates transmission volume and
    time over bursts of transmissions from an average UE point of view. It should be
    particularly noted that only the time, during which transmission is going on,
    should be accounted. For example if an UE receives 10KB over 4ms burst and the rest of
    the time there is no transmission to it during, say, 1 minute, the downlink IP
    Throughput for that UE over the minute is 20Mbit/s (= 8·10KB/4ms), not 1.3Kbit/s (= 8·10KB/60s).
    This KPI basically shows what would be the speed to e.g. download a response for
    HTTP request issued from a mobile.
    
    [1] https://www.etsi.org/deliver/etsi_ts/132400_132499/132450/16.00.00_60/ts_132450v160000p.pdf#page=13
    
    To compute IP Throughput we thus need to know Σ of transmitted amount
    of bytes, and Σ of the time of all transmission bursts.
    
    Σ of the bytes is relatively easy to get. eNB already provides close values in
    overall `stats` and in per-UE `ue_get[stats]` messages. However there is no
    anything readily available out-of-the box for Σ of bursts transmission time.
    Thus we need to measure the time of transmission bursts ourselves somehow.
    
    It turns out that with current state of things the only practical way to
    measure it to some degree is to poll eNB frequently with `ue_get[stats]` and
    estimate transmission time based on δ of `ue_get` timestamps.
    
    Let's see how frequently we need to poll to get to reasonably accuracy of resulting throughput.
    
    A common situation for HTTP requests issued via LTE is that response content
    downloading time takes only few milliseconds. For example I used chromium
    network profiler to access various sites via internet tethered from my phone
    and saw that for many requests response content downloading time was e.g. 4ms,
    5ms, 3.2ms, etc. The accuracy of measuring transmission time should be thus in
    the order of millisecond to cover that properly. It makes a real difference for
    reported throughput, if say a download sample with 10KB took 4ms, or it took
    e.g. "something under 100ms". In the first case we know that for that sample
    downlink throughput is 2500KB/s, while in the second case all we know is that
    downlink throughput is "higher than 100KB/s" - a 25 times difference and not
    certain. Similarly if we poll at 10ms rate we would get that throughput is "higher
    than 1000KB/s" - a 2.5 times difference from actual value. The accuracy of 1
    millisecond coincides with TTI time and with how downlink/uplink transmissions
    generally work in LTE.
    
    With the above the scheme to compute IP Throughput looks to be as
    follows: poll eNB at 1000Hz rate for `ue_get[stats]`, process retrieved
    information into per-UE and per-QCI streams, detect bursts on each UE/QCI pair,
    and aggregate `tx_bytes` and `tx_time` from every burst.
    
    It looks to be straightforward, but 1000Hz polling will likely create
    non-negligible additional load on the system and disturb eNB itself
    introducing much jitter and harming its latency requirements. That's probably
    why eNB actually rate-limits WebSocket requests not to go higher than 100Hz -
    the frequency 10 times less compared to what we need to get to reasonable
    accuracy for IP throughput.
    
    Fortunately there is additional information that provides a way to improve
    accuracy of measured `tx_time` even when polled every 10ms at 100Hz rate:
    that additional information is the number of transmitted transport blocks to/from
    an UE. If we know that during 10ms frame it was e.g. 4 transport blocks transmitted
    to the UE, that there were no retransmissions *and* that eNB is not congested, we can
    reasonably estimate that it was actually a 4ms transmission. And if eNB is
    congested we can still say that transmission time is somewhere in `[4ms, 10ms]`
    interval because transmitting each transport block takes 1 TTI. Even if
    imprecise that still provides some information that could be useful.
    
    Also 100Hz polling turns to be acceptable from performance point of view and
    does not disturb the system much. For example on the callbox machine the process,
    that issues polls, takes only about 3% of CPU load and only on one core, and
    the CPU usage of eNB does not practically change and its reported tx/rx latency
    does not change as well. For sure, there is some disturbance, but it appears to
    be small. To have a better idea of what rate of polling is possible, I've made
    an experiment with the poller accessing my own websocket echo server quickly
    implemented in python. Both the poller and the echo server are not optimized,
    but without rate-limiting they could go to 8000Hz frequency with reaching 100%
    CPU usage of one CPU core. That 8000Hz is 80x times more compared to 100Hz
    frequency actually allowed by eNB. This shows what kind of polling
    frequency limit the system can handle, if absolutely needed, and that 100Hz
    turns out to be not so high a frequency. Also the Linux 5.6 kernel, installed
    on the callbox from Fedora32, is configured with `CONFIG_HZ=1000`, which is
    likely helping here.
    
    Implementation overview
    ~~~~~~~~~~~~~~~~~~~~~~~
    
    The scheme to compute E-UTRAN IP Throughput is thus as follows: poll eNB at
    100Hz frequency for `ue_get[stats]` and retrieve information about per-UE/QCI
    streams and the number of transport blocks dl/ul-ed to the UE in question
    during that 10ms frame. Estimate `tx_time` taking into account
    the number of transmitted transport blocks. And estimate whether eNB is congested or
    not based on `dl_use_avg`/`ul_use_avg` taken from `stats`. For the latter we
    also need to poll for `stats` at 100Hz frequency and synchronize
    `ue_get[stats]` and `stats` requests in time so that they both cover the same
    time interval of particular frame.
    
    Then organize the polling process to provide aggregated statistics in the form of
    new `x.drb_stats` message, and teach `xamari xlog` to save that messages to
    `enb.xlog` together with `stats`. Then further adjust `amari.kpi.LogMeasure`
    and generic `kpi.Measurement` and `kpi.Calc` to handle DRB-related data.
    
    That is how it is implemented.
    
    The main part, that performs 100Hz polling and flow aggregation, is in amari/drb.py.
    There `Sampler` extracts bursts of data transmissions from stream of `ue_get[stats]`
    observations and `x_stats_srv` organizes whole 100Hz sampling process and provides
    aggregated `x.drb_stats` messages to `amari.xlog`.
    
    Even though the main idea is relatively straightforward, several aspects
    deserves to be noted:
    
    1. information about transmitted bytes and corresponding transmitted transport
       blocks is emitted by eNB not synchronized in time. The reason here is that,
       for example, for DL a block is transmitted via PDCCH+PDSCH during one TTI, and
       then the base station awaits HARQ ACK/NACK. That ACK/NACK comes later via
       PUCCH or PUSCH. The time window in between original transmission and
       reception of the ACK/NACK is 4 TTIs for FDD and 4-13 TTIs for TDD(*).
       And Amarisoft LTEENB updates counters for dl_total_bytes and dl_tx at
       different times:
    
           ue.erab.dl_total_bytes      - right after sending data on  PDCCH+PDSCH
           ue.cell.{dl_tx,dl_retx}     - after receiving ACK/NACK via PUCCH|PUSCH
    
       this way an update to dl_total_bytes might be seen in one frame (= 10·TTI),
       while corresponding update to dl_tx/dl_retx might be seen in either same, or
       next, or next-next frame.
    
       `Sampler` brings δ(tx_bytes) and #tx_tb in sync itself via `BitSync`.
    
    2. when we see multiple transmissions related to UE on different QCIs, we
       cannot directly use corresponding global number of transport blocks to estimate
       transmissions times because we do not know how eNB scheduler placed those
       transmissions onto resource map. So without additional information we can only
       estimate corresponding lower and upper bounds.
    
    3. for output stability and to avoid throughput being affected by partial fill
       of tail TTI of a burst, E-UTRAN IP Throughput is required to be computed
       without taking into account last TTI of every sample. We don't have that
       level of details since all we have is total amount of transmitted bytes in a
       burst and estimation of how long in time the burst is. Thus, once again, we
       can only provide an estimation so that resulting E-UTRAN IP
       Throughput uncertainty window cover the right value required by 3GPP standard.
    
    A curious reader might be interested to look at tests in `amari/drb_test.py` ,
    and at the whole changes that brought E-UTRAN IP Throughput alive.
    
    Limitations
    ~~~~~~~~~~~
    
    Current implementation has the following limitations:
    
    - we account whole PDCP instead of only IP traffic.
    - the KPI is computed with uncertainty window instead of being precise even when the
      connection to eNB is alive all the time. The shorter bursts are the more
      the uncertainty.
    - the implementation works correctly for FDD, but not for TDD. That's because
      BitSync currently supports only "next frame" case and support for "next-next
      frame" case is marked as TODO.
    - eNB `t` monitor command practically stops working and now only reports
      ``Warning, remote API ue_get (stats = true) pending...`` instead of reporting
      useful information. This is due to that contrary to `stats`, for `ue_get` eNB
      does not maintain per-connection state and uses global singleton counters.
    - the performance overhead might be more noticeable on machines less
      powerful compared to callbox.
    
    To address the limitations I plan to talk to Amarisoft about eNB improvements
    so that E-UTRAN IP Throughput could be computed precisely from DRB statistics
    directly provided by eNB itself.
    
    However it is still useful to have current implementation, even with all its
    limitations, because it already works today with existing eNB versions.
    
    Kirill
    2a016d48
Name
Last commit
Last update
amari Loading commit data...
demo Loading commit data...
t Loading commit data...
.gitignore Loading commit data...
CHANGELOG.rst Loading commit data...
COPYING Loading commit data...
MANIFEST.in Loading commit data...
README.rst Loading commit data...
kpi.py Loading commit data...
kpi_test.py Loading commit data...
setup.py Loading commit data...
xlte.py Loading commit data...