• Josh Bleecher Snyder's avatar
    cmd/compile: compile more complex functions first · 6664ccb4
    Josh Bleecher Snyder authored
    When using a concurrent backend,
    the overall compilation time is bounded
    in part by the slowest function to compile.
    The number of top-level statements in a function
    is an easily calculated and fairly reliable
    proxy for compilation time.
    
    Here's a standard compilecmp output for -c=8 with this CL:
    
    name       old time/op       new time/op       delta
    Template         127ms ± 4%        125ms ± 6%   -1.33%  (p=0.000 n=47+50)
    Unicode         84.8ms ± 4%       84.5ms ± 4%     ~     (p=0.217 n=49+49)
    GoTypes          289ms ± 3%        287ms ± 3%   -0.78%  (p=0.002 n=48+50)
    Compiler         1.36s ± 3%        1.34s ± 2%   -1.29%  (p=0.000 n=49+47)
    SSA              2.95s ± 3%        2.77s ± 4%   -6.23%  (p=0.000 n=50+49)
    Flate           70.7ms ± 3%       70.9ms ± 2%     ~     (p=0.112 n=50+49)
    GoParser        85.0ms ± 3%       83.0ms ± 4%   -2.31%  (p=0.000 n=48+49)
    Reflect          229ms ± 3%        225ms ± 4%   -1.83%  (p=0.000 n=49+49)
    Tar             70.2ms ± 3%       69.4ms ± 3%   -1.17%  (p=0.000 n=49+49)
    XML              115ms ± 7%        114ms ± 6%     ~     (p=0.158 n=49+47)
    
    name       old user-time/op  new user-time/op  delta
    Template         352ms ± 5%        342ms ± 8%   -2.74%  (p=0.000 n=49+50)
    Unicode          117ms ± 5%        118ms ± 4%   +0.88%  (p=0.005 n=46+48)
    GoTypes          986ms ± 3%        980ms ± 4%     ~     (p=0.110 n=46+48)
    Compiler         4.39s ± 2%        4.43s ± 4%   +0.97%  (p=0.002 n=50+50)
    SSA              12.0s ± 2%        13.3s ± 3%  +11.33%  (p=0.000 n=49+49)
    Flate            222ms ± 5%        219ms ± 6%   -1.56%  (p=0.002 n=50+50)
    GoParser         271ms ± 5%        268ms ± 4%   -0.83%  (p=0.036 n=49+48)
    Reflect          560ms ± 4%        571ms ± 3%   +1.90%  (p=0.000 n=50+49)
    Tar              183ms ± 3%        183ms ± 3%     ~     (p=0.903 n=45+50)
    XML              364ms ±13%        391ms ± 4%   +7.16%  (p=0.000 n=50+40)
    
    A more interesting way of viewing the data is by
    looking at the ratio of the time taken to compile
    the slowest-to-compile function to the overall
    time spent compiling functions.
    
    If this ratio is small (near 0), then increased concurrency might help.
    If this ratio is big (near 1), then we're bounded by that single function.
    
    I instrumented the compiler to emit this ratio per-package,
    ran 'go build -a -gcflags=-c=C -p=P std cmd' three times,
    for varying values of C and P,
    and collected the ratios encountered into an ASCII histogram.
    
    Here's c=1 p=1, which is a non-concurrent backend, single process at a time:
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|
     30%|
     20%|**
     10%|***
      0%|*********
    ----+----------
        |0123456789
    
    The x-axis is floor(10*ratio), so the first column indicates the percent of
    ratios that fell in the 0% to 9.9999% range.
    We can see in this histogram that more concurrency will help;
    in most cases, the ratio is small.
    
    Here's c=8 p=1, before this CL:
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|
     30%|         *
     20%|         *
     10%|*   *    *
      0%|**********
    ----+----------
        |0123456789
    
    In 30-40% of cases, we're mostly bound by the compilation time
    of a single function.
    
    Here's c=8 p=1, after this CL:
    
     90%|
     80%|
     70%|
     60%|
     50%|         *
     40%|         *
     30%|         *
     20%|         *
     10%|         *
      0%|**********
    ----+----------
        |0123456789
    
    The sorting pays off; we are bound by the
    compilation time of a single function in over half of packages.
    The single * in the histogram indicates 0-10%.
    The actual values for this chart are:
    0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55%
    
    This indicates that efforts to increase or enable more concurrency,
    e.g. by optimizing mutexes or increasing the value of c,
    will probably not yield fruit.
    That matches what compilecmp tells us.
    
    Further optimization efforts should thus focus instead on one of:
    
    (1) making more functions compile concurrently
    (2) improving the compilation time of the slowest functions
    (3) speeding up the remaining serial parts of the compiler
    (4) automatically splitting up some large autogenerated functions
        into small ones, as discussed in #19751
    
    I hope to spend more time on (1) before the freeze.
    
    Adding process parallelism doesn't change the story much.
    For example, here's c=8 p=8, after this CL:
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|         *
     30%|         *
     20%|         *
     10%|       ***
      0%|**********
    ----+----------
        |0123456789
    
    Since we don't need to worry much about p,
    these histograms can help us select a good
    general value of c to use as a default,
    assuming we're not bounded by GOMAXPROCS.
    
    Here are some charts after this CL, for c from 1 to 8:
    
    c=1 p=1
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|
     30%|
     20%|**
     10%|***
      0%|*********
    ----+----------
        |0123456789
    
    c=2 p=1
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|
     30%|
     20%|
     10%| ****    *
      0%|**********
    ----+----------
        |0123456789
    
    c=3 p=1
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|
     30%|
     20%|         *
     10%|  ** *   *
      0%|**********
    ----+----------
        |0123456789
    
    c=4 p=1
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|
     30%|         *
     20%|         *
     10%|     *   *
      0%|**********
    ----+----------
        |0123456789
    
    c=5 p=1
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|
     30%|         *
     20%|         *
     10%|     *   *
      0%|**********
    ----+----------
        |0123456789
    
    c=6 p=1
    
     90%|
     80%|
     70%|
     60%|
     50%|
     40%|         *
     30%|         *
     20%|         *
     10%|         *
      0%|**********
    ----+----------
        |0123456789
    
    c=7 p=1
    
     90%|
     80%|
     70%|
     60%|
     50%|         *
     40%|         *
     30%|         *
     20%|         *
     10%|        **
      0%|**********
    ----+----------
        |0123456789
    
    c=8 p=1
    
     90%|
     80%|
     70%|
     60%|
     50%|         *
     40%|         *
     30%|         *
     20%|         *
     10%|         *
      0%|**********
    ----+----------
        |0123456789
    
    Given the increased user-CPU costs as
    c increases, it looks like c=4 is probably
    the sweet spot, at least for now.
    
    Pleasingly, this matches (and explains)
    the results of the standard benchmarking
    that I have done.
    
    Updates #15756
    
    Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb
    Reviewed-on: https://go-review.googlesource.com/41192
    Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
    TryBot-Result: Gobot Gobot <gobot@golang.org>
    Reviewed-by: default avatarRobert Griesemer <gri@golang.org>
    Reviewed-by: default avatarMatthew Dempsky <mdempsky@google.com>
    Reviewed-by: default avatarBrad Fitzpatrick <bradfitz@golang.org>
    6664ccb4
pgen.go 8.48 KB