-
Josh Bleecher Snyder authored
When using a concurrent backend, the overall compilation time is bounded in part by the slowest function to compile. The number of top-level statements in a function is an easily calculated and fairly reliable proxy for compilation time. Here's a standard compilecmp output for -c=8 with this CL: name old time/op new time/op delta Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50) Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49) GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50) Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47) SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49) Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49) GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49) Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49) Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49) XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47) name old user-time/op new user-time/op delta Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50) Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48) GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48) Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50) SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49) Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50) GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48) Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49) Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50) XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40) A more interesting way of viewing the data is by looking at the ratio of the time taken to compile the slowest-to-compile function to the overall time spent compiling functions. If this ratio is small (near 0), then increased concurrency might help. If this ratio is big (near 1), then we're bounded by that single function. I instrumented the compiler to emit this ratio per-package, ran 'go build -a -gcflags=-c=C -p=P std cmd' three times, for varying values of C and P, and collected the ratios encountered into an ASCII histogram. Here's c=1 p=1, which is a non-concurrent backend, single process at a time: 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 The x-axis is floor(10*ratio), so the first column indicates the percent of ratios that fell in the 0% to 9.9999% range. We can see in this histogram that more concurrency will help; in most cases, the ratio is small. Here's c=8 p=1, before this CL: 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%|* * * 0%|********** ----+---------- |0123456789 In 30-40% of cases, we're mostly bound by the compilation time of a single function. Here's c=8 p=1, after this CL: 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 The sorting pays off; we are bound by the compilation time of a single function in over half of packages. The single * in the histogram indicates 0-10%. The actual values for this chart are: 0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55% This indicates that efforts to increase or enable more concurrency, e.g. by optimizing mutexes or increasing the value of c, will probably not yield fruit. That matches what compilecmp tells us. Further optimization efforts should thus focus instead on one of: (1) making more functions compile concurrently (2) improving the compilation time of the slowest functions (3) speeding up the remaining serial parts of the compiler (4) automatically splitting up some large autogenerated functions into small ones, as discussed in #19751 I hope to spend more time on (1) before the freeze. Adding process parallelism doesn't change the story much. For example, here's c=8 p=8, after this CL: 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| *** 0%|********** ----+---------- |0123456789 Since we don't need to worry much about p, these histograms can help us select a good general value of c to use as a default, assuming we're not bounded by GOMAXPROCS. Here are some charts after this CL, for c from 1 to 8: c=1 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 c=2 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| 10%| **** * 0%|********** ----+---------- |0123456789 c=3 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| * 10%| ** * * 0%|********** ----+---------- |0123456789 c=4 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=5 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=6 p=1 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 c=7 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| ** 0%|********** ----+---------- |0123456789 c=8 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 Given the increased user-CPU costs as c increases, it looks like c=4 is probably the sweet spot, at least for now. Pleasingly, this matches (and explains) the results of the standard benchmarking that I have done. Updates #15756 Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb Reviewed-on: https://go-review.googlesource.com/41192 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
6664ccb4