runtime: separate soft and hard heap limits

Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>

runtime: separate soft and hard heap limits
Currently, GC pacing is based on a single hard heap limit computed based on GOGC. In order to achieve this hard limit, assist pacing makes the conservative assumption that the entire heap is live. However, in the steady state (with GOGC=100), only half of the heap is live. As a result, the garbage collector works twice as hard as necessary and finishes half way between the trigger and the goal. Since this is a stable state for the trigger controller, this repeats from cycle to cycle. Matters are even worse if GOGC is higher. For example, if GOGC=200, only a third of the heap is live in steady state, so the GC will work three times harder than necessary and finish only a third of the way between the trigger and the goal. Since this causes the garbage collector to consume ~50% of the available CPU during marking instead of the intended 25%, about 25% of the CPU goes to mutator assists. This high mutator assist cost causes high mutator latency variability. This commit improves the situation by separating the heap goal into two goals: a soft goal and a hard goal. The soft goal is set based on GOGC, just like the current goal is, and the hard goal is set at a 10% larger heap than the soft goal. Prior to the soft goal, assist pacing assumes the heap is in steady state (e.g., only half of it is live). Between the soft goal and the hard goal, assist pacing switches to the current conservative assumption that the entire heap is live. In benchmarks, this nearly eliminates mutator assists. However, since background marking is fixed at 25% CPU, this causes the trigger controller to saturate, which leads to somewhat higher variability in heap size. The next commit will address this. The lower CPU usage of course leads to longer mark cycles, though really it means the mark cycles are as long as they should have been in the first place. This does, however, lead to two potential down-sides compared to the current pacing policy: 1. the total overhead of the write barrier is higher because it's enabled more of the time and 2. the heap size may be larger because there's more floating garbage. We addressed 1 by significantly improving the performance of the write barrier in the preceding commits. 2 can be demonstrated in intense GC benchmarks, but doesn't seem to be a problem in any real applications. Updates #14951. Updates #14812 (fixes?). Fixes #18534. This has no significant effect on the throughput of the github.com/dr2chase/bent benchmarks-50. This has little overall throughput effect on the go1 benchmarks: name old time/op new time/op delta BinaryTree17-12 2.41s ± 0% 2.40s ± 0% -0.22% (p=0.007 n=20+18) Fannkuch11-12 2.95s ± 0% 2.95s ± 0% +0.07% (p=0.003 n=17+18) FmtFprintfEmpty-12 41.7ns ± 3% 42.2ns ± 0% +1.17% (p=0.002 n=20+15) FmtFprintfString-12 66.5ns ± 0% 67.9ns ± 2% +2.16% (p=0.000 n=16+20) FmtFprintfInt-12 77.6ns ± 2% 75.6ns ± 3% -2.55% (p=0.000 n=19+19) FmtFprintfIntInt-12 124ns ± 1% 123ns ± 1% -0.98% (p=0.000 n=18+17) FmtFprintfPrefixedInt-12 151ns ± 1% 148ns ± 1% -1.75% (p=0.000 n=19+20) FmtFprintfFloat-12 210ns ± 1% 212ns ± 0% +0.75% (p=0.000 n=19+16) FmtManyArgs-12 501ns ± 1% 499ns ± 1% -0.30% (p=0.041 n=17+19) GobDecode-12 6.50ms ± 1% 6.49ms ± 1% ~ (p=0.234 n=19+19) GobEncode-12 5.43ms ± 0% 5.47ms ± 0% +0.75% (p=0.000 n=20+19) Gzip-12 216ms ± 1% 220ms ± 1% +1.71% (p=0.000 n=19+20) Gunzip-12 38.6ms ± 0% 38.8ms ± 0% +0.66% (p=0.000 n=18+19) HTTPClientServer-12 78.1µs ± 1% 78.5µs ± 1% +0.49% (p=0.035 n=20+20) JSONEncode-12 12.1ms ± 0% 12.2ms ± 0% +1.05% (p=0.000 n=18+17) JSONDecode-12 53.0ms ± 0% 52.3ms ± 0% -1.27% (p=0.000 n=19+19) Mandelbrot200-12 3.74ms ± 0% 3.69ms ± 0% -1.17% (p=0.000 n=18+19) GoParse-12 3.17ms ± 1% 3.17ms ± 1% ~ (p=0.569 n=19+20) RegexpMatchEasy0_32-12 73.2ns ± 1% 73.7ns ± 0% +0.76% (p=0.000 n=18+17) RegexpMatchEasy0_1K-12 239ns ± 0% 238ns ± 0% -0.27% (p=0.000 n=13+17) RegexpMatchEasy1_32-12 69.0ns ± 2% 69.1ns ± 1% ~ (p=0.404 n=19+19) RegexpMatchEasy1_1K-12 367ns ± 1% 365ns ± 1% -0.60% (p=0.000 n=19+19) RegexpMatchMedium_32-12 105ns ± 1% 104ns ± 1% -1.24% (p=0.000 n=19+16) RegexpMatchMedium_1K-12 34.1µs ± 2% 33.6µs ± 3% -1.60% (p=0.000 n=20+20) RegexpMatchHard_32-12 1.62µs ± 1% 1.67µs ± 1% +2.75% (p=0.000 n=18+18) RegexpMatchHard_1K-12 48.8µs ± 1% 50.3µs ± 2% +3.07% (p=0.000 n=20+19) Revcomp-12 386ms ± 0% 384ms ± 0% -0.57% (p=0.000 n=20+19) Template-12 59.9ms ± 1% 61.1ms ± 1% +2.01% (p=0.000 n=20+19) TimeParse-12 301ns ± 2% 307ns ± 0% +2.11% (p=0.000 n=20+19) TimeFormat-12 323ns ± 0% 323ns ± 0% ~ (all samples are equal) [Geo mean] 47.0µs 47.1µs +0.23% https://perf.golang.org/search?q=upload:20171030.1 Likewise, the throughput effect on the x/benchmarks is minimal (and reasonably positive on the garbage benchmark with a large heap): name old time/op new time/op delta Garbage/benchmem-MB=1024-12 2.40ms ± 4% 2.29ms ± 3% -4.57% (p=0.000 n=19+18) Garbage/benchmem-MB=64-12 2.23ms ± 1% 2.24ms ± 2% +0.59% (p=0.016 n=19+18) HTTP-12 12.5µs ± 1% 12.6µs ± 1% ~ (p=0.326 n=20+19) JSON-12 11.1ms ± 1% 11.3ms ± 2% +2.15% (p=0.000 n=16+17) It does increase the heap size of the garbage benchmarks, but seems to have relatively little impact on more realistic programs. Also, we'll gain some of this back with the next commit. name old peak-RSS-bytes new peak-RSS-bytes delta Garbage/benchmem-MB=1024-12 1.21G ± 1% 1.88G ± 2% +55.59% (p=0.000 n=19+20) Garbage/benchmem-MB=64-12 168M ± 3% 248M ± 8% +48.08% (p=0.000 n=18+20) HTTP-12 45.6M ± 9% 47.0M ±27% ~ (p=0.925 n=20+20) JSON-12 193M ±11% 206M ±11% +7.06% (p=0.001 n=20+20) https://perf.golang.org/search?q=upload:20171030.2 Change-Id: Ic78904135f832b4d64056cbe734ab979f5ad9736 Reviewed-on: https://go-review.googlesource.com/59970 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
03eb9483 · Austin Clements · bc723cf3 · 03eb9483
Commit 03eb9483 authored Jul 19, 2017 by Austin Clements
Show whitespace changes
Inline Side-by-side

Showing with 48 additions and 22 deletions

src/runtime/mgc.go src/runtime/mgc.go +48 -22

No files found.
--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -500,47 +500,73 @@ func (c *gcControllerState) startCycle() {
 // is when assists are enabled and the necessary statistics are
 // available).
 func (c *gcControllerState) revise() {
+	gcpercent := gcpercent
+	if gcpercent < 0 {
+		// If GC is disabled but we're running a forced GC,
+		// act like GOGC is huge for the below calculations.
+		gcpercent = 100000
+	}
+	live := atomic.Load64(&memstats.heap_live)
+
+	var heapGoal, scanWorkExpected int64
+	if live <= memstats.next_gc {
+		// We're under the soft goal. Pace GC to complete at
+		// next_gc assuming the heap is in steady-state.
+		heapGoal = int64(memstats.next_gc)
+
 		// Compute the expected scan work remaining.
 		//
+		// This is estimated based on the expected
+		// steady-state scannable heap. For example, with
+		// GOGC=100, only half of the scannable heap is
+		// expected to be live, so that's what we target.
+		//
+		// (This is a float calculation to avoid overflowing on
+		// 100*heap_scan.)
+		scanWorkExpected = int64(float64(memstats.heap_scan) * 100 / float64(100+gcpercent))
+	} else {
+		// We're past the soft goal. Pace GC so that in the
+		// worst case it will complete by the hard goal.
+		const maxOvershoot = 1.1
+		heapGoal = int64(float64(memstats.next_gc) * maxOvershoot)
+
+		// Compute the upper bound on the scan work remaining.
+		scanWorkExpected = int64(memstats.heap_scan)
+	}
+
+	// Compute the remaining scan work estimate.
+	//
 	// Note that we currently count allocations during GC as both
 	// scannable heap (heap_scan) and scan work completed
-	// (scanWork), so this difference won't be changed by
-	// allocations during GC.
-	//
-	// This particular estimate is a strict upper bound on the
-	// possible remaining scan work for the current heap.
-	// You might consider dividing this by 2 (or by
-	// (100+GOGC)/100) to counter this over-estimation, but
-	// benchmarks show that this has almost no effect on mean
-	// mutator utilization, heap size, or assist time and it
-	// introduces the danger of under-estimating and letting the
-	// mutator outpace the garbage collector.
-	scanWorkExpected := int64(memstats.heap_scan) - c.scanWork
-	if scanWorkExpected < 1000 {
+	// (scanWork), so allocation will change this difference will
+	// slowly in the soft regime and not at all in the hard
+	// regime.
+	scanWorkRemaining := scanWorkExpected - c.scanWork
+	if scanWorkRemaining < 1000 {
 		// We set a somewhat arbitrary lower bound on
 		// remaining scan work since if we aim a little high,
 		// we can miss by a little.
 		//
 		// We *do* need to enforce that this is at least 1,
 		// since marking is racy and double-scanning objects
-		// may legitimately make the expected scan work
-		// negative.
-		scanWorkExpected = 1000
+		// may legitimately make the remaining scan work
+		// negative, even in the hard goal regime.
+		scanWorkRemaining = 1000
 	}

 	// Compute the heap distance remaining.
-	heapDistance := int64(memstats.next_gc) - int64(atomic.Load64(&memstats.heap_live))
-	if heapDistance <= 0 {
+	heapRemaining := heapGoal - int64(live)
+	if heapRemaining <= 0 {
 		// This shouldn't happen, but if it does, avoid
 		// dividing by zero or setting the assist negative.
-		heapDistance = 1
+		heapRemaining = 1
 	}

 	// Compute the mutator assist ratio so by the time the mutator
 	// allocates the remaining heap bytes up to next_gc, it will
 	// have done (or stolen) the remaining amount of scan work.
-	c.assistWorkPerByte = float64(scanWorkExpected) / float64(heapDistance)
-	c.assistBytesPerWork = float64(heapDistance) / float64(scanWorkExpected)
+	c.assistWorkPerByte = float64(scanWorkRemaining) / float64(heapRemaining)
+	c.assistBytesPerWork = float64(heapRemaining) / float64(scanWorkRemaining)
 }

 // endCycle computes the trigger ratio for the next cycle.