Commit 9108ae77 authored by Austin Clements's avatar Austin Clements

runtime: eliminate mark 2 and fix mark termination race

The mark 2 phase was originally introduced as a way to reduce the
chance of entering STW mark termination while there was still marking
work to do. It works by flushing and disabling all local work caches
so that all enqueued work becomes immediately globally visible.
However, mark 2 is not only slow–disabling caches makes marking and
the write barrier both much more expensive–but also imperfect. There
is still a rare but possible race (~once per all.bash) that can cause
GC to enter mark termination while there is still marking work. This
race is detailed at
https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md#appendix-mark-completion-race
The effect of this is that mark termination must still cope with the
possibility that there may be work remaining after a concurrent mark
phase. Dealing with this increases STW pause time and increases the
complexity of mark termination.

Furthermore, a similar but far more likely race can cause early
transition from mark 1 to mark 2. This is unfortunate because it
causes performance instability because of the cost of mark 2.

This CL fixes this by replacing mark 2 with a distributed termination
detection algorithm. This algorithm is correct, so it eliminates the
mark termination race, and doesn't require disabling local caches. It
ensures that there are no grey objects upon entering mark termination.
With this change, we're one step closer to eliminating marking from
mark termination entirely (it's still used by STW GC and checkmarks
mode).

This CL does not eliminate the gcBlackenPromptly global flag, though
it is always set to false now. It will be removed in a cleanup CL.

This led to only minor variations in the go1 benchmarks
(https://perf.golang.org/search?q=upload:20180909.1) and compilebench
benchmarks (https://perf.golang.org/search?q=upload:20180910.2).

This significantly improves performance of the garbage benchmark, with
no impact on STW times:

name                        old time/op    new time/op   delta
Garbage/benchmem-MB=64-12    2.21ms ± 1%   2.05ms ± 1%   -7.38% (p=0.000 n=18+19)
Garbage/benchmem-MB=1024-12  2.30ms ±16%   2.20ms ± 7%   -4.51% (p=0.001 n=20+20)

name                        old STW-ns/GC  new STW-ns/GC  delta
Garbage/benchmem-MB=64-12      138k ±44%     141k ±23%     ~    (p=0.309 n=19+20)
Garbage/benchmem-MB=1024-12    159k ±25%     178k ±98%     ~    (p=0.798 n=16+18)

name                        old STW-ns/op  new STW-ns/op                delta
Garbage/benchmem-MB=64-12     4.42k ±44%    4.24k ±23%     ~    (p=0.531 n=19+20)
Garbage/benchmem-MB=1024-12     591 ±24%      636 ±111%    ~    (p=0.309 n=16+18)

(https://perf.golang.org/search?q=upload:20180910.1)

Updates #26903.
Updates #17503.

Change-Id: Icbd1e12b7a12a76f423c9bf033b13cb363e4cd19
Reviewed-on: https://go-review.googlesource.com/c/134318
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: default avatarRick Hudson <rlh@golang.org>
parent a2a2901b
...@@ -574,8 +574,8 @@ func BenchmarkWriteBarrier(b *testing.B) { ...@@ -574,8 +574,8 @@ func BenchmarkWriteBarrier(b *testing.B) {
n := &node{mkTree(level - 1), mkTree(level - 1)} n := &node{mkTree(level - 1), mkTree(level - 1)}
if level == 10 { if level == 10 {
// Seed GC with enough early pointers so it // Seed GC with enough early pointers so it
// doesn't accidentally switch to mark 2 when // doesn't start termination barriers when it
// it only has the top of the tree. // only has the top of the tree.
wbRoots = append(wbRoots, n) wbRoots = append(wbRoots, n)
} }
return n return n
......
...@@ -28,8 +28,7 @@ ...@@ -28,8 +28,7 @@
// b. Sweep any unswept spans. There will only be unswept spans if // b. Sweep any unswept spans. There will only be unswept spans if
// this GC cycle was forced before the expected time. // this GC cycle was forced before the expected time.
// //
// 2. GC performs the "mark 1" sub-phase. In this sub-phase, Ps are // 2. GC performs the mark phase.
// allowed to locally cache parts of the work queue.
// //
// a. Prepare for the mark phase by setting gcphase to _GCmark // a. Prepare for the mark phase by setting gcphase to _GCmark
// (from _GCoff), enabling the write barrier, enabling mutator // (from _GCoff), enabling the write barrier, enabling mutator
...@@ -54,28 +53,21 @@ ...@@ -54,28 +53,21 @@
// object to black and shading all pointers found in the object // object to black and shading all pointers found in the object
// (which in turn may add those pointers to the work queue). // (which in turn may add those pointers to the work queue).
// //
// 3. Once the global work queue is empty (but local work queue caches // e. Because GC work is spread across local caches, GC uses a
// may still contain work), GC performs the "mark 2" sub-phase. // distributed termination algorithm to detect when there are no
// more root marking jobs or grey objects (see gcMarkDone). At this
// point, GC transitions to mark termination.
// //
// a. GC stops all workers, disables local work queue caches, // 3. GC performs mark termination.
// flushes each P's local work queue cache to the global work queue
// cache, and reenables workers.
//
// b. GC again drains the work queue, as in 2d above.
//
// 4. Once the work queue is empty, GC performs mark termination.
// //
// a. Stop the world. // a. Stop the world.
// //
// b. Set gcphase to _GCmarktermination, and disable workers and // b. Set gcphase to _GCmarktermination, and disable workers and
// assists. // assists.
// //
// c. Drain any remaining work from the work queue (typically there // c. Perform housekeeping like flushing mcaches.
// will be none).
//
// d. Perform other housekeeping like flushing mcaches.
// //
// 5. GC performs the sweep phase. // 4. GC performs the sweep phase.
// //
// a. Prepare for the sweep phase by setting gcphase to _GCoff, // a. Prepare for the sweep phase by setting gcphase to _GCoff,
// setting up sweep state and disabling the write barrier. // setting up sweep state and disabling the write barrier.
...@@ -86,7 +78,7 @@ ...@@ -86,7 +78,7 @@
// c. GC does concurrent sweeping in the background and in response // c. GC does concurrent sweeping in the background and in response
// to allocation. See description below. // to allocation. See description below.
// //
// 6. When sufficient allocation has taken place, replay the sequence // 5. When sufficient allocation has taken place, replay the sequence
// starting with 1 above. See discussion of GC rate below. // starting with 1 above. See discussion of GC rate below.
// Concurrent sweep. // Concurrent sweep.
...@@ -996,8 +988,7 @@ var work struct { ...@@ -996,8 +988,7 @@ var work struct {
// startSema protects the transition from "off" to mark or // startSema protects the transition from "off" to mark or
// mark termination. // mark termination.
startSema uint32 startSema uint32
// markDoneSema protects transitions from mark 1 to mark 2 and // markDoneSema protects transitions from mark to mark termination.
// from mark 2 to mark termination.
markDoneSema uint32 markDoneSema uint32
bgMarkReady note // signal background mark worker has started bgMarkReady note // signal background mark worker has started
...@@ -1385,93 +1376,86 @@ func gcStart(mode gcMode, trigger gcTrigger) { ...@@ -1385,93 +1376,86 @@ func gcStart(mode gcMode, trigger gcTrigger) {
semrelease(&work.startSema) semrelease(&work.startSema)
} }
// gcMarkDone transitions the GC from mark 1 to mark 2 and from mark 2 // gcMarkDoneFlushed counts the number of P's with flushed work.
// to mark termination. //
// Ideally this would be a captured local in gcMarkDone, but forEachP
// escapes its callback closure, so it can't capture anything.
//
// This is protected by markDoneSema.
var gcMarkDoneFlushed uint32
// gcMarkDone transitions the GC from mark to mark termination if all
// reachable objects have been marked (that is, there are no grey
// objects and can be no more in the future). Otherwise, it flushes
// all local work to the global queues where it can be discovered by
// other workers.
// //
// This should be called when all mark work has been drained. In mark // This should be called when all local mark work has been drained and
// 1, this includes all root marking jobs, global work buffers, and // there are no remaining workers. Specifically, when
// active work buffers in assists and background workers; however, //
// work may still be cached in per-P work buffers. In mark 2, per-P // work.nwait == work.nproc && !gcMarkWorkAvailable(p)
// caches are disabled.
// //
// The calling context must be preemptible. // The calling context must be preemptible.
// //
// Note that it is explicitly okay to have write barriers in this // Flushing local work is important because idle Ps may have local
// function because completion of concurrent mark is best-effort // work queued. This is the only way to make that work visible and
// anyway. Any work created by write barriers here will be cleaned up // drive GC to completion.
// by mark termination. //
// It is explicitly okay to have write barriers in this function. If
// it does transition to mark termination, then all reachable objects
// have been marked, so the write barrier cannot shade any more
// objects.
func gcMarkDone() { func gcMarkDone() {
top: // Ensure only one thread is running the ragged barrier at a
// time.
semacquire(&work.markDoneSema) semacquire(&work.markDoneSema)
top:
// Re-check transition condition under transition lock. // Re-check transition condition under transition lock.
//
// It's critical that this checks the global work queues are
// empty before performing the ragged barrier. Otherwise,
// there could be global work that a P could take after the P
// has passed the ragged barrier.
if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) { if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
semrelease(&work.markDoneSema) semrelease(&work.markDoneSema)
return return
} }
// Disallow starting new workers so that any remaining workers // Flush all local buffers and collect flushedWork flags.
// in the current mark phase will drain out. gcMarkDoneFlushed = 0
//
// TODO(austin): Should dedicated workers keep an eye on this
// and exit gcDrain promptly?
atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff)
prevFractionalGoal := gcController.fractionalUtilizationGoal
gcController.fractionalUtilizationGoal = 0
if !gcBlackenPromptly {
// Transition from mark 1 to mark 2.
//
// The global work list is empty, but there can still be work
// sitting in the per-P work caches.
// Flush and disable work caches.
// Disallow caching workbufs and indicate that we're in mark 2.
gcBlackenPromptly = true
// Prevent completion of mark 2 until we've flushed
// cached workbufs.
atomic.Xadd(&work.nwait, -1)
// GC is set up for mark 2. Let Gs blocked on the
// transition lock go while we flush caches.
semrelease(&work.markDoneSema)
systemstack(func() { systemstack(func() {
// Flush all currently cached workbufs and
// ensure all Ps see gcBlackenPromptly. This
// also blocks until any remaining mark 1
// workers have exited their loop so we can
// start new mark 2 workers.
forEachP(func(_p_ *p) { forEachP(func(_p_ *p) {
// Flush the write barrier buffer, since this may add
// work to the gcWork.
wbBufFlush1(_p_) wbBufFlush1(_p_)
// Flush the gcWork, since this may create global work
// and set the flushedWork flag.
//
// TODO(austin): Break up these workbufs to
// better distribute work.
_p_.gcw.dispose() _p_.gcw.dispose()
// Collect the flushedWork flag.
if _p_.gcw.flushedWork {
atomic.Xadd(&gcMarkDoneFlushed, 1)
_p_.gcw.flushedWork = false
}
}) })
}) })
// Check that roots are marked. We should be able to if gcMarkDoneFlushed != 0 {
// do this before the forEachP, but based on issue // More grey objects were discovered since the
// #16083 there may be a (harmless) race where we can // previous termination check, so there may be more
// enter mark 2 while some workers are still scanning // work to do. Keep going. It's possible the
// stacks. The forEachP ensures these scans are done. // transition condition became true again during the
// // ragged barrier, so re-check it.
// TODO(austin): Figure out the race and fix this
// properly.
gcMarkRootCheck()
// Now we can start up mark 2 workers.
atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff)
gcController.fractionalUtilizationGoal = prevFractionalGoal
incnwait := atomic.Xadd(&work.nwait, +1)
if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
// This loop will make progress because
// gcBlackenPromptly is now true, so it won't
// take this same "if" branch.
goto top goto top
} }
} else {
// Transition to mark termination. // There was no global work, no local work, and no Ps
// communicated work since we took markDoneSema. Therefore
// there are no grey objects and no more objects can be
// shaded. Transition to mark termination.
now := nanotime() now := nanotime()
work.tMarkTerm = now work.tMarkTerm = now
work.pauseStart = now work.pauseStart = now
...@@ -1500,13 +1484,13 @@ top: ...@@ -1500,13 +1484,13 @@ top:
// world again. // world again.
semrelease(&work.markDoneSema) semrelease(&work.markDoneSema)
// endCycle depends on all gcWork cache stats being // endCycle depends on all gcWork cache stats being flushed.
// flushed. This is ensured by mark 2. // The termination algorithm above ensured that up to
// allocations since the ragged barrier.
nextTriggerRatio := gcController.endCycle() nextTriggerRatio := gcController.endCycle()
// Perform mark termination. This will restart the world. // Perform mark termination. This will restart the world.
gcMarkTermination(nextTriggerRatio) gcMarkTermination(nextTriggerRatio)
}
} }
func gcMarkTermination(nextTriggerRatio float64) { func gcMarkTermination(nextTriggerRatio float64) {
...@@ -1940,23 +1924,23 @@ func gcMark(start_time int64) { ...@@ -1940,23 +1924,23 @@ func gcMark(start_time int64) {
if work.full == 0 && work.nDataRoots+work.nBSSRoots+work.nSpanRoots+work.nStackRoots == 0 { if work.full == 0 && work.nDataRoots+work.nBSSRoots+work.nSpanRoots+work.nStackRoots == 0 {
// There's no work on the work queue and no root jobs // There's no work on the work queue and no root jobs
// that can produce work, so don't bother entering the // that can produce work, so don't bother entering the
// getfull() barrier. // getfull() barrier. There will be flushCacheRoots
// // work, but that doesn't gray anything.
// This will be the situation the vast majority of the
// time after concurrent mark. However, we still need
// a fallback for STW GC and because there are some
// known races that occasionally leave work around for
// mark termination.
// //
// We're still hedging our bets here: if we do // This should always be the situation after
// accidentally produce some work, we'll still process // concurrent mark.
// it, just not necessarily in parallel.
//
// TODO(austin): Fix the races and and remove
// work draining from mark termination so we don't
// need the fallback path.
work.helperDrainBlock = false work.helperDrainBlock = false
} else { } else {
// There's marking work to do. This is the case during
// STW GC and in checkmark mode. Instruct GC workers
// to block in getfull until all GC workers are in getfull.
//
// TODO(austin): Move STW and checkmark marking out of
// mark termination and eliminate this code path.
if !useCheckmark && debug.gcstoptheworld == 0 && debug.gcrescanstacks == 0 {
print("runtime: full=", hex(work.full), " nDataRoots=", work.nDataRoots, " nBSSRoots=", work.nBSSRoots, " nSpanRoots=", work.nSpanRoots, " nStackRoots=", work.nStackRoots, "\n")
panic("non-empty mark queue after concurrent mark")
}
work.helperDrainBlock = true work.helperDrainBlock = true
} }
...@@ -1991,16 +1975,35 @@ func gcMark(start_time int64) { ...@@ -1991,16 +1975,35 @@ func gcMark(start_time int64) {
// Record that at least one root marking pass has completed. // Record that at least one root marking pass has completed.
work.markrootDone = true work.markrootDone = true
// Double-check that all gcWork caches are empty. This should // Clear out buffers and double-check that all gcWork caches
// be ensured by mark 2 before we enter mark termination. // are empty. This should be ensured by gcMarkDone before we
// enter mark termination.
//
// TODO: We could clear out buffers just before mark if this
// has a non-negligible impact on STW time.
for _, p := range allp { for _, p := range allp {
// The write barrier may have buffered pointers since
// the gcMarkDone barrier. However, since the barrier
// ensured all reachable objects were marked, all of
// these must be pointers to black objects. Hence we
// can just discard the write barrier buffer.
if debug.gccheckmark > 0 {
// For debugging, flush the buffer and make
// sure it really was all marked.
wbBufFlush1(p)
} else {
p.wbBuf.reset()
}
gcw := &p.gcw gcw := &p.gcw
if !gcw.empty() { if !gcw.empty() {
throw("P has cached GC work at end of mark termination") throw("P has cached GC work at end of mark termination")
} }
if gcw.scanWork != 0 || gcw.bytesMarked != 0 { // There may still be cached empty buffers, which we
throw("P has unflushed stats at end of mark termination") // need to flush since we're going to free them. Also,
} // there may be non-zero stats because we allocated
// black after the gcMarkDone barrier.
gcw.dispose()
} }
cachestats() cachestats()
......
...@@ -107,6 +107,11 @@ func (b *wbBuf) discard() { ...@@ -107,6 +107,11 @@ func (b *wbBuf) discard() {
b.next = uintptr(unsafe.Pointer(&b.buf[0])) b.next = uintptr(unsafe.Pointer(&b.buf[0]))
} }
// empty returns true if b contains no pointers.
func (b *wbBuf) empty() bool {
return b.next == uintptr(unsafe.Pointer(&b.buf[0]))
}
// putFast adds old and new to the write barrier buffer and returns // putFast adds old and new to the write barrier buffer and returns
// false if a flush is necessary. Callers should use this as: // false if a flush is necessary. Callers should use this as:
// //
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment