runtime: reintroduce ``dead'' space during GC scan

Reintroduce an optimization discarded during the initial conversion from 4-bit heap bitmaps to 2-bit heap bitmaps: when we reach the place in the bitmap where there are no more pointers, mark that position for the GC so that it can avoid scanning past that place. During heapBitsSetType we can also avoid initializing heap bitmap beyond that location, which gives a bit of a win compared to Go 1.4. This particular optimization (not initializing the heap bitmap) may not last: we might change typedmemmove to use the heap bitmap, in which case it would all need to be initialized. The early stop in the GC scan will stay no matter what. Compared to Go 1.4 (github.com/rsc/go, branch go14bench): name old mean new mean delta SetTypeNode64 80.7ns × (1.00,1.01) 57.4ns × (1.00,1.01) -28.83% (p=0.000) SetTypeNode64Dead 80.5ns × (1.00,1.01) 13.1ns × (0.99,1.02) -83.77% (p=0.000) SetTypeNode64Slice 2.16µs × (1.00,1.01) 1.54µs × (1.00,1.01) -28.75% (p=0.000) SetTypeNode64DeadSlice 2.16µs × (1.00,1.01) 1.52µs × (1.00,1.00) -29.74% (p=0.000) Compared to previous CL: name old mean new mean delta SetTypeNode64 56.7ns × (1.00,1.00) 57.4ns × (1.00,1.01) +1.19% (p=0.000) SetTypeNode64Dead 57.2ns × (1.00,1.00) 13.1ns × (0.99,1.02) -77.15% (p=0.000) SetTypeNode64Slice 1.56µs × (1.00,1.01) 1.54µs × (1.00,1.01) -0.89% (p=0.000) SetTypeNode64DeadSlice 1.55µs × (1.00,1.01) 1.52µs × (1.00,1.00) -2.23% (p=0.000) This is the last CL in the sequence converting from the 4-bit heap to the 2-bit heap, with all the same optimizations reenabled. Compared to before that process began (compared to CL 9701 patch set 1): name old mean new mean delta BinaryTree17 5.87s × (0.94,1.09) 5.91s × (0.96,1.06) ~ (p=0.578) Fannkuch11 4.32s × (1.00,1.00) 4.32s × (1.00,1.00) ~ (p=0.474) FmtFprintfEmpty 89.1ns × (0.95,1.16) 89.0ns × (0.93,1.10) ~ (p=0.942) FmtFprintfString 283ns × (0.98,1.02) 298ns × (0.98,1.06) +5.33% (p=0.000) FmtFprintfInt 284ns × (0.98,1.04) 286ns × (0.98,1.03) ~ (p=0.208) FmtFprintfIntInt 486ns × (0.98,1.03) 498ns × (0.97,1.06) +2.48% (p=0.000) FmtFprintfPrefixedInt 400ns × (0.99,1.02) 408ns × (0.98,1.02) +2.23% (p=0.000) FmtFprintfFloat 566ns × (0.99,1.01) 587ns × (0.98,1.01) +3.69% (p=0.000) FmtManyArgs 1.91µs × (0.99,1.02) 1.94µs × (0.99,1.02) +1.81% (p=0.000) GobDecode 15.5ms × (0.98,1.05) 15.8ms × (0.98,1.03) +1.94% (p=0.002) GobEncode 11.9ms × (0.97,1.03) 12.0ms × (0.96,1.09) ~ (p=0.263) Gzip 648ms × (0.99,1.01) 648ms × (0.99,1.01) ~ (p=0.992) Gunzip 143ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.585) HTTPClientServer 89.2µs × (0.99,1.02) 90.3µs × (0.98,1.01) +1.24% (p=0.000) JSONEncode 32.3ms × (0.97,1.06) 31.6ms × (0.99,1.01) -2.29% (p=0.000) JSONDecode 106ms × (0.99,1.01) 107ms × (1.00,1.01) +0.62% (p=0.000) Mandelbrot200 6.02ms × (1.00,1.00) 6.03ms × (1.00,1.01) ~ (p=0.250) GoParse 6.57ms × (0.97,1.06) 6.53ms × (0.99,1.03) ~ (p=0.243) RegexpMatchEasy0_32 162ns × (1.00,1.00) 161ns × (1.00,1.01) -0.80% (p=0.000) RegexpMatchEasy0_1K 561ns × (0.99,1.02) 541ns × (0.99,1.01) -3.67% (p=0.000) RegexpMatchEasy1_32 145ns × (0.95,1.04) 138ns × (1.00,1.00) -5.04% (p=0.000) RegexpMatchEasy1_1K 864ns × (0.99,1.04) 887ns × (0.99,1.01) +2.57% (p=0.000) RegexpMatchMedium_32 255ns × (0.99,1.04) 253ns × (0.99,1.01) -1.05% (p=0.012) RegexpMatchMedium_1K 73.9µs × (0.98,1.04) 72.8µs × (1.00,1.00) -1.51% (p=0.005) RegexpMatchHard_32 3.92µs × (0.98,1.04) 3.85µs × (1.00,1.01) -1.88% (p=0.002) RegexpMatchHard_1K 120µs × (0.98,1.04) 117µs × (1.00,1.01) -2.02% (p=0.001) Revcomp 936ms × (0.95,1.08) 922ms × (0.97,1.08) ~ (p=0.234) Template 130ms × (0.98,1.04) 126ms × (0.99,1.01) -2.99% (p=0.000) TimeParse 638ns × (0.98,1.05) 628ns × (0.99,1.01) -1.54% (p=0.004) TimeFormat 674ns × (0.99,1.01) 668ns × (0.99,1.01) -0.80% (p=0.001) The slowdown of the first few benchmarks seems to be due to the new atomic operations for certain small size allocations. But the larger benchmarks mostly improve, probably due to the decreased memory pressure from having half as much heap bitmap. CL 9706, which removes the (never used anymore) wbshadow mode, gets back what is lost in the early microbenchmarks. Change-Id: I37423a209e8ec2a2e92538b45cac5422a6acd32d Reviewed-on: https://go-review.googlesource.com/9705Reviewed-by: Rick Hudson <rlh@golang.org>

runtime: reintroduce ``dead'' space during GC scan
Reintroduce an optimization discarded during the initial conversion from 4-bit heap bitmaps to 2-bit heap bitmaps: when we reach the place in the bitmap where there are no more pointers, mark that position for the GC so that it can avoid scanning past that place. During heapBitsSetType we can also avoid initializing heap bitmap beyond that location, which gives a bit of a win compared to Go 1.4. This particular optimization (not initializing the heap bitmap) may not last: we might change typedmemmove to use the heap bitmap, in which case it would all need to be initialized. The early stop in the GC scan will stay no matter what. Compared to Go 1.4 (github.com/rsc/go, branch go14bench): name old mean new mean delta SetTypeNode64 80.7ns × (1.00,1.01) 57.4ns × (1.00,1.01) -28.83% (p=0.000) SetTypeNode64Dead 80.5ns × (1.00,1.01) 13.1ns × (0.99,1.02) -83.77% (p=0.000) SetTypeNode64Slice 2.16µs × (1.00,1.01) 1.54µs × (1.00,1.01) -28.75% (p=0.000) SetTypeNode64DeadSlice 2.16µs × (1.00,1.01) 1.52µs × (1.00,1.00) -29.74% (p=0.000) Compared to previous CL: name old mean new mean delta SetTypeNode64 56.7ns × (1.00,1.00) 57.4ns × (1.00,1.01) +1.19% (p=0.000) SetTypeNode64Dead 57.2ns × (1.00,1.00) 13.1ns × (0.99,1.02) -77.15% (p=0.000) SetTypeNode64Slice 1.56µs × (1.00,1.01) 1.54µs × (1.00,1.01) -0.89% (p=0.000) SetTypeNode64DeadSlice 1.55µs × (1.00,1.01) 1.52µs × (1.00,1.00) -2.23% (p=0.000) This is the last CL in the sequence converting from the 4-bit heap to the 2-bit heap, with all the same optimizations reenabled. Compared to before that process began (compared to CL 9701 patch set 1): name old mean new mean delta BinaryTree17 5.87s × (0.94,1.09) 5.91s × (0.96,1.06) ~ (p=0.578) Fannkuch11 4.32s × (1.00,1.00) 4.32s × (1.00,1.00) ~ (p=0.474) FmtFprintfEmpty 89.1ns × (0.95,1.16) 89.0ns × (0.93,1.10) ~ (p=0.942) FmtFprintfString 283ns × (0.98,1.02) 298ns × (0.98,1.06) +5.33% (p=0.000) FmtFprintfInt 284ns × (0.98,1.04) 286ns × (0.98,1.03) ~ (p=0.208) FmtFprintfIntInt 486ns × (0.98,1.03) 498ns × (0.97,1.06) +2.48% (p=0.000) FmtFprintfPrefixedInt 400ns × (0.99,1.02) 408ns × (0.98,1.02) +2.23% (p=0.000) FmtFprintfFloat 566ns × (0.99,1.01) 587ns × (0.98,1.01) +3.69% (p=0.000) FmtManyArgs 1.91µs × (0.99,1.02) 1.94µs × (0.99,1.02) +1.81% (p=0.000) GobDecode 15.5ms × (0.98,1.05) 15.8ms × (0.98,1.03) +1.94% (p=0.002) GobEncode 11.9ms × (0.97,1.03) 12.0ms × (0.96,1.09) ~ (p=0.263) Gzip 648ms × (0.99,1.01) 648ms × (0.99,1.01) ~ (p=0.992) Gunzip 143ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.585) HTTPClientServer 89.2µs × (0.99,1.02) 90.3µs × (0.98,1.01) +1.24% (p=0.000) JSONEncode 32.3ms × (0.97,1.06) 31.6ms × (0.99,1.01) -2.29% (p=0.000) JSONDecode 106ms × (0.99,1.01) 107ms × (1.00,1.01) +0.62% (p=0.000) Mandelbrot200 6.02ms × (1.00,1.00) 6.03ms × (1.00,1.01) ~ (p=0.250) GoParse 6.57ms × (0.97,1.06) 6.53ms × (0.99,1.03) ~ (p=0.243) RegexpMatchEasy0_32 162ns × (1.00,1.00) 161ns × (1.00,1.01) -0.80% (p=0.000) RegexpMatchEasy0_1K 561ns × (0.99,1.02) 541ns × (0.99,1.01) -3.67% (p=0.000) RegexpMatchEasy1_32 145ns × (0.95,1.04) 138ns × (1.00,1.00) -5.04% (p=0.000) RegexpMatchEasy1_1K 864ns × (0.99,1.04) 887ns × (0.99,1.01) +2.57% (p=0.000) RegexpMatchMedium_32 255ns × (0.99,1.04) 253ns × (0.99,1.01) -1.05% (p=0.012) RegexpMatchMedium_1K 73.9µs × (0.98,1.04) 72.8µs × (1.00,1.00) -1.51% (p=0.005) RegexpMatchHard_32 3.92µs × (0.98,1.04) 3.85µs × (1.00,1.01) -1.88% (p=0.002) RegexpMatchHard_1K 120µs × (0.98,1.04) 117µs × (1.00,1.01) -2.02% (p=0.001) Revcomp 936ms × (0.95,1.08) 922ms × (0.97,1.08) ~ (p=0.234) Template 130ms × (0.98,1.04) 126ms × (0.99,1.01) -2.99% (p=0.000) TimeParse 638ns × (0.98,1.05) 628ns × (0.99,1.01) -1.54% (p=0.004) TimeFormat 674ns × (0.99,1.01) 668ns × (0.99,1.01) -0.80% (p=0.001) The slowdown of the first few benchmarks seems to be due to the new atomic operations for certain small size allocations. But the larger benchmarks mostly improve, probably due to the decreased memory pressure from having half as much heap bitmap. CL 9706, which removes the (never used anymore) wbshadow mode, gets back what is lost in the early microbenchmarks. Change-Id: I37423a209e8ec2a2e92538b45cac5422a6acd32d Reviewed-on: https://go-review.googlesource.com/9705Reviewed-by: Rick Hudson <rlh@golang.org>
54af9a3b · Russ Cox · feb8a3b6 · 54af9a3b · 54af9a3b · 54af9a3b
Commit 54af9a3b authored May 04, 2015 by Russ Cox
5 changed files
--- a/src/cmd/internal/gc/reflect.go
+++ b/src/cmd/internal/gc/reflect.go
@@ -687,7 +687,7 @@ func haspointers(t *Type) bool {

 // typeptrdata returns the length in bytes of the prefix of t
 // containing pointer data. Anything after this offset is scalar data.
-func typeptrdata(t *Type) uint64 {
+func typeptrdata(t *Type) int64 {
 	if !haspointers(t) {
 		return 0
 	}
@@ -699,24 +699,24 @@ func typeptrdata(t *Type) uint64 {
 		TFUNC,
 		TCHAN,
 		TMAP:
-		return uint64(Widthptr)
+		return int64(Widthptr)

 	case TSTRING:
 		// struct { byte *str; intgo len; }
-		return uint64(Widthptr)
+		return int64(Widthptr)

 	case TINTER:
 		// struct { Itab *tab;	void *data; } or
 		// struct { Type *type; void *data; }
-		return 2 * uint64(Widthptr)
+		return 2 * int64(Widthptr)

 	case TARRAY:
 		if Isslice(t) {
 			// struct { byte *array; uintgo len; uintgo cap; }
-			return uint64(Widthptr)
+			return int64(Widthptr)
 		}
 		// haspointers already eliminated t.Bound == 0.
-		return uint64(t.Bound-1)*uint64(t.Type.Width) + typeptrdata(t.Type)
+		return (t.Bound-1)*t.Type.Width + typeptrdata(t.Type)

 	case TSTRUCT:
 		// Find the last field that has pointers.
@@ -726,7 +726,7 @@ func typeptrdata(t *Type) uint64 {
 				lastPtrField = t1
 			}
 		}
-		return uint64(lastPtrField.Width) + typeptrdata(lastPtrField.Type)
+		return lastPtrField.Width + typeptrdata(lastPtrField.Type)

 	default:
 		Fatal("typeptrdata: unexpected type, %v", t)
@@ -794,7 +794,7 @@ func dcommontype(s *Sym, ot int, t *Type) int {
 	//		zero          unsafe.Pointer
 	//	}
 	ot = duintptr(s, ot, uint64(t.Width))
-	ot = duintptr(s, ot, typeptrdata(t))
+	ot = duintptr(s, ot, uint64(typeptrdata(t)))

 	ot = duint32(s, ot, typehash(t))
 	ot = duint8(s, ot, 0) // unused
@@ -1428,17 +1428,12 @@ func usegcprog(t *Type) bool {
 	}

 	// Calculate size of the unrolled GC mask.
-	nptr := (t.Width + int64(Widthptr) - 1) / int64(Widthptr)
-
-	size := (nptr + 7) / 8
+	nptr := typeptrdata(t) / int64(Widthptr)

 	// Decide whether to use unrolled GC mask or GC program.
 	// We could use a more elaborate condition, but this seems to work well in practice.
-	// For small objects GC program can't give significant reduction.
-	// While large objects usually contain arrays; and even if it don't
-	// the program uses 2-bits per word while mask uses 4-bits per word,
-	// so the program is still smaller.
-	return size > int64(2*Widthptr)
+	// For small objects, the GC program can't give significant reduction.
+	return nptr > int64(2*Widthptr*8)
 }

 // Generates GC bitmask (1 bit per word).
@@ -1450,11 +1445,11 @@ func gengcmask(t *Type, gcmask []byte) {
 		return
 	}

-	vec := bvalloc(2 * int32(Widthptr) * 8)
+	vec := bvalloc(int32(2 * Widthptr * 8))
 	xoffset := int64(0)
 	onebitwalktype1(t, &xoffset, vec)

-	nptr := (t.Width + int64(Widthptr) - 1) / int64(Widthptr)
+	nptr := typeptrdata(t) / int64(Widthptr)
 	for i := int64(0); i < nptr; i++ {
 		if bvget(vec, int32(i)) == 1 {
 			gcmask[i/8] |= 1 << (uint(i) % 8)

--- a/src/cmd/internal/ld/data.go
+++ b/src/cmd/internal/ld/data.go
@@ -1109,8 +1109,7 @@ func proggenaddsym(g *ProgGen, s *LSym) {

 	// Skip alignment hole from the previous symbol.
 	proggenskip(g, g.pos, s.Value-g.pos)
-
-	g.pos += s.Value - g.pos
+	g.pos = s.Value

 	// The test for names beginning with . here is meant
 	// to keep .dynamic and .dynsym from turning up as
@@ -1142,16 +1141,16 @@ func proggenaddsym(g *ProgGen, s *LSym) {
 			proggendata(g, 0)
 			proggenarrayend(g)
 		}
-
 		g.pos = s.Value + s.Size
 	} else if decodetype_usegcprog(s.Gotype) != 0 {
 		// gc program, copy directly
+		// TODO(rsc): Maybe someday the gc program will only describe
+		// the first decodetype_ptrdata(s.Gotype) bytes instead of the full size.
 		proggendataflush(g)
-
 		gcprog := decodetype_gcprog(s.Gotype)
 		size := decodetype_size(s.Gotype)
 		if (size%int64(Thearch.Ptrsize) != 0) || (g.pos%int64(Thearch.Ptrsize) != 0) {
-			Diag("proggenaddsym: unaligned gcprog symbol %s: size=%d pos=%d", s.Name, s.Size, g.pos)
+			Diag("proggenaddsym: unaligned gcprog symbol %s: size=%d pos=%d", s.Name, size, g.pos)
 		}
 		for i := int64(0); i < int64(len(gcprog.P)-1); i++ {
 			proggenemit(g, uint8(gcprog.P[i]))
@@ -1160,16 +1159,15 @@ func proggenaddsym(g *ProgGen, s *LSym) {
 	} else {
 		// gc mask, it's small so emit as data
 		mask := decodetype_gcmask(s.Gotype)
-
-		size := decodetype_size(s.Gotype)
-		if (size%int64(Thearch.Ptrsize) != 0) || (g.pos%int64(Thearch.Ptrsize) != 0) {
-			Diag("proggenaddsym: unaligned gcmask symbol %s: size=%d pos=%d", s.Name, s.Size, g.pos)
+		ptrdata := decodetype_ptrdata(s.Gotype)
+		if (ptrdata%int64(Thearch.Ptrsize) != 0) || (g.pos%int64(Thearch.Ptrsize) != 0) {
+			Diag("proggenaddsym: unaligned gcmask symbol %s: size=%d pos=%d", s.Name, ptrdata, g.pos)
 		}
-		for i := int64(0); i < size; i += int64(Thearch.Ptrsize) {
+		for i := int64(0); i < ptrdata; i += int64(Thearch.Ptrsize) {
 			word := uint(i / int64(Thearch.Ptrsize))
 			proggendata(g, (mask[word/8]>>(word%8))&1)
 		}
-		g.pos = s.Value + size
+		g.pos = s.Value + ptrdata
 	}
 }


--- a/src/cmd/internal/ld/decodesym.go
+++ b/src/cmd/internal/ld/decodesym.go
@@ -67,6 +67,11 @@ func decodetype_size(s *LSym) int64 {
 	return int64(decode_inuxi(s.P, Thearch.Ptrsize)) // 0x8 / 0x10
 }

+// Type.commonType.ptrdata
+func decodetype_ptrdata(s *LSym) int64 {
+	return int64(decode_inuxi(s.P[Thearch.Ptrsize:], Thearch.Ptrsize)) // 0x8 / 0x10
+}
+
 // Type.commonType.gc
 func decodetype_gcprog(s *LSym) *LSym {
 	if s.Type == obj.SDYNIMPORT {

--- a/src/runtime/gcinfo_test.go
+++ b/src/runtime/gcinfo_test.go
@@ -34,23 +34,23 @@ func TestGCInfo(t *testing.T) {
 	verifyGCInfo(t, "data eface", &dataEface, infoEface)
 	verifyGCInfo(t, "data iface", &dataIface, infoIface)

-	verifyGCInfo(t, "stack ScalarPtr", new(ScalarPtr), nonStackInfo(infoScalarPtr))
-	verifyGCInfo(t, "stack PtrScalar", new(PtrScalar), nonStackInfo(infoPtrScalar))
-	verifyGCInfo(t, "stack BigStruct", new(BigStruct), nonStackInfo(infoBigStruct()))
-	verifyGCInfo(t, "stack string", new(string), nonStackInfo(infoString))
-	verifyGCInfo(t, "stack slice", new([]string), nonStackInfo(infoSlice))
-	verifyGCInfo(t, "stack eface", new(interface{}), nonStackInfo(infoEface))
-	verifyGCInfo(t, "stack iface", new(Iface), nonStackInfo(infoIface))
+	verifyGCInfo(t, "stack ScalarPtr", new(ScalarPtr), infoScalarPtr)
+	verifyGCInfo(t, "stack PtrScalar", new(PtrScalar), infoPtrScalar)
+	verifyGCInfo(t, "stack BigStruct", new(BigStruct), infoBigStruct())
+	verifyGCInfo(t, "stack string", new(string), infoString)
+	verifyGCInfo(t, "stack slice", new([]string), infoSlice)
+	verifyGCInfo(t, "stack eface", new(interface{}), infoEface)
+	verifyGCInfo(t, "stack iface", new(Iface), infoIface)

 	for i := 0; i < 10; i++ {
-		verifyGCInfo(t, "heap PtrSlice", escape(&make([]*byte, 10)[0]), infoPtr10)
-		verifyGCInfo(t, "heap ScalarPtr", escape(new(ScalarPtr)), infoScalarPtr)
-		verifyGCInfo(t, "heap ScalarPtrSlice", escape(&make([]ScalarPtr, 4)[0]), infoScalarPtr4)
-		verifyGCInfo(t, "heap PtrScalar", escape(new(PtrScalar)), infoPtrScalar)
-		verifyGCInfo(t, "heap BigStruct", escape(new(BigStruct)), infoBigStruct())
-		verifyGCInfo(t, "heap string", escape(new(string)), infoString)
-		verifyGCInfo(t, "heap eface", escape(new(interface{})), infoEface)
-		verifyGCInfo(t, "heap iface", escape(new(Iface)), infoIface)
+		verifyGCInfo(t, "heap PtrSlice", escape(&make([]*byte, 10)[0]), trimDead(infoPtr10))
+		verifyGCInfo(t, "heap ScalarPtr", escape(new(ScalarPtr)), trimDead(infoScalarPtr))
+		verifyGCInfo(t, "heap ScalarPtrSlice", escape(&make([]ScalarPtr, 4)[0]), trimDead(infoScalarPtr4))
+		verifyGCInfo(t, "heap PtrScalar", escape(new(PtrScalar)), trimDead(infoPtrScalar))
+		verifyGCInfo(t, "heap BigStruct", escape(new(BigStruct)), trimDead(infoBigStruct()))
+		verifyGCInfo(t, "heap string", escape(new(string)), trimDead(infoString))
+		verifyGCInfo(t, "heap eface", escape(new(interface{})), trimDead(infoEface))
+		verifyGCInfo(t, "heap iface", escape(new(Iface)), trimDead(infoIface))
 	}

 }
@@ -67,16 +67,11 @@ func verifyGCInfo(t *testing.T, name string, p interface{}, mask0 []byte) {
 	}
 }

-func nonStackInfo(mask []byte) []byte {
-	// typeDead is replaced with typeScalar everywhere except stacks.
-	mask1 := make([]byte, len(mask))
-	for i, v := range mask {
-		if v == typeDead {
-			v = typeScalar
-		}
-		mask1[i] = v
+func trimDead(mask []byte) []byte {
+	for len(mask) > 2 && mask[len(mask)-1] == typeScalar {
+		mask = mask[:len(mask)-1]
 	}
-	return mask1
+	return mask
 }

 var gcinfoSink interface{}

--- a/src/runtime/mbitmap.go
+++ b/src/runtime/mbitmap.go