runtime: replace GC programs with simpler encoding, faster decoder

Small types record the location of pointers in their memory layout by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries, and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using a bitmap for a large type containing arrays does not make sense: if someone refers to the type [1<<28]*byte in a program in such a way that the type information makes it into the binary, it would be a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB (for 1-bit entries) bitmap full of 1s into the binary or even to keep one in memory during the execution of the program. For large types containing arrays, it is much more compact to describe the locations of pointers using a notation that can express repetition than to lay out a bitmap of pointers. Go 1.4 included such a notation, called ``GC programs'' but it was complex, required recursion during decoding, and was generally slow. Dmitriy measured the execution of these programs writing directly to the heap bitmap as being 7x slower than copying from a preunrolled 4-bit mask (and frankly that code was not terribly fast either). For some tests, unrollgcprog1 was seen costing as much as 3x more than the rest of malloc combined. This CL introduces a different form for the GC programs. They use a simple Lempel-Ziv-style encoding of the 1-bit pointer information, in which the only operations are (1) emit the following n bits and (2) repeat the last n bits c more times. This encoding can be generated directly from the Go type information (using repetition only for arrays or large runs of non-pointer data) and it can be decoded very efficiently. In particular the decoding requires little state and no recursion, so that the entire decoding can run without any memory accesses other than the reads of the encoding and the writes of the decoded form to the heap bitmap. For recursive types like arrays of arrays of arrays, the inner instructions are only executed once, not n times, so that large repetitions run at full speed. (In contrast, large repetitions in the old programs repeated the individual bit-level layout of the inner data over and over.) The result is as much as 25x faster decoding compared to the old form. Because the old decoder was so slow, Go 1.4 had three (or so) cases for how to set the heap bitmap bits for an allocation of a given type: (1) If the type had an even number of words up to 32 words, then the 4-bit pointer mask for the type fit in no more than 16 bytes; store the 4-bit pointer mask directly in the binary and copy from it. (1b) If the type had an odd number of words up to 15 words, then the 4-bit pointer mask for the type, doubled to end on a byte boundary, fit in no more than 16 bytes; store that doubled mask directly in the binary and copy from it. (2) If the type had an even number of words up to 128 words, or an odd number of words up to 63 words (again due to doubling), then the 4-bit pointer mask would fit in a 64-byte unrolled mask. Store a GC program in the binary, but leave space in the BSS for the unrolled mask. Execute the GC program to construct the mask the first time it is needed, and thereafter copy from the mask. (3) Otherwise, store a GC program and execute it to write directly to the heap bitmap each time an object of that type is allocated. (This is the case that was 7x slower than the other two.) Because the new pointer masks store 1-bit entries instead of 4-bit entries and because using the decoder no longer carries a significant overhead, after this CL (that is, for Go 1.5) there are only two cases: (1) If the type is 128 words or less (no condition about odd or even), store the 1-bit pointer mask directly in the binary and use it to initialize the heap bitmap during malloc. (Implemented in CL 9702.) (2) There is no case 2 anymore. (3) Otherwise, store a GC program and execute it to write directly to the heap bitmap each time an object of that type is allocated. Executing the GC program directly into the heap bitmap (case (3) above) was disabled for the Go 1.5 dev cycle, both to avoid needing to use GC programs for typedmemmove and to avoid updating that code as the heap bitmap format changed. Typedmemmove no longer uses this type information; as of CL 9886 it uses the heap bitmap directly. Now that the heap bitmap format is stable, we reintroduce GC programs and their space savings. Benchmarks for heapBitsSetType, before this CL vs this CL: name old mean new mean delta SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000) SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179) SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001) SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001) SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000) SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000) SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000) SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001) SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000) SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070) SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000) SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015) SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069) SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767) SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000) SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001) SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000) SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000) SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000) SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000) SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000) SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001) SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000) SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000) SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000) SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000) The above compares always using a cached pointer mask (and the corresponding waste of memory) against using the programs directly. Some slowdown is expected, in exchange for having a better general algorithm. The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024, along with the slice variants of those. It is possible that the cutoff of 128 words (bits) should be raised in a followup CL, but even with this low cutoff the GC programs are faster than Go 1.4's "fast path" non-GC program case. Benchmarks for heapBitsSetType, Go 1.4 vs this CL: name old mean new mean delta SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000) SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000) SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000) SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000) SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000) SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000) SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000) SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000) SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000) SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000) SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000) SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000) SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000) SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000) SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000) SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000) SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000) SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000) SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000) SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000) SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000) SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000) SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000) SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000) SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000) SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000) SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation. Both Go 1.4 and this CL are using pointer bitmaps for this case, so that's an overall 3x speedup for using pointer bitmaps. SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation. Both Go 1.4 and this CL are running the GC program for this case, so that's an overall 17x speedup when using GC programs (and I've seen >20x on other systems). Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against this CL's SetTypeNode128 (GC program), the slow path in the code in this CL is 2x faster than the fast path in Go 1.4. The Go 1 benchmarks are basically unaffected compared to just before this CL. Go 1 benchmarks, before this CL vs this CL: name old mean new mean delta BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306) Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006) FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280) FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039) FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000) FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048) FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533) FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000) FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000) GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609) GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000) Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835) Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169) HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045) JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549) JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000) Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878) GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004) RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000) RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088) RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380) RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157) RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021) RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539) RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378) RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067) Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943) Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000) TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000) TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976) For the record, Go 1 benchmarks, Go 1.4 vs this CL: name old mean new mean delta BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000) Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212) FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000) FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203) FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000) FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000) FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000) FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000) FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000) GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000) GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000) Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000) Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204) HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000) JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001) JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000) Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184) GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003) RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000) RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000) RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000) RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000) RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000) RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000) RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000) RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000) Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083) Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000) TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000) TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000) This CL is a bit larger than I would like, but the compiler, linker, runtime, and package reflect all need to be in sync about the format of these programs, so there is no easy way to split this into independent changes (at least while keeping the build working at each change). Fixes #9625. Fixes #10524. Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a Reviewed-on: https://go-review.googlesource.com/9888Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: Russ Cox <rsc@golang.org>

runtime: replace GC programs with simpler encoding, faster decoder
Small types record the location of pointers in their memory layout by using a simple bitmap. In Go 1.4 the bitmap held 4-bit entries, and in Go 1.5 the bitmap holds 1-bit entries, but in both cases using a bitmap for a large type containing arrays does not make sense: if someone refers to the type [1<<28]*byte in a program in such a way that the type information makes it into the binary, it would be a waste of space to write a 128 MB (for 4-bit entries) or even 32 MB (for 1-bit entries) bitmap full of 1s into the binary or even to keep one in memory during the execution of the program. For large types containing arrays, it is much more compact to describe the locations of pointers using a notation that can express repetition than to lay out a bitmap of pointers. Go 1.4 included such a notation, called ``GC programs'' but it was complex, required recursion during decoding, and was generally slow. Dmitriy measured the execution of these programs writing directly to the heap bitmap as being 7x slower than copying from a preunrolled 4-bit mask (and frankly that code was not terribly fast either). For some tests, unrollgcprog1 was seen costing as much as 3x more than the rest of malloc combined. This CL introduces a different form for the GC programs. They use a simple Lempel-Ziv-style encoding of the 1-bit pointer information, in which the only operations are (1) emit the following n bits and (2) repeat the last n bits c more times. This encoding can be generated directly from the Go type information (using repetition only for arrays or large runs of non-pointer data) and it can be decoded very efficiently. In particular the decoding requires little state and no recursion, so that the entire decoding can run without any memory accesses other than the reads of the encoding and the writes of the decoded form to the heap bitmap. For recursive types like arrays of arrays of arrays, the inner instructions are only executed once, not n times, so that large repetitions run at full speed. (In contrast, large repetitions in the old programs repeated the individual bit-level layout of the inner data over and over.) The result is as much as 25x faster decoding compared to the old form. Because the old decoder was so slow, Go 1.4 had three (or so) cases for how to set the heap bitmap bits for an allocation of a given type: (1) If the type had an even number of words up to 32 words, then the 4-bit pointer mask for the type fit in no more than 16 bytes; store the 4-bit pointer mask directly in the binary and copy from it. (1b) If the type had an odd number of words up to 15 words, then the 4-bit pointer mask for the type, doubled to end on a byte boundary, fit in no more than 16 bytes; store that doubled mask directly in the binary and copy from it. (2) If the type had an even number of words up to 128 words, or an odd number of words up to 63 words (again due to doubling), then the 4-bit pointer mask would fit in a 64-byte unrolled mask. Store a GC program in the binary, but leave space in the BSS for the unrolled mask. Execute the GC program to construct the mask the first time it is needed, and thereafter copy from the mask. (3) Otherwise, store a GC program and execute it to write directly to the heap bitmap each time an object of that type is allocated. (This is the case that was 7x slower than the other two.) Because the new pointer masks store 1-bit entries instead of 4-bit entries and because using the decoder no longer carries a significant overhead, after this CL (that is, for Go 1.5) there are only two cases: (1) If the type is 128 words or less (no condition about odd or even), store the 1-bit pointer mask directly in the binary and use it to initialize the heap bitmap during malloc. (Implemented in CL 9702.) (2) There is no case 2 anymore. (3) Otherwise, store a GC program and execute it to write directly to the heap bitmap each time an object of that type is allocated. Executing the GC program directly into the heap bitmap (case (3) above) was disabled for the Go 1.5 dev cycle, both to avoid needing to use GC programs for typedmemmove and to avoid updating that code as the heap bitmap format changed. Typedmemmove no longer uses this type information; as of CL 9886 it uses the heap bitmap directly. Now that the heap bitmap format is stable, we reintroduce GC programs and their space savings. Benchmarks for heapBitsSetType, before this CL vs this CL: name old mean new mean delta SetTypePtr 7.59ns × (0.99,1.02) 5.16ns × (1.00,1.00) -32.05% (p=0.000) SetTypePtr8 21.0ns × (0.98,1.05) 21.4ns × (1.00,1.00) ~ (p=0.179) SetTypePtr16 24.1ns × (0.99,1.01) 24.6ns × (1.00,1.00) +2.41% (p=0.001) SetTypePtr32 31.2ns × (0.99,1.01) 32.4ns × (0.99,1.02) +3.72% (p=0.001) SetTypePtr64 45.2ns × (1.00,1.00) 47.2ns × (1.00,1.00) +4.42% (p=0.000) SetTypePtr126 75.8ns × (0.99,1.01) 79.1ns × (1.00,1.00) +4.25% (p=0.000) SetTypePtr128 74.3ns × (0.99,1.01) 77.6ns × (1.00,1.01) +4.55% (p=0.000) SetTypePtrSlice 726ns × (1.00,1.01) 712ns × (1.00,1.00) -1.95% (p=0.001) SetTypeNode1 20.0ns × (0.99,1.01) 20.7ns × (1.00,1.00) +3.71% (p=0.000) SetTypeNode1Slice 112ns × (1.00,1.00) 113ns × (0.99,1.00) ~ (p=0.070) SetTypeNode8 23.9ns × (1.00,1.00) 24.7ns × (1.00,1.01) +3.18% (p=0.000) SetTypeNode8Slice 294ns × (0.99,1.02) 287ns × (0.99,1.01) -2.38% (p=0.015) SetTypeNode64 52.8ns × (0.99,1.03) 51.8ns × (0.99,1.01) ~ (p=0.069) SetTypeNode64Slice 1.13µs × (0.99,1.05) 1.14µs × (0.99,1.00) ~ (p=0.767) SetTypeNode64Dead 36.0ns × (1.00,1.01) 32.5ns × (0.99,1.00) -9.67% (p=0.000) SetTypeNode64DeadSlice 1.43µs × (0.99,1.01) 1.40µs × (1.00,1.00) -2.39% (p=0.001) SetTypeNode124 75.7ns × (1.00,1.01) 79.0ns × (1.00,1.00) +4.44% (p=0.000) SetTypeNode124Slice 1.94µs × (1.00,1.01) 2.04µs × (0.99,1.01) +4.98% (p=0.000) SetTypeNode126 75.4ns × (1.00,1.01) 77.7ns × (0.99,1.01) +3.11% (p=0.000) SetTypeNode126Slice 1.95µs × (0.99,1.01) 2.03µs × (1.00,1.00) +3.74% (p=0.000) SetTypeNode128 85.4ns × (0.99,1.01) 122.0ns × (1.00,1.00) +42.89% (p=0.000) SetTypeNode128Slice 2.20µs × (1.00,1.01) 2.36µs × (0.98,1.02) +7.48% (p=0.001) SetTypeNode130 83.3ns × (1.00,1.00) 123.0ns × (1.00,1.00) +47.61% (p=0.000) SetTypeNode130Slice 2.30µs × (0.99,1.01) 2.40µs × (0.98,1.01) +4.37% (p=0.000) SetTypeNode1024 498ns × (1.00,1.00) 537ns × (1.00,1.00) +7.96% (p=0.000) SetTypeNode1024Slice 15.5µs × (0.99,1.01) 17.8µs × (1.00,1.00) +15.27% (p=0.000) The above compares always using a cached pointer mask (and the corresponding waste of memory) against using the programs directly. Some slowdown is expected, in exchange for having a better general algorithm. The GC programs kick in for SetTypeNode128, SetTypeNode130, SetTypeNode1024, along with the slice variants of those. It is possible that the cutoff of 128 words (bits) should be raised in a followup CL, but even with this low cutoff the GC programs are faster than Go 1.4's "fast path" non-GC program case. Benchmarks for heapBitsSetType, Go 1.4 vs this CL: name old mean new mean delta SetTypePtr 6.89ns × (1.00,1.00) 5.17ns × (1.00,1.00) -25.02% (p=0.000) SetTypePtr8 25.8ns × (0.97,1.05) 21.5ns × (1.00,1.00) -16.70% (p=0.000) SetTypePtr16 39.8ns × (0.97,1.02) 24.7ns × (0.99,1.01) -37.81% (p=0.000) SetTypePtr32 68.8ns × (0.98,1.01) 32.2ns × (1.00,1.01) -53.18% (p=0.000) SetTypePtr64 130ns × (1.00,1.00) 47ns × (1.00,1.00) -63.67% (p=0.000) SetTypePtr126 241ns × (0.99,1.01) 79ns × (1.00,1.01) -67.25% (p=0.000) SetTypePtr128 2.07µs × (1.00,1.00) 0.08µs × (1.00,1.00) -96.27% (p=0.000) SetTypePtrSlice 1.05µs × (0.99,1.01) 0.72µs × (0.99,1.02) -31.70% (p=0.000) SetTypeNode1 16.0ns × (0.99,1.01) 20.8ns × (0.99,1.03) +29.91% (p=0.000) SetTypeNode1Slice 184ns × (0.99,1.01) 112ns × (0.99,1.01) -39.26% (p=0.000) SetTypeNode8 29.5ns × (0.97,1.02) 24.6ns × (1.00,1.00) -16.50% (p=0.000) SetTypeNode8Slice 624ns × (0.98,1.02) 285ns × (1.00,1.00) -54.31% (p=0.000) SetTypeNode64 135ns × (0.96,1.08) 52ns × (0.99,1.02) -61.32% (p=0.000) SetTypeNode64Slice 3.83µs × (1.00,1.00) 1.14µs × (0.99,1.01) -70.16% (p=0.000) SetTypeNode64Dead 134ns × (0.99,1.01) 32ns × (1.00,1.01) -75.74% (p=0.000) SetTypeNode64DeadSlice 3.83µs × (0.99,1.00) 1.40µs × (1.00,1.01) -63.42% (p=0.000) SetTypeNode124 240ns × (0.99,1.01) 79ns × (1.00,1.01) -67.05% (p=0.000) SetTypeNode124Slice 7.27µs × (1.00,1.00) 2.04µs × (1.00,1.00) -71.95% (p=0.000) SetTypeNode126 2.06µs × (0.99,1.01) 0.08µs × (0.99,1.01) -96.23% (p=0.000) SetTypeNode126Slice 64.4µs × (1.00,1.00) 2.0µs × (1.00,1.00) -96.85% (p=0.000) SetTypeNode128 2.09µs × (1.00,1.01) 0.12µs × (1.00,1.00) -94.15% (p=0.000) SetTypeNode128Slice 65.4µs × (1.00,1.00) 2.4µs × (0.99,1.03) -96.39% (p=0.000) SetTypeNode130 2.11µs × (1.00,1.00) 0.12µs × (1.00,1.00) -94.18% (p=0.000) SetTypeNode130Slice 66.3µs × (1.00,1.00) 2.4µs × (0.97,1.08) -96.34% (p=0.000) SetTypeNode1024 16.0µs × (1.00,1.01) 0.5µs × (1.00,1.00) -96.65% (p=0.000) SetTypeNode1024Slice 512µs × (1.00,1.00) 18µs × (0.98,1.04) -96.45% (p=0.000) SetTypeNode124 uses a 124 data + 2 ptr = 126-word allocation. Both Go 1.4 and this CL are using pointer bitmaps for this case, so that's an overall 3x speedup for using pointer bitmaps. SetTypeNode128 uses a 128 data + 2 ptr = 130-word allocation. Both Go 1.4 and this CL are running the GC program for this case, so that's an overall 17x speedup when using GC programs (and I've seen >20x on other systems). Comparing Go 1.4's SetTypeNode124 (pointer bitmap) against this CL's SetTypeNode128 (GC program), the slow path in the code in this CL is 2x faster than the fast path in Go 1.4. The Go 1 benchmarks are basically unaffected compared to just before this CL. Go 1 benchmarks, before this CL vs this CL: name old mean new mean delta BinaryTree17 5.87s × (0.97,1.04) 5.91s × (0.96,1.04) ~ (p=0.306) Fannkuch11 4.38s × (1.00,1.00) 4.37s × (1.00,1.01) -0.22% (p=0.006) FmtFprintfEmpty 90.7ns × (0.97,1.10) 89.3ns × (0.96,1.09) ~ (p=0.280) FmtFprintfString 282ns × (0.98,1.04) 287ns × (0.98,1.07) +1.72% (p=0.039) FmtFprintfInt 269ns × (0.99,1.03) 282ns × (0.97,1.04) +4.87% (p=0.000) FmtFprintfIntInt 478ns × (0.99,1.02) 481ns × (0.99,1.02) +0.61% (p=0.048) FmtFprintfPrefixedInt 399ns × (0.98,1.03) 400ns × (0.98,1.05) ~ (p=0.533) FmtFprintfFloat 563ns × (0.99,1.01) 570ns × (1.00,1.01) +1.37% (p=0.000) FmtManyArgs 1.89µs × (0.99,1.01) 1.92µs × (0.99,1.02) +1.88% (p=0.000) GobDecode 15.2ms × (0.99,1.01) 15.2ms × (0.98,1.05) ~ (p=0.609) GobEncode 11.6ms × (0.98,1.03) 11.9ms × (0.98,1.04) +2.17% (p=0.000) Gzip 648ms × (0.99,1.01) 648ms × (1.00,1.01) ~ (p=0.835) Gunzip 142ms × (1.00,1.00) 143ms × (1.00,1.01) ~ (p=0.169) HTTPClientServer 90.5µs × (0.98,1.03) 91.5µs × (0.98,1.04) +1.04% (p=0.045) JSONEncode 31.5ms × (0.98,1.03) 31.4ms × (0.98,1.03) ~ (p=0.549) JSONDecode 111ms × (0.99,1.01) 107ms × (0.99,1.01) -3.21% (p=0.000) Mandelbrot200 6.01ms × (1.00,1.00) 6.01ms × (1.00,1.00) ~ (p=0.878) GoParse 6.54ms × (0.99,1.02) 6.61ms × (0.99,1.03) +1.08% (p=0.004) RegexpMatchEasy0_32 160ns × (1.00,1.01) 161ns × (1.00,1.00) +0.40% (p=0.000) RegexpMatchEasy0_1K 560ns × (0.99,1.01) 559ns × (0.99,1.01) ~ (p=0.088) RegexpMatchEasy1_32 138ns × (0.99,1.01) 138ns × (1.00,1.00) ~ (p=0.380) RegexpMatchEasy1_1K 877ns × (1.00,1.00) 878ns × (1.00,1.00) ~ (p=0.157) RegexpMatchMedium_32 251ns × (0.99,1.00) 251ns × (1.00,1.01) +0.28% (p=0.021) RegexpMatchMedium_1K 72.6µs × (1.00,1.00) 72.6µs × (1.00,1.00) ~ (p=0.539) RegexpMatchHard_32 3.84µs × (1.00,1.00) 3.84µs × (1.00,1.00) ~ (p=0.378) RegexpMatchHard_1K 117µs × (1.00,1.00) 117µs × (1.00,1.00) ~ (p=0.067) Revcomp 904ms × (0.99,1.02) 904ms × (0.99,1.01) ~ (p=0.943) Template 125ms × (0.99,1.02) 127ms × (0.99,1.01) +1.79% (p=0.000) TimeParse 627ns × (0.99,1.01) 622ns × (0.99,1.01) -0.88% (p=0.000) TimeFormat 655ns × (0.99,1.02) 655ns × (0.99,1.02) ~ (p=0.976) For the record, Go 1 benchmarks, Go 1.4 vs this CL: name old mean new mean delta BinaryTree17 4.61s × (0.97,1.05) 5.91s × (0.98,1.03) +28.35% (p=0.000) Fannkuch11 4.40s × (0.99,1.03) 4.41s × (0.99,1.01) ~ (p=0.212) FmtFprintfEmpty 102ns × (0.99,1.01) 84ns × (0.99,1.02) -18.38% (p=0.000) FmtFprintfString 302ns × (0.98,1.01) 303ns × (0.99,1.02) ~ (p=0.203) FmtFprintfInt 313ns × (0.97,1.05) 270ns × (0.99,1.01) -13.69% (p=0.000) FmtFprintfIntInt 524ns × (0.98,1.02) 477ns × (0.99,1.00) -8.87% (p=0.000) FmtFprintfPrefixedInt 424ns × (0.98,1.02) 386ns × (0.99,1.01) -8.96% (p=0.000) FmtFprintfFloat 652ns × (0.98,1.02) 594ns × (0.97,1.05) -8.97% (p=0.000) FmtManyArgs 2.13µs × (0.99,1.02) 1.94µs × (0.99,1.01) -8.92% (p=0.000) GobDecode 17.1ms × (0.99,1.02) 14.9ms × (0.98,1.03) -13.07% (p=0.000) GobEncode 13.5ms × (0.98,1.03) 11.5ms × (0.98,1.03) -15.25% (p=0.000) Gzip 656ms × (0.99,1.02) 647ms × (0.99,1.01) -1.29% (p=0.000) Gunzip 143ms × (0.99,1.02) 144ms × (0.99,1.01) ~ (p=0.204) HTTPClientServer 88.2µs × (0.98,1.02) 90.8µs × (0.98,1.01) +2.93% (p=0.000) JSONEncode 32.2ms × (0.98,1.02) 30.9ms × (0.97,1.04) -4.06% (p=0.001) JSONDecode 121ms × (0.98,1.02) 110ms × (0.98,1.05) -8.95% (p=0.000) Mandelbrot200 6.06ms × (0.99,1.01) 6.11ms × (0.98,1.04) ~ (p=0.184) GoParse 6.76ms × (0.97,1.04) 6.58ms × (0.98,1.05) -2.63% (p=0.003) RegexpMatchEasy0_32 195ns × (1.00,1.01) 155ns × (0.99,1.01) -20.43% (p=0.000) RegexpMatchEasy0_1K 479ns × (0.98,1.03) 535ns × (0.99,1.02) +11.59% (p=0.000) RegexpMatchEasy1_32 169ns × (0.99,1.02) 131ns × (0.99,1.03) -22.44% (p=0.000) RegexpMatchEasy1_1K 1.53µs × (0.99,1.01) 0.87µs × (0.99,1.02) -43.07% (p=0.000) RegexpMatchMedium_32 334ns × (0.99,1.01) 242ns × (0.99,1.01) -27.53% (p=0.000) RegexpMatchMedium_1K 125µs × (1.00,1.01) 72µs × (0.99,1.03) -42.53% (p=0.000) RegexpMatchHard_32 6.03µs × (0.99,1.01) 3.79µs × (0.99,1.01) -37.12% (p=0.000) RegexpMatchHard_1K 189µs × (0.99,1.02) 115µs × (0.99,1.01) -39.20% (p=0.000) Revcomp 935ms × (0.96,1.03) 926ms × (0.98,1.02) ~ (p=0.083) Template 146ms × (0.97,1.05) 119ms × (0.99,1.01) -18.37% (p=0.000) TimeParse 660ns × (0.99,1.01) 624ns × (0.99,1.02) -5.43% (p=0.000) TimeFormat 670ns × (0.98,1.02) 710ns × (1.00,1.01) +5.97% (p=0.000) This CL is a bit larger than I would like, but the compiler, linker, runtime, and package reflect all need to be in sync about the format of these programs, so there is no easy way to split this into independent changes (at least while keeping the build working at each change). Fixes #9625. Fixes #10524. Change-Id: I9e3e20d6097099d0f8532d1cb5b1af528804989a Reviewed-on: https://go-review.googlesource.com/9888Reviewed-by: Austin Clements <austin@google.com> Run-TryBot: Russ Cox <rsc@golang.org>
512f75e8 · Russ Cox · ebe733cb · 512f75e8 · 512f75e8 · 512f75e8
Commit 512f75e8 authored May 08, 2015 by Russ Cox
19 changed files
--- a/src/cmd/dist/buildtool.go
+++ b/src/cmd/dist/buildtool.go
@@ -39,6 +39,7 @@ var bootstrapDirs = []string{
 	"asm/internal/flags",
 	"asm/internal/lex",
 	"internal/asm",
+	"internal/gcprog",
 	"internal/gc/big",
 	"internal/gc",
 	"internal/ld",

--- a/src/cmd/internal/gc/lex.go
+++ b/src/cmd/internal/gc/lex.go
@@ -48,12 +48,13 @@ var debugtab = []struct {
 	name string
 	val  *int
 }{
+	{"append", &Debug_append},         // print information about append compilation
+	{"disablenil", &Disable_checknil}, // disable nil checks
+	{"gcprog", &Debug_gcprog},         // print dump of GC programs
 	{"nil", &Debug_checknil},          // print information about nil checks
+	{"slice", &Debug_slice},           // print information about slice compilation
 	{"typeassert", &Debug_typeassert}, // print information about type assertion inlining
-	{"disablenil", &Disable_checknil}, // disable nil checks
 	{"wb", &Debug_wb},                 // print information about write barriers
-	{"append", &Debug_append},         // print information about append compilation
-	{"slice", &Debug_slice},           // print information about slice compilation
 }

 // Our own isdigit, isspace, isalpha, isalnum that take care

--- a/src/cmd/internal/gc/plive.go
+++ b/src/cmd/internal/gc/plive.go
@@ -944,7 +944,7 @@ func onebitwalktype1(t *Type, xoffset *int64, bv Bvec) {
 		*xoffset += t.Width

 	case TARRAY:
-		// The value of t->bound is -1 for slices types and >0 for
+		// The value of t->bound is -1 for slices types and >=0 for
 		// for fixed array types.  All other values are invalid.
 		if t.Bound < -1 {
 			Fatal("onebitwalktype1: invalid bound, %v", t)

--- a/src/cmd/internal/gc/reflect.go
+++ b/src/cmd/internal/gc/reflect.go
--- a/src/cmd/internal/gcprog/gcprog.go
+++ b/src/cmd/internal/gcprog/gcprog.go
+// Copyright 2015 The Go Authors.  All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+// Package gcprog implements an encoder for packed GC pointer bitmaps,
+// known as GC programs.
+//
+// Program Format
+//
+// The GC program encodes a sequence of 0 and 1 bits indicating scalar or pointer words in an object.
+// The encoding is a simple Lempel-Ziv program, with codes to emit literal bits and to repeat the
+// last n bits c times.
+//
+// The possible codes are:
+//
+//	00000000: stop
+//	0nnnnnnn: emit n bits copied from the next (n+7)/8 bytes, least significant bit first
+//	10000000 n c: repeat the previous n bits c times; n, c are varints
+//	1nnnnnnn c: repeat the previous n bits c times; c is a varint
+//
+// The numbers n and c, when they follow a code, are encoded as varints
+// using the same encoding as encoding/binary's Uvarint.
+//
+package gcprog
+
+import (
+	"fmt"
+	"io"
+)
+
+const progMaxLiteral = 127 // maximum n for literal n bit code
+
+// A Writer is an encoder for GC programs.
+//
+// The typical use of a Writer is to call Init, maybe call Debug,
+// make a sequence of Ptr, Advance, Repeat, and Append calls
+// to describe the data type, and then finally call End.
+type Writer struct {
+	writeByte func(byte)
+	symoff    int
+	index     int64
+	b         [progMaxLiteral]byte
+	nb        int
+	debug     io.Writer
+	debugBuf  []byte
+}
+
+// Init initializes w to write a new GC program
+// by calling writeByte for each byte in the program.
+func (w *Writer) Init(writeByte func(byte)) {
+	w.writeByte = writeByte
+}
+
+// Debug causes the writer to print a debugging trace to out
+// during future calls to methods like Ptr, Advance, and End.
+// It also enables debugging checks during the encoding.
+func (w *Writer) Debug(out io.Writer) {
+	w.debug = out
+}
+
+// BitIndex returns the number of bits written to the bit stream so far.
+func (w *Writer) BitIndex() int64 {
+	return w.index
+}
+
+// byte writes the byte x to the output.
+func (w *Writer) byte(x byte) {
+	if w.debug != nil {
+		w.debugBuf = append(w.debugBuf, x)
+	}
+	w.writeByte(x)
+}
+
+// End marks the end of the program, writing any remaining bytes.
+func (w *Writer) End() {
+	w.flushlit()
+	w.byte(0)
+	if w.debug != nil {
+		index := progbits(w.debugBuf)
+		if index != w.index {
+			println("gcprog: End wrote program for", index, "bits, but current index is", w.index)
+			panic("gcprog: out of sync")
+		}
+	}
+}
+
+// Ptr emits a 1 into the bit stream at the given bit index.
+// that is, it records that the index'th word in the object memory is a pointer.
+// Any bits between the current index and the new index
+// are set to zero, meaning the corresponding words are scalars.
+func (w *Writer) Ptr(index int64) {
+	if index < w.index {
+		println("gcprog: Ptr at index", index, "but current index is", w.index)
+		panic("gcprog: invalid Ptr index")
+	}
+	w.ZeroUntil(index)
+	if w.debug != nil {
+		fmt.Fprintf(w.debug, "gcprog: ptr at %d\n", index)
+	}
+	w.lit(1)
+}
+
+// ShouldRepeat reports whether it would be worthwhile to
+// use a Repeat to describe c elements of n bits each,
+// compared to just emitting c copies of the n-bit description.
+func (w *Writer) ShouldRepeat(n, c int64) bool {
+	// Should we lay out the bits directly instead of
+	// encoding them as a repetition? Certainly if count==1,
+	// since there's nothing to repeat, but also if the total
+	// size of the plain pointer bits for the type will fit in
+	// 4 or fewer bytes, since using a repetition will require
+	// flushing the current bits plus at least one byte for
+	// the repeat size and one for the repeat count.
+	return c > 1 && c*n > 4*8
+}
+
+// Repeat emits an instruction to repeat the description
+// of the last n words c times (including the initial description, c+1 times in total).
+func (w *Writer) Repeat(n, c int64) {
+	if n == 0 || c == 0 {
+		return
+	}
+	w.flushlit()
+	if w.debug != nil {
+		fmt.Fprintf(w.debug, "gcprog: repeat %d × %d\n", n, c)
+	}
+	if n < 128 {
+		w.byte(0x80 | byte(n))
+	} else {
+		w.byte(0x80)
+		w.varint(n)
+	}
+	w.varint(c)
+	w.index += n * c
+}
+
+// ZeroUntil adds zeros to the bit stream until reaching the given index;
+// that is, it records that the words from the most recent pointer until
+// the index'th word are scalars.
+// ZeroUntil is usually called in preparation for a call to Repeat, Append, or End.
+func (w *Writer) ZeroUntil(index int64) {
+	if index < w.index {
+		println("gcprog: Advance", index, "but index is", w.index)
+		panic("gcprog: invalid Advance index")
+	}
+	skip := (index - w.index)
+	if skip == 0 {
+		return
+	}
+	if skip < 4*8 {
+		if w.debug != nil {
+			fmt.Fprintf(w.debug, "gcprog: advance to %d by literals\n", index)
+		}
+		for i := int64(0); i < skip; i++ {
+			w.lit(0)
+		}
+		return
+	}
+
+	if w.debug != nil {
+		fmt.Fprintf(w.debug, "gcprog: advance to %d by repeat\n", index)
+	}
+	w.lit(0)
+	w.flushlit()
+	w.Repeat(1, skip-1)
+}
+
+// Append emits the given GC program into the current output.
+// The caller asserts that the program emits n bits (describes n words),
+// and Append panics if that is not true.
+func (w *Writer) Append(prog []byte, n int64) {
+	w.flushlit()
+	if w.debug != nil {
+		fmt.Fprintf(w.debug, "gcprog: append prog for %d ptrs\n", n)
+		fmt.Fprintf(w.debug, "\t")
+	}
+	n1 := progbits(prog)
+	if n1 != n {
+		panic("gcprog: wrong bit count in append")
+	}
+	// The last byte of the prog terminates the program.
+	// Don't emit that, or else our own program will end.
+	for i, x := range prog[:len(prog)-1] {
+		if w.debug != nil {
+			if i > 0 {
+				fmt.Fprintf(w.debug, " ")
+			}
+			fmt.Fprintf(w.debug, "%02x", x)
+		}
+		w.byte(x)
+	}
+	if w.debug != nil {
+		fmt.Fprintf(w.debug, "\n")
+	}
+	w.index += n
+}
+
+// progbits returns the length of the bit stream encoded by the program p.
+func progbits(p []byte) int64 {
+	var n int64
+	for len(p) > 0 {
+		x := p[0]
+		p = p[1:]
+		if x == 0 {
+			break
+		}
+		if x&0x80 == 0 {
+			count := x &^ 0x80
+			n += int64(count)
+			p = p[(count+7)/8:]
+			continue
+		}
+		nbit := int64(x &^ 0x80)
+		if nbit == 0 {
+			nbit, p = readvarint(p)
+		}
+		var count int64
+		count, p = readvarint(p)
+		n += nbit * count
+	}
+	if len(p) > 0 {
+		println("gcprog: found end instruction after", n, "ptrs, with", len(p), "bytes remaining")
+		panic("gcprog: extra data at end of program")
+	}
+	return n
+}
+
+// readvarint reads a varint from p, returning the value and the remainder of p.
+func readvarint(p []byte) (int64, []byte) {
+	var v int64
+	var nb uint
+	for {
+		c := p[0]
+		p = p[1:]
+		v |= int64(c&^0x80) << nb
+		nb += 7
+		if c&0x80 == 0 {
+			break
+		}
+	}
+	return v, p
+}
+
+// lit adds a single literal bit to w.
+func (w *Writer) lit(x byte) {
+	if w.nb == progMaxLiteral {
+		w.flushlit()
+	}
+	w.b[w.nb] = x
+	w.nb++
+	w.index++
+}
+
+// varint emits the varint encoding of x.
+func (w *Writer) varint(x int64) {
+	if x < 0 {
+		panic("gcprog: negative varint")
+	}
+	for x >= 0x80 {
+		w.byte(byte(0x80 | x))
+		x >>= 7
+	}
+	w.byte(byte(x))
+}
+
+// flushlit flushes any pending literal bits.
+func (w *Writer) flushlit() {
+	if w.nb == 0 {
+		return
+	}
+	if w.debug != nil {
+		fmt.Fprintf(w.debug, "gcprog: flush %d literals\n", w.nb)
+		fmt.Fprintf(w.debug, "\t%v\n", w.b[:w.nb])
+		fmt.Fprintf(w.debug, "\t%02x", byte(w.nb))
+	}
+	w.byte(byte(w.nb))
+	var bits uint8
+	for i := 0; i < w.nb; i++ {
+		bits |= w.b[i] << uint(i%8)
+		if (i+1)%8 == 0 {
+			if w.debug != nil {
+				fmt.Fprintf(w.debug, " %02x", bits)
+			}
+			w.byte(bits)
+			bits = 0
+		}
+	}
+	if w.nb%8 != 0 {
+		if w.debug != nil {
+			fmt.Fprintf(w.debug, " %02x", bits)
+		}
+		w.byte(bits)
+	}
+	if w.debug != nil {
+		fmt.Fprintf(w.debug, "\n")
+	}
+	w.nb = 0
+}
--- a/src/cmd/internal/ld/data.go
+++ b/src/cmd/internal/ld/data.go
@@ -32,9 +32,11 @@
 package ld

 import (
+	"cmd/internal/gcprog"
 	"cmd/internal/obj"
 	"fmt"
 	"log"
+	"os"
 	"strings"
 )

@@ -1044,141 +1046,65 @@ func maxalign(s *LSym, type_ int) int32 {
 	return max
 }

-// Helper object for building GC type programs.
-type ProgGen struct {
-	s        *LSym
-	datasize int32
-	data     [256 / 8]uint8
-	pos      int64
-}
-
-func proggeninit(g *ProgGen, s *LSym) {
-	g.s = s
-	g.datasize = 0
-	g.pos = 0
-	g.data = [256 / 8]uint8{}
-}
+const debugGCProg = false

-func proggenemit(g *ProgGen, v uint8) {
-	Adduint8(Ctxt, g.s, v)
+type GCProg struct {
+	sym *LSym
+	w   gcprog.Writer
 }

-// Writes insData block from g->data.
-func proggendataflush(g *ProgGen) {
-	if g.datasize == 0 {
-		return
+func (p *GCProg) Init(name string) {
+	p.sym = Linklookup(Ctxt, name, 0)
+	p.w.Init(p.writeByte)
+	if debugGCProg {
+		fmt.Fprintf(os.Stderr, "ld: start GCProg %s\n", name)
+		p.w.Debug(os.Stderr)
 	}
-	proggenemit(g, obj.InsData)
-	proggenemit(g, uint8(g.datasize))
-	s := (g.datasize + 7) / 8
-	for i := int32(0); i < s; i++ {
-		proggenemit(g, g.data[i])
-	}
-	g.datasize = 0
-	g.data = [256 / 8]uint8{}
 }

-func proggendata(g *ProgGen, d uint8) {
-	g.data[g.datasize/8] |= d << uint(g.datasize%8)
-	g.datasize++
-	if g.datasize == 255 {
-		proggendataflush(g)
-	}
+func (p *GCProg) writeByte(x byte) {
+	Adduint8(Ctxt, p.sym, x)
 }

-// Skip v bytes due to alignment, etc.
-func proggenskip(g *ProgGen, off int64, v int64) {
-	for i := off; i < off+v; i++ {
-		if (i % int64(Thearch.Ptrsize)) == 0 {
-			proggendata(g, 0)
-		}
+func (p *GCProg) End(size int64) {
+	p.w.ZeroUntil(size / int64(Thearch.Ptrsize))
+	p.w.End()
+	if debugGCProg {
+		fmt.Fprintf(os.Stderr, "ld: end GCProg\n")
 	}
 }

-// Emit insArray instruction.
-func proggenarray(g *ProgGen, length int64) {
-	var i int32
-
-	proggendataflush(g)
-	proggenemit(g, obj.InsArray)
-	for i = 0; i < int32(Thearch.Ptrsize); i, length = i+1, length>>8 {
-		proggenemit(g, uint8(length))
-	}
-}
-
-func proggenarrayend(g *ProgGen) {
-	proggendataflush(g)
-	proggenemit(g, obj.InsArrayEnd)
-}
-
-func proggenfini(g *ProgGen, size int64) {
-	proggenskip(g, g.pos, size-g.pos)
-	proggendataflush(g)
-	proggenemit(g, obj.InsEnd)
-}
-
-// This function generates GC pointer info for global variables.
-func proggenaddsym(g *ProgGen, s *LSym) {
-	if s.Size == 0 {
+func (p *GCProg) AddSym(s *LSym) {
+	typ := s.Gotype
+	// Things without pointers should be in SNOPTRDATA or SNOPTRBSS;
+	// everything we see should have pointers and should therefore have a type.
+	if typ == nil {
+		Diag("missing Go type information for global symbol: %s size %d", s.Name, int(s.Size))
 		return
 	}

-	// Skip alignment hole from the previous symbol.
-	proggenskip(g, g.pos, s.Value-g.pos)
-	g.pos = s.Value
+	ptrsize := int64(Thearch.Ptrsize)
+	nptr := decodetype_ptrdata(typ) / ptrsize

-	if s.Gotype == nil && s.Size >= int64(Thearch.Ptrsize) {
-		Diag("missing Go type information for global symbol: %s size %d", s.Name, int(s.Size))
-		return
+	if debugGCProg {
+		fmt.Fprintf(os.Stderr, "gcprog sym: %s at %d (ptr=%d+%d)\n", s.Name, s.Value, s.Value/ptrsize, nptr)
 	}

-	if s.Gotype == nil || decodetype_noptr(s.Gotype) != 0 || s.Size < int64(Thearch.Ptrsize) || s.Name[0] == '.' {
-		// no scan
-		if s.Size < int64(32*Thearch.Ptrsize) {
-			// Emit small symbols as data.
-			// This case also handles unaligned and tiny symbols, so tread carefully.
-			for i := s.Value; i < s.Value+s.Size; i++ {
-				if (i % int64(Thearch.Ptrsize)) == 0 {
-					proggendata(g, 0)
-				}
+	if decodetype_usegcprog(typ) == 0 {
+		// Copy pointers from mask into program.
+		mask := decodetype_gcmask(typ)
+		for i := int64(0); i < nptr; i++ {
+			if (mask[i/8]>>uint(i%8))&1 != 0 {
+				p.w.Ptr(s.Value/ptrsize + i)
 			}
-		} else {
-			// Emit large symbols as array.
-			if (s.Size%int64(Thearch.Ptrsize) != 0) || (g.pos%int64(Thearch.Ptrsize) != 0) {
-				Diag("proggenaddsym: unaligned noscan symbol %s: size=%d pos=%d", s.Name, s.Size, g.pos)
-			}
-			proggenarray(g, s.Size/int64(Thearch.Ptrsize))
-			proggendata(g, 0)
-			proggenarrayend(g)
-		}
-		g.pos = s.Value + s.Size
-	} else if decodetype_usegcprog(s.Gotype) != 0 {
-		// gc program, copy directly
-		// TODO(rsc): Maybe someday the gc program will only describe
-		// the first decodetype_ptrdata(s.Gotype) bytes instead of the full size.
-		proggendataflush(g)
-		gcprog := decodetype_gcprog(s.Gotype)
-		size := decodetype_size(s.Gotype)
-		if (size%int64(Thearch.Ptrsize) != 0) || (g.pos%int64(Thearch.Ptrsize) != 0) {
-			Diag("proggenaddsym: unaligned gcprog symbol %s: size=%d pos=%d", s.Name, size, g.pos)
-		}
-		for i := int64(0); i < int64(len(gcprog.P)-1); i++ {
-			proggenemit(g, uint8(gcprog.P[i]))
-		}
-		g.pos = s.Value + size
-	} else {
-		// gc mask, it's small so emit as data
-		mask := decodetype_gcmask(s.Gotype)
-		ptrdata := decodetype_ptrdata(s.Gotype)
-		if (ptrdata%int64(Thearch.Ptrsize) != 0) || (g.pos%int64(Thearch.Ptrsize) != 0) {
-			Diag("proggenaddsym: unaligned gcmask symbol %s: size=%d pos=%d", s.Name, ptrdata, g.pos)
 		}
-		for i := int64(0); i < ptrdata; i += int64(Thearch.Ptrsize) {
-			word := uint(i / int64(Thearch.Ptrsize))
-			proggendata(g, (mask[word/8]>>(word%8))&1)
-		}
-		g.pos = s.Value + ptrdata
+		return
 	}
+
+	// Copy program.
+	prog := decodetype_gcprog(typ)
+	p.w.ZeroUntil(s.Value / ptrsize)
+	p.w.Append(prog.P[4:prog.Size], nptr)
 }

 func growdatsize(datsizep *int64, s *LSym) {
@@ -1386,15 +1312,13 @@ func dodata() {

 	/* data */
 	sect = addsection(&Segdata, ".data", 06)
-
 	sect.Align = maxalign(s, obj.SBSS-1)
 	datsize = Rnd(datsize, int64(sect.Align))
 	sect.Vaddr = uint64(datsize)
 	Linklookup(Ctxt, "runtime.data", 0).Sect = sect
 	Linklookup(Ctxt, "runtime.edata", 0).Sect = sect
-	gcdata := Linklookup(Ctxt, "runtime.gcdata", 0)
-	var gen ProgGen
-	proggeninit(&gen, gcdata)
+	var gc GCProg
+	gc.Init("runtime.gcdata")
 	for ; s != nil && s.Type < obj.SBSS; s = s.Next {
 		if s.Type == obj.SINITARR {
 			Ctxt.Cursym = s
@@ -1405,33 +1329,30 @@ func dodata() {
 		s.Type = obj.SDATA
 		datsize = aligndatsize(datsize, s)
 		s.Value = int64(uint64(datsize) - sect.Vaddr)
-		proggenaddsym(&gen, s) // gc
+		gc.AddSym(s)
 		growdatsize(&datsize, s)
 	}
-
 	sect.Length = uint64(datsize) - sect.Vaddr
-	proggenfini(&gen, int64(sect.Length)) // gc
+	gc.End(int64(sect.Length))

 	/* bss */
 	sect = addsection(&Segdata, ".bss", 06)
-
 	sect.Align = maxalign(s, obj.SNOPTRBSS-1)
 	datsize = Rnd(datsize, int64(sect.Align))
 	sect.Vaddr = uint64(datsize)
 	Linklookup(Ctxt, "runtime.bss", 0).Sect = sect
 	Linklookup(Ctxt, "runtime.ebss", 0).Sect = sect
-	gcbss := Linklookup(Ctxt, "runtime.gcbss", 0)
-	proggeninit(&gen, gcbss)
+	gc = GCProg{}
+	gc.Init("runtime.gcbss")
 	for ; s != nil && s.Type < obj.SNOPTRBSS; s = s.Next {
 		s.Sect = sect
 		datsize = aligndatsize(datsize, s)
 		s.Value = int64(uint64(datsize) - sect.Vaddr)
-		proggenaddsym(&gen, s) // gc
+		gc.AddSym(s)
 		growdatsize(&datsize, s)
 	}
-
 	sect.Length = uint64(datsize) - sect.Vaddr
-	proggenfini(&gen, int64(sect.Length)) // gc
+	gc.End(int64(sect.Length))

 	/* pointer-free bss */
 	sect = addsection(&Segdata, ".noptrbss", 06)

--- a/src/cmd/internal/ld/decodesym.go
+++ b/src/cmd/internal/ld/decodesym.go
@@ -44,7 +44,7 @@ func decode_inuxi(p []byte, sz int) uint64 {
 // commonsize returns the size of the common prefix for all type
 // structures (runtime._type).
 func commonsize() int {
-	return 9*Thearch.Ptrsize + 8
+	return 8*Thearch.Ptrsize + 8
 }

 // Type.commonType.kind
@@ -79,7 +79,7 @@ func decodetype_gcprog(s *LSym) *LSym {
 		x := "type..gcprog." + s.Name[5:]
 		return Linklookup(Ctxt, x, 0)
 	}
-	return decode_reloc_sym(s, 2*int32(Thearch.Ptrsize)+8+2*int32(Thearch.Ptrsize))
+	return decode_reloc_sym(s, 2*int32(Thearch.Ptrsize)+8+1*int32(Thearch.Ptrsize))
 }

 func decodetype_gcprog_shlib(s *LSym) uint64 {

--- a/src/cmd/internal/ld/lib.go
+++ b/src/cmd/internal/ld/lib.go
@@ -1200,7 +1200,7 @@ func ldshlibsyms(shlib string) {
 		if strings.HasPrefix(s.Name, "_") {
 			continue
 		}
-		if strings.HasPrefix(s.Name, "runtime.gcbits.0x") {
+		if strings.HasPrefix(s.Name, "runtime.gcbits.") {
 			gcmasks[s.Value] = readelfsymboldata(f, &s)
 		}
 		if s.Name == "go.link.abihashbytes" {

--- a/src/cmd/internal/ld/link.go
+++ b/src/cmd/internal/ld/link.go
@@ -34,6 +34,7 @@ import (
 	"cmd/internal/obj"
 	"debug/elf"
 	"encoding/binary"
+	"fmt"
 )

 type LSym struct {
@@ -86,6 +87,13 @@ type LSym struct {
 	gcmask      []byte
 }

+func (s *LSym) String() string {
+	if s.Version == 0 {
+		return s.Name
+	}
+	return fmt.Sprintf("%s<%d>", s.Name, s.Version)
+}
+
 type Reloc struct {
 	Off     int32
 	Siz     uint8

--- a/src/cmd/internal/ld/objfile.go
+++ b/src/cmd/internal/ld/objfile.go
@@ -347,7 +347,7 @@ func rdsym(ctxt *Link, f *obj.Biobuf, pkg string) *LSym {
 			s.Reachable = false
 		}
 	}
-	if v == 0 && strings.HasPrefix(s.Name, "runtime.gcbits.0x") {
+	if v == 0 && strings.HasPrefix(s.Name, "runtime.gcbits.") {
 		s.Local = true
 	}
 	return s

--- a/src/reflect/all_test.go
+++ b/src/reflect/all_test.go
@@ -4395,11 +4395,9 @@ type funcLayoutTest struct {
 var funcLayoutTests []funcLayoutTest

 func init() {
-	var argAlign = PtrSize
-	var naclExtra []byte
+	var argAlign uintptr = PtrSize
 	if runtime.GOARCH == "amd64p32" {
 		argAlign = 2 * PtrSize
-		naclExtra = append(naclExtra, 0)
 	}
 	roundup := func(x uintptr, a uintptr) uintptr {
 		return (x + a - 1) / a * a
@@ -4413,16 +4411,14 @@ func init() {
 			4 * PtrSize,
 			4 * PtrSize,
 			[]byte{1, 0, 1},
-			[]byte{1, 0, 1, 0, 1, 0},
+			[]byte{1, 0, 1, 0, 1},
 		})

-	var r, s []byte
+	var r []byte
 	if PtrSize == 4 {
 		r = []byte{0, 0, 0, 1}
-		s = append([]byte{0, 0, 0, 1, 0}, naclExtra...)
 	} else {
 		r = []byte{0, 0, 1}
-		s = []byte{0, 0, 1, 0}
 	}
 	funcLayoutTests = append(funcLayoutTests,
 		funcLayoutTest{
@@ -4432,7 +4428,7 @@ func init() {
 			roundup(3*4, PtrSize) + PtrSize + 2,
 			roundup(roundup(3*4, PtrSize)+PtrSize+2, argAlign),
 			r,
-			s,
+			r,
 		})

 	funcLayoutTests = append(funcLayoutTests,
@@ -4469,7 +4465,7 @@ func init() {
 			3 * PtrSize,
 			roundup(3*PtrSize, argAlign),
 			[]byte{1, 0, 1},
-			append([]byte{1, 0, 1}, naclExtra...),
+			[]byte{1, 0, 1},
 		})

 	funcLayoutTests = append(funcLayoutTests,
@@ -4480,7 +4476,7 @@ func init() {
 			PtrSize,
 			roundup(PtrSize, argAlign),
 			[]byte{},
-			append([]byte{0}, naclExtra...),
+			[]byte{},
 		})

 	funcLayoutTests = append(funcLayoutTests,
@@ -4491,7 +4487,7 @@ func init() {
 			0,
 			0,
 			[]byte{},
-			[]byte{0},
+			[]byte{},
 		})

 	funcLayoutTests = append(funcLayoutTests,
@@ -4502,7 +4498,7 @@ func init() {
 			2 * PtrSize,
 			2 * PtrSize,
 			[]byte{1},
-			[]byte{1, 0},
+			[]byte{1},
 			// Note: this one is tricky, as the receiver is not a pointer.  But we
 			// pass the receiver by reference to the autogenerated pointer-receiver
 			// version of the function.
@@ -4532,3 +4528,118 @@ func TestFuncLayout(t *testing.T) {
 		}
 	}
 }
+
+func verifyGCBits(t *testing.T, typ Type, bits []byte) {
+	heapBits := GCBits(New(typ).Interface())
+	if !bytes.Equal(heapBits, bits) {
+		t.Errorf("heapBits incorrect for %v\nhave %v\nwant %v", typ, heapBits, bits)
+	}
+}
+
+func TestGCBits(t *testing.T) {
+	verifyGCBits(t, TypeOf((*byte)(nil)), []byte{1})
+
+	// Building blocks for types seen by the compiler (like [2]Xscalar).
+	// The compiler will create the type structures for the derived types,
+	// including their GC metadata.
+	type Xscalar struct{ x uintptr }
+	type Xptr struct{ x *byte }
+	type Xptrscalar struct {
+		*byte
+		uintptr
+	}
+	type Xscalarptr struct {
+		uintptr
+		*byte
+	}
+
+	var Tscalar, Tptr, Tscalarptr, Tptrscalar Type
+	{
+		// Building blocks for types constructed by reflect.
+		// This code is in a separate block so that code below
+		// cannot accidentally refer to these.
+		// The compiler must NOT see types derived from these
+		// (for example, [2]Scalar must NOT appear in the program),
+		// or else reflect will use it instead of having to construct one.
+		// The goal is to test the construction.
+		type Scalar struct{ x uintptr }
+		type Ptr struct{ x *byte }
+		type Ptrscalar struct {
+			*byte
+			uintptr
+		}
+		type Scalarptr struct {
+			uintptr
+			*byte
+		}
+		Tscalar = TypeOf(Scalar{})
+		Tptr = TypeOf(Ptr{})
+		Tscalarptr = TypeOf(Scalarptr{})
+		Tptrscalar = TypeOf(Ptrscalar{})
+	}
+
+	empty := []byte{}
+
+	verifyGCBits(t, TypeOf(Xscalar{}), empty)
+	verifyGCBits(t, Tscalar, empty)
+	verifyGCBits(t, TypeOf(Xptr{}), lit(1))
+	verifyGCBits(t, Tptr, lit(1))
+	verifyGCBits(t, TypeOf(Xscalarptr{}), lit(0, 1))
+	verifyGCBits(t, Tscalarptr, lit(0, 1))
+	verifyGCBits(t, TypeOf(Xptrscalar{}), lit(1))
+	verifyGCBits(t, Tptrscalar, lit(1))
+
+	verifyGCBits(t, TypeOf([0]Xptr{}), empty)
+	verifyGCBits(t, ArrayOf(0, Tptr), empty)
+	verifyGCBits(t, TypeOf([1]Xptrscalar{}), lit(1))
+	verifyGCBits(t, ArrayOf(1, Tptrscalar), lit(1))
+	verifyGCBits(t, TypeOf([2]Xscalar{}), empty)
+	verifyGCBits(t, ArrayOf(2, Tscalar), empty)
+	verifyGCBits(t, TypeOf([100]Xscalar{}), empty)
+	verifyGCBits(t, ArrayOf(100, Tscalar), empty)
+	verifyGCBits(t, TypeOf([2]Xptr{}), lit(1, 1))
+	verifyGCBits(t, ArrayOf(2, Tptr), lit(1, 1))
+	verifyGCBits(t, TypeOf([100]Xptr{}), rep(100, lit(1)))
+	verifyGCBits(t, ArrayOf(100, Tptr), rep(100, lit(1)))
+	verifyGCBits(t, TypeOf([2]Xscalarptr{}), lit(0, 1, 0, 1))
+	verifyGCBits(t, ArrayOf(2, Tscalarptr), lit(0, 1, 0, 1))
+	verifyGCBits(t, TypeOf([100]Xscalarptr{}), rep(100, lit(0, 1)))
+	verifyGCBits(t, ArrayOf(100, Tscalarptr), rep(100, lit(0, 1)))
+	verifyGCBits(t, TypeOf([2]Xptrscalar{}), lit(1, 0, 1))
+	verifyGCBits(t, ArrayOf(2, Tptrscalar), lit(1, 0, 1))
+	verifyGCBits(t, TypeOf([100]Xptrscalar{}), rep(100, lit(1, 0)))
+	verifyGCBits(t, ArrayOf(100, Tptrscalar), rep(100, lit(1, 0)))
+	verifyGCBits(t, TypeOf([1][100]Xptrscalar{}), rep(100, lit(1, 0)))
+	verifyGCBits(t, ArrayOf(1, ArrayOf(100, Tptrscalar)), rep(100, lit(1, 0)))
+	verifyGCBits(t, TypeOf([2][100]Xptrscalar{}), rep(200, lit(1, 0)))
+	verifyGCBits(t, ArrayOf(2, ArrayOf(100, Tptrscalar)), rep(200, lit(1, 0)))
+
+	verifyGCBits(t, TypeOf((chan [100]Xscalar)(nil)), lit(1))
+	verifyGCBits(t, ChanOf(BothDir, ArrayOf(100, Tscalar)), lit(1))
+
+	verifyGCBits(t, TypeOf((func([100]Xscalarptr))(nil)), lit(1))
+	//verifyGCBits(t, FuncOf([]Type{ArrayOf(100, Tscalarptr)}, nil, false), lit(1))
+
+	verifyGCBits(t, TypeOf((map[[100]Xscalarptr]Xscalar)(nil)), lit(1))
+	verifyGCBits(t, MapOf(ArrayOf(100, Tscalarptr), Tscalar), lit(1))
+
+	verifyGCBits(t, TypeOf((*[100]Xscalar)(nil)), lit(1))
+	verifyGCBits(t, PtrTo(ArrayOf(100, Tscalar)), lit(1))
+
+	verifyGCBits(t, TypeOf(([][100]Xscalar)(nil)), lit(1))
+	verifyGCBits(t, SliceOf(ArrayOf(100, Tscalar)), lit(1))
+
+	hdr := make([]byte, 8/PtrSize)
+	verifyGCBits(t, MapBucketOf(Tscalar, Tptr), join(hdr, rep(8, lit(0)), rep(8, lit(1)), lit(1)))
+	verifyGCBits(t, MapBucketOf(Tscalarptr, Tptr), join(hdr, rep(8, lit(0, 1)), rep(8, lit(1)), lit(1)))
+	verifyGCBits(t, MapBucketOf(Tscalar, Tscalar), empty)
+	verifyGCBits(t, MapBucketOf(ArrayOf(2, Tscalarptr), ArrayOf(3, Tptrscalar)), join(hdr, rep(8*2, lit(0, 1)), rep(8*3, lit(1, 0)), lit(1)))
+	verifyGCBits(t, MapBucketOf(ArrayOf(64/PtrSize, Tscalarptr), ArrayOf(64/PtrSize, Tptrscalar)), join(hdr, rep(8*64/PtrSize, lit(0, 1)), rep(8*64/PtrSize, lit(1, 0)), lit(1)))
+	verifyGCBits(t, MapBucketOf(ArrayOf(64/PtrSize+1, Tscalarptr), ArrayOf(64/PtrSize, Tptrscalar)), join(hdr, rep(8, lit(1)), rep(8*64/PtrSize, lit(1, 0)), lit(1)))
+	verifyGCBits(t, MapBucketOf(ArrayOf(64/PtrSize, Tscalarptr), ArrayOf(64/PtrSize+1, Tptrscalar)), join(hdr, rep(8*64/PtrSize, lit(0, 1)), rep(8, lit(1)), lit(1)))
+	verifyGCBits(t, MapBucketOf(ArrayOf(64/PtrSize+1, Tscalarptr), ArrayOf(64/PtrSize+1, Tptrscalar)), join(hdr, rep(8, lit(1)), rep(8, lit(1)), lit(1)))
+}
+
+func rep(n int, b []byte) []byte { return bytes.Repeat(b, n) }
+func join(b ...[]byte) []byte    { return bytes.Join(b, nil) }
+func lit(x ...byte) []byte       { return x }
--- a/src/reflect/export_test.go
+++ b/src/reflect/export_test.go
@@ -4,6 +4,8 @@

 package reflect

+import "unsafe"
+
 // MakeRO returns a copy of v with the read-only flag set.
 func MakeRO(v Value) Value {
 	v.flag |= flagRO
@@ -28,14 +30,14 @@ func FuncLayout(t Type, rcvr Type) (frametype Type, argSize, retOffset uintptr,
 		ft, argSize, retOffset, s, _ = funcLayout(t.(*rtype), nil)
 	}
 	frametype = ft
-	for i := uint32(0); i < s.n; i += 2 {
-		stack = append(stack, s.data[i/8]>>(i%8)&3)
+	for i := uint32(0); i < s.n; i++ {
+		stack = append(stack, s.data[i/8]>>(i%8)&1)
 	}
 	if ft.kind&kindGCProg != 0 {
 		panic("can't handle gc programs")
 	}
-	gcdata := (*[1000]byte)(ft.gc[0])
-	for i := uintptr(0); i < ft.size/ptrSize; i++ {
+	gcdata := (*[1000]byte)(unsafe.Pointer(ft.gcdata))
+	for i := uintptr(0); i < ft.ptrdata/ptrSize; i++ {
 		gc = append(gc, gcdata[i/8]>>(i%8)&1)
 	}
 	ptrs = ft.kind&kindNoPointers == 0
@@ -51,3 +53,11 @@ func TypeLinks() []string {
 	}
 	return r
 }
+
+var GCBits = gcbits
+
+func gcbits(interface{}) []byte // provided by runtime
+
+func MapBucketOf(x, y Type) Type {
+	return bucketOf(x.(*rtype), y.(*rtype))
+}
--- a/src/reflect/type.go
+++ b/src/reflect/type.go
--- a/src/reflect/value.go
+++ b/src/reflect/value.go
@@ -10,7 +10,7 @@ import (
 	"unsafe"
 )

-const ptrSize = unsafe.Sizeof((*byte)(nil))
+const ptrSize = 4 << (^uintptr(0) >> 63) // unsafe.Sizeof(uintptr(0)) but an ideal const
 const cannotSet = "cannot set value obtained from unexported struct field"

 // Value is the reflection interface to a Go value.

--- a/src/runtime/mbitmap.go
+++ b/src/runtime/mbitmap.go
--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -146,8 +146,8 @@ func gcinit() {
 	work.markfor = parforalloc(_MaxGcproc)
 	_ = setGCPercent(readgogc())
 	for datap := &firstmoduledata; datap != nil; datap = datap.next {
-		datap.gcdatamask = unrollglobgcprog((*byte)(unsafe.Pointer(datap.gcdata)), datap.edata-datap.data)
-		datap.gcbssmask = unrollglobgcprog((*byte)(unsafe.Pointer(datap.gcbss)), datap.ebss-datap.bss)
+		datap.gcdatamask = progToPointerMask((*byte)(unsafe.Pointer(datap.gcdata)), datap.edata-datap.data)
+		datap.gcbssmask = progToPointerMask((*byte)(unsafe.Pointer(datap.gcbss)), datap.ebss-datap.bss)
 	}
 	memstats.next_gc = heapminimum
 }

--- a/src/runtime/symtab.go
+++ b/src/runtime/symtab.go
@@ -32,6 +32,8 @@ const (
 // moduledata records information about the layout of the executable
 // image. It is written by the linker. Any changes here must be
 // matched changes to the code in cmd/internal/ld/symtab.go:symtab.
+// moduledata is stored in read-only memory; none of the pointers here
+// are visible to the garbage collector.
 type moduledata struct {
 	pclntable    []byte
 	ftab         []functab

--- a/src/runtime/traceback.go
+++ b/src/runtime/traceback.go
@@ -469,7 +469,7 @@ func setArgInfo(frame *stkframe, f *_func, needArgMap bool) {
 				throw("reflect mismatch")
 			}
 			bv := (*bitvector)(unsafe.Pointer(fn[1]))
-			frame.arglen = uintptr(bv.n / 2 * ptrSize)
+			frame.arglen = uintptr(bv.n * ptrSize)
 			frame.argmap = bv
 		}
 	}

--- a/src/runtime/type.go
+++ b/src/runtime/type.go
@@ -20,17 +20,10 @@ type _type struct {
 	fieldalign uint8
 	kind       uint8
 	alg        *typeAlg
-	// gc stores type info required for garbage collector.
-	// If (kind&KindGCProg)==0, then gc[0] points at sparse GC bitmap
-	// (no indirection), 4 bits per word.
-	// If (kind&KindGCProg)!=0, then gc[1] points to a compiler-generated
-	// read-only GC program; and gc[0] points to BSS space for sparse GC bitmap.
-	// For huge types (>maxGCMask), runtime unrolls the program directly into
-	// GC bitmap and gc[0] is not used. For moderately-sized types, runtime
-	// unrolls the program into gc[0] space on first use. The first byte of gc[0]
-	// (gc[0][0]) contains 'unroll' flag saying whether the program is already
-	// unrolled into gc[0] or not.
-	gc      [2]uintptr
+	// gcdata stores the GC type data for the garbage collector.
+	// If the KindGCProg bit is set in kind, gcdata is a GC program.
+	// Otherwise it is a ptrmask bitmap. See mbitmap.go for details.
+	gcdata  *byte
 	_string *string
 	x       *uncommontype
 	ptrto   *_type