Simpler and faster GC for Go

Dmitry Vyukov, dvyukov@

July 16, 2014

Problem

Current GC code is very complex and difficult to modify, this hampers future GC evolution and application of spot optimizations. GC consumes significant amount of memory for internal data structures (namely, for type info). GC is slow. See Evaluation section for numbers.

Proposed Solution

Type Info

Replace the current program-based GC type info with a more compact and simpler bit-mask-based type info. Type info uses the same format as stacks currently use; namely, 2 bits per word:

00 - BitsDead, as opposed to stacks in heap this means 'end of object'

01 - BitsScalar, not a pointer, ignore in GC

10 - BitsPointer, pointer

11 - BitsMultiWord, a multi-word object, changes meaning of the next pair of bits to:

00 - BitsString, string object

01 - BitsSlice, slice object

10 - BitsIface, iface object

11 - BitsEface, eface object

Compiler generates type info in "sparse" format: 4 bits per word, only 2 high bits of every 4 bits are used. This is necessary for malloc, so that it can directly copy the info into GC bitmap (see below). Compiler also repeats twice type info for objects with odd number of words; this makes any type info integral number of bytes, which simplifies and makes faster malloc handling of arrays.

However, this format has a problem with very large types -- size of the type info is proportional to the type size, this can significantly bloat binaries. For example, for type struct { x [1<<30]byte; y *byte; } GC info size is 64MB. So the compiler generates a different format for large types for storing in binary -- GC program. If type info mask fits into 2 words, then compiler generates the mask. Otherwise it generates the program.

The program grammar follows:

Program = {Block} "insEnd"

Block = Data | Array

Data = "insData" DataSize DataBlock

DataSize = int // size of the DataBlock in bit pairs, 1 byte

DataBlock = binary // dense GC mask of size ]DataSize/4[ bytes

Array = "insArray" ArrayLen Block "insArrayEnd"

ArrayLen = int // length of the array, 8 bytes (4 bytes for 32-bit arch)

Each instruction (insData, insArray, etc) is 1 byte.

For example, for type struct { x []byte; y [20]struct{ z int; w *byte }; } the program looks as:

insData 3 (BitsMultiWord BitsSlice BitsScalar) insArray 20 insData 2 (BitsScalar BitsPointer) insArrayEnd insEnd

Total size of the program is 17 bytes (13 bytes on 32-bits). The corresponding mask would be 43 bytes (it is repeated twice because the type has odd number of words).

Type object contains 2 words for storing of GC info:

struct Type {

...

uintptr gc[2];

...

};

If the mask fits into these 2 words, then compiler generates mask and stores it directly into Type.gc (no indirection). If the mask fits into 8K, gc[0] points to a symbol in BSS large enough to store the mask + 1 byte; and gc[1] points to the program in RODATA. Otherwise (mask > 8K), gc[1] points to the program in RODATA and gc[0] is nil (runtime will unroll the program directly into GC bitmap).

A new KindGCProg flag in Type.kind determines whether the type uses program format or mask format.

Linker generates the type info for data/bss sections. That info is always in the program format.

GC

At runtime the type info is stored in the GC bitmap. GC bitmap stays 4 bits per word of heap memory, but meaning of the bits changes. Low 2 bits:

00 - bitMiddle, middle of a heap object

01 - bitBoundary, beginning of a non-allocated object

10 - bitAllocated, beginning of an allocated object

11 - bitAllocated, beginning of an allocated and marked object

The high 2 bits store the type info which describes the word of heap memory.

BitsDead set for the first word of the object means "noscan" object. Otherwise BitsDead means 'stop scan here' (end of object).

Per-span type info goes away.

During scanning GC treats words of heap memory according to type info stored in the GC bitmap.

In most cases GC can find beginning of the object by looking at the GC bitmap, anything other than bitMiddle means beginning of an object. GC can also find end of the object during scanning of the GC bitmap. This means that in most cases GC does not need to touch any memory other than the GC bitmap.

Malloc

Malloc stores type info the object in the GC bitmap.

For noscan objects it merely sets bitAllocated|BitsDead for the first word of the object.

For objects that require scanning, it copies type info from the Type object into GC bitmap.

If the allocated type has KindGCProg flag, then malloc first checks whether the program is unrolled into mask format. If not, it unrolls the program in Type.gc[1] into Type.gc[0] and then uses that mask. The 'unrolled' flag is stored in Type.gc[0][0] (first byte of the mask).

If the unrolled mask occupies more than 8K, then malloc unrolls the program directly into GC bitmap. This is required to save memory for very large types.

Evaluation

This change has been implemented in https://codereview.appspot.com/106260045.

Code Size/Complexity

The change leads to a massive code size reduction: -1680 LOC total, -1000+ LOC in mgc0.c only. Partially this is due to simpler type info and code that processes it; partially this is due to code sharing between stacks and heap type info (both generation in the compiler and usage in GC and heapdump). Additional effect of "type punned" type info (it does not encode pointee type) is that all e.g. Hmap's have the same type; this significantly simplifies generation of type info in reflect package (in most cases prototype type info is OK).

Subjectively, GC code is considerably simpler with this change.

Memory consumption

This change is a strict win with regard to memory consumption. It eliminates per-span type info and does not add any new data structures. The upper bound for memory consumption reduction is 50% (for 8-byte span with at least 8 different types).

For real programs observed values of memory consumption reduction are between 0% and 30%. Programs that store most of the data in large byte slices or strings the difference is closer to 0%. For programs that create lots of small objects the difference is closer to 30%.

See details in the Appendix below.

GC Time

GC scanning and marking phase is 10-20% faster for programs with large number of small objects.

For programs that store most of the data in large byte slices or strings the time is unaffected (but scanning is fast for such programs anyway).

See details in the Appendix below.

Binary Size

Binaries generally get smaller because the new type info is more compact. go.benchmarks binary is 6.7% smaller; test/bench/go1 binary is 6.3% smaller; go command binary is 3.2% smaller; godoc binary is 3.7% smaller.

Heap Dumper

Heap dumper becomes less useful as it does not know types for objects anymore. But it does not seem to be reasonable to pay 30% of memory for the very rarely used feature.

One possible solution is to augment memory profiler (memprof) info with type info. Heapdump can use that info for sampled objects. Then a user can set memprof rate to 1 and get very detailed info (type+stack) for all objects (currently we don't store type info for all allocations).

Another option is to use DWARF debug info to trace object types starting from root objects (globals and stacks).

Future Possible Changes

This change already allowed to add memory prefetching to the scanning loop (implemented in the referenced CL), something that I was afraid doing in the current code. But it also opens road for other changes:

1. 1-bit per word type info for both stacks, heap and global. It will further reduce size of binaries and memory consumption and increase runtime speed.

2. Concurrent GC [presumably] will need 1-bit per word type info, as it won't be able to atomically look at complex multi-word objects in presence of concurrent mutators.

3. Once we have 1-bit type info, it is trivial to split data/bss into smaller chunks for better parallelization. It may even be possible to split scanning of large heap objects (hashmaps and arrays) for better parallelization.

4. Switch to byte-based GC bitmap (instead of word-based). This will allow to use non-atomic operations for marking of objects larger than 1 word. And also will simplify and speed up markallocated.

Appendix. Benchmark Results

On go.benchmarks garbage benchmark:

garbage-1
allocated 3060956 2965058 -3.13%
allocs 57729 57624 -0.18%
cputime 16178480 16739958 +3.47%
gc-pause-one 133386156 113134986 -15.18%
gc-pause-total 2934495 3394049 +15.66%
rss 300072960 218271744 -27.26%
sys-gc 17670144 11980800 -32.20%
sys-heap 265945088 191496192 -27.99%
sys-other 10015392 8266216 -17.46%
sys-stack 9043968 8912896 -1.45%
sys-total 302674592 220656104 -27.10%
time 16188036 16743835 +3.43%
virtual-mem 491409408 407719936 -17.03%

garbage-2
allocated 3073933 2964383 -3.56%
allocs 57739 57623 -0.20%
cputime 17588468 17919136 +1.88%
gc-pause-one 79779842 75689054 -5.13%
gc-pause-total 1834936 2346360 +27.87%
rss 310652928 227540992 -26.75%
sys-gc 18128896 12316672 -32.06%
sys-heap 273285120 196739072 -28.01%
sys-other 10285144 8522624 -17.14%
sys-stack 9306112 9306112 +0.00%
sys-total 311005272 226884480 -27.05%
time 8881479 9263811 +4.30%
virtual-mem 574124032 423456768 -26.24%

garbage-4
allocated 3071034 2964380 -3.47%
allocs 57735 57623 -0.19%
cputime 18395208 19084593 +3.75%
gc-pause-one 42328695 39775705 -6.03%
gc-pause-total 931231 1153495 +23.87%
rss 325066752 241741824 -25.63%
sys-gc 18980864 13180928 -30.56%
sys-heap 286916608 210370560 -26.68%
sys-other 10613232 8787872 -17.20%
sys-stack 10092544 10223616 +1.30%
sys-total 326603248 242562976 -25.73%
time 4608702 4830480 +4.81%
virtual-mem 547790848 531423232 -2.99%

garbage-8
allocated 3048763 2957203 -3.00%
allocs 57718 57623 -0.16%
cputime 19151198 19957434 +4.21%
gc-pause-one 24585584 22530624 -8.36%
gc-pause-total 491711 608326 +23.72%
rss 356458496 262660096 -26.31%
sys-gc 20750336 14323712 -30.97%
sys-heap 315228160 228196352 -27.61%
sys-other 11164304 9297984 -16.72%
sys-stack 12451840 12320768 -1.05%
sys-total 359594640 264138816 -26.55%
time 2422566 2531798 +4.51%
virtual-mem 707973120 645025792 -8.89%

Since heap size is reduced GCs happen more frequently, this increases total GC time. It's possible to trade memory back for performance. Below is comparison of old vs new+GOGC=165 (which brings memory consumption roughly to the old level).

garbage-1
allocated 3052052 2965030 -2.85%
allocs 57720 57624 -0.17%
cputime 16621510 15395976 -7.37%
gc-pause-one 136762062 112110308 -18.03%
gc-pause-total 3008765 2017985 -32.93%
rss 299483136 285331456 -4.73%
sys-gc 17670144 15785984 -10.66%
sys-heap 265945088 252313600 -5.13%
sys-other 10015504 9198984 -8.15%
sys-stack 8912896 9043968 +1.47%
sys-total 302543632 286342536 -5.35%
time 16621385 15404268 -7.32%
virtual-mem 557273088 474525696 -14.85%

garbage-2
allocated 3073702 2964237 -3.56%
allocs 57738 57623 -0.20%
cputime 17307008 15874992 -8.27%
gc-pause-one 86934156 85146756 -2.06%
gc-pause-total 1999485 1532641 -23.35%
rss 309473280 295632896 -4.47%
sys-gc 18063360 16244736 -10.07%
sys-heap 272236544 259653632 -4.62%
sys-other 10280824 9464032 -7.94%
sys-stack 9437184 9175040 -2.78%
sys-total 310017912 294537440 -4.99%
time 8936413 8389483 -6.12%
virtual-mem 574255104 491114496 -14.48%

garbage-4
allocated 3065954 2956425 -3.57%
allocs 57731 57621 -0.19%
cputime 18431932 17124132 -7.10%
gc-pause-one 44980995 39581480 -12.00%
gc-pause-total 989581 692675 -30.00%
rss 321982464 314998784 -2.17%
sys-gc 18784256 17375232 -7.50%
sys-heap 283770880 277479424 -2.22%
sys-other 10346984 9989656 -3.45%
sys-stack 10223616 9961472 -2.56%
sys-total 323125736 314805784 -2.57%
time 4685893 4336762 -7.45%
virtual-mem 737247232 595279872 -19.26%

garbage-8
allocated 3041382 2956830 -2.78%
allocs 57712 57622 -0.16%
cputime 19078905 18206227 -4.57%
gc-pause-one 25034791 22108232 -11.69%
gc-pause-total 500695 364785 -27.14%
rss 357806080 345243648 -3.51%
sys-gc 20750336 19107840 -7.92%
sys-heap 315228160 304742400 -3.33%
sys-other 11163856 10510576 -5.85%
sys-stack 12451840 12189696 -2.11%
sys-total 359594192 346550512 -3.63%
time 2419061 2291795 -5.26%
virtual-mem 808402944 727433216 -10.02%

test/bench/go1 benchmarks:

benchmark old ns/op new ns/op delta
BenchmarkBinaryTree17 3882774233 3544278312 -8.72%
BenchmarkFannkuch11 2760029132 2618294904 -5.14%
BenchmarkFmtFprintfEmpty 71.6 72.7 +1.54%
BenchmarkFmtFprintfString 231 219 -5.19%
BenchmarkFmtFprintfInt 186 164 -11.83%
BenchmarkFmtFprintfIntInt 256 257 +0.39%
BenchmarkFmtFprintfPrefixedInt 257 242 -5.84%
BenchmarkFmtFprintfFloat 460 354 -23.04%
BenchmarkFmtManyArgs 1364 1072 -21.41%
BenchmarkGobDecode 16333122 13887405 -14.97%
BenchmarkGobEncode 10785547 9556318 -11.40%
BenchmarkGzip 438134110 406536964 -7.21%
BenchmarkGunzip 113861506 104544186 -8.18%
BenchmarkHTTPClientServer 92163 89593 -2.79%
BenchmarkJSONEncode 21279690 20524117 -3.55%
BenchmarkJSONDecode 81470709 71832238 -11.83%
BenchmarkMandelbrot200 5294830 4101358 -22.54%
BenchmarkGoParse 5982219 5655285 -5.47%
BenchmarkRegexpMatchEasy0_32 155 152 -1.94%
BenchmarkRegexpMatchEasy0_1K 303 294 -2.97%
BenchmarkRegexpMatchEasy1_32 144 149 +3.47%
BenchmarkRegexpMatchEasy1_1K 798 826 +3.51%
BenchmarkRegexpMatchMedium_32 232 247 +6.47%
BenchmarkRegexpMatchMedium_1K 63225 75730 +19.78%
BenchmarkRegexpMatchHard_32 3701 4377 +18.27%
BenchmarkRegexpMatchHard_1K 143579 136470 -4.95%
BenchmarkRevcomp 1059133929 697414142 -34.15%
BenchmarkTemplate 132634561 115347401 -13.03%
BenchmarkTimeParse 389 385 -1.03%
BenchmarkTimeFormat 397 394 -0.76%

benchmark old MB/s new MB/s speedup
BenchmarkGobDecode 46.99 55.27 1.18x
BenchmarkGobEncode 71.16 80.32 1.13x
BenchmarkGzip 44.29 47.73 1.08x
BenchmarkGunzip 170.42 185.61 1.09x
BenchmarkJSONEncode 91.19 94.55 1.04x
BenchmarkJSONDecode 23.82 27.01 1.13x
BenchmarkGoParse 9.68 10.24 1.06x
BenchmarkRegexpMatchEasy0_32 205.66 209.52 1.02x
BenchmarkRegexpMatchEasy0_1K 2485.81 2260.24 0.91x
BenchmarkRegexpMatchEasy1_32 222.08 214.21 0.96x
BenchmarkRegexpMatchEasy1_1K 930.48 798.08 0.86x
BenchmarkRegexpMatchMedium_32 4.31 4.05 0.94x
BenchmarkRegexpMatchMedium_1K 15.74 11.53 0.73x
BenchmarkRegexpMatchHard_32 8.65 7.31 0.85x
BenchmarkRegexpMatchHard_1K 7.13 7.50 1.05x
BenchmarkRevcomp 239.98 364.44 1.52x
BenchmarkTemplate 14.63 16.82 1.15x

BenchmarkRegexpMatchHard_32 with +18.27% degradation looks like phantom code
movement issue, profile do not include anything GC/malloc-related, but the
numbers for unrelated functions are slightly different:
before:
43.73% go1.test go1.test [.] regexp.(*machine).add
17.52% go1.test go1.test [.] regexp.(*machine).step
7.98% go1.test go1.test [.]
_/ssd/src/go1/test/bench/go1.fastaRandom
7.49% go1.test go1.test [.] regexp.(*machine).match
3.93% go1.test go1.test [.] regexp/syntax.(*Inst).MatchRunePos
3.57% go1.test go1.test [.] regexp/syntax.EmptyOpContext
2.63% go1.test go1.test [.] regexp.(*inputBytes).step
1.94% go1.test go1.test [.] regexp.(*machine).alloc
1.32% go1.test go1.test [.] compress/flate.(*compressor).findMatch
1.19% go1.test go1.test [.] regexp/syntax.(*Inst).MatchRune
1.18% go1.test go1.test [.] compress/flate.(*compressor).deflate
0.48% go1.test go1.test [.] hash/crc32.update
0.40% go1.test go1.test [.] sync/atomic.AddUint32
0.31% go1.test go1.test [.] sync/atomic.CompareAndSwapUint32
after:
46.94% go1.test go1.test [.] regexp.(*machine).add
16.17% go1.test go1.test [.] regexp.(*machine).step
7.16% go1.test go1.test [.] regexp.(*machine).match
7.05% go1.test go1.test [.]
_/ssd/src/go2/test/bench/go1.fastaRandom
6.04% go1.test go1.test [.] regexp/syntax.EmptyOpContext
3.38% go1.test go1.test [.] regexp/syntax.(*Inst).MatchRunePos
2.14% go1.test go1.test [.] regexp.(*machine).alloc
2.02% go1.test go1.test [.] regexp.(*inputBytes).step
1.27% go1.test go1.test [.] compress/flate.(*compressor).deflate
1.06% go1.test go1.test [.] compress/flate.(*compressor).findMatch
0.87% go1.test go1.test [.] regexp/syntax.(*Inst).MatchRune
0.42% go1.test go1.test [.] hash/crc32.update
0.32% go1.test go1.test [.] sync/atomic.AddUint32
0.29% go1.test go1.test [.] sync/atomic.CompareAndSwapUint32

RSS for BenchmarkGoParse stabilizes at 700m before and 584m after (-16.6%).