Simpler and faster GC for Go

Dmitry Vyukov, dvyukov@

July 16, 2014

Problem

Current GC code is very complex and difficult to modify, this hampers future GC evolution and application of spot optimizations. GC consumes significant amount of memory for internal data structures (namely, for type info). GC is slow. See Evaluation section for numbers.

Proposed Solution

Type Info

Replace the current program-based GC type info with a more compact and simpler bit-mask-based type info. Type info uses the same format as stacks currently use; namely, 2 bits per word:

00 - BitsDead, as opposed to stacks in heap this means 'end of object'

01 - BitsScalar, not a pointer, ignore in GC

10 - BitsPointer, pointer

11 - BitsMultiWord, a multi-word object, changes meaning of the next pair of bits to:

       00 - BitsString, string object

       01 - BitsSlice, slice object

       10 - BitsIface, iface object

       11 - BitsEface, eface object

Compiler generates type info in "sparse" format: 4 bits per word, only 2 high bits of every 4 bits are used. This is necessary for malloc, so that it can directly copy the info into GC bitmap (see below). Compiler also repeats twice type info for objects with odd number of words; this makes any type info integral number of bytes, which simplifies and makes faster malloc handling of arrays.

However, this format has a problem with very large types -- size of the type info is proportional to the type size, this can significantly bloat binaries. For example, for type struct { x [1<<30]byte; y *byte; } GC info size is 64MB. So the compiler generates a different format for large types for storing in binary -- GC program. If type info mask fits into 2 words, then compiler generates the mask. Otherwise it generates the program.

The program grammar follows:

Program = {Block} "insEnd"

Block = Data | Array

Data = "insData" DataSize DataBlock

DataSize = int // size of the DataBlock in bit pairs, 1 byte

DataBlock = binary // dense GC mask of size ]DataSize/4[ bytes

Array = "insArray" ArrayLen Block "insArrayEnd"

ArrayLen = int // length of the array, 8 bytes (4 bytes for 32-bit arch)

Each instruction (insData, insArray, etc) is 1 byte.

For example, for type struct { x []byte; y [20]struct{ z int; w *byte }; } the program looks as:

insData 3 (BitsMultiWord BitsSlice BitsScalar) insArray 20 insData 2 (BitsScalar BitsPointer) insArrayEnd insEnd

Total size of the program is 17 bytes (13 bytes on 32-bits). The corresponding mask would be 43 bytes (it is repeated twice because the type has odd number of words).

Type object contains 2 words for storing of GC info:

struct Type {

        ...

        uintptr gc[2];

        ...

};

If the mask fits into these 2 words, then compiler generates mask and stores it directly into Type.gc (no indirection). If the mask fits into 8K, gc[0] points to a symbol in BSS large enough to store the mask + 1 byte; and gc[1] points to the program in RODATA. Otherwise (mask > 8K), gc[1] points to the program in RODATA and gc[0] is nil (runtime will unroll the program directly into GC bitmap).

A new KindGCProg flag in Type.kind determines whether the type uses program format or mask format.

Linker generates the type info for data/bss sections. That info is always in the program format.

GC

At runtime the type info is stored in the GC bitmap. GC bitmap stays 4 bits per word of heap memory, but meaning of the bits changes. Low 2 bits:

00 - bitMiddle, middle of a heap object

01 - bitBoundary, beginning of a non-allocated object

10 - bitAllocated, beginning of an allocated object

11 - bitAllocated, beginning of an allocated and marked object

The high 2 bits store the type info which describes the word of heap memory.

BitsDead set for the first word of the object means "noscan" object. Otherwise BitsDead means 'stop scan here' (end of object).

Per-span type info goes away.

During scanning GC treats words of heap memory according to type info stored in the GC bitmap.

In most cases GC can find beginning of the object by looking at the GC bitmap, anything other than bitMiddle means beginning of an object. GC can also find end of the object during scanning of the GC bitmap. This means that in most cases GC does not need to touch any memory other than the GC bitmap.

Malloc

Malloc stores type info the object in the GC bitmap.

For noscan objects it merely sets bitAllocated|BitsDead for the first word of the object.

For objects that require scanning, it copies type info from the Type object into GC bitmap.

If the allocated type has KindGCProg flag, then malloc first checks whether the program is unrolled into mask format. If not, it unrolls the program in Type.gc[1] into Type.gc[0] and then uses that mask. The 'unrolled' flag is stored in Type.gc[0][0] (first byte of the mask).

If the unrolled mask occupies more than 8K, then malloc unrolls the program directly into GC bitmap. This is required to save memory for very large types.

Evaluation

This change has been implemented in https://codereview.appspot.com/106260045.

Code Size/Complexity

The change leads to a massive code size reduction: -1680 LOC total, -1000+ LOC in mgc0.c only. Partially this is due to simpler type info and code that processes it; partially this is due to code sharing between stacks and heap type info (both generation in the compiler and usage in GC and heapdump). Additional effect of "type punned" type info (it does not encode pointee type) is that all e.g. Hmap's have the same type; this significantly simplifies generation of type info in reflect package (in most cases prototype type info is OK).

Subjectively, GC code is considerably simpler with this change.

Memory consumption

This change is a strict win with regard to memory consumption. It eliminates per-span type info and does not add any new data structures. The upper bound for memory consumption reduction is 50% (for 8-byte span with at least 8 different types).

For real programs observed values of memory consumption reduction are between 0% and 30%. Programs that store most of the data in large byte slices or strings the difference is closer to 0%. For programs that create lots of small objects the difference is closer to 30%.

See details in the Appendix below.

GC Time

GC scanning and marking phase is 10-20% faster for programs with large number of small objects.

For programs that store most of the data in large byte slices or strings the time is unaffected (but scanning is fast for such programs anyway).

See details in the Appendix below.

Binary Size

Binaries generally get smaller because the new type info is more compact. go.benchmarks binary is 6.7% smaller; test/bench/go1 binary is 6.3% smaller; go command binary is 3.2% smaller; godoc binary is 3.7% smaller.

Heap Dumper

Heap dumper becomes less useful as it does not know types for objects anymore. But it does not seem to be reasonable to pay 30% of memory for the very rarely used feature.

One possible solution is to augment memory profiler (memprof) info with type info. Heapdump can use that info for sampled objects. Then a user can set memprof rate to 1 and get very detailed info (type+stack) for all objects (currently we don't store type info for all allocations).

Another option is to use DWARF debug info to trace object types starting from root objects (globals and stacks).

Future Possible Changes

This change already allowed to add memory prefetching to the scanning loop (implemented in the referenced CL), something that I was afraid doing in the current code. But it also opens road for other changes:

1. 1-bit per word type info for both stacks, heap and global. It will further reduce size of binaries and memory consumption and increase runtime speed.

2. Concurrent GC [presumably] will need 1-bit per word type info, as it won't be able to atomically look at complex multi-word objects in presence of concurrent mutators.

3. Once we have 1-bit type info, it is trivial to split data/bss into smaller chunks for better parallelization. It may even be possible to split scanning of large heap objects (hashmaps and arrays) for better parallelization.

4. Switch to byte-based GC bitmap (instead of word-based). This will allow to use non-atomic operations for marking of objects larger than 1 word. And also will simplify and speed up markallocated.

Appendix. Benchmark Results

On go.benchmarks garbage benchmark:

garbage-1
allocated                 3060956      2965058      -3.13%
allocs                      57729        57624      -0.18%
cputime                  16178480     16739958      +3.47%
gc-pause-one            133386156    113134986     -15.18%
gc-pause-total            2934495      3394049     +15.66%
rss                     300072960    218271744     -27.26%
sys-gc                   17670144     11980800     -32.20%
sys-heap                265945088    191496192     -27.99%
sys-other                10015392      8266216     -17.46%
sys-stack                 9043968      8912896      -1.45%
sys-total               302674592    220656104     -27.10%
time                     16188036     16743835      +3.43%
virtual-mem             491409408    407719936     -17.03%

garbage-2
allocated                 3073933      2964383      -3.56%
allocs                      57739        57623      -0.20%
cputime                  17588468     17919136      +1.88%
gc-pause-one             79779842     75689054      -5.13%
gc-pause-total            1834936      2346360     +27.87%
rss                     310652928    227540992     -26.75%
sys-gc                   18128896     12316672     -32.06%
sys-heap                273285120    196739072     -28.01%
sys-other                10285144      8522624     -17.14%
sys-stack                 9306112      9306112      +0.00%
sys-total               311005272    226884480     -27.05%
time                      8881479      9263811      +4.30%
virtual-mem             574124032    423456768     -26.24%

garbage-4
allocated                 3071034      2964380      -3.47%
allocs                      57735        57623      -0.19%
cputime                  18395208     19084593      +3.75%
gc-pause-one             42328695     39775705      -6.03%
gc-pause-total             931231      1153495     +23.87%
rss                     325066752    241741824     -25.63%
sys-gc                   18980864     13180928     -30.56%
sys-heap                286916608    210370560     -26.68%
sys-other                10613232      8787872     -17.20%
sys-stack                10092544     10223616      +1.30%
sys-total               326603248    242562976     -25.73%
time                      4608702      4830480      +4.81%
virtual-mem             547790848    531423232      -2.99%

garbage-8
allocated                 3048763      2957203      -3.00%
allocs                      57718        57623      -0.16%
cputime                  19151198     19957434      +4.21%
gc-pause-one             24585584     22530624      -8.36%
gc-pause-total             491711       608326     +23.72%
rss                     356458496    262660096     -26.31%
sys-gc                   20750336     14323712     -30.97%
sys-heap                315228160    228196352     -27.61%
sys-other                11164304      9297984     -16.72%
sys-stack                12451840     12320768      -1.05%
sys-total               359594640    264138816     -26.55%
time                      2422566      2531798      +4.51%
virtual-mem             707973120    645025792      -8.89%

Since heap size is reduced GCs happen more frequently, this increases total GC time. It's possible to trade memory back for performance. Below is comparison of old vs new+GOGC=165 (which brings memory consumption roughly to the old level).

garbage-1
allocated                 3052052      2965030      -2.85%
allocs                      57720        57624      -0.17%
cputime                  16621510     15395976      -7.37%
gc-pause-one            136762062    112110308     -18.03%
gc-pause-total            3008765      2017985     -32.93%
rss                     299483136    285331456      -4.73%
sys-gc                   17670144     15785984     -10.66%
sys-heap                265945088    252313600      -5.13%
sys-other                10015504      9198984      -8.15%
sys-stack                 8912896      9043968      +1.47%
sys-total               302543632    286342536      -5.35%
time                     16621385     15404268      -7.32%
virtual-mem             557273088    474525696     -14.85%

garbage-2
allocated                 3073702      2964237      -3.56%
allocs                      57738        57623      -0.20%
cputime                  17307008     15874992      -8.27%
gc-pause-one             86934156     85146756      -2.06%
gc-pause-total            1999485      1532641     -23.35%
rss                     309473280    295632896      -4.47%
sys-gc                   18063360     16244736     -10.07%
sys-heap                272236544    259653632      -4.62%
sys-other                10280824      9464032      -7.94%
sys-stack                 9437184      9175040      -2.78%
sys-total               310017912    294537440      -4.99%
time                      8936413      8389483      -6.12%
virtual-mem             574255104    491114496     -14.48%

garbage-4
allocated                 3065954      2956425      -3.57%
allocs                      57731        57621      -0.19%
cputime                  18431932     17124132      -7.10%
gc-pause-one             44980995     39581480     -12.00%
gc-pause-total             989581       692675     -30.00%
rss                     321982464    314998784      -2.17%
sys-gc                   18784256     17375232      -7.50%
sys-heap                283770880    277479424      -2.22%
sys-other                10346984      9989656      -3.45%
sys-stack                10223616      9961472      -2.56%
sys-total               323125736    314805784      -2.57%
time                      4685893      4336762      -7.45%
virtual-mem             737247232    595279872     -19.26%

garbage-8
allocated                 3041382      2956830      -2.78%
allocs                      57712        57622      -0.16%
cputime                  19078905     18206227      -4.57%
gc-pause-one             25034791     22108232     -11.69%
gc-pause-total             500695       364785     -27.14%
rss                     357806080    345243648      -3.51%
sys-gc                   20750336     19107840      -7.92%
sys-heap                315228160    304742400      -3.33%
sys-other                11163856     10510576      -5.85%
sys-stack                12451840     12189696      -2.11%
sys-total               359594192    346550512      -3.63%
time                      2419061      2291795      -5.26%
virtual-mem             808402944    727433216     -10.02%

test/bench/go1 benchmarks:

benchmark                          old ns/op      new ns/op      delta      
BenchmarkBinaryTree17              3882774233     3544278312     -8.72%      
BenchmarkFannkuch11                2760029132     2618294904     -5.14%      
BenchmarkFmtFprintfEmpty           71.6           72.7           +1.54%      
BenchmarkFmtFprintfString          231            219            -5.19%      
BenchmarkFmtFprintfInt             186            164            -11.83%    
BenchmarkFmtFprintfIntInt          256            257            +0.39%      
BenchmarkFmtFprintfPrefixedInt     257            242            -5.84%      
BenchmarkFmtFprintfFloat           460            354            -23.04%    
BenchmarkFmtManyArgs               1364           1072           -21.41%    
BenchmarkGobDecode                 16333122       13887405       -14.97%    
BenchmarkGobEncode                 10785547       9556318        -11.40%    
BenchmarkGzip                      438134110      406536964      -7.21%      
BenchmarkGunzip                    113861506      104544186      -8.18%      
BenchmarkHTTPClientServer          92163          89593          -2.79%      
BenchmarkJSONEncode                21279690       20524117       -3.55%      
BenchmarkJSONDecode                81470709       71832238       -11.83%    
BenchmarkMandelbrot200             5294830        4101358        -22.54%    
BenchmarkGoParse                   5982219        5655285        -5.47%      
BenchmarkRegexpMatchEasy0_32       155            152            -1.94%      
BenchmarkRegexpMatchEasy0_1K       303            294            -2.97%      
BenchmarkRegexpMatchEasy1_32       144            149            +3.47%      
BenchmarkRegexpMatchEasy1_1K       798            826            +3.51%      
BenchmarkRegexpMatchMedium_32      232            247            +6.47%      
BenchmarkRegexpMatchMedium_1K      63225          75730          +19.78%    
BenchmarkRegexpMatchHard_32        3701           4377           +18.27%    
BenchmarkRegexpMatchHard_1K        143579         136470         -4.95%      
BenchmarkRevcomp                   1059133929     697414142      -34.15%    
BenchmarkTemplate                  132634561      115347401      -13.03%    
BenchmarkTimeParse                 389            385            -1.03%      
BenchmarkTimeFormat                397            394            -0.76%      

benchmark                         old MB/s     new MB/s     speedup    
BenchmarkGobDecode                46.99        55.27        1.18x      
BenchmarkGobEncode                71.16        80.32        1.13x      
BenchmarkGzip                     44.29        47.73        1.08x      
BenchmarkGunzip                   170.42       185.61       1.09x      
BenchmarkJSONEncode               91.19        94.55        1.04x      
BenchmarkJSONDecode               23.82        27.01        1.13x      
BenchmarkGoParse                  9.68         10.24        1.06x      
BenchmarkRegexpMatchEasy0_32      205.66       209.52       1.02x      
BenchmarkRegexpMatchEasy0_1K      2485.81      2260.24      0.91x      
BenchmarkRegexpMatchEasy1_32      222.08       214.21       0.96x      
BenchmarkRegexpMatchEasy1_1K      930.48       798.08       0.86x      
BenchmarkRegexpMatchMedium_32     4.31         4.05         0.94x      
BenchmarkRegexpMatchMedium_1K     15.74        11.53        0.73x      
BenchmarkRegexpMatchHard_32       8.65         7.31         0.85x      
BenchmarkRegexpMatchHard_1K       7.13         7.50         1.05x      
BenchmarkRevcomp                  239.98       364.44       1.52x      
BenchmarkTemplate                 14.63        16.82        1.15x    

BenchmarkRegexpMatchHard_32 with +18.27% degradation looks like phantom code
movement issue, profile do not include anything GC/malloc-related, but the
numbers for unrelated functions are slightly different:
before:
43.73%  go1.test  go1.test           [.] regexp.(*machine).add
17.52%  go1.test  go1.test           [.] regexp.(*machine).step
 7.98%  go1.test  go1.test           [.]
_/ssd/src/go1/test/bench/go1.fastaRandom
 7.49%  go1.test  go1.test           [.] regexp.(*machine).match
 3.93%  go1.test  go1.test           [.] regexp/syntax.(*Inst).MatchRunePos
 3.57%  go1.test  go1.test           [.] regexp/syntax.EmptyOpContext
 2.63%  go1.test  go1.test           [.] regexp.(*inputBytes).step
 1.94%  go1.test  go1.test           [.] regexp.(*machine).alloc
 1.32%  go1.test  go1.test           [.] compress/flate.(*compressor).findMatch
 1.19%  go1.test  go1.test           [.] regexp/syntax.(*Inst).MatchRune
 1.18%  go1.test  go1.test           [.] compress/flate.(*compressor).deflate
 0.48%  go1.test  go1.test           [.] hash/crc32.update
 0.40%  go1.test  go1.test           [.] sync/atomic.AddUint32
 0.31%  go1.test  go1.test           [.] sync/atomic.CompareAndSwapUint32
after:
46.94%  go1.test  go1.test           [.] regexp.(*machine).add
16.17%  go1.test  go1.test           [.] regexp.(*machine).step
 7.16%  go1.test  go1.test           [.] regexp.(*machine).match
 7.05%  go1.test  go1.test           [.]
_/ssd/src/go2/test/bench/go1.fastaRandom
 6.04%  go1.test  go1.test           [.] regexp/syntax.EmptyOpContext
 3.38%  go1.test  go1.test           [.] regexp/syntax.(*Inst).MatchRunePos
 2.14%  go1.test  go1.test           [.] regexp.(*machine).alloc
 2.02%  go1.test  go1.test           [.] regexp.(*inputBytes).step
 1.27%  go1.test  go1.test           [.] compress/flate.(*compressor).deflate
 1.06%  go1.test  go1.test           [.] compress/flate.(*compressor).findMatch
 0.87%  go1.test  go1.test           [.] regexp/syntax.(*Inst).MatchRune
 0.42%  go1.test  go1.test           [.] hash/crc32.update
 0.32%  go1.test  go1.test           [.] sync/atomic.AddUint32
 0.29%  go1.test  go1.test           [.] sync/atomic.CompareAndSwapUint32

RSS for BenchmarkGoParse stabilizes at 700m before and 584m after (-16.6%).