Contiguous stacks

Allocate each Go routine a contiguous piece of memory for its stack, grown by reallocation/copy when it fills up.

Why?

With contiguous stacks, we avoid both of these problems.  Stacks are grown to the size required, and no more work is needed from then on (modulo shrinking, see below).

How?

How can you move a stack?  It turns out that our compiler’s escape analysis provides a very important invariant: pointers to data on the stack are only ever passed down in the call tree.  Any other escape (written to global variables, returned to a parent, written to the heap, etc.) prevents that data from being allocated on the stack.  This invariant means that the only pointers to stack data are in the stack itself (but see below for exceptions), which makes them easy to find and update in case we need to copy the stack.

Overflow check

The stack overflow check is very similar to the overflow check for split stacks.  The only major difference is that we no longer need to know what the argument size is, as there is no longer the need to copy the arguments to a new stack segment.  This simplifies the proliferation of the morestack routine family somewhat, and eliminates the need for most of the NOSPLIT directives.

We also don’t even need to know the frame size.  Just double the stack size and try again, if it still doesn’t fit we’ll hit the overflow check again and double the stack size again.  Repeat until it fits.

Copying

When a stack overflow check triggers:

The same rules apply when copying - we need to know whether the data really is a pointer and if so, whether it points into the old stack.

Reflect.call

Reflect.call calls a function given a pointer to an argument area and an argument size.  The old stack-split mechanism forced a new stack chunk so this variable-sized argument area can be allocated at the top of the stack.  The new mechanism needs to allocate these args anywhere on the stack.  The implementation is simpler if all stack frames are fixed size, so to “simulate” a variable-sized frame we define a set of fixed-size-frame functions in powers-of-two sizes which can be used as trampolines.

reflect.call(function, args, argsize)

        if argsize <= 16: jump reflect.call16(function, args, argsize)

        if argsize <= 32: jump reflect.call32(function, args, argsize)

        if argsize <= 64: jump reflect.call64(function, args, argsize)

reflect.call16(function, args, argsize)

        sp -= 16

        copy [args, args+argsize] to [sp, sp+argsize]

        call function

        copy [sp, sp+argsize] to [args, args+argsize]  (for return values)

        sp += 16

        return

I propose a maximum reflect.call arg size of 64K.

The stack copier needs to know what is a pointer and what is not for these frames.  This info can be derived by looking at the signature of the called function. (and for vararg calls also the argument size?)

Stack tracing

Stack tracing code simplifies somewhat because it doesn’t need to deal with splits on the stack.

GC

As in stack tracing, GC walk of the stack for a G is simplified.

Defer/panic/recover

The implementations of these mechanisms change slightly, and are simplified in most instances.  The old mechanisms used stack-split locations to mark where panics are happening.

Testing

For testing, all stacks will be allocated somewhere outside the heap.  When a stack is copied, the old stack will be munmap’d so any errant access to the old stack will trap.

CGo

TBD, I don’t know what the issues are yet.

Shrinking

If a Go routine uses lots of stack at the start of its lifetime but then uses much less stack for the rest of its lifetime, then a lot of stack space is left idle.  We need to recover this space somehow.  Ideas:

My current plan is at GC time, if a go routine is using at most ¼ of its stack, free the bottom ½ of the stack.  I’ll be freeing directly if the stack is big enough to allow it, and copying otherwise.

Starting stacks

All Gs by default start with a minimum-sized stack.  When a G finishes, we deallocate its stack, as it may be large and not what is required by the next go routine which happens to start on that G.  (Or maybe we let the shrinking code handle this?)

We could keep some statistics of how much stack each go routine uses (keyed by function address) and pre-allocate that much stack before starting that go routine.  Statistics can be updated each time a go routine finishes.  It won’t be perfect, for example a go routine whose stack use is data-dependent.  But it may be an adequate starting point.

This mechanism could be used in the split stack world as well, if we were willing to record the maximum stack usage during a Go routine.  It would have to be more conservative, however.  In the split-stack world allocating too large an initial stack can’t be undone.

Disadvantages

Experiments so far

I have a prototype implementation running.  It is not complete yet but it can compile simple examples.

Peano (test/peano.go) modified to run up to 11! is about 10% faster.  It is very sensitive to the growth rate, though, so take it with a lot of salt.  (Doubling the stack size each copy is 10% faster.  Growing the stack by 50% each copy is only 2% faster. Growing by 25% is 20% slower.)

“Hot split” testcase (stacksplit.go).  Compiled with -gcflags -l to disable inlining.

segmented stacks:

 no split: 1.25925147s
with split: 5.372118558s   <- triggers hot split problem
both split: 1.293200571s

contiguous stacks:

 no split: 1.261624848s
with split: 1.262939769s
both split: 1.29008309s

Benchmark examples

I modified the segmented stack and contiguous stack implementations to use an environment variable to configure their stack segment size / initial stack size.  We can use this feature to test a benchmark’s sensitivity to stack segment size.  (For these experiments, the contiguous stack size is limited to powers of 2.)

encoding/json/BenchmarkEncode: this one has been complained about historically.  See bug 3787.

Presumably the max stack use is 19KB or so, and below that bad splits cause the downward spikes.

html/template/BenchmarkEscapedExecute: this benchmark has a particular problem at 4096 bytes/segment (the default size, which is how I found it).

(I don’t understand why the contiguous stack impl is asymptotically better.)