Go channels on steroids

Dmitry Vyukov

dvyukov@

Jan 28, 2014

Why?

Channels are the main synchronization and communication primitive in Go, they need to be fast and scalable.

Goals:

- make single-threaded (non-contended) channel operations faster

- make contended buffered (producer/consumer) channel operations faster

- make non-blocking failing operations (e.g. checking of "stop" channel) faster

- make chan semaphores (chan struct{}) faster

- make select statements faster

Non-goals:

- make channels completely lock-free (this would significantly complicate implementation and make it slower for common cases)

- make contended synchronous channel operations faster

The rest of the document describes details of the design.

How?

Types of channels

There are 3 internal types of channels:

1. Sync channels. They do not need any buffering and buffer management code. Also they implement direct hand off semantics (a goroutine directly chooses the pair and accomplishes communication with it).

2. Async channels. This is traditional producer-consumer queues based on ring buffer. They do not implement hand off semantics -- an unblocked consumer competes on general rights with other consumers, if it loses the competition it blocks again.

3. Async channels with zero-sized elements (chan struct{}). This is essentially semaphores. They do not need buffers (consume O(1) memory), and do not implement hand off semantics.

Sync send/recv

Before diving into selects, let's first consider how standalone send/recv work.

Sync channels are mostly mutex-protected, except for non-blocking operation failure fast-path (e.g. non-blocking recv from an empty chan). Sync chan contains the following data:

struct Hchan {

        Lock;

        bool closed;

        SudoG* sendq;  // waiting senders

        SudoG* recvq;  // waiting receivers

};

Send locks the mutex, and checks whether it needs to block or satisfy an inverse operation:

bool syncchansend(Hchan *c, T val, bool block) {

        if(c->closed)  // fast-path

                panic(“closed”);

        if(!block && c->recvq == nil)  // fast-path

                return false;

        lock(c);

        if(c->closed) {

                unlock(c);

                panic(“closed”);

        }

        if(sg = removewaiter(&c->recvq)) {

                // Have a blocked receiver, communicate with it.

                unlock(c);

                sg->val = val;

                sg->completed = true;

                unblock(sg->g);

                return true;

        }

        if(!block) {

                unlock(c);

                return false;

        }

        // Block and wait for a pair.

        sg->g = g;

        sg->val = val;

        addwaiter(&c->sendq, sg);

        unlock(c);

        block();

        if(!sg->completed)

                panic(“closed”);  // unblocked by close

        // Unblocked by a recv.

        return true;

}

Async send/recv

Async send/recv is lock-free if it does not need to manipulate wait queues, wait queues are protected by the mutex. Non-blocking failing operations are fast-pathed as well.

Let’s first consider how non-blocking operations proceed.

The async channel contains the following data:

struct Hchan {

        uint32 cap;   // channel capacity

        Elem*  buf;   // ring buffer of size cap

        // send and receive positions,

        // low 32 bits represent position in the buffer,

        // high 32 bits represent the current “lap” over the ring buffer

        uint64 sendx;

        uint64 recvx;

};

struct Elem {

        // current lap,

        // the element is ready for writing on laps 0, 2, 4, ...

        // for reading -- on laps 1, 3, 5, ...

        uint32 lap;

        T      val;  // user data

};

Sends synchronize with each other by means of advancing sendx with CAS, whoever advances the position writes to the element. Sends synchronize with recvs by means of lap variable in each element, basically, lap value says whether this element is ready for reading/writing on the current lap (high 32 bits of sendx/recvx).

Below is the send algorithm:

bool asyncchansend_nonblock(Hchan* c, T val) {

        uint32 pos, lap, elap;

        uint64 x, newx;

        Elem *e;

        for(;;) {

                x = atomicload64(&c->sendx);

                pos = (uint32)x;

                lap = (uint32)(x >> 32);

                e = &c->buf[pos];

                elap = atomicload32(&e->lap);

                if(lap == elap) {

                        // The element is ready for writing on this lap.

                        // Try to claim the right to write to this element.

                        if(pos + 1 < c->cap)

                                newx = x + 1;  // just increase the pos

                        else

                                newx = (uint64)(lap + 2) << 32;

                        if(!cas64(&c->sendx, x, newx))

                                continue;  // lose the race, retry

                        // We own the element, do non-atomic write.

                        e->val = val;

                        // Make the element available for reading.

                        atomicstore32(&e->lap, elap + 1);

                        return true;

                } else if((int32)(lap - elap) > 0) {

                        // The element is not yet read on the previous lap,

                        // the chan is full.

                        return false;

                } else {

                        // The element has already been written on this lap,

                        // this means that c->sendx has been changed as well,

                        // retry.

                }

        }

}

Recv operation is completely symmetrical, except that recvs start at lap 1 and read the element instead of writing.

Now, let's consider how blocking operations are implemented. The channel structure is extended with a mutex and send/recv waiter queues:

struct Hchan {

        …

        Lock;

        SudoG* sendq;

        SudoG* recvq;

};

To do a blocking send, a goroutine tries to do the non-blocking send. If it succeeds, it checks whether there are recv waiters, and if so unblocks one of them.

If the non-blocking send fails (the chan is full), it locks the mutex, adds itself to the send waiter queue and after that re-checks if the chan is still full. If the chan is full, the goroutine blocks. If the chan is not full, the goroutine removes itself from waiter queue, unlocks the mutex and retries.

Blocking receive proceeds absolutely the same, except for s/send/recv/ s/recv/send/.

The main tricky aspect of such blocking algorithms is to ensure that no deadlocks are possible (a sender is blocked indefinitely on a non-full channel; or a receiver is blocked indefinitely on a non-empty channel). By doing this check-store-recheck thing, we ensure than either (1) sender sees that there is a recv waiter and unblocks it, or (2) receiver sees the element in the buffer and consumes it, or (3) both 1 and 2 (in this case the competition is resolved by means of the mutex); but NOT (4) sender does not see recv waiters and receiver does not see the element in the buffer and blocks indefinitely.

Below is the blocking send algorithm:

void asyncchansend(Hchan* c, T val) {

        for(;;) {

                if(asyncchansend_nonblock(c, val)) {

                        // Send succeeded, see if we need to unblock a receiver.

                        if(c->recvq != nil) {

                        lock(c);

                        sg = removewaiter(&c->recvq);

                        unlock(c);

                                if(sg != nil)

                                        unblock(sg->g);

                        }

                        return;

                } else {

                        // The channel is full.

                        lock(c);

                        sg->g = g;

                        addwaiter(&c->sendq, sg);

                        if(notfull(c)) {

                                removewaiter(&c->sendq, sg);

                                unlock(c);

                                continue;

                        }

                        unlock(c);

                        block();

                        // Retry send.

                }

        }

}

struct{} send/recv

Zero-sized async channels generally resemble non-zero-sized async channels:

- operations are lock-free in non-blocking case

- wait queues are still protected by the mutex

- non-blocking failing operations are fast-pathed

The differences are:

- Hchan contains a single counter instead of send/recv positions and the ring buffer; the counter represents number of elements in the channel

- non-blocking send/receive do a CAS loop to adjust the counter

- full/empty predicates merely check the counter value

The rest, including the blocking algorithm, is the same.

close

Close operations locks the mutex, sets closed flag and then unblocks all waiters. Async send/recv operations check the closed flag before blocking.

This allows to achieve the same guarantees that present for async send/recv blocking. Namely, either (1) close sees a waiter, or (2) a waiter sees closed flag set or (3) both 1 and 2 (in this case the competition is again resolved by means of the mutex).

select

Now we are ready for The Select.

Select operation does not lock mutexes of all involved channels at once, instead it proceeds by doing fine-grained operations on individual channels.

Select consists of 4 phases:

0. Shuffle all involved channels to provide the pseudo-random guarantee (all subsequent phases work with this shuffled list of channels).

1. Check all channels one-by-one to see if any of them is ready for communication, if so do the communication and exit. This makes selects that do not block faster and more scalable, as they do not need to sort and lock mutexes. Moreover, such select does not even need to touch all channels if the first one is ready.

2. Prepare for blocking on all channels.

3. Block. Goto 1.

Phase 2 needs a more detailed description.

Essentially it proceeds the same way as blocking in async send/recv. That is, lock channel mutex, add itself to send/recv waiter queue and after that re-check if the chan is still not ready for communication. If the channel is not ready, then proceed to the next channel. Otherwise, remove itself from all waiter queues and goto phase 1.

There is another tricky aspect. We add select as waiter to several channels, but we do not want several sync channel operations to complete communication with the select (for sync channels unblocking completes successful communication). In order to prevent this, select-related entries in waiter queues contain a pointer to a select-global state word. Before unblocking such waiters other goroutines try to CAS(statep, nil, sg), which gives them the right to unblock/communicate with the waiter. If the CAS fails, goroutines ignore the waiter (it’s being signaled by somebody else).

This algorithm requires to implement isready(c) predicate for all channel types, which does not represent a significant problem. High-level algorithm of a select operation follows:

Scase *select(Select *sel) {

        randomize channel order;

        for(;;) {

                // Phase 1.

                foreach(Scase *cas in sel) {

                        if(chansend/recv_nonblock(cas->c, ...))

                                return cas;

                }

                // Phase 2.

                selectstate = nil;

                foreach(Scase *cas in sel) {

                        lock(cas->c);

                        cas->sg->g = g;

                        cas->sg->selectstatep = &selectstate;

                        addwaiter(&cas->c->sendq/recvq, cas->sg);

                        if(isready(cas->c)) {

                                unlock(c);

                                goto ready;

                        }

                        unlock(cas->c);

                }

                // Phase 3.

                block();

ready:

                CAS(&selectstate, nil, 1);

                foreach(Scase *cas in sel) {

                        lock(cas->c);

                        removewaiter(&cas->c->sendq/recvq, cas->sg);

                        unlock(cas->c);

                }

                // If we were unblocked by a sync chan operation,

                // the communication has completed.

                if(selectstate > 1)

                        return selectstate;  // denotes the completed case

        }

}

Prerequisites

There are two prerequisites (op top of revision 18999). Since chan-related data structures and algorithms will change significantly, we need to:

1. Remove special handling of chans from GC (GC_CHAN_PTR program). It's possible to adopt GC_CHAN_PTR for new representations, but it looks better to just give chan objects proper types.

2. Similarly, move chanlen/chancap into runtime instead of guessing in the compiler which word means what (there is no single len word in async chans).

Performance evaluation

Below is an annotated evaluation of the prototype (https://codereview.appspot.com/12544043/) on standard synthetic channel benchmarks.

benchmark                          old ns/op    new ns/op    delta

BenchmarkChanNonblocking                  24            8  -66.53%

BenchmarkChanNonblocking-2                92            4  -95.57%

BenchmarkChanNonblocking-4               104            2  -97.95%

BenchmarkChanNonblocking-8               114            1  -99.02%

BenchmarkChanNonblocking-16               92            0  -99.37%

BenchmarkChanNonblocking-32               82            0  -99.46%

// This takes advantage of the non-blocking fast path.

BenchmarkSelectUncontended               222          159  -28.38%

BenchmarkSelectUncontended-2             128           97  -23.98%

BenchmarkSelectUncontended-4              87           52  -39.79%

BenchmarkSelectUncontended-8              46           29  -37.69%

BenchmarkSelectUncontended-16             23           15  -34.75%

BenchmarkSelectUncontended-32             18           11  -36.26%

// Single-threaded speedup, because we don’t sort/lock all mutexes and don’t zero select descriptor.

BenchmarkSelectContended                 221          152  -31.22%

BenchmarkSelectContended-2               459          159  -65.36%

BenchmarkSelectContended-4               678          254  -62.54%

BenchmarkSelectContended-8               985          365  -62.94%

BenchmarkSelectContended-16              749          337  -55.01%

BenchmarkSelectContended-32              731          421  -42.41%

// Scalability improvements caused by finer-grained locking and lock-free paths.

BenchmarkSelectNonblock                  104           34  -66.92%

BenchmarkSelectNonblock-2                 51           17  -66.60%

BenchmarkSelectNonblock-4                 25            8  -64.56%

BenchmarkSelectNonblock-8                 12            4  -63.44%

BenchmarkSelectNonblock-16                 6            2  -63.22%

BenchmarkSelectNonblock-32                 5            2  -56.15%

// Again, non-blocking fast paths.

BenchmarkChanUncontended                  61           37  -38.30%

BenchmarkChanUncontended-2                30           18  -39.07%

BenchmarkChanUncontended-4                15            9  -38.58%

BenchmarkChanUncontended-8                 8            5  -37.20%

BenchmarkChanUncontended-16                4            2  -36.43%

BenchmarkChanUncontended-32                3            2  -32.46%

// Single-threaded speedup, because instead of lock/unlock we do a single CAS.

BenchmarkChanContended                    60           39  -34.05%

BenchmarkChanContended-2                 268          291   +8.58%

BenchmarkChanContended-4                 301          289   -3.99%

BenchmarkChanContended-8                 331          332   +0.30%

BenchmarkChanContended-16                254          591  +132.68%

BenchmarkChanContended-32                242          726  +200.00%

// Extremely contended async chans are slower because of increased contention in lock-free paths.

BenchmarkChanSync                        127          127   +0.00%

BenchmarkChanSync-2                      395          393   -0.51%

BenchmarkChanSync-4                      358          340   -5.03%

BenchmarkChanSync-8                      293          312   +6.48%

BenchmarkChanSync-16                     353          361   +2.27%

BenchmarkChanSync-32                     216          220   +1.85%

// Mostly flakes due to extreme contention.

BenchmarkChanProdCons0                   134          140   +4.48%

BenchmarkChanProdCons0-2                 407          573  +40.79%

BenchmarkChanProdCons0-4                 667          745  +11.69%

BenchmarkChanProdCons0-8                 934          883   -5.46%

BenchmarkChanProdCons0-16                760          762   +0.26%

BenchmarkChanProdCons0-32                700          759   +8.43%

// Highly contended sync channels are slower (due to fast paths), but are not interesting.

BenchmarkChanProdCons10                   88           71  -18.86%

BenchmarkChanProdCons10-2                223          104  -53.36%

BenchmarkChanProdCons10-4                699          197  -71.82%

BenchmarkChanProdCons10-8                821          512  -37.64%

BenchmarkChanProdCons10-16               616          588   -4.55%

BenchmarkChanProdCons10-32               553          575   +3.98%

// Highly contended async chans are faster with few threads (because they are lock-free), but slower with more threads (because they are lock-free); both results are not very interesting.

BenchmarkChanProdCons100                  68           43  -37.12%

BenchmarkChanProdCons100-2               206           95  -53.83%

BenchmarkChanProdCons100-4               397          338  -14.86%

BenchmarkChanProdCons100-8               400          469  +17.25%

BenchmarkChanProdCons100-16              310          773  +149.35%

BenchmarkChanProdCons100-32              291          753  +158.76%

// Same as BenchmarkChanProdCons10.

BenchmarkChanProdConsWork0               730          691   -5.34%

BenchmarkChanProdConsWork0-2             427          487  +14.05%

BenchmarkChanProdConsWork0-4             767          855  +11.47%

BenchmarkChanProdConsWork0-8            1096         1062   -3.10%

BenchmarkChanProdConsWork0-16            906          965   +6.51%

BenchmarkChanProdConsWork0-32            858          916   +6.76%

// Sync channels are a bit slower under moderate load (goroutines do some local work), because of the fast paths. Use buffering!

BenchmarkChanProdConsWork10              648          594   -8.33%

BenchmarkChanProdConsWork10-2            593          519  -12.48%

BenchmarkChanProdConsWork10-4           1089          344  -68.41%

BenchmarkChanProdConsWork10-8           1334          500  -62.52%

BenchmarkChanProdConsWork10-16          1107          572  -48.33%

BenchmarkChanProdConsWork10-32          1005          575  -42.79%

// This is a much more interesting producer/consumer case - buffered channel and goroutines do some local work -- new channels are significantly faster because there are no mutexes.

BenchmarkChanProdConsWork100             620          607   -2.10%

BenchmarkChanProdConsWork100-2           543          365  -32.78%

BenchmarkChanProdConsWork100-4           823          211  -74.36%

BenchmarkChanProdConsWork100-8          1030          563  -45.34%

BenchmarkChanProdConsWork100-16          855          755  -11.70%

BenchmarkChanProdConsWork100-32          788          760   -3.55%

// Same as BenchmarkChanProdConsWork10.

BenchmarkSelectProdCons                 1180         1031  -12.63%

BenchmarkSelectProdCons-2                880          683  -22.39%

BenchmarkSelectProdCons-4               1213          433  -64.30%

BenchmarkSelectProdCons-8               1613          578  -64.17%

BenchmarkSelectProdCons-16              1298          805  -37.98%

BenchmarkSelectProdCons-32              1289          773  -40.03%

// Takes advantage of finer-grained locking and lock-free paths.

BenchmarkChanCreation                    150          108  -28.00%

BenchmarkChanCreation-2                  104           65  -37.31%

BenchmarkChanCreation-4                   56           56   -0.70%

BenchmarkChanCreation-8                   57           47  -18.06%

BenchmarkChanCreation-16                  63           48  -23.17%

BenchmarkChanCreation-32                  77           51  -34.53%

// This is faster because the benchmark also includes send/recv operations, chan creation must not be significantly affected.

BenchmarkChanSem                          55           28  -48.01%

BenchmarkChanSem-2                       260           76  -70.69%

BenchmarkChanSem-4                       303           95  -68.38%

BenchmarkChanSem-8                       309          116  -62.46%

BenchmarkChanSem-16                      215          134  -37.67%

BenchmarkChanSem-32                      196          153  -21.94%

// Takes advantage of special lock-free paths for chan struct{}.