NUMA-aware scheduler for Go

Dmitry Vyukov, dvyukov@

Sep 2014

Abstract

With the advent of Intel QPI and AMD HT technologies even commodity SMP servers become NUMA. As core counts continue to grow, this trend will continue as well. When per-package computational power will reach a threshold, even a single package will require multiple memory lanes effectively becoming NUMA (see e.g. Tile-Gx72 processors).

Temporal and spatial locality of computations with respect to data becomes exceedingly important in such world. Go misses the bus.

The goal of this document is to propose a general infrastructure that allows to do NUMA/cache-aware scheduling and computation/data migration; describe a set of signals that the scheduler can use and a set of simple policies that look like a clear win. Once we implement it and gather more information, the signals and policies most likely will need to be refined and tuned.

Reference Target Machine

Let's consider a typical machine with 4 Processors and 4 Memory packages:

Latency between P0 <-> MEM0 is minimal; latency between P0 <-> MEM1 and P0 <-> MEM2 is larger; latency between P0 <-> MEM3 is even larger. This Non-Uniform Memory Architecture (NUMA) dictates need for spatial locality.

And each processor is:

Caches store recently accesses data and are much faster than memory. Caches dictate need for temporal locality. Note than caches alleviate NUMA effect, as remote data in local cache is fast.

Current Architecture

The main concepts in Go runtime are:

M - OS thread (Machine).

P - Logical CPU (Processor), there are exactly GOMAXPROCS P's.

G - Goroutine.

Netpoll - network poller (epoll descriptor).

RunQ - scheduler queue with runnable G's.

MHeap - global memory allocator state (contains central caches and free spans).

MCache - per-P memory allocator cache.

GC - garbage collector.

Current system architecture can be depicted as follows:

Runtime does not try hard to ensure any locality, resources like P's and M's are pooled. Runtime is not aware of system topology.

MCache is tied to P, and this provides some locality. But then P is not tied to M, so this locality is easily destroyable.

New G's are generally submitted to the local RunQ, but once again P is not tied to M.

When an M returns from a syscall quickly, it tries to re-acquire the same P it used before the syscall.

Proposed Architecture

In short: global resources (MHeap, global RunQ and pool of M's) are partitioned between NUMA nodes; netpoll and timers become distributed per-P.

In order to get the most benefit we need to ensure affinity throughout the stack: core <-> M <-> P <-> G and Memory. If one of the levels does not work, benefits significantly diminish.

Let's consider different subsystems in detail.

Memory allocator

Heap is partitioned between NUMA nodes. When a P needs another span, it allocates it [preferably] from the node's MHeap. This ensures that we always allocate local memory.

MCache is attached to P as now. But this time we have P <-> M affinity, so MCache affinity propagates all way down to processor cache.

GC bitmap also needs to be partitioned. Once we grab a new 64K block from OS, we map 4K of bitmap on the current NUMA node as well. This ensures that malloc always work with bitmap in local memory.

GC

GC allocates worker threads on all NUMA nodes. A GC thread scans stacks of goroutines with stacks allocated on the current node. This ensures that accesses to the stack are local. Also there are chances that such goroutines run on this node and thus allocated heap objects on this node as well. Consequently, transitive heap references are local with high probability.

GC has per-node queues with to-be-scanned (gray) objects. When a thread is out of local work, it first tries to steal/request work on local node. This ensures that even after dynamic load-balancing some notion of locality is preserved.

Sweeping. We have 2 types of sweeping: background (done by a dedicated sweeper goroutine) and on-demand (done by user goroutines). On-demand sweep sweeps spans that the goroutine is about to use; thus locality is automatically ensured by the memory allocator (a goroutine uses local spans). Background sweeping functionality is moved into scheduler (as we don't have means to bind goroutines to NUMA nodes). Once in a while scheduler decides to sweep one of the remaining local spans.

There are yet no plans for locality-driven object migration. First, we don't have capability to move objects; second, it's easy to do more harm than good by wasting CPU cycles on data migration.

Stacks

We have 2 caches for stack segments: per-P and global. Per-P cache is OK as is. The global cache is partitioned per NUMA node.

Stack allocation always goes to local per-P cache, then to per-node cache, then to per-node MHeap.

If we free a remote stack segment it is freed directly into remote node cache. This is required to ensure that goroutines always start with local stacks.

As opposed to heap objects, we are able to move stacks in memory. Also required heuristics seems to be simpler for stacks. For example, if a goroutine is long running (has passed through scheduler enough times) and was scheduled on a remote node (as compared to stack segment node) last N times, we can decide to migrate stack segment to that node. The right time for migration seems to be right before goroutine execution. More data and experiments needed to work out exact policy.

Scheduling

M's (threads) are hard bound to NUMA nodes, and each node has own pool of idle M's. For M <-> core affinity we rely on OS.

Alternative: It is possible to bound M's to exact cores instead of nodes. It would allow us to do smarter scheduling (e.g. steal work from the HT sibling first). But at the same time it is subject to the following negative effect: 2 processes can decide to do work on threads bound to the same core, both will work 2 times slower w/o realizing it. Binding M's to nodes gives OS more flexibility in inter-process contention resolution. Binding of M's to cores would require API similar to SetThreadIdealProcessor (ideally - soft affinity to a core, hard affinity a node), but unfortunately it's unavailable on unixes (Update(Jul 5, 2019): Linux may get sched_setaffinity2 for this).

P's (logical processors) are strictly partitioned between NUMA nodes. That is, if we want to start a P bound to NODE0, we need to get an M from NODE0 M pool. Each P has own local RunQ as now. Global RunQ and pool of idle G descriptors become partitioned between NUMA nodes.

To ensure P <-> M affinity, lastm field is added to P. When runtime wants to start a P it first tries to get p->lastm M to run it.

What's left is scheduling policy. Current scheduling policy is very simple: new and unblocked goroutines go to local RunQ; network poller injects work into global queue; work stealing is completely random; GC reshuffles goroutines to balance work.

Most likely we stay with "new and unblocked goroutines go to local RunQ" as it's the cheapest strategy. But the rest needs to be changed.

First, GC does not reshuffle goroutines, it just leaves them where they are.

Then, work stealing prefers stealing from the same node: per-node RunQ -> per-P RunQ's on the same node -> per-node RunQ's of remote nodes -> per-P RunQ's of remote nodes. We don't bind threads to exact cores, so we don't take into account HyperThread siblings and shared caches.

Then, scheduling order changes from strict FIFO to mostly LIFO. LIFO is generally much better in cache utilization. But we still need a bare minimum of fairness (ensure that every runnable goroutine is eventually scheduled). To provide fairness each P pops from the other end of local RunQ once in a while. Global RunQ's stay FIFO as locality is not important for them.

Then, netpoller and timers are used to migrate stolen goroutines back to "home" P. Consider that a goroutine has registered in netpoller of Pi and then was stolen by Pj. When Pi runs out of work, it first checks local netpoll/timers. If the goroutine becomes runnable, it returns to Pi.

If we decide to use a different policy -- prefer to run a goroutine on P where it last run -- we can re-register the goroutine in the necessary netpoller. But still netpoller is used as means to bring a goroutine to the desired P.

The latter strategy can actually be preferable as Linux kernel tries to service the network interrupt on the core where the last EAGAIN read on the socket was executed. So most likely the received data is in the cache of the core that last run the goroutine.

If we only try to ensure locality, we will end up with situation when all goroutines consider the starting P0 as their home (as all goroutines are rooted in the main goroutine). So besides ensuring locality we also need some mechanism to evenly distribute "root" goroutines (e.g. a new goroutine servicing a new connection) across P's.

Assigning each new goroutine to P's round robin is too expensive.

There is at least one very good signal: goroutines created by a goroutine that executes syscall.Accept should be considered "root" and distributed round-robin.

I don't have any other good signals right now. But we can look at communication patterns or accesses to the same fd by different goroutines and similar things. For example, if a goroutine creates a bunch of new goroutines in quick succession, they may need to be distributed across P's and nodes.

Distribution of work may require splitting local RunQ's into two queues: the first is "I would like to do this work myself", which is lock-free single-producer and LIFO; and the second one is "I am ready to share this work with others", which is mutex-protected multi-producer and not necessary LIFO. This separation also gives us some notion of priorities, as the first queue is polled first.

Regarding priorities. Network writes should have higher priority than network reads, as it helps to reduce latency and reduces memory consumption (finish old work before accepting new). Similarly, network accepts should have even lower priority than reads (deal with old connections before accepting new ones). Local per-P RunQ's can be used to express these priorities. Namely, goroutines unblocked on network writes are queued at the front of the first RunQ; network reads -- at the end of the first RunQ; network accepts -- queued into the second RunQ. This soft priority scheduling can be used in other contexts as well; e.g. a goroutine unblocked on chan recv can be given a higher priority than a goroutine unblocked on chan send (the idea is that the goroutine unblocked on chan recv will consume the data that the current goroutine has just produced). But such policies require more investigation.

Non-NUMA Systems

On non-NUMA systems this design mostly reduces to the current design (per-node data effectively becomes global). However, there are several differences:

We will [hopefully] get some additional locality with respect to caches (as we try to keep goroutines on the same cores and do LIFO scheduling).
We will get partitioned per-code network poller and timers. This should reduce contention on epoll descriptors and timers and improve locality as well.

Risks

Even this relatively modest affinity scheme has potential to degrade an application performance.

Several processes can decide to schedule threads on the same NUMA node. If each process has only one runnable goroutine, the NUMA node will be over-subscribed, while other nodes will be idle. To partially alleviate the problem, we can randomize node numbering within each process. Then the starting NODE0 refers to different physical nodes across [Go] processes.
Consider that a program allocates a very large data structure on a single node, and then the data structure is heavily accesses by all goroutines. The memory bus of the node will become a bottleneck, while memory buses of other nodes will be idle. It's known that the so-called "stripped" allocation works better in this case.