Goroutines:
Under the Hood
Vicki Niu
@vickiniu
we ❤️ Go’s concurrency primitives
but what is a goroutine?
“goroutines are lightweight threads”
vs
the lightweight goroutine
the OS thread
the lightweight goroutine
leverages
the OS thread
Fast facts about goroutines
In general, trivial to run millions of goroutines on a given machine!
👀 so... what is a goroutine?
type g struct {
stack stack // stack offsets
m *m // current m (os thread)
sched gobuf // saves context of g for recovering after context switch
atomicstatus uint32 // current status of g (runnable, running, waiting)
// activeStackChans indicates that there are unlocked channels
// pointing into this goroutine's stack. If true, stack
// copying needs to acquire channel locks to protect these
// areas of the stack.
activeStackChans bool
// parkingOnChan indicates that the goroutine is about to
// park on a chansend or chanrecv. Used to signal an unsafe point
// for stack shrinking. It's a boolean value, but is updated atomically.
parkingOnChan uint8
}
goroutine stack
All goroutines are initially allocated 2kb of memory
Each go function has a small preamble, which calls morestack if it runs out of memory
Then, the runtime allocates a new memory segment with double the size, and copies over the old segment and restarts execution
Effectively, makes goroutines infinitely growable! with efficient shrinking
type g struct {
stack stack // stack offsets
m *m // current m (os thread)
sched gobuf // saves context of g for recovering after context switch
atomicstatus uint32 // current status of g (runnable, running, waiting)
// activeStackChans indicates that there are unlocked channels
// pointing into this goroutine's stack. If true, stack
// copying needs to acquire channel locks to protect these
// areas of the stack.
activeStackChans bool
// parkingOnChan indicates that the goroutine is about to
// park on a chansend or chanrecv. Used to signal an unsafe point
// for stack shrinking. It's a boolean value, but is updated atomically.
parkingOnChan uint8
}
🧠 so, when does a goroutine run?
✨ enter the go scheduler ✨
P
M
G
goroutine
OS thread
logical processor
P
M
G
goroutine: the code to execute
OS thread: where to execute it
logical processor: rights + resources to execute it
M:N scheduler
G
G
G
M
M
M
G
G
G
goroutine: the code to execute
G
goroutine: the code to execute
can be in one of three states:
the scheduler’s job:
running goroutines as efficiently as possible
G
goroutine: the code to execute
can be in one of three states:
P
M
G
goroutine: the code to execute
OS thread: where to execute it
logical processor: rights & resources to execute it
main() runs on the main execution thread
P0
M
G
func main() {...}
Now, we create our processors
P0
M
G
func main()
idle processors
P1
P2
P3
Now, we create our processors
P0
M
G
func main()
# of processors is equal to the number of logical cores, or GOMAXPROCS
idle processors
P1
P2
P3
main() spawns a goroutine!
P0
M
G
func main()
idle processors
P1
P2
P3
G1
G1 wakes up an idle P
P0
M
G
func main()
idle processors
P1
P2
P3
G1
G1 wakes up an idle P
P0
M
G
func main()
idle processors
P1
P2
P3
G1
P3 creates an M, with an OS thread to run G1
P0
M
G
func main()
idle processors
P1
P2
P3
M
G1
P3 creates an M, with an OS thread to run G1
P0
M
G
func main()
idle processors
P1
P2
P3
M
G1
G1 finishes executing, now P3 and M are idle
P0
M
G
func main()
idle processors
P1
P2
P3
M
G1 finishes executing, now P3 and M are idle
P0
M
G
func main()
idle processors
P1
P2
P3
M
idle threads
what do we really need P for?
P0
M
G
func main()
P1
M
G1
P0
M
G
func main()
P1
M
G1
G2
P0
M
G
func main()
P1
M
G1
G2
P0
M
G
func main()
P1
M
G1
G2
LRQ
G2 gets added to the
local run queue of the current P
P0
M
G
func main()
P1
M
G1
G2
LRQ
When P1 becomes available again, looks in LRQ for a G
P0
M
G
func main()
P1
M
G2
LRQ
When P1 becomes available again, looks in LRQ for a G
P0
M
G
func main()
P1
M
G2
LRQ
When P1 becomes available again, looks in LRQ for a G
P0
M
G
func main()
P1
M
G2
LRQ
P knows what G’s are available for an M to run
🤔 what about blocked goroutines?
💡the scheduler should maximize execution time on OS threads
what blocks a goroutine?
blocking system call
P0
M0
G0
P1
idle processors
idle threads
M1
LRQ
G0 makes a blocking system call, invoking the scheduler
P0
M0
G0
P1
idle processors
idle threads
M1
LRQ
G0 makes a blocking system call, invoking the scheduler
P0
M0
G0
P1
idle processors
idle threads
M1
LRQ
P0 releases M0 while G0 is still blocking
P0
M0
G0
P1
idle processors
idle threads
M1
LRQ
P0 wakes up M1 to continue executing G’s
P0
M0
G0
P1
idle processors
idle threads
M1
G1
LRQ
G2
P0 wakes up M1 to continue executing G’s
P0
M0
G0
P1
idle processors
idle threads
M1
G1
LRQ
G2
P0 wakes up M1 to continue executing G’s
P0
M0
G0
P1
idle processors
idle threads
M1
G1
LRQ
G2
P0 wakes up M1 to continue executing G’s
P0
M0
G0
P1
idle processors
idle threads
M1
G2
LRQ
G1
When G0 unblocks, M0 attempts to find a P
P0
M0
G0
P1
idle processors
idle threads
M1
G2
LRQ
G1
When G0 unblocks, M0 attempts to find a P
P0
M0
G0
P1
idle processors
idle threads
M1
G2
LRQ
G1
When G0 unblocks, M0 attempts to find a P
P0
M0
G0
P1
idle processors
idle threads
M1
G2
LRQ
G1
G execution continues!
P0
M0
G0
P1
idle processors
idle threads
M1
G2
LRQ
G1
If M0 can’t find a P, then M0 goes to sleep
P0
G0
idle processors
idle threads
M1
G2
LRQ
G1
M0
If M0 can’t find a P, then M0 goes to sleep
P0
G0
idle processors
idle threads
M1
G2
LRQ
G1
M0
If M0 can’t find a P, then M0 goes to sleep
P0
G0
idle processors
idle threads
M1
G2
LRQ
G1
M0
If M0 can’t find a P, then M0 goes to sleep
If M0 can’t find a P, then M0 goes to sleep
G0 gets added to GRQ
(global run queue)
P0
G0
idle processors
idle threads
M1
G2
LRQ
G1
M0
If M0 can’t find a P, then M0 goes to sleep
G0 gets added to GRQ
(global run queue)
P0
idle processors
idle threads
M1
G2
LRQ
G1
M0
G0
GRQ
blocking system call
blocking network call
blocking network call
P0
M0
G0
G1
LRQ
G2
G0 wants to make a network call
P0
M0
G0
G1
LRQ
G2
G0 wants to make a network call
P0
M0
G0
G1
LRQ
G2
G0 wants to make a network call
P0
M0
G0
G1
LRQ
G2
Net Poller
G0 wants to make a network call, so G0 is moved to the net poller!
Net poller has its own OS thread, and handles events from goroutines doing network I/O
Interfaces with the OS to poll the appropriate network sockets
Re-schedules goroutines when their network resource is available!
P0
M0
G0
G1
LRQ
G2
Net Poller
netpoll interface
// func netpollinit()
//
// func netpollopen(fd uintptr, pd *pollDesc) int32
// Arm edge-triggered notifications for fd. The pd argument is to pass
// back to netpollready when fd is ready. Return an errno value.
//
// func netpoll(delta int64) gList
// Poll the network. If delta < 0, block indefinitely. If delta == 0,
// poll without blocking. If delta > 0, block for up to delta nanoseconds.
// Return a list of goroutines built by calling netpollready.
//
// func netpollBreak()
// Wake up the network poller, assumed to be blocked in netpoll.
//
// func netpollIsPollDescriptor(fd uintptr) bool
// Reports whether fd is a file descriptor used by the poller.
netpoll interface
// func netpollinit()
//
// func netpollopen(fd uintptr, pd *pollDesc) int32
// Arm edge-triggered notifications for fd. The pd argument is to pass
// back to netpollready when fd is ready. Return an errno value.
//
// func netpoll(delta int64) gList
// Poll the network. If delta < 0, block indefinitely. If delta == 0,
// poll without blocking. If delta > 0, block for up to delta nanoseconds.
// Return a list of goroutines built by calling netpollready.
//
// func netpollBreak()
// Wake up the network poller, assumed to be blocked in netpoll.
//
// func netpollIsPollDescriptor(fd uintptr) bool
// Reports whether fd is a file descriptor used by the poller.
// poll network if not polled for more than 10ms
lastpoll := int64(atomic.Load64(&sched.lastpoll))
if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {
atomic.Cas64(&sched.lastpoll, uint64(lastpoll), uint64(now))
list := netpoll(0) // non-blocking - returns list of goroutines
if !list.empty() {
...
injectglist(&list)
}
}
G0 wants to make a network call, so G0 is moved to the net poller!
P0
M0
G1
LRQ
G2
Net Poller
G0
G0 wants to make a network call, so G0 is moved to the net poller!
P0
M0
G2
LRQ
Net Poller
G0
G1
When network call is complete, net poller moves G0 back onto LRQ
P0
M0
G2
LRQ
Net Poller
G0
G1
When network calls is complete, net poller moves G0 back onto LRQ
P0
M0
G2
LRQ
Net Poller
G1
G0
P0
M0
G2
LRQ
Net Poller
G1
G0
blocking on channels
P0
M0
G0
G1
LRQ
G2
G0 wants to send on a full channel
ch <- foo
P0
M0
G0
G1
LRQ
G2
G0 wants to send on a full channel, and blocks
ch <- foo
P0
M0
G0
G1
LRQ
G2
G0 wants to send on a full channel, and blocks, adding G0 to ch’s recvq
ch <- foo
ch
P0
M0
G0
G1
LRQ
G2
recvq | G0 |
Scheduler removes G0 from M0, executing the next G
ch <- foo
P0
M0
G0
G1
LRQ
G2
Scheduler removes G0 from M0, executing the next G
ch <- foo
P0
M0
G2
LRQ
G0
G1
Now, G1 is our savior!
ch <- foo
bar := <-ch
P0
M0
G2
LRQ
G0
G1
Now, G1 is our savior!
On the channel receive, ch looks into its recvq.
G1 then calls into the scheduler, making G0 runnable.
ch <- foo
bar := <-ch
ch
P0
M0
G2
LRQ
G0
G1
recvq | G0 |
G0 is added to the LRQ.
bar := <-ch
P0
M0
G2
LRQ
G1
G0
work stealing between threads
P0
M0
G0
G1
LRQ
G2
G3
G4
LRQ
G5
G6
G7
P1
M1
P0
M0
LRQ
G4
G5
LRQ
G6
G7
P1
M1
P0 tries to steal work from another P.
P0
M0
LRQ
G4
G5
LRQ
G6
G7
P1
M1
P0 tries to steal work from another P.
P0
M0
LRQ
G4
G5
LRQ
G6
G7
P1
M1
P0 tries to steal work from another P.
P0
M0
LRQ
G4
G5
LRQ
P1
M1
G6
G7
P0 tries to steal work from another P.
P0
M0
LRQ
G5
LRQ
P1
M1
G7
G4
G6
Ensures that threads don’t remain idle while there is work to do!
P0
M0
LRQ
G5
LRQ
P1
M1
G7
G4
G6
🙇♀️ now, we’re goroutine experts!
goroutine stack is lightweight & growable
M:N scheduling allows for efficient use of OS threads
scheduler maximizes OS thread resources for execution, context switching for
exposes convenient blocking interfaces for us to use as Go developers!