JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 96

Goroutines:

Under the Hood

Vicki Niu

@vickiniu

2 of 96

we ❤️ Go’s concurrency primitives

3 of 96

but what is a goroutine?

4 of 96

“goroutines are lightweight threads”

5 of 96

vs

the lightweight goroutine

the OS thread

6 of 96

the lightweight goroutine

leverages

the OS thread

7 of 96

Fast facts about goroutines

Started and managed by the Go runtime
Goroutines start with just 2kb of memory, but have growable stacks
Seamless context-switching that maximizes efficiency, thanks to the magic of the Go runtime scheduler
Native inter-goroutine communication with channels

In general, trivial to run millions of goroutines on a given machine!

8 of 96

👀 so... what is a goroutine?

9 of 96

type g struct {

stack stack // stack offsets

m *m // current m (os thread)

sched gobuf // saves context of g for recovering after context switch

atomicstatus uint32 // current status of g (runnable, running, waiting)

// activeStackChans indicates that there are unlocked channels

// pointing into this goroutine's stack. If true, stack

// copying needs to acquire channel locks to protect these

// areas of the stack.

activeStackChans bool

// parkingOnChan indicates that the goroutine is about to

// park on a chansend or chanrecv. Used to signal an unsafe point

// for stack shrinking. It's a boolean value, but is updated atomically.

parkingOnChan uint8

}

10 of 96

goroutine stack

All goroutines are initially allocated 2kb of memory

Each go function has a small preamble, which calls morestack if it runs out of memory

Then, the runtime allocates a new memory segment with double the size, and copies over the old segment and restarts execution

Effectively, makes goroutines infinitely growable! with efficient shrinking

11 of 96

type g struct {

stack stack // stack offsets

m *m // current m (os thread)

sched gobuf // saves context of g for recovering after context switch

atomicstatus uint32 // current status of g (runnable, running, waiting)

// activeStackChans indicates that there are unlocked channels

// pointing into this goroutine's stack. If true, stack

// copying needs to acquire channel locks to protect these

// areas of the stack.

activeStackChans bool

// parkingOnChan indicates that the goroutine is about to

// park on a chansend or chanrecv. Used to signal an unsafe point

// for stack shrinking. It's a boolean value, but is updated atomically.

parkingOnChan uint8

}

12 of 96

🧠 so, when does a goroutine run?

13 of 96

✨ enter the go scheduler ✨

14 of 96

P

M

G

goroutine

OS thread

logical processor

15 of 96

P

M

G

goroutine: the code to execute

OS thread: where to execute it

logical processor: rights + resources to execute it

16 of 96

M:N scheduler

G

G

G

M

M

M

G

G

17 of 96

G

goroutine: the code to execute

18 of 96

G

goroutine: the code to execute

can be in one of three states:

blocked
runnable
running

19 of 96

the scheduler’s job:

running goroutines as efficiently as possible

G

goroutine: the code to execute

can be in one of three states:

blocked
runnable
running

20 of 96

P

M

G

goroutine: the code to execute

OS thread: where to execute it

logical processor: rights & resources to execute it

21 of 96

main() runs on the main execution thread

P₀

M

G

func main() {...}

22 of 96

Now, we create our processors

P₀

M

G

func main()

idle processors

P₁

P₂

P₃

23 of 96

Now, we create our processors

P₀

M

G

func main()

# of processors is equal to the number of logical cores, or GOMAXPROCS

idle processors

P₁

P₂

P₃

24 of 96

main() spawns a goroutine!

P₀

M

G

func main()

idle processors

P₁

P₂

P₃

G₁

25 of 96

G₁ wakes up an idle P

P₀

M

G

func main()

idle processors

P₁

P₂

P₃

G₁

26 of 96

G₁ wakes up an idle P

P₀

M

G

func main()

idle processors

P₁

P₂

P₃

G₁

27 of 96

P₃ creates an M, with an OS thread to run G₁

P₀

M

G

func main()

idle processors

P₁

P₂

P₃

M

G₁

28 of 96

P₃ creates an M, with an OS thread to run G₁

P₀

M

G

func main()

idle processors

P₁

P₂

P₃

M

G₁

29 of 96

G₁ finishes executing, now P₃ and M are idle

P₀

M

G

func main()

idle processors

P₁

P₂

P₃

M

30 of 96

G₁ finishes executing, now P₃ and M are idle

P₀

M

G

func main()

idle processors

P₁

P₂

P₃

M

idle threads

31 of 96

what do we really need P for?

32 of 96

P₀

M

G

func main()

P₁

M

G₁

33 of 96

P₀

M

G

func main()

P₁

M

G₁

G₂

34 of 96

P₀

M

G

func main()

P₁

M

G₁

G₂

35 of 96

P₀

M

G

func main()

P₁

M

G₁

G₂

LRQ

G₂ gets added to the

local run queue of the current P

36 of 96

P₀

M

G

func main()

P₁

M

G₁

G₂

LRQ

When P₁ becomes available again, looks in LRQ for a G

37 of 96

P₀

M

G

func main()

P₁

M

G₂

LRQ

When P₁ becomes available again, looks in LRQ for a G

38 of 96

P₀

M

G

func main()

P₁

M

G₂

LRQ

When P₁ becomes available again, looks in LRQ for a G

39 of 96

P₀

M

G

func main()

P₁

M

G₂

LRQ

P knows what G’s are available for an M to run

40 of 96

🤔 what about blocked goroutines?

41 of 96

💡the scheduler should maximize execution time on OS threads

42 of 96

what blocks a goroutine?

system calls (file I/O)
network calls
channels

43 of 96

blocking system call

44 of 96

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

LRQ

45 of 96

G₀ makes a blocking system call, invoking the scheduler

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

LRQ

46 of 96

G₀ makes a blocking system call, invoking the scheduler

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

LRQ

47 of 96

P₀ releases M₀ while G₀ is still blocking

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

LRQ

48 of 96

P₀ wakes up M₁ to continue executing G’s

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

G₁

LRQ

G₂

49 of 96

P₀ wakes up M₁ to continue executing G’s

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

G₁

LRQ

G₂

50 of 96

P₀ wakes up M₁ to continue executing G’s

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

G₁

LRQ

G₂

51 of 96

P₀ wakes up M₁ to continue executing G’s

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

G₂

LRQ

G₁

52 of 96

When G₀ unblocks, M₀ attempts to find a P

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

G₂

LRQ

G₁

53 of 96

When G₀ unblocks, M₀ attempts to find a P

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

G₂

LRQ

G₁

54 of 96

When G₀ unblocks, M₀ attempts to find a P

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

G₂

LRQ

G₁

55 of 96

G execution continues!

P₀

M₀

G₀

P₁

idle processors

idle threads

M₁

G₂

LRQ

G₁

56 of 96

If M₀ can’t find a P, then M₀ goes to sleep

P₀

G₀

idle processors

idle threads

M₁

G₂

LRQ

G₁

M₀

57 of 96

If M₀ can’t find a P, then M₀ goes to sleep

P₀

G₀

idle processors

idle threads

M₁

G₂

LRQ

G₁

M₀

58 of 96

If M₀ can’t find a P, then M₀ goes to sleep

P₀

G₀

idle processors

idle threads

M₁

G₂

LRQ

G₁

M₀

59 of 96

If M₀ can’t find a P, then M₀ goes to sleep

If M₀ can’t find a P, then M₀ goes to sleep

G₀gets added to GRQ

(global run queue)

P₀

G₀

idle processors

idle threads

M₁

G₂

LRQ

G₁

M₀

60 of 96

If M₀ can’t find a P, then M₀ goes to sleep

G₀gets added to GRQ

(global run queue)

P₀

idle processors

idle threads

M₁

G₂

LRQ

G₁

M₀

G₀

GRQ

61 of 96

blocking system call

Allows the blocking G to hang onto the M executing the system call, but avoid consuming P resources to run other goroutines!
Some system calls should always be quick! Go runtime does a slight optimization, only context-switching for expensive syscalls

62 of 96

blocking network call

63 of 96

blocking network call

net/http package provides a default of spawning a goroutine for each incoming connection
Within Go, we can interact with network requests as if they are blocking, which makes our lives as Go developers easy!
In the runtime, this is taken care of by the netpoller: netpoller’s job is to schedule goroutines waiting on asynchronous system calls, and provide a blocking interface

64 of 96

P₀

M₀

G₀

G₁

LRQ

G₂

65 of 96

G₀ wants to make a network call

P₀

M₀

G₀

G₁

LRQ

G₂

66 of 96

G₀ wants to make a network call

P₀

M₀

G₀

G₁

LRQ

G₂

67 of 96

G₀ wants to make a network call

P₀

M₀

G₀

G₁

LRQ

G₂

Net Poller

68 of 96

G₀ wants to make a network call, so G₀ is moved to the net poller!

Net poller has its own OS thread, and handles events from goroutines doing network I/O

Interfaces with the OS to poll the appropriate network sockets

Re-schedules goroutines when their network resource is available!

P₀

M₀

G₀

G₁

LRQ

G₂

Net Poller

69 of 96

netpoll interface

// func netpollinit()

//

// func netpollopen(fd uintptr, pd *pollDesc) int32

// Arm edge-triggered notifications for fd. The pd argument is to pass

// back to netpollready when fd is ready. Return an errno value.

//

// func netpoll(delta int64) gList

// Poll the network. If delta < 0, block indefinitely. If delta == 0,

// poll without blocking. If delta > 0, block for up to delta nanoseconds.

// Return a list of goroutines built by calling netpollready.

//

// func netpollBreak()

// Wake up the network poller, assumed to be blocked in netpoll.

//

// func netpollIsPollDescriptor(fd uintptr) bool

// Reports whether fd is a file descriptor used by the poller.

70 of 96

netpoll interface

// func netpollinit()

//

// func netpollopen(fd uintptr, pd *pollDesc) int32

// Arm edge-triggered notifications for fd. The pd argument is to pass

// back to netpollready when fd is ready. Return an errno value.

//

// func netpoll(delta int64) gList

// Poll the network. If delta < 0, block indefinitely. If delta == 0,

// poll without blocking. If delta > 0, block for up to delta nanoseconds.

// Return a list of goroutines built by calling netpollready.

//

// func netpollBreak()

// Wake up the network poller, assumed to be blocked in netpoll.

//

// func netpollIsPollDescriptor(fd uintptr) bool

// Reports whether fd is a file descriptor used by the poller.

71 of 96

// poll network if not polled for more than 10ms

lastpoll := int64(atomic.Load64(&sched.lastpoll))

if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {

atomic.Cas64(&sched.lastpoll, uint64(lastpoll), uint64(now))

list := netpoll(0) // non-blocking - returns list of goroutines

if !list.empty() {

...

injectglist(&list)

}

}

72 of 96

G₀ wants to make a network call, so G₀ is moved to the net poller!

P₀

M₀

G₁

LRQ

G₂

Net Poller

G₀

73 of 96

G₀ wants to make a network call, so G₀ is moved to the net poller!

P₀

M₀

G₂

LRQ

Net Poller

G₀

G₁

74 of 96

When network call is complete, net poller moves G₀ back onto LRQ

P₀

M₀

G₂

LRQ

Net Poller

G₀

G₁

75 of 96

When network calls is complete, net poller moves G₀ back onto LRQ

P₀

M₀

G₂

LRQ

Net Poller

G₁

G₀

76 of 96

P₀

M₀

G₂

LRQ

Net Poller

G₁

G₀

77 of 96

blocking on channels

78 of 96

P₀

M₀

G₀

G₁

LRQ

G₂

79 of 96

G₀ wants to send on a full channel

ch <- foo

P₀

M₀

G₀

G₁

LRQ

G₂

80 of 96

G₀ wants to send on a full channel, and blocks

ch <- foo

P₀

M₀

G₀

G₁

LRQ

G₂

81 of 96

G₀ wants to send on a full channel, and blocks, adding G₀ to ch’s recvq

ch <- foo

ch

P₀

M₀

G₀

G₁

LRQ

G₂

recvq	G₀

82 of 96

Scheduler removes G₀ from M₀, executing the next G

ch <- foo

P₀

M₀

G₀

G₁

LRQ

G₂

83 of 96

Scheduler removes G₀ from M₀, executing the next G

ch <- foo

P₀

M₀

G₂

LRQ

G₀

G₁

84 of 96

Now, G₁ is our savior!

ch <- foo

bar := <-ch

P₀

M₀

G₂

LRQ

G₀

G₁

85 of 96

Now, G₁ is our savior!

On the channel receive, ch looks into its recvq.

G₁ then calls into the scheduler, making G₀ runnable.

ch <- foo

bar := <-ch

ch

P₀

M₀

G₂

LRQ

G₀

G₁

recvq	G₀

86 of 96

G₀ is added to the LRQ.

bar := <-ch

P₀

M₀

G₂

LRQ

G₁

G₀

87 of 96

work stealing between threads

88 of 96

P₀

M₀

G₀

G₁

LRQ

G₂

G₃

G₄

LRQ

G₅

G₆

G₇

P₁

M₁

89 of 96

P₀

M₀

LRQ

G₄

G₅

LRQ

G₆

G₇

P₁

M₁

90 of 96

P₀ tries to steal work from another P.

P₀

M₀

LRQ

G₄

G₅

LRQ

G₆

G₇

P₁

M₁

91 of 96

P₀ tries to steal work from another P.

P₀

M₀

LRQ

G₄

G₅

LRQ

G₆

G₇

P₁

M₁

92 of 96

P₀ tries to steal work from another P.

P₀

M₀

LRQ

G₄

G₅

LRQ

P₁

M₁

G₆

G₇

93 of 96

P₀ tries to steal work from another P.

P₀

M₀

LRQ

G₅

LRQ

P₁

M₁

G₇

G₄

G₆

94 of 96

Ensures that threads don’t remain idle while there is work to do!

P₀

M₀

LRQ

G₅

LRQ

P₁

M₁

G₇

G₄

G₆

95 of 96

🙇‍♀️ now, we’re goroutine experts!

goroutine stack is lightweight & growable

M:N scheduling allows for efficient use of OS threads

scheduler maximizes OS thread resources for execution, context switching for

long OS syscalls,
networking syscalls, and
channel blocking seamlessly!

exposes convenient blocking interfaces for us to use as Go developers!

96 of 96