1 of 96

Goroutines:

Under the Hood

Vicki Niu

@vickiniu

2 of 96

we ❤️ Go’s concurrency primitives

3 of 96

but what is a goroutine?

4 of 96

“goroutines are lightweight threads”

5 of 96

vs

the lightweight goroutine

the OS thread

6 of 96

the lightweight goroutine

leverages

the OS thread

7 of 96

Fast facts about goroutines

  • Started and managed by the Go runtime
  • Goroutines start with just 2kb of memory, but have growable stacks
  • Seamless context-switching that maximizes efficiency, thanks to the magic of the Go runtime scheduler
  • Native inter-goroutine communication with channels

In general, trivial to run millions of goroutines on a given machine!

8 of 96

👀 so... what is a goroutine?

9 of 96

type g struct {

stack stack // stack offsets

m *m // current m (os thread)

sched gobuf // saves context of g for recovering after context switch

atomicstatus uint32 // current status of g (runnable, running, waiting)

// activeStackChans indicates that there are unlocked channels

// pointing into this goroutine's stack. If true, stack

// copying needs to acquire channel locks to protect these

// areas of the stack.

activeStackChans bool

// parkingOnChan indicates that the goroutine is about to

// park on a chansend or chanrecv. Used to signal an unsafe point

// for stack shrinking. It's a boolean value, but is updated atomically.

parkingOnChan uint8

}

10 of 96

goroutine stack

All goroutines are initially allocated 2kb of memory

Each go function has a small preamble, which calls morestack if it runs out of memory

Then, the runtime allocates a new memory segment with double the size, and copies over the old segment and restarts execution

Effectively, makes goroutines infinitely growable! with efficient shrinking

11 of 96

type g struct {

stack stack // stack offsets

m *m // current m (os thread)

sched gobuf // saves context of g for recovering after context switch

atomicstatus uint32 // current status of g (runnable, running, waiting)

// activeStackChans indicates that there are unlocked channels

// pointing into this goroutine's stack. If true, stack

// copying needs to acquire channel locks to protect these

// areas of the stack.

activeStackChans bool

// parkingOnChan indicates that the goroutine is about to

// park on a chansend or chanrecv. Used to signal an unsafe point

// for stack shrinking. It's a boolean value, but is updated atomically.

parkingOnChan uint8

}

12 of 96

🧠 so, when does a goroutine run?

13 of 96

enter the go scheduler

14 of 96

P

M

G

goroutine

OS thread

logical processor

15 of 96

P

M

G

goroutine: the code to execute

OS thread: where to execute it

logical processor: rights + resources to execute it

16 of 96

M:N scheduler

G

G

G

M

M

M

G

G

17 of 96

G

goroutine: the code to execute

18 of 96

G

goroutine: the code to execute

can be in one of three states:

  1. blocked
  2. runnable
  3. running

19 of 96

the scheduler’s job:

running goroutines as efficiently as possible

G

goroutine: the code to execute

can be in one of three states:

  • blocked
  • runnable
  • running

20 of 96

P

M

G

goroutine: the code to execute

OS thread: where to execute it

logical processor: rights & resources to execute it

21 of 96

main() runs on the main execution thread

P0

M

G

func main() {...}

22 of 96

Now, we create our processors

P0

M

G

func main()

idle processors

P1

P2

P3

23 of 96

Now, we create our processors

P0

M

G

func main()

# of processors is equal to the number of logical cores, or GOMAXPROCS

idle processors

P1

P2

P3

24 of 96

main() spawns a goroutine!

P0

M

G

func main()

idle processors

P1

P2

P3

G1

25 of 96

G1 wakes up an idle P

P0

M

G

func main()

idle processors

P1

P2

P3

G1

26 of 96

G1 wakes up an idle P

P0

M

G

func main()

idle processors

P1

P2

P3

G1

27 of 96

P3 creates an M, with an OS thread to run G1

P0

M

G

func main()

idle processors

P1

P2

P3

M

G1

28 of 96

P3 creates an M, with an OS thread to run G1

P0

M

G

func main()

idle processors

P1

P2

P3

M

G1

29 of 96

G1 finishes executing, now P3 and M are idle

P0

M

G

func main()

idle processors

P1

P2

P3

M

30 of 96

G1 finishes executing, now P3 and M are idle

P0

M

G

func main()

idle processors

P1

P2

P3

M

idle threads

31 of 96

what do we really need P for?

32 of 96

P0

M

G

func main()

P1

M

G1

33 of 96

P0

M

G

func main()

P1

M

G1

G2

34 of 96

P0

M

G

func main()

P1

M

G1

G2

35 of 96

P0

M

G

func main()

P1

M

G1

G2

LRQ

G2 gets added to the

local run queue of the current P

36 of 96

P0

M

G

func main()

P1

M

G1

G2

LRQ

When P1 becomes available again, looks in LRQ for a G

37 of 96

P0

M

G

func main()

P1

M

G2

LRQ

When P1 becomes available again, looks in LRQ for a G

38 of 96

P0

M

G

func main()

P1

M

G2

LRQ

When P1 becomes available again, looks in LRQ for a G

39 of 96

P0

M

G

func main()

P1

M

G2

LRQ

P knows what G’s are available for an M to run

40 of 96

🤔 what about blocked goroutines?

41 of 96

💡the scheduler should maximize execution time on OS threads

42 of 96

what blocks a goroutine?

  • system calls (file I/O)
  • network calls
  • channels

43 of 96

blocking system call

44 of 96

P0

M0

G0

P1

idle processors

idle threads

M1

LRQ

45 of 96

G0 makes a blocking system call, invoking the scheduler

P0

M0

G0

P1

idle processors

idle threads

M1

LRQ

46 of 96

G0 makes a blocking system call, invoking the scheduler

P0

M0

G0

P1

idle processors

idle threads

M1

LRQ

47 of 96

P0 releases M0 while G0 is still blocking

P0

M0

G0

P1

idle processors

idle threads

M1

LRQ

48 of 96

P0 wakes up M1 to continue executing G’s

P0

M0

G0

P1

idle processors

idle threads

M1

G1

LRQ

G2

49 of 96

P0 wakes up M1 to continue executing G’s

P0

M0

G0

P1

idle processors

idle threads

M1

G1

LRQ

G2

50 of 96

P0 wakes up M1 to continue executing G’s

P0

M0

G0

P1

idle processors

idle threads

M1

G1

LRQ

G2

51 of 96

P0 wakes up M1 to continue executing G’s

P0

M0

G0

P1

idle processors

idle threads

M1

G2

LRQ

G1

52 of 96

When G0 unblocks, M0 attempts to find a P

P0

M0

G0

P1

idle processors

idle threads

M1

G2

LRQ

G1

53 of 96

When G0 unblocks, M0 attempts to find a P

P0

M0

G0

P1

idle processors

idle threads

M1

G2

LRQ

G1

54 of 96

When G0 unblocks, M0 attempts to find a P

P0

M0

G0

P1

idle processors

idle threads

M1

G2

LRQ

G1

55 of 96

G execution continues!

P0

M0

G0

P1

idle processors

idle threads

M1

G2

LRQ

G1

56 of 96

If M0 can’t find a P, then M0 goes to sleep

P0

G0

idle processors

idle threads

M1

G2

LRQ

G1

M0

57 of 96

If M0 can’t find a P, then M0 goes to sleep

P0

G0

idle processors

idle threads

M1

G2

LRQ

G1

M0

58 of 96

If M0 can’t find a P, then M0 goes to sleep

P0

G0

idle processors

idle threads

M1

G2

LRQ

G1

M0

59 of 96

If M0 can’t find a P, then M0 goes to sleep

If M0 can’t find a P, then M0 goes to sleep

G0 gets added to GRQ

(global run queue)

P0

G0

idle processors

idle threads

M1

G2

LRQ

G1

M0

60 of 96

If M0 can’t find a P, then M0 goes to sleep

G0 gets added to GRQ

(global run queue)

P0

idle processors

idle threads

M1

G2

LRQ

G1

M0

G0

GRQ

61 of 96

blocking system call

  • Allows the blocking G to hang onto the M executing the system call, but avoid consuming P resources to run other goroutines!
  • Some system calls should always be quick! Go runtime does a slight optimization, only context-switching for expensive syscalls

62 of 96

blocking network call

63 of 96

blocking network call

  • net/http package provides a default of spawning a goroutine for each incoming connection
  • Within Go, we can interact with network requests as if they are blocking, which makes our lives as Go developers easy!
  • In the runtime, this is taken care of by the netpoller: netpoller’s job is to schedule goroutines waiting on asynchronous system calls, and provide a blocking interface

64 of 96

P0

M0

G0

G1

LRQ

G2

65 of 96

G0 wants to make a network call

P0

M0

G0

G1

LRQ

G2

66 of 96

G0 wants to make a network call

P0

M0

G0

G1

LRQ

G2

67 of 96

G0 wants to make a network call

P0

M0

G0

G1

LRQ

G2

Net Poller

68 of 96

G0 wants to make a network call, so G0 is moved to the net poller!

Net poller has its own OS thread, and handles events from goroutines doing network I/O

Interfaces with the OS to poll the appropriate network sockets

Re-schedules goroutines when their network resource is available!

P0

M0

G0

G1

LRQ

G2

Net Poller

69 of 96

netpoll interface

// func netpollinit()

//

// func netpollopen(fd uintptr, pd *pollDesc) int32

// Arm edge-triggered notifications for fd. The pd argument is to pass

// back to netpollready when fd is ready. Return an errno value.

//

// func netpoll(delta int64) gList

// Poll the network. If delta < 0, block indefinitely. If delta == 0,

// poll without blocking. If delta > 0, block for up to delta nanoseconds.

// Return a list of goroutines built by calling netpollready.

//

// func netpollBreak()

// Wake up the network poller, assumed to be blocked in netpoll.

//

// func netpollIsPollDescriptor(fd uintptr) bool

// Reports whether fd is a file descriptor used by the poller.

70 of 96

netpoll interface

// func netpollinit()

//

// func netpollopen(fd uintptr, pd *pollDesc) int32

// Arm edge-triggered notifications for fd. The pd argument is to pass

// back to netpollready when fd is ready. Return an errno value.

//

// func netpoll(delta int64) gList

// Poll the network. If delta < 0, block indefinitely. If delta == 0,

// poll without blocking. If delta > 0, block for up to delta nanoseconds.

// Return a list of goroutines built by calling netpollready.

//

// func netpollBreak()

// Wake up the network poller, assumed to be blocked in netpoll.

//

// func netpollIsPollDescriptor(fd uintptr) bool

// Reports whether fd is a file descriptor used by the poller.

71 of 96

// poll network if not polled for more than 10ms

lastpoll := int64(atomic.Load64(&sched.lastpoll))

if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {

atomic.Cas64(&sched.lastpoll, uint64(lastpoll), uint64(now))

list := netpoll(0) // non-blocking - returns list of goroutines

if !list.empty() {

...

injectglist(&list)

}

}

72 of 96

G0 wants to make a network call, so G0 is moved to the net poller!

P0

M0

G1

LRQ

G2

Net Poller

G0

73 of 96

G0 wants to make a network call, so G0 is moved to the net poller!

P0

M0

G2

LRQ

Net Poller

G0

G1

74 of 96

When network call is complete, net poller moves G0 back onto LRQ

P0

M0

G2

LRQ

Net Poller

G0

G1

75 of 96

When network calls is complete, net poller moves G0 back onto LRQ

P0

M0

G2

LRQ

Net Poller

G1

G0

76 of 96

P0

M0

G2

LRQ

Net Poller

G1

G0

77 of 96

blocking on channels

78 of 96

P0

M0

G0

G1

LRQ

G2

79 of 96

G0 wants to send on a full channel

ch <- foo

P0

M0

G0

G1

LRQ

G2

80 of 96

G0 wants to send on a full channel, and blocks

ch <- foo

P0

M0

G0

G1

LRQ

G2

81 of 96

G0 wants to send on a full channel, and blocks, adding G0 to ch’s recvq

ch <- foo

ch

P0

M0

G0

G1

LRQ

G2

recvq

G0

82 of 96

Scheduler removes G0 from M0, executing the next G

ch <- foo

P0

M0

G0

G1

LRQ

G2

83 of 96

Scheduler removes G0 from M0, executing the next G

ch <- foo

P0

M0

G2

LRQ

G0

G1

84 of 96

Now, G1 is our savior!

ch <- foo

bar := <-ch

P0

M0

G2

LRQ

G0

G1

85 of 96

Now, G1 is our savior!

On the channel receive, ch looks into its recvq.

G1 then calls into the scheduler, making G0 runnable.

ch <- foo

bar := <-ch

ch

P0

M0

G2

LRQ

G0

G1

recvq

G0

86 of 96

G0 is added to the LRQ.

bar := <-ch

P0

M0

G2

LRQ

G1

G0

87 of 96

work stealing between threads

88 of 96

P0

M0

G0

G1

LRQ

G2

G3

G4

LRQ

G5

G6

G7

P1

M1

89 of 96

P0

M0

LRQ

G4

G5

LRQ

G6

G7

P1

M1

90 of 96

P0 tries to steal work from another P.

P0

M0

LRQ

G4

G5

LRQ

G6

G7

P1

M1

91 of 96

P0 tries to steal work from another P.

P0

M0

LRQ

G4

G5

LRQ

G6

G7

P1

M1

92 of 96

P0 tries to steal work from another P.

P0

M0

LRQ

G4

G5

LRQ

P1

M1

G6

G7

93 of 96

P0 tries to steal work from another P.

P0

M0

LRQ

G5

LRQ

P1

M1

G7

G4

G6

94 of 96

Ensures that threads don’t remain idle while there is work to do!

P0

M0

LRQ

G5

LRQ

P1

M1

G7

G4

G6

95 of 96

🙇‍♀️ now, we’re goroutine experts!

goroutine stack is lightweight & growable

M:N scheduling allows for efficient use of OS threads

scheduler maximizes OS thread resources for execution, context switching for

  • long OS syscalls,
  • networking syscalls, and
  • channel blocking seamlessly!

exposes convenient blocking interfaces for us to use as Go developers!

96 of 96