1 of 77

Performance patches

in Go 1.11

Aliaksandr Valialkin, VictoriaMetrics

2 of 77

About me

I like Go and performance optimizations
I’m the author of fasthttp, fastjson, fastrpc, fastrand, quicktemplate and many other projects - see https://github.com/valyala/
Now I work on the fastest time series DB - VictoriaMetrics. It is written in Go

3 of 77

Agenda

Compiler and runtime optimizations
Math/big optimizations (aka ‘crypto-optimizations’)
Standard library optimizations
Arch-specific optimizations

4 of 77

Compiler and runtime optimizations

5 of 77

109918 cmd/compile:

refactor inlining parameters; inline panic

6 of 77

What is inlining?

Inlining is the process of embedding function code into the place of function call
Inlining eliminates function call overhead
Inlining opens additional optimization opportunities for the compiler
Inlining may improve performance

7 of 77

What is inlining?

But sometimes inlining may hurt performance
Big functions’ inlining may lead to binary size bloat and bad performance if the resulting binary code stops fitting CPU instruction cache
So it is better to inline small functions
Go compiler performs basic inlining

8 of 77

Inline functions with panic

Go 1.11 may inline functions with panic()
Panic may be implicit:

If slice element is accessed without explicit or provable bounds check, then the compiler translates a[i] into

if i < 0 || i >= len(a) {

panic(“out of bounds access”)

}

a[i]

9 of 77

Inline functions with panic

Go 1.11 may inline functions with panic()
Panic may be implicit:

If struct field is accessed via struct pointer without explicit or provable nill check, then the compiler may translate foo.bar for foo *T into

if foo == nil {

panic(“nil dereference”)

}

foo.bar

10 of 77

Inline functions with panic

Go 1.11 is able to inline the following function:

type T struct {

N int

}

func f(a []int, b *T) {

b.N = a[2]

}

11 of 77

110055 cmd/compile:

optimize map-clearing range idiom

12 of 77

Optimize map-clearing range idiom

Go 1.11 now detects and optimizes the following code

func clearMap(m map[K]V) {

for k := range m {

delete(m, k)

}

13 of 77

Optimize map-clearing range idiom

It is substituted by something like:

func clearMap(m map[K]V) {

// fastClearMap is an optimized function

// for clearing the map. It preserves m capacity

// in order to reduce overhead during

// subsequent additions into the map

fastClearMap(m)

}

14 of 77

109517 cmd/compile:

optimize append(x, make([]T, y)...)

slice extension

15 of 77

Optimize append(x, make([]T, y)...)

Previously the following code was frequently used for fast slice extension:

func growSlice(a []T, itemsToAdd int) []T {

newSize := len(a) + itemsToAdd

for cap(a) < newSize {

a = append(a[:cap(a)], T{})

}

return a[:newSize]

}

16 of 77

Optimize append(x, make([]T, y)...)

Now this code may be substituted by simpler and faster code:

func growSlice(a []T, itemsToAdd int) []T {

return append(a, make([]T, itemsToAdd)...)

}

Previously such code wasn’t optimal, since Go performed an unnecessary allocation for make([]T, itemsToAdd).

17 of 77

91557 cmd/compile:

avoid extra mapaccess in "m[k] op= r"

18 of 77

Avoid extra mapaccess in “m[k] op= r”

Now the following code works faster:

func countWords(words []string) map[string]int {

m := make(map[string]int)

for _, w := range words {

m[w] += 1 // This line works faster in Go 1.11

}

return m

}

19 of 77

100838 cmd/compile:

avoid mapaccess at m[k]=append(m[k]..

20 of 77

Avoid mapaccess at m[k]=append(m[k]...)

Now the following code works faster:

func groupWordsByLen(words []string) map[int][]string {

m := make(map[int][]string)

for _, w := range words {

wLen := len(w)

// The following line works faster in Go 1.11

m[wLen] = append(m[wLen], w)

}

return m

}

21 of 77

84055 cmd/compile/internal/ssa:

update regalloc in loops

22 of 77

Update regalloc in loops

Improves performance for the following code by using better register allocation inside loops:

for ... {� if hard_case {� call()� }� // simple case, without call� }

23 of 77

100718 cmd/compile:

specialize Move up to 79B on amd64

24 of 77

Specialize Move up to 79B on amd64

Improves performance when copying structs and arrays with sizes from 32 bytes to 79 bytes
Benchmark results:

CopyFat24-4 0.80ns ± 0% 0.40ns ± 0% -50.00% (p=0.001 n=8+9)�CopyFat32-4 2.01ns ± 0% 0.40ns ± 0% -80.10% (p=0.000 n=8+8)�CopyFat64-4 2.87ns ± 0% 0.40ns ± 0% -86.07% (p=0.000 n=8+10)

25 of 77

Bounds check elimination (BCE) improvements

26 of 77

What is bounds check elimination (BCE)?

For a[i] go checks whether i exceeds a bounds by adding the following guard code before each a[i]:

if i < 0 || i >= len(a) {

panic(“out of bounds access”)

}

This code sometimes becomes redundant if a similar check already exists before a[i]
Detection with subsequent removal of such guard code is called BCE
BCE improves performance

27 of 77

BCE improvements

Go 1.11 contains many patches for detecting and eliminating more bounds checks comparing to previous go versions. Here are a few of such patches:

104037 cmd/compile: in prove, complete support for OpIsInBounds/OpIsSliceInBounds
100277 cmd/compile: in prove, add transitive closure of relations
100278 cmd/compile: in prove, infer unsigned relations while branching
104038 cmd/compile: implement loop BCE in prove
104041 cmd/compile: in prove, detect loops with negative increments
105635 cmd/compile: teach prove to handle expressions like len(s)-delta
109776 cmd/compile: simplify shifts using bounds from prove pass
102601 cmd/compile: teach prove about relations between constants
102602 cmd/compile: derive len/cap relations in factsTable.update

28 of 77

BCE improvements example 1

The following code works faster in go 1.11:

func HasSuffix(s, suffix string) bool {� return len(s) >= len(suffix) && s[len(s)-len(suffix):] == suffix� }

Because the compiler became smart enough to eliminate redundant bounds check for s[len(s)-len(suffix):]

29 of 77

BCE improvements example 2

The following code works faster in go 1.11:

x := 0

for i := len(a); i > 0; i-- {

x += int(a[i-1])

}

return x

Because the compiler is smart enough to detect that a[i-1] cannot go out of bounds inside the loop

30 of 77

BCE improvements example 3

The following code works faster in go 1.11:

if i < 0 || i >= len(b) {� return� }� for j := 0; j < i; j++ {� b[j]++� }

Because the compiler removes redundant bounds check at b[j]

31 of 77

105257 cmd/compile:

in escape analysis, propagate loop depth to field

32 of 77

What is escape analysis?

Working with heap variables is usually slower than working with stack variables due to garbage collector overhead
Initially all the variables go to heap, since it is unsafe to store certain variables on stack
Go compiler tries detecting which variables don’t escape from the stack scope and may be safely put on stack
This process is called escape analysis

33 of 77

Escape analysis improvements in Go 1.11

Now the following code works faster:

type T struct { x int }��func f(t *T) {� var y *int� for i := 0; i < 2; i++ {� y = &t.x� *y = 1� }�}

Because the compiler puts t on stack instead of heap

34 of 77

80144 runtime:

use private futexes on Linux

35 of 77

Use private futexes on Linux

Previously Go unnecessarily used shared futexes for synchronization primitives.
Now code with heavy use of synchronization primitives should work slightly faster on Linux.
See https://lwn.net/Articles/229668/ for differences between shared and private futexes.

36 of 77

Stack copy optimizations

37 of 77

Stack copy optimizations

Go copies goroutine stack each time the stack must grow beyond its’ capacity
Initial goroutine stack capacity is 2Kb
When it outgrows 2Kb, go copies the stack into a new memory with bigger size
Stack copies may become a bottleneck for a program with frequently created goroutines with deep call stacks. For instance, busy web server with a lot of middleware

38 of 77

Stack copy optimizations

Go 1.11 contains the following optimizations for stack copy:

94029 runtime: speed up stack copying a little
104737 runtime: avoid calling adjustpointers unnecessarily
108945 runtime: add fast version of getArgInfo
109716 runtime: iterate over set bits in adjustpointers
109001 runtime: allow inlining of stackmapdata
104175 cmd/compile: shrink liveness maps

These optimizations collectively improve stack copy performance by ~2X

39 of 77

math/big optimizations

40 of 77

math/big optimizations

math/big is mostly used in public key cryptography
Public key cryptography is used in https during session establishing
So, math/big optimizations usually speed up https

41 of 77

99838 math/big:

reduce amount of copying

in Montgomery multiplication

42 of 77

What is Montgomery modular multiplication?

Montgomery modular multiplication is an optimized algorithm for calculating a * b (mod q), where a and b - big integers and q - big prime number
See boring details at https://en.wikipedia.org/wiki/Montgomery_modular_multiplication
Modular multiplication is the core of many public key cryptography algorithms

43 of 77

Reduce amount of copying in Montgomery multiplication

Benchmark results:

name old time/op new time/op delta�RSA2048Decrypt-8 1.73ms ± 2% 1.55ms ± 2% -10.19% (p=0.000 n=10+10)�RSA2048Sign-8 2.17ms ± 2% 2.00ms ± 2% -7.93% (p=0.000 n=10+10)�3PrimeRSA2048Decrypt-8 1.10ms ± 2% 0.96ms ± 2% -13.03% (p=0.000 n=10+9)

44 of 77

99615 math/big:

implement Atkin's ModSqrt for 5 mod 8 primes

45 of 77

What is Atkin’s ModSqrt for 5 mod 8 algorithm?

This is an algorithm for fast calculating sqrt(x) (mod q) for Atkin’s prime q = 2^n+1 (mod 2^(n+1)) where n = 2
q=5 is the first such Atkin’s prime, since 5 (mod 8) = 5.
Boring details are available at https://ieeexplore.ieee.org/document/6504967/
It looks like the algorithm covers 1/4 or 25% of all the primes
So the patch speeds up ModSqrt calculation in 25% cases

46 of 77

Implement Atkin’s ModSqrt for 5 mod 8 primes

This improves performance for big.Int.ModSqrt in 25% cases:

ModSqrt231_5Mod8-4 1.03ms ± 2% 0.36ms ± 5% -65.06% (p=0.008 n=5+5)

47 of 77

105075 math/big:

specialize Karatsuba implementation

for squaring

48 of 77

What is squaring?

Squaring is just X * X calculation. Simple. Isn’t it?
This is simple for small integers. What about big integers with thouthands of decimal digits?
Squaring becomes complicated and slows down for big integers
Squaring is used in more complex algorithms such as a pow b calculation
These algorithms are used in more powerful algorithms for public key cryptography
There are special algorithms that may improve squaring performance for big integers

49 of 77

What is Karatsuba squaring?

Karatsuba multiplication is an optimized multiplication algorithm for big integers
Boring details may be found at https://www.hindawi.com/journals/jam/2014/107109/ .
Squaring is a special case for multiplication
So, Karatsuba squaring is just a special case for Karatsuba multiplication

50 of 77

Specialize Karatsuba implementation for squaring

Improves performance for x*x where x - big integer:

NatSqr/500-4 81.9µs ± 1% 67.0µs ± 1% -18.25% (p=0.000 n=48+48)�NatSqr/800-4 161µs ± 1% 140µs ± 1% -13.29% (p=0.000 n=47+48)�NatSqr/1000-4 245µs ± 1% 207µs ± 1% -15.17% (p=0.000 n=49+49)

51 of 77

78755 math/big:

implement Lehmer's extended GCD algorithm

52 of 77

Boring details about extended GCD algorithm

Extended GCD is an extension to the Euclidean algorithm for finding the following numbers:

Greatest common divisor for two integers, a and b, i.e. gcd(a, b). For instance gcd(10, 15)=5, because 10/5=2 and 15/5=3. Remember, how you did this in elementary school? :)
Integer coefficients, x and y, such that ax + by = gcd(a,b). For instance, 10*(-1) + 15*1 = 5

GCD(a, b) = 1 if a and b have no common divisors, i.e. if they are co-prime
Co-prime numbers are frequently used in public key cryptography
Cryptographers like calculating x = 1 / a (mod b) such that ax = 1 (mod b) aka modular multiplicative inverse . This is essentially x coefficient from the extended GCD
It is used in RSA algorithm, which may be used in https handshake

53 of 77

What is the Lehmer’s GCD algorithm?

Lehmer’s GCD algorithm is a modern optimized version of the ancient algorithm from Euclide. Read boring details at https://en.wikipedia.org/wiki/Lehmer%27s_GCD_algorithm :)
Lehmer’s GCD algorithm outperforms Euclidean algorithm for big integers
Big integers are used in public key cryptography
Public key cryptography is used in https
So, Lehmer’s GCD algorithm should improve https performance

54 of 77

Improve performance for extended GCD algorithm

Improves performance of big.Int.ModInverse used in https handshake
Benchmark results:

name old time/op new time/op delta�GCD100x100/WithXY-4 19.3µs ± 0% 3.9µs ± 1% -79.58% (p=0.008 n=5+5)�GCD100x1000/WithXY-4 22.8µs ± 1% 7.5µs ±10% -67.00% (p=0.008 n=5+5)�GCD100x10000/WithXY-4 75.1µs ± 2% 30.5µs ± 2% -59.38% (p=0.008 n=5+5)�GCD100x100000/WithXY-4 542µs ± 2% 267µs ± 2% -50.79% (p=0.008 n=5+5)�GCD1000x1000/WithXY-4 329µs ± 0% 42µs ± 1% -87.12% (p=0.008 n=5+5)�GCD1000x10000/WithXY-4 607µs ± 9% 123µs ± 1% -79.70% (p=0.008 n=5+5)�GCD1000x100000/WithXY-4 3.64ms ± 1% 0.93ms ± 1% -74.41% (p=0.016 n=4+5)�GCD10000x10000/WithXY-4 7.44ms ± 1% 1.00ms ± 0% -86.58% (p=0.008 n=5+5)�GCD10000x100000/WithXY-4 37.3ms ± 0% 7.3ms ± 1% -80.45% (p=0.008 n=5+5)�GCD100000x100000/WithXY-4 505ms ± 1% 56ms ± 1% -88.92% (p=0.008 n=5+5)

55 of 77

74851 math/big:

speed-up addMulVVW on amd64

56 of 77

Speed up addMulVVW on amd64

This is arch-specific patch for GOARCH=amd64, which improves performance for a += b * c calculations on big integers
Improves performance for https handshake:

RSA2048Decrypt-8 1.61ms ± 1% 1.38ms ± 1% -14.13% (p=0.000 n=10+10)�RSA2048Sign-8 1.93ms ± 1% 1.70ms ± 1% -11.86% (p=0.000 n=10+10)�3PrimeRSA2048Decrypt-8 932µs ± 0% 828µs ± 0% -11.15% (p=0.000 n=10+10)��HandshakeServer/RSA-8 901µs ± 1% 777µs ± 0% -13.70% (p=0.000 n=10+8)�HandshakeServer/ECDHE-8 1.01ms ± 1% 0.90ms ± 0% -11.53% (p=0.000 n=10+9)

57 of 77

Standard library optimizations

58 of 77

97255 strings:

speed-up replace for byteStringReplacer

59 of 77

Improve performance for strings.Replace

Benchmark results:

Escape-6 34.2µs ± 2% 20.8µs ± 2% -39.06% (p=0.000 n=10+10)�EscapeNone-6 7.04µs ± 1% 1.05µs ± 0% -85.03% (p=0.000 n=10+10)

�ByteStringMatch-6 1.59µs ± 2% 1.17µs ± 2% -26.35% (p=0.000 n=10+10)�HTMLEscapeNew-6 390ns ± 2% 337ns ± 2% -13.62% (p=0.000 n=10+10)�HTMLEscapeOld-6 621ns ± 2% 603ns ± 2% -2.95% (p=0.000 n=10+9)

60 of 77

101715 regexp:

use sync.Pool to cache

regexp.machine objects

61 of 77

Use sync.Pool to cache regexp.machine objects

Removes lock contention when a single regexp is used from concurrently running goroutines
Benchmark results:

BenchmarkMatchParallelShared-4 361 77.9 -78.42%

62 of 77

102235 compress/flate:

optimize huffSym

63 of 77

Improve gzip decompression speed

Benchmark results:

name old time/op new time/op delta�Decode/Digits/Huffman/1e4-6 278µs ± 1% 240µs ± 2% -13.72% (p=0.000 n=10+10)�Decode/Digits/Huffman/1e5-6 2.38ms ± 1% 2.05ms ± 1% -14.12% (p=0.000 n=10+10)�Decode/Digits/Huffman/1e6-6 23.4ms ± 1% 19.9ms ± 0% -14.69% (p=0.000 n=9+9)�Decode/Twain/Huffman/1e4-6 316µs ± 2% 267µs ± 3% -15.30% (p=0.000 n=9+10)�Decode/Twain/Huffman/1e5-6 2.62ms ± 0% 2.22ms ± 0% -15.24% (p=0.000 n=10+10)�Decode/Twain/Huffman/1e6-6 25.7ms ± 1% 21.8ms ± 0% -15.19% (p=0.000 n=10+10)�Decode/Twain/Compression/1e4-6 272µs ± 1% 250µs ± 4% -8.20% (p=0.000 n=9+10)�Decode/Twain/Compression/1e5-6 2.01ms ± 0% 1.84ms ± 1% -8.57% (p=0.000 n=9+10)�Decode/Twain/Compression/1e6-6 19.1ms ± 0% 17.4ms ± 1% -8.75% (p=0.000 n=9+10)

64 of 77

107715 net:

add support for splice(2)

in (*TCPConn).ReadFrom on Linux

65 of 77

What is splice(2)?

Splice is a Linux system call for fast data transfer. See `man 2 splice` or http://man7.org/linux/man-pages/man2/splice.2.html
Splice avoids data copy between kernel space and user space. This is sometimes called zero-copy
See https://lwn.net/Articles/178199/ for boring details about splice internals

66 of 77

Add splice(2) in TCPConn.ReadFrom on Linux

Go 1.11 transparently uses splice when io.Copy is used for copying data between two tcp connections
See details at https://acln.ro/articles/go-splice
Benchmark results:

benchmark old ns/op new ns/op delta�BenchmarkTCPReadFrom/8192-4 5219 4779 -8.43%�BenchmarkTCPReadFrom/16384-4 8708 8008 -8.04%�BenchmarkTCPReadFrom/32768-4 16349 14973 -8.42%�BenchmarkTCPReadFrom/65536-4 35246 27406 -22.24%�BenchmarkTCPReadFrom/131072-4 72920 52382 -28.17%�BenchmarkTCPReadFrom/262144-4 149311 95094 -36.31%�BenchmarkTCPReadFrom/524288-4 306704 181856 -40.71%�BenchmarkTCPReadFrom/1048576-4 674174 357406 -46.99%

67 of 77

116378 net/http:

remove an allocation in ServeMux

68 of 77

Remove an allocation in ServeMux

This is just classical optimization based on reducing memory allocations
Improves performance for net/http.ServeMux:

name old time/op new time/op delta�ServeMux_SkipServe-4 74.2µs ± 2% 60.6µs ± 1% -18.31% (p=0.000 n=10+9)��name old alloc/op new alloc/op delta�ServeMux_SkipServe-4 2.62kB ± 0% 0.00kB ±NaN% -100.00% (p=0.000 n=10+10)��name old allocs/op new allocs/op delta�ServeMux_SkipServe-4 180 ± 0% 0 ±NaN% -100.00% (p=0.000 n=10+10)

69 of 77

Arch-specific optimizations

70 of 77

Arch-specific optimizations

Go may build programs for many architectures, not only for GOARCH=amd64
GOARCH=arm becomes popular. It is used in smartphones and in energy effective servers. See https://blog.cloudflare.com/arm-takes-wing/ .
Go 1.11 has patches significantly optimizing performance for GOARCH=arm

71 of 77

Performance patches for arm

98095 runtime: use vDSO for clock_gettime on linux/arm

This improves time.Now() performance by 2.5x
The patch substitutes system call with access to vDSO - a shared memory visible to all the processes, where system time may be read

83799 runtime: improve arm64 memmove implementation

Memmove is the most frequently executed code in the majority of programs
Memmove just copies byte slice from one place to another
The patch improves memove performance by up to 2x

72 of 77

Performance patches for arm

There are pending patches for arm, which may still be merged into go 1.11:

99755 crypto/elliptic: implement P256 for arm64. Improves https handshake performance by up to 10x
81877 cmd/compile: improve atomic add intrinsics with ARMv8.1 new instruction. Improves atomic.Add* performance by up to 50X
107298 crypto/aes: implement AES-GCM AEAD for arm64. Improves https performance by up to 10x

73 of 77

There are many other (performance) patches

went into Go 1.11…

Interested readers may go to

https://go-review.googlesource.com/q/status:merged+project:go

75 of 77

Summary

Each new Go version contains performance improvements
Programs built with Go 1.11 should work faster comparing to builds with previous Go versions
The future Go versions will be optimized further
Everybody may take part in this process
This is fun and is easier than optimizing gcc / llvm, since the majority of the Go compiler, runtime and standard library is written in Go

76 of 77

Useful links

https://go-review.googlesource.com/q/status:open - patches (with review comments) that may go into the next Go version
https://go-review.googlesource.com/q/status:merged - patches recently merged into Go
https://golang.org/doc/contribute.html - how to contribute to Go. It’s easy - just do it! :)

1 of 77

2 of 77

3 of 77

4 of 77

5 of 77

6 of 77

7 of 77

8 of 77

9 of 77

10 of 77

11 of 77

12 of 77

13 of 77

14 of 77

15 of 77

16 of 77

17 of 77

18 of 77

19 of 77

20 of 77

21 of 77

22 of 77

23 of 77

24 of 77

25 of 77

26 of 77

27 of 77

28 of 77

29 of 77

30 of 77

31 of 77

32 of 77

33 of 77

34 of 77

35 of 77

36 of 77

37 of 77

38 of 77

39 of 77

40 of 77

41 of 77

42 of 77

43 of 77

44 of 77

45 of 77

46 of 77

47 of 77

48 of 77

49 of 77

50 of 77

51 of 77

52 of 77

53 of 77

54 of 77

55 of 77

56 of 77

57 of 77

58 of 77

59 of 77

60 of 77

61 of 77

62 of 77

63 of 77

64 of 77

65 of 77

66 of 77

67 of 77

68 of 77

69 of 77

70 of 77

71 of 77

72 of 77

73 of 77

74 of 77

75 of 77

76 of 77

77 of 77