Performance patches
in Go 1.11
Aliaksandr Valialkin, VictoriaMetrics
About me
Agenda
Compiler and runtime optimizations
109918 cmd/compile:
refactor inlining parameters; inline panic
What is inlining?
What is inlining?
Inline functions with panic
if i < 0 || i >= len(a) {
panic(“out of bounds access”)
}
a[i]
Inline functions with panic
if foo == nil {
panic(“nil dereference”)
}
foo.bar
Inline functions with panic
type T struct {
N int
}
func f(a []int, b *T) {
b.N = a[2]
}
110055 cmd/compile:
optimize map-clearing range idiom
Optimize map-clearing range idiom
func clearMap(m map[K]V) {
for k := range m {
delete(m, k)
}
}
Optimize map-clearing range idiom
func clearMap(m map[K]V) {
// fastClearMap is an optimized function
// for clearing the map. It preserves m capacity
// in order to reduce overhead during
// subsequent additions into the map
fastClearMap(m)
}
Optimize append(x, make([]T, y)...)
func growSlice(a []T, itemsToAdd int) []T {
newSize := len(a) + itemsToAdd
for cap(a) < newSize {
a = append(a[:cap(a)], T{})
}
return a[:newSize]
}
Optimize append(x, make([]T, y)...)
func growSlice(a []T, itemsToAdd int) []T {
return append(a, make([]T, itemsToAdd)...)
}
91557 cmd/compile:
avoid extra mapaccess in "m[k] op= r"
Avoid extra mapaccess in “m[k] op= r”
func countWords(words []string) map[string]int {
m := make(map[string]int)
for _, w := range words {
m[w] += 1 // This line works faster in Go 1.11
}
return m
}
100838 cmd/compile:
avoid mapaccess at m[k]=append(m[k]..
Avoid mapaccess at m[k]=append(m[k]...)
func groupWordsByLen(words []string) map[int][]string {
m := make(map[int][]string)
for _, w := range words {
wLen := len(w)
// The following line works faster in Go 1.11
m[wLen] = append(m[wLen], w)
}
return m
}
84055 cmd/compile/internal/ssa:
update regalloc in loops
Update regalloc in loops
for ... {� if hard_case {� call()� }� // simple case, without call� }
100718 cmd/compile:
specialize Move up to 79B on amd64
Specialize Move up to 79B on amd64
CopyFat24-4 0.80ns ± 0% 0.40ns ± 0% -50.00% (p=0.001 n=8+9)�CopyFat32-4 2.01ns ± 0% 0.40ns ± 0% -80.10% (p=0.000 n=8+8)�CopyFat64-4 2.87ns ± 0% 0.40ns ± 0% -86.07% (p=0.000 n=8+10)
Bounds check elimination (BCE) improvements
What is bounds check elimination (BCE)?
if i < 0 || i >= len(a) {
panic(“out of bounds access”)
}
BCE improvements
BCE improvements example 1
func HasSuffix(s, suffix string) bool {� return len(s) >= len(suffix) && s[len(s)-len(suffix):] == suffix� }
BCE improvements example 2
x := 0
for i := len(a); i > 0; i-- {
x += int(a[i-1])
}
return x
BCE improvements example 3
if i < 0 || i >= len(b) {� return� }� for j := 0; j < i; j++ {� b[j]++� }
105257 cmd/compile:
in escape analysis, propagate loop depth to field
What is escape analysis?
Escape analysis improvements in Go 1.11
type T struct { x int }��func f(t *T) {� var y *int� for i := 0; i < 2; i++ {� y = &t.x� *y = 1� }�}
80144 runtime:
use private futexes on Linux
Use private futexes on Linux
Stack copy optimizations
Stack copy optimizations
Stack copy optimizations
math/big optimizations
math/big optimizations
What is Montgomery modular multiplication?
Reduce amount of copying in Montgomery multiplication
name old time/op new time/op delta�RSA2048Decrypt-8 1.73ms ± 2% 1.55ms ± 2% -10.19% (p=0.000 n=10+10)�RSA2048Sign-8 2.17ms ± 2% 2.00ms ± 2% -7.93% (p=0.000 n=10+10)�3PrimeRSA2048Decrypt-8 1.10ms ± 2% 0.96ms ± 2% -13.03% (p=0.000 n=10+9)
99615 math/big:
implement Atkin's ModSqrt for 5 mod 8 primes
What is Atkin’s ModSqrt for 5 mod 8 algorithm?
Implement Atkin’s ModSqrt for 5 mod 8 primes
ModSqrt231_5Mod8-4 1.03ms ± 2% 0.36ms ± 5% -65.06% (p=0.008 n=5+5)
What is squaring?
What is Karatsuba squaring?
Specialize Karatsuba implementation for squaring
NatSqr/500-4 81.9µs ± 1% 67.0µs ± 1% -18.25% (p=0.000 n=48+48)�NatSqr/800-4 161µs ± 1% 140µs ± 1% -13.29% (p=0.000 n=47+48)�NatSqr/1000-4 245µs ± 1% 207µs ± 1% -15.17% (p=0.000 n=49+49)
78755 math/big:
implement Lehmer's extended GCD algorithm
Boring details about extended GCD algorithm
What is the Lehmer’s GCD algorithm?
Improve performance for extended GCD algorithm
name old time/op new time/op delta�GCD100x100/WithXY-4 19.3µs ± 0% 3.9µs ± 1% -79.58% (p=0.008 n=5+5)�GCD100x1000/WithXY-4 22.8µs ± 1% 7.5µs ±10% -67.00% (p=0.008 n=5+5)�GCD100x10000/WithXY-4 75.1µs ± 2% 30.5µs ± 2% -59.38% (p=0.008 n=5+5)�GCD100x100000/WithXY-4 542µs ± 2% 267µs ± 2% -50.79% (p=0.008 n=5+5)�GCD1000x1000/WithXY-4 329µs ± 0% 42µs ± 1% -87.12% (p=0.008 n=5+5)�GCD1000x10000/WithXY-4 607µs ± 9% 123µs ± 1% -79.70% (p=0.008 n=5+5)�GCD1000x100000/WithXY-4 3.64ms ± 1% 0.93ms ± 1% -74.41% (p=0.016 n=4+5)�GCD10000x10000/WithXY-4 7.44ms ± 1% 1.00ms ± 0% -86.58% (p=0.008 n=5+5)�GCD10000x100000/WithXY-4 37.3ms ± 0% 7.3ms ± 1% -80.45% (p=0.008 n=5+5)�GCD100000x100000/WithXY-4 505ms ± 1% 56ms ± 1% -88.92% (p=0.008 n=5+5)
74851 math/big:
speed-up addMulVVW on amd64
Speed up addMulVVW on amd64
RSA2048Decrypt-8 1.61ms ± 1% 1.38ms ± 1% -14.13% (p=0.000 n=10+10)�RSA2048Sign-8 1.93ms ± 1% 1.70ms ± 1% -11.86% (p=0.000 n=10+10)�3PrimeRSA2048Decrypt-8 932µs ± 0% 828µs ± 0% -11.15% (p=0.000 n=10+10)��HandshakeServer/RSA-8 901µs ± 1% 777µs ± 0% -13.70% (p=0.000 n=10+8)�HandshakeServer/ECDHE-8 1.01ms ± 1% 0.90ms ± 0% -11.53% (p=0.000 n=10+9)
Standard library optimizations
97255 strings:
speed-up replace for byteStringReplacer
Improve performance for strings.Replace
Escape-6 34.2µs ± 2% 20.8µs ± 2% -39.06% (p=0.000 n=10+10)�EscapeNone-6 7.04µs ± 1% 1.05µs ± 0% -85.03% (p=0.000 n=10+10)
�ByteStringMatch-6 1.59µs ± 2% 1.17µs ± 2% -26.35% (p=0.000 n=10+10)�HTMLEscapeNew-6 390ns ± 2% 337ns ± 2% -13.62% (p=0.000 n=10+10)�HTMLEscapeOld-6 621ns ± 2% 603ns ± 2% -2.95% (p=0.000 n=10+9)
Use sync.Pool to cache regexp.machine objects
BenchmarkMatchParallelShared-4 361 77.9 -78.42%
102235 compress/flate:
optimize huffSym
Improve gzip decompression speed
name old time/op new time/op delta�Decode/Digits/Huffman/1e4-6 278µs ± 1% 240µs ± 2% -13.72% (p=0.000 n=10+10)�Decode/Digits/Huffman/1e5-6 2.38ms ± 1% 2.05ms ± 1% -14.12% (p=0.000 n=10+10)�Decode/Digits/Huffman/1e6-6 23.4ms ± 1% 19.9ms ± 0% -14.69% (p=0.000 n=9+9)�Decode/Twain/Huffman/1e4-6 316µs ± 2% 267µs ± 3% -15.30% (p=0.000 n=9+10)�Decode/Twain/Huffman/1e5-6 2.62ms ± 0% 2.22ms ± 0% -15.24% (p=0.000 n=10+10)�Decode/Twain/Huffman/1e6-6 25.7ms ± 1% 21.8ms ± 0% -15.19% (p=0.000 n=10+10)�Decode/Twain/Compression/1e4-6 272µs ± 1% 250µs ± 4% -8.20% (p=0.000 n=9+10)�Decode/Twain/Compression/1e5-6 2.01ms ± 0% 1.84ms ± 1% -8.57% (p=0.000 n=9+10)�Decode/Twain/Compression/1e6-6 19.1ms ± 0% 17.4ms ± 1% -8.75% (p=0.000 n=9+10)
What is splice(2)?
Add splice(2) in TCPConn.ReadFrom on Linux
benchmark old ns/op new ns/op delta�BenchmarkTCPReadFrom/8192-4 5219 4779 -8.43%�BenchmarkTCPReadFrom/16384-4 8708 8008 -8.04%�BenchmarkTCPReadFrom/32768-4 16349 14973 -8.42%�BenchmarkTCPReadFrom/65536-4 35246 27406 -22.24%�BenchmarkTCPReadFrom/131072-4 72920 52382 -28.17%�BenchmarkTCPReadFrom/262144-4 149311 95094 -36.31%�BenchmarkTCPReadFrom/524288-4 306704 181856 -40.71%�BenchmarkTCPReadFrom/1048576-4 674174 357406 -46.99%
116378 net/http:
remove an allocation in ServeMux
Remove an allocation in ServeMux
name old time/op new time/op delta�ServeMux_SkipServe-4 74.2µs ± 2% 60.6µs ± 1% -18.31% (p=0.000 n=10+9)��name old alloc/op new alloc/op delta�ServeMux_SkipServe-4 2.62kB ± 0% 0.00kB ±NaN% -100.00% (p=0.000 n=10+10)��name old allocs/op new allocs/op delta�ServeMux_SkipServe-4 180 ± 0% 0 ±NaN% -100.00% (p=0.000 n=10+10)
Arch-specific optimizations
Arch-specific optimizations
Performance patches for arm
Performance patches for arm
There are many other (performance) patches
went into Go 1.11…
Interested readers may go to
https://go-review.googlesource.com/q/status:merged+project:go
Summary
Summary
Useful links
Questions?