Accelerating data processing in Go
with SIMD instructions
Marat Dukhan
Performance Components
(Top Desktop x86 CPUs from Intel)
SIMD: Single Instruction - Multiple Data
SIMD extensions are sets of special instructions that operate on short fixed-length arrays of elements
SIMD: Special Operations
Often SIMD extensions include instructions for more data-processing operations that regular instruction sets
uint32_t a[4];
uint32_t b[4];
a[0] = a[0] > b[0] ? a[0] : b[0];
a[1] = a[1] > b[1] ? a[1] : b[1];
a[2] = a[2] > b[2] ? a[2] : b[2];
a[3] = a[3] > b[3] ? a[3] : b[3];
uint32x4_t a;
uint32x4_t b;
a = max(a, b); // Special instruction!
5 times less instructions in SIMD version although SIMD vector has only 4 elements!
4 RISC instructions:
20 RISC instructions:
x86 SIMD Extensions
MMX
i586
MMX+
SSE
SSE2
SSE3
3dnow!
3dnow!+
3dnow! Geode
SSSE3
SSE4A
SSE4.1
SSE4.2
AVX
F16C
FMA3
FMA4
EMMX
XOP
AVX512-F
AVX512-CD
AVX512-ER
AVX512-PF
3dnow! prefetch
AES-NI
PCLMULQDQ
Pentium begot MMX, MMX begot EMMX and MMX+, MMX+ begot 3dnow!, 3dnow! prefetch and SSE, 3dnow! begot 3dnow!+ and SSE begot SSE2, SSE2 begot SSE3, SSE3 begot SSSE3 and SSE4A, SSSE3 begon SSE4.1, SSE4.1 begot SSE4.2, SSE4.2 begot AES-NI, PCLMULQDQ, and AVX, AVX begot F16C, FMA4, and XOP, F16C begot FMA3, FMA3 begot AVX512-F, AVX512-F begot AVX512-CD, AVX512-PF, AVX512-ER, AVX512-BW, AVX512-DQ, AVX512-VL, AVX512-VBMI, and AVX512-IFMA!
AVX512-VBMI
AVX512-IFMA
AVX512-BW
AVX512-BW
AVX512-VL
AVX512-DQ
Instruction Support Detection: CPUID Instruction
On x86 and x86-64 architectures CPUID instruction reveals which instruction set extensions are implemented by the processor
The cpuid package by Klaus Post helps to get this info from Go:
package main
import (
"fmt"
"github.com/klauspost/cpuid"
)
func main() {
if cpuid.CPU.AVX() {
fmt.Println("AVX: supported")
} else {
fmt.Println("AVX: unsupported")
}
}
Three Ways to Use SIMD Instructions
Compiler auto-vectorization in Go
-compiler=gccgo -gccgoflags=”-O3 -march=native -ffast-math”
Go auto-vectorization showcase: dot product
func DotProduct(x, y []float32) (z float32) {
for i, xi := range x {
z += xi * y[i]
}
return
}
Intrinsic functions in Go
...and never will
Linking with External Implementation
Go provides 4 mechanisms to call into code implemented in other languages:
Using Go Assembler
Both Google’s and gccgo toolchains include assemblers, but the syntax is quite different:
6a Assembler
6a assembler weirdness:
6a Assembler: Example
// func sliceSum(a []float64) float64
TEXT ·sliceSum(SB), 7, $0
MOVQ a(FP),SI // SI: &a
MOVQ a_len+8(FP),DX // DX: len(a)
XORPD X0, X0 // Initial value
XORPD X1, X1 // Initial value
MOVQ DX, R10 // R10: len(out) -1
SHRQ $2, DX // DX: (len(out) - 1) / 4
ANDQ $3, R10 // R10: (len(out) -1 ) % 4
CMPQ DX ,$0
JEQ remain_sum
next_sum:
MOVUPD (SI), X2
MOVUPD 16(SI), X3
ADDPD X2, X0
ADDPD X3, X1
ADDQ $32, SI
SUBQ $1, DX
JNZ next_sum
CMPQ R10, $0
JZ done_sum
remain_sum:
MOVSD (SI), X2
ADDSD X2, X0
ADDQ $8, SI
SUBQ $1, R10
JNZ remain_sum
done_sum:
ADDPD X1, X0
MOVAPD X0, X2
UNPCKHPD X0, X2
ADDSD X2, X0
MOVSD X0, ret+24(FP)
RET
Despite all the weirdness, some Go packages make use of built-in assembler.
This example is from GitHub.com/akualab/narray
Linking Go with SysO
SysO are object files with .syso extension
TEXT ·DotProduct(SB),NOSPLIT,$0
JMP DotProduct(SB)
Using CGo and extern
CGo is a tool that automatically generates bindings to C functions for Go. Gotchas:
//#include "DotProduct.h"�//#cgo LDFLAGS: -ldot_product�import "C"��func DotProduct(x, y []float32) float32 {� return C.DotProduct(&x[0], &y[0], len(x))�}
extern is a gccgo-only extension for calling into C functions:
//extern DotProduct
Generating Low-level Code with PeachPy
PeachPy is a Python framework for writing high-performance low-level codes.
For Go users, PeachPy can generate SysO, Plan 9 assembly, as well as object files for Windows, Linux, and OS X
PeachPy Example: Dot Product
x = Argument(ptr(const_float_))
y = Argument(ptr(const_float_))
length = Argument(size_t)
with Function("DotProduct", (x, y, length), float_) as function:
reg_x = GeneralPurposeRegister64()
reg_y = GeneralPurposeRegister64()
reg_length = GeneralPurposeRegister64()
� LOAD.ARGUMENT(reg_x, x)
LOAD.ARGUMENT(reg_y, y)
LOAD.ARGUMENT(reg_length, length)
� vector_loop = Loop()
scalar_loop = Loop()
unroll_factor = 6
ymm_accs = [YMMRegister() for _ in range(unroll_factor)]
� for ymm_acc in ymm_accs:
xmm_acc = ymm_acc.as_xmm
VXORPS(xmm_acc, xmm_acc, xmm_acc)
SUB(reg_length, 8*unroll_factor)
JB(vector_loop.end)
with vector_loop:
ymm_xs = [YMMRegister() for _ in range(unroll_factor)]
� for (i, ymm_x) in enumerate(ymm_xs):
VMOVUPS(ymm_x, [reg_x + 32*i])
� for (i, (ymm_acc, ymm_x)) in enumerate(zip(ymm_accs, ymm_xs)):
VFMADD132PS(ymm_acc, ymm_x, [reg_y + 32*i])
� ADD(reg_x, 32*unroll_factor)
ADD(reg_y, 32*unroll_factor)
� SUB(reg_length, 8*unroll_factor)
JAE(vector_loop.begin)
� # Reduction of multiple YMM registers into into YMM register
VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[1])
VADDPS(ymm_accs[2], ymm_accs[2], ymm_accs[3])
VADDPS(ymm_accs[4], ymm_accs[4], ymm_accs[5])
VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[2])
VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[4])
� ymm_acc = ymm_accs[0]
xmm_acc = ymm_acc.as_xmm
� xmm_scalar_acc = XMMRegister()
VXORPS(xmm_scalar_acc, xmm_scalar_acc, xmm_scalar_acc)
� ADD(reg_length, 8*unroll_factor)
JZ(scalar_loop.end)
� with scalar_loop:
xmm_scalar_x = XMMRegister()
VMOVSS(xmm_scalar_x, [reg_x])
VFMADD132SS(xmm_scalar_acc, xmm_scalar_x, [reg_y])
ADD(reg_x, 4)
ADD(reg_y, 4)
SUB(reg_length, 1)
JNZ(scalar_loop.begin)
# Add remainder
VADDPS(ymm_acc, ymm_acc, xmm_scalar_acc.as_ymm)
xmm_temp = XMMRegister()
VEXTRACTF128(xmm_temp, ymm_acc, 1)
VADDPS(xmm_acc, xmm_acc, xmm_temp)
VHADDPS(xmm_acc, xmm_acc, xmm_acc)
VHADDPS(xmm_acc, xmm_acc, xmm_acc)
� RETURN(xmm_acc)
Dot Product Performance: Dot Product
PeachPy: Resources
PeachPy is an open-source project on GitHub:
GitHub.com/Maratyszcza/PeachPy
The dot product code is included in examples/go-generate