1 of 20

Accelerating data processing in Go

with SIMD instructions

Marat Dukhan

2 of 20

Performance Components

(Top Desktop x86 CPUs from Intel)

3 of 20

SIMD: Single Instruction - Multiple Data

SIMD extensions are sets of special instructions that operate on short fixed-length arrays of elements

E.g. MMX can work on arrays of 4 16-bit elements

SSE2 can work on arrays of 2 double-precision floats

4 of 20

SIMD: Special Operations

Often SIMD extensions include instructions for more data-processing operations that regular instruction sets

E.g. abs, min, max, saturated arithmetic
It is possible to get superlinear speedup!

uint32_t a[4];

uint32_t b[4];

a[0] = a[0] > b[0] ? a[0] : b[0];

a[1] = a[1] > b[1] ? a[1] : b[1];

a[2] = a[2] > b[2] ? a[2] : b[2];

a[3] = a[3] > b[3] ? a[3] : b[3];

uint32x4_t a;

uint32x4_t b;

a = max(a, b); // Special instruction!

5 times less instructions in SIMD version although SIMD vector has only 4 elements!

4 RISC instructions:

2 SIMD loads
1 SIMD max
1 SIMD store

20 RISC instructions:

8 loads
4 compares
4 conditional moves
4 stores

5 of 20

x86 SIMD Extensions

MMX

i586

MMX+

SSE

SSE2

SSE3

3dnow!

3dnow!+

3dnow! Geode

SSSE3

SSE4A

SSE4.1

SSE4.2

AVX

F16C

FMA3

FMA4

EMMX

XOP

AVX512-F

AVX512-CD

AVX512-ER

AVX512-PF

3dnow! prefetch

AES-NI

PCLMULQDQ

Pentium begot MMX, MMX begot EMMX and MMX+, MMX+ begot 3dnow!, 3dnow! prefetch and SSE, 3dnow! begot 3dnow!+ and SSE begot SSE2, SSE2 begot SSE3, SSE3 begot SSSE3 and SSE4A, SSSE3 begon SSE4.1, SSE4.1 begot SSE4.2, SSE4.2 begot AES-NI, PCLMULQDQ, and AVX, AVX begot F16C, FMA4, and XOP, F16C begot FMA3, FMA3 begot AVX512-F, AVX512-F begot AVX512-CD, AVX512-PF, AVX512-ER, AVX512-BW, AVX512-DQ, AVX512-VL, AVX512-VBMI, and AVX512-IFMA!

AVX512-VBMI

AVX512-IFMA

AVX512-BW

AVX512-VL

AVX512-DQ

6 of 20

Instruction Support Detection: CPUID Instruction

On x86 and x86-64 architectures CPUID instruction reveals which instruction set extensions are implemented by the processor

The cpuid package by Klaus Post helps to get this info from Go:

package main

import (

"fmt"

"github.com/klauspost/cpuid"

)

func main() {

if cpuid.CPU.AVX() {

fmt.Println("AVX: supported")

} else {

fmt.Println("AVX: unsupported")

}

7 of 20

Three Ways to Use SIMD Instructions

Compiler auto-vectorization

Let the compiler figure out how to use SIMD

Intrinsic functions

Special low-level functions that easily map to processor instructions

Implement in another language, link with Go

8 of 20

Compiler auto-vectorization in Go

Google’s Go compiler does not support auto-vectorization
gccgo can do auto-vectorization

Pass -ftree-vectorize (or -O3) option to gccgo
Set -march option to enable additional SIMD extensions

-march=native to use everything available on host

Enable -ffast-math to unlock additional floating-point vectorization opportunities (use with care: it affects results)
With go utility: go build

-compiler=gccgo -gccgoflags=”-O3 -march=native -ffast-math”

9 of 20

Go auto-vectorization showcase: dot product

func DotProduct(x, y []float32) (z float32) {

for i, xi := range x {

z += xi * y[i]

}

return

}

10 of 20

Intrinsic functions in Go

Google’s Go toolchain does not support intrinsics
gccgo does not support intrinsics
Go as a language does not specify intrinsics

...and never will

11 of 20

Linking with External Implementation

Go provides 4 mechanisms to call into code implemented in other languages:

Built-in assembler
syso
cgo
extern

12 of 20

Using Go Assembler

Both Google’s and gccgo toolchains include assemblers, but the syntax is quite different:

gccgo toolchain uses GNU assembler, which accepts traditional AT&T and Intel syntax
Google’s Go assembler (6a) follows Plan 9 syntax, which is different from everything!

13 of 20

6a Assembler

6a assembler weirdness:

Unconventional names for instructions and registers
Branch instructions use names from Motorola 68020 arch!
Some widely used constructs are not supported

E.g. alignment, jump tables

6a lacks SIMD instructions later than SSE3 (2004)

Can be inserted directly as opcode bytes

Function names must start with Unicode middle dot character!

14 of 20

6a Assembler: Example

// func sliceSum(a []float64) float64

TEXT ·sliceSum(SB), 7, $0

MOVQ a(FP),SI // SI: &a

MOVQ a_len+8(FP),DX // DX: len(a)

XORPD X0, X0 // Initial value

XORPD X1, X1 // Initial value

MOVQ DX, R10 // R10: len(out) -1

SHRQ $2, DX // DX: (len(out) - 1) / 4

ANDQ $3, R10 // R10: (len(out) -1 ) % 4

CMPQ DX ,$0

JEQ remain_sum

next_sum:

MOVUPD (SI), X2

MOVUPD 16(SI), X3

ADDPD X2, X0

ADDPD X3, X1

ADDQ $32, SI

SUBQ $1, DX

JNZ next_sum

CMPQ R10, $0

JZ done_sum

remain_sum:

MOVSD (SI), X2

ADDSD X2, X0

ADDQ $8, SI

SUBQ $1, R10

JNZ remain_sum

done_sum:

ADDPD X1, X0

MOVAPD X0, X2

UNPCKHPD X0, X2

ADDSD X2, X0

MOVSD X0, ret+24(FP)

RET

Despite all the weirdness, some Go packages make use of built-in assembler.

This example is from GitHub.com/akualab/narray

15 of 20

Linking Go with SysO

SysO are object files with .syso extension

The files follow usual ELF/Mach-O/COFF format
go build will link the .syso files into the executable
Functions from .syso are called via asm trampoline

TEXT ·DotProduct(SB),NOSPLIT,$0

JMP DotProduct(SB)

Functions must follow Go calling convention:

All arguments are passed on stack
Return values are saved on stack
C/C++ compilers can’t generate .syso that would follow Go calling convention

16 of 20

Using CGo and extern

CGo is a tool that automatically generates bindings to C functions for Go. Gotchas:

Call into CGo function requires to switch thread stack
Cross-compilation gets complicated
gccgo does not support cgo

//#include "DotProduct.h"�//#cgo LDFLAGS: -ldot_product�import "C"��func DotProduct(x, y []float32) float32 {� return C.DotProduct(&x[0], &y[0], len(x))�}

extern is a gccgo-only extension for calling into C functions:

//extern DotProduct

17 of 20

Generating Low-level Code with PeachPy

PeachPy is a Python framework for writing high-performance low-level codes.

For Go users, PeachPy can generate SysO, Plan 9 assembly, as well as object files for Windows, Linux, and OS X

18 of 20

PeachPy Example: Dot Product

x = Argument(ptr(const_float_))

y = Argument(ptr(const_float_))

length = Argument(size_t)

with Function("DotProduct", (x, y, length), float_) as function:

reg_x = GeneralPurposeRegister64()

reg_y = GeneralPurposeRegister64()

reg_length = GeneralPurposeRegister64()

� LOAD.ARGUMENT(reg_x, x)

LOAD.ARGUMENT(reg_y, y)

LOAD.ARGUMENT(reg_length, length)

� vector_loop = Loop()

scalar_loop = Loop()

unroll_factor = 6

ymm_accs = [YMMRegister() for _ in range(unroll_factor)]

� for ymm_acc in ymm_accs:

xmm_acc = ymm_acc.as_xmm

VXORPS(xmm_acc, xmm_acc, xmm_acc)

SUB(reg_length, 8*unroll_factor)

JB(vector_loop.end)

with vector_loop:

ymm_xs = [YMMRegister() for _ in range(unroll_factor)]

� for (i, ymm_x) in enumerate(ymm_xs):

VMOVUPS(ymm_x, [reg_x + 32*i])

� for (i, (ymm_acc, ymm_x)) in enumerate(zip(ymm_accs, ymm_xs)):

VFMADD132PS(ymm_acc, ymm_x, [reg_y + 32*i])

� ADD(reg_x, 32*unroll_factor)

ADD(reg_y, 32*unroll_factor)

� SUB(reg_length, 8*unroll_factor)

JAE(vector_loop.begin)

� # Reduction of multiple YMM registers into into YMM register

VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[1])

VADDPS(ymm_accs[2], ymm_accs[2], ymm_accs[3])

VADDPS(ymm_accs[4], ymm_accs[4], ymm_accs[5])

VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[2])

VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[4])

� ymm_acc = ymm_accs[0]

xmm_acc = ymm_acc.as_xmm

� xmm_scalar_acc = XMMRegister()

VXORPS(xmm_scalar_acc, xmm_scalar_acc, xmm_scalar_acc)

� ADD(reg_length, 8*unroll_factor)

JZ(scalar_loop.end)

� with scalar_loop:

xmm_scalar_x = XMMRegister()

VMOVSS(xmm_scalar_x, [reg_x])

VFMADD132SS(xmm_scalar_acc, xmm_scalar_x, [reg_y])

ADD(reg_x, 4)

ADD(reg_y, 4)

SUB(reg_length, 1)

JNZ(scalar_loop.begin)

# Add remainder

VADDPS(ymm_acc, ymm_acc, xmm_scalar_acc.as_ymm)

xmm_temp = XMMRegister()

VEXTRACTF128(xmm_temp, ymm_acc, 1)

VADDPS(xmm_acc, xmm_acc, xmm_temp)

VHADDPS(xmm_acc, xmm_acc, xmm_acc)

� RETURN(xmm_acc)

19 of 20

Dot Product Performance: Dot Product

20 of 20

PeachPy: Resources

PeachPy is an open-source project on GitHub:

GitHub.com/Maratyszcza/PeachPy

The dot product code is included in examples/go-generate