1 of 20

Accelerating data processing in Go

with SIMD instructions

Marat Dukhan

2 of 20

Performance Components

(Top Desktop x86 CPUs from Intel)

3 of 20

SIMD: Single Instruction - Multiple Data

SIMD extensions are sets of special instructions that operate on short fixed-length arrays of elements

  • E.g. MMX can work on arrays of 4 16-bit elements

  • SSE2 can work on arrays of 2 double-precision floats

4 of 20

SIMD: Special Operations

Often SIMD extensions include instructions for more data-processing operations that regular instruction sets

  • E.g. abs, min, max, saturated arithmetic
  • It is possible to get superlinear speedup!

uint32_t a[4];

uint32_t b[4];

a[0] = a[0] > b[0] ? a[0] : b[0];

a[1] = a[1] > b[1] ? a[1] : b[1];

a[2] = a[2] > b[2] ? a[2] : b[2];

a[3] = a[3] > b[3] ? a[3] : b[3];

uint32x4_t a;

uint32x4_t b;

a = max(a, b); // Special instruction!

5 times less instructions in SIMD version although SIMD vector has only 4 elements!

4 RISC instructions:

  • 2 SIMD loads
  • 1 SIMD max
  • 1 SIMD store

20 RISC instructions:

  • 8 loads
  • 4 compares
  • 4 conditional moves
  • 4 stores

5 of 20

x86 SIMD Extensions

MMX

i586

MMX+

SSE

SSE2

SSE3

3dnow!

3dnow!+

3dnow! Geode

SSSE3

SSE4A

SSE4.1

SSE4.2

AVX

F16C

FMA3

FMA4

EMMX

XOP

AVX512-F

AVX512-CD

AVX512-ER

AVX512-PF

3dnow! prefetch

AES-NI

PCLMULQDQ

Pentium begot MMX, MMX begot EMMX and MMX+, MMX+ begot 3dnow!, 3dnow! prefetch and SSE, 3dnow! begot 3dnow!+ and SSE begot SSE2, SSE2 begot SSE3, SSE3 begot SSSE3 and SSE4A, SSSE3 begon SSE4.1, SSE4.1 begot SSE4.2, SSE4.2 begot AES-NI, PCLMULQDQ, and AVX, AVX begot F16C, FMA4, and XOP, F16C begot FMA3, FMA3 begot AVX512-F, AVX512-F begot AVX512-CD, AVX512-PF, AVX512-ER, AVX512-BW, AVX512-DQ, AVX512-VL, AVX512-VBMI, and AVX512-IFMA!

AVX512-VBMI

AVX512-IFMA

AVX512-BW

AVX512-BW

AVX512-VL

AVX512-DQ

6 of 20

Instruction Support Detection: CPUID Instruction

On x86 and x86-64 architectures CPUID instruction reveals which instruction set extensions are implemented by the processor

The cpuid package by Klaus Post helps to get this info from Go:

package main

import (

"fmt"

"github.com/klauspost/cpuid"

)

func main() {

if cpuid.CPU.AVX() {

fmt.Println("AVX: supported")

} else {

fmt.Println("AVX: unsupported")

}

}

7 of 20

Three Ways to Use SIMD Instructions

  • Compiler auto-vectorization
    • Let the compiler figure out how to use SIMD
  • Intrinsic functions
    • Special low-level functions that easily map to processor instructions
  • Implement in another language, link with Go

8 of 20

Compiler auto-vectorization in Go

  • Google’s Go compiler does not support auto-vectorization
  • gccgo can do auto-vectorization
    • Pass -ftree-vectorize (or -O3) option to gccgo
    • Set -march option to enable additional SIMD extensions
      • -march=native to use everything available on host
    • Enable -ffast-math to unlock additional floating-point vectorization opportunities (use with care: it affects results)
    • With go utility: go build

-compiler=gccgo -gccgoflags=”-O3 -march=native -ffast-math”

9 of 20

Go auto-vectorization showcase: dot product

func DotProduct(x, y []float32) (z float32) {

for i, xi := range x {

z += xi * y[i]

}

return

}

10 of 20

Intrinsic functions in Go

  • Google’s Go toolchain does not support intrinsics
  • gccgo does not support intrinsics
  • Go as a language does not specify intrinsics

...and never will

11 of 20

Linking with External Implementation

Go provides 4 mechanisms to call into code implemented in other languages:

  • Built-in assembler
  • syso
  • cgo
  • extern

12 of 20

Using Go Assembler

Both Google’s and gccgo toolchains include assemblers, but the syntax is quite different:

  • gccgo toolchain uses GNU assembler, which accepts traditional AT&T and Intel syntax
  • Google’s Go assembler (6a) follows Plan 9 syntax, which is different from everything!

13 of 20

6a Assembler

6a assembler weirdness:

  • Unconventional names for instructions and registers
  • Branch instructions use names from Motorola 68020 arch!
  • Some widely used constructs are not supported
    • E.g. alignment, jump tables
  • 6a lacks SIMD instructions later than SSE3 (2004)
    • Can be inserted directly as opcode bytes
  • Function names must start with Unicode middle dot character!

14 of 20

6a Assembler: Example

// func sliceSum(a []float64) float64

TEXT ·sliceSum(SB), 7, $0

MOVQ a(FP),SI // SI: &a

MOVQ a_len+8(FP),DX // DX: len(a)

XORPD X0, X0 // Initial value

XORPD X1, X1 // Initial value

MOVQ DX, R10 // R10: len(out) -1

SHRQ $2, DX // DX: (len(out) - 1) / 4

ANDQ $3, R10 // R10: (len(out) -1 ) % 4

CMPQ DX ,$0

JEQ remain_sum

next_sum:

MOVUPD (SI), X2

MOVUPD 16(SI), X3

ADDPD X2, X0

ADDPD X3, X1

ADDQ $32, SI

SUBQ $1, DX

JNZ next_sum

CMPQ R10, $0

JZ done_sum

remain_sum:

MOVSD (SI), X2

ADDSD X2, X0

ADDQ $8, SI

SUBQ $1, R10

JNZ remain_sum

done_sum:

ADDPD X1, X0

MOVAPD X0, X2

UNPCKHPD X0, X2

ADDSD X2, X0

MOVSD X0, ret+24(FP)

RET

Despite all the weirdness, some Go packages make use of built-in assembler.

This example is from GitHub.com/akualab/narray

15 of 20

Linking Go with SysO

SysO are object files with .syso extension

  • The files follow usual ELF/Mach-O/COFF format
  • go build will link the .syso files into the executable
  • Functions from .syso are called via asm trampoline

TEXT ·DotProduct(SB),NOSPLIT,$0

JMP DotProduct(SB)

  • Functions must follow Go calling convention:
    • All arguments are passed on stack
    • Return values are saved on stack
    • C/C++ compilers can’t generate .syso that would follow Go calling convention

16 of 20

Using CGo and extern

CGo is a tool that automatically generates bindings to C functions for Go. Gotchas:

  • Call into CGo function requires to switch thread stack
  • Cross-compilation gets complicated
  • gccgo does not support cgo

//#include "DotProduct.h"//#cgo LDFLAGS: -ldot_product�import "C"��func DotProduct(x, y []float32) float32 {� return C.DotProduct(&x[0], &y[0], len(x))�}

extern is a gccgo-only extension for calling into C functions:

//extern DotProduct

17 of 20

Generating Low-level Code with PeachPy

PeachPy is a Python framework for writing high-performance low-level codes.

For Go users, PeachPy can generate SysO, Plan 9 assembly, as well as object files for Windows, Linux, and OS X

18 of 20

PeachPy Example: Dot Product

x = Argument(ptr(const_float_))

y = Argument(ptr(const_float_))

length = Argument(size_t)

with Function("DotProduct", (x, y, length), float_) as function:

reg_x = GeneralPurposeRegister64()

reg_y = GeneralPurposeRegister64()

reg_length = GeneralPurposeRegister64()

� LOAD.ARGUMENT(reg_x, x)

LOAD.ARGUMENT(reg_y, y)

LOAD.ARGUMENT(reg_length, length)

� vector_loop = Loop()

scalar_loop = Loop()

unroll_factor = 6

ymm_accs = [YMMRegister() for _ in range(unroll_factor)]

for ymm_acc in ymm_accs:

xmm_acc = ymm_acc.as_xmm

VXORPS(xmm_acc, xmm_acc, xmm_acc)

SUB(reg_length, 8*unroll_factor)

JB(vector_loop.end)

with vector_loop:

ymm_xs = [YMMRegister() for _ in range(unroll_factor)]

for (i, ymm_x) in enumerate(ymm_xs):

VMOVUPS(ymm_x, [reg_x + 32*i])

for (i, (ymm_acc, ymm_x)) in enumerate(zip(ymm_accs, ymm_xs)):

VFMADD132PS(ymm_acc, ymm_x, [reg_y + 32*i])

� ADD(reg_x, 32*unroll_factor)

ADD(reg_y, 32*unroll_factor)

� SUB(reg_length, 8*unroll_factor)

JAE(vector_loop.begin)

# Reduction of multiple YMM registers into into YMM register

VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[1])

VADDPS(ymm_accs[2], ymm_accs[2], ymm_accs[3])

VADDPS(ymm_accs[4], ymm_accs[4], ymm_accs[5])

VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[2])

VADDPS(ymm_accs[0], ymm_accs[0], ymm_accs[4])

� ymm_acc = ymm_accs[0]

xmm_acc = ymm_acc.as_xmm

� xmm_scalar_acc = XMMRegister()

VXORPS(xmm_scalar_acc, xmm_scalar_acc, xmm_scalar_acc)

� ADD(reg_length, 8*unroll_factor)

JZ(scalar_loop.end)

with scalar_loop:

xmm_scalar_x = XMMRegister()

VMOVSS(xmm_scalar_x, [reg_x])

VFMADD132SS(xmm_scalar_acc, xmm_scalar_x, [reg_y])

ADD(reg_x, 4)

ADD(reg_y, 4)

SUB(reg_length, 1)

JNZ(scalar_loop.begin)

# Add remainder

VADDPS(ymm_acc, ymm_acc, xmm_scalar_acc.as_ymm)

xmm_temp = XMMRegister()

VEXTRACTF128(xmm_temp, ymm_acc, 1)

VADDPS(xmm_acc, xmm_acc, xmm_temp)

VHADDPS(xmm_acc, xmm_acc, xmm_acc)

VHADDPS(xmm_acc, xmm_acc, xmm_acc)

� RETURN(xmm_acc)

19 of 20

Dot Product Performance: Dot Product

20 of 20

PeachPy: Resources

PeachPy is an open-source project on GitHub:

GitHub.com/Maratyszcza/PeachPy

The dot product code is included in examples/go-generate