1 of 36

Optimizing Half-precision Winograd Algorithm on ARM Many-core Processors

Presenter：Dedong Xie

Date：2022.08.24

Dedong Xie

University of Toronto

Zhen Jia

Amazon Web Services

Zili Zhang

Peking University

Xin Jin

Peking University

13th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2022)

2 of 36

Introduction

01

Design

02

Experiments

03

Conclusion

04

CONTENTS

3 of 36

Introduction

4 of 36

Convolutional Neural Networks

Successful in recognition and recommendation
E.g., VGG, ResNet

CPU

Vector instructions
High performance
GPU shortage
Attractive choice for certain use cases (e.g., inference)

Introduction

5 of 36

Graviton CPU

Fast, efficient, good price performance
ARM NEON SIMD ISA
Opportunity for optimizations for CNN

6 of 36

ARM NEON Vector Instructions

Each Vector register of size 128bits, contains 8 lanes of FP16 data, compute 8 lanes in one instruction
Example: Fused Multiply-Add, FMLA R3, R1, R2

R1

R2

R3

7 of 36

Convolution

Input

Kernel

Output

Convolution F(2,3)

8 of 36

Winograd Algorithm for Convolution

Take intermediate values

A total of m+r+1 multiplications needed for F(m,r)

Input

Kernel

Output

Input transformation

Kernel transformation

Elementwise multiplication

Output transformation

9 of 36

Winograd Algorithm

Input Transformation

Elementwise Multiplication

Output Transformation

10 of 36

Design

11 of 36

HAWC: Design

We present HAWC, Half-precision Winograd algorithm convolution for ARM many-core processors.

Circled parts are where we apply specific optimizations

12 of 36

HAWC: Main Components

Data Layout
GEMM Kernel Generator
Scatter Store
Parallelization

13 of 36

HAWC: Data Layout

Apply vectorization

Maximize data re-use

Reduce access overhead

14 of 36

HAWC: Optimizations for Transformations

Pre-defined transformation codelets for input, kernel, and output transformations
Use of NEON intrinsics
Use of C++ template
Scattered store for matrix multiplication

#include <arm_neon.h>

template <long_t M, long_t R, long_t OS, long_t IS>

inline __attribute__((always_inline))

typename std::enable_if<(M + R - 1) == 4>::type

transform_image(float16x8_t* __restrict out, float16x8_t* __restrict in) {

out[0] = vsubq_f16(in[0], in[IS * 2]);

out[OS * 1] = vaddq_f16(in[IS], in[IS * 2]);

out[OS * 2] = vsubq_f16(in[IS * 2], in[IS]);

out[OS * 3] = vsubq_f16(in[IS * 3], in[IS]);

}

15 of 36

HAWC: GEMM Kernel Generator

Fast

Flexible

Adaptive

16 of 36

HAWC: Scattered Store

After matrix multiplication, an inverse transformation is needed to rebuild elementwise multiplied results to apply output transformation.

Gather

Output transformation

Matrix multiplication

Normal

Workflow

Scatter

Output transformation

HAWC

Workflow

Matrix multiplication

Scope of output transformation

17 of 36

HAWC: Parallelization

Minimal parallel scheduler, parallel in each stage

Input transformation

Matrix multiplication

Output transformation

Thread 1

Thread 2

Thread 3

…

Thread 1

Thread 2

Thread 3

…

Thread 1

Thread 2

Thread 3

…

Boundary

18 of 36

Implementation

Implemented in C++
Target ARM CPU with FP16 ASIMD support
Rely on ARM Compiler Toolchain
Compile using GCC(g++)
Build with Make

19 of 36

Experiments

&

Analysis

20 of 36

Experiments Setup

NCNN

Source: https://github.com/Tencent/ncnn

MNN

Source: https://github.com/alibaba/MNN

Amazon EC2 m6g.metal instance
Graviton 2 64 cores
Ubuntu 18.04

Compare latency on representative layers of CNN models
Compare with NCNN and MNN

21 of 36

Accuracy

Winograd algorithm has mathematical instability
For F(m,r), higher m will yield less operations with lower accuracy (m: Hyper parameter r: Kernel size)
Calculate maximum element error and average element error
Average error of less than E-2 will not influence stability*

*: Gupta et al. 2015. Deep Learning with Limited Numerical Precision. ICML’15.

22 of 36

Performance

23 of 36

Multi-Batch

Performance

24 of 36

GEMM Performance

Achieves ~70%-90% of theoretical maximum TFLOPS

5.12 TFLOPS

25 of 36

Computation Time Breakdown

Our design of scattered store saves time used by output transformation

26 of 36

Case Study: Graviton 3

AWS Graviton 3 instance is released in May 2022, with new features.

27 of 36

Conclusion

28 of 36

Contributions

HAWC

Efficient implementation of FP16�Winograd convolution optimized for ARM many-core processors.

Design

Apply various optimizations.
A custom JIT-compiled matrix multiplication kernel for Winograd convolution for ARM NEON ISA.

Performance

HAWC achieves on average 10.74×�and up to 27.56× speedup by experiments.

29 of 36

Future work

Autotune selection of GEMM parameters
Longer vector registers: 256bits, 512bits,…
Different data type: BF16, INT8,…

30 of 36

Thank you

13th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2022)

August 24^th, 2022

Thank you for listening!

Questions and comments?

31 of 36

32 of 36

HAWC: Workflow

33 of 36

Data Layout

34 of 36

JIT-Compiled GEMM Kernel Generator

Sub-Matrices Blocks

Further break the sub-matrices in cache to smaller blocks in register
Use of prefetch

Generator of Assembly code

Parameters passed at compilation time
Generation at runtime
Pseudo-intrinsic assembly code
Make code adaptive
Possible to extend to other platforms

Compile and Link to Executable File

Link compiled GEMM component to program
Enables plug-in GEMM kernels and re-use of kernels

35 of 36

GEMM Kernel Generator

Layer Specifications

Operation Translation List

Register Bank Descriptions

Generator (JIT-Compile)

GEMM Kernel

Other Components

HAWC Executable

Orange: change with each layer

Link

Produce

36 of 36

GEMM Performance

Fusion 1.1
64 channels, 640 by 640 images.
This layer has small channel size, computations bounded by IO.