1 of 36

Optimizing Half-precision Winograd Algorithm on ARM Many-core Processors

Presenter:Dedong Xie

Date:2022.08.24

Dedong Xie

University of Toronto

Zhen Jia

Amazon Web Services

Zili Zhang

Peking University

Xin Jin

Peking University

13th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2022)

2 of 36

Introduction

01

Design

02

Experiments

03

Conclusion

04

CONTENTS

3 of 36

Introduction

4 of 36

  • Convolutional Neural Networks
    • Successful in recognition and recommendation
    • E.g., VGG, ResNet

  • CPU
    • Vector instructions
    • High performance
    • GPU shortage
    • Attractive choice for certain use cases (e.g., inference)

Introduction

5 of 36

Graviton CPU

  • Fast, efficient, good price performance
  • ARM NEON SIMD ISA
  • Opportunity for optimizations for CNN

© 2020, Amazon Web Services, Inc. or its Affiliates.

6 of 36

ARM NEON Vector Instructions

  • Each Vector register of size 128bits, contains 8 lanes of FP16 data, compute 8 lanes in one instruction
  • Example: Fused Multiply-Add, FMLA R3, R1, R2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

R1

R2

R3

7 of 36

Convolution

 

 

 

 

 

 

 

 

 

 

 

 

Input

Kernel

Output

Convolution F(2,3)

8 of 36

Winograd Algorithm for Convolution

  • Take intermediate values

  • A total of m+r+1 multiplications needed for F(m,r)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Input

Kernel

Output

Input transformation

Kernel transformation

Elementwise multiplication

Output transformation

9 of 36

Winograd Algorithm

    • Input Transformation
      •  
      •  

    • Elementwise Multiplication
      •  

    • Output Transformation
      •  

 

10 of 36

Design

11 of 36

HAWC: Design

We present HAWC, Half-precision Winograd algorithm convolution for ARM many-core processors.

Circled parts are where we apply specific optimizations

12 of 36

HAWC: Main Components

  • Data Layout
  • GEMM Kernel Generator
  • Scatter Store
  • Parallelization

13 of 36

HAWC: Data Layout

Apply vectorization

Maximize data re-use

Reduce access overhead

14 of 36

HAWC: Optimizations for Transformations

  • Pre-defined transformation codelets for input, kernel, and output transformations
  • Use of NEON intrinsics
  • Use of C++ template
  • Scattered store for matrix multiplication

#include <arm_neon.h>

template <long_t M, long_t R, long_t OS, long_t IS>

inline __attribute__((always_inline))

typename std::enable_if<(M + R - 1) == 4>::type

transform_image(float16x8_t* __restrict out, float16x8_t* __restrict in) {

    out[0] = vsubq_f16(in[0], in[IS * 2]);

    out[OS * 1] = vaddq_f16(in[IS], in[IS * 2]);

    out[OS * 2] = vsubq_f16(in[IS * 2], in[IS]);

    out[OS * 3] = vsubq_f16(in[IS * 3], in[IS]);

}

15 of 36

HAWC: GEMM Kernel Generator

Fast

Flexible

Adaptive

16 of 36

HAWC: Scattered Store

  • After matrix multiplication, an inverse transformation is needed to rebuild elementwise multiplied results to apply output transformation.

Gather

Output transformation

Matrix multiplication

Normal

Workflow

Scatter

Output transformation

HAWC

Workflow

Matrix multiplication

Scope of output transformation

17 of 36

HAWC: Parallelization

  • Minimal parallel scheduler, parallel in each stage

Input transformation

Matrix multiplication

Output transformation

Thread 1

Thread 2

Thread 3

Thread 1

Thread 2

Thread 3

Thread 1

Thread 2

Thread 3

Boundary

Boundary

18 of 36

Implementation

  • Implemented in C++
  • Target ARM CPU with FP16 ASIMD support
  • Rely on ARM Compiler Toolchain
  • Compile using GCC(g++)
  • Build with Make

19 of 36

Experiments

&

Analysis

20 of 36

Experiments Setup

NCNN

Source: https://github.com/Tencent/ncnn

MNN

Source: https://github.com/alibaba/MNN

  • Amazon EC2 m6g.metal instance
  • Graviton 2 64 cores
  • Ubuntu 18.04

  • Compare latency on representative layers of CNN models
  • Compare with NCNN and MNN

21 of 36

Accuracy

  • Winograd algorithm has mathematical instability
  • For F(m,r), higher m will yield less operations with lower accuracy (m: Hyper parameter r: Kernel size)
  • Calculate maximum element error and average element error
  • Average error of less than E-2 will not influence stability*

*: Gupta et al. 2015. Deep Learning with Limited Numerical Precision. ICML’15.

22 of 36

Performance

 

23 of 36

Multi-Batch

Performance

 

24 of 36

GEMM Performance

Achieves ~70%-90% of theoretical maximum TFLOPS

5.12 TFLOPS

25 of 36

Computation Time Breakdown

Our design of scattered store saves time used by output transformation

26 of 36

Case Study: Graviton 3

  • AWS Graviton 3 instance is released in May 2022, with new features.

27 of 36

Conclusion

28 of 36

Contributions

HAWC

    • Efficient implementation of FP16�Winograd convolution optimized for ARM many-core processors.

Design

    • Apply various optimizations.
    • A custom JIT-compiled matrix multiplication kernel for Winograd convolution for ARM NEON ISA.

Performance

    • HAWC achieves on average 10.74×�and up to 27.56× speedup by experiments.

29 of 36

Future work

  • Autotune selection of GEMM parameters
  • Longer vector registers: 256bits, 512bits,…
  • Different data type: BF16, INT8,…

30 of 36

Thank you

13th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2022)

August 24th, 2022

  • Thank you for listening!

  • Questions and comments?

31 of 36

32 of 36

HAWC: Workflow

33 of 36

Data Layout

34 of 36

JIT-Compiled GEMM Kernel Generator

Sub-Matrices Blocks

    • Further break the sub-matrices in cache to smaller blocks in register
    • Use of prefetch

Generator of Assembly code

    • Parameters passed at compilation time
    • Generation at runtime
    • Pseudo-intrinsic assembly code
    • Make code adaptive
    • Possible to extend to other platforms

Compile and Link to Executable File

    • Link compiled GEMM component to program
    • Enables plug-in GEMM kernels and re-use of kernels

35 of 36

GEMM Kernel Generator

Layer Specifications

Operation Translation List

Register Bank Descriptions

Generator (JIT-Compile)

GEMM Kernel

Other Components

HAWC Executable

Orange: change with each layer

Link

Produce

36 of 36

GEMM Performance

  • Fusion 1.1
  • 64 channels, 640 by 640 images.
  • This layer has small channel size, computations bounded by IO.