Optimizing Half-precision Winograd Algorithm on ARM Many-core Processors
Presenter:Dedong Xie
Date:2022.08.24
Dedong Xie
University of Toronto
Zhen Jia
Amazon Web Services
Zili Zhang
Peking University
Xin Jin
Peking University
13th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2022)
Introduction
01
Design
02
Experiments
03
Conclusion
04
CONTENTS
Introduction
Introduction
Graviton CPU
© 2020, Amazon Web Services, Inc. or its Affiliates.
ARM NEON Vector Instructions
R1
R2
R3
Convolution
Input
Kernel
Output
Convolution F(2,3)
Winograd Algorithm for Convolution
Input
Kernel
Output
Input transformation
Kernel transformation
Elementwise multiplication
Output transformation
Winograd Algorithm
Design
HAWC: Design
We present HAWC, Half-precision Winograd algorithm convolution for ARM many-core processors.
Circled parts are where we apply specific optimizations
HAWC: Main Components
HAWC: Data Layout
Apply vectorization
Maximize data re-use
Reduce access overhead
HAWC: Optimizations for Transformations
#include <arm_neon.h>
template <long_t M, long_t R, long_t OS, long_t IS>
inline __attribute__((always_inline))
typename std::enable_if<(M + R - 1) == 4>::type
transform_image(float16x8_t* __restrict out, float16x8_t* __restrict in) {
out[0] = vsubq_f16(in[0], in[IS * 2]);
out[OS * 1] = vaddq_f16(in[IS], in[IS * 2]);
out[OS * 2] = vsubq_f16(in[IS * 2], in[IS]);
out[OS * 3] = vsubq_f16(in[IS * 3], in[IS]);
}
HAWC: GEMM Kernel Generator
Fast
Flexible
Adaptive
HAWC: Scattered Store
Gather
Output transformation
Matrix multiplication
Normal
Workflow
Scatter
Output transformation
HAWC
Workflow
Matrix multiplication
Scope of output transformation
HAWC: Parallelization
Input transformation
Matrix multiplication
Output transformation
Thread 1
Thread 2
Thread 3
…
Thread 1
Thread 2
Thread 3
…
Thread 1
Thread 2
Thread 3
…
Boundary
Boundary
Implementation
Experiments
&
Analysis
Experiments Setup
NCNN
Source: https://github.com/Tencent/ncnn
MNN
Source: https://github.com/alibaba/MNN
Accuracy
*: Gupta et al. 2015. Deep Learning with Limited Numerical Precision. ICML’15.
Performance
Multi-Batch
Performance
GEMM Performance
Achieves ~70%-90% of theoretical maximum TFLOPS
5.12 TFLOPS
Computation Time Breakdown
Our design of scattered store saves time used by output transformation
Case Study: Graviton 3
Conclusion
Contributions
HAWC
Design
Performance
Future work
Thank you
13th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys 2022)
August 24th, 2022
HAWC: Workflow
Data Layout
JIT-Compiled GEMM Kernel Generator
Sub-Matrices Blocks
Generator of Assembly code
Compile and Link to Executable File
GEMM Kernel Generator
Layer Specifications
Operation Translation List
Register Bank Descriptions
Generator (JIT-Compile)
GEMM Kernel
Other Components
HAWC Executable
Orange: change with each layer
Link
Produce
GEMM Performance