ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
This spreadsheet compares the single-thread performance of several GEMM kernels on some ARMv7 and ARMv8 cores, using Android NDK r12 standalone toolchain ("clang version 3.8.256229").
2
3
This is only benchmarking the GEMM kernel, not a whole GEMM. The focus here is on pure arithmetic.
4
5
The benchmark code is at
6
https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc
7
8
An 'op' is a single scalar arithmetic op such as mul or add. So a SIMD multiply-add op on 4 lanes counts as 4x2=8 ops.
9
Below we give the same data first in terms of raw speed (Gop/s) then the same divided by clock speed to get ops/cycle.
10
11
Most of these kernels are written in ARM 32 or 64-bit inline assembly. The rows ending in '_intrinsics' are C++/intrinsics versions of the asm kernel in the row just above them.
12
13
Throughput in Gop/s
14
Device Name / core typePixel XL big corePixel XL little coreNexus 5Nexus 5X big coreNexus 5X little coreAndroid One (old)
15
CPU CoreKryoKryoKraitCortex-A57Cortex-A53Cortex-A7
16
Clock (GHz)2.151.602.261.821.441.30
17
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits22.6116.7422.618.87This is gemmlowp with L8R8WithLhsNonzeroBitDepthParams
18
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics15.6011.5511.054.27
19
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators16.0512.0813.399.37This is gemmlowp with DefaultL8R8BitDepthParams
20
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics10.878.036.903.19
21
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A5716.9312.3318.176.38
22
NEON_64bit_GEMM_Int32_WithScalar14.7910.876.806.45
23
NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar14.2310.429.594.20This is more or less Eigen/float.
24
NEON_64bit_GEMM_Float32_WithScalar14.7810.8712.296.03This is the 'sane' way to do a float kernel on ARM. It uses multiply-accumulate-vector-against-one-lane instructions, which are not reflected in Eigen's SIMD wrappers.
25
NEON_64bit_GEMM_Float32_WithScalar_intrinsics6.825.063.541.33
26
NEON_64bit_GEMM_Float32_WithScalar_A5716.5812.2214.286.29
27
NEON_64bit_GEMM_Float32_WithScalar_A5316.4212.1911.478.15
28
NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits22.6916.7923.6321.498.244.04This is gemmlowp with L8R8WithLhsNonzeroBitDepthParams
29
NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics19.0814.4020.6713.905.142.72
30
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators17.0912.6515.1313.168.573.70This is gemmlowp with DefaultL8R8BitDepthParams
31
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics15.7811.6714.6211.557.633.76
32
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand15.8611.7317.8014.966.002.97
33
NEON_32bit_GEMM_Int32_WithScalar13.219.7717.096.595.422.02
34
NEON_32bit_GEMM_Float32_MLA_WithVectorDuplicatingScalar9.556.9715.267.523.591.51This is more or less Eigen/float.
35
NEON_32bit_GEMM_Float32_FMA_WithVectorDuplicatingScalar13.169.7516.497.523.591.51
36
NEON_32bit_GEMM_Float32_MLA_WithScalar12.319.1315.6011.214.851.93This is the 'sane' way to do a float kernel on ARM. It uses multiply-accumulate-vector-against-one-lane instructions, which are not reflected in Eigen's SIMD wrappers.
37
NEON_32bit_GEMM_Float32_WithScalar_intrinsics7.465.594.844.491.990.77
38
NEON_32bit_GEMM_Float32_WithScalar_A5311.668.6514.039.888.311.55
39
NEON_32bit_GEMM_Float32_WithScalar_A53_depth211.948.8914.9410.178.311.55
40
NEON_32bit_GEMM_Float32_MLA_Rotating9.897.3614.797.673.941.39
41
NEON_32bit_GEMM_Float32_FMA_Rotating13.7910.2215.977.663.941.39
42
43
44
45
Efficiency in ops/cycle
46
Device Name / core typePixel XL big corePixel XL little coreNexus 5Nexus 5X big coreNexus 5X little coreOld Android One
47
CPU CoreKryoKryoKraitCortex-A57Cortex-A53Cortex-A7
48
Clock (GHz)2.151.602.261.821.441.30
49
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits10.5210.4712.426.16This is gemmlowp with L8R8WithLhsNonzeroBitDepthParams
50
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics7.267.226.072.97
51
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators7.477.557.366.51This is gemmlowp with DefaultL8R8BitDepthParams
52
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics5.065.023.792.21
53
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A577.887.719.984.43
54
NEON_64bit_GEMM_Int32_WithScalar6.886.793.744.48
55
NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar6.626.515.272.91This is more or less Eigen/float.
56
NEON_64bit_GEMM_Float32_WithScalar6.876.806.754.19This is the 'sane' way to do a float kernel on ARM. It uses multiply-accumulate-vector-against-one-lane instructions, which are not reflected in Eigen's SIMD wrappers.
57
NEON_64bit_GEMM_Float32_WithScalar_intrinsics3.173.161.950.92
58
NEON_64bit_GEMM_Float32_WithScalar_A577.717.647.844.36
59
NEON_64bit_GEMM_Float32_WithScalar_A537.647.626.305.66
60
NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits10.5510.4910.4611.815.723.11This is gemmlowp with L8R8WithLhsNonzeroBitDepthParams
61
NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics8.889.009.157.643.572.09
62
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators7.957.906.707.235.952.85This is gemmlowp with DefaultL8R8BitDepthParams
63
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics7.347.306.476.345.302.89
64
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand7.387.337.888.224.172.29
65
NEON_32bit_GEMM_Int32_WithScalar6.146.117.563.623.761.55
66
NEON_32bit_GEMM_Float32_MLA_WithVectorDuplicatingScalar4.444.366.754.132.491.16This is more or less Eigen/float.
67
NEON_32bit_GEMM_Float32_FMA_WithVectorDuplicatingScalar6.126.097.304.132.491.16
68
NEON_32bit_GEMM_Float32_MLA_WithScalar5.735.706.906.163.371.49This is the 'sane' way to do a float kernel on ARM. It uses multiply-accumulate-vector-against-one-lane instructions, which are not reflected in Eigen's SIMD wrappers.
69
NEON_32bit_GEMM_Float32_WithScalar_intrinsics3.473.492.142.471.380.59
70
NEON_32bit_GEMM_Float32_WithScalar_A535.435.416.215.435.771.19
71
NEON_32bit_GEMM_Float32_WithScalar_A53_depth25.565.566.615.595.771.19
72
NEON_32bit_GEMM_Float32_MLA_Rotating4.604.606.544.212.731.07
73
NEON_32bit_GEMM_Float32_FMA_Rotating6.416.387.074.212.731.07
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100