ARM GEMM kernel single-thread benchmark

	A	B	C	D	E	F	G	H
1	This spreadsheet compares the single-thread performance of several GEMM kernels on some ARMv7 and ARMv8 cores, using Android NDK r12 standalone toolchain ("clang version 3.8.256229").
2
3	This is only benchmarking the GEMM kernel, not a whole GEMM. The focus here is on pure arithmetic.
4
5	The benchmark code is at
6	https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc
7
8	An 'op' is a single scalar arithmetic op such as mul or add. So a SIMD multiply-add op on 4 lanes counts as 4x2=8 ops.
9	Below we give the same data first in terms of raw speed (Gop/s) then the same divided by clock speed to get ops/cycle.
10
11	Most of these kernels are written in ARM 32 or 64-bit inline assembly. The rows ending in '_intrinsics' are C++/intrinsics versions of the asm kernel in the row just above them.
12
13	Throughput in Gop/s
14	Device Name / core type	Pixel XL big core	Pixel XL little core	Nexus 5	Nexus 5X big core	Nexus 5X little core	Android One (old)
15	CPU Core	Kryo	Kryo	Krait	Cortex-A57	Cortex-A53	Cortex-A7
16	Clock (GHz)	2.15	1.60	2.26	1.82	1.44	1.30
17	NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits	22.61	16.74		22.61	8.87		This is gemmlowp with L8R8WithLhsNonzeroBitDepthParams
18	NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics	15.60	11.55		11.05	4.27
19	NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators	16.05	12.08		13.39	9.37		This is gemmlowp with DefaultL8R8BitDepthParams
20	NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics	10.87	8.03		6.90	3.19
21	NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A57	16.93	12.33		18.17	6.38
22	NEON_64bit_GEMM_Int32_WithScalar	14.79	10.87		6.80	6.45
23	NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar	14.23	10.42		9.59	4.20		This is more or less Eigen/float.
24	NEON_64bit_GEMM_Float32_WithScalar	14.78	10.87		12.29	6.03		This is the 'sane' way to do a float kernel on ARM. It uses multiply-accumulate-vector-against-one-lane instructions, which are not reflected in Eigen's SIMD wrappers.
25	NEON_64bit_GEMM_Float32_WithScalar_intrinsics	6.82	5.06		3.54	1.33
26	NEON_64bit_GEMM_Float32_WithScalar_A57	16.58	12.22		14.28	6.29
27	NEON_64bit_GEMM_Float32_WithScalar_A53	16.42	12.19		11.47	8.15
28	NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits	22.69	16.79	23.63	21.49	8.24	4.04	This is gemmlowp with L8R8WithLhsNonzeroBitDepthParams
29	NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics	19.08	14.40	20.67	13.90	5.14	2.72
30	NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators	17.09	12.65	15.13	13.16	8.57	3.70	This is gemmlowp with DefaultL8R8BitDepthParams
31	NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics	15.78	11.67	14.62	11.55	7.63	3.76
32	NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand	15.86	11.73	17.80	14.96	6.00	2.97
33	NEON_32bit_GEMM_Int32_WithScalar	13.21	9.77	17.09	6.59	5.42	2.02
34	NEON_32bit_GEMM_Float32_MLA_WithVectorDuplicatingScalar	9.55	6.97	15.26	7.52	3.59	1.51	This is more or less Eigen/float.
35	NEON_32bit_GEMM_Float32_FMA_WithVectorDuplicatingScalar	13.16	9.75	16.49	7.52	3.59	1.51
36	NEON_32bit_GEMM_Float32_MLA_WithScalar	12.31	9.13	15.60	11.21	4.85	1.93	This is the 'sane' way to do a float kernel on ARM. It uses multiply-accumulate-vector-against-one-lane instructions, which are not reflected in Eigen's SIMD wrappers.
37	NEON_32bit_GEMM_Float32_WithScalar_intrinsics	7.46	5.59	4.84	4.49	1.99	0.77
38	NEON_32bit_GEMM_Float32_WithScalar_A53	11.66	8.65	14.03	9.88	8.31	1.55
39	NEON_32bit_GEMM_Float32_WithScalar_A53_depth2	11.94	8.89	14.94	10.17	8.31	1.55
40	NEON_32bit_GEMM_Float32_MLA_Rotating	9.89	7.36	14.79	7.67	3.94	1.39
41	NEON_32bit_GEMM_Float32_FMA_Rotating	13.79	10.22	15.97	7.66	3.94	1.39
42
43
44
45	Efficiency in ops/cycle
46	Device Name / core type	Pixel XL big core	Pixel XL little core	Nexus 5	Nexus 5X big core	Nexus 5X little core	Old Android One
47	CPU Core	Kryo	Kryo	Krait	Cortex-A57	Cortex-A53	Cortex-A7
48	Clock (GHz)	2.15	1.60	2.26	1.82	1.44	1.30
49	NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits	10.52	10.47		12.42	6.16		This is gemmlowp with L8R8WithLhsNonzeroBitDepthParams
50	NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics	7.26	7.22		6.07	2.97
51	NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators	7.47	7.55		7.36	6.51		This is gemmlowp with DefaultL8R8BitDepthParams
52	NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics	5.06	5.02		3.79	2.21
53	NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A57	7.88	7.71		9.98	4.43
54	NEON_64bit_GEMM_Int32_WithScalar	6.88	6.79		3.74	4.48
55	NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar	6.62	6.51		5.27	2.91		This is more or less Eigen/float.
56	NEON_64bit_GEMM_Float32_WithScalar	6.87	6.80		6.75	4.19		This is the 'sane' way to do a float kernel on ARM. It uses multiply-accumulate-vector-against-one-lane instructions, which are not reflected in Eigen's SIMD wrappers.
57	NEON_64bit_GEMM_Float32_WithScalar_intrinsics	3.17	3.16		1.95	0.92
58	NEON_64bit_GEMM_Float32_WithScalar_A57	7.71	7.64		7.84	4.36
59	NEON_64bit_GEMM_Float32_WithScalar_A53	7.64	7.62		6.30	5.66
60	NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits	10.55	10.49	10.46	11.81	5.72	3.11	This is gemmlowp with L8R8WithLhsNonzeroBitDepthParams
61	NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics	8.88	9.00	9.15	7.64	3.57	2.09
62	NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators	7.95	7.90	6.70	7.23	5.95	2.85	This is gemmlowp with DefaultL8R8BitDepthParams
63	NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics	7.34	7.30	6.47	6.34	5.30	2.89
64	NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand	7.38	7.33	7.88	8.22	4.17	2.29
65	NEON_32bit_GEMM_Int32_WithScalar	6.14	6.11	7.56	3.62	3.76	1.55
66	NEON_32bit_GEMM_Float32_MLA_WithVectorDuplicatingScalar	4.44	4.36	6.75	4.13	2.49	1.16	This is more or less Eigen/float.
67	NEON_32bit_GEMM_Float32_FMA_WithVectorDuplicatingScalar	6.12	6.09	7.30	4.13	2.49	1.16
68	NEON_32bit_GEMM_Float32_MLA_WithScalar	5.73	5.70	6.90	6.16	3.37	1.49	This is the 'sane' way to do a float kernel on ARM. It uses multiply-accumulate-vector-against-one-lane instructions, which are not reflected in Eigen's SIMD wrappers.
69	NEON_32bit_GEMM_Float32_WithScalar_intrinsics	3.47	3.49	2.14	2.47	1.38	0.59
70	NEON_32bit_GEMM_Float32_WithScalar_A53	5.43	5.41	6.21	5.43	5.77	1.19
71	NEON_32bit_GEMM_Float32_WithScalar_A53_depth2	5.56	5.56	6.61	5.59	5.77	1.19
72	NEON_32bit_GEMM_Float32_MLA_Rotating	4.60	4.60	6.54	4.21	2.73	1.07
73	NEON_32bit_GEMM_Float32_FMA_Rotating	6.41	6.38	7.07	4.21	2.73	1.07
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100