ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASATAUAVAWAXAY
1
This spreadsheet compares the single-thread performance of several GEMM kernels on some ARMv7 and ARMv8 cores.
2
3
The benchmark code is at
4
https://github.com/google/gemmlowp/blob/master/standalone/neon-gemm-kernel-benchmark.cc
5
6
An 'op' is a single scalar arithmetic op such as mul or add. So a SIMD multiply-add op on 4 lanes counts as 4x2=8 ops.
7
Below we give the same data first in terms of raw speed (Gop/s) then the same divided by clock speed to get ops/cycle.
8
9
Most of these kernels are written in ARM 32 or 64-bit inline assembly. The rows ending in '_intrinsics' are C++/intrinsics versions of the asm kernel in the row just above them.
10
Note: in 32bit mode, we enable ARM VFPv4 instructions (FMA). That corresponds to -mfpu=neon-vfpv4 on Android toolchains, and "armv7s" on iOS toolchains.
11
12
Throughput in Gop/s
13
Device (and contributor)iPhone 11 Pro (Mobile Harness)iPhone Xs (equiv. Xr) (bweiss@)iPhone X (equiv. 8, 8+) (bweiss@)iPhone 7 (equiv. 7+) (bweiss@, hsiu@)iPhone 6+ (bweiss@, hsiu@)iPhone 5 (bweiss@, hsiu@)LG G8 pre-release, big core (taskset f0, as 80 does not work)LG G8 pre-release, medium core (taskset 70)LG G8 pre-release, little core (taskset 0f)Huawei M20 Pro fbarchard@, big coreHuawei M20 Pro fbarchard@, little corePixel 3 XL, fbarchard@, big corePixel 3 XL, fbarchard@, little coreSamsung Galaxy S9 Qualcomm big coreSamsung Galaxy S9 Qualcomm little coreSamsung Galaxy S9 prototype (non-final), fbarchard@, big coreSamsung Galaxy S9 prototype (non-final), fbarchard@, little corePixel 2 "walleye"
(equivalently, mbridges@ North American version of Samsung Galaxy S8)
big core
Pixel 2 "walleye"
(equivalently, mbridges@ North American version of Samsung Galaxy S8)
little core
Pixel XL
big core
Pixel XL
little core
Samsung Galaxy S7 big coreSamsung Galaxy S7 little coreNexus 5X
big core
Nexus 5X
little core
Nexus 5NVIDIA Jetson TX2 (rocky@)NVIDIA Jetson TX2 (rocky@)Nexus 9Nexus 10Old Android One (newer ones are A53-based)
14
CPU CoreApple A13Apple A12Apple A11Apple A10Apple A8Apple A6Qualcomm S855, custom ARM Cortex-A76Qualcomm S855, custom ARM Cortex-A76Qualcomm S855, custom ARM Cortex-A55r1Kirin 980
ARM Cortex-A76
Kirin 980
ARM Cortex-A55r1
Qualcomm S845 custom ARM Cortex-A75Qualcomm S845 custom ARM Cortex-A55Qualcomm S845 ARM Cortex-A75Qualcomm S845 ARM Cortex-A55Samsung Exynos M3 "mongoose"ARM Cortex-A55Qualcomm S835 custom Cortex-A73Qualcomm S835 custom Cortex-A53Qualcomm KryoQualcomm KryoSamsung Exynos M1 "mongoose"ARM Cortex-A53ARM Cortex-A57ARM Cortex-A53Qualcomm Krait 400ARM Cortex-A57NVIDIA Denver2NVIDIA DenverARM Cortex-A15 (custom/early revision)ARM Cortex-A7
15
Clock (GHz)2.662.492.392.341.401.302.842.421.792.601.802.801.772.801.772.701.792.461.902.151.602.301.601.821.442.262.042.042.301.701.30
16
Year2019201820172016201420122018201820182018201820182018201820182018201820172017201620162016201620152015201320172017201420122014
17
NEON_64bit_GEMM_Int425Operandsbugbug (146.69)bug (132.24)44.4237.7315.0641.0113.8541.4615.0543.1115.1237.0315.2635.9315.45
18
NEON_64bit_GEMM_Int425Operands_intrinsics160.93145.01127.9943.5136.9613.8740.1712.8933.2413.9834.5714.0935.9514.2825.6615.24
19
NEON_64bit_GEMM_Int7Operands_AccumEightWithin16Bits110.27103.1791.7044.1037.4413.3240.8812.3139.9313.3841.4613.4524.7513.6133.1513.30
20
NEON_64bit_GEMM_Int7Operands_AccumEightWithin16Bits_intrinsics97.4858.9558.7632.7027.787.4330.446.7220.457.2921.266.3119.906.2812.866.57
21
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits
(this is what TFLite uses in quantized inference when dotprod is not available)
84.3078.6973.8871.1125.6044.5937.8710.9041.169.9237.2710.8438.7810.9318.9011.0627.5411.6322.5115.9925.359.7522.558.8224.9515.5417.30
22
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics71.0561.3640.6843.7513.9633.0128.044.9530.574.4923.494.9424.344.9118.045.0514.385.2415.4111.0218.914.6711.034.2513.8610.329.91
23
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators57.6653.7747.7244.4921.3622.4419.0511.9920.7111.1219.8911.3820.6411.4813.2111.6617.6112.1816.2111.6216.2910.3913.369.2314.7812.3412.93
24
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics34.6332.4426.9529.9214.3614.7312.683.4413.493.1213.473.3914.023.9711.554.029.034.0310.777.7112.263.486.863.1610.009.005.37
25
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A5762.5058.2255.6153.3320.1143.9537.237.7140.557.0828.367.7129.537.7614.057.8622.168.3416.8211.9719.267.0217.776.3520.1011.8812.84
26
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct (this is what TFLite uses on big cores in quantized inference when dotprod is available)250.70177.41150.6138.46163.7135.07
27
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct_A55r1 (this is what TFLite uses on little cores in quantized inference when dotprod is available)199.53112.3895.3752.98103.7849.56
28
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct_narrow
228.05157.51133.7325.70145.4123.52
29
NEON_64bit_GEMM_Int32_WithScalar62.8258.5755.9854.1821.4611.229.528.5910.367.9710.578.5910.998.6114.188.789.588.3514.5110.5320.487.256.786.237.5112.8213.71
30
NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar47.2344.1531.7432.8021.4426.7622.747.0624.736.5714.647.1015.247.1721.267.3011.135.4514.1110.4820.454.689.564.1010.5915.5516.17
31
NEON_64bit_GEMM_Float32_WithScalar62.7158.4455.9554.1021.5444.3537.688.5940.957.9320.068.5620.868.6533.748.8016.817.8014.7110.8216.416.7712.255.8413.8215.9716.62
32
NEON_64bit_GEMM_Float32_WithScalar_intrinsics26.7832.7217.6315.776.9112.9010.972.9311.922.727.902.958.202.9712.233.024.903.106.755.007.981.483.511.316.9514.547.72
33
NEON_64bit_GEMM_Float32_WithScalar_A5762.5857.7356.0253.4621.6244.3437.639.5440.898.8321.189.5822.009.5933.499.8215.788.1516.5912.1417.007.0814.246.0515.7715.4517.53
34
NEON_64bit_GEMM_Float32_WithScalar_A53untestedbugbugbugbug26.1022.1610.7024.099.9821.1610.1421.5010.2224.1410.4014.3910.5416.3612.0414.499.2411.437.7712.6715.0416.22
35
NEON_64bit_GEMM_Float32_WithScalar_A55r149.59bugbug26.3722.3913.2924.9712.3521.139.5821.639.6523.469.8313.556.89
36
NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits
(this is what TFLite uses in quantized inference)
32-bit
unsupported
32-bit
unsupported
32-bit
unsupported
34.3726.3217.3239.5033.5311.5236.4510.8336.679.9338.1710.0218.7910.1825.719.9922.5116.0125.229.0320.448.1923.1815.3319.504.04
37
NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics47.4026.579.4729.0824.735.6028.145.1629.375.6530.536.2515.916.3518.186.2919.3313.6214.265.6313.475.1019.089.3311.532.72
38
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators33.0521.649.6822.4419.0611.1320.7110.3819.5010.3720.2710.4213.1110.5916.6610.4916.9511.8114.159.4213.368.5015.1812.6012.403.70
39
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics14.006.663.5822.4019.029.6620.679.2317.489.2218.199.0012.899.1614.799.2015.6310.8814.068.3911.947.5514.608.8510.783.76
40
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand34.0320.148.8035.4530.117.2032.716.6828.137.2629.077.2914.087.4019.737.1015.7011.2619.196.5814.925.9417.8311.8713.252.97
41
NEON_32bit_GEMM_Int32_WithScalar32.7320.965.3511.239.546.9810.376.539.987.1110.377.107.027.298.806.4313.289.6710.595.946.645.3616.8315.606.172.02
42
NEON_32bit_GEMM_Float32_MLA_WithVectorDuplicatingScalar10.868.633.4920.1217.114.5518.484.1512.944.5613.434.603.384.688.034.289.606.975.023.947.493.5515.1911.936.861.51
43
NEON_32bit_GEMM_Float32_FMA_WithVectorDuplicatingScalar12.7214.423.4820.3017.244.9418.744.5213.084.9813.615.024.135.107.934.2713.389.596.163.957.493.5516.2415.526.861.51
44
NEON_32bit_GEMM_Float32_MLA_WithScalar22.6112.615.3540.1434.246.2237.855.8219.046.3219.786.365.266.4614.785.8212.479.107.635.3311.404.7915.6312.3810.591.93
45
NEON_32bit_GEMM_Float32_WithScalar_intrinsics15.0910.664.2341.9135.605.6638.675.237.985.668.312.114.232.145.222.367.625.343.532.194.561.974.857.694.230.77
46
NEON_32bit_GEMM_Float32_WithScalar_A5315.0212.827.3616.7814.267.4815.486.8021.137.4421.307.523.937.6413.868.8211.758.396.997.6810.077.3514.0511.339.631.55
47
NEON_32bit_GEMM_Float32_WithScalar_A53_depth214.9912.837.3117.1414.568.6115.827.9321.038.5920.958.223.898.3411.709.2312.138.837.198.3210.367.8614.9411.109.931.55
48
NEON_32bit_GEMM_Float32_MLA_Rotating18.7010.093.7726.4422.505.0724.664.6912.885.0713.415.104.805.1910.224.719.967.297.374.337.853.9014.7712.806.881.39
49
NEON_32bit_GEMM_Float32_FMA_Rotating27.5320.133.8025.0121.245.6123.085.1512.925.5813.365.626.435.7010.214.7813.999.9510.094.337.853.8916.0315.547.521.39
50
51
52
53
Efficiency in ops/cycle
54
Device Name / core typeiPhone 11 Pro (Mobile Harness)iPhone Xs (equiv. Xr) (bweiss@)iPhone X (equiv. 8, 8+) (bweiss@)iPhone 7 (equiv. 7+) (bweiss@, hsiu@)iPhone 6+ (bweiss@, hsiu@)iPhone 5 (bweiss@, hsiu@)LG G8 pre-release, big core (taskset f0, as 80 does not work)LG G8 pre-release, medium core (taskset 70)LG G8 pre-release, little core (taskset 0f)Huawei M20 Pro fbarchard@, big coreHuawei M20 Pro fbarchard@, little corePixel 3 XL, fbarchard@, big corePixel 3 XL, fbarchard@, little coreSamsung Galaxy S9 Qualcomm big coreSamsung Galaxy S9 Qualcomm little coreSamsung Galaxy S9 prototype (non-final), fbarchard@, big coreSamsung Galaxy S9 prototype (non-final), fbarchard@, little corePixel 2 "walleye"
(equivalently, mbridges@ North American version of Samsung Galaxy S8)
big core
Pixel 2 "walleye"
(equivalently, mbridges@ North American version of Samsung Galaxy S8)
little core
Pixel XL
big core
Pixel XL
little core
Samsung Galaxy S7 big coreSamsung Galaxy S7 little coreNexus 5X
big core
Nexus 5X
little core
Nexus 5NVIDIA Jetson TX2 (rocky@)NVIDIA Jetson TX2 (rocky@)Nexus 9Nexus 10Old Android One (newer ones are A53-based)
55
CPU CoreApple A13Apple A12Apple A11Apple A10Apple A8Apple A6Qualcomm S855, custom ARM Cortex-A76Qualcomm S855, custom ARM Cortex-A76Qualcomm S855, custom ARM Cortex-A55r1Kirin 980
ARM Cortex-A76
Kirin 980
ARM Cortex-A55r1
Qualcomm S845 custom ARM Cortex-A75Qualcomm S845 custom ARM Cortex-A55Qualcomm S845 ARM Cortex-A75Qualcomm S845 ARM Cortex-A55Samsung Exynos M3 "mongoose"ARM Cortex-A55Qualcomm S835 custom Cortex-A73Qualcomm S835 custom Cortex-A53Qualcomm KryoQualcomm KryoSamsung Exynos M1 "mongoose"ARM Cortex-A53ARM Cortex-A57ARM Cortex-A53Qualcomm Krait 400ARM Cortex-A57NVIDIA Denver2NVIDIA DenverARM Cortex-A15 (custom/early revision)ARM Cortex-A7
56
Clock (GHz)2.662.492.392.341.401.302.842.421.792.601.802.801.772.801.772.701.792.461.902.151.602.301.601.821.442.262.042.042.301.701.30
57
Year2019201820172016201420122018201820182018201820182018201820182018201820172017201620162016201620152015201320172017201420122014
58
NEON_64bit_GEMM_Int425Operandsbugbugbug15.6315.608.4415.777.6914.818.5015.408.5413.728.5214.618.13
59
NEON_64bit_GEMM_Int425Operands_intrinsics60.5058.2453.5515.3115.287.7715.457.1611.877.9012.347.9613.317.9810.438.02
60
NEON_64bit_GEMM_Int7Operands_AccumEightWithin16Bits41.4641.4338.3715.5215.487.4615.726.8414.267.5614.817.609.177.6013.477.00
61
NEON_64bit_GEMM_Int7Operands_AccumEightWithin16Bits_intrinsics36.6523.6724.5911.5111.484.1611.713.737.314.127.593.567.373.515.233.46
62
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits
(this is what TFLite uses in quantized inference when dotprod is not available)
31.6931.6030.9130.3918.2915.6915.666.1015.835.5113.316.1213.856.177.006.1811.196.1210.479.9911.026.0912.396.1312.267.637.52
63
NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics26.7124.6417.0218.709.9711.6211.592.7711.762.508.392.798.692.776.682.825.842.767.176.888.222.926.062.956.815.074.31
64
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators21.6821.5919.9719.0115.267.907.886.717.966.187.106.437.376.484.896.517.166.417.547.267.086.497.346.417.266.065.62
65
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics13.0213.0311.2812.7910.265.185.241.935.191.734.811.925.012.244.282.243.672.125.014.825.332.183.772.194.914.422.34
66
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand_A5723.5023.3823.2722.7914.3615.4715.394.3215.603.9310.134.3610.554.385.204.399.014.397.827.488.374.399.764.419.885.845.58
67
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct (this is what TFLite uses on big cores in quantized inference when dotprod is available)94.2562.4362.2621.5462.9719.48
68
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct_A55r1 (this is what TFLite uses on little cores in quantized inference when dotprod is available)75.0139.5539.4229.6739.9227.53
69
NEON_64bit_GEMM_Uint8Operands_Uint32Accumulators_dotproduct_narrow
85.7355.4355.2814.3955.9313.07
70
NEON_64bit_GEMM_Int32_WithScalar23.6223.5223.4223.1515.333.953.944.813.984.433.774.853.934.865.254.913.894.406.756.588.904.533.724.333.696.305.96
71
NEON_64bit_GEMM_Float32_WithVectorDuplicatingScalar17.7617.7313.2814.0215.319.429.403.959.513.655.234.015.444.057.874.084.522.876.566.558.892.935.252.845.217.647.03
72
NEON_64bit_GEMM_Float32_WithScalar23.5723.4723.4123.1215.3915.6115.584.8115.754.407.164.847.454.8912.504.926.834.106.846.767.134.236.734.066.797.857.23
73
NEON_64bit_GEMM_Float32_WithScalar_intrinsics10.0713.147.386.744.944.544.541.644.591.512.821.672.931.684.531.681.991.633.143.123.470.921.930.913.417.153.36
74
NEON_64bit_GEMM_Float32_WithScalar_A5723.5323.1823.4422.8515.4415.6015.555.3415.734.917.565.417.865.4212.405.496.424.297.727.597.394.437.824.207.757.597.62
75
NEON_64bit_GEMM_Float32_WithScalar_A53untestedbugbugbugbug9.189.165.999.265.557.565.737.685.778.945.815.855.557.617.536.305.776.285.396.237.397.05
76
NEON_64bit_GEMM_Float32_WithScalar_A55r118.64bugbug9.289.267.449.606.867.555.417.725.458.695.495.513.63
77
NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits
(this is what TFLite uses in quantized inference)
14.6918.8013.3213.9013.866.4514.026.0213.105.6113.635.666.965.6910.455.2610.4710.0110.965.6511.235.6910.266.6611.473.11
78
NEON_32bit_GEMM_Int8Operands_AccumTwoWithin16Bits_intrinsics20.2518.987.2810.2310.223.1310.822.8710.493.1910.903.535.893.557.393.318.998.516.203.527.403.548.444.066.782.09
79
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators14.1315.467.457.907.886.247.975.776.965.867.245.894.855.916.775.527.897.386.155.897.345.906.725.487.302.85
80
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_intrinsics5.984.762.757.887.865.417.955.136.245.216.505.094.775.126.014.847.276.806.115.246.565.246.463.856.342.89
81
NEON_32bit_GEMM_Uint8Operands_Uint32Accumulators_noexpand14.5414.396.7712.4712.444.0312.583.7110.054.1010.384.125.214.148.023.747.307.048.344.128.204.137.895.167.792.29
82
NEON_32bit_GEMM_Int32_WithScalar13.9914.974.123.953.943.913.993.633.564.023.704.012.604.073.583.396.186.054.603.713.653.727.456.783.631.55
83
NEON_32bit_GEMM_Float32_MLA_WithVectorDuplicatingScalar4.646.162.687.087.072.557.112.314.622.584.802.601.252.623.262.254.464.362.182.464.112.476.725.194.041.16
84
NEON_32bit_GEMM_Float32_FMA_WithVectorDuplicatingScalar5.4410.302.687.147.132.777.212.514.672.814.862.841.532.853.222.256.225.992.682.474.112.477.196.754.041.16
85
NEON_32bit_GEMM_Float32_MLA_WithScalar9.669.014.1214.1314.153.4814.563.236.803.577.063.591.953.616.013.075.805.693.323.336.273.336.915.386.231.49
86
NEON_32bit_GEMM_Float32_WithScalar_intrinsics6.457.613.2514.7514.723.1714.872.912.853.202.971.191.571.202.121.243.543.341.541.372.501.372.153.342.490.59
87
NEON_32bit_GEMM_Float32_WithScalar_A536.429.165.665.905.894.195.953.787.554.207.614.251.454.275.634.645.475.253.044.805.535.106.224.935.671.19
88
NEON_32bit_GEMM_Float32_WithScalar_A53_depth26.419.165.626.036.024.826.084.417.514.857.484.651.444.664.764.865.645.523.135.205.695.466.614.835.841.19
89
NEON_32bit_GEMM_Float32_MLA_Rotating7.997.212.909.309.302.849.492.604.602.864.792.881.782.904.162.484.634.553.202.714.312.716.545.574.051.07
90
NEON_32bit_GEMM_Float32_FMA_Rotating11.7714.382.928.808.783.148.882.864.613.154.773.172.383.194.152.526.506.224.392.714.312.707.096.764.421.07
91
92
93
94
95
96
97
98
99
100