ABCDEFGHIJKLMNOPQRSTUVWXYZAAABAC
1
Data from official TF benchmark
2
https://www.tensorflow.org/performance/benchmarks
3
4
Data format: NCHW
NCCL: false
Variable update: parameter server
PS: CPU
Dataset shape: imagenet
5
6
7
batch size64646464646464643232323232323232
8
datasetsynthsynthsynthsynthrealrealrealrealsynthsynthsynthsynthrealrealrealreal
9
modelinception3inception3resnet50resnet50inception3inception3resnet50resnet50inception3inception3resnet50resnet50inception3inception3resnet50resnet50
10
machinedgx-1gcedgx-1gcedgx-1gcedgx-1gcedgx-1gcedgx-1gcedgx-1gcedgx-1gce
11
gpu typeP100K80P100K80P100K80P100K80P100K80P100K80P100K80P100K80median:
12
gpus:
13
images/s1142302195214230.621851.212829.319549.513029.519349.3
14
2284584229927858.442598.82595536895.425755.436995.3
15
3596116852195551115853194520109768183507110760186
16
4113122717343871079225163038199521614853629662161410359
17
speedup11.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x
18
22.00x1.93x1.93x1.90x1.96x1.91x1.95x1.93x2.02x1.88x1.89x1.93x1.98x1.88x1.91x1.93x1.93x
19
44.20x3.87x3.89x3.75x3.88x3.76x3.91x3.79x4.06x3.72x3.94x3.70x3.90x3.73x3.94x3.77x3.87x
20
87.96x7.57x7.92x7.44x7.60x7.35x7.48x7.44x7.77x7.37x7.62x7.31x7.43x7.32x7.31x7.28x7.44x
21
efficiency1100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%100.00%
22
2100.00%96.67%96.35%95.19%97.89%95.42%97.48%96.48%101.17%93.86%94.36%96.36%98.85%93.90%95.60%96.65%96.42%
23
4104.93%96.67%97.26%93.75%97.01%93.95%97.82%94.73%101.56%93.00%98.46%92.42%97.50%93.22%98.45%94.32%96.84%
24
899.56%94.58%98.97%93.03%94.98%91.91%93.46%93.02%97.17%92.15%95.19%91.41%92.88%91.53%91.32%91.02%93.02%
25
speedup from
using synth
vs. real dataset
11.00x0.98x1.00x1.02x0.98x0.99x1.01x1.00x1.00x
26
21.02x0.99x0.99x1.00x1.01x0.99x1.00x1.00x1.00x
27
41.08x1.01x1.00x1.01x1.03x0.99x1.01x0.98x1.01x
28
81.05x1.01x1.06x1.02x1.03x1.00x1.05x1.01x1.02x
29
speedup from
using batch size
64. vs. 32
11.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x1.00x
30
20.99x1.03x1.02x0.99x0.99x1.02x1.02x1.00x1.01x
31
41.03x1.04x0.99x1.01x0.99x1.01x0.99x1.00x1.01x
32
81.02x1.03x1.04x1.02x1.02x1.00x1.02x1.02x1.02x
33
speedup from GPU
P100 vs. K80
4.73x4.21x4.64x4.26x4.37x3.94x4.41x3.91x4.40x
34
4.90x4.26x4.76x4.30x4.71x3.86x4.64x3.87x
35
5.14x4.37x4.79x4.40x4.77x4.20x4.61x4.09x
36
4.98x4.48x4.80x4.28x4.61x4.10x4.47x3.93x
37
comparison of
resnet50 vs. inception3
1.54x1.73x1.54x1.67x1.52x1.69x1.48x1.67x1.61x
38
1.49x1.71x1.53x1.69x1.42x1.73x1.44x1.72x
39
1.43x1.68x1.55x1.69x1.48x1.68x1.50x1.69x
40
1.53x1.70x1.51x1.69x1.49x1.68x1.46x1.66x
41
42
43
44
Observations:
45
- **scaling is really good**
- with more GPUs efficiency goes down a little bit (but still over 90%)
- sometimes there's superlinear speedup - probably due to noise in the 1-GPU measurement
- **using real or synth dataset doesn't show any significant effect**, thus we can use synthetic dataset to estimate performance on real dataset
- using batch size 64 or 32 doesn't show any significant effect
- this kind of training is 4.4x faster (median) on P100 than K80
- training resnet50 is 1.61x faster (median) than inception3 in this benchmark
- both architectures on this dataset have roughly 24-26 million parameters
- baseline performance on 1x Tesla K80 is 30 images/sec on InceptionV3 and 50 images/sec on Resnet50
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100