Fast and Lean Data Science With TPUs,
Keras and Tensorflow 2.1
Martin Görner
Developer Advocate
@martin_gorner
Please reach�out with questions
demo: Retina
Net
Panda
Panda
Panda
TPU hardware
architecture
01
Cloud TPU v2
180 teraflops - 64 GB High Bandwidth Memory (HBM)
Cloud TPU v3
420 teraflops - 128 GB High Bandwidth Memory (HBM)
Cloud TPU v2 Pod (public beta)
11.6 PFLOPS with up to 512 TPU cores
Cloud TPU v3 pod (public beta)
>100 PFLOPS with 32, 128, …, 2048 TPU cores
Adjust batch size�ideal 128 per TPU core�interesting from 8 per core
scales from �8 to 2048 cores
TPU core
One TPU core
MXU �Matrix Multiply Unit�128x128 bfloat16 matrices
VPU �Vector processing Unit float32, int32
Runs matrix multiplications �bfloat16 x bfloat16 = float 32
Runs everything else �(RELU, softmax, batch norm, …)
bfloat16
Automatically replacing float32 matrix multiplications by�bfloat16 x bfloat16 => float32 works in neural networks
~
Exponent
Sign
Fraction
8 bits
7 bits
8 bits
23 bits
5 bits
10 bits
bfloat16
range: ~1e−38 to ~3e38
float32
range: ~1e−38 to ~3e38
float16
range: ~5.9e−8 to 6.5e4
S
S
S
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
E
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
MXU�Systolic array architecture
Benchmark: planespotting
tensorflow-without-a-phd/planespotting
an airplane detection model
GitHub diff between master and TPU branches
github.com/GoogleCloudPlatform/tensorflow-without-a-phd/compare/master...tpu
Photo: US Geological Survey - public domain
Training hardware options on AI Platform
# config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_v100
# config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: standard_v100
parameterServerType: standard
workerType: standard_v100
paramServerCount: 1
workerCount: 4
# config.yaml
trainingInput:
scaleTier: BASIC_TPU
| Training Time | Cost |
GPU - P100 | 5h50 | $11 |
GPU - V100 | 4h30 | $13 |
Training runs Jan 25th 2019 Tensorflow 1.12, pricing structure of May 1st 2019, time and cost to train model on same number of epochs with 4 evaluations per training
Cluster 5 GPUs - P100 | 1h15 | $11 |
Cluster 5 GPUs - V100 | 1h00 | $13 |
VM with 4 GPUs - V100 | 1h50 | $19 |
Cloud TPU v2 | 1h00 | $4.7 |
Keras / TPU integration
available in Tensorflow 2.1
02
TPU / GPU cross-compatible model code
import tensorflow as tf
try: # detect TPUs
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError: # detect GPUs
strategy = tf.distribute.MirroredStrategy() # for GPU or multi-GPU machines
with strategy.scope():
model = tf.keras.models.Sequential() # standard tf.keras code from here
Try it: bit.ly/keras-TPU
tf.data.Dataset and Google Cloud Storage (GCS)
options = tf.data.Options()
options.experimental_deterministic = False
dataset = tf.data.Dataset.list_files(TFREC_FILES, shuffle=True).with_options(options)
dataset = tf.data.TFRecordDataset(dataset, num_parallel_reads=AUTOTUNE)
dataset = dataset.map(read_tfrecord_fn, num_parallel_calls=AUTOTUNE)
dataset = dataset.shuffle(1000)
dataset = dataset.batch(64)
dataset = dataset.repeat()
dataset = dataset.prefetch(AUTOTUNE)
For top GCS throughput, load in parallel form:� a reasonable number (~100s) of� reasonably large files (~100MB)
tutorial: codelabs.developers.google.com/codelabs/�keras-flowers-data
Distributed custom training loop (1/2)
@tf.function
def train_step_fn(images, labels):
with tf.GradientTape() as tape:
probabilities = model(images, training=True)
loss = loss_fn(labels, probabilities)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return loss
Distributed custom training loop (2/2)
train_ds = strategy.experimental_distribute_dataset(training_dataset)
for images, labels in train_ds:
loss_d = strategy.experimental_run_v2(train_step_fn,
args=(images, labels))
loss = strategy.reduce(tf.distribute.ReduceOp.MEAN,
loss_d, axis=None)
DEMO (Tensorflow 2.1)
Getting a Cloud TPU
Compute Engine
Cloud ML Engine
Kubernetes Engine
From cloud shell, type
> ctpu up
Provisions VM, with Tensorflow installed and a Cloud TPU
When launching job, use
--scale_tier=BASIC_TPU
Trains on one VM with an attached Cloud TPU
In .yaml config file, add
cloud-tpus.google.com/v2: 8
Automatically provisions Cloud TPUs with Kubernetes jobs
Reference models on TPU
03
Reference Models for Cloud TPUs�
Machine translation �& language modeling�
Models:�Machine translation�Language modeling�Sentiment analysis�Question-answering�(all transformer-based)�
Image recognition �& object detection
Image recognition:�AmoebaNet-D�ResNet-50/101/152/200�Inception v2/v3/v4�DenseNet MNasNet
Object detection:�RetinaNet
Low-resource models:�MobileNet�SqueezeNet�
Image �generation
Models:�Image Transformer �DCGAN��
Speech�recognition
Model:�ASR Transformer �(LibriSpeech)��
RetinaNet
Panda
Panda
Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017
TPU v2 / v3 comparison on RetinaNet
+22%
+77%
+85%
Panda
Panda
256 x 256 px 512 x 512 px 640 x 640 px
TPU v3 POD scaling on RetinaNet
Panda
Panda
TPUv3-8
2h10min, accuracy 0.72, $16
TPUv2-32
58min, accuracy 0.72, $24
TPUv3-128
20 min, accuracy 0.71
Thank you !
Keras TPU demo TF 2.1
Keras TPU end-to-end example in Colab
Dataset and TPU Keras tutorial
codelabs.developers.google.com/codelabs/keras-flowers-tpu/
TPU Reference models
This deck
TPUs on Kaggle
Martin Görner
@martin_gorner
Martin Görner
Developer Advocate
@martin_gorner
Please reach�out with questions
the end
RetinaNet: multi-scale Feature Pyramid Network (FPN)
Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017
ResNet backbone
Feature pyramid net
class subnet
box subnet
class+box subnets
class+box subnets
class+box subnets
WxHx256
WxHx256
WxHx256
WxHx256
WxHxKA
WxHx4A
RetinaNet:
Focal Loss (FL)
Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017
loss
probability
well classified examples
RetinaNet
Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017
[A] YOLOv2
[B] SSD321
[C] DSSD32
[D] R-FCN
[E] SSD5
[F] DSSD513
[G] FPN FRCN
COCO AP
Inference time (ms)
B
C
D
E
F
G
50 100 150 200 250
28
30
32
34
36
38
Notebooks
on AI Platform
Gigster tip:
Use tf.data.Dataset
03
NotebooksNEW
TPU-enabled notebook VM
Create on the command-line for now
gcloud compute instances create my-machine \
--machine-type n1-standard-8 \
--image-project deeplearning-platform-release \
--image-family tf-1-13-cpu \
--scopes cloud-platform \
--boot-disk-size=100GB \
--metadata proxy-mode=project_editors,startup-script=\
"echo \"export TPU_NAME=my-machine\" > /etc/profile.d/tpu-env.sh;"
gcloud compute tpus create my-machine \
--network default \
--range 192.168.44.0/29 \
--version 1.13
Create VM
Create TPU
Tensorflow image
mark as notebook instance
Set TPU_NAME so that TPUClusterResolver() works in your code
NotebooksNEW with GPUs and TPUs
NotebooksNEW
TPU training
Your VM
data
Cloud TPU
Cloud Storage
Tensorflow graph
ML jobs with Kubeflow/ fairing
Gigster tip:
Don’t pay for a GPU/TPU� machine to type code
04
Training jobs
Kubeflow Fairing: notebook as a job
import fairing
fairing.config.set_deployer('gcp', scale_tier='BASIC_TPU')
fairing.config.set_preprocessor('full_notebook',
notebook_file="code.ipynb",
input_files=my_files,
output_file='gs://out.ipynb')
fairing.config.run()
‘gcp’ for AI Platform Job
or
‘Job’ for Kubeflow job
Hardware config
Runs and renders whole notebook, including pre- and post-training code
Or 'python' for regular script
payload
Getting stuff done (GSD)
Gigster tip:
Use Cloud TPUs and iterate fast
05
Gigster TPU learnings
TIME
EPOCHS
Gigster TPU experiments
TPU | Image size | Epochs | Training time | mAP@0.5 accuracy |
v2-8 | 640 | 6 | 5h 44 min | 0.65 |
v3-8 | 640 | 6 | 1h 39 min | 0.65 |
v3-8 | 512 | 10 | 2h 00 min | 0.73 |
v3-8 | 512 | 6 | 1h 6 min | 0.68 |
v3-8 | 384 | 6 | 42 min | 0.65 |
v3-8 | 384 | 4 | 30 min | 0.64 |
v3-8 | 256 | 6 | 29 min | 0.62 |
v3-8 | 256 | 4 | 22 min | 0.59 |
$7
Highest ever accuracy
Gigster TPU learnings for inference�processing 3200 images by batches of 64
Accelerator, type | CPU, cores | Resource utilization, % | Time, sec | Images/s | $/hour for machine | Images/$ |
None (CPU) | 8 | 99 | 2091 | 1.53 | 0.28 | 20K |
None (CPU) | 64 | 93 | 421 | 7.60 | 2.27 | 3.6K |
Nvidia K80 | 4 | 99 | 667 | 4.80 | 0.64 | 27K |
Nvidia V100 | 4 | 99 | 169 | 18.93 | 2.67 | 26K |
TPU v3-8 | 8 | ? | 27 | 118.5 | 8.19 | 52K |
fastest
cheapest
Pipelines
Gigster tip:
Standardization
Portability
Reuse
06
Pipelines
Training
Ephemeral machines for reproducible end-2-end training in a Docker container
CV API Demo
Latest version of all models for CV team sprint demos and QA
Minimal computing resources possible
Temporary high cost for load tests
App Demo
Integration
Stable version (models in the current release) integrated with mobile/web for user accept test
Minimal computing resources possible
App Staging
Stable version (identical to Prod) integrated with a mobile/web for smoke and reliability testing
Minimal computing resources possible
App Prod
Stable version serving requests from real users integrated with a mobile/web
Full computing resources and auto-scaling
Kubeflow pipelines
Conclusion: neural network R&D
AI Platform
TensorFlow
Cloud TPU
Your Feedback is Greatly Appreciated!
Complete the session survey in mobile app
1-5 star rating system
Open field for comments
Rate icon in status bar
One TPU core
MXU �Matrix Multiply Unit�128x128 bfloat16 matrices
VPU �Vector Processing Unit float32, int32
4 TPU chips
2 cores per chip