1 of 49

Fast and Lean Data Science With TPUs,

Keras and Tensorflow 2.1

Martin Görner

Developer Advocate

@martin_gorner

Please reach�out with questions

2 of 49

demo: Retina

Net

Panda

Panda

Panda

3 of 49

TPU hardware

architecture

01

4 of 49

Cloud TPU v2

180 teraflops - 64 GB High Bandwidth Memory (HBM)

5 of 49

Cloud TPU v3

420 teraflops - 128 GB High Bandwidth Memory (HBM)

6 of 49

Cloud TPU v2 Pod (public beta)

11.6 PFLOPS with up to 512 TPU cores

7 of 49

Cloud TPU v3 pod (public beta)

>100 PFLOPS with 32, 128, …, 2048 TPU cores

Adjust batch sizeideal 128 per TPU coreinteresting from 8 per core

scales from �8 to 2048 cores

8 of 49

TPU core

One TPU core

MXUMatrix Multiply Unit�128x128 bfloat16 matrices

VPUVector processing Unit float32, int32

Runs matrix multiplications �bfloat16 x bfloat16 = float 32

Runs everything else �(RELU, softmax, batch norm, …)

9 of 49

bfloat16

Automatically replacing float32 matrix multiplications by�bfloat16 x bfloat16 => float32 works in neural networks

~

Exponent

Sign

Fraction

8 bits

7 bits

8 bits

23 bits

5 bits

10 bits

bfloat16

range: ~1e−38 to ~3e38

float32

range: ~1e−38 to ~3e38

float16

range: ~5.9e−8 to 6.5e4

S

S

S

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

E

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

10 of 49

MXU�Systolic array architecture

  • Fixed-function architecture for 128x128 matrix multiplications
  • bfloat16 x bfloat16 => float32
  • Density and power advantage over GPUs on matrix multiplies
  • 32,000 mult.-add units vs. ~4,000 “CUDA cores” in typical large GPU

See animation at tpudemo.com

11 of 49

Benchmark: planespotting

tensorflow-without-a-phd/planespotting

an airplane detection model

GitHub diff between master and TPU branches

github.com/GoogleCloudPlatform/tensorflow-without-a-phd/compare/master...tpu

Photo: US Geological Survey - public domain

12 of 49

Training hardware options on AI Platform

# config.yaml

trainingInput:

scaleTier: CUSTOM

masterType: standard_v100

# config.yaml

trainingInput:

scaleTier: CUSTOM

masterType: standard_v100

parameterServerType: standard

workerType: standard_v100

paramServerCount: 1

workerCount: 4

# config.yaml

trainingInput:

scaleTier: BASIC_TPU

Training Time

Cost

GPU - P100

5h50

$11

GPU - V100

4h30

$13

Training runs Jan 25th 2019 Tensorflow 1.12, pricing structure of May 1st 2019, time and cost to train model on same number of epochs with 4 evaluations per training

Cluster 5 GPUs - P100

1h15

$11

Cluster 5 GPUs - V100

1h00

$13

VM with 4 GPUs - V100

1h50

$19

Cloud TPU v2

1h00

$4.7

13 of 49

Keras / TPU integration

available in Tensorflow 2.1

02

14 of 49

TPU / GPU cross-compatible model code

import tensorflow as tf

try: # detect TPUs

tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection

tf.config.experimental_connect_to_cluster(tpu)

tf.tpu.experimental.initialize_tpu_system(tpu)

strategy = tf.distribute.experimental.TPUStrategy(tpu)

except ValueError: # detect GPUs

strategy = tf.distribute.MirroredStrategy() # for GPU or multi-GPU machines

with strategy.scope():

model = tf.keras.models.Sequential() # standard tf.keras code from here

15 of 49

tf.data.Dataset and Google Cloud Storage (GCS)

options = tf.data.Options()

options.experimental_deterministic = False

dataset = tf.data.Dataset.list_files(TFREC_FILES, shuffle=True).with_options(options)

dataset = tf.data.TFRecordDataset(dataset, num_parallel_reads=AUTOTUNE)

dataset = dataset.map(read_tfrecord_fn, num_parallel_calls=AUTOTUNE)

dataset = dataset.shuffle(1000)

dataset = dataset.batch(64)

dataset = dataset.repeat()

dataset = dataset.prefetch(AUTOTUNE)

For top GCS throughput, load in parallel form:� a reasonable number (~100s) of� reasonably large files (~100MB)

tutorial: codelabs.developers.google.com/codelabs/�keras-flowers-data

16 of 49

Distributed custom training loop (1/2)

this is vanilla TF 2.0

demo: bit.ly/keras-tpu-tf21

@tf.function

def train_step_fn(images, labels):

with tf.GradientTape() as tape:

probabilities = model(images, training=True)

loss = loss_fn(labels, probabilities)

grads = tape.gradient(loss, model.trainable_variables)

optimizer.apply_gradients(zip(grads, model.trainable_variables))

return loss

17 of 49

Distributed custom training loop (2/2)

try now in TF 2.1

demo: bit.ly/keras-tpu-tf21

train_ds = strategy.experimental_distribute_dataset(training_dataset)

for images, labels in train_ds:

loss_d = strategy.experimental_run_v2(train_step_fn,

args=(images, labels))

loss = strategy.reduce(tf.distribute.ReduceOp.MEAN,

loss_d, axis=None)

18 of 49

DEMO (Tensorflow 2.1)

19 of 49

Getting a Cloud TPU

Compute Engine

Cloud ML Engine

Kubernetes Engine

From cloud shell, type

> ctpu up

Provisions VM, with Tensorflow installed and a Cloud TPU

When launching job, use

--scale_tier=BASIC_TPU

Trains on one VM with an attached Cloud TPU

In .yaml config file, add

cloud-tpus.google.com/v2: 8

Automatically provisions Cloud TPUs with Kubernetes jobs

20 of 49

Reference models on TPU

03

21 of 49

Reference Models for Cloud TPUs�

Machine translation �& language modeling�

Models:�Machine translation�Language modeling�Sentiment analysis�Question-answering�(all transformer-based)�

Image recognition �& object detection

Image recognition:�AmoebaNet-D�ResNet-50/101/152/200�Inception v2/v3/v4�DenseNet MNasNet

Object detection:�RetinaNet

Low-resource models:�MobileNet�SqueezeNet

Image �generation

Models:�Image Transformer �DCGAN��

Speech�recognition

Model:�ASR Transformer �(LibriSpeech)��

22 of 49

RetinaNet

Panda

Panda

Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017

23 of 49

TPU v2 / v3 comparison on RetinaNet

+22%

+77%

+85%

Panda

Panda

256 x 256 px 512 x 512 px 640 x 640 px

24 of 49

TPU v3 POD scaling on RetinaNet

Panda

Panda

TPUv3-8

2h10min, accuracy 0.72, $16

TPUv2-32

58min, accuracy 0.72, $24

TPUv3-128

20 min, accuracy 0.71

25 of 49

Thank you !

Keras TPU demo TF 2.1

bit.ly/keras-tpu-tf21

Keras TPU end-to-end example in Colab

bit.ly/keras-flowers-tpu

Dataset and TPU Keras tutorial

codelabs.developers.google.com/codelabs/keras-flowers-tpu/

TPU Reference models

github.com/tensorflow/tpu

This deck

bit.ly/keras-tpu-presentation

TPUs on Kaggle

kaggle.com/docs/tpu

Martin Görner

@martin_gorner

Martin Görner

Developer Advocate

@martin_gorner

Please reach�out with questions

26 of 49

the end

27 of 49

RetinaNet: multi-scale Feature Pyramid Network (FPN)

Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017

ResNet backbone

Feature pyramid net

class subnet

box subnet

class+box subnets

class+box subnets

class+box subnets

WxHx256

WxHx256

WxHx256

WxHx256

WxHxKA

WxHx4A

28 of 49

RetinaNet:

Focal Loss (FL)

Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017

loss

probability

well classified examples

29 of 49

RetinaNet

Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017

[A] YOLOv2

[B] SSD321

[C] DSSD32

[D] R-FCN

[E] SSD5

[F] DSSD513

[G] FPN FRCN

COCO AP

Inference time (ms)

B

C

D

E

F

G

50 100 150 200 250

28

30

32

34

36

38

30 of 49

31 of 49

Notebooks

on AI Platform

Gigster tip:

Use tf.data.Dataset

03

32 of 49

NotebooksNEW

33 of 49

TPU-enabled notebook VM

Create on the command-line for now

gcloud compute instances create my-machine \

--machine-type n1-standard-8 \

--image-project deeplearning-platform-release \

--image-family tf-1-13-cpu \

--scopes cloud-platform \

--boot-disk-size=100GB \

--metadata proxy-mode=project_editors,startup-script=\

"echo \"export TPU_NAME=my-machine\" > /etc/profile.d/tpu-env.sh;"

gcloud compute tpus create my-machine \

--network default \

--range 192.168.44.0/29 \

--version 1.13

Create VM

Create TPU

Tensorflow image

mark as notebook instance

Set TPU_NAME so that TPUClusterResolver() works in your code

34 of 49

NotebooksNEW with GPUs and TPUs

35 of 49

NotebooksNEW

36 of 49

TPU training

Your VM

data

Cloud TPU

Cloud Storage

Tensorflow graph

37 of 49

ML jobs with Kubeflow/ fairing

Gigster tip:

Don’t pay for a GPU/TPU� machine to type code

04

38 of 49

Training jobs

39 of 49

Kubeflow Fairing: notebook as a job

import fairing

fairing.config.set_deployer('gcp', scale_tier='BASIC_TPU')

fairing.config.set_preprocessor('full_notebook',

notebook_file="code.ipynb",

input_files=my_files,

output_file='gs://out.ipynb')

fairing.config.run()

gcp’ for AI Platform Job

or

Job’ for Kubeflow job

Hardware config

Runs and renders whole notebook, including pre- and post-training code

Or 'python' for regular script

payload

40 of 49

Getting stuff done (GSD)

Gigster tip:

Use Cloud TPUs and iterate fast

05

41 of 49

Gigster TPU learnings

TIME

EPOCHS

42 of 49

Gigster TPU experiments

TPU

Image size

Epochs

Training time

mAP@0.5 accuracy

v2-8

640

6

5h 44 min

0.65

v3-8

640

6

1h 39 min

0.65

v3-8

512

10

2h 00 min

0.73

v3-8

512

6

1h 6 min

0.68

v3-8

384

6

42 min

0.65

v3-8

384

4

30 min

0.64

v3-8

256

6

29 min

0.62

v3-8

256

4

22 min

0.59

$7

Highest ever accuracy

43 of 49

Gigster TPU learnings for inferenceprocessing 3200 images by batches of 64

Accelerator, type

CPU, cores

Resource utilization, %

Time, sec

Images/s

$/hour for machine

Images/$

None (CPU)

8

99

2091

1.53

0.28

20K

None (CPU)

64

93

421

7.60

2.27

3.6K

Nvidia K80

4

99

667

4.80

0.64

27K

Nvidia V100

4

99

169

18.93

2.67

26K

TPU v3-8

8

?

27

118.5

8.19

52K

fastest

cheapest

44 of 49

Pipelines

Gigster tip:

Standardization

Portability

Reuse

06

45 of 49

Pipelines

Training

Ephemeral machines for reproducible end-2-end training in a Docker container

CV API Demo

Latest version of all models for CV team sprint demos and QA

Minimal computing resources possible

Temporary high cost for load tests

App Demo

Integration

Stable version (models in the current release) integrated with mobile/web for user accept test

Minimal computing resources possible

App Staging

Stable version (identical to Prod) integrated with a mobile/web for smoke and reliability testing

Minimal computing resources possible

App Prod

Stable version serving requests from real users integrated with a mobile/web

Full computing resources and auto-scaling

46 of 49

Kubeflow pipelines

47 of 49

Conclusion: neural network R&D

AI Platform

TensorFlow

Cloud TPU

48 of 49

Your Feedback is Greatly Appreciated!

Complete the session survey in mobile app

1-5 star rating system

Open field for comments

Rate icon in status bar

49 of 49

One TPU core

MXUMatrix Multiply Unit�128x128 bfloat16 matrices

VPUVector Processing Unit float32, int32

4 TPU chips

2 cores per chip