1 of 49

Fast and Lean Data Science With TPUs,

Keras and Tensorflow 2.1

Martin Görner

Developer Advocate

@martin_gorner

Please reach�out with questions

2 of 49

demo: Retina

Net

Panda

3 of 49

TPU hardware

architecture

01

4 of 49

Cloud TPU v2

180 teraflops - 64 GB High Bandwidth Memory (HBM)

5 of 49

Cloud TPU v3

420 teraflops - 128 GB High Bandwidth Memory (HBM)

6 of 49

Cloud TPU v2 Pod (public beta)

11.6 PFLOPS with up to 512 TPU cores

7 of 49

Cloud TPU v3 pod (public beta)

>100 PFLOPS with 32, 128, …, 2048 TPU cores

Adjust batch size�ideal 128 per TPU core�interesting from 8 per core

scales from �8 to 2048 cores

8 of 49

TPU core

One TPU core

MXU �Matrix Multiply Unit�128x128 bfloat16 matrices

VPU �Vector processing Unit float32, int32

Runs matrix multiplications �bfloat16 x bfloat16 = float 32

Runs everything else �(RELU, softmax, batch norm, …)

9 of 49

bfloat16

Automatically replacing float32 matrix multiplications by�bfloat16 x bfloat16 => float32 works in neural networks

~

Exponent

Sign

Fraction

8 bits

7 bits

8 bits

23 bits

5 bits

10 bits

bfloat16

range: ~1e⁻³⁸ to ~3e³⁸

float32

range: ~1e⁻³⁸ to ~3e³⁸

float16

range: ~5.9e⁻⁸ to 6.5e⁴

S

E

M

10 of 49

MXU�Systolic array architecture

Fixed-function architecture for 128x128 matrix multiplications
bfloat16 x bfloat16 => float32
Density and power advantage over GPUs on matrix multiplies
32,000 mult.-add units vs. ~4,000 “CUDA cores” in typical large GPU

See animation at tpudemo .com

The second architectural advantage if the fixed-function matrix multiply unit. It organizes 128*128=16,384 multiply-accumulate units into a “systolic array” architecture where they are directly connected to each other in exactly the right way for performing matrix multiplications. On a GPU, the same matrix multiply would be programmed onto the processing cores of the GPU (sometimes referred to as “CUDA cores”) which are typically much larger units than just a bfloat16 multiply-add. This MXU architecture gives TPUs an advantage in both density and power consumption over GPUs, when performing matrix multiplications. There are a lot of matmuls when training neural networks so TPUs make sense there but they are less versatile than their GPU counterparts. A TPU would not be great at 3D rendering for example.

11 of 49

Benchmark: planespotting

tensorflow-without-a-phd/planespotting

an airplane detection model

GitHub diff between master and TPU branches

github.com/GoogleCloudPlatform/tensorflow-without-a-phd/compare/master...tpu

Photo: US Geological Survey - public domain

12 of 49

Training hardware options on AI Platform

# config.yaml

trainingInput:

scaleTier: CUSTOM

masterType: standard_v100

# config.yaml

trainingInput:

scaleTier: CUSTOM

masterType: standard_v100

parameterServerType: standard

workerType: standard_v100

paramServerCount: 1

workerCount: 4

# config.yaml

trainingInput:

scaleTier: BASIC_TPU

	Training Time	Cost
GPU - P100	5h50	$11
GPU - V100	4h30	$13

Training runs Jan 25^th 2019 Tensorflow 1.12, pricing structure of May 1st 2019, time and cost to train model on same number of epochs with 4 evaluations per training

Cluster 5 GPUs - P100	1h15	$11
Cluster 5 GPUs - V100	1h00	$13

VM with 4 GPUs - V100	1h50	$19
Cloud TPU v2	1h00	$4.7

13 of 49

Keras / TPU integration

available in Tensorflow 2.1

02

14 of 49

TPU / GPU cross-compatible model code

import tensorflow as tf

try: # detect TPUs

tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection

tf.config.experimental_connect_to_cluster(tpu)

tf.tpu.experimental.initialize_tpu_system(tpu)

strategy = tf.distribute.experimental.TPUStrategy(tpu)

except ValueError: # detect GPUs

strategy = tf.distribute.MirroredStrategy() # for GPU or multi-GPU machines

with strategy.scope():

model = tf.keras.models.Sequential() # standard tf.keras code from here

Try it: bit.ly/keras-TPU

15 of 49

tf.data.Dataset and Google Cloud Storage (GCS)

options = tf.data.Options()

options.experimental_deterministic = False

dataset = tf.data.Dataset.list_files(TFREC_FILES, shuffle=True).with_options(options)

dataset = tf.data.TFRecordDataset(dataset, num_parallel_reads=AUTOTUNE)

dataset = dataset.map(read_tfrecord_fn, num_parallel_calls=AUTOTUNE)

dataset = dataset.shuffle(1000)

dataset = dataset.batch(64)

dataset = dataset.repeat()

dataset = dataset.prefetch(AUTOTUNE)

For top GCS throughput, load in parallel form:� a reasonable number (~100s) of� reasonably large files (~100MB)

tutorial: codelabs.developers.google.com/codelabs/�keras-flowers-data

16 of 49

Distributed custom training loop (1/2)

this is vanilla TF 2.0

demo: bit.ly/keras-tpu-tf21

@tf.function

def train_step_fn(images, labels):

with tf.GradientTape() as tape:

probabilities = model(images, training=True)

loss = loss_fn(labels, probabilities)

grads = tape.gradient(loss, model.trainable_variables)

optimizer.apply_gradients(zip(grads, model.trainable_variables))

return loss

17 of 49

Distributed custom training loop (2/2)

try now in TF 2.1

demo: bit.ly/keras-tpu-tf21

train_ds = strategy.experimental_distribute_dataset(training_dataset)

for images, labels in train_ds:

loss_d = strategy.experimental_run_v2(train_step_fn,

args=(images, labels))

loss = strategy.reduce(tf.distribute.ReduceOp.MEAN,

loss_d, axis=None)

18 of 49

DEMO (Tensorflow 2.1)

bit.ly/keras-tpu-tf21

19 of 49

Getting a Cloud TPU

Compute Engine

Cloud ML Engine

Kubernetes Engine

From cloud shell, type

> ctpu up

Provisions VM, with Tensorflow installed and a Cloud TPU

When launching job, use

--scale_tier=BASIC_TPU

Trains on one VM with an attached Cloud TPU

In .yaml config file, add

cloud-tpus.google.com/v2: 8

Automatically provisions Cloud TPUs with Kubernetes jobs

20 of 49

Reference models on TPU

03

21 of 49

Reference Models for Cloud TPUs�

Machine translation �& language modeling�

Models:�Machine translation�Language modeling�Sentiment analysis�Question-answering�(all transformer-based)�

Image recognition �& object detection

Image recognition:�AmoebaNet-D�ResNet-50/101/152/200�Inception v2/v3/v4�DenseNet MNasNet

Object detection:�RetinaNet

Low-resource models:�MobileNet�SqueezeNet�

Image �generation

Models:�Image Transformer �DCGAN��

Speech�recognition

Model:�ASR Transformer �(LibriSpeech)��

github.com/tensorflow/tpu

22 of 49

RetinaNet

Panda

Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017

License:

Before we go deep into the RetainNet details, let me first explain how object detection algorithms work in general.

Typically, they generate a lot of candidate bboxes (also called proposals) and then assign class labels and adjust the bbox coordinates using a neural network.

There are one-stage detectors and two-stage detectors.

One-stage detectors (such as OverFeat, YOLO, SSD) combine candidate generation and detection steps, process image only once, and hence work faster but they are less accurate due to the extreme foreground-background class imbalance problem — as you can see there are a lot more blue boxes, which represent negative examples for the machine learning model, than red bboxes, which are positive examples.

Two-stage detectors (e.g. R-CNN) separate candidate box generation and detection phases, perform additional sampling and post-processing of candidate bboxes and in turn, they are more accurate but slower.

RetinaNet was designed to be both fast and accurate.

23 of 49

TPU v2 / v3 comparison on RetinaNet

+22%

+77%

+85%

Panda

256 x 256 px 512 x 512 px 640 x 640 px

24 of 49

TPU v3 POD scaling on RetinaNet

Panda

TPUv3-8

2h10min, accuracy 0.72, $16

TPUv2-32

58min, accuracy 0.72, $24

TPUv3-128

20 min, accuracy 0.71

25 of 49

Thank you !

Keras TPU demo TF 2.1

bit.ly/keras-tpu-tf21

Keras TPU end-to-end example in Colab

bit.ly/keras-flowers-tpu

Dataset and TPU Keras tutorial

codelabs.developers.google.com/codelabs/keras-flowers-tpu/

TPU Reference models

github.com/tensorflow/tpu

This deck

bit.ly/keras-tpu-presentation

TPUs on Kaggle

kaggle.com/docs/tpu

Martin Görner

@martin_gorner

Martin Görner

Developer Advocate

@martin_gorner

Please reach�out with questions

26 of 49

the end

27 of 49

RetinaNet: multi-scale Feature Pyramid Network (FPN)

Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017

ResNet backbone

Feature pyramid net

class subnet

box subnet

class+box subnets

WxHx256

WxHxKA

WxHx4A

28 of 49

RetinaNet:

Focal Loss (FL)

Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017

loss

probability

well classified examples

29 of 49

RetinaNet

Lin, Tsung-Yi, Priya Goyal, Ross B. Girshick, Kaiming He and Piotr Dollár, “Focal Loss for Dense Object Detection”, In Proceedings of ICCV 2017

[A] YOLOv2

[B] SSD321

[C] DSSD32

[D] R-FCN

[E] SSD5

[F] DSSD513

[G] FPN FRCN

COCO AP

Inference time (ms)

B

C

D

E

F

G

50 100 150 200 250

28

30

32

34

36

38

30 of 49

31 of 49

Notebooks

on AI Platform

Gigster tip:

Use tf.data.Dataset

03

32 of 49

Notebooks^NEW

33 of 49

TPU-enabled notebook VM

Create on the command-line for now

gcloud compute instances create my-machine \

--machine-type n1-standard-8 \

--image-project deeplearning-platform-release \

--image-family tf-1-13-cpu \

--scopes cloud-platform \

--boot-disk-size=100GB \

--metadata proxy-mode=project_editors,startup-script=\

"echo \"export TPU_NAME=my-machine\" > /etc/profile.d/tpu-env.sh;"

gcloud compute tpus create my-machine \

--network default \

--range 192.168.44.0/29 \

--version 1.13

Create VM

Create TPU

Tensorflow image

mark as notebook instance

Set TPU_NAME so that TPUClusterResolver() works in your code

34 of 49

Notebooks^NEWwith GPUs and TPUs

35 of 49

Notebooks^NEW

36 of 49

TPU training

Your VM

data

Cloud TPU

Cloud Storage

Tensorflow graph

A “Cloud TPU” is a TPU board with 4 dual-core TPU chips connected through PCI to a host virtual machine. If you start a VM in GCE and request a “Cloud TPU” ressource for your VM, your VM gets network access to this combination of hardware: a TPU board and a dedicated VM supporting it. The TPUs will typically be running your model training while the VM will be working on getting data to the TPU as fast as possible.

The tensorflow code you write produces internally a Tensorflow graph of operations. That is the way Tensorflow works on any platform. On TPUs specifically, the graph is first translated into the XLA intermediate representation and then compiled to TPU assembly code. One limitation here is that XLA is designed for linear algebra and vector computations. There are some constructs that it will never be able to compile efficiently which is why TPUs always come with a dedicated host VM that will run anything not fit for TPU acceleration. For example an Atari console emulator is the type of branch-heavy code that is best left for the CPU. Another example is decoding JPEG images. This could theoretically be written in Tensorflow in a way that would compile efficiently to TPUs but it has not been done yet. For now, if you have a tf.image.decode_jpeg in your data pipeline, it is better to leave that to the CPU.

37 of 49

ML jobs with Kubeflow/ fairing

Gigster tip:

Don’t pay for a GPU/TPU� machine to type code

04

Having a powerful accelerator is great as we can iterate fast and speed is at the core of our business model, but we don’t want to pay for expensive machines just to type code and don’t want to chase around machines left running.

When we started, we were ahead of the curve with AI and there were no de facto solutions to address this problem.

We designed the Gigster Development Environment to speed up the way data scientists work and minimize the infrastructure cost:

We have a few machines for quick interactive testing of pipelines shared across the data scientists in the team to keep the infra cost low.
But for actual ML jobs that train models for production, we also have Terraform/Ansible scripts that provision machines, install necessary environment variables and drivers, start training out of a Docker container and save the model weights to Cloud Storage.

As Martin will demonstrate, it is great to see that with Google Cloud it comes out of the box.

38 of 49

Training jobs

Gigster: use case for jobs (don’t want to pay for expensive machine for typing code, don’t want to chase around machines left running, don’t want to wait too long...)

Initially, we took an open source Keras RetinaNet implementation and used GPUs for training. It was easy to start but:

Given a limited budget, we had to stay lean and make sure we don’t spend significantly amount of money on idle or used only for dev purposes costly machines with GPUs => we instrumented a workflow, where we have an infra automation script that creates a machine, connects network attached storage, configures the environment by installing the necessary CUDA version, nvidia-docker and etc., then starts the training. It helps reproducibility and minimize cost. Hence, welcome the serverless world that Google Cloud pushes forward…

Demo:

Starting a job from a notebook using fairing

39 of 49

Kubeflow Fairing: notebook as a job

import fairing

fairing.config.set_deployer('gcp', scale_tier='BASIC_TPU')

fairing.config.set_preprocessor('full_notebook',

notebook_file="code.ipynb",

input_files=my_files,

output_file='gs://out.ipynb')

fairing.config.run()

‘gcp’ for AI Platform Job

or

‘Job’ for Kubeflow job

Hardware config

Runs and renders whole notebook, including pre- and post-training code

Or 'python' for regular script

payload

40 of 49

Getting stuff done (GSD)

Gigster tip:

Use Cloud TPUs and iterate fast

05

41 of 49

Gigster TPU learnings

TIME

EPOCHS

42 of 49

Gigster TPU experiments

TPU	Image size	Epochs	Training time	mAP@0.5 accuracy
v2-8	640	6	5h 44 min	0.65
v3-8	640	6	1h 39 min	0.65
v3-8	512	10	2h 00 min	0.73
v3-8	512	6	1h 6 min	0.68
v3-8	384	6	42 min	0.65
v3-8	384	4	30 min	0.64
v3-8	256	6	29 min	0.62
v3-8	256	4	22 min	0.59

$7

Highest ever accuracy

By having access to TPUs, fairing and fast training times, the team was able to explore various hyper parameters for the neural network and accomplish higher quality compared to the original model.

The highest ever quality was achieved with TPUv3 (our favorite), 512px image size and 10 epochs. TPUs can help train longer (before we explored only 3 epochs).

We also experimented with the image sizes and were able to decrease the training time down to 42 mins w/o a serious degradation of quality.

Possible gotchas to mention:

Could not easily port Keras Retinanet model to TPU because all the code running on TPU must be Tensorflow code and in that model some computations were happening in regular Python during the training.
Inference problem with the official TPU Retinanet model: part of the code was not written.

43 of 49

Gigster TPU learnings for inference�processing 3200 images by batches of 64

Accelerator, type	CPU, cores	Resource utilization, %	Time, sec	Images/s	$/hour for machine	Images/$
None (CPU)	8	99	2091	1.53	0.28	20K
None (CPU)	64	93	421	7.60	2.27	3.6K
Nvidia K80	4	99	667	4.80	0.64	27K
Nvidia V100	4	99	169	18.93	2.67	26K
TPU v3-8	8	?	27	118.5	8.19	52K

fastest

cheapest

44 of 49

Pipelines

Gigster tip:

Standardization

Portability

Reuse

06

45 of 49

Pipelines

Training

Ephemeral machines for reproducible end-2-end training in a Docker container

CV API Demo

Latest version of all models for CV team sprint demos and QA

Minimal computing resources possible

Temporary high cost for load tests

App Demo

Integration

Stable version (models in the current release) integrated with mobile/web for user accept test

Minimal computing resources possible

App Staging

Stable version (identical to Prod) integrated with a mobile/web for smoke and reliability testing

Minimal computing resources possible

App Prod

Stable version serving requests from real users integrated with a mobile/web

Full computing resources and auto-scaling

Delivering solutions for the enterprise is very different from prototyping a model in a Jupiter notebook. Data security, code transfer, scalable production workloads, solution monitoring, portability become important and we instrumented a lot of processes and systems to help with that. Again, we had to build it from ground up and with GCP it comes almost out of the box.

Demo of AI Hub and Kubeflow

5 boxes above:

1)

Training machines: provisioned through Ansible scripts + Terraform

2 types of training machines:

One for interactive training
One for unattended training jobs: reproducible !

Client wanted to have the infrastructure for engineers completely lockin-free.

Sees value of AIHub in simplifying rollout of AI code. Especially if there is a lot of reuse between projects.

Kubeflow experience: project started in March 2018 last year. Kubeflow was very immature back then. Nov 2018: started looking again. 1 engineer worked on Kubeflow for 2-3 weeks but faced crashes constantly. Could look at it again.

Internally: Gigster dev environment with a layer on top of the cloud. Helps Gigster deliver mobile and web apps faster (ex: database provisioning)

Started developing a dev environment for AI with ready-made pipelines. Ex: pipeline for sentiment analysis, pipeline for recommendations.

=> these could be based on Kubeflow going forward

AIHub (repo of KF pipelines) = supermarket for Kubeflow pipelines. Sees lots of potential.

When engaging with customers who want to “adopt the AI DNA”, they want to start building the platform capabilities in their company. Pre-buit pipelines and simplified interfaces are key to the penetration of these technologies in these companies. They are accessible to software engineers who have less AI experience.

AIHub: interested in both private and public repo.

2)

CV API Demo: latest version of all models.

3)

App demo: stable version of models with the web/mobile UI.

Models selected based on

decision to release/not to release model for legal (patenting) reasons.
Model features fit for app UI.
Cost reason: many models sitting unused

Model tests:

Load tests (manual)
API regression test (that the deployed test responds)
API prediction test (with just a couple of images)
Health checks on production clusters.
Using Prometheus for reporting. Used custom Kubernetes (legacy). Migrating to GKE + stackdriver.

Full end-to-end pipeline does not have testing integrated yet.

One engineer was only dedicated to starting/stopping machines (devops). Must automate the flow. Worked on deployment flow for a month.

Using Terraform to deploy models.

Deploying Keras models. Just deploys model.predict.

Not using Tensorflow serving.

70% of models in PyTorch: ML Engine did not have support for PyTorch for inference at the time.

Last 2 environments: running at the client

Crucial piece: go live: handoff and code transfer: this needs documentation, processes. Anything that can simplify/standardize this will help enormously.

4) Staging: same as prod with minimal machines

5) Prod

46 of 49

Kubeflow pipelines

47 of 49

Conclusion: neural network R&D

AI Platform

TensorFlow

Cloud TPU

48 of 49

Your Feedback is Greatly Appreciated!

Complete the session survey in mobile app

1-5 star rating system

Open field for comments

Rate icon in status bar

49 of 49

One TPU core

MXU �Matrix Multiply Unit�128x128 bfloat16 matrices

VPU �Vector Processing Unit float32, int32

4 TPU chips

2 cores per chip