1 of 54

XLA: TensorFlow’s compiler

Masahiko Adachi (@adamrocker)

Google Developer Expert (Machine Learning)

CTO/Founder, Kabuku Inc.

2 of 54

Agenda

What’s XLA

Deep dive on XLA

3 of 54

How does it work?

The benefits

Agenda

What’s XLA

Deep dive on XLA

4 of 54

TensorFlow Strengths

Extensible

Flexible

Expressive

5 of 54

Interpreted

Dynamic

Stateful

"Black-Box" Modular

Extensible

Flexible

Expressive

How's It Done?

6 of 54

How do we

keep the strengths,

but add

more speed?!

Photo by Lwp Kommunikáció: https://goo.gl/NOqj68

7 of 54

Just-In-Time Compilation

via XLA, "Accelerated Linear Algebra" compiler

0x00000000 movq (%rdx), %rax�0x00000003 vmovaps (%rax), %xmm0�0x00000007 vmulps %xmm0, %xmm0, %xmm0�0x0000000b vmovaps %xmm0, (%rdi)

...

TF graphs go in,

Optimized & specialized assembly comes out.

Let's explain that!

8 of 54

Program built at runtime

Low-overhead compilation

Dim variables (e.g. batch size) can bind very late

Prototype w/freedom of TF development

What's JIT all about?

9 of 54

TF-Level Block Diagram

TensorFlow

Existing TensorFlow Core

TF CPU Ops

TF GPU Ops

TF TPU Ops

XLA:CPU

XLA:GPU

XLA:TPU

XLA

TF Auto-JIT

10 of 54

What has us excited?

Server-side speedups

XLA's JIT compilation and specialization

Model-shaped benchmark wins up to 60%

SyntaxNet latency reductions: 200µs ⇒ 5µs (extreme case)

11 of 54

XLA's Ahead-of-Time compilation

Turn models to executables

Eliminates much of TensorFlow runtime

Cross-compile for ARM, PPC, x86

LSTM model for mobile: 2.6MiB ⇒ <600KiB (4x reduction)

What has us excited?

Mobile footprint reductions

12 of 54

What has us excited?

Whole-Program Analysis made easy

XLA's High-Level Optimizer

Reusable toolkit of global optimizations

Layout (e.g. dim order, cache-line padding) is parameterized

Mix & match platform-agnostic & target specific passes

13 of 54

Caveats? It's still early days!

Not all TensorFlow ops compile

Perf improves daily, not everything is faster

Haven't devoted equal time to all platforms

With the community we believe we could do much more!

Open source: try it, file bugs, let us know!

Note: some won't compile by design

(e.g. DynamicStitch)

14 of 54

Compilation benefits

Specializes the code for your computation

Eliminates op dispatch overhead

Fuses ops: avoids round trips to memory

Performs precise & global buffer analysis

Unrolls, vectorizes via known dimensions

executable size: generate what you need!

15 of 54

//tensorflow/compiler/xla

//tensorflow/compiler/tf2xla

//tensorflow/compiler/jit

//tensorflow/compiler/aot

Agenda

What’s XLA

Deep dive on XLA

16 of 54

//tensorflow/compiler/xla

//tensorflow/compiler/tf2xla

//tensorflow/compiler/jit

//tensorflow/compiler/aot

We just learned about XLA

17 of 54

XLA in one picture

XLA Graph

LLVM IR

...

x86

binary

PTX

binary

arm

binary

lowering

code gen

18 of 54

//tensorflow/compiler/xla

//tensorflow/compiler/tf2xla

//tensorflow/compiler/jit

//tensorflow/compiler/aot

From TensorFlow to XLA

19 of 54

TensorFlow in one picture

TensorFlow runtime (C++)

Executor

Kernels

TensorFlow Graph

add

soft

max

...

python

java

C

go

add

soft

max

20 of 54

tf2xla: Symbolic graph execution

TensorFlow runtime (C++)

Local

Executor

tf2xla kernels

XLA Graph

TensorFlow Graph

soft

max

soft

max

add

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/tf2xla/kernels

21 of 54

tf2xla: Symbolic graph execution

TensorFlow runtime (C++)

Local

Executor

tf2xla kernels

XLA Graph

TensorFlow Graph

add

soft

max

soft

max

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/tf2xla/kernels

22 of 54

Get XLA Graph!

XLA Graph

LLVM IR

...

x86

binary

PTX

binary

arm

binary

lowering

code gen

23 of 54

//tensorflow/compiler/xla

//tensorflow/compiler/tf2xla

//tensorflow/compiler/jit

//tensorflow/compiler/aot

Just-in-time compilation

24 of 54

JIT: Compile and run TF clusters

Compile and run

Pre-clustering

Post-clustering

Identify

cluster

25 of 54

JIT: Multiple clusters

Pre-clustering

Post-clustering

Compile and run

Identify

clusters

26 of 54

JIT: Avoid deadlock

Pre-clustering

Post-clustering

Cycle == Deadlock!

Bad clustering

27 of 54

Turning on JIT compilation

Whole session

config = tf.ConfigProto()�config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1�sess = tf.Session(config=config) # All supported ops compiled with XLA.

Manual scoped

jit_scope = tf.contrib.compiler.jit.experimental_jit_scope�x = tf.placeholder(np.float32)�with jit_scope():� y = tf.add(x, x) # The "add" op will be compiled with XLA.

https://www.tensorflow.org/versions/master/experimental/xla/jit

28 of 54

Challenge: TF + XLA

XLA使えるようにTensorFlowをビルドする（約2時間）

0: 環境

- GCE / Ubuntu16.10 + Tesla K80

1: bazelとTFのpython依存ライブラリのインストール

2: CUDA系ライブラリのインストール

- cuDNNはNVIDIAへ申請対応なので1日ぐらいかかることもある

3: build

- nvccがgcc4までのサポート(新ABI未対応)なので、gcc-4.9入れておく

- ./configureで下記のオプションを指定する（デフォルトoff）

- gccコマンドパスをgcc-4.9にする

- CUDA supportをON

- XLA just-in-time compilerをON

29 of 54

Challenge: TF + XLA

XLA使えるようにTensorFlowをビルドする（約2時間）

4: bazel buildでpip packageを作るツールを作る

5: pip packageを作るツールでwheelを作る

6: virtualenvしておいてwheelをpip install

7: サンプルを試す

- /tensorflow/tensorflow/examples/tutorials/mnist/mnist_softmax_xla.py

- コマンドライン・オプションの--xla=falseとか効かないのでソース弄る

ドキュメントも更新中のため、

全てが正しいというわけではない

30 of 54

Challenge: TF + XLA

結果

XLAナシ

XLAアリ

変化ナシ!! (オレの2時間!)

31 of 54

Challenge: TF + XLA

本当はこうなるらしい…

XLAナシ

XLAアリ

_XlaLaunch Op

32 of 54

//tensorflow/compiler/xla

//tensorflow/compiler/tf2xla

//tensorflow/compiler/jit

//tensorflow/compiler/aot

Ahead-of-time compilation

33 of 54

TF-Level Block Diagram

TensorFlow

Existing TensorFlow Core

TF CPU Ops

TF GPU Ops

TF TPU Ops

XLA:CPU

XLA:GPU

XLA:TPU

XLA

TF Auto-JIT

AOT

34 of 54

AOT: Ahead-of-time compilation

XLA Graph

...

x86

binary

arm

binary

TensorFlow Graph

35 of 54

tfcompile: Graph compiler

x86

binary

TensorFlow Graph

tfcompile

C++

Header

feeds

+

fetches

Config

function(feed0, ..., feedN) ->

(fetch0, ..., fetchN)

36 of 54

tfcompile: Config

feed {

id{ node_name:"x_hold" }� shape{ dim{size:2} dim{size:3} }�}

feed {

id{ node_name:"y_hold" }� shape{ dim{size:3} dim{size:2} }�}

fetch {� id{ node_name:"x_y_prod" }�}

feeds

+

fetches

Array[2][3] =

x₀₀, x₀₁, x₀₂

x₁₀, x₁₁, x₁₂

Array[3][2] =

y₀₀, y₀₁

y₁₀, y₁₁

y₂₀, y₂₁

feed(x)

feed(y)

fetch

Configure feeds and fetches via tensorflow.tfcompile.Config proto

37 of 54

tfcompile: build

Compile your graph using the tf_library bazel build macro.

load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")��tf_library(� name = "test_graph_tfmatmul",� graph = "test_graph_tfmatmul.pb", # graph in GraphDef format� config = "test_graph_tfmatmul.config.pbtxt", # configure feeds and fetches

cpp_class = "foo::bar::MatMulComp", # control header generation�)

38 of 54

tfcompile: use

Write code to call the computation:

#include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h"��int main(int argc, char** argv) {� foo::bar::MatMulComp matmul;�� // Set up args and run the computation.� const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};� std::copy(args + 0, args + 6, matmul.arg0_data());� std::copy(args + 6, args + 12, matmul.arg1_data());� matmul.Run();�� // Check results

CHECK_EQ(matmul.result0(0, 0), 58);� return 0;�}

1, 2, 3

4, 5, 6

feed(x)

feed(y)

7, 8

9, 10

11, 12

mat

mul

fetch

58, 64

139, 154

39 of 54

Challenge: tfcompile

tfcompileをビルドする（約1.5時間）

0: 環境

- GCE / Ubuntu16.10 + Tesla K80

1: bazelでビルド

- bazel build --config=opt --config=cuda //tensorflow/compiler/aot:tfcompile

40 of 54

Challenge: tfcompile

tfcompileを試す（約1.2時間）

2: サンプルプログラム

- /tensorflow/tensorflow/compiler/aot/tests

- python make_test_graphs.py

- 数パターンのグラフを作るスクリプト

g = ops.Graph()

with g.as_default()

x = array_ops.placeholder(dtypes.float32, name=’x_hold’)

y = array_ops.placeholder(dtypes.float32, name=’y_hold’)

math_ops.matmul(x, y, name=’x_y_prod’)

math_ops.add(x, y, name=’x_y_sum’)

with open(filename, ‘w’) as f:

f.write(g.as_graph_def().SerializeToString())

41 of 54

Challenge: tfcompile

tfcompileを試す（約1.2時間）

3: サンプルを試す: ライブラリを作る

$ bazel build :test_graph_tfmatmul

load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")

tf_library(

name = “test_graph_tfmatmul”,

graph = “test_graph_tfmatmul.pb”,

config = “test_graph_tfmatmul.config.pbtxt”,

cpp_class = “foo::bar::MatMulComp”

)

> //bazel-genfiles/tensorflow/compiler/aot/tests/test_graph_tfmatmul.{o|h}

42 of 54

Challenge: tfcompile

tfcompileを試す（約1.2時間）

4: サンプルを試す: ライブラリを使う

my_code.cc

# include “tensorflow/compiler/aot/tests/test_graph_tfmatmul.h”

int main(int argc, char** argv) {

foo::bar::MatMulComp matmul;

// Set up args and run the computation.

const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

std::copy(args + 0, args + 6, matmul.arg0_data());

std::copy(args + 6, args + 12, matmul.arg1_data());

matmul.Run();

// Check result

if (matmul.result0(0, 0) == 58) {

std::cout << "Success" << std::endl;

} else {

std::cout << "Failed. Expected value 58 at 0,0. Got:"

<< matmul.result0(0, 0) << std::endl;

}

return 0;

}

43 of 54

Challenge: tfcompile

tfcompileを試す（約1.2時間）

5: サンプルを試す: ライブラリを組み込む

$ bazel build :my_binary

load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")

cc_binary(

name = “my_binary”,

srcs = [ “my_code.cc” ],

deps = [ “:test_graph_tfmatmul”, “//third_party/eigen3” ],

linkopts = [ “-lpthread” ]

)

> //bazel-bin/tensorflow/compiler/aot/tests/my_binary

Success

44 of 54

Document: JIT / AOT

これらのChallengeの詳細は以下に残しておきました。

ご参考になれば…

https://goo.gl/wiMyss

45 of 54

tfcompile: Other languages

x86

binary

TensorFlow Graph

tfcompile

C++

Header

feeds

+

fetches

PTX

binary

java +

JNI

python +

swig

...

46 of 54

Results

47 of 54

JIT model benchmarks GPU

Both training and inference

Faster is better!

48 of 54

JIT micro-benchmarks GPU

Some good speedups

… with room to improve

49 of 54

JIT model benchmarks CPU

50 of 54

Smaller binaries on mobile

Binary size reduction on android-arm (stacked LSTM, 3 deep, 60 wide)

Original: 2.6MB (1MB runtime + 1.6MB graph)

Compiled: 600KB (272KB code + 330KB weights)

51 of 54

Summary

TensorFlow graphs compiled via XLA
Just-In-Time (JIT) compilation
Ahead-Of-Time (AOT) compilation
Some models are already faster or smaller

Work in progress

52 of 54

質問

XLAがソースコードレベルでどういった振舞いをしているのかに興味ある人、

どれぐらいいますか？

53 of 54

Masahiko Adachi (@adamrocker)

GDE, Machine Learning

CTO, Kabuku Inc.

Thank you!

1 of 54

2 of 54

3 of 54

4 of 54

5 of 54

6 of 54

7 of 54

8 of 54

9 of 54

10 of 54

11 of 54

12 of 54

13 of 54

14 of 54

15 of 54

16 of 54

17 of 54

18 of 54

19 of 54

20 of 54

21 of 54

22 of 54

23 of 54

24 of 54

25 of 54

26 of 54

27 of 54

28 of 54

29 of 54

30 of 54

31 of 54

32 of 54

33 of 54

34 of 54

35 of 54

36 of 54

37 of 54

38 of 54

39 of 54

40 of 54

41 of 54

42 of 54

43 of 54

44 of 54

45 of 54

46 of 54

47 of 54

48 of 54

49 of 54

50 of 54

51 of 54

52 of 54

53 of 54

54 of 54