XLA: TensorFlow’s compiler
Agenda
What’s XLA
Deep dive on XLA
How does it work?
The benefits
Agenda
What’s XLA
Deep dive on XLA
TensorFlow Strengths
Extensible
Flexible
Expressive
Interpreted
Dynamic
Stateful
"Black-Box" Modular
Extensible
Flexible
Expressive
How's It Done?
How do we
keep the strengths,
but add
more speed?!
Photo by Lwp Kommunikáció: https://goo.gl/NOqj68
Just-In-Time Compilation
via XLA, "Accelerated Linear Algebra" compiler
0x00000000 movq (%rdx), %rax�0x00000003 vmovaps (%rax), %xmm0�0x00000007 vmulps %xmm0, %xmm0, %xmm0�0x0000000b vmovaps %xmm0, (%rdi)
...
TF graphs go in,
Optimized & specialized assembly comes out.
Let's explain that!
Program built at runtime
Low-overhead compilation
Dim variables (e.g. batch size) can bind very late
Prototype w/freedom of TF development
What's JIT all about?
TF-Level Block Diagram
TensorFlow
Existing TensorFlow Core
TF CPU Ops
TF GPU Ops
TF TPU Ops
XLA:CPU
XLA:GPU
XLA:TPU
XLA
TF Auto-JIT
What has us excited?
Server-side speedups
XLA's JIT compilation and specialization
Model-shaped benchmark wins up to 60%
SyntaxNet latency reductions: 200µs ⇒ 5µs (extreme case)
XLA's Ahead-of-Time compilation
Turn models to executables
Eliminates much of TensorFlow runtime
Cross-compile for ARM, PPC, x86
LSTM model for mobile: 2.6MiB ⇒ <600KiB (4x reduction)
What has us excited?
Mobile footprint reductions
What has us excited?
Whole-Program Analysis made easy
XLA's High-Level Optimizer
Reusable toolkit of global optimizations
Layout (e.g. dim order, cache-line padding) is parameterized
Mix & match platform-agnostic & target specific passes
Caveats? It's still early days!
Not all TensorFlow ops compile
Perf improves daily, not everything is faster
Haven't devoted equal time to all platforms
With the community we believe we could do much more!
Open source: try it, file bugs, let us know!
Note: some won't compile by design
(e.g. DynamicStitch)
Compilation benefits
Specializes the code for your computation
Eliminates op dispatch overhead
Fuses ops: avoids round trips to memory
Performs precise & global buffer analysis
Unrolls, vectorizes via known dimensions
executable size: generate what you need!
//tensorflow/compiler/xla
//tensorflow/compiler/tf2xla
//tensorflow/compiler/jit
//tensorflow/compiler/aot
Agenda
What’s XLA
Deep dive on XLA
//tensorflow/compiler/xla
//tensorflow/compiler/tf2xla
//tensorflow/compiler/jit
//tensorflow/compiler/aot
We just learned about XLA
XLA in one picture
XLA Graph
LLVM IR
...
x86
binary
PTX
binary
arm
binary
lowering
code gen
//tensorflow/compiler/xla
//tensorflow/compiler/tf2xla
//tensorflow/compiler/jit
//tensorflow/compiler/aot
From TensorFlow to XLA
TensorFlow in one picture
TensorFlow runtime (C++)
Executor
Kernels
TensorFlow Graph
add
soft
max
...
python
java
C
go
add
soft
max
tf2xla: Symbolic graph execution
TensorFlow runtime (C++)
Local
Executor
tf2xla kernels
XLA Graph
TensorFlow Graph
soft
max
soft
max
add
add
tf2xla: Symbolic graph execution
TensorFlow runtime (C++)
Local
Executor
tf2xla kernels
XLA Graph
TensorFlow Graph
add
add
soft
max
soft
max
Get XLA Graph!
XLA Graph
LLVM IR
...
x86
binary
PTX
binary
arm
binary
lowering
code gen
//tensorflow/compiler/xla
//tensorflow/compiler/tf2xla
//tensorflow/compiler/jit
//tensorflow/compiler/aot
Just-in-time compilation
JIT: Compile and run TF clusters
Compile and run
Pre-clustering
Post-clustering
Identify
cluster
JIT: Multiple clusters
Pre-clustering
Post-clustering
Compile and run
Identify
clusters
JIT: Avoid deadlock
Pre-clustering
Post-clustering
Cycle == Deadlock!
Bad clustering
Turning on JIT compilation
Whole session
config = tf.ConfigProto()�config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1�sess = tf.Session(config=config) # All supported ops compiled with XLA.
Manual scoped
jit_scope = tf.contrib.compiler.jit.experimental_jit_scope�x = tf.placeholder(np.float32)�with jit_scope():� y = tf.add(x, x) # The "add" op will be compiled with XLA.
Challenge: TF + XLA
XLA使えるようにTensorFlowをビルドする(約2時間)
0: 環境
- GCE / Ubuntu16.10 + Tesla K80
1: bazelとTFのpython依存ライブラリのインストール
2: CUDA系ライブラリのインストール
- cuDNNはNVIDIAへ申請対応なので1日ぐらいかかることもある
3: build
- nvccがgcc4までのサポート(新ABI未対応)なので、gcc-4.9入れておく
- ./configureで下記のオプションを指定する(デフォルトoff)
- gccコマンドパスをgcc-4.9にする
- CUDA supportをON
- XLA just-in-time compilerをON
Challenge: TF + XLA
XLA使えるようにTensorFlowをビルドする(約2時間)
4: bazel buildでpip packageを作るツールを作る
5: pip packageを作るツールでwheelを作る
6: virtualenvしておいてwheelをpip install
7: サンプルを試す
- /tensorflow/tensorflow/examples/tutorials/mnist/mnist_softmax_xla.py
- コマンドライン・オプションの--xla=falseとか効かないのでソース弄る
ドキュメントも更新中のため、
全てが正しいというわけではない
Challenge: TF + XLA
結果
XLAナシ
XLAアリ
変化ナシ!! (オレの2時間!)
Challenge: TF + XLA
本当はこうなるらしい…
XLAナシ
XLAアリ
_XlaLaunch Op
//tensorflow/compiler/xla
//tensorflow/compiler/tf2xla
//tensorflow/compiler/jit
//tensorflow/compiler/aot
Ahead-of-time compilation
TF-Level Block Diagram
TensorFlow
Existing TensorFlow Core
TF CPU Ops
TF GPU Ops
TF TPU Ops
XLA:CPU
XLA:GPU
XLA:TPU
XLA
TF Auto-JIT
AOT
AOT: Ahead-of-time compilation
XLA Graph
...
x86
binary
arm
binary
TensorFlow Graph
tfcompile: Graph compiler
x86
binary
TensorFlow Graph
tfcompile
C++
Header
feeds
+
fetches
Config
function(feed0, ..., feedN) ->
(fetch0, ..., fetchN)
tfcompile: Config
feed {
id{ node_name:"x_hold" }� shape{ dim{size:2} dim{size:3} }�}
feed {
id{ node_name:"y_hold" }� shape{ dim{size:3} dim{size:2} }�}
fetch {� id{ node_name:"x_y_prod" }�}
feeds
+
fetches
Array[2][3] =
x00, x01, x02
x10, x11, x12
Array[3][2] =
y00, y01
y10, y11
y20, y21
feed(x)
feed(y)
fetch
Configure feeds and fetches via tensorflow.tfcompile.Config proto
tfcompile: build
Compile your graph using the tf_library bazel build macro.
load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")��tf_library(� name = "test_graph_tfmatmul",� graph = "test_graph_tfmatmul.pb", # graph in GraphDef format� config = "test_graph_tfmatmul.config.pbtxt", # configure feeds and fetches
cpp_class = "foo::bar::MatMulComp", # control header generation�)
tfcompile: use
Write code to call the computation:
#include "tensorflow/compiler/aot/tests/test_graph_tfmatmul.h"��int main(int argc, char** argv) {� foo::bar::MatMulComp matmul;�� // Set up args and run the computation.� const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};� std::copy(args + 0, args + 6, matmul.arg0_data());� std::copy(args + 6, args + 12, matmul.arg1_data());� matmul.Run();�� // Check results
CHECK_EQ(matmul.result0(0, 0), 58);� return 0;�}
1, 2, 3
4, 5, 6
feed(x)
feed(y)
7, 8
9, 10
11, 12
mat
mul
fetch
58, 64
139, 154
Challenge: tfcompile
tfcompileをビルドする(約1.5時間)
0: 環境
- GCE / Ubuntu16.10 + Tesla K80
1: bazelでビルド
- bazel build --config=opt --config=cuda //tensorflow/compiler/aot:tfcompile
Challenge: tfcompile
tfcompileを試す(約1.2時間)
2: サンプルプログラム
- /tensorflow/tensorflow/compiler/aot/tests
- python make_test_graphs.py
- 数パターンのグラフを作るスクリプト
g = ops.Graph()
with g.as_default()
x = array_ops.placeholder(dtypes.float32, name=’x_hold’)
y = array_ops.placeholder(dtypes.float32, name=’y_hold’)
math_ops.matmul(x, y, name=’x_y_prod’)
math_ops.add(x, y, name=’x_y_sum’)
with open(filename, ‘w’) as f:
f.write(g.as_graph_def().SerializeToString())
Challenge: tfcompile
tfcompileを試す(約1.2時間)
3: サンプルを試す: ライブラリを作る
$ bazel build :test_graph_tfmatmul
load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
tf_library(
name = “test_graph_tfmatmul”,
graph = “test_graph_tfmatmul.pb”,
config = “test_graph_tfmatmul.config.pbtxt”,
cpp_class = “foo::bar::MatMulComp”
)
> //bazel-genfiles/tensorflow/compiler/aot/tests/test_graph_tfmatmul.{o|h}
Challenge: tfcompile
tfcompileを試す(約1.2時間)
4: サンプルを試す: ライブラリを使う
my_code.cc
# include “tensorflow/compiler/aot/tests/test_graph_tfmatmul.h”
int main(int argc, char** argv) {
foo::bar::MatMulComp matmul;
// Set up args and run the computation.
const float args[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
std::copy(args + 0, args + 6, matmul.arg0_data());
std::copy(args + 6, args + 12, matmul.arg1_data());
matmul.Run();
// Check result
if (matmul.result0(0, 0) == 58) {
std::cout << "Success" << std::endl;
} else {
std::cout << "Failed. Expected value 58 at 0,0. Got:"
<< matmul.result0(0, 0) << std::endl;
}
return 0;
}
Challenge: tfcompile
tfcompileを試す(約1.2時間)
5: サンプルを試す: ライブラリを組み込む
$ bazel build :my_binary
load("//tensorflow/compiler/aot:tfcompile.bzl", "tf_library")
cc_binary(
name = “my_binary”,
srcs = [ “my_code.cc” ],
deps = [ “:test_graph_tfmatmul”, “//third_party/eigen3” ],
linkopts = [ “-lpthread” ]
)
> //bazel-bin/tensorflow/compiler/aot/tests/my_binary
Success
Document: JIT / AOT
tfcompile: Other languages
x86
binary
TensorFlow Graph
tfcompile
C++
Header
feeds
+
fetches
PTX
binary
java +
JNI
python +
swig
...
Results
JIT model benchmarks GPU
Both training and inference
Faster is better!
JIT micro-benchmarks GPU
Some good speedups
… with room to improve
JIT model benchmarks CPU
Smaller binaries on mobile
Binary size reduction on android-arm (stacked LSTM, 3 deep, 60 wide)
Original: 2.6MB (1MB runtime + 1.6MB graph)
Compiled: 600KB (272KB code + 330KB weights)
Summary
Work in progress
質問
XLAがソースコードレベルでどういった振舞いをしているのかに興味ある人、
どれぐらいいますか?
Thank you!