MLIR:
Presenting the work of many, many, people!
CGO 2020: International Symposium on Code Generation and Optimization
Tatiana Shpeisman
shpeisman@google.com
Chris Lattner
clattner@sifive.com
Multi-Level Intermediate Representation Compiler Infrastructure
Overview
What is wrong with existing compilers?
LLVM circa CGO 2004
TL;DR: One true IR (operations and type system) solves all of:
All the inputs and outputs details are tiny arrows in the diagram above
“LLVM achieves this through ... a code representation with several novel features that serves as a common representation for analysis, transformation, and code distribution”
LLVM compiler today
LLVM IR centerpoint of Mid-Level and Interprocedural Optimizations:
LLVM IR
LLVM compiler today
LLVM IR centerpoint of Mid-Level and Interprocedural Optimizations:
LLVM IR is not enough!
Also various design mistakes that still persist
LLVM IR
Machine IR
.o
SelectionDAG
MC IR
Clang compiler today
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
.o
SelectionDAG
MC IR
Machine IR
Clang compiler today
Abstraction gap between C++ and LLVM IR is huge:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
.o
SelectionDAG
MC IR
Machine IR
🌲
🌲
🌲
🌲
🌲
🌲
🌲
Clang compiler today
Abstraction gap between C++ and LLVM IR is huge:
Clang also needs “high level” dataflow analysis:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
.o
SelectionDAG
MC IR
Machine IR
🌲
🌲
🌲
🌲
🌲
🌲
🌲
Clang “CFG”
Clang Static Analyzer
-Wunreachable-code, -Wuninitialized, ...
Better solution: Clang should have a CIR!
Integrate dataflow diagnostics with compilation flow:
Progressive lowering solves many problems:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
.o
SelectionDAG
CIR
MC IR
Machine IR
Clang Static Analyzer, -Wunreachable-code, -Wuninitialized, ...
Where does this end?
We need an IR for each abstraction level that needs analysis and transformations!
How can we afford this? How many inliners, PassManagers, InstCombiners, parsers/printers... do we end up with??
🤯
It gets worse: this is not Clang-specific!
Let’s zoom out...
Modern languages pervasively invest in high level IRs
LLVM IR
...
Swift
SIL IR
Swift AST
Rust
MIR IR
Rust AST
Julia
Julia IR
Julia AST
Clang AST
C, C++, ObjC, CUDA, OpenCL, ...
CIR IR
Fortran
Flang AST
TensorFlow is basically a huge compiler ecosystem
These boxes are all different domain-specific compiler systems:
TensorFlow Graph
LLVM IR
TPU IR
TensorFlow Lite
Several others
Tensor RT
nGraph
NNAPI
Many others
Core ML
Grappler
Recap: Domain- and abstraction-specific compiler IRs
Great!
Not great!
MLIR: Multi-Level IR
Also: Mid Level,
Moore’s Law,
Multidimensional Loop,
Machine Learning,
…
LLVM has only one expansion and it is wrong/misleading. Solution: have lots of ambiguous expansions so we can change our mind later :-)
Modular Library,
Many similarities to LLVM
func @testFunction(%arg0: i32) {
%x = call @thingToCall(%arg0) : (i32) -> i32
br ^bb1
^bb1:
%y = addi %x, %x : i32
return %y : i32
}
Syntactically similar:
Module
Function
Block
Operation
Operation
Block
Operation
Operation
Type System - some examples
Scalars:
Vectors:
Tensors, including dynamic shape and rank:
Others:
MLIR Operations: an open ecosystem
No fixed / builtin list of globally known operations:
Passes are expected to conservatively handle unknown ops:
func @testFunction(%arg0: i32) -> i32 {
%x = "any_unknown_operation_here"(%arg0, %arg0) : (i32, i32) -> i32
%y = "my_increment"(%x) : (i32) -> i32
return %y : i32
}
Capabilities of MLIR Operations
Operations always have: opcode and source location info
Instructions may have:
%2 = dim %1, 1 : tensor<1024x? x f32>
%x = alloc() : memref<1024x64 x f32>
%y = load %x[%a, %b] : memref<1024x64 x f32>
Dimension to extract is guaranteed integer constant, an “attribute”
Complicated TensorFlow Example
func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,
%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {
%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)
{data_format: "NHWC", epsilon: 0.001, is_training: false}
: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)
-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)
“use”(%0#2, %0#4 ...
Complicated TensorFlow Example: Inputs
func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,
%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {
%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)
{data_format: "NHWC", epsilon: 0.001, is_training: false}
: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)
-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)
“use”(%0#2, %0#4 ...
Complicated TensorFlow Example: Results
func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,
%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {
%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)
{data_format: "NHWC", epsilon: 0.001, is_training: false}
: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)
-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)
“use”(%0#2, %0#4 ...
Complicated TensorFlow Example: Attributes
func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,
%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {
%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)
{data_format: "NHWC", epsilon: 0.001, is_training: false}
: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)
-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)
“use”(%0#2, %0#4 ...
Nested Regions
%7 = tf.If(%arg0 : tensor<i1>, %arg1 : tensor<2xf32>) -> tensor<2xf32> {
… “then” code...
return ...
} else {
… “else” code...
return ...
}
affine.for %arg0 = 0 to %n {
affine.for %arg1 = 0 to %n {
%3 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>
%4 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>
%5 = mulf %3, %4 : f64
affine.store %5, %0[%arg0, %arg1] : memref<3x2xf64>
}
}
Extensible Operations Allow Multi-Level IR
TensorFlow
XLA HLO
LLVM IR
Polyhedral
%x = "tf.Conv2d"(%input, %filter)
{strides: [1,1,2,1], padding: "SAME", dilations: [2,1,1,1]}
: (tensor<*xf32>, tensor<*xf32>) -> tensor<*xf32>
%m = “xla.AllToAll"(%z)� {split_dimension: 1, concat_dimension: 0, split_count: 2}� : (tensor<300x200x32xf32>) -> tensor<600x100x32xf32>
%a = llvm.load %p : !llvm<"float*">
%f = llvm.add %a, %b : !llvm.float
Don’t we end up with the JSON of compiler IRs????
Lowering
affine.for %i = 0 to %n {
%v = affine.load %2[%i, %i] : memref<2x3xf64>
MLIR “Dialects”: Families of defined operations
Dialects generally correspond to an abstraction level:
Dialects can define:
Operation can define:
MLIR: Infrastructure
A “Batteries Included” compiler infrastructure
Provide a lot of standard functionality in the box:
Creating a new IR is very easy, fast and correct:
Utilizes LLVM infrastructure (e.g. FileCheck, lit) and data structures (llvm/ADT)
Declarative Op definitions: TensorFlow LeakyRelu
def TF_LeakyReluOp
Learn more: Operation Definition Specification (ODS)
Declarative Op definitions: TensorFlow LeakyRelu
def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",
Learn more: Operation Definition Specification (ODS)
Declarative Op definitions: TensorFlow LeakyRelu
def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",
[NoSideEffect, SameValueType]>,
Learn more: Operation Definition Specification (ODS)
Declarative Op definitions: TensorFlow LeakyRelu
def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",
[NoSideEffect, SameValueType]>,
Results<(outs TF_Tensor:$output)> {
let arguments = (ins
TF_FpTensor:$value,
DefaultValuedAttr<F32Attr, "0.2">:$alpha
);
Learn more: Operation Definition Specification (ODS)
Declarative Op definitions: TensorFlow LeakyRelu
def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",
[NoSideEffect, SameValueType]>,
Results<(outs TF_Tensor:$output)> {
let arguments = (ins
TF_FpTensor:$value,
DefaultValuedAttr<F32Attr, "0.2">:$alpha
);
let summary = "Leaky ReLU operator";
let description = [{
The Leaky ReLU operation takes a tensor and returns
a new tensor element-wise as follows:
LeakyRelu(x) = x if x >= 0
= alpha*x else
}];
Learn more: Operation Definition Specification (ODS)
Generated documentation
Generated C++ Code: Verifier Implementation
namespace TF {
class LeakyReluOp
: public Op<LeakyReluOp,
OpTrait::OneResult,
OpTrait::HasNoSideEffect,
OpTrait::SameOperandsAndResultType,
OpTrait::OneOperand> {
public:
static StringRef getOperationName() {
return "tf.LeakyRelu";
};
Value *value() { … }
APFloat alpha() { … }
static void build(…) { … }
bool verify() const {
if (…) return emitOpError(
"requires 32-bit float attribute 'alpha'");
return false;
}
...
};
} // end namespace
Custom Op Printer and Parser Implementations
Declarative specification of “pretty” textual format for operation:
def CallOp : My_Op<"call", ...> {
let arguments = (ins FlatSymbolRefAttr:$callee, Variadic<AnyType>:$operands);
let results = (outs Variadic<AnyType>);
let assemblyFormat = [{
$callee `(` $operands `)` attr-dict `:` functional-type($operands, results)
}];
}
%x = “my.call”(%arg0) { callee: @return_op: (i32) -> i32 }
: (i32) -> i32
The default generic IR dump can be verbose and redundant:
%z = “my.add”(%x, %y) : (i32, i32) -> i32
%x = my.call @return_op(%arg0) : (i32) -> i32
%z = my.add %x, %y : i32
Passes, Walkers, Pattern Matchers
struct Vectorize : public FunctionPass<Vectorize> {
void runOnFunction() override;
};
...
if (matchPattern(getOperand(1), m_Zero()))
return getOperand(0);
...
...
f->walk([&](Operation *op) {
process(op);
});
...
Declarative Rewrite Rules (DRR)
Declarative, reduces boilerplate, easy to express for all:
def : Pat<(TF_SqueezeOp StaticShapeTensor:$arg),
(TFL_ReshapeOp $arg)>;
def : Pat<(ReshapeOp(ReshapeOp $arg)),
(ReshapeOp $arg)>;
Pattern Rewrites via State Machine
Large numbers of pattern rewrites takes time to search
Implemented in MLIR of course!
DialectConversion: Consistent Lowering Framework
Defining a lowering with three things:
Supports partial lowering vs complete lowering
Supports transitive lowering: A->B->C
Learn More: Dialect Conversion Framework
Tutorial: Lowering to Lower-Level Dialects
mlir-opt
// RUN: mlir-opt %s -loop-unroll | FileCheck %s
func @loop_nest_simplest() {
// CHECK: affine.for %i0 = 0 to 100 step 2 {
affine.for %i = 0 to 100 step 2 {
// CHECK: %c1_i32 = constant 1 : i32
// CHECK-NEXT: %c1_i32_0 = constant 1 : i32
// CHECK-NEXT: %c1_i32_1 = constant 1 : i32
affine.for %j = 0 to 3 {
%x = constant 1 : i32
}
}
return
}
Integrated Source Location Tracking
API requires location information on each operation:
$ cat test/Transforms/memref-dependence-check.mlir
// Actual test is much longer...
func @test() {
%0 = alloc() : memref<100xf32>
affine.for %i0 = 0 to 10 {
%1 = load %0[%i0] : memref<100xf32>
store %1, %0[%i0] : memref<100xf32>
}
return
}
$ mlir-opt -memref-dependence-check memref-dependence-check.mlir
…
m-d-c.mlir:5:10: note: dependence from 0 to 0 at depth 1 = false
%1 = load %0[%i0] : memref<100xf32>
^
…
m-d-c.mlir:6:5: note: dependence from 1 to 0 at depth 1 = false
store %1, %0[%i0] : memref<100xf32>
^
Easy for passes to emit structured diagnostics:
Learn more: MLIR Diagnostics and Location Tracking
Location Tracking: Great for Testing!
Test suite uses -verify mode just like Clang/Swift diagnostic test:
// RUN: mlir-opt %s -memref-dependence-check -verify
func @test() {
%0 = alloc() : memref<100xf32>
affine.for %i0 = 0 to 10 {
// expected-note @+1 {{dependence from 0 to 1 at depth 2 = true}}
%1 = load %0[%i0] : memref<100xf32>
store %1, %0[%i0] : memref<100xf32>
}
}
LLVM IR is a Dialect in MLIR
...
^bb2: // pred: ^bb1
%9 = llvm.constant(10) : !llvm.i64
%11 = llvm.mul %2, %9 : !llvm.i64
%12 = llvm.add %11, %6 : !llvm.i64
%13 = llvm.extractvalue %arg2[0] : !llvm<"{ float* }">
%14 = llvm.getelementptr %13[%12] :
(!llvm<"float*">, !llvm.i64) -> !llvm<"float*">
llvm.store %8, %14 : !llvm<"float*">
...
Code lowered to
LLVM dialect in MLIR
Reuse standard passes and other dialects
Dialect independent passes:
Many dialects available with dialect specific passes:
Learn more: MLIR Passes
Example application:
Building the TensorFlow Backend Bridge
TensorFlow Compiler ecosystem
Many complex subsystems
TensorFlow Graph
LLVM IR
TPU IR
TensorFlow Lite
Several others
Tensor RT
nGraph
NNAPI
Many others
Core ML
Grappler
TF/XLA bridge
TensorFlow Bridge with MLIR
All semantic transformations are done in MLIR
TensorFlow Graph
Import
Export
Convert
Representation change
Abstraction change
TF Graph.mlir
HLO.mlir
XLA HLO
TensorFlow Computational Graph Dialect
Compact textual representation isomorphic to computational graph
MLIR:
func @f(%arg0 : tensor<8xi32>,
%arg1 : tensor<8xi32>,
%arg2 : tensor<8xi32>) -> tensor<8xi32> {
%a = tf.Add(%arg0, %arg1) : …
%s = tf.Sub(%arg1, %arg2) : …
%m = tf.Mul(%a, %s) : …
return %m : tensor<8xi32>
}
Add
Sub
Arg0
Arg2
Mul
Arg1
Ret
Control Flow and Concurrency
tf_executor.graph () {
%0:2 = tf_executor.island wraps “tf.Const”(...) : () -> tensor<i32>
%1:2 = tf_executor.island wraps “tf.Const”(...) : () -> tensor<i1>
%2:3 = tf_executor.Switch %0#0, %1#0 :
(...)-> (tensor<i32>, tensor<i32>, !tf_executor.control) {...}
…
}
Let’s Build the Bridge
Step1. Pipeline of graph transformation and optimization passes
PassManager bridge(module.getContext());
...
bridge.addPass(createInlinerPass());
bridge.addPass(createTFShapeInfPass());
Step 2. Operation rewrite rules
def : Pat<(TF_ConjOp $v),
(HLO_ComplexOp (HLO_RealOp $v), (HLO_NegOp (HLO_ImagOp $v)))>;
Multi-level Operation Rewrite Interface
A new backend can reuse existing rules, create new ones or use combination of both
tf.Einsum
tf.Reshape
Input
tf.MatMul
tf.Add
hlo.matmul
mychip.Einsum
mychip.matmul
Developer Benefits
More example MLIR users
(not an exhaustive list!)
A Compiler Intermediate Representation for Stencils�JEAN-MICHEL GORIUS, TOBIAS WICKY, TOBIAS GROSSER, AND TOBIAS GYSI
A Compiler Intermediate Representation for Stencils�JEAN-MICHEL GORIUS, TOBIAS WICKY, TOBIAS GROSSER, AND TOBIAS GYSI
An MLIR Dialect for High-Level Optimization of Fortran�Eric Schweitz (NVIDIA)
An MLIR Dialect for High-Level Optimization of Fortran�Eric Schweitz (NVIDIA)
Compiling for Xilinx AI Engine using MLIR�Samuel Bayliss, Xilinx, C4ML 2020
Compiling for Xilinx AI Engine using MLIR�Samuel Bayliss, Xilinx, C4ML 2020
Utilizing MLIR in Clang
Disclaimer, very speculative: the Clang community hasn’t seriously discussed this or reached consensus
Clang IR Generation has poor separation of concerns
Abstraction gap between C++ and LLVM IR is huge:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
🌲
🌲
🌲
How do we make incremental progress?
Clang Static Analyzer
-Wunreachable-code, -Wuninitialized, ...
Clang “CFG”
Recommendation: Move diagnostics first
Clang “CFG” path is (relatively) unloved:
Relatively few clients
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
🌲
🌲
🌲
Clang Static Analyzer
-Wunreachable-code, -Wuninitialized, ...
Clang “CFG”
Build a new “MLIR CIR” in tree
Build it up next to the existing path, in master:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
🌲
🌲
🌲
Clang Static Analyzer
-Wunreachable-code, -Wuninitialized, ...
Clang “CFG”
MLIR CIR
func @_Z3foo() -> !cir.std::vector<int> {
%vec = cir.alloc_stack : !cir.std::vector<int>
cir.call @’std::vector<int>::vector()’(%vec)
br ^loop
^loop: …
%i = ...
cir.call @’std::vector<int>::push_back(int)’(%vec, %i)
...
cond_br %done, ^loop, ^out
^out:
%result = cir.load %vec : !cir.std::vector<int>
cir.dealloc_stack %vec : !cir.std::vector<int>
return %result
}
Build a new “MLIR CIR” in tree
Build it up next to the existing path, in master:
Reimplement flow-sensitive dataflow diagnostics
-Wunreachable-code, -Wuninitialized, ...
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
🌲
🌲
🌲
Clang Static Analyzer
-Wunreachable-code, -Wuninitialized, ...
Clang “CFG”
MLIR CIR
func @_Z3foo() -> !cir.std::vector<int> {
%vec = cir.alloc_stack : !cir.std::vector<int>
cir.call @’std::vector<int>::vector()’(%vec)
br ^loop
^loop: …
%i = ...
cir.call @’std::vector<int>::push_back(int)’(%vec, %i)
...
cond_br %done, ^loop, ^out
^out:
%result = cir.load %vec : !cir.std::vector<int>
cir.dealloc_stack %vec : !cir.std::vector<int>
return %result
}
Cut over the static compiler to use MLIR CIR
Enable by default as soon as the flow-sensitive diagnostics are superior:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
🌲
🌲
🌲
-Wunreachable-code, -Wuninitialized, ...
Clang Static Analyzer
-Wunreachable-code, -Wuninitialized, ...
Clang “CFG”
MLIR CIR
Cut over the static compiler to use MLIR CIR
Enable by default as soon as the flow-sensitive diagnostics are superior:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
🌲
🌲
🌲
Clang “CFG” becomes an implementation detail of the CSA & analysis tools
-Wunreachable-code, -Wuninitialized, ...
Clang Static Analyzer
-Wunreachable-code, -Wuninitialized, ...
Clang “CFG”
MLIR CIR
Add Lowering from CIR to LLVM IR, finish CIR coverage
Reuse most of the existing IR generation logic and helpers:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
🌲
🌲
🌲
-Wunreachable-code, -Wuninitialized, ...
MLIR CIR
Add Lowering from CIR to LLVM IR, finish CIR coverage
Reuse most of the existing IR generation logic and helpers:
Iterate on this until it supports all language features and CIR is starting to settle
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
🌲
🌲
🌲
-Wunreachable-code, -Wuninitialized, ...
MLIR CIR
A unified path enables the fun part!
Start detangling one subsystem at a time:
Implement library-specific optimizations:
LLVM IR
AST
C, C++, ObjC, CUDA, OpenCL, ...
🌲
🌲
🌲
🌲
-Wunreachable-code, -Wuninitialized, ...
MLIR CIR
🌲
Strawman example CIR lowering (sketch)
std::vector<int> foo() {
std::vector<int> vec;
// Insert: result.reserve(100);
for (int i = 0; i < 100; ++i)
vec.push_back(i);
return vec;
}
func @_Z3foo() -> !cir.std::vector<int> {
%vec = cir.alloc_stack : !cir.std::vector<int>
cir.call @’std::vector<int>::vector()’(%vec)
br ^loop
^loop: …
%i = ...
cir.call @’std::vector<int>::push_back(int)’(%vec, %i)
...
cond_br %done, ^loop, ^out
^out:
%result = cir.load %vec : !cir.std::vector<int>
cir.dealloc_stack %vec : !cir.std::vector<int>
return %result
}
Many new approaches to explore...
Maintain structured loops and control flow for OpenCL and Cuda
Preserve better alias analysis information
New generation of source tooling based on hybrid dataflow + source location info
Enable higher level domain specific optimizations:
OpenMP & Other Parallelism Dialects
OpenMP is mostly orthogonal to host language:
Model OpenMP as a dialect in MLIR:
Lower to the LLVM IR dialect as usual
%c4 = cir.constant 4 : !cir.int
%c5 = cir.constant 5 : !cir.int
%j = cir.add %c4, %c5 : !cir.int
omp.parallel.for (...) {
^bb0(%i : !cir.int):
cir.call @stuff(%i, %j)
}
int j = 4+5
#pragma omp parallel for
for (i=0; i<N; i++) {
stuff(i, j)
}
omp.parallel.for (...) {
^bb0(%i : !cir.int):
%c9 = cir.constant 9 : !cir.int
cir.call @stuff(%i, %c9)
}
SSA ConstProp
Utilizing MLIR for LLVM IR
Disclaimer, very speculative: the LLVM community hasn’t seriously discussed this or reached consensus
Why not use MLIR for LLVM IR?
Observation: MLIR already has an LLVM IR dialect used for codegen
...
^bb2: // pred: ^bb1
%9 = llvm.constant(10) : !llvm.i64
%11 = llvm.mul %2, %9 : !llvm.i64
%12 = llvm.add %11, %6 : !llvm.i64
%13 = llvm.extractvalue %arg2[0] : !llvm<"{ float* }">
%14 = llvm.getelementptr %13[%12] :
(!llvm<"float*">, !llvm.i64) -> !llvm<"float*">
llvm.store %8, %14 : !llvm<"float*">
...
Port the existing LLVM IR passes to work on LLVM dialect!
MLIR’s impl is just better than the LLVM IR data structures!
“Why use MLIR for LLVM IR?”
And yes, I 💖 LLVM!
Implicitly Multithreaded PassManager
Multicore isn’t “the future” anymore:
MLIR PassManager runs passes on “isolated from above” regions in parallel
This can provide an easy 4-100x compile time speedup!
MLIR BB Arguments >>> LLVM PHINode
llvm::PHINode design has challenges:
br label %loop
loop:
%x = phi i32 [ %in1, %entry ], [%y, %loop]
%y = phi i32 [ %in2, %entry ], [%x, %loop]
use(%x, %y)
br i1 %cond, label %out, label %loop
out:
...
br ^loop
^loop(%x: i32, %y: i32):
use(%x, %y)
cond_br %cond, ^out, ^loop
^out:
...
MLIR defines these away by using functional-style SSA form:
(%in1 : i32, %in2 : i32)
(%y: i32, %x: i32)
PHINode Predecessors must provide same value
other:
br label %merge
merge:
%result = phi [%y, %other],
[%x, %start]
...
...
br i1 %cond,
label %other, label %merge
if (cond) {
result = x
} else {
result = y
}
...
llvm.cond_br %cond,
^merge(%x: !llvm.i32), ^merge(%y: !llvm.i32)
^merge(%arg: !llvm.i32):
...
LLVM invoke dominance issues
except:
%errval = landingpad ...
...
%result = invoke @foo(...)
to label %normal
unwind label %except
normal:
use(%result)
...
...
llvm.invoke @foo(...)
to ^normal unwind ^except
^normal(%result):
...
^except(%errval):
...
Better Location Tracking
LLVM metadata design is wrong-in-retrospect for debug information:
MLIR design strongly encourages pass authors to think about this by default:
Better Infrastructure
MLIR provides new opportunities!
“Why use MLIR for LLVM IR?”
We can make core LLVM even better!
Can now introduce higher level abstractions!
Parallelism abstractions:
Loop transformations:
...
Ok, but how? --> Incrementally
Upgrade LLVM in place, eliminate differences between the two:
Implement compatibility wrappers:
Dissolve away the wrappers over time
This will be a bit complex to phase in
⇒ Get experience with Clang first
Learning More
MLIR is part of the LLVM Project!
Code available in LLVM GitHub Monorepo:
Find lots of content on mlir.llvm.org:
Discussions on the MLIR Discourse Forum
MLIR within TensorFlow OSS
MLIR powers several TensorFlow subsystems
TensorFlow team hosts the MLIR Open Design Meeting:
“That which you can represent, you can transform!”
Questions?
Tatiana Shpeisman
shpeisman@google.com
Chris Lattner
clattner@sifive.com