1 of 90

MLIR:

Presenting the work of many, many, people!

CGO 2020: International Symposium on Code Generation and Optimization

Tatiana Shpeisman

shpeisman@google.com

Chris Lattner

clattner@sifive.com

Multi-Level Intermediate Representation Compiler Infrastructure

2 of 90

Overview

  • Context
  • About MLIR
  • A few users of MLIR
  • Provocative (?) proposal for Clang and LLVM
  • Conclusion

3 of 90

What is wrong with existing compilers?

4 of 90

LLVM circa CGO 2004

TL;DR: One true IR (operations and type system) solves all of:

  • Multi-targeting, Analysis, Optimization, Distribution

All the inputs and outputs details are tiny arrows in the diagram above

“LLVM achieves this through ... a code representation with several novel features that serves as a common representation for analysis, transformation, and code distribution”

5 of 90

LLVM compiler today

LLVM IR centerpoint of Mid-Level and Interprocedural Optimizations:

  • Proven as a very useful “C with Vectors” level of abstraction

LLVM IR

6 of 90

LLVM compiler today

LLVM IR centerpoint of Mid-Level and Interprocedural Optimizations:

  • Proven as a very useful “C with Vectors” level of abstraction

LLVM IR is not enough!

  • Multiple other abstraction levels introduced over time
  • Poor representation for high level representations - parallelism, loops, etc

Also various design mistakes that still persist

LLVM IR

Machine IR

.o

SelectionDAG

MC IR

7 of 90

Clang compiler today

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

.o

SelectionDAG

MC IR

Machine IR

8 of 90

Clang compiler today

Abstraction gap between C++ and LLVM IR is huge:

  • Don’t forget OpenMP, and the many many other extensions to C
  • ABI lowering and many other things get included in this arrow as well

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

.o

SelectionDAG

MC IR

Machine IR

🌲

🌲

🌲

🌲

🌲

🌲

🌲

9 of 90

Clang compiler today

Abstraction gap between C++ and LLVM IR is huge:

  • Don’t forget OpenMP, and the many many other extensions to C
  • ABI lowering and many other things get included in this arrow as well

Clang also needs “high level” dataflow analysis:

  • Canonicalize away syntactic redundancies, preserve high level semantics
  • Duplication with lowering path is … very bad

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

.o

SelectionDAG

MC IR

Machine IR

🌲

🌲

🌲

🌲

🌲

🌲

🌲

Clang “CFG”

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

10 of 90

Better solution: Clang should have a CIR!

Integrate dataflow diagnostics with compilation flow:

  • Unified flow reduces diagnostic-only bugs and incompleteness

Progressive lowering solves many problems:

  • Simplifies IRGen by reducing responsibilities
  • Host new optimizations for std::vector, std::shared_ptr, std::string, …

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

.o

SelectionDAG

CIR

MC IR

Machine IR

Clang Static Analyzer, -Wunreachable-code, -Wuninitialized, ...

11 of 90

Where does this end?

We need an IR for each abstraction level that needs analysis and transformations!

How can we afford this? How many inliners, PassManagers, InstCombiners, parsers/printers... do we end up with??

🤯

12 of 90

It gets worse: this is not Clang-specific!

Let’s zoom out...

13 of 90

Modern languages pervasively invest in high level IRs

  • Language specific optimizations
  • Dataflow driven type checking - e.g. definitive initialization, borrow checker
  • Progressive lowering from high level abstractions

LLVM IR

...

Swift

SIL IR

Swift AST

Rust

MIR IR

Rust AST

Julia

Julia AST

Clang AST

C, C++, ObjC, CUDA, OpenCL, ...

CIR IR

Fortran

FIR IR

Flang AST

14 of 90

TensorFlow is basically a huge compiler ecosystem

These boxes are all different domain-specific compiler systems:

  • Different limitations, challenges, owners, etc
  • No unifying theory and infrastructure to support this

TensorFlow Graph

LLVM IR

TPU IR

TensorFlow Lite

Several others

Tensor RT

nGraph

NNAPI

Many others

Core ML

Grappler

15 of 90

Recap: Domain- and abstraction-specific compiler IRs

Great!

  • High-level domain-specific optimizations
  • Progressive lowering encourages reuse between levels
  • Flow-sensitive “type checking” in language-specific IRs

Not great!

  • Industry is reimplementing of all the same stuff:
    • pass managers, location tracking, use-def chains, inlining, constant folding, CSE, testing tools, ….
    • Testing infra and “quality of life” for compiler-engineers is often not invested in
  • Huge expense to build this infrastructure:
    • Expense often warps compiler design, e.g. Clang not having a CIR
  • Innovations in one community don’t benefit the others

16 of 90

MLIR: Multi-Level IR

Also: Mid Level,

Moore’s Law,

Multidimensional Loop,

Machine Learning,

LLVM has only one expansion and it is wrong/misleading. Solution: have lots of ambiguous expansions so we can change our mind later :-)

Modular Library,

17 of 90

Many similarities to LLVM

func @testFunction(%arg0: i32) {

%x = call @thingToCall(%arg0) : (i32) -> i32

br ^bb1

^bb1:

%y = addi %x, %x : i32

return %y : i32

}

  • SSA register based, explicitly typed
  • Module/Function/Block/Operation structure
  • Round trippable textual form, FileCheck etc

Syntactically similar:

Module

Function

Block

Operation

Operation

Block

Operation

Operation

18 of 90

Type System - some examples

Scalars:

  • f16, bf16, f32, … i1, i8, i16, i32, … i3, i4, i7, i57, …

Vectors:

  • vector<4 x f32> vector<4x4 x f16> etc.

Tensors, including dynamic shape and rank:

  • tensor<4x4 x f32> tensor<4x?x?x17x? x f32> tensor<* x f32>

Others:

  • functions, memory buffers, … fully extensible!

19 of 90

MLIR Operations: an open ecosystem

No fixed / builtin list of globally known operations:

  • No “instruction” vs “target-indep intrinsic” vs “target-dep intrinsic” distinction
    • Why is “add” an instruction but “add with overflow” an intrinsic in LLVM? 😿

Passes are expected to conservatively handle unknown ops:

  • Just like LLVM does with unknown intrinsics

func @testFunction(%arg0: i32) -> i32 {

%x = "any_unknown_operation_here"(%arg0, %arg0) : (i32, i32) -> i32

%y = "my_increment"(%x) : (i32) -> i32

return %y : i32

}

20 of 90

Capabilities of MLIR Operations

Operations always have: opcode and source location info

Instructions may have:

  • Arbitrary number of SSA results and operands
  • Attributes: guaranteed constant values
  • Block operands: e.g. for branch instructions
  • Regions: discussed in later slide
  • Custom printing/parsing - or use the more verbose generic syntax

%2 = dim %1, 1 : tensor<1024x? x f32>

%x = alloc() : memref<1024x64 x f32>

%y = load %x[%a, %b] : memref<1024x64 x f32>

Dimension to extract is guaranteed integer constant, an “attribute”

21 of 90

Complicated TensorFlow Example

func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,

%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {

%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)

{data_format: "NHWC", epsilon: 0.001, is_training: false}

: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

“use”(%0#2, %0#4 ...

22 of 90

Complicated TensorFlow Example: Inputs

func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,

%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {

%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)

{data_format: "NHWC", epsilon: 0.001, is_training: false}

: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

“use”(%0#2, %0#4 ...

  • Input SSA values and corresponding type info

23 of 90

Complicated TensorFlow Example: Results

func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,

%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {

%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)

{data_format: "NHWC", epsilon: 0.001, is_training: false}

: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

“use”(%0#2, %0#4 ...

  • This op produces five results
  • Each result can be used independently with # syntax
  • No “tuple extracts” get in the way of transformations

24 of 90

Complicated TensorFlow Example: Attributes

func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,

%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {

%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)

{data_format: "NHWC", epsilon: 0.001, is_training: false}

: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

“use”(%0#2, %0#4 ...

  • Named attributes
  • “NHWC” is a constant, static entity, not an SSA value
  • Similar to LLVM “immarg”, but much richer vocabulary of constants

25 of 90

Nested Regions

  • Functional control flow, XLA fusion node, closures/lambdas, parallelism abstractions like OpenMP, etc.

%7 = tf.If(%arg0 : tensor<i1>, %arg1 : tensor<2xf32>) -> tensor<2xf32> {

… “then” code...

return ...

} else {

… “else” code...

return ...

}

affine.for %arg0 = 0 to %n {

affine.for %arg1 = 0 to %n {

%3 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>

%4 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>

%5 = mulf %3, %4 : f64

affine.store %5, %0[%arg0, %arg1] : memref<3x2xf64>

}

}

26 of 90

Extensible Operations Allow Multi-Level IR

TensorFlow

XLA HLO

LLVM IR

Polyhedral

%x = "tf.Conv2d"(%input, %filter)

{strides: [1,1,2,1], padding: "SAME", dilations: [2,1,1,1]}

: (tensor<*xf32>, tensor<*xf32>) -> tensor<*xf32>

%m = “xla.AllToAll"(%z)� {split_dimension: 1, concat_dimension: 0, split_count: 2}� : (tensor<300x200x32xf32>) -> tensor<600x100x32xf32>

%a = llvm.load %p : !llvm<"float*">

%f = llvm.add %a, %b : !llvm.float

Don’t we end up with the JSON of compiler IRs????

Lowering

affine.for %i = 0 to %n {

%v = affine.load %2[%i, %i] : memref<2x3xf64>

27 of 90

MLIR “Dialects”: Families of defined operations

Dialects generally correspond to an abstraction level:

  • LLVM IR, Fortran FIR, Swift SIL, XLA HLO, TensorFlow Graph, ...

Dialects can define:

  • Sets of defined operations
  • Entirely custom type system
  • Customization hooks - constant folding, decoding ...

Operation can define:

  • Invariants on # operands, results, attributes, etc
  • Custom parser, printer, verifier, ...
  • Constant folding, canonicalization patterns, …

28 of 90

MLIR: Infrastructure

29 of 90

A “Batteries Included” compiler infrastructure

Provide a lot of standard functionality in the box:

  • Standard passes like CSE, Inlining, …
  • Lots of dialects that you can mix and match with
  • Testing tools, diagnostics and location info

Creating a new IR is very easy, fast and correct:

  • Encourage declarative approaches over imperative
  • Static checking of correctness and enforcement of invariants

Utilizes LLVM infrastructure (e.g. FileCheck, lit) and data structures (llvm/ADT)

30 of 90

Declarative Op definitions: TensorFlow LeakyRelu

  • Specified using TableGen
    • LLVM Data modelling language

def TF_LeakyReluOp

31 of 90

Declarative Op definitions: TensorFlow LeakyRelu

  • Specified using TableGen
    • LLVM Data modelling language
  • Dialect can create own hierarchies
    • "tf.LeakyRelu" is a "TensorFlow unary op"

def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",

32 of 90

Declarative Op definitions: TensorFlow LeakyRelu

  • Specified using TableGen
    • LLVM Data modelling language
  • Dialect can create own hierarchies
    • "tf.LeakyRelu" is a "TensorFlow unary op"
  • Specify op properties (open ended)
    • e.g. side-effect free, commutative, ...

def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",

[NoSideEffect, SameValueType]>,

33 of 90

Declarative Op definitions: TensorFlow LeakyRelu

  • Specified using TableGen
    • LLVM Data modelling language
  • Dialect can create own hierarchies
    • "tf.LeakyRelu" is a "TensorFlow unary op"
  • Specify op properties (open ended)
    • e.g. side-effect free, commutative, ...
  • Name input and output operands
    • Named accessors created

def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",

[NoSideEffect, SameValueType]>,

Results<(outs TF_Tensor:$output)> {

let arguments = (ins

TF_FpTensor:$value,

DefaultValuedAttr<F32Attr, "0.2">:$alpha

);

34 of 90

Declarative Op definitions: TensorFlow LeakyRelu

  • Specified using TableGen
    • LLVM Data modelling language
  • Dialect can create own hierarchies
    • "tf.LeakyRelu" is a "TensorFlow unary op"
  • Specify op properties (open ended)
    • e.g. side-effect free, commutative, ...
  • Name input and output operands
    • Named accessors created
  • Document along with the op

def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",

[NoSideEffect, SameValueType]>,

Results<(outs TF_Tensor:$output)> {

let arguments = (ins

TF_FpTensor:$value,

DefaultValuedAttr<F32Attr, "0.2">:$alpha

);

let summary = "Leaky ReLU operator";

let description = [{

The Leaky ReLU operation takes a tensor and returns

a new tensor element-wise as follows:

LeakyRelu(x) = x if x >= 0

= alpha*x else

}];

35 of 90

Generated documentation

36 of 90

Generated C++ Code: Verifier Implementation

  • C++ class TF::LeakyReluOp
  • Typed accessors:
    • value() and alpha()
  • IRBuilder constructor
    • builder->create<LeakyReluOp>(loc, …)
  • Verify function
    • Check number of operands, type of�operands, compatibility of operands
    • Xforms can assume valid input!

namespace TF {

class LeakyReluOp

: public Op<LeakyReluOp,

OpTrait::OneResult,

OpTrait::HasNoSideEffect,

OpTrait::SameOperandsAndResultType,

OpTrait::OneOperand> {

public:

static StringRef getOperationName() {

return "tf.LeakyRelu";

};

Value *value() { … }

APFloat alpha() { … }

static void build(…) { … }

bool verify() const {

if (…) return emitOpError(

"requires 32-bit float attribute 'alpha'");

return false;

}

...

};

} // end namespace

37 of 90

Custom Op Printer and Parser Implementations

Declarative specification of “pretty” textual format for operation:

  • Can also be written manually with 10-40 lines of C++ code

def CallOp : My_Op<"call", ...> {

let arguments = (ins FlatSymbolRefAttr:$callee, Variadic<AnyType>:$operands);

let results = (outs Variadic<AnyType>);

let assemblyFormat = [{

$callee `(` $operands `)` attr-dict `:` functional-type($operands, results)

}];

}

%x = “my.call”(%arg0) { callee: @return_op: (i32) -> i32 }

: (i32) -> i32

The default generic IR dump can be verbose and redundant:

%z = “my.add”(%x, %y) : (i32, i32) -> i32

%x = my.call @return_op(%arg0) : (i32) -> i32

%z = my.add %x, %y : i32

38 of 90

Passes, Walkers, Pattern Matchers

  • Additionally module/function passes, function passes, utility matching functions, nested loop matchers ...

struct Vectorize : public FunctionPass<Vectorize> {

void runOnFunction() override;

};

...

if (matchPattern(getOperand(1), m_Zero()))

return getOperand(0);

...

...

f->walk([&](Operation *op) {

process(op);

});

...

39 of 90

Declarative Rewrite Rules (DRR)

Declarative, reduces boilerplate, easy to express for all:

  • Support M-N patterns
  • Support constraints on operations, operands and attributes
  • Support specifying dynamic predicates
    • Similar to "Fast and Flexible Instruction Selection With Constraints", CC18
  • Support hand-written C++ rewriters
    • Always a long tail, don't make the common case hard for the tail!

def : Pat<(TF_SqueezeOp StaticShapeTensor:$arg),

(TFL_ReshapeOp $arg)>;

def : Pat<(ReshapeOp(ReshapeOp $arg)),

(ReshapeOp $arg)>;

40 of 90

Pattern Rewrites via State Machine

Large numbers of pattern rewrites takes time to search

  • Build optimized state machine for matching
  • It works on any dialect, of course

Implemented in MLIR of course!

41 of 90

DialectConversion: Consistent Lowering Framework

Defining a lowering with three things:

  1. Conversion Target: Operations that are considered ‘legal’ in the result
  2. Rewrite Patterns: the lowerings to apply
  3. Type Converter: When lowering across type systems - arguments etc

Supports partial lowering vs complete lowering

Supports transitive lowering: A->B->C

  • Enables reuse of significant lowering infra

42 of 90

mlir-opt

  • A tool for testing compiler passes - just like llvm-opt
  • Every compiler transformation is unit testable:

// RUN: mlir-opt %s -loop-unroll | FileCheck %s

func @loop_nest_simplest() {

// CHECK: affine.for %i0 = 0 to 100 step 2 {

affine.for %i = 0 to 100 step 2 {

// CHECK: %c1_i32 = constant 1 : i32

// CHECK-NEXT: %c1_i32_0 = constant 1 : i32

// CHECK-NEXT: %c1_i32_1 = constant 1 : i32

affine.for %j = 0 to 3 {

%x = constant 1 : i32

}

}

return

}

43 of 90

Integrated Source Location Tracking

API requires location information on each operation:

  • File/line/column, op fusion, op fission
  • “Unknown” is allowed, but discouraged and must be explicit

$ cat test/Transforms/memref-dependence-check.mlir

// Actual test is much longer...

func @test() {

%0 = alloc() : memref<100xf32>

affine.for %i0 = 0 to 10 {

%1 = load %0[%i0] : memref<100xf32>

store %1, %0[%i0] : memref<100xf32>

}

return

}

$ mlir-opt -memref-dependence-check memref-dependence-check.mlir

m-d-c.mlir:5:10: note: dependence from 0 to 0 at depth 1 = false

%1 = load %0[%i0] : memref<100xf32>

^

m-d-c.mlir:6:5: note: dependence from 1 to 0 at depth 1 = false

store %1, %0[%i0] : memref<100xf32>

^

Easy for passes to emit structured diagnostics:

44 of 90

Location Tracking: Great for Testing!

Test suite uses -verify mode just like Clang/Swift diagnostic test:

  • Test analysis passes directly, by writing client that emits diagnostics!

// RUN: mlir-opt %s -memref-dependence-check -verify

func @test() {

%0 = alloc() : memref<100xf32>

affine.for %i0 = 0 to 10 {

// expected-note @+1 {{dependence from 0 to 1 at depth 2 = true}}

%1 = load %0[%i0] : memref<100xf32>

store %1, %0[%i0] : memref<100xf32>

}

}

45 of 90

LLVM IR is a Dialect in MLIR

  • LLVM IR is great at the “C with Vectors” abstraction level
  • Tremendous investment from a large community
  • Code generation for a wide range of architectures

...

^bb2: // pred: ^bb1

%9 = llvm.constant(10) : !llvm.i64

%11 = llvm.mul %2, %9 : !llvm.i64

%12 = llvm.add %11, %6 : !llvm.i64

%13 = llvm.extractvalue %arg2[0] : !llvm<"{ float* }">

%14 = llvm.getelementptr %13[%12] :

(!llvm<"float*">, !llvm.i64) -> !llvm<"float*">

llvm.store %8, %14 : !llvm<"float*">

...

Code lowered to

LLVM dialect in MLIR

46 of 90

Reuse standard passes and other dialects

Dialect independent passes:

Many dialects available with dialect specific passes:

  • Standard” dialect for LLVM-like scalar, vector, and memory abstractions
  • Affine dialect for polyhedral transforms - loop and memory hierarchy, etc.
  • GPU dialect, vector dialect, many others

47 of 90

Example application:

Building the TensorFlow Backend Bridge

48 of 90

TensorFlow Compiler ecosystem

Many complex subsystems

  • Each with its own abstraction and representation
  • TF Backend Bridge = interop between TF and another backend

TensorFlow Graph

LLVM IR

TPU IR

TensorFlow Lite

Several others

Tensor RT

nGraph

NNAPI

Many others

Core ML

Grappler

TF/XLA bridge

49 of 90

TensorFlow Bridge with MLIR

All semantic transformations are done in MLIR

  • Verifiable operations
  • Small unit tests using FileCheck

TensorFlow Graph

Import

Export

Convert

Representation change

Abstraction change

TF Graph.mlir

HLO.mlir

XLA HLO

50 of 90

TensorFlow Computational Graph Dialect

Compact textual representation isomorphic to computational graph

MLIR:

func @f(%arg0 : tensor<8xi32>,

%arg1 : tensor<8xi32>,

%arg2 : tensor<8xi32>) -> tensor<8xi32> {

%a = tf.Add(%arg0, %arg1) : …

%s = tf.Sub(%arg1, %arg2) : …

%m = tf.Mul(%a, %s) : …

return %m : tensor<8xi32>

}

Add

Sub

Arg0

Arg2

Mul

Arg1

Ret

51 of 90

Control Flow and Concurrency

  • Separate TensorFlow executor dialect for more complex concepts
    • Data flow graph with side-effects, unstructured control flow
  • Composes with TensorFlow computational graph dialect

tf_executor.graph () {

%0:2 = tf_executor.island wraps “tf.Const”(...) : () -> tensor<i32>

%1:2 = tf_executor.island wraps “tf.Const”(...) : () -> tensor<i1>

%2:3 = tf_executor.Switch %0#0, %1#0 :

(...)-> (tensor<i32>, tensor<i32>, !tf_executor.control) {...}

}

52 of 90

Let’s Build the Bridge

Step1. Pipeline of graph transformation and optimization passes

  • Can reuse many standard passes

PassManager bridge(module.getContext());

...

bridge.addPass(createInlinerPass());

bridge.addPass(createTFShapeInfPass());

Step 2. Operation rewrite rules

def : Pat<(TF_ConjOp $v),

(HLO_ComplexOp (HLO_RealOp $v), (HLO_NegOp (HLO_ImagOp $v)))>;

53 of 90

Multi-level Operation Rewrite Interface

A new backend can reuse existing rules, create new ones or use combination of both

tf.Einsum

tf.Reshape

Input

tf.MatMul

tf.Add

hlo.matmul

mychip.Einsum

mychip.matmul

54 of 90

Developer Benefits

  • Writing an optimization pass takes days instead of weeks
  • Fewer bugs and less time spent debugging
  • Less custom code to write
  • Flexible pass ordering

55 of 90

More example MLIR users

(not an exhaustive list!)

56 of 90

A Compiler Intermediate Representation for StencilsJEAN-MICHEL GORIUS, TOBIAS WICKY, TOBIAS GROSSER, AND TOBIAS GYSI

57 of 90

A Compiler Intermediate Representation for StencilsJEAN-MICHEL GORIUS, TOBIAS WICKY, TOBIAS GROSSER, AND TOBIAS GYSI

58 of 90

59 of 90

60 of 90

Compiling for Xilinx AI Engine using MLIRSamuel Bayliss, Xilinx, C4ML 2020

61 of 90

Compiling for Xilinx AI Engine using MLIRSamuel Bayliss, Xilinx, C4ML 2020

62 of 90

Utilizing MLIR in Clang

Disclaimer, very speculative: the Clang community hasn’t seriously discussed this or reached consensus

63 of 90

Clang IR Generation has poor separation of concerns

Abstraction gap between C++ and LLVM IR is huge:

  • C++ as a language has many concepts!
  • OpenMP, OpenCL, Cuda, and the many other extensions to C
  • ABI lowering
  • Duplication between IR generation and diagnostics path

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

🌲

🌲

🌲

How do we make incremental progress?

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

64 of 90

Recommendation: Move diagnostics first

Clang “CFG” path is (relatively) unloved:

  • Missing support for many language features
  • Ad-hoc hybrid CFG/AST representation
  • Existing IR doesn’t round trip - difficult to test

Relatively few clients

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

🌲

🌲

🌲

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

65 of 90

Build a new “MLIR CIR” in tree

Build it up next to the existing path, in master:

  • Iterate on design of CIR, starting with already-supported AST nodes

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

🌲

🌲

🌲

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

MLIR CIR

func @_Z3foo() -> !cir.std::vector<int> {

%vec = cir.alloc_stack : !cir.std::vector<int>

cir.call @’std::vector<int>::vector()’(%vec)

br ^loop

^loop: …

%i = ...

cir.call @’std::vector<int>::push_back(int)’(%vec, %i)

...

cond_br %done, ^loop, ^out

^out:

%result = cir.load %vec : !cir.std::vector<int>

cir.dealloc_stack %vec : !cir.std::vector<int>

return %result

}

66 of 90

Build a new “MLIR CIR” in tree

Build it up next to the existing path, in master:

  • Iterate on design of CIR, starting with already-supported AST nodes

Reimplement flow-sensitive dataflow diagnostics

  • Likely to be simpler than existing implementations with better abstractions

-Wunreachable-code, -Wuninitialized, ...

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

🌲

🌲

🌲

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

MLIR CIR

func @_Z3foo() -> !cir.std::vector<int> {

%vec = cir.alloc_stack : !cir.std::vector<int>

cir.call @’std::vector<int>::vector()’(%vec)

br ^loop

^loop: …

%i = ...

cir.call @’std::vector<int>::push_back(int)’(%vec, %i)

...

cond_br %done, ^loop, ^out

^out:

%result = cir.load %vec : !cir.std::vector<int>

cir.dealloc_stack %vec : !cir.std::vector<int>

return %result

}

67 of 90

Cut over the static compiler to use MLIR CIR

Enable by default as soon as the flow-sensitive diagnostics are superior:

  • Then delete the old implementations of these warnings

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

🌲

🌲

🌲

-Wunreachable-code, -Wuninitialized, ...

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

MLIR CIR

68 of 90

Cut over the static compiler to use MLIR CIR

Enable by default as soon as the flow-sensitive diagnostics are superior:

  • Then delete the old implementations of these warnings

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

🌲

🌲

🌲

Clang “CFG” becomes an implementation detail of the CSA & analysis tools

  • “CFG” is no longer used by the normal compiler
  • Move CSA whenever someone has time to port it over: Not on the critical path!

-Wunreachable-code, -Wuninitialized, ...

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

MLIR CIR

69 of 90

Add Lowering from CIR to LLVM IR, finish CIR coverage

Reuse most of the existing IR generation logic and helpers:

  • Just walk CIR instead of AST
  • Feature gated under an ‘experimental’ flag

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

🌲

🌲

🌲

-Wunreachable-code, -Wuninitialized, ...

MLIR CIR

70 of 90

Add Lowering from CIR to LLVM IR, finish CIR coverage

Reuse most of the existing IR generation logic and helpers:

  • Just walk CIR instead of AST
  • Feature gated under an ‘experimental’ flag

Iterate on this until it supports all language features and CIR is starting to settle

  • Each improvement will also improve diagnostics!
  • Turn it on by default and cut over when it is good enough

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

🌲

🌲

🌲

-Wunreachable-code, -Wuninitialized, ...

MLIR CIR

71 of 90

A unified path enables the fun part!

Start detangling one subsystem at a time:

  • OpenMP, make ABI lowering Clang independent, …
  • Each of these can be implemented incrementally, and tested as you go

Implement library-specific optimizations:

  • Insert ‘reserve’ calls for std::vector, constant fold std::string, …
  • Refcount optimizations for std::shared_ptr, ...

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

🌲

🌲

🌲

-Wunreachable-code, -Wuninitialized, ...

MLIR CIR

🌲

72 of 90

Strawman example CIR lowering (sketch)

std::vector<int> foo() {

std::vector<int> vec;

// Insert: result.reserve(100);

for (int i = 0; i < 100; ++i)

vec.push_back(i);

return vec;

}

func @_Z3foo() -> !cir.std::vector<int> {

%vec = cir.alloc_stack : !cir.std::vector<int>

cir.call @’std::vector<int>::vector()’(%vec)

br ^loop

^loop: …

%i = ...

cir.call @’std::vector<int>::push_back(int)’(%vec, %i)

...

cond_br %done, ^loop, ^out

^out:

%result = cir.load %vec : !cir.std::vector<int>

cir.dealloc_stack %vec : !cir.std::vector<int>

return %result

}

73 of 90

Many new approaches to explore...

Maintain structured loops and control flow for OpenCL and Cuda

Preserve better alias analysis information

New generation of source tooling based on hybrid dataflow + source location info

Enable higher level domain specific optimizations:

  • Optimize capture lists for C++ lambdas
  • C++ coroutines
  • Dataflow optimizations for OpenMP

74 of 90

OpenMP & Other Parallelism Dialects

OpenMP is mostly orthogonal to host language:

  • Common runtime + semantics
  • Rich model, many optimizations are possible

Model OpenMP as a dialect in MLIR:

  • Share across Clang and Fortran
  • Region abstraction makes optimizations easy
    • Simple SSA intra-procedural optimizations

Lower to the LLVM IR dialect as usual

%c4 = cir.constant 4 : !cir.int

%c5 = cir.constant 5 : !cir.int

%j = cir.add %c4, %c5 : !cir.int

omp.parallel.for (...) {

^bb0(%i : !cir.int):

cir.call @stuff(%i, %j)

}

int j = 4+5

#pragma omp parallel for

for (i=0; i<N; i++) {

stuff(i, j)

}

omp.parallel.for (...) {

^bb0(%i : !cir.int):

%c9 = cir.constant 9 : !cir.int

cir.call @stuff(%i, %c9)

}

SSA ConstProp

75 of 90

Utilizing MLIR for LLVM IR

Disclaimer, very speculative: the LLVM community hasn’t seriously discussed this or reached consensus

76 of 90

Why not use MLIR for LLVM IR?

Observation: MLIR already has an LLVM IR dialect used for codegen

  • Directly isomorphic to the LLVM IR abstraction level

...

^bb2: // pred: ^bb1

%9 = llvm.constant(10) : !llvm.i64

%11 = llvm.mul %2, %9 : !llvm.i64

%12 = llvm.add %11, %6 : !llvm.i64

%13 = llvm.extractvalue %arg2[0] : !llvm<"{ float* }">

%14 = llvm.getelementptr %13[%12] :

(!llvm<"float*">, !llvm.i64) -> !llvm<"float*">

llvm.store %8, %14 : !llvm<"float*">

...

Port the existing LLVM IR passes to work on LLVM dialect!

  • Drop in replacement: Retains exactly the same code quality, features, etc

77 of 90

MLIR’s impl is just better than the LLVM IR data structures!

Why use MLIR for LLVM IR?”

And yes, I 💖 LLVM!

78 of 90

Implicitly Multithreaded PassManager

Multicore isn’t “the future” anymore:

  • LLVM PassManager aimed to be multithreaded from the beginning
  • Representational issues with use/def chains prevented this from happening

MLIR PassManager runs passes on “isolated from above” regions in parallel

  • … including functions!

This can provide an easy 4-100x compile time speedup!

79 of 90

MLIR BB Arguments >>> LLVM PHINode

llvm::PHINode design has challenges:

  • Must be kept at the top of the block - code to skip over them
  • Dominance: PHIs read their input values in predecessor blocks, not current block
  • Scalability problems for blocks with high input degree - e.g. C++ EH blocks
  • Confusion about atomic assignment of multiple PHI nodes

br label %loop

loop:

%x = phi i32 [ %in1, %entry ], [%y, %loop]

%y = phi i32 [ %in2, %entry ], [%x, %loop]

use(%x, %y)

br i1 %cond, label %out, label %loop

out:

...

br ^loop

^loop(%x: i32, %y: i32):

use(%x, %y)

cond_br %cond, ^out, ^loop

^out:

...

MLIR defines these away by using functional-style SSA form:

(%in1 : i32, %in2 : i32)

(%y: i32, %x: i32)

80 of 90

PHINode Predecessors must provide same value

other:

br label %merge

merge:

%result = phi [%y, %other],

[%x, %start]

...

...

br i1 %cond,

label %other, label %merge

if (cond) {

result = x

} else {

result = y

}

...

llvm.cond_br %cond,

^merge(%x: !llvm.i32), ^merge(%y: !llvm.i32)

^merge(%arg: !llvm.i32):

...

81 of 90

LLVM invoke dominance issues

  • Invoke (and catchswitch, ...) results only dominate their “normal” block!
  • No direct ability to model value live-in to the error block
  • Fragile requirements around block structure / landingpad placement

except:

%errval = landingpad ...

...

%result = invoke @foo(...)

to label %normal

unwind label %except

normal:

use(%result)

...

...

llvm.invoke @foo(...)

to ^normal unwind ^except

^normal(%result):

...

^except(%errval):

...

82 of 90

Better Location Tracking

LLVM metadata design is wrong-in-retrospect for debug information:

  • Passes easily drop or corrupt location tracking, harming debugging experience

MLIR design strongly encourages pass authors to think about this by default:

  • Leads to much better user experience in practice
  • Lots of tooling and testing benefits, described earlier

83 of 90

Better Infrastructure

  • InstCombine should use MLIR declarative pattern rewriting system
  • Auto-generate LangRef.html from the dialect description
  • Native support for multiple return values in functions/operations
  • Better structured representation for attributes/metadata
  • More sensible constant model - eliminate llvm::Constant::canTrap()!
  • ...

84 of 90

MLIR provides new opportunities!

Why use MLIR for LLVM IR?”

We can make core LLVM even better!

85 of 90

Can now introduce higher level abstractions!

Parallelism abstractions:

  • Fork/join parallelism
  • Concurrent for loops

Loop transformations:

  • LLVM infers LoopInfo as an on-the-side analysis, rather than a core IR
  • Polly has a different IR (using ISL), and reencodes some LLVM IR
  • In MLIR, you just add loops!

...

86 of 90

Ok, but how? --> Incrementally

Upgrade LLVM in place, eliminate differences between the two:

  • Const model, “llvm::Value *x” -> “mlir::Value x”, PHINode -> BBArg?
  • Each change is low-risk, and makes the existing codebase better

Implement compatibility wrappers:

  • llvm::IRBuilder2 that creates MLIR instead of LLVM
  • Provide exactly the same API, reducing churn

Dissolve away the wrappers over time

This will be a bit complex to phase in

⇒ Get experience with Clang first

87 of 90

Learning More

88 of 90

MLIR is part of the LLVM Project!

Code available in LLVM GitHub Monorepo:

Find lots of content on mlir.llvm.org:

Discussions on the MLIR Discourse Forum

89 of 90

MLIR within TensorFlow OSS

MLIR powers several TensorFlow subsystems

  • Graph transformations
  • TensorFlow Lite integration
  • Code generation for accelerators

TensorFlow team hosts the MLIR Open Design Meeting:

  • A “weeklyish” video chat with topics from many people in the community
  • Notes, videos and slides available in the public Google doc

90 of 90

“That which you can represent, you can transform!”

Questions?

Tatiana Shpeisman

shpeisman@google.com

Chris Lattner

clattner@sifive.com