1 of 90

MLIR:

Presenting the work of many, many, people!

CGO 2020: International Symposium on Code Generation and Optimization

Tatiana Shpeisman

shpeisman@google.com

Chris Lattner

clattner@sifive.com

Multi-Level Intermediate Representation Compiler Infrastructure

2 of 90

Overview

Context
About MLIR
A few users of MLIR
Provocative (?) proposal for Clang and LLVM
Conclusion

3 of 90

What is wrong with existing compilers?

4 of 90

LLVM circa CGO 2004

TL;DR: One true IR (operations and type system) solves all of:

Multi-targeting, Analysis, Optimization, Distribution

All the inputs and outputs details are tiny arrows in the diagram above

“LLVM achieves this through ... a code representation with several novel features that serves as a common representation for analysis, transformation, and code distribution”

5 of 90

LLVM compiler today

LLVM IR centerpoint of Mid-Level and Interprocedural Optimizations:

Proven as a very useful “C with Vectors” level of abstraction

LLVM IR

6 of 90

LLVM compiler today

LLVM IR centerpoint of Mid-Level and Interprocedural Optimizations:

Proven as a very useful “C with Vectors” level of abstraction

LLVM IR is not enough!

Multiple other abstraction levels introduced over time
Poor representation for high level representations - parallelism, loops, etc

Also various design mistakes that still persist

LLVM IR

Machine IR

.o

SelectionDAG

MC IR

7 of 90

Clang compiler today

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

.o

SelectionDAG

MC IR

Machine IR

8 of 90

Clang compiler today

Abstraction gap between C++ and LLVM IR is huge:

Don’t forget OpenMP, and the many many other extensions to C
ABI lowering and many other things get included in this arrow as well

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

.o

SelectionDAG

MC IR

Machine IR

🌲

9 of 90

Clang compiler today

Abstraction gap between C++ and LLVM IR is huge:

Don’t forget OpenMP, and the many many other extensions to C
ABI lowering and many other things get included in this arrow as well

Clang also needs “high level” dataflow analysis:

Canonicalize away syntactic redundancies, preserve high level semantics
Duplication with lowering path is … very bad

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

.o

SelectionDAG

MC IR

Machine IR

🌲

Clang “CFG”

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

10 of 90

Better solution: Clang should have a CIR!

Integrate dataflow diagnostics with compilation flow:

Unified flow reduces diagnostic-only bugs and incompleteness

Progressive lowering solves many problems:

Simplifies IRGen by reducing responsibilities
Host new optimizations for std::vector, std::shared_ptr, std::string, …

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

.o

SelectionDAG

CIR

MC IR

Machine IR

Clang Static Analyzer, -Wunreachable-code, -Wuninitialized, ...

11 of 90

Where does this end?

We need an IR for each abstraction level that needs analysis and transformations!

How can we afford this? How many inliners, PassManagers, InstCombiners, parsers/printers... do we end up with??

🤯

12 of 90

It gets worse: this is not Clang-specific!

Let’s zoom out...

13 of 90

Modern languages pervasively invest in high level IRs

Language specific optimizations
Dataflow driven type checking - e.g. definitive initialization, borrow checker
Progressive lowering from high level abstractions

LLVM IR

...

Swift

SIL IR

Swift AST

Rust

MIR IR

Rust AST

Julia

Julia IR

Julia AST

Clang AST

C, C++, ObjC, CUDA, OpenCL, ...

CIR IR

Fortran

F IR IR

Flang AST

14 of 90

TensorFlow is basically a huge compiler ecosystem

These boxes are all different domain-specific compiler systems:

Different limitations, challenges, owners, etc
No unifying theory and infrastructure to support this

TensorFlow Graph

LLVM IR

XLA HLO

TPU IR

TensorFlow Lite

Several others

Tensor RT

nGraph

NNAPI

Many others

Core ML

Grappler

15 of 90

Recap: Domain- and abstraction-specific compiler IRs

Great!

High-level domain-specific optimizations
Progressive lowering encourages reuse between levels
Flow-sensitive “type checking” in language-specific IRs

Not great!

Industry is reimplementing of all the same stuff:

pass managers, location tracking, use-def chains, inlining, constant folding, CSE, testing tools, ….
Testing infra and “quality of life” for compiler-engineers is often not invested in

Huge expense to build this infrastructure:

Expense often warps compiler design, e.g. Clang not having a CIR

Innovations in one community don’t benefit the others

16 of 90

MLIR: Multi-Level IR

Also: Mid Level,

Moore’s Law,

Multidimensional Loop,

Machine Learning,

…

LLVM has only one expansion and it is wrong/misleading. Solution: have lots of ambiguous expansions so we can change our mind later :-)

Modular Library,

17 of 90

Many similarities to LLVM

func @testFunction(%arg0: i32) {

%x = call @thingToCall(%arg0) : (i32) -> i32

br ^bb1

^bb1:

%y = addi %x, %x : i32

return %y : i32

}

SSA register based, explicitly typed
Module/Function/Block/Operation structure
Round trippable textual form, FileCheck etc

Syntactically similar:

Module

Function

Block

Operation

Block

Operation

18 of 90

Type System - some examples

Scalars:

f16, bf16, f32, … i1, i8, i16, i32, … i3, i4, i7, i57, …

Vectors:

vector<4 x f32> vector<4x4 x f16> etc.

Tensors, including dynamic shape and rank:

tensor<4x4 x f32> tensor<4x?x?x17x? x f32> tensor<* x f32>

Others:

functions, memory buffers, … fully extensible!

19 of 90

MLIR Operations: an open ecosystem

No fixed / builtin list of globally known operations:

No “instruction” vs “target-indep intrinsic” vs “target-dep intrinsic” distinction

Why is “add” an instruction but “add with overflow” an intrinsic in LLVM? 😿

Passes are expected to conservatively handle unknown ops:

Just like LLVM does with unknown intrinsics

func @testFunction(%arg0: i32) -> i32 {

%x = "any_unknown_operation_here"(%arg0, %arg0) : (i32, i32) -> i32

%y = "my_increment"(%x) : (i32) -> i32

return %y : i32

}

20 of 90

Capabilities of MLIR Operations

Operations always have: opcode and source location info

Instructions may have:

Arbitrary number of SSA results and operands
Attributes: guaranteed constant values
Block operands: e.g. for branch instructions
Regions: discussed in later slide
Custom printing/parsing - or use the more verbose generic syntax

%2 = dim %1, 1 : tensor<1024x? x f32>

%x = alloc() : memref<1024x64 x f32>

%y = load %x[%a, %b] : memref<1024x64 x f32>

Dimension to extract is guaranteed integer constant, an “attribute”

21 of 90

Complicated TensorFlow Example

func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,

%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {

%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)

{data_format: "NHWC", epsilon: 0.001, is_training: false}

: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

“use”(%0#2, %0#4 ...

22 of 90

Complicated TensorFlow Example: Inputs

func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,

%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {

%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)

{data_format: "NHWC", epsilon: 0.001, is_training: false}

: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

“use”(%0#2, %0#4 ...

Input SSA values and corresponding type info

23 of 90

Complicated TensorFlow Example: Results

func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,

%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {

%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)

{data_format: "NHWC", epsilon: 0.001, is_training: false}

: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

“use”(%0#2, %0#4 ...

This op produces five results
Each result can be used independently with # syntax
No “tuple extracts” get in the way of transformations

24 of 90

Complicated TensorFlow Example: Attributes

func @foo(%arg0: tensor<8x?x?x8xf32>, %arg1: tensor<8xf32>,

%arg2: tensor<8xf32>, %arg3: tensor<8xf32>, %arg4: tensor<8xf32>) {

%0:5 = "tf.FusedBatchNorm"(%arg0, %arg1, %arg2, %arg3, %arg4)

{data_format: "NHWC", epsilon: 0.001, is_training: false}

: (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

-> (tensor<8x?x?x8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>, tensor<8xf32>)

“use”(%0#2, %0#4 ...

Named attributes
“NHWC” is a constant, static entity, not an SSA value
Similar to LLVM “immarg”, but much richer vocabulary of constants

25 of 90

Nested Regions

Functional control flow, XLA fusion node, closures/lambdas, parallelism abstractions like OpenMP, etc.

%7 = tf.If(%arg0 : tensor<i1>, %arg1 : tensor<2xf32>) -> tensor<2xf32> {

… “then” code...

return ...

} else {

… “else” code...

return ...

}

affine.for %arg0 = 0 to %n {

affine.for %arg1 = 0 to %n {

%3 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>

%4 = affine.load %1[%arg0, %arg1] : memref<3x2xf64>

%5 = mulf %3, %4 : f64

affine.store %5, %0[%arg0, %arg1] : memref<3x2xf64>

}

26 of 90

Extensible Operations Allow Multi-Level IR

TensorFlow

XLA HLO

LLVM IR

Polyhedral

%x = "tf.Conv2d"(%input, %filter)

{strides: [1,1,2,1], padding: "SAME", dilations: [2,1,1,1]}

: (tensor<*xf32>, tensor<*xf32>) -> tensor<*xf32>

%m = “xla.AllToAll"(%z)� {split_dimension: 1, concat_dimension: 0, split_count: 2}� : (tensor<300x200x32xf32>) -> tensor<600x100x32xf32>

%a = llvm.load %p : !llvm<"float*">

%f = llvm.add %a, %b : !llvm.float

Don’t we end up with the JSON of compiler IRs????

Lowering

affine.for %i = 0 to %n {

%v = affine.load %2[%i, %i] : memref<2x3xf64>

27 of 90

MLIR “Dialects”: Families of defined operations

Dialects generally correspond to an abstraction level:

LLVM IR, Fortran FIR, Swift SIL, XLA HLO, TensorFlow Graph, ...

Dialects can define:

Sets of defined operations
Entirely custom type system
Customization hooks - constant folding, decoding ...

Operation can define:

Invariants on # operands, results, attributes, etc
Custom parser, printer, verifier, ...
Constant folding, canonicalization patterns, …

28 of 90

MLIR: Infrastructure

29 of 90

A “Batteries Included” compiler infrastructure

Provide a lot of standard functionality in the box:

Standard passes like CSE, Inlining, …
Lots of dialects that you can mix and match with
Testing tools, diagnostics and location info

Creating a new IR is very easy, fast and correct:

Encourage declarative approaches over imperative
Static checking of correctness and enforcement of invariants

Utilizes LLVM infrastructure (e.g. FileCheck, lit) and data structures (llvm/ADT)

30 of 90

Declarative Op definitions: TensorFlow LeakyRelu

Specified using TableGen

LLVM Data modelling language

def TF_LeakyReluOp

Learn more: Operation Definition Specification (ODS)

31 of 90

Declarative Op definitions: TensorFlow LeakyRelu

Specified using TableGen

LLVM Data modelling language

Dialect can create own hierarchies

"tf.LeakyRelu" is a "TensorFlow unary op"

def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",

Learn more: Operation Definition Specification (ODS)

32 of 90

Declarative Op definitions: TensorFlow LeakyRelu

Specified using TableGen

LLVM Data modelling language

Dialect can create own hierarchies

"tf.LeakyRelu" is a "TensorFlow unary op"

Specify op properties (open ended)

e.g. side-effect free, commutative, ...

def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",

[NoSideEffect, SameValueType]>,

Learn more: Operation Definition Specification (ODS)

33 of 90

Declarative Op definitions: TensorFlow LeakyRelu

Specified using TableGen

LLVM Data modelling language

Dialect can create own hierarchies

"tf.LeakyRelu" is a "TensorFlow unary op"

Specify op properties (open ended)

e.g. side-effect free, commutative, ...

Name input and output operands

Named accessors created

def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",

[NoSideEffect, SameValueType]>,

Results<(outs TF_Tensor:$output)> {

let arguments = (ins

TF_FpTensor:$value,

DefaultValuedAttr<F32Attr, "0.2">:$alpha

);

Learn more: Operation Definition Specification (ODS)

34 of 90

Declarative Op definitions: TensorFlow LeakyRelu

Specified using TableGen

LLVM Data modelling language

Dialect can create own hierarchies

"tf.LeakyRelu" is a "TensorFlow unary op"

Specify op properties (open ended)

e.g. side-effect free, commutative, ...

Name input and output operands

Named accessors created

Document along with the op

def TF_LeakyReluOp : TF_UnaryOp<"LeakyRelu",

[NoSideEffect, SameValueType]>,

Results<(outs TF_Tensor:$output)> {

let arguments = (ins

TF_FpTensor:$value,

DefaultValuedAttr<F32Attr, "0.2">:$alpha

);

let summary = "Leaky ReLU operator";

let description = [{

The Leaky ReLU operation takes a tensor and returns

a new tensor element-wise as follows:

LeakyRelu(x) = x if x >= 0

= alpha*x else

}];

Learn more: Operation Definition Specification (ODS)

35 of 90

Generated documentation

36 of 90

Generated C++ Code: Verifier Implementation

C++ class TF::LeakyReluOp
Typed accessors:

value() and alpha()

IRBuilder constructor

builder->create<LeakyReluOp>(loc, …)

Verify function

Check number of operands, type of�operands, compatibility of operands
Xforms can assume valid input!

namespace TF {

class LeakyReluOp

: public Op<LeakyReluOp,

OpTrait::OneResult,

OpTrait::HasNoSideEffect,

OpTrait::SameOperandsAndResultType,

OpTrait::OneOperand> {

public:

static StringRef getOperationName() {

return "tf.LeakyRelu";

};

Value *value() { … }

APFloat alpha() { … }

static void build(…) { … }

bool verify() const {

if (…) return emitOpError(

"requires 32-bit float attribute 'alpha'");

return false;

}

...

};

} // end namespace

37 of 90

Custom Op Printer and Parser Implementations

Declarative specification of “pretty” textual format for operation:

Can also be written manually with 10-40 lines of C++ code

def CallOp : My_Op<"call", ...> {

let arguments = (ins FlatSymbolRefAttr:$callee, Variadic<AnyType>:$operands);

let results = (outs Variadic<AnyType>);

let assemblyFormat = [{

$callee `(` $operands `)` attr-dict `:` functional-type($operands, results)

}];

}

%x = “my.call”(%arg0) { callee: @return_op: (i32) -> i32 }

: (i32) -> i32

The default generic IR dump can be verbose and redundant:

%z = “my.add”(%x, %y) : (i32, i32) -> i32

%x = my.call @return_op(%arg0) : (i32) -> i32

%z = my.add %x, %y : i32

38 of 90

Passes, Walkers, Pattern Matchers

Additionally module/function passes, function passes, utility matching functions, nested loop matchers ...

struct Vectorize : public FunctionPass<Vectorize> {

void runOnFunction() override;

};

...

if (matchPattern(getOperand(1), m_Zero()))

return getOperand(0);

...

f->walk([&](Operation *op) {

process(op);

});

...

39 of 90

Declarative Rewrite Rules (DRR)

Declarative, reduces boilerplate, easy to express for all:

Support M-N patterns
Support constraints on operations, operands and attributes
Support specifying dynamic predicates

Similar to "Fast and Flexible Instruction Selection With Constraints", CC18

Support hand-written C++ rewriters

Always a long tail, don't make the common case hard for the tail!

def : Pat<(TF_SqueezeOp StaticShapeTensor:$arg),

(TFL_ReshapeOp $arg)>;

def : Pat<(ReshapeOp(ReshapeOp $arg)),

(ReshapeOp $arg)>;

Learn more: Declarative Rewrite Rule (DRR) Infra

MLIR Generic DAG Rewriter

40 of 90

Pattern Rewrites via State Machine

Large numbers of pattern rewrites takes time to search

Build optimized state machine for matching
It works on any dialect, of course

Implemented in MLIR of course!

Interpreted Pattern Match Execution

41 of 90

DialectConversion: Consistent Lowering Framework

Defining a lowering with three things:

Conversion Target: Operations that are considered ‘legal’ in the result
Rewrite Patterns: the lowerings to apply
Type Converter: When lowering across type systems - arguments etc

Supports partial lowering vs complete lowering

Supports transitive lowering: A->B->C

Enables reuse of significant lowering infra

Learn More: Dialect Conversion Framework

Tutorial: Lowering to Lower-Level Dialects

42 of 90

mlir-opt

A tool for testing compiler passes - just like llvm-opt
Every compiler transformation is unit testable:

// RUN: mlir-opt %s -loop-unroll | FileCheck %s

func @loop_nest_simplest() {

// CHECK: affine.for %i0 = 0 to 100 step 2 {

affine.for %i = 0 to 100 step 2 {

// CHECK: %c1_i32 = constant 1 : i32

// CHECK-NEXT: %c1_i32_0 = constant 1 : i32

// CHECK-NEXT: %c1_i32_1 = constant 1 : i32

affine.for %j = 0 to 3 {

%x = constant 1 : i32

}

return

}

43 of 90

Integrated Source Location Tracking

API requires location information on each operation:

File/line/column, op fusion, op fission
“Unknown” is allowed, but discouraged and must be explicit

$ cat test/Transforms/memref-dependence-check.mlir

// Actual test is much longer...

func @test() {

%0 = alloc() : memref<100xf32>

affine.for %i0 = 0 to 10 {

%1 = load %0[%i0] : memref<100xf32>

store %1, %0[%i0] : memref<100xf32>

}

return

}

$ mlir-opt -memref-dependence-check memref-dependence-check.mlir

…

m-d-c.mlir:5:10: note: dependence from 0 to 0 at depth 1 = false

%1 = load %0[%i0] : memref<100xf32>

^

…

m-d-c.mlir:6:5: note: dependence from 1 to 0 at depth 1 = false

store %1, %0[%i0] : memref<100xf32>

^

Easy for passes to emit structured diagnostics:

Learn more: MLIR Diagnostics and Location Tracking

44 of 90

Location Tracking: Great for Testing!

Test suite uses -verify mode just like Clang/Swift diagnostic test:

Test analysis passes directly, by writing client that emits diagnostics!

// RUN: mlir-opt %s -memref-dependence-check -verify

func @test() {

%0 = alloc() : memref<100xf32>

affine.for %i0 = 0 to 10 {

// expected-note @+1 {{dependence from 0 to 1 at depth 2 = true}}

%1 = load %0[%i0] : memref<100xf32>

store %1, %0[%i0] : memref<100xf32>

}

45 of 90

LLVM IR is a Dialect in MLIR

LLVM IR is great at the “C with Vectors” abstraction level
Tremendous investment from a large community
Code generation for a wide range of architectures

...

^bb2: // pred: ^bb1

%9 = llvm.constant(10) : !llvm.i64

%11 = llvm.mul %2, %9 : !llvm.i64

%12 = llvm.add %11, %6 : !llvm.i64

%13 = llvm.extractvalue %arg2[0] : !llvm<"{ float* }">

%14 = llvm.getelementptr %13[%12] :

(!llvm<"float*">, !llvm.i64) -> !llvm<"float*">

llvm.store %8, %14 : !llvm<"float*">

...

LLVM IR Dialect docs

Conversion to the LLVM Dialect

Code lowered to

LLVM dialect in MLIR

46 of 90

Reuse standard passes and other dialects

Dialect independent passes:

Canonicalization, Inlining, etc.
Uses generic traits like “has no side effect”
More complex semantics described with Operation Interfaces

Many dialects available with dialect specific passes:

“Standard” dialect for LLVM-like scalar, vector, and memory abstractions
Affine dialect for polyhedral transforms - loop and memory hierarchy, etc.
GPU dialect, vector dialect, many others

Learn more: MLIR Passes

Tutorial: Enabling Generic Transformation with Interfaces

47 of 90

Example application:

Building the TensorFlow Backend Bridge

48 of 90

TensorFlow Compiler ecosystem

Many complex subsystems

Each with its own abstraction and representation
TF Backend Bridge = interop between TF and another backend

TensorFlow Graph

LLVM IR

XLA HLO

TPU IR

TensorFlow Lite

Several others

Tensor RT

nGraph

NNAPI

Many others

Core ML

Grappler

TF/XLA bridge

49 of 90

TensorFlow Bridge with MLIR

All semantic transformations are done in MLIR

Verifiable operations
Small unit tests using FileCheck

TensorFlow Graph

Import

Export

Convert

Representation change

Abstraction change

TF Graph.mlir

HLO.mlir

XLA HLO

50 of 90

TensorFlow Computational Graph Dialect

Compact textual representation isomorphic to computational graph

MLIR:

func @f(%arg0 : tensor<8xi32>,

%arg1 : tensor<8xi32>,

%arg2 : tensor<8xi32>) -> tensor<8xi32> {

%a = tf.Add(%arg0, %arg1) : …

%s = tf.Sub(%arg1, %arg2) : …

%m = tf.Mul(%a, %s) : …

return %m : tensor<8xi32>

}

Add

Sub

Arg0

Arg2

Mul

Arg1

Ret

51 of 90

Control Flow and Concurrency

Separate TensorFlow executor dialect for more complex concepts

Data flow graph with side-effects, unstructured control flow

Composes with TensorFlow computational graph dialect

tf_executor.graph () {

%0:2 = tf_executor.island wraps “tf.Const”(...) : () -> tensor<i32>

%1:2 = tf_executor.island wraps “tf.Const”(...) : () -> tensor<i1>

%2:3 = tf_executor.Switch %0#0, %1#0 :

(...)-> (tensor<i32>, tensor<i32>, !tf_executor.control) {...}

…

}

52 of 90

Let’s Build the Bridge

Step1. Pipeline of graph transformation and optimization passes

Can reuse many standard passes

PassManager bridge(module.getContext());

...

bridge.addPass(createInlinerPass());

bridge.addPass(createTFShapeInfPass());

Step 2. Operation rewrite rules

def : Pat<(TF_ConjOp $v),

(HLO_ComplexOp (HLO_RealOp $v), (HLO_NegOp (HLO_ImagOp $v)))>;

53 of 90

Multi-level Operation Rewrite Interface

A new backend can reuse existing rules, create new ones or use combination of both

tf.Einsum

tf.Reshape

Input

tf.MatMul

tf.Add

hlo.matmul

mychip.Einsum

mychip.matmul

54 of 90

Developer Benefits

Writing an optimization pass takes days instead of weeks
Fewer bugs and less time spent debugging
Less custom code to write
Flexible pass ordering

55 of 90

More example MLIR users

(not an exhaustive list!)

56 of 90

A Compiler Intermediate Representation for Stencils�JEAN-MICHEL GORIUS, TOBIAS WICKY, TOBIAS GROSSER, AND TOBIAS GYSI

57 of 90

A Compiler Intermediate Representation for Stencils�JEAN-MICHEL GORIUS, TOBIAS WICKY, TOBIAS GROSSER, AND TOBIAS GYSI

58 of 90

An MLIR Dialect for High-Level Optimization of Fortran�Eric Schweitz (NVIDIA)

59 of 90

An MLIR Dialect for High-Level Optimization of Fortran�Eric Schweitz (NVIDIA)

60 of 90

Compiling for Xilinx AI Engine using MLIR�Samuel Bayliss, Xilinx, C4ML 2020

61 of 90

Compiling for Xilinx AI Engine using MLIR�Samuel Bayliss, Xilinx, C4ML 2020

62 of 90

Utilizing MLIR in Clang

Disclaimer, very speculative: the Clang community hasn’t seriously discussed this or reached consensus

63 of 90

Clang IR Generation has poor separation of concerns

Abstraction gap between C++ and LLVM IR is huge:

C++ as a language has many concepts!
OpenMP, OpenCL, Cuda, and the many other extensions to C
ABI lowering
Duplication between IR generation and diagnostics path

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

How do we make incremental progress?

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

64 of 90

Recommendation: Move diagnostics first

Clang “CFG” path is (relatively) unloved:

Missing support for many language features
Ad-hoc hybrid CFG/AST representation
Existing IR doesn’t round trip - difficult to test

Relatively few clients

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

65 of 90

Build a new “MLIR CIR” in tree

Build it up next to the existing path, in master:

Iterate on design of CIR, starting with already-supported AST nodes

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

MLIR CIR

func @_Z3foo() -> !cir.std::vector<int> {

%vec = cir.alloc_stack : !cir.std::vector<int>

cir.call @’std::vector<int>::vector()’(%vec)

br ^loop

^loop: …

%i = ...

cir.call @’std::vector<int>::push_back(int)’(%vec, %i)

...

cond_br %done, ^loop, ^out

^out:

%result = cir.load %vec : !cir.std::vector<int>

cir.dealloc_stack %vec : !cir.std::vector<int>

return %result

}

66 of 90

Build a new “MLIR CIR” in tree

Build it up next to the existing path, in master:

Iterate on design of CIR, starting with already-supported AST nodes

Reimplement flow-sensitive dataflow diagnostics

Likely to be simpler than existing implementations with better abstractions

-Wunreachable-code, -Wuninitialized, ...

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

MLIR CIR

func @_Z3foo() -> !cir.std::vector<int> {

%vec = cir.alloc_stack : !cir.std::vector<int>

cir.call @’std::vector<int>::vector()’(%vec)

br ^loop

^loop: …

%i = ...

cir.call @’std::vector<int>::push_back(int)’(%vec, %i)

...

cond_br %done, ^loop, ^out

^out:

%result = cir.load %vec : !cir.std::vector<int>

cir.dealloc_stack %vec : !cir.std::vector<int>

return %result

}

67 of 90

Cut over the static compiler to use MLIR CIR

Enable by default as soon as the flow-sensitive diagnostics are superior:

Then delete the old implementations of these warnings

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

-Wunreachable-code, -Wuninitialized, ...

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

MLIR CIR

68 of 90

Cut over the static compiler to use MLIR CIR

Enable by default as soon as the flow-sensitive diagnostics are superior:

Then delete the old implementations of these warnings

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

Clang “CFG” becomes an implementation detail of the CSA & analysis tools

“CFG” is no longer used by the normal compiler
Move CSA whenever someone has time to port it over: Not on the critical path!

-Wunreachable-code, -Wuninitialized, ...

Clang Static Analyzer

-Wunreachable-code, -Wuninitialized, ...

Clang “CFG”

MLIR CIR

69 of 90

Add Lowering from CIR to LLVM IR, finish CIR coverage

Reuse most of the existing IR generation logic and helpers:

Just walk CIR instead of AST
Feature gated under an ‘experimental’ flag

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

-Wunreachable-code, -Wuninitialized, ...

MLIR CIR

70 of 90

Add Lowering from CIR to LLVM IR, finish CIR coverage

Reuse most of the existing IR generation logic and helpers:

Just walk CIR instead of AST
Feature gated under an ‘experimental’ flag

Iterate on this until it supports all language features and CIR is starting to settle

Each improvement will also improve diagnostics!
Turn it on by default and cut over when it is good enough

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

-Wunreachable-code, -Wuninitialized, ...

MLIR CIR

71 of 90

A unified path enables the fun part!

Start detangling one subsystem at a time:

OpenMP, make ABI lowering Clang independent, …
Each of these can be implemented incrementally, and tested as you go

Implement library-specific optimizations:

Insert ‘reserve’ calls for std::vector, constant fold std::string, …
Refcount optimizations for std::shared_ptr, ...

LLVM IR

AST

C, C++, ObjC, CUDA, OpenCL, ...

🌲

-Wunreachable-code, -Wuninitialized, ...

MLIR CIR

🌲

72 of 90

Strawman example CIR lowering (sketch)

std::vector<int> foo() {

std::vector<int> vec;

// Insert: result.reserve(100);

for (int i = 0; i < 100; ++i)

vec.push_back(i);

return vec;

}

func @_Z3foo() -> !cir.std::vector<int> {

%vec = cir.alloc_stack : !cir.std::vector<int>

cir.call @’std::vector<int>::vector()’(%vec)

br ^loop

^loop: …

%i = ...

cir.call @’std::vector<int>::push_back(int)’(%vec, %i)

...

cond_br %done, ^loop, ^out

^out:

%result = cir.load %vec : !cir.std::vector<int>

cir.dealloc_stack %vec : !cir.std::vector<int>

return %result

}

73 of 90

Many new approaches to explore...

Maintain structured loops and control flow for OpenCL and Cuda

Preserve better alias analysis information

New generation of source tooling based on hybrid dataflow + source location info

Enable higher level domain specific optimizations:

Optimize capture lists for C++ lambdas
C++ coroutines
Dataflow optimizations for OpenMP
…

74 of 90

OpenMP & Other Parallelism Dialects

OpenMP is mostly orthogonal to host language:

Common runtime + semantics
Rich model, many optimizations are possible

Model OpenMP as a dialect in MLIR:

Share across Clang and Fortran
Region abstraction makes optimizations easy

Simple SSA intra-procedural optimizations

Lower to the LLVM IR dialect as usual

%c4 = cir.constant 4 : !cir.int

%c5 = cir.constant 5 : !cir.int

%j = cir.add %c4, %c5 : !cir.int

omp.parallel.for (...) {

^bb0(%i : !cir.int):

cir.call @stuff(%i, %j)

}

int j = 4+5

#pragma omp parallel for

for (i=0; i<N; i++) {

stuff(i, j)

}

omp.parallel.for (...) {

^bb0(%i : !cir.int):

%c9 = cir.constant 9 : !cir.int

cir.call @stuff(%i, %c9)

}

SSA ConstProp

75 of 90

Utilizing MLIR for LLVM IR

Disclaimer, very speculative: the LLVM community hasn’t seriously discussed this or reached consensus

76 of 90

Why not use MLIR for LLVM IR?

Observation: MLIR already has an LLVM IR dialect used for codegen

Directly isomorphic to the LLVM IR abstraction level

...

^bb2: // pred: ^bb1

%9 = llvm.constant(10) : !llvm.i64

%11 = llvm.mul %2, %9 : !llvm.i64

%12 = llvm.add %11, %6 : !llvm.i64

%13 = llvm.extractvalue %arg2[0] : !llvm<"{ float* }">

%14 = llvm.getelementptr %13[%12] :

(!llvm<"float*">, !llvm.i64) -> !llvm<"float*">

llvm.store %8, %14 : !llvm<"float*">

...

Port the existing LLVM IR passes to work on LLVM dialect!

Drop in replacement: Retains exactly the same code quality, features, etc

77 of 90

MLIR’s impl is just better than the LLVM IR data structures!

“Why use MLIR for LLVM IR?”

And yes, I 💖 LLVM!

78 of 90

Implicitly Multithreaded PassManager

Multicore isn’t “the future” anymore:

LLVM PassManager aimed to be multithreaded from the beginning
Representational issues with use/def chains prevented this from happening

MLIR PassManager runs passes on “isolated from above” regions in parallel

… including functions!

This can provide an easy 4-100x compile time speedup!

79 of 90

MLIR BB Arguments >>> LLVM PHINode

llvm::PHINode design has challenges:

Must be kept at the top of the block - code to skip over them
Dominance: PHIs read their input values in predecessor blocks, not current block
Scalability problems for blocks with high input degree - e.g. C++ EH blocks
Confusion about atomic assignment of multiple PHI nodes

br label %loop

loop:

%x = phi i32 [ %in1, %entry ], [%y, %loop]

%y = phi i32 [ %in2, %entry ], [%x, %loop]

use(%x, %y)

br i1 %cond, label %out, label %loop

out:

...

br ^loop

^loop(%x: i32, %y: i32):

use(%x, %y)

cond_br %cond, ^out, ^loop

^out:

...

MLIR defines these away by using functional-style SSA form:

(%in1 : i32, %in2 : i32)

(%y: i32, %x: i32)

Modeling PHI nodes as part of the instruction stream was a fundamental design mistake. Many passes have to scan over phi nodes to insert instructions at the top of block, etc.

The entries of a PHI node are unordered, so a lookup needs to scan the operand list, instead of using O(1) lookup, this matters for high input degree. These blocks are uncommon, so you only get bitten by super slow N^2 behaviors in “weird” cases… sigh.

Atomic assignment: The confusion is discussed in the literature as the “lost copy problem” and similar conceptual issue. The semantics of LLVM PHIs is that they read all of their inputs as a block and then update their outputs. This behavior matters in the example because the PHIs read their values from the previous iteration. This behavior is not consistent with executing them as part of the instruction stream.

80 of 90

PHINode Predecessors must provide same value

other:

br label %merge

merge:

%result = phi [%y, %other],

[%x, %start]

...

br i1 %cond,

label %other, label %merge

if (cond) {

result = x

} else {

result = y

}

...

llvm.cond_br %cond,

^merge(%x: !llvm.i32), ^merge(%y: !llvm.i32)

^merge(%arg: !llvm.i32):

...

81 of 90

LLVM invoke dominance issues

Invoke (and catchswitch, ...) results only dominate their “normal” block!
No direct ability to model value live-in to the error block
Fragile requirements around block structure / landingpad placement

except:

%errval = landingpad ...

...

%result = invoke @foo(...)

to label %normal

unwind label %except

normal:

use(%result)

...

llvm.invoke @foo(...)

to ^normal unwind ^except

^normal(%result):

...

^except(%errval):

...

82 of 90

Better Location Tracking

LLVM metadata design is wrong-in-retrospect for debug information:

Passes easily drop or corrupt location tracking, harming debugging experience

MLIR design strongly encourages pass authors to think about this by default:

Leads to much better user experience in practice
Lots of tooling and testing benefits, described earlier

83 of 90

Better Infrastructure

InstCombine should use MLIR declarative pattern rewriting system
Auto-generate LangRef.html from the dialect description
Native support for multiple return values in functions/operations
Better structured representation for attributes/metadata
More sensible constant model - eliminate llvm::Constant::canTrap()!
...

84 of 90

MLIR provides new opportunities!

“Why use MLIR for LLVM IR?”

We can make core LLVM even better!

85 of 90

Can now introduce higher level abstractions!

Parallelism abstractions:

Fork/join parallelism
Concurrent for loops

Loop transformations:

LLVM infers LoopInfo as an on-the-side analysis, rather than a core IR
Polly has a different IR (using ISL), and reencodes some LLVM IR
In MLIR, you just add loops!

...

86 of 90

Ok, but how? --> Incrementally

Upgrade LLVM in place, eliminate differences between the two:

Const model, “llvm::Value *x” -> “mlir::Value x”, PHINode -> BBArg?
Each change is low-risk, and makes the existing codebase better

Implement compatibility wrappers:

llvm::IRBuilder2 that creates MLIR instead of LLVM
Provide exactly the same API, reducing churn

Dissolve away the wrappers over time

This will be a bit complex to phase in

⇒ Get experience with Clang first

87 of 90

Learning More

88 of 90

MLIR is part of the LLVM Project!

Code available in LLVM GitHub Monorepo:

github.com/llvm/llvm-project/tree/master/mlir

Find lots of content on mlir.llvm.org:

Getting Started Guide
“Toy” Tutorial, Documentation, Specification, Rationale
Talks and publications, incl. shiny new arxiv paper!

Discussions on the MLIR Discourse Forum

mlir .llvm.org

89 of 90

MLIR within TensorFlow OSS

MLIR powers several TensorFlow subsystems

Graph transformations
TensorFlow Lite integration
Code generation for accelerators

TensorFlow team hosts the MLIR Open Design Meeting:

A “weeklyish” video chat with topics from many people in the community
Notes, videos and slides available in the public Google doc

90 of 90

“That which you can represent, you can transform!”

Questions?

Tatiana Shpeisman

shpeisman@google.com

Chris Lattner

clattner@sifive.com