Thoughts on Tensor Code Generation in MLIR
January 23, 2020
Chris Lattner <clattner@google.com>
Sharing ideas from dozens of people
This is a public version of an internal vision discussion:
This isn’t a committed design, there are no schedules:
Shared in case it is interesting and useful to others in the field:
This was put together quickly, it is not a polished talk
DISCLAIMERS
Talk Overview
Discuss approaches to code generation from tensors to machine instructions:
This makes up a bunch of “descriptive” names to reduce bias in the conversation:
Talk is about architecture and project management, *not* codegen algorithm details
What is well understood?
Requirements & Goals
Lowering from “ops” to target-specific IR
Google’s requirements:
Goals:
Output of this lowering: “Target Specific IR”
Result is a target-specific IR talking to a target-specific code generator:
Character of these IRs:
MLIR provides a standardized framework to talk to these:
TSIR
(e.g. llvm)
Input: “Graph” of Tensor Ops in MLIR
Can be many different dialects:
MLIR provides a standard dialect-independent IR + infra:
TSIR
(e.g. llvm)
OpGraph
Future: Would like to move beyond tensors as the only abstraction:
“Target Specific Ops With Buffers”
Produced by New TensorFlow Compiler Bridge + HLO passes
For Each Device:
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
OpGraph
“Compiler Bridge” needs its own MLIR ODM talk!
Provides incredible flexibility for target authors:
Result is a subgraph, expressing one “kernel”:
Current Approaches
XLA xPU Emitters
Lowers from Late HLO to xPU-specific instruction set
Proven!
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
OpGraph
Challenges:
Some implementation limitations: static shapes, custom ops etc
LinAlg / Structured Ops?
Very useful technologies and implementation approach for many problems:
Learn more: Dec 5, 2019 MLIR Open Design Meeting
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
OpGraph
Important ingredients, but not really a “framework”:
Common challenges
Spanning a huge abstraction gap all in one step:
Monolithic approach encourages conflation of multiple different concerns:
Effectively assumes that there is one codegen algorithm/approach per target
⇒ MLIR provides an answer to these sorts of challenges!
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
OpGraph
A xPU* Engineering Management Challenge
XLA/xPU is “really really good” at what it does:
Also not “general enough”, so we’d like to get to a more flexible design:
* Certainly not a xPU-specific challenge :-)
How do we bring up a new codegen architecture?
Suggested Approach
Incremental and Collaborative
High Level Target Specific IR
Target specific code generators often have a somewhat arbitrary abstraction boundary
MLIR allows defining higher level IRs:
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
HLTSIR
(e.g. vector dialects)
OpGraph
Guiding principles designing this layer:
Why is this useful?
“Memory Hierarchy Abstraction” - worst name in the talk :-)
Input is valid “High Level Target Specific IR” embedded into a “loop IR”:
Perform correctness-preserving optimizations:
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
MHA
(e.g. stripe/affine)
HLTSIR
(e.g. vector dialects)
OpGraph
These transformations are well understood:
Uday and Intel’s work here is very exciting!
What is the right green arrow here?
?
Lots of different algorithms in this space - unlikely that there is “one true answer”:
⇒ Recommendation: optionality, not “one perfect answer”
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
MHA
(e.g. stripe/affine)
HLTSIR
(e.g. vector dialects)
OpGraph
Other challenges:
“Ops” are not just linear algebra or dense math:
CodeGenAlgorithm Selector
Input is a graph of ops that the target can support:
=> But possibly not by all code generators for the target
Enables federated / MoE approach with multiple CGA’s for a target:
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
CGASel
MHA
(e.g. stripe/affine)
HLTSIR
(e.g. vector dialects)
OpGraph
“CGASel” is a graph partitioning algorithm:
Key is to make sure they all work on the same input/output abstractions:
How does this help the “XLA/xPUs is mature” challenge?
Key requirements:
Continue incremental improvements to XLA emitters: makes the xPU product better
New algorithms can be explored next to them:
Emitters can be progressively simplified as MHO/HLTSIR expands and improves
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
CGASel
Emitters
MHA
(e.g. stripe/affine)
HLTSIR
(e.g. vector dialects)
“New Thing”
...
Custom Op
OpGraph
Why is this nice for MLIR more generally?
Many people care about this problem space!
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
CGASel
Emitters
MHA
(e.g. stripe/affine)
HLTSIR
(e.g. vector dialects)
Tile, TC, ...
...
TACO
TVM/Halide
Custom Op
OpGraph
This is “The MLIR approach”:
Isn’t reasonable to expect a “winner take all” algorithm:
There are lots of subdomains with different constraints:
What about retargetability, co-design, sharing?
Every part of this stack can be written in target-specific or target-indep way:
No reason to block on “perfect target descriptions”:
TSIR
(e.g. llvm)
TSOWB
(e.g. late hlo)
CGASel
MHA
(e.g. stripe/affine)
HLTSIR
(e.g. vector dialects)
OpGraph
Emitters
Tile, TC, ...
TACO
TVM/Halide
Custom Op
Progressively build this over time:
Target Description
...
...
...
TLDR: Build infra for optionality on Code Gen Algorithms
Lots of well known algorithms with different tradeoffs
Lots of experts in the field
“Framework” approach will make it easier to experiment and do research
Not every chip can be targeted from every codegen approach
Many interesting things beyond dense tensors
Managing an
“existing thing to MLIR”
transition
Testing the water
Generic software management advice:
A starting point: Keep existing algorithms, use MLIR data structures
Walk, run, then jump!
Start benefiting from MLIR incrementally:
Extend your IR:
Things to consider:
Questions
?