# The Transformative Impact of MLIR: Key Developments in Al Compilation and Hardware Co-Design (2022-2025) # 1. Introduction: MLIR in the Era Beyond Moore's Law #### The Compiler Challenge The landscape of computing is undergoing a fundamental transformation. The relentless pace of Moore's Law, which dictated hardware scaling for decades, is demonstrably slowing, altering the trajectory of performance improvements.<sup>1</sup> Simultaneously, the complexity and scale of Artificial Intelligence (AI) models, particularly in areas like large language models (LLMs) and generative AI, are exploding. This confluence of factors has driven the proliferation of diverse and specialized hardware accelerators - Graphics Processing Units (GPUs), Tensor Processing Units (TPUs), Neural Processing Units (NPUs), Intelligence Processing Units (IPUs), AMD's AI Engines (AIEs), Field-Programmable Gate Arrays (FPGAs), and custom Application-Specific Integrated Circuits (ASICs) - each designed to tackle specific computational bottlenecks. This heterogeneity presents a profound challenge for compiler technology. Traditional monolithic compiler infrastructures, such as LLVM and GCC, while powerful for conventional CPU targets, struggle to effectively manage the complexity and diversity of these modern hardware targets and the high-level abstractions used in AI frameworks. The result was often a fragmented ecosystem, with different frameworks (TensorFlow, PyTorch, JAX) and hardware vendors developing bespoke, often incompatible, compiler stacks, leading to duplicated effort and hindering portability.3 #### **Enter MLIR** Multi-Level Intermediate Representation (MLIR) emerged as a direct response to these challenges.¹ Conceived within Google and now part of the LLVM project, MLIR represents a paradigm shift in compiler infrastructure design.¹ It is not merely another Intermediate Representation (IR), but a flexible and extensible *framework* for building compilers. Its core design principles – modularity, extensibility, and reusability – are manifest in its unique ability to represent and manipulate code at multiple levels of abstraction simultaneously.¹ This is achieved through the central concept of **dialects**: self-contained units that define domain-specific operations, types, and attributes.¹ These dialects allow MLIR to bridge the gap between high-level programming models used in AI frameworks and the low-level details of target hardware, facilitating progressive lowering and optimization across different abstraction layers.<sup>6</sup> MLIR's Static Single Assignment (SSA)-based structure provides a robust foundation for analysis and transformation.7 #### **Report Scope and Purpose** This report synthesizes the most significant developments, breakthroughs, and impactful applications within the MLIR ecosystem over the approximately 2022-2025 timeframe. It provides an expert-level analysis focusing on MLIR's evolving role in AI software optimization, including the critical area of data processing and pipeline management, its integration with mainstream AI frameworks, its transformative influence on AI hardware acceleration and co-design, and the burgeoning research landscape surrounding it. Particular attention is given to the innovative 'Transform' dialect and its implications for compiler control. The objective is to present a comprehensive technical assessment of MLIR's impact and trajectory for an audience of technical leaders, engineers, and researchers engaged with AI systems and compiler technology. The analysis underscores that MLIR's rise is not merely incidental but a necessary evolutionary step in compiler technology, driven by the fundamental shifts in hardware capabilities and the computational demands of modern AI. # 2. The Evolving MLIR Ecosystem: Growth, Principles, and Identity ## 2.1 Core Infrastructure Growth and Community Expansion The period from 2022 to 2025 witnessed substantial growth and activity within the MLIR project, reflecting its increasing importance and adoption. As an integral part of the broader LLVM ecosystem, MLIR benefited from the significant development velocity of LLVM itself. In 2024 alone, the main LLVM repository saw nearly 37,500 commits, adding over 9.3 million lines of code, roughly in line with activity in 2022 and 2023. While these figures represent the entire LLVM project, they indicate a healthy and active development environment within which MLIR resides. More telling is the dramatic expansion of the contributor base. The number of unique authors contributing to LLVM surged from 1,573 in 2022 and 1,932 in 2023 to an unprecedented 2,138 in 2024. This represents a more than six-fold increase compared to the 336 authors a decade prior in 2014. This rapid growth strongly correlates with the emergence and maturation of MLIR, signaling widespread industry and academic investment. The increasing number of contributors from diverse organizations signifies that MLIR has transcended its origins as a Google-internal project and is becoming a defacto standard infrastructure for compiler development, particularly in the AI domain. Key catalysts for this growth were the open-sourcing of MLIR and its contribution to the LLVM Foundation, which made the technology accessible to the entire industry and fostered collaborative development.<sup>3</sup> The community actively engages through platforms like the LLVM Discourse forums, which replaced the older mailing list system to provide better structure, searchability, and integration (e.g., with GitHub accounts).<sup>16</sup> Furthermore, events like the biannual LLVM Developers' Meetings, which explicitly include MLIR content and dedicated MLIR Workshops, serve as crucial hubs for knowledge sharing, discussion, and networking among developers, researchers, and users.<sup>17</sup> #### 2.2 Foundational Concepts: Dialects, Operations, Extensibility MLIR's power and flexibility stem from a set of well-defined core concepts.<sup>7</sup> At its heart, MLIR represents programs as a graph-like structure of **Operations**, which consume and produce **Values**. Each Value has a **Type** known at compile time, and Operations can possess **Attributes** representing compile-time constant information.<sup>8</sup> Operations are organized within **Blocks**, which in turn reside within **Regions**, allowing for nested structures essential for representing control flow or scoped computations.<sup>8</sup> MLIR utilizes an SSA (Static Single Assignment) form, simplifying dataflow analysis.<sup>7</sup> The most fundamental concept enabling MLIR's modularity and extensibility is the **Dialect**.<sup>1</sup> Unlike traditional compiler infrastructures often centered around a single, monolithic IR <sup>4</sup>, MLIR allows developers to define custom dialects. Each dialect encapsulates a set of operations, types, and attributes relevant to a specific domain or abstraction level. This allows MLIR to represent diverse concepts, from high-level framework operations (like TensorFlow or PyTorch ops) and mathematical abstractions (like linear algebra on tensors) down to hardware-specific instructions and control flow constructs, all within a unified infrastructure.<sup>1</sup> Traditional IRs, operating at a single abstraction level, often face a dilemma when compiling complex domains like AI. Representing high-level constructs directly can make the IR unwieldy, while lowering too early ("premature lowering" <sup>7</sup>) discards valuable semantic information needed for effective optimization. MLIR's dialects circumvent this by enabling a multi-level representation. Compilers can start with high-level dialects preserving domain semantics and progressively lower the IR through a series of intermediate dialects, applying optimizations at the most appropriate level. This modularity significantly simplifies the construction of complex compilers, particularly for heterogeneous hardware targets. Dialects are typically defined using **TableGen**, a configuration description language within LLVM, which generates much of the C++ boilerplate code for operations, types, and their interfaces.<sup>14</sup> The MLIR project includes a rich set of standard dialects that serve as building blocks, including func (functions), arith (standard arithmetic), scf (structured control flow), affine (polyhedral abstractions), linalg (linear algebra on tensors/memrefs), tensor, memref (memory buffers), vector, gpu (GPU abstractions), and llvm (for lowering to LLVM IR).<sup>8</sup> This dialect-based architecture represents a fundamental shift from monolithic IRs towards composable, multi-level abstractions, allowing compiler developers to tackle complexity by isolating concerns at appropriate levels. #### 2.3 MLIR's Identity Crisis: General Infrastructure vs. Al Solution MLIR originated within Google Brain specifically to address the fragmentation and complexity of AI compiler stacks.<sup>3</sup> The initial goal was to create a unified, reusable infrastructure to replace the myriad of incompatible graph technologies and compilers used across projects like TensorFlow, XLA, and TensorFlow Lite, particularly for targeting hardware like TPUs.<sup>3</sup> However, MLIR was intentionally designed as a *general-purpose* compiler framework, not strictly limited to machine learning.<sup>3</sup> Its powerful abstractions and extensibility quickly attracted interest from other domains, leading to its adoption in areas like quantum computing (e.g., NVIDIA's CUDA Quantum <sup>27</sup>), hardware design and high-level synthesis (HLS) through projects like CIRCT <sup>3</sup>, homomorphic encryption (HEIR <sup>29</sup>), and potentially even database query compilation (Substrait-MLIR <sup>13</sup>). This broad success, fueled by its open-sourcing and contribution to the LLVM Foundation <sup>3</sup>, created an "identity crisis". While MLIR excels as domain-agnostic, reusable infrastructure, the AI community simultaneously pushed for it to become an end-to-end AI compiler solution. This led to a "dialect explosion," where numerous AI-specific dialects (representing framework operations, intermediate optimizations, etc.) were added to the upstream MLIR project, sometimes with limited governance. This situation conflates MLIR's core, general-purpose infrastructure with the specific AI solutions built upon it. It raises questions about what "MLIR" as a project truly encompasses and guarantees.<sup>3</sup> While MLIR forms the foundation for major AI compiler projects like OpenXLA and Triton, and is even used within parts of NVIDIA's CUDA stack <sup>3</sup>, the ambiguity between its role as a general framework and a specific AI solution persists. This internal complexity risks creating fragmentation within the MLIR ecosystem, potentially mirroring the external fragmentation it was designed to solve. Recent efforts towards improved governance, such as establishing distinct Area Teams for MLIR Core and specific dialects, aim to address this challenge and clarify the project's structure and identity.<sup>3</sup> # 3. MLIR for AI Data Pipelines: Pre-compilation, Structuring, and Wrangling #### 3.1 The Critical Role of Data Pipelines in Al Data is the fundamental input for machine learning models, and the process of preparing this data – often termed the Extract, Transform, Load (ETL) pipeline – is critical for successful model training and deployment.<sup>30</sup> These pipelines extract raw data from diverse sources (databases, files, APIs), transform it through cleaning, normalization, feature engineering, and structuring, and finally load it into a format suitable for consumption by ML frameworks.<sup>30</sup> Ensuring data quality, consistency, accuracy, and efficient processing is paramount, as the performance and reliability of ML models are directly dependent on the data they are trained on.<sup>30</sup> Input data pipelines often represent a significant performance bottleneck, consuming substantial compute resources and potentially starving hardware accelerators like GPUs and TPUs, which can process training steps much faster than data can be supplied.<sup>32</sup> Efficient ETL involves complex operations like handling large data volumes, applying intricate transformations, managing data shuffling and batching, and overlapping communication with computation.<sup>32</sup> Common preprocessing steps include data cleaning (handling missing values, outliers, inconsistencies), integration from multiple sources, data reduction (aggregation, feature selection), data transformation (normalization, scaling, encoding categorical features), and discretization.<sup>36</sup> #### 3.2 MLIR Approaches to Data Optimization MLIR presents a compelling infrastructure for optimizing data pipelines due to its ability to represent diverse computations and data structures within a unified framework.<sup>8</sup> While traditional ML workflows often use separate tools for ETL (e.g., Spark, Pandas, Airflow <sup>19</sup>) and model training, MLIR offers the potential to represent and optimize both stages cohesively. This co-location enables cross-stage optimizations that are difficult to achieve otherwise. Several MLIR-based approaches target data pipeline optimization: - Data Tiling and Packing: Specialized hardware often requires data to be arranged in specific layouts or processed in tiles for optimal performance. MLIR can be used to model and optimize these data arrangements. For instance, work on targeting AMD's Ryzen AI NPUs uses MLIR-based techniques to derive optimal data tiling and packing strategies, managing data flow through the processor array and leveraging low-level DMA control for efficient data movement.<sup>42</sup> - Data Structuring and Wrangling: MLIR dialects can effectively model various - data structures beyond simple tensors, such as multi-dimensional arrays (memrefs) <sup>8</sup> and potentially even dataframes or semi-structured data like JSON. <sup>13</sup> Standard operations within dialects like tensor, memref, and arith, along with custom operations, can represent common data transformations like reshaping, transposing, type conversions, and element-wise operations. <sup>12</sup> - Optimization Techniques: MLIR's pass infrastructure allows applying various optimizations to data manipulation code. Canonicalization passes can simplify redundant operations (e.g., transpose(transpose(x)) -> x, or chains of reshapes).40 Common subexpression elimination (CSE) can remove repeated calculations.44 Pattern rewriting frameworks (both C++ based and declarative DRR) enable targeted optimizations like constant folding through reshape operations.<sup>40</sup> Furthermore, MLIR's ability to represent higher-level constructs facilitates advanced Data Layout Optimizations (DLO) potentially applied via Link-Time Optimization (LTO). Because MLIR can preserve information about structures (like C structs) across modules, LTO passes can perform optimizations like instance interleaving (rearranging fields of structs in an array for better cache locality) and dead field elimination (removing unused struct fields), which are simpler to implement in MLIR than in lower-level IRs like LLVM IR.<sup>43</sup> Bufferization passes manage the explicit allocation and deallocation of memory buffers 18, and specialized primitives like reuse at and buffer at (demonstrated in HeteroCL) can explicitly manage on-chip memory hierarchies and reuse buffers to minimize off-chip memory access.41 The ability to represent both data preparation steps and model computation within the same MLIR framework opens possibilities for holistic optimizations, such as fusing data preprocessing operations directly into compute kernels or optimizing data layouts based on how the subsequent model consumes the data.<sup>43</sup> #### 3.3 Case Studies and Projects Several research projects and tools demonstrate MLIR's application to data-centric tasks: - DAPHNE: This system explicitly targets integrated data analysis pipelines encompassing ETL, ML, and HPC.<sup>34</sup> It uses a custom MLIR dialect, DaphneIR, to represent operations on frames and matrices, along with standard dialects like SCF for control flow. DAPHNE performs multi-level optimizations within MLIR before lowering to LLVM, aiming to optimize the entire workflow holistically.<sup>34</sup> - **HeteroCL:** Originally based on Halide IR, HeteroCL migrated to MLIR for improved scalability and extensibility in defining hardware customizations. <sup>41</sup> Its HCL dialect allows specifying compute, data type, and memory customizations. Notably, it - includes primitives like .reuse\_at() and .buffer\_at() to explicitly generate and manage reuse buffers and write buffers within custom memory hierarchies, crucial for optimizing data movement in data-intensive applications.<sup>41</sup> - Substrait-MLIR: This ongoing project aims to create an MLIR dialect for Substrait, a cross-language format for database query plans.<sup>13</sup> By representing database operations within MLIR, it seeks to provide a common infrastructure for query optimization and potentially bridge the gap between traditional data processing systems and MLIR's AI/HPC capabilities.<sup>13</sup> - Noisy Arithmetic Example: While focused on homomorphic encryption (FHE) <sup>29</sup>, the example of tracking "noise" through integer arithmetic using an MLIR analysis pass demonstrates a relevant capability. <sup>45</sup> Similar analyses could track data quality metrics, distributions, or other properties through complex ETL transformations within MLIR. These projects illustrate that MLIR's flexibility extends beyond tensor computations. Its application to data management, memory hierarchy optimization, and query plan representation signals a trend towards using MLIR as a unifying infrastructure for data engineering and AI engineering tasks. #### 3.4 Comparison and Challenges While MLIR offers significant potential for optimizing the *transformation* and *loading* stages of ETL, particularly when tightly coupled with ML model execution, it currently faces challenges compared to mature, dedicated ETL frameworks and libraries. MLIR's strengths lie in its potential for deep hardware optimization, fusion of data preparation with compute kernels, and unified representation. However, established ETL tools like Apache Spark, Apache Airflow, Pandas, or commercial platforms <sup>19</sup> possess rich ecosystems with extensive connectors for diverse data sources (the 'Extract' stage), sophisticated orchestration and scheduling features, mature monitoring capabilities, and high-level APIs optimized for data manipulation productivity. Representing complex data cleaning logic (e.g., intricate validation rules, fuzzy matching) or stateful transformations might be less straightforward or efficient in MLIR's current dialects compared to specialized Python libraries like Pandas or data quality tools. <sup>36</sup> Therefore, MLIR is unlikely to replace the entire data engineering stack in the near term. Its most promising role appears to be in accelerating the computationally intensive *transformation* steps within data pipelines and enabling tighter integration and co-optimization with downstream ML model execution, rather than managing the entire end-to-end ETL process. # 4. The Transform Dialect: Unleashing Fine-Grained Compiler Control #### 4.1 Motivation: Beyond Monolithic Passes Traditional compiler optimization flows rely heavily on sequences of pre-defined passes (pass pipelines), often configured via command-line flags. <sup>46</sup> While effective for general-purpose optimization, this coarse-grained approach often lacks the precision required to optimize specific, critical sections of code for the diverse and specialized hardware prevalent today. <sup>46</sup> Source-level annotations or pragmas offer finer control but are typically limited to specific transformations anticipated by compiler developers and require invasive, non-modular compiler changes to implement. <sup>47</sup> A significant limitation of this model is that much of the powerful transformation logic implemented within compiler helper functions (e.g., routines for tiling, unrolling, vectorizing specific loops) remains hidden or inaccessible to the end-user unless they are willing to write custom compiler passes in C++ and rebuild the compiler – a task requiring deep compiler expertise. Domain-specific scheduling languages like Halide and TVM address this by separating the algorithm from its optimization schedule, but they typically require reimplementing optimizations within their own frameworks and do not easily integrate with existing general-purpose compiler infrastructure. The MLIR Transform dialect was conceived to bridge this gap, providing a mechanism to expose and compose existing compiler capabilities with fine-grained precision directly within the MLIR framework. #### 4.2 Core Concepts and Implementation The Transform dialect is, itself, an MLIR dialect, but its purpose is meta-compilational: it defines operations that manipulate and transform *other* MLIR code (the "payload" IR). Instead of the compiler executing a fixed pipeline, it interprets a Transform dialect script provided by the user, which explicitly directs the optimization process. Key concepts include: - Handles: Transform operations operate on handles, which are standard MLIR SSA values. These handles represent lists of operations within the payload IR that are targeted by the transformation.<sup>49</sup> Handles can be produced by matching operations (e.g., match.op) or as results of other transform operations. - Transform Operations: These are operations defined within the Transform dialect (e.g., loop.tile, loop.unroll, bufferization.eliminate alloc tensor). Each transform operation typically takes one or more handles as input, applies a specific compiler transformation (often leveraging existing internal compiler functions) to the associated payload operations, and produces new handles corresponding to the newly created or modified payload operations.<sup>46</sup> - Payload IR: The actual program code (e.g., user functions containing loops and computations) that is being optimized. - Transform IR: The MLIR code written using the Transform dialect that specifies the sequence of optimizations to apply to the payload IR. Consider this illustrative example adapted from <sup>49</sup>: **MLIR** ``` // Transform IR Script transform.named sequence @optimize loops(%payload func:!transform.any op) { // Find the first 'scf.for' loop inside the payload function %outer loop = transform.structured.match ops{["scf.for"]} in %payload func ->!transform.op<"scf.for"> // Define tile sizes %c8 = transform.param.constant 8 : index // Tile the outer loop with size 8 %tiled loops:2 = transform.structured.tile using for %outer loop tile sizes [%c8] interchange -> (!transform.op<"scf.for">,!transform.op<"scf.for">) // Unroll the inner tiled loop (handle: %tiled loops#1) completely %unrolled_inner = transform.loop.unroll %tiled_loops#1 { factor = 0 } // factor=0 means full unroll ->!transform.any_op transform.yield ``` This script finds a loop in the payload, tiles it, and then unrolls the inner loop resulting from the tiling. The Transform dialect is implemented within MLIR and features an extensible design, allowing new transform operations to be added easily.<sup>46</sup> An interface mechanism allows existing C++ helper functions within the compiler to be exposed as Transform dialect operations.<sup>47</sup> #### 4.3 Key Capabilities The Transform dialect provides several powerful capabilities: - **Composition:** Simple, atomic transform operations can be chained together, using the handles produced by one transform as input to the next, allowing the construction of arbitrarily complex optimization pipelines.<sup>46</sup> - Extensibility: New transformations can be exposed by defining new Transform dialect operations and associating them with the corresponding C++ implementation, without altering the core dialect or requiring users to rebuild the entire compiler for every new optimization strategy.<sup>46</sup> - Static Verification: A crucial feature is the system of pre- and post-conditions associated with transform operations. 46 Handle types (e.g., !transform.op<"scf.for">) specify the expected type of payload operation a transform can be applied to. Attributes can add further constraints. This allows the MLIR infrastructure to statically verify the Transform script, catching errors like applying a loop transformation to a non-loop operation or applying a destructive transform twice to the same handle before executing the potentially expensive compilation. 46 The transform.cast operation allows explicit type checking between transforms. 51 - Parameterization: Transformations can be configured using parameters, which can be compile-time constants (like tile sizes or unroll factors) or even values derived dynamically from the payload IR itself, enabling more adaptive optimization strategies.<sup>49</sup> By representing the optimization strategy itself as MLIR IR, the Transform dialect elevates compiler control from a simple configuration task to a programmable one. This "compiler programming" paradigm opens the door to analyzing, verifying, and potentially even automatically generating or optimizing the compilation strategy itself, integrating naturally with techniques like autotuning. #### 4.4 Applications and Impact (Case Studies from CGO 2025 Paper) The practical utility and impact of the Transform dialect were demonstrated through five case studies presented in the CGO 2025 paper <sup>46</sup>: 1. **Expressing Pass Pipelines:** This study showed that existing, coarse-grained pass pipelines can be faithfully replicated using Transform dialect scripts with - negligible compilation time overhead, confirming its efficiency as a control mechanism.<sup>46</sup> - 2. **Robust Lowering:** This case study focused on lowering IR containing a complex mix of dialects. It highlighted the critical role of the Transform dialect's static preand post-conditions in building robust and reliable lowering sequences, preventing errors that might occur in less explicitly controlled pass pipelines.<sup>46</sup> - 3. **Debugging Performance:** The Transform dialect proved effective in diagnosing performance regressions. By enabling fine-grained control over which optimizations were applied where, developers could quickly isolate and disable counter-productive transformation patterns that were hurting performance.<sup>46</sup> - 4. **Fine-Grained Optimization:** This study demonstrated the power of precise control. By meticulously applying loop tiling, vectorization, and importantly, integrating calls to specialized, highly optimized microkernel library functions (exposed via custom transform ops), significant performance improvements were achieved on relevant benchmarks, surpassing what standard pass pipelines could deliver.<sup>49</sup> - 5. **Autotuning Integration:** The final case study showed the ease with which the Transform dialect integrates with state-of-the-art autotuning frameworks. The parameterized nature of the Transform script allowed search algorithms to effectively explore the optimization space (e.g., different tile sizes, unroll factors) to find high-performing configurations automatically.<sup>48</sup> These case studies collectively validate the Transform dialect's practical value across the compiler development and performance engineering lifecycle. They provide concrete evidence that it delivers on its promise of fine-grained control, reusability of compiler internals, improved robustness, and seamless integration with automated tuning methods, offering tangible advantages over traditional compiler control mechanisms. # 5. MLIR Integration in Mainstream AI Frameworks: Bridging the Gap #### 5.1 The Need for Framework Integration Modern AI frameworks like TensorFlow, PyTorch, and JAX provide high-level, productive interfaces for defining complex machine learning models. However, translating these high-level descriptions into efficient code that runs optimally across a diverse landscape of hardware accelerators (CPUs, GPUs, TPUs, etc.) is a major challenge. This necessitates sophisticated compiler backends capable of understanding both the semantics of the ML framework and the intricacies of the target hardware. MLIR, along with related projects like XLA and IREE, has emerged as a critical technology for building these compiler backends, enabling performance optimization, hardware targeting, and improved portability. #### 5.2 TensorFlow & OpenXLA TensorFlow has long utilized XLA (Accelerated Linear Algebra) as a compiler backend to optimize performance, particularly on Google's TPUs and also for GPUs and CPUs.<sup>52</sup> XLA aims to improve execution speed by fusing operations and specializing code, enhance memory usage via buffer analysis, and reduce reliance on custom ops by optimizing fused low-level ops automatically.<sup>52</sup> XLA's architecture heavily involves MLIR.<sup>52</sup> While historically using its own HLO (High Level Operations) representation, the modern XLA pipeline increasingly relies on MLIR dialects. The **StableHLO** dialect now serves as the primary, versioned interface layer between ML frameworks (including TensorFlow, PyTorch via Torch-MLIR, and JAX) and MLIR-based compilers like XLA and IREE.<sup>52</sup> Models are lowered from the framework's representation to StableHLO, which is then consumed by the compiler backend. XLA performs target-independent optimizations on StableHLO/HLO (like CSE, fusion) before invoking target-specific backends (e.g., GPU, CPU) for further optimization and code generation, often via the MLIR LLVM dialect.<sup>52</sup> The **TOSA** (Tensor Operator Set Architecture) dialect is another MLIR dialect relevant to TensorFlow, particularly for TensorFlow Lite (TFLite) inference. However, the incremental upgrade of TOSA to v1.0 exposed significant compatibility challenges between TensorFlow/TFLite (which generated the older TOSA version) and downstream compilers like IREE (which adopted the newer v1.0). This breakage, occurring around late 2024 / early 2025, necessitated users pinning to older versions of TensorFlow, IREE, and associated tooling to maintain compatibility, highlighting the critical need for careful versioning and coordination across the decoupled components of the MLIR ecosystem. To simplify the integration of diverse hardware backends with frameworks like TensorFlow and JAX, the **PJRT** (Plugin-based Runtime) interface was developed and open-sourced as part of OpenXLA.<sup>58</sup> PJRT provides a standardized API for frameworks to discover, load, and interact with different compiler runtimes and hardware devices dynamically. This allows hardware vendors to provide PJRT plugins for their devices, enabling framework support without requiring deep integration into the framework's core codebase.<sup>59</sup> Intel, for example, uses PJRT to provide its GPU backend for TensorFlow and JAX.<sup>59</sup> The **OpenXLA** project represents a collaborative effort by Google and numerous industry partners (including AMD, NVIDIA, Intel, Arm, Meta, AWS) to develop an open-source ecosystem of ML compiler technologies, with XLA, StableHLO, IREE, and PJRT as key components, all leveraging MLIR.<sup>52</sup> This initiative aims to standardize interfaces, promote portability, and reduce the N\*M integration complexity between frameworks and hardware targets. #### 5.3 PyTorch PyTorch, known for its dynamic nature and Pythonic interface, presents different challenges for compiler integration compared to TensorFlow or JAX. The **Torch-MLIR** project serves as the primary bridge connecting the PyTorch ecosystem to MLIR-based backends.<sup>62</sup> It is designed as core infrastructure *for building* end-to-end compilation flows, rather than being a complete compiler itself.<sup>62</sup> Torch-MLIR's architecture features a frontend and a backend. The frontend ingests various PyTorch program representations (primarily via PyTorch's JIT IR, which can be produced by TorchScript, TorchDynamo/torch.compile, torch.fx, etc.) and lowers them to the MLIR torch dialect. This dialect mirrors many PyTorch concepts, including its type system and operators. A critical stage in the frontend is lowering the torch dialect representation to conform to the "backend contract". This contract defines a subset of the torch dialect with specific properties required by downstream MLIR backends: tensors must have value semantics (be immutable and non-aliased), and tensors must have known ranks (number of dimensions) and data types (dtypes), ideally with fully known shapes. Achieving this contract, especially when starting from TorchScript (which represents stateful nn. Module hierarchies and lacks static shape information), requires significant transformations handled by pipelines like torchscript-module-to-torch-backend-pipeline. These include functionalization (converting stateful modules to functional code), shape and dtype inference (often requiring user hints), and simplification of Pythonic constructs. <sup>62</sup> The impedance mismatch between PyTorch's dynamic, object-oriented nature and the typically static, functional nature expected by MLIR compiler backends necessitates this dedicated bridging infrastructure. Once the IR conforms to the backend contract, Torch-MLIR's backend can lower it to various target MLIR dialects, including Linalg (for CPU/GPU codegen via LLVM), TOSA, and StableHLO (for integration with XLA/IREE).<sup>62</sup> This modular design allows different compiler backends to consume PyTorch models via the standardized backend contract provided by Torch-MLIR. #### **5.4 JAX** JAX leverages a functional programming paradigm combined with transformations like jax.jit (just-in-time compilation), jax.grad (automatic differentiation), and jax.vmap (auto-vectorization).<sup>65</sup> For its JIT compilation capabilities, JAX relies heavily on the XLA compiler.<sup>65</sup> The integration between JAX and MLIR-based backends like XLA and IREE is facilitated primarily through StableHLO and the PJRT runtime interface.<sup>52</sup> When jax.jit is invoked, the JAX function is traced and converted into JAX IR, which is then lowered to StableHLO.<sup>67</sup> This StableHLO representation is passed via the PJRT interface to the selected backend (e.g., XLA compiler for GPU/TPU, IREE compiler, or potentially other PJRT plugins) for optimization and code generation.<sup>52</sup> JAX's functional nature generally maps more cleanly onto compiler IRs like StableHLO compared to the complexities of handling PyTorch's stateful modules. This relatively direct mapping simplifies the compiler integration task and likely contributes to JAX's strong performance and adoption on accelerators via XLA and IREE.<sup>54</sup> Research efforts also explore extending JAX's capabilities using MLIR backends, such as the experimental work on providing MLIR-based sparse tensor support for JAX.<sup>68</sup> JAX, combined with XLA's capabilities for automatic parallelization (GSPMD) <sup>54</sup> and PJRT's multi-device support, is widely used for large-scale model training.<sup>54</sup> #### 5.5 Framework Integration Summary The integration of MLIR into major AI frameworks is a dynamic and evolving process, aiming to provide portability and performance across diverse hardware. The move towards standardized interfaces like StableHLO and PJRT within the OpenXLA ecosystem represents a significant effort to create a more modular and interoperable landscape. However, challenges related to dialect versioning, maintaining performance parity, and bridging the gap between dynamic framework features and static compiler requirements remain active areas of development. Table 1: MLIR Integration in Major AI Frameworks (ca. 2022-2025) | Integ | MLIR Core Input gration Dialect(s) to ect(s) Compiler | Integration<br>Interface/La<br>yer | Notable<br>Successes/<br>Capabilities | Key<br>Challenges/<br>Recent | |-------|-------------------------------------------------------|------------------------------------|---------------------------------------|------------------------------| |-------|-------------------------------------------------------|------------------------------------|---------------------------------------|------------------------------| | | | | | | Issues | |------------|-----------------------------------------------|----------------------------------------|-------------------------|-----------------------------------------------------------------|----------------------------------------------------------------------------------------| | TensorFlow | OpenXLA<br>(XLA, IREE),<br>TensorFlow<br>Lite | StableHLO,<br>TOSA | PJRT, TF C<br>API | Strong TPU/GPU/CP U support via XLA, TFLite inference ecosystem | TOSA v1.0<br>compatibility<br>issues <sup>61</sup> ,<br>Historical<br>complexity | | PyTorch | Torch-MLIR,<br>OpenXLA<br>(IREE, XLA) | torch -><br>StableHLO,<br>TOSA, Linalg | torch.compil<br>e, PJRT | Growing backend support (IREE, XLA), Modular bridge design | Lowering complexity (state, dynamic shapes) <sup>62</sup> , Performance tuning | | JAX | OpenXLA<br>(XLA, IREE) | StableHLO | jax.jit, PJRT | High performance on accelerators, Clean functional mapping | Reliance on<br>XLA/IREE<br>backend<br>maturity,<br>Custom op<br>handling <sup>63</sup> | # 6. MLIR Reshaping Hardware: Al Accelerators and Co-Design #### 6.1 The Imperative for Hardware-Specific Compilation The proliferation of specialized AI accelerators is a direct consequence of the need to overcome the limitations of general-purpose processors for demanding AI workloads.¹ Achieving peak performance on these diverse architectures—ranging from massively parallel GPUs to dataflow-oriented TPUs/IPUs and VLIW-based AIEs—requires compilers that can understand and exploit their unique features.² Generic compilation strategies are often insufficient.⁴ MLIR's extensible dialect system provides a powerful mechanism for hardware designers and compiler engineers to create domain-specific compilers that effectively map high-level AI models onto specialized hardware, significantly reducing the cost and complexity compared to building compilers from scratch.¹ #### **6.2 Tenstorrent** Tenstorrent provides a compelling example of deep, native MLIR adoption for targeting specialized AI hardware. Their core compiler is **TT-Forge**, explicitly built upon MLIR.<sup>58</sup> The associated **TT-MLIR** open-source project defines a hierarchy of custom MLIR dialects to represent computations targeting Tenstorrent accelerators <sup>58</sup>: - TTIR (Tenstorrent Intermediate Representation): A primary IR level for Tenstorrent hardware. - TTNN (Tenstorrent Neural Network): Likely represents higher-level neural network constructs or fused operations suitable for their architecture. - TTKernel: Represents lower-level kernel details. - (Future dialects like .ttm, .ttnn mentioned in TT-Explorer roadmap <sup>73</sup>) TT-Forge supports multiple frontends to ingest models from standard frameworks, leveraging open standards: tt-torch (using PyTorch 2.X/torch-mlir, outputting StableHLO), tt-forge-fe (using TVM to handle PyTorch, ONNX, TF), and tt-xla (using PJRT to ingest JAX via StableHLO).<sup>58</sup> A particularly innovative aspect of Tenstorrent's toolchain is **TT-Explorer**, a graphical tool designed for "Human-In-Loop" compilation.<sup>58</sup> TT-Explorer allows users to visualize the TTIR graph, inspect operation attributes, view performance and accuracy metrics overlaid on the graph, edit parameters via an "Overrides" mechanism, trigger re-compilation, and observe the results.<sup>73</sup> Its roadmap includes support for more dialects, visualizing graph transformations, and integration with other tools.<sup>73</sup> This interactive approach, enabled by MLIR's structured IR, empowers developers to directly tune and optimize models for Tenstorrent hardware. Tenstorrent's strategy showcases a full commitment to the MLIR philosophy, building a comprehensive, MLIR-native toolchain with custom dialects and novel interactive tooling, while also embracing interoperability through standards like StableHLO and PJRT. #### 6.3 NVIDIA NVIDIA's CUDA platform remains the dominant ecosystem for GPU computing. While CUDA itself predates MLIR, NVIDIA is actively integrating MLIR into its compiler stack, leveraging its capabilities while building upon its existing, mature infrastructure.<sup>3</sup> NVIDIA contributes significantly to the LLVM project, upon which its CUDA Compiler (NVCC) is based.<sup>74</sup> MLIR's integration appears primarily as intermediate layers bridging high-level representations to NVIDIA's established backend: NVVM IR: This is NVIDIA's internal, LLVM IR-based representation for GPU kernels, featuring specific conventions, address spaces (global, shared, constant), and intrinsic functions.<sup>74</sup> NVCC compiles source languages or - higher-level IRs down to NVVM IR, which is then optimized and translated to PTX (Parallel Thread Execution) assembly.<sup>74</sup> NVVM IR has its own versioning and debug metadata specifications.<sup>75</sup> - MLIR gpu Dialect: This standard MLIR dialect provides target-agnostic abstractions for common GPU programming concepts, such as kernel launches (gpu.launch), kernel functions (gpu.func), thread and block IDs (gpu.thread\_id, gpu.block\_id), barriers, and memory spaces (global, workgroup/shared).<sup>26</sup> A typical compilation pipeline involves outlining the body of a gpu.launch into a separate gpu.func kernel, attaching target-specific information (like SM architecture via nvvm.attach\_target), and then lowering the gpu dialect operations to the nvvm dialect using passes like convert-gpu-to-nvvm.<sup>26</sup> - MLIR nvgpu Dialect: This dialect serves as a bridge between the target-agnostic gpu and vector dialects and the target-specific nvvm dialect. It represents NVIDIA-specific hardware features and PTX-level operations directly in MLIR, such as asynchronous data copies between global and shared memory (nvgpu.device\_async\_copy, managed via groups), memory barriers (nvgpu.mbarrier.\*), matrix load operations (nvgpu.ldmatrix), Tensor Memory Accelerator (TMA) operations for efficient memory access (nvgpu.tma.\*), and Matrix Multiply-Accumulate (MMA) instructions, including support for sparse MMA and warpgroup-level operations targeting newer architectures. This allows optimizations related to these specific hardware features to be expressed and performed within MLIR before the final lowering to NVVM/PTX. - CUDA Quantum: For its quantum computing platform, NVIDIA adopted a more MLIR-native approach from the outset.<sup>27</sup> The nvq++ compiler uses Clang to parse C++ code and then leverages custom MLIR dialects (Quake for quantum operations, CC for classical control) to represent the quantum kernels.<sup>28</sup> Tools like cudaq-quake perform the C++ AST to MLIR conversion, and cudaq-opt applies MLIR passes for optimization.<sup>28</sup> The platform even allows users to register and run their own custom MLIR passes on the Quake IR.<sup>27</sup> Overall, NVIDIA's strategy appears evolutionary, integrating MLIR into its highly optimized CUDA/NVCC/LLVM toolchain primarily as intermediate abstraction layers (gpu, nvgpu) rather than replacing the entire backend. This leverages MLIR's strengths in handling higher-level structures while retaining the mature and performant NVVM/PTX code generation infrastructure. Newer initiatives like CUDA Quantum demonstrate a deeper, ground-up MLIR adoption. Public details on future MLIR roadmap specifics beyond existing dialects are limited, though GTC presentations hint at ongoing work, potentially around runtime compilation or enhanced Python integration.77 #### **6.4 AMD** AMD is actively utilizing and contributing to the LLVM/MLIR ecosystem to support its diverse range of hardware, including CPUs, ROCm-based GPUs, Ryzen AI NPUs, and Versal AI Engines (AIEs). - Ryzen AI NPUs: For its NPUs based on XDNA architecture (found in Ryzen AI processors like Phoenix, Hawk Point, and the upcoming Strix Point with XDNA2), AMD open-sourced "Peano". Peano is an LLVM compiler backend designed specifically for these AI engines, enabling compilation for this specialized hardware within the standard LLVM/MLIR framework. Complementing this, work presented at FOSDEM 2025 focuses on using MLIR dialects and passes for optimizing data tiling and packing specifically for Ryzen AI NPUs, aiming to efficiently manage data movement and utilize DMA capabilities. 42 - ROCm & GPUs: AMD continues to enhance its ROCm platform for GPU computing. Research efforts showcased include running standard, unmodified C/C++ code directly on AMD GPUs via LLVM/ROCm, bypassing the need for specific GPU languages.<sup>15</sup> The porting of the classic game DOOM to run almost entirely on the GPU using ROCm and LLVM libc serves as a demonstration of this capability.<sup>15</sup> Frameworks like IREE utilize MLIR to compile models (e.g., from PyTorch) for execution on AMD GPUs (often via SPIR-V or ROCm backends), offering an alternative to lower-level programming models like OpenCL or HIP.<sup>69</sup> - Al Engines (AIEs): Targeting the complex, heterogeneous Versal ACAP devices containing AIE arrays requires sophisticated compilation flows. The ARIES project, developed at Cornell and collaborators, provides an MLIR-based compilation flow specifically for AIE architectures.<sup>2</sup> It addresses limitations of previous AIE programming frameworks by introducing a novel tile-based programming model in Python that allows users to explicitly map tasks and exploit task-level, tile-level, and instruction-level parallelism (via primitives like .to(), .pipeline(), .vectorize()).2 ARIES uses a unified MLIR representation, leveraging the existing AIEVec dialect for core-level intrinsics and introducing a new ADF (Adaptive Data Flow) dialect to model the inter-tile parallelism and dataflow connections within the AIE array. It performs multi-level optimizations (global: broadcast detection, data forwarding; local: DMA-to-IO conversion, core placement, vectorization, buffer management) before generating executable code (AIE intrinsics, ADF APIs, HLS C++, XRT host code).<sup>2</sup> ARIES demonstrates a deep integration of MLIR, using custom dialects to effectively manage the complexity and parallelism of the AIE architecture. AMD's strategy involves leveraging MLIR across its hardware portfolio, developing targeted compiler solutions (Peano, ARIES) and contributing backends to the open-source ecosystem. This approach allows them to tailor compilation strategies to the specific needs of their NPUs, GPUs, and AIEs within a common infrastructure framework. #### 6.5 IPUs (Graphcore) Graphcore's Intelligence Processing Unit (IPU) features a unique massively parallel architecture with numerous independent cores, each with fast local memory.<sup>71</sup> The software stack for the IPU is the **Poplar SDK**, which was co-designed with the hardware.<sup>79</sup> Poplar provides a C++ graph programming framework and libraries (PopLibs <sup>82</sup>), along with integrations for standard ML frameworks like TensorFlow, PyTorch, and ONNX.<sup>81</sup> Poplar's relationship with LLVM and MLIR is that of using them as *components* within its larger, bespoke graph compilation system: - LLVM: The Poplar graph compiler uses LLVM as a backend to generate code for the individual IPU cores.<sup>79</sup> - **MLIR:** Poplar utilizes MLIR for *some high-level optimizations* within its graph compiler.<sup>71</sup> The specific nature and extent of these MLIR-based optimizations are internal details of the Poplar compiler. The Poplar compiler itself manages the complex tasks of scheduling the computation graph across the IPU's parallel cores, partitioning work, managing data movement between tiles using the IPU's interconnect, and optimizing memory allocation.<sup>71</sup> Poplar's programming model is centered around computational graphs composed of fine-grained tasks (vertices).<sup>71</sup> Unlike Tenstorrent's TT-Forge or AMD's ARIES, Poplar is not fundamentally an MLIR-based compiler. Instead, it incorporates MLIR technology for specific optimization tasks within its own established graph compilation framework, which ultimately relies on LLVM for final code generation for the IPU cores. No public, specific "IPU dialect" for MLIR is documented as part of the Poplar SDK, suggesting MLIR's role is more internal compared to other accelerator vendors who expose MLIR dialects as primary interfaces. #### **6.6 Other Accelerator Projects & Trends** The use of MLIR for targeting AI accelerators extends beyond the major players: - **TPU-MLIR:** An open-source project specifically targeting Sophgo's TPUs.<sup>87</sup> It provides a full toolchain, converting models from ONNX, PyTorch, TFLite, and Caffe into an MLIR representation using a high-level TOP (Tensor Operation) dialect, which is then lowered to a device-specific TPU dialect. The toolchain includes quantization capabilities (F16, INT8 with calibration) and generates a final executable bmodel file.<sup>11</sup> - ONNX-MLIR: This project focuses on providing a direct compilation path from ONNX models using an ONNX dialect within MLIR.<sup>88</sup> It supports code generation for generic CPUs and IBM's Telum AI accelerator, offering compiler interfaces and a runtime environment.<sup>88</sup> - Intel Graph Compiler: Intel is developing an MLIR-based graph compiler designed to optimize deep learning workloads. <sup>89</sup> It accepts MLIR (primarily linalg on tensors) as input, applies optimizations, and generates code for Intel CPUs and GPUs (requiring OpenCL runtime). <sup>89</sup> - Hardware/Software Co-design: MLIR's multi-level nature inherently facilitates hardware/software co-design.<sup>2</sup> By allowing hardware features, constraints, and specialized instructions to be represented in dedicated dialects early in the compilation flow (as seen in ARIES <sup>2</sup> or the nvgpu dialect <sup>76</sup>), MLIR enables tighter integration between software compilation strategies and hardware capabilities. This allows optimizations to be aware of hardware specifics much earlier than in traditional flows that only target hardware late in the process via low-level IR like LLVM IR. The widespread development of MLIR-based compilers for a variety of accelerators (Sophgo TPU, IBM Telum, Intel GPU, AMD NPU/AIE, Tenstorrent IPU) underscores MLIR's success as a foundational framework. It significantly lowers the barrier for hardware vendors and researchers to build specialized, high-performance compiler toolchains, enabling faster support for standard ML frameworks on new and existing hardware compared to developing entirely new compiler infrastructures. #### 6.7 Hardware Acceleration Summary MLIR has become a central technology in the development of compilers for diverse AI hardware. Different vendors adopt varying strategies, from deep MLIR-native toolchains to using MLIR as a component within larger systems. The ability to define custom dialects is key to targeting specialized accelerator features effectively. Table 2: MLIR Adoption in Hardware Acceleration (ca. 2022-2025) | Vendor/Proj Key Relevant | Primary Use | Integration | Key | |--------------------------|-------------|-------------|-----| |--------------------------|-------------|-------------|-----| | ect | Compiler/Pr<br>oject(s) | MLIR Dialects (Standard & Custom) | Case | Approach | Frameworks Supported (via Compiler) | |-----------------------|---------------------------|---------------------------------------------------------------|--------------------------------------------|-----------------------------------------|---------------------------------------------------| | NVIDIA | NVCC, CUDA<br>Quantum | gpu, nvgpu,<br>nvvm<br>(target),<br>Quake, CC<br>(Quantum) | Mid/Low-lev<br>el Opt &<br>Codegen | Component<br>in<br>LLVM/CUDA<br>stack | CUDA<br>ecosystem,<br>C++<br>(Quantum) | | AMD<br>(GPU/ROCm<br>) | ROCm<br>Compiler,<br>IREE | gpu, rocdl<br>(target),<br>amdgpu,<br>StableHLO<br>(via IREE) | Backend<br>Codegen,<br>Framework<br>Target | LLVM<br>Backend,<br>IREE<br>Integration | HIP, OpenCL,<br>PyTorch, TF,<br>JAX (via<br>IREE) | | AMD (Ryzen<br>AI NPU) | Peano | Custom (Peano backend), linalg, vector? | Backend<br>Codegen | LLVM<br>Backend<br>(Open<br>Source) | C/C++,<br>Frameworks<br>via higher<br>layers? | | AMD<br>(AIE/Versal) | ARIES | AIEVec, ADF<br>(custom),<br>affine, scf,<br>memref | Full<br>Heterogeneo<br>us<br>Compilation | MLIR-native<br>Flow | Python<br>(Custom API) | | Tenstorrent | TT-Forge<br>(TT-MLIR) | TTIR, TTNN,<br>TTKernel<br>(custom),<br>StableHLO<br>(input) | Full<br>Compilation<br>Toolchain | MLIR-native<br>Flow | PyTorch,<br>JAX, TF,<br>ONNX (via<br>TVM) | | Graphcore | Poplar SDK | Standard<br>MLIR<br>dialects<br>(internal use) | High-Level<br>Graph Opt. | Component<br>in Poplar<br>stack | PyTorch, TF,<br>ONNX (via<br>Poplar) | | Sophgo<br>(TPU-MLIR) | TPU-MLIR | TOP, TPU<br>(custom) | Full<br>Compilation<br>Toolchain | MLIR-native<br>Flow (Open<br>Source) | PyTorch,<br>ONNX,<br>TFLite, Caffe | | Intel (Graph<br>Comp.) | Intel Graph<br>Compiler | linalg, vector,<br>gpu, spirv?<br>(target) | Graph<br>Optimization<br>& Codegen | MLIR-based<br>Compiler | Frameworks<br>emitting<br>linalg? | |------------------------|-------------------------|--------------------------------------------|------------------------------------|--------------------------------------------|-----------------------------------| | IBM<br>(ONNX-MLI<br>R) | ONNX-MLIR | ONNX<br>(custom) | ONNX Model<br>Compilation | MLIR-based<br>Compiler<br>(Open<br>Source) | ONNX | # 7. Key Research Trends and Open Source Impact #### 7.1 Influential Research Papers & Themes (Last 2-3 Years) The academic and research communities have actively embraced MLIR, pushing its capabilities and exploring new application domains. Several key themes and influential papers have emerged in the 2022-2025 timeframe: - Explicit Compiler Control (Transform Dialect): The work culminating in the CGO 2025 paper by Lücke, Zinenko, Moses, Steuwer, and Cohen formally introduced the Transform dialect.<sup>46</sup> This research provides a foundational mechanism for fine-grained, programmable control over the compilation process, moving beyond static pass pipelines. - Heterogeneous System Compilation: Addressing the complexity of modern systems with multiple, diverse processing units is a major focus. The ARIES paper (Zhuang et al., FPGA'25) presented a comprehensive MLIR-based flow for AMD's AIEs, demonstrating custom dialects for managing parallelism and memory hierarchies.<sup>2</sup> Similarly, the HETOCompiler work (arXiv:2407.09333) introduced a generic hyper dialect within MLIR to abstract data management and parallel computation for general heterogeneous platforms.<sup>6</sup> - Integrated Data Analysis Pipelines: The DAPHNE project (Damme et al., CIDR'22) pioneered the use of MLIR to build a unified system for pipelines combining ETL, ML, and HPC tasks, showcasing MLIR's potential to bridge data management and high-performance computation.<sup>34</sup> - Modular Compiler Construction and Optimization: Research continues on leveraging MLIR for building more modular and reusable compiler components. The work by Vasilache et al. (LCPC'22/arXiv'22) focused on composable and modular code generation techniques within MLIR, particularly for tensor compilers.<sup>18</sup> Performance studies, such as achieving near-peak theoretical performance for DGEMM using MLIR-based code generation, demonstrate the effectiveness of these approaches.<sup>60</sup> - Compiler Robustness and Testing: As MLIR's complexity grows, ensuring its correctness becomes crucial. Recent research has focused on developing specialized fuzzing and testing techniques tailored for MLIR's multi-dialect structure. Projects like MLIRSmith, MLIRod, and DESIL aim to automatically generate or mutate MLIR code to uncover bugs, including challenging "silent bugs" (incorrect results without crashes) and undefined behavior (UB) arising from dialect interactions or lowering processes.<sup>5</sup> - Hardware Synthesis: MLIR, particularly through the CIRCT project, is being explored for high-level synthesis (HLS), translating high-level languages like Julia directly into hardware description languages like Verilog.<sup>22</sup> This research activity indicates a maturing MLIR ecosystem. While early work focused on establishing the core infrastructure and basic AI compilation, recent efforts are tackling more advanced challenges: managing heterogeneity, integrating data processing, enhancing compiler programmability and robustness, and extending MLIR's reach into adjacent domains like hardware design. #### 7.2 Notable Open Source Projects & Libraries MLIR's success is intrinsically linked to its vibrant open-source ecosystem. Key projects and libraries leveraging MLIR include: - Core Infrastructure: The upstream LLVM/MLIR project itself remains the central hub.<sup>15</sup> - Framework Integration: - Torch-MLIR: Provides the bridge for lowering PyTorch models to MLIR dialects.<sup>62</sup> - OpenXLA: An ecosystem encompassing XLA (compiler), IREE (compiler+runtime), and StableHLO (portability dialect), heavily utilizing MLIR for compiling TensorFlow, PyTorch, and JAX.<sup>52</sup> - ONNX-MLIR: A dedicated project for compiling ONNX models via an MLIR ONNX dialect.<sup>88</sup> #### • Hardware Backends & Toolchains: - o **TPU-MLIR:** Open-source compiler for Sophgo TPUs.<sup>87</sup> - o **tt-mlir:** Tenstorrent's open-source MLIR compiler components.<sup>58</sup> - Peano: AMD's open-source LLVM backend for Ryzen AI NPUs.<sup>15</sup> - CIRCT: A sub-project focused on MLIR dialects and tools for circuit design and HLS.<sup>22</sup> # Specialized Domains: HEIR: Developing MLIR dialects and tools for compiling Homomorphic Encryption computations.<sup>29</sup> Substrait-MLIR: Building an MLIR dialect for the Substrait database query plan representation.<sup>13</sup> The diversity of these projects, spanning framework integration, hardware enablement, and specialized computational domains, validates MLIR's role as a versatile and powerful foundational technology. It provides the essential building blocks <sup>1</sup> that enable various communities and companies to construct tailored compiler solutions, fulfilling its promise as a reusable and extensible infrastructure.<sup>1</sup> #### 7.3 Community Engagement A thriving community is essential for the continued development and adoption of an open-source project like MLIR. Key engagement mechanisms include: - LLVM Developers' Meetings: These biannual conferences are major events for the entire LLVM community, including MLIR. They feature technical talks, tutorials, workshops (often with dedicated MLIR tracks), panels, and networking opportunities.<sup>17</sup> Presentations cover topics ranging from core MLIR features like bufferization <sup>18</sup> and pattern rewriting <sup>18</sup> to specific applications and dialect developments. - Open Design Meetings: Historically, regular online Open Design Meetings provided a forum for discussing MLIR's evolution and design proposals, fostering collaboration between Google's initial team and external contributors.<sup>3</sup> - **LLVM Discourse:** The primary platform for online discussion, questions, proposals (RFCs), and announcements related to MLIR and LLVM.<sup>16</sup> This forum replaced older mailing lists, offering better organization and features. - Tutorials and Documentation: While the official MLIR documentation provides language references and some tutorials (e.g., Toy language, mlir-opt usage, dialect creation) <sup>8</sup>, the rapid pace of development means documentation and introductory materials can sometimes lag. <sup>92</sup> Community members and projects like HEIR often contribute additional tutorials and talks. <sup>29</sup> These avenues facilitate knowledge sharing, collaborative design, and the growth of the MLIR user and developer base. ## 8. Comparative Analysis and Future Outlook # 8.1 Comparing MLIR-based Approaches The flexibility inherent in MLIR means that there isn't a single, monolithic "MLIR approach." Instead, different projects and vendors leverage the infrastructure in diverse ways, leading to varied architectural patterns: - MLIR vs. Precursors/Alternatives (TVM, Glow): Projects like Apache TVM and Facebook's Glow were early pioneers in ML compilation, addressing the need for optimizing framework graphs for diverse hardware. 55 TVM, in particular, introduced influential concepts like the separation of algorithm and schedule 46 and employed techniques like autotuning extensively.<sup>56</sup> However, TVM faced challenges in keeping pace with rapidly evolving hardware (especially specialized units like Tensor Cores), suffered from fragmentation as vendors created incompatible forks, and its development slowed relative to framework evolution.<sup>56</sup> MLIR, emerging slightly later, focused heavily on providing a robust, multi-level infrastructure with dialects, aiming for greater modularity and extensibility from the outset. While TVM and Glow were initially more focused on being end-to-end solutions, MLIR positioned itself as a framework for building such solutions. 94 There is potential for interoperability, perhaps by defining TVM dialects within MLIR or translating between their respective IRs.94 Recent research also suggests that MLIR-based autotuning approaches (potentially leveraging the Transform dialect) might achieve comparable results with significantly fewer samples than TVM's methods.95 - Hardware Backend Strategies: Hardware vendors exhibit different MLIR adoption strategies. Tenstorrent represents a deep, MLIR-native approach, building its entire TT-Forge compiler around custom MLIR dialects.<sup>58</sup> NVIDIA integrates MLIR more incrementally, using standard (gpu) and custom (nvgpu) dialects as intermediate layers above its existing, mature NVVM IR and PTX generation backend.<sup>26</sup> Graphcore appears to use MLIR as a component for specific high-level optimizations within its broader, C++-based Poplar graph compiler framework, which relies on LLVM for core-level code generation.<sup>71</sup> AMD employs MLIR strategically across different product lines, developing specific backends (Peano for NPUs <sup>15</sup>) and full compilation flows (ARIES for AIEs <sup>2</sup>). - Data Pipeline Strategies: For data processing, the DAPHNE project exemplifies an ambitious approach, using MLIR to build an integrated system covering ETL, ML, and HPC.<sup>34</sup> A more common, perhaps pragmatic, approach involves using MLIR to optimize specific compute-intensive kernels within a larger, traditional ETL workflow managed by tools like Spark or Airflow. This diversity demonstrates MLIR's adaptability. It functions as a versatile toolkit rather than a prescriptive solution. Different users select and combine MLIR's components (dialects, passes, infrastructure) based on their specific needs, legacy systems, and target domains, leading to varied integration depths and architectural choices. 8.2 Synthesizing Major Trends and Breakthroughs (Last 2-3 Years) Analyzing the developments from 2022-2025 reveals several significant trends and breakthroughs shaping the MLIR landscape: - Trend 1: Standardization via Interfaces: A clear trend is the push towards standardizing the interfaces between ML frameworks and MLIR-based compilers. StableHLO is emerging as the de facto standard input dialect for compilers like XLA and IREE, promoting framework portability.<sup>52</sup> Simultaneously, PJRT is gaining traction as the standard runtime interface, allowing frameworks to dynamically load and interact with different hardware backends in a plug-and-play manner.<sup>59</sup> - Trend 2: Proliferation of Hardware-Specific Dialects: As more hardware vendors adopt MLIR, there is a corresponding increase in the creation of custom, vendor-specific dialects (e.g., nvgpu, amdgpu, TTIR, TPU, ADF) designed to expose unique hardware features and enable targeted optimizations within the MLIR framework.<sup>2</sup> - Trend 3: Rise of Explicit Compiler Control: The development and application of the Transform dialect represent a significant shift towards giving performance engineers direct, programmable control over the compilation process.<sup>46</sup> Its successful use in debugging, fine-grained optimization, and autotuning integration indicates growing adoption. - Trend 4: Broadening Scope: MLIR's application space is expanding considerably beyond its initial focus on core ML model compilation. Active research and development are applying MLIR to data analysis pipelines (DAPHNE <sup>34</sup>), hardware design and synthesis (CIRCT <sup>22</sup>), quantum computing (CUDA Quantum <sup>27</sup>), and homomorphic encryption (HEIR <sup>29</sup>). - Breakthrough 1: Achieving Critical Mass: MLIR has firmly established itself as the foundational compiler infrastructure underpinning major industry efforts in AI compilation, including OpenXLA, Torch-MLIR, and numerous vendor-specific toolchains. Its adoption by key players across the hardware and software spectrum signifies it has reached critical mass. - Breakthrough 2: Demonstrating Performance: MLIR-based compilation techniques have proven capable of generating highly optimized code, achieving performance close to theoretical hardware peaks for critical computational kernels like GEMM, demonstrating its viability for high-performance computing tasks.<sup>60</sup> ### 8.3 Future Directions and Challenges Despite its successes, MLIR faces ongoing challenges and has clear areas for future development: Addressing Fragmentation and Identity: The "dialect explosion" and the ambiguity between MLIR as core infrastructure versus an AI solution require careful management.<sup>3</sup> Continued efforts in community governance, potentially through mechanisms like the LLVM Area Teams, are needed to ensure coherence, manage dialect contributions effectively, and perhaps clarify the boundaries between the domain-agnostic core and domain-specific extensions.<sup>3</sup> Robust mechanisms for dialect versioning and ensuring compatibility between different MLIR components (framework frontends, dialects, backends) are crucial to avoid issues like the TOSA v1.0 breakage.<sup>61</sup> - Improving Usability and Accessibility: While powerful, MLIR currently requires significant compiler expertise.<sup>47</sup> Making the infrastructure more accessible to domain experts (e.g., ML researchers, data scientists) who are not compiler specialists is important for broader adoption. This could involve developing higher-level abstractions, improving tooling, enhancing documentation and tutorials, or further developing programmable interfaces like the Transform dialect.<sup>47</sup> - Maturing Data Pipeline Integration: While projects like DAPHNE and Substrait-MLIR show promise, MLIR's capabilities for handling the full spectrum of ETL tasks (especially data extraction, complex cleaning, orchestration) need further development to compete with dedicated data engineering frameworks.<sup>13</sup> Defining more comprehensive dialects or libraries for common data processing tasks could be beneficial. - Enhancing End-to-End Optimization: Realizing the full potential of MLIR requires enabling more holistic optimizations that span across different dialects, abstraction levels, and pipeline stages (e.g., co-optimizing data layout based on compute patterns, fusing data transformations with model layers). This requires sophisticated analyses and transformation capabilities that can operate across dialect boundaries. - Debugging and Verification: As compilation flows become more complex, involving multiple dialects and intricate lowering paths, robust tools and techniques for debugging transformations and verifying the correctness of the generated code are essential.<sup>5</sup> Continued research in areas like MLIR fuzzing and formal verification is needed. MLIR has successfully established a powerful and flexible foundation. The next phase of its evolution will likely focus on refining the ecosystem built upon this foundation, improving the developer experience, enhancing its capabilities in adjacent domains like data processing, and tackling the complexities arising from its own success to fully realize its potential as a unifying force in compilation technology. #### 9. Conclusion Over the past three years, Multi-Level Intermediate Representation (MLIR) has rapidly transitioned from a promising research project to a cornerstone technology underpinning the modern AI compilation landscape. Its emergence was driven by the fundamental need for a more flexible, extensible, and modular compiler infrastructure capable of handling the growing complexity of AI models and the increasing diversity of hardware accelerators in the post-Moore's Law era. MLIR's core contribution lies in its dialect-based architecture, which enables the representation of computation at multiple levels of abstraction within a single, unified framework. This has proven instrumental in bridging the gap between high-level AI frameworks (TensorFlow, PyTorch, JAX) and the specifics of hardware targets ranging from NVIDIA and AMD GPUs to specialized accelerators like Tenstorrent IPUs, AMD NPUs and AIEs, and Sophgo TPUs. Vendors are increasingly leveraging MLIR to build specialized compilers, often defining custom dialects to expose unique hardware capabilities, thereby accelerating the enablement of standard ML frameworks on their platforms. Key trends during this period include a concerted effort towards standardization through common interfaces like StableHLO and PJRT, aiming to decouple frameworks from backends and enhance portability. Concurrently, the proliferation of hardware-specific dialects highlights MLIR's role in enabling hardware innovation. The development and application of the Transform dialect mark a significant advancement, offering performance engineers unprecedented fine-grained, programmable control over the compilation process itself. Furthermore, MLIR's scope has demonstrably broadened beyond core ML compilation, with active research and development extending its application to data analysis pipelines, hardware design, quantum computing, and cryptography. While MLIR's foundational role appears secure, challenges remain. Managing the complexity and potential fragmentation arising from its own extensibility, improving usability for a wider range of developers, deepening its integration with data processing workflows, and enhancing end-to-end optimization capabilities are critical areas for future work. Nonetheless, MLIR has fundamentally reshaped the compiler landscape for AI and heterogeneous computing. Its trajectory indicates it will continue to be a driving force behind innovation, enabling the efficient deployment of increasingly sophisticated AI models on current and future generations of computing hardware. #### Works cited - MLIR: A Compiler Infrastructure for the End of Moore's Law Al Resources -Modular, accessed April 15, 2025, <a href="https://www.modular.com/ai-resources/mlir-a-compiler-infrastructure-for-the-end-of-moore-s-law">https://www.modular.com/ai-resources/mlir-a-compiler-infrastructure-for-the-end-of-moore-s-law</a> - 2. www.csl.cornell.edu, accessed April 15, 2025, https://www.csl.cornell.edu/~zhiruz/pdfs/aries-fpga2025.pdf - Democratizing Al Compute, Part 8: What about the MLIR compiler infrastructure? Modular, accessed April 15, 2025, <a href="https://www.modular.com/blog/democratizing-ai-compute-part-8-what-about-the-mlir-compiler-infrastructure">https://www.modular.com/blog/democratizing-ai-compute-part-8-what-about-the-mlir-compiler-infrastructure</a> - 4. MLIR Part 1 Introduction to MLIR and Modern Compilers Stephen Diehl, accessed April 15, 2025, <a href="https://www.stephendiehl.com/posts/mlir\_introduction/">https://www.stephendiehl.com/posts/mlir\_introduction/</a> - MLIR generic representation for polynomial multiplication using affine... ResearchGate, accessed April 15, 2025, <a href="https://www.researchgate.net/figure/MLIR-generic-representation-for-polynomial-multiplication-using-affine-and-std-dialects-fig2-349993972">https://www.researchgate.net/figure/MLIR-generic-representation-for-polynomial-multiplication-using-affine-and-std-dialects-fig2-349993972</a> - 6. A Method for Efficient Heterogeneous Parallel Compilation: A Cryptography Case Study, accessed April 15, 2025, https://arxiv.org/html/2407.09333v2 - 7. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation Reliable Computer Systems University of Waterloo, accessed April 15, 2025, <a href="https://rcs.uwaterloo.ca/~ali/cs842-s23/papers/mlir.pdf">https://rcs.uwaterloo.ca/~ali/cs842-s23/papers/mlir.pdf</a> - 8. MLIR Language Reference, accessed April 15, 2025, https://mlir.llvm.org/docs/LangRef/ - 9. MLIR TensorFlow, accessed April 15, 2025, https://www.tensorflow.org/mlir - 10. DESIL: Detecting Silent Bugs in MLIR Compiler Infrastructure arXiv, accessed April 15, 2025, https://arxiv.org/html/2504.01379v1 - 11. [2210.15016] TPU-MLIR: A Compiler For TPU Using MLIR ar5iv arXiv, accessed April 15, 2025, <a href="https://ar5iv.labs.arxiv.org/html/2210.15016">https://ar5iv.labs.arxiv.org/html/2210.15016</a> - 12. MLIR CodeGen Dialects for Machine Learning Compilers Lei.Chat(), accessed April 15, 2025, - https://www.lei.chat/posts/mlir-codegen-dialects-for-machine-learning-compilers/ - 13. substrait-io/substrait-mlir-contrib GitHub, accessed April 15, 2025, <a href="https://github.com/substrait-io/substrait-mlir-contrib">https://github.com/substrait-io/substrait-io/substrait-mlir-contrib</a> - 14. Defining Dialects MLIR, accessed April 15, 2025, <a href="https://mlir.llvm.org/docs/DefiningDialects/">https://mlir.llvm.org/docs/DefiningDialects/</a> - 15. LLVM Had Another Exciting Year With More Than 37k Commits, 35.5 Million Lines, accessed April 15, 2025, - https://www.phoronix.com/news/LLVM-Code-Activity-2024 - 16. Discourse Migration Guide LLVM 19.0.0git documentation, accessed April 15, 2025, - https://rocm.docs.amd.com/projects/llvm-project/en/docs-6.4.0/LLVM/llvm/html/ DiscourseMigrationGuide.html - 17. 2025 European LLVM Developers' Meeting Swoogo, accessed April 15, 2025, <a href="https://llvm.swoogo.com/2025eurollvm/">https://llvm.swoogo.com/2025eurollvm/</a> - 18. Matthias Springer's Homepage, accessed April 15, 2025, https://m-sp.org/ - Building a JSONiq Query Optimizer using MLIR Research Collection, accessed April 15, 2025, <a href="https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/460014/">https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/460014/</a> thesis-mfiebig.pdf - 20. Dialects MLIR, accessed April 15, 2025, https://mlir.llvm.org/docs/Dialects/ - 21. Should Julia use MLIR in the future? Internals & Design, accessed April 15, 2025, <a href="https://discourse.julialang.org/t/should-julia-use-mlir-in-the-future/110459">https://discourse.julialang.org/t/should-julia-use-mlir-in-the-future/110459</a> - 22. Hardware.jl An MLIR-based Julia HLS Flow (Work in Progress) arXiv, accessed April 15, 2025, <a href="https://arxiv.org/html/2503.09463v1">https://arxiv.org/html/2503.09463v1</a> - 23. Hardware.jl An MLIR-based Julia HLS Flow (Work in Progress) Capra, accessed April 15, 2025, https://capra.cs.cornell.edu/latte25/paper/5.pdf - 24. Quickstart tutorial to adding MLIR graph rewrite, accessed April 15, 2025, <a href="https://mlir.llvm.org/docs/Tutorials/QuickstartRewrites/">https://mlir.llvm.org/docs/Tutorials/QuickstartRewrites/</a> - 25. Creating a Dialect MLIR, accessed April 15, 2025, <a href="https://mlir.llvm.org/docs/Tutorials/CreatingADialect/">https://mlir.llvm.org/docs/Tutorials/CreatingADialect/</a> - 26. 'gpu' Dialect MLIR, accessed April 15, 2025, https://mlir.llvm.org/docs/Dialects/GPU/ - 27. Create your Own MLIR Pass NVIDIA CUDA Quantum documentation GitHub Pages, accessed April 15, 2025, https://nvidia.github.io/cuda-quantum/0.4.0/using/advanced/mlir\_pass.html - 28. cuda-quantum/Overview.md at main GitHub, accessed April 15, 2025, <a href="https://github.com/NVIDIA/cuda-quantum/blob/main/Overview.md">https://github.com/NVIDIA/cuda-quantum/blob/main/Overview.md</a> - 29. Tutorials and Talks HEIR, accessed April 15, 2025, https://heir.dev/docs/tutorials/ - 30. Optimizing ETL Pipelines: Best Practices, Tools & Architecture for Efficient Data Workflow, accessed April 15, 2025, <a href="https://www.acceldata.io/blog/etl-pipelines-key-concepts-components-and-best-practices">https://www.acceldata.io/blog/etl-pipelines-key-concepts-components-and-best-practices</a> - 31. How to Build ETL Data Pipeline in ML Neptune.ai, accessed April 15, 2025, <a href="https://neptune.ai/blog/build-etl-data-pipeline-in-ml">https://neptune.ai/blog/build-etl-data-pipeline-in-ml</a> - 32. tf.data: A Machine Learning Data Processing Framework VLDB Endowment, accessed April 15, 2025, <a href="https://vldb.org/pvldb/vol14/p2945-klimovic.pdf">https://vldb.org/pvldb/vol14/p2945-klimovic.pdf</a> - 33. How to Optimize Your ETL Pipeline for Maximum Efficiency DEV Community, accessed April 15, 2025, <a href="https://dev.to/chainguns/how-to-optimize-your-etl-pipeline-for-maximum-efficiency-3b56">https://dev.to/chainguns/how-to-optimize-your-etl-pipeline-for-maximum-efficiency-3b56</a> - 34. www.cidrdb.org, accessed April 15, 2025, <a href="https://www.cidrdb.org/cidr2022/papers/p4-damme.pdf">https://www.cidrdb.org/cidr2022/papers/p4-damme.pdf</a> - 35. [2101.12127] tf.data: A Machine Learning Data Processing Framework, accessed April 15, 2025, https://ar5iv.labs.arxiv.org/html/2101.12127 - 36. Data Preprocessing and Data Cleaning. | by Prabesh Sharma | Medium, accessed April 15, 2025, https://medium.com/@sharmaprabesh027/data-preprocessing-and-data-cleaning. #### <u>q-de318cb7b1b5</u> - 37. Multilingual Information Retrieval | PDF Scribd, accessed April 15, 2025, <a href="https://www.scribd.com/document/689704273/Multilingual-Information-Retrieval">https://www.scribd.com/document/689704273/Multilingual-Information-Retrieval</a> - 38. Data Cleaning and Preprocessing in Machine Learning CodeSignal, accessed April 15, 2025, <a href="https://codesignal.com/learn/courses/data-cleaning-and-preprocessing-in-machine-learning">https://codesignal.com/learn/courses/data-cleaning-and-preprocessing-in-machine-learning</a> - 39. 9 Preprocessing Applied Machine Learning Using mlr3 in R, accessed April 15, 2025, <a href="https://mlr3book.mlr-org.com/chapters/chapter9/preprocessing.html">https://mlr3book.mlr-org.com/chapters/chapter9/preprocessing.html</a> - 40. Chapter 3: High-level Language-Specific Analysis and Transformation MLIR, accessed April 15, 2025, <a href="https://mlir.llvm.org/docs/Tutorials/Toy/Ch-3/">https://mlir.llvm.org/docs/Tutorials/Toy/Ch-3/</a> - 41. Memory Optimization and Profiling for MLIR-Based HeteroCL CS@Cornell, accessed April 15, 2025, https://www.cs.cornell.edu/courses/cs6120/2022sp/blog/hcl-mlir/ - 42. MLIR-based Data Tiling and Packing for Ryzen Al NPU FOSDEM 2025, accessed April 15, 2025, <a href="https://fosdem.org/2025/schedule/event/fosdem-2025-6641-mlir-based-data-tiling-and-packing-for-ryzen-ai-npu/">https://fosdem.org/2025/schedule/event/fosdem-2025-6641-mlir-based-data-tiling-and-packing-for-ryzen-ai-npu/</a> - 43. LTO and Data Layout Optimizations in MLIR LLVM.org, accessed April 15, 2025, <a href="https://llvm.org/devmtg/2021-02-28/slides/Prashantha-MLIR-LTO.pdf">https://llvm.org/devmtg/2021-02-28/slides/Prashantha-MLIR-LTO.pdf</a> - 44. Using `mlir-opt`, accessed April 15, 2025, https://mlir.llvm.org/docs/Tutorials/MlirOpt/ - 45. MLIR A Global Optimization and Dataflow Analysis Math ∩ Programming, accessed April 15, 2025, <a href="https://www.jeremykun.com/2023/11/15/mlir-a-global-optimization-and-dataflow-analysis/">https://www.jeremykun.com/2023/11/15/mlir-a-global-optimization-and-dataflow-analysis/</a> - 46. The MLIR Transform Dialect arXiv, accessed April 15, 2025, https://arxiv.org/html/2409.03864v2 - 47. The MLIR Transform Dialect: Your Compiler Is More Powerful Than You Think Michel Steuwer, accessed April 15, 2025, <a href="https://michel.steuwer.info/files/publications/2025/CGO-2025-2.pdf">https://michel.steuwer.info/files/publications/2025/CGO-2025-2.pdf</a> - 48. The MLIR Transform Dialect Your compiler is more powerful than you think CGO 2025, accessed April 15, 2025, <a href="https://2025.cgo.org/details/cgo-2025-papers/7/The-MLIR-Transform-Dialect-Your-compiler-is-more-powerful-than-you-think">https://2025.cgo.org/details/cgo-2025-papers/7/The-MLIR-Transform-Dialect-Your-compiler-is-more-powerful-than-you-think</a> - 49. www.arxiv.org, accessed April 15, 2025, https://www.arxiv.org/pdf/2409.03864v2 - 50. [2409.03864] The MLIR Transform Dialect. Your compiler is more powerful than you think, accessed April 15, 2025, <a href="https://arxiv.org/abs/2409.03864">https://arxiv.org/abs/2409.03864</a> - 51. 2023 EuroLLVM Tutorial: Controllable Transformations in MLIR YouTube, accessed April 15, 2025, <a href="https://www.youtube.com/watch?v=P4qUi3QtH\_Y">https://www.youtube.com/watch?v=P4qUi3QtH\_Y</a> - 52. XLA architecture OpenXLA Project, accessed April 15, 2025, https://openxla.org/xla/architecture - 53. XLA OpenXLA Project, accessed April 15, 2025, https://openxla.org/xla - 54. OpenXLA is available now to accelerate and simplify machine learning | Google Open Source Blog, accessed April 15, 2025, - https://opensource.googleblog.com/2023/03/openxla-is-ready-to-accelerate-and-simplify-ml-development.html - 55. Accelerating ML through Compilation: Building an ML Compiler that Works | HTEC, accessed April 15, 2025, <a href="https://htec.com/insights/accelerating-ml-through-compilation-building-ml-compiler-that-works/">https://htec.com/insights/accelerating-ml-through-compilation-building-ml-compiler-that-works/</a> - 56. Democratizing Al Compute, Part 6: What about Al compilers (TVM and XLA)? Modular, accessed April 15, 2025, <a href="https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers">https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers</a> - 57. openxla/xla: A machine learning compiler for GPUs, CPUs, and ML accelerators GitHub, accessed April 15, 2025, <a href="https://github.com/openxla/xla">https://github.com/openxla/xla</a> - 58. TT-Forge<sup>™</sup> Tenstorrent, accessed April 15, 2025, https://tenstorrent.com/en/software/tt-forge - 59. PJRT: Simplifying ML Hardware and Framework Integration | Google Open Source Blog, accessed April 15, 2025, <a href="https://opensource.googleblog.com/2023/05/pjrt-simplifying-ml-hardware-and-framework-integration.html">https://opensource.googleblog.com/2023/05/pjrt-simplifying-ml-hardware-and-framework-integration.html</a> - 60. MLIR-based Code Generation for High-Performance Machine Learning on AArch64 Lund University Publications, accessed April 15, 2025, <a href="https://lup.lub.lu.se/student-papers/record/9146373/file/9146374.pdf">https://lup.lub.lu.se/student-papers/record/9146373/file/9146374.pdf</a> - 61. Support for LiteRT (TensorFlow Lite, .tflite) with TOSA 1.0 · Issue ..., accessed April 15, 2025, <a href="https://github.com/iree-org/iree/issues/19777">https://github.com/iree-org/iree/issues/19777</a> - 62. torch-mlir/docs/architecture.md at main · llvm/torch-mlir · GitHub, accessed April 15, 2025, <a href="https://github.com/llvm/torch-mlir/blob/main/docs/architecture.md">https://github.com/llvm/torch-mlir/blob/main/docs/architecture.md</a> - 63. JAX Integration Completeness Milestone GitHub, accessed April 15, 2025, https://github.com/openxla/iree/milestone/33 - 64. An introduction to Torch-MLIR FOSDEM 2025, accessed April 15, 2025, <a href="https://fosdem.org/2025/schedule/event/fosdem-2025-6643-an-introduction-to-torch-mlir/">https://fosdem.org/2025/schedule/event/fosdem-2025-6643-an-introduction-to-torch-mlir/</a> - 65. Quickstart JAX documentation, accessed April 15, 2025, <a href="https://docs.jax.dev/en/latest/quickstart.html">https://docs.jax.dev/en/latest/quickstart.html</a> - 66. JAX and OpenXLA Part 1: Run Process and Underlying Logic Intel, accessed April 15, 2025, <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/jax-openxla-running-process-and-underlying-logic-1.html">https://www.intel.com/content/www/us/en/developer/articles/technical/jax-openxla-running-process-and-underlying-logic-1.html</a> - 67. JAX and OpenXLA Part 1: Run Process and Underlying Logic Intel, accessed April 15, 2025, <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/jax-and-openxla-run-process-and-underlying-logic-1.html">https://www.intel.com/content/www/us/en/developer/articles/technical/jax-and-openxla-run-process-and-underlying-logic-1.html</a> - 68. MLIR Sparsifier MPACT Research Group | Google for Developers, accessed April 15, 2025, <a href="https://developers.google.com/mlir-sparsifier/colabs/Sparse\_JAX\_CPU\_Benchmark\_Colabs">https://developers.google.com/mlir-sparsifier/colabs/Sparse\_JAX\_CPU\_Benchmark\_Colab</a> - 69. AMD Talks Up IREE/MLIR Programming For Ryzen Al NPUs Reddit, accessed April - 15, 2025, - https://www.reddit.com/r/Amd/comments/1ij56pd/amd\_talks\_up\_ireemlir\_programming for ryzen ai/ - 70. Democratizing Al Compute, Part 4: CUDA is the incumbent, but is it any good? Modular, accessed April 15, 2025, <a href="https://www.modular.com/blog/democratizing-ai-compute-part-4-cuda-is-the-incumbent-but-is-it-any-good">https://www.modular.com/blog/democratizing-ai-compute-part-4-cuda-is-the-incumbent-but-is-it-any-good</a> - 71. C4ML 2021, accessed April 15, 2025, <a href="https://www.c4ml.org/c4ml-2021">https://www.c4ml.org/c4ml-2021</a> - 72. tenstorrent/tt-mlir GitHub, accessed April 15, 2025, https://github.com/tenstorrent/tt-mlir - 73. Roadmap tt-mlir documentation, accessed April 15, 2025, <a href="https://docs.tenstorrent.com/tt-mlir/tt-explorer-roadmap.html">https://docs.tenstorrent.com/tt-mlir/tt-explorer-roadmap.html</a> - 74. CUDA LLVM Compiler NVIDIA Developer, accessed April 15, 2025, https://developer.nvidia.com/cuda-llvm-compiler - 75. 1. Introduction NVVM IR Specification 12.8 documentation NVIDIA Docs, accessed April 15, 2025, <a href="https://docs.nvidia.com/cuda/nvvm-ir-spec/">https://docs.nvidia.com/cuda/nvvm-ir-spec/</a> - 76. 'nvgpu' Dialect MLIR, accessed April 15, 2025, https://mlir.llvm.org/docs/Dialects/NVGPU/ - 77. Highlights NVIDIA, accessed April 15, 2025, https://images.nvidia.com/nvimages/gtc/pdf/GTC2025\_Highlights.pdf - 78. Nvidia adds native Python support to CUDA Hacker News, accessed April 15, 2025, https://news.ycombinator.com/item?id=43581584 - 79. LLVM DISTRIBUTORS CONFERENCE 2021 GitHub, accessed April 15, 2025, https://raw.githubusercontent.com/ClangBuiltLinux/llvm-distributors-conf-2021/main/slides/graphcore.pdf - 80. IPU Processors Graphcore, accessed April 15, 2025, https://www.graphcore.ai/products/ipu - 81. 1. Introduction Poplar SDK Overview Graphcore Documents, accessed April 15, 2025, <a href="https://docs.graphcore.ai/projects/sdk-overview/en/latest/overview.html">https://docs.graphcore.ai/projects/sdk-overview/en/latest/overview.html</a> - 82. Poplar® Software Graphcore, accessed April 15, 2025, https://www.graphcore.ai/products/poplar - 83. POPLAR OVERVIEW Graphcore, accessed April 15, 2025, https://www.graphcore.ai/hubfs/assets/Poplar%C2%81%20technical%20overview %20NEW%20BRAND.pdf - 84. graphcore/poplibs: Poplar libraries GitHub, accessed April 15, 2025, <a href="https://github.com/graphcore/poplibs">https://github.com/graphcore/poplibs</a> - 85. 5.1. Poplar Tutorial 1: Programs and Variables Graphcore Documents, accessed April 15, 2025, <a href="https://docs.graphcore.ai/projects/tutorials/en/latest/poplar/tut1\_variables/README.html">https://docs.graphcore.ai/projects/tutorials/en/latest/poplar/tut1\_variables/README.html</a> - 86. 1. Introduction Poplar and PopLibs User Guide Graphcore Documents, accessed April 15, 2025, https://docs.graphcore.ai/projects/poplar-user-guide/en/latest/introduction.html - 87. Machine learning compiler based on MLIR for Sophgo TPU. GitHub, accessed April 15, 2025, <a href="https://github.com/sophgo/tpu-mlir">https://github.com/sophgo/tpu-mlir</a> - 88. onnx/onnx-mlir: Representation and Reference Lowering of ... GitHub, accessed April 15, 2025, <a href="https://github.com/onnx/onnx-mlir">https://github.com/onnx/onnx-mlir</a> - 89. intel/graph-compiler: MLIR-based toolkit targeting intel heterogeneous hardware GitHub, accessed April 15, 2025, <a href="https://github.com/intel/graph-compiler">https://github.com/intel/graph-compiler</a> - 90. HETOCompiler: An MLIR-based crypTOgraphic Compilation Framework for HEterogeneous Devices arXiv, accessed April 15, 2025, <a href="https://arxiv.org/html/2407.09333v1">https://arxiv.org/html/2407.09333v1</a> - 91. High Performance Code Generation in MLIR: An early case study with GEMM YouTube, accessed April 15, 2025, https://www.youtube.com/watch?v=boXl7rmaasU - 92. MLIR Getting Started Math ∩ Programming, accessed April 15, 2025, https://www.jeremykun.com/2023/08/10/mlir-getting-started/ - 93. Compilers: Talking to The Hardware Unify Al, accessed April 15, 2025, https://unify.ai/blog/deep-learning-compilers - 94. Google lasted work: MLIR Primer Development Apache TVM Discuss, accessed April 15, 2025, <a href="https://discuss.tvm.apache.org/t/google-lasted-work-mlir-primer/1721">https://discuss.tvm.apache.org/t/google-lasted-work-mlir-primer/1721</a> - 95. ML2Tuner: Efficient Code Tuning via Multi-Level Machine Learning Models arXiv, accessed April 15, 2025, <a href="https://arxiv.org/html/2411.10764v1">https://arxiv.org/html/2411.10764v1</a>