Method Fusion in Graal
VM Team @ Twitter
August 2018
A common approach in Scala is to provide APIs that allow users to express computations in a composable way. For instance, the Scala Standard Library has collections that can be easily manipulated through a chain of transformations:
These APIs are convenient but introduce considerable performance overhead. In this example, a new collection is created for each `map` call. This is a design decision based on two characteristics of the the collection:
This efficiency issue is also present in APIs like Future, Try, and others. The implementation has no knowledge of the method calls that will immediately follow and has to fully apply each transformation in isolation. For instance, Future has to create new instances on each transformation and add them to the waitlist of the previous futures, which requires at least a volatile read, a compare-and-swap, and multiple other allocations for scheduling of the execution.
This issue can be avoided by fusing multiple transformations into one:
This version of the method is more efficient since it can avoid the intermediate collection allocations. The benchmark results confirm that:
(benchmarks using JMH 1.21, which supports Graal, and Graal 1.0.0 RC2 CE built from source)
This design doc presents a generic solution using the Graal JIT compiler to fuse transformations. This optimization will be highly beneficial to the Scala community since it enables more efficient execution of many high-level abstractions.
The Graal Just-In-Time Compiler (JIT) is a good place to implement such optimization since it can leverage the runtime method inlining and fuse transformations across different inlined method bodies. The optimization will be driven based on user-defined fusion classes so developers can introduce new fusions without having to change Graal. For example, given this method:
The JIT compiler will detect sequential method calls and transform them into an intermediary internal state that introduces an indirection using a fusion class:
The fusion happens for the methods in `ListFusion` that have the same name and parameter types as the target class method. If the JIT finds multiple sequential calls to methods that can be fused, it creates the fusion class instance, delegates the transformations to it, and calls `fuse` at the end to produce the result.
Note: The example uses only `map` for simplicity, but any other method with a fused counterpart would also be fused.
Note that the user won’t write code using the fusion class directly, but we can use this benchmark method to observe its performance characteristics as if it were generated by a JIT compilation phase.
Given that the fusion class never escapes the scope of the method, it’s optimized to an execution graph similar to the manually fused version. It incurs into some throughput overhead since the graph contains nodes to trigger deoptimization, but it has similar allocation rate. Benchmark results:
The user can make the following assumptions when implementing a fusion class:
These two assumptions open the possibility for a few different approaches to implementing the fusion. For instance, for list fusion, it’s possible to use an array to mutate it in place on each transformation or accumulate multiple transformations to apply them when the `fuse` method is called. For the purpose of this benchmark the latter approach is used:
For completeness, this is the JMH benchmark class:
Note that the examples are in Scala but the mechanism is generic to other JVM languages.
Method fusion is sometimes also known as “Operation Fusion”, “Operator Fusion”, or “Transformation Fusion”. The topic has been researched before and a few compilers are capable of performing such optimization.
A Transformation System for Developing Recursive Programs
Pioneering paper from 1977 that introduces fusion
Operation fusion and deforestation for Scala
Gives an overview of method fusion approaches in Haskell and explores generic algorithms to implement fusion in Scala
Shortcut Fusion of Monadic Programs
Describes the Haskell compiler mechanism that allows users to introduce fusion rules
Utilizing Static Analysis and Code Generation to Accelerate Neural Networks
Overview of a few optimizations for neural networks, including performance results for method fusion
Tensorflow has a JIT compiler that implements method fusion
An isolated jar without dependencies on Graal will be provided to users so they can implement custom method fusions. It’ll provide the `Fusion` interface:
All fusion definitions should implement this interface and also provide a static method (companion object method in Scala) called `apply` that receives the initial value of the method fusion chain.
The library will provide a class annotation to activate the fusion:
The user would activate the fusion through an annotation on the target class:
It’d be possible to implement a configuration-based mechanism to activate fusions, but a static annotation on the target class would force the API developer to ship the fusion logic together with the API implementation and maintain the fusion behavior compatible with the target class implementation. For example, a fusion implementation for the Scala 2.12 `List` might not work correctly with the Scala 2.13 `List`, since it was reimplemented from scratch.
A separate test module with Graal dependencies will be provided so developers can test their fusion logic. It’ll have a single class with two static methods:
Note: `Supplier` is the Java equivalent to `() => T` in Scala, and `BiConsumer` is equivalent to
`(T, T) => Unit`.
The fusion developer will use the `test` method to unit test that the fused execution matches the original behavior and returns the same value.
The `verify` method will perform an analysis of the target and fusion classes to detect inconsistencies that could make the fusion not be applied at runtime. It’ll check:
Eventually the `verify` method could be replaced by an annotation processor that performs these validations at compile time.
A new compilation phase will be introduced to apply the method fusions. It’ll be executed by the High Tier before the inlining phase and perform the following steps:
The phase will apply the method fusions sequentially according to the order that their target class appears in the graph. There won’t be a mechanism to prioritize fusions. Also, each fusion will be applied only once. In the future, these limitations can be lifted and a better iterative algorithm with prioritization could be developed to improve the interaction between different method fusions.
The following is outside of scope:
The fusion phase will be disabled by default. To enable it, users need to start the JVM with the option `-Dgraal.MethodFusion=true`. It won’t be possible to disable the method fusion after the JVM is running.
The method fusion phase will maintain debug counters to monitor the number of times each method fusion is applied. It’s an open question if these counters can be exported for monitoring.
Develop a few stub fusion classes and gather statistics on how many times they’d be applied to a Scala application.
Implement a working prototype that is capable of performing method fusion for a single target class. This prototype might not be fully customizable and won’t include the user-side libraries.
Create the user API library and implement the generic method fusion phase based on custom fusion classes.
Develop the Test module.
The fusion could be performed statically using a Scala compiler plugin, but it wouldn’t be as effective since the compiler plugin can only fuse method calls within a method body. The JIT is a better place to implement such optimization since it can leverage the runtime method inlining and fuse transformations across different inlined method bodies.
It’d be possible to implement a java agent that transforms the classes bytecode to perform method fusion. This approach would have the same limitation of the scala compiler plugin since it’d be able to do fusion only within a class method, without considering inlined methods.
Most method fusion papers explore generic method fusions that don’t require the user’s intervention. They make assumptions regarding the APIs to perform fusion. The approach presented by this design document is more flexible since it delegates to the user implementation of the fusion logic, which makes it easier to maintain the fusion behavior similar to the non-fused execution. Also, developers will ship their fusion class implementation with the original API, making it possible to have different fusion behaviors depending on the version of the API.
We explored an approach that would allow users to declare code patterns and provide the optimized version of them. This approach proved to be complicated because of the graph matching and graph transformation. It would also have performance issues since it’d have to match all graphs against all optimization rules.
A large number of unit tests, known internally at Twitter as the Sandbox, will be executed with method fusion enabled. Also, integration tests will be executed for the largest systems running in Scala.
The method fusion will be tested with a Scala application. To compare performance, two instances will be deployed using the same hardware platform: one with method fusion and another without method fusion for baseline comparison. Both CPU utilization and allocation rates should be reduced.
Once correctness and performance are validated for a specific system, a canary instance will be deployed to production. The canary will be observed for a few hours to verify behavior and performance.
If the method fusion fails, all other compilation phases should be executed as if the method fusion wasn’t enabled. The error should be logged for analysis.
There aren’t similar compilation phases that manipulate the graph as it is necessary for method fusion. The implementation of the compilation phase is challenging and we’ll probably need help from the Graal developers at Oracle.
Graal EE has an optimization that merges sequential calls to `StringBuilder.append`. It might be possible to reuse this optimization code to implement the method fusion, but its source code is closed.
[a]This phase changes the semantics of the `map` operation when closures have side effects. Are you going to detect if functions are side-effect free?
[b]Good question. Indeed, the `ListFusion` example here wouldn't keep the same behavior as `List` if side effects are involved. I'll add a note about it.
Keeping behavior will be a responsibility of the developer who works on the fusion class, the compilation phase wouldn't detect side effects.
I'd probably use arrays and apply transformations in place if I were implementing a list fusion.
[c]It is worth mentioning `list.iterator.map(f1).map(f2).map(f3).toList` as a simple/existing option for this.