Method Fusion in Graal

VM Team @ Twitter

August 2018

Overview

Problem Statement

A common approach in Scala is to provide APIs that allow users to express computations in a composable way. For instance, the Scala Standard Library has collections that can be easily manipulated through a chain of transformations:

These APIs are convenient but introduce considerable performance overhead. In this example, a new collection is created for each `map` call. This is a design decision based on two characteristics of the the collection:

`List` is immutable, so it can not apply the transformations in place without creating new collection instances
`List` is strict, so it needs to execute the transformation immediately and can’t accumulate multiple transformations to apply them efficiently in one pass (like Java streams)

This efficiency issue is also present in APIs like Future, Try, and others. The implementation has no knowledge of the method calls that will immediately follow and has to fully apply each transformation in isolation. For instance, Future has to create new instances on each transformation and add them to the waitlist of the previous futures, which requires at least a volatile read, a compare-and-swap, and multiple other allocations for scheduling of the execution.

This issue can be avoided by fusing multiple transformations into one:

This version of the method is more efficient since it can avoid the intermediate collection allocations. The benchmark results confirm that:

(benchmarks using JMH 1.21, which supports Graal, and Graal 1.0.0 RC2 CE built from source)

This design doc presents a generic solution using the Graal JIT compiler to fuse transformations. This optimization will be highly beneficial to the Scala community since it enables more efficient execution of many high-level abstractions.

High-Level Approach^[a]^[b]^[c]

The Graal Just-In-Time Compiler (JIT) is a good place to implement such optimization since it can leverage the runtime method inlining and fuse transformations across different inlined method bodies. The optimization will be driven based on user-defined fusion classes so developers can introduce new fusions without having to change Graal. For example, given this method:

The JIT compiler will detect sequential method calls and transform them into an intermediary internal state that introduces an indirection using a fusion class:

The fusion happens for the methods in `ListFusion` that have the same name and parameter types as the target class method. If the JIT finds multiple sequential calls to methods that can be fused, it creates the fusion class instance, delegates the transformations to it, and calls `fuse` at the end to produce the result.

Note: The example uses only `map` for simplicity, but any other method with a fused counterpart would also be fused.

Note that the user won’t write code using the fusion class directly, but we can use this benchmark method to observe its performance characteristics as if it were generated by a JIT compilation phase.

Given that the fusion class never escapes the scope of the method, it’s optimized to an execution graph similar to the manually fused version. It incurs into some throughput overhead since the graph contains nodes to trigger deoptimization, but it has similar allocation rate. Benchmark results:

The user can make the following assumptions when implementing a fusion class:

Instances of the fusion class don’t escape the scope of a method, so they don’t need to deal with concurrent access
Given that only sequential chains of method calls are fused, an instance of the fusion class is never reused

These two assumptions open the possibility for a few different approaches to implementing the fusion. For instance, for list fusion, it’s possible to use an array to mutate it in place on each transformation or accumulate multiple transformations to apply them when the `fuse` method is called. For the purpose of this benchmark the latter approach is used:

For completeness, this is the JMH benchmark class:

Note that the examples are in Scala but the mechanism is generic to other JVM languages.

Prior Art

Method fusion is sometimes also known as “Operation Fusion”, “Operator Fusion”, or “Transformation Fusion”. The topic has been researched before and a few compilers are capable of performing such optimization.

A Transformation System for Developing Recursive Programs

Pioneering paper from 1977 that introduces fusion

Operation fusion and deforestation for Scala

Gives an overview of method fusion approaches in Haskell and explores generic algorithms to implement fusion in Scala

Shortcut Fusion of Monadic Programs

Describes the Haskell compiler mechanism that allows users to introduce fusion rules

Utilizing Static Analysis and Code Generation to Accelerate Neural Networks

Overview of a few optimizations for neural networks, including performance results for method fusion

TensorFlow XLA-JIT

Tensorflow has a JIT compiler that implements method fusion

Components

User libraries

API Library

An isolated jar without dependencies on Graal will be provided to users so they can implement custom method fusions. It’ll provide the `Fusion` interface:

All fusion definitions should implement this interface and also provide a static method (companion object method in Scala) called `apply` that receives the initial value of the method fusion chain.

The library will provide a class annotation to activate the fusion:

The user would activate the fusion through an annotation on the target class:

It’d be possible to implement a configuration-based mechanism to activate fusions, but a static annotation on the target class would force the API developer to ship the fusion logic together with the API implementation and maintain the fusion behavior compatible with the target class implementation. For example, a fusion implementation for the Scala 2.12 `List` might not work correctly with the Scala 2.13 `List`, since it was reimplemented from scratch.

Test Library

A separate test module with Graal dependencies will be provided so developers can test their fusion logic. It’ll have a single class with two static methods:

Note: `Supplier` is the Java equivalent to `() => T` in Scala, and `BiConsumer` is equivalent to

`(T, T) => Unit`.

The fusion developer will use the `test` method to unit test that the fused execution matches the original behavior and returns the same value.

The `verify` method will perform an analysis of the target and fusion classes to detect inconsistencies that could make the fusion not be applied at runtime. It’ll check:

If the target class has the method fusion annotation
If the fusion class has only methods that match the target class methods
If the fusion class implements the `Fusion` interface and has a static `apply` method with the correct signature

Eventually the `verify` method could be replaced by an annotation processor that performs these validations at compile time.

JIT compilation phase

A new compilation phase will be introduced to apply the method fusions. It’ll be executed by the High Tier before the inlining phase and perform the following steps:

Collect all @MethodFusion annotations of the classes used by the graph
Call `setUseForInlining(false)` on all invoke nodes that have a fusion class counterpart, preventing their inlining
Run the inlining phase to inline non-fused methods
For each fusion class:

Find sequential chains of invoke nodes eligible for fusion
Introduce a call to the static `apply` method at the beginning of each chain
Reassign the invokes to the fusion class counterparts
Introduce a call to the `fuse` method at the end of the invoke sequences
Mark fused methods `setUseForInlining(true)`
Run the inlining phase

The phase will apply the method fusions sequentially according to the order that their target class appears in the graph. There won’t be a mechanism to prioritize fusions. Also, each fusion will be applied only once. In the future, these limitations can be lifted and a better iterative algorithm with prioritization could be developed to improve the interaction between different method fusions.

Non-Goals

The following is outside of scope:

Provide fusion classes for Scala APIs
Avoid the deoptimization overhead due to the usage of the fusion classes
Compile-time validation of the fusion class
Prioritization or custom scheduling of fusion transformations
Disabling the method fusion after the JVM startup

Runtime Configuration

The fusion phase will be disabled by default. To enable it, users need to start the JVM with the option `-Dgraal.MethodFusion=true`. It won’t be possible to disable the method fusion after the JVM is running.

Monitoring

The method fusion phase will maintain debug counters to monitor the number of times each method fusion is applied. It’s an open question if these counters can be exported for monitoring.

Milestones

M1 - Gather statistics

Develop a few stub fusion classes and gather statistics on how many times they’d be applied to a Scala application.

M2 - Implement prototype

Implement a working prototype that is capable of performing method fusion for a single target class. This prototype might not be fully customizable and won’t include the user-side libraries.

M3 - Implement API module and Method Fusion phase

Create the user API library and implement the generic method fusion phase based on custom fusion classes.

M4 - Implement the Test Module

Develop the Test module.

Alternatives Considered

Scala compiler plugin

The fusion could be performed statically using a Scala compiler plugin, but it wouldn’t be as effective since the compiler plugin can only fuse method calls within a method body. The JIT is a better place to implement such optimization since it can leverage the runtime method inlining and fuse transformations across different inlined method bodies.

Instrumentation using a java agent

It’d be possible to implement a java agent that transforms the classes bytecode to perform method fusion. This approach would have the same limitation of the scala compiler plugin since it’d be able to do fusion only within a class method, without considering inlined methods.

Generic fusion without user intervention

Most method fusion papers explore generic method fusions that don’t require the user’s intervention. They make assumptions regarding the APIs to perform fusion. The approach presented by this design document is more flexible since it delegates to the user implementation of the fusion logic, which makes it easier to maintain the fusion behavior similar to the non-fused execution. Also, developers will ship their fusion class implementation with the original API, making it possible to have different fusion behaviors depending on the version of the API.

Ad-hoc JIT optimizations

We explored an approach that would allow users to declare code patterns and provide the optimized version of them. This approach proved to be complicated because of the graph matching and graph transformation. It would also have performance issues since it’d have to match all graphs against all optimization rules.

Testing

Correctness testing

A large number of unit tests, known internally at Twitter as the Sandbox, will be executed with method fusion enabled. Also, integration tests will be executed for the largest systems running in Scala.

Performance testing

The method fusion will be tested with a Scala application. To compare performance, two instances will be deployed using the same hardware platform: one with method fusion and another without method fusion for baseline comparison. Both CPU utilization and allocation rates should be reduced.

Canary testing

Once correctness and performance are validated for a specific system, a canary instance will be deployed to production. The canary will be observed for a few hours to verify behavior and performance.

Graceful Degradation

If the method fusion fails, all other compilation phases should be executed as if the method fusion wasn’t enabled. The error should be logged for analysis.

Risks

Graph manipulation

There aren’t similar compilation phases that manipulate the graph as it is necessary for method fusion. The implementation of the compilation phase is challenging and we’ll probably need help from the Graal developers at Oracle.

Open Questions

StringBuilder optimization

Graal EE has an optimization that merges sequential calls to `StringBuilder.append`. It might be possible to reuse this optimization code to implement the method fusion, but its source code is closed.

[a]This phase changes the semantics of the `map` operation when closures have side effects. Are you going to detect if functions are side-effect free?

[b]Good question. Indeed, the `ListFusion` example here wouldn't keep the same behavior as `List` if side effects are involved. I'll add a note about it.

Keeping behavior will be a responsibility of the developer who works on the fusion class, the compilation phase wouldn't detect side effects.

I'd probably use arrays and apply transformations in place if I were implementing a list fusion.

[c]It is worth mentioning `list.iterator.map(f1).map(f2).map(f3).toList` as a simple/existing option for this.