Pratik Bhatu: Compile-Time Optimizations in Polly -- GSoC

Compile-Time Optimizations in Polly

Abstract

Improving the compile-time of polyhedral compilation tools and Polly has been an issue for some time. However, recently, lot of progress has been made. The aim of this project is to improve the compile-time of Polly by a large-factor. Some of the prominent changes I plan to make are: changes to the representations in the Integer Set Library (ISL) that Polly uses, and to modify the position of Polly in LLVM pass pipeline and it’s interactions with other passes.

Objective

The general objective of my project is to reduce the compile-time of Polly. A comparison of LLVM with options -O3 vs. -O3 -polly shows that there is scope for improvement in various aspects. This could include general work like changes to the ISL AST generator, or targeted improvements like the following:

The Integer Set Library that is used by Polly needs a multi-precision library for its arithmetic operations. Till recently, ISL relied upon the GMP library [8], while it currently uses the imath library [9]. Both these libraries use arbitrary precision integers. One objective of this project is to use standard fixed-width integers (long long/int_128) to implement these arithmetic operations and fall back to imath when these fixed width integers are insufficient.
This project would also improve the interactions of Polly with the other LLVM passes. Polly runs around 8 pre-optimization passes and so currently on non-loop programs, there is an overhead of around 2-10% of compile-time. The other objective of this project would be to schedule Polly later in the pipeline, when the necessary information is available from the pre-optimizations and it has not been invalidated by other transformations.

Criteria of Success

The following are some criteria of success I am aiming for:

Reduce compile time by 40-70% by using fixed-width integer for ISL operations.
On codes that are non-parallelizable by the methods in Polly, make the compile-time overhead close to zero.

Background

Polly [3] is a loop-optimization pass in the LLVM compiler that relies on Polyhedral Compilation techniques [1][2] to generate code for multi-core and GPU architectures. Broadly, Polly performs optimizations like tiling and scheduling on for-loop sections of programs. The work of Polly has been very successful [11] because it relies on the strengths of open-source compilation, flexibility of LLVM-IR, as well as the formalism in the Polyhedral Compilation framework.

In spite of the large performance benefits that it gives, Polly is already reasonably fast on a variety of programs. On loop programs, the compile-times of Polly are around 200ms on small loop kernels and at most 1-2 seconds for larger multi-loop kernels. Also, on non-loop programs using Polly incurs an overhead of around 2-10% increase in compile-time. Though these are good numbers for scientific experiments, but, still significant compile-time reductions are possible. I aim to tackle these by making changes to the libraries that Polly uses and other targeted improvements.

Improvements to the Polly-ISL implementation

Polly relies heavily on the Integer Set Library (ISL) [4] at its heart for the various operations involving rational/integer polyhedra. For purposes of precision, ISL internally uses a variety of multi-precision libraries, which cause performance bottlenecks in a compiler setting like Polly.

In the winter of 2014/2015, I worked on using long long type instead of the GNU-MP (GMP) type that ISL usually relied on. This improvement fetched around 60% compile-time improvements over imath and around 34% over GMP and a positive feedback from the community [6][7].

In certains computations however, using long long integers is insufficient as it leads to overflow. The natural extension is to use int_128 integers and ensure that in ScopDetection we only handle integers for which we are sure the computations will not overflow. This would enable us to see the full performance benefits without limiting the optimizations performed.

The other directions is to implement a hybrid approach as has been previously discussed on the polly mailing list [5] [6]. This involves keeping a struct of the following form:

struct hybrid{

union{

mp_int int_imath;

long long int_long;

}

int is_imath;

}

In the implementation which uses the above structure, all operations would be done on type long long by default. The code will perform range checks on each individual operations like isl_int_add, isl_int_sub, etc. It will fall back to imath when the long long/int_128 method is insufficient. This method would guarantee compile time improvements.

Latest performance results[7] on gemm.c from PolyBench 4.0 [12] benchmark show favourable results with around 50% improvement.

Polly Version	Real time	User time
Original Polly	0m0.207s	0m0.196s
Polly (long long)	0m0.118s	0m0.110s

We also see a significant speedup of 50-60% compile time reduction for a large number of benchmarks[15].

Interactions of LLVM passes with Polly and their schedules

Polly is currently outside LLVM. Before working on its first analysis pass, SCoPDetection, Polly calls various LLVM optimizations like Tail Call Elimination, CFG Simplification, etc. Traditionally, these optimizations help in better polyhedral analysis, but, they cause an additional overhead into the Polly pass for non-loop programs. The following table summarizes these passes and their characteristics.

Code Type/Passes	Polly Engine	Pre-optimization Passes
Loop Programs	Necessary	Necessary
Non-Loop Programs	Not run	Overhead

A key step to making Polly a standard optimization of LLVM is to find a location to call the Polly pass within the other existing passes in LLVM, such that it can use the previously computed analysis results that are still not invalidated. This step will avoid overhead.

However, the above process is not trivial. We need to consider the interactions of Polly with other LLVM passes. As an example, consider the interaction of the Loop Invariant Code Motion (LICM) and Polly; programs that have been subjected to LICM optimization are not readily amenable to optimizations in Polly. Hence we need to modify Polly so that it can optimize LICM'd code, before Polly can actually be moved back in the pass pipeline.

Pushing Polly further back in the pipeline helps us optimize C++ programs with templates where loops are only exposed after inlining. This scheduling of Polly within LLVM would improve the programs from libraries like boost::ublas.

Testing Methodology

Try Buildbots for automatically testing performance changes on standard benchmarks like Polybench.
Compile large applications like Chromium/firefox.

Timeline

Weeks 1-4: Implementing int_128 and hybrid approach and integrating it into Polly.
Week 5: Testing and benchmarking.
Weeks 6-7: Study scheduling of existing passes in LLVM infrastructure and their effects on Polly.
Week 8-10: Scheduling of Polly in the LLVM pipeline.
Week 11,12: Testing and Documentation.

Biography

I am 3rd year Computer Science undergraduate student from Indian Institute of Technology Hyderabad (IIT Hyderabad). My research areas include Compilers as well as Machine Learning. I started learning about compilers since a year and have gained a good amount of experience in this vast area by implementing various projects and studying the algorithms and implementations in LLVM.

As my first major compiler project I wrote a complete compiler for the COOL language[10] including an x86 backend. I have also written an LLVM pass for Register Allocation which prints an interference graph and have a good understanding of LLVM code-base. I have studied LLVM source-code related to SSA construction, Interval analysis, Alias Analysis and also various papers about alias analysis, DJ graphs, loop optimizations and loop identification. I am also proficient in C/C++.

I do not have any other commitments for this summer.

References

Polyhedral Compilation http://polyhedral.info/
Pluto -- An automatic parallelizer and locality optimizer for multicores http://pluto-compiler.sourceforge.net/
Tobias Grosser, Armin Groesslinger, Christian Lengauer. Polly - Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 2012 22:04 http://polly.llvm.org/
Sven Verdoolaege: isl: An Integer Set Library for the Polyhedral Model. ICMS 2010: 299-302. http://repo.or.cz/w/isl.git
Tobias Grosser and Sven Verdoolaege. Speeding up isl_int operations. https://groups.google.com/d/topic/polly-dev/Gtv2ObkqaeM/discussion
Bhatu Pratik, Ajay Brahmakshatriya, Aditya Kamath and G Bhargav Reddy. Faster ISL operations using long long instead of using GMP. Dec 2014--Jan 2015. https://groups.google.com/d/topic/polly-dev/zkR3dxgRdkg/discussion
Pratik Bhatu. Additional Improvements to the Polly-ISL interface. March 2015. https://groups.google.com/d/topic/polly-dev/1biHxWrM_Js/discussion
GMP. The GNU Multiple Precision Arithmetic Library https://gmplib.org
IMath library: A library for arbitrary precision integer computations https://github.com/creachadair/imath
Cool: The Classroom Object-Oriented Language http://theory.stanford.edu/~aiken/software/cool/cool.html
Tobias Grosser. A Decoupled approach to High-Level Loop Optimization. PhD Thesis, 2014.
Louis-Noel Pouchet et al. The PolyBench Benchmarks 4.0. http://sourceforge.net/p/polybench/wiki/Home/
Compile-Times of Polly. http://llvm.org/perf/db_default/v4/nts/25101?compare_to=25121
Ether Hongbin. Pre-commit buildbots for Polly. https://groups.google.com/forum/#!topic/polly-dev/uGYgbUO2O2E
Performance results of various benchmarks. http://bhatuzdaname.github.io/