Compile-Time Optimizations in Polly

Abstract

Improving the compile-time of polyhedral compilation tools and Polly has been an issue for some time. However, recently, lot of progress has been made. The aim of this project is to improve the compile-time of Polly by a large-factor. Some of the prominent changes I plan to make are: changes to the representations in the Integer Set Library (ISL) that Polly uses, and to modify the position of Polly in LLVM pass pipeline and it’s interactions with other passes.

Objective

The general objective of my project is to reduce the compile-time of Polly. A comparison of LLVM with options -O3 vs. -O3 -polly shows that there is scope for improvement in various aspects. This could include general work like changes to the ISL AST generator, or targeted improvements like the following:

Criteria of Success

The following are some criteria of success I am aiming for:

Background

Polly [3] is a loop-optimization pass in the LLVM compiler that relies on Polyhedral Compilation techniques [1][2] to generate code for multi-core and GPU architectures. Broadly, Polly performs optimizations like tiling and scheduling on for-loop sections of programs. The work of Polly has been very successful [11] because it relies on the strengths of open-source compilation, flexibility of LLVM-IR, as well as the formalism in the Polyhedral Compilation framework.

In spite of the large performance benefits that it gives, Polly is already reasonably fast on a variety of programs. On loop programs, the compile-times of Polly are around 200ms on small loop kernels and at most 1-2 seconds for larger multi-loop kernels. Also, on non-loop programs using Polly incurs an overhead of around 2-10% increase in compile-time. Though these are good numbers for scientific experiments, but, still significant compile-time reductions are possible. I aim to tackle these by making changes to the libraries that Polly uses and other targeted improvements.

Improvements to the Polly-ISL implementation

Polly relies heavily on the Integer Set Library (ISL) [4] at its heart for the various operations involving rational/integer polyhedra. For purposes of precision, ISL internally uses a variety of multi-precision libraries, which cause performance bottlenecks in a compiler setting like Polly.

In the winter of 2014/2015, I worked on using long long type instead of the GNU-MP (GMP) type that ISL usually relied on. This improvement fetched around 60% compile-time improvements over imath and around 34% over GMP and a positive feedback from the community [6][7].

In certains computations however, using long long integers is insufficient as it leads to overflow. The natural extension is to use int_128 integers and ensure that in ScopDetection we only handle integers for which we are sure the computations will not overflow. This would enable us to see the full performance benefits without limiting the optimizations performed.

The other directions is to implement a hybrid approach as has been previously discussed on the polly mailing list [5] [6]. This involves keeping a struct of the following form:

struct hybrid{

union{

     mp_int int_imath;

                       long long int_long;

                    }

int is_imath;

}

In the implementation which uses the above structure, all operations would be done on type long long by default. The code will perform range checks on each individual operations like isl_int_add, isl_int_sub, etc. It will fall back to imath when the long long/int_128 method is insufficient. This method would guarantee compile time improvements.

Latest performance results[7] on gemm.c from PolyBench 4.0 [12] benchmark show favourable results with around 50% improvement.

Polly Version

Real time

User time

Original Polly

0m0.207s

0m0.196s

Polly (long long)

0m0.118s

0m0.110s

We also see a significant speedup of 50-60% compile time reduction for a large number of benchmarks[15].

Interactions of LLVM passes with Polly and their schedules

Polly is currently outside LLVM. Before working on its first analysis pass,  SCoPDetection, Polly calls various LLVM optimizations like Tail Call Elimination, CFG Simplification, etc. Traditionally, these optimizations help in better polyhedral analysis, but, they cause an additional overhead into the Polly pass for non-loop programs. The following table summarizes these passes and their characteristics.

Code Type/Passes

Polly Engine

Pre-optimization Passes

Loop Programs

Necessary

Necessary

Non-Loop Programs

Not run

Overhead

A key step to making Polly a standard optimization of LLVM is to find a location to call the Polly pass within the other existing passes in LLVM, such that it can use the previously computed analysis results that are still not invalidated. This step will avoid overhead.

However, the above process is not trivial. We need to consider the interactions of Polly with other LLVM passes. As an example, consider the interaction of the Loop Invariant Code Motion (LICM) and Polly; programs that have been subjected to LICM optimization are not readily amenable to optimizations in Polly. Hence we need to modify Polly so that it can optimize LICM'd code, before Polly can actually be moved back in the pass pipeline.

Pushing Polly further back in the pipeline helps us optimize C++ programs with templates where loops are only exposed after inlining. This scheduling of Polly within LLVM would improve the programs from libraries like boost::ublas.

Testing Methodology

Timeline

Biography

I am 3rd year Computer Science undergraduate student from Indian Institute of Technology Hyderabad (IIT Hyderabad). My research areas include Compilers as well as Machine Learning. I started learning about compilers since a year and have gained a good amount of experience in this vast area by implementing various projects and studying the algorithms and implementations in LLVM.

As my first major compiler project I wrote a complete compiler for the COOL language[10] including an x86 backend. I have also written an LLVM pass for Register Allocation which prints an interference graph and have a good understanding of LLVM code-base. I have studied LLVM source-code related to SSA construction, Interval analysis, Alias Analysis and also various papers about alias analysis, DJ graphs, loop optimizations and loop identification. I am also proficient in C/C++.

I do not have any other commitments for this summer.

References

  1. Polyhedral Compilation http://polyhedral.info/
  2. Pluto -- An automatic parallelizer and locality optimizer for multicores http://pluto-compiler.sourceforge.net/
  3. Tobias Grosser, Armin Groesslinger, Christian Lengauer. Polly - Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 2012 22:04 http://polly.llvm.org/
  4. Sven Verdoolaege: isl: An Integer Set Library for the Polyhedral Model. ICMS 2010: 299-302. http://repo.or.cz/w/isl.git
  5. Tobias Grosser and Sven Verdoolaege. Speeding up isl_int operations. https://groups.google.com/d/topic/polly-dev/Gtv2ObkqaeM/discussion
  6. Bhatu Pratik, Ajay Brahmakshatriya, Aditya Kamath and G Bhargav Reddy. Faster ISL operations using long long instead of using GMP. Dec 2014--Jan 2015. https://groups.google.com/d/topic/polly-dev/zkR3dxgRdkg/discussion
  7. Pratik Bhatu. Additional Improvements to the Polly-ISL interface. March 2015. https://groups.google.com/d/topic/polly-dev/1biHxWrM_Js/discussion
  8. GMP. The GNU Multiple Precision Arithmetic Library https://gmplib.org
  9. IMath library: A library for arbitrary precision integer computations https://github.com/creachadair/imath
  10. Cool: The Classroom Object-Oriented Language http://theory.stanford.edu/~aiken/software/cool/cool.html
  11. Tobias Grosser. A Decoupled approach to High-Level Loop Optimization. PhD Thesis, 2014.
  12. Louis-Noel Pouchet et al. The PolyBench Benchmarks 4.0. http://sourceforge.net/p/polybench/wiki/Home/
  13. Compile-Times of Polly. http://llvm.org/perf/db_default/v4/nts/25101?compare_to=25121
  14. Ether Hongbin. Pre-commit buildbots for Polly. https://groups.google.com/forum/#!topic/polly-dev/uGYgbUO2O2E
  15. Performance results of various benchmarks. http://bhatuzdaname.github.io/