Software Prefetching for Pointer Variables in the LLVM Compiler

Abstract

Software prefetching optimization can considerably bridge the speed gap between processor and memory. In this project, I want to exploit this ISA feature supported by most modern architectures so as to minimize the memory stalls to achieve higher performance.

Objective

The primary objective of this project is to extend the current compiler-assisted Software Prefetching pass in LLVM. It will address two important features:

  1. Enhance the existing Loop Data Prefetching in LLVM for PowerPC backend [8] to make it target independent.
  2. Implement Data Prefetching for Recursive Data Structures (Pointer Variables).

The proposed LLVM pass will insert compiler assisted prefetch instructions to IR code to prefetch data in order to hide memory stalls. The following is an example (from [3]) for recursive data structure(RDS) prefetching.

f(tree *t){

tree *q;

PREFETCH(t->left);

PREFETCH(t->right);

if(test(t->data))

        q = t->left;

else q = t->right;

if(q != NULL)

        f(q);

}

Criteria of Success

The following are the three broad areas in which I aim to improve the current software prefetcher in LLVM.

  1. Improve the number of architectures for which the LLVM Software Prefetcher runs.
  2. Show improvements on practical benchmarks which directly benefit from the result of the prefetch instructions. The comparison baseline would be LLVM vanilla prefetch vs. LLVM with-prefetch.
  3. Develop a cost-model for one or more of the architectures so that a choice could be made between software and hardware prefetching.

Background

Prefetching is the process of loading instructions or data speculatively from lower-level memory to higher level memory in the memory hierarchy. It was [12] introduced to hide the stalls due to speed gap between the processors and the memory. An example of prefetching is from memory to cache or from lower-level cache to higher-level cache. Prefetching happens before the actual memory reference so that the processor does not have to wait for the actual load process due to speed gap. Microprocessor manufacturers have come up with two kinds of prefetching: Hardware prefetching and Software prefetching.

In hardware prefetching, explicit hardware components are added to analyze memory access patterns and do the prefetching based upon the pattern. However, hardware prefetching has  tendency to increase the power consumption. Although most modern processors come with hardware prefetchers that work well with simple array accesses, they are either insufficient for recursive data structures with pointer variables because of algorithmic complexity or do not support prefetching for pointers at all.

Software prefetching is the other mechanism provided by most modern microprocessor which allows the software developer to add directives to instruct the microprocessor to load the desired data or instruction into the cache. Software prefetching has been a major research area [7, 9], and many compiler-assisted prefetching algorithms have been published both for programs with simple array accesses [1] as well as more complex pointer accesses [2, 3, 4, 5, 12]. These algorithms perform an analysis of the program and then automatically insert the prefetching instructions.

Software and hardware prefetching are not exclusive to each other. For programs with array variables that have simple (affine) memory accesses, hardware prefetching may be enough, while for programs with pointer variables and complex memory accesses, software prefetching may be needed.

The following are some of the scenarios when software prefetching is beneficial [6, Sec 3.1]:

  1. Stream-size vs. Cache-size: Number of streams could exceed the available cache size. As an example, if the program has a 3D data structure, which references all 27 neighbors in a grid, the number of streams needed (27) exceeds the hardware support for number of simultaneous stream prefetches (10) in hardwares like Haswell [9]. Software prefetching can however insert prefetch instructions independently for each stream.
  2. Short streams: It is well known that hardware prefetching needs at least two cache misses to recognize the pattern. This is inefficient for short streams. Software prefetching can hence ensure that this inefficiency is not a limitation even for very short streams.
  3. Cache level limitations of Hardware prefetching: Some hardware prefetchers prefetch data to L2 cache, rather than L1. Software prefetching can effectively be used to prefetch data to L1 cache.
  4. Irregular data access patterns: Hardware prefetching is known to work well for data-access patterns involving array variables and perform poorly for ones with irregular accesses. An example of irregular accesses is like a[b[i]].

Software prefetching has been implemented in various compilers because of the above mentioned benefits. The following is the summary of the status of software prefetching in the two major open-source compilers GCC and LLVM.

So, there is need for more work in LLVM in this aspect for it to catch up with GCC.

Earlier works [13] attempted empirical models using learning techniques (like linear regression etc.) in the GCC framework. These models also includes prefetching and its interaction with cache pollution. Also, another related work  [6] uses a simple model for computation of prefetch distance. My work would be in similar mould.

Testing Methodology

Timeline

Biography

I am a first year graduate student in Computer Science and Engineering department at Indian Institute of Technology Hyderabad (IITH). I have done a 4-year undergraduate course in Electronics and Communication Engineering from National Institute of Technology Jalandhar (NIT Jalandhar). I also have 4.5 years of industry experience as a Software Developer in Computer Sciences and Corporation India Pvt. Ltd. In my tenure, I worked mostly on Microsoft Technologies like VB6, .NET and ASP.NET.

I always had the urge to contribute to the open source community and hence opted for a career switch to pursue masters and get familiar with open source tools. Since my first semester at IITH, I have started working with LLVM for Advanced Compilers Course’s projects and mini-assignments. As part of my Advanced Compiler Optimization course in the current semester, I have studied and working with various analysis/transformation passes like Alias Analysis passes (aa-eval, count-aa, basicaa and CFL-AA) and transformation passes (licm, dce, constprop).

I am also proficient in C and C++.

I do not have any commitments for the summer.

References

  1. Todd C. Mowry, Monica S. Lam, and Anoop Gupta. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the fifth international conference on Architectural support for programming languages and operating systems (ASPLOS V), Richard L. Wexelblat (Ed.). ACM, New York, NY, USA, 62-73. DOI=10.1145/143365.143488 http://doi.acm.org/10.1145/143365.143488
  2. Chi-Keung Luk and Todd C. Mowry. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems (ASPLOS VII). ACM, New York, NY, USA, 222-233. DOI=10.1145/237090.237190 http://doi.acm.org/10.1145/237090.237190
  3. Chi-Keung Luk and Todd C. Mowry. 1999. Automatic Compiler-Inserted Prefetching for Pointer-Based Applications. IEEE Trans. Comput. 48, 2 (February 1999), 134-141. DOI=10.1109/12.752654 http://dx.doi.org/10.1109/12.752654
  4. Mikko H. Lipasti, William J. Schmidt, Steven R. Kunkel, and Robert R. Roediger. 1995. SPAID: software prefetching in pointer- and call-intensive environments. In Proceedings of the 28th annual international symposium on Microarchitecture (MICRO 28). IEEE Computer Society Press, Los Alamitos, CA, USA, 231-236.
  5. Chi-Keung Luk. Optimizing the Cache Performance of Non-Numeric Applications. Ph.D. Thesis, Department of Computer Science, University of Toronto, January 2000 http://www.ckluk.org/ck/papers/ckluk_phdthesis.pdf
  6. Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. 2012. When Prefetching Works, When It Doesn’t, and Why. ACM Trans. Archit. Code Optim. 9, 1, Article 2 (March 2012), 29 pages. DOI=10.1145/2133382.2133384 http://doi.acm.org/10.1145/2133382.2133384
  7. Ulrich Drepper. 2007. Memory part 5: What programmers can do http://lwn.net/Articles/255364/
  8. Hal Finkel, 2013. Loop Data Software Prefetching in LLVM for PowerPC. http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20121231/160778.html
  9. Intel® 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.in/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
  10. Data Prefetch Support in the GCC compiler https://gcc.gnu.org/projects/prefetch.html
  11. Loop Data Prefetching for PowerPC: LLVM Source Code. http://llvm.org/docs/doxygen/html/PPCLoopDataPrefetch_8cpp_source.html
  12. David Callahan, Ken Kennedy, and Allan Porterfield. 1991. Software prefetching. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems (ASPLOS IV). ACM, New York, NY, USA, 40-52. DOI=10.1145/106972.106979 http://doi.acm.org/10.1145/106972.106979  
  13. Kapil Vaswani, Matthew J. Thazhuthaveetil, Y. N. Srikant, and P. J. Joseph. 2007. Microarchitecture Sensitive Empirical Models for Compiler Optimizations. In Proceedings of the International Symposium on Code Generation and Optimization (CGO '07). IEEE Computer Society, Washington, DC, USA, 131-143. DOI=10.1109/CGO.2007.25 http://dx.doi.org/10.1109/CGO.2007.25