1 of 55

Next-gen Haskell Compilation Techniques

Csaba Hruska

2 of 55

Overview

The GRIN compiler project & vision

My experience with GHC internals

�My approach of Haskell compilation & GHC reuse

3 of 55

My approach for Haskell compilation

This talk is not a GHC Dev 101

GHC has its own design, decisions made decades ago

It is important to think free about Haskell compilation

(IMO) GHC reached a local optimum many years ago�the compiler pipeline needs to be redesigned to make progress

GHC has unique design,�it makes hard to utilize the mainstream compiler technology

4 of 55

GHC history

Haskell origin

RAM size of the era

research questions/focus (90s, 2000s)� (pure vs IO, exceptions, concurrency, types)

technology constraints

how it shaped the compiler architecture

5 of 55

GHC today

RAM size today and trends

CPU trends

today's feasibility constraints

which constraints are outdated?

6 of 55

GHC compiler pipeline

each compilation stage (IR) focuses on a single aspect�

7 of 55

Aspects of PL/compiler development

Surface Language: PL research, type systems, module systems, syntax, metaprogramming

IR/optimizer/codegen: the IR type system’s purpose is to capture bugs during IR transformations (reduce miscompilation)�IR types helps compiler engineers at compiler debugging

RTS: GC, profiler, threading, exceptions, multi-core evaluator support�types can help here also: encode RTS implementation/design invariants�use metaprogramming to generate specialized code instead of manual code specialization (multi-core X profiling X calling conventions X etc)

8 of 55

Focus: Haskell compiler backend

In the community more developers are working on the surface language and the type system

Only few devs works on the low level code generator, optimization and RTS

Now I'll focus on the low level part of the compiler pipeline

9 of 55

Compiler backend concerns

  • Compiler and RTS code maintainability
  • Compilation speed
  • Compiled code quality & program efficiency
  • Runtime tooling: profiling, debugging, program observation
  • Developer Experience: GHC/compiler developer's User Experience�NOTE: DX is related to the community and human resource management�Goals: GHC and RTS development must be easy for everyone
  • Efficient engineering: use de facto tech for general problems,�develop custom solutions only for special features

10 of 55

Compiler technology advancements

Theory:

  • faster algorithms
  • new system programming languages
  • new type systems

Practice: efficient implementation of sub-problems (libraries, DSL compilers)

11 of 55

What areas could be improved in GHC

  • Developer Experience (DX):� the community has only a few GHC developers� help them with automatization of code validation (using types)� reduce code complexity as much as possible
  • Analysis tech: faster compilation
  • Optimization: faster compiled programs
  • RTS tech: simpler and much maintainable implementation

The GHC developer experience can make a large impact on the future of Haskell

easier to develop → more developers → more issues solved →�better user experience → more users → more GHC developers

12 of 55

Improve: Compiler Developer Experience (DX)

  • GHC compilation time
  • IR accessibility
  • program observation
  • well separated compilation stages
  • replayable compiler actions
  • build system / package management integration
  • transparent optimizer:�show analysis result on the source code
  • support researchers and easy experimentation:�allow to work outside of GHC code base (export/import IR)

13 of 55

DX: make reasoning easy, compositional design

IR semantics: interpreter

IR analysis: developer’s choice of technology

�Important: do not push GHC's technical dept to the curious mind

�My approach

  • external STG IR
  • .fullpak (whole program IR) & .modpak (module IR)
  • external STG interpreter with full GHC primop/RTS/base lib support�this makes it easy to reason about Haskell execution machinery�semantics vs implementation details

14 of 55

Improve: compiled code quality & program efficiency

Memory management: ASAP & Micro-mitten, CIB, Perceus

Data representation: LoCal & Gibbon

Vectorisation: HRC, ISPC

Handling of laziness: HRC, GRIN

These needs different design at the low level part of the GHC compiler pipeline

15 of 55

Lightweight change: external STG link-time opt

GHC's design: compilation unit = Haskell module

NOTE: incremental compilation allows larger compilation units also

Minimal pipeline change: external STG IR export / analyze / transform / import�Does not need RTS and codegen change.

GHC does its main optimizations on the Core IR, but its type system does not allow to express some low level optimizations (i.e. related to data representation)

the low level optimizations could be done on external STG IR, where the compilation unit size is flexible and could consist of several modules or packages.

16 of 55

Lazy Evaluation

  • GHC's approach:�indirect jumps, STG machine
  • GRIN: defunctionalization

17 of 55

GHC's lazy evaluation: indirect jumps, STG machine

Problem: statically unknown control flow� the optimizer and code generator rely on the control flow information

GHC's solution:

  • cheap strictness analysis
  • heavy inlining
  • fine tuned RTS to handle indirect jumps efficiently

Indirect jumps and the RTS/STG heap object manipulation decimates LLVM optimization capabilities.�LLVM optimizes direct calls/jumps much better.

18 of 55

GRIN: defunctionalization

Defunctionalization replaces indirect calls with direct calls��Good: Direct calls unleash LLVM optimization power�Bad: It needs whole program analysis (according to Urban Boquist's PhD thesis)

IMO it is possible to relax the whole program analysis constraint

Vague idea:

  • control flow analysis on code clusters
  • defunctionalize the cluster internal calls
  • use STG's indirect call convention for cluster external calls

19 of 55

GRIN: lazy evaluation + defunctionalization + LLVM

Haskell:�

20 of 55

Note about lazy evaluation

The practical Haskell compilation is not about handling laziness�It is about supporting GHC Haskell primops.�

  • more than 400 primops
  • arrays, mutvars, number arithmetic
  • lightweight threads
  • sync and async exceptions
  • foreign function interface
  • stm

21 of 55

Static Analysis

  • GHC’s approach
  • Souffle Datalog compiler
  • Andersen style / inclusion based points-to analysis
  • P4F: control flow analysis

22 of 55

GHC's static analysis

demand analysis:

  • unboxing (CPR: Constructed Product Result)
  • strictness
  • liveness

Inter-modular via the .hi files

  • GHC specific
  • not a generic framework
  • works on single module IR (Core)

23 of 55

Souffle Datalog compiler

Soufflé is a logic programming language inspired by Datalog

  • designed for crafting static analysis
  • static analysis is written declaratively in Souffle
  • compiles to parallel C++ (OpenMP)
  • memory efficient data structures

https://souffle-lang.github.io/

24 of 55

Souffle Datalog Example

input: egde

output: reachable

calculates the reachable points of a graph (transitive closure)

25 of 55

Andersen style, inclusion based points-to analysis

Implemented in Souffle Datalog

Needed for GRIN (defunctionalization) to approximate the control flow

  • export IR to Datalog facts
  • calculates the value origins for variables�i.e. constructors, closures, primops
  • read back the result
  • transform IR

It is FAST! Analized >100 STG modules in 1.7 sec (quad-core i7 CPU)�https://github.com/grin-compiler/souffle-cfa-optimization-experiment

  • analyses can rely on each other�i.e. reachability, points-to
  • increased speed
  • increased accuracy

26 of 55

P4F: control flow analysis

  • cutting edge control-flow analysis (CFA) for higher-order languages
  • accurate and fast
  • tracks connections between callers and callees accurately
  • takes stack (frame) configuration into account�(perfect stack precision)

analysis result is useful for:

  • defunctionalization
  • inlining (function specialization)

Paper: Pushdown Control-Flow Analysis for Free

It was not used in production yet. How would it deal with real world Haskell programs (STG IR)?

https://github.com/csabahruska/p4f-control-flow-analysis

27 of 55

Memory Management

  • GHC's approach
  • Counting Immutable Beans & Perceus
  • ASAP & Micro-mitten

28 of 55

GHC RTS: memory management

Uses tracing & generational garbage collector

The garbage collector is built on top of a block layer that manages memory in units of blocks, where a block is a multiple of 4 KB in size

More info:� https://www.aosabook.org/en/ghc.htmlhttp://ezyang.com/jfp-ghc-rts-draft.pdf

29 of 55

Counting Immutable Beans & Perceus

Counting Immutable Beans: Reference Counting Optimized for Purely Functional Programming�https://arxiv.org/abs/1908.05647

Perceus: Garbage Free Reference Counting with Reuse�https://www.microsoft.com/en-us/research/uploads/prod/2020/11/perceus-tr-v1.pdf

Perceus was inspired by the Counting Immutable Beans paper.�Both system uses reference counting with static analysis to find reusable memory cells.

30 of 55

Perceus: reuse analysis example

Perceus algorithm is used in the Koka language

The concept works for strict functional languages. (STG and GRIN is strict)

Reuse insertion in Koka:

31 of 55

Perceus: benchmark

32 of 55

ASAP & Micro-mitten

ASAP (As Static As Possible) is a compile-time automatic memory management system using whole program analysis.�https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-908.pdf

Micro-mitten: ASAP implementation for a simple Rust like language�https://github.com/doctorn/micro-mitten

ASAP is not ready for production use, it needs more research.�But it fills a hole in the memory management design space.

33 of 55

Memory Management Design Space

ASAP

ASAP: Static Automatic Memory Management

ASAP’s static analyses can reveal the data lifetime properties at compile time. This on its own is a useful information for the programmer.

34 of 55

Data Representation

  • GHC's approach
  • LoCal & Gibbon

35 of 55

GHC RTS: data representation

Structs with pointers (boxed and tagged heap objects)�Every heap object is represented uniformly

36 of 55

GHC RTS: data representation examples

Complex data values have lots of indirections, lot of pointers!

37 of 55

LoCal & Gibbon

Gibbon is an experimental compiler that transforms high-level functional programs to operate on serialized data. (without pointers/indirections)�

Gibbon uses LoCal (Location Calculus). Its type system ensures transformation correctness.��LoCal allows:

  • full serialized
  • full pointers
  • anything in between

38 of 55

LoCal & Gibbon example

39 of 55

LoCal & Gibbon benchmark

Paper: PLDI'19�LoCal: A Language for Programs Operating on Serialized Data:

Michael Vollmer, Chaitanya Koparkar, Mike Rainey, Laith Sakka, Milind Kulkarni, Ryan R. Newton

Gibbon2: based on LoCal

CNF: GHC’s Compact Normal Form� (Compact Region)

Cap’n Proto: very efficient C++ data� interchange format

40 of 55

Vectorisation

  • HRC: Intel Labs Haskell Research Compiler
  • ISPC: Intel SPMD Compiler

41 of 55

HRC: Intel Labs Haskell Research Compiler

Uses GHC as frontend, has custom optimizing backend

  • specializes data representation
  • tracks function pointers, control flow analysis (approximation)
  • whole program compilation is not required
  • automatic SIMD vectorisation for Haskell (via slightly modified Array primops)�no language modification is needed

Vectorisation makes Haskell suitable for high performance computing

https://github.com/IntelLabs/flrchttp://www.leafpetersen.com/leaf/publications.htm

HRC RTS is not tuned for laziness, but it was not the focus

paper links

(screenshot: HRC benchmark)

(screenshot: measuring the haskell gap)

42 of 55

HRC + FLRC = frontend + optimizer

Functional Language Research Compiler

43 of 55

HRC benchmark: lazy evaluation + RTS

Paper: The Intel Labs Haskell Research Compiler. Hai Liu, Neal Glew, Leaf Petersen and Todd Anderson

HRC RTS is not tuned for laziness (it was not the focus)

44 of 55

Computation heavy benchmark, C vs Haskell

Vectorisation makes Haskell suitable for high performance computing

Paper: Measuring the Haskell Gap. Leaf Petersen, Todd Anderson, Hai Liu and Neal Glew

45 of 55

ISPC: Intel SPMD Compiler

ispc is a compiler for a variant of the C programming language,�with extensions for "single program, multiple data" (SPMD) programming.

compiles SPMD programs to CPU utilizing its SIMD instructions

massive speedup: 3x or more

needs language extension, distinct (SIMD) uniform and varying values

control flow level vectorisation (through multiple function calls)

https://ispc.github.io/

46 of 55

ISPC: Intel SPMD Compiler

47 of 55

ISPC: Intel SPMD Compiler

Idea: SPMD for Haskell

  • Defunctionalization:�represents laziness as data
  • LoCal: removes indirections�(no pointers)

Would it vectorise?

48 of 55

Runtime System & Tooling

  • GHC RTS implementation
  • New options for implementation
  • Web tools for visualization

49 of 55

GHC RTS implementation

Complex primops are written in C-- (i.e. exception handling).�The rest is written in C:

RTS subsystems: GC, thread scheduler, STM, I/O manager, logging��GHCi specific code: linker, byte code interpreter� (should not be in the RTS, it belongs to GHCi)��RTS depends on the base: functions, data types (i.e. exceptions)�

50 of 55

GHC RTS implementation

uses lots of conventions and CPP macros for RTS heap object manipulation

almost a mini OS, C structures with complex invariants

Problem:� RTS developers must maintain complex invariants manually� C type system is weak� complexity scares away developers� IMO it is impossible to make architectural changes (even for experts)

Goal: fearless refactoring

A sophisticated type system could help to check RTS invariants.�

51 of 55

New options for RTS implementation

Powerful system programming languages: Rust, ATS

Rust is more mainstream, but weaker type system

ATS is niche, but more powerful:

  • dependently typed language
  • compile time type/proof erasure, no runtime overhead
  • compiles to C
  • zero overhead interop with C
  • program invariants are expressed via dependent types

52 of 55

Web tools for visualization

HTML/CSS/JS is the most accessible UI technology.

Plenty of good libraries for visualization.

It has much better UX than printing to the terminal.

53 of 55

Web tools for visualization

HTML/CSS/JS is the most accessible UI technology.

Plenty of good libraries for visualization.

It has much better UX than printing to the terminal.

54 of 55

Conclusion

make experimentation easy and fun

55 of 55

Contact

Twitter: @csaba_hruska�Patreon: patreon.com/csaba_hruska