1 of 28

Ada 202X Lightweight Parallelism and OpenMP

Tucker Taft

Ada Rap Group October 2019

based on earlier presentations from

Ada-Europe 2014 update,

Ada-Europe 2019 “20-20 view”

and others

Graphics by Raphaël and Tuck

2 of 28

Outline

Introduction to Ada 202X and its support for Light-weight Parallelism
Introduction to OpenMP and approach to Ada 202X binding
Proposed layering of Ada 202X support for LWT
Example to illustrate layering
Discussion of OpenMP and Ada 202X layering
Discussion of proposed Ada 202X compile-time Data-Race (Conflict) checking and OpenMP approaches

3 of 28

Ada 202X High-Level Story

Make Ada a great language for parallel programming
Other enhancements that build on Ada’s existing strengths:

Safety and Security
Contract-Based Programming
Expressivity (particularly when it furthers the previous goals)

4 of 28

Adding Support for Parallel Programming

Concurrent programming

�Multiple workers

Multiple computations

May need to synchronize across workers

Parallel programming

Many workers

One large computation

Synchronization only for work split/join

5 of 28

Adding Support for Parallel Programming

Concurrent programming

Ada has great building blocks for concurrent programming
Tasks, rendezvous, protected objects

Parallel programming

Ada 2012 has nothing built-in
Although concurrent building blocks can be used, they’re very heavyweight

6 of 28

Ada 202X Parallel Programming Goals

Make it easy and safe to write parallel algorithms

Which can be tuned for different target environments without starting over

Hide the housekeeping of dispatch/scheduling/data accumulation
Have the compiler detect and disallow/prevent data races
Enable use of various light-weight threading (LWT) systems

OpenMP is most important

but even OpenMP has multiple implementations

Provide a portable default LWT library that runs on Ada run-time

7 of 28

A reminder why this is important…

The Right Turn in Single-Processor Performance (14 years ago)

Courtesy IEEE Computer, January 2011, page 33.

Safe Parallel Ada 7

8 of 28

Parallel Loops (202X)

parallel (2*Num_CPUs) -- specify max level of parallelism

for I in 1 .. 1_000 loop

A(I) := B(I) + C(I);

end loop;

parallel for Elem of Arr loop

Elem := Elem * 2;

end loop;

9 of 28

Parallel Block (202X)

parallel do

handled_sequence_of_statements

{and

handled_sequence_of_statements}

end do;

From Ada 202x draft manual:

Each handled_sequence_of_statements represents a separate logical thread of control that proceeds independently and concurrently. The parallel_block_statement is complete once every one of the handled_sequence_of_statements has completed, either by reaching the end of its execution, or due to a transfer of control out of the construct by one of the handled_sequence_of_statements (see 5.1).

Parallel block is important for divide-and-conquer algorithms

such as sorting and searching
equivalent to parallel loop around a “case” statement

10 of 28

Map/Reduce Iterators (202X)

��-- A reduction expression to calculate the sum of elements of an array�Result : Integer := [for Element of Arr => Element]’Reduce(“+”, 0);

-- A reduction expression to create an unbounded string �-- containing the alphabet�Alphabet : Unbounded_String� := [for Letter in 'A' .. 'Z' => Letter]’Reduce(“&”, Null_Unbounded_String, “&”);

-- A reduction expression to determine how many �-- people in a database are 30-something�ThirtySomethings : constant Natural � := [for P of Personnel => (if Age(P) > 30 then 1 else 0)]’Reduce(“+”, 0);

11 of 28

Global contracts from SPARK (202X)�used for data race detection

Global => in out all -- default for non-pure pkgs

Global => null -- default for pure packages

-- Explicitly identified globals with modes

Global => (in P1.A, P2.B,

in out P1.C,

out P1.D, P2.E)

-- Pkg data, access collection, task/protected/atomic

Global => in out private of P3 -- pkg P3 data

Global => synchronized in out all -- prot/atomic

12 of 28

Nonblocking contract�used for deadlock detection

Ada 202X Nonblocking aspect

-- apply to one subprogram

procedure Suspend_Until_True

(S : in out Suspension_Object)

with Nonblocking => False;

-- apply to an entire package

package Ada.Characters.Handling

with Nonblocking => True is …

13 of 28

Ada 202x Syntactic Building Blocks for Parallelism

14 of 28

Ada 202X Building Blocks -- Iterators

Programmers Like Iterators

Looping semantics very visible -- no mystery
Ada 2012 iterators made containers significantly more useful

E.g. Newer AdaCore products make heavy use of iterators

In Ada 202X, we build on iterators

For array aggregates defined with “iterated component association”:

X : Int_Array := (for I in 1 .. N => I**2)

For aggregates defined by iterating over a container:

Y : Int_Array := (for E of C => E);

For “procedural iterators”:

Loop body becomes local anonymous procedure passed into existing iterator procedures:

such as Maps.Iterate and Environment_Variables.Iterate

For reduction expressions (see earlier examples)

15 of 28

Ada 202X Building Blocks -- Filters

Iterators sometimes generate too many values

Use filter to reduce to values of interest
for iterator when condition ...

Filters can be used in various kinds of iterators:

For aggregates defined by iterating over a container:

Odds : Int_Array := (for E of C when E mod 2 = 1 => E);

For procedural iterators:

for (Name, Value) of Environment_Variables.Iterate

when Name(Name’First) /= “_” loop

Put_Line (Name & “ => “ & Value);

end loop;

For reduction expressions:

[for P of Personnel when Age(P) > 30 => 1]’Reduce(“+”, 0);

16 of 28

Ada 202X Building Blocks -- “parallel”

Iterators can generate many values

Use “parallel” to spawn multiple logical threads of control

parallel … -- uses default amount of “chunking”
parallel (Num_Chunks) … -- specify a max number of chunks
parallel (Chunk in 1..Num_Chunks) … -- specify a chunk parameter

“parallel” can be used with various kinds of iterators:

For iterating over a large range:

parallel (Chunk in 1 .. Num_Chunks) -- named chunk parameter

for I in Arr’Range loop

. . . -- possibly do other stuff

Partial_Sum(Chunk) := @ + Arr(I); -- accumulator for each chunk

end loop;

return Partial_Sum’Reduce(“+”, 0.0); -- final reduction

For large reduction expressions over container iterator:

[parallel for P of Personnel when Age(P) > 30 => 1]’Reduce(“+”, 0);

User defined “split” iterators for containers -- like Java’s “spliterators”

Data race conflict checks provided at three levels -- All, Known, None

17 of 28

Ada 202x uses “building block” approach

Enabled by orthogonality of syntax
Eases use and readability relative to large library or multiple pragmas
Portable concepts that can be mapped to diverse targets
Compile-time conflict checking for safety

“parallel”

with chunks

Iterators

Filters

Quantified expression

Reduction expression

Aggregate

Loop body

18 of 28

Mapping Ada 202X to OpenMP & friends

OpenMP 1.0 -- 1997, OpenMP ARB

Heavy weight threads, SPMD model

CUDA -- 2007, NVIDIA

NVIDIA GPUs, explicit separated “kernel” code

OpenCL -- 2008, Apple, Khronos

Most GPUs, explicit separated “kernel”s

OpenMP 3.0 -- 2008, OpenMP ARB

Lighter weight “tasks” with work sharing

OpenACC -- 2011, Cray, NVIDIA, PGI

Many GPUs, no separate “kernel”
Extracts GPU “kernel” code from for-loop

OpenMP 4.0 -- 2013, OpenMP ARB

Adds “target” devices, begins to subsume OpenACC

OpenMP 5.0 -- 2018, OpenMP ARG

Largely subsumes OpenACC, and OpenCL to lesser extent

19 of 28

Mapping Ada 202X to OpenMP & friends

OpenMP originally designed in 1997

Initially supported only heavy-weight “threads”

mapped generally to “kernel” threads
analogous to Ada “tasks”

Thread ID used explicitly to compute what part of data to manipulate

SPMD -- “Single Program, Multiple Data” model

Programs had no visible structure that matched computation being performed

Pragmas used heavily to provide implicit structure
No data race checking provided

OpenMP evolved over time; OpenMP 5.0 is recently released (Nov 2018)

Early features mostly supplanted by newer notions based on lighter-weight “tasks” and work sharing.
Supports parallel loops of a very-specific structure (pragma + pattern)
Supports parallel blocks using an explicit “task” pragma
Incorporated OpenACC-like support for “target devices” such as GPUs
Preferred over OpenCL or CUDA because can debug parallel algorithm on host before inserting “target” directives
Still no data race checking in most implementations

Explicit dependence annotations could enable some checking, some day

20 of 28

Mapping Ada 202X to OpenMP & friends

For Ada 202X mapping, we will generally use newer OpenMP features

Rely on Ada 202x language syntax

for high-level parallel algorithm structure and correctness and tuning (e.g. chunking)

Rely on pragmas, aspects, and/or library calls for target-specific tuning:

Selecting number of heavy-weight threads
Data flushing and caching
Mapping to target devices

Examples of specific mappings:

Parallel region establishes initial number of (heavyweight) threads

Generally will create one region per Ada program
Want to minimize creating and releasing multiple heavyweight threads

For parallel block, tasks are generated

a “single” construct followed by two or more “task” pragmas (or API calls)
awaited at a “taskwait”

For parallel loop:

“parallel for” or “taskloop” pragma/API used for loops that match for-loop “pattern” supported by OpenMP
tasks spawned explicitly to handle other Ada 202X parallel loops, such as those for parallel container iterators, with explicit “taskwait”

21 of 28

Possible layering of Ada 202X light-weight parallelism support

Syntax

parallel for .. in/of, parallel do, parallel for (...), [parallel for …]

Optional: pragmas

Enabling support in earlier Ada versions -- equiv to syntax
For local control and tuning for particular bindings (e.g. OpenMP)

adds binding-specific option record to compiler-generated calls

Ada.Parallelism package

Called by compiler-generated code (or advanced user)
Includes “hook” for passing binding-specific option record

System.LWT interface called by Ada.Parallelism body (or advanced user)

Uses dispatch table to invoke particular binding’s LWT support

System.LWT.OpenMP, System.LWT.Default, … other bindings

Elaboration of binding package initializes dispatch table
User’s “main” procedure “with”s binding of interest

Can declare Scheduler-control object in main subprogram to specify tuning
Might get sequential semantics by default

22 of 28

Example to illustrate layering

with System.LWT.OpenMP; use System.LWT;

procedure Main is

Control : OpenMP.Scheduler_Control :=

OpenMP.Options (Option1 => …, Option2 => … );

Arr : Flt_Array (1 .. 1_000_000) := …

Partial_Sums : Flt_Array (1 .. Open_MP.Num_Core * 2) :=

(others => 0.0);

begin

parallel (Chunk in Partial_Sums’Range)

for I in Arr’Range loop

Partial_Sums (Chunk) := @ + Arr (I) ** 2;

end loop;

Put_Line (“Total SoS = “ & Partial_Sums’Reduce(“+”, 0.0)’Image);

end Main;

23 of 28

Translate to pragmas

with System.LWT.OpenMP; use System.LWT;

with Ada.Parallelism;

procedure Main is

Control : OpenMP.Scheduler :=

OpenMP.Options (Option1 => …, Option2 => … );

Arr : Flt_Array (1 .. 1_000_000) := …

Partial_Sums : Flt_Array (1 .. Open_MP.Num_Core * 2) :=

(others => 0.0);

begin

pragma Par_Loop (Partial_Sums’Length);

for I in Arr’Range loop

Partial_Sums (Ada.Parallelism.Chunk_Index) := @ + Arr (I) ** 2;

end loop;

Put_Line (“Total SoS = “ & Partial_Sums’Reduce(“+”, 0.0)’Image);

end Main;

24 of 28

2) Translate to calls

with System.LWT.OpenMP; use System.LWT;

with Ada.Parallelism; use Ada.Parallelism;

procedure Main is

…

procedure Loop_Body

(Low, High : Longest_Integer; Chunk_Index : Positive) is

begin

for I in Integer’Val (Low) .. Integer’Val (High) loop

Partial_Sums (Chunk_Index) := @ + Arr (I) ** 2;

end loop;

end Loop_Body;

begin

Par_Range_Loop (Integer’Pos (Arr’First), Integer’Pos (Arr’Last),

Num_Chunks => Partial_Sums’Length, Loop_Body => Loop_Body’Access);

Put_Line (...);

end Main;

25 of 28

3) Ada.Parallelism calls System.LWT

procedure Par_Range_Loop

(Low, High : Longest_Integer; Num_Chunks : Positive;

Loop_Body : access procedure

(Low, High : Longest_Integer; Chunk_Index : Positive)) is

Master : LWT.Thread_Master;

...

begin -- Par_Range_Loop

-- Spawn first thread for first chunk

LWT.Spawn_Thread (Master, new Thread_Data_Extension’

(LWT.Root_Data with

Low => Low,

High => Low + (High-Low+1) / Num_Chunks - 1,

Chunk_Index => 1));

-- Wait for all chunks to complete

LWT.Wait_For_Master (Master);

end Par_Range_Loop;

26 of 28

4) System.LWT dispatches to Scheduler

package System.LWT is

type Thread_Scheduler is abstract tagged limited null record;

…

private

procedure Spawn_Thread (S : in out Scheduler;

Master : in out Thread_Master; Data : Thread_Data_Ptr) is abstract;

procedure Wait_For_Thread (S : in out Scheduler;

Master : in out Thead_Master) is abstract;

…

end System.LWT;

package body System.LWT is

Scheduler : access Thread_Scheduler’Class := …; -- Installed thread scheduler

procedure Spawn_Thread

(Master : in out Thread_Master; Data : Thread_Data_Ptr) is

begin

Scheduler.Spawn_Thread (Master, Data); -- dispatch to installed scheduler

end Spawn_Thread;

...

end System.LWT;

27 of 28

5) System.LWT.OpenMP handles calls

package System.LWT.OpenMP is

type Scheduler is new LWT.Thread_Scheduler with record

Option1 : …

Option2 : …

end record;

private

overriding procedure Spawn_Thread

(S : in out Scheduler; Master : in out Thread_Master; Data : Thread_Data_Ptr);

overriding procedure Wait_For_Threads

(S : in out Scheduler; Master : in out Thread_Master);

…

end System.LWT.OpenMP;

package body System.LWT.OpenMP is

… -- Implementation of Spawn_Thread, Wait_For_Threads, etc. in terms of OpenMP

end System.LWT.OpenMP;

28 of 28

Proposed Ada 202X Conflict Checking

Three levels of conflict checking (data race checking) proposed for Ada 202X

No conflict checks
Disallow conflicts between known-to-be-aliased objects
Disallow any up-level use of non-synchronized objects

OpenMP has dependency annotations

Could be used to detect and/or prevent conflicts