1 of 28

Ada 202X Lightweight Parallelism and OpenMP

Tucker Taft

Ada Rap Group October 2019

based on earlier presentations from

Ada-Europe 2014 update,

Ada-Europe 2019 “20-20 view”

and others

Graphics by Raphaël and Tuck

1

2 of 28

Outline

  • Introduction to Ada 202X and its support for Light-weight Parallelism
  • Introduction to OpenMP and approach to Ada 202X binding
  • Proposed layering of Ada 202X support for LWT
  • Example to illustrate layering
  • Discussion of OpenMP and Ada 202X layering
  • Discussion of proposed Ada 202X compile-time Data-Race (Conflict) checking and OpenMP approaches

2

3 of 28

Ada 202X High-Level Story

3

  • Make Ada a great language for parallel programming
  • Other enhancements that build on Ada’s existing strengths:
    • Safety and Security
    • Contract-Based Programming
    • Expressivity (particularly when it furthers the previous goals)

4 of 28

Adding Support for Parallel Programming

4

Concurrent programming

�Multiple workers

Multiple computations

May need to synchronize across workers

Parallel programming

Many workers

One large computation

Synchronization only for work split/join

5 of 28

Adding Support for Parallel Programming

5

Concurrent programming

  • Ada has great building blocks for concurrent programming
  • Tasks, rendezvous, protected objects

Parallel programming

  • Ada 2012 has nothing built-in
  • Although concurrent building blocks can be used, they’re very heavyweight

6 of 28

Ada 202X Parallel Programming Goals

6

  • Make it easy and safe to write parallel algorithms
    • Which can be tuned for different target environments without starting over
  • Hide the housekeeping of dispatch/scheduling/data accumulation
  • Have the compiler detect and disallow/prevent data races
  • Enable use of various light-weight threading (LWT) systems
    • OpenMP is most important
      • but even OpenMP has multiple implementations
    • Provide a portable default LWT library that runs on Ada run-time

7 of 28

A reminder why this is important…

The Right Turn in Single-Processor Performance (14 years ago)

Courtesy IEEE Computer, January 2011, page 33.

Safe Parallel Ada 7

8 of 28

Parallel Loops (202X)

8

parallel (2*Num_CPUs) -- specify max level of parallelism

for I in 1 .. 1_000 loop

A(I) := B(I) + C(I);

end loop;

parallel for Elem of Arr loop

Elem := Elem * 2;

end loop;

9 of 28

Parallel Block (202X)

parallel do

handled_sequence_of_statements

{and

handled_sequence_of_statements}

end do;

From Ada 202x draft manual:

Each handled_sequence_of_statements represents a separate logical thread of control that proceeds independently and concurrently. The parallel_block_statement is complete once every one of the handled_sequence_of_statements has completed, either by reaching the end of its execution, or due to a transfer of control out of the construct by one of the handled_sequence_of_statements (see 5.1).

  • Parallel block is important for divide-and-conquer algorithms
    • such as sorting and searching
    • equivalent to parallel loop around a “case” statement

10 of 28

Map/Reduce Iterators (202X)

10

��-- A reduction expression to calculate the sum of elements of an array�Result : Integer := [for Element of Arr => Element]’Reduce(“+”, 0);

-- A reduction expression to create an unbounded string �-- containing the alphabetAlphabet : Unbounded_String� := [for Letter in 'A' .. 'Z' => Letter]’Reduce(“&”, Null_Unbounded_String, “&”);

-- A reduction expression to determine how many �-- people in a database are 30-somethingThirtySomethings : constant Natural � := [for P of Personnel => (if Age(P) > 30 then 1 else 0)]’Reduce(“+”, 0);

11 of 28

Global contracts from SPARK (202X)�used for data race detection

Global => in out all -- default for non-pure pkgs

Global => null -- default for pure packages

-- Explicitly identified globals with modes

Global => (in P1.A, P2.B,

in out P1.C,

out P1.D, P2.E)

-- Pkg data, access collection, task/protected/atomic

Global => in out private of P3 -- pkg P3 data

Global => synchronized in out all -- prot/atomic

12 of 28

Nonblocking contract�used for deadlock detection

  • Ada 202X Nonblocking aspect

-- apply to one subprogram

procedure Suspend_Until_True

(S : in out Suspension_Object)

with Nonblocking => False;

-- apply to an entire package

package Ada.Characters.Handling

with Nonblocking => True is

13 of 28

Ada 202x Syntactic Building Blocks for Parallelism

13

14 of 28

Ada 202X Building Blocks -- Iterators

14

  • Programmers Like Iterators
    • Looping semantics very visible -- no mystery
    • Ada 2012 iterators made containers significantly more useful
      • E.g. Newer AdaCore products make heavy use of iterators

  • In Ada 202X, we build on iterators
    • For array aggregates defined with “iterated component association”:
      • X : Int_Array := (for I in 1 .. N => I**2)
    • For aggregates defined by iterating over a container:
      • Y : Int_Array := (for E of C => E);
    • For “procedural iterators”:
      • Loop body becomes local anonymous procedure passed into existing iterator procedures:
        • such as Maps.Iterate and Environment_Variables.Iterate
    • For reduction expressions (see earlier examples)

15 of 28

Ada 202X Building Blocks -- Filters

15

  • Iterators sometimes generate too many values
    • Use filter to reduce to values of interest
    • for iterator when condition ...

  • Filters can be used in various kinds of iterators:
    • For aggregates defined by iterating over a container:
      • Odds : Int_Array := (for E of C when E mod 2 = 1 => E);
    • For procedural iterators:
      • for (Name, Value) of Environment_Variables.Iterate

when Name(Name’First) /= “_” loop

Put_Line (Name & “ => “ & Value);

end loop;

    • For reduction expressions:
      • [for P of Personnel when Age(P) > 30 => 1]’Reduce(“+”, 0);

16 of 28

Ada 202X Building Blocks -- “parallel”

16

  • Iterators can generate many values
    • Use “parallel” to spawn multiple logical threads of control
      • parallel-- uses default amount of “chunking”
      • parallel (Num_Chunks)-- specify a max number of chunks
      • parallel (Chunk in 1..Num_Chunks)-- specify a chunk parameter

  • parallel” can be used with various kinds of iterators:
    • For iterating over a large range:
      • parallel (Chunk in 1 .. Num_Chunks) -- named chunk parameter

for I in Arr’Range loop

. . . -- possibly do other stuff

Partial_Sum(Chunk) := @ + Arr(I); -- accumulator for each chunk

end loop;

return Partial_Sum’Reduce(“+”, 0.0); -- final reduction

    • For large reduction expressions over container iterator:
      • [parallel for P of Personnel when Age(P) > 30 => 1]’Reduce(“+”, 0);
        • User defined “split” iterators for containers -- like Java’s “spliterators”
  • Data race conflict checks provided at three levels -- All, Known, None

17 of 28

Ada 202x uses “building block” approach

  • Enabled by orthogonality of syntax
  • Eases use and readability relative to large library or multiple pragmas
  • Portable concepts that can be mapped to diverse targets
  • Compile-time conflict checking for safety

17

“parallel”

with chunks

Iterators

Filters

Quantified expression

Reduction expression

Aggregate

Loop body

18 of 28

Mapping Ada 202X to OpenMP & friends

18

  • OpenMP 1.0 -- 1997, OpenMP ARB
    • Heavy weight threads, SPMD model
  • CUDA -- 2007, NVIDIA
    • NVIDIA GPUs, explicit separated “kernel” code
  • OpenCL -- 2008, Apple, Khronos
    • Most GPUs, explicit separated “kernel”s
  • OpenMP 3.0 -- 2008, OpenMP ARB
    • Lighter weight “tasks” with work sharing
  • OpenACC -- 2011, Cray, NVIDIA, PGI
    • Many GPUs, no separate “kernel”
    • Extracts GPU “kernel” code from for-loop
  • OpenMP 4.0 -- 2013, OpenMP ARB
    • Adds “target” devices, begins to subsume OpenACC
  • OpenMP 5.0 -- 2018, OpenMP ARG
    • Largely subsumes OpenACC, and OpenCL to lesser extent

19 of 28

Mapping Ada 202X to OpenMP & friends

19

  • OpenMP originally designed in 1997
    • Initially supported only heavy-weight “threads”
      • mapped generally to “kernel” threads
      • analogous to Ada “tasks”
    • Thread ID used explicitly to compute what part of data to manipulate
      • SPMD -- “Single Program, Multiple Data” model
    • Programs had no visible structure that matched computation being performed
      • Pragmas used heavily to provide implicit structure
      • No data race checking provided

  • OpenMP evolved over time; OpenMP 5.0 is recently released (Nov 2018)
    • Early features mostly supplanted by newer notions based on lighter-weight “tasks” and work sharing.
    • Supports parallel loops of a very-specific structure (pragma + pattern)
    • Supports parallel blocks using an explicit “task” pragma
    • Incorporated OpenACC-like support for “target devices” such as GPUs
    • Preferred over OpenCL or CUDA because can debug parallel algorithm on host before inserting “target” directives
    • Still no data race checking in most implementations
      • Explicit dependence annotations could enable some checking, some day

20 of 28

Mapping Ada 202X to OpenMP & friends

20

  • For Ada 202X mapping, we will generally use newer OpenMP features
    • Rely on Ada 202x language syntax
      • for high-level parallel algorithm structure and correctness and tuning (e.g. chunking)
    • Rely on pragmas, aspects, and/or library calls for target-specific tuning:
      • Selecting number of heavy-weight threads
      • Data flushing and caching
      • Mapping to target devices

  • Examples of specific mappings:
    • Parallel region establishes initial number of (heavyweight) threads
      • Generally will create one region per Ada program
      • Want to minimize creating and releasing multiple heavyweight threads
    • For parallel block, tasks are generated
      • a “single” construct followed by two or more “task” pragmas (or API calls)
      • awaited at a “taskwait”
    • For parallel loop:
      • “parallel for” or “taskloop” pragma/API used for loops that match for-loop “pattern” supported by OpenMP
      • tasks spawned explicitly to handle other Ada 202X parallel loops, such as those for parallel container iterators, with explicit “taskwait”

21 of 28

Possible layering of Ada 202X light-weight parallelism support

  • Syntax
    • parallel for .. in/of, parallel do, parallel for (...), [parallel for …]
  • Optional: pragmas
    • Enabling support in earlier Ada versions -- equiv to syntax
    • For local control and tuning for particular bindings (e.g. OpenMP)
      • adds binding-specific option record to compiler-generated calls
  • Ada.Parallelism package
    • Called by compiler-generated code (or advanced user)
    • Includes “hook” for passing binding-specific option record
  • System.LWT interface called by Ada.Parallelism body (or advanced user)
    • Uses dispatch table to invoke particular binding’s LWT support
  • System.LWT.OpenMP, System.LWT.Default, … other bindings
    • Elaboration of binding package initializes dispatch table
    • User’s “main” procedure “with”s binding of interest
      • Can declare Scheduler-control object in main subprogram to specify tuning
      • Might get sequential semantics by default

21

22 of 28

Example to illustrate layering

with System.LWT.OpenMP; use System.LWT;

procedure Main is

Control : OpenMP.Scheduler_Control :=

OpenMP.Options (Option1 => …, Option2 => … );

Arr : Flt_Array (1 .. 1_000_000) := …

Partial_Sums : Flt_Array (1 .. Open_MP.Num_Core * 2) :=

(others => 0.0);

begin

parallel (Chunk in Partial_Sums’Range)

for I in Arr’Range loop

Partial_Sums (Chunk) := @ + Arr (I) ** 2;

end loop;

Put_Line (“Total SoS = “ & Partial_Sums’Reduce(“+”, 0.0)’Image);

end Main;

22

23 of 28

  1. Translate to pragmas

with System.LWT.OpenMP; use System.LWT;

with Ada.Parallelism;

procedure Main is

Control : OpenMP.Scheduler :=

OpenMP.Options (Option1 => …, Option2 => … );

Arr : Flt_Array (1 .. 1_000_000) := …

Partial_Sums : Flt_Array (1 .. Open_MP.Num_Core * 2) :=

(others => 0.0);

begin

pragma Par_Loop (Partial_Sums’Length);

for I in Arr’Range loop

Partial_Sums (Ada.Parallelism.Chunk_Index) := @ + Arr (I) ** 2;

end loop;

Put_Line (“Total SoS = “ & Partial_Sums’Reduce(“+”, 0.0)’Image);

end Main;

23

24 of 28

2) Translate to calls

with System.LWT.OpenMP; use System.LWT;

with Ada.Parallelism; use Ada.Parallelism;

procedure Main is

procedure Loop_Body

(Low, High : Longest_Integer; Chunk_Index : Positive) is

begin

for I in Integer’Val (Low) .. Integer’Val (High) loop

Partial_Sums (Chunk_Index) := @ + Arr (I) ** 2;

end loop;

end Loop_Body;

begin

Par_Range_Loop (Integer’Pos (Arr’First), Integer’Pos (Arr’Last),

Num_Chunks => Partial_Sums’Length, Loop_Body => Loop_Body’Access);

Put_Line (...);

end Main;

24

25 of 28

3) Ada.Parallelism calls System.LWT

procedure Par_Range_Loop

(Low, High : Longest_Integer; Num_Chunks : Positive;

Loop_Body : access procedure

(Low, High : Longest_Integer; Chunk_Index : Positive)) is

Master : LWT.Thread_Master;

...

begin -- Par_Range_Loop

-- Spawn first thread for first chunk

LWT.Spawn_Thread (Master, new Thread_Data_Extension’

(LWT.Root_Data with

Low => Low,

High => Low + (High-Low+1) / Num_Chunks - 1,

Chunk_Index => 1));

-- Wait for all chunks to complete

LWT.Wait_For_Master (Master);

end Par_Range_Loop;

25

26 of 28

4) System.LWT dispatches to Scheduler

package System.LWT is

type Thread_Scheduler is abstract tagged limited null record;

private

procedure Spawn_Thread (S : in out Scheduler;

Master : in out Thread_Master; Data : Thread_Data_Ptr) is abstract;

procedure Wait_For_Thread (S : in out Scheduler;

Master : in out Thead_Master) is abstract;

end System.LWT;

package body System.LWT is

Scheduler : access Thread_Scheduler’Class := …; -- Installed thread scheduler

procedure Spawn_Thread

(Master : in out Thread_Master; Data : Thread_Data_Ptr) is

begin

Scheduler.Spawn_Thread (Master, Data); -- dispatch to installed scheduler

end Spawn_Thread;

...

end System.LWT;

26

27 of 28

5) System.LWT.OpenMP handles calls

package System.LWT.OpenMP is

type Scheduler is new LWT.Thread_Scheduler with record

Option1 : …

Option2 : …

end record;

private

overriding procedure Spawn_Thread

(S : in out Scheduler; Master : in out Thread_Master; Data : Thread_Data_Ptr);

overriding procedure Wait_For_Threads

(S : in out Scheduler; Master : in out Thread_Master);

end System.LWT.OpenMP;

package body System.LWT.OpenMP is

… -- Implementation of Spawn_Thread, Wait_For_Threads, etc. in terms of OpenMP

end System.LWT.OpenMP;

27

28 of 28

Proposed Ada 202X Conflict Checking

  • Three levels of conflict checking (data race checking) proposed for Ada 202X
    • No conflict checks
    • Disallow conflicts between known-to-be-aliased objects
    • Disallow any up-level use of non-synchronized objects

  • OpenMP has dependency annotations
    • Could be used to detect and/or prevent conflicts

28