1 of 110

What Formal Approaches are Needed �for Correctness Verification in HPC

Ganesh Gopalakrishnan

School of Computing, University of Utah

and

Center for Parallel Computing (CPU)

http://www.cs.utah.edu/fv

2 of 110

Ever-growing Demand for Compute Power!

Terascale
Petascale
Exascale
Zettascale

3 of 110

More compute power enables new discoveries, solves new problems

Molecular dynamics simulations

Better drug design facilitated

Sanbonmatsu et al, FSE 2010 keynote

290 days of simulation to simulate 2 million

atom interactions over 2 nano seconds

Better “oil caps” can be

designed if we have the right

compute infrastructure

Gropp, SC 2010 panel

4 of 110

High-end Debugging Methods are�often Expensive, Inconclusive

Expensive machines, resources

$3M electricity a year
$1B to install hardware
Months of planning to get runtime on cluster

Debugging tools/methods are primitive

Extreme-Scale goal unrealistic w/o better approaches

Inadequate attention from “CS”

Little/no Formal Software Engineering methods
Almost zero critical mass

5 of 110

Importance of Message Passing in HPC (MPI)

Born ~1994

The world’s fastest CPU ran at 68 MHz
The Internet had 600 sites then!
Java was still not around

Still dominant in 2011

Large investments in applications, tooling support

Credible FV research in HPC must include MPI

Use of message passing is growing

MCAPI, RCCE, streams in CUDA, Queues in OpenCL, .NET Async

6 of 110

Trend: Hybrid Concurrency

6

Sandybridge (courtesy anandtech.com)

Geoforce GTX 480 (Nvidia)

AMD Fusion APU

Concurrent

Data Structures

Infiniband style interconnect

Monolith Large-scale

MPI-based User

Applications

Problem-Solving

Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment based

User Applications

High Performance

MPI Libraries

7 of 110

Difficult Road Ahead wrt Debugging

Concurrent software debugging is hard
Gets harder as the degree of parallelism in applications increases

Node level: Message Passing Interface (MPI)
Core level: Threads, OpenMPI, CUDA

Hybrid programming will be the future

MPI + Threads
MPI + OpenMP
MPI + CUDA

Yet tools are lagging behind!

Many tools cannot operate at scale

and give measurable coverage

HPC Apps

HPC Correctness Tools

8 of 110

Focus of the rest of the talk:�Mostly on MPI

Hybrid verification is future work

We will cover GPUs separately

9 of 110

MPI Verification approach depends on� type of determinism

Execution Deterministic

Basically one computation per input data

Value Deterministic

Multiple computations, but yield same “final answer”

Nondeterministic

Basically reactive programs built around message passing, possibly also using threads

Examples to follow

10 of 110

An example of parallelizing matrix multiplication �using message passing

X

11 of 110

An example of parallelizing matrix multiplication �using message passing

X

MPI_Bcast

MPI_Send

MPI_Recv

12 of 110

An example of parallelizing matrix multiplication �using message passing

12

X

MPI_Send

MPI_Bcast

MPI_Send

MPI_Recv

13 of 110

An example of parallelizing matrix multiplication �using message passing

13

X

=

MPI_Send

MPI_Recv

MPI_Bcast

MPI_Send

MPI_Recv

14 of 110

Unoptimized Initial Version : Execution Deterministic�

14

MPI_Send

MPI_Recv (from: P0, P1, P2, P3…) ;

Send Next Row to

First Slave which

By now must be free

15 of 110

Later Optimized Version : Value Deterministic�Opportunistically Send Work to Processor Who Finishes first

15

MPI_Send

MPI_Recv ( from : * )

OR

Send Next Row to

First Worker that returns the answer!

16 of 110

Still More Optimized Value-Deterministic versions:�Communications are made Non-blocking, and Software Pipelined�(still expected to remain value-deterministic)

16

MPI_Send

MPI_Recv ( from : * )

OR

Send Next Row to

First Worker that returns the answer!

17 of 110

Typical MPI Programs

Value-Nondeterministic MPI programs do exist

Adaptive Dynamic Load Balancing Libraries

But most are value deterministic or execution deterministic

Of course, one does not really know w/o analysis!

Detect replay non-determinism over schedule space

Races can creep into MPI programs

Forgetting to Wait for MPI non-blocking calls to finish

Floating point can make things non-deterministic

18 of 110

Gist of bug-hunting story

MPI programs “die by the bite of a thousand mosquitoes”

No major vulnerabilities one can focus on

E.g. in Thread Programming, focusing on races
With MPI, we need comprehensive “Bug Monitors”

Building MPI bug monitors requires collaboration

Lucky to have collaborations with DOE labs
The lack of FV critical mass hurts

19 of 110

A real-world bug

P0 P1 P2

--- --- ---

Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );

Expected “circular” msg passing
Found that P0’s Recv entirely vanished !!
REASON : ??

In C, -1 % N is not N-1 but rather -1 itself
In MPI, “-1” Is MPI_PROC_NULL
Recv posted on MPI_PROC_NULL is ignored !

20 of 110

A real-world bug

P0 P1 P2

--- --- ---

Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );

Expected “circular” msg passing
Found that P0’s Recv entirely vanished !!
REASON : ??

In C, -1 % N is not N-1, but -1
In MPI, “-1” Is MPI_PROC_NULL
Recv posted on MPI_PROC_NULL is ignored !

21 of 110

MPI Bugs – more anecdotal evidence

Bug encountered at large scale w.r.t. famous MPI library (Vo)

Bug was absent at a smaller scale
It was a concurrency bug

Attempt to implement collective communication (Thakur)

Bug exists for ranges of size parameters

Wrong assumption: that MPI barrier was irrelevant (Siegel)

It was not – a communication race was created

Other common bugs (we see it a lot; potentially concurrency dep.)

Forgetting to wait for non-blocking receive to finish
Forgetting to free up communicators and type objects

Some codes may be considered buggy if non-determinism arises!

Use of MPI_Recv(*) often does not result in non-deterministic execution
Need something more than “superficial inspection”

22 of 110

Real bug stories in the MPI-land

Typing a[i][i] = init instead of a[i][j] = init
Communication races

Unintended send matches “wildcard receive”

Bugs that show up when ported

Runtime buffering changes; deadlocks erupt
Sometimes, bugs show up when buffering added!

Misunderstood “Collective” semantics

Broadcast does not have “barrier” semantics

MPI + threads

Royal troubles await the newbies

23 of 110

How does one make progress?

Here are some thoughts..

Roughly justifies what we’ve done ☺

24 of 110

Our Research Agenda in HPC

Solve FV of Pure MPI Applications “well”

Progress in non-determinism coverage for fixed test harness
MUST integrate with good error monitors

(Preliminary) Work on hybrid MPI + Something

Something = Pthreads and CUDA so far
Evaluated heuristics for deterministic replay of Pthreads + MPI

Work on CUDA/OpenCL Analysis

Good progress on Symbolic Static Analyzer for CUDA Kernels
(Prelim.) progress on Symbolic Test Generator for CUDA Pgms

(Future) Symbolic Test Generation to “crash” hybrid pgms

Finding lurking crashes may be a communicable value proposition

(Future) Intelligent schedule-space exploration

Focus on non-monolithic MPI programs

25 of 110

Motivation for Coverage of Communication Nondeterminism

26 of 110

Eliminating wasted search in message passing verif.

26

P0

---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1

---

MPI_Recv(from P0…);

MPI_Recv(from P2…);

MPI_Recv(*, x);

if (x==22) then error1

else MPI_Recv(*, x);

P2

---

MPI_Send(to P1…);

MPI_Send(to P1, data=33);

27 of 110

27

@InProceedings{PADTAD2006:JitterBug,

author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de

Supinski and Andreas S{\ae}bj{\"o}rnsen},

title = {Improving distributed memory applications testing by message perturbation},

booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD)

Workshop, at the International Symposium on Software Testing and Analysis},

address = {Portland, ME, USA},

month = {July},

year = {2006}

}

A frequently followed approach: “boil the whole schedule space” – often very wasteful

28 of 110

Eliminating wasted work in message passing verif.

28

P0

---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1

---

MPI_Recv(from P0…);

MPI_Recv(from P2…);

MPI_Recv(*, x);

if (x==22) then error1

else MPI_Recv(*, x);

P2

---

MPI_Send(to P1…);

MPI_Send(to P1, data=33);

But consider these two cases…

No need to play with schedules of deterministic actions

29 of 110

Need to detect�Resource Dependent Bugs

30 of 110

Example of Resource Dependent Bug

30

P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Example of Resource Dependent Bug

36

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

… But this program deadlocks if Send(to:1) has more buffering !

Mismatched – hence a deadlock

37 of 110

Widely Publicized Misunderstandings

“”Your program is deadlock free if you have successfully tested it under zero buffering”

37

38 of 110

MPI at fault?

Perhaps partly

Over 17 years of MPI, things have changed
Inevitable use of shared memory cores, GPUs, …
Yet, many of the issues seem fundamental to

Need for wide adoption across problems, languages, machines
Need to give programmer better handle on resource usage

How to evolve out of MPI?

Whom do we trust to reset the world?
Will they get it any better?
What about the train-wreck meanwhile?

Must one completely evolve out of MPI?

38

39 of 110

Our Impact So Far

39

Sandybridge (courtesy anandtech.com)

Geoforce GTX 480 (Nvidia)

AMD Fusion APU

PUG

and

GKLEE

ISP

and

DAMPI

Concurrent

Data Structures

Infiniband style interconnect

Monolith Large-scale

MPI-based User

Applications

Problem-Solving

Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment based

User Applications

High Performance

MPI Libraries

Useful

formalizations

to help test

these

40 of 110

Rest of the talk

Dynamic formal verification of MPI

It is basically testing which discovers all alternate schedules

Coverage of communication non-determinism

Also gives us a “predictive theory” of MPI behavior

Centralized approach : ISP
GEM: Tool Integration within Eclipse Parallel Tools Platform

Distributed approach : DAMPI

CUDA verification support

SMT-supported static analysis of CUDA kernels
Symbolic test generation approach (in progress)

40

All our pubs + code at http://www.cs.utah.edu/fv

41 of 110

A Simple MPI Example

41

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

42 of 110

A Simple MPI Example

42

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

Non-blocking Send – send lasts from Isend to Wait
Send buffer can be reclaimed only after Wait clears
Forgetting to issue Wait 🡺 MPI “request object” leak

43 of 110

A Simple MPI Example

43

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

44 of 110

A Simple MPI Example

44

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

Non-blocking Receive – lasts from Irecv to Wait
Recv buffer can be examined only after Wait clears
Forgetting to issue Wait 🡺 MPI “request object” leak

45 of 110

A Simple MPI Example

45

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

Blocking receive in the middle
Equivalent to Irecv followed by its Wait
The data fetched by Recv(2) is available before that of Irecv

46 of 110

A Simple MPI Example

46

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

Since P0’s Isend and Irecv can be “in flight”, the barrier can be crossed
This allows P2’s Isend to race with P0’s Isend, and match Irecv(*)

47 of 110

Testing Challenges

47

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

Traditional testing methods may reveal only P0->P1 or P2->P1
P2->P1 may happen after the code is ported
Our tools ISP and DAMPI automatically discover and run both tests,

regardless of the execution platform

48 of 110

Flow of ISP

48

Executable

Proc₁

Proc₂

……

Proc_n

Scheduler

Run

MPI Runtime

Scheduler intercepts MPI calls
Reorders and/or rewrites the actual calls going into the MPI Runtime
Discovers maximal non-determinism ; plays through all choices

MPI Program

Interposition Layer

49 of 110

49

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext

Barrier

MPI Runtime

ISP Scheduler Actions (animation)

50 of 110

50

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext

Barrier

Irecv(*)

Barrier

MPI Runtime

ISP Scheduler Actions (animation)

51 of 110

51

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

MPI Runtime

ISP Scheduler Actions (animation)

52 of 110

52

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Runtime

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Wait (req)

Recv(2)

Isend(1)

SendNext

Wait (req)

Irecv(2)

Isend

Wait

No

Match-Set

Deadlock!

ISP Scheduler Actions (animation)

53 of 110

Formalization of MPI Behavior to build Formal Dynamic Verifier (verification scheduler)

54 of 110

Formal Modeling of MPI Calls

MPI calls are modeled in terms of four salient events

Call issued

All calls are issued in program order

Call returns

The code after the call can now be executed

Call matches

An event that marks the call committing

Call completes

All resources associated with the call are freed

54

55 of 110

The Matches-Before Relation of MPI

55

Isend(to: Proc k, …);

…

Isend(to: Proc k, …)

Irecv(from: Proc k, …);

…

Irecv(from: Proc k, …)

Isend( &h );

…

Wait( h );

Wait(..);

…

AnyMPIOp;

Barrier(..);

…

AnyMPIOp;

Irecv(from: Proc *, …);

…

Irecv(from: Proc k, …)

Irecv(from: Proc j, …);

…

Irecv(from: Proc *, …)

Conditional

Matches

before

1.

2.

3.

4.

5.

6.

7.

56 of 110

The ISP Scheduler

Pick a process Pi and its instrn Op at PC n
If Op does not have an unmatched ancestor according to MB, then collect Op into Scheduler’s Reorder Buffer

Stay with Pi, Increment n

Else Switch to Pi+1 until all Pi are “fenced”

“fenced” means all current Ops have unmatched ancestors

Form Match Sets according to priority order

If Match Set is {s1, s2, .. sK} + R(*), cover all cases using stateless replay

Issue an eligible set of Match Set Ops into the MPI runtime
Repeat until all processes are finalized or error encountered

56

Theorem: This Scheduling Method achieves ND-Coverage in MPI programs!

57 of 110

How MB helps predict outcomes

57

Will this single-process example called “Auto-send” deadlock ?

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

58 of 110

58

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

How MB helps predict outcomes

59 of 110

59

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

The MB

How MB helps predict outcomes

60 of 110

60

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect R(from:0, h1)

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

The MPI

Runtime

65

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect W(h1)

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPI

Runtime

66 of 110

66

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

P0 is Fenced. So form {W} match set. Fire!

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPI

Runtime

67 of 110

67

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect W(h2)

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPI

Runtime

68 of 110

68

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Fire W(h2)! Program finishes w/o deadlocks!

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPI

Runtime

69 of 110

Dynamic Verification at Scale

70 of 110

Why do we need a Distributed Analyzer for MPI Programs (DAMPI)?

The ROB and the MB graph can get large

Limit of ISP: “32 Parmetis (15 K LOC) processes”

We need to dynamically verify real apps

Large data sets, 1000s of processes
High-end bugs often masked when downscaling
Downscaling is often IMPOSSIBLE in practice
ISP is too sequential! Employ parallelism of a large Cluster!

What do we give up?

We can’t do “what if” reasoning

What if a PARTICULAR Isend has infinite buffering?

We may have to give up precision for scalability

70

71 of 110

DAMPI Framework

Executable

Proc₁

Proc₂

……

Proc_n

Alternate Matches

MPI runtime

MPI Program

DAMPI - PnMPI modules

Schedule Generator

Epoch

Decisions

Rerun

DAMPI – Distributed Analyzer for MPI

72 of 110

Distributed Causality Tracking in DAMPI

Alternate matches are

co-enabled and concurrent actions
that can match according to MPI’s matching rules

DAMPI performs an initial run of the MPI program
Discovers alternative matches
We have developed an MPI-specific Sparse Logical Clock Tracking

Vector Clocks (no omissions, less scalable)
Lamport Clocks (no omissions in practice, more scalable)

We have gone up to 1000 processes (10K seems possib)

73 of 110

DAMPI uses Lamport clock to build �Happens-Before relationships

Use Lamport clock to track Happens-Before

Sparse Lamport Clocks – only “count” non-det events
MPI MB relation is “baked” into clock-update rules
Increases it after completion of MPI_Recv (ANY_SOURCE)
Nested blocking / non-blocking operations handled
Compare incoming clock to detect potential matches

R₁(*)

1

pb(0)

S(P₁)

0

P0

P1

P2

pb(0)

S(P₁)

74 of 110

How we use Happens-Before relationship to detect alternative matches

Question: could P₂’s send have matched P₁’s recv?
Answer: Yes!
Earliest Message with lower clock value than the current process clock is an eligible match

R₁(*)

1

pb(0)

S(P₁)

0

P0

P1

P2

pb(0)

S(P₁)

75 of 110

DAMPI Algorithm Review: �(1) Discover Alternative matches during initial run

R₁(*)

1

pb(0)

S(P₁)

0

P0

P1

P2

pb(0)

S(P₁)

76 of 110

DAMPI Algorithm Review: �(2) Force alternative matches during replay

R₁(2)

1

S(P₁)

0

P0

P1

P2

S(P₁)

X

77 of 110

DAMPI maintains very good scalability vs ISP

78 of 110

DAMPI is also faster at processing interleavings

79 of 110

Results on large applications: �SpecMPI2007 and NAS-PB

Slowdown is for one interleaving

No replay was necessary

80 of 110

The Devil in the Detail of DAMPI

Textbook Lamport/Vector Clocks Inapplicable

“Clocking” all events won’t scale

So “clock” only ND-events

When do non-blocking calls finish ?

Can’t “peek” inside MPI runtime
Don’t want to impede with execution
So have to infer when they finish

Later blocking operations can force Matches Before

So need to adjust clocks when blocking operations happen

Our contributions: Sparse VC and Sparse LC algos
Handing real corner cases in MPI

Synchronous sends “learning” about non-blocking Recv start!

81 of 110

Why develop FV methods for CUDA?

GPUs are key enablers of HPC

Many of the top 10 machines are GPU based

Interesting debugging challenges

Races
Barrier mismatches
Bank conflicts
Asynchronous MPI / CUDA interactions

81

82 of 110

What are GPUs (aka Accelerators)?

82

Sandybridge (courtesy anandtech.com)

OpenCL Compute (courtesy Khronos group)

Geoforce

GTX 480

(Nvidia)

Three of world’s Top-5 Supercomputers

built using GPUs

World’s greenest supercomputers also

GPU based

CPU/GPU integration is a clear trend
Hand-held devices (iPhone) will also

use them

AMD Fusion APU

83 of 110

83

CPU program

void inc_cpu(float* a,

float b, int N) {

for (int idx = 0; idx<N; idx++)

a[idx] = a[idx] + b;

}

void main() {

.....

increment_cpu(a, b, N);

}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) {

int tid = blockIdx.x* blockDim.x+ threadIdx.x;

if (tid < N)

A[tid] = A[tid] + b;

}

voidmain() {

…..

dim3dimBlock (blocksize);

dim3dimGrid( ceil( N / (float)blocksize) );

increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);

}

Example: Increment Array Elements

Contrast between CPUs and GPUs

84 of 110

84

CPU program

void inc_cpu(float* a,

float b, int N) {

for (int idx = 0; idx<N; idx++)

a[idx] = a[idx] + b;

}

void main() {

.....

increment_cpu(a, b, N);

}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) {

int tid = blockIdx.x* blockDim.x+ threadIdx.x;

if (tid < N)

A[tid] = A[tid] + b;

}

voidmain() {

…..

dim3dimBlock (blocksize);

dim3dimGrid( ceil( N / (float)blocksize) );

increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);

}

Example: Increment Array Elements

Contrast between CPUs and GPUs

Fine-grained threads

scheduled to run like this:

Tid = 0, 1, 2, 3, …

85 of 110

Why heed Many-core / GPU Compute?

85

EAGAN, Minn., May 3, 1991 (John Markoff, NY Times)

The Convex Computer Corporation plans to introduce its first

supercomputer on Tuesday. But Cray Research Inc.,

the king of supercomputing, says it is more worried by "killer micros"

-- compact, extremely fast work stations that sell for less than

$100,000.

Take-away : Clayton Christensen’s “Disruptive Innovation”

GPU is a disruptive technology!

86 of 110

Other Reasons to study GPU Compute

86

GPUs offer an eminent setting in which

to study heterogeneous CPU organizations

and memory hierarchies

87 of 110

What bugs caught? What value?

GPU hardware still stabilizing

Characterize GPU Hardware Formally

Currently, program behavior may change with platform

Micro benchmarking sheds further light

www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

The software is the real issue!

87

88 of 110

GPU Software Bugs / Difficulties

SIMD model, unfamiliar

Synchronization, races, deadlocks, … all different!

Machine constants, program assumptions about problem parameters affect correctness / performance

GPU “kernel functions” may fail or perform poorly

Formal reasoning can help identify performance pitfalls
Are still young, so emulators and debuggers may not match hardware behavior
Multiple memory sub-systems are involved, further complicating semantics / performance

88

89 of 110

Approaches/Tools for GPU SW Verif.

Instrumented Dynamic Verification on GPU Emulators

Boyer et al, STMCS 2008

Grace : Combined Static / Dynamic Analysis

Zheng et al, PPoPP 2011

PUG : An SMT based Static Analyzer

Lee and Gopalakrishnan, FSE 2010

GKLEE : An SMT based test generator

Lee, Rajan, Ghosh, and Gopalakrishnan (under submission)

89

90 of 110

Planned Overall Workflow of PUG

90

Realized

91 of 110

Workflow and Results from PUG

91

92 of 110

Examples of a Data Race

92

Increment N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b

...

__global__ void inc_gpu(

float*A, float b, int N) {

int tid = blockIdx.x * blockDim.x +

threadIdx.x;

if (tid < N)

A[tid] = A[tid – 1] + b;

}

93 of 110

Examples of a Data Race

93

Increment N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b

...

__global__ void inc_gpu(

float*A, float b, int N) {

int tid = blockIdx.x * blockDim.x +

threadIdx.x;

if (tid < N)

A[tid] = A[tid – 1] + b;

}

(and (and (/= t1.x t2.x)

……

(or (and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0)

(= (bv-sub idx1@t1 0b0000000001) idx1@t2))

(and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0)

(= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Encoding for

Read-Write Race

94 of 110

Examples of a Data Race

94

Increment N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b

...

__global__ void inc_gpu(

float*A, float b, int N) {

int tid = blockIdx.x * blockDim.x +

threadIdx.x;

if (tid < N)

A[tid] = A[tid – 1] + b;

}

(and (and (/= t1.x t2.x)

……

(or (and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0)

(= (bv-sub idx1@t1 0b0000000001) idx1@t2))

(and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0)

(= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Encoding for

Read-Write Race

95 of 110

Real Example with Race

95

__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) {

d_out[threadIdx.x] = 0;

for (int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }

__syncthreads();

if(threadIdx.x%2==0) {

for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];

...

/* The counter example given by PUG is :

t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0

Threads t1 and t2 are in iterations 1 and 0 respectively

t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

This causes a real race on writes to d[10] because of the += done by the threads

*/

96 of 110

Real Example with Race

96

__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) {

d_out[threadIdx.x] = 0;

for (int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }

__syncthreads();

if(threadIdx.x%2==0) {

for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];

...

/* The counter example given by PUG is :

t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0

Threads t1 and t2 are in iterations 1 and 0 respectively

t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

This causes a real race on writes to d[10] because of the += done by the threads

*/

97 of 110

PUG Divides and Conquers Each Barrier Interval

97

t₁

t₂

BI1 is conflict free iff:

∀t₁,t₂ : a₁^t1, a₂^t1, a₁^t2 and a₂^t2

do not conflict with each other

_p

_~p

_p

a1

a2

a3

a4

a5

a1

a2

a3

a4

a5

BI 3 is conflict free iff the following accesses

do not conflict for all t1,t2:

a₃^t1, ~p : a₄^t1, ~p : a₅^t1,

a₃^t2, ~p : a₄^t2, ~p : a₅^t2

a6

BI1

BI3

98 of 110

PUG Results (FSE 2010)

98

Kernels (in CUDA SDK )	loc	+O	+C	+R	B.C.	Time (pass)
Bitonic Sort	65				HIGH	2.2s
MatrixMult	102	*	*		HIGH	<1s
Histogram64	136				LOW	2.9s
Sobel	130	*			HIGH	5.6s
Reduction	315				HIGH	3.4s
Scan	255	*	*	*	LOW	3.5s
Scan Large	237	*	*		LOW	5.7s
Nbody	206	*			HIGH	7.4s
Bisect Large	1400	*	*		HIGH	44s
Radix Sort	1150	*	*	*	LOW	39s
Eigenvalues	2300	*	*	*	HIGH	68s

+ O: require no bit-vector overflowing

+C: require constraints on the input values

+R: require manual refinement

B.C.: measure how serious the bank conflicts are

Time: SMT solving time

99 of 110

“GPU class” bugs

99

Defects	Barrier Error or Race		Refinement
	benign	fatal	over #kernel	over #loop
13 (23%)	3	2	17.5%	10.5%

Tested over 57 assignment submissions

Defects: Indicates how many kernels are not well parameterized,

i.e. work only in certain configurations

Refinement: Measures how many loops need automatic refinement

(by PUG).

100 of 110

PUG’s Strengths and Weaknesses

Strengths:

SMT-based Incisive SA avoids interleaving explosion
Still obtains coverage guarantees
Good for GPU Library FV

Weaknesses:

Engineering effort : C++, Templates, Breaks, ..
SMT “explosion” for value correctness
Does not help test the code on the actual hardware
False alarms requires manual intervention

Handling loops, Kernel calling contexts

100

101 of 110

Two directions in progress

Parameterized Verification

See Li’s dissertation for initial results

Formal Testing

Brand new code-base called GKLEE

Joint work with Fujitsu Research

101

102 of 110

GKLEE: Formal Testing of GPU Kernels�(joint work with Li, Rajan, Ghosh of Fujitsu)

102

103 of 110

Evaluation of GKLEE vs. PUG

103

Kernels	n = 4		n = 16		n = 64	n = 256	n = 1024
	PUG	GKLEE	PUG	GKLEE	GKLEE	GKLEE	GKLEE
Simple Reduction	2.8	< 0.1 (< 0.1)	T.O.	< 0.1 (< 0.1)	< 0.1 (< 0.1)	0.2 (0.3)	2.3 (2.9)
Matrix Transpose	1.9	< 0.1 (< 0.1)	T.O.	< 0.1 (0.3)	< 0.1 (3.2)	< 0.1 (63)	0.9 (T.O.)
Bitonic Sort	3.7	0.9 (1)	T.O.	T.O.	T.O.	T.O.	T.O.
Scan Large	▬	< 0.1 (< 0.1)	▬	< 0.1 (< 0.1)	0.1 (0.2)	1.6 (3)	22 (51)

Execution time of PUG and GKLEE on kernels from the CUDA SDK

for functional correctness. n is the number of threads. T.O. means > 5 min.

104 of 110

Evaluation of GKLEE’s Test Coverage

104

Kernels	sc cov.	bc cov.	min #test	Avg. Cov_t	max. Cov_t	exec. time
Bitonic Sort	100% / 100%	51% / 44%	5	78% / 76%	100% / 94%	1s
Merge Sort	100% / 100%	70% / 50%	6	88% / 80%	100% / 95%	0.5s
Word Search	100% / 92%	34% / 25%	2	100% / 81%	100% / 85%	0.1s
Radix Sort	100% / 91%	54% / 35%	6	91% / 68%	100% / 75%	5s
Suffix Tree Match	100% / 90%	38% / 35%	8	90% / 70%	100% / 82%	12s

Test generation results. Traditional metrics sc cov. and bc cov. give source code and bytecode coverage respectively. Refined metrics avg.cov_t and max.cov_t measure the average and maximum coverage of all threads.

All coverages were far better than achieved through random testing.

105 of 110

Conclusions, Future Work

ISP/DAMPI

Will move toward test generation, path coverage, crashing inputs, Integration with MUST

Parameterized Verification
Integration of PUG into the Eclipse Parallel Tools Platform to facilitate debugging
A version of GKLEE for OpenCL programs
PTP integration of GKLEE

105

106 of 110

Constrast with Shared Memory

In message passing systems, changes to an object are not automatically made visible to other threads
Msg passing helps isolate shared objects

Memory effects/errors do not propagate automatically
No shared memory races
Users control the visibility of changes across threads

Programming effort is higher – at least different

The overall semantics are often that of a sequential computation

The parallel threads (or processes) of a message passing system are often the result of decomposition of a single sequential computation along

The data dimension
The task dimension

107 of 110

Events of MPI Isend under Zero buffering �

107

Isend(to: Proc I, data, &handle1);

…code in delay slot…

Wait(handle1);

108 of 110

Events of MPI Isend under Infinite buffering �

108

Isend(to: Proc I, data, &handle1);

…code in delay slot…

Wait(handle1);

Matches-Before serves as a One-Stop Shopping Solution

to Model how Buffering Affects MPI Behavior!

Can study Safety Violations due to Excessive Bufffering

also using MB !!

109 of 110

Crucial Ordering: Matches Before�

109

Isend(to: Proc k, …);

…

Isend(to: Proc k, …)

Models non-overtaking of MPI messages

(with same destinations targeted, match receives in pgm order)

110 of 110

Crucial Ordering: Matches Before�

110

Isend(to: Proc j, …);

…

Isend(to: Proc k, …)

Models relaxed ordering when the destinations differ

(matches with receives are allowed out of program order)