1 of 110

What Formal Approaches are Needed �for Correctness Verification in HPC

Ganesh Gopalakrishnan

School of Computing, University of Utah

and

Center for Parallel Computing (CPU)

http://www.cs.utah.edu/fv

2 of 110

Ever-growing Demand for Compute Power!

  • Terascale
  • Petascale
  • Exascale
  • Zettascale

3 of 110

More compute power enables new discoveries, solves new problems

    • Molecular dynamics simulations
      • Better drug design facilitated
        • Sanbonmatsu et al, FSE 2010 keynote
      • 290 days of simulation to simulate 2 million

atom interactions over 2 nano seconds

    • Better “oil caps” can be

designed if we have the right

compute infrastructure

      • Gropp, SC 2010 panel

4 of 110

High-end Debugging Methods are�often Expensive, Inconclusive

  • Expensive machines, resources
    • $3M electricity a year
    • $1B to install hardware
    • Months of planning to get runtime on cluster

  • Debugging tools/methods are primitive
    • Extreme-Scale goal unrealistic w/o better approaches
  • Inadequate attention from “CS”
    • Little/no Formal Software Engineering methods
    • Almost zero critical mass

5 of 110

Importance of Message Passing in HPC (MPI)

  • Born ~1994
      • The world’s fastest CPU ran at 68 MHz
      • The Internet had 600 sites then!
      • Java was still not around

  • Still dominant in 2011
    • Large investments in applications, tooling support
      • Credible FV research in HPC must include MPI

    • Use of message passing is growing
      • MCAPI, RCCE, streams in CUDA, Queues in OpenCL, .NET Async

6 of 110

Trend: Hybrid Concurrency

6

Sandybridge (courtesy anandtech.com)

Geoforce GTX 480 (Nvidia)

AMD Fusion APU

Concurrent

Data Structures

Infiniband style interconnect

Monolith Large-scale

MPI-based User

Applications

Problem-Solving

Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment based

User Applications

High Performance

MPI Libraries

7 of 110

Difficult Road Ahead wrt Debugging

  • Concurrent software debugging is hard
  • Gets harder as the degree of parallelism in applications increases
    • Node level: Message Passing Interface (MPI)
    • Core level: Threads, OpenMPI, CUDA
  • Hybrid programming will be the future
    • MPI + Threads
    • MPI + OpenMP
    • MPI + CUDA
  • Yet tools are lagging behind!
    • Many tools cannot operate at scale

and give measurable coverage

HPC Apps

HPC Correctness Tools

8 of 110

Focus of the rest of the talk:�Mostly on MPI

Hybrid verification is future work

We will cover GPUs separately

9 of 110

MPI Verification approach depends on� type of determinism

  • Execution Deterministic
    • Basically one computation per input data
  • Value Deterministic
    • Multiple computations, but yield same “final answer”
  • Nondeterministic
    • Basically reactive programs built around message passing, possibly also using threads

Examples to follow

10 of 110

An example of parallelizing matrix multiplication �using message passing

X

11 of 110

An example of parallelizing matrix multiplication �using message passing

X

MPI_Bcast

MPI_Send

MPI_Recv

12 of 110

An example of parallelizing matrix multiplication �using message passing

12

X

MPI_Send

MPI_Bcast

MPI_Send

MPI_Recv

13 of 110

An example of parallelizing matrix multiplication �using message passing

13

X

=

MPI_Send

MPI_Recv

MPI_Bcast

MPI_Send

MPI_Recv

14 of 110

Unoptimized Initial Version : Execution Deterministic

14

MPI_Send

MPI_Recv (from: P0, P1, P2, P3…) ;

Send Next Row to

First Slave which

By now must be free

15 of 110

Later Optimized Version : Value Deterministic�Opportunistically Send Work to Processor Who Finishes first

15

MPI_Send

MPI_Recv ( from : * )

OR

Send Next Row to

First Worker that returns the answer!

16 of 110

Still More Optimized Value-Deterministic versions:�Communications are made Non-blocking, and Software Pipelined�(still expected to remain value-deterministic)

16

MPI_Send

MPI_Recv ( from : * )

OR

Send Next Row to

First Worker that returns the answer!

17 of 110

Typical MPI Programs

  • Value-Nondeterministic MPI programs do exist
    • Adaptive Dynamic Load Balancing Libraries

  • But most are value deterministic or execution deterministic
    • Of course, one does not really know w/o analysis!
      • Detect replay non-determinism over schedule space
    • Races can creep into MPI programs
      • Forgetting to Wait for MPI non-blocking calls to finish
    • Floating point can make things non-deterministic

18 of 110

Gist of bug-hunting story

  • MPI programs “die by the bite of a thousand mosquitoes”
    • No major vulnerabilities one can focus on
      • E.g. in Thread Programming, focusing on races
      • With MPI, we need comprehensive “Bug Monitors”
  • Building MPI bug monitors requires collaboration
    • Lucky to have collaborations with DOE labs
    • The lack of FV critical mass hurts

19 of 110

A real-world bug

P0 P1 P2

--- --- ---

Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );

  • Expected “circular” msg passing
  • Found that P0’s Recv entirely vanished !!
  • REASON : ??

    • In C, -1 % N is not N-1 but rather -1 itself
    • In MPI, “-1” Is MPI_PROC_NULL
    • Recv posted on MPI_PROC_NULL is ignored !

20 of 110

A real-world bug

P0 P1 P2

--- --- ---

Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );

Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );

  • Expected “circular” msg passing
  • Found that P0’s Recv entirely vanished !!
  • REASON : ??

    • In C, -1 % N is not N-1, but -1
    • In MPI, “-1” Is MPI_PROC_NULL
    • Recv posted on MPI_PROC_NULL is ignored !

21 of 110

MPI Bugs – more anecdotal evidence

  • Bug encountered at large scale w.r.t. famous MPI library (Vo)
    • Bug was absent at a smaller scale
    • It was a concurrency bug
  • Attempt to implement collective communication (Thakur)
    • Bug exists for ranges of size parameters
  • Wrong assumption: that MPI barrier was irrelevant (Siegel)
    • It was not – a communication race was created
  • Other common bugs (we see it a lot; potentially concurrency dep.)
    • Forgetting to wait for non-blocking receive to finish
    • Forgetting to free up communicators and type objects
  • Some codes may be considered buggy if non-determinism arises!
    • Use of MPI_Recv(*) often does not result in non-deterministic execution
    • Need something more than “superficial inspection”

22 of 110

Real bug stories in the MPI-land

  • Typing a[i][i] = init instead of a[i][j] = init
  • Communication races
    • Unintended send matches “wildcard receive”
  • Bugs that show up when ported
    • Runtime buffering changes; deadlocks erupt
    • Sometimes, bugs show up when buffering added!
  • Misunderstood “Collective” semantics
    • Broadcast does not have “barrier” semantics
  • MPI + threads
    • Royal troubles await the newbies

23 of 110

How does one make progress?

Here are some thoughts..

Roughly justifies what we’ve done ☺

24 of 110

Our Research Agenda in HPC

  • Solve FV of Pure MPI Applications “well”
    • Progress in non-determinism coverage for fixed test harness
    • MUST integrate with good error monitors
  • (Preliminary) Work on hybrid MPI + Something
    • Something = Pthreads and CUDA so far
    • Evaluated heuristics for deterministic replay of Pthreads + MPI
  • Work on CUDA/OpenCL Analysis
    • Good progress on Symbolic Static Analyzer for CUDA Kernels
    • (Prelim.) progress on Symbolic Test Generator for CUDA Pgms
  • (Future) Symbolic Test Generation to “crash” hybrid pgms
    • Finding lurking crashes may be a communicable value proposition
  • (Future) Intelligent schedule-space exploration
    • Focus on non-monolithic MPI programs

25 of 110

Motivation for Coverage of Communication Nondeterminism

26 of 110

Eliminating wasted search in message passing verif.

26

P0

---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1

---

MPI_Recv(from P0…);

MPI_Recv(from P2…);

MPI_Recv(*, x);

if (x==22) then error1

else MPI_Recv(*, x);

P2

---

MPI_Send(to P1…);

MPI_Send(to P1, data=33);

27 of 110

27

@InProceedings{PADTAD2006:JitterBug,

author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de

Supinski and Andreas S{\ae}bj{\"o}rnsen},

title = {Improving distributed memory applications testing by message perturbation},

booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD)

Workshop, at the International Symposium on Software Testing and Analysis},

address = {Portland, ME, USA},

month = {July},

year = {2006}

}

A frequently followed approach: “boil the whole schedule space” – often very wasteful

28 of 110

Eliminating wasted work in message passing verif.

28

P0

---

MPI_Send(to P1…);

MPI_Send(to P1, data=22);

P1

---

MPI_Recv(from P0…);

MPI_Recv(from P2…);

MPI_Recv(*, x);

if (x==22) then error1

else MPI_Recv(*, x);

P2

---

MPI_Send(to P1…);

MPI_Send(to P1, data=33);

But consider these two cases…

No need to play with schedules of deterministic actions

29 of 110

Need to detect�Resource Dependent Bugs

30 of 110

Example of Resource Dependent Bug

30

P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Recv(from:0);

We know that this program with lesser Send buffering may deadlock

31 of 110

Example of Resource Dependent Bug

31

P0

Send(to:1);

Recv(from:1);

P1

Send(to:0);

Recv(from:0);

… and this program with more Send buffering may avoid a deadlock

32 of 110

Example of Resource Dependent Bug

32

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

… But this program deadlocks if Send(to:1) has more buffering !

33 of 110

Example of Resource Dependent Bug

33

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

… But this program deadlocks if Send(to:1) has more buffering !

34 of 110

Example of Resource Dependent Bug

34

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

… But this program deadlocks if Send(to:1) has more buffering !

35 of 110

Example of Resource Dependent Bug

35

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

… But this program deadlocks if Send(to:1) has more buffering !

36 of 110

Example of Resource Dependent Bug

36

P0

Send(to:1);

Send(to:2);

P1

Send(to:2);

Recv(from:0);

P2

Recv(from:*);

Recv(from:0);

… But this program deadlocks if Send(to:1) has more buffering !

Mismatched – hence a deadlock

37 of 110

Widely Publicized Misunderstandings

”Your program is deadlock free if you have successfully tested it under zero buffering”

37

38 of 110

MPI at fault?

  • Perhaps partly
    • Over 17 years of MPI, things have changed
    • Inevitable use of shared memory cores, GPUs, …
    • Yet, many of the issues seem fundamental to
      • Need for wide adoption across problems, languages, machines
      • Need to give programmer better handle on resource usage

  • How to evolve out of MPI?
    • Whom do we trust to reset the world?
    • Will they get it any better?
    • What about the train-wreck meanwhile?

  • Must one completely evolve out of MPI?

38

39 of 110

Our Impact So Far

39

Sandybridge (courtesy anandtech.com)

Geoforce GTX 480 (Nvidia)

AMD Fusion APU

PUG

and

GKLEE

ISP

and

DAMPI

Concurrent

Data Structures

Infiniband style interconnect

Monolith Large-scale

MPI-based User

Applications

Problem-Solving

Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment based

User Applications

High Performance

MPI Libraries

Useful

formalizations

to help test

these

40 of 110

Rest of the talk

  • Dynamic formal verification of MPI
    • It is basically testing which discovers all alternate schedules
      • Coverage of communication non-determinism
    • Also gives us a “predictive theory” of MPI behavior

    • Centralized approach : ISP
    • GEM: Tool Integration within Eclipse Parallel Tools Platform

    • Distributed approach : DAMPI

  • CUDA verification support
    • SMT-supported static analysis of CUDA kernels
    • Symbolic test generation approach (in progress)

40

All our pubs + code at http://www.cs.utah.edu/fv

41 of 110

A Simple MPI Example

41

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

42 of 110

A Simple MPI Example

42

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

  • Non-blocking Send – send lasts from Isend to Wait
  • Send buffer can be reclaimed only after Wait clears
  • Forgetting to issue Wait 🡺 MPI “request object” leak

43 of 110

A Simple MPI Example

43

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

44 of 110

A Simple MPI Example

44

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

  • Non-blocking Receive – lasts from Irecv to Wait
  • Recv buffer can be examined only after Wait clears
  • Forgetting to issue Wait 🡺 MPI “request object” leak

45 of 110

A Simple MPI Example

45

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

  • Blocking receive in the middle
  • Equivalent to Irecv followed by its Wait
  • The data fetched by Recv(2) is available before that of Irecv

46 of 110

A Simple MPI Example

46

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

  • Since P0’s Isend and Irecv can be “in flight”, the barrier can be crossed
  • This allows P2’s Isend to race with P0’s Isend, and match Irecv(*)

47 of 110

Testing Challenges

47

Process P0

Isend(1, req) ;

Barrier ;

Wait(req) ;

Process P1

Irecv(*, req) ;

Barrier ;

Recv(2) ;

Wait(req) ;

Process P2

Barrier ;

Isend(1, req) ;

Wait(req) ;

  • Traditional testing methods may reveal only P0->P1 or P2->P1
  • P2->P1 may happen after the code is ported
  • Our tools ISP and DAMPI automatically discover and run both tests,

regardless of the execution platform

48 of 110

Flow of ISP

48

Executable

Proc1

Proc2

……

Procn

Scheduler

Run

MPI Runtime

  • Scheduler intercepts MPI calls
  • Reorders and/or rewrites the actual calls going into the MPI Runtime
  • Discovers maximal non-determinism ; plays through all choices

MPI Program

Interposition Layer

49 of 110

49

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext

Barrier

MPI Runtime

ISP Scheduler Actions (animation)

50 of 110

50

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

sendNext

Barrier

Irecv(*)

Barrier

MPI Runtime

ISP Scheduler Actions (animation)

51 of 110

51

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Barrier

Barrier

Barrier

MPI Runtime

ISP Scheduler Actions (animation)

52 of 110

52

P0 P1 P2

Barrier

Isend(1, req)

Wait(req)

MPI Runtime

Scheduler

Irecv(*, req)

Barrier

Recv(2)

Wait(req)

Isend(1, req)

Wait(req)

Barrier

Isend(1)

Barrier

Irecv(*)

Barrier

Barrier

Wait (req)

Recv(2)

Isend(1)

SendNext

Wait (req)

Irecv(2)

Isend

Wait

No

Match-Set

Deadlock!

ISP Scheduler Actions (animation)

53 of 110

Formalization of MPI Behavior to build Formal Dynamic Verifier (verification scheduler)

54 of 110

Formal Modeling of MPI Calls

  • MPI calls are modeled in terms of four salient events
    • Call issued
      • All calls are issued in program order
    • Call returns
      • The code after the call can now be executed
    • Call matches
      • An event that marks the call committing
    • Call completes
      • All resources associated with the call are freed

54

55 of 110

The Matches-Before Relation of MPI

55

Isend(to: Proc k, …);

Isend(to: Proc k, …)

Irecv(from: Proc k, …);

Irecv(from: Proc k, …)

Isend( &h );

Wait( h );

Wait(..);

AnyMPIOp;

Barrier(..);

AnyMPIOp;

Irecv(from: Proc *, …);

Irecv(from: Proc k, …)

Irecv(from: Proc j, …);

Irecv(from: Proc *, …)

Conditional

Matches

before

1.

2.

3.

4.

5.

6.

7.

56 of 110

The ISP Scheduler

  • Pick a process Pi and its instrn Op at PC n
  • If Op does not have an unmatched ancestor according to MB, then collect Op into Scheduler’s Reorder Buffer
    • Stay with Pi, Increment n
  • Else Switch to Pi+1 until all Pi are “fenced”
    • “fenced” means all current Ops have unmatched ancestors
  • Form Match Sets according to priority order
    • If Match Set is {s1, s2, .. sK} + R(*), cover all cases using stateless replay
  • Issue an eligible set of Match Set Ops into the MPI runtime
  • Repeat until all processes are finalized or error encountered

56

Theorem: This Scheduling Method achieves ND-Coverage in MPI programs!

57 of 110

How MB helps predict outcomes

57

Will this single-process example called “Auto-send” deadlock ?

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

58 of 110

58

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

How MB helps predict outcomes

59 of 110

59

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

The MB

How MB helps predict outcomes

60 of 110

60

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect R(from:0, h1)

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

The MPI

Runtime

61 of 110

61

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect B

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

The MPI

Runtime

62 of 110

62

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

P0 is Fenced; Form Match Set { B } and

Send it to the MPI Runtime !

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

The MPI

Runtime

63 of 110

63

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect S(to:0, h2)

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

The MPI

Runtime

64 of 110

64

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

P0 is fenced. So form the {R,S} match-set. Fire!

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

The MPI

Runtime

65 of 110

65

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect W(h1)

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPI

Runtime

66 of 110

66

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

P0 is Fenced. So form {W} match set. Fire!

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

The MPI

Runtime

67 of 110

67

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Collect W(h2)

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPI

Runtime

68 of 110

68

P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);

Fire W(h2)! Program finishes w/o deadlocks!

How MB helps predict outcomes

Scheduler’s

Reorder

Buffer

R(from:0, h1)

B

S(to:0, h2)

W(h1)

W(h2)

The MPI

Runtime

69 of 110

Dynamic Verification at Scale

70 of 110

Why do we need a Distributed Analyzer for MPI Programs (DAMPI)?

  • The ROB and the MB graph can get large
    • Limit of ISP: “32 Parmetis (15 K LOC) processes”
  • We need to dynamically verify real apps
    • Large data sets, 1000s of processes
    • High-end bugs often masked when downscaling
    • Downscaling is often IMPOSSIBLE in practice
    • ISP is too sequential! Employ parallelism of a large Cluster!
  • What do we give up?
    • We can’t do “what if” reasoning
      • What if a PARTICULAR Isend has infinite buffering?
    • We may have to give up precision for scalability

70

71 of 110

DAMPI Framework

Executable

Proc1

Proc2

……

Procn

Alternate Matches

MPI runtime

MPI Program

DAMPI - PnMPI modules

Schedule Generator

Epoch

Decisions

Rerun

DAMPI – Distributed Analyzer for MPI

72 of 110

Distributed Causality Tracking in DAMPI

  • Alternate matches are
    • co-enabled and concurrent actions
    • that can match according to MPI’s matching rules
  • DAMPI performs an initial run of the MPI program
  • Discovers alternative matches
  • We have developed an MPI-specific Sparse Logical Clock Tracking
    • Vector Clocks (no omissions, less scalable)
    • Lamport Clocks (no omissions in practice, more scalable)
      • We have gone up to 1000 processes (10K seems possib)

73 of 110

DAMPI uses Lamport clock to build �Happens-Before relationships

  • Use Lamport clock to track Happens-Before
    • Sparse Lamport Clocks – only “count” non-det events
    • MPI MB relation is “baked” into clock-update rules
    • Increases it after completion of MPI_Recv (ANY_SOURCE)
    • Nested blocking / non-blocking operations handled
    • Compare incoming clock to detect potential matches

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2

pb(0)

S(P1)

74 of 110

How we use Happens-Before relationship to detect alternative matches

  • Question: could P2’s send have matched P1’s recv?
  • Answer: Yes!
  • Earliest Message with lower clock value than the current process clock is an eligible match

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2

pb(0)

S(P1)

75 of 110

DAMPI Algorithm Review: �(1) Discover Alternative matches during initial run

R1(*)

1

pb(0)

S(P1)

0

0

0

P0

P1

P2

pb(0)

S(P1)

76 of 110

DAMPI Algorithm Review: �(2) Force alternative matches during replay

R1(2)

1

S(P1)

0

0

0

P0

P1

P2

S(P1)

X

77 of 110

DAMPI maintains very good scalability vs ISP

78 of 110

DAMPI is also faster at processing interleavings

79 of 110

Results on large applications: �SpecMPI2007 and NAS-PB

Slowdown is for one interleaving

No replay was necessary

80 of 110

The Devil in the Detail of DAMPI

  • Textbook Lamport/Vector Clocks Inapplicable
    • “Clocking” all events won’t scale
      • So “clock” only ND-events
    • When do non-blocking calls finish ?
      • Can’t “peek” inside MPI runtime
      • Don’t want to impede with execution
      • So have to infer when they finish
    • Later blocking operations can force Matches Before
      • So need to adjust clocks when blocking operations happen
  • Our contributions: Sparse VC and Sparse LC algos
  • Handing real corner cases in MPI
    • Synchronous sends “learning” about non-blocking Recv start!

81 of 110

Why develop FV methods for CUDA?

  • GPUs are key enablers of HPC
    • Many of the top 10 machines are GPU based
  • Interesting debugging challenges
    • Races
    • Barrier mismatches
    • Bank conflicts
    • Asynchronous MPI / CUDA interactions

81

82 of 110

What are GPUs (aka Accelerators)?

82

Sandybridge (courtesy anandtech.com)

OpenCL Compute (courtesy Khronos group)

Geoforce

GTX 480

(Nvidia)

  • Three of world’s Top-5 Supercomputers

built using GPUs

  • World’s greenest supercomputers also

GPU based

  • CPU/GPU integration is a clear trend
  • Hand-held devices (iPhone) will also

use them

AMD Fusion APU

83 of 110

83

83

CPU program

void inc_cpu(float* a,

float b, int N) {

for (int idx = 0; idx<N; idx++)

a[idx] = a[idx] + b;

}

void main() {

.....

increment_cpu(a, b, N);

}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) {

int tid = blockIdx.x* blockDim.x+ threadIdx.x;

if (tid < N)

A[tid] = A[tid] + b;

}

voidmain() {

…..

dim3dimBlock (blocksize);

dim3dimGrid( ceil( N / (float)blocksize) );

increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);

}

Example: Increment Array Elements

Contrast between CPUs and GPUs

84 of 110

84

84

CPU program

void inc_cpu(float* a,

float b, int N) {

for (int idx = 0; idx<N; idx++)

a[idx] = a[idx] + b;

}

void main() {

.....

increment_cpu(a, b, N);

}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) {

int tid = blockIdx.x* blockDim.x+ threadIdx.x;

if (tid < N)

A[tid] = A[tid] + b;

}

voidmain() {

…..

dim3dimBlock (blocksize);

dim3dimGrid( ceil( N / (float)blocksize) );

increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);

}

Example: Increment Array Elements

Contrast between CPUs and GPUs

Fine-grained threads

scheduled to run like this:

Tid = 0, 1, 2, 3, …

85 of 110

Why heed Many-core / GPU Compute?

85

EAGAN, Minn., May 3, 1991 (John Markoff, NY Times)

The Convex Computer Corporation plans to introduce its first

supercomputer on Tuesday. But Cray Research Inc.,

the king of supercomputing, says it is more worried by "killer micros"

-- compact, extremely fast work stations that sell for less than

$100,000.

Take-away : Clayton Christensen’s “Disruptive Innovation”

GPU is a disruptive technology!

86 of 110

Other Reasons to study GPU Compute

86

GPUs offer an eminent setting in which

to study heterogeneous CPU organizations

and memory hierarchies

87 of 110

What bugs caught? What value?

  • GPU hardware still stabilizing
    • Characterize GPU Hardware Formally
      • Currently, program behavior may change with platform
    • Micro benchmarking sheds further light
      • www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

  • The software is the real issue!

87

88 of 110

GPU Software Bugs / Difficulties

  • SIMD model, unfamiliar
    • Synchronization, races, deadlocks, … all different!
  • Machine constants, program assumptions about problem parameters affect correctness / performance
    • GPU “kernel functions” may fail or perform poorly
  • Formal reasoning can help identify performance pitfalls
  • Are still young, so emulators and debuggers may not match hardware behavior
  • Multiple memory sub-systems are involved, further complicating semantics / performance

88

89 of 110

Approaches/Tools for GPU SW Verif.

  • Instrumented Dynamic Verification on GPU Emulators
    • Boyer et al, STMCS 2008

  • Grace : Combined Static / Dynamic Analysis
    • Zheng et al, PPoPP 2011

  • PUG : An SMT based Static Analyzer
    • Lee and Gopalakrishnan, FSE 2010

  • GKLEE : An SMT based test generator
    • Lee, Rajan, Ghosh, and Gopalakrishnan (under submission)

89

90 of 110

Planned Overall Workflow of PUG

90

Realized

91 of 110

Workflow and Results from PUG

91

92 of 110

Examples of a Data Race

92

92

Increment N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b

...

__global__ void inc_gpu(

float*A, float b, int N) {

int tid = blockIdx.x * blockDim.x +

threadIdx.x;

if (tid < N)

A[tid] = A[tid – 1] + b;

}

93 of 110

Examples of a Data Race

93

93

Increment N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b

...

__global__ void inc_gpu(

float*A, float b, int N) {

int tid = blockIdx.x * blockDim.x +

threadIdx.x;

if (tid < N)

A[tid] = A[tid – 1] + b;

}

(and (and (/= t1.x t2.x)

……

(or (and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0)

(= (bv-sub idx1@t1 0b0000000001) idx1@t2))

(and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0)

(= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Encoding for

Read-Write Race

94 of 110

Examples of a Data Race

94

94

Increment N-element vector A by scalar b

tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A

A[0]+b

A[15]+b

...

__global__ void inc_gpu(

float*A, float b, int N) {

int tid = blockIdx.x * blockDim.x +

threadIdx.x;

if (tid < N)

A[tid] = A[tid – 1] + b;

}

(and (and (/= t1.x t2.x)

……

(or (and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0)

(= (bv-sub idx1@t1 0b0000000001) idx1@t2))

(and (bv-lt idx1@t1 N0)

(bv-lt idx1@t2 N0)

(= idx1@t1 idx1@t2))))

Encoding for Write-Write Race

Encoding for

Read-Write Race

95 of 110

Real Example with Race

95

95

__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) {

d_out[threadIdx.x] = 0;

for (int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }

__syncthreads();

if(threadIdx.x%2==0) {

for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];

...

/* The counter example given by PUG is :

t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0

Threads t1 and t2 are in iterations 1 and 0 respectively

t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

This causes a real race on writes to d[10] because of the += done by the threads

*/

96 of 110

Real Example with Race

96

96

__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) {

d_out[threadIdx.x] = 0;

for (int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }

__syncthreads();

if(threadIdx.x%2==0) {

for(int i=0; i<SIZE/BLOCKSIZE; i++) {

d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];

...

/* The counter example given by PUG is :

t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0

Threads t1 and t2 are in iterations 1 and 0 respectively

t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]

This causes a real race on writes to d[10] because of the += done by the threads

*/

97 of 110

PUG Divides and Conquers Each Barrier Interval

97

97

t1

t2

BI1 is conflict free iff:

∀t1,t2 : a1t1, a2t1, a1t2 and a2t2

do not conflict with each other

p

~p

~p

p

a1

a2

a3

a4

a5

a1

a2

a3

a4

a5

BI 3 is conflict free iff the following accesses

do not conflict for all t1,t2:

a3t1, ~p : a4t1, ~p : a5t1,

a3t2, ~p : a4t2, ~p : a5t2

a6

a6

BI1

BI3

98 of 110

PUG Results (FSE 2010)

98

98

Kernels (in CUDA SDK )

loc

+O

+C

+R

B.C.

Time (pass)

Bitonic Sort

65

HIGH

2.2s

MatrixMult

102

*

*

HIGH

<1s

Histogram64

136

LOW

2.9s

Sobel

130

*

HIGH

5.6s

Reduction

315

HIGH

3.4s

Scan

255

*

*

*

LOW

3.5s

Scan Large

237

*

*

LOW

5.7s

Nbody

206

*

HIGH

7.4s

Bisect Large

1400

*

*

HIGH

44s

Radix Sort

1150

*

*

*

LOW

39s

Eigenvalues

2300

*

*

*

HIGH

68s

+ O: require no bit-vector overflowing

+C: require constraints on the input values

+R: require manual refinement

B.C.: measure how serious the bank conflicts are

Time: SMT solving time

99 of 110

“GPU class” bugs

99

Defects

Barrier Error or Race

Refinement

benign

fatal

over #kernel

over #loop

13 (23%)

3

2

17.5%

10.5%

Tested over 57 assignment submissions

Defects: Indicates how many kernels are not well parameterized,

i.e. work only in certain configurations

Refinement: Measures how many loops need automatic refinement

(by PUG).

100 of 110

PUG’s Strengths and Weaknesses

  • Strengths:
    • SMT-based Incisive SA avoids interleaving explosion
    • Still obtains coverage guarantees
    • Good for GPU Library FV
  • Weaknesses:
    • Engineering effort : C++, Templates, Breaks, ..
    • SMT “explosion” for value correctness
    • Does not help test the code on the actual hardware
    • False alarms requires manual intervention
      • Handling loops, Kernel calling contexts

100

101 of 110

Two directions in progress

  • Parameterized Verification
    • See Li’s dissertation for initial results
  • Formal Testing
    • Brand new code-base called GKLEE
      • Joint work with Fujitsu Research

101

102 of 110

GKLEE: Formal Testing of GPU Kernels�(joint work with Li, Rajan, Ghosh of Fujitsu)

102

103 of 110

Evaluation of GKLEE vs. PUG

103

Kernels

n = 4

n = 16

n = 64

n = 256

n = 1024

PUG

GKLEE

PUG

GKLEE

GKLEE

GKLEE

GKLEE

Simple Reduction

2.8

< 0.1 (< 0.1)

T.O.

< 0.1 (< 0.1)

< 0.1 (< 0.1)

0.2 (0.3)

2.3 (2.9)

Matrix Transpose

1.9

< 0.1 (< 0.1)

T.O.

< 0.1 (0.3)

< 0.1 (3.2)

< 0.1 (63)

0.9 (T.O.)

Bitonic Sort

3.7

0.9 (1)

T.O.

T.O.

T.O.

T.O.

T.O.

Scan Large

< 0.1 (< 0.1)

< 0.1 (< 0.1)

0.1 (0.2)

1.6 (3)

22 (51)

Execution time of PUG and GKLEE on kernels from the CUDA SDK

for functional correctness. n is the number of threads. T.O. means > 5 min.

104 of 110

Evaluation of GKLEE’s Test Coverage

104

Kernels

sc cov.

bc cov.

min #test

Avg. Covt

max. Covt

exec. time

Bitonic Sort

100% / 100%

51% / 44%

5

78% / 76%

100% / 94%

1s

Merge Sort

100% / 100%

70% / 50%

6

88% / 80%

100% / 95%

0.5s

Word Search

100% / 92%

34% / 25%

2

100% / 81%

100% / 85%

0.1s

Radix Sort

100% / 91%

54% / 35%

6

91% / 68%

100% / 75%

5s

Suffix Tree Match

100% / 90%

38% / 35%

8

90% / 70%

100% / 82%

12s

Test generation results. Traditional metrics sc cov. and bc cov. give source code and bytecode coverage respectively. Refined metrics avg.covt and max.covt measure the average and maximum coverage of all threads.

All coverages were far better than achieved through random testing.

105 of 110

Conclusions, Future Work

  • ISP/DAMPI
    • Will move toward test generation, path coverage, crashing inputs, Integration with MUST
  • Parameterized Verification
  • Integration of PUG into the Eclipse Parallel Tools Platform to facilitate debugging
  • A version of GKLEE for OpenCL programs
  • PTP integration of GKLEE

105

106 of 110

Constrast with Shared Memory

  • In message passing systems, changes to an object are not automatically made visible to other threads
  • Msg passing helps isolate shared objects
    • Memory effects/errors do not propagate automatically
    • No shared memory races
    • Users control the visibility of changes across threads
  • Programming effort is higher – at least different
    • The overall semantics are often that of a sequential computation
      • The parallel threads (or processes) of a message passing system are often the result of decomposition of a single sequential computation along
        • The data dimension
        • The task dimension

107 of 110

Events of MPI Isend under Zero buffering �

107

Isend(to: Proc I, data, &handle1);

…code in delay slot…

Wait(handle1);

108 of 110

Events of MPI Isend under Infinite buffering �

108

Isend(to: Proc I, data, &handle1);

code in delay slot…

Wait(handle1);

  • Matches-Before serves as a One-Stop Shopping Solution

to Model how Buffering Affects MPI Behavior!

  • Can study Safety Violations due to Excessive Bufffering

also using MB !!

109 of 110

Crucial Ordering: Matches Before�

109

Isend(to: Proc k, …);

Isend(to: Proc k, …)

Models non-overtaking of MPI messages

(with same destinations targeted, match receives in pgm order)

110 of 110

Crucial Ordering: Matches Before�

110

Isend(to: Proc j, …);

Isend(to: Proc k, …)

Models relaxed ordering when the destinations differ

(matches with receives are allowed out of program order)