What Formal Approaches are Needed �for Correctness Verification in HPC
Ganesh Gopalakrishnan
School of Computing, University of Utah
and
Center for Parallel Computing (CPU)
Ever-growing Demand for Compute Power!
More compute power enables new discoveries, solves new problems
atom interactions over 2 nano seconds
designed if we have the right
compute infrastructure
High-end Debugging Methods are�often Expensive, Inconclusive
Importance of Message Passing in HPC (MPI)
Trend: Hybrid Concurrency
6
Sandybridge (courtesy anandtech.com)
Geoforce GTX 480 (Nvidia)
AMD Fusion APU
Concurrent
Data Structures
Infiniband style interconnect
Monolith Large-scale
MPI-based User
Applications
Problem-Solving
Environments
e.g. Uintah, Charm++, ADLB
Problem Solving Environment based
User Applications
High Performance
MPI Libraries
Difficult Road Ahead wrt Debugging
and give measurable coverage
HPC Apps
HPC Correctness Tools
Focus of the rest of the talk:�Mostly on MPI
Hybrid verification is future work
We will cover GPUs separately
MPI Verification approach depends on� type of determinism
Examples to follow
An example of parallelizing matrix multiplication �using message passing
X
An example of parallelizing matrix multiplication �using message passing
X
MPI_Bcast
MPI_Send
MPI_Recv
An example of parallelizing matrix multiplication �using message passing
12
X
MPI_Send
MPI_Bcast
MPI_Send
MPI_Recv
An example of parallelizing matrix multiplication �using message passing
13
X
=
MPI_Send
MPI_Recv
MPI_Bcast
MPI_Send
MPI_Recv
Unoptimized Initial Version : Execution Deterministic�
14
MPI_Send
MPI_Recv (from: P0, P1, P2, P3…) ;
Send Next Row to
First Slave which
By now must be free
Later Optimized Version : Value Deterministic�Opportunistically Send Work to Processor Who Finishes first
15
MPI_Send
MPI_Recv ( from : * )
OR
Send Next Row to
First Worker that returns the answer!
Still More Optimized Value-Deterministic versions:�Communications are made Non-blocking, and Software Pipelined�(still expected to remain value-deterministic)
16
MPI_Send
MPI_Recv ( from : * )
OR
Send Next Row to
First Worker that returns the answer!
Typical MPI Programs
Gist of bug-hunting story
A real-world bug
P0 P1 P2
--- --- ---
Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );
Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );
A real-world bug
P0 P1 P2
--- --- ---
Send( (rank+1)%N ); Send( (rank+1)%N ); Send( (rank+1)%N );
Recv( (rank-1)%N ); Recv( (rank-1)%N ); Recv( (rank-1)%N );
MPI Bugs – more anecdotal evidence
Real bug stories in the MPI-land
How does one make progress?
Here are some thoughts..
Roughly justifies what we’ve done ☺
Our Research Agenda in HPC
Motivation for Coverage of Communication Nondeterminism
Eliminating wasted search in message passing verif.
26
P0
---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1
---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1
else MPI_Recv(*, x);
P2
---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
27
@InProceedings{PADTAD2006:JitterBug,
author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de
Supinski and Andreas S{\ae}bj{\"o}rnsen},
title = {Improving distributed memory applications testing by message perturbation},
booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD)
Workshop, at the International Symposium on Software Testing and Analysis},
address = {Portland, ME, USA},
month = {July},
year = {2006}
}
A frequently followed approach: “boil the whole schedule space” – often very wasteful
Eliminating wasted work in message passing verif.
28
P0
---
MPI_Send(to P1…);
MPI_Send(to P1, data=22);
P1
---
MPI_Recv(from P0…);
MPI_Recv(from P2…);
MPI_Recv(*, x);
if (x==22) then error1
else MPI_Recv(*, x);
P2
---
MPI_Send(to P1…);
MPI_Send(to P1, data=33);
But consider these two cases…
No need to play with schedules of deterministic actions
Need to detect�Resource Dependent Bugs
Example of Resource Dependent Bug
30
P0
Send(to:1);
Recv(from:1);
P1
Send(to:0);
Recv(from:0);
We know that this program with lesser Send buffering may deadlock
Example of Resource Dependent Bug
31
P0
Send(to:1);
Recv(from:1);
P1
Send(to:0);
Recv(from:0);
… and this program with more Send buffering may avoid a deadlock
Example of Resource Dependent Bug
32
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
Example of Resource Dependent Bug
33
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
Example of Resource Dependent Bug
34
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
Example of Resource Dependent Bug
35
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
Example of Resource Dependent Bug
36
P0
Send(to:1);
Send(to:2);
P1
Send(to:2);
Recv(from:0);
P2
Recv(from:*);
Recv(from:0);
… But this program deadlocks if Send(to:1) has more buffering !
Mismatched – hence a deadlock
Widely Publicized Misunderstandings
“”Your program is deadlock free if you have successfully tested it under zero buffering”
37
MPI at fault?
38
Our Impact So Far
39
Sandybridge (courtesy anandtech.com)
Geoforce GTX 480 (Nvidia)
AMD Fusion APU
PUG
and
GKLEE
ISP
and
DAMPI
Concurrent
Data Structures
Infiniband style interconnect
Monolith Large-scale
MPI-based User
Applications
Problem-Solving
Environments
e.g. Uintah, Charm++, ADLB
Problem Solving Environment based
User Applications
High Performance
MPI Libraries
Useful
formalizations
to help test
these
Rest of the talk
40
All our pubs + code at http://www.cs.utah.edu/fv
A Simple MPI Example
41
Process P0
Isend(1, req) ;
Barrier ;
Wait(req) ;
Process P1
Irecv(*, req) ;
Barrier ;
Recv(2) ;
Wait(req) ;
Process P2
Barrier ;
Isend(1, req) ;
Wait(req) ;
A Simple MPI Example
42
Process P0
Isend(1, req) ;
Barrier ;
Wait(req) ;
Process P1
Irecv(*, req) ;
Barrier ;
Recv(2) ;
Wait(req) ;
Process P2
Barrier ;
Isend(1, req) ;
Wait(req) ;
A Simple MPI Example
43
Process P0
Isend(1, req) ;
Barrier ;
Wait(req) ;
Process P1
Irecv(*, req) ;
Barrier ;
Recv(2) ;
Wait(req) ;
Process P2
Barrier ;
Isend(1, req) ;
Wait(req) ;
A Simple MPI Example
44
Process P0
Isend(1, req) ;
Barrier ;
Wait(req) ;
Process P1
Irecv(*, req) ;
Barrier ;
Recv(2) ;
Wait(req) ;
Process P2
Barrier ;
Isend(1, req) ;
Wait(req) ;
A Simple MPI Example
45
Process P0
Isend(1, req) ;
Barrier ;
Wait(req) ;
Process P1
Irecv(*, req) ;
Barrier ;
Recv(2) ;
Wait(req) ;
Process P2
Barrier ;
Isend(1, req) ;
Wait(req) ;
A Simple MPI Example
46
Process P0
Isend(1, req) ;
Barrier ;
Wait(req) ;
Process P1
Irecv(*, req) ;
Barrier ;
Recv(2) ;
Wait(req) ;
Process P2
Barrier ;
Isend(1, req) ;
Wait(req) ;
Testing Challenges
47
Process P0
Isend(1, req) ;
Barrier ;
Wait(req) ;
Process P1
Irecv(*, req) ;
Barrier ;
Recv(2) ;
Wait(req) ;
Process P2
Barrier ;
Isend(1, req) ;
Wait(req) ;
regardless of the execution platform
Flow of ISP
48
Executable
Proc1
Proc2
……
Procn
Scheduler
Run
MPI Runtime
MPI Program
Interposition Layer
49
P0 P1 P2
Barrier
Isend(1, req)
Wait(req)
Scheduler
Irecv(*, req)
Barrier
Recv(2)
Wait(req)
Isend(1, req)
Wait(req)
Barrier
Isend(1)
sendNext
Barrier
MPI Runtime
ISP Scheduler Actions (animation)
50
P0 P1 P2
Barrier
Isend(1, req)
Wait(req)
Scheduler
Irecv(*, req)
Barrier
Recv(2)
Wait(req)
Isend(1, req)
Wait(req)
Barrier
Isend(1)
sendNext
Barrier
Irecv(*)
Barrier
MPI Runtime
ISP Scheduler Actions (animation)
51
P0 P1 P2
Barrier
Isend(1, req)
Wait(req)
Scheduler
Irecv(*, req)
Barrier
Recv(2)
Wait(req)
Isend(1, req)
Wait(req)
Barrier
Isend(1)
Barrier
Irecv(*)
Barrier
Barrier
Barrier
Barrier
Barrier
MPI Runtime
ISP Scheduler Actions (animation)
52
P0 P1 P2
Barrier
Isend(1, req)
Wait(req)
MPI Runtime
Scheduler
Irecv(*, req)
Barrier
Recv(2)
Wait(req)
Isend(1, req)
Wait(req)
Barrier
Isend(1)
Barrier
Irecv(*)
Barrier
Barrier
Wait (req)
Recv(2)
Isend(1)
SendNext
Wait (req)
Irecv(2)
Isend
Wait
No
Match-Set
Deadlock!
ISP Scheduler Actions (animation)
Formalization of MPI Behavior to build Formal Dynamic Verifier (verification scheduler)
Formal Modeling of MPI Calls
54
The Matches-Before Relation of MPI
55
Isend(to: Proc k, …);
…
Isend(to: Proc k, …)
Irecv(from: Proc k, …);
…
Irecv(from: Proc k, …)
Isend( &h );
…
Wait( h );
Wait(..);
…
AnyMPIOp;
Barrier(..);
…
AnyMPIOp;
Irecv(from: Proc *, …);
…
Irecv(from: Proc k, …)
Irecv(from: Proc j, …);
…
Irecv(from: Proc *, …)
Conditional
Matches
before
1.
2.
3.
4.
5.
6.
7.
The ISP Scheduler
56
Theorem: This Scheduling Method achieves ND-Coverage in MPI programs!
How MB helps predict outcomes
57
Will this single-process example called “Auto-send” deadlock ?
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
58
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
How MB helps predict outcomes
59
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
The MB
How MB helps predict outcomes
60
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
Collect R(from:0, h1)
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
The MPI
Runtime
61
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
Collect B
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
B
The MPI
Runtime
62
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
P0 is Fenced; Form Match Set { B } and
Send it to the MPI Runtime !
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
B
The MPI
Runtime
63
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
Collect S(to:0, h2)
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
B
S(to:0, h2)
The MPI
Runtime
64
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
P0 is fenced. So form the {R,S} match-set. Fire!
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
B
S(to:0, h2)
The MPI
Runtime
65
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
Collect W(h1)
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
B
S(to:0, h2)
W(h1)
The MPI
Runtime
66
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
P0 is Fenced. So form {W} match set. Fire!
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
B
S(to:0, h2)
W(h1)
The MPI
Runtime
67
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
Collect W(h2)
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
B
S(to:0, h2)
W(h1)
W(h2)
The MPI
Runtime
68
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
Fire W(h2)! Program finishes w/o deadlocks!
How MB helps predict outcomes
Scheduler’s
Reorder
Buffer
R(from:0, h1)
B
S(to:0, h2)
W(h1)
W(h2)
The MPI
Runtime
Dynamic Verification at Scale
Why do we need a Distributed Analyzer for MPI Programs (DAMPI)?
70
DAMPI Framework
Executable
Proc1
Proc2
……
Procn
Alternate Matches
MPI runtime
MPI Program
DAMPI - PnMPI modules
Schedule Generator
Epoch
Decisions
Rerun
DAMPI – Distributed Analyzer for MPI
Distributed Causality Tracking in DAMPI
DAMPI uses Lamport clock to build �Happens-Before relationships
R1(*)
1
pb(0)
S(P1)
0
0
0
P0
P1
P2
pb(0)
S(P1)
How we use Happens-Before relationship to detect alternative matches
R1(*)
1
pb(0)
S(P1)
0
0
0
P0
P1
P2
pb(0)
S(P1)
DAMPI Algorithm Review: �(1) Discover Alternative matches during initial run
R1(*)
1
pb(0)
S(P1)
0
0
0
P0
P1
P2
pb(0)
S(P1)
DAMPI Algorithm Review: �(2) Force alternative matches during replay
R1(2)
1
S(P1)
0
0
0
P0
P1
P2
S(P1)
X
DAMPI maintains very good scalability vs ISP
DAMPI is also faster at processing interleavings
Results on large applications: �SpecMPI2007 and NAS-PB
Slowdown is for one interleaving
No replay was necessary
The Devil in the Detail of DAMPI
Why develop FV methods for CUDA?
81
What are GPUs (aka Accelerators)?
82
Sandybridge (courtesy anandtech.com)
OpenCL Compute (courtesy Khronos group)
Geoforce
GTX 480
(Nvidia)
built using GPUs
GPU based
use them
AMD Fusion APU
83
83
CPU program
void inc_cpu(float* a,
float b, int N) {
for (int idx = 0; idx<N; idx++)
a[idx] = a[idx] + b;
}
void main() {
.....
increment_cpu(a, b, N);
}
CUDA program
__global__ void inc_gpu(float* A, float b, int N) {
int tid = blockIdx.x* blockDim.x+ threadIdx.x;
if (tid < N)
A[tid] = A[tid] + b;
}
voidmain() {
…..
dim3dimBlock (blocksize);
dim3dimGrid( ceil( N / (float)blocksize) );
increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);
}
Example: Increment Array Elements
Contrast between CPUs and GPUs
84
84
CPU program
void inc_cpu(float* a,
float b, int N) {
for (int idx = 0; idx<N; idx++)
a[idx] = a[idx] + b;
}
void main() {
.....
increment_cpu(a, b, N);
}
CUDA program
__global__ void inc_gpu(float* A, float b, int N) {
int tid = blockIdx.x* blockDim.x+ threadIdx.x;
if (tid < N)
A[tid] = A[tid] + b;
}
voidmain() {
…..
dim3dimBlock (blocksize);
dim3dimGrid( ceil( N / (float)blocksize) );
increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);
}
Example: Increment Array Elements
Contrast between CPUs and GPUs
Fine-grained threads
scheduled to run like this:
Tid = 0, 1, 2, 3, …
Why heed Many-core / GPU Compute?
85
EAGAN, Minn., May 3, 1991 (John Markoff, NY Times)
The Convex Computer Corporation plans to introduce its first
supercomputer on Tuesday. But Cray Research Inc.,
the king of supercomputing, says it is more worried by "killer micros"
-- compact, extremely fast work stations that sell for less than
$100,000.
Take-away : Clayton Christensen’s “Disruptive Innovation”
GPU is a disruptive technology!
Other Reasons to study GPU Compute
86
GPUs offer an eminent setting in which
to study heterogeneous CPU organizations
and memory hierarchies
What bugs caught? What value?
87
GPU Software Bugs / Difficulties
88
Approaches/Tools for GPU SW Verif.
89
Planned Overall Workflow of PUG
90
Realized
Workflow and Results from PUG
91
Examples of a Data Race
92
92
Increment N-element vector A by scalar b
tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A
A[0]+b
A[15]+b
...
__global__ void inc_gpu(
float*A, float b, int N) {
int tid = blockIdx.x * blockDim.x +
threadIdx.x;
if (tid < N)
A[tid] = A[tid – 1] + b;
}
Examples of a Data Race
93
93
Increment N-element vector A by scalar b
tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A
A[0]+b
A[15]+b
...
__global__ void inc_gpu(
float*A, float b, int N) {
int tid = blockIdx.x * blockDim.x +
threadIdx.x;
if (tid < N)
A[tid] = A[tid – 1] + b;
}
(and (and (/= t1.x t2.x)
……
(or (and (bv-lt idx1@t1 N0)
(bv-lt idx1@t2 N0)
(= (bv-sub idx1@t1 0b0000000001) idx1@t2))
(and (bv-lt idx1@t1 N0)
(bv-lt idx1@t2 N0)
(= idx1@t1 idx1@t2))))
Encoding for Write-Write Race
Encoding for
Read-Write Race
Examples of a Data Race
94
94
Increment N-element vector A by scalar b
tid 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A
A[0]+b
A[15]+b
...
__global__ void inc_gpu(
float*A, float b, int N) {
int tid = blockIdx.x * blockDim.x +
threadIdx.x;
if (tid < N)
A[tid] = A[tid – 1] + b;
}
(and (and (/= t1.x t2.x)
……
(or (and (bv-lt idx1@t1 N0)
(bv-lt idx1@t2 N0)
(= (bv-sub idx1@t1 0b0000000001) idx1@t2))
(and (bv-lt idx1@t1 N0)
(bv-lt idx1@t2 N0)
(= idx1@t1 idx1@t2))))
Encoding for Write-Write Race
Encoding for
Read-Write Race
Real Example with Race
95
95
__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) {
d_out[threadIdx.x] = 0;
for (int i=0; i<SIZE/BLOCKSIZE; i++) {
d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }
__syncthreads();
if(threadIdx.x%2==0) {
for(int i=0; i<SIZE/BLOCKSIZE; i++) {
d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];
...
/* The counter example given by PUG is :
t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0
Threads t1 and t2 are in iterations 1 and 0 respectively
t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11]
t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]
This causes a real race on writes to d[10] because of the += done by the threads
*/
Real Example with Race
96
96
__global__ void computeKernel(int *d_in,int *d_out, int *d_sum) {
d_out[threadIdx.x] = 0;
for (int i=0; i<SIZE/BLOCKSIZE; i++) {
d_out[threadIdx.x] += compare(d_in[i*BLOCKSIZE+threadIdx.x],6); }
__syncthreads();
if(threadIdx.x%2==0) {
for(int i=0; i<SIZE/BLOCKSIZE; i++) {
d_out[threadIdx.x+SIZE/BLOCKSIZE*i]+=d_out[threadIdx.x+SIZE/BLOCKSIZE*i+1];
...
/* The counter example given by PUG is :
t1.x = 2, t2.x = 10, i@t1 = 1, i@t2 = 0
Threads t1 and t2 are in iterations 1 and 0 respectively
t1 generates access d_out[ 2 + 8 * 1 ] += d_out[ 10 + 8 * 0 + 1] i.e. d[10] += d[11]
t2 generates this: d_out[ 10 + 8 * 0 ] += d_out [ 10 + 8 * 0 + 1] i.e. d[10] += d[11]
This causes a real race on writes to d[10] because of the += done by the threads
*/
PUG Divides and Conquers Each Barrier Interval
97
97
t1
t2
BI1 is conflict free iff:
∀t1,t2 : a1t1, a2t1, a1t2 and a2t2
do not conflict with each other
p
~p
~p
p
a1
a2
a3
a4
a5
a1
a2
a3
a4
a5
BI 3 is conflict free iff the following accesses
do not conflict for all t1,t2:
a3t1, ~p : a4t1, ~p : a5t1,
a3t2, ~p : a4t2, ~p : a5t2
a6
a6
BI1
BI3
PUG Results (FSE 2010)
98
98
Kernels (in CUDA SDK ) | loc | +O | +C | +R | B.C. | Time (pass) |
Bitonic Sort | 65 | | | | HIGH | 2.2s |
MatrixMult | 102 | * | * | | HIGH | <1s |
Histogram64 | 136 | | | | LOW | 2.9s |
Sobel | 130 | * | | | HIGH | 5.6s |
Reduction | 315 | | | | HIGH | 3.4s |
Scan | 255 | * | * | * | LOW | 3.5s |
Scan Large | 237 | * | * | | LOW | 5.7s |
Nbody | 206 | * | | | HIGH | 7.4s |
Bisect Large | 1400 | * | * | | HIGH | 44s |
Radix Sort | 1150 | * | * | * | LOW | 39s |
Eigenvalues | 2300 | * | * | * | HIGH | 68s |
+ O: require no bit-vector overflowing
+C: require constraints on the input values
+R: require manual refinement
B.C.: measure how serious the bank conflicts are
Time: SMT solving time
“GPU class” bugs
99
Defects | Barrier Error or Race | Refinement | ||
| benign | fatal | over #kernel | over #loop |
13 (23%) | 3 | 2 | 17.5% | 10.5% |
Tested over 57 assignment submissions
Defects: Indicates how many kernels are not well parameterized,
i.e. work only in certain configurations
Refinement: Measures how many loops need automatic refinement
(by PUG).
PUG’s Strengths and Weaknesses
100
Two directions in progress
101
GKLEE: Formal Testing of GPU Kernels�(joint work with Li, Rajan, Ghosh of Fujitsu)
102
Evaluation of GKLEE vs. PUG
103
Kernels | n = 4 | n = 16 | n = 64 | n = 256 | n = 1024 | ||
| PUG | GKLEE | PUG | GKLEE | GKLEE | GKLEE | GKLEE |
Simple Reduction | 2.8 | < 0.1 (< 0.1) | T.O. | < 0.1 (< 0.1) | < 0.1 (< 0.1) | 0.2 (0.3) | 2.3 (2.9) |
Matrix Transpose | 1.9 | < 0.1 (< 0.1) | T.O. | < 0.1 (0.3) | < 0.1 (3.2) | < 0.1 (63) | 0.9 (T.O.) |
Bitonic Sort | 3.7 | 0.9 (1) | T.O. | T.O. | T.O. | T.O. | T.O. |
Scan Large | ▬ | < 0.1 (< 0.1) | ▬ | < 0.1 (< 0.1) | 0.1 (0.2) | 1.6 (3) | 22 (51) |
Execution time of PUG and GKLEE on kernels from the CUDA SDK
for functional correctness. n is the number of threads. T.O. means > 5 min.
Evaluation of GKLEE’s Test Coverage
104
Kernels | sc cov. | bc cov. | min #test | Avg. Covt | max. Covt | exec. time |
Bitonic Sort | 100% / 100% | 51% / 44% | 5 | 78% / 76% | 100% / 94% | 1s |
Merge Sort | 100% / 100% | 70% / 50% | 6 | 88% / 80% | 100% / 95% | 0.5s |
Word Search | 100% / 92% | 34% / 25% | 2 | 100% / 81% | 100% / 85% | 0.1s |
Radix Sort | 100% / 91% | 54% / 35% | 6 | 91% / 68% | 100% / 75% | 5s |
Suffix Tree Match | 100% / 90% | 38% / 35% | 8 | 90% / 70% | 100% / 82% | 12s |
Test generation results. Traditional metrics sc cov. and bc cov. give source code and bytecode coverage respectively. Refined metrics avg.covt and max.covt measure the average and maximum coverage of all threads.
All coverages were far better than achieved through random testing.
Conclusions, Future Work
105
Constrast with Shared Memory
Events of MPI Isend under Zero buffering �
107
Isend(to: Proc I, data, &handle1);
…code in delay slot…
Wait(handle1);
Events of MPI Isend under Infinite buffering �
108
Isend(to: Proc I, data, &handle1);
…code in delay slot…
Wait(handle1);
to Model how Buffering Affects MPI Behavior!
also using MB !!
Crucial Ordering: Matches Before�
109
Isend(to: Proc k, …);
…
Isend(to: Proc k, …)
Models non-overtaking of MPI messages
(with same destinations targeted, match receives in pgm order)
Crucial Ordering: Matches Before�
110
Isend(to: Proc j, …);
…
Isend(to: Proc k, …)
Models relaxed ordering when the destinations differ
(matches with receives are allowed out of program order)