1 of 79

Analysis of Shared Memory GPU Parallel Programs (talk from 13 years ago)

Material still relevant!

Ganesh Gopalakrishnan, Professor,

School of Computing, University of Utah,

Salt Lake City, UT 84112, USA

Acknowledgements : Students and Collaborators

Peng Li, Guodong Li, Geof Sawaya, Tyler Sorensen, Brandon Gibson, Mohammed Saeed Al-Mahfoudh, Ashwin Kumar, Zvonimir Rakamaric’, Martin Berzins, Sriram Aananthakrishnan, Alan Humphrey, Diego Oliviera

External Collaborations: Lawrence Livermore National Laboratory, Argonne National Laboratory, Nvidia, Stephen Siegel (Delaware)

2 of 79

Part of this tutorial is updated to 2025 by adding a Colab GPU programming exercise link below due to Mark Harris of NVIDIA

https://colab.research.google.com/drive/1tW8uMgRNRwjpFDghysxpvvsThQN5GlEg?usp=sharing

3 of 79

Emerging Heterogeneous Parallel Systems, and Role of Formal Methods in them

3

Sandybridge (courtesy anandtech.com)

Geoforce GTX 480 (Nvidia)

AMD Fusion APU

Concurrent

Data Structures

Infiniband style interconnect

(1) Formal

Methods at the User Application Level

Problem-Solving

Environments

e.g. Uintah, Charm++, ADLB

Problem Solving Environment based

User Applications

High Performance

MPI Libraries

(2) Formal

Dynamic Methods for MPI

(3) Formal

Analysis for CUDA Programs

4 of 79

Why conventional testing is inadequate

Exponential number of paths;

the bug may be deeply hidden

Exponential number of schedules; the bug may be triggered only by

one particular interleaving

E1

E2

En

…

Must be lucky

to force such

“bad paths”!

Must be lucky

to schedule

threads in

this manner!

Some bad interaction, such as

a data race may depend on

the schedule…

5 of 79

Good Halloween stuff from llama.cpp (to see the complexity explode in your face when you “open the lid” is sobering) - part-1

MY THOUGHTS : I have a huge amount of respect to those who write such codes

WISH-LIST - TOP ENTRY : I would like to see the day when such code emerges after a series of pokes one administers to an initial SMT-based model of the programming intent + some generous profiling to add missing pieces, reverify after the perf enhancements, etc. etc.

6 of 79

Good Halloween stuff from llama.cpp (to see the complexity explode in your face when you “open the lid” is sobering) - part-2

7 of 79

Ways to make formal testing scale

Employ structured designs

Avoid unpredictable / dangerous / undefined interactions between concurrency models

Verification must exploit system structure

8 of 79

ROADMAP for the TUTORIAL

GPU computing and formal methods
…

9 of 79

GPU COMPUTING AND FORMAL METHODS

10 of 79

GPU-based Computing

About 40 of the top 500 machines were (in 2012) GPU-based

Personal supercomputers used for scientific research (biology, physics, …) increasingly based on GPUs

10

(courtesy of AMD)

(courtesy of Nvidia)

(courtesy of Nvidia,

www.engadget.com)

11 of 79

Application areas, roadmap

Graphics, games, scientific computing, machine-learning algorithms, search,…

E.g. Bing powered by GTX 690 servers

Hand-held through high-end machines
Example hardware:

AMD, Nvidia, Samsung, …

Example programming methods:

CUDA, OpenCL, straight C/C++, special languages/libraries (e.g. OpenACC, OpenMP)

Companies such as Intel promote Xeon-Phi MICs

Intel does not believe in GPUs (I was told so by an Intel manager)
Intel MIC : Shared mem multicore, plenty of vector units

AMD promotes APUs
Automated synthesis through auto-tuning / search (Cuda-CHiLL)

12 of 79

Programming Approaches

CUDA C -- WHAT WE WILL FOCUS ON IN THIS TUTORIAL

Runtime API
Many tedious steps (Index calculations, deadlocks, data races, warp divergence, non-coalesced memory transfers, multiple async streams interleavings and async kernel launches interleaving/reordering..)

CopperHead

Data Parallel Python
Constraint programming
Based on Thrust (but, has own transformations)

Thrust

C++-based
Abstract OO design (no indices, no array bounds calculations …etc)
Easy to extend/customize

13 of 79

Programming Approaches

OpenACC

C/C++ with annotations
Similar to OpenMP but for Heterogeneous architectures, including GPUs from different vendors
May be folded into OpenMP

PyCUDA

Python with some CUDA C some times
Almost all problems with CUDA C is still there

OpenCL

Less momentum behind OpenCL than CUDA (at least in the US)
Often lower level than CUDA
Non-proprietary

14 of 79

Key ideas in GPU-based computing

Wide (512 bits) execution units
Memory stall 🡪 switch threads
Regular on-chip structures

Short wires, more ALUs (instead of caches)
7B transistors (Kepler)

Ideas gestated in Graphics Engines

People started “abusing” graphics engines
“Abuse” turned into opportunity 🡪 CUDA

15 of 79

15

CPU program

void inc_cpu(float* a,

float b, int N) {

for (int idx = 0; idx<N; idx++)

a[idx] = a[idx] + b;

}

void main() {

.....

increment_cpu(a, b, N);

}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) {

int tid = blockIdx.x* blockDim.x+ threadIdx.x;

if (tid < N)

A[tid] = A[tid] + b;

}

voidmain() {

…..

dim3dimBlock (blocksize);

dim3dimGrid( ceil( N / (float)blocksize) );

increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);

}

Example: Increment Array Elements

Contrast between CPUs and GPUs

16 of 79

16

CPU program

void inc_cpu(float* a,

float b, int N) {

for (int idx = 0; idx<N; idx++)

a[idx] = a[idx] + b;

}

void main() {

.....

increment_cpu(a, b, N);

}

CUDA program

__global__ void inc_gpu(float* A, float b, int N) {

int tid = blockIdx.x* blockDim.x+ threadIdx.x;

if (tid < N)

A[tid] = A[tid] + b;

}

voidmain() {

…..

dim3dimBlock (blocksize);

dim3dimGrid( ceil( N / (float)blocksize) );

increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);

}

Example: Increment Array Elements

Contrast between CPUs and GPUs

Fine-grained threads

scheduled to run like this:

tid = 0, 1, 2, 3, …

17 of 79

CUDA Abstract Device Model

Device Global Memory (shared between blocks) - Slowest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

18 of 79

Related Work

Boyer/Skadron – Instrumented Dynamic Verif.
Grace – PPoPP 2011 (combined static/dynamic)
PUG – Li, Gopalakrishnan, FSE 2010
GKLEE related papers (PPoPP’12, SC’12, NFM’13)
GPUVerify and related papers – SPLASH 2012, ESOP’13
Test Amplification – PLDI 2012
KLEE-FP, and OpenCL Verif. Framework – Cadar et al
Some conventional tools

Allinea DDT, RogueWave, Nvidia Parallel Nsight, CUDA GDB

19 of 79

Our FV method for GPUs

We will illustrate our approach to GPU verification in parallel
It is based on the tool GKLEE

Concrete + Symbolic Execution based (“concolic”

20 of 79

GKLEE: Symbolic Virtual GPU

20

Host

Kernel 1

Kernel 2

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)

Flexible modeling of the “real GPU”

Vary warp-sizes

Incorporate device-specific rules (e.g. for coalescing rules)

Be able to consider compiler optimizations (to some extent)

virtual CPU

virtual GPU

GKLEE

21 of 79

GKLEE (see PPoPP 2012)

__input__ int *values = (int *)malloc(sizeof(int) * NUM);

klee_make_symbolic(values, sizeof(int)*NUM, "values");

int *dvalues;

cudaMalloc((void **)&dvalues, sizeof(int) * NUM);

cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice);

BitonicKernel <<< … >>> (dvalues);

LLVM byte-code instructions

Symbolic

Analyzer and

Scheduler

Error

Monitors

C++ CUDA Programs with Symbolic Variable Declarations

LLVM-GCC

Deadlocks
Data races
Concrete test inputs
Bank conflicts
Warp divergences
Non-coalesced
Test Cases

Provide high coverage
Can be run on HW

22 of 79

Sequential programming examples

23 of 79

Basics of Concolic Execution

if(item1 > item2) {

printf("Item1 > Item2\n");

if(item2 > item3) {

printf("Item2 > Item3\n");

if(item3 > item1) {

printf("Item3 > Item1\n");

}

else

printf("Item3 <= Item1\n");

}

printf("Item2 <= Item3\n");

}

printf("Item1 <= Item2\n");

}

24 of 79

Basics of Concolic Execution

if(item1 > item2) {

printf("Item1 > Item2\n");

if(item2 > item3) {

printf("Item2 > Item3\n");

if(item3 > item1) {

printf("Item3 > Item1\n");

}

else

printf("Item3 <= Item1\n");

}

printf("Item2 <= Item3\n");

}

printf("Item1 <= Item2\n");

}

25 of 79

Our Formal Verification of CUDA builds on top of Sequential Program Verification by KLEE [OSDI ‘08]

#include "stdio.h"

#include "cutil.h"

#include "string.h"

#include "klee.h"

#define STRLEN 5

int main(){

char item;

char str[STRLEN];

int lo=0;

int hi;

int mid, found=0;

// printf("Please give char to be searched: \n");

// scanf("%c", &item);

klee_make_symbolic(&item, sizeof(item), "item");

// printf("Please give string within which to search: \n");

// scanf("%s", str);

klee_make_symbolic(str, sizeof(str), "str");

klee_assume(str[0] <= str[1]);

klee_assume(str[1] <= str[2]);

klee_assume(str[2] <= str[3]);

klee_assume(str[3] <= str[4]);

str[4] = '\0';

hi= 4; //strlen(str)-1; // strlen ignores null at the end of string

// hi points to last char in a 0 based array

while(!found && lo <= hi) {

mid = (lo+hi)/2;

if (item == str[mid])

{ printf("*");

found = 1; }

else

if (item < str[mid])

{ hi = mid-1;

printf("L"); }

else

{ lo = mid+1;

printf("H"); }

}

if (found) printf("found\n");//printf("\nitem %c found at posn %d\n", item, mid);

else printf("not found\n");//printf("\nitem %c not in str %s\n", item, str);

return found;

}

A simple Binary Search being

prepared for formal symbolic analysis

This will force path coverage !!

26 of 79

Concurrency errors

27 of 79

Example: Increment Array Elements

27

Increment N-element array A by scalar b

tid 0 1 …

A

A[0]+b

__global__ void inc_gpu(int*A, int b, intN) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < 64) {

A[idx] = A[idx] + b;

}

...

A[1]+b

t0

t1

28 of 79

Data Races in GPU Programs

28

Write(a)

Read(a)

The usual definition of a race:

“Two accesses, one of which is a write, that occur without

a happens-before edge between them.”

In GPUs, the happens-before is provided by

Barriers ( _syncthreads() )
Atomics

29 of 79

Illustration of Race

29

Increment N-element vector A by scalar b

tid 0 1 63

A

t63:

write A[63]

...

__global__ void inc_gpu(int*A, int b, int N) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < 64) {

A[idx] = A[(idx – 1 + 64) % 64] + b;

}

RACE!

t0:

read A[63]

t63

t0

30 of 79

Porting Race: A Race-type Identified by us

Some hardware platforms may evaluate “then” before “else” – others may reverse this order

30

If we have a divergent warp where

a variable is READ in one path

and WRITTEN in another path,

then depending on the execution

order, the READ may be before/after

the WRITE

31 of 79

Example of Porting Race

#include <stdio.h>

#define NUM 32

__shared__ int v[NUM];

__global__ void PortingRace() {

if (threadIdx.x % 2) {

v[threadIdx.x] = v[ (NUM + threadIdx.x - 1) % NUM ] + 1; // W:even; R:odd

}

else {

v[threadIdx.x] = v[ (NUM + threadIdx.x + 1) % NUM ] - 1; // W:odd; R:even

}

int main() {

PortingRace<<<1,NUM>>>();

return 0;

}

32 of 79

CUDA-based GPU Programming Basics

A simple dialect of C++ with CUDA directives
Synchronization within Block through Barriers
(__syncthread())
Threads within block scheduled in warps
Don’t rely on warps for correctness!

32

Device Global Memory (shared between blocks) - Slowest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

33 of 79

Conventional GPU Debuggers are Inadequate

Based on platform testing
Full generality of executions not reflected
Hard to cause races (input selection)
Hard to observe races

vector-clock based solutions that run on GPUs exist
(last time we checked) only for shared memory races

33

34 of 79

Some of the GPU Program Bugs

Data races

Cause unpredictable outputs
Cause compilers to misbehave

Incorrectly used synchronizations

Syncthreads not used with textual barriers
Syncthreads causing deadlocks

Misunderstood memory consistency models

When updates are visible across threads

Incorrectly used atomics (synchronization)

Due to incorrect synchronization, invariants get broken

Erroneous computational results – esp. with floating-point

Due to unpredictable computational order, results may diverge

35 of 79

Why are Data Races Highly Problematic?

Testing is seldom conclusive

Must infer a race indirectly (e.g. corrupted results)

Races manifest when code is ported or optimized

Scheduling can change

Data races make compiled results suspect

Compilers optimize assuming the absence of races in the most general setting

35

36 of 79

Synchronization Primitives in CUDA

36

__kernel__ void deadlock() {

if (threadIdx.x > 0) {

v[threadIdx.x]++;

}

else {

v[threadIdx.x]--;

}

__syncthreads(); // Barrier

}

Device Global Memory (shared between blocks) - Slowest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

37 of 79

Synchronization Primitives in CUDA

37

__kernel__ void deadlock() {

if (threadIdx.x > 0) {

v[threadIdx.x]++;

}

else {

v[threadIdx.x]--;

}

__syncthreads(); // Barrier

}

Device Global Memory (shared between blocks) - Slowest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

Threads within a block can

synchronize using __syncthreads().

(This call synchronizes threads, also gives

the effect of the stronger fence “thread_fence”.)

38 of 79

Synchronization Primitives in CUDA

38

__kernel__ void deadlock() {

if (threadIdx.x > 0) {

v[threadIdx.x]++;

}

else {

v[threadIdx.x]--;

}

__syncthreads(); // Barrier

}

Device Global Memory (shared between blocks) - Slowest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

Block (Shared Memory) - Faster

T0 (private Memory) - Fastest

T1 (private Memory) - Fastest

T2 (private Memory) - Fastest

T3 (private Memory) - Fastest

Threads within a block can

synchronize using __syncthreads().

(This call synchronizes threads, also gives

the effect of the stronger fence “thread_fence”.)

For synchronization

across blocks, we must

use one of the CUDA

Atomics (Zvonimir will

present our NFM’13 work)

39 of 79

Deadlocks caused by misused __syncthreads()

Cause a “hang” or unspecified behavior

39

__SyncThreads()

40 of 79

Illustration of “Deadlock” (conceptual deadlock; behavior is really undefined)

40

__global__ void kernelName (args) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx == 3) {

….

__syncthreads();

}

idx < N

idx ≥ N

__kernel__ void deadlock() {

if (threadIdx.x > 0) {

v[threadIdx.x]++;

__syncthreads();

}

else {

v[threadIdx.x]--;

__syncthreads();

}

Suffers from Textually

Non-aligned Barriers

(example Deadlock1.C)

41 of 79

Warp Divergence

41

__global__ void kernelName (args) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if ( odd(idx) ) {

} else {

}

This is a performance bug – but can be

detected through formal symbolic analysis

42 of 79

Memory-bank Conflicts

42

Memory Banks

This is a performance bug – but can be

detected through formal symbolic analysis

Happen when two threads generate addresses that fall into the same memory bank

43 of 79

Non-coalesced Memory Fetches

43

Memory

This is a performance bug – but can be

detected through formal symbolic analysis

(10 x more severe than bank conflicts)

Happen when threads generate global memory addresses that do not fully utilize the bus

44 of 79

Two approaches to modeling

Through interleaving

Followed in most works

Through lock-step synchronous execution

Pioneered in GPUVerify

45 of 79

How bad is the interleaving problem?

45

It is like shuffling decks of cards

> 6 * 10^14 interleavings for 5 threads with 5 instructions

46 of 79

Avoiding Interleaving Explosion

47 of 79

Solution to the interleaving problem: Find representative interleavings

For Example:

If the green dots are

local thread actions,

then

all schedules

that arrive

at the “cut line”

are equivalent!

48 of 79

Solution to the interleaving problem: Find representative interleavings

For Example:

If the green dots are

local thread actions,

then

all schedules

that arrive

at the “cut line”

are equivalent!

49 of 79

Solution to the interleaving problem: Find representative interleavings

For Example:

If the green dots are

local thread actions,

then

all schedules

that arrive

at the “cut line”

are equivalent!

50 of 79

GKLEE Avoids Examining Exp. Schedules !!

50

Instead of considering all

Schedules and

All Potential Races…

51 of 79

GKLEE Avoids Examining Exp. Schedules !!

51

Instead of considering all

Schedules and

All Potential Races…

Consider JUST THIS SINGLE

CANONICAL SCHEDULE !!

Folk Theorem (proved in our paper):

“We will find A RACE

If there is ANY race” !!

52 of 79

Why does a sequential run suffice?

52

If the red pair is the only

race, then it occurs after

the cut-line

So all the ways to reach

the cut-line are

EQUIVALENT !!

53 of 79

So this sequential run is equivalent too.

53

You my complain saying:

“But, the sequential run

Inches Past the Cutline !!”

But hey, you told me that

this is the ONLY race!

If there were some other

race encountered before,

then that will be caught.

… there is a “first race”

and that will be caught !

54 of 79

Concolic execution helps find witnesses to data races

54

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {

const int threadPos = ((threadIdx.x & (~63)) >> 0)

| ((threadIdx.x & 15) << 2)

| ((threadIdx.x & 48) >> 4);

...

__syncthreads();

for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN;

pos += IMUL(blockDim.x, gridDim.x)) {

unsigned data4 = d_Data[pos];

...

addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }

__syncthreads(); ...

}

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }

“GKLEE: Is there a Race ?”

55 of 79

55

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {

const int threadPos = ((threadIdx.x & (~63)) >> 0)

| ((threadIdx.x & 15) << 2)

| ((threadIdx.x & 48) >> 4);

...

__syncthreads();

for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN;

pos += IMUL(blockDim.x, gridDim.x)) {

unsigned data4 = d_Data[pos];

...

addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }

__syncthreads(); ...

}

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }

Threads 5 and and 13 have a WW race

when d_Data[5] = 0x04040404 and d_Data[13] = 0.

GKLEE

Concolic execution helps find witnesses to data races

56 of 79

Two data-points for GKLEE

Modeling all threads explicitly, we can push toward ~1000 threads and modestly-sized kernels – demo radixsort.C

For much larger kernels, we have a parameterized debugging method

scales much more than GKLEE, and has found bugs in much larger kernels / instances

static analysis for obtaining coverage guarantees is being researched

GPUVerify has a theorem of the form “two threads are sufficient”

Parameterized verification is crucial in practice, as many routines are poorly parameterized, defying manual down-scaling.

57 of 79

Scaling of GKLEEp

58 of 79

Lower-level issues�(1) The dangers of assuming warps�(2) Weak memory model

59 of 79

Race hidden by warp-assumption or� low degree of compiler optimization

__global__ void kernel(int* x, int* y)

{

int index = threadIdx.x;

y[index] = x[index] + y[index];

if (index != 63 && index != 31)

y[index+1] = 1111;

}

Initially : x[i] == y[i] == i

Warp-size = 32

The hardware schedules these instructions in “warps” (SIMD groups).

However, the compiler performs transformations, assuming that there are no data races even under warp-size = 1

Result: 0, 1111, 1111, …, 1111, 64, 1111, …

60 of 79

Possible transformation under various optimizations, using NVCC 4.1 and 4.2

if (index != 63 && index != 31)

y[index+1] = 1111;

__global__ void kernel(int* x, int* y)

{

int index = threadIdx.x;

y[index] = x[index] + y[index];

}

Initially : x[i] == y[i] == i

Warp-size = 32

HW Result under optimization: 0, 2, 4, 6, 8, … (elimination of the 1111 assignment likely..)

HW Result under no optimization: 0, 1111, 1111, …, 64, 1111, 1111, …

61 of 79

Race hidden by warp-assumption or� low degree of compiler optimization

__global__ void kernel(int* x, int* y)

{

int index = threadIdx.x;

y[index] = x[index] + y[index];

__syncthreads(); // Remove?

if (index != 63 && index != 31)

y[index+1] = 1111;

}

Initially : x[i] == y[i] == i

Warp-size = 32

REASON: Can’t assume warp-model , so there

IS a data race without __syncthreads() !!

tid = 0, 1, …, 31

32, 33, …, 63

y[index] = x[index] + y[index]

y[index + 1] = 1111

Without Syncthreads :

Behavior depends on compiler

0, 1112, 1113, …, 1142, 64, 1144, …

or

0, 1111, 1111, …, 1111, 64, 1111, …

(With syncthreads, always the latter)

62 of 79

Three ways to avoid compiler transformations…

__global__ void kernel(int* x, int* y)

{

int index = threadIdx.x;

y[index] = x[index] + y[index];

if (index != 63 && index != 31)

y[index+1] = 1111;

}

Initially : x[i] == y[i] == i

Warp-size = 32

Declare x, y to be “volatile”

-- VERY kludgy (but widely

used)

2. Insert a thread_fence_block

-- we have observed that the

results become correct

after this

3. Insert a __sync_threads()

-- expensive

63 of 79

Data Race�(this error-prone code existed in a real HPC library..) �(We’ve communicated with the developers)

if(tx < 32)

{

sdata[tx] += sdata[tx + 32];

}

if(tx == 0)

{

for(int i=1;i<32;i++)

{

sdata[tx] += sdata[tx + i];

}

64 of 79

Data Race�(this error-prone code existed in a real HPC library..) �(We’ve communicated with the developers)

if(tx < 32)

{

sdata[tx] += sdata[tx + 32];

}

if(tx == 0)

{

for(int i=1;i<32;i++)

{

sdata[tx] += sdata[tx + i];

}

sdata[0] += sdata[32];

for(int i=1;i<32;i++) {

sdata[0] += sdata [1];

}

sdata[1] += sdata[33];

T0

T1

65 of 79

Does this rescue the code? �(Supposition on right…)�Answer : No!

if(tx < 32)

{

sdata[tx] += sdata[tx + 32];

}

thread_fence_block() ;

if(tx == 0)

{

for(int i=1;i<32;i++)

{

sdata[tx] += sdata[tx + i];

}

sdata[0] += sdata[32];

for(int i=1;i<32;i++) {

sdata[0] += sdata [1];

}

sdata[1] += sdata[33];

T0

T1

thread_fence_block()

User thinks: These BOTH

are done SIMD… even FENCE

is done SIMD… so safe…. Eh?

66 of 79

Fences only have a local ordering effect. �They don’t affect the inter happens-before

if(tx < 32)

{

sdata[tx] += sdata[tx + 32];

}

thread_fence_block() ;

if(tx == 0)

{

for(int i=1;i<32;i++)

{

sdata[tx] += sdata[tx + i];

}

sdata[0] += sdata[32];

for(int i=1;i<32;i++) {

sdata[0] += sdata [1];

}

sdata[1] += sdata[33];

T0

T1

thread_fence_block()

Still concurrent!

CUDA’s weak memory semantics

means that the updates done by T1

are not instantly visible to T0.

T0 can get to read(sdata[1]) before T1’s

fence finishes execution.

This is true whether the fences are “executed” SIMD or not”

67 of 79

Correct fix for this example

if(tx < 32)

{

sdata[tx] += sdata[tx + 32];

}

__sync_threads() ;

if(tx == 0)

{

for(int i=1;i<32;i++)

{

sdata[tx] += sdata[tx + i];

}

68 of 79

GKLEE can find races in code that tries to exploit warps to avoid data races

68

GKLEE includes a flag

called

-pure-canonical-schedule

that can examine kernels

under this more general

scheduling method.

69 of 79

CUDA Examples (demo)

dot_buggy_CUDAByExampleP88.C
RaceTest.C
DeadlockTest1.C
matrixMul_Kernel.C
Radixsort.C

70 of 79

Recent Developments

GKLEE front-end supported by NVCC and Clang

Without such a flow, benchmarking is difficult
Unfortunately convoluted flow due to lack of access to Nvidia compiler innards

We are happy to release gklee-clang – a lot of work has gone into it! Also help from Nvidia is acknowledged!

Web portal (needs more work)

(http://www.cs.utah.edu/fv/GKLEE)

GIG : Graphical Inquirer of GPU programs (Eclipse Integration) – needs more work

71 of 79

Glimpse of the CUDA memory model (at the PTX level… under construction..)

Two types of fences

TFB (thread_fence_block) and TF (thread_fence)

Disagreements (so far) on “how coherent”
Write atomicity is not guaranteed
Student to spend time at Nvidia to help formalize

T00: St(h,1) ; TFB ; St(b,2)

T01: Ld(b,2) ; TF ; St(g,3)

T10: Ld(g,3) ; TF ; St(a,4)

T11: Ld(a,4) ; TFB ; Ld(h,1) 🡨 this is not guaranteed

72 of 79

Other Issues : �Performance vs. Precision

Performance improvement may destroy precision

Prelim results point to the need to study this more

73 of 79

Naïve vs Work-Efficient Scan

74 of 79

EXTRA SLIDES

75 of 79

Thrust Example (taken from Thrust website)

Mostly uses the default synchronous stream (slower than async streams)

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

#include <thrust/generate.h>

#include <thrust/reduce.h>

#include <thrust/functional.h>

#include <algorithm>

#include <cstdlib>

int main(void)

{

// generate random data serially

thrust::host_vector<int> h_vec(100);

std::generate(h_vec.begin(), h_vec.end(), rand);

// transfer to device and compute sum

thrust::device_vector<int> d_vec = h_vec;

int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());

return 0;

}

76 of 79

CopperHead example (Taken from CopperHead website)

Uses Thrust as backend, but performs better!

from copperhead import *

import numpy as np

@cu

def axpy(a, x, y):

return [a * xi + yi for xi, yi in zip(x, y)]

x = np.arange(100, dtype=np.float64)

y = np.arange(100, dtype=np.float64)

with places.here: # run on python interpreter

gpu = axpy(2.0, x, y)

with places.gpu0: # run on GPU

gpu = axpy(2.0, x, y)

with places.openmp: # run using OpenMP on multicore CPU

cpu = axpy(2.0, x, y)

77 of 79

SDK Kernel Example: race checking

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {

const int threadPos = ((threadIdx.x & (~63)) >> 0)

| ((threadIdx.x & 15) << 2)

| ((threadIdx.x & 48) >> 4);

...

__syncthreads();

for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) {

unsigned data4 = d_Data[pos];

...

addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }

__syncthreads(); ...

}

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }

threadPos = …

data = (data4>26) & 0x3FU

s_Hist[threadPos +

Data*THREAD_N]++;

s_Hist[threadPos + data*THREAD_N]++;

t1

t2

78 of 79

SDK Kernel Example: race checking

threadPos = …

data = (data4>26) & 0x3FU

s_Hist[threadPos +

data*THREAD_N]++;

s_Hist[threadPos + data*THREAD_N]++;

RW set:

t1: writes

s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes

s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

t1

t2

∃t1,t2,d_Data:

(t1 ≠ t2) ∧

(((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) ==

((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64)

?

79 of 79

SDK Kernel Example: race checking

threadPos = …

data = (data4>26) & 0x3FU

s_Hist[threadPos +

data*THREAD_N]++;

s_Hist[threadPos + data*THREAD_N]++;

t1

t2

GKLEE indicates that these two addresses are equal when

t1 = 5, t2 = 13, d_data[5]= 0x04040404,

and d_data[13] = 0

indicating a Write-Write race

RW set:

t1: writes

s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes

s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …