Analysis of Shared Memory GPU Parallel Programs (talk from 13 years ago)
Material still relevant!
Ganesh Gopalakrishnan, Professor,
School of Computing, University of Utah,
Salt Lake City, UT 84112, USA
Acknowledgements : Students and Collaborators
Peng Li, Guodong Li, Geof Sawaya, Tyler Sorensen, Brandon Gibson, Mohammed Saeed Al-Mahfoudh, Ashwin Kumar, Zvonimir Rakamaric’, Martin Berzins, Sriram Aananthakrishnan, Alan Humphrey, Diego Oliviera
External Collaborations: Lawrence Livermore National Laboratory, Argonne National Laboratory, Nvidia, Stephen Siegel (Delaware)
Part of this tutorial is updated to 2025 by adding a Colab GPU programming exercise link below due to Mark Harris of NVIDIA
https://colab.research.google.com/drive/1tW8uMgRNRwjpFDghysxpvvsThQN5GlEg?usp=sharing
Emerging Heterogeneous Parallel Systems, and Role of Formal Methods in them
3
Sandybridge (courtesy anandtech.com)
Geoforce GTX 480 (Nvidia)
AMD Fusion APU
Concurrent
Data Structures
Infiniband style interconnect
(1) Formal
Methods at the User Application Level
Problem-Solving
Environments
e.g. Uintah, Charm++, ADLB
Problem Solving Environment based
User Applications
High Performance
MPI Libraries
(2) Formal
Dynamic Methods for MPI
(3) Formal
Analysis for CUDA Programs
Why conventional testing is inadequate
Exponential number of paths;
the bug may be deeply hidden
Exponential number of schedules; the bug may be triggered only by
one particular interleaving
E1
E2
En
…
…
…
…
Must be lucky
to force such
“bad paths”!
Must be lucky
to schedule
threads in
this manner!
Some bad interaction, such as
a data race may depend on
the schedule…
Good Halloween stuff from llama.cpp (to see the complexity explode in your face when you “open the lid” is sobering) - part-1
MY THOUGHTS : I have a huge amount of respect to those who write such codes
WISH-LIST - TOP ENTRY : I would like to see the day when such code emerges after a series of pokes one administers to an initial SMT-based model of the programming intent + some generous profiling to add missing pieces, reverify after the perf enhancements, etc. etc.
Good Halloween stuff from llama.cpp (to see the complexity explode in your face when you “open the lid” is sobering) - part-2
Ways to make formal testing scale
ROADMAP for the TUTORIAL
GPU COMPUTING AND FORMAL METHODS
GPU-based Computing
10
(courtesy of AMD)
(courtesy of Nvidia)
(courtesy of Nvidia,
www.engadget.com)
Application areas, roadmap
Programming Approaches
Programming Approaches
Key ideas in GPU-based computing
15
15
CPU program
void inc_cpu(float* a,
float b, int N) {
for (int idx = 0; idx<N; idx++)
a[idx] = a[idx] + b;
}
void main() {
.....
increment_cpu(a, b, N);
}
CUDA program
__global__ void inc_gpu(float* A, float b, int N) {
int tid = blockIdx.x* blockDim.x+ threadIdx.x;
if (tid < N)
A[tid] = A[tid] + b;
}
voidmain() {
…..
dim3dimBlock (blocksize);
dim3dimGrid( ceil( N / (float)blocksize) );
increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);
}
Example: Increment Array Elements
Contrast between CPUs and GPUs
16
16
CPU program
void inc_cpu(float* a,
float b, int N) {
for (int idx = 0; idx<N; idx++)
a[idx] = a[idx] + b;
}
void main() {
.....
increment_cpu(a, b, N);
}
CUDA program
__global__ void inc_gpu(float* A, float b, int N) {
int tid = blockIdx.x* blockDim.x+ threadIdx.x;
if (tid < N)
A[tid] = A[tid] + b;
}
voidmain() {
…..
dim3dimBlock (blocksize);
dim3dimGrid( ceil( N / (float)blocksize) );
increment_gpu<<<dimGrid, dimBlock>>>(a, b, N);
}
Example: Increment Array Elements
Contrast between CPUs and GPUs
Fine-grained threads
scheduled to run like this:
tid = 0, 1, 2, 3, …
CUDA Abstract Device Model
Device Global Memory (shared between blocks) - Slowest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Related Work
Our FV method for GPUs
GKLEE: Symbolic Virtual GPU
20
Host
Kernel 1
Kernel 2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
virtual CPU
virtual GPU
GKLEE
GKLEE (see PPoPP 2012)
__input__ int *values = (int *)malloc(sizeof(int) * NUM);
klee_make_symbolic(values, sizeof(int)*NUM, "values");
int *dvalues;
cudaMalloc((void **)&dvalues, sizeof(int) * NUM);
cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice);
BitonicKernel <<< … >>> (dvalues);
LLVM byte-code instructions
Symbolic
Analyzer and
Scheduler
Error
Monitors
C++ CUDA Programs with Symbolic Variable Declarations
LLVM-GCC
Sequential programming examples
Basics of Concolic Execution
if(item1 > item2) {
printf("Item1 > Item2\n");
if(item2 > item3) {
printf("Item2 > Item3\n");
if(item3 > item1) {
printf("Item3 > Item1\n");
}
else
printf("Item3 <= Item1\n");
}
printf("Item2 <= Item3\n");
}
printf("Item1 <= Item2\n");
}
Basics of Concolic Execution
if(item1 > item2) {
printf("Item1 > Item2\n");
if(item2 > item3) {
printf("Item2 > Item3\n");
if(item3 > item1) {
printf("Item3 > Item1\n");
}
else
printf("Item3 <= Item1\n");
}
printf("Item2 <= Item3\n");
}
printf("Item1 <= Item2\n");
}
Our Formal Verification of CUDA builds on top of Sequential Program Verification by KLEE [OSDI ‘08]
#include "stdio.h"
#include "cutil.h"
#include "string.h"
#include "klee.h"
#define STRLEN 5
int main(){
char item;
char str[STRLEN];
int lo=0;
int hi;
int mid, found=0;
// printf("Please give char to be searched: \n");
// scanf("%c", &item);
klee_make_symbolic(&item, sizeof(item), "item");
// printf("Please give string within which to search: \n");
// scanf("%s", str);
klee_make_symbolic(str, sizeof(str), "str");
klee_assume(str[0] <= str[1]);
klee_assume(str[1] <= str[2]);
klee_assume(str[2] <= str[3]);
klee_assume(str[3] <= str[4]);
str[4] = '\0';
hi= 4; //strlen(str)-1; // strlen ignores null at the end of string
// hi points to last char in a 0 based array
while(!found && lo <= hi) {
mid = (lo+hi)/2;
if (item == str[mid])
{ printf("*");
found = 1; }
else
if (item < str[mid])
{ hi = mid-1;
printf("L"); }
else
{ lo = mid+1;
printf("H"); }
}
if (found) printf("found\n");//printf("\nitem %c found at posn %d\n", item, mid);
else printf("not found\n");//printf("\nitem %c not in str %s\n", item, str);
return found;
}
A simple Binary Search being
prepared for formal symbolic analysis
This will force path coverage !!
Concurrency errors
Example: Increment Array Elements
27
Increment N-element array A by scalar b
tid 0 1 …
A
A[0]+b
__global__ void inc_gpu(int*A, int b, intN) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < 64) {
A[idx] = A[idx] + b;
}
...
A[1]+b
t0
t1
Data Races in GPU Programs
28
Write(a)
Write(a)
Write(a)
Read(a)
The usual definition of a race:
“Two accesses, one of which is a write, that occur without
a happens-before edge between them.”
In GPUs, the happens-before is provided by
Illustration of Race
29
Increment N-element vector A by scalar b
tid 0 1 63
A
t63:
write A[63]
...
__global__ void inc_gpu(int*A, int b, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < 64) {
A[idx] = A[(idx – 1 + 64) % 64] + b;
}
RACE!
t0:
read A[63]
t63
t0
Porting Race: A Race-type Identified by us
30
If we have a divergent warp where
a variable is READ in one path
and WRITTEN in another path,
then depending on the execution
order, the READ may be before/after
the WRITE
Example of Porting Race
#include <stdio.h>
#define NUM 32
__shared__ int v[NUM];
__global__ void PortingRace() {
if (threadIdx.x % 2) {
v[threadIdx.x] = v[ (NUM + threadIdx.x - 1) % NUM ] + 1; // W:even; R:odd
}
else {
v[threadIdx.x] = v[ (NUM + threadIdx.x + 1) % NUM ] - 1; // W:odd; R:even
}
}
int main() {
PortingRace<<<1,NUM>>>();
return 0;
}
CUDA-based GPU Programming Basics
32
Device Global Memory (shared between blocks) - Slowest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Conventional GPU Debuggers are Inadequate
33
Some of the GPU Program Bugs
Why are Data Races Highly Problematic?
35
Synchronization Primitives in CUDA
36
__kernel__ void deadlock() {
if (threadIdx.x > 0) {
v[threadIdx.x]++;
}
else {
v[threadIdx.x]--;
}
__syncthreads(); // Barrier
}
Device Global Memory (shared between blocks) - Slowest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Synchronization Primitives in CUDA
37
__kernel__ void deadlock() {
if (threadIdx.x > 0) {
v[threadIdx.x]++;
}
else {
v[threadIdx.x]--;
}
__syncthreads(); // Barrier
}
Device Global Memory (shared between blocks) - Slowest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Threads within a block can
synchronize using __syncthreads().
(This call synchronizes threads, also gives
the effect of the stronger fence “thread_fence”.)
Synchronization Primitives in CUDA
38
__kernel__ void deadlock() {
if (threadIdx.x > 0) {
v[threadIdx.x]++;
}
else {
v[threadIdx.x]--;
}
__syncthreads(); // Barrier
}
Device Global Memory (shared between blocks) - Slowest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Block (Shared Memory) - Faster
T0 (private Memory) - Fastest
T1 (private Memory) - Fastest
T2 (private Memory) - Fastest
T3 (private Memory) - Fastest
Threads within a block can
synchronize using __syncthreads().
(This call synchronizes threads, also gives
the effect of the stronger fence “thread_fence”.)
For synchronization
across blocks, we must
use one of the CUDA
Atomics (Zvonimir will
present our NFM’13 work)
Deadlocks caused by misused __syncthreads()
39
__SyncThreads()
Illustration of “Deadlock” (conceptual deadlock; behavior is really undefined)
40
__global__ void kernelName (args) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx == 3) {
….
__syncthreads();
}
idx < N
idx ≥ N
__kernel__ void deadlock() {
if (threadIdx.x > 0) {
v[threadIdx.x]++;
__syncthreads();
}
else {
v[threadIdx.x]--;
__syncthreads();
}
}
Suffers from Textually
Non-aligned Barriers
(example Deadlock1.C)
Warp Divergence
41
__global__ void kernelName (args) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ( odd(idx) ) {
<Statements1>
} else {
<Statements2>
}
This is a performance bug – but can be
detected through formal symbolic analysis
Memory-bank Conflicts
42
Memory Banks
This is a performance bug – but can be
detected through formal symbolic analysis
Happen when two threads generate addresses that fall into the same memory bank
Non-coalesced Memory Fetches
43
Memory
This is a performance bug – but can be
detected through formal symbolic analysis
(10 x more severe than bank conflicts)
Happen when threads generate global memory addresses that do not fully utilize the bus
Two approaches to modeling
How bad is the interleaving problem?
45
It is like shuffling decks of cards
> 6 * 10^14 interleavings for 5 threads with 5 instructions
Avoiding Interleaving Explosion
Solution to the interleaving problem: Find representative interleavings
For Example:
If the green dots are
local thread actions,
then
all schedules
that arrive
at the “cut line”
are equivalent!
Solution to the interleaving problem: Find representative interleavings
For Example:
If the green dots are
local thread actions,
then
all schedules
that arrive
at the “cut line”
are equivalent!
Solution to the interleaving problem: Find representative interleavings
For Example:
If the green dots are
local thread actions,
then
all schedules
that arrive
at the “cut line”
are equivalent!
GKLEE Avoids Examining Exp. Schedules !!
50
Instead of considering all
Schedules and
All Potential Races…
GKLEE Avoids Examining Exp. Schedules !!
51
Instead of considering all
Schedules and
All Potential Races…
Consider JUST THIS SINGLE
CANONICAL SCHEDULE !!
Folk Theorem (proved in our paper):
“We will find A RACE
If there is ANY race” !!
Why does a sequential run suffice?
52
If the red pair is the only
race, then it occurs after
the cut-line
So all the ways to reach
the cut-line are
EQUIVALENT !!
So this sequential run is equivalent too.
53
You my complain saying:
“But, the sequential run
Inches Past the Cutline !!”
But hey, you told me that
this is the ONLY race!
If there were some other
race encountered before,
then that will be caught.
… there is a “first race”
and that will be caught !
Concolic execution helps find witnesses to data races
54
54
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {
const int threadPos = ((threadIdx.x & (~63)) >> 0)
| ((threadIdx.x & 15) << 2)
| ((threadIdx.x & 48) >> 4);
...
__syncthreads();
for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN;
pos += IMUL(blockDim.x, gridDim.x)) {
unsigned data4 = d_Data[pos];
...
addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }
__syncthreads(); ...
}
inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)
{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }
“GKLEE: Is there a Race ?”
55
55
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {
const int threadPos = ((threadIdx.x & (~63)) >> 0)
| ((threadIdx.x & 15) << 2)
| ((threadIdx.x & 48) >> 4);
...
__syncthreads();
for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN;
pos += IMUL(blockDim.x, gridDim.x)) {
unsigned data4 = d_Data[pos];
...
addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }
__syncthreads(); ...
}
inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)
{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }
Threads 5 and and 13 have a WW race
when d_Data[5] = 0x04040404 and d_Data[13] = 0.
GKLEE
Concolic execution helps find witnesses to data races
Two data-points for GKLEE
Scaling of GKLEEp
Lower-level issues�(1) The dangers of assuming warps�(2) Weak memory model
Race hidden by warp-assumption or� low degree of compiler optimization
__global__ void kernel(int* x, int* y)
{
int index = threadIdx.x;
y[index] = x[index] + y[index];
if (index != 63 && index != 31)
y[index+1] = 1111;
}
Initially : x[i] == y[i] == i
Warp-size = 32
The hardware schedules these instructions in “warps” (SIMD groups).
However, the compiler performs transformations, assuming that there are no data races even under warp-size = 1
Result: 0, 1111, 1111, …, 1111, 64, 1111, …
Possible transformation under various optimizations, using NVCC 4.1 and 4.2
if (index != 63 && index != 31)
y[index+1] = 1111;
__global__ void kernel(int* x, int* y)
{
int index = threadIdx.x;
y[index] = x[index] + y[index];
}
Initially : x[i] == y[i] == i
Warp-size = 32
HW Result under optimization: 0, 2, 4, 6, 8, … (elimination of the 1111 assignment likely..)
HW Result under no optimization: 0, 1111, 1111, …, 64, 1111, 1111, …
Race hidden by warp-assumption or� low degree of compiler optimization
__global__ void kernel(int* x, int* y)
{
int index = threadIdx.x;
y[index] = x[index] + y[index];
__syncthreads(); // Remove?
if (index != 63 && index != 31)
y[index+1] = 1111;
}
Initially : x[i] == y[i] == i
Warp-size = 32
REASON: Can’t assume warp-model , so there
IS a data race without __syncthreads() !!
tid = 0, 1, …, 31
32, 33, …, 63
y[index] = x[index] + y[index]
y[index] = x[index] + y[index]
y[index + 1] = 1111
y[index + 1] = 1111
Without Syncthreads :
Behavior depends on compiler
0, 1112, 1113, …, 1142, 64, 1144, …
or
0, 1111, 1111, …, 1111, 64, 1111, …
(With syncthreads, always the latter)
Three ways to avoid compiler transformations…
__global__ void kernel(int* x, int* y)
{
int index = threadIdx.x;
y[index] = x[index] + y[index];
if (index != 63 && index != 31)
y[index+1] = 1111;
}
Initially : x[i] == y[i] == i
Warp-size = 32
-- VERY kludgy (but widely
used)
2. Insert a thread_fence_block
-- we have observed that the
results become correct
after this
3. Insert a __sync_threads()
-- expensive
Data Race�(this error-prone code existed in a real HPC library..) �(We’ve communicated with the developers)
if(tx < 32)
{
sdata[tx] += sdata[tx + 32];
}
if(tx == 0)
{
for(int i=1;i<32;i++)
{
sdata[tx] += sdata[tx + i];
}
}
Data Race�(this error-prone code existed in a real HPC library..) �(We’ve communicated with the developers)
if(tx < 32)
{
sdata[tx] += sdata[tx + 32];
}
if(tx == 0)
{
for(int i=1;i<32;i++)
{
sdata[tx] += sdata[tx + i];
}
}
sdata[0] += sdata[32];
for(int i=1;i<32;i++) {
sdata[0] += sdata [1];
}
sdata[1] += sdata[33];
T0
T1
Does this rescue the code? �(Supposition on right…)�Answer : No!
if(tx < 32)
{
sdata[tx] += sdata[tx + 32];
}
thread_fence_block() ;
if(tx == 0)
{
for(int i=1;i<32;i++)
{
sdata[tx] += sdata[tx + i];
}
}
sdata[0] += sdata[32];
for(int i=1;i<32;i++) {
sdata[0] += sdata [1];
}
sdata[1] += sdata[33];
T0
T1
thread_fence_block()
thread_fence_block()
User thinks: These BOTH
are done SIMD… even FENCE
is done SIMD… so safe…. Eh?
Fences only have a local ordering effect. �They don’t affect the inter happens-before
if(tx < 32)
{
sdata[tx] += sdata[tx + 32];
}
thread_fence_block() ;
if(tx == 0)
{
for(int i=1;i<32;i++)
{
sdata[tx] += sdata[tx + i];
}
}
sdata[0] += sdata[32];
for(int i=1;i<32;i++) {
sdata[0] += sdata [1];
}
sdata[1] += sdata[33];
T0
T1
thread_fence_block()
thread_fence_block()
Still concurrent!
CUDA’s weak memory semantics
means that the updates done by T1
are not instantly visible to T0.
T0 can get to read(sdata[1]) before T1’s
fence finishes execution.
This is true whether the fences are “executed” SIMD or not”
Correct fix for this example
if(tx < 32)
{
sdata[tx] += sdata[tx + 32];
}
__sync_threads() ;
if(tx == 0)
{
for(int i=1;i<32;i++)
{
sdata[tx] += sdata[tx + i];
}
}
GKLEE can find races in code that tries to exploit warps to avoid data races
68
GKLEE includes a flag
called
-pure-canonical-schedule
that can examine kernels
under this more general
scheduling method.
CUDA Examples (demo)
Recent Developments
We are happy to release gklee-clang – a lot of work has gone into it! Also help from Nvidia is acknowledged!
(http://www.cs.utah.edu/fv/GKLEE)
Glimpse of the CUDA memory model (at the PTX level… under construction..)
T00: St(h,1) ; TFB ; St(b,2)
T01: Ld(b,2) ; TF ; St(g,3)
T10: Ld(g,3) ; TF ; St(a,4)
T11: Ld(a,4) ; TFB ; Ld(h,1) 🡨 this is not guaranteed
Other Issues : �Performance vs. Precision
Naïve vs Work-Efficient Scan
EXTRA SLIDES
Thrust Example (taken from Thrust website)
Mostly uses the default synchronous stream (slower than async streams)
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <algorithm>
#include <cstdlib>
int main(void)
{
// generate random data serially
thrust::host_vector<int> h_vec(100);
std::generate(h_vec.begin(), h_vec.end(), rand);
// transfer to device and compute sum
thrust::device_vector<int> d_vec = h_vec;
int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
return 0;
}
CopperHead example (Taken from CopperHead website)
Uses Thrust as backend, but performs better!
from copperhead import *
import numpy as np
@cu
def axpy(a, x, y):
return [a * xi + yi for xi, yi in zip(x, y)]
x = np.arange(100, dtype=np.float64)
y = np.arange(100, dtype=np.float64)
with places.here: # run on python interpreter
gpu = axpy(2.0, x, y)
with places.gpu0: # run on GPU
gpu = axpy(2.0, x, y)
with places.openmp: # run using OpenMP on multicore CPU
cpu = axpy(2.0, x, y)
SDK Kernel Example: race checking
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) {
const int threadPos = ((threadIdx.x & (~63)) >> 0)
| ((threadIdx.x & 15) << 2)
| ((threadIdx.x & 48) >> 4);
...
__syncthreads();
for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) {
unsigned data4 = d_Data[pos];
...
addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); }
__syncthreads(); ...
}
inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)
{ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }
threadPos = …
threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos +
Data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
t1
t2
SDK Kernel Example: race checking
threadPos = …
threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos +
data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
RW set:
t1: writes
s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …
t2: writes
s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …
t1
t2
∃t1,t2,d_Data:
(t1 ≠ t2) ∧
(((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) ==
((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64)
?
SDK Kernel Example: race checking
threadPos = …
threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos +
data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
t1
t2
GKLEE indicates that these two addresses are equal when
t1 = 5, t2 = 13, d_data[5]= 0x04040404,
and d_data[13] = 0
indicating a Write-Write race
RW set:
t1: writes
s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …
t2: writes
s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …