Parallel Programming
Course Code: BDS701 | Semester: VII | Credits: 04
Teaching Hours: 3:0:2:0 | CIE Marks: 50 | SEE Marks: 50
Course Objectives
Explore Parallel Programming
Understand the fundamental need for parallel programming in modern computing
MIMD Systems
Explain how to parallelize applications on Multiple Instruction, Multiple Data systems
MPI Library
Demonstrate how to apply Message Passing Interface library to parallelize suitable programs
OpenMP
Demonstrate how to apply OpenMP pragma and directives to parallelize suitable programs
CUDA Programming
Demonstrate how to design CUDA programs for GPU-based parallel computing
Module 5 - GPU programming with CUDA
GPUs and GPGPU
4
5
6
CPU vs GPU: Flynn’s Taxonomy
7
SIMD Architecture Concept
A SIMD processor has:
Example operation:
if (x[i] >= 0)
x[i] += 1;
else
x[i] -= 2;
while others remain idle.
➜ This shows branch divergence in SIMD: not all datapaths can work simultaneously.
8
GPU as a Collection of SIMD Processors
9
10
GPU Hardware Details (Eg)
11
GPU Memory Hierarchy
12
Heterogeneous computing
Alternative processors to CPUs:
13
Threads, blocks, and grids
1. Understanding the Kernel Launch Syntax:
Hello<<<num_blocks, threads_per_block>>>();
Example: Hello<<<1, thread_count>>>();
→ Launches a single block with thread_count threads — all running on one Streaming Multiprocessor (SM).
If we use: Hello<<<2, thread_count / 2>>>();
→ Launches 2 blocks, each with thread_count / 2 threads — the work is divided across 2 SMs (if available).
So the total number of threads launched is:
total threads=num_blocks×threads_per_block=2×(thread_count/2)=thread_count
14
15
2. CUDA Execution Model Overview
Hierarchy:
Grid → contains multiple Blocks
Block → contains multiple Threads
Thread → runs on a Streaming Processor (SP)
16
3. Built-in CUDA Variables
CUDA provides predefined structures to identify and organize threads and blocks:
threadIdx - Index (rank) of the thread within its block
.x, .y, .z
blockIdx - Index (rank) of the block within the grid
.x, .y, .z
blockDim - Dimensions (number of threads per block)
.x, .y, .z
gridDim - Dimensions (number of blocks in the grid)
.x, .y, .z
All fields (x, y, z) are unsigned integers.
17
4. Example Program (Program 6.2 Summary)
__global__ void Hello(void) {
printf("Hello from thread %d in block %d\n", threadIdx.x, blockIdx.x);
}
int main(int argc, char* argv[]) {
int blk_ct = strtol(argv[1], NULL, 10); // Number of blocks
int th_per_blk = strtol(argv[2], NULL, 10); // Threads per block
Hello<<<blk_ct, th_per_blk>>>(); // Launch kernel
cudaDeviceSynchronize(); // Wait for GPU to finish
return 0;
}
This program launches a grid of blk_ct blocks, each containing th_per_blk threads, and prints messages from each thread.
18
5. Setting Multi-Dimensional Grids and Blocks
For 2D or 3D organization (useful in graphics/matrix computations):
dim3 grid_dims(2, 3, 1); // 6 blocks (2×3×1)
dim3 block_dims(4, 4, 4); // 64 threads per block (4×4×4)
Kernel<<<grid_dims, block_dims>>>();
Total threads = 6 × 64 = 384.
19
6. Independence of Thread Blocks
7. Concept Summary
20
Concept | Description |
Thread | Smallest execution unit; runs on a single GPU core (SP). |
Block | Group of threads that share memory and synchronize. |
Grid | Collection of blocks; launched by a kernel call. |
SM (Streaming Multiprocessor) | Executes one or more blocks at a time. |
Independence | Each block must be able to run independently of others. |
Nvidia compute capabilities and device architectures
1. What is Compute Capability?
2. Major and Minor Versions
Major revisions so far:
1, 2, 3, 5, 6, 7, 8 (There’s no major version 4)
3. NVIDIA Architectures and Their Compute Capabilities
21
4. Thread and Block Limits by Compute Capability
5. Versions of CUDA vs Compute Capability
Example:
22
Feature | Compute Capability | Limit / Description |
Max threads per block | ≥ 2.x | 1024 threads |
Max threads per SM (Streaming Multiprocessor) | 2.x | 1536 threads |
Max threads per SM (for >2.x) | > 2.x | 2048 threads |
Max block dimensions (x, y, z) | > 1.x | (1024, 1024, 64) |
Supported compute capability in modern CUDA | ≥ 3.0 | Devices below 3.0 are no longer supported. |
Vector addition
This example demonstrates data-parallel programming on a GPU.� That means the same operation z[i] = x[i] + y[i] is performed on many data elements simultaneously — ideal for GPUs.
We have three vectors (arrays): x, y, z and Each of size n.
An embarrassingly parallel problem means: Each computation (z[i]) is independent of others.
Using float Instead of double
Reason:
CUDA Implementation Overview
We’ll need two key parts:
(a) Kernel Function (Device Code)
A kernel is a special GPU function declared with __global__.
23
1 __global__ void Vec_add (
2 const float x [ ] / ∗ i n ∗ / ,
3 const float y [ ] / ∗ i n ∗ / ,
4 float z [ ] / ∗ out ∗ / ,
5 const int n / ∗ i n ∗ / ) {
6 int my_elt = blockDim . x ∗ blockIdx . x + threadIdx . x ;
8 / ∗ total threads = blk_ct ∗ t h _ p e r _ bl k may be > n ∗ /
9 i f ( my_elt < n )
10 z [ my_elt ] = x [ my_elt ] + y [ my_elt ] ;
11 } / ∗ Vec_add ∗ /
12
13 int main ( int argc , char ∗ argv []) {
14 int n , th_per_blk , blk_ct ;
15 char i_g ; / ∗ Are x and y user input or random? ∗ /
16 float ∗x , ∗y , ∗z , ∗cz ;
17 double diff_norm ;
19 / ∗ Get the command l i n e arguments , and s et up vectors ∗ /
24
20 Get_args ( argc , argv , &n , &blk_ct , &th_per_blk , &i_g ) ;
21 Allocate_vectors (&x , &y , &z , &cz , n ) ;
22 Init_vectors ( x , y , n , i_g ) ;
24 / ∗ Invoke kernel and wait for i t to complete ∗ /
25 Vec_add <<<blk_ct , th_per_blk >>>(x , y , z , n ) ;
26 cudaDeviceSynchronize ();
28 / ∗ Check for correctness ∗ /
29 Serial_vec_add ( x , y , cz , n ) ;
30 diff_norm = Two_norm_diff ( z , cz , n ) ;
31 printf ( "Two -norm of difference between host and " ) ;
32 printf ( "device = %e\n" , diff_norm ) ;
34 / ∗ Free storage and quit ∗ /
35 Free_vectors ( x , y , z , cz ) ;
36 return 0 ;
37 } / ∗ main ∗ /
The kernel
int my_elt = blockDim.x * blockIdx.x + threadIdx.x;
EG: Total threads = gridDim.x × blockDim.x = 4 × 5 = 20
25
blockIdx.x | threadIdx.x = 0 | 1 | 2 | 3 | 4 |
0 | 0 | 1 | 2 | 3 | 4 |
1 | 5 | 6 | 7 | 8 | 9 |
2 | 10 | 11 | 12 | 13 | 14 |
3 | 15 | 16 | 17 | 18 | 19 |
Handling Cases Where n ≠ Total Threads
Sometimes, n (the number of elements in the vectors) is not equal to the total number of launched threads.
Example: n= 997 , We might still launch 4 blocks × 256 threads = 1024 threads (because GPU threads come in block-sized chunks).
➡️ The extra 27 threads (indexes 997–1023) must do nothing to avoid out-of-bounds memory access.
Hence, in the kernel: We might still launch 4 blocks × 256 threads = 1024 threads (because GPU threads come in block-sized chunks).
Returning results from CUDA kernels
1. CUDA Kernels Have void Return Type
2. Problem with Passing Host Addresses to Device
26
3. Correct Ways to “Return” Results
(a) Using Unified Memory (cudaMallocManaged)
cudaMallocManaged(&sum_p, sizeof(int));
Add<<<1,1>>>(2, 3, sum_p);
cudaDeviceSynchronize();
printf("%d\n", *sum_p);
No explicit memory copy is needed.
(b) Using Separate Host and Device Memory + Explicit Copy
cudaMemcpy(hsum_p, dsum_p, sizeof(int), cudaMemcpyDeviceToHost);
Example:
Add<<<1,1>>>(2, 3, dsum_p);
cudaMemcpy(hsum_p, dsum_p, sizeof(int), cudaMemcpyDeviceToHost);
27
Argument | Meaning |
hsum_p | Host pointer (destination) — CPU memory |
dsum_p | Device pointer (source) — GPU memory |
sizeof(int) | Number of bytes to copy |
cudaMemcpyDeviceToHost | Direction of copy (Device → Host) |
4. Using Global Managed Variables (__managed__)
__managed__ int sum;
Kernel can directly write to it:
__global__ void Add(int x, int y) { sum = x + y; }
cudaDeviceSynchronize();
printf("%d\n", sum);
Limitations of __managed__ Variables
28
CUDA trapezoidal rule I
Concept of Trapezoidal Rule
29
Serial Implementation
Main bottleneck: the for loop, especially for large n.
30
1 float Serial_trap (
2 const float a / ∗ i n ∗ / ,
3 const float b / ∗ i n ∗ / ,
4 const int n / ∗ i n ∗ / ) {
5 float x , h = ( b−a ) / n ;
6 float trap = 0.5 ∗ ( f ( a ) + f ( b ));
7
8 for ( int i = 1; i <= n−1; i ++) {
9 x = a + i∗h ;
10 trap += f ( x ) ;
11 }
12 trap = trap ∗h ;
13
14 return trap ;
15 } / ∗ Serial_trap ∗ /
Moving to CUDA — Parallelizing
31
Foster’s methodology suggests the two major tasks to parallelize:
Each thread can perform one iteration of the serial loop.
Directly transferring the loop body to CUDA causes five major problems:
Problem 1: Initialization
Variables like h and trap must be initialized properly, but kernel arguments must be handled correctly.
/ ∗ h and trap are formal arguments to the kernel ∗ /
int my_i = blockDim . x ∗ blockIdx . x + threadIdx . x ;
float my_x = a + my_i ∗h ;
float my_trap = f ( my_x ) ;
float trap += my_trap ;
Issues When Implementing in CUDA
32
Problem 2: Thread index range mismatch
Problem 3: Race condition on trap
Problem 4: Kernel cannot return a value
Problem 5: Multiplying by h must happen after summation
5. What Program 6.12 Will Solve
33
Program 6.12: CUDA kernel and wrapper implementing trapezoidal rule.
34
1 __global__ void Dev_trap (
2 const float a / ∗ i n ∗ / ,
3 const float b / ∗ i n ∗ / ,
4 const float h / ∗ i n ∗ / ,
5 const int n / ∗ i n ∗ / ,
6 float ∗ trap_p / ∗ in / out ∗ / ) {
7 int my_i = blockDim . x ∗ blockIdx . x + threadIdx . x ;
8
9 / ∗ f ( x_0 ) and f ( x_n ) were computed on the host . So ∗ /
10 / ∗ compute f ( x_1 ) , f ( x_2 ) , . . . , f ( x_ (n−1)) ∗ /
11 i f (0 < my_i && my_i < n ) {
12 float my_x = a + my_i ∗h ;
13 float my_trap = f ( my_x ) ;
14 atomicAdd ( trap_p , my_trap ) ;
15 }
16 } / ∗ Dev_trap ∗ /
17
18 / ∗ Host code ∗ /
19 void Trap_wrapper (
20 const float a / ∗ i n ∗ / ,
21 const float b / ∗ i n ∗ / ,
22 const int n / ∗ i n ∗ / ,
23 float ∗ trap_p / ∗ out ∗ / ,
24 const int blk_ct / ∗ i n ∗ / ,
25 const int th_per_blk / ∗ i n ∗ / ) {
26
27 / ∗ trap_p storage allocated in main with
28 ∗ cudaMallocManaged ∗ /
29 ∗ trap_p = 0.5 ∗ ( f ( a ) + f ( b ));
30 float h = ( b−a ) / n ;
31
32 Dev_trap <<<blk_ct , th_per_blk >>>(a , b , h , n , trap_p ) ;
33 cudaDeviceSynchronize ();
35 ∗ trap_p = h ∗(∗ trap_p ) ;
36 } / ∗ Trap_wrapper ∗ /
CUDA trapezoidal rule II: improving performance
Why atomicAdd Makes CUDA Slow
In the basic CUDA version of the trapezoidal rule, each thread computes its my_trap value and then executes:
atomicAdd(trap_p, my_trap);
atomicAdd forces serialization.
Even though thousands of threads run in parallel, only one thread at a time is allowed to update *trap_p.
Example from your text:
This destroys parallel performance.
35
36
we’re assuming that
f (x) = 2x + 1,
a = 0, and
b = 8.
So h = (8 − 0)/8 = 1,
and the value referenced by trap_p at the start of the global sum is
0.5 × (f (a) + f (b)) =
0.5 × (1 + 17) =
9.
Tree-Structured Reduction: The Solution
Instead of letting all threads update a single variable, we let the threads combine their results among themselves like a binary tree.
The idea
In each step:
Number of steps needed
Instead of t sequential additions, we now need:
➜ log₂(t) + 1 steps
37
Two standard CUDA implementations
if (threadIdx.x < stride)
shared[threadIdx.x] += shared[threadIdx.x + stride];
__syncthreads();
value += __shfl_down_sync(0xffffffff, value, offset);
38
Warps and warp shuffles
Warp shuffle function
Hex value with 32 binary 1s, meaning all 32 threads in the warp participate.
39
Important constraints to avoid undefined results
Why full warps matter
Why warp shuffle improves performance
40
Implementing tree-structured global sum with a warp shuffle
41
Shared memory and an alternative to the warp shuffle
42
CUDA trapezoidal rule III: blocks with more than one warp
1. Limiting thread blocks to only 32 threads is inefficient
2. Block-level sum using multiple warps
To sum values across up to 1024 threads:
Why warp 0?� Because warp ID = threadIdx.x / warpSize, so warp 0 consists of lanes with indices 0–31.
3. Race condition without synchronization
43
4. Fix: Insert __syncthreads()
Modified pseudocode:
__syncthreads(); // ensures all warp sums are ready
Warp 0 adds all warp sums;
5. Important caveat #1: All threads in the block MUST reach __syncthreads()
Incorrect example:
if (threadIdx.x < blockDim.x/2)
__syncthreads();
6. Important caveat #2: __syncthreads() syncs only within a block
44