1 of 44

Parallel Programming

Course Code: BDS701 | Semester: VII | Credits: 04

Teaching Hours: 3:0:2:0 | CIE Marks: 50 | SEE Marks: 50

2 of 44

Course Objectives

Explore Parallel Programming

Understand the fundamental need for parallel programming in modern computing

MIMD Systems

Explain how to parallelize applications on Multiple Instruction, Multiple Data systems

MPI Library

Demonstrate how to apply Message Passing Interface library to parallelize suitable programs

OpenMP

Demonstrate how to apply OpenMP pragma and directives to parallelize suitable programs

CUDA Programming

Demonstrate how to design CUDA programs for GPU-based parallel computing

3 of 44

Module 5 - GPU programming with CUDA

  • GPUs and GPGPU,
  • GPU architectures,
  • Heterogeneous computing, Threads, blocks, and grids
  • Nvidia compute capabilities and device architectures,
  • Vector addition,
  • Returning results from CUDA kernels,
  • CUDA trapezoidal rule I,
  • CUDA trapezoidal rule II: improving performance,
  • CUDA trapezoidal rule III: blocks with more than one warp.

4 of 44

GPUs and GPGPU

  • In the late 1990s and early 2000s, the demand for realistic computer games and animations led to the creation of powerful Graphics Processing Units (GPUs).
  • GPUs were originally designed specifically for rendering graphics—handling many detailed images efficiently.
  • Soon, programmers realized that the computational power of GPUs could be used for non-graphics tasks such as searching, sorting, and scientific computations.
  • This new approach was termed General Purpose computing on GPUs (GPGPU).
  • Early challenges:
    • GPUs could only be programmed using graphics APIs like Direct3D and OpenGL.
    • Programmers had to reformulate algorithms using graphics concepts (vertices, triangles, pixels), making development complex.
  • To simplify GPU programming, new languages and compilers were developed that allowed programmers to write general-purpose algorithms more naturally.

4

5 of 44

5

  • These efforts led to the creation of general-purpose GPU programming APIs:
    • CUDA (Compute Unified Device Architecture) – developed by Nvidia for its own GPUs.
    • OpenCL (Open Computing Language) – designed for portability across various hardware, including GPUs, FPGAs, and DSPs.
  • CUDA is simpler to set up since it targets only Nvidia GPUs, while OpenCL requires more setup to ensure portability.
  • Therefore, many developers use CUDA for learning and implementation.
  • It also introduced SIMD (Single Instruction, Multiple Data) execution, showing how branching (e.g., if conditions) affects performance when different data paths take different actions.

6 of 44

6

7 of 44

CPU vs GPU: Flynn’s Taxonomy

  • CPU (Central Processing Unit) is typically a SISD (Single Instruction, Single Data) device.
    • Fetches one instruction and executes it on a small number of data items.
    • Complex architecture but executes sequential instructions efficiently.
  • GPU (Graphics Processing Unit) is composed of SIMD (Single Instruction, Multiple Data) processors.
    • Executes the same instruction on many data elements simultaneously — ideal for parallel data processing.

7

8 of 44

SIMD Architecture Concept

A SIMD processor has:

  • A single control unit (fetches and broadcasts instructions).
  • Multiple datapaths (also called processing elements) that execute the same instruction on different data.

Example operation:

if (x[i] >= 0)

x[i] += 1;

else

x[i] -= 2;

  • All datapaths test x[i] >= 0.
  • Those where the condition is true perform x[i] += 1

while others remain idle.

  • Then roles reverse — idle ones execute x[i] -= 2.

➜ This shows branch divergence in SIMD: not all datapaths can work simultaneously.

8

9 of 44

GPU as a Collection of SIMD Processors

  • A GPU can be viewed as many SIMD processors working in parallel.
  • Nvidia GPUs are built from Streaming Multiprocessors (SM).
    • Each SM contains:
      • Several control units,
      • Many datapaths, also known as Streaming Processors (SPs) or cores.
    • Hence, an SM ≈ a collection of SIMD processors.
  • Asynchronous execution:
    • SMs operate independently
    • Example: if all threads with x[i] >= 0 run on one SM and others on another, only two stages are needed.

9

10 of 44

10

11 of 44

GPU Hardware Details (Eg)

  • Modern high-end Nvidia GPU:
    • 82 SMs,
    • Each SM has 128 SPs,
    • Total: 82 × 128 = 10,496 SPs.

  • Nvidia uses the term SIMT (Single Instruction, Multiple Thread) instead of SIMD.
    • Reason: not all threads on an SM execute at the same time — some pause to wait for memory, while others continue.
    • This helps hide memory latency and improve throughput.

11

12 of 44

GPU Memory Hierarchy

  • Each SM has a small, fast shared memory accessible by its SPs.
  • All SMs also share access to a larger global memory, which is slower to access.
  • Host vs Device terminology:
    • Host: CPU + its memory.
    • Device: GPU + its memory.
  • Earlier systems: required explicit data transfers between host and device (e.g., copying arrays).
  • Newer Nvidia systems (compute capability ≥ 3.0):
    • Allow unified memory, so explicit transfers aren’t mandatory (though still useful for performance tuning).

12

13 of 44

Heterogeneous computing

  • Writing programs for GPUs introduces heterogeneous computing because the program uses:
    • A host processor (CPU) – conventional processor
    • A device processor (GPU) – different architecture specialized for parallel data processing
  • Both CPU and GPU work together in a single program using the SPMD (Single Program, Multiple Data) model, but with separate functions for CPU and GPU execution.

Alternative processors to CPUs:

  • GPUs (Graphics Processing Units) – focus of CUDA programming.
  • FPGAs (Field Programmable Gate Arrays) – reconfigurable hardware with programmable logic blocks and interconnects.
  • DSPs (Digital Signal Processors) – designed for specialized signal processing tasks (e.g., compression, filtering, analog signal manipulation).

13

14 of 44

Threads, blocks, and grids

1. Understanding the Kernel Launch Syntax:

Hello<<<num_blocks, threads_per_block>>>();

  • The first value (num_blocks) → number of thread blocks in the grid.
  • The second value (threads_per_block) → number of threads within each block.

Example: Hello<<<1, thread_count>>>();

→ Launches a single block with thread_count threads — all running on one Streaming Multiprocessor (SM).

If we use: Hello<<<2, thread_count / 2>>>();

→ Launches 2 blocks, each with thread_count / 2 threads — the work is divided across 2 SMs (if available).

So the total number of threads launched is:

total threads=num_blocks×threads_per_block=2×(thread_count/2)=thread_count

14

15 of 44

15

16 of 44

2. CUDA Execution Model Overview

  • A GPU consists of multiple Streaming Multiprocessors (SMs).
  • Each SM contains several Streaming Processors (SPs) (i.e., cores).
  • When a kernel runs:
    • Each thread runs on an SP.
    • Each thread block is assigned to a single SM.
    • A grid is composed of multiple blocks.

Hierarchy:

Grid → contains multiple Blocks

Block → contains multiple Threads

Thread → runs on a Streaming Processor (SP)

16

17 of 44

3. Built-in CUDA Variables

CUDA provides predefined structures to identify and organize threads and blocks:

threadIdx - Index (rank) of the thread within its block

.x, .y, .z

blockIdx - Index (rank) of the block within the grid

.x, .y, .z

blockDim - Dimensions (number of threads per block)

.x, .y, .z

gridDim - Dimensions (number of blocks in the grid)

.x, .y, .z

All fields (x, y, z) are unsigned integers.

17

18 of 44

4. Example Program (Program 6.2 Summary)

__global__ void Hello(void) {

printf("Hello from thread %d in block %d\n", threadIdx.x, blockIdx.x);

}

int main(int argc, char* argv[]) {

int blk_ct = strtol(argv[1], NULL, 10); // Number of blocks

int th_per_blk = strtol(argv[2], NULL, 10); // Threads per block

Hello<<<blk_ct, th_per_blk>>>(); // Launch kernel

cudaDeviceSynchronize(); // Wait for GPU to finish

return 0;

}

This program launches a grid of blk_ct blocks, each containing th_per_blk threads, and prints messages from each thread.

18

19 of 44

5. Setting Multi-Dimensional Grids and Blocks

For 2D or 3D organization (useful in graphics/matrix computations):

dim3 grid_dims(2, 3, 1); // 6 blocks (2×3×1)

dim3 block_dims(4, 4, 4); // 64 threads per block (4×4×4)

Kernel<<<grid_dims, block_dims>>>();

Total threads = 6 × 64 = 384.

19

20 of 44

6. Independence of Thread Blocks

  • CUDA requires blocks to be independent.
    • Each block must be able to finish execution without depending on any other block.
    • This allows the GPU to schedule blocks freely, in any order (sequential or parallel), based only on resource availability.

7. Concept Summary

20

Concept

Description

Thread

Smallest execution unit; runs on a single GPU core (SP).

Block

Group of threads that share memory and synchronize.

Grid

Collection of blocks; launched by a kernel call.

SM (Streaming Multiprocessor)

Executes one or more blocks at a time.

Independence

Each block must be able to run independently of others.

21 of 44

Nvidia compute capabilities and device architectures

1. What is Compute Capability?

  • Compute capability (CC) is a version number (written as a.b) that identifies the features and limits of a specific NVIDIA GPU architecture.
  • It defines:
    • How many threads and blocks can run.
    • What hardware features are available (e.g., unified memory, tensor cores, etc.).
    • Which CUDA features are supported by that GPU.

2. Major and Minor Versions

  • Format: a.b
    • a = major version (architecture generation)
    • b = minor version (hardware revision)
  • Example: Compute Capability 7.5 → Architecture:
    • Turing (major 7), version 0.5 minor update.

Major revisions so far:

1, 2, 3, 5, 6, 7, 8 (There’s no major version 4)

3. NVIDIA Architectures and Their Compute Capabilities

21

22 of 44

4. Thread and Block Limits by Compute Capability

5. Versions of CUDA vs Compute Capability

  • The CUDA Toolkit version (e.g., CUDA 9, 11, 12) is not the same as the compute capability.
  • CUDA versions support a range of compute capabilities, but each GPU has one fixed compute capability.

Example:

  • CUDA 9 introduced Pascal support (6.x) and allowed synchronization across multiple blocks.

22

Feature

Compute Capability

Limit / Description

Max threads per block

≥ 2.x

1024 threads

Max threads per SM (Streaming Multiprocessor)

2.x

1536 threads

Max threads per SM (for >2.x)

> 2.x

2048 threads

Max block dimensions (x, y, z)

> 1.x

(1024, 1024, 64)

Supported compute capability in modern CUDA

≥ 3.0

Devices below 3.0 are no longer supported.

23 of 44

Vector addition

This example demonstrates data-parallel programming on a GPU.� That means the same operation z[i] = x[i] + y[i] is performed on many data elements simultaneously — ideal for GPUs.

We have three vectors (arrays): x, y, z and Each of size n.

An embarrassingly parallel problem means: Each computation (z[i]) is independent of others.

  • No communication or synchronization between threads is needed.

Using float Instead of double

Reason:

  • Many GPUs have more 32-bit (float) ALUs than 64-bit (double) ones.
  • This means float operations are faster and more efficient on most GPUs.

CUDA Implementation Overview

We’ll need two key parts:

  1. Host (CPU) Code → sets up memory, launches the kernel, checks results.
  2. Device (GPU) Code → the kernel that actually performs addition.

(a) Kernel Function (Device Code)

A kernel is a special GPU function declared with __global__.

23

24 of 44

1 __global__ void Vec_add (

2 const float x [ ] / ∗ i n ∗ / ,

3 const float y [ ] / ∗ i n ∗ / ,

4 float z [ ] / ∗ out ∗ / ,

5 const int n / ∗ i n ∗ / ) {

6 int my_elt = blockDim . x ∗ blockIdx . x + threadIdx . x ;

8 / ∗ total threads = blk_ct ∗ t h _ p e r _ bl k may be > n ∗ /

9 i f ( my_elt < n )

10 z [ my_elt ] = x [ my_elt ] + y [ my_elt ] ;

11 } / ∗ Vec_add ∗ /

12

13 int main ( int argc , char ∗ argv []) {

14 int n , th_per_blk , blk_ct ;

15 char i_g ; / ∗ Are x and y user input or random? ∗ /

16 float ∗x , ∗y , ∗z , ∗cz ;

17 double diff_norm ;

19 / ∗ Get the command l i n e arguments , and s et up vectors ∗ /

24

20 Get_args ( argc , argv , &n , &blk_ct , &th_per_blk , &i_g ) ;

21 Allocate_vectors (&x , &y , &z , &cz , n ) ;

22 Init_vectors ( x , y , n , i_g ) ;

24 / ∗ Invoke kernel and wait for i t to complete ∗ /

25 Vec_add <<<blk_ct , th_per_blk >>>(x , y , z , n ) ;

26 cudaDeviceSynchronize ();

28 / ∗ Check for correctness ∗ /

29 Serial_vec_add ( x , y , cz , n ) ;

30 diff_norm = Two_norm_diff ( z , cz , n ) ;

31 printf ( "Two -norm of difference between host and " ) ;

32 printf ( "device = %e\n" , diff_norm ) ;

34 / ∗ Free storage and quit ∗ /

35 Free_vectors ( x , y , z , cz ) ;

36 return 0 ;

37 } / ∗ main ∗ /

25 of 44

The kernel

  • Each CUDA thread must know which element of the arrays x, y, and z it should process.That’s done by computing a global thread index (rank) using:

int my_elt = blockDim.x * blockIdx.x + threadIdx.x;

    • blockDim.x → number of threads in one block
    • blockIdx.x → index (ID) of the current block
    • threadIdx.x → index (ID) of the current thread within its block

EG: Total threads = gridDim.x × blockDim.x = 4 × 5 = 20

    • Each thread now has a unique ID from 0 to 19, computed by the formula.

25

blockIdx.x

threadIdx.x = 0

1

2

3

4

0

0

1

2

3

4

1

5

6

7

8

9

2

10

11

12

13

14

3

15

16

17

18

19

Handling Cases Where n ≠ Total Threads

Sometimes, n (the number of elements in the vectors) is not equal to the total number of launched threads.

Example: n= 997 , We might still launch 4 blocks × 256 threads = 1024 threads (because GPU threads come in block-sized chunks).

➡️ The extra 27 threads (indexes 997–1023) must do nothing to avoid out-of-bounds memory access.

Hence, in the kernel: We might still launch 4 blocks × 256 threads = 1024 threads (because GPU threads come in block-sized chunks).

26 of 44

Returning results from CUDA kernels

1. CUDA Kernels Have void Return Type

  • All CUDA kernels must have a return type of void — they cannot directly return values to the host.
  • Standard C/C++ pass-by-reference does not work because host and device have separate memory spaces (host pointers are invalid on the device).

2. Problem with Passing Host Addresses to Device

  • Example: Add<<<1,1>>>(2, 3, &sum);
  • Here, &sum (a host address) is invalid on the device.
  • Result: Either incorrect output (like printing -5) or a device hang occurs because of an invalid memory access (*sum_p = x + y).

26

27 of 44

3. Correct Ways to “Return” Results

(a) Using Unified Memory (cudaMallocManaged)

  • Works on systems supporting Unified Memory.
  • Memory is automatically shared between host and device.

cudaMallocManaged(&sum_p, sizeof(int));

Add<<<1,1>>>(2, 3, sum_p);

cudaDeviceSynchronize();

printf("%d\n", *sum_p);

No explicit memory copy is needed.

(b) Using Separate Host and Device Memory + Explicit Copy

  • For systems without Unified Memory:
    • Allocate memory on both host and device.
    • After kernel execution, use cudaMemcpy() to copy data back:

cudaMemcpy(hsum_p, dsum_p, sizeof(int), cudaMemcpyDeviceToHost);

Example:

Add<<<1,1>>>(2, 3, dsum_p);

cudaMemcpy(hsum_p, dsum_p, sizeof(int), cudaMemcpyDeviceToHost);

27

Argument

Meaning

hsum_p

Host pointer (destination) — CPU memory

dsum_p

Device pointer (source) — GPU memory

sizeof(int)

Number of bytes to copy

cudaMemcpyDeviceToHost

Direction of copy (Device → Host)

28 of 44

4. Using Global Managed Variables (__managed__)

  • Declare a global managed variable accessible to both host and device:

__managed__ int sum;

Kernel can directly write to it:

__global__ void Add(int x, int y) { sum = x + y; }

  • Synchronize before accessing it on host:

cudaDeviceSynchronize();

printf("%d\n", sum);

Limitations of __managed__ Variables

  • Available only on GPUs with compute capability ≥ 3.0.
  • On GPUs with < 6.0, the host cannot access managed variables while a kernel is running.
  • Uses global variables, which can reduce modularity and code clarity.

28

29 of 44

CUDA trapezoidal rule I

Concept of Trapezoidal Rule

  • Used to approximate the area under a curve.�
  • Interval [a,b] is divided into n equal subintervals.

29

30 of 44

Serial Implementation

  • Computes the formula by:
    1. Initializing:

    • Looping from i=1 to n−1
    • Evaluating function f(x)
    • Accumulating into the sum
    • Finally multiplying by h

Main bottleneck: the for loop, especially for large n.

30

1 float Serial_trap (

2 const float a / ∗ i n ∗ / ,

3 const float b / ∗ i n ∗ / ,

4 const int n / ∗ i n ∗ / ) {

5 float x , h = ( b−a ) / n ;

6 float trap = 0.5 ∗ ( f ( a ) + f ( b ));

7

8 for ( int i = 1; i <= n−1; i ++) {

9 x = a + i∗h ;

10 trap += f ( x ) ;

11 }

12 trap = trap ∗h ;

13

14 return trap ;

15 } / ∗ Serial_trap ∗ /

31 of 44

Moving to CUDA — Parallelizing

31

Foster’s methodology suggests the two major tasks to parallelize:

  1. Evaluate f(xi)
  2. Add the value to the total trap result

Each thread can perform one iteration of the serial loop.

Directly transferring the loop body to CUDA causes five major problems:

Problem 1: Initialization

Variables like h and trap must be initialized properly, but kernel arguments must be handled correctly.

/ ∗ h and trap are formal arguments to the kernel ∗ /

int my_i = blockDim . x ∗ blockIdx . x + threadIdx . x ;

float my_x = a + my_i ∗h ;

float my_trap = f ( my_x ) ;

float trap += my_trap ;

32 of 44

Issues When Implementing in CUDA

32

Problem 2: Thread index range mismatch

  • Serial loop: i = 1 → n−1
  • CUDA thread index: starts at 0 up to total threads – 1
  • Must ensure:
    • Thread 0 does NOT compute an invalid x0
    • Threads beyond n−1 do NOT compute anything

Problem 3: Race condition on trap

  • All threads attempt to update a single shared variable trap.
  • This causes data races and incorrect results.
  • Need reduction or atomic operations to fix this.

Problem 4: Kernel cannot return a value

  • Kernel return type MUST be void.
  • So kernel cannot return the final computed area.
  • We must return results through device→host memory copies.

Problem 5: Multiplying by h must happen after summation

  • The term:� trap=trap×h � must occur only after all threads finish adding their contributions.
  • Requires synchronization and host-side final computation.

33 of 44

5. What Program 6.12 Will Solve

  • Proper initialization of parameters.
  • Correct mapping of threads to trapezoid indices.
  • Avoiding race conditions (possibly via partial sums or shared memory).
  • Returning results correctly via memory instead of return value.
  • Ensuring final multiplication by h happens after parallel summation.

33

34 of 44

Program 6.12: CUDA kernel and wrapper implementing trapezoidal rule.

34

1 __global__ void Dev_trap (

2 const float a / ∗ i n ∗ / ,

3 const float b / ∗ i n ∗ / ,

4 const float h / ∗ i n ∗ / ,

5 const int n / ∗ i n ∗ / ,

6 float ∗ trap_p / ∗ in / out ∗ / ) {

7 int my_i = blockDim . x ∗ blockIdx . x + threadIdx . x ;

8

9 / ∗ f ( x_0 ) and f ( x_n ) were computed on the host . So ∗ /

10 / ∗ compute f ( x_1 ) , f ( x_2 ) , . . . , f ( x_ (n−1)) ∗ /

11 i f (0 < my_i && my_i < n ) {

12 float my_x = a + my_i ∗h ;

13 float my_trap = f ( my_x ) ;

14 atomicAdd ( trap_p , my_trap ) ;

15 }

16 } / ∗ Dev_trap ∗ /

17

18 / ∗ Host code ∗ /

19 void Trap_wrapper (

20 const float a / ∗ i n ∗ / ,

21 const float b / ∗ i n ∗ / ,

22 const int n / ∗ i n ∗ / ,

23 float ∗ trap_p / ∗ out ∗ / ,

24 const int blk_ct / ∗ i n ∗ / ,

25 const int th_per_blk / ∗ i n ∗ / ) {

26

27 / ∗ trap_p storage allocated in main with

28 ∗ cudaMallocManaged ∗ /

29 ∗ trap_p = 0.5 ∗ ( f ( a ) + f ( b ));

30 float h = ( b−a ) / n ;

31

32 Dev_trap <<<blk_ct , th_per_blk >>>(a , b , h , n , trap_p ) ;

33 cudaDeviceSynchronize ();

35 ∗ trap_p = h ∗(∗ trap_p ) ;

36 } / ∗ Trap_wrapper ∗ /

35 of 44

CUDA trapezoidal rule II: improving performance

Why atomicAdd Makes CUDA Slow

In the basic CUDA version of the trapezoidal rule, each thread computes its my_trap value and then executes:

atomicAdd(trap_p, my_trap);

atomicAdd forces serialization.

Even though thousands of threads run in parallel, only one thread at a time is allowed to update *trap_p.

Example from your text:

  • 8 threads → 8 atomicAdd operations performed one after another
  • GPU becomes effectively serial at this final addition step

This destroys parallel performance.

35

36 of 44

36

we’re assuming that

f (x) = 2x + 1,

a = 0, and

b = 8.

So h = (8 − 0)/8 = 1,

and the value referenced by trap_p at the start of the global sum is

0.5 × (f (a) + f (b)) =

0.5 × (1 + 17) =

9.

37 of 44

Tree-Structured Reduction: The Solution

Instead of letting all threads update a single variable, we let the threads combine their results among themselves like a binary tree.

The idea

In each step:

  • Half of the threads add their result to the other half.
  • Then half of those active threads become inactive.
  • Repeat until only 1 thread has the final sum.

Number of steps needed

Instead of t sequential additions, we now need:

➜ log₂(t) + 1 steps

37

38 of 44

Two standard CUDA implementations

  1. Shared memory reduction
    • Best for older GPUs (compute capability < 3.0).
    • Process:
  2. Each thread writes its partial result into shared memory.
  3. Use parallel reduction steps:

if (threadIdx.x < stride)

shared[threadIdx.x] += shared[threadIdx.x + stride];

  • Synchronize threads between steps:

__syncthreads();

  • Warp shuffle reduction
    • For modern GPUs (compute capability ≥ 3.0).
    • Uses warp-shuffle instructions (__shfl_down_sync) to let threads share values via registers.
    • Modern GPUs allow threads within a warp(32 threads) to directly read each other’s registers using warp shuffle instructions, e.g.:

value += __shfl_down_sync(0xffffffff, value, offset);

    • Faster: no shared memory, no synchronization.

38

39 of 44

Warps and warp shuffles

  • A warp = 32 threads (currently) with consecutive ranks inside a thread block.
  • Warp size is stored in a built-in variable: int warpSize;
  • They execute in SIMD fashion:
    • Threads in different warps may diverge safely.
    • Threads within the same warp must follow the same execution path.
  • The rank of a thread inside its warp: lane = threadIdx.x % warpSize;

Warp shuffle function

  • Used for register-to-register communication:
    • __shfl_down_sync(mask, var, diff, width = warpSize);
  • Allows a thread to read var from another thread with lane = current_lane + diff.
  • Enables fast tree-structured reduction inside a warp.
  • Common mask:mask = 0xffffffff;

Hex value with 32 binary 1s, meaning all 32 threads in the warp participate.

39

40 of 44

Important constraints to avoid undefined results

  • To ensure correctness:
    • All threads in the warp must reach the shuffle call.
    • Thread block size must be a multiple of warpSize (32).
      • This ensures all warps are full warps.

Why full warps matter

  • Partial warps (when block size ≠ multiple of 32) cause undefined behavior in shuffle operations because some lanes may be inactive.

Why warp shuffle improves performance

  • Implementing a global sum using registers is faster than using shared memory or global memory.
  • Warp shuffle functions (introduced in CUDA 3.0) allow threads in a warp to directly access each other's registers, eliminating the need for shared memory communication.

40

41 of 44

Implementing tree-structured global sum with a warp shuffle

41

42 of 44

Shared memory and an alternative to the warp shuffle

42

43 of 44

CUDA trapezoidal rule III: blocks with more than one warp

1. Limiting thread blocks to only 32 threads is inefficient

  • GPU thread blocks can have up to 1024 threads (compute capability ≥ 2.0).
  • If we use only 32 threads per block:
    • We waste CUDA’s ability to synchronize hundreds of threads efficiently with __syncthreads.
    • We lose flexibility and performance.

2. Block-level sum using multiple warps

To sum values across up to 1024 threads:

  1. Each thread computes its partial value.
  2. Within each warp, use a warp-level reduction (via _shfl_down_sync) → one sum per warp.
  3. With 1024 threads, there are 1024 / 32 = 32 warps → 32 warp sums.
  4. Use warp 0 (threads 0–31) to add these 32 warp sums into a block-wide total.

Why warp 0?� Because warp ID = threadIdx.x / warpSize, so warp 0 consists of lanes with indices 0–31.

3. Race condition without synchronization

  • Warps are independent of each other.
  • If warp 0 finishes its warp sum early, it might start reading other warp sums before they are written.
  • This leads to incorrect block sums (race condition).

43

44 of 44

4. Fix: Insert __syncthreads()

Modified pseudocode:

  • Each thread computes its value;
  • Each warp reduces its own values;

__syncthreads(); // ensures all warp sums are ready

Warp 0 adds all warp sums;

  • __syncthreads() = a fast barrier for all threads in the block.
  • All threads wait until every thread reaches the barrier.

5. Important caveat #1: All threads in the block MUST reach __syncthreads()

Incorrect example:

if (threadIdx.x < blockDim.x/2)

__syncthreads();

  • Only half the threads reach the barrier → deadlock forever.
  • Threads that enter __syncthreads() wait for threads that never call it.

6. Important caveat #2: __syncthreads() syncs only within a block

  • Threads in different blocks are never synchronized by __syncthreads().
  • Blocks always run independently.
  • You cannot synchronize an entire grid using __syncthreads().

44