Parallel Programming
Course Code: BDS701 | Semester: VII | Credits: 04
Teaching Hours: 3:0:2:0 | CIE Marks: 50 | SEE Marks: 50
Course Objectives
Explore Parallel Programming
Understand the fundamental need for parallel programming in modern computing
MIMD Systems
Explain how to parallelize applications on Multiple Instruction, Multiple Data systems
MPI Library
Demonstrate how to apply Message Passing Interface library to parallelize suitable programs
OpenMP
Demonstrate how to apply OpenMP pragma and directives to parallelize suitable programs
CUDA Programming
Demonstrate how to design CUDA programs for GPU-based parallel computing
Module 4- Shared-memory programming with OpenMP
Openmp pragmas and directives
1. Directives-Based API
2. Role of Pragmas
3. Syntax of Pragmas
4
Hello, World” with OpenMP
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void Hello(void);
int main(int argc, char* argv[]) {
int thread_count = strtol(argv[1], NULL, 10);
#pragma omp parallel num_threads(thread_count)
Hello();
return 0;
}
void Hello(void) {
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();
printf("Hello from thread %d of %d\n", my_rank, thread_count); }
5
Compiling and running OpenMP programs
$ gcc −g −Wall −fopenmp −o omp_hello omp_hello . c
To run the program, we specify the number of threads on the command line.
$ . / omp_hello 4 or $ . / omp_hello 1
If we do this, the output might be
Hello from thread 0 of 4
Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 3 of 4
However, it should be noted that the threads are competing for access to stdout, so
there’s no guarantee that the output will appear in thread-rank order. For example, the
output might also be
Hello from thread 1 of 4
Hello from thread 2 of 4
Hello from thread 0 of 4
Hello from thread 3 of 4
$ . / omp_hello 1
Fork-Join Model
1. Program Startup and Thread Creation
eg: Each thread executes the Hello() function, and after returning, threads terminate.
2. OpenMP Directives and Syntax
3. Thread Concepts in OpenMP
6
4. Specifying Number of Threads
#pragma omp parallel num_threads(thread_count)
5. The OpenMP Thread Team Model
7
6. Synchronization and Barriers
7. Thread Functions and Variables
int omp_get_thread_num(void); // Thread rank (ID)
int omp_get_num_threads(void); // Total threads in team
The Trapezoidal Rule
The trapezoidal rule is a numerical method to estimate the area under a curve y=f(x) between two points a and b.
🔹 Idea
Instead of calculating the exact area, we:
This is a standard technique for numerical integration.
🔹 Mathematical Formula
8
9
10
A first OpenMP version
Foster’s Parallel Program Design Methodology - steps for parallelization:
1. Partitioning
Identify independent tasks that can run concurrently.
2. Communication
Determine how the tasks need to communicate.
11
3. Agglomeration
Group smaller tasks into larger ones to reduce communication overhead.
4. Mapping
Assign each agglomerated task (each subinterval) to a processor/core.
Thread 0: [a, local_b0]
Thread 1: [local_a1, local_b1] …..
where
local_a = a + my_rank * local_n * h
local_b = local_a + local_n * h
12
The Race Condition Problem
When multiple threads attempt to do:
global_result += my_result;
at the same time, a race condition occurs.
Only one thread should update global_result at a time.
For example:
13
The OpenMP Solution — Critical Section
#pragma omp critical
global_result += my_result;
14
Parallel Execution (OpenMP Directive)
In the main program:
#pragma omp parallel num_threads(thread_count)
Trap(a, b, n, &global_result);
Load Division Issue
if (n % thread_count != 0) {
fprintf(stderr, "n must be evenly divisible by thread_count\n");
exit(0);
}
15
Scope of variables
16
1. Concept of Variable Scope
2. Variable Scope in OpenMP
In OpenMP, the scope of a variable refers to which threads in a parallel region can access it.
Example — Hello, World Program
Type | Meaning |
Shared scope | Accessible by all threads in a team. |
Private scope | Accessible by only one thread (each thread has its own copy). |
Example — Trapezoidal Rule Program
#pragma omp parallel num_threads(thread_count)
Trap(a, b, n, &global_result);
17
5. Shared vs Private Example in Context
*global_result_p += my_result;
they are updating the same shared variable (global_result), not separate copies.
18
The reduction clause
double Trap(double a, double b, int n);
and be called as:
global_result = Trap(a, b, n);
In the parallel version, each thread computes part of the integral and must add its result to the global sum. Hence the function used a pointer to the shared variable:
void Trap(double a, double b, int n, double *global_result_p);
Each thread updates:
*global_result_p += my_result;
19
3. Alternative Approach — Return Local Result
To simplify logic, we can let each thread return its local partial result:
double Local_trap(double a, double b, int n);
Then, inside the parallel block:
global_result = 0.0;
#pragma omp parallel num_threads(thread_count)
{
#pragma omp critical
global_result += Local_trap(a, b, n);
}
20
4. Problem with This Version
Although it produces correct results, it is inefficient:
5. Improved Version — Use a Private Variable
declare a private variable inside the parallel block:
global_result = 0.0;
#pragma omp parallel num_threads(thread_count)
{
double my_result = 0.0; // private
my_result = Local_trap(a, b, n);
#pragma omp critical
global_result += my_result;
}
Now each thread executes Local_trap in parallel,
and only the final addition to global_result happens sequentially (briefly)
21
6. Cleaner Solution — Using OpenMP Reduction Clause
OpenMP provides a reduction clause to handle such patterns automatically.
Example:
global_result = 0.0;
#pragma omp parallel num_threads(thread_count) reduction(+: global_result)
global_result += Local_trap(a, b, n);
The clause reduction(+: global_result) means:
22
Behavior of Reduction Variables and Supported Reduction Operators in C
23
Loop-Carried Dependences
1. Focus Only on Loop-Carried Dependences
🔹 2. Example Without Loop-Carried Dependence
for (i = 0; i < n; i++) {
x[i] = a + i * h; //here is a data dependence between lines x[i] = ... and y[i] = exp(x[i])
y[i] = exp(x[i]); //within the same iteration, but no cross-iteration dependence.
}
Each iteration reads and writes different array elements (x[i], y[i]). Hence, this loop can safely be parallelized:
#pragma omp parallel for num_threads(thread_count)
for (i = 0; i < n; i++) {
x[i] = a + i * h;
y[i] = exp(x[i]);
}
24
3. Dependence Requires a Write Operation
🔹 4. How to Detect a Loop-Carried Dependence
To find if a loop has a loop-carried dependence:
25
Scheduling loops
The assignment of loop iterations to threads in a parallel for directive is system dependent, though most implementations use block partitioning.
Block Partitioning
Example of Load Imbalance
sum = 0.0;
for (i = 0; i <= n; i++)
sum += f(i);
If f(i) takes time proportional to i, then:
26
Cyclic (Round-Robin) Partitioning
Experimental Results
double f(int i) {
int j, start = i*(i+1)/2, finish = start + i;
double return_val = 0.0;
for (j = start; j <= finish; j++)
return_val += sin(j);
return return_val;
}
27
Threads | Scheduling Type | Runtime (s) | Speedup |
1 | — | 3.67 | 1.00 |
2 | Block | 2.76 | 1.33 |
2 | Cyclic | 1.84 | 1.99 |
The schedule clause
🔹 Default Schedule in OpenMP
sum = 0.0;
#pragma omp parallel for num_threads(thread_count) reduction(+:sum)
for (i = 0; i <= n; i++)
sum += f(i);
28
🔹 Cyclic Scheduling
sum = 0.0;
#pragma omp parallel for num_threads(thread_count) \
reduction(+:sum) schedule(static, 1)
for (i = 0; i <= n; i++)
sum += f(i);
General Syntax of schedule Clause
schedule(<type>[, <chunk_size>])
Types of Scheduling in OpenMP
29
Type | Description |
static | Iterations are divided and assigned before loop execution. (Can be block or cyclic depending on chunk size.) |
dynamic | Iterations are assigned during execution; threads request new chunks after finishing their current ones. |
guided | Similar to dynamic, but chunk size decreases over time for better load balancing. |
auto | The compiler or runtime system decides the best scheduling policy. |
runtime | Schedule type is determined at runtime via an environment variable (OMP_SCHEDULE). |
Static schedule type
Eg. Suppose we have 12 iterations, 0, 1,..., 11, and 3 threads.
schedule(static , total_iterations / thread_count)
30
schedule(static , 1)
schedule(static , 2)
schedule(static , 4)
31
Dynamic Schedule
32
Guided Schedule
33
Runtime schedule type
1. Environment Variables Overview
2. The schedule(runtime) Clause
OMP_SCHEDULE
34
3. Setting OMP_SCHEDULE
You can set OMP_SCHEDULE before running your OpenMP program.
$ export OMP_SCHEDULE="static,1"
$ ./my_openmp_program
This means:
4. Why schedule(runtime) is Useful
Producer–Consumer Problem
Queues — The Core Data Structure
35
Why Queues Matter in Parallel Programs
Example Scenario
Message-Passing on a Shared-Memory System
This models communication between threads without explicit MPI calls.
36
Overall Thread Behavior :
for (sent_msgs = 0; sent_msgs < send_max; sent_msgs++)
{
Send_msg(); // Send a random message to a random destination
Try_receive(); // Try to receive a message from your queue
}
while (!Done()) { // After sending all messages
Try_receive(); // Continue receiving until all threads finish
}
Sending Messages (Send_msg())
When sending, multiple threads may try to enqueue messages into the same queue concurrently — leading to race conditions. To prevent this, enqueueing must occur inside a critical section.
37
🔹 Pseudocode
mesg = random();
dest = random() % thread_count;
#pragma omp critical
Enqueue(queue, dest, my_rank, mesg);
⚠️ Why critical section is needed
Receiving Messages (Try_receive())
Each thread only dequeues from its own queue, so no two threads will dequeue from the same queue.
🔹 Queue Size Calculation
Instead of maintaining a single shared “queue size” variable (which causes data races), we track: queue_size = enqueued - dequeued;
where:
38
if (queue_size == 0)
return;
else if (queue_size == 1)
#pragma omp critical
Dequeue(queue, &src, &mesg);
else
Dequeue(queue, &src, &mesg);
Print_message(src, mesg);
Termination detection
for (sent_msgs = 0; sent_msgs < send_max; sent_msgs++)
{ Send_msg();
Try_receive(); }
while (!Done())
Try_receive();
2. The “Obvious” but Incorrect Approach
queue_size = enqueued - dequeued;
if (queue_size == 0)
return TRUE;
else
return FALSE;
3. The Correct Solution — Use a Global “Done Sending” Counter
A thread should only terminate when:
39
Implementation Details
Introduce a shared counter:
int done_sending = 0; // Shared variable among all threads
Each thread does:
#pragma omp atomic // After finishing all Send_msg() calls
done_sending++;
Then, the Done() function becomes:
queue_size = enqueued - dequeued;
if (queue_size == 0 && done_sending == thread_count)
return TRUE;
else
return FALSE;
40
Startup
1. Program Startup and Queue Allocation
When execution begins:
Because any thread can send to any other thread, every queue must be accessible by all threads.
struct queue **msg_queues; // shared among threads
3. Why Use Array of Pointers (Not an Array of Structs)
queue_t **msg_queues; // array of pointers to queues
Reason:
41
🧱 2. Message Queue Structure
Each queue will store, at minimum:
typedef struct {
message_t *messages; // list of messages (e.g., dynamic array)
int front; // index of front
int rear; // index of rear
int enqueued; // total messages enqueued
int dequeued; // total messages dequeued
int capacity; // optional: max size of queue
} queue_t;
4. The Synchronization Problem
If one thread allocates its queue early and starts sending messages, but another thread’s queue hasn’t yet been allocated,� then a Send_msg() to that uninitialized queue will cause a segmentation fault or crash.
Example of race:
Thread 0: allocates msg_queues[0], starts sending
Thread 1: not yet allocated msg_queues[1]
Thread 0: tries Send_msg(..., dest=1)
-> accesses uninitialized pointer msg_queues[1]
-> crash!
42
Solution: Explicit Barrier
OpenMP provides synchronization primitives.� Here, the correct construct is:
#pragma omp barrier
This ensures all threads wait until everyone has allocated their queue before any proceed.
Behavior of #pragma omp barrier
The atomic directive
In the message-passing OpenMP program:
int done_sending = 0;
Each thread, after its for loop that sends messages, will do:
done_sending++;
But — since multiple threads can execute this simultaneously —� this update is a race condition unless protected.
43
2. The Problem with Unprotected Increment
If multiple threads execute:
done_sending++;
simultaneously, they might:
EG
Initial done_sending = 0
Thread 0: reads 0 → increments → 1 → stores 1
Thread 1: reads 0 → increments → 1 → stores 1
Result: done_sending = 1 (should be 2!)
Safe Fix #1 — Using critical
We could protect it using a critical directive:
#pragma omp critical
done_sending++;
This guarantees mutual exclusion, but it’s slower, because: OpenMP must manage a general lock.
Safe Fix #2 — Using atomic
OpenMP provides a faster, lightweight directive:
#pragma omp atomic
done_sending++;
Behavior
44
Valid Pattern | Example |
x <op>= expr; | x += y; |
x++; or ++x; | done_sending++; |
x--; or --x; | count--; |
5. What atomic Can Protect
An atomic directive only works if the critical operation is a simple assignment in one of these forms:
Allowed operators:� +, -, *, /, &, ^, |, <<, >>
⚠️ Restriction:� The expression must not reference the same variable (x) — e.g. x += f(x) is invalid.
Caches, cache coherence, and false sharing
1. The Memory Hierarchy Problem
Processors are much faster than main memory (RAM).� → Fetching data from memory becomes a bottleneck.
Solution: CPUs use cache memory — a small, fast memory located close to the core.
Caches rely on:
When a variable like x is loaded, the CPU loads an entire block of memory (a cache line, e.g., 64 bytes) into cache — not just the single variable.
45
2. The Cache Coherence Problem
In shared-memory systems, each thread runs on a different core, and each core has its own cache.
Example: int x = 5; // shared
my_y = x; // both threads read x
x++; // Thread 0 modifies x
my_z = x; // Thread 1 reads x again
Now there are three copies of x:
When Thread 0 does x++, its cache updates x, but Thread 1’s copy of x is now stale.
To fix this, processors enforce cache coherence protocols — they mark stale cache lines as invalid, forcing other threads to reload from main memory.
✅ Correctness is preserved� ❌ But performance drops when threads repeatedly invalidate and reload the same cache line.
46
Matrix–Vector Multiplication and Cache Behavior
Serial version:
for (i = 0; i < m; i++) {
y[i] = 0.0;
for (j = 0; j < n; j++)
y[i] += A[i][j] * x[j];
}
Parallel version:
#pragma omp parallel for num_threads(thread_count)
\default(none) private(i, j) shared(A, x, y, m, n)
for (i = 0; i < m; i++) {
y[i] = 0.0;
for (j = 0; j < n; j++)
y[i] += A[i][j] * x[j];
}
47
Each thread updates its own segment of y, so there are no true data races.
But performance measurements show that the 8×8,000,000 matrix case performs much worse with multiple threads. Why?
4. False Sharing Explained
When multiple threads write to different elements that happen to lie in the same cache line, their caches constantly invalidate each other — even though they’re updating different variables.
That’s false sharing:
Threads aren’t sharing variables — they’re only sharing the same cache line.
Example:
Suppose a cache line stores 8 doubles (8 bytes × 8 = 64 bytes).
If:
Result: massive slowdown even though logically the threads don’t interact.
48
How to Fix False Sharing
(1) Padding
Add dummy elements so that each thread’s y[i] lies in a different cache line.
Example:
#define CACHE_LINE_SIZE 64
#define PAD (CACHE_LINE_SIZE / sizeof(double))
double y_padded[MAX_THREADS][PAD]; // padded y
#pragma omp parallel private(i, j)
{
int tid = omp_get_thread_num();
for (i = start; i < end; i++) {
y_padded[tid][0] = 0.0;
for (j = 0; j < n; j++)
y_padded[tid][0] += A[i][j] * x[j];
} }
Now each thread’s output y_padded[tid][0] is in a separate cache line — no invalidation.
Later, sum up partial results into the shared y.
49
✅ (2) Private Buffers
Let each thread compute into a private (thread-local) array, and copy results to shared y at the end:
#pragma omp parallel private(i, j, y_local)
{
double y_local[MAX_M];
for (i = start; i < end; i++) {
y_local[i] = 0.0;
for (j = 0; j < n; j++)
y_local[i] += A[i][j] * x[j];
}
// combine results
for (i = start; i < end; i++)
y[i] = y_local[i];
}
Tasking
1. The Problem: Fixed Parallel Structures
most OpenMP constructs (like parallel for) assume:
However, some problems don’t fit this model:
These can’t be expressed easily with a fixed for loop or predetermined workload
50
2. The Solution: OpenMP Tasks
To address dynamic and recursive workloads, OpenMP 3.0 introduced tasks.
✅ Basic Syntax
#pragma omp task
{
// Independent unit of work
}
Each task is an independent unit of computation that can be:
Tasks are not tied to specific loop iterations — they represent arbitrary units of work.
3. Parallel + Single + Task Structure
Since tasks must be created inside a parallel region, the common pattern is:
#pragma omp parallel
#pragma omp single
{
// Only one thread launches tasks
#pragma omp task
{ ... }
}
If single were omitted, each thread would create tasks redundantly.
🔹 4. Example: Recursive Fibonacci
51
Common Issues
(a) Data Scoping - By default, variables inside a task are private — meaning each task has its own copy.�(b) Task Ordering - Tasks are unordered — their execution is managed by the OpenMP runtime.�(c) Performance Overhead - Each task creation has runtime cost: Allocating a task data environment,Scheduling and synchronization
Optimizing Task Creation
Two optimizations help:
#pragma omp task shared(i) if(n > 20)
i = fib(n - 1);
Work Stealing / Mixed Execution
Let one recursive call execute directly (not as a task):i = fib(n - 1); // executed directly
#pragma omp task shared(j)
j = fib(n - 2);
#pragma omp taskwait
return i + j;
This reduces unnecessary task creation.
52
Conceptual Analogy
Think of tasks as "to-do" items in a shared task queue:
This is ideal for:
53
Thread-safety
The Context - writing a multi-threaded tokenizer using OpenMP, where:
#pragma omp for schedule(static, 1)
The goal: Each thread independently extracts and prints tokens (words) from its assigned lines.
2. The Problem: strtok Is Not Thread-Safe
We’re using the standard C function:
char *strtok(char *string, const char *separators);
How strtok works internally:
54
Example (serial):
char str[] = "Pease porridge hot.";
char *token = strtok(str, " \t\n");
while (token != NULL) {
printf("%s\n", token);
token = strtok(NULL, " \t\n");
}
That works fine — in a single thread.
55
Why It Fails in Multi-threading
When multiple threads call strtok at the same time, they share the internal static variable inside strtok. So:
💥 Result: Race condition!
Threads overwrite each other’s “progress” pointer, so:
That’s exactly why your output was mixed up between threads — strtok’s internal static data is shared among threads
56
The Concept: Thread Safety
A thread-safe function:
Can be safely called by multiple threads simultaneously, without corrupting shared data or producing wrong results.
❌ A non-thread-safe function:
Uses shared or static state that can be corrupted when accessed concurrently.
strtok is not thread-safe because of its internal static variable
The Fix: Use strtok_r (Reentrant Version)
C provides a thread-safe version of strtok:
char *strtok_r(char *string, const char *separators, char **saveptr);
How It Works:
57