2 of 57

Course Objectives

Explore Parallel Programming

Understand the fundamental need for parallel programming in modern computing

MIMD Systems

Explain how to parallelize applications on Multiple Instruction, Multiple Data systems

MPI Library

Demonstrate how to apply Message Passing Interface library to parallelize suitable programs

OpenMP

Demonstrate how to apply OpenMP pragma and directives to parallelize suitable programs

CUDA Programming

Demonstrate how to design CUDA programs for GPU-based parallel computing

3 of 57

Module 4- Shared-memory programming with OpenMP

Openmp pragmas and directives
The trapezoidal rule
Scope of variables
The reduction clause
loop carried dependency, scheduling, producers and consumers
Caches, cache coherence and false sharing in openmp
tasking
thread safety

4 of 57

Openmp pragmas and directives

1. Directives-Based API

OpenMP provides a shared-memory parallel programming model using compiler directives.
In C and C++, these directives are implemented via preprocessor pragmas (#pragma).

2. Role of Pragmas

Pragmas allow additional compiler behaviors that are not part of standard C.
If a compiler doesn’t support OpenMP, it simply ignores these pragmas.
Hence, an OpenMP program can still run sequentially on any standard C compiler.

3. Syntax of Pragmas

Pragmas begin with #pragma.
If the directive is too long, use a backslash (\) to continue to the next line.
The syntax and behavior after #pragma depend on the compiler extension (e.g., OpenMP).

5 of 57

Hello, World” with OpenMP

#include <stdio.h>

#include <stdlib.h>

#include <omp.h>

void Hello(void);

int main(int argc, char* argv[]) {

int thread_count = strtol(argv[1], NULL, 10);

#pragma omp parallel num_threads(thread_count)

Hello();

return 0;

}

void Hello(void) {

int my_rank = omp_get_thread_num();

int thread_count = omp_get_num_threads();

printf("Hello from thread %d of %d\n", my_rank, thread_count); }

Compiling and running OpenMP programs

$ gcc −g −Wall −fopenmp −o omp_hello omp_hello . c

To run the program, we specify the number of threads on the command line.

$ . / omp_hello 4 or $ . / omp_hello 1

If we do this, the output might be

Hello from thread 0 of 4

Hello from thread 1 of 4

Hello from thread 2 of 4

Hello from thread 3 of 4

However, it should be noted that the threads are competing for access to stdout, so

there’s no guarantee that the output will appear in thread-rank order. For example, the

output might also be

Hello from thread 1 of 4

Hello from thread 2 of 4

Hello from thread 0 of 4

Hello from thread 3 of 4

$ . / omp_hello 1

6 of 57

Fork-Join Model

1. Program Startup and Thread Creation

When an OpenMP program starts, the OS launches a single-threaded process executing the main() function.
#pragma omp parallel num_threads(thread_count)) - multiple threads are created.

eg: Each thread executes the Hello() function, and after returning, threads terminate.

2. OpenMP Directives and Syntax

All OpenMP pragmas start with: #pragma omp
The parallel directive specifies that the following structured block should run with multiple threads. Example: #pragma omp parallel
A structured block is a block with one entry and one exit point

3. Thread Concepts in OpenMP

Thread: a sequence of program instructions executed independently.
Threads share most process resources (e.g., memory, stdout, stdin) but have their own stack and program counter.
When threads complete, they join the parent process

7 of 57

4. Specifying Number of Threads

You can specify the number of threads using the num_threads clause

#pragma omp parallel num_threads(thread_count)

This creates a team of threads that executes the block.
The actual number may depend on system limits, but most systems can handle hundreds or thousands of threads.

5. The OpenMP Thread Team Model

When the program hits a parallel directive:

The original thread (the master) continues executing.
It forks (thread_count - 1) additional threads.

Terminology:

Master: The initial thread (thread 0).
Parent: Thread that encountered the parallel directive.
Children: Threads created by the parent.

Together, they form a team that executes the parallel block.

6. Synchronization and Barriers

At the end of a parallel region, OpenMP includes an implicit barrier:

All threads must complete the block before any can proceed.
After this, child threads terminate, and the parent continues.

7. Thread Functions and Variables

Each thread has its own private stack → separate local variables.
Functions to get thread info:

int omp_get_thread_num(void); // Thread rank (ID)

int omp_get_num_threads(void); // Total threads in team

8 of 57

The Trapezoidal Rule

The trapezoidal rule is a numerical method to estimate the area under a curve y=f(x) between two points a and b.

🔹 Idea

Instead of calculating the exact area, we:

Divide the interval [a,b] into n subintervals.
Approximate the area under f(x) in each subinterval by a trapezoid.

This is a standard technique for numerical integration.

🔹 Mathematical Formula

11 of 57

A first OpenMP version

Foster’s Parallel Program Design Methodology - steps for parallelization:

1. Partitioning

Identify independent tasks that can run concurrently.

Tasks:� a. Compute the areas of individual trapezoids - Each trapezoid area can be computed independently — ideal for parallel execution.� b. Sum all trapezoid areas.

2. Communication

Determine how the tasks need to communicate.

During computation of trapezoid areas → no communication (independent).
During summation → all threads communicate to combine partial results.

12 of 57

3. Agglomeration

Group smaller tasks into larger ones to reduce communication overhead.

Since there are many more trapezoids than cores, each thread is assigned a block of trapezoids (a subinterval of [a, b]).
Each thread performs the serial trapezoidal rule on its subinterval.
Each thread computes its partial integral result (my_result).

4. Mapping

Assign each agglomerated task (each subinterval) to a processor/core.

Each thread runs on one core.
The interval [a,b] is partitioned into thread_count subintervals:

Thread 0: [a, local_b0]

Thread 1: [local_a1, local_b1] …..

where

local_a = a + my_rank * local_n * h

local_b = local_a + local_n * h

13 of 57

The Race Condition Problem

When multiple threads attempt to do:

global_result += my_result;

at the same time, a race condition occurs.

Only one thread should update global_result at a time.

For example:

14 of 57

The OpenMP Solution — Critical Section

A critical section is code executed by multiple threads that updates a shared resource, and the shared resource can only be updated by one thread at a time.
We prevent simultaneous updates using:

#pragma omp critical

global_result += my_result;

Ensures mutual exclusion: only one thread at a time can execute that statement.

15 of 57

Parallel Execution (OpenMP Directive)

In the main program:

#pragma omp parallel num_threads(thread_count)

Trap(a, b, n, &global_result);

Spawns multiple threads.
Each thread runs Trap() concurrently.

Load Division Issue

If n (number of trapezoids) is not evenly divisible by thread_count, some trapezoids are unused.
Hence, the program must check:

if (n % thread_count != 0) {

fprintf(stderr, "n must be evenly divisible by thread_count\n");

exit(0);

}

16 of 57

Scope of variables

1. Concept of Variable Scope

In serial programming, the scope of a variable means where in the program the variable can be accessed.

Function-wide scope: Variable declared inside a function → accessible only within that function.
File-wide scope: Variable declared outside any function → accessible to all functions in that file.

2. Variable Scope in OpenMP

In OpenMP, the scope of a variable refers to which threads in a parallel region can access it.

Example — Hello, World Program

Variables like my_rank and thread_count are declared inside the Hello function, which is called inside the parallel region. Hence, each thread has its own private copy of these variables (stored on its private stack).
✅ Conclusion: All variables in this program have private scope.

Type	Meaning
Shared scope	Accessible by all threads in a team.
Private scope	Accessible by only one thread (each thread has its own copy).

17 of 57

Example — Trapezoidal Rule Program

The parallel block only contains a function call:

#pragma omp parallel num_threads(thread_count)

Trap(a, b, n, &global_result);

Variables declared inside the Trap function (like local_a, local_b, my_result, etc.)� → Private to each thread.� → Allocated on each thread’s private stack.
Variables declared in main before the parallel region — a, b, n, global_result, and thread_count —� → Shared among all threads in the team.

18 of 57

5. Shared vs Private Example in Context

Each thread can access a, b, and n (shared variables) when calling Trap(a, b, n, &global_result).
Inside Trap, the pointer global_result_p is private, but it points to shared memory (the variable global_result in main).
Therefore, when threads execute:

*global_result_p += my_result;

they are updating the same shared variable (global_result), not separate copies.

19 of 57

The reduction clause

Serial version (simpler design):

double Trap(double a, double b, int n);

and be called as:

global_result = Trap(a, b, n);

Pointer Version (Parallel Design)

In the parallel version, each thread computes part of the integral and must add its result to the global sum. Hence the function used a pointer to the shared variable:

void Trap(double a, double b, int n, double *global_result_p);

Each thread updates:

*global_result_p += my_result;

20 of 57

3. Alternative Approach — Return Local Result

To simplify logic, we can let each thread return its local partial result:

double Local_trap(double a, double b, int n);

Then, inside the parallel block:

global_result = 0.0;

#pragma omp parallel num_threads(thread_count)

{

#pragma omp critical

global_result += Local_trap(a, b, n);

}

4. Problem with This Version

Although it produces correct results, it is inefficient:

The entire statement� global_result += Local_trap(a, b, n);� is inside the critical section.
So only one thread at a time can execute Local_trap, forcing sequential execution.
This defeats parallelism and might even run slower than serial code.

21 of 57

5. Improved Version — Use a Private Variable

declare a private variable inside the parallel block:

global_result = 0.0;

#pragma omp parallel num_threads(thread_count)

{

double my_result = 0.0; // private

my_result = Local_trap(a, b, n);

#pragma omp critical

global_result += my_result;

}

Now each thread executes Local_trap in parallel,

and only the final addition to global_result happens sequentially (briefly)

22 of 57

6. Cleaner Solution — Using OpenMP Reduction Clause

OpenMP provides a reduction clause to handle such patterns automatically.

Example:

global_result = 0.0;

#pragma omp parallel num_threads(thread_count) reduction(+: global_result)

global_result += Local_trap(a, b, n);

The clause reduction(+: global_result) means:

Each thread gets a private copy of global_result.
Each private copy is initialized to 0 (identity for +).
Threads accumulate their local sums independently.
At the end, OpenMP combines all private values (using addition) into the shared global_result.

23 of 57

Behavior of Reduction Variables and Supported Reduction Operators in C

24 of 57

Loop-Carried Dependences

1. Focus Only on Loop-Carried Dependences

When using #pragma omp parallel for, the main concern is loop-carried dependences
A loop-carried dependence occurs when one iteration of the loop depends on data produced in another iteration.

🔹 2. Example Without Loop-Carried Dependence

for (i = 0; i < n; i++) {

x[i] = a + i * h; //here is a data dependence between lines x[i] = ... and y[i] = exp(x[i])

y[i] = exp(x[i]); //within the same iteration, but no cross-iteration dependence.

}

Each iteration reads and writes different array elements (x[i], y[i]). Hence, this loop can safely be parallelized:

#pragma omp parallel for num_threads(thread_count)

for (i = 0; i < n; i++) {

x[i] = a + i * h;

y[i] = exp(x[i]);

}

25 of 57

3. Dependence Requires a Write Operation

For two statements to be dependent, at least one of them must write (update) a variable.
Hence, only variables that are updated (written to) by the loop body can cause loop-carried dependences.

🔹 4. How to Detect a Loop-Carried Dependence

To find if a loop has a loop-carried dependence:

Identify variables updated in the loop.
Check if a variable is:

Written in one iteration, and
Read or written in a different iteration.

If yes → 🚫 loop cannot be safely parallelized.� If no → ✅ loop can be parallelized.�

26 of 57

Scheduling loops

The assignment of loop iterations to threads in a parallel for directive is system dependent, though most implementations use block partitioning.

Block Partitioning

Iterations are divided into contiguous chunks:

Thread 0 → first n/thread_count iterations
Thread 1 → next n/thread_count iterations …and so on.

This can lead to load imbalance if different iterations take different amounts of time to execute.

Example of Load Imbalance

sum = 0.0;

for (i = 0; i <= n; i++)

sum += f(i);

If f(i) takes time proportional to i, then:

Later iterations (with large i) take longer.
Threads handling higher indices (like thread_count − 1) do more work, causing inefficiency.

27 of 57

Cyclic (Round-Robin) Partitioning

A better approach for uneven workloads.
Iterations are assigned one at a time to threads in a round-robin manner:
Ensures a more balanced workload among threads.

Experimental Results

double f(int i) {

int j, start = i*(i+1)/2, finish = start + i;

double return_val = 0.0;

for (j = start; j <= finish; j++)

return_val += sin(j);

return return_val;

}

Threads	Scheduling Type	Runtime (s)	Speedup
1	—	3.67	1.00
2	Block	2.76	1.33
2	Cyclic	1.84	1.99

28 of 57

The schedule clause

Used to control iteration-to-thread assignment in parallel for or for directives.
Helps achieve better load balancing and performance.

🔹 Default Schedule in OpenMP

By default, OpenMP uses static (block) scheduling when you write:

sum = 0.0;

#pragma omp parallel for num_threads(thread_count) reduction(+:sum)

for (i = 0; i <= n; i++)

sum += f(i);

iterations are evenly divided into contiguous blocks among threads.

🔹 Cyclic Scheduling

To achieve a cyclic (round-robin) distribution, use:

sum = 0.0;

#pragma omp parallel for num_threads(thread_count) \

reduction(+:sum) schedule(static, 1)

for (i = 0; i <= n; i++)

sum += f(i);

Here, schedule(static, 1) assigns one iteration at a time to each thread in a round-robin pattern.

29 of 57

General Syntax of schedule Clause

schedule(<type>[, <chunk_size>])

Types of Scheduling in OpenMP

Type	Description
static	Iterations are divided and assigned before loop execution. (Can be block or cyclic depending on chunk size.)
dynamic	Iterations are assigned during execution; threads request new chunks after finishing their current ones.
guided	Similar to dynamic, but chunk size decreases over time for better load balancing.
auto	The compiler or runtime system decides the best scheduling policy.
runtime	Schedule type is determined at runtime via an environment variable (OMP_SCHEDULE).

30 of 57

Static schedule type

For a static schedule, the system assigns chunks of chunksize iterations to each thread in a round-robin fashion.

Eg. Suppose we have 12 iterations, 0, 1,..., 11, and 3 threads.

The default schedule is defined by particular implementation of OpenMP

schedule(static , total_iterations / thread_count)

The static schedule is a good choice when each loop iteration takes roughly the same amount of time to compute.
this can improve the speed of memory accesses, particularly on NUMA systems

schedule(static , 1)

schedule(static , 2)

schedule(static , 4)

32 of 57

Dynamic Schedule

Iterations are divided into chunks (of size chunksize).
Each thread executes one chunk and then requests another from the runtime system until all iterations are done.
Default behavior:� If chunksize is not specified, it defaults to 1.
Key characteristics:

Iterations are assigned on a first-come, first-served basis.
Better load balancing when iteration times vary (some iterations take longer).
More runtime overhead due to dynamic task assignment.
Increasing chunksize reduces scheduling overhead but may reduce load balance.

Best used when:� Workload per iteration is non-uniform or unpredictable.

33 of 57

Guided Schedule

Similar to dynamic scheduling — threads request chunks as they finish previous ones. However, chunk sizes start large and decrease over time.
Chunk size formula:� Roughly proportional to remaining iterations/number of threads (e.g., 10,000 iterations and 2 threads → first chunk ≈ 5000, next ≈ 2500, etc.)
Default behavior:

Without chunksize: chunk size decreases down to 1.
With chunksize: chunk size decreases down to that value, except possibly for the final chunk.

Key characteristics:

Reduces load imbalance dynamically.
Starts with large chunks to minimize overhead, then smaller chunks for fine-grained balancing.
Provides a compromise between static and dynamic schedules.

Best used when:� Later iterations are more compute-intensive or unevenly distributed.

34 of 57

Runtime schedule type

1. Environment Variables Overview

Environment variables are named values that a program can access while running.
They define the program’s environment, e.g., system paths or user info.
Common examples:

PATH → directories to search for executables
HOME → user’s home directory
SHELL → user’s shell program

2. The schedule(runtime) Clause

When a loop uses schedule(runtime),� OpenMP defers the scheduling decision to runtime.
The actual schedule type (static, dynamic, or guided) and chunk size are determined by the environment variable:

OMP_SCHEDULE

This allows you to change the scheduling policy without recompiling the code.

3. Setting OMP_SCHEDULE

You can set OMP_SCHEDULE before running your OpenMP program.

$ export OMP_SCHEDULE="static,1"

$ ./my_openmp_program

This means:

The program behaves as if it had� #pragma omp for schedule(static,1)

4. Why schedule(runtime) is Useful

It enables flexible testing of different scheduling strategies (static, dynamic, guided) and chunk sizes without editing the source code.
Ideal for performance tuning — you can experiment interactively using environment variables

35 of 57

Producer–Consumer Problem

A producer–consumer setup is a common parallel programming pattern where:

Producers generate or produce data (e.g., tasks, requests, or messages).
Consumers process or consume that data.

Both operate concurrently and communicate through a shared queue.

Queues — The Core Data Structure

A queue is a list-based data structure that follows the FIFO (First-In, First-Out) rule.

Enqueue and Dequeue
Analogy: Like customers standing in line at a supermarket.
New customers go to the end of the line.
The next customer to be served is the one at the front.

Why Queues Matter in Parallel Programs

Queues provide a safe way to coordinate work between threads.
Used when:

Multiple producers create work or messages.
Multiple consumers pick up and process those messages.

Example Scenario

Producers: Threads that request or generate tasks (e.g., stock price requests).
Consumers: Threads that process those requests (e.g., retrieve stock prices).
The flow:

A producer enqueues a request.
A consumer dequeues the request and processes it.
The result is returned or logged.

36 of 57

Message-Passing on a Shared-Memory System

Even in shared-memory systems, threads can simulate message passing using queues.
Each thread can maintain its own message queue:

To send a message → another thread enqueues a message into that thread’s queue.
To receive a message → the thread dequeues from its own queue.

This models communication between threads without explicit MPI calls.

Each thread interleaves sending and receiving messages
After sending all messages, it keeps receiving until the entire system finishes.
The Done() function indicates whether all threads have completed sending and receiving.

Overall Thread Behavior :

for (sent_msgs = 0; sent_msgs < send_max; sent_msgs++)

{

Send_msg(); // Send a random message to a random destination

Try_receive(); // Try to receive a message from your queue

}

while (!Done()) { // After sending all messages

Try_receive(); // Continue receiving until all threads finish

}

37 of 57

Sending Messages (Send_msg())

When sending, multiple threads may try to enqueue messages into the same queue concurrently — leading to race conditions. To prevent this, enqueueing must occur inside a critical section.

Each thread creates a random message and a destination thread ID.
The enqueue operation is enclosed in a #pragma omp critical section:

Ensures that only one thread at a time updates the queue’s rear pointer.
Prevents message loss or corrupted queue structure.

🔹 Pseudocode

mesg = random();

dest = random() % thread_count;

#pragma omp critical

Enqueue(queue, dest, my_rank, mesg);

⚠️ Why critical section is needed

Queues typically track the rear (tail) of the list.
If two threads simultaneously modify the rear pointer, messages could overwrite or lose one another.

38 of 57

Receiving Messages (Try_receive())

Each thread only dequeues from its own queue, so no two threads will dequeue from the same queue.

🔹 Queue Size Calculation

Instead of maintaining a single shared “queue size” variable (which causes data races), we track: queue_size = enqueued - dequeued;

where:

enqueued → no.of messages added to the queue (can be updated by multiple threads).
dequeued → number of messages removed (updated only by the queue’s owner thread).
If queue is empty: simply return — nothing to receive.
If only one message: use a critical section to prevent conflict with a concurrent Enqueue.
If more than one message: safe to Dequeue() without synchronization.
Print_message() displays who sent the message and its content.

if (queue_size == 0)

return;

else if (queue_size == 1)

#pragma omp critical

Dequeue(queue, &src, &mesg);

else

Dequeue(queue, &src, &mesg);

Print_message(src, mesg);

39 of 57

Termination detection

The Need for a Done() Function : Each thread runs this logic:

for (sent_msgs = 0; sent_msgs < send_max; sent_msgs++)

{ Send_msg();

Try_receive(); }

while (!Done())

Try_receive();

After sending all its messages, a thread continues receiving until all work in the system is finished.
The challenge is knowing when it’s safe to stop receiving

2. The “Obvious” but Incorrect Approach

queue_size = enqueued - dequeued;

if (queue_size == 0)

return TRUE;

else

return FALSE;

3. The Correct Solution — Use a Global “Done Sending” Counter

A thread should only terminate when:

Its queue is empty, and
All threads have finished sending messages.

40 of 57

Implementation Details

Introduce a shared counter:

int done_sending = 0; // Shared variable among all threads

Each thread does:

#pragma omp atomic // After finishing all Send_msg() calls

done_sending++;

Then, the Done() function becomes:

queue_size = enqueued - dequeued;

if (queue_size == 0 && done_sending == thread_count)

return TRUE;

else

return FALSE;

41 of 57

Startup

1. Program Startup and Queue Allocation

When execution begins:

A single master thread (thread 0) performs the setup:

Reads command-line arguments.
Allocates an array of message queues, one per thread.

Because any thread can send to any other thread, every queue must be accessible by all threads.

struct queue **msg_queues; // shared among threads

msg_queues is an array of pointers (one per thread).
Each pointer msg_queues[tid] will later point to a queue structure allocated by thread tid.

3. Why Use Array of Pointers (Not an Array of Structs)

queue_t **msg_queues; // array of pointers to queues

Reason:

Each thread dynamically allocates its own queue (with malloc).
Passing around pointers is cheaper than copying large structs.
It avoids false sharing — each queue can be on a separate memory block.

🧱 2. Message Queue Structure

Each queue will store, at minimum:

typedef struct {

message_t *messages; // list of messages (e.g., dynamic array)

int front; // index of front

int rear; // index of rear

int enqueued; // total messages enqueued

int dequeued; // total messages dequeued

int capacity; // optional: max size of queue

} queue_t;

42 of 57

4. The Synchronization Problem

If one thread allocates its queue early and starts sending messages, but another thread’s queue hasn’t yet been allocated,� then a Send_msg() to that uninitialized queue will cause a segmentation fault or crash.

Example of race:

Thread 0: allocates msg_queues[0], starts sending

Thread 1: not yet allocated msg_queues[1]

Thread 0: tries Send_msg(..., dest=1)

-> accesses uninitialized pointer msg_queues[1]

-> crash!

Solution: Explicit Barrier

OpenMP provides synchronization primitives.� Here, the correct construct is:

#pragma omp barrier

This ensures all threads wait until everyone has allocated their queue before any proceed.

Behavior of #pragma omp barrier

All threads stop at the barrier.
When the last thread reaches it, all threads are released.
Afterward, all threads can safely assume that:� Every queue pointer (msg_queues[i]) is allocated and initialized.

43 of 57

The atomic directive

In the message-passing OpenMP program:

Each thread sends a certain no of messages.
After finishing its sending phase, it must notify the system that it’s done.
A shared counter done_sending tracks how many threads have finished sending.

int done_sending = 0;

Each thread, after its for loop that sends messages, will do:

done_sending++;

But — since multiple threads can execute this simultaneously —� this update is a race condition unless protected.

2. The Problem with Unprotected Increment

If multiple threads execute:

done_sending++;

simultaneously, they might:

Read the same old value,
Increment it locally,
Store it back — overwriting the other thread’s update.

Initial done_sending = 0

Thread 0: reads 0 → increments → 1 → stores 1

Thread 1: reads 0 → increments → 1 → stores 1

Result: done_sending = 1 (should be 2!)

44 of 57

Safe Fix #1 — Using critical

We could protect it using a critical directive:

#pragma omp critical

done_sending++;

This guarantees mutual exclusion, but it’s slower, because: OpenMP must manage a general lock.

Safe Fix #2 — Using atomic

OpenMP provides a faster, lightweight directive:

#pragma omp atomic

done_sending++;

Behavior

Protects a single memory operation (load-modify-store).
Ensures that the update to done_sending happens atomically.
No two threads can interfere with each other on that memory location.

Valid Pattern	Example
x <op>= expr;	x += y;
x++; or ++x;	done_sending++;
x--; or --x;	count--;

5. What atomic Can Protect

An atomic directive only works if the critical operation is a simple assignment in one of these forms:

Allowed operators:� +, -, *, /, &, ^, |, <<, >>

⚠️ Restriction:� The expression must not reference the same variable (x) — e.g. x += f(x) is invalid.

45 of 57

Caches, cache coherence, and false sharing

1. The Memory Hierarchy Problem

Processors are much faster than main memory (RAM).� → Fetching data from memory becomes a bottleneck.

Solution: CPUs use cache memory — a small, fast memory located close to the core.

Caches rely on:

Temporal locality: recently used data is likely to be reused soon.
Spatial locality: nearby memory addresses are likely to be used soon.

When a variable like x is loaded, the CPU loads an entire block of memory (a cache line, e.g., 64 bytes) into cache — not just the single variable.

46 of 57

2. The Cache Coherence Problem

In shared-memory systems, each thread runs on a different core, and each core has its own cache.

Example: int x = 5; // shared

my_y = x; // both threads read x

x++; // Thread 0 modifies x

my_z = x; // Thread 1 reads x again

Now there are three copies of x:

In main memory
In Thread 0’s cache
In Thread 1’s cache

When Thread 0 does x++, its cache updates x, but Thread 1’s copy of x is now stale.

To fix this, processors enforce cache coherence protocols — they mark stale cache lines as invalid, forcing other threads to reload from main memory.

✅ Correctness is preserved� ❌ But performance drops when threads repeatedly invalidate and reload the same cache line.

47 of 57

Matrix–Vector Multiplication and Cache Behavior

Serial version:

for (i = 0; i < m; i++) {

y[i] = 0.0;

for (j = 0; j < n; j++)

y[i] += A[i][j] * x[j];

}

Parallel version:

#pragma omp parallel for num_threads(thread_count)

\default(none) private(i, j) shared(A, x, y, m, n)

for (i = 0; i < m; i++) {

y[i] = 0.0;

for (j = 0; j < n; j++)

y[i] += A[i][j] * x[j];

}

Each thread updates its own segment of y, so there are no true data races.

But performance measurements show that the 8×8,000,000 matrix case performs much worse with multiple threads. Why?

48 of 57

4. False Sharing Explained

When multiple threads write to different elements that happen to lie in the same cache line, their caches constantly invalidate each other — even though they’re updating different variables.

That’s false sharing:

Threads aren’t sharing variables — they’re only sharing the same cache line.

Example:

Suppose a cache line stores 8 doubles (8 bytes × 8 = 64 bytes).

If:

Thread 0 updates y[0]
Thread 1 updates y[1]� and both y[0] and y[1] lie in the same cache line,� then every write causes invalidation → both threads keep reloading the same line from memory.

Result: massive slowdown even though logically the threads don’t interact.

49 of 57

How to Fix False Sharing

(1) Padding

Add dummy elements so that each thread’s y[i] lies in a different cache line.

Example:

#define CACHE_LINE_SIZE 64

#define PAD (CACHE_LINE_SIZE / sizeof(double))

double y_padded[MAX_THREADS][PAD]; // padded y

#pragma omp parallel private(i, j)

{

int tid = omp_get_thread_num();

for (i = start; i < end; i++) {

y_padded[tid][0] = 0.0;

for (j = 0; j < n; j++)

y_padded[tid][0] += A[i][j] * x[j];

} }

Now each thread’s output y_padded[tid][0] is in a separate cache line — no invalidation.

Later, sum up partial results into the shared y.

✅ (2) Private Buffers

Let each thread compute into a private (thread-local) array, and copy results to shared y at the end:

#pragma omp parallel private(i, j, y_local)

{

double y_local[MAX_M];

for (i = start; i < end; i++) {

y_local[i] = 0.0;

for (j = 0; j < n; j++)

y_local[i] += A[i][j] * x[j];

}

// combine results

for (i = start; i < end; i++)

y[i] = y_local[i];

}

50 of 57

Tasking

1. The Problem: Fixed Parallel Structures

most OpenMP constructs (like parallel for) assume:

A known, finite, and regular no of iterations or tasks.
Static or dynamic scheduling can divide iterations among threads easily.

However, some problems don’t fit this model:

Recursive algorithms (e.g., Fibonacci, quicksort, graph traversal)
Dynamic workloads (e.g., a web server handling unpredictable HTTP requests)
Producer-consumer systems with continuous or irregular task streams

These can’t be expressed easily with a fixed for loop or predetermined workload

2. The Solution: OpenMP Tasks

To address dynamic and recursive workloads, OpenMP 3.0 introduced tasks.

✅ Basic Syntax

#pragma omp task

{

// Independent unit of work

}

Each task is an independent unit of computation that can be:

Created by any thread within a parallel region.
Executed later by any available thread

Tasks are not tied to specific loop iterations — they represent arbitrary units of work.

51 of 57

3. Parallel + Single + Task Structure

Since tasks must be created inside a parallel region, the common pattern is:

#pragma omp parallel

#pragma omp single

{

// Only one thread launches tasks

#pragma omp task

{ ... }

}

The parallel directive creates a team of threads.
The single directive ensures only one thread launches tasks (to avoid redundant creation).
Other threads in the team execute tasks as they become available.

If single were omitted, each thread would create tasks redundantly.

🔹 4. Example: Recursive Fibonacci

52 of 57

Common Issues

(a) Data Scoping - By default, variables inside a task are private — meaning each task has its own copy.�(b) Task Ordering - Tasks are unordered — their execution is managed by the OpenMP runtime.�(c) Performance Overhead - Each task creation has runtime cost: Allocating a task data environment,Scheduling and synchronization

Optimizing Task Creation

Two optimizations help:

Conditional Task Creation

#pragma omp task shared(i) if(n > 20)

i = fib(n - 1);

Work Stealing / Mixed Execution

Let one recursive call execute directly (not as a task):i = fib(n - 1); // executed directly

#pragma omp task shared(j)

j = fib(n - 2);

#pragma omp taskwait

return i + j;

This reduces unnecessary task creation.

53 of 57

Conceptual Analogy

Think of tasks as "to-do" items in a shared task queue:

Any thread can create a task.
Any available thread can pick it up and execute it.
taskwait = wait until all your assigned subtasks are done.

This is ideal for:

Recursive divide-and-conquer algorithms (e.g., mergesort, quicksort)
Irregular parallelism (e.g., graph traversal)
Dynamic, unpredictable workloads (e.g., web servers, event handlers)

54 of 57

Thread-safety

The Context - writing a multi-threaded tokenizer using OpenMP, where:

Each thread is responsible for tokenizing certain lines of text.
Each line is stored in an array lines[].
The division of lines is done using:

#pragma omp for schedule(static, 1)

The goal: Each thread independently extracts and prints tokens (words) from its assigned lines.

2. The Problem: strtok Is Not Thread-Safe

We’re using the standard C function:

char *strtok(char *string, const char *separators);

How strtok works internally:

On the first call, it’s given a pointer to a string (string argument).
It finds the first token, replaces the separator (like a space) with '\0', and returns the token’s start address.
On subsequent calls, you pass NULL as the first argument — strtok remembers where it left off using an internal static pointer.

55 of 57

Example (serial):

char str[] = "Pease porridge hot.";

char *token = strtok(str, " \t\n");

while (token != NULL) {

printf("%s\n", token);

token = strtok(NULL, " \t\n");

}

That works fine — in a single thread.

56 of 57

Why It Fails in Multi-threading

When multiple threads call strtok at the same time, they share the internal static variable inside strtok. So:

Thread 0 calls strtok(lines[0], " \t\n")
Thread 1 calls strtok(lines[1], " \t\n")
Both calls modify the same internal static pointer.

💥 Result: Race condition!

Threads overwrite each other’s “progress” pointer, so:

Tokens from one line appear in another.
Some tokens get skipped or duplicated.
Output becomes unpredictable.

That’s exactly why your output was mixed up between threads — strtok’s internal static data is shared among threads

57 of 57

The Concept: Thread Safety

A thread-safe function:

Can be safely called by multiple threads simultaneously, without corrupting shared data or producing wrong results.

❌ A non-thread-safe function:

Uses shared or static state that can be corrupted when accessed concurrently.

strtok is not thread-safe because of its internal static variable

The Fix: Use strtok_r (Reentrant Version)

C provides a thread-safe version of strtok:

char *strtok_r(char *string, const char *separators, char **saveptr);

How It Works:

The third argument, saveptr, replaces the internal static pointer used by strtok.
Each thread (or each line) keeps its own save pointer.
Thus, strtok_r can be safely used by multiple threads at once.

1 of 57

2 of 57

3 of 57

4 of 57

5 of 57

6 of 57

7 of 57

8 of 57

9 of 57

10 of 57

11 of 57

12 of 57

13 of 57

14 of 57

15 of 57

16 of 57

17 of 57

18 of 57

19 of 57

20 of 57

21 of 57

22 of 57

23 of 57

24 of 57

25 of 57

26 of 57

27 of 57

28 of 57

29 of 57

30 of 57

31 of 57

32 of 57

33 of 57

34 of 57

35 of 57

36 of 57

37 of 57

38 of 57

39 of 57

40 of 57

41 of 57

42 of 57

43 of 57

44 of 57

45 of 57

46 of 57

47 of 57

48 of 57

49 of 57

50 of 57

51 of 57

52 of 57

53 of 57

54 of 57

55 of 57

56 of 57

57 of 57