Introduction to Shared Memory Computing �- OpenMP -
Mobeen Ludin and Olivia Irving
Basic Architecture: single core machine
Motherboard
VGA
USB
PS2
control
bus
address
bus
data
bus
RAM
sockets
set
enable
Von Neumann
Single Processor
control
bus
CPU0
Older processor had only one cpu core to execute instructions
Modern Architecture: multi-core machine (UMA)
Motherboard
VGA
USB
PS2
control
bus
address
bus
data
bus
RAM
sockets
set
enable
control
bus
Modern processors have 4 or more independent cpu cores to execute instructions
CPU0
CPU1
CPU3
CPU4
L3 Cache
Interconnection Network
“The Processors” or a Node
Shared Memory Systems:
Interconnection (IBM:BlueGene/L, Cray:Gemini/Red Storm, QuNet)
CPU0
CPU1
CPU2
CPU3
RAM
RAM
RAM
RAM
Figure 1.0: Shared-Memory MIMD Architecture
Socket in a Node:
Node: Multiprocessor Machine (NUMA)
CPU0
CPU1
CPU3
CPU4
L3 Cache
CPU5
CPU6
CPU7
CPU8
L3 Cache
Interconnection (IBM:BlueGene/L, Cray:Gemini/Red Storm, QuNet)
RAM
RAM
RAM
RAM
NUMA domain
Bulldozer core
32 Threads per node
Bulldozer Core
Shared Address Space: Process vs. Thread
PID=999: Address Space
Data:
double vect_sum[ ];
Heap:
malloc(vec_a[ ]);
malloc(vect_b[ ]);
Thread 0
Stack frame0:
add_vect();
int a, b, i;
Thread 1
Stack frame1:
mult_vect();
int a, b, i;
Multithreaded program address space.
Program’s executable code.
main(){
add_vect();
mult_vect();
}
PID = 99: Address Space
Program’s executable code.
main(){
add_vect();
mult_vect();
}
Data:
double vect_sum[ ];
Heap:
malloc(vec_a[ ]);
malloc(vect_b[ ]);
Thread 0
Stack frame0:
add_vect();
int a, b, i;
Single threaded program address space
Stack frame1:
mult_vect();
int a, b, i;
How does this relate to OpenMP seems like threads in a program vs a single threaded program but OpenMP ususally is about paralleizing loops?
What is OpenMP:
OpenMP
Environment Variables
OMP_NUM_THREADS
OMP_SCHEDULE(static, 10)
OMP_WAIT_POLICY
OMP_STACKSIZE
Directives
omp parallel
omp parallel for
omp parallel sections
omp parallel ....
Runtime Library
omp_get_num_threads()
omp_set_num_threads()
omp_get_thread_num()
Fork-Join Model
L1
t0
CPU0
Fork
L1
t0
CPU0
L1
t1
CPU1
L1
t2
CPU2
L1
t3
CPU3
t0
t1
t2
t3
Do Stuff
Do Stuff
Do Stuff
Do Stuff
Join
Fork-Join Execution Model:
L1
t0
CPU0
ps0
OpenMP Directives:
Directives: OpenMP directives in C/C++ are based on the #pragma compiler directives. The directive itself consist of directive name followed by a clause
#pragma omp directive_name [clause list]�
Example: #pragma omp parallel �
OpenMP programs execute serially until they encounter the parallel directive.
This directive is responsible for creating group of threads. �
The exact number of thread can be specified using environment variable or at runtime using OpenMP functions, or clause.
Parallel directive example [ omp_greetings.c ]:�
11 #include <stdio.h> // Used for printf()
12 #include <omp.h> // used for OpenMP routines
13
14 void main(){
15 int tid;
16 tid = omp_get_thread_num();
17 printf("Greetings from Master Thread: %d \n", tid);
18
19 #pragma omp parallel
20 {
21 tid = omp_get_thread_num();
22 printf("\tGreetings from worker thread: %d.\n", tid);
23
24 } //END: omp parallel
25
26 printf("Back to just master thread: Goodbye. \n");
27 } // END: main()
parallel
t0
int tid;
tid=omp_get_thread_num();
printf(“Greetings: %d”, tid);
tid=omp_get_thread_num();
tid=omp_get_thread_num();
tid=omp_get_thread_num();
tid=omp_get_thread_num();
t1
t2
t3
printf(“%d”, tid);
printf(“%d”, tid);
printf(“%d”, tid);
printf(“%d”, tid);
printf(“goodbye\n”);
Parallel region
Serial region
Serial region
OpenMP Data-Sharing Clauses
Clauses are used to control data in a parallel region:
default(type):
Where type is one of the following: [ none, shared, private ]
OpenMP default is shared, but you can use the default clause to modify that behaviour.
shared(list):
Accessible by all threads in a parallel region.
Variables that don’t get modified by threads are good candidates.
private(list):
All the threads get a local copy of the variable.
Gets destroyed after the parallel region.
firstprivate(list):
Private variable initialized to value its holding before start of parallel region
Critical and Atomic Operations :
critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.
atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical.
atomic meaning: “Atomic” in this context means “all or nothing” — either we succeed in completing the operation with no interruptions or we fail to even begin the operation (because someone else was doing an atomic operation) — We really mean “atomic” AND “isolated” from other threads.
reduction(+:sum)
i 0,249999
t0
sum=0
i 250k,499999
t1
sum=0
i 500k,749999
t2
sum=0
i 750,100k
t3
sum=0
t0, psum
t1
psum
t2
psum
t3
psum
Do Stuff
Do Stuff
Do Stuff
Do Stuff
sum
t0
OpenMP Synchronization: reduction(+:sum)
Schedule clause:
Schedule clause specifies how iterations of the loop are divided among the threads of the team.
schedule(scheduling_class [, parameter])
schedule(static [, chunk-size ])
Distribute the work evenly or in chunk size units specified
Pre-determined and predictable amount of work between each iterations
compile time
schedule(dynamic [, chunk-size ])
Distribute the work on available threads in chunk size specified
When no idea how long each iterations will take.
most work is done runtime
schedule( runtime )
the environment variable OMP_SCHEDULE which is one of the static, dynamic, or an appropriate pair like:
export OMP_SCHEDULE=”static,500”
OpenMP Library Routines
Full List: https://gcc.gnu.org/onlinedocs/libgomp/Runtime-Library-Routines.html#Runtime-Library-Routines
Name | Return type | Description |
omp_get_thread_num(); | int | Returns the id of current thread. |
omp_get_num_threads(); | int | Returns the number of threads in the current team. |
omp_get_num_procs(); | int | Returns number of CPUs available. |
omp_get_num_devices(); | int | Returns the number of target devices (usually CPUs). |
omp_get_wtime(); | double | Elapsed wall clock time in seconds. |
omp_get_max_threads(); | int | Max # of threads used for a parallel region. |
omp_set_num_threads(); | int | Specifies the number of threads used by default in subsequent parallel sections |
OpenMP Environment Variables
OMP_DYNAMIC: Dynamic adjustment of threads
OMP_MAX_ACTIVE_LEVELS: Set the maximum number of nested parallel regions
OMP_MAX_TASK_PRIORITY: Set the maximum task priority value
OMP_NESTED: Nested parallel regions
OMP_NUM_THREADS: Specifies the number of threads to use
OMP_PROC_BIND: Whether threads may be moved between CPUs
OMP_PLACES: Specifies on which CPUs the threads should be placed
OMP_STACKSIZE: Set default thread stack size
OMP_SCHEDULE: How threads are scheduled
OMP_THREAD_LIMIT: Set the maximum number of threads
OMP_WAIT_POLICY: How waiting threads are handled
Full List: https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html#Environment-Variables
References:
The Community of OpenMP:
http://www.compunity.org/
Sieve Module:
http://www.shodor.org/petascale/materials/UPModules/sieveOfEratosthenes/
OpenMP Documentation:
http://www.openmp.org/mp-documents/OpenMP3.1.pdf
Paper on common mistakes in OpenMP:
http://www.michaelsuess.net/publications/suess_leopold_common_mistakes_06.pdf
List Environment Variables GCC:
https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html#Environment-Variables
List Runtime Routines GCC:
https://gcc.gnu.org/onlinedocs/libgomp/Runtime-Library-Routines.html#Runtime-Library-Routines
Cray OpenMP C/C++Reference Manual:
http://docs.cray.com/books/S-2179-52/html-S-2179-52/z1050591602oswald.html
OpenMP by Example:
http://openmp.org/mp-documents/OpenMP_Examples_4.0.1.pdf
Some citations :
https://courses.cs.washington.edu/courses/cse378/07au/lectures/L25-Atomic-Operations.pdf
http://sc.tamu.edu/shortcourses/SC-openmp/OpenMPSlides_tamu_sc.pdf