1 of 21

Introduction to Shared Memory Computing �- OpenMP -

Mobeen Ludin and Olivia Irving

2 of 21

Basic Architecture: single core machine

Motherboard

VGA

USB

PS2

control

bus

address

bus

data

bus

RAM

sockets

set

enable

Von Neumann

Single Processor

control

bus

CPU0

Older processor had only one cpu core to execute instructions

3 of 21

Modern Architecture: multi-core machine (UMA)

Motherboard

VGA

USB

PS2

control

bus

address

bus

data

bus

RAM

sockets

set

enable

control

bus

Modern processors have 4 or more independent cpu cores to execute instructions

CPU0

CPU1

CPU3

CPU4

L3 Cache

4 of 21

5 of 21

Interconnection Network

“The Processors” or a Node

6 of 21

Shared Memory Systems:

Interconnection (IBM:BlueGene/L, Cray:Gemini/Red Storm, QuNet)

CPU0

CPU1

CPU2

CPU3

RAM

RAM

RAM

RAM

Figure 1.0: Shared-Memory MIMD Architecture

  • Any computer composed of multiple processing elements (processor/cores) that share a global memory address space.�
  • Popular shared-memory systems include multi-core CPUs and many-core GPUs.�
  • Each CPU acts as if that whole main memory is attached to it via Bus or some other type of interconnect.

7 of 21

Socket in a Node:

Node: Multiprocessor Machine (NUMA)

CPU0

CPU1

CPU3

CPU4

L3 Cache

CPU5

CPU6

CPU7

CPU8

L3 Cache

Interconnection (IBM:BlueGene/L, Cray:Gemini/Red Storm, QuNet)

RAM

RAM

RAM

RAM

NUMA domain

Bulldozer core

32 Threads per node

8 of 21

Bulldozer Core

9 of 21

Shared Address Space: Process vs. Thread

PID=999: Address Space

Data:

double vect_sum[ ];

Heap:

malloc(vec_a[ ]);

malloc(vect_b[ ]);

Thread 0

Stack frame0:

add_vect();

int a, b, i;

Thread 1

Stack frame1:

mult_vect();

int a, b, i;

Multithreaded program address space.

Program’s executable code.

main(){

add_vect();

mult_vect();

}

PID = 99: Address Space

Program’s executable code.

main(){

add_vect();

mult_vect();

}

Data:

double vect_sum[ ];

Heap:

malloc(vec_a[ ]);

malloc(vect_b[ ]);

Thread 0

Stack frame0:

add_vect();

int a, b, i;

Single threaded program address space

Stack frame1:

mult_vect();

int a, b, i;

How does this relate to OpenMP seems like threads in a program vs a single threaded program but OpenMP ususally is about paralleizing loops?

10 of 21

What is OpenMP:

OpenMP

Environment Variables

OMP_NUM_THREADS

OMP_SCHEDULE(static, 10)

OMP_WAIT_POLICY

OMP_STACKSIZE

Directives

omp parallel

omp parallel for

omp parallel sections

omp parallel ....

Runtime Library

omp_get_num_threads()

omp_set_num_threads()

omp_get_thread_num()

  • OpenMP is a directive based standard API for writing shared-memory parallel applications in C/C++ and Fortran.
  • Most used for parallelizing loops.
  • OpenMP 3.0 onward also allow for task parallelism.

11 of 21

Fork-Join Model

L1

t0

CPU0

Fork

L1

t0

CPU0

L1

t1

CPU1

L1

t2

CPU2

L1

t3

CPU3

t0

t1

t2

t3

Do Stuff

Do Stuff

Do Stuff

Do Stuff

Join

Fork-Join Execution Model:

  • OpenMP follows the fork/join model�
  • OpenMP programs start with a single thread (t0)�
  • At start of parallel region “master” thread creates team of parallel “worker” threads (FORK)
  • OpenMP threads are mapped onto physical CPU cores�
  • At end of parallel region, all threads synchronize, and join master thread (JOIN)

L1

t0

CPU0

ps0

12 of 21

OpenMP Directives:

Directives: OpenMP directives in C/C++ are based on the #pragma compiler directives. The directive itself consist of directive name followed by a clause

#pragma omp directive_name [clause list]�

Example: #pragma omp parallel

OpenMP programs execute serially until they encounter the parallel directive.

This directive is responsible for creating group of threads. �

The exact number of thread can be specified using environment variable or at runtime using OpenMP functions, or clause.

13 of 21

Parallel directive example [ omp_greetings.c ]:

11 #include <stdio.h> // Used for printf()

12 #include <omp.h> // used for OpenMP routines

13

14 void main(){

15 int tid;

16 tid = omp_get_thread_num();

17 printf("Greetings from Master Thread: %d \n", tid);

18

19 #pragma omp parallel

20 {

21 tid = omp_get_thread_num();

22 printf("\tGreetings from worker thread: %d.\n", tid);

23

24 } //END: omp parallel

25

26 printf("Back to just master thread: Goodbye. \n");

27 } // END: main()

14 of 21

parallel

t0

int tid;

tid=omp_get_thread_num();

printf(“Greetings: %d”, tid);

tid=omp_get_thread_num();

tid=omp_get_thread_num();

tid=omp_get_thread_num();

tid=omp_get_thread_num();

t1

t2

t3

printf(“%d”, tid);

printf(“%d”, tid);

printf(“%d”, tid);

printf(“%d”, tid);

printf(“goodbye\n”);

Parallel region

Serial region

Serial region

15 of 21

OpenMP Data-Sharing Clauses

Clauses are used to control data in a parallel region:

default(type):

Where type is one of the following: [ none, shared, private ]

OpenMP default is shared, but you can use the default clause to modify that behaviour.

shared(list):

Accessible by all threads in a parallel region.

Variables that don’t get modified by threads are good candidates.

private(list):

All the threads get a local copy of the variable.

Gets destroyed after the parallel region.

firstprivate(list):

Private variable initialized to value its holding before start of parallel region

16 of 21

Critical and Atomic Operations :

critical: the enclosed code block will be executed by only one thread at a time, and not simultaneously executed by multiple threads. It is often used to protect shared data from race conditions.

atomic: the memory update (write, or read-modify-write) in the next instruction will be performed atomically. It does not make the entire statement atomic; only the memory update is atomic. A compiler might use special hardware instructions for better performance than when using critical.

atomic meaning: “Atomic” in this context means “all or nothing” — either we succeed in completing the operation with no interruptions or we fail to even begin the operation (because someone else was doing an atomic operation) — We really mean “atomic” AND “isolated” from other threads.

17 of 21

reduction(+:sum)

i 0,249999

t0

sum=0

i 250k,499999

t1

sum=0

i 500k,749999

t2

sum=0

i 750,100k

t3

sum=0

t0, psum

t1

psum

t2

psum

t3

psum

Do Stuff

Do Stuff

Do Stuff

Do Stuff

sum

t0

OpenMP Synchronization: reduction(+:sum)

18 of 21

Schedule clause:

Schedule clause specifies how iterations of the loop are divided among the threads of the team.

schedule(scheduling_class [, parameter])

schedule(static [, chunk-size ])

Distribute the work evenly or in chunk size units specified

Pre-determined and predictable amount of work between each iterations

compile time

schedule(dynamic [, chunk-size ])

Distribute the work on available threads in chunk size specified

When no idea how long each iterations will take.

most work is done runtime

schedule( runtime )

the environment variable OMP_SCHEDULE which is one of the static, dynamic, or an appropriate pair like:

export OMP_SCHEDULE=”static,500”

19 of 21

OpenMP Library Routines

Full List: https://gcc.gnu.org/onlinedocs/libgomp/Runtime-Library-Routines.html#Runtime-Library-Routines

Name

Return type

Description

omp_get_thread_num();

int

Returns the id of current thread.

omp_get_num_threads();

int

Returns the number of threads in the current team.

omp_get_num_procs();

int

Returns number of CPUs available.

omp_get_num_devices();

int

Returns the number of target devices (usually CPUs).

omp_get_wtime();

double

Elapsed wall clock time in seconds.

omp_get_max_threads();

int

Max # of threads used for a parallel region.

omp_set_num_threads();

int

Specifies the number of threads used by default in subsequent parallel sections

20 of 21

OpenMP Environment Variables

OMP_DYNAMIC: Dynamic adjustment of threads

OMP_MAX_ACTIVE_LEVELS: Set the maximum number of nested parallel regions

OMP_MAX_TASK_PRIORITY: Set the maximum task priority value

OMP_NESTED: Nested parallel regions

OMP_NUM_THREADS: Specifies the number of threads to use

OMP_PROC_BIND: Whether threads may be moved between CPUs

OMP_PLACES: Specifies on which CPUs the threads should be placed

OMP_STACKSIZE: Set default thread stack size

OMP_SCHEDULE: How threads are scheduled

OMP_THREAD_LIMIT: Set the maximum number of threads

OMP_WAIT_POLICY: How waiting threads are handled

Full List: https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html#Environment-Variables

21 of 21

References:

The Community of OpenMP:

http://www.compunity.org/

Sieve Module:

http://www.shodor.org/petascale/materials/UPModules/sieveOfEratosthenes/

OpenMP Documentation:

http://www.openmp.org/mp-documents/OpenMP3.1.pdf

Paper on common mistakes in OpenMP:

http://www.michaelsuess.net/publications/suess_leopold_common_mistakes_06.pdf

List Environment Variables GCC:

https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html#Environment-Variables

List Runtime Routines GCC:

https://gcc.gnu.org/onlinedocs/libgomp/Runtime-Library-Routines.html#Runtime-Library-Routines

Cray OpenMP C/C++Reference Manual:

http://docs.cray.com/books/S-2179-52/html-S-2179-52/z1050591602oswald.html

OpenMP by Example:

http://openmp.org/mp-documents/OpenMP_Examples_4.0.1.pdf

Some citations :

https://michaellindon.github.io/lindonslog/programming/openmp/openmp-tutorial-critical-atomic-and-reduction/

https://courses.cs.washington.edu/courses/cse378/07au/lectures/L25-Atomic-Operations.pdf

http://sc.tamu.edu/shortcourses/SC-openmp/OpenMPSlides_tamu_sc.pdf