0% found this document useful (0 votes)
3 views93 pages

OpenMP Basics and Parallel Processing

This document provides an overview of OpenMP, an API for multi-threaded parallelization, detailing its advantages, directives, and programming model. It discusses concurrency, synchronization constructs, and the various ways to manage threads and data in parallel computing. Additionally, it touches on the concept of message passing and the MPI interface for distributed memory systems.

Uploaded by

Marwan Gasser
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views93 pages

OpenMP Basics and Parallel Processing

This document provides an overview of OpenMP, an API for multi-threaded parallelization, detailing its advantages, directives, and programming model. It discusses concurrency, synchronization constructs, and the various ways to manage threads and data in parallel computing. Additionally, it touches on the concept of message passing and the MPI interface for distributed memory systems.

Uploaded by

Marwan Gasser
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Parallel Processing

Lecture 3: OpenMP
What is Concurrency?
• Two events are said to be concurrent if they occur within the same time interval.
• Two or more tasks executing over the same time interval are said to execute concurrently.
◦ Not necessarily mean at the same exact instant.

• Concurrent tasks can execute in a single (context switching ) or multiprocessing environment.


Advantages
• better Utilization:
◦ Do more work over the same time interval using the same resources

• Better throughput:
◦ Work faster or get work done sooner

• Capacity management:
◦ Handle larger work loads

• Naturally simple:
◦ Replacing one long sequential logic with a group of small concurrent tasks
Layers of Concurrency
• Instruction level
◦ Multiple parts of a single instruction can be executed simultaneously

• Sub routine level


◦ Functions are assigned to different threads and can be executed “simultaneously ” if enough processors
are available

• Object level
◦ Objects residing in different threads or processes may executed their methods concurrent

• Application level
◦ Tow or more applications can cooperatively work together to solve the same problem.
What is OpenMP?
• Application Programming Interface (API) for multi-threaded parallelization consisting of
◦ Source code directives
◦ Functions
◦ Environment variables
Thread
• One may think of the thread as an instance of a function
that returns before the function has finished executing.
• The stack corresponding to the function call is local to the
thread.
• logical machine model:
◦ global memory (default)
◦ local memory (stacks).

• Threads are scheduled at runtime (no a priori schedule of


their execution can be safely assumed)
Matrix-Matrix Multiplication

[Link]
What is OpenMP? (cont.)
• Advantages
◦ Easy to use
◦ Incremental parallelization
◦ Flexible
◦ Loop-level or coarse-grain
◦ Portable
◦ Since there’s a standard, will work on any SMP machine

• Disadvantage
◦ Shared-memory systems only
What is OpenMP? (cont.)
• OpenMP directives provide support for:
◦ Concurrency
◦ Synchronization
◦ Data handling
◦ Mutual exclusion
◦ Data scope and initialization
OpenMP Directives
• OpenMP directives in C and C++ are based on the #pragma compiler directives.
• A directive consists of a directive name followed by clauses.
OpenMP Rules
• Case sensitive
• Directives follow conventions of the C/C++ standards for compiler directives
• Only one directive name may be specified per directive
• Each directive applies to at most one succeeding statement, which must be a structured block.
• Long directive lines can be "continued" on succeeding lines by escaping the newline character
with a backslash (" ("\\") at the end of a directive line.
OpenMP Programming Model
• OpenMP programs execute serially until they encounter the parallel directive, which creates a
group of threads.
Clauses
• Conditional Parallelization: The clause if (scalar expression) determines whether
the parallel construct results in creation of threads.
• Degree of Concurrency: The clause num_threads (integer) specifies the number
of threads that are created.
• Data Handling:
◦ private (variable list) : variables local to each thread.
◦ firstprivate (variable list): local variables that are initialized to corresponding values before
the parallel directive.
◦ shared (variable list) : variables are shared across all the
OpenMP Example
//serial Segment
#pragma omp parallel
if (is_parallel == 1 ) num_threads (8 ) private(a) shared (b) firstprivate (c)
{
// parallel Segment

}
// rest serial Segment
OpenMP Example (cont.)
• omp_get_num_threads()
• omp_get_thread_num() : thread id 0 means the master thread
Work Sharing Construct
• A work sharing construct divides the execution of the enclosed code region
among the members of the team that encounter it:
◦ Do not launch new threads
◦ Must be enclosed in a parallel region
◦ Must be encountered by all threads in the team, or none at all
◦ No implied barrier on entry; implied barrier on exit (unless nowait is specified)
◦ A work sharing construct does not launch any new threads

• Directives to specify concurrency :


◦ sections concurrent tasks (functional parallelism).
◦ for concurrent iterations (data parallelism)
Section Directive
• a non iterative work sharing construct
• breaks work into separate, discrete sections.
• Each section is executed by a thread.
• It is possible that for a thread to execute more than one section if it is
quick enough and the implementation permits such.
• Can be used to implement a type of "functional parallelism".
Section Directive (cont.)
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
taskA();
}
#pragma omp section
{
taskB();
}
#pragma omp section
{
taskC();
}
}
}
int i;
float a[N],b[N],c[N],d[N];
//doSomeinitializations
for(i=0;i<N;i++){
a[i]=i*2;
b[i]=i+20;
}
//set number of threads to 10
omp_set_num_threads(10);
//start parallel region
#pragma omp parallel shared(a,b,c,d) private(i)
{
//start parallel sections
#pragma omp sections nowait
{
#pragma omp section
for(i=0;i<N;i++)
c[i]=a[i]+b[i];
#pragma omp section
for(i=0;i<N;i++)
d[i]=a[i]*b[i];
}/*end of sections*/
}/*end of parallel section*/
parallel for
• parallel for directives parallelize subsequent loop
• Implicit barrier at the end of each for directive: nowait breaks implicit barrier
• Clauses are:
◦ private
◦ firstprivate #pragma omp parallel for
◦ lastprivate for(i = 1; i <= maxi; i++){
◦ schedule a[i] = b[i] = c[i];
◦ reduction }
◦ nowait
◦ ordered
Reduction Clause
• The reduction clause specifies how multiple local copies of a variable at different threads are
combined into a single copy at the master when threads exit.
• reduction (operator: variable list)
• The variables in the list are implicitly private to threads.
• The operator can be one of +, *, --, &, |, ^, &&, and II
Schedule Clause
• Schedule refers to the way in which loop indices are distributed among threads default is static
• Each thread is assigned a contiguous range of indices in order of thread number
◦ called round robin

• Number of indices assigned to each thread is as equal as possible


Schedule Clause : static
• Loop runs from 1 to 51, 4 threads
• #pragma omp for schedule(static)

Thread Indices Chunk size

0 1-13 13

1 14-26 13

2 27-39 13

3 40-51 12
Schedule Clause : static (cont.)
• #pragma omp for schedule(static,5)

thread chunk 1 chunk 2 chunk 3 Chunk


indices indices indices size
0 1-5 21-25 41-45 15
1 6-10 26-30 46-50 15
2 11-15 31-35 51 11
3 16-20 36-40 - 10
Schedule Clause : dynamic
• schedule(dynamic) clause assigns chunks to threads dynamically as the threads become
available for more work
• default chunk size is 1
• higher overhead than STATIC
• #pragma omp for schedule(dynamic,5)
Schedule Clause : guided
• schedule(guided) clause assigns chunks automatically, exponentially decreasing chunk size
with each assignment
• specified chunk size is the minimum chunk size except for the last chunk, which can be of any
size
• default chunk size is 1
Schedule Clause : runtime
• schedule can be specified through omp_schedule environment variable
• setenv omp_schedule “dynamic,5”
• the schedule(runtime) clause tells it to set the schedule using the environment variable
Nested Parallelism
• Nested parallelism must be enabled using the OMP_NESTED environment variable.
• Generates a logical team of threads on encountering a nested parallel directive
• Split each of the for loops across various threads

#pragma omp parallel for


for(j=0; j<jmax; j++){
#pragma omp parallel for
for(i=0; i<imax; i++){
do_work(i,j);
}
}
Synchronization Constructs
• OpenMP provides a variety of Synchronization Constructs that control how the execution of
each thread proceeds relative to other team threads

both processor is trying to increment a variable x at the same time


Synchronization Constructs (cont.)
• OpenMP provides a variety of synchronization constructs:
• #pragma omp barrier
• #pragma omp single [clause list]
structured block
• #pragma omp master
◦ structured block

• #pragma omp critical [(name)]


◦ structured block

• #pragma omp ordered


◦ structured block
barrier
• a barrier holds a thread until all threads participating in the barrier have reached it.
• #pragma omp barrier
Barrier (cont.)
• Barriers implemented :
• A single integer is used to keep track of the number of threads that have reached the barrier.
• If the count is less than the total number of threads, the threads execute a condition wait.
• The last thread entering (and setting the count to the number of threads) wakes up all the
threads using a condition broadcast.
master
• specifies a region that is to be executed only by the master thread.
• All other threads on the team skip this section of code
• There is no implied barrier associated with this directive
• #pragma omp master
single
• serializes a section of code executed by only one thread in the team.
• Threads in the team that do not execute the SINGLE directive, wait at
the end of the enclosed code block, unless a nowait clause is specified.
• May be useful when dealing with sections of code that are not thread
safe (such as I/O)
• #pragma omp single [clause list]
critical
• Mutual Exclusion for Shared Variables:
◦ All threads in the team will attempt to execute in parallel,
◦ the CRITICAL construct surrounding the increment of x
◦ only one thread will be able to read/increment/write x at any time .

int x =0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x = x + 1;
}
atomic
• provides a mini CRITICAL section.
• specifies that a specific memory location must be
updated atomically.
• expressions:
◦ x++
◦ ++x
◦ X--
◦ --X

• All atomic directives can be replaced by critical


directives
#pragma omp atomic
statement_expression
ordered
• The ordered directive specifies that iterations of the enclosed loop will be executed in the same
order as if they were executed on a serial processor.
• Threads will need to wait before executing their chunk of iterations if previous iterations
haven't completed yet.
• Used within a DO / for loop with an ORDERED clause
OpenMP Environment Variables
• OMP_NUM_THREADS : This environment variable specifies the default number of threads
created upon entering a parallel region.
• OMP_SET_DYNAMIC : Determines if the number of threads can be dynamically changed.
• OMP_NESTED : Turns on nested parallelism.
• OMP_SCHEDULE : Scheduling of for loops if the clause specifies runtime
Explicit Parallelism
• Same thing as multithreading for shared memory.
• Explicit parallelism is more common with message passing.
◦ User has explicit control over processes.
◦ Good: control can be used to performance benefit.
◦ Bad: user has to deal with it.
Distributed Memory - Message Passing

mem1 mem2 mem3 memN

proc1 proc2 proc3 procN

network
What does the user have to do?
• This is what we said for shared memory:
◦ Decide how to decompose the computation into parallel parts.
◦ Create (and destroy) processes to support that decomposition.
◦ Add synchronization to make sure dependences are covered.

• Is the same true for message passing?


Another Look at SOR Example
for some number of timesteps/iterations {
for (i=0; i<n; i++ )
for( j=0; j<n, j++ )
temp[i][j] = 0.25 *
( grid[i-1][j] + grid[i+1][j] +
grid[i][j-1] + grid[i][j+1] );
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
grid[i][j] = temp[i][j];
}
Shared Memory

grid temp
1 1
2 2
3 3
4 4

proc1 proc2 proc3 procN


Message-Passing Data Distribution (only middle processes)

grid grid
2 3
temp temp
2 3

proc2 proc3
Is this going to work?
Same code as we used for shared memory

for( i=from; i<to; i++ )


for( j=0; j<n; j++ )
temp[i][j] = 0.25*( grid[i-1][j] + grid[i+1][j]
+ grid[i][j-1] + grid[i][j+1]);

No, we need extra boundary elements for grid.


Data Distribution (only middle processes)

grid grid

2 3
temp temp
2 3

proc2 proc3
Is this going to work?
Same code as we used for shared memory
for( i=from; i<to; i++)
for( j=0; j<n; j++ )
temp[i][j] = 0.25*( grid[i-1][j] + grid[i+1][j]
+ grid[i][j-1] + grid[i][j+1]);

No, on the next iteration we need boundary elements from our neighbors.
Data Communication (only middle processes)

grid grid

proc2 proc3
Is this now going to work?
Same code as we used for shared memory
for( i=from; i<to; i++ )
for( j=0; j<n; j++ )
temp[i][j] = 0.25*( grid[i-1][j] + grid[i+1][j]
+ grid[i][j-1] + grid[i][j+1]);

No, we need to translate the indices.


Index Translation
for( i=0; i<n/p; i++)
for( j=0; j<n; j++ )
temp[i][j] = 0.25*( grid[i-1][j] + grid[i+1][j]
+ grid[i][j-1] + grid[i][j+1]);

Remember, all variables are local.


Index Translation is Optional
• Allocate the full arrays on each processor.
• Leave indices alone.
• Higher memory use.
• Sometimes necessary.
What does the user need to do?
• Divide up program in parallel parts.
• Create and destroy processes to do above.
• Partition and distribute the data.
• Communicate data at the right time.
• (Sometimes) perform index translation.
• Still need to do synchronization?
◦ Sometimes, but many times goes hand in hand with data communication.
Message Passing Systems
• Provide process creation and destruction.
• Provide message passing facilities (send and receive) to distribute and communicate data.
• Provide additional synchronization facilities.
MPI (Message Passing Interface)
• The Message Passing Interface (MPI) is an Application Program Interface that defines a model
of parallel computing where each parallel process has its own local memory.
• The data must be explicitly shared by passing messages between processes.
• Using MPI allows programs to scale beyond the processors and shared memory of a single
compute server, to the distributed memory and processors of multiple compute servers
combined together.
• Grew out of an earlier message passing system, PVM, now outdated.
MPI Process Creation/Destruction
MPI_Init( int argc, char **argv )
Initiates a computation.

MPI_Finalize()
Terminates a computation.
MPI Process Identification
MPI_Comm_size( comm, &size )
Determines the number of processes.
MPI_Comm_rank( comm, &pid )
Pid is the process identifier of the caller.
MPI Basic Send
MPI_Send(buf, count, datatype, dest, tag, comm)

buf: address of send buffer


count: number of elements
datatype: data type of send buffer elements
dest: process id of destination process
tag: message tag
comm: communicator
MPI Basic Receive
MPI_Recv(buf, count, datatype, source, tag, comm, &status)

buf: address of receive buffer


count: size of receive buffer in elements
datatype: data type of receive buffer elements
source: source process id or MPI_ANY_SOURCE
tag and comm: message tag and communicator
status: status object
MPI Matrix Multiply (without Index Translation)
main(int argc, char *argv[])
{
MPI_Init (&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
from = (myrank * n)/p;
to = ((myrank+1) * n)/p;
/* Data distribution */ ...
/* Computation */ ...
/* Result gathering */ ...
MPI_Finalize();
}
MPI Matrix Multiply (without Index Translation)
/* Data distribution */
if( myrank != 0 ) {
MPI_Recv( &a[from], n*n/p, MPI_INT, 0, tag,
MPI_COMM_WORLD, &status );
MPI_Recv( &b, n*n, MPI_INT, 0, tag, MPI_COMM_WORLD, &status );
} else {
for( i=1; i<p; i++ ) {
MPI_Send( &a[from], n*n/p, MPI_INT, i, tag,
MPI_COMM_WORLD );
MPI_Send( &b, n*n, MPI_INT, I, tag, MPI_COMM_WORLD );
}
}
MPI Matrix Multiply (without Index Translation)
/* Computation */

for ( i=from; i<to; i++)


for (j=0; j<n; j++) {
C[i][j]=0;
for (k=0; k<n; k++)
C[i][j] += A[i][k]*B[k][j];
}
MPI Matrix Multiply (without Index Translation)
/* Result gathering */
if (myrank!=0)
MPI_Send( &c[from], n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD);
else
for (i=1; i<p; i++)
MPI_Recv( &c[from], n*n/p, MPI_INT, i, tag, MPI_COMM_WORLD,
&status);
MPI Matrix Multiply (with Index Translation)
main(int argc, char *argv[])
{
MPI_Init (&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
from = (myrank * n)/p;
to = ((myrank+1) * n)/p;
/* Data distribution */ ...
/* Computation */ ...
/* Result gathering */ ...
MPI_Finalize();
}
MPI Matrix Multiply (with Index Translation)
/* Data distribution */
if( myrank != 0 ) {
MPI_Recv( &a, n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD, &status );
MPI_Recv( &b, n*n, MPI_INT, 0, tag, MPI_COMM_WORLD, &status );
} else {
for( i=1; i<p; i++ ) {
MPI_Send( &a[from], n*n/p, MPI_INT, i, tag,
MPI_COMM_WORLD );
MPI_Send( &b, n*n, MPI_INT, I, tag, MPI_COMM_WORLD );
}
}
MPI Matrix Multiply (with Index Translation)
/* Computation */

for ( i=0; i<n/p; i++)


for (j=0; j<n; j++) {
C[i][j]=0;
for (k=0; k<n; k++)
C[i][j] += A[i][k]*B[k][j];
}
MPI Matrix Multiply (with Index Translation)
/* Result gathering */
if (myrank!=0)
MPI_Send( &c, n*n/p, MPI_INT, 0, tag, MPI_COMM_WORLD);
else
for( i=1; i<p; i++ )
MPI_Recv( &c[from], n*n/p, MPI_INT, i, tag, MPI_COMM_WORLD,
&status);
Running a MPI Program
• mpirun <program_name> <arguments>
• Interacts with a daemon process on the hosts.
• Causes a Unix process to be run on each of the hosts.
• May only run in interactive mode (batch mode may be blocked by ssh)
MPI Process Creation/Destruction
MPI_Init( int *argc, char **argv )
Initiates a computation.

MPI_Finalize()
Finalizes a computation.
MPI Process Identification
MPI_Comm_size( comm, &size )
Determines the number of processes.
MPI_Comm_rank( comm, &pid )
Pid is the process identifier of the caller.
Global Operations
• So far, we have only looked at point-to-point or one-to-one message passing
facilities.
• Often, it is useful to have one-to-many or many-to-one message
communication.
• This is what MPI’s global operations do.
Global Operations (cont.)
• MPI_Barrier
• MPI_Bcast
• MPI_Gather
• MPI_Scatter
• MPI_Reduce
• MPI_Allreduce
Barrier
MPI_Barrier(comm)

Global barrier synchronization, as before: all processes wait until all have arrived.
Broadcast
MPI_Bcast(inbuf, incnt, intype, root, comm)

inbuf: address of input buffer (on root); address of output buffer (elsewhere)
incnt: number of elements
intype: type of elements
root: process id of root process
Before Broadcast
inbuf

proc0 proc1 proc2 proc3

root
After Broadcast
inbuf

proc0 proc1 proc2 proc3

root
Scatter
MPI_Scatter(inbuf, incnt, intype, outbuf, outcnt, outtype, root, comm)
inbuf: address of input buffer
incnt: number of input elements
intype: type of input elements
outbuf: address of output buffer
outcnt: number of output elements
outtype: type of output elements
root: process id of root process
Before Scatter
inbuf
outbuf

proc0 proc1 proc2 proc3


root
After Scatter
inbuf
outbuf

proc0 proc1 proc2 proc3

root
Gather
MPI_Gather(inbuf, incnt, intype, outbuf, outcnt, outtype, root, comm)
inbuf: address of input buffer
incnt: number of input elements
intype: type of input elements
outbuf: address of output buffer
outcnt: number of output elements
outtype: type of output elements
root: process id of root process
Before Gather
inbuf
outbuf

proc0 proc1 proc2 proc3


root
After Gather
inbuf
outbuf

proc0 proc1 proc2 proc3

root
MPI Matrix Multiply (without Index Translation)
main(int argc, char *argv[])
{
MPI_Init (&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
for( i=0; i<p; i++ ) {
from[i] = (i * n)/p;
to[i] = ((i+1) * n)/p;
}
/* Data distribution */ ...
/* Computation */ ...
/* Result gathering */ ...
MPI_Finalize();
}
MPI Matrix Multiply (without Index Translation)
/* Data distribution */
if( myrank != 0 ) {
MPI_Recv( &a[from[myrank]], n*n/p, MPI_INT, 0, tag,
MPI_COMM_WORLD, &status );
MPI_Recv( &b, n*n, MPI_INT, 0, tag, MPI_COMM_WORLD, &status );
} else {
for( i=1; i<p; i++ ) {
MPI_Send( &a[from[i]], n*n/p, MPI_INT, i, tag,
MPI_COMM_WORLD );
MPI_Send( &b, n*n, MPI_INT, I, tag, MPI_COMM_WORLD );
}
}
MPI Matrix Multiply (without Index Translation)
/* Computation */

for ( i=from[myrank]; i<to[myrank]; i++)


for (j=0; j<n; j++) {
C[i][j]=0;
for (k=0; k<n; k++)
C[i][j] += A[i][k]*B[k][j];
}
MPI Matrix Multiply (without Index Translation)
/* Result gathering */
if (myrank!=0)
MPI_Send( &c[from[myrank]], n*n/p, MPI_INT, 0, tag,
MPI_COMM_WORLD);
else
for( i=1; i<p; i++ )
MPI_Recv( &c[from[i]], n*n/p, MPI_INT, i, tag,
MPI_COMM_WORLD, &status);
MPI Matrix Multiply Revised
main(int argc, char *argv[])
{
MPI_Init (&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &p);
from = (myrank * n)/p;
to = ((myrank+1) * n)/p;
MPI_Scatter (a, n*n/p, MPI_INT, a, n*n/p, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (b,n*n, MPI_INT, 0, MPI_COMM_WORLD);
...
MPI Matrix Multiply Revised (cont.)
...
for (i=from; i<to; i++)
for (j=0; j<n; j++) {
C[i][j]=0;
for (k=0; k<n; k++)
C[i][j] += A[i][k]*B[k][j];
}
MPI_Gather (C[from], n*n/p, MPI_INT, c[from], n*n/p,
MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
}
Reduction
MPI_Reduce(inbuf, outbuf, count, type, op, root, comm)

inbuf: address of input buffer


outbuf: address of output buffer
count: number of elements in input buffer
type: datatype of input buffer elements
op: operation (MPI_MIN, MPI_MAX, etc.)
root: process id of root process
Global Reduction
MPI_Allreduce(inbuf, outbuf, count, type, op, comm)

inbuf: address of input buffer


outbuf: address of output buffer
count: number of elements in input buffer
type: datatype of input buffer elements
op: operation (MPI_MIN, MPI_MAX, etc.)
no root process

You might also like