BCS702 Module 4
BCS702 Module 4
MODULE 4
Shared-memory programming with OpenMP
OpenMP, short for Open Multi-Processing, is a set of compiler directives, runtime library routines, and
environment variables that allow us to write parallel programs in C, C++, and Fortran.
• It is designed for shared-memory systems, where multiple threads operate in a common address
space.
• OpenMP is portable, scalable, and easy to use, making it ideal for scientific computing, simulations,
and performance-critical applications.
.IN
C
N
SY
OpenMP provides a “directives-based” shared-memory API. In C and C++, this means that there are special
preprocessor instructions known as pragmas. Pragmas are typically added to a system to allow behaviours
that aren’t part of the basic C specification.
Pragmas in C and C++ start with
#pragma
.IN
C
N
SY
Line number 11 of our hello, world program, the word parallel is called directive.
# pragma omp parallel
U
The parallel directive, it specifies that the structured block of code that follows should be executed by multiple
threads. The number of threads that run the structured block of code will be determined by the run-time system.
VT
Recall that thread is short for thread of execution. The name is meant to suggest a sequence of statements
executed by a program. Threads are typically started or forked by a process, and they share most of the resources
of the process that starts them—for example, access to stdin and stdout—but each thread has its own stack and
program counter. When a thread completes execution, it joins the process that started it.
What actually happens when the program gets to the parallel directive?
Prior to the parallel directive, the program is using a single thread, the process started when the program
started execution. When the program reaches the parallel directive, the original thread continues executing and
thread_count − 1 additional threads are started.
In OpenMP parlance, the collection of threads executing the parallel block—the original thread and the new
threads—is called a team.
Each thread in the team executes the block following the parallel directive, so in our example, each thread
calls the Hello function.
.IN
OpenMP thread terminology includes the following:
Since each thread has its own stack, a thread executing the Hello function will create its own private, local
variables in the function. In our example, when the function is called, each thread will get its rank or ID and
the number of threads in the team by calling the OpenMP functions omp_get_thread_num and
omp_get_num_threads, respectively.
U
Note that the threads are competing for access to stdout, so there is no guarantee that the output will appear
in thread-rank order.
.IN
THE FUNCTIONS:
C
N
SY
The strtol function in C is used to convert a string to a long integer. It is part of the C standard
library, declared in <stdlib.h>.
U
VT
Use the trapezoidal rule to approximate the area under the curve of a function, y = f(x) (See Figure. 5.3 a.)
.IN
and the sum of the areas of the trapezoids—our approximation to the total area—is
C
Thus, pseudocode for a serial program might look something like this
N
SY
U
VT
Since we still need to add up the threads’ results. An obvious solution is to use a shared
variable for the sum of all the threads’ results, and each thread can add its (private) result into
the shared variable.
We would like to have each thread execute a statement that looks something like
.IN
This can result in an erroneous value for global_result—if two (or more) threads attempt to
C
simultaneously execute this statement, the result will be unpredictable.
N
SY
U
VT
We see that the value computed by thread 0 (my_result =1) is overwritten by thread 1.
Of course, the actual sequence of events might be different, but unless one thread finishes the computation
global_result += my_result before the other starts, the result will be incorrect.
Recall that this is an example of a race condition: multiple threads are attempting to access a shared resource,
at least one of the accesses is an update, and the accesses can result in an error. Also recall that the code that
causes the race condition, global_result += my_result, is called a critical section. A critical section
is code executed by multiple threads that updates a shared resource, and the shared resource can only be updated
by one thread at a time.
In OpenMP uses the critical directive
.IN
C
N
SY
U
VT
In the Trap function, each thread gets its rank and the total number of threads in the team started by the parallel
directive. Then each thread determines the following:
1. The length of the bases of the trapezoids (Line 33),
2. The number of trapezoids assigned to each thread (Line 34),
3. The left and right endpoints of its interval (Lines 35 and 36, respectively)
4. Its contribution to global_result (Lines 37–42).
The threads finish by adding in their individual results to global_result in Lines 44–45. We use the prefix
local_ for some variables to emphasize that their values may differ from the values of corresponding variables
in the main function—for example, local_a may differ from a, although it is the thread’s left endpoint.
Notice that unless n is evenly divisible by thread_count, we’ll use fewer than n trapezoids for
global_result. For example, if n = 14 and thread_count = 4, each thread will compute
Thus each thread will only use 3 trapezoids, and global_result will be computed with 4 ×3=12 trapezoids
.IN
instead of the requested 14.
Since each thread is assigned a block of local_n trapezoids, the length of each thread’s interval will be
local_n∗h, so the left endpoints will be C
N
SY
we assign
U
Furthermore, since the length of each thread’s interval will be local_n∗h, its right endpoint will just be
VT
If it were private to each thread, there would be no need for the critical directive. Furthermore, if it were
private, we would have a hard time determining the value of global_result in main after completion of the
parallel block.
To summarize
1. Variables that have been declared before a parallel directive have shared scope among the threads in the
team
2. Variables declared in the block (e.g., local variables in functions) have private scope.
.IN
3. The value of a shared variable at the beginning of the parallel block is the same as the value before the
block, and, after completion of the parallel block, the value of the variable is the value at the end of the
block.
4. The default scope of a variable can change with other directives
C
4.4 The reduc�on clause
N
In OpenMP, the reduc�on clause is used to safely perform a reducon operaon on variables shared among
SY
mulple threads. When mulple threads need to compute a collecve result, such as the sum of elements or
nding the maximum value, a reducon ensures that the nal result is computed correctly without race cond
ions.
U
A reduction operator is an associative binary operation (such as addition or multiplication), and a reduction
is a computation that repeatedly applies the same reduction operator to a sequence of operands to get a single
result. Furthermore, all of the intermediate results of the operation should be stored in the same variable: the
reduction variable.
When a variable is included in a reduction clause, the variable itself is shared. However, a private variable is
created for each thread in the team. In the parallel block each time a thread executes a statement involving the
variable, it uses the private variable. When the parallel block ends, the values in the private variables are
combined into the shared variable.
One final point to note is that the threads’ private variables are initialized to 0. This is analogous to our
initializing my_result to zero. In general, the private variables created for a reduction clause are initialized to
the identity value for the operator. For example, if the operator is multiplication, the private variables would be
initialized to 1. Table 4.1 for the entire list
.IN
Table 4.1 Identity values for the various reduction opera tors in OpenMP.
C
N
SY
U
The parallel for directive is used in OpenMP to parallelize loops, dividing loop iterations
among multiple threads so that work can be executed simultaneously.
Using it, we can parallelize the serial trapezoidal rule
• The parallel for directive forks a team of threads to execute the following structured block.
• However, the structured block following the parallel for directive must be a for loop.
• With the parallel for directive the system parallelizes the for loop by dividing the iterations of the loop
among the threads.
• In a for loop that has been parallelized with a parallel for directive, the default partitioning of the
iterations among the threads is up to the system. However, most systems use roughly a block
partitioning, that is, if there are m iterations, then roughly the first m/thread_count are assigned to
thread 0, the next m/thread_count are assigned to thread 1, and so on .
4.5.1 Caveats
• OpenMP will only parallelize for loops for which the number of iteraons can be determined
.IN
C
N
SY
• OpenMP will only parallelize for loops—it won’t parallelize while loops or do−while loops
U
directly.
• The “infinite loop” cannot be parallelized
VT
Let us try parallelizing the for loop with a parallel for directive:
The compiler will create an executable without complaint. However, if we try running it with more
.IN
than one thread, we may find that the results are, at best, unpredictable.
1 1 2 3 5 8 13 21 34 55,
C
which is correct. However, we also occasionally get
1 1 2 3 5 8 0 0 0 0.
N
We see two important points here:
SY
1. OpenMP compilers don’t check for dependences among iterations in a loop that is being
parallelized with a parallel for directive. It is up to the programmers to identify these
dependences.
2. A loop in which the results of one or more iterations depend on other iterations cannot, in
U
general, be correctly parallelized by OpenMP without using features such as the Tasking API.
(See Section 4.10).
VT
The dependence of the computation of fibo[i] on the computation of fibo[i-1] is called a data dependence.
Since the value of fibo[i-1] is calculated in one iteration, and the result is used in a subsequent iteration, the
dependence is sometimes called a loop-carried dependence.
Since the computation of x[i] and its subsequent use will always be assigned to the same thread, there is no
issue.
Also note that at least one of the statements must write or update the variable in order for the statements to
represent a dependence. Therefore, to detect a loop-carried dependence, we should only consider variables that
are updated within the loop body. Specifically, we should look for variables that are read or written in one
iteration and written again in another
.IN
%2==0), each odd-subscripted element, a[i], is compared to the element to its “left,” a[i−1], and if they’re
out of order, they’re swapped. During an “odd” phase, each odd-subscripted element is compared to the element
to its right, and if they are out of order, they are swapped. A theorem guarantees that after n phases, the list will
be sorted.
C
Example: Suppose a = {9, 7, 8, 6}. Then the phases are shown in Table 4.2. In this case, the final phase was
N
not necessary, but the algorithm does not bother checking whether the list is already sorted before carrying out
each phase.
SY
It is not hard to see that the outer loop has a loop-carried dependence. As an example, suppose as before that a
= {9, 7, 8, 6}. Then in phase 0 the inner loop will compare elements in the pairs (9,7) and (8,6), and both pairs
are swapped. So for phase 1, the list should be {7, 9, 6, 8}, and during phase 1 the elements in the pair (9,6)
should be compared and swapped. However, if phase 0 and phase 1 are executed simultaneously, the pair that’s
U
checked in phase 1 might be (7,8), which is in order. Furthermore, it is not clear how one might eliminate this
loop-carried dependence, so it would appear that parallelizing the outer for loop is not an option.
VT
The inner for loops, however, don’t appear to have any loop-carried dependences. For example, in an even
phase loop variable i will be odd, so for two distinct values of i, say i=j and i=k, the pairs {j−1,j} and {k−1,
k} will be disjoint. The comparison and possible swaps of the pairs (a[j−1], a[j]) and (a[k−1], a[k]) can
therefore proceed simultaneously.
.IN
C
N
SY
the parallel directive, the parallel for directive has an implicit barrier at the end of the loop.
As a result, no thread will proceed to the next phase (p+1) until all threads have finished the current
VT
phase (p).
2. The overhead associated with forking and joining threads. The OpenMP implementation may fork and
join thread_count threads during each pass through the body of the outer loop. The first row of
Table 4.3 shows the runtimes for 1, 2, 3, and 4 threads on one of the systems when the input list
contained 20,000 elements.
Table 4.3 Odd-even sort with two parallel for directives and two for directives. Times are in seconds.
These runtimes are not terrible, but we can try to improve them.
We can fork our team of thread_count threads before the outer loop using a parallel directive. Then,
instead of forking a new team of threads for each execution of an inner loop, we use a for directive, which
instructs OpenMP to parallelize the loop with the existing team of threads. This modification to the original
OpenMP implementation is shown in Program 4.5
.IN
C
N
Program 4.5: Second OpenMP implementation of odd-even sort.
SY
The parallel for directive, the exact assignment of loop iterations to threads is system dependent. However
VT
most OpenMP implementations use roughly a block partitioning: if there are n iterations in the se rial loop,
then in the parallel loop the first n/thread_count are assigned to thread 0, the next n/thread_count
are assigned to thread 1, and so on.
For example, suppose we want to parallelize the loop
A better assignment of work to threads might be obtained with a cyclic partitioning of the iterations among the
threads. In a cyclic partitioning, the iterations are assigned, one at a time, in a “round-robin” fashion to the
threads. Suppose t = thread_count. Then a cyclic partitioning will assign the iterations as follows:
In our example, we already know how to obtain the default schedule: we just add a parallel for directive
with a reduction clause:
To get a cyclic schedule, we can add a schedule clause to the parallel for directive:
.IN
The type can be any one of the following:
C
• static. The iterations can be assigned to the threads before the loop is executed.
N
• dynamic or guided. The iterations are assigned to the threads while the loop is executing, so
after a thread completes its current set of iterations, it can request more from the run-time system.
SY
• auto. The compiler and/or the run-time system determine the schedule.
• runtime. The schedule is determined at run-time based on an environment variable
U
The chunksize is a positive integer. In OpenMP parlance, a chunk of iterations is a block of iterations that
would be executed consecutively in the serial loop. The number of iterations in the block is the chunksize.
VT
Only static, dynamic, and guided schedules can have a chunksize. Fig. 4.5 provides a visualization
of how work is scheduled using the static, dynamic, and guided types.
.IN
C
N
SY
FIGURE 4.5 Scheduling visualization for the static, dynamic, and guided schedule types with 4 threads
U
and 32 iterations. The first static schedule uses the default chunksize, whereas the second uses a
chunksize of 2. The exact distribution of work across threads will vary between different executions of the
VT
program for the dynamic and guided schedule types, so this visualization shows one of many possible
scheduling outcomes
The default schedule is defined by your particular implementation of OpenMP, but in most
cases it is equivalent to the clause
Note that the chunksize can be omitted. If omitted, the chunksize is approximately
total_iterations / thread_count.
The static schedule is a good choice when each loop iteration takes roughly the same amount of time to
.IN
compute. It also has the advantage that threads in subsequent loops with the same number of iterations will be
assigned to the same ranges; this can improve the speed of memory accesses, particularly on NUMA systems
The primary di�erence between static and dynamic schedules is that the dynamic schedule assigns
ranges to threads on a rst-come, rst-served basis. This can be advantageous if loop iteraons do not take a
uniform amount of me to compute.
U
VT
Table 4.4 Assignment of trapezoidal rule iteraons 1–9999 using a guided schedule with two threads.
The guided schedule is similar to the dynamic schedule in that each thread executes a chunk of iterations
and then requests another chunk once it finishes. However, in a guided schedule, the size of the chunks
decreases as more chunks are completed.
For example, consider running the trapezoidal rule program with the parallel for directive and a
schedule(guided) clause. If � = 10,000and the thread count is 2, the iterations are assigned as shown in
Table 4.4. The size of each chunk is approximately equal to the number of remaining iterations divided by the
number of threads:
9999
• The first chunk size is ≈ 5000, since there are 9999 unassigned iterations.
2
4999
• The second chunk size is ≈ 2500.
2
• If no chunk size is specified, the chunk size decreases all the way down to 1.
.IN
• If a chunk size is specified, the size decreases down to that specified value, except that the very last
chunk may be smaller.
This scheduling method can improve load balancing across threads, especially when later iterations are
more computationally intensive.
C
N
4.7.4 The run�me schedule type
When schedule(runtime) is specified, the system uses the environment variable OMP_SCHEDULE to
SY
determine at run-time how to schedule the loop. The OMP_SCHEDULE environment variable can take on any
of the values that can be used for a static, dynamic, or guided schedule. For example, suppose we
have a parallel for directive in a program and it has been modified by schedule(runtime). Then if we
use the bash shell, we can get a cyclic assignment of iterations to threads by executing the command
U
$ export OMP_SCHEDULE="static,1"
VT
Now, when we start executing our program, the system will schedule the iterations of the for loop as if we had
the clause schedule(static,1) modifying the parallel for directive.
4.8.1 Queues
• A queue is a list abstract datatype in which new elements are inserted at the “rear” of the queue and
elements are removed from the “front” of the queue.
• When a new entryis added to the rear of a queue, we sometimes say that the entry has been
“enqueued,” and when an entry is removed from the front of a queue, we sometimes say that the
entry has been “dequeued.”
• Queues occur frequently in computer science. For example, if we have a number of processes, each
of which wants to store some data on a hard drive, then a natural way to ensure that only one process
writes to the disk at a time is to have the processes form a queue.
• A queue is also a natural data structure to use in many multithreaded applications. For example,
suppose we have several “producer” threads and several “consumer” threads.
• The producer threads might “produce” requests for data from a server— for example, current stock
prices—while the consumer threads might “consume” the request by finding or generating the
requested data—the current stock prices.
4.8.2 Message-passing
• Another natural application would be implementing message-passing on a shared memory system.
• Each thread could have a shared-message queue, and when one thread wanted to “send a message” to
another thread, it could enqueue the message in the destination thread’s queue.
• A thread could receive a message by dequeuing the message at the head of its message queue.
Let us consider a program that consists of multiple threads that communicate through message passing. Each
thread repeatedly generates a random integer message along with a random destination thread, then enqueues
the message into that destination’s message queue. After sending a message, the thread checks its own queue;
if a message is available, it dequeues and prints it. Each thread alternates between sending and receiving until
it has sent a user-specified number of messages. Once finished sending, a thread continues receiving messages
.IN
until all threads have completed their sending, after which all threads terminate.
Pseudocode for each thread might look something like this:
C
N
SY
that keeps track of the rear of the queue is necessary. For example, if we use a singly linked list with the tail of
the list corresponding to the rear of the queue, then, to efficiently enqueue, we would want to store a pointer to
VT
the rear. When we enqueue a new message, we will need to check and update the rear pointer. If two threads try
to do this simultaneously, we may lose a message that has been enqueued by one of the threads. The results of
the two operations will conflict, and hence enqueuing a message will form a critical section.
Pseudocode for the Send_msg() function might look something like this:
Now you may be thinking, “What about the variable storing the size of the queue?” This would be a problem
if we simply store the size of the queue. However, if we store two variables, enqueued and dequeued, then the
number of messages in the queue is
queue_size = enqueued – dequeued
and the only thread that will update dequeued is the owner of the queue. Observe that one thread can update
enqueued at the same time that another thread is using it to compute queue_size. To see this, let’s suppose
thread q is computing queue_size. It will either get the old value of enqueued or the new value. It may
therefore compute a queue_size of 0 or 1 when queue_size should actually be 1 or 2, respectively, but
in our program this will only cause a modest delay. Thread q will try again later if queue_size is 0 when it
should be 1, and it will execute the critical section directive unnecessarily if queue_size is 1 when it should
be 2
Thus we can implement Try_receive as follows:
.IN
C
N
4.8.5 Termina�on detec�on
We also need to think about implementation of the Done function.
SY
U
VT
If thread u executes this code, it’s entirely possible that some thread v will send a message to thread u after u
has computed queue_size =0. Of course, after thread u computes queue_size =0, it will terminate
and the message sent by thread v will never be received. However, in our program, after each thread has
completed the for loop, it won’t send any new messages. Thus if we add a counter done_sending, and each
thread increments this after completing its for loop, then we can implement Done as follows:
Unlike the critical direcve, it can only protect crical secons that consist of a single C assignment
statement. Further, the statement must have one of the following forms:
.IN
• Note that <expression> must not reference x. It should be noted that only the load and store of x
are guaranteed to be protected.
For example, in the code C
# pragma omp atomic
x += y++;
N
a thread’s update to x will be completed before any other thread can begin updating x. However, the
SY
sections.
VT
From OpenMP’s point of view our program has two distinct critical sections: the critical section
protected by the atomic directive, (done_sending++), and the “composite” critical section in which we
enqueue and dequeue messages.
Since enforcing mutual exclusion among threads serializes execution, this default behavior of OpenMP—
treating all critical blocks as part of one composite critical section—can be highly detrimental to our program’s
performance. OpenMP does provide the option of adding a name to a critical directive:
# pragma omp critical (name)
When we do this, two blocks protected with critical directives with different names can be executed
simultaneously. However, the names are set during compilation, and we want a different critical section
for each thread’s queue. Therefore we need to set the names at run-time, and in our setting, when we want to
allow simultaneous access to the same block of code by threads accessing different queues, the named critical
directive isn’t sufficient.
The alternative is to use locks.
A lock consists of a data structure and functions that allow the programmer to explicitly enforce mutual
exclusion in a critical section. The use of a lock can be described by the following pseudocode:
.IN
The lock data structure is shared among the threads that will execute the critical section. One of the
threads (e.g., the master thread) will initialize the lock, and when all the threads are done using the lock, one of
the threads should destroy it.
C
Before a thread enters the critical section, it attempts to set the lock by calling the lock function. If no
N
other thread is executing code in the critical section, it acquires the lock and proceeds into the critical section
past the call to the lock function. When the thread finishes the code in the critical section, it calls an unlock
SY
function, which releases or unsets the lock and allows another thread to acquire the lock.
While a thread owns the lock, no other thread can enter the critical section. If another thread attempts
to enter the critical section, it will block when it calls the lock function. If multiple threads are blocked in a call
to the lock function, then when the thread in the critical section releases the lock, one of the blocked threads
U
returns from the call to the lock, and the others remain blocked.
VT
OpenMP has two types of locks: simple locks and nested locks. A simple lock can only be set once
before it is unset, while a nested lock can be set multiple times by the same thread before it is unset. The type
of an OpenMP simple lock is omp_lock_t, and the simple lock functions that we’ll be using are
.IN
Now when a thread tries to send or receive a message, it can only be blocked by a thread attempting to
access the same message queue, since different message queues have different locks. In our original
implementation, only one thread could send at a time, regardless of the destination.
C
Note that it would also be possible to put the calls to the lock functions in the queue functions Enqueue
N
and Dequeue. However, to preserve the performance of Dequeue, we would also need to move the code that
determines the size of the queue (enqueued– dequeued) to Dequeue. Without it, the Dequeue function
SY
will lock the queue every time it is called by Try_receive. In the interest of preserving the structure of the
code we’ve already written, we’ll leave the calls to omp_set_lock and omp_unset_lock in the Send and
Try_receive functions.
U
4.10 Tasking
while and do−while loops cannot be parallelized with OpenMP, nor can for loops that have an unbounded
number of iterations. This poses potential issues for dynamic problems, including recursive algorithms, such as
graph traversals, or producer-consumer style programs like web servers.
To address these issues, OpenMP 3.0 introduced Tasking functionality. Tasking has been successfully applied
to a number of problems that were previously difficult to parallelize with OpenMP.
Tasking allows developers to specify independent units of computation with the task directive:
• When a thread reaches a block of code with this directive, a new task is generated by the OpenMP run-
time that will be scheduled for execution.
• It is important to note that the task will not necessarily be executed immediately, since there may be
other tasks already pending execution.
• Task blocks behave similarly to a standard parallel region, but can launch an arbitrary number of
tasks instead of only num_threads.
Therefore a majority of programs that use Tasking functionality will contain an outer region that looks
somewhat like:
where the parallel directive creates a team of threads and the single directive in structs the runtime to
only launch tasks from a single thread.
• If the single directive is omitted, subsequent task instances will be launched multiple times, one for
each thread in the team.
.IN
• To demonstrate OpenMP tasking functionality, consider Fibonacci sequence number generation
program. Due to the loop-carried dependence, results were unpredictable and, often incorrect.
• However, we can parallelize this algorithm with the task directive. First, let’s take a look at a recursive
C
serial implementation that stores the sequence in a global array called fibs:
N
SY
U
VT
the results in i and j are lost: i and j retain their values from the initializations
int i = 0;
int j = 0;
.IN
C
N
SY
U