0% found this document useful (0 votes)
16 views26 pages

BCS702 Module 4

Uploaded by

joshismita9857
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views26 pages

BCS702 Module 4

Uploaded by

joshismita9857
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 4 Shared-memory programming with OpenMP

MODULE 4
Shared-memory programming with OpenMP

OpenMP, short for Open Multi-Processing, is a set of compiler directives, runtime library routines, and
environment variables that allow us to write parallel programs in C, C++, and Fortran.
• It is designed for shared-memory systems, where multiple threads operate in a common address
space.
• OpenMP is portable, scalable, and easy to use, making it ideal for scientific computing, simulations,
and performance-critical applications.

.IN
C
N
SY

Figure 4.1 : Shared memory system


U

4.1 A “hello, world” program uses OpenMP


VT

OpenMP provides a “directives-based” shared-memory API. In C and C++, this means that there are special
preprocessor instructions known as pragmas. Pragmas are typically added to a system to allow behaviours
that aren’t part of the basic C specification.
Pragmas in C and C++ start with
#pragma

• OpenMP pragmas always begin with


# pragma omp

Manjusha, Asst. Prof, CSE Dept, SVIT. 1

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

.IN
C
N
SY

Line number 11 of our hello, world program, the word parallel is called directive.
# pragma omp parallel
U

The parallel directive, it specifies that the structured block of code that follows should be executed by multiple
threads. The number of threads that run the structured block of code will be determined by the run-time system.
VT

Recall that thread is short for thread of execution. The name is meant to suggest a sequence of statements
executed by a program. Threads are typically started or forked by a process, and they share most of the resources
of the process that starts them—for example, access to stdin and stdout—but each thread has its own stack and
program counter. When a thread completes execution, it joins the process that started it.

Figure 4.2: A process forking and joining two threads.

Manjusha, Asst. Prof, CSE Dept, SVIT. 2

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
• A clause in OpenMP is just some text that modifies a directive. The num_threads clause can be
added to a parallel directive. It allows the programmer to specify the number of threads that should
execute the following block:

# pragma omp parallel num_threads(thread_count)

What actually happens when the program gets to the parallel directive?
Prior to the parallel directive, the program is using a single thread, the process started when the program
started execution. When the program reaches the parallel directive, the original thread continues executing and
thread_count − 1 additional threads are started.
In OpenMP parlance, the collection of threads executing the parallel block—the original thread and the new
threads—is called a team.
Each thread in the team executes the block following the parallel directive, so in our example, each thread
calls the Hello function.

.IN
OpenMP thread terminology includes the following:

• master: the first thread of execution, or thread 0. C


• parent: thread that encountered a parallel directive and started a team of threads. In many cases, the
parent is also the master thread.
• child: each thread started by the parent is considered a child thread
N
SY

Since each thread has its own stack, a thread executing the Hello function will create its own private, local
variables in the function. In our example, when the function is called, each thread will get its rank or ID and
the number of threads in the team by calling the OpenMP functions omp_get_thread_num and
omp_get_num_threads, respectively.
U

The rank or ID of a thread is an int that is in the range 0,1,...,thread_count−1


VT

Compiling and running OpenMP programs


To compile this with gcc :
$ gcc−g−Wall−fopenmp−o omp_hello omp_hello.c
To run the program, specify the number of threads on the command line. For example, run the
program with four threads
$ ./omp_hello 4

Note that the threads are competing for access to stdout, so there is no guarantee that the output will appear
in thread-rank order.

Manjusha, Asst. Prof, CSE Dept, SVIT. 3

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
For example, the output might also be

or any other permutation of the thread ranks.

.IN
THE FUNCTIONS:

C
N
SY

The strtol function in C is used to convert a string to a long integer. It is part of the C standard
library, declared in <stdlib.h>.
U
VT

4.2 The trapezoidal rule


Consider the trapezoidal rule for es⿿ma⿿ng the area under a curve.

FIGURE 5.3: The trapezoidal rule.

Use the trapezoidal rule to approximate the area under the curve of a function, y = f(x) (See Figure. 5.3 a.)

Manjusha, Asst. Prof, CSE Dept, SVIT. 4

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
The basic idea is to divide the interval on the x-axis into n equal subintervals. Then we approximate
the area lying between the graph and each subinterval by a trapezoid, whose base is the subinterval, whose
vertical sides are the vertical lines through the end points of the subinterval, and whose fourth side is the secant
line joining the points where the vertical lines cross the graph. (See Figure. 5.3b.)
If the endpoints of the subinterval are xi and xi+1, then the length of the subinterval is h = xi+1 − xi. Also, if the
lengths of the two vertical segments are f(xi) and f(xi+1), then the area of the trapezoid is
Since we chose the n subintervals so that they would all have the same length, we also know that if the vertical
lines bounding the region are x = a and x = b, then

Thus if x0 = a and xn = b, we have that

.IN
and the sum of the areas of the trapezoids—our approximation to the total area—is
C
Thus, pseudocode for a serial program might look something like this
N
SY
U
VT

FIGURE 5.4 Assignment of trapezoids to threads.

Manjusha, Asst. Prof, CSE Dept, SVIT. 5

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
Applying Foster’s parallel program design methodology to the trapezoidal rule as described in the
following list:
1. We identified two types of jobs:
a. Computation of the areas of individual trapezoids
b. Adding the areas of trapezoids.
2. There is no communication among the jobs in the first collection, but each job in the first
collection communicates with job 1b.
3. We assumed that there would be many more trapezoids than cores, so we aggregated jobs by
assigning a contiguous block of trapezoids to each thread (and a single thread to each core).
Effectively, this partitioned the interval [a, b] into larger subintervals, and each thread simply
applied the serial trapezoidal rule to its subinterval. See Fig. 5.4.

Since we still need to add up the threads’ results. An obvious solution is to use a shared
variable for the sum of all the threads’ results, and each thread can add its (private) result into
the shared variable.

We would like to have each thread execute a statement that looks something like

.IN
This can result in an erroneous value for global_result—if two (or more) threads attempt to
C
simultaneously execute this statement, the result will be unpredictable.
N
SY
U
VT

We see that the value computed by thread 0 (my_result =1) is overwritten by thread 1.

Of course, the actual sequence of events might be different, but unless one thread finishes the computation
global_result += my_result before the other starts, the result will be incorrect.

Recall that this is an example of a race condition: multiple threads are attempting to access a shared resource,
at least one of the accesses is an update, and the accesses can result in an error. Also recall that the code that
causes the race condition, global_result += my_result, is called a critical section. A critical section
is code executed by multiple threads that updates a shared resource, and the shared resource can only be updated
by one thread at a time.
In OpenMP uses the critical directive

Manjusha, Asst. Prof, CSE Dept, SVIT. 6

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

.IN
C
N
SY
U
VT

Manjusha, Asst. Prof, CSE Dept, SVIT. 7

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

In the Trap function, each thread gets its rank and the total number of threads in the team started by the parallel
directive. Then each thread determines the following:
1. The length of the bases of the trapezoids (Line 33),
2. The number of trapezoids assigned to each thread (Line 34),
3. The left and right endpoints of its interval (Lines 35 and 36, respectively)
4. Its contribution to global_result (Lines 37–42).

The threads finish by adding in their individual results to global_result in Lines 44–45. We use the prefix
local_ for some variables to emphasize that their values may differ from the values of corresponding variables
in the main function—for example, local_a may differ from a, although it is the thread’s left endpoint.

Notice that unless n is evenly divisible by thread_count, we’ll use fewer than n trapezoids for
global_result. For example, if n = 14 and thread_count = 4, each thread will compute

Thus each thread will only use 3 trapezoids, and global_result will be computed with 4 ×3=12 trapezoids

.IN
instead of the requested 14.
Since each thread is assigned a block of local_n trapezoids, the length of each thread’s interval will be
local_n∗h, so the left endpoints will be C
N
SY

we assign
U

Furthermore, since the length of each thread’s interval will be local_n∗h, its right endpoint will just be
VT

4.3 Scope of variables


In OpenMP, the scope of a variable refers to the set of threads that can access the variable in a parallel block. A
variable that can be accessed by all the threads in the team has shared scope, while a variable that can only be
accessed by a single thread has private scope.
In the “hello, world” program, the variables used by each thread (my_rank and thread_count) were
declared in the Hello function, which is called inside the parallel block. Consequently, the variables used by
each thread are allocated from the thread’s (private) stack, and hence all of the variables have private scope.
The variables that are declared in the main function (a, b, n, global_result, and thread_count)
are all accessible to all the threads in the team started by the parallel directive. Hence, the default scope for
variables declared before a parallel block is shared.

Manjusha, Asst. Prof, CSE Dept, SVIT. 8

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
In the Trap function, although global_result_p is a private variable, it refers to the variable
global_result which was declared in main before the parallel directive, and the value of
global_result is used to store the result that’s printed out after the parallel block.

Thus in the code

It is essential that ∗global_result_p have shared scope.

If it were private to each thread, there would be no need for the critical directive. Furthermore, if it were
private, we would have a hard time determining the value of global_result in main after completion of the
parallel block.
To summarize
1. Variables that have been declared before a parallel directive have shared scope among the threads in the
team
2. Variables declared in the block (e.g., local variables in functions) have private scope.

.IN
3. The value of a shared variable at the beginning of the parallel block is the same as the value before the
block, and, after completion of the parallel block, the value of the variable is the value at the end of the
block.
4. The default scope of a variable can change with other directives
C
4.4 The reduc�on clause
N
In OpenMP, the reduc�on clause is used to safely perform a reduc⿿on opera⿿on on variables shared among
SY

mul⿿ple threads. When mul⿿ple threads need to compute a collec⿿ve result, such as the sum of elements or
῿nding the maximum value, a reduc⿿on ensures that the ῿nal result is computed correctly without race cond
i⿿ons.
U

The syntax of the reduc⿿on clause is


VT

A reduction operator is an associative binary operation (such as addition or multiplication), and a reduction
is a computation that repeatedly applies the same reduction operator to a sequence of operands to get a single
result. Furthermore, all of the intermediate results of the operation should be stored in the same variable: the
reduction variable.

For example, if A is an array of n integer values, the computation

is a reduction in which the reduction operator is addition.


In OpenMP using reduction clause we can modify the trapezoidal global_result code as follows:

Manjusha, Asst. Prof, CSE Dept, SVIT. 9

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

When a variable is included in a reduction clause, the variable itself is shared. However, a private variable is
created for each thread in the team. In the parallel block each time a thread executes a statement involving the
variable, it uses the private variable. When the parallel block ends, the values in the private variables are
combined into the shared variable.

One final point to note is that the threads’ private variables are initialized to 0. This is analogous to our
initializing my_result to zero. In general, the private variables created for a reduction clause are initialized to
the identity value for the operator. For example, if the operator is multiplication, the private variables would be
initialized to 1. Table 4.1 for the entire list

.IN
Table 4.1 Identity values for the various reduction opera tors in OpenMP.

C
N
SY
U

4.5 The parallel for direc�ve


VT

The parallel for directive is used in OpenMP to parallelize loops, dividing loop iterations
among multiple threads so that work can be executed simultaneously.
Using it, we can parallelize the serial trapezoidal rule

by simply placing a directive immediately before the for loop:

Manjusha, Asst. Prof, CSE Dept, SVIT. 10

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

• The parallel for directive forks a team of threads to execute the following structured block.
• However, the structured block following the parallel for directive must be a for loop.
• With the parallel for directive the system parallelizes the for loop by dividing the iterations of the loop
among the threads.
• In a for loop that has been parallelized with a parallel for directive, the default partitioning of the
iterations among the threads is up to the system. However, most systems use roughly a block
partitioning, that is, if there are m iterations, then roughly the first m/thread_count are assigned to
thread 0, the next m/thread_count are assigned to thread 1, and so on .

4.5.1 Caveats
• OpenMP will only parallelize for loops for which the number of itera⿿ons can be determined

.IN
C
N
SY

Program 4.3: Legal forms for parallelizable for statements.

• OpenMP will only parallelize for loops—it won’t parallelize while loops or do−while loops
U

directly.
• The “infinite loop” cannot be parallelized
VT

• The for loop with break statements cannot be parallelized

Manjusha, Asst. Prof, CSE Dept, SVIT. 11

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
4.5.2 Data dependences
A more trivial problem occurs in loops in which the computation in one iteration depends on the results
of one or more previous iterations.
As an example, consider the following code, which computes the first n Fibonacci numbers:

Let us try parallelizing the for loop with a parallel for directive:

The compiler will create an executable without complaint. However, if we try running it with more

.IN
than one thread, we may find that the results are, at best, unpredictable.
1 1 2 3 5 8 13 21 34 55,
C
which is correct. However, we also occasionally get
1 1 2 3 5 8 0 0 0 0.
N
We see two important points here:
SY

1. OpenMP compilers don’t check for dependences among iterations in a loop that is being
parallelized with a parallel for directive. It is up to the programmers to identify these
dependences.
2. A loop in which the results of one or more iterations depend on other iterations cannot, in
U

general, be correctly parallelized by OpenMP without using features such as the Tasking API.
(See Section 4.10).
VT

The dependence of the computation of fibo[i] on the computation of fibo[i-1] is called a data dependence.
Since the value of fibo[i-1] is calculated in one iteration, and the result is used in a subsequent iteration, the
dependence is sometimes called a loop-carried dependence.

4.5.2 Finding loop-carried dependences


Perhaps the first thing to note is that when we try to use a parallel for directive, we only need to worry about
loop-carried dependences. We don’t need to consider more general data dependences. For example, in the
loop

Manjusha, Asst. Prof, CSE Dept, SVIT. 12

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
There is a data dependence between lines 2 and 3; however, it does not affect parallelization

Since the computation of x[i] and its subsequent use will always be assigned to the same thread, there is no
issue.
Also note that at least one of the statements must write or update the variable in order for the statements to
represent a dependence. Therefore, to detect a loop-carried dependence, we should only consider variables that
are updated within the loop body. Specifically, we should look for variables that are read or written in one
iteration and written again in another

4.6 Odd-even transposi�on sort in OpenMP


The list a stores n ints, and the algorithm sorts them into increasing order. During an “even phase” (phase

.IN
%2==0), each odd-subscripted element, a[i], is compared to the element to its “left,” a[i−1], and if they’re
out of order, they’re swapped. During an “odd” phase, each odd-subscripted element is compared to the element
to its right, and if they are out of order, they are swapped. A theorem guarantees that after n phases, the list will
be sorted.
C
Example: Suppose a = {9, 7, 8, 6}. Then the phases are shown in Table 4.2. In this case, the final phase was
N
not necessary, but the algorithm does not bother checking whether the list is already sorted before carrying out
each phase.
SY

It is not hard to see that the outer loop has a loop-carried dependence. As an example, suppose as before that a
= {9, 7, 8, 6}. Then in phase 0 the inner loop will compare elements in the pairs (9,7) and (8,6), and both pairs
are swapped. So for phase 1, the list should be {7, 9, 6, 8}, and during phase 1 the elements in the pair (9,6)
should be compared and swapped. However, if phase 0 and phase 1 are executed simultaneously, the pair that’s
U

checked in phase 1 might be (7,8), which is in order. Furthermore, it is not clear how one might eliminate this
loop-carried dependence, so it would appear that parallelizing the outer for loop is not an option.
VT

Table 4.2 Serial odd-even transposition sort

The inner for loops, however, don’t appear to have any loop-carried dependences. For example, in an even
phase loop variable i will be odd, so for two distinct values of i, say i=j and i=k, the pairs {j−1,j} and {k−1,
k} will be disjoint. The comparison and possible swaps of the pairs (a[j−1], a[j]) and (a[k−1], a[k]) can
therefore proceed simultaneously.

Manjusha, Asst. Prof, CSE Dept, SVIT. 13

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

.IN
C
N
SY

Program 4.4: First OpenMP implementation of odd-even sort.


Potential problems with the above code.
1. We must ensure that all threads complete phase p before any thread begins phase p+1. Similar to
U

the parallel directive, the parallel for directive has an implicit barrier at the end of the loop.
As a result, no thread will proceed to the next phase (p+1) until all threads have finished the current
VT

phase (p).
2. The overhead associated with forking and joining threads. The OpenMP implementation may fork and
join thread_count threads during each pass through the body of the outer loop. The first row of
Table 4.3 shows the runtimes for 1, 2, 3, and 4 threads on one of the systems when the input list
contained 20,000 elements.

Table 4.3 Odd-even sort with two parallel for directives and two for directives. Times are in seconds.

These runtimes are not terrible, but we can try to improve them.
We can fork our team of thread_count threads before the outer loop using a parallel directive. Then,
instead of forking a new team of threads for each execution of an inner loop, we use a for directive, which
instructs OpenMP to parallelize the loop with the existing team of threads. This modification to the original
OpenMP implementation is shown in Program 4.5

Manjusha, Asst. Prof, CSE Dept, SVIT. 14

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

.IN
C
N
Program 4.5: Second OpenMP implementation of odd-even sort.
SY

4.7 Scheduling loops


In OpenMP, assigning iterations to threads is called scheduling.
U

The parallel for directive, the exact assignment of loop iterations to threads is system dependent. However
VT

most OpenMP implementations use roughly a block partitioning: if there are n iterations in the se rial loop,
then in the parallel loop the first n/thread_count are assigned to thread 0, the next n/thread_count
are assigned to thread 1, and so on.
For example, suppose we want to parallelize the loop

A better assignment of work to threads might be obtained with a cyclic partitioning of the iterations among the
threads. In a cyclic partitioning, the iterations are assigned, one at a time, in a “round-robin” fashion to the
threads. Suppose t = thread_count. Then a cyclic partitioning will assign the iterations as follows:

Manjusha, Asst. Prof, CSE Dept, SVIT. 15

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
4.7.1 The schedule clause
In OpenMP, assigning iterations to threads is called scheduling, and the schedule clause can be used to
assign iterations in either a parallel for or a for directive.

In our example, we already know how to obtain the default schedule: we just add a parallel for directive
with a reduction clause:

To get a cyclic schedule, we can add a schedule clause to the parallel for directive:

.IN
The type can be any one of the following:
C
• static. The iterations can be assigned to the threads before the loop is executed.
N
• dynamic or guided. The iterations are assigned to the threads while the loop is executing, so
after a thread completes its current set of iterations, it can request more from the run-time system.
SY

• auto. The compiler and/or the run-time system determine the schedule.
• runtime. The schedule is determined at run-time based on an environment variable
U

The chunksize is a positive integer. In OpenMP parlance, a chunk of iterations is a block of iterations that
would be executed consecutively in the serial loop. The number of iterations in the block is the chunksize.
VT

Only static, dynamic, and guided schedules can have a chunksize. Fig. 4.5 provides a visualization
of how work is scheduled using the static, dynamic, and guided types.

Manjusha, Asst. Prof, CSE Dept, SVIT. 16

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

.IN
C
N
SY

FIGURE 4.5 Scheduling visualization for the static, dynamic, and guided schedule types with 4 threads
U

and 32 iterations. The first static schedule uses the default chunksize, whereas the second uses a
chunksize of 2. The exact distribution of work across threads will vary between different executions of the
VT

program for the dynamic and guided schedule types, so this visualization shows one of many possible
scheduling outcomes

4.7.2 The sta�c schedule type


For a static schedule, the system assigns chunks of chunksize iterations to each thread in a round-
robin fashion. As an example, suppose we have 12 iterations, 0,1,...,11, and three threads. Then if
schedule(sta�c , 1) is used, in the parallel for or for direc⿿ve, the iterations will be assigned as

If schedule(sta�c , 2)is used, then the iterations will be assigned as

Manjusha, Asst. Prof, CSE Dept, SVIT. 17

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

If schedule(static , 4)is used, the iterations will be assigned as

The default schedule is defined by your particular implementation of OpenMP, but in most
cases it is equivalent to the clause

schedule(static , total_iterations / thread_count)

Note that the chunksize can be omitted. If omitted, the chunksize is approximately
total_iterations / thread_count.

The static schedule is a good choice when each loop iteration takes roughly the same amount of time to

.IN
compute. It also has the advantage that threads in subsequent loops with the same number of iterations will be
assigned to the same ranges; this can improve the speed of memory accesses, particularly on NUMA systems

4.7.3 The dynamic and guided schedule types C


In a dynamic schedule, the itera⿿ons are also broken up into chunks of chunksize consecu⿿ve itera⿿ons.
Each thread executes a chunk, and when a thread ῿nishes a chunk, it requests another one from the run-⿿me
N
system. This con⿿nues un⿿l all the itera⿿ons are completed. The chunksize can be omi㿿ed. When it is
omi㿿ed, a chunksize of 1 is used.
SY

The primary di�erence between static and dynamic schedules is that the dynamic schedule assigns
ranges to threads on a ῿rst-come, ῿rst-served basis. This can be advantageous if loop itera⿿ons do not take a
uniform amount of ⿿me to compute.
U
VT

Table 4.4 Assignment of trapezoidal rule itera⿿ons 1–9999 using a guided schedule with two threads.

Manjusha, Asst. Prof, CSE Dept, SVIT. 18

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

The guided schedule is similar to the dynamic schedule in that each thread executes a chunk of iterations
and then requests another chunk once it finishes. However, in a guided schedule, the size of the chunks
decreases as more chunks are completed.
For example, consider running the trapezoidal rule program with the parallel for directive and a
schedule(guided) clause. If � = 10,000and the thread count is 2, the iterations are assigned as shown in
Table 4.4. The size of each chunk is approximately equal to the number of remaining iterations divided by the
number of threads:
9999
• The first chunk size is ≈ 5000, since there are 9999 unassigned iterations.
2
4999
• The second chunk size is ≈ 2500.
2

• Subsequent chunks continue to decrease in size following this pattern.


In a guided schedule:

• If no chunk size is specified, the chunk size decreases all the way down to 1.

.IN
• If a chunk size is specified, the size decreases down to that specified value, except that the very last
chunk may be smaller.
This scheduling method can improve load balancing across threads, especially when later iterations are
more computationally intensive.
C
N
4.7.4 The run�me schedule type
When schedule(runtime) is specified, the system uses the environment variable OMP_SCHEDULE to
SY

determine at run-time how to schedule the loop. The OMP_SCHEDULE environment variable can take on any
of the values that can be used for a static, dynamic, or guided schedule. For example, suppose we
have a parallel for directive in a program and it has been modified by schedule(runtime). Then if we
use the bash shell, we can get a cyclic assignment of iterations to threads by executing the command
U

$ export OMP_SCHEDULE="static,1"
VT

Now, when we start executing our program, the system will schedule the iterations of the for loop as if we had
the clause schedule(static,1) modifying the parallel for directive.

4.8 Producers and consumers


Let us take a look at a parallel problem that cannot be effectively parallelized using a parallel for or for
directive.

4.8.1 Queues
• A queue is a list abstract datatype in which new elements are inserted at the “rear” of the queue and
elements are removed from the “front” of the queue.
• When a new entryis added to the rear of a queue, we sometimes say that the entry has been
“enqueued,” and when an entry is removed from the front of a queue, we sometimes say that the
entry has been “dequeued.”
• Queues occur frequently in computer science. For example, if we have a number of processes, each
of which wants to store some data on a hard drive, then a natural way to ensure that only one process
writes to the disk at a time is to have the processes form a queue.

Manjusha, Asst. Prof, CSE Dept, SVIT. 19

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

• A queue is also a natural data structure to use in many multithreaded applications. For example,
suppose we have several “producer” threads and several “consumer” threads.
• The producer threads might “produce” requests for data from a server— for example, current stock
prices—while the consumer threads might “consume” the request by finding or generating the
requested data—the current stock prices.

4.8.2 Message-passing
• Another natural application would be implementing message-passing on a shared memory system.
• Each thread could have a shared-message queue, and when one thread wanted to “send a message” to
another thread, it could enqueue the message in the destination thread’s queue.
• A thread could receive a message by dequeuing the message at the head of its message queue.
Let us consider a program that consists of multiple threads that communicate through message passing. Each
thread repeatedly generates a random integer message along with a random destination thread, then enqueues
the message into that destination’s message queue. After sending a message, the thread checks its own queue;
if a message is available, it dequeues and prints it. Each thread alternates between sending and receiving until
it has sent a user-specified number of messages. Once finished sending, a thread continues receiving messages

.IN
until all threads have completed their sending, after which all threads terminate.
Pseudocode for each thread might look something like this:
C
N
SY

4.8.3 Sending messages


Note that accessing a message queue to enqueue a message is probably a critical section. A variable
U

that keeps track of the rear of the queue is necessary. For example, if we use a singly linked list with the tail of
the list corresponding to the rear of the queue, then, to efficiently enqueue, we would want to store a pointer to
VT

the rear. When we enqueue a new message, we will need to check and update the rear pointer. If two threads try
to do this simultaneously, we may lose a message that has been enqueued by one of the threads. The results of
the two operations will conflict, and hence enqueuing a message will form a critical section.

Pseudocode for the Send_msg() function might look something like this:

4.8.4 Receiving messages


The synchronization issues for receiving a message are a little different.
Only the owner of the queue(that is, the destination thread) will dequeue from a given message queue. As long
as we dequeue one message at a time, if there are at least two messages in the queue, a call to Dequeue can’t
possibly conflict with any calls to Enqueue. So if we keep track of the size of the queue, we can avoid any
synchronization (for example, critical directives), as long as there are at least two messages.

Manjusha, Asst. Prof, CSE Dept, SVIT. 20

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

Now you may be thinking, “What about the variable storing the size of the queue?” This would be a problem
if we simply store the size of the queue. However, if we store two variables, enqueued and dequeued, then the
number of messages in the queue is
queue_size = enqueued – dequeued

and the only thread that will update dequeued is the owner of the queue. Observe that one thread can update
enqueued at the same time that another thread is using it to compute queue_size. To see this, let’s suppose
thread q is computing queue_size. It will either get the old value of enqueued or the new value. It may
therefore compute a queue_size of 0 or 1 when queue_size should actually be 1 or 2, respectively, but
in our program this will only cause a modest delay. Thread q will try again later if queue_size is 0 when it
should be 1, and it will execute the critical section directive unnecessarily if queue_size is 1 when it should
be 2
Thus we can implement Try_receive as follows:

.IN
C
N
4.8.5 Termina�on detec�on
We also need to think about implementation of the Done function.
SY
U
VT

If thread u executes this code, it’s entirely possible that some thread v will send a message to thread u after u
has computed queue_size =0. Of course, after thread u computes queue_size =0, it will terminate
and the message sent by thread v will never be received. However, in our program, after each thread has
completed the for loop, it won’t send any new messages. Thus if we add a counter done_sending, and each
thread increments this after completing its for loop, then we can implement Done as follows:

Manjusha, Asst. Prof, CSE Dept, SVIT. 21

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
4.8.6 The atomic direc�ve
OpenMP provides a poten⿿ally higher performance direc⿿ve to implement mutual exclusion: the atomic
direc⿿ve:
# pragma omp atomic

Unlike the critical direc⿿ve, it can only protect cri⿿cal sec⿿ons that consist of a single C assignment
statement. Further, the statement must have one of the following forms:

Here <op> can be one of the binary operators +, ∗, −, / , &, ^, | , <>.

.IN
• Note that <expression> must not reference x. It should be noted that only the load and store of x
are guaranteed to be protected.
For example, in the code C
# pragma omp atomic

x += y++;
N
a thread’s update to x will be completed before any other thread can begin updating x. However, the
SY

update to y may be unprotected and the results may be unpredictable.


The idea behind the atomic directive is that many processors provide a special load-modify-store
instruction, and a critical section that only does a load-modify store can be protected much more efficiently
by using this special instruction rather than the constructs that are used to protect more general critical
U

sections.
VT

4.8.7 Cri�cal sec�ons and locks


If we simply look at the source code, we will see three blocks of code preceded by a critical or an
atomic directive:

From OpenMP’s point of view our program has two distinct critical sections: the critical section
protected by the atomic directive, (done_sending++), and the “composite” critical section in which we
enqueue and dequeue messages.
Since enforcing mutual exclusion among threads serializes execution, this default behavior of OpenMP—
treating all critical blocks as part of one composite critical section—can be highly detrimental to our program’s
performance. OpenMP does provide the option of adding a name to a critical directive:
# pragma omp critical (name)
When we do this, two blocks protected with critical directives with different names can be executed
simultaneously. However, the names are set during compilation, and we want a different critical section

Manjusha, Asst. Prof, CSE Dept, SVIT. 22

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

for each thread’s queue. Therefore we need to set the names at run-time, and in our setting, when we want to
allow simultaneous access to the same block of code by threads accessing different queues, the named critical
directive isn’t sufficient.
The alternative is to use locks.
A lock consists of a data structure and functions that allow the programmer to explicitly enforce mutual
exclusion in a critical section. The use of a lock can be described by the following pseudocode:

.IN
The lock data structure is shared among the threads that will execute the critical section. One of the
threads (e.g., the master thread) will initialize the lock, and when all the threads are done using the lock, one of
the threads should destroy it.
C
Before a thread enters the critical section, it attempts to set the lock by calling the lock function. If no
N
other thread is executing code in the critical section, it acquires the lock and proceeds into the critical section
past the call to the lock function. When the thread finishes the code in the critical section, it calls an unlock
SY

function, which releases or unsets the lock and allows another thread to acquire the lock.
While a thread owns the lock, no other thread can enter the critical section. If another thread attempts
to enter the critical section, it will block when it calls the lock function. If multiple threads are blocked in a call
to the lock function, then when the thread in the critical section releases the lock, one of the blocked threads
U

returns from the call to the lock, and the others remain blocked.
VT

OpenMP has two types of locks: simple locks and nested locks. A simple lock can only be set once
before it is unset, while a nested lock can be set multiple times by the same thread before it is unset. The type
of an OpenMP simple lock is omp_lock_t, and the simple lock functions that we’ll be using are

4.8.8 Using locks in the message-passing program


In our earlier discussion of the limitations of the critical directive, we saw that in the message-passing program,
we wanted to ensure mutual exclusion in each individual message queue, not in a particular block of source
code. Locks allow us to do this. If we include a data member with type omp_lock_t in our queue struct, we can
simply call omp_set_lock each time we want to ensure exclusive access to a message queue. So the code

Manjusha, Asst. Prof, CSE Dept, SVIT. 23

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP

.IN
Now when a thread tries to send or receive a message, it can only be blocked by a thread attempting to
access the same message queue, since different message queues have different locks. In our original
implementation, only one thread could send at a time, regardless of the destination.
C
Note that it would also be possible to put the calls to the lock functions in the queue functions Enqueue
N
and Dequeue. However, to preserve the performance of Dequeue, we would also need to move the code that
determines the size of the queue (enqueued– dequeued) to Dequeue. Without it, the Dequeue function
SY

will lock the queue every time it is called by Try_receive. In the interest of preserving the structure of the
code we’ve already written, we’ll leave the calls to omp_set_lock and omp_unset_lock in the Send and
Try_receive functions.
U

4.9 Caches, cache coherence, and false sharing


VT

4.10 Tasking
while and do−while loops cannot be parallelized with OpenMP, nor can for loops that have an unbounded
number of iterations. This poses potential issues for dynamic problems, including recursive algorithms, such as
graph traversals, or producer-consumer style programs like web servers.
To address these issues, OpenMP 3.0 introduced Tasking functionality. Tasking has been successfully applied
to a number of problems that were previously difficult to parallelize with OpenMP.
Tasking allows developers to specify independent units of computation with the task directive:

• When a thread reaches a block of code with this directive, a new task is generated by the OpenMP run-
time that will be scheduled for execution.
• It is important to note that the task will not necessarily be executed immediately, since there may be
other tasks already pending execution.
• Task blocks behave similarly to a standard parallel region, but can launch an arbitrary number of
tasks instead of only num_threads.

Manjusha, Asst. Prof, CSE Dept, SVIT. 24

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
• In fact, tasks must be launched from within a parallel region but generally by only one of the threads
in the team.

Therefore a majority of programs that use Tasking functionality will contain an outer region that looks
somewhat like:

where the parallel directive creates a team of threads and the single directive in structs the runtime to
only launch tasks from a single thread.

• If the single directive is omitted, subsequent task instances will be launched multiple times, one for
each thread in the team.

.IN
• To demonstrate OpenMP tasking functionality, consider Fibonacci sequence number generation
program. Due to the loop-carried dependence, results were unpredictable and, often incorrect.
• However, we can parallelize this algorithm with the task directive. First, let’s take a look at a recursive
C
serial implementation that stores the sequence in a global array called fibs:
N
SY
U
VT

This chain of recursive calls will be time-consuming.

the results in i and j are lost: i and j retain their values from the initializations
int i = 0;

int j = 0;

Manjusha, Asst. Prof, CSE Dept, SVIT. 25

Studied smart, not hard — thanks to [Link]


MODULE 4 Shared-memory programming with OpenMP
at the beginning of the function.
We can adjust the scope of i and j by declaring the variables to be shared in the tasks that execute the recursive
call. The problem here is that the order in which the various tasks execute isn’t specified. In other words, our
recursive function calls, fib(n − 1) and fib(n − 2) will be run eventually, but the thread executing the
task that makes the recursive calls can continue to run and simply return the current value of fibs[n] early.
We need to force this task to wait for its subtasks to complete with the taskwait directive, which operates as a
barrier for tasks. We’ve put this all together in Program 4.6

.IN
C
N
SY
U

Program 4.6: Computing the Fibonacci numbers using OpenMP tasks.


VT

4.11 Thread – safety


4.12

Manjusha, Asst. Prof, CSE Dept, SVIT. 26

Studied smart, not hard — thanks to [Link]

You might also like