Module-04 Bcs702 (Parallel Computing) Search Creators
Module-04 Bcs702 (Parallel Computing) Search Creators
The "MP" in OpenMP signifies "multiprocessing," which is synonymous with shared-memory MIMD
computing.
Shared-Memory Model:
OpenMP is designed for systems where all autonomous cores or CPUs have access to a common main
memory, as depicted in a shared-memory system (Fig. 5.1).
This means threads can directly read from and write to the same memory locations.
Pthreads:
Requires explicit programmer control over each thread's behavior, including thread creation, management, and
synchronization. It's a lower-level, function-based library.
Allows programmers to express parallelism using directives (pragmas). The compiler and run-time system
often handle the precise determination of tasks and which thread executes them.
This makes it simpler for certain parallel behaviors, especially parallelizing loops, but can make low-level
thread interactions more challenging to program.
Unlike Pthreads (which is a library that can be linked to any C compiler), OpenMP requires specific compiler
support for its pragmas.
A C compiler that doesn't support OpenMP will ignore the directives, resulting in sequential execution.
Incremental Parallelization:
A significant design goal of OpenMP was to enable programmers to incrementally parallelize existing serial
programs.
This is extremely difficult with MPI (which is for distributed memory) and often challenging with Pthreads
due to its low-level nature.
OpenMP uses a "directives-based" API. In C and C++, these directives are implemented as special
preprocessor instructions called pragmas.
Pragma Syntax: Pragmas start with #pragma. For OpenMP, they specifically start with #pragma omp.
Compiler Handling: Compilers that don't support OpenMP will simply ignore these pragmas. This
allows a single OpenMP program source file to be compiled and run either sequentially (if OpenMP is
not supported) or in parallel (if it is).
1 return 0;
} // main
void Hello(void) {
int my_rank = omp_get_thread_num(); // Get thread ID
int thread_count = omp_get_num_threads(); // Get total number of threads
Bash
Bash
$ ./omp_hello 4
Output Behavior: The output order is non-deterministic because threads contend for access to stdout. You
might see "Hello from thread 0", then "Hello from thread 2", etc., in any order.
1. omp.h Inclusion (Line 3): This header file provides prototypes for OpenMP functions and macro
definitions.
2. strtol (Line 9): Used to parse the number of threads from the command-line argument.
3. #pragma omp parallel num_threads(thread_count) (Line 11): This is the core OpenMP directive.
o #pragma omp parallel: Indicates that the immediately following structured block of code should be
executed in parallel by multiple threads.
o num_threads(thread_count): This is a clause that modifies the parallel directive, explicitly
requesting thread_count threads. While not guaranteed (system limitations may apply), in most
cases, the requested number of threads will be created.
4. Execution Flow at parallel directive:
o Initially, the program runs as a single master thread (often referred to as thread 0).
o When the master thread encounters the parallel directive, it forms a team of thread_count threads.
The original master thread continues as part of this team, and thread_count - 1 child threads are
created.
o Each thread in the team then executes the structured block that follows the directive (in this case,
the Hello() function call).
5. Hello() Function (Lines 17-24):
o omp_get_thread_num(): Returns the unique integer ID (rank) of the calling thread within its team
(0 to thread_count - 1).
o omp_get_num_threads(): Returns the total number of threads in the current team.
o Each thread calls printf independently, leading to the non-deterministic output order.
6. Implicit Barrier and Termination:
o When all threads in the team complete the structured block (e.g., Hello() returns), there is an
implicit barrier. This means any thread that finishes early will wait for all other threads in its team
to complete.
o After all threads have completed the block, the child threads are terminated, and only the parent
thread (the original master thread) continues execution with the code following the parallel region
(Line 14 in main).
Command-line Arguments: It's good practice to check if argv[1] exists and if thread_count is a
positive value.
Compiler Support (_OPENMP macro): To make code more portable to compilers that might not
support OpenMP, check for the preprocessor macro _OPENMP. This macro is defined by OpenMP-
compliant compilers.
#ifdef _OPENMP
# include <omp.h>
#endif
// ... in Hello function or wherever OpenMP calls are made ...
#ifdef _OPENMP
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();
#else
int my_rank = 0; // Assume single-threaded
int thread_count = 1; // Assume single-threaded
#endif
This allows the program to gracefully compile and run sequentially if OpenMP support is absent. (The
textbook examples will often omit this for clarity).
/* Input: a, b, n */
h = (b - a) / n;
approx = (f(a) + f(b)) / 2.0;
for (i = 1; i <= n - 1; i++) {
x_i = a + i * h;
approx += f(x_i);
}
approx = h * approx;
To parallelize the trapezoidal rule using OpenMP, we apply Foster's parallel program design methodology:
1. Task Partitioning:
o Type A tasks: Computing the area of individual trapezoids.
o Type B task: Summing all the individual trapezoid areas.
2. Communication: No communication is needed among the Type A tasks. However, each Type A task must
communicate its computed area to the single Type B task (the global sum).
3. Aggregation and Mapping:
o Assuming many more trapezoids (n) than available cores (thread_count), we aggregate tasks by
assigning a contiguous block of trapezoids to each thread.
o Effectively, this partitions the overall interval [a,b] into smaller, contiguous subintervals, and each
thread applies the serial trapezoidal rule to its assigned subinterval to compute a my_result (local
integral contribution). (See Figure 5.4 for a visual representation of this assignment).
After each thread computes its my_result, these individual contributions need to be summed into a
global_result. A naive approach like global_result += my_result; leads to a race condition.
Race Condition:
Occurs when multiple threads attempt to access and modify a shared resource (here, global_result), and at least
one of the accesses is an update.
The outcome becomes unpredictable because the final value depends on the non-deterministic interleaving of
thread operations.
Example Scenario:
o global_result starts at 0.
o Thread 0 computes my_result = 1.
o Thread 1 computes my_result = 2.
o Problematic Timetable:
Time 0: Thread 0 loads global_result (0) into its register.
Time 1: Thread 1 loads global_result (0) into its register.
Time 2: Thread 0 adds its my_result (1) to its register (now 1).
Critical Section:
The code segment global_result += my_result; is a critical section. It's a block of code that updates a shared
resource and must only be executed by one thread at a time to ensure correctness.
OpenMP provides the #pragma omp critical directive to ensure mutually exclusive access to a structured
block of code. Only one thread can execute the code block immediately following this directive at any given
time.
Let's examine the structure of the OpenMP program for the trapezoidal rule:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h> // For OpenMP functions and directives
// Get input for a, b, and n from the user (executed by master thread)
printf("Enter a, b, and n\n");
scanf("%lf %lf %d", &a, &b, &n);
// After parallel region, master thread continues and prints the result
printf("With n = %d trapezoids, our estimate\n", n);
printf("of the integral from %f to %f = %.14e\n", a, b, global_result);
return 0;
} // main
// Add thread's local result to the shared global_result using a critical section
#pragma omp critical
*global_result_p += my_result;
} // Trap
Code Walkthrough:
main function:
Trap function:
This ensures that only one thread at a time can update the shared global_result_p, preventing the
race condition.
Important Details
Variable Naming:
The local_ prefix (e.g., local_a, local_b, local_n) is used to emphasize that these variables hold values specific
to each thread's portion of the work, distinct from the global a, b, n.
Divisibility Check:
The code assumes n is evenly divisible by thread_count. If not, local_n would be n / thread_count (integer
division), leading to fewer trapezoids being computed in total ((n / thread_count) * thread_count) than the
requested n.
Proper error checking (e.g., if (n % thread_count != 0) { ... exit(0); }) is recommended in practice to handle
this.
Interval Calculation:
The calculation of local_a and local_b correctly partitions the original interval [a, b] among the threads. For
example, thread 0 gets [a, a + local_n*h], thread 1 gets [a + local_n*h, a + 2*local_n*h], and so on.
In serial programming, the scope of a variable refers to the regions of the program text where that variable can
be accessed or "seen."
Function-wide scope: A variable declared inside a C function is only accessible within that function's
body. It's typically allocated on the function's call stack.
File-wide scope (Global/Static): A variable declared at the top of a .c file, outside any function, can be
accessed by any function within that same file. It's typically allocated in static memory.
In OpenMP, the scope of a variable in a parallel block refers to the set of threads that can access that
variable within the parallel region.
Shared Scope:
A variable with shared scope can be accessed by all threads in the OpenMP team.
Default Behavior: Variables that are declared before a #pragma omp parallel directive (i.e., in the enclosing
scope, like main or globally) typically have shared scope by default within the parallel region.
In the main function of the trapezoidal rule program, a, b, n, global_result, and thread_count are declared
before the #pragma omp parallel directive.
When Trap is called within the parallel region, these variables are implicitly shared.
Each thread needs to access the global a, b, and n to determine its local_a, local_b, and local_n. This
access happens through the function call's arguments, which effectively receive the shared values.
global_result is a crucial example. It's declared in main and needs to be accessible to all threads to
accumulate their my_result contributions.
If global_result were private, each thread would have its own independent copy, and the final sum in
main would be incorrect (likely just the initial value or the last thread's value). This is why the critical
directive is needed for updates to global_result – because it's shared.
A variable with private scope can only be accessed by a single thread. Each thread gets its own unique,
independent copy of the variable.
Default Behavior: Variables declared inside a structured block that is executed in parallel (e.g., local
variables within a function called from a parallel region, or variables declared directly within the parallel
region's block) typically have private scope. These variables are allocated from each thread's private stack.
In the Hello function, my_rank and thread_count are declared locally. Each thread calling Hello gets its own
private copies of these variables, which it initializes independently using omp_get_thread_num() and
omp_get_num_threads().
In the Trap function, variables like h, x, my_result, local_a, local_b, i, local_n, my_rank are all declared locally
within the function. Therefore, each thread has its own private version of these variables, residing on its own
stack. This is essential for independent computation (my_result).
While global_result_p itself is a private pointer variable within the Trap function (each thread has its own
global_result_p on its stack), what it points to (*global_result_p) is the global_result variable which has shared
scope. This indirection allows threads to access and update the single shared instance.
The reduction clause in OpenMP provides a powerful and convenient way to handle computations where
multiple threads contribute to a single, combined result using an associative binary operator.
Consider the attempt to use a simpler function prototype double Local_trap(double a, double b, int n); where
each thread calculates and returns its my_result. A naive approach might be:
The problem here is that the entire global_result += Local_trap(...) expression, including the call to Local_trap,
is inside the critical section.
This means only one thread can execute Local_trap at a time, effectively serializing the most
computationally intensive part of the program.
This defeats the purpose of parallelization and can even make the parallel version slower than the serial one
due to the overhead of mutex locking/unlocking.
A better approach using a private variable and a critical section would be:
global_result = 0.0;
#pragma omp parallel num_threads(thread_count)
{
double my_result = 0.0; // Private to each thread
// Each thread calculates its part independently (in parallel)
my_result = Local_trap(a, b, n); // Assuming a,b,n are properly passed/calculated for the local range
// (e.g., Local_trap needs to determine local_a, local_b, local_n based on thread ID)
This version works correctly and allows Local_trap to execute in parallel. Each thread computes its my_result
privately and then, in a very small critical section, adds its my_result to the shared global_result.
OpenMP offers a cleaner, more concise, and often more optimized way to achieve the same result using the
reduction clause.
Search Creators Page 16
PARALLEL COMPUTING|BCS702 |Module -04 |Search Creators
What is a Reduction?
A reduction is a computation that applies an associative binary operator repeatedly to a sequence of operands
to produce a single result.
<operator>: A supported reduction operator (e.g., +, *, -, &, |, ^, &&, ||, max, min).
<variable list>: One or more variables (separated by commas) that are involved in the reduction.
Here's what the OpenMP runtime system conceptually does with the reduction clause:
1. Private Copies: For each variable specified in the reduction clause (global_result in this case), OpenMP
automatically creates a private copy for each thread in the team.
2. Initialization: Each private copy is initialized to the identity value of the specified operator (e.g., 0 for +,
1 for *, etc.). (See Table 5.1 below).
3. Local Computation: Inside the parallel region, when a thread operates on global_result (e.g., global_result
+= Local_trap(...)), it is actually operating on its private copy of global_result. This means the calls to
Local_trap happen in parallel, without contention.
4. Combination (Implicit Synchronization): At the end of the parallel region (at the implicit barrier),
OpenMP automatically performs a final combination of all the private copies into the original shared
global_result variable, using the specified reduction operator. This combination is typically done
efficiently, often using optimized tree-based reductions or other methods that minimize serialization,
making it generally more performant than a manual critical section for summation.
global_result = 0.0;
#pragma omp parallel num_threads(thread_count) \
reduction(+: global_result)
global_result += Local_trap(double a, double b, int n);
global_result = 0.0;
#pragma omp parallel num_threads(thread_count)
{
double my_result = 0.0; /* private */ // This is what OpenMP implicitly creates
my_result += Local_trap(double a, double b, int n);
#pragma omp critical // This is what OpenMP implicitly manages
global_result += my_result;
}
Supported Operators: +, *, -, & (bitwise AND), | (bitwise OR), ^ (bitwise XOR), && (logical AND), ||
(logical OR), max (not explicitly listed in the text but commonly supported), min (not explicitly listed).
Subtraction (-): While subtraction is not associative, OpenMP handles it by effectively summing the
negated private results.
If result -= i is reduced, thread 0 might compute -(1+2)=-3 and thread 1 -(3+4)=-7. The final result will be
global_result + (-3) + (-7) = global_result - 10, which is the correct behavior for repeated subtraction.
Floating Point Precision: For float or double variables, the exact numerical result of a reduction might
differ slightly from a serial calculation, or even between different numbers of threads.
This is because floating-point addition (and multiplication) is not strictly associative due to limited
precision. For example, (a+b)+c might not be exactly equal to a+(b+c) for floating-point numbers.
When OpenMP creates private copies of reduction variables, they are initialized to an "identity value" (or
"neutral element") for the specified operation.
This ensures that when the partial results are combined, the initial state of the private variable does not alter the
final outcome.
When discussing parallel programming, especially in the context of loops, a loop-carried dependency (also
known as a loop dependence or inter-iteration dependency) refers to a situation where an iteration of a loop
depends on the result of a previous iteration of the same loop.
In simpler terms:
Iteration i needs something that was computed or changed by Iteration j, where j comes before i in
the loop's execution sequence.
This dependency is crucial because it prevents simple parallelization of the loop. If iterations are executed
concurrently without respecting the dependency, the results can be incorrect, leading to race conditions, data
inconsistencies, or wrong final values.
If Iteration i=2 starts before Iteration i=1 has finished computing and storing array[1], then Iteration i=2 will
use an old or incorrect value for array[1], leading to a wrong result.
o Iteration i reads a value that was written by Iteration j (where j < i).
o Example: A[i] = B[i] + C[i-1]; (Here, C[i-1] is written in the previous iteration and read in the current
one, assuming C is modified by the loop). Or the example above: array[i] = array[i-1] + some_value;
o Iteration i writes to a location that was read by Iteration j (where j < i). This is less common as a direct
problem for parallelization if variables are privatized, but it can still occur.
o Example: A[i+1] = ...; B[i] = A[i]; then later A[i+1] = ...; in the next iteration.
o Iteration i writes to the same location that was written by Iteration j (where j < i).
Example:
for (...) {
temp_result = ...;
If shared_variable or some_array[k] is the same for multiple iterations, the final value will be determined by
whichever iteration finishes last, leading to non-determinism.
The first step is to carefully analyze the loop and identify if any variables or memory locations are accessed in
a way that creates dependencies between iterations.
Serialization:
If a strong, unavoidable loop-carried dependency exists (like the array[i] = array[i-1] + ... example), the loop
cannot be parallelized directly without changing the algorithm or introducing explicit synchronization that
effectively serializes parts of it. In such cases, trying to parallelize can lead to incorrect results.
Restructuring/Transformation:
o Reduction: If the dependency is a cumulative operation (like summation, product, min, max), it can
often be handled using reduction clauses in OpenMP or similar constructs. As seen in the trapezoidal
rule example, global_result += my_result; is a reduction. Each thread works on a private copy, and these
copies are combined at the end.
o Privatization: If variables appear to cause a dependency but are actually used independently within
each iteration, they can be declared as private to each thread.
o Loop Transformations: Techniques like loop fission (splitting a loop into multiple loops), loop fusion,
or data restructuring might break or manage dependencies.
o Algorithmic Changes: Sometimes, a completely different algorithm is needed that is inherently
parallel. For example, some scan/prefix sum algorithms have loop-carried dependencies but can be re-
expressed in a parallel fashion.
For complex dependencies that cannot be easily transformed into reductions or private variables, explicit
synchronization mechanisms like mutexes, atomic operations, or barriers might be used.
However, this often introduces significant overhead and can largely negate the benefits of parallelism,
especially for fine-grained dependencies.
Search Creators Page 21
PARALLEL COMPUTING|BCS702 |Module -04 |Search Creators
The critical section used in the initial trapezoidal rule program before the reduction clause was introduced is an
example of such explicit synchronization.
Scheduling
When parallelizing for loops in OpenMP using the #pragma omp parallel for directive, the way in which loop
iterations are assigned to threads is called scheduling.
The default scheduling behavior (often a block partition) might not always be optimal, especially when the
work associated with each iteration varies.
OpenMP provides the schedule clause to give programmers control over this distribution.
By default, most OpenMP implementations use a block partitioning. If there are n iterations and thread_count
threads, thread 0 gets iterations 0 to n/thread_count - 1, thread 1 gets the next block, and so on.
The provided example with f(i) calling sin i times demonstrates this:
The schedule clause is used with #pragma omp for or #pragma omp parallel for directives to specify how
iterations should be distributed. Its general form is:
schedule([type][, chunksize])
Behavior: Iterations are assigned to threads before the loop begins. Chunks of chunksize iterations are
distributed in a round-robin fashion.
Default chunksize: If chunksize is omitted or set to the default, it's approximately total_iterations /
thread_count, which results in the block partitioning.
When to Use:
o Ideal when each loop iteration takes roughly the same amount of time.
o Offers minimal overhead because assignments are fixed at compile time (or early runtime).
o Can improve memory locality on NUMA systems if subsequent loops use the same schedule and data
is reused by the same threads.
These are run-time scheduling types, meaning iterations are assigned to threads while the loop is executing.
Threads request new chunks of work as they complete their current ones.
This helps with load balancing when iteration costs are unpredictable.
dynamic:
o Behavior: Iterations are broken into chunks of chunksize consecutive iterations. Threads acquire
chunks on a first-come, first-served basis.
o Default chunksize: If omitted, chunksize defaults to 1.
Search Creators Page 24
PARALLEL COMPUTING|BCS702 |Module -04 |Search Creators
o When to Use: When loop iterations have highly variable or unpredictable computational costs.
o Trade-off: Introduces overhead due to dynamic assignment requests from the runtime system. A
larger chunksize reduces this overhead but might worsen load imbalance for highly variable
workloads.
guided:
o Behavior: Similar to dynamic, but the size of the chunks decreases over time. Initially, chunks are
large, and they get progressively smaller until they reach a minimum chunksize (defaulting to 1, or the
specified chunksize). The idea is to assign large chunks initially to reduce overhead, then smaller
chunks later to fine-tune load balance.
o When to Use: When the computational cost of iterations decreases (or increases) non-linearly as the
loop progresses, or when there's an unknown but varying workload per iteration. It tries to quickly
dispatch the bulk of the work, then use smaller chunks to balance the tail end.
o Example (Table 5.4): If n=10,000 and thread_count=2, the first chunk might be ~5000 iterations, the
next ~2500, then ~1250, and so on.
Behavior: The schedule type and chunksize are determined at runtime by inspecting the
OMP_SCHEDULE environment variable.
How to Use:
1. In your C/C++ code, use schedule(runtime):
When to Use: Extremely useful for experimentation and tuning. It allows you to try different scheduling
strategies without recompiling the code.
If loop iterations have roughly the same computational cost, the default static schedule (with a large chunk
size, typically total_iterations / thread_count) is usually the best choice due to its minimal overhead. If
performance is satisfactory, don't change it.
If the cost of iterations decreases or increases linearly as the loop progresses (like the f(i) example), a static
schedule with a small chunksize (e.g., static,1 for cyclic) often yields the best performance by distributing the
varied workload more evenly.
If the cost of each iteration cannot be determined in advance or varies significantly, dynamic or guided
schedules are good candidates.
o Start with dynamic or guided without a specified chunksize (defaults to 1 for dynamic, adaptive for
guided).
o Experiment with larger chunksize values to find a balance between load balancing and scheduling
overhead.
o The runtime schedule is invaluable here for rapid testing.
Search Creators Page 26
PARALLEL COMPUTING|BCS702 |Module -04 |Search Creators
No Universal Best Schedule:
The optimal schedule can depend on the specific algorithm, the number of threads, the total number of
iterations, and the underlying hardware (e.g., cache sizes, NUMA architecture). Extensive experimentation
might be necessary for fine-tuning.
The "producers and consumers" problem is a classic example in parallel programming that illustrates the need
for explicit synchronization mechanisms beyond simple loop parallelization.
It's not amenable to a straightforward #pragma omp parallel for because the interactions between threads are
more complex than just independent loop iterations or simple reductions.
Queues
Queues are crucial in multithreaded applications for managing shared resources and inter-thread
communication.
In this model:
Example:
The text describes implementing a simple message-passing system where threads alternate between sending
and receiving messages using queues. Each thread has its own message queue.
Reason: If multiple threads try to enqueue simultaneously into the same queue, operations like updating the
"rear" pointer or managing the list structure can conflict, leading to lost messages or a corrupted queue.
mesg = random();
dest = random() % thread_count; // Destination thread's queue
#pragma omp critical // Protects the Enqueue operation
Enqueue(queue, dest, my_rank, mesg);
The critical directive ensures that only one thread can execute Enqueue at a time. This, however, poses
a performance issue: if any thread tries to send a message, it will block if any other thread (even to a
different destination queue) is currently in a critical section. This leads to global serialization.
Dequeuing is different:
The calculation queue_size = enqueued - dequeued might read enqueued while another thread is updating it.
This might lead to queue_size being momentarily incorrect (e.g., 0 instead of 1, or 1 instead of 2). However,
this is deemed acceptable here as it only causes a slight delay or an unnecessary critical section entry, not
correctness issues.
Termination Detection
A crucial aspect is determining when all threads are truly done, including processing all messages.
A simple Done() based only on queue_size == 0 is problematic: a thread might check, find the queue empty,
terminate, and then another thread sends it a message that will never be received.
Solution: Introduce a shared counter done_sending. Each thread increments this counter after completing its
sending loop.
// Done function
if (queue_size == 0 && done_sending == thread_count)
return TRUE;
else
return FALSE;
When a thread encounters this, it pauses until all other threads in the team also reach this point. Only
then do all threads proceed.
The atomic directive provides a potentially higher-performance way to protect simple critical sections.
Syntax: #pragma omp atomic followed by a single assignment statement of a specific form.
Forms:
Limitation: It can only protect single, specific assignment statements. The expr part must not reference x.
Mechanism: It leverages special "load-modify-store" hardware instructions, which are often much faster than
general-purpose mutexes used by critical.
Caveat: Only the operation on x is guaranteed atomic. For x += y++;, y++ might not be atomic.
The default critical directive (without a name) enforces global mutual exclusion across all unnamed critical
sections in the program. This is detrimental for the message-passing example because:
Named Critical Sections: #pragma omp critical (name) allows different critical sections to execute
concurrently if they have different names.
Locks (omp_lock_t): Locks provide the necessary flexibility to enforce mutual exclusion on a per-data-
structure basis, not just per code block.
To avoid global serialization, each queue struct should contain its own omp_lock_t member.
// q_p = msg_queues[dest]
omp_set_lock(&q_p->lock); // Acquire lock for the specific destination queue
Enqueue(q_p, my_rank, mesg); // Now safe to modify q_p
// q_p = msg_queues[my_rank]
omp_set_lock(&q_p->lock);
Dequeue(q_p, &src, &mesg);
omp_unset_lock(&q_p->lock);
This ensures that only threads accessing the same queue are serialized, maximizing concurrency.
atomic:
o Pros: Potentially fastest due to hardware support.
o Cons: Limited to specific single assignment statements. Unnamed atomic directives might
enforce global exclusion on some implementations (though not mandated by standard if
variables are different), so use with caution if multiple distinct atomic operations are present.
critical (unnamed):
o Pros: Very easy to use.
o Cons: Enforces global mutual exclusion across all unnamed critical directives, which can
severely limit concurrency if not all critical sections truly conflict.
critical (name):
o Pros: Allows distinct critical sections to execute concurrently.
o Cons: Names are compile-time. Not suitable when the "resource" to be protected depends on a
runtime value (like a queue index).
locks (omp_lock_t):
o Pros: Most flexible. Allows fine-grained, data-structure-specific mutual exclusion. Essential
when you need to protect instances of a data structure rather than just code blocks.
o Cons: More verbose (requires explicit init, set, unset, destroy calls).
General Guidelines:
1. atomic first: If your critical section fits the exact form of atomic, use it.
2. critical next: If atomic isn't suitable, but you need simple mutual exclusion for a code block (and don't
have many distinct, unrelated critical sections that would suffer from global serialization), critical is
simple. Use named critical sections if different code blocks can indeed run concurrently.
1. Mixing Mutual Exclusion Types: Don't mix different types of protection for the same critical section. For
example, if part of a variable's modification is atomic and another part is critical, there's no guarantee the
critical section will exclude the atomic one, leading to race conditions. Choose one consistent mechanism.
2. No Guarantee of Fairness (Starvation): Mutual exclusion constructs (especially general ones like
mutexes/locks) don't guarantee fairness. A thread can potentially be blocked forever waiting for a lock if
other threads repeatedly acquire and release it before the blocked thread gets a chance. This is called
starvation.
3. Deadlock: This is a common and severe problem when using locks. It occurs when two or more threads
are permanently blocked, each waiting for a resource that another blocked thread holds.
o Nested Locks: Trying to acquire a lock within another lock can lead to deadlock if the inner lock is
already held by the same thread or another thread that needs the outer lock.
o Circular Waiting: If threads attempt to acquire multiple locks in different orders (e.g., Thread U
acquires Lock A then tries Lock B; Thread V acquires Lock B then tries Lock A), they can get into
a state where U holds A, V holds B, and both are waiting for the other's lock.
o Prevention: The primary rule to prevent deadlock with multiple locks is to ensure that all threads
acquire locks in the same, consistent order.
Processor-Memory Speed Gap: Processors operate significantly faster than main memory can supply data.
Caches: To bridge this gap, fast, small memory blocks called caches are integrated into processors.
o Temporal Locality: If a memory location is accessed, it's likely to be accessed again soon.
o Spatial Locality: If a memory location is accessed, nearby locations are likely to be accessed soon.
Cache Lines (Cache Blocks): When a processor needs data from main memory, it doesn't just retrieve that
single data item. Instead, an entire block of memory containing the requested item is transferred to the cache.
This block is called a cache line (typically 64 bytes or more).
In shared-memory systems with multiple cores, each having its own cache, the cache coherence problem arises
when multiple copies of the same data exist in different caches and main memory.
Scenario:
The Problem: Without cache coherence, thread 1's cache copy of x would still be 5, leading to my_z being 5,
which is incorrect.
Solution (Cache Coherence Protocols): Most modern systems implement hardware-level cache coherence
protocols (e.g., MESI protocol).
o When Thread 0 modifies x in its cache, the cache coherence protocol detects this write to a
shared variable.
o It then invalidates (marks as stale) all other cached copies of the cache line containing x (e.g., in
Thread 1's cache).
o Before Thread 1 reads x, its cache controller sees that its copy is invalid. It then fetches the
most up-to-date value of x (either from main memory or directly from Thread 0's cache if it's
the "owner" of the modified line).
The text uses matrix-vector multiplication (y=Ax) to illustrate cache performance and false sharing.
Serial Code:
The outer loop is parallelized because each iteration i updates a unique element y[i], eliminating loop-carried
dependencies on y. A and x are read-only.
Performance Observations (Table 5.6): The total number of floating-point operations is constant
(64,000,000) across different matrix dimensions, but run-times vary significantly for the serial program:
False Sharing
Now, let's focus on the parallel performance, especially the 8x8,000,000 input where parallel efficiency is
significantly lower.
o Occurs when two or more threads access different variables that happen to reside in the same cache
line.
o At least one of the threads updates its variable.
o Even though there's no true data sharing (threads aren't accessing the same variable), the cache
coherence protocol invalidates the entire shared cache line.
o This forces other threads to reload the entire line from main memory, incurring significant
performance penalties, as if they were truly sharing the same variable. Hence, "false sharing."
o In the 8x8,000,000 case, y is a very short vector (8 doubles), easily fitting into one 64-byte cache
line. When threads update different y[i] elements, they are all contending for the same cache line.
Every write causes an invalidation for other cores that might have a copy.
o In 8,000,000x8 and 8000x8000 cases, y is a much longer vector. Each thread is assigned a large,
contiguous block of y elements (e.g., 2000 or 2,000,000 elements).
o While there might be false sharing at the boundaries between blocks assigned to different threads
(where their assigned elements fall into the same cache line), these occurrences are rare relative to
the total number of accesses.
o Most of the time, threads are accessing different cache lines entirely, or their access to the shared
boundary lines is sufficiently separated in time that invalidations don't cause significant contention.
1. Padding: Insert dummy variables (padding) between array elements or data structures to ensure
that variables accessed by different threads fall into different cache lines. This increases
memory usage but can dramatically improve performance.
2. Thread-Private Storage with Final Update: Each thread can compute its partial results into a
private array (allocated on its own stack or heap). After all threads complete their computations,
they then atomically update the shared global array. This completely eliminates false sharing
during the main computation phase. (Similar to how reduction works, but applied to an array).
False sharing is a subtle but critical performance trap in shared-memory programming. It makes correct
parallelization (threads working on different parts of the data) perform poorly due to the underlying hardware's
cache coherence mechanisms operating at the granularity of cache lines. Understanding cache line size and
data layout is essential for optimizing parallel code.
Tasking in OpenMP
The OpenMP constructs discussed so far (like parallel for) are excellent for parallelizing loops with a fixed and
predetermined number of iterations.
However, many parallel problems don't fit this model. This includes:
Dynamic problems: Where the workload size or arrival rate is unpredictable (e.g., a web server
handling incoming requests).
Recursive algorithms: Where the number and structure of sub-problems are defined at runtime (e.g.,
tree traversals, quicksort, Fibonacci).
Producer-Consumer patterns: Where tasks are generated and consumed asynchronously.
Loops with unbounded iterations: while or do-while loops, or for loops where the upper bound is not
fixed.
Task Generation:
When a thread encounters a task directive, a new independent unit of computation (a "task") is generated by
the OpenMP runtime.
Scheduling:
This task is then placed into a pool of tasks to be scheduled for execution by any available thread in the team.
It's crucial to understand that the task is not necessarily executed immediately by the thread that created it. It
might be executed by the same thread, or by a different idle thread.
Often, you only want one thread to be responsible for generating tasks, especially if the task generation logic is
inherently serial. The #pragma omp single directive can be used for this:
If single is omitted, every thread in the team will execute the task-generating code, potentially creating
thread_count identical tasks for each logical task you intended.
The Fibonacci sequence (F_n=F_n−1+F_n−2) is a classic recursive problem demonstrating tasking. A naive
recursive implementation has a natural tree structure of computations.
int fib(int n) {
int i = 0;
int j = 0;
if (n <= 1) {
fibs[n] = n; // fibs is a global array
return n;
}
1. Initial Attempt (Incorrect due to Data Scope): Adding #pragma omp task before each recursive call:
o Problem: The default data scope for variables in tasks is private. This means i and j inside the fib
function that creates the tasks are distinct from the i and j used by the tasks themselves when they
execute fib(n-1) and fib(n-2).
o The fib(n-1) and fib(n-2) calls return their results into the private i and j variables within the context
of their respective tasks. However, the parent fib call (the one that created these tasks) still has its
own i and j variables, which remain at their initial values (e.g., 0). Thus, fibs[n] ends up being 0 +
0.
2. Addressing Data Scope (shared clause):
To make the i and j variables of the parent task accessible and writable by the child tasks that compute
fib(n-1) and fib(n-2), we must declare them as shared within the task directive:
// Program 5.6
int fib(int n) {
int i = 0; // Local variables for current fib call
int j = 0;
if (n <= 1) {
fibs[n] = n;
// Problem: The parent task might continue and use stale i/j values
// fibs[n] = i + j;
// return fibs[n];
}
New Problem:
Even with shared(i) and shared(j), the order of execution is not guaranteed. The thread executing the fib(n) call
might continue and try to use i and j before the tasks computing fib(n-1) and fib(n-2) have completed and
updated them. This results in unpredictable and often incorrect results.
To guarantee that the parent task waits for its child tasks to complete before proceeding, OpenMP provides the
#pragma omp taskwait directive.
This acts as a barrier for tasks: the thread encountering taskwait will pause until all tasks generated by the
current task (or by tasks generated in the same task region) have completed.
if (n <= 1) {
fibs[n] = n;
return n;
}
#pragma omp taskwait // Wait for the two child tasks to complete
fibs[n] = i + j; // Now i and j will have the correct results
return fibs[n];
}
With taskwait, the parallel Fibonacci program now produces correct results.
While tasking solves correctness issues for recursive algorithms, it introduces overhead:
Each task directive involves overhead for creating the task's data environment, scheduling it, etc. For fine-
grained tasks (like fib(2), fib(3)), this overhead can easily outweigh the benefits of parallelism, leading to
significant slowdowns compared to the serial version.
Optimizations:
if Clause for Task Creation Threshold: The if clause on the task directive allows conditional task creation.
Tasks are only generated if the if condition evaluates to true. Below a certain problem size, it's often more
efficient to execute the work serially.
#pragma omp task shared(i) if(n > 20) // Create task only if n is large enough
i = fib(n - 1);
This ensures that very small, cheap fib calls are executed serially by the thread that encounters them, avoiding
tasking overhead. The threshold (e.g., 20) is typically determined through experimentation.
In the Fibonacci example, the parent task creates two child tasks and then just waits. This means the parent
task is mostly idle.
A common optimization is to have the parent task perform one of the recursive calls directly, without creating
a new task for it:
if (n <= 1) {
fibs[n] = n;
return n;
}
j = fib(n - 2); // Current thread directly executes fib(n-2) without creating a task
This reduces the number of tasks created, hence reducing overhead. On a 64-core system with n=35, these two
optimizations (if clause + parent doing work) can halve execution time.
Thread safety
What is Thread-Safety?
A block of code or a function is thread-safe if it can be simultaneously executed by multiple threads without
causing race conditions, data corruption, or incorrect results.
The text provides an excellent example using the strtok function from string.h.
o It uses the previously cached static pointer to continue tokenizing from the last position in the
original string.
The core issue is that strtok uses an internal static variable to store its state (the pointer to the current position
in the string).
This static variable is global (or file-scope static), meaning it's shared among all threads.
Race Condition:
When Thread 0 calls strtok(lines[0], ...) it sets this internal static pointer to lines[0]. If Thread 1 then calls
strtok(lines[1], ...) before Thread 0 completes its tokenization of lines[0], Thread 1 overwrites this shared
static pointer with lines[1].
Incorrect Results:
o Thread 0 then makes a subsequent call strtok(NULL, ...) expecting to continue from lines[0],
but now strtok is looking at lines[1] (due to Thread 1's overwrite).
Conclusion: strtok is not thread-safe because it relies on shared, mutable internal state without any
synchronization mechanisms.
The text notes that it's "not uncommon" for C library functions to be non-thread-safe. Other common examples
include:
rand() from stdlib.h: Often uses an internal seed that gets updated. Multiple threads calling rand()
simultaneously can lead to incorrect or non-random sequences.
localtime() from time.h: May use a static buffer to store the struct tm return value, which can be
overwritten by concurrent calls.
For common non-thread-safe functions, the C standard or specific library implementations often provide re-
entrant (and thus thread-safe) versions.
These versions typically take an extra argument that allows the caller to provide the "state" explicitly, rather
than relying on internal static storage.
o saveptr_p: This is the crucial third argument. It's a pointer to a char* variable that your thread
owns. strtok_r will use this char* variable (pointed to by saveptr_p) to store its internal state for
this specific tokenization operation.
o Since each thread will declare its own saveptr variable (which is a private variable on the
thread's stack), there's no shared state, and thus no race condition.
// In Tokenize function
char *my_token;
char *saveptr; // This variable is private to each thread due to OpenMP private clause
#pragma omp parallel for schedule(static, 1) // my_token and saveptr are private
for (i = 0; i < line_count; i++) {
// ...
my_token = strtok_r(lines[i], " \t\n", &saveptr); // First call with string
while (my_token != NULL) {
// ...
my_token = strtok_r(NULL, " \t\n", &saveptr); // Subsequent calls with NULL
// ...
}
}
Key Takeaways on Thread-Safety:
Any function or code block that modifies shared data without proper synchronization is inherently not thread-
safe. This includes global variables, static variables, or parameters passed by reference that are shared.
When parallelizing code, be vigilant about library functions. Consult documentation or assume functions using
internal state might not be thread-safe. Look for _r versions or other thread-safe alternatives.
Design for Thread-Safety: When writing your own functions that might be called concurrently:
o Minimize Shared State: Design functions to operate primarily on local, private data whenever
possible.
o Encapsulate State: If shared state is necessary, encapsulate it within a data structure and protect
access to that data structure with explicit synchronization primitives (like mutexes/locks, atomic
operations, or OpenMP critical sections/atomics).
o Immutable Data: Functions that only read immutable (unchanging) shared data are generally
thread-safe.