Module 4 Notes Dr. NML
Module 4 Notes Dr. NML
Synchronization Directives
Used to manage thread safety and avoid race conditions:
• #pragma omp critical: Ensures only one thread executes the enclosed code at a time.
• #pragma omp atomic: Performs atomic updates to shared variables.
• #pragma omp barrier: Synchronizes all threads at a specific point.
• #pragma omp flush: Ensures memory consistency across threads.
• #pragma omp ordered: Enforces sequential execution of loop iterations.
Usage Tips
• Always include #include <omp.h> in your source code.
• Compile with OpenMP support (e.g., -fopenmp for GCC).
• Use environment variables like OMP_NUM_THREADS to control thread count.
The trapezoidal rule:
Let’s take a look at a somewhat more useful (and more complicated) example: the
trapezoidal rule for estimating the area under a curve. Recall from Section 3.2 that
if y = f (x) is a reasonably nice function, and a < b are real numbers, then we canestimate the
area between the graph of f (x), the vertical lines x = a and x = b, andthe x-axis by dividing the
interval [a, b] into n subintervals and approximating thearea over each subinterval by the area
of a trapezoid. See Fig. 5.3.
Also recall that if each subinterval has the same length and if we define h =
(b −a)/n, xi = a +ih, i = 0, 1, . . . , n, then our approximation will be
h[ f (x0)/2 +f (x1)+f (x2)+· · ·+f (xn−1)+ f (xn)/2].
Thus we can implement a serial algorithm using the following code:
h = (b−a ) / n ;
approx = (f ( a) + f ( b ) ) / 2 . 0 ;
for ( i = 1; i <= n−1; i ++) {
x_i = a + i∗h ;
approx += f ( x_i ) ;
}
approx = h∗approx ;
Assignment of trapezoids to threads.
A first OpenMP version
Recall that we applied Foster’s parallel program design methodology to the trapezoidal
rule as described in the following list (see Section 3.2.2):
1. We identified two types of jobs:
a. Computation of the areas of individual trapezoids, and
b. Adding the areas of trapezoids.
2. There is no communication among the jobs in the first collection, but each job in
the first collection communicates with job 1b.
3. We assumed that there would be many more trapezoids than cores, so we aggregated
jobs by assigning a contiguous block of trapezoids to each thread (and a
single thread to each core).2 Effectively, this partitioned the interval [a, b] into
larger subintervals, and each thread simply applied the serial trapezoidal rule to
its subinterval. See Fig. 5.4.
Scope of variables:
In parallel programming, especially in shared memory models like OpenMP, variable scope
determines how variables behave across multiple threads. Here's a detailed breakdown:
#include <stdio.h>
#include <omp.h>
🔹 Main Function
• shared_var: Shared among all threads. Any thread can read/write it.
• private_var: Will be declared private inside the parallel region, meaning each thread
gets its own copy.
🔹 Parallel Region
shared_var += 1;
🔹 Output
• Each thread prints its own private_var and the current value of shared_var.
🔹 Final Output
🛡️
How to Fix the Race Condition
Method1
shared_var += 1;
shared_var += 1;
Method 2
🔧 Syntax (OpenMP)
#pragma omp parallel for reduction(op:variable)
• op: Operator like +, *, -, min, max, &, |, ^
• variable: The shared variable to be reduced
🔹 Headers
#include <stdio.h>
#include <omp.h>
• stdio.h: Enables input/output functions like printf.
• omp.h: Provides OpenMP functions and directives for parallel programming.
🔹 Main Function
int N = 1000;
int sum = 0;
• N is the upper limit of the summation.
• sum is the variable that will hold the final result.
🔹 Final Output
printf("Sum = %d\n", sum);
• Prints the final result of the summation: ( \sum_{i=0}^{999} i = \frac{999 \cdot
1000}{2} = 499500 )
🧠 Summary
Feature Purpose
Note:
✅ Purpose:
Ensures that only one thread executes the enclosed block at a time.
🧩 Syntax:
#pragma omp critical
{
shared_var += 1;
}
📌 Use Case:
• When multiple operations need to be performed atomically.
• Suitable for complex updates involving multiple statements.
⚠️ Downside:
• Can be slower due to thread queuing.
• Higher overhead than atomic.
🧩 Syntax:
#pragma omp atomic
shared_var += 1;
📌 Use Case:
• When updating a variable with a simple operation (e.g., +=, ++, --).
• Faster and more efficient than critical.
⚠️ Limitation:
• Only works with simple expressions (single memory location, single operation).
🧠 Summary Table
🧪 Example Comparison
Using critical:
#pragma omp critical
{
sum += value;
count++;
}
Using atomic:
#pragma omp atomic
sum += value;
🧠 Supported Operators
+ Sum
* Product
- Subtraction
| Bitwise OR
^ Bitwise XOR
loop-carried dependency :
A loop-carried dependency occurs in parallel computing when an iteration of a loop
depends on the result of a previous iteration, making it difficult or impossible to safely
execute iterations in parallel.
What Is a Loop-Carried Dependency?
In a loop, if the computation in iteration i depends on the result of iteration i-1 (or any earlier
iteration), this is called a loop-carried dependency. It creates a sequential dependency that
prevents parallel execution.
Example:
for (int i = 1; i < N; i++)
{
A[i] = A[i-1] + B[i]; // A[i] depends on A[i-1]
}
Dependency: A[i] uses the result of A[i-1], which was computed in the previous
iteration.
• Result: Cannot be parallelized without changing the algorithm.
Why It Matters?
• Prevents parallelization: Threads can't safely execute iterations independently.
• Limits performance: Reduces the benefits of multi-core systems.
• Requires transformation: To parallelize, you may need to restructure the algorithm.
Iterations are divided evenly before Best when work per iteration is
Static
execution. uniform.
Like dynamic, but chunk size decreases over Good for load balancing with
Guided
time. decreasing workload.
Syntax Example
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
work(i);
}
You can also specify chunk size:
#pragma omp parallel for schedule(dynamic, 4)
⚙️ Static Scheduling
🔹 How It Works:
🧩 Syntax:
✅ Best For:
⚠️ Drawback:
⚡ Dynamic Scheduling
🔹 How It Works:
• Iterations are assigned on-the-fly as threads become available.
• Threads request new chunks after finishing their current ones.
🧩 Syntax:
✅ Best For:
⚠️ Drawback:
Comparison Table
🧪 Example Code
🧭 Guided Scheduling
🔹 What It Does:
• Assigns large chunks of iterations to threads at first.
• Gradually reduces chunk size as threads finish work.
• Helps balance load when early iterations are heavier.
🧩 Syntax:
#pragma omp parallel for schedule(guided[,chunk_size])
• chunk_size is optional. If omitted, OpenMP chooses a default.
• Smaller chunks toward the end improve load balancing.
✅ Use Case:
• When iteration workload decreases over time.
• Example: processing a sorted list where early items are more complex.
🕹️ Runtime Scheduling
🔹 What It Does:
• Defers scheduling decision to runtime environment.
• Uses the value of the environment variable OMP_SCHEDULE.
🧩 Syntax:
#pragma omp parallel for schedule(runtime)
• Set scheduling policy via:
export OMP_SCHEDULE="dynamic,4"
✅ Use Case:
• When you want to experiment with different strategies without recompiling.
Chunk Size Starts large, shrinks over time Set via environment variable
Use Case Decreasing workload per iteration Dynamic tuning and experimentation
The synchronization issues for receiving a message are a little different. Only the
owner of the queue (that is, the destination thread) will dequeue from a given message
queue. As long as we dequeue one message at a time, if there are at least two messages
in the queue, a call to Dequeue can’t possibly conflict with any calls to Enqueue. So
if we keep track of the size of the queue, we can avoid any synchronization (for
example, critical directives), as long as there are at least two messages.
Now you may be thinking, “What about the variable storing the size of the
queue?” This would be a problem if we simply store the size of the queue. However, if we store
two variables, enqueued and dequeued, then the number of messages
in the queue is queue_size = enqueued − dequeued
and the only thread that will update dequeued is the owner of the queue. Observe that
one thread can update enqueued at the same time that another thread is using it to
compute queue_size. To see this, let’s suppose thread q is computing queue_size. It
will either get the old value of enqueued or the new value. It may therefore compute
a queue_size of 0 or 1 when queue_size should actually be 1 or 2, respectively, but
in our program this will only cause a modest delay. Thread q will try again later if
queue_size is 0 when it should be 1, and it will execute the critical section directive
unnecessarily if queue_size is 1 when it should be 2.
Thus we can implement Try_receive as follows:
queue_size = enqueued − dequeued ;
i f ( queue_size == 0) return ;
e l s e i f ( queue_size == 1)
# pragma omp c r i t i c a l
Dequeue ( queue , &src , &mesg ) ;
else
Dequeue ( queue , &src , &mesg ) ;
Print_message ( src , mesg ) ;
5.8.5 Termination detection
We also need to think about implementation of the Done function. First note that the
following “obvious” implementation will have problems:
queue_size = enqueued - dequeued;
if (queue_size == 0)
return TRUE;
else
return FALSE;
If thread u executes this code, it’s entirely possible that some thread—call it thread
v—will send a message to thread u after u has computed queue_size = 0. Of course,
after thread u computes queue_size = 0, it will terminate and the message sent by
thread v will never be received.
However, in our program, after each thread has completed the for loop, it won’t
send any new messages. Thus if we add a counter done_sending, and each thread
increments this after completing its for loop, then we can implement Done as follows:
queue_size = enqueued - dequeued;
if (queue_size == 0 && done_sending == thread_count)
return TRUE;
else
return FALSE;
5.8.6 Startup
When the program begins execution, a single thread, the master thread, will get
command-line arguments and allocate an array of message queues, one for each
thread. This array needs to be shared among the threads, since any thread can send to
any other thread, and hence any thread can enqueue a message in any of the queues.
Given that a message queue will (at a minimum) store
• a list of messages,
• a pointer or index to the rear of the queue,
• a pointer or index to the front of the queue,
• a count of messages enqueued, and
• a count of messages dequeued,
it makes sense to store the queue in a struct, and to reduce the amount of copying
when passing arguments, it also makes sense to make the message queue an array of
pointers to structs. Thus once the array of queues is allocated by the master thread, we
can start the threads using a parallel directive, and each thread can allocate storage
for its individual queue.
🧩 Key Concepts
🔹 Shared Buffer
• A finite-size queue or array used to hold data between producers and consumers.
• Acts as a communication channel between threads.
🔹 Synchronization
• Required to prevent race conditions and data corruption.
void consume() {
#pragma omp critical
{
if (count > 0) {
int item = buffer[--count];
printf("Consumed: %d\n", item);
}
}
}
int main() {
#pragma omp parallel sections
{
#pragma omp section
for (int i = 0; i < 20; i++) produce(i);
🧩 Code Explanation
🔹 Global Variables
#define SIZE 10
int buffer[SIZE];
int count = 0;
``- `buffer`: A fixed-size array acting as a shared buffer between producer and consumer.
- `count`: Tracks the number of items currently in the buffer.
### 🔹 **Producer Function**
```c
void produce(int item) {
#pragma omp critical
{
if (count < SIZE) {
buffer[count++] = item;
printf("Produced: %d\n", item);
}
}
}
🔹 Consumer Function
void consume() {
#pragma omp critical
{
if (count > 0) {
int item = buffer[--count];
printf("Consumed: %d\n", item);
}
}
}
🔹 Main Function
• #pragma omp parallel sections: Creates two threads—one for producing, one for
consuming.
• Producer Section: Produces 20 items.
• Consumer Section: Attempts to consume 20 items.
⚠️ Behavior Notes
• Buffer Size = 10, but producer tries to add 20 items. Only the first 10 will be added;
the rest are ignored due to the if (count < SIZE) check.
• Consumer tries to consume 20 items, but only 10 are available. The rest are ignored
due to if (count > 0).
🧠 Summary
Component Role
buffer Shared storage for items
count Tracks buffer occupancy
produce() Adds items if buffer isn't full
consume() Removes items if buffer isn't empty
critical Prevents race conditions
parallel sections Runs producer and consumer concurrently
Type Purpose
#define SIZE 5
int buffer[SIZE];
int in = 0, out = 0;
buffer[in] = i;
printf("Produced: %d\n", i);
in = (in + 1) % SIZE;
int main() {
pthread_t prod, cons;
sem_init(&empty, 0, SIZE); // SIZE empty slots
sem_init(&full, 0, 0); // 0 full slots
pthread_mutex_init(&mutex, NULL);
pthread_join(prod, NULL);
pthread_join(cons, NULL);
sem_destroy(&empty);
sem_destroy(&full);
pthread_mutex_destroy(&mutex);
return 0;
}
🧠 How It Works
• empty semaphore: Tracks available buffer slots.
• full semaphore: Tracks filled slots.
• mutex: Ensures mutual exclusion during buffer access.
• Circular buffer: Prevents overflow by wrapping around.
🔧 How It Works:
• Modern CPUs use cache coherence protocols like MESI (Modified, Exclusive,
Shared, Invalid).
• These protocols ensure that:
o If one core modifies a cache line, other cores’ copies are invalidated.
o Reads always return the most up-to-date value.
🧩 Example:
#pragma omp parallel num_threads(2)
{
static int x = 0;
if (omp_get_thread_num() == 0) x = 42;
else printf("x = %d\n", x); // May print stale value if cache is not coherent
}
Without proper synchronization (e.g., #pragma omp flush), the second thread might read an
outdated value of x.
⚠️ 3. False Sharing
False sharing occurs when threads access different variables that reside on the same cache
line, causing unnecessary cache invalidations—even though the variables are logically
independent.
🧩 Example:
int data[2];
PaddedInt data[2];
• Align data using compiler directives (e.g., __attribute__((aligned(64))) in GCC).
📌 Summary Table
Concept Description
Cache Fast memory near CPU cores for frequently used data
Cache Coherence Ensures all cores see the latest value of shared data
False Sharing Performance issue from multiple threads accessing same cache line
In parallel programming, locks are used to control access to shared resources and prevent race
conditions. They ensure that only one thread can access a critical section of code at a time.
int main() {
int counter = 0;
omp_lock_t lock;
omp_destroy_lock(&lock); // Clean up
printf("Final counter: %d\n", counter);
return 0;
}
Function Description
⚠️ Caution
• Overuse of locks can lead to performance bottlenecks.
• Improper use can cause deadlocks or starvation.
In parallel programming, unlocking refers to the action of releasing a lock that a thread
previously acquired. This allows other threads to enter a critical section or access a shared
resource that was protected by the lock.
🔓 What Is Unlocking?
• When a thread acquires a lock, it gains exclusive access to a resource.
• Once it's done, it must unlock to let other threads proceed.
• Failing to unlock can lead to deadlocks, where other threads wait indefinitely.
int main() {
int counter = 0;
omp_lock_t lock;
omp_init_lock(&lock); // Initialize lock
omp_destroy_lock(&lock); // Clean up
printf("Final counter: %d\n", counter);
return 0;
}
🔧 Key Functions:
Function Purpose
🔹 A Task Includes:
• A block of code to execute
• A data environment (variables captured at creation)
• Optional dependencies or synchronization
🧩 Basic Syntax
#pragma omp parallel { #pragma omp single { for (int i = 0; i < N; i++) { #pragma omp task
process(i); } } }
• #pragma omp parallel: Starts a team of threads.
• #pragma omp single: Ensures only one thread creates tasks.
• #pragma omp task: Defines a task to be executed by any thread.
🔄 Task Scheduling
• Tasks are queued and executed dynamically by available threads.
• OpenMP runtime handles scheduling and load balancing.
• Tasks can be nested, dependent, or grouped.
#include <stdio.h>
#include <omp.h>
int fib(int n) {
int x, y;
if (n < 2)
return n;
#pragma omp task shared(x)
x = fib(n - 1);
#pragma omp task shared(y)
y = fib(n - 2);
#pragma omp taskwait
return x + y;
}
int main() {
int n = 10, result;
#pragma omp parallel
{
#pragma omp single
result = fib(n);
}
printf("Fibonacci(%d) = %d\n", n, result);
return 0;
}
Explanation for the above code:
This C program uses OpenMP tasking to compute the Fibonacci number recursively in
parallel. Let’s walk through it step by step:
🧩 Function: fib(int n)
int fib(int n) {
int x, y;
if (n < 2)
return n;
• Base case: If n is 0 or 1, return n.
• Recursive case: Compute fib(n-1) and fib(n-2).
🔹 Task Creation
#pragma omp task shared(x)
x = fib(n - 1);
🔹 Task Synchronization
#pragma omp taskwait
return x + y;
• #pragma omp taskwait ensures both tasks finish before proceeding.
• After both results are available, it returns their sum.
🧩 Main Function
int main() {
int n = 10, result;
🔹 Explanation:
• #pragma omp parallel: Starts a team of threads.
• #pragma omp single: Ensures only one thread initiates the recursive task tree.
• result = fib(n): Starts the recursive computation using tasks.
• printf(...): Prints the final result.
Thread Safety:
Thread safety in parallel programming means ensuring that shared data is accessed and
modified correctly when multiple threads run concurrently, preventing race conditions,
data corruption, and unpredictable behavior.
🧩 Function Signature
void Tokenize(char* lines[], int line_count, int thread_count)
• lines[]: Array of strings to be tokenized.
• line_count: Number of lines in the array.
• thread_count: Number of threads to use for parallel execution.
🔧 Variable Declarations
int my_rank, i, j;
char* my_token;
• my_rank: Thread ID.
• i: Loop index for lines.
• j: Token index within a line.
• my_token: Pointer to each token extracted.
🔄 Parallel Region
#pragma omp parallel num_threads(thread_count) \
default(none) private(my_rank, i, j, my_token) \
shared(lines, line_count)
• Starts a parallel region with thread_count threads.
• default(none): Forces explicit declaration of shared/private variables.
• private(...): Each thread has its own copy of these variables.
• shared(...): All threads access the same lines[] and line_count.
🔁 Work Distribution
#pragma omp for schedule(static, 1)
for (i = 0; i < line_count; i++) {
• Distributes the loop over line_count lines among threads.
• schedule(static, 1): Each thread gets one line at a time in round-robin fashion.
🧠 Tokenization Logic
printf("Thread %d > line %d = %s", my_rank, i, lines[i]);
j = 0;
my_token = strtok(lines[i], " \t\n");
while (my_token != NULL) {
printf("Thread %d > token %d = %s\n", my_rank, j, my_token);
my_token = strtok(NULL, " \t\n");
j++;
}
• Each thread prints the line it's processing.
• Uses strtok() to split the line into tokens using space, tab, and newline as delimiters.
• Prints each token with its index and thread ID.
When werun the program with a single thread, it correctly tokenizes the input stream. The first
time we run it with two threads and the input
Pease porridge hot.
Pease porridge cold.
Pease porridge in the pot
Nine days old.
the output is also correct. However, the second time we run it with this input, we get
the following output
Thread 0 > line 0 = Pease porridge hot.
Thread 1 > line 1 = Pease porridge cold.
Thread 0 > token 0 = Pease
Thread 1 > token 0 = Pease
Thread 0 > token 1 = porridge
Thread 1 > token 1 = cold.
Thread 0 > line 2 = Pease porridge in the pot
Thread 1 > line 3 = Nine days old.
Thread 0 > token 0 = Pease
Thread 1 > token 0 = Nine
Thread 0 > token 1 = days
Thread 1 > token 1 = old.
to persist from one call to the next.
Unfortunately for us, this cached string is shared,not private. Thus it appears that thread 1’s
call to strtok with the second line hasapparently overwritten the contents of thread 0’s call with
the first line. Even worse,thread 0 has found a token (“days”) that should be in thread 1’s output.
The strtok function is therefore not thread-safe: if multiple threads call it simultaneously,the
output it produces may not be correct. Regrettably, it’s not uncommonfor C library functions
to fail to be thread-safe. For example, neither the random numbergenerator rand in stdlib.h nor
the time conversion function localtime in [Link] guaranteed to be thread-safe. In some cases,
the C standard specifies an alternate,thread-safe version of a function. In fact, there is a thread-
safe version of strtok:
char∗ strtok_r (
char∗ string / ∗ in / out ∗ / ,
const char∗ separators / ∗ i n ∗ / ,
char ∗∗ saveptr_p / ∗ in / out ∗ / ) ;
The “_r” is supposed to suggest that the function is re-entrant, which is sometimes
used as a synonym for thread-safe.9 The first two arguments have the same purpose
as the arguments to strtok. The saveptr argument is used by strtok_r for keeping
track of where the function is in the input string; it serves the purpose of the cached
pointer in [Link] can correct our original Tokenize function by replacing the calls
to strtok with calls to strtok_r.We simply need to declare a char∗ variable to pass in
for the third argument, and replace the calls in line 18 and line 22 with the following
calls:
my_token = strtok_r ( lines [ i ] , " \t\n" , &saveptr ) ;
...
my_token = strtok_r ( NULL , " \t\n" , &saveptr ) ;
respectively.
5.11.1 Incorrect programs can produce correct output
Notice that our original version of the tokenizer program shows an especially insidious form
of program error: The first time we ran it with two threads, the programproduced correct output.
It wasn’t until a later run that we saw an error. This, unfortunately,is not a rare occurrence in
parallel programs.
For example, we can’t say whenthread 1 will first call strtok. If its first call takes place after
thread 0 has tokenizedits first line, then the tokens identified for the first line should be correct.
However, ifthread 1 calls strtok before thread 0 has finished tokenizing its first line, it’s
entirelypossible that thread 0 may not identify all the tokens in the first line, so it’s especially
important in developing shared-memory programs to resist the temptation to assumethat since
a program produces correct output, it must be correct. We always need to
be wary of race conditions.