0% found this document useful (0 votes)
4 views41 pages

Module 4 Notes Dr. NML

The document provides an overview of shared-memory programming using OpenMP, detailing pragma directives for parallel regions, work-sharing, tasking, and synchronization. It discusses variable scope, including shared, private, firstprivate, lastprivate, and reduction variables, along with examples of their usage. Additionally, it covers loop-carried dependencies and scheduling strategies in OpenMP, emphasizing the importance of choosing the right scheduling method for performance and load balancing.

Uploaded by

Nagaraj Lutimath
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views41 pages

Module 4 Notes Dr. NML

The document provides an overview of shared-memory programming using OpenMP, detailing pragma directives for parallel regions, work-sharing, tasking, and synchronization. It discusses variable scope, including shared, private, firstprivate, lastprivate, and reduction variables, along with examples of their usage. Additionally, it covers loop-carried dependencies and scheduling strategies in OpenMP, emphasizing the importance of choosing the right scheduling method for performance and load balancing.

Uploaded by

Nagaraj Lutimath
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Shared-memory programming with OpenMP_Module4

openmp pragmas and directives:


OpenMP pragma directives are special compiler instructions used to enable parallel
programming in C, C++, and Fortran by leveraging shared memory multiprocessing.

Parallel Region Directives


These define blocks of code to be executed by multiple threads:
• #pragma omp parallel: Starts a parallel region where code is executed by multiple
threads.
• #pragma omp for: Distributes loop iterations among threads in a parallel region.
• #pragma omp sections: Divides code into sections, each executed by a different thread.
• #pragma omp single: Ensures a block of code is executed by only one thread.

Work-Sharing and Tasking


These control how work is divided among threads:
• #pragma omp task: Defines a task to be executed independently.
• #pragma omp taskwait: Waits for child tasks to complete.
• #pragma omp master: Restricts execution to the master thread only.

Synchronization Directives
Used to manage thread safety and avoid race conditions:
• #pragma omp critical: Ensures only one thread executes the enclosed code at a time.
• #pragma omp atomic: Performs atomic updates to shared variables.
• #pragma omp barrier: Synchronizes all threads at a specific point.
• #pragma omp flush: Ensures memory consistency across threads.
• #pragma omp ordered: Enforces sequential execution of loop iterations.

Usage Tips
• Always include #include <omp.h> in your source code.
• Compile with OpenMP support (e.g., -fopenmp for GCC).
• Use environment variables like OMP_NUM_THREADS to control thread count.
The trapezoidal rule:
Let’s take a look at a somewhat more useful (and more complicated) example: the
trapezoidal rule for estimating the area under a curve. Recall from Section 3.2 that
if y = f (x) is a reasonably nice function, and a < b are real numbers, then we canestimate the
area between the graph of f (x), the vertical lines x = a and x = b, andthe x-axis by dividing the
interval [a, b] into n subintervals and approximating thearea over each subinterval by the area
of a trapezoid. See Fig. 5.3.

Also recall that if each subinterval has the same length and if we define h =
(b −a)/n, xi = a +ih, i = 0, 1, . . . , n, then our approximation will be
h[ f (x0)/2 +f (x1)+f (x2)+· · ·+f (xn−1)+ f (xn)/2].
Thus we can implement a serial algorithm using the following code:
h = (b−a ) / n ;
approx = (f ( a) + f ( b ) ) / 2 . 0 ;
for ( i = 1; i <= n−1; i ++) {
x_i = a + i∗h ;
approx += f ( x_i ) ;
}
approx = h∗approx ;
Assignment of trapezoids to threads.
A first OpenMP version
Recall that we applied Foster’s parallel program design methodology to the trapezoidal
rule as described in the following list (see Section 3.2.2):
1. We identified two types of jobs:
a. Computation of the areas of individual trapezoids, and
b. Adding the areas of trapezoids.
2. There is no communication among the jobs in the first collection, but each job in
the first collection communicates with job 1b.
3. We assumed that there would be many more trapezoids than cores, so we aggregated
jobs by assigning a contiguous block of trapezoids to each thread (and a
single thread to each core).2 Effectively, this partitioned the interval [a, b] into
larger subintervals, and each thread simply applied the serial trapezoidal rule to
its subinterval. See Fig. 5.4.
Scope of variables:
In parallel programming, especially in shared memory models like OpenMP, variable scope
determines how variables behave across multiple threads. Here's a detailed breakdown:

🧠 Variable Scope in Parallel Programming


1. Shared Variables
• Definition: Accessible by all threads.
• Use Case: Common data like input arrays or configuration values.
• Risk: If multiple threads write to a shared variable without synchronization, it can cause
race conditions.
2. Private Variables
• Definition: Each thread has its own copy.
• Use Case: Loop counters, temporary variables.
• Benefit: Prevents interference between threads.
3. Firstprivate
• Definition: Like private, but initialized with the value from the master thread.
• Use Case: When each thread needs a copy of a variable with the same initial value.
4. Lastprivate
• Definition: Like private, but the value from the last iteration/thread is copied back to
the original variable.
• Use Case: Useful in loops where the final value is needed after parallel execution.
5. Reduction Variables
• Definition: Used for operations like sum, max, min across threads.
• Use Case: Accumulating results safely.
• Example: #pragma omp parallel for reduction(+:sum)

🔍 Example in OpenMP (C/C++)


int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += array[i]; // sum is a reduction variable
}
• sum is shared but handled safely via reduction.
• i is private by default in OpenMP loops.

Example program to show shared and private variables


#include <stdio.h>
#include <omp.h>
int main() {
int shared_var = 100; // Shared by default
int private_var = 0; // Will be made private
#pragma omp parallel private(private_var) shared(shared_var)
{
int thread_id = omp_get_thread_num();

// Each thread gets its own copy of private_var


private_var = thread_id * 10;

// All threads access shared_var


shared_var += 1;
printf("Thread %d: private_var = %d, shared_var = %d\n",
thread_id, private_var, shared_var);
}
printf("Final shared_var = %d\n", shared_var);
return 0;
}
• shared_var is shared across threads. Each thread increments it, leading to a race
condition (you’ll see inconsistent results).
• private_var is private to each thread. Each thread sets it independently.
• thread_id is also private by default in OpenMP.

Explanation for the above code

#include <stdio.h>

#include <omp.h>

• stdio.h is for input/output functions like printf.


• omp.h enables OpenMP functionality for parallel programming.

🔹 Main Function

int shared_var = 100; // Shared by default

int private_var = 0; // Will be made private

• shared_var: Shared among all threads. Any thread can read/write it.
• private_var: Will be declared private inside the parallel region, meaning each thread
gets its own copy.
🔹 Parallel Region

#pragma omp parallel private(private_var) shared(shared_var)

• This directive starts a parallel region.


• private(private_var): Each thread has its own independent copy of private_var.
• shared(shared_var): All threads share the same shared_var.

🔹 Inside the Parallel Block

int thread_id = omp_get_thread_num();

private_var = thread_id * 10;

shared_var += 1;

• thread_id: Gets the unique ID of the thread (0 to N-1).


• private_var = thread_id * 10: Each thread sets its own private_var based on its ID.
• shared_var += 1: All threads increment the shared variable. This causes a race condition
because multiple threads write to it simultaneously without synchronization.

🔹 Output

printf("Thread %d: private_var = %d, shared_var = %d\n",

thread_id, private_var, shared_var);

• Each thread prints its own private_var and the current value of shared_var.

🔹 Final Output

printf("Final shared_var = %d\n", shared_var);

• After all threads finish, the final value of shared_var is printed.


• Due to the race condition, this value may be inconsistent or unpredictable.

🛡️
How to Fix the Race Condition

Method1

To make shared_var += 1 thread-safe, use:

#pragma omp atomic

shared_var += 1;

Or wrap it in a critical section:

#pragma omp critical

shared_var += 1;

🧠 Summary of Variable Scope

Variable Scope Behavior


shared_var Shared All threads access and modify it
private_var Private Each thread has its own copy
thread_id Private Automatically private in OpenMP

Method 2

To avoid race conditions on shared_var, use:


#pragma omp atomic shared_var += 1;

Or use a reduction if you're accumulating:


int total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < 100; i++) { total += i; }
The reduction clause in parallel programming

The reduction clause in parallel programming (especially in OpenMP) is used to safely


perform operations like summation, multiplication, min/max, etc., across multiple
threads without race conditions.

🧮 What Is the Reduction Clause?


In parallel loops, multiple threads may need to update a shared variable. The reduction clause
ensures each thread works on a private copy and combines the results at the end.

🔧 Syntax (OpenMP)
#pragma omp parallel for reduction(op:variable)
• op: Operator like +, *, -, min, max, &, |, ^
• variable: The shared variable to be reduced

📌 Example: Sum of Array Elements


#include <stdio.h>
#include <omp.h>
int main() {
int N = 1000;
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += i;
}
printf("Sum = %d\n", sum);
return 0;
}
• Each thread computes a partial sum.
• OpenMP combines them using + after the loop.
Explanation for the above code:
🧩 Code Breakdown

🔹 Headers
#include <stdio.h>
#include <omp.h>
• stdio.h: Enables input/output functions like printf.
• omp.h: Provides OpenMP functions and directives for parallel programming.

🔹 Main Function
int N = 1000;
int sum = 0;
• N is the upper limit of the summation.
• sum is the variable that will hold the final result.

🔹 Parallel Loop with Reduction


#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += i;
}
This is the heart of the program:
• #pragma omp parallel for: Tells the compiler to parallelize the for loop.
• reduction(+:sum):
o Each thread gets a private copy of sum.
o Threads compute partial sums independently.
o After the loop, OpenMP combines all partial sums using the + operator and
stores the result in the original sum.
This avoids race conditions, which would occur if all threads tried to update sum
simultaneously.

🔹 Final Output
printf("Sum = %d\n", sum);
• Prints the final result of the summation: ( \sum_{i=0}^{999} i = \frac{999 \cdot
1000}{2} = 499500 )

✅ Why Use reduction?


Without reduction, you'd need to manually synchronize access to sum using #pragma
omp critical or #pragma omp atomic, which is slower and more error-prone.

🧠 Summary

Feature Purpose

parallel for Distributes loop iterations across threads

reduction(+:sum) Safely accumulates values across threads

Note:

🔒 #pragma omp critical

✅ Purpose:
Ensures that only one thread executes the enclosed block at a time.

🧩 Syntax:
#pragma omp critical
{
shared_var += 1;
}

📌 Use Case:
• When multiple operations need to be performed atomically.
• Suitable for complex updates involving multiple statements.

⚠️ Downside:
• Can be slower due to thread queuing.
• Higher overhead than atomic.

⚡ #pragma omp atomic


✅ Purpose:
Ensures that a single memory operation (like increment or assignment) is performed
atomically.

🧩 Syntax:
#pragma omp atomic
shared_var += 1;

📌 Use Case:
• When updating a variable with a simple operation (e.g., +=, ++, --).
• Faster and more efficient than critical.

⚠️ Limitation:
• Only works with simple expressions (single memory location, single operation).

🧠 Summary Table

Feature #pragma omp critical #pragma omp atomic

Scope Multiple statements Single memory operation

Performance Slower (more overhead) Faster (less overhead)

Flexibility High Limited

Use Case Complex updates Simple updates

🧪 Example Comparison
Using critical:
#pragma omp critical
{
sum += value;
count++;
}
Using atomic:
#pragma omp atomic
sum += value;
🧠 Supported Operators

Operator Use Case

+ Sum

* Product

- Subtraction

& Bitwise AND

| Bitwise OR

^ Bitwise XOR

min Minimum value

max Maximum value

🛡️ Why Use Reduction?


• Avoids race conditions
• Improves performance by parallelizing safely
• Simplifies code compared to manual synchronization

loop-carried dependency :
A loop-carried dependency occurs in parallel computing when an iteration of a loop
depends on the result of a previous iteration, making it difficult or impossible to safely
execute iterations in parallel.
What Is a Loop-Carried Dependency?
In a loop, if the computation in iteration i depends on the result of iteration i-1 (or any earlier
iteration), this is called a loop-carried dependency. It creates a sequential dependency that
prevents parallel execution.
Example:
for (int i = 1; i < N; i++)
{
A[i] = A[i-1] + B[i]; // A[i] depends on A[i-1]
}
Dependency: A[i] uses the result of A[i-1], which was computed in the previous
iteration.
• Result: Cannot be parallelized without changing the algorithm.
Why It Matters?
• Prevents parallelization: Threads can't safely execute iterations independently.
• Limits performance: Reduces the benefits of multi-core systems.
• Requires transformation: To parallelize, you may need to restructure the algorithm.

Scheduling in parallel program


Scheduling in parallel programming refers to how loop iterations or tasks are assigned to
threads. In OpenMP, scheduling affects performance, load balancing, and efficiency.

🧠 What Is Scheduling in Parallel Programming?


When you parallelize a loop, the system must decide which thread executes which iteration.
This decision is called scheduling, and it can be static, dynamic, guided, or runtime-based.

Types of Scheduling in OpenMP

Type Description Use Case

Iterations are divided evenly before Best when work per iteration is
Static
execution. uniform.

Threads grab chunks of iterations as they


Dynamic Ideal when work per iteration varies.
finish previous ones.

Like dynamic, but chunk size decreases over Good for load balancing with
Guided
time. decreasing workload.

Scheduling is determined by environment Useful for tuning without


Runtime
variables at runtime. recompiling.

Syntax Example
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
work(i);
}
You can also specify chunk size:
#pragma omp parallel for schedule(dynamic, 4)

📌 Why Scheduling Matters


• Performance: Poor scheduling can lead to idle threads and slow execution.
• Load Balancing: Dynamic and guided scheduling help balance uneven workloads.
• Scalability: Choosing the right strategy improves scalability across cores.

In parallel programming—especially with OpenMP—static and dynamic scheduling are


strategies for distributing loop iterations among threads. Choosing the right one can
significantly impact performance and load balancing.

⚙️ Static Scheduling

🔹 How It Works:

• Iterations are divided before execution.


• Each thread gets a fixed chunk of iterations.
• Distribution is predictable and repeatable.

🧩 Syntax:

#pragma omp parallel for schedule(static[,chunk_size])

• chunk_size (optional): Number of iterations per thread.


• If omitted, OpenMP divides iterations evenly.

✅ Best For:

• Uniform workloads: Each iteration takes roughly the same time.


• Low overhead: No runtime decision-making.

⚠️ Drawback:

• Poor load balancing if iterations vary in complexity.

⚡ Dynamic Scheduling

🔹 How It Works:
• Iterations are assigned on-the-fly as threads become available.
• Threads request new chunks after finishing their current ones.

🧩 Syntax:

#pragma omp parallel for schedule(dynamic[,chunk_size])

• chunk_size (optional): Number of iterations per request.


• Smaller chunks improve load balancing but increase overhead.

✅ Best For:

• Irregular workloads: Iteration time varies.


• Better load balancing: Threads stay busy.

⚠️ Drawback:

• Slightly higher overhead due to dynamic assignment.

Comparison Table

Feature Static Scheduling Dynamic Scheduling


Assignment Time Before loop starts During loop execution
Load Balancing Poor for uneven workloads Good for uneven workloads
Overhead Low Moderate
Predictability High Low
Use Case Uniform iteration cost Variable iteration cost

🧪 Example Code

#pragma omp parallel for schedule(static, 4)


for (int i = 0; i < N; i++) {
process(i);
}

#pragma omp parallel for schedule(dynamic, 2)


for (int i = 0; i < N; i++) {
process(i);
}
In parallel programming with OpenMP, guided and runtime scheduling strategies help control
how loop iterations are distributed among threads. These strategies are especially useful for
optimizing performance when workloads vary or need tuning.

🧭 Guided Scheduling

🔹 What It Does:
• Assigns large chunks of iterations to threads at first.
• Gradually reduces chunk size as threads finish work.
• Helps balance load when early iterations are heavier.

🧩 Syntax:
#pragma omp parallel for schedule(guided[,chunk_size])
• chunk_size is optional. If omitted, OpenMP chooses a default.
• Smaller chunks toward the end improve load balancing.

✅ Use Case:
• When iteration workload decreases over time.
• Example: processing a sorted list where early items are more complex.

🕹️ Runtime Scheduling

🔹 What It Does:
• Defers scheduling decision to runtime environment.
• Uses the value of the environment variable OMP_SCHEDULE.

🧩 Syntax:
#pragma omp parallel for schedule(runtime)
• Set scheduling policy via:
export OMP_SCHEDULE="dynamic,4"

✅ Use Case:
• When you want to experiment with different strategies without recompiling.

• Ideal for performance tuning and benchmarking.


📊 Comparison Table

Feature Guided Scheduling Runtime Scheduling

Chunk Size Starts large, shrinks over time Set via environment variable

Flexibility Adaptive but fixed at compile time Fully tunable at runtime

Performance Good for uneven workloads Depends on chosen runtime strategy

Use Case Decreasing workload per iteration Dynamic tuning and experimentation

Producer-Consumer in Parallel Programming


5.8 Producers and consumers
.
5.8.1 Queues
Recall that a queue is a list abstract datatype in which new elements are inserted
at the “rear” of the queue and elements are removed from the “front” of the queue.
A queue can thus be viewed as an abstraction of a line of customers waiting to pay
for their groceries in a supermarket. The elements of the list are the customers. New
customers go to the end or “rear” of the line, and the next customer to check out is
the customer standing at the “front” of the line.
When a new entry is added to the rear of a queue, we sometimes say that the entry
has been “enqueued,” and when an entry is removed from the front of a queue, we
sometimes say that the entry has been “dequeued.”
Queues occur frequently in computer science. For example, if we have a number
of processes, each of which wants to store some data on a hard drive, then a natural
way to ensure that only one process writes to the disk at a time is to have the processes
form a queue, that is, the first process that wants to write gets access to the drive first,
the second process gets access to the drive next, and so on.
A queue is also a natural data structure to use in many multithreaded applications.
For example, suppose we have several “producer” threads and several “consumer”
threads. The producer threads might “produce” requests for data from a server—
for example, current stock prices—while the consumer threads might “consume” the
request by finding or generating the requested data—the current stock prices. The
producer threads could enqueue the requested prices, and the consumer threads could
dequeue them. In this example, the process wouldn’t be completed until the consumer
threads had given the requested data to the producer threads.
5.8.2 Message-passing
Another natural application would be implementing message-passing on a sharedmemory
system. Each thread could have a shared-message queue, and when one
thread wanted to “send a message” to another thread, it could enqueue the message
in the destination thread’s queue. A thread could receive a message by dequeuing the
message at the head of its message queue.
Let’s implement a relatively simple message-passing program, in which each
thread generates random integer “messages” and random destinations for the messages.
After creating the message, the thread enqueues the message in the appropriate
5.8.3 Sending messages
Note that accessing a message queue to enqueue a message is probably a critical
section. Although we haven’t looked into the details of the implementation of the
message queue, it seems likely that we’ll want to have a variable that keeps track of
the rear of the queue. For example, if we use a singly linked list with the tail of the
list corresponding to the rear of the queue, then, to efficiently enqueue, we would
want to store a pointer to the rear. When we enqueue a new message, we’ll need to
check and update the rear pointer. If two threads try to do this simultaneously, we
may lose a message that has been enqueued by one of the threads. (It might help to
draw a picture!) The results of the two operations will conflict, and hence enqueuing
a message will form a critical section.
Pseudocode for the Send_msg() function might look something like this:
mesg = random ( ) ;
dest = random ( ) % thread_count ;
# pragma omp c r i t i c a l
Enqueue ( queue , dest , my_rank , mesg ) ;
Note that this allows a thread to send a message to itself.
5.8.4 Receiving messages

The synchronization issues for receiving a message are a little different. Only the
owner of the queue (that is, the destination thread) will dequeue from a given message
queue. As long as we dequeue one message at a time, if there are at least two messages
in the queue, a call to Dequeue can’t possibly conflict with any calls to Enqueue. So
if we keep track of the size of the queue, we can avoid any synchronization (for
example, critical directives), as long as there are at least two messages.
Now you may be thinking, “What about the variable storing the size of the
queue?” This would be a problem if we simply store the size of the queue. However, if we store
two variables, enqueued and dequeued, then the number of messages
in the queue is queue_size = enqueued − dequeued
and the only thread that will update dequeued is the owner of the queue. Observe that
one thread can update enqueued at the same time that another thread is using it to
compute queue_size. To see this, let’s suppose thread q is computing queue_size. It
will either get the old value of enqueued or the new value. It may therefore compute
a queue_size of 0 or 1 when queue_size should actually be 1 or 2, respectively, but
in our program this will only cause a modest delay. Thread q will try again later if
queue_size is 0 when it should be 1, and it will execute the critical section directive
unnecessarily if queue_size is 1 when it should be 2.
Thus we can implement Try_receive as follows:
queue_size = enqueued − dequeued ;
i f ( queue_size == 0) return ;
e l s e i f ( queue_size == 1)
# pragma omp c r i t i c a l
Dequeue ( queue , &src , &mesg ) ;
else
Dequeue ( queue , &src , &mesg ) ;
Print_message ( src , mesg ) ;
5.8.5 Termination detection
We also need to think about implementation of the Done function. First note that the
following “obvious” implementation will have problems:
queue_size = enqueued - dequeued;
if (queue_size == 0)
return TRUE;
else
return FALSE;
If thread u executes this code, it’s entirely possible that some thread—call it thread
v—will send a message to thread u after u has computed queue_size = 0. Of course,
after thread u computes queue_size = 0, it will terminate and the message sent by
thread v will never be received.
However, in our program, after each thread has completed the for loop, it won’t
send any new messages. Thus if we add a counter done_sending, and each thread
increments this after completing its for loop, then we can implement Done as follows:
queue_size = enqueued - dequeued;
if (queue_size == 0 && done_sending == thread_count)
return TRUE;
else
return FALSE;
5.8.6 Startup
When the program begins execution, a single thread, the master thread, will get
command-line arguments and allocate an array of message queues, one for each
thread. This array needs to be shared among the threads, since any thread can send to
any other thread, and hence any thread can enqueue a message in any of the queues.
Given that a message queue will (at a minimum) store
• a list of messages,
• a pointer or index to the rear of the queue,
• a pointer or index to the front of the queue,
• a count of messages enqueued, and
• a count of messages dequeued,
it makes sense to store the queue in a struct, and to reduce the amount of copying
when passing arguments, it also makes sense to make the message queue an array of
pointers to structs. Thus once the array of queues is allocated by the master thread, we
can start the threads using a parallel directive, and each thread can allocate storage
for its individual queue.

In parallel programming, the Producer-Consumer model is a classic synchronization


pattern where:
• Producers generate data and place it in a shared buffer.
• Consumers retrieve and process that data from the buffer.

🧩 Key Concepts

🔹 Shared Buffer
• A finite-size queue or array used to hold data between producers and consumers.
• Acts as a communication channel between threads.

🔹 Synchronization
• Required to prevent race conditions and data corruption.

Producer and consumer model:


#include <stdio.h>
#include <omp.h>
#define SIZE 10
int buffer[SIZE];
int count = 0;

void produce(int item) {


#pragma omp critical
{
if (count < SIZE) {
buffer[count++] = item;
printf("Produced: %d\n", item);
}
}
}

void consume() {
#pragma omp critical
{
if (count > 0) {
int item = buffer[--count];
printf("Consumed: %d\n", item);
}
}
}

int main() {
#pragma omp parallel sections
{
#pragma omp section
for (int i = 0; i < 20; i++) produce(i);

#pragma omp section


for (int i = 0; i < 20; i++) consume();
}
return 0;
}

This C program demonstrates a simple producer-consumer model using OpenMP in a


shared memory environment. Let's break it down step by step:

🧩 Code Explanation

🔹 Global Variables

#define SIZE 10
int buffer[SIZE];
int count = 0;
``- `buffer`: A fixed-size array acting as a shared buffer between producer and consumer.
- `count`: Tracks the number of items currently in the buffer.
### 🔹 **Producer Function**
```c
void produce(int item) {
#pragma omp critical
{
if (count < SIZE) {
buffer[count++] = item;
printf("Produced: %d\n", item);
}
}
}

• Purpose: Adds an item to the buffer if there's space.


• #pragma omp critical: Ensures only one thread accesses/modifies buffer and count at
a time to prevent race conditions.
• Condition: Only adds if count < SIZE (i.e., buffer isn't full).

🔹 Consumer Function

void consume() {
#pragma omp critical
{
if (count > 0) {
int item = buffer[--count];
printf("Consumed: %d\n", item);
}
}
}

• Purpose: Removes an item from the buffer if it's not empty.


• #pragma omp critical: Protects shared access to buffer and count.
• Condition: Only consumes if count > 0.

🔹 Main Function

#pragma omp parallel sections


{
#pragma omp section
for (int i = 0; i < 20; i++) produce(i);

#pragma omp section


for (int i = 0; i < 20; i++) consume();
}

• #pragma omp parallel sections: Creates two threads—one for producing, one for
consuming.
• Producer Section: Produces 20 items.
• Consumer Section: Attempts to consume 20 items.

⚠️ Behavior Notes

• Buffer Size = 10, but producer tries to add 20 items. Only the first 10 will be added;
the rest are ignored due to the if (count < SIZE) check.
• Consumer tries to consume 20 items, but only 10 are available. The rest are ignored
due to if (count > 0).

🧠 Summary

Component Role
buffer Shared storage for items
count Tracks buffer occupancy
produce() Adds items if buffer isn't full
consume() Removes items if buffer isn't empty
critical Prevents race conditions
parallel sections Runs producer and consumer concurrently

producer–consumer synchronization and Semaphores


In parallel programming, producer–consumer synchronization ensures that producers and
consumers coordinate safely when accessing a shared buffer. Semaphores are a powerful tool
for managing this coordination, especially in low-level threading models like POSIX threads
or systems programming.

🧩 Producer–Consumer Problem Overview


• Producer: Generates data and places it in a shared buffer.
• Consumer: Retrieves data from the buffer and processes it.
• Challenge: Prevent race conditions and buffer overflows/underflows.

🔐 Synchronization with Semaphores


Semaphores are integer counters used to control access to shared resources. They come in two
types:

Type Purpose

Counting Tracks available resources (e.g., buffer slots)

Binary Acts like a mutex (0 or 1)

🧪 Example in C with POSIX Semaphores


#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>

#define SIZE 5
int buffer[SIZE];
int in = 0, out = 0;

sem_t empty, full;


pthread_mutex_t mutex;

void* producer(void* arg) {


for (int i = 0; i < 10; i++) {
sem_wait(&empty); // Wait for empty slot
pthread_mutex_lock(&mutex); // Lock buffer

buffer[in] = i;
printf("Produced: %d\n", i);
in = (in + 1) % SIZE;

pthread_mutex_unlock(&mutex); // Unlock buffer


sem_post(&full); // Signal full slot
}
return NULL;
}

void* consumer(void* arg) {


for (int i = 0; i < 10; i++) {
sem_wait(&full); // Wait for full slot
pthread_mutex_lock(&mutex); // Lock buffer

int item = buffer[out];


printf("Consumed: %d\n", item);
out = (out + 1) % SIZE;

pthread_mutex_unlock(&mutex); // Unlock buffer


sem_post(&empty); // Signal empty slot
}
return NULL;
}

int main() {
pthread_t prod, cons;
sem_init(&empty, 0, SIZE); // SIZE empty slots
sem_init(&full, 0, 0); // 0 full slots
pthread_mutex_init(&mutex, NULL);

pthread_create(&prod, NULL, producer, NULL);


pthread_create(&cons, NULL, consumer, NULL);

pthread_join(prod, NULL);
pthread_join(cons, NULL);
sem_destroy(&empty);
sem_destroy(&full);
pthread_mutex_destroy(&mutex);

return 0;
}

🧠 How It Works
• empty semaphore: Tracks available buffer slots.
• full semaphore: Tracks filled slots.
• mutex: Ensures mutual exclusion during buffer access.
• Circular buffer: Prevents overflow by wrapping around.

✅ Benefits of Using Semaphores


• Efficient synchronization between threads.
• Prevents busy-waiting.
• Scales well for multiple producers and consumers.
Would you like to see this implemented in Python using threading and Semaphore, or
visualized with a diagram?

Cache ,cache coherence and False Sharing:


In parallel programming, especially on multicore systems, caches, cache coherence, and false
sharing are critical concepts that directly impact performance and correctness. Let’s break
them down:

🧠 1. Caches in Parallel Systems


• Cache is a small, fast memory located close to the CPU core.
• Each core typically has its own private L1/L2 cache, and may share a L3 cache.
• Caches store frequently accessed data to reduce latency and improve performance.
🔄 2. Cache Coherence
When multiple threads (on different cores) access and modify shared data, cache coherence
ensures that all cores see the most recent value of that data.

🔧 How It Works:
• Modern CPUs use cache coherence protocols like MESI (Modified, Exclusive,
Shared, Invalid).
• These protocols ensure that:
o If one core modifies a cache line, other cores’ copies are invalidated.
o Reads always return the most up-to-date value.

🧩 Example:
#pragma omp parallel num_threads(2)
{
static int x = 0;
if (omp_get_thread_num() == 0) x = 42;
else printf("x = %d\n", x); // May print stale value if cache is not coherent
}
Without proper synchronization (e.g., #pragma omp flush), the second thread might read an
outdated value of x.

⚠️ 3. False Sharing
False sharing occurs when threads access different variables that reside on the same cache
line, causing unnecessary cache invalidations—even though the variables are logically
independent.

🧩 Example:
int data[2];

#pragma omp parallel num_threads(2)


{
int tid = omp_get_thread_num();
for (int i = 0; i < 1000000; i++) {
data[tid]++; // Each thread updates its own element
}
}
• Even though data[0] and data[1] are separate, they may share the same cache line.
• This leads to performance degradation due to frequent cache line bouncing between
cores.

🛠️ How to Avoid False Sharing:


• Padding: Add unused space between variables to ensure they occupy separate cache
lines.
typedef struct {
int value;
char padding[60]; // Assuming 64-byte cache line
} PaddedInt;

PaddedInt data[2];
• Align data using compiler directives (e.g., __attribute__((aligned(64))) in GCC).

📌 Summary Table

Concept Description

Cache Fast memory near CPU cores for frequently used data

Cache Coherence Ensures all cores see the latest value of shared data

False Sharing Performance issue from multiple threads accessing same cache line

In parallel programming, locks are used to control access to shared resources and prevent race
conditions. They ensure that only one thread can access a critical section of code at a time.

🔐 What Are Locks?


• A lock is a synchronization mechanism.
• When a thread acquires (locks) a resource, other threads must wait until it releases
(unlocks) it.
• This ensures mutual exclusion—only one thread executes the critical section at a time.
🧩 Example in C with OpenMP
#include <stdio.h>
#include <omp.h>

int main() {
int counter = 0;
omp_lock_t lock;

omp_init_lock(&lock); // Initialize the lock

#pragma omp parallel num_threads(4)


{
for (int i = 0; i < 1000; i++) {
omp_set_lock(&lock); // Acquire the lock
counter++; // Critical section
omp_unset_lock(&lock); // Release the lock
}
}

omp_destroy_lock(&lock); // Clean up
printf("Final counter: %d\n", counter);
return 0;
}

🔧 OpenMP Lock Functions

Function Description

omp_init_lock() Initializes a lock

omp_destroy_lock() Destroys a lock

omp_set_lock() Acquires the lock (waits if needed)


Function Description

omp_unset_lock() Releases the lock

omp_test_lock() Tries to acquire the lock (non-blocking)

✅ When to Use Locks


• When multiple threads read/write shared data.
• When critical sections involve more than one operation.
• When atomic or critical directives are too limited.

⚠️ Caution
• Overuse of locks can lead to performance bottlenecks.
• Improper use can cause deadlocks or starvation.

In parallel programming, unlocking refers to the action of releasing a lock that a thread
previously acquired. This allows other threads to enter a critical section or access a shared
resource that was protected by the lock.

🔓 What Is Unlocking?
• When a thread acquires a lock, it gains exclusive access to a resource.
• Once it's done, it must unlock to let other threads proceed.
• Failing to unlock can lead to deadlocks, where other threads wait indefinitely.

🧩 Example in OpenMP (C)


#include <stdio.h>
#include <omp.h>

int main() {
int counter = 0;
omp_lock_t lock;
omp_init_lock(&lock); // Initialize lock

#pragma omp parallel num_threads(4)


{
omp_set_lock(&lock); // Lock acquired
counter++; // Critical section
omp_unset_lock(&lock); // Lock released (unlocked)
}

omp_destroy_lock(&lock); // Clean up
printf("Final counter: %d\n", counter);
return 0;
}

🔧 Key Functions:

Function Purpose

omp_set_lock() Acquires the lock

omp_unset_lock() Releases (unlocks) the lock

omp_test_lock() Tries to acquire without blocking

omp_destroy_lock() Cleans up the lock

🧠 Why Unlocking Matters


• Prevents deadlocks and starvation
• Ensures fair access to shared resources
• Maintains program correctness in concurrent environments
TASKING:
asking in OpenMP allows you to define units of work (tasks) that can be executed
independently and potentially in parallel, enabling flexible and efficient parallelism
beyond simple loop-based constructs.

🧠 What Is Tasking in OpenMP?


Unlike traditional OpenMP constructs like parallel for, which divide loop iterations among
threads, tasking lets you create discrete chunks of work that can be executed asynchronously
by any thread in a team.

🔹 A Task Includes:
• A block of code to execute
• A data environment (variables captured at creation)
• Optional dependencies or synchronization

🧩 Basic Syntax
#pragma omp parallel { #pragma omp single { for (int i = 0; i < N; i++) { #pragma omp task
process(i); } } }
• #pragma omp parallel: Starts a team of threads.
• #pragma omp single: Ensures only one thread creates tasks.
• #pragma omp task: Defines a task to be executed by any thread.

🔄 Task Scheduling
• Tasks are queued and executed dynamically by available threads.
• OpenMP runtime handles scheduling and load balancing.
• Tasks can be nested, dependent, or grouped.

Example code for tasking:

#include <stdio.h>
#include <omp.h>
int fib(int n) {
int x, y;
if (n < 2)
return n;
#pragma omp task shared(x)
x = fib(n - 1);
#pragma omp task shared(y)
y = fib(n - 2);
#pragma omp taskwait
return x + y;
}
int main() {
int n = 10, result;
#pragma omp parallel
{
#pragma omp single
result = fib(n);
}
printf("Fibonacci(%d) = %d\n", n, result);
return 0;
}
Explanation for the above code:
This C program uses OpenMP tasking to compute the Fibonacci number recursively in
parallel. Let’s walk through it step by step:

🧩 Function: fib(int n)
int fib(int n) {
int x, y;

if (n < 2)
return n;
• Base case: If n is 0 or 1, return n.
• Recursive case: Compute fib(n-1) and fib(n-2).

🔹 Task Creation
#pragma omp task shared(x)
x = fib(n - 1);

#pragma omp task shared(y)


y = fib(n - 2);
• These directives create two tasks:
o One to compute fib(n-1) and store it in x.
o One to compute fib(n-2) and store it in y.
• shared(x) and shared(y) ensure that the variables are visible across tasks.

🔹 Task Synchronization
#pragma omp taskwait
return x + y;
• #pragma omp taskwait ensures both tasks finish before proceeding.
• After both results are available, it returns their sum.

🧩 Main Function
int main() {
int n = 10, result;

#pragma omp parallel


{
#pragma omp single
result = fib(n);
}

printf("Fibonacci(%d) = %d\n", n, result);


return 0;
}

🔹 Explanation:
• #pragma omp parallel: Starts a team of threads.
• #pragma omp single: Ensures only one thread initiates the recursive task tree.
• result = fib(n): Starts the recursive computation using tasks.
• printf(...): Prints the final result.

🧠 Why Use Tasks Here?


• Recursive Fibonacci is naturally parallelizable because each call is independent.
• OpenMP tasks allow dynamic scheduling and load balancing.
• This approach demonstrates fine-grained parallelism.

Thread Safety:
Thread safety in parallel programming means ensuring that shared data is accessed and
modified correctly when multiple threads run concurrently, preventing race conditions,
data corruption, and unpredictable behavior.

🧠 What Is Thread Safety?


In a multithreaded program, multiple threads may access the same memory or resources. A
function or code block is thread-safe if it behaves correctly regardless of the timing or
interleaving of thread execution.

🔹 Common Issues Without Thread Safety:


• Race conditions: Two threads modify shared data simultaneously.
• Deadlocks: Threads wait indefinitely for each other to release resources.
• Data inconsistency: Threads read stale or partially updated data.
Techniques to Achieve Thread Safety
1. Mutual Exclusion (Locks)
• Use mutexes, spinlocks, or OpenMP locks to ensure only one thread accesses critical
sections.
2. Atomic Operations
• Use #pragma omp atomic for simple updates to shared variables.
3. Critical Sections
• Use #pragma omp critical to protect blocks of code.
4. Thread-Local Storage
• Store data privately for each thread to avoid sharing.
4. Thread-Local Storage
• Store data privately for each thread to avoid sharing.
Best Practices
• Minimise shared data access.
• Prefer atomic operations over locks when possible.
• Avoid nested locks to prevent deadlocks.
• Profile and test for race conditions using tools like Thread Sanitizer or Intel Inspector.

Tokenization in thread programming and Thread Safety

Tokenization in thread programming refers to breaking down input data—typically


text—into smaller units (tokens) using multiple threads to improve performance and
scalability. This is especially useful in natural language processing (NLP), data
preprocessing, and compiler design.
This function Tokenize is a multi-threaded tokenizer written in C using OpenMP. It splits
an array of strings (lines[]) into tokens (words), and distributes the work across multiple threads
for parallel processing. Here's a detailed explanation:

🧩 Function Signature
void Tokenize(char* lines[], int line_count, int thread_count)
• lines[]: Array of strings to be tokenized.
• line_count: Number of lines in the array.
• thread_count: Number of threads to use for parallel execution.

🔧 Variable Declarations
int my_rank, i, j;
char* my_token;
• my_rank: Thread ID.
• i: Loop index for lines.
• j: Token index within a line.
• my_token: Pointer to each token extracted.

🔄 Parallel Region
#pragma omp parallel num_threads(thread_count) \
default(none) private(my_rank, i, j, my_token) \
shared(lines, line_count)
• Starts a parallel region with thread_count threads.
• default(none): Forces explicit declaration of shared/private variables.
• private(...): Each thread has its own copy of these variables.
• shared(...): All threads access the same lines[] and line_count.

🔁 Work Distribution
#pragma omp for schedule(static, 1)
for (i = 0; i < line_count; i++) {
• Distributes the loop over line_count lines among threads.
• schedule(static, 1): Each thread gets one line at a time in round-robin fashion.

🧠 Tokenization Logic
printf("Thread %d > line %d = %s", my_rank, i, lines[i]);
j = 0;
my_token = strtok(lines[i], " \t\n");
while (my_token != NULL) {
printf("Thread %d > token %d = %s\n", my_rank, j, my_token);
my_token = strtok(NULL, " \t\n");
j++;
}
• Each thread prints the line it's processing.
• Uses strtok() to split the line into tokens using space, tab, and newline as delimiters.
• Prints each token with its index and thread ID.

✅ Key Concepts Demonstrated

Concept Role in Program

OpenMP parallel Enables multi-threaded execution

Private variables Prevents race conditions

Static scheduling Ensures predictable workload distribution

strtok() Performs tokenization of each line

When werun the program with a single thread, it correctly tokenizes the input stream. The first
time we run it with two threads and the input
Pease porridge hot.
Pease porridge cold.
Pease porridge in the pot
Nine days old.
the output is also correct. However, the second time we run it with this input, we get
the following output
Thread 0 > line 0 = Pease porridge hot.
Thread 1 > line 1 = Pease porridge cold.
Thread 0 > token 0 = Pease
Thread 1 > token 0 = Pease
Thread 0 > token 1 = porridge
Thread 1 > token 1 = cold.
Thread 0 > line 2 = Pease porridge in the pot
Thread 1 > line 3 = Nine days old.
Thread 0 > token 0 = Pease
Thread 1 > token 0 = Nine
Thread 0 > token 1 = days
Thread 1 > token 1 = old.
to persist from one call to the next.
Unfortunately for us, this cached string is shared,not private. Thus it appears that thread 1’s
call to strtok with the second line hasapparently overwritten the contents of thread 0’s call with
the first line. Even worse,thread 0 has found a token (“days”) that should be in thread 1’s output.
The strtok function is therefore not thread-safe: if multiple threads call it simultaneously,the
output it produces may not be correct. Regrettably, it’s not uncommonfor C library functions
to fail to be thread-safe. For example, neither the random numbergenerator rand in stdlib.h nor
the time conversion function localtime in [Link] guaranteed to be thread-safe. In some cases,
the C standard specifies an alternate,thread-safe version of a function. In fact, there is a thread-
safe version of strtok:
char∗ strtok_r (
char∗ string / ∗ in / out ∗ / ,
const char∗ separators / ∗ i n ∗ / ,
char ∗∗ saveptr_p / ∗ in / out ∗ / ) ;
The “_r” is supposed to suggest that the function is re-entrant, which is sometimes
used as a synonym for thread-safe.9 The first two arguments have the same purpose
as the arguments to strtok. The saveptr argument is used by strtok_r for keeping
track of where the function is in the input string; it serves the purpose of the cached
pointer in [Link] can correct our original Tokenize function by replacing the calls
to strtok with calls to strtok_r.We simply need to declare a char∗ variable to pass in
for the third argument, and replace the calls in line 18 and line 22 with the following
calls:
my_token = strtok_r ( lines [ i ] , " \t\n" , &saveptr ) ;
...
my_token = strtok_r ( NULL , " \t\n" , &saveptr ) ;
respectively.
5.11.1 Incorrect programs can produce correct output
Notice that our original version of the tokenizer program shows an especially insidious form
of program error: The first time we ran it with two threads, the programproduced correct output.
It wasn’t until a later run that we saw an error. This, unfortunately,is not a rare occurrence in
parallel programs.
For example, we can’t say whenthread 1 will first call strtok. If its first call takes place after
thread 0 has tokenizedits first line, then the tokens identified for the first line should be correct.
However, ifthread 1 calls strtok before thread 0 has finished tokenizing its first line, it’s
entirelypossible that thread 0 may not identify all the tokens in the first line, so it’s especially
important in developing shared-memory programs to resist the temptation to assumethat since
a program produces correct output, it must be correct. We always need to
be wary of race conditions.

You might also like