Module 4
Q1. Define OpenMP and explain its programming model for shared-memory MIMD systems with a
neat diagram.
Definition of OpenMP
OpenMP (Open Multi-Processing) is an Application Programming Interface (API) used for developing
parallel programs for shared-memory systems.
It supports MIMD (Multiple Instruction, Multiple Data) execution and is mainly designed for multicore
and multiprocessor systems where all threads share a common global memory.
OpenMP provides:
Compiler directives (pragmas)
Runtime library routines
Environment variables
These features help programmers convert serial programs into parallel programs with minimal code
changes.
Shared-Memory MIMD Architecture
In a shared-memory architecture:
All processors (cores) share a single global address space
Any processor can directly access any memory location
Communication between processors occurs implicitly through shared variables
Synchronization is required to avoid race conditions
This model is simpler than distributed memory because explicit message passing is not required.
OpenMP Programming Model
OpenMP follows a fork–join parallel execution model:
1. Program execution starts with a single master thread
2. When a parallel directive is encountered:
o The master thread forks a team of threads
3. All threads execute the parallel region concurrently
4. At the end of the parallel region:
o Threads join back into the master thread
5. Execution continues sequentially
Key characteristics:
SPMD style execution (same program, multiple threads)
Threads share variables by default
Supports incremental parallelization
Parallelism is controlled using compiler directives
Neat Diagram: Fork–Join Model in OpenMP
Features of OpenMP Programming Model
Thread-based parallelism
Shared variables for communication
Private variables for thread-local computation
Built-in synchronization constructs (barrier, critical, atomic)
Portable across compilers and platforms
Advantages of OpenMP
Easy to use and learn
Less programming effort compared to MPI
Suitable for multicore systems
Incremental parallelization possible
Limitations
Limited scalability (shared memory size)
Performance affected by cache coherence and false sharing
Not suitable for distributed-memory systems
Conclusion
OpenMP provides a simple and efficient programming model for shared-memory MIMD systems. By
using the fork–join model, compiler directives, and shared variables, OpenMP enables developers to exploit
parallelism on multicore architectures with minimal changes to existing serial programs.
Q2. Explain OpenMP pragmas and directives and describe the process of compiling and running an
OpenMP program.
OpenMP Pragmas and Directives
OpenMP uses compiler directives (called pragmas) to specify parallel regions and control parallel
execution.
These directives are ignored by compilers that do not support OpenMP, making OpenMP programs
portable.
In C/C++, OpenMP directives begin with:
#pragma omp
In Fortran, directives begin with:
!$omp
Types of OpenMP Directives
OpenMP directives are broadly classified into four categories:
1. Parallel Directives
These directives define regions of code that will be executed by multiple threads.
parallel
Creates a team of threads.
#pragma omp parallel
{
// parallel code
}
parallel for
Combines thread creation and loop parallelization.
#pragma omp parallel for
for (i = 0; i < n; i++)
a[i] = b[i] + c[i];
2. Work-sharing Directives
These divide work among threads without creating new threads.
for – Distributes loop iterations
sections – Divides code into independent sections
single – Executed by only one thread
Example:
#pragma omp sections
{
#pragma omp section
task1();
#pragma omp section
task2();
}
3. Synchronization Directives
Used to control access to shared resources and avoid race conditions.
barrier – All threads wait
critical – Only one thread executes at a time
atomic – Atomic update of a variable
ordered – Enforces sequential order
Example:
#pragma omp critical
sum += value;
4. Data Scope Clauses
Used to define variable visibility among threads.
shared
private
firstprivate
lastprivate
reduction
Example:
#pragma omp parallel private(i)
Compiling an OpenMP Program
To compile an OpenMP program, the compiler must be instructed to enable OpenMP support.
Using GCC Compiler
gcc -fopenmp program.c -o program
-fopenmp enables OpenMP
Without this flag, OpenMP directives are ignored
For C++
g++ -fopenmp [Link] -o program
Running an OpenMP Program
Execution is similar to a normal program:
./program
The number of threads can be controlled using the environment variable:
export OMP_NUM_THREADS=4
or inside the program:
omp_set_num_threads(4);
Example OpenMP Program
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
printf("Hello from thread %d\n", omp_get_thread_num());
}
return 0;
}
Advantages of Using OpenMP Pragmas
Simple and readable syntax
Incremental parallelization
No need for explicit thread management
Portable across platforms
Conclusion
OpenMP pragmas and directives provide a powerful yet simple way to express parallelism in shared-
memory systems. By using compiler directives, programmers can parallelize loops and sections of code
efficiently. Proper compilation using OpenMP-enabled compilers ensures correct execution and optimal
performance.
Q3. Explain the trapezoidal rule and describe its parallel implementation using OpenMP with a neat
diagram.
Trapezoidal Rule
The trapezoidal rule is a numerical integration technique used to approximate the definite integral of a
function
b
∫ f (x ) dx
a
The interval [ a , b ]is divided into n equal subintervals, each of width:
b−a
h= Each subinterval is approximated by a trapezoid, and the total area is the sum of areas of all
n
trapezoids.
Mathematical Expression
The trapezoidal rule approximation is:
b
∫ f ( x ) dx ≈ h2 ¿
a
Serial Algorithm (Concept)
1. Divide the interval into n trapezoids
2. Evaluate the function at each point
3. Sum the trapezoidal areas
4. Multiply by h/2
Need for Parallelization
Each trapezoid area computation is independent
Suitable for data parallelism
Shared-memory systems can compute partial sums in parallel
OpenMP efficiently exploits this parallelism
Parallel Implementation Using OpenMP
In OpenMP:
The loop that computes the sum of trapezoid areas is parallelized
Each thread computes a partial sum
A reduction clause safely combines partial sums
Parallel Algorithm Using OpenMP
1. Compute step size h
2. Initialize sum with f(a) + f(b)
3. Parallelize the loop computing intermediate function values
4. Use reduction to accumulate the sum
5. Multiply final result by h/2
OpenMP Code Snippet
double Trap(double a, double b, int n) {
double h = (b - a) / n;
double sum = f(a) + f(b);
#pragma omp parallel for reduction(+:sum)
for (int i = 1; i < n; i++) {
sum += 2 * f(a + i * h);
}
return sum * h / 2;
}
Neat Diagram: Parallel Trapezoidal Rule
Explanation of Diagram
Integration interval is split into trapezoids
Trapezoids are divided among threads
Each thread computes its portion independently
Partial results are combined using reduction
Advantages of OpenMP Parallelization
Simple parallelization using parallel for
Automatic thread management
Reduction avoids race conditions
Efficient use of multicore processors
Conclusion
The trapezoidal rule is well-suited for parallel execution due to independent computations of trapezoids.
OpenMP enables efficient parallel implementation using loop parallelization and reduction, resulting in
improved performance on shared-memory systems with minimal programming effort.
Q4. Explain the scope of variables in OpenMP and distinguish between shared and private variables
with suitable examples
Scope of Variables in OpenMP
In OpenMP, variable scope defines how variables are visible and accessed by threads inside a parallel
region.
Since OpenMP follows a shared-memory model, improper handling of variable scope can lead to race
conditions and incorrect results.
OpenMP allows programmers to explicitly control variable scope using data-sharing clauses.
Data-Sharing Attributes in OpenMP
The two most important data-sharing attributes are:
1. Shared variables
2. Private variables
Other related clauses include: firstprivate, lastprivate, and reduction, but the core distinction is between
shared and private.
1. Shared Variables
Definition
A shared variable is common to all threads
All threads access the same memory location
Default for variables declared outside a parallel region
Characteristics
Changes made by one thread are visible to all others
Can lead to race conditions if multiple threads update simultaneously
Requires synchronization (critical, atomic, reduction)
Example
int sum = 0;
#pragma omp parallel
{
sum = sum + 1; // race condition
}
✔ sum is shared by default
✖ Multiple threads update it concurrently → incorrect result
2. Private Variables
Definition
Each thread gets its own private copy
Variable is uninitialized inside the parallel region
Changes are not visible to other threads
Characteristics
No interference between threads
Eliminates race conditions
Default for loop index variables in parallel for
Example
int i;
#pragma omp parallel private(i)
{
i = omp_get_thread_num();
printf("%d\n", i);
}
✔ Each thread has its own copy of i
✔ Safe parallel execution
Comparison: Shared vs Private Variables
Aspect Shared Variable Private Variable
Memory location Single shared memory Separate copy per thread
Visibility Visible to all threads Visible only to owning thread
Default behavior Yes (outside parallel region) No
Race condition Possible Not possible
Initialization Retains value Undefined inside parallel region
Example: Shared vs Private in a Loop
int i, sum = 0;
#pragma omp parallel for private(i) shared(sum)
for (i = 0; i < 100; i++) {
sum += i; // race condition
}
i → private (safe)
sum → shared (unsafe without synchronization)
Correct version:
#pragma omp parallel for reduction(+:sum)
Why Variable Scope Is Important
Prevents incorrect results
Improves program correctness
Avoids race conditions
Enhances parallel performance
Conclusion
In OpenMP, understanding variable scope is critical for writing correct parallel programs. Shared variables
allow communication between threads but require synchronization, while private variables ensure thread-
local computation without interference. Proper use of data-sharing clauses leads to safe and efficient parallel
execution.
Q5. Explain the reduction clause in OpenMP and discuss how it avoids race conditions in parallel
programs.
Reduction Clause in OpenMP
The reduction clause in OpenMP is used to perform parallel computations on shared variables where
each thread computes a partial result, and OpenMP automatically combines these partial results into a
single final value.
It is commonly used for operations like:
Sum (+)
Product (*)
Maximum (max)
Minimum (min)
Logical AND/OR
Why Reduction Is Needed
In shared-memory systems:
Multiple threads may try to update the same shared variable
This causes a race condition
Results become unpredictable and incorrect
Example (incorrect):
int sum = 0;
#pragma omp parallel for
for (int i = 0; i < n; i++) {
sum += a[i]; // race condition
}
Here, multiple threads update sum simultaneously.
Reduction Clause Syntax
#pragma omp parallel for reduction(operator:variable)
operator → operation to be applied (+, *, max, etc.)
variable → shared variable to be reduced
Working of Reduction Clause
1. OpenMP creates a private copy of the reduction variable for each thread
2. Each thread updates its own private copy
3. At the end of the parallel region:
o All private copies are combined using the specified operator
4. The final result is stored in the original shared variable
Correct Example Using Reduction
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
sum += a[i];
}
✔ Each thread updates a private sum
✔ OpenMP safely combines results
✔ No race condition
Supported Reduction Operations
Operator Meaning
+ Addition
* Multiplication
max Maximum value
min Minimum value
&& Logical AND
Reduction vs Critical Section
Aspect Reduction Critical
Performance High Lower
Synchronization overhead Minimal High
Scalability Better Poor
Suitability Numeric accumulation General shared updates
Reduction is preferred when possible because it minimizes synchronization.
How Reduction Avoids Race Conditions
No shared updates during parallel execution
Threads work independently
Combining happens only once at the end
No need for explicit locks or critical sections
Applications of Reduction
Numerical integration (trapezoidal rule)
Sum of arrays
Statistical computations
Matrix and vector operations
Conclusion
The reduction clause is a powerful OpenMP feature that enables safe and efficient parallel accumulation. By
providing each thread with a private copy and combining results automatically, reduction eliminates race
conditions and improves performance compared to explicit synchronization techniques.
Q6. Explain loop-carried dependence and discuss its effect on parallelizing loops using the parallel for
directive.
Loop-Carried Dependence
A loop-carried dependence exists when an iteration of a loop depends on the result of a previous
iteration.
In such cases, the loop iterations cannot be executed independently, which restricts or prevents parallel
execution.
Formally, iteration i depends on iteration i−1 if:
Iteration i uses data produced by iteration i−1
Types of Loop-Carried Dependences
1. True (Flow) Dependence
o Later iteration reads a value written by an earlier iteration
2. Anti-Dependence
o Later iteration writes to a variable that was read earlier
3. Output Dependence
o Two iterations write to the same variable
Example of Loop-Carried Dependence (Not Parallelizable)
for (int i = 1; i < n; i++) {
a[i] = a[i-1] + b[i];
}
a[i] depends on a[i-1]
Each iteration needs the result of the previous one
Cannot be parallelized using parallel for
Effect on parallel for Directive
The OpenMP parallel for directive requires that:
Loop iterations are independent
No iteration depends on another
If loop-carried dependence exists:
Parallel execution produces incorrect results
The compiler may refuse parallelization or generate wrong output
Example of Parallelizable Loop (No Dependence)
#pragma omp parallel for
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
Each iteration is independent
Safe for parallel execution
Reducing Loop-Carried Dependence
Some techniques to handle or reduce dependence:
1. Rewriting the algorithm
2. Using temporary arrays
3. Applying prefix sum or reduction
4. Changing data access patterns
Example using reduction:
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++)
sum += a[i];
Impact on Performance
Loops with no dependence → high parallel efficiency
Loops with dependence → serialization
Dependence limits scalability on multicore systems
Compiler and Programmer Role
Compiler checks for obvious dependences
Programmer must ensure loop correctness
Incorrect assumptions may cause subtle bugs
Summary Table
Aspect Independent Loop Loop-Carried Dependence
Parallelizable Yes No
Execution order Any Fixed
parallel for Safe Unsafe
Performance High Low
Conclusion
Loop-carried dependence is a major limitation in parallel loop execution. OpenMP’s parallel for directive
can only be applied when loop iterations are independent. Identifying and eliminating such dependences is
essential for achieving correct and efficient parallel programs.
Q8. Explain cache coherence and false sharing in OpenMP and discuss techniques to reduce false
sharing.
Cache Coherence
In shared-memory multiprocessor systems, each processor has its own private cache, but all processors
access a common main memory.
Cache coherence ensures that all processors see a consistent view of shared data, even when multiple
caches store copies of the same memory location.
Need for Cache Coherence
One thread may update a shared variable in its cache
Other threads must see the updated value
Without coherence, programs produce incorrect results
Cache coherence is maintained using hardware coherence protocols such as MESI (Modified, Exclusive,
Shared, Invalid).
Cache Coherence Problem Example
#pragma omp parallel
{
x = x + 1;
}
Each thread may cache x
Updates must be propagated to all caches
Coherence protocol ensures correctness
False Sharing
False sharing occurs when:
Multiple threads access different variables
These variables lie on the same cache line
Even though variables are independent, cache coherence treats them as shared
This causes unnecessary cache invalidations, leading to:
Performance degradation
Increased memory traffic
False Sharing Example
int a[NUM_THREADS];
#pragma omp parallel
{
int id = omp_get_thread_num();
a[id] = id;
}
Each thread writes to a different array element
If a[id] values are on the same cache line
Cache line keeps invalidating → false sharing
Difference Between True Sharing and False Sharing
Aspect True Sharing False Sharing
Variables Same variable Different variables
Cache line Same Same
Correctness Required Not required
Performance Necessary overhead Unnecessary overhead
Techniques to Reduce False Sharing
1. Padding
Add unused space so that frequently updated variables occupy separate cache lines.
typedef struct {
int value;
char pad[64];
} padded_var;
2. Use Private Variables
Convert shared variables into private or reduction variables where possible.
#pragma omp parallel private(temp)
3. Use Local Arrays
Each thread works on its own local data instead of shared arrays.
4. Align Data Structures
Align variables to cache line boundaries using compiler directives.
5. Reduce Writes to Shared Variables
Accumulate results locally
Update shared variables only once
Impact of False Sharing on Performance
Increases cache misses
Causes frequent cache invalidations
Reduces scalability of OpenMP programs
Particularly harmful in loops with frequent updates
Conclusion
Cache coherence is essential for correctness in shared-memory systems, but it introduces performance
challenges. False sharing is a major performance bottleneck in OpenMP programs, caused by independent
variables sharing the same cache line. By using padding, private variables, and careful data placement, false
sharing can be significantly reduced, leading to better parallel performance.
Q7. Explain scheduling in OpenMP and describe static, dynamic, guided, and runtime scheduling with
suitable examples.
Scheduling in OpenMP
Scheduling in OpenMP determines how loop iterations are assigned to threads in a parallel region.
It plays a crucial role in:
Load balancing
Performance optimization
Reducing idle time of threads
Scheduling is specified using the schedule clause with the parallel for directive.
General Syntax
#pragma omp parallel for schedule(type, chunk_size)
type → scheduling strategy
chunk_size → number of loop iterations assigned at a time (optional)
Types of Scheduling in OpenMP
OpenMP supports four main scheduling policies:
1. Static Scheduling
Description
Loop iterations are divided equally among threads at compile time
Each thread gets a fixed chunk of iterations
Low overhead
Syntax
#pragma omp parallel for schedule(static)
or
#pragma omp parallel for schedule(static, chunk)
Example
#pragma omp parallel for schedule(static, 4)
for (int i = 0; i < 16; i++)
work(i);
Use Case
Best when each iteration takes equal time
Predictable workload
2. Dynamic Scheduling
Description
Iterations are assigned to threads at runtime
Threads request new chunks when they finish
Higher overhead than static scheduling
Syntax
#pragma omp parallel for schedule(dynamic, chunk)
Example
#pragma omp parallel for schedule(dynamic, 2)
for (int i = 0; i < n; i++)
work(i);
Use Case
Suitable when iteration execution time varies
Improves load balancing
3. Guided Scheduling
Description
Similar to dynamic scheduling
Chunk size decreases exponentially over time
Starts with large chunks, ends with small chunks
Syntax
#pragma omp parallel for schedule(guided)
Example
#pragma omp parallel for schedule(guided, 4)
for (int i = 0; i < n; i++)
work(i);
Use Case
Good compromise between load balance and overhead
Efficient for large loops
4. Runtime Scheduling
Description
Scheduling policy is decided at runtime
Controlled using environment variable OMP_SCHEDULE
Syntax
#pragma omp parallel for schedule(runtime)
Example
export OMP_SCHEDULE="dynamic,2"
#pragma omp parallel for schedule(runtime)
for (int i = 0; i < n; i++)
work(i);
Use Case
Useful for performance tuning without recompiling
Comparison of Scheduling Types
Scheduling Type Assignment Time Overhead Load Balance
Static Compile time Very low Poor (if workload uneven)
Dynamic Runtime High Excellent
Guided Runtime Medium Very good
Runtime Runtime Depends Depends
Effect of Chunk Size
Small chunk → better load balance, higher overhead
Large chunk → lower overhead, possible imbalance
Conclusion
Scheduling in OpenMP directly affects the performance of parallel loops. Static scheduling is efficient for
uniform workloads, while dynamic and guided scheduling handle irregular workloads better. Runtime
scheduling offers flexibility for tuning performance. Selecting the right scheduling strategy is essential for
achieving optimal parallel performance.
Thread Safety
Thread safety refers to the property of a program or code segment to function correctly when executed by
multiple threads concurrently in a shared-memory environment. A thread-safe program produces correct
and consistent results regardless of the interleaving of thread execution.
In shared-memory programming models such as OpenMP, multiple threads access the same memory space,
making thread safety a critical concern.
Need for Thread Safety
Threads share global and heap variables
Concurrent access can cause race conditions
Results may vary between executions
Program behavior becomes unpredictable
Hence, mechanisms are required to control access to shared data.
Common Thread Safety Issues
1. Race Conditions
Occur when multiple threads read/write shared data simultaneously.
2. sum = sum + 1; // unsafe if executed by multiple threads
3. Critical Section Conflicts
Multiple threads entering code that updates shared variables.
4. Non-reentrant Functions
Functions using static or global variables internally.
5. False Sharing
Performance degradation due to cache-line sharing.
Techniques to Ensure Thread Safety
1. Mutual Exclusion
Ensures only one thread accesses shared data at a time.
#pragma omp critical
sum += value;
2. Atomic Operations
Used for simple updates on shared variables.
#pragma omp atomic
sum++;
3. Reduction Clause
Safely combines thread-local results.
#pragma omp parallel for reduction(+:sum)
4. Private Variables
Each thread maintains its own copy.
#pragma omp parallel private(i)
5. Barriers
Synchronize threads at specific points.
#pragma omp barrier
Thread-Safe vs Non-Thread-Safe Code
Aspect Thread-Safe Non-Thread-Safe
Shared data access Controlled Uncontrolled
Race conditions Avoided Possible
Output Consistent Unpredictable
Parallel correctness Guaranteed Not guaranteed
Importance of Thread Safety
Ensures correctness of parallel programs
Improves reliability and reproducibility
Enables efficient utilization of multicore systems
Prevents hard-to-debug concurrency errors
Conclusion
Thread safety is a fundamental requirement in shared-memory programming. By using synchronization
constructs such as critical sections, atomic operations, reduction clauses, and private variables,
programmers can prevent race conditions and ensure correct parallel execution. Proper thread-safe design
leads to reliable, scalable, and efficient parallel applications.