Module 4: Shared-memory programming with OpenMP
MODULE - 4
Syllabus
Shared-memory programming with OpenMP – openmp pragmas and directives, The
trapezoidal rule, Scope of variables, The reduction clause, loop carried dependency,
scheduling, producers and consumers, Caches, cache coherence and false sharing in openmp,
tasking, tasking, thread safety.
Introduction to OpenMP
What is OpenMP?
OpenMP (Open Multi-Processing) is a standardized API for shared-memory parallel
programming. It consists of:
• Compiler directives (pragmas) you add to C/C++/Fortran source to indicate parallel
regions.
• Runtime library routines (functions in omp.h) you can call from code.
• Environment variables (like OMP_NUM_THREADS) that control runtime
behaviour.
Key Features of OpenMP
1. Pragma-based: Uses compiler directives to parallelize code.
2. Fork–Join Model (master creates worker threads as needed).
3. Shared Memory (all threads can access the same memory space)
4. Incremental Parallelization (add parallelism gradually)
5. Portable (supported by most modern compilers)
1. Pragma-based: Uses compiler directives to parallelize code.
Syntax
#pragma omp directive [clauses]
structured-block
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 1
Module 4: Shared-memory programming with OpenMP
A pragma in C/C++ is a compiler directive: a hint embedded in source code that the
compiler recognizes and acts on. OpenMP uses pragmas so you can add parallel behavior to
an existing serial program without rewriting it into a threading library the compiler
#pragma omp
This marks the start of an OpenMP directive. The compiler recognizes pragmas that begin
with #pragma omp and applies OpenMP semantics to the following statement / block.
directive
The directive names the construct you want the runtime to apply. Examples:
• parallel — run the following block with a team of threads.
• for — partition a loop among threads that are already active.
• parallel for — combine parallel + for (fork threads, then split loop iterations among
them).
• critical, atomic, barrier — synchronization constructs.
[clauses]
Clauses modify the directive’s behavior (control variable scope, scheduling, number of
threads, reduction ops, etc.). Examples: private(...), shared(...), reduction(...), schedule(...),
num_threads(...), default(none), nowait, if(...), collapse(...)
structured-block
A structured block is the C statement (single statement or compound { ... }) that immediately
follows the pragma — it must have one entry and one exit
Example
#include <stdio.h>
#include <omp.h>
int main( )
int n = 8;
int A[8] = {1,2,3,4,5,6,7,8};
int sum = 0;
// OpenMP directive with a clause
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 2
Module 4: Shared-memory programming with OpenMP
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++)
{ // <-- structured block is this loop
sum += A[i];
printf("Total sum = %d\n", sum);
return 0;
pragma omp parallel for reduction(+:sum)
• directive: parallel for → run loop in parallel.
• clause: reduction(+:sum) → ensures all threads safely add to sum
Output
Total sum = 36
2. Fork–Join Model (master creates worker threads as needed)
Plain idea: your program normally runs with one thread (the master). When it reaches a
parallel region, the master temporarily creates (forks) several worker threads to run the work
in parallel. When the work is done the workers join back and the master continues alone.
#pragma omp parallel num_threads(4)
// this block is executed by 4 threads simultaneously
printf("Hello from thread %d\n", omp_get_thread_num());
What actually happens:
• Master reaches parallel → runtime starts 3 extra threads (so 4 total).
• All 4 execute the block at the same time.
• At the closing brace there's an implicit barrier: worker threads finish and are
destroyed (or returned to a pool), and the master continues.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 3
Module 4: Shared-memory programming with OpenMP
3. Shared Memory (all threads can access the same memory space)
Plain idea: threads live in the same process. They share global variables and heap
memory. Each thread has its own stack (local variables inside the thread), but memory
allocated outside (global arrays, malloc) is visible to all threads.
Example:
int shared_counter = 0; // accessible by all threads
#pragma omp parallel num_threads(4)
// each thread can read/write shared_counter -> careful!
#pragma omp atomic
shared_counter++;
Important details:
• Shared = same physical memory. No explicit message passing is needed; threads
read/write the same addresses.
• Local (private) variables: variables declared inside the parallel block or loop
iteration are private to each thread (each thread gets its own copy).
4. Incremental Parallelization (add parallelism gradually)
Plain idea: you don’t have to rewrite your whole program as threads from day one.
OpenMP’s pragmas let you add parallelism bit by bit to the places that benefit most
(loops, independent function calls), test, and expand.
Typical workflow:
1. Start with serial code that works and is well-tested.
2. Identify a heavy loop (hotspot) where iterations are independent.
3. Add #pragma omp parallel for above that loop and test results.
4. Fix correctness issues (scope, reductions).
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 4
Module 4: Shared-memory programming with OpenMP
5. Measure performance; tune schedule, chunk size, or move to a parallel region that
surrounds several loops to lower overhead.
Serial:
double sum = 0;
for (int i=0; i<n; ++i)
sum += a[i];
Add OpenMP:
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i=0; i<n; ++i)
sum += a[i];
5. Portable (supported by most modern compilers)
Plain idea: OpenMP is a standard; major C/C++/Fortran compilers support it. That
means code using OpenMP pragmas is portable across compilers and platforms (with the
usual caveats about compiler flags and OpenMP version).
What you need to do:
• Compile with the compiler’s OpenMP flag:
o GCC/Clang: -fopenmp
o Intel (oneAPI): -qopenmp or -fopenmp depending on version
• Run with desired thread count via:
o environment variable: export OMP_NUM_THREADS=4
o or per-region clause #pragma omp parallel num_threads(4)
Compilation:
// Compile with OpenMP support
gcc -fopenmp program.c -o program
icc -qopenmp program.c -o program
clang -fopenmp program.c -o program
// Set number of threads
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 5
Module 4: Shared-memory programming with OpenMP
export OMP_NUM_THREADS=4
./program
Applications:
Scientific computing,
image processing,
financial modeling,
weather simulation,
machine learning
OpenMP Pragmas and Directives
Pragma Syntax:
#pragma omp directive [clause [clause]...]
structured-block
Basic Example:
#include <omp.h>
#include <stdio.h>
int main( )
printf("Before parallel region\n");
#pragma omp parallel
int thread_id = omp_get_thread_num();
int total_threads = omp_get_num_threads();
printf("Hello from thread %d of %d\n", thread_id, total_threads);
printf("After parallel region\n");
return 0;
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 6
Module 4: Shared-memory programming with OpenMP
Execution Model:
1. Program starts with master thread (thread 0)
2. Master creates team of threads at parallel region
3. All threads execute structured block in parallel
4. Threads synchronize at end, all except master terminate
5. Execution continues serially with master thread
Common Functions:
int omp_get_num_threads(); // Get number of threads
int omp_get_thread_num(); // Get current thread ID
void omp_set_num_threads(int); // Set number of threads
double omp_get_wtime(); // Get wall clock time
Common functions
• omp_get_thread_num() → gives thread ID (0..N-1).
• omp_get_num_threads() → gives total threads in this region.
• omp_set_num_threads(int n) → set number of threads in code.
• omp_get_wtime() → gives wall-clock time (for timing programs)
Trapezoidal Rule - Theory
Numerical Integration Concept: The trapezoidal rule approximates definite
integrals by dividing the area under a curve into trapezoids.
Formula: ∫[a to b] f(x) dx ≈ (h/2)[f(a) + 2∑f(xi) + f(b)] where h = (b-a)/n and xi
= a + i*h
Sequential Implementation:
double f(double x)
return x * x; // f(x) = x²
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 7
Module 4: Shared-memory programming with OpenMP
double sequential_trap(double a, double b, int n)
double h = (b - a) / n;
double integral = (f(a) + f(b)) / 2.0;
for (int i = 1; i <= n-1; i++)
double x_i = a + i * h;
integral += f(x_i);
return integral * h;
Analysis:
• Time Complexity: O(n)
• Accuracy: Error decreases as O(h²)
• Parallelization Potential: Each iteration is independent.
Trapezoidal Rule - Parallel Implementation
Parallel Strategy: Distribute computation of trapezoids among threads, then
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 8
Module 4: Shared-memory programming with OpenMP
combine results.
Method 1: Using reduction
double parallel_trap_v1(double a, double b, int n)
double h = (b - a) / n;
double integral = 0.0;
#pragma omp parallel for reduction(+:integral)
for (int i = 0; i < n; i++)
double x_i = a + (i + 0.5) * h;
integral += f(x_i);
return integral * h;
Method 2: Manual distribution
double parallel_trap_v2(double a, double b, int n)
double h = (b - a) / n;
double global_integral = 0.0;
#pragma omp parallel
int thread_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
int local_n = n / num_threads;
int start = thread_id * local_n;
int end = (thread_id == num_threads - 1) ? n : start + local_n;
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 9
Module 4: Shared-memory programming with OpenMP
double local_integral = 0.0;
for (int i = start; i < end; i++)
double x_i = a + (i + 0.5) * h;
local_integral += f(x_i);
#pragma omp critical
global_integral += local_integral;
return global_integral * h;
Performance Tip: Method 1 with reduction is more efficient
Code with small example
#include <stdio.h>
#include <omp.h>
double f(double x) {
return x * x; // function y = x^2
int main() {
int n = 4; // number of trapezoids
double a = 0.0, b = 4.0;
double h = (b - a) / n;
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 10
Module 4: Shared-memory programming with OpenMP
double sum = 0.0;
// Parallel loop: each thread works on part of the sum
#pragma omp parallel for reduction(+:sum)
for (int i = 1; i < n; i++) {
double x = a + i * h;
sum += f(x);
double result = (h/2) * (f(a) + 2*sum + f(b));
printf("Integral ≈ %f\n", result);
return 0;
Output:
Integral ≈ 22.000000
Scheduling
Purpose: Determines how loop iterations are distributed among threads.
Scheduling Types:
• Static:
❖ Iterations divided evenly at compile time .
❖ Iterations are divided before loop begins.
❖ Distribution can be:
❖ Equal contiguous blocks (default if chunk not given).
❖ Very low overhead because no runtime management is needed.
• Dynamic:
❖ Iterations assigned at runtime for load balancing.
❖ Iterations are divided into chunks of size chunk_size.
❖ When a thread finishes its chunk, it requests the next available chunk.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 11
Module 4: Shared-memory programming with OpenMP
❖ Achieves better load balancing but has higher scheduling overhead.
• Guided:
❖ Decreasing chunk sizes over time.
❖ Starts with large chunks → ends with small chunks.
❖ Useful for non-uniform workloads with decreasing computation cost.
• Auto:
❖ Compiler/runtime decides.
❖ Leaves decision of scheduling type and chunk size to the compiler/runtime
system.
// Static scheduling
#pragma omp parallel for schedule(static, 100)
for (int i = 0; i < n; i++)
process_data(i); // Even workload distribution
// Dynamic scheduling
#pragma omp parallel for schedule(dynamic, 10)
for (int i = 0; i < n; i++)
complex_computation(i); // Good for irregular workloads
// Guided scheduling
#pragma omp parallel for schedule(guided, 50)
for (int i = 0; i < n; i++)
variable_work(i); // Chunk sizes decrease over time
// Runtime scheduling
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 12
Module 4: Shared-memory programming with OpenMP
#pragma omp parallel for schedule(runtime)
for (int i = 0; i < n; i++)
work(i); // Uses OMP_SCHEDULE environment variable
When to Use:
• Static: Equal workload per iteration
• Dynamic: Highly variable workload
• Guided: Moderate load imbalance
What is Variable Scoping in OpenMP?
• In OpenMP, when you create parallel regions or parallel loops, multiple threads run
at the same time.
• Each thread can share variables or have its own private copy.
Variable scoping = deciding whether a variable is shared among all threads or private to
each thread.
Types of Scoping
1. Shared
• Variable is visible to all threads.
• Only one memory location is used.
• Risk: multiple threads may write at the same time → race condition.
int x = 0;
#pragma omp parallel shared(x)
x += omp_get_thread_num(); // all threads update same x
int x = 0;
• Variable x is initialized to 0 in the main program.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 13
Module 4: Shared-memory programming with OpenMP
• Since shared(x) is specified, all threads will see and use the same single x.
#pragma omp parallel shared(x)
• Starts a parallel region.
• Suppose we have N threads (say 4).
• Then, 4 threads will execute the block simultaneously.
• Each thread has a unique ID from 0 to N-1, given by omp_get_thread_num().
x += omp_get_thread_num();
• Each thread adds its ID to the shared variable x.
Example with 4 threads:
• Thread 0 → x += 0 → no change.
• Thread 1 → x += 1 → increases x by 1.
• Thread 2 → x += 2 → increases x by 2.
• Thread 3 → x += 3 → increases x by 3.
So ideally, final x = 0 + 1 + 2 + 3 = 6.
2. Private
• Each thread gets its own copy of the variable.
• Copy is uninitialized (unless firstprivate is used).
• Changes are not reflected outside the parallel region.
int x = 5;
#pragma omp parallel private(x)
x = omp_get_thread_num(); // each thread has its own x
printf("Thread %d x=%d\n", omp_get_thread_num(), x);
printf("Outside x=%d\n", x); // original x unchanged
int x = 5;
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 14
Module 4: Shared-memory programming with OpenMP
• In the main thread (serial part), x is initialized to 5.
#pragma omp parallel private(x)
• A parallel region begins.
• Suppose 4 threads are created.
• Each thread gets its own private copy of x.
• Important: private variables are uninitialized inside the parallel region (they don’t
inherit the value 5 from outside).
• So each thread starts with garbage value in its private x (but then we overwrite it).
Inside parallel region
x = omp_get_thread_num();
printf("Thread %d x=%d\n", omp_get_thread_num(), x);
• Each thread assigns its ID to its private x.
• Example output (if 4 threads run):
• Thread 0 x=0
• Thread 1 x=1
• Thread 2 x=2
• Thread 3 x=3
• Note: Order may vary (threads run concurrently).
After parallel region
printf("Outside x=%d\n", x);
• Here, we are back in the main thread.
• The original x from before parallel region (x=5) is unchanged, because:
o The private(x) clause gave each thread its own copy.
o The main thread’s x was not affected inside parallel block.
3. Firstprivate
• Same as private, but each thread’s copy is initialized with the original value.
int x = 5;
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 15
Module 4: Shared-memory programming with OpenMP
#pragma omp parallel firstprivate(x)
printf("Thread %d starts with x=%d\n", omp_get_thread_num(), x);
int x = 5;
• The variable x is initialized to 5 in the main thread.
#pragma omp parallel firstprivate(x)
• A parallel region begins (say 4 threads).
• Each thread gets its own private copy of x.
• Difference from private(x):
o With private(x): each thread’s x is uninitialized (garbage).
o With firstprivate(x): each thread’s x is initialized with the value of x from
before parallel region → here, 5.
3. Inside parallel region
printf("Thread %d starts with x=%d\n", omp_get_thread_num(), x);
• Each thread prints its own copy of x.
• Since they were all initialized to 5, the output will be something like:
• Thread 0 starts with x=5
• Thread 1 starts with x=5
• Thread 2 starts with x=5
• Thread 3 starts with x=5
• Order may vary (threads run in parallel).
4. Lastprivate
• Like private, but the value from the last iteration is copied back to the original
variable after loop ends.
int x;
#pragma omp parallel for lastprivate(x)
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 16
Module 4: Shared-memory programming with OpenMP
for (int i=0; i<4; i++) {
x = i; // each thread has private x
printf("After loop, x=%d\n", x); // x = last iteration’s value (3)
int x;
• Variable x is declared (but not initialized) in the main program.
2. #pragma omp parallel for lastprivate(x)
• This parallelizes the for loop across threads.
• Clause lastprivate(x) means:
o Each thread gets its own private copy of x during the loop (just like
private(x)).
o BUT after the loop finishes, the value of x from the last loop iteration (i=3)
is copied back to the original x in the main program.
3. Inside the loop
x = i;
• Each thread executes some subset of loop iterations.
• For each iteration, that thread’s private x is set to the loop index i.
Example with 4 iterations (i = 0,1,2,3):
• If thread 0 executes iteration i=0, its private x=0.
• If thread 1 executes iteration i=1, its private x=1.
• If thread 2 executes iteration i=2, its private x=2.
• If thread 3 executes iteration i=3, its private x=3.
4. After the loop
printf("After loop, x=%d\n", x);
• Thanks to lastprivate, the value of x from the last iteration (i=3) is assigned to the
main x.
• So after the loop ends, x=3.
• Shared → Everyone writes on the same whiteboard.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 17
Module 4: Shared-memory programming with OpenMP
• Private → Everyone has their own notebook.
• Firstprivate → Everyone gets a notebook copied from the teacher’s notes.
• Lastprivate → Teacher collects the notebook of the student who finished last and
keeps it.
Variable Scoping Rules: Determines how variables are shared or distributed among threads.
Default Rules:
• Shared: Variables declared outside parallel region
• Private: Variables declared inside parallel region.
Example:
int main( )
int a = 10, b = 20, c = 30, d = 40;
#pragma omp parallel for private(a) firstprivate(b) lastprivate(c) shared(d)
for (int i = 0; i < 4; i++)
int thread_id = omp_get_thread_num();
a = thread_id;
b += thread_id;
c = thread_id * 100;
// private (uninitialized)
// firstprivate (initialized to 20)
// lastprivate (copied back)
// d is shared by all threads
return 0;
Variable Initialization (before parallel region)
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 18
Module 4: Shared-memory programming with OpenMP
int a = 10, b = 20, c = 30, d = 40;
• a = 10
• b = 20
• c = 30
• d = 40
These are defined in the serial (main) part of program.
2. OpenMP Directive
#pragma omp parallel for private(a) firstprivate(b) lastprivate(c) shared(d)
This means:
• private(a)
o Each thread gets its own copy of a,
o But it is uninitialized (the initial value 10 is lost).
o So each thread will set its own a.
• firstprivate(b)
o Each thread gets its own copy of b,
o It is initialized with value = 20 (from main program).
o Then each thread modifies its own copy, without affecting others.
• lastprivate(c)
o Each thread gets its own copy of c.
o At the end of the loop, the value of c from the last iteration (highest i) is
copied back into the original c in the main program.
• shared(d)
o All threads see and use the same single d.
o Any modification by one thread is visible to others. (◻ If modified, it can
cause race conditions unless handled).
3. Parallel Loop
for (int i = 0; i < 4; i++)
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 19
Module 4: Shared-memory programming with OpenMP
• Loop runs i = 0, 1, 2, 3.
• OpenMP divides these 4 iterations among available threads.
4. Inside the Loop (thread code)
int thread_id = omp_get_thread_num();
• Gets the ID of the thread (0, 1, 2, ...).
a = thread_id;
• Since a is private, each thread has its own a.
• Example: Thread 0 → a=0, Thread 1 → a=1, etc.
• After the loop ends, the original a=10 in main() remains unchanged.
b += thread_id;
• Since b is firstprivate, each thread starts with b=20.
• Example: Thread 0 → b=20+0=20, Thread 1 → b=20+1=21, etc.
• They are separate copies → no effect on original b=20 after loop.
c = thread_id * 100;
• Each thread has its own copy of c.
• Example: Thread 0 → c=0, Thread 1 → c=100, Thread 2 → c=200…
• Since lastprivate(c) is used → the value of c from the last iteration (i=3) is copied
back into main program.
o So after loop ends → c= thread_id_of_iteration3 * 100.
o If thread 2 executes i=3, then c=200.
// d is shared by all threads
• If any thread changes d, all threads will see the updated value.
• In this code, d is not changed, so d=40 after loop.
Final Values After Loop Ends
• a = 10 (unchanged, because it was private inside loop).
• b = 20 (unchanged, because each thread worked on its private copy initialized with
20).
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 20
Module 4: Shared-memory programming with OpenMP
• c = last thread’s result of iteration i=3 (depends which thread handled iteration 3).
• d = 40 (shared, but not modified).
Reduction Clause
What is Reduction? Common pattern where multiple threads contribute to single result
through commutative and associative operation.
How It Works:
1. Each thread gets private copy of reduction variable
2. Private copies initialized based on operation (0 for +, 1 for *)
3. Each thread operates on private copy
4. All private copies combined using specified operation
5. Final result stored in original variable
Supported Operations:
// Arithmetic
double sum = 0.0;
int product = 1;
#pragma omp parallel for reduction(+:sum) reduction(*:product)
for (int i = 0; i < n; i++) {
sum += array[i];
product *= array[i];
// Min/Max
double min_val = 1e9, max_val = -1e9;
#pragma omp parallel for reduction(min:min_val) reduction(max:max_val)
for (int i = 0; i < n; i++) {
if (array[i] < min_val) min_val = array[i];
if (array[i] > max_val) max_val = array[i];
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 21
Module 4: Shared-memory programming with OpenMP
// Logical
int all_positive = 1;
#pragma omp parallel for reduction(&&:all_positive)
for (int i = 0; i < n; i++) {
all_positive = all_positive && (array[i] > 0);
Practical Examples:
• Dot product of vectors
• Counting elements satisfying condition
• Computing vector norm
The reduction clause in OpenMP is used when multiple threads need to update a shared
variable (like sum, product, max, min, etc.) without causing race conditions.
It tells OpenMP to:
• Give each thread its own private copy of the variable.
• Perform updates locally in each thread.
• At the end of the parallel region, combine all private results into the original shared
variable using the given operator (like +, *, max, etc.).
Syntax
#pragma omp parallel for reduction(op : variable)
op → the operator (sum, product, max, min, etc.)
variable → the variable to be reduced
Example
Sum of numbers 1 to 10:
#include <stdio.h>
#include <omp.h>
int main()
int i, sum = 0;
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 22
Module 4: Shared-memory programming with OpenMP
#pragma omp parallel for reduction(+:sum)
for (i = 1; i <= 10; i++)
sum += i;
printf("Sum = %d\n", sum); // Output: 55
return 0;
Without reduction, multiple threads would write to sum at the same time → wrong result
(race condition).
With reduction, OpenMP ensures correctness.
Supported Operators
Arithmetic: +, -, *
Bitwise: &, |, ^
Logical: &&, ||
Min/Max (newer OpenMP): min, max
In simple words:
The reduction clause lets each thread calculate a partial result and then OpenMP
automatically reduces (combines) all partial results into one final value.
#include <stdio.h>
#include <omp.h>
int main()
int i, sum = 0;
#pragma omp parallel for reduction(+:sum) num_threads(4)
for (i = 1; i <= 10; i++)
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 23
Module 4: Shared-memory programming with OpenMP
sum += i;
printf("Final Sum = %d\n", sum);
return 0;
Step 1: Initialization
When OpenMP sees reduction(+:sum), it does:
Create a private copy of sum for each thread.
Initialize each private copy to 0 (because the operator is +).
So at the start:
Thread 0 → sum = 0
Thread 1 → sum = 0
Thread 2 → sum = 0
Thread 3 → sum = 0
Step 2: Divide Work
With 10 iterations (i = 1…10) and 4 threads, OpenMP distributes work (with static
scheduling by default):
Thread 0 → i = 1, 2, 3
Thread 1 → i = 4, 5, 6
Thread 2 → i = 7, 8
Thread 3 → i = 9, 10
Step 3: Each Thread Computes Locally
Each thread updates its own copy of sum:
Thread 0: sum = 1 + 2 + 3 = 6
Thread 1: sum = 4 + 5 + 6 = 15
Thread 2: sum = 7 + 8 = 15
Thread 3: sum = 9 + 10 = 19
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 24
Module 4: Shared-memory programming with OpenMP
At this stage, private sums are:
Thread 0 → sum = 6
Thread 1 → sum = 15
Thread 2 → sum = 15
Thread 3 → sum = 19
Step 4: Reduction (Combination)
At the end of the parallel loop, OpenMP reduces (combines) all private values using the +
operator:
Final sum = 6 + 15 + 15 + 19 = 55
The result is stored in the original shared variable sum.
Step 5: Final Output
Program prints:
Final Sum = 55
Loop Carried Dependencies
Definition: Loop-carried dependency exists when an iteration depends on
results of previous iteration. These loops cannot be parallelized directly.
Types of Dependencies:
• True Dependency (RAW): Read After Write
• Anti Dependency (WAR): Write After Read
• Output Dependency (WAW): Write After Write
Problematic Examples:
// Cannot parallelize - has dependencies
for (int i = 1; i < n; i++)
a[i] = a[i-1] + b[i]; // Depends on previous iteration
// Fibonacci - classic dependency
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 25
Module 4: Shared-memory programming with OpenMP
fib[0] = 0; fib[1] = 1;
for (int i = 2; i < n; i++)
fib[i] = fib[i-1] + fib[i-2]; // Cannot parallelize
Safe Examples:
// Can parallelize - no dependencies
#pragma omp parallel for
for (int i = 0; i < n; i++)
c[i] = a[i] + b[i]; // Independent iterations
What is a Loop-Carried Dependency?
A loop-carried dependency happens when an iteration of a loop depends on the result of a
previous iteration.
That means iteration i cannot run until iteration i-1 (or earlier) finishes.
This prevents parallelization because the iterations are linked like a chain.
Example 1: With Dependency
for (int i = 1; i < n; i++)
a[i] = a[i-1] + 5;
Here:
Iteration i=1 needs a[0].
Iteration i=2 needs a[1].
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 26
Module 4: Shared-memory programming with OpenMP
Iteration i=3 needs a[2], and so on.
Each step depends on the previous step → this is a loop-carried dependency.
This loop cannot be fully parallelized.
Example 2: No Dependency
for (int i = 0; i < n; i++)
b[i] = c[i] + 5;
Here:
Each b[i] only depends on c[i].
Iteration i=0 does not affect iteration i=1.
No dependency between iterations.
This loop has no loop-carried dependency → it can be parallelized.
Why It Matters in Parallel Programming
•Loops without loop-carried dependency can be executed by multiple threads at the
same time.
• Loops with loop-carried dependency must run sequentially, unless the dependency
can be removed or transformed.
1. True Dependency (Read After Write, RAW)
• Also called flow dependency.
• Occurs when an instruction/iteration needs a value that was produced earlier.
• This is a real data dependency and cannot be removed by just renaming
variables.
Example:
for (int i = 1; i < n; i++)
a[i] = a[i-1] + 5; // depends on result from previous iteration
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 27
Module 4: Shared-memory programming with OpenMP
• Iteration i reads a[i-1], which was written by iteration i-1.
• Must preserve order → hard to parallelize.
2. Anti-Dependency (Write After Read, WAR)
• Occurs when an instruction writes to a variable that was already read by a
previous iteration.
• The read must happen before the write, otherwise the read gets wrong data.
• This is not a true data need, but a name conflict → can often be fixed by
variable renaming.
Example:
for (int i = 0; i < n; i++)
y = a[i]; // read a[i]
a[i] = i; // then write a[i]
• Must ensure the read (y = a[i]) happens before the write (a[i] = i).
• Dependency exists, but can often be removed by using a temporary variable.
3. Output Dependency (Write After Write, WAW)
• Occurs when two iterations write to the same variable/memory location.
• The final value depends on the order of execution.
Example:
for (int i = 0; i < n; i++)
x = i; // write to x in every iteration
Every iteration writes to x, but only the last write (x = n-1) matters.
Note: Parallelizing may cause incorrect results unless writes are controlled.
Dependency Type Also Called Example Real Data Need? Parallelization Impact
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 28
Module 4: Shared-memory programming with OpenMP
Dependency Type Also Called Example Real Data Need? Parallelization Impact
Read After
True (RAW) a[i] = a[i-1] + 1; Yes Must preserve order
Write
Write After No (just name Often removable by
Anti (WAR) y = a[i]; a[i] = i;
Read conflict) renaming
Write After No (just order of
Output (WAW) x = i; inside loop May require synchronization
Write writes)
Caches, cache coherence
• When we run an OpenMP program, multiple threads run on different CPU cores.
• Each core has its own cache (L1, L2).
• All cores share the main memory (RAM).
• If one thread updates a shared variable, other cores must see the new value.
Cache coherence = mechanism that keeps copies of shared data in multiple caches
consistent.
Why is it needed?
int x = 0;
#pragma omp parallel num_threads(2)
int id = omp_get_thread_num();
if (id == 0) {
x = 10; // Thread 0 writes
} else {
printf("Thread 1 reads x = %d\n", x);
What happens?
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 29
Module 4: Shared-memory programming with OpenMP
• Thread 0 (Core 0) writes x = 10.
• But Thread 1 (Core 1) may still have old x = 0 in its cache.
• If caches are not coherent → Thread 1 prints 0 instead of 10.
Cache Coherence Problem in OpenMP
The issue becomes bigger when multiple threads frequently read/write shared variables.
Example:
int sum = 0;
#pragma omp parallel for
for (int i = 0; i < 100000; i++)
sum += i; // multiple threads update same variable
• All threads try to update sum.
• Each core caches sum.
• When one thread writes, other caches must invalidate their copies.
Constant invalidation → performance drop (though correctness is maintained).
This is why we use reduction clause in OpenMP (reduction(+:sum)) to avoid such conflicts.
MESI Protocol
• MESI = Modified, Exclusive, Shared, Invalid
• It is the most widely used cache coherence protocol in multi-core processors.
Example with Two Cores and One Variable x
Main memory: x = 5
Core 0 (Thread 0) and Core 1 (Thread 1) both want to use x.
Exclusive
Step 1: Core 0 reads x
Core 0 gets a fresh copy from memory.
State → Exclusive (E)
Core 0: E (x=5)
Core 1: I (no copy)
Memory: x=5
Shared
Step 2: Core 1 reads x
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 30
Module 4: Shared-memory programming with OpenMP
Now both have the same value.
State → Shared (S)
Core 0: S (x=5)
Core 1: S (x=5)
Memory: x=5
Modified
Step 3: Core 0 writes x = 10
Core 0 changes its copy.
It must tell Core 1: “Your copy is wrong → throw it away.”
State → Core 0 = Modified (M), Core 1 = Invalid (I)
Core 0: M (x=10)
Core 1: I (invalid)
Memory: x=5
Invalid
Step 4: Writeback to memory
Core 0 updates memory with new value 10.
Now memory has the latest version.
Core 0: M (x=10)
Core 1: I
Memory: x=10
No matter what, all cores will eventually see the same latest value of x.
That’s MESI in the simplest form:
E → only one has it, same as memory
S → many have it, same as memory
M → only one has it, changed, memory is old
I → invalid, must fetch again
Cache coherence problem happens when:
Two or more threads use variables stored close together in memory (same cache line).
One thread updates the variable → others’ caches get out of sync → CPU must synchronize
caches, causing delays.
Example (Without Padding)
#include <stdio.h>
#include <omp.h>
#define N 4
int counter[N]; // shared array
int main( )
{
#pragma omp parallel num_threads(N)
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 31
Module 4: Shared-memory programming with OpenMP
{
int id = omp_get_thread_num();
for (int i = 0; i < 1000000; i++)
{
counter[id]++; // each thread updates its own counter
}
}
for (int i = 0; i < N; i++)
{
printf("counter[%d] = %d\n", i, counter[i]);
}
}
What happens in memory?
counter[0], counter[1], counter[2], counter[3] may all lie in one cache line (e.g., 64 bytes).
When Thread 0 updates counter[0],
→ CPU marks the entire cache line invalid for other threads.
When Thread 1 updates counter[1],
→ again invalidates cache line → false sharing occurs.
So, even though threads are working on different array elements, they keep fighting for cache
line updates.
Diagram (False Sharing)
Memory Layout in one cache line (64 bytes):
| counter[0] | counter[1] | counter[2] | counter[3] |
Thread 0 ---> updates counter[0]
Thread 1 ---> updates counter[1]
Thread 2 ---> updates counter[2]
Thread 3 ---> updates counter[3]
All inside same cache line → causes cache invalidation & slow down
2. Padding Concept
Padding means adding extra unused space between variables so that each thread’s variable
sits in a different cache line.
This avoids false sharing → improves performance.
Example (With Padding)
#include <stdio.h>
#include <omp.h>
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 32
Module 4: Shared-memory programming with OpenMP
#define N 4
#define PAD 64 // assume cache line = 64 bytes
struct padded_int
{
int value;
char padding[PAD]; // unused space to push next variable into new cache line
};
struct padded_int counter[N];
int main( )
{
#pragma omp parallel num_threads(N)
{
int id = omp_get_thread_num();
for (int i = 0; i < 1000000; i++)
{
counter[id].value++; // now no false sharing
}
}
for (int i = 0; i < N; i++)
{
printf("counter[%d] = %d\n", i, counter[i].value);
}
}
Diagram (With Padding)
Memory Layout with padding:
| counter[0].value | PADDING | (fills 64 bytes) |
| counter[1].value | PADDING | (fills 64 bytes) |
| counter[2].value | PADDING | (fills 64 bytes) |
| counter[3].value | PADDING | (fills 64 bytes) |
Each thread writes to its own cache line → no invalidation → fast
Cache coherence: Ensures all caches see the same value of memory.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 33
Module 4: Shared-memory programming with OpenMP
False sharing: Happens when multiple threads update different variables in the same cache
line.
Padding: Adds dummy bytes to separate variables across cache lines → removes false
sharing.
False Sharing
What happens inside CPU caches?
• Memory is divided into cache lines (usually 64 bytes).
• When a thread updates a variable, the whole cache line containing that variable is
loaded into the thread’s core cache.
• If another thread updates a different variable in the same cache line, the CPU still
treats the entire cache line as "shared".
This triggers cache invalidations → meaning caches fight with each other even though the
variables are logically independent.
Example
int a[2] = {0,0};
// Thread 0 updates a[0]
// Thread 1 updates a[1]
Case: Both in the same cache line
Cache line in memory (64 bytes)
| a[0] | a[1] | ............... unused space ...... |
Thread 0: updates a[0]
Thread 1: updates a[1]
Step by Step
1. Thread 0 loads cache line (a[0], a[1]) into its core.
2. Thread 1 loads the same cache line into its core.
3. Thread 0 updates a[0].
o Its cache line is marked Modified (M) (MESI protocol).
o Other cores’ copies are marked Invalid (I).
4. Thread 1 tries to update a[1] (same line!).
o It must fetch the "latest" line from Thread 0’s cache.
o Now its line is Modified, and Thread 0’s becomes Invalid.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 34
Module 4: Shared-memory programming with OpenMP
5. They keep bouncing ownership of the cache line back and forth.
Even though a[0] and a[1] are independent, the cache system sees them as part of the
same block, so it forces communication.
This unnecessary "ping-pong" is the false sharing problem.
Diagram
False Sharing
Thread 0 Core Cache Thread 1 Core Cache
| a[0]=5 | a[1]=0 | <----> | a[0]=5 | a[1]=1 |
(updates a[0]) invalidates (updates a[1])
• Both fight over the same line.
• Constant invalidation + reload.
• Performance drops.
Fixed with Padding
Cache Line 1: | a[0] | pad.... | (only Thread 0 touches)
Cache Line 2: | a[1] | pad.... | (only Thread 1 touches)
Now:
• Thread 0 updates a[0] in its own line.
• Thread 1 updates a[1] in a different line.
• No invalidations → fast.
In short
• Problem: CPU caches work at cache line granularity, not variable granularity.
• If variables share a cache line → threads contend for it, causing false sharing.
• Fix: Align/pad data so each frequently updated variable is in a separate cache line.
Do you want me to also show you a timing experiment code (with and without false
sharing) so you can see the performance difference on your system?
False sharing Example
#include<stdio.h>
#include<omp.h>
int main( )
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 35
Module 4: Shared-memory programming with OpenMP
int a[2]={0,0};
#pragma omp parallel for num_threads(2)
int id=omp_get_thread_num();
a[id]++;
printf(“a[0]= and a[1]=”,%d %d);
return 0;
The Producer–Consumer Problem
1. What is it?
• It’s a classic synchronization problem.
• We have:
o Producer(s): generate data (items).
o Consumer(s): process that data.
o Shared buffer: a memory region used by both.
Problem: Without control, they can cause race conditions, overwrites, or invalid reads.
2. Constraints
• The buffer has a fixed size (say N).
• Producer rules:
o If buffer is full, the producer must wait (cannot add).
• Consumer rules:
o If buffer is empty, the consumer must wait (cannot remove).
• Mutual exclusion:
o Only one thread at a time may access the buffer (to prevent corruption).
Producer Function
void producer(int item)
{
while (count == BUFFER_SIZE)
{
// wait (buffer full)
}
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 36
Module 4: Shared-memory programming with OpenMP
• Producer generates an item and wants to put it into the buffer.
• If count == BUFFER_SIZE, it means the buffer is full, so producer must wait (busy
wait here).
• This prevents overflow (writing when no space).
#pragma omp critical(buffer_access)
{
buffer[in] = item;
in = (in + 1) % BUFFER_SIZE;
count++;
}
}
#pragma omp critical(buffer_access) → only one thread at a time can enter this block.
•
• Inside:
o Place the item at position in.
o Update in using modulo % BUFFER_SIZE → makes it a circular buffer
(when in reaches the end, it wraps around to 0).
o Increment count (one more item now stored).
Producer’s job: safely insert into buffer.
Consumer Function
int consumer()
{
int item;
while (count == 0)
{
// wait (buffer empty)
}
• Consumer wants to take an item from buffer.
• If count == 0, buffer is empty, so it must wait (busy wait).
• This prevents underflow (taking from empty buffer).
#pragma omp critical(buffer_access)
{
item = buffer[out];
out = (out + 1) % BUFFER_SIZE;
count--;
}
return item;
}
• Enter critical section to avoid race conditions.
• Take item from position out.
• Update out using modulo → so consumer also works in circular fashion.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 37
Module 4: Shared-memory programming with OpenMP
• Decrement count (one less item in buffer).
• Return the consumed item.
Buffer size = 5 (instead of 100, for simplicity).
• Initially:
o buffer = [ , , , , ] (empty)
o in = 0 (producer will insert here)
o out = 0 (consumer will remove here)
o count = 0 (nothing in buffer)
Step 1: Producer adds an item
Producer creates item 10 and calls producer(10).
• count == BUFFER_SIZE? → No (0 != 5) → continue.
• Critical section:
o buffer[in] = 10 → buffer[0] = 10.
o in = (0+1)%5 = 1.
o count = 1.
Now:
buffer = [10, , , , ]
in = 1, out = 0, count = 1
Step 2: Producer adds another item
Producer creates item 20 and calls producer(20).
• Buffer not full.
• buffer[in] = 20 → buffer[1] = 20.
• in = (1+1)%5 = 2.
• count = 2.
• Now:
buffer = [10, 20, , , ]
in = 2, out = 0, count = 2
Step 3: Consumer takes an item
Consumer calls consumer().
• count == 0? → No (2 != 0) → continue.
• Critical section:
o item = buffer[out] → item = buffer[0] = 10.
o out = (0+1)%5 = 1.
o count = 1.
Consumer gets 10.
Now:
buffer = [10, 20, , , ] // (10 is still in array, but logically removed)
in = 2, out = 1, count = 1
Step 4: Producer adds again
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 38
Module 4: Shared-memory programming with OpenMP
Producer creates item 30 and calls producer(30).
• Buffer not full.
• buffer[in] = 30 → buffer[2] = 30.
• in = (2+1)%5 = 3.
• count = 2.
Now:
buffer = [10, 20, 30, , ]
in = 3, out = 1, count = 2
Step 5: Consumer takes again
Consumer calls consumer().
• count == 0? → No.
• item = buffer[out] → buffer[1] = 20.
• out = (1+1)%5 = 2.
• count = 1.
Consumer gets 20.
Now:
buffer = [10, 20, 30, , ]
in = 3, out = 2, count = 1
How It Works
• in always points to next free slot (where producer will put data).
• out always points to next available item (where consumer will take data).
• count keeps track of how many items are in the buffer.
• while(count==BUFFER_SIZE) → stops producer when buffer is full.
• while(count==0) → stops consumer when buffer is empty.
• #pragma omp critical → ensures producer & consumer don’t update count/in/out at
the same time.
In parallel computing, thread safety means making sure that when multiple threads/processes
run at the same time, they do not corrupt shared data or produce wrong results.
Tasking in OpenMP
Normally in OpenMP we parallelize loops (parallel for). But what if:
• Work is not evenly divisible
• Work is recursive
• Work depends on conditions
Then tasking is used.
It allows you to create independent tasks (units of work). These tasks are stored in a task pool, and
any available thread in the team can pick them up.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 39
Module 4: Shared-memory programming with OpenMP
How It Works (Diagram)
Thread 0 encounters tasks:
┌─────────────────┐
│ Task A │
├─────────────────┤
│ Task B │
├─────────────────┤
│ Task C │
└─────────────────┘
(task pool)
Thread 1 → executes Task A
Thread 2 → executes Task B
Thread 3 → executes Task C
• Only one thread (inside #pragma omp single) creates the tasks.
• Other threads steal tasks from the pool and execute them.
• Scheduling is dynamic — order may change each run.
Example 1: Simple Tasking
#include <stdio.h>
#include <omp.h>
int main()
#pragma omp parallel
#pragma omp single // only one thread creates tasks
printf("Thread %d creating tasks...\n", omp_get_thread_num());
#pragma omp task
printf("Task A executed by thread %d\n", omp_get_thread_num());
#pragma omp task
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 40
Module 4: Shared-memory programming with OpenMP
printf("Task B executed by thread %d\n", omp_get_thread_num());
#pragma omp task
printf("Task C executed by thread %d\n", omp_get_thread_num());
return 0;
Thread 0 creating tasks...
Task B executed by thread 2
Task C executed by thread 3
Task A executed by thread 1
(Order changes every run because tasks are scheduled dynamically!)
Example 2: Tasking with taskwait
#include <stdio.h>
#include <omp.h>
int main()
#pragma omp parallel
#pragma omp single
int x=0, y=0;
#pragma omp task shared(x)
x = 10;
#pragma omp task shared(y)
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 41
Module 4: Shared-memory programming with OpenMP
y = 20;
#pragma omp taskwait // wait for x and y tasks
printf("Sum = %d (computed by thread %d)\n", x+y, omp_get_thread_num());
return 0;
Output:
Sum = 30 (computed by thread 0)
When to Use Tasking
• Recursive problems → Fibonacci, quicksort, tree traversal
• Irregular workloads → graph algorithms, sparse matrix
• Dynamic workloads → adaptive mesh, unpredictable work
Thread Safety
You can achieve thread safety in different ways depending on the type of data sharing. Let’s
go step by step:
1. Avoid Sharing Data (Best Option)
• If each thread has its own copy of data → no conflicts.
• Use private variables in OpenMP, or thread-local storage in other frameworks.
Example in OpenMP:
#pragma omp parallel private(x)
{
int x = omp_get_thread_num(); // each thread has its own 'x'
printf("Thread %d has x = %d\n", omp_get_thread_num(), x);
}
Safe because no thread touches another’s variable.
2. Synchronization (When Sharing is Needed)
• Sometimes threads must share data (e.g., updating a counter, writing into a shared
array).
• Then you need synchronization to protect critical sections.
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 42
Module 4: Shared-memory programming with OpenMP
a) Locks / Critical Sections
Only one thread executes the code block at a time.
int counter = 0;
#pragma omp parallel
{
#pragma omp critical
{
counter++;
}
}
b) Atomic Operations
For very small updates (like increment/decrement), atomics are faster than locks.
int counter = 0;
#pragma omp parallel
{
#pragma omp atomic
counter++;
}
c) Reduction (Built-in Safe Accumulation)
If threads are all contributing to a sum, product, min/max, etc.,
OpenMP provides reduction to safely combine results.
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<1000; i++)
{
sum += i;
}
No race conditions, OpenMP handles safety internally.
3. Padding & Alignment (Avoid False Sharing)
• Even if threads update different variables, they may be in the same cache line, causing
slowdown and inconsistencies.
Fix: padding
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 43
Module 4: Shared-memory programming with OpenMP
struct
{
int value;
char pad[64]; // separate cache lines
} counter[NUM_THREADS];
4. Message Passing (MPI Style)
Instead of sharing memory, each process/thread works on its own local data and exchanges
messages when needed.
No shared state → no race conditions.
Often used in MPI (distributed parallel computing).
Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 44