0% found this document useful (0 votes)
5 views44 pages

PPModule 4 Notes

Module 4 covers shared-memory programming using OpenMP, detailing its features, syntax, and key concepts like the fork-join model, variable scope, and scheduling. It includes practical examples, such as implementing the trapezoidal rule for numerical integration in both sequential and parallel forms. The module emphasizes the importance of incremental parallelization and provides insights into performance optimization techniques.

Uploaded by

vikassh9809
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views44 pages

PPModule 4 Notes

Module 4 covers shared-memory programming using OpenMP, detailing its features, syntax, and key concepts like the fork-join model, variable scope, and scheduling. It includes practical examples, such as implementing the trapezoidal rule for numerical integration in both sequential and parallel forms. The module emphasizes the importance of incremental parallelization and provides insights into performance optimization techniques.

Uploaded by

vikassh9809
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 4: Shared-memory programming with OpenMP

MODULE - 4
Syllabus

Shared-memory programming with OpenMP – openmp pragmas and directives, The


trapezoidal rule, Scope of variables, The reduction clause, loop carried dependency,
scheduling, producers and consumers, Caches, cache coherence and false sharing in openmp,
tasking, tasking, thread safety.

Introduction to OpenMP

What is OpenMP?
OpenMP (Open Multi-Processing) is a standardized API for shared-memory parallel
programming. It consists of:

• Compiler directives (pragmas) you add to C/C++/Fortran source to indicate parallel


regions.

• Runtime library routines (functions in omp.h) you can call from code.

• Environment variables (like OMP_NUM_THREADS) that control runtime


behaviour.

Key Features of OpenMP

1. Pragma-based: Uses compiler directives to parallelize code.


2. Fork–Join Model (master creates worker threads as needed).

3. Shared Memory (all threads can access the same memory space)

4. Incremental Parallelization (add parallelism gradually)

5. Portable (supported by most modern compilers)

1. Pragma-based: Uses compiler directives to parallelize code.

Syntax

#pragma omp directive [clauses]

structured-block

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 1


Module 4: Shared-memory programming with OpenMP

A pragma in C/C++ is a compiler directive: a hint embedded in source code that the
compiler recognizes and acts on. OpenMP uses pragmas so you can add parallel behavior to
an existing serial program without rewriting it into a threading library the compiler

#pragma omp

This marks the start of an OpenMP directive. The compiler recognizes pragmas that begin
with #pragma omp and applies OpenMP semantics to the following statement / block.

directive

The directive names the construct you want the runtime to apply. Examples:

• parallel — run the following block with a team of threads.

• for — partition a loop among threads that are already active.

• parallel for — combine parallel + for (fork threads, then split loop iterations among
them).

• critical, atomic, barrier — synchronization constructs.

[clauses]

Clauses modify the directive’s behavior (control variable scope, scheduling, number of
threads, reduction ops, etc.). Examples: private(...), shared(...), reduction(...), schedule(...),
num_threads(...), default(none), nowait, if(...), collapse(...)

structured-block

A structured block is the C statement (single statement or compound { ... }) that immediately
follows the pragma — it must have one entry and one exit

Example

#include <stdio.h>

#include <omp.h>

int main( )

int n = 8;

int A[8] = {1,2,3,4,5,6,7,8};

int sum = 0;

// OpenMP directive with a clause

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 2


Module 4: Shared-memory programming with OpenMP

#pragma omp parallel for reduction(+:sum)

for (int i = 0; i < n; i++)

{ // <-- structured block is this loop

sum += A[i];

printf("Total sum = %d\n", sum);

return 0;

pragma omp parallel for reduction(+:sum)

• directive: parallel for → run loop in parallel.

• clause: reduction(+:sum) → ensures all threads safely add to sum

Output

Total sum = 36

2. Fork–Join Model (master creates worker threads as needed)

Plain idea: your program normally runs with one thread (the master). When it reaches a
parallel region, the master temporarily creates (forks) several worker threads to run the work
in parallel. When the work is done the workers join back and the master continues alone.

#pragma omp parallel num_threads(4)

// this block is executed by 4 threads simultaneously

printf("Hello from thread %d\n", omp_get_thread_num());

What actually happens:

• Master reaches parallel → runtime starts 3 extra threads (so 4 total).


• All 4 execute the block at the same time.
• At the closing brace there's an implicit barrier: worker threads finish and are
destroyed (or returned to a pool), and the master continues.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 3


Module 4: Shared-memory programming with OpenMP

3. Shared Memory (all threads can access the same memory space)

Plain idea: threads live in the same process. They share global variables and heap
memory. Each thread has its own stack (local variables inside the thread), but memory
allocated outside (global arrays, malloc) is visible to all threads.

Example:

int shared_counter = 0; // accessible by all threads

#pragma omp parallel num_threads(4)

// each thread can read/write shared_counter -> careful!

#pragma omp atomic

shared_counter++;

Important details:

• Shared = same physical memory. No explicit message passing is needed; threads


read/write the same addresses.

• Local (private) variables: variables declared inside the parallel block or loop
iteration are private to each thread (each thread gets its own copy).

4. Incremental Parallelization (add parallelism gradually)

Plain idea: you don’t have to rewrite your whole program as threads from day one.
OpenMP’s pragmas let you add parallelism bit by bit to the places that benefit most
(loops, independent function calls), test, and expand.

Typical workflow:

1. Start with serial code that works and is well-tested.

2. Identify a heavy loop (hotspot) where iterations are independent.

3. Add #pragma omp parallel for above that loop and test results.

4. Fix correctness issues (scope, reductions).

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 4


Module 4: Shared-memory programming with OpenMP

5. Measure performance; tune schedule, chunk size, or move to a parallel region that
surrounds several loops to lower overhead.

Serial:

double sum = 0;

for (int i=0; i<n; ++i)

sum += a[i];

Add OpenMP:

double sum = 0;

#pragma omp parallel for reduction(+:sum)

for (int i=0; i<n; ++i)

sum += a[i];

5. Portable (supported by most modern compilers)

Plain idea: OpenMP is a standard; major C/C++/Fortran compilers support it. That
means code using OpenMP pragmas is portable across compilers and platforms (with the
usual caveats about compiler flags and OpenMP version).

What you need to do:

• Compile with the compiler’s OpenMP flag:

o GCC/Clang: -fopenmp

o Intel (oneAPI): -qopenmp or -fopenmp depending on version

• Run with desired thread count via:

o environment variable: export OMP_NUM_THREADS=4

o or per-region clause #pragma omp parallel num_threads(4)

Compilation:

// Compile with OpenMP support

gcc -fopenmp program.c -o program

icc -qopenmp program.c -o program

clang -fopenmp program.c -o program

// Set number of threads

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 5


Module 4: Shared-memory programming with OpenMP

export OMP_NUM_THREADS=4

./program

Applications:

Scientific computing,

image processing,

financial modeling,

weather simulation,

machine learning

OpenMP Pragmas and Directives

Pragma Syntax:

#pragma omp directive [clause [clause]...]

structured-block

Basic Example:

#include <omp.h>

#include <stdio.h>

int main( )

printf("Before parallel region\n");

#pragma omp parallel

int thread_id = omp_get_thread_num();

int total_threads = omp_get_num_threads();

printf("Hello from thread %d of %d\n", thread_id, total_threads);

printf("After parallel region\n");

return 0;

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 6


Module 4: Shared-memory programming with OpenMP

Execution Model:

1. Program starts with master thread (thread 0)

2. Master creates team of threads at parallel region

3. All threads execute structured block in parallel

4. Threads synchronize at end, all except master terminate

5. Execution continues serially with master thread

Common Functions:

int omp_get_num_threads(); // Get number of threads

int omp_get_thread_num(); // Get current thread ID

void omp_set_num_threads(int); // Set number of threads

double omp_get_wtime(); // Get wall clock time

Common functions

• omp_get_thread_num() → gives thread ID (0..N-1).

• omp_get_num_threads() → gives total threads in this region.

• omp_set_num_threads(int n) → set number of threads in code.

• omp_get_wtime() → gives wall-clock time (for timing programs)

Trapezoidal Rule - Theory

Numerical Integration Concept: The trapezoidal rule approximates definite

integrals by dividing the area under a curve into trapezoids.

Formula: ∫[a to b] f(x) dx ≈ (h/2)[f(a) + 2∑f(xi) + f(b)] where h = (b-a)/n and xi

= a + i*h

Sequential Implementation:

double f(double x)

return x * x; // f(x) = x²

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 7


Module 4: Shared-memory programming with OpenMP

double sequential_trap(double a, double b, int n)

double h = (b - a) / n;

double integral = (f(a) + f(b)) / 2.0;

for (int i = 1; i <= n-1; i++)

double x_i = a + i * h;

integral += f(x_i);

return integral * h;

Analysis:

• Time Complexity: O(n)

• Accuracy: Error decreases as O(h²)

• Parallelization Potential: Each iteration is independent.

Trapezoidal Rule - Parallel Implementation

Parallel Strategy: Distribute computation of trapezoids among threads, then

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 8


Module 4: Shared-memory programming with OpenMP

combine results.

Method 1: Using reduction

double parallel_trap_v1(double a, double b, int n)

double h = (b - a) / n;

double integral = 0.0;

#pragma omp parallel for reduction(+:integral)

for (int i = 0; i < n; i++)

double x_i = a + (i + 0.5) * h;

integral += f(x_i);

return integral * h;

Method 2: Manual distribution

double parallel_trap_v2(double a, double b, int n)

double h = (b - a) / n;

double global_integral = 0.0;

#pragma omp parallel

int thread_id = omp_get_thread_num();

int num_threads = omp_get_num_threads();

int local_n = n / num_threads;

int start = thread_id * local_n;

int end = (thread_id == num_threads - 1) ? n : start + local_n;

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 9


Module 4: Shared-memory programming with OpenMP

double local_integral = 0.0;

for (int i = start; i < end; i++)

double x_i = a + (i + 0.5) * h;

local_integral += f(x_i);

#pragma omp critical

global_integral += local_integral;

return global_integral * h;

Performance Tip: Method 1 with reduction is more efficient

Code with small example

#include <stdio.h>

#include <omp.h>

double f(double x) {

return x * x; // function y = x^2

int main() {

int n = 4; // number of trapezoids

double a = 0.0, b = 4.0;

double h = (b - a) / n;

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 10


Module 4: Shared-memory programming with OpenMP

double sum = 0.0;

// Parallel loop: each thread works on part of the sum

#pragma omp parallel for reduction(+:sum)

for (int i = 1; i < n; i++) {

double x = a + i * h;

sum += f(x);

double result = (h/2) * (f(a) + 2*sum + f(b));

printf("Integral ≈ %f\n", result);

return 0;

Output:

Integral ≈ 22.000000

Scheduling

Purpose: Determines how loop iterations are distributed among threads.

Scheduling Types:

• Static:

❖ Iterations divided evenly at compile time .


❖ Iterations are divided before loop begins.
❖ Distribution can be:
❖ Equal contiguous blocks (default if chunk not given).
❖ Very low overhead because no runtime management is needed.

• Dynamic:

❖ Iterations assigned at runtime for load balancing.


❖ Iterations are divided into chunks of size chunk_size.
❖ When a thread finishes its chunk, it requests the next available chunk.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 11


Module 4: Shared-memory programming with OpenMP

❖ Achieves better load balancing but has higher scheduling overhead.

• Guided:

❖ Decreasing chunk sizes over time.


❖ Starts with large chunks → ends with small chunks.
❖ Useful for non-uniform workloads with decreasing computation cost.

• Auto:

❖ Compiler/runtime decides.
❖ Leaves decision of scheduling type and chunk size to the compiler/runtime
system.

// Static scheduling

#pragma omp parallel for schedule(static, 100)

for (int i = 0; i < n; i++)

process_data(i); // Even workload distribution

// Dynamic scheduling

#pragma omp parallel for schedule(dynamic, 10)

for (int i = 0; i < n; i++)

complex_computation(i); // Good for irregular workloads

// Guided scheduling

#pragma omp parallel for schedule(guided, 50)

for (int i = 0; i < n; i++)

variable_work(i); // Chunk sizes decrease over time

// Runtime scheduling

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 12


Module 4: Shared-memory programming with OpenMP

#pragma omp parallel for schedule(runtime)

for (int i = 0; i < n; i++)

work(i); // Uses OMP_SCHEDULE environment variable

When to Use:

• Static: Equal workload per iteration

• Dynamic: Highly variable workload

• Guided: Moderate load imbalance

What is Variable Scoping in OpenMP?

• In OpenMP, when you create parallel regions or parallel loops, multiple threads run
at the same time.
• Each thread can share variables or have its own private copy.

Variable scoping = deciding whether a variable is shared among all threads or private to
each thread.

Types of Scoping

1. Shared

• Variable is visible to all threads.

• Only one memory location is used.

• Risk: multiple threads may write at the same time → race condition.

int x = 0;

#pragma omp parallel shared(x)

x += omp_get_thread_num(); // all threads update same x

int x = 0;

• Variable x is initialized to 0 in the main program.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 13


Module 4: Shared-memory programming with OpenMP

• Since shared(x) is specified, all threads will see and use the same single x.

#pragma omp parallel shared(x)

• Starts a parallel region.

• Suppose we have N threads (say 4).

• Then, 4 threads will execute the block simultaneously.

• Each thread has a unique ID from 0 to N-1, given by omp_get_thread_num().

x += omp_get_thread_num();

• Each thread adds its ID to the shared variable x.

Example with 4 threads:

• Thread 0 → x += 0 → no change.

• Thread 1 → x += 1 → increases x by 1.

• Thread 2 → x += 2 → increases x by 2.

• Thread 3 → x += 3 → increases x by 3.

So ideally, final x = 0 + 1 + 2 + 3 = 6.

2. Private

• Each thread gets its own copy of the variable.

• Copy is uninitialized (unless firstprivate is used).

• Changes are not reflected outside the parallel region.

int x = 5;

#pragma omp parallel private(x)

x = omp_get_thread_num(); // each thread has its own x

printf("Thread %d x=%d\n", omp_get_thread_num(), x);

printf("Outside x=%d\n", x); // original x unchanged

int x = 5;

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 14


Module 4: Shared-memory programming with OpenMP

• In the main thread (serial part), x is initialized to 5.

#pragma omp parallel private(x)

• A parallel region begins.

• Suppose 4 threads are created.

• Each thread gets its own private copy of x.

• Important: private variables are uninitialized inside the parallel region (they don’t
inherit the value 5 from outside).

• So each thread starts with garbage value in its private x (but then we overwrite it).

Inside parallel region

x = omp_get_thread_num();

printf("Thread %d x=%d\n", omp_get_thread_num(), x);

• Each thread assigns its ID to its private x.

• Example output (if 4 threads run):

• Thread 0 x=0

• Thread 1 x=1

• Thread 2 x=2

• Thread 3 x=3

• Note: Order may vary (threads run concurrently).

After parallel region

printf("Outside x=%d\n", x);

• Here, we are back in the main thread.

• The original x from before parallel region (x=5) is unchanged, because:

o The private(x) clause gave each thread its own copy.

o The main thread’s x was not affected inside parallel block.

3. Firstprivate

• Same as private, but each thread’s copy is initialized with the original value.

int x = 5;

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 15


Module 4: Shared-memory programming with OpenMP

#pragma omp parallel firstprivate(x)

printf("Thread %d starts with x=%d\n", omp_get_thread_num(), x);

int x = 5;

• The variable x is initialized to 5 in the main thread.

#pragma omp parallel firstprivate(x)

• A parallel region begins (say 4 threads).

• Each thread gets its own private copy of x.

• Difference from private(x):

o With private(x): each thread’s x is uninitialized (garbage).

o With firstprivate(x): each thread’s x is initialized with the value of x from


before parallel region → here, 5.

3. Inside parallel region

printf("Thread %d starts with x=%d\n", omp_get_thread_num(), x);

• Each thread prints its own copy of x.

• Since they were all initialized to 5, the output will be something like:

• Thread 0 starts with x=5

• Thread 1 starts with x=5

• Thread 2 starts with x=5

• Thread 3 starts with x=5

• Order may vary (threads run in parallel).

4. Lastprivate

• Like private, but the value from the last iteration is copied back to the original
variable after loop ends.

int x;

#pragma omp parallel for lastprivate(x)

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 16


Module 4: Shared-memory programming with OpenMP

for (int i=0; i<4; i++) {

x = i; // each thread has private x

printf("After loop, x=%d\n", x); // x = last iteration’s value (3)

int x;

• Variable x is declared (but not initialized) in the main program.

2. #pragma omp parallel for lastprivate(x)

• This parallelizes the for loop across threads.

• Clause lastprivate(x) means:

o Each thread gets its own private copy of x during the loop (just like
private(x)).

o BUT after the loop finishes, the value of x from the last loop iteration (i=3)
is copied back to the original x in the main program.

3. Inside the loop

x = i;

• Each thread executes some subset of loop iterations.

• For each iteration, that thread’s private x is set to the loop index i.

Example with 4 iterations (i = 0,1,2,3):

• If thread 0 executes iteration i=0, its private x=0.

• If thread 1 executes iteration i=1, its private x=1.

• If thread 2 executes iteration i=2, its private x=2.

• If thread 3 executes iteration i=3, its private x=3.

4. After the loop

printf("After loop, x=%d\n", x);

• Thanks to lastprivate, the value of x from the last iteration (i=3) is assigned to the
main x.

• So after the loop ends, x=3.

• Shared → Everyone writes on the same whiteboard.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 17


Module 4: Shared-memory programming with OpenMP

• Private → Everyone has their own notebook.

• Firstprivate → Everyone gets a notebook copied from the teacher’s notes.

• Lastprivate → Teacher collects the notebook of the student who finished last and
keeps it.

Variable Scoping Rules: Determines how variables are shared or distributed among threads.

Default Rules:

• Shared: Variables declared outside parallel region

• Private: Variables declared inside parallel region.

Example:

int main( )

int a = 10, b = 20, c = 30, d = 40;

#pragma omp parallel for private(a) firstprivate(b) lastprivate(c) shared(d)

for (int i = 0; i < 4; i++)

int thread_id = omp_get_thread_num();

a = thread_id;

b += thread_id;

c = thread_id * 100;

// private (uninitialized)

// firstprivate (initialized to 20)

// lastprivate (copied back)

// d is shared by all threads

return 0;

Variable Initialization (before parallel region)

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 18


Module 4: Shared-memory programming with OpenMP

int a = 10, b = 20, c = 30, d = 40;

• a = 10

• b = 20

• c = 30

• d = 40

These are defined in the serial (main) part of program.

2. OpenMP Directive

#pragma omp parallel for private(a) firstprivate(b) lastprivate(c) shared(d)

This means:

• private(a)

o Each thread gets its own copy of a,

o But it is uninitialized (the initial value 10 is lost).

o So each thread will set its own a.

• firstprivate(b)

o Each thread gets its own copy of b,

o It is initialized with value = 20 (from main program).

o Then each thread modifies its own copy, without affecting others.

• lastprivate(c)

o Each thread gets its own copy of c.

o At the end of the loop, the value of c from the last iteration (highest i) is
copied back into the original c in the main program.

• shared(d)

o All threads see and use the same single d.

o Any modification by one thread is visible to others. (◻ If modified, it can


cause race conditions unless handled).

3. Parallel Loop

for (int i = 0; i < 4; i++)

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 19


Module 4: Shared-memory programming with OpenMP

• Loop runs i = 0, 1, 2, 3.

• OpenMP divides these 4 iterations among available threads.

4. Inside the Loop (thread code)

int thread_id = omp_get_thread_num();

• Gets the ID of the thread (0, 1, 2, ...).

a = thread_id;

• Since a is private, each thread has its own a.

• Example: Thread 0 → a=0, Thread 1 → a=1, etc.

• After the loop ends, the original a=10 in main() remains unchanged.

b += thread_id;

• Since b is firstprivate, each thread starts with b=20.

• Example: Thread 0 → b=20+0=20, Thread 1 → b=20+1=21, etc.

• They are separate copies → no effect on original b=20 after loop.

c = thread_id * 100;

• Each thread has its own copy of c.

• Example: Thread 0 → c=0, Thread 1 → c=100, Thread 2 → c=200…

• Since lastprivate(c) is used → the value of c from the last iteration (i=3) is copied
back into main program.

o So after loop ends → c= thread_id_of_iteration3 * 100.

o If thread 2 executes i=3, then c=200.

// d is shared by all threads

• If any thread changes d, all threads will see the updated value.

• In this code, d is not changed, so d=40 after loop.

Final Values After Loop Ends

• a = 10 (unchanged, because it was private inside loop).

• b = 20 (unchanged, because each thread worked on its private copy initialized with
20).

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 20


Module 4: Shared-memory programming with OpenMP

• c = last thread’s result of iteration i=3 (depends which thread handled iteration 3).

• d = 40 (shared, but not modified).

Reduction Clause

What is Reduction? Common pattern where multiple threads contribute to single result
through commutative and associative operation.

How It Works:

1. Each thread gets private copy of reduction variable

2. Private copies initialized based on operation (0 for +, 1 for *)

3. Each thread operates on private copy

4. All private copies combined using specified operation

5. Final result stored in original variable

Supported Operations:

// Arithmetic

double sum = 0.0;

int product = 1;

#pragma omp parallel for reduction(+:sum) reduction(*:product)

for (int i = 0; i < n; i++) {

sum += array[i];

product *= array[i];

// Min/Max

double min_val = 1e9, max_val = -1e9;

#pragma omp parallel for reduction(min:min_val) reduction(max:max_val)

for (int i = 0; i < n; i++) {

if (array[i] < min_val) min_val = array[i];

if (array[i] > max_val) max_val = array[i];

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 21


Module 4: Shared-memory programming with OpenMP

// Logical

int all_positive = 1;

#pragma omp parallel for reduction(&&:all_positive)

for (int i = 0; i < n; i++) {

all_positive = all_positive && (array[i] > 0);

Practical Examples:

• Dot product of vectors

• Counting elements satisfying condition

• Computing vector norm

The reduction clause in OpenMP is used when multiple threads need to update a shared
variable (like sum, product, max, min, etc.) without causing race conditions.

It tells OpenMP to:

• Give each thread its own private copy of the variable.


• Perform updates locally in each thread.
• At the end of the parallel region, combine all private results into the original shared
variable using the given operator (like +, *, max, etc.).

Syntax

#pragma omp parallel for reduction(op : variable)

op → the operator (sum, product, max, min, etc.)

variable → the variable to be reduced

Example

Sum of numbers 1 to 10:

#include <stdio.h>

#include <omp.h>

int main()

int i, sum = 0;

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 22


Module 4: Shared-memory programming with OpenMP

#pragma omp parallel for reduction(+:sum)

for (i = 1; i <= 10; i++)

sum += i;

printf("Sum = %d\n", sum); // Output: 55

return 0;

Without reduction, multiple threads would write to sum at the same time → wrong result
(race condition).

With reduction, OpenMP ensures correctness.

Supported Operators

Arithmetic: +, -, *

Bitwise: &, |, ^

Logical: &&, ||

Min/Max (newer OpenMP): min, max

In simple words:

The reduction clause lets each thread calculate a partial result and then OpenMP
automatically reduces (combines) all partial results into one final value.

#include <stdio.h>

#include <omp.h>

int main()

int i, sum = 0;

#pragma omp parallel for reduction(+:sum) num_threads(4)

for (i = 1; i <= 10; i++)

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 23


Module 4: Shared-memory programming with OpenMP

sum += i;

printf("Final Sum = %d\n", sum);

return 0;

Step 1: Initialization

When OpenMP sees reduction(+:sum), it does:

Create a private copy of sum for each thread.

Initialize each private copy to 0 (because the operator is +).

So at the start:

Thread 0 → sum = 0

Thread 1 → sum = 0

Thread 2 → sum = 0

Thread 3 → sum = 0

Step 2: Divide Work

With 10 iterations (i = 1…10) and 4 threads, OpenMP distributes work (with static
scheduling by default):

Thread 0 → i = 1, 2, 3

Thread 1 → i = 4, 5, 6

Thread 2 → i = 7, 8

Thread 3 → i = 9, 10

Step 3: Each Thread Computes Locally

Each thread updates its own copy of sum:

Thread 0: sum = 1 + 2 + 3 = 6

Thread 1: sum = 4 + 5 + 6 = 15

Thread 2: sum = 7 + 8 = 15

Thread 3: sum = 9 + 10 = 19

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 24


Module 4: Shared-memory programming with OpenMP

At this stage, private sums are:

Thread 0 → sum = 6

Thread 1 → sum = 15

Thread 2 → sum = 15

Thread 3 → sum = 19

Step 4: Reduction (Combination)

At the end of the parallel loop, OpenMP reduces (combines) all private values using the +
operator:

Final sum = 6 + 15 + 15 + 19 = 55

The result is stored in the original shared variable sum.

Step 5: Final Output

Program prints:

Final Sum = 55

Loop Carried Dependencies

Definition: Loop-carried dependency exists when an iteration depends on

results of previous iteration. These loops cannot be parallelized directly.

Types of Dependencies:

• True Dependency (RAW): Read After Write

• Anti Dependency (WAR): Write After Read

• Output Dependency (WAW): Write After Write

Problematic Examples:

// Cannot parallelize - has dependencies

for (int i = 1; i < n; i++)

a[i] = a[i-1] + b[i]; // Depends on previous iteration

// Fibonacci - classic dependency

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 25


Module 4: Shared-memory programming with OpenMP

fib[0] = 0; fib[1] = 1;

for (int i = 2; i < n; i++)

fib[i] = fib[i-1] + fib[i-2]; // Cannot parallelize

Safe Examples:

// Can parallelize - no dependencies

#pragma omp parallel for

for (int i = 0; i < n; i++)

c[i] = a[i] + b[i]; // Independent iterations

What is a Loop-Carried Dependency?

A loop-carried dependency happens when an iteration of a loop depends on the result of a


previous iteration.

That means iteration i cannot run until iteration i-1 (or earlier) finishes.

This prevents parallelization because the iterations are linked like a chain.

Example 1: With Dependency

for (int i = 1; i < n; i++)

a[i] = a[i-1] + 5;

Here:

Iteration i=1 needs a[0].

Iteration i=2 needs a[1].

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 26


Module 4: Shared-memory programming with OpenMP

Iteration i=3 needs a[2], and so on.

Each step depends on the previous step → this is a loop-carried dependency.

This loop cannot be fully parallelized.

Example 2: No Dependency

for (int i = 0; i < n; i++)

b[i] = c[i] + 5;

Here:

Each b[i] only depends on c[i].

Iteration i=0 does not affect iteration i=1.

No dependency between iterations.

This loop has no loop-carried dependency → it can be parallelized.

Why It Matters in Parallel Programming

•Loops without loop-carried dependency can be executed by multiple threads at the


same time.
• Loops with loop-carried dependency must run sequentially, unless the dependency
can be removed or transformed.
1. True Dependency (Read After Write, RAW)
• Also called flow dependency.
• Occurs when an instruction/iteration needs a value that was produced earlier.
• This is a real data dependency and cannot be removed by just renaming
variables.

Example:

for (int i = 1; i < n; i++)

a[i] = a[i-1] + 5; // depends on result from previous iteration

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 27


Module 4: Shared-memory programming with OpenMP

• Iteration i reads a[i-1], which was written by iteration i-1.


• Must preserve order → hard to parallelize.

2. Anti-Dependency (Write After Read, WAR)

• Occurs when an instruction writes to a variable that was already read by a


previous iteration.
• The read must happen before the write, otherwise the read gets wrong data.
• This is not a true data need, but a name conflict → can often be fixed by
variable renaming.

Example:

for (int i = 0; i < n; i++)

y = a[i]; // read a[i]

a[i] = i; // then write a[i]

• Must ensure the read (y = a[i]) happens before the write (a[i] = i).
• Dependency exists, but can often be removed by using a temporary variable.

3. Output Dependency (Write After Write, WAW)

• Occurs when two iterations write to the same variable/memory location.


• The final value depends on the order of execution.

Example:

for (int i = 0; i < n; i++)

x = i; // write to x in every iteration

Every iteration writes to x, but only the last write (x = n-1) matters.

Note: Parallelizing may cause incorrect results unless writes are controlled.

Dependency Type Also Called Example Real Data Need? Parallelization Impact

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 28


Module 4: Shared-memory programming with OpenMP

Dependency Type Also Called Example Real Data Need? Parallelization Impact

Read After
True (RAW) a[i] = a[i-1] + 1; Yes Must preserve order
Write

Write After No (just name Often removable by


Anti (WAR) y = a[i]; a[i] = i;
Read conflict) renaming

Write After No (just order of


Output (WAW) x = i; inside loop May require synchronization
Write writes)

Caches, cache coherence


• When we run an OpenMP program, multiple threads run on different CPU cores.
• Each core has its own cache (L1, L2).
• All cores share the main memory (RAM).
• If one thread updates a shared variable, other cores must see the new value.

Cache coherence = mechanism that keeps copies of shared data in multiple caches
consistent.

Why is it needed?
int x = 0;

#pragma omp parallel num_threads(2)

int id = omp_get_thread_num();

if (id == 0) {

x = 10; // Thread 0 writes

} else {

printf("Thread 1 reads x = %d\n", x);

What happens?

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 29


Module 4: Shared-memory programming with OpenMP

• Thread 0 (Core 0) writes x = 10.


• But Thread 1 (Core 1) may still have old x = 0 in its cache.
• If caches are not coherent → Thread 1 prints 0 instead of 10.

Cache Coherence Problem in OpenMP

The issue becomes bigger when multiple threads frequently read/write shared variables.

Example:

int sum = 0;

#pragma omp parallel for

for (int i = 0; i < 100000; i++)

sum += i; // multiple threads update same variable

• All threads try to update sum.


• Each core caches sum.
• When one thread writes, other caches must invalidate their copies.

Constant invalidation → performance drop (though correctness is maintained).

This is why we use reduction clause in OpenMP (reduction(+:sum)) to avoid such conflicts.

MESI Protocol

• MESI = Modified, Exclusive, Shared, Invalid


• It is the most widely used cache coherence protocol in multi-core processors.
Example with Two Cores and One Variable x

Main memory: x = 5
Core 0 (Thread 0) and Core 1 (Thread 1) both want to use x.
Exclusive
Step 1: Core 0 reads x
Core 0 gets a fresh copy from memory.
State → Exclusive (E)
Core 0: E (x=5)
Core 1: I (no copy)
Memory: x=5
Shared
Step 2: Core 1 reads x

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 30


Module 4: Shared-memory programming with OpenMP

Now both have the same value.


State → Shared (S)
Core 0: S (x=5)
Core 1: S (x=5)
Memory: x=5

Modified
Step 3: Core 0 writes x = 10
Core 0 changes its copy.
It must tell Core 1: “Your copy is wrong → throw it away.”
State → Core 0 = Modified (M), Core 1 = Invalid (I)
Core 0: M (x=10)
Core 1: I (invalid)
Memory: x=5
Invalid
Step 4: Writeback to memory
Core 0 updates memory with new value 10.
Now memory has the latest version.
Core 0: M (x=10)
Core 1: I
Memory: x=10
No matter what, all cores will eventually see the same latest value of x.
That’s MESI in the simplest form:
E → only one has it, same as memory
S → many have it, same as memory
M → only one has it, changed, memory is old
I → invalid, must fetch again

Cache coherence problem happens when:


Two or more threads use variables stored close together in memory (same cache line).
One thread updates the variable → others’ caches get out of sync → CPU must synchronize
caches, causing delays.
Example (Without Padding)
#include <stdio.h>
#include <omp.h>

#define N 4

int counter[N]; // shared array

int main( )
{
#pragma omp parallel num_threads(N)

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 31


Module 4: Shared-memory programming with OpenMP

{
int id = omp_get_thread_num();
for (int i = 0; i < 1000000; i++)
{
counter[id]++; // each thread updates its own counter
}
}

for (int i = 0; i < N; i++)


{
printf("counter[%d] = %d\n", i, counter[i]);
}
}

What happens in memory?


counter[0], counter[1], counter[2], counter[3] may all lie in one cache line (e.g., 64 bytes).
When Thread 0 updates counter[0],
→ CPU marks the entire cache line invalid for other threads.

When Thread 1 updates counter[1],


→ again invalidates cache line → false sharing occurs.

So, even though threads are working on different array elements, they keep fighting for cache
line updates.
Diagram (False Sharing)
Memory Layout in one cache line (64 bytes):

| counter[0] | counter[1] | counter[2] | counter[3] |

Thread 0 ---> updates counter[0]


Thread 1 ---> updates counter[1]
Thread 2 ---> updates counter[2]
Thread 3 ---> updates counter[3]

All inside same cache line → causes cache invalidation & slow down
2. Padding Concept
Padding means adding extra unused space between variables so that each thread’s variable
sits in a different cache line.
This avoids false sharing → improves performance.
Example (With Padding)
#include <stdio.h>
#include <omp.h>

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 32


Module 4: Shared-memory programming with OpenMP

#define N 4
#define PAD 64 // assume cache line = 64 bytes

struct padded_int
{
int value;
char padding[PAD]; // unused space to push next variable into new cache line
};

struct padded_int counter[N];

int main( )
{
#pragma omp parallel num_threads(N)
{
int id = omp_get_thread_num();
for (int i = 0; i < 1000000; i++)
{
counter[id].value++; // now no false sharing
}
}

for (int i = 0; i < N; i++)


{
printf("counter[%d] = %d\n", i, counter[i].value);
}
}

Diagram (With Padding)


Memory Layout with padding:

| counter[0].value | PADDING | (fills 64 bytes) |

| counter[1].value | PADDING | (fills 64 bytes) |

| counter[2].value | PADDING | (fills 64 bytes) |

| counter[3].value | PADDING | (fills 64 bytes) |

Each thread writes to its own cache line → no invalidation → fast


Cache coherence: Ensures all caches see the same value of memory.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 33


Module 4: Shared-memory programming with OpenMP

False sharing: Happens when multiple threads update different variables in the same cache
line.
Padding: Adds dummy bytes to separate variables across cache lines → removes false
sharing.

False Sharing

What happens inside CPU caches?

• Memory is divided into cache lines (usually 64 bytes).


• When a thread updates a variable, the whole cache line containing that variable is
loaded into the thread’s core cache.
• If another thread updates a different variable in the same cache line, the CPU still
treats the entire cache line as "shared".

This triggers cache invalidations → meaning caches fight with each other even though the
variables are logically independent.

Example
int a[2] = {0,0};

// Thread 0 updates a[0]


// Thread 1 updates a[1]

Case: Both in the same cache line


Cache line in memory (64 bytes)

| a[0] | a[1] | ............... unused space ...... |

Thread 0: updates a[0]


Thread 1: updates a[1]

Step by Step

1. Thread 0 loads cache line (a[0], a[1]) into its core.


2. Thread 1 loads the same cache line into its core.
3. Thread 0 updates a[0].
o Its cache line is marked Modified (M) (MESI protocol).
o Other cores’ copies are marked Invalid (I).
4. Thread 1 tries to update a[1] (same line!).
o It must fetch the "latest" line from Thread 0’s cache.
o Now its line is Modified, and Thread 0’s becomes Invalid.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 34


Module 4: Shared-memory programming with OpenMP

5. They keep bouncing ownership of the cache line back and forth.

Even though a[0] and a[1] are independent, the cache system sees them as part of the
same block, so it forces communication.
This unnecessary "ping-pong" is the false sharing problem.

Diagram

False Sharing
Thread 0 Core Cache Thread 1 Core Cache

| a[0]=5 | a[1]=0 | <----> | a[0]=5 | a[1]=1 |


(updates a[0]) invalidates (updates a[1])

• Both fight over the same line.


• Constant invalidation + reload.
• Performance drops.

Fixed with Padding


Cache Line 1: | a[0] | pad.... | (only Thread 0 touches)
Cache Line 2: | a[1] | pad.... | (only Thread 1 touches)

Now:

• Thread 0 updates a[0] in its own line.


• Thread 1 updates a[1] in a different line.
• No invalidations → fast.

In short

• Problem: CPU caches work at cache line granularity, not variable granularity.
• If variables share a cache line → threads contend for it, causing false sharing.
• Fix: Align/pad data so each frequently updated variable is in a separate cache line.

Do you want me to also show you a timing experiment code (with and without false
sharing) so you can see the performance difference on your system?

False sharing Example

#include<stdio.h>

#include<omp.h>

int main( )

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 35


Module 4: Shared-memory programming with OpenMP

int a[2]={0,0};

#pragma omp parallel for num_threads(2)

int id=omp_get_thread_num();

a[id]++;

printf(“a[0]= and a[1]=”,%d %d);

return 0;

The Producer–Consumer Problem


1. What is it?
• It’s a classic synchronization problem.
• We have:
o Producer(s): generate data (items).
o Consumer(s): process that data.
o Shared buffer: a memory region used by both.
Problem: Without control, they can cause race conditions, overwrites, or invalid reads.

2. Constraints
• The buffer has a fixed size (say N).
• Producer rules:
o If buffer is full, the producer must wait (cannot add).
• Consumer rules:
o If buffer is empty, the consumer must wait (cannot remove).
• Mutual exclusion:
o Only one thread at a time may access the buffer (to prevent corruption).

Producer Function
void producer(int item)
{
while (count == BUFFER_SIZE)
{
// wait (buffer full)
}

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 36


Module 4: Shared-memory programming with OpenMP

• Producer generates an item and wants to put it into the buffer.


• If count == BUFFER_SIZE, it means the buffer is full, so producer must wait (busy
wait here).
• This prevents overflow (writing when no space).
#pragma omp critical(buffer_access)
{
buffer[in] = item;
in = (in + 1) % BUFFER_SIZE;
count++;
}
}
#pragma omp critical(buffer_access) → only one thread at a time can enter this block.

• Inside:
o Place the item at position in.
o Update in using modulo % BUFFER_SIZE → makes it a circular buffer
(when in reaches the end, it wraps around to 0).
o Increment count (one more item now stored).
Producer’s job: safely insert into buffer.

Consumer Function
int consumer()
{
int item;

while (count == 0)
{
// wait (buffer empty)
}
• Consumer wants to take an item from buffer.
• If count == 0, buffer is empty, so it must wait (busy wait).
• This prevents underflow (taking from empty buffer).
#pragma omp critical(buffer_access)
{
item = buffer[out];
out = (out + 1) % BUFFER_SIZE;
count--;
}

return item;
}
• Enter critical section to avoid race conditions.
• Take item from position out.
• Update out using modulo → so consumer also works in circular fashion.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 37


Module 4: Shared-memory programming with OpenMP

• Decrement count (one less item in buffer).


• Return the consumed item.

Buffer size = 5 (instead of 100, for simplicity).


• Initially:
o buffer = [ , , , , ] (empty)
o in = 0 (producer will insert here)
o out = 0 (consumer will remove here)
o count = 0 (nothing in buffer)

Step 1: Producer adds an item


Producer creates item 10 and calls producer(10).
• count == BUFFER_SIZE? → No (0 != 5) → continue.
• Critical section:
o buffer[in] = 10 → buffer[0] = 10.
o in = (0+1)%5 = 1.
o count = 1.
Now:
buffer = [10, , , , ]
in = 1, out = 0, count = 1
Step 2: Producer adds another item
Producer creates item 20 and calls producer(20).
• Buffer not full.
• buffer[in] = 20 → buffer[1] = 20.
• in = (1+1)%5 = 2.
• count = 2.
• Now:
buffer = [10, 20, , , ]
in = 2, out = 0, count = 2

Step 3: Consumer takes an item


Consumer calls consumer().
• count == 0? → No (2 != 0) → continue.
• Critical section:
o item = buffer[out] → item = buffer[0] = 10.
o out = (0+1)%5 = 1.
o count = 1.
Consumer gets 10.
Now:
buffer = [10, 20, , , ] // (10 is still in array, but logically removed)
in = 2, out = 1, count = 1
Step 4: Producer adds again

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 38


Module 4: Shared-memory programming with OpenMP

Producer creates item 30 and calls producer(30).


• Buffer not full.
• buffer[in] = 30 → buffer[2] = 30.
• in = (2+1)%5 = 3.
• count = 2.
Now:
buffer = [10, 20, 30, , ]
in = 3, out = 1, count = 2
Step 5: Consumer takes again
Consumer calls consumer().
• count == 0? → No.
• item = buffer[out] → buffer[1] = 20.
• out = (1+1)%5 = 2.
• count = 1.
Consumer gets 20.
Now:
buffer = [10, 20, 30, , ]
in = 3, out = 2, count = 1
How It Works
• in always points to next free slot (where producer will put data).
• out always points to next available item (where consumer will take data).
• count keeps track of how many items are in the buffer.
• while(count==BUFFER_SIZE) → stops producer when buffer is full.
• while(count==0) → stops consumer when buffer is empty.
• #pragma omp critical → ensures producer & consumer don’t update count/in/out at
the same time.
In parallel computing, thread safety means making sure that when multiple threads/processes
run at the same time, they do not corrupt shared data or produce wrong results.

Tasking in OpenMP
Normally in OpenMP we parallelize loops (parallel for). But what if:

• Work is not evenly divisible


• Work is recursive
• Work depends on conditions

Then tasking is used.


It allows you to create independent tasks (units of work). These tasks are stored in a task pool, and
any available thread in the team can pick them up.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 39


Module 4: Shared-memory programming with OpenMP

How It Works (Diagram)

Thread 0 encounters tasks:

┌─────────────────┐

│ Task A │

├─────────────────┤

│ Task B │

├─────────────────┤

│ Task C │

└─────────────────┘

(task pool)

Thread 1 → executes Task A

Thread 2 → executes Task B

Thread 3 → executes Task C

• Only one thread (inside #pragma omp single) creates the tasks.
• Other threads steal tasks from the pool and execute them.
• Scheduling is dynamic — order may change each run.

Example 1: Simple Tasking

#include <stdio.h>

#include <omp.h>

int main()

#pragma omp parallel

#pragma omp single // only one thread creates tasks

printf("Thread %d creating tasks...\n", omp_get_thread_num());

#pragma omp task

printf("Task A executed by thread %d\n", omp_get_thread_num());

#pragma omp task

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 40


Module 4: Shared-memory programming with OpenMP

printf("Task B executed by thread %d\n", omp_get_thread_num());

#pragma omp task

printf("Task C executed by thread %d\n", omp_get_thread_num());

return 0;

Thread 0 creating tasks...

Task B executed by thread 2

Task C executed by thread 3

Task A executed by thread 1

(Order changes every run because tasks are scheduled dynamically!)

Example 2: Tasking with taskwait

#include <stdio.h>

#include <omp.h>

int main()

#pragma omp parallel

#pragma omp single

int x=0, y=0;

#pragma omp task shared(x)

x = 10;

#pragma omp task shared(y)

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 41


Module 4: Shared-memory programming with OpenMP

y = 20;

#pragma omp taskwait // wait for x and y tasks

printf("Sum = %d (computed by thread %d)\n", x+y, omp_get_thread_num());

return 0;

Output:

Sum = 30 (computed by thread 0)

When to Use Tasking

• Recursive problems → Fibonacci, quicksort, tree traversal


• Irregular workloads → graph algorithms, sparse matrix
• Dynamic workloads → adaptive mesh, unpredictable work

Thread Safety
You can achieve thread safety in different ways depending on the type of data sharing. Let’s
go step by step:

1. Avoid Sharing Data (Best Option)


• If each thread has its own copy of data → no conflicts.
• Use private variables in OpenMP, or thread-local storage in other frameworks.
Example in OpenMP:

#pragma omp parallel private(x)


{
int x = omp_get_thread_num(); // each thread has its own 'x'
printf("Thread %d has x = %d\n", omp_get_thread_num(), x);
}
Safe because no thread touches another’s variable.

2. Synchronization (When Sharing is Needed)


• Sometimes threads must share data (e.g., updating a counter, writing into a shared
array).
• Then you need synchronization to protect critical sections.

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 42


Module 4: Shared-memory programming with OpenMP

a) Locks / Critical Sections


Only one thread executes the code block at a time.

int counter = 0;

#pragma omp parallel


{
#pragma omp critical
{
counter++;
}
}

b) Atomic Operations
For very small updates (like increment/decrement), atomics are faster than locks.
int counter = 0;

#pragma omp parallel


{
#pragma omp atomic
counter++;
}

c) Reduction (Built-in Safe Accumulation)


If threads are all contributing to a sum, product, min/max, etc.,
OpenMP provides reduction to safely combine results.
int sum = 0;

#pragma omp parallel for reduction(+:sum)


for(int i=0; i<1000; i++)
{
sum += i;
}

No race conditions, OpenMP handles safety internally.

3. Padding & Alignment (Avoid False Sharing)

• Even if threads update different variables, they may be in the same cache line, causing
slowdown and inconsistencies.

Fix: padding

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 43


Module 4: Shared-memory programming with OpenMP

struct
{
int value;
char pad[64]; // separate cache lines
} counter[NUM_THREADS];

4. Message Passing (MPI Style)


Instead of sharing memory, each process/thread works on its own local data and exchanges
messages when needed.
No shared state → no race conditions.
Often used in MPI (distributed parallel computing).

Prof. Amarnath B Patil, Dept. of CSE(DS), SVIT Page 44

You might also like