0% found this document useful (0 votes)
5 views19 pages

OpenMP Shared-Memory Programming Guide

This document provides an overview of shared-memory programming using OpenMP, including its directives, common clauses, and the Trapezoidal Rule for numerical integration. It explains the scope of variables in OpenMP, the reduction clause for thread-safe operations, and issues like loop-carried dependencies that can affect parallelization. The document also includes examples and explanations of how to implement these concepts in C/C++.

Uploaded by

thrishaspam18
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views19 pages

OpenMP Shared-Memory Programming Guide

This document provides an overview of shared-memory programming using OpenMP, including its directives, common clauses, and the Trapezoidal Rule for numerical integration. It explains the scope of variables in OpenMP, the reduction clause for thread-safe operations, and issues like loop-carried dependencies that can affect parallelization. The document also includes examples and explanations of how to implement these concepts in C/C++.

Uploaded by

thrishaspam18
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

|| Jai Sri Gurudev||

Sri Adichunchanagiri Shikshana Trust(R)

SJB INSTITUTE OF TECHNOLOGY


BGS Health & Education City, Dr. Vishnuvardhan Road, Kengeri, Bengaluru–560060
An Autonomous Institute under Visvesvaraya Technological University, Belagavi
Affiliated to Visvesvaraya Technological University, Belagavi & Approved by AICTE, New Delhi, Certified by ISO 9001-2015
Accredited by NBA & NAAC, New Delhi with ‘A+’ Grade, Recognized by UGC, New Delhi with 2(f) and 12(B)

Notes On
Course Name: Parallel Computing

Course Code: B C S 7 0 2

Module – 4

By

Faculty Name: Dr Ranjith J , Prof. Kiran Kumar V , Prof. Veeresh K M


Semester: VIIth Semester

Department of Information Science & Engineering

Aca. Year ODD SEM /2025-26


Chapter 4

Shared-memory programming with OpenMP

4.1 openmp pragmas and directives


OpenMP (Open Multi-Processing) is an API that supports multi-platform shared-memory
multiprocessing programming in C, C++, and Fortran. It uses compiler directives (pragmas in
C/C++), runtime library routines, and environment variables to specify parallel regions and
control their execution.

Basic Structure in C/C++

OpenMP directives in C/C++ use #pragma omp followed by a directive and optional clauses:

CopyEdit
#pragma omp directive [clauses]

Common OpenMP Directives (Pragmas)

1. #pragma omp parallel

Creates a parallel region where multiple threads execute the enclosed block.

c
CopyEdit
#pragma omp parallel
{
printf("Hello from thread %d\n", omp_get_thread_num());
}

2. #pragma omp for

Distributes loop iterations among threads in the team.

c
CopyEdit
#pragma omp parallel for
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i];
}
3. #pragma omp sections

Splits code into sections, each executed by a different thread.

c
CopyEdit
#pragma omp parallel sections
{
#pragma omp section
function1();

#pragma omp section


function2();
}

4. #pragma omp single

Specifies a block that is executed by only one thread.

c
CopyEdit
#pragma omp single
{
printf("Executed by a single thread.\n");
}

5. #pragma omp master

Only the master thread executes this block (thread ID 0).

c
CopyEdit
#pragma omp master
{
printf("Master thread only.\n");
}

6. #pragma omp critical

Defines a critical section to avoid race conditions.

c
CopyEdit
#pragma omp critical
{
shared_var += local_var;
}

7. #pragma omp barrier

Explicit barrier synchronization — all threads wait here.

c
CopyEdit
#pragma omp barrier

8. #pragma omp atomic

Ensures that a memory update is atomic (faster than critical for simple operations).

c
CopyEdit
#pragma omp atomic
sum += x;

9. #pragma omp task

Defines a task to be executed by any thread in the team.

c
CopyEdit
#pragma omp task
process_data();

Common Clauses

These modify the behavior of directives:

 private(var) – Each thread has its own instance.


 shared(var) – Variable is shared across threads.
 firstprivate(var) – Like private, but initialized from original.
 lastprivate(var) – Updates the variable after last iteration.
 reduction(op:var) – Combines private copies into a shared one using op (e.g., +, *, min,
max).
 schedule(type, chunk) – Controls how loop iterations are divided.

4.2 Trapezoidal Rule


The Trapezoidal Rule is a numerical method used to approximate the definite integral of a
function. It is particularly useful when the actual integral is difficult or impossible to compute
analytically.

Definition

The Trapezoidal Rule approximates the area under a curve by dividing it into trapezoids rather
than rectangles (as in the Riemann sum). The formula for the Trapezoidal Rule over a single
interval [a,b][a, b][a,b] is:

∫abf(x) dx≈b−a2[f(a)+f(b)]\int_a^b f(x)\,dx \approx \frac{b - a}{2} \left[f(a) + f(b)\right]∫ab


f(x)dx≈2b−a[f(a)+f(b)]

This is the area of a trapezoid with height b−ab - ab−a and bases f(a)f(a)f(a) and f(b)f(b)f(b).

Composite Trapezoidal Rule

To get better accuracy, we divide the interval [a,b][a, b][a,b] into nnn subintervals of equal width
h=b−anh = \frac{b - a}{n}h=nb−a, and apply the rule to each:

∫abf(x) dx≈h2[f(x0)+2f(x1)+2f(x2)+⋯+2f(xn−1)+f(xn)]\int_a^b f(x)\,dx \approx \frac{h}{2}


\left[f(x_0) + 2f(x_1) + 2f(x_2) + \dots + 2f(x_{n-1}) + f(x_n)\right]∫abf(x)dx≈2h[f(x0)+2f(x1
)+2f(x2)+⋯+2f(xn−1)+f(xn)]

Where:

 x0=ax_0 = ax0=a, xn=bx_n = bxn=b


 xi=a+ihx_i = a + ihxi=a+ih, for i=1,2,...,n−1i = 1, 2, ..., n-1i=1,2,...,n−1

Example

Approximate ∫02x2 dx\int_0^2 x^2 \, dx∫02x2dx using the trapezoidal rule with n=2n = 2n=2:

1. f(x)=x2f(x) = x^2f(x)=x2, a=0a = 0a=0, b=2b = 2b=2, n=2n = 2n=2, so h=1h = 1h=1
2. Points: x0=0x_0 = 0x0=0, x1=1x_1 = 1x1=1, x2=2x_2 = 2x2=2
3. Function values: f(0)=0f(0) = 0f(0)=0, f(1)=1f(1) = 1f(1)=1, f(2)=4f(2) = 4f(2)=4

Now apply the formula:


∫02x2dx≈12[0+2(1)+4]=12[6]=3\int_0^2 x^2 dx \approx \frac{1}{2} [0 + 2(1) + 4] =
\frac{1}{2} [6] = 3∫02x2dx≈21[0+2(1)+4]=21[6]=3

Exact value: ∫02x2dx=83≈2.67\int_0^2 x^2 dx = \frac{8}{3} \approx 2.67∫02x2dx=38≈2.67, so


trapezoidal rule overestimates here.

Error Formula

The error in the composite trapezoidal rule is approximately:

ET=−(b−a)312n2f′′(ξ)E_T = -\frac{(b - a)^3}{12n^2} f''(\xi)ET=−12n2(b−a)3f′′(ξ)

for some ξ∈(a,b)\xi \in (a, b)ξ∈(a,b). The smaller nnn is, the larger the error — so increasing n
improves accuracy.

4.3 The Scope of Variables


shared-memory programming with OpenMP, the scope of variables (i.e., how and where
they are accessible and modified) is a crucial concept for writing correct and efficient parallel
programs.

Variables in OpenMP can have two main types of scope:

1. Shared Variables

 Definition: Variables that are accessible by all threads in a team.


 Behavior: All threads can read and write to the same memory location.
 Default: Most variables are shared by default unless declared otherwise.
 Example:

c
CopyEdit
int x = 0;
#pragma omp parallel
{
x += omp_get_thread_num(); // All threads update the same x
}

 Risk: Race conditions if multiple threads write to the same variable without
synchronization.
2. Private Variables

 Definition: Each thread has its own local instance of the variable.
 Behavior: Modifications by one thread are not seen by others.
 Declared Using: private(var) clause.
 Example:

CopyEdit
int x = 0;
#pragma omp parallel private(x)
{
x = omp_get_thread_num(); // Each thread has its own x
}

3. Firstprivate Variables

 Definition: A special case of private. The variable is initialized with the value from the
master thread.
 Declared Using: firstprivate(var)
 Example:

c
CopyEdit
int x = 42;
#pragma omp parallel firstprivate(x)
{
// Each thread gets x = 42 initially
}

4. Lastprivate Variables

 Definition: Another special case of private. The last iteration's value is copied back to the
original variable (typically in loops).
 Declared Using: lastprivate(var)
 Example:

c
CopyEdit
int x;
#pragma omp parallel for lastprivate(x)
for (int i = 0; i < 10; i++) {
x = i;
}
// After loop, x == 9
5. Threadprivate Variables

 Definition: Similar to private, but used for global/static variables across parallel regions.
 Declared Using: #pragma omp threadprivate(var)
 Use Case: When static or global variables should be private to each thread across
multiple parallel regions.

Variable Scope Defaults

 Inside a parallel region:


o Global variables → shared by default.
o Local variables (defined inside the parallel region) → private by default.
o Loop index variables in for loops → private by default.

In shared-memory programming with OpenMP, the reduction clause is used to perform


operations like summation, multiplication, minimum, maximum, etc., in a thread-safe manner
when multiple threads work on a shared variable.

4.4 Reduction Clause

4.4.1 What Problem Does reduction Solve?

When multiple threads try to update a shared variable, such as in a sum:

c
CopyEdit
sum += array[i];

this can lead to race conditions. The reduction clause avoids this by:

1. Giving each thread its own private copy of the variable.


2. Combining those copies using the specified operator at the end of the parallel region.

�Syntax of reduction in OpenMP (C/C++)

c
CopyEdit
#pragma omp parallel for reduction(operator : variable)
for (int i = 0; i < N; i++) {
sum += array[i];
}

�Example: Sum of an Array

c
CopyEdit
#include <omp.h>
#include <stdio.h>

int main() {
int N = 1000;
double sum = 0.0;
double array[1000];

// Initialize array
for (int i = 0; i < N; i++)
array[i] = 1.0;

#pragma omp parallel for reduction(+:sum)


for (int i = 0; i < N; i++) {
sum += array[i];
}

printf("Sum = %f\n", sum);


return 0;
}

 reduction(+:sum) tells OpenMP to:


o Give each thread a private copy of sum.
o Combine them using + after the loop.

4.4.2 Supported Operators in reduction


Operator Operation
+ Addition
* Multiplication
- Subtraction
& Bitwise AND
` `
^ Bitwise XOR
&& Logical AND
`
min Minimum
Operator Operation
max Maximum

Key Notes

 The variable must be scalar (not arrays).


 Each thread’s temporary result is combined at the end.
 More efficient than manually creating critical sections or atomic operations.

Let me know if you want an example in Fortran or want to go deeper into how reductions are
implemented internally!

In parallel computing, the reduction clause is a feature used to perform a reduction


operation—like summing, multiplying, finding the maximum or minimum—across multiple
threads or processes in a thread-safe manner.

What is a Reduction?

A reduction combines values from multiple threads into a single result using an associative and
commutative operation. Common reduction operations include:

 + (sum)
 * (product)
 min (minimum)
 max (maximum)
 &, |, ^ (bitwise operations)

In Practice: OpenMP Example

In OpenMP (a popular API for shared-memory parallelism in C/C++/Fortran), the reduction clause
looks like this:

c
CopyEdit
#include <omp.h>
#include <stdio.h>

int main() {
int sum = 0;

#pragma omp parallel for reduction(+:sum)


for (int i = 1; i <= 100; i++) {
sum += i;
}
printf("Sum = %d\n", sum);
return 0;
}

How It Works

 Each thread has a private copy of the sum variable.


 Threads accumulate results in their own copy.
 At the end of the loop, the private copies are combined (reduced) into the original
shared variable (sum) using the specified operation (+ here).

Benefits of Using Reduction Clause

 Automatic thread safety: Avoids race conditions.


 Performance: Enables parallelization of otherwise sequential accumulation operations.
 Clarity: Reduces the need for manual synchronization or locks.

Without Reduction

If you try to update a shared variable like sum += i; inside a parallel loop without a reduction or
synchronization mechanism, it may lead to race conditions and incorrect results

4.5 Loop Carried Dependency

In shared memory parallel programming, especially using OpenMP, a loop-carried


dependency is a situation where an iteration of a loop depends on the result of a previous
iteration. This can hinder or prevent parallelization, because iterations are no longer
independent and cannot safely run concurrently.

What is Loop-Carried Dependency?

A loop-carried dependency exists when the computation in one loop iteration depends on data
from a previous iteration.

❌Example of Loop-Carried Dependency:


c
CopyEdit
for (int i = 1; i < N; i++) {
A[i] = A[i-1] + 1;
}

 A[i]depends on A[i-1], so each iteration must wait for the previous one.
 Parallelization with OpenMP would lead to incorrect results, as iterations would run
out of order.
Example Without Loop-Carried Dependency:
c
CopyEdit
#pragma omp parallel for
for (int i = 0; i < N; i++) {
B[i] = A[i] + 1;
}

 Here, each iteration is independent of others.


 Safe to parallelize using #pragma omp parallel for.

How to Handle Loop-Carried Dependencies in OpenMP?

1. Restructure the Algorithm: Try to eliminate the dependency.


2. Use OpenMP Tasks: If dependency is complex but can be expressed in a directed
acyclic graph.
3. Prefix Sums (Scan): For certain types of loop-carried dependencies, algorithms like
prefix sum can help.

Example: Parallel Prefix Sum (to eliminate loop-carried dependency)

This is a classic method for handling dependencies like cumulative sums.

Bad: Loop-Carried Dependency


c
CopyEdit
int sum = 0;
for (int i = 0; i < N; i++) {
sum += A[i];
B[i] = sum;
}

� Good: Use Inclusive Prefix Sum (Scan) Algorithm

You can use an algorithm that breaks this into log(N) steps using OpenMP tasks or tree-based
reductions.
4.6 OpenMP Reduction (when dependency is an accumulation)
Sometimes loop-carried dependencies are reductions and can be parallelized:

c
CopyEdit
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += A[i];
}

 OpenMP handles dependencies internally by combining partial results from threads.

Detecting Loop-Carried Dependencies

Look for:

 Variables updated in each loop iteration.


 Use of previous iteration’s result (A[i] = A[i-1] + ...).
 Accumulations or linked updates.

In shared memory programming using OpenMP, scheduling, producers, and consumers are
key concepts that relate to task coordination and parallel performance. Here’s a breakdown of
each:

1. Scheduling in OpenMP

Scheduling in OpenMP refers to how loop iterations are divided among threads in a parallel for
construct.

Types of Scheduling:

OpenMP offers several scheduling strategies that affect performance:

 static: Divides the iterations into equal-sized chunks, assigned to threads at compile time.

c
CopyEdit
#pragma omp for schedule(static)

 dynamic: Assigns a chunk of iterations to threads dynamically as they become available.


c
CopyEdit
#pragma omp for schedule(dynamic, chunk_size)

 guided: Similar to dynamic, but chunk sizes decrease over time to balance load better.

c
CopyEdit
#pragma omp for schedule(guided)

 auto: The compiler/runtime decides the best schedule.

c
CopyEdit
#pragma omp for schedule(auto)

 runtime: The schedule is set at runtime using the OMP_SCHEDULE environment variable.

c
CopyEdit
#pragma omp for schedule(runtime)

2. Producer-Consumer Pattern

In shared memory systems (like OpenMP), the Producer-Consumer pattern involves


coordination between threads: some threads produce data, others consume it.

Basic Scenario:

 Producer thread(s) generate data and place it in a shared buffer.


 Consumer thread(s) read from the buffer and process the data.
 Requires synchronization to avoid race conditions.

Example Using OpenMP:

Here's a simplified example with manual synchronization using #pragma omp critical.

c
CopyEdit
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define N 10
int buffer[N];
int count = 0;

void producer() {
for (int i = 0; i < N; i++) {
#pragma omp critical
{
buffer[i] = i * 10;
printf("Produced: %d\n", buffer[i]);
count++;
}
}
}

void consumer() {
int local_count = 0;
while (local_count < N) {
#pragma omp critical
{
if (local_count < count) {
printf("Consumed: %d\n", buffer[local_count]);
local_count++;
}
}
}
}

int main() {
#pragma omp parallel sections
{
#pragma omp section
producer();

#pragma omp section


consumer();
}
return 0;
}

In OpenMP (and multithreaded programming in general), performance can be significantly


influenced by how the system handles caches, cache coherence, and false sharing. Let’s break
each of these down in the context of OpenMP.

4.7 Caches in OpenMP


Modern CPUs have a hierarchy of caches (L1, L2, L3) to reduce the latency of memory access.
Each CPU core typically has its own L1 and L2 cache, and shares the L3 cache with other cores.

 When OpenMP creates multiple threads, they often run on different cores.
 If multiple threads access the same memory, it may lead to cache misses or delays due to
coherence enforcement.
 To optimize performance, each thread should ideally access separate memory locations
to benefit from cache locality.
2. Cache Coherence

Cache coherence ensures that all CPUs see the same value for a shared variable.

 When a thread updates a variable in its cache, the system must propagate this change to
other caches or invalidate the outdated copies.
 Coherence protocols (e.g. MESI: Modified, Exclusive, Shared, Invalid) manage this.
 In OpenMP, if multiple threads write to the same variable (or close-by variables in the
same cache line), the cache coherence traffic increases.

Example:

c
CopyEdit
#pragma omp parallel for
for (int i = 0; i < N; i++) {
A[i] = A[i] * 2;
}

If A[i] is thread-private (each thread accesses different indices), cache coherence is not an issue.
But if they access overlapping data, coherence traffic increases.

3. False Sharing

False sharing happens when threads access different variables that reside in the same cache
line.

 Even though the variables are independent, a write by one thread invalidates the cache
line, forcing other threads to reload it, causing performance degradation.
 A cache line is usually 64 bytes. So two int or float values in the same cache line can be
falsely shared.

�Example of False Sharing:

c
CopyEdit
int A[NUM_THREADS];

#pragma omp parallel num_threads(NUM_THREADS)


{
int tid = omp_get_thread_num();
for (int i = 0; i < 1000000; ++i)
A[tid] += 1;
}

Even though A[0], A[1], etc. are separate elements, they may share the same cache line.
�Solution: Padding or using #pragma omp declare simd or compiler-specific attributes.

c
CopyEdit
typedef struct {
int value;
char padding[60]; // to avoid false sharing (assuming 64-byte cache line)
} PaddedInt;

PaddedInt A[NUM_THREADS];

4.8 OpenMP Tasking

� What is Tasking?

OpenMP tasking allows you to parallelize irregular or recursive workloads by defining tasks
— units of work that can be executed by threads independently.

Syntax:
c
CopyEdit
#pragma omp parallel
{
#pragma omp single
{
for (int i = 0; i < N; ++i) {
#pragma omp task
do_work(i);
}
}
}

Use Cases:

 Recursive algorithms (e.g., quicksort, Fibonacci)


 Dynamic or unbalanced workloads
 DAG (Directed Acyclic Graph) based workflows

Task Dependencies:

To express dependencies:

c
CopyEdit
#pragma omp task depend(in:a) depend(out:b)

Thread Safety in OpenMP


What is Thread Safety?

Thread-safe code ensures that shared resources (e.g., variables, arrays, files) are accessed in a
controlled manner to prevent data races and undefined behavior.

�Common ways to ensure thread safety:

1. Use #pragma omp critical sections:

c
CopyEdit
#pragma omp critical
{
shared_resource++;
}

2. Use atomic operations (faster than critical):

c
CopyEdit
#pragma omp atomic
shared_resource++;

3. Avoid shared variables or use private, firstprivate, lastprivate:

c
CopyEdit
#pragma omp parallel for private(i)

4. Use reduction for aggregations (like sum):

c
CopyEdit
#pragma omp parallel for reduction(+:sum)

5. Avoid false sharing (each thread modifying separate cache lines).

Putting It All Together

✔ Example: Thread-safe Tasking

c
CopyEdit
#pragma omp parallel
{
#pragma omp single
{
for (int i = 0; i < 10; ++i) {
#pragma omp task
{
int local = i * i; // thread-safe
#pragma omp critical
{
shared_array[i] = local;
}
}
}
}
}

***************************END**********************************************

You might also like