0% found this document useful (0 votes)
4 views43 pages

Lecture 10

The document provides an introduction to distributed and parallel computing, focusing on OpenMP, an API for shared-memory parallel programming in C, C++, and Fortran. It explains key features, execution models, and practical examples of using OpenMP for parallelizing code, including directives, thread management, and scheduling policies. Additionally, it discusses the importance of avoiding race conditions when multiple threads access shared data.

Uploaded by

deep03rao
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views43 pages

Lecture 10

The document provides an introduction to distributed and parallel computing, focusing on OpenMP, an API for shared-memory parallel programming in C, C++, and Fortran. It explains key features, execution models, and practical examples of using OpenMP for parallelizing code, including directives, thread management, and scheduling policies. Additionally, it discusses the importance of avoiding race conditions when multiple threads access shared data.

Uploaded by

deep03rao
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Distributed and Parallel

Computing
CS-302

Dr. Sanjay Saxena


Assistant Professor, CSE, IIIT Vadodara
Post doc – University of Pennsylvania, USA
PhD – Indian Institute of Technology(BHU), Varanasi
1
Parallel Computing/Parallel Programming

• OpenMP → Shared memory (one machine, many cores)

• MPI → Distributed memory (many machines, connected via network)

• CUDA → GPU computing (thousands of lightweight cores)

2
What is OpenMP?
• OpenMP (Open Multi-Processing) is an API that supports multi-platform shared-memory parallel
programming in C, C++, and Fortran.

• It allows developers to parallelize code easily by adding simple compiler directives (like #pragma
omp).
Key Features
API → It’s not a new language; it’s a set of functions, compiler directives, and environment variables.
Directives → Written as #pragma omp ... in C/C++, Example: #pragma omp parallel
Supported Languages → C, C++, and Fortran (standard in scientific computing).
Execution Model →
1. Follows a fork-join model:
1. Master thread forks worker threads to do tasks in parallel.
2. They later join back at synchronization points.
Portability → Works across multiple platforms (Linux, Windows, macOS) with compilers like GCC, Clang. 3
Example
Why it is Popular
• Ease of use: Add pragmas to existing
#include <stdio.h>
sequential code.
#include <omp.h>
• Minimal code changes: Same program can run
serially or in parallel.
int main() {
• Great for shared-memory systems: Fits
#pragma omp parallel
laptops, desktops, and multi-core servers.
{
• Widely used in scientific & engineering
printf("Hello from thread %d out of %d\n",
applications: simulations, image processing,
omp_get_thread_num(), omp_get_num_threads());
machine learning preprocessing.
}
return 0;
}

With just one directive (#pragma omp parallel), the program runs in parallel.
4
OpenMP execution model
1. Master Thread
•The program always starts with one thread (the master).
•Master thread executes the code sequentially (like normal C/C++).
•When a parallel region is encountered (#pragma omp parallel), the
master forks new worker threads.

2. Worker Threads
•Created when the master enters a parallel region.
•Each worker executes the parallelized portion of the code.
•All threads run concurrently and may handle different parts of the data or tasks.

3. Fork-Join Model
[Link] → Master spawns a team of worker threads.
[Link] → Each thread does its assigned task.
[Link] → Workers finish and master continues execution.
5
Contd..

6
7
If we run a program on a machine with 8 CPU cores, will it
always create exactly 8 threads? Why or why not?

Why does the order of printed lines change every time we


run this program, even though the code is the same?

8
Example Execution Flow
#include <stdio.h>
#include <omp.h> What happens here?

int main() { • First line runs only by master thread.

printf("Program starts with Master thread only.\n"); • When #pragma omp parallel is reached →
threads fork.
#pragma omp parallel
• Each thread (master + workers) prints “Hello
{
from thread …”.
int tid = omp_get_thread_num();
• After parallel region, all workers join back →
int total = omp_get_num_threads();
only master continues.
printf("Hello from thread %d out of %d\n", tid, total);
}
printf("Back to Master thread only.\n");
return 0;
} 9
Syntax overview: #pragma omp ...
OpenMP in C/C++ uses compiler directives that begin with:
#pragma omp <directive> [clauses]
#pragma → tells compiler it’s a special instruction.
omp → specifies it’s an OpenMP directive.
<directive> → what kind of parallel action to take.
[clauses] → extra options to control behavior (optional).

General Structure
#pragma omp parallel [clauses]
{
// Code block to run in parallel
}
parallel → keyword meaning "run this block with multiple threads".
The block { ... } is executed by all threads.
Without clauses, the number of threads depends on system or OMP_NUM_THREADS env variable.

10
Parallel Region
#pragma omp parallel
{ Sections
#pragma omp parallel sections
printf("Thread ID: %d\n", omp_get_thread_num());
{
}
#pragma omp section
Creates multiple threads, each executes the block.
{ task1(); }

for / parallel for #pragma omp section


{ task2(); }
#pragma omp parallel for
}
for(int i=0; i<100; i++) {
a[i] = i*i;
}

11
Fibonacci, Factorial, and Prime – Using sections
#include <stdio.h> int main() {
#include <omp.h>
#pragma omp parallel sections
// Fibonacci {
int fibonacci(int n) { #pragma omp section
if (n <= 1) return n; {
return fibonacci(n-1) + fibonacci(n-2); printf("Thread %d: Fibonacci(10) = %d\n",
} omp_get_thread_num(), fibonacci(10));
}
// Factorial
int factorial(int n) { #pragma omp section
int fact = 1; {
for(int i = 1; i <= n; i++) printf("Thread %d: Factorial(5) = %d\n",
fact *= i; omp_get_thread_num(), factorial(5));
return fact; }
}
#pragma omp section
// Prime check {
int isPrime(int n) { printf("Thread %d: Prime check (29) = %s\n",
if(n <= 1) return 0; omp_get_thread_num(),
for(int i = 2; i*i <= n; i++) isPrime(29) ? "Prime" : "Not Prime");
if(n % i == 0) return 0; }
return 1; } 12
} return 0;
Sum of Array, Maximum Element, and Average (Independent Tasks)
#pragma omp section
#include <stdio.h> {
#include <omp.h> for(int i = 1; i < 10; i++)
if(a[i] > max)
int main() { max = a[i];
int a[10] = {2, 5, 8, 3, 10, 6, 4, 9, 1, 7}; printf("Thread %d computed Max = %d\n",
int sum = 0, max = a[0]; omp_get_thread_num(), max);
double avg; }

#pragma omp parallel sections #pragma omp section


{ {
#pragma omp section for(int i = 0; i < 10; i++)
{ avg += a[i];
for(int i = 0; i < 10; i++) avg = avg / 10;
sum += a[i]; printf("Thread %d computed Average = %.2f\n",
printf("Thread %d computed Sum = %d\n", omp_get_thread_num(), avg);
omp_get_thread_num(), sum); }
} }

return 0;
13
}
Each thread will execute single task(section)?

Answer - Not necessarily.

14
How OpenMP Handles parallel sections
•OpenMP creates a team of threads Case 2: Threads < Sections
•Each section is treated as an independent task Example:
3 sections, 2 threads
•Tasks are assigned to available threads Thread 0 → Section 1, then Section 3
So: Thread 1 → Section 2
•One section → exactly one thread Case 3: Threads = Sections
•One thread → may run one or multiple sections Example:
3 sections, 3 threads
Thread 0 → Section 1
Case 1: Threads ≥ Sections Thread 1 → Section 2
Example: Thread 2 → Section 3
3 sections, 4 threads

Thread 0 → Section 1
Thread 1 → Section 2
It is NOT fixed (not final).
Thread 2 → Section 3
It works dynamically based on thread availability.
Thread 3 → Idle

15
If four threads are created, will each thread receive the same set of
iterations?

16
Example Putting It Together
Useful Clauses #include <stdio.h>
• num_threads(n) → set number of threads. #include <omp.h>
• #pragma omp parallel num_threads(4) int main() {

• private(var) → each thread gets its own copy of int sum = 0;


variable. #pragma omp parallel for reduction(+:sum) num_threads(4)
• shared(var) → variable is shared across all threads. for(int i=1; i<=100; i++) {

• reduction(op:var) → safely combine results (like sum += i;


sum, min, max). }

#pragma omp parallel for reduction(+:sum) printf("Sum = %d\n", sum);


return 0;
reduction(+:sum)
}
reduction(*:prod)
reduction(max:maxval)
reduction(min:minval)

17
Explanation of output Possible Outputs
Run with 4 threads (export OMP_NUM_THREADS=4):
#include <stdio.h> Output (Run 1):
#include <omp.h> Hello from thread 0 out of 4 threads
Hello from thread 2 out of 4 threads
int main() { Hello from thread 1 out of 4 threads
#pragma omp parallel Hello from thread 3 out of 4 threads
Output (Run 2):
{
Hello from thread 1 out of 4 threads
int tid = omp_get_thread_num(); Hello from thread 3 out of 4 threads
int nthreads = omp_get_num_threads(); Hello from thread 0 out of 4 threads
Hello from thread 2 out of 4 threads
printf("Hello from thread %d out of %d
threads\n", tid, nthreads);
}
return 0;
}

18
Functions: omp_get_thread_num(), omp_get_num_threads()
1. omp_get_thread_num()
Definition: Returns the ID number of the thread within the parallel region.
Range: Always 0 to N-1 (where N = total number of threads).
Thread 0 is the master thread.
Usage: Identify which thread is doing the work.

2. omp_get_num_threads()
•Definition: Returns the total number of threads currently running in the parallel region.
•If called outside a parallel region → it will return 1 (only master thread is active).
•Usage: Know how many threads are participating.

19
Together
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
int tid = omp_get_thread_num(); // thread ID
int total = omp_get_num_threads(); // total threads
printf("Hello from thread %d out of %d\n", tid, total);
}
return 0;
}

20
for loop parallelization
Why Parallelize Loops?
• Many scientific/engineering programs spend most of their time in loops.
• If each iteration is independent, multiple threads can work on different iterations at the same time.
• OpenMP makes this very simple using:
#pragma omp parallel for
for (int i = 0; i < N; i++) {
// loop body (must be independent per iteration)
}

Each iteration of the loop is assigned to a thread.


Threads execute different parts of the loop concurrently.
At the end, there is an implicit barrier (all threads must finish before continuing).

21
Vector Addition
#include <stdio.h>
#include <omp.h>
int main() { Thread 0 computed c[0] = 0
int N = 10; Thread 1 computed c[1] = 3
int a[10], b[10], c[10]; Thread 3 computed c[3] = 9
// initialize arrays Thread 2 computed c[2] = 6
for (int i = 0; i < N; i++) { Thread 1 computed c[5] = 15
a[i] = i; Thread 0 computed c[4] = 12
b[i] = 2*i; ...
}
// parallel for loop
#pragma omp parallel for
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
printf("Thread %d computed c[%d] = %d\n",omp_get_thread_num(), i, c[i]);
}
return 0;
}

22
Scheduling policies: static, dynamic, guided
When you parallelize a loop using: #pragma omp parallel for
• OpenMP must decide which thread executes which iterations. This distribution is called a scheduling policy.
• OpenMP provides three main scheduling options: static, dynamic, and guided.

1. Static Scheduling
•How it works:
• The loop iterations are divided evenly into contiguous chunks before execution starts.
• Each thread gets a fixed set of iterations.
•Good for:
• Iterations that take equal time (uniform workload).
• Low scheduling overhead (assignments decided in advance).
•Directive: #pragma omp parallel for schedule(static, chunk_size)
If chunk_size is not specified → iterations are divided evenly among threads.
If specified → each thread gets chunks of that size.
23
Example
#pragma omp parallel for schedule(static, 2)
for (int i = 0; i < 10; i++) {
printf("Thread %d does iteration %d\n",
omp_get_thread_num(), i);
}

•With 4 threads and chunk_size=2:


• Thread 0 → iterations 0,1
• Thread 1 → iterations 2,3
• Thread 2 → iterations 4,5
• Thread 3 → iterations 6,7
• Remaining distributed similarly

24
2. Dynamic Scheduling
How it works:
•Iterations are given to threads in chunks at runtime.
•When a thread finishes its chunk, it requests the next available chunk.
Good for:
•Iterations that take different amounts of time.
•Helps balance workload across threads.
•Has higher overhead (threads ask for new work dynamically).
Directive: #pragma omp parallel for schedule(dynamic, chunk_size)
Default chunk size = 1 (each iteration assigned separately).
Example:
#pragma omp parallel for schedule(dynamic, 2)
for (int i = 0; i < 10; i++) {
printf("Thread %d does iteration %d\n",
omp_get_thread_num(), i);
}

Execution order will vary → whichever thread finishes first grabs the next chunk.

25
3. Guided Scheduling
How it works:
•Iterations are assigned in decreasing chunk sizes.
•Large chunks are assigned first, then smaller chunks later.
•This reduces overhead compared to dynamic (fewer assignments early).
Good for:
•Iterations with highly variable workload.
•Situations where early chunks can keep all threads busy, and later chunks are finer for balancing.
Directive: #pragma omp parallel for schedule(guided, chunk_size)
•Default chunk size = 1 (minimum chunk).
Example:
#pragma omp parallel for schedule(guided, 2)
for (int i = 0; i < 20; i++) {
printf("Thread %d does iteration %d\n", omp_get_thread_num(), i);
}
Distribution:
•First → large chunks (e.g., 10 iterations).
•Then → smaller chunks (e.g., 5, 3, 2…).
•Finally → chunk size not smaller than the specified minimum.
26
Comparison

Policy Assignment Overhead Best For


Static Pre-determined, equal blocks Very low Uniform workload
Dynamic Assigned on-the-fly Higher Irregular workload
Guided Large chunks first, smaller later Medium Highly variable workload

27
Race conditions (when multiple threads update shared data)
•A race condition occurs when two or more threads access and update shared data at the same time, and the final result
depends on the order of execution.
•Because threads run concurrently, the execution order is not guaranteed, so results can vary between runs.

#include <stdio.h> Final sum = 500500 // (1+2+...+1000)


#include <omp.h>

int main() {
int sum = 0;
Actual Results (varies):
#pragma omp parallel for Final sum = 471202
for (int i = 1; i <= 1000; i++) { Final sum = 498761
sum = sum + i; // shared variable! Final sum = 500500
}
printf("Final sum = %d\n", sum);
return 0;
}
28
Why Does This Happen?
•sum = sum + i; looks simple, but it’s not atomic.
•It expands into three steps:
• Load sum into register.
• Add i.
• Store result back to sum.
•If two threads execute these steps at the same time, they may overwrite each other’s updates.
•Result → lost updates → incorrect final sum.

29
Ways to Fix Race Conditions in OpenMP

• 1. Critical Section 2. Atomic Operation


#pragma omp parallel for
#pragma omp parallel for
for (int i = 1; i <= 1000; i++) {
for (int i = 1; i <= 1000; i++) {
#pragma omp atomic
#pragma omp critical
sum = sum + i;
{
}
sum = sum + i;
} Ensures the update to sum is performed atomically
} (without interruption).
Only one thread at a time can execute inside #pragma omp Faster than critical for simple updates.
critical.
Slower, but guarantees correctness.
30
3. Reduction Clause (Best for Summation)
int sum = 0;

#pragma omp parallel for reduction(+:sum)

for (int i = 1; i <= 1000; i++) {

sum = sum + i;

• Each thread keeps a private copy of sum.

• At the end, OpenMP automatically reduces (adds) all partial sums into the global sum.

• Efficient and clean.

31
Sections construct (different tasks in parallel)
•Used when threads should perform different tasks (work-sharing).
•Unlike parallel for, which splits iterations of one loop, sections assigns different code blocks to different threads.
•Each section is executed by only one thread.
•All sections inside a sections block are executed in parallel.

#pragma omp parallel sections


{
#pragma omp section
{
// Task 1
}
#pragma omp section
{
// Task 2
}
#pragma omp section
{
// Task 3
}
} 32
#include <stdio.h>
#include <omp.h>
Contd.. int main() {
#pragma omp parallel sections
{
#pragma omp section
{
printf("Task 1 executed by thread %d\n", omp_get_thread_num());
}

#pragma omp section


{
printf("Task 2 executed by thread %d\n", omp_get_thread_num());
}

#pragma omp section


{
printf("Task 3 executed by thread %d\n", omp_get_thread_num());
}
}
return 0;
} 33
Nested parallelism (parallel inside parallel)
In many OpenMP implementations, nested parallelism is disabled by default (inner parallel regions collapse into
serial execution) #include <stdio.h>
#include <omp.h>
omp_set_nested(1); // C function int main() {
omp_set_nested(1); // enable nested parallelism
Nested Parallel Regions
#pragma omp parallel num_threads(2)
{
int outer_tid = omp_get_thread_num();
printf("Outer thread %d\n", outer_tid);
Outer thread 0
Inner thread 0 (inside outer thread 0) #pragma omp parallel num_threads(3)
Inner thread 1 (inside outer thread 0) {
Inner thread 2 (inside outer thread 0) int inner_tid = omp_get_thread_num();
Outer thread 1 printf(" Inner thread %d (inside outer thread %d)\n",
Inner thread 0 (inside outer thread 1) inner_tid, outer_tid);
Inner thread 1 (inside outer thread 1) }
Inner thread 2 (inside outer thread 1) }
return 0; } 34
Collapse clause (nested loops parallelized)
• Used with nested loops inside a parallel for.
• It merges multiple loops into a single large iteration space.
• OpenMP then distributes these merged iterations across threads.
• This often improves load balancing and parallel efficiency.

#pragma omp parallel for collapse(n)


for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
// loop body
}
}

Here collapse(2) tells OpenMP:


•Combine both loops (i and j) into one big loop with N*M iterations.
•Distribute those iterations among threads.

35
Example 1: Without Collapse Example 2: With Collapse
#pragma omp parallel for #pragma omp parallel for collapse(2)

for (int i = 0; i < 4; i++) { for (int i = 0; i < 4; i++) {

for (int j = 0; j < 4; j++) { for (int j = 0; j < 4; j++) {

printf("Thread %d does iteration (%d,%d)\n", printf("Thread %d does iteration (%d,%d)\n",

omp_get_thread_num(), i, j); omp_get_thread_num(), i, j);

} }

} }

Now both loops are collapsed into 16 iterations total.


OpenMP parallelizes only the outer loop (i).
OpenMP distributes (i,j) pairs across all threads.
With 4 threads → each thread gets 1 full row of j
Much better load balancing since threads get mixed work.
iterations.
If workload inside inner loop varies, some threads may
finish earlier
36
Output Difference (with 4 threads)

Without Collapse: With Collapse:


Thread 0: (0,0) (0,1) (0,2) (0,3) Thread 0: (0,0) (0,1) (1,0) (1,1)
Thread 1: (1,0) (1,1) (1,2) (1,3) Thread 1: (0,2) (0,3) (1,2) (1,3)
Thread 2: (2,0) (2,1) (2,2) (2,3) Thread 2: (2,0) (2,1) (3,0) (3,1)
Thread 3: (3,0) (3,1) (3,2) (3,3) Thread 3: (2,2) (2,3) (3,2) (3,3)

37
Barrier, Master, Single directives (synchronization control)
Synchronization is the backbone of safe parallel programming.

#pragma omp parallel


{ What it does:
int tid = omp_get_thread_num(); •All threads stop here and wait.
printf("Thread %d reached step 1\n", tid);
•Execution resumes only when all threads reach the barrier.
#pragma omp barrier // all threads must arrive here Use case: To ensure all threads finish a phase before moving to

printf("Thread %d reached step 2\n", tid); the next.


}

38
Master

What it does:
Only the master thread executes this block.
Other threads skip it.
No implicit barrier at the end (different from single).
Use case: To perform tasks only once (like I/O, printing, initializing).

#pragma omp parallel


{
#pragma omp master
{
printf("This is executed only by master thread\n");
}
}

39
Single Directive
#pragma omp parallel
{
#pragma omp single
{
printf("This is executed by one thread only\n");
}
}
• What it does:
• Exactly one thread executes this block.
• Other threads skip it but wait at the implicit barrier (unless nowait is used).
• Use case: To ensure only one thread performs some task, but not necessarily the master.

40
Serial vs Parallel execution
Parallel Execution
Serial Execution
•Program runs using multiple threads/cores simultaneously.
•Program runs on a single thread (CPU core).
•Work is divided among threads, which execute concurrently.
•Each task is executed step by step, sequentially.
•Aim: reduce execution time and process larger datasets.
•Limitation: Only one task at a time → slow for large problems.
sum = 0; sum = 0;
for (i = 0; i < N; i++) { #pragma omp parallel for reduction(+:sum)
sum += a[i]; for (i = 0; i < N; i++) {
} sum += a[i];
}

Definition:
Speedup = (Execution time of Serial) ÷ (Execution time of Parallel)
Ideal Speedup:
•With P processors → maximum theoretical speedup = P.
•Example: If 4 processors → we hope to get ~4× faster.
Reality (Amdahl’s Law):
•Some part of the program is inherently serial (can’t be parallelized).
•Synchronization, communication, and overhead reduce actual speedup.
•Thus, speedup curve flattens as processors increase.
41
Assignment

• Parallelize sum of array


• Parallelize sorting
• Compare execution with different threads

42
Thank
You

Small aim is a crime; have great aim.


Bharat-Ratan A. P. J. Abdul Kalam 43

You might also like