OpenMP Shared-Memory Programming
Slide 1: Title Slide
OpenMP Shared-Memory Programming Parallel Programming with
OpenMP
Learning Objectives:
• Understand OpenMP programming model and execution
• Master pragma directives and their applications
• Implement parallel algorithms effectively
• Handle variable scoping and data sharing
• Optimize performance through proper scheduling
• Manage synchronization and avoid race conditions
• Understand cache effects and memory hierarchy
• Implement task-based parallelism
Prerequisites: Basic C/C++, multi-threading concepts, computer architecture
Introduction to OpenMP
What is OpenMP? OpenMP (Open Multi-Processing) is an API for shared-
memory parallel programming using compiler directives, library routines, and
environment variables.
Key Features:
• Pragma-based: Uses compiler directives to parallelize code
• Fork-Join Model: Master thread creates worker threads as needed
• Shared Memory: All threads can access the same memory space
• Incremental Parallelization: Add parallelism gradually to existing code
• Portable: Supported by most modern compilers
Compilation:
// Compile with OpenMP support
gcc -fopenmp program.c -o program
icc -qopenmp program.c -o program
clang -fopenmp program.c -o program
// Set number of threads
export OMP_NUM_THREADS=4
./program
Applications: Scientific computing, image processing, financial modeling,
weather simulation, machine learning
OpenMP Pragmas and Directives
Pragma Syntax:
#pragma omp directive [clause [clause]...]
structured-block
Basic Example:
#include <omp.h>
#include <stdio.h>
int main() {
printf("Before parallel region\n");
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
int total_threads = omp_get_num_threads();
printf("Hello from thread %d of %d\n", thread_id, total_threads);
}
printf("After parallel region\n");
return 0;
}
Execution Model:
1. Program starts with master thread (thread 0)
2. Master creates team of threads at parallel region
3. All threads execute structured block in parallel
4. Threads synchronize at end, all except master terminate
5. Execution continues serially with master thread
Common Functions:
int omp_get_num_threads(); // Get number of threads
int omp_get_thread_num(); // Get current thread ID
void omp_set_num_threads(int); // Set number of threads
double omp_get_wtime(); // Get wall clock time
Trapezoidal Rule - Theory
Numerical Integration Concept: The trapezoidal rule approximates definite
integrals by dividing the area under a curve into trapezoids.
Formula: ∫[a to b] f(x) dx ≈ (h/2)[f(a) + 2∑f(xi) + f(b)] where h = (b-a)/n and xi
= a + i*h
Sequential Implementation:
c
double f(double x) {
return x * x; // f(x) = x²
}
double sequential_trap(double a, double b, int n) {
double h = (b - a) / n;
double integral = (f(a) + f(b)) / 2.0;
for (int i = 1; i <= n-1; i++) {
double x_i = a + i * h;
integral += f(x_i);
}
return integral * h;
}
Analysis:
• Time Complexity: O(n)
• Accuracy: Error decreases as O(h²)
• Parallelization Potential: Each iteration is independent
Trapezoidal Rule - Parallel Implementation
Parallel Strategy: Distribute computation of trapezoids among threads, then
combine results.
Method 1: Using reduction
double parallel_trap_v1(double a, double b, int n) {
double h = (b - a) / n;
double integral = 0.0;
#pragma omp parallel for reduction(+:integral)
for (int i = 0; i < n; i++) {
double x_i = a + (i + 0.5) * h;
integral += f(x_i);
}
return integral * h;
}
Method 2: Manual distribution
double parallel_trap_v2(double a, double b, int n) {
double h = (b - a) / n;
double global_integral = 0.0;
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
int local_n = n / num_threads;
int start = thread_id * local_n;
int end = (thread_id == num_threads - 1) ? n : start + local_n;
double local_integral = 0.0;
for (int i = start; i < end; i++) {
double x_i = a + (i + 0.5) * h;
local_integral += f(x_i);
}
#pragma omp critical
{
global_integral += local_integral;
}
}
return global_integral * h;
}
Performance Tip: Method 1 with reduction is more efficient
Variable Scoping
Variable Scoping Rules: Determines how variables are shared or distributed
among threads.
Default Rules:
• Shared: Variables declared outside parallel region
• Private: Variables declared inside parallel region
• Private: Loop index variables in parallel for loops
Scoping Clauses:
Clause Behavior Use Case
All threads access same
shared(var) Data accessed by all threads
memory
Each thread gets uninitialized
private(var) Temporary variables
copy
Private copy initialized from
firstprivate(var) Variables needing initial values
master
Private copy, final value copied Variables whose final value
lastprivate(var)
back needed
Example:
int main() {
int a = 10, b = 20, c = 30, d = 40;
#pragma omp parallel for private(a) firstprivate(b) lastprivate(c) shared(d)
for (int i = 0; i < 4; i++) {
int thread_id = omp_get_thread_num();
a = thread_id; // private (uninitialized)
b += thread_id; // firstprivate (initialized to 20)
c = thread_id * 100; // lastprivate (copied back)
// d is shared by all threads
}
return 0;
}
Reduction Clause
What is Reduction? Common pattern where multiple threads contribute to
single result through commutative and associative operation.
How It Works:
1. Each thread gets private copy of reduction variable
2. Private copies initialized based on operation (0 for +, 1 for *)
3. Each thread operates on private copy
4. All private copies combined using specified operation
5. Final result stored in original variable
Supported Operations:
c
// Arithmetic
double sum = 0.0;
int product = 1;
#pragma omp parallel for reduction(+:sum) reduction(*:product)
for (int i = 0; i < n; i++) {
sum += array[i];
product *= array[i];
}
// Min/Max
double min_val = 1e9, max_val = -1e9;
#pragma omp parallel for reduction(min:min_val) reduction(max:max_val)
for (int i = 0; i < n; i++) {
if (array[i] < min_val) min_val = array[i];
if (array[i] > max_val) max_val = array[i];
}
// Logical
int all_positive = 1;
#pragma omp parallel for reduction(&&:all_positive)
for (int i = 0; i < n; i++) {
all_positive = all_positive && (array[i] > 0);
}
Practical Examples:
• Dot product of vectors
• Counting elements satisfying condition
• Computing vector norm
Loop Carried Dependencies
Definition: Loop-carried dependency exists when an iteration depends on
results of previous iteration. These loops cannot be parallelized directly.
Types of Dependencies:
• True Dependency (RAW): Read After Write
• Anti Dependency (WAR): Write After Read
• Output Dependency (WAW): Write After Write
Problematic Examples:
c
// Cannot parallelize - has dependencies
for (int i = 1; i < n; i++) {
a[i] = a[i-1] + b[i]; // Depends on previous iteration
}
// Fibonacci - classic dependency
fib[0] = 0; fib[1] = 1;
for (int i = 2; i < n; i++) {
fib[i] = fib[i-1] + fib[i-2]; // Cannot parallelize
}
Safe Examples:
c
// Can parallelize - no dependencies
#pragma omp parallel for
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i]; // Independent iterations
}
// Matrix multiplication
#pragma omp parallel for collapse(2)
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
double sum = 0.0;
for (int k = 0; k < inner; k++) {
sum += A[i][k] * B[k][j];
}
C[i][j] = sum;
}
}
Scheduling
Purpose: Determines how loop iterations are distributed among threads.
Scheduling Types:
• Static: Iterations divided evenly at compile time
• Dynamic: Iterations assigned at runtime for load balancing
• Guided: Decreasing chunk sizes over time
• Auto: Compiler/runtime decides
Examples:
c
// Static scheduling
#pragma omp parallel for schedule(static, 100)
for (int i = 0; i < n; i++) {
process_data(i); // Even workload distribution
}
// Dynamic scheduling
#pragma omp parallel for schedule(dynamic, 10)
for (int i = 0; i < n; i++) {
complex_computation(i); // Good for irregular workloads
}
// Guided scheduling
#pragma omp parallel for schedule(guided, 50)
for (int i = 0; i < n; i++) {
variable_work(i); // Chunk sizes decrease over time
}
// Runtime scheduling
#pragma omp parallel for schedule(runtime)
for (int i = 0; i < n; i++) {
work(i); // Uses OMP_SCHEDULE environment variable
}
When to Use:
• Static: Equal workload per iteration
• Dynamic: Highly variable workload
• Guided: Moderate load imbalance
Producers and Consumers
Producer-Consumer Pattern: Synchronization pattern where some threads
produce data while others consume it.
Implementation with Critical Sections:
c
#define BUFFER_SIZE 100
int buffer[BUFFER_SIZE];
int count = 0, in = 0, out = 0;
void producer(int item) {
#pragma omp critical(buffer_access)
{
while (count == BUFFER_SIZE) {
// Buffer full - wait
}
buffer[in] = item;
in = (in + 1) % BUFFER_SIZE;
count++;
}
}
int consumer() {
int item;
#pragma omp critical(buffer_access)
{
while (count == 0) {
// Buffer empty - wait
}
item = buffer[out];
out = (out + 1) % BUFFER_SIZE;
count--;
}
return item;
}
Using Atomic Operations:
c
int shared_counter = 0;
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < 1000; i++) {
#pragma omp atomic
shared_counter++;
}
}
Caches and Cache Coherence
Cache Coherence: Ensures all processors see the same value for shared
memory locations.
Memory Hierarchy Impact:
• L1 Cache: Fastest, private to each core
• L2 Cache: Shared between cores on same chip
• L3 Cache: Shared among all cores
• Main Memory: Slowest, shared by all
Cache-Friendly Programming:
c
// Good: Sequential access pattern
#pragma omp parallel for
for (int i = 0; i < n; i++) {
array[i] = compute(i); // Good cache locality
}
// Poor: Random access pattern
#pragma omp parallel for
for (int i = 0; i < n; i++) {
int idx = random_index[i];
array[idx] = compute(idx); // Poor cache locality
}
Avoiding Cache Line Conflicts:
c
// Use padding to avoid conflicts
struct {
int counter;
char pad[60]; // Padding to cache line boundary
} thread_data[MAX_THREADS];
#pragma omp parallel
{
int id = omp_get_thread_num();
for (int i = 0; i < iterations; i++) {
thread_data[id].counter++;
}
}
False Sharing
Definition: Multiple threads access different variables that share the same cache
line, causing unnecessary cache invalidations.
Problem Example:
c
// BAD: False sharing occurs
int counters[MAX_THREADS]; // Adjacent in memory
#pragma omp parallel
{
int id = omp_get_thread_num();
for (int i = 0; i < iterations; i++) {
counters[id]++; // False sharing between threads
}
}
Solutions:
1. Padding:
c
struct padded_counter {
int count;
char pad[64 - sizeof(int)]; // Cache line padding
};
struct padded_counter counters[MAX_THREADS];
2. Thread-local storage:
c
#pragma omp parallel
{
int local_counter = 0; // Private to each thread
for (int i = 0; i < iterations; i++) {
local_counter++;
}
#pragma omp critical
{
global_counter += local_counter;
}
}
3. Use reduction:
c
int total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < n; i++) {
total += work(i);
}
OpenMP Tasking
Purpose: Tasks enable irregular parallelism and dynamic work creation.
Basic Task Creation:
c
#pragma omp parallel
{
#pragma omp single
{
for (int i = 0; i < num_tasks; i++) {
#pragma omp task
{
process_item(i); // Each iteration becomes a task
}
}
}
}
Recursive Tasks - Fibonacci:
c
int fibonacci_task(int n) {
if (n < 2) return n;
int x, y;
#pragma omp task shared(x)
x = fibonacci_task(n - 1);
#pragma omp task shared(y)
y = fibonacci_task(n - 2);
#pragma omp taskwait // Wait for both tasks
return x + y;
}
int main() {
int result;
#pragma omp parallel
{
#pragma omp single
result = fibonacci_task(10);
}
return 0;
}
Task Dependencies:
c
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task depend(out:a)
a = compute_a();
#pragma omp task depend(out:b)
b = compute_b();
#pragma omp task depend(in:a,b) depend(out:c)
c = combine(a, b);
}
}
Thread Safety
Thread Safety Mechanisms:
• Critical Sections: Mutual exclusion
• Atomic Operations: Indivisible operations
• Locks: Explicit synchronization
• Barriers: Synchronization points
Synchronization Examples:
c
// Critical section
#pragma omp critical
{
shared_resource++;
}
// Named critical section
#pragma omp critical(update_max)
{
if (value > global_max) global_max = value;
}
// Atomic operations
#pragma omp atomic
counter++;
#pragma omp atomic read
local_value = shared_value;
#pragma omp atomic write
shared_value = new_value;
// Barriers
#pragma omp parallel
{
phase_1_work();
#pragma omp barrier // All threads wait here
phase_2_work();
}
Explicit Locks:
c
#include <omp.h>
omp_lock_t my_lock;
int main() {
omp_init_lock(&my_lock);
#pragma omp parallel
{
omp_set_lock(&my_lock);
shared_work(); // Critical section
omp_unset_lock(&my_lock);
}
omp_destroy_lock(&my_lock);
return 0;
}
Performance Hierarchy: Atomic > Critical > Locks (choose based on needs)
Best Practices and Summary
Best Practices:
1. Start Simple: Begin with basic parallel for loops
2. Minimize Synchronization: Use reduction instead of critical sections
3. Consider Data Locality: Keep related data close in memory
4. Avoid False Sharing: Use padding or thread-local storage
5. Choose Right Scheduling: Static for balanced loads, dynamic for
irregular
6. Profile Your Code: Measure performance improvements
7. Handle Race Conditions: Use proper synchronization
Common Pitfalls:
• Assuming linear speedup
• Overusing critical sections
• Ignoring cache effects
• Not considering overhead of thread creation
• Parallelizing too small loops
Performance Considerations:
• Thread creation overhead
• Memory bandwidth limitations
• Cache coherence costs
• Load balancing issues
• Amdahl's Law limitations
Summary: OpenMP provides powerful tools for shared-memory
parallelization. Success requires understanding of:
• Proper variable scoping
• Effective synchronization
• Cache-aware programming
• Task-based parallelism for irregular problems
• Performance measurement and optimization