Patterns
Based on Structured Parallel Programming
Dr.-Ing Frezewd Lemma
Goal
• Covers Patterns
1
Introduction to Patterns
What Are Patterns? i
• Codification of best practices in software engineering.
• Enable reusable solutions for specific problems.
• In parallel programming: Task distribution and data access
combinations that solves a specific problem in parallel
algorithm design.
• We give each pattern a specific name, which makes it much
easier to succinctly describe, discuss, and compare various
parallel algorithms.
• Algorithms are typically designed by composing patterns
• Provides you a high-level vocabulary for designing your own
algorithms and for understanding other people’s algorithms
2
Why Use Patterns? i
• Universality: Not tied to hardware or language.
• Simplicity: Provide a clear structure for designing algorithms.
• Determinism: Ensures repeatable results for easier debugging
and testing.
3
Nesting Patterns
Nesting Patterns i
• Fundamental compositional pattern in serial and parallel
programming.
• Enables hierarchical and recursive composition of patterns.
• Supports modular and scalable code structures.
• Most programming models, at present support only a fixed
number of nesting levels and may even map these levels
directly onto hardware components.
• Mapping the hierarchy of the program directly onto the
hardware hierarchy also makes code less future-proofed.
• Cilk Plus and TBB support arbitrary nesting while efficiently
mapping parallelism to physical hardware.
4
Nesting Patterns ii
• Examples of systems with limited nesting include:
◦ OpenCL, CUDA, C++ AMP, OpenMP
5
Nesting Example
Figure 1: Example of Nesting Patterns
6
Structured Serial Control Flow
Patterns
Control Flow Patterns: Sequence i
• Tasks execute in a specified order.
• Enforces ordered side effects (e.g., output).
• Example: ‘T = f(A); S = g(T); B = h(S);‘
• restrict the order to be the same as the texture order.
• T = f(A); S = g(A); B = h(S,T);
• the sequence pattern would still require executing g after f
• Superscalar Sequence : a parallel generalization of sequence
7
Control Flow Patterns: Selection i
• Executes one of two branches based on a condition.
• Exactly one branch is executed, ensuring mutual exclusivity.
• Example:
if (c) { a; } else { b; }
• There is a parallel generalization of selection, the speculative
selection pattern, which execute all of a, b, and c in parallel,
8
Control Flow Patterns: Iteration i
• Executes a task repeatedly while a condition holds true.
• The loop body typically modifies a state that the condition
uses for termination testing.
• Iteration assumes data dependencies between loop iterations
are manageable, although parallelization may be complicated
by these dependencies.
• Can use ‘for‘ or ‘while‘ loops.
• Example: ‘while (c) a; ‘
9
Control Flow Patterns: Recursion i
• A function calls itself, directly or indirectly, , enabling nested
tasks.
• It is usually associated with stack-based memory allocation or,
if higher-order functions are supported, closures (see Section
3.4.4) which are objects allocated on the heap.
• Tail recursion is a special form of recursion that can be
converted into iteration
• Examples:
◦ Factorial calculation
◦ Fibonacci sequence
◦ Divide-and-conquer algorithms (e.g., quicksort,
mergesort)
10
Control Flow Patterns: Recursion ii
• Dynamic form of nesting often implemented with stack-based
memory.
11
Parallel Control Patterns
Parallel Control Patterns: Introduction i
• Parallel control patterns extend the serial control patterns
• Parallel control patterns relaxes the assumptions of the serial
control patterns in various ways
12
Fork–Join i
• Splits control flow into parallel tasks and later joins them.
• Applications
◦ Used in frameworks like OpenMP and Cilk Plus.
◦ Divide and Conquer Algorithms: QuickSort: Divide the
array into parts, sort them in parallel, and join the
results. MergeSort: Recursively divide the array, sort the
halves, and merge them in parallel.
◦ Graph algorithms, Dynamic programming etc
13
Map Pattern i
• Applies a function to each element of a collection
independently.
Figure 2: The Map pattern
• Produces a new collection of the same shape.
• Embarrassingly parallel (no dependencies between operations).
• No communication or synchronization needed between tasks.
14
Map Pattern ii
• Ideal for SIMD (Single Instruction, Multiple Data)
architectures.
• Coarse-grained parallelism.
• Scales well with the number of processing units.
• Appllications
◦ Image Processing:
∗ Gamma correction.
∗ Thresholding.
∗ Color space conversions.
◦ Monte Carlo Sampling:
∗ Generating random samples for simulations.
◦ Ray Tracing:
15
Map Pattern iii
∗ Calculating pixel colors in rendering algorithms.
◦ Image filtering, array transformations, or functional
programming.
16
Stencil Pattern i
• Generalization of map pattern with neighborhood access.
• Applications:
◦ Image Filtering:
∗ Convolution.
∗ Median filtering.
◦ Video Processing:
∗ Motion estimation in video encoding.
◦ Scientific Simulations:
∗ Fluid flow simulations.
∗ Electromagnetic partial differential equation (PDE)
solvers.
17
Stencil Pattern ii
∗ Lattice quantum chromodynamics (QCD).
∗ Cellular automata (e.g., lattice Boltzmann methods).
◦ Noise Reduction:
∗ Isotropic diffusion.
18
Reduction i
• Combines all elements of a collection into a single value.
• Uses an associative combiner function (e.g., addition,
multiplication).
• Applications
◦ Numerical Applications:
∗ Summing elements of a collection.
∗ Dot products in matrix multiplication.
∗ Averaging Monte Carlo samples for integration.
◦ Convergence Testing:
∗ Used in iterative solutions of linear equations (e.g.,
conjugate gradient methods).
19
Reduction ii
◦ Image Processing:
∗ Image comparison metrics (e.g., in video encoding).
◦ Other Mathematical Operations:
∗ Maximum, minimum, multiplication, Boolean
AND/OR/XOR, etc.
Pseudocode Example
function reduce(array):
if [Link] == 1: return array[0]
else:
left = reduce(array[:mid])
right = reduce(array[mid:])
return combine(left, right)
20
Scan i
• Computes all partial reductions of a collection.
• Produces an array where each element is the reduction of all
previous elements.
• Applications:
◦ Integration:
∗ Computing cumulative sums or other cumulative
operations.
◦ Financial Modeling:
∗ Sequential decision simulations in option pricing.
◦ Random Number Generation:
21
Scan ii
∗ Generating sequences of random numbers (if
traditional pseudorandom generators are
parallelized).
◦ Data Packing:
∗ Combined with scatter for efficient memory
management.
• Associativity of the operation is crucial for parallelization.
22
Recurrence i
• Generalization of iteration where loop iterations depend on
previous computations.
• Causality: There must be a serial ordering for computation.
• Parallelization:
◦ Associative operations (e.g., prefix sums) can be
parallelized.
◦ Multi-dimensional recurrences are often parallelizable
using sweeps.
• Applications:
◦ Matrix Factorization:
∗ Numerical linear algebra.
23
Recurrence ii
◦ Image Processing:
∗ Advanced filtering operations.
◦ PDE Solvers:
∗ Solving partial differential equations in scientific
computations.
◦ Sequence Alignment:
∗ Used in bioinformatics and computational biology.
24
Serial Data Management Patterns
Random Read and Write i
• Uses pointers or indices for direct memory access.
• Challenges: Aliasing can complicate parallelization.
25
Heap and Stack Allocation i
• Stack: LIFO memory allocation (fast and locality-preserving).
• Heap: Dynamic allocation (slower, can fragment memory).
26
Closures and Objects i
• Closures
◦ Closures are function objects that can be constructed and
managed like data. Lambda functions
◦ When closures are built, they can often be used to
“capture” the state of non-local variables that they
reference
• Objects
◦ Objects are a language construct that associate data with
the code to act on and manage that data.
◦ Multiple functions may be associated with an object and
these functions are called the methods or member
functions of that object.
27
PARALLEL DATA MANAGEMENT
PATTERNS
Pack i
• Eliminates unused or irrelevant elements in a collection.
• Input elements marked with Boolean values:
◦ True: Element is retained.
◦ False: Element is discarded.
• Output: Contiguous sequence of selected elements in the
same order.
• Common Applications:
◦ Filtering results in image processing.
◦ Collision detection in simulations.
28
Pipeline i
• Connects tasks in a producer-consumer relationship.
• Each stage processes data and passes results to the next.
• All stages can run simultaneously, maintaining independent
states.
• Common Applications:
◦ Video encoding and decoding.
◦ Real-time data processing (e.g., audio streams).
29
Geometric Decomposition i
• Divides data into subcollections (overlapping or
non-overlapping).
• Example: Partitioning domains for stencil computations.
30
Partition Pattern i
• Special case of geometric decomposition with non-overlapping
regions.
• Common in matrix operations and image processing.
31
Gather i
• Collects elements from an input collection using an index
array.
• The shape of the output collection matches the index array.
• Optimizations:
◦ Coherent patterns (e.g., shifting or stencil) are more
efficient.
• Common Applications:
◦ Sparse matrix operations.
◦ Ray tracing and volume rendering.
32
Scatter i
• Writes input data to specific locations in a collection using an
index array.
• The inverse of the Gather pattern.
• Challenges:
◦ Handling collisions when multiple writes target the same
index.
◦ Requires a deterministic resolution strategy.
• Common Applications:
◦ Writing results of simulations.
◦ Updating sparse matrices.
33
Scatter ii
34
Advanced Patterns
Superscalar Sequences i
• Allows tasks to execute in parallel as long as data
dependencies are met.
• Example: Out-of-order execution in modern processors.
35
Futures and Speculative Selection i
• Futures: Non-hierarchical task management.
• Speculative Selection: Parallel execution of branches with
cancellation.
36
Workpile i
• Generalization of the Map pattern where tasks can
dynamically generate more tasks.
• Tasks are added to a “workpile” and processed as they arrive.
• Commonly used in:
◦ Recursive algorithms like tree searches.
◦ Dynamic workloads where task count is not predefined.
• Challenges:
◦ Efficient task scheduling.
◦ Avoiding bottlenecks in task generation.
37
Workpile ii
38
Search i
• Finds elements in a collection that meet specific criteria.
• Criteria can vary from simple key matches to complex
logical/arithmetic constraints.
• Applications:
◦ Databases: Relational queries (e.g., SQL).
◦ AI: Pathfinding and heuristic searches.
• Optimization:
◦ Sorting data for faster searches.
◦ Using parallel algorithms for large datasets.
39
Segmentation i
• Divides a collection into multiple segments for independent
processing.
• Each segment can be treated as a separate collection.
• Example:
◦ Summing elements in segments of an array.
• Applications:
◦ Multi-threaded data processing.
◦ Data-parallel algorithms in distributed systems.
40
Expand i
• Creates new data elements or collections based on an existing
one.
• The number of outputs is not predetermined and can vary
based on input.
• Common Applications:
◦ Generating neighboring states in simulations.
◦ Recursive enumeration (e.g., generating permutations or
combinations).
• Challenges:
◦ Managing memory for dynamically generated outputs.
◦ Ensuring scalability with irregular data generation.
41
Expand ii
42
Category Reduction i
• Groups data into categories and applies a reduction operation
within each category.
• Generalization of the Reduction pattern.
• Common Applications:
◦ Grouping sales data by region and computing total
revenue.
◦ Histograms and frequency counts.
• Challenges:
◦ Handling a large number of categories efficiently.
◦ Balancing load in parallel computations.
43
Category Reduction ii
44
PROGRAMMING MODEL
SUPPORT FOR PATTERNS
Programming Model Supports for Patterns i
• TBB: Nesting, Recursion, Fork-Join, Map, Workpile,
Reduction, Scan, Pipeline, Speculative Selection, Branch and
Bound
• OpenMP: Map, Workpile, Reduction, Fork-Join, Stencil,
Gemotric Decompostion, Gather, Scatter
• OpenCL: Map, Gather, Scatter, Reduction, Scan, Pack,
Expand, Stencil, Workpile, Superacalar Sequences, Geometric
Decompostion, Closures
45
Conclusion
Summary i
• Patterns provide reusable solutions for parallel programming.
• Enable scalable, maintainable, and deterministic designs.
• Foundation for developing high-performance systems.
46
Next Steps i
• Explore applications of patterns in real-world systems.
• Experiment with frameworks like OpenMP and TBB.
• Practice composing patterns for complex parallel algorithms.
47
The Map Pattern Implementation i
Definition:
• Applies a function to every element of a data collection in
parallel.
• Executes function invocations independently with no side
effects.
Key Characteristics:
• Foundation for parallelism with no interdependencies.
• Enables maximum concurrency with flexible scheduling.
• Basis of many efficient applications and patterns.
Applications:
48
The Map Pattern Implementation ii
• Embarrassingly parallel tasks and unrelated problem-solving.
• Combined with other patterns like gather, reduction, and scan.
Implementation Considerations:
• Avoids using separate threads for lightweight tasks.
• Parallelizes synchronization overhead efficiently.
• Supports vectorization and serial control flow nesting.
Examples:
• Simplified implementations using TBB, Cilk Plus, OpenMP,
etc.
• Map-Reduce as a practical extension (e.g., Google’s systems).
49
Implementing the Map Pattern Using Different Programming
Models i
• Example Problem: SCALED VECTOR ADDITION - SAXPY
◦ Definition: SAXPY stands for ”Single-Precision A·X
Plus Y.” It performs a scaled vector addition where a
vector x is scaled by a scalar a and added to another
vector y , elementwise.
y ← ax + y
◦ Components:
∗ Inputs: Vector x, scalar a, and vector y .
∗ Output: Updated vector y , with old values
overwritten.
50
Implementing the Map Pattern Using Different Programming
Models ii
◦ Use Cases: Common in linear algebra operations, such
as row cancellation in Gaussian elimination.
◦ Uniform and Varying Parameters:
∗ Uniform Parameters: Scalar a, which remains
constant across operations.
∗ Varying Parameters: Elements of vectors x and y ,
which vary across operations.
◦ Elemental Function Representation: The operation
for each element can be described as:
f (t, p, q) = tp + q
where t = a, p = xi , and q = yi .
51
Implementing the Map Pattern Using Different Programming
Models iii
◦ Parallelization:
∗ Suitable for parallel implementation using the map
pattern, as each iteration is independent.
∗ Arithmetic intensity is low, which limits scalability,
making SAXPY less suitable for highly parallel
systems compared to more complex operations.
◦ Scalability: As a Level 1 BLAS routine, SAXPY does
not scale well due to low arithmetic intensity.
Higher-level BLAS routines (Level 2 and 3) with more
work per unit of parallelism scale better.
52
Implementing the Map Pattern Using Different Programming
Models iv
◦ Implementation Examples: Demonstrated in multiple
parallel programming models, including TBB, OpenMP,
and OpenCL.
◦ Serial Implementation: Provides a baseline, where each
element is processed sequentially in a loop, with
independent iterations.
• Porgramming Models
◦ Serial
◦ TBB
◦ OpenMp
◦ OpenCl
53
Serial Implementation
Figure 3: SAXPY Serial Implementation
54
Map With TBB
Figure 4: SAXPY TBB Implementation 55
Map With OpenMP
Figure 5: SAXPY OpenMP Implementation
56
Map With OpenCl
Figure 6: SAXPY OpenCL Implementation
57