Parallel Architectures an
Algorithms
Dr. Hichem Talbi
[Link]@[Link]
Outline
• Introduction
• Parallel Models & Networks: Shared memory, PRAM, …
• Parallel Algorithm Design
• Parallel Algorithms Techniques
• Parallel Basic Algorithms: Reduce, Parallel Prefix Scan, …
• Parallel Sorting Algorithms: Odd-Even Sort, Bitonic Sort, Parallel Mergesort
• Parallel Graph Algorithms: Lattice Linear Predicates, Shortest Path
Algorithms, …
• Parallel Tree Algorithms: Euler Tour Technique, Least Common Ancestor
problem
• Parallel Matrix Algorithms: Matrix multiplication, Matrix inversion
Aims
- Provide students with the fundamental concepts of modern parallel
computing systems including architectures, models, languages, and
applications.
- Teach students how to design and analyze parallel algorithms for
variant scientific problems using different models and select the most
appropriate one.
Outcomes
• Understand the basic concepts of modern parallel computing
systems, and know different parallel models.
• Deploy appropriate models for the design and evaluation of parallel
computer algorithms for variant problems.
• Distinguish between the different techniques of designing parallel
algorithms.
• Apply appropriate computing tools, and design techniques for
modeling and analyzing specific scientific problem.
• Demonstrate and criticize competence in the use of computer science
parallel methods in solving basic problem.
What is parallelism?
• Von Neumann Architecture:
• Sequential processing of information.
• Fast accomplishment of many human tasks.
• Storage of large amounts of data.
What is parallelism?
• Parallelism: Splitting the problem into several sub-problems, and
solving them simultaneously using a parallel computer.
• Parallel Computer: a type of computer architecture designed to
execute multiple computational tasks or operations concurrently,
enabling the system to handle larger workloads and execute tasks
more efficiently compared to traditional sequential.
Supercomputers, computing clusters, and graphics processing units (GPUs) are
examples of systems that often employ parallel architectures.
• Parallel Algorithm: a computational algorithm designed to efficiently
perform tasks or solve problems by simultaneously executing multiple
operations or computations concurrently.
Why is parallelism important?
• Accelerating computing speed: for tasks that can be divided into
parallelisable subtasks.
• Providing multiplicity of data paths to increase accessibility to storage
elements.
• Increasing the performance and lowering cost through resource sharing
(cloud services).
• Solving difficult problems (e.g. training deep neural networks).
• Modelling and simulating complex systems (climate, crash tests, drug
design, computational chemistry, nuclear reactions, quantum systems, etc.).
• Scalability: Adding more processing units or nodes to a parallel system can
enhance its computational power, making it suitable for increasingly
complex tasks.
• Improved energy efficiency.
Importance for quantum computing
• Quantum Parallelism: Quantum computers exhibit inherent parallelism
due to superposition, enabling the representation of multiple states
simultaneously. Knowledge of parallel algorithms helps quantum
computing students harness this quantum parallelism to solve
problems more efficiently than classical algorithms.
• Quantum Speedup Analysis: Parallel algorithms provide a foundation
for analysing the potential speedup that quantum algorithms can
achieve compared to classical counterparts. Quantum computing
students need to understand how quantum parallelism influences the
computational complexity of various problems.
Importance for quantum computing
• Algorithm Design and Optimization: Quantum computing leverages
principles of superposition and entanglement to perform parallel
computations. Understanding parallel algorithms helps quantum
computing students design and optimize quantum algorithms that
exploit these principles for more efficient and faster computations.
• Hybrid Quantum-Classical Algorithms: Many quantum algorithms
operate in a hybrid fashion, combining quantum and classical
components. Understanding parallel algorithms aids in designing
efficient hybrid algorithms that leverage both quantum and classical
parallelism for improved performance.
Importance for quantum computing
• Quantum Simulation: Quantum computers are well-suited for
simulating quantum systems, which often exhibit complex parallel
behaviours. Understanding parallel algorithms is crucial for efficiently
simulating quantum systems and studying quantum phenomena.
• Quantum Error Correction: Quantum error correction involves the
parallel manipulation of qubits to protect quantum information from
errors. Parallel algorithms are essential for implementing efficient
error-correction codes and protocols in quantum computing.
Computer architectures classification
Architectures
SISD SIMD MISD MIMD Hybrid
Shared Distributed
memory memory
MIMD MIMD
Computer architectures classification
SIMT Architecture
• Used in many modern GPUs
(Graphics Processing Units)
• Share characteristics with both
SIMD and MIMD.
• Use a single instruction
stream, but threads within a
thread block can follow
different paths, providing
flexibility in handling divergent
workloads.
Modern multicore processor systems
• They typically fall under the category of Shared
Memory Multiple Instruction, Multiple Data
(SM-MIMD) architectures.
• Shared Memory (SM): In a multicore processor
system, all the cores share a common memory
space. Each core can access the same memory,
allowing for straightforward communication and
data sharing between cores.
• Multiple Instruction, Multiple Data (MIMD): Each
core in a multicore processor system can execute its
own set of instructions independently. Additionally,
each core can operate on different sets of data
simultaneously, providing true parallelism.
Examples of daily uses of parallel software
• Web Browsers: Modern web browsers often use parallel processing to
speed up the rendering of web pages. Different components of a
webpage, such as images, scripts, and stylesheets, can be loaded in
parallel to enhance the overall browsing experience.
• Operating Systems: Operating systems use parallelism for multitasking.
Concurrent execution of multiple processes or threads allows users to
run several applications simultaneously without one significantly
impacting the performance of others.
• Search Engines: Search engines employ parallel processing to quickly
retrieve and rank search results. The indexing and searching of vast
amounts of web content can be performed in parallel to provide users
with rapid and relevant results.
Examples of daily uses of parallel software
• Media and Entertainment: Video and audio processing applications
often use parallelism for tasks like video rendering, encoding, and
image processing. This accelerates the creation and manipulation of
multimedia content.
• Data Analysis and Analytics: Parallel computing is crucial in data
analytics and scientific simulations. Tools like Apache Hadoop and
Apache Spark use parallel processing to distribute and process large
datasets across multiple nodes, making it feasible to analyze big data
efficiently.
• Graphics and Gaming: Graphics processing units (GPUs) are designed
for parallel processing and are extensively used in gaming and graphics-
intensive applications. The parallel architecture of GPUs accelerates
rendering and enhances the visual experience in video games.
Examples of daily uses of parallel software
• Machine Learning and Artificial Intelligence: Training machine learning
models involves computationally intensive tasks that can benefit from
parallel processing. Frameworks like TensorFlow and PyTorch utilize
parallelism to accelerate the training of neural networks.
• Virtualization and Cloud Computing: Virtualization technologies and
cloud computing platforms use parallelism to efficiently allocate
resources and manage virtual machines. This enables multiple users or
applications to run concurrently on shared infrastructure.
• Database Management Systems: Parallel database systems divide
queries into parallel tasks, allowing for simultaneous execution and
faster retrieval of information. This is beneficial in scenarios where
large datasets need to be queried and analyzed.
Challenges
• Algorithm Design Complexity: Designing algorithms that effectively
exploit parallelism requires a deep understanding of the problem
domain and careful consideration of parallel patterns. Developing
parallel algorithms is often more complex than their sequential
counterparts.
• Heterogeneous Architectures: The increasing prevalence of
heterogeneous computing environments with diverse architectures
(e.g., CPUs, GPUs, accelerators) adds complexity to parallel algorithm
development. Effectively utilizing these heterogeneous resources
requires specialized approaches.
Challenges
• Data Dependency: Dependencies between data elements or tasks can
limit the degree of parallelism. Identifying and managing dependencies
is essential for ensuring that tasks can be executed concurrently
without violating data consistency.
• Communication and Synchronization: Efficient communication and
synchronization between processing units are crucial for proper
coordination. However, excessive communication or synchronization
can lead to performance bottlenecks and negate the benefits of
parallelism.
Challenges
• Granularity and Scalability: Determining the appropriate granularity of
tasks is challenging. Tasks that are too fine-grained can result in high
overhead, while tasks that are too coarse-grained may limit parallelism.
Additionally, ensuring scalability as the problem size increases or the
number of processing units grows can be complex.
• Load Balancing: Distributing the workload evenly among processing
units is crucial for efficient parallel computing. Load imbalances, where
some units have more work than others, can lead to underutilization of
resources and decreased overall performance.
Challenges
• Memory Access Patterns: Efficient memory access is critical in parallel
computing. Suboptimal memory access patterns, such as false sharing
or poor cache utilization, can degrade performance. Designing
algorithms with optimized memory access is a non-trivial task.
• Fault Tolerance: Parallel systems may be prone to failures, and ensuring
fault tolerance in the presence of hardware or software failures is a
significant challenge. Developing algorithms that can recover or adapt
to failures is crucial for the reliability of parallel applications.
• Performance modelling: Speedup, Efficiency.
The PRAM (Parallel Random Access Machine) model
• a theoretical model used in parallel
computing to describe and analyze
the behaviour of parallel algorithms.
• abstracts key aspects of parallel
computation.
• each processor has its own private
memory, and there is a global shared memory.
• each processor can read from or write to any location in the shared
memory in a single time step.
• all processors can execute their operations simultaneously.
• The model assumes perfect synchronization among processors.
The PRAM model variants
• based on how concurrent memory accesses are handled:
• EREW (Exclusive Read, Exclusive Write): At most one processor can read from
or write to a memory location in a single time step.
• CREW (Concurrent Read, Exclusive Write): Multiple processors can read from a
memory location in a single time step, but only one processor can write to it.
• CRCW (Concurrent Read, Concurrent Write): Multiple processors can read
from and write to a memory location in a single time step. This model further
subdivides into:
• common CRCW (all writes must be the same)
• arbitrary CRCW (writes can be different).
Other models
• Bulk Synchronous Parallel (BSP): The BSP model
focuses on practical aspects of parallel
computation, emphasizing synchronization and
communication. It has gained popularity for its
simplicity and applicability to a broad range of
parallel algorithms, making it a competitive
model for algorithmic analysis.
• LogP Model: The LogP model is specifically
focused on communication aspects in parallel
systems. It considers latency, overhead, and gap
between successive messages, providing
insights into the performance of parallel
algorithms in distributed memory
environments. While it may not cover all
aspects of computation, it complements other
models in understanding communication
efficiency.
Parallel algorithms design stages
Problem
Performance
Decomposition Optimization
Analysis
(partitioning)
Communication Testing and Documentation
Design Debugging and Reporting
Parallel
Synchronization Algorithm
Algorithm
Strategy Integration
Maintenance
Parallelization Task
of Subtasks Assignment
(agglomeration) (mapping)
Parallel algorithms design stages
[Link]: The computation that is to be performed and
the data operated on by this computation are decomposed
into small indivisible tasks. Practical issues such as the
number of processors in the target computer are ignored.
[Link]: The communication required to coordinate
tasks execution is determined based on their dependencies.
[Link]: Combines small tasks into larger tasks to
improve performance or to reduce communication/
development cost. Data and/or computation may be
replicated.
[Link]: Each task is assigned to a processor in a manner
that attempts to satisfy the competing goals of maximizing
processor utilization and minimizing communication costs.
Mapping can be specified statically or determined at runtime
by load-balancing algorithms.
Partitioning
• A good partition divides into small pieces both the computation associated with
a problem and the data on which this computation operates.
• Domain decomposition: Programmers, most commonly, determine first an
appropriate partition for the data, and then work out how to associate
computation with data.
• Functional decomposition: decomposing first the computation to be performed
and then dealing with the data.
• The two approaches are complementary; they may be applied to different
components of a single problem or even applied to the same problem to obtain
alternative parallel algorithms.
• We avoid replicating computation and data at this stage. Later, it can be
worthwhile replicating either computation or data if doing so allows us to reduce
communication requirements.
Partitioning
• Domain decomposition:
• Divide data into small pieces of approximately equal size.
• Partition the computation to be performed, typically by associating each
operation with the data on which it operates.
• The data that are decomposed may be the input to the program, the output,
or intermediate values.
• It is better to focus first on the largest data structure or on the data structure
that is accessed most frequently.
• Different phases of the computation may operate on different data structures
or demand different decompositions for the same data structures.
Partitioning
• Functional decomposition:
• The initial focus is on the computation that is to be performed rather than on
the data manipulated by the computation.
• If we are successful in dividing this computation into disjoint tasks, we
proceed to examine the data requirements of these tasks.
• These data requirements may be disjoint, in which case the partition is
complete. Alternatively, they may overlap significantly, in which case
considerable communication or data replication will be required.
Partitioning
• Partitioning design criteria
• For flexibility:
• the number of tasks>the number of PU.
• several alternative partitions.
• For scalability:
• no redundant computation and storage requirements.
• the number of tasks rather then their size scale with problem size.
• For balanced workload, comparable size tasks.
Communication
• The tasks generated by a partition are intended to execute
concurrently but cannot, in general, execute independently.
• The computation to be performed in one task will typically require data
associated with another task.
• Data must then be transferred between tasks so as to allow
computation to proceed: communication.
• To optimize performance, distributing communication operations over
many concurrent tasks.
Communication
• Communication could be local/global, structured/unstructured,
static/dynamic, and synchronous/asynchronous.
• In local communication, each task communicates with a small set of other
tasks (its “neighbors”); in contrast, global communication requires each task
to communicate with many tasks.
• In structured communication, a task and its neighbors form a regular
structure, such as a tree or grid; in contrast, unstructured communication
networks may be arbitrary graphs.
• In static communication, the identity of communication partners does not
change over time; in contrast, the identity of communication partners in
dynamic communication structures may be determined by data computed at
runtime and may be highly variable.
• In synchronous communication, producers and consumers execute in a
coordinated fashion
Agglomeration
• Aims at obtaining an algorithm that will
execute efficiently on some class of parallel
computer.
• By considering whether it is useful to combine,
or agglomerate, tasks identified by the
partitioning phase, so as to provide a smaller
number of tasks, each of greater size.
• By determining whether it is worthwhile to
replicate data and/or computation.
• On most parallel computers, we have to stop
computing in order to send and receive
messages.
Mapping
• The goal in developing mapping algorithms is normally to minimize
total execution time. We use two strategies to achieve this goal:
• We place tasks that are able to execute concurrently on different processors,
so as to enhance concurrency.
• We place tasks that communicate frequently on the same processor, so as to
increase locality.
Mapping
Mapping
Design Models
• Data parallel model
• Task parallel model
• Master slave model
• Pipeline/Producer-Consumer model
• Hybrid models
Performance analysis
• Parallel runtime (complexity): Tp
Tp=Tcomp.+Tcomm.
• Speedup: S
S=Ts/Tp
• Efficiency: E
E=S/P
• Cost: C
C=P*Tp
Optimal parallel algorithm: C asymptotically identical to the
serial cost
GPU programming terminology
• GPU (Graphics Processing Unit): The hardware component designed to
accelerate graphics rendering. In the context of GPU programming, it is
also used for general-purpose parallel computing.
• CUDA (Compute Unified Device Architecture): NVIDIA's parallel
computing architecture and programming model for NVIDIA GPUs. It
allows developers to use GPUs for general-purpose computing tasks.
• OpenCL (Open Computing Language): An open standard for parallel
programming across different platforms, including GPUs. It enables
developers to write programs that can execute on various GPUs, CPUs,
and other accelerators.
GPU programming terminology
• Kernel: A small, specialized program that runs on the GPU. It is the unit of
work in GPU programming and is executed in parallel by multiple threads.
• Thread: A single sequence of instructions in a program. In GPU
programming, threads execute in parallel, with many threads working on
different data elements simultaneously.
• Block (or Workgroup): A group of threads that can share data and
synchronize their execution. Threads within a block can communicate and
cooperate, making them an essential unit for certain algorithms.
• Grid: A collection of blocks. The grid represents the overall structure of the
parallel execution on the GPU.
GPU programming terminology
• Map Operation: It applies a given function to each element in a
collection, producing a new collection where each element is the result
of applying the function to the corresponding element in the original
collection.
• Reduce Operation: It combines elements of a collection to produce a
single result. It typically involves an associative and commutative
operation that is applied iteratively to pairs of elements until a final
result is obtained. Special techniques, such as parallel reduction
algorithms, are employed to optimize reduction operations on GPUs.
Strategies like thread cooperation and shared memory utilization are
often used to enhance performance.
GPU programming terminology
• Shared Memory: Memory that is shared among threads within a block.
It is faster to access than global memory but is limited in size.
• Global Memory: The main memory space accessible by all threads on
the GPU. It is used to store data that needs to be shared across
different blocks.
• Warps: Groups of threads that execute in lockstep on the GPU. In
NVIDIA GPUs, a warp typically consists of 32 threads.
GPU programming terminology
• CUDA Cores (or CUDA Stream Processors): The processing units within an
NVIDIA GPU that execute parallel threads. They are responsible for carrying
out the instructions in CUDA programs.
• Streaming Multiprocessor (SM): The basic building block of a GPU. It
consists of multiple CUDA cores and is responsible for executing parallel
threads.
• Texture Memory: A special type of memory optimized for graphics
applications. It is read-only and provides high-bandwidth access patterns.
• Texture Units: Units responsible for fetching texture data. They are part of
the GPU's processing pipeline.
GPU programming terminology
• Warp Scheduling: The process of organizing and scheduling warps for
execution on the GPU. Efficient warp scheduling is crucial for
maximizing GPU utilization.
• Thread Divergence: The situation where threads within a warp follow
different execution paths. It can lead to performance degradation due
to serialization of thread execution.
GPU programming basic instructions (using PyCUDA)
• Import PyCUDA:
import [Link]
import [Link] as cuda
from [Link] import SourceModule
• Allocate memory on the GPU using [Link].to_gpu:
import numpy as np
a_cpu = [Link]([1, 2, 3], dtype=np.float32)
a_gpu = [Link].to_gpu(a_cpu)
GPU programming basic instructions (using PyCUDA)
• Define CUDA Kernel in CUDA C/C++ syntax using the __global__
specifier:
kernel_code = """
__global__ void add_vectors(float *a, float *b, float *result) {
int idx = threadIdx.x;
result[idx] = a[idx] + b[idx];
}
"""
• Compile the CUDA kernel using [Link]:
mod = SourceModule(kernel_code)
GPU programming basic instructions (using PyCUDA)
• Get the compiled CUDA kernel function from the module:
kernel_code = """add_vectors =
mod.get_function("add_vectors")
• Launch the CUDA kernel with a specified grid and block size:
block_size = (len(a_cpu), 1, 1)
grid_size = (1, 1)
add_vectors(a_gpu, a_gpu, a_gpu, block=block_size, grid=grid_size)
• Retrieve Results from GPU:
result_gpu = a_gpu.get()
import [Link]
import [Link] as cuda
import numpy as np
from [Link] import SourceModule
# Allocate GPU memory
a_cpu = [Link]([1, 2, 3], dtype=np.float32)
a_gpu = [Link].to_gpu(a_cpu)
# Define CUDA kernel
kernel_code = """
__global__ void add_vectors(float *a, float *b, float *result) {
int idx = threadIdx.x;
result[idx] = a[idx] + b[idx];
}
"""
# Compile CUDA kernel
mod = SourceModule(kernel_code)
# Get CUDA kernel function
add_vectors = mod.get_function("add_vectors")
# Launch CUDA kernel
block_size = (len(a_cpu), 1, 1)
grid_size = (1, 1)
add_vectors(a_gpu, a_gpu, a_gpu, block=block_size, grid=grid_size)
# Retrieve results from GPU
result_gpu = a_gpu.get()
print("Result on GPU:", result_gpu)
Using numba
import numpy as np
from numba import cuda
# Define GPU-accelerated function
@[Link]
def add_vectors(a, b, result):
idx = [Link].x
result[idx] = a[idx] + b[idx]
# Allocate GPU memory
a_cpu = [Link]([1, 2, 3], dtype=np.float32)
b_cpu = [Link]([4, 5, 6], dtype=np.float32)
result_gpu = np.zeros_like(a_cpu)
a_gpu = cuda.to_device(a_cpu)
b_gpu = cuda.to_device(b_cpu)
result_gpu_device = cuda.to_device(result_gpu)
# Set up grid and block dimensions
block_size = a_cpu.shape[0]
grid_size = 1
# Launch GPU-accelerated function
add_vectors[grid_size, block_size](a_gpu, b_gpu, result_gpu_device)
# Retrieve results from GPU
result_gpu = result_gpu_device.copy_to_host()
print("Result on GPU:", result_gpu)