Understanding Multiprocessor Systems
Understanding Multiprocessor Systems
Besides providing high performance, the multiprocessor also offers the following benefits:
Classification:
Modern CPUs operate considerably faster than the main memory they use. In the early days of
computing and data processing, the CPU generally ran slower than its own memory. The
performance lines of processors and memory crossed in the 1960s with the advent of the first
supercomputers. Since then, CPUs increasingly have found themselves "starved for data" and
having to stall while waiting for data to arrive from memory. Many supercomputer designs of the
1980s and 1990s focused on providing high-speed memory access as opposed to faster processors,
allowing the computers to work on large data sets at speeds other systems could not approach.
Limiting the number of memory accesses provided the key to extracting high performance from a
modern computer. For commodity processors, this meant installing an ever-increasing amount of
high-speed cache memory and using increasingly sophisticated algorithms to avoid cache misses.
But the dramatic increase in size of the operating systems and of the applications run on them has
generally overwhelmed these cache-processing improvements. Multi-processor systems without
NUMA make the problem considerably worse. Now a system can starve several processors at the
same time, notably because only one processor can access the computer's memory at a time.
NUMA attempts to address this problem by providing separate memory for each processor,
avoiding the performance hit when several processors attempt to address the same memory. For
problems involving spread data(common for servers and similar applications), NUMA can
improve the performance over a single shared memory by a factor of roughly the number of
processors (or separate memory banks). Another approach to addressing this problem, utilized
mainly by non-NUMA systems, is the multi-channel memory architecture; multiple memory
channels are increasing the number of simultaneous memory accesses.
Architecture of a
NUMA system.
No-Remote Memory Access
In NORMA architecture, the address space globally is not unique and the memory is not globally
accessible by the processor.
Accesses to remote memory modules are only indirectly possible by message through the
interconnection network to other processors, which in turn possibly deliver the desired data in a
reply message.
Two categories of parallel computers are discussed below namely shared common memory or
unshared distributed memory.
Advantages:
• Global address space provides a user-friendly programming perspective to memory.
• Data sharing between tasks is both fast and uniform due to the proximity of memory to
CPUs.
Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can
geometrically increase traffic on the shared memory.
Categories of Multiprocessor
1. Shared Memory Multiprocessor
• Shared memory parallel computers vary widely, but generally have in common the ability
for all processors to access all memory as global address space.
• Multiple processors can operate independently but share the same memo r y resources.
• Changes in a memory location effected by one processor are visible to all other
processors.
• Shared memory machines can be divided into three categories based upon memory access
times: UMA, NUMA and COMA.
6
Fig. Shared Memory (NUMA)
Advantages:
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory to
CPUs
Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs. Adding more
CPUs can geometrically increase traffic on the shared memory-CPU path, and for cache
coherent systems, geometrically increase traffic associated with cache/memory
management.
• Programmer responsibility for synchronization constructs that insure " correct" access of
global memory.
• Expense: it becomes increasingly difficult and expensive to design and produce shared
memory machines with ever increasing numbers of processors.
Distributed Memory
• Like shared memory systems, distributed memory systems vary widely but share a
common characteristic. Distributed memory systems require a communication network to
connect inter- processor memory.
Processors have their own local memory. Memory addresses in one processor do not map
to another processor, so there is no concept of global address space across all processors.
• Because each processor has its own local memory, it operates independently.
• Changes it makes to its local memory have no effect on the memory of other processors.
Hence, the concept of cache coherency does not apply.
7
• When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated. Synchronization
between tasks is likewise the programmer's responsibility.
• Modern multicomputer use hardware routers to pass message.
Advantages:
• Memory is scalable with number of processors. Increase the number of processors and
the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and without the
overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.
Disadvantages:
• The programmer is responsible for many of the details associated with data
communication between processors.
• It may be difficult to map existing data structures, based on global memory, to this
memory organization.
Distributed Shared Memory (DSM) implements the distributed systems shared memory model in
a distributed system, that hasn’t any physically shared memory. Shared model provides a virtual
address area shared between any or all nodes. To beat the high forged of communication in
distributed system. DSM memo, model provides a virtual address area shared between all nodes.
systems move information to the placement of access. Information moves between main memory
and secondary memory (within a node) and between main recollections of various nodes. Every
Greek deity object is in hand by a node. The initial owner is that the node that created the object.
possession will amendment as the object moves from node to node. Once a method accesses
8
information within the shared address space, the mapping manager maps shared memory address
to physical memory (local or remote).
DSM permits programs running on separate reasons to share information while not the software
engineer having to agitate causation message instead underlying technology can send the messages
to stay the DSM consistent between compute. DSM permits programs that want to treat constant
laptop to be simply tailored to control on separate reason. Programs access what seems to them to
be traditional memory. Hence, programs that Pine Tree State DSM square measure sometimes
shorter and easier to grasp than programs that use message passing. But DSM isn’t appropriate for
all things. Client-server systems square measure typically less suited to DSM, however, a server
is also wanting to assist in providing DSM practicality for information shared between purchasers.
Key Characteristics:
Advantages:
9
2. Efficiency in Repetitive Tasks: Ideal for tasks with repetitive calculations, such as signal
processing and scientific simulations.
3. Scalability: Can scale by increasing the number of PEs, enhancing performance for large-
scale computations.
Applications:
1. Image and Signal Processing: Enhances the speed and efficiency of operations like filtering
and transformations.
2. Scientific Computing: Accelerates simulations and data analysis in fields like weather
modeling and fluid dynamics.
3. Cryptography: Speeds up encryption and decryption processes by handling multiple data
streams simultaneously.
Vector Processor:
Definition: A vector processor is a type of CPU that can execute a single instruction on multiple
data points simultaneously by leveraging vector registers and operations.
Key Characteristics:
1. Vector Instructions: Executes vector instructions that operate on entire vectors (arrays of
data) in a single operation.
2. SIMD (Single Instruction, Multiple Data): Operates on multiple data points with one
instruction, following the SIMD paradigm.
3. Vector Length: Performance can be influenced by the length of the vectors it can process
in one operation.
Advantages:
1. Efficiency in Linear Algebra: Highly efficient for linear algebra operations, crucial in
scientific and engineering applications.
2. Reduced Instruction Overhead: Fewer instructions are needed for operations on large data
sets, improving performance.
3. Simplified Programming: Vectorizing code can be simpler than parallelizing it for multiple
cores or threads.
Applications:
1. Scientific Computation: Ideal for operations involving large matrices and vectors, such as
matrix multiplications and transformations.
2. Graphics and Multimedia: Enhances performance in rendering and image processing by
handling pixel and vertex data in parallel.
3. Machine Learning: Speeds up training and inference phases by accelerating vector and
matrix operations in neural networks.
10
Comparison of Array Processors and Vector Processors
Systolic Architecture
In the ever-evolving landscape of computing, the demand for efficient, high-performance
processing has led to the development of various architectural paradigms. Among these, systolic
architecture stands out for its innovative approach to parallel computing, particularly in
applications requiring intensive data processing. Derived from the rhythmic, pulsing nature of its
data flow, much like the beating of a heart, systolic architecture offers a compelling solution for
tasks in digital signal processing and beyond. This essay delves into the key characteristics,
applications, advantages, and challenges of systolic architecture, underscoring its significance in
the realm of specialized computing.
At the core of systolic architecture lies its distinctive regular data flow. Unlike traditional
architectures where data might be fetched from and written back to global memory multiple times,
systolic arrays facilitate a seamless, predictable pattern of data movement through a network of
processing elements (PEs). Each PE performs a small, fixed operation on incoming data and
subsequently passes the result to the next PE in the network. This pipelined data processing model
11
is reminiscent of the heart's rhythmic pumping action, which efficiently circulates blood through
the body.
One of the hallmarks of systolic architecture is local communication. PEs within a systolic array
communicate solely with their immediate neighbors, significantly reducing the need for complex
communication pathways and extensive global memory access. This localized communication not
only enhances the speed and efficiency of data processing but also simplifies the design and
scalability of the architecture. The modular nature of systolic arrays allows for easy expansion;
larger arrays can be constructed by adding more PEs, thereby enhancing processing power without
complicating the overall design.
Synchrony in systolic architectures is maintained by a global clock that coordinates the operations
of all PEs. This synchronized approach ensures that data moves through the network in a
coordinated manner, minimizing timing issues and maximizing data throughput. The predictability
and regularity of data flow in systolic architectures are especially beneficial for real-time
processing applications, where latency and precise timing are critical.
Systolic architecture finds its most prominent applications in digital signal processing (DSP).
Tasks such as convolution, correlation, and filtering are particularly well-suited to the systolic
model, as the regular data flow and localized processing align perfectly with the repetitive nature
of these operations. Applications in audio and video processing, telecommunications, and radar
systems frequently leverage systolic arrays to achieve high performance and efficiency.
Beyond DSP, systolic architectures excel in matrix operations, making them invaluable for
scientific computing and simulations. The architecture's efficiency in matrix multiplication and
other linear algebra operations also lends itself well to graphics processing and machine learning,
where neural network operations benefit from the high throughput and parallel processing
capabilities of systolic arrays.
Despite its many advantages, systolic architecture is not without its challenges. One of the primary
limitations is its specialization; while highly optimized for specific tasks, systolic architectures
lack the flexibility needed for general-purpose computing. Additionally, programming for systolic
arrays can be complex, requiring careful attention to data flow and timing to fully exploit the
architecture's potential.
Historically, the concept of systolic architecture was introduced by H.T. Kung and Charles E.
Leiserson in the late 1970s. Since then, it has evolved with advances in VLSI (Very Large-Scale
Integration) technology, allowing for more complex and capable arrays. These advancements have
expanded the applicability and performance of systolic architectures, solidifying their role in
specialized computing domains.
12
**NOTE: Refer Case study and Activity for Practice problems in the PPT of Systolic
Architecture
3. Efficiency: SIMD instructions help maximize the utilization of CPU resources by reducing
instruction overhead and improving data locality. This efficiency is crucial for real-time
multimedia applications that require high throughput and low latency.
13
platforms.
Overall, SIMD extensions play a vital role in enhancing the performance and efficiency of
multimedia applications, enabling seamless multimedia experiences across a wide range of devices
and platforms.
SIMD (Single Instruction, Multiple Data) optimizations have shown significant performance
impacts across various multimedia workloads due to their ability to exploit data parallelism,
instruction-level parallelism, and enhance memory bandwidth utilization.
Data Parallelism: SIMD instructions allow multiple data elements to be processed simultaneously
within a single instruction. In multimedia workloads such as image processing, video
encoding/decoding, and audio processing, where operations often involve large sets of data (e.g.,
pixels, samples), SIMD can greatly accelerate computations. For instance, SIMD operations can
simultaneously process multiple pixels or audio samples, effectively increasing throughput and
reducing processing time.
Memory Bandwidth Utilization: SIMD optimizations can enhance memory bandwidth utilization
by fetching and processing multiple data elements in parallel. This reduces memory access
overhead and maximizes data throughput between the processor and memory. In multimedia
applications, which often involve intensive data transfer between main memory and processor
caches, efficient memory bandwidth utilization is crucial for maintaining high performance. SIMD
instructions can help minimize memory stalls by processing multiple data elements without
needing to fetch additional instructions from memory.
14
animations, as well as in accelerating scientific simulations, machine learning, and other data
processing tasks due to their parallel processing capabilities. They are essential components in
modern computers, especially for tasks that require intensive graphical rendering or complex
mathematical calculations.
GPUs (Graphics Processing Units) and CPUs (Central Processing Units) are both essential
components of modern computing systems, but they are optimized for different types of tasks due
to their architectural differences. Here's a comparison of their architectural features in terms of
suitability for parallel processing tasks and how GPU architecture exploits thread-level
parallelism for high performance:
Architecture:
CPU: CPU architecture typically consists of a few powerful cores optimized for sequential
processing. Each core in a CPU is capable of executing a wide range of instructions, including
complex branching and decision-making.
GPU: GPU architecture consists of thousands of smaller, less powerful cores optimized for
parallel processing. These cores are arranged in a highly parallel structure, allowing them to
execute multiple instructions simultaneously.
Parallelism:
CPU: CPUs are designed for task-level parallelism, where each core typically executes a single
thread of instructions at a time. While modern CPUs may have multiple cores to handle multiple
threads simultaneously, the number of cores is usually limited (e.g., 4, 8, or 16 cores).
GPU: GPUs excel at exploiting thread-level parallelism. They are designed to handle thousands
of threads simultaneously, with each thread executing a small portion of the overall task. This
massively parallel architecture allows GPUs to process a large amount of data in parallel, leading
to significant performance gains for parallelizable tasks.
Instruction Set:
CPU: CPUs support a wide range of general-purpose instructions, including arithmetic, logic,
branching, and data movement operations. They are optimized for handling diverse workloads
efficiently.
GPU: GPUs are optimized for handling specific types of computations commonly found in
graphics rendering and parallel processing tasks. They excel at executing arithmetic and memory
operations in parallel across thousands of threads.
Memory Hierarchy:
CPU: CPUs typically have a smaller number of high-speed caches (e.g., L1, L2, L3 caches)
optimized for low-latency access to frequently accessed data. They also have direct access to
system memory (RAM).
GPU: GPUs have their own dedicated memory called VRAM (Video Random Access
Memory), which is optimized for high-bandwidth, parallel access by multiple cores. They use a
hierarchical memory architecture, including registers, shared memory, and global memory, to
15
manage data access efficiently.
Programming Model:
CPU: Programming for CPUs usually involves sequential programming paradigms, such as
procedural programming or object-oriented programming. Parallelism is often achieved using
techniques like multithreading or multiprocessing.
GPU: Programming for GPUs typically involves parallel programming paradigms, such as
CUDA (Compute Unified Device Architecture) for NVIDIA GPUs or OpenCL (Open Computing
Language) for both NVIDIA and AMD GPUs. These programming models allow developers to
explicitly parallelize their algorithms across thousands of GPU cores, taking advantage of the
GPU's massively parallel architecture.
In summary, GPUs and CPUs have distinct architectural features that make them suitable for
different types of tasks. While CPUs excel at handling sequential tasks and diverse workloads,
GPUs are optimized for parallel processing tasks, leveraging their massively parallel architecture
to achieve high performance through thread-level parallelism.
GPU architectures exploit both thread-level parallelism and data parallelism to achieve high
throughput and performance:
Thread-Level Parallelism:
Massively Parallel Cores: GPUs consist of thousands of smaller processing cores, each capable
of executing its own thread. These cores are organized into streaming multiprocessors (SMs), with
each SM containing multiple cores. This architecture enables GPUs to simultaneously execute a
large number of threads.
Simultaneous Multithreading (SMT): Some GPU architectures support simultaneous
multithreading, allowing multiple threads to execute concurrently on each core. This further
increases the level of parallelism by enabling the core to switch between threads when one is
stalled, keeping the core busy with other threads.
Task Scheduling: GPU architectures employ efficient task scheduling mechanisms to maximize
the utilization of processing cores. Threads are scheduled in groups, called warps or wavefronts,
which are executed concurrently on different cores within an SM. This pipelining of thread
execution helps hide memory latency and maximize throughput.
16
Data Parallelism:
SIMD Execution: GPUs use SIMD (Single Instruction, Multiple Data) execution to perform the
same operation on multiple data elements simultaneously. Within a warp or wavefront, threads
execute the same instruction but operate on different data. This allows GPUs to exploit data
parallelism effectively.
Vectorization: Modern GPU architectures support vectorized instructions, which allow multiple
data elements to be processed in parallel using specialized vector units. Vectorization enables
efficient execution of arithmetic and logic operations on large arrays or matrices, common in many
parallel processing tasks.
Memory Coalescing: GPU memory subsystems are designed to efficiently handle data parallelism
by coalescing memory accesses from multiple threads into contiguous memory transactions. This
reduces memory access latency and increases memory throughput, particularly for memory-bound
workloads.
Optimized Memory Hierarchy:
Shared Memory: GPUs feature fast, on-chip shared memory that allows threads within a thread
block to communicate and synchronize efficiently. Shared memory is typically used for inter-
thread communication and to cache frequently accessed data, reducing the need for expensive
global memory accesses.
Global Memory Access: While GPUs have fast on-chip memory, they also utilize global memory,
which is larger but slower. GPU architectures employ memory hierarchies and caching
mechanisms to minimize the impact of memory latency, such as caching frequently accessed data
in on-chip caches and using memory access patterns that maximize memory bandwidth.
Specialized Hardware Acceleration:
Tensor Cores (in some GPUs): Tensor cores are specialized hardware units designed for
accelerating tensor operations commonly used in deep learning algorithms. These units perform
matrix multiplication and accumulation operations with high throughput, significantly speeding
up deep learning training and inference tasks.
Ray Tracing Cores (in some GPUs): Ray tracing cores are specialized hardware units that
accelerate ray tracing, a rendering technique used to generate realistic lighting and reflections in
computer graphics. These cores optimize the ray tracing process by tracing rays and performing
intersection tests with scene geometry in parallel.
By leveraging thread-level parallelism, data parallelism, and specialized hardware acceleration,
GPU architectures can achieve high throughput and performance across a wide range of parallel
processing tasks, including graphics rendering, scientific simulations, machine learning, and data
analytics.
17
Loop-level Parallelism
Loop-level parallelism (LLP) enhances performance and efficiency by allowing concurrent
execution of loop iterations. This parallelism is crucial in modern computer architectures,
particularly in high-performance computing (HPC) environments where large-scale computations
are common. By distributing tasks across multiple cores, LLP scales computational workloads,
significantly reducing execution time. This is vital for applications demanding high computational
power, such as scientific simulations, data analysis, and machine learning.
Energy efficiency is another critical advantage of LLP. Parallel execution can lower overall energy
consumption by optimizing power usage based on workload. Moreover, distributing workloads
across multiple cores helps manage thermal output, improving heat dissipation and hardware
longevity. For real-time and interactive systems, LLP enhances responsiveness. Real-time
applications, like embedded systems and robotics, benefit from timely processing of critical tasks,
while user-facing applications see improved responsiveness and user experience.
Implementing LLP involves software and hardware techniques. Software techniques include
compiler optimizations and parallel programming models like OpenMP and CUDA. These tools
help developers write parallel code by providing abstractions and APIs. Hardware support includes
multi-core processors, which enable parallel execution, and SIMD (Single Instruction, Multiple
Data) instructions, allowing simultaneous operations on multiple data points. Modern CPUs also
support out-of-order execution, executing instructions as resources become available rather than
sequentially.
However, effective LLP utilization requires careful algorithm design to manage data dependencies
and create parallel-friendly structures. Algorithms must be decomposable into independent units,
and developers must handle true dependencies where one iteration depends on another. Debugging
parallel code is challenging due to concurrency issues like race conditions and deadlocks.
Specialized tools and frameworks, such as Intel Parallel Studio and Valgrind, assist in detecting
and resolving these issues.
Overall, LLP is essential for achieving higher performance, energy efficiency, and responsiveness
in modern computing. It leverages both software and hardware advancements, driving significant
improvements and pushing the boundaries of computational capabilities.
18