0% found this document useful (0 votes)
8 views8 pages

Parallel Computing and Supercomputers

The document discusses the shift from single processor performance to parallel computing due to limitations like the Power Wall, Memory Wall, and ILP Wall. It explains Amdahl's Law, which illustrates the speedup limitations in parallel processing based on the fraction of serial code, and categorizes parallel architectures using Flynn's Taxonomy. Additionally, it covers memory organization in parallel systems, cache coherence issues, and modern trends in interconnection networks for high-performance computing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Parallel Computing and Supercomputers

The document discusses the shift from single processor performance to parallel computing due to limitations like the Power Wall, Memory Wall, and ILP Wall. It explains Amdahl's Law, which illustrates the speedup limitations in parallel processing based on the fraction of serial code, and categorizes parallel architectures using Flynn's Taxonomy. Additionally, it covers memory organization in parallel systems, cache coherence issues, and modern trends in interconnection networks for high-performance computing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The quest for performance has shifted from making a single processor faster to

using multiple processors together. This is due to physical and economic limits:
 Power Wall: Increasing clock speed dramatically increases power
consumption and heat.
 Memory Wall: The speed of memory has not kept up with processor speed.
 ILP Wall: It's becoming increasingly difficult to find more instructions to
execute simultaneously in a single stream of code.
Solution: Divide a large problem into smaller sub-problems and solve them
concurrently using multiple processing elements.

What is Parallel Architecture?


A parallel computer is a collection of processing elements that cooperate to solve
large problems fast. The key components are:
 Processors: The units that perform computations.
 Memory: Where data and instructions are stored.
 Interconnection Network: The system that allows processors to
communicate with each other and with memory.
The Fundamental Goal: To achieve a speedup, where the time on a parallel system
is less than the time on a single processor(sequenial).

Amdahl's Law Example (S = 0.3)


This document explains how Amdahl's Law limits the speedup of a program when 30% of
the code is serial (S = 0.3) and 70% is parallel (1-S = 0.7).

Formula
Amdahl’s Law states:

Speedup(P) = 1 / (S + (1-S)/P)

where:
• S = fraction of code that is serial
• (1-S) = fraction of code that is parallel
• P = number of processors

1. One Processor (P = 1)
Speedup(1) = 1 / (0.3 + 0.7/1)
= 1 / (0.3 + 0.7)
=1/1
=1

Natural, since with only one processor there is no speedup.

2. Four Processors (P = 4)
Speedup(4) = 1 / (0.3 + 0.7/4)
= 1 / (0.3 + 0.175)
= 1 / 0.475
≈ 2.1

This is less than the ideal 4× speedup, because 30% of the program is still sequential.

3. One Hundred Processors (P = 100)


Speedup(100) = 1 / (0.3 + 0.7/100)
= 1 / (0.3 + 0.007)
= 1 / 0.307
≈ 3.26

Even with 100 processors, the maximum speedup is only about 3.26×.

Maximum Speedup
As P → ∞ (in inite processors):

Speedup_max = 1 / S

If S = 0.3:
Speedup_max = 1 / 0.3 ≈ 3.33

Even with a million processors, you cannot exceed a 3.33× speedup when 30% of the
code is serial.

Conclusion
Amdahl’s Law shows that the serial portion (S) is the bottleneck in parallel computing.
Adding more processors improves performance only up to a limit. Reducing S is critical for
achieving higher speedup.
Parallel Architecture: Flynn's Taxonomy & The Memory Hierarchy
1- Classifying Parallel Systems: Flynn's Taxonomy
This is the classic model for categorizing computer architectures based on the number of
instruction and data streams.

Category Instruction Streams Data Streams Example

SISD Single Single Traditional Uniprocessor

SIMD Single Multiple GPUs, Vector Processors

MISD Multiple Single Rarely Used (Theoretical)

MIMD Multiple Multiple Modern Multicore CPUs, Clusters

 SIMD (Single Instruction, Multiple Data): All processors execute the same
instruction simultaneously, but on different data elements. Excellent for data-parallel
tasks (e.g., image processing, scientific simulations).

 MIMD (Multiple Instruction, Multiple Data): Each processor executes its own
instruction stream on its own data. This is the most common and flexible model,
encompassing modern multicore CPUs and supercomputers.

2-Memory Organization

How do the multiple processors in a system share and access memory? This
leads to the two primary architectural models.

 Shared Memory (Tightly Coupled):

o All processors share a single, unified address space.

o Communication between processors is implicit; they simply


read from and write to the same memory locations.

o Advantage: Ease of programming.

o Challenge: Requires cache coherency.


 Distributed Memory (Loosely Coupled):

o Each processor has its own private memory.

o There is no global address space.

o Communication between processors is explicit, via passing


messages over a network (e.g., using MPI - Message Passing
Interface).

o Advantage: Scalable to a very large number of processors.

o Challenge: More complex programming.


Shared Memory Multiprocessors

Further divided based on how memory is physically connected:

 UMA (Uniform Memory Access):

o Access time to any memory location is the same for all


processors.

o Often connected via a bus or a crossbar.

o Also known as SMP (Symmetric Multiprocessor).

o Limitation: The shared bus becomes a bottleneck as more


processors are added.

 NUMA (Non-Uniform Memory Access):

o Access time depends on the memory location relative to the


processor.

o Physically distributed, but logically shared memory.

o Example: A multi-socket server where each CPU socket has


its own local memory. Accessing local memory is fast;
accessing another socket's memory ("remote" memory) is
slower.

o Advantage: More scalable than UMA.


The Cache Coherence Problem:
In shared memory systems, each processor has a cache. If Processor A writes to address X, how
does Processor B know its cached copy of X is now stale? This is solved by a cache coherence
protocol, most commonly MESI (Modified, Exclusive, Shared, Invalid).

Distributed Memory Systems (Clusters)

 A cluster is a group of independent computers (nodes) connected by a high-


speed network (e.g., InfiniBand, Ethernet).

 Each node runs its own operating system.

 This is the architecture of most of the world's Top500 supercomputers.

 Programming is typically done using the Message Passing Model.

The Hybrid Model


Modern high-performance computing (HPC) often uses a hybrid approach:

 Distributed Memory across Nodes: A cluster of many nodes.

 Shared Memory within a Node: Each node is a multicore, shared-


memory computer (e.g., a dual-socket server with 64 cores each).
 Programming: Use MPI for communication between
nodes and OpenMP (a shared-memory API) for parallelism within a
node.

The Interconnection Network & Modern Trends

Interconnection Network:
When we have a large computing system such as a supercomputer or a cluster, the
processors or nodes need to communicate with each other.

This communication is done through special high-speed networks that connect the
processors, memory, and nodes.

Examples: Bus, Ring, Mesh, Torus, Hypercube, Fat-tree – these are all types of
interconnection networks.

The goal: Transfer data quickly and efficiently between the system’s components so that it
works as if it were one giant computer.

Modern Trends:
This refers to the latest advancements in the design of interconnection networks:
 Very high speeds (such as InfiniBand, NVLink), which are much faster than
regular Ethernet.

 Reduced latency, so data can be transferred in fractions of a second.

 Scalability, meaning the network can support thousands or even millions of


processors without performance collapse.

 Specialized networks for AI and GPU clusters, such as NVIDIA NVSwitch and
Google TPU interconnect.

You might also like