Advanced Computer Architecture Overview
Advanced Computer Architecture Overview
io
INSTAGRAM: [Link]/hemanthraj_hemu/
INSTAGRAM: [Link]/futurevisionbie/
17CS72
MODULE-1
Theory of Parallelism
1 Conditions of Parallelism 1
Program Partitioning and Scheduling 1
Program Flow Mechanisms, System Interconnect
1
Architectures
Principles of Scalable Performance 1
Performance Metrics and Measures, Parallel Processing 1
Applications
Speedup Performance Laws, Scalability Analysis and 1
Approaches
[Link]
Department of Computer Science and Engg
10/1/2020 2
Theory of Parallelism
[Link]
Department of Computer Science and Engg
10/1/2020 3
THE STATE OF COMPUTING: Computer Development Milestones
• How it all started…
500 BC: Abacus (China) – The earliest mechanical computer
/calculating device.
Operated to perform decimal arithmetic with carry propagation digit by digit
1642: Mechanical Adder/Subtractor (Blaise Pascal)
1827: Difference Engine (Charles Babbage)
1941: First binary mechanical computer (Konrad Zuse; Germany)
1944: Harvard Mark I (IBM)
The very first electromechanical decimal computer as proposed by Howard Aiken
• Computer Generations
1 st 2 nd 3 rd 4 th 5 th
Division into generations marked primarily by changes in
hardware and software technologies
[Link]
Department of Computer Science and Engg
10/1/2020 4
THE STATE OF COMPUTING: Computer Generations
[Link]
Department of Computer Science and Engg
10/1/2020 5
THE STATE OF COMPUTING: Computer Generations
[Link]
Department of Computer Science and Engg
10/1/2020 6
THE STATE OF COMPUTING: Computer Generations
[Link]
Department of Computer Science and Engg
10/1/2020 7
THE STATE OF COMPUTING: Computer Generations
[Link]
Department of Computer Science and Engg
10/1/2020 8
THE STATE OF COMPUTING: Computer Generations
[Link]
Department of Computer Science and Engg
10/1/2020 10
THE STATE OF COMPUTING: Elements of Modern Computers
Computing problems
[Link]
Department of Computer Science and Engg
10/1/2020 11
THE STATE OF COMPUTING: Elements of Modern Computers
Algorithms and Data Structures
[Link]
Department of Computer Science and Engg
10/1/2020 12
THE STATE OF COMPUTING: Elements of Modern Computers
Hardware Resources
[Link]
Department of Computer Science and Engg
10/1/2020 13
THE STATE OF COMPUTING: Elements of Modern Computers
Operating System
• Operating systems manage the allocation and deallocation of resources during user
program execution.
• An OS plays a significant role in mapping hardware resources to algorithmic and data
structures.
– Implementation of Mapping:
• Relies on efficient Compiler and Operating system support.
• Parallelism can be exploited at Algorithm :
– Design Time
– Program Time
– Compile Time
– Run Time
[Link]
Department of Computer Science and Engg
10/1/2020 14
THE STATE OF COMPUTING: Elements of Modern Computers
System Software Support
[Link]
Department of Computer Science and Engg
10/1/2020 15
THE STATE OF COMPUTING: Elements of Modern Computers
Complier Support
[Link]
Department of Computer Science and Engg
10/1/2020 16
THE STATE OF COMPUTING: Evolution of Computer Architecture
[Link]
Department of Computer Science and Engg
10/1/2020 17
THE STATE OF COMPUTING: Evolution of Computer Architecture
[Link]
Department of Computer Science and Engg
10/1/2020 18
THE STATE OF COMPUTING: Evolution of Computer Architecture
Look-ahead, parallelism and pipelining
• Look-ahead
– Introduced to pre-fetch Instruction to overlap Instruction Fetch & Execute
– They enable function parallelism
• Parallelism
– Supported by two approaches:
• Use of multiple functional units simultaneously.
• Practicing pipelining at various processing levels Includes
– Pipelined Instruction execution
– Pipelined arithmetic computations
– Memory access operations
• Pipelining
– Performs identical operations repeatedly on vector data strings
– Uses Scalar pipeline processors
[Link]
Department of Computer Science and Engg
10/1/2020 19
THE STATE OF COMPUTING: Evolution of Computer Architecture
Flynn’s classification
• This taxonomy distinguishes multi-processor computer architectures according to two
independent dimensions :
• Instruction stream
– Sequence of instructions executed by machine
• Data stream.
– Sequence of data including input, partial or temporary results used by instruction stream
• Each of these dimensions can have only one of two possible states:
– Single or Multiple
• Flynn’s classification depends on the distinction between :
– The performance of control unit
– The data processing unit rather than its operational and structural interconnections.
• Following are the four category of Flynn classification:
– SISD(Single instruction stream, single data stream)
– SIMD(Single instruction stream, multiple data stream)
– MIMD(Multiple instruction stream, multiple data stream)
– MISD(Multiple instruction stream, single data stream)
[Link]
Department of Computer Science and Engg
10/1/2020 20
THE STATE OF COMPUTING: Evolution of Computer Architecture
[Link]
Department of Computer Science and Engg
10/1/2020 21
THE STATE OF COMPUTING: Evolution of Computer Architecture
[Link]
Department of Computer Science and Engg
10/1/2020 22
THE STATE OF COMPUTING: Evolution of Computer Architecture
Parallel/Vector Computers
Parallel computers are those that execute programs in MIMD mode.
• There are two major classes of parallel computers
– Shared-memory multiprocessors
– message-passing multicomputers.
• The major distinction between multiprocessors and multicomputers lies :
– Memory sharing
• Multiprocessors: Processors communicate through shared variable in a common memory
• Multicomputer: Each computer has a local memory unshared with other computers.
– Inter-processor communication
• Through message passing for a multi computer
Vector Processors:
• Equipped with multiple vector pipelines that can be concurrently used under
hardware or firmware control
• Two families :
– Memory-to-memory architecture
– Register-to-register architecture
[Link]
Department of Computer Science and Engg
10/1/2020 23
THE STATE OF COMPUTING: Evolution of Computer Architecture
Development Layers
Programmers View
Point these 2 layers
should be architecture
transparent
As a good computer architect One has to approach problem from both the ends
[Link]
Department of Computer Science and Engg
10/1/2020 24
THE STATE OF COMPUTING: System Attributes Affecting Performance
[Link]
Department of Computer Science and Engg
10/1/2020 25
THE STATE OF COMPUTING: System Attributes Affecting Performance
Performance Indicators:
• The simplest measure of program performance is the turnaround time , which
includes :
– Disk and memory accesses
– Input and output
– Compilation time
– Operating system overhead
– CPU time
• Since I/O and system overhead frequently overlaps processing by other programs,
it is fair to consider only :
– The CPU time used by a program.
– The user CPU time is the most important factor.
[Link]
Department of Computer Science and Engg
10/1/2020 26
THE STATE OF COMPUTING: System Attributes Affecting Performance
System Attributes:
Total CPU time can be used as a basis in estimating the execution rate of a
processor.
[Link]
Department of Computer Science and Engg
10/1/2020 29
THE STATE OF COMPUTING: System Attributes Affecting Performance
System Attributes:
[Link]
Department of Computer Science and Engg
10/1/2020 30
THE STATE OF COMPUTING: System Attributes Affecting Performance
MIPS Rate:
If C is the total number of clock cycles needed to execute a given
program =>
• Then total CPU time can be estimated as:
– T = C × τ = C / f.
• Other relationships are easily observed:
– CPI = C / Ic
– T =Ic × CPI × τ
– T =Ic × CPI / f
Processor speed is often measured in terms of millions of
instructions per second, frequently called the MIPS rate of the
processor
[Link]
Department of Computer Science and Engg
10/1/2020 31
THE STATE OF COMPUTING: System Attributes Affecting Performance
MIPS Rate:
Note:
All four system attributes (instruction set, compiler, processor, and memory
technologies) affect the MIPS rate, which varies also from program to program.
[Link]
Department of Computer Science and Engg
10/1/2020 32
THE STATE OF COMPUTING: System Attributes Affecting Performance
Throughput Rate:
The number of programs a system can execute per unit time,
Ws, in programs per second.
CPU throughput, Wp, is defined as :
Wp = f/(Ic * CPI)
In a multiprogrammed system, the system throughput is often
less than the CPU throughput.
[Link]
Department of Computer Science and Engg
10/1/2020 33
THE STATE OF COMPUTING: System Attributes Affecting Performance
Programming Enviroments:
• Programmability depends on the :
– Programming environment provided to the users.
• Conventional computers are used in a sequential programming
environment
– With tools developed for a uniprocessor computer.
• Parallel computers need parallel tools:
– That allow specification or easy detection of parallelism
– Operating systems that can perform parallel scheduling of:
• Concurrent events
• Shared memory allocation
• Shared peripheral
• Communication links.
[Link]
Department of Computer Science and Engg
10/1/2020 34
THE STATE OF COMPUTING: System Attributes Affecting Performance
Approach of Parallelism
[Link]
Department of Computer Science and Engg
10/1/2020 35
THE STATE OF COMPUTING: System Attributes Affecting Performance
Approach of Parallelism
[Link]
Department of Computer Science and Engg
10/1/2020 36
Multiprocessors and Multicomputers
Categories of Parallel Computers:
• Considering their architecture only, there are two main categories of
parallel computers:
– Systems with shared common memories(Shared Memory Multiprocessors)
• The UMA Model
• The NUMA Model
• The COMA Model
• The CC-NUMA Model
– Systems with unshared distributed memories(Distributed-Memory
Multicomputers)
• The NORMA Machines
• Message Passing multicomputers
– Taxonomy of MIMD Computers
– Representative Systems
• Multiprocessors: BBN TC-200, MPP, S-81, IBM ES/9000 Model 900/VF,
• [Link]
Multicomputers: Intel Paragon XP/S, nCUBE/2 6480, SuperNode 1000, CM5, KSR-1
Department of Computer Science and Engg
10/1/2020 37
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors( Uniform Memory Access Model UMA Model)
• Physical memory uniformly shared by
all processors
– Has equal access time to all words.
• Processors may have local cache
memories.
• Peripherals also shared in some
fashion.
• Tightly coupled systems use a :
– common bus, crossbar, or multistage
network to connect processors,
peripherals, and memories.
• Many manufacturers have
multiprocessor (MP) extensions of
uniprocessor (UP) product lines.
[Link]
Department of Computer Science and Engg
10/1/2020 38
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)
• Symmetric MP systems :
– All processors have access to all peripherals
– Any processor can run the OS and I/O device
drivers.
• Asymmetric MP systems :
– Not all peripherals accessible by all processors
– Kernel runs only on selected processors (master)
– Others are
[Link]
called attached processors (AP).
Department of Computer Science and Engg
10/1/2020 39
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)
[Link]
Department of Computer Science and Engg
10/1/2020 40
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)
[Link]
Department of Computer Science and Engg
10/1/2020 44
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(Cache Only Memory Access Model -COMA)
[Link]
Department of Computer Science and Engg
10/1/2020 45
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors
(Cache Coherent –Non Uniform Memory Access CC-NUMA)
[Link]
Department of Computer Science and Engg
10/1/2020 46
Multiprocessors and Multicomputers:
Distributed Memory Multicomputers
• Multicomputer consist of multiple
computers/nodes, interconnected by a
[Link]
Department of Computer Science and Engg
10/1/2020 49
Taxonomy of MIMD Computers
Scalable
Not Scalable
Scalable
[Link]
Department of Computer Science and Engg
10/1/2020 50
Multivector and SIMD Computers
[Link]
Department of Computer Science and Engg
10/1/2020 51
Multivector and SIMD Computers
Vector computer Is built on top of Scalar
Vector Supercomputers Processor.
[Link]
Department of Computer Science and Engg
10/1/2020 52
Multivector and SIMD Computers
Vector Supercomputers
Step 1-2: Program & data are first loaded into the Main Memory through a Host
computer.
Step 3: All instructions are first decoded by the Scalar Control Unit.
Step 5: If the instructions are decoded as a Vector operation, it will be sent to the vector
control unit.
Step 6: Vector control unit will supervise the flow of vector data between the main
memory and vector functional pipelines.
Note: A number of vector functional pipelines may be built into a Vector Processor.
[Link]
Department of Computer Science and Engg
10/1/2020 53
Multivector and SIMD Computers
Vector Supercomputers - 2 Models
Model 1:Register-to-register architecture
• Which are used to hold all vector operands, intermediate, and final vector
results.
– Stores into the main memory in large units superwords (e.g.512 bits)
[Link]
Department of Computer Science and Engg
10/1/2020 54
Multivector and SIMD Computers
SIMD Supercomputers
Single Instruction Stream and over multiple data stream
[Link]
Department of Computer Science and Engg
10/1/2020 55
Multivector and SIMD Computers
SIMD Supercomputers
An operational model of an SIMD computer is specified by a 5-tuple: (N, C, I, M, R)
(3) I = Set of instructions broadcast by the CU to all PEs for parallel execution.
• Include: Arithmetic, logic, data routing, masking,
• And other local operations executed by each active PE over data within that PE.
Objective:
[Link]
Department of Computer Science and Engg
10/1/2020 58
PRAM and VLSI Models
Parallel Random Access Machines :
o NP Completeness:
» An algorithm has polynomial complexity if there exits polynomial p(s) and its
time complexity O(p(s)) for a problem size s.
» P ⊂ NP
[Link]
Department of Computer Science and Engg
10/1/2020 59
PRAM and VLSI Models
PRAM Models
Conventional Uniprocessor computers
• Modelled as random access machines (RAM)
Parallel random access machines
• Modelled to idealize parallel computers with zero synchronization / memory access
overhead.
• Used parallel algorithm development, scalability & complexity analysis.
• Globally addressable shared memory distributed among n Processor Elements (PEs) /
centralized.
• PEs operate on synchronized read memory, compute, write memory cycle.
Four memory-update options in PRAMs:
Exclusive Read (ER): Read takes 1 Cycle atmost 1 Processor
Exclusive Write (EW): Write takes 1 Cycle atmost 1 Processor
Concurrent read (CR):Read of a memory location takes Same Cycle for multiple Processor
Concurrent Write (CW):Simultaneously Writes to a same memory location conflict need to be
avoided
[Link]
Department of Computer Science and Engg
10/1/2020 60
PRAM and VLSI Models
PRAM Variants:- Classified based on Memory read/write : Four memory-update
1. EREW PRAM model:
Prohibits more than 1 processor from reading or writing simultaneously to same cell.
2. CREW PRAM model:
Mutual exclusion used to avoid write conflicts.
Allows concurrent reads from same memory.
3. ERCW PRAM model:
Allows exclusive reads/ concurrent writes from same memory.
4. CRCW PRAM model:
Allows either concurrent reads/ concurrent writes from same memory.
[Link]
Department of Computer Science and Engg
10/1/2020 61
PRAM and VLSI Models
Discrepancy with Physical Models of PRAM Variants:-
[Link]
Department of Computer Science and Engg
10/1/2020 62
PRAM and VLSI Models
VLSI Complexity Model:-
Parallel computers rely on the use of VLSI chip
To fabricate components such as Processor arrays, Memory arrays, Switching networks
etc.
Nowadays, VLSI technologies are 2-dimensional.
The size of a VLSI chip is proportional to the amount of storage (memory) space
available in that chip.
We deal with AT2 model (2-dimension)
We can calculate the space complexity of an algorithm by :
The chip area (A) of the VLSI chip implementation of that algorithm.
If T is the time (latency) needed to execute the algorithm
Then A.T gives an upper bound on the total number of bits processed through the chip (or
I/O).
For certain computing, there exists a lower bound, f(s), such that
A.T2 >= O (f(s))
Where A=chip area and T=time
[Link]
Department of Computer Science and Engg
10/1/2020 63
PRAM and VLSI Models
VLSI Complexity Model:-
Memory Bound on Chip Area:
Many computations are memory bound has there will be a need in processing large data
sets
Memory requirement of computation sets a lower bound on the chip area (A)
Amount of information processed by the chip can be visualized as information flow
upward across the chip area
[Link]
Department of Computer Science and Engg
10/1/2020 64
PRAM and VLSI Models
VLSI Complexity Model:-
Bisection Communication Bound,
The bisection is represented by the vertical slice across the chip area A.
[Link]
Department of Computer Science and Engg
10/1/2020 65
Chapter-2
Program and Network Properties
[Link]
Department of Computer Science and Engg
10/1/2020 66
Conditions of Parallelism
Three key areas for significant progress in parallel processing:
Computational models for parallel computing
[Link]
Department of Computer Science and Engg
10/1/2020 67
Conditions of Parallelism
Data dependence and resource dependence
The ability to execute several program segments in parallel:
• Requires each segment to be independent of the other segments.
• We use a dependence graph to describe the relations.
• The nodes of a dependence graph correspond to the program statement
(instructions)
• Directed edges with different labels are used to represent the ordered
relations among the statements.
• The analysis of dependence graphs shows where opportunity exists for
parallelization.
[Link]
Department of Computer Science and Engg
10/1/2020 68
Conditions of Parallelism
Data dependence and resource dependence
Data Dependence : The ordering relationship between statements is
indicated by the data dependence.
Five Types:
1. Flow dependence
2. Antidependence
3. Output dependence
4. I/O dependence
5. Unknown dependence
[Link]
Department of Computer Science and Engg
10/1/2020 69
Conditions of Parallelism
Data dependence and resource dependence
1. Flow dependence : A statement S2 is flow dependent on S1 RAW
[Link]
Department of Computer Science and Engg
10/1/2020 70
Conditions of Parallelism
Data dependence and resource dependence
3. Output dependence : Two statements are output dependent if they produce
(write) the same output variable. Also called WAW hazard and denoted as
[Link]
Department of Computer Science and Engg
10/1/2020 71
Conditions of Parallelism
Data dependence and resource dependence
5. Unknown dependence: The dependence relation between two statements
cannot be determined in the following situations:
[Link]
Department of Computer Science and Engg
10/1/2020 72
Conditions of Parallelism
Data dependence and resource dependence
Control Dependence: Dependency is know at Run time
– Data and control dependencies are based on the independence of the work
to be done.
[Link]
Department of Computer Science and Engg
10/1/2020 74
Conditions of Parallelism
Data dependence and resource dependence
Bernstein’s Conditions - 1 : Bernstein’s conditions are a set of conditions which
must exist if two processes can execute in parallel.
– Notation
[Link]
Department of Computer Science and Engg
10/1/2020 75
Conditions of Parallelism
Data dependence and resource dependence
Bernstein’s Conditions - 2 : In terms of data dependencies, Bernstein’s conditions
imply that two processes can execute in parallel if they are:
(Pi || Pj implies Pj || Pi ),
[Link]
Department of Computer Science and Engg
10/1/2020 76
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism : Hardware parallelism is defined by machine architecture
and hardware multiplicity.
– It can be characterized by the number of instructions that can be issued per
machine cycle.
– If a processor issues k instructions per machine cycle, it is called a k-issue
processor.
– Conventional processors are one-issue machines.
– A machine with n processors with k-issue should be able to handle a
maximum of nk threads simultaneously.
– Examples:
• Intel i960CA is a three-issue processor (arithmetic, memory access, branch)
• IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access,
branch)
[Link]
Department of Computer Science and Engg
10/1/2020 77
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism :
[Link]
Department of Computer Science and Engg
10/1/2020 78
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism :
[Link]
Department of Computer Science and Engg
10/1/2020 79
Conditions of Parallelism
Hardware and Software Parallelism
Software Parallelism :
– Software parallelism is defined by the control and data dependence of programs,
– Is revealed in the program’s flow graph.
– It is a function of algorithm, programming style, and compiler optimization.
[Link]
Department of Computer Science and Engg
10/1/2020 80
Conditions of Parallelism
Hardware and Software Parallelism
Types of Software Parallelism :
1. Control Parallelism:
• Two or more operations can be performed simultaneously.
• This can be detected by a compiler, or a programmer can explicitly indicate control
parallelism
– By using special language constructs or dividing a program into multiple
processes.
2. Data parallelism:
• Multiple data elements have the same operations applied to them at the same
time.
• This offers the highest potential for concurrency (in SIMD and MIMD modes).
• Synchronization in SIMD machines handled by hardware.
[Link]
Department of Computer Science and Engg
10/1/2020 81
Conditions of Parallelism
Hardware and Software Parallelism
The Role of Compilers :
– Compilers used to exploit hardware features to improve performance.
– It is not necessarily the case that more software parallelism will improve performance
in conventional scalar processors.
[Link]
Department of Computer Science and Engg
10/1/2020 82
Conditions of Parallelism
Program Partitioning & Scheduling
Program Partitioning & Scheduling :
[Link]
Department of Computer Science and Engg
10/1/2020 83
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
– The sizes are roughly classified using the term “granule size,” or simply “granularity”
– The simplest measure, for example, is the number of instructions in a program part
– Grain sizes are usually described as fine, medium or coarse, depending on the level of
parallelism involved
[Link]
Department of Computer Science and Engg
10/1/2020 84
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Parallelism has exploited at various Processing levels:
[Link]
Department of Computer Science and Engg
10/1/2020 85
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Instruction Level Parallelism:
– Advantages:
• There are usually many candidates (Programs) for parallel execution
[Link]
Department of Computer Science and Engg
10/1/2020 86
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Loop Level Parallelism:
[Link]
Department of Computer Science and Engg
10/1/2020 87
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Procedure Level Parallelism:
[Link]
Department of Computer Science and Engg
10/1/2020 88
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
[Link]
Department of Computer Science and Engg
10/1/2020 89
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
[Link]
Department of Computer Science and Engg
10/1/2020 90
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Final Summary :
– Fine-grain parallelism is exploited at instruction or loop levels
• Assisted by the compiler.
– Medium-grain parallelism is At task or job step level
• Requires programmer and compiler support.
– Coarse-grain parallelism is At Program Level
• Relies heavily on effective OS support.
– Shared-variable communication used at fine- and medium-grain levels.
[Link]
Department of Computer Science and Engg
10/1/2020 92
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Inter-process Communication Latency:
– Example:
• n communicating tasks may require
[Link]
Department of Computer Science and Engg
10/1/2020 93
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Packing and Scheduling )
1. How can I partition a program into parallel “pieces” to yield the shortest
execution time?
• Time spent for scheduling and synchronizing parallel grains and the speedup
obtained by parallel execution.
[Link]
Department of Computer Science and Engg
10/1/2020 95
Conditions of Parallelism
Program Partitioning & Scheduling
(Grain Determination and Scheduling Optimization)
Includes 4 steps:
[Link]
Department of Computer Science and Engg
10/1/2020 96
Conditions of Parallelism
Program Partitioning & Scheduling(Program Graphs and Packing)
A program graph is like a dependence graph
• Nodes= {(n, s)} where n= node name and s=
size
– Larger the s = larger grain size
– Edges= {(v,d)} where v= communicated variable
and d=communication delay
[Link]
Department of Computer Science and Engg
10/1/2020 98
Conditions of Parallelism
Program Partitioning & Scheduling(Program Graphs and Packing)
– Each node in the program graph corresponds to a
computational unit in the program.
[Link]
Department of Computer Science and Engg
10/1/2020 100
Program Flow Mechanisms
Control flow mechanism
Control flow mechanism(used by conventional computers):
– Conventional von Neumann computers use a
– Control flow computers use shared memory to hold program instructions and data
objects.
– Control flow can be made parallel by using parallel language constructs or parallel
compiler.
– Control flow machines give complete control, but are less efficient than other
approaches.[Link]
Department of Computer Science and Engg
10/1/2020 101
Program Flow Mechanisms
Data flow mechanism
Data-driven mechanism(used by data-flow computers):
• High potential for parallelism and throughput and freedom from side effects,
• But have high control overhead, lose time waiting for unneeded arguments, and
difficulty in manipulating data structures.
– Special mechanisms are required to detect data availability, match data tokens
with instructions needing them, enable chain reaction of asynchronous instruction
execution [Link]
Department of Computer Science and Engg
10/1/2020 102
Program Flow Mechanisms
Demand flow mechanism
Demand-driven mechanism(used by reduction computers):
• Eager Evaluation
• This triggers the execution of instructions that yield its operands, and so forth.
• Lazy Evaluation
[Link]
Department of Computer Science and Engg
10/1/2020 103
Program Flow Mechanisms
Demand flow mechanism
Demand-driven mechanism(used by reduction computers): 2 Reduction Models:
1. String-reduction model:
2. Graph-reduction model:
• Processors
• Memory modules
[Link]
Department of Computer Science and Engg
10/1/2020 106
System Interconnect Architectures
[Link]
Department of Computer Science and Engg
10/1/2020 107
System Interconnect Architectures
Network Properties and Routing
– Static Networks:
• Point-to-Point direct connections which will not change during program
execution
– Dynamic Networks:
• Implemented with switched channels.
[Link]
Department of Computer Science and Engg
10/1/2020 108
System Interconnect Architectures
Network Properties and Routing
Goals and Analysis:
– Analysis includes
• Latency
• Bisection bandwidth
• Data-routing functions
• Scalability
[Link]
of parallel architecture
Department of Computer Science and Engg
10/1/2020 109
System Interconnect Architectures
Network Properties and Routing
Terminologies used in Network Properties and Routing:
– Diameter D of a network
• Is the maximum shortest path between any two nodes
– Wire (or channel) length length (e.g. weight) of edges between nodes.
– Network is symmetric if the topology is the same looking from any node;
these are easier to implement or to program.
[Link]
Department of Computer Science and Engg
10/1/2020 112
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Is used for inter-PE data exchange
– Shifting
– Shuffle
– Exchange Etc.
[Link]
Department of Computer Science and Engg
10/1/2020 113
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Permutation (one to one)
– Given n objects, there are n ! ways in which they can be reordered (one of
which is no reordering).
– For example: π = (a , b , c) (d , e)
• Possible mapping are: ab, bc, ca, de, ed in a circular fashion.
[Link]
Department of Computer Science and Engg
10/1/2020 115
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Perfect Shuffle and Exchange
ab…k to bc…ka
– That is, shifting 1 bit to the left and wrapping it around to the least
significant bit position.
– The inverse perfect shuffle reverses the effect of the perfect shuffle.
[Link]
Department of Computer Science and Engg
10/1/2020 116
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Perfect Shuffle and Exchange
Perfect Shuffle
(Left Shift)
Inverse Perfect
Shuffle
(Right Shift)
[Link]
Department of Computer Science and Engg
10/1/2020 117
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Hypercube Routing Functions
[Link]
Department of Computer Science and Engg
10/1/2020 118
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Hypercube Routing Functions
[Link]
Department of Computer Science and Engg
10/1/2020 119
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Broadcast and Multicast
[Link]
Department of Computer Science and Engg
10/1/2020 120
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Network Performance
– Functionality:-
• How the network supports data routing, interrupt handling, synchronization,
request/message combining, and coherence.
– Network latency:-
• Worst-case time for a unit message to be transferred.
– Bandwidth
• Maximum data rate
– Hardware complexity
• Implementation costs for wire, logic, switches, connectors, etc.
– Scalability
• How easily does the scheme adapt to an increasing number of processors,
memories, etc.?
[Link]
Department of Computer Science and Engg
10/1/2020 121
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Static networks use direct links which are fixed
once built.
– Barrel Shifter
– Fat Tree
[Link]
Department of Computer Science and Engg
10/1/2020 122
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Linear Array
– Diameter = n-1
– Bisection = 1
[Link]
Department of Computer Science and Engg
10/1/2020 124
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Ring, Chordal
Ring
[Link]
Department of Computer Science and Engg
10/1/2020 126
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Barrel Shifter
[Link]
Department of Computer Science and Engg
10/1/2020 127
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Barrel Shifter
[Link]
Department of Computer Science and Engg
10/1/2020 128
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Tree and Star
– The balanced binary tree is scalable, since it has a constant maximum node
degree.
[Link]
Department of Computer Science and Engg
10/1/2020 129
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Tree ,Star and Fat Tree
– A star is a two-level tree with a node degree d = N – 1 and a constant diameter
of 2.
– A fat tree is a tree in which the number of edges between nodes increases
closer to the root
– The edges represent communication channels (“wires”).
[Link]
Department of Computer Science and Engg
10/1/2020 130
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Mesh and Torus
– Pure mesh:
[Link]
Department of Computer Science and Engg
10/1/2020 131
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Mesh and Torus
– A torus mesh :
• Has ring connections in each dimension, and is symmetric.
[Link]
Department of Computer Science and Engg
10/1/2020 132
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Systolic Array
[Link]
Department of Computer Science and Engg
10/1/2020 133
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Hypercubes
[Link]
Department of Computer Science and Engg
10/1/2020 134
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks Cube-connected Cycles
– Price can be attributed to the cost of wires, switches, arbiters, and connectors.
[Link]
Department of Computer Science and Engg
10/1/2020 139
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Bus Systems
– The throughput of the network based on a bus can be increased by the use
of a multibus network shown in the figure below.
– In this network, processors connected to the busses can transmit data in
parallel (one for each bus) and many processors can read data from many
busses at a time.
[Link]
Department of Computer Science and Engg
10/1/2020 140
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Switch Modules
[Link]
Department of Computer Science and Engg
10/1/2020 141
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Switch Modules
– However, multiple inputs may not be connected to the same output. When
only one-to-one mappings are allowed, the switch is called a crossbar switch.
[Link]
Department of Computer Science and Engg
10/1/2020 142
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Single-stage networks:
• Perfect shuffle
operation: cyclic shift
1 place left, eg 101 -->
011
[Link]
Department of Computer Science and Engg
10/1/2020 144
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
[Link]
Department of Computer Science and Engg
10/1/2020 146
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
This type of networks can be classified into the following four categories:
– Nonblocking : A network is called strictly nonblocking if it can connect any
idle input to any idle output regardless of what other connections are
currently in process
[Link]
Department of Computer Science and Engg
10/1/2020 148
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Baseline Networks:
– A baseline network can be shown to be topologically equivalent to other
networks (including Omega), and has a simple recursive generation procedure.
– Stage k (k = 0, 1, …) is an m × m switch block (where m = N / 2k ) composed
entirely of 2 × 2 switch blocks, each having two configurations: straight
through and crossover.
– The network can be generated recursively
– The first stage N × N, the second (N/2) × (N/2).
– Networks are topologically equivalent if one network can be easily reproduced
from the other networks by simply rearranging nodes at each stage
[Link]
Department of Computer Science and Engg
10/1/2020 149
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Baseline Networks:
[Link]
Department of Computer Science and Engg
10/1/2020 150
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Crossbar Networks:
– A crossbar switch has a number of input and output data pins and a number
of control pins.
– In response to control instructions set to its control input, the crossbar
switch implements a stable connection of a determined input with a
determined output.
– A m × n crossbar network
• With m processors and n memories, one processor may be able to generate
requests for multiple memories in sequence; thus several switches might be set
in the same row.
• For m × m interprocessor communication, each PE is connected to both an input
and an output of the crossbar; only one switch in each row and column can be
turned on simultaneously. Additional control processors are used to manage the
crossbar itself.
[Link]
Department of Computer Science and Engg
10/1/2020 151
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks Multistage Interconnection Networks
Crossbar Networks:
– Each junction is a switching component – connecting the row to the column.
[Link]
Department of Computer Science and Engg
10/1/2020 152
Principles of Scalable Performance
Chapter-3
This chapter includes:
– Performance measures
– Speedup laws
– Scalability principles
– Scaling up vs. scaling down
[Link]
Department of Computer Science and Engg
10/1/2020 153
Principles of Scalable Performance
Chapter-3
Performance metrics and measures:-
– Parallelism profiles
[Link]
Department of Computer Science and Engg
10/1/2020 154
Principles of Scalable Performance
Parallelism Profile in Programs
Parallelism Profile in Programs:
– Degree of parallelism Reflects the matching of software and
hardware parallelism
• Average parallelism
• Available parallelism
[Link]
Department of Computer Science and Engg
10/1/2020 156
Principles of Scalable Performance
Parallelism Profile in Programs
Degree of parallelism(DOP): Factors affecting parallelism
profiles:
– Algorithm structure
– Program optimization
– Resource utilization
– Run-time conditions
• no overhead
[Link]
Department of Computer Science and Engg
10/1/2020 159
Principles of Scalable Performance
Parallelism Profile in Programs
Average parallelism:Is computed by
In discrete form,
we get
[Link]
Department of Computer Science and Engg
10/1/2020 160
Principles of Scalable Performance
Parallelism Profile in Programs
Example: Parallelism profile and average parallelism
[Link]
Department of Computer Science and Engg
10/1/2020 161
Principles of Scalable Performance
Parallelism Profile in Programs
Asymptotic speedup the amount of work executed
with DOP=i
For i=1(Single processor-sequential)
W.R.T
Response Time
[Link]
Department of Computer Science and Engg
10/1/2020 162
Principles of Scalable Performance
Parallelism Profile in Programs
Asymptotic speedup S∞ is defined as the ratio of
T(1) to T(∞)
In ideal case,
n=∞
n >> m
[Link]
Department of Computer Science and Engg
10/1/2020 163
Principles of Scalable Performance
Mean Performance
Performance measures:
– Consider n processors executing m programs in various modes
– Want to define the mean performance of these multimode
computers:
• Arithmetic mean performance
• Harmonic mean performance
[Link]
Department of Computer Science and Engg
10/1/2020 164
Principles of Scalable Performance
Mean Performance
Arithmetic mean performance:
Let {Ri} be the execution rate of programs i=1,2,3,….,m.
Arithmetic mean execution rate is defined as:
[Link]
Department of Computer Science and Engg
10/1/2020 166
Principles of Scalable Performance
Mean Performance
Harmonic mean performance:
[Link]
Department of Computer Science and Engg
10/1/2020 167
Principles of Scalable Performance
Mean Performance
Harmonic mean performance:
[Link]
Department of Computer Science and Engg
10/1/2020 168
Principles of Scalable Performance
Mean Performance
Harmonic mean Speedup:
[Link]
Department of Computer Science and Engg
10/1/2020 169
Principles of Scalable Performance
Mean Performance-Amdahl’s Law
[Link]
Department of Computer Science and Engg
10/1/2020 170
Principles of Scalable Performance
Mean Performance - Amdahl’s Law
[Link]
Department of Computer Science and Engg
10/1/2020 171
Principles of Scalable Performance
Efficiency, Utilization and Quality
[Link]
Department of Computer Science and Engg
10/1/2020 172
Principles of Scalable Performance
Efficiency, Utilization and Quality
[Link]
Department of Computer Science and Engg
10/1/2020 173
Principles of Scalable Performance
Efficiency, Utilization and Quality
[Link]
Department of Computer Science and Engg
10/1/2020 174
Principles of Scalable Performance
Efficiency, Utilization and Quality
[Link]
Department of Computer Science and Engg
10/1/2020 175
Principles of Scalable Performance
Efficiency, Utilization and Quality
KLIPS [Link]
kilo logic inference per second used to indicate the
10/1/2020
reasoning powerDepartment
of an AIof Computer
machine Science and Engg
176
Principles of Scalable Performance
Parallel Processing Applications
Few applications listed:-
– Drug design
– High-speed civil transport
– Ocean modelling
– Ozone depletion research
– Air pollution
– Digital anatomy
[Link]
Department of Computer Science and Engg
10/1/2020 177
Parallel Processing Applications
– Here we will discuss the grand challenges identified in the U.S. High
Performance Computer and Communication (HPCC) program.
[Link]
Department of Computer Science and Engg
10/1/2020 178
Parallel Processing Applications
[Link]
Department of Computer Science and Engg
10/1/2020 179
Parallel Processing Applications
• Very few processors exists which can execute 2 instructions in one cycle.
• Depends upon complier, OS incapability's and program flow built in modern computers.
– Data parallelism
• Higher than Instruction level parallelism.
[Link]
Department of Computer Science and Engg
10/1/2020 182
Parallel Processing Applications ( Application Models)
Application Models for Parallel Processing: 3 Models Depends on limited
memory, limited latency tolerance , limited I/O bandwidth
The models are:
1. Fixed load model:
• Corresponds to constant workload (α)
[Link]
Department of Computer Science and Engg
10/1/2020 183
Parallel Processing Applications
( Scalability of Parallel Algorithms)
Algorithmic characteristics:-
• Computational granularity
• Parallelism Profile
h=h(s , n)
– Efficiency is given by : -
[Link]
Department of Computer Science and Engg
10/1/2020 186
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 187
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 188
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 189
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 190
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 191
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 192
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 193
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 194
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 195
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 196
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 197
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 198
Speedup Performance Laws
[Link]
Department of Computer Science and Engg
10/1/2020 199
Scalability Analysis and Approaches
[Link]
Department of Computer Science and Engg
10/1/2020 200
Scalability Analysis and Approaches
[Link]
Department of Computer Science and Engg
10/1/2020 201
Scalability Analysis and Approaches
Attained by varying
only the number of
processor(n)
[Link]
10/1/2020 Department of Computer Science and Engg
202
Scalability Analysis and Approaches
[Link]
10/1/2020 Department of Computer Science and Engg
203
Scalability Analysis and Approaches
Scalability Issues and Possible solutions:
[Link]
Department of Computer Science and Engg
10/1/2020 204
[Link]
10/1/2020 Department of Computer Science and Engg
205
Major advancements beyond the von Neumann architecture include the integration of look-ahead, parallel processing, pipelining, Flynn’s classification of parallel computers, and the development of parallel/vector computers. These developments have allowed for the handling of complex and large-scale computations more efficiently, addressing limitations of the traditional sequential computing approach .
Job or program-level parallelism is feasible for systems with a small number of powerful processors, facilitating the execution of independent program streams efficiently. However, in systems with many simple processors, it is impractical due to the extended processing time required per job, highlighting a mismatch in resources and task allocation capabilities .
Loop-level parallelism offers the advantage of executing independent iterations simultaneously, which is effective in pipelined or SIMD environments. However, recursive loops pose complexities due to interdependencies that hinder parallel execution. Proper scheduling and synchronization are critical to harnessing its benefits while overcoming challenges .
The design of hardware and compilers needs to be synchronized to exploit parallelism effectively in conventional scalar processors. This synchronization allows the identification and implementation of parallel constructs, enhancing program execution efficiency .
In multiprocessor systems, achieving a low-latency network design with high data transfer rates is essential for efficient operations. Low latency reduces the time required for data transmission between nodes, while high data transfer rates ensure that large volumes of data can be moved swiftly, enhancing the overall system throughput and performance .
Multistage interconnection networks use a combination of switch modules arranged in stages to provide dynamic routing and reordering of inputs for flexible communication. Their advantage lies in their ability to support various communication patterns and high connectivity, enhancing system flexibility and performance in parallel computing environments .
Inter-process communication latency is crucial to system scalability as it limits the effective size of parallel architectures. High latency leads to longer signal delays and inefficient communication, which become bottlenecks as the system grows. Minimizing latency is essential to maintaining performance and enabling scalable system expansion .
Grain packing involves determining the optimal size and number of parallel grains in a program to minimize execution time. The challenge lies in balancing the overhead of synchronizing and scheduling these grains against the performance gains from parallel execution. It requires careful consideration of both machine and problem-specific characteristics to achieve effective scheduling .
Grain sizes, or granularity, greatly influence the parallel execution of a program by determining the number of instructions per program part. They are classified as fine, medium, or coarse based on parallelism levels, impacting scheduling and optimization strategies. Fine-grain parallelism is more common at the instruction level, while coarse-grain parallelism suits program-level execution .
Static networks use fixed, point-to-point connections that do not change during program execution, providing consistent communication paths. In contrast, dynamic networks utilize switched channels that can be reconfigured based on user program demands, offering flexibility and potentially optimizing communication traffic according to current requirements .