0% found this document useful (0 votes)
28 views206 pages

Advanced Computer Architecture Overview

The document discusses the history and evolution of computing from ancient times to modern computers. It covers the five generations of computers from the first generation using vacuum tubes to the current fifth generation using advanced VLSI processors. It also describes the key technologies, architectures, software and representative systems of each generation. Finally, it discusses some of the elements that characterize modern computer systems such as computing problems, algorithms, hardware resources, operating systems and compiler support.

Uploaded by

omeshwar v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views206 pages

Advanced Computer Architecture Overview

The document discusses the history and evolution of computing from ancient times to modern computers. It covers the five generations of computers from the first generation using vacuum tubes to the current fifth generation using advanced VLSI processors. It also describes the key technologies, architectures, software and representative systems of each generation. Finally, it discusses some of the elements that characterize modern computer systems such as computing problems, algorithms, hardware resources, operating systems and compiler support.

Uploaded by

omeshwar v
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Visit : [Link]

io

Join Telegram to get Instant Updates: [Link]

Contact: MAIL: futurevisionbie@[Link]

INSTAGRAM: [Link]/hemanthraj_hemu/

INSTAGRAM: [Link]/futurevisionbie/

WHATSAPP SHARE: [Link]


Advanced Computer Architecture

17CS72

MODULE-1

Theory of Parallelism

Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”,


Hwang & Jotwani
[Link]
Department of Computer Science and Engg
10/1/2020 1
MODULE-1

Syllabus: MODULE Theory of Parallelism

Parallel Computer Models 1


The State of Computing, Multiprocessors and
1
Multicomputer
Multivector and SIMD Computers 1
PRAM and VLSI Models, Program and Network
Properties 1

1 Conditions of Parallelism 1
Program Partitioning and Scheduling 1
Program Flow Mechanisms, System Interconnect
1
Architectures
Principles of Scalable Performance 1
Performance Metrics and Measures, Parallel Processing 1
Applications
Speedup Performance Laws, Scalability Analysis and 1
Approaches
[Link]
Department of Computer Science and Engg
10/1/2020 2
Theory of Parallelism

[Link]
Department of Computer Science and Engg
10/1/2020 3
THE STATE OF COMPUTING: Computer Development Milestones
• How it all started…
 500 BC: Abacus (China) – The earliest mechanical computer
/calculating device.
Operated to perform decimal arithmetic with carry propagation digit by digit
 1642: Mechanical Adder/Subtractor (Blaise Pascal)
 1827: Difference Engine (Charles Babbage)
 1941: First binary mechanical computer (Konrad Zuse; Germany)
 1944: Harvard Mark I (IBM)
The very first electromechanical decimal computer as proposed by Howard Aiken
• Computer Generations
 1 st 2 nd 3 rd 4 th 5 th
 Division into generations marked primarily by changes in
hardware and software technologies

[Link]
Department of Computer Science and Engg
10/1/2020 4
THE STATE OF COMPUTING: Computer Generations

• First Generation (1945 – 54)


o Technology & Architecture:
• Vacuum Tubes
• Relay Memories (Switch kind latch)
• CPU driven by PC and accumulator
• Fixed Point Arithmetic
o Software and Applications:
• Machine/Assembly Languages
• Single user
• No subroutine linkage
• Programmed I/O using CPU
o Representative Systems: ENIAC, Princeton IAS, IBM 701

[Link]
Department of Computer Science and Engg
10/1/2020 5
THE STATE OF COMPUTING: Computer Generations

• Second Generation (1955 – 64)


o Technology & Architecture:
• Discrete Transistors (Single Transistors)
• Core Memories
• Floating Point Arithmetic
• I/O Processors
• Multiplexed memory access
o Software and Applications:
• High level languages used with compilers
• Subroutine libraries
• Batch Processing Monitor
o Representative Systems: IBM 7090, CDC 1604, Univac LARC

[Link]
Department of Computer Science and Engg
10/1/2020 6
THE STATE OF COMPUTING: Computer Generations

• Third Generation (1965 – 74)


o Technology & Architecture:
• IC Chips (SSI/MSI)
• Microprogramming
• Pipelining
• Cache
• Look-ahead processors
o Software and Applications:
• Multiprogramming and Timesharing OS
• Multiuser applications
o Representative Systems: IBM 360/370, CDC 6600, T1-ASC, PDP-8

[Link]
Department of Computer Science and Engg
10/1/2020 7
THE STATE OF COMPUTING: Computer Generations

• Fourth Generation (1975 – 90)


o Technology & Architecture:
• LSI/VLSI
• Semiconductor memories
• Multiprocessors
• Multi-computers
• Vector supercomputers
o Software and Applications:
• Multiprocessor OS
• Languages, Compilers and environment for parallel
processing
o Representative Systems: VAX 9000, Cray X-MP, IBM 3090, BBN TC
2000

[Link]
Department of Computer Science and Engg
10/1/2020 8
THE STATE OF COMPUTING: Computer Generations

• Fifth Generation (1991 onwards)


o Technology & Architecture:
• Advanced VLSI Processors
• Memory and Switches
• Scalable Architectures
• High Density Packaging
o Software and Applications:
• Superscalar Processors
• Systems on a chip
• Massively parallel processing
• Grand challenge applications
• Heterogeneous processing
o Representative Systems: S-81, IBM ES/9000, Intel Paragon, nCUBE
6480, MPP, VPP500
[Link]
Department of Computer Science and Engg
10/1/2020 9
THE STATE OF COMPUTING: Elements of Modern Computers

• The hardware, software, and programming elements of modern computer


systems can be characterized by looking at a variety of factors, including:
– Computing problems
– Algorithms and data structures
– Hardware resources
– Operating systems
– System software support
– Compiler support

[Link]
Department of Computer Science and Engg
10/1/2020 10
THE STATE OF COMPUTING: Elements of Modern Computers
Computing problems

• A modern computer is an integrated system consisting of :


– Machine hardware
– An instruction set
– System software
– Application programs
– User interfaces.
• The use of a computer is driven by real-life problems demanding cost
effective solutions.
• For numerical problems in science and technology, the solutions demand
complex mathematical formulations and intensive integer or floating-point
computations.
• For artificial intelligence [Al] problems, the solutions demand logic inferences
and symbolic manipulations.

[Link]
Department of Computer Science and Engg
10/1/2020 11
THE STATE OF COMPUTING: Elements of Modern Computers
Algorithms and Data Structures

• Traditional algorithms and data structures are designed


– For sequential machines.
• New, specialized algorithms and data structures are needed
– To exploit the capabilities of parallel architectures.
• These often require interdisciplinary interactions
– Among theoreticians, experimentalists and programmers.

[Link]
Department of Computer Science and Engg
10/1/2020 12
THE STATE OF COMPUTING: Elements of Modern Computers
Hardware Resources

• Processors, memory, and peripheral devices form the hardware core of a


computer system.
• A modem computer system demonstrates its power through coordinated
efforts by :
– Hardware recourses
– An operating system
– Application software.
• Special hardware interfaces are often built into l-O devices such as display
terminals, workstations.
• In addition, software interface programs are needed. These software
interfaces include :
– File transfer systems, editors, word processors, device drivers, interrupt handlers,
network communication programs, etc.

[Link]
Department of Computer Science and Engg
10/1/2020 13
THE STATE OF COMPUTING: Elements of Modern Computers
Operating System

• Operating systems manage the allocation and deallocation of resources during user
program execution.
• An OS plays a significant role in mapping hardware resources to algorithmic and data
structures.
– Implementation of Mapping:
• Relies on efficient Compiler and Operating system support.
• Parallelism can be exploited at Algorithm :
– Design Time
– Program Time
– Compile Time
– Run Time

[Link]
Department of Computer Science and Engg
10/1/2020 14
THE STATE OF COMPUTING: Elements of Modern Computers
System Software Support

• System software is needed for the development of efficient programs in HLL.


• Few system software's are: Compliers, Loaders, Linkers, OS Kernel
• Parallel software can be developed:
– Using entirely new languages designed specifically with parallel support
– Using extensions to existing sequential languages.
• New languages have obvious advantages:
– Like new constructs specifically for parallelism
– But require additional programmer education and system software.
• The most common approach is to extend an existing language.

[Link]
Department of Computer Science and Engg
10/1/2020 15
THE STATE OF COMPUTING: Elements of Modern Computers
Complier Support

There Complier upgrade approaches:


1. Pre-processors
– Use existing sequential compilers and specialized libraries to implement parallel
constructs
2. Pre-compilers
– Perform some program flow analysis, dependence checking, and limited parallel
optimizations
3. Parallelizing Compilers
– Requires full detection of parallelism in source code, and transformation of
sequential code into parallel constructs
• Compiler directives are often inserted into source code to aid compiler
parallelizing efforts

[Link]
Department of Computer Science and Engg
10/1/2020 16
THE STATE OF COMPUTING: Evolution of Computer Architecture

• The study of computer architecture involves both the following:


o Hardware organization
o Programming/software requirements
• The evolution of computer architecture is believed to have started with von
Neumann architecture
o Built as a sequential machine
o Executing scalar data (Single value)
• Major leaps in this context came as…
o Look-ahead, parallelism and pipelining
o Flynn’s classification
o Parallel/Vector Computers
o Development Layers

[Link]
Department of Computer Science and Engg
10/1/2020 17
THE STATE OF COMPUTING: Evolution of Computer Architecture

[Link]
Department of Computer Science and Engg
10/1/2020 18
THE STATE OF COMPUTING: Evolution of Computer Architecture
Look-ahead, parallelism and pipelining

• Look-ahead
– Introduced to pre-fetch Instruction to overlap Instruction Fetch & Execute
– They enable function parallelism
• Parallelism
– Supported by two approaches:
• Use of multiple functional units simultaneously.
• Practicing pipelining at various processing levels Includes
– Pipelined Instruction execution
– Pipelined arithmetic computations
– Memory access operations

• Pipelining
– Performs identical operations repeatedly on vector data strings
– Uses Scalar pipeline processors

[Link]
Department of Computer Science and Engg
10/1/2020 19
THE STATE OF COMPUTING: Evolution of Computer Architecture
Flynn’s classification
• This taxonomy distinguishes multi-processor computer architectures according to two
independent dimensions :
• Instruction stream
– Sequence of instructions executed by machine
• Data stream.
– Sequence of data including input, partial or temporary results used by instruction stream
• Each of these dimensions can have only one of two possible states:
– Single or Multiple
• Flynn’s classification depends on the distinction between :
– The performance of control unit
– The data processing unit rather than its operational and structural interconnections.
• Following are the four category of Flynn classification:
– SISD(Single instruction stream, single data stream)
– SIMD(Single instruction stream, multiple data stream)
– MIMD(Multiple instruction stream, multiple data stream)
– MISD(Multiple instruction stream, single data stream)
[Link]
Department of Computer Science and Engg
10/1/2020 20
THE STATE OF COMPUTING: Evolution of Computer Architecture

Flynn’s classification SISD(Single instruction stream, single data stream)

Flynn’s classification SIMD(Single instruction stream, multiple data stream)

[Link]
Department of Computer Science and Engg
10/1/2020 21
THE STATE OF COMPUTING: Evolution of Computer Architecture

Flynn’s classification MIMD(Multiple instruction stream, multiple data stream)

Flynn’s classification MISD(Multiple instruction stream, single data stream)

[Link]
Department of Computer Science and Engg
10/1/2020 22
THE STATE OF COMPUTING: Evolution of Computer Architecture
Parallel/Vector Computers
Parallel computers are those that execute programs in MIMD mode.
• There are two major classes of parallel computers
– Shared-memory multiprocessors
– message-passing multicomputers.
• The major distinction between multiprocessors and multicomputers lies :
– Memory sharing
• Multiprocessors: Processors communicate through shared variable in a common memory
• Multicomputer: Each computer has a local memory unshared with other computers.
– Inter-processor communication
• Through message passing for a multi computer

Vector Processors:
• Equipped with multiple vector pipelines that can be concurrently used under
hardware or firmware control
• Two families :
– Memory-to-memory architecture
– Register-to-register architecture

[Link]
Department of Computer Science and Engg
10/1/2020 23
THE STATE OF COMPUTING: Evolution of Computer Architecture
Development Layers

Six layers of computer system development:

Programmers View
Point these 2 layers
should be architecture
transparent

This features are up to


the designer  should
match the target
application domain

As a good computer architect One has to approach problem from both the ends
[Link]
Department of Computer Science and Engg
10/1/2020 24
THE STATE OF COMPUTING: System Attributes Affecting Performance

• The ideal performance of a computer system demands a perfect match between


machine capability and program behavior.
• Performance depends on:
– Hardware technology
– Architectural features
– Efficient resource management
– Algorithm design
– Data structures
– Language efficiency
– Programmer skill
– Compiler technology

[Link]
Department of Computer Science and Engg
10/1/2020 25
THE STATE OF COMPUTING: System Attributes Affecting Performance

Performance Indicators:
• The simplest measure of program performance is the turnaround time , which
includes :
– Disk and memory accesses
– Input and output
– Compilation time
– Operating system overhead
– CPU time
• Since I/O and system overhead frequently overlaps processing by other programs,
it is fair to consider only :
– The CPU time used by a program.
– The user CPU time is the most important factor.

[Link]
Department of Computer Science and Engg
10/1/2020 26
THE STATE OF COMPUTING: System Attributes Affecting Performance

Clock Rate and CPI:


• CPU is driven by a clock with a constant cycle time τ (usually measured in
nanoseconds).
• The inverse of the cycle time is the clock rate (f = 1/ τ , measured in megahertz).
• The size of a program is determined by its instruction count, Ic , the number of
machine instructions to be executed by the program.
• Different machine instructions require different numbers of clock cycles to execute.
CPI (cycles per instruction) is thus an important parameter.
Average CPI:
• It is easy to determine the average number of cycles per instruction for a particular
processor if we know the frequency of occurrence of each instruction type.
• Of course, any estimate is valid only for a specific set of programs (which defines the
instruction mix), and then only if there are sufficiently large number of instructions.
• In general, the term CPI is used with respect to a particular instruction set and a given
program mix.
[Link]
Department of Computer Science and Engg
10/1/2020 27
THE STATE OF COMPUTING: System Attributes Affecting Performance

Performance Factors (1):


• The time required to execute a program containing Ic instructions is just
T = Ic * CPI * τ.
• Each instruction must be  fetched from memory  decoded  then operands
fetched from memory  the instruction executed  the results stored.
• The time required to access memory is called the memory cycle time,
– Which is usually k times the processor cycle time t.
– The value of k depends on the memory technology and the processor-memory
interconnection scheme.
Performance Factors (2):
• The processor cycles required for each instruction (CPI)is the total of
– cycles needed for instruction decode & execution (p)
– cycles needed for memory references (m * k)
• The total time needed to execute a program can then be rewritten as
(p + m * k) * τ
T = Ic *[Link]
Department of Computer Science and Engg
10/1/2020 28
THE STATE OF COMPUTING: System Attributes Affecting Performance

System Attributes:

The five performance factors (Ic , p, m, k, τ) are influenced by four system


attributes:

• Instruction-set architecture (affects Ic and p)


• Compiler technology (affects Ic and p and m)
• CPU implementation and control (affects p × τ)
• Cache and memory hierarchy (affects memory access latency, k × τ)

Total CPU time can be used as a basis in estimating the execution rate of a
processor.

[Link]
Department of Computer Science and Engg
10/1/2020 29
THE STATE OF COMPUTING: System Attributes Affecting Performance

System Attributes:

[Link]
Department of Computer Science and Engg
10/1/2020 30
THE STATE OF COMPUTING: System Attributes Affecting Performance

MIPS Rate:
If C is the total number of clock cycles needed to execute a given
program =>
• Then total CPU time can be estimated as:
– T = C × τ = C / f.
• Other relationships are easily observed:
– CPI = C / Ic
– T =Ic × CPI × τ
– T =Ic × CPI / f
Processor speed is often measured in terms of millions of
instructions per second, frequently called the MIPS rate of the
processor
[Link]
Department of Computer Science and Engg
10/1/2020 31
THE STATE OF COMPUTING: System Attributes Affecting Performance

MIPS Rate:

The MIPS rate is:


 Directly proportional ==> To the clock rate
 Inversely proportion to the CPI.

Note:
All four system attributes (instruction set, compiler, processor, and memory
technologies) affect the MIPS rate, which varies also from program to program.
[Link]
Department of Computer Science and Engg
10/1/2020 32
THE STATE OF COMPUTING: System Attributes Affecting Performance

Throughput Rate:
The number of programs a system can execute per unit time,
Ws, in programs per second.
CPU throughput, Wp, is defined as :
Wp = f/(Ic * CPI)
In a multiprogrammed system, the system throughput is often
less than the CPU throughput.

[Link]
Department of Computer Science and Engg
10/1/2020 33
THE STATE OF COMPUTING: System Attributes Affecting Performance

Programming Enviroments:
• Programmability depends on the :
– Programming environment provided to the users.
• Conventional computers are used in a sequential programming
environment
– With tools developed for a uniprocessor computer.
• Parallel computers need parallel tools:
– That allow specification or easy detection of parallelism
– Operating systems that can perform parallel scheduling of:
• Concurrent events
• Shared memory allocation
• Shared peripheral
• Communication links.

[Link]
Department of Computer Science and Engg
10/1/2020 34
THE STATE OF COMPUTING: System Attributes Affecting Performance
Approach of Parallelism

Source Program  Sequential coded

• Complier has to detect parallelism


assign machine resources
• Used in shared memory
multiprocessors
• Success relies on the intelligence
of the parallelizing complier

[Link]
Department of Computer Science and Engg
10/1/2020 35
THE STATE OF COMPUTING: System Attributes Affecting Performance
Approach of Parallelism

Source Program  Parallelism is


explicitly specified in the user program

• Complier needs to preserve


Parallelism assign machine
resources

[Link]
Department of Computer Science and Engg
10/1/2020 36
Multiprocessors and Multicomputers
Categories of Parallel Computers:
• Considering their architecture only, there are two main categories of
parallel computers:
– Systems with shared common memories(Shared Memory Multiprocessors)
• The UMA Model
• The NUMA Model
• The COMA Model
• The CC-NUMA Model
– Systems with unshared distributed memories(Distributed-Memory
Multicomputers)
• The NORMA Machines
• Message Passing multicomputers
– Taxonomy of MIMD Computers
– Representative Systems
• Multiprocessors: BBN TC-200, MPP, S-81, IBM ES/9000 Model 900/VF,
• [Link]
Multicomputers: Intel Paragon XP/S, nCUBE/2 6480, SuperNode 1000, CM5, KSR-1
Department of Computer Science and Engg
10/1/2020 37
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors( Uniform Memory Access Model UMA Model)
• Physical memory uniformly shared by
all processors
– Has equal access time to all words.
• Processors may have local cache
memories.
• Peripherals also shared in some
fashion.
• Tightly coupled systems use a :
– common bus, crossbar, or multistage
network to connect processors,
peripherals, and memories.
• Many manufacturers have
multiprocessor (MP) extensions of
uniprocessor (UP) product lines.
[Link]
Department of Computer Science and Engg
10/1/2020 38
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)

• Synchronization and communication among


processors achieved through shared variables in
common memory.

• Two types of Multiprocessors:

• Symmetric MP systems :
– All processors have access to all peripherals
– Any processor can run the OS and I/O device
drivers.
• Asymmetric MP systems :
– Not all peripherals accessible by all processors
– Kernel runs only on selected processors (master)
– Others are
[Link]
called attached processors (AP).
Department of Computer Science and Engg
10/1/2020 39
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)

Example: Performance Calculation:


Consider two loops:(Sequential Processor)
• The first loop  adds corresponding elements of two N-
element vectors to yield a third vector.
• The second loop  sums elements of the third vector.
• Assume each add/assign operation
– Takes 1 cycle,
– And ignore time spent on other actions (e.g. loop counter
incrementing/testing, instruction fetch, etc.).
• Assume interprocessor communication
– Requires k cycles.
• On a sequential system, each loop(I/J) will require N
cycles
• Hence a total of 2N cycles of processor time is required

[Link]
Department of Computer Science and Engg
10/1/2020 40
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)

Example: Performance Calculation::(Parallel Processor)


• To execute a program On an M-processor system,
• We can partition each loop into M parts,
– Each having L = N / M (for each Loop section).
• As we have 2 loops The total time required is thus 2L.
– This leaves us with M partial sums that must be totalled.(Through Recursion Concept)
• Computing the final sum from the M partial sums requires
– To merge all partial sums of l level binary tree takes l= log2(M) additions
– The addition of each pair of partial sum requires k cycles through shared memory.
– 1 cycle (for the add/assign),
– Hence Total cycles  l × (k+1) cycles.
• The parallel computation thus requires
2N / M + (k + 1) log2(M) cycles
[Link]
Department of Computer Science and Engg
10/1/2020 41
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(UMA Model)

Example: Performance Calculation:


• Assume N = 220.
• Sequential execution requires 2N = 221 cycles.
• If processor synchronization requires k = 200 cycles, and we have M = 256
processors, parallel execution requires
2N / M + (k + 1) log2(M)
= 221 / 28 + 201 × 8
= 213 + 1608 = 9800 cycles
• Comparing results, the parallel solution is 214 times faster than the
sequential, with the best theoretical speedup being 256 (since there are
256 processors).
• Thus the efficiency of the parallel solution is 214 / 256 = 83.6 %
[Link]
Department of Computer Science and Engg
10/1/2020 42
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors( Non-Uniform Memory Access( NUMA ) Model )

• Shared memories, but access time depends on the


location of the data item.
• The shared memory is distributed among the :
– As local memories, but each of these is still
accessible by all processors (with varying access
times).
• Memory access is fastest from the :
– Locally-connected processor, with the
interconnection network adding delays for other
processor accesses.
• Additionally, there may be global memory in a
multiprocessor system:
– With two separate interconnection networks,
one for clusters of processors and their cluster
memories, and another for the global shared
memories[Link]
Department of Computer Science and Engg
10/1/2020 43
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(NUMA Model:Hierarchical Cluster Model)

[Link]
Department of Computer Science and Engg
10/1/2020 44
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors(Cache Only Memory Access Model -COMA)

• In the COMA model,


– Special kind of NUMA model
– Distributed memories of NUMA have
converted to cache memories;
– The caches, taken together, form a
global address space.
• Each cache has an associated
directory
– Remote machines in their lookups;
– Hierarchical directories may exist in
machines based on this model.
• Initial data placement is not critical, as
cache blocks will eventually migrate to
where they are needed

[Link]
Department of Computer Science and Engg
10/1/2020 45
Multiprocessors and Multicomputers:
Shared Memory Multiprocessors
(Cache Coherent –Non Uniform Memory Access CC-NUMA)

It uses the concept of both:


1. COMA
2. NUMA

[Link]
Department of Computer Science and Engg
10/1/2020 46
Multiprocessors and Multicomputers:
Distributed Memory Multicomputers
• Multicomputer consist of multiple
computers/nodes, interconnected by a

• Each node is autonomous, with its own


processor and local memory, and
sometimes local peripherals.
• The message-passing network provides
point-to-point static connections among
the nodes.
• Local memories are not shared,
– So traditional multicomputer are called
no-remote-memory access (or NORMA)
machines.
• Inter-node communication is achieved
by passing messages through the static
connection network.
[Link]
Department of Computer Science and Engg
10/1/2020 47
Multiprocessors and Multicomputers:
Multicomputer Generations
• Each multicomputer uses:
– Routers and channels in its interconnection network,
– Heterogeneous systems may involved mixed node types
– Uniform data representation and communication protocols.
• First generation:
– Hypercube architecture
– Software controlled message switching
– Processor boards
• Second generation:
– Mesh-connected architecture
– Hardware message switching
– Software for medium-grain distributed computing
• Third generation:
– Fine-grained distributed computing, with each VLSI chip containing the processor and
communication resources
[Link]
Department of Computer Science and Engg
10/1/2020 48
Multiprocessors and Multicomputer
Taxonomy of MIMD Computers
Parallel Computers
Parallel Computers : Exists
 SIMD or MIMD configuration
 SIMD configuration
 For special purpose applications
 CM - 2 (Connection Machine) on SIMD architecture
 MIMD configuration
 CM - 5 on MIMD architecture
 Having globally shared virtual address space
 Scalable multiprocessors or multicomputer
 Use distributed shared memory
 Un-scalable multiprocessors:
 Use centrally shared memory

[Link]
Department of Computer Science and Engg
10/1/2020 49
Taxonomy of MIMD Computers

Scalable

• Single Address Space


• Shared Memory
• Computation

Not Scalable

Scalable

[Link]
Department of Computer Science and Engg
10/1/2020 50
Multivector and SIMD Computers

Supercomputer (Have highest operational rate) Classification:

 Pipelined Vector machine/ Vector Supercomputers:


 Using a few powerful processors equipped with vector hardware
 Vector Processing

 SIMD Computers / Parallel Processors:


 Emphasizing massive data parallelism

[Link]
Department of Computer Science and Engg
10/1/2020 51
Multivector and SIMD Computers
Vector computer  Is built on top of Scalar
Vector Supercomputers Processor.

[Link]
Department of Computer Science and Engg
10/1/2020 52
Multivector and SIMD Computers
Vector Supercomputers
Step 1-2: Program & data are first loaded into the Main Memory through a Host
computer.

Step 3: All instructions are first decoded by the Scalar Control Unit.

Step 4: If the decoded instruction is a scalar operation or a program control operation, it


will be directly executed by the scalar processor using the Scalar Functional Pipelines.

Step 5: If the instructions are decoded as a Vector operation, it will be sent to the vector
control unit.

Step 6: Vector control unit will supervise the flow of vector data between the main
memory and vector functional pipelines.

Note: A number of vector functional pipelines may be built into a Vector Processor.
[Link]
Department of Computer Science and Engg
10/1/2020 53
Multivector and SIMD Computers
Vector Supercomputers - 2 Models
Model 1:Register-to-register architecture

– A fixed number of possibly reconfigurable registers

• Which are used to hold all vector operands, intermediate, and final vector
results.

– All registers are accessible in user instructions

– All vector registers length will be fixed(e.g.64 bits)

Model 2: Memory-to-memory architecture

– Primary memory holds operands and results

– A vector stream unit accesses memory for fetches

– Stores into the main memory in large units superwords (e.g.512 bits)
[Link]
Department of Computer Science and Engg
10/1/2020 54
Multivector and SIMD Computers
SIMD Supercomputers
Single Instruction Stream and over multiple data stream

[Link]
Department of Computer Science and Engg
10/1/2020 55
Multivector and SIMD Computers
SIMD Supercomputers
An operational model of an SIMD computer is specified by a 5-tuple: (N, C, I, M, R)

(1) N = No. of Processing Elements (PE) in the machine.

(2) C =Set of instructions directly executed by the control unit (CU).


• Scalar & Program Flow Control Instructions.

(3) I = Set of instructions broadcast by the CU to all PEs for parallel execution.
• Include: Arithmetic, logic, data routing, masking,

• And other local operations executed by each active PE over data within that PE.

(4) M = Set of Masking Schemes


• Each mask partitions the set of PEs into enabled and disabled subsets for a given cycle.

(5) R = Set of data-routing functions


• Specifying various patterns to be set up in the interconnection network for inter-PE communications.
[Link]
Department of Computer Science and Engg
10/1/2020 56
PRAM and VLSI Models
Parallel Random Access Machines :
o Time and Space Complexities
• Time complexity
» It is a function of problem size
» Asymptotic Time complexity of an algorithm like Big-O etc.
» Best case  Worst case
• Space complexity
» It is a function of problem size S
» Asymptotic Space complexity refers to data storage of large problems
» Program and input data storage is not considered.
• Serial and Parallel complexity 
» Time complexity of an algorithm
» Parallel complexity should be lower serial complexity.
• Deterministic and Non-deterministic algorithm
» In deterministic  every operational step is uniquely defined.
» Operations resulting in one outcome set of possible outcomes.
[Link]
Department of Computer Science and Engg
10/1/2020 57
PRAM and VLSI Models
Parallel Random Access Machines :
 Developed by Fortune and Wyllie (1978)

 Objective:

» Modeling idealized parallel computers with zero synchronization or memory


access overhead

 An n-processor PRAM has a globally addressable Memory

Each Processor can access any


memory location in unit time

[Link]
Department of Computer Science and Engg
10/1/2020 58
PRAM and VLSI Models
Parallel Random Access Machines :
o NP Completeness:
» An algorithm has polynomial complexity  if there exits polynomial p(s) and its
time complexity O(p(s)) for a problem size  s.

» Set of problems having polynomial complexity algorithm by deterministic way is


called P-class (for polynomial class)

» Set of problems solvable by nondeterministic algorithms in polynomial time is


called NP-class (for nondeterministic polynomial class)

» P ⊂ NP

» P is solvable where as NP is not solvable

» Note: P= NP/ P≠ NPIs still an open problem

[Link]
Department of Computer Science and Engg
10/1/2020 59
PRAM and VLSI Models
PRAM Models
Conventional Uniprocessor computers
• Modelled as random access machines (RAM)
Parallel random access machines
• Modelled to idealize parallel computers with zero synchronization / memory access
overhead.
• Used  parallel algorithm development, scalability & complexity analysis.
• Globally addressable shared memory distributed among n Processor Elements (PEs) /
centralized.
• PEs operate on synchronized read memory, compute, write memory cycle.
Four memory-update options in PRAMs:
Exclusive Read (ER): Read takes  1 Cycle atmost 1 Processor
Exclusive Write (EW): Write takes  1 Cycle atmost 1 Processor
Concurrent read (CR):Read of a memory location takes  Same Cycle for multiple Processor
Concurrent Write (CW):Simultaneously Writes to a same memory location  conflict need to be
avoided
[Link]
Department of Computer Science and Engg
10/1/2020 60
PRAM and VLSI Models
PRAM Variants:- Classified based on Memory read/write : Four memory-update
1. EREW PRAM model:
 Prohibits more than 1 processor from reading or writing simultaneously to same cell.
2. CREW PRAM model:
 Mutual exclusion used to avoid write conflicts.
 Allows concurrent reads from same memory.
3. ERCW PRAM model:
 Allows exclusive reads/ concurrent writes from same memory.
4. CRCW PRAM model:
 Allows either concurrent reads/ concurrent writes from same memory.

[Link]
Department of Computer Science and Engg
10/1/2020 61
PRAM and VLSI Models
Discrepancy with Physical Models of PRAM Variants:-

 Most popular variants: EREW and CRCW

 SIMD machine with shared architecture is closest architecture modelled by PRAM

 PRAM Allows different instructions to be executed on different processors simultaneously.


Thus, PRAM really operates in synchronized MIMD mode with shared memory

[Link]
Department of Computer Science and Engg
10/1/2020 62
PRAM and VLSI Models
VLSI Complexity Model:-
 Parallel computers rely on the use of VLSI chip 
 To fabricate components such as Processor arrays, Memory arrays, Switching networks
etc.
 Nowadays, VLSI technologies are 2-dimensional.
 The size of a VLSI chip is proportional to the amount of storage (memory) space
available in that chip.
We deal with AT2 model (2-dimension)
 We can calculate the space complexity of an algorithm by :
 The chip area (A) of the VLSI chip implementation of that algorithm.
 If T is the time (latency) needed to execute the algorithm
 Then A.T gives an upper bound on the total number of bits processed through the chip (or
I/O).
 For certain computing, there exists a lower bound, f(s), such that
A.T2 >= O (f(s))
Where A=chip area and T=time

[Link]
Department of Computer Science and Engg
10/1/2020 63
PRAM and VLSI Models
VLSI Complexity Model:-
Memory Bound on Chip Area:
 Many computations are memory bound has there will be a need in processing large data
sets
 Memory requirement of computation sets a lower bound on the chip area (A)
 Amount of information processed by the chip  can be visualized as information flow
upward across the chip area

Volume of rectangular AT: Product of AT

Note : As the information flows over the time :-


The number of input bits <= Volume AT

[Link]
Department of Computer Science and Engg
10/1/2020 64
PRAM and VLSI Models
VLSI Complexity Model:-
Bisection Communication Bound,

 The bisection is represented by the vertical slice across the chip area A.

The height of the cross section is


The bisection area represent  the
maximum amount of information
exchange between the two halves of
the chip  during the time period T

The distance of this dimension is

[Link]
Department of Computer Science and Engg
10/1/2020 65
Chapter-2
Program and Network Properties

In this chapter we will concentrate on following topics:

[Link]
Department of Computer Science and Engg
10/1/2020 66
Conditions of Parallelism
Three key areas for significant progress in parallel processing:
 Computational models for parallel computing

 Inter-processor communication in parallel architectures

 System integration for incorporating parallel system

Condition of parallelism Includes:


o Data dependence and resource dependence
o Hardware and software dependence
o The role of compiler

[Link]
Department of Computer Science and Engg
10/1/2020 67
Conditions of Parallelism
Data dependence and resource dependence
The ability to execute several program segments in parallel:
• Requires each segment to be independent of the other segments.
• We use a dependence graph to describe the relations.
• The nodes of a dependence graph correspond to the program statement
(instructions)
• Directed edges with different labels are used to represent the ordered
relations among the statements.
• The analysis of dependence graphs shows where opportunity exists for
parallelization.

[Link]
Department of Computer Science and Engg
10/1/2020 68
Conditions of Parallelism
Data dependence and resource dependence
Data Dependence : The ordering relationship between statements is
indicated by the data dependence.

Five Types:

1. Flow dependence

2. Antidependence

3. Output dependence

4. I/O dependence

5. Unknown dependence

[Link]
Department of Computer Science and Engg
10/1/2020 69
Conditions of Parallelism
Data dependence and resource dependence
1. Flow dependence : A statement S2 is flow dependent on S1 RAW

S1: ADD R1, R2, R3


ADD R1SUB R1
S2 : SUB R4, R1, R2

2. Antidependence : Statement S2 is antidependent on the statement


S1WAR
S1: ADD R1, R2, R3

S2 : SUB R4, R1, R2 S3 depends on S2 


i.e S3 should be executed after S2
S3: ADD R1, R2, R3

[Link]
Department of Computer Science and Engg
10/1/2020 70
Conditions of Parallelism
Data dependence and resource dependence
3. Output dependence : Two statements are output dependent if they produce
(write) the same output variable. Also called WAW hazard and denoted as

S1: ADD R1, R2, R3


S3 depends on S1 
S2 : SUB R4, R1, R2
i.e S3 and S1 writes to R1
S3: ADD R1, R2, R3

4. I/O dependence : Read and write are I/O statements.


• I/O dependence occurs not because the same variable is involved but because the same file
referenced by both I/O statement.

[Link]
Department of Computer Science and Engg
10/1/2020 71
Conditions of Parallelism
Data dependence and resource dependence
5. Unknown dependence: The dependence relation between two statements
cannot be determined in the following situations:

• The subscript of a variable is itself subscribed ( indirect addressing )

• The subscript does not contain the loop index variable

• A variable appears more than once with subscripts having different


coefficients of the loop variable

• The subscript is non linear in the loop index variable.

[Link]
Department of Computer Science and Engg
10/1/2020 72
Conditions of Parallelism
Data dependence and resource dependence
Control Dependence: Dependency is know at Run time

Control Dependency  Prohibits Parallelism


[Link]
Department of Computer Science and Engg
10/1/2020 73
Conditions of Parallelism
Data dependence and resource dependence
Resource Dependence:

– Data and control dependencies are based on the independence of the work
to be done.

– Resource independence is concerned with conflicts in using shared resources,


such as registers, integer and floating point ALUs, etc.

– ALU conflicts are called ALU dependence.

– Memory (storage) conflicts are called storage dependence.

[Link]
Department of Computer Science and Engg
10/1/2020 74
Conditions of Parallelism
Data dependence and resource dependence
Bernstein’s Conditions - 1 : Bernstein’s conditions are a set of conditions which
must exist if two processes can execute in parallel.

– Notation

• Ii is the set of all input variables for a process Pi .

• Oi is the set of all output variables for a process Pi .

• If P1 and P2 can execute in parallel (which is written as P1 || P2), then:

[Link]
Department of Computer Science and Engg
10/1/2020 75
Conditions of Parallelism
Data dependence and resource dependence
Bernstein’s Conditions - 2 : In terms of data dependencies, Bernstein’s conditions
imply that two processes can execute in parallel if they are:

– Flow-independent, anti-independent, and output-independent.

– The parallelism relation || is commutative 

(Pi || Pj implies Pj || Pi ),

– But not transitive 

(Pi || Pj and Pj || Pk does not imply Pi || Pk )

– Therefore, || is not an equivalence relation.

[Link]
Department of Computer Science and Engg
10/1/2020 76
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism : Hardware parallelism is defined by machine architecture
and hardware multiplicity.
– It can be characterized by the number of instructions that can be issued per
machine cycle.
– If a processor issues k instructions per machine cycle, it is called a k-issue
processor.
– Conventional processors are one-issue machines.
– A machine with n processors  with k-issue should be able to handle a
maximum of nk threads simultaneously.
– Examples:
• Intel i960CA is a three-issue processor (arithmetic, memory access, branch)
• IBM RS-6000 is a four-issue processor (arithmetic, floating-point, memory access,
branch)

[Link]
Department of Computer Science and Engg
10/1/2020 77
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism :

[Link]
Department of Computer Science and Engg
10/1/2020 78
Conditions of Parallelism
Hardware and Software Parallelism
Hardware Parallelism :

[Link]
Department of Computer Science and Engg
10/1/2020 79
Conditions of Parallelism
Hardware and Software Parallelism
Software Parallelism :
– Software parallelism is defined by the control and data dependence of programs,
– Is revealed in the program’s flow graph.
– It is a function of algorithm, programming style, and compiler optimization.

[Link]
Department of Computer Science and Engg
10/1/2020 80
Conditions of Parallelism
Hardware and Software Parallelism
Types of Software Parallelism :
1. Control Parallelism:
• Two or more operations can be performed simultaneously.
• This can be detected by a compiler, or a programmer can explicitly indicate control
parallelism
– By using special language constructs or dividing a program into multiple
processes.
2. Data parallelism:
• Multiple data elements have the same operations applied to them at the same
time.
• This offers the highest potential for concurrency (in SIMD and MIMD modes).
• Synchronization in SIMD machines handled by hardware.

[Link]
Department of Computer Science and Engg
10/1/2020 81
Conditions of Parallelism
Hardware and Software Parallelism
The Role of Compilers :
– Compilers used to exploit hardware features to improve performance.

– Interaction between compiler and architecture design is a necessity in modern


computer development.

– It is not necessarily the case that more software parallelism will improve performance
in conventional scalar processors.

– The hardware and compiler should be designed at the same time.

[Link]
Department of Computer Science and Engg
10/1/2020 82
Conditions of Parallelism
Program Partitioning & Scheduling
Program Partitioning & Scheduling :

Mainly deals with:-


1. Grain Sizes and Latency:

2. Grain Packing and Scheduling

3. Grain determination and scheduling optimization

[Link]
Department of Computer Science and Engg
10/1/2020 83
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)

Grain Sizes and Latency:


– Size of a program for parallel execution can vary

– The sizes are roughly classified using the term “granule size,” or simply “granularity”

– The simplest measure, for example, is the number of instructions in a program part

– Grain sizes are usually described as fine, medium or coarse, depending on the level of
parallelism involved

[Link]
Department of Computer Science and Engg
10/1/2020 84
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Parallelism has exploited at various Processing levels:

[Link]
Department of Computer Science and Engg
10/1/2020 85
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Instruction Level Parallelism:

– This fine-grained, or smallest granularity level typically involves less than 20


instructions per grain.

– The number of candidates (Programs) for parallel execution varies from 2 to


thousands, with about five instructions or statements (on the average) being
the average level of parallelism.

– Advantages:
• There are usually many candidates (Programs) for parallel execution

• Compilers can usually do a reasonable job of finding this parallelism

[Link]
Department of Computer Science and Engg
10/1/2020 86
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Loop Level Parallelism:

– Typical loop has less than 500 instructions.

– If a loop operation is independent between iterations, it can be handled by a


pipeline, or by a SIMD machine.

– Most loop operations are self scheduled to execute on a parallel or vector


machine (MIMD Mode)

– Some loops (e.g. recursive) are difficult to handle.

– Loop-level parallelism is still considered fine grain computation.

[Link]
Department of Computer Science and Engg
10/1/2020 87
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Procedure Level Parallelism:

– It is Medium-sized grain  usually less than 2000 instructions.

– Detection of parallelism is more difficult than with smaller grains

– Interprocedural dependence analysis is difficult and history-sensitive

– Communication requirement less than instruction-level

– SPMD (single procedure multiple data) is a special case

– Multitasking belongs to this level.

[Link]
Department of Computer Science and Engg
10/1/2020 88
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)

Subprogram Level Parallelism:

– Job step level and related subprograms

– Grain typically has thousands of instructions medium- or coarse-grain level

– Job steps can overlap across different jobs

– Multiprograming conducted at this level

– No compilers available to exploit medium- or coarse-grain parallelism at


present.

[Link]
Department of Computer Science and Engg
10/1/2020 89
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)

Job or Program-Level Parallelism:

– Corresponds to execution of essentially independent jobs or programs on a


parallel computer.

– This is practical for a machine with a small number of powerful processors,


but impractical for a machine with a large number of simple processors (since
each processor would take too long to process a single job).

[Link]
Department of Computer Science and Engg
10/1/2020 90
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Final Summary :
– Fine-grain parallelism is exploited at instruction or loop levels
• Assisted by the compiler.
– Medium-grain parallelism is At task or job step level
• Requires programmer and compiler support.
– Coarse-grain parallelism is  At Program Level
• Relies heavily on effective OS support.
– Shared-variable communication used at fine- and medium-grain levels.

– Message passing can be used for medium- and coarse-grain communication

• But fine-grain really need better technique because of heavier


communication requirements.
[Link]
Department of Computer Science and Engg
10/1/2020 91
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Communication Latency:

– Balancing granularity and latency can yield better performance.

– Various latencies attributes  To machine architecture, technology, and


communication patterns used.

– Latency imposes a limiting factor On scalability of machine size.

– Ex. Memory latency increases as memory capacity increases 


• Limiting the amount of memory that can be used with a given tolerance for
communication latency.

[Link]
Department of Computer Science and Engg
10/1/2020 92
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Sizes and Latency)
Inter-process Communication Latency:

– Needs to be minimized by system designer

– Affected by signal delays and communication patterns

– Example:
• n communicating tasks may require 

• n (n - 1)/2 communication links

• The complexity grows quadratically , effectively limiting the number of processors


in the system.

[Link]
Department of Computer Science and Engg
10/1/2020 93
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Packing and Scheduling )

Grain Packing and Scheduling:

Two questions exists in parallel programming:

1. How can I partition a program into parallel “pieces” to yield the shortest
execution time?

2. What is the optimal size of parallel grains?

There is an obvious tradeoff between the:

• Time spent for scheduling and synchronizing parallel grains and the speedup
obtained by parallel execution.

• One approach to the problem is called “grain packing.”


[Link]
Department of Computer Science and Engg
10/1/2020 94
Conditions of Parallelism
Program Partitioning & Scheduling(Grain Packing and Scheduling )

Grain Packing and Scheduling:

• The grain size problem requires determination of both

• The number of partitions

• The size of grains in a parallel problem

• Solution is through problem dependent and machine dependent in


parallelism.

• A short schedule is required for fast execution of subdivided program


modules.

[Link]
Department of Computer Science and Engg
10/1/2020 95
Conditions of Parallelism
Program Partitioning & Scheduling
(Grain Determination and Scheduling Optimization)

Grain determination and scheduling optimization:

Includes 4 steps:

Step 1: Construct a fine-grain program graph

Step 2: Schedule the fine-grain computation

Step 3: Grain packing to produce coarse grains

Step 4: Generate a parallel schedule based on the packed graph

[Link]
Department of Computer Science and Engg
10/1/2020 96
Conditions of Parallelism
Program Partitioning & Scheduling(Program Graphs and Packing)
A program graph is like a dependence graph
• Nodes= {(n, s)} where n= node name and s=
size
– Larger the s = larger grain size
– Edges= {(v,d)} where v= communicated variable
and d=communication delay

– Packing 2 or more nodes produces a larger grain


size node and possibly more edges to other nodes.

– Packing helps in eliminating unnecessary


communication delay or reduce overall scheduling
overhead.[Link]
Department of Computer Science and Engg
10/1/2020 97
Conditions of Parallelism
Program Partitioning & Scheduling(Program Graphs and Packing)
• Nodes l, 2, 3, 4, 5, and 6 are memory reference
(data fetch} operations.
• Each takes one cycle to address and six cycles to
fetch from memory.

[Link]
Department of Computer Science and Engg
10/1/2020 98
Conditions of Parallelism
Program Partitioning & Scheduling(Program Graphs and Packing)
– Each node in the program graph corresponds to a
computational unit in the program.

– The grain size is measured by the number of basic


machine cycle (including both processor and
memory cycles) needed to execute all the
operations within the node.

– Grain size reflects the number of computations


involved in a program segment.

– Fine-grain nodes have a smaller grain size. and


coarse-grain nodes have a larger grain size.
[Link]
Department of Computer Science and Engg
10/1/2020 99
Program Flow Mechanisms

There are three flow mechanism :

– Control flow mechanism(used by conventional computers)


• Order of program execution  Stated in the user program

– Data-driven mechanism(used by data-flow computers)


• Order of program execution  Is driven by data(operand) availability

– Demand-driven mechanism(used by reduction computers)


• Order of program execution  Based on the demand for its result by other
computations

[Link]
Department of Computer Science and Engg
10/1/2020 100
Program Flow Mechanisms
Control flow mechanism
Control flow mechanism(used by conventional computers):
– Conventional von Neumann computers use a 

• Program counter to sequence the execution of instruction in a program.

• This sequential execution style is called control-driven.

– Control flow computers use shared memory to hold program instructions and data
objects.

– A uniprocessor computers is inherently sequential  Use control-driven mechanism.

– Control flow can be made parallel by using parallel language constructs or parallel
compiler.

– Control flow machines give complete control, but are less efficient than other
approaches.[Link]
Department of Computer Science and Engg
10/1/2020 101
Program Flow Mechanisms
Data flow mechanism
Data-driven mechanism(used by data-flow computers):

– Dataflow computers emphasize a high degree of parallelism at the fine-grain


instructional level.

– Data flow machines have 

• High potential for parallelism and throughput and freedom from side effects,

• But have high control overhead, lose time waiting for unneeded arguments, and
difficulty in manipulating data structures.

– No need for  shared memory, program counter, control sequencer

– Special mechanisms are required to detect data availability, match data tokens
with instructions needing them, enable chain reaction of asynchronous instruction
execution [Link]
Department of Computer Science and Engg
10/1/2020 102
Program Flow Mechanisms
Demand flow mechanism
Demand-driven mechanism(used by reduction computers):

– Data-driven machines select instructions for execution based on the availability of


their operands  this is essentially a bottom-up approach.

• Eager Evaluation

– Demand-driven machines take a top-down approach, attempting to execute the


instruction (a demander) that yields the final result.

• This triggers the execution of instructions that yield its operands, and so forth.

• Lazy Evaluation

[Link]
Department of Computer Science and Engg
10/1/2020 103
Program Flow Mechanisms
Demand flow mechanism
Demand-driven mechanism(used by reduction computers): 2 Reduction Models:

1. String-reduction model:

– Each demander gets a separate copy of the expression string 


 To evaluate each reduction step has an operator

 Embedded reference to demand the corresponding input operands.

 Each operator is suspended while arguments are evaluated

2. Graph-reduction model:

– Expression graph reduced by evaluation of branches or subgraphs, possibly in parallel,


with demanders given pointers to results of reductions.

– Based on sharing of pointers to arguments; traversal and reversal of pointers


continues until constant arguments are encountered.
[Link]
Department of Computer Science and Engg
10/1/2020 104
[Link]
10/1/2020 Department of Computer Science and Engg
105
System Interconnect Architectures

– Direct networks for static connections

– Indirect networks for dynamic connections

– Networks are used for 

• Internal connections in a centralized system among

• Processors

• Memory modules

• I/O disk arrays

• Distributed networking of multicomputer nodes

[Link]
Department of Computer Science and Engg
10/1/2020 106
System Interconnect Architectures

– Static and dynamic networks are used

• For interconnecting computer subsystems

• For constructing multiprocessors or multicomputers

– Ideal: Construct a low-latency network with a high data transfer rate

[Link]
Department of Computer Science and Engg
10/1/2020 107
System Interconnect Architectures
Network Properties and Routing
– Static Networks:
• Point-to-Point direct connections which will not change during program
execution

– Dynamic Networks:
• Implemented with switched channels.

• They are dynamically configured to match the communication demand in user


programs

[Link]
Department of Computer Science and Engg
10/1/2020 108
System Interconnect Architectures
Network Properties and Routing
Goals and Analysis:

– The goals of an interconnection network are to provide


• Low-latency

• High data transfer rate

• Wide communication bandwidth

– Analysis includes
• Latency

• Bisection bandwidth

• Data-routing functions

• Scalability
[Link]
of parallel architecture
Department of Computer Science and Engg
10/1/2020 109
System Interconnect Architectures
Network Properties and Routing
Terminologies used in Network Properties and Routing:

– Network usually represented by a graph :


• With a finite number of nodes

• Linked by directed or undirected edges.

– Number of nodes in graph = network size .

– Number of edges (links or channels) incident on a node  node degree d


(also note in and out degrees when edges are directed).
• Node degree reflects number of I/O ports associated with a node, and should
ideally be small and constant.
[Link]
Department of Computer Science and Engg
10/1/2020 110
System Interconnect Architectures
Network Properties and Routing
Terminologies used in Network Properties and Routing:

– Diameter D of a network
• Is the maximum shortest path between any two nodes

• Measured by the number of links traversed

• This should be as small as possible (from a communication point of view)

– Channel bisection width b  minimum number of edges cut to split a


network into two parts each having the same number of nodes.
• Since each channel has w bit wires, the wire bisection width B = bw.
• Bisection width provides good indication of maximum communication
bandwidth along the bisection of a network,
• And all other cross sections should be bounded by the bisection width.
[Link]
Department of Computer Science and Engg
10/1/2020 111
System Interconnect Architectures
Network Properties and Routing
Terminologies used in Network Properties and Routing:

– Wire (or channel) length  length (e.g. weight) of edges between nodes.

– Network is symmetric  if the topology is the same looking from any node;
these are easier to implement or to program.

[Link]
Department of Computer Science and Engg
10/1/2020 112
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Is used for inter-PE data exchange

– Shifting

– Permutation (one to one)

– Broadcast (one to all)

– Multicast (many to many)

– Personalized broadcast (one to many)

– Shuffle

– Exchange Etc.

[Link]
Department of Computer Science and Engg
10/1/2020 113
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Permutation (one to one)

– Given n objects, there are n ! ways in which they can be reordered (one of
which is no reordering).

– For example: π = (a , b , c) (d , e)

• Possible mapping are: ab, bc, ca, de, ed in a circular fashion.

• The cycle (a , b , c) has 3 periods and (d , e)has 2 periods. Hence


permutation π has a period=3 * 2 = 6

• If one has to be permuted then it has identity mapping

I = (a) (b) (c) (d) (e)


[Link]
Department of Computer Science and Engg
10/1/2020 114
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Permutation (one to one)

– A permutation can be specified by giving the rule for reordering a group of


objects.

– Permutations can be implemented using crossbar switches, multistage


networks, shifting, and broadcast operations.

– The time required to perform permutations of the connections between


nodes often dominates the network performance when n is large.

[Link]
Department of Computer Science and Engg
10/1/2020 115
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Perfect Shuffle and Exchange

– Perfect Shuffle is a special permutation function by Harold Stone

– The entries according to the mapping of the k-bit binary number

ab…k to bc…ka
– That is, shifting 1 bit to the left and wrapping it around to the least
significant bit position.

– The inverse perfect shuffle reverses the effect of the perfect shuffle.

[Link]
Department of Computer Science and Engg
10/1/2020 116
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Perfect Shuffle and Exchange

– Perfect Shuffle is a special permutation function by Harold Stone

Perfect Shuffle
(Left Shift)

Inverse Perfect
Shuffle
(Right Shift)

[Link]
Department of Computer Science and Engg
10/1/2020 117
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Hypercube Routing Functions

A 3-cube with nodes C2C1C0

[Link]
Department of Computer Science and Engg
10/1/2020 118
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Hypercube Routing Functions

A 3-cube with nodes C2C1C0

[Link]
Department of Computer Science and Engg
10/1/2020 119
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Broadcast and Multicast

– Broadcast is one-to-all mapping


• This can be achieved in SIMD computers.

• Message passing multicomputers also has mechanisms to broadcast.

– Multicast is one-to-many mapping

• Is implemented by matching the destination nodes in the network.

[Link]
Department of Computer Science and Engg
10/1/2020 120
System Interconnect Architectures
Network Properties and Routing
Data Routing Functions:Network Performance
– Functionality:-
• How the network supports data routing, interrupt handling, synchronization,
request/message combining, and coherence.
– Network latency:-
• Worst-case time for a unit message to be transferred.
– Bandwidth
• Maximum data rate
– Hardware complexity
• Implementation costs for wire, logic, switches, connectors, etc.
– Scalability
• How easily does the scheme adapt to an increasing number of processors,
memories, etc.?
[Link]
Department of Computer Science and Engg
10/1/2020 121
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Static networks use direct links which are fixed
once built.

Few topologies are:


– Linear Array

– Ring and Chordal Ring

– Barrel Shifter

– Tree and Star

– Fat Tree

– Mesh and Torus

[Link]
Department of Computer Science and Engg
10/1/2020 122
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Linear Array

– It is one-dimensional network where 

• N nodes connected by n-1 links in a line.

– Internal nodes have degree 2;

– End nodes have degree 1.

– Diameter = n-1

– Bisection = 1

– For small n, this is economical, but for large


n, it is obviously inappropriate.
[Link]
Department of Computer Science and Engg
10/1/2020 123
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Ring, Chordal
Ring

– Like a linear array, but the two end nodes


are connected by an n th link;

– The ring can be uni-directional or


bidirectional.

– Diameter is n/2 for a bidirectional ring, or n


for a unidirectional ring.

[Link]
Department of Computer Science and Engg
10/1/2020 124
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Ring, Chordal
Ring

– By adding additional links (e.g. “chords” in a


circle)

– The node degree is increased, and we


obtain a chordal ring.

– This reduces the network diameter.

– In the limit, we obtain a fully-connected


network, with a node degree of n -1 and a
diameter of[Link]
1.
Department of Computer Science and Engg
10/1/2020 125
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Ring, Chordal
Ring

Other example for Chordal Ring

[Link]
Department of Computer Science and Engg
10/1/2020 126
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Barrel Shifter

Barrel shifter is obtained from Ring  By


adding extra lines from each node to those
nodes having a distance equal to an integer
powers of 2

[Link]
Department of Computer Science and Engg
10/1/2020 127
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Barrel Shifter

Barrel shifter is obtained from Ring  By


adding extra lines from each node to those
nodes having a distance equal to an integer
powers of 2

[Link]
Department of Computer Science and Engg
10/1/2020 128
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Tree and Star

– A k-level completely balanced binary tree will have N = 2k – 1 nodes,

– With maximum node degree of 3 and network diameter is 2(k – 1).

– The balanced binary tree is scalable, since it has a constant maximum node
degree.

[Link]
Department of Computer Science and Engg
10/1/2020 129
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Tree ,Star and Fat Tree
– A star is a two-level tree with a node degree d = N – 1 and a constant diameter
of 2.
– A fat tree is a tree in which the number of edges between nodes increases
closer to the root
– The edges represent communication channels (“wires”).

[Link]
Department of Computer Science and Engg
10/1/2020 130
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Mesh and Torus

– Pure mesh:

• N = nk nodes with links between each adjacent pair of nodes in a row or


column (or higher degree).

• Interior node degree d = 2k, diameter = k (n – 1).

• This is not a symmetric network


– degree and diameter doesn’t exist as per the above requirement

[Link]
Department of Computer Science and Engg
10/1/2020 131
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Mesh and Torus

– Illiac mesh (used in Illiac IV computer):


• Wraparound is allowed  thus reducing the network
diameter to about half that of the equivalent pure mesh.

– A torus mesh :
• Has ring connections in each dimension, and is symmetric.

• An n × n binary torus has node degree of 4 and a diameter


of 

[Link]
Department of Computer Science and Engg
10/1/2020 132
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Systolic Array

– A systolic array is an arrangement of processing elements and communication


links designed specifically to match the computation and communication
requirements of a specific algorithm (or class of algorithms).

– This specialized character may yield better performance than more


generalized structures, but also makes them more expensive, and more
difficult to program.

[Link]
Department of Computer Science and Engg
10/1/2020 133
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Hypercubes

– A binary n-cube architecture with N = 2n nodes spanning along n dimensions,


with two nodes per dimension.

– The hypercube scalability is poor, and packaging is difficult for higher-


dimensional Hypercubes

[Link]
Department of Computer Science and Engg
10/1/2020 134
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Cube-connected Cycles

– 3-cube connected cycles (CCC)  can be created from a 3-cube by replacing


each corner nodes (vertex) of the 3 cube by a ring of 3 nodes.

– k-cube connected cycles (CCC)  can be created from a k-cube by replacing


each vertex of the k dimensional hypercube by a ring of k nodes.
[Link]
Department of Computer Science and Engg
10/1/2020 135
System Interconnect Architectures
Network Properties and Routing
Static Connection Networks  Network Throughput

– Network throughput:- number of messages a network can handle in a unit


time interval.

– One way to estimate is to calculate the maximum number of messages that


can be present in a network at any instant (its capacity)  Throughput usually
is some fraction of its capacity.

– A hot spot is a pair of nodes that accounts for a disproportionately large


portion of the total network traffic (possibly causing congestion).

– Hot spot throughput is maximum rate at which messages can be sent


[Link]
between two specific nodes.
Department of Computer Science and Engg
10/1/2020 136
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks 

– Dynamic connection networks can implement all communication patterns


based on program demands.

– In increasing order of cost and performance, these include


• Bus systems

• Multistage interconnection networks

• Crossbar switch networks

– Price can be attributed to the cost of wires, switches, arbiters, and connectors.

– Performance is indicated by network bandwidth, data transfer rate, network


[Link]
latency, and communication patterns supported.
Department of Computer Science and Engg
10/1/2020 137
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Bus Systems

– A bus system (contention bus, time-sharing bus):


• Has a collection of wires and connectors
• Multiple modules (processors, memories, peripherals, etc.) which connect
to the wires
• Data transactions between pairs of modules
– Bus supports only one transaction at a time.

– Bus arbitration logic must deal with conflicting requests.

– Lowest cost and bandwidth of all dynamic schemes.

– Many bus standards are available.


[Link]
Department of Computer Science and Engg
10/1/2020 138
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Bus Systems
– A bus is the simplest type of dynamic interconnection networks.
– It constitutes a common data transfer path for many devices.
– Depending on the type of implemented transmissions we have serial
busses and parallel busses.
– The devices connected to a bus can be processors, memories, I/O units, as
shown in the figure below.

[Link]
Department of Computer Science and Engg
10/1/2020 139
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Bus Systems
– The throughput of the network based on a bus  can be increased by the use
of a multibus network shown in the figure below.
– In this network, processors connected to the busses can transmit data in
parallel (one for each bus) and many processors can read data from many
busses at a time.

[Link]
Department of Computer Science and Engg
10/1/2020 140
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Switch Modules

– An a × b switch module has a inputs and b outputs.

– A binary switch has a = b = 2.


• It is not necessary for a = b, but usually a = b = 2k , for some integer k.

[Link]
Department of Computer Science and Engg
10/1/2020 141
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Switch Modules

– In general, any input can be connected to one or more of the outputs.

– However, multiple inputs may not be connected to the same output. When
only one-to-one mappings are allowed, the switch is called a crossbar switch.

[Link]
Department of Computer Science and Engg
10/1/2020 142
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks

– In general, any multistage network is comprised of a 


• Collection of a × b switch modules and fixed network modules.

• The a × b switch modules are used to provide variable permutation or


other reordering of the inputs, which are then further reordered by the
fixed network modules.

– A generic multistage network consists of a sequence alternating dynamic


switches (with relatively small values for a and b) with static networks (with
larger numbers of inputs and outputs). The static networks are used to
implement[Link]
interstage connections (ISC).
Department of Computer Science and Engg
10/1/2020 143
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks

Single-stage networks:

• Perfect shuffle
operation: cyclic shift
1 place left, eg 101 -->
011

[Link]
Department of Computer Science and Engg
10/1/2020 144
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks

– Single-stage networks:Since a single-stage network was not sufficient to get


the desired output, a multistage is built.

– Omega Network : An example of a multistage network is the omega network.

– The capability of single stage networks are limited 


• But if we cascade enough of them together, they form a completely connected
MIN (Multistage Interconnection Network).

– Switches can perform their own routing or can be controlled by a central


router
[Link]
Department of Computer Science and Engg
10/1/2020 145
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks

[Link]
Department of Computer Science and Engg
10/1/2020 146
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks
This type of networks can be classified into the following four categories:
– Nonblocking : A network is called strictly nonblocking if it can connect any
idle input to any idle output regardless of what other connections are
currently in process

– Rearrangeable nonblocking : In this case a network should be able to


establish all possible connections between inputs and outputs by rearranging
its existing connections.

– Blocking interconnection: A network is said to be blocking if it can perform


many, but not all, possible connections between terminals. Example: the
[Link]
Omega network
Department of Computer Science and Engg
10/1/2020 147
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks
Omega Networks:
A 2 × 2 switch can be configured for Straight-through, Crossover, Upper broadcast
(upper input to both outputs), Lower broadcast (lower input to both outputs)
– With four stages of eight 2 × 2 switches, and a static perfect shuffle for each of
the four ISCsA 16 by 16 Omega network can be constructed (but not all
permutations are possible).

[Link]
Department of Computer Science and Engg
10/1/2020 148
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks
Baseline Networks:
– A baseline network can be shown to be topologically equivalent to other
networks (including Omega), and has a simple recursive generation procedure.
– Stage k (k = 0, 1, …) is an m × m switch block (where m = N / 2k ) composed
entirely of 2 × 2 switch blocks, each having two configurations: straight
through and crossover.
– The network can be generated recursively
– The first stage N × N, the second (N/2) × (N/2).
– Networks are topologically equivalent if one network can be easily reproduced
from the other networks by simply rearranging nodes at each stage

[Link]
Department of Computer Science and Engg
10/1/2020 149
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks
Baseline Networks:

[Link]
Department of Computer Science and Engg
10/1/2020 150
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks
Crossbar Networks:
– A crossbar switch has a number of input and output data pins and a number
of control pins.
– In response to control instructions set to its control input, the crossbar
switch implements a stable connection of a determined input with a
determined output.
– A m × n crossbar network
• With m processors and n memories, one processor may be able to generate
requests for multiple memories in sequence; thus several switches might be set
in the same row.
• For m × m interprocessor communication, each PE is connected to both an input
and an output of the crossbar; only one switch in each row and column can be
turned on simultaneously. Additional control processors are used to manage the
crossbar itself.
[Link]
Department of Computer Science and Engg
10/1/2020 151
System Interconnect Architectures
Network Properties and Routing
Dynamic Connection Networks  Multistage Interconnection Networks
Crossbar Networks:
– Each junction is a switching component – connecting the row to the column.

[Link]
Department of Computer Science and Engg
10/1/2020 152
Principles of Scalable Performance
Chapter-3
This chapter includes:

– Performance measures
– Speedup laws
– Scalability principles
– Scaling up vs. scaling down

[Link]
Department of Computer Science and Engg
10/1/2020 153
Principles of Scalable Performance
Chapter-3
Performance metrics and measures:-

– Parallelism profiles

– Asymptotic speedup factor

– System efficiency, utilization and quality

– Standard performance measures

[Link]
Department of Computer Science and Engg
10/1/2020 154
Principles of Scalable Performance
Parallelism Profile in Programs
Parallelism Profile in Programs:
– Degree of parallelism  Reflects the matching of software and
hardware parallelism

– Under this topic we study:


• Degree of parallelism

• Average parallelism

• Available parallelism

• Asymptotic speedup factor


[Link]
Department of Computer Science and Engg
10/1/2020 155
Principles of Scalable Performance
Parallelism Profile in Programs

Degree of parallelism(DOP):  Reflects the matching of


software and hardware parallelism

– DOP is defined as:For each period, the number of processors


used to execute a program in parallelism profile
• This is a discrete time function only for non-negative integer values

– Parallelism profile of a given program  is a plot of the DOP as


a function of time Ideally have unlimited resources

[Link]
Department of Computer Science and Engg
10/1/2020 156
Principles of Scalable Performance
Parallelism Profile in Programs
Degree of parallelism(DOP): Factors affecting parallelism
profiles:

– Algorithm structure

– Program optimization

– Resource utilization

– Run-time conditions

– Realistically limited by  # of available processors, memory,


and other[Link]
non-processor resources
Department of Computer Science and Engg
10/1/2020 157
Principles of Scalable Performance
Parallelism Profile in Programs
Average parallelism:Its variables on which it depends-
– n  Homogeneous Processors

– m  Maximum Parallelism in a Profile


• In ideal case, n >> m

– ∆  Computing Capacity of a Single Processor


• Depends on  execution rate such as MIPS

• no overhead

– DOP = i  i processors busy during an observation period


[Link]
Department of Computer Science and Engg
10/1/2020 158
Principles of Scalable Performance
Parallelism Profile in Programs
Average parallelism: Total amount of work W (instructions or
computations) performed is proportional to the area under the
profile curve

This integral is often


computed by the following
discrete summation

ti  Is the total amount of time


that DOP=i

[Link]
Department of Computer Science and Engg
10/1/2020 159
Principles of Scalable Performance
Parallelism Profile in Programs
Average parallelism:Is computed by

In discrete form,
we get

[Link]
Department of Computer Science and Engg
10/1/2020 160
Principles of Scalable Performance
Parallelism Profile in Programs
Example: Parallelism profile and average parallelism

[Link]
Department of Computer Science and Engg
10/1/2020 161
Principles of Scalable Performance
Parallelism Profile in Programs
Asymptotic speedup  the amount of work executed
with DOP=i
For i=1(Single processor-sequential)

W.R.T
Response Time

For i=∞(Infinite number of processors)

[Link]
Department of Computer Science and Engg
10/1/2020 162
Principles of Scalable Performance
Parallelism Profile in Programs
Asymptotic speedup  S∞  is defined as the ratio of
T(1) to T(∞)

In ideal case,
n=∞
n >> m

[Link]
Department of Computer Science and Engg
10/1/2020 163
Principles of Scalable Performance
Mean Performance
Performance measures:
– Consider n processors executing m programs in various modes
– Want to define the mean performance of these multimode
computers:
• Arithmetic mean performance
• Harmonic mean performance

[Link]
Department of Computer Science and Engg
10/1/2020 164
Principles of Scalable Performance
Mean Performance
Arithmetic mean performance:
Let {Ri} be the execution rate of programs i=1,2,3,….,m.
Arithmetic mean execution rate is defined as:

Ex1: Average time to complete a program (10 secs, 20 secs, 15 secs)  15


secs •
Ex2:Consider the scores (1, 2, 2, 2, 3, 9). The arithmetic mean is 3.17, but five
out of six scores are below this.
[Link]
Department of Computer Science and Engg
10/1/2020 165
Principles of Scalable Performance
Mean Performance
Arithmetic mean performance:
Let {Ri} be the execution rate of programs i =1,2,3,….,m
Note: If the programs are weighted with a distribution
π={fi | i=1,2,…..m}
Weighted Arithmetic mean execution rate is defined as:
By replacing 1/m with fi

[Link]
Department of Computer Science and Engg
10/1/2020 166
Principles of Scalable Performance
Mean Performance
Harmonic mean performance:

[Link]
Department of Computer Science and Engg
10/1/2020 167
Principles of Scalable Performance
Mean Performance
Harmonic mean performance:

[Link]
Department of Computer Science and Engg
10/1/2020 168
Principles of Scalable Performance
Mean Performance
Harmonic mean Speedup:

[Link]
Department of Computer Science and Engg
10/1/2020 169
Principles of Scalable Performance
Mean Performance-Amdahl’s Law

[Link]
Department of Computer Science and Engg
10/1/2020 170
Principles of Scalable Performance
Mean Performance - Amdahl’s Law

[Link]
Department of Computer Science and Engg
10/1/2020 171
Principles of Scalable Performance
Efficiency, Utilization and Quality

[Link]
Department of Computer Science and Engg
10/1/2020 172
Principles of Scalable Performance
Efficiency, Utilization and Quality

[Link]
Department of Computer Science and Engg
10/1/2020 173
Principles of Scalable Performance
Efficiency, Utilization and Quality

[Link]
Department of Computer Science and Engg
10/1/2020 174
Principles of Scalable Performance
Efficiency, Utilization and Quality

[Link]
Department of Computer Science and Engg
10/1/2020 175
Principles of Scalable Performance
Efficiency, Utilization and Quality

Mflops  mega floating-point operations


per second

KLIPS [Link]
kilo logic inference per second used to indicate the
10/1/2020
reasoning powerDepartment
of an AIof Computer
machine Science and Engg
176
Principles of Scalable Performance
Parallel Processing Applications
Few applications listed:-

– Drug design
– High-speed civil transport
– Ocean modelling
– Ozone depletion research
– Air pollution
– Digital anatomy

[Link]
Department of Computer Science and Engg
10/1/2020 177
Parallel Processing Applications

Massively parallel processing has become one of the challenges in


supercomputer applications.

Massive Parallelism for Grand Challenges:Which varies with time.

– Any machine having hundreds and thousands of processors is a Massively


Parallel Processing (MPP) System.

– Here we will discuss the grand challenges identified in the U.S. High
Performance Computer and Communication (HPCC) program.

[Link]
Department of Computer Science and Engg
10/1/2020 178
Parallel Processing Applications

[Link]
Department of Computer Science and Engg
10/1/2020 179
Parallel Processing Applications

Exploiting Massive Parallelism:

– Instruction level parallelism:


• Very limited

• Very few processors exists which can execute 2 instructions in one cycle.

• Depends upon complier, OS incapability's and program flow built in modern computers.

– Data parallelism
• Higher than Instruction level parallelism.

• Deals with large array/operands

– On Message passing multicomputer  Parallelism is more scalable than


shared multiprocessor
[Link]
Department of Computer Science and Engg
10/1/2020 180
Parallel Processing Applications ( Application Models)
Application Models for Parallel Processing:

– Keeping the workload as constant (α)  Efficiency E decreases rapidly as


Machine Size n increases

– Because: Overhead h increases faster than machine size


[Link]
Department of Computer Science and Engg
10/1/2020 181
Parallel Processing Applications ( Application Models)
Application Models for Parallel Processing:
– To maintain the efficiency at desired level  we has to increase the machine
size and program size proportionally  Know as Scalable computers/Scaled
problems
– Workload with linear function of n (represented as γ)-linear scalability in
problem size.
– If linear not possible we can go to Sublinear (β) by keeping some constant value
– If there is enormous growth in workload (θ)  system is poorly scalable To
keep constant efficiency  Exceeds Memory and I/O limits

[Link]
Department of Computer Science and Engg
10/1/2020 182
Parallel Processing Applications ( Application Models)
Application Models for Parallel Processing: 3 Models  Depends on limited
memory, limited latency tolerance , limited I/O bandwidth
The models are:
1. Fixed load model:
• Corresponds to constant workload (α)

• The use is limited by the communication bound.


2. Fixed time model:
• Demands constant execution time

• Linear workload growth refers to this model (γ)


3. Fixed memory model:
• Is limited by memory bound

• Corresponds to workload curve between γ and θ

[Link]
Department of Computer Science and Engg
10/1/2020 183
Parallel Processing Applications
( Scalability of Parallel Algorithms)
Algorithmic characteristics:-

– Parallel algorithms are specially designed for parallel computers

– Important characteristics of parallel algorithms which are machine


implementable are:
• Deterministic versus Non-Deterministic

• Computational granularity

• Parallelism Profile

• Communicational patterns and synchronization requirements

• Uniformity of the operations

• Memory requirement and data structures.


[Link]
Department of Computer Science and Engg
10/1/2020 184
Parallel Processing Applications
( Scalability of Parallel Algorithms)
Isoefficiency:- It is the concept 

– Relating workload (w = w(s) // s-Problem size) to machine size n  needed


to maintain a fixed efficiency For a parallel algorithm on a parallel
computer

– Let h  Communication overhead of the algorithm implemented

 h=h(s , n)

– Efficiency is given by : -

– With Fixed problem size  Efficiency decreases as n increases

– With Fixed machine size  Efficiency increases as s increases with n fixed


[Link]
Department of Computer Science and Engg
10/1/2020 185
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 186
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 187
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 188
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 189
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 190
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 191
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 192
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 193
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 194
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 195
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 196
Speedup Performance Laws

G(n)Reflects the increase in workload has memory increases n times.

[Link]
Department of Computer Science and Engg
10/1/2020 197
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 198
Speedup Performance Laws

[Link]
Department of Computer Science and Engg
10/1/2020 199
Scalability Analysis and Approaches

[Link]
Department of Computer Science and Engg
10/1/2020 200
Scalability Analysis and Approaches

Scalability Metrics and Goals:

[Link]
Department of Computer Science and Engg
10/1/2020 201
Scalability Analysis and Approaches

Attained by varying
only the number of
processor(n)

[Link]
10/1/2020 Department of Computer Science and Engg
202
Scalability Analysis and Approaches

[Link]
10/1/2020 Department of Computer Science and Engg
203
Scalability Analysis and Approaches
Scalability Issues and Possible solutions:

[Link]
Department of Computer Science and Engg
10/1/2020 204
[Link]
10/1/2020 Department of Computer Science and Engg
205

Common questions

Powered by AI

Major advancements beyond the von Neumann architecture include the integration of look-ahead, parallel processing, pipelining, Flynn’s classification of parallel computers, and the development of parallel/vector computers. These developments have allowed for the handling of complex and large-scale computations more efficiently, addressing limitations of the traditional sequential computing approach .

Job or program-level parallelism is feasible for systems with a small number of powerful processors, facilitating the execution of independent program streams efficiently. However, in systems with many simple processors, it is impractical due to the extended processing time required per job, highlighting a mismatch in resources and task allocation capabilities .

Loop-level parallelism offers the advantage of executing independent iterations simultaneously, which is effective in pipelined or SIMD environments. However, recursive loops pose complexities due to interdependencies that hinder parallel execution. Proper scheduling and synchronization are critical to harnessing its benefits while overcoming challenges .

The design of hardware and compilers needs to be synchronized to exploit parallelism effectively in conventional scalar processors. This synchronization allows the identification and implementation of parallel constructs, enhancing program execution efficiency .

In multiprocessor systems, achieving a low-latency network design with high data transfer rates is essential for efficient operations. Low latency reduces the time required for data transmission between nodes, while high data transfer rates ensure that large volumes of data can be moved swiftly, enhancing the overall system throughput and performance .

Multistage interconnection networks use a combination of switch modules arranged in stages to provide dynamic routing and reordering of inputs for flexible communication. Their advantage lies in their ability to support various communication patterns and high connectivity, enhancing system flexibility and performance in parallel computing environments .

Inter-process communication latency is crucial to system scalability as it limits the effective size of parallel architectures. High latency leads to longer signal delays and inefficient communication, which become bottlenecks as the system grows. Minimizing latency is essential to maintaining performance and enabling scalable system expansion .

Grain packing involves determining the optimal size and number of parallel grains in a program to minimize execution time. The challenge lies in balancing the overhead of synchronizing and scheduling these grains against the performance gains from parallel execution. It requires careful consideration of both machine and problem-specific characteristics to achieve effective scheduling .

Grain sizes, or granularity, greatly influence the parallel execution of a program by determining the number of instructions per program part. They are classified as fine, medium, or coarse based on parallelism levels, impacting scheduling and optimization strategies. Fine-grain parallelism is more common at the instruction level, while coarse-grain parallelism suits program-level execution .

Static networks use fixed, point-to-point connections that do not change during program execution, providing consistent communication paths. In contrast, dynamic networks utilize switched channels that can be reconfigured based on user program demands, offering flexibility and potentially optimizing communication traffic according to current requirements .

You might also like