Chapter 8
Multivector and SIMD computers
-by
Prajwala T R
Dept of CSE
PESIT
Vector processing principles
Vector instruction types
• Vector-ordered collection of scalar items of
same type.
• Uses fixed addressing increment-stride
• Vector processor is ensemble of vector
registers,functional pipelines,regs,vectorizer.
• Vector processing –arithmetic and logical
operators are applied to vectors
• Vectorization
• Vector processors are faster ,efficient
• Reduces software overhead.
Vector instruction types
• Vector vector instruction
• Vector scalar instructions
• Vector memory instructions
• Gather and scatter instructions
– M->v1 X v0
– V1 X v0->M
• Masking instructions-compress or expand the
vector
Vector instructions in cray like
computers
Vector access memory schemes
• Vector operand specification
– Base address
– Stride
– Length
– Access rate should match pipeline rate
C(concurrent)-Access memory
organization
• M-way lower order interleaved memory
structure
• If stride is one successive address are accessed
with one cycle delay
• If stride 2 then access are separated by 2
minor cycle.
• Maximum throughput of m words per cycle.
Low order interleaving
S(simultaneous)-Access memory organization
C/S access memory organization
• N buses and m memory modules
• N buses operate in parallel(c-access)
• M modules are interleaved to allow c access.
• Most popular memory access module in
vector computers
NEC SX vector super computer
Relative vector/scalar performance
• Amdhals law redefined
• P=1/(1-f)+f/r
• Indicates speedup of vector to scalar
processing.
• The hardware speed ratio r is designer’s
choice.
Performance directed design goals
• Architectural design goals
– Maintaining good vector to scalar performance
balance
– Supporting scalability
– Increasing memory system capacity and
performance
– Providing high performance i/o and easy access to
network
Balances vector scalar ratio
• Scalar processing is indispensible part of
general purpose architecture
• Vector balance point
• Vector performance
– 9 MFLOPS-vector
– 1 MFLOPS -scalar
• I/O and networking performance
– With speed of supercomputers increasing
problem size increases and I/O bandwidth
requirement as well
– I/O rate
– Cray systems
– 100GBPS transfer rate
• Memory demand
– Latency and bandwidth
– Effective memory hierarchy
– Memory sizes available on chip is rapidly
increasing.
– Relative speed mismatch
• Scalability
– Support of shared memory with increasing
number of processors and memory port.
– Constraints
• Latency
• Communication overhead
Table of comparison
Cray Y MP 816 system organization
C-90 and clusters
Cray MPP systems
• Off the shell components are not suitable.
• Balance of speed between processor memory
and I/O required.
• Lack of efficient memory operation like
synchronization and communication in RISC
• All the lead to introduction of MPP
• T3D
– 150 MHz clock, partition to emulate as SIMD or
MIMD dynamically
– Distributed memory.
– Mach based microkernel operating systems
– Program debugging and performance tools
Development phases
Fujitsu VP2000
Fujitsu 5000
Mainframe computers
LINPACK results
Compound vector processing
• CVF-compound vector function- composite function of
vector operations are converted from looping structure
of linked scalar operation
• Ex:
Do I=1,N
Load r1,x(I)
Load r2,y(I)
Mul r1,s
Add r2,r1
Store y(I),r2
continue
After vectorization
M(x:x+N-1)->v1
N(y:y+N-1)->v2
S X v1->v1
V2+v1->v2
V2->M(y:y+N-1)
• CVF
Y(I)=S X X(I)+Y(I)
Compound vector functions
• Vector loops and chains
– The loop count is determined at compile time or
run time
• Strip mining-when vector has length greater
than vector register
– Vector registers are not allocated to any other
operation until all segments of current vector are
handled.
• Functional unit independence
– Vector registers act as interface between pipeline
stages
– Vector registers and functional units must be
reserved before a vector chain is established
example
Timing diagram
Chaining limitations
• Number of vector operations
• Number of functional pipeline units
• Number of interfaces for adjacent pipelining
stages
• Degree of chaining depends on how many
unary and binary operators.
• How many scalar operations and vector
operations
• Vector recurrence-
– Functional pipeline feed back input to its own
source registers
– Ex-component counter
What is Systolic Computing?
A set of simple processing elements with local connections
which takes external inputs and processes them in a
predetermined manner in a pipelined fashion
Host Station in Systolic Architecture
• As a result of the local-communication scheme, a systolic network is easily
extended without adding any burden to the I/O.
• Systolic Array.
Control Control Control
Unit Unit Unit
……..
Processing Processing Processing
Units Units Units
Interconnection Network(Local)
• Systolic arrays usually pipe data from an outside host and also pipe the
results back to the host.
Multipipeline networking
• Pipeline net-constructed by interconnecting
multiple functional pipelines using BCN
• 2 level architecture of pipelining
Program graph transformation
• rule 1:Adding k delays to any node in systolic
graph and then subtracting k delay from all
incoming edges
• Rule 2:multiply all edges with scaling constant
• 0-graph is called systolic program graph
SIMD computer organization
• Distributed memory model
– Local memory
– scalar and vector control unit
– All processing elements are interconnected by
routing network
– Masking logic
Shared memory model
• Alignment network
• Alignment network
• must be properly set to avoid conflicts
• SIMD instructions
– All instructions must use vector operands of equal
length n
– Data routing functions
• Host and I/O
– Control memory
– Mass storage and graphics display results
CM-2 architecture
• Front end
• Sequencer
• Modes of communication
– Broadcasting
– Global combining
– Scalar memory bus
• Processing nodes
– 32 bit slice processor
– Floating point accelerator
– Bit slice ALU
• Hypercube routers
• applications
MasPar MP architecture
MasPar MP architecture
• Array control unit
• Scalar RISC processor
• Uses demand paging
• Fetches and decodes the instructions
• PE array
• 1024 PE
• 64 PE clusters-16 clusters per PE
• multistage cross bar interconnection netwrok
• Parallel disk arrays