0% found this document useful (0 votes)

7 views11 pages

8 TH

This document discusses the architectures of multivector supercomputers and SIMD array processors, highlighting their capabilities in vector processing over large data volumes. It covers vector processing principles, including instruction types, memory-access schemes, and the evolution of SIMD computers to hybrid systems. The document emphasizes the performance benefits of vector processing compared to scalar processing, as well as the necessary advancements in hardware and compilers for effective vectorization.

Uploaded by

tarunj4632

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views11 pages

8 TH

Uploaded by

tarunj4632

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PM 1'l|¢G-NH-‘ Hlllﬁivoponm

— —

Multivector and SIMD

Computers
By definition. supercomputers ane the fastest computers available at any specific time. 'l'he value
of superoomputing was originally identified by Buzbee [1983] in three areas: knowledge acquisition,
computational n-actnbiity. and promotion of pmductivity. Computing demand. however. is always ahead of
computer [Link]‘s supercomputers are still one generation behind the computing requirements
in most application areas. which have expanded enormously over the last two decades.
In this chapter. we study the architectures of pipelined multivector supercomputers and of SIMD
array processors. Both types of machines perform vector processing over large volumes of data. Besides
discussing basic vector processors. we describe compound vector functions and multipipeline chaining
and networking techniques for developing higher~perlormance vector multiprocessors.
The evolution from SIMD and MIND computers to hybrid SlMDil"llMD computer systems is also
considered. 'l'l1e Connection Machine CH-5 reflected this architectural trend. This hybrid approach to
designing reconfigurable computers opened up new opportunities for exploiting coordinated parallelism
in complex application problems. Recent trends in this direction will be discussed in Chapter 13.

VECTOR PROCESSING PRINCIPLES

1 ‘v'cctor instruction types, memory-access schemes For vector operands, and an overview of
supercomputer families are given in this section.

8.1.1 Vector Instruction Types

Basie concepts behind vector processing are defined below. Then we discuss major types of vector
instnictions encountered in a typical vector processor. The intent is to acquaint the reader with the instruction-
set architectures oftyp ical vector processors.
Vector Processing Definitions A vet-tor is an ordered set ofscalar data items, all ofthe same type, stored
in memory. Usually, the vector elements are ordered to have a fixed addressing increment between successive
elements, called thc .s'n'iri'c.
A vocror processor is an cnscmhlc ofhardware resources, including vector registers, functional pipelines,
proccssing elements, and rcgistcr counters, lhrpcrforming vector operations. l-"Error proc-essirig occurs when
arithmetic or logical opcrationsan: applied to vectors. It isdist inguishcd from scalar processing which operates
on one datum or one pair oi" data. The conversion from scalar oode to vector code is called 1-'ccrori:.nrion_
Thu‘ Ml.'I;Ifllb' H“ l'n¢r.q|r_.u|»r\
34! i Advanced Cornptmerfircbitecture

ln general, vector processing is faster and more efficient than scalar processing. Both pipelined processors
and SIMD computers can perform vector operations. Vector processing reduces software overhead incurred
in the maintenance of looping control, reduces memory-access conﬂicts, and above all matches nicely with
the pipclining and segmentation concepts to generate one rcsult per clock cycle continuously.
Depending on the speed ratio between vector and scalar operations {including startup delays and other
overheads) and on the vcemrimricn ratio in user programs, a vector processor executing a well-vectorized
code can easily achieve a speed|.|p of IO to IO times, as compared with scalar processing on conventional
machines.
Oi‘ course, the enlnmced performance comes with increased hardware and compiler costs, as expected.
A compiler capable of vectorization is ealled a terrorizing eontpiler or simply a wcrorizcr: For successful
vector processing, one 11Beds to make improvements in vector hardware, vectorizing compilers, and
programming skills specially targeted at vector machines.

Vector Instruction Type: We brieﬂy introduced basic vector instructions in Chapter 4. What are
characterized below are vector instructions for register-based, pipelined vector machines. Six types ofvector
instructions are illustrated in Figs. 8.1 and 8.2. We deﬁne these vector instruction types by rnathernatical
mappings between their working registers or memory when: vector operands are stored.
VJ; Ragiier V3 Register V; Ragista Vk Register '|.r‘,- Flogislsel

|s
-.!-'1
E

Fmtctiond unit Fundicnd unit

{a) Vacbr-vector instmction {b)'v'ectcr-smla inshudion

{vactcl Load)
Memory path Vi Regista

till] _

He Mr

till]
Hermcry path
{Vecbr Store)
{ct “ach:-mastery insiuetions

Fig. B-1 Veettor instruction types in Cray-like computers

{ll lirmr-t-'er.'ror rhstrrretrions As shown in Fig. 8.1a, one or two vector operands are fetched ﬁ'om the
respccl:ivc vector registers, enter through a functional pipeline unit, and produce results in another
vector register. These instructions are deﬁned by the following two mappings:
f| :l_’,-—> V, (8.11
jg : ii,-X I-1 —> P] (8.21
,,,,,,,,,,,,,,,,,,,.,,,,,,,,,,,,, _ H,
Examples are V, = sin{ I/E) and V3 = V, + Ir’; for the mappings_f'| and f3, respectively, where F} for
i=1, 2, and 3 are vector registers.
{,3 l li'ctor-Molar r'nsrrur'rr'ons Figure B. lb shows a vector-scalar instruction corresponding to thc
following mapping:
jg :s>< Pk —> I’, (8.31

An example is a scalar product s >1: l"| = F3, in which the elements of Vl arc each multiplied by a
scalar s to produce vector V3 ofcqual length.
('3 l Vecsor-nremo:'_v insrrrrrrfons This corresponds to vector load or vector store (Fig. B. l cl, element by
element, between the vector register ( i-" it and the memory (Ml as deﬁned below:
f4 : M —t- V lireror load‘ (5-4!
f5 : l-’ —-3» M lti.-‘tutor store (B.5j

{.41 lincmr redrrcrion innrrrerions These correspond to the follotwi ng mappings:

,5, : l-']- —> s (8.61

ft : V, >< F, —'> s {B-71

Examples of_f5 include finding the moximrrm, mr'm'mrrm, sum, and mean voiue of all elements in a
vector. A good example off-_. is the dot product, which performs s = 2" tr, >< by from two vectors
_ |
.-r = -|_'_o',-j and s = rs,-1 ’
{'51 Gather and st-otter r'[Link]:rr'ons These instructions use two vector registers to gather or to scatter
vector elements randomly throughout the memory, corresponding to the following mappings:
jg, : A-I —> V, >< I-Q, Gather (3.8;
f,,: P’, >< I-’,,—> M Scatter (8.91
Gather is an operation that fetches fi'om memory the nonzero elements of a sparse vector using
indices that themselves are indexed. Scatter docs [Link] opposite, storing imo memory a vector in a
sparse vector whose nonzero entries are indexed. The vector register I-"| contains thc data, and the
vector register I-Q, is used as an index to gather or scatter data from or to random memory locations as
illustrated in Figs. 8.2a and E.2b, respectively.
('6 l Mrlsfring [Link]'ur'rr'ons This type of instruction uses a rrtaslr vector to compress or to expand a vector
to a shorter or longer index vector, respectively, corresponding to the following mappings:
_,fi,-,: lQ,>< l"m -3 V, (B. l'I]j
Tbc following example will clarify the meaning of gather, scrrrter, and moshing instructions.

.9?) Example 8.1 Gather, scatter, and masking tnstructtons tn

the CrayY-MP (Cray Researeh,1990)
The gather instruction (Fig. 8.23.) [Link] the contents (600, 400, 250, 200] of nonsequential memory
locations (104, 102., I07, [DO] to fruu elements ofa vector register l"1. The base address (100) of the memory
is indicated by an address register Aft. The number of elements being transferred is indicated by the contents
(4) of a vector iengdt register FL.
rr M G um - '
344"i- in I “M t Admn-cad Cumplmerhxdritedlnm

The offscls {h1dic4:s) from the has: address are rctricvcd fi'om the vector register VG. Thc cffccliv: mcrnnry
addwzaws are nbtained by adding the base address in the indices.
Manny
CnrrbemtsJ'
v1. Regina VD Reghta V1 Ragisiar -P-diiflfifi
|* #1 N:-tn} GIG‘ $8 GD
A0
Q38: fiéé
| 1un|

DOGDUO 4.-1.4-|. [Link]

M‘--aw
IQ

|[a}| r.-".-mm mmmm

Hem-my
Contents!
VL Fhgastel vn Regista v1 Ragishr P-ddﬂm
I

I
A0
4|

wul
1*,
2
QI1I!l
, $9 ‘
we
mu “>2
O .4O -F

Q -AD ‘Iul

{hp Scaiter inshudion

vo Register vs Ragisiar
(T851191!) {RBBHW
01
- cc:
v4. Register M

-15
“T10
a 11
0 13
24
-r
u1011m111n1.. . 13
maagaaa 9
-1? -
{oi Masiurig imiucim

Fig. 8.2 Gatheawzauer and masking operations on due Cray'Y-HF [Courmesy ufﬁray Research 1990]
,,,,,,,,,,,,,,,,,._.,,,,,,,,,,,, _ H,
The scatter instruction reverses the mapping operations, as illustrated in Fig. 8.2h. Both the I/L and A0
registers an: embedded in the instruction.
The masking instruction is shown in Fig. 8.2c for compressing a long vector into a short index vector. The
contents of vector register Vt} are tested for Zero or nonzero elements. A [Link]'ng [Link] { F.-‘vii is used to
store the test results. .-lifter testing and forrrring the nrtrsiring veemr in V M, the corresponding nonzero indices
are stored in the I/l register, The i/"L register indicates the length of the vector being tested.

The grrnlrer, setrrrer, and nrtrsiring instructions are very useful in handling sparse vectors or sparse matrices
often encountered in practical vectorprocessing applications. Sparse matrices are those in which most ofthe
entries are zeros. Advanced vector processors implement these instructiorrs directly in hardware.
The above instruction types cover the most. important ones. A given speciﬁc vector processor may
implement an instruction set containing only a subset or even a superset ofthe above instructions.

8.1.1 Veet:or\-Access Memory Schemes

The ﬂow of vector operands between the main memory and vector registers is usually pipelined with multiple
access paths. In this section, we specify vector operands and describe three vector-access schemes from
interleaved memory modules allowing overlapped memory accesses.

Vector Oponnnd Spaeiflecltiom Vector operands may have arbitrary length. Vcc tor elements are not
necessarily stored in contiguous memory locations. For example, the entries in a matrix may be stored in row
major or in column major order. Each row. column, or diagonal of the matrbc can be used as a vector.
When row elements are stored in contiguous locations with a unit stride, the column elements are stored
with a stride ofn, where n is the matrix order. Similarly, the diagonal elements are also separated by a stride
ofn + l.
To access a vector in memory, one must specify its hose trddress, srriob, and length. Since each vector
register has a fixed number of component registers, only a segment of the vector can be loaded into the vector
register in a fixed number of cycles. Long vectors must be segmented and processed one segment at a time.
Vector operands should be stored in memory to allow pipelined or parallel access. The memory system for
a vector processor must he specifically designed to enable fast vector access. The access rate should match
the pipeline rate. In fact, the access path is often itselfpipclined and is called an fl('Ct’.iS]Ji]J£’. These vector-
aecess memory organizations are desc-ribed below.
C-rflcces: Memory Drghnimtion The in-way low-order interleaved memory structure shown in
Figs. 5.15:: and 5.16 allows m memory words to be accessed concurrently in an overlapped manner. This
eoneurrtrm‘ access has been called C.‘-net-ass as illuslntted in Fig. 5.lt5b.
The access cycles in different memory modules are staggered. The low-order tr bits select the modules,
and thehigh-ordcrb hits select the word within each module, where m = L" and rr+ b = n is the address length.
To access a vector with a stride of I, successive addresses are latched in the address buffer at the rate of
one per cycle. Effectively it takes m minor cycles to fetch m words, which equals one (major) memory cycle
as stated in Eq. 5.4 and Fig. 5.l6b.
If the stride is 1, the successive accesses must be separated by two minor cycles in order to avoid access
conflicts. This reduces the memory throughput by one-half. if the stride is 3, there is no module conflict and
the maximtun throughput (m words) results. In general, C-access will yield the maximum throughput of m
words per memory cycle ifthe stride is relatively prime to m, the number ofinterleaved memory modules.
_ H‘-r Mclinrw Hm I-|Il‘l‘.l]|[Link]|f\
340 1- I Admrl-cad campuurnmmwam

S-Access Nlamory Orgflnization The low-order interleaved memory can be rearranged to allow
sinlrllrflmnris access, or S-rimrss, as illustrated in Fig. 8.3a. In this case, all memory modules are accessed
simultaneously in a sgmchmnimecl manner. Again the high-order (n — a) hits scloct thc same nfikct wnrcl from
each mnclu lc.

Pi Fatd-icycla-i-u-I-i Accasscycla Z-l

HII
Data Latch
Single vmrd

{Hi rj0%
Mdtﬂzlexel W"
high-Order
addess ms

I mnuuia
I
Raadhvlita
a Low-order
adckasa bite:
-[al S-aoc-ass crganiiun for an m-way inleﬂaavad ma may

1. Memory Modliaa

FEE-I1 'l FB‘lI.i'l 2 FEE-11 3 ...

M.-in
I A:;oess1 I Acce552 I Acc|e$3 I

Fai:.h'1 Febc.h2 Fat-h3 ...

H1
I Access‘! I h|:=|:.esa2 I Ac,c|ass3 I

Ffibh ‘l Ffllflh 2 Fflth 3 .. . .

Mu
I Aooa5s'l I 15059552 I Acc|esa3 I

m wnrda m words mwcrds

A A A

Cycla1 l C5rda2 l Dg,rda3 T cyan l11;e

{bl Snacmmhre vecto‘ acicassasuaing mrcliappad iabh and anc-asscydas

Fig. 8.3 The S-access inmrlcavcd memory for vocmr cplnncls acons

At the and of each rnumury cyclc [Fig. 8.31:}, Hi = 2“ ccmsccutivc words an: latched in thc data buﬁcn;
simultaneously. The low-order ri bits are than used to multiplex the m wnrds nut, mic pcr cach minor cycle.
,,.,,,.,.,,,,_,,,,,.5.,,D..,,,,,,,, _ H,
lfthc minorcyclc is chosen to be lfrrr ofthe major memory cycle {Eq. 5.4}, then it takes two memory cycles
to access m consecutive words.
However, if thc access phase of the last access is overlapped with the fetch phase of the eurrem access
(Fig. 8,:-lb), effectively m words take only one memory cycle to access. If the stride is greater than 1 , then the
throughput decreases, roughly proportionally to the stride.

CIS-Access Memory Organization A memory organization in which the C-access and S-access are
combined is called C.r‘lS'-or'ees.s. This scheme is shown in Fig. 8.4, where n access buses are used with m
interleaved memory modules attached to each bus. The m modules on each bus are m-way interleaved to
allow C-access. The rt buses operate in parallel to allow S-access. In each memory cycle, at most m - rt words
are fetched if the n buses are fully used with pipelined memory accesses.

P'°'°9sS°'5 i Memories
@ | Bo
PO M60 .. .
System Moo Mo; Mom
@ I |nter- Bo
1 MC‘ OCll'll'lBC'l

I . l'~"|1,o l~"|1,1 Mir-*1

' {Crossbar J : 1 1
® ' an-1
PM MC 6 6 ..
Mi Mm ,o Mn-,1 Mm;-.-1
Fig. B-4 The US rnerno-ry orpniaation with irr = it {Courtesy of D.K. Panda. 1990]

The CIS-access memory is suitable for use in vector multiprocessor conﬁgurations. it provides parallel
pipelinod access ofa vector data set with high bandwidth. .-‘Especial vector eoeire design is needed within each
processor in order to guarantee smooth data movement between the memory and multiple vector processors.

8.1.3 Early Supertump uters

This section introduces ﬁve early supercomputer families, including the Cray Research series, the CDC!
ETA series. ﬁre Fujitsu VP series, the NBC SX series. and die Hitachi B20 series (Table 3.1). The relative
perfo rmance ofthesc machines for vector processing are compared with scalar processing at the end.

The Cray Research Series Seymour Cray founded Cray Research. Inc. in l9'i'2. Since then, hundreds
units of Cray supercomputers have br:r::r1 produced and installed worldwide. As we shall sec in Chapter 13,
the company has gone through a change ofname and evolution ofpmduct line.
The Cray 1 was introduced in i9'i'5. An enhanced version. the Cray IS, was produced in 19T9. it was the
ﬁrst ECL-based strpercomputcr with a 12.5-as clock cycle. High degrees ofpipelining and vector processing
were the major features of these machines.
rt» Mecmw iirttt-...s-,..i.t.¢. '
Ms — _
Adrovrced Computerhrchitecture

Table 8.1 Summary [Link] Sttpcrcotrlputcrs

511.:tern Maximum mnﬁgaration. Uniquefeamres

model‘ clock rote, GS/Compiler and remarks
Cray 1S Uniproccssor with 10 pipelines. 12.5 First ECL-based super. introduced in
ns. COSIC-F'l"l' 2.1. l9T6.
Cray 25 4 processors with 256M-word memory, l6l{-word local memory, ported
."-l-256 4.1 ns. COS or UNIX ICFT7 3.0. UNIX V introduced in I985.
Cray X- MP 4 processors with 16M-word memory, Using shared register clusters for
416 and l2BM-word BSD, 8.5 ns, COS IPC, itflroduced in 1933.
CFTT 5.0.
Cray Y-MP 8 processors with lllllvl-word Enhanced fi'om X-ME introduced in
832 memory. 6 as, EFT! 5.0. 1951‘-8.
Cray Y-MP I6 processors with 2 vector pip-espcr Thc Largest Cray machine, introduced
C-9!] processor, 4.2 us, UNIOOSICF Tl’ 5.0. in 15191.
CDC Cyber Uniproeessor with 4 pipelines, 20 ns, Met11ory-to-memory architecture.
205 virtual OSIFTN 200. introduced in I932.
ETA lll E. Uniprooessor tvith 10.5 ns, Successor to Cyber 205, introduced
ETAVIFTN 200 in i985.
NEC 4 processors with 4 sets of pipelines Succeeded by SX-X Series,
SX-X1‘44 per processor, 2.9 ns, I-i'T?SX. introduced in l9'9l.
Fujitsu Unipmeessor with S vector pipes and Used reconfigurable vector registers
Vl"lfi-CH] ilt) dual scalar processors, 3.2 res, and masking, introduced in l99l.
MS?-EX FFT? EX -‘VP.
Hitachi l3 fitltctionnl pipelines in a Introduced in 193? with 64 I.-D
S202’ RD uniprocessor with 512 Mbytes channels providing a maximum
memory 4 ns, FORT 1?!HAP of 288 It-'lhytes."s transfer.
V23-~OC.

Ten functional pipelines could run simultaneously in the Cray IS to achieve a computing power equivalent
to that of ID IBM 3033's or CDC Cyb-er 1600's. Only batch processing with a single user was allowed when
the Cray I was initially introduced using the Cray Operating System [COS] with a Fortran T7 compiler (CF
T? Version 2.1).
The Cray X-MP Series introduced multiprocessor configurations in I 983. Steve Chen led the effort at Cray
Research in developing this series using one to four Cray I-equivalent CPLls with shared meme-1'y.A unique
feature introduced with the X~MP models was shared register clusters for fast interprocessor eommtmications
without going through the shared memory.
Besides 123 Mbytes of shared memory, the X-MP system had 1 Gbyte [Link]'id-stttre sr-sreg’ {SSD) as
extended shared memory. The clock rate was also reduced to 8.5 ns. The peak performance of the X-MP-
4lti was 840 Mflops when eight vector pipelines for add and multiply were used simultaneously across four
PFDEC-SSCIFS.
,,,,,,,,,,,,,,,._,,,,,,,,,,,,, _ ,,,
The successor to the Cray X-MP was thc Cray Y-MP introduced in 1988 with up to eight processors in a
single system using rt 6-us clock rate and 256 Mbytes of shared memory.
The Cray Y-MP C—9(l was introduced in 199!) to ofi'er an integrated system with 16 processors using a
4.2-ns clock. We will stttdy models Y-MP B16 and C-90 in detail in the next section.
Another product line was the Cray 2S introduced in I985. The system allowed up to four processors with
2 Gbytes of shared memory and a 4.1-ns clock. A major contribution of the Cray 2 was to switch from the
hatch processing, COS to multiuser UNIX System V on a supercomputer. This led to the UNICOS operating
system, derived from the UNIXIV and Berkeley 4.3 BSD, variants of which are currently in use in some Cray
DCll"l'l]IlLIlI1'_T S}'SlI'l'_‘l'!TS.

The CyberiETA Series Control Data Corporation (CDC) introduced its ﬁrst supercomputer, the STA R-I00,
in 1973. Cyber 205 was the successor produced in I932. The Cyber 205 ran at a 2D—ns clock rate, using up to
four vector pipelines in a uniprocessor conﬁguration.
Different from thc register-to-register architecture used in Cray and other supercomputers, the Cyber
205 and its successor, the ETA It), had memory-to-memory architecture with longer vector instructions
containing mcmory addrcsscs.
The largest ETA It) consisted of B CPUs sharing memory and 18 HO processors. The peak performance
of thc ETA lD was targeted for IO Gflops. Both tl'|e Cyber and the ETA Series are no longer in production but
wcrt: in use ibr many years at scvcral supcrcomputcrccntcrs.

Japanese Supercomputer: NEC produced the SX-X Series with a claimed peak performance of22 Gflops
in 1991. Fujitsu produced the VP-2000 Series with a 5-Gtlops peak performance at the same time. These two
machines used 2.9- and 3-.2-ns clocks, respectively.
Shared communication registers and reconfigurable vector registers were special features in these
machines. Hitachi offered the 820 Series providing a 3-Gllops peak performance. Japanese supercomputers
were at one time strong in high—specd hardware and interactive vectorizing compilers.
The NEE SK-X 44 NEC claimed that this machine was the fastest vector supercomputer £22 Gflops peak]
ever huiltup to 1992. The architecture is shown in Fig. 8.5. One ofthe major contributions to this performance
was the use of a 2.9-ns clock cycle based on VLSI and high-density packaging.
There were four arithmetic processors commtmicating through either the shared registers or via the shared
memory of 2 Gbytes. There were four sets of vector pipelines per processor, each set consisting of two add!
shift and two mulfiplyflogical pipelines. Therefore, 6-4-way parallelism was obtained wifit four processors,
similar to that in thc C-9t]-.
Besides the vector unit, a high-speed scalar unit employed RISC architecture with 123 scalar registers.
Instruction reordering was supported to exploit higher parallelism. The main memory was l024~way
interleaved. The extended memory of up to I6 Ghytes provided 21 maximum transfer rate. of 2.75 Ghytes.-’s.
Amaatimtun of four l IO processors could be configured to accommodate a l-Gbytefs data transfer rate per
l/U processor. The system could provide a maztimutn of 256 channels for high-speed network, graphics, and
peripheral operations. The support included l00—l'vll:ytests channels.
3511 i
.
Advorrced Computerhrchitecture

law“ o "“°‘°’
Mair i
Mask

—i
TI “’==
|QP - _- Z MMU i Vector ii
“’ 1 Wis "°“‘" —i
Dcp 2* CPM

Mbytes - Y1
Ii. Cache
-‘i Scalar
Hegs_
.
Scalar Prpo

Scalar unit
Captions:
XMU: Extended memory unit
IOP: |.I'O processors [4]
DCP: Data central processors [2]
AP: Arithmetic processors {4}
MMU: Main memory unit
GPM: Data oentrsl processor memory
Each set consists of rt pipeiln-es for adclfshlft
and multlplyfio-gical vector operations

Fig.8.! The NEE. S24-X 44 vector supummnpuuer archirectuns (Cournasy oi NEE, 1991}

Relative 'ﬂ:[Link] Performance Let r hc the voctorriscalar speed ratio an-t1ftl1e vcctorization ratio.
By Amdal'1l’s law in Section 3.3.], the following reloriveperforrrmnce can be deﬁned:

P= = -%- (3.111
{1—fl+f»'r (1—_f.ir+_f
This relative performance indicates thc speedup performance of vcctnr processing ovcr scalar processing.
The hardware speed ratio r is the designer’s choice. The vectorization ratio f reﬂects the percentage of code
in a user program which is vectcriwed
The relative performance is rather sensitive to the value off This value can be increased by using a
better vectorizzing compiler or through user program transformations. The following ¢}LBIﬂ]J|l: shows the IBM
experience in vector processing with the 3090?»/F computer system.

I»)
lg Example 8.2 The vectorfscalar relative performance of
the IBM 3090!VF
Figure 8.6 plots the relative performance P as a ﬁlncticn ofr with fas a t‘[Link] parameter. The highcr thc
,,.,,,.,,,,,,,,,,.,5,.,,M,,,,,,,,, _ _ an
value off; the highcr the relative speedup. The IBM 3094] with vector facility (VF) was a high-end mainframe
with add-on voctorhardwarc.

{Pl
B .._._

Veciorlzation Ratio [f]

34190 VF Igileslgn Point 90%
5-— —.i
t

&—

8-0%

i"Cl%

2__

r _ sos
_____ 30%

1 I I I I I I I I I lrl
1 2 3 4 5 6 1' 8- 9 10

Fl} I-G Speedup performance -of vector processing over scalar processing in the IBM JDBDNF (Courtesy
of lBl"'l Corporation, W35}

The designers of the 309tL'VF chose a speed ratio in the range 3 £ r 2 5 because IBM wanoed a balance
between business and scientiﬁc applications. when the program is 70% vcctorizcd, one expects a maxirnurn
speedup cfll However, forfﬁ 311%, the speedup is reduced to less than 1.3.

The IBM designers did not ehoose a high speed ratio because they did not expect user programs to be
highly vcctorimlllle. When fis low, the speedup cannot bc high, even with a very high r. In fact, the limiting
case is-P—> l iff —>t"J.
On thc other hand, P —> r when f -1 I. Scientiﬁc supercomputer designers like Cray and Japanese
manufacrtlnrrs often chose a much highcr speed ratio, say, I'll S r S 25, because they expected a, higher
vectorizarion ratioIin user programs, or they used better vectnrizers to increase the ratio to a desired level.
Huge advances have taken place in the underlying technologies and especially in VLSI technology
over the last two decades. We shall see that these advances, summarized in brief in Chapter 13, have dcﬁncd
the direction of advances in computer architecture over this period. Powerful single-chip processors—as
also multi-core s_t=srerns-on—n-c'!rip—prrwidc High Peaffmnarrce Corrrputing [HPC] today. Such I-TPC systems
typically make use of MIMI) auditor SPMD coniigurations with a large number of processors.
Advent of superscalar processors has resulted in vector processing instructiorts being built into powerful
processors, rather than as specialized processors. Thus the ideas we have studied in this section have made

Understanding Vector Processor Architecture
No ratings yet
Understanding Vector Processor Architecture
5 pages
Overview of Vector Processor Architecture
No ratings yet
Overview of Vector Processor Architecture
20 pages
Vector Processing in Computer Architecture
No ratings yet
Vector Processing in Computer Architecture
31 pages
Vector Processing in Computer Architecture
No ratings yet
Vector Processing in Computer Architecture
42 pages
Understanding Vector Processing Techniques
No ratings yet
Understanding Vector Processing Techniques
29 pages
Understanding Vector Processing
No ratings yet
Understanding Vector Processing
20 pages
What Is Vector Processing in Computer Architecture
No ratings yet
What Is Vector Processing in Computer Architecture
12 pages
Vector vs. Array Processors Overview
No ratings yet
Vector vs. Array Processors Overview
40 pages
Vector Supercomputers Overview
No ratings yet
Vector Supercomputers Overview
40 pages
Understanding Data Hazards and Vector Processors
100% (1)
Understanding Data Hazards and Vector Processors
5 pages
Branch Instruction Handling in Pipelines
No ratings yet
Branch Instruction Handling in Pipelines
11 pages
Vector and SIMD Processing Overview
No ratings yet
Vector and SIMD Processing Overview
59 pages
VHDL-Based Vector Processor Design
No ratings yet
VHDL-Based Vector Processor Design
6 pages
SIMD Architecture Overview and Applications
No ratings yet
SIMD Architecture Overview and Applications
33 pages
Types of Array Processors Explained
No ratings yet
Types of Array Processors Explained
7 pages
Understanding Vector Processors in SIMD
No ratings yet
Understanding Vector Processors in SIMD
83 pages
Vector Processing in Supercomputers
No ratings yet
Vector Processing in Supercomputers
7 pages
Matrix and Vector Processor Overview
No ratings yet
Matrix and Vector Processor Overview
12 pages
Compound Vector Processing Techniques
100% (1)
Compound Vector Processing Techniques
11 pages
Vector Register Architecture Overview
No ratings yet
Vector Register Architecture Overview
5 pages
Multivector and SIMD Computing Overview
No ratings yet
Multivector and SIMD Computing Overview
42 pages
Understanding SIMD Architecture in Computing
No ratings yet
Understanding SIMD Architecture in Computing
67 pages
l22 Vector
No ratings yet
l22 Vector
32 pages
Understanding SIMD Architecture in Computing
No ratings yet
Understanding SIMD Architecture in Computing
67 pages
Vector Processing Techniques and Architectures
No ratings yet
Vector Processing Techniques and Architectures
38 pages
Pipeline and Vector Processing Overview
No ratings yet
Pipeline and Vector Processing Overview
36 pages
Vector vs. Array Processors Explained
No ratings yet
Vector vs. Array Processors Explained
16 pages
Understanding Data-Level Parallelism in Vectors
No ratings yet
Understanding Data-Level Parallelism in Vectors
34 pages
4-Stage Pipeline and SIMD Processors
No ratings yet
4-Stage Pipeline and SIMD Processors
51 pages
Vector Processing Architecture Report
No ratings yet
Vector Processing Architecture Report
12 pages
Digital System Architecture Overview
No ratings yet
Digital System Architecture Overview
69 pages
Vector Processor Architecture Overview
No ratings yet
Vector Processor Architecture Overview
13 pages
Understanding Vector Processors and Performance
No ratings yet
Understanding Vector Processors and Performance
49 pages
Understanding SIMD in Microprocessors
No ratings yet
Understanding SIMD in Microprocessors
33 pages
Vector Processors in Computer Architecture
No ratings yet
Vector Processors in Computer Architecture
17 pages
Vector Supercomputer Architecture Overview
No ratings yet
Vector Supercomputer Architecture Overview
31 pages
SIMD and GPU: Vector Processing Insights
No ratings yet
SIMD and GPU: Vector Processing Insights
44 pages
Unit 3 Notes on Vector Processing
No ratings yet
Unit 3 Notes on Vector Processing
35 pages
Overview of Vector Processing Concepts
No ratings yet
Overview of Vector Processing Concepts
16 pages
Cray-1 Computer Reference Manual
No ratings yet
Cray-1 Computer Reference Manual
81 pages
ACA Lecture (9) 14 4 2025 DLP
No ratings yet
ACA Lecture (9) 14 4 2025 DLP
81 pages
SIMD Processors in Computer Architecture
No ratings yet
SIMD Processors in Computer Architecture
64 pages
CAO Lecture 12 Vector Processors and GPUs
No ratings yet
CAO Lecture 12 Vector Processors and GPUs
62 pages
CA - Lec08-Chpater 4-DLP in Vector SIMD and GPU Architectures
No ratings yet
CA - Lec08-Chpater 4-DLP in Vector SIMD and GPU Architectures
108 pages
Lecture 46 - Vector Processing
No ratings yet
Lecture 46 - Vector Processing
7 pages
Understanding Vector SIMD Processing
No ratings yet
Understanding Vector SIMD Processing
7 pages
Data-Level Parallelism in Computer Architecture
No ratings yet
Data-Level Parallelism in Computer Architecture
3 pages
SIMD and Vector Processing Explained
No ratings yet
SIMD and Vector Processing Explained
16 pages
Processor Architectures Explained
No ratings yet
Processor Architectures Explained
5 pages
Vector vs. Array Processing Explained
No ratings yet
Vector vs. Array Processing Explained
7 pages
Module-5 CO
No ratings yet
Module-5 CO
89 pages
Overview of Instruction Set Architecture
No ratings yet
Overview of Instruction Set Architecture
3 pages
Handler's Classification in Parallel Computing
No ratings yet
Handler's Classification in Parallel Computing
9 pages
Pipelining and Parallel Processing Explained
No ratings yet
Pipelining and Parallel Processing Explained
18 pages
Understanding Parallel Processing Techniques
No ratings yet
Understanding Parallel Processing Techniques
33 pages
8086 Instruction Set Overview
100% (1)
8086 Instruction Set Overview
21 pages
Microprocessor Array System Overview
No ratings yet
Microprocessor Array System Overview
7 pages
CISC vs RISC Processor Architectures
No ratings yet
CISC vs RISC Processor Architectures
8 pages
Understanding Data-Level Parallelism
No ratings yet
Understanding Data-Level Parallelism
6 pages

8 TH

Uploaded by

8 TH

Uploaded by

PM 1'l|¢G-NH-‘ Hlllﬁivoponm

Multivector and SIMD

VECTOR PROCESSING PRINCIPLES

8.1.1 Vector Instruction Types

Fmtctiond unit Fundicnd unit

Fig. B-1 Veettor instruction types in Cray-like computers

{.41 lincmr redrrcrion innrrrerions These correspond to the follotwi ng mappings:

ft : V, >< F, —'> s {B-71

.9?) Example 8.1 Gather, scatter, and masking tnstructtons tn

DOGDUO 4.-1.4-|. [Link]

|[a}| r.-".-mm mmmm

{hp Scaiter inshudion

8.1.1 Veet:or\-Access Memory Schemes

Pi Fatd-icycla-i-u-I-i Accasscycla Z-l

FEE-I1 'l FB‘lI.i'l 2 FEE-11 3 ...

Fai:.h'1 Febc.h2 Fat-h3 ...

Ffibh ‘l Ffllflh 2 Fflth 3 .. . .

m wnrda m words mwcrds

Cycla1 l C5rda2 l Dg,rda3 T cyan l11;e

I . l'~"|1,o l~"|1,1 Mir-*1

8.1.3 Early Supertump uters

Table 8.1 Summary [Link] Sttpcrcotrlputcrs

511.:tern Maximum mnﬁgaration. Uniquefeamres

Veciorlzation Ratio [f]

You might also like