Vector Processing
Vector Processing
Parallel Processing
• Parallel processing – denotes the use of techniques designed to
perform various data processing tasks simultaneously to
increase a computer's overall speed.
• These techniques can include:
– performing arithmetic or logical operations while fetching the next
instruction
– executing several instructions at the same time
– performing arithmetic or logical operations on multiple sets of
operands.
• While parallel processing can be more expensive,
technological advances have dropped to overall cost of
processor design enough to make it financially feasible.
1
8/4/2021
Processor
registers
Floating point
Adder/subtractor Logic unit Incrementer
multiply
2
8/4/2021
Flynn's Taxonomy
• Michael Flynn classified computers according to their
type of parallelism:
– SISD – Single Instruction Single Data – simple computers
that are essentially devoid of parallelism
– SIMD – Single Instruction Multiple Data – processors
capable of performing the same operation on multiple pairs
of operands
– MISD – Multiple Instruction Single Data – performing
several operations on the same set of data – only of
theoretical interest
– MIMD - Multiple Instruction Multiple Data – capable of
processing several programs simultaneously on different
sets of data
Pipelining
• Pipelining is a technique where sequential processes
are broken down into separate suboperations, each of
which being performed by its own hardware.
• Each computation is passed along to the next segment
in the pipeline, with the processes are carried in a
manner analogous to an assembly line.
• The fact that each suboperation is performed by
different hardware allows different stages of the
overall operation to be performed in parallel.
3
8/4/2021
Pipelining: An Example
• Imagine that we want to evaluate the following
expression for seven sets of values:
Ai*Bi + Ci, for i = 1, 2, 3, …, 7
• Each suboperation can be implemented by a different
segment within the pipeline.
• This can be decomposed into three segments:
R1 ← Ai, R2 ← Bi Input Ai and Bi
R3 ← R1 * R2, R4 ← Ci Multiply and input Ci
R5 ← R3 + R4 Add Ci to the product
• The 5 registers are each loaded on a new clock pulse.
4
8/4/2021
Pipeline Processing
Ai Bi Ci
R1 R2
Multiplier
R3 R4
Adder
R5
Clock Pulse # R1 R2 R3 R4 R5
1 A1 B1 - - -
2 A2 B2 A1*B1 C1 -
3 A3 B3 A2*B2 C2 A1*B1 + C1
4 A4 B4 A3*B3 C3 A2*B2 + C2
5 A5 B5 A4*B4 C4 A3*B3 + C3
6 A6 B6 A5*B5 C5 A4*B4 + C4
7 A7 B7 A6*B6 C6 A5*B5 + C5
8 - - A7*B7 C7 A6*B6 + C6
9 - - - - A7*B7 + C7
5
8/4/2021
4-Segment Pipeline
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
6
8/4/2021
Space-Time Diagram
1 2 3 4 5 6 7 8 9
Clock
T1 T2 T3 T4 T5 T6 cycles
Segment:
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
Speedup
• Given a k-segment pipeline with a clock cycle of tp that is used
to execute n tasks.
– The first task requires ktp to complete the
operation.
– The remaining tasks are completed one per clock
cycle, requiring an additional (n-1)tp.
– The total execution time is (k + n-1)tp
• A nonpipelined unit would require ntn to complete these tasks.
• The speedup is the ratio
ntn
S = (k + n – 1)t
p
7
8/4/2021
Theoretical Speedup
• As the tasks increase n is much larger than k – 1 and
k + n – 1 → n. Thus the speedup becomes
S = tn / tp
• If we assume that the task takes as long with or
without pipelining, we have tn = ktp, which yields:
S=k
• Therefore k is the theoretical maximum speedup.
Speedup – An Example
8
8/4/2021
P1 P2 P3 P4
Applying Pipelining
• There are two areas where pipeline
organization is applicable:
– Arithmetic pipelining
• divides an arithmetic operation into suboperations for
execution in the pipeline segments.
– Instruction pipelining
• operates on a stream of instructions by overlapping the
fetch, decode, and execute phases of the instruction
cycles.
9
8/4/2021
Arithmetic Pipelining
• Pipelined arithmetic units are used to
implement floating point operations, fixed
point multiplication, etc.
• Floating point operations are readily
decomposed into suboperations that can be
handled separately.
10
8/4/2021
R R
Compare Difference
Segment 1:
exponents
Choose Align
Segment 2:
exponent mantissas
Segment 3: Add/subtract
mantissas
R R
R R
11
8/4/2021
Instruction Pipelining
• Pipeline processing can occur in the instruction
stream as well, with the processor fetches instruction
while the previous instruction are being executed.
• It’s possible for an instruction to cause a branch out
of sequence, and the pipeline must be cleared of all
the instructions after the branch.
• The instruction pipeline can read instructions from
memory and place them in a queue when the
processor is not accessing memory as part of the
execution of instructions.
12
8/4/2021
13
8/4/2021
yes
Branch?
no
Fetch operand
Segment 3:
from memory
Interrupt yes
handling Branch?
Update PC
no
Empty pipe
14
8/4/2021
Instruction: 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI - - FI DA FO EX
Decoding
5 - - - FI DA FO EX
a branch
instruction
6 FI DA FO EX
7 FI DA FO EX
Pipeline Conflicts
• There are three major difficulties that cause the
instruction pipeline to deviate from its normal
operations
1. Resource conflicts – caused by access to memory by two
segments at the same time. Separate memory for data and
instructions resolves this.
2. Data dependency – an instruction depends on the result
of a previous instruction but this result is not yet
available.
3. Branch difficulties – branch and other instructions that
change the PC's value.
15
8/4/2021
Data Dependency
• A data dependency occurs when an instruction needs
data that is not yet available.
• An instruction may need to fetch an operand being
generated at the same time by an instruction that is
first being executed.
• An address dependency occurs when an address
cannot be calculated because the necessary
information is not yet available, e.g., an instruction
with an register indirect address cannot fetch the
operand because the address is not yet loaded into the
register.
16
8/4/2021
17
8/4/2021
RISC Pipeline
• The RISC architecture is able to use an efficient
pipeline that uses a small number of suboperations for
several reasons:
– Its fixed length format means that decoding can occur
during register selection
– Data manipulation are all done using register-to-register
operations, so there is no need to calculate effective
addresses.
– This means that instructions can be carried out in 3
suboperations, with the third used for storing the result in
the specified register.
18
8/4/2021
19
8/4/2021
1. Load R1 I A E
2. Load R2 I A E
3. Add R1 + R2 I A E
Data conflict
we're using R2 4. Store R3 I A E
before it is finished
loading
20
8/4/2021
1. Load R1 I A E
The compiler inserting
a NOP eliminates the 2. Load R2 I A E
data conflict without
needed extra hardware
support 3. No- I A E
operation
4. Add R1+ I A E
R2
5. Store R3 I A E
Delayed Branch
Clock 1 2 3 4 5 6 7 8 9 10
cycles:
1. Load I A E
2. I A E
Increment
The branch 3. Add I A E
instruction
here 4. Subtract I A E
5. Branch I A E
to X
No- I A E
makes these Nops operation
necessary No- I A E
operation
Instruction I A E
in X
21
8/4/2021
Delayed Branch
Clock 1 2 3 4 5 6 7 8 9 10
cycles:
1. Load I A E
Placing the
2. I A E
branch
Increment
instruction
here 3. Branch I A E
to X
4. Add I A E
Vector Processing
• There is a class of computation problems that
require far greater computation power than
many computers can provide. They can take
days or weeks to solve on conventional
computers.
• These problems can be formulated in terms of
vectors and matrices.
• Computers that can process vectors as a unit
can solve these problems much more easily.
22
8/4/2021
Applications
• Computers with vector processing capabilities are
useful in problems such as:
– Long range weather forecasting
– Petroleum exploration
– Seismic data analysis
– Medical diagnosis
– Aerodynamics and space flight simulation
– Artificial intelligence and expert systems
– Mapping the human genome
– Image processing
Vector Operations
• A vector is an ordered set of data items in a
one-dimensional array.
– A vector V of length n can be represented as a row
vector by V = [V1, V2, … , Vn]
– If these values were listed in a column, it would be
a column vector.
• On sequential computers, vector operations are
broken down into single computations on
subscripted variables. The operations on the
entire vector is done by iteration.
23
8/4/2021
24
8/4/2021
Matrix Multiplication
• Matrix multiplication is an extremely computationally
intensive operation.
– Multiplying 2 nn matrices requiring n2 inner products,
each of which requires n multiplications = n3
multiplications
– An n m matrix can be thought of as n row vectors or m
column vectors.
• Consider multiplying two 33 matrices:
3
where cij = aik bkj c11 = a11b11 + a12b21 + a13b31
k =1
25
8/4/2021
Source A
26
8/4/2021
AR AR AR AR
DR DR DR DR
Data bus
Memory Interleaving
• Multiple memory units allow the use of
memory interleaving, where different sets of
addresses are assigned to different modules.
• n-way interleaved memory fetches can be
staggered, reducing the effective memory
cycle time by factor that is close to n.
27
8/4/2021
Supercomputers
• Supercomputers are commercial computers with vector
instructions and pipeline floating-point arithmetic operations.
– Components are also placed in close proximity to speed
data transfers and require special cooling because of the
resultant heat build-up.
• Supercomputers have all the standard instructions that one
might expect as well as those for vector operations. They have
multiple functional units, each with their own pipeline.
• They also make heavy use of parallel processing and are
optimized for large-scale numerical operations.
28
8/4/2021
Cray 1
• The Cray 1 was the first supercomputer
(1976).
It used vector processing with 12 distinct functional
units in parallel, operating concurrently with
operands stored in over 150 registers.
It could perform a floating-point operation on 2 sets
of 64 operands in one 12.5 ns clock cycle,
translating to 80 megaflops.
Array Processors
• Array processors performs computations on
large arrays of data
• There are two different types of such
processors:
– Attached array processors, which are auxiliary
processors attached to a general-purpose computer
– SIMD array processors, which are processors with
an SIMD organization that uses multiple functional
units to perform vector operations.
29
8/4/2021
High-speed memory-
Main memory Local memory
to-memory bus
30
8/4/2021
PE1 M1
PE3 M3
Main Memory
PEn Mn
31