Parallel Processing and Pipelining
Parallel processing is a technique used in computer system to increase the computational speed by
performing simultaneous data processing tasks. In conventional computer, each instruction is
executed sequentially one by one i.e. first the instruction is fetched, then decoded, and then executed.
But in the parallel processing technique, the computer system is able to perform concurrent data
processing to achieve faster execution time.
For example, the computer while an instruction is being executed in the ALU, the next instruction
can be read from the memory. The computer system may have two more ALUs and be able to
execute two or more instructions at the same time. Furthermore the system may have two or more
processors operating concurrently. The purpose of the parallel processing is to increase speed up the
computer processing capability and increase its throughput i.e. the amount of processing that can be
accomplished during a given interval of time. The parallel processing will increase the hardware
complexity and thereby the cost of the system. The parallel processing can be achieved by
distributing the data among the multiple functional units. For example the arithmetic, logic, shift
operations can be separated into three units and the operands diverted to each unit under the
supervision of the control unit.
Parallel Processing and Pipelining
Parallel Processing and Pipelining
In fig. the execution unit is separated into eight functional unit operating in parallel. The operands in
the registers are applied to one of the units depending upon the operation specified by the instruction
associated with the operands. The operation performed in each functional unit is indicating each
block of the diagram. The adder and integer multiplier perform the arithmetic operations with
integer numbers. The floating point operations are separated into three circuits operating in parallel.
The logical, shift, and increment operations can be performed concurrently on different data. All the
units are independent to each other, so one number can be shifted while another number is being
incremented.
When several computations are performed concurrently in a computer, it is known as parallel
processing. A multiprocessor system performs several computations concurrently. Parallel
processing is also achieved in a uniprocessor system using pipelining. There are three types of
parallel processors/computers based on their architectural configuration:
(i) Pipelined processors (ii) Array processors (iii) Multiprocessor system
Parallel Processing Classifications
There are various ways that parallel processing can be classified. One classification was introduced
by M.J. Flynn based on the number of instructions and data items that are manipulated by the
computer simultaneously. For normal operation of a computer is to fetch the instructions from the
memory and then execute them in the processor. The sequence of instructions from read from
memory constitutes an instruction stream. The operations performed on the data in the processor
constitutes a data streams. Parallel processing may occur in the instruction stream, in the data
stream, or in both. Flynn’s classification divides computers into four major groups as follows:
Single Instruction Stream on Single Data Stream (SISD)
An SISD computing system is a uniprocessor machine which is capable of executing a single instruction,
operating on a single data stream. It consists of a control unit, a processor unit, and a memory unit. In
SISD, machine instructions are processed in a sequential manner and computers adopting this model
are popularly called sequential computers. Most conventional computers have SISD architecture. All the
instructions and data to be processed have to be stored in primary memory. The system may or may not
have the internal parallel processing capabilities. For parallel processing, the system uses multiple
functional unit or pipelining processing.
The speed of the processing element in the SISD model is limited(dependent) by the rate at which the
computer can transfer information internally. Dominant representative SISD systems are IBM PC,
workstations.
Parallel Processing Classifications
Single Instruction Stream over Multiple Data Streams (SIMD)
This organization has many processor units under the supervision of a single control unit. The SIMD
system is a multiprocessor machine capable of executing the same instruction on all the CPUs but
operating on different data streams. All processors receives the same instruction from the common
control unit but operate on different data items. All the processing elements, which are ALUs, receive
the same instruction broadcast from the control unit but operate on different data sets from distinct
data streams. As seen in the diagram, control unit sends common instruction stream to each processor
element. The shared memory sub-system containing multiple modules is very essential. Array
Processor
Parallel Processing Classifications
Multiple Instruction streams over a Single Data Stream (MISD)
An MISD computing system is a multiprocessor machine capable of executing different instructions on
different Processing unit but all of them operating on the same dataset .This is a class of machines in
which the same data stream flows through an array of processors that execute different instruction
streams. In MISD computer architecture, there are n processor elements. Each processor elements
receives distinct instructions to execute on the same data stream and its derivatives. Here the output of
one processor element becomes the input of the next processor element in the series.
Parallel Processing Classifications
Parallel Processing Classifications
Multiple Instruction streams over Multiple Data Streams (MIMD)
An MIMD system is a multiprocessor machine which is capable of executing multiple
instructions on multiple data sets. Each Processing element in the MIMD model has
separate instruction and data streams; therefore machines built using this model are
capable to any kind of application. Parallel computers are classified as MIMD in which as
the name suggests, multiple instructions streams process multiple data streams.
Pipelining Processing
What is Pipelining:
Pipelined processing is a technique of divided a sequential process into a number of sub-operations
or subtasks as shown in figure. Each subtask is executed by a specialized hardware unit. Hence, all
the functional units operates concurrently with the other units in the pipeline. So, pipelining can be
visualized as a collection of processing unit through which the binary information flows. The result
obtained from the each processing unit is transferred to the next processing unit. The final result is
obtained when the information is passed through the all functional units. Hence, successive tasks are
streamed into the pipeline and get executed in an overlapped manner at the subtask levels. In the
pipelined processing all the four functional units of Fig. 11.1(a) operate in parallel. The operation of
all the functional units is synchronized under a common clock. All the functional units are within a
processor and operate under a single control unit. A pipelined computer is a uniprocessor system.
Pipelining Processing
In Von Neumann type processor, the instruction is executed by using the following steps:
(i) Instruction fetched IIF) from the memory.
(ii) Instruction decoding (ID). After decoding an instruction, the processor comes to know what
operations are to be performed.
(iii) Operand fetched if any
(iv) Execution of the decoded instruction (Ex)
In a non-pipelined processor, in order to execute the instruction, all the above steps are to be
executed. After that the next instruction (2nd instruction) to be executed is fetched from the memory.
In a pipelined processor, while one instruction is being fetched, the 2nd is being decoded, the
operand is being fetched for the 3rd and the 4th is being executed. The instruction fetching unit, the
instruction decoding unit, operand fetching unit and the execution unit all operate simultaneously.
This type of overlapped operation is shown in Fig.. In first cycle instruction No. 1, that is, I1 is fetched
from the memory. In the second cycle another instruction I2 is fetched and simultaneously I1 is
decoded by the instruction decoding unit. In the third cycle the instruction I3 is fetched, I2 is decoded
and the operand for I1 is fetched. In the fourth cycle the instruction I4 is fetched, I3 is decoded, the
operand for I2 is fetched and I1 is executed. Such a cycle is called pipelined cycle. This type of
processing technique is called pipelining. In the fifth pipelined cycle the instruction I5 is fetched, I4 is
decoded, the operand for I3 is fetched and I2 is executed. In this way the execution of other
instructions proceeds.
Pipelining Processing
Pipelining Processing
Space time diagram of Pipelining:
Consider a four segment pipeline as shown in next slide. The operands pass through all four
segments in a fixed sequence. Each segment of a combinational circuit S that performs sub operation
over the data stream flowing through the pipe. The segments are separated by register R that holds
the intermediate results between the stages.
The behavior of the pipeline can be illustrated with space time diagram. The horizontal line displays
the time in clock cycles and the vertical axis gives the segment number. This diagram shows the six
tasks to be completed by the processing unit. Initially the task T1 is handled by S1. After the first
clock, segment 2 is busy with T1 while segment 1 is busy with T2. In this way, the first task is
completed after four clock pulses. From now, the pipe completes a task every clock cycles.
Hence time required to complete all the six task = Time for first task to complete + (6-1) clock pulses
= 4 + 5=9 clock pulses.
In generalized, if there are k segments pipeline with clock cycle time tp, and n no of task is to be
completed by this unit. Hence, time required to complete the first task=ktp. After that each clock
cycle time a task will be executed. Hence the remaining (n-1) task require (n-1)tp clock cycles.
Total time required= (k + n - 1)tp
But in case of non-pipeline architecture, total time required for n task = ntn where tn is the time
required to complete a task by non pipeline processing architecture.
Speed up (S) of pipe line processing over an non-pipeline processing is given by:
S=ntn/(k + n - 1)tp
Pipelining Processing
In the above equation, if the number of task n increases, then n>>(k-1). Hence (n + K -1) approaches
to n. Under this condition, speedup S= tn/tp
Now, if time required by a non-pipeline unit (tn) to complete a task is equal to the time taken by
pipeline unit (ktp), then above equation can be written as:
S=ktp/tp=k
This is the theoretical maximum speedup that a pipeline can provide is k where k is the number of
segments in the pipeline.
Ex: If each segment takes 20 nsec to complete a sub task (tp=20) and there are k=4 segments and
total n=100 tasks has to be completed. For pipeline processing, total time required = (k+n-
1)tp=103*20=2060 nsec
For non-pipeline unit, total time=(k*tp)*n=4*20*100=8000 nsec
S=8000/2060=3.88 which is close to 4 if n is increased.
Instruction Pipelining Processing
There are two areas of computer design where the pipeline organization is applicable. An arithmetic
pipeline divides an arithmetic operation into sub operations for execution in the pipeline segments.
An instruction pipeline operates on a stream of instructions by overlapping the fetch, decode, and
execute phases of the instruction cycle.
Instruction Pipeline:
Pipelining Hazards
Pipeline hazards are situations that prevent the next instruction in the instruction stream from
executing during its designated clock cycle. The instruction is said to be stalled. When an instruction
is stalled, all instructions later in the pipeline than the stalled instruction are also stalled. Instructions
earlier than the stalled one continue. No new instructions fetched during the stall. There are three
types of hazards:
i) Resource hazard,
ii) Data hazard, and
iii) Control Hazard
Resource Hazard
A resource hazard occurs when two (or more) instructions that are already in the pipeline need the
same resource. Hence, in this case, the instructions must be executed in serial rather than parallel for
a portion of the pipeline. A resource hazard is sometime referred to as a structural hazard.
Let us consider a simple example of a resource hazard. Assume a simplified five-stage pipeline,
Opcode Fetch (OF), Decode Instruction (DI), Fetch operand (FO), Execute Instruction (EI), and Writing
operand (WO). Each stage takes one clock cycle to complete its operation. In Figure a new instruction
enters the pipeline each clock cycle.
Now assume, from the main memory, all instructions are fetched one by one and also data is read
and written but one at a time. Hence, an operand read to or write from memory cannot be
performed in parallel with an instruction fetch. In fig., the source operand for instruction I1 is in
memory and has to be read from the memory in clock 3. Therefore, the fetch instruction stage of the
pipeline must idle for one cycle before beginning the instruction fetch for instruction I3. The figure
assumes that all other operands are in registers. Solved by separate data and instruction memory.
Pipelining Hazards
Data Hazards:
A data hazard occurs when two instructions in a program are to be executed in sequence and both
access a particular memory or register operand. A collision occurs when an instruction cannot
proceed because previous instructions did not complete certain operations. A data dependency
occurs when an instruction needs data that are not yet available. For, example, an instruction in the
FO segment may need to fetch an operand that is being generated at the same time by the previous
instruction in the segment EI. Therefore the 2 nd instruction must wait for data to become available by
the first instruction. Similarly an address dependency may occur when an operand address cannot be
calculated because the information need by the addressing mode is not available. For example, an
instruction with register indirect mode cannot proceed to fetch the operand if the previous
instruction is loading the address into the register. Therefore the operand access to memory must be
delayed until the required address is available.
Pipelining Hazards
ADD AX,BX
SUB [Link]
In this example, the first instruction, the first instruction add AX content with BX content and store
the result in AX. After that the 2nd instruction subtract the content CX from the previous addition
result store in AX. In the pipeline the ADD instruction does not upgrade the result until the end of the
clock 5. But the SUB instruction require that value at the 2nd stage which is the 4th clock cycle. To
maintain correct operation, the pipeline must stall for two clocks cycles. Thus, in the absence of
special hardware and specific avoidance algorithms, such a data hazard results in inefficient pipeline
usage. There are three types of data hazards
Read after write (RAW), or true dependency:.A hazard occurs if the read takes place before the write
operation is complete.
Write after read (RAW), or antidependency: A hazard occurs if the write operation completes before
the read operation takes place.
Write after write (RAW), or output dependency: Two instructions both write to the same location. A
hazard occurs if the write operations take place in the reverse order of the intended sequence.
ADD AX,BX
SUB AX,CX
Pipelining Hazards
Solution of data Hazards:
The pipelined computer deal the above conflicts in a various ways.
Hardware Interlocks:
The most straight forward method is to insert hardware interlocks. A interlock is a circuit that detects
instructions whose source operands are destinations of instructions farther up in the pipeline.
Detection of this situation causes the instruction whose source is not available to be delayed by
enough clock cycles to resolve the conflict.
Operand forwarding:
Operand forwarding technique uses special hardware to detect a conflict and then avoid it by routing
the data needed through special paths between pipeline segments. For, example, instead of
transferring an ALU result into a destination register, the hardware checks the destination operand
and if it is needed as a source in the next instruction, it passes the result directly into the ALU input,
bypassing the register file. This method requires additional hardware paths through multiplexers as
well as the circuit that detects the conflicts.
Delayed load:
In this technique the compiler of the computer is designed to detect a data conflict and reorder the
instructions as necessary to delay the loading of the conflicting data inserting no-operation
instructions.
Pipelining Hazards
Control Hazards:
One of the major problems in operating an instruction pipeline is the occurrence of branch
instructions.
An instruction in the sequence may be a program control type instruction that causes a branch out of
normal sequence i.e. the program is transferred into a new address. In that case all information
stored in the instruction buffer are useless and hence deleted. Hence, the next instruction can not be
initiated until the completion of current branch instruction. This causes the delay in execution of next
instruction in the pipeline.
In figure, assume that the instruction 3 is a branching instruction in 4 segments pipeline. As soon as
this instruction is decoded in segment DA(decode and calculate effective address) in step 4, the
transfer from FI to DA of the other instructions is halted until the branch instruction is executed in
the stp 6. If the branch is taken, a new instruction is fetched in the step 7. If the branch is not taken,
the instruction fetched previously in the step 4 can be used. The pipeline then continues until a new
branch instruction is encountered.
Control Hazards
Solution of Control Hazards
The pipeline d computer uses various hardware technique to minimize the performance degradation
caused by control hazards.
Prefetch target Instruction in Prefetch target buffer:
One way of handling a conditional branch is to prefetch the target instruction specified by the
branching in addition to the instruction following the branch. The memory used here is assumed
to be in multiple modules and all modules can be accessed concurrently. There are two prefetch
buffers (cache) of instructions used:
a)Sequential Prefetch buffer b) Target prefetch Buffer
The Sequential Prefetch Buffer holds instructions fetched during the sequential part of a program.
The Target Prefetch Buffer holds instructions fetched from the target of a conditional branch.
When a branch is successful, the entire sequential prefetch buffer is invalidated, but the other
buffer is validated. However, when a branch is unsuccessful, the reverse is true. If the instruction
requested by the decoder is available in the sequential buffer used for sequential instructions or is
available in the target buffer if a conditional branch has just been resolved and is successful, it
enters the decoder with no delay. Otherwise, the decoder is idle until the instruction returns from
memory.
Solution of Control Hazards
Solution of Control Hazards
What do you mean by multiprocessor system? What are the similarities and dissimilarities
between the multiprocessor system and multiple computer system?
Answer
A multiprocessor system is a single computer system consisting of multiple processors, which may
communicate and cooperate at different levels in solving a given problem. The communication may
occur by sending messages from one processor to the other or by sharing a common memory.
Similarities:
(a) Both systems consist of several processors inter-connected through networks.
(b) Both execute programs concurrently.
(c) Throughput, in both cases, is very high.
(d) Both are very expensive systems.
Dissimilarities:
(a) In multiprocessor systems, the processors usually communicate with each other. However, in
multiple computer systems, the autonomous processors may or may not communicate with
each other.
(b) A multiprocessor system is controlled by one operating system, which provides interaction
between processors; but in multiple computer system, different processors are controlled by
different operating systems.
(c) The degree of interactions among processors in multiprocessor systems is very high; but that in
multiple computer systems is very less.