0% found this document useful (0 votes)
7 views10 pages

Unit III Pipelining

The document discusses pipelining as a method to enhance CPU performance by allowing simultaneous execution of multiple instructions, illustrated through a water bottle packaging analogy. It outlines the design of a basic pipeline, execution sequences, and the performance metrics of pipelined versus non-pipelined processors. Additionally, it covers various parallel processor systems, cache coherence issues, and methods to resolve these issues, including different cache coherence protocols.

Uploaded by

mamta devi
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Unit III Pipelining

The document discusses pipelining as a method to enhance CPU performance by allowing simultaneous execution of multiple instructions, illustrated through a water bottle packaging analogy. It outlines the design of a basic pipeline, execution sequences, and the performance metrics of pipelined versus non-pipelined processors. Additionally, it covers various parallel processor systems, cache coherence issues, and methods to resolve these issues, including different cache coherence protocols.

Uploaded by

mamta devi
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT III

Pipelining
To improve the performance of a CPU we have two options: 1) Improve the
hardware by introducing faster circuits. 2) Arrange the hardware such that
more than one operation can be performed at the same time. Since there is a
limit on the speed of hardware and the cost of faster circuits is quite high, we
have to adopt the 2nd option.
Pipelining is a process of arrangement of hardware elements of the CPU
such that its overall performance is increased. Simultaneous execution of
more than one instruction takes place in a pipelined processor. Let us see a
real-life example that works on the concept of pipelined operation. Consider
a water bottle packaging plant. Let there be 3 stages that a bottle should
pass through, Inserting the bottle(I), Filling water in the bottle(F), and Sealing
the bottle(S). Let us consider these stages as stage 1, stage 2, and stage 3
respectively. Let each stage take 1 minute to complete its operation. Now, in
a non-pipelined operation, a bottle is first inserted in the plant, after 1 minute
it is moved to stage 2 where water is filled. Now, in stage 1 nothing is
happening. Similarly, when the bottle moves to stage 3, both stage 1 and
stage 2 are idle. But in pipelined operation, when the bottle is in stage 2,
another bottle can be loaded at stage 1. Similarly, when the bottle is in stage
3, there can be one bottle each in stage 1 and stage 2. So, after each
minute, we get a new bottle at the end of stage 3. Hence, the average time
taken to manufacture 1 bottle is:
Without pipelining = 9/3 minutes = 3m
I F S | | | | | |
| | | I F S | | |
| | | | | | I F S (9 minutes)
With pipelining = 5/3 minutes = 1.67m
I F S | |
| I F S |
| | I F S (5 minutes)
Thus, pipelined operation increases the efficiency of a system.
Design of a basic pipeline
• In a pipelined processor, a pipeline has two ends, the input end and the
output end. Between these ends, there are multiple stages/segments
such that the output of one stage is connected to the input of the next
stage and each stage performs a specific operation.
• Interface registers are used to hold the intermediate output between two
stages. These interface registers are also called latch or buffer.
• All the stages in the pipeline along with the interface registers are
controlled by a common clock.
Execution in a pipelined processor Execution sequence of instructions in
a pipelined processor can be visualized using a space-time diagram. For
example, consider a processor having 4 stages and let there be 2
instructions to be executed. We can visualize the execution sequence
through the following space-time diagrams:
Non-overlapped execution:
Stage / Cycle 1 2 3 4 5 6 7 8

S1 I1 I2

S2 I1 I2

S3 I1 I2

S4 I1 I2

Total time = 8 Cycle


Overlapped execution:
Stage / Cycle 1 2 3 4 5

S1 I1 I2

S2 I1 I2

S3 I1 I2

S4 I1 I2

Total time = 5 Cycle Pipeline Stages RISC processor has 5 stage


instruction pipeline to execute all the instructions in the RISC instruction set.
Following are the 5 stages of the RISC pipeline with their respective
operations:
• Stage 1 (Instruction Fetch) In this stage the CPU reads instructions from
the address in the memory whose value is present in the program
counter.
• Stage 2 (Instruction Decode) In this stage, instruction is decoded and
the register file is accessed to get the values from the registers used in
the instruction.
• Stage 3 (Instruction Execute) In this stage, ALU operations are
performed.
• Stage 4 (Memory Access) In this stage, memory operands are read and
written from/to the memory that is present in the instruction.
• Stage 5 (Write Back) In this stage, computed/fetched value is written
back to the register present in the instructions.
Performance of a pipelined processor Consider a ‘k’ segment pipeline
with clock cycle time as ‘Tp’. Let there be ‘n’ tasks to be completed in the
pipelined processor. Now, the first instruction is going to take ‘k’ cycles to
come out of the pipeline but the other ‘n – 1’ instructions will take only ‘1’
cycle each, i.e, a total of ‘n – 1’ cycles. So, time taken to execute ‘n’
instructions in a pipelined processor:
ETpipeline = k + n – 1 cycles
= (k + n – 1) Tp
In the same case, for a non-pipelined processor, the execution time of ‘n’
instructions will be:
ETnon-pipeline = n * k * Tp
So, speedup (S) of the pipelined processor over the non-pipelined processor,
when ‘n’ tasks are executed on the same processor is:
S = Performance of pipelined processor /
Performance of non-pipelined processor
As the performance of a processor is inversely proportional to the execution
time, we have,
S = ETnon-pipeline / ETpipeline
=> S = [n * k * Tp] / [(k + n – 1) * Tp]
S = [n * k] / [k + n – 1]
When the number of tasks ‘n’ is significantly larger than k, that is, n >> k
S = n * k / n
S = k
where ‘k’ are the number of stages in the pipeline. Also, Efficiency = Given
speed up / Max speed up = S / S max We know that Smax = k So, Efficiency =
S / k Throughput = Number of instructions / Total time to complete the
instructions So, Throughput = n / (k + n – 1) * Tp Note: The cycles per
instruction (CPI) value of an ideal pipelined processor is 1 Please see Set 2 for
Dependencies and Data Hazard and Set 3 for Types of pipeline and Stalling.
What are the types of Parallel Processor
System in Computer Architecture?
Parallel processing systems are created to speed up the implementation of programs
by breaking the program into several fragments and processing these fragments
together. Such systems are multiprocessor systems also referred to as tightly coupled
systems. Parallel processors can be divided into the following four groups based on
the number of instructions and data streams are as follows −
SISD Computer Organization(Single Instruction and
Single Data Stream)
SISD represents a computer organization with a control unit, a processing unit, and a
memory unit. SISD is like the serial computer in use. SISD executes instructions
sequentially and they may or may not have parallel processing capabilities.
Instructions executed sequentially may get overlapped in their execution stages. A
SISD computer can have greater than one functional unit in it. But all the functional
units are below the administration of one control unit. Parallel processing in such
systems can be attained by pipeline processing or by using multiple functional units.

SIMD Computer Organization(Single Instruction,


Multiple Data)
SIMD organization includes multiple processing elements. All these elements are
below the administration of a common control unit. All processors get identical
instruction from the control unit but work on multiple data items.
The shared subsystem contains multiple modules which help in communicating with
all the processors simultaneously. This is further divided into word slice and bit-slice
mode organizations.
MISD Computer Organization(Multiple Instruction and
Single Data stream)
MISD organization includes multiple processing units, each receiving separate
instructions operating over a similar data flow. The result of one processor becomes
the input of the next processor. The introduction of this organization received less
attention and was not practically implemented in architecture. The structure was of
only theoretical interest.

MIMD Computer Organization(multiple instruction,


multiple data)
A MIMD computer organization contains interactions among the multiprocessors
since all memory flows are changed from the common data area transmitted by all
processors. If the multi-data streams were derived from different shared memories
then it is a multiple SISD operation that is equal to a set of ‘n’ independent SISD
systems.

Cache Coherence
The cache coherence protocol is discussed in this article as a solution to the multicache
inconsistency issues.

Cache Coherence
A cache coherence issue results from the concurrent operation of several processors
and the possibility that various caches may hold different versions of the identical
memory block. The practice of cache coherence makes sure that alterations in the
contents of associated operands are quickly transmitted across the system.

Cache coherence has three different levels:


o Each writing operation seems to happen instantly.
o Each operand's value changes are seen in every processor in precisely the same order.
o Non-coherent behavior results from many processors interpreting the same action in
various ways.

Methods to resolve Cache Coherence


The two methods listed below can be used to resolve the cache coherence issue:

o Write Through
o Write Back
Write Through
The easiest and most popular method is to write through. Every memory write operation
updates the main memory. If the word is present in the cache memory at the requested
address, the cache memory is also updated simultaneously with the main memory.

The benefit of this approach is that the RAM and cache always hold the same information. In
systems with direct memory access transfer, this quality is crucial. It makes sure the
information in the main memory is up-to-date at all times so that a device interacting over
DNA can access the most recent information.

Advantage - It provides the highest level of consistency.

Disadvantage - It requires a greater number of memory access.

Write Back
Only the catch location is changed during a write operation in this approach. When the word
is withdrawn from the cache, the place is flagged, so it is replicated in the main memory. The
right-back approach was developed because words may be updated numerous times while
they are in the cache. However, as long as they are still there, it doesn't matter whether the
copy that is stored in the main memory is outdated because requests for words are fulfilled
from the cache.

An accurate copy must only be transferred back to the main memory when the word is
separated from the cache. According to the analytical findings, between 10% and 30% of all
memory references in a normal program are written into memory.

Advantage - A very small number of memory accesses and write operations.

Disadvantage - Inconsistency may occur in this approach.

The important terms related to the data or information stored in the cache as well as
in the main memory are as follows:

o Modified - The modified term signifies that the data stored in the cache and main
memory are different. This means the data in the cache has been modified, and the
changes need to be reflected in the main memory.
o Exclusive - The exclusive term signifies that the data is clean, i.e., the cache and the
main memory hold identical data.
o Shared - Shared refers to the fact that the cache value contains the most current data
copy, which is then shared across the whole cache as well as main memory.
o Owned - The owned term indicates that the block is currently held by the cache and
that it has acquired ownership of it, i.e., complete privileges to that specific block.
o Invalid - When a cache block is marked as invalid, it means that it needs to be fetched
from another cache or main memory.

Below is a list of the different Cache Coherence Protocols used in multiprocessor


systems:

o MSI protocol (Modified, Shared, Invalid)


o MOSI protocol (Modified, Owned, Shared, Invalid)
o MESI protocol (Modified, Exclusive, Shared, Invalid)
o MOESI protocol (Modified, Owned, Exclusive, Shared, Invalid)

These protocols are discussed below:

1. MSI Protocol

This is a fundamental cache coherence mechanism that is utilized in multiprocessor


systems. A cache may be in any state indicated by the protocol name's letters.
Therefore, for MSI, each block may be in one of the following states:

o Modified - In other words, the data in the cache is incompatible with the main memory,
and this status denotes that the block has been updated in the cache. Therefore, when
the data from the cache block is removed and is in the Modified (M) state, the cache is
responsible for writing the block to the main memory.
o Shared - At least one cache has at least one copy of this block, which has not been
updated. The cache might be removed without writing the data to the backup store.
o Invalid - If this block is going to be stored in this cache, it must be obtained from RAM
or from a different cache because it is invalid.

2. MOSI Protocol

It has one extra state than the MSI protocol, which is discussed below:

Owned - It is used to signify the ownership of the current processor to this block and
will respond to inquiries if another processor wants this block.

3. MESI Protocol

The protocol for cache coherence that is utilized the most is this one. Each cache line
bears a status indicating one of the following:
o Modified - As mentioned above, this term signifies that the data stored in the cache
and main memory are different. This means the data in the cache has been modified,
and the changes need to be reflected in the main memory.
o Exclusive - The exclusive term signifies that the data is clean, i.e., the cache and the
main memory hold identical data.
o Shared - This signifies that other caches on the computer may also hold this cache line.
o Invalid - This indicates that this cache line is marked as invalid by the word "invalid."

4. MOESI Protocol

This protocol provides comprehensive cache coherence, covering all potential states
that are frequently utilized in other protocols. There are one of the following statuses
for each cache line:

o Modified - While the copy in main memory is inaccurate and no other processors are
holding copies, a cache line in this condition contains the most recent, accurate copy
of the data.
o Owned - The most current, accurate copy of the data is stored in a cache line in this
state. In that other processors can store copies of the most recent, accurate data
comparable to the shared state; unlike the shared state, copies in main memory can be
inaccurate. One processor can only own the data at a time, and the remaining
processor can have the data in the shared state.
o Exclusive - The most current, accurate copy of the data is stored on a cache line in this
state. Since no other storage location has a copy of the data, the ram copy is also the
most recent and accurate copy.
o Shared - The most current, accurate copy of the data is stored on a cache line in this
condition. Additional system processors may also store data copies in the shared state.
If no other processor has ownership of the data, the copy in primary memory also
represents the most recent and accurate version of the data.
o Invalid - In this situation, a cache line doesn't contain a reliable copy of the data. Still,
valid data can be found in primary memory or another processor's cache.

Types of Coherence:
There exist three varieties of coherence referred to the coherency mechanisms, which
are listed below:
1. Directory Based - A directory-based system keeps the coherence amongst caches by
storing shared data in a single directory. In order to load an entry from primary memory
into its cache, the processor must request permission through the directory, which
serves as a filter. The directory either upgrades or devalues the other caches that
contain that record when a record is modified.
2. Snooping - Individual caches watch address lines during the snooping process to look
for accesses to memory locations that they have cached. A write invalidate protocol is
what it is known as. When a write activity is seen to a memory address for which a
cache maintains a copy, the cache controller invalidates its own copy of the snooped
memory location.
3. Snarfing - A cache controller uses this approach to try and update its own copy of a
memory location when a second master alters a place in the main memory by keeping
an eye on both the address and the contents. The cache controller updates its own
copy of the underlying memory location with the new data when a write action is
detected to a place of which a cache holds a copy.

You might also like