0% found this document useful (0 votes)
28 views6 pages

Cache and Pipeline Performance Analysis

Uploaded by

guttamaneesha456
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

Cache and Pipeline Performance Analysis

Uploaded by

guttamaneesha456
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Q1.

How many total bits are required for a direct-mapped cache with 128 KB of data and 1-
word block size, assuming a 32-bit address?
Q2. How many total bits are required for a direct-mapped cache with 128 KB of data and 4-
word block size, assuming a 32-bit address?
Q3. Consider a cache with 64 blocks and a block size of 16 bytes. What block number does
byte address 1200 map to?
Q4. Assume for a given machine and program:
 instruction cache miss rate 2%
 data cache miss rate 4%
 miss penalty always 40 cycles
 CPI of 2 without memory stalls
 frequency of load/stores 36% of instructions
a) How much faster is a machine with a perfect cache that never misses?

b) What happens if we speed up the machine by reducing its CPI to 1 without changing
the clock rate?
c) What happens if we speed up the machine by doubling its clock rate, but if the
absolute time for a miss penalty remains same?
Q5. Find the number of misses for a cache with four 1-word blocks given the following
sequence of memory block accesses:0, 8, 0, 6, 8, for each of the following cache
configurations
a) direct mapped
b) 2-way set associative (use LRU replacement policy)
c) fully associative

Q6. Assume a 500 MHz machine with


 base CPI 1.0
 main memory access time 200 ns.
 miss rate 5%
How much faster will the machine be if we add a second-level cache with 20ns access time
that decreases the miss rate to 2%?
Q7. In this exercise, we examine how resource hazards, control hazards, and Instruction Set
Architecture (ISA) design can affect pipelined execution. Problems in this exercise refer to
the following fragment of MIPS code:
sw r16,12(r6)
lw r16,8(r6)
beq r5,r4,Label # Assume r5!=r4
add r5,r1,r4
slt r5,r15,r4
Assume that individual pipeline stages have the following latencies:
IF: 200ps ID:120ps EX : 150ps MEM 190ps WB: 100ps
a) For this problem, assume that all branches are perfectly predicted (this eliminates all
control hazards) and that no delay slots are used. If we only have one memory (for
both instructions and data), there is a structural hazard every time we need to fetch an
instruction in the same cycle in which another instruction accesses data. To guarantee
forward progress, this hazard must always be resolved in favor of the instruction that
accesses data. What is the total execution time of this instruction sequence in the 5-
stage pipeline that only has one memory?
b) We have seen that data hazards can be eliminated by adding nops to the code. Can you
do the same with this structural hazard? Why?
c) For this problem, assume that all branches are perfectly predicted (this eliminates all
control hazards) and that no delay slots are used. If we change load/store instructions
to use a register (without an off set) as the address, these instructions no longer need
to use the ALU. As a result, MEM and EX stages can be overlapped and the pipeline
has only 4 stages. Change this code to accommodate this changed ISA. Assuming this
change does not affect clock cycle time, what speedup is achieved in this instruction
sequence?
d) Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if
branch outcomes are determined in the ID stage, relative to the execution where
branch outcomes are determined in the EX stage?
e) Given these pipeline stage latencies, repeat the speedup calculation from 4.10.2, but
take into account the (possible) change in clock cycle time. When EX and MEM are
done in a single stage, most of their work can be done in parallel. As a result, the
resulting EX/MEM stage has a latency that is the larger of the original two, plus 20 ps
needed for the work that could not be done in parallel.
f) Given these pipeline stage latencies, repeat the speedup calculation from 4.10.3,
taking into account the (possible) change in clock cycle time. Assume that the latency
ID stage increases by 50% and the latency of the EX stage decreases by 10ps when
branch outcome resolution is moved from EX to ID.

Q8. A processor X1 operating at 2 GHz has a standard 5-stage RISC instruction pipeline
having a base CPI (cycles per instruction) of one without any pipeline hazards. For a given
program P that has 30% branch instructions, control hazards incur 2 cycles stall for every
branch. A new version of the processor X2 operating at same clock frequency has an
additional branch predictor unit (BPU) that completely eliminates stalls for correctly
predicted branches. There is neither any savings nor any additional stalls for wrong
predictions. There are no structural hazards and data hazards for X1 and X2. If the BPU has a
prediction accuracy of 80% what will be the speed up (rounded off to two decimal places)
obtained by X2 over X1 in executing P?
Q9. Consider a non-pipelined processor operating at 2.5 GHz. It takes 5 clock cycles to
complete an instruction. You are going to make a 5-stage pipeline out of this processor.
Overheads associated with pipelining force you to operate the pipelined processor at 2 GHz.
In a given program, assume that 30% are memory instructions, 60% are ALU instructions and
the rest are branch instructions. 5% of the memory instructions cause stalls of 50 clock cycles
each due to cache misses and 50% of the branch instructions cause stalls of 2 cycles each.
Assume that there are no stalls associated with the execution of ALU instructions. For this
program,what will be the speedup achieved by the pipelined processor over the non-pipelined
processor (round off to 2 decimal places)?
Q10. Consider the following processors (ns stands for nanoseconds). Assume that the pipeline
registers have zero latency.
Four-stage pipeline with stage latencies 1ns, 2 ns, 2ns, 1ns
Four-stage pipeline with stage latencies 1ns, 1.5ns, 1.5ns, 1.5ns
Five-stage pipeline with stage latencies 0.5ns, 1ns, 1ns, 0.6ns, 1 ns
Five-stage pipeline with stage latencies 0.5ns, 0.5 ns, 1ns, 1ns, 1.1ns
Which processor has the highest peak clock frequency?
Q11. Consider an instruction pipeline with four stages (S1, S2, S3, and S4) and each with
combinational circuit only. The pipeline registers are required between each stage and at the
end of the last stage. Delays for the stages and for the pipeline registers are as given in the
figure.

What is the approximate speed up of the pipeline in steady state under ideal conditions when
compared to the corresponding non-pipeline implementation?
Q12. A -stage pipelined processor has Instruction Fetch (IF) Instruction Decode
(ID) Operand Fetch (OF) Perform Operation (PO) and Write Operand (WO) stages.
The IF, ID, OF and WO stages take 1 clock cycle each for ADD and SUB any instruction.
The PO stage takes 1 clock cycle for ADD and SUB instructions, 3 clock cycles
for MUL instruction, and 6 clock cycles for DIV instruction respectively. Operand
forwarding is used in the pipeline. What is the number of clock cycles needed to execute the
following sequence of instructions?
Q13. Consider a 4 stage pipeline processor. The number of cycles needed by the four
instructions I1, I2, I3, and I4, in stages S1, S2, S3, S4 is shown below.

What is the number of cycles needed to execute the following loop?


For (i=1 to 2){I1; I2; I3; I4;}
Q14. Consider a pipelined processor with the following four stages
IF: Instruction Fetch
ID: Instruction Decode and Operand Fetch
EX: Execute
WB: Write Back
The IF, ID, and WB stages take one clock cycle each to complete the operation. The number
of clock cycles for the EX stage depends on the instruction. The ADD and SUB instructions
need 1 clock cycle and the MUL instruction needs 3 clock cycles in the EX stage. Operand
forwarding is used in the pipelined processor. What is the number of clock cycles taken to
complete the following sequence of instructions?

Q15. A CPU has five stages pipeline and runs at 1GHz frequency. Instruction fetch happens
in the first stage of the pipeline. A conditional branch instruction computes the target address
and evaluates the condition in the third stage of the pipeline. The processor stops fetching
new instruction following a conditional branch until the branch outcome is known. A program
executes 109 instructions out of which 20% are conditional branches. If each instruction takes
one cycle to complete on average, then what will be the total execution time of the program?
Q16. Consider an instruction pipeline with five stages without any branch prediction: Fetch
Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execute Instruction (EI) and
Write Operand (WO) The stage delays for FI, DI, FO, EI, and WO are 5ns, 7ns, 10ns, 8ns,
and 6ns respectively. There are intermediate storage buffers after each stage and the delay of
each buffer is 1ns. A program consisting of 12 instructions I1, I2, I3, ………I12 is executed
in this pipelined processor. Instruction I4 is the only branch instruction and its branch target
is I9. If the branch is taken during the execution of this program, how much time (in ns) will
be needed to complete the program?
Q17. A processor with 16 general purpose registers uses a 32-bit instruction format. The
instruction format consists of an opcode field, an addressing mode field, two register operand
fields, and a 16-bit scalar field. If 8 addressing modes are to be supported, what is the
maximum number of unique opcodes possible for every addressing mode?
Q18. Consider a computer architecture where instructions are 16 bits long. The first 6 bits of
the instruction are reserved for the opcode, and the remaining 10 bits are used for the
operands. There are three addressing modes: immediate, direct, and register. For immediate
addressing, the operand is included in the instruction itself. For direct addressing, the operand
is a memory address. For register addressing, the operand is a register number. Write the
instruction format for each of the addressing modes.
Q19. A computer has a 256 KByte, 4-way set associative, write back data cache with block
size of 32 Bytes. The processor sends 32 bit addresses to the cache controller. Each cache tag
directory entry contains, in addition to address tag, 2 valid bits, 1 modified bit and 1
replacement bit. The number of bits in the tag field of an address is
Q20. An 8KB direct-mapped write-back cache is organized as multiple blocks, each of size
32-bytes. The processor generates 32-bit addresses. The cache controller maintains the tag
information for each cache block comprising of the following. 1 Valid bit 1 Modified bit As
many bits as the minimum needed to identify the memory block mapped in the cache. What is
the total size of memory needed at the cache controller to store meta-data (tags) for the
cache?
Q21. A computer system has an L1 cache, an L2 cache, and a main memory unit connected as
shown below. The block size in L1 cache is 4 words. The block size in L2 cache is 16 words.
The memory access times are 2 nanoseconds. 20 nanoseconds and 200 nanoseconds for L1
cache, L2 cache and main memory unit respectively.

When there is a miss in L1 cache and a hit in L2 cache, a block is transferred from L2 cache
to L1 cache. What is the time taken for this transfer?
Q22. A given program has 25% load/store instructions. Suppose the ideal CPI (cycles per
instruction) without any memory stalls is 2. The program exhibits 2% miss rate on instruction
cache and 8% miss rate on data cache. The miss penalty is 100 cycles. What is the
speedup (rounded off to two decimal places) achieved with a perfect cache (i.e., with NO data
or instruction cache misses)?
Q23. Explain in detail the Flynn’s classification of computer architectures. Differentiate
between SISD, SIMD, MISD, and MIMD architectures with suitable diagrams. Discuss how
instruction and data streams are handled in each category of Flynn’s taxonomy. What are the
limitations of Flynn’s classification in the context of modern parallel processing systems?
Q24. Describe the evolution of computers from the first generation to the fifth generation.
What were the main technological developments that distinguished each generation of
computers? Describe the differences in hardware, software, and performance among the five
generations of computers. Discuss the invention of transistors and how it revolutionized
computer design. Explain the role of integrated circuits (ICs) in the development of the third-
generation computers. Describe how microprocessors led to the birth of personal computing.
What are the main features and goals of the fifth-generation computers? How does artificial
intelligence fit into this generation?
Q25. Problems in this exercise assume that logic blocks needed to implement a processor’s
datapath have the following latencies:
I-Mem : 200 ps Add: 70ps Mux 20ps ALU 90ps Regs 90ps D-Mem 250ps Sign-Extend
15ps Shift-Left-2 10ps
a) If the only thing we need to do in a processor is fetch consecutive instructions, what
would the cycle time be?
b) Consider a datapath where the processor that only has one type of instruction:
unconditional PC-relative branch. What would the cycle time be for this datapath?
c) Repat b) but this time we need to support only conditional PC-relative branches.

You might also like