0% found this document useful (0 votes)
5 views17 pages

Cache Memory Formal Notes

The document provides an in-depth technical study of cache memory, detailing its characteristics, hierarchy, operation, and design parameters. It discusses various access methods, performance metrics, and cache mapping functions, as well as replacement algorithms and write policies. Additionally, it covers multi-level cache organizations and includes a case study on Intel Pentium 4 cache architecture.

Uploaded by

sanketkokate712
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Cache Memory Formal Notes

The document provides an in-depth technical study of cache memory, detailing its characteristics, hierarchy, operation, and design parameters. It discusses various access methods, performance metrics, and cache mapping functions, as well as replacement algorithms and write policies. Additionally, it covers multi-level cache organizations and includes a case study on Intel Pentium 4 cache architecture.

Uploaded by

sanketkokate712
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

CACHE MEMORY

A Comprehensive Technical Study


Based on: Computer Organization and Architecture, 8th Edition
William Stallings

Chapter 4
1. Memory System Characteristics
A memory system is characterised along multiple dimensions. Understanding these dimensions is
fundamental to analysing the performance and suitability of different memory technologies in a
computer architecture.

1.1 Location
Memory may reside in one of the following locations within a computer system:
• CPU Registers — Located directly within the processor; the fastest accessible storage.
• Internal (Main) Memory — Includes cache memory and RAM; accessible directly by the CPU.
• External Memory — Secondary or backing storage such as hard disks and optical media.

1.2 Capacity
Capacity is measured in two interrelated terms:
• Word Size — The natural unit of organisation within a processor; typically 32 or 64 bits.
• Number of Words or Bytes — The total storage volume the memory system can hold.

1.3 Unit of Transfer


The unit of transfer defines how much data moves between memory components in one operation:
• Internal Transfer — Governed by the data bus width; typically one word at a time.
• External Transfer — Conducted in blocks, which are substantially larger than a single word.
• Addressable Unit — The smallest location that can be uniquely addressed; internally a word,
and on disk-based systems a cluster.

1.4 Access Methods


Access method describes the mechanism by which a location is reached. Four distinct methods exist:

Sequential Access
Data is read in sequence from the beginning. Access time depends on the current and previous
location of data. Example: magnetic tape.

Direct Access
Individual blocks carry unique addresses. Access involves jumping to a general vicinity followed by
a sequential search. Example: magnetic disk.

Random Access
Each location is uniquely addressable and access time is independent of location. Example: RAM
(Random Access Memory).

Associative Access
Data is located by comparing contents against a stored portion. Access time is independent of
location. Example: cache memory.

1.5 Performance Metrics


• Access Time — The elapsed time between presenting an address and receiving valid data.
• Memory Cycle Time — Access time plus any recovery time required before the next access.
• Transfer Rate — The rate at which data can be moved to or from memory.

1.6 Physical Types


Type Examples Notes
Semiconductor SRAM, DRAM Main memory and cache
Magnetic Disk, Tape Secondary storage
Optical CD, DVD Removable media
Others Hologram, Bubble Specialised/experimental

1.7 Physical Characteristics


• Decay — Whether data deteriorates over time without refreshing.
• Volatility — Whether data is lost when power is removed.
• Erasability — Whether data can be overwritten.
• Power Consumption — Energy requirements of the memory type.

1.8 Organisation
Organisation refers to the physical arrangement of bits into words. This is not always straightforward;
for example, interleaved memory organisations distribute data across multiple modules to improve
bandwidth.
2. Memory Hierarchy
No single memory technology can simultaneously achieve high capacity, high speed, and low cost. The
memory hierarchy is a structured layering that exploits the trade-offs between these properties. Higher
levels are faster, smaller, and more expensive per bit; lower levels are slower, larger, and cheaper.

Level Type Speed Capacity Cost per Bit


1 (Fastest) CPU Registers Sub- Tens to hundreds Very High
nanosecond of bytes
2 L1 Cache ~1 ns 8 KB – 128 KB High
3 L2 Cache ~5 ns 256 KB – 4 MB Moderate-High
4 L3 Cache ~15 ns 4 MB – 36 MB Moderate
5 Main Memory ~50–100 ns GBs Low-Moderate
(RAM)
6 Disk Cache ~ms range MBs Low
7 External Storage ms to TBs Very Low
(Slowest) (Disk, Optical, seconds
Tape)

The principle that makes the hierarchy effective is Locality of Reference — during the execution of a
program, memory references tend to cluster temporally (the same locations are accessed repeatedly)
and spatially (nearby locations are accessed together, e.g., within a loop).
3. Cache Memory — Fundamentals
3.1 Definition and Purpose
Cache memory is a small but extremely fast memory that sits between the CPU and main memory. Its
purpose is to reduce the effective memory access time by holding copies of frequently used data,
exploiting the principle of locality of reference.
• Cache may be located on the CPU chip itself (on-chip cache) or on a separate module (off-chip
cache).
• The CPU first checks the cache for requested data. A successful lookup is called a cache hit; a
failure is a cache miss.
• On a miss, the required block is loaded from main memory into the cache, then delivered to the
CPU.
• Cache lines (slots) are tagged with identifiers indicating which block of main memory they
currently hold.

3.2 Cache Operation — Read Procedure


The following steps describe a typical cache read operation:
• Step 1: CPU presents a memory address (RA) to the cache controller.
• Step 2: Cache checks whether the block containing RA is present.
• Step 3 (Hit): The word at RA is fetched from cache and delivered to the CPU.
• Step 3 (Miss): Main memory is accessed. A cache line is allocated, the block is loaded into it,
and the required word is delivered to the CPU.

3.3 Cache Design Parameters


The design of a cache involves decisions across several dimensions:
Design Parameter Description
Addressing Logical (virtual) vs. Physical addressing of cache
Size Total data capacity of the cache
Mapping Function How main memory blocks map to cache lines
Replacement Algorithm Which line is evicted on a miss when cache is full
Write Policy How writes to cache are propagated to main memory
Block Size Number of words per cache line/block
Number of Caches Single (unified) vs. multiple levels (L1, L2, L3)
4. Cache Addressing
Cache placement relative to the Memory Management Unit (MMU) determines whether addresses
stored are virtual or physical.

4.1 Logical (Virtual) Cache


• Positioned between the processor and the MMU.
• Stores data using virtual addresses; the processor accesses cache before address translation
occurs.
• Advantage: Faster access (MMU translation bypassed on a hit).
• Disadvantage: Multiple processes share the same virtual address space, requiring the cache to
be flushed on each context switch.

4.2 Physical Cache


• Positioned between the MMU and main memory.
• Stores data using physical addresses produced after MMU translation.
• No ambiguity between processes; does not require flushing on context switches.
• Slightly slower because MMU translation must occur before cache lookup.
5. Cache Mapping Functions
The mapping function determines where a given main memory block can reside within the cache. Three
approaches are defined, each representing a different trade-off between simplicity, flexibility, and cost.
For the following examples, a reference configuration is used throughout:
• Cache size: 64 KB
• Cache block (line) size: 4 bytes → Cache has 2¹⁴ = 16,384 lines
• Main memory size: 16 MB → 24-bit address (2²⁴ = 16M)

5.1 Direct Mapping


Principle
Each block of main memory maps to exactly one specific cache line. The assignment is fixed and
determined by:
Cache Line = Block Number mod (Number of Cache Lines)

Address Structure (24-bit)


Field Bits Purpose
Tag 8 bits (s – r) Identifies which of the 256 memory blocks occupying this line is
currently stored
Line / Slot 14 bits (r) Selects which of the 16,384 cache lines to examine
Word 2 bits (w) Selects the specific byte within the 4-byte block

Advantages and Disadvantages


• Advantages: Simple, inexpensive, fast — only one line need be checked.
• Disadvantage: If a program repeatedly accesses two blocks that map to the same line
(thrashing), the cache miss rate becomes very high regardless of cache size.

5.2 Associative (Fully Associative) Mapping


Principle
A main memory block may be placed in any cache line. The address is divided into only two fields:
• Tag (22 bits): Uniquely identifies the memory block.
• Word (2 bits): Identifies the byte within the block.
To find a block, every cache line's tag must be compared simultaneously with the requested tag.

Advantages and Disadvantages


• Advantage: Maximum flexibility — no fixed location means fewer conflict misses.
• Disadvantage: Expensive and complex hardware required for simultaneous comparison of all
tags; impractical for large caches.
5.3 Set-Associative Mapping
Principle
Set-associative mapping is a compromise between direct and fully associative mapping. The cache
is divided into v sets, each containing k lines. A given block can be placed in any line within a
specific set.
Set Number = Block Number mod (Number of Sets)
When k = 1, the scheme degenerates to direct mapping. When k equals the total number of cache
lines, the scheme becomes fully associative.

Address Structure (24-bit, 2-way, 13-bit set field)


Field Bits Purpose
Tag 9 bits (s – d) Differentiates blocks mapping to the same set
Set 13 bits (d) Selects which of the 8,192 sets to examine
Word 2 bits (w) Selects byte within block

Performance Notes
• 2-way set associative provides significant improvement over direct mapping up to cache sizes of
at least 64 KB.
• Increasing associativity beyond 4-way yields diminishing returns relative to simply increasing
cache size.
• Above 32 KB, increasing associativity provides no measurable improvement (simulation
results).
• Cache hardware complexity increases proportionally with the degree of associativity.

Parameter Direct Mapping Associative Set


Mapping Associative
Mapping
Address Length (s + w) bits (s + w) bits (s + w) bits
Tag Size (s – r) bits s bits (s – d) bits
Lines in Cache m = 2r Undetermined kv = k × 2d
Flexibility None (fixed) Maximum (any Moderate (any
line) line in set)
Hardware Cost Low High Moderate
Miss Penalty Risk High (thrashing) Low Low–Moderate
6. Replacement Algorithms
When a cache miss occurs and the relevant set (or cache, for fully associative) is full, an existing line
must be evicted. The replacement algorithm determines which line is chosen.

6.1 Direct Mapping — No Choice


In direct mapping, each block has only one possible cache location. The existing occupant of that line is
unconditionally replaced.

6.2 Algorithms for Associative and Set-Associative Caches


These are implemented in hardware for speed:

Least Recently Used (LRU)


Evicts the line that was last accessed furthest in the past. It exploits temporal locality and is widely
regarded as the most effective policy in practice. Hardware maintains a usage counter or timestamp
per line.

First In, First Out (FIFO)


Evicts the line that has been in the cache the longest, regardless of how often it has been used.
Simpler to implement than LRU. Note: FIFO can evict frequently-used lines if they were loaded
early.

Least Frequently Used (LFU)


Evicts the line with the fewest hits since it was loaded. Requires per-line hit counters. May retain
stale lines that had many historical accesses but are no longer needed.

Random Replacement
Selects a random line for eviction. Simple to implement. Surprisingly competitive with more
sophisticated algorithms in some access patterns.

6.3 Victim Cache


A victim cache is a small, fully-associative cache (typically 4–16 lines) placed between a direct-mapped
L1 cache and the next memory level. When a line is evicted from the L1 cache, it is moved to the victim
cache rather than discarded. If the evicted line is needed again shortly after, it can be retrieved from the
victim cache with much lower penalty than accessing main memory. This mechanism reduces the
effective miss penalty of direct-mapped caches.
7. Write Policy
The write policy determines how and when data written to the cache is propagated back to main
memory. This is important for maintaining consistency, particularly in systems with multiple processors
or with I/O devices that access main memory directly.

7.1 Write Through


Every write operation updates both the cache and main memory simultaneously.
• Advantage: Main memory is always consistent and up to date. Multiple CPUs can monitor the
memory bus to maintain cache coherence.
• Disadvantage: Generates substantial memory bus traffic. Every write incurs the full main
memory latency, slowing write performance considerably.

7.2 Write Back (Copy Back)


Writes are made only to the cache. A dirty bit (update bit) is set in the cache line to record that the line
has been modified. Main memory is only updated when the modified line is evicted.
• Advantage: Dramatically reduces write traffic to main memory (~85% of memory operations are
reads; only ~15% are writes).
• Disadvantages: Main memory may be temporarily inconsistent. I/O devices that bypass the
cache may read stale data unless they access main memory through the cache. Multiple CPU
caches may hold differing versions of the same data (cache coherence problem).
Criterion Write Through Write Back
Main Memory Always consistent May be temporarily stale
Consistency
Memory Bus Traffic High Low
Write Performance Slower (every write hits RAM) Faster (write to cache only)
Multi-CPU Coherence Easier (bus snooping) Harder (explicit protocols required)
I/O Access Reads correct data directly Must route I/O through cache
8. Block (Line) Size
The block size is the number of bytes transferred between main memory and the cache in a single
operation. Selecting an appropriate block size involves balancing two competing effects:
• Increasing block size exploits spatial locality — nearby memory locations are likely to be
accessed soon, so pre-fetching them reduces future misses.
• Excessively large blocks reduce the number of lines that fit in the cache, increase the time to
load a block on a miss, and make it less likely that all fetched words will be used before eviction.
In practice, block sizes of 8 to 64 bytes are considered reasonable for most systems. For high-
performance computing (HPC) systems, 64- to 128-byte blocks are most common.

9. Multi-Level Cache Organisations


Modern processors employ multiple levels of cache to bridge the increasing speed gap between the
CPU and main memory.

9.1 Rationale
• Advances in semiconductor integration allow entire cache hierarchies to be placed on-chip.
• On-chip caches operate at processor clock speed and do not require the system bus.
• Off-chip caches, while slower, offer larger capacities using faster SRAM technology relative to
DRAM.

9.2 Levels of Cache


Level Typical Speed Typical Size Notes
Location
L1 On-chip (per- Fastest (~1 8–128 KB Often split: separate I-cache and D-
core) ns) cache
L2 On-chip or close Fast (~5 ns) 256 KB – 4 MB Feeds L1; often unified
to chip
L3 On-chip (shared) Moderate 4 MB – 36+ MB Shared across all cores
(~15 ns)

9.3 Unified vs. Split Caches


A unified cache holds both instructions and data in a single structure. A split cache maintains separate
instruction cache (I-cache) and data cache (D-cache).
Aspect Unified Cache Split Cache
Hit Rate Higher (dynamic load Lower per individual cache
balancing)
Design Complexity Single cache to design Two caches required
Pipeline Contention Possible (fetch and execute Eliminated (separate access paths)
compete)
Suitability General purpose Preferred in pipelined processors
10. Case Study — Intel Pentium 4 Cache Architecture
The Pentium 4 exemplifies the evolution of cache design in high-performance x86 processors.

10.1 Cache Configuration


Cache Level Size Line Size Associativity Notes
L1 Instruction 12K — — Holds decoded micro-operations,
Cache micro-ops not raw x86 instructions
L1 Data Cache 8 KB 64 bytes 4-way set assoc. Write-back policy (configurable to
write-through)
L2 Cache 256 KB 128 bytes 8-way set assoc. Feeds both L1 caches
L3 Cache On-chip 128 bytes 8-way set assoc. Added to handle large database
workloads

10.2 Design Rationale


• Instructions are decoded from complex x86 format into simpler, fixed-length RISC-like micro-
operations before being stored in the L1 instruction cache. This enables efficient superscalar
pipelining and out-of-order execution.
• Separating decode from scheduling eliminates decode from the critical scheduling path.
• The L2 and L3 caches use 8-way set associativity to reduce conflict misses for demanding
workloads.

10.3 Intel Cache Evolution Summary


Processor Innovation Problem Solved
386 External cache using faster External memory slower than system bus
memory
486 8 KB on-chip L1 cache External bus bottleneck on cache access
Pentium Split L1 I-cache and D-cache Contention between instruction fetch and data
execution
Pentium Pro Back-side bus for L2 cache L2 cache limited by front-side bus speed
Pentium II L2 cache moved on-chip Further eliminate L2 bus bottleneck
Pentium III External L3 cache added On-chip caches insufficient for large databases
Pentium 4 L3 cache moved on-chip Maximise memory access speed
11. Case Study — ARM Cache Architecture
ARM processors, designed for embedded and mobile applications, employ cache architectures
optimised for power efficiency alongside performance.

11.1 Key Features of ARM Caches


• ARM processors use split L1 caches — separate instruction and data caches — to avoid
contention in their pipelined architectures.
• Cache sizes are configurable across a range (e.g., 4 KB to 128 KB) in many ARM cores,
enabling optimisation for different application requirements.
• ARM employs logical (virtual) address caches to enable fast access before MMU translation.

11.2 ARM Write Buffer


ARM processors incorporate a small FIFO write buffer between the cache and main memory. Its
purpose is to improve write performance:
• The processor writes data to the write buffer at full clock speed and immediately resumes
execution.
• Simultaneously, the write buffer drains to main memory at memory bus speed.
• If the write buffer becomes full before it can drain, the processor stalls.
• The write buffer must be kept small — data written there is not yet in main memory and cannot
be read back until the write completes.
12. Worked Examples
Example 1 — Direct Mapping Cache Miss/Hit Sequence
Given: An empty 8-block direct-mapped cache. Sequence of memory references (decimal addresses):
22, 26, 22, 26, 16, 18, 16, 3
Cache block assignment: Block = Address mod 8

Address Binary Cache Block (Addr mod 8) Result


(Dec)
22 10110 22 mod 8 = 6 MISS
26 11010 26 mod 8 = 2 MISS
22 10110 22 mod 8 = 6 HIT
26 11010 26 mod 8 = 2 HIT
16 10000 16 mod 8 = 0 MISS
18 10010 18 mod 8 = 2 MISS (26 evicted)
16 10000 16 mod 8 = 0 HIT
3 00011 3 mod 8 = 3 MISS

Result: 4 hits out of 8 references → Hit Ratio = 0.50 (50%)

Example 2 — Number of Bits for Direct-Mapped Cache


Given: 16 KB cache, 4-word (16-byte) blocks, 32-bit address. Find the total number of bits required.
• Block size = 16 bytes → Block offset = 4 bits
• Number of cache lines = 16 KB / 16 B = 1024 = 2¹⁰ → Index = 10 bits
• Tag = 32 – 10 – 4 = 18 bits
• Each line stores: 18 (tag) + 1 (valid) + 16×8 (data) = 147 bits
• Total cache bits = 1024 × 147 = 150,528 bits ≈ 18.4 KB overhead for tags/valid bits

Example 3 — Set-Associative Cache: Number of Sets


Given: 16 KB cache, 128-byte line length. Find the number of sets for 2-way, 4-way, and 8-way set-
associative configurations.
• Total lines = 16 × 1024 / 128 = 128 lines
• 2-way: Sets = 128 / 2 = 64 sets
• 4-way: Sets = 128 / 4 = 32 sets
• 8-way: Sets = 128 / 8 = 16 sets
Example 4 — Two-Way Set-Associative: Address Bits for Set Selection
Given: 32 kB cache, 64-byte lines, two-way set-associative, 32-bit address.
• Total lines = 32 × 1024 / 64 = 512 lines
• Number of sets = 512 / 2 = 256
• Bits to select a set = log₂(256) = 8 bits
• 8-way set-associative (same capacity and line length): Sets = 512 / 8 = 64 → log₂(64) = 6 bits
13. Summary
Cache memory is a critical architectural component that bridges the substantial speed gap between fast
CPUs and relatively slow main memory. Its effectiveness rests on the principle of locality of reference
— programs repeatedly access a small subset of their address space.

Topic Key Takeaway


Memory Hierarchy Organises memory by speed, cost, and capacity into a layered
structure.
Cache Operation On a hit, data is served from cache; on a miss, a block is fetched from
RAM.
Direct Mapping Simple and fast but prone to thrashing when two working blocks share
a cache line.
Associative Mapping Maximally flexible but hardware-expensive; impractical for large
caches.
Set-Associative Mapping The dominant real-world approach; balances flexibility and hardware
cost.
Replacement Algorithms LRU is most effective in practice; FIFO and Random are simpler
alternatives.
Write Policies Write-through ensures consistency; write-back reduces bus traffic.
Block Size 8–128 bytes in practice; larger blocks exploit spatial locality but risk
pollution.
Multi-Level Caches L1/L2/L3 hierarchy amortises latency across multiple levels.
Unified vs. Split Split caches preferred in pipelined designs; unified caches give higher
hit rates.

As processor speeds continue to outpace main memory latency, the design of cache hierarchies —
encompassing mapping strategies, replacement policies, write policies, and block sizes — remains one
of the most consequential decisions in computer architecture.

You might also like