Nehalem Processor Architecture Overview
Nehalem Processor Architecture Overview
and
Nehalem-EP SMP Platforms
Michael E. Thomadakis, Ph.D.
Senior Lead Systems Engineer
Supercomputing Facility
miket AT tamu DOT edu
Texas A&M University
September 24, 2010
Abstract
Nehalem is an implementation of the CISC Intel64 instruction specification based on 45nm and high-k + metal
gate transistor technology. Nehalem micro-architectures and system platforms employ a number of state-of-the-art
technologies which enable high computation rates for scientific and other demanding workloads. Nehalem based
processors incorporate multiple cores, on-chip DDR3 memory controller, a shared Level 3 cache and high-speed
Quick-Path Interconnect ports for connectivity with other chips and the I/O sub-system. Each core has super-
scalar, out-of-order and speculative execution pipelines and supports 2-way simultaneous multi-threading. Each
core offers multiple functional units which can sustain high instruction level parallelism rates with the assistance
of program development tools, compilers or special coding techniques. A prominent feature of Intel64 is the
processing of SIMD instructions at a nominal rate of 4 double or 8 single precision floating-point instructions per
clock cycle. Nehalem platforms are cc-NUMA shared-memory processor systems.
Complex processors and platforms, such as those based on Nehalem, present several challenges to application
developers, as well as, system level engineers. Developers are faced with the task of writing efficient code on
increasingly complex platforms. System engineers need to understand the system level bottlenecks in order to
configure and tune the system to yield good performance for the application mix of interest. This report discusses
technical details of the Nehalem µ−architecture and platforms with an emphasis on inner workings and the cost of
instruction execution. The discussion presented here can assist developers and engineers in their respective fields.
The first to produce efficient scalar and parallel code on the Nehalem platform and the latter ones to configure
and tune a system to perform well under complex application workloads.
1 Introduction
Intel “Nehalem” is the nickname for the “Intel Micro-architecture”, where the latter is a specific implementation
of the “Intel64” Instruction Set Architecture (ISA) specification [1, 2, 3]. For this report, “Nehalem” refers to the
particular implementation where a processor chip contains four cores, the fabrication process is 45nm with high-k +
metal gate transistor technology. We further focus on the Nehalem-EP platform which has two processor sockets per
node and where the interconnection between sockets themselves and between processors and I/O is through Intel’s
Quick-Path Interconnect. Nehalem is the foundation of Intel Core i7 and Xeon processor 5500 series. Even though
Intel64 is a classic Complex-Instruction Set Computer (“CISC”) instruction set type, its Intel Micro-architecture
1
Intel Nehalem A Research Report
implementation shares many mechanisms in common with modern Reduced-Instruction Set Computer (“RISC”)
implementations.
Each Nehalem chip is a multi-core chip multiprocessor, where each core is capable of sustaining high degrees
of instruction-level parallelism. Nehalem based platforms are cache-coherent non-uniform memory access shared
memory multi-processors. However, to sustain the high instruction completion rates, the developer must have an
accurate appreciation of the cost of using the various system resources.
%&
% '
!
"
!
#"!$
A processor may speculatively start fetching and executing instructions from a code path before the outcome of a
conditional branch is determined. Branch prediction is commonly used to “predict” the outcome and the target
of a branch instruction. However, when the path is determined not to be the correct one, the processor has to
cancel all intermediate results and start fetching instructions from the right path. Another mechanism relies on data
pre-fetching when it is determined that the code is retrieving data with a certain pattern. There are many other
mechanisms which are however beyond the scope of this report to describe.
Nehalem, as other modern processors, invests heavily into pre-fetching as many instructions, from a predicted
path and translating them into micro-ops, as possible. A dynamic scheduler then attempts to maximize the number
of concurrent micro-ops which can be in progress (“in-flight”) at a time, thus increasing the completion instruction
rates. Another interesting feature of Intel64 is the direct support for SIMD instructions which increase the effective
ALU throughput for FP or integer operations.
. '( /0'
*+
Power
&
Clock
!!"#!"$%#&'(
)###*+&'(
(,,- ,( ®
&-
A. Nehalem Chip and DDR3 Memory Module. The processor chip contains four cores, a shared L3 cache and
DRAM controllers, and Quick-path Interconnect ports.
B. Nehalem micro-photograph.
'
+
% . -
1) & -
$ !
"# "/
0 -
$()&
+ +
$ ! '
*+,-
!
"#
%$&
'
processor chips. The Northbridge has been one of the often cited bottlenecks in previous Intel architectures. Nehalem
has substantially increased the main memory bandwidth and shortened the latency to access main memory. However,
now that a separate DRAM is associated with every IMC and chip, platforms with more than one chips are Non-
Uniform Memory Access (“NUMA”). NUMA organizations have distinct performance advantages and disadvantages
and with proper care multi-threaded computation can make efficient use of the available memory bandwidth. In
general data and thread placement becomes an important part of the application design and tuning process.
• an out-of-order superscalar Execution Engine (EE) that can dynamically schedule and dispatch up to six
micro-ops per cycle to the execution units, as soon as source operands and resources are ready,
• an in-order Retirement Unit (RU) which ensures the results of execution of micro-ops are processed and the
“architected” state is updated according to the original program order, and
• multi-level cache hierarchy and address translation resources.
We describe in the next two Sub-sections in detail the front-end and back-end pf the core.
$*'(
" # %
!
01 230 -
µ
µ
"&)
21%
- ) -
) +
µ .--
allocated to executing mispredicted code path. Instead, new micro-ops stream can start forward progress as soon as
the front end decodes the instructions in the architected code path. The BPU includes the following mechanisms
• Return Stack Buffer (RSB) A 16-entry RSB enables the BPU to accurately predict RET instructions.
Renaming is supported with return stack buffer to reduce mis-predictions of return instructions in the code.
• Front-End Queuing of BPU lookups. The BPU makes branch predictions for 32 bytes at a time, twice the
width of the IFU. Even though this enables taken branches to be predicted with no penalty, software should
regard taken branches as consuming more resources than do not-taken branches.
Instruction Length Decoder (ILD or “Pre-Decoder”) accepts 16 bytes from the L1 instruction cache or pre-
fetch buffers and it prepares the Intel64 instructions found there for instruction decoding downstream. Specifically
the ILD
• determines the length of the instructions,
• decodes all prefix modifiers associated with instructions and
• notes properties of the instructions for the decoders, as for example, the fact that an instruction is a branch.
The ILD can write up to 6 instructions per cycle, maximum, into the downstream Instruction Queue (IQ). A
16-byte buffer containing more than 6 instructions will take 2 clock cycles. Intel64 allows modifier prefixes which
dynamically modify the instruction length. These length changing prefixes (LCPs) prolong the ILD process to up to
6 cycles instead of 1.
The Instruction Queue (IQ) buffers the ILD-processed instructions and can deliver up to five instructions in
one cycle to the downstream instruction decoder. The IQ can buffer up to 18 instructions.
The Instruction Decoding Unit (IDU) translates the pre-processed Intel64 macro-instructions into a stream
of micro-operations. It can handle several instructions in parallel for expediency.
The IDU has a total of four decoding units. Three units can decode one simple instruction each, per cycle. The
other decoder unit can decode one instruction every cycle, either a simple instruction or complex instruction, that
is one which translates into several micro-ops. Instructions made up of more than four micro-ops are delivered from
the micro-sequencer ROM (MSROM). All decoders support the common cases of single micro-op flows, including,
micro-fusion, stack pointer tracking and macro-fusion. Thus, the three simple decoders are not limited to decoding
single micro-op instructions. Up to four micro-ops can be delivered each cycle to the downstream instruction decoder
queue (IDQ).
The IDU also parses the micro-op stream and applies a number of transformations to facilitate a more efficient
handling of groups of micro-ops downstream. It supports the following.
Loop Stream Detection (LSD). For small iterative segments of code whose micro-ops fit within the 28-slot In-
struction Decoder Queue (IDQ), the system only needs to decode the instruction stream once. The LSD detects
these loops (backward branches) which could be streamed directly from the IDQ. When such a loop is detected,
the micro-ops are locked down and the loop is allowed to stream from the IDQ until a mis-prediction ends it.
When the loop plays back from the IDQ, it provides higher bandwidth at reduced power, (since much of the
rest of the front end pipeline is shut off. In the previous micro-architecture the loop detector was working with
the instructions within the IQ upstream. The LSD provides a number of benefits, including,
• no loss of bandwidth due to taken-branches,
• no loss of bandwidth due to misaligned instructions,
• no LCP penalties, as the pre-decode stage are used once for
• the instruction stream within the loop,
• reduced front-end power consumption, because the instruction cache, BPU and predecode unit can go
to idle mode. However, note that loop unrolling and other code optimizations may make the loop too
big to fit into the LSD. For high performance code, loop unrolling is generally considered superior for
performance even when it overflows the loop cache capability.
Stack Pointer Tracking (SPT) implements the Stack Pointer Register (RSP) update logic of instructions which
manipulate the program stack (PUSH, POP, CALL, LEAVE and RET) within the IDU. These macro-instructions
were implemented by several micro-ops in previous architectures. The benefits with SPT include
• using a single micro-op for these instructions improves decoder bandwidth,
• execution resources are conserved since RSP updates do not compete for them,
• parallelism in the execution engine is improved since the implicit serial dependencies have already been
taken care of,
• power efficiency improves since RSP updates are carried out by a small hardware unit.
Micro-Fusion The instruction decoder supports micro-fusion to improve pipeline front-end throughput and increase
the effective size of queues in the scheduler and re-order buffer (ROB). Micro-fusion fuses multiple micro-ops
from the same instruction into a single complex micro-op. The complex micro-op is dispatched in the out-
of-order execution core. This reduces power consumption as the complex micro-op represents more work in a
smaller format (in terms of bit density), and reduces overall “bit-toggling” in the machine for a given amount
of work. It virtually increases the amount of storage in the out-of-order execution engine. Many instructions
provide register and memory flavors. The flavor involving a memory operand will decodes into a longer flow
of micro-ops than the register version. Micro-fusion enables software to use memory to register operations to
express the actual program behavior without worrying about a loss of decoder bandwidth.
Macro-Fusion The IDU supports macro-fusion which translates adjacent macro-instructions into a single micro-op
if possible. Macro-fusion allows logical compare or test instructions to be combined with adjacent conditional
jump instructions into one micro-operation.
4% 4$
($
( #
$
0µ
) *+,
$
0µ
!*
!+
!
!0
!1
#(( $
&$ '
"%)( # !#
")
# "# $ % #
& . .
* *+, *+, *
+3 *
4% 4%
&
Figure 5: High-level diagram of a the out-of-order execution engine in the Nehalem core. All units are fully pipelined
and can operate independently.
Memory Order Buffer (MOB) – Supports speculative and out of order loads and stores and ensures that writes
to memory take place in the right order and with the right data.
Execution Units and Operand Forwarding Network The execution units are fully pipelined and can produce
a result for most micro-ops with latency 1 cycle.
The IDQ unit (see Fig. 4) delivers a stream of micro-ops to the allocation/renaming stage of the EE pipeline. The
execution engine of Nehalem supports up to 128 micro-ops in flight. The input data associated with a micro-op are
generally either read from the ROB or from the retired register file. When a “dependency chain” across micro-ops
causes the machine to wait for a “slow” resource (such as a data read from L2 data cache), the EE allows other
micro-ops to proceed. The primary objective of the execution engine is to increase the flow of micro-ops, maximizing
the overall rate of instructions reaching completion per cycle (IPC), without compromising program correctness.
Resource Allocation and Register Renaming for micro-ops The initial stages of the out of order core
advance the micro-ops from the front end to the ROB and RS. This process is called micro-op *issue*. The RRAU
in the out of order core carries out the following steps.
1. It allocates resources to micro-ops, such as,
3. It “renames” source and destination operands of micro-ops in-flight, enabling out of order execution. Operands
are registers or memory in general. Architected (program visible) registers are renamed onto a larger set
of “micro-architectural” (or “non-architectural”) registers. Modern processors contain a large pool of non-
architectural registers, that is, registers which are not accessible from the code. These registers are used
to capture results which are produced by independent computations but which happen to refer to the same
architected register as destination. Register renaming eliminates these false dependencies which are known as
“write-after-write” and “write-after-read” hazards. A “hazard” is any condition which could force a pipeline
to stall to avoid erroneous results.
4. It provides data to the micro-op when the data is either an immediate value (a constant) or a register value
that has already been calculated.
Unified Reservation Station (URS) queues micro-ops until all source operands are ready, then it schedules
and dispatches ready micro-ops to the available execution units. The RS has 36 entries, that is, at any moment there
is a window of up to 36 micro-ops waiting in the EE to receive input. A single scheduler in the Unified-Reservation
Station (URS) dynamically selects micro-ops for dispatching to the execution units, for all operation types, integer,
FP, SIMD, branch, etc. In each cycle, the URS can dispatch up to six micro-ops, which are ready to execute. A
micro-op is ready to execute as soon as its input operands become available. The URS dispatches micro-ops through
the 6 issue ports to the execution units clusters. Fig. 5 shows the 6 issue ports in the execution engine. Each
cluster may contain a collection of integer, FP and SIMD execution units.
The result produced by an execution unit computing a micro-op are eventually written back permanent storage.
Each clock cycle, up to 4 results may be either written back to the RS or to the ROB. New results can be forwarded
immediately through a bypass network to a micro-op in-flight that requires it as input. Results in the RS can be
used as early as in the next clock cycle.
The EE schedules and executes next common micro-operations, as follows.
• Micro-ops with single-cycle latency can be executed by multiple execution units, enabling multiple streams of
dependent operations to be executed quickly.
• Frequently-used micro-ops with longer latency have pipelined execution units so that multiple micro-ops of
these types may be executing in different parts of the pipeline simultaneously.
• Operations with data-dependent latencies, such as division, have data dependent latencies. Integer division
parses the operands to perform the calculation only on significant portions of the operands, thereby speeding
up common cases of dividing by small numbers.
• Floating point operations with fixed latency for operands that meet certain restrictions are considered excep-
tional cases and are executed with higher latency and reduced throughput. The lower-throughput cases do not
affect latency and throughput for more common cases.
• Memory operands with variable latency, even in the case of an L1 cache hit, are not known to be safe for
forwarding and may wait until a store-address is resolved before executing. The memory order buffer (MOB)
accepts and processes all memory operations.
Nehalem Issue Ports and Execution Units The URS scheduler can dispatch up to six micro-ops per cycle
through the six issue ports to the execution engine which can execute up to 6 operations per clock cycle, namely
• 3 memory operations (1 integer and FP load, 1 store address and 1 store data) and
• 3 arithmetic/logic operations.
The ultimate goal is to keep the execution units utilized most of the time. Nehalem contains the following
components which are used to buffer micro-ops or intermediate results until the retirement stage
• 36 reservation stations
• 48 load buffers to track all allocate load operations,
• 32 store buffers to track all allocate store operations, and
• 10 fill buffers.
The execution core contains the three execution clusters, namely, SIMD integer, regular integer and SIMD floating-
point/x87 units. Each blue block in Fig. 5 is a cluster of execution units (EU) in the execution engine. All EUs are
fully pipelined which means they can deliver one result on each clock cycle. Latencies through the EU pipelines vary
with complexity of the micro-op from 1 to 5 cycles Specifically, the EUs associated with each port are the following:
Port 0 supports
• Integer ALU and Shift Units
• Integer SIMD ALU and SIMD shuffle
• Single precision FP MUL, double precision FP MUL, FP MUL (x87), FP/SIMD/SSE2 Move and Logic
and FP Shuffle, DIV/SQRT
Port 1 supports
• Integer ALU, integer LEA and integer MUL
• Integer SIMD MUL, integer SIMD shift, PSAD and string compare, and
• FP ADD
• peak issue rate of one 128-bit (16 bytes) load and one 128-bit store operation per clock cycle
• deep buffers for data load and store operations:
– 48 load buffers,
– 32 store buffers and
Figure 6: SIMD instructions apply the same FP or integer operation to collections of input data pairs simultaneously.
– 10 fill buffers;
• fast unaligned memory access and robust handling of memory alignment hazards;
• improved store-forwarding for aligned and non-aligned scenarios, and
their temporary storage in a cache. This is an efficient way to retrieve a stream of sub-vector operands from memory
to XMM registers, carry out SIMD computation and then stream the results out directly to memory.
Overview of the SSE Instruction Set Intel introduced and extended the support for SIMD operations in stages
over time as new generations of micro-architectures and SSE instructions were released. Below we summarize the
main characteristics of the SSE instructions in the order of their appearance.
MMX(TM) Technology Support for SIMD computations was introduced to the architecture with the “MMX
technology”. MMX allows SIMD computation on packed byte, word, and doubleword integers. The integers are
contained in a set of eight 64-bit MMX registers (shown in Fig. 8).
Streaming SIMD Extensions (SSE) SSE instructions can be used for 3D geometry, 3D rendering, speech
recognition, and video encoding and decoding. SSE introduced 128-bit XMM registers, 128-bit data type with four
packed single-precision floating-point operands, data prefetch instructions, non-temporal store instructions and other
cacheability and memory ordering instructions, extra 64-bit SIMD integer support.
Streaming SIMD Extensions 2 (SSE2) SSE2 instructions are useful for 3D graphics, video decoding/encoding,
and encryption. SSE2 add 128-bit data type with two packed double-precision floating-point operands, 128-bit data
types for SIMD integer operation on 16-byte, 8-word, 4-doubleword, or 2-quadword integers, support for SIMD arith-
metic on 64-bit integer operands, instructions for converting between new and existing data types, extended support
for data shuffling and extended support for cacheability and memory ordering operations.
Streaming SIMD Extensions 3 (SSE3) SSE3 instructions are useful for scientific, video and multi-threaded
applications. SSE3 add SIMD floating-point instructions for asymmetric and horizontal computation, a special-
purpose 128-bit load instruction to avoid cache line splits, an x87 FPU instruction to convert to integer independent
of the floating-point control word (FCW) and instructions to support thread synchronization.
Supplemental Streaming SIMD Extensions 3 (SSSE3) SSSE3 introduces 32 new instructions to accelerate
eight types of computations on packed integers.
SSE4.1 SSE4.1 introduces 47 new instructions to accelerate video, imaging and 3D applications. SSE4.1 also
improves compiler vectorization and significantly increase support for packed dword computation.
SSE4.2 Intel during 2008 introduced a new set of instructions collectively called as SSE4.2. SSE4 has been
defined for Intel’s 45nm products including Nehalem. A set of 7 new instructions for SSE4.2 were introduced in
Nehalem architecture in 2008. The first version of SSE4.1 was present in the Penryn processor. SSE4.2 instructions
are further divided into 2 distinct sub-groups, called “STTNI” and “ATA”.
• STring and Text New Instructions (STTNI) operate on strings of bytes or words of 16bit size. There are four new
STTNI instructions which accelerate string and text processing. For example, code can parse XML strings faster and
can carry out faster search and pattern matching. Implementation supports parallel data matching and comparison
operations.
• Application Targeted Accelerators (ATA) are instructions which can provide direct benefit to specific application targets.
There are two ATA instructions, namely “POPCNT” and “CRC32”.
– POPCNT is an ATA for fast pattern recognition while processing large data sets. It improves performance for
DNA/Genome Mining and handwriting/voice recognition algorithms. It can also speed up Hamming distance or
population count computation.
– CRC32 is an ATA which accelerates in hardware CRC calculation. This targets Network Attached Storage (NAS)
using iSCSI. It improves power efficiency and reduces time for software I-SCSI, RDMA, and SCTP protocols by
replacing complex instruction sequences with a single instruction.
Intel Advanced Vector Extensions AVX are several vector SIMD instruction extensions of the Intel64
architecture that will be introduced to processors based on 32nm process technology. AVX will expand current
SIMD technology as follows.
• AVX introduces 256-bit vector processing capability and includes two components whcih will be introduced
on processors built on 32nm fabrication process and beyond:
– the first generation Intel AVX will provide 256-bit SIMD register support, 256- bit vector floating-point
instructions, enhancements to 128-bit SIMD instruc- tions, support for three and four operand syntax.
– FMA is a future extension of Intel AVX, which provides fused floating-point multiply-add instructions
supporting 256-bit and 128-bit SIMD vectors.
• General-purpose encryption and AES: 128-bit SIMD extensions targeted to accelerate high-speed block encryp-
tion and cryptographic processing using the Advanced Encryption Standard.
AVX will be introduced with the new Intel 32nm micro-architecture called “Sandy-Bridge”.
Compiler Optimizations for SIMD Support in Executables User applications can leverage the SIMD capa-
bilities of Nehalem through the Intel Compilers and various performance libraries which have been tuned up to take
advantage of this feature. On EOS, use the following compiler options and flags.
• -xHost (or the -xSSE4.2) compiler options to instruct the compiler to use the entire set of SSE instructions
in the generated binary
• -vec This option enables “vectorization” (better term would be “SIMDizations”) and transformations enabled
for vectorization. This effectively asks the compiler to attempt to use the SIMD SSE instructions available in
Nehalem. Use the -vec-reportN option to see which lines could use SIMD and which could not and why.
• -O2 or-O3
Libraries Optimized for SIMD Support Intel provides user Libraries tuned up for SIMD computation. These
include, Intel’s Math-Kernel Library (MKL), Intel’s standard math library (libimf ) and the Integrated-Performance
Primitive library (IPP). Please review the “~/README” file on your EOS home directory with information on the
available software and instructions how to access it. This document contains, among other things, a useful discussion
on compiler flags used for optimization of user code, including SIMD.
waiting for work. Overall, SMT is much more efficient in terms of power than adding another core. One Nehalem,
SMT is supported by the high memory bandwidth and the larger cache sizes.
• Given modern back-end engines, which ISA style is more efficient to capture at a higher-level the semantics of
applications?
• Is it more efficient to use a RISC back-end engine with a CISC or a RISC ISA and front-ends?
• It would be very interesting to see how well the Nehalem back-end execution engine would perform when fitted
in a RISC processor front-end, handling a classical RISC ISA. For instance, how would a classical RISC, such
as a Power5+ would perform if the Nehalem execution engine were to replace its own?
• Conversely, how would the Nehalem perform if it were fitted with the back-end execution engine of a classical
RISC, such as that of an IBM Power5+ processor ?
• From the core designer point of view, can I select different execution engines for the same ISA ?
The old CISC [Link] debate is resurfacing as a question of how more aptly and concisely RISC or a CISC ISA can
express the semantics of applications, so that when the code is translated into micro-ops powerful back-end execution
engines can produce results at a lower cost, i.e., in shorter amount of time and/or using less power?
# + " ',$-
(
0 2
+
!" 9 !"
#$%! (
$ ( )#
6
6
(
% (
1
Figure 10: Overview of Cache Memory Hierachy and Data Flow Paths to and from Nehalem’s Core.
co-locate items which are likely to be accessed together within short time spans. Hardware logic detects sequential
memory access and attempts to pre-fetch subsequent blocks ahead of time. The cache memories eventually have to
evict least used contents to make room for incoming new ones.
an address with its 6 least-significant bits zero). A cache line can be filled from memory with a 8-transfer burst
transaction. The caches do not support partially-filled cache lines, so caching even a single doubleword requires
caching an entire line.
L1 Cache At Level 1 (L1), separate instruction and data caches are part of the Nehalem core (called a “Harvard”
style). The instruction and the data cache are each 32 KiB in size. The L1 data-cache has a single access data port,
and a block size of 64 bytes. In SMT mode, the caches are shared by the two hardware threads running in the core.
The instruction and the data caches have 4-way and 8-way set associative organization, respectively. The access
latency to retrieve data already in L1 data-cache is 4 clocks and the “throughput” period is 1 clock. The write policy
is write-back and the cache is non-inclusive.
L2 Cache Each core also contains a private, 256KiB, 8-way set associative, unified level 2 (L2) cache (for both
instructions and data). L2’s block size is 64 bytes and access time for data already in the cache is 10 clocks. The
write policy is write-back and the cache is non-inclusive.
L3 Cache The Level 3 (L3) cache is a unified, 16-way set associative, 8 MiB cache shared by all four cores on
the Nehalem chip. The latency of L3 access may vary as a function of the frequency ratio between the processor and
the uncore sub-system. Access latency is around 35–40+ clock cycles.
The L3 is inclusive (unlike L1 and L2), meaning that a cache line that exists in either L1 data or instruction, or
the L2 unified caches, also exists in L3. The L3 is designed to use the inclusive nature to minimize “snoop” traffic
between processor cores and processor sockets. A 4-bit valid vector indicates if a particular L3 block is already cached
in the L2 or L1 cache of a particular core in the socket. If the associated bit is not set, it is certain that this core
is not caching this block. A cache block in use by a core in a socket, is cached by its L3 cache which can respond
to snoop requests by other chips, without disturbing (snooping into) L2 or L1 caches on the same chip. The write
policy is write-back.
• peak issue rate of one 128-bit load and one 128-bit store operation per cycle from L1 cache,
• “deeper” buffers for load and store operations: 48 load buffers, 32 store buffers and 10 fill buffers,
• data prefetching to L1 caches,
• data prefetch logic for prefetching to the L2 cache
• fast unaligned memory access and robust handling of memory alignment hazards,
• memory disambiguation,
• store forwarding for most address alignments and
Data Load and Stores Nehalem can execute up to one 128-bit load and up to one 128-bit store per cycle, each
to different memory locations. The micro-architecture enables execution of memory operations out-of-order with
respect to other instructions and with respect to other memory operations.
Loads can
• issue before preceding stores when the load address and store address are known not to conflict,
• be carried out speculatively, before preceding branches are resolved
• take cache misses out of order and in an overlapped manner
• issue before preceding stores, speculating that the store is not going to be to a conflicting address.
Loads cannot
• speculatively take any sort of fault or trap
• speculatively access the uncacheable memory type
Faulting or uncacheable loads are detected and wait until retirement, when they update the programmer visible state. x87
and floating point SIMD loads add 1 additional clock latency.
Stores to memory are executed in two phases:
Execution Phase Prepares the store buffers with address and data for store forwarding (see below). Consumes dispatch
ports 3 and 4.
Completion Phase The store is retired to programmer-visible memory. This may compete for cache banks with executing
loads. Store retirement is maintained as a background task by the Memory Order Buffer, moving the data from the
store buffers to the L1 cache.
Data Pre-fetching to L1 Caches Nehalem supports hardware logic (DPL1) for two data pre-fetchers in the L1
cache. Namely
Data Cache Unit Prefetcher DCU (also known as the “streaming prefetcher”), is triggered by an ascending access
to recently loaded data. The logic assumes that this access is part of a streaming algorithm and automatically
fetches the next line.
Instruction Pointer-based Strided Prefetcher IPSP keeps track of individual load instructions. When load
instructions have a regular stride, a prefetch is sent to the next address which is the sum of the current address
and the stride. This can prefetch forward or backward and can detect strides of up to half of a 4KB-page, or
2 KBytes.
Data prefetching works on loads only when loads is from writeback memory type, the request is within the page
boundary of 4 KiB, no fence or lock is in progress in the pipeline, the number of outstanding load misses in progress
are below a threshold, the memory is not very busy and there is no continuous stream of stores waiting to get
processed.
L1 prefetching usually improves the performance of the memory subsystem, but in rare occasions it may degrade
it. The key to success is to issue the pre-fetch to data that the code will use in the near future when the path from
memory to L1 cache is not congested, thus effectively spreading out the memory operations over time. Under these
circumstances pre-fetching improves performance by anticipating the retrieval of data in large sequential structures
in the program. However, it may cause some performance degradation due to bandwidth issues if access patterns
are sparse instead of having spatial locality.
On certain occasions, if the algorithm’s working set is tuned to occupy most of the cache and unneeded pre-fetches
evict lines required by the program, hardware prefetcher may cause severe performance degradation due to cache
capacity of L1.
In contrast to hardware pre-fetchers, software prefetch instructions relies on the programmer or the compiler to
anticipate data cache miss traffic. Software prefetch act as hints to bring a cache line of data into the desired levels
of the cache hierarchy.
Data Pre-fetching to L2 Caches DPL2 pre-fetch logic brings data to the L2 cache based on past request patterns
of the L1 to the L2 data cache. DPL2 maintains two independent arrays to store addresses from the L1 cache, one for
upstreams (12 entries) and one for down streams (4 entries). Each entry tracks accesses to one 4K byte page. DPL2
pre-fetches the next data block in a stream. It can also detect more complicated data accesses when intermediate
data blocks are skipped. DPL2 adjusts its pre-fetching effort based on the utilization of the memory to cache paths.
Separate state is maintained for each core.
Memory Disambiguation A load instruction micro-op may depend on a preceding store. Many micro-architectures
block loads until all preceding store address are known. The memory disambiguator predicts which loads will not
depend on any previous stores. When the disambiguator predicts that a load does not have such a dependency, the
load takes its data from the L1 data cache. Eventually, the prediction is verified. If an actual conflict is detected,
the load and all succeeding instructions are re-executed.
Store Forwarding When a load data follows a store which reloads the data the store just wrote to memory, the
microarchitecture can forward the data directly from the store to the load in many cases. This is called “tore-to-load”
forwarding, and it saves several cycles by allowing a data requester receive data already available on the processor
instead of waiting for a cache to respond. However several conditions must be met for store to load forwarding to
proceed without delays:
• the store must be the last store to that address prior to the load,
• the store must be equal or greater in size than the size of data being loaded and
• the load data must be completely contained in the preceding store.
In previous micro-architectures specific address alignments and data sizes between the store and load operations would
determine whether a store-to-load forwarding might proceed directly or get delayed going through the cache/memory
sub-system. Intel microarchitecture (Nehalem) allows store-to-load forwarding to proceed regardless of store address
alignment.
Efficient Access to Unaligned Data The cache and memory subsystems handle a significant amount of instruc-
tions and data with different address alignment scenarios. Different address alignments have varying performance
impact on memory and cache operations based on the implementation of these subsystems. On Nehalem the data
path to the L1 caches are 16 bytes wide. The L1 data cache can deliver 16 bytes of data in every cycle, regardless
how their addresses are aligned. However, if a 16-byte load spans across a cache line boundary, the data transfer will
suffer a mild delay in the order of 4 to 5 clock cycles. Prior mircro-architectures imposed much heavier delays.
! "
#
),(), *
)()*+
*!
!! "
® ()
#$ %&! '
),(), *
(
! "
Figure 11: Nehalem On-Chip Memory Hierarchy and Data Traffic through the Chip.
Figure 12: The State Transitions of Cache Blocks in the Basic MESI Cache-Coherence Protocol.
block to transition from one of these states to another. The current state of a block and the requested operation
against it prescribes the h/w to follow a different sequence of tasks which provably maintain memory consistency.
state. The owning core can read and write to this block without having to notify the other cores. If any of the cores
previously sharing this block attempts to read this block, it will receive a cache-miss since the block is Invalid in that
core’s cache. Note that when a core attempts to modify data in a Exclusive state block, NO “Request-for-Ownership”
transaction is necessary since it is certain that no other processor is caching copies of this block.
For Nehalem which is a multi-processor platform, the processors have the ability to “snoop” (eavesdrop) the
address bus for other processor’s accesses to system memory and to their internal caches. They use this snooping
ability to keep their internal caches consistent both with system memory and with the caches in other interconnected
processors.
If through snooping one processor detects that another processor intends to write to a memory location that it
currently has cached in Shared state, the snooping processor will invalidate its cache block forcing it to perform a
cache line fill the next time it accesses the same memory location.
If a core detects that another core is trying to access a memory location that it has modified in its cache, but
has not yet written back to system memory, the owning core signals the requesting core (by means of the “HITM#”
signal) that the cache block is held in Modified state and will perform an implicit write-back of the modified data.
The implicit write-back is transferred directly to the requesting core and snooped by the memory controller to assure
that system memory has been updated. Here, the processor with the valid data can transfer the block directly to the
other core without actually writing it to system memory; however, it is the responsibility of the memory controller
to snoop this operation and update memory.
Each memory block can be stored in a unique set of cache locations, based on a subset of their memory block
identification. A cache memory with associativity K can store each memory block to up to K alternative locations.
If all K cache slots are occupied by memory blocks, the K + 1 request will not have room to store this latest memory
block. This requires that one of the existing K blocks has to be written out to memory (or the inclusive L3 cache) if
this block is in Modified state. Cache memories commonly use a Least Recently Used (LRU) cache replacement
strategy where they evict the block which has not been accessed recently.
As we mentioned, before written out to memory, data operands are first saved in a store buffer. They are then
written from the store buffer to memory when the system path to memory is available.
Note that when all 10 of the line-fill buffers in a core become occupied, outstanding data access operations queue
up in the load and store buffers and cannot proceed. When this happens the core’s front end is suspends issuing
micro-ops to the RS and OOO engine to maintain pipeline consistency.
Write Queue (WQ): is a 16-entry queue for store (write) memory access operations from the local cores.
Load Queue (LQ): is a 32-entry queue for load (read) memory requests by the local cores.
QPI Queue (QQ): is a 12-entry queue for off-chip requests delivered by the QPI links.
"
"
( "
( "
&' &'
)&'
+
When the GQ receives a cache line request from one of the cores, it first checks the on-chip Last Level Cache (L3) to
see if the line is already cached there. As the L3 is inclusive, the answer can be quickly determined. If the line is in
the L3 and was owned by the requesting core it can be returned to the core from the L3 cache directly. If the line is
being used by multiple cores, the GQ snoops the other cores to see if there is a modified copy. If so the L3 cache is
updated and the line is sent to the requesting core. In the event of an L3 cache miss, the GQ sends out requests for
the line. Since the cache line could be cached in the other Nehalem chip, a request through the QPI to the remote
L3 cache is made. As each Nehalem processor chip has its own local integrated memory controller, the GQ must
identify the “home” location of the requested cache line from the physical address. If the address identifies home as
being on the local chip, then the GQ makes a simultaneous request to the local IMC. If home belongs to the remote
chip, the request sent by the QPI will also be used to access the remote IMC.
This process can be viewed in the terms of the QPI protocol as follows. Each socket has a “Caching Agent” (CA)
which might be thought of as the GQ plus the L3 cache and a “Home agent” (HA) which is the IMC. An L3 cache
miss results in simultaneous queries for the line from all the CAs and the HA (wherever home is). In a Nehalem-EP
system there are 3 caching agents, namely the 2 sockets and an I/O hub. If none of the CAs has the cache line, the
home agent ultimately delivers it to the caching agent that requested it. Clearly, the IMC has queues for handling
Access to Local Memory DRAM Fig. 14 demonstrates access to a memory block whose home location is in
the directly (locally) attached DRAM. The sequence of steps are the following:
1. Proc0 requests a cache line which is not in its L1, l2 nor in shared L3 cache
- Proc0 requests data from its DRAM,
- Proc0 snoops Proc1 to check if data is present there.
2. Response
Access to Remote Memory DRAM Fig. 15 illustrates access to a memory block whose home location is the
remote DRAM memory (directly attached to the other Nehalem chip). The steps to access the remote memory block
are the following:
Each page can actually be stored anywhere in the physical memory, in any of the available main memory slots
known as page frames. We can consider the main memory as an array of pages frames. As an example, a physical
memory with 4GiBytes capacity has available exactly 1 Mi page frames for pages having 4KiB size. Applications can
define vast amounts of memory, but they usually refer or access a very small subset of it. When a program references
for the first time a memory location (for instance a new subroutine call or a reference to a data item in an array)
the system selects a free page frame and “pages in” the corresponding page. Application pages already in page slots
which have not been recently used become candidates for eviction to make room for new pages.
The mechanism which dynamically manages the page frames and keeps track of the mapping between pages and
page frames is called the Virtual Memory (VM) management system.
While applications execute refer to memory using “Effective Addresses” (EA) which are virtual memory addresses.
“Physical Addresses” (PA) are actual addresses the memory hardware uses to identify specific memory locations.
The VM system dynamically translates EAs into PAs, in process called “Virtual Address Translation”. VM systems
keeps track of the various program segments and corresponding pages with in memory in data structures called VM
segment and page tables. These structures end up taking plenty of memory space. Multiple levels of page tables are
used to cut down on the actual space used. This multi-level indirection requires the traversal of multiple tables for
each address an application uses and it is in the critical path of the tasks the processor has to carry out in order to
retire each macro-instruction. For this reason, special hardware called “Translation Look-aside Buffers” (TLBs) is
used to speed up this process.
UTLB1 for 4-KiB pages: 512 entries for both data and instruction look-ups.
An DTLB0 miss and UTLB1 hit causes a penalty of 7 cycles. Software only pays this penalty if the DTLB0 is
used in some dispatch cases. The delays associated with a miss to the UTLB1 and Page-Miss Handler are largely
non-blocking.
References
[1] Intel R 64 and IA-32 Architectures Software Developer’s Manual Volume 1:Basic Architecture, Intel Corporation, Jun.
2010.
[2] Intel R 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1, Intel
Corporation, Jun. 2010.
[3] Intel R 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2, Intel
Corporation, Jun. 2010.
[4] S. Gunther and R. Singhal, “Next generation intel microarchitecture (nehalem) family: Architectural insights and power
management,” in Intel Developer Forum, San Francisco, Mar. 2008.
[5] P. P. Gelsinger, “Intel architecture press briefing,” in Intel Developer Forum, San Francisco, Mar. 2008.
[6] R. Singhal, “Inside intel next generation nehalem microarchitecture,” in Intel Developer Forum, San Francisco, Mar. 2008.
[7] Intel R 64 and IA-32 Architectures Optimization Reference Manual, Intel Corporation, Nov. 2009.
[8] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 4th ed.
Morgan-Kaufmann Publishers Inc., 2009, iSBN: 978-0-12-374493-7.
Publisher’s URL: [Link]
[9] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Approach, 4th ed. Morgan-Kaufmann
Publishers Inc., 2007, iSBN: 978-0-12-370490-0.
Publisher’s URL: [Link]
[10] D. Levinthal, “Performance analysis guide for intel R core(TM) i7 processor and intel xeon(TM) 5500 processors,” Intel
Corporation, Tech. Rep. Version 1.0, 2009.
[11] Intel Technical Staff, “TLBs, paging-structure caches, and their invalidation,” Intel Corporation, Application Note, 2008.