Multi-Core Computer Architecture
Lecture 3G
Introduction to GPU Architectures
John Jose
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati
What is a GPU?
❖ Graphics Processing Unit is a specialized electronic circuit designed to
rapidly manipulate and alter memory to accelerate the creation of images
in a frame buffer intended for output to display.
❖ Typically placed on a video card, which contains its own memory and
display interfaces(HDMI, DVI, VGA, etc)
❖ Video card connected to motherboard through
PCI-Express or AGP(Accelerated Graphics Port)
❖ Northbridge chip enables data transfer
between the CPU and GPU
Figure credit: ETechnoG
Why GPU ?
Motivation Example-1:
❖ Consider a 1920 x 1080 HD display, Refresh rate: 60 frames/second
❖ Roughly 125 million pixels have to be processed per second
❖ A 3 GHz processor with an IPC of 2 (medium to high ILP) can process 6
billion instructions per second.
❖ 48 instructions per pixel.
❖ Not enough for most high-intensity games
❖ Graphics effects, HD movies
Why GPU ?
Motivation Example-2:
Raster graphics
❖ The image was represented as an array of pixels.
❖ Very simple rules were used to create it.
❖ Earlier systems were simply showing images on the screen.
Why GPU ?
Motivation Example-2:
Vector graphics
❖ The programmer creates high level objects: shades, textures, characters,
❖ Images are created from basic rules specified by programmer
❖ The system generates the images: vector graphics
❖ If we zoom the image, clarity is not lost
Why GPU ?
Processing Requirements
❖ The processing requirements for games, high-intensity graphics is huge
❖ Aggressive OOO processors (even multicore ones) are insufficient.
❖ We need shader programs.
❖ Custom language to work on objects, vertices, and pixels
❖ Apply transformations to images: rotation, skewing, etc.
❖ Apply effects: textures, shading, and illumination
Shader Programs
❖ We need shader programs.
❖ Custom language to work on objects, vertices, and pixels
❖ Apply transformations to images: rotation, skewing, etc.
❖ Apply effects: textures, shading, and illumination
Graphics Pipeline
Why GPU ?
❖ Throughput more important than latency
❖ High throughput needed for the huge amount of computations required
for graphics
❖ Extremely parallel- different pixels and elements of the image can be
operated on independently
❖ Hundreds of cores executing at the same time to take advantage of this
fundamental parallelism
Figure credit: ETechnoG
CPU – GPU Interaction
❖ GPU houses a graphics pipeline
❖ GPU programs are written in two parts
❖ One part runs on a normal CPU
❖ Other part runs on the GPU
(compiled in a GPU-specific ISA)
CPU vs GPU
Figure credit: ETechnoG
Flynn’s Classification
Exploiting Parallelism
Scalar Sequential Code for (i=0; i < N; i++)
C[i] = A[i] + B[i];
load
Iter. 1 Vectorized Code
load
load load
add
load load
Time
store
add add
load
Iter. 2 load store store
Iter. 1 Vector Instruction
add Iter. 2
❖ Vectorization : Compile-time reordering of operation sequencing
store
❖ Requires extensive loop dependence analysis
Slide credit: Onur Mutlu, ETH Zurich
Vector Machine - Summary
❖ Vector/SIMD machines are good at exploiting regular data-level parallelism
❖ Same operation performed on many data elements
❖ Improve performance, (no intra-vector dependencies)
❖ Performance improvement limited by vectorizability of code
❖ Scalar operations limit vector machine performance
❖ Ref: Amdahl’s Law
❖ Many existing ISAs include (vector-like) SIMD operations
❖ Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
14
GPUs are SIMD Engines Underneath
❖ The instruction pipeline operates like a SIMD pipeline
❖ Programming is done using threads, NOT SIMD instructions
❖ Programming Model (Software) vs. Execution Model (Hardware)
15
Programming vs Execution Model
❖ Programming Model refers to how the programmer expresses the code
❖ Ex: Sequential (von Neumann), Data Flow (SPMD)
❖ Execution Model refers to how the hardware executes the code underneath
❖ Ex: Out-of-order execution, Vector processor, Multithreaded processor
❖ Execution Model can be very different from the Programming Model
❖ Ex: von Neumann model implemented by an OoO processor
❖ Ex: SPMD model implemented by a SIMD processor (a GPU)
Programming vs Execution Model
Scalar Sequential Code Model 1: Sequential (SISD) for (i=0; i < N; i++)
load C[i] = A[i] + B[i];
Iter. 1 ❖ Pipelined processor
load
❖ Out-of-order execution processor
add
❖ Independent instructions executed when ready
store ❖ Different iterations are present in the instruction window
load and can execute in parallel in multiple functional units
Iter. 2 load
❖ Loop is dynamically unrolled by the hardware
add ❖ Superscalar or VLIW processor
store ❖ Can fetch and execute multiple instructions per cycle
Slide credit: Onur Mutlu, ETH Zurich
Programming vs Execution Model
Scalar Sequential Code
Model 2: Data Parallel (SIMD)
for (i=0; i < N; i++)
load C[i] = A[i] + B[i];
Iter. 1 load
load load VLD A 🡪 V1 Vectorized Code
add load load
VLD B 🡪 V2
Time
store
add add
VADD V1 + V2 🡪 V3
load
store store
Iter. 2 load VST Instruction
Vector V3 🡪 C
Iter. 1 Iter. 2
❖ Each iteration
add is independent
❖ Compiler
store
generates a SIMD instruction to execute the same instruction from all
iterations across different data
Slide credit: Onur Mutlu, ETH Zurich
Programming vs Execution Model
Model 3: Multithreadded
Scalar Sequential Code for (i=0; i < N; i++)
C[i] = A[i] + B[i];
load load
Iter. 1 load Iter. 2 load
add add
store store
This can be realized on
❖ SPMD: Single Program Multiple Data
❖ Each iteration is independent ❖ Single Instruction Multiple Thread
❖ Programmer or compiler generates a thread to execute each iteration.
❖ Each thread does the same thing (but on different data)
Slide credit: Onur Mutlu, ETH Zurich
GPU is a SIMT Machine
❖ Single instruction, multiple threads (SIMT) is an execution model used in
parallel computing where single instruction, multiple data (SIMD) is combined
with multithreading.
❖ Each thread executes the same code but operates a different piece of data
❖ Each thread has its own context (can be /restarted/executed independently)
❖ A set of threads executing the same instruction are dynamically grouped into a
warp (wavefront) by the hardware
Slide credit: Onur Mutlu, ETH Zurich
SIMT Illustration
for (i=0; i < N; i++)
load load Warp 0 at PC X
C[i] = A[i] + B[i];
load load Warp 0 at PC X+1
add add Warp 0 at PC X+2
store store Warp 0 at PC X+3
Iteration 1 Iteration 2
❖ Warp: A set of threads that execute the same instruction at the same PC.
❖ Programmer or compiler generates a thread to execute each iteration.
❖ Each thread does the same thing (but on different data)
Multithreaded Warps
❖ Warp has 32 threads, 32K iterations, and 1 iteration/thread 🡪 1K warps
❖ Warps can be interleaved on the same pipeline 🡪 Fine grained multithreading
load load 0 at PC X
Warp 1
load load
add add Warp 20 at PC X+2
store store
Iter.
Iter. Iter.
Iter. ❖ A GPU executes in SIMT model
120*32 + 1
33 220*32
34 +2 ❖ Single Instruction Mutiple Thread
Warp-Level FGMT
❖ Warp: A set of threads that execute the same instruction on different data
elements 🡪 SIMT
❖ All threads run the same code
Thread Warp 0
Thread Warp 1
Thread Common PC
Warp
Scalar Scalar Scalar Scalar Thread Warp 7
Thread Thread Thread Thread
P Q R Z
SIMD Pipeline
Warp-Level FGMT
32-thread warp
A[tid] + B[tid] 🡪 C[tid]
one pipelined functional unit four pipelined functional units
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
C[2] C[8] C[9] C[10] C[11]
C[1] C[4] C[5] C[6] C[7]
C[0] C[0] C[1] C[2] C[3]
SIMD Execution Unit
Warp Instruction Level Parallelism
❖ Overlap execution of multiple instructions [32 threads per warp, 8 lanes]
❖ Completes 24 operations/cycle issuing 1 warp/cycle
Load Unit
W0 Multiply Unit
W1 Add Unit
W2
time
W3
W4
W5
Warp issue
SIMT Memory Access
❖ Same instruction in different threads uses thread id to index and access
different data elements
Threads
N=16, 4 threads per warp 🡪 4 warps
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data elements
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
+ + + +
Warp 0 Warp 1 Warp 2 Warp 3
johnjose@[Link]
[Link]