0% found this document useful (0 votes)
43 views27 pages

Introduction to GPU Architectures

Uploaded by

pulumatisahasra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views27 pages

Introduction to GPU Architectures

Uploaded by

pulumatisahasra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Multi-Core Computer Architecture

Lecture 3G
Introduction to GPU Architectures

John Jose
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati
What is a GPU?
❖ Graphics Processing Unit is a specialized electronic circuit designed to
rapidly manipulate and alter memory to accelerate the creation of images
in a frame buffer intended for output to display.
❖ Typically placed on a video card, which contains its own memory and
display interfaces(HDMI, DVI, VGA, etc)
❖ Video card connected to motherboard through
PCI-Express or AGP(Accelerated Graphics Port)
❖ Northbridge chip enables data transfer
between the CPU and GPU

Figure credit: ETechnoG


Why GPU ?
Motivation Example-1:
❖ Consider a 1920 x 1080 HD display, Refresh rate: 60 frames/second
❖ Roughly 125 million pixels have to be processed per second
❖ A 3 GHz processor with an IPC of 2 (medium to high ILP) can process 6
billion instructions per second.
❖ 48 instructions per pixel.
❖ Not enough for most high-intensity games
❖ Graphics effects, HD movies
Why GPU ?
Motivation Example-2:

Raster graphics
❖ The image was represented as an array of pixels.
❖ Very simple rules were used to create it.
❖ Earlier systems were simply showing images on the screen.
Why GPU ?
Motivation Example-2:

Vector graphics
❖ The programmer creates high level objects: shades, textures, characters,
❖ Images are created from basic rules specified by programmer
❖ The system generates the images: vector graphics
❖ If we zoom the image, clarity is not lost
Why GPU ?
Processing Requirements
❖ The processing requirements for games, high-intensity graphics is huge
❖ Aggressive OOO processors (even multicore ones) are insufficient.
❖ We need shader programs.
❖ Custom language to work on objects, vertices, and pixels
❖ Apply transformations to images: rotation, skewing, etc.
❖ Apply effects: textures, shading, and illumination
Shader Programs
❖ We need shader programs.
❖ Custom language to work on objects, vertices, and pixels
❖ Apply transformations to images: rotation, skewing, etc.
❖ Apply effects: textures, shading, and illumination

Graphics Pipeline
Why GPU ?
❖ Throughput more important than latency
❖ High throughput needed for the huge amount of computations required
for graphics
❖ Extremely parallel- different pixels and elements of the image can be
operated on independently
❖ Hundreds of cores executing at the same time to take advantage of this
fundamental parallelism

Figure credit: ETechnoG


CPU – GPU Interaction
❖ GPU houses a graphics pipeline

❖ GPU programs are written in two parts


❖ One part runs on a normal CPU
❖ Other part runs on the GPU
(compiled in a GPU-specific ISA)
CPU vs GPU

Figure credit: ETechnoG


Flynn’s Classification
Exploiting Parallelism
Scalar Sequential Code for (i=0; i < N; i++)
C[i] = A[i] + B[i];
load
Iter. 1 Vectorized Code
load
load load
add
load load

Time
store
add add
load

Iter. 2 load store store

Iter. 1 Vector Instruction


add Iter. 2
❖ Vectorization : Compile-time reordering of operation sequencing
store
❖ Requires extensive loop dependence analysis
Slide credit: Onur Mutlu, ETH Zurich
Vector Machine - Summary
❖ Vector/SIMD machines are good at exploiting regular data-level parallelism
❖ Same operation performed on many data elements
❖ Improve performance, (no intra-vector dependencies)
❖ Performance improvement limited by vectorizability of code
❖ Scalar operations limit vector machine performance
❖ Ref: Amdahl’s Law
❖ Many existing ISAs include (vector-like) SIMD operations
❖ Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
14

GPUs are SIMD Engines Underneath


❖ The instruction pipeline operates like a SIMD pipeline
❖ Programming is done using threads, NOT SIMD instructions
❖ Programming Model (Software) vs. Execution Model (Hardware)
15

Programming vs Execution Model


❖ Programming Model refers to how the programmer expresses the code
❖ Ex: Sequential (von Neumann), Data Flow (SPMD)
❖ Execution Model refers to how the hardware executes the code underneath
❖ Ex: Out-of-order execution, Vector processor, Multithreaded processor
❖ Execution Model can be very different from the Programming Model
❖ Ex: von Neumann model implemented by an OoO processor
❖ Ex: SPMD model implemented by a SIMD processor (a GPU)
Programming vs Execution Model
Scalar Sequential Code Model 1: Sequential (SISD) for (i=0; i < N; i++)
load C[i] = A[i] + B[i];
Iter. 1 ❖ Pipelined processor
load
❖ Out-of-order execution processor
add
❖ Independent instructions executed when ready
store ❖ Different iterations are present in the instruction window
load and can execute in parallel in multiple functional units
Iter. 2 load
❖ Loop is dynamically unrolled by the hardware
add ❖ Superscalar or VLIW processor

store ❖ Can fetch and execute multiple instructions per cycle

Slide credit: Onur Mutlu, ETH Zurich


Programming vs Execution Model
Scalar Sequential Code
Model 2: Data Parallel (SIMD)
for (i=0; i < N; i++)
load C[i] = A[i] + B[i];
Iter. 1 load
load load VLD A 🡪 V1 Vectorized Code

add load load


VLD B 🡪 V2
Time

store
add add
VADD V1 + V2 🡪 V3
load
store store
Iter. 2 load VST Instruction
Vector V3 🡪 C
Iter. 1 Iter. 2
❖ Each iteration
add is independent
❖ Compiler
store
generates a SIMD instruction to execute the same instruction from all
iterations across different data
Slide credit: Onur Mutlu, ETH Zurich
Programming vs Execution Model
Model 3: Multithreadded
Scalar Sequential Code for (i=0; i < N; i++)
C[i] = A[i] + B[i];
load load
Iter. 1 load Iter. 2 load

add add

store store
This can be realized on
❖ SPMD: Single Program Multiple Data
❖ Each iteration is independent ❖ Single Instruction Multiple Thread
❖ Programmer or compiler generates a thread to execute each iteration.
❖ Each thread does the same thing (but on different data)
Slide credit: Onur Mutlu, ETH Zurich
GPU is a SIMT Machine
❖ Single instruction, multiple threads (SIMT) is an execution model used in
parallel computing where single instruction, multiple data (SIMD) is combined
with multithreading.
❖ Each thread executes the same code but operates a different piece of data
❖ Each thread has its own context (can be /restarted/executed independently)
❖ A set of threads executing the same instruction are dynamically grouped into a
warp (wavefront) by the hardware

Slide credit: Onur Mutlu, ETH Zurich


SIMT Illustration
for (i=0; i < N; i++)
load load Warp 0 at PC X
C[i] = A[i] + B[i];

load load Warp 0 at PC X+1

add add Warp 0 at PC X+2

store store Warp 0 at PC X+3

Iteration 1 Iteration 2
❖ Warp: A set of threads that execute the same instruction at the same PC.
❖ Programmer or compiler generates a thread to execute each iteration.
❖ Each thread does the same thing (but on different data)
Multithreaded Warps
❖ Warp has 32 threads, 32K iterations, and 1 iteration/thread 🡪 1K warps
❖ Warps can be interleaved on the same pipeline 🡪 Fine grained multithreading

load load 0 at PC X
Warp 1

load load

add add Warp 20 at PC X+2

store store

Iter.
Iter. Iter.
Iter. ❖ A GPU executes in SIMT model
120*32 + 1
33 220*32
34 +2 ❖ Single Instruction Mutiple Thread
Warp-Level FGMT
❖ Warp: A set of threads that execute the same instruction on different data
elements 🡪 SIMT
❖ All threads run the same code

Thread Warp 0
Thread Warp 1
Thread Common PC
Warp
Scalar Scalar Scalar Scalar Thread Warp 7
Thread Thread Thread Thread
P Q R Z
SIMD Pipeline
Warp-Level FGMT
32-thread warp
A[tid] + B[tid] 🡪 C[tid]
one pipelined functional unit four pipelined functional units
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3]


SIMD Execution Unit
Warp Instruction Level Parallelism
❖ Overlap execution of multiple instructions [32 threads per warp, 8 lanes]
❖ Completes 24 operations/cycle issuing 1 warp/cycle
Load Unit
W0 Multiply Unit
W1 Add Unit
W2
time
W3
W4
W5

Warp issue
SIMT Memory Access
❖ Same instruction in different threads uses thread id to index and access
different data elements
Threads
N=16, 4 threads per warp 🡪 4 warps
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data elements
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

+ + + +

Warp 0 Warp 1 Warp 2 Warp 3


johnjose@[Link]
[Link]

You might also like