0% found this document useful (0 votes)

43 views27 pages

Introduction to GPU Architectures

Uploaded by

pulumatisahasra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views27 pages

Introduction to GPU Architectures

Uploaded by

pulumatisahasra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Multi-Core Computer Architecture

Lecture 3G
Introduction to GPU Architectures

John Jose
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Guwahati
What is a GPU?
❖ Graphics Processing Unit is a specialized electronic circuit designed to
rapidly manipulate and alter memory to accelerate the creation of images
in a frame buffer intended for output to display.
❖ Typically placed on a video card, which contains its own memory and
display interfaces(HDMI, DVI, VGA, etc)
❖ Video card connected to motherboard through
PCI-Express or AGP(Accelerated Graphics Port)
❖ Northbridge chip enables data transfer
between the CPU and GPU

Figure credit: ETechnoG

Why GPU ?
Motivation Example-1:
❖ Consider a 1920 x 1080 HD display, Refresh rate: 60 frames/second
❖ Roughly 125 million pixels have to be processed per second
❖ A 3 GHz processor with an IPC of 2 (medium to high ILP) can process 6
billion instructions per second.
❖ 48 instructions per pixel.
❖ Not enough for most high-intensity games
❖ Graphics effects, HD movies
Why GPU ?
Motivation Example-2:

Raster graphics
❖ The image was represented as an array of pixels.
❖ Very simple rules were used to create it.
❖ Earlier systems were simply showing images on the screen.
Why GPU ?
Motivation Example-2:

Vector graphics
❖ The programmer creates high level objects: shades, textures, characters,
❖ Images are created from basic rules specified by programmer
❖ The system generates the images: vector graphics
❖ If we zoom the image, clarity is not lost
Why GPU ?
Processing Requirements
❖ The processing requirements for games, high-intensity graphics is huge
❖ Aggressive OOO processors (even multicore ones) are insufficient.
❖ We need shader programs.
❖ Custom language to work on objects, vertices, and pixels
❖ Apply transformations to images: rotation, skewing, etc.
❖ Apply effects: textures, shading, and illumination
Shader Programs
❖ We need shader programs.
❖ Custom language to work on objects, vertices, and pixels
❖ Apply transformations to images: rotation, skewing, etc.
❖ Apply effects: textures, shading, and illumination

Graphics Pipeline
Why GPU ?
❖ Throughput more important than latency
❖ High throughput needed for the huge amount of computations required
for graphics
❖ Extremely parallel- different pixels and elements of the image can be
operated on independently
❖ Hundreds of cores executing at the same time to take advantage of this
fundamental parallelism

Figure credit: ETechnoG

CPU – GPU Interaction
❖ GPU houses a graphics pipeline

❖ GPU programs are written in two parts

❖ One part runs on a normal CPU
❖ Other part runs on the GPU
(compiled in a GPU-specific ISA)
CPU vs GPU

Figure credit: ETechnoG

Flynn’s Classification
Exploiting Parallelism
Scalar Sequential Code for (i=0; i < N; i++)
C[i] = A[i] + B[i];
load
Iter. 1 Vectorized Code
load
load load
add
load load

Time
store
add add
load

Iter. 2 load store store

Iter. 1 Vector Instruction

add Iter. 2
❖ Vectorization : Compile-time reordering of operation sequencing
store
❖ Requires extensive loop dependence analysis
Slide credit: Onur Mutlu, ETH Zurich
Vector Machine - Summary
❖ Vector/SIMD machines are good at exploiting regular data-level parallelism
❖ Same operation performed on many data elements
❖ Improve performance, (no intra-vector dependencies)
❖ Performance improvement limited by vectorizability of code
❖ Scalar operations limit vector machine performance
❖ Ref: Amdahl’s Law
❖ Many existing ISAs include (vector-like) SIMD operations
❖ Intel MMX/SSEn/AVX, PowerPC AltiVec, ARM Advanced SIMD
14

GPUs are SIMD Engines Underneath

❖ The instruction pipeline operates like a SIMD pipeline
❖ Programming is done using threads, NOT SIMD instructions
❖ Programming Model (Software) vs. Execution Model (Hardware)
15

Programming vs Execution Model

❖ Programming Model refers to how the programmer expresses the code
❖ Ex: Sequential (von Neumann), Data Flow (SPMD)
❖ Execution Model refers to how the hardware executes the code underneath
❖ Ex: Out-of-order execution, Vector processor, Multithreaded processor
❖ Execution Model can be very different from the Programming Model
❖ Ex: von Neumann model implemented by an OoO processor
❖ Ex: SPMD model implemented by a SIMD processor (a GPU)
Programming vs Execution Model
Scalar Sequential Code Model 1: Sequential (SISD) for (i=0; i < N; i++)
load C[i] = A[i] + B[i];
Iter. 1 ❖ Pipelined processor
load
❖ Out-of-order execution processor
add
❖ Independent instructions executed when ready
store ❖ Different iterations are present in the instruction window
load and can execute in parallel in multiple functional units
Iter. 2 load
❖ Loop is dynamically unrolled by the hardware
add ❖ Superscalar or VLIW processor

store ❖ Can fetch and execute multiple instructions per cycle

Slide credit: Onur Mutlu, ETH Zurich

Programming vs Execution Model
Scalar Sequential Code
Model 2: Data Parallel (SIMD)
for (i=0; i < N; i++)
load C[i] = A[i] + B[i];
Iter. 1 load
load load VLD A 🡪 V1 Vectorized Code

add load load

VLD B 🡪 V2
Time

store
add add
VADD V1 + V2 🡪 V3
load
store store
Iter. 2 load VST Instruction
Vector V3 🡪 C
Iter. 1 Iter. 2
❖ Each iteration
add is independent
❖ Compiler
store
generates a SIMD instruction to execute the same instruction from all
iterations across different data
Slide credit: Onur Mutlu, ETH Zurich
Programming vs Execution Model
Model 3: Multithreadded
Scalar Sequential Code for (i=0; i < N; i++)
C[i] = A[i] + B[i];
load load
Iter. 1 load Iter. 2 load

add add

store store
This can be realized on
❖ SPMD: Single Program Multiple Data
❖ Each iteration is independent ❖ Single Instruction Multiple Thread
❖ Programmer or compiler generates a thread to execute each iteration.
❖ Each thread does the same thing (but on different data)
Slide credit: Onur Mutlu, ETH Zurich
GPU is a SIMT Machine
❖ Single instruction, multiple threads (SIMT) is an execution model used in
parallel computing where single instruction, multiple data (SIMD) is combined
with multithreading.
❖ Each thread executes the same code but operates a different piece of data
❖ Each thread has its own context (can be /restarted/executed independently)
❖ A set of threads executing the same instruction are dynamically grouped into a
warp (wavefront) by the hardware

Slide credit: Onur Mutlu, ETH Zurich

SIMT Illustration
for (i=0; i < N; i++)
load load Warp 0 at PC X
C[i] = A[i] + B[i];

load load Warp 0 at PC X+1

add add Warp 0 at PC X+2

store store Warp 0 at PC X+3

Iteration 1 Iteration 2
❖ Warp: A set of threads that execute the same instruction at the same PC.
❖ Programmer or compiler generates a thread to execute each iteration.
❖ Each thread does the same thing (but on different data)
Multithreaded Warps
❖ Warp has 32 threads, 32K iterations, and 1 iteration/thread 🡪 1K warps
❖ Warps can be interleaved on the same pipeline 🡪 Fine grained multithreading

load load 0 at PC X
Warp 1

load load

add add Warp 20 at PC X+2

store store

Iter.
Iter. Iter.
Iter. ❖ A GPU executes in SIMT model
120*32 + 1
33 220*32
34 +2 ❖ Single Instruction Mutiple Thread
Warp-Level FGMT
❖ Warp: A set of threads that execute the same instruction on different data
elements 🡪 SIMT
❖ All threads run the same code

Thread Warp 0
Thread Warp 1
Thread Common PC
Warp
Scalar Scalar Scalar Scalar Thread Warp 7
Thread Thread Thread Thread
P Q R Z
SIMD Pipeline
Warp-Level FGMT
32-thread warp
A[tid] + B[tid] 🡪 C[tid]
one pipelined functional unit four pipelined functional units
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3]

SIMD Execution Unit
Warp Instruction Level Parallelism
❖ Overlap execution of multiple instructions [32 threads per warp, 8 lanes]
❖ Completes 24 operations/cycle issuing 1 warp/cycle
Load Unit
W0 Multiply Unit
W1 Add Unit
W2
time
W3
W4
W5

Warp issue
SIMT Memory Access
❖ Same instruction in different threads uses thread id to index and access
different data elements
Threads
N=16, 4 threads per warp 🡪 4 warps
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Data elements
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

+ + + +

Warp 0 Warp 1 Warp 2 Warp 3

johnjose@[Link]
[Link]

GPU Architectures Overview and Evolution
No ratings yet
GPU Architectures Overview and Evolution
95 pages
GPU Architectures by Soumyajit Dey
No ratings yet
GPU Architectures by Soumyajit Dey
33 pages
Performance Evaluation in Multi-Core Systems
No ratings yet
Performance Evaluation in Multi-Core Systems
20 pages
EDA MCQ and VHDL Concepts for ECE
No ratings yet
EDA MCQ and VHDL Concepts for ECE
4 pages
Chapter 03 Assembly Language
100% (1)
Chapter 03 Assembly Language
96 pages
Understanding Big and Little Endian
No ratings yet
Understanding Big and Little Endian
4 pages
Fault Tolerance in Systolic Arrays
No ratings yet
Fault Tolerance in Systolic Arrays
42 pages
System Software & Microprocessor Lab Manual
No ratings yet
System Software & Microprocessor Lab Manual
130 pages
MIPS Processor Design Overview
100% (1)
MIPS Processor Design Overview
21 pages
Internal Organization of RAM Chips
No ratings yet
Internal Organization of RAM Chips
19 pages
Memory and I/O Systems Overview
No ratings yet
Memory and I/O Systems Overview
37 pages
Operating System Functions Overview
No ratings yet
Operating System Functions Overview
66 pages
EDA Synthesis and Logic Optimization Guide
No ratings yet
EDA Synthesis and Logic Optimization Guide
63 pages
Compiler Design Overview by Arti Bahuguna
No ratings yet
Compiler Design Overview by Arti Bahuguna
110 pages
Understanding Intermediate Code Generation
No ratings yet
Understanding Intermediate Code Generation
120 pages
RISC Processor 5-Stage Architecture Overview
No ratings yet
RISC Processor 5-Stage Architecture Overview
44 pages
Shift Micro-Operations in Architecture
No ratings yet
Shift Micro-Operations in Architecture
14 pages
Pipelining and Vector Processing Overview
No ratings yet
Pipelining and Vector Processing Overview
48 pages
Memory Hierarchy in Computer Systems
No ratings yet
Memory Hierarchy in Computer Systems
5 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
MMX Unit 1
No ratings yet
MMX Unit 1
33 pages
KTU Foundations of Computing Guide
No ratings yet
KTU Foundations of Computing Guide
49 pages
Firmware Engineering Expertise
No ratings yet
Firmware Engineering Expertise
4 pages
GPU Memory Hierarchy Overview
No ratings yet
GPU Memory Hierarchy Overview
24 pages
Understanding Amdahl's Law in Computing
No ratings yet
Understanding Amdahl's Law in Computing
25 pages
Advanced Embedded Systems Syllabus
100% (1)
Advanced Embedded Systems Syllabus
11 pages
Simple Computer Design with DMA Overview
No ratings yet
Simple Computer Design with DMA Overview
6 pages
ACM - Step-By-Step Design and Simulation of A Simple CPU Architecture
100% (1)
ACM - Step-By-Step Design and Simulation of A Simple CPU Architecture
5 pages
Intermediate Code Generation in Compilers
No ratings yet
Intermediate Code Generation in Compilers
26 pages
Two-Pass Assembler Implementation in Java
100% (1)
Two-Pass Assembler Implementation in Java
41 pages
Linkers vs. Loaders Explained
No ratings yet
Linkers vs. Loaders Explained
28 pages
VLSI Fabrication and Design Questions
No ratings yet
VLSI Fabrication and Design Questions
1 page
I/O Organization and Interrupt Handling
No ratings yet
I/O Organization and Interrupt Handling
95 pages
Simple Data Paths in Computer Architecture
No ratings yet
Simple Data Paths in Computer Architecture
15 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Microprocessor and Computer Architecture Course
No ratings yet
Microprocessor and Computer Architecture Course
2 pages
Intermediate Code Generation Overview
No ratings yet
Intermediate Code Generation Overview
51 pages
Register Allocation and Graph Coloring
No ratings yet
Register Allocation and Graph Coloring
50 pages
VLSI Design: MOSFET and CMOS Basics
No ratings yet
VLSI Design: MOSFET and CMOS Basics
4 pages
Database Management Systems Overview
No ratings yet
Database Management Systems Overview
102 pages
Overview of Von Neumann Architecture
No ratings yet
Overview of Von Neumann Architecture
5 pages
Overview of 80286 Microprocessor Features
No ratings yet
Overview of 80286 Microprocessor Features
12 pages
High Level Synthesis in VLSI Design
No ratings yet
High Level Synthesis in VLSI Design
4 pages
Understanding Cache Memory Concepts
No ratings yet
Understanding Cache Memory Concepts
45 pages
Introduction to Embedded Systems Basics
100% (2)
Introduction to Embedded Systems Basics
36 pages
Arithmetic Operations in Computer Organization
No ratings yet
Arithmetic Operations in Computer Organization
25 pages
Superscalar and VLIW Processor Architectures
No ratings yet
Superscalar and VLIW Processor Architectures
72 pages
Memory Organization in Computer Architecture
No ratings yet
Memory Organization in Computer Architecture
38 pages
Onur 447 Spring15 Lecture15 Gpus Vliw Dae Afterlecture
No ratings yet
Onur 447 Spring15 Lecture15 Gpus Vliw Dae Afterlecture
69 pages
GPU Programming and Architecture Overview
No ratings yet
GPU Programming and Architecture Overview
56 pages
GPU Architecture and Optimization Insights
No ratings yet
GPU Architecture and Optimization Insights
20 pages
Module 1
No ratings yet
Module 1
94 pages
GPU Architecture and Parallel Execution
No ratings yet
GPU Architecture and Parallel Execution
46 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
31 pages
Understanding Data Level Parallelism
No ratings yet
Understanding Data Level Parallelism
54 pages
Explain SIMD Computer Organization: SIMD (Single Instruction, Multiple Data)
No ratings yet
Explain SIMD Computer Organization: SIMD (Single Instruction, Multiple Data)
4 pages
Introduction to Parallel Programming
No ratings yet
Introduction to Parallel Programming
27 pages
GPU Programming and Computer Architecture
No ratings yet
GPU Programming and Computer Architecture
18 pages
Data-Level Parallelism in Computing Architectures
No ratings yet
Data-Level Parallelism in Computing Architectures
80 pages
Cours 5 M Core GPU
No ratings yet
Cours 5 M Core GPU
89 pages
Bose SoundLink Mini Firmware Update Guide
No ratings yet
Bose SoundLink Mini Firmware Update Guide
4 pages
Intec Biometric Access Quotation
No ratings yet
Intec Biometric Access Quotation
4 pages
OpenMP Tasking & Thread Safety Guide
No ratings yet
OpenMP Tasking & Thread Safety Guide
7 pages
Kaspersky EDR and KATA Overview
No ratings yet
Kaspersky EDR and KATA Overview
49 pages
C Programming Practice Questions
100% (1)
C Programming Practice Questions
17 pages
UX Designer Profile: Ruchit Naik
No ratings yet
UX Designer Profile: Ruchit Naik
3 pages
Automate With Greater Speed, Simplicity, and Confidence
No ratings yet
Automate With Greater Speed, Simplicity, and Confidence
9 pages
Debugging Memory Leaks in C: Techniques
No ratings yet
Debugging Memory Leaks in C: Techniques
1 page
Hand Gesture Volume Control System
No ratings yet
Hand Gesture Volume Control System
41 pages
Configuring Abaqus 2020 with Fortran
No ratings yet
Configuring Abaqus 2020 with Fortran
4 pages
2015 Annual Computer Education Plan
No ratings yet
2015 Annual Computer Education Plan
3 pages
Human Machine Interface Overview
No ratings yet
Human Machine Interface Overview
20 pages
Sololearn Python Basics Guide
No ratings yet
Sololearn Python Basics Guide
35 pages
Install Log
No ratings yet
Install Log
78 pages
Term 3 Assessment for Basic School BS 5
No ratings yet
Term 3 Assessment for Basic School BS 5
5 pages
Creating and Using Stored Functions
No ratings yet
Creating and Using Stored Functions
5 pages
Wrong Number Series Practice Set 1
No ratings yet
Wrong Number Series Practice Set 1
72 pages
Computer Organization Question Bank
No ratings yet
Computer Organization Question Bank
13 pages
Robot Movement in a Maze Simulation
No ratings yet
Robot Movement in a Maze Simulation
18 pages
Acer Aod255 (Atomn Nm10 Enekb926alc272 Ar8152)
No ratings yet
Acer Aod255 (Atomn Nm10 Enekb926alc272 Ar8152)
39 pages
Introduction To Cyber Security
No ratings yet
Introduction To Cyber Security
5 pages
Computer Science and Design Curriculum
No ratings yet
Computer Science and Design Curriculum
5 pages
Database Administration COC Test Guide
No ratings yet
Database Administration COC Test Guide
10 pages
Business Tools in Hospitality Industry
100% (1)
Business Tools in Hospitality Industry
16 pages
Teledyne: Instruction Manual
No ratings yet
Teledyne: Instruction Manual
302 pages
Understanding Linux Character Drivers
No ratings yet
Understanding Linux Character Drivers
54 pages
Minecraft Crash: ClassNotFoundException
No ratings yet
Minecraft Crash: ClassNotFoundException
5 pages
Turbo C Graphics Programming Guide
No ratings yet
Turbo C Graphics Programming Guide
11 pages
Modern Information Retrieval Overview
No ratings yet
Modern Information Retrieval Overview
39 pages
Scrum Project Planning Template
No ratings yet
Scrum Project Planning Template
4 pages

Introduction to GPU Architectures

Uploaded by

Introduction to GPU Architectures

Uploaded by

Multi-Core Computer Architecture

Figure credit: ETechnoG

Figure credit: ETechnoG

❖ GPU programs are written in two parts

Figure credit: ETechnoG

Iter. 2 load store store

Iter. 1 Vector Instruction

GPUs are SIMD Engines Underneath

Programming vs Execution Model

store ❖ Can fetch and execute multiple instructions per cycle

Slide credit: Onur Mutlu, ETH Zurich

add load load

Slide credit: Onur Mutlu, ETH Zurich

load load Warp 0 at PC X+1

add add Warp 0 at PC X+2

store store Warp 0 at PC X+3

add add Warp 20 at PC X+2

C[2] C[8] C[9] C[10] C[11]

C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3]

Warp 0 Warp 1 Warp 2 Warp 3

You might also like