0% found this document useful (0 votes)

9 views25 pages

GPU SIMT Microarchitecture Overview

Chapter 19 discusses the GPU SIMT microarchitecture, highlighting the differences between CPU and GPU execution models, and the implications for performance and security. It covers topics such as control flow divergence, warp scheduling, memory access patterns, and various side-channel attacks that exploit GPU vulnerabilities. The chapter emphasizes the importance of understanding GPU architecture to mitigate potential security risks associated with parallel processing.

Uploaded by

taha.aimen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views25 pages

GPU SIMT Microarchitecture Overview

Uploaded by

taha.aimen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 19 GPU SIMT Microarchitecture

‫أهداف الفصل‬
MIMD ‫ و‬SIMD ‫ و ازاي يختلف عن‬SIMT execution model ‫ فهم‬.1
masking ‫ و‬divergence problem ‫ و‬warp scheduler ‫ تحليل‬.2
leak access patterns ‫ لـ‬memory coalescing side-channels ‫ استغالل‬.3
timing oracle ‫ كـ‬shared memory bank conflicts ‫ فهم‬.4
GPU-specific vulnerabilities: page table walks، TLB، cache timing ‫ تحليل‬.5
synchronization attacks ‫ و‬kernel launch overhead ‫ فهم‬.6

‫تقسيمة الموضوعات‬
1. GPU vs CPU: Execution Model Fundamentals
CPU Philosophy:

Latency-optimized
- Big caches (MB)
- Out-of-order execution
- Branch prediction
- Few powerful cores (4-64)

Goal: minimize latency of single thread

GPU Philosophy:

Throughput-optimized
- Small caches (KB per SM)
- In-order execution (mostly)
- NO branch prediction
- Thousands of threads (10K-100K)

Goal: maximize throughput via massive parallelism

Security Implication:

 CPU: complex microarchitecture = Spectre/Meltdown

 GPU: simpler, but massive shared resources = contention side-channels
The Parallelism Pyramid:

SISD: Single Instruction, Single Data (scalar CPU)

SIMD: Single Instruction, Multiple Data (SSE/AVX)
SIMT: Single Instruction, Multiple Threads (GPU)
MIMD: Multiple Instruction, Multiple Data (multi-core)

SIMT = SIMD + Multi-threading:

SIMD (AVX-512):
for (int i = 0; i < 16; i++) {
result[i] = a[i] + b[i]; // 16 operations in 1 instruction
}

SIMT (CUDA):
__global__ void add(float *a, float *b, float *result) {
int i = threadIdx.x; // Each thread has unique ID
result[i] = a[i] + b[i];
}
// Launch 1024 threads (not 16!)
// Hardware groups threads into warps (32 threads)

2. SIMT Execution: Warps & Thread Blocks

Thread Hierarchy:

Grid (entire kernel launch)

↓
Thread Blocks (up to 1024 threads per block)
↓
Warps (32 threads - hardware scheduling unit)
↓
Individual Threads

Example: 1024 threads

Grid: 1024 threads

Thread Block 0: threads 0-511
Thread Block 1: threads 512-1023

Warp 0: threads 0-31

Warp 1: threads 32-63
...
Warp 31: threads 992-1023

Warp Execution:

All 32 threads in warp execute SAME instruction

- Same PC (program counter)
- Different data (different registers)

Example:
Instruction: ADD R1, R2, R3
Thread 0: R1[0] = R2[0] + R3[0]
Thread 1: R1[1] = R2[1] + R3[1]
...
Thread 31: R1[31] = R2[31] + R3[31]

All execute in LOCKSTEP (1 cycle)

Hardware View: Streaming Multiprocessor (SM)

SM (NVIDIA Ampere):
- 64 CUDA cores (FP32 ALUs)
- 4 warp schedulers
- 128KB register file
- 64KB shared memory
- 128KB L1 cache
- Warp scheduler picks 4 warps to issue per cycle

Security Note:

 Warp scheduling = shared resource

 Warp execution timing = observable
 Warp divergence = massive timing leak

3. Control Flow Divergence - ‫ مشكلة الـ‬GPU ‫األساسية‬

The Problem:

global void kernel(int data, int output) {

int tid = threadIdx.x;
if (data[tid] > 100) { // Branch!
output[tid] = data[tid] * 2; // Path A
} else {
output[tid] = data[tid] + 1; // Path B
}
}

If all threads take same path: no problem

Threads 0-31: all data[tid] > 100

→ All take Path A
→ 1 warp execution (efficient!)

If threads diverge: DISASTER

Threads 0-15: data[tid] > 100 (Path A)

Threads 16-31: data[tid] <= 100 (Path B)

Hardware executes:
1. Execute Path A with threads 0-15 ACTIVE, 16-31 MASKED
2. Execute Path B with threads 0-15 MASKED, 16-31 ACTIVE

→ 2 warp executions (50% efficiency!)

Masking Implementation:

Active Mask (32-bit): 1 = active, 0 = masked

Path A execution:
Active Mask = 0x0000FFFF (threads 0-15)
Instruction: MUL R1, R2, 2
- Threads 0-15: execute MUL
- Threads 16-31: NOP (masked, but consume cycles!)

Path B execution:
Active Mask = 0xFFFF0000 (threads 16-31)
Instruction: ADD R1, R2, 1
- Threads 0-15: NOP
- Threads 16-31: execute ADD

Performance Impact:

Best case (no divergence): 1 warp = 1 cycle

Worst case (all threads diverge): 32 warps = 32 cycles
→ 32× slowdown!
Divergence Side-Channel Attack:

// Victim kernel (cryptographic operation)

__global__ void aes_encrypt(uint8_t *key, uint8_t *plaintext) {
int tid = threadIdx.x;
uint8_t k = key[tid % 16];

if (k & 0x01) { // Secret-dependent branch!

// Path A: ~10 instructions
sbox_lookup_even(plaintext[tid]);
} else {
// Path B: ~10 instructions
sbox_lookup_odd(plaintext[tid]);
}
}

// Attacker kernel (co-scheduled on same SM)

__global__ void attacker_probe() {
uint64_t start = clock64();
// Dummy computation (1 warp, no divergence)
for (int i = 0; i < 1000; i++) {
volatile int x = i * 2;
}
uint64_t end = clock64();
uint64_t latency = end - start;

// If victim has divergence → attacker delayed (SM busy)

// If no divergence → attacker fast
// → Leak victim's key bits!
}

Real Attack: CVE-2018-4437 (Apple GPU)

 Divergence timing leaks AES key

 Reconstruct key after ~1000 encryptions

4. Warp Scheduling & Occupancy

SM Resources:

NVIDIA A100 SM:

- 64 FP32 CUDA cores
- 65536 registers (32-bit each)
- 164KB shared memory
- Max 64 warps resident
- Max 2048 threads resident

Occupancy Calculation:

Example kernel:
- 128 threads per block
- 64 registers per thread
- 16KB shared memory per block

Shared memory limit:

164KB / 16KB per block = 10 blocks max

Actual: min(8, 10) = 8 blocks

→ 8 blocks × 128 threads = 1024 threads
→ Occupancy = 1024 / 2048 = 50%

Low Occupancy = Security Leak:

Attacker strategy:
1. Launch kernel with high register usage (low occupancy)
2. Co-schedule victim kernel
3. If victim doesn't launch → SM full
4. If victim launches → SM had space
5. Infer victim's resource usage → leak kernel config

Warp Scheduler Policies:

Round-Robin (NVIDIA default):

- Cycle through ready warps
- Fair, but no QoS

Greedy-Then-Oldest (GTO):
- Prioritize oldest warp
- Better cache locality
- But: unfair (newer warps starved)

Two-Level (TL):
- Group warps into fetch groups
- Round-robin between groups
- GTO within group

Scheduler Side-Channel:

// Attacker measures warp scheduling order

__global__ void attacker_probe() {
// Record timestamp when this warp runs
uint64_t timestamp = clock64();
atomicAdd(&global_timestamps[warpId], timestamp);
}

// Analysis:
// If victim warp scheduled before attacker → higher priority
// If attacker scheduled first → victim blocked/not ready
// → Infer victim's warp state (ready/blocked)

5. Memory System: Coalescing & Bank Conflicts

Global Memory Access:

Warp issues load: 32 threads, 32 addresses

Best case: addresses are COALESCED (contiguous)
→ 1 memory transaction (128 bytes)

Worst case: addresses are SCATTERED

→ 32 memory transactions (4096 bytes)
→ 32× bandwidth waste!

Coalesced Access:

global void coalesced(float *data) {

int tid = threadIdx.x;
float val = data[tid]; // Thread i reads address i
}

Memory access:
Thread 0: addr 0x1000
Thread 1: addr 0x1004
...
Thread 31: addr 0x107C
→ All in same 128-byte cache line
→ 1 transaction

Uncoalesced Access:

global void uncoalesced(float *data) {

int tid = threadIdx.x;
float val = data[tid * 1024]; // Stride = 1024
}

Memory access:
Thread 0: addr 0x1000
Thread 1: addr 0x1400 (1KB away!)
...
→ 32 different cache lines
→ 32 transactions

Coalescing Side-Channel:

Observation: Coalesced access = fast, Uncoalesced = slow

Attack:
1. Victim accesses array with secret-dependent stride
stride = (secret_bit == 1) ? 4 : 1024
2. Attacker measures victim's memory bandwidth
3. High bandwidth → stride=4 (coalesced) → secret_bit=1
4. Low bandwidth → stride=1024 (uncoalesced) → secret_bit=0

Shared Memory Bank Conflicts:

Shared Memory: 64KB, divided into 32 banks

Bank_ID = (address / 4) % 32

Conflict-free access:
Thread i accesses bank i
→ All 32 threads access different banks
→ 1 cycle

Bank conflict:
Threads 0 and 16 both access bank 0
→ Serialized: 2 cycles
Worst case: All threads access same bank
→ 32 cycles!

Example:

shared float shared_data[1024];

// Conflict-free:
int tid = threadIdx.x;
float val = shared_data[tid]; // Each thread → different bank

// Bank conflict:
float val2 = shared_data[tid * 32]; // All threads → same bank!

Bank Conflict Side-Channel:

// Victim kernel
__global__ void victim(uint8_t *secret) {
__shared__ float shmem[256];
int tid = threadIdx.x;

// Secret-dependent access pattern

int index = (secret[tid] & 0x1F) * 32; // Bank conflict if secret
odd
float val = shmem[index];
}

// Attacker probes shared memory latency

__global__ void attacker() {
__shared__ float dummy[256];
uint64_t start = clock64();
volatile float x = dummy[threadIdx.x]; // Conflict-free
uint64_t end = clock64();

// If victim has conflicts → shared memory busy → attacker delayed

}

6. Cache Hierarchy & Timing Attacks

GPU Cache Hierarchy:
Per-SM:
- L1 Data Cache: 128KB (Ampere)
- Shared Memory: 64KB-164KB (configurable partition)
- Texture Cache: read-only, optimized for 2D locality

Global:
- L2 Cache: 40MB (A100)
- HBM Memory: 40-80GB (high bandwidth)

L1 Cache = Timing Oracle:

Attack: Prime+Probe on GPU L1

1. Attacker fills L1 with known data (prime)
2. Victim executes (evicts some lines)
3. Attacker probes L1 (measures access time)
4. Slow access → evicted → victim accessed this line

GPU-Specific Challenge:

CPU Prime+Probe: clflush instruction

GPU: NO explicit cache flush instruction!

Workaround:
- Allocate large array (> L1 size)
- Access entire array sequentially
- Evicts all L1 lines (flush by capacity)

Implementation:

#define L1_SIZE (128 * 1024) // 128KB

#define LINE_SIZE 128
#define NUM_LINES (L1_SIZE / LINE_SIZE)

global void prime_l1(uint8_t *probe_array) {

int tid = threadIdx.x + blockIdx.x * blockDim.x;

// Prime: fill L1 with probe_array

for (int i = 0; i < NUM_LINES; i++) {
volatile uint8_t x = probe_array[i * LINE_SIZE];
}
}
__global__ void probe_l1(uint8_t *probe_array, uint64_t *timings) {
int tid = threadIdx.x;

// Probe: measure access time

for (int i = 0; i < NUM_LINES; i++) {
uint64_t start = clock64();
volatile uint8_t x = probe_array[i * LINE_SIZE];
uint64_t end = clock64();
timings[i] = end - start;
}
}

// Analysis:
// If timings[i] > threshold → cache miss → victim accessed line i

Real Attack: GPU-based AES Key Recovery

 Target: AES T-table lookups

 Prime+Probe on L1
 Leak table indices → recover key
 Success rate: 95% after 10K encryptions

7. TLB & Page Table Side-Channels

GPU TLB:

Per-SM TLB: ~16-64 entries (small!)

L2 TLB (global): 512-1024 entries

Why so small?
- GPUs have MASSIVE address spaces (100K threads)
- TLB thrashing is common

TLB Thrashing Attack:

// Attacker evicts victim's TLB entries

__global__ void evict_tlb(uint64_t *large_array) {
int tid = threadIdx.x;
// Access many different pages
for (int i = 0; i < 1024; i++) {
volatile uint64_t x = large_array[i * 4096 / 8]; // 1 page per
iteration
}
}

// Victim experiences TLB misses

__global__ void victim() {
// Every access now requires page table walk
// → 100-200 cycle penalty
}

Page Table Walk Latency:

TLB hit: ~1 cycle

TLB miss → Page table walk:
- 4-level page table (x86-64 style)
- 4 memory accesses
- ~100-200 cycles

Attacker measures victim's overall latency

→ High latency = TLB thrashing
→ Infer victim's memory access pattern

Unified Memory Side-Channel:

NVIDIA Unified Memory: CPU & GPU share virtual address space
Page migration: pages migrate between CPU & GPU memory

Attack:
1. Victim accesses page on CPU
2. Page migrates to GPU (on-demand)
3. Attacker measures page fault frequency
4. High page faults → victim accessing many unique pages
5. Low page faults → victim re-using pages
→ Leak working set size

8. Synchronization Primitives & Race Conditions

__syncthreads() Barrier:

global void kernel() {

__shared__ float shmem[256];

// Phase 1: each thread writes

shmem[threadIdx.x] = compute();
__syncthreads(); // All threads wait here

// Phase 2: each thread reads

float val = shmem[(threadIdx.x + 1) % 256];
}

Implementation:

Hardware barrier:
- Each thread sets "arrived" bit
- Warp scheduler stalls until all bits set
- Then releases all threads

Timing:
- Fast if all threads arrive together
- Slow if threads straggle (divergence)

Barrier Timing Side-Channel:

// Victim kernel (secret-dependent work)

__global__ void victim(uint8_t *secret) {
if (secret[threadIdx.x]) {
// Heavy computation (100 cycles)
expensive_function();
}
__syncthreads(); // Barrier

// More work...
}

// Attacker in same block

__global__ void attacker() {
uint64_t start = clock64();
__syncthreads(); // Same barrier!
uint64_t end = clock64();

// If victim threads diverge (secret-dependent) → longer barrier

wait
// → Leak secret via barrier timing
}

Atomic Operations:
__global__ void atomic_inc(int *counter) {
atomicAdd(counter, 1);
}

Serialization Problem:

All 32 threads in warp try to atomicAdd same location

→ Serialized: 32 cycles
→ Creates contention bottleneck

Atomic Contention Side-Channel:

Victim uses atomics for secret-dependent histogram:

for each element:
bin = hash(element, secret_key)
atomicAdd(&histogram[bin], 1)

Attacker measures atomic throughput:

- High contention → many threads hit same bins → reveals key

9. Kernel Launch Overhead & Timing

Launch Latency:

CPU launches kernel:

1. Driver prepares launch parameters (~5 μs)
2. Copy parameters to GPU (~1 μs)
3. GPU scheduler assigns SMs (~2 μs)
4. Kernel starts execution

Total: ~8-10 μs overhead (before any useful work!)

Inference Attack:

// Attacker measures victim's kernel launch rate

void monitor_launches() {
while (1) {
uint64_t start = clock();
// Wait for GPU idle
cudaDeviceSynchronize();
uint64_t end = clock();
printf("Time between kernels: %lu us\n", end - start);
}
}

// If victim launches kernels at secret-dependent rate:

// → Leak secret via launch timing

Persistent Kernels (Defense):

// Instead of many small kernel launches:

// → 1 persistent kernel that loops internally

global void persistent_kernel(WorkQueue *queue) {

while (1) {
Work work = dequeue(queue);
if (work == NULL) break;
process_work(work);
}
}

// Launch once, feed work via queue

// → No launch overhead per task
// → Harder to infer task boundaries

10. Multi-Tenancy & Isolation

GPU Virtualization:

Modern GPUs (A100, H100):

- Multi-Instance GPU (MIG): hardware partitioning
- Each instance gets dedicated SMs, memory, caches
- Isolated execution

Older GPUs:
- Spatial sharing: different kernels on different SMs
- Temporal sharing: time-slice GPU
- NO isolation!

Spatial Sharing Attack:

Attacker kernel on SM 0-3
Victim kernel on SM 4-7

Attack:
1. Attacker saturates L2 cache (shared!)
2. Victim's cache lines evicted
3. Victim performance degrades
4. Attacker measures victim's slowdown
→ Infer victim's memory access pattern

Temporal Sharing Attack:

GPU scheduler: preempts victim kernel, runs attacker

- Victim's state: registers, shared memory, L1 cache
- Lazy state save: GPU doesn't clear state (optimization!)

Attack:
1. Victim kernel runs (leaves data in registers)
2. GPU preempts victim
3. Attacker kernel runs on same SM
4. Attacker reads registers (might contain victim data!)

Real Vulnerability: CVE-2019-5700 (AMD GPU)

Incomplete context switch:

- Register file not cleared
- Attacker reads victim's registers
- Leak: crypto keys, passwords

Mitigation:

Scrubbing: zero all registers on context switch

Cost: 100-200 cycles per SM
→ 5-10% performance overhead

11. CUDA Driver & Runtime Attacks

Driver Privileges:

CUDA Driver: kernel mode (Ring 0)

User application: user mode (Ring 3)
Syscalls:
- cuLaunchKernel()
- cuMemcpyHtoD()
- cuMemAlloc()

Each syscall = potential vulnerability

Attack Surface:

// Malformed kernel launch

cuLaunchKernel(
kernel,
1, // gridDim
1, // blockDim
0, // sharedMem
0, // stream
(void **)0xDEADBEEF, // kernelParams (invalid pointer!)
NULL
);

// If driver doesn't validate:

// → Kernel dereferences 0xDEADBEEF
// → GPU page fault
// → Driver might crash / leak info

Driver Buffer Overflow:

// Copy large buffer to GPU

char large_buffer[1024 * 1024 * 1024]; // 1GB
cuMemcpyHtoD(gpu_ptr, large_buffer, sizeof(large_buffer));

// If driver allocates small buffer internally:

// → Buffer overflow in kernel driver
// → Potential privilege escalation

Real Exploits:

 CVE-2021-1056 (NVIDIA): driver buffer overflow → SYSTEM

 CVE-2020-5963 (NVIDIA): driver UAF → kernel code execution

12. Constant Memory & Read-Only Cache Exploitation

Constant Memory:

64KB per kernel

Cached in dedicated constant cache (64KB per SM)
Read-only
Optimized for broadcast (all threads read same value)

Timing Leak:

constant uint8_t aes_key[16]; // Secret key

global void encrypt(uint8_t *plaintext) {

int tid = threadIdx.x;
// All threads read same key byte (broadcast)
uint8_t k = aes_key[tid % 16];

// Use k...
}

Attack:

Attacker probes constant cache:

1. Fill constant cache with known data
2. Victim runs (loads key into cache)
3. Attacker measures constant cache state
(via timing of constant memory access)
4. Infer which cache lines were evicted
→ Leak key bytes

Read-Only Cache (Texture Cache):

Separate from L1
Optimized for spatial locality (2D arrays)
Used for read-only data

global void kernel(float *input) {

float val = __ldg(&input[idx]); // Load via texture cache
}

Same Attack:
Prime+Probe on texture cache
→ Leak victim's read patterns

13. Performance Counter Side-Channels

NVIDIA Profiler (nvprof):

$ nvprof --metrics all ./victim_program

==12345== Profiling result:

Kernel: aes_encrypt
l1_cache_global_hit_rate: 87.3%
shared_mem_bank_conflicts: 1234
warp_execution_efficiency: 45.2%
achieved_occupancy: 0.67

Attack Scenario:

Cloud GPU:
- Multiple users share GPU
- Each can run nvprof
- Counters are GLOBAL (not per-user!)

Attack:
1. Attacker runs: nvprof --metrics dram_read_bytes
2. Victim encrypts data
3. Attacker reads counter: +128KB
→ Victim read 128KB from DRAM
→ Infer data size / operation type

Mitigation:

GPU driver: restrict perf counter access

- Require root/admin privileges
- Per-process counters (hardware support needed)

// GOOD: branchless (masking)

__global__ void good_crypto(uint8_t *key) {
uint8_t mask = (key[threadIdx.x] & 0x01) ? 0xFF : 0x00;
uint8_t result_A = path_A_compute();
uint8_t result_B = path_B_compute();
uint8_t result = (result_A & mask) | (result_B & ~mask);
// Both paths execute, no divergence timing leak
}

Constant-Time Memory Access:

// BAD: secret-dependent stride

__global__ void bad_access(uint8_t *secret, float *data) {
int stride = secret[threadIdx.x] * 1024;
float val = data[stride]; // Coalescing depends on secret!
}

// GOOD: fixed stride, shuffle result

__global__ void good_access(uint8_t *secret, float *data) {
// Always coalesced access
float val = data[threadIdx.x];
// Shuffle locally (cheap)
int idx = secret[threadIdx.x];
__shared__ float shmem[256];
shmem[threadIdx.x] = val;
__syncthreads();
float result = shmem[idx]; // Secret-dependent, but local
}

Noise Injection:

// Add random delays to hide timing

__global__ void noisy_crypto(uint8_t *key) {
// Random delay (0-100 cycles)
int delay = random() % 100;
for (int i = 0; i < delay; i++) {
__threadfence(); // Dummy work
}

// Actual crypto work

compute(key);
}

Cost: 10-30% performance overhead

‫ترتيب المذاكرة‬
1. GPU vs CPU (Section 1): ‫فهم الفلسفة األساسية‬
2. SIMT & Warps (Section 2): execution model
3. Divergence (Section 3): ‫ المشكلة الرئيسية في‬GPU
4. Warp Scheduling (Section 4): ‫ ازاي الـ‬SM ‫ يدير‬warps
5. Memory Coalescing (Section 5): memory access patterns
6. Cache Timing (Section 6): Prime+Probe ‫ على‬GPU
7. TLB Attacks (Section 7): page table side-channels
8. Synchronization (Section 8): barriers ‫ و‬atomics
9. Kernel Launch (Section 9): overhead ‫ و‬timing
10. Multi-Tenancy (Section 10): isolation problems
11. Driver Attacks (Section 11): privilege escalation
12. Constant Memory (Section 12): specialized caches
13. Perf Counters (Section 13): monitoring attacks
14. Defenses (Section 14): constant-time programming

)!‫أهم المصطلحات (حفظ‬

1. SIMT: Single Instruction, Multiple Threads
2. Warp: 32 threads executing in lockstep
3. Streaming Multiprocessor (SM): GPU core cluster
4. Divergence: threads in warp take different paths
5. Active Mask: bitmask showing which threads are active
6. Coalescing: merging memory requests into single transaction
7. Bank Conflict: multiple threads access same shared memory bank
8. Occupancy: ratio of active warps to max warps per SM
9. Shared Memory: fast on-chip memory shared within thread block
10. Thread Block: group of threads (up to 1024)
11. Grid: collection of thread blocks
12. CUDA Core: scalar FP32 ALU
13. Texture Cache: read-only cache for spatial locality
14. Constant Memory: 64KB read-only broadcast memory
15. Warp Scheduler: hardware that picks warps to execute
16. Register Spilling: registers overflow to local memory
17. L1 Cache: per-SM cache (128KB)
18. L2 Cache: global cache (40MB)
19. HBM: High Bandwidth Memory
20. MIG: Multi-Instance GPU (hardware partitioning)

‫ أسئلة مراجعة‬/ ‫تمارين‬

‫ كبير؟‬SIMD ‫ مش مجرد‬GPU ‫؟ ليه‬SIMT ‫ و‬SIMD ‫ إيه الفرق بين‬.1
‫ ليه هو مشكلة أداء ضخمة؟‬.warp divergence ‫ اشرح‬.2
‫؟‬execution ‫ هيحصل إيه في‬،path B ‫ تاخد‬16 ‫ و‬path A ‫ تاخد‬threads 16 ‫ و‬if-else ‫ فيه‬kernel ‫ لو عندك‬.3
‫ يستغلده؟‬attacker ‫؟ إزاي‬uncoalesced memory access ‫ و‬coalesced ‫ إيه الفرق بين‬.4
‫ إزاي تتجنبه؟‬.shared memory bank conflict ‫ اشرح‬.5
‫؟‬clflush instruction ‫ بدون‬GPU L1 cache ‫ على‬Prime+Probe ‫ يعمل‬attacker ‫ ازاي‬.6
‫؟‬security implication ‫)؟ إيه الـ‬entries 64-16( ‫ صغير جدًا‬GPU TLB ‫ ليه‬.7
‫() يسرب معلومات؟‬syncthreads__ ‫ إزاي‬.barrier timing side-channel attack ‫ اشرح‬.8
‫؟‬GPU multi-tenancy ‫ في‬temporal sharing ‫ و‬spatial sharing ‫ إيه الفرق بين‬.9
‫؟‬AMD GPU ‫ في‬lazy context switching ‫ استغل‬CVE-2019-5700 ‫ ازاي‬.10
‫؟‬divergence timing leaks ‫ بدون‬constant-time GPU code ‫ إزاي تكتب‬.11
‫؟‬shared GPU environments ‫ خطر في‬performance counters ‫ ليه‬.12

Mini Project: GPU Cache Timing Attack on AES

:‫الهدف‬
.leak AES T-table accesses ‫ لـ‬GPU L1 cache ‫ على‬Prime+Probe attack ‫بناء‬

:‫المطلوب‬

Phase 1: Setup

 NVIDIA GPU (GTX 1080 ‫)أو أعلى‬

 CUDA Toolkit installed
 Victim AES implementation (T-table based)

Phase 2: Understand Victim

// Victim: AES T-table lookup (typical implementation)

__global__ void aes_encrypt(uint8_t *key, uint8_t *plaintext, uint8_t
*output) {
__shared__ uint32_t T0[256]; // T-tables in shared memory
__shared__ uint32_t T1[256];

// ... load T-tables ...

int tid = threadIdx.x;

uint8_t state[16];
// ... copy plaintext to state ...

// Round 1 (example)
uint8_t idx0 = state[0] ^ key[0];
uint32_t t0 = T0[idx0]; // Secret-dependent lookup!

// ... rest of AES ...

}

Phase 3: Build Prime Phase

#define L1_SIZE (128 * 1024)

#define LINE_SIZE 128
#define NUM_SETS (L1_SIZE / LINE_SIZE)

global void prime_l1(uint8_t *probe_array) {

int tid = threadIdx.x;
// Fill L1 with probe_array
for (int set = 0; set < NUM_SETS; set++) {
volatile uint8_t x = probe_array[set * LINE_SIZE];
}
}

Phase 4: Run Victim (Target)

// Launch victim kernel

aes_encrypt<<<1, 32>>>(dev_key, dev_plaintext, dev_output);
cudaDeviceSynchronize();

Phase 5: Build Probe Phase

global void probe_l1(uint8_t probe_array, uint64_t timings) {

int tid = threadIdx.x;
for (int set = 0; set < NUM_SETS; set++) {
uint64_t start = clock64();
volatile uint8_t x = probe_array[set * LINE_SIZE];
uint64_t end = clock64();
timings[set] = end - start;
}
}

Phase 6: Analysis

# Analyze timings to find evicted sets

threshold = mean(timings) + 2 * std(timings)
evicted_sets = [i for i, t in enumerate(timings) if t > threshold]

# Map evicted sets to T-table indices

# (requires knowing T-table address and L1 mapping function)
t_table_indices = map_sets_to_indices(evicted_sets)

# Repeat for many plaintexts

for plaintext in plaintexts:
run_attack(plaintext)
collect_t_table_indices()

# Statistical analysis: Differential Cache Analysis

# Correlate T-table indices with key bytes
key_bytes = recover_key_from_indices(all_indices, all_plaintexts)

Phase 7: Validation
# Test recovered key
recovered_key = bytes(key_bytes)
reference_ciphertext = reference_aes_encrypt(plaintext, actual_key)
test_ciphertext = reference_aes_encrypt(plaintext, recovered_key)

if test_ciphertext == reference_ciphertext:
print("SUCCESS: Key recovered!")
else:
print("FAIL: Key incorrect")

Expected Results:

Baseline (no attack): 100% AES correctness

After 1000 plaintexts: 30-40% key bytes recovered
After 10000 plaintexts: 95%+ key bytes recovered

Challenges:

 L1 cache mapping function unknown (reverse engineer)

 Noise ‫ من‬other SM activity (isolate using persistent kernels)
 Timing measurement precision (use clock64(), not wall time)

Bonus:

 Extend to Flush+Reload (if page deduplication available)

 Measure detection rate vs false positive rate
 Implement countermeasures (e.g., T-table in registers, not shared mem)

:‫ملحوظة‬
GPU cache timing, not weaponizing against production ‫ الهدف فهم‬.research-focused ‫المشروع ده‬
.systems

Key Takeaway:
GPU SIMT = massive parallelism + simple microarchitecture. But: shared resources (SM,
caches, memory) = huge side-channel surface. Divergence, coalescing, banks, TLB - ‫كلها‬
observable. Constant-time programming ‫ أصعب على‬GPU ‫ من‬CPU.

GPU AES Side-Channel Vulnerability Analysis
No ratings yet
GPU AES Side-Channel Vulnerability Analysis
8 pages
GPU Model Optimizations for Faster Inferencing
No ratings yet
GPU Model Optimizations for Faster Inferencing
8 pages
Pid-225 CRC-2
No ratings yet
Pid-225 CRC-2
1 page
NVIDIA Volta GPU Architecture Analysis
No ratings yet
NVIDIA Volta GPU Architecture Analysis
66 pages
Spectre Attacks: Exploiting _mm_clflush
No ratings yet
Spectre Attacks: Exploiting _mm_clflush
99 pages
GPU Architecture Case Study Overview
No ratings yet
GPU Architecture Case Study Overview
34 pages
Understanding OpenMP and CUDA Concepts
No ratings yet
Understanding OpenMP and CUDA Concepts
5 pages
qt1wb7f3h4 Nosplash
No ratings yet
qt1wb7f3h4 Nosplash
126 pages
Optimizing Multi-Kernel GPU Algorithms
No ratings yet
Optimizing Multi-Kernel GPU Algorithms
23 pages
GPU Computing Architecture Overview
No ratings yet
GPU Computing Architecture Overview
36 pages
Parallel AES Implementation with CUDA
No ratings yet
Parallel AES Implementation with CUDA
8 pages
SIMD vs MIMD: Key Differences Explained
No ratings yet
SIMD vs MIMD: Key Differences Explained
4 pages
Optimizing GPU Performance Metrics
No ratings yet
Optimizing GPU Performance Metrics
9 pages
GPU Programming with CUDA Basics
No ratings yet
GPU Programming with CUDA Basics
45 pages
Advanced CUDA Techniques for Many-Core Computing
No ratings yet
Advanced CUDA Techniques for Many-Core Computing
64 pages
HPC-DAG Model for Real-Time Systems
No ratings yet
HPC-DAG Model for Real-Time Systems
15 pages
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
No ratings yet
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
5 pages
SIMD vs MIMD: Parallel Computing Explained
No ratings yet
SIMD vs MIMD: Parallel Computing Explained
3 pages
Analyzing and Improving Hardware Modeling of Accel-Sim: Rodrigo Huerta Mojtaba Abaie Shoushtary Antonio González
No ratings yet
Analyzing and Improving Hardware Modeling of Accel-Sim: Rodrigo Huerta Mojtaba Abaie Shoushtary Antonio González
6 pages
HW 2
No ratings yet
HW 2
4 pages
CUDA Cooperative Groups Overview
No ratings yet
CUDA Cooperative Groups Overview
56 pages
asiaCCS21 GPUTLBChannel
No ratings yet
asiaCCS21 GPUTLBChannel
14 pages
Understanding CUDA Memory Management
No ratings yet
Understanding CUDA Memory Management
18 pages
Algorithms and Architectures For Parallel Processing 18th International Conference ICA3PP 2018 Guangzhou China November 15 17 2018 Proceedings Part IV Jaideep Vaidya Ebook Official Digital Version
100% (9)
Algorithms and Architectures For Parallel Processing 18th International Conference ICA3PP 2018 Guangzhou China November 15 17 2018 Proceedings Part IV Jaideep Vaidya Ebook Official Digital Version
56 pages
Cuda Python Handbook
No ratings yet
Cuda Python Handbook
50 pages
Memory Management in CUDA Programming
No ratings yet
Memory Management in CUDA Programming
65 pages
CUDA Execution Model and GPU Architecture
No ratings yet
CUDA Execution Model and GPU Architecture
7 pages
Spoiler
No ratings yet
Spoiler
18 pages
CUDA Cheat Sheet
No ratings yet
CUDA Cheat Sheet
2 pages
CUDA 9 Features and Tesla V100 Overview
100% (1)
CUDA 9 Features and Tesla V100 Overview
45 pages
Parallel Programming Overview for BCS702
No ratings yet
Parallel Programming Overview for BCS702
43 pages
Hybrid Distributed-Shared Memory Overview
No ratings yet
Hybrid Distributed-Shared Memory Overview
20 pages
Warp-Synchronous Programming in CUDA
No ratings yet
Warp-Synchronous Programming in CUDA
45 pages
GPU Unified Memory Support in OMPSS
No ratings yet
GPU Unified Memory Support in OMPSS
17 pages
GPU Performance and Attention Mechanisms
No ratings yet
GPU Performance and Attention Mechanisms
82 pages
Accelerating Graph Algorithms with CUDA
No ratings yet
Accelerating Graph Algorithms with CUDA
25 pages
CUDA Atomics and Thread Management
No ratings yet
CUDA Atomics and Thread Management
34 pages
Optimizing CUDA Memory Access Patterns
No ratings yet
Optimizing CUDA Memory Access Patterns
77 pages
ccs18 Gpu Side Channel
No ratings yet
ccs18 Gpu Side Channel
15 pages
Smack: Efficient Instruction Cache Attacks Via Self-Modifying Code Conflicts
No ratings yet
Smack: Efficient Instruction Cache Attacks Via Self-Modifying Code Conflicts
15 pages
GPU Computing Architecture Overview
No ratings yet
GPU Computing Architecture Overview
32 pages
Cheat Sheet
No ratings yet
Cheat Sheet
2 pages
Optimizing GPU Kernels for Deep Learning
No ratings yet
Optimizing GPU Kernels for Deep Learning
27 pages
GPU Performance in Supercomputing
No ratings yet
GPU Performance in Supercomputing
7 pages
Depth-Wise Convolution Optimization Guide
No ratings yet
Depth-Wise Convolution Optimization Guide
14 pages
Parallel Lec1-3 Summary
No ratings yet
Parallel Lec1-3 Summary
8 pages
Advanced CUDA Performance Optimization
100% (1)
Advanced CUDA Performance Optimization
127 pages
nVidia G80 GPU Architecture Overview
No ratings yet
nVidia G80 GPU Architecture Overview
25 pages
CUDA Programming Overview
No ratings yet
CUDA Programming Overview
28 pages
GPU Architecture and CUDA Performance Insights
No ratings yet
GPU Architecture and CUDA Performance Insights
26 pages
CUDA Programming for GPU Parallelization
No ratings yet
CUDA Programming for GPU Parallelization
25 pages
Latency Hiding Techniques in Architecture
No ratings yet
Latency Hiding Techniques in Architecture
28 pages
Inside NVidia GPUs - 2025
No ratings yet
Inside NVidia GPUs - 2025
34 pages
CUDA Programming and GPU Architecture Overview
No ratings yet
CUDA Programming and GPU Architecture Overview
52 pages
Introduction to GPGPU and CUDA
No ratings yet
Introduction to GPGPU and CUDA
26 pages
Optimizing GPU Data Transfers in CUDA
No ratings yet
Optimizing GPU Data Transfers in CUDA
30 pages
MIMD Execution Model on GPUs
No ratings yet
MIMD Execution Model on GPUs
15 pages
Pid-225 CRC-5
No ratings yet
Pid-225 CRC-5
1 page
CUDA Optimization Techniques Overview
No ratings yet
CUDA Optimization Techniques Overview
56 pages
Advanced Superscalar Processor Design
No ratings yet
Advanced Superscalar Processor Design
13 pages
Microprocessor Evolution Overview
No ratings yet
Microprocessor Evolution Overview
15 pages
Vector Computers: Programming & Security
No ratings yet
Vector Computers: Programming & Security
10 pages
VLIW-EPIC ILP Scheduling Techniques
No ratings yet
VLIW-EPIC ILP Scheduling Techniques
8 pages
Vector Programming Model Overview
No ratings yet
Vector Programming Model Overview
6 pages
Intel 4004: Microprocessor Evolution
No ratings yet
Intel 4004: Microprocessor Evolution
11 pages
Microarchitecture and Microprogramming Concepts
No ratings yet
Microarchitecture and Microprogramming Concepts
30 pages
Single-Cycle vs. Pipeline Architecture
No ratings yet
Single-Cycle vs. Pipeline Architecture
7 pages
Understanding Sequential Consistency in Memory Models
No ratings yet
Understanding Sequential Consistency in Memory Models
5 pages
Memory Technology Evolution Overview
No ratings yet
Memory Technology Evolution Overview
19 pages
Attack Surface Evolution and Mitigations
No ratings yet
Attack Surface Evolution and Mitigations
14 pages
Synchronization Issues in Parallel Processes
No ratings yet
Synchronization Issues in Parallel Processes
8 pages
Understanding CPU-Memory Gap Issues
No ratings yet
Understanding CPU-Memory Gap Issues
18 pages
Reducing Branch Penalty in Modern Pipelines
No ratings yet
Reducing Branch Penalty in Modern Pipelines
6 pages
ISA Implementation and Security Risks
No ratings yet
ISA Implementation and Security Risks
19 pages
E-Mail as a Secure Authentication Factor
No ratings yet
E-Mail as a Secure Authentication Factor
14 pages
IoT Firmware Vulnerabilities and Solutions
No ratings yet
IoT Firmware Vulnerabilities and Solutions
103 pages
Information-Theoretic Security in Cryptography
No ratings yet
Information-Theoretic Security in Cryptography
6 pages
Bypassing BitLocker: TPM Attack Analysis
No ratings yet
Bypassing BitLocker: TPM Attack Analysis
5 pages
Physical Side-Channel Attacks and Covert Communication On FPGAs
No ratings yet
Physical Side-Channel Attacks and Covert Communication On FPGAs
19 pages
IP Detection and Security Quiz
50% (2)
IP Detection and Security Quiz
4 pages
Cloud Computing Security Threats & Solutions
No ratings yet
Cloud Computing Security Threats & Solutions
15 pages
EM Security Testing for SoC Systems
No ratings yet
EM Security Testing for SoC Systems
4 pages
Side-Channel Testing Tools for Hardware
100% (1)
Side-Channel Testing Tools for Hardware
65 pages
Machine Learning in Side Channel Attacks
No ratings yet
Machine Learning in Side Channel Attacks
12 pages
Comprehensive Review of ABE Techniques
No ratings yet
Comprehensive Review of ABE Techniques
18 pages
An5156 Introduction To Security For Stm32 Mcus Stmicroelectronics
No ratings yet
An5156 Introduction To Security For Stm32 Mcus Stmicroelectronics
59 pages
Superscalar CPU Side-Channel Security
No ratings yet
Superscalar CPU Side-Channel Security
6 pages
Dummy Shuﬄing for White-box Security
No ratings yet
Dummy Shuﬄing for White-box Security
30 pages
Cryptography 08 00015 v2
No ratings yet
Cryptography 08 00015 v2
13 pages
Side Channel Spillway Overview
0% (1)
Side Channel Spillway Overview
12 pages
IoT Energy Monitoring with ESP32 System
No ratings yet
IoT Energy Monitoring with ESP32 System
10 pages
Hardware Security Syllabus Overview
No ratings yet
Hardware Security Syllabus Overview
2 pages
VeriSide: Efficient RTL Leakage Assessment
No ratings yet
VeriSide: Efficient RTL Leakage Assessment
2 pages
IoT and OT Security: Threats and Solutions
No ratings yet
IoT and OT Security: Threats and Solutions
8 pages
Power Modulation for IoT Security
100% (1)
Power Modulation for IoT Security
5 pages
KeeLoq Side-Channel Attack Guide
No ratings yet
KeeLoq Side-Channel Attack Guide
16 pages
Universal Fault-Injection Attack Detection
No ratings yet
Universal Fault-Injection Attack Detection
14 pages
Wireless Network Vulnerabilities Overview
No ratings yet
Wireless Network Vulnerabilities Overview
4 pages
Preserving Privacy Against Side-Channel Leaks: From Data Publishing To Web Applications 1st Edition Wen Ming Liu Ebook ARC Digital Version
100% (1)
Preserving Privacy Against Side-Channel Leaks: From Data Publishing To Web Applications 1st Edition Wen Ming Liu Ebook ARC Digital Version
62 pages
Understanding Side-Channel Attacks
100% (1)
Understanding Side-Channel Attacks
4 pages
332 Hayashi-Survey of Hardware Trojan Threats and Detection A12b0
No ratings yet
332 Hayashi-Survey of Hardware Trojan Threats and Detection A12b0
5 pages
Network and Information Security in Wireless Optical Network
No ratings yet
Network and Information Security in Wireless Optical Network
14 pages
Synopsys
No ratings yet
Synopsys
7 pages
Understanding Hardware Vulnerabilities
No ratings yet
Understanding Hardware Vulnerabilities
9 pages

GPU SIMT Microarchitecture Overview

Uploaded by

GPU SIMT Microarchitecture Overview

Uploaded by

Chapter 19 GPU SIMT Microarchitecture

Goal: minimize latency of single thread

Goal: maximize throughput via massive parallelism

 CPU: complex microarchitecture = Spectre/Meltdown

SISD: Single Instruction, Single Data (scalar CPU)

SIMT = SIMD + Multi-threading:

2. SIMT Execution: Warps & Thread Blocks

Grid (entire kernel launch)

Example: 1024 threads

Grid: 1024 threads

Warp 0: threads 0-31

All 32 threads in warp execute SAME instruction

All execute in LOCKSTEP (1 cycle)

Hardware View: Streaming Multiprocessor (SM)

 Warp scheduling = shared resource

3. Control Flow Divergence - ‫ مشكلة الـ‬GPU ‫األساسية‬

__global__ void kernel(int *data, int *output) {

If all threads take same path: no problem

Threads 0-31: all data[tid] > 100

If threads diverge: DISASTER

Threads 0-15: data[tid] > 100 (Path A)

→ 2 warp executions (50% efficiency!)

Active Mask (32-bit): 1 = active, 0 = masked

Best case (no divergence): 1 warp = 1 cycle

// Victim kernel (cryptographic operation)

if (k & 0x01) { // Secret-dependent branch!

// Attacker kernel (co-scheduled on same SM)

// If victim has divergence → attacker delayed (SM busy)

Real Attack: CVE-2018-4437 (Apple GPU)

 Divergence timing leaks AES key

4. Warp Scheduling & Occupancy

NVIDIA A100 SM:

Shared memory limit:

Actual: min(8, 10) = 8 blocks

Low Occupancy = Security Leak:

Warp Scheduler Policies:

Round-Robin (NVIDIA default):

// Attacker measures warp scheduling order

5. Memory System: Coalescing & Bank Conflicts

Warp issues load: 32 threads, 32 addresses

Worst case: addresses are SCATTERED

__global__ void coalesced(float *data) {

__global__ void uncoalesced(float *data) {

Observation: Coalesced access = fast, Uncoalesced = slow

Shared Memory Bank Conflicts:

Shared Memory: 64KB, divided into 32 banks

__shared__ float shared_data[1024];

Bank Conflict Side-Channel:

// Secret-dependent access pattern

// Attacker probes shared memory latency

// If victim has conflicts → shared memory busy → attacker delayed

6. Cache Hierarchy & Timing Attacks

L1 Cache = Timing Oracle:

Attack: Prime+Probe on GPU L1

CPU Prime+Probe: clflush instruction

#define L1_SIZE (128 * 1024) // 128KB

__global__ void prime_l1(uint8_t *probe_array) {

// Prime: fill L1 with probe_array

// Probe: measure access time

Real Attack: GPU-based AES Key Recovery

 Target: AES T-table lookups

7. TLB & Page Table Side-Channels

Per-SM TLB: ~16-64 entries (small!)

TLB Thrashing Attack:

// Attacker evicts victim's TLB entries

// Victim experiences TLB misses

Page Table Walk Latency:

TLB hit: ~1 cycle

Attacker measures victim's overall latency

Unified Memory Side-Channel:

8. Synchronization Primitives & Race Conditions

__global__ void kernel() {

// Phase 1: each thread writes

// Phase 2: each thread reads

Barrier Timing Side-Channel:

// Victim kernel (secret-dependent work)

global void kernel(int data, int output) {

global void coalesced(float *data) {

global void uncoalesced(float *data) {

shared float shared_data[1024];

global void prime_l1(uint8_t *probe_array) {

global void kernel() {

global void persistent_kernel(WorkQueue *queue) {

constant uint8_t aes_key[16]; // Secret key

global void encrypt(uint8_t *plaintext) {

global void kernel(float *input) {

global void prime_l1(uint8_t *probe_array) {

global void probe_l1(uint8_t probe_array, uint64_t timings) {