GPU Computing Evolution and Architecture
GPU Computing Evolution and Architecture
Characteristics
GPUs were built only for specific graphics operations.
Stages of rendering like transform, lighting, rasterization, texture mapping were
hardwired.
Very limited programmability.
Examples
Early NVIDIA RIVA series
3dfx Voodoo graphics cards
Limitations
Could not perform general computations.
Programmers had no control over pipeline stages.
Key Idea
GPU stages like vertex and pixel processing became programmable.
Developers could write shader programs (small programs running on the GPU).
Architectural Changes
Separate Vertex Shader Units
Separate Pixel/Fragment Shader Units
Technologies
DirectX 8/9
OpenGL Shader Language (GLSL)
High-Level Shader Language (HLSL)
Outcome
Enabled complex effects like:
per-pixel lighting
realistic shading
water/ smoke simulation
Breakthrough
Instead of separate vertex/pixel processors, GPUs moved to unified cores.
Any shader core could execute any type of workload.
Advantages
Better load balancing
Higher efficiency
Easier to program
Impact
Enabled GPGPU (General Purpose GPU Computing) using:
CUDA
OpenCL
Focus Shift
GPUs now optimized not only for graphics but also for parallel computing.
Architectural Enhancements
Thousands of cores organized into:
Streaming Multiprocessors (SMs – NVIDIA)
Compute Units (CUs – AMD)
Memory Enhancements
High Bandwidth Memory (HBM)
Improved caching
Faster interconnects (NVLink)
Applications
Scientific simulations
Deep learning
Cryptography
Image/video processing
Key Features
Introduction of Tensor Cores (NVIDIA Volta, 2017)
Accelerate matrix math (FP16, BF16, INT8, FP8)
Dedicated Ray Tracing Cores (NVIDIA Turing, 2018)
Hardware-accelerated ray tracing
Architectural Innovations
Mixed precision computing
Multi-instance GPU (MIG)
Improved L1/L2 cache
AI-specific instruction sets
AMD Developments
RDNA / RDNA2 / RDNA3 architectures
Better compute performance-per-watt
Infinity Fabric for multi-chip GPU designs
Applications
Large Language Models (LLMs)
Autonomous driving
Robotics
3D gaming with real-time ray tracing
Examples
AMD RDNA3 chiplets
NVIDIA Blackwell (2024) with modular scaling
Intel Ponte Vecchio (multi-tile GPU)
Benefits
Extreme scalability
Higher memory bandwidth
Efficient heat management
(a) Threads
Smallest unit of execution.
Thousands of threads run together.
(c) Blocks
Threads are grouped into blocks, which execute independently.
(d) Grids
Entire collection of blocks.
Hierarchy:
Grid → Blocks → Threads
CPU way
Uses a loop → processes one element at a time (or few at a time).
Sequential mostly.
GPU way
Launch 1 lakh threads.
Each thread handles one addition.
All 1 lakh additions happen simultaneously.
⏱ Result: GPU finishes in milliseconds.
✔️Matrix operations
✔️Deep learning model training
✔️Image and video processing
✔️Physics simulations
✔️Cryptography
✔️Data analytics (parallel tasks)
Anywhere you need to do the same task repeatedly for large data.
3.3. L2 Cache
Shared by all SMs.
Helps reduce global memory access latency.
5. Memory Interface
Very high bandwidth GDDR6/HBM memory.
Memory controllers distribute data across multiple channels.
CUDA
Hardware
Overview
CUDA
(Compute
Unified
Device
Architecture) is NVIDIA’s parallel computing platform and programming model. It enables general-
purpose computing on GPUs (GPGPU). Understanding CUDA hardware requires knowing how
GPUs are organized internally for massive parallelism.
Warp
Group of 32 threads
Executes instructions SIMT (Single Instruction Multiple Threads)
Thread Block
One or more warps (typically 32–1024 threads)
Scheduled on a single SM
Can use shared memory and synchronize
Grid
Group of multiple blocks
Independently scheduled across SMs
4.3 Registers
Fastest memory
Each thread gets its own set of registers
Warp Scheduling
SM hardware schedules warps
When one warp waits for memory, the scheduler switches to another warp → zero-
overhead context switching
Occupancy
Number of active warps per SM
Higher occupancy = better hide memory latency
1 Registers (Fastest)
1️⃣
Located inside GPU cores.
Automatically allocated by compiler.
Extremely fast (1 cycle access).
Limited — register spills go to local memory.
Example:
__shared__ float data[256];
4️⃣Local Memory
Per-thread storage.
Stored in global memory (slow).
Used when:
Register usage exceeds limit.
Large arrays declared inside kernels.
5️⃣Constant Memory
Cached read-only memory.
Fast if all threads read the same address.
Good for:
Coefficients
Filter kernels
Example
__constant__ float coeff[32];
6️⃣Texture Memory
Read-only.
Special cache optimized for 2D/3D locality.
Used in:
Image processing
Interpolation
🚀 Key Memory Optimization Techniques
✔ 2. Coalesced Access
Threads in the same warp must access contiguous global memory locations.
Bad:
A[ threadIdx.x * stride ]
Good:
A[ threadIdx.x ]
__syncthreads();
c[tid] = temp[threadIdx.x];
}
UNIT 2 CUDA PROGRAMMING
🚀 What is CUDA?
CUDA (Compute Unified Device Architecture) is a parallel computing platform and
programming model created by NVIDIA.
It allows you to use the power of NVIDIA GPUs (Graphics Processing Units) to accelerate general-
purpose computing tasks — not just graphics.
Architecture Components
SM (Streaming Multiprocessor) → executes threads
Warp (32 threads) → smallest scheduling unit
Global memory → large memory for all threads
Shared memory → fast small memory per block
Registers → fastest storage for threads
🔹 Global Memory
Largest memory on GPU
Slowest
Accessible by all threads
Must use coalesced access for better performance
🔹 Shared Memory
Small, very fast
Shared by threads in the same block
Acts like a “software-managed cache”
Boosts performance significantly
🔹 Registers
Fastest memory
Private to each thread
Limited amount
🔹 Constant Memory
Read-only
Cached
Fast when all threads read the same value
🔹 Texture Memory
Read-only
Special hardware caching
Good for 2D/3D spatial data (images)
🔹 Thread
Smallest execution unit
Runs part of kernel code
🔹 Block
Group of threads
Threads within block share shared memory
🔹 Grid
Group of blocks
Entire kernel launch forms a grid
🔹 Warp
32 threads
Executed in lockstep
Warp divergence slows performance
Host code
int main() {
int a[5] = {1,2,3,4,5};
int b[5] = {10,20,30,40,50};
int c[5];
int *dA, *dB, *dC;
cudaMalloc(&dA, 5*sizeof(int));
cudaMalloc(&dB, 5*sizeof(int));
cudaMalloc(&dC, 5*sizeof(int));
🔹 SM (Streaming Multiprocessor)
Heart of the GPU
Contains:
CUDA cores
Warp schedulers
Shared memory
Registers
🔸 Memory Coalescing
Ensure consecutive threads access consecutive addresses
🔸 Use Shared Memory
To avoid repeated global memory access
🔸 Increase Occupancy
Use enough threads per block
Host (CPU)
Manages memory
Launches kernels
Controls device
Device (GPU)
Executes kernels
Runs thousands of threads in parallel
Kernel Launch
Uses syntax:
kernel<<<numBlocks, threadsPerBlock>>>();
Example:
myKernel<<<16, 256>>>();
Means:
16 blocks
256 threads per block
Total = 4096 threads running parallel
Why Multi-GPU?
Multi-GPU architectures are used when:
A single GPU is not sufficient for large workloads.
Applications require massive parallelism.
Training large AI/ML models.
Running scientific simulations, bioinformatics, fluid dynamics, or rendering.
Increasing system scalability without replacing existing hardware.
1. Data Parallelism
Same model/program runs on all GPUs.
Input data is partitioned across GPUs.
Common in deep learning (e.g., PyTorch Distributed, TensorFlow MirroredStrategy).
GPUs periodically synchronize weights.
Example:
4 GPUs each get 25% of a dataset batch.
2. Model Parallelism
The model is too large to fit on a single GPU.
Model layers or components are split across GPUs.
Example:
Transformer layers distributed across different GPUs in LLM training.
3. Pipeline Parallelism
Stages of the model or computation pipeline run on different GPUs.
Enables continuous data flow like an assembly line.
Often used in large-model training (Megatron-LM, DeepSpeed).
4. Hybrid Parallelism
Combination of data, model, and pipeline parallelism.
Used in GPT-3, Llama, BERT-large training.
Multi-GPU in CUDA
CUDA supports multi-GPU programming using:
✔ cudaGetDeviceCount()
Check number of GPUs.
✔ cudaSetDevice(device_id)
Select which GPU runs the kernel.
Memory Models:
Independent memory: each GPU has its own memory.
Unified Virtual Addressing (UVA): seamless addressing across GPUs/CPU.
Managed memory: CUDA automatically handles migrations.
1. Desktop Multi-GPU
Historically NVIDIA SLI / AMD CrossFire for gaming.
Now mainly used in:
Content creation
Video rendering
CUDA computation
2. Server-Grade Multi-GPU
Used in DGX/HGX servers:
4, 8, 16, or 32 GPUs in one node.
NVLink + NVSwitch for high-speed communication.
Used in supercomputing, HPC clusters, AI training.
AI/ML
Training LLMs (GPT, Llama, BERT)
Reinforcement learning
Scientific HPC
Molecular dynamics (GROMACS)
CFD simulations
Weather modeling
Engineering
Finite element analysis (ANSYS)
Rendering (Blender Cycles X)
Medical / Biology
Genomics
Protein folding (AlphaFold)
Example
Matrix addition:
Each element C[i][j]=A[i][j]+B[i][j]
→ Each thread computes one element.
This fits naturally into CUDA grid/block/thread mapping.
B. Task Decomposition
Break the application into independent operations (functions).
Examples:
Preprocessing → Kernel 1
Computation → Kernel 2
Reduction → Kernel 3
Useful for:
Pipeline processing
Independent algorithmic stages
Asynchronous streams
1. Thread-Level Decomposition
Each thread handles:
Single element
Single pixel
Single particle
One iteration of a loop
Tips
Ensure threads execute the same instruction path (avoid divergence)
Use enough threads to fill all SMs (typically thousands)
2. Block-Level Decomposition
Blocks should handle independent portions with optional shared memory.
Example:
A block handles a tile of a matrix
Shared memory stores tile for reuse
Threads cooperate using __syncthreads()
Benefits
Reduces global memory access
Enables tiling & cache efficiency
Improves memory coalescing
3. Grid-Level Decomposition
Grids allow splitting very large data sets:
Millions of threads
Adaptive grid-stride loops
Kernel launches based on problem size
Grid-Stride Loop Pattern
for (int i = tid; i < N; i += blockDim.x * gridDim.x) {
out[i] = in[i] * in[i];
}
1. Memory Coalescing
Arrange data so that:
Thread 0 → memory[0]
Thread 1 → memory[1]
→ contiguous accesses
Poor decomposition → scattered access → slow global loads.
Result
✔ 10× to 50× faster than naive version
✔ Due to better decomposition & memory reuse
✅ 5. Summary Table
Technique Purpose
Data Decomposition Split data → threads/blocks
Task Decomposition Split functions → kernels/streams
Tiling Reuse data via shared memory
Coalesced Access Efficient global memory use
Minimizing Divergence Better warp efficiency
Grid-Stride Loops Handle large data sets
Streams Pipeline parallelism
🎯 Final Summary
Optimizing CUDA applications starts with breaking the problem into parallelizable pieces. A
good decomposition ensures:
High occupancy
Memory efficiency
Low divergence
Maximum throughput
It’s the foundation of all other CUDA optimizations.
Memory
A. Registers
Fastest memory, private to each thread.
Allocated automatically by the compiler.
Excess register usage → spills into local memory (slow!)
Consideration:
✔ Optimize register usage; use --ptxas-options=-v to see register count.
B. Shared Memory
Very fast, on-chip memory.
Accessible by all threads in a block.
Lifetime = block duration.
Limited size: typically 48–100 KB per block depending on GPU.
Best use cases:
Caching frequently accessed data.
Reducing global memory accesses.
Tiling (e.g., matrix multiplication).
Considerations:
✔ Avoid bank conflicts.
✔ Divide data into tiles to improve locality.
C. Global Memory
Largest memory region, but high latency (~400–800 cycles).
Accessible by all threads and the host.
Performance considerations:
1. Memory Coalescing
Threads in the same warp should access contiguous memory locations.
2. Alignment
Align data to 32, 64, or 128 bytes for coalesced access.
3. Minimize transfers
Host ↔ Device transfers are slow:
Use pinned memory for faster transfers.
4. Reuse data
Load once → store in shared memory or registers.
D. Constant Memory
Read-only memory, cached.
Fast if all threads access the same address (broadcast).
Use cases:
Constants, coefficients, lookup tables.
E. Texture Memory
Read-only, cached memory with special spatial locality optimization.
Useful for:
Image processing
Interpolation
2D/3D texture access patterns
F. Local Memory
Misleading name: this is actually off-chip global memory.
Used when:
Compiler spills registers, or
You declare large per-thread arrays.
Avoid whenever possible!
G. Unified Memory
Single memory space accessible by CPU and GPU.
Simplifies programming but may incur page faults during migrations.
Suitable for prototyping and irregular memory access patterns.
✔ 1. Coalesced Access
Ensure warp threads access consecutive memory addresses.
✔ 2. Avoid Divergence
Avoid memory patterns that cause divergence in a warp.
cudaMemcpy()
Basic transfer function.
cudaMemcpy(dest, src, size, cudaMemcpyHostToDevice);
cudaMemcpy(dest, src, size, cudaMemcpyDeviceToHost);
cudaMemcpy(dest, src, size, cudaMemcpyDeviceToDevice);
cudaMemcpyAsync()
Asynchronous transfer using streams.
Allows:
Overlapping transfer with kernel execution
Non-blocking behavior
(a) Threads
The smallest unit of execution.
Each thread runs the same kernel code, but operates on different data.
Identified by threadIdx.x, threadIdx.y, threadIdx.z.
(c) Grid
A collection of multiple thread blocks.
Blocks in a grid run independently.
Important considerations:
More threads per block → better occupancy
But too many → register/shared memory limits exceeded
Launch configuration:
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
vectorAdd<<<numBlocks, blockSize>>>(A, B, C, N);
🔹 6. Thread Synchronization
Threads inside a block can synchronize:
__syncthreads();
1. Register Contention
Each Streaming Multiprocessor (SM) has a fixed number of registers.
If your kernel uses too many registers per thread:
fewer threads can run simultaneously (low occupancy)
performance drops
Occurs commonly with complex kernels or heavy math operations.
Solution:
Reduce register usage (__restrict__, compiler optimizations)
Launch fewer threads per block
Use -maxrregcount compiler flag (with care)
🔶 7. SM Resource Contentions
Each SM has limited:
warps
blocks
registers
shared memory
thread slots
If your kernel demands too much of one resource:
fewer blocks can be scheduled
GPU can't fully allocate SM resources → low occupancy
✅ 1. Basic Problems
a. Input/Output
Read a number and print it.
Print patterns (stars, numbers, pyramids).
Sum of digits.
b. Mathematical
Check prime number.
Find GCD / LCM.
Fibonacci series.
Factorial of a number.
Armstrong number.
Palindrome number.
✅ 2. Array Problems
Find maximum / minimum in an array.
Reverse an array.
Linear / Binary search.
Sum of array elements.
Remove duplicates.
Kadane’s algorithm (maximum subarray sum).
Move zeros to the end.
Find missing number in 1 to n.
Rotate an array (left/right rotation).
✅ 3. String Problems
Reverse a string.
Check palindrome string.
Count vowels and consonants.
Anagram checking.
Longest substring without repeating characters.
Count character occurrences.
Convert lower to upper case.
✅ 4. Recursion Problems
Factorial using recursion.
Fibonacci using recursion.
Tower of Hanoi.
Permutations of a string.
Sum of array using recursion.
✅ 5. Sorting Problems
Bubble sort.
Selection sort.
Insertion sort.
Merge sort.
Quick sort.
Count sort.
cudaGetLastError() returns the last error from runtime API or kernel launch.
cudaDeviceSynchronize() ensures all preceding work is completed and reports runtime
errors.
// Usage
CUDA_CHECK(cudaMalloc((void**)&d_array, size * sizeof(int)));
5. Device Reset
If a fatal error occurs, it’s good practice to reset the GPU:
cudaDeviceReset();
6. Debugging Tips
Use cuda-memcheck to detect memory errors:
cuda-memcheck ./my_program
Parallel programming can give huge speedups, but it comes with its own set of challenges and
issues. Here's a detailed overview:
1. Race Conditions
Occur when multiple threads access shared data concurrently and at least one thread
modifies it.
Results are non-deterministic depending on execution order.
Example:
int counter = 0;
#pragma omp parallel for
for(int i=0; i<100; i++) {
counter++; // Race condition: multiple threads increment simultaneously
}
Solution:
Use locks/mutexes (std::mutex in C++)
Atomic operations (std::atomic<int> counter;)
Thread-local storage
2. Deadlocks
Happens when two or more threads wait indefinitely for resources held by each other.
Common with nested locks or improper resource acquisition order.
Solution:
Acquire locks in a consistent order.
Use try-locks with timeout.
Minimize shared resources.
3. Livelocks
Threads are active but cannot make progress because they keep responding to each other.
Similar to deadlock but CPU is busy spinning.
Solution:
Introduce delays or back-off strategies.
4. Starvation
A thread never gets CPU time or resources because higher-priority threads dominate.
Solution:
Use fair scheduling policies or priority adjustments.
6. Load Balancing
Unequal distribution of work causes some threads to finish early while others are still busy.
Wastes parallelism and reduces speedup.
Solution:
Dynamic scheduling (#pragma omp parallel for schedule(dynamic))
Work-stealing techniques
7. Overhead of Synchronization
Locks, barriers, and atomic operations introduce performance overhead.
Too fine-grained parallelism can be slower than sequential code.
Solution:
Minimize shared data access
Reduce frequency of synchronization
Use lock-free or reduction patterns when possible
8. False Sharing
Occurs when multiple threads modify variables that reside on the same cache line.
Causes unnecessary cache coherence traffic.
Solution:
Pad shared variables to align with cache lines
Avoid putting frequently updated variables together
9. Scalability Limitations
Speedup is limited by Amdahl’s Law:
S=(1−P)+NP1
Where P is parallelizable fraction, N is number of threads.
Implication:
Beyond a certain number of threads, adding more cores gives diminishing returns.
1. Race Conditions
What it is: Occurs when the outcome of a program depends on the order or timing of thread
execution.
Example: Two threads updating a shared counter simultaneously might overwrite each
other’s updates.
Symptom: Inconsistent or incorrect results.
Solution: Use mutexes, locks, or atomic operations to ensure exclusive access.
2. Deadlocks
What it is: Two or more threads are waiting indefinitely for resources held by each other.
Example: Thread A holds Lock 1 and waits for Lock 2, while Thread B holds Lock 2 and
waits for Lock 1.
Symptom: Program hangs.
Solution:
Always acquire locks in a consistent order.
Use timeout-based locking.
Detect and recover from deadlocks.
3. Livelocks
What it is: Threads keep changing state in response to each other but make no progress.
Example: Two threads repeatedly yield to each other trying to avoid deadlock but never
proceed.
Symptom: Program runs but does nothing useful.
Solution: Introduce randomness or back-off strategies.
4. Starvation
What it is: A thread never gets CPU time or access to a resource because others
continuously take precedence.
Example: Low-priority thread never runs because high-priority threads dominate the CPU.
Symptom: Some tasks never complete.
Solution: Use fair scheduling or priority inheritance.
5. Priority Inversion
What it is: A low-priority thread holds a resource needed by a high-priority thread, causing
the high-priority thread to wait.
Symptom: High-priority tasks are blocked unexpectedly.
Solution: Use priority inheritance protocols to temporarily boost low-priority thread’s
priority.
6. Atomicity Violations
What it is: Non-atomic operations on shared resources are interrupted by other threads.
Example: x = x + 1 is not atomic; two threads doing it simultaneously can overwrite each
other’s increment.
Symptom: Counter or shared data becomes incorrect.
Solution: Use atomic operations or locks.
8. False Sharing
What it is: Threads access variables that reside on the same CPU cache line, causing
unnecessary cache invalidations.
Symptom: Performance degradation, even if logically independent.
Solution: Align variables to separate cache lines.
algorithmic issues and programming issues in software development. I’ll break it down
carefully:
1. Algorithmic Issues
Algorithmic issues are problems that arise due to the design, logic, or efficiency of algorithms.
These can affect the correctness, speed, and scalability of a program.
Common types of algorithmic issues:
1. Incorrect logic
The algorithm doesn’t produce the intended result.
Example: Miscalculating the sum of an array due to off-by-one errors.
2. Inefficient algorithms
Takes too long or uses too much memory.
Example: Using bubble sort for a very large dataset instead of merge sort.
3. Unoptimized recursion
Excessive recursive calls can lead to stack overflow or high memory use.
Example: A naive Fibonacci function using recursion.
4. Failure to handle edge cases
Algorithm fails on empty inputs, maximum/minimum values, or unexpected inputs.
5. Poor scalability
Algorithm works for small inputs but fails with large datasets due to high time or
space complexity.
2. Programming Issues
Programming issues are problems caused during implementation of code, not necessarily the
algorithm itself. These are more related to coding errors, environment, or integration.
Common programming issues:
1. Syntax errors
Code cannot compile or run.
Example: Missing semicolons, unmatched parentheses.
2. Runtime errors
Code crashes while running.
Example: Division by zero, null pointer exceptions.
3. Logical errors
Program runs but gives wrong results.
Example: Using < instead of <= in a loop condition.
4. Resource management issues
Memory leaks, file handle leaks, database connections not closed.
5. Concurrency issues
Race conditions, deadlocks, thread starvation when multiple threads access shared
resources.
6. Security vulnerabilities
Buffer overflows, SQL injection, improper input validation.
7. Integration issues
Problems when combining modules or APIs, often due to mismatched data formats
or protocols.
finding and avoiding errors are central programming issues, often discussed under debugging,
error handling, and software reliability. Let’s break it down carefully:
1. Purpose
OpenCL standardizes parallel programming across different types of hardware. Its main goals are:
Portability: Write code once, run it on multiple types of devices.
Performance: Enable efficient use of hardware parallelism.
Interoperability: Integrate with graphics APIs like OpenGL, Vulkan, or DirectX.
2. Architecture
OpenCL defines a platform model and a programming model.
A. Platform Model
Host: Usually the CPU, responsible for running the main application.
Compute Devices: Hardware capable of parallel execution (GPU, CPU cores, FPGA).
Compute Units (CUs): Each device is divided into CUs (e.g., GPU multiprocessors).
Processing Elements (PEs): Individual cores within a CU.
B. Memory Model
Global Memory: Accessible by all work-items and persistent across kernel executions.
Constant Memory: Read-only memory shared across work-items.
Local Memory: Shared among work-items within a workgroup; faster than global memory.
Private Memory: Each work-item has its own private memory.
C. Execution Model
Kernels: Functions written in OpenCL C executed on compute devices.
Work-items: Smallest unit of execution.
Work-groups: Groups of work-items that share local memory and can synchronize.
3. Programming Model
OpenCL programs are typically divided into two parts:
1. Host Program: Runs on CPU, manages devices, memory, and kernel execution.
2. Device Program (Kernel): Runs on compute devices in parallel.
Key Concepts:
NDRange (N-Dimensional Range): Defines the grid of work-items for execution.
Enqueueing: Commands like kernel execution, memory read/write are enqueued to a
command queue.
4. Key Features
Heterogeneous computing support.
Explicit control over memory and task parallelism.
Synchronization primitives for work-items and work-groups.
Interoperability with graphics APIs.
Supports vector and scalar data types.
5. Versions
OpenCL has evolved over time:
OpenCL 1.x: Initial support for CPUs, GPUs, and basic parallelism.
OpenCL 2.x: Added shared virtual memory, pipes, device-side enqueue.
OpenCL 3.0: Unified and modular; optional features to improve portability and adoption.
6. Use Cases
Scientific computing (simulation, weather modeling).
Image and signal processing.
Machine learning and AI inference.
Financial modeling.
Real-time rendering and graphics.
In OpenCL, a kernel is a fundamental concept: it’s the function that runs on a compute device
(GPU, CPU, or other accelerators) in parallel across many work-items. You can think of a
kernel as the “core computation” that you want to perform on large amounts of data.
Here’s a detailed breakdown:
1. Definition
A kernel is written in OpenCL C (a C-like language) and is executed on the device, not on the host
(CPU). The host program launches the kernel across a grid of work-items.
Example Kernel:
__kernel void vector_add(__global const float *A,
__global const float *B,
__global float *C) {
int i = get_global_id(0); // unique index for this work-item
C[i] = A[i] + B[i];
}
2. Execution Model
OpenCL executes kernels in parallel using NDRange (N-Dimensional Range):
Work-item: Smallest execution unit; runs one instance of the kernel.
Work-group: A group of work-items that can share local memory.
Global size: Total number of work-items.
Local size: Number of work-items in a work-group.
Example:
If you have 1024 data elements and a local size of 64, OpenCL creates 1024 work-items organized
into 16 work-groups (1024 ÷ 64).
4. Synchronization
Within a work-group, work-items can synchronize using:
barrier(CLK_LOCAL_MEM_FENCE);
5. Launching a Kernel
From the host program, you enqueue kernels like this:
size_t global_size = 1024;
size_t local_size = 64;
6. Key Points
A kernel runs in parallel across many work-items.
Each work-item has a unique ID used to access its portion of data.
Memory type and synchronization are important for performance and correctness.
Kernels are compiled at runtime or ahead-of-time and executed on devices, not the host.
host-
device
interaction, it generally refers to the way a host system (like a CPU-based computer)
communicates and coordinates with a device (like a GPU, FPGA, or peripheral hardware) in
computing systems. This concept is crucial in parallel programming, hardware acceleration,
and embedded systems. Let me break it down clearly:
Step 4: Synchronization
Host waits for the device to finish processing if needed.
cudaDeviceSynchronize();
Step 6: Cleanup
Free the allocated device memory.
cudaFree(d_array);
3. Communication Mechanisms
Direct Memory Access (DMA): Device can read/write memory directly with minimal CPU
involvement.
PCIe Bus: Standard for connecting host CPU to GPUs and other accelerators.
Unified Memory (in CUDA): Allows host and device to share memory space for easier
programming.
Command Queues: Host sends commands to device asynchronously.
execution
3. Memory Model
OpenCL defines a hierarchical memory model for efficient execution:
Memory
Scope Characteristics
Type
All work-items on all compute
Global High latency, visible to host and all kernels
units
All work-items on all compute
Constant Read-only, cached for efficiency
units
Work-items within a compute Shared among work-items of a workgroup, faster
Local
unit than global
Private Individual work-item Fastest, private to the work-item
4. Execution Model
Execution in OpenCL is based on kernels (functions executed on devices).
Kernels run across a NDRange (N-Dimensional range of work-items).
Work-items: Single execution instances of the kernel.
Work-groups: Group of work-items that share local memory.
Hierarchy:
NDRange (entire problem) → Work-groups → Work-items
5. Command-Queue Execution
The host controls execution via command queues:
1. Write data to device memory.
2. Enqueue kernel for execution.
3. Read results back from device memory.
Commands can be in-order or out-of-order, affecting scheduling.
In OpenCL, the memory model defines how memory is organized, how data is accessed by the
host and kernels, and how memory consistency and visibility are handled across multiple
compute units and work-items. Understanding this is crucial for writing efficient and correct
OpenCL programs. Let’s break it down clearly:
Atomic operations are used to prevent race conditions when multiple work-items update
shared memory.
basic OpenCL examples that demonstrate how to write and execute simple OpenCL
programs. I’ll include vector addition and scalar multiplication as typical starting points.
#define ARRAY_SIZE 8
int main() {
int A[ARRAY_SIZE] = {1,2,3,4,5,6,7,8};
int B[ARRAY_SIZE] = {8,7,6,5,4,3,2,1};
int C[ARRAY_SIZE];
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
printf("Result: ");
for(int i=0; i<ARRAY_SIZE; i++) printf("%d ", C[i]);
printf("\n");
clReleaseMemObject(bufA);
clReleaseMemObject(bufB);
clReleaseMemObject(bufC);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(queue);
clReleaseContext(context);
return 0;
}
Output:
Result: 9 9 9 9 9 9 9 9
2. Scalar Multiplication Example
Multiply each element of an array by a constant.
Kernel Code (scalar_mul.cl):
__kernel void scalar_mul(__global int* input, __global int* output, int scalar) {
int id = get_global_id(0);
output[id] = input[id] * scalar;
}
Host Code:
#include <CL/cl.h>
#include <stdio.h>
int main() {
int input[8] = {1,2,3,4,5,6,7,8};
int output[8];
int scalar = 5;
// Execute kernel
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, NULL);
printf("Output: ");
for(int i=0;i<8;i++) printf("%d ", output[i]);
printf("\n");
}
Output:
Output: 5 10 15 20 25 30 35 40
parallel patterns in computing with convolution as an example. Let’s break it down clearly.
4. Visualization
Input Image (I) Kernel (K)
+---+---+---+ +---+---+
|a|b|c| |k|l|
|d|e|f| * |m|n|
|g|h|i|
+---+---+---+
Each output pixel calculation is independent, so you can compute many pixels in parallel.
GPU-based implementation of the prefix sum (scan) algorithm, which is a classic parallel
algorithm used in many GPU applications. Let’s go step by step.
Up-Sweep:
for d = 0 to log2(n)-1
for k = 0 to n-1 by 2^(d+1) // parallel
A[k + 2^(d+1) - 1] += A[k + 2^d - 1]
Down-Sweep:
A[n-1] = 0
for d = log2(n)-1 down to 0
for k = 0 to n-1 by 2^(d+1) // parallel
t = A[k + 2^d - 1]
A[k + 2^d - 1] = A[k + 2^(d+1) - 1]
A[k + 2^(d+1) - 1] += t
// Down-sweep phase
for (int d = 1; d < n; d *= 2) {
offset >>= 1;
__syncthreads();
if (thid < d) {
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
int t = temp[ai];
temp[ai] = temp[bi];
temp[bi] += t;
}
}
__syncthreads();
Notes:
n must be a power of 2 for this simple version.
temp is stored in shared memory for speed.
Each thread works on two elements.
For larger arrays, multiple blocks and a block-sum adjustment is needed.
Sparse Matrix
algorithms on GPU. Sparse matrices are matrices in which most elements are zero. Because of
the sparsity, storing and computing them efficiently requires special algorithms and data
structures. GPUs, with their massive parallelism, can accelerate sparse matrix computations
significantly.
a) Single GPU
Use CUDA or OpenCL.
Example CUDA workflow:
1. Allocate memory on device (GPU).
2. Copy matrices A and B to device.
3. Launch a kernel to compute C[i,j] per thread.
4. Use shared memory to optimize access for sub-blocks (tiling).
5. Copy result back to host.
CUDA kernel example (conceptual):
__global__ void matMulKernel(float* A, float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; ++k)
sum += A[row * N + k] * B[k * N + col];
C[row * N + col] = sum;
}
b) Heterogeneous Clusters
A heterogeneous GPU cluster may consist of:
Different GPU types (e.g., NVIDIA + AMD).
CPUs + GPUs together.
Multiple nodes interconnected via MPI/InfiniBand.
Key considerations:
1. Workload partitioning: Divide the matrix into blocks assigned to different devices based
on computational power.
2. Device-specific kernels: May need to tune kernels per GPU type.
3. Communication: Use MPI or NCCL to transfer blocks of matrices and gather results.
4. Overlap communication and computation: Send next data blocks while computing
current ones to hide latency.
e) Optimizations
1. Tiling / blocking → minimize global memory access.
2. Asynchronous transfers → overlap host-device communication with computation.
3. Load balancing → important for heterogeneous clusters.
4. Use optimized libraries if available:
cuBLAS (NVIDIA)
clBLAS / oneAPI (Intel, AMD)
MAGMA for multi-GPU
CUDA leverages a sophisticated memory hierarchy consisting of registers, shared, global, constant, and texture memory, each serving specific roles to optimize performance. Registers, being the fastest, serve as temporary storage within threads, minimizing latency for frequently accessed data . Shared memory speeds up access times by allowing blocks of threads to share data effectively, reducing slow global memory accesses . Global memory, while being the largest storage, is slower and benefitted from coalesced access patterns to minimize latency . Constant memory allows efficient distribution of read-only data, suitable for broadcast among threads . Texture memory is employed for its caching mechanism benefiting spatial locality, particularly in image processing tasks . These types together underpin CUDA’s ability to handle large-scale computations efficiently by optimizing data movement and access patterns .
Key components of CUDA memory optimization include choosing the right memory type and maximizing memory coalescing. Shared memory is fast but limited and should be used to reduce global memory access . Registers are the fastest but have limited space per thread, requiring careful tracking to avoid spills into local memory . Efficient usage of memory involves aligning data for coalesced access in global memory, using shared memory effectively for caching and data reuse, and leveraging constant memory for broadcast operations . It is critical to choose the right memory type based on latency, scope, and bandwidth needs .
Shared memory in CUDA offers low-latency access and can significantly increase performance by reducing the need to repeatedly fetch data from the slower global memory . It allows for efficient data reuse through techniques like tiling, which partitions data into small, re-usable segments . However, its limited size (typically 48–100 KB per block) necessitates careful management to avoid conflicts and under-utilization . Additionally, shared memory is only accessible within a single block of threads, limiting its usability for cross-block data sharing .
Correct and efficient memory management in CUDA includes understanding the memory hierarchy and ensuring memory coalescing and proper alignment . Techniques include allocating memory using the appropriate API calls (`cudaMalloc`, `cudaMemcpy`) and releasing them appropriately (`cudaFree`) to prevent leaks and keep system resources optimized . Managing data reuse through shared memory and constant memory for read-only data can reduce repeated access to slower global memory . Moreover, balancing register use to avoid spilling into slower local memory improves performance by reducing latency .
Tiling optimizes matrix multiplication by partitioning matrices into smaller sub-blocks called tiles, which fit into fast shared memory, minimizing slow global memory accesses . Each tile is processed by a block of threads, allowing the reuse of data during computation and reducing the need for repeated global memory access . Tiling leverages shared memory to hold needed matrix segments for computation, significantly speeding up the process via localized computation and efficient synchronization and thus improving overall parallel throughput . This approach exploits both spatial and temporal data locality to enhance performance .
CUDA's architecture, with its many GPU cores and streaming multiprocessors (SMs), supports massive parallelism, contrasting sharply with the few, more complex cores in a CPU. Each SM in CUDA can execute thousands of lightweight threads simultaneously, taking advantage of data-level parallelism that is not feasible with CPU architectures . CUDA employs a memory hierarchy optimized for different speeds and usages, such as global, shared, and constant memory, which together improve overall throughput by reducing memory bottlenecks . Furthermore, CUDA allows more efficient handling of parallel workloads because it minimizes the context switch overhead and scales efficiently with workload size .
OpenCL and CUDA share commonalities in that they both enable parallel computing across multiple processing elements. However, OpenCL provides a platform-independent framework that supports execution across heterogeneous devices like GPUs, CPUs, and FPGAs . In contrast, CUDA is specifically optimized for NVIDIA GPUs, enabling deeper integration with NVIDIA's hardware architecture . OpenCL uses a command-queue-based execution model where the host enqueues commands, whereas CUDA uses a more direct launch of kernels with device-specific optimizations . This difference in model reflects OpenCL’s broader target device support but may result in non-optimal performance on any single device compared to CUDA’s GPU-centric design .
Constant memory in CUDA is advantageous for broadcasting read-only data to many threads, which can result in notable speedups when all threads read the same data location due to its cache-like low latency . It is best used for data that does not change across kernel calls, like coefficients in mathematical operations . Effective usage involves keeping constant memory usage within its small size constraints (64 KB) and ensuring cache efficiency by having all threads access it simultaneously . Such access patterns maximize cache hits and minimize global memory accesses .
Optimizing data transfers between CPU and GPU starts by minimizing unnecessary data movements and using asynchronous transfers where possible. Employing appropriate memory allocation strategies, like pinned memory, can accelerate data transfers . Overlapping computation and data transfer through CUDA streams can also optimize runtime . Furthermore, utilizing unified memory, which allows seamless data access from both host and device memory spaces, can simplify development but might not always be faster . Efficient use involves tuning data sizes and memory access patterns to exploit full bandwidth capabilities of the PCIe interface .
CUDA handles divergent execution paths within warps by serializing the execution of different paths, which reduces parallel efficiency . Each warp consists of 32 threads executed in lockstep, so when a conditional branch causes threads to diverge, CUDA executes each path serially, stalling others and thereby decreasing warp utilization . This branching divergence can significantly impact performance, especially if it leads to extensive serialization and stalls, making minimizing warp divergence an essential optimization goal .