0% found this document useful (0 votes)

41 views77 pages

GPU Computing Evolution and Architecture

Q: How does CUDA leverage different types of memory to improve performance, and what role does each play in the architectural design?

CUDA leverages a sophisticated memory hierarchy consisting of registers, shared, global, constant, and texture memory, each serving specific roles to optimize performance. Registers, being the fastest, serve as temporary storage within threads, minimizing latency for frequently accessed data . Shared memory speeds up access times by allowing blocks of threads to share data effectively, reducing slow global memory accesses . Global memory, while being the largest storage, is slower and benefitted from coalesced access patterns to minimize latency . Constant memory allows efficient distribution of read-only data, suitable for broadcast among threads . Texture memory is employed for its caching mechanism benefiting spatial locality, particularly in image processing tasks . These types together underpin CUDA’s ability to handle large-scale computations efficiently by optimizing data movement and access patterns .

Q: What are the key components and considerations involved in optimizing memory usage in CUDA programming?

Key components of CUDA memory optimization include choosing the right memory type and maximizing memory coalescing. Shared memory is fast but limited and should be used to reduce global memory access . Registers are the fastest but have limited space per thread, requiring careful tracking to avoid spills into local memory . Efficient usage of memory involves aligning data for coalesced access in global memory, using shared memory effectively for caching and data reuse, and leveraging constant memory for broadcast operations . It is critical to choose the right memory type based on latency, scope, and bandwidth needs .

Q: What are the potential benefits and downsides of using shared memory in CUDA programs?

Shared memory in CUDA offers low-latency access and can significantly increase performance by reducing the need to repeatedly fetch data from the slower global memory . It allows for efficient data reuse through techniques like tiling, which partitions data into small, re-usable segments . However, its limited size (typically 48–100 KB per block) necessitates careful management to avoid conflicts and under-utilization . Additionally, shared memory is only accessible within a single block of threads, limiting its usability for cross-block data sharing .

Q: What are the considerations and techniques used to ensure correct and efficient memory management in CUDA applications?

Correct and efficient memory management in CUDA includes understanding the memory hierarchy and ensuring memory coalescing and proper alignment . Techniques include allocating memory using the appropriate API calls (`cudaMalloc`, `cudaMemcpy`) and releasing them appropriately (`cudaFree`) to prevent leaks and keep system resources optimized . Managing data reuse through shared memory and constant memory for read-only data can reduce repeated access to slower global memory . Moreover, balancing register use to avoid spilling into slower local memory improves performance by reducing latency .

Q: What role does tiling play in optimizing matrix multiplication in CUDA, and how does it utilize the memory hierarchy?

Tiling optimizes matrix multiplication by partitioning matrices into smaller sub-blocks called tiles, which fit into fast shared memory, minimizing slow global memory accesses . Each tile is processed by a block of threads, allowing the reuse of data during computation and reducing the need for repeated global memory access . Tiling leverages shared memory to hold needed matrix segments for computation, significantly speeding up the process via localized computation and efficient synchronization and thus improving overall parallel throughput . This approach exploits both spatial and temporal data locality to enhance performance .

Q: How does CUDA's architecture enable efficient parallel computation compared to traditional CPU architectures?

CUDA's architecture, with its many GPU cores and streaming multiprocessors (SMs), supports massive parallelism, contrasting sharply with the few, more complex cores in a CPU. Each SM in CUDA can execute thousands of lightweight threads simultaneously, taking advantage of data-level parallelism that is not feasible with CPU architectures . CUDA employs a memory hierarchy optimized for different speeds and usages, such as global, shared, and constant memory, which together improve overall throughput by reducing memory bottlenecks . Furthermore, CUDA allows more efficient handling of parallel workloads because it minimizes the context switch overhead and scales efficiently with workload size .

Q: How does the execution model of OpenCL differ from CUDA when it comes to parallel computing tasks?

OpenCL and CUDA share commonalities in that they both enable parallel computing across multiple processing elements. However, OpenCL provides a platform-independent framework that supports execution across heterogeneous devices like GPUs, CPUs, and FPGAs . In contrast, CUDA is specifically optimized for NVIDIA GPUs, enabling deeper integration with NVIDIA's hardware architecture . OpenCL uses a command-queue-based execution model where the host enqueues commands, whereas CUDA uses a more direct launch of kernels with device-specific optimizations . This difference in model reflects OpenCL’s broader target device support but may result in non-optimal performance on any single device compared to CUDA’s GPU-centric design .

Q: What are the advantages of using constant memory in CUDA, and how can it be used effectively?

Constant memory in CUDA is advantageous for broadcasting read-only data to many threads, which can result in notable speedups when all threads read the same data location due to its cache-like low latency . It is best used for data that does not change across kernel calls, like coefficients in mathematical operations . Effective usage involves keeping constant memory usage within its small size constraints (64 KB) and ensuring cache efficiency by having all threads access it simultaneously . Such access patterns maximize cache hits and minimize global memory accesses .

Q: What strategies can be used to optimize the execution of a kernel that involves large data transfers between CPU and GPU?

Optimizing data transfers between CPU and GPU starts by minimizing unnecessary data movements and using asynchronous transfers where possible. Employing appropriate memory allocation strategies, like pinned memory, can accelerate data transfers . Overlapping computation and data transfer through CUDA streams can also optimize runtime . Furthermore, utilizing unified memory, which allows seamless data access from both host and device memory spaces, can simplify development but might not always be faster . Efficient use involves tuning data sizes and memory access patterns to exploit full bandwidth capabilities of the PCIe interface .

Q: How does CUDA handle divergent execution paths within warps and what is the impact on performance?

CUDA handles divergent execution paths within warps by serializing the execution of different paths, which reduces parallel efficiency . Each warp consists of 32 threads executed in lockstep, so when a conditional branch causes threads to diverge, CUDA executes each path serially, stalling others and thereby decreasing warp utilization . This branching divergence can significantly impact performance, especially if it leads to extensive serialization and stalls, making minimizing warp divergence an essential optimization goal .

The document outlines the evolution of GPU architectures from fixed-function graphics pipelines in the 1990s to modern multi-chip designs. It details key phases including the introduction of programmable shaders, unified shader architecture, GPGPU capabilities, and deep learning optimizations. Additionally, it discusses the parallelism of GPUs, their architecture, memory hierarchy, and the CUDA programming model for general-purpose computing.

Uploaded by

pavi250496

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views77 pages

GPU Computing Evolution and Architecture

Uploaded by

pavi250496

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

IF4093 GPU Computing

Evolution of GPU Architectures – Detailed Notes

Graphics Processing Units (GPUs) have evolved from simple fixed-function graphics accelerators
into highly parallel, programmable processors used across graphics, AI, scientific computing, and
data analytics. Their evolution can be divided into several major phases:

1. Fixed-Function Graphics Pipelines (1990s – early 2000s)

Characteristics
 GPUs were built only for specific graphics operations.
 Stages of rendering like transform, lighting, rasterization, texture mapping were
hardwired.
 Very limited programmability.

Examples
 Early NVIDIA RIVA series
 3dfx Voodoo graphics cards

Limitations
 Could not perform general computations.
 Programmers had no control over pipeline stages.

2. Introduction of Programmable Shaders (2001 – 2005)

Key Idea
 GPU stages like vertex and pixel processing became programmable.
 Developers could write shader programs (small programs running on the GPU).

Architectural Changes
 Separate Vertex Shader Units
 Separate Pixel/Fragment Shader Units

Technologies
 DirectX 8/9
 OpenGL Shader Language (GLSL)
 High-Level Shader Language (HLSL)
Outcome
 Enabled complex effects like:
 per-pixel lighting
 realistic shading
 water/ smoke simulation

3. Unified Shader Architecture (2006 – 2012)

Breakthrough
 Instead of separate vertex/pixel processors, GPUs moved to unified cores.
 Any shader core could execute any type of workload.

Advantages
 Better load balancing
 Higher efficiency
 Easier to program

NVIDIA: Tesla architecture (2006)

AMD: Radeon R600, later GCN

Impact
 Enabled GPGPU (General Purpose GPU Computing) using:
 CUDA
 OpenCL

4. GPGPU and Massive Parallelism (2010 – 2017)

Focus Shift
 GPUs now optimized not only for graphics but also for parallel computing.

Architectural Enhancements
 Thousands of cores organized into:
 Streaming Multiprocessors (SMs – NVIDIA)
 Compute Units (CUs – AMD)
Memory Enhancements
 High Bandwidth Memory (HBM)
 Improved caching
 Faster interconnects (NVLink)

Parallel Programming Models

 CUDA
 OpenCL
 DirectCompute
 Vulkan Compute Shaders

Applications
 Scientific simulations
 Deep learning
 Cryptography
 Image/video processing

5. Deep Learning Optimized Architectures (2017 – Present)

AI workloads changed GPU design dramatically.

Key Features
 Introduction of Tensor Cores (NVIDIA Volta, 2017)
 Accelerate matrix math (FP16, BF16, INT8, FP8)
 Dedicated Ray Tracing Cores (NVIDIA Turing, 2018)
 Hardware-accelerated ray tracing

Architectural Innovations
 Mixed precision computing
 Multi-instance GPU (MIG)
 Improved L1/L2 cache
 AI-specific instruction sets

AMD Developments
 RDNA / RDNA2 / RDNA3 architectures
 Better compute performance-per-watt
 Infinity Fabric for multi-chip GPU designs
Applications
 Large Language Models (LLMs)
 Autonomous driving
 Robotics
 3D gaming with real-time ray tracing

6. Modern Multi-Chip GPU Architectures (2022 – Future)

Next Evolution Step

 Chiplet-based GPU designs:
 Multiple GPU dies interconnected
 Improved manufacturing yield
 Lower cost and higher scalability

Examples
 AMD RDNA3 chiplets
 NVIDIA Blackwell (2024) with modular scaling
 Intel Ponte Vecchio (multi-tile GPU)

Benefits
 Extreme scalability
 Higher memory bandwidth
 Efficient heat management

🚀 Understanding Parallelism with GPUs

GPUs (Graphics Processing Units) are designed to perform massively parallel computations,
meaning many operations happen at the same time. This makes them extremely fast for tasks
involving large amounts of similar, repetitive work.

🧠 1. CPU vs GPU Parallelism (Concept)

Feature CPU GPU
Architecture Few powerful cores Thousands of lightweight cores
Optimized for Complex sequential tasks Simple tasks repeated many times
Example Running OS, logic, decision-making Image processing, matrix multiplication
Analogy:
 CPU = A single teacher who is very smart but can handle only a few tasks one by one.
 GPU = A classroom of 1000 students doing the same activity simultaneously.

⚙️2. Why GPUs Are Parallel?

GPUs were originally built for graphics.
A screen has millions of pixels, and the same shader program must run on each pixel.
👉 So GPU designers created an architecture that can run the same instruction on many data
elements at once.
This is called SIMD – Single Instruction, Multiple Data.

💡 3. Core Parallelism Concepts in GPU

(a) Threads
 Smallest unit of execution.
 Thousands of threads run together.

(b) Warps / Wavefronts

 Group of ~32 threads that execute the same instruction simultaneously.

(d) Grids
 Entire collection of blocks.
Hierarchy:
Grid → Blocks → Threads

🔢 4. Example: Adding Two Arrays

Suppose you want to add two arrays of 1 lakh numbers.

CPU way
 Uses a loop → processes one element at a time (or few at a time).
 Sequential mostly.
GPU way
 Launch 1 lakh threads.
 Each thread handles one addition.
 All 1 lakh additions happen simultaneously.
⏱ Result: GPU finishes in milliseconds.

📊 5. Where GPU Parallelism Shines

✔️Matrix operations
✔️Deep learning model training
✔️Image and video processing
✔️Physics simulations
✔️Cryptography
✔️Data analytics (parallel tasks)
Anywhere you need to do the same task repeatedly for large data.

🚦 6. Limits of GPU Parallelism

GPUs are not good at:
❌ Branch-heavy logic (if-else)
❌ Tasks requiring sequential dependency
❌ Small workloads (CPU may be faster)
Typical GPU Architecture – Explained
A Graphics Processing Unit (GPU) is designed for massively parallel processing. Unlike a CPU,
which focuses on low-latency execution of a few complex threads, a GPU focuses on high-
throughput execution of thousands of lightweight threads.

1. High-Level Structure of a GPU

A typical GPU contains:

1.1. Graphics/Compute Command Processor

 Acts as the front-end.
 Receives commands from the CPU (kernel launches, memory transfers).
 Schedules work to GPU cores.

2. Streaming Multiprocessors (SMs) / Compute Units

This is the heart of the GPU.
Each GPU consists of dozens to hundreds of SMs (NVIDIA) or CUs (AMD).
Each SM contains:
2.1. Scalar Processors (Cores / CUDA Cores)
 Perform integer and floating-point operations.
 Usually grouped into SIMD units, meaning they execute the same instruction on multiple
data at once.

2.2. Warp/Wavefront Scheduler

 Warp (NVIDIA) = 32 threads
 Wavefront (AMD) = 64 threads
 Schedules and issues instructions to GPU cores.

2.3. Special Function Units (SFUs)

 Handle special math: sin, cos, sqrt, reciprocal, interpolation.

2.4. Tensor / Matrix Cores (in modern GPUs)

 High-speed matrix multiplication for AI/ML workloads.

2.5. Register File

 Very large compared to CPU.
 Each thread gets a set of registers.

3. GPU Memory Hierarchy

3.1. Global (Device) Memory

 Large (GBs), high-latency.
 Used for data storage accessible by all SMs.

3.2. Shared Memory / Local Data Store

 Fast on-chip memory inside each SM.
 Shared by threads in the same block.
 Acts like a user-controlled cache.

3.3. L2 Cache
 Shared by all SMs.
 Helps reduce global memory access latency.

3.4. L1 Cache + Texture Cache

 Per-SM.
 Helps fast access to frequently used data and texture information.
3.5. Constant & Instruction Memory
 Read-only optimized memory for constants and instructions.

4. Parallel Execution Model

4.1. SIMT (Single Instruction, Multiple Threads)

 Each warp executes one instruction across multiple threads.
 Threads in a warp run in lock-step.

4.2. Thread Hierarchy

 Thread → Warp → Block → Grid
 Thousands of threads active at once.

4.3. Latency Hiding

Instead of large caches (like CPUs), GPUs hide latency by:
 Switching between warps instantly
 Keeping many warps ready to execute

5. Memory Interface
 Very high bandwidth GDDR6/HBM memory.
 Memory controllers distribute data across multiple channels.

6. Typical GPU Architecture Diagram (Text Form)

CUDA
Hardware
Overview
CUDA
(Compute
Unified
Device

Architecture) is NVIDIA’s parallel computing platform and programming model. It enables general-
purpose computing on GPUs (GPGPU). Understanding CUDA hardware requires knowing how
GPUs are organized internally for massive parallelism.

1. GPU as a Parallel Processor

A CUDA-enabled GPU consists of:
 Streaming Multiprocessors (SMs) – the main compute units.
 Many lightweight cores inside each SM.
 Large memory bandwidth optimized for parallel workloads.
 Hierarchical memory architecture to manage data for thousands of threads.
GPUs focus on throughput (executing many operations in parallel) rather than low latency like
CPUs.

2. CUDA Core Architecture

2.1 Streaming Multiprocessor (SM)

The SM is the fundamental building block of a CUDA GPU.
Each SM contains:
 CUDA cores (integer + floating-point ALUs)
 Warp schedulers
 Register file
 Load/Store units
 Special Function Units (SFUs)
 Tensor Cores (in newer architectures like Volta, Turing, Ampere)

Typical Layout (simplified):

GPU
├─ SM0
│ ├─ CUDA Cores
│ ├─ Registers
│ ├─ Shared Memory
│ └─ Warp Scheduler
├─ SM1
│ └─ ...
└─ SMn

Each SM can manage thousands of lightweight threads.

3. Thread Hierarchy in Hardware

CUDA uses a three-level hardware execution model:

Warp
 Group of 32 threads
 Executes instructions SIMT (Single Instruction Multiple Threads)

Thread Block
 One or more warps (typically 32–1024 threads)
 Scheduled on a single SM
 Can use shared memory and synchronize

Grid
 Group of multiple blocks
 Independently scheduled across SMs

4. CUDA Memory Hierarchy

NVIDIA GPUs provide a layered memory architecture:

4.1 Global Memory

 Large (GBs)
 High latency
 Accessible by all threads

4.2 Shared Memory

 On-chip memory inside each SM
 Very fast (similar to L1 cache)
 User-controlled
 Used for inter-thread communication within a block

4.3 Registers
 Fastest memory
 Each thread gets its own set of registers

4.4 Constant & Texture Memory

 Cached read-only memory
 Optimized for specific access patterns

4.5 L1/L2 Cache

 Modern GPUs have:
 L1 cache per SM
 Unified L2 cache shared across SMs
5. CUDA Execution Model

Warp Scheduling
 SM hardware schedules warps
 When one warp waits for memory, the scheduler switches to another warp → zero-
overhead context switching

Occupancy
 Number of active warps per SM
 Higher occupancy = better hide memory latency

6. Tensor Cores (Modern GPUs)

 Specialized hardware for matrix operations
 Greatly accelerate AI/ML workloads, linear algebra
 Perform mixed-precision FMA operations (FP16, BF16, TF32, INT8)

✅ Memory Handling in CUDA

CUDA provides a hierarchical memory model designed to maximize GPU performance by handling
data efficiently across different memory spaces. Understanding these memory spaces is critical for
writing optimized CUDA programs.

🎯 Why Memory Handling Is Important?

 Global memory is slow (hundreds of cycles).
 Shared memory and registers are fast.
 Efficient memory use determines GPU performance far more than arithmetic operations.

🧱 CUDA Memory Hierarchy

CUDA provides different types of memory with different scopes, lifetimes, and speeds:

Memory Type Scope Lifetime Speed Size Use

Per Temporary thread
Registers Thread Fastest Very small
thread variables
Shared Per Limited (~48– Thread cooperation
Block Very fast
Memory block 100 KB) within block
Global Device Application Slow Large (GBs) Main data storage
Memory Type Scope Lifetime Speed Size Use
Memory
Per When registers
Local Memory Thread Slow Depends
thread overflow
Constant Fast for uniform
Device Application 64 KB Read-only constants
Memory access
Texture 2D/3D spatial data,
Device Application Cached Read-only
Memory images

⚙️Memory Types Explained

1 Registers (Fastest)
1️⃣
 Located inside GPU cores.
 Automatically allocated by compiler.
 Extremely fast (1 cycle access).
 Limited — register spills go to local memory.

2️⃣Shared Memory (Like L1 Cache)

 Shared among threads in the same block.
 Very low latency.
 Programmable by developer.
 Ideal for:
 Matrix multiplication
 Stencil operations
 Data reuse

Example:
__shared__ float data[256];

3️⃣Global Memory (Slowest but Largest)

 Accessible by all threads and CPU.
 400–800 cycles latency.
 Needs coalesced memory accesses for performance.
Allocation
cudaMalloc(&d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

4️⃣Local Memory
 Per-thread storage.
 Stored in global memory (slow).
 Used when:
 Register usage exceeds limit.
 Large arrays declared inside kernels.

5️⃣Constant Memory
 Cached read-only memory.
 Fast if all threads read the same address.
 Good for:
 Coefficients
 Filter kernels

Example
__constant__ float coeff[32];

6️⃣Texture Memory
 Read-only.
 Special cache optimized for 2D/3D locality.
 Used in:
 Image processing
 Interpolation
🚀 Key Memory Optimization Techniques

✔ 1. Use Shared Memory to Reduce Global Memory Access

Example: Tile-based matrix multiplication.

✔ 2. Coalesced Access
Threads in the same warp must access contiguous global memory locations.
Bad:
A[ threadIdx.x * stride ]

Good:
A[ threadIdx.x ]

✔ 3. Avoid Register Spilling

 Reduce local variable count.
 Use -maxrregcount wisely.

✔ 4. Use Constant Memory for Read-Only Data

GPU can broadcast same value to many threads.

✔ 5. Prefer Unified Memory for Ease (Not Always Fast)

cudaMallocManaged(&ptr, size);

📝 Simple Code Example (Shared Memory + Global Memory)

__global__ void addArray(int *a, int *b, int *c) {
__shared__ int temp[256];

int tid = threadIdx.x + blockIdx.x * blockDim.x;

temp[threadIdx.x] = a[tid] + b[tid];

__syncthreads();

c[tid] = temp[threadIdx.x];
}
UNIT 2 CUDA PROGRAMMING

🚀 What is CUDA?
CUDA (Compute Unified Device Architecture) is a parallel computing platform and
programming model created by NVIDIA.
It allows you to use the power of NVIDIA GPUs (Graphics Processing Units) to accelerate general-
purpose computing tasks — not just graphics.

🔍 Why CUDA? (Simple Explanation)

Normally, the CPU handles most tasks. But CPUs have only a few cores.
GPUs, on the other hand, have thousands of cores, perfect for tasks that can run in parallel.
So CUDA helps you:
 Run computations much faster
 Process large datasets (like images, scientific data)
 Run parallel algorithms easily
 Accelerate AI/ML, simulations, graphics, etc.

📦 Where is CUDA used?

 Deep Learning (TensorFlow, PyTorch use CUDA internally)
 Scientific simulations
 Image/Video processing
 Cryptography
 Database acceleration
 Engineering simulations (CFD, FEM)

⚙️How CUDA works (Simplified)

When you write a CUDA program:
1. CPU (Host) sends tasks to
2. GPU (Device) which runs thousands of threads
3. GPU returns the results to the CPU
You write special functions called kernels that run on the GPU.
Example kernel:
__global__ void add(int *a, int *b, int *c) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
}

🧠 Key Concepts in CUDA

Concept Meaning
Thread Smallest GPU execution unit
Block Group of threads
Grid Group of blocks
Kernel Function that runs on GPU
Concept Meaning
Device Memory GPU memory
Host Memory CPU memory

Architecture Components
 SM (Streaming Multiprocessor) → executes threads
 Warp (32 threads) → smallest scheduling unit
 Global memory → large memory for all threads
 Shared memory → fast small memory per block
 Registers → fastest storage for threads

✅ 1. CUDA Memory Hierarchy

CUDA uses different memory types, each with different speed and usage:

🔹 Global Memory
 Largest memory on GPU
 Slowest
 Accessible by all threads
 Must use coalesced access for better performance

🔹 Shared Memory
 Small, very fast
 Shared by threads in the same block
 Acts like a “software-managed cache”
 Boosts performance significantly

🔹 Registers
 Fastest memory
 Private to each thread
 Limited amount

🔹 Constant Memory
 Read-only
 Cached
 Fast when all threads read the same value
🔹 Texture Memory
 Read-only
 Special hardware caching
 Good for 2D/3D spatial data (images)

✅ 2. CUDA Thread–Block–Grid Model

CUDA uses a hierarchy:

🔹 Thread
 Smallest execution unit
 Runs part of kernel code

🔹 Block
 Group of threads
 Threads within block share shared memory

🔹 Grid
 Group of blocks
 Entire kernel launch forms a grid

🔹 Warp
 32 threads
 Executed in lockstep
 Warp divergence slows performance

✅ 3. Write Your First CUDA Program

Kernel function (runs on GPU)

__global__ void add(int *a, int *b, int *c) {
int i = threadIdx.x;
c[i] = a[i] + b[i];
}

Host code
int main() {
int a[5] = {1,2,3,4,5};
int b[5] = {10,20,30,40,50};
int c[5];
int *dA, *dB, *dC;

cudaMalloc(&dA, 5*sizeof(int));
cudaMalloc(&dB, 5*sizeof(int));
cudaMalloc(&dC, 5*sizeof(int));

cudaMemcpy(dA, a, 5*sizeof(int), cudaMemcpyHostToDevice);

cudaMemcpy(dB, b, 5*sizeof(int), cudaMemcpyHostToDevice);

add<<<1,5>>>(dA, dB, dC);

cudaMemcpy(c, dC, 5*sizeof(int), cudaMemcpyDeviceToHost);

cudaFree(dA); cudaFree(dB); cudaFree(dC);

}

✅ 4. CUDA Hardware Architecture Overview

🔹 SM (Streaming Multiprocessor)
 Heart of the GPU
 Contains:
 CUDA cores
 Warp schedulers
 Shared memory
 Registers

🔹 GPU has many SMs

 Each SM runs many warps
 Enables massive parallelism

🔹 Memory hierarchy inside SM

 Registers (fastest)
 Shared memory (very fast)

✅ 5. CUDA Optimization Techniques

🔸 Memory Coalescing
 Ensure consecutive threads access consecutive addresses
🔸 Use Shared Memory
 To avoid repeated global memory access

🔸 Avoid Warp Divergence

 Avoid if-else branching inside warps

🔸 Increase Occupancy
 Use enough threads per block

🔸 Use constant memory

 For read-only data

✅ 6. CPU vs GPU Architecture

Feature CPU GPU
Cores Few (4–16) Thousands
Type Serial processing Parallel processing
Cache Large Small
Memory Lower bandwidth Very high bandwidth
Best for Single-thread tasks Parallel tasks

✅ 7. CUDA Programming Model – Summary

Host (CPU)
 Manages memory
 Launches kernels
 Controls device

Device (GPU)
 Executes kernels
 Runs thousands of threads in parallel

Kernel Launch
Uses syntax:
kernel<<<numBlocks, threadsPerBlock>>>();

Example:
myKernel<<<16, 256>>>();

Means:
 16 blocks
 256 threads per block
 Total = 4096 threads running parallel

Multi-GPU Solutions – Overview

A multi-GPU system uses two or more Graphics Processing Units to work together on the same
workload. Instead of relying on a single GPU's computational power, multiple GPUs split the data
or tasks to achieve higher throughput, faster computation, and larger memory capacity.

Why Multi-GPU?
Multi-GPU architectures are used when:
 A single GPU is not sufficient for large workloads.
 Applications require massive parallelism.
 Training large AI/ML models.
 Running scientific simulations, bioinformatics, fluid dynamics, or rendering.
 Increasing system scalability without replacing existing hardware.

Types of Multi-GPU Configurations

1. Data Parallelism
 Same model/program runs on all GPUs.
 Input data is partitioned across GPUs.
 Common in deep learning (e.g., PyTorch Distributed, TensorFlow MirroredStrategy).
 GPUs periodically synchronize weights.
Example:
4 GPUs each get 25% of a dataset batch.

2. Model Parallelism
 The model is too large to fit on a single GPU.
 Model layers or components are split across GPUs.
Example:
Transformer layers distributed across different GPUs in LLM training.

3. Pipeline Parallelism
 Stages of the model or computation pipeline run on different GPUs.
 Enables continuous data flow like an assembly line.
Often used in large-model training (Megatron-LM, DeepSpeed).

4. Hybrid Parallelism
Combination of data, model, and pipeline parallelism.
Used in GPT-3, Llama, BERT-large training.

Hardware and Interconnects for Multi-GPU

Effective multi-GPU systems depend on fast communication links:

Interconnect Features Use cases

PCIe Standard CPU ↔ GPU / GPU ↔ GPU Most systems
NVLink Very high bandwidth GPU ↔ GPU NVIDIA data centers
NVSwitch Full GPU mesh, scalable HGX servers, DGX
InfiniBand Connects GPUs across nodes (distributed clusters) HPC, cloud computing

Multi-GPU in CUDA
CUDA supports multi-GPU programming using:

✔ cudaGetDeviceCount()
Check number of GPUs.

✔ cudaSetDevice(device_id)
Select which GPU runs the kernel.

✔ Peer-to-peer Memory Copy

cudaDeviceEnablePeerAccess();
cudaMemcpyPeer();

Memory Models:
 Independent memory: each GPU has its own memory.
 Unified Virtual Addressing (UVA): seamless addressing across GPUs/CPU.
 Managed memory: CUDA automatically handles migrations.

Multi-GPU Approaches in Real Systems

1. Desktop Multi-GPU
Historically NVIDIA SLI / AMD CrossFire for gaming.
Now mainly used in:
 Content creation
 Video rendering
 CUDA computation

2. Server-Grade Multi-GPU
Used in DGX/HGX servers:
 4, 8, 16, or 32 GPUs in one node.
 NVLink + NVSwitch for high-speed communication.
Used in supercomputing, HPC clusters, AI training.

3. Distributed Multi-GPU Clusters

Multiple nodes, each with multiple GPUs.
Tools:
 Horovod
 NVIDIA NCCL
 MPI + CUDA
 PyTorch DistributedDataParallel (DDP)

Challenges in Multi-GPU Systems

❗ Memory bottlenecks
GPUs cannot share memory directly unless NVLink/NVSwitch is used.
❗ Communication overhead
Synchronization slows parallelism.
❗ Load imbalance
Unequal workloads reduce overall speed-up.
❗ Scalability issues
More GPUs ≠ linear speed increase.

Performance Scaling – Amdahl’s Law

Even with unlimited GPUs, performance is limited by:
 Serial parts of the program
 Communication time
 Synchronization
Multi-GPU efficiency depends on minimizing inter-GPU communication.

Applications of Multi-GPU Solutions

AI/ML
 Training LLMs (GPT, Llama, BERT)
 Reinforcement learning

Scientific HPC
 Molecular dynamics (GROMACS)
 CFD simulations
 Weather modeling

Engineering
 Finite element analysis (ANSYS)
 Rendering (Blender Cycles X)

Medical / Biology
 Genomics
 Protein folding (AlphaFold)

🚀 Optimizing CUDA Applications Through Problem Decomposition

CUDA performance depends heavily on how well you break down (decompose) your problem
into parallelizable pieces that fit the GPU's architecture. Proper decomposition ensures full SM
utilization, coalesced memory access, minimal divergence, and maximum throughput.
✅ 1. Why Problem Decomposition Matters in CUDA
GPUs excel when:
 Many independent computations can run in parallel
 Data can be split into contiguous chunks
 Workloads can be mapped to threads, warps, blocks, and grids efficiently
Poor decomposition leads to:
❌ Thread divergence
❌ Memory access inefficiency
❌ Underutilized SMs
❌ Kernel underperformance

✅ 2. Two Main Types of Decomposition

A. Data Decomposition (most common)

Split the input data into independent chunks and assign to:
 Threads
 Warps
 Blocks

Example
Matrix addition:
Each element C[i][j]=A[i][j]+B[i][j]
→ Each thread computes one element.
This fits naturally into CUDA grid/block/thread mapping.

B. Task Decomposition
Break the application into independent operations (functions).
Examples:
 Preprocessing → Kernel 1
 Computation → Kernel 2
 Reduction → Kernel 3
Useful for:
 Pipeline processing
 Independent algorithmic stages
 Asynchronous streams

✅ 3. Key CUDA-Specific Decomposition Strategies

1. Thread-Level Decomposition
Each thread handles:
 Single element
 Single pixel
 Single particle
 One iteration of a loop

Tips
 Ensure threads execute the same instruction path (avoid divergence)
 Use enough threads to fill all SMs (typically thousands)

2. Block-Level Decomposition
Blocks should handle independent portions with optional shared memory.
Example:
 A block handles a tile of a matrix
 Shared memory stores tile for reuse
 Threads cooperate using __syncthreads()

Benefits
 Reduces global memory access
 Enables tiling & cache efficiency
 Improves memory coalescing

3. Grid-Level Decomposition
Grids allow splitting very large data sets:
 Millions of threads
 Adaptive grid-stride loops
 Kernel launches based on problem size
Grid-Stride Loop Pattern
for (int i = tid; i < N; i += blockDim.x * gridDim.x) {
out[i] = in[i] * in[i];
}

✔ Handles arbitrarily large datasets

✔ Keeps kernel simple
✔ Ensures all threads participate

✅ 4. Techniques That Improve Performance After Decomposition

1. Memory Coalescing
Arrange data so that:
 Thread 0 → memory[0]
 Thread 1 → memory[1]
→ contiguous accesses
Poor decomposition → scattered access → slow global loads.

2. Using Shared Memory via Tiling

Break computation into tiles:
 Each block loads a tile
 Threads reuse data in shared memory
Useful for:
 Matrix multiplication
 Stencil computations
 2D/3D grids

3. Minimizing Thread Divergence

Each warp (32 threads) must follow same branch.
So:
 Decompose work to avoid differing paths within a warp.
Example issue:
if (threadIdx.x % 2 == 0) { ... } else { ... }

Avoid such divergence when possible.

4. Use Multiple Streams for Overlap
Task decomposition + CUDA streams enables:
 Kernel execution overlap
 Data transfer overlap
 Pipeline parallelism

5. Avoid Too Small or Too Large Blocks

Ideal block size is typically:
 128, 256, or 512 threads per block
Balance between:
 Occupancy
 Register usage
 Shared memory per block

🧠 Example: Decomposing Matrix Multiplication (Tiling)

Step 1: Data Decomposition

 Break matrices into small tiles (BLOCK_SIZE × BLOCK_SIZE)

Step 2: Map tiles to blocks

Each block computes one tile of output matrix.

Step 3: Use shared memory

Threads load parts of A and B into shared memory.

Step 4: Synchronize threads

Compute partial sums.

Result
✔ 10× to 50× faster than naive version
✔ Due to better decomposition & memory reuse
✅ 5. Summary Table
Technique Purpose
Data Decomposition Split data → threads/blocks
Task Decomposition Split functions → kernels/streams
Tiling Reuse data via shared memory
Coalesced Access Efficient global memory use
Minimizing Divergence Better warp efficiency
Grid-Stride Loops Handle large data sets
Streams Pipeline parallelism

🎯 Final Summary
Optimizing CUDA applications starts with breaking the problem into parallelizable pieces. A
good decomposition ensures:
 High occupancy
 Memory efficiency
 Low divergence
 Maximum throughput
It’s the foundation of all other CUDA optimizations.

Memory

Considerations in CUDA Programming

Efficient use of memory is critical to achieving high performance in CUDA applications. GPU
memory hierarchy is different from CPU memory, and understanding each type of memory, its
latency, bandwidth, and usage is essential.
1. CUDA Memory Hierarchy Overview
CUDA devices have several memory regions, each differing in:
 Latency
 Bandwidth
 Scope (thread, block, grid, device)
 Lifetime

Hierarchy (Fastest → Slowest)

1. Registers
2. Shared Memory
3. L1 Cache
4. L2 Cache
5. Global Memory
6. Constant Memory / Texture Memory / Surface Memory
7. Host Memory (CPU)

2. Key Memory Types and Their Considerations

A. Registers
 Fastest memory, private to each thread.
 Allocated automatically by the compiler.
 Excess register usage → spills into local memory (slow!)
Consideration:
✔ Optimize register usage; use --ptxas-options=-v to see register count.

B. Shared Memory
 Very fast, on-chip memory.
 Accessible by all threads in a block.
 Lifetime = block duration.
 Limited size: typically 48–100 KB per block depending on GPU.
Best use cases:
 Caching frequently accessed data.
 Reducing global memory accesses.
 Tiling (e.g., matrix multiplication).
Considerations:
✔ Avoid bank conflicts.
✔ Divide data into tiles to improve locality.

C. Global Memory
 Largest memory region, but high latency (~400–800 cycles).
 Accessible by all threads and the host.
Performance considerations:
1. Memory Coalescing
Threads in the same warp should access contiguous memory locations.
2. Alignment
Align data to 32, 64, or 128 bytes for coalesced access.
3. Minimize transfers
Host ↔ Device transfers are slow:
Use pinned memory for faster transfers.
4. Reuse data
Load once → store in shared memory or registers.

D. Constant Memory
 Read-only memory, cached.
 Fast if all threads access the same address (broadcast).
Use cases:
 Constants, coefficients, lookup tables.

E. Texture Memory
 Read-only, cached memory with special spatial locality optimization.
 Useful for:
 Image processing
 Interpolation
 2D/3D texture access patterns

F. Local Memory
 Misleading name: this is actually off-chip global memory.
 Used when:
 Compiler spills registers, or
 You declare large per-thread arrays.
Avoid whenever possible!

G. Unified Memory
 Single memory space accessible by CPU and GPU.
 Simplifies programming but may incur page faults during migrations.
 Suitable for prototyping and irregular memory access patterns.

3. Memory Transfer Considerations

Host ↔ Device Transfers

 Use pinned (page-locked) memory to speed up transfers.
 Use asynchronous memory copy (cudaMemcpyAsync) to overlap computation and data
transfer.
 Use streams to maximize concurrency.

4. Performance Optimization Techniques

✔ 1. Coalesced Access
Ensure warp threads access consecutive memory addresses.

✔ 2. Avoid Divergence
Avoid memory patterns that cause divergence in a warp.

✔ 3. Use Shared Memory Tiling

Reduce repeated global memory accesses.
✔ 4. Leverage Constant/Texture Memory
For read-only data with spatial or broadcast reuse.

✔ 5. Minimize Register Spilling

Reduce local memory usage.

✔ 6. Consider Memory Alignment

Align arrays for coalesced loads/stores.

✔ 7. Use Managed Memory Carefully

Avoid excessive page migrations.

5. Example: Matrix Multiplication Optimization With Shared Memory

Using shared memory can speed up matrix multiplication by 2–10× vs naïve GPU implementation.
Concept:
 Load tile of matrix A and B into shared memory.
 Synchronize threads.
 Compute partial results.
 Avoid repeated global memory loads.

6. Common Mistakes in CUDA Memory Usage

❌ Using global memory in inner loops
❌ Non-coalesced memory access
❌ Excess register usage leads to spills
❌ Using shared memory without avoiding bank conflicts
❌ Using unified memory without understanding migrations
❌ Frequent host-device copies

🚀 CUDA Programming: Transfers Explained

In CUDA programming, data transfers refer to moving data between the CPU (Host) and the
GPU (Device). Transfers are critical because they can significantly affect application performance.
✅ 1. Types of Transfers

1.1 Host → Device (H2D) Transfer

Data moves from the system memory (CPU RAM) to GPU global memory.
Example:
Copying input arrays before launching a kernel.

1.2 Device → Host (D2H) Transfer

Data moves from GPU memory back to CPU memory.
Example:
Copying results after kernel execution.

1.3 Device → Device (D2D) Transfer

Copying data between two memory locations within the GPU.
 Much faster than H2D or D2H transfers.
 Can be used for rearranging data or copying subarrays.

1.4 Peer-to-Peer (P2P) Transfer (Multi-GPU systems)

Data transfer between GPU A ↔ GPU B without using the CPU.
 Requires:
 GPUs supporting Unified Virtual Addressing (UVA)
 Enabling peer access

⚡ 2. CUDA APIs for Transfers

cudaMemcpy()
Basic transfer function.
cudaMemcpy(dest, src, size, cudaMemcpyHostToDevice);
cudaMemcpy(dest, src, size, cudaMemcpyDeviceToHost);
cudaMemcpy(dest, src, size, cudaMemcpyDeviceToDevice);

cudaMemcpyAsync()
Asynchronous transfer using streams.
Allows:
 Overlapping transfer with kernel execution
 Non-blocking behavior

🔍 3. How Transfers Affect Performance

Transfers across the PCIe bus are much slower than GPU computation.

Operation Typical Speed

GPU compute Hundreds of GB/s
GPU global memory ~500–1000 GB/s
PCIe host–device transfer ~12–16 GB/s
So, minimizing H2D and D2H transfers is key to optimization.

🚀 4. Optimizing Transfers (Important for Exams)

4.1 Use Page-Locked (Pinned) Memory

Pinned memory improves bandwidth and enables async transfers.
cudaMallocHost(&ptr, size);

4.2 Use cudaMemcpyAsync + Streams

Allows overlapping:
 Transfer + Kernel execution
 Transfer H2D + Transfer D2H

4.3 Use Unified Memory (cudaMallocManaged)

Single memory space shared by CPU and GPU.
 Easy to use
 Performance depends on access patterns

4.4 Reduce Number of Transfers

 Transfer large chunks at once
 Keep data on GPU as long as possible
 Perform multiple kernels before copying back
4.5 Use Peer-to-Peer Transfers for Multi-GPU
Avoids host bottleneck.

✅ Thread Usage in CUDA Programming

CUDA (Compute Unified Device Architecture) allows parallel execution using thousands of
lightweight threads. Understanding how these threads are organized and used is essential for writing
efficient GPU programs.

🔹 1. CUDA Thread Hierarchy

CUDA organizes its parallel execution into three levels:

(a) Threads
 The smallest unit of execution.
 Each thread runs the same kernel code, but operates on different data.
 Identified by threadIdx.x, threadIdx.y, threadIdx.z.

(b) Thread Blocks

 A group of threads (typically 32–1024 threads per block).
 Threads within a block can:
 Communicate through shared memory,
 Synchronize using __syncthreads().

🔹 2. Why So Many Threads?

GPUs are massively parallel. NVIDIA GPUs contain many Streaming Multiprocessors (SMs),
each capable of running thousands of threads simultaneously.
High thread count helps to:
 Maximize GPU utilization
 Hide memory latency
 Improve throughput
🔹 3. Warp: The Execution Unit
 Threads are executed in groups of 32, called a warp.
 All threads in a warp execute in lockstep (SIMT—Single Instruction, Multiple Threads).
🔸 Important note:
If threads within a warp take different branches (if/else), warp divergence occurs → performance
slows down.

🔹 4. Choosing the Correct Number of Threads

Threads per block

A common guideline:
 Use multiples of 32 (warp size).
 Typically 128, 256, 512, or 1024 threads per block.

Important considerations:
 More threads per block → better occupancy
 But too many → register/shared memory limits exceeded

🔹 5. Thread Index Calculation Example

To map each thread to an array element:
__global__ void vectorAdd(int *A, int *B, int *C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
C[idx] = A[idx] + B[idx];
}
}

Launch configuration:
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
vectorAdd<<<numBlocks, blockSize>>>(A, B, C, N);

🔹 6. Thread Synchronization
Threads inside a block can synchronize:
__syncthreads();

 Ensures all threads reach a point before continuing.

 Cannot synchronize threads across blocks.

🔹 7. Thread Resource Usage

Threads use:
 Registers
 Local memory
 Shared memory (per block)
If a kernel uses too many registers or shared memory:
 Fewer threads can run concurrently → occupancy drops.

Resource Contentions in CUDA Programming

In CUDA, thousands of threads run in parallel. But the GPU has limited hardware resources.
When many threads request the same resource at the same time, resource contention occurs —
leading to slowdowns and reduced performance.

🔶 1. Types of Resource Contentions

1. Register Contention
 Each Streaming Multiprocessor (SM) has a fixed number of registers.
 If your kernel uses too many registers per thread:
 fewer threads can run simultaneously (low occupancy)
 performance drops
 Occurs commonly with complex kernels or heavy math operations.
Solution:
 Reduce register usage (__restrict__, compiler optimizations)
 Launch fewer threads per block
 Use -maxrregcount compiler flag (with care)

2. Shared Memory Contention

 Shared memory is very fast but limited.
 When multiple threads access the same bank → bank conflicts
 Bank conflicts serialize memory accesses → slower execution.
Solution:
 Optimize shared memory layout
 Avoid strided memory access
 Use padding to avoid accessing the same bank

3. Global Memory Contention

 Global memory has high latency.
 If many threads try to read/write same/global memory location:
 requests queue
 memory throughput reduces
Solution:
 Use coalesced memory access
 Use shared memory as cache
 Avoid unnecessary global writes

4. Atomic Operation Contention

 Atomics serialize access.
 When multiple threads do atomic add/sub/compare on the same variable:
 Only one thread succeeds at a time → heavy slowdown.
Solution:
 Minimize atomic ops
 Use per-thread or per-block private buffers, then reduce
 Use warp-level primitives (__shfl_sync, warpReduce)

5. Warp Divergence Contention

 When threads in the same warp execute different branches (if-else),
CUDA serializes the paths.
 Not strictly a "resource" but causes execution units to stall.
Solution:
 Use branchless programming where possible
 Rearrange data to avoid divergent conditions

6. Texture/Constant Memory Contention

 These are cached.
 When too many requests miss the cache or collide:
 Cache thrashing occurs
 Performance decreases
Solution:
 Use these memories only for read-only, reused data
 Ensure spatial/temporal locality

🔶 7. SM Resource Contentions
Each SM has limited:
 warps
 blocks
 registers
 shared memory
 thread slots
If your kernel demands too much of one resource:
 fewer blocks can be scheduled
 GPU can't fully allocate SM resources → low occupancy

🟩 8. PCIe / NVLink Transfer Contention

If many kernels/streams simultaneously request host–device transfers:
 bandwidth becomes the bottleneck.
 overlapping transfers may not fully parallelize.
UNIT 3 PROGRAMMING ISSUES

✅ 1. Basic Problems

a. Input/Output
 Read a number and print it.
 Print patterns (stars, numbers, pyramids).
 Sum of digits.

b. Mathematical
 Check prime number.
 Find GCD / LCM.
 Fibonacci series.
 Factorial of a number.
 Armstrong number.
 Palindrome number.

✅ 2. Array Problems
 Find maximum / minimum in an array.
 Reverse an array.
 Linear / Binary search.
 Sum of array elements.
 Remove duplicates.
 Kadane’s algorithm (maximum subarray sum).
 Move zeros to the end.
 Find missing number in 1 to n.
 Rotate an array (left/right rotation).

✅ 3. String Problems
 Reverse a string.
 Check palindrome string.
 Count vowels and consonants.
 Anagram checking.
 Longest substring without repeating characters.
 Count character occurrences.
 Convert lower to upper case.

✅ 4. Recursion Problems
 Factorial using recursion.
 Fibonacci using recursion.
 Tower of Hanoi.
 Permutations of a string.
 Sum of array using recursion.
✅ 5. Sorting Problems
 Bubble sort.
 Selection sort.
 Insertion sort.
 Merge sort.
 Quick sort.
 Count sort.

✅ 6. Linked List Problems

 Insert node at beginning/end.
 Reverse linked list.
 Detect loop in linked list.
 Find middle element.
 Delete a node.

✅ 7. Stack & Queue Problems

 Implement stack using array/linked list.
 Implement queue using array/linked list.
 Parentheses balancing.
 Evaluate postfix/prefix expressions.

✅ 8. Tree & Graph Problems

 Inorder / Preorder / Postorder traversal.
 Level order traversal.
 Check if a tree is BST.
 Height of a tree.
 Graph BFS/DFS.
 Shortest path (Dijkstra).
 Minimum spanning tree (Kruskal/Prim).
✅ 9. Dynamic Programming Problems
 Fibonacci using DP.
 0/1 Knapsack.
 Coin change problem.
 Longest common subsequence (LCS).
 Longest increasing subsequence (LIS).
 Edit distance.

✅ 10. Object-Oriented Programming Problems

 Class creation.
 Inheritance demonstration.
 Polymorphism implementation.
 Interface usage.
 Encapsulation & abstraction examples.

✅ 11. Database Programming Problems

 CRUD operations (Create, Read, Update, Delete).
 Joins (inner, outer, cross).
 Normalization questions.
 SQL aggregation (GROUP BY, HAVING).
 Stored procedures & triggers.

1. Checking CUDA API Errors

Most CUDA runtime functions return a value of type cudaError_t. You should check it after every
API call.
Example:
cudaError_t err = cudaMalloc((void**)&d_array, size * sizeof(int));
if (err != cudaSuccess) {
printf("CUDA malloc failed: %s\n", cudaGetErrorString(err));
return -1;
}

 cudaGetErrorString(err) converts the error code to a human-readable string.

2. Checking Kernel Launch Errors
Kernel launches are asynchronous. Errors might not show up immediately. To catch them:
kernel<<<grid, block>>>(d_array);
cudaError_t err = cudaGetLastError(); // Check for launch errors
if (err != cudaSuccess) {
printf("Kernel launch failed: %s\n", cudaGetErrorString(err));
}

cudaDeviceSynchronize(); // Catch errors during execution

err = cudaGetLastError();
if (err != cudaSuccess) {
printf("Kernel execution failed: %s\n", cudaGetErrorString(err));
}

 cudaGetLastError() returns the last error from runtime API or kernel launch.
 cudaDeviceSynchronize() ensures all preceding work is completed and reports runtime
errors.

3. Common Error Types

 cudaErrorMemoryAllocation – failed memory allocation on GPU.
 cudaErrorInvalidValue – invalid argument in API call.
 cudaErrorLaunchOutOfResources – kernel uses too many threads or registers.
 cudaErrorInvalidDeviceFunction – kernel not compatible with GPU architecture.
 cudaErrorUnknown – unexpected/unknown runtime error.

4. Macro for Easier Error Handling

Many developers use a macro to wrap calls for brevity:
#define CUDA_CHECK(err) \
if (err != cudaSuccess) { \
printf("CUDA Error: %s (file %s, line %d)\n", cudaGetErrorString(err), __FILE__,
__LINE__); \
exit(EXIT_FAILURE); \
}

// Usage
CUDA_CHECK(cudaMalloc((void**)&d_array, size * sizeof(int)));
5. Device Reset
If a fatal error occurs, it’s good practice to reset the GPU:
cudaDeviceReset();

6. Debugging Tips
 Use cuda-memcheck to detect memory errors:
cuda-memcheck ./my_program

 Use nsight or nvprof for profiling and detecting kernel issues.

 Check grid and block dimensions to avoid out-of-bounds memory access.

Parallel programming can give huge speedups, but it comes with its own set of challenges and
issues. Here's a detailed overview:

1. Race Conditions
 Occur when multiple threads access shared data concurrently and at least one thread
modifies it.
 Results are non-deterministic depending on execution order.
Example:
int counter = 0;
#pragma omp parallel for
for(int i=0; i<100; i++) {
counter++; // Race condition: multiple threads increment simultaneously
}

Solution:
 Use locks/mutexes (std::mutex in C++)
 Atomic operations (std::atomic<int> counter;)
 Thread-local storage

2. Deadlocks
 Happens when two or more threads wait indefinitely for resources held by each other.
 Common with nested locks or improper resource acquisition order.
Solution:
 Acquire locks in a consistent order.
 Use try-locks with timeout.
 Minimize shared resources.

3. Livelocks
 Threads are active but cannot make progress because they keep responding to each other.
 Similar to deadlock but CPU is busy spinning.
Solution:
 Introduce delays or back-off strategies.

4. Starvation
 A thread never gets CPU time or resources because higher-priority threads dominate.
Solution:
 Use fair scheduling policies or priority adjustments.

5. Memory Consistency & Visibility

 Threads may see stale values due to CPU caches or compiler reordering.
 Example: Thread A updates a variable, but Thread B still sees the old value.
Solution:
 Use memory barriers (__threadfence() in CUDA)
 Volatile or atomic variables
 Proper synchronization primitives

6. Load Balancing
 Unequal distribution of work causes some threads to finish early while others are still busy.
 Wastes parallelism and reduces speedup.
Solution:
 Dynamic scheduling (#pragma omp parallel for schedule(dynamic))
 Work-stealing techniques
7. Overhead of Synchronization
 Locks, barriers, and atomic operations introduce performance overhead.
 Too fine-grained parallelism can be slower than sequential code.
Solution:
 Minimize shared data access
 Reduce frequency of synchronization
 Use lock-free or reduction patterns when possible

8. False Sharing
 Occurs when multiple threads modify variables that reside on the same cache line.
 Causes unnecessary cache coherence traffic.
Solution:
 Pad shared variables to align with cache lines
 Avoid putting frequently updated variables together

9. Scalability Limitations
 Speedup is limited by Amdahl’s Law:
S=(1−P)+NP1
Where P is parallelizable fraction, N is number of threads.
Implication:
 Beyond a certain number of threads, adding more cores gives diminishing returns.

10. Debugging Difficulty

 Bugs like race conditions, deadlocks, and memory corruption are hard to reproduce.
 Traditional debugging tools may fail with parallel execution.
Solution:
 Use specialized tools:
 Intel Inspector, ThreadSanitizer, cuda-memcheck (CUDA)
 Logging with thread IDs
Synchronization issues in programming usually arise when multiple threads or processes need
to access shared resources concurrently. If this access is not properly controlled, it can lead to
inconsistent data, unexpected behavior, or crashes. Here’s a clear breakdown of common
synchronization problems and their causes:

1. Race Conditions
 What it is: Occurs when the outcome of a program depends on the order or timing of thread
execution.
 Example: Two threads updating a shared counter simultaneously might overwrite each
other’s updates.
 Symptom: Inconsistent or incorrect results.
 Solution: Use mutexes, locks, or atomic operations to ensure exclusive access.

2. Deadlocks
 What it is: Two or more threads are waiting indefinitely for resources held by each other.
 Example: Thread A holds Lock 1 and waits for Lock 2, while Thread B holds Lock 2 and
waits for Lock 1.
 Symptom: Program hangs.
 Solution:
 Always acquire locks in a consistent order.
 Use timeout-based locking.
 Detect and recover from deadlocks.

3. Livelocks
 What it is: Threads keep changing state in response to each other but make no progress.
 Example: Two threads repeatedly yield to each other trying to avoid deadlock but never
proceed.
 Symptom: Program runs but does nothing useful.
 Solution: Introduce randomness or back-off strategies.

4. Starvation
 What it is: A thread never gets CPU time or access to a resource because others
continuously take precedence.
 Example: Low-priority thread never runs because high-priority threads dominate the CPU.
 Symptom: Some tasks never complete.
 Solution: Use fair scheduling or priority inheritance.

5. Priority Inversion
 What it is: A low-priority thread holds a resource needed by a high-priority thread, causing
the high-priority thread to wait.
 Symptom: High-priority tasks are blocked unexpectedly.
 Solution: Use priority inheritance protocols to temporarily boost low-priority thread’s
priority.

6. Atomicity Violations
 What it is: Non-atomic operations on shared resources are interrupted by other threads.
 Example: x = x + 1 is not atomic; two threads doing it simultaneously can overwrite each
other’s increment.
 Symptom: Counter or shared data becomes incorrect.
 Solution: Use atomic operations or locks.

7. Memory Visibility Issues

 What it is: Changes made by one thread may not be visible to other threads due to caching
or compiler optimizations.
 Example: One thread updates a flag, but another thread never sees the update.
 Symptom: Threads appear stuck or see stale data.
 Solution: Use volatile variables, memory barriers, or synchronized blocks (depending on
language).

8. False Sharing
 What it is: Threads access variables that reside on the same CPU cache line, causing
unnecessary cache invalidations.
 Symptom: Performance degradation, even if logically independent.
 Solution: Align variables to separate cache lines.
algorithmic issues and programming issues in software development. I’ll break it down
carefully:

1. Algorithmic Issues
Algorithmic issues are problems that arise due to the design, logic, or efficiency of algorithms.
These can affect the correctness, speed, and scalability of a program.
Common types of algorithmic issues:
1. Incorrect logic
 The algorithm doesn’t produce the intended result.
 Example: Miscalculating the sum of an array due to off-by-one errors.
2. Inefficient algorithms
 Takes too long or uses too much memory.
 Example: Using bubble sort for a very large dataset instead of merge sort.
3. Unoptimized recursion
 Excessive recursive calls can lead to stack overflow or high memory use.
 Example: A naive Fibonacci function using recursion.
4. Failure to handle edge cases
 Algorithm fails on empty inputs, maximum/minimum values, or unexpected inputs.
5. Poor scalability
 Algorithm works for small inputs but fails with large datasets due to high time or
space complexity.
2. Programming Issues
Programming issues are problems caused during implementation of code, not necessarily the
algorithm itself. These are more related to coding errors, environment, or integration.
Common programming issues:
1. Syntax errors
 Code cannot compile or run.
 Example: Missing semicolons, unmatched parentheses.
2. Runtime errors
 Code crashes while running.
 Example: Division by zero, null pointer exceptions.
3. Logical errors
 Program runs but gives wrong results.
 Example: Using < instead of <= in a loop condition.
4. Resource management issues
 Memory leaks, file handle leaks, database connections not closed.
5. Concurrency issues
 Race conditions, deadlocks, thread starvation when multiple threads access shared
resources.
6. Security vulnerabilities
 Buffer overflows, SQL injection, improper input validation.
7. Integration issues
 Problems when combining modules or APIs, often due to mismatched data formats
or protocols.
finding and avoiding errors are central programming issues, often discussed under debugging,
error handling, and software reliability. Let’s break it down carefully:

1. Types of Programming Errors

Programming errors can generally be categorized into three types:
1. Syntax Errors
 Mistakes in code grammar.
 Detected at compile time (for compiled languages).
 Example: Missing semicolon, wrong indentation, or misspelled keywords.
2. Runtime Errors
 Occur while the program is running.
 Often due to illegal operations, invalid input, or resource issues.
 Example: Division by zero, null pointer dereference.
3. Logical Errors
 The program runs without crashing but produces wrong output.
 Hardest to detect because the code “works” but behaves incorrectly.
 Example: Using i <= n instead of i < n in a loop.

2. Finding Errors (Debugging)

Finding errors involves identifying and correcting issues in code:
 Testing: Write test cases to verify program behavior.
 Code Reviews: Peer reviews can catch errors you might miss.
 Debuggers: Tools like GDB, Visual Studio Debugger, or IDE built-in debuggers help step
through code.
 Logging: Adding logs to trace program execution.

3. Avoiding Errors (Preventive Measures)

Preventing errors reduces time spent debugging:
 Good Programming Practices:
 Follow coding standards and conventions.
 Write modular, readable, and maintainable code.
 Input Validation: Always check input before using it.
 Use of Static Analysis Tools: Detect potential errors before running code.
 Error Handling Mechanisms: Use try-catch blocks, exception handling, and proper
resource management.
 Version Control: Helps avoid mistakes from overwriting or losing code.
UNIT 4 Open Computing Language
The OpenCL (Open Computing Language) standard is an open, royalty-free framework
designed for writing programs that can execute across heterogeneous platforms, which
include CPUs, GPUs, FPGAs, DSPs, and other processors. It is maintained by the Khronos
Group and enables developers to leverage parallel computing to improve performance in
compute-intensive applications.
Here’s a structured overview:

1. Purpose
OpenCL standardizes parallel programming across different types of hardware. Its main goals are:
 Portability: Write code once, run it on multiple types of devices.
 Performance: Enable efficient use of hardware parallelism.
 Interoperability: Integrate with graphics APIs like OpenGL, Vulkan, or DirectX.
2. Architecture
OpenCL defines a platform model and a programming model.
A. Platform Model
 Host: Usually the CPU, responsible for running the main application.
 Compute Devices: Hardware capable of parallel execution (GPU, CPU cores, FPGA).
 Compute Units (CUs): Each device is divided into CUs (e.g., GPU multiprocessors).
 Processing Elements (PEs): Individual cores within a CU.
B. Memory Model
 Global Memory: Accessible by all work-items and persistent across kernel executions.
 Constant Memory: Read-only memory shared across work-items.
 Local Memory: Shared among work-items within a workgroup; faster than global memory.
 Private Memory: Each work-item has its own private memory.
C. Execution Model
 Kernels: Functions written in OpenCL C executed on compute devices.
 Work-items: Smallest unit of execution.
 Work-groups: Groups of work-items that share local memory and can synchronize.

3. Programming Model
OpenCL programs are typically divided into two parts:
1. Host Program: Runs on CPU, manages devices, memory, and kernel execution.
2. Device Program (Kernel): Runs on compute devices in parallel.
Key Concepts:
 NDRange (N-Dimensional Range): Defines the grid of work-items for execution.
 Enqueueing: Commands like kernel execution, memory read/write are enqueued to a
command queue.

4. Key Features
 Heterogeneous computing support.
 Explicit control over memory and task parallelism.
 Synchronization primitives for work-items and work-groups.
 Interoperability with graphics APIs.
 Supports vector and scalar data types.

5. Versions
OpenCL has evolved over time:
 OpenCL 1.x: Initial support for CPUs, GPUs, and basic parallelism.
 OpenCL 2.x: Added shared virtual memory, pipes, device-side enqueue.
 OpenCL 3.0: Unified and modular; optional features to improve portability and adoption.

6. Use Cases
 Scientific computing (simulation, weather modeling).
 Image and signal processing.
 Machine learning and AI inference.
 Financial modeling.
 Real-time rendering and graphics.

In OpenCL, a kernel is a fundamental concept: it’s the function that runs on a compute device
(GPU, CPU, or other accelerators) in parallel across many work-items. You can think of a
kernel as the “core computation” that you want to perform on large amounts of data.
Here’s a detailed breakdown:

1. Definition
A kernel is written in OpenCL C (a C-like language) and is executed on the device, not on the host
(CPU). The host program launches the kernel across a grid of work-items.
Example Kernel:
__kernel void vector_add(__global const float *A,
__global const float *B,
__global float *C) {
int i = get_global_id(0); // unique index for this work-item
C[i] = A[i] + B[i];
}

 __kernel marks the function as an OpenCL kernel.

 __global indicates memory in global memory space accessible by all work-items.
 get_global_id(0) gives the unique work-item ID along the first dimension.

2. Execution Model
OpenCL executes kernels in parallel using NDRange (N-Dimensional Range):
 Work-item: Smallest execution unit; runs one instance of the kernel.
 Work-group: A group of work-items that can share local memory.
 Global size: Total number of work-items.
 Local size: Number of work-items in a work-group.
Example:
If you have 1024 data elements and a local size of 64, OpenCL creates 1024 work-items organized
into 16 work-groups (1024 ÷ 64).

3. Memory Spaces in Kernels

Kernels can access different types of memory:

Memory Type Scope Speed

__private Work-item only Fastest
__local Work-group shared Fast
__global Accessible by all work-items Slower
__constant Read-only, shared by all work-items Fast (cached)

4. Synchronization
Within a work-group, work-items can synchronize using:
barrier(CLK_LOCAL_MEM_FENCE);

 Ensures all work-items reach the barrier before continuing.

 Needed when sharing local memory to avoid race conditions.

5. Launching a Kernel
From the host program, you enqueue kernels like this:
size_t global_size = 1024;
size_t local_size = 64;

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_size, &local_size, 0, NULL, NULL);

 global_size: Total number of work-items.

 local_size: Work-items per work-group.

6. Key Points
 A kernel runs in parallel across many work-items.
 Each work-item has a unique ID used to access its portion of data.
 Memory type and synchronization are important for performance and correctness.
 Kernels are compiled at runtime or ahead-of-time and executed on devices, not the host.

host-
device

interaction, it generally refers to the way a host system (like a CPU-based computer)
communicates and coordinates with a device (like a GPU, FPGA, or peripheral hardware) in
computing systems. This concept is crucial in parallel programming, hardware acceleration,
and embedded systems. Let me break it down clearly:

1. What is Host-Device Interaction?

 The host is usually the main CPU and its memory system.
 The device is a co-processor or specialized hardware (e.g., GPU, FPGA) that performs
computations or handles specific tasks.
 Interaction involves:
 Transferring data between host and device.
 Launching tasks or kernels on the device.
 Synchronizing execution between host and device.
2. Typical Steps in Host-Device Interaction (GPU Example)
Step 1: Memory Allocation
 The host allocates memory on the device for data it wants the device to process.
 Example in CUDA:
 cudaMalloc((void**)&d_array, size);

Step 2: Data Transfer

 Data is copied from the host memory to the device memory.
 cudaMemcpy(d_array, h_array, size, cudaMemcpyHostToDevice);

Step 3: Kernel Launch

 The host tells the device to execute a kernel (parallel function) on the data.
 myKernel<<<numBlocks, threadsPerBlock>>>(d_array);

Step 4: Synchronization
 Host waits for the device to finish processing if needed.
 cudaDeviceSynchronize();

Step 5: Result Retrieval

 Copy results from device memory back to host memory.
 cudaMemcpy(h_array, d_array, size, cudaMemcpyDeviceToHost);

Step 6: Cleanup
 Free the allocated device memory.
 cudaFree(d_array);

3. Communication Mechanisms
 Direct Memory Access (DMA): Device can read/write memory directly with minimal CPU
involvement.
 PCIe Bus: Standard for connecting host CPU to GPUs and other accelerators.
 Unified Memory (in CUDA): Allows host and device to share memory space for easier
programming.
 Command Queues: Host sends commands to device asynchronously.
execution

environment in OpenCL. Let’s go step by step with a clear explanation.

1. Overview of OpenCL Execution Environment

OpenCL (Open Computing Language) provides a platform-independent framework for writing
programs that execute across heterogeneous devices like CPUs, GPUs, and FPGAs. Its execution
environment describes how OpenCL programs run on hardware.
The execution environment can be understood in three layers:

2. OpenCL Platform Model

The platform model defines the relationship between the host (CPU) and devices
(GPUs/accelerators).
 Host: Usually a CPU that runs the main program and controls OpenCL devices.
 Devices: Compute resources (GPU, CPU, FPGA) where kernels execute.
 Compute Units (CUs): Devices are divided into compute units. Each CU can run multiple
processing elements (PEs) in parallel.
 Processing Elements (PEs): The smallest execution units (threads) that execute OpenCL
instructions.
Diagrammatically:
Host CPU
│
├── Device 0 (GPU/CPU/FPGA)
│ ├── Compute Unit 0
│ │ └── Processing Elements
│ ├── Compute Unit 1
│ └── ...
└── Device 1
└── ...

3. Memory Model
OpenCL defines a hierarchical memory model for efficient execution:

Memory
Scope Characteristics
Type
All work-items on all compute
Global High latency, visible to host and all kernels
units
All work-items on all compute
Constant Read-only, cached for efficiency
units
Work-items within a compute Shared among work-items of a workgroup, faster
Local
unit than global
Private Individual work-item Fastest, private to the work-item

4. Execution Model
Execution in OpenCL is based on kernels (functions executed on devices).
 Kernels run across a NDRange (N-Dimensional range of work-items).
 Work-items: Single execution instances of the kernel.
 Work-groups: Group of work-items that share local memory.
Hierarchy:
NDRange (entire problem) → Work-groups → Work-items

 Each work-item has a unique ID (get_global_id()), used to process part of data.

 Synchronization is allowed within a work-group using barrier().

5. Command-Queue Execution
The host controls execution via command queues:
1. Write data to device memory.
2. Enqueue kernel for execution.
3. Read results back from device memory.
Commands can be in-order or out-of-order, affecting scheduling.
In OpenCL, the memory model defines how memory is organized, how data is accessed by the
host and kernels, and how memory consistency and visibility are handled across multiple
compute units and work-items. Understanding this is crucial for writing efficient and correct
OpenCL programs. Let’s break it down clearly:

1. OpenCL Memory Types

OpenCL memory is hierarchical and divided into four main types:

Memory Scope / Accessible

Lifetime / Visibility Characteristics
Type By
Private Single work-item Exists for the lifetime of Fastest, registers on GPU, cannot be
Memory (thread) a work-item shared directly between work-items.
Shared among work-items in the same
Local All work-items in a Exists for the lifetime of
work-group. Stored in GPU’s on-chip
Memory work-group the work-group
memory (fast).
All work-items Exists for the lifetime of
Global Large but slower than local memory.
across all work- the buffer allocated by
Memory Resides in device DRAM.
groups the host
Exists for the lifetime of Read-only memory visible to all work-
Constant
All work-items the buffer allocated by items. Stored in global memory but
Memory
the host cached for fast access.

2. Memory Access Qualifiers

In OpenCL C kernels, memory pointers are qualified to specify which memory region they refer to:
 __private → private memory (default for variables in a kernel function)
 __local → local memory shared in a work-group
 __global → global memory accessible by all work-items
 __constant → read-only memory in global space
Example:
__kernel void example_kernel(__global int *global_data,
__local int *local_data) {
int private_var = 0; // private memory
int lid = get_local_id(0);

// Work-items can access shared local memory

local_data[lid] = global_data[lid];
barrier(CLK_LOCAL_MEM_FENCE); // synchronize work-items in the work-group
}

3. Memory Consistency and Synchronization

 Barriers are used to synchronize work-items in a work-group:
barrier(CLK_LOCAL_MEM_FENCE); // ensures all local memory writes are visible to all work-
items in the group
barrier(CLK_GLOBAL_MEM_FENCE); // ensures global memory writes are visible

 Atomic operations are used to prevent race conditions when multiple work-items update
shared memory.

basic OpenCL examples that demonstrate how to write and execute simple OpenCL
programs. I’ll include vector addition and scalar multiplication as typical starting points.

1. Vector Addition Example

This program adds two arrays element-wise using OpenCL.
Host Code (C/C++):
#include <CL/cl.h>
#include <stdio.h>
#include <stdlib.h>

#define ARRAY_SIZE 8

const char* kernelSource =

"__kernel void vector_add(__global int* A, __global int* B, __global int* C) { \n"
" int id = get_global_id(0); \n"
" C[id] = A[id] + B[id]; \n"
"} \n";

int main() {
int A[ARRAY_SIZE] = {1,2,3,4,5,6,7,8};
int B[ARRAY_SIZE] = {8,7,6,5,4,3,2,1};
int C[ARRAY_SIZE];

cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);

cl_mem bufA = clCreateBuffer(context, CL_MEM_READ_ONLY |

CL_MEM_COPY_HOST_PTR,
sizeof(int) * ARRAY_SIZE, A, NULL);
cl_mem bufB = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR,
sizeof(int) * ARRAY_SIZE, B, NULL);
cl_mem bufC = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
sizeof(int) * ARRAY_SIZE, NULL, NULL);

cl_program program = clCreateProgramWithSource(context, 1, &kernelSource, NULL, NULL);

clBuildProgram(program, 1, &device, NULL, NULL, NULL);

cl_kernel kernel = clCreateKernel(program, "vector_add", NULL);

clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufA);

clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufB);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &bufC);

size_t globalSize = ARRAY_SIZE;

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, NULL);
clEnqueueReadBuffer(queue, bufC, CL_TRUE, 0, sizeof(int) * ARRAY_SIZE, C, 0, NULL,
NULL);

printf("Result: ");
for(int i=0; i<ARRAY_SIZE; i++) printf("%d ", C[i]);
printf("\n");

clReleaseMemObject(bufA);
clReleaseMemObject(bufB);
clReleaseMemObject(bufC);
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(queue);
clReleaseContext(context);

return 0;
}

Output:
Result: 9 9 9 9 9 9 9 9
2. Scalar Multiplication Example
Multiply each element of an array by a constant.
Kernel Code (scalar_mul.cl):
__kernel void scalar_mul(__global int* input, __global int* output, int scalar) {
int id = get_global_id(0);
output[id] = input[id] * scalar;
}

Host Code:
#include <CL/cl.h>
#include <stdio.h>

int main() {
int input[8] = {1,2,3,4,5,6,7,8};
int output[8];
int scalar = 5;

// (Platform, device, context, queue setup same as vector addition)

// Create buffers, program, kernel as before
// Set kernel args:
clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufInput);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufOutput);
clSetKernelArg(kernel, 2, sizeof(int), &scalar);

// Execute kernel
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, NULL);

// Read buffer and print output

clEnqueueReadBuffer(queue, bufOutput, CL_TRUE, 0, sizeof(int)*8, output, 0, NULL, NULL);

printf("Output: ");
for(int i=0;i<8;i++) printf("%d ", output[i]);
printf("\n");
}

Output:
Output: 5 10 15 20 25 30 35 40

parallel patterns in computing with convolution as an example. Let’s break it down clearly.

1. What is a Parallel Pattern?

A parallel pattern is a common computational structure that naturally exposes opportunities for
parallel execution. Instead of thinking about individual threads, you think about the structure of the
computation. Examples:
 Map – Apply a function independently to all elements of a collection.
 Reduce – Combine elements of a collection into a single value (sum, max, etc.).
 Stencil – Each output element depends on a fixed neighborhood of input elements.
 Pipeline – Different stages of computation are executed in a sequence but can be
overlapped.
 Divide and Conquer – Problem is recursively split into independent subproblems.

2. Convolution as a Parallel Pattern

Convolution is widely used in image processing, signal processing, and deep learning.
Mathematically, a 2D convolution looks like:
O[i,j]=m∑n∑I[i+m,j+n]⋅K[m,n]
Where:
 I = input image
 K = kernel/filter
 O = output image
Parallelism in convolution:
1. Element-level parallelism
 Each output pixel O[i,j] can be computed independently.
 On a GPU, you can assign one thread per output pixel.
2. Kernel-level parallelism
 The sum over m and n (kernel elements) can be computed in parallel.
3. Tile-based parallelism
 Divide the input into tiles/blocks that fit into shared memory (for GPUs).
 Each block computes its part of the output independently.
4. Pipeline pattern
 In deep learning, multiple convolution layers can be pipelined, overlapping memory
access and computation.

3. Why Convolution is a Classic Parallel Pattern

 Locality: Each output depends on a small neighborhood (good for cache/shared memory).
 High compute-to-memory ratio: Each pixel requires multiple multiplications & adds,
suitable for SIMD/SIMT architectures.
 Massively parallel: Thousands of pixels can be computed simultaneously.

4. Visualization
Input Image (I) Kernel (K)
+---+---+---+ +---+---+
|a|b|c| |k|l|
|d|e|f| * |m|n|
|g|h|i|
+---+---+---+

Output pixel O[1,1] = ak + bl + dm + en

 Each output pixel calculation is independent, so you can compute many pixels in parallel.

GPU-based implementation of the prefix sum (scan) algorithm, which is a classic parallel
algorithm used in many GPU applications. Let’s go step by step.

1. What is a Prefix Sum (Scan)?

Given an array A of size n:
 Inclusive scan: B[i] = A[0] + A[1] + ... + A[i]
 Exclusive scan: B[i] = A[0] + A[1] + ... + A[i-1] (with B[0] = 0)
Example:
A = [3, 1, 7, 0, 4, 1, 6, 3]
Inclusive scan => B = [3, 4, 11, 11, 15, 16, 22, 25]
Exclusive scan => B = [0, 3, 4, 11, 11, 15, 16, 22]
2. Parallel GPU Approach
The GPU approach uses Blelloch’s scan algorithm (two phases):
1. Up-sweep (reduce) phase: build a sum tree.
2. Down-sweep phase: compute the prefix sums.
This allows O(log n) parallel time using O(n) processors.

2.1 Pseudocode for Blelloch Scan

For array of length n (assume n is a power of 2):

Up-Sweep:
for d = 0 to log2(n)-1
for k = 0 to n-1 by 2^(d+1) // parallel
A[k + 2^(d+1) - 1] += A[k + 2^d - 1]

Down-Sweep:
A[n-1] = 0
for d = log2(n)-1 down to 0
for k = 0 to n-1 by 2^(d+1) // parallel
t = A[k + 2^d - 1]
A[k + 2^d - 1] = A[k + 2^(d+1) - 1]
A[k + 2^(d+1) - 1] += t

 After the down-sweep, A contains the exclusive scan.

2.2 GPU Implementation (CUDA Example)

__global__ void prefix_sum(int *data, int n) {
extern __shared__ int temp[]; // shared memory
int thid = threadIdx.x;
int offset = 1;

// Load input into shared memory

temp[2*thid] = data[2*thid];
temp[2*thid+1] = data[2*thid+1];

// Up-sweep (reduce) phase

for (int d = n>>1; d > 0; d >>= 1) {
__syncthreads();
if (thid < d) {
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
temp[bi] += temp[ai];
}
offset *= 2;
}

// Clear the last element

if (thid == 0) { temp[n-1] = 0; }

// Down-sweep phase
for (int d = 1; d < n; d *= 2) {
offset >>= 1;
__syncthreads();
if (thid < d) {
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
int t = temp[ai];
temp[ai] = temp[bi];
temp[bi] += t;
}
}

__syncthreads();

// Write results back to global memory

data[2*thid] = temp[2*thid];
data[2*thid+1] = temp[2*thid+1];
}

Notes:
 n must be a power of 2 for this simple version.
 temp is stored in shared memory for speed.
 Each thread works on two elements.
 For larger arrays, multiple blocks and a block-sum adjustment is needed.

3. Optimization Tips for GPU

1. Use shared memory within a block for speed.
2. Handle non-power-of-two arrays by padding zeros.
3. Use warp-level primitives for small arrays (e.g., CUDA’s __shfl_up_sync).
4. For large arrays:
 Compute per-block prefix sums.
 Sum the block totals.
 Add block sums to each element.
4. Libraries
 CUDA Thrust: thrust::inclusive_scan() or thrust::exclusive_scan()
 OpenCL: You can implement the same up-sweep/down-sweep method or use
cl::sycl::inclusive_scan.

Sparse Matrix
algorithms on GPU. Sparse matrices are matrices in which most elements are zero. Because of
the sparsity, storing and computing them efficiently requires special algorithms and data
structures. GPUs, with their massive parallelism, can accelerate sparse matrix computations
significantly.

1. Sparse Matrix Storage Formats

Before running algorithms on GPU, you need an efficient storage format:
1. CSR (Compressed Sparse Row)
 Stores non-zero elements row-wise.
 Arrays: values, col_index, row_ptr.
 Popular for SpMV (Sparse Matrix-Vector multiplication).
2. CSC (Compressed Sparse Column)
 Stores non-zero elements column-wise.
 Arrays: values, row_index, col_ptr.
3. COO (Coordinate Format)
 Stores (row, col, value) for each non-zero element.
 Simple but not as memory-efficient for GPU.
4. ELL (Ellpack/ITPACK)
 Stores non-zeros in each row in fixed-length arrays.
 Better for GPU parallelism due to predictable memory access.
5. HYB (Hybrid)
 Combination of ELL and COO for GPUs (used in NVIDIA’s cuSPARSE library).

2. Sparse Matrix Algorithms on GPU

(a) Sparse Matrix-Vector Multiplication (SpMV)

 Goal: y = A * x where A is sparse.
 GPU Implementation Steps:
1. Load sparse matrix in CSR/ELL/COO format.
2. Assign one thread per row (CSR) or per element (COO/ELL).
3. Threads read non-zero elements and multiply by corresponding vector elements.
4. Use shared memory and coalesced access to optimize performance.
 Libraries: cuSPARSE, rocSPARSE, MAGMA.
Challenges:
 Load imbalance due to uneven number of non-zeros per row.
 Memory coalescing is critical for performance.

(b) Sparse Matrix-Matrix Multiplication (SpGEMM)

 Goal: C = A * B where A and B are sparse.
 GPU Approach:
1. Count non-zeros in each row of result C.
2. Allocate memory for C.
3. Multiply and sum contributions in parallel.
 Optimizations:
1. Use hash maps per thread block to accumulate results.
2. Use segmented scan for parallel reduction.

(c) Sparse Triangular Solve

 Solving Lx = b or Ux = b where L or U is sparse and triangular.
 Approach: Level-scheduling
1. Identify independent sets of rows.
2. Solve rows in parallel when dependencies are met.
 Challenges: Irregular dependencies reduce parallelism.

(d) Sparse Iterative Solvers (CG, GMRES, BiCGSTAB)

 Iterative methods for linear systems using SpMV as the core operation.
 GPU-friendly because:
 Most work is in SpMV.
 Vector operations are naturally parallel.
 Libraries: cuSPARSE, CUSP, PETSc GPU.

3. GPU Optimization Tips

1. Memory Coalescing
 Arrange non-zero elements so consecutive threads access consecutive memory.
2. Use Shared Memory
 Cache vector elements in SpMV.
3. Balance Work
 Avoid thread divergence in rows with varying numbers of non-zeros.
4. Use Libraries
 NVIDIA cuSPARSE, CUSP, Kokkos, rocSPARSE.

4. Example: GPU SpMV with CSR (CUDA)

__global__ void spmv_csr_kernel(int n, const int* row_ptr, const int* col_idx,
const float* values, const float* x, float* y) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < n) {
float sum = 0.0f;
for (int j = row_ptr[row]; j < row_ptr[row + 1]; j++) {
sum += values[j] * x[col_idx[j]];
}
y[row] = sum;
}
}

 One thread per row.

 Optimized further using warp-level parallelism.
matrix multiplication on heterogeneous GPU clusters, including algorithms and programming
approaches. Let’s break this down step by step.

1. Matrix Multiplication Basics

Matrix multiplication for two matrices A(m×k) and B(k×n) is defined as:
Ci,j=l=1∑kAi,l⋅Bl,j
Key points for GPU optimization:
 Highly parallelizable, as each element of C can be computed independently.
 Efficient implementations use tiling to reduce global memory accesses.

2. GPU Programming Approaches

a) Single GPU
 Use CUDA or OpenCL.
 Example CUDA workflow:
1. Allocate memory on device (GPU).
2. Copy matrices A and B to device.
3. Launch a kernel to compute C[i,j] per thread.
4. Use shared memory to optimize access for sub-blocks (tiling).
5. Copy result back to host.
CUDA kernel example (conceptual):
__global__ void matMulKernel(float* A, float* B, float* C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; ++k)
sum += A[row * N + k] * B[k * N + col];
C[row * N + col] = sum;
}

b) Heterogeneous Clusters
A heterogeneous GPU cluster may consist of:
 Different GPU types (e.g., NVIDIA + AMD).
 CPUs + GPUs together.
 Multiple nodes interconnected via MPI/InfiniBand.

Key considerations:
1. Workload partitioning: Divide the matrix into blocks assigned to different devices based
on computational power.
2. Device-specific kernels: May need to tune kernels per GPU type.
3. Communication: Use MPI or NCCL to transfer blocks of matrices and gather results.
4. Overlap communication and computation: Send next data blocks while computing
current ones to hide latency.

c) Example Algorithm for Multi-GPU / Heterogeneous Clusters

1. Split matrices into tiles:
 Row-wise split for matrix A
 Column-wise split for matrix B
2. Assign tiles to devices proportional to their performance:
 Faster GPU gets bigger tiles.
 CPU can handle small tiles if needed.
3. Local multiplication on each device:
1. C_tile = A_tile × B_tile

2. Reduce / sum tiles:

 Gather results from all devices.
 Assemble the final matrix C.
d) OpenCL for Heterogeneous Systems
OpenCL can target CPUs, GPUs, and FPGAs. Its model suits heterogeneous clusters.
Steps:
1. Query platforms and devices.
2. Create context with all devices.
3. Partition matrices for each device.
4. Build and execute kernels for each device.
5. Read back results and combine.
Advantages:
 Single code base for multiple architectures.
 Explicit control over data transfer and parallelism.

e) Optimizations
1. Tiling / blocking → minimize global memory access.
2. Asynchronous transfers → overlap host-device communication with computation.
3. Load balancing → important for heterogeneous clusters.
4. Use optimized libraries if available:
 cuBLAS (NVIDIA)
 clBLAS / oneAPI (Intel, AMD)
 MAGMA for multi-GPU

Common questions