0% found this document useful (0 votes)

13 views13 pages

GPU Simulation with Accel-Sim Guide

The document outlines the instructions for CPEN 411 Assignment 4, focusing on GPU simulation using the Accel-Sim framework. It includes preparatory steps for setting up the environment, using the simulator, running benchmarks, and processing results, along with evaluation criteria for grading the assignment. The assignment emphasizes understanding GPU architecture through practical simulation tasks and analysis of performance metrics.

Uploaded by

Daniel Mok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views13 pages

GPU Simulation with Accel-Sim Guide

Uploaded by

Daniel Mok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CPEN 411: Assignment 4

Accel-Sim GPU Simulation

Tor M. Aamodt

November 26, 2025

Contents

1 Introduction and Setup [0 Points] 2

1.1 Preparatory Steps: Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Preparatory Steps: Accel-Sim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Preparatory Steps: Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Preparatory Steps: Processing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Evaluation: [5 + 0.5 bonus points total] 8

2.1 Basic Understanding: Accel-Sim [0.5 point] . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Using the Simulator: PTX versus SASS [1 point] . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Modifying Configurations: CTA Scheduler [2 points] . . . . . . . . . . . . . . . . . . . . 8
2.4 Making Measurements: Control Flow Divergence [2 points] . . . . . . . . . . . . . . . . 9

3 Submission Instructions 11

1
Chapter 1

Introduction and Setup [0 Points]

This assignment is an introduction to GPU simulation using Accel-Sim [1] ([Link]
a trace-based simulation framework.
You may choose to complete this assignment on your personal machine if you prefer. However, we do
not provide any additional support for that. The Accel-Sim documentation lists all dependencies and
requirements to run the simulator on your personal device.

1.1. Preparatory Steps: Setup

1. Navigate to the GitHub classroom page for this assignment ([Link]

AaE0IAlo) and similar to Assignment 1, click on ‘Accept this assignment’. Click on the URL under the
text ‘Your assignment repository has been created’ to go to your repository. Clone the repository to
your account on the ECE server <username>@[Link].

2. Although we will not be using a physical GPU for this assignment, the simulator still relies on several
CUDA tools to run. The CUDA toolkit is installed in /ubc/ece/home/courses/cpen411/assignment4/
cuda-toolkit on the ECE server.
If you are not using the ECE server, please download and install the CUDA Toolkit. Note that you do
not need to install the CUDA driver. For this assignment, we will be using CUDA version 12.8.0.
wget [Link]
cuda_12.8.0_570.86.10_linux.run
mkdir cuda-toolkit
sh cuda_12.8.0_570.86.10_linux.run --silent --toolkit --toolkitpath=`pwd`/cuda-toolkit
--override

3. Add the CUDA Toolkit and xutils to your environment variables so that they can be easily accessed.
You can update setup_cuda.sh with the correct paths and use source setup_cuda.sh to quickly set
your environment.

export CUDA_INSTALL_PATH=/ubc/ece/home/courses/cpen411/assignment4/cuda-toolkit
export CUDA_HOME=/ubc/ece/home/courses/cpen411/assignment4/cuda-toolkit
export PTXAS_CUDA_INSTALL_PATH=/ubc/ece/home/courses/cpen411/assignment4/cuda-toolkit
export LD_LIBRARY_PATH=/ubc/ece/home/courses/cpen411/assignment4/cuda-toolkit/lib64/
export PATH=$CUDA_INSTALL_PATH/bin:`pwd`/xutils/usr/bin:$PATH

Once the CUDA Toolkit has been added to your environment PATH, you can test out nvcc —version
to make sure the CUDA compiler is functional.

2
CPEN 411 Assignment 4

1.2. Preparatory Steps: Accel-Sim

1. Clone Accel-Sim from the public repository [Link]

tree/release.
git clone git@[Link]:accel-sim/[Link]

2. Source the setup script for Accel-Sim.

cd accel-sim-framework/gpu-simulator
source setup_environment.sh

Accel-Sim is a simulation framework that uses another simulator GPGPU-Sim ([Link]

as it’s core performance model. On the first setup instance, you will be prompted to clone the GPGPU-
Sim repository. Follow the default prompts to clone the dev branch of GPGPU-Sim from the main
distribution repo.

Page 3 of 12
CPEN 411 Assignment 4

GPGPU-Sim operates by tricking CUDA executables into using the simulator instead of a real GPU. The
simulator itself is compiled as a "shared object" in linux ([Link]) - that mirrors the [Link]
provided by NVIDIA. When you source the setup_environment, we put the path to the [Link] in
the LD_LIBRARY_PATH. So from then on, in that shell, you can run any CUDA executable and GPGPU-
Sim will be used.

3. Compile the GPU simulator, which also creates the [Link] object.
make -j

If you get an error relating to available resources, please use make without the parallel flag instead.
If the compilation is successful, you should not see any error messages. After compilation, you can find
the [Link] files in the lib folder under the CUDA and GCC versions (12.8 and 13.3) used during
the build. You will need to recompile the GPU simulator whenever you modify the simulator code.

Page 4 of 12
CPEN 411 Assignment 4

There are many resources available to learn more about the Accel-Sim framework and the underlying
GPGPU-Sim simulator. Here are links to some helpful videos:

• [Link]

1.3. Preparatory Steps: Benchmarks

Accel-Sim models the GPU device. To run simulations, we also need GPU kernels to execute on Accel-
Sim. The Accel-Sim framework provides many GPU benchmarks that are commonly used to evaluate
microarchitecture optimizations. In this assignment, we will use select workloads from the Rodinia
benchmark suite [2] ([Link]

1. The Rodinia traces are available from Accel-Sim using their provided script. If you are not using the
ECE servers, you will need to run this script to get your own copy of the Rodinia trace files.
./[Link]

These traces are also downloaded to /ubc/ece/home/courses/cpen411/assignment4/rodinia-traces

available on the ECE server.
Not all Rodinia workloads are supported with the simulator and varies by CUDA versions. Copy the
modified list of Rodinia workloads we will run in this assignment to the Accel-Sim folder.
cp [Link] ../accel-sim-framework/util/job_launching/apps/.

2. Launch the simulations using the job launching script.

./util/job_launching/run_simulations.py -B rodinia-3.1 -T /ubc/ece/home/courses/
cpen411/assignment4/rodinia-traces -N rodinia-test

This command runs the -B benchmark using -T trace files and names the job with -N. Depending on
resource constraints, you may find it useful to use the -c flag to choose how many simulations to run
concurrently. You can learn more about these flags using —help. The default GPU configuration is used
(QV100) since it is not specified. All of the available configurations are defined in a YAML file:
./util/job_launching/configs/[Link]

You can change the GPU model to simulate by using the -C flag and also create your own special
configuration GPU models by modifying the [Link] file and adding the definition to the
YAML list.

Page 5 of 12
CPEN 411 Assignment 4

3. Once the simulations have been launched, you can monitor their process using another script.
./util/job_launching/job_status.py -N rodinia-test
This script will show which benchmarks are currently being simulated and which are complete.

4. The simulations can be prematurely terminated with

./util/job_launching/[Link] -k

Page 6 of 12
CPEN 411 Assignment 4

1.4. Preparatory Steps: Processing Results

Once the simulations are complete, we can analyze the simulation statistics and results to better un-
derstand how the GPU executed each benchmark. These statistics also help identify inefficiencies and
bottlenecks in the GPU hardware, which allows researchers to introduce hardware modifications or
software optimizations.

1. All the output from the simulator will be placed in sim_run_12.8/(app)/(args)/(config)/. Inside
each of these directories, there will be files labeled *.o(jobId) and *.e(jobId) which store the stdout
(o) and stderr (e) for each particular run.
2. Use the get_stats.py script to collect the simulation statistics and save to a CSV file.
./util/job_launching/get_stats.py -N rodinia-test > [Link]
By default, all the stats collected (which are represented as regexes pulled from the output files) are
listed here, you can always add to this list with other stats you create, etc.
./util/job_launching/stats/example_stats.yml

3. To generate plots using the available Python scripts, you may need to install additional Python
packages. On the ECE server, you will need to use a virtual environment.
python3 -m venv ~/cpen411-a4
source ~/cpen411-a4/bin/activate
pip install plotly pyyaml numpy
Once all the necessary Python packages are installed, run the plotting script to visualize your results.
Alternatively, you can generate your own plots based on the data saved in the CSV file, which will likely
be more useful for your assignment report.
./util/plotting/[Link] -c [Link]
These plots are saved to ./util/plotting/htmls/*.html.

Page 7 of 12
Chapter 2

Evaluation: [5 + 0.5 bonus points total]

This assignment will primarily be graded based on your submitted report. Your report should include
a section that addresses each of the tasks below.

2.1. Basic Understanding: Accel-Sim [0.5 point]

Write a paragraph in your report that explains what Accel-Sim does and how it can be used to support
GPU architecture research. Why would someone choose to use Accel-Sim instead of a physical GPU?

2.2. Using the Simulator: PTX versus SASS [1 point]

Accel-Sim offers two modes of operation: functional emulation using PTX and replay of SASS instruction
traces collected from hardware. PTX instructions are a virtual ISA, or an intermediate representation
of the program. SASS is the real machine ISA instructions executed on NVIDIA GPUs.
To enable PTX simulation, we need the source binaries to the Rodinia benchmarks instead of the traces
used for SASS. These are available in /ubc/ece/home/courses/cpen411/assignment4/rodinia-src/
gpu-app-collection. If you are not using the ECE servers, please clone these benchmarks from git@
[Link]:accel-sim/[Link] and following their instructions to compile.
To execute the program in SASS mode:
./util/job_launching/run_simulations.py -C QV100-SASS -B rodinia-3.1 -T /ubc/ece/home/
courses/cpen411/assignment4/rodinia-traces -N rodinia-sass
To execute the program in PTX mode:
source /ubc/ece/home/courses/cpen411/assignment4/rodinia-src/gpu-app-collection/src/
setup_environment
./util/job_launching/run_simulations.py -C QV100-PTX -B rodinia-3.1 -N rodinia-ptx
Save the simulation statistics for both to [Link] and [Link].
Plot the execution time for the Rodinia benchmarks to compare between PTX and SASS to include in
your report. Write your answer to the following questions in your report, referring to your execution
time results.

• What do these numbers say about the design of the PTX versus SASS ISA?

• What do these numbers say about the optimization level of each code type?

2.3. Modifying Configurations: CTA Scheduler [2 points]

Compare different modifications to the threadblock scheduler. By default, the scheduler will attempt
to involve as many SMs and SM clusters as possible – round-robin between clusters and SMs within
clusters before assigning more than one threadblock to a core. In this task, you will make modifications
to this scheduler and analyze the results.

8
CPEN 411 Assignment 4

1. In the default Volta configuration provided, each SM gets its own interconnect port. Consider how
the scheduler might change when there are multiple SMs clustered together. Add a new configuration
setting add-on that sets the gpgpu_n_cores_per_cluster to 8 in the configuration file:
./util/job_launching/configs/[Link]

2. The CTA-level scheduler is implemented in GPGPU-Sim:

./gpu-simulator/gpgpu-sim/src/gpgpu-sim/[Link]:
void gpgpu_sim::issue_block2core()

Modify the simulator to implement a Greedy CTA Scheduler. Make the scheduler pack one SM fully
(as much as the occupancy calculation allows) before moving on to the next SM.

3. Add an additional metric for the average number of CTAs per SM in the simulator. Simulate the
Rodinia benchmark suite for each of the three configurations: baseline, 8-cluster, and greedy. Also
measure the IPC, L1 misses, and L2 misses for each configuration. Save these simulation statistics to
[Link] and [Link].
Compare these measurements in graphs and add them to your report. Explain what your results mean
for each scheduler configuration.

2.4. Making Measurements: Control Flow Divergence [2 points]

1. In SASS mode, collect and plot metrics on how much control-flow divergence (average SIMT efficiency,
which ranges between 1/32 and 32/32) and memory divergence (average number of memory accesses
per memory instruction) there is in each application. For memory, keep in mind that the Volta machine
has a 128-byte cache lines with 32B sectors. Therefore, a perfectly coalesced memory instruction can
still generate 4 32-byte accesses. In the worst-case, a completely diverged access generates 32 memory
32-byte transactions.
There is no prepared script to measure SIMT efficiency and memory divergence, which you will need to
add yourself. For SIMT efficiency, look for the shader_core_ctx::issue_warp() function in ./gpu-simu
lator/gpgpu-sim/src/gpgpu-sim/[Link]. The m_stats->shader_cycle_distro tracks a histogram
of the number of active threads in each warp when an instruction is issued. From this information,
you should be able to compute the overall SIMT efficiency. You may either add an aggregating print
to the simulator itself or simply write a post-processing script that computes the SIMT efficiency from
m_stats->shader_cycle_distro when it is printed as "Warp Occupancy Distribution". For memory
divergence, you will need to count both the number of global memory instructions, as well as the
number of global memory accesses.
In your report, present your results for SIMT efficiency and memory divergence. Use this data to answer
the following questions:

• Is there a connection between these divergence metrics and the application’s instructions-per-cycle
(IPC)? Why or why not?

2. The default approach to control flow in the simulator uses SIMT stacks. You can read more about
this implementation in "Thread Block Compaction for Efficient SIMT Control Flow" [3]. The SIMT
stack is implemented in the simt_stack class and can be found in ./gpu-simulator/gpgpu-sim/src/a
bstract_hardware_model.cc.

Page 9 of 12
CPEN 411 Assignment 4

Choose an application from the Rodinia benchmarks that demonstrates low SIMT efficiency (high
branch divergence). In PTX mode, simulate the benchmark and identify the instruction PCs causing
the most divergence. For each diverging PC, track the number of cycles from when the warp diverges
until reconvergence.
In your report, identify the benchmark you selected and present your data comparing diverging in-
struction PCs and the length of divergence they cause. Explain what you could change to make this
benchmark faster and more efficient.
To re-run any individual benchmark, find the appropriate folder and use the ./[Link] script.
./sim_run_12.8/[application]/[data]/[config]/[Link]

The output will be printed on the console, or you may redirect it to a log file.

Page 10 of 12
Chapter 3

Submission Instructions

This assignment is due 11:59PM Friday, December 5, 2025.

To submit, please push your changes to the Github repository. Change the README file in the repo
to include your student ID and your name and make sure your repository has the updated source code
files.
Please include the following files:

• [Link] of your final report for the assignment. There is no page limit, but please be concise
and keep the report to a reasonable length.

• Simulator result files added to the simulator-results folder:

– [Link] and [Link] for the PTX versus SASS results.

– [Link] and [Link] for the scheduler com-
parison.

• Simulator source files that you edit, copied to the modified-simulator folder:

– ./gpu-simulator/gpgpu-sim/src/gpgpu-sim/[Link] where you implement your greedy

CTA scheduler.
– ./gpu-simulator/gpgpu-sim/src/abstract_hardware_model.cc where you track diverging
instruction PCs.

Be careful and double-check that all files have been included and that the source files are the correc-
t/latest version!

11
Bibliography

[1] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An extensible simulation framework
for validated GPU modeling,” in ACM/IEEE International Symposium on Computer Architecture (ISCA).
IEEE, 2020, pp. 473–486.
[2] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark
suite for heterogeneous computing,” in IEEE international symposium on workload characterization (IISWC).
Ieee, 2009, pp. 44–54.
[3] W. W. Fung and T. M. Aamodt, “Thread block compaction for efficient simt control flow,” in 2011 IEEE
17th international symposium on high performance computer architecture. IEEE, 2011, pp. 25–36.

Common questions

Different CTA scheduling configurations significantly affect GPU performance metrics. For example, a Greedy CTA Scheduler configuration, which fully populates one SM before moving to the next, could lead to increased IPC due to higher resource utilization per SM. However, this might also result in increased contention or imbalance across SMs. Comparatively, an 8-cluster configuration could lead to better load distribution but may suffer from increased latency and idle cycles due to communication overheads between clusters. The results of such configurations reveal how load balancing and resource contention affect the overall efficiency and should reflect varying patterns in L1 and L2 cache miss rates as well .

PTX is a virtual ISA and serves as an intermediate representation, whereas SASS is the real machine ISA for NVIDIA GPUs. PTX allows for greater flexibility and portability, as it is hardware agnostic, while SASS directly reflects the architecture-specific optimizations available on actual NVIDIA hardware. In the context of Accel-Sim, using PTX mode can provide insights into potential ISA-level optimizations and their impact on performance, whereas SASS provides more realistic performance metrics that closely resemble actual hardware execution .

Modifying gpgpu_n_cores_per_cluster affects how resources are allocated and scheduled for executing tasks within the simulator. Increasing the number of cores per cluster can enhance resource-sharing efficiency and reduce inter-cluster communication latency, potentially leading to improved execution speed and better utilization of on-chip resources. However, it can also create challenges with increased resource contention, requiring adaptive scheduling policies to balance workloads effectively across clusters to prevent bottlenecks .

Trace-based simulation frameworks, like Accel-Sim, provide the advantage of being able to rapidly explore design spaces and evaluate architectural changes with lower computational costs compared to traditional cycle-accurate simulations. They allow researchers to use real application traces to model system behavior accurately without the overhead of simulating every individual hardware cycle. However, this approach may limit the granularity of insights into cycle-level interactions and might miss subtle timing-dependent phenomena, which could be critical depending on the study's objectives. Therefore, while trace-based simulations are efficient for broad explorations, they might not be suitable for detailed timing analysis .

Visual plots generated from raw GPU simulation data allow researchers to quickly identify trends, patterns, and anomalies that might be less obvious in purely numerical formats. These plots make complex data more accessible and provide intuitive insights into performance metrics like execution times, efficiency, and resource utilization. Moreover, visualizations facilitate comparative studies across different configurations, helping researchers to effectively communicate findings and support hypothesis-driven research by highlighting key areas for further investigation or optimization .

Setting up environment variables is crucial for ensuring that the CUDA Toolkit functions correctly in a simulation environment as it provides the necessary paths for the toolkit binaries and libraries to be accessed by the simulation software. Environment variables such as CUDA_INSTALL_PATH, CUDA_HOME, and LD_LIBRARY_PATH define where the system can find critical components of the CUDA Toolkit, including compilers, libraries, and executables. Incorrect setup can lead to unresolved dependencies and runtime failures, thereby hindering the simulator's operation. Proper configuration ensures a smoother integration between the simulation framework and the toolkit .

Control flow divergence in SIMT architectures leads to efficiency losses as some threads within a warp become inactive while waiting for others to complete divergent paths. In Accel-Sim, this manifests as reduced SIMT efficiency, which can be measured by the m_stats->shader_cycle_distro metric. To mitigate its impact, techniques such as warp divergence management, altering scheduling policies, or optimizing kernel code to minimize conditional branches can be used. These strategies aim to keep warps more uniformly busy, thereby enhancing overall SIMT efficiency and throughput .

Implementing a Greedy CTA Scheduler in GPGPU-Sim involves modifying the scheduler to fully utilize the resources of one SM before proceeding to the next. Challenges include ensuring the scheduling logic correctly assesses occupancy and capacity constraints without negatively impacting execution parallelism. This approach can enhance resource usage within an SM, thereby potentially increasing IPC, but it might also introduce resource contention and imbalance across SMs, leading to varying inefficiencies depending on workload characteristics. Careful balancing of scheduling priorities and occupancy management is crucial to realize performance benefits while mitigating limitations .

Researchers simulate the Rodinia benchmark suite on Accel-Sim to evaluate and measure the performance impact of architectural changes. Rodinia covers a wide range of applications designed for heterogeneous computing and offers comprehensive insights into how different components of a GPU architecture respond to varied loads. By analyzing performance metrics such as execution time, IPC, and cache miss rates under different simulated configurations, researchers can identify bottlenecks, efficiency levels, and areas for potential optimization, ultimately informing decisions for architectural refinement and development .

Accel-Sim allows researchers to perform detailed architectural studies and experiments on GPU design without the need for a physical GPU. It enables the simulation of new and experimental configurations that might not yet exist in hardware, such as novel scheduling algorithms or memory architectures. Furthermore, it allows for controlled and reproducible experiments, where variables can be isolated and manipulated independently .

GPU Training Optimization Assignment
No ratings yet
GPU Training Optimization Assignment
37 pages
Accel-Sim: An Extensible Simulation Framework For Validated GPU Modeling
No ratings yet
Accel-Sim: An Extensible Simulation Framework For Validated GPU Modeling
14 pages
Lab - 2 in - Lab Solution
No ratings yet
Lab - 2 in - Lab Solution
11 pages
GPU Acceleration in Simulation Optimization
No ratings yet
GPU Acceleration in Simulation Optimization
15 pages
GPU Performance Optimization Guide
No ratings yet
GPU Performance Optimization Guide
150 pages
Modelado de GPUs con Accel-sim
No ratings yet
Modelado de GPUs con Accel-sim
56 pages
2D Heat Diffusion Solver Assignment
No ratings yet
2D Heat Diffusion Solver Assignment
5 pages
GPU EstimationMachineLearning PDF
No ratings yet
GPU EstimationMachineLearning PDF
22 pages
Lab3 2
No ratings yet
Lab3 2
6 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
CUDA Parallel Programming Overview
No ratings yet
CUDA Parallel Programming Overview
93 pages
GPU Computing in Data Science
No ratings yet
GPU Computing in Data Science
34 pages
GPGPU-Sim 3.x Tutorial Overview
No ratings yet
GPGPU-Sim 3.x Tutorial Overview
27 pages
OpenACC SAXPY GPU Computing Guide
No ratings yet
OpenACC SAXPY GPU Computing Guide
40 pages
GPU Power and Performance Analysis Report
No ratings yet
GPU Power and Performance Analysis Report
14 pages
L21 GPU Programming
No ratings yet
L21 GPU Programming
6 pages
Optimizing GPU Performance Metrics
No ratings yet
Optimizing GPU Performance Metrics
9 pages
Lec 03
No ratings yet
Lec 03
9 pages
CS 179: Intro to GPU Programming
No ratings yet
CS 179: Intro to GPU Programming
26 pages
GPU Parallel Computing: Speedup Analysis
No ratings yet
GPU Parallel Computing: Speedup Analysis
29 pages
Running CUDA on Rocks Cluster
No ratings yet
Running CUDA on Rocks Cluster
17 pages
CS146 Computer Architecture Homework 2
No ratings yet
CS146 Computer Architecture Homework 2
5 pages
GPGPU Performance Analysis and Tuning
No ratings yet
GPGPU Performance Analysis and Tuning
100 pages
Data Extraction and Plotting Tutorial
No ratings yet
Data Extraction and Plotting Tutorial
1 page
Introduction to CUDA Parallel Computing
No ratings yet
Introduction to CUDA Parallel Computing
39 pages
Cython vs Numba Performance Comparison
No ratings yet
Cython vs Numba Performance Comparison
15 pages
Introduction to GPU Programming Basics
No ratings yet
Introduction to GPU Programming Basics
36 pages
GPU-Accelerated Trading Solutions
No ratings yet
GPU-Accelerated Trading Solutions
7 pages
CUDA Assignment Guidelines for HPC
No ratings yet
CUDA Assignment Guidelines for HPC
7 pages
GPU Acceleration in Scientific Computing
100% (2)
GPU Acceleration in Scientific Computing
96 pages
GPGPU-Sim Tutorial Overview
No ratings yet
GPGPU-Sim Tutorial Overview
35 pages
HPC Applications in R: Course Overview
No ratings yet
HPC Applications in R: Course Overview
68 pages
Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla
No ratings yet
Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla
56 pages
CS 239: CUDA Matrix Addition Exercise
No ratings yet
CS 239: CUDA Matrix Addition Exercise
2 pages
Comprehensive GPU Performance Modeling
No ratings yet
Comprehensive GPU Performance Modeling
8 pages
CUDA Programming for Beginners
No ratings yet
CUDA Programming for Beginners
21 pages
CUDA Programming: Saxpy Examples
No ratings yet
CUDA Programming: Saxpy Examples
57 pages
Accelerate R with CUDA Libraries
No ratings yet
Accelerate R with CUDA Libraries
59 pages
CUDA Programming Basics and Tools
100% (1)
CUDA Programming Basics and Tools
173 pages
CS 179: Introduction to GPU Programming
No ratings yet
CS 179: Introduction to GPU Programming
24 pages
GPU Computing: A Comprehensive Guide
No ratings yet
GPU Computing: A Comprehensive Guide
29 pages
Kernelgen Pavt 2012 Slides
No ratings yet
Kernelgen Pavt 2012 Slides
51 pages
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
No ratings yet
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
9 pages
GPU Performance Model for Optimization
No ratings yet
GPU Performance Model for Optimization
12 pages
Parallelizing A Modern GPU Simulator: Rodrigo Huerta Antonio González
No ratings yet
Parallelizing A Modern GPU Simulator: Rodrigo Huerta Antonio González
6 pages
GPU Computing Guide 2011
No ratings yet
GPU Computing Guide 2011
24 pages
nVidia G80 GPU Architecture Overview
No ratings yet
nVidia G80 GPU Architecture Overview
25 pages
GPU-Parallel C++ Programming Guide
0% (1)
GPU-Parallel C++ Programming Guide
71 pages
CUDA Fortran Implementation in WRF GPU
No ratings yet
CUDA Fortran Implementation in WRF GPU
22 pages
Algorithmic Trading and Machine Learning Based On GPU: Mantas Vaitonis Saulius Masteika Konstantinas Korovkinas
No ratings yet
Algorithmic Trading and Machine Learning Based On GPU: Mantas Vaitonis Saulius Masteika Konstantinas Korovkinas
5 pages
CUDA Programming: Vector Addition Guide
No ratings yet
CUDA Programming: Vector Addition Guide
5 pages
Ai Accelerator Nptel
No ratings yet
Ai Accelerator Nptel
5 pages
CUDA Programming for High Performance Computing
No ratings yet
CUDA Programming for High Performance Computing
77 pages
GPU Acceleration for Quantum Chemistry Integrals
No ratings yet
GPU Acceleration for Quantum Chemistry Integrals
10 pages
GPU Programming and Architecture Overview
No ratings yet
GPU Programming and Architecture Overview
56 pages
3D Finite Difference Computation On Gpus Using Cuda: Paulius Micikevicius
No ratings yet
3D Finite Difference Computation On Gpus Using Cuda: Paulius Micikevicius
6 pages
GPU Acceleration for RTL Simulation
No ratings yet
GPU Acceleration for RTL Simulation
5 pages
Whitepaper High Performance Modeling GPU Computing in Insurance
No ratings yet
Whitepaper High Performance Modeling GPU Computing in Insurance
11 pages
Tesla Personal Supercomputer Overview
No ratings yet
Tesla Personal Supercomputer Overview
16 pages
AI and Quantum Science Synergy
No ratings yet
AI and Quantum Science Synergy
94 pages
DVFS Impact on K20 GPU Performance
No ratings yet
DVFS Impact on K20 GPU Performance
8 pages
Memory Interference in CPU and iGPU Systems
No ratings yet
Memory Interference in CPU and iGPU Systems
10 pages
AMD GCN3 Instruction Set Architecture
No ratings yet
AMD GCN3 Instruction Set Architecture
354 pages
Real-Time Dynamic Global Illumination
No ratings yet
Real-Time Dynamic Global Illumination
29 pages
Computer Architecture for Performance Optimization
No ratings yet
Computer Architecture for Performance Optimization
18 pages
Tesla K40 vs V100 GPU Comparison
No ratings yet
Tesla K40 vs V100 GPU Comparison
2 pages
Compilers in Hardware/Software Co-Design
No ratings yet
Compilers in Hardware/Software Co-Design
64 pages
CUDA-Based Library for Protocol Parsing
No ratings yet
CUDA-Based Library for Protocol Parsing
4 pages
GPU Computing Question Bank Answers
No ratings yet
GPU Computing Question Bank Answers
3 pages
GPU Computing in Visual Computing
No ratings yet
GPU Computing in Visual Computing
46 pages
CUDA Exact String Matching Demo
No ratings yet
CUDA Exact String Matching Demo
10 pages
Parallel HEVC Decoding on CPU+GPU
No ratings yet
Parallel HEVC Decoding on CPU+GPU
34 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
GPU vs FPGA for Compute Acceleration
No ratings yet
GPU vs FPGA for Compute Acceleration
7 pages
GP-GPU Computing and CUDA Overview
No ratings yet
GP-GPU Computing and CUDA Overview
28 pages
NVIDIAFermiComputeArchitectureWhitepaper PDF
No ratings yet
NVIDIAFermiComputeArchitectureWhitepaper PDF
21 pages
VTU B.E. Computer Science Curriculum 2023
No ratings yet
VTU B.E. Computer Science Curriculum 2023
18 pages
Base Paper PDF
No ratings yet
Base Paper PDF
14 pages
Understanding GPUs and GPGPU Programming
No ratings yet
Understanding GPUs and GPGPU Programming
23 pages
Large-Scale Deep Learning Techniques
No ratings yet
Large-Scale Deep Learning Techniques
23 pages
GPU Architecture and CUDA Programming Insights
No ratings yet
GPU Architecture and CUDA Programming Insights
2 pages
Oxford Bookworm Library
No ratings yet
Oxford Bookworm Library
10 pages
Optimizing GPU Paged Memory Performance
No ratings yet
Optimizing GPU Paged Memory Performance
13 pages
GPU Compute Partitioning with libsmctrl
No ratings yet
GPU Compute Partitioning with libsmctrl
13 pages
Manufacturing Simulation of Bevel Gear Cutting: Simulation Based Approach For Tool Wear Analysis
No ratings yet
Manufacturing Simulation of Bevel Gear Cutting: Simulation Based Approach For Tool Wear Analysis
8 pages
Understanding OpenMP and Parallelism
No ratings yet
Understanding OpenMP and Parallelism
15 pages
Mumbai University B.E. Electronics Syllabus
No ratings yet
Mumbai University B.E. Electronics Syllabus
82 pages

GPU Simulation with Accel-Sim Guide

Uploaded by

GPU Simulation with Accel-Sim Guide

Uploaded by

CPEN 411: Assignment 4

Accel-Sim GPU Simulation

November 26, 2025

1 Introduction and Setup [0 Points] 2

2 Evaluation: [5 + 0.5 bonus points total] 8

Introduction and Setup [0 Points]

1.1. Preparatory Steps: Setup

1. Navigate to the GitHub classroom page for this assignment ([Link]

1.2. Preparatory Steps: Accel-Sim

1. Clone Accel-Sim from the public repository [Link]

2. Source the setup script for Accel-Sim.

Accel-Sim is a simulation framework that uses another simulator GPGPU-Sim ([Link]

1.3. Preparatory Steps: Benchmarks

These traces are also downloaded to /ubc/ece/home/courses/cpen411/assignment4/rodinia-traces

2. Launch the simulations using the job launching script.

4. The simulations can be prematurely terminated with

1.4. Preparatory Steps: Processing Results

Evaluation: [5 + 0.5 bonus points total]

2.1. Basic Understanding: Accel-Sim [0.5 point]

2.2. Using the Simulator: PTX versus SASS [1 point]

2.3. Modifying Configurations: CTA Scheduler [2 points]

2. The CTA-level scheduler is implemented in GPGPU-Sim:

2.4. Making Measurements: Control Flow Divergence [2 points]

This assignment is due 11:59PM Friday, December 5, 2025.

• Simulator result files added to the simulator-results folder:

– [Link] and [Link] for the PTX versus SASS results.

– ./gpu-simulator/gpgpu-sim/src/gpgpu-sim/[Link] where you implement your greedy

Common questions

In the context of GPU simulations, what is the impact of different configurations of CTA scheduling on performance metrics like IPC, L1 misses, and L2 misses?

In the context of GPU simulations, what is the impact of different configurations of CTA scheduling on performance metrics like IPC, L1 misses, and L2 misses?

How do PTX and SASS differ in their execution within the Accel-Sim framework, and what are the implications of these differences on performance evaluation?

How do PTX and SASS differ in their execution within the Accel-Sim framework, and what are the implications of these differences on performance evaluation?

In what ways can modifying the gpgpu_n_cores_per_cluster in the simulator configuration influence the execution of a benchmark task?

In what ways can modifying the gpgpu_n_cores_per_cluster in the simulator configuration influence the execution of a benchmark task?

What are the potential advantages and limitations of using trace-based simulation frameworks like Accel-Sim over traditional cycle-accurate simulations in GPU architecture studies?

What are the potential advantages and limitations of using trace-based simulation frameworks like Accel-Sim over traditional cycle-accurate simulations in GPU architecture studies?

In the context of analyzing GPU simulation results, what are the benefits of generating visual plots from raw data, and how can this practice enhance understanding?

In the context of analyzing GPU simulation results, what are the benefits of generating visual plots from raw data, and how can this practice enhance understanding?

Explain how the setup of environmental variables influences the functionality of the CUDA Toolkit in a GPU simulation environment.

Explain how the setup of environmental variables influences the functionality of the CUDA Toolkit in a GPU simulation environment.

How does control flow divergence affect the SIMT efficiency in GPU architectures as represented in the Accel-Sim framework, and what strategies can be used to mitigate its impact?

How does control flow divergence affect the SIMT efficiency in GPU architectures as represented in the Accel-Sim framework, and what strategies can be used to mitigate its impact?

What challenges and considerations are involved in implementing a Greedy CTA Scheduler within GPGPU-Sim, and how does it affect performance metrics?

What challenges and considerations are involved in implementing a Greedy CTA Scheduler within GPGPU-Sim, and how does it affect performance metrics?

Why might researchers choose to simulate the Rodinia benchmark suite on Accel-Sim, and how can the results inform architectural research?

Why might researchers choose to simulate the Rodinia benchmark suite on Accel-Sim, and how can the results inform architectural research?

What are the key reasons for using the Accel-Sim GPU simulation framework instead of a physical GPU for architecture research?

What are the key reasons for using the Accel-Sim GPU simulation framework instead of a physical GPU for architecture research?

You might also like