CPEN 411: Assignment 4
Accel-Sim GPU Simulation
Tor M. Aamodt
November 26, 2025
Contents
1 Introduction and Setup [0 Points] 2
1.1 Preparatory Steps: Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Preparatory Steps: Accel-Sim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Preparatory Steps: Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Preparatory Steps: Processing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Evaluation: [5 + 0.5 bonus points total] 8
2.1 Basic Understanding: Accel-Sim [0.5 point] . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Using the Simulator: PTX versus SASS [1 point] . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Modifying Configurations: CTA Scheduler [2 points] . . . . . . . . . . . . . . . . . . . . 8
2.4 Making Measurements: Control Flow Divergence [2 points] . . . . . . . . . . . . . . . . 9
3 Submission Instructions 11
1
Chapter 1
Introduction and Setup [0 Points]
This assignment is an introduction to GPU simulation using Accel-Sim [1] ([Link]
a trace-based simulation framework.
You may choose to complete this assignment on your personal machine if you prefer. However, we do
not provide any additional support for that. The Accel-Sim documentation lists all dependencies and
requirements to run the simulator on your personal device.
1.1. Preparatory Steps: Setup
1. Navigate to the GitHub classroom page for this assignment ([Link]
AaE0IAlo) and similar to Assignment 1, click on ‘Accept this assignment’. Click on the URL under the
text ‘Your assignment repository has been created’ to go to your repository. Clone the repository to
your account on the ECE server <username>@[Link].
2. Although we will not be using a physical GPU for this assignment, the simulator still relies on several
CUDA tools to run. The CUDA toolkit is installed in /ubc/ece/home/courses/cpen411/assignment4/
cuda-toolkit on the ECE server.
If you are not using the ECE server, please download and install the CUDA Toolkit. Note that you do
not need to install the CUDA driver. For this assignment, we will be using CUDA version 12.8.0.
wget [Link]
cuda_12.8.0_570.86.10_linux.run
mkdir cuda-toolkit
sh cuda_12.8.0_570.86.10_linux.run --silent --toolkit --toolkitpath=`pwd`/cuda-toolkit
--override
3. Add the CUDA Toolkit and xutils to your environment variables so that they can be easily accessed.
You can update setup_cuda.sh with the correct paths and use source setup_cuda.sh to quickly set
your environment.
export CUDA_INSTALL_PATH=/ubc/ece/home/courses/cpen411/assignment4/cuda-toolkit
export CUDA_HOME=/ubc/ece/home/courses/cpen411/assignment4/cuda-toolkit
export PTXAS_CUDA_INSTALL_PATH=/ubc/ece/home/courses/cpen411/assignment4/cuda-toolkit
export LD_LIBRARY_PATH=/ubc/ece/home/courses/cpen411/assignment4/cuda-toolkit/lib64/
export PATH=$CUDA_INSTALL_PATH/bin:`pwd`/xutils/usr/bin:$PATH
Once the CUDA Toolkit has been added to your environment PATH, you can test out nvcc —version
to make sure the CUDA compiler is functional.
2
CPEN 411 Assignment 4
1.2. Preparatory Steps: Accel-Sim
1. Clone Accel-Sim from the public repository [Link]
tree/release.
git clone git@[Link]:accel-sim/[Link]
2. Source the setup script for Accel-Sim.
cd accel-sim-framework/gpu-simulator
source setup_environment.sh
Accel-Sim is a simulation framework that uses another simulator GPGPU-Sim ([Link]
as it’s core performance model. On the first setup instance, you will be prompted to clone the GPGPU-
Sim repository. Follow the default prompts to clone the dev branch of GPGPU-Sim from the main
distribution repo.
Page 3 of 12
CPEN 411 Assignment 4
GPGPU-Sim operates by tricking CUDA executables into using the simulator instead of a real GPU. The
simulator itself is compiled as a "shared object" in linux ([Link]) - that mirrors the [Link]
provided by NVIDIA. When you source the setup_environment, we put the path to the [Link] in
the LD_LIBRARY_PATH. So from then on, in that shell, you can run any CUDA executable and GPGPU-
Sim will be used.
3. Compile the GPU simulator, which also creates the [Link] object.
make -j
If you get an error relating to available resources, please use make without the parallel flag instead.
If the compilation is successful, you should not see any error messages. After compilation, you can find
the [Link] files in the lib folder under the CUDA and GCC versions (12.8 and 13.3) used during
the build. You will need to recompile the GPU simulator whenever you modify the simulator code.
Page 4 of 12
CPEN 411 Assignment 4
There are many resources available to learn more about the Accel-Sim framework and the underlying
GPGPU-Sim simulator. Here are links to some helpful videos:
• [Link]
• [Link]
1.3. Preparatory Steps: Benchmarks
Accel-Sim models the GPU device. To run simulations, we also need GPU kernels to execute on Accel-
Sim. The Accel-Sim framework provides many GPU benchmarks that are commonly used to evaluate
microarchitecture optimizations. In this assignment, we will use select workloads from the Rodinia
benchmark suite [2] ([Link]
1. The Rodinia traces are available from Accel-Sim using their provided script. If you are not using the
ECE servers, you will need to run this script to get your own copy of the Rodinia trace files.
./[Link]
These traces are also downloaded to /ubc/ece/home/courses/cpen411/assignment4/rodinia-traces
available on the ECE server.
Not all Rodinia workloads are supported with the simulator and varies by CUDA versions. Copy the
modified list of Rodinia workloads we will run in this assignment to the Accel-Sim folder.
cp [Link] ../accel-sim-framework/util/job_launching/apps/.
2. Launch the simulations using the job launching script.
./util/job_launching/run_simulations.py -B rodinia-3.1 -T /ubc/ece/home/courses/
cpen411/assignment4/rodinia-traces -N rodinia-test
This command runs the -B benchmark using -T trace files and names the job with -N. Depending on
resource constraints, you may find it useful to use the -c flag to choose how many simulations to run
concurrently. You can learn more about these flags using —help. The default GPU configuration is used
(QV100) since it is not specified. All of the available configurations are defined in a YAML file:
./util/job_launching/configs/[Link]
You can change the GPU model to simulate by using the -C flag and also create your own special
configuration GPU models by modifying the [Link] file and adding the definition to the
YAML list.
Page 5 of 12
CPEN 411 Assignment 4
3. Once the simulations have been launched, you can monitor their process using another script.
./util/job_launching/job_status.py -N rodinia-test
This script will show which benchmarks are currently being simulated and which are complete.
4. The simulations can be prematurely terminated with
./util/job_launching/[Link] -k
Page 6 of 12
CPEN 411 Assignment 4
1.4. Preparatory Steps: Processing Results
Once the simulations are complete, we can analyze the simulation statistics and results to better un-
derstand how the GPU executed each benchmark. These statistics also help identify inefficiencies and
bottlenecks in the GPU hardware, which allows researchers to introduce hardware modifications or
software optimizations.
1. All the output from the simulator will be placed in sim_run_12.8/(app)/(args)/(config)/. Inside
each of these directories, there will be files labeled *.o(jobId) and *.e(jobId) which store the stdout
(o) and stderr (e) for each particular run.
2. Use the get_stats.py script to collect the simulation statistics and save to a CSV file.
./util/job_launching/get_stats.py -N rodinia-test > [Link]
By default, all the stats collected (which are represented as regexes pulled from the output files) are
listed here, you can always add to this list with other stats you create, etc.
./util/job_launching/stats/example_stats.yml
3. To generate plots using the available Python scripts, you may need to install additional Python
packages. On the ECE server, you will need to use a virtual environment.
python3 -m venv ~/cpen411-a4
source ~/cpen411-a4/bin/activate
pip install plotly pyyaml numpy
Once all the necessary Python packages are installed, run the plotting script to visualize your results.
Alternatively, you can generate your own plots based on the data saved in the CSV file, which will likely
be more useful for your assignment report.
./util/plotting/[Link] -c [Link]
These plots are saved to ./util/plotting/htmls/*.html.
Page 7 of 12
Chapter 2
Evaluation: [5 + 0.5 bonus points total]
This assignment will primarily be graded based on your submitted report. Your report should include
a section that addresses each of the tasks below.
2.1. Basic Understanding: Accel-Sim [0.5 point]
Write a paragraph in your report that explains what Accel-Sim does and how it can be used to support
GPU architecture research. Why would someone choose to use Accel-Sim instead of a physical GPU?
2.2. Using the Simulator: PTX versus SASS [1 point]
Accel-Sim offers two modes of operation: functional emulation using PTX and replay of SASS instruction
traces collected from hardware. PTX instructions are a virtual ISA, or an intermediate representation
of the program. SASS is the real machine ISA instructions executed on NVIDIA GPUs.
To enable PTX simulation, we need the source binaries to the Rodinia benchmarks instead of the traces
used for SASS. These are available in /ubc/ece/home/courses/cpen411/assignment4/rodinia-src/
gpu-app-collection. If you are not using the ECE servers, please clone these benchmarks from git@
[Link]:accel-sim/[Link] and following their instructions to compile.
To execute the program in SASS mode:
./util/job_launching/run_simulations.py -C QV100-SASS -B rodinia-3.1 -T /ubc/ece/home/
courses/cpen411/assignment4/rodinia-traces -N rodinia-sass
To execute the program in PTX mode:
source /ubc/ece/home/courses/cpen411/assignment4/rodinia-src/gpu-app-collection/src/
setup_environment
./util/job_launching/run_simulations.py -C QV100-PTX -B rodinia-3.1 -N rodinia-ptx
Save the simulation statistics for both to [Link] and [Link].
Plot the execution time for the Rodinia benchmarks to compare between PTX and SASS to include in
your report. Write your answer to the following questions in your report, referring to your execution
time results.
• What do these numbers say about the design of the PTX versus SASS ISA?
• What do these numbers say about the optimization level of each code type?
2.3. Modifying Configurations: CTA Scheduler [2 points]
Compare different modifications to the threadblock scheduler. By default, the scheduler will attempt
to involve as many SMs and SM clusters as possible – round-robin between clusters and SMs within
clusters before assigning more than one threadblock to a core. In this task, you will make modifications
to this scheduler and analyze the results.
8
CPEN 411 Assignment 4
1. In the default Volta configuration provided, each SM gets its own interconnect port. Consider how
the scheduler might change when there are multiple SMs clustered together. Add a new configuration
setting add-on that sets the gpgpu_n_cores_per_cluster to 8 in the configuration file:
./util/job_launching/configs/[Link]
2. The CTA-level scheduler is implemented in GPGPU-Sim:
./gpu-simulator/gpgpu-sim/src/gpgpu-sim/[Link]:
void gpgpu_sim::issue_block2core()
Modify the simulator to implement a Greedy CTA Scheduler. Make the scheduler pack one SM fully
(as much as the occupancy calculation allows) before moving on to the next SM.
3. Add an additional metric for the average number of CTAs per SM in the simulator. Simulate the
Rodinia benchmark suite for each of the three configurations: baseline, 8-cluster, and greedy. Also
measure the IPC, L1 misses, and L2 misses for each configuration. Save these simulation statistics to
[Link] and [Link].
Compare these measurements in graphs and add them to your report. Explain what your results mean
for each scheduler configuration.
2.4. Making Measurements: Control Flow Divergence [2 points]
1. In SASS mode, collect and plot metrics on how much control-flow divergence (average SIMT efficiency,
which ranges between 1/32 and 32/32) and memory divergence (average number of memory accesses
per memory instruction) there is in each application. For memory, keep in mind that the Volta machine
has a 128-byte cache lines with 32B sectors. Therefore, a perfectly coalesced memory instruction can
still generate 4 32-byte accesses. In the worst-case, a completely diverged access generates 32 memory
32-byte transactions.
There is no prepared script to measure SIMT efficiency and memory divergence, which you will need to
add yourself. For SIMT efficiency, look for the shader_core_ctx::issue_warp() function in ./gpu-simu
lator/gpgpu-sim/src/gpgpu-sim/[Link]. The m_stats->shader_cycle_distro tracks a histogram
of the number of active threads in each warp when an instruction is issued. From this information,
you should be able to compute the overall SIMT efficiency. You may either add an aggregating print
to the simulator itself or simply write a post-processing script that computes the SIMT efficiency from
m_stats->shader_cycle_distro when it is printed as "Warp Occupancy Distribution". For memory
divergence, you will need to count both the number of global memory instructions, as well as the
number of global memory accesses.
In your report, present your results for SIMT efficiency and memory divergence. Use this data to answer
the following questions:
• Is there a connection between these divergence metrics and the application’s instructions-per-cycle
(IPC)? Why or why not?
2. The default approach to control flow in the simulator uses SIMT stacks. You can read more about
this implementation in "Thread Block Compaction for Efficient SIMT Control Flow" [3]. The SIMT
stack is implemented in the simt_stack class and can be found in ./gpu-simulator/gpgpu-sim/src/a
bstract_hardware_model.cc.
Page 9 of 12
CPEN 411 Assignment 4
Choose an application from the Rodinia benchmarks that demonstrates low SIMT efficiency (high
branch divergence). In PTX mode, simulate the benchmark and identify the instruction PCs causing
the most divergence. For each diverging PC, track the number of cycles from when the warp diverges
until reconvergence.
In your report, identify the benchmark you selected and present your data comparing diverging in-
struction PCs and the length of divergence they cause. Explain what you could change to make this
benchmark faster and more efficient.
To re-run any individual benchmark, find the appropriate folder and use the ./[Link] script.
./sim_run_12.8/[application]/[data]/[config]/[Link]
The output will be printed on the console, or you may redirect it to a log file.
Page 10 of 12
Chapter 3
Submission Instructions
This assignment is due 11:59PM Friday, December 5, 2025.
To submit, please push your changes to the Github repository. Change the README file in the repo
to include your student ID and your name and make sure your repository has the updated source code
files.
Please include the following files:
• [Link] of your final report for the assignment. There is no page limit, but please be concise
and keep the report to a reasonable length.
• Simulator result files added to the simulator-results folder:
– [Link] and [Link] for the PTX versus SASS results.
– [Link] and [Link] for the scheduler com-
parison.
• Simulator source files that you edit, copied to the modified-simulator folder:
– ./gpu-simulator/gpgpu-sim/src/gpgpu-sim/[Link] where you implement your greedy
CTA scheduler.
– ./gpu-simulator/gpgpu-sim/src/abstract_hardware_model.cc where you track diverging
instruction PCs.
Be careful and double-check that all files have been included and that the source files are the correc-
t/latest version!
11
Bibliography
[1] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers, “Accel-Sim: An extensible simulation framework
for validated GPU modeling,” in ACM/IEEE International Symposium on Computer Architecture (ISCA).
IEEE, 2020, pp. 473–486.
[2] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark
suite for heterogeneous computing,” in IEEE international symposium on workload characterization (IISWC).
Ieee, 2009, pp. 44–54.
[3] W. W. Fung and T. M. Aamodt, “Thread block compaction for efficient simt control flow,” in 2011 IEEE
17th international symposium on high performance computer architecture. IEEE, 2011, pp. 25–36.
12