0% found this document useful (0 votes)
44 views22 pages

Virtuoso: Fast VM Simulation Framework

Virtuoso is a new simulation framework designed to enhance the evaluation of virtual memory (VM) systems by providing fast and accurate prototyping through an imitation-based operating system simulation methodology. It integrates a lightweight userspace OS kernel, MimicOS, with various architectural simulators, allowing researchers to assess the performance implications of VM designs with improved accuracy and reduced simulation time overhead. The framework has been validated against real CPU systems, demonstrating significant improvements in modeling VM operations while maintaining efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views22 pages

Virtuoso: Fast VM Simulation Framework

Virtuoso is a new simulation framework designed to enhance the evaluation of virtual memory (VM) systems by providing fast and accurate prototyping through an imitation-based operating system simulation methodology. It integrates a lightweight userspace OS kernel, MimicOS, with various architectural simulators, allowing researchers to assess the performance implications of VM designs with improved accuracy and reduced simulation time overhead. The framework has been validated against real CPU systems, demonstrating significant improvements in modeling VM operations while maintaining efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Virtuoso: Enabling Fast and Accurate

Virtual Memory Research via an Imitation-based


Operating System Simulation Methodology
Konstantinos Kanellopoulos1 Konstantinos Sgouras1 F. Nisa Bostanci1
Andreas Kosmas Kakolyris 1 Berkin Kerim Konar 1 Rahul Bera1
Mohammad Sadrosadati1 Rakesh Kumar 2 Nandita Vijaykumar 3 Onur Mutlu1
1 ETH Zürich 2 Norwegian University of Science and Technology 3 University of Toronto
arXiv:2403.04635v2 [[Link]] 27 Mar 2025

Abstract simulation time overhead of only 20%, on top of four baseline


The unprecedented growth in data demand from emerging architectural simulators. The source code of Virtuoso is freely
applications has turned virtual memory (VM) into a major available at [Link]
performance bottleneck. VM’s overheads are expected to
persist as memory requirements continue to increase. Re-
searchers explore new hardware/OS co-designs to optimize 1 Introduction
VM across diverse applications and systems. To evaluate Virtual memory (VM) [1–23] is a cornerstone of modern
such designs, researchers rely on various simulation method- computing systems, enabling application-transparent physi-
ologies to model VM components. Unfortunately, current cal memory management, isolation and data sharing. Con-
simulation tools (i) either lack the desired accuracy in mod- temporary applications (e.g., [24–45]) exhibit different char-
eling VM’s software components or (ii) are too slow and acteristics that stress the VM subsystem. We classify these
complex to prototype and evaluate schemes that span across workloads into two broad categories: (i) long-running work-
the hardware/software boundary. loads (i.e., execution time larger than 100s of seconds) [24, 28–
We introduce Virtuoso, a new simulation framework that 31, 33–35] with large data footprints and irregular memory
enables quick and accurate prototyping and evaluation of access patterns, that exhibit high address translation over-
the software and hardware components of the VM subsystem. heads, and (ii) short-running workloads (i.e., execution time
The key idea of Virtuoso is to employ a lightweight userspace often lower than 1 second) [36–45] whose execution time
OS kernel, called MimicOS, that (i) accelerates simulation does not amortize the overheads of system software opera-
time by imitating only the desired kernel functionalities, (ii) tions (e.g., physical memory allocation). Multiple prior works
facilitates the development of new OS routines that imitate and industrial studies [46–57] have shown that address trans-
real ones, using an accessible high-level programming in- lation in long-running workloads and memory allocation
terface, (iii) enables accurate and flexible evaluation of the in short-running workloads respectively account for up to
application- and system-level implications of VM after inte- 40% and 95% of the total execution time. As memory re-
grating Virtuoso to a desired architectural simulator. quirements continue to increase and systems transition to
In this work, we integrate Virtuoso into five diverse ar- larger physical address spaces [58] (e.g., via hybrid memory
chitectural simulators, each specializing in different aspects systems with high-capacity non-volatile memories [59–64],
of system design, and heavily enrich it with multiple state- memory disaggregation [65–94]), the overheads associated
of-the-art VM schemes. This way, we establish a common with VM operations are expected to increase.
ground for researchers to evaluate current VM designs and To tackle these overheads, many research works take a
to develop and test new ones. We demonstrate Virtuoso’s hardware/OS co-design approach and revisit core aspects
flexibility and versatility by evaluating five diverse use cases, of VM such as page table structure [54, 95–103], virtual-
yielding new insights into state-of-the-art VM techniques. to-physical mapping [104–111], physical memory allocation
Our validation shows that Virtuoso ported on top of Sniper, policy (e.g. transparent huge page mechanisms [112–116, 116,
a state-of-the-art microarchitectural simulator, models (i) the 117]) and Translation Lookaside Buffer (TLB) design [13, 118–
memory management unit of a real high-end server-grade 125]. Evaluating such VM designs is not straightforward. The
CPU with 82% accuracy, and (ii) the page fault latency of a evaluation challenge primarily arises from the need to model
real Linux kernel with up to 79% accuracy. Consequently, Vir- the interplay between both the OS and HW components
tuoso models the IPC performance of a real high-end server- involved in VM. For example, in modern systems, the OS
grade CPU with 21% higher accuracy than the baseline ver- manages the allocation of large pages, which directly affects
sion of Sniper. Virtuoso’s accuracy benefits incur an average the effectiveness of the TLB [126–128], memory footprint

1
of the page table (PT), PT walk latency [97, 98, 103] and la- evaluate the application- and system-level implications of the
tency of page faults [42, 104, 112, 113, 129, 130]. Given this OS by integrating Virtuoso into an architectural simulator.
complex interplay, evaluating the strengths and weaknesses
of existing and future VM designs becomes a challenging Simulator-Type OS Speed Accuracy Development Effort
task without a comprehensive and robust simulation infras- Emulation-based N/A Fast Low Low
Realistic Slow Very High High
tructure. Full-system

Unfortunately, modern simulators are either (i) designed Our methodology Imitation Fast High Low

for different purposes (e.g., mainly focus on core microarchi- Table 1. Comparison of existing VM simulation methodolo-
tecture [131–135]) and thus lack the ability and flexibility to gies versus our proposed methodology for VM research.
accurately model the impact of the OS components involved
in the VM subsystem (e.g. Sniper [133]) or (ii) are relatively
slow and hard-to-develop (e.g., gem5 full-system execution Our proposed methodology involves dynamically in-
mode [136]), which hinders rapid design space exploration. strumenting a userspace kernel that operates as a standalone
This dichotomy of simulators creates a significant gap in the program and communicates with an architectural simulator
field, compelling researchers to invest considerable time and via two distinct channels: a functional channel and an in-
effort in developing new custom tools or methodologies for struction stream channel. The functional channel uses shared
each VM proposal [54, 101, 108, 127–129, 137–141]. memory primitives and specialized ISA instructions to enable
Existing Simulation Methodologies. Many simulators message exchanges between the kernel and the simulator for
(e.g., [131–136, 142–145]) are primarily designed to focus functional events (e.g., interrupts). For instance, when the
on and model microarchitectural CPU features. These sim- simulator triggers a page fault, it communicates this event
ulators emulate basic OS functionalities and use simplified to the kernel. The kernel then handles the fault and reports
methods to estimate the implications of OS routines on per- the outcome back to the simulator using the shared mem-
formance. We classify these simulators as emulation-based. ory region. Using the instruction stream channel, the kernel
Emulation-based simulators often employ first-order approx- injects dynamically instrumented instruction streams (e.g.,
imations (e.g., fixed latencies) for OS routines and VM opera- page fault handler instructions) into the simulator, enabling
tions. As we show in §2, fixed latencies can lead to inaccurate the simulator to accurately model the overheads introduced
estimation of VM overheads, which display high variability by OS routines (e.g additional latency, memory interference).
across diverse workloads and system states. Hence, these sim- Using this methodology we build MimicOS, a lightweight
ulators are not suitable for (and are not primarily designed userspace kernel written in C++ [147] that imitates, but is
to be used for) the evaluation of new VM designs that rely on not limited to, the basic memory management functionality
hardware/OS co-design. On the other hand, full-system sim- of the Linux kernel [148]. MimicOS is portable and can be
ulators like gem5 [136] and QFlex [146] allow for detailed easily attached to the memory model of an architectural
simulation of the entire OS, supporting realistic memory simulator (see §6.2). In this work, we integrate MimicOS with
management for evaluating new VM architectures. However, five architectural simulators, Sniper [133], ChampSim [132],
such simulators suffer from significant drawbacks, includ- Ramulator2 [142, 149], gem5-SE [136] and an SSD simulator,
ing (i) low simulation speed, (ii) high memory consumption MQSim [150]. Using MimicOS and Sniper as a baseline, we
overhead, and (iii) substantial development effort. These build VirTool, a comprehensive toolset that contains both
drawbacks impede rapid prototyping of new VM schemes the HW and SW components that are required to evaluate
that rely on HW/OS co-design. many state-of-the-art VM schemes. By doing so, we aim to (i)
As we show in Table 1, our goal in this work, is to de- unlock a wide range of new case studies ranging from low-
sign a simulation framework that (i) maintains the speed level microarchitectural VM schemes to system software-
of emulation-based simulators while reaching the accuracy level ones, and (ii) establish a common ground for researchers
of full-system simulators and (ii) enables researchers to eas- to evaluate current VM designs and to develop and test new
ily develop and evaluate new VM schemes. To this end, we ones. Table 2 provides a comprehensive overview of existing
present Virtuoso, a new simulation framework that enables techniques that are included in VirTool.
fast and accurate prototyping and evaluation of the software Validation & Comparison. We validate the accuracy
and hardware components of the VM subsystem. The key of MimicOS+Sniper against a real high-end server-grade
idea of Virtuoso is to employ a lightweight userspace ker- processor (see §7.2) and demonstrate four key results. First,
nel, written in a high level language (e.g., C++ [147]), that MimicOS+Sniper estimates the average L2 TLB misses per
enables researchers to (i) isolate the functionality of only the kilo instructions and PT walk latency, respectively, with 82%
desired kernel code (e.g., Transparent Huge Pages [114, 115]) and 85% accuracy compared to the real system. Second, Mim-
to speed up simulation time, (ii) easily develop new OS rou- icOS+Sniper estimates the page fault latency with 66% (up
tines (e.g., a modified physical memory allocator [112, 113, to 79%) accuracy compared to the page fault latency mea-
117, 129]) without being kernel experts, and (iii) accurately sured by the Linux kernel running on a real machine. Third,

2
MimicOS+Sniper improves instructions per cycle (IPC) per- 2 Background & Motivation
formance estimation accuracy by 21% (from 66% to 80%) VM Overheads. Reducing the overheads of the VM sub-
while incurring 35% simulation time overhead compared to system is a long-standing challenge in computer architec-
baseline Sniper. Fourth, MimicOS incurs only 20% simula- ture and OS research. Lately, emerging data-intensive work-
tion time overhead, averaged across four simulators, while loads [24–35] turned VM overheads into a major perfor-
enabling the full-system execution mode in gem5 leads to mance bottleneck. As shown in multiple academic and in-
77% simulation time overhead compared to gem5’s system dustrial studies [46–57], address translation can significantly
call emulation mode. degrade the performance of applications taking up to 40% of
Versatility & Use Cases. To illustrate the versatility of the total execution time [50, 51]. At the same time, OS rou-
Virtuoso, we conduct five case studies that are time-con- tines responsible for allocating physical memory can cause
suming and difficult to assess accurately and rapidly using high performance overheads, up to 95% [42, 130, 152].
existing simulation tools. First, we analyze the performance Figure 1 shows the portion of the total execution time
of four different page table designs [54, 97] and draw key spent on address translation and allocating physical mem-
insights about their impact on page table walk latency, minor ory 1 for long-running (i.e., > 100 s) and short-running (i.e.,
page fault latency and main memory interference (see §7.4). < 1 s) workloads executed in a real high-end server-grade
Second, we evaluate the overheads associated with different system (our evaluation methodology is described in detail
physical memory allocation policies across large language in §7.1). We make two key observations. First, long-running
model inference workloads (see §7.5). Third, we draw key workloads spend on average 25% (4.9%) of the total execution
insights about the architectural trade-offs of restricting the time on address translation (memory allocation). In contrast,
virtual-to-physical address mapping across physical mem- in short-running workloads the overheads of memory allo-
ory [105] (see §7.6.1). Fourth, we evaluate the benefits of cation take a large portion of the total execution time, i.e.,
contiguity-aware address translation [151] across different 32% on average, while the overheads of address translation
memory fragmentation levels (see §7.6.2). Fifth, we analyze are very small, i.e., less than 1% on average. This is because
the implications of employing an intermediate address space in long-running workloads, the overheads of physical mem-
scheme [111] across workloads with different memory allo- ory allocation tend to be amortized over time, whereas in
cation patterns (see §7.6.3). short-running workloads they are not. We conclude that the
In this work, we make the following contributions: overheads of the VM subsystem can vary across different
workloads and can heavily affect performance.
• We propose Virtuoso, a new simulation framework
Fraction of total execution time (%)

100
Physical Memory Allocation
that employs a new imitation-based OS simulation 80
Address Translation

methodology. Virtuoso enables fast and accurate pro- 60

totyping and evaluation of the hardware and software 40

20
components of the virtual memory (VM) subsystem. 0
• We integrate our new methodology with five diverse

3D Transp

2D-Sum
BFS

KCORE

SSSP

XS

IMG-RES

DB
BC

CC

GC

PR

TC

RND

GMEAN

JSON

Bagel

Mistral

GMEAN
Hadamard
WCNT

Llama
AES

architectural simulators and implement a comprehen-


sive set of state-of-the-art VM techniques to provide Long Running Short Running

a common ground for researchers to evaluate current


Figure 1. Fraction of total execution time spent in address
and new VM designs.
translation and physical memory allocation in long-running
• We validate Virtuoso against a real CPU system and
and short-running workloads executed on a real high-end
demonstrate that it improves the accuracy of a state-of-
server system [153].
the-art emulation-based simulator with only a modest
increase in simulation time. We demonstrate that Vir- The increasingly data-intensive nature of emerging ap-
tuoso can bridge the gap between emulation-based and plications and the transition towards large physical address
full-system simulators enabling accurate exploration spaces [58] (e.g., via compute-enabled memory modules [99,
of VM designs at a fast and flexible way. 154–158], large hybrid memory hierarchies [59–64], memory
• We illustrate the versatility of Virtuoso, by conducting disaggregation [65–94], heterogeneous systems with unified
five case studies that are time-consuming and diffi- virtual memory [159, 160]) is expected to increase the over-
cult to accurately and rapidly assess using existing heads caused by the VM subsystem [51, 70].
simulation tools.
• Virtuoso’s source code and integration with all five
1We consider physical memory allocation as the total time spent in the
simulators is freely available at [Link]
page fault handler. We populate the page cache before the application starts
CMU-SAFARI/Virtuoso. executing to demonstrate the overheads of the page fault handler even in
the absence of long-latency major page faults (i.e., disk accesses).

3
Hardware/OS Co-Design. A promising way to allevi- mechanism, huge page allocation, page table updates, mem-
ate the overheads of VM is to co-design the hardware and ory reclamation) and pathological cases that might occur
OS. As shown in multiple prior works, VM can be improved during page fault handling.
via (i) designing more efficient page tables [54, 96, 97, 161,
162] (e.g., hash-based page tables [54, 97, 161]), (ii) enforc- Contribution of outliers to total minor page fault latency: 67%
ing and leveraging contiguity between virtual and physical THP
Enabled
addresses to increase the address translation reach of the Median

processor [46, 50, 113, 128, 129, 151, 163–165] (e.g., range-
based translation [151]), (iii) employing hash-based virtual- THP
Contribution of outliers to total minor page fault latency: 25%

to-physical mappings to reduce the size of metadata used Disabled

for address translation [105, 107, 109], (iv) introducing inter- 25th
percentile
75th
percentile

mediate address spaces [106, 110, 111, 166] to delay address


Minor Page Fault Latency (µs) in Log Scale
translation until a main memory access, (v) employing large
OS-managed TLBs [118, 167] to improve the TLB hit rate, and Figure 2. Minor page fault latency distribution across two
(vi) accelerating OS routines that manage the VM subsystem different physical memory allocation policies (i.e., THP [114,
by offloading them to specialized hardware [42, 130, 152]. 115] enabled and disabled) measured in a real system [153].
Need for Detailed Simulation. Given the large VM over-
heads, it is critical to have methods for easily and quickly pro-
totyping and evaluating existing and new VM ideas and tech- 200

Latency (cycles)
Low Memory Intensity SSSP Graph
niques. However, such an evaluation is challenging since VM
Average PTW
160 Application
120 Ι/Ο stressor
components (i) span across the hardware/software bound- 80
ary, and (ii) are highly interdependent, which leads to sig- 40
High Memory Intensity
0
nificant variability in the overheads of the VM components 53 Benchmarks with Varying Memory Intensity Levels

across different workloads and system states. For example,


the effectiveness of TLBs [128, 164] as well as the storage Figure 3. Average PTW latency across 53 different applica-
requirements, lookup latency and main memory contention tions that exhibit varying levels of memory intensity, mea-
caused by the page table heavily depend on the number of sured in a real high-end server system [153].
large pages (e.g., 2MB pages) that the OS’s physical memory
allocator provides to user applications. At the same time, the Example: Variation of Page Table Walk (PTW) La-
physical memory allocation policy affects the latency of the tency. Fig. 3 shows the average PTW latency across 45 appli-
page fault handler which might heavily affect the tail latency cations executed in a real system that stress VM at different
of the application. Therefore, it is challenging to accurately levels2 We observe that the PTW latency significantly varies
model the overheads of the VM components with simple across different applications. For example, the PTW of an
first-order models (e.g., those that assume a fixed latency). application that performs large I/O allocations is 39 cycles
We use two example cases to showcase the variability in the while the PTW latency of the single-source shortest path
overheads caused by the VM components. workload (SSSP) from GraphBig [33] is larger than 180 cycles.
Example: Variation of Minor Page Fault Latency. We conclude that the overhead of the VM subsystem sig-
Fig. 2 shows the distribution of the minor page fault (MPF) la- nificantly varies across different workloads and system con-
tency using two OS page allocation policies, (i.e., transparent figurations and thus, cannot be accurately modeled with
huge pages (THP) [114, 115] enabled and disabled) across all first-order approximations (e.g., assuming fixed latencies)
workloads executed in a real high-end server-grade system but requires detailed simulation.
(§7.1). We make two key observations. First, the latency of 2.1 Existing Simulation Frameworks
MPFs can vary significantly given a single physical mem-
ory allocation policy. With THP-enabled, the average MPF We classify existing simulators (e.g., [131, 133–136, 143, 146,
latency is 2.2𝜇s while the standard deviation is larger than 149, 168]) into two broad categories: (i) simulators that em-
50𝜇s. Second, the distribution of the PF latency can signifi- ulate OS routines, and (ii) full-system simulators where a
cantly change when the physical memory allocation policy real full-blown OS is executed on top of a hardware simu-
provides large pages. With THP-enabled, the contribution of lator. Unfortunately, as we describe below, neither type of
the outliers (i.e., MPFs with latency larger than 10𝜇s) to the simulator is well-suited for evaluating VM schemes that rely
total MPF latency is 67% while with THP-disabled, the con- on co-designing OS routines and hardware support, which
tribution of the outliers to the total PF latency 25.5%. Prior hinders fast and accurate protyping and evaluation of such
works (e.g., [176, 177]) attribute this variability to the large 2We use different configurations of the stress-ng benchmarks [178]
number of different operations (e.g., page zeroing, fallback and the long-running workloads described in §7.1. We measure the page
table walk latency using performance counters.

4
schemes. Table 2 summarizes the VM components supported full-blown OS, including realistic memory management and
by eleven existing simulators and by our proposed simulator, other OS routines, on top of a hardware simulator. Such
Virtuoso. a methodology is particularly valuable when evaluating
Emulating OS Routines. Many existing simulators VM designs that involve changes to the OS kernel code
(e.g., [131–136, 142–145]) are designed with a focus on accu- and require new hardware support. However, existing
rately modeling the core, main memory or other hardware full-system simulation methodologies have three main
components that do not directly rely on or interact with the limitations: (i) low simulation speed, (ii) high memory
OS. Hence, these simulators lack (and some do not need for overheads, and (iii) high development time and effort.
the use cases they are designed for) a methodology to accu- First, simulating a full-blown OS drastically increases
rately model the implications (e.g., latency, memory interfer- simulation time and memory consumption, hindering
ence) of the OS components involved in the VM subsystem. rapid design space exploration. Simulating every single
For example, multiple simulators (e.g., [132, 133, 143]) model OS routine without the possibility of omitting those that
only the functional interactions of the application with a are irrelevant to the desired evaluation can significantly
subset of OS routines (e.g., mmap() [179]) and typically use increase simulation times. At the same time, spawning a
first-order approximations (e.g. Sniper [133] uses a fixed full-blown OS significantly increases memory consumption
PTW latency and Champsim [132] uses a fixed page fault la- per simulation task. In §7.3, we show that simulating a
tency) to model VM overheads. However, as we show in Fig. 2 full-blown OS on top of gem5 [136] can increase simulation
and Fig. 3, the overheads of VM can significantly vary across time by 77% and memory consumption by 1.69x (from 1GB
different workloads and applications, and hence, cannot be to 1.69GB per simulation task) compared to the system call
accurately modeled with static first-order approximations. emulation mode of gem5 (gem5-SE). Second, evaluating
In §7.2, we show that the baseline version of Sniper that uses new hardware/OS co-design schemes on top of full-system
a fixed PTW latency leads to 35% error in IPC estimation simulators necessitates (i) the modification of an already
compared to the real system. Thus, such simulators are not a complex OS kernel code, (ii) its functional verification of top
good fit for evaluating new VM schemes that require changes of simulated hardware and (iii) simulator extensions to sup-
to the OS kernel code and new hardware support. port new hardware components (e.g., new TLB designs), and
Full-System Simulation. Full system simulators (iv) complex modifications to the interface between the OS
(e.g., [136, 146, 168, 180–183]) like the full-system execution routines and the hardware. This process requires significant
mode provided by gem5 [136] and QEMU-based architec- development effort and time, especially for researchers who
tural simulators like QFlex [146] enable the execution of a are not experts in OS development. We conclude that, while

Table 2. Virtual memory schemes supported by existing simulators and Virtuoso (our proposed simulator).
Type Simulator/ TLB Page Table Contiguity Intermediate Hash-based Memory
Component Hierarchy Design Schemes Address Space Translation Tagging

SimpleScalar [134] Generic TLB Controller ✗ ✗ ✗ ✗ ✗


Multi2Sim [135] Generic TLB Controller ✗ ✗ ✗ ✗ ✗
Emulation-based

Scarab [131] ✗ ✗ ✗ ✗ ✗ ✗
Ramulator2 [142] ✗ ✗ ✗ ✗ ✗ ✗
ZSim [143] ✗ ✗ ✗ ✗ ✗ ✗
gem5-SE [136] Generic TLB Controller x86-64 & ARM PT ✗ ✗ ✗ ✗
ChampSim [132] Generic & TLB Prefetching x86-64 PT ✗ ✗ ✗ ✗
Sniper [133] Generic TLB Controller Fixed PTW latency ✗ ✗ ✗ ✗

PTLsim [168] Generic TLB Controller x86-64 & ARM PT Linux THP [114, 115] ✗ ✗ ✗
System
Full

QFlex [146] Generic TLB Controller x86-64 & ARM PT Linux THP [114, 115] ✗ ✗ ✗
Gem5-FS [136] Generic TLB Controller x86-64 & ARM PT Linux THP [114, 115] ✗ ✗ ✗

Configurable TLB hierarchy Hash-based PTs: Mondrian


Direct Segments [108] Hash-based
ECH [97], HDC [54] Data
Multi-page size TLBs Midgard [111] translation
Imitation-based

Protection
[109]
Page-size prediction [127] Configurable [169]
Range Translation &
Virtuoso Radix-PT +
TLB prefetching [170] Eager Paging [151] Hybrid
(this work) PWCs [48]
Restrictive & Expressive
Software-managed TLBs [118] Virtual Block
Support for nested Linux-like [114, 115] Flexible Memory
Interface [106]
TLB entries stored TLB [172] and & Reservation-based Physical [171]
in data caches [175] PTW [173] THP [174] Segments [105]

5
full-system simulators are indispensable tools in computer Virtuoso’s MimicOS: Lightweight Userspace Kernel

OS Modules
architecture research, they limit productivity and cause + Fast Simulation
Page Fault
Instrumentation Tool 4
high simulation overheads, thereby hindering their practical + Quick Development
+ OS Emulation
Handler Restart PTW 6
Inject
disassembled
3
utility in exploring and evaluating VM schemes that span
instructions

Communication Interface
across the hardware/software boundary. Functional
Instruction
Simulation Requirements. To evaluate new VM schemes Channel
7 2 Stream
Channel
Architectural Simulator
accurately, efficiently and rapidly, a simulation framework Communication Interface
needs to (i) enable fast prototyping of the required hardware Core Model Memory Model
Page fault 1
and OS modifications, (ii) accurately and quickly estimate
5 Userspace Kernel Instructions
the overheads caused and the benefits provided by the new + Detailed OS Overhead Evaluation

OS and hardware components, (iii) model the interaction of


the VM components with the rest of the system and between Figure 4. Overview of Virtuoso’s Architecture.
each other.
4.1 Lightweight Userspace Kernel
3 Virtuoso: Overview Virtuoso employs a lightweight userspace kernel to imitate
the functionality of the desired OS kernel code. Such a design
We present Virtuoso, a new simulation framework that en-
decision enables researchers to (i) simulate only the relevant
ables fast and accurate prototyping and evaluation of the
OS routines to speed up simulation time, and (ii) quickly and
software and hardware components of the VM subsystem. The
easily develop new OS modules.
key idea of Virtuoso is to employ a lightweight userspace
Kernel Module Selection. Virtuoso’s kernel comprises dif-
kernel, written in a high level language (e.g., C++), that en-
ferent modules selected by the researcher to balance accuracy
ables researchers to (i) isolate the functionality of only the
and simulation time depending on their research needs. For
desired kernel code to speed up simulation time, (ii) easily
example, a kernel may solely comprise of a page fault han-
develop new OS routines using a high-level language with-
dler if the researcher wants to quickly evaluate the impact
out being kernel code experts, and (iii) accurately evaluate
of different page fault handling mechanisms on system per-
the application- and system-level implications of the OS by
formance without taking irrelevant OS routines (e.g., thread
integrating Virtuoso into an architectural simulator.
scheduler) into consideration. As we demonstrate in §7.3,
Figure 4 illustrates a high-level overview of Virtuoso’s
executing a simulator paired with a userspace kernel that
components and workflow. Virtuoso consists of two main
faithfully mimics the functionality of only the Linux memory
components: (i) a lightweight userspace kernel, called Mimi-
management subsystem, is 49% faster than simulating the
cOS, that imitates the virtual memory subsystem of the OS,
entire Linux kernel.
and (ii) a communication channel between MimicOS and
Ease of Development. The userspace kernel can be writ-
the architectural simulator that Virtuoso is coupled with.
ten in a high-level language (e.g., Python, C++), which en-
When the architectural simulator executes an event that
ables easier development of new OS routines without re-
requires OS intervention (e.g., page fault, memory alloca-
quiring expert knowledge. For example, the researcher can
tion, etc.) 1 , the simulator forwards the event to MimicOS
easily develop a new machine learning-based page replace-
through the communication channel 2 . MimicOS processes
ment algorithm using a high-level library (e.g., mlpack [184],
the event 3 and Virtuoso performs two operations. First,
TensorFlow [185], PyTorch [186]) and integrate it with the
Virtuoso dynamically instruments MimicOS’s binary 4 and
kernel without needing to understand or modify the complex
injects MimicOS’s disassembled instructions into the proces-
code of a production-grade OS. At the same time, Virtuoso’s
sor performance model of the simulator 5 . This way, the
modular design allows increasing the number of supported
simulator can accurately estimate the performance implica-
OS modules to closely mimic the functionality of a target
tions of the executed OS routines on the application. Second,
kernel at the cost of increased simulation time.
when MimicOS resolves the event, it returns the functional
response to the architectural simulator (e.g., signals the core 4.2 Interface with the Architectural Simulator
to restart walking the page table 6 ) through the functional
To evaluate the impact of OS routines on the performance of
channel 7 .
a system, the userspace kernel needs to execute on top of an
architectural simulator. To achieve this, Virtuoso (i) executes
4 Imitation-Based Simulation Methodology both processes (i.e., the userspace kernel and the simulator)
We describe the key components of Virtuoso’s simulation as standalone applications and (ii) establishes a new com-
methodology, (i) the lightweight userspace kernel and (ii) munication interface between the userspace kernel and the
the communication interface between the kernel and the simulator that consists of two new communication channels
architectural simulator, and provide a step-by-step example that employ synchronization primitives to orchestrate the
of the simulation flow of a page fault handling routine. execution flow between the kernel and the simulator.

6
Communication Channels. Virtuoso establishes two com- to the functional channel and (ii) executes a magic instruc-
munication channels between the kernel and the simulator: tion to signal the simulator to continue the simulation of
(i) a functional and (ii) an instruction stream channel. Through the application. When the simulator decodes the magic in-
the functional channel, the simulator communicates func- struction, it pauses the instrumentation of userspace kernel
tional requests (e.g., page fault requests) to the kernel and instructions and switches back the simulated application.
the kernel communicates the emulated result of the request
back to the simulator (e.g., signal to restart the page table 4.3 Multithreaded Userspace Kernel
walk). However, the functional channel is not sufficient to Virtuoso’s userspace kernel supports multithreading to con-
estimate the impact of the OS routines on the performance of currently handle multiple system calls or interrupts from
the system. For example, the architectural simulator cannot different processes. To achieve this, when an application
estimate the impact of the page fault handler on various sys- being executed on the simulator issues a request to the ker-
tem components (e.g., main memory controller contention) nel, the kernel spawns a new thread to handle the request
by using only the functional state (e.g., the physical address or forwards the request to an available thread. The kernel
of the new page) of the userspace kernel. To address this uses synchronization primitives to guarantee the correct-
issue, Virtuoso executes the userspace kernel a binary in- ness of the kernel routines in multithreaded environments
strumentation tool (e.g., Intel Pin [187], DynamoRIO [188]) and model the performance overheads of atomic operations.
to dynamically generate the kernel’s instruction stream (e.g., For example, if multiple applications compete for physical
the page fault handler instructions) and communicates it to memory resources, our methodology can capture the corre-
the simulator through a separate instruction stream channel. sponding synchronization overheads.
Synchronization Primitives. To achieve high simulation
speed while maximizing portability (i.e., porting the userspace
4.4 Simulation Flow: Page Fault Handling Example
kernel to many different architectural simulators with mini-
mal changes), Virtuoso employs (i) POSIX-based [189] shared Figure 5 demonstrates the workflow of the proposed simu-
memory primitives to exchange messages between the ker- lation methodology with an example case study of a page
nel and the architectural simulator, and (ii) magic operations fault (PF) handler. First, the kernel and the simulator are
(e.g., m5ops in gem5 [136], xchg instructions in Sniper [133]) launched as userspace processes. In this example, the kernel
to synchronize the execution of the userspace kernel with comprises a PF handler with multiple different modules 1
the architectural simulator.3 (e.g., page table management, page cache [193] management,
Execution Flow. When the simulated application causes etc.). The simulated application is fed to the frontend (i.e., in-
an interrupt or a system call, the architectural simulator struction format generator) of the simulator (e.g., trace-based,
performs two actions: (i) writes the interrupt/system call instrumentation-based, emulation-based etc.) to generate the
parameters to the functional channel (i.e., a POSIX-based instruction stream 2 . If an instruction contains a load or
shared memory segment [190]) and (ii) notifies the userspace store memory operand, the frontend issues a memory access
kernel to read the parameters and start processing the re- request to the core model of the simulator 3 . The core model
quest. While the userspace kernel processes the request, the forwards the memory request to the memory management
binary instrumentation tool produces the instruction stream unit (MMU) model to perform address translation 4 . If the
of the kernel’s code and sends it to the simulator through MMU does not find the translation in the TLB hierarchy, it
the instruction stream channel. The simulator consumes the triggers a page table (PT) walk 5 . In this scenario, the PT
instruction stream, feeds it to its core model, and estimates walker does not find the translation in the PT and triggers
the impact of the kernel’s code on performance. The produc- a PF 6 . Through the functional channel A , the simulator
tion and the consumption of the kernel’s instruction stream sends a request to the kernel to handle the PF 7 . The kernel
happen in parallel to avoid unnecessary latency in the sim- decodes the message and executes the PF handler code 8 .
ulation.4 When the userspace kernel resolves the request, The PF handler code is instrumented using a binary instru-
it performs two actions: (i) writes the result of the request mentation tool (e.g., Intel Pin [187], DynamoRIO [188]) 9
and the instrumented disassembled instruction stream is sent
to the simulator through the instruction stream channel B .
3 Magic operations are special instructions that may or not be part of The PF handler’s instruction stream is forwarded 10 to
the ISA and are used to notify the simulator to perform a specific action. the core model of the simulator and the simulator models the
For example, when Sniper [133] decodes the xchg R1,R2 instruction, and execution of the kernel’s instructions to estimate the impact
r1 is identical to r2, it treats it as a signal to perform a specific special action of the PF handler on the microarchitectural state and perfor-
dictated by the content of r1 (e.g, start detailed simulation). mance (e.g., main memory contention, cache pollution) 11 .
4 The latency for the production of the kernel’s instruction stream could

be hidden by using a runahead thread [191, 192]. Such an optimization is


When the PF handler completes executing, the kernel com-
useful especially when the simulator’s frontend is trace-based and all the municates the outcome of the PF (e.g., the physical address
instructions of the application are known in advance. of the new page and the page size) to the simulator 12 . The

7
Virtuoso’s Simulation Methodology
Lightweight Userspace Kernel Modules
Binary Simulated Application
8 Page Fault Handling Page Table Management
Instrumentation Tool do_page_fault(int VA){
pa = alloc();
Page Cache 1 2
update_THP_info(VA);
update_page_table(VA);
__do_page_fault() ..
mov eax, [0xA1]
mov ebx, [0xFA] 9 }
shmem_write(fd,pa);
Huge Page Policy Simulator Frontend (e.g., trace-based)
add eax, ebx
.. 7 Request: Result: 12 Memory
Access
3
Handle Restart
page fault PT Walk 4

Architectural
TLB Hierarchy Translate VA

Simulator
B A Update MMU & Restart PTW Miss
5 MMU Core Model
Instruction Stream Functional
6 Translation +
Page Fault Page Table Walker 13
Channel Interface Channel Interface Latency 11
Page Fault Latency
10 Page Fault Instruction Stream

Figure 5. Example page fault handling workflow of Virtuoso coupled with an architectural
11 simulator.

simulator then re-walks the PT, the core model adds the la- accesses the swap cache [199] to retrieve the location of the
tency of the PF to the translation latency 13 and forwards data in the swap file [200] 6 . If the PTE is empty and corre-
the physical address to the memory hierarchy. sponds to file-backed pages (e.g., data originates from files),
MimicOS accesses the page cache [193] (software data struc-
5 MimicOS: A Lightweight Userspace ture that resides in memory and stores recently-accessed
Kernel for Memory Management file-backed pages) to retrieve the data 7 . On a page cache
miss or swap access, MimicOS fetches the data from disk
Using our new imitation-based simulation methodology (§4),
(we simulate the disk access latency using an SSD simula-
we build MimicOS, a new lightweight kernel written in C++
tor [150]) 8 and updates the PT 9 .
that mimics, but is not limited to, the basic memory man-
agement functionality of the Linux kernel [148] for x86-64
systems [194]. Page Fault Walk the page table No Slab Allocator:
Allocate 4KB frame
2

Is the 4th level allocated? PT Frame


Find Virtual
5.1 Mimicking Linux Memory Management Memory Area Yes
PT Frame
Is the 3rd level allocated?
3
As shown in Fig. 6, MimicOS employs a memory manage- No
Found

Is VMA DAX or
1 No
Not

Yes
Insert 1GB
backed by file?
ment scheme that mimics the one used by Linux. On a page Mapping
Yes

1GB Allocation
Page in
fault, MimicOS checks if the virtual memory area (VMA) [195] Is the 2nd level allocated?
Error No 1GB page
HugeTLB?
allocation flags

No
should be stored in hugetlbfs5 [196] 1 and updates the page Update PT No
Is page anonymous? are on?
Yes
Yes
Yes
Does VMA have pages in a
table (PT). If not, MimicOS begins walking the PT. To allocate (case of mmap/
No single backing store?
Yes Scan free lists
Yes

shmemget upon
for 1GB A
explicit request)
new PT frames (in case of a page fault), MimicOS requests
Yes
Update the 2nD level Allocate
new frames from the slab allocator [197] 2 . If the 3rd-level
Is there zero 2MB page?

of each VMA of the process


No available
No 4

KHugePage Scanning
2MB page 2MB page is available
PT entry is uninstantiated, MimicOS decides whether or not Is PTE allocated? Swap in the Scan N pages of the VMA

to allocate a 1GB physical page based on three conditions 3 :


No Yes swapped pages
Slab 5 Swapped-out pages? Shared pages?
Collapse and copy all
(1) the VMA uses DAX [64] or is backed by a file, (2) 1GB Search free list 4KB pages inside the
Non-zero PTEs?
4KB device page?
Young entries?
Write-protected PTE?
2MB region
allocation flags are set, and (3) a 1GB contiguous physical Update PTE anon?
memory region is available in the buddy allocator’s free list. If Swap Cache 6 Index
Update
A file- Access DISK 8 PTE 9
Page Cache 7 Miss
all conditions are met, a 1GB page is allocated, data is fetched backed?

from the page cache (or disk), and the PT is updated. If not,
Figure 6. MimicOS Memory Management Subsystem.
MimicOS attempts to allocate smaller pages and resumes the
PT walk. For empty 2nd-level PT entries, MimicOS attempts
allocating a 2MB page if the VMA is anonymous [195] 4 . 5.2 VirTool: A Toolset for VM Research
If a zeroed 2MB page is available, MimicOS allocates it, and We integrated MimicOS with (i) four architectural simula-
updates the PT. If not, a 4KB page is allocated, the final PT tors: Sniper [133], Ramulator [149], ChampSim [132], and
level updated 5 , and khugepaged [198] is notified to scan gem5-SE [136], and (ii) an SSD simulator, MQSim [150], to
memory and merge 4KB pages into 2MB pages. If the PTE is enable the evaluation of storage device impact on VM. By
allocated and corresponds to anonymous pages, MimicOS doing so, we aim to unlock a wide range of new ideas and
5 hugetlbfs [196] is a Linux kernel policy responsible for reserving huge case studies ranging from low-level microarchitectural VM
pages to ensure availability during allocation time. A virtual memory area schemes to hardware/software/OS co-design VM solutions.
is mapped through hugetlbfs only when large pages are explicitly requested Using MimicOS+Sniper as a baseline, we create VirTool, a
via mmap() or shmemget() calls. comprehensive toolset of state-of-the-art VM [133]. Table 2

8
Simulator Frontend Core model MMU model Files in Fig. 7. We use ChampSim [132] as an example trace-based
ChampSim [132] 56 45 22 6 simulator. First, MimicOS is booted in parallel with Champ-
Sniper [133] 46 35 180 9 Sim and runs as a separate process on top of a binary instru-
Ramulator2 [142] 79 83 44 6
gem5-SE [136] 0 221 44 12
mentation tool. ChampSim is modified in two ways: (i) the
MMU model gets attached to MimicOS using a bi-directional
Table 3. Additional lines of code and number of files modi- communication channel to receive and send functional re-
fied in different simulators to integrate Virtuoso. quests A and (ii) the core model gets attached to a commu-
nication channel to receive MimicOS’s disassembled instruc-
provides an overview of the techniques included in VirTool. tion stream B . When the MMU model encounters a page
With VirTool we aim to provide a common ground for re- fault, it sends a functional request to MimicOS to handle it 1 .
searchers to easily and consistently develop and evaluate MimicOS starts executing the corresponding handler 3 and
existing and new VM techniques. the binary instrumentation tool (e.g., Intel Pin [187]) gener-
ates the disassembled instruction stream 4 . The instrumen-
6 Extending Virtuoso tation tool is modified to generate a trace that follows the
format expected by ChampSim C . The instructions from
6.1 Support for Virtualized Environments MimicOS’s trace 5 are streamed through the communication
Virtuoso supports out-of-the-box simulation of virtualized channel to ChampSim’s core model 6 , which models their
execution environments (i.e., virtual machines running on execution. When the page fault is resolved, MimicOS noti-
top of a hypervisor (e.g., [12, 173])). To achieve this, Virtuoso fies the MMU to re-walk the page table 7 and ChampSim’s
spawns two userspace kernels (MimicOSes): 1) one that acts core model starts fetching instructions from the original
and mimics the hypervisor (e.g., acting like KVM [201]) and application trace 8 .
2) one that imitates the guest OS (e.g., Linux). When the
guest OS needs to send requests to the hypervisor, the same Virtuoso Trace-based Simulator
ChampSim
process described in §5.1 is followed in a nested manner, so 3 MimicOS
1 Handle page fault

A MMU model
that the simulator captures the instruction stream of both
7 Restart PTW
the guest OS and the hypervisor. VirTool already provides 4
Intel Pin
Tool
support for nested address translation [173], which is a key C
feature for modeling virtualized environments. B Core Model
Trace Format
ChampSim

Load R1,[X]
Load R2,[Y]
Inject MimicOS 6
Store[Z],R2 instructions from trace
6.2 Integration with Architectural Simulators EOF App Trace 8
5

At a high level, integrating Virtuoso with an architectural


simulator mainly requires three key steps: (i) using an emu- Figure 7. Integrating Virtuoso with trace-based simulators.
lation, instrumentation or other tools (e.g., custom tracer) to
capture the instruction stream generated by MimicOS and Simulators with Execution-driven Frontend. Execu-
convert it to the format used by the architectural simulator, tion-driven simulators, such as Sniper [133], Scarab [131] and
(ii) establishing a bi-directional communication channel (e.g., ZSim [143], dynamically instrument [187, 188] the simulated
POSIX-based shared memory [190]) between MimicOS and application and generate the instruction stream on-the-fly
the memory model (e.g., MMU model) of the architectural without storing a trace file. Such a simulation methodol-
simulator to exchange messages (e.g., signals for interrupt, ogy is particularly useful when the simulator manipulates
system call output), (iii) establishing a communication chan- the functional model (e.g., simulation of wrong path execu-
nel between MimicOS and the core model of the architectural tion [131, 136, 203, 204]). Virtuoso can be integrated with
simulator to inject the instruction stream generated by Mim- these simulators the same way as trace-based simulators
icOS. We already integrated Virtuoso with five different sim- with one key difference: when the instrumentation tool gen-
ulators: Sniper [133], Ramulator [142, 149], ChampSim [132], erates MimicOS’s instruction stream, it directly injects it
gem5-SE [136] and MQSim [150]. into the core model of the simulator without the need for
Table 3 shows the additional lines-of-code required for an additional trace file. In this scenario, the core model of
the integration. the simulator must be modified to dynamically switch be-
Simulators with Trace-based Frontend. Trace-based tween the instruction stream generated by MimicOS and the
simulators (e.g., [132, 133, 142, 149, 150, 202]) typically sim- original instruction stream of the workload.
ulate workloads using input trace files that represent the Simulators with Emulation-based Frontend. Simu-
instructions and memory accesses of the workload generated lators with an emulation-based frontend (e.g., gem5 [136],
by instrumentation and emulation tools (e.g., Intel Pin [187]) QFlex [146]) use an emulation tool to capture the instruction
or other simulators. Virtuoso can be seamlessly integrated stream of the workload and then feed the instructions to the
with trace-based simulators by following the steps described core model of the simulator. Integrating Virtuoso with these

9
simulators is straightforward, as the existing emulation tool gaps with full-system simulators and enabling the modeling
can be reused to capture the instruction stream generated of more complex OS-level operations.
by MimicOS and feed it to the core model of the simulator.
For example, in Virtuoso’s integration with gem5, when the 7 Virtuoso: Validation & Use Cases
MMU model encounters a page fault, it sends a request to
We (i) validate Virtuoso’s accuracy against a real high-end
MimicOS through shared memory and the emulation tool
server-grade CPU, (ii) evaluate Virtuoso’s simulation time
produces the instruction stream of MimicOS, feeding it to
overheads when integrated into four different architectural
the core model of gem5.
simulators, and (iii) we conduct five diverse case studies to
6.3 Usage in Heterogeneous System Simulation demonstrate Virtuoso’s versatility.
Virtuoso can be used to facilitate VM research in heteroge-
7.1 Evaluation Methodology
neous systems comprising of accelerators managed by a host
CPU. One such example could be Unified Virtual Memory System Configuration. We use the version of Virtuoso in-
(UVM) [159, 160] that enables the use of a shared virtual tegrated with Sniper [133] as our primary simulation tool.
address space across GPUs and CPUs. UVM management We chose Sniper for four key reasons: (1) it provides a good
operations are typically orchestrated by the device driver balance between microarchitecture, cache hierarchy, inter-
running on the CPU (Host), using an Input-Output Memory connect, main memory modeling details (we heavily refac-
Management Unit (IOMMU) [205]. In this scenario, Virtu- tored and enhanced the baseline DRAM model inspired from
oso’s imitation-based methodology can be applied to model Ramulator [142, 149]) and simulation speed; (2) it is scalable
(i) functionalities provided by the OS and the device driver in multi-core system simulation; (3) it is more programmer-
(e.g. host/device memory allocation, page migration) and (ii) friendly than gem5 [136]; and (4) it achieves higher IPC
functionalities of the IOMMU (e.g. page translation). Exist- performance estimation accuracy over gem5-SE [136], as
ing UVM-enabled GPU simulators [206, 207] emulate events shown in prior studies [209] and as we also verified. Ta-
(e.g., page allocation, migration and translation) using fixed ble 4 shows the configuration of the baseline simulated sys-
latencies or analytical models. Consequently, integrating Vir- tem, the configurations of all the schemes we evaluated in
tuoso into such simulators requires (1) extending MimicOS to our case studies (§7.4-7.6.3) and the configuration of the
imitate the desired OS components (e.g., UVM driver [208]) real system we validated Virtuoso against. Virtuoso along
and (2) establishing a communication channel between the with all scripts, benchmarks, integration with five simulators
host CPU simulator and the accelerator simulator to com- and all techniques included in VirTool, is freely available at
municate the corresponding OS-related latency overheads. [Link]
Workloads. Table 5 shows the benchmarks we used to eval-
uate Virtuoso. We select short-running applications (< 1s)
6.4 Current Limitations from various domains including Function-as-a-Service work-
We believe that Virtuoso is a good fit for studies focusing loads [40, 41], Large Language Model (LLM) inference [37,
on VM, which spans across the hardware and OS layers of 38, 217] and image processing [218]. We select long-running
the system stack. Virtuoso’s speed and accuracy in simu- applications with high L2 TLB MPKI (> 5) from the Graph-
lating the Linux memory subsystem and hardware MMU BIG [33], HPCC [31] and XSBench [32] benchmark suites
makes it particularly useful for academic research, system which are also used by multiple prior works (e.g., [96, 97,
optimization, and the preliminary testing of hardware/OS 101, 105, 111, 175]).
changes before deployment on actual systems. At the same
time, researchers can expand MimicOS to incorporate more 7.2 Validation of Virtuoso
advanced OS functionality and adjust the accuracy and sim- IPC Validation. Figure 8 shows the IPC performance es-
ulation time as per their research requirements. Hence, even timation accuracy of Virtuoso+Sniper and baseline Sniper
though it provides a viable alternative to full-system simu- compared to a real system (Table 4) across the long-running
lators, we do not suggest that Virtuoso replaces them but memory intensive workloads that are heavily affected by
rather complements them. In many cases, researchers need address translation. Virtuoso (baseline Sniper) achieves 80%
to simulate the entire system stack, including a real OS, to (66%) average accuracy in IPC estimation compared to the
discover previously unknown performance bottlenecks or real system. Virtuoso adapts to the dynamic characteristics
to evaluate the performance of a new hardware/OS coop- of different workloads and achieves 21% higher accuracy in
erative technique in production-level OSes. In such cases, IPC estimation versus baseline Sniper which uses a fixed
full-system simulators like gem5 [136] can provide a more PTW latency (set as the average PTW latency obtained from
accurate simulation of the entire system stack compared to a real system) regardless of the workload characteristics.
Virtuoso. As Virtuoso evolves, further development could ex- Validation of Page Fault (PF) Latency. We compare the PF
pand its capabilities, potentially bridging some of its current latency reported by Virtuoso+Sniper against the page fault

10
latency measured on the real system. We measure the real Real System
Accuracy Virtuoso + Sniper
Virtuoso+Sniper
Accuracy Baseline Sniper
Baseline Sniper

system PF latency at a fine granularity using ftrace and the 0.4 100%
handle_mm_fault() function tracer [221]. Figure 9 shows 0.3
80%

the cosine similarity [222] of the PF latency reported by Vir-

Accuracy
60%

IPC
0.2
tuoso and the real system.6 We use the short-running, page 40%
0.1
20%

0.0 0%
Table 4. Simulation Configuration and Simulated Systems BC BFS CC KCORE GC PR SSSP TC XS GMEAN

Baseline Virtuoso+Sniper Configuration Figure 8. IPC estimation accuracy estimation of Virtu-


Core 4-way Out-of-Order x86 2.9 GHz core oso+Sniper and baseline Sniper compared to a real system.
L1 I-TLB: 128-entry, 8-way assoc, 1-cycle latency

Cosine Similarity
L1 D-TLB (4 KB): 64-entry, 4-way assoc, 1-cycle latency; L1 1.00
D-TLB (2 MB): 32-entry, 4-way assoc, 1-cycle latency 0.75
MMU
0.50
L2 TLB: 2048-entry, 16-way assoc, 12-cycle latency 0.25
0.00
3-Page Walk Caches: 32-entry, 4-way, 2-cycle latency

B
N

s
T
S

ES

sp

um

N
M

ar
N
L1 I/D-Cache: 32 KB, 8-way assoc, 4-cycle access latency

D
AE
O

EA
an
C

-R

LL

S
am
JS

M
Tr

2D
G
L1 Cache

ad

G
IM

3D
LRU replacement policy; IP-stride prefetcher [210]

H
L2 Cache
2 MB, 16-way assoc, 16-cycle latency Figure 9. Cosine similarity between the page fault latency
SRRIP replacement policy [211]; Stream prefetcher [212] values measured by Virtuoso and the real system.
L3 Cache 2 MB/core, 16-way assoc, 35-cycle latency
256 GB, DDR4-2400, 𝑡𝑅𝐶𝐷 , 𝑡𝐶𝐿 =12.5 ns, 𝑡𝑅𝑃 =2.5 ns
DRAM
fault-bound workloads for which PF latency estimation is
MimicOS Linux-like THP with 4 KB and 2 MB pages; HugeTLBFS; Swap: critical. Despite using MimicOS, Virtuoso’s userspace kernel
4 GB; Swapping threshold: 90%; Baseline fragmentation: 80%
that imitates only a subset of Linux kernel’s memory man-
Real System Linux 5.15.0-60 [213]; DDR4-2400 Memory: 256 GB; agement routines (§5.1), the cosine similarity of PF latency
(Validation) CPU: Intel Xeon Gold 6226R 2.90 GHz [153]
ranges from 60% to 79%, with an average of 66% across all
Simulated Systems Evaluated in Use Cases (§7.4-7.6.3)
workloads. We conclude that Virtuoso can approximate the
Radix 4-level tree; 4 KB page table frames; 3-Page Walk Caches (Phys- PF latency with reasonable accuracy, even without modeling
[49, 214] ical Indexing): 32-entry, 2-way, 2-cycle
the entire Linux kernel.
ECH [97] 8K-entries/way; 4-way; Hash function: CITY [215] 2-cycle Per-
fect Cuckoo Walk caches for inter-page walks: 2-cycle Validation of MMU Performance. Figure 10 shows the
HDC [54] Size: 4 GB; Open addressing; 8 PTEs/entry L2 TLB misses per kilo instructions (MPKI) and the PTW
HT [216] Size: 4 GB; Chain Table; 8 PTEs/entry latency of Virtuoso+Sniper compared to the real system. For
Utopia [105] 2 x 8 GB RestSegs: 1 × 4 KB pages and 1 × 2 MB pages; RestSegs: this experiment, we use the long-running workloads that
16-way, SRRIP replacement policy [211]; 1x FlexSeg with 4-level are heavily affected by address translation latency and thus
radix PT; TAR Cache: 8 KB, 2-cycle; SF Cache: 8 KB, 2-cycle
by the effectiveness of the MMU. We observe that Virtu-
Midgard [111] 64-entry L1 VLB: 1-cycle latency; 16-entry L2 Range-based VMA
Lookaside Buffer: 4-cycle latency; B+ Tree for VMAs; 2-level oso estimates the L2 TLB MPKI and the PTW latency with,
MLB hierarchy; 6-level radix tree for M->P translation on average, 82% and 85% accuracy, respectively. Virtuoso
RMM [151] 64-entry RLB: 9-cycle, Access in parallel with L2 TLB; Eager accurately models the MMU performance of the real sys-
paging allocator with max order of 21; B+ Tree to store ranges
tem, which is essential for capturing the address translation
overheads in data-intensive workloads.
Table 5. Evaluated Workloads
Real System Sniper+Virtuoso
Virtuoso+Sniper Accuracy
100 100%
Suite/Domain Workload Data Set
L2 TLB MPKI

80 80%
Accuracy

Betweenness Centrality (BC), Breadth-first search 60 60%


(BFS) , Connected components (CC), Coloring (GC), 40 40%
GraphBIG [33] 50-100GB 20 20%
PageRank (PR) , Triangle counting (TC) , Shortest-
path (SP), k-Core (KC) 0 0%

HPC XSBench [32], randacc from GUPS [219] 10 GB 140 100%


120
PTW Latency

80%
100
Accuracy

AES, Image Resizing (IMG-RES), Word count of a


80 60%
Function-as-a-Service document (WCNT), Database filter query (DB), JSON <50MB
60 40%
deserialization (JS) 40
20 20%
Short-input short-output prompts using Llama 0 0%
Large Language Models 7B [39], Bagel [38] and Mistral [37] on top of <2GB
BC

PR
S

SP

XS
C

TC

N
C

[Link] [217]
R
BF

EA
C

SS
O

M
KC

3D Hadamard Product [218], 3D Matrix Transposi-


Image Processing <2GB
tion [220], 2D Matrix Sum
Figure 10. (Top) L2 TLB MPKI and (Bottom) PTW latency
6We use the cosine similarity instead of the mean absolute error to
reported by Virtuoso+Sniper compared to a real system.
account for the variance and the fluctuations in the PF latency across time.

11
7.3 Simulation Time and Memory Overhead increases, by a factor of 1.5x on average. We also verify this
Fig. 11 shows the simulation time and memory consumption trend for gem5-SE and gem5-FS (see extended version [223]).
overhead when we integrate MimicOS into Sniper, Champ- y
2.8
Sim, Ramulator, and gem5-SE compared to their baseline ver-

Simulation Time
2.6

Normalized
2.4
sions and gem5-FS. We report worst-case overheads using 2.2
2.0
randacc, which incurs the highest number of page faults per 1.8
1.6
y=1.5x
1.4
kilo instructions (PFKI) and ultimately frequent MimicOS- 1.2
1.0 x
simulator communication. We make five key observations. 0 10 20 30 40 50
First, integrating MimicOS increases simulation time by an Fraction of Instructions Executed by MimicOS

average of 20% due to additional simulated instructions.


Second, enabling full-system mode in gem5 leads to a 77% Figure 12. Correlation between the number of instructions
increase in simulation time compared to gem5’s syscall- executed by MimicOS and the simulation time overhead.
emulation mode. Third, using MimicOS results in a 1.45x
average increase in memory consumption across all simula- 7.4 Use Case 1: Alternative Page Table Designs
tors. Fourth, in ChampSim and Sniper, we observe nearly 2.1x
We evaluate different page table (PT) designs to draw in-
memory overhead since we enable online binary instrumen-
sights on the trade-offs between address translation latency,
tation for MimicOS. On the contrary, in Ramulator where
memory interference and page fault latency. We evaluate
we use offline binary instrumentation and in gem5 where we
the following designs: (i) Radix: a 4-level radix-based PT de-
reuse the existing binary emulation infrastructure, MimicOS
sign [194] and Linux-like THP enabled [114, 115], (ii) ECH:
leads to only 1.02x overhead. Last, in terms of raw memory
elastic cuckoo hash PT design [97], (iii) HDC: 4GB global
usage, porting MimicOS to Sniper leads to 0.8GB memory
open-addressing-based hash PT [54], and (iv) HT: 4GB global
usage, whereas gem5-FS consumes double (1.6GB), leading
chain-based hash PT [216]. In this use case, we define mem-
to up to 2x lower simulation job throughput when memory
ory fragmentation as the percentage of free 2MB pages com-
capacity is limited.
pared to the total number of 2MB pages.
Effect of PT Design on Translation Latency & Mem-
with MimicOS with Full-blown Linux Kernel
Baseline Simulators

77% ory Interference. Figure 13 shows the reduction in total


Slowdown over

80%
60%
PTW latency achieved by ECH, HDC and HT compared to
40% 35%
28% Radix, across different memory fragmentation levels. We
20%
20% 13%
2%
make two key observations. First, all three hash-based PT
0% designs consistently reduce the total PTW latency over Radix
across all memory fragmentation levels. Second, the reduc-
over Baseline Simulators

(0.8GB)
2.60 tion in total PTW latency achieved by all hash-based PT
Memory Overhead

2.28
2.20 2.08 (1.6GB)
1.02 1.69
designs increases with decreasing fragmentation levels. To
1.80
1.45
1.40
Radix ECH HDC HT
1.00 40%
Reduction in total PTW

ChampSim Sniper Ramulator Gem5-SE AVG


Latency over Radix

30%

20%

Figure 11. Simulation time and memory usage overheads of 10%

integrating MimicOS into Sniper, ChampSim, Ramulator and 0%


100% 98% 96% 94% 92% 90%
gem5-SE compared to their baseline versions and gem5-FS.
Memory Fragmentation Level

Correlation Between Simulation Time and Number Figure 13. Reduction in total PTW latency achieved by hash-
of MimicOS Instructions. Figure 12 shows the correla- based PTs compared to Radix across different memory frag-
tion between the number of MimicOS instructions and the mentation levels.
simulation time overhead when we integrate MimicOS with
Sniper. To perform this analysis, we crafted a microbench- better understand the effect of PT design on the system, in
mark where the number of MimicOS instructions is varied Figure 14 we show the total DRAM row buffer conflicts (in-
while keeping the total number of simulated instructions con- duced by activating rows that contain either data or page
stant. We observe a strong correlation between the number table entries) of ECH, HDC, and HT compared to Radix. We
of MimicOS instructions and the simulation time overhead observe that ECH increases total DRAM row buffer conflicts
across all simulation points. As the number of MimicOS by 52% over Radix while HDC and HT reduce DRAM row-
instructions increases, the simulation time overhead also buffer conflicts by 5% and 7%, respectively. Probing ECH

12
ECH HDC HT 2.5x 2.7x region are allocated, and (iv) UT: a Utopia [105] system with
Row Buffer Conflicts

1.8
Normalized DRAM

1.6 memory segments of different sizes (4MB, 32MB, 512MB)


1.4
1.2 and associativity (8,16) that employ a restrictive hash-based
1.0 virtual-to-physical address mapping.
0.8
BC BFS CC GC KC PR RND SP TC XS GMEAN Figure 16 shows the PF latency distribution across all allo-
cation policies in three LLM inference workloads. We make
Figure 14. Normalized DRAM row buffer conflicts for ECH, three observations. First, THP-based allocators (CR-THP and
HDC and HT over Radix. AR-THP) show similar median latency to BD but with a
>1000x increase in tail latency. Second, UT-32MB/16-way
during a PTW requires multiple memory accesses (one ac- achieves the lowest PF latency as it provides large contigu-
cess for each Cuckoo nest in the hash table), causing high ous segments for fast hash-based page allocations. Third, as
interference in the main memory. we increase the restrictive segment size (e.g., UT-512MB/16-
Effect of PT Design on Minor Page Fault Latency way) both the total and tail PF latencies increase compared to
(MPF). PT design can significantly impact MPF latency due UT-32MB/16-way. This is because, allocating data in a very
to differences in PT update or insertion operations. For exam- large segment limits the spatial locality of the data structure
ple, Radix requires up to 4 memory accesses to insert a new that stores the allocation metadata (i.e., virtual tags for each
entry, while ECH may require 1 or more depending on load physical page) which in turn increases PF latency.
or insertion order. Figure 15 shows the reduction in total MPF Obsv. Restricting the virtual-to-physical address mapping
latency achieved by the hash-based PTs over Radix. We make leads to faster page fault handling due to the lightweight hash-
two key observations. First, ECH, HDC, and HT respectively based page allocation routine.
reduce MPF latency by 9%, 18% and 19%, on average across all
Total page fault latency
workloads. This occurs because hash-based PTs are allocated
Page Fault Latency (ns)

Best performing Best performing Best performing

(or expanded) with large physical memory chunks compared


to Radix that allocates 4KB frames on-demand. Second, HDC
and HT reduce MPF latency across all workloads, while ECH
increases it in RND due to multiple memory accesses caused
by hash collisions.
Obsv. Although ECH reduces the latency of PTWs, it causes Bagel-2.8B Llama-2-7B Mistral-7B

higher main memory contention and sometimes increases the


latency of MPFs compared to a radix-based baseline. Figure 16. Page fault latency distribution with seven dif-
ferent physical memory allocation policies for three LLM
ECH HDC HT workloads.
30%
Reduction in total MPF
Latency over Radix

-6%
20%

10%
7.6 Evaluating Different MMU Designs
0%
We draw insights into how different MMUs affect microarchi-
BC BFS CC GC KC PR RND SP TC XS GMEAN tectural and system-level metrics. We evaluate the following
designs: (i) Utopia [105]: a system equipped with a 16GB-
Figure 15. Reduction in total minor page fault (MPF) latency large physical memory segment that employs a restrictive
achieved by hash-based PTs compared to Radix. address mapping, (ii) RMM [151]: a system that employs, on
the software side, eager paging to allocate large contiguous
physical segments and, on the hardware side, a range looka-
7.5 Use Case 2: Physical Memory Allocation in LLMs side buffer and range walker to quickly retrieve contiguity
We examine the effect of different physical memory alloca- information, (iii) Midgard [111]: a system that employs an
tion policies: (i) BD: a buddy allocator that only provides intermediate address space and two-level address translation,
4KB pages and updates the PT accordingly, (ii) CR-THP: with a frontend that employs two VMA lookaside buffers
a conservative reservation-based THP allocator [174] that and a backend that employs a 4-level radix tree. We define
reserves a 2MB physical memory region upon the initial memory fragmentation based on the underlying design: for
allocation of a 4KB page, and fully upgrades it to a 2MB page Utopia, we define memory fragmentation as the number of
once over 50% of the 4KB pages within that region are allo- available 2MB pages, including the contiguous 2MB pages
cated, (iii) AR-THP: an aggressive reservation-based THP needed to form the RestSeg, compared to the total number
allocator [174] that reserves a 2MB physical memory region of 2MB pages. For RMM, we define memory fragmentation
upon the initial allocation of a 4KB page, and fully upgrades as the ratio of the total size of the top 50 largest unallocated
it to a 2MB page once over 10% of the 4KB pages within that contiguous segments to the total main memory size. For

13
Midgard, we define memory fragmentation as the number of RestSeg, address translation latency increases, up to 10% for
2MB pages that are available for allocation for the backend the largest RestSeg compared to the 8GB RestSeg. This is
translation level compared to the total number of 2MB pages. because a large RestSeg increases the latency of accessing
address translation metadata (RSW as described in [105]).
7.6.1 Use Case 3: Intermediate Address Space Schemes
Figure 17 shows the breakdown of address translation latency Obsv. Selecting the size of a memory segment that enforces a
in Midgard [111] to understand the effects of frontend and restrictive VA-to-PA mapping poses a trade-off: larger segments
backend address translation. We make two key observations. reduce the frequency of page table walks for data within these
First, most workloads spend less than 20% of the total trans- segments, yet they may increase address translation latency.
lation latency in the frontend translation since they use a
small number of large VMAs. Hence, the frontend lookaside 16GB 32GB 64GB

Increase in address
over 8GB RestSeg
translation latency
15%
buffers can effectively cache all the VMA information. Sec- 10%
ond, we observe that BC spends more than 50% of the total 5%
translation latency in the frontend. 0%
BC BFS CC GC KC PR RND SP TC XS GMEAN
100%
Breakdown of Latency

75%
Figure 19. Increase in translation latency achieved by in-
50%
creasing the RestSeg size over Utopia with an 8GB RestSeg.
25%
Backend Frontend
0%
Effect of Utopia on Swapping Activity. We evaluate the
BC BFS CC GC KC PR RND SP TC XS
effect of Utopia on swapping activity using a setup where
Figure 17. Breakdown of translation latency in Midgard. Virtuoso is integrated into Sniper [133] and MQSim [150].
In this setup, Utopia is configured with restrictive segments
capturing large portions of main memory (>50%), and we
To better understand this phenomenon, we investigate the
measure the time spent swapping in/out of memory. When
number and size of virtual memory areas (VMA) [195] in-
memory usage exceeds 90%, the system begins swapping
volved in BC. As shown in Figure 18, BC uses (i) one VMA oc-
pages to disk. Figure 20 shows the normalized time spent in
cupying 77GB of VA space and (ii) 147 smaller VMAs ranging
swapping for different restrictive segment sizes compared
from 4KB to 1GB. While the large VMA is efficiently cached
Radix. We observe that swapping time increases with larger
in the frontend VMA lookaside buffers, the 147 smaller VMAs
restrictive segments, reaching up to 203x for the largest size
are not covered efficiently by either the L1 or L2 VMA-LBs
compared to Radix. This occurs because restrictive segments
(3% hit ratio in L2 VLB), resulting in high frontend translation
cause hash collisions that prevent data from being stored in
latency. We conclude that Midgard’s frontend design needs
memory even in the presence of free space. Thus, careful
further optimization to handle workloads with many small
selection of restrictive segment size is crucial to minimize
VMAs, despite the large VMAs being efficiently cached.
swapping overheads.
Obsv. Schemes that employ intermediate address spaces can
Obsv. Enforcing a restrictive hash-based mapping across very
be further optimized to reduce the frontend translation latency
large memory segments leads to increased swapping activity.
for workloads with a large number of small VMAs.
250
spent on swapping
Normalized cycles

70 203x
200
60
Number of VMAs

50 150
40 100
77GB 1.001x
30
50
20
10 0
0 50% 60% 70% 80% 90% 100%
4KB <128KB <256KB <512KB <1MB <8MB <16MB <32 MB <1GB >1GB
Fraction of main memory covered by restrictive segment
VMA Size

Figure 18. Number of VMAs of different sizes in BC. Figure 20. Time spent in swapping activity for different
restrictive segment sizes (in Utopia), normalized to Radix.

7.6.2 Use Case 4: Restricting the VA-to-PA Mapping


We evaluate the effects of the size of the restrictive segment 7.6.3 Use Case 5: Exploiting Contiguity Information
(RestSeg) in Utopia [105]. Figure 19 shows the increase in We further explore the effect of memory fragmentation on
translation latency as we increase the Utopia RestSeg size up exploiting virtual-to-physical address contiguity to reduce
to 64GB compared to Utopia that employs an 8GB RestSeg. PTWs as described in RMM [104]. Figure 21 shows the re-
We draw the following insight: as we increase the size of the duction in DRAM row buffer conflicts caused by address

14
translation metadata (contiguity information and page ta- 8.2 FPGA-Accelerated Simulation
ble entries) achieved by RMM over Radix, across different Several prior works explore FPGA-based approaches to ac-
fragmentation levels We observe that even with 94% frag- celerate system simulation (e.g., [235–240]). FireSim [240]
mentation, RMM reduces DRAM row buffer conflicts caused is an FPGA-accelerated platform that enables fast, cycle-
by address translation metadata by 90% on average over exact simulation of large-scale systems, such as server blades.
Radix due to the reduced number of PTWs. FAST [236] is a hybrid FPGA-CPU simulator that offloads its
Obsv. Even at mid-to-high memory fragmentation levels, em- timing model computation on an FPGA while executing the
ploying contiguity-based schemes significantly reduces DRAM functional model on a CPU.
row buffer conflicts caused by page table accesses. FPGA-accelerated simulators come with notable challenges:
94% 92% 90% 80% 70% 60% 50% 40%
(i) porting simulation models to Register-transfer level (RTL)
requires substantial development effort and time, (ii) slow
buffer conflicts over Radix

100%
Reduction in DRAM row

95% compilation due to RTL synthesis, and (iii) existing FPGA-


90% based prototypes may not fully represent modern systems
85% due to constraints such as discrepancies between FPGA and
80%
DRAM operating frequencies. While these simulators pro-
BFS CC GC KC PR RND SP TC XS GMEAN
vide fast and accurate simulation, they can be impractical
for rapid prototyping (and programming) in fast-evolving
Figure 21. Reduction in DRAM row buffer conflicts (caused
HW/SW environments, such as virtual memory solutions.
by address translation metadata) achieved by RMM, over
Compared to FPGA-accelerated simulators, Virtuoso priori-
Radix, across different memory fragmentation levels.
tizes ease of development, use and versatility while providing
relatively high simulation speed and high accuracy.
8 Related Work
To our knowledge, Virtuoso is the first simulator that bridges 8.3 Simulating Large-Scale Memory/Storage Systems
the gap between emulation-based and full-system simula-
tors enabling accurate exploration of VM designs in a fast Prior works optimize how program values are stored by the
and flexible way. Various simulators (e.g., [131–136, 136, 142– simulator, enabling large-scale memory and storage system
146, 168, 180–183]) and simulation methodologies (e.g., [224– simulation (e.g., [241–243]). David [241] and Exalt [243] em-
233]) have been developed to model different system compo- ploy semantics-aware data representation schemes that lead
nents. In §2, we examine the key characteristics of emulation- to highly-efficient data compression, enabling large-scale
based and full-system simulators and compare them against storage simulation. Øsim [242] models large-scale memory
Virtuoso. In this section, we discuss other related simulation systems on commodity hardware by leveraging the obser-
methodologies and provide a broad overview of works that vation that most data-intensive workloads follow similar
focus on VM optimizations. control flows, enabling efficient memory compression. Vir-
tuoso can be integrated with these simulators to model real
8.1 First-Order Models program values while optimizing memory usage.
First-order models, combined with instrumentation tools
(e.g., BadgerTrap [233]), are used in prior VM research (e.g., [54,
104, 234]) to approximate VM overheads . These models are 8.4 Virtual Memory Optimizations
typically analytical (e.g., fixed latency for PTW) which makes To improve VM, prior works explore several key approaches:
them valuable for quickly estimating the performance impact (i) enabling large page sizes (e.g., [116, 126, 140, 234, 244–
of new VM features. However, they overlook critical dynamic 255]), (ii) enforcing virtual-to-physical address contiguity to
effects arising from hardware and OS interactions, such as increase the processor’s address translation reach (e.g., [46,
the volume of page table data stored in caches, DRAM con- 50, 113, 128, 129, 151, 163–165]), (iii) employing restrictive
tention due to page table accesses, and large page availability virtual-to-physical address mappings (e.g., [105, 107, 109]),
affected by fragmentation. These effects exhibit dynamic be- (iv) designing alternative page table structures to reduce PT
havior and can significantly influence evaluation results. walk latency (e.g., [54, 95–103]), (v) employing TLB prefetch-
In contrast, Virtuoso captures both first-order and dy- ing (e.g., [125, 138, 170, 256–258]), (vi) optimizing TLB re-
namic effects in VM performance analysis. For instance, as placement policies (e.g., [259, 260]), (vii) storing TLB entries
demonstrated in §7.4, Virtuoso measures first-order metrics in the cache to minimize PT walks (e.g., [167, 175, 261]),
(e.g., page table walk latency, page fault latency) alongside (viii) leveraging hardware support to reduce page fault han-
dynamic effects (e.g., resource contention) of page table de- dling latency (e.g., [42, 130, 152]), (ix) employing hardware
sign. Thus, Virtuoso serves as an alternative for simulating mechanisms to accelerate PT walks (e.g., [48, 262, 263]), (x)
hardware/OS interactions at higher detail when necessary. optimizing VM components for efficient address translation

15
in virtualized environments (e.g., [140, 162, 264–267, 267– [6] M. Satyanarayanan, Henry H. Mashburn, Puneet Kumar, David C.
270]) and (xi) employing intermediate address spaces to de- Steere, and James J. Kistler. Lightweight Recoverable Virtual Memory.
fer address translation (e.g., [106, 110, 111, 166]). Developing In SOSP, 1993.
[7] E. Abrossimov, M. Rozier, and M. Shapiro. Generic Virtual Memory
these techniques requires extensive simulation effort at both Management for Operating System Kernels. In SOSP, 1989.
the OS and the hardware model levels. Virtuoso provides [8] Richard W. Carr and John L. Hennessy. WSCLOCK – A Simple and
a comprehensive toolset of state-of-the-art VM techniques, Effective Algorithm for Virtual Memory Management. In SOSP, 1981.
offering a common ground that makes it easier to develop [9] Ting Yang, Emery D. Berger, Scott F. Kaplan, and J. Eliot B. Moss.
and evaluate existing and new VM solutions. CRAMM: Virtual Memory Support for Garbage-Collected Applica-
tions. In OSDI, 2006.
[10] Peter J. Denning. Virtual Memory. In CSUR, 1970.
9 Conclusion [11] Thomas Ahearn, Robert Capowski, Neal Christensen, Patrick Gannon,
We introduced Virtuoso, a new simulation methodology that Arlin Lee, and John Liptay. Virtual Memory System, 1973.
[12] Robert P Goldberg. Survey of Virtual Machine Research. In IEEE
enables quick and accurate prototyping and evaluation of vir-
Computer, 1974.
tual memory (VM) schemes. Virtuoso’s key idea is to employ [13] Bruce L. Jacob and Trevor N. Mudge. A Look at Several Memory
a lightweight userspace kernel written in a high-level lan- Management Units, TLB-Refill Mechanisms, and Page Table Organi-
guage, which comprises of a subset of the OS’s VM-related zations. In ASPLOS, 1998.
functionalities to: (i) accelerate simulation, (ii) simplify the [14] A. J. Smith. A Comparative Study of Set Associative Memory Mapping
Algorithms and Their Use for Cache and Main Memory. In IEEE TSE,
development of new OS routines, and (iii) accurately evalu-
1978.
ate different VM schemes. We integrate Virtuoso with five [15] D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, and J. M. Pendleton.
architectural simulators and validate it against a real high- An In-Cache Address Translation Mechanism. In ISCA, 1986.
end server-grade CPU. To showcase Virtuoso’s versatility, [16] J Bradley Chen, Anita Borg, and Norman P Jouppi. A Simulation
we conduct five case studies demonstrating its applicability Based Study of TLB Performance. In ISCA, 1992.
[17] Eric J. Koldinger, Jeffrey S. Chase, and Susan J. Eggers. Architecture
to various VM research areas. Our evaluation demonstrates
Support for Single Address Space Operating Systems. In ASPLOS,
that Virtuoso provides a new point in the design space of 1992.
simulators that strikes a unique balance between simulation [18] Anders Lindstrom, John Rosenberg, and Alan Dearle. The Grand
speed, accuracy, and versatility. We conclude that Virtuoso Unified Theory of Address Spaces. In HotOS, 1995.
can become a useful platform for researchers to implement, [19] Bruce Jacob and Trevor Mudge. Virtual Memory in Contemporary
Microprocessors. In IEEE Micro, 1998.
compare and evaluate new and existing VM designs. To en-
[20] D. R. Engler, S. K. Gupta, and M. F. Kaashoek. AVM: Application-Level
able further research, we make Virtuoso freely available at Virtual Memory. In HotOS, 1995.
[Link] [21] Jerry Huck and Jim Hays. Architectural Support for Translation Table
Management in Large Address Space Machines. In ISCA, 1993.
Acknowledgements [22] Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, and Edward D.
Lazowska. The Interaction of Architecture and Operating System
We thank the anonymous reviewers of MICRO 2024 and Design. In ASPLOS, 1991.
ASPLOS 2025 for their feedback and the SAFARI Research [23] F. J. Corbató and V. A. Vyssotsky. Introduction and Overview of the
Group members for providing a stimulating intellectual envi- Multics System. In AFIPS, 1965.
ronment. We thank Ian Ganz for his help during early stages [24] Thomas N Kipf and Max Welling. Semi-Supervised Classification
with Graph Convolutional Networks. In ICLR, 2017.
of this work. We acknowledge the generous gifts from our
[25] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang,
industrial partners: Google, Huawei, Intel, Microsoft, and Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph
VMware, and the Semiconductor Research Corporation. This Neural Networks: A Review of Methods and Applications. In AI Open,
work was supported in part by the ETH Future Computing 2020.
Laboratory. [26] Brad Fitzpatrick. Distributed Caching with Memcached. In Linux J.,
2004.
[27] Redis. [Link]
References [28] Graph 500. Graph 500 Large-Scale Benchmarks. [Link]
[1] Abhishek Bhattacharjee. Breaking the Address Translation Wall By org/.
Accelerating Memory Replays. In IEEE Micro, 2018. [29] Mikko Rautiainen and Tobias Marschall. GraphAligner: Rapid and
[2] Steven M Hand. Self-Paging in the Nemesis Operating System. In Versatile Sequence-to-Graph Alignment. In Genome Biology, 2020.
OSDI, 1999. [30] Damla Senol Cali, Konstantinos Kanellopoulos, Joël Lindegger, Zülal
[3] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Bingöl, Gurpreet S. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak,
Systems. In TOCS, 1989. Jeremie Kim, Nika Mansouri Ghiasi, Gagandeep Singh, Juan Gómez-
[4] Andrew W. Appel and Kai Li. Virtual Memory Primitives for User Luna, Nour Almadhoun Alserr, Mohammed Alser, Sreenivas Subra-
Programs. In ASPLOS, 1991. money, Can Alkan, Saugata Ghose, and Onur Mutlu. SeGraM: A
[5] Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Universal Hardware Accelerator for Genomic Sequence-to-Graph
Robert Baron, David Black, William Bolosky, and Jonathan Chew. and Sequence-to-Sequence Mapping. In ISCA, 2022.
Machine-Independent Virtual Memory Management for Paged [31] Piotr R. Luszczek, David H. Bailey, Jack J. Dongarra, Jeremy Kepner,
Uniprocessor and Multiprocessor Architectures. In OSR, 1987. Robert F. Lucas, Rolf Rabenseifner, and Daisuke Takahashi. The HPC
Challenge (HPCC) Benchmark Suite. In SC, 2006.

16
[32] John R. Tramm, Andrew R. Siegel, Tanzima Islam, and Martin Schulz. [49] Linux. 5 Level Paging. [Link]
XSBench - The Development and Verification of a Performance Ab- [Link], 2021.
straction for Monte Carlo Reactor Analysis. In PHYSOR, 2014. [50] Kaiyang Zhao, Kaiwen Xue, Ziqi Wang, Dan Schatzberg, Leon Yang,
[33] Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching- Antonis Manousis, Johannes Weiner, Rik Van Riel, Bikash Sharma,
Yung Lin. GraphBIG: Understanding Graph Computing in the Context Chunqiang Tang, and Dimitrios Skarlatos. Contiguitas: the Pursuit
of Industrial Solutions. In SC, 2015. of Physical Memory Contiguity in Datacenters. In ISCA, 2023.
[34] R. Hwang, T. Kim, Y. Kwon, and M. Rhu. Centaur: A Chiplet-Based, [51] Sandeep Kumar, Aravinda Prasad, Smruti R. Sarangi, and Sreenivas
Hybrid Sparse-Dense Accelerator for Personalized Recommendations. Subramoney. Radiant: Efficient Page Table Management for Tiered
In ISCA, 2020. Memory Systems. In ISMM, 2021.
[35] Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Bran- [52] Abhishek Bhattacharjee and Margaret Martonosi. Characterizing the
don Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark TLB Behavior of Emerging Parallel Workloads On Chip Multiproces-
Hempstead, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa sors. In PACT, 2009.
Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. The [53] Swapnil Haria, Mark D. Hill, and Michael M. Swift. Devirtualizing
Architectural Implications of Facebook’s DNN-Based Personalized Memory in Heterogeneous Systems. In ASPLOS, 2018.
Recommendation. In HPCA, 2020. [54] Idan Yaniv and Dan Tsafrir. Hash, Don’t Cache (the Page Table). In
[36] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin SIGMETRICS, 2016.
Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. [55] Timothy Merrifield and H. Reza Taheri. Performance ImplicatiOns of
Efficient Memory Management for Large Language Model Serving Extended Page Tables On Virtualized X86 Processors. In VEE, 2016.
with PagedAttention. In SOSP, 2023. [56] Peter Hornyack, Luis Ceze, Steve Gribble, Dan Ports, and Hank Levy.
[37] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bam- A Study of Virtual Memory Usage and Implications for Large Memory.
ford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Technical report, 2013.
Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard [57] Nick Lindsay and Abhishek Bhattacharjee. Understanding Address
Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Translation Scaling Behaviours Using Hardware Performance Coun-
Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mis- ters. In IISWC, 2024.
tral 7B. In arXiv, 2023. [58] Intel Corp. 3rd Generation Intel® Xeon® Scalable processore. https:
[38] Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, and //[Link]/content/www/us/en/products/docs/processors/
Kenton Lee. BAGEL: Bootstrapping Agents by Guiding Exploration embedded/[Link].
with Language. In ICML, 2024. [59] Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, and Onur
[39] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Mutlu. Utility-Based Hybrid Memory Management. In CLUSTER,
Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman 2017.
Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, [60] Jishen Zhao, Onur Mutlu, and Yuan Xie. FIRM: Fair and High-
Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Performance Memory Control for Persistent Memory Systems. In
Foundation Language Models. In arXiv, 2023. MICRO, 2014.
[40] Dmitrii Ustiugov, Plamen Petrov, Marios Kogias, Edouard Bugnion, [61] Reza Salkhordeh, Onur Mutlu, and Hossein Asadi. An Analytical
and Boris Grot. Benchmarking, Analysis, and Optimization of Server- Model for Performance and Lifetime Estimation of Hybrid DRAM-
less Function Snapshots. In ASPLOS, 2021. NVM Main Memories. In TC, 2019.
[41] David Schall, Andreas Sandberg, and Boris Grot. Warming Up a Cold [62] Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and
Front-End with Ignite. In MICRO, 2023. Parthasarathy Ranganathan. Enabling Efficient and Scalable Hy-
[42] Ziqi Wang, Kaiyang Zhao, Pei Li, Andrew Jacob, Michael Kozuch, brid Memories using Fine-granularity DRAM Cache Management.
Todd Mowry, and Dimitrios Skarlatos. Memento: Architectural Sup- In CAL, 2012.
port for Ephemeral Memory Management in Serverless Environments. [63] Sihang Liu, Korakit Seemakhupt, Gennady Pekhimenko, Aasheesh
In MICRO, 2023. Kolli, and Samira Khan. Janus: Optimizing Memory and Storage
[43] Dong Du, Tianyi Yu, Yubin Xia, Binyu Zang, Guanglu Yan, Chenggang Support for Non-Volatile Memory Systems. In ISCA, 2019.
Qin, Qixuan Wu, and Haibo Chen. Catalyzer: Sub-Millisecond Startup [64] Chloe Alverti, Vasileios Karakostas, Nikhita Kunati, Georgios
for Serverless Computing with Initialization-Less Booting. ASPLOS, Goumas, and Michael Swift. DaxVM: Stressing the Limits of Memory
2020. as a File Interface. In MICRO 2022.
[44] Mohammad Shahrad, Jonathan Balkind, and David Wentzlaff. Archi- [65] Shai Bergman, Priyank Faldu, Boris Grot, Lluís Vilanova, and Mark
tectural Implications of Function-as-a-Service Computing. MICRO, Silberstein. Reconsidering OS Memory Optimizations in the Presence
2019. of Disaggregated Memory. In ISMM, 2022.
[45] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, [66] Hyungkyu Ham, Jeongmin Hong, Geonwoo Park, Yunseon Shin,
Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Okkyun Woo, Wonhyuk Yang, Jinhoon Bae, Eunhyeok Park, Hyojin
Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Sung, Euicheol Lim, and Gwangsun Kim. Low-Overhead General-
Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaru- Purpose Near-Data Processing in CXL Memory Expanders. In MICRO,
vinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and 2024.
Christina Delimitrou. An Open-Source Benchmark Suite for Mi- [67] Houxiang Ji, Srikar Vanavasam, Yang Zhou, Qirong Xia, Jinghan
croservices and Their Hardware-Software Implications for Cloud & Huang, Yifan Yuan, Ren Wang, Pekon Gupta, Bhushan Chitlur, Ipoom
Edge Systems. In ASPLOS, 2019. Jeong, and Nam Sung Kim. Demystifying a CXL Type-2 Device: A
[46] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Heterogeneous Cooperative Computing Perspective. In MICRO, 2024.
Michael M. Swift. Efficient Virtual Memory for Big Memory Servers. [68] Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan
In ISCA, 2013. Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, Ren
[47] Vasileios Karakostas, Osman S. Unsal, Mario Nemirovsky, Adrian Wang, Jung Ho Ahn, Tianyin Xu, and Nam Sung Kim. Demystifying
Cristal, and Michael Swift. Performance Analysis of the Memory CXL Memory with Genuine CXL-Ready Systems and Devices. In
Management Unit Under Scale-Out Workloads. In IISWC, 2014. MICRO, 2023.
[48] Thomas W. Barr, Alan L. Cox, and Scott Rixner. Translation Caching:
Skip, Don’t Walk (the Page Table). In ISCA, 2010.

17
[69] Dimosthenis Masouros, Christian Pinto, Michele Gazzetti, Sotirios [89] Atul Adya, Robert Grandl, Daniel Myers, and Henry Qin. Fast Key-
Xydis, and Dimitrios Soudris. Adrias: Interference-aware memory Value Stores: An Idea Whose Time Has Come and Gone. In HotOS,
orchestration for disaggregated cloud infrastructures. In HPCA, 2023. 2019.
[70] Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiy- [90] Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agar-
ing Zhang. Clio: A Hardware-software co-designed Disaggregated wal, Radoslaw Burny, Shakeel Butt, Jichuan Chang, Ashwin Chau-
Memory system. In ASPLOS, 2022. gule, Nan Deng, Junaid Shahid, Greg Thelen, Kamil Adam Yurtsever,
[71] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Yu Zhao, and Parthasarathy Ranganathan. Software-Defined Far
Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Memory in Warehouse-Scale Computers. In ASPLOS, 2019.
Network Requirements for Resource Disaggregation. In OSDI, 2016. [91] Christian Pinto, Dimitris Syrivelis, Michele Gazzetti, Panos Koutso-
[72] Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. LegoOS: vasilis, Andrea Reale, Kostas Katrinis, and H. Peter Hofstee. Thymes-
A Disseminated, Distributed OS for Hardware Resource Disaggrega- isFlow: A Software-Defined, HW/SW co-Designed Interconnect Stack
tion. In OSDI, 2018. for Rack-Scale Memory Disaggregation. In MICRO, 2020.
[73] Dario Korolija, Dimitrios Koutsoukos, Kimberly Keeton, Konstantin [92] Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury,
Taranov, Dejan S. Milojicic, and Gustavo Alonso. Farview: Disaggre- and Kang G. Shin. Efficient Memory Disaggregation with Infiniswap.
gated Memory with Operator Off-loading for Database Engines. In In NSDI, 2017.
CIDR, 2022. [93] Dhantu Buragohain, Abhishek Ghogare, Trishal Patel, Mythili Vu-
[74] Chenxi Wang, Haoran Ma, Shi Liu, Yuanqi Li, Zhenyuan Ruan, Khanh tukuru, and Purushottam Kulkarni. DiME: A Performance Emulator
Nguyen, Michael D. Bond, Ravi Netravali, Miryung Kim, and Guo- for Disaggregated Memory Architectures. In APSys, 2017.
qing Harry Xu. Semeru: A Memory-Disaggregated Managed Runtime. [94] Georgios Zervas, Hui Yuan, Arsalan Saljoghei, Qianqiao Chen, and
In OSDI, 2020. Vaibhawa Mishra. Optically Disaggregated Data Centers with Mini-
[75] Pengfei Zuo, Jiazhao Sun, Liu Yang, Shuangwu Zhang, and Yu Hua. mal Remote Memory Latency: Technologies, Architectures, and Re-
One-sided RDMA-Conscious Extendible Hashing for Disaggregated source Allocation. In JOCN, 2018.
Memory. In ATC, 2021. [95] Swapnil Haria, Michael M. Swift, and Mark D. Hill. Devirtualizing
[76] Hasan Al Maruf and Mosharaf Chowdhury. Effectively Prefetching Virtual Memory for Heterogeneous Systems. In ASPLOS, 2018.
Remote Memory with Leap. In ATC, 2020. [96] Chang Hyun Park, Ilias Vougioukas, Andreas Sandberg, and David
[77] Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ran- Black-Schaffer. Every Walk’s a Hit: Making Page Walks Single-Access
ganathan, Steven K. Reinhardt, and Thomas F. Wenisch. Disaggre- Cache Hits. In ASPLOS, 2022.
gated Memory for Expansion and Sharing in Blade Servers. In ISCA, [97] Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, and Josep Torrel-
2009. las. Elastic Cuckoo Page Tables: Rethinking Virtual Memory Transla-
[78] Qizhen Zhang, Yifan Cai, Sebastian Angel, Vincent Liu, Ang Chen, tion for Parallelism. In ASPLOS, 2020.
and Boon Thau Loo. Rethinking Data Management Systems for [98] Jovan Stojkovic, Namrata Mantri, Dimitrios Skarlatos, Tianyin Xu,
Disaggregated Data Centers. In CIDR, 2020. and Josep Torrellas. Memory-Efficient Hashed Page Tables. In HPCA,
[79] Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. 2023.
Nimble Page Management for Tiered Memory Systems. In ASPLOS, [99] Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K. Chang,
2019. Amirali Boroumand, Saugata Ghose, and Onur Mutlu. Accelerating
[80] Sebastian Angel, Mihir Nanavati, and Siddhartha Sen. Disaggregation Pointer Chasing in 3D-stacked Memory: Challenges, Mechanisms,
and the Application. In HotCloud, 2020. Evaluation. In ICCD, 2016.
[81] Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, [100] Reto Achermann, Ashish Panwar, Abhishek Bhattacharjee, Timothy
Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. Roscoe, and Jayneel Gandhi. Mitosis: Transparently Self-Replicating
System-Level Implications of Disaggregated Memory. In HPCA, 2012. Page-Tables for Large-Memory Machines. In ASPLOS, 2020.
[82] Ivy Peng, Roger Pearce, and Maya Gokhale. On the Memory Under- [101] Sam Ainsworth and Timothy M. Jones. Compendia: Reducing Virtual-
utilization: Exploring Disaggregated Memory on HPC Systems. In Memory Costs Via Selective Densification. In ISMM, 2021.
SBAC-PAD, 2020. [102] Hanna Alam, Tianhao Zhang, Mattan Erez, and Yoav Etsion. Do-It-
[83] Laurent Bindschaedler, Ashvin Goel, and Willy Zwaenepoel. Hail- Yourself Virtual Memory Translation. In ISCA, 2017.
storm: Disaggregated Compute and Storage for Distributed LSM- [103] Osang Kwon, Yongho Lee, Junhyeok Park, Sungbin Jang, Byungchul
Based Databases. In ASPLOS, 2020. Tak, and Seokin Hong. Distributed Page Table: Harnessing Physical
[84] K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zervas, D. Theodor- Memory as an Unbounded Hashed Page Table. In MICRO, 2024.
opoulos, I. Koutsopoulos, K. Hasharoni, D. Raho, C. Pinto, F. Espina, [104] V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley,
S. Lopez-Buedo, Q. Chen, M. Nemirovsky, D. Roca, H. Klos, and M. Nemirovsky, M. M. Swift, and O. Ünsal. Redundant Memory
T. Berends. Rack-Scale Disaggregated Cloud Data Centers: The dReD- Mappings for Fast Access to Large Memories. In ISCA, 2015.
Box Project Vision. In DATE, 2016. [105] Konstantinos Kanellopoulos, Rahul Bera, Kosta Stojiljkovic, Nisa
[85] Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Bostanci, Can Firtina, Rachata Ausavarungnirun, Rakesh Kumar, Nas-
Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, taran Hajinazar, Jisung Park, Mohammad Sadrosadati, Nandita Vi-
Rajesh Venkatasubramanian, and Michael Wei. Remote Memory in jaykumar, and Onur Mutlu. Utopia: Efficient Address Translation
the Age of Fast Networks. In SoCC, 2017. using Hybrid Virtual-to-Physical Address Mapping. In MICRO, 2023.
[86] Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, [106] Nastaran Hajinazar, Pratyush Patel, Minesh Patel, Konstantinos
Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Sub- Kanellopoulos, Saugata Ghose, Rachata Ausavarungnirun, Geraldo F.
rahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, Oliveira, Jonathan Appavoo, Vivek Seshadri, and Onur Mutlu. The
and Michael Wei. Remote Regions: A Simple Abstraction for Remote Virtual Block Interface: A Flexible Alternative to the Conventional
Memory. In ATC, 2018. Virtual Memory Framework. In ISCA, 2020.
[87] Pramod Subba Rao and George Porter. Is Memory Disaggregation [107] Krishnan Gosakan, Jaehyun Han, William Kuszmaul, Ibrahim Nael
Feasible? A Case Study with Spark SQL. In ANCS, 2016. Mubarek, Nirjhar Mukherjee, Guido Tagliavini, Evan West, Michael
[88] Irina Calciu, M. Talha Imran, Ivan Puddu, Sanidhya Kashyap, Bender, Abhishek Bhattacharjee, Alex Conway, Martin Farach-Colton,
Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. Rethinking Soft- Jayneel Gandhi, Rob Johnson, Sudarsun Kannan, and Donald Porter.
ware Runtimes for Disaggregated Memory. In ASPLOS, 2021.

18
Mosaic Pages: Big TLB Reach with Small Pages. In ASPLOS, 2023. [130] Chandrahas Tirumalasetty, Chih Chieh Chou, Narasimha Reddy, Paul
[108] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Gratz, and Ayman Abouelwafa. Reducing Minor Page Fault Over-
Michael M. Swift. Efficient Virtual Memory for Big Memory Servers. heads through Enhanced Page Walker. In TACO, 2022.
In ISCA 2013. [131] HPS Research Group. “hpsresearchgroup/scarab: Joint HPS and ETH
[109] Javier Picorel, Djordje Jevdjic, and Babak Falsafi. Near-Memory repository to work towards open sourcing Scarab and Ramulator.”.
Address Translation. In PACT, 2017. [Link]
[110] Lixin Zhang, Evan Speight, Ram Rajamony, and Jiang Lin. Enigma: [132] ChampSim. [Link]
Architectural and Operating System Support for Reducing the Impact [133] Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. Sniper:
of Address Translation. In ICS, 2010. Exploring the Level of Abstraction for Scalable and Accurate Parallel
[111] Siddharth Gupta, Atri Bhattacharyya, Yunho Oh, Abhishek Bhat- Multi-Core Simulations. In SC, 2011.
tacharjee, Babak Falsafi, and Mathias Payer. Rebooting Virtual Mem- [134] D. Ernst T. Austin, E. Larson. SimpleScalar: an infrastructure for
ory with Midgard. In ISCA, 2021. computer system modeling. In IEEE Computer, 2002.
[112] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, [135] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David
and Emmett Witchel. Coordinated and Efficient Huge Page Manage- Kaeli. Multi2Sim: A Simulation Framework for CPU-GPU Computing.
ment with Ingens. In OSDI, 2016. In PACT, 2012.
[113] Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. [136] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Rein-
Translation Ranger: Operating System Support for Contiguity-Aware hardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower,
TLBs. In ISCA, 2019. Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell,
[114] Jonathan Corbet. Transparent Huge Pages in 2.6.38. [Link] Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood.
Articles/423584/, 2011. The gem5 Simulator. 2011.
[115] Jonathan Corbet. The Current State of Kernel Page-Table Isolation. [137] Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal,
[Link] 2017. Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M.
[116] Venkat Sri Sai Ram, Ashish Panwar, and Arkaprava Basu. Trident: Swift, and Osman Ünsal. Redundant Memory Mappings for Fast
Harnessing Architectural Resources for All Page Sizes in X86 Proces- Access to Large Memories. In ISCA, 2015.
sors. In MICRO, 2021. [138] Artemiy Margaritov, Dmitrii Ustiugov, Edouard Bugnion, and Boris
[117] Stratos Psomadakis, Chloe Alverti, Vasileios Karakostas, Christos Kat- Grot. Prefetched Address Translation. In MICRO, 2019.
sakioris, Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas, [139] Guilherme Cox and Abhishek Bhattacharjee. Efficient Address Trans-
and Nectarios Koziris. Elastic Translations: Fast Virtual Memory with lation for Architectures with Multiple Page Sizes. In ASPLOS, 2017.
Multiple Translation Sizes. In MICRO, 2024. [140] Binh Pham, Ján Veselý, Gabriel H. Loh, and Abhishek Bhattacharjee.
[118] Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K. John. Re- Large Pages and Lightweight Memory Management in Virtualized
thInking TLB Designs in Virtualized Environments: A Very Large Environments: Can You Have It Both Ways? In MICRO, 2015.
Part-of-Memory TLB. In ISCA, 2017. [141] Thomas W. Barr, Alan L. Cox, and Scott Rixner. SpecTLB: A Mecha-
[119] Yashwant Marathe, Nagendra Gulur, Jee Ho Ryoo, Shuang Song, and nism for Speculative Address Translation. In ISCA, 2011.
Lizy K. John. CSALT: Context Switch Aware Large TLB. In MICRO, [142] Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun,
2017. A. Giray Yağlıkçı, , and Onur Mutlu. Ramulator 2.0: A Modern,
[120] Yunfang Tai, Wanwei Cai, Qi Liu, Ge Zhang, and Wenzhi Wang. Modular, and Extensible DRAM Simulator. In CAL, 2023.
Comparisons of Memory Virtualization Solutions for Architectures [143] Daniel Sanchez and Christos Kozyrakis. ZSim: Fast and Accurate
with Software-Managed TLBs. In NAS, 2013. Microarchitectural Simulation of Thousand-Core Systems. In ISCA,
[121] Xiaotao Chang, Hubertus Franke, Yi Ge, Tao Liu, Kun Wang, Jimi 2013.
Xenidis, Fei Chen, and Yu Zhang. Improving Virtualization in the [144] Amit Puri, Kartheek Bellamkonda, Kailash Narreddy, John Jose,
Presence of Software Managed Translation Lookaside Buffers. In Venkatesh Tamarapalli, and Vijaykrishnan Narayanan. DRackSim:
ISCA, 2013. Simulating CXL-Enabled Large-Scale Disaggregated Memory Sys-
[122] Richard Uhlig, David Nagle, Tim Stanley, Trevor Mudge, Stuart tems. In PADS, 2024.
Sechrest, and Richard Brown. Design Tradeoffs for Software- [145] Aamer Jaleel, Robert S. Cohn, Chi-Keung Luk, and Bruce Jacob. CMP-
Managed TLBs. In TOCS, 1994. Sim: A Pin-Based On-The-Fly Multi-Core Cache Simulator. In Work-
[123] D. R. Cheriton, G. A. Slavenburg, and P. D. Boyle. Software-Controlled shop on Modeling, Benchmarking and Simulation, 2008.
Caches in the VMP Multiprocessor. In ISCA, 1986. [146] EPFL Parallel Systems Architecture Lab (PARSA). QFlex, 2020.
[124] David Nagle, Richard Uhlig, Tim Stanley, Stuart Sechrest, Trevor N. [147] Bjarne Stroustrup. The C++ Programming Language. 2013.
Mudge, and Richard B. Brown. Design Tradeoffs for Software- [148] Linus Torvalds. Linux (5.15) [operating system]. [Link]
managed TLBs. In ISCA, 1993. torvalds/linux/releases/tag/.
[125] Kavita Bala, M. Frans Kaashoek, and William E. Weihl. Software [149] Yoongu Kim, Weikun Yang, and Onur Mutlu. Ramulator: A Fast and
Prefetching and Caching for Translation Lookaside Buffers. In OSDI, Extensible DRAM Simulator. In CAL, 2015.
1994. [150] Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata
[126] Faruk Guvenilir and Yale N Patt. Tailored Page Sizes. In ISCA, 2020. Ghose, and Onur Mutlu. MQSim: A Framework for Enabling Realistic
[127] Misel-Myrto Papadopoulou, Xin Tong, André Seznec, and Andreas Studies of Modern Multi-Queue SSD Devices. In FAST, 2018.
Moshovos. Prediction-Based Superpage-Friendly TLB Designs. In [151] Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal,
HPCA, 2015. Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M.
[128] Chang Hyun Park, Taekyung Heo, Jungi Jeong, and Jaehyuk Huh. Swift, and Osman Ünsal. Redundant Memory Mappings for Fast
Hybrid TLB Coalescing: Improving TLB Translation Coverage Under Access to Large Memories. In ISCA, 2015.
Diverse Fragmented Memory Allocations. In ISCA, 2017. [152] Gyusun Lee, Wenjing Jin, Wonsuk Song, Jeonghun Gong, Jonghyun
[129] Chloe Alverti, Stratos Psomadakis, Vasileios Karakostas, Jayneel Bae, Tae Jun Ham, Jae W. Lee, and Jinkyu Jeong. A Case for Hardware-
Gandhi, Konstantinos Nikas, Georgios Goumas, and Nectarios Koziris. Based Demand Paging. In ISCA, 2020.
Enhancing and Exploiting Contiguity for Fast Memory Virtualization. [153] Intel Xeon Gold 6226R. [Link]
In ISCA, 2020. 6226r.

19
[154] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiy- MICRO, 2023.
oung Choi. A Scalable Processing-in-Memory Accelerator for Parallel [176] Mark Mansi, Bijan Tabatabai, and Michael M. Swift. CBMM: Financial
Graph Processing. In ISCA, 2015. Advice for Kernel Memory Managers. In ATC, 2022.
[155] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. PIM- [177] Mark Mansi and Michael M. Swift. Characterizing Physical Memory
Enabled Instructions: A Low-Overhead, Locality-Aware Processing- Fragmentation. In arXiv, 2024.
in-Memory Architecture. In ISCA, 2015. [178] stress-ng. [Link]
[156] Saugata Ghose, Amirali Boroumand, Jeremie S Kim, Juan Gómez- [179] mmap() System Call. [Link]
Luna, and Onur Mutlu. Processing-in-Memory: A Workload-Driven [Link].
Perspective. In IBM Journal, 2019. [180] Nikolaos Hardavellas, Stephen Somogyi, Thomas F. Wenisch,
[157] Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Roland E. Wunderlich, Shelley Chen, Jangwoo Kim, Babak Falsafi,
Ausavarungnirun. Processing Data Where It Makes Sense: Enabling James C. Hoe, and Andreas Nowatzyk. SimFlex: A Fast, Accurate,
In-Memory Computation. In arXiv, 2019. Flexible Full-System Simulation Framework for Performance Evalua-
[158] Mingyu Gao and Christos Kozyrakis. HRL: Efficient and Flexible tion of Server Architecture. In SIGMETRICS, 2004.
Reconfigurable Logic for Near-Data Processing. In HPCA, 2016. [181] Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald,
[159] Yueqi Wang, Bingyao Li, Mohamed Tarek Ibn Ziad, Lieven Eeckhout, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant
Jun Yang, Aamer Jaleel, and Xulong Tang. OASIS: Object-Aware Page Agarwal. Graphite: A Distributed Parallel Simulator for Multicores.
Management for Multi-GPU Systems. In HPCA, 2025. In HPCA, 2010.
[160] Yueqi Wang, Bingyao Li, Aamer Jaleel, Jun Yang, and Xulong Tang. [182] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,
GRIT: Enhancing Multi-GPU Performance with Fine-Grained Dy- J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A Full
namic Page Placement. In HPCA, 2024. System Simulation Platform. In IEEE Computer, 2002.
[161] Jovan Stojkovic, Namrata Mantri, Dimitrios Skarlatos, Tianyin Xu, [183] A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: A Full System
and Josep Torrellas. Memory-Efficient Hashed Page Tables. In HPCA, Simulator for Multicore x86 CPUs. In DAC, 2011.
2023. [184] Ryan R. Curtin, Marcus Edel, Omar Shrit, Shubham Agrawal, Suryo-
[162] Jayneel Gandhi, Mark D. Hill, and Michael M. Swift. Agile Paging: day Basak, James J. Balamuta, Ryan Birmingham, Kartik Dutt, Dirk Ed-
Exceeding the Best of Nested and Shadow Paging. In ISCA, 2016. delbuettel, Rishabh Garg, Shikhar Jaiswal, Aakash Kaushik, Sangyeon
[163] Dongwei Chen, Dong Tong, Chun Yang, Jiangfang Yi, and Xu Cheng. Kim, Anjishnu Mukherjee, Nanubala Gnana Sai, Nippun Sharma,
FlexPointer: Fast Address TranslatiOn Based On Range TLB and Yashwant Singh Parihar, Roshan Swain, and Conrad Sanderson. ml-
Tagged Pointers. In TACO, 2023. pack 4: A Fast, Header-Only C++ Machine Learning Library. In
[164] Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Journal of Open Source Software, 2023.
Bhattacharjee. CoLT: Coalesced Large-Reach TLBs. In MICRO, 2012. [185] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
[165] Jiyuan Zhang, Weiwei Jia, Siyuan Chai, Peizhe Liu, Jongyul Kim, Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean,
and Tianyin Xu. Direct Memory Translation for Virtualized Clouds. Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
ASPLOS, 2024. Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
[166] B Frey. PowerPC Architecture Book 2003. [Link]/ Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat
developerworks/eserver/articles/[Link]. Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,
[167] Aamer Jaleel, Eiman Ebrahimi, and Sam Duncan. DUCATI: High- Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Performance Address Translation by Extending TLB Reach of GPU- Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Accelerated Systems. In TACO, 2019. Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,
[168] M.T. Yourst. PTLsim: A Cycle Accurate Full System x86-64 Microar- and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on
chitectural Simulator. In ISPASS, 2007. heterogeneous systems. In arXiv, 2015.
[169] Emmett Witchel, Josh Cates, and Krste Asanović. Mondrian Memory [186] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James
Protection. In ASPLOS, 2002. Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia
[170] Georgios Vavouliotis, Lluc Alvarez, Vasileios Karakostas, Konstanti- Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward
nos Nikas, Nectarios Koziris, Daniel A. Jiménez, and Marc Casas. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-
Exploiting Page Table Locality for Agile TLB Prefetching. In ISCA, amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.
2021. PyTorch: An Imperative Style, High-Performance Deep Learning
[171] Nandita Vijaykumar, Abhilasha Jain, Diptesh Majumdar, Kevin Hsieh, Library. In NeurIPS, 2019.
Gennady Pekhimenko, Eiman Ebrahimi, Nastaran Hajinazar, Phillip B. [187] Pin - A Dynamic Binary Instrumentation Tool. [Link]
Gibbons, and Onur Mutlu. A Case for Richer Cross-Layer Abstrac- com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
tions: Bridging the Semantic Gap with Expressive Memory. In ISCA, [188] DynamoRio. [Link]
2018. [189] Fred Zlotnick. The POSIX.1 Standard: a Programmer’s guide. 1991.
[172] Longyu Zhao, Zongwu Wang, Fangxin Liu, and Li Jiang. Ninja: A [190] Posix shared memory. [Link]
hardware assisted system for accelerating nested address translation. shm_overview.[Link].
In ICCD, 2024. [191] Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. Runahead
[173] Advanced Micro Devices. AMD-V Nested Paging, White Pa- Execution: An Effective Alternative to Large Instruction Windows.
per. [Link] In HPCA 2003.
1%[Link]. [192] Tanausu Ramirez, Alex Pajuelo, Oliverio J Santana, and Mateo Valero.
[174] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. Practical, Runahead Threads to Improve SMT Performance. In HPCA, 2008.
Transparent Operating System Support for Superpages. In OSDI, [193] Kernel Development Community. The Linux Kernel 6.10 Manual.
2002. [Link]
[175] Konstantinos Kanellopoulos, Hong Chul Nam, F. Nisa Bostanci, Rahul [194] Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol. 3:
Bera, Mohammad Sadrosadati, Rakesh Kumar, Davide Basilio Bar- System Programming Guide 3A 4-19.
tolini, and Onur Mutlu. Victima: Drastically Increasing Address [195] Anonymous Memory. [Link]
Translation Reach by Leveraging Underutilized Cache Resources. In [Link].

20
[196] Mike Kravetz. Hugetlbfs Reservation. [Link] Sadrosadati, Rakesh Kumar, Nandita Vijaykumar, and Onur Mutlu.
html/v4.20/vm/hugetlbfs_reserv.html, 2017. Virtuoso: Enabling fast and accurate virtual memory research via an
[197] Jeff Bonwick. The Slab Allocator: An Object-Caching Kernel Memory imitation-based os simulation methodology. In arXiv, 2025.
Allocator. In USTC, 1994. [224] Davy Genbrugge, Stijn Eyerman, and Lieven Eeckhout. Interval Sim-
[198] Khugepage Daemon. [Link] ulation: Raising the Level of Abstraction in Architectural Simulation.
vm/[Link]. In HPCA, 2010.
[199] Swap Management. [Link] [225] Frederick Ryckbosch, Stijn Polfliet, and Lieven Eeckhout. VSim:
understand/[Link]. Simulating Multi-Server Setups at Near Native Hardware Speed. In
[200] Raúl Cervera, Toni Cortes, and Yolanda Becerra. Improving Applica- TACO, 2012.
tion Performance Through Swap Compression. In ATC, 1999. [226] Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. Sampled
[201] Linux KVM. [Link] Simulation of Multi-Threaded Applications. In ISPASS, 2013.
[202] Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. DRAMSim2: A [227] Trevor E. Carlson, Wim Heirman, Kenzo Van Craeynest, and Lieven
Cycle Accurate Memory System Simulator. volume 10, pages 16–19. Eeckhout. BarrierPoint: Sampled Simulation of Multi-Threaded Ap-
IEEE, 2011. plications. In ISPASS, 2014.
[203] Stijn Eyerman, Sam Van den Steen, Wim Heirman, and Ibrahim Hur. [228] Nikos Nikoleris, Lieven Eeckhout, Erik Hagersten, and Trevor E.
Simulating Wrong-Path Instructions in Decoupled Functional-First Carlson. Directed Statistical Warming through Time Traveling. In
Simulation. In ISPASS, 2023. MICRO, 2019.
[204] Onur Mutlu, Hyesoon Kim, David N Armstrong, and Yale N Patt. [229] Wenjie Liu, Wim Heirman, Stijn Eyerman, Shoaib Akram, and Lieven
An Analysis of the Performance Impact of Wrong-path Memory Eeckhout. Scale-Model Architectural Simulation. In ISPASS, 2022.
References on Out-of-order and Runahead Execution Processors. In [230] Changxi Liu, Alen Sabu, Akanksha Chaudhari, Qingxuan Kang, and
TACO, 2005. Trevor E. Carlson. Pac-Sim: Simulation of Multi-threaded Workloads
[205] Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. IOMMU: strate- using Intelligent, Live Sampling. In TACO, 2023.
gies for mitigating the IOTLB bottleneck. In ISCA, 2010. [231] Harish Patil, Alexander Isaev, Wim Heirman, Alen Sabu, Ali Hajiabadi,
[206] Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. In- and Trevor E. Carlson. ELFies: Executable Region Checkpoints for
terplay Between Hardware Prefetcher and Page Eviction Policy in Performance Analysis and Simulation. In CGO, 2021.
Cpu-Gpu Unified Virtual Memory. In ISCA, 2019. [232] Alen Sabu, Harish Patil, Wim Heirman, and Trevor E. Carlson. Loop-
[207] Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xi- Point: Checkpoint-driven Sampled Simulation for Multi-threaded
ang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter Mc- Applications. In HPCA, 2022.
Cardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, [233] Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift.
Zhongliang Chen, Rafael Ubal, José L. Abellán, John Kim, Ajay Joshi, BadgerTrap: A Tool to Instrument x86-64 TLB Misses. In SIGARCH
and David Kaeli. MGPUSim: Enabling Multi-GPU Performance Mod- Comput. Archit. News, 2014.
eling and Optimization. In ISCA, 2019. [234] Mohammad Agbarya, Idan Yaniv, Jayneel Gandhi, and Dan Tsafrir.
[208] NVIDIA Linux Open GPU Kernel Module. [Link] Predicting Execution Times with Partial Simulations in Virtual Mem-
NVIDIA/open-gpu-kernel-modules. ory Research: Why and How. In MICRO, 2020.
[209] Ayaz Akram and Lina Sawalha. x86 Computer Architecture Simula- [235] J. Wawrzynek, M. Oskin, C. Kozyrakis, D. Chiou, D. A. Patterson,
tors: A Comparative Study. In ICCD, 2016. and S.-L. Lu. Ramp: A research accelerator for multiple processors.
[210] John W. C. Fu, Janak H. Patel, and Bob L. Janssens. Stride Directed In EECS Department, University of California, Berkeley, Tech. Rep.
Prefetching in Scalar Processors. In MICRO, 1992. UCB/EECS-2006-158, 2006.
[211] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, and Joel Emer. [236] D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, and D. E.
High Performance Cache Replacement Using Re-Reference Interval Johnson. FPGA-Accelerated Simulation Technologies (FAST): Fast,
Prediction (RRIP). In ISCA, 2010. Full-System, Cycle-Accurate Simulators. In MICRO, 2007.
[212] Tien-Fu Chen and Jean-Loup Baer. Effective Hardware-based Data [237] M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer. Hasim:
Prefetching for High-performance Processors. In TC, 1995. FPGA-Based High-Detail Multicore Simulation Using Time-Division
[213] The linux kernel 5.15.0. [Link] Multiplexing. In HPCA, 2011.
[214] Intel. 5-Level Paging and 5-Level EPT, 2017. [238] E. S. Chung, M. K. Papamichael, E. Nurvitadhi, J. C. Hoe, K. Mai, and
[215] Google. CITY Hash. [Link] B. Falsafi. ProtoFlex: Towards Scalable, Full-System Multiprocessor
[216] May Cathy, Silha Ed, Simpson Rick, and Warren Hank. The PowerPC Simulations Using FPGAs. In ACM TRTS, 2009.
Architecture: A Specification for a New Family of RISC Processors. [239] Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook, and D. Patterson.
1994. RAMP Gold: An FPGA-based Architecture Simulator for Multipro-
[217] [Link]. [Link] cessors. In DAC, 2010.
[218] Yuanyuan Wang, Xia Xie, Qiong He, Hongen Liao, Huabin Zhang, [240] Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin,
and Jianwen Luo. Hadamard-Encoded Synthetic Transmit Aper- Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin
ture Imaging for Improved Lateral Motion Estimation in Ultrasound Schmidt, Aditya Chopra, Qijing Huang, Kyle Kovacs, Borivoje Nikolic,
Elastography. In TUFFC, 2022. Randy Katz, Jonathan Bachrach, and Krste Asanović. FireSim: FPGA-
[219] Steven J. Plimpton, Ron Brightwell, Courtenay Vaughan, Keith Under- accelerated Cycle-exact Scale-out System Simulation in the Public
wood, and Mike Davis. A Simple Synchronous Distributed-Memory Cloud. In ISCA, 2018.
Algorithm for the HPCC RandomAccess Benchmark. In Cluster, 2006. [241] Nitin Agrawal, Leo Arulraj, Andrea C. Arpaci-Dusseau, and Remzi H.
[220] F. Bureau, J. Robin, and A. Le Ber. Three-Dimensional Ultrasound Arpaci-Dusseau. Emulating Goliath Storage Systems with David. In
Matrix Imaging. In Nature Communications, 2023. ACM Trans. Storage, 2012.
[221] ftrace and Function Tracer. [Link] [242] Mark Mansi and Michael M. Swift. 0sim: Preparing System Software
trace/[Link]. for a World with Terabyte-scale Memories. In ASPLOS, 2020.
[222] Cosine Similarity. [Link] [243] Yang Wang, Manos Kapritsos, Lara Schmidt, Lorenzo Alvisi, and
[223] Konstantinos Kanellopoulos, Konstantinos Sgouras, F. Nisa Bostanci, Mike Dahlin. Exalt: Empowering Researchers to Evaluate Large-
Andreas Kosmas Kakolyris, Berkin K. Konar, Rahul Bera, Mohammad Scale Storage Systems. In NSDI, 2014.

21
[244] Chang Hyun Park, Sanghoon Cha, Bokyeong Kim, Youngjin Kwon, [258] Ashley Saulsbury, Fredrik Dahlgren, and Per Stenström. Recency-
David Black-Schaffer, and Jaehyuk Huh. Perforated Page: Supporting based TLB Preloading. In ISCA, 2000.
Fragmented Memory Allocation for Large Pages. In ISCA, 2020. [259] Chandrashis Mazumdar, Prachatos Mitra, and Arkaprava Basu. Dead
[245] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, Page and Dead Block Predictors: Cleaning TLBs and Caches Together.
and Emmett Witchel. Coordinated and Efficient Huge Page Manage- In HPCA, 2021.
ment with Ingens. In OSDI, 2016. [260] Samira Mirbagher-Ajorpaz, Elba Garza, Gilles Pokam, and Daniel A.
[246] Madhusudhan Talluri, Shing Kong, Mark D. Hill, and David A. Pat- Jiménez. CHiRP: Control-Flow History Reuse Prediction. In MICRO,
terson. Tradeoffs in Supporting Two Page Sizes. In ISCA, 1992. 2020.
[247] Ashish Panwar, Aravinda Prasad, and K Gopinath. Making Huge [261] Jagadish B. Kotra, Michael LeBeane, Mahmut T. Kandemir, and
Pages Actually Useful. In ASPLOS, 2018. Gabriel H. Loh. Increasing GPU Translation Reach by Leveraging
[248] Ashish Panwar, Sorav Bansal, and K Gopinath. Hawkeye: Efficient Under-Utilized On-Chip Resources. In MICRO, 2021.
Fine-grained OS Support for Huge Pages. In ASPLOS, 2019. [262] Abhishek Bhattacharjee. Large-Reach Memory Management Unit
[249] Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Caches. In MICRO, 2013.
Ghose, Jayneel Gandhi, Christopher J. Rossbach, and Onur Mutlu. [263] Albert Esteve, Maria Engracia Gómez, and Antonio Robles. Exploiting
Mosaic: A GPU Memory Manager with Application-Transparent Parallelization On Address Translation: Shared Page Walk Cache. In
Support for Multiple Page Sizes. In MICRO, 2017. OMHI, 2014.
[250] Zhen Fang, Lixin Zhang, J.B. Carter, W.C. Hsieh, and S.A. McKee. [264] Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift.
Reevaluating Online Superpage Promotion with Hardware Support. Efficient Memory Virtualization: Reducing Dimensionality of Nested
In HPCA, 2001. Page Walks. In MICRO, 2014.
[251] Mark Swanson, Leigh Stoller, and John Carter. Increasing TLB Reach [265] Binh Pham, Jan Vesely, Gabriel H Loh, and Abhishek Bhattachar-
Using Superpages Backed By Shadow Memory. In ISCA, 1998. jee. Using TLB Speculation to Overcome Page Splintering in Virtual
[252] Yu Du, Miao Zhou, Bruce R Childers, Daniel Mossé, and Rami Melhem. Machines. Technical report, 2015.
Supporting Superpages in Non-Contiguous Physical Memory. In [266] Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha
HPCA, 2015. Manne. Accelerating Two-Dimensional Page Walks for Virtualized
[253] Madhusudhan Talluri and Mark D. Hill. Surpassing the TLB Per- Systems. In ASPLOS, 2008.
formance of Superpages with Less Operating System Support. In [267] Zi Yan, Ján Veselỳ, Guilherme Cox, and Abhishek Bhattacharjee.
ASPLOS, 1994. Hardware Translation Coherence for Virtualized Systems. In ISCA,
[254] Mel Gorman and Patrick Healy. Supporting Superpage Allocation 2017.
Without Additional Hardware Support. In ISMM, 2008. [268] Dimitrios Skarlatos, Umur Darbaz, Bhargava Gopireddy, Nam Sung
[255] Narayanan Ganapathy and Curt Schimmel. General Purpose Operat- Kim, and Josep Torrellas. BabelFish: Fusing Address Translations for
ing System Support for Multiple Page Sizes. In ATC, 1998. Containers. In ISCA, 2020.
[256] Georgios Vavouliotis, Lluc Alvarez, Boris Grot, Daniel Jiménez, and [269] Artemiy Margaritov, Dmitrii Ustiugov, Amna Shahab, and Boris Grot.
Marc Casas. Morrigan: A Composite Instruction TLB Prefetcher. In PTEMagnet: FIne-graIned Physical Memory Reservation for Faster
MICRO, 2021. Page Walks in Public Clouds. In ASPLOS, 2021.
[257] Gokul B Kandiraju and Anand Sivasubramaniam. Going the Distance [270] Ashish Panwar, Reto Achermann, Arkaprava Basu, Abhishek Bhat-
for TLB Prefetching: An Application-driven Study. In ISCA, 2002. tacharjee, K Gopinath, and Jayneel Gandhi. Fast Local Page-tables
for Virtualized Numa Servers with vmitosis. In ASPLOS, 2021.

22

Common questions

Powered by AI

Traditional simulation tools face challenges such as lack of accuracy in modeling software components, low simulation speed, high resource consumption, and significant development effort needed for full-system simulators. Emulation-based simulators are typically faster but sacrifice accuracy by using fixed latencies, which doesn't suit VM research that involves hardware/OS co-design . Virtuoso addresses these challenges by balancing speed and accuracy through its lightweight userspace kernel approach, focusing on relevant OS functionalities. It reduces the complexity and resource demands while maintaining high simulation accuracy .

Virtuoso offers several advantages over traditional simulators. It combines the speed of emulation-based simulators with the accuracy of full-system simulators by using a lightweight userspace kernel. This design allows Virtuoso to simulate only the necessary OS routines, speeding up the simulation while still providing high accuracy in the evaluation of VM components . Additionally, Virtuoso requires low development effort compared to full-system simulators, making it easier for researchers to develop and test new VM designs across the software/hardware boundary .

MimicOS facilitates the development and testing of new OS routines by providing a high-level programming interface that doesn’t require expertise in kernel development. Researchers can develop new OS routines more easily and quickly using this accessible interface. Furthermore, because MimicOS can isolate and simulate only the necessary OS functions, it can efficiently evaluate the impact of new routines in conjunction with the Virtuoso framework, enhancing both speed and development ease .

Virtuoso improves simulation accuracy and speed by employing a lightweight userspace kernel, MimicOS, that only imitates essential OS functionalities. This approach allows it to focus on relevant aspects of the virtual memory subsystem, reducing unnecessary complexity and speeding up simulation. The use of a functional channel for event handling and an instruction stream channel for injecting instructions dynamically enables precise modeling of OS routine overheads without slowing down the simulation significantly .

By using a high-level language like C++, Virtuoso improves simulation frameworks by making them more accessible and easier to develop. It allows researchers to prototype and execute new OS routines rapidly without needing extensive kernel expertise. This approach increases flexibility and reduces development time and complexity, enabling more frequent testing and iteration of VM schemes. Additionally, it supports better isolation of desired kernel functionalities, enhancing simulation speed and accuracy .

Virtuoso ensures flexible evaluation of application- and system-level VM implications by integrating into various architectural simulators. This integration allows Virtuoso to adapt its simulation focus based on the specific requirements of the VM component being tested, whether hardware- or software-related. The high-level programmable interface and MimicOS's ability to simulate desired functionalities independently enable comprehensive evaluations across different system configurations .

Virtuoso evaluates the performance implications of virtual memory schemes by dynamically instrumenting and injecting MimicOS binaries into the architectural simulator's processor performance model. This allows the simulator to estimate performance impacts of OS routines on applications accurately. For example, when a page fault occurs, Virtuoso uses MimicOS to handle the fault and returns the result to the simulator, providing an estimate of the overheads introduced by the OS routine .

The evaluation of current VM techniques using Virtuoso provides researchers with insights into the performance, accuracy, and development implications of various VM schemes. Virtuoso's implementation reveals strengths and weaknesses of different virtual memory strategies by simulating real-world scenarios and workloads. Researchers can identify areas where improvements in efficiency or design alterations are needed and gain an understanding of the interplay between VM components and system architecture .

Virtuoso supports the development of new VM schemes across different architectural simulators by providing a unified, flexible framework that can be easily integrated with these simulators. It uses a lightweight userspace kernel and a high-level programming approach to facilitate the rapid prototyping and testing of VM protocols and designs. This versatility allows researchers to evaluate the impact of various VM innovations consistently across a wide range of systems and configurations .

Virtuoso’s open-source availability is significant for the future of virtual memory research as it democratizes access to advanced simulation tools, fostering collaboration and innovation across the research community. Open-access allows researchers globally to utilize and enhance the framework, leading to rapid advances in VM techniques. Furthermore, transparency in its methodology encourages validation and improvement efforts, while ensuring that any advancements are built on a shared platform, increasing the scalability and breadth of research .

You might also like