GROUP 19 COE 453 - Distributed Computing
Distributed
Systems
Simulation
and Modeling
Presented by Group 19
GROUP 19 COE 453 - Distributed Computing
Introduction to Distributed Systems
01 Simulation and Modeling
Table of 02 Simulation Models for Distributed Systems
Distributed System Workload Generation
Contents 03 and Modeling
Performance Metrics and Evaluation
04 Techniques in Distributed Systems
Monte Carlo Simulation for Distributed
05 Systems
06 Distributed System Profiling and Tracing
GROUP 19 COE 453 - Distributed Computing
Simulators for Cloud Computing and Data
07 Centers
Table of 08 Trace Analysis and Visualization Tools
Simulating IoT and Edge Computing
Contents 09 Environments
Distributed System Failure and Anomaly
10 Simulation
Validation and Verification of Simulation
11 Models
12 Conclusion & Future Trends in Simulation
GROUP 19 COE 453 - Distributed Computing
Introduction to
Distributed Systems
Simulation and Modeling
Distributed Systems Simulation and Modeling Key Areas Covered
is the use of computational methods to analyze, • Simulation models
test, and evaluate distributed systems without • Workload generation & performance metrics
physically implementing them. • Statistical techniques (Monte Carlo, profiling)
• Trace analysis & visualization
• Fault and anomaly simulation
Importance • Specialized simulators (Cloud, IoT, Edge)
• Predict system performance under different conditions. • Validation & future trends
• Assist in resource allocation and fault tolerance testing.
• Reduce deployment risks and costs.
GROUP 19 COE 453 - Distributed Computing
Simulation Models for Distributed
Systems
Components Types of Models
Entities Discrete Event Simulation
Events Agent-Based Simulation
Simulation Clock Network Simulation
Parallel and Distributed Simulation
Simulation Engine
There are other models like the Continuous, the Monte
Carlo and the Stochastic Simulation Models
Purpose Software for Simulation Modeling
Performance Evaluation
NS-3 (Network Simulator 3)
Fault Tolerance Testing
OMNeT++
Capacity Planning MATLAB/Simulink
AnyLogic
GROUP 19 COE 453 - Distributed Computing
Workload Generation in Distributed
Systems
Workload Generation refers to creating realistic Types
workloads to mimic user interactions and system • Interactive
• Batch
usage. • IoT
• Cloud
Purpose Techniques
• Performance testing • Replay-based
• Bottleneck detection • Synthetic
• Scalability • Benchmarking tools (e.g.,
• Fault tolerance JMeter, Locust)
• Optimization
GROUP 19 COE 453 - Distributed Computing
Workload Modeling in Distributed Systems
Workload Modeling refers to mathematical Models Used
models predicting workload behaviors. • Poisson processes
• Markov chains
• Gaussian distributions
Characteristics
• Request arrival rate
• Service time
• Concurrency
• Resource usage
GROUP 19 COE 453 - Distributed Computing
Performance Metrics and Evaluation
Techniques
• Why Evaluate? Techniques Cont’d
- Ensure systems handle real-world conditions. 1. Analytical Models:
- Mathematical tools (e.g., queuing theory).
Key Metrics to Evaluate Performance: - Early predictions but may lack real-world details.
Throughput: Number of tasks completed per unit time.
2. Experimental Testing:
Latency: Time taken to complete a request. - Real condition deployment (e.g., AWS testing).
Scalability: Ability to handle increasing workloads. - Accurate but costly and time-consuming.
Reliability: How often the system remains operational. 3. Simulation-Based Evaluation:
Energy Efficiency: Power consumption vs. performance. - Safe, flexible virtual testing.
- Cost-effective but depends on model
quality.
GROUP 19 COE 453 - Distributed Computing
Monte Carlo Simulation for Distributed
Systems
Monte Carlo Simulation is a computational Why Monte Carlo in
method that uses random sampling to estimate
Distributed Systems?
results for problems that are too complex to
solve exactly. · Imagine you are a cybersecurity
analyst trying to estimate:
It is widely used in computer science, How long before a hacker breaks
networking, AI, and finance. into your system?
It works well in distributed computing There are too many factors to
because: calculate exactly:
The problem can be broken into Strength of passwords
smaller, independent tasks. Computing power of the
Each task runs on a separate hacker's machine
computer. - More computers = Number of login attempts
Faster and more accurate results. per second
GROUP 19 COE 453 - Distributed Computing
Monte Carlo Simulation for Distributed
Systems (Cont’d)
Instead of solving it with a formula, we simulate millions Example: Monte Carlo
of attacks using random values: for Password Cracking
Some attacks succeed early
Some take a long time. Let’s say we want to estimate how long it
takes to brute force an 8-character password
We estimate the average time to breach using
Monte Carlo! A hacker tries random passwords until
they get the correct one
Distributed computing makes this fast The time depends on:
by running many attack simulations on Password complexity (e.g., numbers,
multiple machines in parallel! letters, symbols)
How fast the hacker’s computer
guesses passwords
GROUP 19 COE 453 - Distributed Computing
Monte Carlo Simulation Example Simulation:
for Distributed Systems Simulation
#
Password
Length
Speed
(attempts/sec)
Time to
Crack
(Cont’d) 1️M
1 8 chars 5 hours
attempts/sec
Instead of solving a complicated math formula, we run a
Monte Carlo simulation:
5M
2 8 chars 1 hour
1. Randomly generate millions of fake hacking attempts. attempts/sec
2. Measure how long it takes to crack the password in
each case. 3 8 chars
1️0M
3️0 mins
attempts/sec
3. Compute the average time across all simulations.
Formula: Monte Carlo lets us estimate how long
total time taken across all simulations real-world attacks take!
Average time to crack =
total number of simulations Monte Carlo lets us estimate how long
real-world attacks take!
GROUP 19 COE 453 - Distributed Computing
5. Monte Carlo in Other Computer Applications
Monte Carlo Simulation Monte Carlo is widely used in distributed
computing systems, including:
for Distributed Systems
Network Security – Simulating
(Cont’d) cyberattacks to test defense systems.
4. How Distributed Computing Helps Load Balancing – Predicting web server
Instead of running this simulation on one computer, we can: loads by simulating user traffic.
1. Split the work into multiple independent tasks AI & Machine Learning – Simulating
2. Assign each task to a different computer different training scenarios.
3. Each machine simulates random hacking attempts Cloud Computing – Optimizing resource
4. At the end, we combine results to estimate the final allocation in data centers.
breach time. 6. Conclusion
Monte Carlo is a powerful tool for solving
Example complex problems using random sampling.
Single machine: 1️million attempts per second → Takes Distributed computing makes it even better
too long. by running simulations in parallel!
10 distributed machines: Each runs 1️00,000 attempts → More computers = Faster results & better
10x faster! accuracy.
GROUP 19 COE 453 - Distributed Computing
Distributed System Profiling and Tracing
Profiling is monitoring system performance, while Tools Used
tracing records events as they occur.
Perf, eBPF (for Linux systems)
Google’s Dapper (used for tracing in
Why It’s Important large-scale distributed systems)
Helps in debugging performance bottlenecks.
Optimizes resource allocation and load balancing.
GROUP 19 COE 453 - Distributed Computing
Distributed System Profiling and Tracing
Feature Perf (Linux Performance Tools)
eBPF (Extended Berkeley Packet Dapper (Google's Distributed
Filter) Tracing)
Kernel & user-space
Purpose CPU & system profiling Distributed request tracing
tracing
End-to-end tracing across
Scope Process & thread-level System-wide observability
services
Moderate
Overhead Low (event-driven) Low to moderate
(sampling-based)
Dynamic Requires app-level
Instrumentation Manual (command-line)
(programmable in kernel) integration
Security, monitoring, and
Best For Performance tuning Microservices tracing
observability
GROUP 19 COE 453 - Distributed Computing
Trace Analysis and Visualization Tools
Definition: Types of Distributed Tracing
Distributed tracing is a technique used Code tracing: it involves the
to track and profile the execution of inspecting of the flow of source
requests as they travel across multiple codes in an application when
services in a distributed architecture. performing a specific function
It provides a detailed view of the path of Program tracing: in this method,
a requests through various developers examine the addresses
of instructions and variables called
microservices which in turn allows
by an active application
developers and operation teams to End-to-end tracing(main focus):
pinpoint performance bottlenecks, With end-to-end tracing, developers
latency and errors across a system. track data information along the
service request path.
GROUP 19 COE 453 - Distributed Computing
Trace Analysis and Visualization Tools
(Cont’d)
How it works Importance
When a request is initiated, data is collected and a End-to-end visibility
unique trace ID and span(parent span) ID is created Performance monitoring
A trace is an entire execution path Error diagnosis
A span is a single unit of work during a journey
When the request enters a service, a top-level child
span is created.
If multiple commands are made within the same
service, the top-level child span becomes
parent to multiple child spans underneath it
The platform encodes a child span with the original
trace ID, unique span ID, duration, error data and
other relevant metadata.
All spans are then visualized in a flame graph with
the parent span on top and child spans beneath,
GROUP 19 COE 453 - Distributed Computing
Trace Analysis and Visualization Tools
(Cont’d)
Tools Examples of Tools
Distributed tracing tools support three OpenTelemetry: an industry-standard open-
phases of request tracing source platform for data instrumentation and
Instrumentation: which involves collection. Offers vendor-neutral auto-
modifying code so requests can be instrumentation libraries and APIs that allow
recorded as they pass through your you to trace the end-to-end pathways and
stack duration of requests
Data collection: collecting span data for Jaeger: open-source tool with UI that visualize
each request distributed traces. Limited to sampling hence
Analysis and visualization: involves some problems are likely to be omitted.
encoding and tagging the spans for Datadog: offers complete Application
analysis and displaying them as flame Performance Monitoring and distributed tracing
graphs for organizations operating at any scale.
GROUP 19 COE 453 - Distributed Computing
Simulators for Cloud Computing and Data
Centers
Definition: Key Features of Cloud
Cloud simulators create a virtual environment that Simulators
replicates cloud computing and data center operations. Resource Allocation Modeling
They allow researchers and engineers to experiment Workload and Traffic Simulation
with different configurations, workloads, and resource Cost Prediction
management strategies without the high costs and Energy Consumption Analysis
risks associated with real-world testing. Scalability Testing
Why Use Cloud Simulators? How Cloud Simulators
Cost-Effective Testing
Work Together
Time Efficiency Scenario Testing
Resource Optimization Comparative Analysis
Risk Mitigation Optimization
GROUP 19 COE 453 - Distributed Computing
Simulators for Cloud Computing and Data
Centers
Popular Cloud Simulators
Simulator Purpose Key Features Common Use Cases
Simulates cloud resource Models resource
Academic research,
CloudSim allocation and VM provisioning, scheduling,
performance evaluation
migrations and migration
Predicts cloud service Cost modeling, large scale
Cost estimation,
iCanCloud costs for different cloud infrastructure
infrastructure planning
configurations simulation
Energy consumption Analyzing energy
Focuses on energy
GreenCloud modeling, network efficiency, designing green
efficient cloud computing
simulation data centers
GROUP 19 COE 453 - Distributed Computing
Simulating IoT and Edge Computing
Environments
Definition: Simulation Tools for
Simulation tools help model IoT networks and IoT & Edge Computing:
edge computing scenarios.
· IoTSim: Simulates IoT applications on
Challenges in IoT Simulation cloud platforms.
· EdgeCloudSim: Models edge
· Large-scale device connectivity
computing architectures.
· Real-time data processing constraints
· NS3 (Network Simulator 3): Used for
simulating IoT network traffic.
GROUP 19 COE 453 - Distributed Computing
Simulating IoT and Edge Computing
Environments (Cont’d)
Popular Simulation Tools for IoT & Edge Computing
Tool Purpose Key Features Common Use Cases
Models IoT data
Simulates IoT applications Research on IoT-cloud
IoTSim processing in cloud
on cloud platforms integration
environments
Models edge computing Simulates edge and cloud Testing edge offloading
EdgeCloudSim
architectures resource distribution strategies
Studying IoT protocol
Simulates IoT network Models wireless networks
NS3 performance and
traffic and IoT communications
scalability
GROUP 19 COE 453 - Distributed Computing
Distributed System Failure and Anomaly
Simulation
Definition: Why simulate failures?
This involves intentionally causing failures in a distributed 1. Enhancing fault tolerance e.g. A social media platform
system to test how it responds. The goal is to make the like Facebook tests its ability to keep running even
system more robust and resilient to real-world when one of its data centers goes down.
disruptions. 2. Preparing for real-world disruptions e.g. E-commerce
websites like Amazon simulate Black Friday traffic
Example: Imagine an online banking system where sudden spikes to ensure their servers don’t crash during high
server crashes could cause customers to lose access to demand.
their accounts. If the bank's IT team had tested failure 3. Optimizing system recovery e.g. Cloud storage
scenarios in advance, they could ensure the system can services like Google Drive can test data recovery by
automatically switch to backup servers without affecting simulating a disk failure and measuring how fast lost
users. files are restored.
4. Improving load balancing e.g. A video streaming
service like Netflix tests its servers to handle millions
of users watching movies at the same time.
GROUP 19 COE 453 - Distributed Computing
Distributed System Failure and Anomaly
Simulation (Cont’d)
Types of Failures Simulated 2. Hardware crashes
This models failures like server crashes, disk
1. Network partitions corruption, and memory issues. Some sources are
This occurs when different parts of a distributed server crashes, disk failures, and memory
system lose communication. Some sources are corruption.
software bugs (misconfiguration in network Example: If a bank’s database server crashes, its
settings), hardware failure (router or switch not backup system should automatically take over.
working).
Example: If WhatsApp's servers in different 3. Load spikes and system overload
countries stop communicating, messages may not Simulates high traffic conditions and resource
get delivered immediately. consumption. Some sources are high user demand,
DDoS attacks, and inefficient resource allocation.
Example: Before launching a new product, an online
store may test what happens if 100,000 customers
try to buy the same item at once.
GROUP 19 COE 453 - Distributed Computing
Distributed System Failure and Anomaly
Simulation (Cont’d)
Summary Table
Type of Failure What Happens? Real-Life Example How Systems Handle It
Replication, retries,
Some parts of the system
Network Partition WhatsApp message delays partition-tolerant
can't communicate
databases
Backups, failover
A server, disk, or memory
Hardware Crash Bank ATM server crash mechanisms, redundant
fails
power
System overload due to Amazon Black Friday Auto-scaling, load
Load Spike
high traffic outage balancing, caching
GROUP 19 COE 453 - Distributed Computing
Failure Simulation Tools (Cont’d)
Tool Purpose Key Features Common Use Cases
Randomly shuts down Microservices & cloud Shuts down transaction services to test failover
Chaos Monkey
services resilience mechanisms
Simulates network
Network stability Delays customer transactions to check retry
Gremlin failures (latency,
testing mechanisms
dropped connections)
Simulates AWS
AWS cloud Tests if the system recovers from an AWS S3
AWS FIS outages & resource
applications storage failure
failures
Simulates container & Kubernetes-based Terminates transaction-processing containers to
LitmusChaos
Kubernetes failures apps test auto-scaling
GROUP 19 COE 453 - Distributed Computing
Distributed System Failure and Anomaly
Simulation (Cont’d)
Example Use Case of Tools
Let’s consider an online banking system that handles transactions, account management, and fraud
detection. The bank’s infrastructure is cloud-based and runs on AWS and Kubernetes with multiple
microservices.
To test its fault tolerance, the bank can combine Chaos Monkey, Gremlin, AWS FIS, and LitmusChaos in
the following ways:
Step 1: Test Microservices Resilience with Chaos Monkey
Step 2: Simulate Network Failures Using Gremlin
Step 3: Test Cloud Service Failures with AWS FIS
Step 4: Test Kubernetes-Based Services with LitmusChaos
GROUP 19 COE 453 - Distributed Computing
Final Outcome: Banking System
Resilience Testing
Failure Scenario Simulation Tool Used Expected System Response
Microservices failure Chaos Monkey Backup services take over
Network delays Gremlin Requests retry automatically
Cloud service outage AWS FIS Failover to another AWS region
Kubernetes pod failure LitmusChaos Auto-restart failed containers
By combining these tools, the bank ensures that its system can handle failures from multiple angles—from
random shutdowns to cloud outages—keeping transactions secure and minimizing downtime.
GROUP 19 COE 453 - Distributed Computing
Distributed System Failure and Anomaly
Simulation (Cont’d)
Examples of Failure Simulation Tools
Tool Purpose Key Features Common Use Cases
Randomly shuts down Injects controlled failures Testing system robustness
Chaos Monkey
services to test resilience in production and auto-recovery
Simulates cloud service Models outages in cloud- Evaluating cloud
[Link]
failures based infrastructures application reliability
GROUP 19 COE 453 - Distributed Computing
Validation and Verification of Simulation
Models
Definitions Verification Methods
Verification means checking if the simulation Compare simulation output with real system logs
model is built correctly according to its design Perform statistical accuracy testing
and specifications. Use benchmarks from existing distributed systems
Validation means checking if the simulation
model accurately represents the real-world
system it is supposed to simulate. Tools Used:
Model checkers (e.g., UPPAAL, SPIN)
Empirical comparison with real-world
Why is Validation Important? data
Ensures simulation results match real-world
behavior
Prevents incorrect conclusions from flawed models
GROUP 19 COE 453 - Distributed Computing
Validation and Verification of Simulation
Models (Cont’d)
How Validation and Verification Work Together
Verification ensures the model is correctly built ("Did we build the model right?").
Validation ensures the model reflects real-world behavior ("Did we build the right model?")
Example Use Case
A cloud resource allocation simulator can be:
Verified by checking if it correctly implements scheduling algorithms.
Validated by running it against AWS cloud service logs and ensuring predicted resource usage matches
real-world observations.
GROUP 19 COE 453 - Distributed Computing
Validation and Verification of Simulation
Models (Cont’d)
Tool/Method Purpose Key Features Common Use Cases
Model Checkers Verifies correctness of Formal verification, Verifying distributed
(UPPAAL, SPIN) system models detects logical errors algorithms and protocols
Validates simulation Ensuring simulations
Empirical Comparison with Uses historical data for
accuracy by comparing reflect real-world
Real-World Data validation
with actual logs performance
GROUP 19 COE 453 - Distributed Computing
Conclusion & Future Future Trends:
Trends in Simulation · AI-driven simulations for predictive
analytics
· Improved real-time monitoring and
Summary of Key Takeaways
visualization tools
· Simulation helps optimize, test, and analyze distributed
· Integration of digital twins for real-world
systems before deployment
system testing
· Various models and tools exist for cloud computing,
IoT, failure handling, and performance optimization
Thank You