Production Incidents in GenAI Cloud Services
Production Incidents in GenAI Cloud Services
Abstract—The ever-increasing demand for generative artificial [18]. However, both the acquisition of the required resources
intelligence (GenAI) has motivated cloud-based GenAI services and their efficient management pose significant challenges to
such as Azure OpenAI Service. Like any large-scale cloud service, individuals and even enterprises. Therefore, it motivates the
failures are inevitable in cloud-based GenAI services, resulting
in user dissatisfaction and significant monetary losses. However, development of GenAI cloud services, which offer a platform
GenAI cloud services, featured by their massive parameter where developers and users can create, deploy, and utilize large
scales, hardware demands, and usage patterns, present unique models without substantial hardware and software investments,
challenges, including generated content quality issues and privacy e.g., Cloud for AI (Cloud4AI), and also incorporate model
concerns, compared to traditional cloud services. To understand APIs within cloud systems, e.g., AI for Cloud (AI4Cloud)
the production reliability of GenAI cloud services, we analyzed
production incidents from Microsoft spanning in the past four [5]–[7], [19]–[21]. Popular GenAI cloud services include
years. Our study (1) presents the general characteristics of GenAI Azure OpenAI, Amazon Bedrock, IBM Watson, and Anthropic
cloud service incidents at different stages of the incident life Claude. GenAI cloud services afford enterprises the infrastruc-
cycle; (2) identifies the symptoms and impacts of these incidents ture necessary for the deployment and maintenance of GenAI
on GenAI cloud service quality and availability; (3) uncovers models, based on which users can further interact, analyze, and
why these incidents occurred and how they were resolved; (4)
discusses open research challenges in terms of incident detection, fine-tune such models. Moreover, they are crucial in promoting
triage, and mitigation, and sheds light on potential solutions. collaboration among researchers by providing shared access to
Index Terms—Incident Management, Generative AI, Cloud advanced models and computational resources.
Service Reliability, Empirical Study As with any large-scale cloud services, GenAI cloud ser-
vices are not immune to occasional incidents. These events,
I. I NTRODUCTION while often unavoidable due to the complexity and scale
In recent years, there have been significant advancements of the systems involved, have the potential to impact user
in generative artificial intelligence (GenAI), particularly in experience and, in some cases, result in challenges such as
Large Language Models (LLMs) and their applications across user dissatisfaction or economic implications. For example,
various fields. Beyond natural language processing, these OpenAI recently experienced an incident where request fail-
models have also shown new capabilities in image recogni- ures and high latency severely impacted ChatGPT’s API and
tion [1], [2], data analysis [3], [4], software engineering [5]– functionalities [22]. However, despite the critical importance
[8], and more [9]–[11]. The emergence of models like the of reliability in GenAI cloud services, there is a notable
GPT-4 family marks a new era, with capabilities extending lack of research focusing on their reliability and incident
to complex reasoning [12], [13], creative thinking [14]–[16], management. Therefore, understanding the characteristics of
and even surpassing human expertise in certain tasks [17]. these incidents—including detection, triage, diagnosis, and
This innovation has resulted in impactful research findings mitigation—is crucial for enhancing the quality of GenAI
and practical applications with substantial implications for cloud services.
scientific research and socio-economic development. Before the era of GenAI cloud services, traditional ML
The demands of GenAI come with the requirements of platforms like AzureML, AWS SageMaker, and Google Cloud
unprecedented computational resources, including the hard- ML were primarily used for tasks such as training, inference,
ware for operating the models as well as infrastructure sys- and model fine-tuning [23]. These services have been well-
tems for efficiently allocating and utilizing such resources studied for issues like deployment challenges, fault taxonomy,
and bug characteristics [24]–[26], while extensive research
* Haoran Yan and Yinfang Chen contributed equally. Work was performed
during their internship at Microsoft. has similarly examined incident management practices, root
† Corresponding author. causes, and triage procedures in conventional cloud ser-
vices [6], [27]–[32]. However, GenAI cloud services funda- GenAI Cloud Services
mentally differ from these. Specifically, GenAI services such Incident Triage after
Detection Localization
as large language models (LLMs) rely on massive parameter
Time to
scales, high hardware demands, and provide natural language- Chat Design Analysis Mitigation
Incident
Management
driven applications like text generation, summarization, and (TTM)
translation, which traditional cloud services do not [33], [34]. Mitigation/ Root Cause
Resolution Diagnosis
These services also allow users to fine-tune models using
Finetuning Embedding ...
user-uploaded datasets [35], exposing risks from model-level
behavior changes. Moreover, they provide intuitive conversa- Fig. 1: Incident management of GenAI cloud services.
tional user interfaces, making them accessible to a broader
audience while adding complexity and risks in managing user
interactions. Such characteristics create new reliability issues from Microsoft1 .
related to model quality, privacy, and performance, layered
• We identify not only the symptoms and impacts of high
atop conventional reliability concerns. Therefore, due to the
severity GenAI incidents but also uncover the root causes
distinctive challenges of GenAI cloud services, it is necessary
behind them and how they were mitigated with many real-
to investigate GenAI incident patterns, impacts, and mitigation
world incident cases.
strategies to ensure future dependable and reliable GenAI
• We reveal the challenges of handling GenAI incidents at
services.
different incident life cycles and provide insights into im-
In this study, we examine incidents in the GenAI cloud
proving the reliability of large-scale GenAI cloud services.
service of Microsoft, a leader in the GenAI field, known for
hosting GPT series models. Microsoft’s Incident Management II. BACKGROUND AND M OTIVATION
system (IcM) documents a wide range of incident data, includ-
ing root causes, mitigation steps, and detailed engineer discus- In this section, we begin by introducing Large Language
sions, enabling a comprehensive and comparative analysis of Model (LLM) cloud services and incident management, as
GenAI cloud service incidents alongside conventional cloud illustrated in Figure 1. Subsequently, we outline the motivation
services. Our investigation reveals that while some traditional behind our study.
reliability challenges, such as system downtime or latency
A. GenAI Cloud Service
issues, remain relevant in GenAI services, new and unique
challenges have emerged. For example, incidents like response With the substantial parameter scale of foundation models
quality degradation show that models can unexpectedly pro- such as GPT-4, they are typically deployed in cloud systems
duce low-quality or even inappropriate output from simple like Azure OpenAI. This Cloud4AI service offers users a
prompts. We term these incidents GenAI incidents. convenient means to access advanced language models without
Our study leads to crucial findings. For instance, we find the complexities of managing infrastructure or undertaking
that (1) GenAI incidents manifest as performance degradation extensive local computations. GenAI also has APIs for cloud
(49.8%), deployment failure (35.7%), and invalid inference services, as seen with Copilot [36], referred to as AI4Cloud.
(14.5%), significantly impacting both service reliability and In our study, both Cloud4AI and AI4Cloud are the subjects
user satisfaction; (2) GenAI cloud services experience a higher of our investigation, which we collectively refer to as GenAI
rate of incidents detected by humans (38.3%) compared to cloud services. In our study of GenAI cloud services’ inci-
other services (13.7%) rather than automated monitors. Also, dents (termed as GenAI incidents), we collect incident data
there is a higher false alarm rate for GenAI (11.0%) versus from the Microsoft Incident Management system (IcM) [37]
other services (3.8%); (3) Due to human-reported nature, many (Section III).
GenAI incidents need to be re-assigned to different teams, and
B. Incident Management
GenAI incidents need more time (1.12 time units on average)
to mitigate compared to those in other services (0.65 time In cloud services, incidents are common and can lead to
units on average); (4) During mitigation, a specific root cause service disruptions, economic losses, and other unexpected
is not tied to a single type of fix. For example, while code bugs severe consequences. To address such issues, major cloud
account for 21.5% of the GenAI incidents, only 7.6% of fixes providers like Microsoft typically involve four main proce-
are code changes, with other strategies being employed. Given dures: detection, triage, diagnosis, and mitigation (Figure 1).
the tight deadlines for on-call engineers, quick approaches like • Detection. This step detects service violations or perfor-
rollback are prioritized to reduce downtime. mance issues and creates a ticket to record relevant infor-
In summary, this paper makes the following main contribu- mation [38]–[50]. Such incidents can be detected manually
tions: (e.g., by customers or engineers) or automatically (e.g., by
the service monitor) [38]–[50].
• We make the first attempt to unravel the general behavior of
incidents occurring in GenAI cloud services by collecting 1 Due to company policy, we hide the actual numbers and present normal-
and analyzing a large number of GenAI-related incidents ized numbers in this paper.
2
III. M ETHODOLOGY
100
Microsoft, a leader in cloud computing, hosts the training
Total
# Normalized Incidents
80 and APIs for OpenAI and offers various GenAI cloud services,
High Severity GPT-4
60 Medium Severity including Azure OpenAI, which utilizes Microsoft platform
Low Severity GPT-3.5 ChatGPT to provide access to the GPT series models. Incidents in
40 these services are documented in a dedicated database. Prior
20 GPT-3 researches [28], [30], [31], [70] have utilized similar database
to collect incidents and derive analytical insights. Consistent
0 6 9 12 3 6 9 12 3 6 9 12 3 6 9 12 2 with this approach, this study leverages Microsoft’s database
2020 2021 2022 2023 2024 to collect GenAI-related incidents.
Date (Month \\ Year)
The database contains key details for each incident, includ-
Fig. 2: Number of GenAI incidents at different time. ing its description, root cause, mitigation steps, discussions by
the on-call engineers (OCEs), and severity-level tags (high,
medium, and low). To conduct our empirical study, we collect
• Triage. This process assigns the detected incident to a both GenAI-related and non-GenAI incidents as a comparative
responsible team [50]–[54]. Due to the complexity of cloud dataset. Following the methodology of previous research [28],
service systems, determining the appropriate team may we focus on significant incidents characterized by their high
require multiple rounds of discussions, and reassignment is severity and detailed root cause descriptions, thereby facilitat-
also necessary. ing an insightful qualitative analysis. The following shows the
• Diagnosis. The assigned team analyzes the incident to details of our methodology.
determine its root cause by examining system logs and
configuration settings to isolate the problem and identify A. Data Collection
corresponding factors [7], [50], [55]–[65]. We first introduce the detailed procedures to collect the
• Mitigation. The mitigation step often accompanies the di- dataset. In particular, we collect two datasets that serve for
agnosis, as engineers strive to promptly resolve the incident the two incident study respectively. The general incident
to minimize the Time to Mitigate (TTM) [66]–[69]. study is designed to explore general characteristics of GenAI
incidents within the incident management process, such as the
C. Motivation distribution of incidents’ detection methods. It also aims to
GenAI, especially LLMs such as OpenAI’s ChatGPT, has compare these incidents with those from other cloud services,
witnessed a surge in their popularity, with ChatGPT having analyzing the differences between the two. It requires a dataset
over one million users in its debut week. However, such with broad coverage. Therefore, we endeavor to collect all
increased adoption has also unveiled potential risks, including incidents that meet the criteria as comprehensively as possible.
outages, and errors. Figure 2 showcases the variation in terms The in-depth incident study focuses on understanding the cat-
of the number of GenAI-related incidents within Microsoft egories of an incident’s symptoms, root cause, and mitigation
over the recent four years, also highlighting changes in the strategies based on detailed information, such as discussions
number of incidents across different severity levels. The lower by OCEs. Given the large volume of data, we opt to select
the severity level, the higher the impact to customers. only high-severity incidents as in-depth analysis cases, as these
The total number of incidents in gray color shows an incidents have a more significant impact on the system and
upward trend. Specifically, before the release of the GPT- tend to attract greater attention. Through both the study, we
3.5 model in March 2022, GenAI-related incidents account can comprehensively understand the characteristics of GenAI
for a mere 3% of the total incidents within the GenAI cloud incidents.
service. After 2023, there is a significant increase in incidents, • GenAI incidents collection for general analysis. In this
with a pronounced spike following the introduction of GPT- phase, our primary goal is to gather data on incidents, encom-
4 in March 2023. At this point, the volume of incidents passing the period from June 2020, following the release date
had increased nearly tenfold relative to the figures reported of GPT-3 model by OpenAI, to February 2024. Here are the
during the GPT-3.5 era. This dramatic rise can be attributed criteria we use to collect GenAI-related incidents:
to the global fame attained by the GPT model, which attracted 1) We choose incidents that have been mitigated or resolved.
millions of users. This trend also holds across all severity The incident status is categorized within the “Status” field
levels, with lower-impact incidents comprising most cases. as “Active”, “Mitigated”, and “Resolved”. Our collection
The proliferation of GenAI-related incidents affects both the excludes incidents marked as “Active” due to the lack of
associated cloud services and their end users. Unfortunately, comprehensive data, such as discussions by OCEs, root
the characteristics of the incidents of GenAI cloud services cause analysis, and mitigation steps;
have not yet been comprehensively unveiled. This study aims 2) The “Service” field indicates whether the incident is as-
to bridge this gap, thus providing insights for future research sociated with a GenAI cloud service and its team or not.
and practical guidance for the software engineering commu- Incidents are considered GenAI-related if they are linked
nity maintaining GenAI cloud services. to a specific GenAI cloud service, such as Azure OpenAI;
3
3) Given the complex architecture and dependencies of GenAI authors independently following the open coding strategy [71]
cloud services, certain GenAI incidents may be managed by to label both symptoms, root causes and mitigation strategies
dependent (sub-)services and cannot be directly found by for the taxonomy set. Next, for categories with inconsistent
“Service”. Thus, we define a vocabulary of words related classifications, a meeting involving other authors will be con-
to GenAI (e.g., “gpt-3.5-turbo”, “LLM”, [Link]). Then we vened to determine the final categorization. The two authors
perform a case-insensitive search of these terms within the then label the validation set to check for the emergence of
“Title” of an incident. new categories to perform further discussions to refine their
Following these criteria, we obtain hundreds of thousands understanding of each category. Finally, they label the test set
of GenAI-related incidents. and employ Cohen’s kappa [72] coefficient to measure the
• GenAI incidents collection for in-depth analysis. We consistency between annotators.
meticulously select a subset of GenAI incidents based on three After multiple rounds of the labeling process described
criteria: above, we ultimately adopt the best result, achieving near-
perfect agreement across the three taxonomies: Symptom:
1) The incident must be of high severity. Incidents of this
0.921, Root Cause: 0.930, Mitigation: 0.893. For incidents
nature typically result in significant service disruptions,
that can fit into multiple categories, e.g., multiple symptoms,
affecting numerous tenants and customers;
disagreements are resolved by focusing on the category most
2) The incident should include a detailed root cause analysis;
prominently reflected in the incident and OCE’s discussions.
3) The incident must be valid. We deem an incident as invalid
if its mitigation steps are described as a “False Alarm”. IV. RQ1: G ENERAL S TATISITCS
Following these criteria, we identified and selected many We explore the characteristics of GenAI incidents from three
incidents for our detailed analysis. Given that high-severity aspects, detection, triage, and mitigation, each corresponding
incidents inherently constitute a smaller proportion of total to a phase of the incident life-cycle.
incidents, the data collected at this step is significantly less
than what is gathered for qualitative analysis. A. Incident Detection
• Other incidents collection. For discussion, especially a Detection is the initial step in incident management for a
comparative mitigation analysis between GenAI incidents and cloud service. Engineers can identify incidents by noticing
those unrelated to GenAI, we collect the same number of other unusual system behaviors [32], [73]–[75], while customers can
incidents using the same time frame and criteria in general also report issue tickets when encountering failure messages
analysis (omitting the (2) and (3)) and in-depth analysis. or experiencing delay [76]. To improve the efficiency of inci-
B. Research Questions dent detection, automated monitoring tools are deployed [38].
These tools either passively collect real-time system telemetry
In this study, we aim to reveal the behaviors of GenAI inci- data (e.g., CPU usage) and performance measures (e.g., re-
dents in the incident management life cycle. Such insights are sponse time and throughput), or proactively check the health
critical for the development, maintenance, and management of of the system by periodically performing heartbeats or sanity
LLMs, aiming to improve the robustness and reliability of the checks. Figure 3 shows a monitor detecting the calling failure
LLM cloud systems. This exploration is pivotal for providing a rate of a service.
scientific basis to prevent future incidents, thereby contributing Missing Alarms (False Negative): As shown in Figure 4,
valuable knowledge and experience to both the research and we find that 38.3% of the incidents related to GenAI are
practical applications in the field. In particular, we design the reported by humans, such as engineers and customers, instead
following research questions (RQs). of automated incident monitors. To explain this high human-
RQ1. What is the general behavior of GenAI incidents in terms reported percentage (i.e., such a ratio is only 13.7% for other
of different incident life cycles? cloud services, as will be further discussed in Section VIII-A),
RQ2. What are the symptoms of GenAI incidents? we find that 45.9% of GenAI cloud services are still under
RQ3. What are the root causes of GenAI incidents? development or in the preview stage, while 54.1% of GenAI
RQ4. How are GenAI incidents mitigated? cloud services are in the General Availability status. More-
over, many GenAI cloud service monitors currently build on
C. Categorization Strategy adaptations of existing frameworks designed for other types
While each incident is documented with detailed informa- of cloud services, which may not yet fully align with the
tion, these records are typically composed by humans, e.g., unique requirements of GenAI-specific scenarios. For instance,
OCEs’ discussion, and may contain images, URLs, and other invalid inference incidents are often identified and reported by
elements that complicate automatic categorization. Therefore, users, reflecting the collaborative effort to refine these systems
we need to analyze all the incidents manually to further under- further. Our study observes that there are around 25.9 unique
stand their symptoms, root causes, and mitigation strategies. monitors per 100 monitor-reported GenAI incidents, compared
We divide our dataset of incidents into three subsets ran- to 74.4 for other cloud services, offering an opportunity to
domly: (1) taxonomy set: 40% incidents, (2) validation set: enhance monitoring diversity. These insights highlight the
20% incidents, and (3) test set: 40% incidents. Firstly, two ongoing evolution of GenAI monitoring approaches, as the
4
TABLE I: Detection type distribution and false alarms rate for
Title: monitor evaluated high fail rate GenAI and non-GenAI incidents.
for scope [ServiceA], zone [WestRegion2]
Monitor Name: [Service] FailRate Detection Type False Alarm Rate
Metric: DependencyCallCounter Human Monitor Human Monitor
Description: Marks the target as GenAI 38.3% 61.7% 6.6% 11.0%
‘Unhealthy’ and raises a high-severity Other 13.7% 86.3% 4.8% 3.8%
incident if the failure rate exceeds 4%
over the past 60 minutes.
Hop=1 Hop=2 Hop>=3
Trouble-shooting Guide: [Link to the TSG]
Diagnostic Information: 8.6% 11.1%
0.7% 3.2%
Failure Description Count
ServiceClient failure for 52
[ServiceB]: Failed to call
[ServiceB], ReasonPhrase=Failed
Dependency 90.7% 85.7%
RequestTimeout for 13 Monitor Human
[EncoderService]
ServiceClient failure for ChatGPT: 8 Fig. 5: Transfer hops for incidents.
No service for ‘BotClientLibrary’
has been registered.
ServiceClient failure for ChatGPT: 3 the sensitivity of the monitoring systems. For example, the
Failed to call ‘ChatGPT’ at monitor in Figure 3 issues an incident report if the failure rate
LoadBalancer, ErrorStatusCode=400 exceeds 4% within one hour. If the failure rate threshold is
ClientSecretCredential 6 set lower or the monitoring period is shortened, the monitor
authentication failed: A becomes more sensitive, possibly leading to more false pos-
configuration issue is preventing itives. These false alarms burden engineers with unnecessary
authentication. Details: The investigations, thus delaying the resolution of true incidents.
provided client secret keys
for app [ApplicationA-UUID] are Finding 1: GenAI cloud services and their monitoring are
expired. still in an early stage. A high percentage (38.3%) of GenAI
incidents are reported by humans. Besides, among the inci-
... ...
dents detected by the automated monitors, there is an 11.0%
false alarm rate, which points to opportunities for further
Fig. 3: Incident detected by a monitor and the collected enhancement in monitoring precision.
diagnostic information attached to the monitor.
B. Incident Triage
Human Monitor Dev Preview GA Triage is a crucial component of the incident manage-
ment life cycle, significantly affecting the Time-to-Mitigate
54.1%
61.7% (TTM) [32]. Incidents can be sent to incorrect teams or
need collaborative efforts, leading to cases where they are re-
assigned between different teams. The process of reassigning
38.3% 36.9% an incident from one team to another is called a transfer hop.
9.0% As shown in Figure 5, incidents that are initially detected
by monitoring systems are usually accurately triaged to the
(a) Detection type. (b) Service stage. correct team on their first attempt (90.7%). However, the
Fig. 4: GenAI incident detection type and different stages of proportion of incidents needing triage increases when detected
GenAI services. GA: General Availability, Dev: Development. by humans. GenAI incidents detected by humans that undergo
reassignment is 14.3%. This shows the effectiveness of using
automatic monitors for triage. For example, the monitor-
industry continues to refine automated detection capabilities generated ticket title embeds the name of the service that
and improve response efficiency. leads to the incident, as shown in Figure 3, so the incident
Wrong Alarms (False Positives): The false alarm rate for can be accurately triaged to the service team. Another factor
incidents detected by monitors in GenAI services is notably for the incident re-assignment is the interdependency on other
high at 11.0% (Table I), compared to the 6.6% detected by services. Resolving an incident might exceed the capabilities
humans. This higher false positive rate is primarily from of a single team, and collaborative efforts across different
5
5 5 5 subsections are ordered based on their perceived impact on
4 4 4 service operation and user experience.
Time Unit
Time Unit
Time Unit
3 3 3
2 2 2 A. Invalid Inference (14.5%)
1 1 1 While the model inference executes successfully and the
0 0 0 service returns results to clients without errors, the model
High Sev Medium Sev Low Sev HasTSG NoTSG Human Monitor
output can be invalid. Inaccuracies in the output directly affect
(a) (b) (c) the core functionality of GenAI services. (1) Response Quality
Fig. 6: TTM distribution across different factors: Y-axis is the Degradation (10.7%): Models can generate low quality con-
normalized TTM of all incidents; the top whisker of each box tent with even simple user prompt. Another scenario involves
plot represents the maximum value; the top and bottom edge the generation of invalid content, where the model could
of the box represent the upper quartile and the lower quartile, not understand the user’s prompt, leading to invalid content
respectively, and the line inside the box represents the median creation [77]; (2) Prompt/Response Content Filter Malfunction
value. (a) Different severity levels; (b) The presence of a TSG; (3.8%): GenAI cloud services deploy policy filters for both
(c) Detection types. user prompts and model responses to prevent the generation of
harmful content. However, these content filters can sometimes
malfunction, resulting in inappropriate or harmful content from
service domains are needed. Further details on the root causes the model, as well as false alarms that incorrectly filter out
of GenAI incidents will be elaborated in Section VI. valid prompts or responses.
6
Infrastructure Issue 14.6 9.4 3.2 nected components. However, mismanagement of these config-
urations is occasionally observed. Incorrect or unsynchronized
Percentage (%)
Configuration Issue 13.2 9.4 1.9
Root Cause
10 settings can ruin service functionality. We categorize these
Code Bug 9.8 7.0 4.7 configuration issues into the following types: (1) Misconfig-
External Usage Issue 8.0 3.3 2.8 uration (13.1%): Operators may employ incorrect configu-
5
Operation Error 4.2 6.6 1.9 rations or commit errors, typically due to human mistakes.
Degraded Deployment Invalid For example, engineers might configure much fewer model
Performance Failure Inference instances than required during system maintenance, leading to
Symptom an outage of degraded performance. (2) Configuration Update
Fig. 7: Relationships between symptom and root cause. (6.4%): Changes in one cloud component’s configurations can
lead to incompatibilities with other components due to the
configuration dependencies among them. Additionally, version
VI. RQ3: ROOT C AUSE conflicts for the same configuration may result in one config-
uration overriding another, e.g., using a removed parameter in
We categorize the root causes of GenAI incidents into its latest version or using an added parameter in its previous
five distinct types. The relationships between symptoms and version, leading to malfunctions. (3) Configuration Missing
root causes are shown in Figure 7. Each cell represents the and Gaps (5.0%): Missing or disabled configurations can
percentage of a specific symptom associated with a particular disrupt normal operations. Additionally, certain configurations
root cause. We can observe that a single symptom can come impose range restrictions on values, such as timeout thresholds
from multiple root causes rather than a simple one-to-one or maximum sizes for prompt tokens. Under unexpected
relationship. This indicates that diagnosing the root cause from circumstances, such as a sudden surge in user traffic, these
symptoms is not straightforward. static configurations can constrain system performance.
A. Infrastructure Issue (27.2%) C. Code Bug (21.5%)
GenAI cloud services are built upon a complex hierarchi- Code bugs are a primary cause of incidents, and a prior
cal infrastructure comprising VMs, nodes, clusters, and data work [30] has specifically investigated the code bugs leading to
centers that host tightly coupled resources, including CPU, cloud incidents. The following shows four types of code bugs
memory, storage, and networks. We find that infrastructure for GenAI incidents: data constraints bugs, content filter bugs,
issues are a major cause of degraded performance and de- exception handling bugs, and cross-system bugs. (1) Bugs
ployment failure (Figure 7). The infrastructure is categorized violating Data Constraints of the Model (6.7%): Bugs can arise
into the following types: (1) Infrastructure Maintenance Issues due to inadequate validation for data format or missing data
(17.8%): Failures of hardware components, such as worn-out that the model needs to consume. Take a fine-tune failure as an
GPUs, can impact the fine-tuning and inference of GenAI example, it can be caused by the lack of validation on dataset
services. For instance, faulty GPUs can process requests format in FileUpload API. The malformed dataset was not
incorrectly, resulting in errors such as gibberish outputs. rejected during the file upload stage, and was delivered to the
(2) Network Issues (4.7%): Besides the network bandwidth, backend services; (2) Prompt/Response Content Filter Bugs
incidents can happen between the communication of VMs (2.2%): Code defects can exist in the prompt or response filter.
and nodes within clusters, including connectivity issues and (3) Exception Handling Bugs (6.3%): Exceptions are a normal
DNS resolution failures. Such network problems can severely occurrence during code execution. However, the code can be
disrupt the performance and reliability of the service. (3) unable to effectively handle certain exceptions or failures. For
Storage Issues (4.7%): The management of vast amounts of example, errors may occur during model deployment, such as
data needs robust storage solutions. Failures in data storage or an invalid model being deployed to an endpoint. Due to a code
IO operations, such as data corruption or delays, can lead to defect in processing such an error, e.g., simply swallowing the
service disruptions. exception, the invalid model remains there and serve requests;
(4) Cross-system Bugs (6.3%): These bugs are mostly caused
Finding 4: Infrastructure issues are a key area of focus for un- by issues in the code across multiple components. To fix this
derstanding and addressing incidents in GenAI cloud services, type of bugs, changes are needed for multiple services.
especially for degraded performance and deloyment failure. To D. External Usage Issue (14.1%)
meet the growing user demands, GenAI cloud services should Incidents can arise from incorrect usage of GenAI service by
not only scale up the size of GPU cluster but also prioritize the customer. For example, a customer missed indexes when
robust infrastructure management. performing queries to LLM, which caused the high CPU usage
in the service.
B. Configuration Issue (24.5%) E. Operation Error (12.7%)
GenAI cloud services rely on a multitude of configuration Operation errors in GenAI cloud services are typically
settings to ensure the seamless operation of their intercon- caused by human errors during the management and opera-
7
tional processes. This error occurs when operators mistakenly the third-party library. For example, an inference API error
introduce erroneous or outdated dependencies, or use expired which caused by the compatibility issue between fine-tuning
credentials. code and inference code can be fixed by rolling back to a
previous inference engine for users in specific regions; (2)
VII. RQ4: M ITIGATION
Configuration Rollback (6.3%): This involves undoing bad
To answer RQ4, we delve into the common categories configuration changes to alleviate the issue.
of mitigation strategies utilized to address GenAI incidents.
Specifically, we inspect the title and the detailed description C. Configuration Fix (13.0%)
of the mitigation steps in each incident ticket and its cor- To address the majority of configuration errors, engineers
responding postmortem report. these descriptions, engineers’ often fix bugs in configuration files to reinstate the service. We
discussion thread, and completed work bullets, we classify the identify two primary approaches to configuration fixes: (1) Add
mitigation methods into the following distinct types: ad-hoc or Disable Features (7.6%): Incidents can be mitigated either
fix, self-recover, rollback, configuration fix, infrastructure fix, by adding new features that enhance service stability or by
external fix, code fix, and others. disabling features that are causing failures, thus aiding in the
A. Code Fix (7.6%) swift resolution of the issue; (2) Increase the Configuration
Limit (5.4%): Besides the configuration issues, a number of
const getCommandText = () => incidents from resource capacity as mentioned in Section VI-A
featureFlags . enableRemoveUnicodeFromRequest can also be mitigated by configuration changes as a short-term
? removeUnicodeFromRequest (text) : text;
... strategy, such as increasing timeout thresholds.
8
Ad-hoc Fix Rollback Infrastructure Fix Code Fix a comparative study to identify their distinctions. We find
Self-recover Configuration Fix External Fix that GenAI incidents generally require more time to mitigate
10.0% 22.9% compared to other types. Specifically, on average, GenAI
7.6% 12.1%
5.8% incidents take 1.12 time units to resolve, compared to 0.65
2.7% time units for non-GenAI incidents.
13.0% 4.5%
22.4% To reveal the underlying reason: (1) We calculate the TTM
7.2%
2.2% for each type of mitigation category, and find that the longer
15.2% 54.7% TTM for GenAI incident holds across all mitigation categories,
19.7%
as shown in Figure 9, reflecting the complexity of solv-
GenAI Incidents Other Incidents
ing various GenAI incidents. Additionally, across all factors
Fig. 8: The distribution of mitigation approaches. we consider (severity levels, detection types, troubleshooting
guides) in the general analysis in Section IV-C, the Time to
2.0 Mitigation (TTM) for LLM incidents is consistently longer
GenAI Other than for incidents in other services. (2) We compare the
1.5 distribution of mitigation approaches, as depicted in Figure 8.
Time Unit
1.0 The ad-hoc fix (54.7%) is the majority of the mitigation for
other cloud services, which have shorter TTM compared to any
0.5 GenAI incident mitigation in Figure 9. The mitigation distri-
0.0 Ad-hoc Self.R. Rollback Config. Infra. External Code bution of GenAI incidents is more balanced, with ad-hoc fixes
comprising only 22 4%. This indicates that, for GenAI cloud
Fig. 9: Average TTM for different mitigation approaches. services in their early development stage, more diverse, sophis-
· ticated, and time-consuming methods are required as opposed
to applying the ad-hoc fixes. (3) The current monitoring tools
for GenAI cloud services are being continuously improved to
and enforcing a maximum limit for the batch size. Also, in better align with their unique requirements. Enhancements in
other cases where a single user’s request consumed too many accuracy and adequacy are expected to help reduce TTM and
background resources and resulted in service overload, the improve overall efficiency. Unlike conventional cloud services
issue was mitigated by temporarily limiting the user’s request monitored by automated watchdogs, a high percentage of
rate, adjusting the throttling from 10 seconds throttling to one GenAI incidents are detected by humans. According to Table I
second for the customer with a high workload. Note that over in Section IV-A, only 13.7% of the incidents were detected by
half of the incidents from other cloud services are mitigated by humans for non-GenAI cloud services in our dataset, compared
ad-hoc fix (Figure 8), while GenAI cloud services often require to 38.3% for GenAI incidents. Furthermore, monitor-detected
more development and deployment efforts (other mitigation GenAI incidents have an 11.0% false positive alarm rate,
approaches to be discussed in the following) to fully resolve significantly higher than the 3.8% observed in other services.
the incidents. Consequently, the TTM of GenAI incidents are This suggests that the current monitor is not mature compared
longer. to conventional incidents, and requires additional effort to
F. External Fix (10.0%) improve.
GenAI cloud services support external company partners Longer TTM is also attributed to the difficulty in performing
and customers, so some incidents are mitigated externally, root cause analysis for GenAI incidents. As discussed in
including by Microsoft Partners and customers. For exam- Section VII, a single symptom can stem from multiple root
ple, engineers will recommend that customers modify their causes, thus complicating the debugging of GenAI services.
prompts when their wrong usage causes the model to return For example, diagnosing unexpected model outputs can be
unexpected content or switch to a stable model. complex; potential causes include faulty hardware, misconfig-
urations, code defects, or misuse.
G. Self-recover (19.7%)
B. Implications
These transient incidents are automatically mitigated as the
service recovers on its own due to its resilience mechanisms, Our findings offer actionable insights for a wide range of
for example, back-off retry, or when the monitoring system stakeholders, including researchers, model providers, service
no longer detects abnormal indicators, e.g., heartbeat detection maintainers, developers, and etc.
rate returns to normal. Note that self-recovered incidents are Researchers. Our study highlights several avenues for future
not false alarms in our dataset. research, particularly in automated methods to detect invalid
inference results. Currently, invalid outputs (14.5%), such
VIII. D ISCUSSION
as hallucinations or irrelevant responses, are challenging to
A. Lessons Learned detect. The current state-of-the-art detection methods gen-
Since the mitigation strategy categories for both GenAI erally include 1) self-judgment by the LLM, 2) fine-tuning
and non-GenAI share high similarities, we further perform another model with human-labeled data, or 3) calculating
9
consistency scores after multiple attempts. However, neither of incident triage in Microsoft’s online service systems to un-
them is cost-efficient nor fully effective. More robust research derstand industry practices. Zhao et al. [70] explored change-
is needed to address these limitations and develop scalable induced incident lifecycles in large-scale online services, offer-
validation algorithms that can operate across various GenAI ing management insights. Wang et al. [31] analyzed the time-
applications. to-mitigation (TTM) of incidents across 20 Microsoft online
Model Providers. Besides the high ratio of invalid inference services. Building on this, our study delves into incident char-
results (14.5%) and challenges in detecting hallucinations or acteristics, comparing incidents related to GenAI with those
invalid content, another notable finding is that 38% of GenAI of other services. In related work, Liu et al. [30] investigated
incidents are reported by humans, reflecting that monitoring software bugs causing cloud incidents in Microsoft Azure
tools are underdeveloped. Moreover, many GenAI cloud ser- and their resolutions. Ghosh et al. [28] analyzed incidents in
vices (45.9%) are still under development or in the preview Microsoft Teams, classifying root causes and mitigation steps.
stage, coupled with the scarcity of incident monitor types. Martino et al. [79] characterized failures in a business data
Providers should enhance service observability to detect and processing platform using event log data. Our work builds
diagnose issues more effectively, and provide better support upon these studies by applying established approaches from
and documentation to help users navigate the complexities of traditional cloud incident analysis to the GenAI cloud services
GenAI service integration and management. context. While following similar research methodologies, we
Service Maintainers. Our study reveals that the Time-to- highlight characteristics unique to GenAI incidents, such as
Mitigate (TTM) for GenAI incidents is 1.83 times longer than symptoms that involve invalid inference, which are not com-
for non-GenAI incidents, highlighting the need for automation monly seen in conventional cloud environments.
in incident mitigation. The complexity of GenAI systems, LLMs empirical study. In recent years, with the rise of
which involve vast and interconnected layers of infrastructure, large language models (LLMs), numerous related studies have
dependencies, and configurations, is a significant factor. For emerged. Cui et al. [80] organize existing studies related
example, GenAI cloud systems require 2.5x more infrastruc- to LLMs and propose a comprehensive taxonomy, which
ture fixes, 3.0x more code changes, and 3.0x as many configu- systematically analyzes potential risks in LLM systems and
ration updates compared to non-GenAI services. Despite these, discusses corresponding mitigation strategies. Liu et al. [81]
more straightforward ad-hoc fixes are applied in only 22.4% of investigate the use of jailbreak prompts to bypass restrictions
GenAI incidents, compared to 54.7% in non-GenAI services, imposed on ChatGPT. They conduct an empirical study to
indicating a reliance on more complex, time-consuming fixes evaluate the effectiveness and robustness of prompts collected
for GenAI systems. Furthermore, diagnosing root causes of from the real world. Zhuo et al. [82] present an empirical
GenAI incidents is often complex. A single symptom, such study on the adversarial robustness of a prompt-based semantic
as poor performance (49.8%) or deployment failure (35.7%), parser based on Codex. Yang et al. [83] conduct a study on
can have multiple root causes, including infrastructure prob- GPT-3 in knowledge-based visual question answering (VQA),
lems (27.2%), configuration problems (24.5%), or code bugs treating GPT-3 as a knowledge base (KB) and adapting GPT-3
(22.5%). Services should provide observability from different to solve the VQA task in a few-shot manner. In contrast to
dimensions to obtain granular insight into these symptoms these studies, which primarily focus on model behavior and
and their underlying causes. Maintainers should consider 1) robustness, our work centers on the reliability of LLM-related
implementing more automation tools or agents for distinct cloud services. Specifically, we present a novel empirical study
mitigation approaches, 2) adopting more infrastructure-as-code of incidents in such services, offering insights into their design,
practices to manage complex GenAI cloud infra more effec- operation, and maintenance.
tively, and 3) integrating more automated rollback mechanisms X. T HREATS TO VALIDITY
to address compatibility issues swiftly.
Internal threat. Subjectivity may occur during manual label-
Application Developers and Users. For developers, input
ing as an internal threat. To mitigate this threat, our study
validation and dynamic rate limiting are critical areas need-
go through multiple rounds involving independent labeling,
ing improvement. Incidents reveal that special characters,
meetings to discuss categorization, and the calculation of
fragmented prompts, and excessive token usage, even within
Cohen’s kappa [72]. We ultimately select the round of labeling
token limits, can disrupt model processing. Developers should
that is near-perfect as our final result, which demonstrates the
implement input validation processes to prevent these issues
highest consistency.
and adopt dynamic rate-limiting strategies.
External threat. All incidents we collect come from Mi-
IX. R ELATED W ORK crosoft’s cloud systems. Given that Microsoft employs various
effective tools and techniques to eliminate bugs and deploys
Empirical studies on cloud incidents. A significant amount multiple automated tools to mitigate some incidents before
of prior work has been devoted to studying the characteristics they impact customers, the incidents we collect may not fully
of incidents occurring in production systems. Ganatra et al. represent the behavior of other GenAI cloud services. We plan
[78] examined incident detection at Microsoft to identify to perform a larger scale evaluation of GenAI cloud services
monitoring gaps in cloud platforms. Chen et al. [32] studied from different companies in the future.
10
XI. C ONCLUSION [14] Y. Zhao, R. Zhang, W. Li, and L. Li, “Assessing and understanding
creativity in large language models,” Mach. Intell. Res., vol. 22, 2025.
In this paper, we present a comprehensive study of incidents [15] Y. Liu, S. Chen, H. Cheng, M. Yu, X. Ran, A. Mo, Y. Tang, and
from GenAI cloud services within Microsoft. We explore the Y. Huang, “How AI processing delays foster creativity: Exploring
symptoms, root causes, and mitigation strategies of GenAI research question co-creation with an llm-based agent,” in CHI, 2024.
[16] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, W. Zhang, S. Qin,
incidents. Our findings reveal unique characteristics in GenAI S. Rajmohan, Q. Lin, and D. Zhang, “Everything of thoughts: Defying
cloud services. For example, we identify notable differences the law of penrose triangle for thought generation,” in ACL, 2024.
between incidents from LLM cloud services and other cloud [17] S. Bubeck et al., “Sparks of artificial general intelligence: Early exper-
iments with gpt-4,” arXiv:2303.12712, 2023.
services, such as significant disparities in the time to mitigation [18] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia,
of incidents. Additionally, we find that the primary cause of “Towards efficient generative large language model serving: A survey
incidents in LLM cloud services is related to infrastructure. from algorithms to systems,” arXiv:2312.15234, 2023.
These findings provide guidance for future academic and [19] Y. Chen, M. Shetty, G. Somashekar, M. Ma, Y. Simmhan, J. Mace,
C. Bansal, R. Wang, and S. Rajmohan, “AIOpslab: A holistic framework
industrial research in the field of LLM cloud services. We to evaluate AI agents for enabling autonomous clouds,” in MLSys, 2025.
hope to inspire the development of advanced, specialized [20] M. Shetty, Y. Chen, G. Somashekar, M. Ma, Y. Simmhan, X. Zhang,
tooling and raise discussions on GenAI incidents, so that our J. Mace, D. Vandevoorde, P. Las-Casas, S. M. Gupta, et al., “Building
ai agents for autonomous clouds: Challenges and design principles,” in
community can monitor the GenAI cloud system with early SoCC, 2024.
warnings, triage incidents to the correct teams with fewer hops, [21] Z. Yu, M. Ma, C. Zhang, S. Qin, Y. Kang, C. Bansal, S. Rajmohan,
pinpoint root causes accurately, and mitigate the incidents with Y. Dang, C. Pei, D. Pei, et al., “Monitorassistant: Simplifying cloud
service monitoring via large language models,” in FSE, 2024.
optimal plans. [22] “Elevated errors affecting api and chatgpt.” [Link]
incidents/n38dwwksfkv9, 2024.
ACKNOWLEDGEMENT [23] D. Xin, H. Miao, A. Parameswaran, and N. Polyzotis, “Production
We sincerely thank all anonymous reviewers for their valu- machine learning pipelines: Empirical analysis and optimization oppor-
tunities,” in SIGMOD, 2021.
able feedback and guidance in improving this paper. This work [24] M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive study
was sponsored by National Natural Science Foundation of on deep learning bug characteristics,” in ESEC/FSE, 2019.
China (No.62372193 and No.U2436207). [25] N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco, and
P. Tonella, “Taxonomy of real faults in deep learning systems,” in ICSE,
R EFERENCES 2020.
[26] Z. Chen, Y. Cao, Y. Liu, H. Wang, T. Xie, and X. Liu, “A comprehensive
[1] S. Yang, Z. Shang, Y. Wang, D. Deng, H. Chen, Q. Cheng, and X. Wu, study on challenges in deploying deep learning based software,” in
“Data-free multi-label image recognition via llm-powered prompt tun- ESEC/FSE, 2020.
ing,” CoRR, vol. abs/2403.01209, 2024. [27] Z. Chen, Y. Kang, L. Li, X. Zhang, H. Zhang, H. Xu, Y. Zhou, L. Yang,
[2] J. Han, R. Zhang, W. Shao, P. Gao, P. Xu, H. Xiao, K. Zhang, C. Liu, J. Sun, Z. Xu, et al., “Towards intelligent incident management: why
S. Wen, Z. Guo, X. Lu, S. Ren, Y. Wen, X. Chen, X. Yue, H. Li, we need it and how we make it,” in ESEC/FSE, 2020.
and Y. Qiao, “Imagebind-llm: Multi-modality instruction tuning,” CoRR, [28] S. Ghosh, M. Shetty, C. Bansal, and S. Nath, “How to fight production
vol. abs/2309.03905, 2023. incidents? an empirical study on a large-scale cloud service,” in SoCC,
[3] C. Zhang, Z. Ma, Y. Wu, S. He, S. Qin, M. Ma, X. Qin, Y. Kang, 2022.
Y. Liang, X. Gou, et al., “Allhands: Ask me anything on large-scale [29] P. Dogga, C. Bansal, R. Costleigh, G. Jayagopal, S. Nath, and X. Zhang,
verbatim feedback via large language models,” in ICDE, 2025. “Autoarts: Taxonomy, insights and tools for root cause labelling of
[4] B. Qiao, L. Li, X. Zhang, S. He, Y. Kang, C. Zhang, F. Yang, H. Dong, incidents in microsoft azure,” in ATC, 2023.
J. Zhang, L. Wang, et al., “Taskweaver: A code-first agent framework,”
[30] H. Liu, S. Lu, M. Musuvathi, and S. Nath, “What bugs cause production
arXiv:2311.17541, 2023.
cloud incidents?,” in HotOS, 2019.
[5] Y. Jiang, C. Zhang, S. He, Z. Yang, M. Ma, S. Qin, Y. Kang, Y. Dang,
[31] W. Wang, J. Chen, L. Yang, H. Zhang, P. Zhao, B. Qiao, Y. Kang,
S. Rajmohan, Q. Lin, et al., “Xpert: Empowering incident management
Q. Lin, S. Rajmohan, F. Gao, Z. Xu, Y. Dang, and D. Zhang, “How
with query recommendations via large language models,” in ICSE, 2024.
long will it take to mitigate this incident for online service systems?,”
[6] Y. Chen, H. Xie, M. Ma, Y. Kang, X. Gao, L. Shi, Y. Cao, X. Gao,
in ISSRE, 2021.
H. Fan, M. Wen, et al., “Automatic root cause analysis via large language
models for cloud incidents,” in EuroSys, 2024. [32] J. Chen, X. He, Q. Lin, Y. Xu, H. Zhang, D. Hao, F. Gao, Z. Xu,
[7] P. Jin, S. Zhang, M. Ma, H. Li, Y. Kang, L. Li, Y. Liu, B. Qiao, C. Zhang, Y. Dang, and D. Zhang, “An empirical investigation of incident triage
P. Zhao, et al., “Assess and summarize: Improve outage understanding for online service systems,” in ICSE (SEIP), 2019.
with large language models,” in ESEC/FSE, 2023. [33] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and
[8] C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y. Kang, L. Zettlemoyer, “Rethinking the role of demonstrations: What makes
Q. Lin, S. Rajmohan, et al., “Ufo: A ui-focused agent for windows os in-context learning work?,” in EMNLP, 2022.
interaction,” in NAACL, 2025. [34] E. D. Liddy, “Natural language processing,” 2001.
[9] S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu, Y. Zhou, Y. Xiao, [35] “Fine-tuning.” [Link]
S. Yun, X. Huang, and Z. Wei, “Disc-lawllm: Fine-tuning large language [36] “Microsoft copilot: Your everyday ai companion.” [Link]
models for intelligent legal services,” CoRR, vol. abs/2309.11325, 2023. [Link]/, 2024.
[10] M. M. Raza, K. P. Venkatesh, and J. C. Kvedar, “Generative AI and [37] Z. Chen, Y. Kang, F. Gao, L. Yang, J. Sun, Z. Xu, P. Zhao, B. Qiao,
large language models in health care: pathways to implementation,” npj L. Li, X. Zhang, et al., “Aiops innovations of incident management for
Digit. Medicine, vol. 7, 2024. cloud services,” 2020.
[11] L. Luceri, E. Boniardi, and E. Ferrara, “Leveraging large lan- [38] C. Zhao, M. Ma, Z. Zhong, S. Zhang, Z. Tan, X. Xiong, L. Yu, J. Feng,
guage models to detect influence campaigns in social media,” CoRR, Y. Sun, Y. Zhang, D. Pei, Q. Lin, and D. Zhang, “Robust multimodal
vol. abs/2311.07816, 2023. failure detection for microservice systems,” in KDD, 2023.
[12] J. Yu, R. He, and R. Ying, “Thought propagation: An analogical [39] J. Huang, Y. Yang, H. Yu, J. Li, and X. Zheng, “Twin graph-based
approach to complex reasoning with large language models,” in ICLR, anomaly detection via attentive multi-modal learning for microservice
2024. system,” in ASE, 2023.
[13] S. P. Sharan, F. Pittaluga, V. K. B. G, and M. Chandraker, “Llm-assist: [40] C. Zhang, X. Peng, C. Sha, K. Zhang, Z. Fu, X. Wu, Q. Lin, and
Enhancing closed-loop planning with language-based reasoning,” CoRR, D. Zhang, “Deeptralog: Trace-log combined microservice anomaly de-
vol. abs/2401.00125, 2024. tection through graph-based deep learning,” in ICSE, 2022.
11
[41] L. Li, X. Zhang, X. Zhao, H. Zhang, Y. Kang, P. Zhao, B. Qiao, S. He, [64] X. Zhang, S. Ghosh, C. Bansal, R. Wang, M. Ma, Y. Kang, and
P. Lee, J. Sun, et al., “Fighting the fog of war: Automated incident S. Rajmohan, “Automated root causing of cloud incidents using in-
detection for cloud systems,” in ATC, 2021. context learning with GPT-4,” in FSE, 2024.
[42] H. Qiu, S. S. Banerjee, S. Jha, Z. T. Kalbarczyk, and R. K. Iyer, [65] M. Ma, Z. Yin, S. Zhang, S. Wang, C. Zheng, X. Jiang, H. Hu, C. Luo,
“{FIRM}: An intelligent fine-grained resource management framework Y. Li, N. Qiu, et al., “Diagnosing root causes of intermittent slow queries
for {SLO-Oriented} microservices,” in OSDI, 2020. in cloud databases,” VLDB, vol. 13, 2020.
[43] M. Ma, S. Zhang, J. Chen, J. Xu, H. Li, Y. Lin, X. Nie, B. Zhou, [66] J. Jiang, W. Lu, J. Chen, Q. Lin, P. Zhao, Y. Kang, H. Zhang, Y. Xiong,
Y. Wang, and D. Pei, “Jump-starting multivariate time series anomaly F. Gao, Z. Xu, et al., “How to mitigate the incident? an effective
detection for online service systems,” in ATC, 2021. troubleshooting guide recommendation technique for online service
[44] J. Zeng, Z. L. Chua, Y. Chen, K. Ji, Z. Liang, and J. Mao, “Watson: systems,” in ESEC/FSE, 2020.
Abstracting behaviors from audit logs via aggregation of contextual [67] T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and
semantics.,” in NDSS, 2021. S. Rajmohan, “Recommending root-cause and mitigation steps for cloud
[45] J. Zeng, X. Wang, J. Liu, Y. Chen, Z. Liang, T.-S. Chua, and Z. L. incidents using large language models,” in ICSE, 2023.
Chua, “Shadewatcher: Recommendation-guided cyber threat analysis [68] C. Zhang, R. Yao, S. Qin, Z. Li, S. Agrawal, B. R. Mishra, T. Tran,
using system audit records,” in S&P, 2022. M. Ma, Q. Lin, M. Chintalapati, et al., “Deoxys: A causal inference
[46] S. Zhang, Y. Ji, J. Luan, X. Nie, Z. Chen, M. Ma, Y. Sun, and D. Pei, engine for unhealthy node mitigation in large-scale cloud infrastructure,”
“End-to-end automl for unsupervised log anomaly detection,” in ASE, in SoCC, 2024.
2024. [69] H. Li, M. Ma, Y. Liu, P. Zhao, S. Li, Z. Li, M. Chintalapati, Y. Dang,
[47] Z. Yu, C. Pei, X. Wang, M. Ma, C. Bansal, S. Rajmohan, Q. Lin, C. Bansal, S. Rajmohan, et al., “Can we trust auto-mitigation? improving
D. Zhang, X. Wen, J. Li, et al., “Pre-trained kpi anomaly detection cloud failure prediction with uncertain positive learning,” in 2024
model through disentangled transformer,” in KDD, 2024. IEEE 35th International Symposium on Software Reliability Engineering
[48] Y. Liu, M. Ma, P. Zhao, T. Li, B. Qiao, S. Li, Z. Li, M. Chintalapati, (ISSRE), pp. 499–510, IEEE, 2024.
Y. Dang, C. Bansal, S. Rajmohan, Q. Lin, and D. Zhang, “Early bird: [70] Y. Zhao, L. Jiang, Y. Tao, S. Zhang, C. Wu, Y. Wu, T. Jia, Y. Li, and
Ensuring reliability of cloud systems through early failure prediction,” Z. Wu, “How to manage change-induced incidents? lessons from the
in ISSRE, 2024. study of incident life cycle,” in ISSRE, 2023.
[71] A. L. Strauss and J. M. Corbin, Grounded theory in practice. Sage,
[49] J. Liu, C. Zhang, J. Qian, M. Ma, S. Qin, C. Bansal, Q. Lin, S. Rajmo-
1997.
han, and D. Zhang, “Large language models can deliver accurate and
[72] J. Cohen, “A coefficient of agreement for nominal scales,” Educational
interpretable time series anomaly detection,” in KDD, 2025.
and psychological measurement, vol. 20, 1960.
[50] Y. Sun, B. Shi, M. Mao, M. Ma, S. Xia, S. Zhang, and D. Pei,
[73] Z. Wang, C. Pei, M. Ma, X. Wang, Z. Li, D. Pei, S. Rajmohan, D. Zhang,
“Art: A unified unsupervised framework for incident management in
Q. Lin, H. Zhang, J. Li, and G. Xie, “Revisiting VAE for unsupervised
microservice systems,” in ASE, 2024.
time series anomaly detection: A frequency perspective,” in WWW, 2024.
[51] C. Bansal, S. Renganathan, A. Asudani, O. Midy, and M. Janakiraman, [74] Y. Chen, C. Zhang, M. Ma, Y. Liu, R. Ding, B. Li, S. He, S. Rajmohan,
“Decaf: Diagnosing and triaging performance issues in large-scale cloud Q. Lin, and D. Zhang, “Imdiffusion: Imputed diffusion models for
services,” in ICSE (SEIP), 2020. multivariate time series anomaly detection,” VLDB, vol. 17, 2023.
[52] J. Chen, X. He, Q. Lin, Y. Xu, H. Zhang, D. Hao, F. Gao, Z. Xu, [75] Z. Zeng, Y. Zhang, Y. Xu, M. Ma, B. Qiao, W. Zou, Q. Chen, M. Zhang,
Y. Dang, and D. Zhang, “An empirical investigation of incident triage X. Zhang, H. Zhang, X. Gao, H. Fan, S. Rajmohan, Q. Lin, and
for online service systems,” in ICSE (SEIP), 2019. D. Zhang, “Traceark: Towards actionable performance anomaly alerting
[53] J. Chen, X. He, Q. Lin, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, for online service systems,” in ICSE (SEIP), 2023.
and D. Zhang, “Continuous incident triage for large-scale online service [76] J. Gu, J. Wen, Z. Wang, P. Zhao, C. Luo, Y. Kang, Y. Zhou, L. Yang,
systems,” in ASE, 2019. J. Sun, Z. Xu, B. Qiao, L. Li, Q. Lin, and D. Zhang, “Efficient customer
[54] Z. Wang, J. Li, M. Ma, Z. Li, Y. Kang, C. Zhang, C. Bansal, M. Chinta- incident triage via linking with system incidents,” in ESEC/FSE, 2020.
lapati, S. Rajmohan, Q. Lin, et al., “Large language models can provide [77] J. Wester, T. Schrills, H. Pohl, and N. van Berkel, ““as an ai language
accurate and interpretable incident triage,” in ISSRE, 2024. model, i cannot”: Investigating llm denials of user requests,” in CHI,
[55] Z. Wang, Z. Liu, Y. Zhang, A. Zhong, L. Fan, L. Wu, and Q. Wen, 2024.
“Rcagent: Cloud root cause analysis by autonomous agents with tool- [78] V. Ganatra, A. Parayil, S. Ghosh, Y. Kang, M. Ma, C. Bansal, S. Nath,
augmented large language models,” in CIKM, 2024. and J. Mace, “Detection is better than cure: A cloud incidents perspec-
[56] D. Zhang, X. Zhang, C. Bansal, P. Las-Casas, R. Fonseca, and tive,” in ESEC/FSE, 2023.
S. Rajmohan, “Pace: Prompting and augmentation for calibrated con- [79] C. Di Martino, Z. Kalbarczyk, R. K. Iyer, G. Goel, S. Sarkar, and
fidence estimation with gpt-4 in cloud incident root cause analysis,” R. Ganesan, “Characterization of operational failures from a business
arXiv:2309.05833, 2023. data processing saas platform,” in ICSE, 2014.
[57] G. Yu, P. Chen, Y. Li, H. Chen, X. Li, and Z. Zheng, “Nezha: [80] T. Cui, Y. Wang, C. Fu, Y. Xiao, S. Li, X. Deng, Y. Liu, Q. Zhang,
Interpretable fine-grained root causes analysis for microservices on Z. Qiu, P. Li, et al., “Risk taxonomy, mitigation, and assessment
multi-modal observability data,” in ESEC/FSE, 2023. benchmarks of large language model systems,” arXiv:2401.05778, 2024.
[58] C. Lee, T. Yang, Z. Chen, Y. Su, and M. R. Lyu, “Eadro: An end-to-end [81] Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang,
troubleshooting framework for microservices on multi-source data,” in and Y. Liu, “Jailbreaking chatgpt via prompt engineering: An empirical
ICSE, 2023. study,” arXiv:2305.13860, 2023.
[59] S. Zhang, P. Jin, Z. Lin, Y. Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, [82] T. Y. Zhuo, Z. Li, Y. Huang, F. Shiri, W. Wang, G. Haffari, and Y.-F. Li,
M. Ma, W. Jin, et al., “Robust failure diagnosis of microservice system “On robustness of prompt-based semantic parsing with large pre-trained
through multimodal data,” IEEE Trans. Serv. Comput., vol. 16, 2023. language model: An empirical study on codex,” in EACL, 2023.
[60] Y. Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun, J. Jiang, [83] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An
L. Fan, and M. Ke, “Cloudrca: A root cause analysis framework for empirical study of gpt-3 for few-shot knowledge-based vqa,” in AAAI,
cloud computing platforms,” in CIKM, 2021. 2022.
[61] Z. Xie, S. Zhang, Y. Geng, Y. Zhang, M. Ma, X. Nie, Z. Yao, L. Xu,
Y. Sun, W. Li, et al., “Microservice root cause analysis with limited
observability through intervention recognition in the latent space,” in
KDD, 2024.
[62] S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y. Sun,
and D. Pei, “Failure diagnosis in microservice systems: A comprehensive
survey and analysis,” TOSEM, 2025.
[63] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, X. Wu, M. Zhang, Q. Chen,
X. Gao, X. Gao, et al., “Tracediag: Adaptive, interpretable, and efficient
root cause analysis on large-scale microservice systems,” in ESEC/FSE,
2023.
12
In GenAI cloud services, incidents often stem from diverse root causes such as code defects or misconfigurations, which make diagnosis and resolution more complex than traditional services . These systems require a higher degree of infrastructure fixes, code changes, and configuration updates, being 2.5 to 3 times more frequent than in non-GenAI services . The wide variety of fixes—21.5% of incidents due to code bugs but only 7.6% fixed with code changes—suggests an inherent complexity and the need for diverse remediation strategies beyond simple code adjustments .
The human-in-the-loop approach results in a higher detection rate for GenAI incidents, with 38.3% being identified by humans compared to 13.7% in traditional services . While this highlights the necessity of human oversight due to inadequate automated monitoring, it also suggests scalability issues. Improvements can be made by enhancing automated monitoring tools to better mimic and support human judgment, thereby reducing false positives and increasing detection accuracy, possibly through advanced machine learning techniques .
GenAI cloud services face unique challenges such as performance degradation, deployment failures, and invalid inferences, which impact the reliability and user satisfaction more significantly than traditional cloud services . The complexity and nature of these challenges result in GenAI incidents requiring more time to mitigate (1.12 time units on average) compared to non-GenAI services (0.65 time units). Additionally, the higher incidence of human-detected issues (38.3%) as opposed to automated detection indicates a maturity gap in monitoring capabilities for GenAI services compared to traditional services .
The complexity of GenAI systems, characterized by vast infrastructures and intricate dependencies, significantly influences their incident management life cycle by prolonging TTM and complicating root cause analysis . This complexity necessitates advanced diagnostic capabilities and more refined mitigation strategies. For development and maintenance, it implies the need for robust monitoring tools and a deep understanding of system behavior to preempt and manage incidents effectively, thereby requiring ongoing refinement of both infrastructure and processes .
Invalid or false alarm incidents in GenAI services are managed through reassignment to appropriate teams and require careful verification. These incidents, often detected manually due to insufficient automated tools, highlight the challenge of discerning between genuine issues and false positives . The significant presence of false alarms (11.0% in GenAI compared to 3.8% in non-GenAI services) suggests an urgent need for improving the accuracy of monitoring systems to reduce unnecessary reallocations and focus resources on legitimate incidents .
The TTM for GenAI incidents is longer partly due to the complex nature of these systems, which involve interconnected infrastructure layers and dependencies, making diagnosis and resolution more time-consuming . Strategies to address this include developing more sophisticated automation tools for incident mitigation processes, and prioritizing quick-fix methods like rollbacks to minimize downtime, as well as enhancing root cause analysis through advanced AI models .
Continuous improvement of monitoring tools is crucial in managing GenAI cloud service incidents, given the high rate of human-detected incidents and false positives. Tools need to specifically address the sophisticated demands of GenAI systems by improving accuracy and reducing detection times . Necessary improvements include developing scalable algorithms for automated anomaly detection, refining monitoring capabilities to reduce reliance on human detection, and incorporating advanced AI methodologies to handle the complex data environments of GenAI systems effectively .
Research areas critical for improving GenAI cloud service incident management include developing automated methods to detect invalid inferences more cost-efficiently and effectively, as current methodologies like consistency scores and model fine-tuning are inadequate . Advancements in these areas could lead to more reliable detection and resolution of issues, reducing both false positives and the dependency on human oversight, thereby enhancing service reliability and decreasing TTM .
GenAI cloud services exhibit a higher rate of human-detected incidents at 38.3% compared to 13.7% in non-GenAI services, indicating a reliance on human oversight . The false alarm rate for GenAI services is also higher at 11.0%, against 3.8% in non-GenAI services, implying that current monitoring tools are not as mature or accurate for GenAI contexts . This underscores the need for improved and more tailored monitoring solutions to address the unique challenges faced by GenAI services.
Analyzing high-severity GenAI incidents separately is crucial due to their significant impact on service disruptions, affecting numerous tenants and customers . Such analysis provides insights into the root causes and symptoms unique to severe incidents, facilitating targeted improvements in incident management and resolution strategies. It uncovers patterns and commonalities that can lead to preemptive measures, enhancing the overall reliability and robustness of GenAI cloud services .