Production Incidents in GenAI Cloud Services

Uploaded by

Parameswara Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views12 pages

Production Incidents in GenAI Cloud Services

Uploaded by

Parameswara Prasad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

An Empirical Study of Production Incidents in

Generative AI Cloud Services

Haoran Yan∗1 , Yinfang Chen∗2 , Minghua Ma†3 , Ming Wen1 , Shan Lu3 , Shenglin Zhang4 , Tianyin Xu2
Rujia Wang3 , Chetan Bansal3 , Saravan Rajmohan3 , Qingwei Lin5 , Chaoyun Zhang5 , Dongmei Zhang5
1
Huazhong University of Science and Technology, China, {haoran yan, mwenaa}@[Link]
2
University of Illinois Urbana-Champaign, USA, {yinfang3, tyxu}@[Link]
3
Microsoft, USA, {minghuama, shanlu, rujiawang, chetanb, [Link]}@[Link]
4
Nankai University, China, {zhangsl}@[Link]
5
Microsoft, China, {[Link], qlin, dongmeiz}@[Link]
arXiv:2504.08865v2 [[Link]] 14 Aug 2025

Abstract—The ever-increasing demand for generative artificial [18]. However, both the acquisition of the required resources
intelligence (GenAI) has motivated cloud-based GenAI services and their efficient management pose significant challenges to
such as Azure OpenAI Service. Like any large-scale cloud service, individuals and even enterprises. Therefore, it motivates the
failures are inevitable in cloud-based GenAI services, resulting
in user dissatisfaction and significant monetary losses. However, development of GenAI cloud services, which offer a platform
GenAI cloud services, featured by their massive parameter where developers and users can create, deploy, and utilize large
scales, hardware demands, and usage patterns, present unique models without substantial hardware and software investments,
challenges, including generated content quality issues and privacy e.g., Cloud for AI (Cloud4AI), and also incorporate model
concerns, compared to traditional cloud services. To understand APIs within cloud systems, e.g., AI for Cloud (AI4Cloud)
the production reliability of GenAI cloud services, we analyzed
production incidents from Microsoft spanning in the past four [5]–[7], [19]–[21]. Popular GenAI cloud services include
years. Our study (1) presents the general characteristics of GenAI Azure OpenAI, Amazon Bedrock, IBM Watson, and Anthropic
cloud service incidents at different stages of the incident life Claude. GenAI cloud services afford enterprises the infrastruc-
cycle; (2) identifies the symptoms and impacts of these incidents ture necessary for the deployment and maintenance of GenAI
on GenAI cloud service quality and availability; (3) uncovers models, based on which users can further interact, analyze, and
why these incidents occurred and how they were resolved; (4)
discusses open research challenges in terms of incident detection, fine-tune such models. Moreover, they are crucial in promoting
triage, and mitigation, and sheds light on potential solutions. collaboration among researchers by providing shared access to
Index Terms—Incident Management, Generative AI, Cloud advanced models and computational resources.
Service Reliability, Empirical Study As with any large-scale cloud services, GenAI cloud ser-
vices are not immune to occasional incidents. These events,
I. I NTRODUCTION while often unavoidable due to the complexity and scale
In recent years, there have been significant advancements of the systems involved, have the potential to impact user
in generative artificial intelligence (GenAI), particularly in experience and, in some cases, result in challenges such as
Large Language Models (LLMs) and their applications across user dissatisfaction or economic implications. For example,
various fields. Beyond natural language processing, these OpenAI recently experienced an incident where request fail-
models have also shown new capabilities in image recogni- ures and high latency severely impacted ChatGPT’s API and
tion [1], [2], data analysis [3], [4], software engineering [5]– functionalities [22]. However, despite the critical importance
[8], and more [9]–[11]. The emergence of models like the of reliability in GenAI cloud services, there is a notable
GPT-4 family marks a new era, with capabilities extending lack of research focusing on their reliability and incident
to complex reasoning [12], [13], creative thinking [14]–[16], management. Therefore, understanding the characteristics of
and even surpassing human expertise in certain tasks [17]. these incidents—including detection, triage, diagnosis, and
This innovation has resulted in impactful research findings mitigation—is crucial for enhancing the quality of GenAI
and practical applications with substantial implications for cloud services.
scientific research and socio-economic development. Before the era of GenAI cloud services, traditional ML
The demands of GenAI come with the requirements of platforms like AzureML, AWS SageMaker, and Google Cloud
unprecedented computational resources, including the hard- ML were primarily used for tasks such as training, inference,
ware for operating the models as well as infrastructure sys- and model fine-tuning [23]. These services have been well-
tems for efficiently allocating and utilizing such resources studied for issues like deployment challenges, fault taxonomy,
and bug characteristics [24]–[26], while extensive research
* Haoran Yan and Yinfang Chen contributed equally. Work was performed
during their internship at Microsoft. has similarly examined incident management practices, root
† Corresponding author. causes, and triage procedures in conventional cloud ser-
vices [6], [27]–[32]. However, GenAI cloud services funda- GenAI Cloud Services
mentally differ from these. Specifically, GenAI services such Incident Triage after
Detection Localization
as large language models (LLMs) rely on massive parameter
Time to
scales, high hardware demands, and provide natural language- Chat Design Analysis Mitigation
Incident
Management
driven applications like text generation, summarization, and (TTM)
translation, which traditional cloud services do not [33], [34]. Mitigation/ Root Cause
Resolution Diagnosis
These services also allow users to fine-tune models using
Finetuning Embedding ...
user-uploaded datasets [35], exposing risks from model-level
behavior changes. Moreover, they provide intuitive conversa- Fig. 1: Incident management of GenAI cloud services.
tional user interfaces, making them accessible to a broader
audience while adding complexity and risks in managing user
interactions. Such characteristics create new reliability issues from Microsoft1 .
related to model quality, privacy, and performance, layered
• We identify not only the symptoms and impacts of high
atop conventional reliability concerns. Therefore, due to the
severity GenAI incidents but also uncover the root causes
distinctive challenges of GenAI cloud services, it is necessary
behind them and how they were mitigated with many real-
to investigate GenAI incident patterns, impacts, and mitigation
world incident cases.
strategies to ensure future dependable and reliable GenAI
• We reveal the challenges of handling GenAI incidents at
services.
different incident life cycles and provide insights into im-
In this study, we examine incidents in the GenAI cloud
proving the reliability of large-scale GenAI cloud services.
service of Microsoft, a leader in the GenAI field, known for
hosting GPT series models. Microsoft’s Incident Management II. BACKGROUND AND M OTIVATION
system (IcM) documents a wide range of incident data, includ-
ing root causes, mitigation steps, and detailed engineer discus- In this section, we begin by introducing Large Language
sions, enabling a comprehensive and comparative analysis of Model (LLM) cloud services and incident management, as
GenAI cloud service incidents alongside conventional cloud illustrated in Figure 1. Subsequently, we outline the motivation
services. Our investigation reveals that while some traditional behind our study.
reliability challenges, such as system downtime or latency
A. GenAI Cloud Service
issues, remain relevant in GenAI services, new and unique
challenges have emerged. For example, incidents like response With the substantial parameter scale of foundation models
quality degradation show that models can unexpectedly pro- such as GPT-4, they are typically deployed in cloud systems
duce low-quality or even inappropriate output from simple like Azure OpenAI. This Cloud4AI service offers users a
prompts. We term these incidents GenAI incidents. convenient means to access advanced language models without
Our study leads to crucial findings. For instance, we find the complexities of managing infrastructure or undertaking
that (1) GenAI incidents manifest as performance degradation extensive local computations. GenAI also has APIs for cloud
(49.8%), deployment failure (35.7%), and invalid inference services, as seen with Copilot [36], referred to as AI4Cloud.
(14.5%), significantly impacting both service reliability and In our study, both Cloud4AI and AI4Cloud are the subjects
user satisfaction; (2) GenAI cloud services experience a higher of our investigation, which we collectively refer to as GenAI
rate of incidents detected by humans (38.3%) compared to cloud services. In our study of GenAI cloud services’ inci-
other services (13.7%) rather than automated monitors. Also, dents (termed as GenAI incidents), we collect incident data
there is a higher false alarm rate for GenAI (11.0%) versus from the Microsoft Incident Management system (IcM) [37]
other services (3.8%); (3) Due to human-reported nature, many (Section III).
GenAI incidents need to be re-assigned to different teams, and
B. Incident Management
GenAI incidents need more time (1.12 time units on average)
to mitigate compared to those in other services (0.65 time In cloud services, incidents are common and can lead to
units on average); (4) During mitigation, a specific root cause service disruptions, economic losses, and other unexpected
is not tied to a single type of fix. For example, while code bugs severe consequences. To address such issues, major cloud
account for 21.5% of the GenAI incidents, only 7.6% of fixes providers like Microsoft typically involve four main proce-
are code changes, with other strategies being employed. Given dures: detection, triage, diagnosis, and mitigation (Figure 1).
the tight deadlines for on-call engineers, quick approaches like • Detection. This step detects service violations or perfor-
rollback are prioritized to reduce downtime. mance issues and creates a ticket to record relevant infor-
In summary, this paper makes the following main contribu- mation [38]–[50]. Such incidents can be detected manually
tions: (e.g., by customers or engineers) or automatically (e.g., by
the service monitor) [38]–[50].
• We make the first attempt to unravel the general behavior of
incidents occurring in GenAI cloud services by collecting 1 Due to company policy, we hide the actual numbers and present normal-
and analyzing a large number of GenAI-related incidents ized numbers in this paper.

2
III. M ETHODOLOGY
100
Microsoft, a leader in cloud computing, hosts the training
Total
# Normalized Incidents

80 and APIs for OpenAI and offers various GenAI cloud services,
High Severity GPT-4
60 Medium Severity including Azure OpenAI, which utilizes Microsoft platform
Low Severity GPT-3.5 ChatGPT to provide access to the GPT series models. Incidents in
40 these services are documented in a dedicated database. Prior
20 GPT-3 researches [28], [30], [31], [70] have utilized similar database
to collect incidents and derive analytical insights. Consistent
0 6 9 12 3 6 9 12 3 6 9 12 3 6 9 12 2 with this approach, this study leverages Microsoft’s database
2020 2021 2022 2023 2024 to collect GenAI-related incidents.
Date (Month \\ Year)
The database contains key details for each incident, includ-
Fig. 2: Number of GenAI incidents at different time. ing its description, root cause, mitigation steps, discussions by
the on-call engineers (OCEs), and severity-level tags (high,
medium, and low). To conduct our empirical study, we collect
• Triage. This process assigns the detected incident to a both GenAI-related and non-GenAI incidents as a comparative
responsible team [50]–[54]. Due to the complexity of cloud dataset. Following the methodology of previous research [28],
service systems, determining the appropriate team may we focus on significant incidents characterized by their high
require multiple rounds of discussions, and reassignment is severity and detailed root cause descriptions, thereby facilitat-
also necessary. ing an insightful qualitative analysis. The following shows the
• Diagnosis. The assigned team analyzes the incident to details of our methodology.
determine its root cause by examining system logs and
configuration settings to isolate the problem and identify A. Data Collection
corresponding factors [7], [50], [55]–[65]. We first introduce the detailed procedures to collect the
• Mitigation. The mitigation step often accompanies the di- dataset. In particular, we collect two datasets that serve for
agnosis, as engineers strive to promptly resolve the incident the two incident study respectively. The general incident
to minimize the Time to Mitigate (TTM) [66]–[69]. study is designed to explore general characteristics of GenAI
incidents within the incident management process, such as the
C. Motivation distribution of incidents’ detection methods. It also aims to
GenAI, especially LLMs such as OpenAI’s ChatGPT, has compare these incidents with those from other cloud services,
witnessed a surge in their popularity, with ChatGPT having analyzing the differences between the two. It requires a dataset
over one million users in its debut week. However, such with broad coverage. Therefore, we endeavor to collect all
increased adoption has also unveiled potential risks, including incidents that meet the criteria as comprehensively as possible.
outages, and errors. Figure 2 showcases the variation in terms The in-depth incident study focuses on understanding the cat-
of the number of GenAI-related incidents within Microsoft egories of an incident’s symptoms, root cause, and mitigation
over the recent four years, also highlighting changes in the strategies based on detailed information, such as discussions
number of incidents across different severity levels. The lower by OCEs. Given the large volume of data, we opt to select
the severity level, the higher the impact to customers. only high-severity incidents as in-depth analysis cases, as these
The total number of incidents in gray color shows an incidents have a more significant impact on the system and
upward trend. Specifically, before the release of the GPT- tend to attract greater attention. Through both the study, we
3.5 model in March 2022, GenAI-related incidents account can comprehensively understand the characteristics of GenAI
for a mere 3% of the total incidents within the GenAI cloud incidents.
service. After 2023, there is a significant increase in incidents, • GenAI incidents collection for general analysis. In this
with a pronounced spike following the introduction of GPT- phase, our primary goal is to gather data on incidents, encom-
4 in March 2023. At this point, the volume of incidents passing the period from June 2020, following the release date
had increased nearly tenfold relative to the figures reported of GPT-3 model by OpenAI, to February 2024. Here are the
during the GPT-3.5 era. This dramatic rise can be attributed criteria we use to collect GenAI-related incidents:
to the global fame attained by the GPT model, which attracted 1) We choose incidents that have been mitigated or resolved.
millions of users. This trend also holds across all severity The incident status is categorized within the “Status” field
levels, with lower-impact incidents comprising most cases. as “Active”, “Mitigated”, and “Resolved”. Our collection
The proliferation of GenAI-related incidents affects both the excludes incidents marked as “Active” due to the lack of
associated cloud services and their end users. Unfortunately, comprehensive data, such as discussions by OCEs, root
the characteristics of the incidents of GenAI cloud services cause analysis, and mitigation steps;
have not yet been comprehensively unveiled. This study aims 2) The “Service” field indicates whether the incident is as-
to bridge this gap, thus providing insights for future research sociated with a GenAI cloud service and its team or not.
and practical guidance for the software engineering commu- Incidents are considered GenAI-related if they are linked
nity maintaining GenAI cloud services. to a specific GenAI cloud service, such as Azure OpenAI;

3
3) Given the complex architecture and dependencies of GenAI authors independently following the open coding strategy [71]
cloud services, certain GenAI incidents may be managed by to label both symptoms, root causes and mitigation strategies
dependent (sub-)services and cannot be directly found by for the taxonomy set. Next, for categories with inconsistent
“Service”. Thus, we define a vocabulary of words related classifications, a meeting involving other authors will be con-
to GenAI (e.g., “gpt-3.5-turbo”, “LLM”, [Link]). Then we vened to determine the final categorization. The two authors
perform a case-insensitive search of these terms within the then label the validation set to check for the emergence of
“Title” of an incident. new categories to perform further discussions to refine their
Following these criteria, we obtain hundreds of thousands understanding of each category. Finally, they label the test set
of GenAI-related incidents. and employ Cohen’s kappa [72] coefficient to measure the
• GenAI incidents collection for in-depth analysis. We consistency between annotators.
meticulously select a subset of GenAI incidents based on three After multiple rounds of the labeling process described
criteria: above, we ultimately adopt the best result, achieving near-
perfect agreement across the three taxonomies: Symptom:
1) The incident must be of high severity. Incidents of this
0.921, Root Cause: 0.930, Mitigation: 0.893. For incidents
nature typically result in significant service disruptions,
that can fit into multiple categories, e.g., multiple symptoms,
affecting numerous tenants and customers;
disagreements are resolved by focusing on the category most
2) The incident should include a detailed root cause analysis;
prominently reflected in the incident and OCE’s discussions.
3) The incident must be valid. We deem an incident as invalid
if its mitigation steps are described as a “False Alarm”. IV. RQ1: G ENERAL S TATISITCS
Following these criteria, we identified and selected many We explore the characteristics of GenAI incidents from three
incidents for our detailed analysis. Given that high-severity aspects, detection, triage, and mitigation, each corresponding
incidents inherently constitute a smaller proportion of total to a phase of the incident life-cycle.
incidents, the data collected at this step is significantly less
than what is gathered for qualitative analysis. A. Incident Detection
• Other incidents collection. For discussion, especially a Detection is the initial step in incident management for a
comparative mitigation analysis between GenAI incidents and cloud service. Engineers can identify incidents by noticing
those unrelated to GenAI, we collect the same number of other unusual system behaviors [32], [73]–[75], while customers can
incidents using the same time frame and criteria in general also report issue tickets when encountering failure messages
analysis (omitting the (2) and (3)) and in-depth analysis. or experiencing delay [76]. To improve the efficiency of inci-
B. Research Questions dent detection, automated monitoring tools are deployed [38].
These tools either passively collect real-time system telemetry
In this study, we aim to reveal the behaviors of GenAI inci- data (e.g., CPU usage) and performance measures (e.g., re-
dents in the incident management life cycle. Such insights are sponse time and throughput), or proactively check the health
critical for the development, maintenance, and management of of the system by periodically performing heartbeats or sanity
LLMs, aiming to improve the robustness and reliability of the checks. Figure 3 shows a monitor detecting the calling failure
LLM cloud systems. This exploration is pivotal for providing a rate of a service.
scientific basis to prevent future incidents, thereby contributing Missing Alarms (False Negative): As shown in Figure 4,
valuable knowledge and experience to both the research and we find that 38.3% of the incidents related to GenAI are
practical applications in the field. In particular, we design the reported by humans, such as engineers and customers, instead
following research questions (RQs). of automated incident monitors. To explain this high human-
RQ1. What is the general behavior of GenAI incidents in terms reported percentage (i.e., such a ratio is only 13.7% for other
of different incident life cycles? cloud services, as will be further discussed in Section VIII-A),
RQ2. What are the symptoms of GenAI incidents? we find that 45.9% of GenAI cloud services are still under
RQ3. What are the root causes of GenAI incidents? development or in the preview stage, while 54.1% of GenAI
RQ4. How are GenAI incidents mitigated? cloud services are in the General Availability status. More-
over, many GenAI cloud service monitors currently build on
C. Categorization Strategy adaptations of existing frameworks designed for other types
While each incident is documented with detailed informa- of cloud services, which may not yet fully align with the
tion, these records are typically composed by humans, e.g., unique requirements of GenAI-specific scenarios. For instance,
OCEs’ discussion, and may contain images, URLs, and other invalid inference incidents are often identified and reported by
elements that complicate automatic categorization. Therefore, users, reflecting the collaborative effort to refine these systems
we need to analyze all the incidents manually to further under- further. Our study observes that there are around 25.9 unique
stand their symptoms, root causes, and mitigation strategies. monitors per 100 monitor-reported GenAI incidents, compared
We divide our dataset of incidents into three subsets ran- to 74.4 for other cloud services, offering an opportunity to
domly: (1) taxonomy set: 40% incidents, (2) validation set: enhance monitoring diversity. These insights highlight the
20% incidents, and (3) test set: 40% incidents. Firstly, two ongoing evolution of GenAI monitoring approaches, as the

4
TABLE I: Detection type distribution and false alarms rate for
Title: monitor evaluated high fail rate GenAI and non-GenAI incidents.
for scope [ServiceA], zone [WestRegion2]
Monitor Name: [Service] FailRate Detection Type False Alarm Rate
Metric: DependencyCallCounter Human Monitor Human Monitor
Description: Marks the target as GenAI 38.3% 61.7% 6.6% 11.0%
‘Unhealthy’ and raises a high-severity Other 13.7% 86.3% 4.8% 3.8%
incident if the failure rate exceeds 4%
over the past 60 minutes.
Hop=1 Hop=2 Hop>=3
Trouble-shooting Guide: [Link to the TSG]
Diagnostic Information: 8.6% 11.1%
0.7% 3.2%
Failure Description Count
ServiceClient failure for 52
[ServiceB]: Failed to call
[ServiceB], ReasonPhrase=Failed
Dependency 90.7% 85.7%
RequestTimeout for 13 Monitor Human
[EncoderService]
ServiceClient failure for ChatGPT: 8 Fig. 5: Transfer hops for incidents.
No service for ‘BotClientLibrary’
has been registered.
ServiceClient failure for ChatGPT: 3 the sensitivity of the monitoring systems. For example, the
Failed to call ‘ChatGPT’ at monitor in Figure 3 issues an incident report if the failure rate
LoadBalancer, ErrorStatusCode=400 exceeds 4% within one hour. If the failure rate threshold is
ClientSecretCredential 6 set lower or the monitoring period is shortened, the monitor
authentication failed: A becomes more sensitive, possibly leading to more false pos-
configuration issue is preventing itives. These false alarms burden engineers with unnecessary
authentication. Details: The investigations, thus delaying the resolution of true incidents.
provided client secret keys
for app [ApplicationA-UUID] are Finding 1: GenAI cloud services and their monitoring are
expired. still in an early stage. A high percentage (38.3%) of GenAI
incidents are reported by humans. Besides, among the inci-
... ...
dents detected by the automated monitors, there is an 11.0%
false alarm rate, which points to opportunities for further
Fig. 3: Incident detected by a monitor and the collected enhancement in monitoring precision.
diagnostic information attached to the monitor.
B. Incident Triage
Human Monitor Dev Preview GA Triage is a crucial component of the incident manage-
ment life cycle, significantly affecting the Time-to-Mitigate
54.1%
61.7% (TTM) [32]. Incidents can be sent to incorrect teams or
need collaborative efforts, leading to cases where they are re-
assigned between different teams. The process of reassigning
38.3% 36.9% an incident from one team to another is called a transfer hop.
9.0% As shown in Figure 5, incidents that are initially detected
by monitoring systems are usually accurately triaged to the
(a) Detection type. (b) Service stage. correct team on their first attempt (90.7%). However, the
Fig. 4: GenAI incident detection type and different stages of proportion of incidents needing triage increases when detected
GenAI services. GA: General Availability, Dev: Development. by humans. GenAI incidents detected by humans that undergo
reassignment is 14.3%. This shows the effectiveness of using
automatic monitors for triage. For example, the monitor-
industry continues to refine automated detection capabilities generated ticket title embeds the name of the service that
and improve response efficiency. leads to the incident, as shown in Figure 3, so the incident
Wrong Alarms (False Positives): The false alarm rate for can be accurately triaged to the service team. Another factor
incidents detected by monitors in GenAI services is notably for the incident re-assignment is the interdependency on other
high at 11.0% (Table I), compared to the 6.6% detected by services. Resolving an incident might exceed the capabilities
humans. This higher false positive rate is primarily from of a single team, and collaborative efforts across different

5
5 5 5 subsections are ordered based on their perceived impact on
4 4 4 service operation and user experience.
Time Unit

Time Unit

Time Unit
3 3 3
2 2 2 A. Invalid Inference (14.5%)
1 1 1 While the model inference executes successfully and the
0 0 0 service returns results to clients without errors, the model
High Sev Medium Sev Low Sev HasTSG NoTSG Human Monitor
output can be invalid. Inaccuracies in the output directly affect
(a) (b) (c) the core functionality of GenAI services. (1) Response Quality
Fig. 6: TTM distribution across different factors: Y-axis is the Degradation (10.7%): Models can generate low quality con-
normalized TTM of all incidents; the top whisker of each box tent with even simple user prompt. Another scenario involves
plot represents the maximum value; the top and bottom edge the generation of invalid content, where the model could
of the box represent the upper quartile and the lower quartile, not understand the user’s prompt, leading to invalid content
respectively, and the line inside the box represents the median creation [77]; (2) Prompt/Response Content Filter Malfunction
value. (a) Different severity levels; (b) The presence of a TSG; (3.8%): GenAI cloud services deploy policy filters for both
(c) Detection types. user prompts and model responses to prevent the generation of
harmful content. However, these content filters can sometimes
malfunction, resulting in inappropriate or harmful content from
service domains are needed. Further details on the root causes the model, as well as false alarms that incorrectly filter out
of GenAI incidents will be elaborated in Section VI. valid prompts or responses.

C. Incident Mitigation B. Deployment Failure (35.7%)

Intuitively, we would expect incidents with higher severity Deployment failures reflect the impact on GenAI service
to have longer TTM, as these incidents usually require more continuity. We find: (1) Model deployment failure (12.0%):
extensive investigation and resolution efforts. For example, When users are training or fine-tuning large language models,
as shown in Figure 6a, high-severity incidents generally take the deployment failure may happen. For instance, all of the
longer to resolve than medium-severity. However, low-severity user fine-tuned models were not successfully deployed in
incidents exhibit a significantly longer TTM compared to other time for a specific deployment region; (2) Resource deploy-
severity levels because these lower-priority GenAI incidents ment failure (14.4%): GenAI cloud services heavily depend
often remain unresolved for extended periods due to their low on different types of resource deployment, like computing,
impact. networking and storage resources for consuming, transmitting
Incidents accompanied by a TSG are resolved more swiftly and storing vast volumes of data. Failed deployment of these
than those without one, as TSGs provide clear guidelines and can propagate exceptions to other parts of the GenAI services;
solutions that facilitate faster mitigation (Figure 6b). Further- (3) Fine-tune API failure (9.3%): GenAI cloud services offer
more, our analysis reveals that incidents generated by monitors interfaces for uploading/downloading data, model selection,
are mitigated more quickly than those reported by humans and parameter setting, which users can customize to fine-tune
(Figure 6c). This is partly because monitors, as illustrated their own models. However, a failure may happen when calling
in Figure 3, often include links to corresponding TSGs. By such fine-tune REST APIs. For instance, a conflict version
following the TSG instructions, diagnostic information is more requirement caused the failure of the fine-tune API calls.
readily collected. For example, it becomes immediately clear
that the root cause of the incident in Figure 3 is expired secret C. Degraded Performance (49.8%)
keys of the application, thereby enabling quicker resolution.
There are two typical performance degradation: (1) Service-
Finding 2: Automatic monitors and trouble-shooting guides level Degradation (27.2%): Multiple APIs within a GenAI
(TSGs) can significantly boost the mitigation process, and service can fail simultaneously, impacting the overall availabil-
reduce the Time to Mitigation for GenAI incidents. ity and performance of that service. Also, if multiple service
nodes become unhealthy, e.g., out-of-memory or disk pressure,
the performance of the whole service can be influenced; (2)
V. RQ2: S YMPTOM OF G ENAI I NCIDENTS API-level Degradation (22.6%): A particular GenAI API can
We analyze the symptoms of GenAI incidents from avail- be delayed. Degraded performance is primarily due to infras-
able incident-related telemetry data (metrics, logs, traces, etc.) tructure and configuration issue as discussed in Section VI.
and discussion threads from on-call engineers. We catego-
rize the symptoms into invalid inference, deployment failure, Finding 3: GenAI incidents can occasionally include chal-
and degraded performance. Note that one incident may have langes, including invalid inference (14.5%), deployment fail-
multiple symptoms, and we choose the major symptom as ure (35.7%), and performance degradation (49.8%).
its category as mentioned in Section III-C. The following

6
Infrastructure Issue 14.6 9.4 3.2 nected components. However, mismanagement of these config-
urations is occasionally observed. Incorrect or unsynchronized

Percentage (%)
Configuration Issue 13.2 9.4 1.9
Root Cause
10 settings can ruin service functionality. We categorize these
Code Bug 9.8 7.0 4.7 configuration issues into the following types: (1) Misconfig-
External Usage Issue 8.0 3.3 2.8 uration (13.1%): Operators may employ incorrect configu-
5
Operation Error 4.2 6.6 1.9 rations or commit errors, typically due to human mistakes.
Degraded Deployment Invalid For example, engineers might configure much fewer model
Performance Failure Inference instances than required during system maintenance, leading to
Symptom an outage of degraded performance. (2) Configuration Update
Fig. 7: Relationships between symptom and root cause. (6.4%): Changes in one cloud component’s configurations can
lead to incompatibilities with other components due to the
configuration dependencies among them. Additionally, version
VI. RQ3: ROOT C AUSE conflicts for the same configuration may result in one config-
uration overriding another, e.g., using a removed parameter in
We categorize the root causes of GenAI incidents into its latest version or using an added parameter in its previous
five distinct types. The relationships between symptoms and version, leading to malfunctions. (3) Configuration Missing
root causes are shown in Figure 7. Each cell represents the and Gaps (5.0%): Missing or disabled configurations can
percentage of a specific symptom associated with a particular disrupt normal operations. Additionally, certain configurations
root cause. We can observe that a single symptom can come impose range restrictions on values, such as timeout thresholds
from multiple root causes rather than a simple one-to-one or maximum sizes for prompt tokens. Under unexpected
relationship. This indicates that diagnosing the root cause from circumstances, such as a sudden surge in user traffic, these
symptoms is not straightforward. static configurations can constrain system performance.
A. Infrastructure Issue (27.2%) C. Code Bug (21.5%)
GenAI cloud services are built upon a complex hierarchi- Code bugs are a primary cause of incidents, and a prior
cal infrastructure comprising VMs, nodes, clusters, and data work [30] has specifically investigated the code bugs leading to
centers that host tightly coupled resources, including CPU, cloud incidents. The following shows four types of code bugs
memory, storage, and networks. We find that infrastructure for GenAI incidents: data constraints bugs, content filter bugs,
issues are a major cause of degraded performance and de- exception handling bugs, and cross-system bugs. (1) Bugs
ployment failure (Figure 7). The infrastructure is categorized violating Data Constraints of the Model (6.7%): Bugs can arise
into the following types: (1) Infrastructure Maintenance Issues due to inadequate validation for data format or missing data
(17.8%): Failures of hardware components, such as worn-out that the model needs to consume. Take a fine-tune failure as an
GPUs, can impact the fine-tuning and inference of GenAI example, it can be caused by the lack of validation on dataset
services. For instance, faulty GPUs can process requests format in FileUpload API. The malformed dataset was not
incorrectly, resulting in errors such as gibberish outputs. rejected during the file upload stage, and was delivered to the
(2) Network Issues (4.7%): Besides the network bandwidth, backend services; (2) Prompt/Response Content Filter Bugs
incidents can happen between the communication of VMs (2.2%): Code defects can exist in the prompt or response filter.
and nodes within clusters, including connectivity issues and (3) Exception Handling Bugs (6.3%): Exceptions are a normal
DNS resolution failures. Such network problems can severely occurrence during code execution. However, the code can be
disrupt the performance and reliability of the service. (3) unable to effectively handle certain exceptions or failures. For
Storage Issues (4.7%): The management of vast amounts of example, errors may occur during model deployment, such as
data needs robust storage solutions. Failures in data storage or an invalid model being deployed to an endpoint. Due to a code
IO operations, such as data corruption or delays, can lead to defect in processing such an error, e.g., simply swallowing the
service disruptions. exception, the invalid model remains there and serve requests;
(4) Cross-system Bugs (6.3%): These bugs are mostly caused
Finding 4: Infrastructure issues are a key area of focus for un- by issues in the code across multiple components. To fix this
derstanding and addressing incidents in GenAI cloud services, type of bugs, changes are needed for multiple services.
especially for degraded performance and deloyment failure. To D. External Usage Issue (14.1%)
meet the growing user demands, GenAI cloud services should Incidents can arise from incorrect usage of GenAI service by
not only scale up the size of GPU cluster but also prioritize the customer. For example, a customer missed indexes when
robust infrastructure management. performing queries to LLM, which caused the high CPU usage
in the service.
B. Configuration Issue (24.5%) E. Operation Error (12.7%)
GenAI cloud services rely on a multitude of configuration Operation errors in GenAI cloud services are typically
settings to ensure the seamless operation of their intercon- caused by human errors during the management and opera-

7
tional processes. This error occurs when operators mistakenly the third-party library. For example, an inference API error
introduce erroneous or outdated dependencies, or use expired which caused by the compatibility issue between fine-tuning
credentials. code and inference code can be fixed by rolling back to a
previous inference engine for users in specific regions; (2)
VII. RQ4: M ITIGATION
Configuration Rollback (6.3%): This involves undoing bad
To answer RQ4, we delve into the common categories configuration changes to alleviate the issue.
of mitigation strategies utilized to address GenAI incidents.
Specifically, we inspect the title and the detailed description C. Configuration Fix (13.0%)
of the mitigation steps in each incident ticket and its cor- To address the majority of configuration errors, engineers
responding postmortem report. these descriptions, engineers’ often fix bugs in configuration files to reinstate the service. We
discussion thread, and completed work bullets, we classify the identify two primary approaches to configuration fixes: (1) Add
mitigation methods into the following distinct types: ad-hoc or Disable Features (7.6%): Incidents can be mitigated either
fix, self-recover, rollback, configuration fix, infrastructure fix, by adding new features that enhance service stability or by
external fix, code fix, and others. disabling features that are causing failures, thus aiding in the
A. Code Fix (7.6%) swift resolution of the issue; (2) Increase the Configuration
Limit (5.4%): Besides the configuration issues, a number of
const getCommandText = () => incidents from resource capacity as mentioned in Section VI-A
featureFlags . enableRemoveUnicodeFromRequest can also be mitigated by configuration changes as a short-term
? removeUnicodeFromRequest (text) : text;
... strategy, such as increasing timeout thresholds.

const removeUnicodeFromRequest = (msg: string ) => D. Infrastructure Fix (12.1%)

{
const unescapedMsg = unescapeUnicode (msg); For incidents caused by infrastructure issues, an infras-
const regex = /[\u{ E0000 }-\u{ E007F }]/ gu; tructure fix is a frequently utilized mitigation method. Com-
return unescapedMsg . replace (regex , ""); mon infrastructure fixes include scaling operations, compo-
};
nent restarts or rebuilds, and traffic failovers. One of the
This category is to address incidents by updating and fixing following actions can be performed: (1) Scaling (6.3%): Due
buggy code or by incorporating new code [30], such as to infrastructure limitations, a service may not be able to
adding exception handling mechanisms to improve resilience handle a large volume of traffic, and simple configuration of
or implementing new features for specific purpose. For ex- increasing the capacity does not work. Therefore, scaling out
ample, certain Unicode characters cannot be rendered in a more instances or nodes to increase capacity is needed. For
font and thus do not appear in the user interface, resulting example, increasing the compute capacity allows the service to
in what is called hidden text. However, the hidden text can process more requests, thus avoiding an excessive number of
still be understood and processed by the LLM. This could request failures; (2) Restart or Rebuild (3.1%): This category
be potentially exploited as an attack surface to change the involves mitigating incidents by restarting or rebuilding faulty
response from the user’s intent. The following code update components; (3) Traffic Failover (2.7%): This involves failing
adds a new feature to remove the Unicode characters (within over traffic to another healthy service component, including
the range of U+E0000 to U+E007F) that can be used as hidden nodes, clusters, or another cloud region.
text from the user’s request.
Finding 6: In practice, increasing the limit of resource con-
Finding 5: Given the tight deadlines for mitigating GenAI figuration (5.4%) is a straightforward mitigation strategy.
incidents, only a small proportion (7.6%) of incidents are However, when these configuration changes are insufficient
resolved through code fixes. This approach is time-consuming, due to the allocated resources or infrastructure reaching their
requiring more efforts to design and implement the solution capacity, re-scaling (6.3%) becomes necessary to resolve the
and navigate through an end-to-end CI/CD pipeline. Conse- GenAI incidents, even though it may take a long time to deploy
quently, other mitigation strategies are preferred by engineers the additional infrastructure.
for their faster resolution times in the initial stages of mitiga-
tion. E. Ad-hoc Fix (22.4%)
LLM incidents can be complext, and engineers may not
B. Rollback (15.2%) always be familiar with the root cause of GenAI incidents.
For incidents triggered by changes, such as configuration To address the impact quickly, standardized procedures can
adjustments or code updates, rollback is a widely used and be costly, so a series of improvised, situation-specific steps
efficient mitigation strategy. Engineers revert these changes to are applied to mitigate the symptom first. For instance, in
a previous, stable version. Our study identifies: (1) Deployment response to a malicious user bypassing the batch size limi-
Rollback (8.9%): Updates to code or third-party libraries can tation, engineers mitigated it by identifying and blocking the
introduce bugs. These incidents can be addressed by reverting malicious user, enabling the validation logic to check the
to a previous commit or an older stable build version of ImageModelA-batch-size parameter in the request headers,

8
Ad-hoc Fix Rollback Infrastructure Fix Code Fix a comparative study to identify their distinctions. We find
Self-recover Configuration Fix External Fix that GenAI incidents generally require more time to mitigate
10.0% 22.9% compared to other types. Specifically, on average, GenAI
7.6% 12.1%
5.8% incidents take 1.12 time units to resolve, compared to 0.65
2.7% time units for non-GenAI incidents.
13.0% 4.5%
22.4% To reveal the underlying reason: (1) We calculate the TTM
7.2%
2.2% for each type of mitigation category, and find that the longer
15.2% 54.7% TTM for GenAI incident holds across all mitigation categories,
19.7%
as shown in Figure 9, reflecting the complexity of solv-
GenAI Incidents Other Incidents
ing various GenAI incidents. Additionally, across all factors
Fig. 8: The distribution of mitigation approaches. we consider (severity levels, detection types, troubleshooting
guides) in the general analysis in Section IV-C, the Time to
2.0 Mitigation (TTM) for LLM incidents is consistently longer
GenAI Other than for incidents in other services. (2) We compare the
1.5 distribution of mitigation approaches, as depicted in Figure 8.
Time Unit

1.0 The ad-hoc fix (54.7%) is the majority of the mitigation for
other cloud services, which have shorter TTM compared to any
0.5 GenAI incident mitigation in Figure 9. The mitigation distri-
0.0 Ad-hoc Self.R. Rollback Config. Infra. External Code bution of GenAI incidents is more balanced, with ad-hoc fixes
comprising only 22 4%. This indicates that, for GenAI cloud
Fig. 9: Average TTM for different mitigation approaches. services in their early development stage, more diverse, sophis-
· ticated, and time-consuming methods are required as opposed
to applying the ad-hoc fixes. (3) The current monitoring tools
for GenAI cloud services are being continuously improved to
and enforcing a maximum limit for the batch size. Also, in better align with their unique requirements. Enhancements in
other cases where a single user’s request consumed too many accuracy and adequacy are expected to help reduce TTM and
background resources and resulted in service overload, the improve overall efficiency. Unlike conventional cloud services
issue was mitigated by temporarily limiting the user’s request monitored by automated watchdogs, a high percentage of
rate, adjusting the throttling from 10 seconds throttling to one GenAI incidents are detected by humans. According to Table I
second for the customer with a high workload. Note that over in Section IV-A, only 13.7% of the incidents were detected by
half of the incidents from other cloud services are mitigated by humans for non-GenAI cloud services in our dataset, compared
ad-hoc fix (Figure 8), while GenAI cloud services often require to 38.3% for GenAI incidents. Furthermore, monitor-detected
more development and deployment efforts (other mitigation GenAI incidents have an 11.0% false positive alarm rate,
approaches to be discussed in the following) to fully resolve significantly higher than the 3.8% observed in other services.
the incidents. Consequently, the TTM of GenAI incidents are This suggests that the current monitor is not mature compared
longer. to conventional incidents, and requires additional effort to
F. External Fix (10.0%) improve.
GenAI cloud services support external company partners Longer TTM is also attributed to the difficulty in performing
and customers, so some incidents are mitigated externally, root cause analysis for GenAI incidents. As discussed in
including by Microsoft Partners and customers. For exam- Section VII, a single symptom can stem from multiple root
ple, engineers will recommend that customers modify their causes, thus complicating the debugging of GenAI services.
prompts when their wrong usage causes the model to return For example, diagnosing unexpected model outputs can be
unexpected content or switch to a stable model. complex; potential causes include faulty hardware, misconfig-
urations, code defects, or misuse.
G. Self-recover (19.7%)
B. Implications
These transient incidents are automatically mitigated as the
service recovers on its own due to its resilience mechanisms, Our findings offer actionable insights for a wide range of
for example, back-off retry, or when the monitoring system stakeholders, including researchers, model providers, service
no longer detects abnormal indicators, e.g., heartbeat detection maintainers, developers, and etc.
rate returns to normal. Note that self-recovered incidents are Researchers. Our study highlights several avenues for future
not false alarms in our dataset. research, particularly in automated methods to detect invalid
inference results. Currently, invalid outputs (14.5%), such
VIII. D ISCUSSION
as hallucinations or irrelevant responses, are challenging to
A. Lessons Learned detect. The current state-of-the-art detection methods gen-
Since the mitigation strategy categories for both GenAI erally include 1) self-judgment by the LLM, 2) fine-tuning
and non-GenAI share high similarities, we further perform another model with human-labeled data, or 3) calculating

9
consistency scores after multiple attempts. However, neither of incident triage in Microsoft’s online service systems to un-
them is cost-efficient nor fully effective. More robust research derstand industry practices. Zhao et al. [70] explored change-
is needed to address these limitations and develop scalable induced incident lifecycles in large-scale online services, offer-
validation algorithms that can operate across various GenAI ing management insights. Wang et al. [31] analyzed the time-
applications. to-mitigation (TTM) of incidents across 20 Microsoft online
Model Providers. Besides the high ratio of invalid inference services. Building on this, our study delves into incident char-
results (14.5%) and challenges in detecting hallucinations or acteristics, comparing incidents related to GenAI with those
invalid content, another notable finding is that 38% of GenAI of other services. In related work, Liu et al. [30] investigated
incidents are reported by humans, reflecting that monitoring software bugs causing cloud incidents in Microsoft Azure
tools are underdeveloped. Moreover, many GenAI cloud ser- and their resolutions. Ghosh et al. [28] analyzed incidents in
vices (45.9%) are still under development or in the preview Microsoft Teams, classifying root causes and mitigation steps.
stage, coupled with the scarcity of incident monitor types. Martino et al. [79] characterized failures in a business data
Providers should enhance service observability to detect and processing platform using event log data. Our work builds
diagnose issues more effectively, and provide better support upon these studies by applying established approaches from
and documentation to help users navigate the complexities of traditional cloud incident analysis to the GenAI cloud services
GenAI service integration and management. context. While following similar research methodologies, we
Service Maintainers. Our study reveals that the Time-to- highlight characteristics unique to GenAI incidents, such as
Mitigate (TTM) for GenAI incidents is 1.83 times longer than symptoms that involve invalid inference, which are not com-
for non-GenAI incidents, highlighting the need for automation monly seen in conventional cloud environments.
in incident mitigation. The complexity of GenAI systems, LLMs empirical study. In recent years, with the rise of
which involve vast and interconnected layers of infrastructure, large language models (LLMs), numerous related studies have
dependencies, and configurations, is a significant factor. For emerged. Cui et al. [80] organize existing studies related
example, GenAI cloud systems require 2.5x more infrastruc- to LLMs and propose a comprehensive taxonomy, which
ture fixes, 3.0x more code changes, and 3.0x as many configu- systematically analyzes potential risks in LLM systems and
ration updates compared to non-GenAI services. Despite these, discusses corresponding mitigation strategies. Liu et al. [81]
more straightforward ad-hoc fixes are applied in only 22.4% of investigate the use of jailbreak prompts to bypass restrictions
GenAI incidents, compared to 54.7% in non-GenAI services, imposed on ChatGPT. They conduct an empirical study to
indicating a reliance on more complex, time-consuming fixes evaluate the effectiveness and robustness of prompts collected
for GenAI systems. Furthermore, diagnosing root causes of from the real world. Zhuo et al. [82] present an empirical
GenAI incidents is often complex. A single symptom, such study on the adversarial robustness of a prompt-based semantic
as poor performance (49.8%) or deployment failure (35.7%), parser based on Codex. Yang et al. [83] conduct a study on
can have multiple root causes, including infrastructure prob- GPT-3 in knowledge-based visual question answering (VQA),
lems (27.2%), configuration problems (24.5%), or code bugs treating GPT-3 as a knowledge base (KB) and adapting GPT-3
(22.5%). Services should provide observability from different to solve the VQA task in a few-shot manner. In contrast to
dimensions to obtain granular insight into these symptoms these studies, which primarily focus on model behavior and
and their underlying causes. Maintainers should consider 1) robustness, our work centers on the reliability of LLM-related
implementing more automation tools or agents for distinct cloud services. Specifically, we present a novel empirical study
mitigation approaches, 2) adopting more infrastructure-as-code of incidents in such services, offering insights into their design,
practices to manage complex GenAI cloud infra more effec- operation, and maintenance.
tively, and 3) integrating more automated rollback mechanisms X. T HREATS TO VALIDITY
to address compatibility issues swiftly.
Internal threat. Subjectivity may occur during manual label-
Application Developers and Users. For developers, input
ing as an internal threat. To mitigate this threat, our study
validation and dynamic rate limiting are critical areas need-
go through multiple rounds involving independent labeling,
ing improvement. Incidents reveal that special characters,
meetings to discuss categorization, and the calculation of
fragmented prompts, and excessive token usage, even within
Cohen’s kappa [72]. We ultimately select the round of labeling
token limits, can disrupt model processing. Developers should
that is near-perfect as our final result, which demonstrates the
implement input validation processes to prevent these issues
highest consistency.
and adopt dynamic rate-limiting strategies.
External threat. All incidents we collect come from Mi-
IX. R ELATED W ORK crosoft’s cloud systems. Given that Microsoft employs various
effective tools and techniques to eliminate bugs and deploys
Empirical studies on cloud incidents. A significant amount multiple automated tools to mitigate some incidents before
of prior work has been devoted to studying the characteristics they impact customers, the incidents we collect may not fully
of incidents occurring in production systems. Ganatra et al. represent the behavior of other GenAI cloud services. We plan
[78] examined incident detection at Microsoft to identify to perform a larger scale evaluation of GenAI cloud services
monitoring gaps in cloud platforms. Chen et al. [32] studied from different companies in the future.

10
XI. C ONCLUSION [14] Y. Zhao, R. Zhang, W. Li, and L. Li, “Assessing and understanding
creativity in large language models,” Mach. Intell. Res., vol. 22, 2025.
In this paper, we present a comprehensive study of incidents [15] Y. Liu, S. Chen, H. Cheng, M. Yu, X. Ran, A. Mo, Y. Tang, and
from GenAI cloud services within Microsoft. We explore the Y. Huang, “How AI processing delays foster creativity: Exploring
symptoms, root causes, and mitigation strategies of GenAI research question co-creation with an llm-based agent,” in CHI, 2024.
[16] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, W. Zhang, S. Qin,
incidents. Our findings reveal unique characteristics in GenAI S. Rajmohan, Q. Lin, and D. Zhang, “Everything of thoughts: Defying
cloud services. For example, we identify notable differences the law of penrose triangle for thought generation,” in ACL, 2024.
between incidents from LLM cloud services and other cloud [17] S. Bubeck et al., “Sparks of artificial general intelligence: Early exper-
iments with gpt-4,” arXiv:2303.12712, 2023.
services, such as significant disparities in the time to mitigation [18] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia,
of incidents. Additionally, we find that the primary cause of “Towards efficient generative large language model serving: A survey
incidents in LLM cloud services is related to infrastructure. from algorithms to systems,” arXiv:2312.15234, 2023.
These findings provide guidance for future academic and [19] Y. Chen, M. Shetty, G. Somashekar, M. Ma, Y. Simmhan, J. Mace,
C. Bansal, R. Wang, and S. Rajmohan, “AIOpslab: A holistic framework
industrial research in the field of LLM cloud services. We to evaluate AI agents for enabling autonomous clouds,” in MLSys, 2025.
hope to inspire the development of advanced, specialized [20] M. Shetty, Y. Chen, G. Somashekar, M. Ma, Y. Simmhan, X. Zhang,
tooling and raise discussions on GenAI incidents, so that our J. Mace, D. Vandevoorde, P. Las-Casas, S. M. Gupta, et al., “Building
ai agents for autonomous clouds: Challenges and design principles,” in
community can monitor the GenAI cloud system with early SoCC, 2024.
warnings, triage incidents to the correct teams with fewer hops, [21] Z. Yu, M. Ma, C. Zhang, S. Qin, Y. Kang, C. Bansal, S. Rajmohan,
pinpoint root causes accurately, and mitigate the incidents with Y. Dang, C. Pei, D. Pei, et al., “Monitorassistant: Simplifying cloud
service monitoring via large language models,” in FSE, 2024.
optimal plans. [22] “Elevated errors affecting api and chatgpt.” [Link]
incidents/n38dwwksfkv9, 2024.
ACKNOWLEDGEMENT [23] D. Xin, H. Miao, A. Parameswaran, and N. Polyzotis, “Production
We sincerely thank all anonymous reviewers for their valu- machine learning pipelines: Empirical analysis and optimization oppor-
tunities,” in SIGMOD, 2021.
able feedback and guidance in improving this paper. This work [24] M. J. Islam, G. Nguyen, R. Pan, and H. Rajan, “A comprehensive study
was sponsored by National Natural Science Foundation of on deep learning bug characteristics,” in ESEC/FSE, 2019.
China (No.62372193 and No.U2436207). [25] N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco, and
P. Tonella, “Taxonomy of real faults in deep learning systems,” in ICSE,
R EFERENCES 2020.
[26] Z. Chen, Y. Cao, Y. Liu, H. Wang, T. Xie, and X. Liu, “A comprehensive
[1] S. Yang, Z. Shang, Y. Wang, D. Deng, H. Chen, Q. Cheng, and X. Wu, study on challenges in deploying deep learning based software,” in
“Data-free multi-label image recognition via llm-powered prompt tun- ESEC/FSE, 2020.
ing,” CoRR, vol. abs/2403.01209, 2024. [27] Z. Chen, Y. Kang, L. Li, X. Zhang, H. Zhang, H. Xu, Y. Zhou, L. Yang,
[2] J. Han, R. Zhang, W. Shao, P. Gao, P. Xu, H. Xiao, K. Zhang, C. Liu, J. Sun, Z. Xu, et al., “Towards intelligent incident management: why
S. Wen, Z. Guo, X. Lu, S. Ren, Y. Wen, X. Chen, X. Yue, H. Li, we need it and how we make it,” in ESEC/FSE, 2020.
and Y. Qiao, “Imagebind-llm: Multi-modality instruction tuning,” CoRR, [28] S. Ghosh, M. Shetty, C. Bansal, and S. Nath, “How to fight production
vol. abs/2309.03905, 2023. incidents? an empirical study on a large-scale cloud service,” in SoCC,
[3] C. Zhang, Z. Ma, Y. Wu, S. He, S. Qin, M. Ma, X. Qin, Y. Kang, 2022.
Y. Liang, X. Gou, et al., “Allhands: Ask me anything on large-scale [29] P. Dogga, C. Bansal, R. Costleigh, G. Jayagopal, S. Nath, and X. Zhang,
verbatim feedback via large language models,” in ICDE, 2025. “Autoarts: Taxonomy, insights and tools for root cause labelling of
[4] B. Qiao, L. Li, X. Zhang, S. He, Y. Kang, C. Zhang, F. Yang, H. Dong, incidents in microsoft azure,” in ATC, 2023.
J. Zhang, L. Wang, et al., “Taskweaver: A code-first agent framework,”
[30] H. Liu, S. Lu, M. Musuvathi, and S. Nath, “What bugs cause production
arXiv:2311.17541, 2023.
cloud incidents?,” in HotOS, 2019.
[5] Y. Jiang, C. Zhang, S. He, Z. Yang, M. Ma, S. Qin, Y. Kang, Y. Dang,
[31] W. Wang, J. Chen, L. Yang, H. Zhang, P. Zhao, B. Qiao, Y. Kang,
S. Rajmohan, Q. Lin, et al., “Xpert: Empowering incident management
Q. Lin, S. Rajmohan, F. Gao, Z. Xu, Y. Dang, and D. Zhang, “How
with query recommendations via large language models,” in ICSE, 2024.
long will it take to mitigate this incident for online service systems?,”
[6] Y. Chen, H. Xie, M. Ma, Y. Kang, X. Gao, L. Shi, Y. Cao, X. Gao,
in ISSRE, 2021.
H. Fan, M. Wen, et al., “Automatic root cause analysis via large language
models for cloud incidents,” in EuroSys, 2024. [32] J. Chen, X. He, Q. Lin, Y. Xu, H. Zhang, D. Hao, F. Gao, Z. Xu,
[7] P. Jin, S. Zhang, M. Ma, H. Li, Y. Kang, L. Li, Y. Liu, B. Qiao, C. Zhang, Y. Dang, and D. Zhang, “An empirical investigation of incident triage
P. Zhao, et al., “Assess and summarize: Improve outage understanding for online service systems,” in ICSE (SEIP), 2019.
with large language models,” in ESEC/FSE, 2023. [33] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and
[8] C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y. Kang, L. Zettlemoyer, “Rethinking the role of demonstrations: What makes
Q. Lin, S. Rajmohan, et al., “Ufo: A ui-focused agent for windows os in-context learning work?,” in EMNLP, 2022.
interaction,” in NAACL, 2025. [34] E. D. Liddy, “Natural language processing,” 2001.
[9] S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu, Y. Zhou, Y. Xiao, [35] “Fine-tuning.” [Link]
S. Yun, X. Huang, and Z. Wei, “Disc-lawllm: Fine-tuning large language [36] “Microsoft copilot: Your everyday ai companion.” [Link]
models for intelligent legal services,” CoRR, vol. abs/2309.11325, 2023. [Link]/, 2024.
[10] M. M. Raza, K. P. Venkatesh, and J. C. Kvedar, “Generative AI and [37] Z. Chen, Y. Kang, F. Gao, L. Yang, J. Sun, Z. Xu, P. Zhao, B. Qiao,
large language models in health care: pathways to implementation,” npj L. Li, X. Zhang, et al., “Aiops innovations of incident management for
Digit. Medicine, vol. 7, 2024. cloud services,” 2020.
[11] L. Luceri, E. Boniardi, and E. Ferrara, “Leveraging large lan- [38] C. Zhao, M. Ma, Z. Zhong, S. Zhang, Z. Tan, X. Xiong, L. Yu, J. Feng,
guage models to detect influence campaigns in social media,” CoRR, Y. Sun, Y. Zhang, D. Pei, Q. Lin, and D. Zhang, “Robust multimodal
vol. abs/2311.07816, 2023. failure detection for microservice systems,” in KDD, 2023.
[12] J. Yu, R. He, and R. Ying, “Thought propagation: An analogical [39] J. Huang, Y. Yang, H. Yu, J. Li, and X. Zheng, “Twin graph-based
approach to complex reasoning with large language models,” in ICLR, anomaly detection via attentive multi-modal learning for microservice
2024. system,” in ASE, 2023.
[13] S. P. Sharan, F. Pittaluga, V. K. B. G, and M. Chandraker, “Llm-assist: [40] C. Zhang, X. Peng, C. Sha, K. Zhang, Z. Fu, X. Wu, Q. Lin, and
Enhancing closed-loop planning with language-based reasoning,” CoRR, D. Zhang, “Deeptralog: Trace-log combined microservice anomaly de-
vol. abs/2401.00125, 2024. tection through graph-based deep learning,” in ICSE, 2022.

11
[41] L. Li, X. Zhang, X. Zhao, H. Zhang, Y. Kang, P. Zhao, B. Qiao, S. He, [64] X. Zhang, S. Ghosh, C. Bansal, R. Wang, M. Ma, Y. Kang, and
P. Lee, J. Sun, et al., “Fighting the fog of war: Automated incident S. Rajmohan, “Automated root causing of cloud incidents using in-
detection for cloud systems,” in ATC, 2021. context learning with GPT-4,” in FSE, 2024.
[42] H. Qiu, S. S. Banerjee, S. Jha, Z. T. Kalbarczyk, and R. K. Iyer, [65] M. Ma, Z. Yin, S. Zhang, S. Wang, C. Zheng, X. Jiang, H. Hu, C. Luo,
“{FIRM}: An intelligent fine-grained resource management framework Y. Li, N. Qiu, et al., “Diagnosing root causes of intermittent slow queries
for {SLO-Oriented} microservices,” in OSDI, 2020. in cloud databases,” VLDB, vol. 13, 2020.
[43] M. Ma, S. Zhang, J. Chen, J. Xu, H. Li, Y. Lin, X. Nie, B. Zhou, [66] J. Jiang, W. Lu, J. Chen, Q. Lin, P. Zhao, Y. Kang, H. Zhang, Y. Xiong,
Y. Wang, and D. Pei, “Jump-starting multivariate time series anomaly F. Gao, Z. Xu, et al., “How to mitigate the incident? an effective
detection for online service systems,” in ATC, 2021. troubleshooting guide recommendation technique for online service
[44] J. Zeng, Z. L. Chua, Y. Chen, K. Ji, Z. Liang, and J. Mao, “Watson: systems,” in ESEC/FSE, 2020.
Abstracting behaviors from audit logs via aggregation of contextual [67] T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and
semantics.,” in NDSS, 2021. S. Rajmohan, “Recommending root-cause and mitigation steps for cloud
[45] J. Zeng, X. Wang, J. Liu, Y. Chen, Z. Liang, T.-S. Chua, and Z. L. incidents using large language models,” in ICSE, 2023.
Chua, “Shadewatcher: Recommendation-guided cyber threat analysis [68] C. Zhang, R. Yao, S. Qin, Z. Li, S. Agrawal, B. R. Mishra, T. Tran,
using system audit records,” in S&P, 2022. M. Ma, Q. Lin, M. Chintalapati, et al., “Deoxys: A causal inference
[46] S. Zhang, Y. Ji, J. Luan, X. Nie, Z. Chen, M. Ma, Y. Sun, and D. Pei, engine for unhealthy node mitigation in large-scale cloud infrastructure,”
“End-to-end automl for unsupervised log anomaly detection,” in ASE, in SoCC, 2024.
2024. [69] H. Li, M. Ma, Y. Liu, P. Zhao, S. Li, Z. Li, M. Chintalapati, Y. Dang,
[47] Z. Yu, C. Pei, X. Wang, M. Ma, C. Bansal, S. Rajmohan, Q. Lin, C. Bansal, S. Rajmohan, et al., “Can we trust auto-mitigation? improving
D. Zhang, X. Wen, J. Li, et al., “Pre-trained kpi anomaly detection cloud failure prediction with uncertain positive learning,” in 2024
model through disentangled transformer,” in KDD, 2024. IEEE 35th International Symposium on Software Reliability Engineering
[48] Y. Liu, M. Ma, P. Zhao, T. Li, B. Qiao, S. Li, Z. Li, M. Chintalapati, (ISSRE), pp. 499–510, IEEE, 2024.
Y. Dang, C. Bansal, S. Rajmohan, Q. Lin, and D. Zhang, “Early bird: [70] Y. Zhao, L. Jiang, Y. Tao, S. Zhang, C. Wu, Y. Wu, T. Jia, Y. Li, and
Ensuring reliability of cloud systems through early failure prediction,” Z. Wu, “How to manage change-induced incidents? lessons from the
in ISSRE, 2024. study of incident life cycle,” in ISSRE, 2023.
[71] A. L. Strauss and J. M. Corbin, Grounded theory in practice. Sage,
[49] J. Liu, C. Zhang, J. Qian, M. Ma, S. Qin, C. Bansal, Q. Lin, S. Rajmo-
1997.
han, and D. Zhang, “Large language models can deliver accurate and
[72] J. Cohen, “A coefficient of agreement for nominal scales,” Educational
interpretable time series anomaly detection,” in KDD, 2025.
and psychological measurement, vol. 20, 1960.
[50] Y. Sun, B. Shi, M. Mao, M. Ma, S. Xia, S. Zhang, and D. Pei,
[73] Z. Wang, C. Pei, M. Ma, X. Wang, Z. Li, D. Pei, S. Rajmohan, D. Zhang,
“Art: A unified unsupervised framework for incident management in
Q. Lin, H. Zhang, J. Li, and G. Xie, “Revisiting VAE for unsupervised
microservice systems,” in ASE, 2024.
time series anomaly detection: A frequency perspective,” in WWW, 2024.
[51] C. Bansal, S. Renganathan, A. Asudani, O. Midy, and M. Janakiraman, [74] Y. Chen, C. Zhang, M. Ma, Y. Liu, R. Ding, B. Li, S. He, S. Rajmohan,
“Decaf: Diagnosing and triaging performance issues in large-scale cloud Q. Lin, and D. Zhang, “Imdiffusion: Imputed diffusion models for
services,” in ICSE (SEIP), 2020. multivariate time series anomaly detection,” VLDB, vol. 17, 2023.
[52] J. Chen, X. He, Q. Lin, Y. Xu, H. Zhang, D. Hao, F. Gao, Z. Xu, [75] Z. Zeng, Y. Zhang, Y. Xu, M. Ma, B. Qiao, W. Zou, Q. Chen, M. Zhang,
Y. Dang, and D. Zhang, “An empirical investigation of incident triage X. Zhang, H. Zhang, X. Gao, H. Fan, S. Rajmohan, Q. Lin, and
for online service systems,” in ICSE (SEIP), 2019. D. Zhang, “Traceark: Towards actionable performance anomaly alerting
[53] J. Chen, X. He, Q. Lin, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, for online service systems,” in ICSE (SEIP), 2023.
and D. Zhang, “Continuous incident triage for large-scale online service [76] J. Gu, J. Wen, Z. Wang, P. Zhao, C. Luo, Y. Kang, Y. Zhou, L. Yang,
systems,” in ASE, 2019. J. Sun, Z. Xu, B. Qiao, L. Li, Q. Lin, and D. Zhang, “Efficient customer
[54] Z. Wang, J. Li, M. Ma, Z. Li, Y. Kang, C. Zhang, C. Bansal, M. Chinta- incident triage via linking with system incidents,” in ESEC/FSE, 2020.
lapati, S. Rajmohan, Q. Lin, et al., “Large language models can provide [77] J. Wester, T. Schrills, H. Pohl, and N. van Berkel, ““as an ai language
accurate and interpretable incident triage,” in ISSRE, 2024. model, i cannot”: Investigating llm denials of user requests,” in CHI,
[55] Z. Wang, Z. Liu, Y. Zhang, A. Zhong, L. Fan, L. Wu, and Q. Wen, 2024.
“Rcagent: Cloud root cause analysis by autonomous agents with tool- [78] V. Ganatra, A. Parayil, S. Ghosh, Y. Kang, M. Ma, C. Bansal, S. Nath,
augmented large language models,” in CIKM, 2024. and J. Mace, “Detection is better than cure: A cloud incidents perspec-
[56] D. Zhang, X. Zhang, C. Bansal, P. Las-Casas, R. Fonseca, and tive,” in ESEC/FSE, 2023.
S. Rajmohan, “Pace: Prompting and augmentation for calibrated con- [79] C. Di Martino, Z. Kalbarczyk, R. K. Iyer, G. Goel, S. Sarkar, and
fidence estimation with gpt-4 in cloud incident root cause analysis,” R. Ganesan, “Characterization of operational failures from a business
arXiv:2309.05833, 2023. data processing saas platform,” in ICSE, 2014.
[57] G. Yu, P. Chen, Y. Li, H. Chen, X. Li, and Z. Zheng, “Nezha: [80] T. Cui, Y. Wang, C. Fu, Y. Xiao, S. Li, X. Deng, Y. Liu, Q. Zhang,
Interpretable fine-grained root causes analysis for microservices on Z. Qiu, P. Li, et al., “Risk taxonomy, mitigation, and assessment
multi-modal observability data,” in ESEC/FSE, 2023. benchmarks of large language model systems,” arXiv:2401.05778, 2024.
[58] C. Lee, T. Yang, Z. Chen, Y. Su, and M. R. Lyu, “Eadro: An end-to-end [81] Y. Liu, G. Deng, Z. Xu, Y. Li, Y. Zheng, Y. Zhang, L. Zhao, T. Zhang,
troubleshooting framework for microservices on multi-source data,” in and Y. Liu, “Jailbreaking chatgpt via prompt engineering: An empirical
ICSE, 2023. study,” arXiv:2305.13860, 2023.
[59] S. Zhang, P. Jin, Z. Lin, Y. Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, [82] T. Y. Zhuo, Z. Li, Y. Huang, F. Shiri, W. Wang, G. Haffari, and Y.-F. Li,
M. Ma, W. Jin, et al., “Robust failure diagnosis of microservice system “On robustness of prompt-based semantic parsing with large pre-trained
through multimodal data,” IEEE Trans. Serv. Comput., vol. 16, 2023. language model: An empirical study on codex,” in EACL, 2023.
[60] Y. Zhang, Z. Guan, H. Qian, L. Xu, H. Liu, Q. Wen, L. Sun, J. Jiang, [83] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An
L. Fan, and M. Ke, “Cloudrca: A root cause analysis framework for empirical study of gpt-3 for few-shot knowledge-based vqa,” in AAAI,
cloud computing platforms,” in CIKM, 2021. 2022.
[61] Z. Xie, S. Zhang, Y. Geng, Y. Zhang, M. Ma, X. Nie, Z. Yao, L. Xu,
Y. Sun, W. Li, et al., “Microservice root cause analysis with limited
observability through intervention recognition in the latent space,” in
KDD, 2024.
[62] S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y. Sun,
and D. Pei, “Failure diagnosis in microservice systems: A comprehensive
survey and analysis,” TOSEM, 2025.
[63] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, X. Wu, M. Zhang, Q. Chen,
X. Gao, X. Gao, et al., “Tracediag: Adaptive, interpretable, and efficient
root cause analysis on large-scale microservice systems,” in ESEC/FSE,
2023.

Common questions

In GenAI cloud services, incidents often stem from diverse root causes such as code defects or misconfigurations, which make diagnosis and resolution more complex than traditional services . These systems require a higher degree of infrastructure fixes, code changes, and configuration updates, being 2.5 to 3 times more frequent than in non-GenAI services . The wide variety of fixes—21.5% of incidents due to code bugs but only 7.6% fixed with code changes—suggests an inherent complexity and the need for diverse remediation strategies beyond simple code adjustments .

The human-in-the-loop approach results in a higher detection rate for GenAI incidents, with 38.3% being identified by humans compared to 13.7% in traditional services . While this highlights the necessity of human oversight due to inadequate automated monitoring, it also suggests scalability issues. Improvements can be made by enhancing automated monitoring tools to better mimic and support human judgment, thereby reducing false positives and increasing detection accuracy, possibly through advanced machine learning techniques .

GenAI cloud services face unique challenges such as performance degradation, deployment failures, and invalid inferences, which impact the reliability and user satisfaction more significantly than traditional cloud services . The complexity and nature of these challenges result in GenAI incidents requiring more time to mitigate (1.12 time units on average) compared to non-GenAI services (0.65 time units). Additionally, the higher incidence of human-detected issues (38.3%) as opposed to automated detection indicates a maturity gap in monitoring capabilities for GenAI services compared to traditional services .

The complexity of GenAI systems, characterized by vast infrastructures and intricate dependencies, significantly influences their incident management life cycle by prolonging TTM and complicating root cause analysis . This complexity necessitates advanced diagnostic capabilities and more refined mitigation strategies. For development and maintenance, it implies the need for robust monitoring tools and a deep understanding of system behavior to preempt and manage incidents effectively, thereby requiring ongoing refinement of both infrastructure and processes .

Invalid or false alarm incidents in GenAI services are managed through reassignment to appropriate teams and require careful verification. These incidents, often detected manually due to insufficient automated tools, highlight the challenge of discerning between genuine issues and false positives . The significant presence of false alarms (11.0% in GenAI compared to 3.8% in non-GenAI services) suggests an urgent need for improving the accuracy of monitoring systems to reduce unnecessary reallocations and focus resources on legitimate incidents .

The TTM for GenAI incidents is longer partly due to the complex nature of these systems, which involve interconnected infrastructure layers and dependencies, making diagnosis and resolution more time-consuming . Strategies to address this include developing more sophisticated automation tools for incident mitigation processes, and prioritizing quick-fix methods like rollbacks to minimize downtime, as well as enhancing root cause analysis through advanced AI models .

Continuous improvement of monitoring tools is crucial in managing GenAI cloud service incidents, given the high rate of human-detected incidents and false positives. Tools need to specifically address the sophisticated demands of GenAI systems by improving accuracy and reducing detection times . Necessary improvements include developing scalable algorithms for automated anomaly detection, refining monitoring capabilities to reduce reliance on human detection, and incorporating advanced AI methodologies to handle the complex data environments of GenAI systems effectively .

Research areas critical for improving GenAI cloud service incident management include developing automated methods to detect invalid inferences more cost-efficiently and effectively, as current methodologies like consistency scores and model fine-tuning are inadequate . Advancements in these areas could lead to more reliable detection and resolution of issues, reducing both false positives and the dependency on human oversight, thereby enhancing service reliability and decreasing TTM .

GenAI cloud services exhibit a higher rate of human-detected incidents at 38.3% compared to 13.7% in non-GenAI services, indicating a reliance on human oversight . The false alarm rate for GenAI services is also higher at 11.0%, against 3.8% in non-GenAI services, implying that current monitoring tools are not as mature or accurate for GenAI contexts . This underscores the need for improved and more tailored monitoring solutions to address the unique challenges faced by GenAI services.

Analyzing high-severity GenAI incidents separately is crucial due to their significant impact on service disruptions, affecting numerous tenants and customers . Such analysis provides insights into the root causes and symptoms unique to severe incidents, facilitating targeted improvements in incident management and resolution strategies. It uncovers patterns and commonalities that can lead to preemptive measures, enhancing the overall reliability and robustness of GenAI cloud services .

AI Strategies for Cloud Management
No ratings yet
AI Strategies for Cloud Management
4 pages
Securing Generative AI in Cloud Environments
100% (1)
Securing Generative AI in Cloud Environments
7 pages
Cloud Platforms for Generative AI Solutions
No ratings yet
Cloud Platforms for Generative AI Solutions
93 pages
GenAI's Impact on Reliability Engineering
No ratings yet
GenAI's Impact on Reliability Engineering
4 pages
Overcoming Gen AI Program Challenges
No ratings yet
Overcoming Gen AI Program Challenges
5 pages
Ethical AI in Cloud Platforms Analysis
No ratings yet
Ethical AI in Cloud Platforms Analysis
6 pages
Generative AI Implementation Guide 2025
100% (1)
Generative AI Implementation Guide 2025
68 pages
AI Transformation Playbook Overview
No ratings yet
AI Transformation Playbook Overview
106 pages
Automation of AD-OHC Dashbord and Monitoring of Cloud Resources Using Genrative AI To Reduce Costing and Enhance Performance
No ratings yet
Automation of AD-OHC Dashbord and Monitoring of Cloud Resources Using Genrative AI To Reduce Costing and Enhance Performance
9 pages
Overcoming Two Issues That Are Sinking Gen Ai Programs
No ratings yet
Overcoming Two Issues That Are Sinking Gen Ai Programs
6 pages
GenAI and Edge-Cloud Computing Overview
No ratings yet
GenAI and Edge-Cloud Computing Overview
18 pages
Generative AI and Edge-Cloud Computing
No ratings yet
Generative AI and Edge-Cloud Computing
23 pages
GenAI's Impact on IT Infrastructure Choices
No ratings yet
GenAI's Impact on IT Infrastructure Choices
6 pages
AIwithIBMCloud-whitepaper-040925 - Client-Ready
No ratings yet
AIwithIBMCloud-whitepaper-040925 - Client-Ready
23 pages
2025 Hype Cycle for Generative AI
No ratings yet
2025 Hype Cycle for Generative AI
118 pages
Overview of Generative AI Evolution
No ratings yet
Overview of Generative AI Evolution
8 pages
Generative AI's Impact on Cybersecurity
No ratings yet
Generative AI's Impact on Cybersecurity
4 pages
Generative AI for Cloud Security Anomaly Detection
No ratings yet
Generative AI for Cloud Security Anomaly Detection
3 pages
Understanding Generative AI: Key Insights
No ratings yet
Understanding Generative AI: Key Insights
4 pages
2024 Navigation The Genai Revolution
No ratings yet
2024 Navigation The Genai Revolution
21 pages
Generative AI in Cloud Security Operations
No ratings yet
Generative AI in Cloud Security Operations
7 pages
References
No ratings yet
References
1 page
Generative AI in Engineering and R&D
No ratings yet
Generative AI in Engineering and R&D
15 pages
Generative AI: Transforming Business Today
No ratings yet
Generative AI: Transforming Business Today
18 pages
Cloud & Generative AI - A Synergistic Future
No ratings yet
Cloud & Generative AI - A Synergistic Future
74 pages
Title of The Project-The Role of Generative AI in Modern Data Science
No ratings yet
Title of The Project-The Role of Generative AI in Modern Data Science
22 pages
Data Infrastructure for AI Success
No ratings yet
Data Infrastructure for AI Success
17 pages
Generative AI: Innovation & Security Insights
No ratings yet
Generative AI: Innovation & Security Insights
70 pages
Anomaly Detection in Cloud AI Systems
No ratings yet
Anomaly Detection in Cloud AI Systems
28 pages
Overview of Generative AI: Methods & Impact
No ratings yet
Overview of Generative AI: Methods & Impact
5 pages
Ch1, Ch2, Ch3, 3Pages-of-Ch4
No ratings yet
Ch1, Ch2, Ch3, 3Pages-of-Ch4
28 pages
Generative AI's Impact on Software Engineering
No ratings yet
Generative AI's Impact on Software Engineering
20 pages
Generative AI in Software Engineering
No ratings yet
Generative AI in Software Engineering
14 pages
Generative AI Success Guide for Leaders
100% (2)
Generative AI Success Guide for Leaders
20 pages
Generative AI in Engineering and R&D
No ratings yet
Generative AI in Engineering and R&D
14 pages
Gena I Brief
No ratings yet
Gena I Brief
6 pages
Engineering Reliable AI Workflows Guide
No ratings yet
Engineering Reliable AI Workflows Guide
186 pages
GenAI Incident Response Guide 2025
No ratings yet
GenAI Incident Response Guide 2025
82 pages
03 02 Lessonarticle
No ratings yet
03 02 Lessonarticle
4 pages
Verifiable Internet for AI: DKG Insights
No ratings yet
Verifiable Internet for AI: DKG Insights
12 pages
GenAI in Software Architecture: Trends & Challenges
No ratings yet
GenAI in Software Architecture: Trends & Challenges
23 pages
Infrastructure for Generative AI Success
No ratings yet
Infrastructure for Generative AI Success
43 pages
Generative AI Infrastructure Guide
No ratings yet
Generative AI Infrastructure Guide
42 pages
Ijs RST 2513120
No ratings yet
Ijs RST 2513120
14 pages
GenAI First Unit Notes
No ratings yet
GenAI First Unit Notes
9 pages
0 Introduction To Cloud Platforms For Agentic GenAI
No ratings yet
0 Introduction To Cloud Platforms For Agentic GenAI
14 pages
Hype Cycle For Gener 832685 NDX
No ratings yet
Hype Cycle For Gener 832685 NDX
118 pages
Generative AI Deployment Guide
No ratings yet
Generative AI Deployment Guide
117 pages
The Big Book of Generative AI Insights
100% (10)
The Big Book of Generative AI Insights
118 pages
Generative AI: Transforming Business Experience
No ratings yet
Generative AI: Transforming Business Experience
30 pages
AI and GenAI Use Cases Regenerated
No ratings yet
AI and GenAI Use Cases Regenerated
11 pages
Generative AI in Software Architecture
No ratings yet
Generative AI in Software Architecture
23 pages
8 Warning Signs for AI App Failure
No ratings yet
8 Warning Signs for AI App Failure
22 pages
Fundamentals of Generative AI
No ratings yet
Fundamentals of Generative AI
17 pages
Generative AI: Challenges and Collaboration
No ratings yet
Generative AI: Challenges and Collaboration
19 pages
Generative AI: Content Creation & Benefits
No ratings yet
Generative AI: Content Creation & Benefits
5 pages
Explainability in Generative AI for Code
No ratings yet
Explainability in Generative AI for Code
17 pages
Research Fellow in Railway Computer Vision
No ratings yet
Research Fellow in Railway Computer Vision
2 pages
Tableau Visualization Techniques Guide
No ratings yet
Tableau Visualization Techniques Guide
18 pages
Setting Up Tinkercad Student Accounts
No ratings yet
Setting Up Tinkercad Student Accounts
4 pages
E-commerce Technical Manager Profile
No ratings yet
E-commerce Technical Manager Profile
6 pages
MH2p Alpine Component Protection Patch
No ratings yet
MH2p Alpine Component Protection Patch
6 pages
Learn Ladder Logic With A Free Version of RSLogix 500 and RSEmulator 500
No ratings yet
Learn Ladder Logic With A Free Version of RSLogix 500 and RSEmulator 500
8 pages
Digital Flatbed Cutter PN Series Operating Manual
No ratings yet
Digital Flatbed Cutter PN Series Operating Manual
142 pages
C++ Stack and Queue Operations Guide
No ratings yet
C++ Stack and Queue Operations Guide
23 pages
SWP Installation Guide for PC Environment
No ratings yet
SWP Installation Guide for PC Environment
24 pages
Activating SIM Connection on Ubuntu 20.04
No ratings yet
Activating SIM Connection on Ubuntu 20.04
4 pages
Direct Manipulation in GUI Design
No ratings yet
Direct Manipulation in GUI Design
23 pages
Cloud-Based Attendance Management System
No ratings yet
Cloud-Based Attendance Management System
7 pages
RCS Universal Profile Corrections V2.1
No ratings yet
RCS Universal Profile Corrections V2.1
86 pages
Vimbuza The Healing Dance of Northern Malawi Soko Boston Kindle & PDF Formats
100% (3)
Vimbuza The Healing Dance of Northern Malawi Soko Boston Kindle & PDF Formats
89 pages
Printer Install Guide For Xprinter (Win 7 64)
No ratings yet
Printer Install Guide For Xprinter (Win 7 64)
12 pages
Data Engineering Project Guide
No ratings yet
Data Engineering Project Guide
9 pages
UiPath Task Capture Training Overview
No ratings yet
UiPath Task Capture Training Overview
1 page
Cybersecurity & Ethical Hacking Diploma
No ratings yet
Cybersecurity & Ethical Hacking Diploma
25 pages
Documents Request Management System Overview
No ratings yet
Documents Request Management System Overview
73 pages
Java Programming Fundamentals Guide
No ratings yet
Java Programming Fundamentals Guide
13 pages
Unit 3: ICT Policies and Safety Issues in Teaching and Learning
No ratings yet
Unit 3: ICT Policies and Safety Issues in Teaching and Learning
21 pages
GameCenterBizApplication Startup Log
No ratings yet
GameCenterBizApplication Startup Log
14 pages
Organize Your Mind for Success
No ratings yet
Organize Your Mind for Success
2 pages
AI-Driven Error Correction Code Design
No ratings yet
AI-Driven Error Correction Code Design
14 pages
PayBill Management System Code
No ratings yet
PayBill Management System Code
5 pages
Machine Learning Lab Curriculum Overview
No ratings yet
Machine Learning Lab Curriculum Overview
74 pages
Crime Reporting System Project Overview
No ratings yet
Crime Reporting System Project Overview
51 pages
Supplier Response Manual for Bidding
No ratings yet
Supplier Response Manual for Bidding
7 pages
E-learning Evolution for Office Ergonomics
No ratings yet
E-learning Evolution for Office Ergonomics
36 pages
Major Project Report Guidelines
No ratings yet
Major Project Report Guidelines
7 pages