0% found this document useful (0 votes)
88 views15 pages

AI Agent Index: Documenting Agentic Systems

The AI Agent Index is a comprehensive database documenting agentic AI systems that can perform complex tasks with minimal human intervention. It addresses the lack of structured information on the technical components, intended uses, and safety features of these systems, while highlighting their growing deployment and associated risks. The index aims to enhance understanding among users, developers, and policymakers regarding the capabilities, limitations, and safety measures of agentic AI systems.

Uploaded by

magazynmeister
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views15 pages

AI Agent Index: Documenting Agentic Systems

The AI Agent Index is a comprehensive database documenting agentic AI systems that can perform complex tasks with minimal human intervention. It addresses the lack of structured information on the technical components, intended uses, and safety features of these systems, while highlighting their growing deployment and associated risks. The index aims to enhance understanding among users, developers, and policymakers regarding the capabilities, limitations, and safety measures of agentic AI systems.

Uploaded by

magazynmeister
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The AI Agent Index

Stephen Casper * 1 Luke Bailey 2 Rosco Hunter 3 Carson Ezell 4 Emma Cabalé 5 Michael Gerovitch 1
Stewart Slocum 1 Kevin Wei 4 Nikola Jurkovic 4 Ariba Khan 1 Phillip Christoffersen 1 A. Pinar Ozisik 1
Rakshit Trivedi 1 Dylan Hadfield-Menell 1 Noam Kolt * 6

Abstract et al., 2023; Zaharia et al., 2024; Yao, 2024; Su et al., 2024)—
Leading AI developers and startups are increas- are being deployed in a growing number of domains (see
arXiv:2502.01635v1 [[Link]] 3 Feb 2025

ingly deploying agentic AI systems that can plan Figure 7).


and execute complex tasks with limited human The performance of agentic systems is steadily improving
involvement. However, there is currently no struc- on benchmarks (Mialon et al., 2023b; Xie et al., 2024b;
tured framework for documenting the technical Zhou et al., 2023; Koh et al., 2024; Yoran et al., 2024; Xu
components, intended uses, and safety features et al., 2024), and these systems are being integrated into
of agentic systems. To fill this gap, we introduce broader swathes of economic activity (Wang et al., 2024;
the AI Agent Index, the first public database to Durante et al., 2024; Sager et al., 2025). As a result, their
document information about currently deployed real-world impacts are mounting (Chan et al., 2023; Gabriel
agentic AI systems. For each system that meets the et al., 2024; Anwar et al., 2024; Kolt, 2025). Alongside
criteria for inclusion in the index, we document the significant opportunities presented by agentic systems,
the system’s components (e.g., base model, rea- researchers have also raised noteworthy concerns, including
soning implementation, tool use), application do- cybersecurity risks (Fang et al., 2024a;b), loss of control
mains (e.g., computer use, software engineering), (Cohen et al., 2024; Bengio et al., 2025), and physical harm
and risk management practices (e.g., evaluation where agents operate robotic systems (Ruan et al., 2023).
results, guardrails), based on publicly available
information and correspondence with developers. Despite growing efforts to study trends in the development
We find that while developers generally provide of agentic AI systems, including evaluating their perfor-
ample information regarding the capabilities and mance and cost (Kapoor et al., 2024; Stroebl et al., 2025),
applications of agentic systems, they currently assessing their potential harms (Andriushchenko et al., 2024;
provide limited information regarding safety and Kumar et al., 2024; U.S. AI Safety Institute, 2025), and in-
risk management practices. The AI Agent Index creasing visibility into their operation (Shavit et al., 2023;
is available online at [Link] Chan et al., 2024a;b; 2025; Kolt, 2025), many practical
with raw data at this link. questions remain unanswered:

• Which organizations are developing agentic systems?


1. Introduction • In which domains are they being deployed?
‘Agentic’ AI systems that can be instructed to plan and
• What infrastructure do agentic systems require?
directly execute complex tasks with only limited human in-
volvement (Xi et al., 2023; Wang et al., 2024; Durante et al., • How is their performance and safety evaluated?
2024; Sager et al., 2025) are transitioning from research
prototypes to real-world products (e.g., Devin, h2oGPTe, • What guardrails are used to mitigate risks?
Simple AI, XBOW). These systems—which are generally
comprised of foundation models augmented with scaffold- To empirically answer these questions and improve pub-
ing for reasoning, planning, memory, and tool use (Sumers lic understanding of agentic AI systems, we introduce and
release the AI Agent Index, a comprehensive sample of
*
Equal contribution 1 Massachusetts Institute of Technology deployed agentic AI systems (n = 67). The index, which is
2
Stanford University 3 University of Warwick 4 Harvard Univer- constructed from a combination of publicly available data
sity 5 École Normale Supérieure Paris-Saclay, Université Paris- and correspondence with developers, documents publicly-
Saclay 6 Hebrew University. Correspondence to: Stephen Casper
<scasper@[Link]>, Noam Kolt <[Link]@[Link]>. available information on the intended uses of agentic sys-
tems, their technical components (including reasoning, plan-
Preprint ning, and memory implementation, base models, observa-

1
The AI Agent Index

Publicly available formal 19.4%


safety policy
Publicly reported external 7.5%
Code 49.3% safety testing
Publicly reported safety 9.0%
Documentation 70.1% testing by developer
0 20 40 60 80 100 0 25 50 75 100
% Publicly Available % With Safety Measure

Figure 1. Most AI agent developers in the index provide some Figure 2. Only 19.4% of indexed agentic systems disclose a formal
public documentation (70.1%), while about half (49.3%) release safety policy, and fewer than 10% report external safety evalua-
their underlying code. tions.

tion and action space, and user interface), safety features Wiener, 1961), artificial life (Maes, 1990; 1993; 1995), ra-
(including accessibility of system components, usage con- tional agency (Rao & Georgeff, 1991), software engineering
trols and restrictions, and red-teaming practices), and details (Wooldridge & Jennings, 1995; Jennings, 2000), reinforce-
regarding the organizations developing and deploying agen- ment learning (Sutton & Barto, 2018), and philosophy (Den-
tic systems (including entity type and country of origin). nett, 1989; Dung, 2024). While there have been notable
attempts to define the term “agent”, including in the context
In addition to collecting and systematizing information
of computational systems (Franklin & Graesser, 1996; Rus-
about agentic AI systems, the index also sheds light on the
sell & Norvig, 2020; Kenton et al., 2023), we do not decide
availability of such information. Specifically, we find that
among these definitions or offer an alternative definition. In-
while relatively detailed information is available regarding
stead, we follow Chan et al. (2023), and loosely characterize
the features and applications of agentic systems (Figure 1),
agentic AI systems as ones that exhibit, to some significant
strikingly limited information is available regarding their
degree, a combination of the following properties:
safety evaluations and guardrails (Figure 2).
In this paper, we make three contributions: a) Underspecification: the system can accomplish a goal
provided to it without a precise specification of how to
do so.
1. We introduce a structured framework for documenting
the technical, safety, and policy-relevant features of b) Directness of impact: the system’s actions can affect
agentic AI systems. the world with little to no human mediation.

2. We identify currently deployed agentic systems that c) Goal-directedness: the system acts as if in the pursuit
meet our criteria (described below) and publicly docu- of a particular objective.
ment these systems according to our framework. d) Long-term planning: the system can solve problems
by reasoning about how to approach them, constructing
3. We discuss key findings from the index, shedding light
plans, and executing them step by step.
on geographic spread, academic vs. industry devel-
opment, openness, and risk management of agentic
systems. 2.1. Agentic Architectures, Applications, and
Opportunities
The index is available on the web at Contemporary AI agents are generally compound systems
[Link] with raw data accessible here. (Zaharia et al., 2024) comprised of a foundation model
augmented by external resources, known as “scaffolding”,
2. Background which enable effective planning, memory, and tool use
(Wang et al., 2024; Xi et al., 2023; Durante et al., 2024).
There is no widely accepted definition of “AI agent”. The Planning of complex series of actions is typically facilitated
notion of artificial agency has a long and contentious history, through chain-of-thought-based reasoning processes (Wei
spanning multiple decades and diverse disciplines. These et al., 2022; Yao et al., 2022c; 2023; Shinn et al., 2023;
include cybernetics (Rosenblueth et al., 1943; Ashby, 1956; OpenAI, 2024). Memory relies on information stored in

2
The AI Agent Index

the base model and/or in external storage modules (Sumers clude datasheets (Gebru et al., 2018), model cards (Mitchell
et al., 2023). Tool use is enabled through API calls and nat- et al., 2019), reward reports (Gilbert et al., 2022), ecosys-
ural language dialogue between the base model and external tem graphs (Bommasani et al., 2023b), and data provenance
software, databases, and other affordances (Schick et al., cards (Longpre et al., 2023). In addition, several databases
2023; Mialon et al., 2023a; Qin et al., 2023). have been created to collect information regarding contem-
porary AI systems and their real-world impacts, such as the
These agentic architectures are increasingly applied to a
Foundation Model Transparency Index (Bommasani et al.,
variety of domains, including programming (Jimenez et al.,
2023a), the AI Incident Database (McGregor, 2021), and
2023; Yang et al., 2024b), machine learning research (Huang
the AI Risk Repository (Slattery et al., 2024). Currently,
et al., 2024; Wijk et al., 2024; Chan et al., 2024c), experi-
however, there are no equivalent frameworks for document-
mentation in the natural sciences (Boiko et al., 2023; Bran
ing agentic AI systems. This lack of structured information
et al., 2024; Jansen et al., 2024), and consumer activities
limits both researchers’ ability to study and build agentic
such as online retail (Yao et al., 2022a; Deng et al., 2023),
systems, as well as policymakers’ capacity to design appro-
travel planning (Xie et al., 2024a), and general-purpose web
priate governance mechanisms (Winecoff & Bogen, 2024).
browsing (Gur et al., 2023; Wu et al., 2024). Progress in
these applications is being evaluated by a growing suite of The AI Agent Index fills this gap. By collecting and commu-
benchmarks, which measure performance in computer use nicating technical, safety, and policy-relevant information
(Mialon et al., 2023b; Xie et al., 2024b; Zhou et al., 2023; concerning agentic AI systems, the index aims to inform
Koh et al., 2024; Yoran et al., 2024), software engineering different stakeholders in distinct ways. Specifically, the
(Jimenez et al., 2023; Yang et al., 2024b), and virtual work index:
environments (Xu et al., 2024).
1. Enables users to better understand the capabilities and
2.2. Safety Risks and Ethical Concerns limitations of agentic systems with which they interact.
Given that agentic AI systems are built on foundation mod- 2. Provides developers more comprehensive and granular
els, they are susceptible to many of the risks associated with information about currently deployed agentic systems.
such models, including harms arising from hallucinations,
biased outputs, and leakage of private data (Bender et al., 3. Supports auditors and red-teams in deciding the scope
2021; Weidinger et al., 2022; Solaiman et al., 2023). Agentic and focus of their evaluations of agentic systems.
systems, however, also present new risks that stem specifi-
cally from their agentic properties, i.e., underspecification, 4. Offers an evidence base to policymakers designing
directness of impact, goal-directedness, and long-term plan- governance mechanisms for agentic systems.
ning (Chan et al., 2023; Cohen et al., 2024; Ruan et al.,
2023; Andriushchenko et al., 2024; Bengio et al., 2025). 5. Improves public awareness and understanding of agen-
For example, while chatbots often cause harm by human tic systems.
users acting upon model outputs (e.g., deploying model-
generated malicious code) (Phuong et al., 2024), agentic 3. Methods
AI systems can directly cause harm (e.g., autonomously
hacking websites) (Fang et al., 2024a; Jaech et al., 2024). What does the index include? As discussed in Section 2,
there is no widely-accepted definition of “AI agent.” We
Additionally, as agentic AI systems undertake more complex do not propose one here. Given our focus on the societal
and long-horizon tasks, with limited human oversight, users impacts of agentic AI systems, we draw on the four char-
are likely to repose greater trust in those systems, poten- acteristics introduced by Chan et al. (2023) discussed in
tially developing asymmetric relationships of dependence Section 2. Importantly, to address the practical questions
(Gabriel et al., 2024; Manzini et al., 2024b;a; Bengio et al., outlined in Section 1, we primarily document the features
2025). Moreover, agentic systems developed and operated of agentic AI systems that are either deployed as products
by large platform companies could enable those compa- or available open source.
nies to exert greater influence and control over users and
third parties with whom they interact (e.g., vendors accessed The full decision graph we used to determine inclusion in
through platform-controlled agents) (Lazar, 2024). the index is shown in Figure 3. Notably, we restricted the in-
dex to agentic systems and did not include language models
2.3. Documentation Frameworks themselves, or agent development frameworks (unless the
framework was built around a qualifying flagship system,
Many frameworks have been developed to document the in which case we indexed that system). We also created a
features of AI systems, the resources used to build them, single index entry per named and versioned system. Differ-
and the contexts in which they are deployed. These in- ent releases (e.g., “HelpfulAgent1.1” vs “HelpfulAgent1.2”)

3
The AI Agent Index

Yes
Include
Yes Include

Could the system be


used off the shelf or with Yes
Is the system a Include
Yes product deployed for Yes very minor modifications
commercial or other to accomplish
economically valuable Is there another
consequential
tasks competitively with compelling reason to
Can the system applications?
other solutions? track this system? E.g.,
accomplish a diverse Is the system
Named it is a particularly
range of tasks, and does it open source?
"agentic" AI informative case study
have a meaningfully higher No
system or is exceptionally
degree of agency than relevant in shaping the
ChatGPT-4o? No state of the art.

Exclude
No

Figure 3. Decision graph for determining inclusion in the index: We focused on indexing agentic systems (as opposed to models or
development frameworks) and drew on the four characteristics of agency from Chan et al. (2023): underspecification, directness of impact,
goal-directedness, and long-term planning. In total, we indexed 67 systems.

and different configurations (e.g., “HelpfulAgent-Claude3.5- • Systems that do not have a meaningfully higher
Sonnet” vs. “HelpfulAgent-GPT4o”) were indexed under degree of agency than ChatGPT-4o1 (based on the
the same entry. The final node in our decision graph (Fig- four aspects of agency from Chan et al. (2023)) such as
ure 3) facilitates the inclusion of systems that otherwise Taskade, Vonage AI Virtual Assistant, Talkdesk, IBM
would not strictly fit the criteria at our discretion. In prac- WatsonX, and ActionAgents.
tice, we only invoked this for systems from leading compa-
nies that were announced but have not (yet) been externally • Systems that are not open source or products de-
deployed, such as OpenAI o3 or Project Mariner. In total, ployed for commercial or other consequential applica-
we indexed 67 systems. Limitations of our methods are tions such as Falcon-UI (Shen et al., 2024) or Honey-
discussed in Section 6. Comb (Zhang et al., 2024).

The AI Agent Index represents a snapshot in time as of • Open source systems that could not be used compet-
December 31, 2024. New developments in the AI agent itively off the shelf, often due to age or narrow scope
research and product ecosystem occur weekly. To improve such as GeniA, ReAct (Yao et al., 2022b), Pearl (Sun
thoroughness and consistency, we only indexed systems et al., 2023), or Moatlesss.
announced by, and available in, 2024.
• Systems deployed after the cutoff date of December
What does the index not include? Our selection criteria 31, 2024 such as Deepseek-R1, Doubao-1.5-pro, or
led us to exclude the following types of systems: OpenAI Operator.
• Non-“agentic” models such as Llama-3.2-90B-Vision-
Instruct (Dubey et al., 2024). How was information collected? From August 2024 to
January 2025, we identified agentic AI systems using web
• Unnamed systems often comprised of simple base- searches, academic literature review, benchmark leader-
line implementations introduced under frameworks boards (e.g., SWE-bench (Jimenez et al., 2023) and GAIA
or benchmarks such as CORE-Bench (Siegel et al., (Mialon et al., 2023b)), and additional resources that com-
2024), AgentHarm (Andriushchenko et al., 2024), or pile lists of agentic systems (e.g., [Link]
The Agent Company (Xu et al., 2024). [Link]
• Non-“agentic” development frameworks without a Survey/, and [Link]
qualifying flagship model such as AutoGPT (Fırat & On a rolling basis, we created the first drafts of agent cards
Kuleli, 2023), Beam, Dust, GumLoop, Lindy, OpenAI according to the template outlined next in Section 4. After
Swarm, Qwen-Agent, or Spell. each first draft was completed, we contacted the developers
• Systems that cannot open-endedly accomplish a di- of each agent to request feedback and potential corrections.
verse range of tasks such as systems that propose solu- 1
ChatGPT-4o allows users to customize system prompts, can
tions to git requests (e.g., MentatBot, Engine, Globant engage in open-ended dialogue, and can search/synthesize web
Code-Fixer Agent (Bel et al., 2024)). searches when responding to users.

4
The AI Agent Index

We received a 36% response rate. After editing each draft to – Code: Is code available?
incorporate feedback, we updated and finalized agent cards – Scaffolding: Is system scaffolding available?
in January 2025 to ensure that all reflected the state of the – Documentation: Is documentation available?
field as of December 31, 2024. For all web sources cited in • Controls and guardrails: What notable methods
all agent cards (excluding stable papers, videos, and social are used to protect against harmful actions?
media posts), we cited stable archived versions of websites
preceding and as close to December 31, 2024 as possible • Customer and usage restrictions: Are there know-
using [Link] and [Link] your-customer measures or other restrictions on
customers?
• Monitoring and shutdown procedures: Are there
4. Agent Card Components any notable methods or protocols that allow for
Each agent card contains 33 fields of information, divided the system to be shut down if it is observed to
into 6 categories: behave harmfully?

5. Evaluations
1. Basic information
• Notable benchmark evaluations (e.g., on SWE-
• Website
Bench Verified)
• Short description
• Bespoke testing (e.g., demos)
• Intended uses: What does the developer state that
the system is intended for? • Safety: Have safety evaluations been conducted
by the developers? What were the results?
• Date(s) deployed
• Publicly reported external red-teaming or compa-
2. Developer rable auditing:
• Website – Personnel: Who were the red-
• Legal name teamers/auditors?
– Scope, scale, access, and methods: What ac-
• Entity type
cess did red-teamers/auditors have and what
• Country (location of developer or first author’s actions did they take?
first affiliation)
– Findings: What did the red-teamers/auditors
• Safety policies: What safety and/or responsibility conclude?
policies are in place?
6. Ecosystem
3. System components
• Backend model: What model(s) are used to power • Interoperability with other systems: What tools
the system? or integrations are available?
• Publicly available model specification: Is there • Usage statistics and patterns: Are there any no-
formal documentation on the system’s intended table observations about usage?
uses and how it is designed to behave in them?
7. Additional notes: If any
• Reasoning, planning, and memory implementa-
tion: How does the system ‘think’?
We populated each field in each card with written notes
• Observation space: What is the system able to based on publicly available information. When no informa-
observe while ‘thinking’? tion was available, we recorded “None” or “Unknown.”
• Action space/tools: What direct actions can the
system take?
5. Findings
• User interface: How do users interact with the
system? In addition to compiling specific information regarding each
• Development cost and compute: What is known of the 67 indexed systems, the AI Agent Index offers a
about the development costs? high-level perspective of this emerging field. Noting the
limitations and biases discussed next (in Section 6), here, we
4. Guardrails and oversight offer a bird’s eye view of the state of the art for AI agents.
• Accessibility of components: Agentic systems are being deployed at a steadily increas-
– Weights: Are model parameters available? ing rate. Systems that meet our criteria for inclusion in the
– Data: Is data available? index have had (initial) deployments dating back to early

5
The AI Agent Index

2023. However, Figure 4 shows that they have been de- Country Distribution
ployed at an increasing rate with approximately half of the
indexed systems deployed in the second half of 2024.

45
Timeline of Agent Releases
16
14 1
Number of Agents

1
12 1
10 2
8 2
8 3
6 4
4 USA Singapore
2 China Canada
UK Sweden
0 Israel France
2023-03
2023-05
2023-07
2023-09
2023-11
2024-01
2024-03
2024-05
2024-07
2024-09
2024-11
2025-01
Japan

Figure 5. Most agentic systems are created by developers in the


Release Date USA. In this figure, some developers’ countries are counted mul-
tiple times due to producing multiple indexed systems. Google
DeepMind is counted 3x, while OpenAI, National University of
Singapore, UC Berkeley, and Stanford University are each counted
Figure 4. Agentic systems are being deployed at a steadily increas- 2x.
ing rate.

Developer Types
Most indexed systems are created by developers located
in the USA. We considered the ‘developer country’ of each
agent to be the national location of either (a) the developer
organization if the developer was a company or (b) the first
author’s first listed affiliation if the agent was created as part 26.9%
of an academic research collaboration. We plot the number
of agents from each country in Figure 5. Of the 67 agents,
45 were created by developers in the USA.
While most agentic systems are developed by compa-
nies, a significant fraction are developed in academia. In 73.1%
Figure 6, we show the developers of agents broken down
based on whether they are projects from academic labs or
companies in industry. 18 (26.9%) are academic while 49
(73.1%) are from companies.
Academic Industry
The majority of indexed systems specialize in software
engineering and/or computer use. We divided the 67
systems into 6 categories: Figure 6. Most agentic systems are developed by companies.

• Software: agents that assist in coding and software


engineering (e.g., Yang et al., 2024a). • Universal: agents designed to be a general-purpose
reasoning engine (e.g., OpenAI, 2024).
• Computer use: agents designed to open-endedly inter-
act with computer interfaces (e.g., Yoran et al., 2024) • Research: agents designed to assist with scientific re-
(Sager et al., 2025). search (e.g., Lu et al., 2024).

6
The AI Agent Index

• Robotics: agents designed for robotic control (e.g., the systems that have undergone formal, publicly-reported
Kim et al., 2024). safety testing are from a small number of large companies
(e.g., Anthropic, Google DeepMind, OpenAI).
• Other: systems that are designed for niche applications
(e.g., LinkedIn Talent Agents).
6. Limitations and Concerns
We plot the breakdown by domain in Figure 7. 50 of the 67 Defining agentic systems. The term “AI agent” is con-
agents (74.6%) specialize in either software engineering or tentious, as discussed in Section 2. In particular, the term
computer use. We also note that there exist many ‘agentic’ has been criticized for inappropriately anthropomorphiz-
systems for customer service, which do not meet our criteria ing certain AI systems (Weidinger et al., 2022; Mitchell,
for inclusion in the index. See Section 3 for discussion and 2021), which could potentially lead to unrealistic expecta-
examples. tions from, or over-reliance on, such systems (Gabriel et al.,
2024; Manzini et al., 2024b). Recognizing this concern,
Application Domains we do not weigh in on this debate, advocate a particular
definition of “AI agent”, or propose alternative terminology.
Instead, we focus on empirically documenting a growing
class of deployed AI systems that exhibit “agentic” char-
37.3% acteristics (as described in Chan et al. (2023)) and have
a potential for significant impact. Through the index, we
communicate our findings as plainly and openly as possible.
Scope and timing of index. The index is not a compre-
3.0% hensive or exhaustive database of all agentic systems or
37.3% 6.0% related resources, such as language models and develop-
6.0% ment frameworks for building agentic systems. The field of
10.4% agentic AI is highly decentralized and poorly documented.
Accordingly, there may also be systems that meet the se-
lection criteria specified in Section 3 but do not appear in
the index. In particular, the index is likely to disproportion-
ately document agentic systems that are publicly available
Software Universal
Computer use Research or publicly released, compared with systems used internally
Other Robotics within organizations (which, by definition, are not publicly
accessible). In addition, the index only includes systems
described in the English language and includes relatively
few systems from non-western developers. The index rep-
Figure 7. The majority of indexed systems specialize in software
resents a snapshot in time on December 31, 2024 and does
engineering and/or computer use.
not include systems that were obsolete by this date or were
The majority of indexed systems have released code released thereafter. Moreover, while the agent cards in the
and/or documentation. Developers are relatively publicly index collect 33 fields of information, these are not exhaus-
forthcoming about details related to usage and capabilities. tive and exclude, for example, records of real-world safety
In Figure 1, we show results: 33 (49.3%) release code, and incidents (to the extent such incidents have occurred).
47 (70.1%) release documentation. We also observed that Incomplete or inaccurate information. In total, the in-
systems developed as academic projects are released with a dex contains over 2,200 fields of information reviewed by
high degree of openness, with 16 of the 18 (88.8%) releasing multiple authors. Nonetheless, despite our best efforts to
code. manually verify the completeness and accuracy of all agent
There is limited publicly available information about cards, mistakes may have occurred. In addition, the response
safety testing and risk management practices. In contrast rate of developers to our requests for feedback was 36%.
to the relatively high degree of openness that developers Accordingly, it is possible that some developers may, for ex-
exercise around their systems’ capabilities and usage, we ample, have in place internal safety documents or practices
find scant public information about safety policies, inter- that we could not discover from publicly available documen-
nal safety evaluations, and external safety evaluations. In tation, or were not informed about through correspondence.
Figure 2, we show that only 13 (19.4%), 5 (7.5%), and 6 Recognizing these concerns, we have established a struc-
(9%) indexed systems have publicly available information tured process for facilitating further corrections to the index.
on each of these, respectively. We also note that most of These can be submitted at this link.

7
The AI Agent Index

Promoting problematic practices. The findings we present Documentation can inform governance and policy. Our
in Section 5—particularly the lack of transparency around findings (discussed in Section 5) may inform the scope and
the safety features of agentic systems—could arguably pro- methods of AI governance and policymaking:
mote problematic risk management practices. For exam-
ple, developers could choose to ‘game’ an index like ours • The majority of indexed agentic systems were devel-
through perfunctory, selective disclosure of information oped in industry, suggesting that governance interven-
recorded in the index (Krawiec, 2003; Marquis et al., 2016). tions should consider the incentives of corporate devel-
Due in part to this concern, we do not use this index to make opers (distinct from those of academic labs).
developer scorecards. Instead, we see our findings as offer- • Most indexed systems were developed by US-based or-
ing basic information to key stakeholders, including users, ganizations, indicating that governance efforts focused
developers, auditors, and policymakers. In doing so, we on US contexts could have more leverage than efforts
hope to lay the groundwork for more targeted assessments in other countries or regions.
of impacts and risks from agentic systems in future work.
• The prominence of software engineering and computer-
use agents suggests that policy researchers and practi-
7. Discussion and Future Work tioners should prioritize these domains when designing
governance frameworks.
The agentic AI ecosystem is difficult to document. The
extensive data collection process undertaken for the current • Very few developers disclose information about safety
paper (see Section 3) sheds light on the significant chal- or risk management, underscoring the importance of
lenges involved in documenting agentic AI systems. During establishing transparency and disclosure mechanisms
this process, we encountered a diverse range of AI systems, as a key first step in the governance of agentic systems.
across multiple domains, in different places in the research–
product spectrum, and accompanied by varying levels of To address knowledge and accountability gaps uncovered
information and documentation. The differences were often by our findings, policymakers could consider:
most stark when comparing systems developed in industry • Structured bug bounties: Incentivizing external red-
and systems developed in academia, the latter of which are teaming promotes the proactive discovery of vulnera-
typically simpler and more open. On occasion, these fea- bilities, adapting approaches used in cybersecurity.
tures of the agentic AI ecosystem made it challenging to
determine whether a particular system meets our criteria • Systematic testing of agents: Governance bodies and
for inclusion in the index. Most importantly, the fact that academic labs could coordinate risk assessments of
we ultimately produced an “AI Agent Index” should not be agentic systems.
taken to suggest that this ecosystem lends itself to clean • Centralized oversight of indices: Regulatory or
taxonomization and indexing (it does not). We expect these standard-setting institutions could establish and main-
documentation challenges to persist for the foreseeable fu- tain indices of agentic systems like this one.
ture.
• Integration with model registries: Incorporate indices
Future documentation work should be appropriately of agentic systems into broader registry frameworks
scoped. Our research design—including both the selection (McKernon et al., 2024), ensuring unified reporting
of information fields to be collected and the methods for of agentic systems, common safety benchmarks, and
collecting data—offers lessons for future attempts to docu- clearer accountability mechanisms.
ment the agentic AI ecosystem. From the outset, we sought
to collect information on agentic systems that had been gen- Impact Statement
erally overlooked by previous survey papers and overviews
of the field, such as the accessibility of documentation and This work was undertaken to improve our collective un-
code, information regarding red-teaming and safety poli- derstanding of the emerging field of agentic AI. Its con-
cies, and the country of developers (see Section 4). Future tributions revolve around the compilation and analysis of
documentation work can build on this approach, examin- publicly available information, supplemented by correspon-
ing a broader range of technical, safety, and policy-relevant dence with developers. In Section 6, we discuss how trans-
features of agentic AI systems. To ensure tractability, we parency standards can be ‘gamed,’ and note that this was
recommend that future work surveying the agent ecosystem one reason that we did not score developers using the index.
be appropriately scoped either in breadth or depth. For ex- Taken together, we hope the methodology and findings intro-
ample, selection criteria could be revised to demand a high duced by the AI Agent Index inform progress toward better
threshold for “agency” or anticipated societal impact. risk management practices and governance frameworks for
agentic AI systems.

8
The AI Agent Index

Acknowledgments Khan, S. M., Lee, K. M., Ligot, D. V., Molchanovskyi,


O., Monti, A., Mwamanzi, N., Nemer, M., Oliver, N.,
We are thankful to Alan Chan, Atoosa Kasirzadeh, Laker Portillo, J. R. L., Ravindran, B., Rivera, R. P., Riza, H.,
Newhouse, Gabe Mukobi, Rishi Bommasani, Peter Cihon, Rugege, C., Seoighe, C., Sheehan, J., Sheikh, H., Wong,
Merlin Stein, Greg Leppert, Jack Cushman, and Seth Lazar D., and Zeng, Y. International ai safety report, 2025. URL
for discussions and feedback. [Link]

References Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. Au-
tonomous chemical research with large language models.
Andriushchenko, M., Souly, A., Dziemian, M., Duenas, Nature, 624(7992):570–578, 2023.
D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter,
Z., Fredrikson, M., et al. Agentharm: A benchmark for Bommasani, R., Klyman, K., Longpre, S., Kapoor, S.,
measuring harmfulness of llm agents. arXiv preprint Maslej, N., Xiong, B., Zhang, D., and Liang, P. The
arXiv:2410.09024, 2024. foundation model transparency index. arXiv preprint
arXiv:2310.12941, 2023a.
Anwar, U., Saparov, A., Rando, J., Paleka, D., Turpin, M.,
Hase, P., Lubana, E. S., Jenner, E., Casper, S., Sour- Bommasani, R., Soylu, D., Liao, T. I., Creel, K. A., and
but, O., et al. Foundational challenges in assuring align- Liang, P. Ecosystem graphs: The social footprint of
ment and safety of large language models. arXiv preprint foundation models. arXiv preprint arXiv:2303.15772,
arXiv:2404.09932, 2024. 2023b.
Bran, M., Andres, Cox, S., Schilter, O., Baldassari, C.,
Ashby, W. R. An Introduction to Cybernetics. Chapman &
White, A. D., and Schwaller, P. Augmenting large lan-
Hall, London, 1956.
guage models with chemistry tools. Nature Machine
Bel, M. A., Rı́os, J. L., Carrasco, R. A. L., Miche- Intelligence, pp. 1–11, 2024.
lini, J., Milano, G., Milano, G., Pérez, M., and Chan, A., Salganik, R., Markelius, A., Pang, C., Rajkumar,
Pasquero, G. Globant code fixer agent: Whitepaper, N., Krasheninnikov, D., Langosco, L., He, Z., Duan, Y.,
November 2024. URL [Link] Carroll, M., et al. Harms from increasingly agentic algo-
com/wp-content/uploads/2024/11/ rithmic systems. In Proceedings of the 2023 ACM Con-
Whitepaper-Globant-Code-Fixer-Agent. ference on Fairness, Accountability, and Transparency,
pdf. Accessed: 2025-01-18. pp. 651–666, 2023.
Bender, E. M., Gebru, T., McMillan-Major, A., and Chan, A., Ezell, C., Kaufmann, M., Wei, K., Hammond,
Shmitchell, S. On the dangers of stochastic parrots: Can L., Bradley, H., Bluemke, E., Rajkumar, N., Krueger,
language models be too big? In Proceedings of the 2021 D., Kolt, N., et al. Visibility into ai agents. In The
ACM conference on fairness, accountability, and trans- 2024 ACM Conference on Fairness, Accountability, and
parency, pp. 610–623, 2021. Transparency, pp. 958–973, 2024a.
Bengio, Y., Mindermann, S., Privitera, D., Besiroglu, T., Chan, A., Kolt, N., Wills, P., Anwar, U., de Witt, C. S.,
Bommasani, R., Casper, S., Choi, Y., Fox, P., Garfinkel, Rajkumar, N., Hammond, L., Krueger, D., Heim, L.,
B., Goldfarb, D., Heidari, H., Ho, A., Kapoor, S., Kha- and Anderljung, M. Ids for ai systems. arXiv preprint
latbari, L., Longpre, S., Manning, S., Mavroudis, V., arXiv:2406.12137, 2024b.
Mazeika, M., Michael, J., Newman, J., Ng, K. Y., Okolo,
C. T., Raji, D., Sastry, G., Seger, E., Skeadas, T., South, Chan, A., Wei, K., Huang, S., Rajkumar, N., Perrier, E.,
T., Strubell, E., Tramèr, F., Velasco, L., Wheeler, N., Ace- Lazar, S., Hadfield, G. K., and Anderljung, M. Infrastruc-
moglu, D., Adekanmbi, O., Dalrymple, D., Dietterich, ture for ai agents, 2025. URL [Link]
T. G., Felten, E. W., Fung, P., Gourinchas, P.-O., Heintz, abs/2501.10114.
F., Hinton, G., Jennings, N., Krause, A., Leavy, S., Liang, Chan, J. S., Chowdhury, N., Jaffe, O., Aung, J., Sherburn,
P., Ludermir, T., Marda, V., Margetts, H., McDermid, J., D., Mays, E., Starace, G., Liu, K., Maksin, L., Patward-
Munga, J., Narayanan, A., Nelson, A., Neppel, C., Oh, han, T., et al. Mle-bench: Evaluating machine learning
A., Ramchurn, G., Russell, S., Schaake, M., Schölkopf, agents on machine learning engineering. arXiv preprint
B., Song, D., Soto, A., Tiedrich, L., Varoquaux, G., Yao, arXiv:2410.07095, 2024c.
A., Zhang, Y.-Q., Albalawi, F., Alserkal, M., Ajala, O.,
Avrin, G., Busch, C., de Leon Ferreira de Carvalho, A. Cohen, M. K., Kolt, N., Bengio, Y., Hadfield, G. K., and
C. P., Fox, B., Gill, A. S., Hatip, A. H., Heikkilä, J., Russell, S. Regulating advanced artificial agents. Science,
Jolly, G., Katzir, Z., Kitano, H., Krüger, A., Johnson, C., 384(6691):36–38, 2024.

9
The AI Agent Index

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y., Eck,
Sun, H., and Su, Y. Mind2web: towards a generalist agent D., and Faust, A. A real-world webagent with planning,
for the web. In Proceedings of the 37th International long context understanding, and program synthesis. arXiv
Conference on Neural Information Processing Systems, preprint arXiv:2307.12856, 2023.
pp. 28091–28114, 2023.
Huang, Q., Vora, J., Liang, P., and Leskovec, J. Mlagent-
Dennett, D. C. The intentional stance. 1989. bench: Evaluating language agents on machine learning
experimentation. In Forty-first International Conference
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, on Machine Learning, 2024.
A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,
A., et al. The llama 3 herd of models. arXiv preprint Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky,
arXiv:2407.21783, 2024. A., Low, A., Helyar, A., Madry, A., Beutel, A., Car-
ney, A., et al. Openai o1 system card. arXiv preprint
Dung, L. Understanding artificial agency. The Philosophical arXiv:2412.16720, 2024.
Quarterly, pp. pqae010, 2024.
Jansen, P., Côté, M.-A., Khot, T., Bransom, E., Mishra,
Durante, Z., Huang, Q., Wake, N., Gong, R., Park, J. S., B. D., Majumder, B. P., Tafjord, O., and Clark, P. Dis-
Sarkar, B., Taori, R., Noda, Y., Terzopoulos, D., Choi, Y., coveryworld: A virtual environment for developing and
et al. Agent ai: Surveying the horizons of multimodal evaluating automated scientific discovery agents. arXiv
interaction. arXiv preprint arXiv:2401.03568, 2024. preprint arXiv:2406.06769, 2024.

Jennings, N. R. On agent-based software engineering. Arti-


Fang, R., Bindu, R., Gupta, A., Zhan, Q., and Kang, D. Llm
ficial intelligence, 117(2):277–296, 2000.
agents can autonomously hack websites. arXiv preprint
arXiv:2402.06664, 2024a. Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press,
O., and Narasimhan, K. Swe-bench: Can language mod-
Fang, R., Bindu, R., Gupta, A., Zhan, Q., and Kang, D. els resolve real-world github issues? arXiv preprint
Teams of llm agents can exploit zero-day vulnerabilities. arXiv:2310.06770, 2023.
arXiv preprint arXiv:2406.01637, 2024b.
Kapoor, S., Stroebl, B., Siegel, Z. S., Nadgir, N., and
Fırat, M. and Kuleli, S. What if gpt4 became autonomous: Narayanan, A. Ai agents that matter. arXiv preprint
The auto-gpt project and use cases. Journal of Emerging arXiv:2407.01502, 2024.
Computer Technologies, 3(1):1–6, 2023.
Kenton, Z., Kumar, R., Farquhar, S., Richens, J., MacDer-
Fourney, A., Bansal, G., Mozannar, H., Tan, C., Salinas, mott, M., and Everitt, T. Discovering agents. Artificial
E., Niedtner, F., Proebsting, G., Bassman, G., Gerrits, Intelligence, 322:103963, 2023.
J., Alber, J., et al. Magentic-one: A generalist multi-
agent system for solving complex tasks. arXiv preprint Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakr-
arXiv:2411.04468, 2024. ishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., San-
keti, P., et al. Openvla: An open-source vision-language-
Franklin, S. and Graesser, A. Is it an agent, or just a pro- action model. arXiv preprint arXiv:2406.09246, 2024.
gram?: A taxonomy for autonomous agents. In Inter-
Koh, J. Y., Lo, R., Jang, L., Duvvur, V., Lim, M. C., Huang,
national workshop on agent theories, architectures, and
P.-Y., Neubig, G., Zhou, S., Salakhutdinov, R., and Fried,
languages, pp. 21–35. Springer, 1996.
D. Visualwebarena: Evaluating multimodal agents on re-
Gabriel, I., Manzini, A., Keeling, G., Hendricks, L. A., alistic visual web tasks. arXiv preprint arXiv:2401.13649,
Rieser, V., Iqbal, H., Tomašev, N., Ktena, I., Kenton, Z., 2024.
Rodriguez, M., et al. The ethics of advanced ai assistants. Kolt, N. Governing ai agents. arXiv preprint
arXiv preprint arXiv:2404.16244, 2024. arXiv:2501.07913, 2025.
Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Krawiec, K. D. Cosmetic compliance and the failure of
Wallach, H., Daumé III, H., and Crawford, K. Datasheets negotiated governance. Wash. ULQ, 81:487, 2003.
for datasets. arXiv preprint arXiv:1803.09010, 2018.
Kumar, P., Lau, E., Vijayakumar, S., Trinh, T., Team, S. R.,
Gilbert, T. K., Lambert, N., Dean, S., Zick, T., and Snoswell, Chang, E., Robinson, V., Hendryx, S., Zhou, S., Fredrik-
A. Reward reports for reinforcement learning. arXiv son, M., et al. Refusal-trained llms are easily jailbroken as
preprint arXiv:2204.10817, 2022. browser agents. arXiv preprint arXiv:2410.13886, 2024.

10
The AI Agent Index

Lazar, S. Frontier ai ethics: Anticipating and evaluating Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., and
the societal impacts of generative agents. arXiv preprint Scialom, T. Gaia: a benchmark for general ai assistants.
arXiv:2404.06750, 2024. arXiv preprint arXiv:2311.12983, 2023b.

Longpre, S., Mahari, R., Chen, A., Obeng-Marnu, N., Sileo, Mitchell, M. Why ai is harder than we think. arXiv preprint
D., Brannon, W., Muennighoff, N., Khazam, N., Kabbara, arXiv:2104.12871, 2021.
J., Perisetla, K., et al. The data provenance initiative: A
large scale audit of dataset licensing & attribution in ai. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman,
arXiv preprint arXiv:2310.16787, 2023. L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T.
Model cards for model reporting. In Proceedings of the
Lu, C., Lu, C., Lange, R. T., Foerster, J., Clune, J., and Ha, conference on fairness, accountability, and transparency,
D. The ai scientist: Towards fully automated open-ended pp. 220–229, 2019.
scientific discovery. arXiv preprint arXiv:2408.06292,
OpenAI. Introducing openai o1-preview, Septem-
2024.
ber 2024. URL [Link]
Maes, P. Designing autonomous agents: Theory and prac- introducing-openai-o1-preview/. Accessed:
tice from biology to engineering and back. MIT press, 2025-01-19.
1990.
Phuong, M., Aitchison, M., Catt, E., Cogan, S., Kaskasoli,
Maes, P. Modeling adaptive autonomous agents. Artificial A., Krakovna, V., Lindner, D., Rahtz, M., Assael, Y., Hod-
life, 1(1 2):135–162, 1993. kinson, S., et al. Evaluating frontier models for dangerous
capabilities. arXiv preprint arXiv:2403.13793, 2024.
Maes, P. Artificial life meets entertainment: lifelike au-
Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y.,
tonomous agents. Communications of the ACM, 38(11):
Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating
108–114, 1995.
large language models to master 16000+ real-world apis.
Manzini, A., Keeling, G., Alberts, L., Vallor, S., Morris, arXiv preprint arXiv:2307.16789, 2023.
M. R., and Gabriel, I. The code that binds us: Navigating
Rao, A. S. and Georgeff, M. P. Modeling rational agents
the appropriateness of human-ai assistant relationships. In
within a bdi-architecture. In Proceedings of the Second
Proceedings of the AAAI/ACM Conference on AI, Ethics,
International Conference on Principles of Knowledge
and Society, volume 7, pp. 943–957, 2024a.
Representation and Reasoning, pp. 473–484, 1991.
Manzini, A., Keeling, G., Marchal, N., McKee, K. R., Rosenblueth, A., Wiener, N., and Bigelow, J. Behavior,
Rieser, V., and Gabriel, I. Should users trust advanced purpose and teleology. Philosophy of science, 10(1):18–
ai assistants? justified trust as a function of competence 24, 1943.
and alignment. In The 2024 ACM Conference on Fair-
ness, Accountability, and Transparency, pp. 1174–1186, Ruan, Y., Dong, H., Wang, A., Pitis, S., Zhou, Y., Ba, J.,
2024b. Dubois, Y., Maddison, C. J., and Hashimoto, T. Identify-
ing the risks of lm agents with an lm-emulated sandbox.
Marquis, C., Toffel, M. W., and Zhou, Y. Scrutiny, norms, arXiv preprint arXiv:2309.15817, 2023.
and selective disclosure: A global study of greenwashing.
Organization Science, 27(2):483–504, 2016. Russell, S. and Norvig, P. Artificial Intelligence: A Modern
Approach. Pearson, USA, 4th edition, 2020.
McGregor, S. Preventing repeated real world ai failures
by cataloging incidents: The ai incident database. In Sager, P. J., Meyer, B., Yan, P., von Wartburg-Kottler,
Proceedings of the AAAI Conference on Artificial Intelli- R., Etaiwi, L., Enayati, A., Nobel, G., Abdulkadir, A.,
gence, volume 35, pp. 15458–15463, 2021. Grewe, B. F., and Stadelmann, T. Ai agents for computer
use: A review of instruction-based computer control,
McKernon, E., Glasser, G., Cheng, D., and Hadfield, G. Ai gui automation, and operator assistants. arXiv preprint
model registries: A foundational tool for ai governance. arXiv:2501.16150, 2025.
arXiv preprint arXiv:2410.09645, 2024.
Schick, T., Dwivedi-Yu, J., Dessı̀, R., Raileanu, R., Lomeli,
Mialon, G., Dessı̀, R., Lomeli, M., Nalmpantis, C., Pa- M., Hambro, E., Zettlemoyer, L., Cancedda, N., and
sunuru, R., Raileanu, R., Rozière, B., Schick, T., Dwivedi- Scialom, T. Toolformer: Language models can teach
Yu, J., Celikyilmaz, A., et al. Augmented language mod- themselves to use tools. Advances in Neural Information
els: a survey. arXiv preprint arXiv:2302.07842, 2023a. Processing Systems, 36:68539–68551, 2023.

11
The AI Agent Index

Shavit, Y., Agarwal, S., Brundage, M., Adler, S., O’Keefe, Sutton, R. S. and Barto, A. G. Reinforcement learning: An
C., Campbell, R., Lee, T., Mishkin, P., Eloundou, T., introduction. MIT press, 2018.
Hickey, A., et al. Practices for governing agentic ai sys-
tems. Research Paper, OpenAI, December, 2023. U.S. AI Safety Institute. Technical blog: Strengthening
ai agent hijacking evaluations, January 2025. Accessed:
Shen, H., Liu, C., Li, G., Wang, X., Zhou, Y., Ma, C., and Ji, 2025-01-19.
X. Falcon-ui: Understanding gui before following user
instructions. arXiv preprint arXiv:2412.09362, 2024. Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J.,
Chen, Z., Tang, J., Chen, X., Lin, Y., et al. A survey on
Shinn, N., Cassano, F., Berman, E., Gopinath, A., large language model based autonomous agents. Frontiers
Narasimhan, K., and Yao, S. Reflexion: Language of Computer Science, 18(6):186345, 2024.
agents with verbal reinforcement learning. arXiv preprint
arXiv:2303.11366, 2023. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi,
E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting
Siegel, Z. S., Kapoor, S., Nagdir, N., Stroebl, B., and elicits reasoning in large language models. Advances in
Narayanan, A. Core-bench: Fostering the credibility neural information processing systems, 35:24824–24837,
of published research through a computational repro- 2022.
ducibility agent benchmark. ArXiv, abs/2409.11363,
2024. URL [Link] Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang,
org/CorpusID:272694423. P.-S., Mellor, J., Glaese, A., Cheng, M., Balle, B.,
Kasirzadeh, A., et al. Taxonomy of risks posed by lan-
Slattery, P., Saeri, A. K., Grundy, E. A., Graham, J., Noetel, guage models. In Proceedings of the 2022 ACM Confer-
M., Uuk, R., Dao, J., Pour, S., Casper, S., and Thompson, ence on Fairness, Accountability, and Transparency, pp.
N. The ai risk repository: A comprehensive meta-review, 214–229, 2022.
database, and taxonomy of risks from artificial intelli-
gence. arXiv preprint arXiv:2408.12622, 2024. Wiener, N. Cybernetics: Or Control and Communication
in the Animal and the Machine. MIT Press, Cambridge,
Solaiman, I., Talat, Z., Agnew, W., Ahmad, L., Baker, D., MA, 1961.
Blodgett, S. L., Chen, C., Daumé III, H., Dodge, J.,
Duan, I., et al. Evaluating the social impact of gener- Wijk, H., Lin, T., Becker, J., Jawhar, S., Parikh, N., Broadley,
ative ai systems in systems and society. arXiv preprint T., Chan, L., Chen, M., Clymer, J., Dhyani, J., et al. Re-
arXiv:2306.05949, 2023. bench: Evaluating frontier ai r&d capabilities of language
model agents against human experts. arXiv preprint
Stroebl, B., Kapoor, S., and Narayanan, A. Hal: A arXiv:2411.15114, 2024.
holistic agent leaderboard for centralized and repro-
ducible agent evaluation. [Link] Winecoff, A. A. and Bogen, M. Improving governance
princeton-pli/hal-harness/, 2025. outcomes through ai documentation: Bridging theory and
practice. arXiv preprint arXiv:2409.08960, 2024.
Su, Y., Yang, D., Yao, S., and Yu, T. Language
agents: Foundations, prospects, and risks. In Li, J. Wooldridge, M. and Jennings, N. R. Intelligent agents:
and Liu, F. (eds.), Proceedings of the 2024 Confer- Theory and practice. The knowledge engineering review,
ence on Empirical Methods in Natural Language Pro- 10(2):115–152, 1995.
cessing: Tutorial Abstracts, pp. 17–24, Miami, Florida,
USA, November 2024. Association for Computational Wu, Z., Han, C., Ding, Z., Weng, Z., Liu, Z., Yao, S.,
Linguistics. doi: 10.18653/v1/[Link]-tutorials. Yu, T., and Kong, L. Os-copilot: Towards generalist
3. URL [Link] computer agents with self-improvement. arXiv preprint
emnlp-tutorials.3/. arXiv:2402.07456, 2024.

Sumers, T. R., Yao, S., Narasimhan, K., and Griffiths, Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B.,
T. L. Cognitive architectures for language agents. arXiv Zhang, M., Wang, J., Jin, S., Zhou, E., et al. The rise and
preprint arXiv:2309.02427, 2023. potential of large language model based agents: A survey.
arXiv preprint arXiv:2309.07864, 2023.
Sun, S., Liu, Y., Wang, S., Zhu, C., and Iyyer, M. Pearl:
Prompting large language models to plan and execute Xie, J., Zhang, K., Chen, J., Zhu, T., Lou, R., Tian, Y.,
actions over long documents. ArXiv, abs/2305.14564, Xiao, Y., and Su, Y. Travelplanner: A benchmark for
2023. URL [Link] real-world planning with language agents. arXiv preprint
org/CorpusID:258866190. arXiv:2402.01622, 2024a.

12
The AI Agent Index

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Zhang, H., Song, Y., Hou, Z., Miret, S., and Liu, B. Hon-
Hua, T. J., Cheng, Z., Shin, D., Lei, F., et al. Os- eycomb: A flexible llm-based agent system for materials
world: Benchmarking multimodal agents for open-ended science. arXiv preprint arXiv:2409.00135, 2024.
tasks in real computer environments. arXiv preprint
arXiv:2404.07972, 2024b. Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A.,
Cheng, X., Ou, T., Bisk, Y., Fried, D., et al. Webarena:
Xu, F. F., Song, Y., Li, B., Tang, Y., Jain, K., Bao, M., Wang, A realistic web environment for building autonomous
Z. Z., Zhou, X., Guo, Z., Cao, M., et al. Theagentcom- agents. arXiv preprint arXiv:2307.13854, 2023.
pany: benchmarking llm agents on consequential real
world tasks. arXiv preprint arXiv:2412.14161, 2024.

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao,


S., Narasimhan, K., and Press, O. Swe-agent: Agent-
computer interfaces enable automated software engineer-
ing. arXiv preprint arXiv:2405.15793, 2024a.

Yang, J., Jimenez, C. E., Zhang, A. L., Lieret, K., Yang,


J., Wu, X., Press, O., Muennighoff, N., Synnaeve, G.,
Narasimhan, K. R., et al. Swe-bench multimodal: Do ai
systems generalize to visual software domains? arXiv
preprint arXiv:2410.03859, 2024b.

Yao, S. Language Agents: From Next-Token Prediction to


Digital Automation. PhD thesis, Princeton University,
2024.

Yao, S., Chen, H., Yang, J., and Narasimhan, K. Web-


shop: Towards scalable real-world web interaction with
grounded language agents. Advances in Neural Informa-
tion Processing Systems, 35:20744–20757, 2022a.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
K., and Cao, Y. React: Synergizing reasoning and
acting in language models. ArXiv, abs/2210.03629,
2022b. URL [Link]
org/CorpusID:252762395.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,
K., and Cao, Y. React: Synergizing reasoning and acting
in language models. arXiv preprint arXiv:2210.03629,
2022c.

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y.,
and Narasimhan, K. Tree of thoughts: Deliberate prob-
lem solving with large language models. arXiv preprint
arXiv:2305.10601, 2023.

Yoran, O., Amouyal, S. J., Malaviya, C., Bogin, B., Press,


O., and Berant, J. Assistantbench: Can web agents
solve realistic and time-consuming tasks? arXiv preprint
arXiv:2407.15711, 2024.

Zaharia, M., Khattab, O., Chen, L., Davis, J. Q., Miller,


H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N.,
and Ghodsi, A. The shift from models to compound ai
systems. [Link]
2024/02/18/compound-ai-systems/, 2024.

13
The AI Agent Index

A. Sample Agent Card


Here, we provide a sample agent card for Microsoft’s Magentic One (Fourney et al., 2024). We selected it based on
its recency, degree of documentation, openness, generality, and noteworthy performance. No authors have conflicts of
interest related to Microsoft or Magentic One, and this example selection was made without correspondence with Microsoft.
Including Magentic One’s agent card as an example is not an endorsement of the system or developer.

Magentic One
1. Basic information
• Website: [Link]
for-solving-complex-tasks/
• Short description: A multiagent system introduced by Microsoft with general capabilities.
• Intended uses: What does the developer state that the system is intended for? It is used for “ad-hoc, open-ended
tasks such as browsing the web and interacting with web-based applications, handling files, and writing and
executing Python code” [source].
• Date(s) deployed: Announced November 4, 2023 [source].
2. Developer
• Website: [Link]
• Legal name: Microsoft Corporation [source].
• Entity type: Corporation [source].
• Country (location of developer or first author’s first affiliation): Incorporation: Washington, USA (Microsoft
Corporation (2357303)) [source]. Registration: Delaware, USA. HQ: Washington, USA [source].
• Safety policies: What safety and/or responsibility policies are in place? Model evaluations and red teaming;
model reporting and information sharing; security controls [source]. Microsoft’s safety policies are described
online [source].
3. System components
• Backend model: What model(s) are used to power the system? The default model used is gpt-4o-2024-05-13, but
they also experiment with using OpenAI o1 [source].
• Publicly available model specification: Is there formal documentation on the system’s intended uses and how it is
designed to behave in them? Available [source].
• Reasoning, planning, and memory implementation: How does the system ‘think’? The system contains multiple
subagents that work together to solve problems. Things are controlled at a high level by the “Orchestrator” agent
and executed by the “WebSurfer,” FileSurfer,” “Coder,” and “ComputerTerminal” agents [source].
• Observation space: What is the system able to observe while ‘thinking’? It has full access to a filesystem and web
browser.
• Action space/tools: What direct actions can the system take? It is able to surf (including posting) on the web,
execute file system commands, and write/execute code.
• User interface: How do users interact with the system? Users can configure and experiment with it using the
AutoGen package [source].
• Development cost and compute: What is known about the development costs? Unknown.
4. Guardrails and oversight
• Accessibility of components:
– Weights: Are model parameters available? N/A; backends various models.
– Data: Is data available? N/A; backends various models.
– Code: Is code available? Available on GitHub as part of Microsoft’s AutoGen project [source].
– Scaffolding: Is system scaffolding available? Available [source].
– Documentation: Is documentation available? Available on GitHub [source], see also the technical report
[source].

14
The AI Agent Index

• Controls and guardrails: What notable methods are used to protect against harmful actions? The developers
recommend using containers, virtual environments, log monitoring, human oversight, access limitations, and data
safeguards.
• Customer and usage restrictions: Are there know-your-customer measures or other restrictions on customers?
None.
• Monitoring and shutdown procedures: Are there any notable methods or protocols that allow for the system to be
shut down if it is observed to behave harmfully? Logs are kept while the system runs.

5. Evaluations
• Notable benchmark evaluations (e.g., on SWE-Bench Verified): GAIA (38%), AssistantBench (27.7%), and
WebArena (32.8%) [source].
• Bespoke testing (e.g., demos): None.
• Safety: Have safety evaluations been conducted by the developers? What were the results? They report on ad-hoc
evaluations of failures and safety concerns in the technical report [source]. The developers claim: “We performed
testing for Responsible AI harm e.g., cross-domain prompt injection and all tests returned the expected results
with no signs of jailbreak” [source].
• Publicly reported external red-teaming or comparable auditing:
– Personnel: Who were the red-teamers/auditors? None.
– Scope, scale, access, and methods: What access did red-teamers/auditors have and what actions did they take?
None.
– Findings: What did the red-teamers/auditors conclude? None.
6. Ecosystem
• Interoperability with other systems: What tools or integrations are available? It was not explicitly designed to
interoperate with any particular systems other than the web browser and filesystem. But it presumably could
integrate with others with little configuration.
• Usage statistics and patterns: Are there any notable observations about usage? Microsoft AutoGen has 36.9k
stars and 5.3k forks [source].
7. Additional notes: None.

15

Common questions

Powered by AI

The AI Agent Index reveals a disparity in the availability of technical versus safety information: while technical features and applications are relatively well-documented, safety evaluations and risk management details are scant . This indicates challenges such as the prioritization of performance over safety, lack of standardized safety documentation practices, and possible competitive reasons for withholding such information .

The AI Agent Index highlights significant limitations in the documentation of safety and risk management practices for agentic AI systems. Only 19.4% of the indexed systems provide a formal safety policy, and fewer than 10% report external safety evaluations . This suggests a reluctance or inability among developers to disclose detailed safety evaluations, which might be due to a lack of standardized frameworks or competitive concerns .

The AI Agent Index indicates that the openness of agentic AI systems is significantly influenced by the geographic and industrial origin of their developers. Systems developed as academic projects tend to be more open, with 88.8% releasing code, in contrast to corporate-developed systems which are generally less open . This suggests that academic environments prioritize transparency and open data, while industrial developers may focus on competitive advantage .

The characteristics of agentic systems as described by Chan et al., such as underspecification, directness of impact, goal-directedness, and long-term planning, play a critical role in the inclusion criteria for the AI Agent Index. These characteristics help determine whether a system exhibits a higher degree of agency than conventional models and whether it should be included in the Index based on its potential societal impacts and the need for effective governance .

The term 'AI agent' lacks a widely-accepted definition because it has a contentious and varied history spanning multiple decades, often criticized for anthropomorphizing AI systems. This can lead to unrealistic expectations and over-reliance on these systems. Therefore, the AI Agent Index refrains from providing a strict definition and instead focuses on documenting systems that display 'agentic' characteristics like goal-directedness and long-term planning .

Agentic AI systems have significant societal impacts in areas such as economic competitiveness, ethical considerations, and risk management. The AI Agent Index addresses these impacts by systematizing information about the technical capabilities and limitations of these systems, thus providing stakeholders—such as policymakers, developers, and the public—with insights necessary for informed decision-making and governance .

The AI Agent Index supports auditors and red-teams by providing detailed, structured documentation of agentic AI systems, including their technical, safety, and policy-relevant features. This comprehensive information helps determine the scope and focus of safety and vulnerability assessments, facilitating a more targeted and efficient evaluation process .

The AI Agent Index enhances public understanding and awareness by making technical, safety, and policy-relevant information about agentic AI systems readily accessible. By documenting the capabilities and limitations of these systems, the Index helps demystify complex AI technologies and informs the public about their potential impacts, fostering informed engagement .

The AI Agent Index faces challenges in creating a comprehensive database because the field of agentic AI is highly decentralized and poorly documented, making it difficult to track all relevant systems. Additionally, the rapid evolution and proprietary nature of many systems, coupled with the lack of standardized definitions and documentation practices, pose significant barriers to compiling an exhaustive index .

The AI Agent Index intends to improve governance mechanisms for agentic AI systems by providing a comprehensive evidence base that policymakers can use to design effective governance strategies. By documenting technical, safety, and policy-relevant information, the Index aids in the understanding of agentic systems' capabilities and limitations, thereby informing more nuanced regulatory approaches .

You might also like