0% found this document useful (0 votes)

18 views19 pages

Safe Robust Multi-Agent RL Methods

Q: What role do control barrier functions play in the safety and optimization of multi-agent systems according to the document?

Control barrier functions (CBFs) in multi-agent systems serve the critical role of maintaining system safety by ensuring forward invariance of the safe set. They are used to find optimal or near-optimal control policies that keep the states of the system within the safe set, thus guaranteeing safety. The integration of robust neural CBFs in the system helps to overcome the difficulties in constructing general CBFs, particularly in complex environments. These functions are used to define safety constraints and form the basis for ensuring that control actions remain within safe boundaries .

Q: Why does integrating robust control barrier functions with multi-agent reinforcement learning enhance the safety of these systems?

Integrating robust control barrier functions (CBFs) with multi-agent reinforcement learning enhances the safety of the systems by providing a mechanism to enforce safety constraints consistently during the learning process. Robust CBFs accommodate uncertainties and allow for decentralized implementation in complex environments where handcrafted CBFs are insufficient. By ensuring that the control actions respect these safety certificates, the system maintains safe operation across all agents, preventing violations like collisions or unsafe interactions, even under adversarial conditions. This integration leads to the development of policies that not only optimize performance but also inherently incorporate safety constraints, thus offering a comprehensive approach to both robust learning and safe operation .

Q: Explain the theoretical guarantees provided by Theorem 1 regarding the convergence and optimality of robust multi-agent reinforcement learning algorithms.

Theorem 1 provides theoretical guarantees that the robust multi-agent reinforcement learning (MARL) algorithms will converge to an optimal solution as the number of iterations approaches infinity. It ensures that the Q values calculated by the algorithm gradually approximate the optimal Q values through consistent updates based on a well-defined Bellman optimality framework. This convergence is important to ensure that the learning process continues to refine the policy towards achieving both safety and optimal performance in unpredictable environments. The theorem also supports the inclusion of elements like replay buffers and action-output noise to stabilize training and handle the intrinsic stochasticity within the environment .

Q: How does the concept of adversarial robustness contribute to the safety of multi-agent reinforcement learning systems?

Adversarial robustness in multi-agent reinforcement learning systems is achieved by modeling adversaries that simulate modeling errors and external disturbances. These adversaries are integrated into the system to provide a robust adversarial setting, which helps in training the agents to withstand potential disruptions. The agents learn policies that are robust against the worst-case disturbances introduced by these adversaries, thus enhancing the system’s resilience to uncertainties and maintaining optimal performance while ensuring safety. The adversaries are factored into the Q value and Bellman optimality equations to refine the agent's decision-making process despite the interventions .

Q: What is the function of the shaping function in robust multi-agent reinforcement learning systems, and how is it implemented?

The shaping function in robust multi-agent reinforcement learning systems is designed to refine the reward structure, adding a layer that promotes certain behavior outcomes, such as prioritizing safety alongside performance. It is implemented by introducing the safety reward term, which transitions standard reward signals into an augmented form. This reshaping of rewards helps to influence agent policy development in a manner that is conducive to safe operations while maximizing overall returns. The shaping function effectively adjusts the reward landscape, directing learning towards desired states while ensuring compliance with safety conditions through continuous reinforcement .

Q: What are the assumptions about the state sets, and how do they relate to finding safe control policies in multi-agent systems?

The assumptions about the state sets in multi-agent systems include defining a safe set (C_s), dangerous set (C_d), and goal set (C_goal). The relationship between these sets is that the goal set is a subset of the safe set (C_goal ⊂ C_s), implying that every goal state is also a safe state. The objective is to develop control policies that maintain the forward invariance of the safe set, ensuring all states remain safe over time. Thus, the policies must guarantee that if the initial state is within the safe set (s_0 ∈ C_s), then the state will remain within it for all times t ≥ 0. This framework is essential to develop constrained optimization problems for maximizing the reward while adhering to safety constraints .

Q: How do augmented reward functions contribute to the optimal performance of robust multi-agent reinforcement learning systems?

Augmented reward functions contribute to the optimal performance of robust multi-agent reinforcement learning systems by modifying the reward structure to include safety rewards alongside standard performance rewards. By adding a safety reward term (r_t,s,i), the augmented reward function enhances the learning process by prioritizing actions that maintain safety while still pursuing performance optimization objectives. This modifies the agents' decision-making to balance between achieving high rewards and ensuring safe operation. The augmented reward function guides agents toward safer policies, thus aligning with the initial goals of achieving robust and safe solutions. Theoretical analyses affirm that the solution derived under the augmented reward scenario remains optimal for the original problem, meaning it does not compromise on reward maximization while ensuring safety .

Q: Why is the safety attention mechanism (SAM) important in the context of multi-agent reinforcement learning when dealing with safety-critical systems?

The safety attention mechanism (SAM) is significant in multi-agent reinforcement learning for safety-critical systems because it enhances the model's capability to focus on critical safety events. SAM works alongside robust neural control barrier functions to manage the safety of the system specifically when the number of agents and the dynamics involved introduce complexity. By increasing attention to potentially dangerous situations, it allows for better management of time-varying observable agents and enhances the safety policies of the system, thus preventing potential failures or collisions against unknown environmental disturbances .

Q: How does the proposed safe robust multi-agent reinforcement learning method address modeling errors and external disturbances in safety-critical systems?

The proposed safe robust multi-agent reinforcement learning method addresses modeling errors and external disturbances by treating them as adversaries for each agent. This approach allows each agent to learn a policy that is robust against the disturbances introduced by these adversaries. The use of decentralized robust neural control barrier functions ensures safety when handcrafted CBFs are challenging to construct. In conjunction with a safety attention mechanism, the method enhances focus on dangerous events and maintains system safety. The online fine-tuning of policies further improves the safety and robustness of the system .

This paper proposes a novel safe robust multi-agent reinforcement learning (MARL) method that integrates decentralized robust neural control barrier functions (CBFs) and a safety attention mechanism (SAM) for safety-critical multi-agent systems. The method addresses challenges such as modeling errors and external disturbances by treating these factors as adversaries, enabling agents to learn robust control policies while maintaining safety. Experiments demonstrate the effectiveness of the proposed approach in enhancing safety and performance in complex environments.

Uploaded by

仕轩刘

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views19 pages

Safe Robust Multi-Agent RL Methods

Uploaded by

仕轩刘

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Information Sciences 690 (2025) 121567

Contents lists available at ScienceDirect

Information Sciences
journal homepage: [Link]/locate/ins

Safe robust multi-agent reinforcement learning with neural

control barrier functions and safety attention mechanism
Shihan Liu, Lijun Liu ∗ , Zhen Yu ∗
Department of Automation, Xiamen University, Xiamen, 361103, China

A R T I C L E I N F O A B S T R A C T

Keywords: In this paper, a novel safe robust multi-agent reinforcement learning method integrated with
Multi-agent reinforcement learning decentralized robust neural control barrier functions (CBFs) and a safety attention mechanism
Control barrier function (SAM) is proposed for the safety-critical multi-agent system (MAS). Safety is fundamental in the
Safety attention mechanism
safety-critical MAS but can be affected by factors such as modeling errors, external unknown
Robust
disturbances, and time-varying observable agents. Several appropriate measures are implemented
to address these issues. First, modeling errors and external disturbances are regarded as an
adversary for each agent. The agent learns a policy that is robust to disturbances created by the
adversary. Accordingly, decentralized robust neural CBFs are introduced to maintain the safety of
the MAS, particularly when the general handcrafted CBFs are difficult to construct. The SAM, in
combination with the robust neural CBFs, provides a control policy with the capacity to handle
time-varying observable agents and increases its attention to dangerous events. The online fine-
tuning procedure further enhances the safety. Finally, experiments demonstrate the safety and
effectiveness of the proposed method.

1. Introduction

In recent years, reinforcement learning (RL) [1] has made significant achievements in research and applications, such as au-
tonomous navigation [2], industrial robots [3], and resource scheduling [4]. In some real applications, the number of agents or
players often exceeds one, necessitating the agents to engage in competition or cooperation to achieve the optimal overall result
[5–9]. Multi-agent RL (MARL) methods have been applied extensively in various domains, including job shop scheduling [5], bipar-
tite containment control [6], traffic signal control [7], and autonomous mobile robots [9]. Safety is the foundation in the safety-critical
multi-agent system (MAS). The control method can potentially result in irreparable damage to the system without safety certificates.
For example, collision avoidance is critical for large-scale multi-robot platforms. Safe MARL should maximize the reward while
considering safety issues. Hence, the objective of this study is to determine the safe optimal policies for the safety-critical MAS.
Detrimental factors threaten the safety of the system, including modeling errors and external disturbances. In real-world scenarios,
the absence of accurate environmental models is a common challenge, and system dynamics is completely unknown in certain
situations [8,10]. In addition, deviations between the approximate and real dynamics are inevitable due to the model uncertainties,
scarcity of data, and the sensor equipment [11–13]. Current RL agents are commonly trained on simulation platforms instead of in
real-world environments, primarily motivated by cost and safety considerations. The sim-to-real gap often hinders the transfer if the

* Corresponding authors.
E-mail addresses: 23220200156035@[Link] (S. Liu), liulijun@[Link] (L. Liu), yuzhen20@[Link] (Z. Yu).

[Link]
Received 19 March 2023; Received in revised form 16 October 2024; Accepted 16 October 2024
Available online 18 October 2024
0020-0255/© 2024 Elsevier Inc. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

controller lacks robustness to modeling errors. Furthermore, external disturbances, such as ambient noise and control jitter, can affect
the control performance, and the control policy is sensitive to perturbations in the environment. Disturbance-observer-based methods
have been adopted in many applications to mitigate the impact of this situation [14,15]. Nevertheless, the accuracy of exosystem
interference estimation relies heavily on precise measurements. The errors in the estimation and measurement are multiplied as the
number of agents increases. Moreover, various types of environmental perturbations increase the difficulty of accurate estimation.
Hence, these adverse factors pose significant challenges in the design of robust and safe controllers.
In the RL field, some studies [11,12] have improved the robustness by introducing an adversary as the second player of the
agent, engaging in competition with the protagonist thereby diminishing the overall system performance. The original controller
and adversary are considered maximizing and minimizing players, respectively. This problem is also formulated as an 𝐻∞ control
problem, which is a type of zero-sum game. The traditional zero-sum game mainly focuses on the system performance, which is
inherently an optimal problem. However, conflicts may exist between optimality and safety. The focus of this study is to achieve a
balance between optimality and safety for a model-free disturbed MAS with an imperative to prioritize safety. Our MARL method
is inspired by the principle of 𝐻∞ control, which aims to acquire control policies that are robust to modeling errors and external
disturbances.
For safety issues, control barrier functions (CBFs) [16] have been used extensively to design safe controllers with constraints and
have achieved success in some applications such as quadrotors [17] and robotic systems [18]. CBFs are intended to maintain the
forward invariance of a safe set. Zhao et al. [19] integrate the CBF and control Lyapunov function (CLF) methods in RL, requiring
knowledge of the dynamics to construct CBFs and CLFs. Qin et al. [20] propose a method for learning safe control policies and neural
CBF certificates without requiring an accurate system model. However, the discrepancy between the nominal and real dynamics
leads to a reduction in safety performance. To handle both model uncertainties and complicated environments, Dawson et al. [21]
develop a model-based learning approach using robust neural Lyapunov barrier functions (rCLBF). Nevertheless, this method only
considers the model uncertainty and the quadratic programming (QP) is computed at each training step which leads to a significant
computational burden. Qin et al. [22] incorporate the neural CBFs as safety certificates into multi-agent control policies. This method
is only suitable for ideal scenarios due to the lack of consideration of modeling errors and exosystem disturbances. In summary, the
above methods yield a safe policy based on an existing stable policy with the learned neural CBFs. However, several ill-conceived
issues remain in the existing neural CBF methods, such as optimality, modeling errors, and disturbed systems.
Facing the challenges above, we propose a safe robust MARL method with decentralized robust neural CBFs in this study. First,
the modeling errors and external unknown disturbances are transferred to the adversary of each agent. Since it is difficult to construct
universal and accurate CBFs in the MAS, robust neural CBFs are acquired by learning the data along the system trajectory. In the MAS,
the computation of the interaction information among agents becomes increasingly difficult as the number of agents increases. The
curse of dimensionality is a critical issue. In addition, due to the communication limitation, each agent has only partial observability
of states [23]. For example, one agent may only have a vision of its neighbors and directly coordinate them. Also, the connectivity
network of agents is time-varying due to their movements [24]. Scaling to a system of numerous agents poses a challenging task in
partially observable domains. To address these problems, we adopt the popular approach of centralized learning and distributed poli-
cies [25], in which each agent takes actions based on its local observation. Accordingly, this study proposes the decentralized robust
neural CBFs in a decentralized execution fashion. The decentralized control method has demonstrated its advantages in addressing
the partial observability and dynamically changing connectivity network.
For a given agent, the neighboring agents have varying effects on its safety, implying that different danger levels are associated
with different agents. For example, in an autonomous driving scenario, a car approaching at a high speed from a far distance is
more hazardous than a nearby car driving away. To address this issue, we employ an attention mechanism [26] that assigns different
weights to the interacting agents. In some MARL studies [27,28], the attention mechanism is incorporated into the critic to process
the received information accurately. Furthermore, several studies [29–32] have incorporated an attention mechanism for inter-agent
communication. In particular, the approach presented in [29] allows agents to prioritize messages from their neighbors selectively
based on importance, and this method can dynamically adapt to a changing number of agents. In this study, principally concerned
with the safety of agents, a safety attention mechanism (SAM) is devised to assist a specific agent in identifying more hazardous
neighbors. The SAM is integrated with decentralized robust neural CBFs, and their training procedures are synchronized. In addition,
the SAM is designed as a shared module for all aspects of the algorithm, which is also regarded as an observation encoder. To handle
time-varying connectivity, we aggregate the states of all neighbors and weigh them according to the danger degree assigned by the
SAM. This approach enhances the safety performance and reduces the computational redundancy.
Based on the discussion of the aforementioned issues and relevant solutions, a safe robust MARL with decentralized robust neural
CBFs and SAM is proposed. The main contributions of the study are summarized as follows:

1. A novel safe robust MARL method is proposed for the safety-critical MAS. In contrast to prior works [24,33], our method considers
a scenario in which the dynamics of the MAS is unknown and the system is disturbed by unknown perturbations. It is a data-
driven method that mitigates the need for precise knowledge of the system dynamics. To counteract the eﬀects of modeling errors
and external unknown disturbances, we employ the 𝐻∞ method, which treats these factors as adversaries for each agent. The
protagonist learns a robust control policy that performs well in the presence of various disturbances created by the adversary.
2. CBFs are introduced to guarantee the safety of the MAS. Due to the complexity and the various uncertainties of the environment,
a decentralized robust neural CBF is formulated using data collected along the system trajectory. This approach is more univer-
sally applicable than traditional handcrafted CBFs [16,17,34]. The establishment of neural CBFs considers the inﬂuence of the
aforementioned adversary. Moreover, the decentralized execution mode of the neural CBF allows for scalability to an arbitrary

2
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

number of agents. Under the premise of safety, the integration of robust MARL with the robust neural CBF will balance optimality
and safety compared with the previous multi-agent control methods with CBFs [22,34].
3. In some works [22,35], all agents are treated equally. However, the inﬂuence of neighboring agents on a particular agent may
diﬀer. To address this, the SAM is introduced to evaluate the varying levels of danger posed by neighboring agents. In addition, the
SAM adjusts the variable-dimensional received information to be invariant, which addresses the issue of time-varying observable
agents.

The remainder of the paper is organized as follows. The preliminaries and problem statements are introduced in Section 2. Next,
we propose a safe robust MARL with CBF safety certiﬁcates for the safety-critical MAS in Section 3. The simulation and the experiment
analysis are carried out in Section 4. Finally, we conclude with Section 5.

2. Preliminaries and problem statement

In this section, preliminaries and the problem statement are introduced, and we provide some fundamental concepts, including
MARL, CBF, and the attention mechanism.

2.1. MARL and Markov game

The MARL problem can be modeled as a multi-agent extension of partially observable Markov decision processes (POMDPs)
[36,37],
⟨ which is also known as a ⟩partially observable Markov game [25,38]. A Markov game with 𝑁𝑎 agents can be described
{ } { }
as 𝑆, 𝐴𝑖 𝑖∈𝑁 , 𝑃 , , 𝑅𝑖 𝑖∈𝑁 , 𝛾 . 𝑆 is the state space shared by all agents. 𝐴𝑖 denotes the action space for the agent 𝑖. 𝑃 ∶
𝑎 𝑎
𝑆 × 𝐴1 × ⋯ × 𝐴𝑁𝑎 × 𝑆 → [0, 1] represents the state transition probability.  denotes the observation space. 𝑅𝑖 ∶ 𝑆 × 𝐴1 × ⋯ × 𝐴𝑁𝑎 → ℝ
denotes the reward function of the agent 𝑖. 𝛾 is the discount factor that determines the current value of future rewards.
We consider a partially observable environment in which the agent 𝑖 only receives the observation 𝑜𝑡𝑖 ∈  that contains the local
{ }
state 𝑠𝑡𝑖 ∈ 𝑆 and partial information from other agents. The joint control policy for the MAS is 𝜋 = 𝜋1 , ⋯ , 𝜋𝑁𝑎 , where 𝜋𝑖 is the
control policy of the agent 𝑖. At time 𝑡, the agent 𝑖 selects the action 𝑢𝑡𝑖 ∈ 𝐴𝑖 by the control policy 𝜋𝑖 given the current observation
𝑜𝑡𝑖 , obtains the reward 𝑟𝑡𝑖 ∈ ℝ calculated by 𝑅𝑖 , and transitions to the next state 𝑠𝑡+1
𝑖 . The goal of the agent 𝑖 is to learn a policy that
[ ∑∞ 𝑡 𝑡 ]
maximizes its total expected return 𝔼 𝛾 𝑟 .
( 𝑡=0 𝑖 )
We deﬁne the Q value function 𝑄𝑖 𝑜𝑖 , 𝑢𝑖 , 𝑢−𝑖 for the agent 𝑖 as follows:
[∞ ]
( ) ∑
𝑄𝑖 𝑜𝑖 , 𝑢𝑖 , 𝑢−𝑖 = 𝔼 𝛾 𝑙 𝑟𝑡+𝑙
𝑖 |𝑜 𝑡
𝑖 = 𝑜 ,
𝑖 𝑖𝑢𝑡
= 𝑢 , 𝑢
𝑖 −𝑖
𝑡
= 𝑢−𝑖 (1)
𝑙=0

where −𝑖 represents other agents except for the agent 𝑖.

Accordingly, the Bellman optimality equation for the optimal Q-function is
[ ]
( ) ( )
𝑄∗𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 =𝔼 𝑟𝑡𝑖 + 𝛾 max 𝑄∗𝑖 𝑜𝑡+1 𝑡+1 𝑡+1
𝑖 , 𝑢𝑖 , 𝑢−𝑖 . (2)
𝑢𝑡+1 𝑡+1
𝑖 ,𝑢−𝑖

2.2. Discrete-time control barrier function

Barrier functions based on control techniques have a significant effect on safety-critical applications.
We define a set 𝑠 as

𝑠 = {𝑠 ∈ ℝ𝑛 |ℎ (𝑠) ≥ 0} (3)
𝑛
𝜕𝑠 = {𝑠 ∈ ℝ |ℎ (𝑠) = 0} (4)

𝐼𝑛𝑡𝑠 = {𝑠 ∈ ℝ𝑛 |ℎ (𝑠) > 0} (5)

where ℎ (𝑠) ∈ ℝ is a continuously diﬀerentiable function.
The discrete-time CBF is deﬁned as follows:

Deﬁnition 1 (Discrete-time CBF). [39]: Let a function ℎ ∶ 𝑠 → ℝ be deﬁned by (3)-(5). ℎ is a discrete-time CBF on the set 𝑠 if there
exists class  function 𝛼 satisfying 𝛼 (𝑟) < 𝑟 for all 𝑟 > 0 such that
( ) ( ( ))
Δℎ 𝑠𝑡 , 𝑢𝑡 ≥ −𝛼 ℎ 𝑠𝑡 (6)
( ) ( ) ( )
where Δℎ 𝑠𝑡 , 𝑢𝑡 = ℎ 𝑠𝑡+1 − ℎ 𝑠𝑡 .

Then, if a CBF exists for the discrete-time system, the set 𝑠 is invariant. And the statement has been proven in [40].

3
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

Table 1
Main symbols and descriptions.

Symbol Description

𝑖 the 𝑖 th agent
−𝑖 other agents except for the agent 𝑖
𝑖 the neighboring or communicating set
−𝑖̄ other agents belong to 𝑖
𝑠𝑖 the local state
𝑜𝑖 the observation which contains the local state and partial information from other agents
Δ𝑖 the uncertainty or error of the estimated model
𝑑𝑖 the external disturbance
𝑒𝑖 , 𝑒𝑥,𝑖 the state encoding feature and the encoding feature for x, e.g. 𝑒𝑢,𝑖 and 𝑒𝑤,𝑖
𝑥𝑎𝑡𝑡𝑒𝑛,𝑖 the attention encoding feature for 𝑥, e.g. 𝑜𝑎𝑡𝑡𝑒𝑛,𝑖 , 𝑢𝑎𝑡𝑡𝑒𝑛,𝑖 and 𝑤𝑎𝑡𝑡𝑒𝑛,𝑖
𝜋𝑖 , 𝜋𝑠,𝑖 the RL control policy and the safe control policy
𝑢𝑖 , 𝑢𝑠,𝑖 the RL action and the safe action
𝜈𝑖 , 𝜈𝑠,𝑖 the RL adversary policy and the safe adversary policy
𝑤𝑖 , 𝑤𝑠,𝑖 the RL adversary action and the safe adversary action

Furthermore, to improve the applicability of barrier functions, the discrete-time exponential CBFs have been put forward in [41],
where 𝛼 is considered as a linear function with a coeﬃcient 𝜖 , 0 < 𝜖 ≤ 1. The inequality (6) can be rewritten as follows:
( ) ( )
Δℎ 𝑠𝑡 , 𝑢𝑡 ≥ −𝜖ℎ 𝑠𝑡 . (7)

2.3. Attention mechanism

Attention mechanism [26] has become a prominent topic in neural network research, and has yielded significant achievements
in NLP and computer vision (CV). The key point of attention is that the model learns to differentiate between agents in response to a
specific request. The attention function is calculated on sets of queries, keys, and values, and they are packed together into matrices
Q, K, and V, respectively. The form of the function is expressed as follows:
( )
𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 (𝑄, 𝐾, 𝑉 ) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 √ 𝑉 (8)
𝑑𝑘
where 𝐾 , 𝑄, and 𝑉 are learnable parameter matrices for data projections, and 𝑑𝑘 is the dimension of queries and keys.
In this paper, the attention mechanism will be applied to determine the danger level of neighboring agents.

2.4. Problem formulation

Consider an MAS with discrete-time dynamics as follows:

( )
𝑠𝑡+1
𝑖 = 𝑓𝑖 𝑠𝑡𝑖 , 𝑢𝑡𝑖 + 𝑑𝑖𝑡 , 𝑖 = 1, ⋯ , 𝑁𝑎 (9)
where 𝑓𝑖 ∈ ℝ𝑛 and ∈ 𝑑𝑡𝑡 ℝ𝑛
are unknown continuously differentiable functions. 𝑑𝑖𝑡
denotes the external unknown disturbance. The
communication among the agents is limited so that the agent 𝑖 can only obtain information about itself and its neighboring or
communicating agents. We set 𝑖 as the communicating set, and the neighboring or communication agents belong to 𝑖 . Then the
agent 𝑖 obtains the observation 𝑜𝑖 , which contains the local state 𝑠𝑖 and the neighboring information from 𝑖 . Correspondingly, for
the agent 𝑖, −𝑖̄ represents the agents belonging to 𝑖 .
In the safety-critical MAS, we assume that 𝐶𝑠 , 𝐶𝑑 , and 𝐶𝑔𝑜𝑎𝑙 denote the safe set, the dangerous set, and the goal set, respectively.
For example, in a multi-UAV system, if a collision occurs between agents or between agents and obstacles, the states belong to 𝐶𝑑 ;
otherwise, 𝐶𝑠 . It is clear that 𝑔𝑜𝑎𝑙 ⊂ 𝑠 . Considering safety, we should find control policies that guarantee the forward invariance
of 𝐶𝑠 , e.g., 𝑠𝑡 ∈ 𝐶𝑠 , ∀𝑡 ≥ 0 if 𝑠0 ∈ 𝐶𝑠 . This study aims to find the optimal (or near-optimal) and safe control policies 𝜋 , and the
corresponding constrained optimization problem can be expressed as follows:
[𝑁 ∞ ]
∑𝑎 ∑
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝐽 (𝜋) = 𝔼 𝛾 𝑡 𝑟𝑡𝑖
𝑖=1 𝑡=0
(10)
𝑠.𝑡. 𝑠𝑡𝑖 , ⋯ , 𝑠𝑡𝑁 ∈ 𝑠
𝑎

𝑠𝑒𝑛𝑑 𝑒𝑛𝑑
𝑖 , ⋯ , 𝑠𝑁 ∈ 𝑔𝑜𝑎𝑙 .
𝑎

For convenience, the primary mathematical symbols are listed in Table 1.

4
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

3. Safe robust MARL with CBF safety certiﬁcates

This section introduces a novel safe robust MARL method with CBF safety certiﬁcates. The proposed method is called RMARL-
CBF-SAM. In addition, we present a detailed algorithm implementation process.

3.1. Robust MARL

Due to the unknown dynamics of the MAS, the dynamics can be approximated by neural networks, Gaussian processes, or other
methods based on sampled data from the real dynamics. Nonetheless, regardless of the method is employed, estimated errors are
inevitable. The estimated dynamics model is described as follows:
( )
𝑠𝑡+1
𝑖 = 𝑓̂𝑖 𝑠𝑡𝑖 , 𝑢𝑡𝑖 + Δ𝑡𝑖 + 𝑑𝑖𝑡 , 𝑖 = 1, ⋯ , 𝑁𝑎 (11)

where 𝑓̂𝑖 is the estimated model, Δ𝑡𝑖 is the modeling error, and 𝑑𝑖𝑡 represents the external unknown disturbance. Also, Δ𝑡𝑖 and 𝑑𝑖𝑡 are
bounded, and 𝑑𝑖𝑡 ∈ 𝐿2 [0, ∞].
For system robustness, modeling errors and external unknown disturbances are regarded as 𝑤𝑡𝑖 ∈ ℝ𝑛 generated by an adversary
𝜈𝑖 , which represents Δ𝑡𝑖 + 𝑑𝑖𝑡 . The adversary aims to undermine the optimality and safety of the system. The joint adversary is 𝜈 =
{ }
𝜈1 , ⋯ , 𝜈𝑁𝑎 , and the objective is to prevent the protagonist from completing the task.
Subsequently, the Q value function and the Bellman optimality equation are rewritten as the following equations:
[∞ ]
( ) ∑
𝑄𝑖 𝑜𝑖 , 𝑢𝑖 , 𝑢−𝑖 , 𝑤𝑖 , 𝑤−𝑖 = 𝔼 𝛾 𝑙 𝑟𝑡+𝑙 𝑡
𝑖 |𝑜𝑖 = 𝑜𝑖 , 𝑢𝑡𝑖 = 𝑢𝑖 , 𝑢𝑡−𝑖 = 𝑢−𝑖 , 𝑤𝑡𝑖 = 𝑤𝑖 , 𝑤𝑡−𝑖 = 𝑤−𝑖 (12)
𝑙=0
[ ]
( ) ( )
𝑄∗𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 =𝔼 𝑟𝑡𝑖 + 𝛾 max min 𝑄∗𝑖 𝑜𝑡+1 𝑡+1 𝑡+1 𝑡+1 𝑡+1
𝑖 , 𝑢𝑖 , 𝑢−𝑖 , 𝑤𝑖 , 𝑤−𝑖 . (13)
𝑢𝑡+1 𝑡+1 𝑤𝑡+1 ,𝑤𝑡+1
𝑖 ,𝑢−𝑖 𝑖 −𝑖

The whole Q function for the MAS is deﬁned as follows:

𝑁
( ) ∑𝑎
( )
𝑄 𝑜𝑡 , 𝑢𝑡 , 𝑤𝑡 = 𝑄𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 (14)
𝑖=1
( ) ( ) ( )
where 𝑜𝑡 = 𝑜𝑡1 , ⋯ , 𝑜𝑡𝑁 , 𝑢 = 𝑢𝑡1 , ⋯ , 𝑢𝑡𝑁 , and 𝑤 = 𝑤𝑡1 , ⋯ , 𝑤𝑡𝑁 .
𝑎 𝑎 𝑎
The update mechanism can be presented as follows:
( 𝑡 𝑡 𝑡 ) ( )
𝑄𝑘+1
𝑖 𝑜𝑖 , 𝑢𝑖 , 𝑢−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 = (1 − 𝛼) 𝑄𝑘𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖
( )
( ) (15)
+𝛼 𝑟𝑡𝑖 + 𝛾 max min 𝑄𝑘𝑖 𝑜𝑡+1 𝑡+1 𝑡+1 𝑡+1 𝑡+1
𝑖 , 𝑢𝑖 , 𝑢−𝑖 , 𝑤𝑖 , 𝑤−𝑖
𝑢𝑡+1
𝑖
,𝑢𝑡+1
−𝑖
𝑤𝑡+1 𝑡+1
𝑖 ,𝑤−𝑖

where 𝑘 is the iterative step, 𝛼 is the learning rate.

Theorem 1. In the MAS (11), the robust MARL with the updating rule (15) will converge to the optimal solution if 0 < 1 − 𝛼 + 𝛼𝛾 < 1.

Proof. We deﬁne a mapping for the rule (15) as follows:

( ) ( )
(𝐻𝑞,𝑖 𝑄𝑖 ) 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 = (1 − 𝛼) 𝑄𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖
( )
( )
+ 𝛼 𝑟𝑡𝑖 + 𝛾 max min 𝑄𝑖 𝑜𝑡+1
𝑖 , 𝑢𝑡+1 𝑡+1
𝑖 , 𝑢−𝑖 , 𝑤 𝑡+1
𝑖 , 𝑤 𝑡+1
−𝑖
𝑢𝑡+1
𝑖
,𝑢𝑡+1
−𝑖
𝑤𝑡+1 𝑡+1
𝑖 ,𝑤−𝑖 (16)
( )
( ) ( )
= (1 − 𝛼) 𝑄𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 + 𝛼 𝑟𝑡𝑖 + 𝛾 max min 𝑄𝑖 𝑜𝑡+1
𝑖 ,.
𝑢𝑡+1 𝑡+1 𝑤𝑡+1 ,𝑤𝑡+1
𝑖 ,𝑢−𝑖 𝑖 −𝑖
( ) ( )
where 𝑄𝑖 𝑜𝑡+1 𝑡+1 𝑡+1 𝑡+1 𝑡+1
𝑖 , 𝑢𝑖 , 𝑢−𝑖 , 𝑤𝑖 , 𝑤−𝑖 is abbreviated to 𝑄𝑖 𝑜𝑡+1
𝑖 ,. .
The rule (15) can be written as follows:

𝑄𝑘+1
𝑖 = 𝐻𝑞,𝑖 𝑄𝑘𝑖 (17)
where 𝑘 is the iterative step.
We deﬁne two Q value functions namely 𝑄𝑖,1 and 𝑄𝑖,2 , which are diﬀerent estimated functions of 𝑄𝑖 under Markov decision
processes.

5
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

Then, we have
‖ 𝑘+1 ‖ ‖ ‖
‖𝑄𝑖,1 − 𝑄𝑘+1 ‖ = ‖𝐻 𝑄𝑘 − 𝐻𝑞,𝑖 𝑄𝑘𝑖,2 ‖
𝑖,2 ‖∞ ‖ 𝑞,𝑖 𝑖,1
‖ ‖∞
‖ [ ]‖
‖ ( ) ( 𝑡+1 ) ( 𝑡+1 ) ‖
=‖‖(1 − 𝛼) 𝑄 𝑘
𝑖,1
− 𝑄 𝑘
𝑖,2
+ 𝛼 𝛾 max min 𝑄 𝑘
𝑜
𝑖,1 𝑖
, . − 𝛾 max min 𝑄 𝑘
𝑜
𝑖,2 𝑖
, . ‖
‖ (18)
‖ 𝑢𝑡+1 ,𝑢𝑡+1 𝑤𝑡+1 ,𝑤𝑡+1 𝑢𝑡+1 ,𝑢𝑡+1 𝑤𝑡+1 ,𝑤𝑡+1 ‖
‖ 𝑖 −𝑖 𝑖 −𝑖 𝑖 −𝑖 𝑖 −𝑖 ‖∞
‖ ‖
≤ (1 − 𝛼 + 𝛼𝛾) ‖𝑄𝑘𝑖,1 − 𝑄𝑘𝑖,2 ‖ .
‖ ‖∞
Therefore, the operator 𝐻𝑞,𝑖 is a contraction operator according to the contraction mapping theorem [42] with 0 < 1 − 𝛼 + 𝛼𝛾 < 1,
and has a unique ﬁxed point 𝑄∗𝑖 . Thus, the following equation can be obtained

𝐻𝑞,𝑖 𝑄∗𝑖 = 𝑄∗𝑖 . (19)

Furthermore, we have
‖ 𝑘+1 ‖ ‖ ‖
‖𝑄𝑖 − 𝑄∗𝑖 ‖ = ‖𝐻𝑞,𝑖 𝑄𝑘𝑖 − 𝐻𝑞,𝑖 𝑄∗𝑖 ‖
‖ ‖∞ ‖ ‖∞
‖ ‖
≤𝜁 ‖𝑄𝑘𝑖 − 𝑄∗𝑖 ‖
‖ ‖∞
(20)
‖ 𝑘 ‖
=𝜁 ‖𝑄𝑖 − 𝑄𝑖 + 𝑄𝑘+1
𝑘+1
− 𝑄∗𝑖 ‖
‖ 𝑖 ‖∞
‖ 𝑘‖ ‖ 𝑘+1 ‖
≤𝜁 ‖𝑄𝑘+1 − 𝑄 ‖ + 𝜁 ‖ 𝑄 − 𝑄∗𝑖 ‖
‖ 𝑖 𝑖 ‖∞ ‖ 𝑖 ‖∞
where 𝜁 = 1 − 𝛼 + 𝛼𝛾 .
Therefore,

‖ 𝑘+1 ‖ 𝜁 ‖ 𝑘+1 ‖
‖𝑄𝑖 − 𝑄∗𝑖 ‖ ≤ ‖𝑄 − 𝑄𝑘𝑖 ‖
‖ ‖∞ 1 − 𝜁 ‖ 𝑖 ‖∞
𝜁2 ‖ 𝑘 ‖
≤ ‖𝑄𝑖 − 𝑄𝑘−1
𝑖 ‖
1−𝜁 ‖ ‖∞ (21)
⋮
𝜁 𝑘+1 ‖ 1 ‖
≤ ‖𝑄 − 𝑄0𝑖 ‖ .
1−𝜁 ‖ 𝑖 ‖∞
Due to 0 < 𝜁 < 1,

‖ ‖ 𝜁 𝑘+1 ‖ 1 ‖
lim ‖𝑄𝑘+1 − 𝑄∗𝑖 ‖ ≤ lim ‖𝑄 − 𝑄0𝑖 ‖
𝑘→∞ ‖ 𝑖 ‖∞ 𝑘→∞ 1 − 𝜁 ‖ 𝑖 ‖∞ (22)
=0.
Hence,
‖∑𝑁 𝑁𝑎 ‖
‖ ‖ ‖ 𝑎 𝑘+1 ∑ ‖
lim ‖𝑄𝑘+1 − 𝑄∗ ‖ = lim ‖‖ 𝑄𝑖 − 𝑄∗𝑖 ‖
‖
𝑘→∞ ‖ ‖∞ 𝑘→∞ ‖ ‖
‖ 𝑖=1 𝑖=1 ‖∞
𝑁𝑎 [ ] (23)
∑ ‖ ∗‖
≤ lim ‖𝑄𝑘+1 − 𝑄 ‖
𝑘→∞ ‖ 𝑖 𝑖 ‖∞
𝑖=1
=0.
The result indicates that 𝑄𝑘 → 𝑄∗ as 𝑘 → ∞. □

Remark 1. Theorem 1 theoretically guarantees the convergence and optimality of the robust MARL, providing a theoretical basis for
the algorithm’s practical application. However, in the specific implementation of the algorithm, to address adverse factors affecting
the training process such as the stochasticity of the environment and limitations of sampling data, we typically introduce methods like
the replay buffer and target networks into the basic framework of the algorithm. Additionally, noise is added to the action outputted
by the deterministic policy to increase exploration randomness. These methods will enhance the efficiency and stability of training.

3.2. Robust CBF safety certiﬁcates

In this section, we integrate the robust MARL with CBFs to guarantee the safety of the MAS. A bottleneck exists in the construction
of general CBFs, particularly in complex environments. Robust neural CBFs are introduced to overcome the limitations of handcrafted
CBFs.

6
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

For the agent 𝑖, the decentralized robust neural CBF conditions are expressed as follows:
( )
ℎ𝑖 𝑜𝑡𝑖 < 0, ∀𝑠𝑡𝑖 ∈ 𝐶𝑑 (𝐶1)
( )
ℎ𝑖 𝑜𝑡𝑖 ≥ 0, ∀𝑠𝑡𝑖 ∈ 𝐶𝑠 (𝐶2) (24)
( ) ( )
Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑠,𝑖 , 𝑤𝑡𝑠,−𝑖 + 𝜖ℎ𝑖 𝑜𝑡𝑖 ≥ 0, ∀𝑠𝑡𝑖 ∈ 𝐶𝑠 (𝐶3)
( ) ( ) ( )
where Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑠,𝑖 , 𝑤𝑡𝑠,−𝑖 = ℎ𝑖 𝑜𝑡+1
𝑖 − ℎ𝑖 𝑜𝑡𝑖 , ℎ𝑖 is the neural CBF that is represented as neural networks, 𝑢𝑡𝑠,𝑖 is the safe
action by the safe control policy 𝜋𝑠,𝑖 that guarantees 𝐶𝑠 invariant, and 𝑤𝑡𝑠,𝑖 is the anti-safety action by the safe adversary policy 𝜈𝑠,𝑖 .
In addition, 𝜋𝑠,𝑖 and 𝜈𝑠,𝑖 are functions of 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , and 𝑤𝑡𝑖 . Furthermore, 𝑤𝑡𝑠,𝑖 satisﬁes the following equation:
[ ( )]
𝑤𝑡𝑠,𝑖 = min Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡𝑠,−𝑖 . (25)
𝑤𝑡𝑖

Also, we deﬁne a new set 𝐶𝑛 in which 𝑠𝑡𝑖 ∈ 𝐶𝑠 and the next state 𝑠𝑡+1
𝑖 ∈ 𝐶𝑠 .
The adversary policy in the robust RL aims to cause the most severe disturbances for the protagonist, consequently compromising
both the optimality and safety of the system. In particular consideration of safety, the safety reward term 𝑟𝑡𝑠,𝑖 is added to the reward
in the robust MARL.
( ) ( )
𝑟𝑡𝑠,𝑖 = 𝛾ℎ𝑖 𝑜𝑡+1
𝑖 − ℎ𝑖 𝑜𝑡𝑖 . (26)

Similar to [43], 𝑟𝑡𝑠,𝑖 is the shaping function, and the augmented reward 𝑟𝑡𝑎,𝑖 is denoted as follows:

𝑟𝑡𝑎,𝑖 = 𝑟𝑡𝑖 + 𝑟𝑡𝑠,𝑖 . (27)

Proposition 1. Given that the term 𝑟𝑡𝑠,𝑖 (26) is added to the original reward and the optimal solution exists in the robust MARL with 𝑟𝑡𝑎,𝑖
(27), then the solution is also the optimal solution to the original robust MARL with 𝑟𝑡𝑖 .

Proof. For the agent 𝑖, the augmented optimal Q value function 𝑄∗𝑎,𝑖 satisﬁes the following Bellman equation:
[ ]
( ) ( )
𝑄∗𝑎,𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 = 𝔼 𝑟𝑡𝑎,𝑖 + 𝛾 max min 𝑄∗𝑎,𝑖 𝑜𝑡+1
𝑖 , 𝑢𝑡+1 𝑡+1
𝑖 , 𝑢−𝑖 , 𝑤 𝑡+1
𝑖 , 𝑤 𝑡+1
−𝑖 . (28)
𝑢𝑡+1
𝑖
,𝑢𝑡+1
−𝑖
𝑤𝑡+1 𝑡+1
𝑖 ,𝑤−𝑖
( )
Subtracting ℎ𝑖 𝑜𝑡𝑖 from both sides of the equation (13), we can derive
( ) ( )
𝑄∗𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 − ℎ𝑖 𝑜𝑡𝑖
[ ]
( ) ( ) ( ( 𝑡+1 𝑡+1 𝑡+1 𝑡+1 𝑡+1 ) ( 𝑡+1 )) (29)
=𝔼 𝑟𝑡𝑖 + 𝛾ℎ𝑖 𝑜𝑡+1 𝑖 − ℎ𝑖 𝑜𝑡𝑖 + 𝛾 max min 𝑄∗𝑖 𝑜𝑖 , 𝑢𝑖 , 𝑢−𝑖 , 𝑤𝑖 , 𝑤−𝑖 − ℎ𝑖 𝑜𝑖 .
𝑢𝑡+1 ,𝑢𝑡+1 𝑤𝑡+1 ,𝑤𝑡+1
𝑖 −𝑖 𝑖 −𝑖

We deﬁne
( ) ( ) ( )
𝑄̂ 𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 ∶= 𝑄∗𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 − ℎ𝑖 𝑜𝑡𝑖 . (30)
Then, combining (26), (27), (29) and (30), we can obtain
[ ]
( ) ( )
𝑄̂ 𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 =𝔼 𝑟𝑡𝑖 + 𝑟𝑡𝑠,𝑖 + 𝛾 max min 𝑄̂ 𝑖 𝑜𝑡+1
𝑖 , 𝑢𝑡+1 𝑡+1
𝑖 , 𝑢−𝑖 , 𝑤 𝑡+1
𝑖 , 𝑤 𝑡+1
−𝑖
𝑢𝑡+1
𝑖
,𝑢𝑡+1
−𝑖
𝑤𝑡+1 𝑡+1
𝑖 ,𝑤−𝑖
[ ] (31)
( 𝑡+1 𝑡+1 𝑡+1 𝑡+1 𝑡+1 )
=𝔼 𝑟𝑡𝑎,𝑖 + 𝛾 max min ̂
𝑄𝑖 𝑜𝑖 , 𝑢𝑖 , 𝑢−𝑖 , 𝑤𝑖 , 𝑤−𝑖 .
𝑢𝑡+1
𝑖
,𝑢𝑡+1
−𝑖
𝑤𝑡+1 𝑡+1
𝑖 ,𝑤−𝑖

By comparing (28) with (31), we can derive

( ) ( )
𝑄∗𝑎,𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 =𝑄̂ 𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖
( ) ( ) (32)
=𝑄∗𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 − ℎ𝑖 𝑜𝑡𝑖 .
( )
Also, ℎ𝑖 𝑜𝑡𝑖 is independent of 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , and 𝑤𝑡−𝑖 . Therefore, the augmented optimal solution is also the optimal solution to the
original robust MARL with 𝑟𝑡𝑖 . □

Remark 2. According to Proposition 1, the optimal adversary policies minimize the safety reward term 𝑟𝑡𝑠,𝑖 as far as possible through
the powerful
( learning ability
) of deep RL. In real applications, 𝛾 approaches 1, and then minimizing 𝑟𝑠,𝑖 by 𝜈𝑖 is equivalent to minimizing
Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 . From (25), it can be deduced that the acquisition of the optimal adversary contributes to the reﬁnement of
the safe adversary policy. Additionally, the convergence of the reward shaping has theoretical support [44].

7
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

( )
Proposition 2. For the agent 𝑖, if 𝑠0𝑖 ∈ 𝐶𝑠 and the decentralized robust neural CBF ℎ𝑖 satisﬁes the conditions (24), then for ∀𝑡, ℎ𝑖 𝑜𝑡𝑖 ≥ 0
holds under the safe action 𝑢𝑡𝑠,𝑖 . Hence, the MAS is safe.

( )
Proof. For the agent 𝑖, due to 𝑠0𝑖 ∈ 𝐶𝑠 , ℎ𝑖 𝑜0𝑖 ≥ 0. The adversary 𝜈𝑖 represents the unknown uncertainty, the approximate error, and
the external disturbance. For the adversarial action 𝑤𝑡𝑖 , we can obtain
( ) ( ) ( ) ( )
Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡𝑠,−𝑖 + 𝜖ℎ𝑖 𝑜𝑡𝑖 ≥ Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑠,𝑖 , 𝑤𝑡𝑠,−𝑖 + 𝜖ℎ𝑖 𝑜𝑡𝑖 ≥ 0. (33)
( ) ( ) ( )
Then, according to Proposition 4 in [40], we can obtain ℎ𝑖 𝑜𝑡+1
𝑖 ≥ (1 − 𝜖)𝑡+1 ℎ𝑖 𝑜0𝑖 ≥ 0, that is, ∀𝑡, ℎ𝑖 𝑜𝑡𝑖 ≥ 0. It is obvious that
the entire MAS is safe. □

During the online training procedure, the safe and dangerous data originate from the interaction with the environment generated
by the robust MARL. The powerful exploration performance of RL can provide rich information, which aids in learning general CBFs
and eﬀectively reduces the discrepancy between the training and test distributions.
For ℎ𝑖 , the empirical loss function can be represented as follows:

𝐿ℎ,𝑖 = 𝐿ℎ𝑠,𝑖 + 𝐿ℎ𝑑,𝑖 + 𝐿ℎ𝑛,𝑖 (34)

1 ∑ ( ( ))
𝐿ℎ𝑠,𝑖 = max 0, −ℎ 𝑜𝑡𝑖 (35)
𝑁𝑡𝑟𝑎𝑖𝑛,𝑠 𝑡
𝑠𝑖 ∈𝐶𝑠

1 ∑ ( ( ))
𝐿ℎ𝑑,𝑖 = max 0, ℎ 𝑜𝑡𝑖 (36)
𝑁𝑡𝑟𝑎𝑖𝑛,𝑑
𝑠𝑡𝑖 ∈𝐶𝑑
∑ ( ( ) ( ))
1
𝐿ℎ𝑛,𝑖 = max 0, −Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑠,𝑖 , 𝑤𝑡𝑠,−𝑖 − 𝜖ℎ𝑖 𝑜𝑡𝑖 (37)
𝑁𝑡𝑟𝑎𝑖𝑛,𝑛 𝑠𝑡 ∈𝐶𝑛

where 𝑁𝑡𝑟𝑎𝑖𝑛,𝑠 , 𝑁𝑡𝑟𝑎𝑖𝑛,𝑑 , and 𝑁𝑡𝑟𝑎𝑖𝑛,𝑛 denote the training set sizes of 𝐶𝑠 , 𝐶𝑑 , and 𝐶𝑛 , respectively.
Considering optimality, the safe control policy 𝜋𝑠,𝑖 will not deviate too far from 𝜋𝑖 of the robust MARL; therefore, the empirical
loss function can be represented as follows:
∑ ‖ 𝜆𝑢 ∑ ( ( ) ( ))
1 ‖2
𝐿𝜋𝑠 ,𝑖 = ‖𝑢𝑡𝑠,𝑖 − 𝑢𝑡𝑖 ‖ + max 0, −Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑠,𝑖 , 𝑤𝑡𝑠,−𝑖 − 𝜖ℎ𝑖 𝑜𝑡𝑖
2𝑁𝑡𝑟𝑎𝑖𝑛,𝑠 𝑡
‖ ‖ 𝑁𝑡𝑟𝑎𝑖𝑛,𝑠 𝑡 (38)
𝑠𝑖 ∈𝐶𝑠 𝑠𝑖 ∈𝐶𝑠

where 𝜆𝑢 is a factor that balances optimality and safety.

For the safe adversary 𝜈𝑠,𝑖 , the loss function can be expressed as follows:
∑ ‖ 𝜆𝑤 ∑ ( )
1 ‖2
𝐿𝜈𝑠 ,𝑖 = ‖𝑤𝑡𝑠,𝑖 − 𝑤𝑡𝑖 ‖ + Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑠,𝑖 , 𝑤𝑡𝑠,−𝑖
2𝑁𝑡𝑟𝑎𝑖𝑛,𝑠 𝑡 ‖ ‖ 𝑁𝑡𝑟𝑎𝑖𝑛,𝑠 𝑡 (39)
𝑠𝑖 ∈𝐶𝑠 𝑠𝑖 ∈𝐶𝑠

where 𝜆𝑤 denotes a balanced factor.

Remark 3. To guarantee the efficacy of neural CBFs, it is crucial to ensure accurate training. Inspired by [45], we implement some
measures to enhance the accurate representation of neural CBFs. First, the constraint condition (24) can be relaxed as the loss function
(34) by a penalty method. The violation of constraints is formulated as scalar penalties that intend for minimization. The losses are
minimized via optimization algorithms, such as Adam and RMSprop. Then, the training data is randomly sampled from the replay
buffer built by the robust MARL. The replay buffer as a large finite set stores the experiences and serves to minimize the correlation
between samples, thereby improving the learning performance.

The learning processes of the neural network weights for CBFs and policies are interleaved at each iteration. After a sufficient
number of iterations, the loss gradually decreases. When the losses 𝐿ℎ𝑠,𝑖 , 𝐿ℎ𝑑,𝑖 , and 𝐿ℎ𝑛,𝑖 converge to zero, the conditions (24) are
satisfied.
However, in practical engineering applications, accidental failures may occur due to training errors and drift between training
and testing. Inspired by [22], an online fine-tuning method for the control policy and the adversary policy is provided. Let 𝑢𝑡𝑒,𝑖 and
𝑤𝑡𝑒,𝑖 be the adjusted inputs of the control policy and the adversary policy, respectively. In addition, the fine-tuning iterative number
can be set to a small value based on the previous training. The refinement procedure is outlined as follows:

S.1. Adjustment of the adversary policy.

The loss function can be deﬁned as follows:
( )
‖ ‖2
𝐿𝑤𝑒,𝑖 = Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑠,𝑖 + 𝑤𝑡𝑒,𝑖 , 𝑤𝑡𝑠,−𝑖 + 𝜆𝑎𝑤 ‖𝑤𝑡𝑒,𝑖 ‖ . (40)
‖ ‖
S.2. Adjustment of the control policy.

8
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

The loss function can be deﬁned as follows:

( ( ) ( )) ‖ ‖2
𝐿𝑢𝑒,𝑖 = max 0, −Δℎ𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑠,𝑖 + 𝑢𝑡𝑒,𝑖 , 𝑢𝑡𝑠,−𝑖 , 𝑤𝑡𝑠,𝑖 + 𝑤𝑡𝑒,𝑖 , 𝑤𝑡𝑠,−𝑖 − 𝜖ℎ𝑖 𝑜𝑡𝑖 + 𝜆𝑎𝑢 ‖𝑢𝑡𝑒,𝑖 ‖ . (41)
‖ ‖
S.3. Stop criterion.
‖ ‖2
If 𝐿𝑢𝑒,𝑖 − 𝜆𝑎𝑢 ‖𝑢𝑡𝑒,𝑖 ‖ = 0 or the max fine-tuning iterative number 𝑁𝑓 is reached, we obtain the final policies, 𝑢𝑡𝑠,𝑖 = 𝑢𝑡𝑠,𝑖 + 𝑢𝑡𝑒,𝑖 and
‖ ‖
𝑤𝑡𝑠,𝑖 = 𝑤𝑡𝑠,𝑖 + 𝑤𝑡𝑒,𝑖 ; otherwise, the fine-tuning procedure continues.

In the above procedure, 𝜆𝑎𝑤 and 𝜆𝑎𝑢 are regularization factors used to suppress the magnitude of the variables.
The reﬁnement procedure S.1-S.3 is also referred to as the safety regulator in this paper.

3.3. Safety attention mechanism

The algorithm incorporates an attention mechanism to adapt to the time-varying number of neighbors and differentiate the
dangers of different agents. This mechanism encodes the observation and converts the variable-dimensional information into a fixed-
dimensional input variable.
First, we feed the state 𝑠𝑖 into the feature encoder 𝐸𝑠 ,
( )
𝑒𝑖 = 𝐸𝑠 𝑠𝑖 (42)
where 𝐸𝑠 is a multi-layer perceptron (MLP) and 𝑒𝑖 is the state encoding feature.
Then, the similarity value between the agent 𝑖 and the agent 𝑗 is calculated by
( )
𝑒𝑖𝑗 = 𝑓𝑠 𝑒𝑇𝑗 𝑊𝑘𝑇 𝑊𝑞 𝑒𝑖 (43)

where 𝑗 ∈ 𝑖 , a linear learnable weight matrix 𝑊𝑞 transforms 𝑒𝑖 into a query, 𝑊𝑘 transforms 𝑒𝑗 into a key, and 𝑓𝑠 is a single-layer
MLP.
The attention weight, which is the normalized form of 𝑒𝑖𝑗 , is calculated by softmax as follows:
( ( ))
exp 𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈 𝑒𝑖𝑗
𝑎𝑖𝑗 = ∑ ( ( )) (44)
𝑚∈𝑖 exp 𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈 𝑒𝑖𝑚

where 𝐿𝑒𝑎𝑘𝑦𝑅𝑒𝐿𝑈 is the leaky version of ReLU.

The combined encoding feature 𝑒−𝑖̄ is aggregated based on the attention weight 𝑎𝑖𝑗 and the linear learnable weight matrix 𝑊𝑣 :
∑
𝑒−𝑖̄ = 𝑎𝑖𝑚 𝑊𝑣 𝑒𝑚 . (45)
𝑚∈𝑖

Regardless of the size of 𝑖 , the neighboring information can be transformed into a fixed dimension using the attention mechanism.
( )
Subsequently, the attention encoding feature 𝑒−𝑖̄ and the state encoding feature 𝑒𝑖 are combined to form a new observation 𝑒𝑖 , 𝑒−𝑖̄ ,
which is fed into the decentralized robust neural CBF ℎ𝑖 . The training of the parameters in the attention mechanism is synchronized
with the training of ℎ𝑖 . The integration of the attention mechanism and CBFs enables the policy to treat different agents differently
compared with the max-pooling method [22] or the mean-field method [35]. These prior methods typically treat all agents equally,
and potentially overlooks some valuable unique information. Then, this module is named the safety attention mechanism (SAM).

Remark 4. With the SAM, our proposed method has the following advantages:

1) The weighted sum-pooling in the SAM can handle an arbitrary number of agents.
2) The danger degree among the neighboring or communicating agents can be diﬀerentiated.
3) The attention can make up for the defects of partial observability and serves as an information ﬁlter since most of the information
is redundant.

To enhance its impact, the SAM operates as a shared module. In addition to the CBFs, it is embedded into the Q value function,
actors, and adversaries.

Remark 5. Since the inputs of the Q value function contain states and actions, the construction of the feature encoder needs to be
modiﬁed when using the SAM. Similar to 𝐸𝑠 , we introduce( 𝐸𝑢 and ) 𝐸𝑤 , which are built by the multi-layer MLP.
For the agent 𝑖, the attention observation
{ } is 𝑜𝑎𝑡𝑡𝑒𝑛,𝑖 = 𝑒 𝑖 −𝑖 , where 𝑒𝑖 and 𝑒−𝑖̄ are calculated by (42) and (45), respectively.
, 𝑒 ̄
Simultaneously, the attention weights 𝑎𝑖𝑗 𝑗∈ are used to evaluate the neighboring control policies and adversaries. And attention
𝑖
weights for agents not belonging to 𝑖 are set zeros. Then, in the RL procedure, the attention encoding features for the control policies
and the adversary policies are presented as follows:
( )
𝑢𝑎𝑡𝑡𝑒𝑛,𝑖 = 𝑒𝑢,𝑖 , 𝑒𝑢,−𝑖̄ (46)

9
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

( )
𝑤𝑎𝑡𝑡𝑒𝑛,𝑖 = 𝑒𝑤,𝑖 , 𝑒𝑤,−𝑖̄ (47)
where
( ) ∑ ( )
𝑒𝑢,𝑖 =𝐸𝑢 𝑢𝑖 , 𝑒𝑢,−𝑖̄ = 𝑎𝑖𝑚 𝐸𝑢 𝑢𝑚 ,
𝑚∈𝑖
( ) ∑ ( ) (48)
𝑒𝑤,𝑖 =𝐸𝑤 𝑢𝑖 , 𝑒𝑤,−𝑖̄ = 𝑎𝑖𝑚 𝐸𝑤 𝑤𝑚 .
𝑚∈𝑖

Here, the SAM is embedded in the neural network structures to assist with feature encoding of neighboring agents. Since the
SAM is part of the network structures, 𝑜𝑎𝑡𝑡𝑒𝑛,𝑖 , 𝑢𝑎𝑡𝑡𝑒𝑛,𝑖 , and 𝑤𝑎𝑡𝑡𝑒𝑛,𝑖 are not explicitly mentioned in the subsequent description of the
algorithm.

3.4. Implementation of safe algorithm

In the robust MARL, neural networks are applied to approximate the Q value function, the control policies, and the adversary
policies. The algorithm adopts the actor-adversary-critic framework and extends MADDPG by incorporating adversary policies. More-
over, target networks with a soft updating rate 𝜄 are introduced to stabilize the training by providing slow-moving targets for the
actor, adversary, and critic networks. Subsequently, we randomly initialize the networks 𝑄𝑎,𝑖 , 𝜋𝑖 , and 𝜈𝑖 with weights 𝜃𝑞,𝑖 , 𝜃𝜋,𝑖 , and
𝜃𝜈,𝑖 , 𝑖 = 1, 2, ⋯ , 𝑁𝑎 . Correspondingly, the weights of the target networks 𝑄′𝑎,𝑖 , 𝜋𝑖′ , and 𝜈𝑖′ are initially set as 𝜃𝑞,𝑖
′ ← 𝜃 , 𝜃 ′ ← 𝜃 , and
𝑞,𝑖 𝜋,𝑖 𝜋,𝑖
𝜃𝜈,𝑖
′ ← 𝜃 , 𝑖 = 1, 2, ⋯ , 𝑁 .
𝜈,𝑖 𝑎
For the agent 𝑖, the neural networks of the Q value function are updated by minimizing the loss function, which is given by
[ ( ) ]
𝐿𝜃𝑞,𝑖 = 𝔼 (𝑄𝑎,𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 − 𝑦𝑡𝑖 )2 (49)

where 𝑦𝑡𝑖 is the target value, as follows:

( )
𝑡+1 |
𝑦𝑡𝑖 = 𝑟𝑡𝑎,𝑖 + 𝛾𝑄′𝑎,𝑖 𝑜𝑡+1 𝑡+1 𝑡+1 𝑡+1
𝑖 , 𝑢𝑖 , 𝑢−𝑖 , 𝑤𝑖 , 𝑤−𝑖 ||𝑢𝑡+1 =𝜋 ′ 𝑜𝑡+1 ,𝑤𝑡+1 =𝜈 ′ 𝑜𝑡+1 .
( ) ( )
(50)
𝑗 𝑗 𝑗 𝑗 𝑗 𝑗

The value function for the agent 𝑖 with the control policies 𝜋 and the adversary policies 𝜈 can be expressed as
[∞ ]
∑
𝐽𝑖 = 𝔼 𝛾 𝑡 𝑟𝑡𝑎,𝑖 . (51)
𝑡=0

The optimization objective of the actor network is to maximize 𝐽𝑖 , and the policy gradient is written as
[ ]
( ) ( )|
∇𝜃𝜋,𝑖 𝐽𝑖 = 𝔼 ∇𝜃𝜋,𝑖 𝜋𝑖 𝑜𝑡𝑖 ∇𝑢𝑡 𝑄𝑎,𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 | 𝑡 ( 𝑡 ) (52)
𝑖 |𝑢𝑖 =𝜋𝑖 𝑜𝑖
Conversely, the optimization objective of the adversary network is to minimize 𝐽𝑖 , and the policy gradient is written as
[ ]
( ) ( )|
∇𝜃𝜈,𝑖 𝐽𝑖 = 𝔼 ∇𝜃𝜈,𝑖 𝜈𝑖 𝑜𝑡𝑖 ∇𝑤𝑡 𝑄𝑎,𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑢𝑡−𝑖 , 𝑤𝑡𝑖 , 𝑤𝑡−𝑖 | 𝑡 ( 𝑡 ) (53)
𝑖 |𝑤𝑖 =𝜈𝑖 𝑜𝑖
where the updating direction is opposite to 𝜃𝜋,𝑖 .

Remark 6. In practical application, computing the joint action space for all agents can be computationally infeasible due to its
exponential growth with the number of agents. According to studies [25,46,47], during practical implementation, we can assume
that the actions of other agents remain ﬁxed while updating the policy parameters of agent 𝑖.

A block diagram of the training procedure is presented in Fig. 1. The training of the robust MARL and the safety module is
conducted in an interleaved manner. The safety module contains safe controllers, CBFs, and the SAM. In addition, the SAM is trained
synchronized with the decentralized robust neural CBF ℎ𝑖 and embedded in the entire process. The SAM parameters are shared with
the other modules of the algorithm. The replay buffer stores and replays past experiences, thereby facilitating more efficient learning.
Furthermore, the real environment and the robust safe controllers are combined as an augmented environment. It is worth noting
that the robust MARL, initially intended for controlling the original system, can be effectively employed for the augmented system.
The augmented environment tends to create a safe environment. If the safety of the MAS cannot be guaranteed by the RL, the robust
safe controllers will provide safe actions with minimal intervention. In essence, the goal is to attain system safety with a minimal
performance cost. The method of directly applying the trained RL controllers to an augmented system is similar to that in a previous
work [48]. The study demonstrates that the training procedure is not negatively influenced and the performance can be improved.
The pseudocode is provided in Algorithm 1.
A block diagram of the implementation procedure is presented in Fig. 2. Similar to the training procedure, the SAM is incorporated
into each module of the algorithm.

10
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

Fig. 1. The block diagram of the training procedure.

4. Simulation

In this section, we evaluate the eﬀectiveness of our proposed method in a multi-agent navigation environment. Besides, we compare
our method with some baseline methods.

4.1. Environment setting

We construct a simulation environment for robot navigation. The initial and goal locations of the robots are positioned in either
a random or structured manner, and an arbitrary number of obstacles are randomly placed in the environment, as shown in Fig. 3.
From the illustration of the multi-agent navigation environment in Fig. 4, we can observe that each robot should navigate from the
initial location to the goal location while avoiding obstacles and other robots under the interference of unknown disturbances. In
addition, each robot has a speciﬁc communication and detection range within which it can obtain information about its neighbors,
including positions and velocities.
Moreover, the dynamics of the robot is set as a disturbed double integrator:
[ ] [ ][ ] [ ]
𝑝̇ 𝑖 0 𝐼 𝑝𝑖 1 0
𝑠̇ 𝑖 = = + 𝑢 + 𝑑𝑖 (54)
𝑣̇ 𝑖 0 0 𝑣𝑖 𝑚𝑖 𝐼 𝑖
where 𝑝𝑖 , 𝑣𝑖 , 𝑢𝑖 , 𝑑𝑖 and 𝑚𝑖 are the position, the velocity, the force applied to the robot 𝑖, the unknown bounded disturbance, and
the unknown mass of the robot 𝑖, respectively. Then, 𝑝𝑖 = [𝑝𝑥,𝑖 , 𝑝𝑦,𝑖 ]𝑇 , 𝑣𝑖 = [𝑣𝑥,𝑖 , 𝑣𝑦,𝑖 ]𝑇 , 𝑢𝑖 = [𝑢𝑥,𝑖 , 𝑢𝑦,𝑖 ]𝑇 and, ‖ ‖
‖𝑑𝑖 ‖ ≤ 𝑑𝑖𝑀 . The sampling
interval Δ𝑡 is set to 0.1 s.

11
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

Fig. 2. The block diagram of the implementation procedure.

Fig. 3. The distribution of the environment: (a) random; (b) structured.

Fig. 4. Multi-agent navigation environment.

At time 𝑡, the reward for the robust MARL procedure is deﬁned as follows:

𝑟𝑡𝑎,𝑖 =𝑟𝑡𝑑𝑖𝑠𝑡,𝑖 + 𝑟𝑡𝑒𝑛𝑒𝑟𝑔𝑦,𝑖 + 𝑟𝑡𝑠,𝑖

𝑟𝑡𝑑𝑖𝑠𝑡,𝑖 = − (𝑝𝑡𝑖 − 𝑝𝑔𝑜𝑎𝑙,𝑖 )𝑇 (𝑝𝑡𝑖 − 𝑝𝑔𝑜𝑎𝑙,𝑖 )
(55)
𝑟𝑡𝑒𝑛𝑒𝑟𝑔𝑦,𝑖 = − (𝑢𝑡𝑖 )𝑇 𝑢𝑡𝑖 + (𝑤𝑡𝑖 )𝑇 𝑤𝑡𝑖
( ) ( )
𝑟𝑡𝑠,𝑖 =𝛾ℎ𝑖 𝑜𝑡+1𝑖 − ℎ𝑖 𝑜𝑡𝑖
12
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

where 𝑟𝑡𝑑𝑖𝑠𝑡,𝑖 is the goal distance reward, 𝑟𝑡𝑒𝑛𝑒𝑟𝑔𝑦,𝑖 is the energy reward that assesses the control cost, and 𝑟𝑡𝑠,𝑖 is the safety reward
related to the CBF being trained. When the robot is closer to the goal, 𝑟𝑡𝑑𝑖𝑠𝑡,𝑖 is greater.
The configuration of the environment and main hyperparameters of the training procedure are listed in Table 2. To illustrate the
superiority of our method, it is compared with several adjusted baseline approaches: MADDPG-P, MAAC-P, M3DDPG-P, MARL-RCPO,
MACBF [22], MARL-BLAC and, MARL-Barrier. Among these methods, MADDPG-P, MAAC-P, and, M3DDPG-P are based on MADDPG
[25], MAAC [27] and, M3DDPG [49], respectively. The constrained violation is incorporated as a penalty term to augment the reward
function of the above MARL, denoted as “P” to signify the addition of this constraint penalty to the reward function. MARL-RCPO
is adjusted based on the combination of MADDPG [25] and RCPO [50]. MARL-BLAC is an extension of the BLAC algorithm [19]
that adapts the algorithm from single-agent to multi-agent scenarios under CBF constraints. Similarly, MARL-Barrier is an extension
and refinement of the algorithm proposed in [18], which combines the MADDPG with the neural barrier certificates. Unlike the
aforementioned MARL methods, MACBF focuses more on combining multi-agent control policies and CBF certificates.
Moreover, to verify the effectiveness of the SAM, we replace it with the maximum pooling layer [22] in our method. The adjusted
method is considered a new comparison method, namely RMARL-CBF-MAX.

Algorithm 1 Training procedure of RMARL-CBF-SAM.

1: Randomly initialize networks 𝑄𝑎,𝑖 , 𝜋𝑖 , 𝜈𝑖 , 𝜋𝑠,𝑖 , 𝜈𝑠,𝑖 and ℎ𝑖 with weights 𝜃𝑞,𝑖 , 𝜃𝜋,𝑖 , 𝜃𝜈,𝑖 , 𝜃𝜋𝑠 ,𝑖 , 𝜃𝜈𝑠 ,𝑖 and 𝜃ℎ,𝑖 , 𝑖 = 1, 2, ⋯ , 𝑁𝑎 .
2: Initialize target networks 𝑄′𝑎,𝑖 , 𝜋𝑖′ and 𝜈𝑖′ with weights 𝜃𝑞,𝑖
′
← 𝜃𝑞,𝑖 , 𝜃𝜋,𝑖
′
← 𝜃𝜋,𝑖 and 𝜃𝜈,𝑖
′
← 𝜃𝜈,𝑖 , 𝑖 = 1, 2, ⋯ , 𝑁𝑎 .
3: Initialize replay buﬀer .
4: for episode = 1 to 𝑁𝑒 do
5: Initialize a random process  for action exploration.
6: Receive the initial states for each agent.
7: for 𝑡 = 1 to 𝑁𝑠 do
8: Get RL (control
) policies:
9: 𝑢𝑡𝑖 = 𝜋𝑖 𝑜𝑡𝑖 +  𝑡 , 𝑖 = 1, 2, ⋯ , 𝑁𝑎 .
10: Get RL (adversary
) policies:
11: 𝑤𝑡𝑖 = 𝜈𝑖 𝑜𝑡𝑖 +  𝑡 , 𝑖 = 1, 2, ⋯ , 𝑁𝑎 .
12: Get safe control
( policies:
)
13: 𝑢𝑡𝑠,𝑖 = 𝜋𝑠,𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑤𝑡𝑖 , 𝑖 = 1, 2, ⋯ , 𝑁𝑎 .
14: Get safe adversary
( )policies:
15: 𝑤𝑡𝑠,𝑖 = 𝜈𝑠,𝑖 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑤𝑡𝑖 , 𝑖 = 1, 2, ⋯ , 𝑁𝑎 .
{ }𝑁 𝑎 { }𝑁𝑎
16: Execute policies and observe the new state 𝑜𝑡+1
𝑖 𝑖=1
and the reward 𝑟𝑡𝑎,𝑖 .
𝑖=1
{( )}𝑁𝑎
17: Store transition 𝑜𝑡𝑖 , 𝑢𝑡𝑖 , 𝑤𝑡𝑖 , 𝑟𝑡𝑎,𝑖 , 𝑜𝑡+1
𝑖 into .
𝑖=1
18: if t mod 𝑁𝑟𝑙 == 0 then
19: Sample a random minibatch of 𝑁𝑏𝑎𝑡𝑐ℎ transitions from 
20: for 𝑖 = 1 to 𝑁𝑎 do
21: Update the critic network 𝜃𝑞,𝑖 by minimizing the loss function (49).
22: Update the actor network 𝜃𝜋,𝑖 by the gradient (52).
23: Update the adversary network 𝜃𝜈,𝑖 by the gradient (53).
24: Update target networks:
25: 𝜃𝑞,𝑖
′ ← 𝜄𝜃 +
𝑞,𝑖 (1 − 𝜄) 𝜃𝑞,𝑖
′

26: 𝜃𝜋,𝑖
′ ← 𝜄𝜃
𝜋,𝑖 + (1 − 𝜄) 𝜃𝜋,𝑖
′

27: 𝜃𝜈,𝑖
′ ← 𝜄𝜃 +
𝜈,𝑖 (1 − 𝜄) 𝜃𝜈,𝑖
′

28: end for

29: end if
30: if t mod 𝑁𝑐𝑏𝑓 == 0 then
31: Sample a random minibatch of 𝑁𝑏𝑎𝑡𝑐ℎ transitions from 
32: for 𝑖 = 1 to 𝑁𝑎 do
33: Update the CBF network 𝜃ℎ,𝑖 by minimizing the loss function (34).
34: Update the safe actor network 𝜃𝜋𝑠 ,𝑖 by minimizing the loss function (38).
35: Update the safe adversary network 𝜃𝜈𝑠 ,𝑖 by minimizing the loss function (39).
36: end for
37: end if
38: end for
39: end for

4.2. Simulation results and analysis

The eﬀectiveness of our method is investigated through qualitative analysis. Given the emphasis on safety and optimality in this
paper, the following indicators are selected:

1) Safety rate: It is the most important indicator for the safe MARL. The ideal scenario is that agents avoid other agents and
1 ∑𝑁 𝑎 1 𝑇
obstacles in the entire navigation process. It can be calculated by 𝑖 ∫0 1(𝑠𝑡𝑖 ∈ 𝑠 ) 𝑑𝑡. 𝑁𝑎 𝑇

13
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

Table 2
Training parameters of the algorithm.

Parameters Value

Learning rate 1e-5

Fine-tuning rate 1e-1
Soft updating rate 𝜄 1e-3
Replay buffer  size 1,000,000
Discount factor 𝛾 0.9999
Batch size 𝑁𝑏𝑎𝑡𝑐ℎ 128
Max trajectory step 𝑁𝑠 200
Max episode 𝑁𝑒 4000
RL training interval 𝑁𝑟𝑙 5
CBF training interval 𝑁𝑐𝑏𝑓 10
CBF coefficient 𝜖 0.55
Balanced factors 𝜆𝑢 , 𝜆𝑤 0.6
Regularization factors 𝜆𝑎𝑢 , 𝜆𝑎𝑤 0.1
Learning optimizer Adam
Fine-tuning optimizer RMSprop
Max fine-tuning iterative number 𝑁𝑓 10

Fig. 5. Learning curves of diﬀerent MARL methods in terms of performance reward.

2) Risk margin rate: When two agents or an agent and an obstacle are nearby, this situation is referred to as the risk margin,
i.e., the distance Δ𝑑 between them is too close, Δ𝑑 < 0.2. The proportion of this situation in the entire process is called the risk
margin rate.
3) Performance reward: It evaluates the performance of the method primarily with a focus on optimality, encompassing both state
and energy costs, denoted as −(𝑝𝑡𝑖 − 𝑝𝑔𝑜𝑎𝑙,𝑖 )𝑇 (𝑝𝑡𝑖 − 𝑝𝑔𝑜𝑎𝑙,𝑖 ) − (𝑢𝑡𝑖 )𝑇 𝑢𝑡𝑖 of one step. The cumulative sum of these values for each step
constitutes the accumulated performance reward in one episode.
4) Navigation step: It represents the navigation steps undertaken by all agents to reach their target positions.

The learning curves of the different MARL methods for the average performance reward are shown in Fig. 5. It can be observed
that all methods can converge to stable values, which also indicates that the training procedure of our method is stable.
In addition to the training procedure, the different evaluation indicators based on 500 independent simulations are illustrated
in Table 3 intuitively. The model is trained in a scenario that contains 10 agents and 10 obstacles. To verify the scalability of the
algorithm, we increase the number of agents and obstacles to 50, respectively. In general, apart from the indicator performance
reward, our method outperforms the others in all test scenarios. For performance reward, our method is close to the best-performing
method, MAAC-P. This is because, in our approach, the agent should spend more energy or take longer routes for safety in our
approach. Although the indicator performance reward is lower than that of MAAC-P, it is better than the others. Then, safety rate and
risk margin rate of our method outperform the others by a significant margin, which indicates that the decentralized robust neural
CBFs play an essential role in maintaining safety. Notably, as shown in Fig. 6, our method still maintains a high safety rate even as
the number of agents and obstacles increases. Simultaneously, compared with RMARL-CBF-MAX, the SAM demonstrates a notable
impact on enhancing the safety rate. Moreover, the indicator navigation step demonstrates that our method has better execution
efficiency than the others. It is noteworthy that although the MACBF method utilizes neural network CBFs, it fails to account for the
influence of modeling errors and unknown disturbances, overlooks performance optimization, and treats neighboring agents equally.
Consequently, both its safety and optimality performance are inferior to those of our method, as validated by the experimental results.

14
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

Table 3
Quantitative comparison of the safety and performance.

Method Scenario 1 Scenario 2

Risk Risk
Safety Performance Navigation Safety Performance Navigation
margin margin
rate reward step rate reward step
rate rate

𝑁𝑎 = 10, 𝑁𝑜 = 10 𝑁𝑎 = 10, 𝑁𝑜 = 10
RMARL-CBF-SAM 0.9998 0.0261 -106.9118 80 0.9998 0.0352 -160.2550 90
RMARL-CBF-MAX 0.9991 0.0273 -107.1875 88 0.9986 0.0412 -174.3799 110
MAAC-P 0.9900 0.0394 -96.9189 101 0.9868 0.0553 -150.1372 109
M3DDPG-P 0.9873 0.0399 -120.3926 108 0.9853 0.0470 -247.9340 122
MADDPG-P 0.9885 0.0389 -127.9314 111 0.9922 0.0415 -235.9178 116
MARL-RCPO 0.9793 0.0521 -226.3338 153 0.9872 0.0419 -169.5964 112
MACBF 0.9880 0.0530 -107.3338 100 0.9908 0.0406 -173.1560 95
MARL-BLAC 0.9914 0.0382 -115.0238 109 0.9899 0.0401 -170.8980 123
MARL-Barrier 0.9852 0.0399 -119.0166 122 0.9888 0.0455 -169.7595 124

𝑁𝑎 = 30, 𝑁𝑜 = 30 𝑁𝑎 = 30, 𝑁𝑜 = 30
RMARL-CBF-SAM 0.9996 0.0272 -212.6815 103 0.9994 0.0367 -466.3641 112
RMARL-CBF-MAX 0.9968 0.0330 -238.5108 111 0.9953 0.0488 -496.7043 114
MAAC-P 0.9836 0.0443 -205.9168 111 0.9739 0.0750 -445.7901 116
M3DDPG-P 0.9774 0.0526 -274.1299 130 0.9691 0.0758 -575.0621 127
MADDPG-P 0.9797 0.0514 -281.3259 139 0.9788 0.0574 -525.8341 119
MARL-RCPO 0.9724 0.0631 -612.3985 168 0.9764 0.0599 -471.4294 117
MACBF 0.9848 0.0629 -224.7866 108 0.9789 0.0423 -513.3392 114
MARL-BLAC 0.9857 0.0420 -245.7891 121 0.9772 0.0597 -497.2911 128
MARL-Barrier 0.9820 0.0472 -247.5354 128 0.9745 0.0687 -500.6940 127

𝑁𝑎 = 50, 𝑁𝑜 = 50 𝑁𝑎 = 50, 𝑁𝑜 = 50
RMARL-CBF-SAM 0.9992 0.0299 -350.5713 120 0.9988 0.0388 -812.8569 117
RMARL-CBF-MAX 0.9951 0.0348 -378.6206 127 0.9888 0.0510 -844.7378 119
MAAC-P 0.9798 0.0517 -342.5582 125 0.9642 0.0848 -793.0284 119
M3DDPG-P 0.9745 0.0594 -429.5557 134 0.9588 0.0956 -930.0082 132
MADDPG-P 0.9746 0.0614 -462.6930 172 0.9659 0.0869 -852.1013 120
MARL-RCPO 0.9695 0.0676 -711.8138 180 0.9670 0.0803 -829.8660 125
MACBF 0.9824 0.0745 -365.8480 121 0.9733 0.0547 -948.4012 118
MARL-BLAC 0.9821 0.0472 -381.6679 125 0.9681 0.0755 -849.0286 129
MARL-Barrier 0.9798 0.0530 -383.0098 133 0.9664 0.0808 -850.2107 129

Fig. 6. Comparison of diﬀerent methods in terms of safety rate. (a) Scenario 1; (b) Scenario 2.

Similarly, due to the lack of effective and sufficient consideration for modeling errors and unknown disturbances, both MARL-BLAC
and MARL-Barrier, despite incorporating CBFs, exhibit inferior performance compared to our method. Overall, our method effectively
balances safety and optimality.
In addition, the robustness of our method is investigated
( in scenarios
) with 30 agents and 30 obstacles. As described in Section 3.1,
‖ ‖ ‖ 𝑡+1 ‖
Δ𝑡𝑖 and 𝑑𝑖𝑡 are bounded. We set Δ𝑠𝑡+1 = ‖𝑠̂𝑡+1 − 𝑠𝑡+1 ‖ ∕ ‖𝑠𝑖,𝑟𝑒𝑎𝑙 ‖ + 𝜅 as the error percentage of the estimated model, where 𝑠̂𝑡+1 =
( 𝑡 𝑡 ) 𝑡+1 ( 𝑡 𝑡)
𝑖 ‖ 𝑖 𝑖,𝑟𝑒𝑎𝑙 ‖ ‖ ‖ 𝑖
‖ ‖
𝑓̂𝑖 𝑠𝑖 , 𝑢𝑖 , 𝑠𝑖,𝑟𝑒𝑎𝑙 = 𝑓𝑖 𝑠𝑖 , 𝑢𝑖 , and 𝜅 is an extremely small positive value. Then we have Δ𝑠𝑖 ≤ Δ𝑠𝑖𝑀 and ‖𝑑𝑖 ‖ ≤ 𝑑𝑖𝑀 , where Δ𝑠𝑖𝑀
𝑡+1 𝑡
‖ ‖
represents the maximum percentage of modeling errors. We verify the robustness of the method by setting different values for Δ𝑠𝑖𝑀

15
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

Fig. 7. The eﬀect of the model error and external disturbance on the safety rate. The percentage values in the legend represent the maximum percentage of modeling
errors Δ𝑠𝑖𝑀 . (a) Scenario 1; (b) Scenario 2.

Fig. 8. The detailed trajectories of the focus robot and neighboring robots, and the positions of the neighboring obstacles in Scenario 1. And dashed blue circles are
the detection areas for the focus robot.

and 𝑑𝑖𝑀 . The results are illustrated in Fig. 7, where the percentage values in the legend represent the maximum model error Δ𝑖𝑀 .
Although the indicator safety rate shows a downward trend with increasing model error and disturbance, it consistently maintains a
fairly high level. The experimental results demonstrate the strong robustness of the proposed method.
Furthermore, Figs. 8 and 9 display the detailed trajectories of multi-robots in Scenario 1 and Scenario 2, respectively. We primarily
focus on a random robot marked with a red star. The left of the figures shows the global initiative and end states of the MAS. The
right of the figures shows the trajectories of the focus robot at different steps. The arrow in the circle indicates the direction of speed.
It can be observed that the safety attention weights of the robots and obstacles within the communication and detection range of the
focus robot are displayed. Robots or obstacles closer to the focus will receive higher scores. In addition, the score is higher when the
robot is approaching the focus, and lower otherwise. For example, at step 40 in Scenario 1, robot B is moving towards the focus, and
thus, it receives a high score; conversely, the score of robot A is the lowest because it is the farthest away from the focus. The SAM
assigns the safety attention weights of neighbors, which has been demonstrated to be reasonable and effective.
Additionally, to provide a more comprehensive representation of the role of neural CBFs, the contour maps are displayed at
different steps for the focus robot in Scenario 1, as shown in Fig. 10. The figure clearly demonstrates that neural CBFs dynamically
adapt in response to the status of the agents, thereby effectively identifying hazardous regions. For mobile robots, when they are
relatively close, the area between them is relatively dangerous; otherwise, it is safe. For stationary robots, dangerous areas mainly

16
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

Fig. 9. The detailed trajectories of the focus robot and neighboring robots, and the positions of the adjacent obstacles in Scenario 2. And dashed blue circles are the
detection areas for the focus robot.

Fig. 10. The contours of neural CBFs for the focus robot at diﬀerent steps in Scenario 1. The blue lines are zero-level contours of neural CBFs, the dashed yellow circles
are the detection areas for the focus robot, and the goals of other robots are not displayed.

surround their periphery. Moreover, through prior learning, neural CBFs can allocate a certain safety margin and provide agents with
adequate response time to adjust their strategies proactively.

5. Conclusion

In this paper, we have proposed a novel robust MARL method for the safety-critical MAS. Inspired by 𝐻∞ , the adversary is
introduced to address the issues of modeling errors and external unknown disturbances. CBFs have been verified to be valid for
designing controllers with state constraints. For extremely complex environments, neural CBFs are constructed based on the data
along the system trajectories to overcome the flaws of handcrafted CBFs. Moreover, the SAM is utilized to ensure the scalability of an
arbitrary number of agents and address the time-varying observable agents. The SAM is generated through the training of the robust
neural CBFs, which can capture essential information from the neighboring agents and eliminate redundant data interference. Then,
the simulations confirm the superiority of our proposed method over the others. In the future, we intend to validate our method in

17
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

an actual multi-agent environment, such as a drone swarm or a car swarm. Furthermore, adaptive robust neural CBFs, which can
address the excessive conservativeness issue generated by adversaries, will be the focus of our next study.

CRediT authorship contribution statement

Shihan Liu: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Methodology, Investigation,
Formal analysis, Data curation, Conceptualization. Lijun Liu: Writing – review & editing, Validation, Resources, Funding acquisition.
Zhen Yu: Supervision, Project administration, Funding acquisition.

Declaration of competing interest

The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to
inﬂuence the work reported in this paper.

Data availability

Data will be made available on request.

References

[1] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2018.
[2] G. Yuan, J. Xiao, J. He, H. Jia, Y. Wang, Z. Wang, Multi-agent cooperative area coverage: a two-stage planning approach based on reinforcement learning, Inf.
Sci. 678 (2024) 121025.
[3] Z. Zhang, Y. Wang, Z. Zhang, L. Wang, H. Huang, Q. Cao, A residual reinforcement learning method for robotic assembly using visual and force information,
J. Manuf. Syst. 72 (2024) 245–262.
[4] F. Xue, Q. Hai, T. Dong, Z. Cui, Y. Gong, A deep reinforcement learning based hybrid algorithm for efficient resource scheduling in edge computing environment,
Inf. Sci. 608 (2022) 362–374.
[5] X. Jing, X. Yao, M. Liu, J. Zhou, Multi-agent reinforcement learning based on graph convolutional network for flexible job shop scheduling, J. Intell. Manuf.
35 (1) (2024) 75–93.
[6] S. Fan, F. Peng, X. Liu, T. Wang, J. Qiu, Bipartite containment control of multi-agent systems subject to adversarial inputs based on zero-sum game, Inf. Sci. 681
(2024) 121234.
[7] J. Liu, S. Qin, M. Su, Y. Luo, Y. Wang, S. Yang, Multiple intersections traffic signal control based on cooperative multi-agent reinforcement learning, Inf. Sci. 647
(2023) 119484.
[8] Y. Liu, D. Yao, H. Li, R. Lu, Distributed cooperative compound tracking control for a platoon of vehicles with adaptive nn, IEEE Trans. Cybern. 52 (7) (2021)
7039–7048.
[9] H. Cao, H. Xiong, W. Zeng, H. Jiang, Z. Cai, L. Hu, L. Zhang, W. Lu, Safe reinforcement learning-based motion planning for functional mobile robots suffering
uncontrollable mobile robots, IEEE Trans. Intell. Transp. Syst. (2023) 1–18, [Link]
[10] S. Tong, K. Sun, S. Sui, Observer-based adaptive fuzzy decentralized optimal control design for strict-feedback nonlinear large-scale systems, IEEE Trans. Fuzzy
Syst. 26 (2) (2017) 569–584.
[11] L. Pinto, J. Davidson, R. Sukthankar, A. Gupta, Robust adversarial reinforcement learning, in: International Conference on Machine Learning, PMLR, 2017,
pp. 2817–2826.
[12] J. Morimoto, K. Doya, Robust reinforcement learning, Neural Comput. 17 (2) (2005) 335–359.
[13] Y.-m. Li, X. Min, S. Tong, Adaptive fuzzy inverse optimal control for uncertain strict-feedback nonlinear systems, IEEE Trans. Fuzzy Syst. 28 (10) (2019)
2363–2374.
[14] V.-N. Giap, S.-C. Huang, Effectiveness of fuzzy sliding mode control boundary layer based on uncertainty and disturbance compensator on suspension active
magnetic bearing system, Meas. Control 53 (5–6) (2020) 934–942.
[15] Q. Yao, Dual-disturbance-observer-based robust finite-time trajectory tracking control for robotic surface vehicle under measurement uncertainties, Ocean Eng.
242 (2021) 110183.
[16] D.Q. Mayne, J.B. Rawlings, C.V. Rao, P.O. Scokaert, Constrained model predictive control: stability and optimality, Automatica 36 (6) (2000) 789–814.
[17] G. Wu, K. Sreenath, Safety-critical control of a planar quadrotor, in: 2016 American Control Conference, IEEE, 2016, pp. 2252–2258.
[18] Y. Yang, Y. Jiang, Y. Liu, J. Chen, S.E. Li, Model-free safe reinforcement learning through neural barrier certificate, IEEE Robot. Autom. Lett. 8 (3) (2023)
1295–1302.
[19] L. Zhao, K. Gatsis, A. Papachristodoulou, A barrier-Lyapunov actor-critic reinforcement learning approach for safe and stable control, arXiv preprint, arXiv:
2304.04066.
[20] Z. Qin, D. Sun, C. Fan, Sablas: learning safe control for black-box dynamical systems, IEEE Robot. Autom. Lett. 7 (2) (2022) 1928–1935.
[21] C. Dawson, Z. Qin, S. Gao, C. Fan, Safe nonlinear control using robust neural Lyapunov-barrier functions, in: Conference on Robot Learning, PMLR, 2022,
pp. 1724–1735.
[22] Z. Qin, K. Zhang, Y. Chen, J. Chen, C. Fan, Learning safe multi-agent control with decentralized neural barrier certificates, in: International Conference on
Learning Representations, 2020.
[23] T.T. Nguyen, N.D. Nguyen, S. Nahavandi, Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications, IEEE Trans.
Cybern. 50 (9) (2020) 3826–3839.
[24] W. Zhou, Z. Liu, J. Li, X. Xu, L. Shen, Multi-target tracking for unmanned aerial vehicle swarms using deep reinforcement learning, Neurocomputing 466 (2021)
285–297.
[25] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, I. Mordatch, Multi-agent actor-critic for mixed cooperative-competitive environments, in: Advances in
Neural Information Processing Systems, vol. 30, Curran Associates, Inc., 2017.
[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L.u. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information
Processing Systems, vol. 30, Curran Associates, Inc., 2017.
[27] S. Iqbal, F. Sha, Actor-attention-critic for multi-agent reinforcement learning, in: International Conference on Machine Learning, PMLR, 2019, pp. 2961–2970.
[28] K. Liu, Y. Zhao, G. Wang, B. Peng, Self-attention-based multi-agent continuous control method in cooperative environments, Inf. Sci. 585 (2022) 454–470.

18
S. Liu, L. Liu and Z. Yu Information Sciences 690 (2025) 121567

[29] S. Nayak, K. Choi, W. Ding, S. Dolan, K. Gopalakrishnan, H. Balakrishnan, Scalable multi-agent reinforcement learning through intelligent information aggrega-
tion, in: International Conference on Machine Learning, PMLR, 2023, pp. 25817–25833.
[30] J. Jiang, Z. Lu, Learning attentional communication for multi-agent cooperation, in: 32nd Conference on Neural Information Processing Systems, NeurIPS, 2018,
pp. 7254–7264.
[31] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, J. Pineau, Tarmac: targeted multi-agent communication, in: International Conference on Machine
Learning, PMLR, 2019, pp. 1538–1546.
[32] W.J. Yun, B. Lim, S. Jung, Y.-C. Ko, J. Park, J. Kim, M. Bennis, Attention-based reinforcement learning for real-time uav semantic communication, in: 2021 17th
International Symposium on Wireless Communication Systems, IEEE, 2021, pp. 1–6.
[33] J. Fu, G. Wen, X. Yu, Z.-G. Wu, Distributed formation navigation of constrained second-order multiagent systems with collision avoidance and connectivity
maintenance, IEEE Trans. Cybern. 52 (4) (2022) 2149–2162.
[34] R. Cheng, M.J. Khojasteh, A.D. Ames, J.W. Burdick, Safe multi-agent interaction through robust control barrier functions with learned uncertainties, in: 2020
59th IEEE Conference on Decision and Control, IEEE, 2020, pp. 777–783.
[35] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, J. Wang, Mean field multi-agent reinforcement learning, in: International Conference on Machine Learning, PMLR,
2018, pp. 5571–5580.
[36] L.P. Kaelbling, M.L. Littman, A.R. Cassandra, Planning and acting in partially observable stochastic domains, Artif. Intell. 101 (1–2) (1998) 99–134.
[37] M.L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, 2014.
[38] F.A. Oliehoek, C. Amato, et al., A Concise Introduction to Decentralized POMDPs, vol. 1, Springer, 2016.
[39] Y. Xiong, D.-H. Zhai, M. Tavakoli, Y. Xia, Discrete-time control barrier function: high-order case and adaptive case, IEEE Trans. Cybern. (2022) 1–9.
[40] A. Agrawal, K. Sreenath, Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation, in: Robotics:
Science and Systems, Cambridge, MA, USA, vol. 13, 2017.
[41] Q. Nguyen, K. Sreenath, Exponential control barrier functions for enforcing high relative-degree safety-critical constraints, in: 2016 American Control Conference,
IEEE, 2016, pp. 322–328.
[42] R.P. Agarwal, M. Meehan, D. O’regan, Fixed Point Theory and Applications, vol. 141, Cambridge University Press, 2001.
[43] H. Sami, J. Bentahar, A. Mourad, H. Otrok, E. Damiani, Graph convolutional recurrent networks for reward shaping in reinforcement learning, Inf. Sci. 608
(2022) 63–80.
[44] A.Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: theory and application to reward shaping, in: International Conference on
Machine Learning, vol. 99, 1999, pp. 278–287.
[45] C. Dawson, S. Gao, C. Fan, Safe control with learned certificates: a survey of neural Lyapunov, barrier, and contraction methods for robotics and control, IEEE
Trans. Robot. 39 (3) (2023) 1749–1767.
[46] M. Zhou, Y. Chen, Y. Wen, Y. Yang, Y. Su, W. Zhang, D. Zhang, J. Wang, Factorized q-learning for large-scale multi-agent systems, in: Proceedings of the First
International Conference on Distributed Artificial Intelligence, 2019, pp. 1–7.
[47] C. Sun, W. Liu, L. Dong, Reinforcement learning with task decomposition for cooperative multiagent systems, IEEE Trans. Neural Netw. Learn. Syst. 32 (5) (2020)
2054–2065.
[48] K.P. Wabersich, M.N. Zeilinger, A predictive safety filter for learning-based control of constrained nonlinear dynamical systems, Automatica 129 (2021) 109597.
[49] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, S. Russell, Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient, in: Proceedings of
the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 4213–4220.
[50] C. Tessler, D.J. Mankowitz, S. Mannor, Reward constrained policy optimization, in: International Conference on Learning Representations, 2018.

Common questions