0% found this document useful (0 votes)
3 views11 pages

End-to-End Safe Reinforcement Learning Through Barrier Functions For Safety-Critical Continuous Control Tasks

The document presents a framework for safe reinforcement learning (RL) that integrates model-free RL algorithms with control barrier functions (CBFs) to ensure safety during the learning process. The proposed RL-CBF algorithm guarantees high probability safety while improving policy exploration efficiency, demonstrated through applications on an inverted pendulum and autonomous car following tasks. This approach addresses the limitations of existing safe RL methods by allowing for effective exploration of safe policies while learning system dynamics online.

Uploaded by

carronte94
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views11 pages

End-to-End Safe Reinforcement Learning Through Barrier Functions For Safety-Critical Continuous Control Tasks

The document presents a framework for safe reinforcement learning (RL) that integrates model-free RL algorithms with control barrier functions (CBFs) to ensure safety during the learning process. The proposed RL-CBF algorithm guarantees high probability safety while improving policy exploration efficiency, demonstrated through applications on an inverted pendulum and autonomous car following tasks. This approach addresses the limitations of existing safe RL methods by allowing for effective exploration of safe policies while learning system dynamics online.

Uploaded by

carronte94
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

End-to-End Safe Reinforcement Learning through Barrier Functions for

Safety-Critical Continuous Control Tasks

Richard Cheng,1 Gábor Orosz,2 Richard M. Murray,1 Joel W. Burdick,1


1
California Institute of Technology, 2 University of Michigan, Ann Arbor
arXiv:1903.08792v1 [[Link]] 21 Mar 2019

Abstract of physical systems has been limited to simulations, where


many failed iterations can occur before success.
Reinforcement Learning (RL) algorithms have found limited
success beyond simulated applications, and one main reason
Safe RL tries to learn a policy that maximizes the ex-
is the absence of safety guarantees during the learning pro- pected return, while also ensuring (or encouraging) the sat-
cess. Real world systems would realistically fail or break be- isfaction of some safety constraints (Garcı́a and Fernández
fore an optimal controller can be learned. To address this is- 2015). Previous approaches to safe reinforcement learn-
sue, we propose a controller architecture that combines (1) ing include reward-shaping, policy optimization with con-
a model-free RL-based controller with (2) model-based con- straints (Gaskett 2003; Moldovan and Abbeel 2012; Achiam
trollers utilizing control barrier functions (CBFs) and (3) on- et al. 2017; Wachi et al. 2018), or teacher advice (Abbeel
line learning of the unknown system dynamics, in order to and Ng 2004; Abbeel, Coates, and Ng 2010; Tang et al.
ensure safety during learning. Our general framework lever- 2010). However, these model-free approaches do not guar-
ages the success of RL algorithms to learn high-performance antee safety during learning – safety is only approximately
controllers, while the CBF-based controllers both guarantee
safety and guide the learning process by constraining the set
guaranteed after a sufficient learning period. The fundamen-
of explorable polices. We utilize Gaussian Processes (GPs) to tal issue is that without a model, safety must be learned
model the system dynamics and its uncertainties. through environmental interactions, which means it may be
Our novel controller synthesis algorithm, RL-CBF, guaran- violated during initial learning interactions.
tees safety with high probability during the learning process, Model-based approaches have utilized Lyapunov-based
regardless of the RL algorithm used, and demonstrates greater methods or model predictive control to guarantee safety un-
policy exploration efficiency. We test our algorithm on (1) der system dynamics during learning (Wang, Theodorou,
control of an inverted pendulum and (2) autonomous car- and Egerstedt 2017; Berkenkamp et al. 2017; Chow et al.
following with wireless vehicle-to-vehicle communication, 2018; Ohnishi et al. 2018; Koller et al. 2018), but they do not
and show that our algorithm attains much greater sample ef- address the issue of exploration and performance optimiza-
ficiency in learning than other state-of-the-art algorithms and tion. Other works guarantee safety by switching between
maintains safety during the entire learning process.
backup controllers (Perkins and Barto 2003; Mannucci et al.
2018), though this overly constrains policy exploration.
Introduction We draw inspiration from recent work that has incorpo-
rated model information into model-free RL algorithms to
Reinforcement learning (RL) focuses on finding an agent’s
ensure safety during exploration (Fisac et al. 2018; Li, Kal-
policy (i.e. controller) that maximizes a long-term reward.
abic, and Chu 2018; Gillula and Tomlin 2012). However,
It does this by repeatedly observing the agent’s state, tak-
these approaches utilize backup safety controllers that do not
ing an action (according to a current policy), and receiving
guide the learning process (limiting exploration efficiency).
a reward. Over time, the agent modifies its policy to max-
imize its long-term reward. This method has been success- This paper develops a framework for integrating exist-
fully applied to continuous control tasks (Duan et al. 2016; ing model-free RL algorithms with control barrier functions
Lillicrap et al. 2015) where controllers have learned to sta- (CBFs) to guarantee safety and improve exploration effi-
bilize complex robots (after many policy iterations). ciency in RL, even with uncertain model information. The
However, since RL focuses on maximizing the long-term CBFs require a (potentially poor) nominal dynamics model,
reward, it is likely to explore unsafe behaviors during the but can ensure online safety of nonlinear systems during the
learning process. This feature is problematic for any RL al- entire learning process and help the RL algorithm efficiently
gorithm that will be deployed on hardware, as unsafe learn- search the policy space. This methodology effectively con-
ing policies could damage the hardware or bring harm to a strains the policy exploration process to a set of safe po-
human. As a result, most success in the use of RL for control lices defined by the CBF. An on-line process learns the gov-
erning dynamical system over time, which allows the CBF
Copyright c 2019, Association for the Advancement of Artificial controller to adapt and become less conservative over time.
Intelligence ([Link]). All rights reserved. This general framework allows us to utilize any model-free
RL algorithm to learn a controller, with the CBF controller value function Qπ , and advantage function, Aπ below:
guiding policy exploration and ensuring safety. ∞
hX i
Using this framework, we develop an efficient algorithm Qπ (st , at ) = Est+1 ,at+1 ,... γ l r(st+l , at+l ) ,
for controller synthesis, RL-CBF, with guarantees on safety l=0
(remaining within a safe set) and performance (reward- ∞
hX i
maximization). To test this approach, we integrated two Vπ (st ) = Eat ,st+1 ,at+1 ,... γ l r(st+l , at+l ) ,
model-free RL algorithms – trust region policy optimization l=0
(TRPO) (Schulman et al. 2015) and deep deterministic pol-
icy gradients (DDPG) (Lillicrap et al. 2015) – with the CBF
controllers and dynamical model learning. We tested the al- Aπ (st , at ) = Qπ (st , at ) − Vπ (st ), (3)
gorithms on two nonlinear control problems: (1) balancing where actions ai are drawn from distribution ai ∼ π(a|si ).
of an inverted pendulum, and (2) autonomous car following Most policy optimization RL algorithms attempt to maxi-
with wireless vehicle-to-vehicle communication. For both mize long-term reward J(π) using (a) policy iteration meth-
tasks, our algorithm efficiently learned a high-performance ods (Bertsekas 2005), (b) derivative-free optimization meth-
controller while maintaining safety throughout the learning ods that optimize the return as a function of policy param-
process. Furthermore, it learned faster than comparable RL eters (Fu, Glover, and April 2005), or (c) policy gradient
algorithms due to inclusion of a model learning process, methods (Peters and Schaal 2008; Silver et al. 2014). Any of
which constrains the space of explorable policies and guides these methods can be rendered end-to-end safe using the RL-
the exploration process. CBF control framework proposed in this work. However, we
Our main contributions are: (1) we develop the first al- will focus mainly on policy gradient methods, due to their
gorithm that integrates CBF-based controllers with model- good performance on continuous control problems.
free RL to achieve end-to-end safe RL for nonlinear control
systems, and (2) we show improved learning efficiency by Policy Gradient-Based RL Policy gradient methods esti-
guiding the policy exploration with barrier functions. mate the gradient of the expected return J(π) with respect
to the policy based on sampled trajectories. They then op-
timize the policy using gradient ascent, allowing modifica-
Preliminaries tion of the control law at episodic intervals. The DDPG and
Consider an infinite-horizon discounted Markov decision TRPO algorithms are examples of policy gradient methods,
process (MDP) with control-affine, deterministic dynamics which we will use as benchmarks.
(a good assumption when dealing with robotic systems), de- DDPG is an off-policy actor-critic method that computes
fined by the tuple (S, A, f, g, d, r, ρ0 , γ), where S is a set of the policy gradient based on sampled trajectories and an es-
states, A is a set of actions, f : S → S is the nominal un- timate of the action-value function. It alternately updates the
actuated dynamics, g : S → Rn,m is the nominal actuated action-value function and the policy as it samples more and
dynamics, and d : S → S is the unknown system dynamics. more trajectories.
The time evolution of the system is given by TRPO is an on-policy policy gradient method that maxi-
mizes a surrogate loss function, which serves as an approx-
st+1 = f (st ) + g(st )at + d(st ), (1) imate lower bound on the true loss function. It also ensures
that the next policy distribution is within a “trust region”.
where st ∈ S, at ∈ A, f and g compose a known nom- More precisely, it approximates the optimal policy update
inal model of the dynamics, and d represents the unknown by iteratively solving the optimization problem:
model. In practice, the nominal model may be quite bad (e.g. X X
a robot model that ignores friction and compliance), and we πi+1 = arg max ρπi (s) π(a|s)Aπi (s, a) (4)
π
must learn a much better dynamic model through data. s a
Furthermore r : S × A → R is the reward function, ρ0 : such that the Kullback-Leibler divergence
S → R is the distribution of the initial state s0 , and γ ∈ DKL (πi , πi+1 ) ≤ δp . Here ρπi (s) represents the dis-
(0, 1) is the discount factor. counted visitation frequency of state s under policy πi , and
δp is a constant defining the “trust region”.
Reinforcement Learning Though both DDPG and TRPO have learned good con-
trollers on several benchmark problems, there is no guar-
Let π(a|s) denote a stochastic control policy π : S × A → antee of safety in these algorithms, nor any other model-
[0, 1] that maps states to distributions over actions, and let free RL algorithm. Therefore, our objective is to comple-
J(π) denote the policy’s expected discounted reward: ment model-free RL controllers with model-based CBF con-
∞ trollers (using a potentially poor nominal model), which can
both improve search efficiency and ensure safety.
X
J(π) = Eτ ∼π [ γ t r(st )]. (2)
t=0
Gaussian Processes
Here τ ∼ π is a trajectory τ = {st , at , ..., st+n , at+n } We use Gaussian process (GP) models to estimate the un-
where the actions are sampled from policy π(a|s). We use known system dynamics, d(s), from data. A Gaussian pro-
the standard definitions for the value function Vπ , action- cess is a nonparametric regression method for estimating
functions and their uncertain distribution from data (Ras- Control barrier functions utilize a Lyapunov-like argu-
mussen and Williams 2006). It describes the evolving model ment to provide a sufficient condition for ensuring for-
of the uncertain dynamics, d(s), by a mean estimate, µd (s), ward invariance of the safe set C under controlled dynam-
and the uncertainty, σd2 (s), which allows for high probability ics. Therefore, barrier functions are a natural tool to enforce
confidence intervals on the function: safety throughout the learning process, and can be used to
synthesize safe controllers for our systems.
µd (s) − kδ σd (s) ≤ d(s) ≤ µd (s) + kδ σd (s), (5)
Definition 1. Given a set C ∈ Rn defined by (7), the contin-
with probability (1 − δ) where kδ is a design parameter that uously differentiable function h : Rn → R is a discrete-time
determines δ (e.g. 95% confidence is achieved at kδ = 2). control barrier function (CBF) for dynamical system (1) if
Therefore, by learning µd (s) and σd (s) in tandem with the there exists η ∈ [0, 1] such that for all st ∈ C,
controller, we obtain high probability confidence intervals
on the unknown dynamics, which adapt/shrink as we obtain h   i
more information (i.e. measurements) on the system. sup h f (st ) + g(st )at + d(st ) + (η − 1)h(st ) ≥ 0,
at ∈A
A GP model is parameterized by a kernel function (8)
k(s, s0 ), which defines the similarity between any two states
s, s0 ∈ S. In order to make inferences on the unknown
ˆ where η represents how strongly the barrier function
function d(s), we need measurements, d(s), which are com- “pushes” the state inwards within the safe set (if η = 0,
puted from measurements of (st , at , st+1 ) using the rela- the barrier condition simplifies to the Lyapunov condition).
tion from Equation (1): d(s ˆ t ) = st+1 − f (st ) − g(st )at . The existence of a CBF implies that there exists a deter-
Since any finite number of data points form a multivari- ministic controller uCBF : S → A such that the set C is for-
ate normal distribution, we can obtain the posterior dis- ward invariant for system (1) (Agrawal and Sreenath 2017;
tribution of d(s∗ ) at any query state s∗ ∈ S by condi- Ames et al. 2017). In other words, if condition (8) is sat-
tioning on the past measurements. Given n measurements isfied for all s ∈ C, then the set C is rendered forward in-
yn = [d(sˆ 1 ), d(s
ˆ 2 ), ..., d(s
ˆ n )] subject to independent Gaus- variant. Our goal is to find a controller, uCBF , that satisfies
2
sian noise νnoise ∼ N (0, σnoise ), the mean µd (s∗ ) and vari- condition (8), so that safety is certified.
ance σd2 (s∗ ) at the query state, s∗ , are calculated to be, For this paper, we restrict our attention to affine barrier
functions of form h = pT s + q, (p ∈ Rn , q ∈ R), though our
µd (s∗ ) = k∗T (s∗ )(K + σnoise
2
I)−1 yn , methodology could support more general barrier functions.
σd2 (s∗ ) = k(s∗ , s∗ ) − k∗T (s∗ )(K + σnoise
2
I)−1 k∗ (s∗ ), This restriction means the set C is composed of intersecting
(6) half spaces (i.e. polytopes).
Before we can formulate a tractable optimization prob-
where Ki,j = k(si , sj ) is the kernel matrix, and k∗ = lem that satisfies condition (8), we must have an estimate for
[k(s1 , s∗ ), k(s2 , s∗ ), ..., k(sn , s∗ )]. As we collect more data, d(s). We use an updating GP model to estimate the mean and
µd (s) becomes a better estimate of d(s), and the uncertainty, variance of the function, µd (s) and σd2 (s), from measure-
σd2 (s), of the dynamics decreases. ment data. From equation (5), we know that |µd (s)−d(s)| ≤
We note that in applications with large amounts of data, kδ σd (s) with probability (1 − δ). Therefore, we can refor-
training the GP becomes problematic since computing the mulate the CBF condition (8) into the following quadratic
matrix inverse in Equation (6) scales poorly (N 3 in the program (QP) that can be efficiently solved at each time step:
number of data points). There are several methods to al-
leviate this issue, such as using sparse inducing inputs or (at , ) = argmin kat k2 + K 
local GPs (Snelson and Ghahramani 2007; Nguyen-Tuong, at ,
Seeger, and Peters 2009). In fact, our framework can use any
s.t. pT f (st ) + pT g(st )at + pT µd (st )−
model approximation method that provides quantifiable un-
certainty bounds (e.g. neural networks with dropout). How- kδ |p|T σd (st ) + q ≥ (1 − η)h(st ) − 
ever, we bypass this issue in this work by batch training the ailow ≤ ait ≤ aihigh for i = 1, ..., M
GP model with only the latest batch of ≈ 1000 data points.
(9)
Control Barrier Functions where  is a slack variable in the safety condition, K is a
large constant that penalizes safety violations, and |p| de-
Consider an arbitrary safe set, C, defined by the super-level notes the element-wise absolute value of the vector p. The
set of a continuously differentiable function h : Rn → R, optimization is not sensitive to the K parameter as long as it
C : {s ∈ Rn : h(s) ≥ 0}. (7) is very large (e.g. 1012 ), such that safety constraint violations
are heavily penalized. The last constraint on ait encodes ac-
To maintain safety during the learning process, the system tuator constraints. The solution to this optimization problem
state must always remain within the safe set C (i.e. the set C (9) enforces the safety condition (8) as best as possible with
is forward invariant). Examples include keeping a manipu- minimum control effort, even with uncertain dynamics. Ac-
lator within a given workspace, or ensuring that a quadcopter counting for the dynamics uncertainty through GP models
avoids obstacles. Essentially, the learning algorithm should allows us to certify system safety, even with a poor nominal
learn/explore only in set C. model.
Let us define the set C : {s ∈ Rn : h(s) ≥ − η }. Then Note that since the RL policy πθRLk
(a|s) is stochastic (see
we can prove the following lemma. Preliminaries section on RL), the controller uRL θk (s) rep-
Lemma 1. For dynamical system (1), if there exists a so- resents the realization (i.e. sampled control action) of the
lution to (9) for all s ∈ C with  = 0, then the controller stochastic policy πθRL
k
(a|s) after policy iteration k.
derived from (9) renders set C forward invariant with prob-
ability (1 − δ).
However, suppose there exists s ∈ C such that (9) has
solution with  = max > 0. If for all s ∈ C , the solution
to (9) satisfies  ≤ max , then the larger set C is forward
invariant with probability (1 − δ).
Proof. The first part of the lemma follows directly from (a)
Definition 1 and the probabilistic bounds on the uncertainty
obtained from GPs shown in equation (5).
For the second part, the property of GPs in equation (5)
implies that with probability (1−δ), the following inequality
is satisfied under the system dynamics (1):
 
h(st+1 ) ≥ pT f (st ) + g(st )at + µd (st ) −
(10)
kδ |p|T σd (st ) + q. (b)

Therefore, the constraint in problem (9) ensures that: Figure 1: Control architecture combining model-free RL
controller with model-based CBF to guarantee safety. (a)
h(st+1 ) ≥ (1 − η)h(st ) − , Initial architecture that uses CBF to compensate for unsafe
pT st+1 + q ≥ (1 − η)(pT st + q) − , (11) control actions, but does not guide learning and exploration.
  (b) Architecture that uses CBF to guide exploration and
pT st+1 + q + ≥ (1 − η)(pT st + q + ). learning, as well as ensure safety.
η η

Define h (s) = q + + pT s, so that (11) simplifies to
η The model-free RL controller, uRL θk (s) proposes a con-
h (st+1 ) ≥ (1 − η)h (st ). (12) trol action that attempts to optimize long-term reward, but
may be unsafe. Before deploying the RL controller, a CBF
By Definition 1, the set C defined by h (s) = h(s)+ η ≥ controller uCBF
k (s, uRL
θk ) filters the proposed control action
0 is forward invariant under system dynamics (1). and provides the minimum control intervention needed to
ensure that the overall controller, uk (s), keeps the system
The CBF controllers that solve (9) provide determin- state within the safe set. Essentially, the CBF controller,
istic control laws, uCBF (s) that naturally encode safety; uCBF (s, uRL RL
k θk ) “projects” the RL controller uθk (s) into the
they provide the minimal control intervention that maintains set of safe policies. In the case of an autonomous car, this
safety or provide graceful degradation (a small deviation action may enforce a safe distance between nearby cars, re-
from the safe set) when safety cannot be enforced (e.g. due gardless of the action proposed by the RL controller.
to actuation constraints). Furthermore, even with dynamics The CBF controller uCBF (s, uRL
k θk ), which depends on the
uncertainty, we can make high-probability statements about RL control, is defined by the following QP that can be effi-
system safety using GP models with CBFs. ciently solved at each time step:
Note that one can easily combine multiple CBF con-
straints in problem (9) to define polytopic safe regions.
(at , ) = argmin kat k2 + K 
at ,
CBF-Based Compensating Control with  
Reinforcement Learning s.t. pT f (st ) + pT g(st ) uRL
θk (st ) + a T
t + p µd (st )
To illustrate our framework, we first propose the suboptimal
− kδ |p|T σd (st ) + q ≥ (1 − η)h(st ) − 
controller in equation (13), which combines a model-free
RL(i)
RL-based controller (parameterized by θk ) and a CBF-based ailow ≤ ait + uθk (st ) ≤ aihigh for i = 1, ..., M
controller in the architecture shown in Figure 1a. (14)
uk (s) = uRL CBF
(s, uRL (13) The last constraint in (14) incorporates possible actuator
θk (s) + uk θk ).
limits of the system.
The concept is akin to shielded RL (Alshiekh et al. 2017; We must make clear the important distinction between the
Fisac et al. 2018), since the CBF controller compensates for indexes t and k. Note that t indexes timesteps within each
the RL controller to ensure safety, but does not guide explo- policy iteration or trial, whereas k indexes the policy iter-
ration of the RL algorithm. The next section will extend the ations (which contain trajectories with several timesteps).
CBF controller to improve RL policy exploration. The CBF controller updates throughout the task (computed
at each time step, t), whereas the RL policy and GP model updates around the previously deployed controller, which
update at episodic policy iteration intervals indexed by k. adds to the efficiency of the learning process by encouraging
Let max = maxs∈C  from (14) represent the largest vi- the policy to operate in desired areas of the state space. This
olation of the barrier condition (i.e. potential deviation from idea is illustrated in Figure 2b.
the safe set) for any s ∈ C. Lemma 1 extends to the modified The intuition is that at iteration k = 0, the RL policy
optimization problem (14), implying that uk = uRL θk +uk
CBF
proposed actions uRL RL
θ0 (s), but it took safe actions uθ0 (s) +
max
satisfies the barrier certificate inequality (up to  ) that uCBF (s). To update the policy based on the safe actions, the
0
guarantees forward invariance of C. Therefore, if there ex- effective RL controller at the next iteration (k = 1) should
ists a solution to problem (14) such that max = 0, then be uRL CBF
θ1 (s) + u0 (s), which is then filtered by the CBF
controller (13) renders the safe set C forward invariant with controller u1 CBF
(s) (i.e. uCBF (s) is now part of the RL con-
0
probability (1 − δ). However if max > 0, but  ≤ max for troller). Across multiple policy iterations, we can consider
all s ∈ C , then the controller will render the set C forward Pk−1 CBF
invariant with probability (1 − δ). uRL
θk (s) + j=0 uj (s, uRL RL
θ0 , ..., uθj−1 ) to be the guided
Intuitively, the RL controller provides a “feedforward RL controller (proposing potentially unsafe actions), which
Pk−1 CBF
control”, and the CBF controller compensates with the mini- is rendered safe by uCBFk (s, uRL
θk + j=0 uj ).
mum control necessary to render the safe set forward invari-
ant. If such a control does not exist (e.g. due to torque con-
straints), then the CBF controller provides the control that
keeps the state as close as possible to the safe set.
However, a significant issue is that controller (13) ensures
safety, but does not actively guide policy exploration of the
overall controller. This is because the RL policy being up-
dated around, uRL θk (s), is not the policy deployed on the
agent, uk (s). For example, suppose that in an autonomous
driving task, the RL controller inadvertently proposes to col-
lide with an obstacle. The CBF controller compensates to (a)
drive the car around the obstacle. The next learning iteration
should update the policy around the safe deployed policy
uk (s), rather than the unsafe policy uRL θk (s) (which would
have led to an obstacle collision). However, the algorithm
described in this section updates around the original policy,
uRL
θk (s), as illustrated in Figure 2a.

CBF-Based Guiding Control with


Reinforcement Learning (b)
In order to achieve safe and efficient learning, we should
learn from the deployed controller uk , since it operates in Figure 2: Illustration of policy iteration process, where we
the safe region C, rather than learning around uRL try to learn the optimal safe policy, πopt . (a) Policy opti-
θk , which
may operate in an unsafe, irrelevant area of state space. The mization with barrier-compensating controller. Next policy
RL-CBF algorithm described below incorporates this goal. is updated around the previous RL controller, πθRLk
; (b) Pol-
Recall that uk , uRL icy optimization with barrier-guided controller. Next policy
θk represent the realized controllers
is updated around previous deployed controller, πk .
sampled from stochastic policies πk , πθRL k
. Consider an ini-
RL
tial RL-based controller uθ0 (s) (for iteration k = 0). The To ensure safety after incorporating all prior CBF con-
CBF controller uCBF0 (s) is determined from (14) to obtain trollers, they must be included into the governing QP:
u0 (s) = uRLθ0 (s) + u0
CBF
(s). For every following policy it-
eration, let us define the overall controller to incorporate all
previous CBF controllers, as in equation (15). (at , ) = argmin kat k2 + K 
at ,
k−1
k−1  X 
uk (s) = uRL
X
uCBF (s, uRL RL s.t. pT f (st ) + pT g(st ) uRL
θk (st ) + uCBF
j (st ) + at
θk (s) + j θ0 , ..., uθj−1 )
j=0
j=0
(15) + pT µd (st ) − kδ |p|T σd (st ) + q ≥ (1 − η)h(st ) − 
k−1
X
+ uCBF
k (s, uRL
θk + uCBF
j ). k−1
X
j=0 ailow ≤ ait + uRL (st ) + uCBF
j (st ) ≤ aihigh
The dependence of controller (15) on all prior CBF con- j=0

trollers (see Figure 1b) is critical to enhancing learning effi- for i = 1, ..., M.
ciency. Defining the controller in this fashion leads to policy (16)
The solution to (16) defines the CBF controller uCBFk (s), controller. In addition, we would have to solve k − 1 sepa-
which ensures safety by satisfying the barrier condition (8). rate QPs in sequence to evaluate each CBF controller. Such
Let max = maxs∈C  from (16) represent the largest vi- a brute-force implementation would be impractical .
olation of the barrier condition for any s ∈ C. To overcome this issue, we approximate ubar φk (s) ≈
Pk−1 CBF RL RL bar
Theorem 2. Using the control law uk (s) from (15), if there u
j=0 j (s, uθ0 , ..., u θj−1 ), where u φk is a feedfor-
exists a solution to problem (16) such that max = 0, then ward neural network (MLP) parameterized by φ. We chose
the safe set C is forward invariant with probability (1 − δ). a MLP since they have been shown to be powerful function
If max > 0, but the solution to problem (16) satisfies  ≤ approximators. Thus, at each policy iteration, we fit the MLP
max for all s ∈ C , then the controller will render the larger ubar
Pk−1 CBF
(s, uRL RL
φk (s) to data of j=0 uj θ0 , ..., uθj−1 ) collected
set C forward invariant with probability (1 − δ). from trajectories of the previous policy iteration. Then we
Furthermore, if we use TRPO for the RL algorithm, then obtain the controller:
the control law uprop
k (s) = uk (s) − uCBF k (s) from (15)
achieves the performance guarantee J(πkprop ) ≥ J(πk−1 )− uk (s) = uRL bar CBF
θk (s) + uφk (s) + uk (s, uRL bar
θk + uφk ).
2λγ
(1−γ)2 δπ , where λ = maxs |Ea∼πk
prop [A
πk−1 (s, a)]| and δπ
is chosen as in equation (4). Note that even with this approximation, safety with
probability (1 − δ) is still guaranteed. This is because
Proof. The first part of the theorem follows directly from the above approximation only affects the guided RL term
Pk−1 CBF
Definition 1 and Lemma 1. The only difference from Lemma uRL
θk (s) + j=0 uj (s, uRL RL
θ0 , ..., uθj−1 ). The CBF con-
1 is that the control includes the RL controller and all previ- CBF
troller uk RL bar
(s, uθk +uφ ) still solves (16), which provides
ous CBF controllers (uCBF0 , ..., uCBF
k−1 ). the safety guarantees in Theorem 2 by satisfying the CBF
The proof of the performance bound is given in the Ap- condition (8). Furthermore, we now have to store only two
pendix of this paper found at [Link] NNs and solve one QP for the controller. The tradeoff is that
[Link]/files/[Link]. the performance guarantee in Theorem 2 does not necessar-
ily hold with this approximation. The algorithm is outlined
RL-CBF provides high-probability safety guarantees dur-
in Algorithm 1.
ing the learning process and can maintain the performance
guarantees of TRPO. If we have no uncertainty in the dy-
namics, then safety is guaranteed with probability 1. Note Experiments
that the performance guarantee in Theorem 2 is for control We implement two versions of the RL-CBF algorithm with
law uk (s) − uCBF
k (s), which is not the deployed controller, existing model-free RL algorithms: TRPO-CBF, derived
uk (s). However, this does not pose a significant issue, since from TRPO (Schulman et al. 2015), and DDPG-CBF, de-
uCBF
k (s) rapidly decays to 0 as we iterate. This is because rived from DDPG (Lillicrap et al. 2015). The code for
the guided RL controller quickly learns to operate in the safe these examples can be found at: [Link]
region, so the CBF controller uCBF
k (s) becomes inactive. com/rcheng805/RL-CBF.

Computationally Efficient Algorithm Inverted Pendulum


This section describes an efficient algorithm to implement We first apply RL-CBF to the control of a simulated inverted
the framework described above, since a naive approach pendulum from the OpenAI gym environment (pendulum-
would be too computationally expensive in many cases. To v0), which has mass m and length, l, and is actuated by
see this, recall the controller (15) we would ideally imple- torque, u. We set the safe region to be θ ∈ [−1, 1] radians,
ment: and define the reward function r = θ2 + 0.1θ̇2 + 0.001u2 to
learn a controller that keeps the pendulum upright. The true
k−1
X system dynamics are defined as follows,
uk (s) = uRL
θk (s) + uCBF
j (s, uRL RL
θ0 , ..., uθj−1 )
3g 3
j=0
θt+1 = θt + θ˙t δt + sin(θt )δt2 + uδt2 ,
k−1 2l ml2 (17)
3g 3
X
+ uCBF
k (s, uRL
θk + uCBF
j ). θ̇t+1 = θ˙t + sin(θt )δt + uδt,
j=0 2l ml2
The first term may be represented by a neural net- with torque limits u ∈ [−15, 15], and m = 1, l = 1. To
work that is parameterized by θk , which has a standard introduce model uncertainty, our nominal model assumes
implementation. The third term is just a quadratic pro- m = 1.4, l = 1.4 (40% error in model parameters).
gram with dependencies on the other terms; it does not Figure 3 compares the accumulated reward achieved dur-
pose a computational burden. However, the summation ing each episode using TRPO, DDPG, TRPO-CBF, and
in the 2nd term poses a challenge, since every term in DDPG-CBF. The two RL-CBF algorithms converge near the
Pk−1 CBF optimal solution very rapidly, and significantly outperform
j=0 uj (s, uRL RL
θ0 , ..., uθj−1 ) depends on a different pre- the corresponding baseline algorithms without the CBFs.
vious RL controller uRL
θj . Therefore, we would need to store We note that TRPO and DDPG sometimes converge on
k − 1 neural networks corresponding to each previous RL a high-performance controller (comparable to TRPO-CBF
Algorithm 1 RL-CBF algorithm
1: Initialize RL Policy π0RL , state s0 ∼ ρ0 ,
measurement array D̂, action array Â
2: for t = 1, . . . , T do
3: Sample (but do not deploy) control uRL θ0 (st )
CBF
4: Solve for u0 (st ) from optimization problem (16)
5: Deploy controller u0 (st ) = uRL θ0 (st ) + u0
CBF
(st )
6: Store state-action pair (st , uCBF
0 ) in Â
7: Observe (st , u0 , st+1 , rt ) and store in D̂
PT
8: Collect Episode Reward, t=1 rt
9: Update GP model using (6) and measurements D̂
10: Set k = 1 (representing k th policy iteration)
11: while k < Episodes do
12: Do policy iteration using RL algorithm based on
previously observed episode/rewards to obtain πθRL k
13: Train ubar
φk to approximate prior CBF controllers
(ubar
φ k = u0
CBF
+ ...uCBF
k−1 ) using Â
14: Initialize state s0 ∼ ρ0
15: for t = 1, . . . , T do
16: Sample control uRL bar
θk (st ) + uφk (st )
Figure 3: (Top) Maximum angle (rad) of the pendulum
17: Solve for uCBF k (st ) from problem (16)
throughout each episode. Values above the dashed black line
18: Deploy controller uk (st ) = uRL θk (st ) represent exits from the safe set at some point during the
+ ubar
φk (s CBF
t ) + uk (st ). episode. (Bottom) Comparison of accumulated reward from
bar CBF inverted pendulum problem using TRPO, DDPG, TRPO-
19: Store state-action pair (st , uφk + uk ) in Â
CBF, and DDPG-CBF.
20: Observe (st , uk , st+1 , rt ) and store in D̂
PT
21: Collect Episode Reward, t=1 rt
22: Update GP model using (6) and measurements D̂
23: k =k+1
24: return πθRLk
, ubar
φk , uk
CBF

B Overall controller composed


from all 3 subcomponents

and DDPG-CBF), though this occurs less reliably and more


slowly, resulting in the poorer learning curves. More impor-
tantly, the RL-CBF controllers maintain safety (i.e. never
leave the safe region) throughout the learning process, as Figure 4: Representative pendulum trajectory (angle vs.
also seen in Figure 3. In contrast, TRPO and DDPG severely time) using first policy vs last policy. The left plot and right
violate safety while learning the optimal policy. plot show results from TRPO-CBF and DDPG-CBF, respec-
Figure 4 shows the pendulum angle during a representa- tively. The trajectory for the first policy (blue) goes to edge
tive trial under the first policy versus the last learned pol- of the safe region and stays there, while the trajectory for the
icy deployed for TRPO-CBF and DDPG-CBF. For the first last policy (red) quickly converges to the upright position.
policy iteration, the pendulum angle is maintained near the
edge of the safe region – the RL algorithm has proposed a
poor controller so the CBF controller takes the minimal ac-
tion necessary to keep the system safe. By the last iteration
mize fuel efficiency during traffic congestion while avoiding
though, the CBF controller is completely inactive (uCBF =
collisions. Each car utilizes the dynamics shown in equation
0), since the guided RL controller (uRL bar
θk (s) + uφk (s)) is al- (18), and we attempt to optimize the reward function (19).
ready safe. The car dynamics and reward function are inspired by pre-
vious work (He, Ge, and Orosz 2018).
Simulated Car Following
Consider a chain of five cars following each other on a       
straight road. We control the acceleration/deceleration of the ṡ(i) 0 1 s(i) 0
4th car in the chain, and would like to train a policy to maxi- = + a kd = 0.1. (18)
v̇ (i) 0 −kd v (i) 1
T h 4
X (4) (4)
X  500 i
r=− vt max((at ), 0) + Gi (i) (i+1)
,
t=1 i=3 st − st
|x| if s(m) − s(m+1) ≤ 3

Gm (x) =
0 otherwise
(19)
The first term in the reward optimizes fuel efficiency,
while the other term encourages the car to maintain a 3 me-
ter distance from the other cars (soft constraint). For the RL-
CBF controllers, the CBF enforces a 2 meter safe distance
between cars (hard constraint). The behavior of cars 1,2,3,
and 5 is described in the Appendix.
The 4th car has access to every other cars’ position, veloc-
ity, and acceleration, but it only has a crude model of its own
dynamics (kd = 0) and an inaccurate model of the drivers
behind and in front of it. In addition, we add Gaussian noise
to the acceleration of each car. The idea is that the 4th car
can use its crude model to guarantee safety with high prob-
ability, and improve fuel efficiency by slowly building and
leveraging an implicit model of the other drivers’ behaviors. Figure 5: (Top) Minimum headway between cars during
From Figure 5, we see that there were no safety violations each learning episode using DDPG, TRPO, DDPG-CBF,
between the cars during our simulated experiments when us- and TRPO-CBF. Values below the dashed black line repre-
ing either of the RL-CBF controllers. When using TRPO and sent exits from the safe set, and values below 0 represent
DDPG alone without CBF safety, almost all trials had col- collisions. The curve for DDPG has high negative values
lisions, even in the later stages of learning. Furthermore, as throughout learning, and is not seen. (Bottom) Comparison
seen in Figure 5, TRPO-CBF learns faster and outperforms of reward over multiple episodes from car-following prob-
TRPO (DDPG-CBF also outperforms DDPG though neither lem using TRPO, TRPO-CBF, and DDPG-CBF (DDPG is
algorithm converged on a high-performance controller in our excluded because it exhibits very poor performance).
experiments). It is important to note that in some experi-
ments, TRPO finds a comparable controller to TRPO-CBF,
but this is often not the case due to randomness in seeds.
and Egerstedt 2017; Wabersich and Zeilinger 2018; Fisac
Although DDPG and DDPG-CBF failed to converge on a
et al. 2018). If we are not given a valid safe set, we may
good policy, Figure 5 shows that DDPG-CBF (and TRPO-
reach states where it is not possible to remain safe (i.e.
CBF) always maintained a safe controller. This is a crucial
max ≥ 0). Although our controller achieves graceful degra-
benefit of the RL-CBF approach, as it guarantees safety in-
dation in these cases, in future work it will be important to
dependent of the system’s learning performance.
learn the safe set in addition to the controller.
Conclusion
Acknowledgment
Adding even crude model information and CBFs into the
model-free RL framework allows us to improve the ex- The authors would like to thank Hoang Le and Yisong Yue
ploration of model-free learning algorithms while ensuring for helpful discussions.
end-to-end safety. Therefore, we proposed the safe RL-CBF
framework, and developed an efficient controller synthesis References
algorithm that guarantees safety and improves exploration. [Abbeel and Ng 2004] Abbeel, P., and Ng, A. Y. 2004. Ap-
These features will be crucial in deploying reinforcement prenticeship learning via inverse reinforcement learning. In
learning on physical systems, where problems require online Twenty-first international conference on Machine learning -
computation and efficient learning with safety guarantees. ICML ’04.
This framework, which combines model-free RL-based
[Abbeel, Coates, and Ng 2010] Abbeel, P.; Coates, A.; and
control, model-based CBF control, and model learning has
Ng, A. Y. 2010. Autonomous helicopter aerobatics through
the additional advantages of being able to (1) easily inte-
apprenticeship learning. International Journal of Robotics
grate new RL algorithms (in place of TRPO/DDPG) as they
Research.
are developed, and (2) incorporate better model information
from measurements to online improve the CBF controller. [Achiam et al. 2017] Achiam, J.; Held, D.; Tamar, A.; and
A significant assumption in this work is that we are given Abbeel, P. 2017. Constrained Policy Optimization. arXiv
a valid safe set, h(s), which can be rendered forward in- preprint arXiv:1705:10528.
variant. However, computing these valid safe sets is non- [Agrawal and Sreenath 2017] Agrawal, A., and Sreenath, K.
trivial and computationally intensive (Wang, Theodorou, 2017. Discrete Control Barrier Functions for Safety-Critical
Control of Discrete Systems with Application to Bipedal 2015. Continuous control with deep reinforcement learning.
Robot Navigation. Robotics science and systems (RSS). arXiv preprint arXiv:1509.02971.
[Alshiekh et al. 2017] Alshiekh, M.; Bloem, R.; Ehlers, R.; [Mannucci et al. 2018] Mannucci, T.; Van Kampen, E. J.; De
Könighofer, B.; Niekum, S.; and Topcu, U. 2017. Safe Visser, C.; and Chu, Q. 2018. Safe Exploration Algorithms
Reinforcement Learning via Shielding. arXiv preprint for Reinforcement Learning Controllers. IEEE Transactions
arXiv:1708.08611. on Neural Networks and Learning Systems.
[Ames et al. 2017] Ames, A. D.; Xu, X.; Grizzle, J. W.; and [Moldovan and Abbeel 2012] Moldovan, T. M., and Abbeel,
Tabuada, P. 2017. Control Barrier Function Based Quadratic P. 2012. Safe Exploration in Markov Decision Processes.
Programs for Safety Critical Systems. IEEE Transactions on arXiv preprint arXiv:1205.4810.
Automatic Control. [Nguyen-Tuong, Seeger, and Peters 2009] Nguyen-Tuong,
[Berkenkamp et al. 2017] Berkenkamp, F.; Turchetta, M.; D.; Seeger, M.; and Peters, J. 2009. Local Gaussian Process
Schoellig, A. P.; and Krause, A. 2017. Safe Model-based Regression for Real Time Online Model Learning and
Reinforcement Learning with Stability Guarantees. In Neu- Control. In Advances in neural information processing
ral Information Processing Systems. systems.
[Bertsekas 2005] Bertsekas, D. 2005. Dynamic Program- [Ohnishi et al. 2018] Ohnishi, M.; Wang, L.; Notomista, G.;
ming and Optimal Control. and Egerstedt, M. 2018. Safety-aware Adaptive Reinforce-
[Chow et al. 2018] Chow, Y.; Nachum, O.; Duenez-Guzman, ment Learning with Applications to Brushbot Navigation.
E.; and Ghavamzadeh, M. 2018. A Lyapunov-based Ap- arXiv preprint arXiv:1801.09627.
proach to Safe Reinforcement Learning. arXiv preprint [Perkins and Barto 2003] Perkins, T. J., and Barto, A. G.
arXiv:1805.07708. 2003. Lyapunov design for safe reinforcement learning.
[Duan et al. 2016] Duan, Y.; Chen, X.; Schulman, J.; and Journal of Machine Learning Research.
Abbeel, P. 2016. Benchmarking Deep Reinforcement Learn- [Peters and Schaal 2008] Peters, J., and Schaal, S. 2008. Re-
ing for Continuous Control. arXiv. inforcement learning of motor skills with policy gradients.
[Fisac et al. 2018] Fisac, J. F.; Akametalu, A. K.; Zeilinger, Neural Networks.
M. N.; Kaynama, S.; Gillula, J.; and Tomlin, C. J. [Rasmussen and Williams 2006] Rasmussen, C. E., and
2018. A General Safety Framework for Learning-Based Williams, C. K. 2006. Gaussian Processes for Machine
Control in Uncertain Robotic Systems. arXiv preprint Learning.
arXiv:1705.01292.
[Schulman et al. 2015] Schulman, J.; Levine, S.; Moritz, P.;
[Fu, Glover, and April 2005] Fu, M.; Glover, F.; and April, J. Jordan, M.; and Abbeel, P. 2015. Trust Region Policy Opti-
2005. Simulation optimization: a review, new developments, mization. In International Conference on Machine Learning
and applications. Proceedings of the Winter Simulation Con- (ICML).
ference, 2005.
[Silver et al. 2014] Silver, D.; Lever, G.; Heess, N.; Degris,
[Garcı́a and Fernández 2015] Garcı́a, J., and Fernández, F. T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic
2015. A Comprehensive Survey on Safe Reinforcement Policy Gradient Algorithms. Proceedings of the 31st Inter-
Learning. Journal of Machine Learning Research. national Conference on Machine Learning (ICML-14).
[Gaskett 2003] Gaskett, C. 2003. Reinforcement Learning [Snelson and Ghahramani 2007] Snelson, E., and Ghahra-
in Circumstances Beyond its Control. In CIMCA. mani, Z. 2007. Local and global sparse Gaussian process
[Gillula and Tomlin 2012] Gillula, J. H., and Tomlin, C. J. approximations. Proceedings of the International Confer-
2012. Guaranteed safe online learning via reachability: ence on Artificial Intelligence and Statistics (AISTATS).
Tracking a ground target using a quadrotor. In Proceedings [Tang et al. 2010] Tang, J.; Singh, A.; Goehausen, N.; and
- IEEE International Conference on Robotics and Automa- Abbeel, P. 2010. Parameterized maneuver learning for au-
tion. tonomous helicopter flight. In Proceedings - IEEE Interna-
[He, Ge, and Orosz 2018] He, C. R.; Ge, J. I.; and Orosz, G. tional Conference on Robotics and Automation.
2018. Data-based fuel-economy optimization of connected
[Wabersich and Zeilinger 2018] Wabersich, K. P., and
automated trucks in traffic. Annual American Control Con-
Zeilinger, M. N. 2018. Scalable synthesis of safety
ference (ACC).
certificates from data with applications to learning-based
[Koller et al. 2018] Koller, T.; Berkenkamp, F.; Turchetta, control. arXiv preprint arXiv:1711.11417.
M.; and Krause, A. 2018. Learning-based Model Predictive
[Wachi et al. 2018] Wachi, A.; Sui, Y.; Yue, Y.; and Ono, M.
Control for Safe Exploration and Reinforcement Learning.
2018. Safe Exploration and Optimization of Constrained
arXiv preprint arXiv:1803.08287.
MDPs using Gaussian Processes. 32nd AAAI conference on
[Li, Kalabic, and Chu 2018] Li, Z.; Kalabic, U.; and Chu, T. Artificial Intelligence (AAAI).
2018. Safe Reinforcement Learning: Learning with Super-
vision Using a Constraint-Admissible Set. In Annual Amer- [Wang, Theodorou, and Egerstedt 2017] Wang, L.;
ican Control Conference. Theodorou, E. A.; and Egerstedt, M. 2017. Safe Learning
of Quadrotor Dynamics Using Barrier Certificates. arXiv
[Lillicrap et al. 2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; preprint arXiv:1710:05472.
Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D.
Appendix A: Proof of Theorem 2 where we drop the policies’ dependency on the state s for
Theorem 2. Using the control law uk (s) from (15), if there compactness. Due to the shift invariance of the total varia-
exists a solution to problem (16) such that max = 0, then tional distance, DT V , we can simplify this to:
the safe set C is forward invariant with probability (1 − δ).
If max > 0, but the solution to problem (16) satisfies  ≤ 1 hX ∞
max for all s ∈ C , then the controller will render the set C J(πkprop ) ≥ J(πk−1 ) + E st ∼πprop
k−1 γ t Aπk−1 (st , at )
1 − γ at ∼πk
forward invariant with probability (1 − δ). t=0
Furthermore, if we use TRPO for the RL algorithm, then 2γλ RL RL
i
the control law uprop (s) = uk (s) − uCBF (s) from (15) − DT V (πθk−1 , πθk ) .
k k 1−γ
achieves the performance guarantee J(πkprop ) ≥ J(πk−1 )− (25)
2λγ
(1−γ)2 δπ , where λ = maxs |Ea∼πk
prop [A
πk−1 (s, a)]| and δπ
Because πk−1 is a feasible point of the TRPO optimiza-
is chosen as in equation (4).
tion problem (4) with objective value 0, we know that our
Proof. To prove the performance bound in the second part of solution πkprop satisfies the following:
the theorem, we use the property of the advantage function ∞
hX i
from equation (20) below: E st ∼πprop
k−1 γ t Aπk−1 (st , at ) ≥ 0.
at ∼πk
t=0

hX i Since the optimization problem (4) specifies the bound
J(πk ) = J(πk−1 ) + Eτ ∼πk γ t Aπk−1 (st , at ) , (20) DT V (πθRL , πθRL ) ≤ δπ , then it follows that:
k−1 k
t=0
2λγ
where st+1 ∼ P (st+1 |st , at ). As derived in (Schulman et J(πkprop ) ≥ J(πk−1 ) − δπ , (26)
(1 − γ)2
al. 2015), we can then obtain the following inequality:
where λ = maxs |Ea∼πkprop [Aπk−1 (s, a)]|. The realization
∞ of the policy πkprop (a|s) is:
1 hX
J(πk ) ≥ J(πk−1 ) + Est ∼πk−1 γ t Aπk−1 (st , at )
1 − γ at ∼πk t=0
uprop
k (s) = uRL barrier
θk (s) + uk−1 (s) = uk (s) − uCBF
k (s).
2γλ i
− DT V (πk−1 , πk ) ,
1−γ Therefore, if we utilize the policy uk (s) − uCBF
k (s), we
(21) can obtain the performance bound in equation (26).
where DT V (πk−1 , πk ) is the total variational
distance between policies πk−1 and πk , and
λ = maxs |Ea∼πk [Aπk−1 (s, a)]|. Note that our CBF
controllers Pare all deterministic, soPwe can redefine
k−2 k−1 CBF
ubarrier
k−1 = j=0 uCBF j + uCBF
k−1 = j=0 uj . Based Appendix B: Car-Following Problem
on this definition and equation (15), we can rewrite/define Driver Behavior and System Dynamics
the following controllers:
In this section, we elaborate on the behavior of the cars in
the car-following numerical experiment. The dynamics for
uk−1 (s) = uRL barrier
θk−1 (s) + uk−1 (s), the drivers follows equation (20), and their acceleration is
(22) described as follows:
πk−1 (a|s) = πθRL
k−1
(a − ubarrier
k−1 (s) | s),

a(1) = vdes − 10 sin(0.2t)


uprop (s) = uRL barrier
θk (s) + uk−1 (s),
k
(23) a(i) = kp (vdes − v (i) ) − kb G1 (s(i−1) − s(i) ) for i = 2, 3
prop RL barrier
πk (a|s) = πθk (a − uk−1 (s) | s). 1
a(5) = kp (vdes − v (i) ) − kb G2 (s(3) − s(5) ) for i = 5
We can plug in the above relations for πk−1 and πkprop 2
 
into inequality (21), to obtain the following bound (we plug x if x ≤ 6 x if x ≤ 12
G1 (x) = , G2 (x) =
in πkprop for πk ): 0 otherwise 0 otherwise
kp = 4, kb = 20, vdes = 30, a ∈ [−100, 100]

1 hX (27)
J(πkprop )
≥ J(πk−1 ) + E st ∼πprop
k−1 γ t Aπk−1 (st , at )
1 − γ at ∼πk t=0 where a(i) represents the acceleration for driver i. In ad-
2γλ i dition, gaussian noise is added to the acceleration of each
− DT V (πθRL (a − ubarrier
k−1 ), π RL
θk (a − u barrier
k−1 )) , driver. In driver four’s nominal model of the other drivers’
1−γ k−1

(24) behavior, kp = 3.5, kb = 18, and kd = 0.


Explanation for High Reward of DDPG in Initial
Trials
In Figure 5, the reward of DDPG-CBF starts very high for
early trials, and then drops to lower values. This arises due to
stochasticity in the drivers’ behaviors, which makes certain
bad control strategies perform well in rare specific cases.
In most trials, our car must accelerate at certain points (de-
creasing reward) in order to avoid collision with the driver
behind. However, if the rear driver significantly slows down
during certain trials due to stochasticity in their behavior,
our car can simply cruise with little acceleration throughout
these trials (these correspond to the few, initial high reward
trials).
This strategy of cruising (little/no acceleration) is gener-
ally bad because if the driver behind does not slow down,
our car must accelerate heavily at the last second to avoid
collision, accumulating heavy penalty. The DDPG algorithm
learns to avoid this do nothing initially strategy.

You might also like