LTL Control Synthesis via Model-Free RL

Q: Explain how model-free learning avoids explicitly constructing an MDP model for policy synthesis under temporal logic specifications.

Model-free learning avoids constructing explicit MDP models by using interaction with the environment to infer optimal policies. Through techniques like Q-learning, the agent learns to associate values with state-action pairs based directly on reward feedback, iteratively improving policies to satisfy temporal logic specifications without needing a pre-defined transition matrix or reward function model. This adaptability to unknown environments is crucial for dynamic and stochastic contexts .

Q: How does the transformation of LTL formulas into LDBAs enable the quantitative analysis of MDPs?

Transforming LTL formulas into LDBAs enables quantitative analysis of MDPs by converting temporal objectives into automata that can be directly used as components in product MDPs. This transformation captures the temporal logic constraints in a deterministic structure amenable to analysis, facilitating techniques such as Q-learning to maximize probability objectives. LDBAs, unlike fully deterministic automata, allow for certain nondeterministic transitions, which are crucial for maintaining the expressiveness necessary for real-world applications .

Q: What is the role of rewards and state-dependent discounts in achieving optimal policies for satisfying LTL specifications?

Rewards and state-dependent discounts are designed to align policy optimization with the satisfaction of LTL specifications. Specifically, the rewards signal successful visits to goal states, while discounts adjust the influence of future rewards based on current state relevance to the Büchi conditions. These strategic adjustments enable the crafted learning environment to direct policies towards maximizing the probability of satisfying LTL objectives over time, yielding optimal strategies .

Q: What is the main challenge in synthesizing a robot control policy in a stochastic environment modeled as an MDP with LTL specifications?

The main challenge is to synthesize a control policy without prior construction of a model of the MDP, yet maximize the probability of meeting the desired LTL specification. This involves learning the maximal probabilities of satisfaction through direct interaction with the environment, especially when the transition probabilities of the MDP are fully unknown .

Q: Explain the significance of using a discount factor that approaches one in reinforcing policies for Büchi objectives.

Using a discount factor that approaches one is significant for reinforcing policies in Büchi objectives as it makes the expected return closely approximate the probability of satisfying these objectives. As the discount factor tends towards one, it emphasizes long-term rewards, aligning the value of a state with its likelihood of visiting accepting states in Büchi objectives repeatedly. This leads to a more accurate estimation of satisfaction probabilities essential for optimal policy derivation in reinforcement learning without explicit model dependence .

Q: In what context can the transformation of MDPs with LTL constraints into product MDPs be considered a form of automated verification?

Transforming MDPs with LTL constraints into product MDPs serves as automated verification by enabling the formal checking of system behaviors against desired temporal properties. This conversion integrates LTL constraints within the operational model, facilitating the evaluation of compliance through model checking techniques without manual inspection. It systematizes verification, leveraging logic to ensure processes like decision making and alternative path exploration adhere strictly to predefined specifications .

Q: How does the approach of using sampling-based algorithms enhance motion planning with temporal logic constraints?

Sampling-based algorithms enhance motion planning with temporal logic constraints by exploring feasible paths in the state space without exhaustively enumerating all possibilities. This method efficiently finds near-optimal paths that satisfy the constraints of the temporal logic specifications by leveraging probabilistic sampling to evaluate transitions and states, thereby ensuring adherence to specified trajectories under uncertainty and limited computational resources .

Q: What methodological advancements allow deep reinforcement learning to handle large MDP state spaces effectively?

Deep reinforcement learning handles large MDP state spaces effectively through advancements such as function approximation, experience replay, and target networks. Function approximation allows generalizing learned experiences across similar states, experience replay breaks temporal correlations and improves sample efficiency, and target networks stabilize learning by reducing oscillations and divergence in Q-values by holding a fixed target for multiple updates .

Q: In what way do epsilon-transitions in an LDBA affect the construction of a product MDP with respect to non-determinism and action selection?

Epsilon-transitions in an LDBA represent nondeterministic choices where the automaton can move from an initial to an accepting state without consuming input symbols. In a product MDP, these transitions become epsilon-actions, allowing the automaton to update its state independently from the current MDP state. This non-deterministic feature is crucial for action selection as it anticipates paths generated in the future, effectively guessing subsequent states of the MDP to align with LDBA states for policy derivation .

Q: How does a product MDP constructed from an LTL formula facilitate model-free reinforcement learning for policy synthesis?

A product MDP is formed by composing the given MDP with an LDBA derived from the LTL formula. In this setup, the transition actions are represented as combinations of MDP actions and LDBA epsilon-moves. This facilitates the application of Q-learning, enabling the finding of an objective policy with guarantees on performance, such as probability maximization for satisfying the LTL specifications without the need for an explicit model of the environment .

Uploaded by

min9509

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views7 pages

LTL Control Synthesis via Model-Free RL

Uploaded by

min9509

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Keywords: LTL, LDBA, MDP, Discount function, Value fucntion as Satisifactory probability

csrl package and sample codes: [Link]

Control Synthesis from Linear Temporal Logic Specifications using

Model-Free Reinforcement Learning
Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos, and Miroslav Pajic

Abstract— We present a reinforcement learning (RL) frame- temporal logic tasks need to be represented by a reward
work to synthesize a control policy from a given linear temporal function, possibly with a finite-memory, so that the optimal
logic (LTL) specification in an unknown stochastic environment policy maximizing the discounted future reward, also maxi-
that can be modeled as a Markov Decision Process (MDP).
Specifically, we learn a policy that maximizes the probability mizes the probability of satisfying the tasks. One approach is
arXiv:1909.07299v2 [[Link]] 5 Mar 2020

of satisfying the LTL formula without learning the transition to use temporal logic specifications that are time-bounded or
probabilities. We introduce a novel rewarding and discounting defined on finite traces so that they can be directly translated
mechanism based on the LTL formula such that (i) an optimal to a real-valued reward function [19]–[22]. Alternatively,
policy maximizing the total discounted reward effectively max- unbounded LTL formulas can be transformed into an ω-
imizes the probabilities of satisfying LTL objectives, and (ii)
a model-free RL algorithm using these rewards and discount automaton and the accepting condition of the automaton can
factors is guaranteed to converge to such a policy. Finally, we be used to design the reward function.
illustrate the applicability of our RL-based synthesis approach Such reward functions based on Rabin conditions are
on two motion planning case studies. introduced in [23], as part of a model-free RL method;
the approach assigns a sufficiently small negative and a
I. I NTRODUCTION
positive reward to the first and second sets of the Rabin pairs,
Formal logics have been used to facilitate robot motion respectively. A generalization to deep Q-learning, with a new
planning beyond its traditional focus on computing robot optimization algorithm, is done in [24]. Yet, in the presence
trajectories that, starting from an initial region, reach a de- of rejecting end components or multiple Rabin pairs, optimal
sired goal without hitting any obstacles (e.g., [1], [2]). Linear policies obtained by this method may not satisfy the LTL
Temporal Logic (LTL) is a widely used framework for formal property almost surely, even if such policies exist [25].
specification of high-level robotic tasks on discrete models. A given LTL property can also be translated into a limit-
Thus, control synthesis on discrete-transition systems for deterministic Büchi automaton (LDBA), which can be used
LTL objectives has attracted a lot of attention (e.g., [3]–[7]). in quantitative analysis of MDPs [27], [28]. The first reward
Another line of work considers motion planning for LTL function based on LDBA accepting conditions is introduced
objectives for systems that exhibit uncertainty coming from in [29]. Yet, similar to [23], in the presence of non-accepting
either robot dynamics or the environment, such as Markov components, the algorithm might fail to converge to the
Decision Processes (MDPs) [8]–[14]. By synthesizing con- policy that almost surely satisfies the LTL specification.
trol for an MDP, from a given LTL objective, the obtained The problem of satisfying the Büchi condition of an
controller maximizes the probability of satisfying the speci- LDBA can be reduced to a reachability problem by adding
fication. Also, tools from probabilistic model checking [15] transitions with a positive reward from accepting states to a
can be directly used for synthesis. Yet, when the MDP tran- terminal state [25]. Then, as the transition probability from
sition probabilities are not known a priori, the control policy an accepting state to the terminal state goes to zero, in order
needs to be synthesized through learning from samples. to reach the terminal state and obtain a positive reward, an
Hence, there is a recent focus on learning for control (i.e., accepting state needs to be visited infinitely often, which
motion planning) synthesis for LTL objectives [16]–[26]. captures the satisfaction of the Büchi condition. However,
Most model-based reinforcement learning (RL) methods are model-free RL algorithms such as Q-learning may fail to
based on detection of end components, and provide estimates converge to the correct reachability probabilities without
of satisfaction probabilities with probably approximately discounting (or improper discounting) in the presence of end
correct bounds (e.g., [16], [17]). Such approaches, however, components [17], as Q-learning might get stuck in one of the
need to first learn and store the MDP transition probabilities, fixed-point solutions; e.g., if all the values are initialized to
and thus have significant space requirements, restricting their 1, Q-learning will not be able to decrease any value estimate.
use to systems with small and low-dimensional state spaces. Consequently, in this paper, we propose a model-free
On the other hand, model-free RL methods derive the RL algorithm that is guaranteed to find a control policy
desired policies without storing a model of the MDP. The that maximizes the probability of satisfying a given LTL
objective (i.e., specification) in an arbitrary unknown MDP;
This work is sponsored in part by the ONR under agreements N00014-17-
1-2504, N00014-20-1-2745 and N00014-18-1-2374, AFOSR award number for the MDP, not even which probabilities are nonzero (i.e.,
FA9550-19-1-0169, and the NSF CNS-1652544 and CNS-1932011 grants. its graph/topology) is known. We use an automata-based
Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos and approach that constructs a product MDP using an LDBA of a
Miroslav Pajic are with Duke University, Durham, NC 27708, USA,
{[Link], yu.wang094, [Link], given LTL formula and assigns rewards based on the Büchi
[Link]}@[Link]. (repeated reachability) acceptance condition. Such optimal
policy can then be derived by learning a policy maximizing The RL objective is to find an optimal policy π ∗ for MDP
the satisfaction probability of the Büchi condition on the M from samples, such that the return vπ (s) is maximized for
product. Unlike [25], our approach directly assigns positive all s ∈ S; we denote the maximum by v∗ (s). Specifically, RL
rewards to the accepting states and discounts these rewards is model-free, if π ∗ is derived without explicitly estimating
in such a way that the values of the optimal policy are proved the transition probabilities, as done in model-based RL;
to converge to the maximal satisfaction probabilities as the hence, it scales significantly better in large applications [30].
discount factor goes beyond a threshold that is less than 1. B. LTL and Limit-Deterministic Büchi Automata
The rest of the paper is organized as follows. We in-
troduce preliminaries and formalize the problem in Sec. II. LTL provides a high-level language to describe specifica-
Sec. III presents our model-free RL algorithm that maximizes tions of a system. LTL formulas can be constructed induc-
probabilities that LTL specifications are satisfied. Finally, we tively as combinations of Boolean operators, negation (¬)
evaluate our approach on several motion planning problems and conjunction (∧), and two temporal operators, next ( )
for mobile robots (Sec. IV), before concluding in Sec. V. and until (U), using the following syntax:
ϕ ::= true | a | ϕ1 ∧ ϕ2 | ¬ϕ | ϕ | ϕ1 Uϕ2 , a ∈ AP. (3)
II. P RELIMINARIES AND P ROBLEM S TATEMENT
The satisfaction of an LTL formula ϕ for a path σ of an
We start with preliminaries on LTL, MDPs, and RL on MDP from Def. 1 (denoted by σ |= ϕ) is defined as follows:
MDPs, before problem formulation. We denote the sets of σ satisfies an atomic proposition a, if the first state s0 of the
real and natural numbers by R and N, respectively. For a set path is labeled with a, i.e., a ∈ L(s0 ); a path σ satisfies ϕ
S, S + denotes the set of all finite sequences taken from S. if σ[1:] satisfies the formula ϕ; and finally,
A. Markov Decision Processes and Reinforcement Learning σ |= ϕ1 Uϕ2 , if ∃i.σ[i] |= ϕ2 and ∀j < i.σ[j] |= ϕ1 . (4)
MDPs are common modeling formalism for systems that Other common Boolean and temporal operators are derived
permit nondeterministic choices with probabilistic outcomes. as follows: (or) ϕ1 ∨ ϕ2 ≡ ¬(¬ϕ1 ∧ ¬ϕ2 ); (implies) ϕ1 →
ϕ2 ≡ ¬ϕ1 ∨ ϕ2 ; (eventually) ♦ϕ ≡ true U ϕ; and (always)
Definition 1. A (labeled) MDP is a tuple M =
ϕ ≡ ¬(♦¬ϕ) [15].
(S, A, P, s0 , AP, L), where S is a finite set of states, A is a
Satisfaction of an LTL formula can be evaluated on a
finite set of actions, P : S × A × S → [0, 1] is the transition
Limit-Deterministic Büchi Automata (LDBA) that can be
probability function, s0 ∈ S is an initial state, AP is a finite
directly derived from the formula [27], [28].
set of atomic propositions, and L : S → 2AP is a labeling
function. For simplicity, let A(s) denote the set of actions that Definition 3. An LDBA is a tuple A = (Q, Σ, δ, q0 , B),
can beP taken in state s; then for all states s ∈ S, it holds where Q is a finite set of states, Σ is a finite alphabet,
that s0 ∈S P (s, a, s0 ) = 1 if a ∈ A(s), and 0 otherwise. δ : Q × (Σ ∪ {}) → 2Q is a (partial) transition function,
q0 ∈ Q is an initial state, and B is a set of accepting
A path is an infinite sequence of states σ = s0 s1 s2 . . . ,
states, such that (i) δ is total except for the -moves, i.e.,
with si ∈ S such that for all i ≥ 0, there exists ai ∈ A with
|δ(q, α)| = 1 for all q ∈ Q, α ∈ Σ; and (ii) there exists a
P (si , ai , si+1 ) > 0. We use σ[i] to denote the state si , as
bipartition of Q to an initial and an accepting component,
well as σ[:i] and σ[i+1:] to denote the prefix s0 s1 . . . si and QI is nondeterministic
i.e., QI ∪ QA = Q, where while QA is deterministic, see Note.3
the suffix si+1 si+2 . . . of the path, respectively.
• the -moves are not allowed in the accepting component,
Definition 2. A policy π for an MDP M is a function i.e., for any q ∈ QA , δ(q, ) = ∅,
π : S + → A such that π(σ[:n]) ∈ A(σ[n]). A policy • outgoing transitions from the accepting component stay
is memoryless if it only depends on the current state, i.e., within it, i.e., for any q ∈ QA , ν ∈ Σ, δ(q, ν) ⊆ QA ,
π(σ[:n]) = π(σ[n]) for any σ, and thus can be defined as • the accepting states are in the accepting component,
π : S → A. A Markov chain (MC) of an MDP M induced by i.e., B ⊆ QA .
a memoryless policy π is a tuple Mπ = (S, Pπ , s0 , AP, L), An infinite path σ is accepted by the LDBA if it satisfies the
where Pπ (s, s0 ) = P (s, π(s), s0 ) for all s, s0 ∈ S. A bottom Büchi condition – i.e., inf(σ) ∩ B 6= ∅, where inf(σ) denotes
strongly connected component (BSCC) of an MC is a the set of states visited by σ infinitely many times.
strongly connected component with no outgoing transitions.
see Note.1 C. Problem Statement
Let R : S → R be a reward function of the MDP M. In this work, we consider the problem of synthesizing a
Then, for a discount factor γ ∈ (0, 1), the K-step return robot control policy in a stochastic environment such that
(K ∈ N or K = ∞) of a path σ from time t ∈ N is the probability of satisfying the desired specification is
K
X maximized. The robot environment is modeled as an MDP
Gt:K (σ) = γ i R(σ[t+i]), Gt (σ) = lim Gt:K (σ). (1)
K→∞ with unknown transition probabilities (i.e., not even which
i=0
pretty standard definitions in RL probabilities are nonzero is known), and the desired objective
Under a policy π, the value of a state s is defined as the (i.e., specification) is given by an LTL formula. Our goal is
expected return of a path – i.e., to obtain such a policy by learning the maximal probabilities
vπ (s) = Eπ [Gt (σ) | σ[t] = s] , (2) that the LTL specification is satisfied; this should be achieved
by directly interacting with the environment – i.e., without
for any fixed t ∈ N such that P rπM (σ[t] = s) > 0. constructing a model of the MDP.
without constructing "transition matrix and reward function"
For any policy π, P rπ (s |= ϕ) denotes the probability of Algorithm 1: Model-free RL-based synthesis on MDPs that
all paths from the state s to satisfy formula ϕ under the policy maximizes the satisfaction probability of LTL specifications.
M M Input: LTL formula ϕ, MDP M
P rπ (s |= ϕ) := P rπ {σ | σ[0] = s and σ |= ϕ} . (5)
Translate ϕ to an LDBA Aϕ
We omit the superscript M when it is clear from the context. Construct the product M× of M and Aϕ
We now formally state the problem considered in this work. Initialize Q(hs, qi, a) on M×
Problem 1. Given an MDP M = (S, A, P, s0 , AP, L) where for t = 0, 1, . . . , T do
P is fully unknown and an LTL specification ϕ, design a Derive a policy π from Q
model-free RL algorithm that finds a finite-memory objective Take the action at ← π(hs, qit )
policy π ϕ that satisfies see Lemma 1, in MDP it is finite-memory; Observe the next state hs, qit+1
in product MDP, it is memoryless. Q(hs, qit , at ) ← (1−α)·Q(hs, qit , at )+α·RB (hs, qit )
P rπϕ (s |= ϕ) = P rmax (s |= ϕ), (6)
where P rmax (s |= ϕ) := maxπ P rπ (s |= ϕ) for all s ∈ S. +α · ΓB (hs, qit ) · maxa0 Q(hs, qit+1 , a0 )
end for alpha= 1/n((s,q),a) is the learning rate
ϕ
III. RL-BASED S YNTHESIS FROM LTL S PECIFICATIONS Get a greedy policy πB from Q
ϕ
In this section, we introduce a framework to solve Prob- return πB and Aϕ
lem 1. We start by exploiting the fact that any LTL formula
can be transformed into an LDBA that can be used in
is taken, only the state of the LDBA is updated according to
quantitative analysis of MDPs [27], [28]; in such LDBAs,
the corresponding -move. When an MDP action is taken,
the only nondeterministic transitions are the -moves from
the next MDP state will be determined by the transition
the initial component to the accepting component (e.g., see
probabilities and the LDBA makes a transition by consuming
Fig. 1a). Therefore, we reduce the problem of satisfying
the label of the current MDP state. Intuitively, an -action
a given LTL objective ϕ in an MDP M to the problem
can be considered as guessing the possible paths generated
B contains satisfying a repeated reachability (Büchi) objective ϕB = in the future. If, as part of iterative learning, the guess is
the final ♦B in the product MDP, computed from the MDP M
states in wrong the agent cannot change its guess; however, in the
the product and the obtained LBDA. We then exploit a new discounting next RL episode, the agent can make the correct one.
MDP and rewarding mechanism that enables the use of model-
An example product MDP is illustrated in Fig. 1. In the
free reinforcement learning, to find an objective policy with
MDP (Fig. 1b), states s0 and s1 are labeled by atomic propo-
strong performance guarantees (i.e., probability maximiza-
sitions a and b, respectively. In the LDBA (Fig. 1a), for sim-
tion). Specifically, we use Q-learning [31] in this work,
plicity, the transitions are labeled by Boolean formulas of the
but other reinforcement learning methods can be applied
atomic propositions of a and b or an label, with 1 standing
similarly. Our overall approach is captured in Algorithm 1,
for “true”; this is equivalent to labeling the transitions using
and we now describe each step in detail.
sets of atomic propositions, as in Def. 3. A transition labeled
A. Design of Product MDP by a Boolean formula is triggered upon receiving a set of
Given an LTL formula ϕ with atomic propositions 2 , atomic propositions satisfying that formula, and a transition
AP

the product MDP is constructed by composing M with an labeled by an label can be (but does not have to be) trig-
LDBA Aϕ with the alphabet 2AP , which can be automatically gered automatically. The product MDP is shown in Fig. 1c.
derived from ϕ [27], [28]. LDBAs, similarly to deterministic To distinguish the two transitions from q0 to q1 and
Rabin automata [15], can be used in quantitative analysis of from q0 to q2 in Fig. 1a, we denote them by 1 and 2 in
MDPs if they are constructed in a certain way [25]. Fig. 1c, respectively. Notice that choosing 2 before choosing
β does not satisfy the Büchi condition although the generated
Definition 4. A product MDP M× = (S × , A× , P × , s× 0, paths by this policy satisfy the LTL formula. Yet, this does
B × ) of an MDP M = (S, A, P, s0 , AP, L) and an LDBA not constitute a problem because in such cases, there always
A = (Q, 2AP , δ, q0 , B) is defined as: S × = S × Q is the set exists a corresponding policy that generates the same paths
of states, A× = A ∪ A , A :={q |q ∈ Q} is the action set, and satisfies the Büchi condition (e.g. choosing 2 after β).
P × : S × × A× × S × → [0, 1] is the transition function Now, the satisfaction of the LTL objective ϕ on the
× 0 0
P (hs, qi, a, hs , q i) original MDP M is related to the satisfaction of the Büchi
 objective ϕB on the product MDP M× , as formalized below.
0 0
P (s, a, s ) q = δ(q, L(s)) and a ∈
 / A ϕ
Lemma 1. A memoryless policy πB that maximizes the
= 1 a = q0 and q ∈ δ(q, ) and s = s0 , (7)
0
×

0 satisfaction probability of ϕ B on M induces a finite-
otherwise memory policy π ϕ that maximizes the satisfaction probability
s× × × of ϕ on M in Problem 1.
0 is hs0 , q0 i, and B = {hs, qi ∈ S | q ∈ B} is the set of
accepting states. We say that a path σ of the product MDP Proof. Follows from the proof of Theorem 3 in [28].LDBA for LTL
M× satisfies the Büchi condition ϕB if inf (σ) ∩ B × 6= ∅.
Therefore, the behavior of the induced policy π ϕ can
ϕ
The nondeterministic -moves in the LDBA are repre- be described by the policy πB and the LDBA Aϕ derived
sented by -actions in the product MDP. When an -action directly from the LTL formula ϕ. Initially, Aϕ is reset to its
another BSCC
a
see Note.4
2 1.0 q2,s1 θ
1 q1 1 1.0
¬a
q0 q3 0.1 q0,s1 1.0 θ

b
¬b QA α
1.0
0.9 1
q2 β 1.0
q0,s0
(a) A derived LDBA A for the LTL 1.0
α 0.1 q1,s1
θ
BSCC but no B states.
1 q1,s0 0.9 1.0 1.0 A deadlock because of q3
formula ϕ = ♦a ∨ ♦b
β
BSCC 2 1.0 1.0
s0 0.9 α 0.1 q2,s0 β
{a} s1 β 1.0 q3,s1 θ
1.0 {b} θ 0.1 1.0
β 1.0 α 0.9 q3,s0
0.9 α
(b) An example MDP M; the circles 0.1
denote MDP states, rectangles denote ac-
tions, and numbers transition probabilities (c) The obtained product MDP
×
Fig. 1: Product MDP M obtained from an MDP M and an LDBA A that is automatically derived from an LTL formula ϕ.
Fig.1 is correction. see Note.5

start state q0 and whenever the MDP M makes a transition Before proving Theorem 1, we develop bounds on Gt (σ).
from s to s0 , Aϕ updates its current state from q to δ(q, L(s)).
The action to be selected in an MDP state s when Aϕ is in Lemma 2. For all paths and Gt (σ) from (9), it holds that
ϕ ϕ
a state q is determined by πB as follows: if πB (hs, qi) is
0 ≤ γGt+1 (σ) ≤ Gt (σ) ≤ 1 − γB + γB Gt+1 (σ) ≤ 1 (12)
an -action q0 , Aϕ changes its state to q 0 and the action
ϕ ϕ
πB (hs, q 0 i) is selected; otherwise, πB (hs, qi) is selected.
Proof. Since there is no negative reward, Gt ≥ 0 holds. By
B. Learning for Büchi Conditions with Discounted Rewards the return definition, replacing γ with 1 yields a larger or
Our goal is to learn a policy that maximizes the probability equal return, which constitutes the following upper bound
b
of satisfying a given Büchi objective. By Lemma 1, in what on the return: Gt (σ) ≤ 1 − γB ≤ 1, where b is the number
follows, we assume policies are memoryless since they are of B states visited. Return Gt (σ) from (9) satisfies
sufficient for Büchi objectives. For simplicity, we omit the (
1 + γB (Gt+1 (σ) − 1) σ[t] ∈ B
superscript × and write M = (S, A, P, s0 , B) and s ∈ S Gt (σ) = (13)
instead of M× = (S × , A× , P × , s× × × γGt+1 (σ) see Note.6 σ[t] ∈ /B
0 , B ) and hs, qi ∈ S .
We propose a model-free learning method that uses care-
fully crafted rewards and state-dependent discounting based From Gt (σ) ≤ 1 it follows that 1 + γB (Gt+1 (σ) − 1) ≥
on the Büchi condition such that an optimal policy π ∗ γGt+1 (σ), which with (13) proves the other inequalities.
maximizing the expected return is also an objective policy
ϕ Lemma 2 implies that replacing a prefix of a path with
πB maximizing the satisfaction probabilities. Specifically, we
define the return of a path as a function of these rewards and states belonging to B never decreases the return of a path and
discount factors in such a way that the value of a state, the similarly replacing with states that do not belong to B never
expected return from that state, approaches the probability increases the return. The result is particularly useful when
of satisfying the objective as the discount factor γ goes to 1. we establish upper and lower bounds on the value of a state.
Theorem 1. For a given MDP M with B ⊆ S, the value The next lemma shows that under a policy, the values of
function vπγ for the policy π and the discount factor γ satisfies states in the accepting BSCCs of the induced Markov chain
approach 1 in the limit; thus, is the key to proving Theorem 1.
lim vπγ (s) = P rπ (s |= ♦B) (8)
γ→1− Lemma 3. Let BSCC(Mπ ) denote the set of all BSCCs of
for all states s ∈ S, if the return of a path is defined as an induced Markov chain Mπ and let Bπ denote the set of
X∞ Yi−1 B states that belong to a BSCC of Mπ – i.e.,
Gt (σ) := RB (σ[t+i]) · ΓB (σ[t+j]) (9)
i=0 j=0
Bπ := {s | s ∈ B, s ∈ T, T ∈ BSCC(Mπ )}. (14)
Q−1
where j=0 := 1, RB : S → [0, 1) and ΓB : S → (0, 1)
are the reward and the discount functions defined as: Then, for any state s in Bπ
( (
1 − γB s ∈ B γB s ∈ B lim vπγ (s) = 1. Pr=1 (15)
RB (s) := , ΓB (s) := (10) γ→1−
0 s∈/B γ s∈ /B
Proof. For any fixed t ∈ N, let Nt be the stopping time of
Here, we set γB = γB (γ) as a function of γ such that
first returning to the state s ∈ S after leaving it at t,
1−γ
lim− = 0. (11)
γ→1 1 − γB (γ) Nt = min{τ | σ[t+τ ] = s, τ > 0}. (16)
Guideline to choose Gamma and GammaB? Both are constants
in the paper examples.
Then by (2), it holds that Similarly, let Mt0 be the stopping time of first reaching a
rejecting BSCC of Mπ after leaving s at t. Then
vπγ (s) = 1 − γB + γB Eπ [Gt+1 (σ) | σ[t]=s]
h
Mπ0 = min τ | σ[t+τ ] ∈ T, T ∩ B = ∅,

= 1 − γB + γB Eπ Gt+1:t+Nt −1 (σ)
Y
Nt −1
i T ∈ BSCC(Mπ ), τ > 0 (25)
+ Γ(σ[t+i]) · Gt+Nt (σ) | σ[t]=s , (17)
i=1
denoting the number of time steps before a rejecting BSCC
since once a state s ∈ Bπ is visited, almost surely it is visited is reached. Thus, from Lemma 2 and the Markov property
again [15]. Using that Gt (σ) ≥ γGt+1 (σ), we obtain
Eπ [Gt (σ) | σ[t]=s, σ 6|= ♦B]
vπγ (s) ≥ 1 − γB + γB Eπ γ Nt −1 Gt+Nt (σ) | σ[t]=s

M0
h i
À ≤ Eπ 1 − γB π | σ[t]=s, σ 6|= ♦B
≥ 1 − γB + γB Eπ γ Nt −1 | σ[t]=s vπ (s)

Eπ [Mπ0 |σ[t]=s,σ6|=♦B ] m 0
Á Eπ [Nt −1|σ[t]=s]
≤ 1 − γB = 1 − γB (26)
≥ 1 − γB + γB γ vπ (s)
≥ 1 − γB + γB γ n vπ (s) (18) where m0 is also constant. From this upper bound and (20)
where À holds by the Markov property, Á holds by the vπγ (s) ≤ P rπ (s |= ♦B) + (1 − γB
m 0
)P rπ (s 6|= ♦B).
Jensen’s inequality, and n ≥ 1 is a constant. From (18),
1 − γB 1 − γB Both the above upper bound and the lower bound from
vπγ (s) ≥ ≥
1 − γB γ n 1 − γB (1 − n(1 − γ)) (24) go to the probability of satisfying the formula as γ
1 approaches 1 from below, thus concluding the proof.
= 1−γ . (19)
1 + n 1−γB − n(1 − γ)
Theorem 1 suggests that the limit of the optimal state val-
where the second “≥” holds by (1−(1−γ))n ≥ 1−n(1−γ) ues is equal to the maximal probabilities as γ goes to 1; this is
for γ ∈ (0, 1). Finally, since vπγ (s) ≤ 1 by Lemma 2, letting captured by the next corollary whose proof follows from the
γ, γB → 1− under the condition (11) results in (15). definition of the optimal policies and maximal probabilities.
We now prove Theorem 1. Corollary 1. For all states s ∈ S the following holds:
Proof of Theorem 1. First, we divide the expected return of
a random path σ from a state s ∈ S by whether it visits the lim v∗γ (s) = P rmax (s |= ♦B). (27)
γ→1−
states B ⊆ S infinitely often:
Remark 1. From Theorem 1 of [32], γ < 1 ensures conver-
vπγ (s) = Eπ [Gt (σ) | σ[t]=s, σ |= ♦B]P rπ (s |= ♦B)
gence of the model-free learning to the unique solution. With
+ Eπ [Gt (σ) | σ[t]=s, σ 6|= ♦B]P rπ (s 6|= ♦B) (20) γ = 1, the result may converge to a non-optimal policy [17]
for some fixed t ∈ N. let Mt be the stopping time of first as there might exist multiple fixed-point solutions.
reaching a state in Bπ after leaving s at t, Finally, as the policies are discrete, the convergence of (8)
Mt = min{τ | σ[t+τ ] ∈ Bπ , τ > 0} (21) and (27) is achieved after some threshold γ 0 , as stated below.

where Bπ is defined as in (14). Then, it holds that Corollary 2. There exists a γ 0 such that for all γ > γ 0 and
for all states s ∈ S, the optimal policy π ∗ satisfies
Eπ [Gt (σ) | σ[t]=s, σ |= ♦B] (22)
À P rπ∗ (s |= ♦B) = P rmax (s |= ♦B). (28)
= Eπ [Gt (σ) | σ[t]=s, σ |= ♦Bπ ]
Á Proof. Let dmin be the minimum positive difference between
≥ Eπ γ Mt Gt+Mt (σ) | σ[t]=s, σ |= ♦Bπ

the satisfaction probabilities of two policies:
Â γ
≥ Eπ γ Mt | σ[t]=s, σ |= ♦Bπ vπ,min

(Bπ )
dmin := min |P rπ1 (s |= ♦B) − P rπ2 (s |= ♦B)|
Ã γ
≥ γ Eπ [Mt |σ[t]=s,σ|=♦Bπ ] vπ,min (Bπ ) | s ∈ S, P rπ1 (s |= ♦B) 6= P rπ2 (s |= ♦B)
γ
= γ m vπ,min (Bπ ), (23)
and let γ 0 be the discount factor such that
γ
where vπ,min (Bπ )
= mins∈Bπ vπγ (s)
and m is constant. Here,
max |vπγ (s) − P rπ (s |= ♦B)| | s ∈ S < dmin /2. (29)

À holds because a path σ |= ♦B almost surely eventually
enters an accepting BSCC, it eventually reaches a state s ∈
Bπ almost surely, Á, Â and Ã hold due to Lemma 2, the Now, suppose a policy π 0 that maximizes the satisfaction
Markov property and Jensen’s inequality. From (20), we have probability is not optimal for γ 0 , then the optimal value of
all states must be larger than P rmax (s |= ♦B) − dmin /2,
vπγ (s) ≥ γ m vπ (Bπ )P rπ (s |= ♦B). (24) which is not possible due to the definition of dmin .
non-deterinisitic policy
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
0 b d 0 b d 0 b d 0 b d
0 0 1.00 1.00 0.00 1.00
c c 1 1 1 1
2
1 1 1.00 1.00 0.79 1.00
a b a b 2 2 2 2

obstacle? 2 2 1.00 0.00 1.00

c c 3 a 3 a 3 a 3 a
2 1
3 3 1.00 1.00 1.00 1.00 4 c 4 c 4 c 4 c
b a b a
4 4 1.00 0.00 0.81 0.00 (a) Policy c to b (b) Policy b to c (c) Policy b to a (d) Policy a to b
c c c c
why not set c as obstacles?
(a) Policy (b) Value Estimates (c) Convergence Fig. 3: A summary of the synthesized policy for the nursery
scenario. Arrows: actions top, left, down, and right; encircled
Fig. 2: The objective policy and the estimated maximal probabil- characters: state labels. The actions in states that are not reachable
ities of satisfying ϕ1 from (30). Empty circles: absorbing states; or lead to another LDBA state are not displayed. In all subfigures,
Filled circles: obstacles; Arrows: actions top, left, down and right the most likely paths are highlighted in red.
and 1 , 2 are -actions. State labels: encircled letters in the lower
part of the cells. The values are rounded to the closest hundredth.
probability 0.8 and 0.2, respectively. In addition, this type
of non-0 or non-1 probability guarantees cannot be provided
IV. I MPLEMENTATION AND C ASE S TUDIES with existing learning-based methods for LTL specifications.
We implemented our RL-based synthesis framework in While the values from Fig. 2a and 2b were obtained from
Python; we used Rabinizer 4 [33] to map LTL formulas a single run over K=100 000 episodes, we investigated the
into LDBAs, and Q-learning for the proposed discounting impact of the number of episodes. Fig. 2c shows the L2
rewards. The code and videos are available at [34]. We evalu- norm of the errors averaged over 100 repetitions for different
ated our framework on two motion planning case studies. We numbers of episodes (the error bars show standard deviation).
consider two scenarios in a grid-world where a mobile robot
can take four actions top, left, down and right (Fig. 2 and 3). B. Mobile Robot in Nursery Scenario
The robot moves in the intended direction with probability In this scenario (inspired by [35]), the robot’s objective is
0.8 and it can go sideways with probability 0.2 (0.1 each). If to repeatedly check a baby (at state b) and go back to its
the robot hits a wall or an obstacle it stays in the same state. charger (at state c), while avoiding the danger zone (at state
For Q-learning, we used ε-greedy policy to choose the d). Near the baby b, the only allowed action is left and when
optimal actions, and discount factors γB = 0.99 and γ = taken the following situations can happen: (i) the robot hits
0.99999. The probability that a random action is taken, ε, and the wall with probability 0.1 and wakes the baby up; (ii) the
the learning rate, α, were gradually decreased from 1.0 to 0.1 robot moves left with probability 0.8 or moves down with
and then 0.001. The objective policies and estimates of the probability 0.1. If the baby has been woken up, which means
maximal probabilities were obtained using 100 000 episodes. the robot could not leave in a single time step (represented
by LTL as b ∧ b), the robot should notify the adult (at
A. Motion Planning with Safe Absorbing States
state a); otherwise, the robot should directly go back to the
In this example, the robot tries to reach a safe absorbing charger (at state c). The full objective is specified in LTL as
state (states a or b in circle), while avoiding unsafe states
ϕ2 = ¬d ∧ (b ∧ ¬ b) → (¬b U (a ∨ c)) ∧ a → (¬a U b)
(states c). This is formally specified in LTL as
3 states from owl without epsilon, see Note.7
|{z} | {z } | {z }
(1) (2) (3)
ϕ1 = (♦a ∨ ♦b) ∧ ¬c. (30)
∧ (¬b ∧ b∧¬ b)→(¬a U c) ∧ c→(¬a U b) ∧ (b ∧ b)→♦a .
The LDBA computed from ϕ1 has 4 states and the product | {z } | {z } | {z }
MDP has 80 states. All episodes started in a random state and (4) (5) (6)

were terminated after T = 100 steps. Here, the sub-formulas mean (1) avoid the danger state; (2) if
waken up
The optimal policy obtained for an MDP is illustrated in the baby is left, do not return before visiting the adult or the
Fig. 2a. The shortest way to enter a safe absorbing state from charger; (3) after notifying the adult, leave immediately and
(0, 0) is reaching (1, 3) via (1, 2); yet, in that case, the robot go for the baby; (4) after leaving the baby sleeping, go for the
visits an unsafe state with probability 0.2. Thus, the optimal charger and do not notify the adult; (5) after charging, return
policy tries to enter one of (3, 0) and (3, 2) by choosing up to the baby first without visiting the adult; and (6) notify the
in (3, 1). Under this policy, the robot eventually reaches a adult if the baby has woken up.
safe absorbing state without visiting an unsafe state almost The LDBA for this specification has 47 states and the
deep RL can be used for large number of states
surely. Once the robot enters an absorbing state, it chooses product MDP has 940 states. The episodes were terminated
epsilon- an -action depending on the state label, and thus the LDBA after 1000 steps and the robot position was reset to charging.
actions
are transitions to an accepting state, with positive rewards. Fig. 3 depicts the optimal policy for the four most visited
pre-defined? Fig. 2b shows the estimates of the maximal probabilities. LDBA states during the simulation. The robot follows the
Note that the approximation errors in (1, 2) and (4, 2) are policy in Fig. 3a after it leaves the charger dock (4, 1). Under
due to the variance of the return caused by the unsafe states. this policy, the robot almost surely reaches the baby in (0, 2),
When the robot visits an unsafe state, the LDBA makes while successfully avoiding visiting a. Similarly, the policy
a transition to a trap state, making it impossible for the in Fig. 3b is followed by the robot to go back to the charger
robot to receive a positive reward. Hence, the return that while the baby is sleeping. If the baby is awake, the robot
can be obtained from (1, 2) and (4, 2) is either 1 or 0 with takes the shortest path to reach a (Fig. 3c).
V. C ONCLUSION [18] Min Wen, Rüdiger Ehlers, and Ufuk Topcu. Correct-by-synthesis
In this work, we present a model-free learning-based reinforcement learning with temporal logic constraints. In 2015
IEEE/RSJ International Conference on Intelligent Robots and Systems
method to synthesize a control policy that maximizes prob- (IROS), pages 4983–4990. IEEE, 2015.
ability that an LTL specification is satisfied in unknown [19] D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta. Q-learning
stochastic environments that can be modeled by an MDP. for robust satisfaction of signal temporal logic specifications. In 2016
IEEE 55th Conference on Decision and Control (CDC), pages 6565–
We first show that synthesizing controllers from an LTL 6570, Dec 2016.
specification on the MDP can be converted to synthesizing a [20] X. Li, C. Vasile, and C. Belta. Reinforcement learning with tem-
memoryless policy of a Büchi objective on the product MDP. poral logic rewards. In 2017 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS), pages 3834–3839, Sep. 2017.
Then, we design a novel discounting and reward scheme, and [21] Rodrigo Toro Icarte, Toryn Q Klassen, Richard Valenzano, and
show that the memoryless policy optimizing this reward, also Sheila A McIlraith. Teaching multiple tasks to an RL agent using LTL.
optimizes the satisfaction probability of the Büchi objective In Proceedings of the 17th International Conference on Autonomous
Agents and MultiAgent Systems, pages 452–461. International Foun-
(and thus the initial LTL specification). Finally, we evaluate dation for Autonomous Agents and Multiagent Systems, 2018.
our synthesis method on motion planning case studies. [22] Giuseppe De Giacomo, Luca Iocchi, Marco Favorito, and Fabio
Patrizi. Foundations for restraining bolts: Reinforcement learning
R EFERENCES with LTLf /LDLf restraining specifications. In Proceedings of the
International Conference on Automated Planning and Scheduling,
[1] Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms volume 29, pages 128–136, 2019.
for optimal motion planning. The International Journal of Robotics [23] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia.
Research, 30(7):846–894, 2011. A learning based approach to control synthesis of Markov decision
[2] Sertac Karaman, Matthew R Walter, Alejandro Perez, Emilio Frazzoli, processes for linear temporal logic specifications. In 53rd IEEE
and Seth Teller. Anytime motion planning using the RRT. In 2011 Conference on Decision and Control, pages 1091–1096, Dec 2014.
IEEE International Conference on Robotics and Automation, pages [24] Qitong Gao, Davood Hajinezhad, Yan Zhang, Yiannis Kantaros,
1478–1483. IEEE, 2011. and Michael M. Zavlanos. Reduced variance deep reinforcement
[3] C. I. Vasile and C. Belta. Sampling-based temporal logic path learning with temporal logic specifications. In Proceedings of the
planning. In 2013 IEEE/RSJ International Conference on Intelligent 10th ACM/IEEE International Conference on Cyber-Physical Systems,
Robots and Systems, pages 4817–4822, Nov 2013. ICCPS ’19, pages 237–248, New York, NY, USA, 2019. ACM.
[4] Stephen L Smith, Jana Tmov, Calin Belta, and Daniela Rus. Optimal [25] Ernst Moritz Hahn, Mateo Perez, Sven Schewe, Fabio Somenzi,
path planning for surveillance with temporal-logic constraints. The Ashutosh Trivedi, and Dominik Wojtczak. Omega-regular objectives
International Journal of Robotics Research, 30(14):1695–1708, 2011. in model-free reinforcement learning. In Tomáš Vojnar and Lijun
[5] Y. Chen, X. C. Ding, A. Stefanescu, and C. Belta. Formal approach Zhang, editors, Tools and Algorithms for the Construction and Anal-
to the deployment of distributed robotic teams. IEEE Transactions on ysis of Systems, pages 395–412, Cham, 2019. Springer International
Robotics, 28(1):158–171, Feb 2012. Publishing.
[6] Y. Kantaros and M. M. Zavlanos. Sampling-based control synthesis [26] Xiao Li, Yao Ma, and Calin Belta. A policy search method for
for multi-robot systems under global temporal specifications. In 2017 temporal logic specified reinforcement learning tasks. In 2018 Annual
ACM/IEEE 8th International Conference on Cyber-Physical Systems American Control Conference (ACC), pages 240–245. IEEE, 2018.
(ICCPS), pages 3–14, April 2017. [27] Ernst Moritz Hahn, Guangyuan Li, Sven Schewe, Andrea Turrini,
[7] E. M. Wolff, U. Topcu, and R. M. Murray. Optimization-based and Lijun Zhang. Lazy Probabilistic Model Checking without De-
trajectory generation with linear temporal logic specifications. In 2014 terminisation. In Luca Aceto and David de Frutos Escrig, editors,
IEEE International Conference on Robotics and Automation (ICRA), 26th International Conference on Concurrency Theory (CONCUR
pages 5319–5325, May 2014. 2015), volume 42 of Leibniz International Proceedings in Informatics
[8] M. Guo and M. M. Zavlanos. Probabilistic motion planning under (LIPIcs), pages 354–367, Dagstuhl, Germany, 2015. Schloss Dagstuhl–
temporal tasks and soft constraints. IEEE Transactions on Automatic Leibniz-Zentrum fuer Informatik.
Control, 63(12):4051–4066, Dec 2018. [28] Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan Křetı́nský.
[9] Meng Guo and Dimos V Dimarogonas. Multi-agent plan reconfigu- Limit-deterministic Büchi automata for linear temporal logic. In
ration under local LTL specifications. The International Journal of Swarat Chaudhuri and Azadeh Farzan, editors, Computer Aided Verifi-
Robotics Research, 34(2):218–235, 2015. cation, pages 312–332, Cham, 2016. Springer International Publishing.
[10] Y. Kantaros and M. M. Zavlanos. Sampling-based optimal control [29] Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroen-
synthesis for multirobot systems under global temporal tasks. IEEE ing. Logically-constrained reinforcement learning. arXiv:1801.08099
Transactions on Automatic Control, 64(5):1916–1931, May 2019. [[Link]], 2018.
[11] M. Lahijanian, S. B. Andersson, and C. Belta. Temporal logic motion [30] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and
planning and control with probabilistic satisfaction guarantees. IEEE Michael L. Littman. Pac model-free reinforcement learning. In Pro-
Transactions on Robotics, 28(2):396–409, April 2012. ceedings of the 23rd International Conference on Machine Learning,
[12] E. M. Wolff, U. Topcu, and R. M. Murray. Robust control of uncertain ICML 06, page 881888, New York, NY, USA, 2006. Association for
Markov decision processes with temporal logic specifications. In 2012 Computing Machinery.
IEEE 51st IEEE Conference on Decision and Control (CDC), pages [31] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An
3372–3379, Dec 2012. Introduction. MIT Press, Cambridge, MA, USA, 2nd edition, 2018.
[13] Marta Kwiatkowska and David Parker. Automated verification and [32] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. Conver-
strategy synthesis for probabilistic systems. In Dang Van Hung and gence of stochastic iterative dynamic programming algorithms. In
Mizuhito Ogawa, editors, Automated Technology for Verification and J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in
Analysis, pages 5–22, Cham, 2013. Springer International Publishing. Neural Information Processing Systems 6, pages 703–710. Morgan-
[14] X. Ding, S. L. Smith, C. Belta, and D. Rus. Optimal control of Markov Kaufmann, 1994.
decision processes with linear temporal logic constraints. IEEE [33] Jan Křetı́nský, Tobias Meggendorfer, Salomon Sickert, and Christo-
Transactions on Automatic Control, 59(5):1244–1257, May 2014. pher Ziegler. Rabinizer 4: From LTL to your favourite deterministic
[15] Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. automaton. In Hana Chockler and Georg Weissenbacher, editors,
MIT Press, Cambridge, MA, USA, 2008. Computer Aided Verification, pages 567–577, Cham, 2018. Springer
[16] Jie Fu and Ufuk Topcu. Probably approximately correct MDP learning International Publishing.
and control with temporal logic constraints, 2014. arXiv:1404.7073 [34] A. K. Bozkurt, Y. Wang, M. M. Zavlanos, and M. Pajic. CSRL, 2019.
[[Link]]. [Link]
[17] Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelı́k, Vojtěch [35] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas. Where’s Waldo?
Forejt, Jan Křetı́nský, Marta Kwiatkowska, David Parker, and Mateusz sensor-based temporal logic motion planning. In Proceedings 2007
Ujma. Verification of Markov decision processes using learning IEEE International Conference on Robotics and Automation, pages
algorithms. In Franck Cassez and Jean-François Raskin, editors, 3116–3121, April 2007.
Automated Technology for Verification and Analysis, pages 98–114,
Cham, 2014. Springer International Publishing.

Common questions