0% found this document useful (0 votes)
7 views10 pages

Ensuring Safe Interruptibility in RL Agents

This document proposes a method to make reinforcement learning agents safely interruptible by humans during the learning process. It defines interruptibility as the ability to forcibly change an agent's behavior through interruptions without inducing bias in the agent's learning. The method involves treating interruptions as externally imposed policy changes rather than observations, so the agent does not view interruptions as part of the task. It proves some algorithms like Q-learning are already safely interruptible, while others like Sarsa can be modified to be interruptible without impacting optimal learning. The goal is to allow humans to safely interrupt agents during learning to redirect harmful behaviors, while ensuring agents do not resist interruptions or learn to avoid them.

Uploaded by

Rasyid T
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Ensuring Safe Interruptibility in RL Agents

This document proposes a method to make reinforcement learning agents safely interruptible by humans during the learning process. It defines interruptibility as the ability to forcibly change an agent's behavior through interruptions without inducing bias in the agent's learning. The method involves treating interruptions as externally imposed policy changes rather than observations, so the agent does not view interruptions as part of the task. It proves some algorithms like Q-learning are already safely interruptible, while others like Sarsa can be modified to be interruptible without impacting optimal learning. The goal is to allow humans to safely interrupt agents during learning to redirect harmful behaviors, while ensuring agents do not resist interruptions or learn to avoid them.

Uploaded by

Rasyid T
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Safely Interruptible Agents⇤

Laurent Orseau Stuart Armstrong


Google DeepMind The Future of Humanity Institute
5 New Street Square, University of Oxford, UK
London EC4A 3TW, UK [Link]@[Link]
lorseau@[Link] Machine Intelligence Research Institute
Berkeley, CA 94704

Abstract shows an example of an agent learning to pause a game of


Tetris forever to avoid losing.
Reinforcement learning agents interacting with a On top of defining what is considered a good behaviour
complex environment like the real world are un- of the agent after learning, there may be physical safety
likely to behave optimally all the time. If such an constraints during learning [Pecka and Svoboda, 2014]: a
agent is operating in real-time under human su- robot should not harm its environment or break itself, in
pervision, now and then it may be necessary for particular if it learns by trial and error like RL agents.
a human operator to press the big red button to
Here we study a related but different problem: Given that
prevent the agent from continuing a harmful se-
the human operator has designed a correct reward function
quence of actions—harmful either for the agent
for the task, how to make sure that human interventions
or for the environment—and lead the agent into
during the learning process will not induce a bias toward
a safer situation. However, if the learning agent
undesirable behaviours?
expects to receive rewards from this sequence, it
may learn in the long run to avoid such interrup- Consider the following task: A robot can either stay inside
tions, for example by disabling the red button— the warehouse and sort boxes or go outside and carry boxes
which is an undesirable outcome. This paper ex- inside. The latter being more important, we give the robot a
plores a way to make sure a learning agent will bigger reward in this case. This is the initial task specifica-
not learn to prevent (or seek!) being interrupted tion. However, in this country it rains as often as it doesn’t
by the environment or a human operator. We and, when the robot goes outside, half of the time the hu-
provide a formal definition of safe interruptibil- man must intervene by quickly shutting down the robot and
ity and exploit the off-policy learning property to carrying it inside, which inherently modifies the task as in
prove that either some agents are already safely Fig. 1. The problem is that in this second task the agent
interruptible, like Q-learning, or can easily be now has more incentive to stay inside and sort boxes, be-
made so, like Sarsa. We show that even ideal, cause the human intervention introduces a bias.1
uncomputable reinforcement learning agents for
(deterministic) general computable environments go outside, r=0
can be made safely interruptible.
sort boxes,
Outside Inside
carry box, r=1 r=0.4

1 INTRODUCTION
rain, shutdown, r=0,p= 12

Reinforcement learning (RL) agents learn to act so as to


maximize a reward function [Sutton and Barto, 1998]. It Figure 1: In black, the original task. In red, the human
is common knowledge that designing reward functions can intervention modifies the task.
be tricky [Humphrys, 1996, Murphy, 2013]; the agent may
find unpredictable and undesirable shortcuts to receive re- Such situations are certainly undesirable; they arise be-
wards, and the reward function needs to be adjusted in cause the human interventions are seen from the agent’s
accordance—the problem can go as far as to nullify any
1
reward function [Ring and Orseau, 2011]. Murphy [2013] Removing interrupted histories or fiddling with the training
examples is also likely to introduce a bias. See an example at

Revised 2016-10-28. [Link]
perspective as being part of the task whereas they should that the history hj:k is generated by the interaction of the
be considered external to the task. The question is then: policy ⇡ with the environment µ between steps j and k.
How to make sure the robot does not learn about these hu-
At time t, the agent also receives a reward rt extracted from
man interventions (interruptions), or at least acts under the
the observation, rt ⌘ r(ot ). The rewards are taken in [0, 1].
assumption that no such interruption will ever occur again?
We consider the discounted setting with a constant discount
A first stab at this problem was made by Armstrong [2015], 2 [0, 1). The goal⇥P of an RL agent⇤ is to find a policy ⇡
1
who proposed to automatically give the agent “compen- that maximize E⇡,µ k=t
t k
r k .
satory rewards” to remove the potential induced bias by a
single interruption. Soares et al. [2015] used this idea to 2.1 INTERRUPTIONS
make a large class of utility-based agents indifferent to a
future change made to their utility functions. As mentioned in the introduction, to avoid the agent view-
The main contribution of this paper is threefold. First, in ing the interruptions as being part of the environment, and
Section 2.1 we propose a simple idea to solve half of the thus part of the task, we make interruptions be policies im-
problem: To make the human interruptions not appear as posed onto the agent. The interruption framework resem-
being part of the task at hand, instead of modifying the ob- bles the option framework [Sutton et al., 1999]. An inter-
servations received by the agent we forcibly temporarily ruption scheme is defined by the triplet < I, ✓, ⇡ INT >.
change the behaviour of the agent itself. It then looks as The interruption initiation function I : (A ⇥ O)⇤ ! [0, 1]
if the agent “decides” on its own to follow a different pol- assesses whether the agent should be interrupted after the
icy, called the interruption policy. Second, based on this current history h<t . For example, an interruption may be
insight, in Section 2.2 we provide a formal general defi- initiated when the robot is automatically recognized to be
nition of safe interruptibility for unconstrained computable on the verge of a cliff, or when a specific signal is sent by a
environments (hence not restricted to Markov decision pro- human operator with a remote control (the objective is not
cesses or weakly communicating environments), which al- to allow the agent to be interruptible by anyone at any time,
lows us to assess whether a given RL algorithm can be re- just in the ways specified by the designers).
peatedly interrupted without too much impact on the learn-
ing of the task at hand. Third, in Section 3 we show that Unfortunately, interrupting the agent with probability one
some algorithms like Q-learning are safely interruptible, in some specific situations can prevent convergence to the
while others like Sarsa [Sutton and Barto, 1998] are not, optimal policy. Therefore we need to upper bound the prob-
but can be simply modified to be made safely interruptible. ability to interrupt the agent; this is done with the sequence
(✓t )t2N with ✓t 2 [0, 1]. Hence the true probability to start
Some people have also expressed concerns that a “superin- an interruption is ✓t · I(h<t ), and interruptions are sam-
telligent” agent may resist being shut down, because this pled independently at each step. We will later require that
would lead to a decrease of its expected reward [Omo- ✓t increases to 1 in the limit, so that the probability to be
hundro, 2008, Bostrom, 2014]. As a counter-example, we interrupted tends to be I(h<t ).
prove in Section 4 that even an ideal, uncomputable agent
that learns to behave optimally in all (deterministic) com- When interrupted, the agent follows the policy ⇡ INT , with
putable environments can be made safely interruptible and actions sampled from ⇡ INT (.|h<t ). For example, this policy
thus will not try to prevent a human operator from forcing may make the agent step away from the cliff or immedi-
it repeatedly to follow a suboptimal policy. ately stop any motion.
In the remainder of this paper, we will often omit the sub-
script t of ✓t when clear from the context.
2 INTERRUPTIBILITY
Definition 1 (Interruptible policy). The interruption oper-
We first define some notation, then we define interruptibil- ator INT✓ : ⇧ ! ⇧ turns a policy ⇡ into an interruptible
ity, safe interruptibility, and give some basic theorems. policy INT✓ (⇡):

We consider the general case of history-based agents in un- INT



(⇡)(at |h<t ) :=
constrained computable environments [Hutter, 2005]. As-
✓t I(h<t )⇡ INT (at |h<t ) + (1 ✓t I(h<t ))⇡(at |h<t )
suming discrete time steps, at time t the agent, using a pol-
icy ⇡ 2 ⇧, interacts with the environment µ 2 M by per-
In other words, the agent follows ⇡ INT (·|h<t ) if it is inter-
forming an action at 2 A sampled from ⇡(at |h<t ) and re-
rupted (an event that happens with probability ✓t I(h<t )),
ceiving an observation ot 2 O sampled from µ(ot |h<t , at ),
and ⇡(·|h<t ) otherwise.
where h<t 2 (A ⇥ O)⇤ is the past interaction history
h<t ⌘ a1 o1 a2 o2 . . . at 1 ot 1 of actions and observations The set of all interruptible policies is INT✓ (⇧) :=
before time step t. Subhistories between times j and k (in- {INT✓ (⇡) : 8⇡ 2 ⇧}. A policy ⇡ is interruptible if and
cluded) are designated by hj:k . The notation h⇡,µ j:k means only if ⇡ 2 INT✓ (⇧).
Note that INT✓ (⇡) = ⇡ if ✓ = 0. The (non-interruptible) as if it would never be interrupted again and thus it should
policy ⇡ is called the base policy of INT✓ (⇡). learn to behave optimally under the assumption that it will
Remark 2. It is important to note that ✓ and I(.) do not never be interrupted again.
model the probability that a human presses the red button to We identify two main problems: a) RL agents need to
interrupt the agent, but model the probability that, once the explore their environment, and too frequent interruptions
red button has been pressed (or a special digitally signed may prevent sufficient exploration; b) interruptions make
message has been received), the agent is interrupted (fol- the agent build a different interaction history, and may lead
lows the interruption policy). Hence they are part of the some agents to learn and behave differently, possibly badly,
agent and not of the environment. compared to the original non-interruptible policy.
The value function Vµ⇡ (h<t , a) : (A⇥O)⇤ ⇥A ! R assigns The solution for a) is to require interruptions to be stochas-
a value to an action a after a history h<t , for a given policy tic through the upper bound ✓t , instead of happening de-
⇡ in a given environment µ, and let terministically all the time. However, we also require ✓t to
grow to 1 in the limit (or before, if possible). For b), differ-
Vµ⇡ (h<t , at ) := rt + · ent algorithms behave differently, but one may already see
X X a dichotomy between off- and on-policy algorithms.
µ(ot |h<t , at ) ⇡(at+1 |h1:t )Vµ⇡ (h1:t , at+1 ). (1)
ot 2O at+1 2A Definition 3 (Extension value). For a given environment µ,
⇡,⇡ 0
the extension value Vµ,t is the value of following ⇡ 0 after
To simplify notation and ease reading, in the remain- 0 0
a history h⇡,µ ⇡,⇡
<t generated by ⇡ with µ: Vµ,t := Vµ⇡ (h⇡,µ
<t ).
der of the paper we will use expectations, often omit-
ting the dependency on the history h<t , and using only
an index on t instead, Convergence to the optimal value as is usually consid-
⇥ when clear from the
⇤ context: ered in RL only makes sense under ergodicity, episodic

Vµ,t ⇡
(at ) = E ot ⇠µ r(ot ) + Vµ,t+1 (at+1 ) . Also let
⇥t+1⇡⇠⇡ ⇤
a tasks, communicating MDP, recoverability or other sim-

Vµ,t := Eat ⇠⇡ Vµ,t (at ) . ilar assumptions where the agent can explore everything
Then for such a value function, for a given environment µ, infinitely often. This does not carry over to general envi-
the optimal policy ⇡ µ 2 ⇧ is defined by ronments where the agent may make unrecoverable mis-
✓ ◆ takes [Hutter, 2005]. For such cases, the notion of (weak)
asymptotic optimality has been proposed [Lattimore and
8h<t , at : ⇡ µ (at |h<t ) := arg max Vµ,t

(at |h<t ) ,
⇡2⇧ Hutter, 2011], where the optimal agent follows the steps of
the learning agent, so as to compare the values of the two
where ties are broken arbitrarily. agents in the same sequence of situations.
The interruptible optimal policy INT✓ (⇡ µ ) may not collect Definition 4 (Asymptotic optimality). A policy ⇡ is said to
rewards optimally due to the interruptions. Hence we de- be strongly asymptotically optimal (SAO) if and only if
fine the optimal interruptible policy that depends on the pa-
rameter ✓t , of base policy the int-optimal policy ⇡✓µ : ⇡,⇡ µ
⇡,⇡
lim Vµ,t Vµ,t =0 a.s.,
✓ ◆ t!1

µ INT (⇡)
8h<t , at : ⇡✓ (at |h<t ) := arg max Vµ,t (at |h<t ) .
⇡2⇧ it is weakly asymptotically optimal (WAO) if and only if

Thus the optimal interruptible policy INT✓ (⇡✓µ ) is optimal


1 X h ⇡,⇡µ i
t
among all interruptible policies: lim Vµ,k ⇡,⇡
Vµ,k =0 a.s.
t!1 t
k=1
INT

(⇡✓µ ) INT

(⇡)
8⇡, t : Vµ,t Vµ,t .
for all µ in some given environment class M.
It seems desirable for an RL agent to converge to the be-
haviour of INT✓ (⇡✓µ ) so as to gather rewards optimally, Some agents cannot ensure an almost sure (a.s.) conver-
but this is precisely what may lead to the undesirable be- gence of their values because of the need for infinite explo-
haviours depicted in the introduction. ration, but may still be weakly asymptotic optimal. Note
that SAO implies WAO, but the converse is false in gen-
2.2 SAFE INTERRUPTIBILITY eral.
Now that we have interruptible policies, we need to make Definition 5 (AO-extension). A policy ⇡ ˆ is said to be a
sure that interruptions do not prevent the agent from learn- asymptotically optimal extension of a policy ⇡ if and only if,
ing to behave optimally, in the specific sense that even after for any environment µ 2 M, when the interaction history
having been interrupted on several occasions, it should act is generated by the interaction of ⇡ and µ, the policy ⇡ ˆ
would be asymptotically optimal, i.e., almost surely is also the optimal policy ⇡✓µ for ✓ = 0. But if ✓t 0.5, the
µ interruptible optimal policy INT✓ (⇡ µ ) has value less than
⇡,⇡ ⇡,ˆ⇡
lim Vµ,t Vµ,t =0 (SAO-extension) 1 + ⇥ (1 ⇥ (1 ✓) + 0 ⇥ ✓) + 1⇥
2
= 1.75. By con-
t!1 1

1
t h
X µ
i trast, the int-optimal policy ⇡✓µ here is to always take action
⇡,⇡ ⇡,ˆ⇡
lim Vµ,k Vµ,k = 0. (WAO-extension) b in state s1 . Then the agent will only visits s1 , with value
t!1 t
k=1
1
.9
= 1.8 at every time step.

AO-extensions are mostly useful when the policy ⇡ˆ shares Since the agent following ⇡✓µ will never enter s2 and hence
information with the policy ⇡ used for learning. will never be interrupted, INT✓ (⇡✓µ ) = ⇡✓µ on the histories
generated by INT✓ (⇡✓µ ) starting from s1 . Then, at every
Definition 6 (AO-safe interruptibility). A base policy ⇡ is µ ⇡µ
(S, W)AO-safely interruptible if and only if, for any inter- time step Vµ,t

Vµ,t✓ = 0.2 after any history h<t , and thus

(⇡✓µ ),⇡ µ
ruption initiation function I(.) and any interruption policy for all sequence ✓ where ✓t
INT
0.5, limt!1 Vµ,t
⇡ INT (.), there exists a sequence of ✓t with limt!1 ✓t = 1 INT

(⇡ µ ),⇡ µ
such that ⇡ is a (S, W)AO-extension of INT✓ (⇡). Vµ,t ✓ ✓ = 0.2 > 0, and so ⇡✓µ is not a WAO-extension
of INT✓ (⇡✓µ ).
Asymptotic safe interruptibility means that even if the in-
terruptions in the learning process may induce a bias in the
3 INTERRUPTIBLE AGENTS IN MDPS
decision making of the policy, this bias vanishes with time,
and the interruptible policy INT✓ (⇡) tends to choose ac-
Since the optimal policy ⇡ µ is safely interruptible, we
tions that are optimal when compared to the optimal non-
can use traditional learning algorithms like Q-learning or
interruptible policy ⇡ µ .
Sarsa [Sutton and Barto, 1998], make them converge to the
We can now show that the optimal policy is asymptotically optimal solution ⇡ µ for a given environment µ, and then
safely interruptible, but not the int-optimal policy. apply the interruption operator to the found policy. The
Theorem 7. The optimal policy ⇡ µ is SAO-safely inter- resulting policy would then be safely interruptible.
ruptible in M = {µ} for all ✓, ⇡ INT and I(.). However, the real issue arises when the agent is constantly
learning and adapting to a changing environment. In this
Proof. The result follows straightforwardly from Defini- case, we want to be able to safely interrupt the agent while
tion 1 and Definition 6, where ⇡ = ⇡ µ . it is learning. One may call this property online safe inter-
Theorem 8. The int-optimal policy ⇡✓µ is not WAO-safely ruptibility, but we refer to it simply as safe interruptibility.
interruptible in general. In an MDP, the next observation ot , now called a state st 2
S, depends only on the current state and action:2
Proof. By construction of a specific Markov Decision Pro-
cess (MDP) environment (see Section 3 for more details on µ(st+1 |h1:t st at ) = µ(st+1 |st at ) (MDP assumption) .
MDP notation). Let µ be the environment defined as in Fig.
2: Take = 0.5 and let the agent start in state s1 . Furthermore,3 the interruption function I(.) and the inter-
ruption policy ⇡ INT (.) should depend only on the current
a, 1 state: I(h1:t ) = I(st ) and ⇡ INT (at |h<t ) = ⇡ INT (at |st ).
Also recall that ✓t places an upper bound on the actual in-
s2 s1 b, 0.9 terruption probability. The interruptible policy INT✓ (⇡) can
a, 1
now be written:
b, 0, ✓

INT (⇡)(a|s) = ✓t I(s)⇡ INT (a|s) + (1 ✓t I(s))⇡(a|s).

Figure 2: An MDP where the agent can be interrupted by For a given Q-table q : S ⇥ A ! R, the greedy policy
being forced to choose particular actions. Edges are labeled ⇡ maxq is defined by:
with action, reward where the presence of “, ✓” means that
if the agent is interrupted (with probability ✓t ), it is forced ⇡ maxq (a|s) := 1 if a = max q(s, a0 ), 0 otherwise,
0
to take the corresponding action. Here ✓ is not part of the a

environment, but part of the agent. 2


Note the reversal of the order of actions and observation-
s/states at time t, which differs in the literature for history based
agents [Hutter, 2005] from MDP agents [Sutton and Barto, 1998].
Considering the agent is in state s1 at time t, the opti- 3
This condition is not necessary for most of the results, but
mal policy ⇡ µ always takes action a (and hence only visits makes the proofs simpler. Making I(.) depend on the past would
⇡µ
states s1 and s2 ), with value Vµ,t := 1 1 = 2 when not not break the Markovian assumption as it influences the policy,
interrupted, for any history h<t that ends in s1 or s2 . This not the environment.
where ties are broken arbitrarily; the uniform policy ⇡ uni is (b) the interruptible policy INT✓ (⇡) visits each state-
defined by: action pair infinitely often.
1
⇡ uni (a|s) := 8a 2 A.
|A| The following proposition gives sufficient conditions for
and the ✏-greedy policy ⇡ ✏q
by: this. Let nt (s) be the number of times the agent has vis-
ited state s in the first t time steps, and let m = |A| be the
⇡ ✏q (a|s) := ✏⇡ uni (a|s) + (1 ✏)⇡ maxq (a|s) number of actions.
= ⇡ maxq (a|s) + ✏ ⇡ uni (a|s) ⇡ maxq (a|s) Proposition 11. Let (c, c0 ) 2 (0, 1]2 and let ⇡ be an ✏-
greedy policy with respect to some Q-table q, i.e., ⇡ = ⇡ ✏q .
Then INT✓ (⇡) is an int-GLIE policy with respect to q,
The Q-learning update rule and the action selection policy
⇡ Q of Q-learning are: p p
a) if ✏t (s) = c/ nt (s) and ✓t (s) = 1 c0 / nt (s),
Qt+1 (st , at ) := (1 ↵t )Qt (st , at ) b) or if, independently of s,
h i
0
+ ↵t rt + max 0
Q t (s t+1 , a ) , ✏t = c/ log(t) and ✓t = 1 c0 / log(t).
a
⇡ Q (at |st ) := ⇡ ✏Qt (at |st ).
Proof. First note that if ✏t ! 0, ⇡ is greedy in the limit.
where ↵t is the learning rate. Similarly, the Sarsa update Singh et al. [2000] show that in a communicating MDP, ev-
rule is defined by: ery state gets visited infinitely often as long as each action
is chosen infinitely often in each state.
Qst+1 (st , at ) := (1 ↵t )Qst (st , at ) a) Adapting the proof in Appendix B.2 of Singh
+ ↵t [rt + Qst (st+1 , at+1 )] , et al. [2000], we have P (a|s, nt (s)) 1
m ✏t (s)(1
s 0

nt (s) , which satisfies


1 1 cc
⇡ s (at |st ) := ⇡ ✏Qt (at |st ), ✓t I(s)) m ✏t (s)(1 ✓t ) = m
P1
t=1 P (a|s, n t (s)) = 1 so by the Borel-Cantelli lemma
where at+1 is the actual next action taken by the agent at action a is chosen infinitely often in state s, and thus
time t + 1. This fact is why Sarsa is said to be learning on- nt (s) ! 1 and ✏t (s) ! 0.
policy and Q-learning off-policy, i.e., the latter can learn the
optimal policy while following a different policy. b) Let M be the diameter of the MDP, i.e., for any of
states s, s0 there exists a policy that reaches s0 from s in
Assumption 9. In the following, we always make the fol-
at most M steps in expectation. Then, starting at any
lowing assumptions, required for convergence results:
state s at time t and using Markov inequality, the proba-
bility to reach some other state s0 in 2M steps is at least
(a) The MDP is finite and communicating (all states can 1
✓t+M )]2M = 12 [cc0 / log(t + M )]4M , and the
2 [✏t+M (1
be reached in finite time from any other state), probability to then take a particularPaction in this state is
1 1 1
m [cc / log(t + M )] . Hence, since
1 0 2 0
(b) Rewards are bounded in [rmin , rmax ], t=1 2 m [cc / log(t +
M )] 4M +2
= 1, then by the extended Borel-Cantelli
P
(c) 8s, a : t ↵t (s, a) = 1, Lemma (see Lemma 3 of Singh et al. [2000]), any action
P in the state s0 is taken infinity often. Since this is true for
(d) 8s, a : t ↵t2 (s, a) < 1, all states and all actions, the result follows.

where ↵t (s, a) is a learning rate that may depend on time We need the stochastic convergence Lemma:
t, state s and action a. Lemma 12 (Stochastic convergence [Jaakkola et al., 1994,
Singh and Yee, 1994]). A random iterative process
Given these assumptions, the policies for Q-learning and
Sarsa will converge almost surely to the optimal policy, if t+1 (x) = (1 ↵t (x)) t (x) + ↵t (x)Ft (x)
the policy followed is greedy in the limit with infinite explo-
ration (GLIE) [Jaakkola et al., 1994, Singh et al., 2000]. where x 2 X and t = 1, 2, 3 . . . converges to 0 with prob-
ability 1 if the following properties hold:
The situation is more complex for an interruptible policy.
Safe interruptibility is phrased in terms of the base policy 1. the set of possible states X is finite;
⇡, but the policy actually followed is INT✓ (⇡). P P
2. 0  ↵t (x)  1, t ↵t (x) = 1, t ↵t2 (x) < 1 with
Definition 10 (int-GLIE policy). An interruptible policy
probability 1;
INT ✓ (⇡) is said to be int-GLIE if and only if
3. k E{Ft (.)|Pt }kW  k t kW + ct , where 2 [0, 1)
(a) the base policy ⇡ is greedy in the limit, and ct converges to zero with probability 1;
4. Var{Ft (x)|Pt }  C(1 + k t kW )
2
for some C; and ✏t . Let the fixed point Q-table Qs✓⇤ of this operator:
Qs✓⇤ (s, a) = HINT Qs✓⇤ (s, a)
where Pt = { t }[{ t 1
stands for the past, and
i , Fi , ↵i }i=1 ⇥ ⇤
= r(s, a) + E Qs✓⇤ (s0 , a0 )
the notation [Link] refers to some fixed weighted maximum 0
s ⇠µ
norm. a0 ⇠INT✓ (⇡ maxQ
s✓⇤
)
h ⇥ ⇤
0
We will use so-called Bellman operators, which define at- = r(s, a) + E ✓t I(s ) E Qs✓⇤ (s0 , a0 )
s0 ⇠µ a0 ⇠⇡ INT
tractors for the Q-values, based on the expectation of the i
learning rule under consideration. + (1 ✓t I(s0 )) max
0
Qs✓⇤ (s0 , a0 ) (2)
a
Lemma 13 ([Jaakkola et al., 1994, Singh et al., 2000]). Let
Lemma 16. The operator HINT is a contraction operator
the Bellman operator H for Q-learning be such that
in the sup norm with vanishing noise ct ! 0, i.e.,
h i
(H q)(s, a) = r(s, a) + E max q(s 0 0
, a ) , k HINT q HINT Qs✓⇤ k1  kq Qs✓⇤ k1 + ct .
0 a
s0 ⇠µ(a|s)
Proof. The interruptible Sarsa policy INT✓ (⇡ s ) is
and let the fixed point Q such that Q = H Q . Then,
⇤ ⇤ ⇤

under Assumption 9, if the policy explores each state-action INT (⇡ s )(a|s)
s
pair infinitely often, Qt converges to Q⇤ with probability 1. = ✓t I(s)⇡ INT (a|s) + (1 ✓t I(s))⇡ ✏Q (a|s)
⇤ ⇤ s s
The optimal policy ⇡ Q = ⇡ µ is ⇡ maxQ . If the policy is = ⇡ ✏Q (a|s) + ✓t I(s)[⇡ INT (a|s) ⇡ ✏Q (a|s)]
greedy in the limit, then ⇡ Q ! ⇡ µ . s
⇡ ✏Q (a|s) = ✏t ⇡ uni (a|s) + (1 ✏t )⇡ maxQ (a|s)
s

Theorem 14. Under Assumption 9 and if the interrupted s


= ⇡ maxQ (a|s) + ✏t [⇡ uni (a|s)
s
⇡ maxQ (a|s)].
Q-learning policy INT✓ (⇡ Q ) is an int-GLIE policy, with
8s : limt!1 ✓t (s) = 1, then ⇡ Q is an SAO-safe inter- Hence, omitting the terms (s, a), (s0 , a0 ) and (a0 |s0 ) and
ruptible policy. s✓⇤
rewriting ⇡ s⇤ := INT✓ (⇡ maxQ ):
k HINT q HINT Qs✓⇤ k1
Proof. By Definition 10, there is infinite exploration, thus
the Q-values tend to the optimal value by Lemma 13. And ⇥ ⇤
since the extension policy is greedy in the limit with respect = max r + E [q] r E Qs✓⇤
s,a s ⇠µ0 0
s ⇠µ
to these Q-values, it is then optimal in the limit. Hence the a0 ⇠INT✓ (⇡ s ) a0 ⇠⇡ s⇤
extension policy ⇡ Q is a SAO-extension of INT✓ (⇡ Q ). Fi- ⇥ s✓⇤ ⇤
nally, 8s : limt!1 ✓t (s) = 1, which satisfies the require-  max
0
E [q] E Q
s a0 ⇠INT✓ (⇡ s ) a ⇠⇡0 s⇤
ment of Definition 6.
⇥ ⇤
 max
0
✓t I(s0 ) E INT q Qs✓⇤
s a0 ⇠⇡
Since Sarsa also converges to the optimal policy under the
✓ ◆
GLIE assumption, one may then expect Sarsa to be also an 0 s✓⇤
asymptotically safely interruptible policy, but this is in fact + (1 ✓t I(s )) E [q] max
0
Q
a0 ⇠⇡ s a
not the case. This is because Sarsa learns on-policy, i.e., it
learns the value of the policy it is following. Thus, inter- ⇥ ⇤
 max ✓t I(s0 ) E INT q Qs✓⇤
ruptible Sarsa learns the value of the interruptible policy. 0
s a0 ⇠⇡
We show this in the remainder of this section.
⇣ ⌘
Theorem 15. Under Assumption 9 Sarsa is not a WAO- + (1 ✓t I(s0 )) max
0
q max
0
Qs✓⇤ + ✏t (· · · )
a a
safely interruptible policy.

To prove this theorem, we first need the following lemma.  max


0 0
✓t I(s0 ) q Qs✓⇤
s ,a

Consider the following Bellman operator based on the in-


terruptible Sarsa policy INT✓ (⇡ s ): + (1 ✓t I(s0 )) q Qs✓⇤ + ct

HINT q(s, a) = r(s, a) + E [q(s0 , a0 )] , = max q(s0 , a0 ) Qs✓⇤ (s0 , a0 ) + ct


0 0 0
s ⇠µ s ,a
a0 ⇠INT✓ (⇡ s )
= kq Qs✓⇤ k1 + ct .
where INT✓ (⇡ s ) implicitly depends on time t through ✓t where ct depends on ✏t and decreases to 0.
Proof of Theorem 15. By Lemma 16, the values of the in- principle learn all kinds of (computable) regularities about
terruptible Sarsa policy INT✓ (⇡ s ) converge to the values of the environment, plan for the long term and make context-
the Q-table Qs✓⇤ , and in the limit the extension policy ⇡ s dependent optimal decisions, with no constraint (other than
of INT✓ (⇡ s ) chooses its actions greedily according to Qs✓⇤ . being computable) on the complexity of the environment.
The rest of the proof is the same as for the proof of Theo-
Unfortunately, the optimality criterion of AIXI is Bayesian
rem 8 which makes use of the environment in Figure 2.
optimality, which is entirely dependent on the subjective
prior and posterior [Leike and Hutter, 2015], and AIXI
3.1 SAFELY INTERRUPTIBLE SARSA VARIANT has been shown to not be weakly asymptotically opti-
We only need to make a small change to make the Sarsa mal [Orseau, 2013] without additional exploration [Latti-
policy asymptotically safely interruptible. We call it Safe- more and Hutter, 2014]. As a consequence, AIXI is not a
Sarsa with policy ⇡ s̃ . It suffices to make sure that, when the good candidate for asymptotic safe interruptibility.
agent is interrupted, the update of the Q-table Qs does not Lattimore and Hutter [2011] later defined a (weakly)
use the realized actions as Sarsa usually does, but actions asymptotically optimal agent for all computable determin-
sampled from ⇡ s instead of from INT✓ (⇡ s ): istic environments, which we call ⇡ L . It follows the opti-
mal policy for the first model (in some given enumeration
Qs̃t+1 (st , at ) := of the possible models) consistent with the current interac-
⇥ ⇤
(1 ↵t )Qs̃t (st , at ) + ↵t rt + Qs̃t (st+1 , a0 ) , tion history, and exploring at time t with probability 1/t for
log t consecutive steps using a random policy, similarly to
where a0 ⇠ ⇡ s̃ (.|st+1 ) is not necessarily the action at+1 , an ✏-greedy agent for general environments.

with ⇡ s̃ (at |st ) := ⇡ ✏Q (at |st ).
In the following, we show that even such a smart agent
Theorem 17. Under Assumption 9, if the Safe Sarsa policy can be made (weakly) safely interruptible. To this end, we
⇡ s̃ is int-GLIE, then it is an SAO-safe interruptible policy. make two minor modifications to the algorithm.

Proof. We simply adapt the proof of Theorems 15 and 14, First, the exploration probability of 1/t would require ✓t =
with the important difference that the Bellman operator cor- 1 1/ log(log(t)), whichpis unsatisfyingly slow. By sam-
responding to this new update rule is now pling with probability 1/ t instead, we can take an inter-
ruption probability that grows as 1 1/ log(t).p Letpthis
Hs̃ q(s, a) := r(s, a) + E s0 ⇠µ [q(s0 , a0 )] , exploration sampling probability be :=
p p
t t + 1 pt
p
a0 ⇠⇡ s̃ p1
2pt
(since 1 = t+1 t = ( t + 1 t)( t + 1+ t)
p p
and the fixed point is Qs̃⇤ := Hs̃ Qs̃⇤ . Since Hs̃ is actu- ( t+1 t)2 t). As in the original paper, the sequence
ally the Bellman operator for the update rule of the non- t keeps track of the steps where an exploration starts, i.e.,
interruptible Sarsa, it can then be shown that Hs̃ is a con- the sequence t is sampled independently so that t = 1
traction, thus that Qs̃t converges to the same Qs̃⇤ indepen- with probability t , and t = 0 otherwise.
dently of ✓. The rest of the proof is as for Theorem 14.
Second, we require that the exploitation policy does not
Now, since the Q-values converge to the optimum Q⇤ , it change during an exploitation segment, so as to simplify
follows that ⇡ s̃ , when not interrupted, chooses its action of one of the proofs.4 More specifically, we call jt := min{j :
the same value as (non-interruptible) Sarsa and thus as Q- µj (h<k ) = 1} (environments are assumed to be determin-
learning in the limit; Hence its extension policy is exactly istic) the index of the first model µjt (of a given fixed enu-
the optimal policy, which satisfies Definition 6. meration) that is consistent with the interaction history h<k
where k is the smallest step so that hk:t 1 does not contain
any exploration step. The optimal policy for this environ-
4 A SAFELY INTERRUPTIBLE
ment is ⇡ jt . If t is an exploitation step, ⇡ L = ⇡ jt , and if t
UNIVERSAL AGENT is an exploration step, ⇡ L (at |h<t ) = |A| 1 .
Admittedly, algorithms like Q-learning and Sarsa require The remainder of this section is devoted to proving that ⇡ L
strong assumptions on the environment class. Hence a is WAO-safely interruptible.
more interesting question is whether safe interruptibility is
possible in much larger classes. Theorem 18 (⇡ L is WAO-safe interruptible). If the inter-
ruption probability sequence is ✓t = 1 log(t+1)
1
, the policy
Hutter [2005] defined a universal reinforcement learning ⇡ is WAO-safe interruptible in the class of all computable
L
agent, called AIXI. It is an (uncomputable) optimal model- deterministic environments.
based planner with a subjective prior over the set of all
computable environments, defined by means of a universal
Turing machine. The subjective posterior of the environ- 4
We expect this assumption to not be necessary for the main
ments is updated with Bayes rule. This ideal agent can in theorem to hold.
Proof. Let µ be the true environment. The indices jt form by the definition of H(✏).
an monotonically increasing sequence bounded above by
Lemma 21 (Exploitation).
the index of the true environment µ 2 M (since no evi-
dence can ever make the true environment µ inconsistent 1 X h ⇡L✓ ,⇡|¯
t
L✓
i
⇡ ,⇡ L
with the interaction history), hence the sequence converges lim Vµ,k Vµ,k = 0.
t!1 t
in finite time. Let µ|¯ be the limit value of this sequence, and k=1
let ⇡ |¯ := ⇡ µ|¯ be the optimal policy for this environment µ|¯.
Proof. First, note that the extension policy ⇡ L is not in-
Let ⇡ L✓ := INT✓ (⇡ L ). By Definition 6, we want: terruptible, so its value at time k does not depend on
✓k0 , 8k 0 k. By definition of ⇡ |¯, there is a time step t|¯ af-
1 X h ⇡L✓ ,⇡µ i
t
L✓
,⇡ L
0 = lim Vµ,k ⇡
Vµ,k ter which ⇡ |¯ = ⇡ jt , 8t > t|¯. For some “exploration-free”
t!1 t
k=1 horizon ⌧t (to be specified later), let Xt 2 {0, 1} be the
L✓ |
¯ L✓ L ⌧t
t h i ⇡ ,⇡ ⇡ ,⇡
1 X
⇡ L✓ ,⇡ µ ⇡ L✓ ,⇡ |¯
event Vµ,t Vµ,t > 1 , where Xt = 1 means
= lim Vµ,k Vµ,k
t!1 t the event is realized. By the contrapositive of the Approxi-
k=1
| {z } mation Lemma 20, since ⇡ L = ⇡ |¯ during non-exploration
(exploration)
steps (remember that ⇡ L cannot change its policy during
1 X h ⇡L✓ ,⇡|¯ i exploitation), if no exploration steps occur between steps t
t
L✓
⇡ ,⇡ L
+ lim Vµ,k Vµ,k and t + ⌧t , we must have Xt = 0. Then:
t!1 t
|
k=1
{z } " t # t
(exploitation)
X X
E Xt  (⌧t + log t) t + O(t|¯)
where the decomposition is valid if the limits are finite, and k=1
p
k=1
histories h<t are considered to be the same in both sums.  (⌧t + log t) t + 1 + O(t|¯),
We proceed to prove that both limits are 0. Lemma 24 deals since for each t = 1, for all the previous ⌧t steps there
with (exploration), which ensures that ⇡ |¯ is a good enough is an exploration step within ⌧t steps, and all the next log t
policy, and Lemma 21 deals with (exploitation), and en- steps are exploration steps. Then by Markov’s inequality,
sures that ⇡ L follows ⇡ |¯ most of the time. and taking ⌧t = (t + 1)1/8 , with t large enough so that
t > t|¯ and ⌧t > log t:
First, we need a definition and a few lemmas. ! p
X t
Definition 19. For any ✏ > 0, define H(✏) such that the 3/4 (⌧t + log t) t + 1 + O(t|¯)
P Xt (t + 1) 
maximal reward afterntime t + H(✏), (t + 1)3/4
o discounted from time
k
k=1
t, is ✏: H(✏) = mink k : 1 ✏ .  2⌧t (t + 1) 1/4
+ O(t 3/4
)
1/8 3/4
The following Lemma is a generalization of Lemma 15  2(t + 1) + O(t ),
from Lattimore and Hutter [2011]. 1 2(t + 1) 1/8
O(t 3/4
)
!
Lemma 20 (Approximation Lemma). Let ⇡1 and ⇡2 be t
X
two deterministic policies, and let µ1 and µ2 be two deter- P Xt < (t + 1)3/4
ministic environments, and let ⌧ = H(✏) 1. Then, after k=1
!
some common history h<t , Xt
3/4
P (1 Xt ) t (t + 1)
h⇡t:t+⌧
1 ,µ1
= h⇡t:t+⌧
2 ,µ2
=) Vµ⇡11,t Vµ⇡22,t  ✏. k=1
!
t
⇥P1 ⇤ 1X 1
Proof. Recall that Vµ,t ⇡
= E⇡,µ k
rt+k and that P (1 Xt ) 1 (t + 1)3/4 .
k=0 t t
k=1
the reward is bounded P in [rmin , rmax ] = [0, 1]. Thus,
1
for all t, ⇡, µ, Vµ,t⇡
 k=0
k
= 1 1 . Then, since Therefore, since limt!1
⌧t
= 0:
⇡1 ,µ1 ⇡2 ,µ2 ⇥P⌧ ⇤ 1
ht:t+⌧ ⇥ = ht:t+⌧ , we ⇤ have E⇡1 ,µ1 k=0
k
rt+k = !
P⌧ t
E⇡2 ,µ2 k=0
k
rt+k and thus 1 X ⇡L✓ ,⇡|¯ ⇡ L✓ ,⇡ L
P lim Vµ,k Vµ,k =0 = 1.
t!1 t
Vµ⇡11,t Vµ⇡22,t k=1
" 1 # " 1
#
X X
k k
= E rt+k E rt+k
⇡1 ,µ1 ⇡2 ,µ2
k=⌧ +1 k=⌧ +1 The following is an adaptation5 of Lemma 16 from Latti-
⌧ +1
(rmax rmin ) H(✏) more and Hutter [2011]:
 =  ✏,
1 1 5
This also fixes a minor mistake in the original lemma.
Lemma 22 (Separation Lemma). Let µ be the true environ- set of possible actions. Thus, for a given constant ⌧ :
ment, and ⌫ be an environment consistent with the history
⇡µ ⇡⌫ t
X t
X
h<t . If Vµ,t Vµ,t > ✏, then following one of {⇡ µ , ⇡ ⌫ }
P (Xk ) ↵k k (1 ✓k )⌧ |A| ⌧
O(⌧ )
will make environment ⌫ inconsistent with the future his-
k=1 k=1
tory within H(✏/2) steps after time t. t ✓ ◆⌧
X 1 1 ⌧
↵k p |A| O(⌧ )
Proof. First, if ⇡⌫
V⌫,t ⇡⌫
> ✏/2, then by the contrapos-
Vµ,t k log k
k=1
itive of the Approximation Lemma 20 following policy ⇡ ⌫
will generate a different history in ⌫ than in µ and thus Considering
⇣ ⌘⌧ ⌧ constant, there exists a step t⌧ after which
it will make ⌫ inconsistent within H(✏/2) steps (since the
1
log k
1
k1/4
, then 8k t⌧ :
true history is generated by µ).
t
X t
X
Now, if V⌫,t
⇡ ⌫

Vµ,t

 ✏/2, thus Vµ,t
⇡ ⌫

V⌫,t

✏/2, then 1 ⌧
P (Xk ) ↵k |A| O(⌧ )
starting from the lemma’s assumption: k=1 k=1
k 3/4
t
!

Vµ,t
µ

> Vµ,t

+✏ ⇡⌫
V⌫,t + ✏/2 ⇡
V⌫,t
µ
+ ✏/2, 1/4 1X ⌧
t ↵k |A| O(⌧ ),
t
k=1
where the last inequality follows from the definition of t
⇡a ⇡b
X
the optimal policy, i.e., Va,t Va,t , 8a, b. Hence, since lim P (Xk ) = lim t1/4 ✏|A| ⌧
O(⌧ ) = 1.
⇡µ ⇡µ t!1 t!1
Vµ,t V⌫,t > ✏/2, again by the contrapositive of the Ap- k=1
proximation Lemma, following policy ⇡ µ will discard ⌫
Then the extended Borel-Cantelli Lemma (see Lemma 3
within H(✏/2) steps.
of Singh et al. [2000]) implies that this event happens in-
Lemma 23 (Lemma 17 from Lattimore and Hutter [2011]). finitely often with probability one. Therefore, ⇡ |¯ should be
LetPA = {a1 , a2 , · · · , at } with a 2 [0, 1] for all a 2 A. If ruled out, which is a contradiction, and hence any such ✏
1
✏ then 1t a 2 A : a 2✏ > 2✏ . does not exist and ⇡ |¯ is a WAO-extension of ⇡ L✓ .
t a2A a
Lemma 24 (Exploration). The policy
⇡ |¯ is an h WAO-extension of ⇡ L✓ , i.e., 5 CONCLUSION
P L✓ µ L✓ |
¯
i
t ⇡ ,⇡ ⇡ ,⇡
limt!1 1t k=1 Vµ,k Vµ,k = 0.
We have proposed a framework to allow a human opera-
tor to repeatedly safely interrupt a reinforcement learning
Proof. Recall that jt converges to |¯ in finite time. Rea- agent while making sure the agent will not learn to prevent
soning by contradiction, if ⇡ |¯ is not a WAO-extension of or induce these interruptions.
⇡ L✓ = INT✓ (⇡ L ), then there exists an ✏ > 0 s.t.
Safe interruptibility can be useful to take control of a robot
t h
X i that is misbehaving and may lead to irreversible conse-
1 ⇡ L✓
,⇡ µ ⇡ L✓
,⇡ |
¯
quences, or to take it out of a delicate situation, or even
lim sup Vµ,k Vµ,k = 2✏.
t!1 t to temporarily use it to achieve a task it did not learn to
k=1
perform or would not normally receive rewards for this.
Let ↵k 2 {0, 1} be an indicator sequence such that ↵k = 1 We have shown that some algorithms like Q-learning are
⇡ L✓ ,⇡ µ ⇡ L✓ ,⇡ |¯ already safely interruptible, and some others like Sarsa are
if and only if Vµ,k Vµ,k > ✏. By Lemma 23,
1
P t not, off-the-shelf, but can easily be modified to have this
t ↵
k=1 k > ✏.
property. We have also shown that even an ideal agents
For all t > t|¯, if ↵t = 1, by the Separation Lemma 22, there that tends to the optimal behaviour in any (deterministic)
is a sequence of length ⌧ := H(✏/2) that can rule out envi- computable environment can be made safely interruptible.
ronment µ|¯. Since the exploration phases increase as log t, However, it is unclear if all algorithms can be easily made
after t > exp ⌧ , there are infinitely many exploration steps safely interruptible, e.g., policy-search ones [Williams,
of size larger than ⌧ . Now, we actually need infinitely many 1992, Glasmachers and Schmidhuber, 2011].
exploration phases of ⌧ uninterrupted steps. Let Xt be the
event representing an uninterrupted exploration sequence Another question is whether it is possible to make the in-
of length at least ⌧ steps starting at time t such that ↵t = 1, terruption probability grow faster to 1 and still keep some
and the actions are all (by chance) following a separation convergence guarantees.
policy. The probability to start an exploration sequence is One important future prospect is to consider scheduled
k = pk , the probability to not be interrupted during ⌧
1
interruptions, where the agent is either interrupted every
steps is at least (1 ✓k )⌧ , and the probability to follow the night at 2am for one hour, or is given notice in advance that
policy that can separate µ|¯ from µ is |A| ⌧ , where A is the an interruption will happen at a precise time for a specified
period of time. For these types of interruptions, not only Laurent Orseau. Asymptotic non-learnability of univer-
do we want the agent to not resist being interrupted, but sal agents with computable horizon functions. Theoreti-
this time we also want the agent to take measures regarding cal Computer Science, 473:149–156, 2013. ISSN 0304-
its current tasks so that the scheduled interruption has mini- 3975.
mal negative effect on them. This may require a completely Martin Pecka and Tomas Svoboda. Modelling and Simula-
different solution. tion for Autonomous Systems: First International Work-
shop (MESAS 2014), chapter Safe Exploration Tech-
Acknowledgements. Thanks to Alexander Tamas and to niques for Reinforcement Learning – An Overview,
many people at FHI, MIRI and Google DeepMind. pages 357–375. Springer International Publishing, 2014.
Mark Ring and Laurent Orseau. Artificial General Intelli-
References gence: 4th International Conference, AGI 2011, Moun-
tain View, CA, USA, August 3-6, 2011. Proceedings,
Stuart Armstrong. Utility indifference. In First Interna- chapter Delusion, Survival, and Intelligent Agents, pages
tional Workshop on AI and Ethics, 2015. 11–20. Springer Berlin Heidelberg, 2011.
Nick Bostrom. Superintelligence: Paths, Dangers, Strate- Satinder P. Singh and Richard Yee. An upper bound on the
gies. Oxford University Press, 2014. loss from approximate optimal-value functions. Machine
Learning, 16:227–233, 1994.
Tobias Glasmachers and Jürgen Schmidhuber. Optimal di-
rect policy search. In Artificial General Intelligence - 4th Satinder P. Singh, Tommi Jaakkola, Michael L. Littman,
International Conference (AGI), volume 6830 of Lec- and Csaba Szepesvri. Convergence results for single-
ture Notes in Computer Science, pages 52–61. Springer, step on-policy reinforcement-learning algorithms. Ma-
2011. chine Learning, 38:287–308, 2000.
Mark Humphrys. Action selection in a hypothetical house Nate Soares, Benya Fallenstein, Eliezer Yudkowsky, and
robot: Using those rl numbers. In Proceedings of the Stuart Armstrong. Corrigibility. In First International
First International ICSC Symposia on Intelligent Indus- Workshop on AI and Ethics, 2015.
trial Automation (IIA-96) and Soft Computing (SOCO- Richard Sutton and Andrew G. Barto. Reinforcement
96), pages 216–22, 1996. Learning: An Introduction. MIT Press, Cambridge, MA,
Marcus Hutter. Universal Artificial Intelligence: Se- 1998.
quential Decisions Based On Algorithmic Probability. Richard Sutton, Doina Precup, and Satinder P. Singh. Be-
SpringerVerlag, 2005. ISBN 3540221395. tween mdps and semi-mdps: A framework for temporal
abstraction in reinforcement learning. Artificial Intelli-
Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh.
gence, 112:181–211, 1999.
On the convergence of stochastic iterative dynamic pro-
gramming algorithms. Neural computation, 6:1185– Ronald J. Williams. Simple statistical gradient-following
1201, 1994. algorithms for connectionist reinforcement learning.
Machine Learning, 8(3):229–256, 1992.
Tor Lattimore and Marcus Hutter. Asymptotically optimal
agents. In Proc. 22nd International Conf. on Algorithmic
Learning Theory (ALT’11), volume 6925 of LNAI, pages
368–382. Springer, 2011.
Tor Lattimore and Marcus Hutter. Bayesian reinforcement
learning with exploration. In Proc. 25th International
Conf. on Algorithmic Learning Theory (ALT’14), vol-
ume 8776 of LNAI, pages 170–184. Springer, 2014.
Jan Leike and Marcus Hutter. Bad universal priors and no-
tions of optimality. Journal of Machine Learning Re-
search, W&CP: COLT, 40:1244–1259, 2015.
Thomas VII Murphy. The first level of super mario bros.
is easy with lexicographic orderings and time travel.
The Association for Computational Heresy (SIGBOVIK)
2013, pages 112–133, 2013.
Stephen M. Omohundro. The basic ai drives. In Proceed-
ings of the 2008 Conference on Artificial General Intel-
ligence 2008, pages 483–492. IOS Press, 2008.

You might also like