0% found this document useful (0 votes)
21 views10 pages

COMA: Counterfactual Multi-Agent RL

This document introduces a new multi-agent reinforcement learning method called Counterfactual Multi-Agent (COMA) Policy Gradients. COMA uses a centralized critic to estimate the Q-function during learning, while using decentralized actors to optimize each agent's policy. To address multi-agent credit assignment challenges, COMA uses a counterfactual baseline that marginalizes out a single agent's action while keeping other agents' actions fixed. COMA was evaluated on a StarCraft micromanagement task and achieved better performance than other multi-agent actor-critic methods.

Uploaded by

Kristine Wah
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

COMA: Counterfactual Multi-Agent RL

This document introduces a new multi-agent reinforcement learning method called Counterfactual Multi-Agent (COMA) Policy Gradients. COMA uses a centralized critic to estimate the Q-function during learning, while using decentralized actors to optimize each agent's policy. To address multi-agent credit assignment challenges, COMA uses a counterfactual baseline that marginalizes out a single agent's action while keeping other agents' actions fixed. COMA was evaluated on a StarCraft micromanagement task and achieved better performance than other multi-agent actor-critic methods.

Uploaded by

Kristine Wah
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Counterfactual Multi-Agent Policy Gradients

Jakob N. Foerster1,† Gregory Farquhar1,†


[Link]@[Link] [Link]@[Link]

Triantafyllos Afouras1 Nantas Nardelli1 Shimon Whiteson1


afourast@[Link] nantas@[Link] [Link]@[Link]

1 †
University of Oxford, United Kingdom Equal contribution
arXiv:1705.08926v2 [[Link]] 14 Dec 2017

Abstract training of decentralised policies is a standard paradigm for


multi-agent planning (Oliehoek, Spaan, and Vlassis 2008;
Many real-world problems, such as network packet routing
and the coordination of autonomous vehicles, are naturally
Kraemer and Banerjee 2016) and has recently been picked
modelled as cooperative multi-agent systems. There is a great up by the deep RL community (Foerster et al. 2016; Jorge,
need for new reinforcement learning methods that can ef- Kågebäck, and Gustavsson 2016). However, the question of
ficiently learn decentralised policies for such systems. To how best to exploit the opportunity for centralised learning
this end, we propose a new multi-agent actor-critic method remains open.
called counterfactual multi-agent (COMA) policy gradients. Another crucial challenge is multi-agent credit assign-
COMA uses a centralised critic to estimate the Q-function ment (Chang, Ho, and Kaelbling 2003): in cooperative set-
and decentralised actors to optimise the agents’ policies. In tings, joint actions typically generate only global rewards,
addition, to address the challenges of multi-agent credit as- making it difficult for each agent to deduce its own con-
signment, it uses a counterfactual baseline that marginalises
out a single agent’s action, while keeping the other agents’
tribution to the team’s success. Sometimes it is possible to
actions fixed. COMA also uses a critic representation that al- design individual reward functions for each agent. However,
lows the counterfactual baseline to be computed efficiently in these rewards are not generally available in cooperative set-
a single forward pass. We evaluate COMA in the testbed of tings and often fail to encourage individual agents to sacri-
StarCraft unit micromanagement, using a decentralised vari- fice for the greater good. This often substantially impedes
ant with significant partial observability. COMA significantly multi-agent learning in challenging tasks, even with rela-
improves average performance over other multi-agent actor- tively small numbers of agents.
critic methods in this setting, and the best performing agents In this paper, we propose a new multi-agent RL method
are competitive with state-of-the-art centralised controllers
called counterfactual multi-agent (COMA) policy gradients,
that get access to the full state.
in order to address these issues. COMA takes an actor-critic
(Konda and Tsitsiklis 2000) approach, in which the actor,
1 Introduction i.e., the policy, is trained by following a gradient estimated
Many complex reinforcement learning (RL) problems such by a critic. COMA is based on three main ideas.
as the coordination of autonomous vehicles (Cao et al. First, COMA uses a centralised critic. The critic is only
2013), network packet delivery (Ye, Zhang, and Yang 2015), used during learning, while only the actor is needed during
and distributed logistics (Ying and Dayong 2005) are nat- execution. Since learning is centralised, we can therefore use
urally modelled as cooperative multi-agent systems. How- a centralised critic that conditions on the joint action and all
ever, RL methods designed for single agents typically fare available state information, while each agent’s policy condi-
poorly on such tasks, since the joint action space of the tions only on its own action-observation history.
agents grows exponentially with the number of agents. Second, COMA uses a counterfactual baseline. The idea
To cope with such complexity, it is often necessary to re- is inspired by difference rewards (Wolpert and Tumer 2002;
sort to decentralised policies, in which each agent selects its Tumer and Agogino 2007), in which each agent learns from
own action conditioned only on its local action-observation a shaped reward that compares the global reward to the re-
history. Furthermore, partial observability and communica- ward received when that agent’s action is replaced with a
tion constraints during execution may necessitate the use of default action. While difference rewards are a powerful way
decentralised policies even when the joint action space is not to perform multi-agent credit assignment, they require ac-
prohibitively large. cess to a simulator or estimated reward function, and in gen-
Hence, there is a great need for new RL methods that eral it is unclear how to choose the default action. COMA
can efficiently learn decentralised policies. In some settings, addresses this by using the centralised critic to compute an
the learning itself may also need to be decentralised. How- agent-specific advantage function that compares the esti-
ever, in many cases, learning can take place in a simulator mated return for the current joint action to a counterfactual
or a laboratory in which extra state information is avail- baseline that marginalises out a single agent’s action, while
able and agents can communicate freely. This centralised keeping the other agents’ actions fixed. This is similar to cal-
culating an aristocrat utility (Wolpert and Tumer 2002), but to be used during learning and do not address the multi-agent
avoids the problem of a recursive interdependence between credit assignment problem.
the policy and utility function because the expected contri- Gupta, Egorov, and Kochenderfer (2017) investigate
bution of the counterfactual baseline to the policy gradient actor-critic methods for decentralised execution with cen-
is zero. Hence, instead of relying on extra simulations, ap- tralised training. However, in their methods both the actors
proximations, or assumptions regarding appropriate default and the critic condition on local, per-agent, observations and
actions, COMA computes a separate baseline for each agent actions, and multi-agent credit assignment is addressed only
that relies on the centralised critic to reason about counter- with hand-crafted local rewards.
factuals in which only that agent’s action changes. Most previous applications of RL to StarCraft microman-
Third, COMA uses a critic representation that allows the agement use a centralised controller, with access to the full
counterfactual baseline to be computed efficiently. In a sin- state, and control of all units, although the architecture of
gle forward pass, it computes the Q-values for all the differ- the controllers exploits the multi-agent nature of the prob-
ent actions of a given agent, conditioned on the actions of all lem. Usunier et al. (2016) use a greedy MDP, which at each
the other agents. Because a single centralised critic is used timestep sequentially chooses actions for agents given all
for all agents, all Q-values for all agents can be computed in previous actions, in combination with zero-order optimisa-
a single batched forward pass. tion, while Peng et al. (2017) use an actor-critic method that
We evaluate COMA in the testbed of StarCraft unit mi- relies on RNNs to exchange information between the agents.
cromanagement1 , which has recently emerged as a challeng- The closest to our problem setting is that of Foerster et
ing RL benchmark task with high stochasticity, a large state- al. (2017), who also use a multi-agent representation and de-
action space, and delayed rewards. Previous works (Usunier centralised policies. However, they focus on stabilising ex-
et al. 2016; Peng et al. 2017) have made use of a centralised perience replay while using DQN and do not make full use
control policy that conditions on the entire state and can use of the centralised training regime. As they do not report on
powerful macro-actions, using StarCraft’s built-in planner, absolute win-rates we do not compare performance directly.
that combine movement and attack actions. To produce a However, Usunier et al. (2016) address similar scenarios
meaningfully decentralised benchmark that proves challeng- to our experiments and implement a DQN baseline in a
ing for scenarios with even relatively few agents, we propose fully observable setting. In Section 6 we therefore report our
a variant that massively reduces each agent’s field-of-view competitive performance against these state-of-the-art base-
and removes access to these macro-actions. lines, while maintaining decentralised control. Omidshafiei
Our empirical results on this new benchmark show that et al. (2017) also address the stability of experience replay in
COMA can significantly improve performance over other multi-agent settings, but assume a fully decentralised train-
multi-agent actor-critic methods, as well as ablated versions ing regime.
of COMA itself. In addition, COMA’s best agents are com- (Lowe et al. 2017) concurrently propose a multi-agent
petitive with state-of-the-art centralised controllers that are policy-gradient algorithm using centralised critics. Their ap-
given access to full state information and macro-actions. proach does not address multi-agent credit assignment. Un-
like our work, it learns a separate centralised critic for each
2 Related Work agent and is applied to competitive environments with con-
tinuous action spaces.
Although multi-agent RL has been applied in a variety of Our work builds directly off of the idea of difference
settings (Busoniu, Babuska, and De Schutter 2008; Yang and rewards (Wolpert and Tumer 2002). The relationship of
Gu 2004), it has often been restricted to tabular methods COMA to this line of work is discussed in Section 4.
and simple environments. One exception is recent work in
deep multi-agent RL, which can scale to high dimensional
input and action spaces. Tampuu et al. (2015) use a com-
3 Background
bination of DQN with independent Q-learning (Tan 1993; We consider a fully cooperative multi-agent task that can
Shoham and Leyton-Brown 2009) to learn how to play two- be described as a stochastic game G, defined by a tuple
player pong. More recently the same method has been used G = hS, U, P, r, Z, O, n, γi, in which n agents identified
by Leibo et al. (2017) to study the emergence of collabora- by a ∈ A ≡ {1, ..., n} choose sequential actions. The en-
tion and defection in sequential social dilemmas. vironment has a true state s ∈ S. At each time step, each
Also related is work on the emergence of communication agent simultaneously chooses an action ua ∈ U , forming
between agents, learned by gradient descent (Das et al. 2017; a joint action u ∈ U ≡ U n which induces a transition in
Mordatch and Abbeel 2017; Lazaridou, Peysakhovich, and the environment according to the state transition function
Baroni 2016; Foerster et al. 2016; Sukhbaatar, Fergus, and P (s0 |s, u) : S × U × S → [0, 1]. The agents all share the
others 2016). In this line of work, passing gradients between same reward function r(s, u) : S × U → R and γ ∈ [0, 1)
agents during training and sharing parameters are two com- is a discount factor.
mon ways to take advantage of centralised training. How- We consider a partially observable setting, in which
ever, these methods do not allow for extra state information agents draw observations z ∈ Z according to the obser-
vation function O(s, a) : S × A → Z. Each agent has an
1
StarCraft and its expansion StarCraft: Brood War are trade- action-observation history τ a ∈ T ≡ (Z × U )∗ , on which it
marks of Blizzard EntertainmentTM . conditions a stochastic policy π a (ua |τ a ) : T × U → [0, 1].
P∞ (n)
We denote joint quantities over agents in bold, and joint where y (λ) = (1 − λ) n=1 λn−1 Gt , and the n-step
quantities over agents other than a given agent a with the (n)
returns Gt are calculated with bootstrapped values esti-
superscript −a. P∞ mated by a target network (Mnih et al. 2015) with parame-
The discounted return is Rt = l=0 γ l rt+l . The agents’ ters copied periodically from θc .
joint policy induces a value function, i.e., an expectation
over Rt , V π (st ) = Est+1:∞ ,ut:∞ [Rt |st ], and an action- 4 Methods
value function Qπ (st , ut ) = Est+1:∞ ,ut+1:∞ [Rt |st , ut ].
The advantage function is given by Aπ (st , ut ) = In this section, we describe approaches for extending policy
Qπ (st , ut ) − V π (st ). gradients to our multi-agent setting.
Following previous work (Oliehoek, Spaan, and Vlassis
2008; Kraemer and Banerjee 2016; Foerster et al. 2016; Independent Actor-Critic
Jorge, Kågebäck, and Gustavsson 2016), our problem setting The simplest way to apply policy gradients to multiple
allows centralised training but requires decentralised execu- agents is to have each agent learn independently, with its
tion. This is a natural paradigm for a large set of multi-agent own actor and critic, from its own action-observation his-
problems where training is carried out using a simulator with tory. This is essentially the idea behind independent Q-
additional state information, but the agents must rely on lo- learning (Tan 1993), which is perhaps the most popular
cal action-observation histories during execution. To condi- multi-agent learning algorithm, but with actor-critic in place
tion on this full history, a deep RL agent may make use of a of Q-learning. Hence, we call this approach independent
recurrent neural network (Hausknecht and Stone 2015), typ- actor-critic (IAC).
ically with a gated model such as LSTM (Hochreiter and In our implementation of IAC, we speed learning by shar-
Schmidhuber 1997) or GRU (Cho et al. 2014). ing parameters among the agents, i.e., we learn only one ac-
In Section 4, we develop a new multi-agent policy gra- tor and one critic, which are used by all agents. The agents
dient method for tackling this setting. In the remainder of can still behave differently because they receive different ob-
this section, we provide some background on single-agent servations, including an agent-specific ID, and thus evolve
policy gradient methods (Sutton et al. 1999). Such methods different hidden states. Learning remains independent in the
optimise a single agent’s policy, parameterised by θπ , by sense that each agent’s critic estimates only a local value
performing gradient ascent on an estimate of the expected function, i.e., one that conditions on ua , not u. Though we
discounted total reward J = Eπ [R0 ]. Perhaps the simplest are not aware of previous applications of this specific algo-
form of policy gradient is REINFORCE (Williams 1992), in rithm, we do not consider it a significant contribution but
which the gradient is: instead merely a baseline algorithm.
We consider two variants of IAC. In the first, each agent’s
critic estimates V (τ a ) and follows a gradient based on
" T #
X
g = Es0:∞ ,u0:∞ Rt ∇θπ log π(ut |st ) . (1) the TD error, as described in Section 3. In the second,
t=0 each agent’s critic estimates Q(τ a , ua ) and follows a gra-
dient based on the advantage: A(τ a , ua ) = Q(τ a , ua ) −
In actor-critic approaches (Sutton et al. 1999; Konda and V (τ a ), where V (τ a ) = ua π(ua |τ a )Q(τ a , ua ). Indepen-
P
Tsitsiklis 2000; Schulman et al. 2015), the actor, i.e., the dent learning is straightforward, but the lack of information
policy, is trained by following a gradient that depends on sharing at training time makes it difficult to learn coordi-
a critic, which usually estimates a value function. In par- nated strategies that depend on interactions between multi-
ticular, Rt is replaced by any expression equivalent to ple agents, or for an individual agent to estimate the contri-
Q(st , ut ) − b(st ), where b(st ) is a baseline designed to re- bution of its actions to the team’s reward.
duce variance (Weaver and Tao 2001). A common choice is
b(st ) = V (st ), in which case Rt is replaced by A(st , ut ). Counterfactual Multi-Agent Policy Gradients
Another option is to replace Rt with the temporal difference
(TD) error rt + γV (st+1 ) − V (s), which is an unbiased es- The difficulties discussed above arise because, beyond pa-
timate of A(st , ut ). In practice, the gradient must be esti- rameter sharing, IAC fails to exploit the fact that learning
mated from trajectories sampled from the environment, and is centralised in our setting. In this section, we propose
the (action-)value functions must be estimated with function counterfactual multi-agent (COMA) policy gradients, which
approximators. Consequently, the bias and variance of the overcome this limitation. Three main ideas underly COMA:
gradient estimate depends strongly on the exact choice of 1) centralisation of the critic, 2) use of a counterfactual base-
estimator (Konda and Tsitsiklis 2000). line, and 3) use of a critic representation that allows efficient
In this paper, we train critics f c (·, θc ) on-policy to es- evaluation of the baseline. The remainder of this section de-
timate either Q or V , using a variant of TD(λ) (Sutton scribes these ideas.
1988) adapted for use with deep neural networks. TD(λ) First, COMA uses a centralised critic. Note that in IAC,
(n) Pn each actor π(ua |τ a ) and each critic Q(τ a , ua ) or V (τ a )
uses a mixture of n-step returns Gt = l=1 γ l−1 rt+l + conditions only on the agent’s own action-observation his-
γ n f c (·t+n , θc ). In particular, the critic parameters θc are up- tory τ a . However, the critic is used only during learning and
dated by minibatch gradient descent to minimise the follow- only the actor is needed during execution. Since learning is
ing loss: centralised, we can therefore use a centralised critic that con-
Lt (θc ) = (y (λ) − f c (·t , θc ))2 , (2) ditions on the true global state s, if it is available, or the joint
action-observation histories τ otherwise. Each actor condi- aristocrat utility using value-based methods creates a self-
tions on its own action-observation histories τ a , with param- consistency problem because the policy and utility function
eter sharing, as in IAC. Figure 1a illustrates this setup. depend recursively on each other. As a result, prior work
A naive way to use this centralised critic would be for focused on difference evaluations using default states and
each actor to follow a gradient based on the TD error esti- actions. COMA is different because the counterfactual base-
mated from this critic: line’s expected contribution to the gradient, as with other
policy gradient baselines, is zero. Thus, while the baseline
does depend on the policy, its expectation does not. Conse-
g = ∇θπ log π(u|τta ) (r + γV (st+1 ) − V (st )) . (3) quently, COMA can use this form of the advantage without
creating a self-consistency problem.
However, such an approach fails to address a key credit
While COMA’s advantage function replaces potential ex-
assignment problem. Because the TD error considers only
tra simulations with evaluations of the critic, those evalu-
global rewards, the gradient computed for each actor does
ations may themselves be expensive if the critic is a deep
not explicitly reason about how that particular agent’s ac-
neural network. Furthermore, in a typical representation, the
tions contribute to that global reward. Since the other agents
number of output nodes of such a network would equal |U |n ,
may be exploring, the gradient for that agent becomes very
the size of the joint action space, making it impractical to
noisy, particularly when there are many agents.
train. To address both these issues, COMA uses a critic rep-
Therefore, COMA uses a counterfactual baseline. The
resentation that allows for efficient evaluation of the base-
idea is inspired by difference rewards (Wolpert and Tumer
line. In particular, the actions of the other agents, u−a
t , are
2002), in which each agent learns from a shaped reward
part of the input to the network, which outputs a Q-value
Da = r(s, u) − r(s, (u−a , ca )) that compares the global
for each of agent a’s actions, as shown in Figure 1c. Conse-
reward to the reward received when the action of agent a is
quently, the counterfactual advantage can be calculated effi-
replaced with a default action ca . Any action by agent a that
ciently by a single forward pass of the actor and critic, for
improves Da also improves the true global reward r(s, u),
each agent. Furthermore, the number of outputs is only |U |
because r(s, (u−a , ca )) does not depend on agent a’s ac-
instead of (|U |n ). While the network has a large input space
tions.
that scales linearly in the number of agents and actions, deep
Difference rewards are a powerful way to perform multi- neural networks can generalise well across such spaces.
agent credit assignment. However, they typically require In this paper, we focus on settings with discrete actions.
access to a simulator in order to estimate r(s, (u−a , ca )). However, COMA can be easily extended to continuous ac-
When a simulator is already being used for learning, dif- tions spaces by estimating the expectation in (4) with Monte
ference rewards increase the number of simulations that Carlo samples or using functional forms that render it ana-
must be conducted, since each agent’s difference reward lytical, e.g., Gaussian policies and critic.
requires a separate counterfactual simulation. Proper and The following lemma establishes the convergence of
Tumer (2012) and Colby, Curran, and Tumer (2015) pro- COMA to a locally optimal policy. The proof follows di-
pose estimating difference rewards using function approxi- rectly from the convergence of single-agent actor-critic al-
mation rather than a simulator. However, this still requires a gorithms (Sutton et al. 1999; Konda and Tsitsiklis 2000),
user-specified default action ca that can be difficult to choose and is subject to the same assumptions.
in many applications. In an actor-critic architecture, this ap-
proach would also introduce an additional source of approx- Lemma 1. For an actor-critic algorithm with a compatible
imation error. TD(1) critic following a COMA policy gradient
" #
A key insight underlying COMA is that a centralised critic X
a a a a
can be used to implement difference rewards in a way that gk = Eπ ∇θk log π (u |τ )A (s, u) (5)
avoids these problems. COMA learns a centralised critic, a
Q(s, u) that estimates Q-values for the joint action u con- at each iteration k,
ditioned on the central state s. For each agent a we can then lim inf ||∇J|| = 0 w.p. 1. (6)
k
compute an advantage function that compares the Q-value
for the current action ua to a counterfactual baseline that Proof. The COMA gradient is given by
" #
marginalises out ua , while keeping the other agents’ actions X
a a a a
u−a fixed: g = Eπ ∇θ log π (u |τ )A (s, u) , (7)
X a
Aa (s, u) = Q(s, u) − π a (u0a |τ a )Q(s, (u−a , u0a )). Aa (s, u) = Q(s, u) − b(s, u−a ), (8)
u0a
(4) where θ are the parameters of all actor policies, e.g. θ =
Hence, Aa (s, ua ) computes a separate baseline for each {θ1 , . . . , θ|A| }, and b(s, u−a ) is the counterfactual baseline
agent that uses the centralised critic to reason about counter- defined in equation 4.
factuals in which only a’s action changes, learned directly First consider the expected contribution of the this base-
from agents’ experiences instead of relying on extra simula- line b(s, u−a ):
" #
tions, a reward model, or a user-designed default action. X
a a a −a
This advantage has the same form as the aristocrat util- gb = −Eπ ∇θ log π (u |τ )b(s, u ) , (9)
ity (Wolpert and Tumer 2002). However, optimising for an a
A1 t Critic A2 t a
= (hat, ) Aa t
t
1 2
(h , ) (h , )
( ) (uat, a
t
) COMA
1 1 2 2
h u t
u t
h
h at {Q(ua=1, u-at,..),. .,Q(ua=|U|, u-at,..)}
Actor 1 Actor 2
(hat-1) GRU (hat)
st rt
o 1t u1 t u2 t o 2t

Environment (oat, a, uat-1) (u-at, st, oat, a, ut-1)


(a) (b) (c)
Figure 1: In (a), information flow between the decentralised actors, the environment and the centralised critic in COMA; red
arrows and components are only required during centralised learning. In (b) and (c), architectures of the actor and critic.

where the expectation Eπ is with respect to the state-action 2. the update timescales for Q and π are sufficiently slow,
distribution induced by the joint policy π. Now let dπ (s) be and that π is updated sufficiently slower than Q, and
the discounted ergodic state distribution as defined by Sutton 3. Q uses a representation compatible with π,
et al. (1999):
X XX amongst several further assumptions. The parameterisation
gb = − dπ (s) π(u−a |τ −a) · of the policy (i.e., the single-agent joint-action learner is de-
s a u−a
X composed into independent actors) is immaterial to conver-
π (u |τ a )∇θ log π a (ua |τ a )b(s, u−a )
a a
(10) gence, as long as it remains differentiable. Note however
ua that COMA’s centralised critic is essential for this proof to
X XX hold.
=− dπ (s) π(u−a |τ −a) ·
s a u−a
X 5 Experimental Setup
∇θ π (ua |τ a )b(s, u−a )
a
(11)
In this section, we describe the StarCraft problem to which
ua
X XX we apply COMA, as well as details of the state features,
=− dπ (s) π(u−a |τ −a)b(s, u−a )∇θ 1 network architectures, training regimes, and ablations.
s a u−a Decentralised StarCraft Micromanagement. StarCraft
= 0. (12) is a rich environment with stochastic dynamics that cannot
be easily emulated. Many simpler multi-agent settings, such
Clearly, the per-agent baseline, although it reduces variance, as Predator-Prey (Tan 1993) or Packet World (Weyns, Helle-
does not change the expected gradient, and therefore does boogh, and Holvoet 2005), by contrast, have full simulators
not affect the convergence of COMA. with controlled randomness that can be freely set to any state
The remainder of the expected policy gradient is given by: in order to perfectly replay experiences. This makes it pos-
sible, though computationally expensive, to compute differ-
" #
X
a a a
g = Eπ ∇θ log π (u |τ )Q(s, u) (13) ence rewards via extra simulations. In StarCraft, as in the
a real world, this is not possible.
In this paper, we focus on the problem of microman-
" #
Y
a a a
= Eπ ∇θ log π (u |τ )Q(s, u) . (14) agement in StarCraft, which refers to the low-level con-
a trol of individual units’ positioning and attack commands
Writing the joint policy as a product of the independent ac- as they fight enemies. This task is naturally represented as
tors: a multi-agent system, where each StarCraft unit is replaced
by a decentralised controller. We consider several scenarios
Y
π(u|s) = π a (ua |τ a ), (15)
a
with symmetric teams formed of: 3 marines (3m), 5 marines
(5m), 5 wraiths (5w), or 2 dragoons with 3 zealots (2d 3z).
yields the standard single-agent actor-critic policy gradient:
The enemy team is controlled by the StarCraft AI, which
g = Eπ [∇θ log π(u|s)Q(s, u)] . (16) uses reasonable but suboptimal hand-crafted heuristics.
We allow the agents to choose from a set of discrete
Konda and Tsitsiklis (2000) prove that an actor-critic fol-
actions: move[direction], attack[enemy id],
lowing this gradient converges to a local maximum of the
stop, and noop. In the StarCraft game, when a unit selects
expected return J π , given that:
an attack action, it first moves into attack range before
1. the policy π is differentiable, firing, using the game’s built-in pathfinding to choose a
route. These powerful attack-move macro-actions make the and shield.2 All features are normalised by their maxi-
control problem considerably easier. mum values. We do not include any information about the
To create a more challenging benchmark that is mean- units’ current target.
ingfully decentralised, we impose a restricted field of view The global state representation consists of similar fea-
on the agents, equal to the firing range of ranged units’ tures, but for all units on the map regardless of fields of
weapons, shown in Figure 2. This departure from the stan- view. Absolute distance is not included, and x-y locations
dard setup for centralised StarCraft control has three effects. are given relative to the centre of the map rather than to
a particular agent. The global state also includes health
points and cooldown for all agents. The representa-
tion fed to the centralised Q-function critic is the concate-
nation of the global state representation with the local ob-
servation of the agent whose actions are being evaluated.
Our centralised critic that estimates V (s), and is therefore
x agent-agnostic, receives the global state concatenated with
all agents’ observations. The observations contain no new
information but include the egocentric distances relative to
that agent.
Architecture & Training. The actor consists of 128-bit
gated recurrent units (GRUs) (Cho et al. 2014) that use
fully connected layers both to process the input and to pro-
Figure 2: Starting position with example local field of view
duce the output values from the hidden state, hat . The IAC
for the 2d 3z map.
critics use extra output heads appended to the last layer of
the actor network. Action probabilities are produced from
First, it introduces significant partial observability. Sec- the final layer, z, via a bounded softmax distribution that
ond, it means units can only attack when they are in range lower-bounds the probability of any given action by /|U |:
of enemies, removing access to the StarCraft macro-actions. P (u) = (1 − )softmax(z)u + /|U |). We anneal  lin-
Third, agents cannot distinguish between enemies who are early from 0.5 to 0.02 across 750 training episodes. The cen-
dead and those who are out of range and so can issue in- tralised critic is a feedforward network with multiple ReLU
valid attack commands at such enemies, which results in no layers combined with fully connected layers. Hyperparame-
action being taken. This substantially increases the average ters were coarsely tuned on the 5m scenario and then used
size of the action space, which in turn increases the difficulty for all other maps. We found that the most sensitive param-
of both exploration and control. eter was TD(λ), but settled on λ = 0.8, which worked best
Under these difficult conditions, scenarios with even rela- for both COMA and our baselines. Our implementation uses
tively small numbers of units become much harder to solve. TorchCraft (Synnaeve et al. 2016) and Torch 7 (Collobert,
As seen in Table 1, we compare against a simple hand-coded Kavukcuoglu, and Farabet 2011). Pseudocode and further
heuristic that instructs the agents to run forwards into range details on the training procedure are in the appendix.
and then focus their fire, attacking each enemy in turn until We experimented with critic architectures that are fac-
it dies. This heuristic achieves a 98% win rate on 5m with tored at the agent level and further exploit internal parameter
a full field of view, but only 66% in our setting. To perform sharing. However, we found that the bottleneck for scalabil-
well in this task, the agents must learn to cooperate by posi- ity was not the centralisation of the critic, but rather the dif-
tioning properly and focussing their fire, while remembering ficulty of multi-agent exploration. Hence, we defer further
which enemy and ally units are alive or out of view. investigation of factored COMA critics to future work.
All agents receive the same global reward at each time Ablations. We perform ablation experiments to validate
step, equal to the sum of damage inflicted on the opponent three key elements of COMA. First, we test the importance
units minus half the damage taken. Killing an opponent gen- of centralising the critic by comparing against two IAC vari-
erates a reward of 10 points, and winning the game generates ants, IAC-Q and IAC-V . These critics take the same decen-
a reward equal to the team’s remaining total health plus 200. tralised input as the actor, and share parameters with the ac-
This damage-based reward signal is comparable to that used tor network up to the final layer. IAC-Q then outputs |U |
by Usunier et al. (2016). Unlike (Peng et al. 2017), our ap- Q-values, one for each action, while IAC-V outputs a sin-
proach does not require estimating local rewards. gle state-value. Note that we still share parameters between
State Features. The actor and critic receive different in- agents, using the egocentric observations and ID’s as part of
put features, corresponding to local observations and global the input to allow different behaviours to emerge. The coop-
state, respectively. Both include features for allies and ene- erative reward function is still shared by all agents.
mies. Units can be either allies or enemies, while agents are Second, we test the significance of learning Q instead of
the decentralised controllers that command ally units.
The local observations for every agent are drawn only 2
After firing, a unit’s cooldown is reset, and it must drop
from a circular subset of the map centred on the unit it before firing again. Shields absorb damage until they break, after
controls and include for each unit within this field of view: which units start losing health. Dragoons and zealots have shields
distance, relative x, relative y, unit type but marines do not.
COMA c e ntral-QV IAC-V IAC-Q
c e ntral-V he uris tic
90 90
80 80

Ave rag e Win %


Ave rag e Win % 70 70
60 60
50 50
40 40
30 30
20 20
10 10
0 0
20k 40k 60k 80k 100k 120k 140k 10k 20k 30k 40k 50k 60k 70k
# Epis o de s # Epis o de s

(a) 3m (b) 5m
90 70
80 60
Ave rag e Win %

Ave rag e Win %


70
50
60
50 40
40 30
30
20
20
10 10
0 0
5k 10k 15k 20k 25k 30k 35k 5k 10k 15k 20k 25k 30k 35k 40k
# Epis o de s # Epis o de s

(c) 5w (d) 2d 3z

Figure 3: Win rates for COMA and competing algorithms on four different scenarios. COMA outperforms all baseline methods.
Centralised critics also clearly outperform their decentralised counterparts. The legend at the top applies across all plots.

V . The method central-V still uses a central state for the Section 5), which could be expected to speed learning. How-
critic, but learns V (s), and uses the TD error to estimate the ever, these results suggest that the improved accuracy of pol-
advantage for policy gradient updates. icy evaluation made possible by conditioning on the global
Third, we test the utility of our counterfactual baseline. state outweighs the overhead of training a separate network.
The method central-QV learns both Q and V simultane- Furthermore, COMA strictly dominates central-QV , both
ously and estimates the advantage as Q − V , replacing in training speed and in final performance across all settings.
COMA’s counterfactual baseline with V . All methods use This is a strong indicator that our counterfactual baseline is
the same architecture and training scheme for the actors, and crucial when using a central Q-critic to train decentralised
all critics are trained with TD(λ). policies.
Learning a state-value function has the obvious advantage
6 Results of not conditioning on the joint action. Still, we find that
COMA outperforms the central-V baseline in final perfor-
Figure 3 shows average win rates as a function of episode for mance. Furthermore, COMA typically achieves good poli-
each method and each StarCraft scenario. For each method, cies faster, which is expected as COMA provides a shaped
we conducted 35 independent trials and froze learning ev- training signal. Training is also more stable than central-V ,
ery 100 training episodes to evaluate the learned policies which is a consequence of the COMA gradient tending to
across 200 episodes per method, plotting the average across zero as the policy becomes greedy. Overall, COMA is the
episodes and trials. Also shown is one standard deviation in best performing and most consistent method.
performance. Usunier et al. (2016) report the performance of their best
The results show that COMA is superior to the IAC base- agents trained with their state-of-the-art centralised con-
lines in all scenarios. Interestingly, the IAC methods also troller labelled GMEZO (greedy-MDP with episodic zero-
eventually learn reasonable policies in 5m, although they order optimisation), and for a centralised DQN controller,
need substantially more episodes to do so. This may seem both given a full field of view and access to attack-move
counterintuitive since in the IAC methods, the actor and macro-actions. These results are compared in Table 1 against
critic networks share parameters in their early layers (see the best agents trained with COMA for each map. Clearly,
Local Field of View (FoV) Full FoV, Central Control
COMA
map heur. IAC-V IAC-Q cnt-V cnt-QV heur. DQN GMEZO
mean best
3m 35 47 (3) 56 (6) 83 (3) 83 (5) 87 (3) 98 74 - -
5m 66 63 (2) 58 (3) 67 (5) 71 (9) 81 (5) 95 98 99 100
5w 70 18 (5) 57 (5) 65 (3) 76 (1) 82 (3) 98 82 70 743
2d 3z 63 27 (9) 19 (21) 36 (6) 39 (5) 47 (5) 65 68 61 90

Table 1: Mean win percentage averaged across final 1000 evaluation episodes for the different maps, for all methods and the
hand-coded heuristic in the decentralised setting with a limited field of view. The highest mean performances are in bold,
while the values in parentheses denote the 95% confidence interval, for example 87(3) = 87 ± 3. Also shown, maximum
win percentages for COMA (decentralised), in comparison to the heuristic and published results (evaluated in the centralised
setting).

in most settings these agents achieve performance compara- References


ble to the best published win rates despite being restricted to
Busoniu, L.; Babuska, R.; and De Schutter, B. 2008. A
decentralised policies and local fields of view.
comprehensive survey of multiagent reinforcement learning.
IEEE Transactions on Systems Man and Cybernetics Part C
7 Conclusions & Future Work Applications and Reviews 38(2):156.
This paper presented COMA policy gradients, a method Cao, Y.; Yu, W.; Ren, W.; and Chen, G. 2013. An overview
that uses a centralised critic in order to estimate a coun- of recent progress in the study of distributed multi-agent
terfactual advantage for decentralised policies in mutli- coordination. IEEE Transactions on Industrial informatics
agent RL. COMA addresses the challenges of multi-agent 9(1):427–438.
credit assignment by using a counterfactual baseline that Chang, Y.-H.; Ho, T.; and Kaelbling, L. P. 2003. All learning
marginalises out a single agent’s action, while keeping is local: Multi-agent learning in global reward games. In
the other agents’ actions fixed. Our results in a decen- NIPS, 807–814.
tralised StarCraft unit micromanagement benchmark show
that COMA significantly improves final performance and Cho, K.; van Merriënboer, B.; Bahdanau, D.; and Ben-
training speed over other multi-agent actor-critic methods gio, Y. 2014. On the properties of neural machine
and remains competitive with state-of-the-art centralised translation: Encoder-decoder approaches. arXiv preprint
controllers under best-performance reporting. Future work arXiv:1409.1259.
will extend COMA to tackle scenarios with large numbers Colby, M. K.; Curran, W.; and Tumer, K. 2015. Approx-
of agents, where centralised critics are more difficult to train imating difference evaluations with local information. In
and exploration is harder to coordinate. We also aim to de- Proceedings of the 2015 International Conference on Au-
velop more sample-efficient variants that are practical for tonomous Agents and Multiagent Systems, 1659–1660. In-
real-world applications such as self-driving cars. ternational Foundation for Autonomous Agents and Multia-
gent Systems.
Acknowledgements Collobert, R.; Kavukcuoglu, K.; and Farabet, C. 2011.
Torch7: A matlab-like environment for machine learning. In
This project has received funding from the European Re-
BigLearn, NIPS Workshop.
search Council (ERC) under the European Union’s Hori-
zon 2020 research and innovation programme (grant agree- Das, A.; Kottur, S.; Moura, J. M.; Lee, S.; and Batra, D.
ment number 637713). It was also supported by the Oxford- 2017. Learning cooperative visual dialog agents with deep
Google DeepMind Graduate Scholarship, the UK EPSRC reinforcement learning. arXiv preprint arXiv:1703.06585.
CDT in Autonomous Intelligent Machines and Systems, and Foerster, J.; Assael, Y. M.; de Freitas, N.; and Whiteson, S.
a generous grant from Microsoft for their Azure cloud com- 2016. Learning to communicate with deep multi-agent re-
puting services. We would like to thank Nando de Freitas, inforcement learning. In Advances in Neural Information
Yannis Assael, and Brendan Shillingford for helpful com- Processing Systems, 2137–2145.
ments and discussion. We also thank Gabriel Synnaeve, Foerster, J.; Nardelli, N.; Farquhar, G.; Torr, P.; Kohli, P.;
Zeming Lin, and the rest of the TorchCraft team at FAIR Whiteson, S.; et al. 2017. Stabilising experience replay for
for their work on the interface. deep multi-agent reinforcement learning. In Proceedings of
3
5w DQN and GMEZO benchmark performances are of a pol-
The 34th International Conference on Machine Learning.
icy trained on a larger map and tested on 5w Gupta, J. K.; Egorov, M.; and Kochenderfer, M. 2017. Coop-
erative multi-agent control using deep reinforcement learn- Sukhbaatar, S.; Fergus, R.; et al. 2016. Learning multia-
ing. gent communication with backpropagation. In Advances in
Hausknecht, M., and Stone, P. 2015. Deep recurrent Neural Information Processing Systems, 2244–2252.
q-learning for partially observable mdps. arXiv preprint Sutton, R. S.; McAllester, D. A.; Singh, S. P.; Mansour, Y.;
arXiv:1507.06527. et al. 1999. Policy gradient methods for reinforcement
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term learning with function approximation. In NIPS, volume 99,
memory. Neural computation 9(8):1735–1780. 1057–1063.
Jorge, E.; Kågebäck, M.; and Gustavsson, E. 2016. Learning Sutton, R. S. 1988. Learning to predict by the methods of
to play guess who? and inventing a grounded language as a temporal differences. Machine learning 3(1):9–44.
consequence. arXiv preprint arXiv:1611.03218. Synnaeve, G.; Nardelli, N.; Auvolat, A.; Chintala, S.;
Konda, V. R., and Tsitsiklis, J. N. 2000. Actor-critic algo- Lacroix, T.; Lin, Z.; Richoux, F.; and Usunier, N. 2016.
rithms. In Advances in neural information processing sys- Torchcraft: a library for machine learning research on real-
tems, 1008–1014. time strategy games. arXiv preprint arXiv:1611.00625.
Kraemer, L., and Banerjee, B. 2016. Multi-agent reinforce- Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus,
ment learning as a rehearsal for decentralized planning. Neu- K.; Aru, J.; Aru, J.; and Vicente, R. 2015. Multiagent coop-
rocomputing 190:82–94. eration and competition with deep reinforcement learning.
arXiv preprint arXiv:1511.08779.
Lazaridou, A.; Peysakhovich, A.; and Baroni, M. 2016.
Multi-agent cooperation and the emergence of (natural) lan- Tan, M. 1993. Multi-agent reinforcement learning: Inde-
guage. arXiv preprint arXiv:1612.07182. pendent vs. cooperative agents. In Proceedings of the tenth
international conference on machine learning, 330–337.
Leibo, J. Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; and
Graepel, T. 2017. Multi-agent reinforcement learning in se- Tumer, K., and Agogino, A. 2007. Distributed agent-based
quential social dilemmas. arXiv preprint arXiv:1702.03037. air traffic flow management. In Proceedings of the 6th inter-
national joint conference on Autonomous agents and multi-
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and agent systems, 255. ACM.
Mordatch, I. 2017. Multi-agent actor-critic for mixed
cooperative-competitive environments. arXiv preprint Usunier, N.; Synnaeve, G.; Lin, Z.; and Chintala, S. 2016.
arXiv:1706.02275. Episodic exploration for deep deterministic policies: An ap-
plication to starcraft micromanagement tasks. arXiv preprint
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve- arXiv:1609.02993.
ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;
Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human- Weaver, L., and Tao, N. 2001. The optimal reward baseline
level control through deep reinforcement learning. Nature for gradient-based reinforcement learning. In Proceedings
518(7540):529–533. of the Seventeenth conference on Uncertainty in artificial in-
telligence, 538–545. Morgan Kaufmann Publishers Inc.
Mordatch, I., and Abbeel, P. 2017. Emergence of grounded
compositional language in multi-agent populations. arXiv Weyns, D.; Helleboogh, A.; and Holvoet, T. 2005. The
preprint arXiv:1703.04908. packet-world: A test bed for investigating situated multi-
agent systems. In Software Agent-Based Applications, Plat-
Oliehoek, F. A.; Spaan, M. T. J.; and Vlassis, N. 2008. Op- forms and Development Kits. Springer. 383–408.
timal and approximate Q-value functions for decentralized
POMDPs. 32:289–353. Williams, R. J. 1992. Simple statistical gradient-following
algorithms for connectionist reinforcement learning. Ma-
Omidshafiei, S.; Pazis, J.; Amato, C.; How, J. P.; and Vian, chine learning 8(3-4):229–256.
J. 2017. Deep decentralized multi-task multi-agent rl under
partial observability. arXiv preprint arXiv:1703.06182. Wolpert, D. H., and Tumer, K. 2002. Optimal payoff func-
tions for members of collectives. In Modeling complexity in
Peng, P.; Yuan, Q.; Wen, Y.; Yang, Y.; Tang, Z.; Long, H.; economic and social systems. World Scientific. 355–369.
and Wang, J. 2017. Multiagent bidirectionally-coordinated
Yang, E., and Gu, D. 2004. Multiagent reinforcement learn-
nets for learning to play starcraft combat games. arXiv
ing for multi-robot systems: A survey. Technical report,
preprint arXiv:1703.10069.
tech. rep.
Proper, S., and Tumer, K. 2012. Modeling difference re-
Ye, D.; Zhang, M.; and Yang, Y. 2015. A multi-agent frame-
wards for multiagent learning. In Proceedings of the 11th
work for packet routing in wireless sensor networks. sensors
International Conference on Autonomous Agents and Multi-
15(5):10026–10047.
agent Systems-Volume 3, 1397–1398. International Founda-
tion for Autonomous Agents and Multiagent Systems. Ying, W., and Dayong, S. 2005. Multi-agent framework
for third party logistics in e-commerce. Expert Systems with
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M. I.; and
Applications 29(2):431–436.
Abbeel, P. 2015. High-dimensional continuous control using
generalized advantage estimation. CoRR abs/1506.02438.
Shoham, Y., and Leyton-Brown, K. 2009. Multiagent Sys-
tems: Algorithmic, Game-Theoretic, and Logical Founda-
tions. New York: Cambridge University Press.
Appendix
Training Details and Hyperparameters
Training is performed in batch mode, with a batch size of 30. Due to parameter sharing, all agents can be processed in parallel,
with each agent for each episode and time step occupying one batch entry. The training cycle progresses in three steps (com-
pletion of all three steps constitutes as one episode in our graphs): 1) collect data: collect 30
n episodes; 2) train critic: for each
time step, apply a gradient step to the feed-forward critic, starting at the end of the episode; and 3) train actor: fully unroll the
recurrent part of the actor, aggregate gradients in the backward pass across all time steps, and apply a gradient update.
We use a target network for the critic, which updates every 150 training steps for the feed-forward centralised critics and
every 50 steps for the recurrent IAC critics. The feed-forward critic receives more learning steps, since it performs a parameter
update for each timestep. Both the actor and the critic networks are trained using RMS-prop with learning rate 0.0005 and alpha
0.99, without weight decay. We set gamma to 0.99 for all maps.
Although tuning the skip-frame in StarCraft can improve absolute performance (Peng et al. 2017), we use a default value of
7, since the main focus is a relative evaluation between COMA and the baselines.

Algorithm

Algorithm 1 Counterfactual Multi-Agent (COMA) Policy Gradients


Initialise θ1c , θ̂1c , θπ
for each training episode e do
Empty buffer
for ec = 1 to BatchSizen do
s1 = initial state, t = 0, ha0 = 0 for each agent a
while st 6= terminal and t < T do
t=t+1
for each agent a do 
hat = Actor oat , hat−1 , uat−1 , a, u; θi
Sample uat from π(hat , (e))
Get reward rt and next state st+1
Add episode to buffer
Collate episodes in buffer into single batch
for t = 1 to T do // from now processing all agents in parallel via single batch
Batch unroll RNN using states, actions and rewards
Calculate TD(λ) targets yta using θ̂ic
for t = T down to 1 do 
∆Qat = yta − Q saj , u
∆θc = ∇θc (∆Qat )2 // calculate critic gradient
c
θi+1 = θic − α∆θc // update critic weights
Every C steps reset θ̂ic = θic
for t = T down to 1 do
Aa (sat , u) = Q(sat , u) − u Q(sat , u, u−a )π(u|hat ) // calculate COMA
P
∆θπ = ∆θπ + ∇θπ log π(u|hat )Aa (sat , u) // accumulate actor gradients
θi+1 = θiπ + α∆θπ // update actor weights
π

Common questions

Powered by AI

COMA uses a centralised critic which conditions on the joint action and global state information to estimate Q-values for the joint action, a departure from traditional actor-critic methods where each actor and critic conditions only on its own action-observation history . This centralised approach allows for a more effective reasoning about the contribution of individual agents to the global reward, addressing the problem of noisy gradients seen in traditional methods because agents' contributions to the global reward can be assessed more accurately . Furthermore, by computing an agent-specific advantage function that compares the Q-value for the current action to a counterfactual baseline, it avoids the need for simulations or default actions, improving credit assignment .

The counterfactual baseline in COMA helps solve the credit assignment problem by providing a method to evaluate the contribution of each agent to the global reward without requiring extra simulations. It does this by estimating the Q-value for the current action and comparing it to a counterfactual baseline that marginalizes out the agent's action while keeping the other agents' actions fixed . This baseline represents the reward expectation if the agent had taken a default action instead. This approach circumvents the need for a simulator or default action assumptions typical in difference rewards. By doing so, COMA can effectively determine the advantage of an agent's action, thus, more precisely assigning credit to individual agents in a multi-agent setting where rewards are otherwise joint and indiscrete .

Computing efficiency in counterfactual baselines is significant because it enables COMA to be applied feasibly in complex environments with many agents, where the joint action space is large . Efficiency is achieved by using a critic representation that allows all Q-values for different actions to be computed in a single batched forward pass, as opposed to computing each Q-value individually. This is facilitated by inputting the actions of other agents into the network and computing the Q-values for a given agent’s action in a single step . Consequently, the baseline evaluation avoids the computationally expensive task of simulating alternate realities for every possible action, instead relying on efficient neural network generalization .

In cooperative multi-agent settings, the main challenge in assigning rewards is that joint actions typically generate only global rewards, which makes it difficult for each agent to deduce its own contribution to the team's success. Individual reward functions can be designed for each agent, but these are often unavailable and fail to encourage individual agents to sacrifice for the greater good . COMA addresses these challenges by using a counterfactual multi-agent reinforcement learning method that incorporates an actor-critic approach with a centralised critic only used during learning. This critic conditions on joint actions and available state information, allowing it to estimate the global reward and compute an agent-specific advantage function. This function compares the estimated return of the current joint action with a counterfactual baseline that marginalises out a single agent's action, thus aiding in multi-agent credit assignment without requiring access to a simulator or assumed default actions .

COMA addresses the issue of recursive interdependence, a problem in value-based methods when the policy and utility functions depend on each other, by constructing a counterfactual baseline whose expected contribution to the policy gradient is zero . This is akin to other policy gradient baselines, ensuring that while the baseline itself may depend on the policy, its expectation does not. This non-recursiveness allows COMA to use the counterfactual advantage function without creating a self-consistency problem, a challenge encountered in earlier methods attempting to optimize aristocrat utilities with value-based approaches .

You might also like