COMA: Counterfactual Multi-Agent RL
COMA: Counterfactual Multi-Agent RL
1 †
University of Oxford, United Kingdom Equal contribution
arXiv:1705.08926v2 [[Link]] 14 Dec 2017
where the expectation Eπ is with respect to the state-action 2. the update timescales for Q and π are sufficiently slow,
distribution induced by the joint policy π. Now let dπ (s) be and that π is updated sufficiently slower than Q, and
the discounted ergodic state distribution as defined by Sutton 3. Q uses a representation compatible with π,
et al. (1999):
X XX amongst several further assumptions. The parameterisation
gb = − dπ (s) π(u−a |τ −a) · of the policy (i.e., the single-agent joint-action learner is de-
s a u−a
X composed into independent actors) is immaterial to conver-
π (u |τ a )∇θ log π a (ua |τ a )b(s, u−a )
a a
(10) gence, as long as it remains differentiable. Note however
ua that COMA’s centralised critic is essential for this proof to
X XX hold.
=− dπ (s) π(u−a |τ −a) ·
s a u−a
X 5 Experimental Setup
∇θ π (ua |τ a )b(s, u−a )
a
(11)
In this section, we describe the StarCraft problem to which
ua
X XX we apply COMA, as well as details of the state features,
=− dπ (s) π(u−a |τ −a)b(s, u−a )∇θ 1 network architectures, training regimes, and ablations.
s a u−a Decentralised StarCraft Micromanagement. StarCraft
= 0. (12) is a rich environment with stochastic dynamics that cannot
be easily emulated. Many simpler multi-agent settings, such
Clearly, the per-agent baseline, although it reduces variance, as Predator-Prey (Tan 1993) or Packet World (Weyns, Helle-
does not change the expected gradient, and therefore does boogh, and Holvoet 2005), by contrast, have full simulators
not affect the convergence of COMA. with controlled randomness that can be freely set to any state
The remainder of the expected policy gradient is given by: in order to perfectly replay experiences. This makes it pos-
sible, though computationally expensive, to compute differ-
" #
X
a a a
g = Eπ ∇θ log π (u |τ )Q(s, u) (13) ence rewards via extra simulations. In StarCraft, as in the
a real world, this is not possible.
In this paper, we focus on the problem of microman-
" #
Y
a a a
= Eπ ∇θ log π (u |τ )Q(s, u) . (14) agement in StarCraft, which refers to the low-level con-
a trol of individual units’ positioning and attack commands
Writing the joint policy as a product of the independent ac- as they fight enemies. This task is naturally represented as
tors: a multi-agent system, where each StarCraft unit is replaced
by a decentralised controller. We consider several scenarios
Y
π(u|s) = π a (ua |τ a ), (15)
a
with symmetric teams formed of: 3 marines (3m), 5 marines
(5m), 5 wraiths (5w), or 2 dragoons with 3 zealots (2d 3z).
yields the standard single-agent actor-critic policy gradient:
The enemy team is controlled by the StarCraft AI, which
g = Eπ [∇θ log π(u|s)Q(s, u)] . (16) uses reasonable but suboptimal hand-crafted heuristics.
We allow the agents to choose from a set of discrete
Konda and Tsitsiklis (2000) prove that an actor-critic fol-
actions: move[direction], attack[enemy id],
lowing this gradient converges to a local maximum of the
stop, and noop. In the StarCraft game, when a unit selects
expected return J π , given that:
an attack action, it first moves into attack range before
1. the policy π is differentiable, firing, using the game’s built-in pathfinding to choose a
route. These powerful attack-move macro-actions make the and shield.2 All features are normalised by their maxi-
control problem considerably easier. mum values. We do not include any information about the
To create a more challenging benchmark that is mean- units’ current target.
ingfully decentralised, we impose a restricted field of view The global state representation consists of similar fea-
on the agents, equal to the firing range of ranged units’ tures, but for all units on the map regardless of fields of
weapons, shown in Figure 2. This departure from the stan- view. Absolute distance is not included, and x-y locations
dard setup for centralised StarCraft control has three effects. are given relative to the centre of the map rather than to
a particular agent. The global state also includes health
points and cooldown for all agents. The representa-
tion fed to the centralised Q-function critic is the concate-
nation of the global state representation with the local ob-
servation of the agent whose actions are being evaluated.
Our centralised critic that estimates V (s), and is therefore
x agent-agnostic, receives the global state concatenated with
all agents’ observations. The observations contain no new
information but include the egocentric distances relative to
that agent.
Architecture & Training. The actor consists of 128-bit
gated recurrent units (GRUs) (Cho et al. 2014) that use
fully connected layers both to process the input and to pro-
Figure 2: Starting position with example local field of view
duce the output values from the hidden state, hat . The IAC
for the 2d 3z map.
critics use extra output heads appended to the last layer of
the actor network. Action probabilities are produced from
First, it introduces significant partial observability. Sec- the final layer, z, via a bounded softmax distribution that
ond, it means units can only attack when they are in range lower-bounds the probability of any given action by /|U |:
of enemies, removing access to the StarCraft macro-actions. P (u) = (1 − )softmax(z)u + /|U |). We anneal lin-
Third, agents cannot distinguish between enemies who are early from 0.5 to 0.02 across 750 training episodes. The cen-
dead and those who are out of range and so can issue in- tralised critic is a feedforward network with multiple ReLU
valid attack commands at such enemies, which results in no layers combined with fully connected layers. Hyperparame-
action being taken. This substantially increases the average ters were coarsely tuned on the 5m scenario and then used
size of the action space, which in turn increases the difficulty for all other maps. We found that the most sensitive param-
of both exploration and control. eter was TD(λ), but settled on λ = 0.8, which worked best
Under these difficult conditions, scenarios with even rela- for both COMA and our baselines. Our implementation uses
tively small numbers of units become much harder to solve. TorchCraft (Synnaeve et al. 2016) and Torch 7 (Collobert,
As seen in Table 1, we compare against a simple hand-coded Kavukcuoglu, and Farabet 2011). Pseudocode and further
heuristic that instructs the agents to run forwards into range details on the training procedure are in the appendix.
and then focus their fire, attacking each enemy in turn until We experimented with critic architectures that are fac-
it dies. This heuristic achieves a 98% win rate on 5m with tored at the agent level and further exploit internal parameter
a full field of view, but only 66% in our setting. To perform sharing. However, we found that the bottleneck for scalabil-
well in this task, the agents must learn to cooperate by posi- ity was not the centralisation of the critic, but rather the dif-
tioning properly and focussing their fire, while remembering ficulty of multi-agent exploration. Hence, we defer further
which enemy and ally units are alive or out of view. investigation of factored COMA critics to future work.
All agents receive the same global reward at each time Ablations. We perform ablation experiments to validate
step, equal to the sum of damage inflicted on the opponent three key elements of COMA. First, we test the importance
units minus half the damage taken. Killing an opponent gen- of centralising the critic by comparing against two IAC vari-
erates a reward of 10 points, and winning the game generates ants, IAC-Q and IAC-V . These critics take the same decen-
a reward equal to the team’s remaining total health plus 200. tralised input as the actor, and share parameters with the ac-
This damage-based reward signal is comparable to that used tor network up to the final layer. IAC-Q then outputs |U |
by Usunier et al. (2016). Unlike (Peng et al. 2017), our ap- Q-values, one for each action, while IAC-V outputs a sin-
proach does not require estimating local rewards. gle state-value. Note that we still share parameters between
State Features. The actor and critic receive different in- agents, using the egocentric observations and ID’s as part of
put features, corresponding to local observations and global the input to allow different behaviours to emerge. The coop-
state, respectively. Both include features for allies and ene- erative reward function is still shared by all agents.
mies. Units can be either allies or enemies, while agents are Second, we test the significance of learning Q instead of
the decentralised controllers that command ally units.
The local observations for every agent are drawn only 2
After firing, a unit’s cooldown is reset, and it must drop
from a circular subset of the map centred on the unit it before firing again. Shields absorb damage until they break, after
controls and include for each unit within this field of view: which units start losing health. Dragoons and zealots have shields
distance, relative x, relative y, unit type but marines do not.
COMA c e ntral-QV IAC-V IAC-Q
c e ntral-V he uris tic
90 90
80 80
(a) 3m (b) 5m
90 70
80 60
Ave rag e Win %
(c) 5w (d) 2d 3z
Figure 3: Win rates for COMA and competing algorithms on four different scenarios. COMA outperforms all baseline methods.
Centralised critics also clearly outperform their decentralised counterparts. The legend at the top applies across all plots.
V . The method central-V still uses a central state for the Section 5), which could be expected to speed learning. How-
critic, but learns V (s), and uses the TD error to estimate the ever, these results suggest that the improved accuracy of pol-
advantage for policy gradient updates. icy evaluation made possible by conditioning on the global
Third, we test the utility of our counterfactual baseline. state outweighs the overhead of training a separate network.
The method central-QV learns both Q and V simultane- Furthermore, COMA strictly dominates central-QV , both
ously and estimates the advantage as Q − V , replacing in training speed and in final performance across all settings.
COMA’s counterfactual baseline with V . All methods use This is a strong indicator that our counterfactual baseline is
the same architecture and training scheme for the actors, and crucial when using a central Q-critic to train decentralised
all critics are trained with TD(λ). policies.
Learning a state-value function has the obvious advantage
6 Results of not conditioning on the joint action. Still, we find that
COMA outperforms the central-V baseline in final perfor-
Figure 3 shows average win rates as a function of episode for mance. Furthermore, COMA typically achieves good poli-
each method and each StarCraft scenario. For each method, cies faster, which is expected as COMA provides a shaped
we conducted 35 independent trials and froze learning ev- training signal. Training is also more stable than central-V ,
ery 100 training episodes to evaluate the learned policies which is a consequence of the COMA gradient tending to
across 200 episodes per method, plotting the average across zero as the policy becomes greedy. Overall, COMA is the
episodes and trials. Also shown is one standard deviation in best performing and most consistent method.
performance. Usunier et al. (2016) report the performance of their best
The results show that COMA is superior to the IAC base- agents trained with their state-of-the-art centralised con-
lines in all scenarios. Interestingly, the IAC methods also troller labelled GMEZO (greedy-MDP with episodic zero-
eventually learn reasonable policies in 5m, although they order optimisation), and for a centralised DQN controller,
need substantially more episodes to do so. This may seem both given a full field of view and access to attack-move
counterintuitive since in the IAC methods, the actor and macro-actions. These results are compared in Table 1 against
critic networks share parameters in their early layers (see the best agents trained with COMA for each map. Clearly,
Local Field of View (FoV) Full FoV, Central Control
COMA
map heur. IAC-V IAC-Q cnt-V cnt-QV heur. DQN GMEZO
mean best
3m 35 47 (3) 56 (6) 83 (3) 83 (5) 87 (3) 98 74 - -
5m 66 63 (2) 58 (3) 67 (5) 71 (9) 81 (5) 95 98 99 100
5w 70 18 (5) 57 (5) 65 (3) 76 (1) 82 (3) 98 82 70 743
2d 3z 63 27 (9) 19 (21) 36 (6) 39 (5) 47 (5) 65 68 61 90
Table 1: Mean win percentage averaged across final 1000 evaluation episodes for the different maps, for all methods and the
hand-coded heuristic in the decentralised setting with a limited field of view. The highest mean performances are in bold,
while the values in parentheses denote the 95% confidence interval, for example 87(3) = 87 ± 3. Also shown, maximum
win percentages for COMA (decentralised), in comparison to the heuristic and published results (evaluated in the centralised
setting).
Algorithm
COMA uses a centralised critic which conditions on the joint action and global state information to estimate Q-values for the joint action, a departure from traditional actor-critic methods where each actor and critic conditions only on its own action-observation history . This centralised approach allows for a more effective reasoning about the contribution of individual agents to the global reward, addressing the problem of noisy gradients seen in traditional methods because agents' contributions to the global reward can be assessed more accurately . Furthermore, by computing an agent-specific advantage function that compares the Q-value for the current action to a counterfactual baseline, it avoids the need for simulations or default actions, improving credit assignment .
The counterfactual baseline in COMA helps solve the credit assignment problem by providing a method to evaluate the contribution of each agent to the global reward without requiring extra simulations. It does this by estimating the Q-value for the current action and comparing it to a counterfactual baseline that marginalizes out the agent's action while keeping the other agents' actions fixed . This baseline represents the reward expectation if the agent had taken a default action instead. This approach circumvents the need for a simulator or default action assumptions typical in difference rewards. By doing so, COMA can effectively determine the advantage of an agent's action, thus, more precisely assigning credit to individual agents in a multi-agent setting where rewards are otherwise joint and indiscrete .
Computing efficiency in counterfactual baselines is significant because it enables COMA to be applied feasibly in complex environments with many agents, where the joint action space is large . Efficiency is achieved by using a critic representation that allows all Q-values for different actions to be computed in a single batched forward pass, as opposed to computing each Q-value individually. This is facilitated by inputting the actions of other agents into the network and computing the Q-values for a given agent’s action in a single step . Consequently, the baseline evaluation avoids the computationally expensive task of simulating alternate realities for every possible action, instead relying on efficient neural network generalization .
In cooperative multi-agent settings, the main challenge in assigning rewards is that joint actions typically generate only global rewards, which makes it difficult for each agent to deduce its own contribution to the team's success. Individual reward functions can be designed for each agent, but these are often unavailable and fail to encourage individual agents to sacrifice for the greater good . COMA addresses these challenges by using a counterfactual multi-agent reinforcement learning method that incorporates an actor-critic approach with a centralised critic only used during learning. This critic conditions on joint actions and available state information, allowing it to estimate the global reward and compute an agent-specific advantage function. This function compares the estimated return of the current joint action with a counterfactual baseline that marginalises out a single agent's action, thus aiding in multi-agent credit assignment without requiring access to a simulator or assumed default actions .
COMA addresses the issue of recursive interdependence, a problem in value-based methods when the policy and utility functions depend on each other, by constructing a counterfactual baseline whose expected contribution to the policy gradient is zero . This is akin to other policy gradient baselines, ensuring that while the baseline itself may depend on the policy, its expectation does not. This non-recursiveness allows COMA to use the counterfactual advantage function without creating a self-consistency problem, a challenge encountered in earlier methods attempting to optimize aristocrat utilities with value-based approaches .