UAV Control for Target Tracking Using RL
UAV Control for Target Tracking Using RL
Siyi Li1 , Tianbo Liu2 , Chi Zhang1 , Dit-Yan Yeung1 , Shaojie Shen2
1
Department of Computer Science and Engineering, HKUST
2
Department of Electronic and Computer Engineering, HKUST
{sliay, czhangbr, dyyeung}@[Link], {tliuam, eeshaojie}@[Link]
4936
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
jun Bai, 2017] are proposed, they are not designed for the Consider a Q-function approximator parameterized by θQ . It
quadrotor domain. Specific to the quadrotors, the relative can be optimized by minimizing the loss:
frailty of the underlying system makes a partially trained h 2 i
motor-level policy crash in the initial stage of training. A L(θQ ) = Eπ Q(st , at |θQ ) − yt , (3)
key question is how to enjoy the richness and flexibility of a
0
self-improving policy by model-free learning while preserving where yt = r(st , at ) + γ maxa Q(st+1 , a|θQ ) is the target
the stability of conventional controllers. Q-value estimated by a target Q-network.
In this paper, we propose to combine the stability of con- However, for continuous action problems Q-learning be-
ventional feedback PID controllers with the self-improving comes difficult since it requires maximizing a complex, non-
performance of model-free RL techniques so that the hybrid linear function at each update to improve the current policy.
method can be practically applied to learning quadrotor con- Therefore continuous domains are often tackled by actor-critic
trol. Model-free methods are used to learn the high-level methods, where a separate parameterized “actor” policy is
reference inputs to the PID control loop while the PID con- learned in addition to the Q-function. The Deep Deterministic
troller maps the reference inputs to low-level motor actions. Policy Gradient (DDPG) [Lillicrap et al., 2016] algorithm,
Consequently, both data efficiency and training stability will based on Deterministic Policy Gradient [Silver et al., 2014],
be greatly improved. Moreover, the proposed hierarchical maintains a parameterized actor function µ(s|θ µ ) which spec-
control system makes it easy to transfer the policy trained in ifies the current policy by deterministically mapping each
simulators to real-world platforms, since both sides share simi- state to a unique action. The actor is updated by performing
lar high-level system dynamics. This transfer ability is crucial gradient ascent based on the following policy gradient:
to many real-world control tasks since learning in simulated
environments only incurs low cost. Our experimental valida- ∇θµ µ ≈ Eµ0 [∇a Q(s, a|θQ )|s=st ,a=µ(st ) ∇θµ µ(s|θ µ )|s=st ].
tion shows that (1) introducing the PID controller is essential (4)
for successful training of quadrotor control policies and (2) the By incorporating the ideas of sample replay buffer and target
learned high-level policy network can successfully generalize network backup originated from DQN [Mnih et al., 2015],
across different environments including real-world scenarios. DDPG can use neural network function approximators for
problems that involve continuous action domains.
2 Preliminaries
2.1 Formulation of Quadrotor Control Problem 3 Our Approach
The quadrotor target following task can be naturally formu- Although model-free methods such as DDPG allow us to opti-
lated as a sequential decision making problem under the RL mize complex policies based on raw image observations, they
framework. At each time step t, the agent receives an observa- require massive amounts of data to achieve good performance.
tion ot from the environment (i.e., the onboard sensor), applies Besides, function approximators such as neural networks de-
an action at according to a policy π, and then obtains a reward fined on high-dimensional observation spaces are very difficult
feedback rt . The goal of the agent is to learn an optimal policy to train in fragile physical systems such as quadrotors, since
that maximizes the expected sum of discounted rewards Rt the agent can hardly find actions to reach any good state in the
T
X exploration process, especially in (infinite) continuous action
Rt = γ i−t ri , (1) domains. We now present our proposed approach which is
i=t particularly suitable for this task.
where T is the terminated time step and γ ∈ [0, 1] is a discount
factor that determines the importance of future rewards. The 3.1 Policy Network Architecture
underlying state st of the system includes the physical state In order to represent a policy that performs both perception
configuration (positions, velocities, etc.) of both the quadrotor and control, we use deep neural networks. As shown in Fig-
and the target object (which generally need to be inferred from ure 1, the policy network maps monocular RGB images and
the observations ot ). Since the actions at consist of low-level quadrotor configurations to the actions. We discard pooling
motor commands, the complex system dynamics (denoted in the visual processing layers due to the loss of spatial infor-
as the distribution p(st+1 |st , at )) make it difficult to learn mation. Inspired by [Levine et al., 2016], we incorporate a
a stable policy by directly applying existing model-free RL spatial-argmax layer after the last convolutional layer to con-
methods. vert each pixel-wise feature map into spatial coordinates in
the image space. The spatial-argmax layer consists of a spa-
2.2 Deep Deterministic Policy Gradient tial softmax function applied to the last convolutional feature
When the system dynamics p(st+1 |st , at ) are not known, map and a fixed sparse fully connected layer which calculates
model-free RL methods such as policy gradient [Peters and the expected image position of each feature map. The spatial
Schaal, 2006] and Q-learning [Sutton et al., 1999] methods feature points are then regressed to a three-dimensional vector,
are often preferred. Assuming that the environment is fully so,t = (xt , yt , ht ), which represents the 2D position and scale
observed so that ot = st , the Q-function Qπ (st , at ) represents (here we only keep the height information) of the target on
the expected return after taking an action at in state st and the image plane, by another fully connected layer. In order to
thereafter following policy π: achieve stable flight, it is essential to use the quadrotor con-
Qπ (st , at ) = Eπ [Rt |st , at ]. (2) figuration sq,t = (zt , vt , qt , wt ), which includes the altitude,
4937
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
Perception Layers
Figure 1: Policy network architecture. The perception layers estimate the target state while the control layers learn the control behavior.
linear velocity, orientation, and angular velocity, as an addi- 3.3 Reward Shaping
tional input to the neural network. After the perception layers, Due to the continuous nature of the target following task, an
the target related state so,t is concatenated with the quadrotor immediate reward feedback at every time step is essential.
state sq,t , followed by fully connected layers to the actions. Besides, the instability and fragility of the quadrotor system
While we could choose to directly output low-level motor also poses some additional challenges to the design of the
actions at by the policy network, the model will easily get reward function. A naive reward function based solely on
stuck in unwanted local optimum to yield little performance the target related state so,t will lead to suboptimal policies
improvement even after tens of thousands of sample experi- that cannot guarantee flying stability. Therefore the reward
ences. We reason that pure model-free RL methods cannot function should consider both the quadrotor state and the target
effectively learn stable policies in such fragile systems. There- related state.
fore we introduce another set of high-level actions ut as the To that end, we design the reward function as a combination
output of the policy network. The high-level actions are then of the goal-oriented target reward and the auxiliary quadrotor
mapped to low-level motor commands by a PID controller. reward:
r = rg (sg ) + rq (sq ). (6)
�� �� ��+1
Policy Network PID Controller Quadrotor
Note that for notational simplicity we omit the time step t
�� ��
from the subscript here. The target reward is expressed as the
Sensor
sum of two parts as below:
Figure 2: Hierarchical control system which combines the policy rg (sg ) = rg (x, y) + rg (h), (7)
network and a PID controller.
which correspond to the position reward and scale reward,
respectively. Let spart denote part of the state sg , which
3.2 Combination with PID Control Loop corresponds to either (x, y) or h. Then the corresponding
reward takes the following form:
The quadrotor features complex nonlinear aerodynamics
which are hard to learn by model-free RL methods. For- exp(−∆spart ) ∆spart ≤ τ1
(
tunately, this challenge can be tackled by incorporating a rg (spart ) = 0 τ1 < ∆spart ≤ τ2 ,
conventional PID controller. Figure 2 shows the proposed −(∆spart − τ2 ) ∆spart > τ2
hierarchical control system. At each time step t, given the ob- (8)
served image, the policy network generates a four-dimensional where ∆spart = kspart − s∗part k2 denotes the `2 -norm be-
high-level reference action ut tween the current state and the desired goal state, τ1 , τ2 de-
notes different threshold values. The intuition is that the
ut = (px , py , pz , ϕyaw ), (5) learner must observe variations in the reward signal in or-
which corresponds to the relative position offset in the x-, der to improve the policy. This hierarchical form essentially
y-, z-directions and the relative angle offset around the yaw provides an intermediate goal to guide the learning process
axis. Thanks to the differential flatness property proposed to find a reasonable solution step by step. Also with a slight
in [Mellinger and Kumar, 2011], the above high-level ref- abuse of notation, the auxiliary reward is expressed as follows:
erence trajectories can be followed by simple cascade PID rq (sq ) = rq (z) + rq (q1 , q2 , q3 , q4 ), (9)
controllers with dynamical feasibility guarantee. While the
quadrotor dynamics of simulated models are often significantly which correspond to the quadrotor altitude and orientations
different from those of real-world platforms, their high-level (expressed as quaternions), respectively. Different from the
decision strategies are generally very similar. Thus introduc- target reward, the auxiliary reward is used to impose additional
ing high-level actions also makes it much easier to transfer a constraints on the flying gesture. Only penalty terms are in-
policy learned in a simulator domain to a real-world domain. troduced. By using the same notations as in the target reward,
4938
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
4939
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
(a) Simulated scenario 1. (b) Simulated scenario 2. (c) Simulated scenario 3. (d) Real-world scenario.
Figure 3: Three simulated environments and one real-world environment. The blue, red and green annotation denotes the quadrotor, the target
and the sensor view respectively. Object and person distractors are presented in Scenario 2 and 3.
Figure 4: Training results of the end-to-end motor-level control Figure 6: Comparison between separate learning and end-to-end
system and the proposed hierarchical control system. The hierarchical learning. The first 150 iterations are trained by only fine-tuning the
control system is substantially easier to train. control layers in Figure 5.
1200 1000
800
800
only minor improvements afterwards. The results demonstrate
600 600
that supervised pre-training of the perception module greatly
400 400
increases the stability of the policy network, which has always
200
0
200 been a major drawback of many RL algorithms. Actually, fur-
-200 0
ther experiments show that even with more training iterations,
0 50
iterations
100 150 0 50
iterations
100 150
the pure DDPG algorithm still gets stuck in suboptimal poli-
0 w/o pre-training fine-tune all finetune control cies. Another important observation is that jointly optimizing
the whole network from the very beginning might hurt the
Figure 5: Training results on three different training strategies. overall performance since it leads the convolutional layers to
forget useful features learned through pre-training.
End-to-end learning vs. separate learning. So far, the
the average return measures the precision of control, while
perception layers and the control layers have been learned
the the average episode length measures the successfully fol-
separately. We now examine our design choice of end-to-end
lowed steps (robustness). In both methods, the network is
fine-tuning: does training the perception and control layers
trained for 100 iterations with 2000 time steps for each itera-
jointly provide better performance? After initializing the con-
tion. We can see that a direct application of DDPG to learn
trol layers as above, we fine-tune the whole network in an
low-level motor commands cannot achieve any performance
end-to-end manner (end-to-end learning). We also make
improvement while the proposed hierarchical control approach
comparison with a baseline in which only the control layers
achieves substantial improvement as learning proceeds. This
are fine-tuned (separate learning). As shown in Figure 6, the
large performance gap is mainly due to the complexity and
learning curve suggests that jointly training the perception and
fragility of the quadrotor system.
control layers end-to-end does further boost the performance.
With pre-training vs. without pre-training. We can ob-
serve in Figure 4 that the agent suffers severe vibration in the
learning process under standard training strategy. To validate
4.3 Policy Evaluation in Simulators
the necessity of supervised pre-training, we compare three dif- To gain more insights into how the learned policy actually
ferent approaches here. The first one is the standard training works, we further apply the trained policy network in a num-
without any pre-training (w/o pre-training). The second one ber of simulated testing environments in which the policy
is to fine-tune all layers after pre-training (fine-tune all). The interacts with the environment until game termination. For
last one is our strategy which only fine-tunes the control layers testing environment initialization, we randomly set the posi-
after the pre-training initialization (fine-tune control). Fig- tions of the quadrotor and the target, making the target appear
ure 5 shows the performance of different methods. We show at different corners of the camera view with different scales.
4940
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
end-to-end approach
h [0.151]
separate approach
real-world transfer
0.4 h [0.034] h [0.075] h [0.112] h [0.095]
scenario 2
scenario 3
0.3
0.2 0.2 0.2 0.2
0.2
0.1 0 0 0 0
0
(a) Separate vs. end-to-end in scenario 1 (left to right). (b) Unseen scenarios 2 and 3 (left to right). (c) Real-world test.
Figure 7: Variations of the normalized (divided by image dimensions) true target state so,t = (x, y, h) over time (x, y ∈ [−0.5, 0.5], h ∈ [0, 1]).
The dashed lines show the goal for each states: (x∗ , y ∗ , h∗ ) = (0, 0, 0.4). The top-right legend shows the average deviation.
The performance of the policy is measured by the deviation of the NUC. For speed consideration, the policy network com-
the target state from our desired goal (which is specified in the putation is deployed on a ground laptop M1000 GPU which
reward function). We also average the results over 5 different communicates with the onboard NUC by Robot Operating
runs. System (ROS).
We first compare two different policies trained by separate Figure 7(c) shows the results of the real-world flight test,
learning and end-to-end learning (as described in Figure 6), where the true target states are labeled off-line by an object
respectively. By applying the models in the testing environ- tracker. With a slight decrease in control precision, the trans-
ment, we can record the true target state variation, as shown ferred agent can still follow the moving target for up to 4000
in Figure 7(a), to show the quality of different policies. We time steps (approximately 3 minutes). It is worth noting that
also compute the average deviation of each state variable to this real-world test bears several challenges, including the
give quantitative analysis, as shown in the top-right legend. large gap between the low-level controllers, the data delay,
The result is consistent with our findings in the design evalua- and the state noise in inertial measurement unit (IMU) and
tion. The end-to-end learning approach is clearly better. It is GPS. Without any adaptation, the learned policy can still ex-
worth noting that, although the policies are only trained with a hibit reasonable following capability. This shows that the
maximum of 1000 time steps, the agent can generalize well policy does have well-generalized performance. With more
beyond that. Both policies can consistently follow the target mature simulators, we believe these are promising directions
for quite a long time, neither crashing the quadrotor nor losing to pursue in robotics learning.
the target from the camera view. However, the separate learn-
ing approach only manages to learn a suboptimal policy with 5 Conclusion and Future Work
which the resulting target state is relatively far from the desired
goal state, as indicated by the dashed line in Figure 7(a). On In this paper, we have explored the potential of applying a
the contrary, the end-to-end learning approach achieves a very machine learning approach to the challenging autonomous
stable policy to successfully maintain the target state within UAV control problem. The policy is represented by a convolu-
a small range of the goal state for up to 5000 time steps (and tional neural network which combines perception and control
even more). For all subsequent experiments in this section and thus can be trained end-to-end. Instead of directly pre-
we will stick to the model trained by the end-to-end learning dicting the low-level motor commands, the policy network
approach. is designed to produce high-level actions. This enables both
Since training is only performed in scenario 1 (Figure 3), stable learning and good generalization ability to different
we further test the trained policy network in two unseen en- environments. Our training approach consists of supervised
vironments (scenarios 2 and 3) to evaluate the generalization learning from raw images and reinforcement learning from
ability. Scenario 2 has a background significantly different games of self-play. This training decomposition greatly al-
from the training setting and scenario 3 has very similar object leviates the instability of existing model-free policy gradient
distractors. We directly apply the trained policy network to the methods. Results from both simulated and real-world exper-
new testing environments without any adaptation. As shown iments show that our method can successfully perform the
in Figure 7(b), surprisingly, our policy network exhibits good target following task with good generalization ability.
generalization ability to unseen scenes. We find that in sce- Currently the policy is directly transferred to the real-world
nario 3 the scale drifts a little bit. This is the result of occlusion quadrotor. This can also be used as an initialization scheme
by some similar distractors. and the model can be further trained end-to-end [Zhu et al.,
2017]. We will pursue these research directions in our future
4.4 Policy Transfer to the Real World work.
Our quadrotor testbed is based on the DJI Matrice 100 plat-
form equipped with an Intel NUC and a camera. We use
DJI built-in functions to map the high-level actions to actual
Acknowledgments
flight commands, which are significantly more complex than a This research has been supported by General Research Fund
simple PID controller. This mapping computation is done by 16207316 from the Research Grants Council of Hong Kong.
4941
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
4942
Empirical findings indicate that end-to-end optimization from the start can hurt the pre-trained representation ability of the perception layers. Pre-training the perception layers separately results in greater stability and improves the overall performance of the policy network, as early joint optimization causes these layers to forget crucial features learned during pre-training .
Applying the trained policy network in diverse simulated environments demonstrates its generalization abilities and stability across different settings. The policy interacts with the environment until game termination without collapsing, highlighting the effectiveness of the training approach and its potential for diverse applications .
Future research directions include using the trained policy network as an initialization scheme for further end-to-end training and exploring its application to real-world quadrotor control beyond initial deployment. The authors plan to pursue these avenues to expand the real-world applicability of their approach .
The hierarchical control approach greatly outperforms direct application of DDPG due to the complexity and fragility associated with quadrotor systems. The hierarchical structure allows for a stable decomposition of the learning task, utilizing supervised pre-training for perception modules and reinforcement learning for control, leading to substantially improved performance and reduced training instabilities .
Key challenges in setting up a real-world game playing environment include the inability to efficiently access true target states online to compute rewards and difficulties in resetting the game state after termination. These challenges can be bypassed by using policies trained in simulators where true states are available, allowing direct learning of control layers with state input before merging with perception layers for application in real domains .
The introduction of a PID controller is shown to be essential for successful training, as it significantly aids in stabilizing the learning process when compared to standard end-to-end training methods. The PID controller helps address issues related to the complexity and system fragility of quadrotors .
The factorizing scheme involves pre-training the perception layers using a small dataset collected from real-world scenarios, which addresses the large gap between simulated and real-world images. By fixing the perception layers initially and only training the control layers, the learned policy can be applied to real-world domains, taking advantage of the availability of true states in simulations to optimize control layers while maintaining the pre-trained perception capabilities .
High-level actions in the policy network, as opposed to low-level motor commands, facilitate stable learning and enhance the policy's ability to generalize across different environments. This approach breaks down the learning process into supervised learning from images and reinforcement learning, reducing the instability often associated with model-free policy gradient methods .
Supervised pre-training enhances stability by mitigating the vibrations and oscillations that commonly occur during learning in standard training strategies. This approach stabilizes the learning curve by ensuring that the perception module already has a robust feature extraction capability before learning the control commands, thus preventing the network from being stuck in suboptimal policies .
The decomposition into supervised learning from raw images and reinforcement learning through self-play alleviates instability by simplifying the learning process into manageable parts. Supervised learning provides a robust starting point for perception, while reinforcement learning refines control strategies, thus reducing the chances of instability often seen in model-free policy gradient methods .