0% found this document useful (0 votes)
7 views7 pages

UAV Control for Target Tracking Using RL

QX

Uploaded by

Siriki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

UAV Control for Target Tracking Using RL

QX

Uploaded by

Siriki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Learning Unmanned Aerial Vehicle Control for Autonomous Target Following

Siyi Li1 , Tianbo Liu2 , Chi Zhang1 , Dit-Yan Yeung1 , Shaojie Shen2
1
Department of Computer Science and Engineering, HKUST
2
Department of Electronic and Computer Engineering, HKUST
{sliay, czhangbr, dyyeung}@[Link], {tliuam, eeshaojie}@[Link]

Abstract derivative (PID) controllers, matching some manually defined


rules. The whole system consists of several hard-coded com-
While deep reinforcement learning (RL) methods ponents without any learning capability and relies heavily on
have achieved unprecedented successes in a range of tedious tuning by human experts to achieve good performance.
challenging problems, their applicability has been
mainly limited to simulation or game domains due These limitations may be addressed by viewing the problem
to the high sample complexity of the trial-and-error from the machine learning perspective. In essence, quadrotor
learning process. However, real-world robotic appli- control is a sequential prediction problem with the sensory
cations often need a data-efficient learning process information as input and the motor control commands as out-
with safety-critical constraints. In this paper, we put. Standard supervised learning [Giusti et al., 2016] and
consider the challenging problem of learning un- imitation learning [Ross et al., 2013] have been used to learn
manned aerial vehicle (UAV) control for tracking a various controllers for quadrotors. However, both methods
moving target. To acquire a strategy that combines require labeled datasets, which have to be provided by costly
perception and control, we represent the policy by human experts. With human experts becoming the bottleneck,
a convolutional neural network. We develop a hier- such approaches are often restricted to small datasets and thus
archical approach that combines a model-free pol- cannot exploit high-capacity learning algorithms to train the
icy gradient method with a conventional feedback policies.
proportional-integral-derivative (PID) controller to Recent advances in reinforcement learning (RL) offer new
enable stable learning without catastrophic failure. promises for solving the problem. Instead of demanding
The neural network is trained by a combination of explicitly labeled samples as in supervised learning, RL
supervised learning from raw images and reinforce- only requires a scalar reward function to guide the learn-
ment learning from games of self-play. We show ing agent through a trial-and-error process interacting with
that the proposed approach can learn a target follow- the environment. Most RL approaches belong to either one
ing policy in a simulator efficiently and the learned of two categories: model-based methods and model-free
behavior can be successfully transferred to the DJI methods. Model-based methods learn an explicit dynami-
quadrotor platform for real-world UAV control. cal model of the environment and then optimize the policy
under this model. They have been successfully applied to
robotics for various applications such as object manipula-
1 Introduction tion [Deisenroth et al., 2015; Levine et al., 2016], ground ve-
The recent development of perception [Li and Yeung, 2017; hicles [Mueller et al., 2012], helicopters [Abbeel et al., 2010;
Shen et al., 2013] and control [Lee, 2011] technologies for 2007], and quadrotors [Mohajerin and Waslander, 2014;
unmanned aerial vehicles (UAVs) enables autonomy in vari- Punjani and Abbeel, 2015; Zhang et al., 2016]. Model-based
ous complex environments, opening up a promising market methods tend to have a data-efficient learning process but suf-
with applications in aerial photography, monitoring, and in- fer from significant bias since complex unknown dynamics
spection. Many of these applications require an aerial robot to cannot always be modeled accurately. In contrast, model-
autonomously follow a moving target. free methods have the ability of handling arbitrary dynam-
In this paper, we consider the problem of controlling a ical systems with minimal bias. Several recent studies in
quadrotor with very limited payload to autonomously follow model-free methods, especially deep RL [Mnih et al., 2015;
a moving target. Most early research on quadrotor autonomy Lillicrap et al., 2016; Mnih et al., 2016; Schulman et al., 2015;
focused on a two-step pipeline. The first step is on percep- Gu et al., 2017], have shown that end-to-end model-free meth-
tion where various vision algorithms are used to estimate ods are capable of learning high-quality control policies using
the underlying state or map [Bachrach et al., 2012; Barry generic neural networks with minimal feature engineering.
and Tedrake, 2015; Schmid et al., 2013; Bills et al., 2011; However, Model-free methods generally require a data-hungry
Sun et al., 2017]. The second step is to design and tune con- training paradigm which is costly for real-world physical
ventional feedback controllers, such as proportional-integral- systems. Although some hierarchical learning methods [Ai-

4936
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

jun Bai, 2017] are proposed, they are not designed for the Consider a Q-function approximator parameterized by θQ . It
quadrotor domain. Specific to the quadrotors, the relative can be optimized by minimizing the loss:
frailty of the underlying system makes a partially trained h 2 i
motor-level policy crash in the initial stage of training. A L(θQ ) = Eπ Q(st , at |θQ ) − yt , (3)
key question is how to enjoy the richness and flexibility of a
0
self-improving policy by model-free learning while preserving where yt = r(st , at ) + γ maxa Q(st+1 , a|θQ ) is the target
the stability of conventional controllers. Q-value estimated by a target Q-network.
In this paper, we propose to combine the stability of con- However, for continuous action problems Q-learning be-
ventional feedback PID controllers with the self-improving comes difficult since it requires maximizing a complex, non-
performance of model-free RL techniques so that the hybrid linear function at each update to improve the current policy.
method can be practically applied to learning quadrotor con- Therefore continuous domains are often tackled by actor-critic
trol. Model-free methods are used to learn the high-level methods, where a separate parameterized “actor” policy is
reference inputs to the PID control loop while the PID con- learned in addition to the Q-function. The Deep Deterministic
troller maps the reference inputs to low-level motor actions. Policy Gradient (DDPG) [Lillicrap et al., 2016] algorithm,
Consequently, both data efficiency and training stability will based on Deterministic Policy Gradient [Silver et al., 2014],
be greatly improved. Moreover, the proposed hierarchical maintains a parameterized actor function µ(s|θ µ ) which spec-
control system makes it easy to transfer the policy trained in ifies the current policy by deterministically mapping each
simulators to real-world platforms, since both sides share simi- state to a unique action. The actor is updated by performing
lar high-level system dynamics. This transfer ability is crucial gradient ascent based on the following policy gradient:
to many real-world control tasks since learning in simulated
environments only incurs low cost. Our experimental valida- ∇θµ µ ≈ Eµ0 [∇a Q(s, a|θQ )|s=st ,a=µ(st ) ∇θµ µ(s|θ µ )|s=st ].
tion shows that (1) introducing the PID controller is essential (4)
for successful training of quadrotor control policies and (2) the By incorporating the ideas of sample replay buffer and target
learned high-level policy network can successfully generalize network backup originated from DQN [Mnih et al., 2015],
across different environments including real-world scenarios. DDPG can use neural network function approximators for
problems that involve continuous action domains.
2 Preliminaries
2.1 Formulation of Quadrotor Control Problem 3 Our Approach
The quadrotor target following task can be naturally formu- Although model-free methods such as DDPG allow us to opti-
lated as a sequential decision making problem under the RL mize complex policies based on raw image observations, they
framework. At each time step t, the agent receives an observa- require massive amounts of data to achieve good performance.
tion ot from the environment (i.e., the onboard sensor), applies Besides, function approximators such as neural networks de-
an action at according to a policy π, and then obtains a reward fined on high-dimensional observation spaces are very difficult
feedback rt . The goal of the agent is to learn an optimal policy to train in fragile physical systems such as quadrotors, since
that maximizes the expected sum of discounted rewards Rt the agent can hardly find actions to reach any good state in the
T
X exploration process, especially in (infinite) continuous action
Rt = γ i−t ri , (1) domains. We now present our proposed approach which is
i=t particularly suitable for this task.
where T is the terminated time step and γ ∈ [0, 1] is a discount
factor that determines the importance of future rewards. The 3.1 Policy Network Architecture
underlying state st of the system includes the physical state In order to represent a policy that performs both perception
configuration (positions, velocities, etc.) of both the quadrotor and control, we use deep neural networks. As shown in Fig-
and the target object (which generally need to be inferred from ure 1, the policy network maps monocular RGB images and
the observations ot ). Since the actions at consist of low-level quadrotor configurations to the actions. We discard pooling
motor commands, the complex system dynamics (denoted in the visual processing layers due to the loss of spatial infor-
as the distribution p(st+1 |st , at )) make it difficult to learn mation. Inspired by [Levine et al., 2016], we incorporate a
a stable policy by directly applying existing model-free RL spatial-argmax layer after the last convolutional layer to con-
methods. vert each pixel-wise feature map into spatial coordinates in
the image space. The spatial-argmax layer consists of a spa-
2.2 Deep Deterministic Policy Gradient tial softmax function applied to the last convolutional feature
When the system dynamics p(st+1 |st , at ) are not known, map and a fixed sparse fully connected layer which calculates
model-free RL methods such as policy gradient [Peters and the expected image position of each feature map. The spatial
Schaal, 2006] and Q-learning [Sutton et al., 1999] methods feature points are then regressed to a three-dimensional vector,
are often preferred. Assuming that the environment is fully so,t = (xt , yt , ht ), which represents the 2D position and scale
observed so that ot = st , the Q-function Qπ (st , at ) represents (here we only keep the height information) of the target on
the expected return after taking an action at in state st and the image plane, by another fully connected layer. In order to
thereafter following policy π: achieve stable flight, it is essential to use the quadrotor con-
Qπ (st , at ) = Eπ [Rt |st , at ]. (2) figuration sq,t = (zt , vt , qt , wt ), which includes the altitude,

4937
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Perception Layers

RGB image conv1 conv2 conv3 spatial softmax


target state
3x3 conv 3x3 conv 3x3 conv [x,y,h]
stride 1 stride 1 stride 1 relu 4-dim
relu relu relu expected linear relu tanh actions
3 2D pos 3
16 16 16 16
64 × 64
62 × 62 60 × 60 58 × 58 58 × 58 32

concatenate 128 128


quadrotor configuration
[𝑧𝑡 , 𝑣𝑡 , 𝑞𝑡 , 𝑤𝑡 ]
Control Layers

Figure 1: Policy network architecture. The perception layers estimate the target state while the control layers learn the control behavior.

linear velocity, orientation, and angular velocity, as an addi- 3.3 Reward Shaping
tional input to the neural network. After the perception layers, Due to the continuous nature of the target following task, an
the target related state so,t is concatenated with the quadrotor immediate reward feedback at every time step is essential.
state sq,t , followed by fully connected layers to the actions. Besides, the instability and fragility of the quadrotor system
While we could choose to directly output low-level motor also poses some additional challenges to the design of the
actions at by the policy network, the model will easily get reward function. A naive reward function based solely on
stuck in unwanted local optimum to yield little performance the target related state so,t will lead to suboptimal policies
improvement even after tens of thousands of sample experi- that cannot guarantee flying stability. Therefore the reward
ences. We reason that pure model-free RL methods cannot function should consider both the quadrotor state and the target
effectively learn stable policies in such fragile systems. There- related state.
fore we introduce another set of high-level actions ut as the To that end, we design the reward function as a combination
output of the policy network. The high-level actions are then of the goal-oriented target reward and the auxiliary quadrotor
mapped to low-level motor commands by a PID controller. reward:
r = rg (sg ) + rq (sq ). (6)
�� �� ��+1
Policy Network PID Controller Quadrotor
Note that for notational simplicity we omit the time step t
�� ��
from the subscript here. The target reward is expressed as the
Sensor
sum of two parts as below:
Figure 2: Hierarchical control system which combines the policy rg (sg ) = rg (x, y) + rg (h), (7)
network and a PID controller.
which correspond to the position reward and scale reward,
respectively. Let spart denote part of the state sg , which
3.2 Combination with PID Control Loop corresponds to either (x, y) or h. Then the corresponding
reward takes the following form:
The quadrotor features complex nonlinear aerodynamics
which are hard to learn by model-free RL methods. For- exp(−∆spart ) ∆spart ≤ τ1
(
tunately, this challenge can be tackled by incorporating a rg (spart ) = 0 τ1 < ∆spart ≤ τ2 ,
conventional PID controller. Figure 2 shows the proposed −(∆spart − τ2 ) ∆spart > τ2
hierarchical control system. At each time step t, given the ob- (8)
served image, the policy network generates a four-dimensional where ∆spart = kspart − s∗part k2 denotes the `2 -norm be-
high-level reference action ut tween the current state and the desired goal state, τ1 , τ2 de-
notes different threshold values. The intuition is that the
ut = (px , py , pz , ϕyaw ), (5) learner must observe variations in the reward signal in or-
which corresponds to the relative position offset in the x-, der to improve the policy. This hierarchical form essentially
y-, z-directions and the relative angle offset around the yaw provides an intermediate goal to guide the learning process
axis. Thanks to the differential flatness property proposed to find a reasonable solution step by step. Also with a slight
in [Mellinger and Kumar, 2011], the above high-level ref- abuse of notation, the auxiliary reward is expressed as follows:
erence trajectories can be followed by simple cascade PID rq (sq ) = rq (z) + rq (q1 , q2 , q3 , q4 ), (9)
controllers with dynamical feasibility guarantee. While the
quadrotor dynamics of simulated models are often significantly which correspond to the quadrotor altitude and orientations
different from those of real-world platforms, their high-level (expressed as quaternions), respectively. Different from the
decision strategies are generally very similar. Thus introduc- target reward, the auxiliary reward is used to impose additional
ing high-level actions also makes it much easier to transfer a constraints on the flying gesture. Only penalty terms are in-
policy learned in a simulator domain to a real-world domain. troduced. By using the same notations as in the target reward,

4938
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

the auxiliary reward takes the following form: 4 Experiments



−c(1 − exp(−∆spart )) ∆spart ≤ τ1 In this section we present a series of experiments to answer
rq (spart ) = ,
−c ∆spart > τ1 the following research questions:
(10) 1. Is introducing the PID controller essential for successful
where τ1 denotes the same threshold as in equation (8) and c training?
denotes the penalty weight. 2. How does the training strategy work compared to stan-
With the reward function defined above, existing policy dard end-to-end training?
gradient methods (such as DDPG) can be applied in a game 3. How does the learned high-level policy network general-
playing environment to train the policy network. ize across different environments?
To answer question (1) and (2), we evaluate different variations
3.4 Training Strategy of the proposed system, in Section 4.2, by training policies
Although the policy network can be directly trained end-to- for the target following task in a simulated environment. We
end by DDPG, our empirical finding shows that the agent will further evaluate the generalization ability of the learned policy
suffer severe vibration in the learning process due to the high in Section 4.3 by testing it in various simulated environments.
sample complexity of model-free methods. Another drawback Finally, we set up a real-world flight test in Section 4.4.
is that the perception layers cannot be guaranteed to accurately
locate the target of interest. 4.1 Simulator Settings
We propose to introduce a supervised pre-training stage
Environment. We set up the simulated target following task
that allows us to initialize the perception layers of the policy
on the Virtual Robot Experimentation Platform (V-REP) us-
network using a relatively small number of training iterations.
ing the built-in quadrotor model. The observed state of the
The dataset can be easily collected by randomly moving the
quadrotor sq,t = (zt , vt , qt , wt ) ∈ R11 consists of the alti-
quadrotor and recording the camera image stream. In a sim-
tude, linear velocity, orientation, and angular velocity, where
ulator, we can directly get the accurate labels. In real-world
the velocities are expressed in the body frame. The state of the
domains, we can use existing model-free object trackers [Li
target object is unknown and must be inferred from the RGB
and Yeung, 2017] to get the noisy labels. Both are sufficient to
image input with resolution 64 × 64. We require the target to
train a pose regression CNN. After training on the regression
be in the camera view of the quadrotor on initialization. At
task, the weights of the perception layers can be transferred
each time step, the target randomly chooses a direction to walk
to the policy network. By factoring the perception and con-
at a random speed. The game will terminate either when the
trol tasks in learning, we can gain the image generalization
target is out of the camera view or when the quadrotor crashes,
power of CNNs across different environments (simulator and
as determined by using simple thresholds on the quadrotor’s
real-world). This is more reasonable than some existing ap-
altitude and orientation. The maximum episode length allowed
proaches [Sadeghi and Levine, 2017] which directly transfer
is set to 1000. Three different simulated scenes are shown in
simulator perception to the real world.
the left part of Figure 3.
After the supervised learning stage, we first fix the percep-
Implementation Details. We choose the off-policy actor-
tion layers and learn only the weights of the control layers
critic algorithm DDPG due to its sample efficiency over on-
which are not initialized with pre-training. Then the entire
policy methods. Our implementation is based on rllab [Duan
policy network is further optimized in an end-to-end manner.
et al., 2016]. The Q-network shares the same architecture with
Our empirical findings (in section 4.2) show that jointly opti-
the policy network, except that the last two fully connected
mizing the whole network from the very beginning hurts the
layers have a smaller number of units (32) and the actions are
pre-trained representation ability of the perception layers.
included in the second to last layer. We use Adam [Kingma
3.5 Transfer to the Real World and Ba, 2015] for optimization with the hyperparameters set
As explained in Section 3.2, learning the high-level actions according to [Lillicrap et al., 2016]. For the reward setting,
makes it easier to transfer a policy from the simulator domain we use τ1 = 0.05, τ2 = 0.2, and c = 0.5.
to a real-world domain. Considering the large gap between the
simulated images and the real-world images, we take the fac- 4.2 System Design Evaluation
torizing scheme during training as in Section 3.4. Namely, the In this section we validate our design choices of both the
perception layers are pre-trained by a small dataset collected hierarchical control system and the proposed training strategy.
in real-world scenarios. There are also several challenges All of the results are averaged over 5 different runs.
to set up a game playing environment for real-world quadro- Hierarchical control system vs. end-to-end control sys-
tors. First, we cannot access the true target state efficiently tem. In this experiment, we show the effectiveness of introduc-
online and thus cannot reliably compute the reward. Second, ing the PID controller to the control system. We compare two
efficiently resetting the game state upon game termination is approaches: the first one being an end-to-end control system
difficult. Fortunately we can bypass these issues to directly use where the policy network directly outputs at (the velocities of
the policy behavior trained in simulators. Since in simulators the 4 rotors) and the second one being the hierarchical control
all the underlying true states are available, we can directly system with PID where the policy network outputs ut . Both
learn the control layers with the state input. Finally the per- approaches are trained with standard training strategy (no pre-
ception layers and the control layers are merged to form the training and no hierarchical fine-tuning). Figure 4 shows the
policy network which is applicable to real-world domains. learning curves of these two different approaches. Intuitively,

4939
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

(a) Simulated scenario 1. (b) Simulated scenario 2. (c) Simulated scenario 3. (d) Real-world scenario.

Figure 3: Three simulated environments and one real-world environment. The blue, red and green annotation denotes the quadrotor, the target
and the sensor view respectively. Object and person distractors are presented in Scenario 2 and 3.

800 1000 1500 1000


average episode length

average episode length


average episode return

average episode return


900
600 800
1000
800
400 600
500 700
200 400
600
0
0 200
500

-200 0 -500 400


0 20 40 60 80 100 0 20 40 60 80 100 150 200 250 300 150 200 250 300
iterations iterations iterations iterations

end-to-end control hierarchical control separate learning end-to-end learning

Figure 4: Training results of the end-to-end motor-level control Figure 6: Comparison between separate learning and end-to-end
system and the proposed hierarchical control system. The hierarchical learning. The first 150 iterations are trained by only fine-tuning the
control system is substantially easier to train. control layers in Figure 5.

1200 1000

1000 the learning curves for up to 150 iterations since we observe


average episode length
average episode return

800
800
only minor improvements afterwards. The results demonstrate
600 600
that supervised pre-training of the perception module greatly
400 400
increases the stability of the policy network, which has always
200

0
200 been a major drawback of many RL algorithms. Actually, fur-
-200 0
ther experiments show that even with more training iterations,
0 50
iterations
100 150 0 50
iterations
100 150
the pure DDPG algorithm still gets stuck in suboptimal poli-
0 w/o pre-training fine-tune all finetune control cies. Another important observation is that jointly optimizing
the whole network from the very beginning might hurt the
Figure 5: Training results on three different training strategies. overall performance since it leads the convolutional layers to
forget useful features learned through pre-training.
End-to-end learning vs. separate learning. So far, the
the average return measures the precision of control, while
perception layers and the control layers have been learned
the the average episode length measures the successfully fol-
separately. We now examine our design choice of end-to-end
lowed steps (robustness). In both methods, the network is
fine-tuning: does training the perception and control layers
trained for 100 iterations with 2000 time steps for each itera-
jointly provide better performance? After initializing the con-
tion. We can see that a direct application of DDPG to learn
trol layers as above, we fine-tune the whole network in an
low-level motor commands cannot achieve any performance
end-to-end manner (end-to-end learning). We also make
improvement while the proposed hierarchical control approach
comparison with a baseline in which only the control layers
achieves substantial improvement as learning proceeds. This
are fine-tuned (separate learning). As shown in Figure 6, the
large performance gap is mainly due to the complexity and
learning curve suggests that jointly training the perception and
fragility of the quadrotor system.
control layers end-to-end does further boost the performance.
With pre-training vs. without pre-training. We can ob-
serve in Figure 4 that the agent suffers severe vibration in the
learning process under standard training strategy. To validate
4.3 Policy Evaluation in Simulators
the necessity of supervised pre-training, we compare three dif- To gain more insights into how the learned policy actually
ferent approaches here. The first one is the standard training works, we further apply the trained policy network in a num-
without any pre-training (w/o pre-training). The second one ber of simulated testing environments in which the policy
is to fine-tune all layers after pre-training (fine-tune all). The interacts with the environment until game termination. For
last one is our strategy which only fine-tunes the control layers testing environment initialization, we randomly set the posi-
after the pre-training initialization (fine-tune control). Fig- tions of the quadrotor and the target, making the target appear
ure 5 shows the performance of different methods. We show at different corners of the camera view with different scales.

4940
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

target x coordinate target y coordinate target scale goal states


0.7
0.6 0.6 0.6 0.6
0.6 x [0.088]
x [0.054] x [0.059] x [0.056] x [0.150]
y [0.076]
0.5
0.4 y [0.111] 0.4 y [0.092] 0.4 y [0.074] 0.4 y [0.102]

end-to-end approach
h [0.151]
separate approach

real-world transfer
0.4 h [0.034] h [0.075] h [0.112] h [0.095]

scenario 2

scenario 3
0.3
0.2 0.2 0.2 0.2
0.2

0.1 0 0 0 0
0

-0.1 -0.2 -0.2 -0.2 -0.2


-0.2

-0.3 -0.4 -0.4 -0.4 -0.4


0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000
timestep timestep timestep timestep timestep

(a) Separate vs. end-to-end in scenario 1 (left to right). (b) Unseen scenarios 2 and 3 (left to right). (c) Real-world test.

Figure 7: Variations of the normalized (divided by image dimensions) true target state so,t = (x, y, h) over time (x, y ∈ [−0.5, 0.5], h ∈ [0, 1]).
The dashed lines show the goal for each states: (x∗ , y ∗ , h∗ ) = (0, 0, 0.4). The top-right legend shows the average deviation.

The performance of the policy is measured by the deviation of the NUC. For speed consideration, the policy network com-
the target state from our desired goal (which is specified in the putation is deployed on a ground laptop M1000 GPU which
reward function). We also average the results over 5 different communicates with the onboard NUC by Robot Operating
runs. System (ROS).
We first compare two different policies trained by separate Figure 7(c) shows the results of the real-world flight test,
learning and end-to-end learning (as described in Figure 6), where the true target states are labeled off-line by an object
respectively. By applying the models in the testing environ- tracker. With a slight decrease in control precision, the trans-
ment, we can record the true target state variation, as shown ferred agent can still follow the moving target for up to 4000
in Figure 7(a), to show the quality of different policies. We time steps (approximately 3 minutes). It is worth noting that
also compute the average deviation of each state variable to this real-world test bears several challenges, including the
give quantitative analysis, as shown in the top-right legend. large gap between the low-level controllers, the data delay,
The result is consistent with our findings in the design evalua- and the state noise in inertial measurement unit (IMU) and
tion. The end-to-end learning approach is clearly better. It is GPS. Without any adaptation, the learned policy can still ex-
worth noting that, although the policies are only trained with a hibit reasonable following capability. This shows that the
maximum of 1000 time steps, the agent can generalize well policy does have well-generalized performance. With more
beyond that. Both policies can consistently follow the target mature simulators, we believe these are promising directions
for quite a long time, neither crashing the quadrotor nor losing to pursue in robotics learning.
the target from the camera view. However, the separate learn-
ing approach only manages to learn a suboptimal policy with 5 Conclusion and Future Work
which the resulting target state is relatively far from the desired
goal state, as indicated by the dashed line in Figure 7(a). On In this paper, we have explored the potential of applying a
the contrary, the end-to-end learning approach achieves a very machine learning approach to the challenging autonomous
stable policy to successfully maintain the target state within UAV control problem. The policy is represented by a convolu-
a small range of the goal state for up to 5000 time steps (and tional neural network which combines perception and control
even more). For all subsequent experiments in this section and thus can be trained end-to-end. Instead of directly pre-
we will stick to the model trained by the end-to-end learning dicting the low-level motor commands, the policy network
approach. is designed to produce high-level actions. This enables both
Since training is only performed in scenario 1 (Figure 3), stable learning and good generalization ability to different
we further test the trained policy network in two unseen en- environments. Our training approach consists of supervised
vironments (scenarios 2 and 3) to evaluate the generalization learning from raw images and reinforcement learning from
ability. Scenario 2 has a background significantly different games of self-play. This training decomposition greatly al-
from the training setting and scenario 3 has very similar object leviates the instability of existing model-free policy gradient
distractors. We directly apply the trained policy network to the methods. Results from both simulated and real-world exper-
new testing environments without any adaptation. As shown iments show that our method can successfully perform the
in Figure 7(b), surprisingly, our policy network exhibits good target following task with good generalization ability.
generalization ability to unseen scenes. We find that in sce- Currently the policy is directly transferred to the real-world
nario 3 the scale drifts a little bit. This is the result of occlusion quadrotor. This can also be used as an initialization scheme
by some similar distractors. and the model can be further trained end-to-end [Zhu et al.,
2017]. We will pursue these research directions in our future
4.4 Policy Transfer to the Real World work.
Our quadrotor testbed is based on the DJI Matrice 100 plat-
form equipped with an Intel NUC and a camera. We use
DJI built-in functions to map the high-level actions to actual
Acknowledgments
flight commands, which are significantly more complex than a This research has been supported by General Research Fund
simple PID controller. This mapping computation is done by 16207316 from the Research Grants Council of Hong Kong.

4941
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

References et al. Human-level control through deep reinforcement learning.


[Abbeel et al., 2007] Pieter Abbeel, Adam Coates, Morgan Quigley, Nature, 518(7540):529–533, 2015.
and Andrew Y Ng. An application of reinforcement learning to [Mnih et al., 2016] Volodymyr Mnih, Adria Puigdomenech Badia,
aerobatic helicopter flight. In NIPS, 2007. Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley,
[Abbeel et al., 2010] Pieter Abbeel, Adam Coates, and Andrew Y David Silver, and Koray Kavukcuoglu. Asynchronous methods
Ng. Autonomous helicopter aerobatics through apprenticeship for deep reinforcement learning. In ICML, 2016.
learning. The International Journal of Robotics Research, 2010. [Mohajerin and Waslander, 2014] Nima Mohajerin and Steven L
[Aijun Bai, 2017] Stuart Russell Aijun Bai. Efficient reinforcement Waslander. Modular deep recurrent neural network: Application
learning with hierarchies of machines by leveraging internal tran- to quadrotors. In SMC, 2014.
sitions. In IJCAI, 2017. [Mueller et al., 2012] Fabian L Mueller, Angela P Schoellig, and
[Bachrach et al., 2012] Abraham Bachrach, Samuel Prentice, Rui- Raffaello D’Andrea. Iterative learning of feed-forward corrections
for high-performance tracking. In IROS, 2012.
jie He, Peter Henry, Albert S Huang, Michael Krainin, Daniel
Maturana, Dieter Fox, and Nicholas Roy. Estimation, planning, [Peters and Schaal, 2006] Jan Peters and Stefan Schaal. Policy gra-
and mapping for autonomous flight using an RGB-D camera in dient methods for robotics. In IROS, 2006.
GPS-denied environments. The International Journal of Robotics [Punjani and Abbeel, 2015] Ali Punjani and Pieter Abbeel. Deep
Research, 31(11):1320–1343, 2012. learning helicopter dynamics models. In ICRA, 2015.
[Barry and Tedrake, 2015] Andrew J Barry and Russ Tedrake. Push- [Ross et al., 2013] Stéphane Ross, Narek Melik-Barkhudarov, Ku-
broom stereo for high-speed navigation in cluttered environments. mar Shaurya Shankar, Andreas Wendel, Debadeepta Dey, J An-
In ICRA, 2015. drew Bagnell, and Martial Hebert. Learning monocular reactive
[Bills et al., 2011] Cooper Bills, Joyce Chen, and Ashutosh Saxena. UAV control in cluttered natural environments. In ICRA, 2013.
Autonomous MAV flight in indoor environments using single [Sadeghi and Levine, 2017] Fereshteh Sadeghi and Sergey Levine.
image perspective cues. In ICRA, 2011. (CAD)2 RL: Real single-image flight without a single real image.
[Deisenroth et al., 2015] Marc Peter Deisenroth, Dieter Fox, and In RSS, 2017.
Carl Edward Rasmussen. Gaussian processes for data-efficient [Schmid et al., 2013] Korbinian Schmid, Teodor Tomic, Felix
learning in robotics and control. IEEE Transactions on Pattern Ruess, Heiko Hirschmüller, and Michael Suppa. Stereo vision
Analysis and Machine Intelligence, 37(2):408–423, 2015. based indoor/outdoor navigation for flying robots. In IROS, 2013.
[Duan et al., 2016] Yan Duan, Xi Chen, Rein Houthooft, John Schul- [Schulman et al., 2015] John Schulman, Sergey Levine, Pieter
man, and Pieter Abbeel. Benchmarking deep reinforcement learn- Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy
ing for continuous control. In ICML, 2016. optimization. In ICML, 2015.
[Giusti et al., 2016] Alessandro Giusti, Jérôme Guzzi, Dan C [Shen et al., 2013] Shaojie Shen, Yash Mulgaonkar, Nathan
Cireşan, Fang-Lin He, Juan P Rodrı́guez, Flavio Fontana, Matthias Michael, and Vijay Kumar. Vision-based state estimation and
Faessler, Christian Forster, Jürgen Schmidhuber, Gianni Di Caro, trajectory control towards high-speed flight with a quadrotor. In
et al. A machine learning approach to visual perception of forest RSS, 2013.
trails for mobile robots. IEEE Robotics and Automation Letters,
1(2):661–667, 2016. [Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess,
Thomas Degris, Daan Wierstra, and Martin Riedmiller. Determin-
[Gu et al., 2017] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and istic policy gradient algorithms. In ICML, 2014.
Sergey Levine. Deep reinforcement learning for robotic manipu-
lation with asynchronous off-policy updates. In ICRA, 2017. [Sun et al., 2017] Ting Sun, Shengyi Nie, Dit-Yan Yeung, and Shao-
jie Shen. Gesture-based piloting of an aerial robot using monocular
[Kingma and Ba, 2015] Diederik Kingma and Jimmy Ba. Adam: A vision. In ICRA, 2017.
method for stochastic optimization. In ICLR, 2015.
[Sutton et al., 1999] Richard S Sutton, David A McAllester, Satin-
[Lee, 2011] Taeyoung Lee. Geometric tracking control of the atti- der P Singh, and Yishay Mansour. Policy gradient methods for
tude dynamics of a rigid body on so (3). In ACC, 2011. reinforcement learning with function approximation. In NIPS,
[Levine et al., 2016] Sergey Levine, Chelsea Finn, Trevor Darrell, 1999.
and Pieter Abbeel. End-to-end training of deep visuomotor poli- [Zhang et al., 2016] Tianhao Zhang, Gregory Kahn, Sergey Levine,
cies. Journal of Machine Learning Research, 17(39):1–40, 2016. and Pieter Abbeel. Learning deep control policies for autonomous
[Li and Yeung, 2017] Siyi Li and Dit-Yan Yeung. Visual object aerial vehicles with MPC-guided policy search. In ICRA, 2016.
tracking for unmanned aerial vehicles: A benchmark and new [Zhu et al., 2017] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve,
motion models. In AAAI, 2017. Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-
[Lillicrap et al., 2016] Timothy P Lillicrap, Jonathan J Hunt, driven visual navigation in indoor scenes using deep reinforcement
Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David learning. In ICRA, 2017.
Silver, and Daan Wierstra. Continuous control with deep rein-
forcement learning. In ICLR, 2016.
[Mellinger and Kumar, 2011] Daniel Mellinger and Vijay Kumar.
Minimum snap trajectory generation and control for quadrotors.
In ICRA, 2011.
[Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, David
Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex
Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski,

4942

Common questions

Powered by AI

Empirical findings indicate that end-to-end optimization from the start can hurt the pre-trained representation ability of the perception layers. Pre-training the perception layers separately results in greater stability and improves the overall performance of the policy network, as early joint optimization causes these layers to forget crucial features learned during pre-training .

Applying the trained policy network in diverse simulated environments demonstrates its generalization abilities and stability across different settings. The policy interacts with the environment until game termination without collapsing, highlighting the effectiveness of the training approach and its potential for diverse applications .

Future research directions include using the trained policy network as an initialization scheme for further end-to-end training and exploring its application to real-world quadrotor control beyond initial deployment. The authors plan to pursue these avenues to expand the real-world applicability of their approach .

The hierarchical control approach greatly outperforms direct application of DDPG due to the complexity and fragility associated with quadrotor systems. The hierarchical structure allows for a stable decomposition of the learning task, utilizing supervised pre-training for perception modules and reinforcement learning for control, leading to substantially improved performance and reduced training instabilities .

Key challenges in setting up a real-world game playing environment include the inability to efficiently access true target states online to compute rewards and difficulties in resetting the game state after termination. These challenges can be bypassed by using policies trained in simulators where true states are available, allowing direct learning of control layers with state input before merging with perception layers for application in real domains .

The introduction of a PID controller is shown to be essential for successful training, as it significantly aids in stabilizing the learning process when compared to standard end-to-end training methods. The PID controller helps address issues related to the complexity and system fragility of quadrotors .

The factorizing scheme involves pre-training the perception layers using a small dataset collected from real-world scenarios, which addresses the large gap between simulated and real-world images. By fixing the perception layers initially and only training the control layers, the learned policy can be applied to real-world domains, taking advantage of the availability of true states in simulations to optimize control layers while maintaining the pre-trained perception capabilities .

High-level actions in the policy network, as opposed to low-level motor commands, facilitate stable learning and enhance the policy's ability to generalize across different environments. This approach breaks down the learning process into supervised learning from images and reinforcement learning, reducing the instability often associated with model-free policy gradient methods .

Supervised pre-training enhances stability by mitigating the vibrations and oscillations that commonly occur during learning in standard training strategies. This approach stabilizes the learning curve by ensuring that the perception module already has a robust feature extraction capability before learning the control commands, thus preventing the network from being stuck in suboptimal policies .

The decomposition into supervised learning from raw images and reinforcement learning through self-play alleviates instability by simplifying the learning process into manageable parts. Supervised learning provides a robust starting point for perception, while reinforcement learning refines control strategies, thus reducing the chances of instability often seen in model-free policy gradient methods .

You might also like