GPU-Accelerated Model Predictive Control
GPU-Accelerated Model Predictive Control
Authorized licensed use limited to: University of Michigan Library. Downloaded on March 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
Active Set and Interior Point [9] are two effective methods é B ù
to solve QP problems. The first is adopted to realize (3). In é x (k + 1| k ) ù é A ù ê ú
ê ú ê 2ú ê 1 ú
ê å i =0
i
each iteration the main purpose of Active Set is to find a ê x (k + 2 | k ) ú ê A ú A B ú
feasible point at which the effective constraints are used as ê ú ê ú ê ú
ê x (k + 3 | k ) ú = ê A3 ú x (k )+ ê 2 Ai B ú u(k -1)
equality constraints and the ineffective constraints are deleted. ê ú ê ú ê å i =0 ú
To find a feasible point, an equality constrained QP problem is ê ú ê ú ê ú
ê ú ê ú ê ú
solved. There are 4 steps as follows: ê x ( k + N | k )ú ê A N ú ê ú
ëê ûú ëê ûú ê -1 i ú
ëê å i =0
N
A B
Step 1. Let k = 1 and find an initial feasible point X (1) ûú
which satisfies:
é B 0 ù
ê ú
T
a X (1)
£ bi ,i Î E ê AB + B 0 ú é Du(k | k ) ù
i
(4) ê ú ê ú
T
a X (1)
= bi ,i Î I ê A2 B + AB + B 0 ú ê ú (8)
i +ê ú ê
ê ú ú
At point X (1) the active set is I ( X (1) ) = {i | aiT X (1) = bi ,i Î I } . ê ú êëDu(k + N -1| k )úû
ê N -1 N -m ú
ê å Ai B åi=0 Ai Bú
Step 2. Find searching direction d ( k ) by solving the ë i =0 û
equality constrained QP problem: The predictive output on horizon N is
1 é y (k + 1| k ) ù éC 0 0 ù é x (k + 1 | k ) ù
min d T Gd + f ( X ( k ) )T d ê ú ê úê ú
2 (5) ê y(k + 2 | k ) ú ê 0 C 0 ú ê x (k + 2 | k ) ú
ê ú=ê úê ú (9)
s.t. S T d = 0 ê ú ê úú êê ú
ê ú ê ú
ê y (k + N | k )ú ê 0 0 C úûú êëê x (k + N | k )úûú
Here, S = [a1 a2 am ] , n > m and rank( S ) = m ; G is a ëê ûú ëê
n ´ n positive semi-definite matrix. d * , l * are the globe Obviously, the number of system states and the length of
solution and related Lagrange multiplier respectively by prediction horizon are the main factors leading to
solving: computational burden. For example, implementing the
procedure above in MATLAB, take a small system (see (11))
é G S ù é d ù é-f ( X ( k ) )T ù with N=100 and shows the computation time given by Profiler
ê T úê ú=ê ú (6)
êS
ë 0 úû êël úû êë 0 ú
û
in MATLAB in detail as Table 1.
Step 3. Let d ( k ) = d * . If d ( k ) = 0 , find the Lagrange
TABLE I. CLASSICAL COMPUTATION TIME OF QP IN MPC
multiplier l ( k ) . Then if l ( k ) ³ 0 , X ( k ) is the optimal solution.
Otherwise find lq( k ) = min{li( k ) | i Î I ( X ( k ) )} and let Function name Calls Total [s] Self time [s]
( k +1) (k ) ( k +1) MPC_Control 1 2.619 0.141
X = X . The active set I ( X ) is replaced by
MPC_QP 1 2.478 0.076
I ( X ( k ) ) excluding from the equality constraints with respect
MPC_QP>Solve_L 17 2.117 2.117
to q. Let k = k + 1 , jump to Step 2.
MPC_QP>FindAciveSet 16 0.141 0.141
Step 4. If d ( k ) ¹ 0 , calculate the step size in search: MPC_QP>GetNewX 16 0.126 0.126
ì-a X
ï T (k )
ïü MPC_QP>FindInitPoint 1 0.016 0.016
aˆk = min ïí i
| aiT d ( k ) > 0, i Ï I ( X ( k ) )ïý The function Solve_L taking much time is to solve (6) by
ï
î a d
ï
T (k )
i
ïïþ
Gauss elimination method which has a complexity of O(n3 ) .
-a Tp X ( k ) When N=100, the dimension of coefficient matrix of predicted
= (7) output is 300´300 . One reason for low real time peculiarity
a Tp d ( k )
of MPC is to solve a large linear equation. Another is the call
Compute X ( k +1) = X ( k ) + ak d ( k ) , ak = min{aˆk ,1} ; If ak = aˆk , times of the solver.
add new constraints with respect to p to the active set, or let
III. PARALLEL COMPUTING ARCHITECTURE
I ( X ( k +1) ) = I ( X ( k ) ) , k = k + 1 and jump to Step 2.
In last chapter Gauss elimination method is implemented in
Before solving QP problem, the related matrixes need to be serial mode (running on a CPU) in MATLAB. This manner
compute first. In N horizon, the system model becomes takes much time to solve large equations. In this section, a
equality constraints like this: parallel one is presented based on a NVIDIA GPU.
All work is done in CUDA environment [10]. In CUDA,
the code contains two parts, the serial running on a host (CPU)
and the parallel running on a device (GPU). The former aims to
tasks’ collaboration and the later to acceleration of solution.
1002
Authorized licensed use limited to: University of Michigan Library. Downloaded on March 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
The parallel code running on device is called kernel function 453478569234567589 3
which is executed identically by all light threads. One kernel is 66789436457892148 4
with a corresponding grid which is made of thread blocks. 4785692345675893
234585696758938
5
3
Thread blocks are made of threads which are the basic 81569422634567 2
computational units of a device. The dimension of a grid and 4228156945675
694228155675
7
8
thread blocks can be one, two or three. A two dimension 75366942281 9
example is demonstrated in Fig. 1. Before the device starts to 2815366942
692836421
0
6
parallel compute, a host allocates the device memory for the 96283642 4
9
grid and copies the initial data from host to device. In the end, a 9628364
357283 6
host copies the results form device. 65483 0
4532 4
687 5
63 7
7 2
(c)
Figure 2. Demonstration of Gauss elimination method in parallel mode
ê ú ê ú (10)
ê ê úú
ê úú ê
êa an ,n úûú It is obvious that when m = (4t2 + 2t3 n) / (4k1 + t3 ) the
êa an ,1 an,n úúû a
ëê n ,0 n ,1
êë n ,0 total time is the minimum.
The flowchart of QP solver containing with acceleration is
showed as Figure 3.
a0,0 / a0,0 a1,0 / a0,0 a2,0 / a0,0 an,0 / a0,0
r0,0 r1,0 r2,0 rn,0
(a) (b)
1003
Authorized licensed use limited to: University of Michigan Library. Downloaded on March 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
10 0.0476 0.2447 -
21 0.1263 0.2812 -
42 0.4861 0.3241 1.4998
85 1.7052 0.5923 2.8789
170 8.9164 1.0620 8.3959
341 53.3854 2.1834 24.4506
682 402.1035 13.3325 30.1596
X ( k+1) 1365 3054.5284 88.6587 35.6593
For N < 42 , the acceleration is invalid. Because the
matrixes transferred to device is small. And the transferring
li( k ) < 0 time occupies much proportion than computing time.
For N ³ 42 , the effect of acceleration is displayed.
l (k ) ³ 0 ak For N = 682 , traditional serial mode needs 402.1035 s for the
solution of QP problem, and the parallel mode needs only
13.3325 s. Notice that the proposed method is a dense solver.
d (k ) = 0 ?
The computing power of the PC used in this paper is weaker
than the newest PC, so it is certain that large speedup and less
time will be achieved if a new PC is used.
V. CONCLUSIONS
This paper analyzes the computational burden caused by
Figure 3. Flowchart of QP solver accelerated by GPU MPC, and proposes a parallel method to accelerate the solution
of MPC based on GPU. The essential of this method is to
implement Gauss elimination in parallel mode. And simulation
IV. SIMULATIONS results show that the parallel method deduces much time for
Simulations are conducted on a PC. Hardware consists of calculation. The speedup reaches 30 for N = 682 . That allows
AMD Athlon64 3000+, DDR400 1 GB and NVIDIA Geforce MPC to be applied to much larger systems.
8800GT. Software environment consists of VC2005 and
CUDA 1.0. This GPU peak capability of single precision REFERENCES
floating point arithmetic is 336 GFLOPS, and the CPU’s is [1] J.B. Rawlings, “Tutorial overview of model predictive control. ieee
2.3346 GFLOPS. 8800GT consists of 16 SMs (Stream control systems magazine,” 20(3):38-52. 2000.
Multiprocessor). Each SM can process 786 threads. All the [2] D.Q. Mayne, J.B. Rawlings, C.V. Rao, et al., “Constrained model
registers accessed by threads of a SM are limited to 8192 predictive control: stability and optimality,” Automatica, 36(6):789-814,
words. For sake of convenience, the dimension of matrixes or 2000.
vectors to be calculated is augmented to the nth power of 2. [3] T.A. Johansen, W. Jackson, R. Schreiber, et al., “Hardware synthesis of
The elements added are set to zero. The length of prediction explicit model predictive controllers,” IEEE Trans. Control System
Technology, 15(1):191-197, 2007.
horizon is properly set to make dimension of the related
constant matrixes get close to 2n. This paper takes a model [4] P. Tondel, T.A. Johansen, A. Bemporad., “An algorithm for multi-
parametric quadratic programming and explicit MPC solutions,”
below to test the proposed method. Automatica, 39(3):489-497, 2003.
é x1 ù é 0 1 0 ù é x1 ù é 0 ù [5] Raphael Cagienard, Pascal Grieder, Eric C, Kerrigan, et al., “Moving
ê ú ê úê ú ê ú blocking strategies in receding horizon control,” 43rd IEEE Conference
ê x 2 ú = ê 0 -1.0860 4.287925ú ê x2 ú + ê 0 ú u on Decision and Control, 2004.
ê ú ê úê ú ê ú
ê x ú ê 0 -1.0909 -1.4545 ú ê x ú ê3.6363ú [6] Stefano Longo, Eric C. Kerrigan, Kecj Voonling, et al., “Parallel move
ë 3û ë û ë 3û ë û (11) blocking model predictive control,” 50th IEEE CDC-ECC, 2011.
[7] Adrian G. Wills, Geoff Knagge, Brett Ninness., “Fast linear model
x1 is system output. The serial and the parallel methods are predictive control via custom integrated circuit architecture,” IEEE
performed under different length of N. The calculation time Trans. Control Systems Techology 20(1): 59-71, 2012.
and speedup is given in Table 2. [8] J. L. Jerez, G. A. Constantinides, E. C. Kerrigan, et al., “Parallel MPC
for realtime FPGA-based implementation,” Proc. 18th IFAC World
Congress, 2011.
TABLE II. COMPARISON OF COMPUTING TIME FOR MPC [9] R. A. Bartlett, A. Wachter, L. T. Biegler., “Active Set vs. Interior Point
strategies for model predictive control,” Proceedings of the American
Serial Parallel Control Conference, 2000.
N Speedup
mode [s] mode [s]
[10] NVIDIA CUDA Programming Guide 2.0. NVIDIA Co., 2008.
5 0.0315 0.2138 -
1004
Authorized licensed use limited to: University of Michigan Library. Downloaded on March 05,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.