APTS StatComp2020
APTS StatComp2020
Statistical Computing
Notes originally by Simon Wood, University of Bath, 2012
[Link]
1 Introduction 4
1.1 Things can go wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Matrix Computation 9
2.1 Efficiency in matrix computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Flop counting in general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 There is more to efficiency than flop counting . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Efficiency conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Some terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Cholesky decomposition: a matrix square root . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Eigen-decomposition (spectral-decomposition) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Powers of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Another matrix square root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Matrix inversion, rank and condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.5 Asymmetric eigen-decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 The QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Woodbury’s formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Further reading on matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Optimisation 23
3.1 Local versus global optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Optimisation methods are myopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Look at your objective first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Objective function transects are a partial view . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Some optimisation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Taylor’s Theorem (and conditions for an optimum) . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Steepest descent (AKA gradient descent) . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.4 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.5 The Nelder–Mead polytope method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.6 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Constrained optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Modifying the objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Further reading on optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Calculus by computer 40
4.1 Cancellation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Approximate derivatives by finite differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Other FD formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Automatic Differentiation: code that differentiates itself . . . . . . . . . . . . . . . . . . . . . . 42
2
CONTENTS 3
6 Other topics 69
6.1 An example: sparse matrix computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Hashing: another example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 1
Introduction
In combination, modern computer hardware, algorithms and numerical methods are extraordinarily impres-
sive. Take a moment to consider some simple examples, all using R.
1. Simulate a million random numbers and then sort them into ascending order.
x <- runif(1000000)
[Link](xs <- sort(x))
The sorting operation took 0.1 seconds on the laptop used to prepare these notes. If each of the
numbers in x had been written on its own playing card, the resulting stack of cards would have been
40 metres high. Imagine trying to sort x without computers and clever algorithms.
[Link](2)
n <- 1000
A <- matrix(runif(n*n),n,n)
[Link](Ai <- solve(A)) # invert A
The matrix inversion took under half a second to perform, and the inverse satisfied its defining equa-
tions to around one part in 1011 . It involved around 2 billion (2 × 109 ) basic arithmetic calculations
(additions, subtractions multiplications or divisions). At a very optimistic 10 seconds per calculation
this would take 86 average working lives to perform without a computer (and that’s assuming we use
the same algorithm as the computer, rather than Cramer’s rule, and made no mistakes). Checking
the result by multiplying the A by the resulting inverse would take another 86 working lives.
3. The preceding two examples are problems in which smaller versions of the same problem can be
solved more or less exactly without computers and special algorithms or numerical methods. In these
cases, computers allow much more useful sizes of problem to be solved very quickly. Not all problems
are like this, and some problems can really only be solved by computer. For example, here is a scaled
4
CHAPTER 1. INTRODUCTION 5
version of a simple delay differential equation model of adult population size, n, in a laboratory blowfly
culture.
dn
= P TD n(t − 1)e−n(t−1) − δTD n(t).
dt
The dynamic behaviour of the solutions to this model is governed by parameter groups P TD and δTD ,
but the equation can not be solved analytically, from any interesting set of initial conditions. To see
what solutions actually look like we must use numerical methods.
library(PBSddesolve)
## set parameters...
parms <- list(P=150,delta=6,yinit=2.001)
## solve model...
yout <- dde(y=parms$yinit,times=seq(0,15,0.05),func=bfly,parms=parms)
## plot result...
plot(yout$t,yout$y,type="l",ylab="n",xlab="time")
0 2 4 6 8
n
0 5 10 15
time
Here the solution of the model (to an accuracy of 1 part in 108 ), took a quarter of a second.
1. The [Link] function in R measures how long a particular operation takes to execute. First
generate two 3000 × 3000 matrices, A and B and a vector, y.
n <- 3000
A <- matrix(runif(n*n),n,n)
B <- matrix(runif(n*n),n,n)
y <- runif(n)
f0 and f1 are identical, to machine precision, so why is one calculation so much quicker than the
other? Clearly anyone wanting to compute with matrices had better know the answer to this.
Note that [Link] is a very crude method for benchmarking code. There are various R
packages that can facilitate benchmarking and profiling of R code. Here we illustrate use of the
rbenchmark package, which is available on CRAN.
library(rbenchmark)
benchmark(
"ABy" = {
f0 <- A%*%B%*%y
},
"A(By)" = {
f1 <- A%*%(B%*%y)
},
replications = 10,
columns = c("test", "replications", "elapsed",
"relative", "[Link]", "[Link]"))
2. It is not unusual to require derivatives of complicated functions, for example when estimating models.
Such derivatives can be very tedious or difficult to calculate directly, but can be approximated quite
easily by ‘finite differencing’. i.e. if f is a sufficiently smooth function of x then
∂f f (x + ∆) − f (x)
'
∂x ∆
where ∆ is a small interval. Taylor’s theorem implies that the right hand side tends to the left hand
side, above, as ∆ → 0. It is easy to try this out on an example where the answer is known, so consider
differentiating sin(x) w.r.t. x.
CHAPTER 1. INTRODUCTION 7
x <- seq(1,2*pi,length=1000)
delta <- 1e-5 # FD interval
[Link] <- (sin(x+delta)-sin(x))/delta # FD derivative
error <- [Link]-cos(x) # error in FD derivative
The following plots the results. On the LHS sin(x) is dashed, it’s true derivative cos(x) is dotted and
the FD derivative is continuous. The RHS shows the difference between the FD approximation and
the truth.
1.0
2e−06
0.5
sin(x)
error
0.0
−4e−06
−1.0
1 2 3 4 5 6 1 2 3 4 5 6
x x
Note that the error appears to be of about the same magnitude as the finite difference interval ∆.
Perhaps we could reduce the error to around the machine precision by using a much smaller ∆. Here
is the equivalent of the previous plot, when
delta <- 1e-15
0.2
1.0
0.1
0.5
sin(x)
error
0.0
0.0
−1.0
−0.2
1 2 3 4 5 6 1 2 3 4 5 6
x x
Clearly there is something wrong with the idea that decreasing ∆ must always increase accuracy
here.
3. These data
0.5
0.4
0.3
0.2
y
0.1
0.0
−0.1
An appropriate and innocuous linear model for the data is yi = β0 + β1 xi + β2 x2i + i , where the i
are independent, zero mean constant variance r.v.s and the βj are parameters. This can be fitted in
R with
b <- lm(y~x+I(x^2))
From basic statistical modelling courses, recall that the least squares parameter estimates for a linear
model are given by β̂ = (XT X)−1 XT y where X is the model matrix. An alternative to use of lm would
be to use this expression for β̂ directly.
Something has gone badly wrong here, and this is a rather straightforward model. It’s important to
understand what has happened before trying to write code to apply more sophisticated approaches
to more complicated models.
In case you are wondering, I can make lm fail to. Mathematically it is easy to show that the model
fitted values should be unchanged if I simply add 900 to the x values, but if I refit the model to such
shifted data using lm, then the following fitted values result, which can not be correct.
0.5
0.4
0.3
0.2
y
0.1
0.0
−0.1
Matrix Computation
Matrices are ubiquitous in statistics. But most statistical models and methods are only useful once turned
into computer code, and this often means computing with matrices. It is easy to ruin a beautiful piece of
theory by poor use of numerical linear algebra. This section covers some basic numerical linear algebra,
including the important topics of efficiency and stability.
n <- 3000
A <- matrix(runif(n*n),n,n)
B <- matrix(runif(n*n),n,n)
y <- runif(n)
[Link](f0 <- A%*%B%*%y) # case 1
What happened here? The answer is all to do with how the multiplications were ordered in the two
cases, and the number of floating point operations (flop) required by the two alternative orderings.
1. In the first case AB was formed first, and the resulting matrix used to pre-multiply the vector y.
2. In the second case, the vector By was formed first, and was then pre-multiplied by A.
Now we need to count the floating point operations. Try a couple of simple examples first
1. How many flop (+, -, *, /) are involved in the multiplication
b11 b12 b13 y1
b21 b22 b23 y2 ?
b31 b32 b33 y3
9
CHAPTER 2. MATRIX COMPUTATION 10
Generalising
Now write down the flop cost of the two ways of forming ABy
1.
2.
Another simple matrix operation is the evaluation of the trace of a matrix product. Again different ways of
computing the same quantity lead to radically different computation times. Consider the computation of
tr (AB) where A is 5000 × 1000 and B is 1000 × 5000.
1. The first method forms AB, at a flop cost of 2n2 m and then extracts the leading diagonal and sums
it.
2. The second method uses the fact that tr (AB) = tr (BA). It forms BA at flop cost of 2nm2 and then
extracts the leading diagonal of the result and sums it.
P
3. The third method makes direct use of the fact that tr (AB) = ij Aij Bji , at a cost of 2nm. It is fastest
(in principle) as no effort is wasted in calculating off diagonal elements of the target matrix (but here
does come at the cost of computing a transpose).
Notice that method 1 is not just wasteful of flops, it also requires storage of an n × n matrix, which is much
larger than either A or B.
La_library()
## [1] "/usr/lib/x86_64-linux-gnu/openblas-pthread/[Link].3"
extSoftVersion()["BLAS"]
## BLAS
## "/usr/lib/x86_64-linux-gnu/openblas-pthread/[Link].3"
• If you are writing compiled code for manipulating matrices (or other arrays), then be aware of memory
management efficiency as well as flops.
• For optimal performance try and use libraries such as LAPACK, with a fast BLAS, where-ever possible.
x11 x12 x13 x11 x12 x13
x21 x22 x23 x12 x22 x23 yT Xy > 0 ∀ y 6= 0
yT Xy ≥ 0 ∀ y 6= 0
x31 x32 x33 x13 x23 x33
square symmetric positive (semi−)definite
x11 0 0 x11 x12 x13 x11 0 0
x21 x22 0 0 x22 x23 0 x22 0
x31 x32 x33 0 0 x33 0 0 x33
lower triangular upper triangular diagonal
Q is an orthogonal matrix iff QT Q = QQT = I.
If the component equations of this expression are written out and solved in the right order, then each
CHAPTER 2. MATRIX COMPUTATION 13
Working through these equations in row order, from row one, and starting each row from its leading diagonal
component ensures that all r.h.s. quantities are known at each step. Cholesky decomposition requires n3 /3
flops and n square roots. In R it is performed by function chol, which calls routines in LAPACK or LINPACK.
Example (continued): The following simulates a random draw from
1 2 −1 1
N −1 , −1 2 −1
3 1 −1 2
V <- matrix(c(2,-1,1,-1,2,-1,1,-1,2),3,3)
mu <- c(1,-1,3)
## [,1]
## [1,] 1.621485
## [2,] -2.069023
## [3,] 3.939876
The following simulates 1000 such vectors, and checks their observed mean and covariance.
(Y-mu)%*%t(Y-mu)/1000 ## observed V
As a second application of the Cholesky decomposition, consider evaluating the log likelihood of µ and
Σ
n 1 1
l = − log(2π) − log(|Σ|) − (y − µ)T Σ−1 (y − µ).
2 2 2
If we were simply to invert Σ to evaluate the final term, it would cost 2n3 flops, and we would still need
to evaluate the determinant. A Cholesky based approach is much better. It is easy to see that Σ−1 =
R−1 R−T , where R is the Cholesky factor of Σ. So the final term in the log likelihood can be written as zT z
where z = R−T (y−µ). Notice that we don’t actually need to evaluate R−T , but simply to solve RT z = y−µ
for z. To see how this is done, consider a 4 × 4 case again:
R11 0 0 0 z1 y1 − µ 1
z2 y2 − µ 2
R12 R22 0 0
R13 R23 R33 0 z3 = y3 − µ3
y <- 1:3
z <- forwardsolve(t(R),y-mu)
logLik <- -length(y)*log(2*pi)/2 -
sum(log(diag(R))) - sum(z*z)/2
logLik
## [1] -6.824963
Note that Cholesky decomposition of a matrix that is not positive definite will fail. Even positive semi-
definite will not work, since in that case a leading diagonal element of the Cholesky factor will become zero,
so that computation of the off diagonal elements on the same row is impossible. Positive semi-definite
matrices are reasonably common, so this is a practical problem. For the positive semi-definite case, it is
possible to modify the Cholesky decomposition by pivoting: that is by re-ordering the rows/columns of the
original matrix so that the zeroes end up at the end of the leading diagonal of the Cholesky factor, in rows
that are all zero. This will not be pursued further here. Rather we will consider a more general matrix
decomposition, which provides matrix square roots along with much else.
where the matrix U is orthogonal and Λ is a diagonal matrix, with ith leading diagonal element λi (conven-
tionally λi ≥ λi+1 ). Post-multiplying both sides of the decomposition by U we have
AU = UΛ.
Considering this system one column at a time and writing ui for the ith column of U we have:
Aui = λi ui .
i.e. the λi are the eigenvalues of A, and the columns of U are the corresponding eigenvectors. (2.1) is the
eigen-decomposition or spectral decomposition of A.
We will not go over the computation of the eigen-decomposition in detail, but it is not done using the
determinant and characteristic equation of A. One practical scheme is as follows: (i) the matrix A is first
reduced to tri-diagonal form using repeated pre and post multiplication by simple rank-one orthogonal ma-
trices called Householder rotations, (ii) an iterative scheme called QR-iteration then pre-and post-multiplies
the tri-diagonal matrix by even simpler orthogonal matrices, in order to reduce it to diagonal form. At this
point the diagonal matrix contains the eigenvalues, and the product of all the orthogonal matrices gives U.
Eigen-decomposition is O(n3 ), but a good symmetric eigen routine is around 10 times as computationally
costly as a Cholesky routine.
An immediate use of the eigen-decomposition is to provide an alternative characterisation of positive
(semi-) definite matrices. All the eigenvalues of a positive (semi-)definite matrix must be positive (non-
negative) and real. This is easy to see. Were some eigenvalue, λi to be negative (zero) then the corre-
sponding eigenvector ui would result in uT i Aui being negative (zero). At the same time the existence of
an x such that xT Ax is negative (zero) leads to a contradiction unless at least one eigenvalue is negative
(zero)* .
f 0 (A) ≡ Uf 0 (Λ)UT
where f 0 (Λ) denotes the diagonal matrix with ith leading diagonal element f (λi ). E.g. exp(A) = U exp(Λ)UT .
A−1 = UΛ−1 UT
can write x = Ub for some vector b. So xT Ax < 0 ⇒ bT Λb < 0 ⇒ b2i Λi < 0 ⇒ Λi < 0 for some i.
* We
P
CHAPTER 2. MATRIX COMPUTATION 16
where the diagonal matrix Λ−1 has ith leading diagonal element λ−1 i . Clearly we have a problem if any of
the λi are zero, for the matrix inverse will be undefined. A matrix with no zero eigen-values is termed full
rank. A matrix with any zero eigenvalues is rank deficient and does not have an inverse. The number of
non-zero eigenvalues is the rank of a matrix.
For some purposes it is sufficient to define a generalised inverse or pseudo-inverse when faced with
rank deficiency, by finding the reciprocal of the non-zero eigenvalues, but setting the reciprocal of the zero
eigenvalues to zero. We will not pursue this here.
It is important to understand the consequences of rank deficiency quite well when performing matrix
operations involving matrix inversion/matrix equation solving. This is because near rank deficiency is rather
easy to achieve by accident, and in finite precision arithmetic is as bad as rank deficiency. First consider
trying to solve
Ax = y
for x when A is rank deficient. In terms of the eigen-decomposition the solution is
x = UΛ−1 UT y
i.e. y is rotated to become y0 = UT y, the elements of y0 are then divided by the eigenvalues, λi , and the
reverse rotation is applied to the result. The problem is that yi0 /λi is not defined if λi = 0. This is just a
different way of showing something that you already know: rank deficient matrices can not be inverted. But
the approach also helps in the understanding of near rank deficiency and ill conditioning.
An illustrative example highlights the problem. Suppose that n × n symmetric matrix A has n − 1 distinct
eigenvalues ranging from 0.5 to 1, and one much smaller magnitude eigenvalue . Further suppose that
we wish to compute with A, on a machine that can represent real numbers to an accuracy of 1 part in −1 .
Now consider solving the system
Ax = u1 (2.2)
for x, where u1 is the dominant eigenvector of A. Clearly the correct solution is x = u1 , but now consider
computing the answer. As before we have a formal solution
x = UΛ−1 UT u1 ,
but although analytically u01 = UT u1 = (1, 0, 0, . . . , 0)T , the best we can hope for computationally is to get
u01 = (1 + e1 , e2 , e3 , . . . , en )T where the number ej are of the order of ±. For convenience, suppose that
en = . Then, approximately, Λ−1 u01 = (1, 0, 0, . . . , 0, 1)T , and x = UΛ−1 u01 = u1 + un , which is not correct.
Similar distortions would occur if we used any of the other first n − 1 eigenvectors in place of u1 : they all
become distorted by a spurious un component, with only un itself escaping.
Now consider an arbitrary P vector y on the r.h.s. of (2.2). We can always write it as some weighted sum
of the eigenvectors y = wi ui . This emphasises how bad the ill-conditioning problem is: all but one of
ys components are seriously distorted when multiplied by A−1 . By contrast multiplication by A itself would
lead only to distortion of the un component of y, and not the other eigen-vectors, but the un component is
the component that is so heavily shrunk by multiplication by A that it makes almost no contribution to the
result, unless we have the misfortune to choose a y that is proportional to un and nothing else.
A careful examination of the preceding argument reveals that what really matters in determining the
seriousness of the consequences of near rank deficiency is the ratio of the largest magnitude to the smallest
magnitude eigenvalues, i.e.
κ = max |λi |/ min |λi |.
This quantity is a condition number for A. Roughly speaking it is the factor by which errors in y will be
multiplied when solving Ax = y for x. Once κ begins to approach the reciprocal of the machine precision
we’re in trouble† . A system with a large condition number is referred to as ill-conditioned. Orthogonal
matrices have κ = 1, which is why numerical analysts like them so much.
Example. We are now in a position to understand what happened when we tried to use the ‘normal
equations’ to fit the simple quadratic model in the introduction. Let’s explicitly calculate the condition number
of XT X.
†
Because the condition number is so important in numerical computation, there are several methods for getting an approximate condition
number more cheaply than via eigen decomposition — e.g. see ?kappa in R.
CHAPTER 2. MATRIX COMPUTATION 17
[Link](1)
xx <- sort(runif(100))
x <- xx + 100 # regenerated same x data
X <- [Link](~x + I(x^2)) # and the model matrix
XtX <- crossprod(X) # form t(X)%*%X (with half the flop count)
lambda <- eigen(XtX)$values
lambda[1]/lambda[3] # the condition number of X’X
## [1] 2.355538e+18
Of course this raises a couple of obvious questions. Could we have diagnosed the problem directly
from X? And how does the lm function avoid this problem? Answers soon, but first consider a trick for
reducing κ.
2.4.4 Preconditioning
The discussion of condition numbers given above was for systems involving unstructured matrices (albeit
presented only in the context of symmetric matrices). Systems involving matrices with special structure
are sometimes less susceptible to ill-conditioning than naive computation of the condition number would
suggest. For example if D is a diagonal matrix, then we can accurately solve Dy = x for y, however large
κ(D) is: overflow or underflow are the only limits.
This basic fact can sometimes be exploited to re-scale a problem to improve computational stability. As
an example consider diagonal preconditioning of the computation of XT X, above. For the original XT X we
have
solve(XtX)
but (DXT XD)−1 turns out to have a much lower condition number than XT X. . .
D <- diag(1/diag(XtX)^.5)
DXXD <- D%*%XtX%*%D
lambda <- eigen(DXXD)$values
lambda[1]/lambda[3]
## [1] 429397494367
## (Intercept) x I(x^2)
## [1,] 9.999640e-01 -3.262288e-03 -0.352586841
## [2,] 2.920875e-07 1.000032e+00 0.002340679
## [3,] -1.151890e-09 -1.143870e-07 0.999992137
n = 1000
XA = matrix(rnorm(n*n),ncol=n)
XXt = XA %*% t(XA)
[Link](chol(XXt))
[Link](eigen(XXt, symmetric=TRUE))
[Link](eigen(XA))
[Link](svd(XA))
[Link](qr(XA))
The number of its non-zero singular values gives the rank of a matrix, and the SVD is the most reliable
method for numerical rank determination (by examination of the size of the singular values relative to the
largest singular value). In a similar vein, a general definition of condition number is the ratio of largest and
smallest singular values: κ = d1 /dc .
Example Continuing the simple regression disaster from the introduction, consider the singular values
of X
Clearly, numerically X is close to being rank 2, rather than rank 3. Turning to the condition number. . .
d[1]/d[3]
## [1] 1560769713
κ ≈ 2 × 109 is rather large, especially since it is easy to show that the condition number of XT X must
be the square of this. So we now have a pretty clear diagnosis of the cause of the original problem.
In fact the SVD not only provides a diagnosis of the problem, but one possible solution. We can re-write
the solution to the normal equations in terms of the SVD of X.
1. The condition number of the system that we have ended up with is exactly the condition number of
X, i.e. the square root of the condition number involved in the direct solution of the normal equations.
2. Comparing the final RHS expression to the representation of an inverse in terms of its eigen-decomposition,
it is clear that VD−1 UT is a sort of pseudo-inverse X.
The SVD has many uses. One interesting one is low rank approximation of matrices. In a well defined
sense, the best rank k ≤ rank(X) approximation to a matrix X can be expressed in terms of the SVD of X
as
X̃ = UD̃VT
where D̃ is D with all but the k largest singular values set to 0. Using this result to find low rank ap-
proximations to observed covariance matrices is the basis for several dimension reduction techniques in
multivariate statistics (although, of course, a symmetric eigen decomposition is then equivalent to SVD).
One issue with this sort of approximation is that the full SVD is computed, despite the fact that part of it is
then to be discarded (don’t be fooled by routines that ask you how many eigen or singular vectors to return
— I saved .1 of a second, out of 13 by getting R routine svd to only return the first columns of U and V).
Look up Lanczos methods and Krylov subspaces for approaches that avoid this sort of waste.
CHAPTER 2. MATRIX COMPUTATION 20
X = QR
where R is upper triangular and Q is of the same dimension as X, with orthogonal columns (so QT Q = I,
but QQT 6= I). The QR decomposition has a cost of O(rc2 ), but it is about 1/3 of the cost of SVD.
Consider the linear modelling example again
Now the singular values of R are the same as the singular values of X, so this system has the same
condition number as X. i.e. the QR method provides the stability of SVD without the computational cost
(although it still has about twice the cost of solving the normal equations via Cholesky decomposition).
In fact the QR method for solving least squares estimates can be derived from first principles, without
using the normal equations, in which case it also provides a very compact way of deriving much of the
distributional theory required for linear modelling. R routine lm uses the QR approach, which is why, in
the Introduction, it was able to deal with the case for which the normal equations failed. Of course it is not
magic: the subsequent failure even of lm arose because I had caused the model matrix condition number
to become so bad that even the QR approach fails (I should have centred x, by subtracting its mean, or
used orthogonal polynomials, via poly).
It is also worth noting the connection between the QR and Cholesky decompositions. If X = QR, then
X X = RT R, where R is upper triangular, so the QR decomposition of X gives the Cholesky factorisation
T
of XT X “for free”. Note that most QR routines don’t bother ensuring that the diagonal elements of R are
positive. If the positivity of the diagonal elements is an issue (it rarely matters), it can easily be fixed by
flipping the signs of the relevant rows of R. This relationship between the QR and Cholesky can be used to
directly construct the Cholesky decomposition of a covariance matrix in a stable way, without ever directly
constructing the covariance matrix.
Another application of the QR decomposition is determinant calculation. If A is square and A = QR
then Y
|A| = |Q||R| = |R| = Rii
i
formula, Sherman-Morrison formula, etc., are all common names for various versions and special cases.
Here I’ll just give the most commonly encountered variants.
The context is a (large) m × m matrix, A, and its inverse, A−1 . We suppose that A−1 is already known,
but that now A is subject to a low-rank update of the form A + UVT , for U, V both m × n with n << m.
The problem is how to compute the updated inverse (A + UVT )−1 in an efficient way. Woodbury’s formula
gives the answer.
(A + UVT )−1 = A−1 − A−1 U(In + VT A−1 U)−1 VT A−1
Note that crucially, the inverse on the LHS is m × m, whereas the new inverse required on the RHS is n × n.
This is what makes the result so useful.
The result is easy to prove by direct multiplication. First, pre-multiply the RHS by A + UVT , multiply
out, and simplify, to get Im . This tells us that the RHS is a right-inverse. Next post-multiply the RHS by
A + UVT , multiply out, simplify to get Im , thus confirming that the RHS is both a left- and right-inverse,
and hence the inverse.
The form given above is sufficiently general for most applications in statistical computing, but it is worth
noting a modest generalisation, often referred to as the Sherman-Morrison-Woodbury formula, which intro-
duces an additional n × n matrix, C.
This can be proved by direct multiplication, as before. There is also a very commonly encountered special
case, where U and V are just m × 1 vectors, u, v. In that case the identity is often referred to as the
Sherman-Morrison formula, and reduces to
• Golub, GH and CF Van Loan (1996) Matrix Computations 3rd ed. John Hopkins. This is the reference
for numerical linear algebra, but not the easiest introduction. It is invaluable if you are writing carefully
optimised matrix computations, directly calling BLAS and LAPACK functions.
• Watkins, DS (2002) Fundamentals of Matrix Computations 2nd ed. Wiley. This is a more accessible
book, but still aimed at those who really want to understand the material (the author encourages you
to write your own SVD routine, and with this book you can).
• Press WH, SA Teukolsky, WT Vetterling and BP Flannery (2007) Numerical Recipes (3rd ed), Cam-
bridge. Very easily digested explanation of how various linear algebra routines work, with code. But
there are better versions of most of the algorithms in this book.
• For state of the art numerical linear algebra use LAPACK: [Link]
CHAPTER 2. MATRIX COMPUTATION 22
• Matlab offers a convenient, but expensive, platform for matrix computation. For an introduction see
e.g. Pratap, R (2006) Getting Started with Matlab, Oxford.
• Monahan, JF (2001) Numerical Methods of Statistics provides a useful treatment of numerical linear
algebra for statistics.
• Harville, DA (1997) Matrix Algebra From a Statistician’s Perspective, Springer, provides a wealth of
helpful theoretical material.
Chapter 3
Optimisation
where f is a real scalar valued function of vector θ, usually known as the objective function. For example:
f might be the negative log-likelihood of the parameters θ of a statistical model, the maximum likelihood
estimates of which are required; in a Bayesian setting f might be a (negative) posterior distribution for θ,
the posterior modes of which are required; f might be a dissimilarity measure for the alignment of two
DNA sequences, where the alignment is characterised by a parameter vector θ; f might be total distance
travelled, and θ a vector giving the order in which a delivery van should visit a set of drop off points.
Note that maximising f is equivalent to minimising −f so we are not losing anything by concentrating on
minimisation here. In the optimisation literature, the convention is to minimise an objective, with the idea
being that the objective is some kind of cost or penalty to be minimised. Indeed, the objective function is
sometimes referred to as the cost function.
Let Θ denote the set of all possible θ values. Before discussing specific methods, note that in practice
we will only be able to solve (3.1) if one of two properties holds.
2. It is possible to put some structure onto Θ, giving a notion of distance between any two elements of
Θ, such that closeness in Θ implies closeness in f value.
In complex situations, requiring complicated stochastic approaches, it is easy to lose sight of point 2 (al-
though if it isn’t met we may still be able to find ways of locating θ values that reduce f ).
Optimisation problems vary enormously in difficulty. In the matrix section we saw algorithms with op-
eration counts which might be proportional to as much as n3 . There are optimisation problems whose
operations count is larger than O(nk ) for any k: in computer science terms there is no polynomial time al-
gorithm* for solving them. Most of these really difficult problems are discrete optimisation problems, in that
θ takes only discrete values, rather than consisting of unconstrained real numbers. One such problem is
the famous Travelling salesman problem, which concerns finding the shortest route between a set of fixed
points, each to be visited only once. Note however, that simply having a way of improving approximate
solutions to such problems is usually of considerable practical use, even if exact solution is very taxing.
Continuous optimisation problems in which θ is a real random vector are generally more straightforward
than discrete problems, and we will concentrate on such problems here. In part this is because of the
centrality of optimisation methods in practical likelihood maximisation.
* Computer scientists have an interesting classification of the ‘difficulty’ of problems that one might solve by computer. If n is the problem
‘size’ then the easiest problems can be solved in polynomial time: they require O(nk ) operations where k is finite. These problems are in the
P-class. Then there are problems for which a proposed solution can be checked in polynomial time, the NP class. NP-complete problems are
problems in NP which are at least as hard as all other problems in NP in a defined sense. NP hard problems are at least as hard as all problems
in NP, but are not in NP: so you can’t even check the answer to an NP hard problem in polynomial time. No one knows if P and NP are the
same, but if any NP-complete problem turned out to be in P, then they would be.
23
CHAPTER 3. OPTIMISATION 24
w.r.t. r and K. We should try to get a feel for the behaviour of f . To see how this can work consider some
simulated data examples. In each case I used n0 = 20, K = 50 and σ = 1 for the simulations, but I have
varied r between the cases.
• In the first instance data were simulated with r = 2. If we now pretend that we need to estimate r and
K from such data, then we might look at some r-transects and some K-transects through f . This
figure shows the raw data and an r-transect.
25000
26
population
15000
24
f(r,K=50)
22
5000
20
r-transects with other K values look equally innocuous, and K-transects also look benign over this
r range. So in this case f appears to be a nice smooth function of r and K, and any half decent
optimisation method ought to be able to find the optimum.
• In the second example data were simulated with r = 3.2. Again pretend that we only have the data
and need to estimate r and K from it. Examination of the left hand plot below, together with some
knowledge of the model’s dynamic behaviour, indicates that r is probably in the 3 to 4 range.
25000
40
35
15000
population
f(r,K=50)
30
5000
25
20
0
0 10 20 30 40 50 2.0 2.5 3.0 3.5 4.0
time r
Notice that f still seems to behave smoothly in the vicinity of its minimum, which is around 3.2, but
that something bizarre starts to happen above 3.6. Apparently we could be reasonably confident of
locating the function’s minimum if we start out below or reasonably close to 3.2, but if we stray into
the territory above 3.6 then there are so many local minima that locating the desired global minimum
would be very difficult.
f(r,K=50)
30
15000
20
5000
10
Now the objective has a minimum somewhere around 3.7, but it is surrounded by other local minima,
so that locating the actual minima would be a rather taxing problem. In addition it is now unclear
how we would go about quantifying uncertainty about the ‘optimal’ θ: it will certainly be of no use
appealing to asymptotic likelihood arguments in this case.
So, in each of these examples, simple transects through the objective function have provided useful
information. In the first case everything seemed OK. In the second we would need to be careful with the
optimisation, and careful to examine how sensible the optimum appeared to be. In the third case we would
CHAPTER 3. OPTIMISATION 26
need to think very carefully about the purpose of the optimisation, and about whether a re-formulation of
the basic problem might be needed.
Notice how the behaviour of the objective was highly parameter dependent here, something that em-
phasises the need to understand models quite well before trying to fit them. In this case the dynamic model,
although very simple, can show a wide range of complicated behaviour, including chaos. In fact, for this
simple example, it is possible to produce a plot of the fitting objective over pretty much the whole plausible
parameter space. Here is an example when data were simulated with r = 3.2
50
40
f(r,K)/
30
1000
20
10
70
60
50
k
40
4.0
3.5
3.0
2.5
2.0
30 1.5 r
1.0
−3 −2 −1 0 1 2 3
x
Apparently the function has many local minima, and optimisation will be difficult. Actually the function
is this. . .
CHAPTER 3. OPTIMISATION 27
y
x
It has one local minimum, its global minimum. Head downhill from any point x, y and you will eventually
reach the minimum.
The opposite problem can also occur. Here are x and y transects through a second function g(x, y).
0.5
0.5
0.4
0.4
0.3
0.3
g(x,y)
g(x,y)
0.2
0.2
0.1
0.1
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x y
. . . apparently no problem. From these plots you would be tempted to conclude that g is a well behaved
and uni-modal. The problem is that g looks like this. . .
CHAPTER 3. OPTIMISATION 28
g(x,
y)
y
x
So, generally speaking, it is a good idea to plot transects through your objective function before optimi-
sation, and to plot transects passing through the apparent optimum after optimisation, but bear in mind that
transects only give partial views.
1. Evaluate f (θ [k] ), and possibly the first and second derivatives of f w.r.t. the elements of θ, at θ [k] .
2. Either
(a) use the information from step 1 (and possibly previous executions of step 1) to find a search
direction, ∆, such that for some α, f (θ [k] + α∆) will provide sufficient decrease relative to
f (θ [k] ). Search for such an α and set θ [k+1] = θ [k] + α∆. Or
(b) use the information from step 1 (and possibly previous executions of step 1) to construct a local
model of f , and find the θ minimising this model within a defined region around θ [k] . If this value
of θ leads to sufficient decrease in f itself, accept it as θ [k+1] , otherwise shrink the search region
until it does.
3. Test whether a minimum has yet been reached. If it has, terminate, otherwise increment k and return
to 1.
Methods employing approach (a) at step 2 are known as search direction methods, while those employing
approach (b) are trust region methods. In practice the difference between (a) and (b) may be rather small:
the search direction is often based on a local model of the function. Here we will examine some search
direction methods.
CHAPTER 3. OPTIMISATION 29
From Taylor’s theorem it is straightforward to establish that the conditions for f (θ ∗ ) to be a minimum of f
are that ∇f (θ ∗ ) = 0 and ∇2 f (θ ∗ ) is positive semi-definite.
Here we have given an exact form of Taylor’s theorem for a real-valued many-variable function, for a first
order approximation plus a second order remainder term in Langrange’s form. In practice, in applications,
we typically work with approximate versions of the above result. In some cases we work directly with a
linear (first-order) approximation, by setting the remainder term to zero. In other cases we work with the
quadratic (second-order) approximation, obtained by evaluating the remainder term at t = 0.
For ∆ of fixed length the greatest decrease will be achieved by minimising the inner product ∇f (θ [k] )T ∆,
which is achieved by making ∆ parallel to −∇f (θ [k] ), i.e. by choosing
∆ ∝ −∇f (θ [k] ).
To firm things up, let’s set ∆ = −∇f (θ [k] )/k∇f (θ [k] )k, and consider a new proposed value of the form
θ [k] + α∆ for α > 0. Then, by plugging back into our linear approximation, we have
∇f (θ)T ∇f (θ)
f (θ [k] + α∆) ' f (θ [k] ) − α
k∇f (θ [k] )k
= f (θ [k] ) − αk∇f (θ [k] )k,
and so as we increase α from zero, we decrease f (·) at a rate proportional to the magnitude of the gradient
vector. Any other choice of ∆ will decrease the function value at a slower rate.
Given a search direction, we need a method for choosing how far to move from θ [k] along this direction.
Here there is a trade-off. We need to make reasonable progress, but do not want to spend too much time
choosing the exactly optimal distance to move, since the point will only be abandoned at the next iteration
anyway. Usually it is best to try and choose the step length α so that f (θ [k] + α∆) is sufficiently much lower
that f (θ [k] ), and also so that the magnitude of the gradient of f in the ∆ direction is sufficiently reduced by
the step. See e.g. Section 3.1 of Nocedal and Wright (2006) for further details. For the moment presume
that we have a step selection method.
The following illustrates the progress of the steepest descent method on Rosenbrock’s function. The
contours are log10 of Rosenbrock’s function, and the thick line shows the path of the algorithm.
CHAPTER 3. OPTIMISATION 30
1.5
1.0
0.5
z
0.0
−0.5
It took 14,845 steps to reach the minimum. This excruciatingly slow progress is typical of the steepest
descent method. Another disadvantage of steepest descent is scale dependence: imagine linearly re-scaling
so that the z axis in the above plot ran from -5 to 15, but the contours were unchanged. This would change
all the steepest descent directions. By contrast, the methods considered next are scale invariant.
as the equation to solve for the step ∆ which will take us to the turning point of the (quadratic) Taylor
approximation of f . This turning point will be a minimum (of the approximation) if ∇2 f (θ [k] ) is positive
definite, and can also be derived directly without vector calculus by completing the square.
Notice how Newton’s method automatically suggests a search direction and a step length. Usually we
simply accept the Newton step, ∆, unless it fails to lead to sufficient decrease in f . In the latter case the
step length is repeatedly halved until sufficient decrease is achieved.
The second practical issue with Newton’s method is the requirement that ∇2 f (θ [k] ) is positive definite.
There is no guarantee that it will be, especially if θ [k] is far from the optimum. This would be a major
problem, were it not for the fact that the solution to
H∆ = −∇f (θ [k] )
is a descent direction for any positive definite H. To see this, put ∆ = −H−1 ∇f (θ [k] ) as a search direction
in our first-order approximation,
for small α > 0 due to the positive-definiteness of H−1 . This gives us the freedom to modify ∇2 f (θ [k] )
to make it positive definite, and still be guaranteed a descent direction. One approach is to take the
symmetric eigen-decomposition of ∇2 f (θ [k] ), reverse the sign of any negative eigenvalues, replace any
CHAPTER 3. OPTIMISATION 31
zero eigenvalues with a small positive number and then replace ∇2 f (θ [k] ) by the matrix with this modified
eigen-decomposition. Computationally cheaper alternatives simply add a multiple of the identity matrix to
∇2 f (θ [k] ), sufficiently large to make the result positive definite. Note that steepest descent corresponds
to the choice H = I, and so inflating the diagonal of the Hessian corresponds to nudging the search
direction from that of the Newton method towards that of steepest descent, and this is usually a fairly safe
and conservative thing to do.
Combining these steps gives a practical Newton algorithm. Starting with k = 0 and a guesstimate
[0]
θ ...
1. Evaluate f (θ [k] ), ∇f (θ [k] ) and ∇2 f (θ [k] ), implying a quadratic approximation to f , via Taylor’s theo-
rem.
3. If ∇2 f (θ [k] ) is not positive definite, perturb it so that it is (thereby modifying the quadratic model of f ).
5. If f (θ [k] + ∆) is not sufficiently lower than f (θ [k] ), repeatedly halve ∆ until it is.
The following figure illustrates the progress of the method on Rosenbrock’s function, working across
rows from top left. Grey contours show the function, dotted show the quadratic approximation at the current
best point, and, if present, black contours are the quadratic approximation before modification of the Hes-
sian. The thick black line tracks the steps accepted, while the thin black line shows any suggested steps
that were reduced. Note the fact that the method converged in 20 steps, compared to the 14,845 steps
taken by steepest descent.
CHAPTER 3. OPTIMISATION 32
dimension of θ is large, then solving for the Newton direction can be a substantial part of the computational
burden of optimisation. Is it possible to produce methods that only require first derivative information, but
with efficiency rivalling Newton’s method, rather than steepest descent?
The answer is yes. One approach is to replace the exact Hessian, of Newton’s method, with an approx-
imate Hessian based on finite differences of gradients. The finite difference intervals must be chosen with
care (see section 4.2), but the approach is reliable (recall that any positive definite matrix in place of the
Hessian will yield a descent direction, so an approximate Hessian should be unproblematic). However, fi-
nite differencing for the Hessian is often more costly than exact evaluation, and there is no saving in solving
for the search direction.
The second approach is to use a quasi-Newton method. The basic idea is to use first derivative infor-
mation gathered from past steps of the algorithm to accumulate an approximation to the Hessian at the
current best point. In principle, this approximation can be used exactly like the exact Hessian in Newton’s
method, but if structured carefully, such an algorithm can also work directly on an approximation to the
inverse of the Hessian, thereby reducing the cost of calculating the actual step. It is also possible to ensure
that the approximate Hessian is always positive definite.
Quasi-Newton methods were invented in the mid 1950’s by W.C. Davidon (a physicist). In the math-
ematical equivalent of not signing the Beatles, his paper on the method was rejected (it was eventually
published in 1991). There are now many varieties of Quasi-Newton method, but the most popular is the
BFGS method† , which will be briefly covered here.
We suppose that at step k we have already computed θ [k] and an approximate Hessian H[k] . So, just
as with a regular Newton method, we solve
to obtain a new search direction, leading either to a new θ [k+1] directly, or more likely via line search.
We now have the problem of how to appropriately update the approximate Hessian, H[k+1] . We want the
approximate Hessian to give us a good quadratic approximation of our function about θ [k+1] . If H[k+1] is
our new approximate positive definite Hessian at the (k + 1)th step, we have the function approximation
1
f (θ) ' f (θ [k+1] ) + ∇f (θ [k+1] )T (θ − θ [k+1] ) + (θ − θ [k+1] )T H[k+1] (θ − θ [k+1] ).
2
The basic requirement of a quasi Newton method is that the new Hessian should be chosen so that the
approximation should exactly match ∇f (θ [k] ) — i.e. the approximation should get the gradient vector at the
previous best point, θ [k] , exactly right. Taking the gradient of the above approximation gives
and hence
H[k+1] sk = yk (3.2)
where sk = θ [k+1] − θ [k] and yk = ∇f (θ [k+1] ) − ∇f (θ [k] ). The so-called secant equation (3.2) will only
be feasible for positive definite H[k+1] under certain conditions on sk and yk , but we can always arrange
for these to be met by appropriate step length selection (see Wolfe conditions, Nocedal & Wright, 2006,
§3.1). Note, in particular, that pre-multiplying (3.2) by sT T T [k+1] s > 0 for positive definite
k gives sk yk = sk H k
T
Hessian, so sk yk must be positive. Still, the secant equation is under-determined, so we need to constrain
the problem further in order to identify an appropriate unique solution. For quasi Newton methods, this is
typically done by proposing a low-rank update to the previous Hessian. Methods exist which use a rank 1
update, but BFGS uses a rank 2 update of the form
For (good) reasons that we don’t have time to explore here, the choice u = yk , v = H[k] sk is made, and
substituting this update into the secant equation allows solving for α and β, giving the BFGS update,
H[k] sk sT
kH
[k]
H[k+1] = H[k] + ρk yk ykT − ,
sT [k]
k H sk
where ρk = 1/sTk yk . By substitution, we can directly verify that this satisfies the secant equation. Now let’s
−1
work in terms of the inverse approximate Hessian, B[k] ≡ H[k] , since this will allow us to avoid solving
a linear system at each iteration. A naive re-write of our update would be
" #−1
(B[k] )−1 sk sT [k] −1
k (B )
B[k+1] = (B[k] )−1 + ρk yk ykT − ,
sT [k] −1
k (B ) sk
but this is a low-rank update of an inverse, and hence a good candidate for application of Woodbury’s
formula. It is actually a bit tedious to apply Woodbury directly, but if we can guess the solution,
we can verify that it is correct by direct multiplication. We can also double-check that this update satisfies
the modified secant equation, B[k+1] yk = sk (it does). The BFGS update is usually given in the form above,
but for actually implementation on a computer, it can be expanded as
h i
B[k+1] = B[k] − ρk sk ykT B[k] + B[k] yk sT T [k] T
k + ρk (1 + ρk yk B yk )sk sk ,
and then some careful thinking about the order and re-use of computational steps within the update can
lead to a very efficient algorithm.
The BFGS method then works exactly like Newton’s method, but with B[k] in place of the inverse of
∇ f (θ [k] ), and without the need for second derivative evaluation or for perturbing the Hessian to achieve
2
positive definiteness. It also allows direct computation of the search direction via
∆ = −B[k] ∇f (θ k ).
A finite difference approximation to the Hessian is often used to start the method. The following figure
illustrates its application (showing every second step of 38).
CHAPTER 3. OPTIMISATION 35
Notice how this quasi-Newton method has taken nearly twice as many steps as the Newton method, for
the same problem, and shows a greater tendency to ‘zig-zag’. However, it is still computationally cheaper
than finite differencing for the Hessian, which would have required an extra 20 gradient evaluations, com-
pared to quasi-Newton. Compared to steepest descent it’s miraculous.
CHAPTER 3. OPTIMISATION 36
L-BFGS
It is also worth being aware of a variant of BFGS, known as L-BFGS, with the “L” standing for limited mem-
ory. This algorithm is mainly intended for very high-dimensional functions, where working with a full inverse
Hessian would be an issue for either computational or memory reasons. Here, rather than propagating the
full inverse Hessian matrix at each step, a low rank approximation of it is implicitly updated and propagated
via the recent history of the process. The following Wikipedia pages provide a little additional context.
• [Link]
• [Link]
• [Link]
1. The search direction is defined as the vector from the worst point (vertex of the polytope with highest
objective value) through the average of the remaining n points.
2. The initial step length is set to twice the distance from the worst point to the centroid of the others. If
it succeeds (meaning that the new point is no longer the worst point) then a step length of 1.5 times
that is tried, and the better of the two accepted.
3. If the previous step did not find a successful new point then step lengths of half and one and a half
times the distance from the worst point to the centroid are tried.
4. If the last two steps failed to locate a successful point then the polytope is reduced in size, by linear
rescaling towards the current best point (which remains fixed.)
5. When a point is accepted, the current worst point is discarded, to maintain n + 1 points, and the
algorithm returns to step 1.
Variations are possible, in particular with regard to the step lengths and shrinkage factors.
The following figure illustrates the polytope method in action. Each polytope is plotted, with the line
style cycling through, black, grey and dashed black. The worst point in each polytope is highlighted with a
circle.
‡
This is also known as the downhill simplex method, which should not be confused with the completely different simplex method of linear
programming.
CHAPTER 3. OPTIMISATION 37
1.5
1.0
0.5
z
0.0
−0.5
In this case it took 79 steps to reach the minimum of Rosenbrock’s function. This is somewhat more
than Newton or BFGS, but given that we need no derivatives in this case, the amount of computation is
actually less.
On the basis of this example you might be tempted to suppose that Nelder-Mead is all you ever need,
but this is generally not the case. If you need to know the optimum very accurately then Nelder-Mead
will often take a long time to get an answer that Newton based methods would give very quickly. Also,
the polytope can get ‘stuck’, so that it is usually a good idea to restart the optimisation from any apparent
minimum (with a new polytope having the apparent optimum as one vertex), to check that further progress
is really not possible. Essentially, Nelder-Mead is good if the answer does not need to be too accurate, and
derivatives are hard to come by, but you wouldn’t use it as the underlying optimiser for general purpose
modelling software, for example.
• In likelihood maximisation contexts the method of scoring is often used. This replaces the Hessian in
Newton’s method with the expected Hessian.
• The Gauss-Newton method is another way of getting a Newton type method when only first deriva-
tives are available. It is applicable when the objective is somehow related to a sum of squares of
differences between some data and the model predictions of those data, and has a nice interpreta-
tion in terms of approximating linear models. It is closely related to the algorithm for fitting GLMs.
• Conjugate gradient methods are another way of getting a good method using only first derivative
information. The problem with steepest descent is that if you minimise along one steepest descent
direction then the next search direction is always at right angles to it. This can mean that it takes
a huge number of tiny zig-zagging steps to minimise even a perfectly quadratic function. If n is the
dimension of θ, then Conjugate Gradient methods aim to construct a set of n consecutive search
directions such that successively minimising along each will lead you exactly to the minimum of a
quadratic function.
CHAPTER 3. OPTIMISATION 38
• Simulated annealing is a gradient-free method for optimising difficult objectives, such as those arising
in discrete optimisation problems, or with very spiky objective functions. The basic idea is to propose
random changes to θ, always accepting a change which reduces f (θ), but also accepting moves
which increase the objective, with a probability which decreases with the size of increase, and also
decreases as the optimisation progresses.
• Stochastic gradient descent (SGD) is an increasingly popular approach to optimisation of likelihoods
arising from very large data sets. Each step of SGD is a gradient descent based on a new objective
derived from a random subset of the data. The stochastic aspect can speed up convergence and the
use of a relatively small subset of the data can speed up each iteration: [Link]
org/wiki/Stochastic_gradient_descent
• The EM algorithm is a useful approach to likelihood maximisation when there are missing data or
random effects that are difficult to integrate out.
3.7 Software
R includes a routine optim for general purpose optimisation by BFGS, Nelder-Mead, Conjugate Gradient
or Simulated Annealing. The R routine nlm uses Newton’s method and will use careful finite difference
approximations for derivatives which you don’t supply. R has a number of add on packages e.g. quadprog,
linprog, trust to name but 3. See [Link] for a really useful guide to what is more
generally available.
Should you ever implement your own optimisation code? Many people reply with a stern ‘no’, but I am
less sure. For the methods presented in detail above, the advantages of using your own implementation are
(i) the un-rivalled access to diagnostic information and (ii) the knowledge of exactly what the routine is doing.
However, if you do go down this route it is very important to read further about step length selection and
termination criteria. Where greater caution is needed is if you require constrained optimisation, large scale
optimisation or to optimise based on approximate derivatives. All these areas require careful treatment.
• Gill, P.E. , W. Murray and M.H. Wright (1981) Practical Optimization Academic Press, is a classic,
particularly on constrained optimisation. I’ve used it to successfully implement quadratic programming
routines.
• Dennis, J.E. and R.B. Schnabel (1983) Numerical Methods for Unconstrained Optimization and Non-
linear Equations (re-published by SIAM, 1996), is a widely used monograph on unconstrained meth-
ods (the nlm routine in R is based on this one).
• Press WH, SA Teukolsky, WT Vetterling and BP Flannery (2007) Numerical Recipes (3rd ed.), Cam-
bridge and Monahan, JF (2001) Numerical Methods of Statistics, Cambridge, both discuss Optimiza-
tion at a more introductory level.
Chapter 4
Calculus by computer
As we saw in the previous section, almost any time that we want to maximise a likelihood, we need to
differentiate it. Very often, merely evaluating a likelihood involves integration (to obtain the marginal dis-
tribution of the observable data). Unsurprisingly, there are times when it is useful to perform these tasks
by computer. Broadly speaking, numerical integration is computationally costly but stable, while numerical
differentiation is cheap but potentially unstable. We will also consider the increasingly important topic of
‘automatic differentiation’.
a <- 1e16
b <- 1e16 + pi
d <- b - a
d; pi
## [1] 4
## [1] 3.141593
## [1] 0.2732395
This is an example of cancellation error. Almost any time we take the difference of two floating point
numbers of similar magnitude (and sign) we find that the accuracy of the result is substantially reduced.
The problem is rounding error. You can see what happens by considering computation to 16 places of
decimals. Then
hence d = b − a = 3. (d was 4, above, because the computer uses a binary rather than a decimal
representation).
As we will see, derivative approximation is an area where we can not possibly ignoreP cancellation error,
2 2
but actually it is a pervasive issue. For example, cancellation error is the reason why i xi − nx̄ is not
2 2 2
P P
a good basis for a ‘one-pass’ method for calculating i (xi − x̄) . i (xi − x1 ) − n(x̄ − x1 ) is a simple
one pass alternative, and generally much better. It is also a substantial issue in numerical linear algebra.
40
CHAPTER 4. CALCULUS BY COMPUTER 41
For example, orthogonal matrix decompositions (used for QR, eigen and singular value decompositions),
usually involve an apparently arbitrary choice at each successive rotation involved in the decomposition,
about whether some element of the result of rotation should be positive or negative: one of the choices
almost always involves substantially lower cancellation error than the other, and is the one taken.
Since approximation of derivatives inevitably involves subtracting similar numbers, it is self evident
that cancellation error will cause some difficulty. The basic reason why numerical integration is a more
stable process (at least for statisticians who usually integrate only positive functions) is because numerical
integration is basically about adding things up, rather than differencing them, and this is more stable. e.g.
f <- a + b
(f-2e16-pi)/(2e16+pi) ## relative error in sum, is tiny...
## [1] 4.292037e-17
f (x + ∆ei ) − f (x) ∂f 1
− = ∆eT ∇2 f ei .
∆ ∂xi 2 i
Now suppose that L is an upper bound on the magnitude of ∂ 2 f /∂x2i , we have
f (x + ∆ei ) − f (x) ∂f L∆
− ≤ .
∆ ∂xi 2
That is to say, we have an upper bound on the finite difference truncation* error, assuming that the FD
approximation is calculated exactly, without round-off error.
In reality we have to work in finite precision arithmetic and can only calculate f to some accuracy.
Suppose that we can calculate f to one part in −1 (the best we could hope for here is that is the machine
precision). Now let Lf be an upper bound on the magnitude of f , and denote the computed value of f by
comp(f ). We have
|comp{f (x)} − f (x)| ≤ Lf
and
|comp{f (x + ∆ei )} − f (x + ∆ei )| ≤ Lf
combining the preceding two inequalities we get
an upper bound on the cancellation error that results from differencing two very similar quantities using
finite precision arithmetic. (It should now be clear what the problem was in section 1.1 example 2.)
Now clearly the dependence of the two types of error on ∆ is rather different. We want ∆ as small as
possible to minimise truncation error, and as large as possible to minimise cancellation error. Given that
the total error is bounded as follows,
L∆ 2Lf
[Link] ≤ +
2 ∆
it makes sense to chose ∆ to minimise the bound. That is we should choose
r
4Lf
∆≈ .
L
So if the typical size of f and its second derivatives are similar then
√
∆≈
will not be too far from optimal. This is why the square root of the machine precision is often used as the
finite difference interval. However, note the assumptions: we require that Lf ≈ L and f to be calculable to
a relative accuracy that is a small multiple of the machine precision for such a choice to work well. In reality
problems are not always well scaled and a complicated function may have a relative accuracy substantially
lower than is suggested by the machine precision. In such cases, see section 8.6 of Gill, Murray and Wright
(1981), or section 4.3.3 below.
which in the well scaled case will be most accurate for ∆ ≈ 1/4 .
(x1*x2*sin(x3) + exp(x1*x2))/x3
would evaluate the function, if x1, x2 and x3 were initialised to be floating point numbers.
Now define a new type of object of class "ad" which has a value (a floating point number) and a
"grad" attribute. In the current case this "grad" attribute will be a 3-vector containing the derivatives of
the value w.r.t. x1, x2 and x3. We can now define versions of the arithmetic operators and mathematical
functions which will return class "ad" results with the correct value and "grad" attribute, whenever they
are used in an expression.
Here is an R function to create and initialise a simple class "ad" object
Here it is in use, initialising x1 to the value 1, giving it a 3 dimensional "grad" attribute, and setting the
first element of grad to 1 since ∂x1 /∂x1 = 1.
x1 <- ad(1,c(3,1))
x1
## [1] 1
## attr(,"grad")
## [1] 1 0 0
## attr(,"class")
## [1] "ad"
Now the interesting part. I can define versions of mathematical functions and operators that are specific
to class "ad" objects and correctly propagate derivatives alongside values. Here is a sin function for class
"ad".
†
This example is taken from Nocedal and Wright, (2006).
CHAPTER 4. CALCULUS BY COMPUTER 44
sin(x1)
## [1] 0.841471
## attr(,"grad")
## [1] 0.5403023 0.0000000 0.0000000
## attr(,"class")
## [1] "ad"
i.e. the value of the result is sin(x1 ), while the first element of its "grad" contains the derivative of
sin(x1 ) w.r.t. x1 evaluated at x1 = 1.
Operators can also be overloaded in this way. e.g. here is multiplication operator for class "ad":
Continuing in the same way we can provide a complete library of mathematical functions and operators
for the "ad" class. Given such a library, we can obtain the derivatives of a function directly from the code
that would simply evaluate it, given ordinary floating point arguments. For example, here is some code
evaluating a function.
## [1] 5.977259
and here is the same code with the arguments replaced by "ad" objects
## [1] 5.977259
## attr(,"grad")
## [1] 10.681278 5.340639 -3.805241
## attr(,"class")
## [1] "ad"
you can check that these results are correct (actually to machine accuracy).
This simple propagation of derivatives alongside the evaluation of a function is known as forward mode
auto-differentiation. R is not the best language in which to try to do this, and if you need AD it is much
better to use existing software libraries in C++, for example, which have done all the function and operator
re-writing for you.
CHAPTER 4. CALCULUS BY COMPUTER 45
Any computer evaluating this must break the computation down into a sequence of elementary operations
on one or two floating point numbers. This can be thought of as a graph:
* x4 x6
exp
x2
* +
x3 x5 x7 x8 f
sin /
where the nodes x4 to x8 are the intermediate quantities that will have to be produced en route from the
input values x1 to x3 to the final answer, f . The arrows run from parent nodes to child nodes. No child can
be evaluated until all its parents have been evaluated.
Simple left to right evaluation of this graph results in this . . .
CHAPTER 4. CALCULUS BY COMPUTER 46
* x4 = 2 x6 = 7.39
exp
x2 = 2
* +
x3 = 1.57 x5 = 1 x7 = 2 x8 = 9.39 f = 5.98
sin /
Now, forward mode AD carries derivatives forward through the graph, alongside values. e.g. the deriva-
tive of a node with respect to input variable, x1 is computed using
(the r.h.s. being evaluated by overloaded functions and operators, in the object oriented approach). The
following illustrates this process, just for the derivative w.r.t. x1 .
x1 = 1
dx1 f=(x1x2sin x3+ ex1 x2)/x3
=1 dx
4 dx
dx1 1 = 2
x4 = 2 x6 = 7.39
dx6 dx4 = 7.39
* dx4 dx6
=1 =2 exp = 14.8
dx 2 dx1 dx1
dx 4
x2 = 2
dx2 dx dx
=0 7
dx
8
dx
dx1 4 6
= =
1 1
df dx3 = −3.81
. . . Again computation runs left to right, with evaluation of a node only possible once all parent values are
known.
Now, if we require derivatives w.r.t. several input variables, then each node will have to evaluate deriva-
tives w.r.t. each of these variables, and this becomes expensive (in terms of the previous graph, each node
would contain multiple evaluated derivatives). Reverse mode therefore does something ingenious. It first
executes a forward sweep through the graph, evaluating the function and all the derivatives of nodes w.r.t.
their parents, as follows:
CHAPTER 4. CALCULUS BY COMPUTER 47
x2 = 2 dx dx
7 8
dx dx
4 = 6 =
1 1
df dx3 = −3.81
The reverse sweep then works backwards through the graph, evaluating the derivative of f w.r.t. each
node, using the following
∂f X ∂xj ∂f
=
∂xk ∂xk ∂xj
j is child of k
x1 = 1
df f=(x1x2sin x3+ ex1 x2)/x3
= 10.7 dx
4 dx
dx1 1 = 2
x4 = 2 x6 = 7.39
dx6 dx4 = 7.39
* df df
= 1 = 5.34 exp = 0.64
dx 2 dx4 dx6
dx 4
x2 = 2
df dx dx
= 5.34 7
dx
8
dx
dx2 4 6
= =
1 1
df dx3 = −3.81
The derivatives in grey are those calculated on the reverse sweep. The point here is that there is only
one derivative to be evaluated at each node, but in the end we know the derivative of f w.r.t. every input
variable. Reverse mode can therefore save a large number of operations relative to finite differencing or
forward mode. Once again, general purpose AD libraries automate the process for you, so that all you
need to be able to write is the evaluation code.
Unfortunately reverse mode efficiency comes at a heavy price. In forward mode we could discard the
values and derivatives associated with a node as soon as all its children were evaluated. In reverse mode
the values of all nodes and the evaluated derivatives associated with every connection have to be stored
during the forward sweep, in order to be used in the reverse sweep. This is a heavy storage requirement.
CHAPTER 4. CALCULUS BY COMPUTER 48
For example, if f involved the inversion of a 1000 × 1000 matrix then we would have to store some 2 × 109
intermediate node values plus a similar number of evaluated derivatives. That is some 32 Gigabytes of
storage before we even consider the requirements for storing the structure of the graph.
4.3.2 A caveat
For AD to work it is not sufficient that the function being evaluated has properly defined derivatives at
the evaluated function value. We require that every function/operator used in the evaluation has prop-
erly defined derivatives at its evaluated argument(s). This can create a problem with code that executes
conditional on the value of some variable. For example the Box-Cox transformation of a positive datum y is
λ
(y − 1)/λ λ 6= 0
B(y; λ) =
log(y) λ=0
If you code this up in the obvious way then AD will never get the derivative of B w.r.t. λ right if λ = 0.
• Integration w.r.t. 1 or 2 (maybe even up to 4) variables is not too problematic. There are two ‘deter-
ministic’ lines of attack.
is the solution of
dy
= f (x), y(a) = 0
dx
integrated to b.
2. Quadrature rules. These evaluate the function to be integrated at some set of points in the
integration domain and combine the resulting values to obtain an approximation to the integral.
For example
Z b N
b−aX
f (x)dx ≈ f (a + (i − .5)(b − a)/N )
a N
i=1
is the Midpoint rule for approximate integration. Simpson’s Rule is a higher order variant, while
Gaussian Quadrature rules aim for substantially higher order accuracy by careful choice of the
locations at which to evaluate.
CHAPTER 4. CALCULUS BY COMPUTER 49
• Integration with respect to several-many variables is usually more interesting in statistics and it is
difficult. Maintenance of acceptable accuracy with quadrature rules or differential equation meth-
ods typically requires an unfeasible number of function evaluations for high dimensional domains of
integration. There are two main approaches.
b N
b−aX
Z
f (x)dx = f (a + (i − .5)(b − a)/N ) + O(h2 )
a N
i=1
where hR in the final error term is (b − a)/N . Let’s see what the error bound means in practice by approxi-
1
mating 0 ex dx.
A .04% error may be good enough for many purposes, but suppose we need higher accuracy. . .
N <- 100
[Link] <- sum(exp((1:N-.5)/N))/N
c([Link],[Link],([Link])/[Link])
N <- 1000
[Link] <- sum(exp((1:N-.5)/N))/N
c([Link],[Link],([Link])/[Link])
Exactly as the error bound indicated, each 10 fold increase in the number of function evaluations buys
us a 100 fold reduction in the approximation error. Incidentally, notice that in statistical terms this error is
not a random error — it is really a bias.
If f is cheap to evaluate, accuracy is not paramount and we are only interested in 1 dimensional inte-
grals then we might stop there (or more likely with the slightly more accurate but equally straightforward
‘Simpson’s Rule’), but life is not so simple. If one is very careful about the choice of x points at which to
CHAPTER 4. CALCULUS BY COMPUTER 50
evaluate f (x) and about the weights used in the integral approximating summation, then some very im-
pressive results can be obtained, at least for smooth enough functions, using Gauss(ian) Quadrature. The
basic idea is to find a set {xi , wi : i = 1, . . . , n} such that if w(x) is a fixed weight function and g(x) is any
polynomial up to order 2n − 1 then
Z b n
X
g(x)w(x)dx = wi g(xi )
a i=1
Note that the xi , wi set depends on w(x), a and b, but not on g(x). The hope is that the r.h.s. will still approxi-
mate the l.h.s. very accurately if g(x) is not a polynomial of the specified type, and there is some impressive
error analysis to back this up, for any function with 2n well behaved derivatives (see e.g. Monahan, 2001,
section 10.3).
The R package statmod has a nice [Link] function for producing the weights, wi and nodes,
xi , for some standard w(x)s. It assumes a = −1, b = 1, but we can just linearly re-scale our actual interval
to use this. Here is the previous example using Gauss Quadrature (assuming w(x) is a constant).
library(statmod)
N <- 10
gq <- [Link](N) # get the x_i, w_i
gq # take a look
## $nodes
## [1] -0.9739065 -0.8650634 -0.6794096 -0.4333954 -0.1488743 0.1488743
## [7] 0.4333954 0.6794096 0.8650634 0.9739065
##
## $weights
## [1] 0.06667134 0.14945135 0.21908636 0.26926672 0.29552422 0.29552422
## [7] 0.26926672 0.21908636 0.14945135 0.06667134
## [1] 2
For convenience, we can pack the rescaling logic into a simple function:
gqi(exp, 10, 0, 1)
## [1] 1.718282
. . . still more accurate than the midpoint rule with 100 evaluations. This example should be treated with
caution, however. ex is a very smooth function and such impressive results will not hold for less smooth
cases.
Clearly, in principle we can repeat this for as many dimensions as we like. In practice however, to maintain
a given level of accuracy, the number of function evaluations will have to rise as N d where N is the number
of nodes for each dimension, and d is the number of dimensions. Even if we can get away with a 3 point
Gaussian quadrature, we will need 1.5 million evaluations by the time we reach a 13 dimensional integral.
Here is a function which takes the node sequence x and weight sequence w of some quadrature rule
and uses them to produce the d dimensional set of nodes, and the corresponding weight vector resulting
from recursive application of the 1D rule.
mesh(c(1,2),3)
## $X
## [,1] [,2] [,3]
## [1,] 1 1 1
CHAPTER 4. CALCULUS BY COMPUTER 52
## [2,] 2 1 1
## [3,] 1 2 1
## [4,] 2 2 1
## [5,] 1 1 2
## [6,] 2 1 2
## [7,] 1 2 2
## [8,] 2 2 2
##
## $w
## [1] 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
## [1] 100
## [1] 1e+05
. . . which illustrates the enormous rate of increase in function evaluations required to achieve a rather
disappointing accuracy.
Of course Gauss Quadrature is a better bet
## [1] 16
## [1] 1024
. . . even so, by the time we reach 20 dimensions we would be requiring 1012 function evaluations, and
most functions we might want to integrate will not be so smooth and well behaved that 4 quadrature points
give us anything like the sort of accuracy seen in this example.
For a given y let b̂y be the value of b maximising f . Then Taylor’s theorem implies that
1
log{f (y, b)} ' log{f (y, b̂y )} − (b − b̂y )T H(b − b̂y )
2
and so Z
1 T
fy (y) ' f (y, b̂y ) exp − (b − b̂y ) H(b − b̂y ) db.
2
From the properties of normal p.d.f.s we have
where X ∼ uniform over Ω. Generating N uniform random vectors, xi , over Ω, we have the obvious
estimator
N
ˆ V (Ω) X
IΩ = f (xi ),
N
i=1
which defines basic Monte-Carlo Integration. ‡ Clearly IˆΩ is unbiased, and if the xi are independent then
V (Ω)2
var(IˆΩ ) = N var{f (X)}
N2
i.e. the variance scales as N −1 , and the standard deviation scales as N −1/2 (it’s Op (N −1/2 )). This is
not brilliantly fast convergence to IΩ , but compare it to quadrature. If we take the midpoint rule then
the (bias) error is O(h2 ) where h is the node spacing. In d dimensions then at best we can arrange
h ∝ N −1/d , implying that the bias is O(N −2/d ). i.e. once we get above d = 4 we would prefer the stochastic
convergence rate of Monte-Carlo Integration to the deterministic convergence rate of the midpoint rule.
Of course using Simpson’s rule would buy us a few more dimensions, and for smooth functions the
spectacular convergence rates of Gauss quadrature may keep it as the best option up to a rather higher d,
but the fact remains that its bias is still dependent on d, while the sampling error of the Monte Carlo method
is dimension independent, so Monte Carlo will win in the end.
which is usually lower than previously, because the range of variation of f within each Ωi is lower than
within the whole of Ω.
The underlying idea of trying to obtain the nice properties of uniform random variables, but with more
even spacing over the domain of integration, is pushed to the limit by Quasi-Monte-Carlo integration, which
uses deterministic space filling sequences of numbers as the basis for integration. For fixed N , the bias of
such schemes grows much less quickly with dimension than the quadrature rules.
where the xi are observations on r.v.s with p.d.f. f (x). A difficulty with this approach, as with simple
MC integration, is that a high proportion of the xi s often fall in places where φ(xi ) makes no, or negligible,
‡
Note that in practice, if Ω is a complicated shape, it may be preferable to generate uniformly over a simpler enclosing region Ω̃, and just
set the integrand to zero for points outside Ω itself.
CHAPTER 4. CALCULUS BY COMPUTER 55
contribution to the integral. For example, in Bayesian modelling, it is quite possible for the f (y|θ) to be have
non-vanishing support over only a tiny volume of the parameter space over which f (θ) has non-vanishing
support.
One approach to this problem is importance sampling. Rather than generate the xi from f (x) we
generate from a distribution with the same support as f , but with a p.d.f., g(x), which is designed to
generate points concentrated in the places where the integrand is non-negligible. Then the estimator is
Z Z n
f (x) 1X f (xi )
φ(x)f (x)dx = φ(x) g(x)dx ≈ Iˆis = φ(xi ) .
g(x) n g(xi )
i=1
The expected value of the r.h.s. of the above is its l.h.s., so this estimator is unbiased. R
What would make a good g? Suppose that g ∝ φf . Then g(x) = φ(x)f (x)/k, where k = φ(x)f (x)dx.
So Iˆis = nk/n = φ(x)f (x)dx, and the estimator has zero variance. Clearly this is impractical. If we knew
R
k then we would not need to estimate, but there are useful insights here anyway. Firstly, the more constant
is φf /g, the lower the variance of Iˆis . Secondly, the idea that the density of evaluation points should reflect
the contribution of the integrand in the local neighbourhood is quite generally applicable.
f (x̂ + R−1 z)
Z
1
φ(x̂ + R−1 z) h(z)dz,
|R| h(z)
where h(z) = (2π)−d/2 exp{−zT z/2} is the standard normal density for z, giving
−1
Z
1 −1 f (x̂ + R z)
φ(x)f (x)dx = Ez φ(x̂ + R z) =
|R| h(z)
(2π)d/2 h i
Ez φ(x̂ + R−1 z)f (x̂ + R−1 z) exp{zT z/2} ,
|R|
1. It might be desirable to make g a little more diffuse than the Laplace approximation suggests, to avoid
excessive importance weights being assigned to zi values in the tails of the target distribution.
2. If the integration is in order to obtain a likelihood, which is to be maximised, then it is usually best
to condition on a set of zi values, in order to avoid random variability in the approximate likelihood,
which will make optimisation difficult.
CHAPTER 4. CALCULUS BY COMPUTER 56
3. Usually the log of the integrand is evaluated rather than the integrand itself, and it is the log of the
integral that is required. To avoid numerical under/overflow problems note that
( ) " #
X X
log wi exp(ψi ) = log wi exp{ψi − max(ψi )} + max(ψi )
i i
the r.h.s. will not overflow or underflow when computed in finite precision arithmetic (for reasonable
wi ).
4. As with simple Monte-Carlo integration, some variance reduction is achievable by stratification. The zi
can be obtained by suitable transformation of random vectors from a uniform distribution on (0, 1)d . In-
stead (0, 1)d can be partitioned into equal volume parts, with one uniform vector generated in each. . .
Tweak 3. is an example of a general technique for avoiding numerical underflow and overflow which has
many applications in statistical computing. It is therefore worth examining more carefully.
without the risk of underflow. We can do this by first computing m = maxi li and then noting that
X X
L = log exp(li ) = log exp(m) exp(li − m)
i i
" #
X X
= log exp(m) exp(li − m) = m + log exp(li − m).
i i
We are now safe, since li − m cannot be bigger than zero, and hence no term can overflow. Also, since
we know that li − m = 0 for some i, we know that at least one term will not underflow to zero. So we are
guaranteed to get a sensible finite answer in a numerically stable way. Note that exactly the same technique
works for computing a sample mean (or other weighted sum) as opposed to just a simple sum (as above).
4.4.9 An example
Consider a Generalised Linear Mixed Model:
where Λ is diagonal and depends on a small number of parameters. Here, β are the fixed effects and b
are the random effects, with covariate matrices X and Z, respectively. From the model specification it is
CHAPTER 4. CALCULUS BY COMPUTER 57
straightforward to write down the joint probability function of y and b, based on the factorisation f (y, b) =
fy|b (y|b)fb (b). i.e.
X
log{f (y, b)} = {yi (Xi β + Zi b) − exp(Xi β + Zi b) − log(yi !)}
i
1X 1X 1X
− λj b2j + log(λj ) − log(2π)
2 2 2
j j j
but likelihood based inference actually requires that we can evaluate the marginal p.f. fy (y) for any values
of the model parameters. Now Z
fy (y) = f (y, b)db
[Link](0);n <- 50
x <- runif(n)
z1 <- sample(c(0,1),n,replace=TRUE)
z2 <- sample(c(0,1),n,replace=TRUE)
b <- rnorm(4) ## simulated random effects
X <- [Link](~x)
Z <- cbind(z1,1-z1,z2,1-z2)
eta <- X%*%c(0,3) + Z%*%b
mu <- exp(eta)
y <- rpois(mu,mu)
Here is a function evaluating log{f (y, b)} (log is used here to avoid undesirable underflow to zero)
theta contains the fixed effect parameters, i.e. β and the parameters of Λ. So in this example, theta
is of length 4, with the first 2 elements corresponding to β and the last two are σ1−1 , σ22 . lambda is the name
of a function for turning the parameters of Λ into its leading diagonal elements, for example
Actually, for integration purposes, lf.yb0 can usefully be re-written to accept a matrix b where each
column is a different b at which to evaluate the log p.d.f. Suppose [Link] is such a function. Lets try eval-
uating fy (y) at the simulated data, and the theta vector used for simulation, using the various integration
methods covered above.
CHAPTER 4. CALCULUS BY COMPUTER 58
Quadrature rules
First try brute force application of the midpoint rule. We immediately have the slight problem of the integra-
tion domain being infinite, but since the integrand will vanish at some distance from b = 0, we can probably
safely restrict attention to the domain [−5, 5]4 .
## [1] -116.5303
Of course, we don’t actually know how accurate this is, but we can get some idea by doubling nm, in
which case
log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
## [1] -118.6234
Clearly we are not doing all that well, even with 10000-160000 function evaluations.
On very smooth functions Gauss quadrature rules were a much better bet, so let’s try a naive application
of these again.
nm <- 10
gq <- [Link](nm)
w <- gq$weights*m.r/2 ## rescale to domain
x <- gq$nodes*m.r/2
bm <- mesh(x,4,w)
lf <- [Link](y,t(bm$X),theta,X,Z,[Link])
log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
## [1] -136.95
— this is a long way from the midpoint estimate: what happens if we double nm to 20?
log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
## [1] -125.3369
Increasing further still indicates problems. The difficulty is that the integrand is not very smooth, in
the way that ex was, and it is not well approximated by a tensor product of high order polynomials. The
midpoint rule actually works better here. In fact this integral could be performed reasonably accurately by
skilful use of Gauss quadrature, but the example illustrates that naive application can be much worse than
doing something simpler.
Laplace approximation
To use the Laplace approximation we need to supplement [Link] with functions [Link] and [Link]
which evaluate the gradient vector and Hessian of the log(f ) w.r.t. b. Given these functions we first find b̂
by Newton’s method
CHAPTER 4. CALCULUS BY COMPUTER 59
b <- rep(0,4)
while(TRUE) {
b0 <- b
g <- [Link](y, b, theta, X, Z, [Link])
H <- [Link](y, b, theta, X, Z, [Link])
b <- b - solve(H, g)
b <- [Link](b)
if (sum(abs(b-b0)) < 1e-5*sum(abs(b)) + 1e-4) break
}
after which b contains b̂ and H the negative of the required H. It remains to evaluate the approximation
lf.yb0(y,b,theta,X,Z,[Link]) + 2*log(2*pi) -
0.5*determinant(H,logarithm=TRUE)$modulus
## [1] -116.9646
## attr(,"logarithm")
## [1] TRUE
This is reasonably close to the truth, but the method itself provides us with no way of knowing how
close.
Monte Carlo
N <- 100000
bm <- matrix((runif(N*4)-.5)*m.r,4,N)
lf <- [Link](y, bm, theta, X, Z, [Link])
log(sum(exp(lf-max(lf)))) + max(lf) - log(N) + log(m.r^4)
## [1] -119.8479
## [1] -119.3469
The next replicate gave -119.3, and that degree of variability is with 105 function evaluations. Stratification
improves things a bit, but let’s move on.
Using the original factorisation of the joint likelihood, we can re-write our required marginal likelihood
Z Z
fy (y) = f (y, b)db = fy|b (y|b)fb (b)db = Eb fy|b (y|b)
as the expectation of the complete data likelihood wrt the random effects distribution. This gives us a
natural Monte Carlo approach to estimation.
}
N = 100000
lam = [Link](theta[3:4])
bm = matrix(rnorm(4*N, 0, sqrt(lam)), nrow=4)
llv = [Link](y, bm, theta, X, Z, [Link])
max(llv) + log(mean(exp(llv - max(llv))))
## [1] -119.2791
## [1] -119.0786
The second replicate is consistent with the first, suggesting relatively low variance. This approach
is simple and often works well, but it does rely on a degree of consistency between the complete data
likelihood and the (prior) random effects distribution. Laplace importance sampling does not need this.
Using Laplace importance sampling allows the construction of an efficient unbiased estimate of the marginal
likelihood. It is not strictly necessary here, since the previous Monte Carlo approach is arguably simpler.
But it is very useful for more challenging problems.
N <- 10000
R <- chol(-H)
z <- matrix(rnorm(N*4),4,N)
bm <- b + backsolve(R,z)
lf <- [Link](y,bm,theta,X,Z,[Link])
zz.2 <- colSums(z*z)/2
lfz <- lf + zz.2
log(sum(exp(lfz - max(lfz)))) + max(lfz) + 2 * log(2*pi) -
sum(log(diag(R))) - log(N)
## [1] -119.3229
## [1] -119.2882
## [1] -119.2774
Stratification can reduce this variability still further, and be aware that while we know that the estimator
is unbiased, it is not on the log scale, of course.
Conclusions
The example emphasises that integration is computationally expensive, and that multi-dimensional inte-
gration is best approached by approximating the integrand with something analytically integrable or by
constructing unbiased estimators of the integral. The insight that variance is likely to be minimised by con-
centrating the ‘design points’ of the estimator on where the integrand has substantial magnitude is strongly
supported by this example, and leads rather naturally, beyond the scope of this course, to MCMC methods.
Note that in likelihood maximisation settings you might also consider avoiding intractable integration via the
EM algorithm (see e.g. Lange, 2010 or Monahan, 2001).
CHAPTER 4. CALCULUS BY COMPUTER 61
• Gill, P.E. , W. Murray and M.H. Wright (1981) Practical Optimization Academic Press, Section 8.6
offers the treatment of finite differencing that you should read before using it on anything but the most
benign problems. This is one of the few references that tells you what to do if your objective is not
‘well scaled’.
• If you want to see AD in serious action, take a look at e.g. Kalnay, E (2003) Atmospheric Modelling,
Data Assimilation and Predictability.
• Although most widely-used AD libraries are based on object–oriented programming approaches, ar-
guably the most elegant approaches exploit functional programming languages with strong static type
systems. See, for example: Elliott, C. (2018) The simple essence of automatic differentiation. Proc.
ACM Program. Lang. 2, ICFP, Article 70. DOI: [Link]
• Chapter 10 of Monahan, J.F (2001) Numerical Methods of Statistics, offers more information on
numerical integration, including much of the detail skipped over here, although it is not always easy
to follow, if you don’t already have quite a good idea what is going on.
• Lange, K (2010) Numerical Analysis for Statisticians, second edition, Springer. Chapter 18 is good
on one dimensional quadrature.
• Ripley, B.D. (1987, 2006) Stochastic Simulation Chapters 5 and section 7.4 are good discussions of
stochastic integration and Quasi Monte Carlo integration.
• Robert, C.P. and G. Casella (1999) Monte Carlo Statistical Methods Chapter 3 (and beyond) goes
into much more depth than these notes on stochastic integration.
• Press et al. (2007) Numerical Recipes (3rd ed), Cambridge, has a nice chapter on numerical integra-
tion, but with emphasis on 1D.
Chapter 5
The stochastic integration methods of the last section took it for granted that we can produce random
numbers from various distributions. Actually we can’t. The best that can be done is to produce a completely
deterministic sequence of numbers which appears indistinguishable from a random sequence with respect
to any relevant statistical property that we choose to test. In other words we may be able to produce
a deterministic sequence of numbers that can be very well modelled as being a random sequence from
some distribution. Such deterministic sequences are referred to as sequences of pseudorandom numbers,
but the pseudo part usually gets dropped at some point.
The fundamental problem, for our purposes, is to generate a pseudorandom sequence that can be
extremely well modelled as i.i.d. U (0, 1). Given such a sequence, it is fairly straightforward to generate
deviates from other distributions, but the i.i.d. U (0, 1) generation is where the problems lie. Indeed if you
read around this topic then most books will pretty much agree about how to turn uniform random deviates
into deviates from a huge range of other distributions, but advice on how to get the uniform deviates in the
first place is much less consistent.
where b is 0 or 1, in practice. This is started with a seed X0 . The Xi are integers (< M , of course), but
we can define Ui = Xi /M . Now the intuitive hope that this recipe might lead to Ui that are reasonably well
modelled by i.i.d. U (0, 1) r.v.s is only realised for some quite special choices of a and M , and it takes some
number theory to give the generator any sort of theoretical foundation (see Ripley, 1987, Chapter 2).
An obvious property to try to achieve is full period. We would like the generator to visit all possible
integers between 1 − b and M − 1 once before it starts to repeat itself (clearly the first time it revisits a value,
it starts to repeat itself). We would also like successive Ui s to appear uncorrelated. A notorious and widely
used generator called RANDU, supplied at one time with IBM machines, met these basic considerations
with
Xi+1 = (65539Xi )mod231
This appears to do very well in 1 dimension.
62
CHAPTER 5. RANDOM NUMBER GENERATION 63
qqplot((1:n-.5)/n,sort(u))
1.0
0.8
0.6
U
0.4
0.2
0.0
plot(U$u1,U$u2,pch=".")
1.0
0.8
0.6
Ui
0.4
0.2
0.0
We can also check visually what the distribution of the triples (Ui , Ui−1 , Ui−2 ) looks like.
library(lattice)
cloud(u1~u2*u3,U,pch=".",col=1,screen=list(z=40,x=-70,y=0))
CHAPTER 5. RANDOM NUMBER GENERATION 64
u1
u3
u2
cloud(u1~u2*u3,U,pch=".",col=1,screen=list(z=40,x=70,y=0))
u1
u3 u2
i.e. the triples lie on one of 15 planes. Actually you can show that this must happen (see Ripley, 1987,
section 2.2).
Does this deficiency matter in practice? When we looked at numerical integration we saw that quadra-
ture rules based on evaluating integrands on fixed lattices become increasingly problematic as the di-
mension of the integral increases. Stochastic integration overcomes the problems by avoiding the biases
associated with fixed lattices. We should expect problems if we are performing stochastic integration on
the basis of random numbers that are in fact sampled from some coarse lattice.
So the first lesson is to use generators that have been carefully engineered by people with a good
understanding of number theory, and have then been empirically tested (Marsaglia’s Diehard battery of
tests provides one standard test set). For example, if we stick with simple congruential generators, then
is a much better bet. Here is its triples plot. No amount of rotating provides any evidence of structure.
CHAPTER 5. RANDOM NUMBER GENERATION 65
u1
u3 u2
Although this generator is much better than RANDU, it is still problematic. An obvious infelicity is the
fact that a very small Xi will always be followed by an unusually small Xi+1 (consider Xi = 1, for example).
This is not a property that would be desirable in a time series simulation, for example. Not quite so obvious
is the fact that for any congruential generator of period M , then k−tuples, Ui , Ui−1 , . . . , Ui−k+1 will tend to
lie on a finite number of k − 1 dimensional planes (e.g. for RANDU we saw 3-tuples lying on 2 dimensional
planes.) There will be at most M 1/k such planes, and as RANDU shows, there can be far fewer. The upshot
of this is that if we could visualise 8 dimensions, then the 8-tuple plot for (5.1) would be just as alarming as
the 3D plot was for RANDU. 8 is not an unreasonably large dimension for an integral.
Generally then, it might be nice to have generators with better behaviour than simple congruential
generators, and in particular we would like generators where k−tuples appear uniformly distributed on
[0, 1]k for as high a k as possible (referred to as having a high k-distribution).
x ← x ∧ (x >> a)
x ← x ∧ (x << b)
x ← x ∧ (x >> c)
each iteration generates a new random sequence of 0s and 1s. ∧ denotes ‘exclusive or’ (XOR) and >>
and << are right-shift and left shift respectively, with the integers a, b and c giving the distance to shift.
a = 21, b = 35 and c = 4 appear to be good constants (but see Press et al., 2007, for some others).
If you are a bit rusty on these binary operators then consider an 8-bit example where x=10011011 and
z=01011100.
• x<<1 is 00110110, i.e. the bit pattern is shifted leftwards, with the leftmost bit discarded, and the
rightmost set to zero.
• x<<2 is 01101100, i.e. the pattern is shifted 2 bits leftwards, which also entails discarding the 2
leftmost bits and zeroing the two rightmost.
• x^z is 11000111, i.e. a 1 where the bits in x and z disagree, and a 0 where they agree.
The Xorshift generator is very fast, has a period of 264 − 1, and passes the Diehard battery of tests
(perhaps unsurprising as Marsaglia is responsible for that too). These shift register generators suffer
similar granularity problems to congruential generators (there is always some k for which [0, 1]k can not be
very well covered by even 264 − 1 points), but tend to have all bit positions ‘equally random’, whereas lower
order bits from congruential generator sequences often have a good deal of structure.
Now we reach a fork in the road. To achieve better performance in terms of longer period, larger k-
distribution, and fewer low order correlation problems, there seem to be two main approaches, the first
pragmatic, and the second more theoretical.
1. Combine the output from several ‘good’, well understood, simple generators using operations that
maintain randomness (e.g. XOR and addition, but not multiplication). When doing this, the output
from the combined generators is never fed back into the driving generators. Preferably combine
rather different types of generator. Press et al. (2007) make a convincing case for this approach.
Wichmann-Hill, available in R is an example of such a combined generator, albeit based on 3 very
closely related generators.
2. Use more complicated generators — non-linear or with a higher dimensional state that just a single
Xi (see Gentle, 2003). For example, use a shift register type generator based on maintaining the
history of the last n bit-patterns, and using these in the bit scrambling operation. Matsumoto and
Nishimura’s (1998) Mersenne Twister is of this type. It achieves a period of 219937 − 1 (that’s not a
misprint: 219937 −1 is a ‘Mersenne prime’* ), and is 623-distributed at 32 bit accuracy. i.e. its 623-tuples
appear uniformly distributed (each appearing the same number of times in a full period) and spaced
2−32 apart (without the ludicrous period this would not be possible). It passes Diehard, is the default
generator in R, and C source code is freely available.
1. Avoid using black-box routines supplied with low level languages such as C, you don’t know what you
are getting, and there is a long history of these being botched.
2. Do make sure you know what method is being used to generate any uniform random deviates that
you use, and that you are satisfied that it is good enough for purpose.
3. For any random number generation task that relies on k−tuples having uniform distribution for high
k then you should be particularly careful about what generator you use. This includes any statistical
task that is somehow equivalent to high dimensional integration.
4. The Mersenne Twister is probably the sensible default choice in most cases, at present. For high
dimensional problems it remains a good idea to check answers with a different high-quality generator.
If results differ significantly, then you will need to find out why (probably starting with the ‘other’
generator).
Note that I haven’t discussed methods used by cryptographers. Cryptographers want to use (pseudo)random
sequences of bits (0s and 1s) to scramble messages. Their main concern is that if someone were to in-
tercept the random sequence, and guess the generator being used, they should not be able to infer the
state of the generator. Achieving this goal is quite computer intensive, which is why generators used for
cryptography are usually over engineered for simulation purposes.
* Numbers this large are often described as being ‘astronomical’, but this doesn’t really do it justice: there are probably fewer than 2270
atoms in the universe.
CHAPTER 5. RANDOM NUMBER GENERATION 67
For most standard distributions (except the exponential) there are better methods than inversion, and
the happy situation exists where textbooks tend to agree what these are. Ripley’s (1987) Chapter 3 is a
good place to start, while the lighter version is provided by Press et al. (2007) Chapter 7. R has many of
these methods built in. For example an alternative generation of 1 million i.i.d. N (0, 1) deviates would use.
Here the difference is marginal, due to the highly optimised qnorm function in R, but for other distribu-
tions the difference can be substantial. We have already seen how to generate multivariate normal deviates
given i.i.d. N (0, 1) deviates, and for other (simple) multivariate distributions see chapter 4 of Ripley (1987).
For more complicated distributions see the Computationally-Intensive Statistical Methods APTS module.
• DIEHARD is George Marsaglia’s Diehard battery of tests for random number generators: https:
//[Link]/wiki/Diehard_tests
• Press, W.H., S.A. Teukolsky, W.T. Vetterling and B.P Flannery (2007) Numerical Recipes (3rd ed.)
Cambridge. Chapter 7, on random numbers is good and reasonably up to date, but in this case you
do need the third edition, rather than the second (and certainly not the first).
• Ripley (1987, re-issued 2006) Stochastic Simulation, Wiley. Chapter’s 2 and 3 are where to go to
really get to grips with the issues surrounding random number generation. The illustrations alone are
CHAPTER 5. RANDOM NUMBER GENERATION 68
an eye opener. You probably would not want to use any of the actual generators discussed and the
remarks about mixed generators now look dated, but the given principles for choosing generators
remain sound (albeit with some updating of what counts as simple, and the integers involved). See
chapter 3 and 4 for simulating from other distributions, given uniform deviates.
• Gamerman, D. and H.F. Lopes (2006) Markov Chain Monte Carlo: Stochastic simulation for Bayesian
inference, CRC. Chapter 1 has a nice compact introduction to simulation, given i.i.d. U (0, 1) deviates.
• Devroye, L., (2003) Non-Uniform Random Variate Generation, Springer. This is a classic reference.
Now out of print, it is available on-line. [Link]
Chapter 6
Other topics
This module has concentrated on some topics in numerical analysis and numerical computation that are
quite important for statistics. It has not covered a great many other topics in numerical analysis that are also
important. Despite its name the module has made no attempt whatsoever to provide an overview of ‘statis-
tical computing’ more generally, which would surely have to talk about data handling, pattern recognition,
sorting, matching, searching, algorithms for graphs and trees, computational geometry, discrete optimi-
sation, the EM algorithm, dynamic programming, linear programming, fast Fourier transforms, wavelets,
support vector machines, function approximation, sparse matrices, huge data sets, graphical processing
units, multi-core processing and a host of other topics.
Rather than attempt the impossible task of trying to remedy this deficiency, I want to provide a couple
of examples of the ingenuity of the less numerical algorithms available, which serve, I hope, to illustrate the
value of looking to see what is available when confronted with any awkward computational problem.
## 326428560 bytes
## [1] 0.9906125
. . . so the simulated X is 100, 000 × 400 and is about 99% zeroes. It requires about 0.3Gb of computer
memory, if stored as a normal ‘dense’ matrix. The Matrix library, supplied with R, supports sparse matrix
69
CHAPTER 6. OTHER TOPICS 70
library(Matrix)
Xs <- Matrix(X) ## a sparse version of X
## much reduced memory footprint
[Link]([Link](Xs))/[Link]([Link](X))
## [1] 0.03349829
Because Xs doesn’t store the zeroes, it uses much less memory than the dense version X. Note that
although X is only 1% non-zeroes, Xs uses 3% of the storage of X, because it needs to store the locations
of the non-zero elements, in addition to their values. The sparse approach also saves computer time when
performing basic matrix operations, by avoiding the floating point operations that are not needed because
they involve a zero. For example. . .
Again computational overheads involved in only storing the non-zeroes, and their locations, means that
the cost of the sparse operation is more than 1% of the dense cost, but the saving is still substantial.
Any competent programmer could produce code for the examples given so far, but any serious applica-
tion will involve solving linear systems, and this is much less straightforward. Sparse versions of some of
the decompositions that we met in section 2 are needed, but naive implementations using sparse matrices
flounder on the problem of infill (or fill-in). You can see the problem in the following dense calculation. . .
[Link]({
XX <- t(X)%*%X
R <- chol(XX)
})
dim(XX)
sum(XX == 0)/prod(dim(XX))
## [1] 0.97935
image(XX == 0)
CHAPTER 6. OTHER TOPICS 71
1.0
0.8
0.6
0.4
0.2
0.0
range(t(R)%*%R-XX)
## [1] 80120
image(R == 0)
CHAPTER 6. OTHER TOPICS 72
1.0
0.8
0.6
0.4
0.2
0.0
## (Intercept) a2 a3 a4 a5
## (Intercept) 316.2278 7.712795 7.962615 8.149190 8.051159
## a2 0.0000 48.780250 -1.258994 -1.288493 -1.272993
## a3 0.0000 0.000000 49.527888 -1.342901 -1.326747
## a4 0.0000 0.000000 0.000000 50.071220 -1.378683
## a5 0.0000 0.000000 0.000000 0.000000 49.758389
Although XT X is 98% zeroes, the upper triangle of its Cholesky factor is 0% zeroes, when computed in
the usual way. i.e. we have lost all sparsity. Fortunately, it is possible to re-arrange the rows and columns
of XT X (or any other matrix for which we need a sparse Cholesky factor), in a way that reduces infill in
the Cholesky factor. This pivoting means that we get a re-ordered version of the solution of the system
of interest: a minor inconvenience. Unfortunately, finding the optimal pivoting in terms of infill reduction is
NP hard, but there are never the less practical computational pivoting strategies that work well in practice.
These rely on some rather ingenious graph theory based symbolic analysis of the matrix to be factorised,
which is applied before anything numerical is computed. See Davis (2006). Here’s what happens when a
sparse Cholesky factorisation, with infill reducing pivoting, is used on XT X. . .
[Link]({
XXs <- t(Xs)%*%Xs
Rs <- chol(XXs,pivot=TRUE) ## pivot to reduce infill
})
## user system elapsed
## 0.030 0.000 0.029
range(t(Rs)%*%Rs-XXs[Rs@pivot,Rs@pivot]) ## factorization works!
## [1] -5.684342e-14 5.684342e-14
sum(Rs!=0) ## the upper triangle is sparse (much better)
## [1] 1729
CHAPTER 6. OTHER TOPICS 73
image([Link](Rs) == 0)
1.0
0.8
0.6
0.4
0.2
0.0
with sparsity maintained in the factor, sparse backward or forward substitution is also very efficient.
Note that while the sparse Cholesky routines in the Matrix package are state-of-the-art, the QR rou-
tines are not, and carry heavy memory overheads associated with computation of Q. You’ll probably need
to look for sparse multi-frontal QR routines if you need sparse QR.
You could store some sort of directory of names, which would be searched through to find the location
of the data. But such a search is always going to take some time, which will increase with the number of
people’s data that is stored. Can we do better? With hashing the answer is yes. The pivotal idea is to
create a function which will turn an arbitrary key into an index which is either unique or shared with only a
few other keys, and has a modest predefined range. This index gives a location in a (hash) table which in
turn stores the location of the actual data (plus some information for resolving ‘clashes’). So given a key,
we use the (hash) function to get its index, and its index to look up the data location, with no, or very little
searching. The main problem is how to create a hash function.
An elegant general strategy is to use the ideas of pseudorandom number generation to turn keys into
random integers. For example, if the key were some 64bit quantity, we could use it to seed the Xorshift
random number generator given earlier, and run this forward for a few steps. More generally we would
need an approach that successively incorporated all the bits in a key into the ‘random number’ generation
process (see e.g. section 7.6 of Press et al., 2007, for example code). So, the hash function should return
numbers that appear random, which will mean that they tend to be nicely spread out, irrespective of what
the keys are like. Of course, despite this, the function always returns the same number for a given key.
It remains to turn the number returned by the hash function, h, into an index, ih , somewhere between
0 and the length, nh , of the hash table minus 1. ih = h mod nh is the usual approach. At position ih in the
hash table we might store the small number of actual keys (or possibly just their h values) sharing the ih
value, and the actual location in memory of the corresponding data.
The really neat thing here is that the randomness of the h’s means that whatever the keys look like, you
usually end up with a fairly uniform distribution of ih ’s, and only a small number of keys sharing each slot in
the hash table. It is this that makes data retrieval so efficient, relative to searching a directory of keys.
Note that R environments are themselves methods for associating values with lookup keys, and in fact,
R environments can use hash tables.
ht <- [Link](hash=TRUE)
ht[["foo"]] = "bar"
ht[["baz"]] = 1:10
ht[["foo"]]
## [1] "bar"
ht$baz
## [1] 1 2 3 4 5 6 7 8 9 10
If that doesn’t sound like it has any application in statistics, consider the construction of triangulations
in spatial statistics (a triangulation being a way of joining a set of points up with triangles, in such a way
that the whole complex hull of the points is covered in triangles). Efficient construction of triangulations is
tricky, and several efficient algorithms work by incrementally adding triangles. You usually know at least
one neighbour of each triangle at its creation, but you actually need to know all its neighbours in the existing
set. The task of finding those neighbours would be prohibitive if you had to search each existing triangle
for edges shared with the new triangle. Instead one can maintain a hash table for unpaired triangle edges,
where the hash key is based on the end co-ordinates of each edge. This allows any newly created edge to
be paired with any existing edge in little more than the length of time it takes to create the hash key.
2. Tim Davis’s home page is worth a look for more recent sparse matrix developments
[Link]
CHAPTER 6. OTHER TOPICS 75
3. If you think hashing is clever, take a look at the rest of Cormen, T., C.E. Leiserson, R.L. Rivest and C.
Stein (2001) Introduction to Algorithms (2nd ed.) MIT.
4. Once again, Press, W.H., S.A. Teukolsky, W.T. Vetterling and B.P Flannery (2007) Numerical Recipes
(3rd ed.) provides a readable introduction to a much wider range of topics in scientific computing than
has been covered here.