0% found this document useful (0 votes)
2 views75 pages

APTS StatComp2020

The document is a comprehensive guide on Statistical Computing, originally authored by Simon Wood and updated by Darren Wilkinson. It covers various topics including matrix computation, optimization, calculus by computer, and random number generation, providing detailed explanations and methodologies. The last update was made on December 16, 2020, and it includes online resources for further learning.

Uploaded by

anh nguyen
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views75 pages

APTS StatComp2020

The document is a comprehensive guide on Statistical Computing, originally authored by Simon Wood and updated by Darren Wilkinson. It covers various topics including matrix computation, optimization, calculus by computer, and random number generation, providing detailed explanations and methodologies. The last update was made on December 16, 2020, and it includes online resources for further learning.

Uploaded by

anh nguyen
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

APTS 2020/21

Statistical Computing
Notes originally by Simon Wood, University of Bath, 2012
[Link]

Updated by Darren Wilkinson, Newcastle University, 2018, 2019, 2020


[Link]

On-line resources: [Link]

Last updated: December 16, 2020


Contents

1 Introduction 4
1.1 Things can go wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Matrix Computation 9
2.1 Efficiency in matrix computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Flop counting in general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 There is more to efficiency than flop counting . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Efficiency conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Some terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Cholesky decomposition: a matrix square root . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Eigen-decomposition (spectral-decomposition) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Powers of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Another matrix square root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Matrix inversion, rank and condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.4 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.5 Asymmetric eigen-decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 The QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Woodbury’s formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 Further reading on matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Optimisation 23
3.1 Local versus global optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Optimisation methods are myopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Look at your objective first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Objective function transects are a partial view . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Some optimisation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Taylor’s Theorem (and conditions for an optimum) . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Steepest descent (AKA gradient descent) . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.4 Quasi-Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.5 The Nelder–Mead polytope method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.6 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Constrained optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Modifying the objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Further reading on optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Calculus by computer 40
4.1 Cancellation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Approximate derivatives by finite differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Other FD formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Automatic Differentiation: code that differentiates itself . . . . . . . . . . . . . . . . . . . . . . 42

2
CONTENTS 3

4.3.1 Reverse mode AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


4.3.2 A caveat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.3 Using AD to improve FD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.1 Quadrature rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.2 Quadrature rules for multidimensional integrals . . . . . . . . . . . . . . . . . . . . . . 51
4.4.3 Approximating the integrand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.4 Monte-Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.5 Stratified Monte-Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.6 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.7 Laplace importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.8 The log–sum–exp trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.9 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Further reading on computer integration and differentiation . . . . . . . . . . . . . . . . . . . 61

5 Random number generation 62


5.1 Simple generators and what can go wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Building better generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Uniform generation conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4 Other deviates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Further reading on pseudorandom numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Other topics 69
6.1 An example: sparse matrix computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Hashing: another example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 1

Introduction

In combination, modern computer hardware, algorithms and numerical methods are extraordinarily impres-
sive. Take a moment to consider some simple examples, all using R.

1. Simulate a million random numbers and then sort them into ascending order.

x <- runif(1000000)
[Link](xs <- sort(x))

## user system elapsed


## 0.064 0.009 0.073

The sorting operation took 0.1 seconds on the laptop used to prepare these notes. If each of the
numbers in x had been written on its own playing card, the resulting stack of cards would have been
40 metres high. Imagine trying to sort x without computers and clever algorithms.

2. Here is a matrix calculation of comparatively modest dimension, by modern statistical standards.

[Link](2)
n <- 1000
A <- matrix(runif(n*n),n,n)
[Link](Ai <- solve(A)) # invert A

## user system elapsed


## 0.310 0.106 0.062

range(Ai%*%A-diag(n)) # check accuracy of result

## [1] -1.149747e-12 1.272094e-12

The matrix inversion took under half a second to perform, and the inverse satisfied its defining equa-
tions to around one part in 1011 . It involved around 2 billion (2 × 109 ) basic arithmetic calculations
(additions, subtractions multiplications or divisions). At a very optimistic 10 seconds per calculation
this would take 86 average working lives to perform without a computer (and that’s assuming we use
the same algorithm as the computer, rather than Cramer’s rule, and made no mistakes). Checking
the result by multiplying the A by the resulting inverse would take another 86 working lives.

3. The preceding two examples are problems in which smaller versions of the same problem can be
solved more or less exactly without computers and special algorithms or numerical methods. In these
cases, computers allow much more useful sizes of problem to be solved very quickly. Not all problems
are like this, and some problems can really only be solved by computer. For example, here is a scaled

4
CHAPTER 1. INTRODUCTION 5

version of a simple delay differential equation model of adult population size, n, in a laboratory blowfly
culture.
dn
= P TD n(t − 1)e−n(t−1) − δTD n(t).
dt
The dynamic behaviour of the solutions to this model is governed by parameter groups P TD and δTD ,
but the equation can not be solved analytically, from any interesting set of initial conditions. To see
what solutions actually look like we must use numerical methods.

[Link]("PBSddesolve") ## install package from CRAN

library(PBSddesolve)

bfly <- function(t,y,parms) {


## function defining rate of change of blowfly population, y, at
## time t, given y, t, lagged values of y form ‘pastvalue’ and
## the parameters in ‘parms’. See ‘ddesolve’ docs for more details.

## making names more useful ...


P <- parms$P;d <- parms$delta;y0 <- parms$yinit

if (t<=1) y1 <- P*y0*exp(-y0)-d*y else { ## initial phase


ylag <- pastvalue(t-1) ## getting lagged y
y1 <- P*ylag*exp(-ylag) - d*y ## the gradient
}
y1 ## return gradient
}

## set parameters...
parms <- list(P=150,delta=6,yinit=2.001)
## solve model...
yout <- dde(y=parms$yinit,times=seq(0,15,0.05),func=bfly,parms=parms)
## plot result...
plot(yout$t,yout$y,type="l",ylab="n",xlab="time")
0 2 4 6 8
n

0 5 10 15

time

Here the solution of the model (to an accuracy of 1 part in 108 ), took a quarter of a second.

1.1 Things can go wrong


The amazing combination of speed and accuracy apparent in many computer algorithms and numerical
methods can lead to a false sense of security and the wrong impression that in translating mathematical
and statistical ideas into computer code we can simply treat the numerical methods and algorithms we use
as infallible ‘black boxes’. This is not the case, as the following three simple examples should illustrate.
CHAPTER 1. INTRODUCTION 6

1. The [Link] function in R measures how long a particular operation takes to execute. First
generate two 3000 × 3000 matrices, A and B and a vector, y.

n <- 3000
A <- matrix(runif(n*n),n,n)
B <- matrix(runif(n*n),n,n)
y <- runif(n)

Now form the product ABy in two slightly different ways

[Link](f0 <- A%*%B%*%y)

## user system elapsed


## 2.713 0.168 0.377

[Link](f1 <- A%*%(B%*%y))

## user system elapsed


## 0.103 0.114 0.027

f0 and f1 are identical, to machine precision, so why is one calculation so much quicker than the
other? Clearly anyone wanting to compute with matrices had better know the answer to this.
Note that [Link] is a very crude method for benchmarking code. There are various R
packages that can facilitate benchmarking and profiling of R code. Here we illustrate use of the
rbenchmark package, which is available on CRAN.

library(rbenchmark)
benchmark(
"ABy" = {
f0 <- A%*%B%*%y
},
"A(By)" = {
f1 <- A%*%(B%*%y)
},
replications = 10,
columns = c("test", "replications", "elapsed",
"relative", "[Link]", "[Link]"))

## test replications elapsed relative [Link] [Link]


## 2 A(By) 10 0.299 1.000 1.527 0.841
## 1 ABy 10 4.076 13.632 29.478 2.583

2. It is not unusual to require derivatives of complicated functions, for example when estimating models.
Such derivatives can be very tedious or difficult to calculate directly, but can be approximated quite
easily by ‘finite differencing’. i.e. if f is a sufficiently smooth function of x then
∂f f (x + ∆) − f (x)
'
∂x ∆
where ∆ is a small interval. Taylor’s theorem implies that the right hand side tends to the left hand
side, above, as ∆ → 0. It is easy to try this out on an example where the answer is known, so consider
differentiating sin(x) w.r.t. x.
CHAPTER 1. INTRODUCTION 7

x <- seq(1,2*pi,length=1000)
delta <- 1e-5 # FD interval
[Link] <- (sin(x+delta)-sin(x))/delta # FD derivative
error <- [Link]-cos(x) # error in FD derivative

The following plots the results. On the LHS sin(x) is dashed, it’s true derivative cos(x) is dotted and
the FD derivative is continuous. The RHS shows the difference between the FD approximation and
the truth.
1.0

2e−06
0.5
sin(x)

error
0.0

−4e−06
−1.0

1 2 3 4 5 6 1 2 3 4 5 6
x x

Note that the error appears to be of about the same magnitude as the finite difference interval ∆.
Perhaps we could reduce the error to around the machine precision by using a much smaller ∆. Here
is the equivalent of the previous plot, when
delta <- 1e-15
0.2
1.0

0.1
0.5
sin(x)

error
0.0
0.0
−1.0

−0.2

1 2 3 4 5 6 1 2 3 4 5 6
x x

Clearly there is something wrong with the idea that decreasing ∆ must always increase accuracy
here.
3. These data
0.5
0.4
0.3
0.2
y
0.1
0.0
−0.1

100.0 100.2 100.4 100.6 100.8 101.0


x
CHAPTER 1. INTRODUCTION 8

were generated with this code. . .

[Link](1); n <- 100


xx <- sort(runif(n))
y <- .2*(xx-.5)+(xx-.5)^2 + rnorm(n)*.1
x <- xx+100

An appropriate and innocuous linear model for the data is yi = β0 + β1 xi + β2 x2i + i , where the i
are independent, zero mean constant variance r.v.s and the βj are parameters. This can be fitted in
R with

b <- lm(y~x+I(x^2))

which results in the following fitted curve.


0.5
0.4
0.3
0.2
y
0.1
0.0
−0.1

100.0 100.2 100.4 100.6 100.8 101.0


x

From basic statistical modelling courses, recall that the least squares parameter estimates for a linear
model are given by β̂ = (XT X)−1 XT y where X is the model matrix. An alternative to use of lm would
be to use this expression for β̂ directly.

X <- [Link](b) # get same model matrix used above


[Link] <- solve(t(X)%*%X,t(X)%*%y) # direct solution of ‘normal equations’

## Error in [Link](t(X) %*% X, t(X) %*% y): system is computationally


singular: reciprocal condition number = 3.98647e-19

Something has gone badly wrong here, and this is a rather straightforward model. It’s important to
understand what has happened before trying to write code to apply more sophisticated approaches
to more complicated models.
In case you are wondering, I can make lm fail to. Mathematically it is easy to show that the model
fitted values should be unchanged if I simply add 900 to the x values, but if I refit the model to such
shifted data using lm, then the following fitted values result, which can not be correct.
0.5
0.4
0.3
0.2
y
0.1
0.0
−0.1

1000.0 1000.2 1000.4 1000.6 1000.8 1001.0


x
Chapter 2

Matrix Computation

Matrices are ubiquitous in statistics. But most statistical models and methods are only useful once turned
into computer code, and this often means computing with matrices. It is easy to ruin a beautiful piece of
theory by poor use of numerical linear algebra. This section covers some basic numerical linear algebra,
including the important topics of efficiency and stability.

2.1 Efficiency in matrix computation


Recall the simple example

n <- 3000
A <- matrix(runif(n*n),n,n)
B <- matrix(runif(n*n),n,n)
y <- runif(n)
[Link](f0 <- A%*%B%*%y) # case 1

## user system elapsed


## 2.942 0.216 0.415

[Link](f1 <- A%*%(B%*%y)) # case 2

## user system elapsed


## 0.168 0.077 0.032

What happened here? The answer is all to do with how the multiplications were ordered in the two
cases, and the number of floating point operations (flop) required by the two alternative orderings.
1. In the first case AB was formed first, and the resulting matrix used to pre-multiply the vector y.

2. In the second case, the vector By was formed first, and was then pre-multiplied by A.
Now we need to count the floating point operations. Try a couple of simple examples first
1. How many flop (+, -, *, /) are involved in the multiplication
  
b11 b12 b13 y1
 b21 b22 b23   y2 ?
b31 b32 b33 y3

2. How many operations are involved in the multiplication


  
a11 a12 a13 b11 b12 b13
 a21 a22 a23   b21 b22 b23 ?
a31 a32 a33 b31 b32 b33

9
CHAPTER 2. MATRIX COMPUTATION 10

Generalising

1. In terms of n, what is the flop cost of forming AB?

2. What is the cost of forming By

Now write down the flop cost of the two ways of forming ABy

1.

2.

Another simple matrix operation is the evaluation of the trace of a matrix product. Again different ways of
computing the same quantity lead to radically different computation times. Consider the computation of
tr (AB) where A is 5000 × 1000 and B is 1000 × 5000.

n <- 5000; m <- 1000


A <- matrix(runif(n*m), n, m)
B <- matrix(runif(n*m), m, n)
benchmark(
"sdAB" = { sum(diag(A %*% B)) },
"sdBA" = { sum(diag(B %*% A)) },
"sABt" = { sum(A*t(B)) },
replications = 10, order=NULL,
columns = c("test", "replications", "elapsed",
"relative", "[Link]", "[Link]"))

## test replications elapsed relative [Link] [Link]


## 1 sdAB 10 6.019 9.317 41.795 5.491
## 2 sdBA 10 1.518 2.350 10.514 1.438
## 3 sABt 10 0.646 1.000 0.814 0.314

1. The first method forms AB, at a flop cost of 2n2 m and then extracts the leading diagonal and sums
it.

2. The second method uses the fact that tr (AB) = tr (BA). It forms BA at flop cost of 2nm2 and then
extracts the leading diagonal of the result and sums it.
P
3. The third method makes direct use of the fact that tr (AB) = ij Aij Bji , at a cost of 2nm. It is fastest
(in principle) as no effort is wasted in calculating off diagonal elements of the target matrix (but here
does come at the cost of computing a transpose).

Notice that method 1 is not just wasteful of flops, it also requires storage of an n × n matrix, which is much
larger than either A or B.

2.1.1 Flop counting in general


Clearly it is important to count flops, but it is also tedious to do a complete count for complex algorithms. In
many cases an exact count is not necessary. Instead we simply need to know how the computation scales
with problem size, as it becomes large. For this reason the leading order operation count is all that is usually
reported. For example the naive computation of tr (AB), above, required at least n2 m + n2 (m − 1) + n − 1
operations. As n and m tend to infinity this is dominated by a term proportional to n2 m, so we write that the
computational cost is O(n2 m) (‘order n2 m’).
Note also that flops are not always defined in the same way. For some authors one flop is one floating
point addition, subtraction, division or multiplication, but for others one flop is one multiplication/division
with one addition/subtraction.
CHAPTER 2. MATRIX COMPUTATION 11

2.1.2 There is more to efficiency than flop counting


Flop count is clearly important, but it is not the only aspect of efficiency. With current computing technology
(driven by the computer games market) a floating point operation often takes the same amount of time
as an operation on an integer or a memory address. Hence how matrices are stored and addressed in
memory is often as important as the flop count in determining efficiency.
As an example consider the multiplication of the elements of two matrices as part of a matrix multipli-
cation performed in C. The code
C[i][j] += A[i][k]*B[k][j]
requires at least 6 address/integer calculations in order to figure out where the relevant elements of A, B
and C are located, before the 2 floating point operations can be performed. To avoid this, efficient code
stores pointers which keep a record of the memory locations of the ‘current’ relevant elements of A, B and
C. The multiplication loops are then structured so as to minimise the work needed to update such pointers,
for each new element multiplication.
To further complicate matters, computer processors often contain features that can be exploited to in-
crease speed. For example, fetching data from the processor memory cache is much quicker than fetching
data from general memory (not to be confused with the yet slower disk). An algorithm structured to break
a problem into chunks that fit entirely into cache can therefore be very quick. Similarly, some processors
can execute multiple floating point operations simultaneously, but careful coding is required to use such a
feature.
To avoid having to re-write whole matrix algebra libraries to maximise speed on every new processor
that comes along, standard numerical linear algebra libraries, such as LAPACK (used by R), adopt a highly
structured design. Wherever possible, the high level routines in the library (e.g. for finding eigenvalues),
are written so that most of the flop intensive work is performed by calling a relatively small set of routines
for comparatively simple basic matrix operations (e.g. multiplying a matrix by a vector), known as the Basic
Linear Algebra Subprograms (BLAS). The BLAS can then be tuned to make use of particular features of
particular computers, without the library that depends on it needing to change. This provides one way of
exploiting multi-core computers. Multi-threaded BLAS can gain efficiency by using multiple CPU cores,
without any of the higher level code having to change.
Whatever language or library you are using for statistical computing, you should be using native BLAS
and LAPACK libraries “under–the–hood” for your matrix computations. But it is important to be aware that
not all BLAS and LAPACK libraries are the same, and “default” versions shipping with Linux distributions
and similar operating systems are not always the best. You can very often swap out a default BLAS library
and replace it with an optimised BLAS for your system without changing or recompiling any code. This can
easily make a factor of 2 or more difference to the running time of codes for which matrix computation is a
bottleneck, so it is well worth investigating how to do this for an easy way of speeding up your code.
On Linux, you can check which LAPACK and BLAS are being used by R using

La_library()

## [1] "/usr/lib/x86_64-linux-gnu/openblas-pthread/[Link].3"

extSoftVersion()["BLAS"]

## BLAS
## "/usr/lib/x86_64-linux-gnu/openblas-pthread/[Link].3"

but these commands don’t work on Windows.

2.1.3 Efficiency conclusions


In conclusion:
• Think about the flop count when computing with matrices (or computing in general).
CHAPTER 2. MATRIX COMPUTATION 12

• If you are writing compiled code for manipulating matrices (or other arrays), then be aware of memory
management efficiency as well as flops.

• For optimal performance try and use libraries such as LAPACK, with a fast BLAS, where-ever possible.

2.2 Some terminology


Now consider some more interesting things to do with matrices. First recall the following basic terminology.

   
x11 x12 x13 x11 x12 x13
 x21 x22 x23   x12 x22 x23  yT Xy > 0 ∀ y 6= 0
yT Xy ≥ 0 ∀ y 6= 0
x31 x32 x33 x13 x23 x33
square symmetric positive (semi−)definite
     
x11 0 0 x11 x12 x13 x11 0 0
 x21 x22 0   0 x22 x23   0 x22 0 
x31 x32 x33 0 0 x33 0 0 x33
lower triangular upper triangular diagonal
Q is an orthogonal matrix iff QT Q = QQT = I.

2.3 Cholesky decomposition: a matrix square root


Positive definite matrices are the ‘positive real numbers’ of matrix algebra. They have particular computa-
tional advantages and occur frequently in statistics, since covariance matrices are usually positive definite
(and always positive semi-definite). So let’s start with positive definite matrices, and their matrix square
roots. To see why matrix square roots might be useful, consider the following.
Example: Generating multivariate normal random variables. There exist very quick and reliable methods
for simulating i.i.d. N (0, 1) random deviates, but suppose that N (µ, Σ) random vectors are required?
Clearly we can generate vectors z from N (0, I). If we could find a square root matrix R such that RT R = Σ
then
y ≡ RT z + µ ∼ N (µ, Σ),
since the covariance matrix of y is RT IR = RT R = Σ and E(y) = E(RT z + µ) = µ.
In general the square root of a positive definite matrix is not uniquely defined, but there is a unique
upper triangular square root of any positive definite matrix: its Cholesky factor. The algorithm for finding
the Cholesky factor is easily derived. Consider a 4 × 4 example first. The defining matrix equation is
    
R11 0 0 0 R11 R12 R13 R14 A11 A12 A13 A14
 R12 R22 0 0   0 R22 R23 R24  =  A12 A22 A23 A24 
   

 R13 R23 R33 0   0 0 R33 R34   A13 A23 A33 A34 
R14 R24 R34 R44 0 0 0 R44 A14 A24 A34 A44

If the component equations of this expression are written out and solved in the right order, then each
CHAPTER 2. MATRIX COMPUTATION 13

contains only one unknown, as the following illustrates (unknown in grey)


A11 = R11 2
A12 = R11 R12
A13 = R11 R13
A14 = R11 R14
2
A22 = R12 + R22 2
A23 = R12 R13 + R22 R23
.
.
Generalising to the n × n case, and using the convention that 0k=1 xi ≡ 0, we have
P
v
u i−1 Pi−1
u X
2 Aij − k=1 Rki Rkj
Rii = Aii −
t Rki , and Rij = , j > i.
Rii
k=1

Working through these equations in row order, from row one, and starting each row from its leading diagonal
component ensures that all r.h.s. quantities are known at each step. Cholesky decomposition requires n3 /3
flops and n square roots. In R it is performed by function chol, which calls routines in LAPACK or LINPACK.
Example (continued): The following simulates a random draw from
   
1 2 −1 1
N  −1  ,  −1 2 −1 
3 1 −1 2

V <- matrix(c(2,-1,1,-1,2,-1,1,-1,2),3,3)
mu <- c(1,-1,3)

R <- chol(V) ## Cholesky factor of V


z <- rnorm(3) ## iid N(0,1) deviates
y <- t(R)%*%z + mu ## N(mu,V) deviates
y

## [,1]
## [1,] 1.621485
## [2,] -2.069023
## [3,] 3.939876

The following simulates 1000 such vectors, and checks their observed mean and covariance.

Z <- matrix(rnorm(3000),3,1000) ## 1000 N(0,I) 3-vectors


Y <- t(R)%*%Z + mu ## 1000 N(mu,V) vectors
## and check that they behave as expected...
rowMeans(Y) ## observed mu

## [1] 1.033903 -1.017923 2.970056

(Y-mu)%*%t(Y-mu)/1000 ## observed V

## [,1] [,2] [,3]


## [1,] 2.0160784 -1.008641 0.9942745
## [2,] -1.0086408 1.970159 -1.0416559
## [3,] 0.9942745 -1.041656 1.9563587
CHAPTER 2. MATRIX COMPUTATION 14

As a second application of the Cholesky decomposition, consider evaluating the log likelihood of µ and
Σ
n 1 1
l = − log(2π) − log(|Σ|) − (y − µ)T Σ−1 (y − µ).
2 2 2
If we were simply to invert Σ to evaluate the final term, it would cost 2n3 flops, and we would still need
to evaluate the determinant. A Cholesky based approach is much better. It is easy to see that Σ−1 =
R−1 R−T , where R is the Cholesky factor of Σ. So the final term in the log likelihood can be written as zT z
where z = R−T (y−µ). Notice that we don’t actually need to evaluate R−T , but simply to solve RT z = y−µ
for z. To see how this is done, consider a 4 × 4 case again:
    
R11 0 0 0 z1 y1 − µ 1
  z2   y2 − µ 2 
 R12 R22 0 0     
 R13 R23 R33 0   z3  =  y3 − µ3 

R14 R24 R34 R44 z4 y4 − µ 4


If this system of equations is solved from the top down, then there is only one unknown (shown in grey) at
each stage:
R11 z1 = y1 − µ1
R12 z1 + R22 z2 = y2 − µ2
R13 z1 + R23 z2 + R33 z3 = y3 − µ3
R14 z1 + R24 z2 + R34 z3 + R44 z4 = y4 − µ4
The generalisation of this forward substitution process to n dimensions is obvious, as is the fact that it
cost O(n2 ) flops — much cheaper than explicit formation of R−T , which would involve applying forward
substitution to find each column of the unknown R−T in the equation RT R−T = I, at O(n3 ) cost.
In R there is a routine forwardsolve for doing forward substitution with a lower triangular matrix
(and a routine backsolve for performing the equivalent back substitution with upper triangular matrices).
Before using it, we still need to consider |Σ|. Again the Cholesky factor helps. From general
Qn properties of
T T
determinants we know that |R ||R| = |Σ|, but because R is triangular |R | = |R| = i=1 Rii . So given
the Cholesky factor the calculation of the determinant is O(n).
Example: The following evaluates the log likelihood of the covariance matrix V and mean vector mu,
from the previous example, given an observed yT = (1, 2, 3).

y <- 1:3
z <- forwardsolve(t(R),y-mu)
logLik <- -length(y)*log(2*pi)/2 -
sum(log(diag(R))) - sum(z*z)/2
logLik

## [1] -6.824963

Note that Cholesky decomposition of a matrix that is not positive definite will fail. Even positive semi-
definite will not work, since in that case a leading diagonal element of the Cholesky factor will become zero,
so that computation of the off diagonal elements on the same row is impossible. Positive semi-definite
matrices are reasonably common, so this is a practical problem. For the positive semi-definite case, it is
possible to modify the Cholesky decomposition by pivoting: that is by re-ordering the rows/columns of the
original matrix so that the zeroes end up at the end of the leading diagonal of the Cholesky factor, in rows
that are all zero. This will not be pursued further here. Rather we will consider a more general matrix
decomposition, which provides matrix square roots along with much else.

2.4 Eigen-decomposition (spectral-decomposition)


Any symmetric matrix, A can be written as
A = UΛUT (2.1)
CHAPTER 2. MATRIX COMPUTATION 15

where the matrix U is orthogonal and Λ is a diagonal matrix, with ith leading diagonal element λi (conven-
tionally λi ≥ λi+1 ). Post-multiplying both sides of the decomposition by U we have

AU = UΛ.

Considering this system one column at a time and writing ui for the ith column of U we have:

Aui = λi ui .

i.e. the λi are the eigenvalues of A, and the columns of U are the corresponding eigenvectors. (2.1) is the
eigen-decomposition or spectral decomposition of A.
We will not go over the computation of the eigen-decomposition in detail, but it is not done using the
determinant and characteristic equation of A. One practical scheme is as follows: (i) the matrix A is first
reduced to tri-diagonal form using repeated pre and post multiplication by simple rank-one orthogonal ma-
trices called Householder rotations, (ii) an iterative scheme called QR-iteration then pre-and post-multiplies
the tri-diagonal matrix by even simpler orthogonal matrices, in order to reduce it to diagonal form. At this
point the diagonal matrix contains the eigenvalues, and the product of all the orthogonal matrices gives U.
Eigen-decomposition is O(n3 ), but a good symmetric eigen routine is around 10 times as computationally
costly as a Cholesky routine.
An immediate use of the eigen-decomposition is to provide an alternative characterisation of positive
(semi-) definite matrices. All the eigenvalues of a positive (semi-)definite matrix must be positive (non-
negative) and real. This is easy to see. Were some eigenvalue, λi to be negative (zero) then the corre-
sponding eigenvector ui would result in uT i Aui being negative (zero). At the same time the existence of
an x such that xT Ax is negative (zero) leads to a contradiction unless at least one eigenvalue is negative
(zero)* .

2.4.1 Powers of matrices


Consider raising A to the power m.

Am = AAA · · · A = UΛUT UΛUT · · · UΛUT = UΛΛ · · · ΛUT = UΛm UT

where Λm is just the diagonal matrix with λm th


i as the i leading diagonal element. This suggests that any
real valued function, f , of a real valued argument, which has a power series representation, has a natural
generalisation to a symmetric matrix valued function of a symmetric matrix argument, i.e.

f 0 (A) ≡ Uf 0 (Λ)UT

where f 0 (Λ) denotes the diagonal matrix with ith leading diagonal element f (λi ). E.g. exp(A) = U exp(Λ)UT .

2.4.2 Another matrix square root


For matrices with non-negative
√ √ eigenvalues we can generalise
√ √ to non-integer powers. √ For example it is
readily verified that A = U ΛUT has the property that A A = A. Notice (i) that A is not the same
as the Cholesky √ factor, emphasising the non-uniqueness of matrix square roots, and (ii) that, unlike the
Cholesky factor, A is well defined for positive semi-definite matrices (and can therefore be applied to any
covariance matrix). Also, the symmetry of this square root can sometimes be convenient.

2.4.3 Matrix inversion, rank and condition


Continuing in the same vein we can write

A−1 = UΛ−1 UT
can write x = Ub for some vector b. So xT Ax < 0 ⇒ bT Λb < 0 ⇒ b2i Λi < 0 ⇒ Λi < 0 for some i.
* We
P
CHAPTER 2. MATRIX COMPUTATION 16

where the diagonal matrix Λ−1 has ith leading diagonal element λ−1 i . Clearly we have a problem if any of
the λi are zero, for the matrix inverse will be undefined. A matrix with no zero eigen-values is termed full
rank. A matrix with any zero eigenvalues is rank deficient and does not have an inverse. The number of
non-zero eigenvalues is the rank of a matrix.
For some purposes it is sufficient to define a generalised inverse or pseudo-inverse when faced with
rank deficiency, by finding the reciprocal of the non-zero eigenvalues, but setting the reciprocal of the zero
eigenvalues to zero. We will not pursue this here.
It is important to understand the consequences of rank deficiency quite well when performing matrix
operations involving matrix inversion/matrix equation solving. This is because near rank deficiency is rather
easy to achieve by accident, and in finite precision arithmetic is as bad as rank deficiency. First consider
trying to solve
Ax = y
for x when A is rank deficient. In terms of the eigen-decomposition the solution is
x = UΛ−1 UT y
i.e. y is rotated to become y0 = UT y, the elements of y0 are then divided by the eigenvalues, λi , and the
reverse rotation is applied to the result. The problem is that yi0 /λi is not defined if λi = 0. This is just a
different way of showing something that you already know: rank deficient matrices can not be inverted. But
the approach also helps in the understanding of near rank deficiency and ill conditioning.
An illustrative example highlights the problem. Suppose that n × n symmetric matrix A has n − 1 distinct
eigenvalues ranging from 0.5 to 1, and one much smaller magnitude eigenvalue . Further suppose that
we wish to compute with A, on a machine that can represent real numbers to an accuracy of 1 part in −1 .
Now consider solving the system
Ax = u1 (2.2)
for x, where u1 is the dominant eigenvector of A. Clearly the correct solution is x = u1 , but now consider
computing the answer. As before we have a formal solution
x = UΛ−1 UT u1 ,
but although analytically u01 = UT u1 = (1, 0, 0, . . . , 0)T , the best we can hope for computationally is to get
u01 = (1 + e1 , e2 , e3 , . . . , en )T where the number ej are of the order of ±. For convenience, suppose that
en = . Then, approximately, Λ−1 u01 = (1, 0, 0, . . . , 0, 1)T , and x = UΛ−1 u01 = u1 + un , which is not correct.
Similar distortions would occur if we used any of the other first n − 1 eigenvectors in place of u1 : they all
become distorted by a spurious un component, with only un itself escaping.
Now consider an arbitrary P vector y on the r.h.s. of (2.2). We can always write it as some weighted sum
of the eigenvectors y = wi ui . This emphasises how bad the ill-conditioning problem is: all but one of
ys components are seriously distorted when multiplied by A−1 . By contrast multiplication by A itself would
lead only to distortion of the un component of y, and not the other eigen-vectors, but the un component is
the component that is so heavily shrunk by multiplication by A that it makes almost no contribution to the
result, unless we have the misfortune to choose a y that is proportional to un and nothing else.
A careful examination of the preceding argument reveals that what really matters in determining the
seriousness of the consequences of near rank deficiency is the ratio of the largest magnitude to the smallest
magnitude eigenvalues, i.e.
κ = max |λi |/ min |λi |.
This quantity is a condition number for A. Roughly speaking it is the factor by which errors in y will be
multiplied when solving Ax = y for x. Once κ begins to approach the reciprocal of the machine precision
we’re in trouble† . A system with a large condition number is referred to as ill-conditioned. Orthogonal
matrices have κ = 1, which is why numerical analysts like them so much.
Example. We are now in a position to understand what happened when we tried to use the ‘normal
equations’ to fit the simple quadratic model in the introduction. Let’s explicitly calculate the condition number
of XT X.

Because the condition number is so important in numerical computation, there are several methods for getting an approximate condition
number more cheaply than via eigen decomposition — e.g. see ?kappa in R.
CHAPTER 2. MATRIX COMPUTATION 17

[Link](1)
xx <- sort(runif(100))
x <- xx + 100 # regenerated same x data
X <- [Link](~x + I(x^2)) # and the model matrix
XtX <- crossprod(X) # form t(X)%*%X (with half the flop count)
lambda <- eigen(XtX)$values
lambda[1]/lambda[3] # the condition number of X’X

## [1] 2.355538e+18

Of course this raises a couple of obvious questions. Could we have diagnosed the problem directly
from X? And how does the lm function avoid this problem? Answers soon, but first consider a trick for
reducing κ.

2.4.4 Preconditioning
The discussion of condition numbers given above was for systems involving unstructured matrices (albeit
presented only in the context of symmetric matrices). Systems involving matrices with special structure
are sometimes less susceptible to ill-conditioning than naive computation of the condition number would
suggest. For example if D is a diagonal matrix, then we can accurately solve Dy = x for y, however large
κ(D) is: overflow or underflow are the only limits.
This basic fact can sometimes be exploited to re-scale a problem to improve computational stability. As
an example consider diagonal preconditioning of the computation of XT X, above. For the original XT X we
have

solve(XtX)

## Error in [Link](XtX): system is computationally singular: reciprocal


condition number = 3.98647e-19
p
But now suppose that we create a diagonal matrix D, with elements Dii = 1/ (XT X)ii . Clearly

(XT X)−1 = D(DXT XD)−1 D,

but (DXT XD)−1 turns out to have a much lower condition number than XT X. . .

D <- diag(1/diag(XtX)^.5)
DXXD <- D%*%XtX%*%D
lambda <- eigen(DXXD)$values
lambda[1]/lambda[3]

## [1] 429397494367

As a result we can now compute the inverse of XT X. . .

XtXi <- D%*%solve(DXXD,D) ## computable inverse of X’X


XtXi %*% XtX ## how accurate?

## (Intercept) x I(x^2)
## [1,] 9.999640e-01 -3.262288e-03 -0.352586841
## [2,] 2.920875e-07 1.000032e+00 0.002340679
## [3,] -1.151890e-09 -1.143870e-07 0.999992137

. . . not perfect, but better than no answer at all.


CHAPTER 2. MATRIX COMPUTATION 18

2.4.5 Asymmetric eigen-decomposition


If positive definite matrices are the positive reals of the square matrix system, and symmetric matrices are
the reals, then asymmetric matrices are the complex numbers. As such they have complex eigenvectors
and eigenvalues. It becomes necessary to distinguish right and left eigenvectors (one is no longer the
transpose of the other), and the right and left eigenvector matrices are no longer orthogonal matrices
(although they are still inverses of each other). Eigen-decomposition of asymmetric matrices is still O(n3 ),
but is substantially more expensive than the symmetric case. Using R on the laptop used to prepare
these notes, eigen-decomposition of a symmetric 1000 × 1000 matrix took 0.5 seconds. The asymmetric
equivalent took 3 seconds.
The need to compute with complex numbers somewhat reduces the practical utility of the eigen-
decomposition in numerical methods for statistics. It would be better to have a decomposition that provides
some of the useful properties of the eigen-decomposition without the inconvenience of complex numbers.
The singular value decomposition meets this need.

2.5 Singular value decomposition


The singular values of r × c matrix, A (r ≥ c) are the positive square roots of the eigenvalues of AT A.
If A is positive semi-definite then its singular values are just its eigenvalues, of course. For symmetric
matrices, eigenvalues and singular values differ only in sign, if at all. However the singular values are also
well defined, and real, for matrices that are not even square, let alone symmetric.
Related to the singular values is the singular value decomposition
A = UDVT
where U has orthogonal columns and is the same dimension as A, while D is a c × c diagonal matrix of
the singular values (usually arranged in descending order), and V is a c × c orthogonal matrix.
The singular value decomposition is computed using a similar approach to that used for the symmetric
eigen problem: orthogonal bi-diagonalization, followed by QR iteration at O(rc2 ) cost (it does not involve
forming AT A) . It is more costly than symmetric eigen-decomposition, but cheaper than the asymmetric
equivalent. Example R timings for a 1000 × 1000 matrix are 0.5 seconds for symmetric eigen, 1 seconds for
SVD and 3 seconds for asymmetric eigen.

n = 1000
XA = matrix(rnorm(n*n),ncol=n)
XXt = XA %*% t(XA)
[Link](chol(XXt))

## user system elapsed


## 0.039 0.037 0.010

[Link](eigen(XXt, symmetric=TRUE))

## user system elapsed


## 1.607 0.787 0.313

[Link](eigen(XA))

## user system elapsed


## 14.297 9.237 2.980

[Link](svd(XA))

## user system elapsed


## 3.343 1.752 0.679
CHAPTER 2. MATRIX COMPUTATION 19

[Link](qr(XA))

## user system elapsed


## 0.454 0.173 0.251

The number of its non-zero singular values gives the rank of a matrix, and the SVD is the most reliable
method for numerical rank determination (by examination of the size of the singular values relative to the
largest singular value). In a similar vein, a general definition of condition number is the ratio of largest and
smallest singular values: κ = d1 /dc .
Example Continuing the simple regression disaster from the introduction, consider the singular values
of X

d <- svd(X)$d # get the singular values of X


d

## [1] 1.010455e+05 2.662169e+00 6.474081e-05

Clearly, numerically X is close to being rank 2, rather than rank 3. Turning to the condition number. . .

d[1]/d[3]

## [1] 1560769713

κ ≈ 2 × 109 is rather large, especially since it is easy to show that the condition number of XT X must
be the square of this. So we now have a pretty clear diagnosis of the cause of the original problem.
In fact the SVD not only provides a diagnosis of the problem, but one possible solution. We can re-write
the solution to the normal equations in terms of the SVD of X.

(XT X)−1 XT y = (VDUT UDVT )−1 VDUT y


= (VD2 VT )−1 VDUT y
= VD−2 VT VDUT y
= VD−1 UT y

Notice two things. . .

1. The condition number of the system that we have ended up with is exactly the condition number of
X, i.e. the square root of the condition number involved in the direct solution of the normal equations.

2. Comparing the final RHS expression to the representation of an inverse in terms of its eigen-decomposition,
it is clear that VD−1 UT is a sort of pseudo-inverse X.

The SVD has many uses. One interesting one is low rank approximation of matrices. In a well defined
sense, the best rank k ≤ rank(X) approximation to a matrix X can be expressed in terms of the SVD of X
as
X̃ = UD̃VT
where D̃ is D with all but the k largest singular values set to 0. Using this result to find low rank ap-
proximations to observed covariance matrices is the basis for several dimension reduction techniques in
multivariate statistics (although, of course, a symmetric eigen decomposition is then equivalent to SVD).
One issue with this sort of approximation is that the full SVD is computed, despite the fact that part of it is
then to be discarded (don’t be fooled by routines that ask you how many eigen or singular vectors to return
— I saved .1 of a second, out of 13 by getting R routine svd to only return the first columns of U and V).
Look up Lanczos methods and Krylov subspaces for approaches that avoid this sort of waste.
CHAPTER 2. MATRIX COMPUTATION 20

2.6 The QR decomposition


The SVD provided a stable solution to the linear model fitting example, but at a rather high computational
cost, prompting the question of whether similar stability could be obtained without the full cost of SVD? The
QR decomposition provides a positive answer. We can write any r × c rectangular matrix X (r ≥ c) as the
product of columns of an orthogonal matrix and an upper triangular matrix.

X = QR

where R is upper triangular and Q is of the same dimension as X, with orthogonal columns (so QT Q = I,
but QQT 6= I). The QR decomposition has a cost of O(rc2 ), but it is about 1/3 of the cost of SVD.
Consider the linear modelling example again

(XT X)−1 XT y = (RT QT QR)−1 RT QT y


= (RT R)−1 RT QT y
= R−1 R−T RT QT y
= R−1 QT y

Now the singular values of R are the same as the singular values of X, so this system has the same
condition number as X. i.e. the QR method provides the stability of SVD without the computational cost
(although it still has about twice the cost of solving the normal equations via Cholesky decomposition).
In fact the QR method for solving least squares estimates can be derived from first principles, without
using the normal equations, in which case it also provides a very compact way of deriving much of the
distributional theory required for linear modelling. R routine lm uses the QR approach, which is why, in
the Introduction, it was able to deal with the case for which the normal equations failed. Of course it is not
magic: the subsequent failure even of lm arose because I had caused the model matrix condition number
to become so bad that even the QR approach fails (I should have centred x, by subtracting its mean, or
used orthogonal polynomials, via poly).
It is also worth noting the connection between the QR and Cholesky decompositions. If X = QR, then
X X = RT R, where R is upper triangular, so the QR decomposition of X gives the Cholesky factorisation
T

of XT X “for free”. Note that most QR routines don’t bother ensuring that the diagonal elements of R are
positive. If the positivity of the diagonal elements is an issue (it rarely matters), it can easily be fixed by
flipping the signs of the relevant rows of R. This relationship between the QR and Cholesky can be used to
directly construct the Cholesky decomposition of a covariance matrix in a stable way, without ever directly
constructing the covariance matrix.
Another application of the QR decomposition is determinant calculation. If A is square and A = QR
then Y
|A| = |Q||R| = |R| = Rii
i

since R is triangular, while Q is orthogonal with determinant 1. Usually we need


X
log |A| = log |Rii |
i

which underflows to zero much less easily than |A|.

2.7 Woodbury’s formula


There are a large number of matrix identities that are useful for simplifying computations, and the “matrix
cookbook” is an excellent source of these. However, there is a set of identities relating to the updating of
the inverse of a large matrix that is worthy of particular mention. There are several different variants, and
each goes by different names. But names such as the Woodbury identity, Sherman-Morrison-Woodbury
CHAPTER 2. MATRIX COMPUTATION 21

formula, Sherman-Morrison formula, etc., are all common names for various versions and special cases.
Here I’ll just give the most commonly encountered variants.
The context is a (large) m × m matrix, A, and its inverse, A−1 . We suppose that A−1 is already known,
but that now A is subject to a low-rank update of the form A + UVT , for U, V both m × n with n << m.
The problem is how to compute the updated inverse (A + UVT )−1 in an efficient way. Woodbury’s formula
gives the answer.
(A + UVT )−1 = A−1 − A−1 U(In + VT A−1 U)−1 VT A−1
Note that crucially, the inverse on the LHS is m × m, whereas the new inverse required on the RHS is n × n.
This is what makes the result so useful.
The result is easy to prove by direct multiplication. First, pre-multiply the RHS by A + UVT , multiply
out, and simplify, to get Im . This tells us that the RHS is a right-inverse. Next post-multiply the RHS by
A + UVT , multiply out, simplify to get Im , thus confirming that the RHS is both a left- and right-inverse,
and hence the inverse.
The form given above is sufficiently general for most applications in statistical computing, but it is worth
noting a modest generalisation, often referred to as the Sherman-Morrison-Woodbury formula, which intro-
duces an additional n × n matrix, C.

(A + UCVT )−1 = A−1 − A−1 U(C−1 + VT A−1 U)−1 VT A−1

This can be proved by direct multiplication, as before. There is also a very commonly encountered special
case, where U and V are just m × 1 vectors, u, v. In that case the identity is often referred to as the
Sherman-Morrison formula, and reduces to

A−1 uvT A−1


(A + uvT )−1 = A−1 − .
1 + vT A−1 u
Writing it this way makes it clear that no new matrix inversion is required to perform the update.
There are numerous applications of these identities in statistical computing, but updating a precision
matrix based on a low-rank update of a covariance matrix is a typical example.

2.8 Further reading on matrices


These notes can only scratch the surface of numerical linear algebra. Hopefully you now have a clearer idea
of the importance of stability and efficiency in matrix computation, and some understanding of a few key
decompositions and some of their applications. My hope is that when writing any code involving matrices,
you’ll at least think about flops and condition numbers. If you want to find out more, here are some starting
points. See section 6.3 as well.

• Golub, GH and CF Van Loan (1996) Matrix Computations 3rd ed. John Hopkins. This is the reference
for numerical linear algebra, but not the easiest introduction. It is invaluable if you are writing carefully
optimised matrix computations, directly calling BLAS and LAPACK functions.

• Watkins, DS (2002) Fundamentals of Matrix Computations 2nd ed. Wiley. This is a more accessible
book, but still aimed at those who really want to understand the material (the author encourages you
to write your own SVD routine, and with this book you can).

• Press WH, SA Teukolsky, WT Vetterling and BP Flannery (2007) Numerical Recipes (3rd ed), Cam-
bridge. Very easily digested explanation of how various linear algebra routines work, with code. But
there are better versions of most of the algorithms in this book.

• The GNU scientific library is at [Link] and is a better bet than


Numerical Recipes for actual code.

• For state of the art numerical linear algebra use LAPACK: [Link]
CHAPTER 2. MATRIX COMPUTATION 22

• Matlab offers a convenient, but expensive, platform for matrix computation. For an introduction see
e.g. Pratap, R (2006) Getting Started with Matlab, Oxford.

• Octave is Matlab like free software: [Link]

• Julia: [Link] is another dynamically typed scripting language aimed at scientific


computing.

• Monahan, JF (2001) Numerical Methods of Statistics provides a useful treatment of numerical linear
algebra for statistics.

• Harville, DA (1997) Matrix Algebra From a Statistician’s Perspective, Springer, provides a wealth of
helpful theoretical material.
Chapter 3

Optimisation

Many problems in statistics can be characterised as solving

minf (θ) (3.1)


θ

where f is a real scalar valued function of vector θ, usually known as the objective function. For example:
f might be the negative log-likelihood of the parameters θ of a statistical model, the maximum likelihood
estimates of which are required; in a Bayesian setting f might be a (negative) posterior distribution for θ,
the posterior modes of which are required; f might be a dissimilarity measure for the alignment of two
DNA sequences, where the alignment is characterised by a parameter vector θ; f might be total distance
travelled, and θ a vector giving the order in which a delivery van should visit a set of drop off points.
Note that maximising f is equivalent to minimising −f so we are not losing anything by concentrating on
minimisation here. In the optimisation literature, the convention is to minimise an objective, with the idea
being that the objective is some kind of cost or penalty to be minimised. Indeed, the objective function is
sometimes referred to as the cost function.
Let Θ denote the set of all possible θ values. Before discussing specific methods, note that in practice
we will only be able to solve (3.1) if one of two properties holds.

1. It is practical to evaluate f (θ) for all elements of Θ.

2. It is possible to put some structure onto Θ, giving a notion of distance between any two elements of
Θ, such that closeness in Θ implies closeness in f value.

In complex situations, requiring complicated stochastic approaches, it is easy to lose sight of point 2 (al-
though if it isn’t met we may still be able to find ways of locating θ values that reduce f ).
Optimisation problems vary enormously in difficulty. In the matrix section we saw algorithms with op-
eration counts which might be proportional to as much as n3 . There are optimisation problems whose
operations count is larger than O(nk ) for any k: in computer science terms there is no polynomial time al-
gorithm* for solving them. Most of these really difficult problems are discrete optimisation problems, in that
θ takes only discrete values, rather than consisting of unconstrained real numbers. One such problem is
the famous Travelling salesman problem, which concerns finding the shortest route between a set of fixed
points, each to be visited only once. Note however, that simply having a way of improving approximate
solutions to such problems is usually of considerable practical use, even if exact solution is very taxing.
Continuous optimisation problems in which θ is a real random vector are generally more straightforward
than discrete problems, and we will concentrate on such problems here. In part this is because of the
centrality of optimisation methods in practical likelihood maximisation.
* Computer scientists have an interesting classification of the ‘difficulty’ of problems that one might solve by computer. If n is the problem

‘size’ then the easiest problems can be solved in polynomial time: they require O(nk ) operations where k is finite. These problems are in the
P-class. Then there are problems for which a proposed solution can be checked in polynomial time, the NP class. NP-complete problems are
problems in NP which are at least as hard as all other problems in NP in a defined sense. NP hard problems are at least as hard as all problems
in NP, but are not in NP: so you can’t even check the answer to an NP hard problem in polynomial time. No one knows if P and NP are the
same, but if any NP-complete problem turned out to be in P, then they would be.

23
CHAPTER 3. OPTIMISATION 24

3.1 Local versus global optimisation


Minimisation methods generally operate by seeking a local minimum of f . That is they seek a point, θ ∗ ,
such that f (θ ∗ +∆) ≥ f (θ ∗ ), for any sufficiently small perturbation ∆. Unless f is a strictly convex function,
it is not possible to guarantee that such a θ ∗ is a global minimum. That is, it is not possible to guarantee
that f (θ ∗ + ∆) ≥ f (θ ∗ ) for any arbitrary ∆.

3.2 Optimisation methods are myopic


Before going further it is worth stressing another feature of optimisation methods: they are generally very
short-sighted. At any stage of operation all that optimisation methods ‘know’ about a function are a few
properties of the function at the current best θ, (and, much less usefully, those same properties at previous
best points). It is on the basis of such information that a method tries to find an improved best θ. The
methods have no ‘overview’ of the function being optimised.

3.3 Look at your objective first


Before attempting to optimise a function with respect to some parameters, it is a good idea to produce plots
in order to get a feel for the function’s behaviour. Of course if you could simply plot the function over the
whole of Θ it is unlikely that you would need to worry about numerical optimisation, but generally this is not
possible, and the best that can be achieved is to look at some one or two dimensional transects through f .
To fix ideas, consider the apparently innocuous example of fitting a simple dynamic model to a single
time series by least squares/maximum likelihood. The dynamic model is
nt+1 = rnt (1 − nt /K), t = 0, 1, 2, . . .
where r and K are parameters and we will assume that n0 is known. Further suppose that we have
observations yt = nt + t where t ∼ N (0, σ 2 ). Estimation of r and K by least squares (or maximum
i.i.d.
likelihood, in this case) requires minimisation of
X
f (r, K) = {yi − ni (r, K)}2
i

w.r.t. r and K. We should try to get a feel for the behaviour of f . To see how this can work consider some
simulated data examples. In each case I used n0 = 20, K = 50 and σ = 1 for the simulations, but I have
varied r between the cases.
• In the first instance data were simulated with r = 2. If we now pretend that we need to estimate r and
K from such data, then we might look at some r-transects and some K-transects through f . This
figure shows the raw data and an r-transect.
25000
26
population

15000
24

f(r,K=50)
22

5000
20

0 10 20 30 40 50 1.0 1.5 2.0 2.5 3.0


time r
CHAPTER 3. OPTIMISATION 25

r-transects with other K values look equally innocuous, and K-transects also look benign over this
r range. So in this case f appears to be a nice smooth function of r and K, and any half decent
optimisation method ought to be able to find the optimum.

• In the second example data were simulated with r = 3.2. Again pretend that we only have the data
and need to estimate r and K from it. Examination of the left hand plot below, together with some
knowledge of the model’s dynamic behaviour, indicates that r is probably in the 3 to 4 range.

25000
40
35

15000
population

f(r,K=50)
30

5000
25
20

0
0 10 20 30 40 50 2.0 2.5 3.0 3.5 4.0
time r

Notice that f still seems to behave smoothly in the vicinity of its minimum, which is around 3.2, but
that something bizarre starts to happen above 3.6. Apparently we could be reasonably confident of
locating the function’s minimum if we start out below or reasonably close to 3.2, but if we stray into
the territory above 3.6 then there are so many local minima that locating the desired global minimum
would be very difficult.

• Finally data were simulated with r = 3.8.


25000
40
population

f(r,K=50)
30

15000
20

5000
10

0 10 20 30 40 50 2.0 2.5 3.0 3.5 4.0


time r

Now the objective has a minimum somewhere around 3.7, but it is surrounded by other local minima,
so that locating the actual minima would be a rather taxing problem. In addition it is now unclear
how we would go about quantifying uncertainty about the ‘optimal’ θ: it will certainly be of no use
appealing to asymptotic likelihood arguments in this case.

So, in each of these examples, simple transects through the objective function have provided useful
information. In the first case everything seemed OK. In the second we would need to be careful with the
optimisation, and careful to examine how sensible the optimum appeared to be. In the third case we would
CHAPTER 3. OPTIMISATION 26

need to think very carefully about the purpose of the optimisation, and about whether a re-formulation of
the basic problem might be needed.
Notice how the behaviour of the objective was highly parameter dependent here, something that em-
phasises the need to understand models quite well before trying to fit them. In this case the dynamic model,
although very simple, can show a wide range of complicated behaviour, including chaos. In fact, for this
simple example, it is possible to produce a plot of the fitting objective over pretty much the whole plausible
parameter space. Here is an example when data were simulated with r = 3.2

50

40
f(r,K)/

30
1000

20

10

70

60

50
k

40
4.0
3.5
3.0
2.5
2.0
30 1.5 r
1.0

3.3.1 Objective function transects are a partial view


Clearly plotting some transects through your objective function is a good idea, but they can only give a
limited and partial view when the θ is multidimensional. Consider this x-transect through a function, f (x, y)
0.00
−0.02
−0.04
f(x,y)
−0.06
−0.08
−0.10
−0.12

−3 −2 −1 0 1 2 3
x

Apparently the function has many local minima, and optimisation will be difficult. Actually the function
is this. . .
CHAPTER 3. OPTIMISATION 27

y
x

It has one local minimum, its global minimum. Head downhill from any point x, y and you will eventually
reach the minimum.
The opposite problem can also occur. Here are x and y transects through a second function g(x, y).
0.5
0.5

0.4
0.4

0.3
0.3
g(x,y)

g(x,y)

0.2
0.2
0.1

0.1
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x y

. . . apparently no problem. From these plots you would be tempted to conclude that g is a well behaved
and uni-modal. The problem is that g looks like this. . .
CHAPTER 3. OPTIMISATION 28

g(x,
y)

y
x

So, generally speaking, it is a good idea to plot transects through your objective function before optimi-
sation, and to plot transects passing through the apparent optimum after optimisation, but bear in mind that
transects only give partial views.

3.4 Some optimisation methods


Optimisation methods start from some initial guesstimate of the optimum, θ [0] . Starting from k=0, most
methods then iterate the following steps.

1. Evaluate f (θ [k] ), and possibly the first and second derivatives of f w.r.t. the elements of θ, at θ [k] .

2. Either

(a) use the information from step 1 (and possibly previous executions of step 1) to find a search
direction, ∆, such that for some α, f (θ [k] + α∆) will provide sufficient decrease relative to
f (θ [k] ). Search for such an α and set θ [k+1] = θ [k] + α∆. Or
(b) use the information from step 1 (and possibly previous executions of step 1) to construct a local
model of f , and find the θ minimising this model within a defined region around θ [k] . If this value
of θ leads to sufficient decrease in f itself, accept it as θ [k+1] , otherwise shrink the search region
until it does.

3. Test whether a minimum has yet been reached. If it has, terminate, otherwise increment k and return
to 1.

Methods employing approach (a) at step 2 are known as search direction methods, while those employing
approach (b) are trust region methods. In practice the difference between (a) and (b) may be rather small:
the search direction is often based on a local model of the function. Here we will examine some search
direction methods.
CHAPTER 3. OPTIMISATION 29

3.4.1 Taylor’s Theorem (and conditions for an optimum)


We need only a limited version of Taylor’s theorem. Suppose that f is a twice continuously differentiable
function of θ and that ∆ is of the same dimension as θ. Then
1
f (θ + ∆) = f (θ) + ∇f (θ)T ∆ + ∆T ∇2 f (θ + t∆)∆
2
for some t ∈ (0, 1), where
 

∂f
 ∂2f ∂2f
∂θ12 ∂θ1 ∂θ2 . .
∂θ1
∂f ∂2f ∂2f
   
∇f (θ) = 
 ∂θ2  2

 and ∇ f (θ) =  ∂θ2 ∂θ1 ∂θ22
. . 
.
 .  
 . .

. . 
. . . . .

From Taylor’s theorem it is straightforward to establish that the conditions for f (θ ∗ ) to be a minimum of f
are that ∇f (θ ∗ ) = 0 and ∇2 f (θ ∗ ) is positive semi-definite.
Here we have given an exact form of Taylor’s theorem for a real-valued many-variable function, for a first
order approximation plus a second order remainder term in Langrange’s form. In practice, in applications,
we typically work with approximate versions of the above result. In some cases we work directly with a
linear (first-order) approximation, by setting the remainder term to zero. In other cases we work with the
quadratic (second-order) approximation, obtained by evaluating the remainder term at t = 0.

3.4.2 Steepest descent (AKA gradient descent)


Given some parameter vector θ [k] , it might seem reasonable to simply ask which direction would lead to
the most rapid decrease in f for a sufficiently small step, and to use this as the search direction. If ∆ is
small enough then Taylor’s theorem implies that

f (θ [k] + ∆) ' f (θ [k] ) + ∇f (θ [k] )T ∆.

For ∆ of fixed length the greatest decrease will be achieved by minimising the inner product ∇f (θ [k] )T ∆,
which is achieved by making ∆ parallel to −∇f (θ [k] ), i.e. by choosing

∆ ∝ −∇f (θ [k] ).

To firm things up, let’s set ∆ = −∇f (θ [k] )/k∇f (θ [k] )k, and consider a new proposed value of the form
θ [k] + α∆ for α > 0. Then, by plugging back into our linear approximation, we have

∇f (θ)T ∇f (θ)
f (θ [k] + α∆) ' f (θ [k] ) − α
k∇f (θ [k] )k
= f (θ [k] ) − αk∇f (θ [k] )k,

and so as we increase α from zero, we decrease f (·) at a rate proportional to the magnitude of the gradient
vector. Any other choice of ∆ will decrease the function value at a slower rate.
Given a search direction, we need a method for choosing how far to move from θ [k] along this direction.
Here there is a trade-off. We need to make reasonable progress, but do not want to spend too much time
choosing the exactly optimal distance to move, since the point will only be abandoned at the next iteration
anyway. Usually it is best to try and choose the step length α so that f (θ [k] + α∆) is sufficiently much lower
that f (θ [k] ), and also so that the magnitude of the gradient of f in the ∆ direction is sufficiently reduced by
the step. See e.g. Section 3.1 of Nocedal and Wright (2006) for further details. For the moment presume
that we have a step selection method.
The following illustrates the progress of the steepest descent method on Rosenbrock’s function. The
contours are log10 of Rosenbrock’s function, and the thick line shows the path of the algorithm.
CHAPTER 3. OPTIMISATION 30

1.5
1.0
0.5
z
0.0
−0.5

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x

It took 14,845 steps to reach the minimum. This excruciatingly slow progress is typical of the steepest
descent method. Another disadvantage of steepest descent is scale dependence: imagine linearly re-scaling
so that the z axis in the above plot ran from -5 to 15, but the contours were unchanged. This would change
all the steepest descent directions. By contrast, the methods considered next are scale invariant.

3.4.3 Newton’s method


As steepest descent approaches the minimum, the first derivative term retained in the Taylor approximation
of f becomes negligible relative to the neglected second derivative term. This largely explains the method’s
poor performance. Retaining the second derivative term in the Taylor expansion yields the method of choice
for a wide range of problems: Newton’s method. Taylor’s theorem implies that
1
f (θ [k] + ∆) ' f (θ [k] ) + ∇f (θ [k] )T ∆ + ∆T ∇2 f (θ [k] )∆.
2
Differentiating the r.h.s. w.r.t. ∆ and setting the result to 0 we have

∇2 f (θ [k] )∆ = −∇f (θ [k] )

as the equation to solve for the step ∆ which will take us to the turning point of the (quadratic) Taylor
approximation of f . This turning point will be a minimum (of the approximation) if ∇2 f (θ [k] ) is positive
definite, and can also be derived directly without vector calculus by completing the square.
Notice how Newton’s method automatically suggests a search direction and a step length. Usually we
simply accept the Newton step, ∆, unless it fails to lead to sufficient decrease in f . In the latter case the
step length is repeatedly halved until sufficient decrease is achieved.
The second practical issue with Newton’s method is the requirement that ∇2 f (θ [k] ) is positive definite.
There is no guarantee that it will be, especially if θ [k] is far from the optimum. This would be a major
problem, were it not for the fact that the solution to

H∆ = −∇f (θ [k] )

is a descent direction for any positive definite H. To see this, put ∆ = −H−1 ∇f (θ [k] ) as a search direction
in our first-order approximation,

f (θ + α∆) ' f (θ) + α∇f (θ)T ∆


= f (θ) − α∇f (θ)T H−1 ∇f (θ)
< f (θ),

for small α > 0 due to the positive-definiteness of H−1 . This gives us the freedom to modify ∇2 f (θ [k] )
to make it positive definite, and still be guaranteed a descent direction. One approach is to take the
symmetric eigen-decomposition of ∇2 f (θ [k] ), reverse the sign of any negative eigenvalues, replace any
CHAPTER 3. OPTIMISATION 31

zero eigenvalues with a small positive number and then replace ∇2 f (θ [k] ) by the matrix with this modified
eigen-decomposition. Computationally cheaper alternatives simply add a multiple of the identity matrix to
∇2 f (θ [k] ), sufficiently large to make the result positive definite. Note that steepest descent corresponds
to the choice H = I, and so inflating the diagonal of the Hessian corresponds to nudging the search
direction from that of the Newton method towards that of steepest descent, and this is usually a fairly safe
and conservative thing to do.
Combining these steps gives a practical Newton algorithm. Starting with k = 0 and a guesstimate
[0]
θ ...

1. Evaluate f (θ [k] ), ∇f (θ [k] ) and ∇2 f (θ [k] ), implying a quadratic approximation to f , via Taylor’s theo-
rem.

2. Test whether θ [k] is a minimum, and terminate if it is.

3. If ∇2 f (θ [k] ) is not positive definite, perturb it so that it is (thereby modifying the quadratic model of f ).

4. The search direction ∆ is now defined by the solution to

∇2 f (θ [k] )∆ = −∇f (θ [k] ),

i.e. the ∆ minimising the current quadratic model.

5. If f (θ [k] + ∆) is not sufficiently lower than f (θ [k] ), repeatedly halve ∆ until it is.

6. Set θ [k+1] = θ [k] + ∆, increment k by one and return to step 1.

The following figure illustrates the progress of the method on Rosenbrock’s function, working across
rows from top left. Grey contours show the function, dotted show the quadratic approximation at the current
best point, and, if present, black contours are the quadratic approximation before modification of the Hes-
sian. The thick black line tracks the steps accepted, while the thin black line shows any suggested steps
that were reduced. Note the fact that the method converged in 20 steps, compared to the 14,845 steps
taken by steepest descent.
CHAPTER 3. OPTIMISATION 32

3.4.4 Quasi-Newton methods


Newton’s method usually performs very well, especially near the optimum, but to use it we require the
second derivative (Hessian) matrix of the objective. This can be tedious to obtain. Furthermore, if the
CHAPTER 3. OPTIMISATION 33

dimension of θ is large, then solving for the Newton direction can be a substantial part of the computational
burden of optimisation. Is it possible to produce methods that only require first derivative information, but
with efficiency rivalling Newton’s method, rather than steepest descent?
The answer is yes. One approach is to replace the exact Hessian, of Newton’s method, with an approx-
imate Hessian based on finite differences of gradients. The finite difference intervals must be chosen with
care (see section 4.2), but the approach is reliable (recall that any positive definite matrix in place of the
Hessian will yield a descent direction, so an approximate Hessian should be unproblematic). However, fi-
nite differencing for the Hessian is often more costly than exact evaluation, and there is no saving in solving
for the search direction.
The second approach is to use a quasi-Newton method. The basic idea is to use first derivative infor-
mation gathered from past steps of the algorithm to accumulate an approximation to the Hessian at the
current best point. In principle, this approximation can be used exactly like the exact Hessian in Newton’s
method, but if structured carefully, such an algorithm can also work directly on an approximation to the
inverse of the Hessian, thereby reducing the cost of calculating the actual step. It is also possible to ensure
that the approximate Hessian is always positive definite.
Quasi-Newton methods were invented in the mid 1950’s by W.C. Davidon (a physicist). In the math-
ematical equivalent of not signing the Beatles, his paper on the method was rejected (it was eventually
published in 1991). There are now many varieties of Quasi-Newton method, but the most popular is the
BFGS method† , which will be briefly covered here.
We suppose that at step k we have already computed θ [k] and an approximate Hessian H[k] . So, just
as with a regular Newton method, we solve

H[k] ∆ = −∇f (θ [k] )

to obtain a new search direction, leading either to a new θ [k+1] directly, or more likely via line search.
We now have the problem of how to appropriately update the approximate Hessian, H[k+1] . We want the
approximate Hessian to give us a good quadratic approximation of our function about θ [k+1] . If H[k+1] is
our new approximate positive definite Hessian at the (k + 1)th step, we have the function approximation
1
f (θ) ' f (θ [k+1] ) + ∇f (θ [k+1] )T (θ − θ [k+1] ) + (θ − θ [k+1] )T H[k+1] (θ − θ [k+1] ).
2
The basic requirement of a quasi Newton method is that the new Hessian should be chosen so that the
approximation should exactly match ∇f (θ [k] ) — i.e. the approximation should get the gradient vector at the
previous best point, θ [k] , exactly right. Taking the gradient of the above approximation gives

∇f (θ) ' ∇f (θ [k+1] ) + H[k+1] (θ − θ [k+1] ).

Evaluating this at θ [k] leads to our requirement

∇f (θ [k] ) = ∇f (θ [k+1] ) + H[k+1] (θ [k] − θ [k+1] ),

and hence
H[k+1] sk = yk (3.2)
where sk = θ [k+1] − θ [k] and yk = ∇f (θ [k+1] ) − ∇f (θ [k] ). The so-called secant equation (3.2) will only
be feasible for positive definite H[k+1] under certain conditions on sk and yk , but we can always arrange
for these to be met by appropriate step length selection (see Wolfe conditions, Nocedal & Wright, 2006,
§3.1). Note, in particular, that pre-multiplying (3.2) by sT T T [k+1] s > 0 for positive definite
k gives sk yk = sk H k
T
Hessian, so sk yk must be positive. Still, the secant equation is under-determined, so we need to constrain
the problem further in order to identify an appropriate unique solution. For quasi Newton methods, this is
typically done by proposing a low-rank update to the previous Hessian. Methods exist which use a rank 1
update, but BFGS uses a rank 2 update of the form

H[k+1] = H[k] + αuuT + βvvT .



BFGS is named after Broyden, Fletcher, Goldfarb and Shanno all of whom independently discovered and published it, around 1970. Big
Friendly Giant Steps is the way all Roald Dahl readers remember the name, of course ([Link], pers. com.).
CHAPTER 3. OPTIMISATION 34

For (good) reasons that we don’t have time to explore here, the choice u = yk , v = H[k] sk is made, and
substituting this update into the secant equation allows solving for α and β, giving the BFGS update,

H[k] sk sT
kH
[k]
H[k+1] = H[k] + ρk yk ykT − ,
sT [k]
k H sk

where ρk = 1/sTk yk . By substitution, we can directly verify that this satisfies the secant equation. Now let’s
−1
work in terms of the inverse approximate Hessian, B[k] ≡ H[k] , since this will allow us to avoid solving
a linear system at each iteration. A naive re-write of our update would be
" #−1
(B[k] )−1 sk sT [k] −1
k (B )
B[k+1] = (B[k] )−1 + ρk yk ykT − ,
sT [k] −1
k (B ) sk

but this is a low-rank update of an inverse, and hence a good candidate for application of Woodbury’s
formula. It is actually a bit tedious to apply Woodbury directly, but if we can guess the solution,

B[k+1] = (I − ρk sk ykT )B[k] (I − ρk yk sT T


k ) + ρk sk sk ,

we can verify that it is correct by direct multiplication. We can also double-check that this update satisfies
the modified secant equation, B[k+1] yk = sk (it does). The BFGS update is usually given in the form above,
but for actually implementation on a computer, it can be expanded as
h i
B[k+1] = B[k] − ρk sk ykT B[k] + B[k] yk sT T [k] T
k + ρk (1 + ρk yk B yk )sk sk ,

and then some careful thinking about the order and re-use of computational steps within the update can
lead to a very efficient algorithm.
The BFGS method then works exactly like Newton’s method, but with B[k] in place of the inverse of
∇ f (θ [k] ), and without the need for second derivative evaluation or for perturbing the Hessian to achieve
2

positive definiteness. It also allows direct computation of the search direction via

∆ = −B[k] ∇f (θ k ).

A finite difference approximation to the Hessian is often used to start the method. The following figure
illustrates its application (showing every second step of 38).
CHAPTER 3. OPTIMISATION 35

Notice how this quasi-Newton method has taken nearly twice as many steps as the Newton method, for
the same problem, and shows a greater tendency to ‘zig-zag’. However, it is still computationally cheaper
than finite differencing for the Hessian, which would have required an extra 20 gradient evaluations, com-
pared to quasi-Newton. Compared to steepest descent it’s miraculous.
CHAPTER 3. OPTIMISATION 36

L-BFGS

It is also worth being aware of a variant of BFGS, known as L-BFGS, with the “L” standing for limited mem-
ory. This algorithm is mainly intended for very high-dimensional functions, where working with a full inverse
Hessian would be an issue for either computational or memory reasons. Here, rather than propagating the
full inverse Hessian matrix at each step, a low rank approximation of it is implicitly updated and propagated
via the recent history of the process. The following Wikipedia pages provide a little additional context.

• [Link]

• [Link]

• [Link]

3.4.5 The Nelder–Mead polytope method


What if even gradient evaluation is too taxing, or if our objective is not smooth enough for Taylor approx-
imations to be valid (or perhaps even possible)? What can be done with function values alone? The
Nelder–Mead polytope‡ method provides a rather successful answer, and as with the other methods we
have met, it is beautifully simple.
Let n be the dimension of θ. At each stage of the method we maintain n + 1 distinct θ vectors, defining
a polytope in Θ (e.g. for a 2-dimensional θ, the polytope is a triangle). The following is iterated until a
minimum is reached/ the polytope collapses to a point.

1. The search direction is defined as the vector from the worst point (vertex of the polytope with highest
objective value) through the average of the remaining n points.

2. The initial step length is set to twice the distance from the worst point to the centroid of the others. If
it succeeds (meaning that the new point is no longer the worst point) then a step length of 1.5 times
that is tried, and the better of the two accepted.

3. If the previous step did not find a successful new point then step lengths of half and one and a half
times the distance from the worst point to the centroid are tried.

4. If the last two steps failed to locate a successful point then the polytope is reduced in size, by linear
rescaling towards the current best point (which remains fixed.)

5. When a point is accepted, the current worst point is discarded, to maintain n + 1 points, and the
algorithm returns to step 1.

Variations are possible, in particular with regard to the step lengths and shrinkage factors.
The following figure illustrates the polytope method in action. Each polytope is plotted, with the line
style cycling through, black, grey and dashed black. The worst point in each polytope is highlighted with a
circle.

This is also known as the downhill simplex method, which should not be confused with the completely different simplex method of linear
programming.
CHAPTER 3. OPTIMISATION 37

1.5
1.0
0.5
z
0.0
−0.5

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5


x

In this case it took 79 steps to reach the minimum of Rosenbrock’s function. This is somewhat more
than Newton or BFGS, but given that we need no derivatives in this case, the amount of computation is
actually less.
On the basis of this example you might be tempted to suppose that Nelder-Mead is all you ever need,
but this is generally not the case. If you need to know the optimum very accurately then Nelder-Mead
will often take a long time to get an answer that Newton based methods would give very quickly. Also,
the polytope can get ‘stuck’, so that it is usually a good idea to restart the optimisation from any apparent
minimum (with a new polytope having the apparent optimum as one vertex), to check that further progress
is really not possible. Essentially, Nelder-Mead is good if the answer does not need to be too accurate, and
derivatives are hard to come by, but you wouldn’t use it as the underlying optimiser for general purpose
modelling software, for example.

3.4.6 Other methods


These notes can only provide a brief overview of some important methods. There are many other optimi-
sation methods, and quite a few are actually useful. For example:

• In likelihood maximisation contexts the method of scoring is often used. This replaces the Hessian in
Newton’s method with the expected Hessian.

• The Gauss-Newton method is another way of getting a Newton type method when only first deriva-
tives are available. It is applicable when the objective is somehow related to a sum of squares of
differences between some data and the model predictions of those data, and has a nice interpreta-
tion in terms of approximating linear models. It is closely related to the algorithm for fitting GLMs.

• Conjugate gradient methods are another way of getting a good method using only first derivative
information. The problem with steepest descent is that if you minimise along one steepest descent
direction then the next search direction is always at right angles to it. This can mean that it takes
a huge number of tiny zig-zagging steps to minimise even a perfectly quadratic function. If n is the
dimension of θ, then Conjugate Gradient methods aim to construct a set of n consecutive search
directions such that successively minimising along each will lead you exactly to the minimum of a
quadratic function.
CHAPTER 3. OPTIMISATION 38

• Simulated annealing is a gradient-free method for optimising difficult objectives, such as those arising
in discrete optimisation problems, or with very spiky objective functions. The basic idea is to propose
random changes to θ, always accepting a change which reduces f (θ), but also accepting moves
which increase the objective, with a probability which decreases with the size of increase, and also
decreases as the optimisation progresses.
• Stochastic gradient descent (SGD) is an increasingly popular approach to optimisation of likelihoods
arising from very large data sets. Each step of SGD is a gradient descent based on a new objective
derived from a random subset of the data. The stochastic aspect can speed up convergence and the
use of a relatively small subset of the data can speed up each iteration: [Link]
org/wiki/Stochastic_gradient_descent
• The EM algorithm is a useful approach to likelihood maximisation when there are missing data or
random effects that are difficult to integrate out.

3.5 Constrained optimisation


Sometimes optimisation must be performed subject to constraints on θ. An obvious way to deal with con-
straints is to re-parameterise to arrive at an unconstrained problem. For example if θ1 > 0 we might choose
to work in terms of the unconstrained θ10 where θ1 = exp(θ10 ), and more complicated re-parameterisations
can impose quite complex constraints in this way. While often effective re-parameterisation suffers from at
least three problems:
1. It is not possible for all constraints that might be required.
2. The objective is sometimes a more awkward function of the new parameters than it was of the old.
3. There can be problems with the unconstrained parameter estimates tending to ±∞. For example is
θ1 above is best estimated to be zero then the best estimate of θ10 is −∞.
A second method is to add to the objective function a penalty function penalising violation of the con-
straints. The strength of such a penalty has to be carefully chosen, and of course the potential to make the
objective more awkward than the original applies here too.
The final set of approaches treat the constraints directly. For linear equality and inequality constraints
constrained Newton type methods are based on the methods of quadratic programming. Non-linear con-
straints are more difficult, but can often be discretised into a set of approximate linear constraints.

3.6 Modifying the objective


If you have problems optimising an objective function, then sometimes it helps to modify the function itself.
For example:
• Reparameterisation can turn a very unpleasant function into something much easier to work with
(e.g. it is often better to work with precision than with standard deviation when dealing with normal
likelihoods).
• Transform the objective. e.g. log(f ) will be ‘more convex’ than f : this is usually a good thing.
• Consider perturbing the data. For example an objective based on a small sub sample of the data
can sometimes be a good way of finding starting values for the full optimisation. Similarly resampling
the data can provide a means for escaping small scale local minima in an objective if the re-sample
based objective is alternated with the true objective as optimisation progresses, or if the objective is
averaged over multiple re-samples. cf. SGD.
• Sometimes a problem with optimisation indicates that the objective is not a sensible one. e.g. in
the third example in section 3.3, an objective based on getting the observed ACF right makes more
sense, and is much better behaved.
CHAPTER 3. OPTIMISATION 39

3.7 Software
R includes a routine optim for general purpose optimisation by BFGS, Nelder-Mead, Conjugate Gradient
or Simulated Annealing. The R routine nlm uses Newton’s method and will use careful finite difference
approximations for derivatives which you don’t supply. R has a number of add on packages e.g. quadprog,
linprog, trust to name but 3. See [Link] for a really useful guide to what is more
generally available.
Should you ever implement your own optimisation code? Many people reply with a stern ‘no’, but I am
less sure. For the methods presented in detail above, the advantages of using your own implementation are
(i) the un-rivalled access to diagnostic information and (ii) the knowledge of exactly what the routine is doing.
However, if you do go down this route it is very important to read further about step length selection and
termination criteria. Where greater caution is needed is if you require constrained optimisation, large scale
optimisation or to optimise based on approximate derivatives. All these areas require careful treatment.

3.8 Further reading on optimisation


• Nocedal and Wright (2006) Numerical Optimization 2nd ed. Springer, is a very clear and up to date
text. If you are only going to look at one text, then this is probably the one to go for.

• Gill, P.E. , W. Murray and M.H. Wright (1981) Practical Optimization Academic Press, is a classic,
particularly on constrained optimisation. I’ve used it to successfully implement quadratic programming
routines.

• Dennis, J.E. and R.B. Schnabel (1983) Numerical Methods for Unconstrained Optimization and Non-
linear Equations (re-published by SIAM, 1996), is a widely used monograph on unconstrained meth-
ods (the nlm routine in R is based on this one).

• Press WH, SA Teukolsky, WT Vetterling and BP Flannery (2007) Numerical Recipes (3rd ed.), Cam-
bridge and Monahan, JF (2001) Numerical Methods of Statistics, Cambridge, both discuss Optimiza-
tion at a more introductory level.
Chapter 4

Calculus by computer

As we saw in the previous section, almost any time that we want to maximise a likelihood, we need to
differentiate it. Very often, merely evaluating a likelihood involves integration (to obtain the marginal dis-
tribution of the observable data). Unsurprisingly, there are times when it is useful to perform these tasks
by computer. Broadly speaking, numerical integration is computationally costly but stable, while numerical
differentiation is cheap but potentially unstable. We will also consider the increasingly important topic of
‘automatic differentiation’.

4.1 Cancellation error


Consider the following R code

a <- 1e16
b <- 1e16 + pi
d <- b - a

so d should be π, right? Actually

d; pi

## [1] 4
## [1] 3.141593

i.e. d is out by 27% as the following shows. . .

(d-pi)/pi ## relative error in difference is huge...

## [1] 0.2732395

This is an example of cancellation error. Almost any time we take the difference of two floating point
numbers of similar magnitude (and sign) we find that the accuracy of the result is substantially reduced.
The problem is rounding error. You can see what happens by considering computation to 16 places of
decimals. Then

a = 1.0000000000000000 × 1016 and b = 1.0000000000000003 × 1016

hence d = b − a = 3. (d was 4, above, because the computer uses a binary rather than a decimal
representation).
As we will see, derivative approximation is an area where we can not possibly ignoreP cancellation error,
2 2
but actually it is a pervasive issue. For example, cancellation error is the reason why i xi − nx̄ is not
2 2 2
P P
a good basis for a ‘one-pass’ method for calculating i (xi − x̄) . i (xi − x1 ) − n(x̄ − x1 ) is a simple
one pass alternative, and generally much better. It is also a substantial issue in numerical linear algebra.

40
CHAPTER 4. CALCULUS BY COMPUTER 41

For example, orthogonal matrix decompositions (used for QR, eigen and singular value decompositions),
usually involve an apparently arbitrary choice at each successive rotation involved in the decomposition,
about whether some element of the result of rotation should be positive or negative: one of the choices
almost always involves substantially lower cancellation error than the other, and is the one taken.
Since approximation of derivatives inevitably involves subtracting similar numbers, it is self evident
that cancellation error will cause some difficulty. The basic reason why numerical integration is a more
stable process (at least for statisticians who usually integrate only positive functions) is because numerical
integration is basically about adding things up, rather than differencing them, and this is more stable. e.g.

f <- a + b
(f-2e16-pi)/(2e16+pi) ## relative error in sum, is tiny...

## [1] 4.292037e-17

4.2 Approximate derivatives by finite differencing


Consider differentiating a sufficiently smooth function f (x) with respect to the elements of its vector argu-
ment x. f might be something simple like sin(x) or something complicated like the mean global temperature
predicted by an atmospheric GCM, given an atmospheric composition, forcing conditions etc.
A natural way to approximate the derivatives is to use:
∂f f (x + ∆ei ) − f (x)
'
∂xi ∆
where ∆ is a small constant and ei is a vector of the same dimension as x, with zeroes for each element
except the ith , which is 1. In a slightly sloppy notation, Taylor’s theorem tells us that
1
f (x + ∆ei ) = f (x) + ∇f (x)T ei ∆ + ∆2 eT 2
i ∇ f ei
2
Re-arranging while noting that ∇f (x)T ei = ∂f /∂xi we have

f (x + ∆ei ) − f (x) ∂f 1
− = ∆eT ∇2 f ei .
∆ ∂xi 2 i
Now suppose that L is an upper bound on the magnitude of ∂ 2 f /∂x2i , we have

f (x + ∆ei ) − f (x) ∂f L∆
− ≤ .
∆ ∂xi 2

That is to say, we have an upper bound on the finite difference truncation* error, assuming that the FD
approximation is calculated exactly, without round-off error.
In reality we have to work in finite precision arithmetic and can only calculate f to some accuracy.
Suppose that we can calculate f to one part in −1 (the best we could hope for here is that  is the machine
precision). Now let Lf be an upper bound on the magnitude of f , and denote the computed value of f by
comp(f ). We have
|comp{f (x)} − f (x)| ≤ Lf
and
|comp{f (x + ∆ei )} − f (x + ∆ei )| ≤ Lf
combining the preceding two inequalities we get

comp{f (x + ∆ei ) − f (x)} f (x + ∆ei ) − f (x) 2Lf


− ≤ ,
∆ ∆ ∆
* So called because it is the error associated with truncating the Taylor series approximation to the function.
CHAPTER 4. CALCULUS BY COMPUTER 42

an upper bound on the cancellation error that results from differencing two very similar quantities using
finite precision arithmetic. (It should now be clear what the problem was in section 1.1 example 2.)
Now clearly the dependence of the two types of error on ∆ is rather different. We want ∆ as small as
possible to minimise truncation error, and as large as possible to minimise cancellation error. Given that
the total error is bounded as follows,
L∆ 2Lf
[Link] ≤ +
2 ∆
it makes sense to chose ∆ to minimise the bound. That is we should choose
r
4Lf
∆≈ .
L
So if the typical size of f and its second derivatives are similar then

∆≈ 

will not be too far from optimal. This is why the square root of the machine precision is often used as the
finite difference interval. However, note the assumptions: we require that Lf ≈ L and f to be calculable to
a relative accuracy that is a small multiple of the machine precision for such a choice to work well. In reality
problems are not always well scaled and a complicated function may have a relative accuracy substantially
lower than is suggested by the machine precision. In such cases, see section 8.6 of Gill, Murray and Wright
(1981), or section 4.3.3 below.

4.2.1 Other FD formulae


The FD approach considered above is forward differencing. Centred differences are more accurate, but
more costly. . .
∂f f (x + ∆ei ) − f (x − ∆ei )
'
∂xi 2∆
. . . in the well scaled case ∆ ≈ 1/3 is about right.
Higher order derivatives can also be useful. For example

∂2f f (x + ∆ei + ∆ej ) − f (x + ∆ei ) − f (x + ∆ej ) + f (x)


'
∂xi ∂xj ∆2

which in the well scaled case will be most accurate for ∆ ≈ 1/4 .

4.3 Automatic Differentiation: code that differentiates itself


Differentiation of complex models or other expressions is tedious, but ultimately routine. One simply applies
the chain rule and some known derivatives, repeatedly. The main problem with doing this is the human error
rate and the limits on human patience. Given that differentiation is just the repeated application of the chain
rule and known derivatives, couldn’t it be automated? The answer is yes. There are two approaches.
The first is symbolic differentiation using a computer algebra package such as Maple or Mathematica.
This works, but can produce enormously unwieldy expressions when applied to a complex function. The
problem is that the number of terms in a symbolic derivative can be very much larger than for the expression
being evaluated. It is also very difficult to apply to a complex simulation model, for example.
The second approach is automatic differentiation. This operates by differentiating a function based
directly on the computer code that evaluates the function. There are several different approaches to this,
but many use the features of object oriented programming languages to achieve the desired end. The key
feature of an object oriented language, from the AD perspective, is that every data structure, or object in
such a language has a class and the meaning of operators such as +, -, * etc depends on the class of the
objects to which they are applied. Similarly the action of a function depends on the class of its arguments.
CHAPTER 4. CALCULUS BY COMPUTER 43

Suppose then, that we would like to differentiate

f (x1 , x2 , x3 ) = (x1 x2 sin(x3 ) + ex1 x2 ) /x3

w.r.t. its real arguments x1 , x2 and x3 .† In R the code

(x1*x2*sin(x3) + exp(x1*x2))/x3

would evaluate the function, if x1, x2 and x3 were initialised to be floating point numbers.
Now define a new type of object of class "ad" which has a value (a floating point number) and a
"grad" attribute. In the current case this "grad" attribute will be a 3-vector containing the derivatives of
the value w.r.t. x1, x2 and x3. We can now define versions of the arithmetic operators and mathematical
functions which will return class "ad" results with the correct value and "grad" attribute, whenever they
are used in an expression.
Here is an R function to create and initialise a simple class "ad" object

ad <- function(x,diff = c(1,1)) {


## create class "ad" object. diff[1] is length of grad
## diff[2] is element of grad to set to 1.
grad <- rep(0,diff[1])
if (diff[2]>0 && diff[2]<=diff[1]) grad[diff[2]] <- 1
attr(x,"grad") <- grad
class(x) <- "ad"
x
}

Here it is in use, initialising x1 to the value 1, giving it a 3 dimensional "grad" attribute, and setting the
first element of grad to 1 since ∂x1 /∂x1 = 1.

x1 <- ad(1,c(3,1))
x1

## [1] 1
## attr(,"grad")
## [1] 1 0 0
## attr(,"class")
## [1] "ad"

Now the interesting part. I can define versions of mathematical functions and operators that are specific
to class "ad" objects and correctly propagate derivatives alongside values. Here is a sin function for class
"ad".

[Link] <- function(a) {


grad.a <- attr(a,"grad")
a <- [Link](a) ## avoid infinite recursion - change class
d <- sin(a)
attr(d,"grad") <- cos(a) * grad.a ## chain rule
class(d) <- "ad"
d
}

and here is what happens when it is applied to x1


This example is taken from Nocedal and Wright, (2006).
CHAPTER 4. CALCULUS BY COMPUTER 44

sin(x1)

## [1] 0.841471
## attr(,"grad")
## [1] 0.5403023 0.0000000 0.0000000
## attr(,"class")
## [1] "ad"

i.e. the value of the result is sin(x1 ), while the first element of its "grad" contains the derivative of
sin(x1 ) w.r.t. x1 evaluated at x1 = 1.
Operators can also be overloaded in this way. e.g. here is multiplication operator for class "ad":

"*.ad" <- function(a,b) { ## ad multiplication


grad.a <- attr(a,"grad")
grad.b <- attr(b,"grad")
a <- [Link](a)
b <- [Link](b)
d <- a*b ## evaluation
attr(d,"grad") <- a * grad.b + b * grad.a ## product rule
class(d) <- "ad"
d
}

Continuing in the same way we can provide a complete library of mathematical functions and operators
for the "ad" class. Given such a library, we can obtain the derivatives of a function directly from the code
that would simply evaluate it, given ordinary floating point arguments. For example, here is some code
evaluating a function.

x1 <- 1; x2 <- 2; x3 <- pi/2


(x1*x2*sin(x3)+ exp(x1*x2))/x3

## [1] 5.977259

and here is the same code with the arguments replaced by "ad" objects

x1 <- ad(1, c(3,1))


x2 <- ad(2, c(3,2))
x3 <- ad(pi/2, c(3,3))
(x1*x2*sin(x3) + exp(x1*x2))/x3

## [1] 5.977259
## attr(,"grad")
## [1] 10.681278 5.340639 -3.805241
## attr(,"class")
## [1] "ad"

you can check that these results are correct (actually to machine accuracy).
This simple propagation of derivatives alongside the evaluation of a function is known as forward mode
auto-differentiation. R is not the best language in which to try to do this, and if you need AD it is much
better to use existing software libraries in C++, for example, which have done all the function and operator
re-writing for you.
CHAPTER 4. CALCULUS BY COMPUTER 45

4.3.1 Reverse mode AD


If you require many derivatives of a scalar valued function then forward mode AD will have a theoretical
computational cost similar to finite differencing, since at least as many operations are required for each
derivative as are required for function evaluation. In reality the overheads associated with operator over-
loading make AD more expensive (alternative strategies also carry overheads). Of course the benefit of
AD is higher accuracy, and in many applications the cost is not critical.
An alternative with the potential for big computational savings over finite differencing is reverse mode
AD. To understand this it helps to think of the evaluation of a function in terms of a computational graph.
Again concentrate on the example function given in Nocedal and Wright (2006):

f (x1 , x2 , x3 ) = (x1 x2 sin(x3 ) + ex1 x2 ) /x3

Any computer evaluating this must break the computation down into a sequence of elementary operations
on one or two floating point numbers. This can be thought of as a graph:

x1 f=(x1x2sin x3+ ex1 x2)/x3

* x4 x6
exp

x2

* +
x3 x5 x7 x8 f
sin /

where the nodes x4 to x8 are the intermediate quantities that will have to be produced en route from the
input values x1 to x3 to the final answer, f . The arrows run from parent nodes to child nodes. No child can
be evaluated until all its parents have been evaluated.
Simple left to right evaluation of this graph results in this . . .
CHAPTER 4. CALCULUS BY COMPUTER 46

x1 = 1 f=(x1x2sin x3+ ex1 x2)/x3

* x4 = 2 x6 = 7.39
exp

x2 = 2

* +
x3 = 1.57 x5 = 1 x7 = 2 x8 = 9.39 f = 5.98
sin /

Now, forward mode AD carries derivatives forward through the graph, alongside values. e.g. the deriva-
tive of a node with respect to input variable, x1 is computed using

∂xk X ∂xk ∂xj


=
∂x1 ∂xj ∂x1
j parent of k

(the r.h.s. being evaluated by overloaded functions and operators, in the object oriented approach). The
following illustrates this process, just for the derivative w.r.t. x1 .

x1 = 1
dx1 f=(x1x2sin x3+ ex1 x2)/x3
=1 dx
4 dx
dx1 1 = 2
x4 = 2 x6 = 7.39
dx6 dx4 = 7.39
* dx4 dx6
=1 =2 exp = 14.8
dx 2 dx1 dx1
dx 4
x2 = 2
dx2 dx dx
=0 7
dx
8
dx
dx1 4 6
= =
1 1

x3 = 1.57 x5 = 1 x7 = 2 x8 = 9.39 f = 5.98


dx5 dx3 = 0 dx7 dx5 = 2 * dx8 dx7 = 1 + df dx8 = 0.64
dx3 dx5 dx7 dx8 df
=0 sin =0 =2 = 16.8 = 10.7
dx1 dx1 dx1 dx1 / dx1

df dx3 = −3.81

. . . Again computation runs left to right, with evaluation of a node only possible once all parent values are
known.
Now, if we require derivatives w.r.t. several input variables, then each node will have to evaluate deriva-
tives w.r.t. each of these variables, and this becomes expensive (in terms of the previous graph, each node
would contain multiple evaluated derivatives). Reverse mode therefore does something ingenious. It first
executes a forward sweep through the graph, evaluating the function and all the derivatives of nodes w.r.t.
their parents, as follows:
CHAPTER 4. CALCULUS BY COMPUTER 47

x1 = 1 f=(x1x2sin x3+ ex1 x2)/x3


dx
4 dx
1 = 2
dx6 dx4 = 7.39
* x4 = 2 x6 = 7.39
= 1 exp
dx 2
dx 4

x2 = 2 dx dx
7 8
dx dx
4 = 6 =
1 1

dx5 dx3 = 0 dx7 dx5 = 2 * dx8 dx7 = 1 + df dx8 = 0.64


x3 = 1.57 x5 = 1 x7 = 2 x8 = 9.39 f = 5.98
sin /

df dx3 = −3.81

The reverse sweep then works backwards through the graph, evaluating the derivative of f w.r.t. each
node, using the following
∂f X ∂xj ∂f
=
∂xk ∂xk ∂xj
j is child of k

and the fact that at the terminal node ∂f /∂f = 1.

x1 = 1
df f=(x1x2sin x3+ ex1 x2)/x3
= 10.7 dx
4 dx
dx1 1 = 2
x4 = 2 x6 = 7.39
dx6 dx4 = 7.39
* df df
= 1 = 5.34 exp = 0.64
dx 2 dx4 dx6
dx 4
x2 = 2
df dx dx
= 5.34 7
dx
8
dx
dx2 4 6
= =
1 1

x3 = 1.57 x5 = 1 x7 = 2 x8 = 9.39 f = 5.98


dx5 dx3 = 0 dx7 dx5 = 2 * dx8 dx7 = 1 + df dx8 = 0.64
df df df df df
= − 3.8 sin = 1.27 = 0.64 = 0.64 =1
dx3 dx5 dx7 dx8 / df

df dx3 = −3.81

The derivatives in grey are those calculated on the reverse sweep. The point here is that there is only
one derivative to be evaluated at each node, but in the end we know the derivative of f w.r.t. every input
variable. Reverse mode can therefore save a large number of operations relative to finite differencing or
forward mode. Once again, general purpose AD libraries automate the process for you, so that all you
need to be able to write is the evaluation code.
Unfortunately reverse mode efficiency comes at a heavy price. In forward mode we could discard the
values and derivatives associated with a node as soon as all its children were evaluated. In reverse mode
the values of all nodes and the evaluated derivatives associated with every connection have to be stored
during the forward sweep, in order to be used in the reverse sweep. This is a heavy storage requirement.
CHAPTER 4. CALCULUS BY COMPUTER 48

For example, if f involved the inversion of a 1000 × 1000 matrix then we would have to store some 2 × 109
intermediate node values plus a similar number of evaluated derivatives. That is some 32 Gigabytes of
storage before we even consider the requirements for storing the structure of the graph.

4.3.2 A caveat
For AD to work it is not sufficient that the function being evaluated has properly defined derivatives at
the evaluated function value. We require that every function/operator used in the evaluation has prop-
erly defined derivatives at its evaluated argument(s). This can create a problem with code that executes
conditional on the value of some variable. For example the Box-Cox transformation of a positive datum y is
 λ
(y − 1)/λ λ 6= 0
B(y; λ) =
log(y) λ=0

If you code this up in the obvious way then AD will never get the derivative of B w.r.t. λ right if λ = 0.

4.3.3 Using AD to improve FD


When fitting complicated or computer intensive models it can be that AD is too expensive to use for rou-
tine derivative calculation during optimisation. However, with the gradual emergence of highly optimised
libraries, this situation is changing, and in any case it can still provide a useful means for calibrating FD
intervals. A ‘typical’ model run can be auto-differentiated, and the finite difference intervals adjusted to
achieve the closest match to the AD derivatives. As optimisation progresses one or two further calibrations
of the FD intervals can be carried out as necessary.

4.4 Numerical Integration


Integration is at least as important as optimisation in statistics. Whether integrating random effects out
of a joint distribution to get a likelihood, or just evaluating a simpler expectation, it occurs everywhere.
Unfortunately integration is generally numerically expensive, and much statistical research is devoted to
clever ways of avoiding it. When it can’t be avoided then the game is typically to find ways of approximating
an integral of a function with the minimum number of function evaluations.

• Integration w.r.t. 1 or 2 (maybe even up to 4) variables is not too problematic. There are two ‘deter-
ministic’ lines of attack.

1. Use differential equation solving methods since, for example


Z b
y= f (x)dx
a

is the solution of
dy
= f (x), y(a) = 0
dx
integrated to b.
2. Quadrature rules. These evaluate the function to be integrated at some set of points in the
integration domain and combine the resulting values to obtain an approximation to the integral.
For example
Z b N
b−aX
f (x)dx ≈ f (a + (i − .5)(b − a)/N )
a N
i=1

is the Midpoint rule for approximate integration. Simpson’s Rule is a higher order variant, while
Gaussian Quadrature rules aim for substantially higher order accuracy by careful choice of the
locations at which to evaluate.
CHAPTER 4. CALCULUS BY COMPUTER 49

• Integration with respect to several-many variables is usually more interesting in statistics and it is
difficult. Maintenance of acceptable accuracy with quadrature rules or differential equation meth-
ods typically requires an unfeasible number of function evaluations for high dimensional domains of
integration. There are two main approaches.

1. Approximate the function to be integrated by one that can be integrated analytically.


2. Take a statistical view of the integration process, and construct statistical estimators of the inte-
gral. Note that, for a fixed number of function evaluations N ,
(a) any bias (such as the error in a quadrature rule) scales badly with dimension.
(b) any estimator variance need not depend on dimension at all.
(c) as usual, we can completely eliminate either bias or variance, but not both.
These considerations suggest using unbiased estimators of integrals, based on random samples
from the function being integrated. The main challenge is then to design statistical sampling
schemes and estimators which minimise the estimator variance.

4.4.1 Quadrature rules


In one dimension we might choose to approximate the integral of a continuous function f (x) by the area
under a set of step functions, with the midpoint of each matching f :

b N
b−aX
Z
f (x)dx = f (a + (i − .5)(b − a)/N ) + O(h2 )
a N
i=1

where hR in the final error term is (b − a)/N . Let’s see what the error bound means in practice by approxi-
1
mating 0 ex dx.

[Link] <- exp(1) - 1


N <- 10
[Link] <- sum(exp((1:N-.5)/N))/N
c([Link],[Link],([Link])/[Link])

## [1] 1.7175660865 1.7182818285 0.0004165452

A .04% error may be good enough for many purposes, but suppose we need higher accuracy. . .

N <- 100
[Link] <- sum(exp((1:N-.5)/N))/N
c([Link],[Link],([Link])/[Link])

## [1] 1.718275e+00 1.718282e+00 4.166655e-06

N <- 1000
[Link] <- sum(exp((1:N-.5)/N))/N
c([Link],[Link],([Link])/[Link])

## [1] 1.718282e+00 1.718282e+00 4.166667e-08

Exactly as the error bound indicated, each 10 fold increase in the number of function evaluations buys
us a 100 fold reduction in the approximation error. Incidentally, notice that in statistical terms this error is
not a random error — it is really a bias.
If f is cheap to evaluate, accuracy is not paramount and we are only interested in 1 dimensional inte-
grals then we might stop there (or more likely with the slightly more accurate but equally straightforward
‘Simpson’s Rule’), but life is not so simple. If one is very careful about the choice of x points at which to
CHAPTER 4. CALCULUS BY COMPUTER 50

evaluate f (x) and about the weights used in the integral approximating summation, then some very im-
pressive results can be obtained, at least for smooth enough functions, using Gauss(ian) Quadrature. The
basic idea is to find a set {xi , wi : i = 1, . . . , n} such that if w(x) is a fixed weight function and g(x) is any
polynomial up to order 2n − 1 then
Z b n
X
g(x)w(x)dx = wi g(xi )
a i=1

Note that the xi , wi set depends on w(x), a and b, but not on g(x). The hope is that the r.h.s. will still approxi-
mate the l.h.s. very accurately if g(x) is not a polynomial of the specified type, and there is some impressive
error analysis to back this up, for any function with 2n well behaved derivatives (see e.g. Monahan, 2001,
section 10.3).
The R package statmod has a nice [Link] function for producing the weights, wi and nodes,
xi , for some standard w(x)s. It assumes a = −1, b = 1, but we can just linearly re-scale our actual interval
to use this. Here is the previous example using Gauss Quadrature (assuming w(x) is a constant).

library(statmod)
N <- 10
gq <- [Link](N) # get the x_i, w_i
gq # take a look

## $nodes
## [1] -0.9739065 -0.8650634 -0.6794096 -0.4333954 -0.1488743 0.1488743
## [7] 0.4333954 0.6794096 0.8650634 0.9739065
##
## $weights
## [1] 0.06667134 0.14945135 0.21908636 0.26926672 0.29552422 0.29552422
## [7] 0.26926672 0.21908636 0.14945135 0.06667134

sum(gq$weights) # for an interval of width 2

## [1] 2

[Link] <- sum(gq$weights*exp((gq$nodes+1)/2))/2


c([Link], [Link], ([Link])/[Link])

## [1] 1.718282e+00 1.718282e+00 -1.292248e-15

For convenience, we can pack the rescaling logic into a simple function:

gqi <- function(g, N = 10, a = -1, b = 1) {


gq = [Link](N)
w = gq$weights*(b-a)/2
n = a + (b-a)*(gq$nodes + 1)/2
sum(w * g(n))
}

gqi(exp, 10, 0, 1)

## [1] 1.718282

Clearly 10 evaluation points was a bit wasteful here, so consider N = 5


CHAPTER 4. CALCULUS BY COMPUTER 51

[Link] <- gqi(exp, 5, 0, 1)


c([Link], [Link], ([Link])/[Link])

## [1] 1.718282e+00 1.718282e+00 3.805670e-13

even 5 seems slightly wasteful:

[Link] <- gqi(exp, 3, 0, 1)


c([Link], [Link], ([Link])/[Link])

## [1] 1.718281e+00 1.718282e+00 4.795992e-07

. . . still more accurate than the midpoint rule with 100 evaluations. This example should be treated with
caution, however. ex is a very smooth function and such impressive results will not hold for less smooth
cases.

4.4.2 Quadrature rules for multidimensional integrals


We can integrate w.r.t. several variables by recursive application of quadrature rules. For example consider
integrating f (x, z) over a square region, and assume we use basically the same node and weight sequence
for both dimensions. We have Z X
f (xj , z)dz ≈ wi f (xj , zi )
i
and so Z Z X X XX
f (x, z)dzdx ≈ wj wi f (xj , zi ) = wj wi f (xj , zi )
j i j i

Clearly, in principle we can repeat this for as many dimensions as we like. In practice however, to maintain
a given level of accuracy, the number of function evaluations will have to rise as N d where N is the number
of nodes for each dimension, and d is the number of dimensions. Even if we can get away with a 3 point
Gaussian quadrature, we will need 1.5 million evaluations by the time we reach a 13 dimensional integral.
Here is a function which takes the node sequence x and weight sequence w of some quadrature rule
and uses them to produce the d dimensional set of nodes, and the corresponding weight vector resulting
from recursive application of the 1D rule.

mesh <- function(x,d,w=1/length(x)+x*0) {


n <- length(x)
W <- X <- matrix(0,n^d,d)
for (i in 1:d) {
X[,i] <- x;W[,i] <- w
x <- rep(x,rep(n,length(x)))
w <- rep(w,rep(n,length(w)))
}
w <- exp(rowSums(log(W))) ## column product of W gives weights
list(X=X,w=w) ## each row of X gives co-ordinates of a node
}

Here’s how it works.

mesh(c(1,2),3)

## $X
## [,1] [,2] [,3]
## [1,] 1 1 1
CHAPTER 4. CALCULUS BY COMPUTER 52

## [2,] 2 1 1
## [3,] 1 2 1
## [4,] 2 2 1
## [5,] 1 1 2
## [6,] 2 1 2
## [7,] 1 2 2
## [8,] 2 2 2
##
## $w
## [1] 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125

Let’s look at it in action on f (x) = exp( xi ), integrated over [−1, 1]d .


P

fd <- function(x) {exp(rowSums(x))}

First try the midpoint rule over a two dimensional domain

N <- 10; d <- 2; N^d # number of function evaluations

## [1] 100

[Link] <- (exp(1)-exp(-1))^d


mmp <- mesh((1:N-.5)/N*2-1,d,rep(2/N,N))
[Link] <- sum(mmp$w*fd(mmp$X))
c([Link],[Link],([Link])/[Link])

## [1] 5.506013515 5.524391382 0.003326677

Now over a 5D domain

N <- 10; d <- 5; N^d # number of function evaluations

## [1] 1e+05

[Link] <- (exp(1)-exp(-1))^d


mmp <- mesh((1:N-.5)/N*2-1,d,rep(2/N,N))
[Link] <- sum(mmp$w*fd(mmp$X))
c([Link],[Link],([Link])/[Link])

## [1] 71.136612879 71.731695754 0.008295954

. . . which illustrates the enormous rate of increase in function evaluations required to achieve a rather
disappointing accuracy.
Of course Gauss Quadrature is a better bet

N <- 4; d <- 2; N^d

## [1] 16

[Link] <- (exp(1)-exp(-1))^d


gq <- [Link](N)
mgq <- mesh(gq$nodes,d,gq$weights)
[Link] <- sum(mgq$w*fd(mgq$X))
c([Link],[Link],([Link])/[Link])

## [1] 5.524390e+00 5.524391e+00 2.511325e-07


CHAPTER 4. CALCULUS BY COMPUTER 53

with the increase in dimension being slightly less painful . . .

N <- 4; d <- 5; N^d

## [1] 1024

[Link] <- (exp(1)-exp(-1))^d


gq <- [Link](N)
mgq <- mesh(gq$nodes,d,gq$weights)
[Link] <- sum(mgq$w*fd(mgq$X))
c([Link],[Link],([Link])/[Link])

## [1] 7.173165e+01 7.173170e+01 6.278311e-07

. . . even so, by the time we reach 20 dimensions we would be requiring 1012 function evaluations, and
most functions we might want to integrate will not be so smooth and well behaved that 4 quadrature points
give us anything like the sort of accuracy seen in this example.

4.4.3 Approximating the integrand


Beyond a few dimensions, quadrature rules really run out of steam. In statistical work, where we are
often interested in integrating probability density functions, or related quantities, it is sometimes possible to
approximate the integrand by a function with a known integral. Here, just one approach will be illustrated:
Laplace approximation. The typical application of the method is when we have a joint density f (y, b) and
want to evaluate the marginal p.d.f. of y,
Z
fy (y) = f (y, b)db.

For a given y let b̂y be the value of b maximising f . Then Taylor’s theorem implies that
1
log{f (y, b)} ' log{f (y, b̂y )} − (b − b̂y )T H(b − b̂y )
2

where H is the Hessian of − log(f ) w.r.t. b evaluated at y, b̂y . i.e.


 
1 T
f (y, b) ' f (y, b̂y ) exp − (b − b̂y ) H(b − b̂y )
2

and so Z  
1 T
fy (y) ' f (y, b̂y ) exp − (b − b̂y ) H(b − b̂y ) db.
2
From the properties of normal p.d.f.s we have

|H|1/2 − 1 (b−b̂y )T H(b−b̂y ) (2π)d/2


Z Z
− 21 (b−b̂y )T H(b−b̂y )
e 2 db = 1 ⇒ e db =
(2π)d/2 |H|1/2
so
(2π)d/2
fy (y) ' f (y, b̂y ) .
|H|1/2
Notice that the basic ingredients of this approximation are the Hessian of the log integrand w.r.t b and the
b̂y values. Calculating the latter is an optimisation problem, which can usually be solved by a Newton type
method, given that we can presumably obtain gradients as well as the Hessian. The highest cost here is
the Hessian calculation, which scales with the square of the dimension, at worst, or for cheap integrands
the determinant evaluation, which scales as the cube of the dimension. i.e. when it works, this approach is
much cheaper than brute force quadrature.
CHAPTER 4. CALCULUS BY COMPUTER 54

4.4.4 Monte-Carlo integration


Consider integrating f (x) over some region Ω of volume V (Ω). Clearly
Z
IΩ = f (x)dx = E{f (X)}V (Ω)

where X ∼ uniform over Ω. Generating N uniform random vectors, xi , over Ω, we have the obvious
estimator
N
ˆ V (Ω) X
IΩ = f (xi ),
N
i=1

which defines basic Monte-Carlo Integration. ‡ Clearly IˆΩ is unbiased, and if the xi are independent then

V (Ω)2
var(IˆΩ ) = N var{f (X)}
N2
i.e. the variance scales as N −1 , and the standard deviation scales as N −1/2 (it’s Op (N −1/2 )). This is
not brilliantly fast convergence to IΩ , but compare it to quadrature. If we take the midpoint rule then
the (bias) error is O(h2 ) where h is the node spacing. In d dimensions then at best we can arrange
h ∝ N −1/d , implying that the bias is O(N −2/d ). i.e. once we get above d = 4 we would prefer the stochastic
convergence rate of Monte-Carlo Integration to the deterministic convergence rate of the midpoint rule.
Of course using Simpson’s rule would buy us a few more dimensions, and for smooth functions the
spectacular convergence rates of Gauss quadrature may keep it as the best option up to a rather higher d,
but the fact remains that its bias is still dependent on d, while the sampling error of the Monte Carlo method
is dimension independent, so Monte Carlo will win in the end.

4.4.5 Stratified Monte-Carlo Integration


One tweak on basic Monte-Carlo is to divide (a region containing) Ω into N equal volumed sub-regions, Ωi ,
and to estimate the integral within each Ωi from a single Xi drawn from a uniform distribution over Ωi . The
estimator has the same form as before, but its variance is now
N
V (Ω)2 X
var(I˜Ω ) = var{f (Xi )}
N2
i=1

which is usually lower than previously, because the range of variation of f within each Ωi is lower than
within the whole of Ω.
The underlying idea of trying to obtain the nice properties of uniform random variables, but with more
even spacing over the domain of integration, is pushed to the limit by Quasi-Monte-Carlo integration, which
uses deterministic space filling sequences of numbers as the basis for integration. For fixed N , the bias of
such schemes grows much less quickly with dimension than the quadrature rules.

4.4.6 Importance sampling


In statistics an obvious variant on Monte-Carlo integration occurs when the integrand factors into the prod-
uct of a p.d.f. f (x) and another term in which case
Z n
1X
φ(x)f (x)dx = Ef [φ(X)] ≈ φ(xi )
n
i=1

where the xi are observations on r.v.s with p.d.f. f (x). A difficulty with this approach, as with simple
MC integration, is that a high proportion of the xi s often fall in places where φ(xi ) makes no, or negligible,

Note that in practice, if Ω is a complicated shape, it may be preferable to generate uniformly over a simpler enclosing region Ω̃, and just
set the integrand to zero for points outside Ω itself.
CHAPTER 4. CALCULUS BY COMPUTER 55

contribution to the integral. For example, in Bayesian modelling, it is quite possible for the f (y|θ) to be have
non-vanishing support over only a tiny volume of the parameter space over which f (θ) has non-vanishing
support.
One approach to this problem is importance sampling. Rather than generate the xi from f (x) we
generate from a distribution with the same support as f , but with a p.d.f., g(x), which is designed to
generate points concentrated in the places where the integrand is non-negligible. Then the estimator is
Z Z n
f (x) 1X f (xi )
φ(x)f (x)dx = φ(x) g(x)dx ≈ Iˆis = φ(xi ) .
g(x) n g(xi )
i=1

The expected value of the r.h.s. of the above is its l.h.s., so this estimator is unbiased. R
What would make a good g? Suppose that g ∝ φf . Then g(x) = φ(x)f (x)/k, where k = φ(x)f (x)dx.
So Iˆis = nk/n = φ(x)f (x)dx, and the estimator has zero variance. Clearly this is impractical. If we knew
R

k then we would not need to estimate, but there are useful insights here anyway. Firstly, the more constant
is φf /g, the lower the variance of Iˆis . Secondly, the idea that the density of evaluation points should reflect
the contribution of the integrand in the local neighbourhood is quite generally applicable.

4.4.7 Laplace importance sampling


We are left with the question of how to choose g? When the integrand is some sort of probability density
function, a reasonable approach is to approximate it by the normal density implied by the Laplace approxi-
mation. That is we take g to be the p.d.f. of the multivariate normal N (x̂, H−1 ) where x̂ denotes the x value
maximising φf , while H is the Hessian of − log(φf ) w.r.t. x, evaluated at x̂.
In fact, as we have seen previously, multivariate normal random vectors are generated by transformation
of standard normal random vectors z ∼ N (0, I). i.e. x = x̂ + R−1 z where R is a Cholesky factor of H
so that RT R = H. There is then some computational saving in using a change of variable argument to
re-express the importance sampling estimator in terms of the density of z, rather than g itself. Start with
Z Z
φ(x)f (x)dx = φ(x̂ + R−1 z)f (x̂ + R−1 z)|R−1 |dz =

f (x̂ + R−1 z)
Z
1
φ(x̂ + R−1 z) h(z)dz,
|R| h(z)

where h(z) = (2π)−d/2 exp{−zT z/2} is the standard normal density for z, giving

−1
Z  
1 −1 f (x̂ + R z)
φ(x)f (x)dx = Ez φ(x̂ + R z) =
|R| h(z)
(2π)d/2 h i
Ez φ(x̂ + R−1 z)f (x̂ + R−1 z) exp{zT z/2} ,
|R|

and leading in turn to the Laplace importance sampling estimator


n
(2π)d/2 X
Z
T
φ(x)f (x)dx ≈ φ(x̂ + R−1 zi )f (x̂ + R−1 zi )ezi zi /2 .
n|R|
i=1

Some tweaks may be required.

1. It might be desirable to make g a little more diffuse than the Laplace approximation suggests, to avoid
excessive importance weights being assigned to zi values in the tails of the target distribution.

2. If the integration is in order to obtain a likelihood, which is to be maximised, then it is usually best
to condition on a set of zi values, in order to avoid random variability in the approximate likelihood,
which will make optimisation difficult.
CHAPTER 4. CALCULUS BY COMPUTER 56

3. Usually the log of the integrand is evaluated rather than the integrand itself, and it is the log of the
integral that is required. To avoid numerical under/overflow problems note that
( ) " #
X X
log wi exp(ψi ) = log wi exp{ψi − max(ψi )} + max(ψi )
i i

the r.h.s. will not overflow or underflow when computed in finite precision arithmetic (for reasonable
wi ).

4. As with simple Monte-Carlo integration, some variance reduction is achievable by stratification. The zi
can be obtained by suitable transformation of random vectors from a uniform distribution on (0, 1)d . In-
stead (0, 1)d can be partitioned into equal volume parts, with one uniform vector generated in each. . .

Tweak 3. is an example of a general technique for avoiding numerical underflow and overflow which has
many applications in statistical computing. It is therefore worth examining more carefully.

4.4.8 The log–sum–exp trick


In statistical computing, we rarely want to work with raw probabilities and likelihoods, but instead work
with their logarithms. Log-likelihoods are much less likely to numerically underflow or overflow than raw
likelihoods. For likelihood computations involving iid observations, this is often convenient, since a product
of raw likelihood terms becomes a sum of log-likelihood terms. However, sometimes we have to evaluate
the sum of raw probabilities or likelihoods. The obvious way to do this would be to exponentiate the
logarithms, evaluate the sum, and then take the log of the result to get back to our log scale. However,
the whole point to working on the log scale is to avoid numerical underflow and overflow, so we know that
exponentiating raw likelihoods is a dangerous operation which is likely to fail in many problems. Can we do
the log–sum–exp operation safely? Yes, we can.
Suppose that we have log (unnormalised) weights (typically probabilities or likelihoods), li = log wi , but
we want to compute the (log of the) sum of the raw weights. We can obviously exponentiate the log weights
and sum them, but this carries the risk that all of the log weights will underflow (or, possibly, overflow). So,
we want to compute X
L = log exp(li ),
i

without the risk of underflow. We can do this by first computing m = maxi li and then noting that
X X
L = log exp(li ) = log exp(m) exp(li − m)
i i
" #
X X
= log exp(m) exp(li − m) = m + log exp(li − m).
i i

We are now safe, since li − m cannot be bigger than zero, and hence no term can overflow. Also, since
we know that li − m = 0 for some i, we know that at least one term will not underflow to zero. So we are
guaranteed to get a sensible finite answer in a numerically stable way. Note that exactly the same technique
works for computing a sample mean (or other weighted sum) as opposed to just a simple sum (as above).

4.4.9 An example
Consider a Generalised Linear Mixed Model:

yi |b ∼ Poi(µi ), log(µi ) = Xi β + Zi b, b ∼ N (0, Λ−1 )

where Λ is diagonal and depends on a small number of parameters. Here, β are the fixed effects and b
are the random effects, with covariate matrices X and Z, respectively. From the model specification it is
CHAPTER 4. CALCULUS BY COMPUTER 57

straightforward to write down the joint probability function of y and b, based on the factorisation f (y, b) =
fy|b (y|b)fb (b). i.e.
X
log{f (y, b)} = {yi (Xi β + Zi b) − exp(Xi β + Zi b) − log(yi !)}
i
1X 1X 1X
− λj b2j + log(λj ) − log(2π)
2 2 2
j j j

but likelihood based inference actually requires that we can evaluate the marginal p.f. fy (y) for any values
of the model parameters. Now Z
fy (y) = f (y, b)db

and evaluation of this requires numerical integration.


To tie things down further, suppose that the fixed effects part of the model is simply β0 + β1 x where x is
a continuous covariate, while the random effects part is given by two factor variables, z1 and z2 , each with
two levels. So a typical row of Z is a vector of 4 zeros and ones, in some order. Suppose further that the
first two elements of b relate to z1 and the remainder to z2 and that diag(Λ) = [σ1−2 , σ1−2 , σ2−2 , σ2−2 ]. The
following simulates 50 observations from such a model.

[Link](0);n <- 50
x <- runif(n)
z1 <- sample(c(0,1),n,replace=TRUE)
z2 <- sample(c(0,1),n,replace=TRUE)
b <- rnorm(4) ## simulated random effects
X <- [Link](~x)
Z <- cbind(z1,1-z1,z2,1-z2)
eta <- X%*%c(0,3) + Z%*%b
mu <- exp(eta)
y <- rpois(mu,mu)

Here is a function evaluating log{f (y, b)} (log is used here to avoid undesirable underflow to zero)

lf.yb0 <- function(y, b, theta, X, Z, lambda) {


beta <- theta[1:ncol(X)]
theta <- theta[-(1:ncol(X))]
eta <- X%*%beta + Z%*%b
mu <- exp(eta)
lam <- lambda(theta, ncol(Z))
sum(y*eta - mu - lfactorial(y)) - sum(lam*b^2)/2 +
sum(log(lam))/2 - ncol(Z)*log(2*pi)/2
}

theta contains the fixed effect parameters, i.e. β and the parameters of Λ. So in this example, theta
is of length 4, with the first 2 elements corresponding to β and the last two are σ1−1 , σ22 . lambda is the name
of a function for turning the parameters of Λ into its leading diagonal elements, for example

[Link] <- function(theta, nb) c(theta[1], theta[1],


theta[2], theta[2])

Actually, for integration purposes, lf.yb0 can usefully be re-written to accept a matrix b where each
column is a different b at which to evaluate the log p.d.f. Suppose [Link] is such a function. Lets try eval-
uating fy (y) at the simulated data, and the theta vector used for simulation, using the various integration
methods covered above.
CHAPTER 4. CALCULUS BY COMPUTER 58

Quadrature rules

First try brute force application of the midpoint rule. We immediately have the slight problem of the integra-
tion domain being infinite, but since the integrand will vanish at some distance from b = 0, we can probably
safely restrict attention to the domain [−5, 5]4 .

theta <- c(0,3,1,1)


nm <- 10;m.r <- 10
bm <- mesh((1:nm-(nm+1)/2)*m.r/nm,4,rep(m.r/nm,nm))
lf <- [Link](y,t(bm$X),theta,X,Z,[Link])
log(sum(bm$w*exp(lf-max(lf)))) + max(lf)

## [1] -116.5303

Of course, we don’t actually know how accurate this is, but we can get some idea by doubling nm, in
which case

log(sum(bm$w*exp(lf-max(lf)))) + max(lf)

## [1] -118.6234

Clearly we are not doing all that well, even with 10000-160000 function evaluations.
On very smooth functions Gauss quadrature rules were a much better bet, so let’s try a naive application
of these again.

nm <- 10
gq <- [Link](nm)
w <- gq$weights*m.r/2 ## rescale to domain
x <- gq$nodes*m.r/2
bm <- mesh(x,4,w)
lf <- [Link](y,t(bm$X),theta,X,Z,[Link])
log(sum(bm$w*exp(lf-max(lf)))) + max(lf)

## [1] -136.95

— this is a long way from the midpoint estimate: what happens if we double nm to 20?

log(sum(bm$w*exp(lf-max(lf)))) + max(lf)

## [1] -125.3369

Increasing further still indicates problems. The difficulty is that the integrand is not very smooth, in
the way that ex was, and it is not well approximated by a tensor product of high order polynomials. The
midpoint rule actually works better here. In fact this integral could be performed reasonably accurately by
skilful use of Gauss quadrature, but the example illustrates that naive application can be much worse than
doing something simpler.

Laplace approximation

To use the Laplace approximation we need to supplement [Link] with functions [Link] and [Link]
which evaluate the gradient vector and Hessian of the log(f ) w.r.t. b. Given these functions we first find b̂
by Newton’s method
CHAPTER 4. CALCULUS BY COMPUTER 59

b <- rep(0,4)
while(TRUE) {
b0 <- b
g <- [Link](y, b, theta, X, Z, [Link])
H <- [Link](y, b, theta, X, Z, [Link])
b <- b - solve(H, g)
b <- [Link](b)
if (sum(abs(b-b0)) < 1e-5*sum(abs(b)) + 1e-4) break
}

after which b contains b̂ and H the negative of the required H. It remains to evaluate the approximation

lf.yb0(y,b,theta,X,Z,[Link]) + 2*log(2*pi) -
0.5*determinant(H,logarithm=TRUE)$modulus

## [1] -116.9646
## attr(,"logarithm")
## [1] TRUE

This is reasonably close to the truth, but the method itself provides us with no way of knowing how
close.

Monte Carlo

Simple brute force Monte Carlo Integration follows

N <- 100000
bm <- matrix((runif(N*4)-.5)*m.r,4,N)
lf <- [Link](y, bm, theta, X, Z, [Link])
log(sum(exp(lf-max(lf)))) + max(lf) - log(N) + log(m.r^4)

## [1] -119.8479

## [1] -119.3469

The next replicate gave -119.3, and that degree of variability is with 105 function evaluations. Stratification
improves things a bit, but let’s move on.

Intelligent Monte Carlo

Using the original factorisation of the joint likelihood, we can re-write our required marginal likelihood
Z Z
 
fy (y) = f (y, b)db = fy|b (y|b)fb (b)db = Eb fy|b (y|b)

as the expectation of the complete data likelihood wrt the random effects distribution. This gives us a
natural Monte Carlo approach to estimation.

[Link] <- function(y, b, theta, X, Z, lambda) {


beta = theta[1:ncol(X)]
eta = Z%*%b + [Link](X%*%beta)
colSums(y*eta - exp(eta) - lfactorial(y))
CHAPTER 4. CALCULUS BY COMPUTER 60

}
N = 100000
lam = [Link](theta[3:4])
bm = matrix(rnorm(4*N, 0, sqrt(lam)), nrow=4)
llv = [Link](y, bm, theta, X, Z, [Link])
max(llv) + log(mean(exp(llv - max(llv))))

## [1] -119.2791

## [1] -119.0786

The second replicate is consistent with the first, suggesting relatively low variance. This approach
is simple and often works well, but it does rely on a degree of consistency between the complete data
likelihood and the (prior) random effects distribution. Laplace importance sampling does not need this.

Laplace Importance Sampling

Using Laplace importance sampling allows the construction of an efficient unbiased estimate of the marginal
likelihood. It is not strictly necessary here, since the previous Monte Carlo approach is arguably simpler.
But it is very useful for more challenging problems.

N <- 10000
R <- chol(-H)
z <- matrix(rnorm(N*4),4,N)
bm <- b + backsolve(R,z)
lf <- [Link](y,bm,theta,X,Z,[Link])
zz.2 <- colSums(z*z)/2
lfz <- lf + zz.2
log(sum(exp(lfz - max(lfz)))) + max(lfz) + 2 * log(2*pi) -
sum(log(diag(R))) - log(N)

## [1] -119.3229

## [1] -119.2882
## [1] -119.2774

Stratification can reduce this variability still further, and be aware that while we know that the estimator
is unbiased, it is not on the log scale, of course.

Conclusions

The example emphasises that integration is computationally expensive, and that multi-dimensional inte-
gration is best approached by approximating the integrand with something analytically integrable or by
constructing unbiased estimators of the integral. The insight that variance is likely to be minimised by con-
centrating the ‘design points’ of the estimator on where the integrand has substantial magnitude is strongly
supported by this example, and leads rather naturally, beyond the scope of this course, to MCMC methods.
Note that in likelihood maximisation settings you might also consider avoiding intractable integration via the
EM algorithm (see e.g. Lange, 2010 or Monahan, 2001).
CHAPTER 4. CALCULUS BY COMPUTER 61

4.5 Further reading on computer integration and differentiation


• Chapter 8 of Nocedal, J. & S.J. Wright (2006) Numerical Optimization (2nd ed.), Springer, offers a
good introduction to finite differencing and AD, and the AD examples above are adapted directly from
this source.

• Gill, P.E. , W. Murray and M.H. Wright (1981) Practical Optimization Academic Press, Section 8.6
offers the treatment of finite differencing that you should read before using it on anything but the most
benign problems. This is one of the few references that tells you what to do if your objective is not
‘well scaled’.

• Griewank, A. (2000) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation


SIAM, is the reference for Automatic Differentiation. If you want to know exactly how AD code works
then this is the place to start. It is very clearly written and should be consulted before using AD on
anything seriously computer intensive.

• [Link] is a great resource for information on AD and AD tools.

• If you want to see AD in serious action, take a look at e.g. Kalnay, E (2003) Atmospheric Modelling,
Data Assimilation and Predictability.

• Although most widely-used AD libraries are based on object–oriented programming approaches, ar-
guably the most elegant approaches exploit functional programming languages with strong static type
systems. See, for example: Elliott, C. (2018) The simple essence of automatic differentiation. Proc.
ACM Program. Lang. 2, ICFP, Article 70. DOI: [Link]

• Chapter 10 of Monahan, J.F (2001) Numerical Methods of Statistics, offers more information on
numerical integration, including much of the detail skipped over here, although it is not always easy
to follow, if you don’t already have quite a good idea what is going on.

• Lange, K (2010) Numerical Analysis for Statisticians, second edition, Springer. Chapter 18 is good
on one dimensional quadrature.

• Ripley, B.D. (1987, 2006) Stochastic Simulation Chapters 5 and section 7.4 are good discussions of
stochastic integration and Quasi Monte Carlo integration.

• Robert, C.P. and G. Casella (1999) Monte Carlo Statistical Methods Chapter 3 (and beyond) goes
into much more depth than these notes on stochastic integration.

• Press et al. (2007) Numerical Recipes (3rd ed), Cambridge, has a nice chapter on numerical integra-
tion, but with emphasis on 1D.
Chapter 5

Random number generation

The stochastic integration methods of the last section took it for granted that we can produce random
numbers from various distributions. Actually we can’t. The best that can be done is to produce a completely
deterministic sequence of numbers which appears indistinguishable from a random sequence with respect
to any relevant statistical property that we choose to test. In other words we may be able to produce
a deterministic sequence of numbers that can be very well modelled as being a random sequence from
some distribution. Such deterministic sequences are referred to as sequences of pseudorandom numbers,
but the pseudo part usually gets dropped at some point.
The fundamental problem, for our purposes, is to generate a pseudorandom sequence that can be
extremely well modelled as i.i.d. U (0, 1). Given such a sequence, it is fairly straightforward to generate
deviates from other distributions, but the i.i.d. U (0, 1) generation is where the problems lie. Indeed if you
read around this topic then most books will pretty much agree about how to turn uniform random deviates
into deviates from a huge range of other distributions, but advice on how to get the uniform deviates in the
first place is much less consistent.

5.1 Simple generators and what can go wrong


Since the 1950s there has been much work on linear congruential generators. The intuitive motivation is
something like this. Suppose I take an integer, multiply it by some enormous factor, re-write it in base -
‘something huge’, and then throw away everything except for the digits after the decimal point. Pretty hard
to predict the result, no? So, if I repeat the operation, feeding each step’s output into the input for the next
step, a more or less random sequence might result. Formally the pseudorandom sequence is defined by

Xi+1 = (aXi + b)modM

where b is 0 or 1, in practice. This is started with a seed X0 . The Xi are integers (< M , of course), but
we can define Ui = Xi /M . Now the intuitive hope that this recipe might lead to Ui that are reasonably well
modelled by i.i.d. U (0, 1) r.v.s is only realised for some quite special choices of a and M , and it takes some
number theory to give the generator any sort of theoretical foundation (see Ripley, 1987, Chapter 2).
An obvious property to try to achieve is full period. We would like the generator to visit all possible
integers between 1 − b and M − 1 once before it starts to repeat itself (clearly the first time it revisits a value,
it starts to repeat itself). We would also like successive Ui s to appear uncorrelated. A notorious and widely
used generator called RANDU, supplied at one time with IBM machines, met these basic considerations
with
Xi+1 = (65539Xi )mod231
This appears to do very well in 1 dimension.

n <- 100000 ## code NOT for serious use


x <- rep(1,n)
a <- 65539;M <- 2^31;b <- 0 ## Randu

62
CHAPTER 5. RANDOM NUMBER GENERATION 63

for (i in 2:n) x[i] <- (a*x[i-1]+b)%%M


u <- x/(M-1)

qqplot((1:n-.5)/n,sort(u))

1.0
0.8
0.6
U
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


Uniform quantiles

Similarly a 2D plot of Ui vs Ui−1 indicates no worries with serial correlation

## Create data frame with U at 3 lags...


U <- [Link](u1=u[1:(n-2)],u2=u[2:(n-1)],u3=u[3:n])

plot(U$u1,U$u2,pch=".")
1.0
0.8
0.6
Ui
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


Ui−1

We can also check visually what the distribution of the triples (Ui , Ui−1 , Ui−2 ) looks like.

library(lattice)
cloud(u1~u2*u3,U,pch=".",col=1,screen=list(z=40,x=-70,y=0))
CHAPTER 5. RANDOM NUMBER GENERATION 64

u1

u3
u2

. . . not quite so random looking. Experimenting a little with rotations gives

cloud(u1~u2*u3,U,pch=".",col=1,screen=list(z=40,x=70,y=0))

u1

u3 u2

i.e. the triples lie on one of 15 planes. Actually you can show that this must happen (see Ripley, 1987,
section 2.2).
Does this deficiency matter in practice? When we looked at numerical integration we saw that quadra-
ture rules based on evaluating integrands on fixed lattices become increasingly problematic as the di-
mension of the integral increases. Stochastic integration overcomes the problems by avoiding the biases
associated with fixed lattices. We should expect problems if we are performing stochastic integration on
the basis of random numbers that are in fact sampled from some coarse lattice.
So the first lesson is to use generators that have been carefully engineered by people with a good
understanding of number theory, and have then been empirically tested (Marsaglia’s Diehard battery of
tests provides one standard test set). For example, if we stick with simple congruential generators, then

Xi = (69069Xi−1 + 1)mod232 (5.1)

is a much better bet. Here is its triples plot. No amount of rotating provides any evidence of structure.
CHAPTER 5. RANDOM NUMBER GENERATION 65

u1

u3 u2

Although this generator is much better than RANDU, it is still problematic. An obvious infelicity is the
fact that a very small Xi will always be followed by an unusually small Xi+1 (consider Xi = 1, for example).
This is not a property that would be desirable in a time series simulation, for example. Not quite so obvious
is the fact that for any congruential generator of period M , then k−tuples, Ui , Ui−1 , . . . , Ui−k+1 will tend to
lie on a finite number of k − 1 dimensional planes (e.g. for RANDU we saw 3-tuples lying on 2 dimensional
planes.) There will be at most M 1/k such planes, and as RANDU shows, there can be far fewer. The upshot
of this is that if we could visualise 8 dimensions, then the 8-tuple plot for (5.1) would be just as alarming as
the 3D plot was for RANDU. 8 is not an unreasonably large dimension for an integral.
Generally then, it might be nice to have generators with better behaviour than simple congruential
generators, and in particular we would like generators where k−tuples appear uniformly distributed on
[0, 1]k for as high a k as possible (referred to as having a high k-distribution).

5.2 Building better generators


An alternative to the congruential generators are generators which focus on generating random sequences
of 0s and 1s. In some ways this seems to be the natural fundamental random number generation problem
when using modern digital computers, and at time of writing it also seems to be the approach that yields the
most satisfactory results. Such generators are often termed shift-register generators. The basic approach
is to use bitwise binary operations to make a binary sequence ‘scramble itself’. An example is Marsaglia’s
(2003) Xorshift generator as recommended in Press et al. (2007).
Let x be a 64-bit variable (i.e. an array of 64 0s or 1s). The generator is initialised by setting to any
value (other than 64 0s). The following steps then constitute one iteration (update of x)

x ← x ∧ (x >> a)
x ← x ∧ (x << b)
x ← x ∧ (x >> c)

each iteration generates a new random sequence of 0s and 1s. ∧ denotes ‘exclusive or’ (XOR) and >>
and << are right-shift and left shift respectively, with the integers a, b and c giving the distance to shift.
a = 21, b = 35 and c = 4 appear to be good constants (but see Press et al., 2007, for some others).
If you are a bit rusty on these binary operators then consider an 8-bit example where x=10011011 and
z=01011100.

• x<<1 is 00110110, i.e. the bit pattern is shifted leftwards, with the leftmost bit discarded, and the
rightmost set to zero.

• x<<2 is 01101100, i.e. the pattern is shifted 2 bits leftwards, which also entails discarding the 2
leftmost bits and zeroing the two rightmost.

• x>>1 is 01001101, same idea, but rightwards shift.


CHAPTER 5. RANDOM NUMBER GENERATION 66

• x^z is 11000111, i.e. a 1 where the bits in x and z disagree, and a 0 where they agree.

The Xorshift generator is very fast, has a period of 264 − 1, and passes the Diehard battery of tests
(perhaps unsurprising as Marsaglia is responsible for that too). These shift register generators suffer
similar granularity problems to congruential generators (there is always some k for which [0, 1]k can not be
very well covered by even 264 − 1 points), but tend to have all bit positions ‘equally random’, whereas lower
order bits from congruential generator sequences often have a good deal of structure.
Now we reach a fork in the road. To achieve better performance in terms of longer period, larger k-
distribution, and fewer low order correlation problems, there seem to be two main approaches, the first
pragmatic, and the second more theoretical.

1. Combine the output from several ‘good’, well understood, simple generators using operations that
maintain randomness (e.g. XOR and addition, but not multiplication). When doing this, the output
from the combined generators is never fed back into the driving generators. Preferably combine
rather different types of generator. Press et al. (2007) make a convincing case for this approach.
Wichmann-Hill, available in R is an example of such a combined generator, albeit based on 3 very
closely related generators.

2. Use more complicated generators — non-linear or with a higher dimensional state that just a single
Xi (see Gentle, 2003). For example, use a shift register type generator based on maintaining the
history of the last n bit-patterns, and using these in the bit scrambling operation. Matsumoto and
Nishimura’s (1998) Mersenne Twister is of this type. It achieves a period of 219937 − 1 (that’s not a
misprint: 219937 −1 is a ‘Mersenne prime’* ), and is 623-distributed at 32 bit accuracy. i.e. its 623-tuples
appear uniformly distributed (each appearing the same number of times in a full period) and spaced
2−32 apart (without the ludicrous period this would not be possible). It passes Diehard, is the default
generator in R, and C source code is freely available.

5.3 Uniform generation conclusions


I hope that the above has convinced you that random number generation, and use of pseudo-random
numbers are non-trivial topics that require some care. That said, most of the time, provided you pick a
good modern generator, you will probably have no problems. As general guidelines. . .

1. Avoid using black-box routines supplied with low level languages such as C, you don’t know what you
are getting, and there is a long history of these being botched.

2. Do make sure you know what method is being used to generate any uniform random deviates that
you use, and that you are satisfied that it is good enough for purpose.

3. For any random number generation task that relies on k−tuples having uniform distribution for high
k then you should be particularly careful about what generator you use. This includes any statistical
task that is somehow equivalent to high dimensional integration.

4. The Mersenne Twister is probably the sensible default choice in most cases, at present. For high
dimensional problems it remains a good idea to check answers with a different high-quality generator.
If results differ significantly, then you will need to find out why (probably starting with the ‘other’
generator).

Note that I haven’t discussed methods used by cryptographers. Cryptographers want to use (pseudo)random
sequences of bits (0s and 1s) to scramble messages. Their main concern is that if someone were to in-
tercept the random sequence, and guess the generator being used, they should not be able to infer the
state of the generator. Achieving this goal is quite computer intensive, which is why generators used for
cryptography are usually over engineered for simulation purposes.
* Numbers this large are often described as being ‘astronomical’, but this doesn’t really do it justice: there are probably fewer than 2270
atoms in the universe.
CHAPTER 5. RANDOM NUMBER GENERATION 67

5.4 Other deviates


Once you have a satisfactory stream of i.i.d. U (0, 1) deviates then generating deviates from other standard
distributions is much more straightforward. Conceptually, the simplest approach is inversion. We know that
if X is from a distribution with continuous c.d.f. F then F (X) ∼ U (0, 1). Similarly, if we define the inverse
of F by F − (u) = min(x|F (x) ≥ u), and if U ∼ U (0, 1), then F − (U ) has a distribution with c.d.f. F (this time
with not even a continuity restriction on F itself).
As an example here is inversion used to generate 1 million i.i.d. N (0, 1) deviates in R.

[Link](X <- qnorm(runif(1e6)))

## user system elapsed


## 0.072 0.000 0.072

For most standard distributions (except the exponential) there are better methods than inversion, and
the happy situation exists where textbooks tend to agree what these are. Ripley’s (1987) Chapter 3 is a
good place to start, while the lighter version is provided by Press et al. (2007) Chapter 7. R has many of
these methods built in. For example an alternative generation of 1 million i.i.d. N (0, 1) deviates would use.

[Link](Z <- rnorm(1e6))

## user system elapsed


## 0.068 0.000 0.068

Here the difference is marginal, due to the highly optimised qnorm function in R, but for other distribu-
tions the difference can be substantial. We have already seen how to generate multivariate normal deviates
given i.i.d. N (0, 1) deviates, and for other (simple) multivariate distributions see chapter 4 of Ripley (1987).
For more complicated distributions see the Computationally-Intensive Statistical Methods APTS module.

5.5 Further reading on pseudorandom numbers


• Gentle, J.E. (2003) Random Number Generation and Monte Carlo Methods (2nd ed), Springer, Chap-
ters 1 and 2 provides up to date summaries both of uniform generators, and of generator testing.
Chapter 3 covers Quasi Monte Carlo, and the remaining chapters cover simulation from other distri-
butions + software.

• Matsumoto, M. and Nishimura, T. (1998) Mersenne Twister: A 623-dimensionally equidistributed uni-


form pseudo-random number generator, ACM Transactions on Modeling and Computer Simulation,
8, 3-30. This gives details and code for the Mersenne Twister.

• [Link] gives the Mersenne Twister


source code.

• Marsaglia, G. (2003) Xorshift RNGs, Journal of Statistical Software, 8(14):1-6.

• DIEHARD is George Marsaglia’s Diehard battery of tests for random number generators: https:
//[Link]/wiki/Diehard_tests

• Press, W.H., S.A. Teukolsky, W.T. Vetterling and B.P Flannery (2007) Numerical Recipes (3rd ed.)
Cambridge. Chapter 7, on random numbers is good and reasonably up to date, but in this case you
do need the third edition, rather than the second (and certainly not the first).

• Ripley (1987, re-issued 2006) Stochastic Simulation, Wiley. Chapter’s 2 and 3 are where to go to
really get to grips with the issues surrounding random number generation. The illustrations alone are
CHAPTER 5. RANDOM NUMBER GENERATION 68

an eye opener. You probably would not want to use any of the actual generators discussed and the
remarks about mixed generators now look dated, but the given principles for choosing generators
remain sound (albeit with some updating of what counts as simple, and the integers involved). See
chapter 3 and 4 for simulating from other distributions, given uniform deviates.

• Gamerman, D. and H.F. Lopes (2006) Markov Chain Monte Carlo: Stochastic simulation for Bayesian
inference, CRC. Chapter 1 has a nice compact introduction to simulation, given i.i.d. U (0, 1) deviates.

• Devroye, L., (2003) Non-Uniform Random Variate Generation, Springer. This is a classic reference.
Now out of print, it is available on-line. [Link]
Chapter 6

Other topics

This module has concentrated on some topics in numerical analysis and numerical computation that are
quite important for statistics. It has not covered a great many other topics in numerical analysis that are also
important. Despite its name the module has made no attempt whatsoever to provide an overview of ‘statis-
tical computing’ more generally, which would surely have to talk about data handling, pattern recognition,
sorting, matching, searching, algorithms for graphs and trees, computational geometry, discrete optimi-
sation, the EM algorithm, dynamic programming, linear programming, fast Fourier transforms, wavelets,
support vector machines, function approximation, sparse matrices, huge data sets, graphical processing
units, multi-core processing and a host of other topics.
Rather than attempt the impossible task of trying to remedy this deficiency, I want to provide a couple
of examples of the ingenuity of the less numerical algorithms available, which serve, I hope, to illustrate the
value of looking to see what is available when confronted with any awkward computational problem.

6.1 An example: sparse matrix computation


Many statistical computations involve matrices which contain very high proportions of zeroes: these are
known as sparse matrices. Most of the computational time used by most numerical linear algebra routines
is devoted to additions, subtractions and multiplications of pairs of floating point numbers: if we know, in
advance, that one of the pair is a zero, we don’t actually need to perform the operation, since the answer
is also known in advance. In addition, there is little point in storing all the zeroes in a sparse matrix: it is
often much more efficient to store only the values of the non-zero elements, together with some encoding
of their location in the matrix.
Efficient storage and simple matrix manipulations are easy to implement. Here is an example, based
on simulating the model matrix for a linear model with 2 factors, and a large number of observations.

n <- 100000 ## number of obs


pa <- 40;pb <- 10 # numbers of factor levels
a <- factor(sample(1:pa,n,replace=TRUE))
b <- factor(sample(1:pb,n,replace=TRUE))
X <- [Link](~a*b) # model matrix a + b + a:b
y <- rnorm(n) # a random response
[Link](X) # X takes lots of memory!

## 326428560 bytes

sum(X==0)/prod(dim(X)) # proportion of zeros

## [1] 0.9906125

. . . so the simulated X is 100, 000 × 400 and is about 99% zeroes. It requires about 0.3Gb of computer
memory, if stored as a normal ‘dense’ matrix. The Matrix library, supplied with R, supports sparse matrix

69
CHAPTER 6. OTHER TOPICS 70

computations, so the following produces a sparse version of X. . .

library(Matrix)
Xs <- Matrix(X) ## a sparse version of X
## much reduced memory footprint
[Link]([Link](Xs))/[Link]([Link](X))

## [1] 0.03349829

Because Xs doesn’t store the zeroes, it uses much less memory than the dense version X. Note that
although X is only 1% non-zeroes, Xs uses 3% of the storage of X, because it needs to store the locations
of the non-zero elements, in addition to their values. The sparse approach also saves computer time when
performing basic matrix operations, by avoiding the floating point operations that are not needed because
they involve a zero. For example. . .

[Link](Xy <- t(X)%*%y)

## user system elapsed


## 0.282 0.065 0.228

[Link](Xsy <- t(Xs)%*%y)

## user system elapsed


## 0.004 0.004 0.008

Again computational overheads involved in only storing the non-zeroes, and their locations, means that
the cost of the sparse operation is more than 1% of the dense cost, but the saving is still substantial.
Any competent programmer could produce code for the examples given so far, but any serious applica-
tion will involve solving linear systems, and this is much less straightforward. Sparse versions of some of
the decompositions that we met in section 2 are needed, but naive implementations using sparse matrices
flounder on the problem of infill (or fill-in). You can see the problem in the following dense calculation. . .

[Link]({
XX <- t(X)%*%X
R <- chol(XX)
})

## user system elapsed


## 2.202 0.095 0.499

dim(XX)

## [1] 400 400

sum(XX == 0)/prod(dim(XX))

## [1] 0.97935

image(XX == 0)
CHAPTER 6. OTHER TOPICS 71

1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

range(t(R)%*%R-XX)

## [1] -1.818989e-12 1.455192e-11

sum(R != 0) ## the upper triangle is dense (not good)

## [1] 80120

image(R == 0)
CHAPTER 6. OTHER TOPICS 72

1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

R[1:5, 1:5] ## top left corner, for example

## (Intercept) a2 a3 a4 a5
## (Intercept) 316.2278 7.712795 7.962615 8.149190 8.051159
## a2 0.0000 48.780250 -1.258994 -1.288493 -1.272993
## a3 0.0000 0.000000 49.527888 -1.342901 -1.326747
## a4 0.0000 0.000000 0.000000 50.071220 -1.378683
## a5 0.0000 0.000000 0.000000 0.000000 49.758389

Although XT X is 98% zeroes, the upper triangle of its Cholesky factor is 0% zeroes, when computed in
the usual way. i.e. we have lost all sparsity. Fortunately, it is possible to re-arrange the rows and columns
of XT X (or any other matrix for which we need a sparse Cholesky factor), in a way that reduces infill in
the Cholesky factor. This pivoting means that we get a re-ordered version of the solution of the system
of interest: a minor inconvenience. Unfortunately, finding the optimal pivoting in terms of infill reduction is
NP hard, but there are never the less practical computational pivoting strategies that work well in practice.
These rely on some rather ingenious graph theory based symbolic analysis of the matrix to be factorised,
which is applied before anything numerical is computed. See Davis (2006). Here’s what happens when a
sparse Cholesky factorisation, with infill reducing pivoting, is used on XT X. . .

[Link]({
XXs <- t(Xs)%*%Xs
Rs <- chol(XXs,pivot=TRUE) ## pivot to reduce infill
})
## user system elapsed
## 0.030 0.000 0.029
range(t(Rs)%*%Rs-XXs[Rs@pivot,Rs@pivot]) ## factorization works!
## [1] -5.684342e-14 5.684342e-14
sum(Rs!=0) ## the upper triangle is sparse (much better)
## [1] 1729
CHAPTER 6. OTHER TOPICS 73

image([Link](Rs) == 0)
1.0
0.8
0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Rs[1:5,1:5] ## top left corner, for example

## 5 x 5 sparse Matrix of class "dtCMatrix"


## (Intercept) a2 a3 a4 a5
## (Intercept) 14.89966 . . . .
## a2 . 15.93738 . . .
## a3 . . 15.36229 . .
## a4 . . . 16.18641 .
## a5 . . . . 15.96872

with sparsity maintained in the factor, sparse backward or forward substitution is also very efficient.
Note that while the sparse Cholesky routines in the Matrix package are state-of-the-art, the QR rou-
tines are not, and carry heavy memory overheads associated with computation of Q. You’ll probably need
to look for sparse multi-frontal QR routines if you need sparse QR.

6.2 Hashing: another example


Suppose that you want to store data, which is to be accessed using arbitrary keys. For example, you want
to store data on a large number of people, using their names (as a text string) as the key for finding the
record in memory. i.e., data about someone is read in from somewhere, and is stored at a location in
computer memory, depending on what their name string is. Later you want to retrieve this information using
their name string. How can you actually do this?
CHAPTER 6. OTHER TOPICS 74

You could store some sort of directory of names, which would be searched through to find the location
of the data. But such a search is always going to take some time, which will increase with the number of
people’s data that is stored. Can we do better? With hashing the answer is yes. The pivotal idea is to
create a function which will turn an arbitrary key into an index which is either unique or shared with only a
few other keys, and has a modest predefined range. This index gives a location in a (hash) table which in
turn stores the location of the actual data (plus some information for resolving ‘clashes’). So given a key,
we use the (hash) function to get its index, and its index to look up the data location, with no, or very little
searching. The main problem is how to create a hash function.
An elegant general strategy is to use the ideas of pseudorandom number generation to turn keys into
random integers. For example, if the key were some 64bit quantity, we could use it to seed the Xorshift
random number generator given earlier, and run this forward for a few steps. More generally we would
need an approach that successively incorporated all the bits in a key into the ‘random number’ generation
process (see e.g. section 7.6 of Press et al., 2007, for example code). So, the hash function should return
numbers that appear random, which will mean that they tend to be nicely spread out, irrespective of what
the keys are like. Of course, despite this, the function always returns the same number for a given key.
It remains to turn the number returned by the hash function, h, into an index, ih , somewhere between
0 and the length, nh , of the hash table minus 1. ih = h mod nh is the usual approach. At position ih in the
hash table we might store the small number of actual keys (or possibly just their h values) sharing the ih
value, and the actual location in memory of the corresponding data.
The really neat thing here is that the randomness of the h’s means that whatever the keys look like, you
usually end up with a fairly uniform distribution of ih ’s, and only a small number of keys sharing each slot in
the hash table. It is this that makes data retrieval so efficient, relative to searching a directory of keys.
Note that R environments are themselves methods for associating values with lookup keys, and in fact,
R environments can use hash tables.

ht <- [Link](hash=TRUE)
ht[["foo"]] = "bar"
ht[["baz"]] = 1:10
ht[["foo"]]

## [1] "bar"

ht$baz

## [1] 1 2 3 4 5 6 7 8 9 10

If that doesn’t sound like it has any application in statistics, consider the construction of triangulations
in spatial statistics (a triangulation being a way of joining a set of points up with triangles, in such a way
that the whole complex hull of the points is covered in triangles). Efficient construction of triangulations is
tricky, and several efficient algorithms work by incrementally adding triangles. You usually know at least
one neighbour of each triangle at its creation, but you actually need to know all its neighbours in the existing
set. The task of finding those neighbours would be prohibitive if you had to search each existing triangle
for edges shared with the new triangle. Instead one can maintain a hash table for unpaired triangle edges,
where the hash key is based on the end co-ordinates of each edge. This allows any newly created edge to
be paired with any existing edge in little more than the length of time it takes to create the hash key.

6.3 Further reading


1. For a good introduction to the sparse methods underlying Martin Maechler and Douglas Bates’ R
package Matrix, see Davis, T. A. (2006) Direct Methods for Sparse Linear Systems, SIAM.

2. Tim Davis’s home page is worth a look for more recent sparse matrix developments
[Link]
CHAPTER 6. OTHER TOPICS 75

3. If you think hashing is clever, take a look at the rest of Cormen, T., C.E. Leiserson, R.L. Rivest and C.
Stein (2001) Introduction to Algorithms (2nd ed.) MIT.

4. Once again, Press, W.H., S.A. Teukolsky, W.T. Vetterling and B.P Flannery (2007) Numerical Recipes
(3rd ed.) provides a readable introduction to a much wider range of topics in scientific computing than
has been covered here.

You might also like