Statistical Computing in R
Statistical Computing in R
2. y = (1 2 3 5)T
Setting up
3. d = ∑ ( x − y )2
• Open the R script in an editor, eg notepad,
> m<-x-y
> n<-m^2
> o<-sum(n)
1
> d<-sqrt(o) • Mouse click to set cursor position
> 2*(y[1]+x[1])
• Delete to remove and re-type parts of
command
Creating vectors
The simplest way to creating a vector in R is done by using the c() command (short for concatenate).
For example, to create the vector v = [2 4 6.8 3]T
> v
[1] 2 4 6.8 3
If the elements of the vector are consecutive in nature, then there are more concise ways to
create vectors. Using the : command, a vector can be created from a starting point to an
ending point, with a step size of one. For example, the command 1:10 creates a vector from 1
to 8, stepping one at a time.
> x=c(1:10)
> x
[1] 1 2 3 4 5 6 7 8 9 10
> x < 4
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2
> mean(x) # calculates the mean of the entries in x
[1] 5.5
> var(x) # calculates the variance of the entries in x
[1] 9.166667
> sum(x)
[1] 55
> cumsum(x) # check what this function is doing…
[1] 1 3 6 10 15 21 28 36 45 55
> median(x)
[1] 5.5
> length(x)
[1] 10
> sum(y) # logical values can be 'coerced' to numeric, where FALSE
= 0 and TRUE = 1
[1] 3
> # - here the sum is a count of 'TRUE's
The last example can be modified to calculate proportions or percentages. For example, if y is
a set of logical variables indicating whether plants are heavily infected,
> sum(y)/length(y)
[1] 0.3
gives the proportion of plants heavily infected. sum(y) gives the number of plants infected
and length(y) gives the total number of plants.
The help function gives information about how a function works. For example, help(mean).
ls()
# gives a list of all the objects in your working directory
rm()
# removes the objects indicated.
# For example the object x is removed by
rm(x)
You might want to clear out old objects that you don’t need any more to avoid confusion.
To create a matrix in R, you use the matrix() command. For instance, the matrix
1 4 7
A = 2 5 8 is created using either of the following commands:
3 6 9
> A<-matrix(c(1,2,3,4,5,6,7,8,9),3,3)
> A
[,1] [,2] [,3]
3
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Or
> A<-matrix(1:9,3,3)
> v<-c(1,2,4,2,3,4,4,32,33,2,4)
> v[5]
[1] 3
> v[5:9]
[1] 3 4 4 32 33
Extracting elements from a matrix is similar, but you specify two values instead of one. For
example, to extract the first column, the second row, or the element in the second row and the
third column of the matrix A
> A[,1]
[1] 1 2 3
> A[2,]
[1] 2 5 8
> A[2,3]
[1] 8
A[1:2,1]# check this?
Operations on Matrices
Common matrix operations are included in R, such as matrix addition, subtraction, or
multiplication, as well as other types of operations. R performs the common addition and
subtraction of matrices with the + and the - commands. However, matrix multiplication is not
performed by the * command, but rather the %*% command. The * command will perform
the element-wise multiplication.
> B<-matrix(2:5,2,2)
> C<-matrix(5:2,2,2)
> B
[,1] [,2]
[1,] 2 4
[2,] 3 5
> C
[,1] [,2]
4
[1,] 5 3
[2,] 4 2
> B-C #element-wise subtraction
[,1] [,2]
[1,] -3 1
[2,] -1 3
> B*C #element-wise multiplication
[,1] [,2]
[1,] 10 12
[2,] 12 10
> B+C #element-wise addition
[,1] [,2]
[1,] 7 7
[2,] 7 7
> B%*%C #matrix multiplication
[,1] [,2]
[1,] 26 14
[2,] 35 19
Other basic operations needed, such as the determinant, the transpose, the inverse, the
dimension of a matrix, augmenting matrices, or creating an identity matrix can be performed
in R. Examples of these commands are:
> det(B) #the determinant
[1] -2
> t(B) #the transpose
[,1] [,2]
[1,] 2 3
[2,] 4 5
> solve(B) #matrix inverse of B
[,1] [,2]
[1,] -2.5 2
[2,] 1.5 -1
> sum(diag(B)) #the trace
[1] 7
> dim(B) #matrix size
[1] 2 2
> cbind(B,C) #augment column-wise
[,1] [,2] [,3] [,4]
[1,] 2 4 5 3
[2,] 3 5 4 2
> rbind(B,C) #augment row-wise
[,1] [,2]
[1,] 2 4
[2,] 3 5
[3,] 5 3
[4,] 4 2
> diag(3) #a 3x3 identity
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
5
[3,] 0 0 1
Matrix Factorizations
Several matrix factorizations are possible using the current core version of R, with others
possible by installing new packages. The Eigen-decomposition is one of the most common
decompositions in a linear algebra course. The eigen() command extracts both the
eigenvectors and the eigenvalues. Storing the result of the eigen() command allows access
to both the eigenvectors and the eigenvalues without recomputing them:
> A<-matrix(1:9,3,3)
> temp<-eigen(A) #store both the eigenvalues and vectors in
temp
> temp #display everything in temp
$values
[1] 1.611684e+01 -1.116844e+00 -4.054215e-16
$vectors
[,1] [,2] [,3]
[1,] -0.4645473 -0.8829060 0.4082483
[2,] -0.5707955 -0.2395204 -0.8164966
[3,] -0.6770438 0.4038651 0.4082483
> temp$values #displays only the $values of the object temp
[1] 1.611684e+01 -1.116844e+00 -4.054215e-16
> temp$vectors #displays only the $vectors of the object temp
[,1] [,2] [,3]
[1,] -0.4645473 -0.8829060 0.4082483
[2,] -0.5707955 -0.2395204 -0.8164966
[3,] -0.6770438 0.4038651 0.4082483
Graphics
Simple Plotting
Producing a simple x-y plot, or scatter plot, is easy in R. Use help(plot) to get a description
of options available.
To draw a line through the points, from the smallest value of DisSev to the largest value of
DisSev, the necessary ordering of the values can be determined using the order command
and then DisSev and RelYield can be indexed by these ordered values. The first command
lines provide an introduction to order and the related command rank.
6
# Set up a simple example vector
order(ex8)
# First the 2nd entry must be drawn, then the 3rd, then the 1st.
# In contrast, the rank command gives the rank of the entries
# from smallest to largest.
rank(ex8)
# The first entry comes 3rd, the next entry comes first, and
# the next entry comes second.
# Now use the order command and plot the DisSev data.
Advanced plotting
Suppose you would like to use colour in your plots to make them easier to read or more
attractive. Colours are easily added to plots using the col='color name' statement in the
plot() command.
To find a list of all colors available for use in R graphics type colors() Likewise, calling
different numbers for pch in the plot command produces different shapes for the plotting
points.
To add subscripts to your x or y labels enclose the text in [] brackets, for example H[2]O. To
superscript text use a tilde, ~, before the text and use the ^ before the part of the text to be
raised, for example ~ cm^3.
Sometimes it's useful to specify the range for values represented in a graph rather than just
using the default ranges produced by R. You can control the range of values on the x
(horizontal) and y (vertical) axes using xlim and ylim.
Variations on the plot() function are called when it is applied to different types of objects.
For example, when plot() is applied to categorical data, a boxplot is produced. When
applied to the output from the linear regression function lm, plot() produces a set of four
regression diagnostic plots.
R has interactive features, too. To see the ID for three points, try clicking on three points on
the graph after entering the following:
identify(DisSev,RelYield, n=3)
To get the (x,y) coordinates for any four locations on the graph, try clicking any four places
on the graph after entering locator(4). The coordinates will appear after all the locations
have been clicked on.
You can also plot a particular function instead of just a set of variables.
curve(x^2,from=0,to=10,xlab='Time',ylab='Population size')
or
curve(exp(x),from=0,to=10,xlab='Time',ylab='Population size')
Loops
A loop can be used to perform the same function repeatedly. For example:
1. for(i in 1:10){} sets up the loop to run over ten values of i, 1, 2, 3, ...,and 10 for
whatever commands are entered in the curly brackets {}.
8
2. x[i] indicates the ith entry in x. When i = 1, x[i] is the same as x[1]. When i = 2,
x[i] is the same as x[2].
3. x[i] <- x[i] + 1
x <- x + 1
since R will perform the function on each entry of x.
The use of a temporary variable named i is arbitrary, another name could be used.
You might choose to apply a function only to entries for which a certain criterion is true.
Here the if() statement controls which entries in the vector x are modified.
In this case, only those entries that are less than 6 have 1 added.
check#
rbind(dai,pinf)
cbind(dai,pinf)
9
Compare to
rbind(c(dai,pinf))
z5 <- array(0,c(5,3))
z6 <- array(-1,c(5,3,2))
Data Structures
Venables et al. (2007) recommend keeping variables in data frames for easier organization.
See what df5 looks like
Data frames need to be composed of objects with the same length. If objects don't have the
same length, you might get an error or you may get surprising results. Check the following
example.
The above command assigns the sequence of numbers between 1 and 10 to the object named
var1.
Did df6 look the way you expected it to? This shows how different variable length can
produce confusing outcomes when the [Link] command is used.
You can indicate dai within df5 by df5$dai, df5[1], or df5[[1]]. The name dai is only
maintained with the call df5[1]
Lists can also be used to gather objects together and the objects don't need to be of the same
length.
[[2]]
[1] 1 2
The interactive R environment in Windows has a data editing window available through the
Menu under Edit.
R users can construct data sets in programs such as Microsoft Excel or Notepad and then
bring those files into R for further analyses. Data input in R is handled through a series of
options, including:
• [Link]
• [Link]
• read.csv2
• [Link]
• read.delim2
For further information on the differences amongst these methods, use the help() or ?
options in R, for example: ?[Link].
11
The next thing we do is take a look at the data. We first plot the data using a scatter plot and notice
that it looks linear. To confirm our suspicions we then find the correlation between the year and the
mean interest rates:
library(ggplot2)
scatterplots
Histogram
Eg
At this point we should be excited because associations that strong never happen in the real
world unless you cook the books or work with averaged data. The next question is what
straight line comes "closest" to the data? In this case we will use least squares regression as
one way to determine the line.
Before we can find the least square regression line we have to make some decisions. First we
have to decide which is the explanatory and which is the response variable. Here, we
arbitrarily pick the explanatory variable to be the year, and the response variable is the
interest rate. This was chosen because it seems like the interest rate might change in time
rather than time changing as the interest rate changes.
Since we specified that the interest rate is the response variable and the year is the
explanatory variable this means that the regression line can be written in slope-intercept
form:
12
rate = (intercept) + (slope) year
The way that this relationship is defined in the lm command is that you write the vector
containing the response variable, a tilde ("~"), and a vector containing the explanatory
variable:
Coefficients:
(Intercept) year
1419.208 -0.705
When you make the call to lm it returns a variable with a lot of information in it. If you are
just learning about least squares regression you are probably only interested in two things at
this point, the slope and the y-intercept. If you just type the name of the variable returned by
lm it will print out this minimal information to the screen. (See above.)
If you would like to know what else is stored in the variable you can use the attributes
command:
> attributes(fit)
$names
[1] "coefficients" "residuals" "effects" "rank"
[5] "[Link]" "assign" "qr" "[Link]"
[9] "xlevels" "call" "terms" "model"
$class
[1] "lm"
One of the things you should notice is the coefficients variable within fit. You can print out
the y-intercept and slope by accessing this part of the variable:
> fit$coefficients[1]
(Intercept)
1419.208
> fit$coefficients[[1]]
[1] 1419.208
> fit$coefficients[2]
year
-0.705
> fit$coefficients[[2]]
[1] -0.705
Note that if you just want to get the number you should use two square braces. So if you want
to get an estimate of the interest rate in the year 2015 you can use the formula for a line:
> fit$coefficients[[2]]*2015+fit$coefficients[[1]]
[1] -1.367
13
So if you just wait long enough, the banks will pay you to take a car!
A better use for this formula would be to calculate the residuals and plot them:
That is a bit messy, but fortunately there is an easier way to get the residuals:
> residuals(fit)
1 2 3 4 5
0.132 -0.003 -0.178 -0.163 0.212
If you want to plot the regression line on the same plot as your scatter plot you can use the
abline function along with your variable fit:
Finally, as a teaser for the kinds of analyses you might see later, you can get the results of an
F-test by asking R for a summary of the fit variable:
> summary(fit)
Call:
lm(formula = rate ~ year)
Residuals:
1 2 3 4 5
0.132 -0.003 -0.178 -0.163 0.212
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1419.20800 126.94957 11.18 0.00153 **
year -0.70500 0.06341 -11.12 0.00156 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Exercise
Load the data that you have (Oxygen and Hydrocarbon levels)
14
Reading Data into R
R users can construct data sets in programs such as Microsoft Excel or Notepad and then
bring those files into R for further analyses. Data input in R is handled through a series of
options, including:
• [Link]
• [Link]
• read.csv2
• [Link]
• read.delim2
For further information on the differences amongst these methods, use the help() or ?
options in R, for example: ?[Link].
or
> oxygen
> library(MASS)
The truehist() produces a smooth histrogram especially when you have large data sets other wise just use
hist() command
Scatter plot
15
Then fit the linear model
> summary(RegModel.1)
> RegModel.1$coefficients
Note that if you just want to get the number you should use two square braces.
A better use for this formula would be to calculate the residuals and plot them:
> res
That is a bit messy, but fortunately there is an easier way to get the residuals:
> residuals(RegModel.1)
If you want to plot the regression line on the same plot as your scatter plot you can use the abline
function along with your variable fit:
> plot(oxygen$[Link],oxygen$Purity)
> abline(RegModel.1)
> anova(RegModel.1)
Confidence intervals
Graphical diagnostics
> par(mfrow=c(2,2))
> plot(RegModel.1)
Exercise: Exclude the 9th point then fit again and examine the residuals
> library(scatterplot3d)
> summary(RegModel.2)
> summary(RegModel.3)
> summary(RegModel.4)
> summary(RegModel.5)
Confidence intervals
> confint(RegModel.2)
> confint(RegModel.3)
> confint(RegModel.4)
> confint(RegModel.5)
> confint(RegModel.6)
> anova(RegModel.2)
> anova(RegModel.3)
> anova(RegModel.4)
> anova(RegModel.5)
17
X1 captures more variations in the response than x2. The model with an interaction effect between x1
and x2 is even better.
Diagonstics
> plot(RegModel.2)
> par(mfrow=c(2,2))
> plot(RegModel.3)
> plot(RegModel.4)
> plot(RegModel.5)
> par(mfrow=c(1,1))
Matrix method
In linear regression, we must evaluate the equation b = (X'X)-1X'Y to obtain the estimated regression
parameters. Here Y is the vector of the values of the response variable, and X is the design matrix
consisting of a column of ones and a column for the values of each predictor variable. If X and Y
have already been created in R, then we can evaluate the equation in R like so:
> Y=([Link]$y)
> C0<-matrix(1,25,1)
> C1=([Link]$x1)
> C2=([Link]$x2)
> X<-cbind(C0,C1,C2)
> b
[,1]
2.26379143
C1 2.74426964
C2 0.01252781
18
Then the vector b will contain the estimated regression coefficients.
Dummy variables
> Y<-tool$yi
> C0<-matrix(1,20,1)
> C1.1<-matrix(0,10,1)
> C1.2<-matrix(1,10,1)
> C1<-tool$RPM
> C2<-rbind(C1.1,C1.2)
> C2
> X<-cbind(C0,C1,C2)
> b
19
Simulation
return (u[-1])
}
# * normal = rnorm
# * poisson = rpois
# * binomial = rbinom
# * gamma = rgamma
# * look at [Link](keyword="distribution") for more
#
# Always check with the documentation that the distribution
# is parameterised the way you expect. Eg. for normal, standard
# deviation, not variance.
replicate(100, mean(rnorm(100)))
replicate(100, mean(rnorm(10)))
20
# function to get a permutation of the indices 1 ... n
N <- 1000
n <- 10
table(replicate(N, H(perm(n))))/N
Write a code in R to show the areas above 1.96 the standard normal density function. The
diagram should:
Leave room on the left margin
1 − 2z2
Display the area between -4 and +4 on the x-axis and label it Z, label the y axis as φ ( z ) = e and
2π
the Main title: The Standard Normal Density Function φ ( z ) .
To the plot obtained in above, insert the grid line at x = 0 and y = 0 in gray colour, define the
area equal to or greater than 1.96 as well as shading it in gray.
abline(h=0, col="gray")
abline(v=0, col="gray")
z0 <- z[z >= 1.96] # define region to fill
z0 <- c(z0[1], z0)
p0 <- p[z >= 1.96]
p0 <- c(0, p0)
polygon(z0, p0, col="gray")
21
∞
Point and label the area shaded above as ∫1.96φ ( z ) dz = 0.025 with the text being above the arrow.
2
z
−
e
2π
0.2
1
φ (z) =
∞
⌠ φ (z)dz = 0.025
0.1
⌡1.96
0.0
-4 -2 0 2 4
Using the R language, write a function which computes cumulative logistic probabilities
plogit = 1/(1+exp(-x))and describe how you would generate the plot given below
in R
Graphics Solutions
The first graph:
plogit <- function(x) 1/(1+exp(-x))
eta <- seq(-10, 10, len=100)
p1 <- plogit(eta-1)
p2 <- plogit(eta+1)
p3 <- plogit(eta+4.5)
plot(c(-10,10), range(p1,p2,p3), type="n", axes=FALSE, xlab="x", ylab="Pr(y
> j)")
axis(2)
box()
abline(h=c(0,1), col="gray")
lines(eta, p1, lwd=2)
lines(eta, p2, lwd=2)
lines(eta, p3, lwd=2)
coords <- locator(2)
arrows(coords$x[1], coords$y[1], coords$x[2], coords$y[2], code=1,
length=0.125)
text(coords$x[2], coords$y[2], pos=3, "Pr(y > 1)")
coords <- locator(2)
22
arrows(coords$x[1], coords$y[1], coords$x[2], coords$y[2], code=1,
length=0.125)
text(coords$x[2], coords$y[2], pos=3, "Pr(y > 2)")
coords <- locator(2)
arrows(coords$x[1], coords$y[1], coords$x[2], coords$y[2], code=1,
length=0.125)
text(coords$x[2], coords$y[2], pos=3, "Pr(y > 3)")
1.0
0.8
Pr(y > 1)
Pr(y > 3)
0.6
Pr(y > 2)
Pr(y >j)
0.4
0.2
0.0
Examining Distributions
A common requirement is for biostatisticians to fit a probability distribution to some biological data.
This section provides some simple examples of how we might achieve this in R.
First we need some data. For this example we shall create a random sample (n=100,000) of data from
a Gamma distribution.
n <- 100000
sample <- rgamma(n, rate=0.5, shape=5)
We will then examine this distribution via a histogram. We could use the standard function hist,
to create the graph, but the rule that this function uses to produce “nice” cut-points between the
bins often produces too few bins. Whilst this is adjustable via the breaks parameter, I find the
function truehist in the MASS package produces better default results. By default the y-axis
shows the density of observations.
library(MASS)
truehist(sample, main="Histogram of sample")
23
Histogram of sample
0.10
0.08
0.06
0.04
0.02
0.00
0 10 20 30 40
sample
Observing that the data has a heavy tail, we might consider a simple transformation to normalise the
data, such as taking logarithms. If the data comes from a lognormal population, we would expect a
nice bell curve.
24
Histogram of log(sample)
0.8
0.6
0.4
0.2
0.0
-1 0 1 2 3
log(sample)
The log transform doesn’t appear to work well here. An alternative is to consider a power
transformation of the form xn, known as a Box-Cox transformation. We can find a suitable power via
the command: boxcox(sample ~ 1).
Instead we shall try fitting some other heavy tailed distributions such as the Weibull, Gamma. We
could fit distributions manually via method of moments or use maximum likelihood estimation.
Oddly, R doesn’t have standard functions for fitting distributions. However methods are available in
various add on packages.
• mle – in the stats4 package. Requires the user to define the log-likelihood function manually.
• fitdistr – in the MASS package.
As the fitdistr function requires less information from the user, we shall use this one. For
example:
fitdistr(sample, "Weibull")
25
shape scale
2.360351150 11.277036784
( 0.005528084) ( 0.015984577)
The output is an object of the “fitdistr” class. This is essentially a list containing two vectors,
“estimate” “sd”. We can plot density curves for the fitted distribution via the following code:
# Fit distributions
fit.w <- fitdistr(sample, "Weibull")
fit.g <- fitdistr(sample, "gamma")
Histogram of sample
0.10
Gamma distn
Weibull distn
0.08
0.06
0.04
0.02
0.00
0 10 20 30 40
sample
26
curve(pweibull(x, shape=fit.w$estimate[1],
scale=fit.w$estimate[2]), add=T, col=2)
curve(pgamma(x, shape=fit.g$estimate[1],
rate=fit.g$estimate[2]), add=T, col=3)
legend(30, 0.6, lwd=1, col=1:3,
legend=c("Empirical CDF", "Weibull CDF", "Gamma CDF"))
It is quite clear that the Gamma distribution is most appropriate for this data. However an
alternative graphical measure is to consider a QQ-plot, which plots the empirical quantiles of the
data to the theoretical quantiles of the fitted distribution. If the distribution is appropriate, we
would expect the points to follow a straight line.
This is very simple to achieve manually, but the function [Link] included in the car library is
quite neat. For example:
library(car)
par(mfrow = c(1, 2))
[Link](sample, distribution="weibull",
shape=fit.w$estimate[1], scale=fit.w$estimate[2])
title(main="QQ plot of sample against \n Weibull quantiles",
[Link]=0.9)
[Link](sample, distribution="gamma",
shape=fit.g$estimate[1], rate=fit.g$estimate[2])
title(main="QQ plot of sample against \n Gamma quantiles",
[Link]=0.9)
27
The par command, sets the graphical parameter mfrow, so that we are able to draw two graphs
side by side. See the help file for further details. As discussed in section 2.5, the “\n” character
represents a new line.
In the resulting graphs, below, 95% confidence intervals are included around the expected straight
line. As the points in the first graph clearly do not follow this line, the Weibull distribution is not
appropriate. However, the second graph indicates that the Gamma distribution is indeed a good fit
of the data.
28