0% found this document useful (0 votes)
2 views28 pages

Statistical Computing in R

This document serves as an introduction to the R programming language, covering fundamental concepts such as syntax, data manipulation, and basic graphics. It includes instructions for setting up R, creating vectors and matrices, performing statistical analyses, and generating plots. The document also highlights key functions and commands for data examination and manipulation within R.

Uploaded by

alvesphelix
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views28 pages

Statistical Computing in R

This document serves as an introduction to the R programming language, covering fundamental concepts such as syntax, data manipulation, and basic graphics. It includes instructions for setting up R, creating vectors and matrices, performing statistical analyses, and generating plots. The document also highlights key functions and commands for data examination and manipulation within R.

Uploaded by

alvesphelix
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to R The R language

Objective • Learning a new language: grammar,


vocabulary
Become computationally proficient to do
statistical data analysis • Loading, examining, summarizing data

• Getting Started & Setup • Creating data

• R basics (refresher) Learning a language

• Syntax You must think in that language

• Examining Objects Examples

• Extracting Parts 1. x=2/3 x <- 2 / 3


2. x sqrt(x)
• Basic Graphics: 3. a = 2(x + 3)2
a <- 2 *(x + 3)^2
Scatterplots, Histograms, Boxplots 4. y = (1 2 3 5)T
y <- c(1, 2, 3, 5)
5. y1 y[1]
Install R
• On your own machine, go to [Link]
6. ∑ y sum(y)
[Link]/ 7. 2y 2*y
8. f(y, 2) = 2y
• From CRAN, pick download site (ISU might f <- function(x, y)
be good) return(x*y)
f(y, 2)
• Download from base:

• Run the installation script


Try this on your own
• Start R by double-clicking the icon
1. x = (4 1 3 9)T

2. y = (1 2 3 5)T
Setting up
3. d = ∑ ( x − y )2
• Open the R script in an editor, eg notepad,

wordpad, emacs, ... 4. 2(y1 + x1)

• Edit lines solution

• Cut and paste lines of code into the R > x<-c(4,1,3,9)

interpreter window > y<-c(1,2,3,5)

> m<-x-y

> n<-m^2

> o<-sum(n)
1
> d<-sqrt(o) • Mouse click to set cursor position
> 2*(y[1]+x[1])
• Delete to remove and re-type parts of
command

Navigating the R interpreter window

• Up/down arrow keys to retrieve previous Getting Out


•q()
lines

• Left/right arrow keys to move cursor along


line

Creating vectors
The simplest way to creating a vector in R is done by using the c() command (short for concatenate).
For example, to create the vector v = [2 4 6.8 3]T

That is, v<- c(2, 4, 6.8, 3)

> v

[1] 2 4 6.8 3

A vector in R is always displayed as a row vector. However, R treats it as a column vector as


well. The “[1]" indicates the element number of the element directly to the right of [1].
When vectors span multiple lines, then each subsequent line will have a similar indicator.

If the elements of the vector are consecutive in nature, then there are more concise ways to
create vectors. Using the : command, a vector can be created from a starting point to an
ending point, with a step size of one. For example, the command 1:10 creates a vector from 1
to 8, stepping one at a time.

> x=c(1:10)

> x

[1] 1 2 3 4 5 6 7 8 9 10

Logical variables can also be assigned to objects, for example:

> x < 4

[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Define the object y<- x < 4

R includes many functions, especially for statistical summaries and analyses.

Try these functions on x and other objects that you create.

2
> mean(x) # calculates the mean of the entries in x
[1] 5.5
> var(x) # calculates the variance of the entries in x
[1] 9.166667
> sum(x)
[1] 55
> cumsum(x) # check what this function is doing…
[1] 1 3 6 10 15 21 28 36 45 55
> median(x)
[1] 5.5
> length(x)
[1] 10
> sum(y) # logical values can be 'coerced' to numeric, where FALSE
= 0 and TRUE = 1
[1] 3
> # - here the sum is a count of 'TRUE's

The last example can be modified to calculate proportions or percentages. For example, if y is
a set of logical variables indicating whether plants are heavily infected,

> sum(y)/length(y)

[1] 0.3

gives the proportion of plants heavily infected. sum(y) gives the number of plants infected
and length(y) gives the total number of plants.

The help function gives information about how a function works. For example, help(mean).

ls()
# gives a list of all the objects in your working directory
rm()
# removes the objects indicated.
# For example the object x is removed by
rm(x)

You might want to clear out old objects that you don’t need any more to avoid confusion.

To create a matrix in R, you use the matrix() command. For instance, the matrix
1 4 7 
A =  2 5 8  is created using either of the following commands:
 3 6 9 
> A<-matrix(c(1,2,3,4,5,6,7,8,9),3,3)
> A
[,1] [,2] [,3]
3
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Or
> A<-matrix(1:9,3,3)

Extracting Elements from a Vector or Matrix


Extracting elements is an important operation. There are several commands to do so. To
extract the eighth element of a vector, one would type

> v<-c(1,2,4,2,3,4,4,32,33,2,4)
> v[5]
[1] 3
> v[5:9]
[1] 3 4 4 32 33

Or a complement of a sequence of elements:


> v[-c(8,9,10)]
[1] 1 2 4 2 3 4 4 4
You can even select elements based on a comparison
> v[v>12]
[1] 32 33

Extracting elements from a matrix is similar, but you specify two values instead of one. For
example, to extract the first column, the second row, or the element in the second row and the
third column of the matrix A
> A[,1]
[1] 1 2 3
> A[2,]
[1] 2 5 8
> A[2,3]
[1] 8
A[1:2,1]# check this?

Operations on Matrices
Common matrix operations are included in R, such as matrix addition, subtraction, or
multiplication, as well as other types of operations. R performs the common addition and
subtraction of matrices with the + and the - commands. However, matrix multiplication is not
performed by the * command, but rather the %*% command. The * command will perform
the element-wise multiplication.

> B<-matrix(2:5,2,2)
> C<-matrix(5:2,2,2)
> B
[,1] [,2]
[1,] 2 4
[2,] 3 5
> C
[,1] [,2]

4
[1,] 5 3
[2,] 4 2
> B-C #element-wise subtraction
[,1] [,2]
[1,] -3 1
[2,] -1 3
> B*C #element-wise multiplication
[,1] [,2]
[1,] 10 12
[2,] 12 10
> B+C #element-wise addition
[,1] [,2]
[1,] 7 7
[2,] 7 7
> B%*%C #matrix multiplication
[,1] [,2]
[1,] 26 14
[2,] 35 19

Other basic operations needed, such as the determinant, the transpose, the inverse, the
dimension of a matrix, augmenting matrices, or creating an identity matrix can be performed
in R. Examples of these commands are:
> det(B) #the determinant
[1] -2
> t(B) #the transpose
[,1] [,2]
[1,] 2 3
[2,] 4 5
> solve(B) #matrix inverse of B
[,1] [,2]
[1,] -2.5 2
[2,] 1.5 -1
> sum(diag(B)) #the trace
[1] 7
> dim(B) #matrix size
[1] 2 2
> cbind(B,C) #augment column-wise
[,1] [,2] [,3] [,4]
[1,] 2 4 5 3
[2,] 3 5 4 2
> rbind(B,C) #augment row-wise
[,1] [,2]
[1,] 2 4
[2,] 3 5
[3,] 5 3
[4,] 4 2
> diag(3) #a 3x3 identity
[,1] [,2] [,3]
[1,] 1 0 0
[2,] 0 1 0
5
[3,] 0 0 1

Matrix Factorizations
Several matrix factorizations are possible using the current core version of R, with others
possible by installing new packages. The Eigen-decomposition is one of the most common
decompositions in a linear algebra course. The eigen() command extracts both the
eigenvectors and the eigenvalues. Storing the result of the eigen() command allows access
to both the eigenvectors and the eigenvalues without recomputing them:
> A<-matrix(1:9,3,3)
> temp<-eigen(A) #store both the eigenvalues and vectors in
temp
> temp #display everything in temp
$values
[1] 1.611684e+01 -1.116844e+00 -4.054215e-16

$vectors
[,1] [,2] [,3]
[1,] -0.4645473 -0.8829060 0.4082483
[2,] -0.5707955 -0.2395204 -0.8164966
[3,] -0.6770438 0.4038651 0.4082483
> temp$values #displays only the $values of the object temp
[1] 1.611684e+01 -1.116844e+00 -4.054215e-16
> temp$vectors #displays only the $vectors of the object temp
[,1] [,2] [,3]
[1,] -0.4645473 -0.8829060 0.4082483
[2,] -0.5707955 -0.2395204 -0.8164966
[3,] -0.6770438 0.4038651 0.4082483

Graphics
Simple Plotting

Producing a simple x-y plot, or scatter plot, is easy in R. Use help(plot) to get a description
of options available.

DisSev <- c(0,12,67,34,23,28)


RelYield <- c(100,103,79,89,85,97)
plot(x=DisSev, y=RelYield, xlab='Disease Severity (%)',ylab='Relative Yield
(%)')

To draw a line through the points, from the smallest value of DisSev to the largest value of
DisSev, the necessary ordering of the values can be determined using the order command
and then DisSev and RelYield can be indexed by these ordered values. The first command
lines provide an introduction to order and the related command rank.
6
# Set up a simple example vector

ex8 <- c(103,101,102)

# The order command give the order in which entries


# must be drawn to put them in order from smallest to largest

order(ex8)

# First the 2nd entry must be drawn, then the 3rd, then the 1st.
# In contrast, the rank command gives the rank of the entries
# from smallest to largest.

rank(ex8)

# The first entry comes 3rd, the next entry comes first, and
# the next entry comes second.

# Now use the order command and plot the DisSev data.

DisSevOrd <- order(DisSev)


plot(DisSev[DisSevOrd],RelYield[DisSevOrd], type='l')

Advanced plotting

Using color in plots

Suppose you would like to use colour in your plots to make them easier to read or more
attractive. Colours are easily added to plots using the col='color name' statement in the
plot() command.

plot(DisSev, RelYield, xlab='Disease Severity', ylab='Relative Yield


(%)', col='orange', pch=15, )

To find a list of all colors available for use in R graphics type colors() Likewise, calling
different numbers for pch in the plot command produces different shapes for the plotting
points.

Using subscripts and superscripts in graph labels

To add subscripts to your x or y labels enclose the text in [] brackets, for example H[2]O. To
superscript text use a tilde, ~, before the text and use the ^ before the part of the text to be
raised, for example ~ cm^3.

plot(x=DisSev, y=RelYield, xlab = quote(Size [code]), ylab = quote(Volume


~ cm^3), col='purple', pch=17)

# Including the main title


7
> plot(x=DisSev, y=RelYield, xlab = quote(Size [code]), ylab =
quote(Volume ~ cm^3), main="Disease Severity", col='purple', pch=17)

Interactive plotting features

Sometimes it's useful to specify the range for values represented in a graph rather than just
using the default ranges produced by R. You can control the range of values on the x
(horizontal) and y (vertical) axes using xlim and ylim.

plot(DisSev, RelYield, xlab='Disease Severity (%)',


ylab='Relative Yield (%)', xlim=c(0,100), ylim=c(0,100))

Variations on the plot() function are called when it is applied to different types of objects.
For example, when plot() is applied to categorical data, a boxplot is produced. When
applied to the output from the linear regression function lm, plot() produces a set of four
regression diagnostic plots.

R has interactive features, too. To see the ID for three points, try clicking on three points on
the graph after entering the following:

identify(DisSev,RelYield, n=3)

To get the (x,y) coordinates for any four locations on the graph, try clicking any four places
on the graph after entering locator(4). The coordinates will appear after all the locations
have been clicked on.

You can also plot a particular function instead of just a set of variables.

curve(x^2,from=0,to=10,xlab='Time',ylab='Population size')

or

curve(exp(x),from=0,to=10,xlab='Time',ylab='Population size')

Loops
A loop can be used to perform the same function repeatedly. For example:

for(i in 1:10){x[i] <- x[i] + 1}


> x
[1] 2 3 4 5 6 7 8 9 10 11

1. for(i in 1:10){} sets up the loop to run over ten values of i, 1, 2, 3, ...,and 10 for
whatever commands are entered in the curly brackets {}.

8
2. x[i] indicates the ith entry in x. When i = 1, x[i] is the same as x[1]. When i = 2,
x[i] is the same as x[2].
3. x[i] <- x[i] + 1

Here the value in x[i] is replaced by the value in x[i] plus 1.

In this case the same result is obtained more simply by:

x <- x + 1
since R will perform the function on each entry of x.

A generalization if the function is to be applied to each entry of x would be the following.

for(i in 1:length(x)){x[i] <- x[i] + 1}

Here i runs from one to the length of the x vector.

The use of a temporary variable named i is arbitrary, another name could be used.

for(Joe in 1:length(x)){x[Joe] <- x[Joe] + 1}


x

You might choose to apply a function only to entries for which a certain criterion is true.

for(i in 1:length(x)){if(x[i] < 6){x[i] <- x[i] + 1}}


x

Here the if() statement controls which entries in the vector x are modified.

In this case, only those entries that are less than 6 have 1 added.

Another way to accomplish the same thing would be as follows.

index <- x < 6


x[index] <- x[index] + 1

Working With Data Sets


Suppose you would like to combine two sets of values into a single object.

dai <- c(0,1,2,5,10,20) # dai might indicate days after inoculation


pinf <- c(0,0,0,5,25,80) # pinf might indicate percent infection

check#

rbind(dai,pinf)
cbind(dai,pinf)

9
Compare to

rbind(c(dai,pinf))

You can create arrays using the following commands

z5 <- array(0,c(5,3))
z6 <- array(-1,c(5,3,2))

Data Structures

Venables et al. (2007) recommend keeping variables in data frames for easier organization.
See what df5 looks like

df5 <- [Link](dai,pinf)

Data frames need to be composed of objects with the same length. If objects don't have the
same length, you might get an error or you may get surprising results. Check the following
example.

var1 <- 1:10

The above command assigns the sequence of numbers between 1 and 10 to the object named
var1.

var2 <- 1:2


df6 <- [Link](var1,var2)
> df6
var1 var2
1 1 1
2 2 2
3 3 1
4 4 2
5 5 1
6 6 2
7 7 1
8 8 2
9 9 1
10 10 2

Did df6 look the way you expected it to? This shows how different variable length can
produce confusing outcomes when the [Link] command is used.

You can indicate dai within df5 by df5$dai, df5[1], or df5[[1]]. The name dai is only
maintained with the call df5[1]

Lists can also be used to gather objects together and the objects don't need to be of the same
length.

list6 <- list(var1,var2)


10
> list6
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10

[[2]]
[1] 1 2

Consider list6[[1]] and list6[[2]]

The interactive R environment in Windows has a data editing window available through the
Menu under Edit.

Reading Data into R

R users can construct data sets in programs such as Microsoft Excel or Notepad and then
bring those files into R for further analyses. Data input in R is handled through a series of
options, including:

• [Link]
• [Link]
• read.csv2
• [Link]
• read.delim2

For further information on the differences amongst these methods, use the help() or ?
options in R, for example: ?[Link].

Duncan <- [Link]("C:/Users/Joseph/Documents/statistical


programming/DAta/[Link]", header=TRUE, sep="\t",
[Link]="NA", dec=".", [Link]=TRUE)

Linear Least Squares Regression

Load the following data

> year <- c(2000 , 2001 , 2002 , 2003 , 2004)

> rate <- c(9.34 , 8.50 , 7.62 , 6.93 , 6.60)

11
The next thing we do is take a look at the data. We first plot the data using a scatter plot and notice
that it looks linear. To confirm our suspicions we then find the correlation between the year and the
mean interest rates:

plot(year,rate,main="Commercial Banks Interest Rate for 4 Year Car Loan")


cor(year,rate) calculates the correlation between year and rate.

Using the ggplot2 library

Load the library by

library(ggplot2)

scatterplots

> qplot(year, rate, geom="point")

>qplot(year, rate, geom="point") + geom_smooth(method="lm")

There is a linear relationship between rate and year

Histogram

> qplot(rate, geom="histogram")

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Eg

> qplot(rate, geom="histogram", binwidth=1)

At this point we should be excited because associations that strong never happen in the real
world unless you cook the books or work with averaged data. The next question is what
straight line comes "closest" to the data? In this case we will use least squares regression as
one way to determine the line.

Before we can find the least square regression line we have to make some decisions. First we
have to decide which is the explanatory and which is the response variable. Here, we
arbitrarily pick the explanatory variable to be the year, and the response variable is the
interest rate. This was chosen because it seems like the interest rate might change in time
rather than time changing as the interest rate changes.

The command to perform the least square regression is the lm command.

Since we specified that the interest rate is the response variable and the year is the
explanatory variable this means that the regression line can be written in slope-intercept
form:

12
rate = (intercept) + (slope) year

The way that this relationship is defined in the lm command is that you write the vector
containing the response variable, a tilde ("~"), and a vector containing the explanatory
variable:

> fit <- lm(rate ~ year)


> fit
Call:
lm(formula = rate ~ year)

Coefficients:
(Intercept) year
1419.208 -0.705

When you make the call to lm it returns a variable with a lot of information in it. If you are
just learning about least squares regression you are probably only interested in two things at
this point, the slope and the y-intercept. If you just type the name of the variable returned by
lm it will print out this minimal information to the screen. (See above.)

If you would like to know what else is stored in the variable you can use the attributes
command:

> attributes(fit)
$names
[1] "coefficients" "residuals" "effects" "rank"
[5] "[Link]" "assign" "qr" "[Link]"
[9] "xlevels" "call" "terms" "model"

$class
[1] "lm"

One of the things you should notice is the coefficients variable within fit. You can print out
the y-intercept and slope by accessing this part of the variable:

> fit$coefficients[1]
(Intercept)
1419.208
> fit$coefficients[[1]]
[1] 1419.208
> fit$coefficients[2]
year
-0.705
> fit$coefficients[[2]]
[1] -0.705

Note that if you just want to get the number you should use two square braces. So if you want
to get an estimate of the interest rate in the year 2015 you can use the formula for a line:

> fit$coefficients[[2]]*2015+fit$coefficients[[1]]
[1] -1.367

13
So if you just wait long enough, the banks will pay you to take a car!

A better use for this formula would be to calculate the residuals and plot them:

> res <- rate - (fit$coefficients[[2]]*year+fit$coefficients[[1]])


> res
[1] 0.132 -0.003 -0.178 -0.163 0.212
> plot(year,res)

That is a bit messy, but fortunately there is an easier way to get the residuals:

> residuals(fit)
1 2 3 4 5
0.132 -0.003 -0.178 -0.163 0.212

If you want to plot the regression line on the same plot as your scatter plot you can use the
abline function along with your variable fit:

> plot(year,rate, main="Commercial Banks Interest Rate for 4 Year Car


Loan")
> abline(fit)

Finally, as a teaser for the kinds of analyses you might see later, you can get the results of an
F-test by asking R for a summary of the fit variable:

> summary(fit)

Call:
lm(formula = rate ~ year)

Residuals:
1 2 3 4 5
0.132 -0.003 -0.178 -0.163 0.212

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1419.20800 126.94957 11.18 0.00153 **
year -0.70500 0.06341 -11.12 0.00156 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2005 on 3 degrees of freedom


Multiple R-Squared: 0.9763, Adjusted R-squared: 0.9684
F-statistic: 123.6 on 1 and 3 DF, p-value: 0.001559

Exercise

Load the data that you have (Oxygen and Hydrocarbon levels)

Perform the above tasks

14
Reading Data into R

R users can construct data sets in programs such as Microsoft Excel or Notepad and then
bring those files into R for further analyses. Data input in R is handled through a series of
options, including:

• [Link]
• [Link]
• read.csv2
• [Link]
• read.delim2

For further information on the differences amongst these methods, use the help() or ?
options in R, for example: ?[Link].

Duncan <- [Link]("C:/Users/Joseph/Documents/statistical


programming/DAta/[Link]", header=TRUE, sep="\t",
[Link]="NA", dec=".", [Link]=TRUE)

or

> oxygen <- [Link]("C:/Users/Joseph/Documents/statistical


programming/DAta/[Link]", header=TRUE, sep="\t", [Link]="NA",
dec=".", [Link]=TRUE)

> oxygen

You can first explore the data, by using various plots

> library(MASS)

> truehist(oxygen$[Link], col="darkgray")

> truehist(oxygen$Purity, col="darkgray")

The truehist() produces a smooth histrogram especially when you have large data sets other wise just use
hist() command

> hist(oxygen$Purity, col="darkgray")

> hist(oxygen$[Link], col="darkgray")

You can also have a box plot

> boxplot(oxygen$Purity, ylab="Purity") # identify the outlier by the


mouse

> identify(rep(1, length(oxygen$Purity)), oxygen$Purity, rownames(oxygen))

Scatter plot

> plot(Purity~[Link], data=oxygen)

Is there a linear relationship between the two variables?

15
Then fit the linear model

> RegModel.1 <- lm(Purity~[Link], data=oxygen)

> summary(RegModel.1)

> RegModel.1$coefficients

Note that if you just want to get the number you should use two square braces.

> RegModel.1$coefficients[[2]]*1.25+RegModel.1$coefficients[[1]] #the


fitted value at 1.25

A better use for this formula would be to calculate the residuals and plot them:

> res <- oxygen$Purity -


(RegModel.1$coefficients[[2]]+RegModel.1$coefficients[[1]])

> res

That is a bit messy, but fortunately there is an easier way to get the residuals:

> residuals(RegModel.1)

If you want to plot the regression line on the same plot as your scatter plot you can use the abline
function along with your variable fit:

> plot(oxygen$[Link],oxygen$Purity)

> abline(RegModel.1)

Testing for the adequacy of the model

> anova(RegModel.1)

Confidence intervals

> confint(RegModel.1, level=.95)

Graphical diagnostics

> par(mfrow=c(2,2))

> plot(RegModel.1)

Exercise: Exclude the 9th point then fit again and examine the residuals

Multiple linear regression

> [Link] <- [Link]("C:/Users/Joseph/Documents/statistical


programming/DAta/[Link]", header=TRUE, sep="\t",
[Link]="NA", dec=".", [Link]=TRUE)
16
Exploration

> boxplot([Link]$x1, ylab="x1")

> boxplot([Link]$x2, ylab="x2")

> library(scatterplot3d)

> scatterplot3d([Link]$x1, [Link]$y, [Link]$x2,


bg="white", grid=TRUE, xlab="x1", ylab="y", zlab="x2")

> RegModel.2 <- lm(y~x1+x2, data=[Link])

> summary(RegModel.2)

> RegModel.3 <- lm(y~x1, data=[Link])

> summary(RegModel.3)

> RegModel.4 <- lm(y~x2, data=[Link])

> summary(RegModel.4)

A model with interaction effects

> RegModel.5 <- lm(y~x1+x2+x1*x2, data=[Link])

> summary(RegModel.5)

> RegModel.6 <- lm(y~x1+x1*x2, data=[Link])

> summary(RegModel.6) #same as RegModel.5

Confidence intervals

> confint(RegModel.2)

> confint(RegModel.3)

> confint(RegModel.4)

> confint(RegModel.5)

> confint(RegModel.6)

Testing for the adequacy of the regressions

> anova(RegModel.2)

> anova(RegModel.3)

> anova(RegModel.4)

> anova(RegModel.5)

17
X1 captures more variations in the response than x2. The model with an interaction effect between x1
and x2 is even better.

Diagonstics

> par(mfrow=c(2,2)) #multiple rows of graphics

> plot(RegModel.2)

> par(mfrow=c(1,1)) #close the multiple rows

> par(mfrow=c(2,2))

> plot(RegModel.3)

There is some weak pattern in the residuals that looks quadratic

> plot(RegModel.4)

> plot(RegModel.5)

Model with an interaction effect performs better on the residuals.

> par(mfrow=c(1,1))

Matrix method

In linear regression, we must evaluate the equation b = (X'X)-1X'Y to obtain the estimated regression
parameters. Here Y is the vector of the values of the response variable, and X is the design matrix
consisting of a column of ones and a column for the values of each predictor variable. If X and Y
have already been created in R, then we can evaluate the equation in R like so:

> Y=([Link]$y)

> C0<-matrix(1,25,1)

> C1=([Link]$x1)

> C2=([Link]$x2)

> X<-cbind(C0,C1,C2)

> b <- solve( t(X) %*% X ) %*% t(X) %*% Y

> b

[,1]

2.26379143

C1 2.74426964

C2 0.01252781
18
Then the vector b will contain the estimated regression coefficients.

Dummy variables

Load the data tool

> tool <- [Link]("C:/Users/Joseph/Documents/statistical


programming/DAta/[Link]", header=TRUE, sep="\t", [Link]="NA",
dec=".", [Link]=TRUE)

Plot the box plot

> boxplot(tool$RPM, ylab="RPM")

> tool$Tool <- factor(tool$Tool, labels=c('302','416')) #convert


the variable tool into a factor.

> boxplot(RPM~Tool, ylab="RPM", xlab="Tool", data=tool) #plot the


boxplots for the two factors

> Y<-tool$yi

> C0<-matrix(1,20,1)

> C1.1<-matrix(0,10,1)

> C1.2<-matrix(1,10,1)

> C1<-tool$RPM

> C2<-rbind(C1.1,C1.2)

> C2

> X<-cbind(C0,C1,C2)

> b <- solve( t(X) %*% X ) %*% t(X) %*% Y

> b

19
Simulation

In R, write function lcg(n, seed, m, a, c) that produces a sequence of n pseudo


random numbers from a linear congruential generator with parameters (m, a, c) starting
with the specified seed.
lcg <- function(n, seed, m, a, c) {
x <- vector(length=n+1)
x[1] <- seed
for (i in 1:n) {
x[i+1] <- (a * x[i] + c) %% m
}
u <- x/m

return (u[-1])
}

For m = 2^31, a = 65539, c = 0 and seed =171717 produce 5000 random


numbers
randu <- lcg(5000, 171717, 2^31, 65539, 0)
> head(randu)

# Generating random numbers -------------------------


rnorm(10)
rnorm(100)
rnorm(100, mean=10, sd=5)
rpois(10, 10)

# * normal = rnorm
# * poisson = rpois
# * binomial = rbinom
# * gamma = rgamma
# * look at [Link](keyword="distribution") for more
#
# Always check with the documentation that the distribution
# is parameterised the way you expect. Eg. for normal, standard
# deviation, not variance.

# Repeating a task -------------------------


#
# replicate(n, task) repeats task n times and joins the
# result into a single vector.

replicate(100, mean(rnorm(100)))
replicate(100, mean(rnorm(10)))

# Display an empirical distribution ----------------


#
# An empirical distribution is one derived from data,
# as opposed to a theoretical distribution.

20
# function to get a permutation of the indices 1 ... n

perm <- function(n) {


return(sample(n,n, replace=F))
}

# number of indices left in place:


H <- function(p) {
return(sum(p==1:length(p)))
}

N <- 1000
n <- 10

table(replicate(N, H(perm(n))))/N

Plotting a Standard normal density function

Write a code in R to show the areas above 1.96 the standard normal density function. The
diagram should:
Leave room on the left margin

oldpar <- par(mar = c(5, 6, 4, 2) + 0.1) # leave room on the left


oldpar

1 − 2z2
Display the area between -4 and +4 on the x-axis and label it Z, label the y axis as φ ( z ) = e and

the Main title: The Standard Normal Density Function φ ( z ) .

z <- seq(-4, 4, length=1000)


p <- dnorm(z)
plot(z, p, type="l", lwd=2, main=expression("The Standard Normal Density
Function" ~~ phi(z)), ylab=expression(phi(z) == frac(1, sqrt(2*pi)) * ~~
e^- ~~ frac(z^2, 2)))

To the plot obtained in above, insert the grid line at x = 0 and y = 0 in gray colour, define the
area equal to or greater than 1.96 as well as shading it in gray.

abline(h=0, col="gray")
abline(v=0, col="gray")
z0 <- z[z >= 1.96] # define region to fill
z0 <- c(z0[1], z0)
p0 <- p[z >= 1.96]
p0 <- c(0, p0)
polygon(z0, p0, col="gray")

21

Point and label the area shaded above as ∫1.96φ ( z ) dz = 0.025 with the text being above the arrow.

coords <- locator(2) # locate head and tail of arrow


arrows(coords$x[1], coords$y[1], coords$x[2], coords$y[2], code=1,
length=0.125)
text(coords$x[2], coords$y[2], pos=3, # text above tail of arrow
expression(integral(phi(z)*dz, 1.96, infinity) == .025))
Note: use the mouse to locate both the head and tail of the arrows

The Standard Normal Density Function φ (z)


0.4
0.3
2

2
z

e

0.2
1
φ (z) =


⌠ φ (z)dz = 0.025
0.1

⌡1.96
0.0

-4 -2 0 2 4

Using the R language, write a function which computes cumulative logistic probabilities
plogit = 1/(1+exp(-x))and describe how you would generate the plot given below
in R

Graphics Solutions
The first graph:
plogit <- function(x) 1/(1+exp(-x))
eta <- seq(-10, 10, len=100)
p1 <- plogit(eta-1)
p2 <- plogit(eta+1)
p3 <- plogit(eta+4.5)
plot(c(-10,10), range(p1,p2,p3), type="n", axes=FALSE, xlab="x", ylab="Pr(y
> j)")
axis(2)
box()
abline(h=c(0,1), col="gray")
lines(eta, p1, lwd=2)
lines(eta, p2, lwd=2)
lines(eta, p3, lwd=2)
coords <- locator(2)
arrows(coords$x[1], coords$y[1], coords$x[2], coords$y[2], code=1,
length=0.125)
text(coords$x[2], coords$y[2], pos=3, "Pr(y > 1)")
coords <- locator(2)

22
arrows(coords$x[1], coords$y[1], coords$x[2], coords$y[2], code=1,
length=0.125)
text(coords$x[2], coords$y[2], pos=3, "Pr(y > 2)")
coords <- locator(2)
arrows(coords$x[1], coords$y[1], coords$x[2], coords$y[2], code=1,
length=0.125)
text(coords$x[2], coords$y[2], pos=3, "Pr(y > 3)")
1.0
0.8

Pr(y > 1)
Pr(y > 3)
0.6

Pr(y > 2)
Pr(y >j)

0.4
0.2
0.0

Examining Distributions

A common requirement is for biostatisticians to fit a probability distribution to some biological data.
This section provides some simple examples of how we might achieve this in R.

First we need some data. For this example we shall create a random sample (n=100,000) of data from
a Gamma distribution.

n <- 100000
sample <- rgamma(n, rate=0.5, shape=5)

We will then examine this distribution via a histogram. We could use the standard function hist,
to create the graph, but the rule that this function uses to produce “nice” cut-points between the
bins often produces too few bins. Whilst this is adjustable via the breaks parameter, I find the
function truehist in the MASS package produces better default results. By default the y-axis
shows the density of observations.

library(MASS)
truehist(sample, main="Histogram of sample")

23
Histogram of sample

0.10
0.08
0.06
0.04
0.02
0.00

0 10 20 30 40

sample

Observing that the data has a heavy tail, we might consider a simple transformation to normalise the
data, such as taking logarithms. If the data comes from a lognormal population, we would expect a
nice bell curve.

truehist(log(sample), main="Histogram of log(sample)")

24
Histogram of log(sample)

0.8
0.6
0.4
0.2
0.0

-1 0 1 2 3

log(sample)

The log transform doesn’t appear to work well here. An alternative is to consider a power
transformation of the form xn, known as a Box-Cox transformation. We can find a suitable power via
the command: boxcox(sample ~ 1).

Instead we shall try fitting some other heavy tailed distributions such as the Weibull, Gamma. We
could fit distributions manually via method of moments or use maximum likelihood estimation.

Oddly, R doesn’t have standard functions for fitting distributions. However methods are available in
various add on packages.

• mle – in the stats4 package. Requires the user to define the log-likelihood function manually.
• fitdistr – in the MASS package.

As the fitdistr function requires less information from the user, we shall use this one. For
example:

fitdistr(sample, "Weibull")
25
shape scale
2.360351150 11.277036784
( 0.005528084) ( 0.015984577)

The output is an object of the “fitdistr” class. This is essentially a list containing two vectors,
“estimate” “sd”. We can plot density curves for the fitted distribution via the following code:

# Fit distributions
fit.w <- fitdistr(sample, "Weibull")
fit.g <- fitdistr(sample, "gamma")

# Caclulate points for density curves


x <- seq(from=0, to=40, by=1)
y.w <- dweibull(x, shape=fit.w$estimate[1], scale=fit.w$estimate[2])
y.g <- dgamma(x, shape=fit.g$estimate[1], rate=fit.g$estimate[2])

# Draw histogram and density curves


truehist(sample, main="Histogram of sample")
lines(x, y.w)
lines(x, y.g, col="red")
legend(30, 0.1, legend=c("Gamma distn", "Weibull distn"),
lwd=1, col=c("red", "black"))

Histogram of sample
0.10

Gamma distn
Weibull distn
0.08
0.06
0.04
0.02
0.00

0 10 20 30 40

sample

Some people prefer to look at cumulative density functions. For example:

# Draw cumulative density functions


plot(ecdf(sample), cex=0)

26
curve(pweibull(x, shape=fit.w$estimate[1],
scale=fit.w$estimate[2]), add=T, col=2)
curve(pgamma(x, shape=fit.g$estimate[1],
rate=fit.g$estimate[2]), add=T, col=3)
legend(30, 0.6, lwd=1, col=1:3,
legend=c("Empirical CDF", "Weibull CDF", "Gamma CDF"))

It is quite clear that the Gamma distribution is most appropriate for this data. However an
alternative graphical measure is to consider a QQ-plot, which plots the empirical quantiles of the
data to the theoretical quantiles of the fitted distribution. If the distribution is appropriate, we
would expect the points to follow a straight line.

This is very simple to achieve manually, but the function [Link] included in the car library is
quite neat. For example:

library(car)
par(mfrow = c(1, 2))
[Link](sample, distribution="weibull",
shape=fit.w$estimate[1], scale=fit.w$estimate[2])
title(main="QQ plot of sample against \n Weibull quantiles",
[Link]=0.9)
[Link](sample, distribution="gamma",
shape=fit.g$estimate[1], rate=fit.g$estimate[2])
title(main="QQ plot of sample against \n Gamma quantiles",
[Link]=0.9)

27
The par command, sets the graphical parameter mfrow, so that we are able to draw two graphs
side by side. See the help file for further details. As discussed in section 2.5, the “\n” character
represents a new line.

In the resulting graphs, below, 95% confidence intervals are included around the expected straight
line. As the points in the first graph clearly do not follow this line, the Weibull distribution is not
appropriate. However, the second graph indicates that the Gamma distribution is indeed a good fit
of the data.

28

You might also like