Chapter 2: Introduction to R
Programming
Edwin Fong
Department of Statistics and Actuarial Science
The University of Hong Kong
Email: chefong@[Link]
STAT3621
Types of Statistical Software
• Command-line software
• requires knowledge of syntax of commands
• reproducible results through scripts
• detailed analyses possible
• R, Python, SAS, SPSS, …
• GUI-based software
• does not require knowledge of commands
• not reproducible actions
• JMP, Excel, …
• Hybrid types (both command-line and GUI)
Why R?
• It is free!
• It runs on a variety of platforms including Windows, Unix and MacOS
• maintained by top quality experts
• continuous improvement
• Many advanced statistical analysis packages
• Available through [Link]
• IDE: R Studio: [Link]
R Introduction
First Program
Your First Program How to Get Help in R
Installing R Package
• [Link]("car") #download and install the package named
”car"
• library(car) #load the ”car" package
R: A Scientific Calculator
Mathematical Operations
• exp(pi*3)+1
• factorial(10)
• log(30)
• sqrt(2)^2 == 2
• [Link](sqrt(2)^2, 2)
• isTRUE(1:5)^2>=16
• choose(5, 0:5)
Arithmetic Operators
Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
x %% y modulus (x mod y) 5%%2 is 1
x %/% y integer division 5%/%2 is 2
9
Logical Operators
Operator Description
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x|y x OR y
x&y x AND y
isTRUE(x) test if x is TRUE
10
R Objects
• Results of calculations can be stored in objects using
• An arrow (<-)
• The equal character (=)
• e.g., x <- c(1:5); y= c(2:10)
• These objects can then be used in other calculations.
• To print the object just enter the name of the object.
• To list the objects that you have in your current R session
> ls()
[1] "x" "y”
11
Restrictions for Object Name
• Object names cannot contain `strange' symbols like !,
+, -, #
• A dot (.) and an underscore (_) are allowed, also a
name starting with a dot
• Object names can contain a number but cannot start
with a number
• R is case sensitive
• X and x are two different objects, as well as temp
and temP
• FOO, Foo, and foo are three different objects
12
R Workspace
• Objects that you create during an R session are hold in
memory, the collection of objects that you currently have is
called the workspace.
• This workspace is not saved on disk unless you tell R to do
so.
• This means that your objects are lost when you close R and
not save the objects, or worse when R or your system
crashes on you during a session.
13
R Workspace
• When you close the RGui or the R console window, the
system will ask if you want to save the workspace image.
• If you select to save the workspace image then all the
objects in your current R session are saved in a file .RData.
• This is a binary file located in the working directory of R,
which is by default the installation directory of R.
14
R Workspace
• During your R session you can also explicitly save the
workspace image. Go to the `File‘ menu and then select `Save
Workspace...', or use the [Link]() function.
## save to the current working directory
[Link]()
## just checking what the current working directory is
getwd()
## save to a specific file and location
[Link]("C:\\Program Files\\[Link]")
15
R Datasets
• R comes with a number of sample datasets that you can
experiment with.
• Type
> data()
to see the available datasets.
The results will depend on which packages you have loaded.
• Type
> help(datasetname)
for details on a sample dataset.
16
Data Types
Data Types
R has a wide variety of data types including
• Scalars
• Vectors (numerical, character, logical)
• Matrices
• Data frames
• Lists
Vector
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
Refer to elements of a vector using subscripts.
a[c(2,4)] # 2nd and 4th elements of vector
vector(“numeric”, 5) #create a numeric vector
vector(“list”, 5)
vector(“character”, 5)
length(a) #the length/dimension of the vector
Matrix
# generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rnames, cnames))
#Identify rows, columns or elements using subscripts.
x[,4] # 4th column of matrix
x[3,] # 3rd row of matrix
x[2:4,1:3] # rows 2,3,4 of columns 1,2,3
Array
• Arrays are similar to matrices but can have more than two
dimensions. See help(array) for details.
Data frame
• A data frame is more general than a matrix, in that different columns
can have different modes (numeric, character, factor, etc.).
• d <- c(1,2,3,4)
• e <- c("red", "white", "red", NA)
• f <- c(TRUE,TRUE,TRUE,FALSE)
• mydata <- [Link](d,e,f)
• names(mydata) <- c("ID","Color","Passed") #variable names
• There are a variety of ways to identify the elements of a dataframe .
myframe[3:5] # columns 3,4,5 of dataframe
myframe[c("ID","Age")] # columns ID and Age from dataframe
myframe$X1 # variable x1 in the dataframe
List
• An ordered collection of objects (components).
• A list allows you to gather a variety of (possibly
unrelated) objects under one name.
• # example of a list with 4 components -
# a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
• # example of a list containing two lists
v <- c(list1,list2)
• Identify elements of a list using the [[]] convention.
• mylist[[2]] # 2nd component of the list
Factor
• Tell R that a variable is nominal by making it a factor.
• The factor stores the nominal values as a vector of integers in the range [1...k],
and an internal vector of character strings (the original values) mapped to
these integers.
# variable gender with 20 "male" entries and
# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
# 1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)
Environments and Functions
Useful Functions
length(object) # number of elements or components
str(object) # structure of an object
class(object) # class or type of an object
names(object) # names
c(object,object,...) # combine objects into a vector
cbind(object, object, ...) # combine objects as columns
rbind(object, object, ...) # combine objects as rows
ls() # list current objects
rm(object) # delete an object
rm(list = ls()) # empties the whole environment
newobject <- edit(object) # edit copy and save a
newobject
fix(object) # edit in place
Importing Data
• # first row contains variable names, comma is separator
• # assign the variable id to row names
• mydata <- [Link]("c:/[Link]", header=TRUE, sep=",",
[Link]="id")
• Useful package: library([Link]); fread()
Exporting Data
• To A Tab Delimited Text File
[Link](mydata, "c:/[Link]", sep="\t")
• To an Excel Spreadsheet
library(xlsReadWrite)
[Link](mydata, "c:/[Link]")
Viewing Data
There are a number of functions for listing the contents of an object or
dataset.
# list objects in the working environment
ls()
# list the variables in mydata
names(mydata)
# list the structure of mydata
str(mydata)
# list levels of factor v1 in mydata
levels(mydata$v1)
# dimensions of an object
dim(object)
Viewing Data
There are a number of functions for listing the contents of an object or
dataset.
# class of an object (numeric, matrix, dataframe, etc)
class(object)
# print mydata
mydata
# print first 10 rows of mydata
head(mydata, n=10)
# print last 5 rows of mydata
tail(mydata, n=5)
# view mydata in new tab
View(mydata)
Value Labels
• To understand value labels in R, you need to understand the data
structure factor.
• You can use the factor function to create your own value labels.
# variable v1 is coded 1, 2 or 3
# we want to attach value labels 1=red, 2=blue,3=green
mydata$v1 <- factor(mydata$v1,levels = c(1,2,3),
labels = c("red", "blue", "green"))
Value Labels
# variable y is coded 1, 3 or 5
# we want to attach value labels 1=Low, 3=Medium, 5=High
mydata$v1 <- ordered(mydata$y,
levels = c(1,3, 5),
labels = c("Low", "Medium", "High"))
Use the factor() function for nominal data and the ordered() function
for ordinal data. R statistical and graphic functions will then treat
the data appropriately.
Note: factor() and ordered() are used the same way, with the same
arguments. The former creates factors and the later creates
ordered factors.
Missing Data
• In R, missing values are represented by the symbol NA (not
available) .
• Impossible values (e.g., dividing by zero) are represented
by the symbol NaN (not a number).
• Unlike SAS, R uses the same symbol for character and
numeric data.
• Testing for Missing Values
[Link](x) # returns TRUE of x is missing
y <- c(1,2,3,NA)
[Link](y) # returns a vector (F F F T)
Missing Data
• Recoding Values to Missing
# recode 99 to missing for variable v1
# select rows where v1 is 99 and recode column v1
mydata[mydata$v1==99,"v1"] <- NA
• Excluding Missing Values from Analyses
Arithmetic functions on missing values yield missing values.
x <- c(1,2,NA,3)
mean(x) # returns NA
mean(x, [Link]=TRUE) # returns 2
Missing Data
• The function [Link]() returns a logical vector indicating
which cases are complete.
# list rows of data that have missing values
mydata[,]
• The function [Link]() returns the object with listwise deletion of
missing values.
# create new dataset without missing data
newdata <- [Link](mydata)
Data Manipulation
Creating new variables
• Use the assignment operator <- to create new variables.
• # Three examples for doing the same computations
mydata$sum <- mydata$x1 + mydata$x2
mydata$mean <- (mydata$x1 + mydata$x2)/2
attach(mydata)
mydata$sum <- x1 + x2
mydata$mean <- (x1 + x2)/2
detach(mydata)
• mydata <- transform( mydata, sum = x1 + x2, mean = (x1 + x2)/2 )
37
Creating new variables - Recoding variables
• In order to recode data, you will probably use one or more of R's
control structures.
• # create 2 age categories
mydata$agecat <- ifelse(mydata$age > 70, c("older"), c("younger"))
• # another example: create 3 age categories
attach(mydata)
mydata$agecat[age > 75] <- "Elder"
mydata$agecat[age > 45 & age <= 75] <- "Middle Aged"
mydata$agecat[age <= 45] <- "Young"
detach(mydata)
38
Control Structures
• if-else
• if (cond) expr
if (cond) expr1 else expr
• for
• for (var in seq) expr
• while
• while (cond) expr
• switch
• switch(expr, ...)
• ifelse
• ifelse(test,yes,no)
39
Control Structures
• # transpose of a matrix
# a poor alternative to built-in t() function
mytrans <- function(x) {
if () {
warning("argument is not a matrix: returning NA")
return(NA_real_)
}
y <- matrix(1, nrow=ncol(x), ncol=nrow(x))
for (i in 1:nrow(x)) {
for (j in 1:ncol(x)) {
y[j,i] <- x[i,j]
}
}
return(y)
}
40
Control Structures
• # try it
z <- matrix(1:10, nrow=5, ncol=2)
tz <- mytrans(z)
41
R built-in functions (Numeric Functions)
Function Description
abs(x) absolute value
sqrt(x) square root
ceiling(x) ceiling(3.475) is 4
floor(x) floor(3.475) is 3
trunc(x) trunc(5.99) is 5
round(x, digits=n) round(3.475, digits=2) is 3.48
signif(x, digits=n) signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x) also acos(x), cosh(x), acosh(x), etc.
log(x) natural logarithm
log10(x) common logarithm
exp(x) e^x
42
Character Functions
Function Description
substr(x, start=n1, stop=n2) Extract or replace substrings in a character vector.
x <- "abcdef"
substr(x, 2, 4) is "bcd"
substr(x, 2, 4) <- "22222" is "a222ef"
grep(pattern, x , [Link]=FALSE, Search for pattern in x. If fixed =FALSE then pattern is a regular expression. If fixed=TRUE
fixed=FALSE) then pattern is a text string. Returns matching indices.
grep("A", c("b","A","c"), fixed=TRUE) returns 2
sub(pattern, replacement, x, [Link] Find pattern in x and replace with replacement text. If fixed=FALSE then pattern is a regular
=FALSE, fixed=FALSE) expression.
If fixed = T then pattern is a text string.
sub("\\s",".","Hello There") returns "[Link]"
strsplit(x, split) Split the elements of character vector x at split.
strsplit("abc", "") returns 3 element vector "a","b","c"
paste(..., sep="") Concatenate strings after using sep string to seperate them.
paste("x",1:3,sep="") returns c("x1","x2" "x3")
paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3")
paste("Today is", date())
toupper(x) Uppercase
tolower(x) Lowercase
Applied Statistical Computing and Graphics 43
Review Exercises
[Link]
Stat/Prob Functions
• The following table describes functions related to
probability distributions.
• For random number generators below, you can use
[Link](1234) or some other integer to create
reproducible pseudo-random numbers.
45
Function Description
dnorm(x) normal density function (by default m=0 sd=1)
# plot standard normal curve
x <- pretty(c(-3,3), 30)
y <- dnorm(x)
plot(x, y, type='l', xlab="Normal Deviate", ylab="Density", yaxs="i")
pnorm(q) cumulative normal probability for q
(area under the normal curve to the right of q)
pnorm(1.96) is 0.975
qnorm(p) normal quantile.
value at the p percentile of normal distribution
qnorm(.9) is 1.28 # 90th percentile
rnorm(n, m=0,sd=1) n random normal deviates with mean m
and standard deviation sd.
#50 random normal variates with mean=50, sd=10
x <- rnorm(50, m=50, sd=10)
dbinom(x, size, prob) binomial distribution where size is the sample size
pbinom(q, size, prob) and prob is the probability of a heads (pi)
qbinom(p, size, prob) # prob of 0 to 5 heads of fair coin out of 10 flips
rbinom(n, size, prob) dbinom(0:5, 10, .5)
# prob of 5 or less heads of fair coin out of 10 flips
pbinom(5, 10, .5)
dpois(x, lamda) poisson distribution with m=std=lamda
ppois(q, lamda) #probability of 0,1, or 2 events with lamda=4
qpois(p, lamda) dpois(0:2, 4)
rpois(n, lamda) # probability of at least 3 events with lamda=4
1- ppois(2,4)
dunif(x, min=0, max=1) uniform distribution, follows the same pattern
punif(q, min=0, max=1) as the normal distribution above.
qunif(p, min=0, max=1) #10 uniform random variates
runif(n, min=0, max=1)Applied
x <-Statistical
runif(10)Computing and Graphics 46
Function Description
mean(x, trim=0, mean of object x
[Link]=FALSE) # trimmed mean, removing any missing values and
# 5 percent of highest and lowest scores
mx <- mean(x,trim=.05,[Link]=TRUE)
sd(x) standard deviation of object(x). also look at var(x) for variance and mad(x) for median absolute
deviation.
median(x) median
quantile(x, probs) quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with
probabilities in [0,1].
# 30th and 84th percentiles of x
y <- quantile(x, c(.3,.84))
range(x) range
sum(x) sum
diff(x, lag=1) lagged differences, with lag indicating which lag to use
min(x) minimum
max(x) maximum
scale(x, center=TRUE, column center or standardize a matrix.
scale=TRUE)
47
Other Useful Functions
Function Description
seq(from , to, by) generate a sequence
indices <- seq(1,10,2)
#indices is c(1, 3, 5, 7, 9)
rep(x, ntimes) repeat x n times
y <- rep(1:3, 2)
# y is c(1, 2, 3, 1, 2, 3)
cut(x, n) divide continuous variable in factor with n levels
y <- cut(x, 5)
48
Sorting
• To sort a dataframe in R, use the order( ) function. By default, sorting is
ASCENDING. Prepend the sorting variable by a minus sign to indicate
DESCENDING order.
• # sorting examples using the mtcars dataset
data(mtcars)
# sort by mpg
newdata = mtcars[order(mtcars$mpg),]
# sort by mpg and cyl
newdata <- mtcars[order(mtcars$mpg, mtcars$cyl),]
#sort by mpg (ascending) and cyl (descending)
newdata <- mtcars[order(mtcars$mpg, -mtcars$cyl),]
49
Merging
• To merge two dataframes (datasets) horizontally, use the merge
function.
• In most cases, you join two dataframes by one or more common
key variables (i.e., an inner join).
# merge two dataframes by ID
total <- merge(dataframeA,dataframeB,by="ID")
# merge two dataframes by ID and Country
total <- merge(dataframeA,dataframeB,by=c("ID","Country"))
50
Merging
• ADDING ROWS
To join two dataframes (datasets) vertically, use the rbind
function. The two dataframes must have the same variables,
but they do not have to be in the same order.
total <- rbind(dataframeA, dataframeB)
If dataframeA has variables that dataframeB does not, then either:
Delete the extra variables in dataframeA or
Create the additional variables in dataframeB and set them to NA (missing) before joining them with rbind.
51
Aggregating
• It is relatively easy to collapse data in R using one or more
BY variables and a defined function.
• # aggregate dataframe mtcars by cyl and vs, returning means
# for numeric variables
attach(mtcars)
aggdata <-aggregate(mtcars, by=list(cyl), FUN=mean, [Link]=TRUE)
print(aggdata)
• OR use apply()
52
Aggregating
• When using the aggregate() function, the by variables must
be in a list (even if there is only one).
• See also:
• summarize() in the Hmisc package
• summaryBy() in the doBy package
53
Data Type Conversion
• Type conversions in R work as you would expect. For
example, adding a character string to a numeric vector
converts all the elements in the vector to character.
• [Link](), [Link](), [Link](), [Link](), [Link]()
[Link](), [Link](), [Link](), [Link](), [Link])
54
Advanced R
[Link]
us/articles/201057987-Quick-list-of-useful-R-
packages
55
Useful Websites
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• Excellent tutorial an nearly every aspect of R
• Introduction to R by Vincent Zoonekynd KickStart. Hints on plotting data in R
• P. Kuhnert & B. Venables, An Introduction to R: Software for Statistical Modeling & Computing
• J.H. Maindonald, Using R for Data Analysis and Graphics W.J. Owen, The R Guide
• D. Rossiter, Introduction to the R Project for Statistical Computing for Use at the ITC
• W.N. Venebles & D. M. Smith, An Introduction to R. Interpreting Output From lm()
• An Introduction to R Import / Export Manual R Reference Cards