100% found this document useful (1 vote)
49 views163 pages

R With RStudio For Introductory Statistics

Uploaded by

arif özer
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
49 views163 pages

R With RStudio For Introductory Statistics

Uploaded by

arif özer
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

R with RStudio for Introductory

Statistics
Copyright © 2020 Ramon C. Hernandez
Published by Ramon C. Hernandez and Gold Mountain Publishing

All Rights Reserved

No part of this book may be reproduced or transmitted in any form or


by any means, electronic or mechanical, including photocopying,
recording, or by any information storage or retrieval system, without
written permission from the author or Gold Mountain Publishing, except
for the inclusion of brief quotations in a review. No warranty liability
whatsoever is assumed concerning the use of the information contained
herein. Although every precaution has been taken in the preparation of
this book, the publisher and author assume no responsibility for errors or
omissions.

Date of Publication: March 2020


Language: English
PREFACE
With the advent of personal computers and the internet, the sheer volume
of data we have available has grown enormously and continues to grow with
each passing day. The science of data analysis (statistics, econometrics,
psychometrics, and machine learning, among others) must keep pace with
this explosion of data. Today’s statisticians must be able to access and
analyze data from a wide range of sources (database management systems,
text files, statistical packages, and spreadsheets, etc.), merge the pieces of
data, clean and annotate them, analyze them with the latest methods, present
the findings in meaningful and graphically appealing ways, and incorporate
the results into attractive reports that can be distributed to stakeholders and
the public. Here statistical computing comes into the picture to facilitate the
accomplishment of these goals.
Today, R (an open-source offshoot of the commercial statistical package
S Plus), is utilized on a range of platforms including Windows, Mac OS X,
and Linux, and has become the worldwide language for statistical
computing, predictive analytics, and data visualization. The language is used
extensively by statisticians, scientists, researchers, and academics around the
world. RStudio is a widely used graphical interface for R that sits on top of
R and renders it more user friendly.

No prior experience in programming or using R is assumed nor


necessary. The instructions are written in a reader/learner-friendly, step-by-
step, easy-to-follow manner, with many screen images that show the student
what the results should look like on the actual program screen. This makes it
easy for the student to follow along and to check for a correct understanding
of the material and methods every step of the way. The book is written in a
down-to-earth conversational style that greatly simplifies the concepts
encountered.
A bonus is that, not only is the student learning statistical computing but,
as the book progresses, Dr. Hernandez is also clearly explaining some of the
more difficult concepts in introductory statistics.
This first part of the course takes the student through coding 1 + 1 = 2, to
the basic methods of importing, cleaning, organizing, managing data,
producing catchy, informational graphs, and carrying out statistical methods
and tests. After having acquired some measure of familiarity with R and its
basic functions, the student is then introduced to RStudio in Chapter 4. All
subsequent chapters and topics utilize RStudio as the GUI from which all
work is done.
This book is one of those that come from the author’s collection of
lecture notes. It is designed to take you, even if you are a raw beginner, and
show you statistics and statistical computing with R and RStudio in such a
way that you will be doing your projects with an admirable level of
expertise.
At the end of the book, I have recommended some resources for further
reading if it is your wish to go further (which I hope you will) and build on
the good foundation you received herein.
May the time you spend with this book be enlightening, enjoyable, and
edifying.
Acknowledgments
I must acknowledge several people for their help with this project over
the years it has been on the stovetop. I thank my wife Lisa and my daughter
Giselle for their careful proofreading and insightful comments. I would like
to thank those who reviewed the text in its various stages. The reviewers,
who also gave support and guidance, Lennox Celestin, MSc, and Neil
Sylvester, MSc. of the College of Science, Technology and Applied Arts of
Trinidad and Tobago. I want to also extend thanks to my special professors
Drs. Stuart Hurlbert, David Futch, and Lars Hellberg (deceased) of San
Diego State University and Dr. John Oakes of University of California San
Diego, also the students at Herbert Hoover High School, San Diego City
College, both in San Diego, CA, and the students at the College of
Technology, Science and Applied Arts of Trinidad and Tobago, who all were
my focus groups and “guinea pigs” through the years.

RCH
Chapter 1: Introducing R
What is R?
R is a free, open-source software that is one of the most popular
platforms for data analysis and visualization that is available today.

Why should I use R?


Although there exists a wide range of statistical and graphing packages,
such as Microsoft Excel, SPSS, SAS, Minitab, among others, R is
recommended by many features that it possesses:
R is free! They say that the best things in life are free; so there you
go.
You can do just about any type of statistical analysis in R, as it
offers a comprehensive platform for statistical analysis.
State-of-the-art graphing capabilities that can be used to visualize
from the simplest to the most complex data.
A powerful platform for interactive data analysis.
Easily imports data from a wide variety of sources
Is extensible, and therefore provides a natural language for quickly
programming recently published methods.
Contains advanced statistical routines available as packages. New
methods become available for download weekly.
There are even guides for installing R on smartphones.

Download and Installation


R is available freely for download on the Comprehensive R Archive
Network (CRAN) at
[Link]
Once you are at the CRAN site you click to choose between Linux, Mac
OS, and Windows, then follow the instruction for your platform. You are
downloading the base product, called base R, directly to your device and
then run the setup program which has a name like [Link] (for a PC), or [Link]
(for a Mac). Upon being asked whether you want to ‘Run’ or ‘Save’ the file,
click ‘Run’. It is all automatic from there. After installation, simply click on
the R icon to begin using R.
You can later extend R’s functionality by downloading optional modules
called packages (also from CRAN). The packages are freely contributed by
world-wide sources.

GUIs
Several Graphical User Interface (GUI) applications, such as R
Commander and RStudio, are available, and they offer the power of R
through menus, graphical icons, and dialogs, and run on a wide variety of
platforms, including Windows, Unix, Linux, and Mac OS X. These GUIs are
point-and-click interfaces that sit on top of R and serves to simplify its use
by reducing the number of lines of code the user needs to write, and for
some operations, eliminating the writing of code. In this course, we will
focus on RStudio and its use in a subsequent chapter, since it is best practice
to first learn base R, which we will do in the early chapters.

Using R
In this opening chapter, we will give a brief overview and rundown of
some of R’s functionality and in the following chapter, we will get into the
hands-on details.
R is case-sensitive and interpreted. So be careful with the cases of your
letters. If you saved a file as “myFile” then later you call up the file with the
command “MyFile”, R will advise you that there is no such file.
Commands are entered one at a time at the command prompt (>). R uses
data types or structures, that include vectors, matrices, data frames, and lists.
R sees the data types we create and stores and manipulates them as objects,
as it is an object-oriented language. It gets much of its functionality through
built-in and user-created functions, and all objects are kept in memory
during an interactive session. Other functions are contained in packages that
can be attached to a current session and then detached when the session is
over.
On opening base R, you will see an interface like the one shown below.
Figure 1.1

The red “greater than” symbol > that you see at the end of the blue
writing is called the prompt. We enter our data at the prompt.
A variable is used to store the results of computations. In other words,
after we make a computation, if we give a name to the results, then that
name is a variable. For example
>x <- sqrt(64)
Here we are making computation of the square root of 64, by placing
“64” between the brackets of the square root function, sqrt(),
sqrt(64)
and then, by writing “x” and putting an arrow (a less than the symbol “<”
followed by a dash “-“) going from “sqrt(64)” to “x” as shown below,
x <- sqrt(64)
we are assigning the name, x, to the result of the square root, which is 8.
Therefore, R will store in memory a variable (object) named “x” that has a
value of 8. If we now call up the variable, x, (by writing “x” at the prompt,
and then pressing ENTER), R will return the value “8”, as shown below.
(Our input is placed after the prompt and R’s output is what comes after
the square brackets.)
>x
[1] 8 #This is R’s response

*Note: Anything written behind a pound sign (#) on the code line is not
part of the code. The line above is an example.
To assign the value of 15 to a variable y, we key
> y <- 15

Operators and Functions

We act upon objects in R’s memory, like the variables, “x” and “y”
above, by way of operators and functions. Operators are symbols that call
for some action or operation to be performed on pieces of data. The symbols
“+”, “-”, “×”, “÷” are all operators that perform the specified operations on
numerical data. Operators can be arithmetical (like those shown above),
logical (TRUE or FALSE), or comparative (like greater than >, and lesser
than <, etc.). By keying in
>2*y + 3*x
we are performing operations on the numerical variables x and y, to
which we have already assigned values. We are now multiplying the variable
y by 2 and then adding the result to three times the variable x. Upon pressing
ENTER, R will respond with the answer 54.
[1] 54
We will see much more about the use of operators later.

Functions

We can also act upon objects by way of functions. A function is an object


that does something, for example, perform a specified calculation. We could
take the calculation we made with our variables X and Y in the last section
and call the entire calculation a function, function1 ().
A function name is always followed by open and close round brackets, ().
When we call up the function and give it values for x and y, it performs the
entire calculation and gives us an answer.
Examples of functions include
scan()
sqrt()
plot()
sum()
Arguments are placed within the function’s brackets to give the function-
specific instructions on how to do its job. For example, in the function we
called function1() above, we will place the values for x and y between the
brackets. The function will use these values to make its calculations. These
values we give to the function are called arguments. Another example is the
“64” we placed between the brackets of the square root function, the 64 is
the argument.
Functions also have options, like more arguments, that can fine-tune its
results. For example

>plot(Depth ~ Pressure, data=DiveData, pch=16)

This plot() function plots Depth against Pressure with the argument
“Depth ~ Pressure”. The “data=DiveData” tells R that the variables Depth
and Pressure can both be found in a file called DiveData. We have also
included another argument, pch= 16. The pch= 16 code tells R to use a solid
black dot to plot the graph. When we hit ENTER, R will execute the plot.
R’s graph will look something like this:

Figure 1.2

List of Objects in Memory

The list function, ls (), will output a list of all objects in memory. This
function does not usually need an argument. To remove an object, y, from
memory, use the remove function rm() and put “y” as the argument
>rm (y)
Upon clicking ENTER, the object y will be removed from memory.

The R Workspace
The workspace is the current R environment in which you are working,
and it includes any object you defined or to which you assigned values (e.g.
variables). At the end of an R session, you can save an image of the current
workspace, and that image is automatically reloaded the next time you start
R. You can use the up and down arrows to scroll through the commands you
keyed during your session. Here you can select an old command, edit it as
you wish and then resubmit it with the Enter key.
To display a given number of your last commands, say, your last four
commands, you can use the history() function and give it “4” as an
argument.
>history (4)

Input and Output


You can input from the keyboard or a script file containing R statements
and commands. Your output is by default sent to the screen (console).
However, you can use the sink () function to direct your output to other
destinations. If, for example, you wish to direct the output of a calculation to
a file called [Link], which can be found by the path
C:/Users/Desktop/[Link], you can use the command

>sink (“C:/Users/Desktop/[Link]”)

This will send the output to the file [Link] only, and you will not
see your output on the screen. By including the option code,

>sink (“C:/Users/Desktop/[Link]”, split=TRUE)

your output will be sent to the specified file as well as to the screen.

One Caution: If there is pre-existing content in the file [Link], the


contents will be overwritten by the sink () function. However, if you include
the option,
>sink (“C:/Users/Desktop/[Link]”, split=TRUE, append=TRUE)

the output of your current session will be appended (added) to the file and
the contents of the file will not be overwritten.
The sink function will work on text output but not on graphic output. To
direct graphic output to a specific file, say, [Link], and text output to
myfile. Rdata, you would use

>sink (“C:/Users/Desktop/[Link]”, split=TRUE, append=TRUE)


>jpeg (“[Link]”)

There are several different functions for saving the graphic output in base
R. Some of them are listed in the table below.
Function Saved to Type of File
Pdf (“[Link]”) A pdf file
Jpeg(“[Link]”) A jpeg file
Bmp(“[Link]”) A bitmap file

To “unsink” or to stop text output from being redirected to the sink file,
you would use
>sink()
With no arguments.
To stop graphic output from being “sinked” to a graph file, you would use
>[Link] ()
Directories
The current working directory is where R will find the files with which to
work. It is the location of the files that R would remember. To find the file
that R is currently using as the working directory, use
>getwd ()
To get R to use a different directory, say, directory1, use
>setwd (“directory1”)

Saving Your Work


To save a history of all commands used in your workspace to a file, say
myhistory, you can use the command
>savehistory (“myhistory”)
To save the current workspace to a file, say, myworkspace, you can use
the command
>[Link] (“d:/[Link]”)
To save specific objects to a file, say, myfile, you can use the command,

>save (objectlist, file=”d:/[Link]”)

In the examples above, d:/ refers to the working directory.


For example, you have in your workspace three objects, x, y, and z,
which you wish to save in a new file called [Link] in the working
directory. To do this, you enter

>save (x,y,z, file=”d:/[Link]”)

To access saved files you can use the load() function to load Rdata files.

>load(“d:/[Link]”)

Alternatively, you can save files by clicking the File Menu and then
clicking save workspace. A dialog box will appear. Now you can browse to
the folder in which you want to save the file and give the file a name of your
choice and click Save.
You can also access a saved file by clicking the File Menu then clicking
load workspace. A dialog box will appear. Now you can browse to the folder
in which you saved the .Rdata file and click Open.
You can save commands made in your R session through the File Menu
by clicking file and then save history. A dialog box will appear. Browse to
the folder in which you want to save the file, name the file and then click
Save.

Packages
A “package” in R is a set of functions bundled together to perform a
certain group of operations. Packages are stored in “libraries”, which are
analogous to specific cabinet drawers in the filing cabinet that is your
computer. When you downloaded R for the first time, the package “base”
which contains all the basic functions was downloaded automatically. There
exist, however, many different packages with different functionalities that
are developed by the programming community and placed in the CRAN site
where they are made available for download and use by general R users. To
get a list of the standard packages loaded with base R, type
>search()
Below is a partial list of some base packages (there are hundreds of them)
and a brief description of their functions.

The complete list can be found at


[Link]

Installation of Packages

To install a new package for the first time, you would use the command

>[Link] ()
This will bring up a list of CRAN mirror sites. You select a site and then
you will see a list of all packages on that site. You select a package and it
will be downloaded and installed on your computer. If you know the name
of the specific package you want, say the package BRUD, you could use
>[Link] (“BRUD”)
Loading a Package

When you install a package, say, BRUD, it is downloaded from a mirror


site somewhere in the world and placed in your library. If you want to now
use the package BRUD in a session, you will need to load the package from
the library to your workspace by using
>library(BRUD) #(NOTE: No quotation marks this time)
Installing packages is made a snap when using RStudio, which will soon
be introduced in an upcoming chapter of this book.
What Does a Particular Package Do?
After loading a package, say, BRUD, you could find out all about it by
using
>help(package=” BRUD”)
R will return a brief description of the package along with a list of the
functions and datasets included in it.

Getting Help

The help() Function


If you need to obtain help on a specific R function, say the [Link]()
function, simply type
>help([Link])
R will return to you a list of help items for that function. To obtain a
general list of help topics, type
>help()
By qualifying the help() function with an argument, R provides
appropriate information and documentation. For example, if you type
>help(ANOVA)
R will provide information and documentation on all functions and
processes involved in ANOVA.

The [Link]() Function

If we type
>[Link]()
R will open for us a window with much information on syntax, packages,
and functions.

The [Link]() and apropos() Functions

Two important help functions are [Link]() and apropos().


If you type
>[Link](t-test)
R will return a list of functions and processes involved in ANOVA on the
t-test.
If you type
>apropos(table)
R will return a list of all commands that contain the word “table”.

The example() Function

If you type, for instance,


>example([Link])
R will return an example, if available, of the use of the Wilcoxon test.

Ending an R Session
To end your session, simply use
>q ()
This is the quit function used with no parameters. If you have not saved
your workspace using the methods discussed above, you will be prompted to
do so after entering the quit function.

Mistakes to avoid in R Programming


It is easy to make errors in writing R code. The following are common
mistakes made in writing R.
Attempting to use functions from a package that is not loaded.
Using the backslash character in a pathname. The forward slash should be
used here.
Not enclosing the names of variables, files, packages, etc., in quotation
marks.
Not including the parenthesis when calling a function.
Not using the proper case (upper or lower), as R is case sensitive.
Exercises 1.
Get Used to Working with R

Open R.
Inquire of R about your current directory.
Redirect session output to a file called [Link] (pay attention to the
case), while having output also show on your console.
Redirect graphical output to a file called [Link].
Install the package AER (a package with functions, examples, datasets,
and demos). Load the package.
Get a list of the functions and datasets available in the AER package.
Get the details on the dataset CollegeDistance.
Output the dataset CollegeDistance.
Run the example in CollegeDistance (simply key:
example(CollegeDistance))
List the objects in your workspace.
List the last three commands you entered.
Run the example in CollegeDistance again. This time the graph should
show on your screen – as it is not being redirected to any file.
Quit.
Go into the current directory and check the file [Link]. You should
find the graph of the CollegeDistance dataset in this file.
Close R.
Chapter 2: Data Structures

Forms of Data
We now zoom in a little closer for a more detailed look at the workings or
R. In R, data is stored as objects. All objects have two basic attributes: the
mode of the object and the length of the object. The mode of the object is the
type of data that the object holds. There are four basic mode types: numeric,
character, complex, and logical (TRUE or FALSE).
Let us create an object we will call A. We are adding 3 + 4, and then calling
the answer “A”.
A<-3 + 4
After hitting ENTER, R now has in its memory an object called A, whose
value is 7. To verify that R has created the object A, we simply key the name
of the object, hit ENTER
>A
and R will return the value of A.
[1] 7
To find out the mode of the object named A, we key
>mode (A)
If A is numeric, as we know it is, R will return
[1] “numeric”

Naming Objects
We have already seen the naming of an object by our work with object A
above, but let us take a closer look at object naming. When naming an object
(with the <- assignment), we can use letters (A – Z or a – z), digits (0 to 9),
dots, and underscores. R discriminates between lowercase and uppercase
letters in the names of objects so that “A” is not the same as “a”.

Some Basic Operations and Functions


Consider the one-column dataset, which represents the lengths of five
widgets.
To enter the data in the table above into R, you use the combination
function c( ), which combines the arguments (15, 10, 6, 11, 9) into an R
vector. (The function c() takes only numeric arguments.).After the prompt
you type

> Length <- c(15, 10, 6, 11, 9)

Then press the enter key.


You have now created a vector (15, 10, 6, 11, 9) and called this vector
Length, using the assignment symbol <-. The numbers 15, 10, 6, 11, and 9
are elements of the vector. The number 15 is the first element of the vector
and is referred to as [1]. Thus, if we key
>Length[1]
We are asking R for the first element of the object Length. R will respond
with
[1] 15
Now to call up the entire vector Length to screen, simply write
>Length
And then press enter.
R will show
[1] 15, 10, 6, 11, 9

The number in the square brackets at the beginning of the R statement


[1], indicates that the display begins at the first element of the object Length.
Were the number of measurements in the vector too many to fit in one line
and the second line began with the 10th measurement, R will show
[1] 15 10 6 11 9 28 13 17 23
[10] 36 31 19
This means that the first line begins with the first element of the vector
and the second line begins with the 10th element
So this vector is stored in memory and you can call it up at any time in your
session. If, however, you simply entered
>5*12
and then press the Enter key, R will return the value
[1] 60
But this will not be stored in memory because you did not assign it a
name with the <- symbol. Thus, R does not see it as an object - just a
calculation.
Create another vector, Weight, the vector of weight measurements of our
five widgets:

> Weight <- c (135, 66, 73, 120, 112)

To find the mean and standard deviation of the vector Weight, we write
>mean (Weight)
[1] 101.2
>sd(Weight)
[1] 30.19437
When we write
>plot (Length, Weight)
we obtain a scatterplot of Length on the x-axis and Weight on the y-axis
as shown below.
Figure 2.1
This plot is basic and somewhat unattractive. Later you will learn to
create attractive, custom graphs to suit your needs.
If a variable, x, has the value of 15, then until you change it, x has the
value of 15. It can be used in subsequent mathematical calculations. For
example,
x<-15
>x/3
[1] 5
>x^2 #x raised to the second power
[1] 225

Objects of Character Mode


The value of an object with character mode is input between quotation
marks. For example,

>y <- c (“Port-of-Spain”, “Buenos Aires”, “Havana”)

The variable y is now a vector of character (or non-numeric) mode that


contains three elements: Port-of-Spain, Buenos Aires, and Havana. To look
at the object, y, we key and enter:
>y
[1] Port-of-Spain Buenos Aires Havana

Data Structures in R
Objects in R come in many forms and structures. When R works with
data, R first notes the structure of the data with which you are presenting it.
Working with data in R means, therefore, that we must first choose the
appropriate data structure to hold the data. R has many different data
structures for holding data. For our purposes, the most important of these are
Numbers, vectors, matrices, data frames, lists, factors, and strings. We will
now see each data structure individually.

Numbers
Numbers in R are usually dealt with the same as they are in ordinary
mathematics. One of the main differences is how R treats very large and
very small numbers. When we write “a e b” in R, where a and b are
numbers, we mean a × eb. Now let a = 6.3 and b = 13. Thus, a e b means
6.3 e +13 which means 6.3 × 1013.
In reverse, when we enter in R
>exp(40)
We get back
[1] 2.353853 e +17
Which means 2.353853 × 1017.
Additionally, the “undefined” designation in R is denoted “inf” for
infinity. In mathematics, the answer to any number divided by zero is
infinity – or, a number too large it cannot be written, and so it is called
infinity. So, if we type in
>3/0
R will answer:
[1] inf
Now, if we type in
>0/0
we are trying to divide zero by zero – another mathematical anomaly. R
will tell us”
[1] NaN
This means “Not a Number”
Vectors
Vectors are the simplest data structure and they consist of a one-
dimensional array of data of any type: numeric, character, or logical. But all
data in a single vector must be of the same type. We can represent a vector
by a column of elements. Below, we see a vector represented by a column
with its elements all the same color to denote the fact that all elements are of
the same type.

Vectors keep track of the numeric position of each of its elements, so it is


an ordered data structure. The entry
>x <- c (1, 3, 6, 8, 11)
creates a vector named x with five ordered elements.
We can represent x as:

This denotes a vector of five elements with the position of each element
represented as a box in which we place the actual value of the element at
that position. The value of the element at the fifth position (fifth box) is 11.
The entry (with square brackets)
>x[4]
will get an output of the fourth element of the vector x, which is the value
8. Thus, if we hit the Enter Key, R will return
[1] 8
The entry with the colon
>x [3:5]
tells R to generate a sequence of the third through the fifth elements of
the vector x.
R will therefore output:
[1] 6 8 11
If we enter
>x[2] +x[4]
R will tell us
[1] 11
The following line changes the third element of X from 6 to 36
>x[3] <- 36
>x
[1] 1 3 36 8 11

Using the assignment method, we can, also create the original vector x by
creating an empty vector x and then adding values one at a time.
>x <- c() # creating the empty vector
>x[1] <- 1 #assigning the value 1 as the first element
>x[2] <- 3 # assigning 3 as the second element
>x[3] <- 6
>x[4] <- 8
>x[5] <- 11
# now we look at the vector we have created by assignment
>x.
[1] 1 3 6 8 11

Appending elements to an existing vector

The function c() can also be used to append (add) elements to an existing
vector. Let us create a vector x with the elements 1 3 5 7.
>x <- c(1, 3, 5, 7)
Now enter the code:
>x <- c(x, 9)
This adds to the vector x, the element 9. Now call up x
>x
[1] 1 3 5 7 9

Other examples of vectors are


> y <- c (“Port-of-Spain”, “Buenos Aires”, “Havana”)
>z <- c (False, True, False)
The vector y is of the character type and the vector z, the logical type.

Matrices
If a vector can be represented by the columnar structure

then a matrix can be represented by the structure

A matrix is two-dimensional and is made up of vectors of all the same


type. The figure shows four vectors, each of five elements, stuck together to
form a matrix. A matrix can contain any number of row vectors or column
vectors. A 4 × 3 matrix has 4 row vectors, each with 3 elements, or 3 column
vectors, each with 4 elements. The representation of a matrix in the figure
above is a 5 × 4 matrix.

The matrix() Function


Matrices can be created in several ways. The most basic method of
creating a matrix is by using the matrix function, matrix().
For example, suppose we want to create a matrix of the three vectors:
Age, Years of Training, and Length of Employment like shown in the figure
below,

We will create one long vector of all the values, column by column. We
call this vector values.
>values <- c(23, 41, 35, 33, 2, 3, 4, 8, 2, 11, 6, 7)
Next, we will create a vector called colnames, whose elements are the
names of our original vectors, and a vector called rownames, whose
elements are the names of the employees. Remember to enclose the names
in punctuation marks, since they are of character type.
>colnames <- c(“Age”, “YearsTraining”, “LengthEmploy”)
>rownames<-c(“Roy”, “Sunny”, “Stan”, “Pat”)

Next, we use the matrix function to create the matrix. We will call this
matrix, Matrix1.
>Matrix1 <- matrix(values, nrow=4, ncol=3, byrow=FALSE,
dimnames=list(rownames, colnames))

The values argument in the matrix function gives the vector of numbers
that we called “values” in the designation we coded above. The arguments
nrow and ncol give the number of rows and columns respectively of our
matrix. The argument byrow is a logical type and it tells the function how
the values (the numbers) will be written into the matrix. In our case, when
we were creating the vector values, we entered the numbers by column and,
as such, we will be filling our matrix column by column. So, we give the
byrow argument the value of FALSE to indicate that we will not be filling
the matrix by rows.
The dimnames() argument is a function that simply gives a list of the
names that the matrix is to use as its column and row names. When we press
the Enter key, R will create the matrix, Matrix1, and store it in memory. To
call up Matrix1 on screen, we key and Enter
>Matrix1
R will output on the screen

Accessing the Elements of a Matrix


We can access the elements of a matrix using square brackets just like we
can access the elements of a vector with square brackets. We can use
subscripts and square brackets to identify values in the matrix. For example,
to identify the value of Stan’s years of training, the value at row 3 column 3,
we key
>Matrix1[3, 3]
R will give us the value
[1] 6
If we have a matrix called X, then X[1, 2] indicates the element in the
first row and second column of the matrix.
X[1, ] indicates all elements of the first row, and
X[ ,1] indicates all elements of the first column.
Let us give X some elements.
>X <- matrix(c(1, 3, 5, 7), nrow=2, ncol=2, byrow=TRUE)
>X
However, if we only give the function three values with which to build a
2 x 2, four-element matrix, like the following:

>X <- matrix(c(1, 3, 5), nrow=2, ncol=2, byrow=TRUE)

When we call up the matrix,


>X
we will see the following

NA indicates that that element is “not available”. And asking R for the
elements of the second column,
>X[ ,2]
we will get
[1] 3 NA

Creating and Filling Empty Matrices


Can you figure out what the following code will do?
>Y <- matrix(nrow=3, ncol=2)
>Y
Now, when this code is added, what will the result be?
>Y[ ,1] <- c(2, 4, 6)
>Y[ ,2] <- 1:3
>Y
In the first instance, we created a matrix Y with no values, so until we
give it values, R will place NA at each position. In the second instance, we
are giving Y three values, 2, 4, and 6, which R will fill into the first column.
Then we are giving Y three more values, the numbers 1 through 3 (1, 2, and
3), which R will fill into the second column.
Data Frames
Data frames are like matrices except that they can hold vectors of
different types. The figure below is the representation of a data frame made
up of four different vectors of different colors (different types) stuck
together.

.
Suppose you have the following dataset. Here we have four vectors: two
of numeric mode and two of character mode. Thus, the structure we will use
is a data frame.
Figure 2.2
One way to create the above data frame is to use the [Link]()
function to create an empty data frame and then use the text editor to enter
the data. In creating the empty data frame, we first name the column vectors
and their type and give it a value of zero. We will call the data frame,
dataframe1.

>dataframe1<-[Link](Name=character(0), Age=numeric(0),
Position=character(0), LengthEmploy=numeric(0))

This creates a data frame called dataframe1, which consists of 4 empty


vectors: Name, Age, Position, and LengthEmploy, with two of the vectors
as numeric and two of character modes. Now, with the edit() function, we
call up the data editor to fill in the values.

>dataframe1<-edit(dataframe1)

The above code tells R that we will edit the empty dataframe1 we
created, and the result will be the new dataframe1. The text editor will come
up with the data frame and its variables (columns) already created (Figure
2.3A). You may now enter the data directly as in a spreadsheet (Figure
2.3B).
Figure 2.3
If, at this point, we want to change the name of a variable (column
vector), we simply click on the variable name on the text editor. The dialog
box like the one shown below will appear. We can now write in the new
name.

Figure 2.4

To call up dataframe1 on the screen, key


>dataframe1
R’s output will be
Another method of creating a data frame is to enter it directly into the
code instead of using the data editor. To use this method, we first create the
column vectors.

>Name<-c(“Roy James”, “Sunny Meza”, “Stanley Shang”, “Patricia


Neeley”)
>Age<-c(23, 41, 35, 33)
>Position<-c(“Stock”, “Mail Sup”, “IT Sup”, “Office Mgr”)
>LengthEmpl<-c(2, 11, 7, 6)

Now, we create the data frame called dataframe1.

>dataframe1<-[Link](Name, Age, Position, LengthEmpl)

That’s it. The data frame will now be created and stored in memory. To
call up dataframe1 to screen, we key
>dataframe1

Factors
In R, non-numeric variables (vectors) are also called factors. The
categories of these variables are called “levels” of the factors. In the above
example, the variable “Position” is non-numeric, so it is a factor and it
contains four levels: Stock, Mail Sup, IT Sup, and OfficeMgr. R codes the
levels of the factors as integers from 1 to k, where k is the number of unique
levels the factor has. Thus, the factor, Position, has four levels, and they will
be coded as
1 = Stock
2 = Mail Sup
3 = IT Sup
4 = OfficeMgr
Now, suppose we added (using the text editor) the row
“Jim 27 Stock 4 M”,
then the data frame, dataframe1, will now look like this,

Let us now look at how R sees the object dataframe1 in memory.


To obtain information on an object in R, we use the function str(), with
the argument being the name of the object, in this case, dataframe1. We will,
therefore, key,
>str(dataframe1)
for which R will output the following

'[Link]': 5 obs. of 5 variables:


$ Name : Factor w/ 5 levels "Roy ","Sunny",..: 1 2 3 4 5
$ Age : num 23 41 35 33 27
$ Position: Factor w/ 4 levels "Stock","Mail sup",..: 1 2 3 4 1
$ LengthE : num 2 11 6 7 4
$ Gender : chr "M" "F" "M" "F" .”M”..
Let us look at the variable “Position”.
It is listed as a factor with 4 levels: “Stock”, “Mail Sup”, …. Now here
you will also see a list of numbers 1 2 3 4 1. These numbers indicate the
level of the factor “Position” for the five names. So, it is telling us that the
first name has a Position value of 1 = Stock. The second name has a position
of 2 = Mail Sup; the third name has a position of 3 = IT; the fourth has a
Position level of 4 = Office Mgr, and the fifth name has a Position level of 1
= Stock.

For ordinal factors, we indicate to R that our data is ordered. For


example, suppose we had the following data:
Answer: Strongly Agree, Agree, Not Sure, Disagree, Strongly Disagree.
We could store this as an ordinal factor by keying:

>Answer <- c(“Strongly Agree”, “Agree”, “Not Sure”, “Disagree”,


“Strongly Disagree”)
>Answer <- factor(Answer, ordered=TRUE)

The first line creates the vector Answer and the second line gives R to
store it as a factor of five ordered levels.

Lists

A list is a collection of objects given a specific name. A list can be


represented by a single column of elements, all of which can be of different
types. A list is not a vector unless all the elements are of the same type. The
figure below represents a list of five different elements (colors).
We create a list with the list() function, the arguments of which are the
objects being collected. Any number of objects can be collected as a list.
The objects of a list can be of different types and completely unrelated.
Suppose we have the following data:

For example, to create a list that we will call listA, from the above data,
we will create and two vectors: one called Kilometers, and another called
Time, then we would create a factor (non-numeric vector) called
DayofWeek.

>Kilometers <- c(3.3, 2.6, 4.0, 3.5, 2.9, 3.8, 3.1)


> Time <-c(28, 31, 39, 32, 24, 27, 29)
> DayofWeek <- c(“Monday”, ‘Tuesday”, “Wednesday”, “Thursday”,
“Friday”, “Saturday”, “Sunday”)
>DayofWeek <-factor(DayofWeek, ordered=TRUE)

Now we can create a list called listA that consists of the factor,
DayofWeek, and the two vectors, Kilometers and Time.

>listA<-list(DayofWeek, Kilometers, Time).

To access the objects in a list we can refer to them by name of position


number in the list. If we desired to access the object Kilometers from listA,
we can call it up by keying the following, using double square brackets to
access a vector in the list.

>listA[[“Kilometers”]]
R will then output
[1] 3.3 2.6 4.0 3.5 2.9 3.8 3.1
Or, instead of keying “Kilometers”, we can simply write 2 inside of the
double square brackets.
>listA[[2]]
This asks R for the second vector (kilometers) in the list listA, to which R
will output
[1] 3.3 2.6 4.0 3.5 2.9 3.8 3.1

Strings
Strings mean relatively short text consisting of a sequence of characters.
For example, the name of a respondent in a survey. Strings are enclosed in
quotation symbols. One word of caution is in order here: The number 38 is
not the same as the string “38”. The string 38 is a text sequence consists of
the digit 3 followed by the digit 8.
The operation
>3 * 4
Is perfectly legal and R will output
[1] 12
However
>”3” * “4”
Will cause R to sound an error alarm. As well it should because here we
are attempting to perform a numeric operation on two text characters.

String concatenation with paste()


Just as the function c() concatenates numerical elements, the function
paste() strings together text elements.
>W <-paste(“Time”, “has”, “come”, “today.”)
>W
[1] Time has come today.
But if we use the sep=”” argument
>U <-paste(“Super”, “cali”, “fragilistic”, “expi”, “ali”, “docius.”,
sep="")
>U
[1] Supercalifragilisticexpialidocius.
With the string W, “Time has come today”, R kept the words separated.
However, with the string U, the words were all joined together. This is
because of the sep argument. The sep argument tells R the kind of separator
that is placed between the words. If no separator is specified, like in the
string W, R will place a single space between the words. This is the default
condition. If we write sep=” ”, this indicates that there should be no
separation between the words, as in the string U.

Boolean Values
R uses TRUE and FALSE as Boolean values. These are mostly used with
comparison logical operators. For example, if we write
> 20<15
We are asking R whether 20 is less than 15.
R will tell us
[1]FALSE
We used “<”, a comparison operator that compared 20 to 15.
Below, we present a list of R’s comparison operators.

Data Entry
Data comes in many formats, in many different forms and from many
different sources. Data from any source and in any format can be imported
into R. We will only deal with a few formats and sources here. For a
complete guide in data import and export in R, see R Data Import/Export
from the CRAN website manuals.
Direct Entry from the Keyboard
If you are entering data directly from the keyboard, which you may do if
your dataset is small, the easiest method is through the use of the edit()
function, which we have already met. When you invoke the edit function,
the text editor will come up and thereby you can enter your data directly into
R’s memory. The simplest way to do this is to create an empty data frame
and then call up the editor to enter the values into the empty vectors. We
have already seen this method, but to solidify it in our minds, we will see
another example.
We are going to create a data frame called dataframe1, which will consist
of the data which we previously used in the section on lists, and we are
going to enter it directly from the keyboard.

First, we will create the empty data frame, Dataframe1.

>Dataframe1 <- [Link](DayofWeek=character(0),


Kilometers=numeric(0),
Time=numeric(0))

The Data Editor

Now we will call up the data editor so that we can fill in the values.

>dataframe1 <-edit(dataframe1)

Note that we are altering the original object dataframe1 by our editing of
it. At the end of our editing we are assigning the results back to the same
object, dataframe1, so that it is then altered.
Another way to call up the data editor is to choose Data Editor from the
Edit menu at the main ribbon.
Below is a picture of the result of calling up the data editor.

Figure 2.5
The data editor allows you to change the name and type of a variable.
To do so, click on the variable (the column heading) and make the
changes in the box that appears. Close the box. The changes are made. We
can also edit cell values. Double-click on the cell you want to edit and make
the changes.
To enter data, you simply click in the appropriate cell and enter the value.
Additional columns can be added by simply clicking on the column title cell
of adjacent empty columns. Next, we will look at importing data and
creating a data frame from a text file that is already in existence.

Importing Data Into R

Importing Data from Excel

We have created the file Example in Excel and we want to import it into
R. For the file Example, we will use the same table below that we created
earlier in the section.

Before we can import it into R, we must save it in Tab Delimited (.txt) or


Comma Delimited (.csv) format in Excel. To do this, we simply choose that
format upon clicking Save As in Excel. The file will be saved as
[Link] or [Link] in Excel. (Some versions of R seem to work
better with .txt files and others with .csv.)
Now suppose we have saved it to our desktop, and it now has the address:
C:/Users/Hernandez/Desktop/[Link].
To import it into R and call it dataframe2 there, we use the command:

>dataframe2<[Link](”C:/Users/Hernandez/Desktop/[Link]”,
header=TRUE, sep=” ”)

If we saved the Excel file as comma-delimited (.csv), we write

>dataframe2<[Link](”C:/Users/Hernandez/Desktop/[Link]”,
header=TRUE, sep=”,”)
The command header=TRUE tells R that the first row contains the
column names as headers and not data values, so R would not try to do
calculations with them. The command sep=”,” tells R that the data values
are separated by comma delimiters. If a file is saved as a tab-delimited or
.txt, then you should key in sep=” ”. This tells R that the file in Excel uses a
tab (space) to separate the data values.
To now call up the file dataframe2, key
>dataframe2
Now if the file was saved in Excel as comma-delimited files, we can
import it into R using the [Link]() function as follows:

>dataframe2 <-
[Link](file=”C:/Users/Hernandez/Desktop/[Link]”,
header=TRUE, sep=”,”)

As we will see in a subsequent chapter, this whole process can be done by


just clicking on a single icon if we are using RStudio.

Importing Data From Excel Using the Package RODBC


(Optional)

We can also import Excel files, as well as MS Access files into R by


using an extension package called “RODBC”. For example, suppose we
have a file in Excel called [Link], which is in sheet 1 of the Excel
workbook, and which we desire to import into R under the name
dataframe3. We would first download the package,
>[Link](“RODBC”)
Now we load it into our library,
>library(RODBC)
Now we use the following commands to import the file into R using
RODBC,
>channel <-odbcConnectExcel(“[Link]”)
>dataframe3 <-sqlFetch(channel, “sheet1”)
>odbcClose(channel)

Here we use the function odbcConnectExcel() to connect the file to an


object called “channel”, which, like a relay runner, will pass the baton to
another function called sqlFetch(), which will bring the file into R. The final
step will close the channel object after the file has been imported.

Example: Importing a file from Excel


Consider the following Excel file, [Link]

Figure 2.6

The variable Sex is coded 1 = Male 2 = Female


The variable Smoker is coded 1 = Yes 5 = No. We wish to import this file
from Excel into R.
The first thing we do is to save the file in Excel to the working directory
as a comma-delimited file. The extension for a comma-delimited file is csv.
The function
>getwd()
will give you information about which file is being used as your working
directory.
R will reply
C://Users/Bill/Documents
We save smoke as a comma delimited file in Excel, after which the full
address of your file will be
C://Users/Bill/Documents/[Link]
Once the file has been saved in the working directory, we can now load it
into R with the [Link]() function. We will call it smokedata in R

>smokedata <-[Link](“C://Users/Bill/Documents/[Link]”,
header=TRUE, sep=”,”)
We are telling R to read the file [Link], found at
C://Users/Bill/Documents, with the first row being the headings
(header=TRUE) and that the file is delimited or separated by a comma
(sep=”,”).
This line of code will import the file into R.
Now we simply type
>smokedata
To call up the file to the R screen.

Importing Data From SPSS


To import SPSS files, we must utilize the Hmisc package from which we
will use the [Link]() function. Suppose we want to import a file
[Link] from SPSS into R, where we will call it dataframe4.

>[Link](“Himsc”)
>library(Himsc)
>dataframe4 <-[Link](“[Link]”, [Link]=TRUE)

We first install the package, load it into our library, then use its [Link]()
function to find [Link] and import it into R as dataframe4.
The logical [Link]=TRUE tells R to convert variables with
value labels into factors with the value labels as levels of the factor.

Importing Data from SAS


Data can be imported into R from SAS using the Himsc package.

>[Link](“Himsc”)
>library(Himsc)
>dataframe4 <-[Link](“[Link]”, [Link]=TRUE).

You can also save the SAS dataset as a comma-delimited file in SAS
using the PROC EXPORT command.
SAS program:
Proc export data=datafile
Outfile=”[Link]”
dbms=csv;
run;
Thus, we have the file saved in SAS as a comma-delimited file. We can
now import it into R using the [Link] function.

>dataframe1 <-[Link](“[Link]”, header=TRUE, sep=”,”)


We have now imported the SAS file into R.
However, when we get to RStudio, importing from outside sources can be
done with just one click.
Exercises 2
Use R to compute the following values:
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10
224
(x + 3)3 + x2

What would the values of x and y be after the following commands are
executed in R?
>x <-3
>y <- 7
>x <- x + y
> y <- x + y

Create the vector x, consisting of the numbers from 1 through 10 using


the code
>x <- c(1:10)
Create the vector y, consisting of the numbers 11 through 20.
Add the two vectors using the code
>x + y
Now create the vector z, consisting of the numbers 10 and 11.
Add the vectors x and z. Why did R give the results shown?

There is a distribution known as Benford’s Distribution that gives the


probability of first digits in numbers occurring in many kinds of data.
Financial data follow Benford’s Law to a high degree. This makes it useful
for the investigation of financial fraud. Consider the table, which lists the
first digits and their probabilities, given below.
Create two vectors in R, vector D for the digits and vector P for their
probabilities.

The mean of a discrete distribution is . Use this formula


with R to find the mean of the given Benford’s distribution.

In this exercise, we are going to create a matrix by creating first a vector


with all the values of a three-column table, and a vector of the column
names. Then we will create the matrix, matrix1 out of the two vectors.
Below is the three-column table from an entomology study.

Create a vector, entomology1, that consists of all the values in the table,
entered row by row. Then create the matrix, matrix1, using the matrix with
the argument byrow put in as TRUE. Now apply the colnames() function to
create a character (string) vector containing the column labels. Now call up
matrix1 to see the completed matrix.
We will create the table above as a data frame in R using the R Data
Editor. Do so with the code below and when the data editor comes up, enter
the values.
The first line of code creates the data frame called entomology2
consisting of three columns all of which are of numerical mode.
>entomology2 <-[Link](Weight=numeric(),
ThoraxLength1=numeric(), ThoraxLength2=numeric())
>entomology2 <-edit(entomology2)

What is the second line of code doing?


Now enter the values with the R Data Editor.

Enter the Entomology table above into Excel, and then import it from
Excel into R, calling it entomology3.
Chapter 3: Introduction to Graphs
R’s graphical abilities is one of its strongest points. You can use R to
create some amazing graphs and charts. In this chapter, we will investigate
some of the basic graphing methods and techniques. Below are some
examples of R’s powerful graphics capabilities.

Figure A: From R for Multivariate Statistics, Ramon. Hernandez


Figure B: From The R Book, Michael Crawley
Figure D: From R Graphics, Paul Murrell

Your First Graph


The techniques needed to create graphs like the foregoing will be
discussed in a subsequent book; but for now, we learn the basics. We have
created a file named [Link] in Excel as a comma-delimited file. To do
this, you choose “csv” in the Save window. We show this step in the figure
below.

[Link] is saved as the following table.

Now let us follow the lines of code below:


>ocean<-[Link](“C://Users/Hernandez/Documents/[Link]”,
header=TRUE,sep=”,”)
>pdf(“[Link]”)
>attach(ocean)
>plot(Depth.m, [Link])
>title(“Depth vs. Pressure”)
>detach(ocean)
>[Link]()
On the first line we have the [Link]() function. This function takes
[Link], the comma-delimited file saved in Excel with the specified path
and imports it into R as a data frame called “ocean”.
The second line saves the graph in advance in pdf format as a document
named [Link] in the current working directory. As we saw in Chapter 1,
we can also save graphs in other formats by using functions with endings
like jpeg(), png(), and bmp(). Another option to save a graph is by selecting
(in Windows) File > Save As from the graphics window and then selecting a
format and location. On a Mac, simply select File > Save As from the menu
bar. To save on a platform as Unix or Linux, you must do so by writing a
line of code.
The third line is the attach() function. This function adds the data frame,
ocean, to R’s search path, so that this is one of the first places R will search
when attempting to follow a command.
The fourth line is the plot() function with the arguments Depth.m and
[Link]. This function simply creates a scatter plot of the variables
Depth.m and [Link] which it will find in the attached data frame,
ocean. The scatter plot will appear as an extra window.
If, within the plotline, we had included the following:
>plot(Depth.m,[Link], type=”b”)
The type=”b” attribute tells R to use a type b plot. So, instead of a scatter
plot with points alone plotted, R will plot both the points and the lines
connecting them.
The fifth line instructs R to add “Depth vs. Pressure” as the title of the
graph. The sixth line detaches the data frame, that is, removes it from the
search path. Write this line only if you are ending the session or have no
further need for the data frame in this session.
To be able to write graphical output, R must open an extra window, or,
device. The seventh line, [Link]() closes the device upon which the scatter
plot was written. This line is optional.

Parameters for Graphing


By using parameters, we can change and form our graphs as we wish,
with the total customization of fonts, colors, axes, titles, size, etc.
If you wish to change the parameters of your current graph but retain
your original parameter settings for subsequent graphs, you can use the par()
function. The par() function, used with the [Link]=TRUE option, and
assigned to the name opar, as shown below,
>opar<-par([Link]=TRUE)
will place the current settings in memory and make a copy of those
settings, (the copy is called opar()which you can then modify, with the
knowledge that your original settings are safe in memory so that you can
change back to them whenever you so choose.
Let’s say that in your original settings, the parameters for plotting were a
solid line type for plotting (lty=1) and an open circle for the plotted points or
plotting character (pch=21). A graph using these settings is shown below.

Figure 3.1

But now you want to use a dotted line (lty=3) and a solid square for the
points (pch=15), like the graph shown below.

Figure 3.2

You would use the following code:


>opar<-par([Link]=TRUE)
>par(lty=3, pch=15)
>plot(Depth.m,[Link], type=”b”)
>par(opar)
The first line saves your original settings (solid line and open circles) and
makes a copy of those settings that you can modify. The second line gives
the new parameters that you want to use to modify your graph (dotted line
and solid square). The third line plots the graph using the new settings, and
the fourth line restores the original settings.
A second way to change or specify parameters or options is to do so
directly within the plot() function. This is shown below.

>plot(Depth.m,[Link], type=”b”, lty=3, pch=15)


The pch Option
The parameter pch means plotting character or the character the points
will be plotted as. Each different character is represented by a pch number,
e.g., pch=3. The following table gives the symbols for the pch numbers.

The Line Type, or lty= Option


The lty= option specifies the line type with which the graph will be
plotted. Some of the types and codes are given below.
The Line Width lwd = Option
This option specifies the width of the line. This width is expressed
relative to a width of 1, which is the default. Typing in lwd = 2, makes the
line twice as wide as the default.
The Symbol Size cex = Option
The cex= option gives allows you to change the size of the pch plotting
symbols. This size is expressed relative to 1, which is the default size.
Typing in cex =2 makes the symbol twice as large as the default, and cex =
1.5 makes it 50% larger than the default.

The Color Options


Some color parameters are shown in the table below.

Color is specified by using the index for the specific color (for example,
color=1); by typing in the color names directly, for example, color=”green”;
using the hexadecimal code for the color, for example, color=”#8B7500”; or
the RGB code for the color, for example, col=rgb(138,117,0).
The color() function will give a list of all color names available.
For a comprehensive chart of all R colors, visit the web site:
[Link]
R also has many more sophisticated color functions that can be used to
create more striking effects: rainbow(), this produces the colors of the
rainbow; [Link](); [Link](); [Link]() are some of the more
popular color functions. Levels of gray can be produced with the gray()
function. For example, to generate five levels of gray (not 50 shades!), type
>gray(0:5/5).

To observe an example of the rainbow function, we will now code a pie


chart of 12 different colors using the rainbow function. Additionally, we will
label the colors in the default hexadecimal code.
>n<-12
>rainbowcolors<-rainbow(n)
>pie(rep(1, n), labels=rainbowcolors, col=rainbowcolors)
You will get the pie chart below.

(Don’t worry about the color codes currently. We will be seeing pie charts
up close in a subsequent chapter.)
To observe an example of the gray() function in action, we will code a
pie chart with 8 gray levels and label them with the hexadecimal code.
>x<-8
>shadesofgray<-gray(0:x/x)
>pie(rep(1,x), labels=shadesofgray, col=shadesofgray)
You will get the following pie chart.

Text Parameters and Options


We can use parameters and options to control and specify the size of
graph text, font family, and font style. Two text size parameters are given in
the table below.

After the equal sign, you will specify a number that specifies the font
style according to the scheme:
1 = plain text
2 = bold
3 = italics
4 = bold italics
5 = Adobe symbol encoding

Font family parameters


Specifying font family can be done by using the windowsFonts()
function. In this function, you specify the font family type you want to use
and assign it a name (usually an uppercase letter) by which you will call it
up. For example, let us say that you want to use three different font family
types in your graph. You will type,
>windowsFonts(A=windowsFont(“Palatino Linotype”)
Now you will use the opar to keep the original settings in memory, then
use the par() function to call out the new font family type, and to set other
font options. For example
>opar<-par([Link]=TRUE)
>par([Link]=2, [Link]=4, [Link]=2, family=”A”)
>plot(Depth.m,[Link], type=”b”)
>par(opar)

If you don’t use the opar setting, then all graphs created after the
statement above will have axis labels in bold, and with the default size of 1;
the main title in bold italic with a font size 2.5 times the default size of 1 and
the font family will be Palatino Linotype.
The windowsFonts() function will only work for windows. If you are
using a Mac, then you use the quartzFonts() function, and it is used in the
same manner as the windows function.

Graph Size and Margins


Consider the following code:
>par(pin=c(5, 6), mai(c(1, .2, 1.5,.3))
This code uses the par function with the pin and mai parameters. The
‘pin’ parameter concatenates the dimensions of the graph in inches. In this
case, the graph will be 5 inches wide by 6 inches high. The ‘mai’ parameter
concatenates the dimensions of the margin. In this case, the bottom margin
will be 1 inch, the left margin will be 0.2 inches, the top margin will be 1.5
inches and the right margin will be 0.3 inches.

Customizing Graphs by Adding Text, Axes


Options, and Legends

The Title Function


To customize your graph title, the title() function can be used. In the title
function, you can specify the main title, the subtitle, the x-and y-axes labels,
the color and size of the title and label texts, the fonts, and the axes limits.
The title function is used in the following format:
>title(main=”My Graph”, [Link]=”red”,
sub=”My Sub Title”, [Link]=”blue”,
xlab=” X-Axis label”, ylab=” Y Axis Label”,
[Link]=”green”, [Link]=1.5)
Here the function parameters specify the main title and subtitle; the
colors of the texts for the main and subtitles; the labels for the x- and y-axes;
the color of the axes labels; and the size of the text for the axes labels, in one
line of code.

Now, let us construct a graph using some of the functions and techniques
we have seen so far: the opar<-par, the pin, mai, title, plot and finally
passing the opar back to par.
The two vectors are:
Year: 1790, 1820, 1850, 1880, 1910, 1940, 1970, 2000.
Number of New Species Described: 100, 120, 150, 170, 180, 195, 211,
215.
I will explain the code while you follow the code lines below. First, enter
the vectors: Year and NumofSpeciesDescribed, set the opar, and use the
par() function to set the pin, mai, and line width. Then use the plot()
function to include line type, plotting symbol, plot type, parameters for
colors for the lines, main title label, the color, size and font of the main title,
and the color of the axes titles. The code should end by sending the opar
setting (the original settings) back to par (the forefront) to make it current
once more. However, we will not end the opar at this time since we will be
plotting a second line on the same graph. After this second line and the
legend, we will then change the opar settings back to par.

>Year<-c(1790, 1820, 1850, 1880, 1910, 1940, 1970, 2000)


>NumOfSpeciesDescribed<-c(100, 120, 150, 170, 180, 195, 211, 215)
>opar<-par([Link]=TRUE)
>par(pin=c(4,3), mai=c(3,3,1,1), lwd=2.5)
>plot(Year, NumOfSpeciesDescribed, pch=15, col=”red”, type=”b”,
lty=5, fg=”blue, main=”Year vs. Number of Species Described”,
[Link]=”blue”,
[Link]=2, [Link]=2, [Link]=”blue”)

The “[Link]=2” indicates that the main title is in bold (1 = plain, 2 =


bold, 3 = italics, 4 = bold italics). The “[Link] = 2” indicates that the size
of the main title is twice the default size. The “[Link] = blue” indicates that
the axes titles are in blue.

The following graph is the result of the above code.


Plotting Two Lines on the Same Graph
At times we may need to plot two lines on the same graph for comparison
purposes or others. After the last line of code that plotted the first line above,
we continue by declaring a third variable, NumOfSpecies2, then we use the
points() and lines() functions. For the pch symbol for the second line, we
use pch = 17, we color the line blue, and for the line type, we use lty=2.

>NumOfSpecies2<-c(92, 110, 135, 160, 175, 195, 215, 225)


>points(Year, NumOfSpecies2, col=”blue”, pch=17)
>lines(Year, NumOfSpecies2, col=”blue”, lty=2)

This code will result in the following graph.


Adding a Legend
When there are different sets of data in one graph, it is useful to include a
legend to identify the different sets of data and the different graphs. You can
place a legend at the top right of your plot by the following code.

>legend(“topleft”, inset=0.10, title=”Species”, c(“NumSpec1”,


”NumSpec2”), lty=c(5, 2), col=c(“Red”, “blue”))
> par(opar)

The “topleft” indicates location of the legend. Other locations that can be
written in as keywords are: “topright”, “top”, “bottom”, “left”, “right”,
“bottomleft”, “bottomright”, “center”. If you use one of these keywords,
you can also specify how far you want to place the legend within the plot by
using the inset=parameter. The inset is given as a proportion of the plot
width;”0.10” means that the legend will be placed at a location 1/10 of the
width of the plot area into the plot at the top right.
The title parameter gives a title to the legend. The first concatenation
gives the names of the two lines of the legend. The second concatenation
gives the line types you specified for the two lines when you plotted the
lines. The third gives the colors of the two lines. And finally, we end the
code by sending the opar setting (the original settings) back to par (the
forefront) to make it current once more.
Below is the graph after the legend code.
Exercises 3
Create a data frame called dataframe1, which holds the data given in the
table below, by first creating an empty data frame and then filling it in using
the data editor. Attach the data frame to the work path and create a
temporary opar in which to store the different parameters for this graphing
only.

Construct a graph of the two lines on the same plot. Set margins at 2
inches on the bottom and the left, with margins of 1 inch on the top and
right. Use different plotting symbols and different line types for each line
graph. Use colors to liven up the graph lines and labels. Add a legend at the
upper left of the plot area. End the code by passing the opar back to par to
revert to the original settings.
Chapter 4: Using RStudio
Now that we have familiarized ourselves with some of the features in the
working of base R, we will introduce an R-running environment, which
could serve to make our lives, as users of R, a lot easier. This chapter
introduces RStudio and visits a few of its basic features – all you need to
know to begin using RStudio like a pro.
RStudio is an integrated development environment (IDE) that was built
just for R. It includes a console, syntax-highlighting editor that supports
direct code execution, as well as tools for plotting, history, debugging, and
workspace management. More commonly, it is called a GUI (Graphical User
Interface.) A GUI is not the main application but sits on top of the main
application like R and makes it more user-friendly. For example, basic
commands like Open Script, Import/Export, CSV files, package
management, help queries, can be done by the click of the mouse instead of
writing lines of code.

Figure 4.1

To be able to use RStudio on your system, you first need to have base R
(at least Edition 3.0.1+) installed from [Link]. Once you have R installed,
you can then download and install RStudio.
RStudio is available in open-source (free) and commercial editions (buy)
and runs on the desktop (Windows, Mac, and Linux) or in a browser
connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, Red
Hat/CentOS, and SUSE Linux).
Download and Installation of RStudio
On the following link, [Link]
choose
the appropriate version for your needs, click on download and then run it to
install R-studio. When you open RStudio you will see four windows as
shown below.

Figure 4.2

The Windows of RStudio

The Script Editor Window


The window in the upper left is the script editor window. This is where
you
write and edit your code.
Figure 4.3 The RStudio Script Editor

The Console Window


The bottom left window is the console. The RStudio console is analogous
to the console window in R. Calculations take place here. The script editor
sends your
code to the console for processing and display of the results. You can write
your code directly into the console, however, mistakes in coding are much
more difficult to correct if the code was written directly into the console.

Figure 4.4 The RStudio Console Window


Figure 4.5 The Console in base R

The Global Environment Window


The window in the upper right is the Global Environment Window. This
is the place where you will find a list of all the objects being used in your
session. Here you will find a list of all: calculations you have made, objects
created and imported into your session. If you are using only base R and you
want to find out what objects
are in the memory, you use the ls() function. However, with RStudio, the
Global Environment window lets you see what in memory briefly.

Figure 4.6 RStudio Global Environment Window

The Plots and Management Window


At one lower left, we have the Plots and Management Window. This
window consists of five tabs in one.

Figure 4.7 RStudio Plots and Management Window

Plots and Management Window: Files Tab


The first tab is the Files tab.

Figure 4.8 The Files Tab


Clicking on this tab gives you access to and allows you to manage all the
files you have created, stored, or imported into RStudio.

Plots and Management Window: Plots Tab


The second tab, Plots, opens a window in which plots and graphs you
have coded are shown. You can save, manage and export all graphs from
here by clicking on icons and not having to write code (as is the case for all
other tabs in RStudio).

Figure 4.9 The Plots Tab


Plots and Management Window: Packages Tab
The third tab is the Packages tab. This is the most important of the five
tabs. Clicking on this tab will bring up a list of all the packages already
installed into R with base R installation or packages you have installed
during your work sessions.

Figure 4.10 The Packages Tab

You can also install new packages by clicking on the Install sub-tab.

Figure 4.11
When you do, the dialog box shown in Figure 4 .11 will appear. In his
box, you will enter the name of the package you wish to install. Multiple
packages may be installed from this one window by writing their names
separated by a space or a comma.
The list of packages shown by clicking the Packages tab is organized into
Libraries. The first time you open RStudio, you will see a “System Library”
list. This is a listing of all packages that have been automatically installed
with R. All the packages you, the user, have installed over time are listed in
the User Library.

Figure 4.12 The User Library

It is important to note that the packages in the libraries can only be used
after you have activated them by clicking in the checkbox to the left of the
package name.

Plots and Management Window: Help Tab


The fourth tab is the Help tab. This is where you would go for detailed
information on a specific function or documentation on R itself.

Figure 4.13
Understanding how to use the Help tab is important and can save you a
lot of time.
Plots and Management Window: Viewer Tab
This tab is used to display local web content. This is an advanced topic
that will not be looked at here.

Setting Up a Simple RStudio Script


Let us open a new Script editor window as shown in the figure below.
Click on the down arrow and then choose R Script. A new editor window
will appear.

Figure 4.14

This will open a second editor window. A tab for this new window is
created
and shown as highlighted in yellow.
Figure 4.15
Now we will write and enter the following code to create a simple object,
Object1:
Object1 3:6
Now, we will highlight the line of code and then click “Run”

Figure 4.16
RStudio will respond as shown below.

As shown in red, the editor will send the script to the console.

Figure 4.17
As shown in green, the console will create the object and send the details
of the object to the Global Environment window.

In the Global environment window, the object has been created as a


vector of type integer (int), of length 4, whose elements are the integers 3, 4,
5, and 6.

Now let us plot the elements of Object1 against their index numbers, that
is, their position numbers in the vector (the first element, 3 has the index of
1, the second element, 4 has the index of 2, etc. We write the following code
in the editor: (Notice, there is no prompt in the script editor window)
Plot(object1)
After clicking Run, RStudio will respond as shown below. Note also that
we do not have to click RUN after every line of the script. We can write
several lines of script, highlight all the lines and then click RUN.

Figure 4.18 Plotting our script object.

On clicking on the Plots tab in the Plots window, we will see a scatterplot
of the points. We could maximize the window if we so desire.

Saving Your Workspace


To save the script, you click on the icon circled in green below in the
script editor– no writing of code needed!
Figure 4.19 Saving the Script

When prompted, choose where you want to save it on your computer and
then name the file using the dialog box that appears. You can also save your
entire workspace by clicking on the ‘Session’ tab in the Script Editor
window and choosing ‘Save Workspace As’.

Saving a Plot

To save or export a graph or plot from the Plots window, you simply click
on Export and then choose Save as Image, Save as PDF or Copy to
Clipboard, as shown in the figure below. In the appearing dialog box, you
can choose the location at which you wish to save the plot and give the plot
a name.

Figure 4.20
Plots can also be saved from the Script Editor window Plots tab.

Figure 4.21

Setting the Working Directory


As you know, the working directory is where R will automatically save
all your work unless told to do it differently. You can set the working
directory by clicking on Session in the editor window, and then choosing Set
Working Directory and then Choose Directory.

Figure 4.22
The dialog box that will come up enables you to choose a directory by
pointing and clicking instead of writing lines of code.
Figure 4.23

Customizing the Appearance of Your


Workspace
You can customize your workspace using color themes, zoom
percentages, font size and type, and other factors. To customize your
workspace, you first click on the main tab “Tools” and then select “Global
Options”

Figure 4.24
In the Options box that appears, click on Appearance. The Appearance
box will allow you to customize your workspace as you will. The first
section of the Appearance dialog box is the RStudio Theme. Here you can
change the theme of the window itself. The choices are Classic, Sky, or
Modern.

Figure 4.25

The Zoom section lets you change the percentage of zoom in or out. The
Editor Font and Font Size follow. After the Font Size, you will find the
Editor Theme. This section has many options for a theme: background color,
font color, colors for code levels and you can choose one to your taste.
Below, I have clicked on “Tomorrow Night Blue”.
Figure 4.26

This option gives me a navy background with basic white text and
different colors texts for different functions and levels of code hierarchy.
After I click “Apply”, my workspace will appear as shown above in the
section at the right in Figure 4.30.

Importing Data into R with RStudio


While we can still import data using R commands like “[Link]” and
“[Link]”, RStudio greatly facilitates data import with its point-and-click
GUI capabilities. We can easily import data into R using the File/Import
Datasets menu at the Script Editor window.
Figure 4.27

We can also use the Import Datasets menu in the Environment window.
From there we can import from Excel, SPSS, SAS, or Stata with just the
click of a mouse and then choosing options from the ensuing dialog box.

The top portion of the dialog box that appears when the Import from
Excel button is clicked is shown below in Figure 4.28. In the highlighted
space you would put the URL for the file if you are downloading from the
web. If you are accessing it from your computer, you click on the Browse
button on the right.

Figure 4.28

Figure 4.29 below shows the bottom half of the dialog box.
Figure 4.29

In the Name box, you give your dataset a name that will be accessed in
R. In the Sheet box you choose the worksheet you wish to import from the
down arrow if your Excel spreadsheet has more than one worksheet. If there
are other worksheets in the file that you want to import, you can import them
one at a time. In the Range box, you give RStudio the range of the dataset
file that you are using. Sometimes the original Excel file might contain rows
of information that you do not wish to use in your R analysis. In the Max
Rows box, you can instruct R on the number of rows you wish to export. If
you enter “5” in this box, R will import only the first 5 rows of the Excel
file. In the Skip box, you will instruct R if you wish to skip any rows. If you
type “2” in this box, R will skip the first two rows of the Excel file; if you
type “3” the first three rows will be skipped and not appear in your imported
file. The NA box is perhaps the most important in this group. Missing values
are usually always present, and you need to instruct R how to handle them.
Handling missing values improperly could give you unreliable results in
your analysis.

Missing Values
If there are empty cells in the original file, R will automatically place
NA’s in those spaces when it imports the file. NA means “not available”. If
there are any other symbols in the cells apart from the symbols they are
supposed to contain, you will have to instruct R to treat these as missing
values. For example, if some cells that are supposed to contain numeric
values have the symbols “###” in them, as shown below,
Figure 4.30

then you will have to tell R to treat them as missing values by typing “###”
into the NA box. Wherever R finds this symbol in a cell, it will replace it
with an NA, making it an “official” missing value. If the first row of the file
contains the column names, then click in the checkbox “First Row as
Names” on the right.
Before you click “Import” you need to check your preview file in the
upper left window. If there is a variable in your dataset that is a character
variable but coded with numbers, then R will seek to identify it as a numeric
variable. You will have to correct R by telling it that it is a character
variable. In the figure below, the variable in the last column is coded as 1’s
and 0’s. However, it is a character variable, in this case, the variable,
whether the pharmacy is in a shopping center or not, is coded with 1 as
“Yes” and 0 as “No”. To tell R to identify this variable as a character
variable, click on the drop-down arrow near the name of the variable in Row
1. Clicking on this drop-down arrow will bring up the list of options shown
in the figure below.
Figure 4.31

Select “Character” from the list and R will identify the variable as such.
We are now ready to import the file. Click on the Import button on the
bottom right (Figure 4.29) and your file will be imported into the R
workspace.

Projects
If you are using RStudio, you can create a new R project by File/ New
Project in the script editor. A project is simply a special working directory
designated with a RProj file. When you open a project (using File/Open
Project in RStudio or by double-clicking on the .Rproj file outside of R), the
working directory will automatically be set to the directory that [Link]
file is located in.
Once you have created a new R project, let’s say you named the project
Proj1, the main folder will be [Link]. Within this folder, you could then
create a folder that will contain your R code, a folder for your data files,
folders for notes, a folder for your graphs, and other material relevant to
your project (you can do this outside of R on your computer, or in the Files
window of RStudio). For example, you could create a folder called
[Link] that contains all your R code, a folder called [Link] that
contains all your data (etc.).
Chapter 5: Data Management 1
Datasets from, for example, surveys, can be quite large and as such, may
contain many areas that may be somewhat problematic for the analyst. For
example, respondents often fail to respond to many questions, this means
that we will need to have a way to handle missing data. There may also be
many variables in the dataset, but only a few of interest to us. In this case,
we may need to create a simpler dataset containing only the variables of
interest. We may also need to recode values of a variable into new categories
because of the need for further study on the variable. These and many other
issues present themselves and thereby necessitate tools and methods of data
management to handle the problems that arise. The first aspect of data
management we will look at is that of missing values in a dataset.

Dealing with Missing Values


Missing values in a dataset represent non-responses in a survey or
untranscribed values in data entry and data recording.

R codes missing values of both character and numeric types as NA (not


available).
If the dataset is a small one, missing values can be easily spotted.
However, that may not be the case for larger datasets.
How do you know that your dataset contains missing values? The
function [Link]() is used. This function is a logical-valued function that
reports missing values as TRUE when evoked. All present values are
reported as FALSE.
For example, let us say that you have a vector (a column of values that
could be the measurements of a variable) called vector3 with 5 values, 14
through 18. Because the value (16) in the third position is missing, the data
editor has placed an NA at that position.
vector3<-c(14, 15, , 17, 18)
vector3
[1] 14 15 NA 17 18

Let us evoke the function [Link](), writing in vector3 as its argument.

[Link](vector3)

R will respond:

[1] FALSE FALSE TRUE FALSE FALSE

By the TRUE in R’s response, it tells us that the third position is an NA.
Let us see how the function operates with a data frame, dataframe1. We
call it up in the Script Editor window.
dataframe1
Remember that in RStudio, you do not press ENTER, but instead, you
click RUN to execute the commands. So, upon clicking RUN, the following
will appear in the Console window.

We inquire if there are any missing values in dataframe1

[Link](dataframe1)
R’s response will be:

The TRUE means that that place in the data frame, dataframe1[3, 2],
contains a missing value.

How Do You Fix Missing Values?


If your data contains missing values, one of the ways to deal with these
missing values before analyzing the data is to eliminate them. However, if
there is a lot of missing data in a relatively small dataset, then deleting all
missing values could cause a great deal of data and information to be lost.
For this and other cases in which it is not desirable to delete all missing data,
there are other ways to treat missing data. We will encounter some of these
other methods later.
Ordinarily, if you attempt to perform arithmetic and other types of
numerical data operations on data with missing values, the answer will be
NA. Let us look at an example.
x<-c(12,13,NA,15,16)
z<-sum(x)
# highlight both lines and click RUN. R will answer:
[1] NA
For simple structures like vectors, we can use the [Link]=TRUE option
that comes with many functions, to remove the missing data. For example, if
we want to sum the elements of the vector x above without the missing data,
we type,
z<-sum(x, [Link]=TRUE)
Now when we call up z, R will sum the other elements of the vector x and
ignore the missing data, or the NA’s.
z
[1] 56
In the case of dataframe1 with the missing data above, we would use the
function [Link]() function to omit the entire row that contains the NA. Let
us use [Link]() to remove the third observation (row), and then let us call
the new data frame (without the third observation) dataframe2.

dataframe2<-[Link](dataframe1)
dataframe2

Using Arithmetic Operators to Create New


Variables
In many instances, you will want to create new variables by using
arithmetic or mathematical operators on existing variables. Let us create a
data frame called dataframe4, consisting of two vectors called X and Y.
dataframe4<-[Link](X =c(0,1,2,3,4,5), Y =c(75,72,69,65,60,54))
dataframe4

Now, suppose we want to create two new vectors (variables) called


SumXY, which will be the sum of the two X and Y element for the row,
where

and ProdXY, which will be the product of the X and Y elements, where

There are several ways to do this. First, the simplest way:


dataframe4$SumXY<-dataframe4$X + dataframe4$Y
dataframe4$ProdXY<-dataframe4$X * dataframe4$Y
The code dataframe4$X accesses the X variable (column) of the data
frame only. The code dataframe4$Y accesses the Y variable of the data
frame only.
So, dataframe4$X + dataframe4$Y, tells R to add X to Y (each value of X
is added to the corresponding value of Y). To complete the line,
dataframe4$SumXY<-dataframe4$X + dataframe4$Y says that the column
that is the sums of the X and Y values is then assigned as a new variable
(column) in dataframe4 called SumXY.
The product of X and Y is coded likewise.
Now, we call up the new dataframe4 with the two new variables, SumXY
and ProdXY. So, after adding the two new variables, we call up dataframe4.
dataframe4

Pretty simple, isn’t it?


Now let us look at a second method that uses the function transform().
Here we are going to add to our new dataframe4 a new variable, DiffXY,
which is

dataframe4<-transform(dataframe4, DiffXY=X-Y)
Here, the transform() function has two parameters: the first is the name
of the data frame we are transforming (dataframe4), and the second is how
we are going to transform it. We are transforming it by adding a new
column, which is the difference X – Y. We then call the transformed
dataframe4, the new dataframe4.
We now call up the new dataframe4.
dataframe4

Renaming Variables

You can change the names of variables using several methods, two of
which we will visit here. The simplest method is that of using the data
editor. Suppose in dataframe4 you want to change the name of the variable
“DiffXY” to “[Link]” and you want to do so using the data editor. You
can call up the data editor in several ways. You can go to the Menu bar and
click Edit then find Data Editor.
Or you can call up the data editor by using the fix() function.
fix(dataframe4).
This will call up the data editor so that you can - guess what? That’s right
- fix data frame 4. You enter the name of your main file, dataframe4, in the
opening dialog box, then directly change the name of the variable “DiffXY”
by double-clicking on the name and typing in the new name, “[Link]”.

The second method of changing names is to use the names() function.


The variable DiffXY is the fifth variable [5] in dataframe4. So if we want to
change the name “DiffXY” to “[Link]” using the names() function, we
code as follows:
names(dataframe4)[5]<-“[Link]”
dataframe4
Now, suppose, instead of making that one change, we wanted to change
the names of the variables “SumXY”, “ProdXY” and “DiffXY” to
“[Link]”, “Interaction”, and “[Link]” respectively, using the
names() function, we would type the following code:
names(dataframe4) [3:5]<-
c(“[Link]”,”Interaction”,”[Link]”)
By now you should know that dataframe4[3:5] means the third through
fifth columns of dataframe4. Now we call up dataframe4 with the new
names.
dataframe4

Date and Date Conversion


R returns the current date when the [Link]() function is written in, and
it returns the current date and time when the date() function is coded:
[Link]()
[1]”2019-11-11”

date()
[1]”Fri October 11 12:14:34 2019”
We can change the format of the current date by using the format()
function with the formats shown in the table below.

If we want the [Link]() function to return today’s date in the form of


Friday October 11, 2019 instead of 2019-11-11, we code:
today<-[Link]()
format(today, format=”%A %B %d %Y”)
This tells R that we want the date in the format %A, which is the full
form day of the week, Friday. We then want %B, the full form of the month.
Then we want %d, he day number (11), then %Y, the full 4-digit year, 2019.
R will return,
[1] Friday, October 11, 2019
From the today vector, we can extract parts of the date.
format(today, format=”%A”)
[1] “Friday”
Exercises 5.
Create the following data frame by either forming it on Excel and
importing it into R, or by entering it directly into R.

Use R’s function to check if there are any recorded NA’s.

Eliminate the record with the NA’s.

Create two new variables:


Q4 = Q1 +Q2+Q3 and
Q5 = Q2 × Q3

Change Q1, Q2, …Q5 to “Item1”, “Item2”, …, “Item5”.


Chapter 6: Data Management 2

Built-In Functions
For manipulating data, R has many built-in functions that can handle a
wide array of operations. We will visit several of the functions of common
usage for arithmetic, mathematical and statistical calculations. You need not
memorize these functions but use them as needed.

Basic Arithmetic Functions and Operators

x + y # addition
x – y # subtraction
x * y # multiplication
x / y # division
x^y #exponentiation (raising x to the power of y)
x %% y # x mod y
x %/% y # interger division
round(x,digits=n) # we are rounding x to n digits
trunc(x) # truncating values in x to 0: trunc(8.95) returns 8.
signif(x, digits=n) # round to n significant digits

Mathematical Functions

abs(-34) # absolute value of -34 will return 34


sqrt(36) # square root of 36 will return 6
floor(x) # largest integer less than x
ceiling(x) # smallest integer greater than x
beta(x, y) # returns the beta function value of two non-negative
numeric values or vectors
gamma(x) #returns the gamma function value of a numeric value or
vector
choose(n, k) # returns the value of the combinatorial formula for
combinations of k items chosen in an unordered manner
out of a set of n
factorial(x)

Trigonometric Functions

cos(x)
sin(x)
tan(x)
cosh(x)
sinh(x)
tanh(x)
acos(x)
asin(x)
atan(x)
acosh(x)
asinh(x)
atanh(x

Logarithmic and Exponential Functions

log(x) # this gives the natural logarithm of x, ln(x)


log10(x) # this gives the common logarithm of x
log(x, base=y) # logarithm of x to the base y
exp(x) # gives ex
These functions can operate on a single value, and they can also operate on a
vector, a matrix or a data frame. When operating on a vector, matrix or data
frame, they operate on each value in the object. For example,
x<-c(3, 5, 7, 9)
exp(x)
[1] 2.008554e+01 1.484132e+02 1.096633e+03 8.103084e+03

Relational Operators

x<y
x>y
x <= y
x >= y
x == y # equality test
x != y #non-equality test

These are binary operators that compare the values. They return a vector
of TRUEs or FALSEs that indicate the result of the individual
comparisons.

For example
4 == 5
[1] FALSE #4 is not equal to 5

4 != 5
[1] TRUE

Statistical and Probability Functions

mean(x)
median(x)
sd(x)
var(x
mad(x) #mean absolute deviation of x
range(x)
sum(x)
min(x)
max(x)
summary(x) # gives summary numbers for the dataset x.
For example,
y<- c(5:25)
summary(y)

scale(x) #This standardizes the values in x. It gives the z-scores for


each value in x.
To transform a vector, matrix, or data frame to scale with a specified mean
and standard deviation, we code
new.x<-scale(x)*SD + M
In this code, SD and M are the new specified standard deviation and
mean respectively.
quantile(x, probabilities)
This gives the quantiles where x is the numeric vector of which the quantiles
are desired and probabilities is a numeric vector giving the probabilities for
the quantiles desired.
For example, if we want the 45th and 80th percentiles of the vector
23,34,45,56,46,31,57,68,79,87,65,41,54,32,41
We would code,
x<-c(23,34,45,56,46,31,57,68,79,87,65,41,54,32,41)
y<-quantile(x, c(.45, .80))

You can also standardize a specified variable or column in a data frame.


Let’s say you have a data frame called dataframe1 which contains a variable
called weight and you want to standardize the variable weight. You code as
follows.
dataframe1<-transform(dataframe1, weight=scale(weight))

The following functions are probability functions and are used basically
to generate data from probability distributions with known characteristics.
The following table gives the probability distribution and its code
Along with the code for the distribution names, we can add a letter to the
beginning of the code name to indicate a special function.
d = density function
p = distribution function (cumulative)
q = quantile function
r = will generate a specified number of random numbers from the
distribution.

Probability Distributions: The Normal


Distribution

What is a Normal Distribution?


Imagine a dump truck dumping coarse sand at a construction site. The
dumped sand will take the shape of a hill as shown below. But why is that?
Well, there is a central spot called the mean, around which most of the sand
will fall. The further you move away from the mean, the less likely you will
have sand falling there, hence the fat middle and trailing tails on both ends.
So, the sand falls in a mound, and from our vantage point, the mound is a
curve with a bell shape.

Figure 6.1

Now imagine again a family having a reunion in a park. If you were to


take the ages of everyone in the reunion, you will find that most of the
people will have ages that will cluster around the average age. Let us say
that the average age was 40; the further away you get from 40, towards the
high end and the low end, the fewer people you will find with ages near that
measure. For example, there will tend to be fewer people present of ages two
and under, and, there will be fewer people there who are 90 and over. Again,
the ages are falling in a bell curve with more ages around the center and less,
the further away you get from the center.
This bell curve is called the normal distribution. We can ask the
hypothetical question: Suppose you were to go blindfolded and grab
someone at that reunion, how likely is it (what is the probability) that the
person you grabbed is between 45 and 55 years of age? The fact that the
ages fall in a normal curve makes the answer to this question possible.
To answer the question, we will need two quantities or measures that we
get from the list of ages: the mean and the standard deviation. The standard
deviation is a measure of the spread of the ages around the mean. The figure
below illustrates this.
Figure 6.2

In the normal curve to the left, the age measures are packed closer around
the mean, so the curve is skinny and tall. If the mean is 40 years of age, then
in this curve, you are less likely to find someone of age 80 or 90 – that is too
far from the mean for this curve.
The curve to the right is fat and wide, meaning that the age measures are
spread out further from the mean. If the age curve looks like this one, and if
the mean is 40 years of age, you are more likely to find an 80- or a 90-year-
old here. The skinny curve will have a smaller spread around the mean (a
smaller standard deviation) than the fat one. If a normal distribution has a
large standard deviation, then you know the scores are spread further away
from the distribution’s mean.
Once you know the mean and the standard deviation of the normal
distribution, a probability question like the one posed above can be easily
answered.

Plotting the Normal Distribution Curve


The curve of a normal distribution is called a probability density curve. If
you want to plot the density curve of a normal distribution with a mean of
110 and a standard deviation of 15. You can use the pretty() function to give
the x values. This function will give n even spaced numbers that cover the
given range: for example, pretty(c(25,250),50) will give 50 even spaced
numbers covering the range from 25 to 250).
For the y-axis, we will use the dnorm() function (“d” means density) for
the given x range and with the given mean and standard deviation.
So, we assign as follows
x<-pretty(c(25,250), 50)
y<-dnorm(x, 110, 15)
Then we use the plot function:
plot(x, y, col=”dark green”, pch=16)
RStudio will give the following normal distribution plot in the Plots
window.:

The above graph plots the value of the random variable


against its probability.

The Percentile Problem (Also Called the Reverse Normal


Problem)
What is the score that is the 85th percentile (P85) of a normal distribution
with a mean of 110 and a standard deviation of 15?
Now, look at it this way: Suppose 200 students in a room did an exam.
After their scores came in, the average score was found to be 110 and the
standard deviation, 15. If a student gets his or her exam report and they were
told that they scored higher than 85% of the other students who did the
exam, then their score is called the 85th percentile score.
To find the 85th percentile score of that distribution using R, we use the
function, qnorm(), and we give it the values 0.85 (for the percentile being
asked), 110 for the mean, and 15 for the standard deviation.
qnorm(.85, 110,15)
R will give us back the value
[1] 125.5465
This is the 85th Percentile value of the distribution. Thus, if someone
scored 125.5, that person would have scored higher than 85% of the other
candidates.
In using the normal distribution functions, if we do not indicate a mean
and a standard deviation, R will use the standard normal distribution, whose
mean = 0, and standard deviation = 1.

Probability Distributions: The Poisson


Distribution

What is A Poisson Distribution?


In a nutshell, a Poisson distribution is a distribution of the probabilities of
rare events occurring in some given time interval or some area or region.
Some examples of situations that are modeled by the Poisson distribution
include:
The number of telephone calls per hour received by an office
The number of days school is closed or a game postponed due to rain or
stormy weather
The number of field mice per acre in a certain forest area.
The number of bacteria in a culture
The number of typing errors per page
The number of fatal accidents per year one a given highway.

Example:
During a lab experiment, the average number of radioactive particles
passing through a counter in 1 millisecond is 4. This means that the passing
of a particle through the counter is a rare event. What is the probability that
6 particles will enter the counter in the next millisecond? Since we want the
probability of a rare event, we can use the Poisson distribution.
To answer probability questions using a normal distribution, two
quantities are needed: the mean and the standard deviation. To answer
probability questions using the Poisson distribution, only one quantity is
needed: the mean number of events in the time or region specified. For
example, in the radioactive particles question above, we are told that the
average number of radioactive particles passing through a counter in 1
millisecond is 4. Thus, the Poisson mean is 4. The Poisson mean is called
“lambda”, a Greek letter that is written as “ �� ”.
To find Poisson probabilities in R, we use the dpois() function. We give
the function two quantities, lambda, and the x for the question. Remember
that x is the value of the random variable, whose probability we are being
asked. For this question, then, x = 6. The form of the function is dpois(x,
lambda). Thus, we key
dpois(6, 4)
R will answer
[1] 0.1041956

Plotting a Poisson Probability Distribution


Suppose we record the number of radioactive particles passing through
the counter for the next 12 milliseconds. Here we are plotting a sample of 12
random values from a Poisson distribution with a lambda = 4. We first create
a sequence from 0 to 12 in steps of 1 (we are not considering 0.5 of a
millisecond, only complete milliseconds.) We will call the sequence x. Then
we use the dpois() function, between whose brackets we put x for the
sequence and the value of lambda, which is 4. We use the type=”h” to
indicate that our graph should not be a line but a histogram. The lwd=4 tells
R to graph the histogram using a bar of width = 4. The col=”blue” and Main
commands speak for themselves.
x<-seq(0,12,1)
plot(dpois(x,4), type=”h”, lwd=4, col=”blue”, main=”Poisson(12,4),
[Link]=”blue”)
This code will generate the following Poisson graph.
Using this graph, we could answer many questions about the distribution
of the radioactive particles.

Probability Distributions: The Binomial


Distribution
Let us use a concrete example to show the use and code-writing for the
binomial distribution.
Example
It is believed that the probability of finding impurities in drinking wells in
a certain rural region is 0.35. That is saying that impurities exist in 35% of
all wells in this region. To gain some insight into this problem, it is
determined that some tests should be made. It is too expensive to test all the
many wells in the area, so 10 were randomly selected for testing. What is the
probability that:
None of the 10 wells carry impurities?
One of the 10 wells carry impurities?
Two of the 10 wells carry impurities?
Three …
Four …
Five …
Six …
Seven …
Eight …
Nine of the 10 wells carry impurities?
All of the 10 wells carry impurities?

How Do We Know This is a Binomial Situation?


The first point to be checked out is that we have ‘n’ independent trials. A
trial is done each time we take samples from a well to determine whether it
contains impurities or not. In our experiment, there are ten wells, therefore
we have n = 10 trials. Are the trials independent? They are because the
outcome of one trial does not depend on the outcome of any other trial. The
outcome that one well contains impurities does not depend on whether the
well before it contained impurities or didn’t. Therefore, our trials are
independent.
The second point is that each trial must be dichotomous, that is, for each
trial, there can be only two outcomes – a success or a failure. This point
checks out because the outcomes for each trial are either a well contains
impurities or it does not. If it does, we call that a success. If it does not, we
call that a failure.
The third point that must be checked out is that we should know
beforehand the probability of having a success. We are given that the
probability that a well contains impurities is 0.35. This is the probability of a
success. Therefore, this point checks out as well. This means that we can use
the binomial model to answer our probability questions. This model uses the
binomial formula, given below, to calculate the probabilities.

Since we are using R to do our calculations for us, all we need to do is to


give it the quantities x, n, p, and q and it will do the rest. The “b” in the
formula above indicates that we are working with a binomial distribution.

The Parameters
Now, we must find the parameters, x, n, p, and q. The quantity n, we
know to be the number of trials, which is 10. The quantity p is the given
probability of a success. This is 0.35. The third quantity is q, and q = 1 – p.
Therefore, if
p = 0.35, then
q = 1 – 0.35 = 0.65.
The quantity x is the value of the random variable and is given in the
question. The random variable, in this case, is the number of wells out of the
ten that contain impurities. The first question asks us to find the probability
that 0 out of the 10 wells contain impurities. Therefore, for this question, x =
0.
Our parameters are n = 10, p = 0.35, and q = 0.65. Each question will
give us a different x. We are asked the probability that none of the systems
work, that is, the probability that x = 0. This is denoted P(X = 0).
We can also find the probability that one out of the ten systems work, that
is
P(X = 1). Similarly, we could find P(X = 2), P(X = 3), P(X = 4), P(X = 5),
… , P(X = 10); Eleven probabilities in all. Together, all eleven probabilities
make up the probability distribution of the random variable X, using the
binomial model. We are now ready to have R calculate the probabilities for
us.

The Binomial Distribution Function


The probabilities can be found using the binomial distribution function
dbinom(x, n, p, log=FALSE).
The first parameter of the function, the x in dbinom(x, n, p, log=FALSE)
is the range of the of the values of X, 0 through 10 or, 0:10. The n in
dbinom(x, n, p, log=FALSE) is the size of the sample, which is 10 wells.
The p is the given probability, which is p = 0.35. The log=FALSE tells R not
to calculate the logarithm of the probabilities. The function dbinom(x, n, p,
log=FALSE) will give all 11 probabilities. We will call the vector of the 11
probabilities, y. We write:
y<- dbinom(0:10,10,0.35, log=FALSE)
R will return:
[1] 1.346274e-02 7.249169e-02 1.756530e-01 2.522196e-01
2.376685e-01 1.535704e-01 6.890980e-02 2.120302e-02 4.281378e-03
5.123017e-04 2.758547e-05
I have underlined the values for clarity.
The Probability Distribution of X
The first value 1.346274e-02, or, using four decimal places, is 0.0135.
This is the probability of finding no wells with impurities out of the sample
of 10 wells. The second value, 0.0725 is the probability of getting one well
with impurities. The probability of two wells with impurities is 0.1757, etc.
All 11 of these probabilities make up the probability distribution of X.

The Graph of the Binomial Probabilities for this Problem

We can plot all values of X against Y, their probabilities. This will be the
graph of our distribution. To do so, we first give a name to the vector of X
values, which is zero through 10, (0, 1, 2, …, 9, 10). We call this vector x.
Then we will plot x against y.
x<-c(0:10)
plot(x,y, pch=15, cex=1.5,col="red")
R will give the following graph
.

If we want to find a single probability value, say, the probability that we


find seven out of the 10 wells with impurities, we use the following line:
dbinom(7,10,0.35, log=FALSE) # instead of the range 0:10, we use 7.
R will return:
6.890980e-02
This is 0.0689

Probability Distributions: The Hypergeometric


Distribution

A Hypergeometric Problem

As an example of the use of the hypergeometric distribution to find


probabilities, suppose, in an area of the forest under a population biology
study, we have a population of 25 rabbits,10 of which are brown, and 15,
gray. Note that in the population there are two groups. We will call the gray
rabbits Group 1, and the brown rabbits, Group 2.
Now suppose we set 5 walk-in traps overnight to catch 5 rabbits at
random. What is the probability that in our sample of 5 rabbits, we find:
No gray rabbits?
One gray rabbit?
Two gray rabbits?
Three gray rabbits?
Four gray rabbits?
All five are gray rabbits?
The hypergeometric distribution is the one that is used to find these
probabilities.

The Hypergeometric Function


Here we use the dhyper(X; k, m, n) function. X is the range of the values
of x, k is the number of individuals in Group 1, m is the number of
individuals in Group 2, and n is the size of our sample. The formula for
finding hypergeometric probabilities is:

where N = k + m
But luckily, we are using R, so let us have R worry about the formula.

The Probability Distribution of X

Let X be the number of rabbits we could find in the trap; then X would
range from 0 to 5. In R we write 0:5 for this range. Let k be the number of
gray rabbits in the population (Group 1, k = 15). Let m be the number of
brown rabbits in the population (Group 2, m = 10). Let n be the size of our
sample, n = 5. This is all we need to find the probabilities using the
hypergeometric function dhyper() in R. We code as follows:

dhyper(0:5,15, 10, 5)
R will return the following:
[1] 0.004743083 0.059288538 0.237154150 0.385375494 0.256916996
0.056521739
Again, I have underlined for clarity.
Using only 4 decimal places, the first value, 0.0047 is the probability that
we find no gray rabbits in the catch. The second value, 0.0593 is the
probability we find one gray rabbit, etc. These six values are the probability
distribution of X.

The Graph of the Hypergeometric Probabilities for this Problem

We now graph the distribution using a bar graph. To do this, we first


make a table of the values of X with their probabilities, ProbX.

We will now code this data frame into R. We will call it Rabbits.
Rabbits<-[Link](X = c(0,1,2,3,4,5), ProbX =
c(0.0047,0.0593,0.2372,0.3854,0.2569,0.0565))
We now call up our data frame to see what it looks like.
Rabbits
We will now plot a bar graph using ProbX, written as Rabbits$ProbX (
more about graphs in the next chapter). The values of X, that is, 0, 1, 2, 3, …
will be used as the categories or the names of the bars. This we will do using
the [Link]=c(0, 1, 2, 3, 4, 5)the two variables of the data frame Rabbit.
barplot(Rabbits$ProbX, [Link] = c("0", "1", "2", "3", "4", "5"),
col="darkred")
R’s output will be

User-Defined Functions
A program is nothing but a set of instructions. When we write and save a
program, we are ensuring that we will not have to write the same set of
instructions over and over every time we must do the same task. A program
can be very simple or very complex. When we write a function, a macro, a
print instruction, a worksheet template, we are programming. In this section,
we are going to visit one of the most important aspects of programming in R
and that is writing your functions. One of R’s prime strengths is that the user
can write user-defined functions that can expand the scope of the program.

A Simple Function: The While Loop


Let us say we begin with a vector we will call Vector1 and let us assign
some values to the vector. And then we will find the mean and variance of
Vector1.

Vector1<-c(5, 6, 7)
Vector1

[1] 5 6 7

Mean(Vector1); var(Vector1)

[1] 6 #mean of Vector1


[1] 1 #variance of Vector1

Let us add 1 to each element of Vector1.

Vector1<-Vector1 + 1
Vector1

[1] 6 7 8

What will be the mean and variance of this new vector1? And if we add 1
to this new vector1, what will be the mean of the new vector? We will add 1
to Vector1 five times and each time we will find the mean of the new
vector1.
To do this we will use a While loop. This is loop uses the while{}
function. Note the curly brackets. A loop is a statement that keeps running
until a condition is satisfied.
The syntax of a while loop is while (this condition is true){execute this
statement}.
Vector1<-c(5,6,7,8)
while(Vector1[1]<=10) {cat("mean=",mean(Vector1),"\n");Vector1<-
Vector1+1}

Now, let us look at the code in detail. The first line, of course, creates
Vector1 and assigns it the values of 5, 6, 7, and 8 using the combine function
c().
The second line sets up the condition. This condition is that Vector1[1]
<=10. Remember that we are adding one to Vector1 four times. Therefore,
the first time we add 1, the elements of the new vector will be 6, 7, 8, and 9.
The second time we add 1, we will have a new vector, 7, 8, 9, 10. The fifth
time we add 1, the vector will be 10, 11, 12, 13. Now R will give us the
mean of each new vector we create. But how would R know when to stop
calculating means? This is the job of the condition. The condition tells R to
continue giving us the mean of each new vector as long as Vector1[1]<=10,
that is, as long as the first element in the vector is less than or equal to 10. R
will keep checking the first element of each new vector to determine if it is
less than or equal to 10. If it is, then R will give us the mean of that vector.
On the sixth time adding 1, the first element will be 11. Therefore, R will
stop giving us the mean and the loop will end.
The cat() is the string version of the combine c() function. It joins
together the string “The mean=”, and the actual value of the mean that
comes from the mean(Vector1). The comma that follows brings on the third
thing that the cat function joins, “\n”. After R has written the mean for the
first vector, the “\n” tells R to go to a new line. The semi-colon ends the
cat() function, but we are still in the while function. So, R has given us the
mean for the first vector, now we are adding 1 to the first vector to create a
second vector. This is the statement that tells R to add 1 to the vector is just
found the mean for. The statement says, “the new vector1 is the old vector 1
+ 1”. So then R will go down to a new line and give us the mean for this
new vector1. It will keep doing this until the first element of the new vector
is greater than 10, then the while loop will end.
Below is the code written again and R’s response.
> vector1<-c(5,6,7,8)
> while(vector1[1]<=10){cat("mean=",mean(vector1),"\n");vector1<-
vector1+1}

mean= 6.5
mean= 7.5
mean= 8.5
mean= 9.5
mean= 10.5
mean= 11.5

To have the while loop return the variance of the vector as well as its mean, we
code as shown below.

vector1<-c(5,6,7,8)
while(vector1[1]<=10){cat("mean=",mean(vector1),"variance=",
var(vector1),"\n");
vector1<-vector1+1}

mean= 6.5 variance= 1.666667


mean= 7.5 variance= 1.666667
mean= 8.5 variance= 1.666667
mean= 9.5 variance= 1.666667
mean= 10.5 variance= 1.666667
mean= 11.5 variance= 1.666667

As we add 1 to the vectors their means will increase by 1. However, their


variances will all be the same because variance is simply the spread of the
values of the vector around their mean. When we add 1 to each value, they
changed by 1, but their mean also
change by 1. Therefore, the relationship between the values and their mean did
not
change and, thus, the spread of the values around the mean, which is the
variance, did not change.

Writing Your Function with an If-Else Loop

Let us begin by examining one of R’s built-in functions to see how a


function works as well as examining its coding. We have, so far, been using
functions that have been built-in the base R system or come as part of a
package.
Now let us create our function. We are going to write a function that will
calculate and print the mean and variance of a given vector of values.
First, we will use function(), which is used to create a user-defined
function, and then we will assign a name, MyFunction to our new function.
Function() will have as its parameters, the name of the vector of values, x,
and the print logical parameter. The code will be something like this:
MyFunction <- function(x, print=TRUE)
{
center<-mean(x);
variance<-var(x)
cat(“Mean of the values is:”, center, “\n”, “Variance of the values is:”,
variance, “\n”)
}
Now to see how our function works we will assign a vector of values as
x, then call the function on x.
x<-c(23,34,45,56,67,78,89)
MyFunction(x)
R will print
The mean of the values is: 56
A variance of the values is: 564.6667

Now let us include an error statement in our code with an if-else


construct. This is a condition statement that is only executed if the condition
given is met. If the condition is not met, then the else statement will be
executed.

In the example below, we will create a variable for the mean of a vector
x, which we will call “center”, and a variable for the variance of the vector
x, which we will call variance. We will write the if-else statement that will
output the mean and variance of a given vector if the vector is numeric. If
the vector given is not numeric, then R will output the message, “Vector
must be of numeric type.”
This whole process we will put together in a function which we will call
MyFunction2. Remember, you must use function(x) to create MyFunction2.
After this assignment, you will write the body of the function you created
between curly brackets, that is, the if-else loop and other assignments.

MyFunction2<-function(x)
{
center<-mean(x);
variance<-var(x)
if([Link](x))
cat(“Mean:”, center,”\n”,”Variance:”,variance,”\n”)
else print(“Vector must be of numeric type.”)
}

Now we will assign numeric values to x.


x<-c(345,398,309,321)
MyFunction2(x)
R’s will give the answers
Mean: 346.25
Variance: 1556.25

Now let us assign a character vector as x.


x<-c(“dog”,”cat”,”cow”,”snake”)
MyFunction2(x)
To which R will reply,

[1] "Vector must be of numeric type."


Warning messages:
In [Link](x): argument is not numeric or logical: returning NA
In var(x): NAs introduced by coercion
In var(if ([Link](x)) x else [Link](x), [Link] = [Link]).
NAs introduced by coercion

An excellent resource for learning all about user-written functions is the


book Software for Data Analysis: Programming with R, by J. M. Chambers
2008 Springer. Here the finer points of user-defined functions are detailed,
and you can learn to write professional-level code that you can make
available to others as packages.
Exercises 6
Plot the density curve for a normal distribution using 100 numbers evenly
spaced across a range of -3 to +3 on the x-axis and for the y axis, the normal
probabilities for standard normal distribution. Use line width and color
options and any other options you choose.

Plot 120 numbers from a Poisson distribution with a lambda = 6, using an


x range of 0 to 20.

Plot 20 numbers from a binomial distribution with p = 0.25 using an x


range of 0 to 20.

Plot the density of a Hypergeometric distribution of a population of 25


muskrats, 12 of which are brown and the others gray. You are sampling 10 at
random from the population. Use an x-range of 0 to 20.

Write a function that, when given a vector of values, will return the
Median,
First Quartile
Third Quartile
Variance
Standard Deviation
Mean Absolute Deviation
Upper Fence (RUB)
Lower Fence (RLB)
Midrange
Mid-quartile
Interquartile Range (IQR)
of the vector.
And will also give a warning message if the input vector is non-numeric.
Chapter 7: Basic Graphs of Statistics

A picture is worth a thousand words – and graphs are pictures of


functions and relations. In this section we will learn how to create some of
the most common graphs in statistics, thereby picturing your data, which is
an integral part of the descriptive statistics process.

Dot Plots
The first graph we will visit is the dot plot. Dot plots are used for mainly
quantitative variables. There are two kinds of dot plots: the Wilkerson Dot
Plot and the Cleveland Dot Plot.
The Wilkerson Dot Plot
In the Wilkerson plot, the horizontal axis is a scale for the quantities. The
numerical values of each measurement in the dataset are located on the
horizontal scale by a dot. When data values repeat, the dots are stacked
vertically above the scale value. A dot plot of random values is shown below

Figure 7.1
To create a Wilkerson dot plot with R, we use the stripchart() function.
To demonstrate this, we will use the following example. Twenty people were
asked how many times in the past had they ever visited a museum. Their
answers are given below.
1 3 1 4 2 5 1 1 2 1
4 1 1 2 1 2 1 2 1 2
We will call the data set museum, which we will enter it directly into R
via the keyboard. Then we will create the dot plot with the stripchart()
function.
museum<-c(1,3,1,4,2,5,1,1,2,1,4,1,1,2,1,2,1,2,1,2,)
stripchart(museum, main=”Museum”, method=”stack”,
pch=16, col=”blue”)
This will create the dot plot as shown below.

The option method =”stack” ensures that the dots that represent positions
of the same value would be stacked vertically. Had the option not been used,
dots for the same value would be placed in an overlapping manner and the
multiplicities would not be seen.
The Cleveland Dot Plot

If you have a group of labeled values on a horizontal scale you can use
the dot chart() function to create a dot plot of the values. We have the
following data, which have been entered into R as a data frame called
windspeed, with two variables (or vectors): windspeed$CapitalCities and
windspeed$CurrentWindSpeed.
The following code will create the Cleveland dot plot for the dataset.
dot chart(windspeed$CurrentWindSpeed,
labels=windspeed$CapitalCities,
cex=.8, main=”Current Wind Speed for 10 Capital Cities”, xlab=”Wind
Speed +(km/hr)”)
This will give the following dot plot
Stem-and-Leaf Plots
The stem-and-leaf plot is a type of graph that classifies items according to
their most significant numerical digits. This plot is a simple plot that serves
as a first-glance graph. However, it gives an idea as to the contours of the
distribution. The stem-and-leaf plot is created by the stem() function. The
parameter passed is a numerical vector of values. For example, suppose we
want to create a stem-and-leaf plot of the following data, which is waiting
times in minutes at a bank ATM (automatic teller machine):

We have imported the dataset as a data frame called waittime, which


contains only one column waittime$WaitTime. In the code, we rename the
dataset times. The code that follows will create the stem-and-leaf plot of the
data.
times<-waittime$WaitTime
stem(times)
R’s output will be as follows:

I can read your mind on this. You are asking: How do I read this?
Let us take the second row as an example.
2|49
The stroke | is where the decimal point would go. Thus, 2|49
means we have the first value of 2.4 and another value of 2.9.
8|63 gives us two values: 8.6 and 8.3.

Pie Charts
The data table below is the mint date and number bearing that mint date
of a sample of 2000 pennies.

Source: Lu, S, and Skiena, S “Filling a penny album” Chance, Vol 13


No 2
Spring 2000 p 36.
We are going to create a data frame with the above data by typing in the
following code:
Pennies<-[Link](MintDate=c(“pre 1060’s”,
“1960’s”,”1970’s”,”1980’s”,”1990’s”), Number=c(18, 125, 330, 727, 800))
Then we call up pennies to see what it looks like.

Now we use the pie() function to create a pie chart with rainbow colors
with the labels as per the table.
pie(pennies$Number,col=rainbow(length(pennies$Number)),labels=penn
ies$MintDate)
R will output the graph below in the Plots window

Now there are times when we might want to use colors ideal for black
and white print, use percentages or proportions to compare the categories
and use a legend. The code below will do this. Here we first define the
range of grayscale colors, we create a function to convert the number of
pennies to a percentage of the total rounded to 1 decimal place, then we use
the paste() function to concatenate the “%” symbol to the numbers.
Don’t worry, just write the code as you see it here and you will get the
graphical results.
colors<-c(“white”,”grey70”,”grey50”,”grey90”,”black”)
pennies_labels<-round(pennies$Number/sum(pennies$Number)*100,1)
pennies_labels<-paste(pennies_labels,”%”,sep=””)
pie(pennies$Number,col=colors,labels=pennies_labels,
main=”Percentage of Pennies with Given Mint Dates”,
cex=.8),legend(1.0,0.3,pennies$MintDate,cex=.8,fill=colors)

3-D Pie Charts

We can create a 3-D pie chart from the pie3D() function from the plotrix
package. First, we install the package. In the RStudio plots window, first,
click on the Packages tab and look down the list to see if the plotrix package
is already installed. If it is, just click on it and write the code from
library(plotrix) below. If it is not, then click on the Install.
R will install the package, after which you will see it appear in the User
Library. Find plotrix in the User Library and click in the check box at its
left. Then click on the package. Now you can use the package. Now write
the code to plot the chart.

pie3D(pennies$Number,labels=pennies$MintDate,explode=0.1,
col=rainbow(5),main=”Pennies with Given Mint Dates”)
R now outputs the following

Bar Graphs
A bar graph is a commonly used graph in statistics mainly because of the
ease at which it can be visualized. The height of each bar is proportional to
the amount of data in that category. To create a bar graph in R we use the
barplot() function.
We are going to produce a bar graph of the pennies data. Here we use the
rainbow of colors and use the [Link]=pennies$MintDate to label the
categories.
barplot(pennies$Number,[Link]=pennies$MintDate, col=rainbow(5),
main=”Number of Pennies with Mint Date as Given”)
R’s output will be
We can produce the same bar graph but with a horizontal orientation
simply by including the option “horiz=TRUE” in our code line.
barplot(pennies$Number,[Link]=pennies$[Link],
col=rainbow(5),
main=”Number of Pennies with Mint Date as Given”,horiz=TRUE)
Figure 7.8

To rotate the category labels (the years scale on the left), we simply add
the las=2 option.

barplot(pennies$Number,[Link]=pennies$MintData,
col=rainbow(5),
main=”Number of Pennies with Mint Date as Given”, horiz=TRUE,
las=2)

Histograms
A histogram is a summary graph much like the bar graph. It shows a
count of the data points falling within various ranges. It gives a rough
approximation of the frequency distribution of the data. The groups or
classes of data are called “bins”, as they are like containers that accumulate
data according to the frequency of that data class.
We can create a simple histogram with the hist() function.
The command
hist(pennies$Number)
Will give the following graphical output:

Figure 7.10

We use the option “freq=FALSE” to create a histogram based on


probability densities rather than frequencies. We use the option breaks to
instruct R as to how many bins we are using.

We are going to generate 100 random values from a Poisson distribution


of lambda = 5 and graph them in a histogram with 10 bins.
x<-rpois(100,5)
hist([Link]=10,col=”green”)
Figure 7.11

In attempting to identify the type of distribution that underlies a dataset


using a histogram, we want to use the probability density plot and then
overlay it with a smooth approximation of the distribution of scores. This is
done by using the lines() function.
hist(x,freq=FALSE,breaks=10,col=”green”,xlab=”Poisson Scores”,
main=”Histogram and Density Curve”)
lines(density(x),col=”brown”,lwd=2)

Figure 7.12
Box Plots
A box plot provides a graph of the median, quartiles, maximum and
minimum of a data set. This graph can display a lot of information on one
plot. You can create a simple plot or a more complex plot of categories in
the dataset.
The basic command is boxplot() and to this, we can add axis labels, a
main label, color, etc. like the options in any other graphing function. Let us
create a simple box plot for the data set below.
(23, 25, 27, 30, 31, 32, 35, 36, 45, 47, 49, 51, 53)
We will call the vector z.
>z<-(23, 25, 27, 30, 31, 32, 35, 36, 45, 47, 49, 51, 53)
>boxplot(z)

Figure 7.13

Not a very attractive box plot, but we can spruce it up a bit.


> boxplot(z,ylab="Univariate Dataset",xlab="Value Axis",horiz=TRUE,
+ main="Simple Box Plot",col="green")
Figure 7.14

The plot will automatically show any outlier. If there is an outlier, the
maximum and minimum will not be shown because R will default the range
to 1.5 the IQR. So, we will see the Upper and lower fences. If we want the
full range to be shown, we can use the range option. If we set the range =0
then we will get the full range. Let us add an outlier to the data set.
>z[14]<-138
>boxplot(z,col=”lightblue”)

Figure 7.15
The range is automatically made to 1.5(IQR) and so the outlier is shown.
If we want to show the full range with 138 being the maximum score, we
use the range=0 option.
>boxplot(z,range=0)

Figure 7.16 Here the outlier is not shown, and R uses the full range of
the data.

Side-by-Side Box Plots


The boxplot() function can also create side-by-side boxplots with the
code:
boxplot(X~Y, data=dataframe)
Consider the following dataset: the dependent variable is “[Link]”,
from the data frame “cars”.

We can create three separate box plots side by side: one for Mazda, one
for Nissan, and one for Toyota with the following code:
Mazda<-c(38.5, 33.6, 41.8, 46.4, 46.0, 48.7)
Nissan<-c(34.6, 36.5, 31.5, 30.8, 35.1, 36.1)
Toyota<-c(40.7, 38.2, 38.4, 38.1, 46.7, 39.6)
mpg<-[Link](Mazda, Nissan, Toyota) #Remember, highlight lines
then RUN
Now we call up mpg to see its form
mpg

Looks good. Next, to get our box plots, we code:

boxplot(Mazda, Nissan, Toyota, data=mpg, col=c("red", "blue",


"green"))

Figure 7.17
Exercises 7
1. Create a Wilkerson dot plot using the GPA data given below.

Create a stem-and-leaf plot using the GPA file.


Use the GPA file to create a histogram with 5 bins and with its density
curve.
Use the GPA file with the added values 7.5 and 8.3 to create a box plot.
2. Use the Digits, shown above, file to create a pie simple pie chart with
labels
and colors.
Use the Digits file to create a labeled grayscale pie chart.
Use the Digits file to create a labeled 3D pie chart with colors.
Use the Digits file to create a horizontal bar graph with the y-labels
rotated.
Use the Digits file to create a simple histogram with colors.

3. Use the Tire data given below to create side-by-side box plots. These
are the stopping distances for thirty cars, ten equipped with Michelin tires,
ten with Goodyear, and ten with Firestone.
Chapter 8: Basic Methods of Statistical
Analysis
In this chapter, we will begin to explore descriptive statistics. We will test
hypotheses and answer questions about variables and their interrelationships.

Descriptive Statistics
We will begin with the summary() function. We have seen this function
before. Now we will look a little deeper into its meaning. The summary()
command will provide the minimum, maximum, quartiles, and mean for a
numerical vector and will give frequencies for non-numerical vectors. Let us
create a numerical vector.
x<-c(12,23,34,45,56,67,78,89,90)
Now we ask R to send us a summary of the vector x.
summary(x)
R then tells us:

If we have a numerical data frame or matrix, the summary() function will


give the summary numbers for any columns you specify or for all vectors in
the data frame.
I have created a three-vector file in Excel called Insects. The file contains a
first vector, Weight, a second [Link], and a third, [Link].1.
Now I will import Insects into RStudio and use the function summary() to
find the summary numbers for the second and third vectors; [Link]
and [Link].1, respectively.
I call up the Insects data frame to see what it looks like by just typing in
its name.

Insects

The data frame below will appear in the Console window after you
highlight the line and hit RUN.
Now call up the summary() function to find the summary numbers for
[Link] (column 2) and [Link].1 (column 3) of the Insects
data frame.

summary(Insects[2:3])

R will put the summary numbers in the Console window.

Let us now see what the summary numbers mean.

Means
The function colMeans(insects[2:3]) will give the mean (the arithmetic
or common average) of the second and of the third vectors.
colMeans(Insects[2:3])

Median
Suppose we have 5 measurements and arrange them in numerical order
from the smallest to the largest (this is called a distribution of the data), then
the number at the center will be our median. For example, let’s say we have
the following dataset in numerical order:
11, 16, 21,32, 43
Our median will be the measure 21.
If instead, we had a dataset of 6 numbers:
11, 16, 21, 32, 43, 62, our median will be the average of the two middle
numbers, 21 and 32. Thus, our median will be (21+23)/2 = 26.5.

Quartiles
Quantiles are numbers that partition, or divide, an ordered data set into
equal parts. As an example, let us begin with a distribution. All the numbers
of the distribution represent 100%.

Now let us find the number that divides the distribution into two parts of
25% of the distribution on the left and 75% on the right. This means that this
number is larger than 25% of the other numbers in the distribution and
smaller than 75% of them. This number is called the First Quartile (Q1) of
the distribution.

Now, let us find the number that divides the distribution into two equal
parts of 50% each. This means that this number is larger than 50% of the
other numbers in the distribution and smaller than 50% of them. This
number is called the Second Quartile, or the Median of the distribution.

Let us now find the number that is larger than 75% of the other numbers
and smaller than 25% of them. This number is called the Third Quartile (Q3)
of the distribution.
The quartiles are quantiles because they divide the distribution into equal
parts – four of them.

Percentiles
Here is the distribution again - all 100% of it.

Now, let us divide our distribution into 100 equal parts.

Each part will be 1% of the distribution. These parts are called


percentiles. Now let us find a number located in the distribution with 17 of
these percentiles below it and 83 of them above it. It therefore will be larger
than 17% of the other numbers in the distribution and smaller than 83% of
them. This number is called the 17th Percentile of the distribution.

Going back to our dataset Insects, we want to get the medians of the
second and third vectors. Since the median function median() can only give
the median for single vectors at a time, we will have to use the apply()
function to give us the two medians at the same time.
apply(Insects[2:3],2,median)

The apply function can also give us the quartiles of the two vectors.
apply(Insects[2:3],2,quantile)

We see that the Third Quartile (Q3) of the [Link] variable is


17.1. This means that if an insect of this group has a thorax that measures
17.1, then its thorax length is greater than 75% of the thorax lengths of all
other insects in this group.

Contingency and Frequency Tables

head() and tail()


Sometimes a dataset is large and could take up many pages. We might
just want to take a glimpse at the first few observations in the set instead of
calling up the entire dataset. For this, we use the head() function. It will
usually give you the first five or six observations. We use the head function
on the Insects dataset.
data(Insects)
head(Insects)
We can also get the last few observations on the list by calling up the
tail() function.
data(Insects)
tail(Insects)

Frequency Tables
When we have large univariate data sets (datasets with one variable), one
of the frequently used methods to organize and display our data is using
frequency tables. In this method, we group our data into score intervals and
then construct a frequency table. Data collected into frequency intervals are
called grouped data. Grouping data into frequency tables is an important
step in univariate descriptive statistics and a vital method in preparing data
for analysis. We will use a real-life example to motivate and instruct in the
development of a frequency table.
The largest colony of Antarctic Emperor Penguin can be found on the
Ross Ice Shelf in Antarctica, where, at any given time, at least 80,000 of the
birds can be found lounging on the ice.

Figure 8.1
Dr. Schottenheimer is a biologist who is studying the Ross Island Colony
of the emperor penguins. One of the summary numbers he is trying to
determine is the average weight of the adult penguin, so he has taken a
sample of 45 of the adult birds and has recorded the weights of each one of
them. Given below is a list of the recorded weights to the nearest pound of
45 adult penguins from The Ross Ice Shelf.

To display the data in a more meaningful way, Dr. Schottenheimer would


now group the data into intervals and develop a frequency distribution,
which is an arrangement of the data points that shows the number of data
points that fall into any of several frequency groups or intervals. This is an
especially important step in data analysis. He will, therefore, group several
data points into an interval. As an example, he might place all data points
(weights) that fall between 60 and 69 into one interval, those between 70
and 79 into another interval, etc. These intervals are called class intervals,
or, in computer language, bins. Let us take the interval 60 to 69. The 60 we
call the lower limit of the interval and the 69 the upper limit of the interval.
Before the good doctor completes the table, however, he must decide just
how many bins he wants for his distribution. There is a simple formula that
enables him to do just that.
The number of bins = 1 + 3.3(log n), where n is the number of data points in
the dataset.
The doctor works out the formula using n = 45 because he has 45 penguin
weights.
1 + 3.3(log 45) = 1 + 3.3 (1.65) = 6.45. Rounded = 6.
He will, therefore, use 6 bins or intervals.
His next step is to work out the width of the bins. Once he has determined
how many bins, he can use another little formula to help him find the bin

widths:
Where w = bin width
r = range of our data = highest data point – lowest data point
b = number of bins or class intervals to be used (this was just calculated
above).
Plugging values into our formula, we find that w = 9.

=
So, the bins will be 9 units wide. But where do we start? What are our
lower limits and upper limits? For the lower limit of the first interval, we use
the minimum score value. For the upper limit of interval 1, we simply add
the calculated width to the lower limit. This will give us a lower limit of 43
and an upper limit of 43 + 9 = 52 for interval 1.

Interval 2 will now begin at 53 as the lower limit then add the bin width
of 9 to find the upper limit. The upper limit of linterval2, therefore, is 62.
Using the same method of calculation, we find that interval 3 spans from 63
to 72; Interval 4 from 73 to 82; Interval 5 from 83 to 92; and Interval 6 from
93 to 102. We will arrange the data into a table called a frequency
distribution, shown below. This three-column table is a part of the full
seven-column frequency table.
The midpoint is the halfway point of the interval. For example, the
midpoint of interval 1 is 47.5; interval 2 is 57.5, interval 3 is 67.5, interval 4
is 77.5, interval 5 is 87.5 and interval 6 is 97.5.
The next column is the frequency column. Here, we place the number of
data points that fall into the range of each interval. For example, since
Interval 1 has a range of 43 to 52, all scores between 43 and 52, fall in the
Interval 1 range. We see that five of the penguins weigh between 43 and 52
pounds. Therefore, the frequency of Interval 1 is 5. Similarly, we find the
frequency of Interval 2 is seven (7); the frequency of Interval 3 is 10; the
frequency of Interval 4 is 12; the frequency of Interval 5 is seven 9, and the
frequency of Interval 6 is 2.

Coding a Frequency Table (One-Way Contingency Table)


First, we will use the table() function to generate a simple frequency table
of the penguin weights. using the data from the PenguinWeights dataset.
Using the tail() function we can get a glimpse at the last few observations in
the dataset.
data(PenguinWeights)
tail(PenguinWeights)

We know that we have six bins or intervals. The table above of the last
six weights in the dataset is telling us that the weight of penguin #40 which
is 63 pounds, belongs to Category (Interval) 3.
Now we are going to use the table() function to organize the
PenguinWeights data into a simple one-way frequency table which we will
name PenguinFreq.
Penguinfreq<-with(PenguinWeights, table(count))
Penguinfreq
R’s response will be:
This tells us that the “1” category, or the first bin, carries a frequency of
5; the “2” category, a frequency of 7, etc.

Two-Way Contingency Tables


Two-way contingency tables are tables that show the frequencies of the
elements of two variables. To investigate two-way contingency tables, we
import a file on smoking that gives the gender and smoking status of 10
respondents. The dataset smoke is shown below.

Here we have a four-column (variables) table, the first column of which


gives the index number of the respondent. The second column (Gen) gives
the gender of the respondent. The fourth column ([Link]) gives the
smoker status of the respondent; and the third column (Smoker) gives the
same information given by the fourth column but codes it as a number (1 =
smoker, 5 = non-smoker).
Now suppose we want a cross-tabulation that gives the number of
smokers and nonsmokers of each gender, we will need to create a simple
two-way contingency table which we will call Newtable. We can do this
with the xtabs() function, and use the Gen and [Link] columns.
Newtable<-xtabs(~Gen+[Link], data=smoke)
Newtable
Using the xtabs() function, the “~” precedes the first vector (Gen),
followed by a plus sign(+) then the second vector ([Link]). The last
option, data, gives the source file, smoke. R’s two-way table is given below.
The results tell us that there are 2 female non-smokers and 3 female
smokers, while there are 1 male non-smoker and 4 male smokers.

Margin Totals

We can add margin totals to our two-way table by using the


addmargins() function.
addmargins(Newtable)

All table() functions will ignore all missing NA values by default.

Three-Way Contingency Tables

We can easily generate three-way contingency tables by using the same


xtabs() function.
We have imported a dataset called smoke2 which we will call up now.
smoke2
This file contains an additional column called AgeGroup, which carries
the codes:
1 = 25 and under
2 = 26 to 45
3 = over 45
We wish to create a three-way contingency table, organizing our data in a
table under the three variables Gen, smoke. stat, and AgeGroup.
We do this placing the formula Gen+[Link]+AgeGroup in the
xtabs() function. We will call our three-way table, table2.
table2<-xtabs(~Gen+[Link]+AgeGroup, data=smoke2)
table2
After highlighting the code lines and clicking our RUN button, we get:
We can now create a single frequency table out of the three sections with
the ftable() function.
ftable(table2)

Here we can see that in AgeGroup 1, there are two female non-smokers
and one smoker, while there are no male non-smokers and one smoker.
Now, let us add margin totals to the values
ftable(addmargins(table2))
Tests for Association

Two Non-Numeric Variables: Pearson’s Chi-Square Test

Suppose we wanted an answer to the question: Does gender play a part in


whether a person smoke or not? In statistics, we call that an association. The
question becomes: Are the variables, Gender, and smoker status
([Link]), associated or are they independent of each other? Both
variables, gender and smoker status, are non-numerical. The answer to the
question of the person’s gender could be either M or F, which are non-
numerical answers, and Smoker Status = Yes or No, also non-numerical.
Therefore, to make the determination or association, and acknowledging the
fact that the variables are both non-numerical, we can use a test called the
Chi-Squared Test for Independence.
To do this test, we set up two hypotheses. The first one is the research or
alternative hypothesis, which states what we want to determine: Is there
some association between the two variables. We are aiming to prove the
research hypothesis true. However, the tools of statistics are better at
proving something is false than they are at proving it is true. With this in
mind, we set up a second hypothesis called the Null Hypothesis, which
states the opposite to the research hypothesis: There is no association
between the two variables, or, the variables are independent of each other.
By showing that the null hypothesis is false, or by rejecting the null
hypothesis, we are thereby showing that the research hypothesis is true.
This test can be applied on a two-way table by using the function
[Link]().
First, we will call up Newtable, which is the two-way contingency table
we made of the variables Gen and [Link].

Newtable

Now we run the chi-square test for independence on the table data.
[Link](newtable)
to which we will get:

The important figure is the p-value. A large p-value - one that is roughly
greater than 0.05, shows that the test is in favor of the null hypothesis. The
p-value = 0.7418, means we cannot reject the null hypothesis. This means
that the variables are independent, or there is no association between them.
Gender, in other words, does not determine smoker status.

Two Numeric Variables: Correlation

In the past section, we tested for independence or association between


two variables, both of which were non-numerical. Now suppose the two
variables for which you want to establish association are both numerical
variables, how do you test for an association between two variables? Let us
explore this question with an example. We are climbing up Mt. Everest and
recording the temperature at every 1000 meters we climb.
We go up to 6000 meters and come back down. When we arrive at base
camp, we have data that looks something like the table below.

Now while the temperature at the different altitudes is not realistic, the
table will serve our purpose well enough.
We can now ask the question: Are the variables, Altitude, and
Temperature, associated? We can see that as one variable increases the other
decreases, but are they doing it in such a way that if we are given the value
of one, we could predict the other with some accuracy? We could do this
only if the variables are sufficiently correlated.
Should we draw a scatter plot of the data, we would observe one of the
following:
In the first graph, the points are increasing together, and the variables are
said to be positively correlated. In the second graph, the points are moving
in opposite directions – one is increasing while the other is decreasing. Here
the variables are said to be negatively correlated. In the third graph, there is
no pattern and so the variables show little or no correlation.
Correlation is quantified by the correlation coefficient, whose value
ranges from -1 to +1. A correlation coefficient of +1 means the points are in
perfect positive correlation – the points are in a perfectly straight line with
an upward slope. A coefficient of -1 means the points are in a perfectly
straight line with a downward slope – a perfect negative correlation. A
coefficient of zero means that the points show no pattern whatsoever, and so
there is no correlation between the variables. The closer to +1 or -1 the
coefficient is, the stronger the association between the variables – either
negatively or positively. Thus, the coefficient of correlation is a measure of
the strength of association between numerical variables.
In plotting our graphs, you might have noticed that we placed Altitude on
the horizontal (X) axis and Temperature on the vertical (Y) axis. Could we
have placed Temperature on the horizontal axis instead? Well generally, you
want to place the independent variable on the X-axis and the dependent
variable on the Y. How do we know which is the independent and which is
the dependent? In our Mt Everest case, we were able to select the altitudes at
which to take our readings of temperature. However once the altitude was
selected, the temperature at that altitude is fixed – we don’t get to select the
temperature at that altitude. The temperature depends on the altitude we
select. As such, we say that Temperature is the dependent (Y) variable and
Altitude is the independent (X) variable. This distinction is important for
correlation because did we place Temperature on the (X) axis, the
correlation would be different. So how do we find the correlation coefficient
with RStudio?
We have created an Excel spreadsheet with our Mt. Everest data and have
imported it into RStudio, where we have named it MtEverest1. We attach the
MtEverest1 file and then simply use the cor(x,y) function. The cor(x,y)
function takes the two variables to be tested, the independent (X) and the
dependent(Y) as its arguments. We code thus,
attach(MtEverest1)
cor(Altitude,Temp)
R will return the correlation coefficient:
[1]-0.94612
This tells us that the two variables, Altitude, and Temperature have a very
strong negative correlation or association. This means that as one variable
increases, the other decreases in such a way that, given the value of the
independent variable, we could predict the corresponding value of the
dependent variable.
Suppose we did a correlation study for two variables, IQ and the number
of beers drank per day (Beers), and got a coefficient of 0.45346. This
number does not indicate a strong correlation, but is it significant enough to
deem the two variables correlated? To conduct such a test of the significance
of the correlation coefficient, we use the [Link](x, y) function.

[Link](IQ, Beers)
We will get in response,

We have a 95% confidence interval, so our level of significance is the


remaining 5% (0.05). This is the number against which we compare the p-
value. If the p-value is smaller than 0.05, then we will deem our correlation
coefficient sufficiently significant. The p-value = 0.03795, is smaller than
0.05. Therefore, our coefficient of 0.45346 might not be a strong correlation
but the test tells us that it is sufficiently significant to deem the variables
correlated.

Tests of Significance

t-tests: One Sample


Suppose we want to test the hypothesis with the Insects data, that the
mean (average) thorax length for group 2, the [Link].1 group, is less
than16.5. We would use the [Link]() function in which we would specify the
hypothesized mean (μ (mu) = 16.5) and the direction of the test (in this case,
we are testing that the mean is less than.). This is the contention of the
research (alternative) hypothesis. The null hypothesis would, of course, state
the opposite to the alternative hypothesis, which, in this case, would be that
the mean is not less than 16.5. We would write the following code.
[Link]([Link].1, mu=16.5, alternative=”less”)

With a p-value of 0.7143 larger than our alpha = 0.05 (5% confidence
interval), we must concur with the null hypothesis that the mean is not less
than 16.5.

t-tests: Two Independent Samples

The function [Link](x~y, data) is used to test hypotheses on two


independent numerical samples.
We will use our Insects file to test the hypothesis that Group 2 (
[Link].1) has a greater average thorax length than Group 1
([Link]) population.
We add the option alternative=”less” or alternative=”greater” to specify
the direction of our alternative hypothesis. If no direction is given, then the
default is a two-tailed test. Here we are testing the alternative that says that
Group 2 ([Link].1) has a greater average length than Group 1
([Link]). As such we will use alternative = “greater”.
t-test([Link].1, [Link], data=Insects, alternative =
“greater”)

The p-value (0.0321) that is smaller than 0.05 means that we reject the
null hypothesis and confirm the alternative hypothesis that the average
thorax length for Group 2 is greater than that of Group 1.

t-tests: Two Dependent Samples

We have imported a file called Windspeed2 which gives the wind speed
in several capital cities taken at two different times of the year: January
([Link].1) and August ([Link].2). In this situation, we have two
groups of measurements, but they both come from the same city – just that
they are taken at different times. This means that the groups are dependent,
and we must use a paired t-test to test hypotheses on these groups. We want
to determine by a paired t-test, whether the average wind speed for time 1
([Link].1) is less than that of time 2 ([Link].2). To do this, we use
the option paired=TRUE in the [Link]() function.
We Global Environment window (upper right) to see if the file
Windspeed2 is in our workspace. If it is not, we must first “attach” it, which
brings it up to our desk, so to speak.

attach(Windspeed2)
t-test([Link].1,[Link].2, data=Windspeed2, paired=TRUE,
alternative=”less”)

R will return the following:

The exceedingly small p-value (9.31 × 10-6), tells us that we must reject
the null hypothesis that the speeds are equal, and the evidence supports the
alternative hypothesis that the wind speed at Time 1 is less than the wind
speed at Time 2.

Nonparametric Tests

Remember the normal distribution that most continuous data almost


automatically fall into? Now if your data do not follow a normal
distribution, then they do not meet the parametric assumptions of the t-tests.
In cases like this, you can use nonparametric testing methods to test
hypotheses based on your data. We will now visit some of these
nonparametric methods.
Testing Two Groups
If you are testing two independent groups, you can use the Mann-
Whitney test, which is called the Wilcox test in R. The [Link]() function
will analyze the data and test the hypotheses. The form of this function is
wilcox(variable 1, variable 2, data). To do a Wilcox test on the variables
[Link].1 and [Link].2 we will code
[Link]([Link].1,[Link].2,
data=Windspeed2,alternative="less")

Again, we look at the p-value to determine the result of the test.


If the variables are paired (not independent), we use the same function
but add the paired=TRUE option.

Testing More than Two Groups

If we have more than two groups, and the groups are independent, we can
apply the Kruskal-Wallis test with the function [Link](). After
grouping your data, the test is applied with the function
[Link](dependent variable~grouping variable,data)
Remember to first group your data before applying the Kruskal-Wallis
test.
Example:
We have recorded the daily rainfall in Trinidad and Tobago for each day
in October, November, and December. We want to determine whether the
distributions of rainfall for October, November, and December are identical.
We use a non-parametric test because we will not be assuming that the
distributions are normal. We simply want to know if they are identical
without assuming normal parameters. We will, therefore, use the Kruskal-
Wallis Test (called a Non-parametric ANOVA), by using the [Link]()
function.
Below is a part of the dataset of rainfall for October, November, and
December in a file called rainfall

We now use the [Link]() function.

[Link](Rainfall(in)~Month, data=rainfall)

With such a small p-value, we reject the null hypothesis and conclude
that the distributions of rainfall for October, November, and December are
not identical.
A drawback is that the test does not tell you just how they differ from
each other. This question can be answered by using a Mann-Whitney U test
or doing multiple comparison tests like the Scheffe or the Tukey. A package
called npmc can provide these multiple comparison tests.
We have seen some of the features of R with RStudio that we used for
doing basic descriptive and inferential statistics. Armed with this
knowledge, you could do a lot of statistical analysis. But this knowledge is
only the basics of both R and RStudio. They can do a lot more, and, I hope
that your appetite has been stimulated enough to make you want to
investigate further into statistics and R/RStudio.
Learning Resources and References
Chang, William, (2013) R Graphics Cookbook, O’Reilley Media Inc.

Crowley, Michael J.; (2013) The R Book, John Wiley and Sons Inc.

Dielman, Terry E.; (2005) Applied Regression Analysis 4th, Ed.


Thompson, Brooks/Cole

Kabacoff, Robert I.; (2011), R in Action, Manning Publications Co.

Lu, S, and Skiena, S “Filling a penny album” Chance, Vol 13 No 2 Spring


2000 p 36

Martinez, Marco (2009) R for Biologists, Vol 1.1 University of


Tennessee, Knoxville

Murrell, Paul; R Graphics P10, 14, 15.

Quian, Song S.; (2010) Environmental and Ecological Statistics with R,


Chapman & Hall/CRC Press

RStudio Documentation - Management, Retrieved from the RStudio


website, [Link]

Zar, Jerrold H.; (1999); Biostatistical Analysis 4th Ed., Prentice-Hall Inc
Note at the End of the Book
You have made it to the end of the book, and I applaud you and thank
you for sticking with me. If you enjoyed the learning experience, won’t
you please take a minute to leave a review at your favorite retailer?
Many thanks,
RCH

Connect with Me:


Follow me on Twitter: [Link]
Friend me on Facebook: [Link]
About the author:
Ramon Hernandez is a mathematician and computer scientist and has
worked in the biomedical field and as a lecturer and mathematics
department head. He is also a biostatistical software package developer.
He has spent a career sharing his knowledge with students from all walks
of life and making statistics and statistical computing accessible to all. As
a teacher/lecturer, his clear explanation of complex mathematical and
statistical concepts and his ability to bring those difficult concepts to a
level that all could understand has been widely acclaimed. His online
courses in statistics and mathematics have been well received and sought
after.
This book is one of those that come from his collection of lecture
notes and could take you, even if you are a raw beginner, and show you
statistics and statistical computing with R and RStudio in such a way that
you will be doing your own projects almost like an expert.

Other Titles by This Author:

R with RStudio for Multivariate Exploratory Analysis


Excel Labs for Introductory Statistics
Excel Labs for Introductory Inferential Statistics
Excel Labs for Statistical Quality Control

You might also like