R With RStudio For Introductory Statistics
R With RStudio For Introductory Statistics
Statistics
Copyright © 2020 Ramon C. Hernandez
Published by Ramon C. Hernandez and Gold Mountain Publishing
RCH
Chapter 1: Introducing R
What is R?
R is a free, open-source software that is one of the most popular
platforms for data analysis and visualization that is available today.
GUIs
Several Graphical User Interface (GUI) applications, such as R
Commander and RStudio, are available, and they offer the power of R
through menus, graphical icons, and dialogs, and run on a wide variety of
platforms, including Windows, Unix, Linux, and Mac OS X. These GUIs are
point-and-click interfaces that sit on top of R and serves to simplify its use
by reducing the number of lines of code the user needs to write, and for
some operations, eliminating the writing of code. In this course, we will
focus on RStudio and its use in a subsequent chapter, since it is best practice
to first learn base R, which we will do in the early chapters.
Using R
In this opening chapter, we will give a brief overview and rundown of
some of R’s functionality and in the following chapter, we will get into the
hands-on details.
R is case-sensitive and interpreted. So be careful with the cases of your
letters. If you saved a file as “myFile” then later you call up the file with the
command “MyFile”, R will advise you that there is no such file.
Commands are entered one at a time at the command prompt (>). R uses
data types or structures, that include vectors, matrices, data frames, and lists.
R sees the data types we create and stores and manipulates them as objects,
as it is an object-oriented language. It gets much of its functionality through
built-in and user-created functions, and all objects are kept in memory
during an interactive session. Other functions are contained in packages that
can be attached to a current session and then detached when the session is
over.
On opening base R, you will see an interface like the one shown below.
Figure 1.1
The red “greater than” symbol > that you see at the end of the blue
writing is called the prompt. We enter our data at the prompt.
A variable is used to store the results of computations. In other words,
after we make a computation, if we give a name to the results, then that
name is a variable. For example
>x <- sqrt(64)
Here we are making computation of the square root of 64, by placing
“64” between the brackets of the square root function, sqrt(),
sqrt(64)
and then, by writing “x” and putting an arrow (a less than the symbol “<”
followed by a dash “-“) going from “sqrt(64)” to “x” as shown below,
x <- sqrt(64)
we are assigning the name, x, to the result of the square root, which is 8.
Therefore, R will store in memory a variable (object) named “x” that has a
value of 8. If we now call up the variable, x, (by writing “x” at the prompt,
and then pressing ENTER), R will return the value “8”, as shown below.
(Our input is placed after the prompt and R’s output is what comes after
the square brackets.)
>x
[1] 8 #This is R’s response
*Note: Anything written behind a pound sign (#) on the code line is not
part of the code. The line above is an example.
To assign the value of 15 to a variable y, we key
> y <- 15
We act upon objects in R’s memory, like the variables, “x” and “y”
above, by way of operators and functions. Operators are symbols that call
for some action or operation to be performed on pieces of data. The symbols
“+”, “-”, “×”, “÷” are all operators that perform the specified operations on
numerical data. Operators can be arithmetical (like those shown above),
logical (TRUE or FALSE), or comparative (like greater than >, and lesser
than <, etc.). By keying in
>2*y + 3*x
we are performing operations on the numerical variables x and y, to
which we have already assigned values. We are now multiplying the variable
y by 2 and then adding the result to three times the variable x. Upon pressing
ENTER, R will respond with the answer 54.
[1] 54
We will see much more about the use of operators later.
Functions
This plot() function plots Depth against Pressure with the argument
“Depth ~ Pressure”. The “data=DiveData” tells R that the variables Depth
and Pressure can both be found in a file called DiveData. We have also
included another argument, pch= 16. The pch= 16 code tells R to use a solid
black dot to plot the graph. When we hit ENTER, R will execute the plot.
R’s graph will look something like this:
Figure 1.2
The list function, ls (), will output a list of all objects in memory. This
function does not usually need an argument. To remove an object, y, from
memory, use the remove function rm() and put “y” as the argument
>rm (y)
Upon clicking ENTER, the object y will be removed from memory.
The R Workspace
The workspace is the current R environment in which you are working,
and it includes any object you defined or to which you assigned values (e.g.
variables). At the end of an R session, you can save an image of the current
workspace, and that image is automatically reloaded the next time you start
R. You can use the up and down arrows to scroll through the commands you
keyed during your session. Here you can select an old command, edit it as
you wish and then resubmit it with the Enter key.
To display a given number of your last commands, say, your last four
commands, you can use the history() function and give it “4” as an
argument.
>history (4)
>sink (“C:/Users/Desktop/[Link]”)
This will send the output to the file [Link] only, and you will not
see your output on the screen. By including the option code,
your output will be sent to the specified file as well as to the screen.
the output of your current session will be appended (added) to the file and
the contents of the file will not be overwritten.
The sink function will work on text output but not on graphic output. To
direct graphic output to a specific file, say, [Link], and text output to
myfile. Rdata, you would use
There are several different functions for saving the graphic output in base
R. Some of them are listed in the table below.
Function Saved to Type of File
Pdf (“[Link]”) A pdf file
Jpeg(“[Link]”) A jpeg file
Bmp(“[Link]”) A bitmap file
To “unsink” or to stop text output from being redirected to the sink file,
you would use
>sink()
With no arguments.
To stop graphic output from being “sinked” to a graph file, you would use
>[Link] ()
Directories
The current working directory is where R will find the files with which to
work. It is the location of the files that R would remember. To find the file
that R is currently using as the working directory, use
>getwd ()
To get R to use a different directory, say, directory1, use
>setwd (“directory1”)
To access saved files you can use the load() function to load Rdata files.
>load(“d:/[Link]”)
Alternatively, you can save files by clicking the File Menu and then
clicking save workspace. A dialog box will appear. Now you can browse to
the folder in which you want to save the file and give the file a name of your
choice and click Save.
You can also access a saved file by clicking the File Menu then clicking
load workspace. A dialog box will appear. Now you can browse to the folder
in which you saved the .Rdata file and click Open.
You can save commands made in your R session through the File Menu
by clicking file and then save history. A dialog box will appear. Browse to
the folder in which you want to save the file, name the file and then click
Save.
Packages
A “package” in R is a set of functions bundled together to perform a
certain group of operations. Packages are stored in “libraries”, which are
analogous to specific cabinet drawers in the filing cabinet that is your
computer. When you downloaded R for the first time, the package “base”
which contains all the basic functions was downloaded automatically. There
exist, however, many different packages with different functionalities that
are developed by the programming community and placed in the CRAN site
where they are made available for download and use by general R users. To
get a list of the standard packages loaded with base R, type
>search()
Below is a partial list of some base packages (there are hundreds of them)
and a brief description of their functions.
Installation of Packages
To install a new package for the first time, you would use the command
>[Link] ()
This will bring up a list of CRAN mirror sites. You select a site and then
you will see a list of all packages on that site. You select a package and it
will be downloaded and installed on your computer. If you know the name
of the specific package you want, say the package BRUD, you could use
>[Link] (“BRUD”)
Loading a Package
Getting Help
If we type
>[Link]()
R will open for us a window with much information on syntax, packages,
and functions.
Ending an R Session
To end your session, simply use
>q ()
This is the quit function used with no parameters. If you have not saved
your workspace using the methods discussed above, you will be prompted to
do so after entering the quit function.
Open R.
Inquire of R about your current directory.
Redirect session output to a file called [Link] (pay attention to the
case), while having output also show on your console.
Redirect graphical output to a file called [Link].
Install the package AER (a package with functions, examples, datasets,
and demos). Load the package.
Get a list of the functions and datasets available in the AER package.
Get the details on the dataset CollegeDistance.
Output the dataset CollegeDistance.
Run the example in CollegeDistance (simply key:
example(CollegeDistance))
List the objects in your workspace.
List the last three commands you entered.
Run the example in CollegeDistance again. This time the graph should
show on your screen – as it is not being redirected to any file.
Quit.
Go into the current directory and check the file [Link]. You should
find the graph of the CollegeDistance dataset in this file.
Close R.
Chapter 2: Data Structures
Forms of Data
We now zoom in a little closer for a more detailed look at the workings or
R. In R, data is stored as objects. All objects have two basic attributes: the
mode of the object and the length of the object. The mode of the object is the
type of data that the object holds. There are four basic mode types: numeric,
character, complex, and logical (TRUE or FALSE).
Let us create an object we will call A. We are adding 3 + 4, and then calling
the answer “A”.
A<-3 + 4
After hitting ENTER, R now has in its memory an object called A, whose
value is 7. To verify that R has created the object A, we simply key the name
of the object, hit ENTER
>A
and R will return the value of A.
[1] 7
To find out the mode of the object named A, we key
>mode (A)
If A is numeric, as we know it is, R will return
[1] “numeric”
Naming Objects
We have already seen the naming of an object by our work with object A
above, but let us take a closer look at object naming. When naming an object
(with the <- assignment), we can use letters (A – Z or a – z), digits (0 to 9),
dots, and underscores. R discriminates between lowercase and uppercase
letters in the names of objects so that “A” is not the same as “a”.
To find the mean and standard deviation of the vector Weight, we write
>mean (Weight)
[1] 101.2
>sd(Weight)
[1] 30.19437
When we write
>plot (Length, Weight)
we obtain a scatterplot of Length on the x-axis and Weight on the y-axis
as shown below.
Figure 2.1
This plot is basic and somewhat unattractive. Later you will learn to
create attractive, custom graphs to suit your needs.
If a variable, x, has the value of 15, then until you change it, x has the
value of 15. It can be used in subsequent mathematical calculations. For
example,
x<-15
>x/3
[1] 5
>x^2 #x raised to the second power
[1] 225
Data Structures in R
Objects in R come in many forms and structures. When R works with
data, R first notes the structure of the data with which you are presenting it.
Working with data in R means, therefore, that we must first choose the
appropriate data structure to hold the data. R has many different data
structures for holding data. For our purposes, the most important of these are
Numbers, vectors, matrices, data frames, lists, factors, and strings. We will
now see each data structure individually.
Numbers
Numbers in R are usually dealt with the same as they are in ordinary
mathematics. One of the main differences is how R treats very large and
very small numbers. When we write “a e b” in R, where a and b are
numbers, we mean a × eb. Now let a = 6.3 and b = 13. Thus, a e b means
6.3 e +13 which means 6.3 × 1013.
In reverse, when we enter in R
>exp(40)
We get back
[1] 2.353853 e +17
Which means 2.353853 × 1017.
Additionally, the “undefined” designation in R is denoted “inf” for
infinity. In mathematics, the answer to any number divided by zero is
infinity – or, a number too large it cannot be written, and so it is called
infinity. So, if we type in
>3/0
R will answer:
[1] inf
Now, if we type in
>0/0
we are trying to divide zero by zero – another mathematical anomaly. R
will tell us”
[1] NaN
This means “Not a Number”
Vectors
Vectors are the simplest data structure and they consist of a one-
dimensional array of data of any type: numeric, character, or logical. But all
data in a single vector must be of the same type. We can represent a vector
by a column of elements. Below, we see a vector represented by a column
with its elements all the same color to denote the fact that all elements are of
the same type.
This denotes a vector of five elements with the position of each element
represented as a box in which we place the actual value of the element at
that position. The value of the element at the fifth position (fifth box) is 11.
The entry (with square brackets)
>x[4]
will get an output of the fourth element of the vector x, which is the value
8. Thus, if we hit the Enter Key, R will return
[1] 8
The entry with the colon
>x [3:5]
tells R to generate a sequence of the third through the fifth elements of
the vector x.
R will therefore output:
[1] 6 8 11
If we enter
>x[2] +x[4]
R will tell us
[1] 11
The following line changes the third element of X from 6 to 36
>x[3] <- 36
>x
[1] 1 3 36 8 11
Using the assignment method, we can, also create the original vector x by
creating an empty vector x and then adding values one at a time.
>x <- c() # creating the empty vector
>x[1] <- 1 #assigning the value 1 as the first element
>x[2] <- 3 # assigning 3 as the second element
>x[3] <- 6
>x[4] <- 8
>x[5] <- 11
# now we look at the vector we have created by assignment
>x.
[1] 1 3 6 8 11
The function c() can also be used to append (add) elements to an existing
vector. Let us create a vector x with the elements 1 3 5 7.
>x <- c(1, 3, 5, 7)
Now enter the code:
>x <- c(x, 9)
This adds to the vector x, the element 9. Now call up x
>x
[1] 1 3 5 7 9
Matrices
If a vector can be represented by the columnar structure
We will create one long vector of all the values, column by column. We
call this vector values.
>values <- c(23, 41, 35, 33, 2, 3, 4, 8, 2, 11, 6, 7)
Next, we will create a vector called colnames, whose elements are the
names of our original vectors, and a vector called rownames, whose
elements are the names of the employees. Remember to enclose the names
in punctuation marks, since they are of character type.
>colnames <- c(“Age”, “YearsTraining”, “LengthEmploy”)
>rownames<-c(“Roy”, “Sunny”, “Stan”, “Pat”)
Next, we use the matrix function to create the matrix. We will call this
matrix, Matrix1.
>Matrix1 <- matrix(values, nrow=4, ncol=3, byrow=FALSE,
dimnames=list(rownames, colnames))
The values argument in the matrix function gives the vector of numbers
that we called “values” in the designation we coded above. The arguments
nrow and ncol give the number of rows and columns respectively of our
matrix. The argument byrow is a logical type and it tells the function how
the values (the numbers) will be written into the matrix. In our case, when
we were creating the vector values, we entered the numbers by column and,
as such, we will be filling our matrix column by column. So, we give the
byrow argument the value of FALSE to indicate that we will not be filling
the matrix by rows.
The dimnames() argument is a function that simply gives a list of the
names that the matrix is to use as its column and row names. When we press
the Enter key, R will create the matrix, Matrix1, and store it in memory. To
call up Matrix1 on screen, we key and Enter
>Matrix1
R will output on the screen
NA indicates that that element is “not available”. And asking R for the
elements of the second column,
>X[ ,2]
we will get
[1] 3 NA
.
Suppose you have the following dataset. Here we have four vectors: two
of numeric mode and two of character mode. Thus, the structure we will use
is a data frame.
Figure 2.2
One way to create the above data frame is to use the [Link]()
function to create an empty data frame and then use the text editor to enter
the data. In creating the empty data frame, we first name the column vectors
and their type and give it a value of zero. We will call the data frame,
dataframe1.
>dataframe1<-[Link](Name=character(0), Age=numeric(0),
Position=character(0), LengthEmploy=numeric(0))
>dataframe1<-edit(dataframe1)
The above code tells R that we will edit the empty dataframe1 we
created, and the result will be the new dataframe1. The text editor will come
up with the data frame and its variables (columns) already created (Figure
2.3A). You may now enter the data directly as in a spreadsheet (Figure
2.3B).
Figure 2.3
If, at this point, we want to change the name of a variable (column
vector), we simply click on the variable name on the text editor. The dialog
box like the one shown below will appear. We can now write in the new
name.
Figure 2.4
That’s it. The data frame will now be created and stored in memory. To
call up dataframe1 to screen, we key
>dataframe1
Factors
In R, non-numeric variables (vectors) are also called factors. The
categories of these variables are called “levels” of the factors. In the above
example, the variable “Position” is non-numeric, so it is a factor and it
contains four levels: Stock, Mail Sup, IT Sup, and OfficeMgr. R codes the
levels of the factors as integers from 1 to k, where k is the number of unique
levels the factor has. Thus, the factor, Position, has four levels, and they will
be coded as
1 = Stock
2 = Mail Sup
3 = IT Sup
4 = OfficeMgr
Now, suppose we added (using the text editor) the row
“Jim 27 Stock 4 M”,
then the data frame, dataframe1, will now look like this,
The first line creates the vector Answer and the second line gives R to
store it as a factor of five ordered levels.
Lists
For example, to create a list that we will call listA, from the above data,
we will create and two vectors: one called Kilometers, and another called
Time, then we would create a factor (non-numeric vector) called
DayofWeek.
Now we can create a list called listA that consists of the factor,
DayofWeek, and the two vectors, Kilometers and Time.
>listA[[“Kilometers”]]
R will then output
[1] 3.3 2.6 4.0 3.5 2.9 3.8 3.1
Or, instead of keying “Kilometers”, we can simply write 2 inside of the
double square brackets.
>listA[[2]]
This asks R for the second vector (kilometers) in the list listA, to which R
will output
[1] 3.3 2.6 4.0 3.5 2.9 3.8 3.1
Strings
Strings mean relatively short text consisting of a sequence of characters.
For example, the name of a respondent in a survey. Strings are enclosed in
quotation symbols. One word of caution is in order here: The number 38 is
not the same as the string “38”. The string 38 is a text sequence consists of
the digit 3 followed by the digit 8.
The operation
>3 * 4
Is perfectly legal and R will output
[1] 12
However
>”3” * “4”
Will cause R to sound an error alarm. As well it should because here we
are attempting to perform a numeric operation on two text characters.
Boolean Values
R uses TRUE and FALSE as Boolean values. These are mostly used with
comparison logical operators. For example, if we write
> 20<15
We are asking R whether 20 is less than 15.
R will tell us
[1]FALSE
We used “<”, a comparison operator that compared 20 to 15.
Below, we present a list of R’s comparison operators.
Data Entry
Data comes in many formats, in many different forms and from many
different sources. Data from any source and in any format can be imported
into R. We will only deal with a few formats and sources here. For a
complete guide in data import and export in R, see R Data Import/Export
from the CRAN website manuals.
Direct Entry from the Keyboard
If you are entering data directly from the keyboard, which you may do if
your dataset is small, the easiest method is through the use of the edit()
function, which we have already met. When you invoke the edit function,
the text editor will come up and thereby you can enter your data directly into
R’s memory. The simplest way to do this is to create an empty data frame
and then call up the editor to enter the values into the empty vectors. We
have already seen this method, but to solidify it in our minds, we will see
another example.
We are going to create a data frame called dataframe1, which will consist
of the data which we previously used in the section on lists, and we are
going to enter it directly from the keyboard.
Now we will call up the data editor so that we can fill in the values.
>dataframe1 <-edit(dataframe1)
Note that we are altering the original object dataframe1 by our editing of
it. At the end of our editing we are assigning the results back to the same
object, dataframe1, so that it is then altered.
Another way to call up the data editor is to choose Data Editor from the
Edit menu at the main ribbon.
Below is a picture of the result of calling up the data editor.
Figure 2.5
The data editor allows you to change the name and type of a variable.
To do so, click on the variable (the column heading) and make the
changes in the box that appears. Close the box. The changes are made. We
can also edit cell values. Double-click on the cell you want to edit and make
the changes.
To enter data, you simply click in the appropriate cell and enter the value.
Additional columns can be added by simply clicking on the column title cell
of adjacent empty columns. Next, we will look at importing data and
creating a data frame from a text file that is already in existence.
We have created the file Example in Excel and we want to import it into
R. For the file Example, we will use the same table below that we created
earlier in the section.
>dataframe2<[Link](”C:/Users/Hernandez/Desktop/[Link]”,
header=TRUE, sep=” ”)
>dataframe2<[Link](”C:/Users/Hernandez/Desktop/[Link]”,
header=TRUE, sep=”,”)
The command header=TRUE tells R that the first row contains the
column names as headers and not data values, so R would not try to do
calculations with them. The command sep=”,” tells R that the data values
are separated by comma delimiters. If a file is saved as a tab-delimited or
.txt, then you should key in sep=” ”. This tells R that the file in Excel uses a
tab (space) to separate the data values.
To now call up the file dataframe2, key
>dataframe2
Now if the file was saved in Excel as comma-delimited files, we can
import it into R using the [Link]() function as follows:
>dataframe2 <-
[Link](file=”C:/Users/Hernandez/Desktop/[Link]”,
header=TRUE, sep=”,”)
Figure 2.6
>smokedata <-[Link](“C://Users/Bill/Documents/[Link]”,
header=TRUE, sep=”,”)
We are telling R to read the file [Link], found at
C://Users/Bill/Documents, with the first row being the headings
(header=TRUE) and that the file is delimited or separated by a comma
(sep=”,”).
This line of code will import the file into R.
Now we simply type
>smokedata
To call up the file to the R screen.
>[Link](“Himsc”)
>library(Himsc)
>dataframe4 <-[Link](“[Link]”, [Link]=TRUE)
We first install the package, load it into our library, then use its [Link]()
function to find [Link] and import it into R as dataframe4.
The logical [Link]=TRUE tells R to convert variables with
value labels into factors with the value labels as levels of the factor.
>[Link](“Himsc”)
>library(Himsc)
>dataframe4 <-[Link](“[Link]”, [Link]=TRUE).
You can also save the SAS dataset as a comma-delimited file in SAS
using the PROC EXPORT command.
SAS program:
Proc export data=datafile
Outfile=”[Link]”
dbms=csv;
run;
Thus, we have the file saved in SAS as a comma-delimited file. We can
now import it into R using the [Link] function.
What would the values of x and y be after the following commands are
executed in R?
>x <-3
>y <- 7
>x <- x + y
> y <- x + y
Create a vector, entomology1, that consists of all the values in the table,
entered row by row. Then create the matrix, matrix1, using the matrix with
the argument byrow put in as TRUE. Now apply the colnames() function to
create a character (string) vector containing the column labels. Now call up
matrix1 to see the completed matrix.
We will create the table above as a data frame in R using the R Data
Editor. Do so with the code below and when the data editor comes up, enter
the values.
The first line of code creates the data frame called entomology2
consisting of three columns all of which are of numerical mode.
>entomology2 <-[Link](Weight=numeric(),
ThoraxLength1=numeric(), ThoraxLength2=numeric())
>entomology2 <-edit(entomology2)
Enter the Entomology table above into Excel, and then import it from
Excel into R, calling it entomology3.
Chapter 3: Introduction to Graphs
R’s graphical abilities is one of its strongest points. You can use R to
create some amazing graphs and charts. In this chapter, we will investigate
some of the basic graphing methods and techniques. Below are some
examples of R’s powerful graphics capabilities.
Figure 3.1
But now you want to use a dotted line (lty=3) and a solid square for the
points (pch=15), like the graph shown below.
Figure 3.2
Color is specified by using the index for the specific color (for example,
color=1); by typing in the color names directly, for example, color=”green”;
using the hexadecimal code for the color, for example, color=”#8B7500”; or
the RGB code for the color, for example, col=rgb(138,117,0).
The color() function will give a list of all color names available.
For a comprehensive chart of all R colors, visit the web site:
[Link]
R also has many more sophisticated color functions that can be used to
create more striking effects: rainbow(), this produces the colors of the
rainbow; [Link](); [Link](); [Link]() are some of the more
popular color functions. Levels of gray can be produced with the gray()
function. For example, to generate five levels of gray (not 50 shades!), type
>gray(0:5/5).
(Don’t worry about the color codes currently. We will be seeing pie charts
up close in a subsequent chapter.)
To observe an example of the gray() function in action, we will code a
pie chart with 8 gray levels and label them with the hexadecimal code.
>x<-8
>shadesofgray<-gray(0:x/x)
>pie(rep(1,x), labels=shadesofgray, col=shadesofgray)
You will get the following pie chart.
After the equal sign, you will specify a number that specifies the font
style according to the scheme:
1 = plain text
2 = bold
3 = italics
4 = bold italics
5 = Adobe symbol encoding
If you don’t use the opar setting, then all graphs created after the
statement above will have axis labels in bold, and with the default size of 1;
the main title in bold italic with a font size 2.5 times the default size of 1 and
the font family will be Palatino Linotype.
The windowsFonts() function will only work for windows. If you are
using a Mac, then you use the quartzFonts() function, and it is used in the
same manner as the windows function.
Now, let us construct a graph using some of the functions and techniques
we have seen so far: the opar<-par, the pin, mai, title, plot and finally
passing the opar back to par.
The two vectors are:
Year: 1790, 1820, 1850, 1880, 1910, 1940, 1970, 2000.
Number of New Species Described: 100, 120, 150, 170, 180, 195, 211,
215.
I will explain the code while you follow the code lines below. First, enter
the vectors: Year and NumofSpeciesDescribed, set the opar, and use the
par() function to set the pin, mai, and line width. Then use the plot()
function to include line type, plotting symbol, plot type, parameters for
colors for the lines, main title label, the color, size and font of the main title,
and the color of the axes titles. The code should end by sending the opar
setting (the original settings) back to par (the forefront) to make it current
once more. However, we will not end the opar at this time since we will be
plotting a second line on the same graph. After this second line and the
legend, we will then change the opar settings back to par.
The “topleft” indicates location of the legend. Other locations that can be
written in as keywords are: “topright”, “top”, “bottom”, “left”, “right”,
“bottomleft”, “bottomright”, “center”. If you use one of these keywords,
you can also specify how far you want to place the legend within the plot by
using the inset=parameter. The inset is given as a proportion of the plot
width;”0.10” means that the legend will be placed at a location 1/10 of the
width of the plot area into the plot at the top right.
The title parameter gives a title to the legend. The first concatenation
gives the names of the two lines of the legend. The second concatenation
gives the line types you specified for the two lines when you plotted the
lines. The third gives the colors of the two lines. And finally, we end the
code by sending the opar setting (the original settings) back to par (the
forefront) to make it current once more.
Below is the graph after the legend code.
Exercises 3
Create a data frame called dataframe1, which holds the data given in the
table below, by first creating an empty data frame and then filling it in using
the data editor. Attach the data frame to the work path and create a
temporary opar in which to store the different parameters for this graphing
only.
Construct a graph of the two lines on the same plot. Set margins at 2
inches on the bottom and the left, with margins of 1 inch on the top and
right. Use different plotting symbols and different line types for each line
graph. Use colors to liven up the graph lines and labels. Add a legend at the
upper left of the plot area. End the code by passing the opar back to par to
revert to the original settings.
Chapter 4: Using RStudio
Now that we have familiarized ourselves with some of the features in the
working of base R, we will introduce an R-running environment, which
could serve to make our lives, as users of R, a lot easier. This chapter
introduces RStudio and visits a few of its basic features – all you need to
know to begin using RStudio like a pro.
RStudio is an integrated development environment (IDE) that was built
just for R. It includes a console, syntax-highlighting editor that supports
direct code execution, as well as tools for plotting, history, debugging, and
workspace management. More commonly, it is called a GUI (Graphical User
Interface.) A GUI is not the main application but sits on top of the main
application like R and makes it more user-friendly. For example, basic
commands like Open Script, Import/Export, CSV files, package
management, help queries, can be done by the click of the mouse instead of
writing lines of code.
Figure 4.1
To be able to use RStudio on your system, you first need to have base R
(at least Edition 3.0.1+) installed from [Link]. Once you have R installed,
you can then download and install RStudio.
RStudio is available in open-source (free) and commercial editions (buy)
and runs on the desktop (Windows, Mac, and Linux) or in a browser
connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, Red
Hat/CentOS, and SUSE Linux).
Download and Installation of RStudio
On the following link, [Link]
choose
the appropriate version for your needs, click on download and then run it to
install R-studio. When you open RStudio you will see four windows as
shown below.
Figure 4.2
You can also install new packages by clicking on the Install sub-tab.
Figure 4.11
When you do, the dialog box shown in Figure 4 .11 will appear. In his
box, you will enter the name of the package you wish to install. Multiple
packages may be installed from this one window by writing their names
separated by a space or a comma.
The list of packages shown by clicking the Packages tab is organized into
Libraries. The first time you open RStudio, you will see a “System Library”
list. This is a listing of all packages that have been automatically installed
with R. All the packages you, the user, have installed over time are listed in
the User Library.
It is important to note that the packages in the libraries can only be used
after you have activated them by clicking in the checkbox to the left of the
package name.
Figure 4.13
Understanding how to use the Help tab is important and can save you a
lot of time.
Plots and Management Window: Viewer Tab
This tab is used to display local web content. This is an advanced topic
that will not be looked at here.
Figure 4.14
This will open a second editor window. A tab for this new window is
created
and shown as highlighted in yellow.
Figure 4.15
Now we will write and enter the following code to create a simple object,
Object1:
Object1 3:6
Now, we will highlight the line of code and then click “Run”
Figure 4.16
RStudio will respond as shown below.
As shown in red, the editor will send the script to the console.
Figure 4.17
As shown in green, the console will create the object and send the details
of the object to the Global Environment window.
Now let us plot the elements of Object1 against their index numbers, that
is, their position numbers in the vector (the first element, 3 has the index of
1, the second element, 4 has the index of 2, etc. We write the following code
in the editor: (Notice, there is no prompt in the script editor window)
Plot(object1)
After clicking Run, RStudio will respond as shown below. Note also that
we do not have to click RUN after every line of the script. We can write
several lines of script, highlight all the lines and then click RUN.
On clicking on the Plots tab in the Plots window, we will see a scatterplot
of the points. We could maximize the window if we so desire.
When prompted, choose where you want to save it on your computer and
then name the file using the dialog box that appears. You can also save your
entire workspace by clicking on the ‘Session’ tab in the Script Editor
window and choosing ‘Save Workspace As’.
Saving a Plot
To save or export a graph or plot from the Plots window, you simply click
on Export and then choose Save as Image, Save as PDF or Copy to
Clipboard, as shown in the figure below. In the appearing dialog box, you
can choose the location at which you wish to save the plot and give the plot
a name.
Figure 4.20
Plots can also be saved from the Script Editor window Plots tab.
Figure 4.21
Figure 4.22
The dialog box that will come up enables you to choose a directory by
pointing and clicking instead of writing lines of code.
Figure 4.23
Figure 4.24
In the Options box that appears, click on Appearance. The Appearance
box will allow you to customize your workspace as you will. The first
section of the Appearance dialog box is the RStudio Theme. Here you can
change the theme of the window itself. The choices are Classic, Sky, or
Modern.
Figure 4.25
The Zoom section lets you change the percentage of zoom in or out. The
Editor Font and Font Size follow. After the Font Size, you will find the
Editor Theme. This section has many options for a theme: background color,
font color, colors for code levels and you can choose one to your taste.
Below, I have clicked on “Tomorrow Night Blue”.
Figure 4.26
This option gives me a navy background with basic white text and
different colors texts for different functions and levels of code hierarchy.
After I click “Apply”, my workspace will appear as shown above in the
section at the right in Figure 4.30.
We can also use the Import Datasets menu in the Environment window.
From there we can import from Excel, SPSS, SAS, or Stata with just the
click of a mouse and then choosing options from the ensuing dialog box.
The top portion of the dialog box that appears when the Import from
Excel button is clicked is shown below in Figure 4.28. In the highlighted
space you would put the URL for the file if you are downloading from the
web. If you are accessing it from your computer, you click on the Browse
button on the right.
Figure 4.28
Figure 4.29 below shows the bottom half of the dialog box.
Figure 4.29
In the Name box, you give your dataset a name that will be accessed in
R. In the Sheet box you choose the worksheet you wish to import from the
down arrow if your Excel spreadsheet has more than one worksheet. If there
are other worksheets in the file that you want to import, you can import them
one at a time. In the Range box, you give RStudio the range of the dataset
file that you are using. Sometimes the original Excel file might contain rows
of information that you do not wish to use in your R analysis. In the Max
Rows box, you can instruct R on the number of rows you wish to export. If
you enter “5” in this box, R will import only the first 5 rows of the Excel
file. In the Skip box, you will instruct R if you wish to skip any rows. If you
type “2” in this box, R will skip the first two rows of the Excel file; if you
type “3” the first three rows will be skipped and not appear in your imported
file. The NA box is perhaps the most important in this group. Missing values
are usually always present, and you need to instruct R how to handle them.
Handling missing values improperly could give you unreliable results in
your analysis.
Missing Values
If there are empty cells in the original file, R will automatically place
NA’s in those spaces when it imports the file. NA means “not available”. If
there are any other symbols in the cells apart from the symbols they are
supposed to contain, you will have to instruct R to treat these as missing
values. For example, if some cells that are supposed to contain numeric
values have the symbols “###” in them, as shown below,
Figure 4.30
then you will have to tell R to treat them as missing values by typing “###”
into the NA box. Wherever R finds this symbol in a cell, it will replace it
with an NA, making it an “official” missing value. If the first row of the file
contains the column names, then click in the checkbox “First Row as
Names” on the right.
Before you click “Import” you need to check your preview file in the
upper left window. If there is a variable in your dataset that is a character
variable but coded with numbers, then R will seek to identify it as a numeric
variable. You will have to correct R by telling it that it is a character
variable. In the figure below, the variable in the last column is coded as 1’s
and 0’s. However, it is a character variable, in this case, the variable,
whether the pharmacy is in a shopping center or not, is coded with 1 as
“Yes” and 0 as “No”. To tell R to identify this variable as a character
variable, click on the drop-down arrow near the name of the variable in Row
1. Clicking on this drop-down arrow will bring up the list of options shown
in the figure below.
Figure 4.31
Select “Character” from the list and R will identify the variable as such.
We are now ready to import the file. Click on the Import button on the
bottom right (Figure 4.29) and your file will be imported into the R
workspace.
Projects
If you are using RStudio, you can create a new R project by File/ New
Project in the script editor. A project is simply a special working directory
designated with a RProj file. When you open a project (using File/Open
Project in RStudio or by double-clicking on the .Rproj file outside of R), the
working directory will automatically be set to the directory that [Link]
file is located in.
Once you have created a new R project, let’s say you named the project
Proj1, the main folder will be [Link]. Within this folder, you could then
create a folder that will contain your R code, a folder for your data files,
folders for notes, a folder for your graphs, and other material relevant to
your project (you can do this outside of R on your computer, or in the Files
window of RStudio). For example, you could create a folder called
[Link] that contains all your R code, a folder called [Link] that
contains all your data (etc.).
Chapter 5: Data Management 1
Datasets from, for example, surveys, can be quite large and as such, may
contain many areas that may be somewhat problematic for the analyst. For
example, respondents often fail to respond to many questions, this means
that we will need to have a way to handle missing data. There may also be
many variables in the dataset, but only a few of interest to us. In this case,
we may need to create a simpler dataset containing only the variables of
interest. We may also need to recode values of a variable into new categories
because of the need for further study on the variable. These and many other
issues present themselves and thereby necessitate tools and methods of data
management to handle the problems that arise. The first aspect of data
management we will look at is that of missing values in a dataset.
[Link](vector3)
R will respond:
By the TRUE in R’s response, it tells us that the third position is an NA.
Let us see how the function operates with a data frame, dataframe1. We
call it up in the Script Editor window.
dataframe1
Remember that in RStudio, you do not press ENTER, but instead, you
click RUN to execute the commands. So, upon clicking RUN, the following
will appear in the Console window.
[Link](dataframe1)
R’s response will be:
The TRUE means that that place in the data frame, dataframe1[3, 2],
contains a missing value.
dataframe2<-[Link](dataframe1)
dataframe2
and ProdXY, which will be the product of the X and Y elements, where
dataframe4<-transform(dataframe4, DiffXY=X-Y)
Here, the transform() function has two parameters: the first is the name
of the data frame we are transforming (dataframe4), and the second is how
we are going to transform it. We are transforming it by adding a new
column, which is the difference X – Y. We then call the transformed
dataframe4, the new dataframe4.
We now call up the new dataframe4.
dataframe4
Renaming Variables
You can change the names of variables using several methods, two of
which we will visit here. The simplest method is that of using the data
editor. Suppose in dataframe4 you want to change the name of the variable
“DiffXY” to “[Link]” and you want to do so using the data editor. You
can call up the data editor in several ways. You can go to the Menu bar and
click Edit then find Data Editor.
Or you can call up the data editor by using the fix() function.
fix(dataframe4).
This will call up the data editor so that you can - guess what? That’s right
- fix data frame 4. You enter the name of your main file, dataframe4, in the
opening dialog box, then directly change the name of the variable “DiffXY”
by double-clicking on the name and typing in the new name, “[Link]”.
date()
[1]”Fri October 11 12:14:34 2019”
We can change the format of the current date by using the format()
function with the formats shown in the table below.
Built-In Functions
For manipulating data, R has many built-in functions that can handle a
wide array of operations. We will visit several of the functions of common
usage for arithmetic, mathematical and statistical calculations. You need not
memorize these functions but use them as needed.
x + y # addition
x – y # subtraction
x * y # multiplication
x / y # division
x^y #exponentiation (raising x to the power of y)
x %% y # x mod y
x %/% y # interger division
round(x,digits=n) # we are rounding x to n digits
trunc(x) # truncating values in x to 0: trunc(8.95) returns 8.
signif(x, digits=n) # round to n significant digits
Mathematical Functions
Trigonometric Functions
cos(x)
sin(x)
tan(x)
cosh(x)
sinh(x)
tanh(x)
acos(x)
asin(x)
atan(x)
acosh(x)
asinh(x)
atanh(x
Relational Operators
x<y
x>y
x <= y
x >= y
x == y # equality test
x != y #non-equality test
These are binary operators that compare the values. They return a vector
of TRUEs or FALSEs that indicate the result of the individual
comparisons.
For example
4 == 5
[1] FALSE #4 is not equal to 5
4 != 5
[1] TRUE
mean(x)
median(x)
sd(x)
var(x
mad(x) #mean absolute deviation of x
range(x)
sum(x)
min(x)
max(x)
summary(x) # gives summary numbers for the dataset x.
For example,
y<- c(5:25)
summary(y)
The following functions are probability functions and are used basically
to generate data from probability distributions with known characteristics.
The following table gives the probability distribution and its code
Along with the code for the distribution names, we can add a letter to the
beginning of the code name to indicate a special function.
d = density function
p = distribution function (cumulative)
q = quantile function
r = will generate a specified number of random numbers from the
distribution.
Figure 6.1
In the normal curve to the left, the age measures are packed closer around
the mean, so the curve is skinny and tall. If the mean is 40 years of age, then
in this curve, you are less likely to find someone of age 80 or 90 – that is too
far from the mean for this curve.
The curve to the right is fat and wide, meaning that the age measures are
spread out further from the mean. If the age curve looks like this one, and if
the mean is 40 years of age, you are more likely to find an 80- or a 90-year-
old here. The skinny curve will have a smaller spread around the mean (a
smaller standard deviation) than the fat one. If a normal distribution has a
large standard deviation, then you know the scores are spread further away
from the distribution’s mean.
Once you know the mean and the standard deviation of the normal
distribution, a probability question like the one posed above can be easily
answered.
Example:
During a lab experiment, the average number of radioactive particles
passing through a counter in 1 millisecond is 4. This means that the passing
of a particle through the counter is a rare event. What is the probability that
6 particles will enter the counter in the next millisecond? Since we want the
probability of a rare event, we can use the Poisson distribution.
To answer probability questions using a normal distribution, two
quantities are needed: the mean and the standard deviation. To answer
probability questions using the Poisson distribution, only one quantity is
needed: the mean number of events in the time or region specified. For
example, in the radioactive particles question above, we are told that the
average number of radioactive particles passing through a counter in 1
millisecond is 4. Thus, the Poisson mean is 4. The Poisson mean is called
“lambda”, a Greek letter that is written as “ �� ”.
To find Poisson probabilities in R, we use the dpois() function. We give
the function two quantities, lambda, and the x for the question. Remember
that x is the value of the random variable, whose probability we are being
asked. For this question, then, x = 6. The form of the function is dpois(x,
lambda). Thus, we key
dpois(6, 4)
R will answer
[1] 0.1041956
The Parameters
Now, we must find the parameters, x, n, p, and q. The quantity n, we
know to be the number of trials, which is 10. The quantity p is the given
probability of a success. This is 0.35. The third quantity is q, and q = 1 – p.
Therefore, if
p = 0.35, then
q = 1 – 0.35 = 0.65.
The quantity x is the value of the random variable and is given in the
question. The random variable, in this case, is the number of wells out of the
ten that contain impurities. The first question asks us to find the probability
that 0 out of the 10 wells contain impurities. Therefore, for this question, x =
0.
Our parameters are n = 10, p = 0.35, and q = 0.65. Each question will
give us a different x. We are asked the probability that none of the systems
work, that is, the probability that x = 0. This is denoted P(X = 0).
We can also find the probability that one out of the ten systems work, that
is
P(X = 1). Similarly, we could find P(X = 2), P(X = 3), P(X = 4), P(X = 5),
… , P(X = 10); Eleven probabilities in all. Together, all eleven probabilities
make up the probability distribution of the random variable X, using the
binomial model. We are now ready to have R calculate the probabilities for
us.
We can plot all values of X against Y, their probabilities. This will be the
graph of our distribution. To do so, we first give a name to the vector of X
values, which is zero through 10, (0, 1, 2, …, 9, 10). We call this vector x.
Then we will plot x against y.
x<-c(0:10)
plot(x,y, pch=15, cex=1.5,col="red")
R will give the following graph
.
A Hypergeometric Problem
where N = k + m
But luckily, we are using R, so let us have R worry about the formula.
Let X be the number of rabbits we could find in the trap; then X would
range from 0 to 5. In R we write 0:5 for this range. Let k be the number of
gray rabbits in the population (Group 1, k = 15). Let m be the number of
brown rabbits in the population (Group 2, m = 10). Let n be the size of our
sample, n = 5. This is all we need to find the probabilities using the
hypergeometric function dhyper() in R. We code as follows:
dhyper(0:5,15, 10, 5)
R will return the following:
[1] 0.004743083 0.059288538 0.237154150 0.385375494 0.256916996
0.056521739
Again, I have underlined for clarity.
Using only 4 decimal places, the first value, 0.0047 is the probability that
we find no gray rabbits in the catch. The second value, 0.0593 is the
probability we find one gray rabbit, etc. These six values are the probability
distribution of X.
We will now code this data frame into R. We will call it Rabbits.
Rabbits<-[Link](X = c(0,1,2,3,4,5), ProbX =
c(0.0047,0.0593,0.2372,0.3854,0.2569,0.0565))
We now call up our data frame to see what it looks like.
Rabbits
We will now plot a bar graph using ProbX, written as Rabbits$ProbX (
more about graphs in the next chapter). The values of X, that is, 0, 1, 2, 3, …
will be used as the categories or the names of the bars. This we will do using
the [Link]=c(0, 1, 2, 3, 4, 5)the two variables of the data frame Rabbit.
barplot(Rabbits$ProbX, [Link] = c("0", "1", "2", "3", "4", "5"),
col="darkred")
R’s output will be
User-Defined Functions
A program is nothing but a set of instructions. When we write and save a
program, we are ensuring that we will not have to write the same set of
instructions over and over every time we must do the same task. A program
can be very simple or very complex. When we write a function, a macro, a
print instruction, a worksheet template, we are programming. In this section,
we are going to visit one of the most important aspects of programming in R
and that is writing your functions. One of R’s prime strengths is that the user
can write user-defined functions that can expand the scope of the program.
Vector1<-c(5, 6, 7)
Vector1
[1] 5 6 7
Mean(Vector1); var(Vector1)
Vector1<-Vector1 + 1
Vector1
[1] 6 7 8
What will be the mean and variance of this new vector1? And if we add 1
to this new vector1, what will be the mean of the new vector? We will add 1
to Vector1 five times and each time we will find the mean of the new
vector1.
To do this we will use a While loop. This is loop uses the while{}
function. Note the curly brackets. A loop is a statement that keeps running
until a condition is satisfied.
The syntax of a while loop is while (this condition is true){execute this
statement}.
Vector1<-c(5,6,7,8)
while(Vector1[1]<=10) {cat("mean=",mean(Vector1),"\n");Vector1<-
Vector1+1}
Now, let us look at the code in detail. The first line, of course, creates
Vector1 and assigns it the values of 5, 6, 7, and 8 using the combine function
c().
The second line sets up the condition. This condition is that Vector1[1]
<=10. Remember that we are adding one to Vector1 four times. Therefore,
the first time we add 1, the elements of the new vector will be 6, 7, 8, and 9.
The second time we add 1, we will have a new vector, 7, 8, 9, 10. The fifth
time we add 1, the vector will be 10, 11, 12, 13. Now R will give us the
mean of each new vector we create. But how would R know when to stop
calculating means? This is the job of the condition. The condition tells R to
continue giving us the mean of each new vector as long as Vector1[1]<=10,
that is, as long as the first element in the vector is less than or equal to 10. R
will keep checking the first element of each new vector to determine if it is
less than or equal to 10. If it is, then R will give us the mean of that vector.
On the sixth time adding 1, the first element will be 11. Therefore, R will
stop giving us the mean and the loop will end.
The cat() is the string version of the combine c() function. It joins
together the string “The mean=”, and the actual value of the mean that
comes from the mean(Vector1). The comma that follows brings on the third
thing that the cat function joins, “\n”. After R has written the mean for the
first vector, the “\n” tells R to go to a new line. The semi-colon ends the
cat() function, but we are still in the while function. So, R has given us the
mean for the first vector, now we are adding 1 to the first vector to create a
second vector. This is the statement that tells R to add 1 to the vector is just
found the mean for. The statement says, “the new vector1 is the old vector 1
+ 1”. So then R will go down to a new line and give us the mean for this
new vector1. It will keep doing this until the first element of the new vector
is greater than 10, then the while loop will end.
Below is the code written again and R’s response.
> vector1<-c(5,6,7,8)
> while(vector1[1]<=10){cat("mean=",mean(vector1),"\n");vector1<-
vector1+1}
mean= 6.5
mean= 7.5
mean= 8.5
mean= 9.5
mean= 10.5
mean= 11.5
To have the while loop return the variance of the vector as well as its mean, we
code as shown below.
vector1<-c(5,6,7,8)
while(vector1[1]<=10){cat("mean=",mean(vector1),"variance=",
var(vector1),"\n");
vector1<-vector1+1}
In the example below, we will create a variable for the mean of a vector
x, which we will call “center”, and a variable for the variance of the vector
x, which we will call variance. We will write the if-else statement that will
output the mean and variance of a given vector if the vector is numeric. If
the vector given is not numeric, then R will output the message, “Vector
must be of numeric type.”
This whole process we will put together in a function which we will call
MyFunction2. Remember, you must use function(x) to create MyFunction2.
After this assignment, you will write the body of the function you created
between curly brackets, that is, the if-else loop and other assignments.
MyFunction2<-function(x)
{
center<-mean(x);
variance<-var(x)
if([Link](x))
cat(“Mean:”, center,”\n”,”Variance:”,variance,”\n”)
else print(“Vector must be of numeric type.”)
}
Write a function that, when given a vector of values, will return the
Median,
First Quartile
Third Quartile
Variance
Standard Deviation
Mean Absolute Deviation
Upper Fence (RUB)
Lower Fence (RLB)
Midrange
Mid-quartile
Interquartile Range (IQR)
of the vector.
And will also give a warning message if the input vector is non-numeric.
Chapter 7: Basic Graphs of Statistics
Dot Plots
The first graph we will visit is the dot plot. Dot plots are used for mainly
quantitative variables. There are two kinds of dot plots: the Wilkerson Dot
Plot and the Cleveland Dot Plot.
The Wilkerson Dot Plot
In the Wilkerson plot, the horizontal axis is a scale for the quantities. The
numerical values of each measurement in the dataset are located on the
horizontal scale by a dot. When data values repeat, the dots are stacked
vertically above the scale value. A dot plot of random values is shown below
Figure 7.1
To create a Wilkerson dot plot with R, we use the stripchart() function.
To demonstrate this, we will use the following example. Twenty people were
asked how many times in the past had they ever visited a museum. Their
answers are given below.
1 3 1 4 2 5 1 1 2 1
4 1 1 2 1 2 1 2 1 2
We will call the data set museum, which we will enter it directly into R
via the keyboard. Then we will create the dot plot with the stripchart()
function.
museum<-c(1,3,1,4,2,5,1,1,2,1,4,1,1,2,1,2,1,2,1,2,)
stripchart(museum, main=”Museum”, method=”stack”,
pch=16, col=”blue”)
This will create the dot plot as shown below.
The option method =”stack” ensures that the dots that represent positions
of the same value would be stacked vertically. Had the option not been used,
dots for the same value would be placed in an overlapping manner and the
multiplicities would not be seen.
The Cleveland Dot Plot
If you have a group of labeled values on a horizontal scale you can use
the dot chart() function to create a dot plot of the values. We have the
following data, which have been entered into R as a data frame called
windspeed, with two variables (or vectors): windspeed$CapitalCities and
windspeed$CurrentWindSpeed.
The following code will create the Cleveland dot plot for the dataset.
dot chart(windspeed$CurrentWindSpeed,
labels=windspeed$CapitalCities,
cex=.8, main=”Current Wind Speed for 10 Capital Cities”, xlab=”Wind
Speed +(km/hr)”)
This will give the following dot plot
Stem-and-Leaf Plots
The stem-and-leaf plot is a type of graph that classifies items according to
their most significant numerical digits. This plot is a simple plot that serves
as a first-glance graph. However, it gives an idea as to the contours of the
distribution. The stem-and-leaf plot is created by the stem() function. The
parameter passed is a numerical vector of values. For example, suppose we
want to create a stem-and-leaf plot of the following data, which is waiting
times in minutes at a bank ATM (automatic teller machine):
I can read your mind on this. You are asking: How do I read this?
Let us take the second row as an example.
2|49
The stroke | is where the decimal point would go. Thus, 2|49
means we have the first value of 2.4 and another value of 2.9.
8|63 gives us two values: 8.6 and 8.3.
Pie Charts
The data table below is the mint date and number bearing that mint date
of a sample of 2000 pennies.
Now we use the pie() function to create a pie chart with rainbow colors
with the labels as per the table.
pie(pennies$Number,col=rainbow(length(pennies$Number)),labels=penn
ies$MintDate)
R will output the graph below in the Plots window
Now there are times when we might want to use colors ideal for black
and white print, use percentages or proportions to compare the categories
and use a legend. The code below will do this. Here we first define the
range of grayscale colors, we create a function to convert the number of
pennies to a percentage of the total rounded to 1 decimal place, then we use
the paste() function to concatenate the “%” symbol to the numbers.
Don’t worry, just write the code as you see it here and you will get the
graphical results.
colors<-c(“white”,”grey70”,”grey50”,”grey90”,”black”)
pennies_labels<-round(pennies$Number/sum(pennies$Number)*100,1)
pennies_labels<-paste(pennies_labels,”%”,sep=””)
pie(pennies$Number,col=colors,labels=pennies_labels,
main=”Percentage of Pennies with Given Mint Dates”,
cex=.8),legend(1.0,0.3,pennies$MintDate,cex=.8,fill=colors)
We can create a 3-D pie chart from the pie3D() function from the plotrix
package. First, we install the package. In the RStudio plots window, first,
click on the Packages tab and look down the list to see if the plotrix package
is already installed. If it is, just click on it and write the code from
library(plotrix) below. If it is not, then click on the Install.
R will install the package, after which you will see it appear in the User
Library. Find plotrix in the User Library and click in the check box at its
left. Then click on the package. Now you can use the package. Now write
the code to plot the chart.
pie3D(pennies$Number,labels=pennies$MintDate,explode=0.1,
col=rainbow(5),main=”Pennies with Given Mint Dates”)
R now outputs the following
Bar Graphs
A bar graph is a commonly used graph in statistics mainly because of the
ease at which it can be visualized. The height of each bar is proportional to
the amount of data in that category. To create a bar graph in R we use the
barplot() function.
We are going to produce a bar graph of the pennies data. Here we use the
rainbow of colors and use the [Link]=pennies$MintDate to label the
categories.
barplot(pennies$Number,[Link]=pennies$MintDate, col=rainbow(5),
main=”Number of Pennies with Mint Date as Given”)
R’s output will be
We can produce the same bar graph but with a horizontal orientation
simply by including the option “horiz=TRUE” in our code line.
barplot(pennies$Number,[Link]=pennies$[Link],
col=rainbow(5),
main=”Number of Pennies with Mint Date as Given”,horiz=TRUE)
Figure 7.8
To rotate the category labels (the years scale on the left), we simply add
the las=2 option.
barplot(pennies$Number,[Link]=pennies$MintData,
col=rainbow(5),
main=”Number of Pennies with Mint Date as Given”, horiz=TRUE,
las=2)
Histograms
A histogram is a summary graph much like the bar graph. It shows a
count of the data points falling within various ranges. It gives a rough
approximation of the frequency distribution of the data. The groups or
classes of data are called “bins”, as they are like containers that accumulate
data according to the frequency of that data class.
We can create a simple histogram with the hist() function.
The command
hist(pennies$Number)
Will give the following graphical output:
Figure 7.10
Figure 7.12
Box Plots
A box plot provides a graph of the median, quartiles, maximum and
minimum of a data set. This graph can display a lot of information on one
plot. You can create a simple plot or a more complex plot of categories in
the dataset.
The basic command is boxplot() and to this, we can add axis labels, a
main label, color, etc. like the options in any other graphing function. Let us
create a simple box plot for the data set below.
(23, 25, 27, 30, 31, 32, 35, 36, 45, 47, 49, 51, 53)
We will call the vector z.
>z<-(23, 25, 27, 30, 31, 32, 35, 36, 45, 47, 49, 51, 53)
>boxplot(z)
Figure 7.13
The plot will automatically show any outlier. If there is an outlier, the
maximum and minimum will not be shown because R will default the range
to 1.5 the IQR. So, we will see the Upper and lower fences. If we want the
full range to be shown, we can use the range option. If we set the range =0
then we will get the full range. Let us add an outlier to the data set.
>z[14]<-138
>boxplot(z,col=”lightblue”)
Figure 7.15
The range is automatically made to 1.5(IQR) and so the outlier is shown.
If we want to show the full range with 138 being the maximum score, we
use the range=0 option.
>boxplot(z,range=0)
Figure 7.16 Here the outlier is not shown, and R uses the full range of
the data.
We can create three separate box plots side by side: one for Mazda, one
for Nissan, and one for Toyota with the following code:
Mazda<-c(38.5, 33.6, 41.8, 46.4, 46.0, 48.7)
Nissan<-c(34.6, 36.5, 31.5, 30.8, 35.1, 36.1)
Toyota<-c(40.7, 38.2, 38.4, 38.1, 46.7, 39.6)
mpg<-[Link](Mazda, Nissan, Toyota) #Remember, highlight lines
then RUN
Now we call up mpg to see its form
mpg
Figure 7.17
Exercises 7
1. Create a Wilkerson dot plot using the GPA data given below.
3. Use the Tire data given below to create side-by-side box plots. These
are the stopping distances for thirty cars, ten equipped with Michelin tires,
ten with Goodyear, and ten with Firestone.
Chapter 8: Basic Methods of Statistical
Analysis
In this chapter, we will begin to explore descriptive statistics. We will test
hypotheses and answer questions about variables and their interrelationships.
Descriptive Statistics
We will begin with the summary() function. We have seen this function
before. Now we will look a little deeper into its meaning. The summary()
command will provide the minimum, maximum, quartiles, and mean for a
numerical vector and will give frequencies for non-numerical vectors. Let us
create a numerical vector.
x<-c(12,23,34,45,56,67,78,89,90)
Now we ask R to send us a summary of the vector x.
summary(x)
R then tells us:
Insects
The data frame below will appear in the Console window after you
highlight the line and hit RUN.
Now call up the summary() function to find the summary numbers for
[Link] (column 2) and [Link].1 (column 3) of the Insects
data frame.
summary(Insects[2:3])
Means
The function colMeans(insects[2:3]) will give the mean (the arithmetic
or common average) of the second and of the third vectors.
colMeans(Insects[2:3])
Median
Suppose we have 5 measurements and arrange them in numerical order
from the smallest to the largest (this is called a distribution of the data), then
the number at the center will be our median. For example, let’s say we have
the following dataset in numerical order:
11, 16, 21,32, 43
Our median will be the measure 21.
If instead, we had a dataset of 6 numbers:
11, 16, 21, 32, 43, 62, our median will be the average of the two middle
numbers, 21 and 32. Thus, our median will be (21+23)/2 = 26.5.
Quartiles
Quantiles are numbers that partition, or divide, an ordered data set into
equal parts. As an example, let us begin with a distribution. All the numbers
of the distribution represent 100%.
Now let us find the number that divides the distribution into two parts of
25% of the distribution on the left and 75% on the right. This means that this
number is larger than 25% of the other numbers in the distribution and
smaller than 75% of them. This number is called the First Quartile (Q1) of
the distribution.
Now, let us find the number that divides the distribution into two equal
parts of 50% each. This means that this number is larger than 50% of the
other numbers in the distribution and smaller than 50% of them. This
number is called the Second Quartile, or the Median of the distribution.
Let us now find the number that is larger than 75% of the other numbers
and smaller than 25% of them. This number is called the Third Quartile (Q3)
of the distribution.
The quartiles are quantiles because they divide the distribution into equal
parts – four of them.
Percentiles
Here is the distribution again - all 100% of it.
Going back to our dataset Insects, we want to get the medians of the
second and third vectors. Since the median function median() can only give
the median for single vectors at a time, we will have to use the apply()
function to give us the two medians at the same time.
apply(Insects[2:3],2,median)
The apply function can also give us the quartiles of the two vectors.
apply(Insects[2:3],2,quantile)
Frequency Tables
When we have large univariate data sets (datasets with one variable), one
of the frequently used methods to organize and display our data is using
frequency tables. In this method, we group our data into score intervals and
then construct a frequency table. Data collected into frequency intervals are
called grouped data. Grouping data into frequency tables is an important
step in univariate descriptive statistics and a vital method in preparing data
for analysis. We will use a real-life example to motivate and instruct in the
development of a frequency table.
The largest colony of Antarctic Emperor Penguin can be found on the
Ross Ice Shelf in Antarctica, where, at any given time, at least 80,000 of the
birds can be found lounging on the ice.
Figure 8.1
Dr. Schottenheimer is a biologist who is studying the Ross Island Colony
of the emperor penguins. One of the summary numbers he is trying to
determine is the average weight of the adult penguin, so he has taken a
sample of 45 of the adult birds and has recorded the weights of each one of
them. Given below is a list of the recorded weights to the nearest pound of
45 adult penguins from The Ross Ice Shelf.
widths:
Where w = bin width
r = range of our data = highest data point – lowest data point
b = number of bins or class intervals to be used (this was just calculated
above).
Plugging values into our formula, we find that w = 9.
=
So, the bins will be 9 units wide. But where do we start? What are our
lower limits and upper limits? For the lower limit of the first interval, we use
the minimum score value. For the upper limit of interval 1, we simply add
the calculated width to the lower limit. This will give us a lower limit of 43
and an upper limit of 43 + 9 = 52 for interval 1.
Interval 2 will now begin at 53 as the lower limit then add the bin width
of 9 to find the upper limit. The upper limit of linterval2, therefore, is 62.
Using the same method of calculation, we find that interval 3 spans from 63
to 72; Interval 4 from 73 to 82; Interval 5 from 83 to 92; and Interval 6 from
93 to 102. We will arrange the data into a table called a frequency
distribution, shown below. This three-column table is a part of the full
seven-column frequency table.
The midpoint is the halfway point of the interval. For example, the
midpoint of interval 1 is 47.5; interval 2 is 57.5, interval 3 is 67.5, interval 4
is 77.5, interval 5 is 87.5 and interval 6 is 97.5.
The next column is the frequency column. Here, we place the number of
data points that fall into the range of each interval. For example, since
Interval 1 has a range of 43 to 52, all scores between 43 and 52, fall in the
Interval 1 range. We see that five of the penguins weigh between 43 and 52
pounds. Therefore, the frequency of Interval 1 is 5. Similarly, we find the
frequency of Interval 2 is seven (7); the frequency of Interval 3 is 10; the
frequency of Interval 4 is 12; the frequency of Interval 5 is seven 9, and the
frequency of Interval 6 is 2.
We know that we have six bins or intervals. The table above of the last
six weights in the dataset is telling us that the weight of penguin #40 which
is 63 pounds, belongs to Category (Interval) 3.
Now we are going to use the table() function to organize the
PenguinWeights data into a simple one-way frequency table which we will
name PenguinFreq.
Penguinfreq<-with(PenguinWeights, table(count))
Penguinfreq
R’s response will be:
This tells us that the “1” category, or the first bin, carries a frequency of
5; the “2” category, a frequency of 7, etc.
Margin Totals
Here we can see that in AgeGroup 1, there are two female non-smokers
and one smoker, while there are no male non-smokers and one smoker.
Now, let us add margin totals to the values
ftable(addmargins(table2))
Tests for Association
Newtable
Now we run the chi-square test for independence on the table data.
[Link](newtable)
to which we will get:
The important figure is the p-value. A large p-value - one that is roughly
greater than 0.05, shows that the test is in favor of the null hypothesis. The
p-value = 0.7418, means we cannot reject the null hypothesis. This means
that the variables are independent, or there is no association between them.
Gender, in other words, does not determine smoker status.
Now while the temperature at the different altitudes is not realistic, the
table will serve our purpose well enough.
We can now ask the question: Are the variables, Altitude, and
Temperature, associated? We can see that as one variable increases the other
decreases, but are they doing it in such a way that if we are given the value
of one, we could predict the other with some accuracy? We could do this
only if the variables are sufficiently correlated.
Should we draw a scatter plot of the data, we would observe one of the
following:
In the first graph, the points are increasing together, and the variables are
said to be positively correlated. In the second graph, the points are moving
in opposite directions – one is increasing while the other is decreasing. Here
the variables are said to be negatively correlated. In the third graph, there is
no pattern and so the variables show little or no correlation.
Correlation is quantified by the correlation coefficient, whose value
ranges from -1 to +1. A correlation coefficient of +1 means the points are in
perfect positive correlation – the points are in a perfectly straight line with
an upward slope. A coefficient of -1 means the points are in a perfectly
straight line with a downward slope – a perfect negative correlation. A
coefficient of zero means that the points show no pattern whatsoever, and so
there is no correlation between the variables. The closer to +1 or -1 the
coefficient is, the stronger the association between the variables – either
negatively or positively. Thus, the coefficient of correlation is a measure of
the strength of association between numerical variables.
In plotting our graphs, you might have noticed that we placed Altitude on
the horizontal (X) axis and Temperature on the vertical (Y) axis. Could we
have placed Temperature on the horizontal axis instead? Well generally, you
want to place the independent variable on the X-axis and the dependent
variable on the Y. How do we know which is the independent and which is
the dependent? In our Mt Everest case, we were able to select the altitudes at
which to take our readings of temperature. However once the altitude was
selected, the temperature at that altitude is fixed – we don’t get to select the
temperature at that altitude. The temperature depends on the altitude we
select. As such, we say that Temperature is the dependent (Y) variable and
Altitude is the independent (X) variable. This distinction is important for
correlation because did we place Temperature on the (X) axis, the
correlation would be different. So how do we find the correlation coefficient
with RStudio?
We have created an Excel spreadsheet with our Mt. Everest data and have
imported it into RStudio, where we have named it MtEverest1. We attach the
MtEverest1 file and then simply use the cor(x,y) function. The cor(x,y)
function takes the two variables to be tested, the independent (X) and the
dependent(Y) as its arguments. We code thus,
attach(MtEverest1)
cor(Altitude,Temp)
R will return the correlation coefficient:
[1]-0.94612
This tells us that the two variables, Altitude, and Temperature have a very
strong negative correlation or association. This means that as one variable
increases, the other decreases in such a way that, given the value of the
independent variable, we could predict the corresponding value of the
dependent variable.
Suppose we did a correlation study for two variables, IQ and the number
of beers drank per day (Beers), and got a coefficient of 0.45346. This
number does not indicate a strong correlation, but is it significant enough to
deem the two variables correlated? To conduct such a test of the significance
of the correlation coefficient, we use the [Link](x, y) function.
[Link](IQ, Beers)
We will get in response,
Tests of Significance
With a p-value of 0.7143 larger than our alpha = 0.05 (5% confidence
interval), we must concur with the null hypothesis that the mean is not less
than 16.5.
The p-value (0.0321) that is smaller than 0.05 means that we reject the
null hypothesis and confirm the alternative hypothesis that the average
thorax length for Group 2 is greater than that of Group 1.
We have imported a file called Windspeed2 which gives the wind speed
in several capital cities taken at two different times of the year: January
([Link].1) and August ([Link].2). In this situation, we have two
groups of measurements, but they both come from the same city – just that
they are taken at different times. This means that the groups are dependent,
and we must use a paired t-test to test hypotheses on these groups. We want
to determine by a paired t-test, whether the average wind speed for time 1
([Link].1) is less than that of time 2 ([Link].2). To do this, we use
the option paired=TRUE in the [Link]() function.
We Global Environment window (upper right) to see if the file
Windspeed2 is in our workspace. If it is not, we must first “attach” it, which
brings it up to our desk, so to speak.
attach(Windspeed2)
t-test([Link].1,[Link].2, data=Windspeed2, paired=TRUE,
alternative=”less”)
The exceedingly small p-value (9.31 × 10-6), tells us that we must reject
the null hypothesis that the speeds are equal, and the evidence supports the
alternative hypothesis that the wind speed at Time 1 is less than the wind
speed at Time 2.
Nonparametric Tests
If we have more than two groups, and the groups are independent, we can
apply the Kruskal-Wallis test with the function [Link](). After
grouping your data, the test is applied with the function
[Link](dependent variable~grouping variable,data)
Remember to first group your data before applying the Kruskal-Wallis
test.
Example:
We have recorded the daily rainfall in Trinidad and Tobago for each day
in October, November, and December. We want to determine whether the
distributions of rainfall for October, November, and December are identical.
We use a non-parametric test because we will not be assuming that the
distributions are normal. We simply want to know if they are identical
without assuming normal parameters. We will, therefore, use the Kruskal-
Wallis Test (called a Non-parametric ANOVA), by using the [Link]()
function.
Below is a part of the dataset of rainfall for October, November, and
December in a file called rainfall
[Link](Rainfall(in)~Month, data=rainfall)
With such a small p-value, we reject the null hypothesis and conclude
that the distributions of rainfall for October, November, and December are
not identical.
A drawback is that the test does not tell you just how they differ from
each other. This question can be answered by using a Mann-Whitney U test
or doing multiple comparison tests like the Scheffe or the Tukey. A package
called npmc can provide these multiple comparison tests.
We have seen some of the features of R with RStudio that we used for
doing basic descriptive and inferential statistics. Armed with this
knowledge, you could do a lot of statistical analysis. But this knowledge is
only the basics of both R and RStudio. They can do a lot more, and, I hope
that your appetite has been stimulated enough to make you want to
investigate further into statistics and R/RStudio.
Learning Resources and References
Chang, William, (2013) R Graphics Cookbook, O’Reilley Media Inc.
Crowley, Michael J.; (2013) The R Book, John Wiley and Sons Inc.
Zar, Jerrold H.; (1999); Biostatistical Analysis 4th Ed., Prentice-Hall Inc
Note at the End of the Book
You have made it to the end of the book, and I applaud you and thank
you for sticking with me. If you enjoyed the learning experience, won’t
you please take a minute to leave a review at your favorite retailer?
Many thanks,
RCH