R Markdown Basics
R Markdown Basics
KENNEDY
1 Introduction 5
2 Why R? 7
4 R Markdown 17
4.1 Fixing Errors in an R Markdown file 17
4.2 The Components of an R Markdown File 18
4.3 R Markdown Chunk Options 22
4.4 General Guidelines for Writing R Markdown Files 22
4.5 Help -> Cheatsheets 23
4.6 Spell-check 23
7 Concluding Remarks 39
References 39
1
Introduction
This book was written to give people who are new to R, RStudio, and R Markdown the tools
they need to begin making their own research reproducible. R is an open-source program-
ming language that has seen its popularity grow tremendously in recent years, with developers
adding new functionality via packages on a daily basis. RStudio is a graphical development
environment that makes it easier to write and view the results of R code, and R Markdown
provides an easy way to produce rich, fully-documented, reproducible analyses.
Screenshots and screencasts (with no audio) are used throughout to illustrate key concepts,
but if you need further clarification on these or any other aspect, please create a GitHub issue
or email me with a reference to the area where more guidance is necessary. Pull requests on
GitHub for typos or improvements are also welcome, and you can easily do so by clicking on
the Edit button near Search at the top of the HTML version of the book.
In the HTML version, you can download the book as a PDF by clicking on the PDF button
in the toolbar at the top of the page. Given the heavy use of screencasts, HTML is the recom-
mended format for most readers, but the PDF is available for those who need it. Links to the
different YouTube videos directly found in the HTML version are provided in the PDF ver-
sion. You can also download the video files directly. Note that no audio is attached to these
screencast videos.
This book will evolve and be updated as needed based on reader feedback. You can see when
the book was last updated below.
I strongly recommend that you use R version 3.3.0 or higher, RStudio Desktop version 1.0
or higher, and rmarkdown R package version 1.0 or higher. This will ensure that your setup
matches the screenshots and recordings, making it easier to follow along. Additionally, you may
find that videos don’t load sometimes. I haven’t had any problems using Google Chrome and
recommend that as your browser to view this book if you have trouble with other browsers.
This book was last updated by [Link] on Tuesday, November 12, 2019 16:08:10 MST.
2
Why R?
If you are new to R and programming, you may be intimidated by the idea of writing code.
You probably aren’t used to having to type commands to tell the computer what to do. You
may be more comfortable using drop-down menus and other graphical user interfaces that
allow you to pick what you’d like to do. So, why are so many companies, colleges, universities,
and individuals of all disciplinary backgrounds shifting towards using R?
There are many answers to this question, but some of the most important are:
One of the biggest perks of working with R and RStudio is that both are available free of
charge. Whereas other, proprietary statistics packages are often stuck in the dark ages of
development (the 1990s, for example), and can be incredibly expensive to purchase, R is a
free alternative that allows users of all experience levels to contribute to its development.
As many scientific fields embrace the idea of reproducible analyses, proprietary point-and-
click systems actually serve as a hindrance to this process. If you need to re-run your anal-
ysis using one of these systems, you’ll need to carefully copy-and-paste your results into
your text editor, potentially from beginning to end. As anyone who has done this sort of
copy-and-pasting knows, this approach is both prone to errors and incredibly tedious.
If, on the other hand, you use the workflows described in this book, your analyses will be
reproducible, thus eliminating the copy-and-paste dance. And, as you can probably guess,
it is much better to be able to update your code and data inputs and then re-run all of
your analysis with the push of a button than to have to worry about manually moving your
results from one program to another. Reproducibility also helps you as a programmer, since
your most frequent collaborator is likely to be yourself a few months or years down the road.
Instead of having to carefully write down all the steps you took to find the correct drop-
down menu option, your entire code is stored, and immediately reusable.
This approach also helps with collaboration since, as you will see later, you can share a
single R Markdown file containing all of your analysis, documentation, comments, and code
with others. This reduces the time needed to work with others and reduces the likelihood of
errors being made in following along with point-and-click analyses. The mantra here is to
Say No to Copy-And-Paste! both for your sanity and for the sake of science.
If you have ever had an issue with software, you know how difficult it can be to find an-
swers to your questions. “How can I describe the process to someone else? Do I need to take
screenshots? Do I really need to call IT and wait for hours for someone to respond?” Be-
cause R is a programming language, it is much easier (after a bit of practice) to use Google
or Stack Overflow to find answers to your questions. You’ll be amazed at how many other
users have encountered the same sorts of errors you will see when you begin.
I frequently (almost daily) Google things like “How do I make a side-by-side boxplot in R
coloring by a third variable?”. You’ll become better at working with R by reaching out to
others for help and by answering questions that others have. In addition, Chapter 6 de-
scribes many common errors and explains how to fix them.
We all know that learning isn’t easy. Do you have trouble remembering how to follow a list
of more than 10 steps or so? Do you find yourself going back over and over again because
you can’t remember what step comes next in the process? This is extremely common, espe-
cially if you haven’t done the procedure in awhile. Learning by following a procedure is easy
in the short-term, but can be extremely frustrating to remember in the long-term. If done
well, programming promotes long-term thinking to short-term fixes.
One unfortunate thing that we frequently take for granted is that our brain tricks us into
picking the easy route. If you truly want to learn how to do something (like programming
with R), you’ll need to feel frustrated at times. Any time you learn something new, you’ve
been frustrated. We tend to forget all the frustration and only think about where we cur-
rently are. Although R still frustrates me from time to time, I grow through practice and I
look forward to the challenges. Garrett Grolemund encapsulated this phenomenon nicely in
the Prologue of his book “Hands-On Programming with R” (Grolemund, 2014):
As you learn to program, you are going to get frustrated. You are learning a new language,
and it will take time to become fluent. But frustration is not just natural, it’s actually a
positive sign that you should watch for. Frustration is your brain’s way of being lazy; it’s
trying to get you to quit and go do something easy or fun. If you want to get physically
fitter, you need to push your body even though it complains. If you want to get better at
programming, you’ll need to push your brain. Recognize when you get frustrated and see it
as a good thing: you’re now stretching yourself. Push yourself a little further every day, and
you’ll soon be a confident programmer.
3
R and RStudio Basics
3.1 What is R?
In Chapter 2, I discussed many of the reasons why you should begin doing your analyses (es-
pecially those of the data type) using R. If you skipped over that chapter in the hopes of just
diving in to learning about R, I suggest you go back and read it over carefully. As you be-
gin building fluency in working with R, it is especially important to review that introductory
chapter from time to time.
3.1.1 R beginnings
R was developed by a group of statisticians who wanted an open-source alternative to the
costly proprietary options that were (and still are) popular. Because it was created by statis-
ticians (instead of computer scientists), R has some quirky aspects to it that take some time
to get used to. We’ll see that many packages have been developed to help with this, and these
days, you don’t need an advanced degree in statistics to work with R.
Getting back to the development of R… R was created by Ross Ihaka and Robert Gentleman
in New Zealand at the University of Auckland. It is a spin-off of the S programming language
and was named partly after the first names of its developers (as you can see from the empha-
sis above). The beginning ideas for creating R came in 1992, and the first version of R was
released in 1994. You can find much more about the background of R, its features, and its
connections to the S language on Wikipedia.
3.1.2 R packages
I first learned to use R as a graduate student at Northern Arizona University from Dr. Philip
Turk in 2007. At the time, I never thought that R could have exploded in users as we have
seen since 2011. I never would have thought that students taking an introductory statistics
course would be encouraged to learn to use R.
In 2007, R was still largely an esoteric and tricky language used by statisticians to do analyses.
Getting used to the syntax for producing plots and working with data was especially tricky
10 chester ismay and patrick c. kennedy
for those with little to no programming experience. So what has changed since 2007 about
learning R?
I believe one of the biggest developments has been the creation of packages to make R easier to
work with for newbies. Packages are add-ons created by users of R to increase the functionality
of the base R installation. Packages created by Hadley Wickham and others recently have
greatly expanded the capabilities of R, while also working to make beginning with R simpler.
As of April 2017, more than 10,400 packages were available on common R repositories.1 1
You’ll see how to down-
Another great development is the graphical user interface called RStudio and a package de- load these packages via
[Link]("dplyr")
veloped by RStudio, Inc. called rmarkdown. We will discuss rmarkdown (also referred to as R and load them into
Markdown) in a Chapter 4, and will now focus on discussing RStudio. your current R working
environment via
library("dplyr"), for
example, in Chapter 5.
3.2 What is RStudio?
It is worth noting that you can’t just install RStudio Desktop without first installing R, as
RStudio needs to have R installed to run. Step-by-step guides to installing R and RStudio
Desktop with screenshots are available
• here for the Mac, and
• here for a PC.
Unless you plan to create PDF documents (which requires a multiple gigabyte download of
LaTeX), you can skip some of the later steps of the installation. It is recommended that you
select HTML as the Default Output Format for R Markdown. You’ll see more about this in
Chapter 4.
getting used to r, rstudio, and r markdown 11
The RStudio Server provides a web-based interface to run analyses in R. This means that you
will only need an internet connection and a web browser to run your analyses. Your professor
or administrator will provide you with a link to the web location of your RStudio Server. After
entering the link, you’ll see a page that looks something like:
After logging in with your username and password, you should see a layout similar to what
follows.
For reference, a screenshot of RStudio Desktop looks similar:
This makes switching between the two RStudio set-ups painless. A discussion of each of the
different RStudio panes and their corresponding tabs is in Chapter 4. You’ll find that a lot of
what follows also applies to RStudio Desktop (except for the Shared Projects feature), but it is
always recommended to create an RStudio project regardless of whether you are on the cloud
or working locally.
12 chester ismay and patrick c. kennedy
Initially, you may be a little overwhelmed by all the different panes and tabs that are available
in RStudio, but you will soon learn to appreciate this layout. We will begin with the top left
pane and proceed clockwise. The layout of these panes can be customized, but it is recom-
mended that beginning users keep the standard layout.
Note that, when you edit the file, the tab with the file name changes from black to red, and
an asterisk appears after the file name, indicating that the latest changes have not been saved.
You should get in the habit of saving files frequently. You can have multiple tabs open and
view different files in this pane. In Chapter 4, you’ll also see that you can View datasets in this
pane.
It may not be clear what these additions are really doing just yet. That’s ok! When finished,
you’ll be pressing the Knit HTML button near the top of the pane to put all of your text,
code, and its output together. We aren’t quite there just yet though!
By default, the top right pane includes an Environment tab and a History tab. To get a
sense of what these tabs provide, we’ll need to also use the bottom left pane and the Console
tab. I’ll show you how to create multiple objects in R using the Console. Initially, you will
see that the Environment tab tells us that the “Environment is empty.” This means there are
no objects (e.g., data) yet. If you click on the History tab, you should also see a blank screen
with a few icons. We won’t go over all of these buttons here, but I encourage you to hover over
them and click on them to get a sense for what they do. As I enter code into the Console,
watch to see how the Environment and History tabs change.
You can think of the Console as a place to play around. It is your R sandbox. You can test
your code to make sure it is working and then copy that text into your Rmd file once you are
satisfied with it. We’ll see more examples of this in the chapters to come.
Note that, by default, when you enter the name of an object into the console like I did with
sum_1_2 it displays the result. You’ve also been shown what is called the assignment operator
denoted by <-. You can read this as putting the contents of the right-hand-side into an object
named whatever appears on the left-hand-side. In the example, num1 is the name of an object
that stores the value 7.
A powerful feature of the R language has been introduced in the sum_1_2 <- sum(num1,
num2) line. sum is a function. Functions are denoted by their name and then a parenthesis, fol-
lowed by one or more arguments separated by commas, and then a closing parenthesis. You’ll
see many examples of these going forward. Of course, we’ll be using R for more than just a
basic calculator shown here but this should give you an idea of what the Environment and
History tabs store.
3.4.3 Console
You’ll frequently use the Console as a way to check your work or experiment with how to
solve a problem using R. Before RStudio, most users of R just had a window like the Console
provides where they entered their commands and then looked at the results in different win-
dows. We will see that the Console and the Code Editor/View Window will allow us to
store all of our code in a file and then “run” that code through the Console to check that it
getting used to r, rstudio, and r markdown 15
It is good practice to get in the habit of naming variables corresponding to what they actually
represent. If you are dividing two different sums of numbers, you might want to choose a name
like ratio_of_sums to refer to that object. R has a few restrictions regarding what can be
included in the name of R objects:
1. Object names cannot begin with a number.
2. Object names cannot contain symbols used for mathematics or to denote other operations
native to R. These symbols include $, @, !, ^, +, -, /, and *.
Another important property of R is that is case-sensitive. You’ll see what this means in the
video below that also includes examples of invalid names of objects. Note that we will con-
tinue to work inside the R chunk as we did in the last video. You’ll see that these code chunks
provide a nice way to keep track of our important analyses.
You may have noticed that R will do some checks and alert you to potential errors by placing
a red X to the left of lines of code with common syntax errors. These checks won’t catch all
errors, but this can be helpful. Note also that Name, name, and nAme refer to three different
values.
Another thing to notice is that you can store numbers into an object with a given name like we
did earlier with num1. If we’d like to store a character string such as “Chester”, we can provide
the name of the object on the left-hand-side of <- and the string in quotation marks on the
right-hand-side of <-. There are more complex object types that you will see in Chapter 5, but
it is always important to think about the difference between a number and a character in R.
It’s also a good idea to not call objects the names of functions that are built into R. You may
want to call the addition of two numbers sum and R will allow this, but it is HIGHLY recom-
mended that you create more descriptive names and not choose to name objects the same as
common R functions. Something like sum_densities is better and less likely to be the name of
a function.
Of course, you won’t know what all of the names of built-in functions are until you practice,
but it is something to think about. If you are ever wondering if a function is built-in or is
in a package you have included, you can use the ? function in the Console to check. Some
examples are shown below as a screencast.
16 chester ismay and patrick c. kennedy
Files
The leftmost tab here shows the file and folder structure for the current working directory. In
an RStudio project, this is the folder in which the project file was saved. This shows you where
the files are stored, what they are called, and any folders that may exist in your project folder.
This can be thought of as similar to going to My Computer on a PC or opening Finder on
a Mac. Similarly, this tab lists the file and directory structure either on the cloud for RStudio
Server or on your local machine for RStudio Desktop.
Plots / Viewer
You’ll see clearer examples of what the Plots and Viewer tabs provide in Chapter 4. As you
likely guessed Plots will show you the resulting graphs/figures that your R code has generated.
The Viewer tab can show you the resulting HTML file created from an R Markdown Knit.
Packages
The Packages tab lists all packages installed on your computer or cloud server. You can see
which packages are loaded in the current working environment by looking to see if a check-
mark exists next to the package name. Note that you may not have all of the packages loaded
onto your machine that I do below in the video. That’s OK. This is just an example of what
you may expect.
You’ll also notice a Description of the package as well as the Version number here. Packages
are frequently updated and improved, so this is a way to check whether you have the most up-
to-date version of a package. Remember, if you are using RStudio Server, this is likely taken
care of for you. If you are using RStudio Desktop, you might find the Install and Update
buttons useful for downloading new packages or updating currently installed ones.
Help
We also saw an example of using the Help tab when we invoked the ? function. This will show
you documentation on R functions, datasets, and packages. When (not if) you come across
code and you aren’t sure what it does, it is often helpful to enter a question mark followed by
the name, and see if the built-in documentation can help you out.
4
R Markdown
Important note: Remember that all of the R code you want to run needs to be stored in a
chunk (in the correct order) for your analysis to be reproducible AND for you not to receive
errors when you Knit. It is easy to do a lot of work in the R Console and then forget to add
that work into a chunk in your Rmd file. This is probably the number one error you will see
when you first begin working in RStudio. An example of this error is shown below.
The object not found errors are the most frequently encountered errors, and along with
misspellings and incomplete R code segments, represent the vast majority of issues with R.
This is covered in greater detail in Chapter 6.
4.2.1 YAML
The top part of the file is called the YAML header. YAML is a recursive acronym that stands
for “YAML Ain’t Markup Language” and is defined on its official website at [Link]
as:
YAML is a human friendly data serialization standard for all programming languages.
Essentially, the YAML header stores the metadata needed for the document. You can see an
example of a YAML header from our first_rmarkdown.Rmd file:
---
title: "First RMarkdown"
author: "Chester Ismay"
output: html_document
---
There are many other fields that can be customized in the YAML header. The important thing
to notice here are the three hyphens that begin and end the YAML header. Indentation also
has meaning in YAML, so take care when aligning text.
4.2.2 Headers
As you can see above, you can create many different sized headers by simply adding one or
more # in front of the text you’d like to denote the header.
4.2.3 Emphasis
Whenever you see a hash-tag in the text of your R Markdown document, you now know that
this will correspond to bolded, larger text1 that denotes the start of a section of your docu- 1
Unless you want to have
ment. a fourth, fifth, or sixth
level header, but these
This is one of the nice features of R Markdown. You can simply look at the plain text and are not common.
know what it will produce in the knitted document. We can also add different styles of empha-
sis to words, phrases, or sentences by surrounding them in matching symbols. Below are some
examples.
You are beginning to see how easy it is to customize your output. We’ll next discuss ways to
getting used to r, rstudio, and r markdown 19
add links to URLs, create ordered and unordered lists, and use other frequently used Mark-
down features.
4.2.4 Links
To add a link to a URL, you simply enclose the text you’d like displayed in the resulting
HTML file inside [ ] and then the link itself inside ( ) right next to each other with no space
in between.
4.2.5 Lists
The screencast below shows the process of creating both ordered and unordered lists.
Note that only numbers are needed as we saw by numbering “Warm up food” with a “1.” We
can also combine unordered and ordered lists by indenting the text two spaces.
In many of the examples that follow, you will see the actual text you’d type into your R Mark-
down document highlighted with a gray background and also the results of that text immedi-
ately below it.
1. Wake up
- Get out of bed
1. Warm up food
- Open kitchen door
- Get plate out of cupboard
2. Make coffee
i. Warm up water
ii. Grind beans
3. Make breakfast
We can have a paragraph (or two) here describing how we could go about making
breakfast. If we indent the paragraph a few spaces and create a newline, it
will indent below the item.
1. Wake up
• Get out of bed
2. Warm up food
• Open kitchen door
• Get plate out of cupboard
3. Make coffee
i. Warm up water
ii. Grind beans
4. Make breakfast
We can have a paragraph (or two) here describing how we could go about making breakfast.
If we indent the paragraph a few spaces it will indent below the item.
20 chester ismay and patrick c. kennedy
Here is an example of text with only a line break. You may expect this line to appear in a new
paragraph but it doesn’t.
In order to start a new paragraph, you need to add white space between the two paragraphs:
Here is an example of text with a line break and white space.
You may expect this line to appear in a new paragraph and it does.
---
Blockquotes
If you’d like to quote someone or produce an indented text block, you can do so by adding a >
before the passage:
> Reproducible research is the idea that data analyses, and more generally,
scientific claims, are published with their data and software code so that
others may verify the findings and build upon them. - Roger Peng
Reproducible research is the idea that data analyses, and more generally, scientific claims, are
published with their data and software code so that others may verify the findings and build
upon them. - Roger Peng
Commenting Text
There are times where you might want to comment out text inside an R Markdown document.
Say you wrote something that you don’t really want in the resulting knitted document, but you
aren’t quite sure if you should delete it completely. To create a comment, enclose the block of
text in <!-- --> as seen below:
<!--
I'd like to save this text for later and don't want to delete it yet.
getting used to r, rstudio, and r markdown 21
-->
You won’t see the commented out results here in the book.
Equations
If you’d like nice mathematical formulas in your document, you can add them between two
dollar signs:
$y = mx + b$
y = mx + b
4.2.7 R Chunks
Now that you have the basics down, we can get to what I believe is the best part about R
Markdown: the ability to include R code directly in the document which is compiled in the
resulting output. You’ve seen several examples of R chunks in the R Markdown file already.
These code blocks all share several properties in common which you should know:
• Code blocks always begin and end with three backticks ```.
• After the initial three backticks, the first line of an R chunk begins with {r, optionally
includes a name and/or other chunk options, and then ends with a }.
• The lines enclosed between the beginning and closing three backticks is valid R code that
you could execute in the console.
Note that including spaces in front of these backticks will produce an error.
In our first_rmarkdown.Rmd file, let’s explore an example of recognizing and creating our
own R chunks:
This example introduces you to two different ways to create a vector of values. You’ll see fur-
ther discussion of this in Chapter 5. You can see that the code was automatically executed
when we pressed the Knit HTML button, and its output was included in the knitted file.
Important note: Any other R chunks after this one will have access to the three variables
created here: count20, count100_by_5, and prod. Any chunks before the mult_vectors
named chunk WILL NOT have access to these variables. You can read the document like
a book, so it is important to add objects and work with them in the appropriate order. You’ll
receive errors from R if you don’t.
You also see that R gave an error when I didn’t include the third argument to ifelse in quo-
tation marks. However, when I fixed the code and pressed Knit HTML again, the error went
away.
As you saw in the previous example, it is a good habit to highlight the names of your R objects
to differentiate them from regular text. This can be done by enclosing the word in a single
backtick such as what we did with one_value.
You can set many options on a chunk by chunk basis. The most common R chunk options are
echo, eval, and include. By default, all three of these options are set to TRUE.
• echo dictates whether the code that produces the result should be printed before the corre-
sponding R output
• eval specifies whether the code should be evaluated or just displayed without its output
• include specifies whether the code AND its output should be included in the resulting
knitted document. If it is set to FALSE the code is run, but neither the code or its output
are included in the resulting document.
Because we specified that eval=FALSE and that chunk was where we declared the one_value
variable, we now obtain an error. You can include multiple chunk options by separating each
option with a comma.
White space is your friend. You should always include a blank white space between R chunks
and your Markdown text. It makes your document much more readable and can reduce some
potential errors. Also, leave a line of white space between header text and your paragraphs.
Commentary is always good. Explain yourself and your ideas whenever you can. Remember
that your most frequent collaborator is likely yourself a few months down the road. Be nice to
future you and explain what you are doing so that you can remember!
Remember that the Console and R Markdown environments (when Knitting) don’t interact
with each other. This forces you to include only the code in your R chunks that produces
exactly the results you want to share with others. Don’t inflate your document with extra
output. Be concise and clear in exactly what you are doing.
The chunk options can really beautify your documents and customize them exactly to what
you’d like the reader of your documents to see. You can find more information on all of the
available R chunk options here.
getting used to r, rstudio, and r markdown 23
[Link]
4.6 Spell-check
Near the top of your R Markdown editor window sits one of the more useful tools for writing
documents: the spell-check button. It is the green check-mark with “ABC” above it:
Before you submit a document or share it with someone else, you should spell check your doc-
ument. You may need to add some R commands to the dictionary or ignore them since those
may not be recognized as words, but it is easy to misspell words as we type and this feature
can really help.
5
Intro to R using R Markdown
In this chapter, you’ll see many of the ways that R stores objects and more details on how
you can use functions to solve problems in R. In doing so, you will be working with a common
dataset derived from something that you likely have encountered before: the periodic table of
elements from chemistry.
We will be discussing both strings and factors in the data structures section (Section 5.3)
and why this additional parameter stringsAsFactors set to FALSE is recommended.
It’s good practice to check out the data after you have loaded it in:
View(periodic_table)
The video below walks through downloading the CSV file and loading it into the data frame
getting used to r, rstudio, and r markdown 27
object named periodic_table. It also shows another way to view data frames that is built
into RStudio without having to run the View function. Note that a new R Markdown file is
created here as well called chemistry_example.Rmd. We are writing the commands directly
into R chunks here, but you may find it easier to play around in your R Console sandbox first.
I encourage you to take the R chunks that follow in this chapter, type/copy them into an R
Markdown file that has loaded the periodic_table data set, and see that your resulting out-
put matches the output that I have presented for you here. This book was written in R Mark-
down after all!
Each of the names of the variables/columns in the data frame are listed immediately after the
$. Then on each row of the str function call after the : we see what type of variable it is. The
periodic_table data set has four data types: int, chr, num, and logi:
• int corresponds to integer values
• chr corresponds to character string values
• num corresponds to numeric (not necessarily integer) values
• logi corresponds to logical values (TRUE or FALSE)
5.3.2 Vectors
Data frames can be thought of as many vectors of the same length put together into a single
object. In the periodic_table data frame, each row corresponds to a chemical element and
each column corresponds to a different measurement or characteristic of that element. There
are many different ways to create a vector that stands on its own outside of a data frame.
If you would like to list out many entries and put them into a vector object, you can do so via
the c function. If you enter ?c in the R Console, you can gain information about it. The “c”
stands for combine or concatenate.
Suppose we wanted a way to store four names:
friend_names <- c("Bertha", "Herbert", "Alice", "Nathaniel")
friend_names
28 chester ismay and patrick c. kennedy
## [1] "character"
Next suppose we wanted to put the ages of our friends in another vector. We can again use the
c function:
friend_ages <- c(25L, 37L, 22L, 30L)
friend_ages
## [1] 25 37 22 30
class(friend_ages)
## [1] "integer"
Note the use of the L value here. This tells R that the numbers entered have no decimal com-
ponents. If we didn’t designate the L we can see that the values are read in as "numeric" by
default:
ages_numeric <- c(25, 37, 22, 30)
class(ages_numeric)
## [1] "numeric"
From a user’s perspective, there is not a huge difference in how these values are stored, but
it is still a good habit to specify what class your variables are whenever possible to help with
collaboration and documentation.
The most likely way you will enter character values into a vector is via the c function. Numeric
values can be entered in a couple different ways. One is using the c function, as we saw above.
Because numbers have a natural order, we can also specify a sequence of numbers with a start-
ing value, an ending value, and the amount by which to increment each step in the sequence:
sequence_by_2 <- seq(from = 0L, to = 100L, by = 2L)
sequence_by_2
## [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
## [18] 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66
## [35] 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100
class(sequence_by_2)
## [1] "integer"
You should now have a better sense of what the numbers in the [ ] before the output refer to.
getting used to r, rstudio, and r markdown 29
This helps you keep track of where you are in the printing of the output. So the first element
denoted by [1] is 0, the 18th entry ([18]) is 34, and the 35th entry ([35]) is 68. This will
serve as a nice introduction into indexing and subsetting in Section 5.5.
We can also set the sequence to go by a negative number or a decimal value. We will do both
in the next example.
dec_frac_seq <- seq(from = 10, to = 3, by = -0.2)
dec_frac_seq
## [1] 10.0 9.8 9.6 9.4 9.2 9.0 8.8 8.6 8.4 8.2 8.0 7.8 7.6
## [14] 7.4 7.2 7.0 6.8 6.6 6.4 6.2 6.0 5.8 5.6 5.4 5.2 5.0
## [27] 4.8 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 3.0
class(dec_frac_seq)
## [1] "numeric"
A short-cut version of the seq version can be achieved using the : operator. If we are increas-
ing values by 1 (or -1), we can use the : operator to build our vector:
inc_seq <- 98:112
inc_seq
## [1] 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
dec_seq <- 5:-5
dec_seq
## [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
If you aren’t reading in data from a file and you have some vectors of information you’d like
combined into a single data frame, you can use the [Link] function:
friends <- [Link](names = friend_names,
ages = friend_ages,
stringsAsFactors = FALSE)
friends
## names ages
## 1 Bertha 25
## 2 Herbert 37
## 3 Alice 22
## 4 Nathaniel 30
Here we have created a names variable in the friends data frame that corresponds to the
values in the friend_names vector and similarly an ages variable in friends that corresponds
30 chester ismay and patrick c. kennedy
5.3.3 Factors
If we have a strings vector/variable that has some sort of natural ordering to it, it frequently
makes sense to convert that vector/variable into a factor. The factor will convert the strings
to integers to keep track of which order you’d prefer, and also keep track of the original string
values as well.
Looking over our periodic_table data frame again via View(periodic_table), you can see
some good candidates for shifting from chr to Factor. If you remember your chemistry, you’ll
know that the natural ordering of the block variable is "s", "p", "d", and "f".
By default, R will organize character strings in alphabetical order. To see this, we’ll introduce
two new features: the table function and the $ operator.
table(periodic_table$block)
##
## d f p s
## 40 28 36 14
The table function provides a count of the number of elements that appear in each block. But
as you can see, the ordering is off. You may remember the $ displayed before variable names in
the str function. That isn’t a coincidence. To access specific variables inside a data frame, we
can do so by entering the name of the data frame followed by $ and the name of the variable.
(Note that spaces in variable names will not work. You’ll likely learn that the hard way, as I
have.)
To convert block into a factor, we use the aptly named factor function. Note that this is
“converting” by assigning the result of factor back to block in periodic_table.
periodic_table$block <- factor(periodic_table$block,
levels = c("s", "p", "d", "f"))
table(periodic_table$block)
##
## s p d f
## 14 36 40 28
You’ll find that this is an easy way to organize your data whenever you’d like to summarize it
or to plot it, but we’ll save that discussion for a different time and a different book.
R can work extremely quickly when provided with a vector or a collection of vectors like a data
frame. Instead of iterating through each element to perform an operation that we might need
to do in other programming languages, we can do something like this:
getting used to r, rstudio, and r markdown 31
## [1] 30 42 27 35
Just like that, every age is five more than where we started. This extends to adding two vec-
tors together1 . 1
Vectors of the same
size, of course…well,
actually R has a way of
5.5 Indexing and subsetting dealing with vectors of
different sizes and not
So we have a big data frame of information about the periodic table, but what if we wanted to giving errors, but let’s
ignore that for now.
extract smaller pieces of the data frame? You already saw that to focus on any specific variable
we can use the $ operator.
## atomic_number symbol
## 41 41 Nb
## 42 42 Mo
32 chester ismay and patrick c. kennedy
## 43 43 Tc
## 44 44 Ru
## 45 45 Rh
## 46 46 Pd
## 47 47 Ag
## 48 48 Cd
## 49 49 In
## 50 50 Sn
## name_origin
## 41 Niobe, daughter of king Tantalus from Greek mythology
## 42 the Greek molybdos meaning 'lead'
## 43 the Greek tekhn??tos meaning 'artificial'
## 44 Ruthenia, the New Latin name for Russia
## 45 the Greek rhodos, meaning 'rose coloured'
## 46 the then recently discovered asteroid Pallas, considered a planet at the time
## 47 English word (argentum in Latin)
## 48 the New Latin cadmia, from King Kadmos
## 49 indigo
## 50 English word (stannum in Latin)
## atomic_weight state_at_stp
## 1 1.008235 Gas
## 8 15.999000 Gas
The extra parentheses around periodic_table$name %in% c("Hydrogen", "Oxygen") are
a good habit to get into as they ensure everything before the comma is used to select spe-
cific rows matching that condition. For the columns, we can specify a vector of the column
names to focus on only those variables. The resulting table here gives the atomic_weight and
state_at_stp for "Hydrogen" and then for "Oxygen".
There are many more complicated ways to subset a data frame and one can use the subset
function built into R, but in my experience, whenever you want to do anything more compli-
cated than what we have done here, it is easier to use the filter and select functions in the
dplyr package.
5.6 Functions
You might not have noticed, but we have been using functions throughout this entire chapter.
The seq command we saw earlier is a function. It expects a few arguments: from, to, by, and
a few others that we didn’t specify. How do I know this and why didn’t we specify them?
Recall that you can look up the help documentation on any function by entering ? and the
function name in the R console. If we do this for seq with ?seq, we are given some examples
of what to expect under the Usage section. R allows for function arguments to take on default
values and that’s what we see:
seq(from = 1, to = 1, by = ((to - from)/([Link] - 1)),
[Link] = NULL, [Link] = NULL, ...)
By default, the sequence will both start and end at 1 (a not very interesting sequence). The
[Link] and [Link] arguments are set to NULL by default. NULL represents an empty
object in R, so they are essentially ignored, unless you specify values for them. The ... argu-
ment is beyond the scope of this book, but you can read more about it and other useful tips
about writing function at the NiceRCode page here.
Not all functions have all default arguments like seq. The mean function is one such example:
mean()
seq()
## [1] 1
To fix the error, we’ll need to specify which vector or object we want to compute the mean of.
Recall the ages_numeric vector. We can pass that into the mean function:
ages_numeric
## [1] 25 37 22 30
mean(x = ages_numeric)
## [1] 28.5
We can also skip specifying the name of the argument, as long as you follow the same order as
what is given in Usage in the documentation:
mean(ages_numeric)
## [1] 28.5
We can see that R expects the arguments to be x, then trim, and then [Link]. What happens
if we try to specify TRUE for [Link] without specifically saying [Link] = TRUE?
mean(ages_numeric, TRUE)
## [1] 28.5
As you begin to explore help documentation for different functions, you’ll notice that some ar-
guments require quotations around them, while others (like those for mean) don’t. This brings
us back to the discussion earlier about strings, logicals, and numeric/integer classes.
One example of a function that requires a character string (or vector) is the [Link]
function, which is at the heart of R’s ability to expand on its built-in functionality by import-
ing packages that include functions, templates, and data written by users of R. If you run
getting used to r, rstudio, and r markdown 35
?[Link], you’ll see that the first argument pkgs is expected to be a character
vector. You’ll therefore need to enter the packages you’d like to download and install inside
quotes.
Two useful packages that I recommend you install and download are "ggplot2" and "dplyr"
(if you are using RStudio Server, hopefully your instructor or server administrator has already
done so). The [Link] function has a large number of arguments, with all but the
pkgsargument set by default. We’ll pick a couple here to specify instead of using the defaults:
[Link](pkgs = c("ggplot2","dplyr"),
repos = "[Link]
dependencies = TRUE,
quiet = TRUE)
If you refer back to the help via ?[Link], you will see descriptions that what type
of argument is expected:
• pkgs expects a character vector
• repos expects a character vector
• dependencies expects a logical value (TRUE or FALSE)
• quiet expects a logical value
After you’ve downloaded the packages, you can load the package into your R environment
using the library function:
library("ggplot2")
library("dplyr")
For references on errors, check out the following two links by Noam Ross here and David Smith
here.
This error usually occurs when a package has not been loaded into R via library, so R does
not know where to find the specified function. It’s a good habit to use the library functions
on all of the packages you will be using in the top R chunk in your R Markdown file, which is
usually given the chunk name setup.
This error usually occurs when your R Markdown document refers to an object that has not
been defined in an R chunk at or before that chunk. You’ll frequently see this when you’ve
forgotten to copy code from your R Console sandbox back into a chunk in R Markdown.
6.3 Misspellings
One of the most frustrating errors you can encounter in R is when you misspell the name of
an object or function. R is not forgiving on this, and it won’t try to automatically figure out
what you are referring to. You’ll usually be able to quite easily figure out that you made a
typo because you’ll receive an object not found error.
Remember that R is also case-sensitive, so if you called an object Name and then try to call it
name later on without name being defined, you’ll receive an error.
Another common error is forgetting or neglecting to finish a call to a function with a closing ).
An example of this follows:
mean(x = c(1, 5, 10, 52)
38 chester ismay and patrick c. kennedy
My hope is that this book provides a nice introduction into understanding how statisticians,
data scientists, and many other professionals use RStudio and R Markdown to simplify their
analyses and ensure that their reports are computationally reproducible. Learning R is not
nearly as intimidating as it once was, and more and more industries are shifting towards free
open-source tools like R and RStudio. R Markdown provides an excellent way to document
your analyses and share it with others in a variety of formats.
Of course, this book is just the tip of the iceberg in terms of showing you what R can really
do. If you’d like to learn more, I encourage you to check out Modern Dive, a free, online, open-
source book Albert Kim and I wrote on using modern data analysis techniques and visualiza-
tion with R, RStudio, and R Markdown.
Additionally, Garrett Grolemund’s “Hands-On Programming with R” (Grolemund, 2014) is
an excellent resource and goes into much more depth than I do here on how to work with
more complicated objects in R. It also discusses concepts in a project-based framework that is
entertaining and easy-to-read.
As always, feel free to send me an email at [Link]@[Link] if you’d like any fur-
ther clarification or if you have suggestions on improvements. Thanks for taking the time to
read through this and best wishes to you on your next steps towards reproducible, thoughtful,
beautiful analyses!
- Chester
References