Data Wrangling with R Course Overview
Data Wrangling with R Course Overview
Course Overview
• Course Overview 10
• Course objectives 10
• Course Expectations 10
• Creating an R Script 15
• R Objects 16
• Things to note 18
• Using functions 19
• Navigating directories 19
• Acknowledgments 27
• Additional Resources 27
• Loading Tidyverse 29
• What is a tibble? 29
• An Example 36
• Data reshape 38
• Acknowledgements 45
• Additonal Resources 45
• Introduction to ggplot2 46
• Objectives 46
• Why ggplot2? 48
• Getting help 49
• Geom functions 52
• Mapping and aesthetics (aes()) 53
• Colors 57
• Facets 63
• Resource list 68
• Acknowledgements 69
• What is dplyr? 70
• Loading dplyr 70
• Importing data 71
• Excluding columns 75
• Helper functions 77
• Comparison operators 79
• Reordering rows 85
• Acknowledgments 86
• Loading dplyr 87
• Data 87
• Mutating joins 90
• Filtering joins 91
• Transforming variables 93
• mutate() 93
• Acknowledgments 102
• Objectives 103
• What is Bioconductor? 103
• Accessors 107
• Acknowledgements 116
• Objectives 118
• Step 5: Create a bar plot, using ggplot2 to plot the total gene counts per sample. 126
• Lesson 1 132
• Lesson 2 132
• Lesson 3 132
• Lesson 4 132
• Lesson 5 132
• Lesson 6 132
• Lesson 7 133
• Lesson 8 133
Practice Questions
• Acknowledgements 139
Course Overview
Course objectives
• Learn how to navigate RStudio.
• Learn how to load different types of data formats.
• Get acquainted with the tidyverse packages, especially dplyr.
• Become familiar with functions useful for cleaning, transforming, and summarizing data.
While this course will not make you an expert R programmer or full-fledged data analyst, it will
help you learn how to analyze real-life, messy data and prepare it for visualization and further
analyses.
Course Expectations
This course will include a series of eight, one hour lessons. Each lesson will be held virtually
using the Webex platform on Mondays / Wednesdays at 1 pm. Lessons will immediately be
followed by a one-hour help session. Help sessions will be structured around a set of practice
problems for you to test your new skills. Though, we welcome all questions!
DNAnexus
You will not be able to practice or work through this documentation using DNAnexus outside of class hours. If you
are using this documentation asynchronously, please install R and RStudio ([Link]
docs/rtools/[Link]).
In Lesson 5, we will learn how to improve code interpretability with the pipe %>% from the
magrittr package. We will also learn how to merge and filter data frames.
In Lesson 6, we will continue to wrangle data using dplyr. This lesson will focus on functions
such as group_by(), arrange(), summarize(), and mutate().
398 + 783
## [1] 1181
475 / 5
## [1] 95
2 * 8906
## [1] 17812
(1 + (5 ** 0.5))/2
## [1] 1.618034
As you can see, ** were used for exponentiation. There are a number of other special operators
used to perform math in R. Refer to this chart from [Link] ([Link]
genomics-r-intro/02-r-basics/[Link]) or an overview of mathematical operators used in R.
Creating an R Script
If all R could do was function as a calculator, it wouldn't be very useful. R can be used for
powerful analyses and visualizations.
As we learn more about R and begin implementing our first commands, we will keep a record of
our commands using an R script. Remember, good annotation is key to reproducible data
analysis. An R script can also be generated to run on its own without user interaction.
To create an R script, click File > New File > R Script or click on the new document icon (paper
with a +) and select R script. You can save your now Untitled script by selecting the floppy disk
icon. Give your script a meaningful name so that you can identify what it contains when
returning to it later. R scripts end in .R. Save your R script to your working directory, which will
be the default location on RStudio Server.
Important
Scripts are ordered. Running commands out of order will cause confusion later when you try to reproduce a given
analysis step.
R Objects
Now that we have an R script, let's begin to work with R objects. Everything assigned a value in
R is technically an object. Mostly we think of R objects as something in which a method (or
function) can act on; however, R functions, too, are R object. R objects are what gets assigned
to memory in R and are of a specific type or class. Objects include things like vectors, lists,
matrices, arrays, factors, and data frames. Don't get too bogged down by terminology. Many of
these terms will become clear as we begin to use them in our code. In order to be assigned to
memory, an r object must be created.
Therefore, objects are data structures with specific attributes and methods that can be applied
to them.
To create an R object, you need a name, a value, and an assignment operator (e.g., <- or =)
([Link] R is
case sensitive, so an object with the name "FOO" is not the same as "foo".
To run our code, we have a number of options. First, you can use the Run button above. This
will run highlighted or selected code. You may also use the source button to run your entire
script. My preferred method is to use keyboard shortcuts. Move your cursor to the code of
interest and use command + enter for macs or control + enter for PCs. If a command is
taking a long time to run and you need to cancel it, use control + c from the command line or
escape in RStudio. Once you run the command, you will see the command print to the console
in blue followed by the output. You do not need to highlight code to run it. If you do highlight
code, make sure you are highlighting everything you plan to run. Highlighting can be a great way
to only test small sections of nested code.
a<-1 #You can and should annotate your code with comments for better reproducib
a #Simply call the name of the object to print the value to the screen
## [1] 1
In this example, "a" is the name of the object, 1 is the value, and <- is the assignment operator.
We inspect objects simply by typing or running their name.
For example:
In contrast:
## [1] "apples"
What do you think would have happened if we didn't put 'apples' in quotes? Try it.
a<-apples
3. Avoid common names with special meanings or assigned to existing functions (These will
auto complete).
To view a list of the objects you have created, use ls() or look at your global environment pane.
## [1] "tp53"
## [1] "GH1"
Warning
R will not warn you when objects are being overwritten, so use caution.
Things to note
• R doesn't care about spaces in your code. However, it can vastly improve readability if
you include them. For example, "thisissohardtoread" but "this is fine".
• You can use tab completion to quickly type object or function names.
For example:
```{py3 hl_lines="1-100"}
clifford<-"a big red dog"
```
Now, type "clif" into the console and hit tab.
Using functions
A function in R (or any computing language) is a short program that takes some
input and returns some output.
• Functions have a name (e.g. dir, getwd); note that functions are case
sensitive!
• Following the name, functions have a pair of ()
• Inside the parentheses, a function may take 0 or more arguments ---
[Link] ([Link]
[Link])
Navigating directories
Now that we know what a function is, let's use them to navigate our directories.
Our first function will be getwd(). This simply prints your working directory (our default
directory for saving files) and is the R equivalent of pwd (if you know unix coding).
[1] "/home/rstudio/"
For details on function arguments and examples of how to use the function, we should check
the package / function documentation. We can get help by preceding a function with ? or ?? if
the package library has not been loaded. We can also use the function args().
Let's see this in action with setwd(). setwd() is used to change our working directory. If we
want to know what argument it takes, we can try the help documentation.
?setwd()
As we can see from the help documentation, setwd() requires the argument dir, which
requires a character string pointing to the correct directory. The path should be in quotes, and
you can use tab completion to fill in the path as needed.
Note
R uses unix formatting for directories, so regardless of whether you have a Windows computer or a mac, the way
you enter the directory information will be the same.
round()
rounds the values in its first argument to the specified number of decimal places
(default 0) --- R help.
This implies that the first argument should be the number you want to round.
## [1] 17.66
## [1] 2
## [1] 17.66
One of those is c(), which is used to combine its arguments to form a vector.
Vectors are probably the most used commonly used object type in R. A vector is a
collection of values that are all of the same type (numbers, characters, etc.). The
columns that make up a data frame, for example, are vectors of the same length. ---
[Link] ([Link]
[Link]).
sample_names<-c("Sample1","Sample2","Sample3")
more_samps<-c("Sample4","Sample5")
Here is a short list of functions that are commonly used and good to keep in mind:
You may also find this function reference card valuable: reference card ([Link]
[Link]/doc/contrib/[Link])
At times a function may be masked by another function. This can happen if two functions are
named the same (e.g., dplyr::filter() vs plyr::filter()). We can get around this by
explicitly calling a function from the correct package using the following syntax:
package::function().
The mode or type of an object can be examined using mode() or typeof(). Its class can be
examined using class(). Unlike modes and types, classes are unlimited and can be user
defined. Classes can often be more informative. For example, class() may return [Link],
array, matrix, or factor.
typeof(df)
## [1] "list"
class(df)
## [1] "[Link]"
Because data frames are made up of columns that can store different types of data, we can
examine the overall structure of a data frame using str().
str(df)
There are functions that can gage types and classes directly, for example, [Link](),
[Link](), [Link](), [Link](), [Link](), [Link]().
We will discuss factors more later when plotting. Functions most relevant to factors include
factor() and levels().
We can access a column of our data frame using [], [[]], or using the $. We can use
colnames() and rownames() to access the column names and row names of a data frame.
For example:
df[["cell"]]
df["cell"]
## cell
## 1 cell_line_A
## 2 cell_line_A
## 3 cell_line_A
## 4 cell_line_A
## 5 cell_line_A
## 6 cell_line_A
## 7 cell_line_A
## 8 cell_line_A
## 9 cell_line_A
## 10 cell_line_A
## 11 cell_line_B
## 12 cell_line_B
## 13 cell_line_B
## 14 cell_line_B
## 15 cell_line_B
## 16 cell_line_B
## 17 cell_line_B
## 18 cell_line_B
## 19 cell_line_B
## 20 cell_line_B
df$cell
colnames(df)
rownames(df)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "
## [16] "16" "17" "18" "19" "20"
Subsetting
Check out this chapter ([Link] of Advanced R for more on subsetting operators.
• Certain symbols in R always come in pairs, for example, parentheses and quotation
marks. If you do not provide a pair, R will return a continuation character +, meaning it is
waiting for more input to complete the command. You provide the missing information or
press ESCAPE.
For example,
Acknowledgments
Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for
Genomics lesson provided by [Link] ([Link]
[Link]).
Additional Resources
Hands-on Programming with R ([Link]
Github is another common source used to store R packages; though, these packages do not
necessarily meet CRAN standards so approach with caution. To install a Github package, use
library(devtools) followed by install_github(). devtools is a CRAN package. If
you have not installed it, you may use [Link]("devtools") prior to the
previous steps.
Many genomics and other packages useful to biologists / molecular biologists can be found on
Bioconductor. To install a Bioconductor package, you will first need to install BiocManager, a
CRAN package ([Link]("BiocManager")). You can then use BiocManager
to install the Bioconductor core packages or any specific package (e.g.,
BiocManager::install("DESeq2")).
First, the RStudio IDE has a drop down menu for data import. Simply go to File > Import
Dataset and select one of the options and follow the prompts.
We should pay close attention to the import functions and their arguments. Using the import
arguments correctly can save us from a headache later down the road. You will notice two
types of import functions under Import Dataset "from text": base R import functions and
readr import functions. We will use both in this course.
Note
Tidyverse packages are generally against assigning rownames and instead prefer that all column data are
treated the same, but there are times when this is beneficial and will be required for genomics data (e.g., See
SummarizedExperiment ([Link]
doc/[Link]) from Bioconductor).
Loading Tidyverse
library(tidyverse)
What is a tibble?
When loading tabular data with readr, the default object created will be a tibble. Tibbles are
like data frames with some small but apparent modifications. For example, they can have
numbers for column names, and the column types are immediately apparent when viewing.
Additionally, when you call a tibble by running the object name, the entire data frame does not
print to the screen, rather the first ten rows along with the columns that fit the screen are shown.
They produce tibbles, they don’t convert character vectors to factors, use row
names, or munge the column names. These are common sources of frustration with
the base R functions.
They are more reproducible. Base R functions inherit some behaviour from your
operating system and environment variables, so import code that works on your
computer might not work on someone else’s. ---R4DS ([Link]
[Link]#compared-to-base-r)
Importing excel files requires the R package readxl. While this is a tidyverse package, it is not
core and must be loaded separately.
library(readxl)
The functions to import excel files are read_excel(), read_xls(), and read_xlsx(). The
latter two are more specific based on file format, whereas the first will guess which format (.xls
or .xlsx) we are working with.
Let's look at its basic usage using an example data set from the readxl package. To access
the example data we use readxl_example().
## [1] "/Library/Frameworks/[Link]/Versions/4.3-x86_64/Resources/library/r
Now, let's read in the data. The only required argument is a path to the file to be imported.
irisdata<-read_excel(ex_xl)
irisdata
## # A tibble: 150 × 5
## [Link] [Link] [Link] [Link] Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
Notice that the resulting imported data is a tibble. This is a feature specific to tidyverse. Now,
let's check out some of the additional arguments. We can view the help information using ?
read_excel().
sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
sum_air
## # A tibble: 11 × 4
## `Uses Airway Data` ...2 ...3 ...4
## <chr> <chr> <chr> <chr
## 1 Some RNA-Seq summary information <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
Upon importing this data, we can immediately see that something is wrong with the column
names.
colnames(sum_air)
There are some extra rows of information at the beginning of the data frame that should be
excluded. We can take advantage of additional arguments to load only the data we are
interested in. We are also going to tell read_excel() that we want the names repaired to
eliminate spaces.
sum_air<-read_excel("./data/RNASeq_totalcounts_vs_totaltrans.xlsx",
skip=3,.name_repair = "universal")
## New names:
## • `Sample Name` -> `[Link]`
## • `Number of Transcripts` -> `[Link]`
## • `Total Counts` -> `[Link]`
sum_air
## # A tibble: 8 × 4
## [Link] Treatment [Link] [Link]
## <chr> <chr> <dbl> <dbl>
## 1 GSM1275863 Dexamethasone 10768 18783120
## 2 GSM1275867 Dexamethasone 10051 15144524
## 3 GSM1275871 Dexamethasone 11658 30776089
## 4 GSM1275875 Dexamethasone 10900 21135511
## 5 GSM1275862 None 11177 20608402
Name repair
To import tab-delimited files there are several options. There are base R functions such as
[Link]() and [Link]() as well as the readr functions read_delim(),
read_tsv(), and read_table().
Let's take a look at ?[Link]() and ?read_delim(), which are most appropriate if you
are working with tab delimited data stored in a .txt file.
For [Link](), you will notice that the default separator (sep) is white space, which can
be one or more spaces, tabs, newlines. However, you could use this function to load a comma
separated file as well; you simply need to use sep = ",". The same is true of
read_delim(), except the argument is delim rather than sep.
Let's load sample information from the RNA-Seq project airway ([Link]
packages/release/data/experiment/html/[Link]). We will refer back to some of these data
frequently throughout our lessons. The airway data is from Himes et al. (2014) (https://
[Link]/24926665/). These data, which are available in R as a
RangedSummarizedExperiment object, are from a bulk RNAseq experiment. In the experiment,
the authors "characterized transcriptomic changes in four primary human ASM cell lines that
were treated with dexamethasone," a common therapy for asthma. The airway package
includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes
a treated and untreated negative control.
Using [Link]():
smeta<-[Link]("./data/airway_sampleinfo.txt")
head(smeta)
Using read_delim():
smeta2<-read_delim("./data/airway_sampleinfo.txt")
## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this mess
smeta2
## # A tibble: 8 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0
## 3 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0
## 4 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0
## 5 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0
## 6 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0
To read comma separated files, we can use the specific functions ?[Link]() and ?
read_csv().
cexamp<-[Link]("./data/surveys_datacarpentry.csv")
head(cexamp)
cexamp2<-read_csv("./data/surveys_datacarpentry.csv")
cexamp2
## # A tibble: 35,549 × 9
## record_id month day year plot_id species_id sex hindfoot_length weig
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <db
## 1 1 7 16 1977 2 NL M 32
## 2 2 7 16 1977 3 NL M 33
## 3 3 7 16 1977 2 DM F 37
## 4 4 7 16 1977 7 DM M 36
## 5 5 7 16 1977 3 DM M 35
## 6 6 7 16 1977 1 PF M 14
## 7 7 7 16 1977 2 PE F NA
## 8 8 7 16 1977 1 DM M 37
## 9 9 7 16 1977 1 DM F 34
## 10 10 7 16 1977 6 PF F 20
## # ℹ 35,539 more rows
For information on importing other files types (e.g., json, xml, google sheets), check out this
chapter ([Link] from Tidyverse Skills for
Data Science ([Link] by Carrie Wright, Shannon E. Ellis,
Stephanie C. Hicks and Roger D. Peng.
An Example
Let's load in a count matrix from airway to work with and reshape.
aircount<-[Link]("./data/head50_airway_nonnorm_count.txt")
head(aircount)
## X Accession.SRR1039508 Accession.SRR1039509
## 1 ENSG00000000003.TSPAN6 679 448
## 2 [Link] 0 0
## 3 ENSG00000000419.DPM1 467 515
## 4 ENSG00000000457.SCYL3 260 211
## 5 ENSG00000000460.C1orf112 60 55
## 6 [Link] 0 0
## Accession.SRR1039512 Accession.SRR1039513 Accession.SRR1039516
## 1 873 408 1138
## 2 0 0 0
Because this is a count matrix, we want to save column 'X', which was automatically named, as
row names rather than a column.
aircount<-[Link]("./data/head50_airway_nonnorm_count.txt",
[Link] = 1)
head(aircount)
## Accession.SRR1039508 Accession.SRR1039509
## ENSG00000000003.TSPAN6 679 448
## [Link] 0 0
## ENSG00000000419.DPM1 467 515
## ENSG00000000457.SCYL3 260 211
## ENSG00000000460.C1orf112 60 55
## [Link] 0 0
## Accession.SRR1039512 Accession.SRR1039513
## ENSG00000000003.TSPAN6 873 408
## [Link] 0 0
## ENSG00000000419.DPM1 621 365
## ENSG00000000457.SCYL3 263 164
## ENSG00000000460.C1orf112 40 35
## [Link] 2 0
## Accession.SRR1039516 Accession.SRR1039517
## ENSG00000000003.TSPAN6 1138 1047
## [Link] 0 0
## ENSG00000000419.DPM1 587 799
## ENSG00000000457.SCYL3 245 331
## ENSG00000000460.C1orf112 78 63
## [Link] 1 0
## Accession.SRR1039520 Accession.SRR1039521
## ENSG00000000003.TSPAN6 770 572
## [Link] 0 0
Data reshape
What do we mean by reshaping data?
Data reshaping is one aspect of tidying our data. The shape of our data is determined by how
values are organized across rows and columns. When reshaping data we are most often
wrangling the data from wide to long format or vice versa. To tidy the data we will need to (1)
know the difference between observations and variables, and (2) potentially resolve cases in
which a single variable is spread across multiple columns or a single observation is spread
across multiple rows (R4DS ([Link]
It is difficult to provide a single definition for what is wide data vs long data, as both can take
different forms, and both can be considered tidy depending on the circumstance.
Important
Remember, while we are interested in getting data into a "tidy" format, your data should ultimately be wrangled into a
format that is going to work with downstream analyses.
In general, in wide data there is often a single metric spread across multiple columns. This type
of data often, but not always, takes on a matrix like appearance.
While in long data, each variable tends to have its own column.
However, these definitions depend on what you are ultimately considering a variable and what
you are considering an observation.
For example, which of the following data representations is the tidy option?
tibble(iris)
## # A tibble: 150 × 5
## [Link] [Link] [Link] [Link] Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
## # A tibble: 600 × 4
## Iris_id Species Measurement_location Measurement
## <chr> <fct> <chr> <dbl>
Regardless, you may want one format or the other depending on your analysis goals. Many of
the tidyverse tools (e.g., ggplot2) seem to work better with long format data.
The tools we use to go from wide to long and long to wide are from the package tidyr.
Because we already loaded the package tidyverse, we do not need to load tidyr, as it is a
core package.
If you haven't guessed already, our count matrix is currently in wide format. If we wanted to
merge these data with sample metadata and plot various aspects of the data using ggplot2, we
would likely want these data in long format.
Let's check out the help documentation ?pivot_longer(). This function requires the data
and the columns we want to combine. There are also a number of optional arguments involving
the name column and the value column.
l_air<-pivot_longer(aircount,1:length(aircount),names_to ="Sample",
values_to= "Count")
head(l_air)
## # A tibble: 6 × 2
## Sample Count
## <chr> <int>
## 1 Accession.SRR1039508 679
## 2 Accession.SRR1039509 448
## 3 Accession.SRR1039512 873
## 4 Accession.SRR1039513 408
## 5 Accession.SRR1039516 1138
## 6 Accession.SRR1039517 1047
Notice that the row names were dropped. While we would want to keep row names if we were
working with this matrix as is, because we want a long data frame, we will need to first put the
row names into a column. For this, we will use rownames_to_column() from the tidyverse
package tibble.
## Feature
## 1 ENSG00000000003.TSPAN6
## 2 [Link]
## 3 ENSG00000000419.DPM1
## 4 ENSG00000000457.SCYL3
## 5 ENSG00000000460.C1orf112
## 6 [Link]
#pivot longer...again
l_air<-pivot_longer(aircount,starts_with("Accession"),
names_to =c("Sample"),values_to= "Count")
head(l_air)
## # A tibble: 6 × 3
## Feature Sample Count
## <chr> <chr> <int>
## 1 ENSG00000000003.TSPAN6 Accession.SRR1039508 679
## 2 ENSG00000000003.TSPAN6 Accession.SRR1039509 448
## 3 ENSG00000000003.TSPAN6 Accession.SRR1039512 873
## 4 ENSG00000000003.TSPAN6 Accession.SRR1039513 408
## 5 ENSG00000000003.TSPAN6 Accession.SRR1039516 1138
## 6 ENSG00000000003.TSPAN6 Accession.SRR1039517 1047
How can we get this back to a wide format? We can use ?pivot_wider(). This requires two
additional arguments beyond the data argument: names_from and values_from. The first,
names_from should be the name of the column containing the new column names for your
wide data. values_from is the column that contains the values to fill the rows of your wide
data columns.
w_air<-tidyr::pivot_wider(l_air,names_from = Sample,
values_from = Count)
head(w_air)
## # A tibble: 6 × 9
## Feature Accession.SRR1039508 Accession.SRR1039509 Accession.SRR103
## <chr> <int> <int> <
## 1 ENSG0000000000… 679 448
## 2 ENSG0000000000… 0 0
## 3 ENSG0000000041… 467 515
## 4 ENSG0000000045… 260 211
## 5 ENSG0000000046… 60 55
## 6 ENSG0000000093… 0 0
## # ℹ 5 more variables: Accession.SRR1039513 <int>, Accession.SRR1039516 <int>,
## # Accession.SRR1039517 <int>, Accession.SRR1039520 <int>,
## # Accession.SRR1039521 <int>
Note
There are many optional arguments for both of these functions. These are there to help you reshape seemingly
complicated data schemes. Don't get discouraged. The examples in the help documentation are extremely helpful.
For example, you may have noticed that our feature column from our example data is really two
types of information combined (an Ensembl id and a gene abbreviation). If we want to separate
this column into two, we could easily do this with the help of separate().
Let's see this in action. We want to separate the column Feature at the first ..
head(l_air2)
## # A tibble: 6 × 4
## Ensembl_ID gene_abb Sample Count
## <chr> <chr> <chr> <int>
## 1 "" "" Accession.SRR1039508 679
## 2 "" "" Accession.SRR1039509 448
## 3 "" "" Accession.SRR1039512 873
## 4 "" "" Accession.SRR1039513 408
## 5 "" "" Accession.SRR1039516 1138
## 6 "" "" Accession.SRR1039517 1047
If . matches any character, how do you match a literal .? You need to use an
“escape” to tell the regular expression you want to match it exactly, not use its
special behaviour. Like strings, regexps use the backslash, \, to escape special
behaviour. So to match an ., you need the regexp \.. Unfortunately this creates a
problem. We use strings to represent regular expressions, and \ is also used as an
escape symbol in strings. So to create the regular expression \. we need the string
\\.. --- stringr vignette ([Link]
[Link])
Lost?
If this explanation of the \ makes little sense to you, check out this reddit post ([Link]
learnprogramming/comments/13bb5pa/why_double_slashes_in_regex_expressions/?rdt=59580) for more help.
head(l_air2)
## # A tibble: 6 × 4
## Ensembl_ID gene_abb Sample Count
## <chr> <chr> <chr> <int>
## 1 ENSG00000000003 TSPAN6 Accession.SRR1039508 679
unite() is simply the opposing function to separate(). Let's use unite() to combine our
columns (Ensemble_ID and gene_abb) back together. This time we will use a _ between our
ensembleID and gene abbreviations.
## # A tibble: 6 × 3
## Feature Sample Count
## <chr> <chr> <int>
## 1 ENSG00000000003_TSPAN6 Accession.SRR1039508 679
## 2 ENSG00000000003_TSPAN6 Accession.SRR1039509 448
## 3 ENSG00000000003_TSPAN6 Accession.SRR1039512 873
## 4 ENSG00000000003_TSPAN6 Accession.SRR1039513 408
## 5 ENSG00000000003_TSPAN6 Accession.SRR1039516 1138
## 6 ENSG00000000003_TSPAN6 Accession.SRR1039517 1047
Regular expressions can be exceedingly complicated and like anything require time and
practice. We will not take a deep dive into regular expressions in this course. A great place to
start with regular expressions is Chapter 14: Strings ([Link]
from R4DS. You may also find this stringr vignette ([Link]
stringr/vignettes/[Link]) helpful.
Acknowledgements
Material from this lesson was inspired by R4DS ([Link] and
Tidyverse Skills for Data Science ([Link] The survey data
([Link] loaded in
the section on comma separated files was taken from a [Link] lesson (https://
[Link]/R-ecology-lesson/[Link]).
Additonal Resources
readr / readxl cheatsheet
Tidyr cheatsheet
Stringr / regex cheatsheet
Introduction to ggplot2
Objectives
1. Learn the ggplot2 syntax.
By the end of the course, students should be able to create simple, pretty, and effective figures.
## New names:
## • `Sample Name` -> `[Link]`
## • `Number of Transcripts` -> `[Link]`
## • `Total Counts` -> `[Link]`
exdata
## # A tibble: 8 × 4
## [Link] Treatment [Link] [Link]
## <chr> <chr> <dbl> <dbl>
## 1 GSM1275863 Dexamethasone 10768 18783120
## 2 GSM1275867 Dexamethasone 10051 15144524
## 3 GSM1275871 Dexamethasone 11658 30776089
## 4 GSM1275875 Dexamethasone 10900 21135511
## 5 GSM1275862 None 11177 20608402
## 6 GSM1275866 None 11526 25311320
## 7 GSM1275870 None 11425 24411867
## 8 GSM1275874 None 11000 19094104
These data include total transcript read counts summed by sample and the total number of
transcripts recovered by sample that had at least 100 reads. These data derive from a bulk
RNAseq experiment described by Himes et al. (2014) ([Link]
24926665/) and introduced in lesson 3. As a reminder, the authors "characterized transcriptomic
changes in four primary human ASM cell lines that were treated with dexamethasone," a
common therapy for asthma. Each cell line included a treated and untreated negative control
resulting in a total sample size of 8.
We could plot this data in Excel and get something like this:
While this isn't bad, it took an unnecessary amount of time to create, and there weren't a lot of
options for customization.
RECOMMENDATION
You should save metadata or other tabular data as either comma separated files (.csv) or tab-delimited files (.txt,
.tsv). Using these file extensions will make it easier to use the data with bioinformatic programs. There are multiple
functions available to read in delimited data in R. We will see a few of these over the next few weeks.
Why ggplot2?
Outside of base R plotting, one of the most popular packages used to generate graphics in R is
ggplot2, which is associated with a family of packages collectively known as the tidyverse.
GGplot2 allows the user to create informative plots quickly by using a 'grammar of graphics'
implementation, which is described as "a coherent system for describing and building graphs"
R4DS ([Link]
[Link]#:~:text=ggplot2%20implements%20the%20grammar%20of,applying%20it%20in%20many%
We will see this in action shortly. The power of this package is that plots are built in layers and
few changes to the code result in very different outcomes. This makes it easy to reuse parts of
the code for very different figures.
Being a part of the tidyverse collection, ggplot2 works best with data organized so that
individual observations are in rows and variables are in columns.
Getting help
The R community is extensive and getting help is now easier than ever with a simple web
search. If you can't figure out how to plot something, give a quick web search a try. Great
resources include internet tutorials, R bookdowns, and stackoverflow. You should also use the
help features within RStudio to get help on specific functions or to find vignettes. Try entering
ggplot2 in the help search bar in the lower right panel under the Help tab.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
The only required components to begin plotting are the data we want to plot, geom function(s),
and mapping aesthetics. Notice the + symbol following the ggplot() function. This symbol will
precede each additional layer of code for the plot, and it is important that it is placed at the end
of the line. More on geom functions and mapping aesthetics to come.
What is the relationship between total transcript sums per sample and the number of recovered
transcripts per sample?
We can easily see that there is a relationship between the number of transcripts per sample and
the total transcripts recovered per sample. ggplot2 default parameters are great for
exploratory data analysis. But, with only a few tweaks, we can make some beautiful, publishable
figures.
The data we called was from the data frame exdata, which we created above. Next, we
provided a geom function (geom_point()), which created a scatter plot. This scatter plot
required mapping information, which we provided for the x and y axes. More on this in a
moment.
Geom functions
A geom is the geometrical object that a plot uses to represent data. People often
describe plots by the type of geom that the plot uses. --- R4DS (https://
[Link]/[Link]#geometric-objects)
There are multiple geom functions that change the basic plot type or the plot representation. We
can create scatter plots (geom_point()), line plots (geom_line(),geom_path()), bar plots
(geom_bar(), geom_col()), line modeled to fitted data (geom_smooth()), heat maps
(geom_tile()), geographic maps (geom_polygon()), etc.
ggplot2 provides over 40 geoms, and extension packages provide even more (see
[Link] ([Link]
gallery/) for a sampling). The best way to get a comprehensive overview is the
ggplot2 cheatsheet, which you can find at [Link]
cheatsheets. --- R4DS ([Link]
You can also see a number of options pop up when you type geom into the console, or you can
look up the ggplot2 documentation in the help tab.
We can see how easy it is to change the way the data is plotted. Let's plot the same data using
geom_line().
ggplot(data=exdata) +
geom_line(aes(x=[Link], y = [Link]))
The geom functions require a mapping argument. The mapping argument includes the aes()
function, which "describes how variables in the data are mapped to visual properties
(aesthetics) of geoms" (ggplot2 R Documentation). If not included it will be inherited from the
ggplot() function.
Let's return to our plot above. Is there a relationship between treatment ("dex") and the number
of transcripts or total counts?
There is potentially a relationship. ASM cells treated with dexamethasone in general have lower
total numbers of transcripts and lower total counts.
Notice how we changed the color of our points to represent a variable, in this case. To do this,
we set color equal to 'Treatment' within the aes() function. This mapped our aesthetic, color, to
a variable we were interested in exploring.
Aesthetics that are not mapped to our variables are placed outside of the aes() function. These aesthetics are
manually assigned and do not undergo the same scaling process as those within aes().
For example
We can also see from this that 'Treatment' could be mapped to other aesthetics. In the above
example, we see it mapped to shape rather than color. By default, ggplot2 will only map six
shapes at a time, and if your number of categories goes beyond 6, the remaining groups will go
unmapped. This is by design because it is hard to discriminate between more than six shapes
at any given moment. This is a clue from ggplot2 that you should choose a different aesthetic to
map to your variable. However, if you choose to ignore this functionality, you can manually
assign more than six shapes ([Link]
We could have just as easily mapped it to alpha, which adds a gradient to the point visibility by
category, or we could map it to size. There are multiple options, so feel free to explore a little
with your plots.
Defaults
The assignment of color, shape, or alpha to our variable was automatic, with a unique aesthetic level representing
each category (i.e., 'Dexamethasone', 'none') within our variable. You will also notice that ggplot2 automatically
created a legend to explain the levels of the aesthetic mapped. We can change aesthetic parameters - what colors
are used, for example - by adding additional layers to the plot.
scatter_plot<-ggplot(exdata) +
geom_point(aes(x=[Link], y = [Link],
color=Treatment))
scatter_plot
We can add additional layers directly to our object. We will see how this works by defining some
colors for our 'dex' variable.
Colors
ggplot2 will automatically assign colors to the categories in our data. Colors are assigned to
the fill and color aesthetics in aes(). We can change the default colors by providing an
additional layer to our figure. To change the color, we use the scale_color functions:
scale_color_manual(), scale_color_brewer() ([Link]
[Link]), scale_color_grey(), etc. We can also change the name of the
color labels in the legend using the labels argument of these functions
scatter_plot +
scale_color_manual(values=c("red","black"),
labels=c('treated','untreated'))
scatter_plot +
scale_color_grey()
scatter_plot +
scale_color_brewer(palette = "Paired")
Similarly, if we want to change the fill, we would use the scale_fill options. To apply scale_fill to
shape, we will have to alter the shapes, as only some shapes take a fill argument. Refer to the
shapes in the red box in the figure below.
ggplot(data=exdata) +
geom_point(aes(x=[Link], y = [Link],fill=Treatment),
shape=21,size=2) + #increase size and change points
scale_fill_manual(values=c("purple", "yellow"))
There are a number of ways to specify the color argument including by name, number, and hex
code. Here ([Link] is a great resource from the R
Graph Gallery ([Link] for assigning colors in R.
There are also a number of complementary packages in R that expand our color options. One
of my favorites is viridis, which provides colorblind friendly palettes. randomcoloR is a
great package if you need a large number of unique colors.
Paletteer contains a comprehensive set of color palettes, if you want to load the palettes
from multiple packages all at once. See the Github page ([Link]
paletteer) for details.
However, there are additional components that can be added to our core components to enable
us to generate even more diverse plot types.
• one or more geometric objects that serve as the visual representations of the
data, – for instance, points, lines, rectangles, contours,
• descriptions of how the variables in the data are mapped to visual properties
(aesthetics) of the geometric objects, and an associated scale (e. g., linear,
logarithmic, rank),
• optional parameters that affect the layout and rendering, such text size, font
and alignment, legend positions,
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>
) +
<FACET_FUNCTION> +
<COORDINATE SYSTEM> +
<THEME>
Facets
A way to add variables to a plot beyond mapping them to an aesthetic is to use facets or
subplots. There are two primary functions to add facets, facet_wrap() and facet_grid().
If faceting by a single variable, use facet_wrap(). If multiple variables, use facet_grid().
The first argument of either function is a formula, with variables separated by a ~ (See below).
Variables must be discrete (not continuous).
Let's return to the airway count data to see how facets are useful. Here, we are going to
compare scaled and unscaled count data using a density plot.
A density plot shows the distribution of a numeric variable. --- R Graph Gallery
([Link]
In our example data, density_data, the gene counts were scaled to account for technical
and composition differences using the trimmed mean of M values (TMM) from EdgeR (Robinson
and Oshlack 2010), but non-normalized values remained for comparison. Thus, we can
compare scaled vs unscaled counts by sample using faceting.
#density plot
head(density_data)
#plot
ggplot(data= density_data)+
aes(x=abundance,
color=SampleName)+ #initialize ggplot
geom_density() + #call density plot geom
facet_wrap(~source) + #use facet_wrap; see ~source
scale_x_log10()#scales the x axis using a base-10 log transformation
The distributions of sample counts did not differ greatly between samples before scaling, but
regardless, we can see that the distributions are more similar after scaling.
Here, faceting allowed us to visualize multiple features of our data. We were able to see count
distributions by sample as well as normalized vs non-normalized counts.
Note the help options with ?facet_wrap(). How would we make our plot facets vertical rather
than horizontal?
ggplot(exdata) +
geom_point(aes(x=[Link], y = [Link],
fill=Treatment),
shape=21,size=3) +
#can change labels of fill levels along with colors
scale_fill_manual(values=c("purple", "yellow"),
labels=c('treated','untreated'))+
ggsave("[Link]",width=5.5,height=3.5,units="in",dpi=300)
Resource list
1. ggplot2 cheatsheet
2. The R Graph Gallery ([Link]
3. The R Graphics Cookbook ([Link]
Acknowledgements
Material from this lesson was adapted from Chapter 3 of R for Data Science (https://
[Link]/[Link]) and from a 2021 workshop entitled Introduction to Tidy
Transciptomics ([Link]
[Link]) by Maria Doyle and Stefano Mangiola.
What is dplyr?
The package dplyr tries to provide easy tools for the most common data
manipulation tasks. It was built to work directly with data frames. The thinking
behind it was largely inspired by the package plyr which has been in use for some
time but suffered from being slow in some cases. --- [Link] (https://
[Link]/genomics-r-intro/05-dplyr/[Link])
Loading dplyr
We do not need to load the dplyr package separately, as it is a core tidyverse package. If
you need to install and load only dplyr, use [Link]("dplyr") and
library(dplyr).
library(tidyverse)
Importing data
For this lesson, we will use sample metadata and differential expression results from the
airway RNA-Seq project.
#sample information
smeta<-read_delim("./data/airway_sampleinfo.txt")
## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this mess
smeta
## # A tibble: 8 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0
## 3 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0
## 4 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0
## 5 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0
## 6 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0
## 7 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0
## 8 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0
dexp
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG000… untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 3.1
## 2 ENSG000… untrt DPM1 hg38 TRUE 0.198 4.61 6.90 2.8
## 3 ENSG000… untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 7.6
## 4 ENSG000… untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 5.5
## 5 ENSG000… untrt CFH hg38 TRUE 0.417 8.09 29.3 4.6
## 6 ENSG000… untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 4.0
## 7 ENSG000… untrt GCLC hg38 TRUE -0.0581 4.84 0.167 6.9
## 8 ENSG000… untrt NFYA hg38 TRUE -0.509 4.13 44.9 1.0
## 9 ENSG000… untrt STPG1 hg38 TRUE -0.136 3.12 1.04 3.3
## 10 ENSG000… untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 5.6
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>
We can get an idea of the structure of these data by using str() or glimpse(). glimpse(),
from tidyverse, is similar to str() but provides somewhat cleaner output.
glimpse(smeta)
## Rows: 8
## Columns: 9
## $ SampleName <chr> "GSM1275862", "GSM1275863", "GSM1275866", "GSM1275867", "
## $ cell <chr> "N61311", "N61311", "N052611", "N052611", "N080611", "N08
## $ dex <chr> "untrt", "trt", "untrt", "trt", "untrt", "trt", "untrt",
## $ albut <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "un
## $ Run <chr> "SRR1039508", "SRR1039509", "SRR1039512", "SRR1039513", "
## $ avgLength <dbl> 126, 126, 126, 87, 120, 126, 101, 98
## $ Experiment <chr> "SRX384345", "SRX384346", "SRX384349", "SRX384350", "SRX3
## $ Sample <chr> "SRS508568", "SRS508567", "SRS508571", "SRS508572", "SRS5
## $ BioSample <chr> "SAMN02422669", "SAMN02422675", "SAMN02422678", "SAMN0242
glimpse(dexp)
## Rows: 15,926
## Columns: 10
## $ feature <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457",
## $ albut <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "un
## $ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "G
## $ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "
## $ .abundant <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU
## $ logFC <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.4
## $ logCPM <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.90966
## $ F <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2
## $ PValue <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0
## $ FDR <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.003
Now that we have some data to work with, let's start subsetting.
iris[1:5,1:3]
While this type of subsetting is useful, it is not always the most readable or easy to employ,
especially for beginners. This is where dplyr comes in. The dplyr package in the tidyverse
world simplifies data wrangling with easy to employ and easy to understand functions specific
for data manipulation in data frames.
To subset by column, we use the function select(). We can include and exclude columns,
reorder columns, and rename columns using select().
We can select the columns we are interested in by first calling the data frame object (dexp)
followed by the columns we want to select (transcript,logFC,FDR). All arguments are
separated by a comma. The order of the arguments will determine the order of the columns in
the new data frame.
## # A tibble: 15,926 × 3
## transcript logFC FDR
## <chr> <dbl> <dbl>
## 1 TSPAN6 -0.390 0.00283
## 2 DPM1 0.198 0.0770
## 3 SCYL3 0.0292 0.844
## 4 C1orf112 -0.124 0.682
## 5 CFH 0.417 0.00376
## 6 FUCA2 -0.250 0.0186
## 7 GCLC -0.0581 0.794
## 8 NFYA -0.509 0.00126
## 9 STPG1 -0.136 0.478
## 10 NIPAL3 -0.0500 0.695
## # ℹ 15,916 more rows
## # A tibble: 15,926 × 3
## gene logFoldChange FDRpvalue
## <chr> <dbl> <dbl>
## 1 TSPAN6 -0.390 0.00283
Note
If you want to retain all columns, you could also use rename() ([Link]
from dplyr to rename columns.
Excluding columns
We can select all columns, leaving out ones that do not interest us using a - sign. This is helpful
if the columns to keep far outweigh those to exclude. We can similarly use the ! to negate a
selection.
ex2<-select(dexp, -feature)
ex2
## # A tibble: 15,926 × 9
## albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <
## 1 untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 0.000312 0.0
## 2 untrt DPM1 hg38 TRUE 0.198 4.61 6.90 0.0281 0.0
## 3 untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 0.763 0.8
## 4 untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 0.555 0.6
## 5 untrt CFH hg38 TRUE 0.417 8.09 29.3 0.000463 0.0
## 6 untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 0.00405 0.0
## 7 untrt GCLC hg38 TRUE -0.0581 4.84 0.167 0.692 0.7
## 8 untrt NFYA hg38 TRUE -0.509 4.13 44.9 0.000100 0.0
## 9 untrt STPG1 hg38 TRUE -0.136 3.12 1.04 0.335 0.4
## 10 untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 0.569 0.6
## # ℹ 15,916 more rows
ex2<-select(dexp, !feature)
ex2
## # A tibble: 15,926 × 9
## albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <
## 1 untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 0.000312 0.0
## 2 untrt DPM1 hg38 TRUE 0.198 4.61 6.90 0.0281 0.0
## 3 untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 0.763 0.8
## 4 untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 0.555 0.6
## 5 untrt CFH hg38 TRUE 0.417 8.09 29.3 0.000463 0.0
## 6 untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 0.00405 0.0
## 7 untrt GCLC hg38 TRUE -0.0581 4.84 0.167 0.692 0.7
## 8 untrt NFYA hg38 TRUE -0.509 4.13 44.9 0.000100 0.0
## 9 untrt STPG1 hg38 TRUE -0.136 3.12 1.04 0.335 0.4
## 10 untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 0.569 0.6
## # ℹ 15,916 more rows
#you can reorder columns and call a range of columns using select().
ex3<-select(dexp, transcript:FDR,albut)
ex3
## # A tibble: 15,926 × 9
## transcript ref_genome .abundant logFC logCPM F PValue FDR a
## <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <
## 1 TSPAN6 hg38 TRUE -0.390 5.06 32.8 0.000312 0.00283 u
## 2 DPM1 hg38 TRUE 0.198 4.61 6.90 0.0281 0.0770 u
## 3 SCYL3 hg38 TRUE 0.0292 3.48 0.0969 0.763 0.844 u
## 4 C1orf112 hg38 TRUE -0.124 1.47 0.377 0.555 0.682 u
## 5 CFH hg38 TRUE 0.417 8.09 29.3 0.000463 0.00376 u
## 6 FUCA2 hg38 TRUE -0.250 5.91 14.9 0.00405 0.0186 u
## 7 GCLC hg38 TRUE -0.0581 4.84 0.167 0.692 0.794 u
## 8 NFYA hg38 TRUE -0.509 4.13 44.9 0.000100 0.00126 u
## 9 STPG1 hg38 TRUE -0.136 3.12 1.04 0.335 0.478 u
## 10 NIPAL3 hg38 TRUE -0.0500 7.04 0.350 0.569 0.695 u
## # ℹ 15,916 more rows
Note
Notice that we can select a range of columns using the :. We could also deselect a range of
columns or deselect a range of columns while adding a column back.
ex3<-select(dexp, -(albut:F),logFC)
ex3
## # A tibble: 15,926 × 4
## feature PValue FDR logFC
## <chr> <dbl> <dbl> <dbl>
## 1 ENSG00000000003 0.000312 0.00283 -0.390
## 2 ENSG00000000419 0.0281 0.0770 0.198
## 3 ENSG00000000457 0.763 0.844 0.0292
## 4 ENSG00000000460 0.555 0.682 -0.124
## 5 ENSG00000000971 0.000463 0.00376 0.417
## 6 ENSG00000001036 0.00405 0.0186 -0.250
## 7 ENSG00000001084 0.692 0.794 -0.0581
## 8 ENSG00000001167 0.000100 0.00126 -0.509
## 9 ENSG00000001460 0.335 0.478 -0.136
## 10 ENSG00000001461 0.569 0.695 -0.0500
## # ℹ 15,916 more rows
Helper functions
## # A tibble: 15,926 × 4
## transcript logFC logCPM FDR
## <chr> <dbl> <dbl> <dbl>
## 1 TSPAN6 -0.390 5.06 0.00283
## 2 DPM1 0.198 4.61 0.0770
## 3 SCYL3 0.0292 3.48 0.844
## 4 C1orf112 -0.124 1.47 0.682
## 5 CFH 0.417 8.09 0.00376
## 6 FUCA2 -0.250 5.91 0.0186
## 7 GCLC -0.0581 4.84 0.794
## 8 NFYA -0.509 4.13 0.00126
## 9 STPG1 -0.136 3.12 0.478
There are a number of other selection helpers. See the help documentation for select (https://
[Link]/reference/[Link]) for more information ?dplyr::select().
There are many other ways to select multiple columns. You may commonly be interested in
selecting all numeric columns or all factors. The syntax below can be used for this purpose.
## # A tibble: 15,926 × 5
## logFC logCPM F PValue FDR
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.390 5.06 32.8 0.000312 0.00283
## 2 0.198 4.61 6.90 0.0281 0.0770
## 3 0.0292 3.48 0.0969 0.763 0.844
## 4 -0.124 1.47 0.377 0.555 0.682
## 5 0.417 8.09 29.3 0.000463 0.00376
## 6 -0.250 5.91 14.9 0.00405 0.0186
## 7 -0.0581 4.84 0.167 0.692 0.794
## 8 -0.509 4.13 44.9 0.000100 0.00126
## 9 -0.136 3.12 1.04 0.335 0.478
## 10 -0.0500 7.04 0.350 0.569 0.695
## # ℹ 15,916 more rows
## # A tibble: 15,926 × 5
## logFC logCPM F PValue FDR
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.390 5.06 32.8 0.000312 0.00283
## 2 0.198 4.61 6.90 0.0281 0.0770
## 3 0.0292 3.48 0.0969 0.763 0.844
## 4 -0.124 1.47 0.377 0.555 0.682
## 5 0.417 8.09 29.3 0.000463 0.00376
## 6 -0.250 5.91 14.9 0.00405 0.0186
## 7 -0.0581 4.84 0.167 0.692 0.794
## 8 -0.509 4.13 44.9 0.000100 0.00126
## 9 -0.136 3.12 1.04 0.335 0.478
filter() only includes rows where the condition is TRUE; it excludes both FALSE and
NA values. ---R4DS ([Link]
Now let's filter the rows from smeta based on a condition. Let's look at only the treated samples
in dex (i.e., trt) using the function filter(). The first argument is the data frame (e.g.,
smeta) followed by the expression(s) to filter the data frame.
To complete these filter phrases you will often need to include comparison operators such as
the == above. These operators help us evaluate relations. For example, a == b is asking if a
and b are equivalent. It is a logical comparison that when evaluated will return TRUE or FALSE.
The filter function will then return rows that evaluate to TRUE.
a <- 1
b <- 1
a == b
## [1] TRUE
Comparison operators
Comparison Operator Description
> greater than
!= Not equal
== equal
a&b a and b
We may want to combine filtering parameters using AND or OR phrasing and the operators &
and |.
For example, if we only wanted to return rows where dex == trt and cell==N61311, we can
use:
## # A tibble: 1 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS50… SAMN0
## # A tibble: 1 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS50… SAMN0
## # A tibble: 4 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0
## 3 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0
## 4 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0
%in% returns a logical vector indicating if there is a match or not for its left operand.
--- match R Documentation.
The returned logical vector will be the length of the vector to the left. Its basic usage:
## # A tibble: 4 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0
## 2 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0
## 3 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0
## 4 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0
Past versions of dplyr included powerful variants of filter, select, and other functions to
help perform tasks across columns. You may see functions such as filter_all, filter_if,
and filter_at. Functions like these can still be used but have been superseded by across
([Link] However, across has been deprecated in
the case of filter and replaced by if_any() and if_all().
Both functions operate similarly to across() but go the extra mile of aggregating the
results to indicate if all the results are true when using if_all(), or if at least one is
true when using if_any() ---[Link] ([Link]
dplyr-1-0-4-if-any/)
Anonymous functions
The code above includes an anonymous function. Read more here ([Link]
[Link]#anonymous_function,_formula). You may also find this stackoverflow post (https://
[Link]/questions/56532119/dplyr-piping-data-difference-between-and-x) useful.
For example,
Let's explore how piping streamlines this. Piping (using %>%) allows you to employ multiple
functions consecutively, while improving readability. The output of one function is passed
directly to another without storing the intermediate steps as objects. You can pipe from the
beginning (reading in the data) all the way to plotting without storing the data or intermediate
objects, if you want. Pipes in R come from the magrittr package, which is a dependency of
dplyr.
Pipe
Read more info about the magrittr pipe here ([Link] . There is also a
native R pipe, |>, as of R 4.1.0. Read more about the difference between %>% and |> here (https://
[Link]/blog/2023/04/base-vs-magrittr-pipe/).
To pipe, we have to first call the data and then pipe it into a function. The output of each step is
then piped into the next step.
tspanDpm <- dexp %>% #call the data and pipe to select()
select(transcript, logFC, FDR) %>% #select columns of interest
filter(transcript == "TSPAN6" | transcript=="DPM1" ) #filter
Notice that the data argument has been dropped from select() and filter(). This is
because the pipe passes the input from the left to the right. The %>% must be at the end of each
line.
ggplot(aes(x=transcript,y=logFC,fill=FDR)) + #plot
geom_bar(stat = "identity") +
theme_classic() +
geom_hline(yintercept=0, linetype="dashed", color = "black")
The dplyr functions by themselves are somewhat simple, but by combining them
into linear workflows with the pipe, we can accomplish more complex manipulations
of data frames. ---[Link] ([Link]
dplyr/[Link])
Reordering rows
There are many steps that can be taken following subsetting (i.e., filtering by rows and
columns); one of which is reordering rows. In the tidyverse, reordering rows is largely done by
arrange(). Arrange will reorder a variable from smallest to largest, or in the case of
characters, alphabetically, from a to z.
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG0000… untrt A1BG-AS1 hg38 TRUE 0.513 1.02 9.22 1.4
## 2 ENSG0000… untrt A2M hg38 TRUE 0.528 10.1 3.57 9.2
## 3 ENSG0000… untrt A2M-AS1 hg38 TRUE -0.337 0.308 2.76 1.3
## 4 ENSG0000… untrt A4GALT hg38 TRUE 0.519 5.89 24.5 8.5
## 5 ENSG0000… untrt AAAS hg38 TRUE -0.0254 5.12 0.134 7.2
## 6 ENSG0000… untrt AACS hg38 TRUE -0.191 4.06 5.00 5.3
## 7 ENSG0000… untrt AADAT hg38 TRUE -0.642 2.67 16.9 2.7
## 8 ENSG0000… untrt AAGAB hg38 TRUE -0.165 5.08 5.82 3.9
## 9 ENSG0000… untrt AAK1 hg38 TRUE -0.188 3.82 2.29 1.6
## 10 ENSG0000… untrt AAMDC hg38 TRUE 0.447 2.42 8.52 1.7
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG000002… untrt LINC00906 hg38 TRUE -4.59 0.473 139. 1.1
## 2 ENSG000001… untrt LRRTM2 hg38 TRUE -4.00 1.24 127. 1.6
## 3 ENSG000001… untrt VASH2 hg38 TRUE -3.95 0.0171 152. 7.7
## 4 ENSG000001… untrt VCAM1 hg38 TRUE -3.66 4.60 565. 2.8
## 5 ENSG000001… untrt SLC14A1 hg38 TRUE -3.63 1.38 42.3 1.2
## 6 ENSG000002… untrt FER1L6 hg38 TRUE -3.13 3.53 238. 1.1
## 7 ENSG000001… untrt SMTNL2 hg38 TRUE -3.12 1.46 134. 1.2
## 8 ENSG000001… untrt WNT2 hg38 TRUE -3.07 3.99 521. 4.0
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG00000… untrt ALOX15B hg38 TRUE 10.1 1.62 554. 5.92
## 2 ENSG00000… untrt ZBTB16 hg38 TRUE 7.15 4.15 1429. 5.11
## 3 ENSG00000… untrt <NA> <NA> TRUE 6.17 1.35 380. 1.58
## 4 ENSG00000… untrt ANGPTL7 hg38 TRUE 5.68 3.51 483. 5.66
## 5 ENSG00000… untrt STEAP4 hg38 TRUE 5.22 3.66 445. 8.07
## 6 ENSG00000… untrt PRODH hg38 TRUE 4.85 1.29 253. 9.10
## 7 ENSG00000… untrt FAM107A hg38 TRUE 4.74 2.78 656. 1.51
## 8 ENSG00000… untrt LGI3 hg38 TRUE 4.68 -0.0503 106. 3.45
## 9 ENSG00000… untrt SPARCL1 hg38 TRUE 4.56 5.53 721. 1.00
## 10 ENSG00000… untrt KLF15 hg38 TRUE 4.48 4.69 479. 5.86
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>
Acknowledgments
Some material from this lesson was either taken directly or adapted from the Intro to R and
RStudio for Genomics lesson provided by [Link] ([Link]
genomics-r-intro/01-introduction/[Link]). Additional content was inspired by Chapter 3,
Wrangling Data in the Tidyverse, ([Link]
[Link]#filtering-data) from Tidyverse Skills for Data Science and Suzan Baert's dplyr tutorials
([Link]
Loading dplyr
In this lesson, we are continuing with the package dplyr. We do not need to load the dplyr
package separately, as it is a core tidyverse package. Again, if you need to install and load
only dplyr, use [Link]("dplyr") and library(dplyr).
library(tidyverse)
Data
Let's load in some data to work with. In this lesson, we will continue to use sample metadata,
raw count data, and differential expression results from the airway RNA-Seq project.
#sample information
smeta<-read_delim("./data/airway_sampleinfo.txt")
## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this mess
smeta
## # A tibble: 8 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSa
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0
## 3 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0
## 4 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0
## 5 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0
## 6 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0
## 7 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0
## 8 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0
## New names:
## Rows: 64102 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR103951
## SRR1039...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## Specify the column types or set `show_col_types = FALSE` to quiet this messa
## • `` -> `...1`
acount
## # A tibble: 64,102 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR103
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <
## 1 ENSG000000… 679 448 873 408 1138
## 2 ENSG000000… 0 0 0 0 0
## 3 ENSG000000… 467 515 621 365 587
## 4 ENSG000000… 260 211 263 164 245
## 5 ENSG000000… 60 55 40 35 78
## 6 ENSG000000… 0 0 2 0 1
## 7 ENSG000000… 3251 3679 6177 4252 6721 1
## 8 ENSG000000… 1433 1062 1733 881 1424
## 9 ENSG000000… 519 380 595 493 820
## 10 ENSG000000… 394 236 464 175 658
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
dexp
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PV
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <
## 1 ENSG000… untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 3.1
## 2 ENSG000… untrt DPM1 hg38 TRUE 0.198 4.61 6.90 2.8
## 3 ENSG000… untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 7.6
## 4 ENSG000… untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 5.5
There are a series of functions from dplyr devoted to the purpose of joining data frames. There
are two types of joins: mutating joins ([Link]
and filtering joins ([Link]
Mutating joins
Imagine we have two data frames x and y. A mutating join will keep all columns from x and y
by adding columns from y to x.
return all rows from x, and all columns from x and y. Rows in x with no match in y
will have NA values in the new columns. If there are multiple matches between x
and y, all combinations of the matches are returned. --- R documentation, dplyr
(version 0.7.8) ([Link]
topics/join)
return all rows from y, and all columns from x and y. Rows in y with no match in x will
have NA values in the new columns. If there are multiple matches between x and y,
all combinations of the matches are returned. --- R documentation, dplyr (version
0.7.8) ([Link]
return all rows from x where there are matching values in y, and all columns from x
and y. If there are multiple matches between x and y, all combination of the
matches are returned. --- R documentation, dplyr (version 0.7.8) (https://
[Link]/packages/dplyr/versions/0.7.8/topics/join)
return all rows and all columns from both x and y. Where there are not matching
values, returns NA for the one missing. --- R documentation, dplyr (version 0.7.8)
([Link]
Note
The R documentation for dplyr has been updated with dplyr v1.0.9. However, these descriptions still stand and are
clearer (in my opinion) than the new documentation.
The most common type of join is the left_join(). Let's see this in action:
#reshape acount
acount_smeta<-acount %>% pivot_longer(where([Link]),names_to ="Sample",
values_to= "Count") %>% left_join(smeta, by=c("Sample"="Run
acount_smeta
## # A tibble: 512,816 × 11
## Feature Sample Count SampleName cell dex albut avgLength Experi
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 ENSG000000000… SRR10… 679 GSM1275862 N613… untrt untrt 126 SRX384
## 2 ENSG000000000… SRR10… 448 GSM1275863 N613… trt untrt 126 SRX384
## 3 ENSG000000000… SRR10… 873 GSM1275866 N052… untrt untrt 126 SRX384
## 4 ENSG000000000… SRR10… 408 GSM1275867 N052… trt untrt 87 SRX384
## 5 ENSG000000000… SRR10… 1138 GSM1275870 N080… untrt untrt 120 SRX384
## 6 ENSG000000000… SRR10… 1047 GSM1275871 N080… trt untrt 126 SRX384
## 7 ENSG000000000… SRR10… 770 GSM1275874 N061… untrt untrt 101 SRX384
## 8 ENSG000000000… SRR10… 572 GSM1275875 N061… trt untrt 98 SRX384
## 9 ENSG000000000… SRR10… 0 GSM1275862 N613… untrt untrt 126 SRX384
## 10 ENSG000000000… SRR10… 0 GSM1275863 N613… trt untrt 126 SRX384
## # ℹ 512,806 more rows
## # ℹ 2 more variables: Sample.y <chr>, BioSample <chr>
Notice the use of by in left_join. The argument by requires the column or columns that we
want to join by. If the column we want to join by has a different name, we can use the notation
above, which says to match Sample from acount to Run from smeta.
Filtering joins
Filtering joins result in filtered x data based on matching or non-matching with y. These joins do
not add columns from y to x.
semi_join()
return all rows from x where there are matching values in y, keeping just columns
from x.
A semi join differs from an inner join because an inner join will return one row of x
for each matching row of y, where a semi join will never duplicate rows of x. --- R
documentation, dplyr (version 0.7.8) ([Link]
dplyr/versions/0.7.8/topics/join)
anti_join()
return all rows from x where there are not matching values in y, keeping just
columns from x. --- R documentation, dplyr (version 0.7.8) (https://
[Link]/packages/dplyr/versions/0.7.8/topics/join)
#reshape acount
smeta_f<-smeta %>% filter(Run %in% c("SRR1039512","SRR1039508"))
semi_join(acount_L,smeta_f, by=c("Sample"="Run"))
Note
This example does not use the %>%. This was simply to demonstrate the different "strategies" that can be used to set
up and run code.
## # A tibble: 128,204 × 3
## Feature Sample Count
## <chr> <chr> <dbl>
## 1 ENSG00000000003 SRR1039508 679
## 2 ENSG00000000003 SRR1039512 873
## 3 ENSG00000000005 SRR1039508 0
## 4 ENSG00000000005 SRR1039512 0
## 5 ENSG00000000419 SRR1039508 467
## 6 ENSG00000000419 SRR1039512 621
## 7 ENSG00000000457 SRR1039508 260
## 8 ENSG00000000457 SRR1039512 263
## 9 ENSG00000000460 SRR1039508 60
## 10 ENSG00000000460 SRR1039512 40
## # ℹ 128,194 more rows
In this case, we could have used filter. However, if we were interested in filtering by multiple
variables simultaneously, it would be easier to employ a semi-join.
Transforming variables
Data wrangling often involves transforming one variable to another. For example, we may be
interested in log transforming a variable or adding two variables to create a third. In dplyr this
can be done with mutate() and transmute(). These functions allow us to create a new
variable from existing variables.
mutate()
mutate() adds new variables and preserves existing ones; transmute() adds new
variables and drops existing ones. New variables overwrite existing variables of the
same name. --- [Link] ([Link]
[Link])
Let's create a column in our original differential expression data frame denoting significant
transcripts (those with an FDR corrected p-value less than 0.05 and a log fold change greater
than or equal to 2).
This creates a column named Significant that contains TRUE values where the expression
above was true (meaning significant in this case) and FALSE where the expression was FALSE.
Let's look at another example. This time let's log transform our FDR corrected p-values.
## # A tibble: 15,926 × 1
## logFDR
## <dbl>
## 1 -2.55
## 2 -1.11
## 3 -0.0735
## 4 -0.166
## 5 -2.42
## 6 -1.73
## 7 -0.100
## 8 -2.90
## 9 -0.320
## 10 -0.158
## # ℹ 15,916 more rows
What if we want to transform all of our counts spread across multiple columns in acount using
scale(), which applies a z-score transformation? In this case we use across() within
mutate(), which has replaced the scoped verbs (mutate_if,mutate_at, and
mutate_all).
Z-score
A z score tells us the number of standard deviations from the mean of a given value. This can be achieved by
scale(x, center = TRUE, scale = TRUE).
## # A tibble: 64,102 × 9
## Feature SRR1039508[,1] SRR1039509[,1] SRR1039512[,1] SRR1039513[,
## <chr> <dbl> <dbl> <dbl> <db
## 1 ENSG00000000003 0.103 0.0527 0.0991 0.06
## 2 ENSG00000000005 -0.0929 -0.100 -0.0821 -0.08
## 3 ENSG00000000419 0.0418 0.0756 0.0468 0.04
## 4 ENSG00000000457 -0.0179 -0.0281 -0.0275 -0.02
## 5 ENSG00000000460 -0.0756 -0.0814 -0.0738 -0.07
## 6 ENSG00000000938 -0.0929 -0.100 -0.0817 -0.08
## 7 ENSG00000000971 0.845 1.16 1.20 1.51
## 8 ENSG00000001036 0.321 0.262 0.278 0.24
## 9 ENSG00000001084 0.0568 0.0295 0.0414 0.09
## 10 ENSG00000001167 0.0208 -0.0196 0.0142 -0.02
## # ℹ 64,092 more rows
## # ℹ 4 more variables: SRR1039516 <dbl[,1]>, SRR1039517 <dbl[,1]>,
## # SRR1039520 <dbl[,1]>, SRR1039521 <dbl[,1]>
Mutate can also be used to coerce variables. Again, we need to use across() and where().
Let's also check out the difference between mutate() and transmute().
## Rows: 512,816
## Columns: 9
## $ Feature <fct> ENSG00000000003, ENSG00000000003, ENSG00000000003, ENSG00
## $ Sample <fct> SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR103951
## $ SampleName <fct> GSM1275862, GSM1275863, GSM1275866, GSM1275867, GSM127587
## $ cell <fct> N61311, N61311, N052611, N052611, N080611, N080611, N0610
## $ dex <fct> untrt, trt, untrt, trt, untrt, trt, untrt, trt, untrt, tr
## $ albut <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, u
## $ Experiment <fct> SRX384345, SRX384346, SRX384349, SRX384350, SRX384353, SR
## $ Sample.y <fct> SRS508568, SRS508567, SRS508571, SRS508572, SRS508575, SR
## $ BioSample <fct> SAMN02422669, SAMN02422675, SAMN02422678, SAMN02422670, S
#mutate
ex_coerce<-acount_smeta %>% mutate(across(where([Link]),[Link]))
glimpse(ex_coerce)
## Rows: 512,816
## Columns: 11
## $ Feature <fct> ENSG00000000003, ENSG00000000003, ENSG00000000003, ENSG00
## $ Sample <fct> SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR103951
## $ Count <dbl> 679, 448, 873, 408, 1138, 1047, 770, 572, 0, 0, 0, 0, 0,
## $ SampleName <fct> GSM1275862, GSM1275863, GSM1275866, GSM1275867, GSM127587
## $ cell <fct> N61311, N61311, N052611, N052611, N080611, N080611, N0610
## $ dex <fct> untrt, trt, untrt, trt, untrt, trt, untrt, trt, untrt, tr
## $ albut <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, u
## $ avgLength <dbl> 126, 126, 126, 87, 120, 126, 101, 98, 126, 126, 126, 87,
## $ Experiment <fct> SRX384345, SRX384346, SRX384349, SRX384350, SRX384353, SR
## $ Sample.y <fct> SRS508568, SRS508567, SRS508571, SRS508572, SRS508575, SR
## $ BioSample <fct> SAMN02422669, SAMN02422675, SAMN02422678, SAMN02422670, S
Notice that transmute dropped all columns that were not mutated.
Let's create a small data frame, and use mutate() to get the mean(). What happens when we
use mean as is?
df<-[Link](A=c(1,2,3),B=c(4,5,6),C=c(7,8,9))
df
## A B C
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
## A B C D
## 1 1 4 7 5
## 2 2 5 8 5
## 3 3 6 9 5
## A B C D
## 1 1 4 7 4
## 2 2 5 8 5
## 3 3 6 9 6
The first example simply gives us the mean of A, B, and C (not row wise). The second example
gave us what we wanted, but was more complicated. The alternative is to first group by row
using rowwise() and then use mutate().
## # A tibble: 3 × 4
## # Rowwise:
## A B C D
Let's get the top five genes with the greatest median raw counts by treatment.
## `summarise()` has grouped output by 'dex'. You can override using the `.grou
## argument.
## # A tibble: 10 × 3
## # Groups: dex [2]
## dex Feature median_counts
## <chr> <chr> <dbl>
## 1 trt ENSG00000115414 322164
## 2 trt ENSG00000011465 263587
## 3 trt ENSG00000156508 239676.
## 4 trt ENSG00000198804 230992
## 5 trt ENSG00000116260 187288.
## 6 untrt ENSG00000011465 336076
## `summarise()` has grouped output by 'dex'. You can override using the `.grou
## argument.
## # A tibble: 10 × 3
## # Groups: dex [2]
## dex Feature median_counts
## <chr> <chr> <dbl>
## 1 trt ENSG00000115414 322164
## 2 trt ENSG00000011465 263587
## 3 trt ENSG00000156508 239676.
## 4 trt ENSG00000198804 230992
## 5 trt ENSG00000116260 187288.
## 6 untrt ENSG00000011465 336076
## 7 untrt ENSG00000115414 302956.
## 8 untrt ENSG00000156508 294097
## 9 untrt ENSG00000164692 249846
## 10 untrt ENSG00000198804 249206
How many rows per sample are in the acount_smeta data frame? We can use count() or
summarize() paired with other functions (e.g., n(),tally()).
acount_smeta %>%
count(dex, Sample)
## # A tibble: 8 × 3
## dex Sample n
## <chr> <chr> <int>
## 1 trt SRR1039509 64102
## 2 trt SRR1039513 64102
## 3 trt SRR1039517 64102
acount_smeta %>%
group_by(dex, Sample) %>%
summarize(n=n()) #there are multiple functions that can be used here
## `summarise()` has grouped output by 'dex'. You can override using the `.grou
## argument.
## # A tibble: 8 × 3
## # Groups: dex [2]
## dex Sample n
## <chr> <chr> <int>
## 1 trt SRR1039509 64102
## 2 trt SRR1039513 64102
## 3 trt SRR1039517 64102
## 4 trt SRR1039521 64102
## 5 untrt SRR1039508 64102
## 6 untrt SRR1039512 64102
## 7 untrt SRR1039516 64102
## 8 untrt SRR1039520 64102
This output makes sense, as there are 64,102 unique Ensembl ids
(n_distinct(acount_smeta$Feature))
Missing Values
By default, all [built in] R functions operating on vectors that contain missing data will return NA. It’s a
way to make sure that users know they have missing data, and make a conscious decision on how
to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the
missing data) is to use [Link] = TRUE (rm stands for remove). ---[Link] (https://
[Link]/genomics-r-intro/05-dplyr/[Link])
#let's view
fun_df
## genes counts
## 1 A NA
## 2 A 214
## 3 A NA
## 4 B 352
## 5 B 256
## 6 B NA
## 7 C 400
## 8 C 381
## 9 C 250
## 10 D 278
## 11 D NA
## 12 D 169
## # A tibble: 4 × 5
## genes mean_count median_count min_count max_count
## <chr> <dbl> <int> <int> <int>
## 1 A NA NA NA NA
## 2 B NA NA NA NA
## 3 C 344. 381 250 400
## 4 D NA NA NA NA
#use [Link]
fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts, [Link]=TRUE),
median_count = median(counts, [Link]=TRUE),
min_count = min(counts, [Link]=TRUE),
max_count = max(counts, [Link]=TRUE))
## # A tibble: 4 × 5
## genes mean_count median_count min_count max_count
## <chr> <dbl> <dbl> <int> <int>
## 1 A 214 214 214 214
## 2 B 304 304 256 352
## 3 C 344. 381 250 400
## 4 D 224. 224. 169 278
Similar to mutate, we can summarize across multiple columns at once using across(). For
example, let's get the mean of logFC and logCPM.
dexp %>%
summarize(across(starts_with("Log"), mean))
## # A tibble: 1 × 2
## logFC logCPM
## <dbl> <dbl>
## 1 -0.00859 3.71
Acknowledgments
Some material from this lesson was either taken directly or adapted from the Intro to R and
RStudio for Genomics lesson provided by [Link] ([Link]
genomics-r-intro/01-introduction/[Link]). Additional content was inspired by Chapter 13,
Relational Data, ([Link] from R for Data Science and Suzan
Baert's dplyr tutorials ([Link]
Objectives
1. To explore Bioconductor, a repository for R packages related to biological data analysis.
2. To better understand S4 objects as they relate to the Bioconductor core infrastructure.
3. To learn more about a popular Bioconductor S4 class, SummarizedExperiment.
What is Bioconductor?
Bioconductor is a repository for R packages related to biological data analysis, primarily
bioinformatics and computational biology, and as such it is a great place to search for -omics
packages and pipelines.
• software
• annotation data
• experiment data
• workflows.
As of November 2023, Bioconductor v3.18 included 2,266 software packages, 429 experiment
data packages, 920 annotation packages, 30 workflows, and 4 books.
To browse these packages, use BiocViews ([Link]
[Link]).
Core infrastructure
An important goal of the Bioconductor project is interoperability, or the ability of packages to
work together using shared data classes and methods. This is achieved through the use of
common data structures. These common data structures are "implemented as classes in the S4
object-oriented system of the R language" ([Link]
For more information comparing these paradigms, check out this brief article ([Link]
functional-programming-vs-object-oriented-programming-oop-which-is-better-82172e53a526AZ) from Medium.
Object-oriented programming allows the same function to be used for many different types of
input (e.g., summary()) (Advanced R).
Note
We work with S3 objects ([Link] all the time in R and these are prominent when using base
R programming.
Classes are organised in a hierarchy so that if a method does not exist for one
class, its parent’s method is used, and the child is said to inherit behaviour. For
example, in R, an ordered factor inherits from a regular factor, and a generalised
linear model inherits from a linear model. --- Advanced R ([Link]
[Link])
Understanding S4 classes
This lesson will not go into too much detail regarding the S4 class system. You do not need to know that much
about S4 systems to use S4 objects. However, some of the basics will be described below. If you would like to learn
more about S4 classes, see this lesson ([Link] with The
Carpentries and this chapter ([Link] of Advanced R.
S4 classes create the data structures that store complex information (e.g., biological assay data
and metadata) in one or more slots. The entire structure can then be assigned to an R object.
The types of information in each slot of the object is tightly controlled. S4 generics and methods
define functions that can be applied to these objects. See this lesson ([Link]
[Link]/bioc-project/[Link]#s4-classes-and-methods) from The Carpentries,
authored by the Bioconductor Project community, for more information.
In brief, S4 objects have slots that store defined types of information. The use of the object can
be extended by defining a new class, which will inherit the attributes of the old class, including
all of the slots, and add new slots.
Info
You can find a list of common classes used by the Bioconductor Project here (https://
[Link]/[Link]?q=classes#reusebioc).
The advantage of using a SummarizedExperiment is that when manipulating the object, the
data elements in each slot maintain consistency, meaning if a sample is removed, the
corresponding sample column in the expression data would also be removed. This keeps you
from accidentally including data that should have been excluded or vice versa.
We can use the airway package to see how this container works, including how to access and
subset the data.
There are data sets available in R to practice with or showcase different packages. The airway data is from Himes et
al. (2014) ([Link] . These data, which are contained within a
RangedSummarizedExperiment object, are from a bulk RNAseq experiment. In the experiment, the authors
"characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone,"
a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell
samples.
Note
Other popular classes that inherit from the SummarizedExperiment include the DESeqDataSet (https://
[Link]/packages/devel/bioc/vignettes/DESeq2/inst/doc/[Link]#the-deseqdataset) and the
SingleCellExperiment ([Link]
doc/[Link]).
library(SummarizedExperiment)
library(airway)
data(airway, package="airway")
se <- airway
se
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
Accessors
As you can see from the image, there are several accessor functions to access the data from
the object:
• assays() - access matrix-like experimental data (e.g., count data). Rows are genomic
features (e.g., transcripts) and columns are samples.
This can hold more than one assay, and each assay should be called directly with the $
accessor.
## List of length 1
## names(1): counts
colData(se)
rowData(se)
## ENSG00000000460 NA C1orf112
## ... ... ...
## ENSG00000273489 NA RP11-180C16.1
## ENSG00000273490 NA TSEN34
## ENSG00000273491 NA RP11-138A9.2
## ENSG00000273492 NA AP000230.1
## ENSG00000273493 NA RP11-80H18.4
rowRanges(se)$ENSG00000000003
metadata(se)
## [[1]]
## Experiment data
## Experimenter name: Himes BE
## Laboratory: NA
## Contact information:
## Title: RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucoc
## URL: [Link]
## PMIDs: 24926665
##
## Abstract: A 226 word abstract is available. Use 'abstract' method.
```
Note
You can use str() to get an idea of how to access information from each slot. Notice the use of @.
Now that we know how to access the information stored in the object se, how can we subset it?
First, notice that you can easily access columns from the sample metadata (colData()) using
$.
se$SampleName
se$dex
levels(se$dex)
We can also use 2D subsetting like we would for a data frame (i.e., [] notation), where left of
the comma applies to the rows (features) and right of the comma applies to the columns
(samples).
If we only wanted the first 10 transcripts and the first three samples, we could use
se[1:10,1:3]
## class: RangedSummarizedExperiment
## dim: 10 3
## metadata(1): ''
## assays(1): counts
## rownames(10): ENSG00000000003 ENSG00000000005 ... ENSG00000001084
## ENSG00000001167
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(3): SRR1039508 SRR1039509 SRR1039512
## colData names(9): SampleName cell ... Sample BioSample
## class: RangedSummarizedExperiment
## dim: 63677 4
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
se[ ,se$dex=="trt"]$dex #we can check to see that only trt samples remain
We can do the same with feature information. For example, if we only want to include
protein_coding transcripts:
se[rowData(se)$gene_biotype == "protein_coding",]
## class: RangedSummarizedExperiment
## dim: 22810 8
## metadata(1): ''
## assays(1): counts
## rownames(22810): ENSG00000000003 ENSG00000000005 ... ENSG00000273482
## ENSG00000273490
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
OR
We could filter using the count assay. For example, keeping only transcripts that have a count
of at least 10 across a minimum of 4 samples.
## class: RangedSummarizedExperiment
## dim: 16139 8
## metadata(1): ''
## assays(1): counts
## rownames(16139): ENSG00000000003 ENSG00000000419 ... ENSG00000273486
## ENSG00000273487
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
Using tidySummarizedExperiment
Why did we focus so heavily on the tidyverse if it can't be used to manipulate Bioconductor
objects? Well, for one, regardless of whether you are a user of Bioconductor packages, you will
often manipulate data frames, which is where the tidyverse collection of packages is super
useful. Second, you can actually use tidyverse commands in certain contexts. The R package,
tidySummarizedExperiment allows you to view a SummarizedExperiment object as a
tibble (a tidyverse data frame) and provides other tidyverse compatible functions from the
packages dplyr, tidyr, ggplot, and plotly. ([Link]
tidySummarizedExperiment/)
library(tidySummarizedExperiment)
## # A tibble: 2 × 2
## dex total
## <fct> <int>
## 1 trt 85955244
## 2 untrt 89561179
## # A tibble: 267,752 × 24
## # Groups: .feature [33,469]
## .feature .sample counts SampleName cell dex albut Run avgLe
## <chr> <chr> <int> <fct> <fct> <fct> <fct> <fct> <
## 1 ENSG00000000003 SRR10395… 679 GSM1275862 N613… untrt untrt SRR1…
## 2 ENSG00000000419 SRR10395… 467 GSM1275862 N613… untrt untrt SRR1…
## 3 ENSG00000000457 SRR10395… 260 GSM1275862 N613… untrt untrt SRR1…
## 4 ENSG00000000460 SRR10395… 60 GSM1275862 N613… untrt untrt SRR1…
## 5 ENSG00000000938 SRR10395… 0 GSM1275862 N613… untrt untrt SRR1…
## 6 ENSG00000000971 SRR10395… 3251 GSM1275862 N613… untrt untrt SRR1…
## 7 ENSG00000001036 SRR10395… 1433 GSM1275862 N613… untrt untrt SRR1…
## 8 ENSG00000001084 SRR10395… 519 GSM1275862 N613… untrt untrt SRR1…
## 9 ENSG00000001167 SRR10395… 394 GSM1275862 N613… untrt untrt SRR1…
## 10 ENSG00000001460 SRR10395… 172 GSM1275862 N613… untrt untrt SRR1…
## # ℹ 267,742 more rows
## # ℹ 15 more variables: Experiment <fct>, Sample <fct>, BioSample <fct>,
## # gene_id <chr>, gene_name <chr>, entrezid <int>, gene_biotype <chr>,
## # gene_seq_start <int>, gene_seq_end <int>, seq_name <chr>, seq_strand <in
## # seq_coord_system <int>, symbol <chr>, GRangesList <list>, mean_exprs <db
1. tidySingleCellExperiment ([Link]
- for SingleCellExperiment ([Link]
vignettes/SingleCellExperiment/inst/doc/[Link]) objects
Note
Saving an R object
S4 objects store complex information that isn't necessarily simple to save. If you intend to work
with the object further, try using saveRDS.
Note
saveRDS() is used to save a single object. To save multiple objects, use save(). To save your entire R
environment, use [Link]().
Acknowledgements
Content from this lesson was inspired by or taken from the following sources:
1. [Link]
[Link] ([Link]
SummarizedExperiment/inst/doc/[Link])
2. [Link]
([Link]
3. [Link] ([Link]
[Link]/bioc-intro/[Link])
4. [Link] (https://
[Link]/compbio/bioc/[Link])
Objectives
1. Review important data wrangling functions
2. Put our wrangling skills to use on a realistic RNA-Seq data set
Importing and exporting data into the R environment is done using base R and readR (readxl
in the case of excel files) functions. Most of these functions begin with read. and read_ for
importing, or write. and write_ for exporting. You can use tab complete to find the
appropriate function.
rownames_to_column() - Creates a column in your data frame from existing row names.
There are 3 components required for all ggplot2 plots: DATA, GEOM_FUNCTION, and aes
MAPPINGS.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>))
◦ everything()
◦ last_col()
◦ starts_with()
◦ ends_with()
◦ contains()
◦ matches()
◦ num_range()
◦ all_of()
◦ any_of()
◦ where()
• filter() often uses relational operators. Type ?Comparison and ?match in the
console for more information.
• For filtering across multiple rows, check out if_all() and if_any().
Mutating Joins
Filtering joins
To mutate across multiple columns using a single phrase use across(). Because across()
requires you to select which columns you want to mutate, it can be used with the select()
helper functions.
To apply some function to smaller subsets (groups) within your data, and return a summarized
data frame, use group_by() with summarize().
group_by() - group a data frame by a categorical variable so that a given operation can be
performed per group / category.
Using R on Biowulf
If you wish to use R on Biowulf, you can view modules available using module -r avail
'^R$'. Loading the module requires an interactive session, sinteractive (unless submitting
a job).
For a list of currently installed packages in the default R environment, see [Link]
apps/[Link]#packages ([Link]
For instructions on using R interactively or via a batch job, see the R Biowulf help page (https://
[Link]/apps/[Link]). You may also find our course documentation (https://
[Link]/docs/reproducible-r-on-biowulf/) for Toward Reproducibility with R
on Biowulf helpful.
Provided Data
The provided data set is an example of a real count matrix returned from the NCI CCR
Sequencing Facility (CCR-SF). The provided file (./data/SF_example_RNASeq_1.txt)
contains RNAseq data for two sets of samples (Cell1 and Cell2) run in triplicate.
Data Challenges
1. The column names begin with numbers, which are not syntactic with R.
2. The gene names are hybrids of Ensembl ID and gene symbols and will match neither if
needed in the future. Therefore, these should be split into their own columns.
3. The values of the gene counts are in decimals. Most differential expression packages will
expect integers.
Our challenge
First, we will want to fix the three immediate problems discussed above. Once those have been
fixed, we are going to create a bar plot, using ggplot2 to plot the total gene counts per
sample. With that accomplished we are then going to format the data so that it can be used to
construct a DESeqDataSet object. This object is necessary to apply different functions from
DESeq2 including differential expression analysis.
Note: While the instructions below include some guidance on the different functions required,
you may need to include helper functions or modify function arguments to complete each task.
Expected output will be shown below each prompt.
{{Sdet}}
Solution}
library(tidyverse)
dmat<-read_delim("./data/SF_example_RNASeq_1.txt",col_names =TRUE)
{{Edet}}
Great! You should have a 10,000 x 7 tibble if you used a readr function for data loading.
dmat
## # A tibble: 10,000 × 7
## gene_id `1_Cell1_Rep1` `2_Cell1_Rep2` `3_Cell1_Rep3` `4_Cell2_R
## <chr> <dbl> <dbl> <dbl> <
## 1 ENSG00000001630.… 6877. 6614 7058. 11
## 2 ENSG00000002016.… 283. 287. 287.
## 3 ENSG00000002330.… 1946 1662 2121
## 4 ENSG00000002834.… 17636 19333 18917 4
## 5 ENSG00000003056.… 3874 4107 4005 5
## 6 ENSG00000003393.… 2041 2150 2141 1
## 7 ENSG00000003989.… 279 345 305 18
## 8 ENSG00000004534.… 2695. 3031 2871 1
## 9 ENSG00000004838.… 42 52 39
## 10 ENSG00000004848.… 1 0 0
## # ℹ 9,990 more rows
## # ℹ 2 more variables: `5_Cell2_Rep2` <dbl>, `6_Cell2_Rep3` <dbl>
Notice that the column names each begin with a number (e.g., 1_MB231_RNA1). Place an "S"
at the beginning of each sample name using rename_with(), which is similar to rename(),
but it uses a function to rename the columns. Look up the function ?paste() to use with
rename(). Overwrite dmat with dmat.
{{Sdet}}
Solution}
{{Edet}}
colnames(dmat)
Separate the gene abbreviation from the Ensembl ID in column 1 using separate(). Save the
output to a new object named dmat2.
{{Sdet}}
Solution}.
{{Edet}}
Store the newly separated columns (Ensembl_ID and gene_abb) in a new data frame named
gene_names and drop the gene abbreviations from our working data frame (dmat2) because
we ultimately want this to be a data matrix (overwrite dmat2 with dmat2).
{{Sdet}}
Solution}
{{Edet}}
gene_names
## # A tibble: 10,000 × 2
## Ensembl_ID gene_abb
## <chr> <chr>
## 1 ENSG00000001630.11 CYP51A1
## 2 ENSG00000002016.12 RAD52
## 3 ENSG00000002330.9 BAD
## 4 ENSG00000002834.13 LASP1
## 5 ENSG00000003056.3 M6PR
## 6 ENSG00000003393.10 ALS2
## 7 ENSG00000003989.12 SLC7A2
## 8 ENSG00000004534.10 RBM6
## 9 ENSG00000004838.9 ZMYND10
## 10 ENSG00000004848.6 ARX
## # ℹ 9,990 more rows
dmat2
## # A tibble: 10,000 × 7
## Ensembl_ID S1_Cell1_Rep1 S2_Cell1_Rep2 S3_Cell1_Rep3 S4_Cell2_Rep
## <chr> <dbl> <dbl> <dbl> <dbl
## 1 ENSG00000001630.11 6877. 6614 7058. 11305
## 2 ENSG00000002016.12 283. 287. 287. 265
## 3 ENSG00000002330.9 1946 1662 2121 608
## 4 ENSG00000002834.13 17636 19333 18917 4583
## 5 ENSG00000003056.3 3874 4107 4005 5741
## 6 ENSG00000003393.10 2041 2150 2141 1687
## 7 ENSG00000003989.12 279 345 305 18586
## 8 ENSG00000004534.10 2695. 3031 2871 1948
## 9 ENSG00000004838.9 42 52 39 61
## 10 ENSG00000004848.6 1 0 0 18
## # ℹ 9,990 more rows
## # ℹ 2 more variables: S5_Cell2_Rep2 <dbl>, S6_Cell2_Rep3 <dbl>
Many of the packages that handle RNASeq count data do not work correctly with decimal
numbers. We need to convert these numbers to integers using mutate(). Save your
transformed data frame to an object named dmat3.
{{Sdet}}
Solution}
{{Edet}}
dmat3
## # A tibble: 10,000 × 7
## Ensembl_ID S1_Cell1_Rep1 S2_Cell1_Rep2 S3_Cell1_Rep3 S4_Cell2_Rep
## <chr> <int> <int> <int> <int
## 1 ENSG00000001630.11 6877 6614 7058 1130
## 2 ENSG00000002016.12 283 287 287 26
Step 5: Create a bar plot, using ggplot2 to plot the total gene counts per
sample.
In order to create a ggplot2 bar plot we will need to reshape the data. The sample names
should be in a single column named Sample and the gene counts in a single column named
Count. Use pivot_longer() and save the reshaped data to an object named ldmat.
{{Sdet}}
Solution}
{{Edet}}
ldmat
## # A tibble: 60,000 × 3
## Ensembl_ID Sample Count
## <chr> <chr> <int>
## 1 ENSG00000001630.11 S1_Cell1_Rep1 6877
## 2 ENSG00000001630.11 S2_Cell1_Rep2 6614
## 3 ENSG00000001630.11 S3_Cell1_Rep3 7058
## 4 ENSG00000001630.11 S4_Cell2_Rep1 11305
## 5 ENSG00000001630.11 S5_Cell2_Rep2 10761
## 6 ENSG00000001630.11 S6_Cell2_Rep3 10047
## 7 ENSG00000002016.12 S1_Cell1_Rep1 283
## 8 ENSG00000002016.12 S2_Cell1_Rep2 287
## 9 ENSG00000002016.12 S3_Cell1_Rep3 287
## 10 ENSG00000002016.12 S4_Cell2_Rep1 265
## # ℹ 59,990 more rows
{{Sdet}}
Solution}
ldmat %>%
group_by(Sample) %>%
summarize(Tcounts= sum(Count)) %>%
ggplot() +
geom_bar(aes(x=Sample,y=Tcounts),stat="identity") +
geom_text(aes(x=Sample,y=Tcounts,label = Tcounts), vjust = 0)+
theme_classic()+
theme([Link].x = element_text(angle=90,vjust=0.5)) +
labs(x=NULL,y="Total Counts")
{{Edet}}
The object class used by the DESeq2 package to store the read counts and the
intermediate estimated quantities during statistical analysis is the DESeqDataSet.
--- Analyzing RNA-seq data with DESeq2 ([Link]
bioc/vignettes/DESeq2/inst/doc/[Link]#the-deseqdataset)
Constructing this object from a count matrix requires count data, sample information, and an
experiment design. Our final objective is to prep a count matrix and sample information that can
be used to create a DESeqDataSet object.
The count data currently stored in a data frame will need to be a matrix. To do this, first convert
the current column Ensembl_ID to row names. Next, before converting to a matrix, filter genes
with low expression. Keep only genes with 10 or more reads across all samples (Hint: check out
the function rowSums()). Finally, convert the resulting data frame to a matrix (use
[Link]()). Save this object back to dmat3.
{{Sdet}}
Solution}
{{Edet}}
head(dmat3)
Next, we need to create some sample information to include with our count matrix. Because our
RNAseq data includes two sets of samples run in triplicate for two cell lines, Cell1 and Cell2, we
can create sample information from the sample names using the function [Link](). The
column names of our data matrix (dmat3) and the row names of our sample metadata
(samp_df) should be in the same order.
{{Sdet}}
Solution}
samp_df<-[Link](Sample=colnames(dmat3),Cell_line=rep(c("Cell1","Cell2"),eac
{{Edet}}
samp_df
## Cell_line Replicate
## S1_Cell1_Rep1 Cell1 1
## S2_Cell1_Rep2 Cell1 2
## S3_Cell1_Rep3 Cell1 3
## S4_Cell2_Rep1 Cell2 1
## S5_Cell2_Rep2 Cell2 2
## S6_Cell2_Rep3 Cell2 3
Now we have the files needed to create a DESeqDataSet object. Let's use the function
DESeqDataSetFromMatrix(). The general template is as follows:
For more information on constructing the DESeqDataSet object and using the package
DESeq2, check out this vignette ([Link]
DESeq2/inst/doc/[Link]).
BiocManager::install("DESeq2")
{{Sdet}}
Solution}
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = dmat3,
colData = samp_df,
design = ~ Cell_line)
dds
## class: DESeqDataSet
## dim: 3622 6
## metadata(1): version
## assays(1): counts
## rownames(3622): ENSG00000001630.11 ENSG00000002016.12 ...
## ENSGR0000169100.8 ENSGR0000223773.2
## rowData names(0):
## colnames(6): S1_Cell1_Rep1 S2_Cell1_Rep2 ... S5_Cell2_Rep2
## S6_Cell2_Rep3
## colData names(2): Cell_line Replicate
{{Edet}}
Lesson 1
No data available.
Lesson 2
No data available.
Lesson 3
Lesson 3 covered data import and reshaping. Data unavailable through base R or other R
packages, can be downloaded here. The survey / species data sets were obtained from a data
carpentry lesson Data Analysis and Visualization in R for Ecologists ([Link]
R-ecology-lesson/[Link]).
Lesson 4
Lesson 4 focused on data visualization with ggplot2. A summary RNA-Seq file
(RNASeq_totalcounts_vs_totaltrans.xlsx) was used for this lesson as well as the file
[Link]. Both files are also available in Lesson3_data.zip provided above.
Lesson 5
Lesson 5 introduced the package dplyr and the functions select(), filter(), and
arrange(). Data used in this lesson derived from the RNA-Seq experiment airway and can
be downloaded here
Lesson 6
Lesson 6 continued with dplyr, focusing on the functions mutate(), group_by(), and
summarize(). Data used in this lesson derived from the RNA-Seq experiment airway and
can be downloaded here
Lesson 7
No data available
Lesson 8
Data associated with the lesson 8 data wrangling challenge can be found here.
Practice problems
Which of the following will throw an error and why?
{{Sdet}}
Solution}
{{Edet}}
Create the following objects; give each object an appropriate name (your best guess at what
name to use is fine):
{{Sdet}}
Solution}
(fav_gene<-"GH1")
## [1] "GH1"
(vec1<-c(8:16))
## [1] 8 9 10 11 12 13 14 15 16
(vec2<-c(22:29))
## [1] 22 23 24 25 26 27 28 29
(vec3<-c(vec1,vec2))
## [1] 8 9 10 11 12 13 14 15 16 22 23 24 25 26 27 28 29
{{Edet}}
Create the following objects in R. What type of data are stored in these objects?
{{Sdet}}
Solution}
typeof(chromosome_name)
## [1] "character"
typeof(ODval)
## [1] "double"
typeof(chr_position)
## [1] "character"
typeof(question)
## [1] "logical"
typeof(irregular)
## [1] "logical"
{{Edet}}
If you combine, snp_positions with snp_chromosomes, what is the resulting data type?
{{Sdet}}
Solution}
typeof(c(snp_positions,snp_chromosomes))
## [1] "character"
{{Edet}}
{{Sdet}}
Solution}
{{Edet}}
df<-[Link](id=paste("Sample",1:10,sep="_"), cell=rep(factor(c("cell_line_A"
\
\
What are the column names? Rename the column "id" to "Sampleid". [You will likely need to look
up how to do this. How can you find help?]
{{Sdet}}
Solution}
colnames(df)
{{Edet}}
{{Sdet}}
Solution}
{{Edet}}
Duplicate the column 'cell' and save to a new column of df called 'cell_line'.
{{Sdet}}
Solution}
{{Edet}}
Look at the help documentation for [Link](). Use this function to determine whether there are
NAs in the following vector:
{{Sdet}}
Solution}
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [1] 6 9
{{Edet}}
Acknowledgements
Questions were inspired by or taken directly from [Link] ([Link]
genomics-r-intro/01-introduction/[Link])
Loading data
1. Import data from the sheet "iris_data_long" from the excel workbook (file_path = "./data/
iris_data.xlsx"). Make sure the column names are unique and do not contain spaces.
Save the imported data to an object called iris_long.
{{Sdet}}
Solution}
iris_long<-readxl::read_excel("../data/iris_data.xlsx",sheet="iris_data_lon
iris_long
## # A tibble: 600 × 4
## [Link] Species [Link] Measurement
## <dbl> <chr> <chr> <dbl>
## 1 1 setosa [Link] 5.1
## 2 1 setosa [Link] 3.5
## 3 1 setosa [Link] 1.4
## 4 1 setosa [Link] 0.2
## 5 2 setosa [Link] 4.9
## 6 2 setosa [Link] 3
## 7 2 setosa [Link] 1.4
## 8 2 setosa [Link] 0.2
## 9 3 setosa [Link] 4.7
## 10 3 setosa [Link] 3.2
## # ℹ 590 more rows
{{Edet}}
{{Sdet}}
Solution}
species<-readr::read_delim("../data/species_datacarpentry.txt",col_types
species
## # A tibble: 54 × 4
## species_id genus species taxa
## <chr> <fct> <fct> <fct>
## 1 AB Amphispiza bilineata Bird
## 2 AH Ammospermophilus harrisi Rodent
## 3 AS Ammodramus savannarum Bird
## 4 BA Baiomys taylori Rodent
## 5 CB Campylorhynchus brunneicapillus Bird
## 6 CM Calamospiza melanocorys Bird
## 7 CQ Callipepla squamata Bird
## 8 CS Crotalus scutalatus Reptile
## 9 CT Cnemidophorus tigris Reptile
## 10 CU Cnemidophorus uniparens Reptile
## # ℹ 44 more rows
{{Edet}}
3. Load in a comma separated file with row names present (file_path= "./data/[Link]")
and save to an object named countB.
{{Sdet}}
Solution}
countB<-[Link]("../data/[Link]",[Link]=1)
head(countB)
{{Edet}}
{{Sdet}}
Solution}
library(tidyverse)
read_delim("../data/WebexSession_report.txt",delim="\t",locale =
## Rows: 10 Columns: 21
## ── Column specification ────────────────────────────────────────────────
## Delimiter: "\t"
## chr (7): Name, Date, Invited, Registered, Duration, Network joined fro
## dbl (1): Participant
## lgl (11): Audio Type, Email, Company, Title, Phone Number, Address 1, A
## time (2): Start time, End time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
## # A tibble: 10 × 21
## Participant `Audio Type` Name Email Date Invited Registered `Sta
## <dbl> <lgl> <chr> <lgl> <chr> <chr> <chr> <tim
## 1 1 NA Partici… NA 6/8/… No N/A 13:0
## 2 2 NA Partici… NA 6/9/… <NA> <NA> 13:0
head([Link]("../data/WebexSession_report.txt",
fileEncoding="UTF-16LE")) #via base R
## [Link]..[Link]..GMT.04.00. X
## 1 Session detail for 'A Webex Meeting of some type':
## 2 Participant Audio Type
## 3 1
## 4 2
## 5 3
## 6 4
## X.1 X.2 X.3 X.4 X.5 X.6 X.7
## 1
## 2 Name Email Date Invited Registered Start time End time Dur
## 3 Participant 1 <NA> 6/8/22 No N/A 1:00 PM 1:59 PM 59
## 4 Participant 2 <NA> 6/9/22 1:00 PM 1:59 PM 59
## 5 Participant 3 <NA> 6/10/22 No N/A 12:57 PM 2:06 PM 69
## 6 Participant 4 <NA> 6/11/22 12:57 PM 2:06 PM 69
## X.9 X.10 X.11 X.12 X.13 X.14 X.15
## 1
## 2 Company Title Phone Number Address 1 Address 2 City State/Province
## 3
## 4
## 5
## 6
## X.16 X.17 X.18 X
## 1
## 2 Zip/Postal Code Country/region Network joined from: Internal Participa
## 3 External
## 4
## 5 External
## 6
{{Edet}}
Reshaping data
1. Reshape iris_long to a wide format. Your new column names will contain names from
[Link]. Your wide data should look as follows:
## # A tibble: 150 × 6
## [Link] Species [Link] [Link] [Link] [Link]
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 setosa 5.1 3.5 1.4 0.2
## 2 2 setosa 4.9 3 1.4 0.2
## 3 3 setosa 4.7 3.2 1.3 0.2
## 4 4 setosa 4.6 3.1 1.5 0.2
## 5 5 setosa 5 3.6 1.4 0.2
## 6 6 setosa 5.4 3.9 1.7 0.4
## 7 7 setosa 4.6 3.4 1.4 0.3
## 8 8 setosa 5 3.4 1.5 0.2
## 9 9 setosa 4.4 2.9 1.4 0.2
## 10 10 setosa 4.9 3.1 1.5 0.1
## # ℹ 140 more rows
{{Sdet}}
Solution}
{{Edet}}
2. Let's use table4a from the tidyr package. Use pivot_longer() to place the year
columns in a column named year and their values in a column named cases.
library(tidyr)
data(table4a)
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
{{Sdet}}
Solution}
{{Edet}}
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
3. Separate the column rate from tidyr's table3 into two columns: cases and
population.
data(table3)
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
{{Sdet}}
Solution}
separate(table3, rate, into = c("cases", "population"))
{{Edet}}
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Reshape challenge
1. Use pivot_longer to reshape countB. Your reshaped data should look the same as the
data below.
{{Sdet}}
Solution}
library(tidyverse)
countB<-[Link]("../data/[Link]",[Link]=1) %>%
rownames_to_column("Feature")
countB_l<-pivot_longer(countB,
cols=2:length(countB),
names_to = c(".value", "Replicate"),
names_sep = "_") #reshaping data so that all replicates are stacke
tibble(countB_l)
{{Edet}}
## # A tibble: 27 × 4
## Feature Replicate SampleA SampleB
## <chr> <chr> <int> <int>
## 1 Tspan6 1 703 71
## 2 Tspan6 2 567 970
## 3 Tspan6 3 867 242
## 4 TNMD 1 490 342
The diamonds dataset comes in ggplot2 and contains information about ~54,000
diamonds, including the price, carat, color, clarity, and cut of each diamond. ---
R4DS ([Link]
library(ggplot2)
data(diamonds)
diamonds
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ℹ 53,930 more rows
Create a scatter plot demonstrating how carat (x axis) relates to price (y axis).
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
{{Edet}}
• Color the points above by clarity and scale the colors using the viridis package, option
"magma".
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price,color=clarity))
viridis::scale_color_viridis(discrete=TRUE,option="magma")
{{Edet}}
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price,color=clarity))
viridis::scale_color_viridis(discrete=TRUE,option="magma") +
theme_classic()
{{Edet}}
Create a bar chart displaying the number of diamonds per cut. Be sure to check out the help
documentation for geom_bar().
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
{{Edet}}
• Fill the bars by clarity. Modify the position of the bars so that they are set to dodge.
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill=clarity),position="dodge"
{{Edet}}
Examine how the price of a diamond changes across different diamond color categories
using a boxplot.
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price))
{{Edet}}
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price))+
theme_bw()
{{Edet}}
• Change the font of all text elements to "Times New Roman" and change the size of the
font to 12. Bold the x and y axis labels.
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = color, y = price))+
theme_bw()+
theme(text=element_text(family="Times New Roman",size=12),
[Link] = element_text(face="bold"))
{{Edet}}
Challenge Question
Using the boxplot you created above, reorder the x-axis so that color is organized from worst (J)
to best (D). There are multiple possible solutions. Hint: Check out functions in the forcats
package (a tidyverse core package)
{{Sdet}}
Solution}
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = forcats::fct_rev(color), y = price))
labs(x="color",y="price")+
theme_bw()+
theme(text=element_text(family="Times New Roman",size=12),
[Link] = element_text(face="bold"))
{{Edet}}
{{Sdet}}
Solution}
gcounts<-readr::read_csv("../data/[Link]")
## New names:
## Rows: 9 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ","
## (1): ...1 dbl (6): SampleA_1, SampleA_2, SampleA_3, SampleB_1, SampleB_2
## SampleB_3
## ℹ Use `spec()` to retrieve the full column specification for this data.
## Specify the column types or set `show_col_types = FALSE` to quiet this m
## • `` -> `...1`
colnames(gcounts)[1]<-"Gene"
gcounts
## # A tibble: 9 × 7
## Gene SampleA_1 SampleA_2 SampleA_3 SampleB_1 SampleB_2 SampleB_3
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tspan6 703 567 867 71 970 242
## 2 TNMD 490 482 18 342 935 469
## 3 DPM1 921 797 622 661 8 500
## 4 SCYL3 335 216 222 774 979 793
## 5 FGR 574 574 515 584 941 344
## 6 CFH 577 792 672 104 192 936
## 7 FUCA2 798 766 995 27 756 546
## 8 GCLC 822 874 923 705 667 522
## 9 NFYA 622 793 918 868 334 64
{{Edet}}
• Plot the values (gene counts) from Sample A on the y axis and sample B on the x axis.
Hint: you will need to reshape the data to accomplish this task.
{{Sdet}}
Solution}
library(tidyverse)
gcount2<-pivot_longer(gcounts,
cols=2:length(gcounts),
names_to = c(".value", "Replicate"),
names_sep = "_"
) #reshaping data so that all replicates are stacked in a single column by
ggplot(data=gcount2) +
geom_point(aes(x=SampleB,y=SampleA))
{{Edet}}
• Add a linear model to your scatter plot (See geom_smooth()). Also, the trend line should
be red and the confidence interval around the trend line should NOT be visible. Change
the panel background to white.
{{Sdet}}
Solution}
ggplot(data=gcount2,aes(x=SampleB,y=SampleA)) +
geom_point() +
geom_smooth(method="lm",color="red", se=FALSE)+
theme([Link] = element_rect(fill = "white", colour =
[Link] = element_line(color="black",size = 0.05
{{Edet}}
{{Sdet}}
Solution}
library(tidyverse)
sc<-read_delim("../data/filtlowabund_scaledcounts_airways.txt")
cnames<-c('sample', 'cell', 'dex', 'transcript', 'counts_scaled'
sc<-sc %>% select(all_of(cnames)) %>% filter(dex == "untrt" & (transcript
{{Edet}}
{{Sdet}}
Solution}
dexp<-read_delim("../data/diffexp_results_edger_airways.txt")
{{Edet}}
{{Sdet}}
Solution}
## # A tibble: 0 × 5
## # ℹ 5 variables: sample <dbl>, cell <chr>, dex <chr>, transcript <chr>,
## # counts_scaled <dbl>
{{Edet}}
{{Sdet}}
Solution}
sc %>% select(where([Link]))
## # A tibble: 8 × 3
## cell dex transcript
## <chr> <chr> <chr>
## 1 N61311 untrt ANAPC4
## 2 N61311 untrt ACTN1
## 3 N052611 untrt ANAPC4
## 4 N052611 untrt ACTN1
## 5 N080611 untrt ANAPC4
## 6 N080611 untrt ACTN1
## 7 N061011 untrt ANAPC4
## 8 N061011 untrt ACTN1
{{Edet}}
5. Select all columns from dexp except .abundant and PValue. Keep only rows with FDR
less than or equal to 0.01.
{{Sdet}}
Solution}
## # A tibble: 2,763 × 8
## feature albut transcript ref_genome logFC logCPM F
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <
## 1 ENSG00000000003 untrt TSPAN6 hg38 -0.390 5.06 32.8 0.002
## 2 ENSG00000000971 untrt CFH hg38 0.417 8.09 29.3 0.003
## 3 ENSG00000001167 untrt NFYA hg38 -0.509 4.13 44.9 0.001
## 4 ENSG00000002834 untrt LASP1 hg38 0.388 8.39 22.7 0.007
## 5 ENSG00000003096 untrt KLHL13 hg38 -0.949 4.16 84.8 0.000
## 6 ENSG00000003402 untrt CFLAR hg38 1.18 6.90 130. 0.000
## 7 ENSG00000003987 untrt MTMR7 hg38 0.993 0.341 24.7 0.005
## 8 ENSG00000004059 untrt ARF5 hg38 0.358 5.84 30.9 0.003
{{Edet}}
6. Import the file "./data/airway_rawcount.csv". Use the function rename() to rename the
first column. Use the pipe from magrittr to import and rename successively without
intermediate steps or function nesting. Save to an object named acount.
{{Sdet}}
Solution}
acount<-read_csv("../data/airway_rawcount.csv") %>%
dplyr::rename("Feature" = "...1")
## New names:
## Rows: 64102 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: ","
## (1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR10
## SRR1039...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## Specify the column types or set `show_col_types = FALSE` to quiet this m
## • `` -> `...1`
{{Edet}}
7. Use filter on the object acount to keep only genes that had a count greater than 10 in at
least one sample
{{Sdet}}
Solution}
{{Edet}}
8. Challenge Question: Filter genes from acount that had a total count less than ten across
all samples. Hint: Use column_to_rownames and look up rowSums().
{{Sdet}}
Solution}
{{Edet}}
library(tidyverse)
acount_smeta<-read_tsv("../data/[Link]")
acount_smeta
{{Sdet}}
Solution}
## # A tibble: 48,176 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 0 0 0 0 0
## 2 ENSG000000… 0 0 2 0 1
## 3 ENSG000000… 10 2 9 2 10
## 4 ENSG000000… 3 0 3 1 4
## 5 ENSG000000… 2 0 1 0 0
## 6 ENSG000000… 0 0 1 0 0
## 7 ENSG000000… 4 6 22 10 2
## 8 ENSG000000… 0 0 0 1 1
## 9 ENSG000000… 3 1 0 0 3
## 10 ENSG000000… 0 0 0 0 0
## # ℹ 48,166 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
#OR
## # A tibble: 48,176 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 0 0 0 0 0
## 2 ENSG000000… 0 0 2 0 1
## 3 ENSG000000… 10 2 9 2 10
## 4 ENSG000000… 3 0 3 1 4
## 5 ENSG000000… 2 0 1 0 0
## 6 ENSG000000… 0 0 1 0 0
## 7 ENSG000000… 4 6 22 10 2
## 8 ENSG000000… 0 0 0 1 1
## 9 ENSG000000… 3 1 0 0 3
## 10 ENSG000000… 0 0 0 0 0
## # ℹ 48,166 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
{{Edet}}
{{Sdet}}
Solution}
acount_smeta %>%
group_by(dex,Feature) %>%
summarize(mean_counts=mean(Count)) %>%
slice_min(n=5, order_by=mean_counts) %>%
glimpse()
## `summarise()` has grouped output by 'dex'. You can override using the `.
## argument.
## Rows: 67,285
## Columns: 3
## Groups: dex [2]
## $ dex <chr> "trt", "trt", "trt", "trt", "trt", "trt", "trt", "tr
## $ Feature <chr> "ENSG00000000005", "ENSG00000000938", "ENSG000000027
## $ mean_counts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{{Edet}}
{{Sdet}}
Solution}
## # A tibble: 64,102 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SR
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 2.83 2.65 2.94 2.61 3.06
## 2 ENSG000000… 0 0 0 0 0
## 3 ENSG000000… 2.67 2.71 2.79 2.56 2.77
## 4 ENSG000000… 2.42 2.33 2.42 2.22 2.39
## 5 ENSG000000… 1.79 1.75 1.61 1.56 1.90
## 6 ENSG000000… 0 0 0.477 0 0.301
## 7 ENSG000000… 3.51 3.57 3.79 3.63 3.83
## 8 ENSG000000… 3.16 3.03 3.24 2.95 3.15
## 9 ENSG000000… 2.72 2.58 2.78 2.69 2.91
## 10 ENSG000000… 2.60 2.37 2.67 2.25 2.82
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
{{Edet}}
{{Sdet}}
Solution}
{{Edet}}
Challenge question:
1. Calculate the mean raw counts for each gene by treatment in acount_smeta. Combine
these results with the differential expression results. Your resulting data frame should
resemble the following:
{{Sdet}}
Solution}
a<-acount_smeta %>%
group_by(dex, Feature) %>%
summarise(mean_count = mean(Count)) %>%
pivot_wider(names_from=dex,values_from=mean_count,
names_prefix="Mean_Counts_") %>%
right_join(dexp, by=c("Feature" = "feature"))
## `summarise()` has grouped output by 'dex'. You can override using the `.
## argument.
{{Edet}}
## # A tibble: 15,926 × 12
## Feature Mean_Counts_trt Mean_Counts_untrt albut transcript re
## <chr> <dbl> <dbl> <chr> <chr> <c
## 1 ENSG00000000003 619. 865 untrt TSPAN6 hg
## 2 ENSG00000000419 547. 523 untrt DPM1 hg
## 3 ENSG00000000457 234. 250. untrt SCYL3 hg
## 4 ENSG00000000460 53.2 63.5 untrt C1orf112 hg
## 5 ENSG00000000971 6738. 5331. untrt CFH hg
## 6 ENSG00000001036 1123. 1487. untrt FUCA2 hg
## 7 ENSG00000001084 573. 658. untrt GCLC hg
## 8 ENSG00000001167 316 469 untrt NFYA hg
## 9 ENSG00000001460 168. 208 untrt STPG1 hg
## 10 ENSG00000001461 2545 3113. untrt NIPAL3 hg
## # ℹ 15,916 more rows
Additional Resources
Other Resources
1. Helpful search engine for R: rseek ([Link]
2. Test your regular expressions ([Link]
3. Troubleshooting Errors ([Link]
Troubleshooting_Rprog/)
To manage large data sets in R effectively, one can employ strategies such as using tibbles for clear outputs, leveraging dplyr for efficient manipulation with functions like mutate() and summarize(), and using the data.table package for its speed and memory efficiency. Writing modular and reusable functions, using piping for clarity and reducing computational redundancy, and employing lazy evaluation with packages like delayedassign or futures can further enhance performance. Also, breaking data into chunks, using memory-efficient data types, and performing in-memory operations with packages like Rcpp may improve both performance and readability .
dplyr facilitates data manipulation by providing a set of functions that are intuitive and well-suited for transforming data frames. Its core functions include select(), filter(), mutate(), summarize(), and arrange(), which allow for selecting specific columns, filtering rows based on conditions, adding new columns or modifying existing ones, summarizing data by groups, and sorting data, respectively. The use of these functions, often in conjunction with the piping operator (%>%), streamlines data manipulation tasks, making the code more readable and efficient .
Base R and ggplot2 differ significantly in their approach to plotting. Base R relies on individual plot functions for customization, often resulting in complex code for intricate plots. In contrast, ggplot2 uses the 'grammar of graphics', allowing users to layer plot elements, which makes adding complexity straightforward and organized. This leads to ggplot2's advantage in code reusability, ease of producing complex plots with minimal changes, and its integration with tidyverse for handling well-structured data formats. ggplot2's ability to integrate mapping, geoms, and aesthetics into a cohesive plotting system makes it superior for advanced visualizations .
Tibbles are a modern version of data frames provided by the tidyverse. They offer additional features that improve user experience, such as better default printing and the avoidance of partial matching in names. Unlike traditional data frames, tibbles do not modify variable names (e.g., no conversion to factors), which minimizes unexpected transformations. They also handle lists within columns gracefully. These characteristics make tibbles particularly beneficial when working with large data sets and interactive data analysis, as they provide clearer outputs and reduce potential errors .
Subsetting data using the dplyr select() function is a process of extracting specific columns from a data frame, which serves to focus on relevant data and streamline analysis. The function allows for selecting columns by name, excluding them with a minus sign, or even using helper functions like starts_with() for dynamic selection. For example, `ex1 <- select(dexp, gene = transcript, logFoldChange = logFC, FDRpvalue = FDR)` creates a simplified version of the data frame `dexp` with renamed columns for easier readability and analysis .
Using package libraries like ggplot2 offers substantial benefits, including access to sophisticated plotting functions and a 'grammar of graphics' approach that simplifies creating complex visuals. It enhances productivity through reusable code and consistency of aesthetic customizations. However, potential drawbacks include the need for continuous updates to maintain compatibility with R's evolving ecosystem and a reliance on the tidyverse framework, which may introduce a steep learning curve for beginners. Additionally, while ggplot2’s abstraction layer is powerful, it may initially mask lower level understanding of plotting mechanics, impacting users' adaptability to alternative frameworks or languages .
The function wrapper syntax in R (`function_name <- function(arg_1, arg_2, ...) { Function body }`) is crucial for writing robust, reusable code. It encapsulates logic and computation, allowing code modularity and repeated use across contexts without duplication. This syntax supports abstraction, making complex operations manageable and code maintenance easier. By isolating compartmental calculations or transformations in functions, developers can test and optimize components independently, enhance code readability, and facilitate collaboration. Functions also support argument validation and error handling, reinforcing reliability and usage consistency across diverse projects .
The getwd() function assists users by displaying the current working directory, simplifying file management by providing a reference point for file I/O operations. It is analogous to the pwd command from UNIX systems, which outputs the current directory path. This function is essential for verifying and changing directories in scripts where relative file paths are employed, ensuring operations are executed in the correct folder context. Knowing the working directory mitigates issues with file accessibility and paths, particularly across different sessions or collaborators' setups .
The 'grammar of graphics' in ggplot2 provides a coherent system for describing and building graphs, which simplifies the creation of complex plots by organizing visual elements into layers. This system allows for quick construction of informative plots with minimal code modifications, enhancing flexibility and reusability. By using this grammar, users can build plots incrementally, adding layers of data representation and aesthetics. This results in a more intuitive and efficient plotting process compared to base R graphics, particularly when dealing with layered data from the tidyverse framework .
In R, replacing or renaming columns in a data frame can be effectively performed using the select() and rename() functions from dplyr. Select() allows users to rename columns during the selection process, while rename() allows specific column names to be changed without altering their position. These methods improve code readability and data manipulation accuracy, essential for maintaining consistency in data analysis workflows. Correct labeling reflects the content accurately, reducing the chance of errors in downstream analysis and improving the clarity of code shared with collaborators .