Data Analysis Using R
Data Analysis Using R
1 / 110
Agenda
Why R?
What is R?
R Basics
Data Wrangling
Visualization
2 / 110
Why R?
3 / 110
Languages to analyze data with
How will we work with the data? SO many options....
Excel, SAS, SPSS, Stata, R, Python, MatLab, Gauss, C, C ++ ,
Java, SciLab, ...
Our focus: R
4 / 110
5 / 110
So why R?
Simple languages (Excel, SAS, SPSS) are easy but give you
less Xexibility
Complex languages (C, C++) are hard but give you lots of
Xexibility
If you can learn R, you can learn other languages more easily
6 / 110
What is R?
7 / 110
R is a language for statistical
computing
Offers a wide variety of statistical tools
You can compute your own statistics
Or you can use what others have written (called packages!)
8 / 110
R is free
Free to download for various OS (Windows, Mac, Linux)
Users upload packages for everyone to use
Packages will be very useful for you!
9 / 110
R is an object-oriented langauge
Object-oriented programming (OOP) is a way to organize your
code
R structures its codes with objects
But what is an object?
10 / 110
R Basics
11 / 110
RStudio
Two key components:
Tip: Preferences > Editing > Code > Soft-wrap R Source Code
12 / 110
Demonstration
Now we will look at our erst R code
Go to class materials and download the R_workshop.R ele
and open it in RStudio
I will walk you through the codes
Please feel free to follow on your own screen
13 / 110
Quick Overview of the Demonstration
Objects
Arithmetic Operations
Vectors
Functions
Data Files
Saving Objects
Packages
14 / 110
Creating Objects
<- is the assignment operator that creates an object.
x <- 5
x
## [1] 5
x + 10
## [1] 15
y <- x + 1
y
## [1] 6
15 / 110
Different Classes of Objects
Numeric, boolean, character (string), factor
x <- 5
class(x)
## [1] "numeric"
x <- "R"
class(x)
## [1] "character"
x + 10
16 / 110
Different Classes of Objects
Boolean class represents true or false statement.
x <- 1 < 3
x
## [1] TRUE
x <- factor(c("female","male","male","female"))
x
17 / 110
Creating Vectors
We can also store more than one number in an object.
one_to_six^2
## [1] 1 4 9 16 25 36
18 / 110
Create a character vector.
19 / 110
A vector can contain only one class.
z1 <- c(1,2)
class(z1)
## [1] "numeric"
z2 <- c(1,2,"character")
z2
class(z2)
## [1] "character"
20 / 110
Vector operations
Numeric operations for vectors with numeric elements work
element-wise.
x1 <- c(1,3,5)
x2 <- c(4,1,2)
x1 + x2
## [1] 5 4 7
x1 * x2
## [1] 4 3 10
x1 / x2
21 / 110
Vector operations
Numeric operations don't work nicely for vectors of different
lengths
x1 <- c(1,3,5)
x2 <- c(4,1)
x1 + x2
## [1] 5 4 9
22 / 110
Now Your Turn: Creating Vectors
Create a vector with the ages of everyone in your family.
23 / 110
Task: Creating Vectors
family_ages <- c(33, 37, 35) # My partner, me, my brother.
family_ages + 5 # In 5 years
## [1] 38 42 40
length(family_ages)
## [1] 3
mean(family_ages)
## [1] 35
sum(family_ages)
## [1] 105
c(min(family_ages), max(family_ages))
## [1] 33 37
24 / 110
Reading a Help Page
?mean
Description
Generic function for the (trimmed) arithmetic mean.
Usage
x <- c(1,4,6,7,10)
x[2]
## [1] 4
x[length(x)]
## [1] 10
You can also use integers to get multiple elements from a vector.
x <- c(1,4,6,7,10)
idx <- c(1,3,5)
x[idx]
## [1] 1 6 10
26 / 110
Reference: Common Logical
Expressions
== Equal to
!= Not equal to
%in% Contains
27 / 110
Indexing
You can also use boolean to get elements from a vector.
x <- c(1,4,6,7,10)
x[x < 5]
## [1] 1 4
x[x >= 7]
## [1] 7 10
x[x %in
in
in% 1:5]
## [1] 1 4
28 / 110
Matrix
A matrix is a set of numbers arranged in rows and columns.
m1 <- matrix(1:6,nrow=2,ncol=3)
m1
m2 <- matrix(1:6,nrow=3,ncol=2)
m2
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
29 / 110
Matrix
If you stack vectors either row-wise or column-wise, you get a
matrix.
x1 <- c(3,4,5)
x2 <- c(11,12,13)
X <- rbind(x1,x2)
X
Y <- cbind(x1,x2)
Y
## x1 x2
## [1,] 3 11
## [2,] 4 12
## [3,] 5 13
30 / 110
Matrix
Just like vectors, a matrix can contain only one class of
elements.
x1 <- c(3,4,5)
x2 <- c("A","B","C")
X <- rbind(x1,x2)
X
31 / 110
Matrix
Indexing is done by specifying a row, a column, or both.
m1 <- matrix(1:6,nrow=2,ncol=3)
m1
m1[1,]
## [1] 1 3 5
m1[,1]
## [1] 1 2
m1[1,3]
## [1] 5
32 / 110
Data Frame
Another object we'll commonly work with is a [Link]
(think: spreadsheet).
test_df <- [Link](id = 1:3, age = c(22,34,51), nationality = c("Korea", "Japan", "Canada"))
test_df
## id age nationality
## 1 1 22 Korea
## 2 2 34 Japan
## 3 3 51 Canada
str(test_df)
33 / 110
Data Frame
You can use $ to read or write a column (variable) in a data
frame
test_df$nationality
## id age nationality
## 1 1 22 China
## 2 2 34 US
## 3 3 51 Congo
34 / 110
Data Frame
Append a new variable to a data frame
## id age nationality Q1
## 1 1 22 China Agree
## 2 2 34 US Disagree
## 3 3 51 Congo Strongly Agree
35 / 110
Indexing Data Frame
Indexing with a data frame is more or less the same as indexing
with a matrix. With data frames, however, you can use $ to
extract a column (or variable).
test_df
## id age nationality Q1
## 1 1 22 China Agree
## 2 2 34 US Disagree
## 3 3 51 Congo Strongly Agree
test_df$age
## [1] 22 34 51
36 / 110
Indexing Data Frame
Equivalently, you can use the variable name(s) to get columns.
test_df[,c("age","nationality")]
## age nationality
## 1 22 China
## 2 34 US
## 3 51 Congo
37 / 110
Useful Functions for Data Frames
dim(test_df) # You can separately find the number of rows or columns: nrow(test_df); ncol(test_df)
## [1] 3 4
names(test_df)
str(test_df)
38 / 110
Useful Functions for Data Frames
summary(test_df)
## id age nationality Q1
## Min. :1.0 Min. :22.00 Length:3 Length:3
## 1st Qu.:1.5 1st Qu.:28.00 Class :character Class :character
## Median :2.0 Median :34.00 Mode :character Mode :character
## Mean :2.0 Mean :35.67
## 3rd Qu.:2.5 3rd Qu.:42.50
## Max. :3.0 Max. :51.00
39 / 110
Your Turn: Working with Data Frame
Create a data frame with 4 rows and 3 columns
40 / 110
Your Turn: Working with Data Frame
mydata <- [Link]("v1"=1:4,
"v2"=c("dog","cat","eagle","monkey"),
"v3"=c("Jack","Amanda","Brian","Charles"))
mydata[1,2] <- "rabbit"
mydata$v4 <- 5:8
names(mydata) <- c("num1to4","animals","friends","num5to8")
mydata
41 / 110
Data Wrangling
42 / 110
What is a package in R
An R package is a collection of functions, data, and
documentation
43 / 110
Package 1: tidyverse
We'll use a package from tidyverse, dplyr, to manipulate data
frames.
44 / 110
Loading Data
Loading data in your local folder requires setting the correct
path
## [1] "/Users/byungkookim/Dropbox/KDIS/Workshop/R"
setwd("~/Dropbox/KDIS/Workshop")
45 / 110
Loading Data
Once you have set the working directory, you can load the data.
46 / 110
Your Turn: Loading a Data Frame
Download cigar csv ele from class materials and store it
in the directory of your choice
47 / 110
Your Turn: Loading a Data Frame
cigar <- read_csv("[Link]")
head(cigar)
## # A tibble: 6 × 10
## rownames state year price pop pop16 cpi ndi sales pimin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 63 28.6 3383 2236. 30.6 1558. 93.9 26.1
## 2 2 1 64 29.8 3431 2277. 31 1684. 95.4 27.5
## 3 3 1 65 29.8 3486 2328. 31.5 1810. 98.5 28.9
## 4 4 1 66 31.5 3524 2370. 32.4 1915. 96.4 29.5
## 5 5 1 67 31.6 3533 2394. 33.4 2024. 95.5 29.6
## 6 6 1 68 35.6 3522 2405. 34.8 2202. 88.4 32
mean(cigar$price)
## [1] 68.69993
48 / 110
Subset a Data Frame
*pipes: Pipes in tidyverse is a new R syntax that allows you to
manage/manipulate data frame with better legibility.
filter()
mutate()
select()
49 / 110
Subset a Data Frame
filter() subsets a data frame
## # A tibble: 6 × 10
## rownames state year price pop pop16 cpi ndi sales pimin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 361 16 63 26.4 2758 1881. 30.6 2153. 115. 25.4
## 2 362 16 64 27.9 2763 1889. 31 2281. 110. 25.6
## 3 363 16 65 28.1 2758 1897. 31.5 2538. 116 26.1
## 4 364 16 66 31.6 2764 1912. 32.4 2722. 108. 26.2
## 5 365 16 67 32 2772 1931. 33.4 2745. 114. 27.5
## 6 366 16 68 36.3 2775 1950. 34.8 2918. 109. 29.2
50 / 110
Subset a Data Frame
expensive_cigar <- cigar %>% filter(price >= mean(price))
head(expensive_cigar)
## # A tibble: 6 × 10
## rownames state year price pop pop16 cpi ndi sales pimin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 1 81 68.8 3917 2925. 90.9 7042. 120. 62.6
## 2 20 1 82 73.1 3943 2954. 96.5 7505. 119. 67.8
## 3 21 1 83 84.4 3959 2978. 99.6 7975. 116. 78.6
## 4 22 1 84 90.8 3990 3009. 104. 8693. 113 86.8
## 5 23 1 85 99 4020 3040. 108. 9059. 114. 90.7
## 6 24 1 86 103 4050 3072. 110. 9675. 116. 98.8
51 / 110
Your Turn: Subset a Data Frame
## [Link]("ggplot2")
library
library(ggplot2)
data(diamonds) # Test dataset that ships with ggplot2 package.
head(diamonds) # Shows us the first several observations of a [Link].
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
nrow(diamonds); names(diamonds)
## [1] 53940
53 / 110
Your Turn: Subsetting a Data Frame
"Premium" cut OR are less than 0.5 carats?
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
54 / 110
Creating New Variables
mutate() changes/adds variables to a data frame
## # A tibble: 5 × 11
## carat cut color clarity depth table price x y z ratio
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1417.
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 1552.
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1422.
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 1152.
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1081.
55 / 110
Creating New Variables
diamonds %>% arrange(ratio)
## # A tibble: 53,940 × 11
## carat cut color clarity depth table price x y z ratio
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.43 Premium H I1 62 59 452 4.78 4.83 2.98 1051.
## 2 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68 1078.
## 3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1081.
## 4 0.33 Ideal J SI2 62.4 54 366 4.43 4.45 2.77 1109.
## 5 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 1110.
## 6 0.32 Good D I1 64 54 361 4.33 4.36 2.78 1128.
## 7 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 1130
## 8 0.31 Very Good J SI1 59.4 62 353 4.39 4.43 2.62 1139.
## 9 0.31 Very Good J SI1 58.1 62 353 4.44 4.47 2.59 1139.
## 10 0.36 Premium J SI1 61.6 60 410 4.54 4.58 2.81 1139.
## # ℹ 53,930 more rows
56 / 110
Creating New Variables
diamonds %>% arrange(-ratio)
## # A tibble: 53,940 × 11
## carat cut color clarity depth table price x y z ratio
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1.04 Very Good D IF 61.3 56 18542 6.53 6.55 4.01 17829.
## 2 1.07 Premium D IF 60.9 58 18279 6.67 6.57 4.03 17083.
## 3 1.03 Ideal D IF 62 56 17590 6.55 6.44 4.03 17078.
## 4 1.07 Very Good D IF 60.9 58 18114 6.57 6.67 4.03 16929.
## 5 1.02 Very Good D IF 61.7 59 17100 6.42 6.52 3.99 16765.
## 6 1.07 Very Good D IF 59 59 17909 6.63 6.72 3.94 16737.
## 7 1.09 Very Good D IF 61.7 58 18231 6.55 6.65 4.07 16726.
## 8 1 Ideal D IF 60.7 57 16469 6.44 6.48 3.92 16469
## 9 1.01 Premium D IF 61.6 56 16234 6.46 6.43 3.97 16073.
## 10 1 Very Good D IF 63.3 59 16073 6.37 6.33 4.02 16073
## # ℹ 53,930 more rows
57 / 110
Selecting Variables
names(diamonds)
## # A tibble: 53,940 × 3
## carat cut price
## <dbl> <ord> <int>
## 1 0.23 Ideal 326
## 2 0.21 Premium 326
## 3 0.23 Good 327
## 4 0.29 Premium 334
## 5 0.31 Good 335
## 6 0.24 Very Good 336
## 7 0.24 Very Good 336
## 8 0.26 Very Good 337
## 9 0.22 Fair 337
## 10 0.23 Very Good 338
## # ℹ 53,930 more rows
58 / 110
Selecting Variables
names(diamonds)
## # A tibble: 53,940 × 8
## carat cut color clarity depth table price ratio
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 1417.
## 2 0.21 Premium E SI1 59.8 61 326 1552.
## 3 0.23 Good E VS1 56.9 65 327 1422.
## 4 0.29 Premium I VS2 62.4 58 334 1152.
## 5 0.31 Good J SI2 63.3 58 335 1081.
## 6 0.24 Very Good J VVS2 62.8 57 336 1400
## 7 0.24 Very Good I VVS1 62.3 57 336 1400
## 8 0.26 Very Good H SI1 61.9 55 337 1296.
## 9 0.22 Fair E VS2 65.1 61 337 1532.
## 10 0.23 Very Good H VS1 59.4 61 338 1470.
## # ℹ 53,930 more rows
59 / 110
Your Turn: Creating New Variables
Load an excel ele
library
library(readxl)
pwt <- read_excel("[Link]",
sheet = "Data", skip = 2)
pwt %>% slice(1:5)
## # A tibble: 5 × 47
## countrycode country currency_unit year rgdpe rgdpo pop emp avh hc
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW Aruba Aruban Guilder 1950 NA NA NA NA NA NA
## 2 ABW Aruba Aruban Guilder 1951 NA NA NA NA NA NA
## 3 ABW Aruba Aruban Guilder 1952 NA NA NA NA NA NA
## 4 ABW Aruba Aruban Guilder 1953 NA NA NA NA NA NA
## 5 ABW Aruba Aruban Guilder 1954 NA NA NA NA NA NA
## # ℹ 37 more variables: ccon <dbl>, cda <dbl>, cgdpe <dbl>, cgdpo <dbl>,
## # ck <dbl>, ctfp <dbl>, cwtfp <dbl>, rgdpna <dbl>, rconna <dbl>, rdana <dbl>,
## # rkna <dbl>, rtfpna <dbl>, rwtfpna <dbl>, labsh <dbl>, delta <dbl>,
## # xr <dbl>, pl_con <dbl>, pl_da <dbl>, pl_gdpo <dbl>, i_cig <chr>,
## # i_xm <chr>, i_xr <chr>, i_outlier <chr>, cor_exp <dbl>, statcap <dbl>,
## # csh_c <dbl>, csh_i <dbl>, csh_g <dbl>, csh_x <dbl>, csh_m <dbl>,
## # csh_r <dbl>, pl_c <dbl>, pl_i <dbl>, pl_g <dbl>, pl_x <dbl>, pl_m <dbl>, …
60 / 110
Your Turn: Creating New Variables
rgdpna is GDP
pop is population
61 / 110
Your Turn: Creating New Variables
pwt <- pwt %>% mutate(gdp_percap = rgdpna / pop) # Create a new variable
pwt_sub <- pwt %>% filter(year == 2014) # Subset to 2014
pwt_sub <- pwt_sub %>% select(country, year, gdp_percap) # Select columns
pwt_sub
## # A tibble: 182 × 3
## country year gdp_percap
## <chr> <dbl> <dbl>
## 1 Aruba 2014 36133.
## 2 Angola 2014 8533.
## 3 Anguilla 2014 20652.
## 4 Albania 2014 9965.
## 5 United Arab Emirates 2014 73433.
## 6 Argentina 2014 20200.
## 7 Armenia 2014 9586.
## 8 Antigua and Barbuda 2014 20230.
## 9 Australia 2014 47544.
## 10 Austria 2014 41582.
## # ℹ 172 more rows
62 / 110
Your Turn: Creating New Variables
pwt_sub <- pwt %>%
mutate(gdp_percap = rgdpna / pop) %>%
filter(year == 2014) %>%
select(country, year, gdp_percap)
pwt_sub
## # A tibble: 182 × 3
## country year gdp_percap
## <chr> <dbl> <dbl>
## 1 Aruba 2014 36133.
## 2 Angola 2014 8533.
## 3 Anguilla 2014 20652.
## 4 Albania 2014 9965.
## 5 United Arab Emirates 2014 73433.
## 6 Argentina 2014 20200.
## 7 Armenia 2014 9586.
## 8 Antigua and Barbuda 2014 20230.
## 9 Australia 2014 47544.
## 10 Austria 2014 41582.
## # ℹ 172 more rows
63 / 110
Visualization
64 / 110
Basic Visualizations
Principles
65 / 110
Basic Visualizations
What is the message you want to convey with visualization?
66 / 110
Basic Visualizations
Plot a continuous variable (carat) against a continuous variable
(price)
67 / 110
Basic Visualizations
68 / 110
Basic Visualizations
What is the message you want to convey with visualization?
69 / 110
Basic Visualizations
Plotting a categorical variable (cut) against a continuous
variable (price).
70 / 110
Basic Visualizations
71 / 110
Demo: Cars dataset (Motor Trends
magazine)
Understand your data:
data(mtcars) ## loads motor cars data
glimpse(mtcars, width = 35)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0,…
## $ hp <dbl> 110, 110, 93, 110, 1…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.…
## $ wt <dbl> 2.620, 2.875, 2.320,…
## $ qsec <dbl> 16.46, 17.02, 18.61,…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4,…
72 / 110
Variables of interest:
wt: weight
73 / 110
Types of data and plots
Quantitative data
Qualitative data
74 / 110
Types of data and plots
different visualization strategies for different
data types
75 / 110
Multiple plots in one page
par(mfrow=c(1,2))
hist(mtcars$wt,breaks=30) ## histogram
plot(density(mtcars$mpg)) ## density plot
mfrow=c(1,2) puts two plots on a grid with one row and two
columns
76 / 110
par(mfrow=c(1,2))
hist(mtcars$wt,breaks=30) ## histogram
plot(density(mtcars$mpg)) ## density plot
77 / 110
Box plot best visualizes the distribution of one or more
variables in box format
Vertical vars above and below the box indicate the top 25%
and bottom 25% of data
79 / 110
Scatter plot best visualizes the covariance of two variables
plot(mtcars$wt, mtcars$hp,
xlab="Weight", ylab="Horse Power",
col="tomato", pch=16, cex=1.5) ## Car weights and horse power
81 / 110
Bar plot visualizes data in bar format (suitable for categories
and proportions data)
82 / 110
tb <- table(mtcars$vs, mtcars$gear)
barplot(tb, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("steelblue","tomato"),
legend = rownames(tb))
83 / 110
Adding lines to the plot
abline() adds vertical, horizontal, and diagonal lines to an
existing plot.
84 / 110
plot(mtcars$wt,mtcars$hp,pch=16,xlab="Weight",ylab="Horse Power")
abline(v=2.5,col="red",lwd=1,lty=2)
abline(h=100,col="steelblue",lwd=4)
abline(a=150,b=20,col="forest green",lwd=2)
85 / 110
Adding lines to the plot
You can add lines to an existing plot with lines() function.
plot(mtcars$wt,mtcars$hp,pch=16,xlab="Weight",ylab="Horse Power")
lines(seq(1,6,[Link]=100),100*sin(seq(-3,3,[Link]=100))+150,lwd=2,col="brown")
86 / 110
Adding points to the plot
You can add points to an existing plot with points()
function.
87 / 110
plot(mtcars.0$wt, mtcars.0$hp, xlab="Weight", ylab="Horse Power", pch=16,
xlim=c(1,6), ylim=c(50,350))
points(mtcars.1$wt, mtcars.1$hp, col="tomato", pch=16)
legend("topleft",legend=c("V-shaped","Stragiht"),pch=16,col=c("black","tomato"),bty="n")
In-class task:
Use ?legend to end out what legend() function does in this
plot
88 / 110
If your objective is to compare two distributions, quantile-
quantile plot (QQ-plot) can be useful
How many points lie below/above the red line? What does it
mean?
89 / 110
Adding Text to a Plot
plot(mtcars$wt,mtcars$hp,pch=16,xlab="Weight",ylab="Horse Power",
xlim=c(1,6), ylim=c(50,350),
type="n")
text(mtcars$wt,mtcars$hp,label=rownames(mtcars),cex=0.7)
90 / 110
Saving Your Plot
Use pdf() to save your plot in pdf
pdf("[Link]",width=5,height=5)
plot(1:10, 1:10, main="test plot")
[Link]()
91 / 110
Visualization Tips
Different types of data call for different visualization
strategies
92 / 110
What car should I buy?
Colors to mark Engine types (Red: V-shaped, Blue: Straight)
N <- nrow(mtcars)
mycolor <- rep("tomato",N) ## repeat "tomato" N number of times
mycolor[ mtcars$vs == 1 ] <- "steelblue" ## for Straight engines, replace "tomato" with "steelblue
93 / 110
Redeene variables for better interpretation
94 / 110
Now visualize data to help make your
decision
plot(x, y, xlab="Horse Power/Weight", ylab="Gas Efficiency",
col=mycolor,
pch=myshape)
abline(v=med.x, col="gray", lwd=2, lty=2)
abline(h=med.y, col="gray", lwd=2, lty=2)
95 / 110
Demo: Cigarette Prices &
Consumption
De\ne the question:
96 / 110
Demo: Cigarette Prices &
Consumption
Variables:
97 / 110
Preprocessing Data for Better
Interpretation
cig_df <- cig_df %>%
mutate(year = year + 1900) %>% ## year with four digits
mutate(adjusted_price = price/cpi) %>% ## adjust for inflation
mutate(adjusted_pimin = pimin/cpi) ## adjust for inflation
98 / 110
Questions: Cigarette Prices Over
Time
Did the price of cigarettes (adjusting for inXation) increase
between 1963 and 1992?
99 / 110
Did the price of cigarettes (adjusting for inXation) increase
over time?
100 / 110
Are cigarette sales lower in states where the price exceeds
the minimum price in neighboring states?
101 / 110
In which state was the absolute increase most dramatic?
(compare states 5, 25, and 50)
102 / 110
In which state was the absolute increase most dramatic?
(states 5, 25, and 50)
103 / 110
In-class Task: MA Test Score Data,
1997-8
MCAS <- read_csv("[Link]
104 / 110
In-class Task: MA Test Score Data,
1997-8
Codebook
Variables of interest:
105 / 110
In-class Task: MA Test Score Data,
1997-8
1. Deene a question.
106 / 110
In-class Task: MA Test Score Data,
1997-8
Do schools with high teacher salaries have high 8th grade
scores?
plot(mass_df$avgsalary, mass_df$totsc8,
xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score",
pch=16)
107 / 110
In-class Task: MA Test Score Data,
1997-8
How does teacher salary and 8th grade interact with per capita
income?
108 / 110
par(mfrow=c(1,3))
plot(mass_df$avgsalary, mass_df$totsc8, cex=2,
xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score",
col="black", pch=16, ylim=c(630,750), main="All")
plot(mass_df.1$avgsalary, mass_df.1$totsc8, cex=2,
xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score",
col="steelblue", pch=16, ylim=c(630,750), main="Above Median Income")
plot(mass_df.2$avgsalary, mass_df.2$totsc8, cex=2,
xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score",
col="tomato", pch=16, ylim=c(630,750), main="Below Median Income")
109 / 110
Quick Tip: Quitting R
When you exit R, it will ask if you want to save your workspace.
Choose NO.
110 / 110