0% found this document useful (0 votes)
12 views110 pages

Data Analysis Using R

The document outlines a training session on data analysis using R, focusing on its advantages, basic functionalities, and data manipulation techniques. It covers the structure of R, including its object-oriented nature, and introduces essential concepts such as vectors, matrices, and data frames. Additionally, it discusses the use of R packages like tidyverse for enhanced data wrangling and visualization.

Uploaded by

shelan.haji
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views110 pages

Data Analysis Using R

The document outlines a training session on data analysis using R, focusing on its advantages, basic functionalities, and data manipulation techniques. It covers the structure of R, including its object-oriented nature, and introduces essential concepts such as vectors, matrices, and data frames. Additionally, it discusses the use of R packages like tidyverse for enhanced data wrangling and visualization.

Uploaded by

shelan.haji
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Analysis using R

2025 Capacity Building for Kurdish Experts


ByungKoo Kim
KDI School of Public Policy and Management
Friday, Sep 5, 2025

1 / 110
Agenda
Why R?

What is R?

R Basics

Data Wrangling

Visualization

2 / 110
Why R?

3 / 110
Languages to analyze data with
How will we work with the data? SO many options....
Excel, SAS, SPSS, Stata, R, Python, MatLab, Gauss, C, C ++ ,
Java, SciLab, ...
Our focus: R

4 / 110
5 / 110
So why R?
Simple languages (Excel, SAS, SPSS) are easy but give you
less Xexibility

Complex languages (C, C++) are hard but give you lots of
Xexibility

R is somewhere in between (not too hard, suYciently Xexible)

Increasingly useful and popular among social scientists

If you can learn R, you can learn other languages more easily

6 / 110
What is R?

7 / 110
R is a language for statistical
computing
Offers a wide variety of statistical tools
You can compute your own statistics
Or you can use what others have written (called packages!)

8 / 110
R is free
Free to download for various OS (Windows, Mac, Linux)
Users upload packages for everyone to use
Packages will be very useful for you!

9 / 110
R is an object-oriented langauge
Object-oriented programming (OOP) is a way to organize your
code
R structures its codes with objects
But what is an object?

10 / 110
R Basics

11 / 110
RStudio
Two key components:

1. Source: Where you write code (.R or .Rmd eles)

2. Console: Where you execute code

You can copy-paste code into the console OR


highlight code and press cmd+enter.

Tip: Preferences > Editing > Code > Soft-wrap R Source Code

12 / 110
Demonstration
Now we will look at our erst R code
Go to class materials and download the R_workshop.R ele
and open it in RStudio
I will walk you through the codes
Please feel free to follow on your own screen

13 / 110
Quick Overview of the Demonstration
Objects

Arithmetic Operations

Vectors

Functions

Data Files

Saving Objects

Packages

14 / 110
Creating Objects
<- is the assignment operator that creates an object.

x <- 5
x

## [1] 5

Once an object is created, you can use it for various operations.

x + 10

## [1] 15

An object can be used to create another object

y <- x + 1
y

## [1] 6

15 / 110
Different Classes of Objects
Numeric, boolean, character (string), factor

x <- 5
class(x)

## [1] "numeric"

For characters, use quotation marks.

x <- "R"
class(x)

## [1] "character"

Character objects cannot be used for mathematical operations

x + 10

## Error in x + 10: non-numeric argument to binary operator

16 / 110
Different Classes of Objects
Boolean class represents true or false statement.

x <- 1 < 3
x

## [1] TRUE

Factor class represents unique categories.

x <- factor(c("female","male","male","female"))
x

## [1] female male male female


## Levels: female male

Note that c() stands for combine.

17 / 110
Creating Vectors
We can also store more than one number in an object.

one_to_six <- c(1, 2, 3, 4, 5, 6)


one_to_six <- 1:6 # Equivalent
one_to_six <- seq(from = 1, to = 6, by = 1) # Equivalent

one_to_six^2

## [1] 1 4 9 16 25 36

We'll call this a vector.

18 / 110
Create a character vector.

y1 <- "hello world"


y1

## [1] "hello world"

y2 <- c("hello", "world")


y2

## [1] "hello" "world"

What's the difference between the two vectors, y1 and y2?

19 / 110
A vector can contain only one class.

z1 <- c(1,2)
class(z1)

## [1] "numeric"

z2 <- c(1,2,"character")
z2

## [1] "1" "2" "character"

class(z2)

## [1] "character"

20 / 110
Vector operations
Numeric operations for vectors with numeric elements work
element-wise.

x1 <- c(1,3,5)
x2 <- c(4,1,2)
x1 + x2

## [1] 5 4 7

x1 * x2

## [1] 4 3 10

x1 / x2

## [1] 0.25 3.00 2.50

21 / 110
Vector operations
Numeric operations don't work nicely for vectors of different
lengths

x1 <- c(1,3,5)
x2 <- c(4,1)
x1 + x2

## Warning in x1 + x2: longer object length is not a multiple of shorter object


## length

## [1] 5 4 9

22 / 110
Now Your Turn: Creating Vectors
Create a vector with the ages of everyone in your family.

Add 5 years to your family's ages.

Feed your vector to the following functions and think about


what they do:
length(), mean(), sum(), min(), and max()

Use Google or type ?length() into your R console to egure


out what the function does.

23 / 110
Task: Creating Vectors
family_ages <- c(33, 37, 35) # My partner, me, my brother.
family_ages + 5 # In 5 years

## [1] 38 42 40

length(family_ages)

## [1] 3

mean(family_ages)

## [1] 35

sum(family_ages)

## [1] 105

c(min(family_ages), max(family_ages))

## [1] 33 37
24 / 110
Reading a Help Page
?mean

Description
Generic function for the (trimmed) arithmetic mean.

Usage

mean(x, trim = 0, [Link] = FALSE, ...


...)

Arguments of the Function


x -- An R object. Currently there are methods for numeric/logical
vectors and date...
[Link] -- a logical value indicating whether NA values should be
stripped...

Value (i.e., the output)


If trim is zero (the default), the arithmetic mean of the values in x
is computed... 25 / 110
Indexing
How do we access speciec elements in a vector? We use
brackets.

x <- c(1,4,6,7,10)
x[2]

## [1] 4

x[length(x)]

## [1] 10

You can also use integers to get multiple elements from a vector.

x <- c(1,4,6,7,10)
idx <- c(1,3,5)
x[idx]

## [1] 1 6 10

26 / 110
Reference: Common Logical
Expressions

== Equal to
!= Not equal to

%in% Contains

> Greater or less than

>= Greater or less than or


equal to
&
AND
|
OR

27 / 110
Indexing
You can also use boolean to get elements from a vector.

x <- c(1,4,6,7,10)
x[x < 5]

## [1] 1 4

x[x >= 7]

## [1] 7 10

x[x %in
in
in% 1:5]

## [1] 1 4

28 / 110
Matrix
A matrix is a set of numbers arranged in rows and columns.

m1 <- matrix(1:6,nrow=2,ncol=3)
m1

## [,1] [,2] [,3]


## [1,] 1 3 5
## [2,] 2 4 6

m2 <- matrix(1:6,nrow=3,ncol=2)
m2

## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6

29 / 110
Matrix
If you stack vectors either row-wise or column-wise, you get a
matrix.

x1 <- c(3,4,5)
x2 <- c(11,12,13)
X <- rbind(x1,x2)
X

## [,1] [,2] [,3]


## x1 3 4 5
## x2 11 12 13

Y <- cbind(x1,x2)
Y

## x1 x2
## [1,] 3 11
## [2,] 4 12
## [3,] 5 13

30 / 110
Matrix
Just like vectors, a matrix can contain only one class of
elements.

x1 <- c(3,4,5)
x2 <- c("A","B","C")
X <- rbind(x1,x2)
X

## [,1] [,2] [,3]


## x1 "3" "4" "5"
## x2 "A" "B" "C"

In this sense, a matrix is essentially a storage for vectors.

31 / 110
Matrix
Indexing is done by specifying a row, a column, or both.

m1 <- matrix(1:6,nrow=2,ncol=3)
m1

## [,1] [,2] [,3]


## [1,] 1 3 5
## [2,] 2 4 6

m1[1,]

## [1] 1 3 5

m1[,1]

## [1] 1 2

m1[1,3]

## [1] 5

32 / 110
Data Frame
Another object we'll commonly work with is a [Link]
(think: spreadsheet).

Row: Observation; Column: Variable Data frames are similar to


matrices in form, but can contain vectors of different classes

test_df <- [Link](id = 1:3, age = c(22,34,51), nationality = c("Korea", "Japan", "Canada"))
test_df

## id age nationality
## 1 1 22 Korea
## 2 2 34 Japan
## 3 3 51 Canada

str(test_df)

## '[Link]': 3 obs. of 3 variables:


## $ id : int 1 2 3
## $ age : num 22 34 51
## $ nationality: chr "Korea" "Japan" "Canada"

33 / 110
Data Frame
You can use $ to read or write a column (variable) in a data
frame

test_df$nationality

## [1] "Korea" "Japan" "Canada"

Overwrite an existing variable in a data frame

test_df$nationality <- c("China","US","Congo")


test_df

## id age nationality
## 1 1 22 China
## 2 2 34 US
## 3 3 51 Congo

34 / 110
Data Frame
Append a new variable to a data frame

test_df$Q1 <- c("Agree","Disagree","Strongly Agree")


test_df

## id age nationality Q1
## 1 1 22 China Agree
## 2 2 34 US Disagree
## 3 3 51 Congo Strongly Agree

35 / 110
Indexing Data Frame
Indexing with a data frame is more or less the same as indexing
with a matrix. With data frames, however, you can use $ to
extract a column (or variable).

test_df

## id age nationality Q1
## 1 1 22 China Agree
## 2 2 34 US Disagree
## 3 3 51 Congo Strongly Agree

test_df$age

## [1] 22 34 51

36 / 110
Indexing Data Frame
Equivalently, you can use the variable name(s) to get columns.

test_df[,c("age","nationality")]

## age nationality
## 1 22 China
## 2 34 US
## 3 51 Congo

37 / 110
Useful Functions for Data Frames
dim(test_df) # You can separately find the number of rows or columns: nrow(test_df); ncol(test_df)

## [1] 3 4

names(test_df)

## [1] "id" "age" "nationality" "Q1"

str(test_df)

## '[Link]': 3 obs. of 4 variables:


## $ id : int 1 2 3
## $ age : num 22 34 51
## $ nationality: chr "China" "US" "Congo"
## $ Q1 : chr "Agree" "Disagree" "Strongly Agree"

38 / 110
Useful Functions for Data Frames
summary(test_df)

## id age nationality Q1
## Min. :1.0 Min. :22.00 Length:3 Length:3
## 1st Qu.:1.5 1st Qu.:28.00 Class :character Class :character
## Median :2.0 Median :34.00 Mode :character Mode :character
## Mean :2.0 Mean :35.67
## 3rd Qu.:2.5 3rd Qu.:42.50
## Max. :3.0 Max. :51.00

39 / 110
Your Turn: Working with Data Frame
Create a data frame with 4 rows and 3 columns

The erst column contains integers 1,2,3, and 4

The second column contains words "dog","cat","eagle",and


"monkey"

The third column contains factor elements


"Jack","Amanda","Brian", and "Charles"

Change the entry at row 1 and column 2 to "rabbit"

Append a new column that contains 5,6,7, and 8

[extra] Add variable names of your choice to the data frame

40 / 110
Your Turn: Working with Data Frame
mydata <- [Link]("v1"=1:4,
"v2"=c("dog","cat","eagle","monkey"),
"v3"=c("Jack","Amanda","Brian","Charles"))
mydata[1,2] <- "rabbit"
mydata$v4 <- 5:8
names(mydata) <- c("num1to4","animals","friends","num5to8")
mydata

## num1to4 animals friends num5to8


## 1 1 rabbit Jack 5
## 2 2 cat Amanda 6
## 3 3 eagle Brian 7
## 4 4 monkey Charles 8

41 / 110
Data Wrangling

42 / 110
What is a package in R
An R package is a collection of functions, data, and
documentation

extend the capabilities of the R programming language


an "add-on" or "plugin" that provides new tools
ex1) tidyverse for data wrangling
ex2) ggplot2 for data visualization
ex3) MCMCpack for etting Bayesian MCMC algorithms

43 / 110
Package 1: tidyverse
We'll use a package from tidyverse, dplyr, to manipulate data
frames.

# Install package (only ever have to do this once)


[Link]("tidyverse")

# Load package (must be done every time)


library
library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.3.3

44 / 110
Loading Data
Loading data in your local folder requires setting the correct
path

In R, this begins with setting the right working directory

getwd() ## check current working directory

## [1] "/Users/byungkookim/Dropbox/KDIS/Workshop/R"

You can set the path to any directory you want

setwd("~/Dropbox/KDIS/Workshop")

45 / 110
Loading Data
Once you have set the working directory, you can load the data.

cigar <- read_csv("[Link]")

## Rows: 1380 Columns: 10


## !! Column specification !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
## Delimiter: ","
## dbl (10): rownames, state, year, price, pop, pop16, cpi, ndi, sales, pimin
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

46 / 110
Your Turn: Loading a Data Frame
Download cigar csv ele from class materials and store it
in the directory of your choice

Load cigar csv into R

Compute the average price of cigarettes in the data

47 / 110
Your Turn: Loading a Data Frame
cigar <- read_csv("[Link]")

## Rows: 1380 Columns: 10


## !! Column specification !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
## Delimiter: ","
## dbl (10): rownames, state, year, price, pop, pop16, cpi, ndi, sales, pimin
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(cigar)

## # A tibble: 6 × 10
## rownames state year price pop pop16 cpi ndi sales pimin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 63 28.6 3383 2236. 30.6 1558. 93.9 26.1
## 2 2 1 64 29.8 3431 2277. 31 1684. 95.4 27.5
## 3 3 1 65 29.8 3486 2328. 31.5 1810. 98.5 28.9
## 4 4 1 66 31.5 3524 2370. 32.4 1915. 96.4 29.5
## 5 5 1 67 31.6 3533 2394. 33.4 2024. 95.5 29.6
## 6 6 1 68 35.6 3522 2405. 34.8 2202. 88.4 32

mean(cigar$price)

## [1] 68.69993
48 / 110
Subset a Data Frame
*pipes: Pipes in tidyverse is a new R syntax that allows you to
manage/manipulate data frame with better legibility.

filter()

mutate()

select()

49 / 110
Subset a Data Frame
filter() subsets a data frame

cigar16 <- cigar %>% filter(state == 16)


head(cigar16)

## # A tibble: 6 × 10
## rownames state year price pop pop16 cpi ndi sales pimin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 361 16 63 26.4 2758 1881. 30.6 2153. 115. 25.4
## 2 362 16 64 27.9 2763 1889. 31 2281. 110. 25.6
## 3 363 16 65 28.1 2758 1897. 31.5 2538. 116 26.1
## 4 364 16 66 31.6 2764 1912. 32.4 2722. 108. 26.2
## 5 365 16 67 32 2772 1931. 33.4 2745. 114. 27.5
## 6 366 16 68 36.3 2775 1950. 34.8 2918. 109. 29.2

50 / 110
Subset a Data Frame
expensive_cigar <- cigar %>% filter(price >= mean(price))
head(expensive_cigar)

## # A tibble: 6 × 10
## rownames state year price pop pop16 cpi ndi sales pimin
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 1 81 68.8 3917 2925. 90.9 7042. 120. 62.6
## 2 20 1 82 73.1 3943 2954. 96.5 7505. 119. 67.8
## 3 21 1 83 84.4 3959 2978. 99.6 7975. 116. 78.6
## 4 22 1 84 90.8 3990 3009. 104. 8693. 113 86.8
## 5 23 1 85 99 4020 3040. 108. 9059. 114. 90.7
## 6 24 1 86 103 4050 3072. 110. 9675. 116. 98.8

51 / 110
Your Turn: Subset a Data Frame
## [Link]("ggplot2")
library
library(ggplot2)
data(diamonds) # Test dataset that ships with ggplot2 package.
head(diamonds) # Shows us the first several observations of a [Link].

## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

How many observations? What variables are included?

How many observations have a "Premium" cut?

How many observations have a "Premium" cut OR are less


than 0.5 carats?

What is the average price of diamonds in the data?


52 / 110
Your Turn: Subsetting a Data Frame
How many observations? What variables are included?

nrow(diamonds); names(diamonds)

## [1] 53940

## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"


## [8] "x" "y" "z"

How many observations have a "Premium" cut?

premium_diamonds <- diamonds %>% filter(cut == "Premium")

53 / 110
Your Turn: Subsetting a Data Frame
"Premium" cut OR are less than 0.5 carats?

premium_diamonds <- diamonds %>%


filter(cut == "Premium" | carat < 0.5)
head(premium_diamonds)

## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

54 / 110
Creating New Variables
mutate() changes/adds variables to a data frame

diamonds <- diamonds %>%


mutate(ratio = price/carat)
diamonds %>% slice(1:5)

## # A tibble: 5 × 11
## carat cut color clarity depth table price x y z ratio
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1417.
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 1552.
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1422.
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 1152.
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1081.

55 / 110
Creating New Variables
diamonds %>% arrange(ratio)

## # A tibble: 53,940 × 11
## carat cut color clarity depth table price x y z ratio
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.43 Premium H I1 62 59 452 4.78 4.83 2.98 1051.
## 2 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68 1078.
## 3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1081.
## 4 0.33 Ideal J SI2 62.4 54 366 4.43 4.45 2.77 1109.
## 5 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 1110.
## 6 0.32 Good D I1 64 54 361 4.33 4.36 2.78 1128.
## 7 0.3 Good J SI1 64 55 339 4.25 4.28 2.73 1130
## 8 0.31 Very Good J SI1 59.4 62 353 4.39 4.43 2.62 1139.
## 9 0.31 Very Good J SI1 58.1 62 353 4.44 4.47 2.59 1139.
## 10 0.36 Premium J SI1 61.6 60 410 4.54 4.58 2.81 1139.
## # ℹ 53,930 more rows

56 / 110
Creating New Variables
diamonds %>% arrange(-ratio)

## # A tibble: 53,940 × 11
## carat cut color clarity depth table price x y z ratio
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1.04 Very Good D IF 61.3 56 18542 6.53 6.55 4.01 17829.
## 2 1.07 Premium D IF 60.9 58 18279 6.67 6.57 4.03 17083.
## 3 1.03 Ideal D IF 62 56 17590 6.55 6.44 4.03 17078.
## 4 1.07 Very Good D IF 60.9 58 18114 6.57 6.67 4.03 16929.
## 5 1.02 Very Good D IF 61.7 59 17100 6.42 6.52 3.99 16765.
## 6 1.07 Very Good D IF 59 59 17909 6.63 6.72 3.94 16737.
## 7 1.09 Very Good D IF 61.7 58 18231 6.55 6.65 4.07 16726.
## 8 1 Ideal D IF 60.7 57 16469 6.44 6.48 3.92 16469
## 9 1.01 Premium D IF 61.6 56 16234 6.46 6.43 3.97 16073.
## 10 1 Very Good D IF 63.3 59 16073 6.37 6.33 4.02 16073
## # ℹ 53,930 more rows

57 / 110
Selecting Variables
names(diamonds)

## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"


## [8] "x" "y" "z" "ratio"

diamonds %>% select(carat, cut, price)

## # A tibble: 53,940 × 3
## carat cut price
## <dbl> <ord> <int>
## 1 0.23 Ideal 326
## 2 0.21 Premium 326
## 3 0.23 Good 327
## 4 0.29 Premium 334
## 5 0.31 Good 335
## 6 0.24 Very Good 336
## 7 0.24 Very Good 336
## 8 0.26 Very Good 337
## 9 0.22 Fair 337
## 10 0.23 Very Good 338
## # ℹ 53,930 more rows

58 / 110
Selecting Variables
names(diamonds)

## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"


## [8] "x" "y" "z" "ratio"

diamonds %>% select(-c(x,y,z))

## # A tibble: 53,940 × 8
## carat cut color clarity depth table price ratio
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 1417.
## 2 0.21 Premium E SI1 59.8 61 326 1552.
## 3 0.23 Good E VS1 56.9 65 327 1422.
## 4 0.29 Premium I VS2 62.4 58 334 1152.
## 5 0.31 Good J SI2 63.3 58 335 1081.
## 6 0.24 Very Good J VVS2 62.8 57 336 1400
## 7 0.24 Very Good I VVS1 62.3 57 336 1400
## 8 0.26 Very Good H SI1 61.9 55 337 1296.
## 9 0.22 Fair E VS2 65.1 61 337 1532.
## 10 0.23 Very Good H VS1 59.4 61 338 1470.
## # ℹ 53,930 more rows

59 / 110
Your Turn: Creating New Variables
Load an excel ele

library
library(readxl)
pwt <- read_excel("[Link]",
sheet = "Data", skip = 2)
pwt %>% slice(1:5)

## # A tibble: 5 × 47
## countrycode country currency_unit year rgdpe rgdpo pop emp avh hc
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW Aruba Aruban Guilder 1950 NA NA NA NA NA NA
## 2 ABW Aruba Aruban Guilder 1951 NA NA NA NA NA NA
## 3 ABW Aruba Aruban Guilder 1952 NA NA NA NA NA NA
## 4 ABW Aruba Aruban Guilder 1953 NA NA NA NA NA NA
## 5 ABW Aruba Aruban Guilder 1954 NA NA NA NA NA NA
## # ℹ 37 more variables: ccon <dbl>, cda <dbl>, cgdpe <dbl>, cgdpo <dbl>,
## # ck <dbl>, ctfp <dbl>, cwtfp <dbl>, rgdpna <dbl>, rconna <dbl>, rdana <dbl>,
## # rkna <dbl>, rtfpna <dbl>, rwtfpna <dbl>, labsh <dbl>, delta <dbl>,
## # xr <dbl>, pl_con <dbl>, pl_da <dbl>, pl_gdpo <dbl>, i_cig <chr>,
## # i_xm <chr>, i_xr <chr>, i_outlier <chr>, cor_exp <dbl>, statcap <dbl>,
## # csh_c <dbl>, csh_i <dbl>, csh_g <dbl>, csh_x <dbl>, csh_m <dbl>,
## # csh_r <dbl>, pl_c <dbl>, pl_i <dbl>, pl_g <dbl>, pl_x <dbl>, pl_m <dbl>, …

60 / 110
Your Turn: Creating New Variables
rgdpna is GDP
pop is population

Create GDP per Capita variable


Create a new data by subsetting pwt by year 2014
Display country, year, gdp_percap variables of the new data

61 / 110
Your Turn: Creating New Variables
pwt <- pwt %>% mutate(gdp_percap = rgdpna / pop) # Create a new variable
pwt_sub <- pwt %>% filter(year == 2014) # Subset to 2014
pwt_sub <- pwt_sub %>% select(country, year, gdp_percap) # Select columns
pwt_sub

## # A tibble: 182 × 3
## country year gdp_percap
## <chr> <dbl> <dbl>
## 1 Aruba 2014 36133.
## 2 Angola 2014 8533.
## 3 Anguilla 2014 20652.
## 4 Albania 2014 9965.
## 5 United Arab Emirates 2014 73433.
## 6 Argentina 2014 20200.
## 7 Armenia 2014 9586.
## 8 Antigua and Barbuda 2014 20230.
## 9 Australia 2014 47544.
## 10 Austria 2014 41582.
## # ℹ 172 more rows

62 / 110
Your Turn: Creating New Variables
pwt_sub <- pwt %>%
mutate(gdp_percap = rgdpna / pop) %>%
filter(year == 2014) %>%
select(country, year, gdp_percap)
pwt_sub

## # A tibble: 182 × 3
## country year gdp_percap
## <chr> <dbl> <dbl>
## 1 Aruba 2014 36133.
## 2 Angola 2014 8533.
## 3 Anguilla 2014 20652.
## 4 Albania 2014 9965.
## 5 United Arab Emirates 2014 73433.
## 6 Argentina 2014 20200.
## 7 Armenia 2014 9586.
## 8 Antigua and Barbuda 2014 20230.
## 9 Australia 2014 47544.
## 10 Austria 2014 41582.
## # ℹ 172 more rows

63 / 110
Visualization

64 / 110
Basic Visualizations
Principles

What is the message you want to convey with visualization?

What quantity best represents your message?

What visualization strategy best displays your quantity?


(multiple try-outs!)

65 / 110
Basic Visualizations
What is the message you want to convey with visualization?

Are bigger diamonds always more expensive?

What quantity best represents your message?

price is a continuous variable that measures the value of the


diamonds.

carat is a continuous variable that measures the size of the


diamonds.

66 / 110
Basic Visualizations
Plot a continuous variable (carat) against a continuous variable
(price)

Scatter plot is ideal for continuous × continuous variables.

plot(diamonds$carat, diamonds$price, pch=16, xlab="Quality", ylab="Price")

67 / 110
Basic Visualizations

68 / 110
Basic Visualizations
What is the message you want to convey with visualization?

How does the quality of diamonds affect their price

What quantity best represents your message?

price is a continuous variable that measures the value of the


diamonds.

cut is a categorical variable that measures the quality of the


diamonds.

69 / 110
Basic Visualizations
Plotting a categorical variable (cut) against a continuous
variable (price).

If one of the plotted variables is categorical, consider using the


box plot.

boxplot(price ~ cut, data=diamonds, pch=16, xlab="Quality", ylab="Price") ## box plot

70 / 110
Basic Visualizations

71 / 110
Demo: Cars dataset (Motor Trends
magazine)
Understand your data:
data(mtcars) ## loads motor cars data
glimpse(mtcars, width = 35)

## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0,…
## $ hp <dbl> 110, 110, 93, 110, 1…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.…
## $ wt <dbl> 2.620, 2.875, 2.320,…
## $ qsec <dbl> 16.46, 17.02, 18.61,…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4,…

72 / 110
Variables of interest:

mpg: miles per gallon

cyl: number of cylinders

hp: gross horsepower

wt: weight

vs: engine (0 = V-shaped, 1 = straight)

gear: number of forward gears

73 / 110
Types of data and plots
Quantitative data

Continuous: fractional numbers (i.e. price of products, weight


of a car)
Discrete: integers (i.e. total counts, page number)

Qualitative data

Ordinal: categories with natural ordering (i.e. letter grades,


ranking)
Nominal: categories without natural ordering (i.e. gender, hair
color, nationality)

For more: [Link]

74 / 110
Types of data and plots
different visualization strategies for different
data types

Histogram best visualizes the distribution of one variable

Suitable for countably continuous data. For uncountably


continuous, use density plot

75 / 110
Multiple plots in one page
par(mfrow=c(1,2))
hist(mtcars$wt,breaks=30) ## histogram
plot(density(mtcars$mpg)) ## density plot

We can choose how to display multiple plots in one page with


par(mfrow)

mfrow decides the matrix grid where plots will be placed.

mfrow=c(1,2) puts two plots on a grid with one row and two
columns

mfrow=c(n,k) puts n × k plots on a grid with n rows and k


columns

76 / 110
par(mfrow=c(1,2))
hist(mtcars$wt,breaks=30) ## histogram
plot(density(mtcars$mpg)) ## density plot

77 / 110
Box plot best visualizes the distribution of one or more
variables in box format

boxplot(mpg~vs,data=mtcars,xlab="Engine Types",ylab="Miles per gallon") ## Fuel efficiency (mpg) b

xlab and ylab write the label of x-axis and y-axis


respectively.

A box plot displays the center 50% of data as a box


78 / 110
boxplot(mpg~vs,data=mtcars,xlab="Engine Types",ylab="Miles per gallon") ## Fuel efficiency (mpg) b

Vertical vars above and below the box indicate the top 25%
and bottom 25% of data

We can easily identify the skewedenss of a distribution with


box plot

79 / 110
Scatter plot best visualizes the covariance of two variables

plot(mtcars$wt, mtcars$hp,
xlab="Weight", ylab="Horse Power",
col="tomato", pch=16, cex=1.5) ## Car weights and horse power

col decides the color of the points

pch decides the shape of the points

cex decides the size of the points 80 / 110


Bar plot visualizes data in bar format (suitable for categories
and proportions data)

tb <- table(mtcars$vs, mtcars$gear)


barplot(tb, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("steelblue","tomato"),
legend = rownames(tb))

81 / 110
Bar plot visualizes data in bar format (suitable for categories
and proportions data)

barplot(tb, main="Car Distribution by Gears and VS",


xlab="Number of Gears", col=c("steelblue","tomato"),
legend = rownames(tb),
beside = TRUE)

82 / 110
tb <- table(mtcars$vs, mtcars$gear)
barplot(tb, main="Car Distribution by Gears and VS",
xlab="Number of Gears", col=c("steelblue","tomato"),
legend = rownames(tb))

barplot(tb, main="Car Distribution by Gears and VS",


xlab="Number of Gears", col=c("steelblue","tomato"),
legend = rownames(tb),
beside = TRUE)

legend parameter adds index for the visualized data

beside=TRUE puts bars of different groups side-by-side

83 / 110
Adding lines to the plot
abline() adds vertical, horizontal, and diagonal lines to an
existing plot.

You need to erst generate a plot before using abline()


function.

84 / 110
plot(mtcars$wt,mtcars$hp,pch=16,xlab="Weight",ylab="Horse Power")
abline(v=2.5,col="red",lwd=1,lty=2)
abline(h=100,col="steelblue",lwd=4)
abline(a=150,b=20,col="forest green",lwd=2)

85 / 110
Adding lines to the plot
You can add lines to an existing plot with lines() function.

plot(mtcars$wt,mtcars$hp,pch=16,xlab="Weight",ylab="Horse Power")
lines(seq(1,6,[Link]=100),100*sin(seq(-3,3,[Link]=100))+150,lwd=2,col="brown")

86 / 110
Adding points to the plot
You can add points to an existing plot with points()
function.

mtcars.0 <- filter(mtcars,vs==0) ## Cars with V-shaped engines


mtcars.1 <- filter(mtcars,vs==1) ## Cars with straight engines

87 / 110
plot(mtcars.0$wt, mtcars.0$hp, xlab="Weight", ylab="Horse Power", pch=16,
xlim=c(1,6), ylim=c(50,350))
points(mtcars.1$wt, mtcars.1$hp, col="tomato", pch=16)
legend("topleft",legend=c("V-shaped","Stragiht"),pch=16,col=c("black","tomato"),bty="n")

In-class task:
Use ?legend to end out what legend() function does in this
plot

88 / 110
If your objective is to compare two distributions, quantile-
quantile plot (QQ-plot) can be useful

qqplot(mtcars.0$wt, mtcars.1$wt, xlab="V-shaped", ylab="Straight",pch=16,


xlim=c(0,6),ylim=c(0,6))
abline(a=0,b=1,col="red",lwd=2) ## the line on which values from two distributions are equal

How many points lie below/above the red line? What does it
mean?
89 / 110
Adding Text to a Plot
plot(mtcars$wt,mtcars$hp,pch=16,xlab="Weight",ylab="Horse Power",
xlim=c(1,6), ylim=c(50,350),
type="n")
text(mtcars$wt,mtcars$hp,label=rownames(mtcars),cex=0.7)

90 / 110
Saving Your Plot
Use pdf() to save your plot in pdf

Use png() to save your plot in png

Type [Link]() to enish saving

Plots are saved in your working directory unless you specify a


path.

Make sure you set your working directory using setwd()

pdf("[Link]",width=5,height=5)
plot(1:10, 1:10, main="test plot")
[Link]()

91 / 110
Visualization Tips
Different types of data call for different visualization
strategies

Use shape, color, and text overlays to display more


information in one plot

Add lines to communicate/highlight target information

92 / 110
What car should I buy?
Colors to mark Engine types (Red: V-shaped, Blue: Straight)

N <- nrow(mtcars)
mycolor <- rep("tomato",N) ## repeat "tomato" N number of times
mycolor[ mtcars$vs == 1 ] <- "steelblue" ## for Straight engines, replace "tomato" with "steelblue

Shapes to mark Transmission (Circle: Auto, Triangle: Manual)

myshape <- rep(16,N) ## repeat 16 N number of times


myshape[ mtcars$am == 1 ] <- 2

93 / 110
Redeene variables for better interpretation

x <- mtcars$hp/mtcars$wt ## Horse power to Weight ratio


y <- mtcars$mpg ## Miles per gallon (Gas Efficiency)

Median values for x and y

med.x <- median(x)


med.y <- median(y)

94 / 110
Now visualize data to help make your
decision
plot(x, y, xlab="Horse Power/Weight", ylab="Gas Efficiency",
col=mycolor,
pch=myshape)
abline(v=med.x, col="gray", lwd=2, lty=2)
abline(h=med.y, col="gray", lwd=2, lty=2)

95 / 110
Demo: Cigarette Prices &
Consumption
De\ne the question:

Is cigarette consumption lower in states with higher cigarette


prices?

cig_df <- read_csv("[Link]

## Rows: 1380 Columns: 10


## !! Column specification !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
## Delimiter: ","
## dbl (10): rownames, state, year, price, pop, pop16, cpi, ndi, sales, pimin
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

96 / 110
Demo: Cigarette Prices &
Consumption
Variables:

price: price per pack of cigarettes


cpi: consumer price index (1983=100)
ndi: per capita disposable income
sales: cigarette sales in packs per capita
pimin: minimum price in adjoining states per pack of
cigarettes

97 / 110
Preprocessing Data for Better
Interpretation
cig_df <- cig_df %>%
mutate(year = year + 1900) %>% ## year with four digits
mutate(adjusted_price = price/cpi) %>% ## adjust for inflation
mutate(adjusted_pimin = pimin/cpi) ## adjust for inflation

98 / 110
Questions: Cigarette Prices Over
Time
Did the price of cigarettes (adjusting for inXation) increase
between 1963 and 1992?

Are cigarette sales lower in states where the price exceeds


the minimum price in neighboring states?

Extra challenge: In which state was the absolute increase


most dramatic? (compare states 5, 25, and 50)

99 / 110
Did the price of cigarettes (adjusting for inXation) increase
over time?

plot(cig_df$year, cig_df$adjusted_price, xlab="Year", ylab="Price (Constant 1983)")

100 / 110
Are cigarette sales lower in states where the price exceeds
the minimum price in neighboring states?

cig_df_expensive <- cig_df %>% filter(adjusted_price > adjusted_pimin)


cig_df_cheap <- cig_df %>% filter(adjusted_price < adjusted_pimin)
qqplot(cig_df_expensive$sales, cig_df_cheap$sales, main="Sales Comparison",
xlab="Price (More Expensive than Neighbors)", ylab="Price (Cheaper than Neighbors)")
abline(a=0,b=1,col="red",lwd=2)

101 / 110
In which state was the absolute increase most dramatic?
(compare states 5, 25, and 50)

state.5 <- filter(cig_df,state==5); state.25 <- filter(cig_df,state==25); state.50 <- filter(cig_d

plot(state.5$year, state.5$adjusted_price, col="black", pch=16, xlab="Year", ylab="Price (Constant


ylim=c(0.5,1.5))
points(state.25$year, state.25$adjusted_price, col="tomato", pch=16)
points(state.50$year, state.50$adjusted_price, col="forest green", pch=16)
legend("topleft",legend=c(5,25,50),col=c("black","tomato","forest green"),pch=16)

102 / 110
In which state was the absolute increase most dramatic?
(states 5, 25, and 50)

plot(state.5$year, state.5$adjusted_price, col="black", pch=16, xlab="Year", ylab="Price (Constant


ylim=c(0.5,1.5),type="l")
lines(state.25$year, state.25$adjusted_price, col="tomato", pch=16)
lines(state.50$year, state.50$adjusted_price, col="forest green", pch=16)
legend("topleft",legend=c(5,25,50),col=c("black","tomato","forest green"),lty=1)

103 / 110
In-class Task: MA Test Score Data,
1997-8
MCAS <- read_csv("[Link]

## Rows: 220 Columns: 18


## !! Column specification !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
## Delimiter: ","
## chr (2): municipa, district
## dbl (16): rownames, code, regday, specneed, bilingua, occupday, totday, spc,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mass_df <- as_tibble(MCAS)


## glimpse(mass_df,width=40)

104 / 110
In-class Task: MA Test Score Data,
1997-8
Codebook

Variables of interest:

tchratio: students per teacher


percap: per capita income
totsc8: 8th grade score
avgsalary: average teacher salary

105 / 110
In-class Task: MA Test Score Data,
1997-8
1. Deene a question.

2. Understand your data.

3. Sketch one or more hypothetical egures.

4. Convert this sketch to an R egure.

5. Try saving your egure as a PDF

106 / 110
In-class Task: MA Test Score Data,
1997-8
Do schools with high teacher salaries have high 8th grade
scores?

plot(mass_df$avgsalary, mass_df$totsc8,
xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score",
pch=16)

107 / 110
In-class Task: MA Test Score Data,
1997-8
How does teacher salary and 8th grade interact with per capita
income?

mass_df.1 <- filter(mass_df, percap > median(percap))


mass_df.2 <- filter(mass_df, percap <= median(percap))

plot(mass_df.1$avgsalary, mass_df.1$totsc8, col="steelblue", pch=16, ylim=c(630,750),


xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score")
points(mass_df.2$avgsalary, mass_df.2$totsc8, col="tomato", pch=16)
legend("bottomright", legend=c("Above Median Income","Below Median Income"), col=c("steelblue","to

108 / 110
par(mfrow=c(1,3))
plot(mass_df$avgsalary, mass_df$totsc8, cex=2,
xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score",
col="black", pch=16, ylim=c(630,750), main="All")
plot(mass_df.1$avgsalary, mass_df.1$totsc8, cex=2,
xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score",
col="steelblue", pch=16, ylim=c(630,750), main="Above Median Income")
plot(mass_df.2$avgsalary, mass_df.2$totsc8, cex=2,
xlab="Avg. Teacher Salary", ylab="Avg. 8th Grade Score",
col="tomato", pch=16, ylim=c(630,750), main="Below Median Income")

109 / 110
Quick Tip: Quitting R
When you exit R, it will ask if you want to save your workspace.
Choose NO.

Start with a fresh workspace and load or create objects as you


need them.

110 / 110

You might also like