0% found this document useful (0 votes)
24 views103 pages

R Programming: Data Visualization with ggplot2

This document covers the basics of data visualization using the ggplot2 library in R, including how to create various types of plots and the syntax for ggplot2 functions. It discusses the importance of aesthetic mappings, geometric objects, and statistical transformations in visualizing data effectively. Additionally, it introduces data manipulation techniques using dplyr for filtering, arranging, selecting, and summarizing data.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views103 pages

R Programming: Data Visualization with ggplot2

This document covers the basics of data visualization using the ggplot2 library in R, including how to create various types of plots and the syntax for ggplot2 functions. It discusses the importance of aesthetic mappings, geometric objects, and statistical transformations in visualizing data effectively. Additionally, it introduces data manipulation techniques using dplyr for filtering, arranging, selecting, and summarizing data.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 4

R P ROGRAMMING BA S I CS FOR DATA A N ALYTICS


DATA I MPORT A N D E X P ORT
V I SUALIZATION , T R ANS FOR MATION, E X PLORATORY A N ALYSI S, T I DYING, MOD ELLING

1
Data visualization with ggplot2
The simple graph has brought more information to the data analyst’s mind than
any other device.
ggplot2 implements the grammar of graphics
A library for describing and building graphs.
A data frame is a rectangular collection of variables (in the columns) and
observations (in the rows).

2
Installing and loading a Package
[Link](“ggplot2”)
library(ggplot2)

3
The mpg Data Frame

4
Creating a ggplot
displ, a car’s engine size, in liters.
hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg).
To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:

The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). In other
words, cars with big engines use more fuel.

5
Continued..
ggplot(data = mpg) creates an empty graph.
The function geom_point() adds a layer of points to your plot, which creates a scatterplot.
ggplot2 comes with many geom functions that each add a different type of layer to a plot.
Each geom function in ggplot2 takes a mapping argument.
This defines how variables in your dataset are mapped to visual properties.
The mapping argument is always paired with aes(), and the x and y arguments of aes() specify
which variables to map to the x- and y-axes.
ggplot2 looks for the mapped variable in the data argument, in this case, mpg.

6
A Graphing Template
Syntax of ggplot2:

7
Exercises
1. Run ggplot(data = mpg). What do you see?

2. How many rows are in mtcars? How many columns?

3. What does the drv variable describe? Read the help for ?mpg to find out.

4. Make a scatterplot of hwy versus cyl.

5. What happens if you make a scatterplot of class versus drv? Why is the plot not useful?

8
Aesthetic Mappings
In the following plot, one group of points (highlighted in red) seems to fall outside of the linear
trend.
These cars have a higher mileage than you might expect.

9
Continued..
Let’s hypothesize that the cars are hybrids.
One way to test this hypothesis is to look at the class value for each car. The class variable of the
mpg dataset classifies cars into groups such as compact, midsize, and SUV.
If the outlying points are hybrids, they should be classified as compact cars or, perhaps,
subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs
became popular).

10
Continued..
You can add a third variable, like class, to a two-dimensional scatterplot by mapping it to an
aesthetic.
An aesthetic is a visual property of the objects in your plot.
Aesthetics include things like the size, the shape, or the color of your points. You can display a
point in different ways by
changing the values of its aesthetic properties.

11
Continued..
We can map the colors of your points to the class variable to reveal the class of each car:

12
Continued..
To map an aesthetic to a variable, associate the name of the aesthetic to the
name of the variable inside aes().
ggplot2 will automatically assign a unique level of the aesthetic to each unique
value of the variable, a process known as scaling.
ggplot2 will also add a legend that explains which levels correspond to which
values.

13
Continued..
We can map class to the size aesthetic:

14
15
You can also set the aesthetic properties of your geom manually.
For example, we can make all of the points in our plot blue:

16
Facets
The facet approach partitions a plot into a matrix of panels.
Each panel shows a different subset of the data.

There are two main functions for faceting :


◦ facet_grid()
◦ facet_wrap()

17
facet_wrap()
The first argument of facet_wrap() should be a formula, which you create with ~ followed by a
variable name.
The variable that you pass to facet_wrap() should be discrete.

18
facet_grid()
To facet the plot as the combination of two variables we use facet_grid().
The first argument of facet_grid() is also a formula.
This time the formula should contain two variable names separated by a ~.

19
Geometric Objects
A geom is the geometrical object that a plot uses to represent data.
People often describe plots by the type of geom that the plot uses.
For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms,
and so on.
Scatterplots use the point geom.
Can use different geoms to plot the same data.
To change the geom in your plot, change the geom function that you add to ggplot().
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))

20
You could set the shape of a point, but you couldn’t set the “shape” of a line.
We could set the linetype of a line.
geom_smooth() will draw a different line, with a different linetype, for each unique value of
the variable that you map to linetype:

ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

21
ggplot2 provides over 30 geoms, and extension packages provide even more.
geom_smooth() use a single geometric object to display multiple rows of data.
geom_smooth(), use a single geometric object to display multiple rows of data.
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
[Link] = FALSE
) 22
Continued..
To display multiple geoms in the same plot, add multiple geom functions to ggplot():
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +


geom_point() +
geom_smooth()

23
Continued..
It possible to display different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()

24
Statistical Transformations
geom_bar() : used to draw bar charts.
The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds,
including the price, carat, color, clarity, and cut of each diamond.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

25
Continued..
Scatterplots, plot the raw values of the dataset.
Bar charts, calculate new values to plot
• Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the
number of points that fall in each bin.
• Smoothers fit a model to your data and then plot predictions from the model.
• Boxplots compute a robust summary of the distribution and display a specially formatted box.

26
Continued..
The algorithm used to calculate new values for a graph is called a stat, short for statistical
transformation.

27
Continued..
geom_bar shows the default value for stat is “count,” which means that geom_bar() uses
stat_count().

ggplot(data = diamonds) + stat_count(mapping = aes(x = cut))

28
stat_summary()
Summarizes the y values for each unique x value.
ggplot(data = diamonds) +
stat_summary( mapping = aes(x = cut, y = depth),
[Link] = min, [Link] = max, fun.y = median )

29
Position Adjustments
Color a bar chart using either the color aesthetic or fill.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))

30
Continued..
If you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked.
Each colored rectangle represents a combination of cut and clarity.
The stacking is performed automatically by the position adjustment specified by the position
argument.
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity))
Three options:
◦ Identity
◦ dodge
◦ fill

31
identity
position = "identity" will place each object exactly where it falls in the context of the graph. This
is not very useful for bars, because it overlaps them.
To see that overlapping we either need to make the bars slightly transparent by setting alpha to
a small value, or completely transparent by setting fill = NA

32
fill
position = "fill" works like stacking, but makes each set of stacked bars the same height.
This makes it easier to compare proportions across groups.
ggplot(data = diamonds) +
geom_bar( mapping = aes(x = cut, fill = clarity), position = "fill" )

33
dodge
position = "dodge" places overlapping objects directly beside one another.
This makes it easier to compare individual values.
ggplot(data = diamonds) +
geom_bar( mapping = aes(x = cut, fill = clarity), position = "dodge" )

34
geom_ jitter()
The values of hwy and displ are rounded so the points appear on a grid and many points overlap
each other. This problem is known as overplotting.
This arrangement makes it hard to see where the mass of the data is.
Can avoid this gridding by setting the position adjustment to “jitter.”
position = "jitter" adds a small amount of random noise to each point.
This spreads the points out because no two points are likely to receive the same amount of
random noise.
Makes graph less accurate at small scales, it makes your graph more revealing at large scales.

35
ggplot(data = mpg) +
geom_point( mapping = aes(x = displ, y = hwy), position = "jitter" )

36
Coordinate Systems
Coordinate systems are probably the most complicated part of ggplot2.
The default coordinate system is the Cartesian coordinate system where the x and y position act
independently to find the location of each point.

37
coord_flip()
Switches the x- and y-axes.
This is useful if you want horizontal boxplots.
It’s also useful for long label.
It’s hard to get them to fit without overlapping on the x-axis.

38
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()

39
coord_quickmap()
Sets the aspect ratio correctly for maps.
This is very important if you’re plotting spatial data with ggplot2.

40
coord_polar()
Uses polar coordinates.
Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
bar <- ggplot(data = diamonds) +
geom_bar( mapping = aes(x = cut, fill = cut),
[Link] = FALSE,
width = 1 ) +
theme([Link] = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()

41
Data Transformation with dplyr

42
Continued..
View(flights) : Shows whole dataset

43
dplyr Basics
filter() - Pick observations by their values.
arrange() - Reorder the rows.
select() - Pick variables by their names.
mutate() - Create new variables with functions of existing variables.
summarize() - Collapse many values down to a single summary.

44
filter()
Filter Rows
Allows you to subset observations based on their values.
dplyr executes the filtering operation and returns a new data frame.
dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use
the assignment operator, <-

45
46
Comparisons
To use filtering effectively, select the observations using the comparison operators.
R provides the standard suite: >, >=, <=, != (not equal), and == (equal).
when testing for equality use ==

47
Computers use finite precision arithmetic

48
Logical Operators

49
The following code finds all flights that departed in November or December:

A useful shorthand is x %in% y. This will select every row where x is one of the values
in y:

50
Missing Values
One important feature of R that can make comparison tricky is missing values, or NAs
(“not availables”).
NA represents an unknown value so missing values are “contagious”
Almost any operation involving an unknown value will also be unknown.

51
If you want to determine if a value is missing, use [Link]():

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA
values.

52
arrange()
Change the order of rows.
It takes a data frame and a set of column names to order by.

53
Use desc() to reorder by a column in descending order:

Missing values are always sorted at the end:

54
select()
Select Columns.
Use to narrow the data set.
Allows to rapidly zoom in on a useful subset using operations based on the names of
the variables.

55
56
57
58
Helper functions of select
• starts_with("abc") matches names that begin with “abc”.
• ends_with("xyz") matches names that end with “xyz”.
• contains("ijk") matches names that contain “ijk”.
• matches("(.)\\1") selects variables that match a regular expression.
• num_range("x", 1:3) matches x1, x2, and x3.

59
rename()
Is a variant of select() that keeps all the variables that aren’t explicitly mentioned.

60
every thing()
This is useful to move variables to the start of the data frame.

61
mutate()
Add new variables to data frame.
To add new columns that are functions of existing columns.
Adds new columns at the end of your dataset.

62
63
transmute():
This is to keep only the new variables.

64
Useful Creation Functions
1. Arithmetic operators +, -, *, /, ^
2. Modular arithmetic (%/% and %%)

65
Continued..
3. Logs log(), log2(), log10()
4. Offsets
◦ lead()
◦ lag()

66
Continued..
5. Cumulative and rolling aggregates.

◦ cumsum()
◦ cumprod()
◦ cummin()
◦ cummax()
◦ cummean()

67
Continued..
6. Logical comparisons <=, >, >=, !=
7. Ranking
◦ min_rank()
◦ desc(x)
◦ row_number()
◦ dense_rank()
◦ percent_rank()

68
summarize()
It collapses a data frame to a single row.

69
group_by():
This changes the unit of analysis from the complete dataset to individual groups.

70
Combining Multiple Operations with the Pipe (%>%)

71
Useful Summary Functions
means, counts, and sum
Measures of location
◦ Median()
Measures of spread sd(x), IQR(x), mad(x)
Measures of rank min(x), quantile(x, 0.25), max(x)
Measures of position first(x), nth(x, 2), last(x)
Counts
◦ sum(![Link](x)) - To count the number of non-missing values
◦ n_distinct(x) - To count the number of distinct (unique) values
◦ n() - returns the size of the current group.

72
Exploratory Data Analysis
EDA is to develop an understanding of your data.
EDA is fundamentally a creative process.
And like most creative processes, the key to asking quality questions is to generate a
large quantity of questions.

73
Variation
Variation is the tendency of the values of a variable to change from measurement to
measurement.
Every variable has its own pattern of variation, which can reveal interesting information.
The best way to understand that pattern is to visualize the distribution of variables’
values.
If you measure any continuous variable twice, you will get two different results.
If you measure quantities that are constant, like the speed of light.

74
Visualizing Distributions
A variable is categorical if it can only take one of a small set of values.
In R, categorical variables are usually saved as factors or character vectors.
Use bar plots.

75
The height of the bars displays how many observations occurred with each x value.

You can compute these values manually with dplyr::count():

76
Visualizing Distributions
A variable is continuous if it can take any of an infinite set of ordered values.
Numbers and date-times are two examples of continuous variables.
Use a histogram.

77
78
Can compute this by hand by combining:
• dplyr::count()
• ggplot2::cut_width():

79
80
• geom_freqpoly() :
• Overlay multiple histograms in the same plot.
• Same calculation as geom_histogram(), but instead of displaying the counts with
bars, uses lines instead.
• It’s much easier to understand overlapping lines than bars.

81
Typical Values
In both bar charts and histograms, tall bars show the common values of a variable, and shorter
bars show less-common values.
Places that do not have bars reveal values that were not seen in your data.

82
Unusual Values
Outliers are observations that are unusual.
Data points that don’t seem to fit the pattern.
Sometimes outliers are data entry errors.
Other times outliers suggest important new science.
When you have a lot of data, outliers are sometimes difficult to see in a histogram.

83
Sometimes the only evidence of outliers is the unusually wide limits on the y-axis

84
• coord_cartesian():
• To make it easy to see the unusual values, we need to zoom in to small values of
the y-axis.

85
Missing Values
If there are unusual values in your dataset we have two options:
1.

86
Missing Values
2. Replace the unusual values with missing values.
The easiest way to do this is to use mutate() to replace the variable with a modified
copy.
Use the ifelse() function to replace unusual values with NA.

87
ggplot2 doesn’t include missing valuesin the plot, but it does warn that they’ve been removed

88
89
Covariation
Covariation describes the behavior between variables.
Covariation is the tendency for the values of two or more variables to vary
together in a related way.
The best way to spot covariation is to visualize the relationship between two or
more variables.

90
Covariation - continuous variable
Sometimes the pattern is not discernible as one variable has a large measure. To make
comparison easier, we use density instead of count() so that the area under each frequency
polygon is one.

91
Covariation – Categorical variable
A boxplot is a type of visual shorthand for a distribution of values.
A box that stretches from the 25th percentile of the distribution to the 75th percentile, a
distance known as the interquartile range (IQR).
In the middle of the box is a line that displays the median, i.e., 50th percentile, of the
distribution.
These three lines give you a sense of the spread of the distribution and whether or not the
distribution is symmetric about the median or skewed to one side.
Visual points that display observations that fall more than 1.5 times the IQR from either edge of
the box. These outlying points are unusual, so they are plotted individually.
A line (or whisker) that extends from each end of the box and goes to the farthest nonoutlier
point in the distribution.

92
93
94
reorder()

95
96
97
Two Categorical Variables

To visualize the covariation between categorical variables, you will need to count
the number of observations for EACH combination. The way to do that is with
geom_count().

98
ggplot(data = diamonds) +
geom_count(mapping = aes(x =cut, y = color))

99
Another approach is to compute the count with dplyr:

100
geom_tile()

101
Two Continuous Variables
To visualize the covariation between two continuous variables: draw a scatterplot with
geom_point().

102
One way to fix the Overplotting is , using the alpha aesthetic to add transparency.

103

You might also like