0% found this document useful (0 votes)
18 views21 pages

Data Visualization with ggplot2 Guide

The document discusses data visualization using the ggplot2 package in R. It covers: 1) The importance of data visualization and different types of visualization tools. 2) An overview of the grammar of graphics and how ggplot2 implements this to provide a flexible system for constructing plots. 3) The basics of creating plots with ggplot2, including specifying data, aesthetic mappings of variables to visual properties, and adding geometric objects like points and lines.

Uploaded by

Edited By ME
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

Data Visualization with ggplot2 Guide

The document discusses data visualization using the ggplot2 package in R. It covers: 1) The importance of data visualization and different types of visualization tools. 2) An overview of the grammar of graphics and how ggplot2 implements this to provide a flexible system for constructing plots. 3) The basics of creating plots with ggplot2, including specifying data, aesthetic mappings of variables to visual properties, and adding geometric objects like points and lines.

Uploaded by

Edited By ME
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Visualization using ggplot2

Sayantan Banerjee

What we will learn in this session


Why visualizing data is important for any analysis
Learn how to visualize data using the ggplot2 package
Different types of data visualization tools

Data Visualization
Creating visualizations or graphical representations of data is a key step in being able to communicate information
and findings to others.
But improper or bad visualizations can cause harm.
Need to produce proper and nice visualizations.

Grammar of graphics

ggplot2

Grammar of Graphics: ggplot


There are several systems for graphics in R
One of the most elegant systems is ggplot2
It implements the ‘grammar of graphics’
Very flexible and versatile
Helps us to construct graphical figures out of different visual elements.
This grammar opens up a conversation about parts of a plot: circles, lines, arrows, and words that are combined into
a diagram for visualizing data.
Helps us describing various components of a plot

Grammar of Graphics
Components of a plot include

the data!
geometric objects (dots, circles, lines, etc.) appearing on the plot
a set of mappings from variables in the data to the aesthetics (appearance) of the geometric objects
statistical transformations used to calculate the data values used in the plot
position adjustments for locating each geometric object on the plot
scales (e.g., range of values) for each aesthetic mapping used
coordinate system used to organize the geometric objects
the facets or groups of data shown in different plots
These components are further organized into layers, where each layer has a single geometric object, statistical
transformation, and position adjustment.
Following this grammar, you can think of each plot as a set of layers of images, where each image’s appearance is
based on some aspect of the data set.

Pre-requisites
You may install ggplot2 separately, but it is better to install the larger package ‘tidyverse’

We start with installing and loading the tidyverse package first

Install and load tidyverse


library(tidyverse)

## -- Attaching packages ------------------------ tidyverse 1.3.0 --

## v ggplot2 3.3.0 v purrr 0.3.4


## v tibble 3.0.3 v dplyr 0.8.5
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0

## -- Conflicts --------------------------- tidyverse_conflicts() --


## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

Basics of ggplot2: Example


Let us first load a dataset to explore different visualizations. Consider the mpg data, that contains observations collected by
the US Environmental Protection Agency on 38 models of car.

The mpg data frame has several variables, including

displ : a car’s engine size, in litres.

hwy : a car’s fuel efficiency on the highway, in miles per gallon (mpg).

A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same
distance.

mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l~ f 18 29 p comp~
## 2 audi a4 1.8 1999 4 manual~ f 21 29 p comp~
## 3 audi a4 2 2008 4 manual~ f 20 31 p comp~
## 4 audi a4 2 2008 4 auto(a~ f 21 30 p comp~
## 5 audi a4 2.8 1999 6 auto(l~ f 16 26 p comp~
## 6 audi a4 2.8 1999 6 manual~ f 18 26 p comp~
## 7 audi a4 3.1 2008 6 auto(a~ f 18 27 p comp~
## 8 audi a4 quat~ 1.8 1999 4 manual~ 4 18 26 p comp~
## 9 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p comp~
## 10 audi a4 quat~ 2 2008 4 manual~ 4 20 28 p comp~
## # ... with 224 more rows

Basics of ggplot2
For a basic plot, you need three primary steps

Create a blank canvas for your plot, using the ggplot() call
Specify aesthetic mappings, which specifies how you want to map variables to visual aspects.
Add layers of geometric objects

# create the blank canvas


ggplot(mpg)

# variables of interest mapped


ggplot(mpg, aes(x = displ, y = hwy))
# variables of interest mapped
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()

Note: We have added the geom layer you used the addition (+) operator. New layers are always added using + to add
onto your visualization.
Aesthetic mappings
An aesthetic is a visual property of the objects in your plot.
Aesthetics include things like the size, the shape, or the color of your points.
You can display a point in different ways by changing the values of its aesthetic properties.
All aesthetics for a plot are specified in the aes() function call
Each geom layer can have its own aes specifications.

Aesthetic mappings: examples


The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV.
We can add a third variable, like class , to a two dimensional scatterplot by mapping it to an aesthetic.
For example, we can map the colours of your points to the class variable to reveal the class of each car.

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))

We could have mapped class to the

alpha aesthetic, which controls the transparency of the points, or to the


shape aesthetic, which controls the shape of the points.

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))

You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot
red:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "red")

Geometric Objects
ggplot2 supports different types of geometric objects, including:

geom_point : drawing individual points, like a scatter plot


geom_line : drawing lines
geom_smooth :drawing smoothed lines, like moving averages
geom_bar : drawing bars
geom_histogram : drawing binned values, like a histogram
geom_polygon : drawing arbitrary shapes
geom_map :drawing polygons in the shape of a map! (cool feature indeed)

Geometric Objects: examples


ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


ggplot(data = mpg, aes(x = class)) +
geom_bar()

ggplot(data = mpg, aes(x = hwy)) +


geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The aesthetics for each geom can be different, so you could show multiple lines on the same plot (or with different
colors, styles, etc).
It is also possible to give each geom a different data argument, so that you can show multiple data sets in the same
plot.
If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer.
It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to
display different aesthetics in different layers.

ggplot(mpg, aes(x = displ, y = hwy)) +


geom_point(color = "blue") +
geom_smooth(color = "red")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can use the same idea to specify different data for each layer.
Let us say that we shall display the smooth line for just a subset of the mpg dataset, the subcompact cars.
The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +


geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


Statistical transformations
Let us revisit the default bar chart we had shown before.

ggplot(data = mpg, aes(x = class)) +


geom_bar()
Notice that the the y axis was defined for us as the count of elements that have the particular type.
This count isn’t part of the data set (it’s not a column in mpg ), but is instead a statistical transformation that the
geom_bar automatically applies to the data.
In particular, it applies the stat_count transformation.
You might want to override the default mapping from transformed variables to aesthetics. For example, you might
want to display a bar chart of proportion, rather than count.
ggplot2 provides over 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. ?
stat_bin.
To see a complete list of stats, try the ggplot2 cheatsheet.

ggplot(data = mpg, aes(x = class)) +


geom_bar(mapping = aes(x = class, y = stat(prop), group = 1))

Position adjustments
Each geom also has a default position adjustment which specifies a set of “rules” as to how different components
should be positioned relative to each other.
This position is noticeable in a geom_bar if you map a different variable to the color visual characteristic.

# bar chart of class, colored by drive (front, rear, 4-wheel)


ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar()
The geom_bar by default uses a position adjustment of “stack”, which makes each rectangle’s height proprotional to
its value and stacks them on top of each other.
We can use the position argument to specify what position adjustment rules to follow.

# position = "dodge": values next to each other


ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "dodge")
# position = "fill": percentage/proportions chart
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "fill") +
ylab("proportion")
Co-ordinate systems
The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to
determine the location of each point.

There are a number of other coordinate systems that are occasionally helpful.

coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful
for long labels: it’s hard to get them to fit without overlapping on the x-axis.
coord_fixed a cartesian system with a “fixed” aspect ratio (e.g., 1.78 for a “widescreen” plot)
coord_polar a plot using polar coordinates
coord_quickmap a coordinate system that approximates a good aspect ratio for maps.

# flip x and y axis with coord_flip


ggplot(mpg, aes(x = class)) +
geom_bar() +
coord_flip()

Labels
We add labels with the labs() function. This example adds a plot title

ggplot(mpg, aes(displ, hwy)) +


geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")
We can also use labs() to replace the axis and legend titles.

Scales
Scales control the mapping from data values to things that you can perceive. ggplot2 automatically adds scales for you. For
example, when you type
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))

ggplot2 automatically adds default scales behind the scenes:

ggplot(mpg, aes(displ, hwy)) +


geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()

Scales: Axis ticks and legend keys


There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend:
breaks and labels .

Breaks controls the position of the ticks, or the values associated with the keys.

Labels controls the text label associated with each tick/key. The most common use of breaks is to override the
default choice:

ggplot(mpg, aes(displ, hwy)) +


geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))

Scales: Legend layout


We will most often use breaks and labels to tweak the axes.
While they both also work for legends, there are a few other techniques we are more likely to use.
To control the overall position of the legend, we need to use a theme() setting.
The theme setting [Link] controls where the legend is drawn:
base <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
base + theme([Link] = "left") # 'right' is default

# may use 'top' or 'bottom' as well

Facets
One way to add additional variables is with aesthetics.
Another way, particularly useful for categorical variables, is to split your plot into facets, subplots that each display
one subset of the data.
To facet your plot by a single variable, use facet_wrap() .
The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name
(here “formula” is the name of a data structure in R, not a synonym for “equation”).
The variable that you pass to facet_wrap() should be discrete.

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
To facet your plot on the combination of two variables, add facet_grid() to your plot call.
The first argument of facet_grid() is also a formula. This time the formula should contain two variable names
separated by a ~ .

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
Saving plots
Saving your plots is an important aspect, specially if you are going to use them in your analysis reports.
For plots generated using ggplot2, we can use ggsave() to save plots.
Look into the help function for ggsave using ?ggsave for finer aspects.
ggsave() will save the most recent plot to the disk.

ggplot(mpg, aes(displ, hwy)) + geom_point()


ggsave("[Link]")

## Saving 7 x 5 in image

Common questions

Powered by AI

Geometric objects, or ‘geoms’, are central to plot creation in ggplot2, specifying the type of plot or statistical graphic, such as points, lines, or bars . Geometric objects offer flexibility through diverse forms including geoms for points (geom_point), smooth lines (geom_smooth), bars (geom_bar), and polygons (geom_polygon). The flexibility extends into customization of aesthetics and the integration of multiple associations within the same plot or different layers. This enables complex visualizations and supports varied data storytelling needs while maintaining visual coherence .

Coordinate systems in ggplot2 orient visualizations on the plot background. The default Cartesian coordinate system sets separate x and y axes for positioning . Beyond this, systems like coord_flip() swap axes, useful for horizontal boxplots or when labels are long and overlap on x, coord_polar() uses polar coordinates for circular representations, coord_fixed() keeps aspect ratios consistent, and coord_quickmap() assists with spatial data mapping. These systems can clarify plots by adjusting to data contexts, improving readability and insight delivery .

Position adjustments in ggplot2 determine how different components are positioned relative to each other. The default position adjustment for geom_bar is 'stack', which stacks them over each other . Alternatives like 'dodge' position bars side-by-side, facilitating comparison, while 'fill' normalizes the bar heights, resulting in a proportion comparison between categories . These adjustments provide ways to alter the presentation to suit different analytical needs and improve clarity in visual data comparisons and distinctions.

Proper data visualization is crucial in data analysis as it effectively communicates information and findings to others. It involves creating coherent and meaningful visualizations that encapsulate the essence of the data and its underlying messages . Incorrect visualizations can mislead interpretations and cause harm by providing false insights or by misunderstanding important data patterns .

Facets in ggplot2 divide a plot into multiple subplots based on the values of discrete variables, using facet_wrap() or facet_grid(). These subplots, displaying subsets of data, enhance visual comparisons by separating categories into distinct sections, maintaining contextual coherence . For instance, a single variable can be used to create subplots row-wise using facet_wrap(), whereas two variables can line up subplots in a grid via facet_grid(). By segmenting data visually, facets facilitate a clearer comparison and detection of patterns across different groups .

Scales in ggplot2 define the mapping from data values to perceptual properties like position, size, and color. By setting scales, ggplot2 ensures that the graphical representation correlates appropriately with the data . This involves adding default scales for continuous data, categorical data, etc., which can be further customized to modify how data values are perceived—such aspects include tick spacing and legend customization. Scales, therefore, are integral for accurate and meaningful data interpretation and presentation .

ggplot2 handles statistical transformations by applying functions that compute summary statistics from the data, which are then mapped to aesthetics in a plot. For example, geom_bar() applies the stat_count transformation by default to compute the number of instances of each categorical level . These transformations are crucial because they add a layer of explicit calculation to the data representation, automatically generating or transforming meaningful metrics for visualization, such as counts, proportions, or means, hence enriching the plot's significance .

The 'grammar of graphics' is a framework that ggplot2 implements, which conceptualizes visualizations as a series of systematic components. It allows for the construction of graphical figures using visual elements such as circles, lines, and arrows which are combined to represent data visually . This grammar involves various components like geometric objects, mappings, statistical transformations, position adjustments, and scales, all organized into layers. ggplot2's flexibility and versatility stem from this systematic approach, allowing for descriptive and organized visual representations of data .

The layering system in ggplot2 divides a plot into a series of image layers, with each being governed by specific geometric objects, statistical transformations, and position adjustments. This system allows for the customization of each individual layer, enabling users to adjust the mapping of data to aesthetics, change statistical computations, and control the positioning of elements independently. It enhances the richness of the visualization by allowing multiple perspectives and dimensions to be displayed concurrently, all aligned harmoniously according to data-driven rules .

In ggplot2, aesthetic mappings determine how variables in the data are visually represented by mapping them to properties like size, shape, and color. These mappings are specified using the aes() function within ggplot2 commands. Each geom layer can have its own aesthetic mappings, allowing for customization and differentiation of visual elements based on data attributes . For example, in a scatterplot, car classes could be represented by different colors, shapes, or transparency levels, adding depth to the representation .

You might also like