0% found this document useful (0 votes)
3 views26 pages

Mod4 New Rashid

Module 4 covers R programming basics for data analytics, focusing on data visualization with ggplot2, data transformation with dplyr, and exploratory data analysis. Key topics include creating graphs, using aesthetic mappings, and employing various dplyr functions such as filter(), arrange(), and summarize() for data manipulation. The module emphasizes the importance of understanding data patterns through visualization and the handling of outliers in datasets.

Uploaded by

alansaniljacob95
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views26 pages

Mod4 New Rashid

Module 4 covers R programming basics for data analytics, focusing on data visualization with ggplot2, data transformation with dplyr, and exploratory data analysis. Key topics include creating graphs, using aesthetic mappings, and employing various dplyr functions such as filter(), arrange(), and summarize() for data manipulation. The module emphasizes the importance of understanding data patterns through visualization and the handling of outliers in datasets.

Uploaded by

alansaniljacob95
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 4

R PROGRAMMING BASICS FOR DATA ANALYTICS

SYLLABUS:
 R programming: basics
 Data visualization with ggplot2
 Data transformation with dplyr
 Exploratory data analysis in R
 Tidy data with tidyr
 Modelling

IMPORTANT QUESTIONS:

 Define ggplot2. What are the features provided by ggplot2? What are the problems faced while
using ggplot2 and how can we overcome them?
 Write the R code to import a .csv file, examine its contents and generate its descriptive statistics.
 With examples, illustrate how these R functions help in data analysis.
o filter()
o arrange()
o summarize()
o mutate()
o select()

SHADHA K|DEPT OF IT|MESCE 1


DATA VISUALIZATION WITH GGPLOT2

The simple graph can bring more information to the data analyst’s mind than any other device.
ggplot2 implements the grammar of graphics. Ggplot2 is a library for describing and building graphs.
A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).

Installing and loading a Package


To install any package use the command [Link]() and to load that use library().
[Link](“ggplot2”)
library(ggplot2)

The mpg Data Frame

‘displ’ describes a car’s engine size, in liters. ‘hwy’, a car’s fuel efficiency on the highway, in miles per
gallon (mpg). To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis we use several
functions under ggplot2.

SHADHA K|DEPT OF IT|MESCE 2


 ggplot():

 ggplot(data = mpg) creates an empty graph.


 geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with
many geom functions that each add a different type of layer to a plot.
 Each geom function in ggplot2 takes a mapping argument. This defines how variables in your
dataset are mapped to visual properties.
 The mapping argument is always paired with aes(), and the x and y arguments of aes() specify
which variables to map to the x- and y-axes.
 ggplot2 looks for the mapped variable in the data argument, in this case, mpg.

A Graphing Template

Below shows a reusable template for making graphs with ggplot2. To make a graph, replace the
bracketed sections in the fol‐ lowing code with a dataset, a geom function, or a collection of mappings:

SHADHA K|DEPT OF IT|MESCE 3


Aesthetic Mappings

An aesthetic is a visual property of the objects in your plot. Aesthetic mappings describe how variables in
the data are mapped to visual properties (aesthetics) of geoms. Aesthetic mappings can be set in ggplot()
and in individual layers. In the following plot, one group of points (highlighted in red) seems to fall outside
of the linear trend. These cars have a higher mileage than you might expect.

Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value
for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize,
and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps,
subcompact cars. You can add a third variable, like class, to a two-dimensional scatterplot by mapping
it to an aesthetic.
Aesthetics include things like the size, the shape, or the color of your points. You can display a point in
different ways by changing the values of its aesthetic properties. We can map the colors of your points
to the class variable to reveal the class of each car:

SHADHA K|DEPT OF IT|MESCE 4


To map an aesthetic to a variable, associate the name of the aesthetic to the name of the variable inside
aes(). ggplot2 will automatically assign a unique level of the aesthetic to each unique value of the
variable, a process known as scaling. ggplot2 will also add a legend that explains which levels
correspond to which values.

Common Problems
 R is extremely picky, and a misplaced character can make all the difference. Make sure that every
( is matched with a ) and every " is paired with another ". Sometimes you’ll run the code and
nothing happens. Check the left-hand side of your console: if it’s a +, it means that R doesn’t
think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s
usually easy to start from scratch again by pressing Esc to abort processing the current command.
 One common problem when creating ggplot2 graphics is to put them in the wrong place: it has
to come at the end of the line, not the start.
Facets
The facet approach partitions a plot into a matrix of panels. Each panel shows a different subset of the
data.
There are two main functions for faceting :
◦ facet_grid()
◦ facet_wrap()
facet_wrap()
The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable
name. The variable that you pass to facet_wrap() should be discrete.

SHADHA K|DEPT OF IT|MESCE 5


facet_grid()
To facet the plot as the combination of two variables we use facet_grid(). The first argument of
facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.

Geometric Objects
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the
type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms,
boxplots use boxplot geoms, and so on. Scatterplots use the point geom. We can use different geoms to
plot the same data.
 geom_smooth() will draw a different line, with a different linetype, for each unique value of the
variable that you map to linetype.

SHADHA K|DEPT OF IT|MESCE 6


 To display multiple geoms in the same plot, add multiple geom functions to ggplot():

 If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It
will use these mappings to extend or overwrite the global mappings for that layer only. This makes it
possible to display different aesthetics in different layers:

SHADHA K|DEPT OF IT|MESCE 7


Statistical Transformations

 geom_bar() : used to draw bar charts.

The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the
price, carat, color, clarity, and cut of each diamond. To plot bar graph:

Scatterplots, plot the raw values of the dataset. Bar charts, calculate new values to plot. Bar charts,
histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall
in each bin. Smoothers fit a model to your data and then plot predictions from the model. Boxplots
compute a robust summary of the distribution and display a specially formatted box. The algorithm used
to calculate new values for a graph is called a stat, short for statistical transformation.

geom_bar shows the default value for stat is “count,” which means that geom_bar() uses stat_count().

SHADHA K|DEPT OF IT|MESCE 8


Position Adjustments
You can color a bar chart using either the color aesthetic, or more usefully, fill:

If you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked. Each
colored rectangle represents a combination of cut and clarity. The stacking is performed automatically by
the position adjustment specified by the position argument.

The stacking is performed automatically by the position adjustment specified by the position
[Link] options of position are:
◦ Identity
◦ dodge
◦ fill

SHADHA K|DEPT OF IT|MESCE 9


Identity
position = "identity" will place each object exactly where it falls in the context of the graph. This is not
very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars
slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA. The
identity position adjustment is more useful for 2D geoms, like points, where it is the default.

Fill
position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it
easier to compare proportions across groups.

dodge
position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare
individual values.

SHADHA K|DEPT OF IT|MESCE 10


 geom_ jitter()
The values of hwy and displ are rounded so the points appear on a grid and many points overlap each
other. This problem is known as overplotting. This arrangement makes it hard to see where the mass of
the data is. We can avoid this gridding by setting the position adjustment to “jitter.” position = "jitter" adds
a small amount of random noise to each point. This spreads the points out because no two points are likely
to receive the same amount of random noise. This makes graph less accurate at small scales, it makes your
graph more revealing at large scales.

SHADHA K|DEPT OF IT|MESCE 11


Coordinate Systems
Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is
the Cartesian coordinate system where the x and y position act independently to find the location of each
point.

 coord_flip()
This function switches the x- and y-axes. This is useful if you want horizontal boxplots. It’s also useful
for long label. It’s hard to get them to fit without overlapping on the x-axis.

 coord_quickmap()
The function sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data
with ggplot2.

 coord_polar()
The function uses polar coordinates. Polar coordinates reveal an interesting connection between a bar
chart and a Coxcomb chart.

DATA TRANSFORMATION WITH DPLYR

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most
common data manipulation challenges:

 mutate() adds new variables that are functions of existing variables


 select() picks variables based on their names.
 filter() picks cases based on their values.
 summarise() reduces multiple values down to a single summary.
 arrange() changes the ordering of the rows.

To explore the basic data manipulation verbs of dplyr, we’ll use nycflights13::flights. This data frame
contains all 336,776 flights that departed from New York City in 2013. The data comes from the US
Bureau of Transportation Statistics, and is documented in ?flights:
View(flights) : Shows whole dataset

SHADHA K|DEPT OF IT|MESCE 12


 filter()

The filter() is used to Filter rows in the dataset. The function allows you to subset observations based on
their values. dplyr executes the filtering operation and returns a new data frame. dplyr functions never
modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-.

To use filtering effectively, select the observations using the comparison operators. R provides the standard
suite: >, >=, <=, != (not equal), and == (equal). When testing for equality use == .Logical Operators can
be used with the filter().A useful shorthand is x %in% y. This will select every row where x is one of the
values in y. filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA
values.

 arrange()

This function Change the order of rows. It takes a data frame and a set of column names to order by. If
you provide more than one column name, each additional column will be used to break ties in the values
of preceding columns.

SHADHA K|DEPT OF IT|MESCE 13


Use desc() to reorder by a column in descending order.

Missing values are always sorted at the end:

 select()

The function select columns. The function is used to narrow the data set. It allows to rapidly zoom in on
a useful subset using operations based on the names of the variables.

SHADHA K|DEPT OF IT|MESCE 14


There are a number of helper functions you can use within select():
 starts_with("abc") matches names that begin with “abc”.
 ends_with("xyz") matches names that end with “xyz”.
 contains("ijk") matches names that contain “ijk”.
 matches("(.)\\1") selects variables that match a regular expression.
 num_range("x", 1:3) matches x1, x2, and x3.

select() can be used to rename variables, but it’s rarely useful because it drops all of the variables not
explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that
aren’t explicitly mentioned.

 mutate()

The function add new variables to data frame. It add new columns that are functions of existing columns.
The newly created columns are added at the end of your dataset.

SHADHA K|DEPT OF IT|MESCE 15


The various functions can be used with mutate().
1. Arithmetic operators +, -, *, /, ^
2. Modular arithmetic (%/% and %%)
3. Logs log(), log2(), log10()
4. Offsets
◦ lead()
◦ lag()
5. Cumulative and rolling aggregates.
◦ cumsum()
◦ cumprod()
◦ cummin()
◦ cummax()
◦ cummean()
6. Logical comparisons <=, >, >=, !=
7. Ranking
◦ min_rank()
◦ desc(x)

SHADHA K|DEPT OF IT|MESCE 16


◦ row_number()
◦ dense_rank()
◦ percent_rank()

transmute()keep only the new variables in the data set.

 summarize()
The function collapses a data frame to a single row.

Useful Summary Functions are :


 means, counts, and sum
 Measures of location
o Median()
 Measures of spread sd(x), IQR(x), mad(x)
 Measures of rank min(x), quantile(x, 0.25), max(x)
 Measures of position first(x), nth(x, 2), last(x)
 Counts
o sum(![Link](x)) - To count the number of non-missing values
o n_distinct(x) - To count the number of distinct (unique) values
o n() - returns the size of the current group.

SHADHA K|DEPT OF IT|MESCE 17


EXPLORATORY DATA ANALYSIS

EDA is to develop an understanding of your data. EDA is fundamentally a creative process. And like most
creative processes, the key to asking quality questions is to generate a large quantity of questions.

Variation
Variation is the tendency of the values of a variable to change from measurement to measurement. Every
variable has its own pattern of variation, which can reveal interesting information. The best way to
understand that pattern is to visualize the distribution of variables’ values. If you measure any continuous
variable twice, you will get two different results. If you measure quantities that are constant, like the speed
of light.

Visualizing Distributions

A variable is categorical if it can only take one of a small set of values. In R, categorical variables are
usually saved as factors or character vectors. We use bar plots to visualize categorical values.

SHADHA K|DEPT OF IT|MESCE 18


A variable is continuous if it can take any of an infinite set of ordered values. Numbers and date-times
are two examples of continuous variables. We use a histogram to plot continuous values.

Typical Values
In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show
less-common values. Places that do not have bars reveal values that were not seen in your data.

Unusual Values
Outliers are observations that are unusual. Outliers are the Data points that don’t seem to fit the pattern.
Sometimes outliers are data entry errors. Other times outliers suggest important new science. When you
have a lot of data, outliers are sometimes difficult to see in a histogram.
If there are unusual values in your dataset we have two options:
 Drop the entire row with strange values.

 Replace the unusual values with missing values.


The easiest way to do this is to use mutate() to replace the variable with a modified copy. Use the ifelse()
function to replace unusual values with NA.

SHADHA K|DEPT OF IT|MESCE 19


TIDY DATA WITH tidyr

A consistent way to organize your data in R, an organization called tidy data. We can represent the same
underlying data in multiple ways. dplyr, ggplot2, and all the other packages in the tidyverse are designed
to work with tidy data. One dataset, the tidy dataset, will be much easier to work with inside the tidyverse.

There are three interrelated rules which make a dataset tidy:


1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

These three rules are interrelated because it’s impossible to only sat‐ isfy two of the three. That
interrelationship leads to an even simpler set of practical instructions:
1. Put each dataset in a tibble.
2. Put each variable in a column.

Advantages of tidy data


 If you have a consistent data structure, it’s easier to learn the tools that work with it because they
have an underlying uniformity.
 There’s a specific advantage to placing variables in columns because it allows R’s vectorized
nature. Most built-in R functions work with vectors of values. That makes transforming tidy data
feel particularly natural.

Spreading and Gathering

Most data that you will encounter will be untidy. There are two main reasons:
 Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself
unless you spend a lot of time working with data.
 Data is often organized to facilitate some use other than analysis. For example, data is often
organized to make entry as easy as possible.

This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out
what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with
the people who originally generated the data. The second step is to resolve one of two common problems:
 One variable might be spread across multiple columns.
SHADHA K|DEPT OF IT|MESCE 20
 One observation might be scattered across multiple rows.
The different functuions in tidyr: gather() and spread() can be used to solve the issue. gather() makes wide
tables narrower and longer; spread() makes long tables shorter and wider.

Gathering – gather()
A common problem is a dataset where some of the column names are not names of variables, but values
of a variable. To tidy a dataset like this, we need to gather those columns into a new pair of variables.
Consider the below table where the columns are the values of the attribute year.

To describe that operation we need three parameters:


 The set of columns that represent values, not variables. In this example, those are the columns
1999 and 2000.
 The name of the variable whose values form the column names. That is the key, and here it is
year.
 The name of the variable whose values are spread over the cells. That is value, and here it’s the
number of cases.

SHADHA K|DEPT OF IT|MESCE 21


Spreading – spread()

Spreading is the opposite of gathering. You use it when an observation is scattered across multiple rows.

To tidy this up, we first analyze the representation.


 The column that contains variable names, the key column. Here, it’s type.
 The column that contains values forms multiple variables, the value column. Here, it’s count.

SHADHA K|DEPT OF IT|MESCE 22


Separate()

Separate() pulls apart one column into multiple columns, by splitting wherever a separator character
appears. In the example: The rate column contains both cases and population variables, and we need to
split it into two variables. separate() takes the name of the column to separate.

SHADHA K|DEPT OF IT|MESCE 23


Unite()
unite() is the inverse of separate(): it combines multiple columns into a single column. Can use unite() to
rejoin the century and year columns in example. unite() takes a data frame, the name of the new variable
to create, and a set of columns to combine.

Missing Values
Changing the representation of a dataset brings up an important subtlety of missing values. A value can
be missing in one of two possible ways:
 Explicitly, i.e., flagged with NA.
 Implicitly, i.e., simply not present in the data.

SHADHA K|DEPT OF IT|MESCE 24


An explicit missing value is the presence of an absence; an implicit missing value is the absence of a
presence. For example, we can make the implicit missing value explicit by putting years in the columns:

The explicit missing values may not be important in other representations of the data, you can set [Link]
= TRUE in gather() to turn explicit missing values implicit:

complete()
Complete() is used for making missing values explicit in tidy data. Complete() takes a set of columns,
and finds all unique combinations. It then ensures the original dataset contains all those values, filling in
explicit NAs where necessary.

SHADHA K|DEPT OF IT|MESCE 25


fill()
fill() can be used to fill in the missing values. It takes a set of columns where you want missing values
to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

Fill() will complete the values as below:

SHADHA K|DEPT OF IT|MESCE 26

You might also like