Graphics in R:
Graphical facilities are an important and extremely versatile component of the R environment.
It is possible to use the facilities to display a wide variety of statistical graphs and also to
build entirely new types of graph.
R is capable of creating high quality graphics. Graphs are typically created using a series of high-
level and low-level plotting commands. High-level functions create new plots and low-level
functions add information to an existing plot. Customize graphs (line style, symbols, color, etc)
by specifying graphical parameters. Specify graphic options using the par() function.
Once the device driver is running, R plotting commands can be used to produce a variety of
graphical displays and to create entirely new kinds of display. Plotting commands are divided
into two basic groups.
• High-level plotting commands: High Level plotting functions create a new plot on the
graphics device, possibly with axes, labels, titles and so on. High-level plotting functions are
designed to generate a complete plot of the data passed as arguments to the function. Where
appropriate, axes, labels and titles are automatically generated (unless you request otherwise.)
High-level plotting commands always start a new plot, erasing the current plot if necessary.
plot() Scatter plot
hist() Histogram
boxplot() Boxplot
qqplot(), qqnorm(), qqline() Quantile plots
[Link]() Interaction plot
sunflower plot() Sunflower scatter plot
pairs() Scatter plot matrix
symbols() Draw symbols on a plot
dotchart(), barplot(), pie() Dot chart, bar chart, pie chart
curve() Draw a curve from a given function
Create a grid of colored rectangles with colors
image() based on the values of a third variable
• Low-level plotting commands: Low-level plotting functions add more information to an
existing plot, such as extra points, lines and labels. Sometimes the high-level plotting
functions don’t produce exactly the kind of plot you desire. In this case, low-level plotting
commands can be used to add extra information (such as points, lines or text) to the current
plot.
points() Add points to a figure
lines() Add lines to a figure
text() Insert text in the plot region
mtext() Insert text in the figure and outer margins
title() Add figure title or outer title
legend() Insert legend
axis(), [Link]() Customize axes
abline() Add horizontal and vertical lines or a single line
box() Draw a box around the current plot
polygon() Draw a polygon
rect() Draw a rectangle
arrows() Draw arrows
segments() Draw line segments
Bar Chart:
A bar chart represents data in rectangular bars with length of the bar proportional to the value of
the variable. R uses the function barplot() to create bar charts. R can draw both vertical and
Horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, [Link],col)
Following is the description of the parameters used −
• H is a vector or matrix containing numeric values used in bar chart.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the title of the bar chart.
• [Link] is a vector of names appearing under each bar.
• col is used to give colors to the bars in the graph.
Example
A simple bar chart is created using just the input vector and the name of each [Link] below
script will create and save the bar chart in the current R working directory.
# Create the data for the chart
H <- c(7,12,28,3,41)
# Give the chart file a name
png(file = "[Link]")
# Plot the bar chart
barplot(H)
# Save the file
[Link]()
When we execute above code, it produces following result −
47
Bar Chart Labels, Title and Colors
The features of the bar chart can be expanded by adding more parameters. The main parameter
is used to add title. The col parameter is used to add colors to the bars. The [Link] is a
vector having same number of values as the input vector to describe the meaning of each bar.
Example
The below script will create and save the bar chart in the current R working directory.
# Create the data for the chart
H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
# Give the chart file a name
png(file = "barchart_months_revenue.png")
# Plot the bar chart
barplot(H,[Link]=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart",border="red")
# Save the file
[Link]()
We can also plot bars horizontally by providing the argument horiz = TRUE.
# barchart with added parameters
barplot([Link], main = "Maximum Temperatures in a Week", xlab = "Degree Celsius", ylab =
"Day",
[Link] = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), col = "darkred", horiz =
TRUE)
48
How to plot barplot with matrix?
As mentioned before, barplot() function can take in vector as well as matrix. If the input is
matrix, a stacked bar is plotted. Each column of the matrix will be represented by a stacked bar.
Let us consider the following matrix which is derived from our Titanic dataset.
> [Link]
Class
Survival 1st 2nd 3rd Crew
No 122 167 528 673
Yes 203 118 178 212
This data is plotted as follows.
barplot([Link], main = "Survival of Each Class", xlab = "Class",col = c("red","green") )
legend("topleft", c("Not survived","Survived"), fill = c("red","green") )
Instead of a stacked bar we can have different bars for each element in a column juxtaposed to
each other by specifying the parameter beside = TRUE as shown below.
49
Pie Chart:
A pie-chart is a representation of values as slices of a circle with different colors. The slices are
labeled and the numbers corresponding to each slice is also represented in the [Link] R the pie
chart is created using the pie() function which takes positive numbers as a vector input. The
additional parameters are used to control labels, color, title etc.
Syntax:
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
• x is a vector containing the numeric values used in the pie chart.
• labels is used to give description to the slices.
• radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
• main indicates the title of the chart.
• col indicates the color palette.
• clockwise is a logical value indicating if the slices are drawn clockwise or anti
clockwise.
Example:
A very simple pie-chart is created using just the input vector and labels. The below script will
create and save the pie chart in the current R working directory
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels<- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "[Link]")
# Plot the chart.
pie(x,labels)
# Save the file.
[Link]()
50
Pie Chart Title and Colors
We can expand the features of the chart by adding more parameters to the function. We will use
parameter main to add a title to the chart and another parameter is col which will make use of
rainbow color pallet while drawing the chart. The length of the pallet should be same as the
number of values we have for the chart. Hence we use length(x).
Example
The below script will create and save the pie chart in the current R working directory.
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels<- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city_title_colours.jpg")
# Plot the chart with title and rainbow color pallet.
pie(x, labels, main = "City pie chart", col = rainbow(length(x)))
# Save the file.
[Link]()
51
Example 2: Pie chart with additional parameters
pie(expenditure, labels=[Link](expenditure), main="Monthly Expenditure
Breakdown", col=c("red","orange","yellow","blue","green"), border="brown",
clockwise=TRUE )
Box Plots:
Boxplots are a measure of how well distributed is the data in a data set. It divides the data set
into three quartiles. This graph represents the minimum, maximum, median, first quartile and
third quartile in the data set. It is also useful in comparing the distribution of data across data sets
by drawing boxplots for each of them. The boxplot() function takes in any number of
numeric vectors, drawing a boxplot for each vector. You can also pass in a list (or data frame)
52
with numeric vectors as its components. Boxplots are created in R by using
the boxplot() function.
Syntax
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Following is the description of the parameters used −
• x is a vector or a formula.
• data is the data frame.
• notch is a logical value. Set as TRUE to draw a notch.
• varwidth is a logical value. Set as true to draw width of the box proportionate to the
sample size.
• names are the group labels which will be printed under each boxplot.
• main is used to give a title to the graph.
Ex1:
Let us use the built-in dataset airquality which has “Daily air quality measurements in New
York, May to September 1973.”-R documentation.
> str(airquality)
'[Link]': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
Let us make a boxplot for the ozone readings.
53
>boxplot(airquality$Ozone)
We can see that data above the median is more dispersed. We can also notice two outliers at the
higher extreme.
Ex2:
We can pass in additional parameters to control the way our plot looks. Some of the frequently
used ones are, main-to give the title, xlab and ylab-to provide labels for the axes, col to define
color etc. Additionally, with the argument horizontal = TRUE we can plot it horizontally and
with notch = TRUE we can add a notch to the box.
boxplot(airquality$Ozone,
main = "Mean ozone in parts per billion at Roosevelt
Island", xlab = "Parts Per Billion",
ylab = "Ozone",
col = "orange",
border =
"brown",
horizontal =
TRUE, notch =
TRUE
54
Multiple Boxplots
We can draw multiple boxplots in a single plot, by passing in a list, data frame or multiple
vectors.
Let us consider the Ozone and Temp field of airquality dataset. Let us also generate normal
distribution with the same mean and standard deviation and plot them side by side for
comparison.
# prepare the data
>ozone <- airquality$Ozone
>temp <- airquality$Temp
# gererate normal distribution with same mean and sd
>ozone_norm <- rnorm(200,mean=mean(ozone, [Link]=TRUE), sd=sd(ozone, [Link]=TRUE))
>temp_norm <- rnorm(200,mean=mean(temp, [Link]=TRUE), sd=sd(temp, [Link]=TRUE))
➢ rnorm generates a random value from the normal distribution. runif generates a random
value from the uniform.
Now we us make 4 boxplots with this data. We use the arguments at and names to denote the
place and label.
>boxplot(ozone, ozone_norm, temp, temp_norm,
main = "Multiple boxplots for comparision",
at = c(1,2,4,5),
names = c("ozone", "normal", "temp", "normal"),
las = 2,
col = c("orange","red"),
border = "brown",
horizontal = TRUE,
notch = TRUE
)
55
Scatter Plot:
The Scatter Plot in R Programming is very useful to visualize the relationship between two sets
of data. The data is displayed as collection of points that shows the linear relation between those
two data sets. For example, if we want to visualize the Age against Weight then we can use this
Scatter Plot.
Scatterplots show many points plotted in the Cartesian plane. Each point represents the values
of two variables. One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, type,main, xlab, ylab, xlim, ylim, axes)
Following is the description of the parameters used −
• x is the data set whose values are the horizontal coordinates.
• y is the data set whose values are the vertical coordinates.
• type: Please specify, what type of plot you want to draw.
o To draw Points, use type = “p”
o To draw Lines use type = “l”
o Use type = “h” for Histograms
o Use type = “s” for stair steps
o To draw over-plotted use type = “o”
• main is the tile of the graph.
• xlab is the label in the horizontal axis.
• ylab is the label in the vertical axis.
• xlim is the limits of the values of x used for plotting.
• ylim is the limits of the values of y used for plotting.
56
• axes indicate whether both axes should be drawn on the
plot. Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's
use the columns "wt" and "mpg" in mtcars.
input<- mtcars[,c('wt','mpg')]
print(head(input))
When we execute the above code, it produces the following result −
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1
Creating the Scatterplot
The below script will create a scatterplot graph for the relation between wt(weight) and
mpg(miles per gallon).
# Get the input values.
input<-
mtcars[,c('wt','mpg')] # Give
the chart file a name.
png(file = "[Link]")
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab =
"Weight", ylab
= "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
57
# How to create a Scatter Plot in R Example
faithful
# Finding the Correlation
cor(faithful$eruptions, faithful$waiting)
# Drawing Scatter Plot
plot(faithful$eruptions, faithful$waiting)
Following statement will find the correlation between the eruptions, and waiting
cor(faithful$eruptions, faithful$waiting)
Change Colors of Scatter plot in R
In this example we will show you, how to change the scatter plot color using col argument, and
size of the character that represents the point using cex (character expansion) argument.
• col: Please specify the color you want to use for your Scatter plot.
• cex: Please specify the size of the point(s)
R CODE
• # R Scatter Plot - Changing Color, Dot Size Example
• # Drawing Scatter Plot
• plot(faithful$eruptions, faithful$waiting,
• col = "chocolate",
• cex = 1.2,
• main = "R Scatter Plot",
58
• xlab = "Eruptions", ylab = "Waiting", las = 1)
•
•
Change Shapes and Axis limits of Scatter Plot in R
In this example we will show you, How to change the shape using pch argument.
• xlim: This argument can help you to specify the limits for the X-Axis
• ylim: This argument may help you to specify the Y-Axis limits
R CODE
# R Scatter Plot - Changing X, Y Limitations, Dot Sape Example
faithful
# Drawing Scatter Plot
plot(faithful$eruptions, faithful$waiting,
col = "chocolate",
pch = 8,
main = "R Scatter Plot",
xlab = "Eruptions",
ylab = "Waiting",
las = 1,
xlim = c(1.5, 5.5),
ylim = c(40, 100))
Low-Level Graphics:
The plot( ) function in R:
The most used plotting function in R programming is the plot() function. It is a generic function,
meaning, it has many methods which are called according to the type of object passed to plot().
In the simplest case, we can pass in a vector and we will get a scatter plot of magnitude vs index.
But generally, we pass in two vectors and a scatter plot of these points are plotted.
For example, the command plot(c(1,2),c(3,5)) would plot the points (1,3) and (2,5).
Adding shapes to Graphs:
➢ Adding Titles and Labeling Axes
We can add a title to our plot with the parameter main. similarly, xlab and ylab can be used to
label the x-axis and y-axis respectively.
plot(x, sin(x),
main="The Sine Function",
ylab="sin(x)")
➢ Changing Color and Plot Type
We can see above that the plot is of circular points and black in color. This is the default color.
We can change the plot type with the argument type. It accepts the following strings and has the
given effect.
"p" - points "l" - lines
"b" - both points and lines
"c" - empty points joined by lines "o" - over plotted points and lines "s" and "S" - stair steps
"h" - histogram-like vertical lines
60
"n" - does not produce any points or lines
Ex:
plot(x, sin(x),
main="The Sine Function",
ylab="sin(x)",
type="l",
col="blue")
➢ Overlaying Plots Using legend () and lines () functions:
Calling plot() multiple times will have the effect of plotting the current graph on the same
window replacing the previous one. However, sometimes we wish to overlay the plots in order to
compare the results. This is made possible with the functions lines() and points() to add lines and
points respectively, to the existing plot.
plot(x, sin(x),
main="Overlaying Graphs",
ylab="",
type="l",
col="blue")
lines(x,cos(x),
col="red")
legend("topleft",
c("sin(x)","cos(x)"),
fill=c("blue","red")
)
61
➢ Adding Other Shapes to a Plot
Using the following functions, we can add the extra graphical objects in plots:
• rect – For plotting rectangles – rect(xleft, ybottom, xright, ytop)
Ex: > plot(c(100, 250), c(300, 450), type = "n", xlab = "", ylab = "",
+main = "2 x 11 rectangles'")
rect(100+i, 300+i, 150+i, 380+i, col = rainbow(11, start = 0.7, end = 0.1))
rect(100, 400, 125, 450, col = "green", border = "blue")
o/p:
Using the locater function, we can obtain the coordinates of the corners of the rectangle. But the rect function do
• arrows – For plotting arrows and headed bars – The syntax for the arrows function is to
draw a line from the point (xO, yO) to the point (x1, y1) with the arrowhead, by default,
at the “second” end (x1, y1).
62
arrows(xO, yO, xl, yl)
Adding code=3 produces a horizontal double-headed arrow from (1,9) to (5,9), for example:
arrows(1,9,5,9,code=3)
Ex:
> plot(x,y, main = "arrows and segments”)
> ## draw arrows from point to point:
> s <- seq(length(x)-1) # one shorter than data
> arrows(x[s], y[s], x[s+1], y[s+1], col = 1:3)
polygon – For plotting more complicated filled shapes, including objects with curved sides. To draw a polygon i
Now you can draw a lavender-colored polygon by using the following command:
locations<-locator(6)
polygon(locations,col=.lavender.)
Ex: > xx <- c(0:n, n:0)
yy <- c(c(0,cumsum(rnorm(n))), rev(c(0,cumsum(rnorm(n)))))
plot (xx, yy, type="n", xlab="Time", ylab="Distance")
polygon(xx, yy, col="gray", border = "red")
63
• Plot symbols in R: Different plotting symbols are available in R. The graphical
argument used to specify point shapes is pch. By default pch=1 . The different points
symbols commonly used in R are shown in the figure below
64
Data analytics Using R Unit - IV
Example:
x<-c(2.2, 3, 3.8, 4.5, 7, 8.5, 6.7, 5.5)
y<-c(4, 5.5, 4.5, 9, 11, 15.2, 13.3, 10.5)
# Plot points plot(x, y)
# Change plotting symbol # Use solid
circle
plot(x, y, pch = 19)
65
Downloaded by Gayathri T
Data Analysis using R
Saving Graphs to Files:
If you want to publish your results, you have to save your plot to a file in R and then import this
graphics file into another document. Much of the time however, you may simply want to use R
graphics in an interactive way to explore your data.
To save a plot to an image file, you have to do three things in sequence:
1. Open a graphics device.
➢ The default graphics device in R is your computer screen. To save a plot to an image file,
you need to tell R to open a new type of device — in this case, a graphics file of a
specific type, such as PNG, PDF, or JPG.
➢ The R function to create a PNG device is png(). Similarly, you create a PDF
device with pdf() and a JPG device with jpg().
➢ The first step in deciding how to save plots is to decide on the output format that you want
to use. The following table lists some of the available formats, along with guidance as to
when they may be useful.
Format Driver Notes
JPG jpeg Can be used anywhere, but doesn't resize
PNG png Can be used anywhere, but doesn't resize
WMF [Link] Windows only; best choice with Word; easily resizable
PDF pdf Best choice with pdflatex; easily resizable
Postscript postscript Best choice with latex and Open Office; easily resizable
2. Create the plot.
Methods to Save Graphs to Files in R
Below, are the methods to Save Graphs to Files in R
i. A General Method
Here’s a general method that will work on any computer with R, regardless of operating system
or the way that we are connecting.
For Example:
If we have to save a plot as a JPG file, so we will use the jpeg driver. If we want to save a jpg
file called “[Link]” containing a plot of x and y, we would type the following commands:
• Save as Jpeg image
>jpeg('[Link]')
> plot(x,y)
> [Link]()
• Save as png image
>png(file="C:/Datamentor/R-tutorial/saving_plot2.png", width=600, height=350)
>hist(Temperature, col="gold")
>[Link]()
ii. Another Approach
66
Downloaded by Gayathri T
Data Analysis using R
In R, the [Link] command is used to copy the contents of the graph window to a file without
having to re-enter the commands.
For Example:
To create a png file called [Link] from a graph that is displayed by R, type
> [Link](png,'[Link]')
> [Link]()
3. Close the graphics device.
You do this with the [Link]() function.
Put this in action by saving a plot of faithful to the home folder on your computer. First set your
working directory to your home folder (or to any other folder you prefer
Now you can check your file system to see whether the file [Link] exists. (It should!) The
result is a graphics file of type PNG that you can insert into a presentation, document, or website.
To save a plot as jpeg image we would perform the following steps. Please note that we need to
call the function [Link]() after all the plotting, to save the file and return control to the screen.
Ex: jpeg(file="saving_plot1.jpeg")
hist(Temperature, col="darkgreen")
[Link]()
Descriptive Statistics:
Statistical analysis in R is performed by using many in-built functions. Most of these functions
are part of the R base package. These functions take R vector as an input along with the
arguments and give the result.
The functions we are discussing in this chapter are mean, median and mode.
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
The function mean() is used to calculate this in R.
Syntax:
mean(x, trim = 0, [Link] = FALSE, ...)
Following is the description of the parameters used −
• x is the input vector.
• trim is used to drop some observations from both end of the sorted vector.
• [Link] is used to remove the missing values from the input vector.
67
Downloaded by Gayathri T
Data Analysis using R
Ex1 : x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
[Link] <- mean(x)
print([Link])
When we execute the above code, it produces the following result −
[1] 5.55
Ex2: x <- c(12,7,3,4.2,18,2,54,-21,8,-5,NA)
# Find mean.
[Link] <- mean(x)
print([Link])
# Find mean dropping NA values.
[Link] <- mean(x,[Link] = TRUE)
print([Link])
When we execute the above code, it produces the following result −
[1] NA
[1] 8.22
Median:
The middle most value in a data series is called the median. The median()function is used in R to
calculate this value.
Syntax:
median(x, [Link] = FALSE)
Following is the description of the parameters used −
x is the input vector.
[Link] is used to remove the missing values from the input vector.
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find the median.
[Link] <- median(x)
print([Link])
When we execute the above code, it produces the following result −
[1] 5.6
Variance: How far a set of data values are spread out from their mean. Calculating variance in R
is simplicity itself. You use the var() function. The variance is a numerical measure of how the
data values is dispersed around the mean. In particular, the sample variance is defined as:
Ex:
x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
[Link] = var(x) # calculate variance
> print ([Link])
[1] 2.484211
68
Downloaded by Gayathri T
Data Analysis using R
Standard Deviation: A measure that is used to quantify the amount of variation or dispersion of
a set of data values. Standard deviations are calculated in the same way as means. The standard
deviation of a single variable can be computed with the sd(VAR) command, where VAR is the
name of the variable whose standard deviation you wish to retrieve.
Ex:
x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
> [Link] = sqrt(var(x)) # calculate standard deviation
> print ([Link])
[1] 1.576138
Minimum and Maximum
Keeping with the pattern, a minimum can be computed on a single variable using the min(VAR)
command. The maximum, via max(VAR), operates identically. However, in contrast to the mean
and standard deviation functions, min(DATAVAR) or max(DATAVAR) will retrieve the
minimum or maximum value from the entire dataset, not from each individual variable.
Therefore, it is recommended that minimums and maximums be calculated on individual
variables, rather than entire datasets, in order to produce more useful information. The sample
code below demonstrates the use of the min and max functions.
Ex:
x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
> y=min(x)
>y
[1] 1
> z=max(x)
>z
[1] 6
Correlation and lines of Regression:
Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose value is
derived from the predictor variable.
Linear Regression
Linear regression is one of the most commonly used predictive modelling techniques. The
aim of linear regression is to find a mathematical equation for a continuous response
variable Y as a function of one or more X variable(s). So that you can use this regression
model to predict the Y when only the X is known.
Mathematically a linear relationship represents a straight line when plotted as a graph. A non-
linear relationship where the exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −
69
Downloaded by Gayathri T
Data Analysis using R
y = ax + b
Following is the description of the parameters used −
• y is the response variable.
• x is the predictor variable.
• a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known. To
do this we need to have the relationship between height and weight of a person.
The steps to create the relationship is −
• Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
• Create a relationship model using the lm() functions in R.
• Find the coefficients from the model created and create the mathematical equation using
these
• Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
• To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
• formula is a symbol presenting the relation between x and y.
• data is the vector on which the formula will be
applied. Create Relationship Model & get the Coefficients
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
print(relation)
When we execute the above code, it produces the following result −
70
Downloaded by Gayathri T
Data Analysis using R
Call:
lm(formula = y ~ x)
Coefficients: (Intercept)
-38.4551 x 0.6746
Nonlinear Regression:
Regression is nonlinear when at least one of its parameters appears nonlinearly. It commonly
sorts and analyzes data of various industries like retail and banking sectors. It also helps to draw
conclusions and predict future trends on the basis of user’s activities on the net.
In non-linear regression the analyst specify a function with a set of parameters to fit to the data.
The most basic way to estimate such parameters is to use a non-linear least squares approach
(function nls in R) which basically approximate the non-linear function using a linear one and
iteratively try to find the best parameter values (wiki). A nice feature of non-linear regression in
an applied context is that the estimated parameters have a clear interpretation (Vmax in
a Michaelis-Menten model is the maximum rate) which would be harder to get using linear
models on transformed data
#simulate some data
[Link](20160227)
x<-seq(0,50,1)
y<-((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(51,0,1)
#for simple models nls find good starting values for the parameters even if it throw a
warning m<-nls(y~a*x/(b+x))
#get some estimation of goodness of fit
cor(y,predict(m))
[1] 0.9496598
Multiple regression:
Multiple regression is an extension of linear regression into relationship between more than two
variables. In simple linear relation we have one predictor and one response variable, but in
multiple regression we have more than one predictor variable and one response variable.
The general mathematical equation for multiple regression is −
y = a + b1x1 + b2x2 +...bnxn
Following is the description of the parameters used −
• y is the response variable.
• a, b1, b2...bn are the coefficients.
• x1, x2, ...xn are the predictor variables.
We create the regression model using the lm() function in R. The model determines the value of
the coefficients using the input data. Next we can predict the value of the response variable for a
given set of predictor variables using these coefficients.
71
Downloaded by Gayathri T
Data Analysis using R
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
• formula is a symbol presenting the relation between the response variable and predictor
variables.
• data is the vector on which the formula will be applied.
Example
Input Data
Consider the data set "mtcars" available in the R environment. It gives a comparison between
different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"), horse
power("hp"), weight of the car("wt") and some more parameters.
The goal of the model is to establish the relationship between "mpg" as a response variable with
"disp","hp" and "wt" as predictor variables. We create a subset of these variables from the
mtcars data set for this purpose.
Live Demo
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
When we execute the above code, it produces the following result −
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
Logistic Regression
The Logistic Regression is a regression model in which the response variable (dependent
variable) has categorical values such as True/False or 0/1. It actually measures the probability of
a binary response as the value of response variable based on the mathematical equation relating
it with the predictor variables.
The general mathematical equation for logistic regression is −
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
Following is the description of the parameters used −
• y is the response variable.
• x is the predictor variable.
• a and b are the coefficients which are numeric constants.
The function used to create the regression model is the glm() function.
72
Downloaded by Gayathri T
Data Analysis using R
Syntax
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)
Following is the description of the parameters used −
• formula is the symbol presenting the relationship between the variables.
• data is the data set giving the values of these variables.
• family is R object to specify the details of the model. It's value is binomial for logistic
regression.
Example
The in-built data set "mtcars" describes different models of a car with their various engine
specifications. In "mtcars" data set, the transmission mode (automatic or manual) is described
by the column am which is a binary value (0 or 1). We can create a logistic regression model
between the columns "am" and 3 other columns - hp, wt and cyl.
# Select some columns form mtcars.
input <-
mtcars[,c("am","cyl","hp","wt")]
When we execute the above code, it produces the following result −
am cyl hp wt
Mazda RX41 6 Mazda R1X140 W2.6a2g01
Datsun 710 1 4
6110 2.875
93 2.320
Hornet 4 Drive0 6 110 3.215
Hornet Sportabout 0 8175 3.440
Valiant0
Create 6105 3.460
Regression Model
We use the glm() function to create the regression model and get its summary for analysis.
Live Demo
input <- mtcars[,c("am","cyl","hp","wt")]
[Link] = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
print(summary([Link]))
Time Series Analysis
73
Downloaded by Gayathri T
Data Analysis using R
Time series is a series of data points in which each data point is associated with a timestamp. A
simple example is the price of a stock in the stock market at different points of time on a given
day. Another example is the amount of rainfall in a region at different months of the year. R
language uses many functions to create, manipulate and plot the time series data. The data for
the time series is stored in an R object called time-series object. It is also a R data object like a
vector or data frame.
The time series object is created by using the ts() function.
Syntax
The basic syntax for ts() function in time series analysis is −
[Link] <- ts(data, start, end, frequency)
Following is the description of the parameters used −
• data is a vector or matrix containing the values used in the time series.
• start specifies the start time for the first observation in time series.
• end specifies the end time for the last observation in time series.
• frequency specifies the number of observations per unit time.
Except the parameter "data" all other parameters are optional.
Example
Consider the annual rainfall details at a place starting from January 2012. We create an R time
series object for a period of 12 months and plot it.
# Get the data points in form of a R vector.
rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
# Convert it to a time series object.
[Link] <- ts(rainfall,start = c(2012,1),frequency =
12) # Print the timeseries data.
print([Link])
# Give the chart file a name.
png(file = "[Link]")
# Plot a graph of the time series.
plot([Link])
# Save the file.
[Link]()
When we execute the above code, it produces the following result and chart −
JanFebMarAprMayJunJulAugSep
74
Downloaded by Gayathri T
Data Analysis using R
2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5 998.6 784.2
OctNovDec 2012 985.0 882.8 1071.0
The Time series chart −
Different Time Intervals
The value of the frequency parameter in the ts() function decides the time intervals at which
the data points are measured. A value of 12 indicates that the time series is for 12 months. Other
values and its meaning is as below −
• frequency = 12 pegs the data points for every month of a year.
• frequency = 4 pegs the data points for every quarter of a year.
• frequency = 6 pegs the data points for every 10 minutes of an hour.
• frequency = 24*6 pegs the data points for every 10 minutes of a day.
Multiple Time Series
We can plot multiple time series in one chart by combining both the series into a matrix.
# Get the data points in form of a R vector.
rainfall1 <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
rainfall2 <-
c(655,1306.9,1323.4,1172.2,562.2,824,822.4,1265.5,799.6,1105.6,1106.7,1337.8)
# Convert them to a matrix.
[Link] <- matrix(c(rainfall1,rainfall2),nrow = 12) # Convert it to a time series object.
[Link] <- ts([Link],start = c(2012,1),frequency = 12) # Print the timeseries data.
print([Link])
75
Downloaded by Gayathri T
Data Analysis using R
# Give the chart file a name. png(file = "rainfall_combined.png") # Plot a graph of the time series.
plot([Link], main = "Multiple Time Series") # Save the file.
[Link]()
The Multiple Time series chart −
76
Downloaded by Gayathri T
Data Analysis using R
You can customize many features of your graphs (fonts, colors, axes, titles) through graphic
options.
One way is to specify these options in through the par( ) function. If you set parameter values
here, the changes will be in effect for the rest of the session or until you change them again. The
format is par(optionname=value, optionname=value, ...)
# Set a graphical parameter using par()
par() # view current settings
opar <- par() # make a copy of current settings
par([Link]="red") # red x and y labels
hist(mtcars$mpg) # create a plot with these new settings
par(opar) # restore original settings
A second way to specify graphical parameters is by providing the optionname=value pairs
directly to a high level plotting function. In this case, the options are only in effect for that
specific graph.
# Set a graphical parameter within the plotting function
hist(mtcars$mpg, [Link]="red")
See the help for a specific high level plotting function (e.g. plot, hist, boxplot) to determine
which graphical parameters can be set this way.
The remainder of this section describes some of the more important graphical parameters that
you can set.
Text and Symbol Size
The following options can be used to control text and symbol size in graphs.
option description
cex number indicating the amount by which plotting text and symbols
should be scaled relative to the default. 1=default, 1.5 is 50% larger,
0.5 is 50% smaller, etc.
[Link] magnification of axis annotation relative to cex
[Link] magnification of x and y labels relative to cex
77
Downloaded by Gayathri T
Data Analysis using R
[Link] magnification of titles relative to cex
[Link] magnification of subtitles relative to cex
Plotting Symbols
Use the pch= option to specify symbols to use when plotting points. For symbols 21 through 25,
specify border color (col=) and fill color (bg=).
Lines
You can change lines using the following options. This is particularly useful for reference lines,
axes, and fit lines.
option description
lty line type. see the chart below.
lwd line width relative to the default (default=1). 2 is twice as wide.
78
Downloaded by Gayathri T
Data Analysis using R
Colors
Options that specify colors include the following.
option description
col Default plotting color. Some functions (e.g. lines) accept a vector
of values that are recycled.
[Link] color for axis annotation
[Link] color for x and y labels
[Link] color for titles
[Link] color for subtitles
fg plot foreground color (axes, boxes - also sets col= to same)
bg plot background color
You can specify colors in R by index, name, hexadecimal, or RGB.
For example col=1, col="white", and col="#FFFFFF" are equivalent.
The following chart was produced with code developed by Earl F. Glynn. See his Color Chart for
all the details you would ever need about using colors in R.
You can also create a vector of n contiguous colors using the
functions rainbow(n), [Link](n), [Link](n), [Link](n), and [Link](n).
colors() returns all available color names.
79
Downloaded by Gayathri T
Data Analysis using R
Fonts
You can easily set font size and style, but font family is a bit more complicated.
option description
font Integer specifying font to use for text.
1=plain, 2=bold, 3=italic, 4=bold italic, 5=symbol
[Link] font for axis annotation
[Link] font for x and y labels
[Link] font for titles
[Link] font for subtitles
ps font point size (roughly 1/72
inch) text size=ps*cex
family font family for drawing text. Standard values are "serif",
"sans", "mono", "symbol". Mapping is device dependent.
In windows, mono is mapped to "TT Courier New", serif is mapped to"TT Times New Roman",
sans is mapped to "TT Arial", mono is mapped to "TT Courier New", and symbol is mapped to
"TT Symbol" (TT=True Type). You can add your own mappings.
# Type family examples - creating new mappings
plot(1:10,1:10,type="n")
windowsFonts(
A=windowsFont("Arial Black"),
B=windowsFont("Bookman Old Style"),
C=windowsFont("Comic Sans MS"),
D=windowsFont("Symbol")
text(3,3,"Hello World Default")
text(4,4,family="A","Hello World from Arial Black")
text(5,5,family="B","Hello World from Bookman Old Style")
80
Downloaded by Gayathri T
Data Analysis using R
text(6,6,family="C","Hello World from Comic Sans MS")
text(7,7,family="D", "Hello World from Symbol")
click to view
Margins and Graph Size
You can control the margin size using the following parameters.
option description
mar numerical vector indicating margin size c(bottom, left, top, right) in
lines. default = c(5, 4, 4, 2) + 0.1
mai numerical vector indicating margin size c(bottom, left, top, right)
in inches
pin plot dimensions (width, height) in inches
For complete information on margins, see Earl F. Glynn's margin tutorial.
Downloaded by Gayathri T