*
* Basic Data Visualization
* Data visualization is an efficient technique for gaining insight about data
through a visual medium.
* By using the data visualization technique, we can work with large datasets
to efficiently obtain key insights about it.
* Graphics play an important role in carrying out the important features of
the data.
Different Data Visualization charts
R Bar Charts
R Pie Charts
R Histogram
R Boxplot
R Scatterplots
* A)R Bar Charts::A bar chart is a pictorial representation in which
numerical values of variables are represented by length or height
of lines or rectangles of equal width.
* A bar chart is used for summarizing a set of categorical data.
* In bar chart, the data is shown through rectangular bars having
the length of the bar proportional to the value of the variable.
syntax:
barplot(h,x,y,main, [Link],col)
[Link] Parameter Description
1. HH A vector oror
A vector matrix which
matrix contains
which numeric
contains numeric values
values
used used
in theinbar
thechart.
bar chart.
xlab A label for the x-axis.
2. xlab A label for the x-axis.
ylab A label for the y-axis.
3. ylab A label for the y-axis.
main A title of the bar chart.
4. main
[Link] A title of
A vector of the barthat
names chart.
appear under each bar.
col It is used to give colors to the bars in the graph.
5. [Link] A vector of names that appear under each bar.
6. col It is used to give colors to the bars in the graph.
* # Creating the data for Bar chart
* H <- c(12,35,54,3,41)
* M<- c("Feb","Mar","Apr","May","Jun")
* # Giving the chart file a name
* png(file = "bar_properties.png")
* # Plotting the bar chart
* barplot(H,[Link]=M, xlab="Month", ylab="Revenue", col="Green",
main="Revenue Bar chart“ ,border="red")
* # Saving the file
* [Link]()
*Output:
Group Bar Chart & Stacked Bar Chart
* We can create bar charts with groups of bars and stacks using matrices as input
values in each bar. One or more variables are represented as a matrix that is
used to construct group bar charts and stacked bar charts.
* Example:
* months <- c("Jan","Feb","Mar","Apr","May")
* regions <- c("West","North","South")
* # Creating the matrix of the values.
* Values <- matrix(c(21,32,33,14,95,46,67,78,39,11,22,23,94,15,16),nrow =3, ncol
= 5, byrow = TRUE)
* # Giving the chart file a name
* png(file = "stacked_chart.png")
* # Creating the bar chart
* barplot(Values, main = "Total Revenue", [Link] = months, xlab = "Month",
ylab = "Revenue", ccol =c("cadetblue3","deeppink2","goldenrod1"))
* # Adding the legend to the chart
* legend("topleft", regions, cex=1.3,
fill=c("cadetblue3","deeppink2","goldenrod1"))
* # Saving the file [Link]()
*Output
R Pie Charts
* A pie-chart is a representation of values in the form of slices
of a circle with different colors.
* Slices are labeled with a description, and the numbers
corresponding to each slice are also shown in the chart.
* The Pie charts are created with the help of pie () function,
which takes positive numbers as vector input.
* Syntax:
* pie(X, Labels, Radius, Main, Col, Clockwise)
* 1.X:is a vector that contains the numeric values
used in the pie chart.
* [Link]: are used to give the description to the
slices.
* [Link]:describes the radius of the pie chart.
* [Link]:describes the title of the chart.
* [Link]:defines the colour palette.
* [Link]:is a logical value that indicates the
clockwise or anti- clockwise direction in which slices
are drawn.
Example:
# Creating data for the graph.
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")
# Giving the chart file a name.
png(file = "title_color.jpg")
# Plotting the chart.
pie(x,labels,main="CountryPiechart",col=rainbow(length(x))
)
# Saving the file. [Link]()
Output
* R Histogram
* A histogram is a type of bar chart which shows the
frequency of the number of values which are
compared with a set of values ranges.
* The histogram is used for the distribution, whereas
a bar chart is used for comparing different entities.
* In the histogram, each bar represents the height of
the number of values present in the given range.
* For creating a histogram, R provides hist() function,
which takes a vector as an input.
* Syntax:
* hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border)
[Link] Parameter Description
1. v It is a vector that contains numeric values.
2. main It indicates the title of the chart.
3. col It is used to set the color of the bars.
4. border It is used to set the border color of each bar.
5. xlab It is used to describe the x-axis.
6. ylab It is used to describe the y-axis.
7. xlim It is used to specify the range of values on the x-axis.
8. ylim It is used to specify the range of values on the y-axis.
9. breaks It is used to mention the width of each bar.
Example:
# Creating data for the graph.
v <- c(12,24,16,38,21,13,55,17,39,10,60)
# Giving a name to the chart file.
png(file = "histogram_chart.png")
# Creating the histogram.
hist(v,xlab = "Weight",ylab="Frequency",col =
"green",border = "red")
# Saving the file.
[Link]()
R Boxplot
* Boxplots are a measure of how well data is distributed across
a data set. This divides the data set into three quartiles. This
graph represents the minimum, maximum, average.
* Boxplot is also useful in comparing the distribution of data in
a data set by drawing a boxplot for each of them.
* R provides a boxplot() function to create a boxplot.
* Syntax:
boxplot(x, data, notch, varwidth, names, main)
x It is a vector or a formula.
data It is the data frame.
notch It is a logical value set as true to draw
a notch.
varwidth It is also a logical value set as true to
draw the width of the box same as the
sample size.
names It is the group of labels that will be
printed under each boxplot.
main It is used to give a title to the graph.
* Example: In the below example, we will use the "mtcars"
dataset present in the R environment. We will use its two
columns only, i.e., "mpg" and "cyl".
* The below example will create a boxplot graph for the relation
between mpg and cyl, i.e., miles per gallon and number of
cylinders, respectively.
* # Giving a name to the chart file.
png(file = "[Link]")
# Plotting the chart.
* boxplot(mpg ~ cyl, data = mtcars, xlab = "Quantity of
Cylinders", ylab = "Miles Per Gallon", main = "R Boxplot
Example")
* # Save the file.
[Link]()
Boxplot using notch:In R, we can draw a boxplot using a notch.
Example:
* # Giving a name to our chart.
png(file = "boxplot_using_notch.png")
* # Plotting the chart.
* boxplot(mpg ~ cyl, data = mtcars, xlab = "Quantity of
Cylinders", varwidth = TRUE, ccol = c("green","yellow","red"),
names = c("High","Medium","Low"))
* # Saving the file.
[Link]()
R Scatterplots
* The scatter plots are used to compare variables. A
comparison between variables is required when we need to
define how much one variable is affected by another
variable.
* In a scatterplot, the data is represented as a collection of
points. Each point on the scatterplot defines the values of
the two variables.
* One variable is selected for the vertical axis and other for
the horizontal axis.
* Syntax:
* plot(x, y, main, xlab, ylab, xlim, ylim, axes)
1. x It is the dataset whose values are
the horizontal coordinates.
2. y It is the dataset whose values are
the vertical coordinates.
3. main It is the title of the graph.
4. xlab It is the label on the horizontal axis.
5. ylab It is the label on the vertical axis.
6. xlim It is the limits of the x values which is
used for plotting.
7. ylim It is the limits of the values of y, which
is used for plotting.
8. axes It indicates whether both axes should be
drawn on the plot.
* Example: In our example, we will use the dataset
"mtcars", which is the predefined dataset available in the
R environment.
* #Fetching two columns from mtcars
data <-mtcars[,c('wt','mpg')]
* # Giving a name to the chart file.
png(file = "[Link]")
* # Plotting the chart for cars with weight between 2.5
to 5 and mileage between 15 and 30.
* plot(x = data$wt, y = data$mpg, xlab = "Weight", ylab =
"Milage", xlim = c(2.5,5), ylim = c(15,30), main = "Weight
v/sMilage")
* # Saving the file. [Link]()
Scatterplot using ggplot2
In R, there is another way for creating scatterplot i.e. with the
help of ggplot2 package.
* The ggplot2 package provides ggplot() and geom_point() function for
creating a scatterplot.
* The ggplot() function takes a series of the input item. The first
parameter is an input vector, and the second is the aes() function in
which we add the x-axis and y-axis.
Example:
* #Loading ggplot2 package
* library(ggplot2)
* # Giving a name to the chart file.
* png(file = "scatterplot_ggplot.png")
* # Plotting the chart using ggplot() and geom_point() functions.
* ggplot(mtcars, aes(x = drat, y = mpg)) +geom_point())
Statistics
*Statistics is a form of mathematical analysis that
concerns the collection, organization, analysis,
interpretation, and presentation of data.
R – Statistics: R is a programming language and is used
for environment statistical computing and graphics.
Average/mean in R Programming: Average is calculated by
dividing the sum of the values in the set by their number.
# Calculate mean of Vector
vec = c(6,7,8,9,10,11,12)
mean(vec)
# Output
[1]
Variance in R Programming Language: Variance is the sum of
squares of differences between all numbers and means.
σ2=∑(xi−xˉ)2 /N
Example:
Find the Population variance of the data [5, 7, 9, 10, 14, 15].
* Standard Deviation in R Programming
Language: Standard Deviation is the square
root of variance.
Find the standard deviation of the data
A)(2,4,6,8,10,14)
B)(20.15,10,40)
Random Variable
* Real values of random experiment is called Random variable.
Two types of Random variables are:-
1)Discrete Random Variable
2) Continuous Random Variable
1)Discrete Random Variable:
* In probability theory, a discrete random variable is a type
of random variable that can take on a finite or countable
number of distinct values.
* These values are often represented by integers or whole
numbers, other than this they can also be represented by
other discrete values.
* For example, the number of heads obtained after flipping a
coin three times is a discrete random variable.
* The possible values of this variable are 0, 1, 2, or 3.
2)Continuous Random Variable:
* Consider a generalized experiment rather than taking some
particular experiment.
* Suppose that in your experiment, the outcome of this
experiment can take values in some interval (a, b).
* That means that each and every single point in the interval
can be taken up as the outcome values when you do the
experiment.
* Thus, X= {x: x belongs to (a, b)}
* Example: The speed of a vehicle on a highway.
common probability distribution:
A common probability distribution is a well-known
mathematical function that describes how probabilities are
distributed over the values of a random variable.
* In simpler terms: A probability distribution tells you how
likely each possible outcome is.
* Describes outcomes of a random variable (either discrete or
continuous)
* Has a probability function (e.g., PMF for discrete, PDF for
continuous)
* The total probability across all outcomes is always 1
Types of Common probability Distributions:
[Link] probability mass function
[Link] Probability Density Functions
1. Common probability mass function/ Probability Mass
Functions (PMFs):
* Used for Discrete random variables.
* Gives the probability that a discrete random variable is
exactly equal to some value.
* used for countable outcomes
a) Bernoulli Distribution
b)Binomial Distribution
c)Poisson Functions
* Bernoulli Distribution
* Bernoulli Distribution is a special case of Binomial distribution
where only a single trial is performed.
* It is a discrete probability distribution for a Bernoulli trial (a
trial that has only two outcomes i.e. either success or failure).
* For example, In R it can be represented as a coin toss where
the probability of getting the head is 0.5 and getting a tail is
0.5.
* It is a probability distribution of a random variable that takes
value 1 with probability p and the value 0 with probability q=1-
p.
The probability mass function f of this distribution, over
possible outcomes k, is given by :
* In R Programming Language, there are 4 built-in functions
to for Bernoulli distribution.
They are:
dbern()
pbern()
qbern()
rbern()
A)dbern(): dbern( ) function in R programming measures the density
function of the Bernoulli distribution.
* Syntax: dbern(x, prob, log = FALSE)
* Parameter:
x: vector of quantiles
prob: probability of success on each trial
log: logical; if TRUE, probabilities p are given as log(p)
Example:
# Importing the Rlab library
library(Rlab)
# x values for the dbern() function
x <- c(0, 1, 3, 5, 7, 10)
* # Using dbern() function to obtain the corresponding Bernoulli PDF
y <- dbern(x, prob = 0.5)
# Plotting dbern values plot(x, y, type = "o")
Output:
B)pbern():
* pbern( ) function in R programming giver the distribution
function for the Bernoulli distribution.
* The distribution function or cumulative distribution
function (CDF) or cumulative frequency function, describes
the probability that a variate X takes on a value less than
or equal to a number x.
* Syntax: pbern(q, prob, log.p = FALSE)
* Parameter:
* q: vector of quantiles
* prob: probability of success on each trial
* log.p: logical; if TRUE, probabilities p are given as log(p).
C)qbern(): qbern( ) gives the quantile function for the
Bernoulli distribution.
A quantile function in statistical terms specifies the value of
the random variable such that the probability of the variable
being less than or equal to that value equals the given
probability.
* Syntax: pbern(q, prob, log.p = FALSE)
* Parameter:
* q: vector of quantiles
* prob: probability of success on each trial
* log.p: logical; if TRUE, probabilities p are given as log(p).
D)rbern()
* rbern( ) function in R programming is used to generate a
vector of random numbers which are Bernoulli
distributed.
* Syntax: rbern(n, prob)
* Parameter:
* n: number of observations.
* prob: number of observations.
B)Binomial Distribution
* The binomial distribution is a discrete probability
distribution in statistics that models the number of
successes in a fixed number of independent Bernoulli trials
(yes/no experiments), each with the same probability of
success.
Key Characteristics of a Binomial Distribution
* Fixed number of trials (n): You repeat the experiment a set
number of times.
* Only two possible outcomes per trial: Success (e.g., heads)
or failure (e.g., tails).
* Constant probability of success (p): The chance of success is
the same for each trial.
* Independent trials: The outcome of one trial does not affect
another.
*
=10*0.125*0.25
=0.3125
* In R Programming Language, there are 4 built-in
functions to for Binomial distribution. They are:
* pbinom()
* qbinom()
* rbinom()
* dbinom
A) dbinom: Probability of exactly 3 successes in 5
trials
* dbinom(3, size = 5, prob = 0.5)
* # Output: 0.3125
B)pbinom() The function pbinom() is used to find the
cumulative probability of a data following binomial
distribution till a given value ie it finds P(X <= k)
Syntax:
pbinom(k, n, p)
Where,
n is total number of trials, p is probability of success, k
is the value at which the probability has to be found out.
EX: Probability of 3 or fewer successes
pbinom(3, size = 5, prob = 0.5)
# Output: 0.8125
C)qbinom() This function is used to find the nth quantile, that
is if P(x <= k) is given, it finds k.
Syntax:
qbinom(P, n, p)
* Ex: Smallest x such that P(X ≤ x) ≥ 0.8
qbinom(0.8, size = 5, prob = 0.5)
# Output: 3
D)rbinom() This function generates n random
variables of a particular probability.
Syntax:
rbinom(n, N, p)
EX:Random values: Generate 10 random binomial
outcomes
* rbinom(10, size = 5, prob = 0.5)
* # Output: e.g., 2 3 1 4 3 2 3 2 4 3
Function Description Example
Probability mass function
dbinom(x, size, prob) dbinom(3, size = 5, prob = 0.5)
(PMF) — P(X = x)
pbinom(q, size, Cumulative distribution
pbinom(3, size = 5, prob = 0.5)
prob) function (CDF) — P(X ≤ q)
qbinom(p, size, Quantile function — Finds x
qbinom(0.8, size = 5, prob = 0.5)
prob) such that P(X ≤ x) = p
Random generation of
rbinom(n, size, prob) rbinom(10, size = 5, prob = 0.5)
binomial values
3)Poisson Functions: The Poisson distribution represents
the probability of a provided number of cases happening
in a set period of space or time if these cases happen
with an identified constant mean rate.
In statistics, the Poisson distribution is used to model
the number of times an event occurs in a fixed interval
of time or space, under the assumption that:
*Events occur independently,
*At a constant average rate,
*And two events cannot occur at exactly the same
instant.
*There are four Poisson functions available in R:
* A)dpois(): This function is used for illustration of
Poisson density in an R plot.
*The function dpois() calculates the probability of a
random variable that is available within a certain
range.
*Syntax:
*
*where,
*K: number of successful events happened in an interval
*λ: mean per interval
*log: If TRUE then the function returns probability in form
of log
B)ppois():This function is used for the illustration of
cumulative probability function in an R plot. The function
ppois() calculates the probability of a random variable that
will be equal to or less than a number.
* Syntax:
Where,
k: number of successful events happened in an interval
λ: mean per interval
[Link]: If TRUE then left tail is considered otherwise
if the FALSE right tail is considered
log: If TRUE then the function returns probability in
form of log
C)qpois(): The function qpois() is used for generating quantile
of a given Poisson’s distribution.
In probability, quantiles are marked points that divide the
graph of a probability distribution into intervals (continuous )
which have equal probabilities.
* Syntax:
* where,
* K: number of successful events happened in an interval
* λ: mean per interval
* [Link]: If TRUE then left tail is considered otherwise if
the FALSE right tail is considered
* log: If TRUE then the function returns probability in form of
log
D)rpois(): The function rpois() is used for generating
random numbers from a given Poisson’s distribution.
Syntax:
Where, q: number of random numbers needed
λ: mean per interval
Function Purpose Example
dpois(x, lambda) PMF: P(X = x) dpois(2, lambda = 4)
ppois(q, lambda) CDF: P(X ≤ q) ppois(2, lambda = 4)
Quantile: Find
qpois(p, lambda) smallest x such that qpois(0.9, lambda = 4)
P(X ≤ x) ≥ p
Random generation
rpois(n, lambda) rpois(10, lambda = 4)
from Poisson
2. Common Probability Density Functions /Continuous
Distributions: (used for measurable outcomes)
* Probability Density Functions (PDFs)
* Used for Continuous random variables.
* Describes the likelihood of a random variable falling within a
particular range of values.
* The probability of the variable taking on an exact value is
zero. Instead, we calculate the probability over an interval.
A)Uniform Distribution
B)Normal Distribution in R
C)Student’s t-distribution
A)Uniform Distribution:The continuous uniform distribution is
also referred to as the probability distribution of any random
number selection from the continuous interval defined
between intervals a and b.
* A uniform distribution holds the same probability for the
entire interval.
* Thus, its plot is a rectangle, and therefore it is often
referred to as rectangular distribution.
* Probability Density Function
dunif(): method in R programming language is used to generate
density function. It calculates the uniform density function in R
language in the specified interval (a, b).
Syntax:
* dunif(x, min = 0, max = 1, log = FALSE)
Parameter:
x: input sequence
min, max= range of values
log: indicator, of whether to display the output values as
probabilities.
Cumulative probability distribution:
* The punif() method in R is used to calculate the uniform
cumulative distribution function, this is, the probability of
a variable X taking a value lower than x (that is, x <= X).
* Syntax:
punif(q,min = 0,max = 1, [Link] = TRUE)
The runif() function in R programming language is used
to generate a sequence of random following the uniform
distribution.
* Syntax:
* runif(n, min = 0, max = 1)
* Parameter:
* n= number of random samples
* min=minimum value(by default 0)
* max=maximum value(by default 1)
In statistics, the normal distribution is a very common and
important probability distribution. It’s often called the bell
curve because of its characteristic bell-shaped graph.
* Symmetrical:
The left and right sides of the curve are mirror images.
* Mean = Median = Mode:
All three are located at the center of the distribution.
* Bell-shaped Curve:
Most of the data is clustered around the mean, with fewer
values as you move further away.
B)Normal Distribution in R: Normal Distribution is a
probability function used in statistics that tells about how the
data values are distributed.
* For example, the height of the population, shoe size, IQ
level, rolling a dice, and many more.
* It is generally observed that data distribution is normal when
there is a random collection of data from independent
sources.
* The graph produced after plotting the value of the variable
on x-axis and count of the value on y-axis is bell-shaped curve
graph.
* The graph signifies that the peak point is the mean of the
data set and half of the values of data set lie on the left side
of the mean and other half lies on the right part of the mean.
In R, there are 4 built-in functions to generate normal
distribution.
A)dnorm()
B) pnorm()
C) qnorm()
D) rnorm()
A)dnorm():
* dnorm() function in R programming measures density
function of distribution.
* Syntax :
dnorm(x, mean, sd)
B)pnorm():
* pnorm() function is the cumulative distribution function
which measures the probability that a random number X
takes a value less than or equal to x.
* Syntax:
* pnorm(x, mean, sd)
C)qnorm():
* qnorm() function is the inverse of pnorm() function. It takes
the probability value and gives output which corresponds to
the probability value.
* It is useful in finding the percentiles of a normal distribution.
* Syntax:
qnorm(p, mean, sd)
D)rnorm():
* rnorm() function in R programming is used to generate a
vector of random numbers which are normally distributed.
*
Syntax:
rnorm(x, mean, sd)
C)Student’S t-distribution:
* The Student’s t-distribution is a continuous probability
distribution generally used when dealing with statistics
estimated from a sample of data.
* Any particular t-distribution looks a lot like the standard
normal distribution— it’s bell-shaped, symmetric and it’s
centered on zero.
* The difference is that while a normal distribution is typically
used to deal with a population, the t-distribution deals with
sample from a population.
Functions used:
A)dt()
B)pt()
C)qt()
A)To find the value of probability density function (pdf) of the
Student’s t- distribution given a random variable x, use the dt()
function in R.
* Syntax: dt(x, df)
Parameters:
x is the quantiles vector
df is the degrees of freedom
B)pt(): function is used to get the cumulative distribution function
(CDF) of a t- distribution
Syntax:
pt(q, df, [Link] = TRUE)
Parameter:
q is the quantiles vector
df is the degrees of freedom
[Link] – if TRUE (default), probabilities are P[X ≤ x],
otherwise, P[X > x].
C)qt(): function is used to get the quantile function or inverse
cumulative density function of a t-distribution.
* Syntax: qt(p, df, [Link] = TRUE)
Parameter:
p is the vector of probabilities
df is the degrees of freedom
[Link] – if TRUE (default), probabilities are P[X ≤ x],
otherwise, P[X > x].