R Programming for Statistical Analysis
R Programming for Statistical Analysis
Statistics:
Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. It
is divided into descriptive statistics and inferential statistics.
1. Descriptive Statistics:
Descriptive statistics summarize or describe the characteristics of a data set. They help in
understanding the basic features of data and give simple summaries about the sample and the
measures.
a. Measures of Central Tendency:
• Mean: The average of the data points. R uses the mean() function to calculate the mean of a
dataset.
• Median: The middle value of a sorted dataset. R uses the median() function.
• Mode: The most frequent value in a dataset. There is no built-in function for the mode in R,
but it can be calculated with a custom function.
b. Measures of Dispersion:
• Variance: The spread of data points around the mean. Calculated using var().
• Standard Deviation: The square root of the variance, calculated using sd().
• Range: The difference between the maximum and minimum values. R uses the range()
function.
c. Summary Statistics:
• R’s summary() function provides a quick summary of a dataset, including the minimum, 1st
quartile, median, mean, 3rd quartile, and maximum.
• Chi-Square Tests: The [Link] () function is used for tests of independence between
categorical variables.
b. Confidence Intervals:
These provide a range of values that likely contain the population parameter with a certain level
of confidence (e.g., 95% confidence interval).
c. Regression Analysis:
Used to model relationships between variables. The simplest form is linear regression,
which examines the relationship between a dependent variable and one or more independent
variables.
• Linear Regression: Use the lm () function to fit linear regression models, which allow you to
model the relationship between a dependent and one or more independent variables. For
generalized linear models (GLMs), use glm ().
Probability:
Probability is the measure of the likelihood that an event will occur. R is well-equipped for
handling probability distributions, random variables, and probability computations.
Probability Distributions:
Probability distributions describe the likelihood of different outcomes for a random variable. R
provides built-in functions for a variety of distributions, including:
1 (𝑥−𝜇)2
−
𝑓 (𝑥 ) = 𝑒 2𝜎2
𝜎√2𝜋
R provides functions like dnorm() for the probability density function, pnorm() for the
cumulative distribution function, which gives the probability that a random variable is less than
or equal to a certain value, rnorm() for generating random variables from a given distribution, and
qnorm() returns the quantile for the normal distribution to find the point in the distribution
corresponding to a given cumulative probability.
b. Binomial Distribution:
This distribution describes the number of successes in a fixed number of independent
Bernoulli trials (each with the same probability of success). The probability mass function (PMF)
is given by:
𝑛
𝑃(𝑋 = 𝑘 ) = ( )𝑝𝑘 (1 − 𝑝)𝑘
𝑘
where n is the number of trials, k is the number of successes, and p is the probability of
success on a single trial.
R supports the binomial distribution with functions like dbnorm() for the probability mass
function, pbnorm() for the cumulative distribution function, which gives the probability that a
random variable is less than or equal to a certain value, rbnorm() for generating random
variables from a given distribution, and qbnorm() returns the quantile for the normal distribution
to find the point in the distribution corresponding to a given cumulative probability.
Let’s use the dataset “iris”, which is imported by default in R. It consists of measurements
(sepal’s length and width, and petal’s length and width) of three different species of iris flowers: Iris
setosa, Iris versicolor, and Iris virginica.
The dataset contains 150 observations and 5 variables, representing the length and width of
the sepal and petal and the species of 150 flowers. Length and width of the sepal and petal are
numeric variables and the species is a factor with 3 levels (indicated by num and Factor w/ 3
levels after the name of the variables).
9. Creating a Model:
In the last step, building a predictive model, such as a linear regression, helps explore
relationships between variables or classify species.
Now, we’ll predict the [Link] based on [Link] and [Link] using the linear
regression model.
R has four built-in functions to generate binomial distribution. They are as follows:
a. dbinom ()
b. pbinom ()
c. qbinom ()
d. rbinom ()
Let’s consider a scenario of flipping a coin 10 times (each flip is a trial) to explore probabilities
related to the outcomes of these flips (head/tail) using R's Binomial distribution functions.
a. dbinom ():
This function calculates the probability of getting exactly a certain number of successes.
Syntax:
dbinom (x, size, prob)
• x: The number of successes you are interested in.
• size: The total number of trials.
• prob: The probability of success in a single trial.
Example:
b. pbinom ():
This function calculates the probability of getting at most a certain number of successes.
Syntax:
pbinom (q, size, prob)
• q: The maximum number of successes you are considering.
• size: The total number of trials.
c. qbinom ():
This function tells you the number of successes corresponding to a given cumulative
probability.
Syntax:
qbinom (p, size, prob)
• p: The cumulative probability threshold.
• size: The total number of trials.
• prob: The probability of success in a single trial.
Example:
d. rbinom ():
This function generates random outcomes from a binomial distribution.
Syntax:
rbinom (n, size, prob)
• n: The number of random outcomes you want to generate.
• size: The total number of trials in each experiment.
• prob: The probability of success in a single trial.
Example:
2. Normal distributions:
For example, the heights of people in a population often follow a normal distribution. If the
mean height is 170 cm with a standard deviation of 5 cm, the distribution can help estimate the
probability of a randomly chosen person having a height within a specific range, like between 165 cm
and 175 cm.
1 (𝑥−𝜇)2
−
𝑓 (𝑥 ) = 𝑒 2𝜎2
𝜎√2𝜋
Where,
• x: The variable for which we are calculating the probability.
• μ: The mean (average) of the distribution.
• σ: The standard deviation, which measures the spread or dispersion of the data.
• π: The mathematical constant, approximately 3.14159.
• e: Euler's number, approximately 2.71828.
R has four built-in functions to generate normal distribution. They are as follows:
a. dnorm ()
b. pnorm ()
c. qnorm ()
d. rnorm ()
Let’s consider an example, Imagine the heights of people in a city with the average height (mean)
is 170cm and the spread of heights (sd) is 10cm to follow a normal distribution:
a. dnorm ():
This function calculates the probability density (height of the curve) at a given point for a
normal distribution.
Syntax:
dnorm (x, mean = 0, sd = 1)
• x: The point at which to evaluate the density.
b. pnorm ():
This function computes the cumulative probability up to a given point.
Syntax:
pnorm(q, mean = 0, sd = 1, [Link] = TRUE)
• q: The point up to which the cumulative probability is calculated.
• mean: The mean of the distribution.
• sd: The standard deviation of the distribution.
• [Link]: If TRUE, computes P(X ≤ q); if FALSE, computes P(X>q).
Example:
c. qnorm ():
This function determines the quantile (value) corresponding to a given cumulative probability.
Syntax:
qnorm(p, mean = 0, sd = 1, [Link] = TRUE)
• p: The cumulative probability.
• mean: The mean of the distribution.
• sd: The standard deviation of the distribution.
• [Link]: If TRUE, finds x for P(X ≤ x); if FALSE, finds x for P(X > x).
Example:
d. rnorm ():
LBAS AND SBSC COLLEGE SAGAR 12
STATISTICAL COMPUTING AND R PROGRAMMING
This function generates random numbers following a normal distribution.
Syntax:
rnorm(n, mean = 0, sd = 1)
• n: The number of random values to generate.
• mean: The mean of the distribution.
• sd: The standard deviation of the distribution.
Example:
For binomial distributions, R uses dbinom(), pbinom(), qbinom(), and rbinom() functions for PMF, CDF, quantile calculation, and random generation, respectively . In contrast, normal distributions use dnorm(), pnorm(), qnorm(), and rnorm() for density, cumulative probability, quantile, and random generation functions, respectively . These functions assist in statistical analysis by allowing precise mathematical handling of data patterns and probabilities under assumed distribution conditions, enabling more informed inferential statistics and modeling .
Examining relationships between variables is important for understanding how changes in one variable may affect another, which is crucial for modeling and prediction. In R, this is often done using covariance (cov()) and correlation (cor()) functions . These metrics provide insights into the linear relationship and co-movement between variables, critical for data analysis and subsequent modeling tasks .
The process involves loading the 'iris' dataset, exploring its structure using str() and viewing the initial rows with head(). Using summary(), summary statistics like mean and median are calculated. Checking for missing/outlier values with is.na() and visualizing with boxplots helps refine the dataset. Insights like central tendency and variability of sepal/petal measurements can reveal patterns and help in predictive modeling .
R's built-in functions, dbinom() for PMF, pbinom() for CDF, qbinom() for quantiles, and rbinom() for random generation, efficiently support analysis of binomial distributions by providing tools to calculate probabilities and model outcomes of random discrete events with two possible outcomes. This is crucial for experiments like coin tosses where the likelihood of varying counts of successes needs evaluation .
R handles normal distribution primarily through four functions: dnorm() for the probability density function, pnorm() for calculating cumulative probability, qnorm() for finding quantiles, and rnorm() for generating random variables that follow the distribution. These functions enable modeling and analysis of continuous data that tend to present in a bell curve pattern .
In R, missing values can be identified using is.na() and handled by imputation or removal. Outliers can be detected using visualizations like boxplots. Handling involves deciding on techniques like replacing with mean/median values, interpolation, or exclusion based on their influence on data integrity and following statistical justifications .
R supports hypothesis testing through functions like t.test() for conducting t-tests, and chisq.test() for chi-squared tests between categorical variables. These functions help to verify assumptions about population parameters based on sample data. For example, t.test() might be used to determine if there is a significant difference between the means of two datasets .
Data visualization in R is supported through tools like histograms, boxplots, and density plots, aiding in understanding data distribution. Visual representation helps identify patterns, outliers, and the shape of the data, crucial for making informed decisions about data manipulation, analysis, and interpretation related to statistical tests and models .
In R, building a predictive model with linear regression involves loading the dataset, exploring its structure, and handling missing values. After exploring, a linear regression model can be created using the lm() function. For instance, using the 'iris' dataset, one can predict Petal.Length based on Petal.Width and Sepal.Length by fitting these variables into a linear regression model .
R facilitates descriptive statistics through functions like mean(), median(), and summary(), allowing for the calculation of measures of central tendency and dispersion. Descriptive statistics provide a basic summary of the data, offering insights into its central tendencies and variability, which are crucial for understanding the dataset's properties and preparing it for further analysis .