0% found this document useful (0 votes)
18 views13 pages

R Programming for Statistical Analysis

The document provides an overview of statistical computing and R programming, focusing on R's capabilities for descriptive and inferential statistics, as well as probability distributions. It details various statistical measures, functions in R for analysis, and the process of descriptive analysis, including data collection, exploration, and modeling. Additionally, it explains binomial and normal distributions, along with R functions for handling these distributions.

Uploaded by

srujan987.123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views13 pages

R Programming for Statistical Analysis

The document provides an overview of statistical computing and R programming, focusing on R's capabilities for descriptive and inferential statistics, as well as probability distributions. It details various statistical measures, functions in R for analysis, and the process of descriptive analysis, including data collection, exploration, and modeling. Additionally, it explains binomial and normal distributions, along with R functions for handling these distributions.

Uploaded by

srujan987.123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STATISTICAL COMPUTING AND R PROGRAMMING

Module 3: R as a set of statistical tables

Statistics and Probability:


R is a comprehensive statistical computing environment and programming language used
widely in data analysis, statistical modeling, and probability computations. It offers built-in
functions and additional packages to handle both simple and complex statistical operations.

Statistics:
Statistics is the science of collecting, analyzing, interpreting, presenting, and organizing data. It
is divided into descriptive statistics and inferential statistics.

1. Descriptive Statistics:
Descriptive statistics summarize or describe the characteristics of a data set. They help in
understanding the basic features of data and give simple summaries about the sample and the
measures.
a. Measures of Central Tendency:
• Mean: The average of the data points. R uses the mean() function to calculate the mean of a
dataset.
• Median: The middle value of a sorted dataset. R uses the median() function.
• Mode: The most frequent value in a dataset. There is no built-in function for the mode in R,
but it can be calculated with a custom function.

b. Measures of Dispersion:
• Variance: The spread of data points around the mean. Calculated using var().
• Standard Deviation: The square root of the variance, calculated using sd().
• Range: The difference between the maximum and minimum values. R uses the range()
function.

c. Summary Statistics:
• R’s summary() function provides a quick summary of a dataset, including the minimum, 1st
quartile, median, mean, 3rd quartile, and maximum.

LBAS AND SBSC COLLEGE SAGAR 1


STATISTICAL COMPUTING AND R PROGRAMMING
2. Inferential Statistics:
Inferential statistics involve drawing conclusions about a population based on a sample. R
provides various tools for inferential statistics, including hypothesis testing, confidence intervals,
and regression analysis.
a. Hypothesis Testing:
This is used to test if a certain assumption about a population is true or false. The most
common methods for hypothesis testing include t-tests, chi-squared tests, and analysis of
variance (ANOVA).
• T-tests: The [Link] () function is used for one-sample, two-sample, and paired t-tests to
compare means.

• Chi-Square Tests: The [Link] () function is used for tests of independence between
categorical variables.

b. Confidence Intervals:
These provide a range of values that likely contain the population parameter with a certain level
of confidence (e.g., 95% confidence interval).

c. Regression Analysis:
Used to model relationships between variables. The simplest form is linear regression,
which examines the relationship between a dependent variable and one or more independent
variables.
• Linear Regression: Use the lm () function to fit linear regression models, which allow you to
model the relationship between a dependent and one or more independent variables. For
generalized linear models (GLMs), use glm ().

Probability:
Probability is the measure of the likelihood that an event will occur. R is well-equipped for
handling probability distributions, random variables, and probability computations.

Probability Distributions:
Probability distributions describe the likelihood of different outcomes for a random variable. R
provides built-in functions for a variety of distributions, including:

LBAS AND SBSC COLLEGE SAGAR 2


STATISTICAL COMPUTING AND R PROGRAMMING
a. Normal Distribution:
One of the most important probability distributions, the normal distribution is symmetric
and describes many natural phenomena such as heights, test scores, and measurement
errors. It is defined by its mean (µ) and standard deviation (σ). The probability density function
(PDF) for the normal distribution is given by the equation:

1 (𝑥−𝜇)2

𝑓 (𝑥 ) = 𝑒 2𝜎2
𝜎√2𝜋

R provides functions like dnorm() for the probability density function, pnorm() for the
cumulative distribution function, which gives the probability that a random variable is less than
or equal to a certain value, rnorm() for generating random variables from a given distribution, and
qnorm() returns the quantile for the normal distribution to find the point in the distribution
corresponding to a given cumulative probability.

b. Binomial Distribution:
This distribution describes the number of successes in a fixed number of independent
Bernoulli trials (each with the same probability of success). The probability mass function (PMF)
is given by:
𝑛
𝑃(𝑋 = 𝑘 ) = ( )𝑝𝑘 (1 − 𝑝)𝑘
𝑘
where n is the number of trials, k is the number of successes, and p is the probability of
success on a single trial.

R supports the binomial distribution with functions like dbnorm() for the probability mass
function, pbnorm() for the cumulative distribution function, which gives the probability that a
random variable is less than or equal to a certain value, rbnorm() for generating random
variables from a given distribution, and qbnorm() returns the quantile for the normal distribution
to find the point in the distribution corresponding to a given cumulative probability.

LBAS AND SBSC COLLEGE SAGAR 3


STATISTICAL COMPUTING AND R PROGRAMMING
Process of Descriptive Analysis:
Descriptive analysis is a fundamental aspect of data analysis aimed at summarizing,
organizing, and simplifying data to make it more interpretable and meaningful. It provides insights
into the distribution, central tendency, and variability of the data. The process of descriptive analysis
is as follows:
1. Collect and Organize the Data
2. Explore the Dataset
3. Check for Missing or Outlier Values
4. Calculate the Central Tendency
5. Calculate measures of Variability
6. Analyze Data Distribution
7. Examine Relationships between Variables
8. Summarize and Interpret Results
9. Creating a Model

1. Collect and Organize the Data:


The first step is gathering the raw data from surveys, experiments, or existing datasets and
organizing it into a structured format, such as a table, spreadsheet, or data frame in R.

Let’s use the dataset “iris”, which is imported by default in R. It consists of measurements
(sepal’s length and width, and petal’s length and width) of three different species of iris flowers: Iris
setosa, Iris versicolor, and Iris virginica.

Now, Load the dataset iris to a dataframe ‘data’.

2. Explore the Dataset:


Before performing analysis, exploring the dataset is essential to understand the structure and
content of the data.
• Structure of the dataset: Use str() to examine the data types and layout.
• First few rows: Use head() to preview the data.
• Summary statistics: Use summary() to get an overview of each variable.

LBAS AND SBSC COLLEGE SAGAR 4


STATISTICAL COMPUTING AND R PROGRAMMING

The dataset contains 150 observations and 5 variables, representing the length and width of
the sepal and petal and the species of 150 flowers. Length and width of the sepal and petal are
numeric variables and the species is a factor with 3 levels (indicated by num and Factor w/ 3
levels after the name of the variables).

3. Check for Missing or Outlier Values:


Identify and handle missing data or outliers:
• Missing data: Check for NA values using [Link]() and handle them using imputation or removal.
• Outliers: Detect outliers using visualizations (e.g., boxplots) or statistical techniques.

LBAS AND SBSC COLLEGE SAGAR 5


STATISTICAL COMPUTING AND R PROGRAMMING
4. Calculate the Central Tendency:
Central tendency measures provide a single value that represents the center of the data:
• Mean: Average value (mean ()).
• Median: Middle value (median () or quantile ()).
• Mode: Most frequent value (custom function).

5. Calculate measures of Variability:


Variability measures describe the spread or dispersion of data:
• Range: Difference between the maximum and minimum values.
• Variance: Average squared deviation from the mean (var()).
• Standard Deviation: Square root of variance (sd()).

LBAS AND SBSC COLLEGE SAGAR 6


STATISTICAL COMPUTING AND R PROGRAMMING
6. Analyze Data Distribution:
Data distribution is crucial for statistical inference, which can be analyzed using either table ()
or various visualization tools such as histograms, boxplots, or density plots.

7. Examine Relationships between Variables:


Descriptive analysis often explores relationships between variables which are performed using
covariance (cov ()) and correlation (cor ()).
• Covariance: Measures how two variables change together (cov()).
• Correlation: Measures the strength and direction of the linear relationship (cor()).

LBAS AND SBSC COLLEGE SAGAR 7


STATISTICAL COMPUTING AND R PROGRAMMING
8. Summarize and Interpret Results:
Finally, summarize the findings to provide insights into the data:
• Summarize key metrics (mean, median, standard deviation, etc.).
• Describe trends, patterns, and outliers observed in the data.
• Use visualizations to communicate results effectively.

9. Creating a Model:
In the last step, building a predictive model, such as a linear regression, helps explore
relationships between variables or classify species.

Now, we’ll predict the [Link] based on [Link] and [Link] using the linear
regression model.

LBAS AND SBSC COLLEGE SAGAR 8


STATISTICAL COMPUTING AND R PROGRAMMING
Probability distributions in R:
1. Binomial distributions:
The binomial distribution model deals with finding the probability of success of an event
which has only two possible outcomes in a series of experiments. For example, tossing of a coin
always gives a head or tail. The probability of finding exactly 3 heads in tossing a coin repeatedly for 10
times is estimated during the binomial distribution.

R has four built-in functions to generate binomial distribution. They are as follows:
a. dbinom ()
b. pbinom ()
c. qbinom ()
d. rbinom ()

Let’s consider a scenario of flipping a coin 10 times (each flip is a trial) to explore probabilities
related to the outcomes of these flips (head/tail) using R's Binomial distribution functions.
a. dbinom ():
This function calculates the probability of getting exactly a certain number of successes.
Syntax:
dbinom (x, size, prob)
• x: The number of successes you are interested in.
• size: The total number of trials.
• prob: The probability of success in a single trial.
Example:

b. pbinom ():
This function calculates the probability of getting at most a certain number of successes.
Syntax:
pbinom (q, size, prob)
• q: The maximum number of successes you are considering.
• size: The total number of trials.

LBAS AND SBSC COLLEGE SAGAR 9


STATISTICAL COMPUTING AND R PROGRAMMING
• prob: The probability of success in a single trial.
Example:

c. qbinom ():
This function tells you the number of successes corresponding to a given cumulative
probability.
Syntax:
qbinom (p, size, prob)
• p: The cumulative probability threshold.
• size: The total number of trials.
• prob: The probability of success in a single trial.
Example:

d. rbinom ():
This function generates random outcomes from a binomial distribution.
Syntax:
rbinom (n, size, prob)
• n: The number of random outcomes you want to generate.
• size: The total number of trials in each experiment.
• prob: The probability of success in a single trial.
Example:

2. Normal distributions:

LBAS AND SBSC COLLEGE SAGAR 10


STATISTICAL COMPUTING AND R PROGRAMMING
The normal distribution is a continuous probability distribution that is symmetric around its
mean, depicting that data near the mean are more frequent in occurrence than data far from the
mean. It is commonly referred to as the bell curve due to its shape.

For example, the heights of people in a population often follow a normal distribution. If the
mean height is 170 cm with a standard deviation of 5 cm, the distribution can help estimate the
probability of a randomly chosen person having a height within a specific range, like between 165 cm
and 175 cm.

The probability density function (PDF) of a normal distribution is given by:

1 (𝑥−𝜇)2

𝑓 (𝑥 ) = 𝑒 2𝜎2
𝜎√2𝜋
Where,
• x: The variable for which we are calculating the probability.
• μ: The mean (average) of the distribution.
• σ: The standard deviation, which measures the spread or dispersion of the data.
• π: The mathematical constant, approximately 3.14159.
• e: Euler's number, approximately 2.71828.

R has four built-in functions to generate normal distribution. They are as follows:
a. dnorm ()
b. pnorm ()
c. qnorm ()
d. rnorm ()

Let’s consider an example, Imagine the heights of people in a city with the average height (mean)
is 170cm and the spread of heights (sd) is 10cm to follow a normal distribution:
a. dnorm ():
This function calculates the probability density (height of the curve) at a given point for a
normal distribution.
Syntax:
dnorm (x, mean = 0, sd = 1)
• x: The point at which to evaluate the density.

LBAS AND SBSC COLLEGE SAGAR 11


STATISTICAL COMPUTING AND R PROGRAMMING
• mean: The mean (center) of the normal distribution (default is 0).
• sd: The standard deviation (spread) of the distribution (default is 1).
Example:

b. pnorm ():
This function computes the cumulative probability up to a given point.
Syntax:
pnorm(q, mean = 0, sd = 1, [Link] = TRUE)
• q: The point up to which the cumulative probability is calculated.
• mean: The mean of the distribution.
• sd: The standard deviation of the distribution.
• [Link]: If TRUE, computes P(X ≤ q); if FALSE, computes P(X>q).
Example:

c. qnorm ():
This function determines the quantile (value) corresponding to a given cumulative probability.
Syntax:
qnorm(p, mean = 0, sd = 1, [Link] = TRUE)
• p: The cumulative probability.
• mean: The mean of the distribution.
• sd: The standard deviation of the distribution.
• [Link]: If TRUE, finds x for P(X ≤ x); if FALSE, finds x for P(X > x).
Example:

d. rnorm ():
LBAS AND SBSC COLLEGE SAGAR 12
STATISTICAL COMPUTING AND R PROGRAMMING
This function generates random numbers following a normal distribution.
Syntax:
rnorm(n, mean = 0, sd = 1)
• n: The number of random values to generate.
• mean: The mean of the distribution.
• sd: The standard deviation of the distribution.
Example:

Visualizing above outputs:

LBAS AND SBSC COLLEGE SAGAR 13

Common questions

Powered by AI

For binomial distributions, R uses dbinom(), pbinom(), qbinom(), and rbinom() functions for PMF, CDF, quantile calculation, and random generation, respectively . In contrast, normal distributions use dnorm(), pnorm(), qnorm(), and rnorm() for density, cumulative probability, quantile, and random generation functions, respectively . These functions assist in statistical analysis by allowing precise mathematical handling of data patterns and probabilities under assumed distribution conditions, enabling more informed inferential statistics and modeling .

Examining relationships between variables is important for understanding how changes in one variable may affect another, which is crucial for modeling and prediction. In R, this is often done using covariance (cov()) and correlation (cor()) functions . These metrics provide insights into the linear relationship and co-movement between variables, critical for data analysis and subsequent modeling tasks .

The process involves loading the 'iris' dataset, exploring its structure using str() and viewing the initial rows with head(). Using summary(), summary statistics like mean and median are calculated. Checking for missing/outlier values with is.na() and visualizing with boxplots helps refine the dataset. Insights like central tendency and variability of sepal/petal measurements can reveal patterns and help in predictive modeling .

R's built-in functions, dbinom() for PMF, pbinom() for CDF, qbinom() for quantiles, and rbinom() for random generation, efficiently support analysis of binomial distributions by providing tools to calculate probabilities and model outcomes of random discrete events with two possible outcomes. This is crucial for experiments like coin tosses where the likelihood of varying counts of successes needs evaluation .

R handles normal distribution primarily through four functions: dnorm() for the probability density function, pnorm() for calculating cumulative probability, qnorm() for finding quantiles, and rnorm() for generating random variables that follow the distribution. These functions enable modeling and analysis of continuous data that tend to present in a bell curve pattern .

In R, missing values can be identified using is.na() and handled by imputation or removal. Outliers can be detected using visualizations like boxplots. Handling involves deciding on techniques like replacing with mean/median values, interpolation, or exclusion based on their influence on data integrity and following statistical justifications .

R supports hypothesis testing through functions like t.test() for conducting t-tests, and chisq.test() for chi-squared tests between categorical variables. These functions help to verify assumptions about population parameters based on sample data. For example, t.test() might be used to determine if there is a significant difference between the means of two datasets .

Data visualization in R is supported through tools like histograms, boxplots, and density plots, aiding in understanding data distribution. Visual representation helps identify patterns, outliers, and the shape of the data, crucial for making informed decisions about data manipulation, analysis, and interpretation related to statistical tests and models .

In R, building a predictive model with linear regression involves loading the dataset, exploring its structure, and handling missing values. After exploring, a linear regression model can be created using the lm() function. For instance, using the 'iris' dataset, one can predict Petal.Length based on Petal.Width and Sepal.Length by fitting these variables into a linear regression model .

R facilitates descriptive statistics through functions like mean(), median(), and summary(), allowing for the calculation of measures of central tendency and dispersion. Descriptive statistics provide a basic summary of the data, offering insights into its central tendencies and variability, which are crucial for understanding the dataset's properties and preparing it for further analysis .

You might also like