0% found this document useful (0 votes)
10 views14 pages

Probability Distributions in R

Module 8 covers probability distributions in R, including binomial, Poisson, normal, exponential, uniform, and student t-distributions, with a focus on calculating probabilities and simulating data. It includes practical examples, videos, and quizzes for self-assessment. Learning outcomes involve calculating probabilities, interpreting results, and estimating confidence intervals using R software.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views14 pages

Probability Distributions in R

Module 8 covers probability distributions in R, including binomial, Poisson, normal, exponential, uniform, and student t-distributions, with a focus on calculating probabilities and simulating data. It includes practical examples, videos, and quizzes for self-assessment. Learning outcomes involve calculating probabilities, interpreting results, and estimating confidence intervals using R software.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Module 8 - Probability distributions in R

Instructional Hours: 4
In this module, you will learn about how to calculate probabilities in R and
simulating data in R for the binomial, Poisson, Normal, exponential, uniform
and the student t-distribution. Estimate the confidence interval for the Normal
and t-distributions. Practical examples using R software are provided. Further,
the module provides videos, examples and quiz for self-assessment in the learning
process.
This module introduces the t-distribution, exponential, uniform, Normal,
Poisson and binomial distributions and how to simulate data and calculate prob-
abilities using the R software.

Module Learning Outcomes 1


• Show how to calculate probabilities using the R software for different dis-
tributions
• Interpret the probabilities calculated from different distributions

• Evaluate the simulated data for the different distributions


• Estimate the confidence interval for the normal and t-distributions

Learning Activities
1. Read the lecture notes (PDF)
2. Read the assigned reference materials
3. Watch lecture videos

4. Complete the module assessment/Quiz

0.1 Binomial distribution


The binomial distribution is used to model the number of successes in a fixed
number of independent Bernoulli trials with the same probability of success.
The probability mass function of a binomial random variable X is given by:
 
n k
P (X = k) = p (1 − p)n−k
k

where n is the number of trials, k is the number of successes, and p is the


probability of success on a single trial.

1
0.1.1 Calculating Probabilities using R
In R, you can calculate binomial probabilities using the dbinom function. Below
is an example:

# Parameters
n <- 10 # number of trials
p <- 0.5 # probability of success

# Probability of getting exactly 5 successes


k <- 5
prob <- dbinom(k, n, p)
print(prob)

Example 1. Suppose a factory produces light bulbs, and the probability that a
randomly selected bulb is defective is 0.05. If a quality control inspector randomly
selects 20 bulbs from the production line, what is the probability that exactly 2
of the selected bulbs are defective?
We have n = 20, p = 0.05, k = 2. Our interest is P (X = 2) and in R, it
is expressed as:
# Parameters
n <- 20
p <- 0.05
k <- 2

prob <- dbinom(2, 20, 0.05)

# The Probability of 2 defective bulbs is


0.1886768
If our interest is P (X ≤ 2), then, we find three probabilities k = 0, 1, 2
#Based on the information above
prob <- dbinom(0, 20, 0.05) + dbinom(1, 20, 0.05) + dbinom(2, 20, 0.05)

# The probability of 2 or fewer defective bulbs is


0.9245163

0.1.2 Simulation using R


You can also perform simulations to generate binomial random variables using
the rbinom function. Below is an example of simulating 1000 observations from
a binomial distribution with n = 10 trials and probability of success p = 0.5:

# Parameters
n <- 10 # number of trials
p <- 0.5 # probability of success

2
size <- 1000 # number of simulations

# Simulating binomial random variables


sim_data <- rbinom(size, n, p)

# Displaying the first 10 simulated values


print(head(sim_data, 10))

Example 2. Let’s calculate the probability of getting exactly 3 successes in 5


trials with a success probability of 0.7, and simulate 10,000 observations:

# Parameters for probability calculation


n <- 5
p <- 0.7
k <- 3

# Probability calculation
prob <- dbinom(k, n, p)
print(paste("Probability of exactly 3 successes:", prob))

# Parameters for simulation


size <- 10000

# Simulating binomial random variables


sim_data <- rbinom(size, n, p)

# Displaying the first 10 simulated values


print(head(sim_data, 10))

Practical Session
Create an interactive R studio in LMS for the student to practice the above
codes.

0.2 Poisson distribution


The Poisson distribution is a discrete probability distribution that expresses the
probability of a given number of events occurring in a fixed interval of time or
space. This document provides examples of how to calculate probabilities and
perform simulations for a Poisson distribution using R software.

0.2.1 Calculating probabilities in R


To calculate the probability of observing k events in a Poisson distribution
with parameter λ (the average rate of occurrence), we use the probability mass

3
function:
λk e−λ
P (X = k) =
k!
In R, we use the dpois function to calculate these probabilities.

Example 3. Calculate the probability of observing exactly 3 events in a Poisson


distribution with an average rate of 2 events per interval.
# Parameters
lambda <- 2
k <- 3

# Calculate probability
probability <- dpois(k, lambda)
print(probability)

0.2.2 Simulating Poisson distribution


To simulate random variables from a Poisson distribution, we use the rpois
function in R. This function generates random deviates.
Simulate 10 random variables from a Poisson distribution with an average
rate of 2 events per interval.

# Parameters
lambda <- 2
n <- 10

# Generate random variables


random_variables <- rpois(n, lambda)
print(random_variables)

Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

0.3 Normal distribution


The normal distribution is defined by two parameters: the mean (µ) and the
standard deviation (σ).

0.3.1 Standard Normal distribution


The standard normal distribution is a special case of the normal distribution
where the mean (µ) is 0 and the standard deviation (σ) is 1.

4
• The probability density function (PDF) of the standard normal distribu-
tion is:  2
1 z
ϕ(z) = √ exp −
2π 2

• It is used as a reference distribution and is often denoted by Z.


• Any normal distribution can be converted to the standard normal distri-
bution through standardization using the formula:
X −µ
Z=
σ

• This process of standardization allows comparison between different nor-


mal distributions.

0.3.2 Calculating probabilities in R


To calculate probabilities in a normal distribution, you can use the following R
functions:

• dnorm(x, mean, sd): Computes the density (height of the probability


density function) at x for a normal distribution with a specified mean and
sd (standard deviation).
• pnorm(q, mean, sd): Computes the cumulative probability up to q for a
normal distribution with a specified mean and sd.
• qnorm(p, mean, sd): Computes the quantile (the inverse of the cumulative
distribution function) at p for a normal distribution with a specified mean
and sd.
• rnorm(n, mean, sd): Generates n random samples from a normal distri-
bution with a specified mean and sd.

Example 4. Let’s calculate the following for a normal distribution with mean
= 0 and standard deviation = 1:

• The probability density at x = 1.


• The cumulative probability up to x = 1.
• The 95th percentile.
• Generate 10 random samples.

# Set parameters
mean <- 0
sd <- 1

5
# Calculate probability density at x = 1
density <- dnorm(1, mean, sd)

# Calculate cumulative probability up to x = 1


cumulative_prob <- pnorm(1, mean, sd)

# Calculate the 95th percentile


percentile_95 <- qnorm(0.95, mean, sd)

# Generate 10 random samples


samples <- rnorm(10, mean, sd)

# Print results
cat("Probability density at x = 1:", density, "\n")
cat("Cumulative probability up to x = 1:", cumulative_prob, "\n")
cat("95th percentile:", percentile_95, "\n")
cat("Random samples:", samples, "\n")

0.3.3 Simulation using Normal


Simulation involves generating random samples from a normal distribution.
This can be done using the rnorm function as shown above. We can also visualize
the distribution of these samples using a histogram.
Example 5. Let’s simulate 1000 random samples from a normal distribution
with mean = 0 and standard deviation = 1, and plot the histogram.

# Set parameters
mean <- 0
sd <- 1
n <- 1000

# Generate random samples


samples <- rnorm(n, mean, sd)

# Plot histogram
hist(samples, breaks = 30, main = "Histogram of Random Samples",
xlab = "Value", ylab = "Frequency", col = "lightblue")

Example 6. A company’s employees’ salaries follow a normal distribution with


a mean of $60, 000 and a standard deviation of $15, 000.

(a) What is the probability that a randomly selected employee earns more than
$75, 000?
(b) What is the salary at the 90th percentile?

6
(c) Simulate the salaries of 1000 employees and plot a histogram of the sim-
ulated salaries.

The solution to the above three questions are provided below:

1. To find the probability that a randomly selected employee earns more than
$75,000, we use the cumulative distribution function (CDF) of the normal
distribution:

 75000 − 60000 
P (X > 75, 000) = 1−P (X ≤ 75, 000) = 1−P Z ≤ = 1−P (Z ≤ 1)
15000
In other words
P (Z > 1) = 1 − P (Z ≤ 1)

Using R:

mean_salary <- 60000


sd_salary <- 15000
salary <- 75000

# Probability of earning more than 75,000


probability <- 1 - pnorm(salary, mean_salary, sd_salary)
print(probability)

2. To find the salary at the 90th percentile, we use the quantile function of
the normal distribution:
Using R:

percentile <- 0.90

# Salary at the 90th percentile


salary_90th <- qnorm(percentile, mean_salary, sd_salary)
print(salary_90th)

3. To simulate the salaries of 1000 employees and plot a histogram:


Using R:

n <- 1000

# Simulate salaries
simulated_salaries <- rnorm(n, mean_salary, sd_salary)

# Plot histogram
hist(simulated_salaries, breaks = 30, main = "Histogram of Simulated Salaries",
xlab = "Salary", col = "lightblue")

7
Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

0.3.4 Watch the video

Watch the video and answer the questions that follow

Video Visit the URL below to view a video:

[Link]

Video by LawrenceStats - Standard Normal probabilities in R

1. (Insert after 1:50 minutes). If Z is standard normal, find

(a) P (Z > 0.9) (2 Minutes) (Answer is 0.1840601)


(b) P (Z < −1.1) (1 Minute) (Answer is 0.1356661)
2. (Insert after 2:45 minutes). If Z is standard normal, find
(a) P (Z > −0.5) (2 Minutes) (Answer is 0.6914625) (2 Marks)

0.4 Student t-distribution


0.4.1 Calculating probabilities using R
To calculate the cumulative probability for the Student’s t-distribution, we use
the pt function in R. Below is an example:

# Calculate the cumulative probability P(T <= t) for t = 1.5 with df = 10


prob <- pt(1.5, df = 10)
prob

This code calculates the cumulative probability P (T ≤ 1.5) for a Student’s


t-distribution with 10 degrees of freedom.

0.4.2 Simulation using R


To generate random samples from a Student’s t-distribution, we use the rt
function in R. Below is an example:

8
# Generate 1000 random samples from a Student’s t-distribution with df = 10
[Link](123) # Setting seed for reproducibility
samples <- rt(1000, df = 10)
hist(samples, main = "Histogram of Student’s t-distribution samples",
xlab = "Value", ylab = "Frequency")

This code generates 1000 random samples from a Student’s t-distribution


with 10 degrees of freedom and creates a histogram to visualize the samples.

Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

0.5 Uniform distribution


The uniform distribution is a continuous probability distribution where all in-
tervals of the same length are equally probable. This document demonstrates
how to calculate probabilities and simulate data using the uniform distribution
in R.

0.5.1 Calculating probabilities


To calculate probabilities in R for a uniform distribution U (a, b), you can use
the punif function.
Example 7. Consider a uniform distribution U (0, 1). What is the probability
that X (a random variable from U (0, 1)) is less than 0.5?

# Parameters
a <- 0
b <- 1

# Probability calculation
p <- punif(0.5, min = a, max = b)
p

0.5.2 Simulating data in R


You can simulate data from a uniform distribution using the runif function.
Example 8. Simulate 1000 random numbers from a uniform distribution U (0, 1).

# Number of simulations
n <- 1000

9
# Simulating data
sim_data <- runif(n, min = 0, max = 1)

# Displaying first few values


head(sim_data)

Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

0.6 Exponential distribution


The exponential distribution is often used to model the time between events
in a Poisson process. It is characterized by a single parameter, λ, which is
the rate parameter. The probability density function (PDF) of an exponential
distribution is given by:

f (x; λ) = λe−λx for x ≥ 0

0.6.1 Calculating probabilities


To calculate probabilities for an exponential distribution in R, we use the pexp
function. The cumulative distribution function (CDF) for the exponential dis-
tribution is given by:
F (x; λ) = 1 − e−λx
Example 9. Suppose λ = 0.5, and we want to find the probability that X ≤ 2.

lambda <- 0.5


x <- 2
prob <- pexp(x, rate = lambda)
prob
The output will be:

[1] 0.6321206

0.6.2 Simulating data


To simulate data from an exponential distribution in R, we use the rexp func-
tion. This function generates random numbers following an exponential distri-
bution.
Example 10. Suppose we want to simulate 1000 observations from an expo-
nential distribution with λ = 0.5.

10
lambda <- 0.5
n <- 1000
sim_data <- rexp(n, rate = lambda)
hist(sim_data, breaks = 50, main = "Histogram of Simulated Exponential Data", xlab = "Value"
This code will generate a histogram of the simulated data.

Practical Session
Create an interactive R studio in LMS for the student to practice the
above codes.

Quiz
A customer service center receives calls at an average rate of 3 calls per minute.
Assume the number of calls follows an exponential distribution. What is the
probability that the time between two consecutive calls is
1. more than 1 minute? (2 Marks) (Answer is 0.04978707)

2. between 1 minute and 3 minutes? (3 Marks)(Answer is 0.04966366)

0.7 Confidence Interval


A confidence interval (CI) is a range of values used to estimate the true value
of a population parameter. It is calculated from the sample data and gives an
indication of the uncertainty associated with the estimate.
A confidence interval for a population mean can be calculated using the
formula:  
∗ s
CI = x̄ ± z √
n
Where:

• x̄ = sample mean
• z ∗ = critical value from the standard normal distribution for the desired
confidence level
• s = sample standard deviation

• n = sample size

Interpretation of the confidence interval


A 95% confidence interval implies that if we were to take 100 different samples
and compute a CI for each sample, approximately 95 of those intervals would
contain the true population mean.

11
Example 11. A company claims that the average time it takes to assemble a
product is 50 minutes with a known standard deviation of 8 minutes. A random
sample of 30 products is taken, and the average assembly time is found to be 52
minutes. Construct a 95% confidence interval for the true mean assembly time.
Given:

• Sample mean, x̄ = 52
• Population standard deviation, σ = 8
• Sample size, n = 30
• Confidence level = 95%

The formula for the confidence interval using the normal distribution is:
 
σ
x̄ ± Zα/2 √
n

For a 95% confidence level, Zα/2 = 1.96. Calculate the margin of error:
 
8
Margin of Error = 1.96 × √ = 1.96 × 1.46 = 2.86
30
Construct the confidence interval:

52 ± 2.86 ⇒ (49.14, 54.86)

Thus, the 95% confidence interval for the true mean assembly time is (49.14, 54.86)
minutes.
Example 12. A researcher wants to estimate the average weight of a particular
species of fish in a lake. A sample of 15 fish is caught, and their weights are
recorded. The sample mean weight is 3.5 kg, and the sample standard deviation
is 0.5 kg. Construct a 95% confidence interval for the true mean weight using
the t-distribution.
Given:

• Sample mean, x̄ = 3.5


• Sample standard deviation, s = 0.5
• Sample size, n = 15
• Confidence level = 95%

The formula for the confidence interval using the t-distribution is:
 
s
x̄ ± tα/2,df √
n

12
Degrees of freedom, df = n−1 = 14. For a 95% confidence level and df = 14,
the critical value tα/2,14 ≈ 2.145. Calculate the margin of error:
 
0.5
Margin of Error = 2.145 × √ = 2.145 × 0.129 = 0.276
15
Construct the confidence interval:
3.5 ± 0.276 ⇒ (3.224, 3.776)
Thus, the 95% confidence interval for the true mean weight is (3.224, 3.776)
kg.

Quiz on confidence interval


1. A quality control manager wants to estimate the average diameter of bolts
produced by a machine. The diameter is known to follow a normal dis-
tribution with a standard deviation of 0.1 cm. A random sample of 40
bolts is taken, and the sample mean diameter is found to be 5.05 cm.
Construct a 95% confidence interval for the true mean diameter. (Answer
is (5.019, 5.081)) (4 Marks)
2. A researcher is studying the effect of a new fertilizer on plant growth. A
sample of 25 plants is treated with the fertilizer, and the sample mean
growth is measured as 15.3 cm with a sample standard deviation of 2.5
cm. Construct a 99% confidence interval for the true mean growth of
plants treated with the fertilizer. (Answer is (13.901, 16.699)) (4 Marks)

0.8 Reading Materials


1. Roger D. Peng (2015). R programming for data science. Lean Publishers
(Pages 123-130) Read the selected pages

0.9 Summary
1. Binomial distribution
• Calculate probabilities: Use the dbinom(x, size, prob) function where
x is the number of successes, size is the number of trials, and prob is
the probability of success in each trial.
• Simulate data: Use the rbinom(n, size, prob) function where n is the
number of observations.
2. Poisson distribution
• Calculate probabilities: Use the dpois(x, lambda) function where
x is the number of occurrences and lambda is the average rate of
occurrence.

13
• Simulate Data: Use the rpois(n, lambda) function where n is the
number of observations.
3. Normal distribution
• Calculate probabilities: Use the dnorm(x, mean, sd) function where
x is the value, mean is the mean of the distribution, and sd is the
standard deviation.
• Simulate data: Use the rnorm(n, mean, sd) function where n is the
number of observations.
4. Exponential distribution

• Calculate probabilities: Use the dexp(x, rate) function where x is the


value and rate is the rate parameter (1/mean).
• Simulate data: Use the rexp(n, rate) function where n is the number
of observations.

5. Uniform distribution
• Calculate probabilities: Use the dunif(x, min, max) function where
x is the value, min is the minimum value, and max is the maximum
value.
• Simulate data: Use the runif(n, min, max) function where n is the
number of observations.
6. Student’s t-distribution
• Calculate probabilities: Use the dt(x, df) function where x is the
value and df is the degrees of freedom.
• Simulate data: Use the rt(n, df) function where n is the number of
observations.

14

You might also like