0% found this document useful (0 votes)
8 views34 pages

R Programming Essentials and Functions

R is a powerful programming language for statistical computing and graphics, widely used for data analysis and visualization. It features vectorized operations, extensive libraries, and strong graphics capabilities, along with support for various data types and manipulation techniques. The document also covers reserved words, operators, input/output functions, control structures, and statistical parameters, providing examples and solutions for common tasks in R.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

R Programming Essentials and Functions

R is a powerful programming language for statistical computing and graphics, widely used for data analysis and visualization. It features vectorized operations, extensive libraries, and strong graphics capabilities, along with support for various data types and manipulation techniques. The document also covers reserved words, operators, input/output functions, control structures, and statistical parameters, providing examples and solutions for common tasks in R.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

R Programming

 R is a powerful programming language and environment for statistical computing and


graphics.
 It is widely used in academia, industry, and research for data analysis, visualization, and
modeling.
 R is open-source, cross-platform, and has a large and active community of users and
developers.

Features of R Programming

 Vectorized Operations: R allows operations to be applied to entire vectors or matrices at


once, making it efficient for data manipulation.
 Extensive Libraries: R has a vast ecosystem of packages for various tasks, including data
manipulation, statistical analysis, machine learning, and visualization.
 Graphics Capabilities: R provides powerful tools for creating high-quality plots and
visualizations, including scatter plots, histograms, bar charts, and more.
 Statistical Analysis: R offers a wide range of statistical functions and tests for analyzing
data and making inferences.
 Integration with Other Languages: R can be easily integrated with other programming
languages like C, C++, and Python.

Reserved Words

 Reserved words are predefined keywords in R that have special meanings and cannot be
used as identifiers (e.g., variable names or function names).
 Examples include if, else, repeat, while, for, function, return, next, break, NULL, and NA.
 TRUE and FALSE are the logical consonants in R.

Naming Variables

 Variable names in R can consist of letters, numbers, and periods (.), but they must start
with a letter.
 It's good practice to use descriptive names that reflect the purpose of the variable.
 Avoid using reserved words as variable names.
 R is a case-sensitive language. Example: TRUE and True are not the same, while the first
one is a reserved word, the latter can be used as a variable name.
 Comments in R start with the '#' symbol.
Operators

Operators are symbols used to perform operations on variables and values. R supports various
types of operators, including arithmetic operators, relational operators, logical operators,
assignment operators, and special operators. Here are some examples:

Arithmetic Operators Relational Operators


+ Addition == Equal to
- Subtraction != or <> Not equal to
* Multiplication < Less than
/ Division > Greater than
^ or ** Exponentiation <= Less than or equal to
%% Modulo (remainder) >= Greater than or equal to
%/% Integer division

Logical Operators
! Negation
& or && Logical AND
| or || Logical OR
xor() Exclusive OR
isTRUE() Tests if an expression is TRUE

Assignment Operators
<– or = Assigns a value to a variable
<<– Assigns a value to a variable in the parent environment

Special Operators
$ Extracts a component of a list or data frame
: Creates a sequence of numbers
%in% Tests if elements of one vector are in another vector
%*% Matrix multiplication

Input/Output Functions

 print(): Displays output on the console.


 cat(): Concatenates and prints objects to the console.
 scan(): Reads data from the keyboard or a file.
 [Link](): Reads data from a tabular file into a data frame.
 [Link](): Writes data from a data frame to a tabular file.
 [Link](): Reads data from a CSV file into a data frame.
 [Link](): Writes data from a data frame to a CSV file.
 readLines(): Reads lines from a text file.
 writeLines(): Writes lines to a text file.
 source(): Executes R code from a file.

Data Types

 R supports several basic data types, including numeric, character, logical, integer, and
complex.
 Vectors are one-dimensional arrays that can hold elements of the same data type.
 Matrices are two-dimensional arrays.
 Data frames are similar to matrices but can hold different types of data in each column.

Data Manipulation

 Use functions like subset(), merge(), transform(), and aggregate() for data
manipulation.
 The dplyr package provides a set of functions for easy and efficient data manipulation.
 The tidyr package offers tools for tidying data, which involves restructuring datasets
to facilitate analysis.

Data Visualization

 R offers several packages for data visualization, including ggplot2, lattice, and base
graphics.
 Use functions like plot(), hist(), boxplot(), and barplot() for basic plotting.

Control Structures

 R supports various control structures such as if-else statements, for loops, while loops,
and repeat loops.
 These control structures are used to control the flow of execution in a program.

Functions

 You can define your own functions in R using the function() keyword.
 Functions can take arguments and return values.
 R also has many built-in functions for common tasks, such as mathematical
calculations, statistical analysis, and data manipulation.

Packages

1. R has a vast ecosystem of packages contributed by users from around the world.
2. Use the [Link]() function to install packages from CRAN, and then load them
into your R session using the library() function.
3. Popular packages include tidyverse, caret, forecast, rmarkdown, and shiny.
Examples:

1. Write a program in R to calculate sum of numbers.


# Define a vector of numbers
numbers <- c(1, 2, 3, 4, 5)
# Calculate the sum of numbers
total <- sum(numbers)
# Print the result
print(total)

Output: 15

2. Write a program in R to display information about a person including their name, age,
and city. Use the cat() function to concatenate and display the information.

# Define variables
name <- "John"
age <- 30
city <- "New York"

# Concatenate strings using cat


cat("Name: ", name, "\n")
cat("Age: ", age, "\n")
cat("City: ", city, "\n")

Output:
Name: John
Age: 30
City: New York
Unit I - INTRODUCTION

1. Using R, find the probabilities of tossing a pair of coins five times and obtaining: (i)
exactly 3 heads, (ii) exactly 3 tails, (iii) at least one head, (iv) not more than one tail.

Solution
# Define the number of trials (tosses)
n <- 5

# Probability of getting heads or tails on a single coin toss


p_head <- 0.5 # Probability of getting heads
p_tail <- 0.5 # Probability of getting tails
# Probability of getting exactly 3 heads
prob_3_heads <- dbinom(3, size = n, prob = p_head)

# Probability of getting exactly 3 tails


prob_3_tails <- dbinom(3, size = n, prob = p_tail)

# Probability of getting at least one head


prob_at_least_one_head <- 1 - dbinom(0, size = n, prob = p_head)

# Probability of getting not more than one tail


prob_not_more_than_one_tail <- pbinom(1, size = n, prob = p_tail)

# Print the probabilities


cat("Probability of getting exactly 3 heads:", prob_3_heads, "\n")
cat("Probability of getting exactly 3 tails:", prob_3_tails, "\n")
cat("Probability of getting at least one head:", prob_at_least_one_head, "\n")
cat("Probability of getting not more than one tail:", prob_not_more_than_one_tail,
"\n")

Output

Probability of getting exactly 3 heads: 0.3125


Probability of getting exactly 3 tails: 0.3125
Probability of getting at least one head: 0.96875
Probability of getting not more than one tail: 0.1875

𝑥, 0 ≤ 𝑥 ≤ 1
2. For a triangular distribution 𝑓(𝑥) = {2 − 𝑥, 1 ≤ 𝑥 ≤ 2. Find mean, variance using R.
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Solution

# Define the density function for the triangular distribution


triangular_density <- function(x) {
ifelse(x >= 0 & x <= 1, x,
ifelse(x > 1 & x <= 2, 2 - x,
0))
}

# Calculate mean
mean_triangular <- integrate(function(x) x * triangular_density(x), lower = 0, upper =
2)$value

# Calculate E[X^2]
mean_squared_triangular <- integrate(function(x) x^2 * triangular_density(x), lower =
0, upper = 2)$value

# Calculate variance
var_triangular <- mean_squared_triangular - (mean_triangular)^2

# Print the results


cat("Mean:", mean_triangular, "\n")
cat("Variance:", var_triangular, "\n")

Output

Mean: 1
Variance: 0.1666667

3. A random variable ‘X’ has the following probability function:


X=xi 0 1 2 3 4
P(X=xi) k 3k 5k 7k 9k
Find k, P[X ≥ 3] and P[0 < X < 4] using R.

Solution
# Define the values of X and corresponding probabilities
X <- c(0, 1, 2, 3, 4)

# Define a function to calculate the probability for each value of X


calc_prob <- function(k) {
Prob_X <- c(k, 3*k, 5*k, 7*k, 9*k)
return(Prob_X)
}

# Define a function to calculate the total probability


total_prob <- function(k) {
Prob_X <- calc_prob(k)
return(sum(Prob_X))
}

# Define an equation to solve for k


equation <- function(k) {
return(total_prob(k) - 1)
}

# Solve for k
k <- uniroot(equation, interval = c(0, 1))$root

# Display the value of k


cat("Value of k:", k, "\n")

# Calculate P[X ≥ 3]
Prob_X <- calc_prob(k)
P_X_ge_3 <- sum(Prob_X[X >= 3])

# Calculate P[0 < X < 4]


P_X_between_0_and_4 <- sum(Prob_X[X > 0 & X < 4])

# Display the results


cat("P[X ≥ 3]:", P_X_ge_3, "\n")
cat("P[0 < X < 4]:", P_X_between_0_and_4, "\n")

Output

Value of k: 0.04
P[X ≥ 3]: 0.64
P[0 < X < 4]: 0.6

4. In a continuous random variable X having the probability density function


𝑥2
𝑓(𝑥) = { 3 , −1 < 𝑥 < 2
0, 𝑒𝑙𝑠𝑒𝑤ℎ𝑒𝑟𝑒
Using R, find P(0 < X <= 1)

Solution
# Define the probability density function
f <- function(x) {
ifelse(x >= -1 & x <= 2, x^2/3, 0)
}

# Find the probability P(0 < X <= 1)


P_0_to_1 <- integrate(f, lower = 0, upper = 1)$value

cat("P(0 < X <= 1):", P_0_to_1, "\n")

Output

P(0 < X <= 1): 0.1111111


Unit II - STATISTICAL PARAMETERS

1. A company keeps a record of accidents during a recent safety review. The random of
sample of 60 accidents was selected and classified by day of the week.

Day Mon Tue Wed Thurs Fri


No:of accidents 8 12 9 14 17
Test whether the accidents are more likely to occur on someday than others using R.

Solution

# Define observed frequencies


observed <- c(8, 12, 9, 14, 17)

# Calculate the total number of accidents


total_accidents <- sum(observed)

# Calculate expected probabilities (assuming equal probability for each day)


expected_prob <- rep(1/5, 5)

# Perform chi-square goodness-of-fit test


chi_square_test <- [Link](observed, p = expected_prob)

# Print the chi-square statistic


print(chi_square_test)

# Print the critical chi-square value


df <- length(observed) - 1
critical_chi_square <- qchisq(1 - 0.05, df)
cat("Critical chi-square value:", round(critical_chi_square, 3), "\n")

# Print the conclusion based on the p-value


if (chi_square_test$[Link] < 0.05) {
cat("Reject the null hypothesis. Accidents are more likely to occur on some days than
others.\n")
} else {
cat("Accept the null hypothesis. There is no significant difference in the likelihood of
accidents on different days.\n")
}
Output

Chi-squared test for given probabilities


data: observed
X-squared = 4.5, df = 4, p-value = 0.3425

Critical chi-square value: 9.488


Accept the null hypothesis. There is no significant difference in the
likelihood of accidents on different days.

2. 1000 students in a college level were graded academic to the IQ level and economic
condition. What conclusion can be drawn from the following data using R?
IQ Level
Economic condition High Low
Rich 460 140
Poor 240 160

Solution

# Create a matrix of observed frequencies


observed <- matrix(c(460, 140, 240, 160), nrow = 2, byrow = TRUE)

# Perform chi-square test of independence


chi_square_test <- [Link](observed)

# Print the chi-square statistic


print(chi_square_test)

# Set the significance level


alpha <- 0.05

# Calculate the critical chi-square value


df <- (nrow(observed) - 1) * (ncol(observed) - 1)
critical_chi_square <- qchisq(1 - alpha, df)

# Print the critical chi-square value


cat("Critical chi-square value:", round(critical_chi_square, 2), "\n")

# Print the conclusion


if (chi_square_test$[Link] < alpha) {
cat("Reject the null hypothesis. There is a significant association between economic
condition and IQ level.\n")
} else {
cat("Accept the null hypothesis. There is no significant association between economic
condition and IQ level.\n")
}
Output

Pearson's Chi-squared test with Yates' continuity correction

data: observed
X-squared = 30.957, df = 1, p-value = 2.638e-08

Critical chi-square value: 3.84


Reject the null hypothesis. There is a significant association between
economic condition and IQ level.

3. The following table gives the number of aircraft accidents that occurred during the
various days of the week. Test whether the accidents are uniformly distributed over
the week using R programming.
Days Sun Mon Tue Wed Thu Fri Sat
No of Accidents 14 16 8 12 11 9 14

Solution

# Given data
days <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
accidents <- c(14, 16, 8, 12, 11, 9, 14)

# Perform chi-square goodness-of-fit test


chi_square_test <- [Link](accidents)

# Print the chi-square test result


print(chi_square_test)

# Calculate the critical chi-square value for alpha = 0.05


df <- length(accidents) - 1
critical_chi_square <- qchisq(1 - 0.05, df)

# Print the critical chi-square value


cat("Critical chi-square value:", round(critical_chi_square, 2), "\n")

# Determine the conclusion


if (chi_square_test$statistic > critical_chi_square) {
cat("Reject the null hypothesis. Accidents are not uniformly distributed over the
week.\n")
} else {
cat("Accept the null hypothesis. Accidents are uniformly distributed over the
week.\n")
}
Output
Chi-squared test for given probabilities

data: accidents
X-squared = 4.1667, df = 6, p-value = 0.6541

Critical chi-square value: 12.59


Accept the null hypothesis. Accidents are uniformly distributed over
the week.

4. Two independent samples of 9 and 7 from a normal population is given below:


Sample I 18 13 12 15 12 14 16 14 15
Sample II 16 19 13 16 18 13 15 - -
Do the estimation of the population variance that differs significantly using R.

Solution
# Define the two samples
sample1 <- c(18, 13, 12, 15, 12, 14, 16, 14, 15)
sample2 <- c(16, 19, 13, 16, 18, 13, 15)

# Perform F-test for equality of variances


var_test <- [Link](sample1, sample2)

# Print the results


print(var_test)

# Print the conclusion


if (var_test$[Link] < 0.05) {
cat("Reject the null hypothesis.\n")
} else {
cat("Accept the null hypothesis.\n")
}

Output

F = 0.71591, num df = 8, denom df = 6, p-value = 0.6444


alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1278495 3.3301911
sample estimates:
ratio of variances
0.7159091

Accept the null hypothesis.


5. The mean life time of a sample of 100 light bulbs produced is computed to be 1570
hours with a standard deviation 120 hours. If μ is the mean life time of all the bulbs
produced by the company, test the hypothesis using R μ =1600 hours, against the
alternate hypothesis μ ≠ 1600 hours with α = 0.05 and 0.01.

Solution
# Given data
sample_mean <- 1570 # Mean lifetime of the sample bulbs
population_mean <- 1600 # Hypothesized mean lifetime of all bulbs
population_std <- 120 # Population standard deviation
sample_size <- 100 # Sample size

# Calculate the z-score


z_score <- round((sample_mean - population_mean) / (population_std /
sqrt(sample_size)), 2)

# Define significance levels


alpha_1 <- 0.05
alpha_2 <- 0.01

# Calculate critical z-values for two-tailed test


z_critical_1 <- round(qnorm(1 - alpha_1 / 2), 2)
z_critical_2 <- round(qnorm(1 - alpha_2 / 2), 2)

# Print z-score
cat("Z-score:", z_score, "\n")

cat("For alpha = 0.05:\n")


cat("Critical Z-value:", z_critical_1, "\n")
if (abs(z_score) > z_critical_1) {
cat("Reject the null hypothesis: μ ≠ 1600 hours\n")
} else {
cat("Accept the null hypothesis: μ = 1600 hours\n")
}

cat("\nFor alpha = 0.01:\n")


cat("Critical Z-value:", z_critical_2, "\n")
if (abs(z_score) > z_critical_2) {
cat("Reject the null hypothesis: μ ≠ 1600 hours\n")
} else {
cat("Accept the null hypothesis: μ = 1600 hours\n")
}
Output

Z-score: -2.5
For alpha = 0.05:
Critical Z-value: 1.96
Reject the null hypothesis: μ ≠ 1600 hours

For alpha = 0.01:


Critical Z-value: 2.58
Accept the null hypothesis: μ = 1600 hours

6. The mean breaking strength of the cables supplied by a manufacturer is 1800 with a
S.D. of 100. By a new technique in the manufacturing process, it is claimed that the
breaking strength of the cable has increased. In order to test this claim, a sample of 50
cables is tested and it is found that the mean breaking strength is 1850. Can we support
the claim at 1% level of significance (Use R)?

Solution
# Given data
population_mean <- 1800 # Mean breaking strength before the new technique
population_std <- 100 # Standard deviation of breaking strength before the new
technique
sample_mean <- 1850 # Mean breaking strength observed in the sample
sample_size <- 50 # Sample size
alpha <- 0.01 # Significance level

# Calculate the z-score


z_score <- round((sample_mean - population_mean) / (population_std /
sqrt(sample_size)), 2)

# Calculate the critical z-value for a one-tailed test (upper tail)


z_critical <- round(qnorm(1 - alpha), 2)

# Print z-score and critical z-value


cat("Z-score:", z_score, "\n")
cat("Critical z-value:", z_critical, "\n")

# Perform the hypothesis test


if (z_score > z_critical) {
cat("Reject the null hypothesis. We may support the claim that the breaking strength
has increased.\n")
} else {
cat("Accept the null hypothesis. There is not enough evidence to support the claim.\n")
}

Output

Z-score: 3.54
Critical z-value: 2.33
Reject the null hypothesis. We may support the claim that the breaking
strength has increased.

7. The average number of defective articles per day in a certain factory is claimed to be
less than the average of all the factories. The average of all the factories is 30.5. A
random sample of 100 days showed the following distribution.
Class limits 16-20 21-25 26-30 31-35 36-40
No:of days 12 22 20 30 16
Is the average less than the figure for all the factories (Use R)?

Solution
# Given data
sample_data <- c(18, 23, 28, 33, 38) # Midpoint of each class
frequency <- c(12, 22, 20, 30, 16)
n <- sum(frequency) # Total number of days
mean_all_factories <- 30.5 # Average of all factories

# Calculating the sample mean and standard deviation


sample_mean <- sum(sample_data * frequency) / n
sample_sd <- sqrt(sum(((sample_data - sample_mean) ^ 2) * frequency) / (n))

# Printing sample mean, sample standard deviation, and Z-score


cat("Sample Mean:", round(sample_mean, 2), "\n")
cat("Sample Standard Deviation:", round(sample_sd, 2), "\n")

# Performing Z-test
z_score <- (sample_mean - mean_all_factories) / (sample_sd / sqrt(n))
cat("Z-score:", round(z_score, 2), "\n")

# Define the critical Z-value for a one-tailed test with significance level α = 0.05
z_critical <- qnorm(0.05)
cat("Critical Z-value:", round(z_critical, 2), "\n")

# Check if the Z-score is less than the critical Z-value


if (z_score < z_critical) {
cat("Reject the null hypothesis. The average number of defective articles per day in
this factory is less than the average of all the factories.\n")
} else {
cat("Accept the null hypothesis. There is not enough evidence to support the claim
that the average number of defective articles per day in this factory is less than the
average of all the factories.\n")
}

Output

Sample Mean: 28.8


Sample Standard Deviation: 6.35
Z-score: -2.68
Critical Z-value: -1.64
Reject the null hypothesis. The average number of defective articles
per day in this factory is less than the average of all the factories.

8. The sales manager of a large company conducted a sample survey in two places A and
B taking 200 samples in each case. The results were the following table. Use R to test
whether the average sales is the same in the two areas at 5% level.
Place A Place B
Average sales Rs. 2000 Rs. 1700
S.D Rs. 200 Rs. 450

Solution
# Given data
average_sales_A <- 2000
average_sales_B <- 1700
sd_A <- 200
sd_B <- 450
n <- 200 # Number of samples in each case
alpha <- 0.05 # Significance level

# Compute the standard error of the difference between means


SE_diff <- sqrt((sd_A^2 / n) + (sd_B^2 / n))

# Compute the Z-score


Z_score <- round((average_sales_A - average_sales_B) / SE_diff, 2)
cat("Z-score:", Z_score, "\n")

# Compute the critical Z-value for a two-tailed test


critical_Z <- round(qnorm(1 - alpha/2), 2)
cat("Critical Z-value:", critical_Z, "\n")
# Check if the Z-score falls within the critical region
if (abs(Z_score) > critical_Z) {
cat("Reject the null hypothesis. There is sufficient evidence to conclude that the
average sales are different in the two areas.\n")
} else {
cat("Accept the null hypothesis. There is not enough evidence to conclude that the
average sales are different in the two areas.\n")
}

Output

Z-score: 8.62
Critical Z-value: 1.96
Reject the null hypothesis. There is sufficient evidence to conclude
that the average sales are different in the two areas.

9. Two samples drawn from two different populations gave the following results:
Size Mean SD
Sample I 100 582 24
Sample II 100 540 28
Use R to test the hypothesis, at 5% level of significance, that the difference of the means
of the population is 35.

Solution
# Given data
mean_1 <- 582 # Mean of Sample I
sd_1 <- 24 # Standard deviation of Sample I
n_1 <- 100 # Sample size of Sample I
mean_2 <- 540 # Mean of Sample II
sd_2 <- 28 # Standard deviation of Sample II
n_2 <- 100 # Sample size of Sample II
population_mean_diff <- 35 # Hypothesized difference of means
alpha <- 0.05 # Significance level

# Calculate the standard error of the difference between means


SE_diff <- sqrt((sd_1^2 / n_1) + (sd_2^2 / n_2))

# Calculate the z-score


z_score <- (mean_1 - mean_2 - population_mean_diff) / SE_diff

# Define the critical z-value for a two-tailed test


z_critical <- qnorm(1 - alpha/2)

# Print the z-score


cat("Z-score:", round(z_score, 2), "\n")

# Print the critical z-value


cat("Critical Z-value:", round(z_critical, 2), "\n")

# Compare the z-score with the critical z-value


if (abs(z_score) > z_critical) {
cat("Reject the null hypothesis. There is sufficient evidence to conclude that the
difference of the means of the populations is not 35.\n")
} else {
cat("Accept the null hypothesis. There is not enough evidence to conclude that the
difference of the means of the populations is not 35.\n")
}

Output

Z-score: 1.9
Critical Z-value: 1.96
Accept the null hypothesis. There is not enough evidence to conclude
that the difference of the means of the populations is not 35.

10. Two horses A and B were tested according to the time (in seconds) to run a particular
race with the following results:
Horse A 28 30 32 33 33 29 34
Horse B 29 30 30 24 27 29
Use R to test whether the horse A is running faster than B at 5% level.

Solution
# Given data
times_horse_A <- c(28, 30, 32, 33, 33, 29, 34)
times_horse_B <- c(29, 30, 30, 24, 27, 29)

# Perform two-sample t-test


t_test_result <- [Link](times_horse_A, times_horse_B, alternative = "greater")

# Print the result


print(t_test_result)

# Determine the critical region value


alpha <- 0.05
critical_region <- qt(1 - alpha, df = t_test_result$parameter)

# Print the critical region value


cat("Critical region value:", round(critical_region, 2), "\n")

# Compare the test statistic with the critical region value


if (t_test_result$statistic > critical_region) {
cat("Reject the null hypothesis. There's sufficient evidence that horse A runs faster
than horse B at the 5% level of significance.\n")
} else {
cat("Fail to reject the null hypothesis. There's not enough evidence that horse A runs
faster than horse B at the 5% level of significance.\n")
}

Output

Welch Two Sample t-test

data: times_horse_A and times_horse_B


t = 2.4335, df = 10.652, p-value = 0.01693
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.8103788 Inf
sample estimates:
mean of x mean of y
31.28571 28.16667

Critical region value: 1.8


Reject the null hypothesis. There's sufficient evidence that horse A
runs faster than horse B at the 5% level of significance.

Unit III - REGRESSION AND CORRELATION ANALYSIS

1. From the following data, using R find (i) two regression equation (ii) the coefficient of
correlation between the marks in economics and statistics (iii) the most likely marks in
statistics when marks in economics are 30
Marks in Economics X 25 28 35 32 31 36 29 38 34 32
Marks in Statistics Y 43 46 49 41 36 32 31 30 33 39

Solution
# Load the required libraries
library(stats)
# Enter the data
x <- c(25, 28, 35, 32, 31, 36, 29, 38, 34, 32)
y <- c(43, 46, 49, 41, 36, 32, 31, 30, 33, 39)

# Calculate regression equations


# Y on X
model_Y_on_X <- lm(y ~ x)
m_YX <- model_Y_on_X$coefficients[2]
b_YX <- model_Y_on_X$coefficients[1]

# X on Y
model_X_on_Y <- lm(x ~ y)
m_XY <- model_X_on_Y$coefficients[2]
b_XY <- model_X_on_Y$coefficients[1]

# Print regression equations


cat("Regression equation (Y on X): Y =", round(b_YX, 2), "+", round(m_YX, 2), "*
X\n")
cat("Regression equation (X on Y): X =", round(b_XY, 2), "+", round(m_XY, 2), "*
Y\n")

# Calculate coefficient of correlation


correlation <- cor(x, y)

# Print coefficient of correlation


cat("Coefficient of correlation:", round(correlation, 3), "\n")

# Predict the most likely marks in statistics when marks in economics are 30
predicted_y <- predict(model_Y_on_X, [Link](x = 30))
cat("Predicted marks in statistics when marks in economics are 30 is:",
round(predicted_y, 2), "\n")
Output

Regression equation (Y on X): Y = 59.26 + -0.66 * X


Regression equation (X on Y): X = 40.88 + -0.23 * Y
Coefficient of correlation: -0.394
Predicted marks in statistics when marks in economics are 30: 39.33

2. The joint probability mass function of x and y is given below. Find the correlation
coefficient of x, y using R.
x
y -1 +1 Total
0 1/8 3/8 4/8
1 2/8 2/8 4/8
Total 3/8 5/8

Solution
# Define the joint probability mass function
pmf <- matrix(c(1/8, 3/8, 2/8, 2/8), nrow = 2, byrow = TRUE)

# Define the values of x and y


x <- c(-1, 1)
y <- c(0, 1)

# Calculate the means


mu_x <- sum(x * rowSums(pmf))
mu_y <- sum(y * colSums(pmf))

# Calculate the standard deviations


sigma_x <- sqrt(sum((x - mu_x)^2 * rowSums(pmf)))
sigma_y <- sqrt(sum((y - mu_y)^2 * colSums(pmf)))

# Calculate the covariance


cov_xy <- sum(outer(x - mu_x, y - mu_y, "*") * pmf)

# Calculate the correlation coefficient


correlation_coefficient <- cov_xy / (sigma_x * sigma_y)

# Print the result


cat("Correlation Coefficient of x and y:", round(correlation_coefficient,3))

Output

Correlation Coefficient of x and y: -0.258

3. Obtain the rank correlation coefficient for the following data using R.
X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70

Solution
# Given data
X <- c(68, 64, 75, 50, 64, 80, 75, 40, 55, 64)
Y <- c(62, 58, 68, 45, 81, 60, 68, 48, 50, 70)

# Calculate the rank correlation coefficient (Spearman's rho)


correlation_coefficient <- cor(X, Y, method = "spearman")

# Print the rank correlation coefficient


cat("Rank Correlation Coefficient (Spearman's rho):", round(correlation_coefficient,
3), "\n")

Output

Rank Correlation Coefficient (Spearman's rho): 0.556

4. Find the rank correlation coefficient from the following data using R:
Rank in X 1 2 3 4 5 6 7
Rank in Y 4 3 1 2 6 5 7

Solution
# Given ranks
rank_X <- c(1, 2, 3, 4, 5, 6, 7)
rank_Y <- c(4, 3, 1, 2, 6, 5, 7)

# Calculate the rank correlation coefficient


correlation <- cor(rank_X, rank_Y, method = "spearman")

# Print the rank correlation coefficient


cat("Rank Correlation Coefficient (Spearman's rho):", round(correlation, 4), "\n")

Output

Rank Correlation Coefficient (Spearman's rho): 0.6429

5. Ten participants were ranked according to their performance in a musical test by the 3
Judges in the following data.
1 2 3 4 5 6 7 8 9 10
Rank by X 1 6 5 10 3 2 4 9 7 8
Rank by Y 3 5 8 4 7 10 2 1 6 9
Rank by Z 6 4 9 8 1 2 3 10 5 7
Using rank correlation method, discuss which pair of judges has the nearest approach
to common likings of music using R.

Solution
# Given ranks
rank_X <- c(1, 6, 5, 10, 3, 2, 4, 9, 7, 8)
rank_Y <- c(3, 5, 8, 4, 7, 10, 2, 1, 6, 9)
rank_Z <- c(6, 4, 9, 8, 1, 2, 3, 10, 5, 7)
# Calculate rank correlation coefficients
correlation_XY <- cor(rank_X, rank_Y, method = "spearman")
correlation_XZ <- cor(rank_X, rank_Z, method = "spearman")
correlation_YZ <- cor(rank_Y, rank_Z, method = "spearman")

# Print the rank correlation coefficients


cat("Rank correlation coefficient between X and Y:", round(correlation_XY,3), "\n")
cat("Rank correlation coefficient between X and Z:", round(correlation_XZ,3), "\n")
cat("Rank correlation coefficient between Y and Z:", round(correlation_YZ,3), "\n")

# Interpretation
if (correlation_XY > correlation_XZ && correlation_XY > correlation_YZ) {
cat("The pair of judges X and Y has the nearest approach to common likings of
music.\n")
} else if (correlation_XZ > correlation_XY && correlation_XZ > correlation_YZ) {
cat("The pair of judges X and Z has the nearest approach to common likings of
music.\n")
} else {
cat("The pair of judges Y and Z has the nearest approach to common likings of
music.\n")
}

Output

Rank correlation coefficient between X and Y: -0.212


Rank correlation coefficient between X and Z: 0.636
Rank correlation coefficient between Y and Z: -0.297
The pair of judges X and Z has the nearest approach to common likings
of music.

6. Let X and Y be discrete R.V’s with probability function 𝑓(𝑥, 𝑦) = (𝑥 + 𝑦)/21, 𝑥 =


1,2,3; 𝑦 = 1,2. Using R find (i) Mean and Variance of X and Y (ii) Cov (X,Y) (iii)
Correlation of X and Y.

Solution
# Define the probability function f(x, y)
f <- function(x, y) {
return ((x + y) / 21)
}

# Define the possible values of X and Y


X <- c(1, 2, 3)
Y <- c(1, 2)

# Calculate the mean and variance of X


mean_X <- sum(X * sapply(X, function(x) sum(sapply(Y, function(y) f(x, y)))))
var_X <- sum(sapply(X, function(x) (x - mean_X)^2 * sum(sapply(Y, function(y) f(x,
y)))))

# Calculate the mean and variance of Y


mean_Y <- sum(Y * sapply(Y, function(y) sum(sapply(X, function(x) f(x, y)))))
var_Y <- sum(sapply(Y, function(y) (y - mean_Y)^2 * sum(sapply(X, function(x) f(x,
y)))))

# Calculate the covariance of X and Y


cov_XY <- sum(sapply(X, function(x) sum(sapply(Y, function(y) (x - mean_X) * (y -
mean_Y) * f(x, y)))))

# Calculate the standard deviations of X and Y


sd_X <- sqrt(var_X)
sd_Y <- sqrt(var_Y)

# Calculate the correlation coefficient of X and Y


corr_XY <- cov_XY / (sd_X * sd_Y)

# Print the results


cat("Mean of X:", mean_X, "\n")
cat("Variance of X:", var_X, "\n")
cat("Mean of Y:", mean_Y, "\n")
cat("Variance of Y:", var_Y, "\n")
cat("Covariance of X and Y:", cov_XY, "\n")
cat("Correlation coefficient of X and Y:", corr_XY, "\n")

Output

Mean of X: 2.190476
Variance of X: 0.6303855
Mean of Y: 1.571429
Variance of Y: 0.244898
Covariance of X and Y: -0.01360544
Correlation coefficient of X and Y: -0.03462717

7. The two lines of regression are


8𝑥 − 10𝑦 + 66 = 0 … (𝐴)
40𝑥 − 18𝑦 − 214 = 0 … (𝐵)
The variance of x is 9. Find using R (i) The mean values of x and y (ii) Correlation
coefficient between x and y

Solution
# Define the equations of the regression lines
eq1 <- function(x) (8*x + 66)/10
eq2 <- function(x) (40*x - 214)/18

# Find the intersection point by solving the equations


intersection <- uniroot(function(x) eq1(x) - eq2(x), c(-20, 20))

# Extract the x value at the intersection point


x_intersection <- intersection$root

# Calculate y value from the first equation and x value from the second equation
y_intersection <- eq1(x_intersection)
y_x_on_y <- eq1(x_intersection)
x_y_on_x <- eq2(y_intersection)

# Calculate and print the mean values of x and y


mean_x <- x_intersection
mean_y <- y_intersection
cat("Mean value of x:", mean_x, "\n")
cat("Mean value of y:", mean_y, "\n")
# Calculate the correlation coefficients
correlation_xy <- y_x_on_y / x_intersection
correlation_yx <- y_intersection / x_y_on_x

if (correlation_xy >= -1 && correlation_xy <= 1 && correlation_xy > 0) {


final_correlation <- correlation_xy
cat("Correlation coefficient:", final_correlation, "\n")
} else if (correlation_yx >= -1 && correlation_yx <= 1 && correlation_yx > 0) {
final_correlation <- correlation_yx
cat("Correlation coefficient:", final_correlation, "\n")
}

Output

Mean value of x: 13
Mean value of y: 17
Correlation coefficient: 0.6566524
8. For the following data, using R find the most likely price at Madras corresponding to
the price 70 at Bombay and that at Bombay corresponding to the price 68 at Madras.
S.D of the difference between the price at Madras & Bombay is 3.1.
Madras Bombay
Average price 65 67
S.D. of price 0.5 3.5

Solution
# Given data
avg_m <- 65
avg_b <- 67
sd_m <- 0.5
sd_b <- 3.5
sd_diff <- 3.1

# Calculate correlation coefficient


corr_coeff <- round((sd_m^2 + sd_b^2 - sd_diff^2) / (2 * sd_m * sd_b), 2)
cat("Correlation Coefficient:", corr_coeff, "\n")

# Given prices
x <- 68
y <- 70

# Calculate predicted price at Madras corresponding to the price 70 at Bombay


pred_m <- round(avg_m + corr_coeff * (sd_m / sd_b) * (y - avg_b), 2)

# Calculate predicted price at Bombay corresponding to the price 68 at Madras


pred_b <- round(avg_b + corr_coeff * (sd_b / sd_m) * (x - avg_m), 2)

cat("Most likely price at Madras:", pred_m, "\n")


cat("Most likely price at Bombay:", pred_b, "\n")

Output

Correlation Coefficient: 0.83


Most likely price at Madras: 65.36
Most likely price at Bombay: 84.43

9. Suppose that 2 DRV of X, Y has the joint probability density function 𝑓(𝑥, 𝑦) =
𝑥 + 𝑦, 0 < 𝑥 < 1 𝑎𝑛𝑑 0 < 𝑦 < 1
{ . Using R, obtain the correlation coefficient
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
between X and Y. Check whether X and Y are independent.
Solution
# Define the joint probability density function
f <- function(x, y) {
ifelse((0 < x & x < 1) & (0 < y & y < 1), x + y, 0)
}

# Define the covariance function


covariance <- function(x, y) {
mean((x - mean(x)) * (y - mean(y)))
}

# Define the correlation coefficient function


correlation_coefficient <- function(x, y) {
cov_xy <- covariance(x, y)
sd_x <- sd(x)
sd_y <- sd(y)
cov_xy / (sd_x * sd_y)
}

# Generate random samples from the joint distribution


[Link](123) # for reproducibility
sample_size <- 10000
x <- runif(sample_size, 0, 1)
y <- runif(sample_size, 0, 1)

# Calculate correlation coefficient


corr_coef <- correlation_coefficient(x, y)
cat("Correlation Coefficient:", corr_coef, "\n")

# Check for independence


if (corr_coef == 0) {
cat("X and Y are independent.", "\n")
} else {
cat("X and Y are not independent.", "\n")
}
Output

Correlation Coefficient: -0.01760349


X and Y are not independent.
Unit IV - ANALYSIS OF VARIANCE

1. A completely randomised design experiment with 10 plots and 3 treatments gave the
following results:
Plot No 1 2 3 4 5 6 7 8 9 10
Treatment A B C A C C A B A B
Yield 5 4 3 7 5 1 3 4 1 7
Analyse the results for treatment effect using R.

Solution
# Create a dataframe with the data
data <- [Link](
Plot_No = 1:10,
Treatment = c("A", "B", "C", "A", "C", "C", "A", "B", "A", "B"),
Yield = c(5, 4, 3, 7, 5, 1, 3, 4, 1, 7)
)

# Perform ANOVA
result <- aov(Yield ~ Treatment, data = data)

# Summarize the ANOVA results


summary_result <- summary(result)

# Print ANOVA table


print(summary_result)

# Check for significant difference


if (summary_result[[1]][["Pr(>F)"]][1] < 0.05) {
cat("There is a significant difference between the treatments.\n")
} else {
cat("There is no significant difference between the treatments.\n")
}

Output

Df Sum Sq Mean Sq F value Pr(>F)


Treatment 2 6 3.000 0.618 0.566
Residuals 7 34 4.857
There is no significant difference between the treatments.

2. The following are the numbers of mistakes made n 5 successive days of 4 technicians
working for a photographic industry:
Technician I (X1) Technician II (X2) Technician III (X3) Technician IV (X4)
6 14 10 9
14 9 12 12
10 12 7 8
8 10 15 10
11 14 11 11
Using R, test at the level of significance α=0.01, whether the difference among the 4
sample means, can be attributed to chance.

Solution
# Create a data frame with the provided data
data <- [Link](
Day = 1:5,
Technician_I = c(6, 14, 10, 8, 11),
Technician_II = c(14, 9, 12, 10, 14),
Technician_III = c(10, 12, 7, 15, 11),
Technician_IV = c(9, 12, 8, 10, 11)
)

# Reshape data into long format


library(reshape2)
data_long <- melt(data, [Link] = "Day", [Link] = "Technician", [Link] =
"Mistakes")

# Perform ANOVA
result <- aov(Mistakes ~ Technician, data = data_long)

# Summary of ANOVA
summary(result)

# Extract p-value
p_value <- summary(result)[[1]]$`Pr(>F)`[1]

# Define significance level


alpha <- 0.01

# Conclusion
if (p_value < alpha) {
cat("Reject the null hypothesis. There is a difference among the means of the four
technicians' mistake rates.")
} else {
cat("Accept the null hypothesis. There is no significant difference among the means
of the four technicians' mistake rates.")
}

Output

Df Sum Sq Mean Sq F value Pr(>F)


Technician 3 12.95 4.317 0.68 0.577
Residuals 16 101.60 6.350
Accept the null hypothesis. There is no significant difference among
the means of the four technicians' mistake rates.

3. There are three main brands of a certain powder. A set of 120 sample values is examined
and found to be allocated among four groups (A, B, C and D) and the three brands (I,
II, III) as shown here under:
Brands Groups
A B C D
I 0 4 8 15
II 5 8 13 6
II 8 19 11 13

Use R to identify if there is any significant difference in brands preference? Answer at


5% level.

Solution
# Create a data frame with the provided data
data <- [Link](
Brand = rep(c("I", "II", "III"), each = 4),
Group = rep(c("A", "B", "C", "D"), times = 3),
Count = c(0, 4, 8, 15, 5, 8, 13, 6, 8, 19, 11, 13)
)

# Perform one-way ANOVA


result <- aov(Count ~ Brand, data = data)

# Summary of ANOVA
summary_anova <- summary(result)
print(summary_anova)
# Extract p-value
p_value <- summary_anova[[1]]$`Pr(>F)`[1]

# Define significance level


alpha <- 0.05

# Conclusion
if (p_value < alpha) {
cat("Reject the null [Link] is a significant difference in brand
preference.\n")
} else {
cat("Accept the null [Link] is no significant difference in brand
preference.\n")
}

Output

Df Sum Sq Mean Sq F value Pr(>F)


Brand 2 80.17 40.08 1.6 0.254
Residuals 9 225.50 25.06
Accept the null [Link] is no significant difference in brand
preference.

4. An experiment was designed to study the performance of 4 different detergents for


cleaning fuel injectors. The following “cleanliness” readings were obtained with
specially designed equipment for 12 tanks of gas distributed over 3 different models of
engines:
Engine 1 Engine 2 Engine 3 Total
Detergent A 45 43 51 139
Detergent B 47 46 52 145
Detergent C 48 50 55 153
Detergent D 42 37 49 128
Total 182 176 207 565

Perform ANOVA using R and test at 0.01 level of significance, whether there are
differences in the detergents or in the engines.

Solution
# Create a data frame with the provided data
cleanliness <- [Link](
Engine = factor(rep(1:3, each = 4)),
Detergent = factor(rep(c("A", "B", "C", "D"), times = 3)),
Cleanliness = c(45, 47, 48, 42, 43, 46, 50, 37, 51, 52, 55, 49)
)

# Fit ANOVA model


model <- aov(Cleanliness ~ Engine + Detergent, data = cleanliness)

# Summary of ANOVA
anova_result <- summary(model)
print(anova_result)
# Extract p-values
detergent_p_value <- anova_result[[1]]$"Pr(>F)"[1]
engine_p_value <- anova_result[[1]]$"Pr(>F)"[2]

# Conclusion
if (detergent_p_value > 0.01 && engine_p_value > 0.01) {
cat("Accept the null hypothesis. There are no significant differences in cleanliness
between detergents and engines.\n")
} else {
cat("Reject the null hypothesis. There are significant differences in cleanliness
between detergents or engines.\n")
}
Output

Df Sum Sq Mean Sq F value Pr(>F)


Engine 2 135.17 67.58 21.53 0.00183 **
Detergent 3 110.92 36.97 11.78 0.00631 **
Residuals 6 18.83 3.14
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reject the null hypothesis. There are significant differences in
cleanliness between detergents or engines.

5. Analyse the following RBD and find your conclusion using R.


Treatments
T1 T2 T3 T4
B1 12 14 20 22
B2 17 27 19 15
Blocks B3 15 14 17 12
B4 18 16 22 12
B5 19 15 20 14

Solution
# Load the required library for ANOVA
library(dplyr)

# Create a data frame with the provided data


data <- [Link](
Blocks = c("B1", "B2", "B3", "B4", "B5"),
T1 = c(12, 17, 15, 18, 19),
T2 = c(14, 27, 14, 16, 15),
T3 = c(20, 19, 17, 22, 20),
T4 = c(22, 15, 12, 12, 14)
)
# Reshape the data into long format suitable for ANOVA
data_long <- data %>%
tidyr::pivot_longer(cols = -Blocks, names_to = "Treatment", values_to = "Value")

# Perform ANOVA
result <- aov(Value ~ Treatment + Blocks, data = data_long)

# Summarize the ANOVA results


summary_result <- summary(result)

# Print ANOVA table


print(summary_result)

# Check for significant difference


if (summary_result[[1]][["Pr(>F)"]][1] < 0.05) {
cat("Reject the null hypothesis. There is a significant difference between blocks and
treatments.\n")
} else {
cat("Accept the null hypothesis. There is no significant difference between blocks
and treatments.\n")
}

Output

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

filter, lag

The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

Df Sum Sq Mean Sq F value Pr(>F)


Treatment 3 57.2 19.07 1.238 0.339
Blocks 4 50.0 12.50 0.812 0.541
Residuals 12 184.8 15.40
Accept the null hypothesis. There is no significant difference between
blocks and treatments.

6. The following is a Latin square of a design, when 4 varieties of seeds are being tested.
Use R to set up the analysis of variance table. You may carry out suitable change of
origin and scale.
A 105 B 95 C 125 D 115
C 115 D 125 A 105 B 105
D 115 C 95 B 105 A 115
B 95 A 135 D 95 C 115

Solution
# Data
Yield <- c(105, 95, 125, 115,
115, 125, 105, 105,
115, 95, 105, 115,
95, 135, 95, 115)
Row <- factor(rep(1:4, each = 4))
Col <- factor(rep(1:4, times = 4))
Trt <- c("A", "B", "C", "D",
"C", "D", "A", "B",
"D", "C", "B", "A",
"B", "A", "D", "C")
df <- [Link](Yield, Row, Col, Trt)

# Subtract 100 from each yield value and divide by 5


df$Adjusted_Yield <- (df$Yield - 100) / 5

# Linear model
fit_model <- lm(formula = Adjusted_Yield ~ Row + Col + Trt, data = df)
summary(fit_model)
anova_result <- anova(fit_model)
print(anova_result)

Output

Call:
lm(formula = Adjusted_Yield ~ Row + Col + Trt, data = df)

Residuals:
Min 1Q Median 3Q Max
-3.5 -1.5 0.0 1.5 3.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.500e+00 2.500e+00 1.000 0.356
Row2 5.000e-01 2.236e+00 0.224 0.830
Row3 -5.000e-01 2.236e+00 -0.224 0.830
Row4 3.336e-16 2.236e+00 0.000 1.000
Col2 1.000e+00 2.236e+00 0.447 0.670
Col3 4.759e-16 2.236e+00 0.000 1.000
Col4 1.000e+00 2.236e+00 0.447 0.670
TrtB -3.000e+00 2.236e+00 -1.342 0.228
TrtC -5.000e-01 2.236e+00 -0.224 0.830
TrtD -5.000e-01 2.236e+00 -0.224 0.830
Residual standard error: 3.162 on 6 degrees of freedom
Multiple R-squared: 0.3182, Adjusted R-squared: -0.7045
F-statistic: 0.3111 on 9 and 6 DF, p-value: 0.9432

Analysis of Variance Table

Response: Adjusted_Yield
Df Sum Sq Mean Sq F value Pr(>F)
Row 3 2 0.6667 0.0667 0.9756
Col 3 4 1.3333 0.1333 0.9367
Trt 3 22 7.3333 0.7333 0.5690
Residuals 6 60 10.0000

You might also like