R Programming Essentials and Functions
R Programming Essentials and Functions
Features of R Programming
Reserved Words
Reserved words are predefined keywords in R that have special meanings and cannot be
used as identifiers (e.g., variable names or function names).
Examples include if, else, repeat, while, for, function, return, next, break, NULL, and NA.
TRUE and FALSE are the logical consonants in R.
Naming Variables
Variable names in R can consist of letters, numbers, and periods (.), but they must start
with a letter.
It's good practice to use descriptive names that reflect the purpose of the variable.
Avoid using reserved words as variable names.
R is a case-sensitive language. Example: TRUE and True are not the same, while the first
one is a reserved word, the latter can be used as a variable name.
Comments in R start with the '#' symbol.
Operators
Operators are symbols used to perform operations on variables and values. R supports various
types of operators, including arithmetic operators, relational operators, logical operators,
assignment operators, and special operators. Here are some examples:
Logical Operators
! Negation
& or && Logical AND
| or || Logical OR
xor() Exclusive OR
isTRUE() Tests if an expression is TRUE
Assignment Operators
<– or = Assigns a value to a variable
<<– Assigns a value to a variable in the parent environment
Special Operators
$ Extracts a component of a list or data frame
: Creates a sequence of numbers
%in% Tests if elements of one vector are in another vector
%*% Matrix multiplication
Input/Output Functions
Data Types
R supports several basic data types, including numeric, character, logical, integer, and
complex.
Vectors are one-dimensional arrays that can hold elements of the same data type.
Matrices are two-dimensional arrays.
Data frames are similar to matrices but can hold different types of data in each column.
Data Manipulation
Use functions like subset(), merge(), transform(), and aggregate() for data
manipulation.
The dplyr package provides a set of functions for easy and efficient data manipulation.
The tidyr package offers tools for tidying data, which involves restructuring datasets
to facilitate analysis.
Data Visualization
R offers several packages for data visualization, including ggplot2, lattice, and base
graphics.
Use functions like plot(), hist(), boxplot(), and barplot() for basic plotting.
Control Structures
R supports various control structures such as if-else statements, for loops, while loops,
and repeat loops.
These control structures are used to control the flow of execution in a program.
Functions
You can define your own functions in R using the function() keyword.
Functions can take arguments and return values.
R also has many built-in functions for common tasks, such as mathematical
calculations, statistical analysis, and data manipulation.
Packages
1. R has a vast ecosystem of packages contributed by users from around the world.
2. Use the [Link]() function to install packages from CRAN, and then load them
into your R session using the library() function.
3. Popular packages include tidyverse, caret, forecast, rmarkdown, and shiny.
Examples:
Output: 15
2. Write a program in R to display information about a person including their name, age,
and city. Use the cat() function to concatenate and display the information.
# Define variables
name <- "John"
age <- 30
city <- "New York"
Output:
Name: John
Age: 30
City: New York
Unit I - INTRODUCTION
1. Using R, find the probabilities of tossing a pair of coins five times and obtaining: (i)
exactly 3 heads, (ii) exactly 3 tails, (iii) at least one head, (iv) not more than one tail.
Solution
# Define the number of trials (tosses)
n <- 5
Output
𝑥, 0 ≤ 𝑥 ≤ 1
2. For a triangular distribution 𝑓(𝑥) = {2 − 𝑥, 1 ≤ 𝑥 ≤ 2. Find mean, variance using R.
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Solution
# Calculate mean
mean_triangular <- integrate(function(x) x * triangular_density(x), lower = 0, upper =
2)$value
# Calculate E[X^2]
mean_squared_triangular <- integrate(function(x) x^2 * triangular_density(x), lower =
0, upper = 2)$value
# Calculate variance
var_triangular <- mean_squared_triangular - (mean_triangular)^2
Output
Mean: 1
Variance: 0.1666667
Solution
# Define the values of X and corresponding probabilities
X <- c(0, 1, 2, 3, 4)
# Solve for k
k <- uniroot(equation, interval = c(0, 1))$root
# Calculate P[X ≥ 3]
Prob_X <- calc_prob(k)
P_X_ge_3 <- sum(Prob_X[X >= 3])
Output
Value of k: 0.04
P[X ≥ 3]: 0.64
P[0 < X < 4]: 0.6
Solution
# Define the probability density function
f <- function(x) {
ifelse(x >= -1 & x <= 2, x^2/3, 0)
}
Output
1. A company keeps a record of accidents during a recent safety review. The random of
sample of 60 accidents was selected and classified by day of the week.
Solution
2. 1000 students in a college level were graded academic to the IQ level and economic
condition. What conclusion can be drawn from the following data using R?
IQ Level
Economic condition High Low
Rich 460 140
Poor 240 160
Solution
data: observed
X-squared = 30.957, df = 1, p-value = 2.638e-08
3. The following table gives the number of aircraft accidents that occurred during the
various days of the week. Test whether the accidents are uniformly distributed over
the week using R programming.
Days Sun Mon Tue Wed Thu Fri Sat
No of Accidents 14 16 8 12 11 9 14
Solution
# Given data
days <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
accidents <- c(14, 16, 8, 12, 11, 9, 14)
data: accidents
X-squared = 4.1667, df = 6, p-value = 0.6541
Solution
# Define the two samples
sample1 <- c(18, 13, 12, 15, 12, 14, 16, 14, 15)
sample2 <- c(16, 19, 13, 16, 18, 13, 15)
Output
Solution
# Given data
sample_mean <- 1570 # Mean lifetime of the sample bulbs
population_mean <- 1600 # Hypothesized mean lifetime of all bulbs
population_std <- 120 # Population standard deviation
sample_size <- 100 # Sample size
# Print z-score
cat("Z-score:", z_score, "\n")
Z-score: -2.5
For alpha = 0.05:
Critical Z-value: 1.96
Reject the null hypothesis: μ ≠ 1600 hours
6. The mean breaking strength of the cables supplied by a manufacturer is 1800 with a
S.D. of 100. By a new technique in the manufacturing process, it is claimed that the
breaking strength of the cable has increased. In order to test this claim, a sample of 50
cables is tested and it is found that the mean breaking strength is 1850. Can we support
the claim at 1% level of significance (Use R)?
Solution
# Given data
population_mean <- 1800 # Mean breaking strength before the new technique
population_std <- 100 # Standard deviation of breaking strength before the new
technique
sample_mean <- 1850 # Mean breaking strength observed in the sample
sample_size <- 50 # Sample size
alpha <- 0.01 # Significance level
Output
Z-score: 3.54
Critical z-value: 2.33
Reject the null hypothesis. We may support the claim that the breaking
strength has increased.
7. The average number of defective articles per day in a certain factory is claimed to be
less than the average of all the factories. The average of all the factories is 30.5. A
random sample of 100 days showed the following distribution.
Class limits 16-20 21-25 26-30 31-35 36-40
No:of days 12 22 20 30 16
Is the average less than the figure for all the factories (Use R)?
Solution
# Given data
sample_data <- c(18, 23, 28, 33, 38) # Midpoint of each class
frequency <- c(12, 22, 20, 30, 16)
n <- sum(frequency) # Total number of days
mean_all_factories <- 30.5 # Average of all factories
# Performing Z-test
z_score <- (sample_mean - mean_all_factories) / (sample_sd / sqrt(n))
cat("Z-score:", round(z_score, 2), "\n")
# Define the critical Z-value for a one-tailed test with significance level α = 0.05
z_critical <- qnorm(0.05)
cat("Critical Z-value:", round(z_critical, 2), "\n")
Output
8. The sales manager of a large company conducted a sample survey in two places A and
B taking 200 samples in each case. The results were the following table. Use R to test
whether the average sales is the same in the two areas at 5% level.
Place A Place B
Average sales Rs. 2000 Rs. 1700
S.D Rs. 200 Rs. 450
Solution
# Given data
average_sales_A <- 2000
average_sales_B <- 1700
sd_A <- 200
sd_B <- 450
n <- 200 # Number of samples in each case
alpha <- 0.05 # Significance level
Output
Z-score: 8.62
Critical Z-value: 1.96
Reject the null hypothesis. There is sufficient evidence to conclude
that the average sales are different in the two areas.
9. Two samples drawn from two different populations gave the following results:
Size Mean SD
Sample I 100 582 24
Sample II 100 540 28
Use R to test the hypothesis, at 5% level of significance, that the difference of the means
of the population is 35.
Solution
# Given data
mean_1 <- 582 # Mean of Sample I
sd_1 <- 24 # Standard deviation of Sample I
n_1 <- 100 # Sample size of Sample I
mean_2 <- 540 # Mean of Sample II
sd_2 <- 28 # Standard deviation of Sample II
n_2 <- 100 # Sample size of Sample II
population_mean_diff <- 35 # Hypothesized difference of means
alpha <- 0.05 # Significance level
Output
Z-score: 1.9
Critical Z-value: 1.96
Accept the null hypothesis. There is not enough evidence to conclude
that the difference of the means of the populations is not 35.
10. Two horses A and B were tested according to the time (in seconds) to run a particular
race with the following results:
Horse A 28 30 32 33 33 29 34
Horse B 29 30 30 24 27 29
Use R to test whether the horse A is running faster than B at 5% level.
Solution
# Given data
times_horse_A <- c(28, 30, 32, 33, 33, 29, 34)
times_horse_B <- c(29, 30, 30, 24, 27, 29)
Output
1. From the following data, using R find (i) two regression equation (ii) the coefficient of
correlation between the marks in economics and statistics (iii) the most likely marks in
statistics when marks in economics are 30
Marks in Economics X 25 28 35 32 31 36 29 38 34 32
Marks in Statistics Y 43 46 49 41 36 32 31 30 33 39
Solution
# Load the required libraries
library(stats)
# Enter the data
x <- c(25, 28, 35, 32, 31, 36, 29, 38, 34, 32)
y <- c(43, 46, 49, 41, 36, 32, 31, 30, 33, 39)
# X on Y
model_X_on_Y <- lm(x ~ y)
m_XY <- model_X_on_Y$coefficients[2]
b_XY <- model_X_on_Y$coefficients[1]
# Predict the most likely marks in statistics when marks in economics are 30
predicted_y <- predict(model_Y_on_X, [Link](x = 30))
cat("Predicted marks in statistics when marks in economics are 30 is:",
round(predicted_y, 2), "\n")
Output
2. The joint probability mass function of x and y is given below. Find the correlation
coefficient of x, y using R.
x
y -1 +1 Total
0 1/8 3/8 4/8
1 2/8 2/8 4/8
Total 3/8 5/8
Solution
# Define the joint probability mass function
pmf <- matrix(c(1/8, 3/8, 2/8, 2/8), nrow = 2, byrow = TRUE)
Output
3. Obtain the rank correlation coefficient for the following data using R.
X 68 64 75 50 64 80 75 40 55 64
Y 62 58 68 45 81 60 68 48 50 70
Solution
# Given data
X <- c(68, 64, 75, 50, 64, 80, 75, 40, 55, 64)
Y <- c(62, 58, 68, 45, 81, 60, 68, 48, 50, 70)
Output
4. Find the rank correlation coefficient from the following data using R:
Rank in X 1 2 3 4 5 6 7
Rank in Y 4 3 1 2 6 5 7
Solution
# Given ranks
rank_X <- c(1, 2, 3, 4, 5, 6, 7)
rank_Y <- c(4, 3, 1, 2, 6, 5, 7)
Output
5. Ten participants were ranked according to their performance in a musical test by the 3
Judges in the following data.
1 2 3 4 5 6 7 8 9 10
Rank by X 1 6 5 10 3 2 4 9 7 8
Rank by Y 3 5 8 4 7 10 2 1 6 9
Rank by Z 6 4 9 8 1 2 3 10 5 7
Using rank correlation method, discuss which pair of judges has the nearest approach
to common likings of music using R.
Solution
# Given ranks
rank_X <- c(1, 6, 5, 10, 3, 2, 4, 9, 7, 8)
rank_Y <- c(3, 5, 8, 4, 7, 10, 2, 1, 6, 9)
rank_Z <- c(6, 4, 9, 8, 1, 2, 3, 10, 5, 7)
# Calculate rank correlation coefficients
correlation_XY <- cor(rank_X, rank_Y, method = "spearman")
correlation_XZ <- cor(rank_X, rank_Z, method = "spearman")
correlation_YZ <- cor(rank_Y, rank_Z, method = "spearman")
# Interpretation
if (correlation_XY > correlation_XZ && correlation_XY > correlation_YZ) {
cat("The pair of judges X and Y has the nearest approach to common likings of
music.\n")
} else if (correlation_XZ > correlation_XY && correlation_XZ > correlation_YZ) {
cat("The pair of judges X and Z has the nearest approach to common likings of
music.\n")
} else {
cat("The pair of judges Y and Z has the nearest approach to common likings of
music.\n")
}
Output
Solution
# Define the probability function f(x, y)
f <- function(x, y) {
return ((x + y) / 21)
}
Output
Mean of X: 2.190476
Variance of X: 0.6303855
Mean of Y: 1.571429
Variance of Y: 0.244898
Covariance of X and Y: -0.01360544
Correlation coefficient of X and Y: -0.03462717
Solution
# Define the equations of the regression lines
eq1 <- function(x) (8*x + 66)/10
eq2 <- function(x) (40*x - 214)/18
# Calculate y value from the first equation and x value from the second equation
y_intersection <- eq1(x_intersection)
y_x_on_y <- eq1(x_intersection)
x_y_on_x <- eq2(y_intersection)
Output
Mean value of x: 13
Mean value of y: 17
Correlation coefficient: 0.6566524
8. For the following data, using R find the most likely price at Madras corresponding to
the price 70 at Bombay and that at Bombay corresponding to the price 68 at Madras.
S.D of the difference between the price at Madras & Bombay is 3.1.
Madras Bombay
Average price 65 67
S.D. of price 0.5 3.5
Solution
# Given data
avg_m <- 65
avg_b <- 67
sd_m <- 0.5
sd_b <- 3.5
sd_diff <- 3.1
# Given prices
x <- 68
y <- 70
Output
9. Suppose that 2 DRV of X, Y has the joint probability density function 𝑓(𝑥, 𝑦) =
𝑥 + 𝑦, 0 < 𝑥 < 1 𝑎𝑛𝑑 0 < 𝑦 < 1
{ . Using R, obtain the correlation coefficient
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
between X and Y. Check whether X and Y are independent.
Solution
# Define the joint probability density function
f <- function(x, y) {
ifelse((0 < x & x < 1) & (0 < y & y < 1), x + y, 0)
}
1. A completely randomised design experiment with 10 plots and 3 treatments gave the
following results:
Plot No 1 2 3 4 5 6 7 8 9 10
Treatment A B C A C C A B A B
Yield 5 4 3 7 5 1 3 4 1 7
Analyse the results for treatment effect using R.
Solution
# Create a dataframe with the data
data <- [Link](
Plot_No = 1:10,
Treatment = c("A", "B", "C", "A", "C", "C", "A", "B", "A", "B"),
Yield = c(5, 4, 3, 7, 5, 1, 3, 4, 1, 7)
)
# Perform ANOVA
result <- aov(Yield ~ Treatment, data = data)
Output
2. The following are the numbers of mistakes made n 5 successive days of 4 technicians
working for a photographic industry:
Technician I (X1) Technician II (X2) Technician III (X3) Technician IV (X4)
6 14 10 9
14 9 12 12
10 12 7 8
8 10 15 10
11 14 11 11
Using R, test at the level of significance α=0.01, whether the difference among the 4
sample means, can be attributed to chance.
Solution
# Create a data frame with the provided data
data <- [Link](
Day = 1:5,
Technician_I = c(6, 14, 10, 8, 11),
Technician_II = c(14, 9, 12, 10, 14),
Technician_III = c(10, 12, 7, 15, 11),
Technician_IV = c(9, 12, 8, 10, 11)
)
# Perform ANOVA
result <- aov(Mistakes ~ Technician, data = data_long)
# Summary of ANOVA
summary(result)
# Extract p-value
p_value <- summary(result)[[1]]$`Pr(>F)`[1]
# Conclusion
if (p_value < alpha) {
cat("Reject the null hypothesis. There is a difference among the means of the four
technicians' mistake rates.")
} else {
cat("Accept the null hypothesis. There is no significant difference among the means
of the four technicians' mistake rates.")
}
Output
3. There are three main brands of a certain powder. A set of 120 sample values is examined
and found to be allocated among four groups (A, B, C and D) and the three brands (I,
II, III) as shown here under:
Brands Groups
A B C D
I 0 4 8 15
II 5 8 13 6
II 8 19 11 13
Solution
# Create a data frame with the provided data
data <- [Link](
Brand = rep(c("I", "II", "III"), each = 4),
Group = rep(c("A", "B", "C", "D"), times = 3),
Count = c(0, 4, 8, 15, 5, 8, 13, 6, 8, 19, 11, 13)
)
# Summary of ANOVA
summary_anova <- summary(result)
print(summary_anova)
# Extract p-value
p_value <- summary_anova[[1]]$`Pr(>F)`[1]
# Conclusion
if (p_value < alpha) {
cat("Reject the null [Link] is a significant difference in brand
preference.\n")
} else {
cat("Accept the null [Link] is no significant difference in brand
preference.\n")
}
Output
Perform ANOVA using R and test at 0.01 level of significance, whether there are
differences in the detergents or in the engines.
Solution
# Create a data frame with the provided data
cleanliness <- [Link](
Engine = factor(rep(1:3, each = 4)),
Detergent = factor(rep(c("A", "B", "C", "D"), times = 3)),
Cleanliness = c(45, 47, 48, 42, 43, 46, 50, 37, 51, 52, 55, 49)
)
# Summary of ANOVA
anova_result <- summary(model)
print(anova_result)
# Extract p-values
detergent_p_value <- anova_result[[1]]$"Pr(>F)"[1]
engine_p_value <- anova_result[[1]]$"Pr(>F)"[2]
# Conclusion
if (detergent_p_value > 0.01 && engine_p_value > 0.01) {
cat("Accept the null hypothesis. There are no significant differences in cleanliness
between detergents and engines.\n")
} else {
cat("Reject the null hypothesis. There are significant differences in cleanliness
between detergents or engines.\n")
}
Output
Solution
# Load the required library for ANOVA
library(dplyr)
# Perform ANOVA
result <- aov(Value ~ Treatment + Blocks, data = data_long)
Output
filter, lag
6. The following is a Latin square of a design, when 4 varieties of seeds are being tested.
Use R to set up the analysis of variance table. You may carry out suitable change of
origin and scale.
A 105 B 95 C 125 D 115
C 115 D 125 A 105 B 105
D 115 C 95 B 105 A 115
B 95 A 135 D 95 C 115
Solution
# Data
Yield <- c(105, 95, 125, 115,
115, 125, 105, 105,
115, 95, 105, 115,
95, 135, 95, 115)
Row <- factor(rep(1:4, each = 4))
Col <- factor(rep(1:4, times = 4))
Trt <- c("A", "B", "C", "D",
"C", "D", "A", "B",
"D", "C", "B", "A",
"B", "A", "D", "C")
df <- [Link](Yield, Row, Col, Trt)
# Linear model
fit_model <- lm(formula = Adjusted_Yield ~ Row + Col + Trt, data = df)
summary(fit_model)
anova_result <- anova(fit_model)
print(anova_result)
Output
Call:
lm(formula = Adjusted_Yield ~ Row + Col + Trt, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.5 -1.5 0.0 1.5 3.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.500e+00 2.500e+00 1.000 0.356
Row2 5.000e-01 2.236e+00 0.224 0.830
Row3 -5.000e-01 2.236e+00 -0.224 0.830
Row4 3.336e-16 2.236e+00 0.000 1.000
Col2 1.000e+00 2.236e+00 0.447 0.670
Col3 4.759e-16 2.236e+00 0.000 1.000
Col4 1.000e+00 2.236e+00 0.447 0.670
TrtB -3.000e+00 2.236e+00 -1.342 0.228
TrtC -5.000e-01 2.236e+00 -0.224 0.830
TrtD -5.000e-01 2.236e+00 -0.224 0.830
Residual standard error: 3.162 on 6 degrees of freedom
Multiple R-squared: 0.3182, Adjusted R-squared: -0.7045
F-statistic: 0.3111 on 9 and 6 DF, p-value: 0.9432
Response: Adjusted_Yield
Df Sum Sq Mean Sq F value Pr(>F)
Row 3 2 0.6667 0.0667 0.9756
Col 3 4 1.3333 0.1333 0.9367
Trt 3 22 7.3333 0.7333 0.5690
Residuals 6 60 10.0000