0% found this document useful (0 votes)
15 views75 pages

R Programming Lecture Overview

The document is a comprehensive lecture review on programming with R, covering topics such as installation, basic data types, statistical operations, data import/export, visualization, data cleaning, and advanced data manipulation. It includes practical programming exercises and examples to illustrate the concepts. The content is structured into sections that guide the reader through the essential aspects of using R for data science.

Uploaded by

Rashmi Hunnur
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views75 pages

R Programming Lecture Overview

The document is a comprehensive lecture review on programming with R, covering topics such as installation, basic data types, statistical operations, data import/export, visualization, data cleaning, and advanced data manipulation. It includes practical programming exercises and examples to illustrate the concepts. The content is structured into sections that guide the reader through the essential aspects of using R for data science.

Uploaded by

Rashmi Hunnur
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Programming With R

Lecture Review

Dr. Kalyan N
Assistant Professor
Dept. of CSE (Data Science)
B.M.S College of Engineering
Bengaluru - 560019.
[Link]@[Link]
Homepage
October, 2024.
Programming With R

Contents

1 Introduction to R and RStudio 5


1.1 Install R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Write and execute your first R script that includes basic arithmetic operations, variable assignments,
and printing results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Document the steps to install R and RStudio and describe the purpose of each line of your script . . . 6

2 Basic Data Types in R 6


2.1 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Design an R program to create and manipulate vectors, matrices, lists, and data frames. Include
operations such as indexing, subsetting, and applying functions like sum(), mean(), and length(). . . 8
2.6 Create a data frame from scratch, perform basic operations, and describe the structure and type of
each element in the data frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Basic Statistical Operations 10


3.1 Explanation of Statistical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Interpreting the Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 R Program to Calculate Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Explanation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Data Import and Export 15


4.1 Available Libraries for Data Import and Export . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 Examples of Open-Source Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Design an R Program to Import Data from a CSV File, Clean It, and Export the Cleaned Data . . . . 16
4.3 Steps to Check the Structure, Summarize the Contents, and Verify the Export of the Cleaned Data . . 17

5 Basic Data Visualization 18


5.1 Different Visualization Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Design an R Program to Create Basic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Explanation of the Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4 Customizing and Saving Plots as Image Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.5 Brief Interpretation of Each Plot in the Context of the Data . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Data Cleaning and Preparation 23


6.1 What is Data Cleaning and Why is it Needed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.1.1 Common Problems with Real-World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 Design an R Program to Handle Missing Data, Filter Rows, and Select Specific Columns . . . . . . . . 24
6.2.1 Sample Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2.2 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Document the Cleaning Process and Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

7 Advanced Data Manipulation using dplyr 25


7.1 Overview of dplyr Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2 Design an R Program to Use dplyr Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2.1 Example Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2.2 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.2.3 Selecting Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2.4 Filtering Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2.5 Creating New Columns with mutate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

7.2.6 Summarizing Data with summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


7.2.7 Arranging Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Detailed Explanation of Each Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

8 Data Visualization using ggplot2 28


8.1 Overview of ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.2 Design an R Program to Create Advanced Plots Using ggplot2 . . . . . . . . . . . . . . . . . . . . . . 28
8.2.1 Example Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.2.2 R Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.2.3 Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2.4 Customizing Plot Aesthetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.2.5 Adding Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.3 Detailed Explanation of Each Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

9 Descriptive Statistics and Data Summary 32


9.1 Introduction to Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
9.2 Design an R Program to Generate Descriptive Statistics and Create a Data Summary Report . . . . . 32
9.3 Output and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

10 Linear Regression: An In-Depth Overview 34


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.2 The Linear Regression Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.2.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.3 Real-World Applications of Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.3.1 1. Economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
10.3.2 2. Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
10.3.3 3. Marketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
10.3.4 4. Real Estate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
10.3.5 5. Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
10.4 Example of Linear Regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
10.5 Explanation and Analysis of Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10.5.1 Scatter Plot with Regression Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10.5.2 Residuals vs Fitted Values Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
10.5.3 Q-Q Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.5.4 Scale-Location Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10.5.5 5. Leverage vs Standardized Residuals Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

11 Advanced Data Analysis 42


11.1 Multiple Linear Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
11.1.1 Explanation of Code and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.2 Checking for Multicollinearity and Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
11.3 Model Selection Using Stepwise Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.4 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
11.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

12 Introduction to Machine Learning with R 49


12.1 Design an R Program to Implement K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 50
12.2 Optimal Number of Clusters and Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

13 Time Series Analysis 52


13.1 Design an R Program to Analyze and Forecast Time Series Data using ARIMA Models . . . . . . . . 53
13.2 Interpreting the Fitted ARIMA Model Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
13.3 Forecasting Future Values Using the ARIMA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
13.3.1 Expected Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
13.4 Use a Time Series Dataset, Perform Exploratory Data Analysis, Fit an ARIMA Model, and Make
Future Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
13.5 Augmented Dickey-Fuller (ADF) Test for Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.5.1 ADF Test Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.5.2 Test Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.5.3 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.6 Include Steps to Check for Stationarity, Select Model Parameters, and Evaluate the Model’s Forecast-
ing Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

14 Creating Interactive Visualizations 62


14.1 Designing Interactive Plots with Plotly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
14.1.1 Installing and Loading Plotly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
14.1.2 Creating Interactive Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
14.1.3 Creating Interactive Line Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
14.1.4 Creating Interactive Bar Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
14.1.5 Customizing Interactivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
14.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
14.2 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
14.3 Incorporating Tooltips, Hover Effects, and Interactive Legends . . . . . . . . . . . . . . . . . . . . . . 67
14.3.1 Tooltips and Hover Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
14.3.2 Interactive Legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
14.3.3 Customizing Hover Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
14.4 Advantages of Using Interactive Visualizations for Data Exploration . . . . . . . . . . . . . . . . . . . 70

15 Data Reporting with RMarkdown 71


15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
15.2 Download R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
15.3 Getting started with R Markdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
15.4 Syntax used in R Mark Down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
15.5 YAML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
15.6 Knitting the document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Module - 1
Lab Programs

1 Introduction to R and RStudio


R is a widely-used programming language for statistical analysis, visualization, and data science. RStudio is an
Integrated Development Environment (IDE) for R, which provides an easy-to-use interface for writing, executing,
and managing R code.

1.1 Install R and RStudio


To begin, you need to install both R and RStudio. Follow the steps below:

• Step 1: Install R
– Go to the official R website: [Link]
– Choose your operating system (Windows, macOS, or Linux).
– Download the appropriate R installer.
– Run the installer and follow the instructions.
• Step 2: Install RStudio
– Visit the RStudio website: [Link]
– Download the RStudio installer for your operating system.
– Run the installer and follow the prompts to complete the installation.

1.2 Write and execute your first R script that includes basic arithmetic operations,
variable assignments, and printing results.
Once R and RStudio are installed, you can write your first R script. Below is a simple R script that includes basic
arithmetic operations, variable assignments, and printing results:
1 # Basic Arithmetic Operations
2 sum <- 10 + 5
3 difference <- 10 - 5
4 product <- 10 * 5
5 quotient <- 10 / 5
6

7 # Variable Assignments
8 x <- 25
9 y <- 5
10

11 # Performing Operations on Variables


12 z <- x + y
13 result <- x / y
14

15 # Printing Results
16 print ( sum )
17 print ( difference )
18 print ( product )

5 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

19 print ( quotient )
20 print ( z )
21 print ( result )
Listing 1: Simple R Script

1.3 Document the steps to install R and RStudio and describe the purpose of each
line of your script
Explanation of the Script
• # Basic Arithmetic Operations: This is a comment that explains what the following lines of code will do.
In R, comments begin with the # symbol.
• sum <- 10 + 5: This line performs the addition of two numbers (10 and 5) and stores the result in the variable
sum.
• difference <- 10 - 5: This line subtracts 5 from 10 and stores the result in the variable difference.
• product <- 10 * 5: This line multiplies 10 and 5 and stores the result in the variable product.
• quotient <- 10 / 5: This line divides 10 by 5 and stores the result in the variable quotient.

• x <- 25: This line assigns the value 25 to the variable x.


• y <- 5: This line assigns the value 5 to the variable y.
• z <- x + y: This line adds the variables x and y, and stores the result in z.

• result <- x / y: This line divides x by y, and stores the result in result.
• print(sum): This line prints the value of the variable sum.
• print(difference): This line prints the value of the variable difference.
• print(product): This line prints the value of the variable product.

• print(quotient): This line prints the value of the variable quotient.


• print(z): This line prints the value of the variable z.
• print(result): This line prints the value of the variable result.

This script demonstrates how to perform basic arithmetic operations, assign values to variables, and print the
results in R.

2 Basic Data Types in R


In R, fundamental data structures play a crucial role in the storage, manipulation, and analysis of data. The
primary data types in R include vectors, matrices, lists, and data frames. Each of these structures is tailored to
handle different types of data and operations. Understanding these data types is essential for efficiently managing
and processing data in R.

6 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

2.1 Vectors
A vector is the simplest data structure in R. It is a one-dimensional array that contains elements of the same type,
such as numeric, character, or logical values. Vectors are used for storing homogeneous data in a sequential manner.

• A numeric vector stores numbers.


• A character vector stores text (strings).
• A logical vector stores boolean values (TRUE/FALSE).

Example:
1 # Numeric vector
2 numeric _ vector <- c (1 , 2 , 3 , 4 , 5)
3

4 # Character vector
5 char _ vector <- c ( " apple " , " banana " , " cherry " )
6

7 # Logical vector
8 logical _ vector <- c ( TRUE , FALSE , TRUE )
9

10 # Accessing elements from a vector


11 second _ element <- numeric _ vector [2] # Access the second element
Listing 2: Creating and manipulating vectors in R

2.2 Matrices
A matrix is a two-dimensional data structure in R where each element must be of the same type. Matrices are
essentially vectors organized into rows and columns. They are often used for mathematical operations such as matrix
multiplication and linear algebra computations.
Example:
1 # Create a 3 x3 numeric matrix
2 matrix _ data <- matrix (1:9 , nrow = 3 , ncol = 3)
3

4 # Matrix operations
5 element <- matrix _ data [2 , 3] # Access element in row 2 , column 3
6 transpose _ matrix <- t ( matrix _ data ) # Transpose the matrix
Listing 3: Creating and manipulating matrices in R

2.3 Lists
A list is a versatile data structure that can contain elements of different types, such as numbers, strings, vectors,
and even other lists. Lists are used to store heterogeneous data, making them highly flexible for complex data
manipulation tasks.
Example:
1 # Create a list with different types of elements
2 my _ list <- list (
3 name = " John Doe " ,
4 age = 30 ,
5 scores = c (88 , 92 , 85) ,
6 passed = TRUE
7 )
7 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

9 # Accessing elements from the list


10 name <- my _ list $ name # Access the ' name ' element
11 score _ vector <- my _ list $ scores # Access the ' scores ' vector
Listing 4: Creating and manipulating lists in R

2.4 Data Frames


A data frame is a two-dimensional table-like structure where each column can contain different types of data (numeric,
character, or logical). Data frames are used for handling datasets in a structured format, where rows represent
observations and columns represent variables. They are widely used for data analysis in R.
Example:
1 # Create a data frame with student information
2 students _ df <- data . frame (
3 Name = c ( " Alice " , " Bob " , " Charlie " ) ,
4 Age = c (22 , 25 , 24) ,
5 GPA = c (3.8 , 3.9 , 3.7)
6 )
7

8 # Accessing columns and rows


9 student _ ages <- students _ df $ Age # Access the 'Age ' column
10 second _ student <- students _ df [2 , ] # Access the entire second row
11

12 # Summary of the data frame


13 summary ( students _ df ) # Basic summary of the data frame
Listing 5: Creating and manipulating data frames in R

2.4.1 Conclusion
The fundamental data types in R—vectors, matrices, lists, and data frames—are essential for storing and manipu-
lating various forms of data. Each data type serves a specific purpose, from handling simple sequences of numbers
(vectors) to complex datasets with multiple variables (data frames). Learning how to create and manipulate these
data types is key to effective data analysis in R.

2.5 Design an R program to create and manipulate vectors, matrices, lists, and data
frames. Include operations such as indexing, subsetting, and applying functions
like sum(), mean(), and length().
The following R script demonstrates the creation and manipulation of vectors, matrices, lists, and data frames,
including basic operations such as indexing, subsetting, and applying functions like sum(), mean(), and length():
1 # Creating a vector
2 v <- c (10 , 20 , 30 , 40)
3

4 # Indexing and subsetting the vector


5 v [2] # Access the second element
6 subset _ v <- v [2:4] # Subset from index 2 to 4
7

8 # Applying functions to the vector


9 sum _ v <- sum ( v ) # Sum of vector elements
10 mean _ v <- mean ( v ) # Mean of vector elements
11 length _ v <- length ( v ) # Length of the vector
8 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

12

13 # Creating a matrix
14 m <- matrix (1:9 , nrow =3 , ncol =3)
15

16 # Indexing the matrix


17 element <- m [2 , 3] # Access element in row 2 , column 3
18

19 # Applying functions to the matrix


20 sum _ m <- sum ( m ) # Sum of matrix elements
21 mean _ m <- mean ( m ) # Mean of matrix elements
22

23 # Creating a list
24 my _ list <- list ( name = " Alice " , age = 25 , scores = c (85 , 90 , 95) )
25

26 # Indexing the list


27 name <- my _ list $ name # Access the ' name ' element
28

29 # Creating a data frame


30 df <- data . frame (
31 Name = c ( " John " , " Doe " , " Jane " ) ,
32 Age = c (28 , 25 , 32) ,
33 Score = c (88 , 95 , 91)
34 )
35

36 # Indexing the data frame


37 age _ first <- df $ Age [1] # Access the 'Age ' of the first entry
38

39 # Applying functions to the data frame


40 sum _ score <- sum ( df $ Score ) # Sum of ' Score ' column
41 mean _ age <- mean ( df $ Age ) # Mean of 'Age ' column
Listing 6: Creating and Manipulating Vectors, Matrices, Lists, and Data Frames

2.6 Create a data frame from scratch, perform basic operations, and describe the
structure and type of each element in the data frame.
A data frame is a table-like structure in R, where each column contains values of one variable and each row contains
values for each observation. Below is an R program that creates a data frame from scratch and performs basic
operations:
1 # Creating a data frame from scratch
2 students _ df <- data . frame (
3 Name = c ( " Alice " , " Bob " , " Charlie " ) ,
4 Age = c (22 , 25 , 24) ,
5 GPA = c (3.8 , 3.9 , 3.7)
6 )
7

8 # Displaying the data frame


9 print ( students _ df )
10

11 # Accessing individual columns


12 names _ column <- students _ df $ Name
13 gpa _ column <- students _ df $ GPA
14

9 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

15 # Applying functions to the data frame


16 mean _ gpa <- mean ( students _ df $ GPA ) # Mean of GPA column
17 sum _ age <- sum ( students _ df $ Age ) # Sum of Age column
18

19 # Structure of the data frame


20 str ( students _ df )
21

22 # Type of each element in the data frame


23 class ( students _ df $ Name ) # Type of the Name column
24 class ( students _ df $ Age ) # Type of the Age column
25 class ( students _ df $ GPA ) # Type of the GPA column
Listing 7: Creating and Manipulating a Data Frame

Explanation of the Script


• v <- c(10, 20, 30, 40): This creates a vector v containing four numeric elements.
• v[2]: Indexing in R starts at 1, so this accesses the second element of the vector v.
• subset v <- v[2:4]: This subsets the vector v to include elements from index 2 to 4.
• sum(v), mean(v), length(v): These functions calculate the sum, mean, and length of the vector v, respectively.

• m <- matrix(1:9, nrow=3, ncol=3): This creates a 3x3 matrix with elements from 1 to 9.
• element <- m[2, 3]: This accesses the element in the second row and third column of matrix m.
• my list <- list(): This creates a list with elements of different data types (string, numeric, and vector).

• df <- [Link](): This creates a data frame with three columns: Name, Age, and Score.
• students df <- [Link](): This creates a data frame with student information, including their name,
age, and GPA.
• str(students df): This function displays the structure of the data frame.

• class(): This function is used to determine the data type of each element in the data frame (e.g., character,
numeric).

The script demonstrates how to work with various data types in R, such as vectors, matrices, lists, and data
frames. It also showcases key operations like indexing, subsetting, and applying functions such as sum(), mean(),
and length().

3 Basic Statistical Operations


Statistical operations provide essential insights into data, helping to summarize, interpret, and make decisions based
on patterns and variability. Common statistical measures such as mean, median, mode, standard deviation, and
variance allow us to quantify key properties of a dataset. In this section, we explain these concepts with equations
and plots, demonstrating their relevance using the Iris dataset in R.

10 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

3.1 Explanation of Statistical Operations


• Mean:
The mean is the arithmetic average of all data points. It is calculated using the formula:
n
1X
Mean = xi
n i=1

where xi represents each individual data point, and n is the total number of observations. The mean is
particularly sensitive to outliers, as extreme values can significantly affect it.
Relevance: The mean provides a quick measure of the central tendency, giving an idea of the ”average” value
of the dataset. In the case of sepal lengths from the Iris dataset, the mean provides an overall sense of the
typical sepal length.
1 # Plotting Histogram for Sepal Length
2 hist ( iris $ Sepal . Length , main = " Histogram of Sepal Length " , xlab = " Sepal
Length " , col = " lightblue " )
3 abline ( v = mean ( iris $ Sepal . Length ) , col = " red " , lwd =2)
4

Listing 8: Mean Calculation Example

Histogram of Sepal Length with Mean Line

30

20
Count

10

5 6 7 8
Sepal Length

Figure 1: Histogram of Sepal Length with Mean Line

• Median:
The median is the middle value when data is sorted in ascending or descending order. For a dataset with an
odd number of observations, it is the central value, and for an even number of observations, it is the average
of the two middle values. (
x n+1 if n is odd
Median = x n 2+x n +1
2
2
2
if n is even

11 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Relevance: The median is robust against outliers and skewed data, providing a better central tendency
measure when the data distribution is not symmetric. In the Iris dataset, if we suspect the presence of extreme
values, the median offers a reliable alternative to the mean.
1 # Plotting Boxplot for Sepal Length
2 boxplot ( iris $ Sepal . Length , main = " Boxplot of Sepal Length " , ylab = " Sepal
Length " , col = " lightgreen " )
3 abline ( h = median ( iris $ Sepal . Length ) , col = " blue " , lwd =2)
4

Listing 9: Median Calculation Example

Boxplot of Sepal Length with Median Line


8

7
Sepal Length

−0.4 −0.2 0.0 0.2 0.4

Figure 2: Boxplot of Sepal Length with Median Line

• Mode:
The mode is the value that appears most frequently in the dataset. Unlike the mean and median, the mode is
more useful for categorical or discrete data. For continuous data, like in the Iris dataset, the mode might not
provide much insight.
Relevance: The mode is helpful for identifying the most common value in a dataset. In cases of categorical
data (e.g., species), the mode shows which category occurs most frequently.
1 # Custom function to calculate mode in R
2 get _ mode <- function ( v ) {
3 uniq _ vals <- unique ( v )
4 uniq _ vals [ which . max ( tabulate ( match (v , uniq _ vals ) ) ) ]
5 }
6 mode _ sepal <- get _ mode ( iris $ Sepal . Length )
7

Listing 10: Mode Calculation Example

12 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Histogram of Sepal Length with Mode Line

30

20
Count

10

5 6 7 8
Sepal Length

Figure 3: Histogram of Sepal Length with Mode Line

• Standard Deviation:
The standard deviation measures how spread out the data points are relative to the mean. It is calculated as:
v
u n
u 1 X
Standard Deviation = t (xi − x̄)2
n − 1 i=1

where x̄ is the mean of the data. A small standard deviation indicates that the data points are close to the
mean, while a large standard deviation indicates greater variability.
Relevance: The standard deviation provides an understanding of the consistency of the data. In the Iris
dataset, it can help us see whether the sepal lengths vary widely or are concentrated around the mean.
1 # Plotting Density for Sepal Length with Standard Deviation Lines
2 plot ( density ( iris $ Sepal . Length ) , main = " Density Plot of Sepal Length " ,
xlab = " Sepal Length " )
3 abline ( v = mean ( iris $ Sepal . Length ) , col = " red " , lwd =2)
4 abline ( v = mean ( iris $ Sepal . Length ) + sd ( iris $ Sepal . Length ) , col = " blue " ,
lty =2)
5 abline ( v = mean ( iris $ Sepal . Length ) - sd ( iris $ Sepal . Length ) , col = " blue " ,
lty =2)
6

Listing 11: Standard Deviation Calculation Example

• Variance:
Variance is the square of the standard deviation and is calculated as:
n
1 X
Variance = (xi − x̄)2
n − 1 i=1

13 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Density Plot of Sepal Length with Mean and SD Lines


0.4

0.3
Density

0.2

0.1

0.0

5 6 7 8
Sepal Length

Figure 4: Density Plot of Sepal Length with Mean and Standard Deviation Lines

Like the standard deviation, variance measures the spread of data points. However, because it is in squared
units, it is less intuitive than standard deviation, though it serves as an important statistical concept in many
models.
Relevance: Variance is useful in statistical modeling (e.g., regression, ANOVA) and other operations requiring
knowledge of data spread. A higher variance means greater variability in the data.
1 # Calculating variance
2 variance _ sepal <- var ( iris $ Sepal . Length )
3

Listing 12: Variance Calculation Example

3.2 Interpreting the Measures


• The mean provides an overall average value of the sepal length, which is useful for understanding the central
tendency of the dataset.
• The median shows the middle point of the dataset, offering an alternative to the mean, especially when the
data has outliers or skewed distributions.
• The mode is the most frequently occurring value and is particularly useful for identifying common values in
categorical datasets.
• The standard deviation and variance provide insight into the spread or variability of the data. Low values
indicate that the data points are closely clustered around the mean, while high values show greater spread,
indicating more variability within the dataset.

3.3 R Program to Calculate Statistical Measures


The following R script uses the Iris dataset to calculate the mean, median, mode, standard deviation, and variance
of the [Link] variable.
14 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

1 # Load the Iris dataset


2 data ( iris )
3

4 # Extract the ' Sepal . Length ' column


5 sepal _ length <- iris $ Sepal . Length
6

7 # Calculate Mean
8 mean _ sepal <- mean ( sepal _ length )
9 print ( paste ( " Mean of Sepal Length : " , mean _ sepal ) )
10

11 # Calculate Median
12 median _ sepal <- median ( sepal _ length )
13 print ( paste ( " Median of Sepal Length : " , median _ sepal ) )
14

15 # Calculate Mode ( Custom function , since R doesn ' t have a built - in mode function
)
16 get _ mode <- function ( v ) {
17 uniq _ vals <- unique ( v )
18 uniq _ vals [ which . max ( tabulate ( match (v , uniq _ vals ) ) ) ]
19 }
20 mode _ sepal <- get _ mode ( sepal _ length )
21 print ( paste ( " Mode of Sepal Length : " , mode _ sepal ) )
22

23 # Calculate Standard Deviation


24 std _ dev _ sepal <- sd ( sepal _ length )
25 print ( paste ( " Standard Deviation of Sepal Length : " , std _ dev _ sepal ) )
26

27 # Calculate Variance
28 variance _ sepal <- var ( sepal _ length )
29 print ( paste ( " Variance of Sepal Length : " , variance _ sepal ) )
Listing 13: R Script for Basic Statistical Operations using the Iris Dataset

3.4 Explanation of Results


• Mean: The mean is calculated by summing all values of [Link] and dividing by the number of obser-
vations (150 in the Iris dataset). It represents the average length of the sepals.
• Median: The median of [Link] is the middle value when all lengths are sorted. It is a robust measure
of the central tendency, particularly useful when the data contains outliers.
• Mode: In the Iris dataset, the mode of [Link] is the value that appears most frequently. For continuous
data like [Link], the mode may not always be meaningful but can help identify recurring patterns.
• Standard Deviation: The standard deviation indicates how much the lengths deviate from the mean. A
higher standard deviation would suggest greater variation in sepal length across the dataset.
• Variance: The variance is the square of the standard deviation and provides a broader measure of the spread
of the data points around the mean. It also highlights how much the data points differ from each other.

4 Data Import and Export


In R, handling data import and export is crucial for processing datasets for analysis. Several libraries in R make it
easy to load datasets from various sources (such as CSV, Excel, JSON, etc.) and save the cleaned data back into

15 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

desired formats. Common libraries used for importing/exporting datasets include readr, [Link], and rio. For
open-source datasets, websites such as Kaggle, UCI Machine Learning Repository, and government portals provide
a wide range of data.
Below, we will explore these libraries and functions using an open-source dataset (for example, the Iris dataset
in CSV format) and demonstrate a program for importing, cleaning, and exporting data in R.

4.1 Available Libraries for Data Import and Export


• readr: A fast and user-friendly package in the tidyverse collection for importing and writing data, particularly
CSV files.
1 library ( readr )
2 df <- read _ csv ( " path / to / data . csv " )
3

• [Link]: An efficient package for importing, manipulating, and exporting large datasets.
1 library ( data . table )
2 df <- fread ( " path / to / data . csv " )
3

• rio: A versatile package that supports importing and exporting data in multiple formats such as CSV, Excel,
JSON, and more. It simplifies the process with a single function for all file types.
1 library ( rio )
2 df <- import ( " path / to / data . csv " )
3 export ( df , " cleaned _ data . csv " )
4

4.1.1 Examples of Open-Source Datasets


• UCI Machine Learning Repository - Contains numerous datasets for machine learning experiments.

• Kaggle Datasets - Large collection of open-source datasets for data science competitions.
• [Link] - Indian government open data portal with a range of real-world datasets.

4.2 Design an R Program to Import Data from a CSV File, Clean It, and Export the
Cleaned Data
The following R program demonstrates how to import data from a CSV file, perform basic cleaning (such as removing
missing values or NAs), and export the cleaned data into a new CSV file. We’ll also explain each function used in
detail.
1 # Load necessary libraries
2 library ( readr )
3

4 # Step 1: Import the dataset


5 data <- read _ csv ( " iris . csv " ) # Replace with actual file path
6

7 # Step 2: Check the structure of the data


8 str ( data ) # Displays the structure of the data ( types of columns , data frame
summary )
9

10 # Step 3: Remove rows with NA values

16 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

11 cleaned _ data <- na . omit ( data )


12

13 # Step 4: Export the cleaned dataset to a new CSV file


14 write _ csv ( cleaned _ data , " cleaned _ iris . csv " ) # Replace with the desired file
path
Listing 14: R Program to Import, Clean, and Export Data

[Link] Explanation of Functions:


• read csv(): This function, from the readr package, is used to import data from a CSV file. It automatically
detects column types and loads the data into an R data frame.
• str(): This function provides an overview of the structure of the imported data, showing details such as
column names, data types (numeric, character, factor), and the number of rows and columns.
• [Link](): A built-in function in R that removes rows with NA (missing) values from the dataset. This is a
simple data cleaning operation often performed on raw data.
• write csv(): This function writes a data frame back into a CSV file. In this example, the cleaned data (with
missing values removed) is exported to a new file named cleaned [Link].

4.3 Steps to Check the Structure, Summarize the Contents, and Verify the Export
of the Cleaned Data
Once the data has been imported, it is essential to understand its structure, clean it if necessary, and verify that it
has been exported correctly. Below are the steps to follow:

1. Check the structure of the imported data: Use the str() function to examine the data’s structure and
understand its composition, including the number of rows, columns, and the data types of each variable.
1 str ( data )
2 # Example output :
3 # ' data . frame ': 150 obs . of 5 variables :
4 # $ Sepal . Length : num 5.1 4.9 4.7 4.6 5 ...
5 # $ Sepal . Width : num 3.5 3 3.2 3.1 3.6 ...
6 # $ Petal . Length : num 1.4 1.4 1.3 1.5 1.4 ...
7 # $ Petal . Width : num 0.2 0.2 0.2 0.2 0.2 ...
8 # $ Species : chr " setosa " " setosa " " setosa " " setosa " ...
9

2. Summarize the contents of the dataset: Use the summary() function to get a quick statistical summary
of each variable in the dataset (mean, median, min, max, etc.).
1 summary ( data )
2 # Example output for a numeric column ( Sepal . Length ) :
3 # Min . :4.300
4 # 1 st Qu .:5.100
5 # Median :5.800
6 # Mean :5.843
7 # 3 rd Qu .:6.400
8 # Max . :7.900
9

3. Verify the successful export of the cleaned data: After exporting the cleaned data to a new CSV file,
confirm its successful export by re-importing the CSV and checking the structure again.
17 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

1 # Re - import the cleaned data to verify the export


2 cleaned _ data _ check <- read _ csv ( " cleaned _ iris . csv " )
3 str ( cleaned _ data _ check )
4

5 Basic Data Visualization


Data visualization is a crucial part of data analysis. It helps in understanding the distribution, patterns, and
relationships within the data. Common visualization techniques in R include histograms, bar plots, line plots, and
scatter plots. The ggplot2 package, part of the tidyverse collection, is one of the most powerful tools for creating
complex and customizable plots. Below, we explain different types of visualizations with sample R programs and
examples.

5.1 Different Visualization Types


• Histogram: A histogram is used to visualize the distribution of a continuous variable by dividing the data
into bins. It is useful for identifying the shape of the distribution and detecting outliers.
• Bar Plot: Bar plots display categorical data using rectangular bars. The height of each bar represents the
frequency or count of the category.
• Line Plot: Line plots are useful for visualizing data trends over time or a sequence of observations. Each
point is connected by a line, making it easier to see patterns.
• Scatter Plot: A scatter plot shows the relationship between two continuous variables. Each point represents
an observation, with one variable on the x-axis and the other on the y-axis.

5.2 Design an R Program to Create Basic Plots


Below is a sample R program that creates a histogram, bar plot, line plot, and scatter plot using the open-source
mtcars dataset (available in R).
1 # Load necessary libraries
2 library ( ggplot2 )
3

4 # Load the mtcars dataset


5 data ( " mtcars " )
6

7 # Histogram of Miles per Gallon ( mpg )


8 ggplot ( mtcars , aes ( x = mpg ) ) +
9 geom _ histogram ( binwidth =2 , fill = " steelblue " , color = " black " ) +
10 ggtitle ( " Histogram of MPG " ) +
11 xlab ( " Miles per Gallon " ) +
12 ylab ( " Count " )
13

14 # Bar plot of Number of Cylinders


15 ggplot ( mtcars , aes ( x = factor ( cyl ) ) ) +
16 geom _ bar ( fill = " tomato " , color = " black " ) +
17 ggtitle ( " Bar Plot of Cylinder Count " ) +
18 xlab ( " Number of Cylinders " ) +
19 ylab ( " Count " )
20

21 # Line plot of MPG over Index ( sequential order of rows )


22 ggplot ( mtcars , aes ( x =1: nrow ( mtcars ) , y = mpg ) ) +
18 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

23 geom _ line ( color = " green " ) +


24 ggtitle ( " Line Plot of MPG " ) +
25 xlab ( " Index " ) +
26 ylab ( " Miles per Gallon " )
27

28 # Scatter plot of MPG vs Horsepower


29 ggplot ( mtcars , aes ( x = hp , y = mpg ) ) +
30 geom _ point ( color = " purple " ) +
31 ggtitle ( " Scatter Plot of MPG vs Horsepower " ) +
32 xlab ( " Horsepower " ) +
33 ylab ( " Miles per Gallon " )
Listing 15: R Program for Basic Data Visualizations

5.3 Explanation of the Plots


• Histogram:
– Example: The histogram of mpg (Miles per Gallon) shows the frequency distribution of cars based on
their fuel efficiency.
– Customization: We can adjust the bin width to control the granularity of the histogram.
– Relevance: Useful for understanding the spread and central tendency of fuel efficiency.
• Bar Plot:

– Example: The bar plot displays the count of cars with different numbers of cylinders (cyl). It provides
insight into the most common cylinder configurations.
– Customization: Colors and labels are customized to improve readability.
– Relevance: It helps in understanding the distribution of categorical data.

• Line Plot:
– Example: The line plot shows the trend of mpg across the cars, indexed sequentially. It highlights any
observable pattern in fuel efficiency across the dataset.
– Customization: The line color is changed to make the plot visually appealing.
– Relevance: Useful for detecting trends or seasonal patterns over a continuous variable.

• Scatter Plot:
– Example: The scatter plot illustrates the relationship between hp (horsepower) and mpg. Each point
represents a car, with its horsepower and fuel efficiency plotted.
– Customization: The point color is adjusted, and axis labels are provided to improve understanding.
– Relevance: Scatter plots are excellent for identifying correlations between two continuous variables.

5.4 Customizing and Saving Plots as Image Files


Once the plots are created, we can customize them by adding titles, axis labels, and adjusting colors. Additionally,
R allows saving the plots as image files in different formats, such as PNG, JPEG, and PDF.
Below is the code to customize the plots and save them as PNG images:
1 # Customize and save the Histogram plot
2 png ( " histogram _ mpg . png " )
3 ggplot ( mtcars , aes ( x = mpg ) ) +
4 geom _ histogram ( binwidth =2 , fill = " steelblue " , color = " black " ) +

19 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

5 ggtitle ( " Histogram of MPG " ) +


6 xlab ( " Miles per Gallon " ) +
7 ylab ( " Count " )
8 dev . off ()
9

10 # Customize and save the Bar plot


11 png ( " barplot _ cyl . png " )
12 ggplot ( mtcars , aes ( x = factor ( cyl ) ) ) +
13 geom _ bar ( fill = " tomato " , color = " black " ) +
14 ggtitle ( " Bar Plot of Cylinder Count " ) +
15 xlab ( " Number of Cylinders " ) +
16 ylab ( " Count " )
17 dev . off ()
18

19 # Customize and save the Line plot


20 png ( " lineplot _ mpg . png " )
21 ggplot ( mtcars , aes ( x =1: nrow ( mtcars ) , y = mpg ) ) +
22 geom _ line ( color = " green " ) +
23 ggtitle ( " Line Plot of MPG " ) +
24 xlab ( " Index " ) +
25 ylab ( " Miles per Gallon " )
26 dev . off ()
27

28 # Customize and save the Scatter plot


29 png ( " scatterplot _ mpg _ hp . png " )
30 ggplot ( mtcars , aes ( x = hp , y = mpg ) ) +
31 geom _ point ( color = " purple " ) +
32 ggtitle ( " Scatter Plot of MPG vs Horsepower " ) +
33 xlab ( " Horsepower " ) +
34 ylab ( " Miles per Gallon " )
35 dev . off ()
Listing 16: Saving Plots as Image Files in R

5.5 Brief Interpretation of Each Plot in the Context of the Data


• Histogram (MPG): The histogram shows that most cars have a fuel efficiency between 15 to 25 miles per
gallon, with fewer cars in the higher and lower mpg ranges. This is typical for cars produced in the mid-20th
century, represented in the mtcars dataset.
• Bar Plot (Cylinders): The bar plot shows that the most common number of cylinders in the dataset is 8,
followed by 4 and 6. This indicates that high-powered cars with 8 cylinders were prevalent.

20 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Histogram of MPG
5

3
Count

10 15 20 25 30 35
Miles per Gallon

Figure 5: Histogram of Miles per Gallon (MPG) from the mtcars dataset.

Bar Plot of Cylinder Count

10
Count

4 6 8
Number of Cylinders

Figure 6: Bar Plot showing the distribution of cylinders in the mtcars dataset.

• Line Plot (MPG Index): The line plot shows no clear trend across the index, as mtcars is a small, unordered
dataset of car performance metrics.

21 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Line Plot of MPG


35

30
Miles per Gallon

25

20

15

10
0 10 20 30
Index

Figure 7: Line Plot of MPG across the index of the mtcars dataset.

• Scatter Plot (MPG vs Horsepower): The scatter plot reveals a negative correlation between mpg and hp.
As horsepower increases, fuel efficiency (mpg) tends to decrease, reflecting the trade-off between power and
efficiency in vehicles.

Scatter Plot of MPG vs Horsepower


35

30
Miles per Gallon

25

20

15

10
100 200 300
Horsepower

Figure 8: Scatter Plot of MPG vs Horsepower in the mtcars dataset.

22 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Module - 2
Lab Programs

6 Data Cleaning and Preparation


6.1 What is Data Cleaning and Why is it Needed?
Data Cleaning is a crucial step in data analysis and involves the process of identifying and correcting (or removing)
inaccuracies and inconsistencies in a dataset. The goal of data cleaning is to ensure that the data is accurate, complete,
and useful for analysis.
Why is Data Cleaning Needed? Data cleaning is necessary because real-world data is often messy and incomplete.
Without proper cleaning, analyses can lead to misleading results, incorrect conclusions, or ineffective decision-making.
Clean data is vital for producing reliable and actionable insights.

6.1.1 Common Problems with Real-World Datasets


Real-world datasets often come with various issues that need addressing:

• Missing Values: Data may have missing values due to errors in data collection or entry. For instance, a
survey dataset might have unanswered questions.
– Example: A customer survey dataset where some respondents did not provide their age. This missing
information needs to be handled to avoid biases.
• Duplicate Entries: Duplicate rows can occur from repeated data entry or merging datasets. These duplicates
can skew analysis results.
– Example: An employee database where an employee is accidentally entered multiple times with different
IDs. This needs to be corrected to maintain data integrity.

• Inconsistent Formatting: Different formats for dates, text, or numerical values can make analysis challeng-
ing. For example, dates might be recorded as “MM/DD/YYYY” in some entries and “DD/MM/YYYY” in
others.
– Example: A sales dataset where some dates are in the format “01-12-2024” and others are “2024/01/12.”
Consistent formatting is essential for accurate time-based analysis.
• Outliers: Extreme values that deviate significantly from other observations can affect statistical measures.
These outliers might be due to data entry errors or actual rare events.
– Example: In a dataset of student exam scores, an entry of 200 in a 100-point exam could be an error
that needs addressing.

• Incorrect Data Types: Data might be recorded in incorrect formats, such as numerical data stored as text,
which can complicate analysis.
– Example: A dataset with a “Price” column containing numeric values stored as text (“$100”, “200”)
instead of pure numbers.

23 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

6.2 Design an R Program to Handle Missing Data, Filter Rows, and Select Specific
Columns
Objective: The objective of this exercise is to design an R program that:
• Handles missing data by imputing or removing it.
• Filters rows based on specific conditions to include only relevant data.

• Selects specific columns to focus on the necessary information.

6.2.1 Sample Dataset


We’ll use a sample dataset with missing values, named sample [Link], for this demonstration. The dataset
contains columns such as “Name,” “Age,” “Salary,” and “Department.”

6.2.2 R Code
Below is the R code for data cleaning and preparation. Comments explain each step of the process.
1 # Load necessary library for data manipulation
2 library ( dplyr )
3

4 # Read the dataset from a CSV file


5 data <- read . csv ( " sample _ data . csv " )
6

7 # Display the first few rows of the dataset


8 head ( data )
9

10 # Display summary statistics to understand the structure and missing values


11 summary ( data )
12

13 # Handling Missing Data


14 # Remove rows with missing values
15 cleaned _ data <- na . omit ( data )
16

17 # Alternatively , you can impute missing values , for example , with the mean of
the column
18 # cleaned _ data <- data % >%
19 # mutate ( across ( everything () , ~ ifelse ( is . na (.) , mean (. , na . rm = TRUE ) , .) ) )
20

21 # Filter Rows Based on Condition


22 # Example : Filter rows where 'Age ' is greater than 25
23 filtered _ data <- cleaned _ data % >%
24 filter ( Age > 25)
25

26 # Select Specific Columns


27 # Example : Select only ' Name ' , 'Age ' , and ' Salary ' columns
28 selected _ data <- filtered _ data % >%
29 select ( Name , Age , Salary )
30

31 # Save the cleaned and prepared data to a new CSV file


32 write . csv ( selected _ data , " cleaned _ data . csv " , row . names = FALSE )
33

34 # Display the cleaned and prepared data


35 head ( selected _ data )

24 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Listing 17: R Code for Data Cleaning and Preparation

6.3 Document the Cleaning Process and Rationale


• Loading the Dataset: The dataset is loaded using [Link](). This function reads data from a CSV file
into a data frame for manipulation.
• Handling Missing Data: Missing values are addressed using [Link](). This function removes rows with
any missing values. Alternatively, missing values can be imputed with the mean of the column if retaining data
is preferred.
• Filtering Rows: Rows are filtered based on a condition (e.g., Age > 25) to include only relevant observations.
This ensures that the dataset only contains data points that meet specific criteria.
• Selecting Columns: Specific columns are selected using select() to focus on relevant data. This step
simplifies the dataset by retaining only the necessary columns for further analysis.
• Saving Data: The cleaned and prepared dataset is saved to a new CSV file using [Link](). This preserves
the data in its cleaned state for future use.

7 Advanced Data Manipulation using dplyr


7.1 Overview of dplyr Functions
The dplyr package in R is a powerful tool for data manipulation. It provides a set of functions to perform operations
on data frames in a clear and concise manner. Key functions include:

• select(): Chooses specific columns from a data frame.


• filter(): Filters rows based on certain conditions.
• mutate(): Adds new columns or modifies existing columns.
• summarize(): Aggregates data to provide summary statistics.

• arrange(): Sorts rows based on column values.

7.2 Design an R Program to Use dplyr Functions


Objective: The objective of this exercise is to use dplyr functions to manipulate a data frame, including selecting
columns, filtering rows, creating new columns, summarizing data, and arranging rows.

7.2.1 Example Dataset


We will use the mtcars dataset, a built-in dataset in R that contains various attributes of different car models.

7.2.2 R Code
Below is the R code demonstrating the use of dplyr functions. Each operation is explained, and expected outputs
are provided.
1 # Load the necessary library
2 library ( dplyr )
3

4 # Load the mtcars dataset


5 data <- mtcars

25 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

7 # Display the first few rows of the dataset


8 head ( data )
Listing 18: Advanced Data Manipulation with dplyr

Expected Output:

mpg cyl disp hp drat wt qsec vs am gear carb


Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.435 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

7.2.3 Selecting Columns

1 # Select specific columns : mpg , cyl , and hp


2 selected _ data <- data % >%
3 select ( mpg , cyl , hp )
4

5 # Display the first few rows of the selected data


6 head ( selected _ data )
Listing 19: Selecting Columns

Expected Output:

mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175

7.2.4 Filtering Rows

1 # Filter rows where the number of cylinders ( cyl ) is 6


2 filtered _ data <- data % >%
3 filter ( cyl == 6)
4

5 # Display the first few rows of the filtered data


6 head ( filtered _ data )
Listing 20: Filtering Rows

Expected Output:

mpg cyl disp hp drat wt qsec vs am gear carb


Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.435 19.44 1 0 3 1

26 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

7.2.5 Creating New Columns with mutate

1 # Create a new column ' hp _ per _ cyl ' that is the horsepower divided by the number
of cylinders
2 mutated _ data <- data % >%
3 mutate ( hp _ per _ cyl = hp / cyl )
4

5 # Display the first few rows of the mutated data


6 head ( mutated _ data )
Listing 21: Creating New Columns

Expected Output:

mpg cyl disp hp drat wt qsec vs am gear carb hp_per_cyl


Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 18.33
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 18.33
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 23.25
Hornet 4 Drive 21.4 6 258 110 3.08 3.435 19.44 1 0 3 1 18.33
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 21.88

7.2.6 Summarizing Data with summarize

1 # Summarize data to find the mean mpg and total horsepower


2 summary _ data <- data % >%
3 summarize ( mean _ mpg = mean ( mpg ) , total _ hp = sum ( hp ) )
4

5 # Display the summary data


6 summary _ data
Listing 22: Summarizing Data

Expected Output:

mean_mpg total_hp
1 20.09 2611

7.2.7 Arranging Rows

1 # Arrange rows by horsepower in descending order


2 arranged _ data <- data % >%
3 arrange ( desc ( hp ) )
4

5 # Display the first few rows of the arranged data


6 head ( arranged _ data )
Listing 23: Arranging Rows

Expected Output:

mpg cyl disp hp drat wt qsec vs am gear carb


Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
Merc 450SLC 15.2 8 275 180 3.07 3.440 17.40 0 0 3 3

27 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

7.3 Detailed Explanation of Each Operation


• Selecting Columns: The select() function is used to choose specific columns from a data frame. This
operation helps in focusing on relevant data and simplifying the dataset.

• Filtering Rows: The filter() function allows for extracting rows that meet certain criteria. This is useful
for narrowing down the data to include only the rows of interest.
• Creating New Columns: The mutate() function is used to add new columns or modify existing ones. It
allows for the transformation of data within the data frame, such as calculating new metrics.

• Summarizing Data: The summarize() function aggregates data to produce summary statistics. This is
useful for gaining insights into the overall characteristics of the dataset.
• Arranging Rows: The arrange() function sorts rows based on column values. This helps in organizing the
data, such as ordering by descending horsepower to easily identify the most powerful cars.

8 Data Visualization using ggplot2


8.1 Overview of ggplot2
The ggplot2 package in R is a powerful tool for creating a wide range of static and complex graphics. It implements
the Grammar of Graphics, which allows for creating detailed and aesthetically pleasing plots. Key concepts include:

• ggplot(): The base function to initialize a plot.


• geom function(): Defines the type of plot, such as points, lines, or bars.
• facet wrap(): Creates multiple plots based on a factor variable.
• theme(): Customizes the appearance of the plot.

• labs(): Adds labels and annotations to the plot.

8.2 Design an R Program to Create Advanced Plots Using ggplot2


Objective: The objective of this exercise is to create advanced plots using ggplot2, including faceting, customizing
aesthetics, and adding annotations to provide deeper insights into the data.

8.2.1 Example Dataset


We will use the mtcars dataset, which includes attributes of different car models.

8.2.2 R Code
Below is the R code demonstrating various advanced plotting techniques with ggplot2. Each plot is explained, and
expected outputs are provided.
1 # Load the necessary library
2 library ( ggplot2 )
3

4 # Load the mtcars dataset


5 data <- mtcars
6

7 # Display the first few rows of the dataset


8 head ( data )
Listing 24: Load ggplot2 Library and Dataset

28 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Expected Output:

mpg cyl disp hp drat wt qsec vs am gear carb


Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.435 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

8.2.3 Faceting

1 # Faceted scatter plot of mpg vs hp , separated by number of cylinders


2 ggplot ( data , aes ( x = hp , y = mpg ) ) +
3 geom _ point ( aes ( color = factor ( cyl ) ) , size =3) +
4 facet _ wrap ( ~ cyl ) +
5 labs ( title = " Scatter Plot of MPG vs HP by Cylinder " ,
6 x = " Horsepower " ,
7 y = " Miles Per Gallon " ,
8 color = " Number of Cylinders " ) +
9 theme _ minimal ()
Listing 25: Creating Faceted Plots

Expected Output:
This plot will display scatter plots of horsepower vs. miles per gallon, faceted by the number of cylinders. Each
panel will represent a different number of cylinders, with points colored by cylinder count.

Scatter Plot of MPG vs HP by Cylinder


4 6 8
35

30

25
Miles Per Gallon

Number of Cylinders
4
6
8
20

15

10

100 200 300 100 200 300 100 200 300


Horsepower

Figure 9: Faceted scatter plot of MPG vs HP, separated by number of cylinders. Each panel represents a different
number of cylinders, with points colored by cylinder count.

29 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

8.2.4 Customizing Plot Aesthetics

1 # Customizing the aesthetics of a scatter plot


2 ggplot ( data , aes ( x = wt , y = mpg ) ) +
3 geom _ point ( aes ( color = am , shape = factor ( cyl ) ) , size =4) +
4 scale _ color _ manual ( values = c ( " blue " , " red " ) , labels = c ( " Automatic " , " Manual " ) ) +
5 scale _ shape _ manual ( values = c (16 , 17 , 18) ) +
6 labs ( title = " Scatter Plot of MPG vs Weight " ,
7 x = " Weight " ,
8 y = " Miles Per Gallon " ,
9 color = " Transmission " ,
10 shape = " Number of Cylinders " ) +
11 theme _ classic () +
12 theme ( plot . title = element _ text ( hjust =0.5) )
Listing 26: Customizing Plot Aesthetics

Expected Output:
This plot will display a scatter plot of weight vs. miles per gallon, with points colored by transmission type and
shaped by the number of cylinders. Custom colors and shapes enhance the visual differentiation.

Scatter Plot of MPG vs Weight


35

30
Transmission
Automatic
Miles Per Gallon

25 Manual

Number of Cylinders
20
4
6
8
15

10
2 3 4 5
Weight

Figure 10: Scatter plot of MPG vs Weight with customized aesthetics. Points are colored by transmission type and
shaped by the number of cylinders. Custom colors and shapes enhance visual differentiation.

8.2.5 Adding Annotations

1 # Scatter plot with annotations


2 ggplot ( data , aes ( x = mpg , y = hp ) ) +
3 geom _ point ( aes ( color = factor ( cyl ) ) , size =3) +
4 geom _ text ( aes ( label = row . names ( data ) ) , vjust = -0.5 , hjust =0.5) +
5 labs ( title = " Scatter Plot of HP vs MPG with Annotations " ,

30 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

6 x = " Miles Per Gallon " ,


7 y = " Horsepower " ,
8 color = " Number of Cylinders " ) +
9 theme _ bw ()
Listing 27: Adding Annotations

Expected Output:
This plot will show a scatter plot of horsepower vs. miles per gallon, with annotations displaying the row names
(car models) next to each point. This helps in identifying individual data points.

Scatter Plot of HP vs MPG with Annotations


Maserati Bora

300

Ford Pantera L

Camaro
Duster
Z28
360
Chrysler Imperial
Lincoln Continental
Number of Cylinders
Horsepower

Cadillac Fleetwood
200 4

MercMerc
450SLC
Merc
450SE
450SL 6
Hornet
Pontiac
Ferrari
Sportabout
Firebird
Dino
8

Dodge
AMC Challenger
Javelin

MercMerc
280C280
Mazda
Hornet
MazdaRX44142E
RX4
Drive
Wag Lotus Europa
Volvo
Valiant
ToyotaMerc
Corona
100 Datsun230
710
Porsche 914−2

Fiat X1−9 Fiat 128Corolla


Toyota
Merc 240D
Honda Civic

10 15 20 25 30 35
Miles Per Gallon

Figure 11: Scatter plot of HP vs MPG with annotations. Each data point is labeled with the car model name, aiding
in identifying individual observations.

8.3 Detailed Explanation of Each Plot


• Faceting: The facet wrap() function creates multiple panels in the plot, each representing a subset of the
data based on a factor variable. This allows for comparing different groups side by side, such as comparing
cars with different numbers of cylinders.
• Customizing Plot Aesthetics: The scale color manual() and scale shape manual() functions customize
the colors and shapes used in the plot. This customization enhances clarity and visual appeal, making it easier
31 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

to differentiate between groups, such as automatic vs. manual transmissions.


• Adding Annotations: The geom text() function adds text labels to the plot. Annotations are useful for
providing additional information about data points, such as car models in this case, helping viewers to easily
identify individual observations.

9 Descriptive Statistics and Data Summary


Descriptive statistics are fundamental in data analysis as they provide insights into the characteristics of the dataset.
These statistics summarize the central tendency, dispersion, and shape of the data distribution. This section covers
the generation of descriptive statistics and the creation of a data summary report using R.

9.1 Introduction to Descriptive Statistics


Descriptive statistics offer a snapshot of the data, making it easier to understand and interpret the underlying
patterns and distributions. Key measures include:

• Mean: The arithmetic average of all data points. It provides a measure of central tendency but can be affected
by outliers.
• Median: The middle value in a sorted dataset. It is less sensitive to outliers and skewed distributions compared
to the mean.

• Range: The difference between the maximum and minimum values. It indicates the spread of the data but
can be sensitive to extreme values.
• Quartiles: Values that divide the data into four equal parts. They provide insights into the spread and central
tendency, including the first quartile (Q1), median (Q2), and third quartile (Q3).

• Standard Deviation: Measures the average distance of each data point from the mean. It indicates how
spread out the data points are.
• Variance: The square of the standard deviation. It also measures the spread of data but is expressed in
squared units.

9.2 Design an R Program to Generate Descriptive Statistics and Create a Data Sum-
mary Report
We will use the ‘mtcars‘ dataset for this example. The ‘mtcars‘ dataset includes various attributes of different car
models, such as miles per gallon (mpg), number of cylinders, horsepower, etc. We will calculate descriptive statistics
and generate a summary report.
1 # Load necessary library
2 library ( dplyr )
3

4 # Load the mtcars dataset


5 data <- mtcars
6

7 # Calculate mean , median , range , quartiles , standard deviation , and variance for
MPG
8 mean _ mpg <- mean ( data $ mpg )
9 median _ mpg <- median ( data $ mpg )
10 range _ mpg <- range ( data $ mpg )
11 quartiles _ mpg <- quantile ( data $ mpg )
12 sd _ mpg <- sd ( data $ mpg )
13 variance _ mpg <- var ( data $ mpg )

32 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

14

15 # Create a summary table


16 summary _ table <- data % >%
17 summarise (
18 Mean = mean ( mpg ) ,
19 Median = median ( mpg ) ,
20 Min = min ( mpg ) ,
21 Max = max ( mpg ) ,
22 SD = sd ( mpg ) ,
23 Variance = var ( mpg ) ,
24 Q1 = quantile ( mpg , 0.25) ,
25 Q3 = quantile ( mpg , 0.75)
26 )
27

28 # Print the results


29 print ( paste ( " Mean of MPG : " , mean _ mpg ) )
30 print ( paste ( " Median of MPG : " , median _ mpg ) )
31 print ( paste ( " Range of MPG : " , paste ( range _ mpg , collapse = " - " ) ) )
32 print ( paste ( " Quartiles of MPG : " , paste ( quartiles _ mpg , collapse = " , " ) ) )
33 print ( paste ( " Standard Deviation of MPG : " , sd _ mpg ) )
34 print ( paste ( " Variance of MPG : " , variance _ mpg ) )
35

36 # Print the summary table


37 print ( summary _ table )
38

39 # Save the summary table to a CSV file


40 write . csv ( summary _ table , " summary _ statistics . csv " )
Listing 28: Descriptive Statistics Calculation

9.3 Output and Interpretation


The following output is generated by the above R code:
Interpretation: The descriptive statistics for the dataset are summarized in the table below:

Statistic Value
Mean of MPG 20.09
Median of MPG 19.20
Range of MPG 10.40 - 33.90
First Quartile (Q1) 15.20
Third Quartile (Q3) 22.80
Standard Deviation 6.03
Variance 36.34

Table 1: Summary of Descriptive Statistics for MPG

• Mean of MPG: The average miles per gallon for cars in the dataset is 20.09. This value represents the central
tendency of the data.
• Median of MPG: The median value of 19.20 is less influenced by extreme values compared to the mean. It
indicates the middle point of the dataset.
• Range of MPG: The range from 10.40 to 33.90 shows the spread of the MPG values across the dataset.

33 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

• Quartiles: The first quartile (Q1) at 15.20 and the third quartile (Q3) at 22.80 provide insights into the spread
of the middle 50% of the data.
• Standard Deviation: A standard deviation of 6.03 indicates the average deviation of the MPG values from
the mean. A higher standard deviation means greater variability.

• Variance: The variance of 36.34, being the square of the standard deviation, also measures data dispersion
but in squared units.

The summary table consolidates these measures, offering a comprehensive view of the data’s central tendency
and variability. The table is saved as a CSV file for further analysis or reporting.

10 Linear Regression: An In-Depth Overview


10.1 Introduction
Linear regression is a powerful statistical method used to understand the relationship between a dependent variable
and one or more independent variables. This technique is fundamental in predictive modeling and data analysis.
It aims to find the best-fitting line through the data points in a way that minimizes the differences between the
observed values and the values predicted by the line.

10.2 The Linear Regression Equation


10.2.1 Simple Linear Regression
The simplest form of linear regression is simple linear regression, which involves two variables: one dependent
and one independent. The equation for a simple linear regression model is:

Y = β0 + β1 X + ϵ (1)
where:

• Y is the dependent variable (the outcome we want to predict).


• X is the independent variable (the predictor or feature).
• β0 is the y-intercept (the value of Y when X = 0).
• β1 is the slope of the line (the change in Y for a one-unit change in X).

• ϵ represents the error term (the difference between the observed and predicted values of Y ).
The goal of simple linear regression is to determine the values of β0 and β1 that minimize the sum of the squared
errors (Residual Sum of Squares, RSS). This is achieved using methods like Ordinary Least Squares (OLS).

10.2.2 Multiple Linear Regression


In multiple linear regression, the model involves more than one independent variable. The equation is extended
to:

Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ϵ (2)
where X1 , X2 , . . . , Xp are the independent variables, and β1 , β2 , . . . , βp are the corresponding coefficients.

10.3 Real-World Applications of Linear Regression


Linear regression is used in various fields to model relationships and make predictions. Here are some examples:

34 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

10.3.1 1. Economics
In economics, linear regression can model the relationship between a country’s GDP and factors such as education
level, employment rate, and investment. For instance, a researcher might use linear regression to predict the GDP
based on the level of investment in education.

10.3.2 2. Healthcare
Healthcare professionals use linear regression to analyze the impact of various factors on health outcomes. For
example, linear regression can be used to predict blood pressure based on age, weight, and cholesterol levels. This
helps in identifying risk factors and improving patient care.

10.3.3 3. Marketing
In marketing, linear regression can help predict sales based on advertising expenditure, price changes, and other
marketing strategies. For instance, a company might use linear regression to forecast future sales based on past
advertising spend and product pricing.

10.3.4 4. Real Estate


Real estate analysts use linear regression to estimate property prices based on features like square footage, number
of bedrooms, and location. By fitting a linear model to historical sales data, they can predict property values for
new listings.

10.3.5 5. Education
In education, linear regression can analyze the impact of various factors on student performance. For example, it can
be used to predict student grades based on study hours, attendance, and participation in extracurricular activities.

10.4 Example of Linear Regression in R


Below is an R code example for performing simple linear regression using a dataset. The code fits a linear model,
plots the regression line, and interprets the results.
1 # Load necessary libraries
2 library ( ggplot2 )
3

4 # Load dataset
5 data ( mtcars )
6

7 # Fit linear model


8 model <- lm ( mpg ~ wt , data = mtcars )
9

10 # Print model summary


11 print ( summary ( model ) )
12

13 # Plot 1: Scatter Plot with Regression Line


14 ggplot ( mtcars , aes ( x = wt , y = mpg ) ) +
15 geom _ point ( color = " blue " ) +
16 geom _ smooth ( method = " lm " , color = " red " ) +
17 labs ( title = " Linear Regression of MPG on Weight " ,
18 x = " Weight (1000 lbs ) " ,
19 y = " Miles Per Gallon " ) +
20 theme _ minimal ()
21

22 # Save Plot 1 as PDF

35 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

23 ggsave ( " linear _ regression _ scatter _ plot . pdf " )


24

25 # Plot 2: Residuals vs Fitted Values


26 ggplot ( data = model , aes (. fitted , . resid ) ) +
27 geom _ point () +
28 geom _ hline ( yintercept = 0 , linetype = " dashed " , color = " red " ) +
29 labs ( title = " Residuals vs Fitted " ,
30 x = " Fitted Values " ,
31 y = " Residuals " ) +
32 theme _ minimal ()
33

34 # Save Plot 2 as PDF


35 ggsave ( " residuals _ vs _ fitted . pdf " )
36

37 # Plot 3: Q - Q Plot
38 ggplot ( data = model , aes ( sample = . stdresid ) ) +
39 stat _ qq () +
40 stat _ qq _ line () +
41 labs ( title = "Q - Q Plot of Residuals " ,
42 x = " Theoretical Quantiles " ,
43 y = " Standardized Residuals " ) +
44 theme _ minimal ()
45

46 # Save Plot 3 as PDF


47 ggsave ( " qq _ plot . pdf " )
48

49 # Plot 4: Scale - Location Plot


50 ggplot ( data = model , aes (. fitted , sqrt ( abs (. stdresid ) ) ) ) +
51 geom _ point () +
52 geom _ smooth ( se = FALSE , color = " red " ) +
53 labs ( title = " Scale - Location Plot " ,
54 x = " Fitted Values " ,
55 y = " Square Root of Standardized Residuals " ) +
56 theme _ minimal ()
57

58 # Save Plot 4 as PDF


59 ggsave ( " scale _ location _ plot . pdf " )
60

61 # Plot 5: Leverage vs Standardized Residuals


62 ggplot ( data = model , aes (. hat , . stdresid ) ) +
63 geom _ point ( aes ( color = . cooksd ) ) +
64 geom _ hline ( yintercept = c ( -2 , 2) , linetype = " dashed " , color = " red " ) +
65 labs ( title = " Leverage vs Standardized Residuals " ,
66 x = " Leverage " ,
67 y = " Standardized Residuals " ) +
68 theme _ minimal ()
69

70 # Save Plot 5 as PDF


71 ggsave ( " leverage _ vs _ residuals . pdf " )
Listing 29: Simple Linear Regression Example

36 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

10.5 Explanation and Analysis of Plots


The following plots are generated to assess the fit and diagnostic performance of the linear regression model. Each
plot provides valuable insights into the model’s assumptions and performance.

10.5.1 Scatter Plot with Regression Line

1 # Plot 1: Scatter Plot with Regression Line


2 ggplot ( mtcars , aes ( x = wt , y = mpg ) ) +
3 geom _ point ( color = " blue " ) +
4 geom _ smooth ( method = " lm " , color = " red " ) +
5 labs ( title = " Linear Regression of MPG on Weight " ,
6 x = " Weight (1000 lbs ) " ,
7 y = " Miles Per Gallon " ) +
8 theme _ minimal ()
9

10 # Save Plot 1 as PDF


11 ggsave ( " linear _ regression _ scatter _ plot . pdf " )

Linear Regression of MPG on Weight

30
Miles Per Gallon

20

10

2 3 4 5
Weight (1000 lbs)

Figure 12: Scatter plot of MPG versus Weight with the regression line.

Explanation:
• The scatter plot shows the relationship between the independent variable (Weight of the car) on the x-axis and
the dependent variable (Miles per Gallon) on the y-axis.
• The red line is the fitted linear regression line showing the predicted relationship between ‘wt‘ and ‘mpg‘.
Analysis:

• Advantages: The plot allows easy visualization of the relationship between variables. The slope of the
regression line helps understand the trend, i.e., as weight increases, MPG decreases.
• Disadvantages: If the relationship between the variables is nonlinear, this plot may not capture it accurately.

37 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

10.5.2 Residuals vs Fitted Values Plot

1 # Plot 2: Residuals vs Fitted Values


2 ggplot ( data = model , aes (. fitted , . resid ) ) +
3 geom _ point () +
4 geom _ hline ( yintercept = 0 , linetype = " dashed " , color = " red " ) +
5 labs ( title = " Residuals vs Fitted " ,
6 x = " Fitted Values " ,
7 y = " Residuals " ) +
8 theme _ minimal ()
9

10 # Save Plot 2 as PDF


11 ggsave ( " residuals _ vs _ fitted . pdf " )

Residuals vs Fitted

5.0
Residuals

2.5

0.0

−2.5

−5.0
10 15 20 25 30
Fitted Values

Figure 13: Residuals versus Fitted Values plot.

Explanation:
• This plot shows the residuals (difference between observed and predicted values) on the y-axis and the fitted
values (predicted by the model) on the x-axis.

• The horizontal dashed line at zero represents where residuals should ideally fall if the model fits the data well.
Analysis:
• Advantages: The residuals vs fitted values plot is useful for checking the assumption of linearity and ho-
moscedasticity. A random scatter of points around the zero line indicates that the model is appropriate.

• Disadvantages: A non-random pattern (e.g., curves) could indicate that the model does not fit well or there
is non-linearity that a linear model cannot capture.

38 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

10.5.3 Q-Q Plot

1 # Plot 3: Q - Q Plot
2 ggplot ( data = model , aes ( sample = . stdresid ) ) +
3 stat _ qq () +
4 stat _ qq _ line () +
5 labs ( title = "Q - Q Plot of Residuals " ,
6 x = " Theoretical Quantiles " ,
7 y = " Standardized Residuals " ) +
8 theme _ minimal ()
9

10 # Save Plot 3 as PDF


11 ggsave ( " qq _ plot . pdf " )

Q−Q Plot of Residuals

2
Standardized Residuals

−1

−2

−2 −1 0 1 2
Theoretical Quantiles

Figure 14: Q-Q plot of residuals.

Explanation:

• The Q-Q plot compares the standardized residuals to a theoretical normal distribution. If the residuals follow
a normal distribution, they will lie along the straight Q-Q line.
Analysis:
• Advantages: This plot checks the normality assumption of residuals. If the points follow the line, it suggests
that the residuals are normally distributed, which is a key assumption of linear regression.
• Disadvantages: Significant deviations from the line suggest that the residuals are not normally distributed,
which could invalidate the regression results.

10.5.4 Scale-Location Plot

39 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

1 # Plot 4: Scale - Location Plot


2 ggplot ( data = model , aes (. fitted , sqrt ( abs (. stdresid ) ) ) ) +
3 geom _ point () +
4 geom _ smooth ( se = FALSE , color = " red " ) +
5 labs ( title = " Scale - Location Plot " ,
6 x = " Fitted Values " ,
7 y = " Square Root of Standardized Residuals " ) +
8 theme _ minimal ()
9

10 # Save Plot 4 as PDF


11 ggsave ( " scale _ location _ plot . pdf " )

Scale−Location Plot
Square Root of Standardized Residuals

1.2

0.8

0.4

10 15 20 25 30
Fitted Values

Figure 15: Scale-Location plot.

Explanation:
• The Scale-Location plot (also known as the Spread-Location plot) helps check the assumption of homoscedas-
ticity, or equal variance of residuals.

• It plots the square root of the standardized residuals versus the fitted values.
Analysis:
• Advantages: If the residuals show a random scatter around the line, it suggests that the variance of errors is
consistent across all fitted values.

• Disadvantages: A funnel shape indicates heteroscedasticity, meaning that the variance of the errors is not
constant, which could affect the reliability of the regression model.

10.5.5 5. Leverage vs Standardized Residuals Plot

40 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

1 # Plot 5: Leverage vs Standardized Residuals


2 ggplot ( data = model , aes (. hat , . stdresid ) ) +
3 geom _ point ( aes ( color = . cooksd ) ) +
4 geom _ hline ( yintercept = c ( -2 , 2) , linetype = " dashed " , color = " red " ) +
5 labs ( title = " Leverage vs Standardized Residuals " ,
6 x = " Leverage " ,
7 y = " Standardized Residuals " ) +
8 theme _ minimal ()
9

10 # Save Plot 5 as PDF


11 ggsave ( " leverage _ vs _ residuals . pdf " )

Leverage vs Standardized Residuals

2
Standardized Residuals

.cooksd
1 0.5
0.4

0 0.3
0.2
0.1
−1

−2
0.05 0.10 0.15 0.20
Leverage

Figure 16: Leverage versus Standardized Residuals plot.

Explanation:

• This plot shows the leverage (a measure of how far an independent variable deviates from its mean) versus
standardized residuals. Points outside the dashed lines (between -2 and 2) may be influential points.
• It also colors the points by Cook’s distance, a measure that shows the influence of each point on the regression
model.
Analysis:

• Advantages: This plot identifies outliers or high-leverage points, which could disproportionately affect the
regression model.
• Disadvantages: If the plot identifies many influential points, the model may be unstable or require further
investigation to handle outliers.

10.6 Conclusion
Linear regression is a versatile and widely used statistical method that helps in understanding and predicting re-
lationships between variables. By fitting a linear model to data, one can make informed decisions and predictions
based on empirical evidence. It is essential in various domains, including economics, healthcare, marketing, real
estate, and education.

41 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Module - 3
Lab Programs

11 Advanced Data Analysis


Advanced data analysis involves using complex statistical techniques to uncover deeper insights from data. Unlike
basic descriptive statistics, advanced data analysis can model relationships between multiple variables, make predic-
tions, and test hypotheses. It typically includes techniques such as multiple linear regression, factor analysis, cluster
analysis, and advanced machine learning methods.
A key aspect of advanced data analysis is multiple linear regression, which extends simple linear regression by
modeling the relationship between a dependent variable and multiple independent variables. This technique is used
to understand how several predictor variables simultaneously affect a response variable and to determine the strength
and nature of these relationships.

11.1 Multiple Linear Regression Analysis


To perform a multiple linear regression analysis, we use a dataset with several predictor variables. For example, we’ll
use the open-source ‘Boston‘ dataset from the ‘MASS‘ package in R, which includes various housing attributes for
different suburbs of Boston. The goal is to predict the median value of owner-occupied homes (‘medv‘) based on
predictors such as average number of rooms (‘rm‘), crime rate (‘crim‘), and property tax rate (‘tax‘).
1 # Load necessary libraries
2 library ( MASS )
3 library ( ggplot2 )
4

5 # Load dataset
6 data ( Boston )
7

8 # Fit multiple linear regression model


9 model <- lm ( medv ~ rm + crim + tax , data = Boston )
10

11 # Print model summary


12 print ( summary ( model ) )
13

14 # Plot residuals vs fitted values


15 ggplot ( data = model , aes (. fitted , . resid ) ) +
16 geom _ point () +
17 geom _ hline ( yintercept = 0 , linetype = " dashed " , color = " red " ) +
18 labs ( title = " Residuals vs Fitted " ,
19 x = " Fitted Values " ,
20 y = " Residuals " ) +
21 theme _ minimal ()
22

23 # Save plot as PDF


24 ggsave ( " residuals _ vs _ fitted _ multiple . pdf " )
25

26 # Plot Q - Q plot
27 ggplot ( data = model , aes ( sample = . stdresid ) ) +
28 stat _ qq () +
29 stat _ qq _ line () +
30 labs ( title = "Q - Q Plot of Residuals " ,

42 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

31 x = " Theoretical Quantiles " ,


32 y = " Standardized Residuals " ) +
33 theme _ minimal ()
34

35 # Save plot as PDF


36 ggsave ( " qq _ plot _ multiple . pdf " )
37

38 # Plot Scale - Location plot


39 ggplot ( data = model , aes (. fitted , sqrt ( abs (. stdresid ) ) ) ) +
40 geom _ point () +
41 geom _ smooth ( se = FALSE , color = " red " ) +
42 labs ( title = " Scale - Location Plot " ,
43 x = " Fitted Values " ,
44 y = " Square Root of Standardized Residuals " ) +
45 theme _ minimal ()
46

47 # Save plot as PDF


48 ggsave ( " scale _ location _ plot _ multiple . pdf " )
49

50 # Plot Leverage vs Standardized Residuals


51 ggplot ( data = model , aes (. hat , . stdresid ) ) +
52 geom _ point ( aes ( color = . cooksd ) ) +
53 geom _ hline ( yintercept = c ( -2 , 2) , linetype = " dashed " , color = " red " ) +
54 labs ( title = " Leverage vs Standardized Residuals " ,
55 x = " Leverage " ,
56 y = " Standardized Residuals " ) +
57 theme _ minimal ()
58

59 # Save plot as PDF


60 ggsave ( " leverage _ vs _ residuals _ multiple . pdf " )
Listing 30: R Program for Multiple Linear Regression Analysis

Residuals vs Fitted
40

20
Residuals

−20
0 10 20 30 40
Fitted Values

Figure 17: Residuals vs Fitted Plot for Multiple Linear Regression

43 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Q−Q Plot of Residuals

5.0
Standardized Residuals

2.5

0.0

−2.5

−2 0 2
Theoretical Quantiles

Figure 18: Q-Q Plot for Multiple Linear Regression

Scale−Location Plot
Square Root of Standardized Residuals

0
0 10 20 30 40
Fitted Values

Figure 19: Scale-Location Plot for Multiple Linear Regression

44 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Leverage vs Standardized Residuals

5.0
Standardized Residuals

.cooksd
0.20
2.5 0.15

0.10

0.0 0.05

−2.5

0.0 0.1 0.2


Leverage

Figure 20: Leverage vs Standardized Residuals Plot

11.1.1 Explanation of Code and Results


• Loading Libraries: We use ‘MASS‘ for the ‘Boston‘ dataset and ‘ggplot2‘ for plotting.
• Loading Dataset: The ‘Boston‘ dataset is loaded using the ‘data‘ function.

• Fitting the Model: The ‘lm‘ function fits a multiple linear regression model predicting ‘medv‘ from ‘rm‘,
‘crim‘, and ‘tax‘.
• Model Summary: The ‘summary‘ function provides coefficients, R-squared values, and significance levels.

The generated plots help in diagnosing the model:

• Residuals vs Fitted Plot: This plot helps identify heteroscedasticity (non-constant variance of residuals).
Ideally, residuals should be randomly scattered around zero.
• Q-Q Plot: This plot checks if residuals are normally distributed. Points should lie along the theoretical
quantiles line.

• Scale-Location Plot: This plot helps assess homoscedasticity. A horizontal line indicates equal variance.
• Leverage vs Standardized Residuals Plot: This plot helps detect influential data points. Points with high
leverage or large residuals might unduly influence the model.

11.2 Checking for Multicollinearity and Model Validation


Checking for Multicollinearity: Multicollinearity occurs when two or more predictor variables in a multiple linear
regression model are highly correlated with each other. This means that they provide redundant information about
the response variable, which can make the estimates of the coefficients unstable and unreliable. High multicollinearity
can inflate the variance of the coefficient estimates and make it difficult to determine the individual effect of each
predictor on the response variable.
Why Multicollinearity is Important in Advanced Data Analysis: In advanced data analysis, detecting and
addressing multicollinearity is crucial for several reasons:

45 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

• Coefficient Interpretation: High multicollinearity can make the coefficients of the predictors unstable and
sensitive to small changes in the data. This instability can lead to large standard errors for the coefficients,
which means that the confidence intervals for the predictors are wider and less informative.
• Model Performance: Multicollinearity can affect the overall performance of the regression model. While the
model might still have a high R-squared value, the inflated standard errors and unstable coefficients can reduce
the model’s predictive power and its reliability for making inferences.
• Feature Selection: In the presence of multicollinearity, it becomes challenging to determine which predictor
variables are truly significant. This can impact feature selection processes and lead to incorrect conclusions
about which variables are important for predicting the response variable.
How Multicollinearity is Detected and Addressed:
• Variance Inflation Factor (VIF): The VIF quantifies how much the variance of a regression coefficient is
inflated due to multicollinearity. It is calculated for each predictor variable and is defined as:

1
VIFi =
1 − Ri2

where Ri2 is the R-squared value obtained by regressing the i-th predictor on all other predictors. A VIF value
greater than 10 is often used as a threshold to indicate significant multicollinearity.
• Correlation Matrix: Examining the correlation matrix of the predictor variables can provide insights into
which variables are highly correlated. High pairwise correlations (close to ±1) may indicate potential multi-
collinearity issues.
• Condition Index: The condition index measures the sensitivity of the regression coefficients to small changes
in the predictor variables. A high condition index (typically above 30) suggests multicollinearity.
• Regularization Techniques: Techniques such as Ridge Regression or LASSO (Least Absolute Shrinkage and
Selection Operator) can be used to address multicollinearity by adding a penalty to the size of the coefficients,
thereby stabilizing the model.
In summary, addressing multicollinearity is a critical aspect of building reliable and interpretable regression models
in advanced data analysis. By identifying and mitigating multicollinearity, we can improve the stability and validity
of the regression coefficients and enhance the overall performance of the model.
1 # Load necessary library
2 library ( car )
3

4 # Load Boston dataset


5 data ( " Boston " , package = " MASS " )
6

7 # Fit multiple linear regression model


8 model <- lm ( medv ~ rm + crim + tax , data = Boston )
9

10 # Calculate VIF
11 vif ( model )
Listing 31: Checking for Multicollinearity

Expected Output: The output from the ‘vif‘ function will provide VIF values for each predictor variable. For
example:
rm crim tax
1.500 1.200 1.800
These VIF values are below 10, indicating no significant multicollinearity issues.
46 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

11.3 Model Selection Using Stepwise Selection


Model selection is a crucial step in building a regression model, aiming to identify the most significant predictors
while avoiding overfitting. Stepwise selection is an iterative approach that adds or removes predictors based on
specific criteria to find the most parsimonious model that best explains the variability in the response variable.

Stepwise Selection with ‘stepAIC‘:


In R, the ‘stepAIC()‘ function from the ‘MASS‘ package performs stepwise model selection by using Akaike Infor-
mation Criterion (AIC) as the selection criterion. The AIC measures the relative quality of a statistical model for
a given dataset, considering both the goodness of fit and the number of parameters. Lower AIC values indicate a
better balance between model fit and complexity.

The ‘stepAIC()‘ function operates in two directions:


• Forward Selection: Starts with an empty model and adds predictors one by one.
• Backward Elimination: Starts with the full model and removes predictors iteratively.
• Both Directions: Combines both forward and backward steps, adding and removing predictors to minimize
the AIC.
Here’s the R code for performing stepwise selection using ‘stepAIC‘:
1 # Load necessary library
2 library ( MASS )
3

4 # Perform stepwise model selection


5 stepwise _ model <- stepAIC ( model , direction = " both " )
6 summary ( stepwise _ model )
Listing 32: Model Selection Using Stepwise Selection
Expected Output:
The output from the ‘stepAIC()‘ function provides a summary of the final model after the stepwise selection
process. Below is an example of the output:

Call:
lm(formula = medv ~ rm + tax, data = Boston)

Coefficients:
(Intercept) rm tax
3.234e+01 -2.015e+00 -6.321e-03

Residual standard error: 5.789 on 502 degrees of freedom


Multiple R-squared: 0.625, Adjusted R-squared: 0.621

Explanation of Output:

• Call: Shows the final model formula selected by stepwise selection. In this example, the model includes ‘rm‘
(average number of rooms) and ‘tax‘ (property tax rate) as predictors.
• Coefficients: Provides the estimated coefficients for the intercept, ‘rm‘, and ‘tax‘. For instance, an intercept
of 32.34, a coefficient of -2.015 for ‘rm‘, and -0.00632 for ‘tax‘ indicate how these variables influence the median
value of homes (‘medv‘).
• Residual Standard Error: Measures the average distance between the observed values and the predicted
values. A lower value suggests a better fit.
• Multiple R-squared: Indicates the proportion of the variance in the response variable that is predictable
from the predictors. A value of 0.625 means 62.5
47 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

• Adjusted R-squared: Adjusts the R-squared value for the number of predictors in the model. It provides a
more accurate measure of model performance, especially when multiple predictors are used.

Significance: The results of stepwise selection provide a refined model with predictors that are deemed most
significant according to the AIC criterion. This model is often preferred for its balance between fit and complexity,
helping to avoid overfitting while maintaining interpretability.
Stepwise selection, however, should be used cautiously as it relies on statistical criteria and may not always produce
the best model for all applications. It is beneficial to complement this method with other techniques and domain
knowledge.

11.4 Cross-Validation
Cross-validation is a robust technique used to assess the performance and generalizability of a predictive model. It
helps ensure that the model is not overfitting to the training data and provides an estimate of how well the model
will perform on unseen data. One common method of cross-validation is k-fold cross-validation.

K-Fold Cross-Validation: In k-fold cross-validation, the dataset is divided into k equally sized subsets, or ”folds.”
The model is trained on k−1 of these folds and validated on the remaining fold. This process is repeated k times, with
each fold serving as the validation set exactly once. The overall performance is then averaged over all k iterations to
provide a more reliable estimate of the model’s generalizability.
Implementation in R: In R, the ‘[Link]‘ function from the ‘boot‘ package is used to perform k-fold cross-validation
for generalized linear models. Below is the R code for performing cross-validation:
1 # Load necessary library
2 library ( boot )
3

4 # Define cross - validation function


5 cv <- cv . glm ( Boston , stepwise _ model , K =10)
6 cv $ delta
Listing 33: Cross-Validation Example

Explanation of the Code:

• Loading the Library: The ‘boot‘ library is required to use the ‘[Link]‘ function.

• Defining Cross-Validation Function: The ‘[Link]‘ function performs cross-validation on the specified
model (‘stepwise model‘). The ‘K=10‘ argument specifies 10-fold cross-validation, meaning the dataset is split
into 10 subsets.
• Output: The result of ‘[Link]‘ is stored in ‘cv‘, and ‘[Link]‘ provides the cross-validation errors.

Expected Output: The output from the ‘[Link]‘ function will include cross-validation errors. For example:

[1] 23.47 25.32

Interpretation of the Output:

• Cross-Validated Estimate of Model’s Error: The first value, ‘23.47‘, represents the average error of the
model across all folds. This value indicates how well the model performs on average when validated on different
subsets of the data.
• Standard Error of the Estimate: The second value, ‘25.32‘, is the standard error of the cross-validated
estimate. It provides an indication of the variability in the model’s error across different folds. A lower standard
error suggests more consistent performance of the model.

48 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Significance of Cross-Validation: Cross-validation is crucial for evaluating model performance as it provides


insights into how the model is likely to perform on new, unseen data. It helps in assessing the stability and reliability
of the model’s predictions. By averaging the model’s performance across multiple folds, cross-validation mitigates
the risk of overfitting and ensures that the model’s performance metric is not overly optimistic or pessimistic. This
technique is particularly useful for comparing different models and selecting the one that best balances bias and
variance.

11.4.1 Analysis
• Model Summary: The summary provides the regression coefficients for ‘rm‘, ‘crim‘, and ‘tax‘, along with
R-squared and adjusted R-squared values.
• Residuals vs Fitted Plot: Should display residuals scattered randomly around zero, indicating that the
model’s assumptions are satisfied.
• Q-Q Plot: Should show residuals approximately following a 45-degree line if they are normally distributed.
• Scale-Location Plot: Should display a horizontal trend, suggesting that residuals are homoscedastic.
• Leverage vs Standardized Residuals Plot: Points should fall within acceptable bounds, indicating no
extreme leverage or influential outliers.
• VIF Results: Provide insight into multicollinearity. VIF values above 10 suggest high multicollinearity.
• Stepwise Selection: Reveals the most significant predictors in the model.
• Cross-Validation: Provides an estimate of how well the model performs on unseen data, with lower cross-
validation error indicating better model performance.

12 Introduction to Machine Learning with R


Machine Learning (ML) is a subset of artificial intelligence (AI) that enables systems to learn and improve from
experience without being explicitly programmed. It is a data-driven approach that involves training models on data
to identify patterns, make decisions, and generate predictions. Machine Learning can be categorized into three main
types: supervised learning, unsupervised learning, and reinforcement learning.
• Supervised Learning: Involves training a model on labeled data, where the outcome is already known, and
the model learns to predict the output based on the input features. This category includes popular algorithms
such as:

– Linear regression
– Decision trees
– Neural networks
• Unsupervised Learning: Works with data that has no labeled outcomes. The goal is to uncover hidden
patterns or structures in the data. Common algorithms include:
– Clustering techniques (e.g., K-means, hierarchical clustering)
– Dimensionality reduction methods (e.g., Principal Component Analysis, PCA)
• Reinforcement Learning: Focuses on training an agent to make decisions in a dynamic environment by:

– Rewarding desirable actions


– Punishing undesirable actions

49 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

R is one of the most widely used programming languages for machine learning due to its rich ecosystem of libraries,
ease of use, and strong community support. In this section, we will explore unsupervised learning by implementing
the K-means clustering algorithm. Clustering is a technique that groups data points into clusters based on similarity.
K-means is a widely used clustering algorithm that partitions a dataset into a predefined number of clusters, k.
The algorithm minimizes the sum of the squared distances between each data point and the centroid of its assigned
cluster.

12.1 Design an R Program to Implement K-means Clustering


In this section, we implement a K-means clustering algorithm in R. The steps include loading the data, normalizing
the features, applying the K-means algorithm, and visualizing the results. We will use the iris dataset, an open-
source dataset with 150 observations of iris flowers, including features such as sepal length, sepal width, petal length,
and petal width.
1 # Load necessary libraries
2 library ( ggplot2 )
3

4 # Load the iris dataset


5 data ( iris )
6

7 # Normalize the data


8 iris _ norm <- scale ( iris [ , 1:4])
9

10 # Perform K - means clustering with 3 clusters


11 set . seed (123) # Set seed for reproducibility
12 kmeans _ result <- kmeans ( iris _ norm , centers = 3)
13

14 # Add cluster information to the dataset


15 iris $ Cluster <- as . factor ( kmeans _ result $ cluster )
16

17 # Visualize the clusters using Sepal Length and Sepal Width


18 ggplot ( iris , aes ( Sepal . Length , Sepal . Width , color = Cluster ) ) +
19 geom _ point ( size = 3) +
20 labs ( title = "K - means Clustering of Iris Data " ,
21 x = " Sepal Length " ,
22 y = " Sepal Width " ) +
23 theme _ minimal ()
Listing 34: K-means Clustering on the Iris Dataset

Expected Output: The expected output from the code will show a scatter plot of the iris data points, colored
by their assigned cluster.

50 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

K−means Clustering of Iris Data


4.5

4.0

Cluster
Sepal Width

3.5
1
2
3.0
3

2.5

2.0
5 6 7 8
Sepal Length

Figure 21: K-means Clustering of Iris Data Based on Sepal Length and Sepal Width

The K-means clustering algorithm partitions the iris dataset into three clusters. Each data point is assigned to a
cluster based on the proximity to the cluster centroid. The visualization shows how well the algorithm separates the
different species based on Sepal Length and Sepal Width. In this case, the clusters roughly correspond to the three
species of iris in the dataset: setosa, versicolor, and virginica.

12.2 Optimal Number of Clusters and Practical Applications


Determining the optimal number of clusters is critical in clustering analysis. One common method to find the optimal
number of clusters is the Elbow Method, where we plot the total within-cluster sum of squares (WSS) against the
number of clusters, k. The ”elbow point,” where the decrease in WSS starts to slow down, indicates the optimal
number of clusters.
1 # Calculate within - cluster sum of squares for different k
2 wss <- sapply (1:10 , function ( k ) {
3 kmeans ( iris _ norm , centers = k ) $ tot . withinss
4 })
5

6 # Plot the Elbow curve


7 plot (1:10 , wss , type = " b " , pch = 19 , frame = FALSE ,
8 xlab = " Number of Clusters " ,
9 ylab = " Total Within Sum of Squares " ,
10 main = " Elbow Method for Optimal Number of Clusters " )
Listing 35: Elbow Method for Determining Optimal Clusters

Expected Output: The Elbow Method plot will show the relationship between the number of clusters and the
WSS.

51 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Elbow Method for Optimal Number of Clusters

100 200 300 400 500 600


Total Within Sum of Squares

2 4 6 8 10

Number of Clusters

Figure 22: Elbow Method for Optimal Number of Clusters

Analysis: From the Elbow plot (Figure 22), we observe that the WSS sharply decreases until k = 3, after which
the decrease slows. This suggests that k = 3 is the optimal number of clusters for the iris dataset, aligning with the
known species.
Practical Applications of Clustering: Clustering techniques like K-means are widely used in various fields:

• Customer Segmentation: Businesses use clustering to segment customers based on purchasing behavior,
enabling targeted marketing strategies.

• Image Compression: Clustering can reduce the number of colors in an image, thereby compressing it without
significant loss of quality.
• Document Classification: Clustering is used to organize large datasets of documents, such as categorizing
news articles or research papers based on topics.

• Anomaly Detection: In cybersecurity, clustering is employed to detect unusual patterns in network traffic
that could signal security breaches.

In data analysis, clustering helps identify inherent structures within the data, enabling more informed decision-
making. By grouping similar data points, clustering can simplify complex datasets and provide valuable insights
that guide business strategies, scientific research, and technological advancements.

13 Time Series Analysis


Time Series Analysis (TSA) refers to techniques used to analyze time-ordered data points collected at regular in-
tervals. The primary goal of TSA is to understand the underlying structure of the data and make forecasts about
future values. Time series data can arise in various fields such as finance, weather forecasting, stock markets, and
economics, where observations are made over time. Time series models focus on capturing temporal dependencies
between data points and identifying trends, seasonality, and cyclical patterns.

R provides a robust environment for time series analysis, offering functions for importing data, performing exploratory
analysis, and fitting complex models. In TSA, the key is understanding the components of the series—namely trend,

52 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

seasonality, cyclic variations, and irregular components.

The most widely used models for forecasting time series data are ARIMA models (AutoRegressive Integrated Moving
Average), which combine three different elements:

• AutoRegressive (AR): Predicts future values based on past values.


• Integrated (I): Involves differencing the data to make it stationary.
• Moving Average (MA): Models the error term as a linear combination of previous error terms.
ARIMA models are particularly useful for data that exhibits trends and is non-stationary, meaning its statistical
properties (such as mean and variance) change over time.

13.1 Design an R Program to Analyze and Forecast Time Series Data using ARIMA
Models
To illustrate TSA with R, we will use an open-source dataset, ‘AirPassengers‘, which contains monthly totals of
international airline passengers from 1949 to 1960.
1 # Load necessary libraries
2 library ( forecast )
3 library ( tseries )
4

5 # Load AirPassengers dataset


6 data ( " AirPassengers " )
7 ts _ data <- AirPassengers
8

9 # Plot the time series data


10 plot ( ts _ data , main = " AirPassengers Data " , ylab = " Passengers " , xlab = " Year " )
11

12 # Perform Differencing to check stationarity


13 diff _ ts _ data <- diff ( log ( ts _ data ) )
14 plot ( diff _ ts _ data , main = " Differenced Log of AirPassengers " )
15

16 # Fit ARIMA model


17 fit <- auto . arima ( log ( ts _ data ) )
18 summary ( fit )
19

20 # Forecast future values


21 forecast _ data <- forecast ( fit , h =12)
22 plot ( forecast _ data , main = " ARIMA Forecast for AirPassengers " )
Listing 36: Time Series Analysis and ARIMA Modeling

53 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

AirPassengers Data

600
500
Passengers

400
300
200
100

1950 1952 1954 1956 1958 1960

Year

Figure 23: Time Series Plot of AirPassengers Data

Differenced Log of AirPassengers


0.2
0.1
diff_ts_data

0.0
-0.1
-0.2

1950 1952 1954 1956 1958 1960

Time

Figure 24: Differencing to check stationarity of AirPassengers Data

13.2 Interpreting the Fitted ARIMA Model Output


The ARIMA model output from the code fit <- [Link](log(ts data)) provides the following summary:

Series: log(ts_data)
ARIMA(0,1,1)(0,1,1)[12]

Coefficients:

54 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

ma1 sma1
-0.4018 -0.5569
s.e. 0.0896 0.0731

sigma^2 = 0.001371: log likelihood = 244.7


AIC=-483.4 AICc=-483.21 BIC=-474.77

Training set error measures:


ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.0005730622 0.03504883 0.02626034 0.01098898 0.4752815 0.2169522 0.01443892

Let’s break down the elements of the output:

[Link] Model Type: ARIMA(0,1,1)(0,1,1)[12]:


• This ARIMA model has no autoregressive (AR) terms (denoted by 0), one order of differencing (1), and one
moving average (MA) term (1).

• The (0,1,1)[12] part refers to the seasonal components of the ARIMA model, where there is no seasonal AR
term (0), one order of seasonal differencing (1), and one seasonal MA term (1). The [12] indicates a periodicity
of 12, suggesting that the data exhibits yearly seasonality, which is common for monthly data.

[Link] Coefficients:
• ma1 and sma1 represent the moving average (MA) and seasonal moving average (SMA) coefficients, respectively.

• ma1 = -0.4018: This coefficient indicates the short-term relationship between past error terms and the current
observation.
• sma1 = -0.5569: This represents the seasonal component, where past errors from a year ago are factored into
the forecast.
• s.e.: These are the standard errors of the coefficients, helping to understand the uncertainty of the estimates.
Both coefficients have relatively small standard errors, suggesting the estimates are reasonably precise.

[Link] Model Statistics:


• sigma^2 = 0.001371: This is the estimated variance of the residuals, indicating how well the model fits the
data. A smaller value of σ 2 suggests a better fit.
• log likelihood = 244.7: The log-likelihood measures the goodness-of-fit of the model. Higher values of log
likelihood indicate a better fit.

• AIC = -483.4, AICc = -483.21, BIC = -474.77:


– AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are used to compare
models; lower values are preferred.
– AICc is the corrected AIC for small sample sizes.
– In this case, the AIC and BIC are low, indicating a well-fitted model.

[Link] Training Set Error Measures:


• ME (Mean Error): 0.0005730622 - This is the average forecast error. A value close to 0 indicates the model
is unbiased.
• RMSE (Root Mean Square Error): 0.03504883 - The RMSE provides a measure of how well the model
forecasts future values, with lower values indicating a better fit.

55 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

• MAE (Mean Absolute Error): 0.02626034 - This gives the average absolute difference between the pre-
dicted and actual values. Lower values suggest better predictions.
• MPE (Mean Percentage Error): 0.01098898 - This is the mean of the percentage errors in the model’s
predictions, showing the average bias.
• MAPE (Mean Absolute Percentage Error): 0.4752815 - It shows the average percentage error and is
more interpretable than RMSE or MAE. A MAPE below 1 indicates a relatively accurate model.
• MASE (Mean Absolute Scaled Error): 0.2169522 - This is another scaled error metric; a value below 1
indicates better performance compared to a naı̈ve model.
• ACF1 (Autocorrelation of residuals at lag 1): 0.01443892 - This measures the correlation of residuals
with their lagged values. A low value suggests that the residuals are uncorrelated, which is a good sign.
This summary highlights that the ARIMA model effectively fits the data, with reasonably low error metrics and
coefficients that explain the time series dynamics well. The model captures both the non-seasonal and seasonal
elements of the series, making it a strong candidate for forecasting.

13.3 Forecasting Future Values Using the ARIMA Model


Once we have successfully fitted an ARIMA model to the time series data, the next step is to forecast future values
based on the model. The following code demonstrates how to use the forecast function in R to predict future values
and visualize the forecast.
1 # Forecast future values
2 forecast _ data <- forecast ( fit , h =12)
3 print ( forecast _ data )
4

5 # Plot the forecasted values


6 plot ( forecast _ data , main = " ARIMA Forecast for AirPassengers " )
Listing 37: Forecasting Future Values with ARIMA

[Link] Explanation of Code:


• forecast(fit, h=12): This function generates forecasts based on the fitted ARIMA model. The argument
h=12 specifies that we want to forecast the next 12 periods (in this case, months). The number of periods can
be adjusted depending on how far into the future we want to forecast.
• print(forecast data): This command prints the forecasted values, which typically include the predicted
mean for each period, along with the 80% and 95% confidence intervals.
• plot(forecast data): This generates a plot that visualizes the forecast. The plot typically shows the original
time series, the forecasted values, and the associated confidence intervals, which are shaded in the graph. The
main argument provides the title for the plot.

13.3.1 Expected Output


The expected output of the forecast function is a set of predicted future values along with confidence intervals.
The printed output may look something like the following:

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 1961 455.9929 446.8214 465.1645 441.8634 470.1223
Feb 1961 421.0283 411.8568 430.1999 406.8988 435.0579
Mar 1961 462.3956 453.2241 471.5672 448.2661 476.5252
...
Dec 1961 475.0584 465.8869 484.2300 460.9289 489.1880
56 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

This table provides the forecasted values (Point Forecast) for each month, along with the lower and upper
bounds of the 80% and 95% confidence intervals.

[Link] Plot Interpretation: The generated plot (refer to Figure ??) will display the following elements:
• Historical Data: The actual data points of the original time series are plotted on the left side of the graph.
• Forecasted Values: The forecasted points for the next 12 months are plotted on the right side.

• Confidence Intervals: The shaded areas around the forecast represent the 80% and 95% confidence intervals.
The wider the intervals, the more uncertainty is associated with the forecast.

ARIMA Forecast for AirPassengers


6.5
6.0
5.5
5.0

1950 1952 1954 1956 1958 1960 1962

Figure 25: ARIMA Model Forecast for AirPassengers Data

The plot provides valuable insights into the expected future behavior of the time series, and the confidence intervals
indicate the range within which the actual future values are likely to fall. In this case, we are forecasting the
AirPassengers dataset for the next 12 months.

13.4 Use a Time Series Dataset, Perform Exploratory Data Analysis, Fit an ARIMA
Model, and Make Future Forecasts
We will continue using the ‘AirPassengers‘ dataset. First, we perform exploratory data analysis (EDA) to understand
the structure of the data. EDA for time series typically involves visualizing the data, checking for trends, seasonality,
and stationarity.
1 # Plot original data
2 plot ( ts _ data , main = " Original AirPassengers Data " , ylab = " Passengers " )
3

4 # Decompose the data to extract components


5 decomposed <- decompose ( ts _ data )
6 plot ( decomposed )

57 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

8 # Perform stationarity test using ADF test


9 adf _ test <- adf . test ( log ( ts _ data ) , alternative = " stationary " )
10 print ( adf _ test )
Listing 38: Exploratory Data Analysis on Time Series Data

AirPassengers Data
600
500
Passengers

400
300
200
100

1950 1952 1954 1956 1958 1960

Year

Figure 26: Time Series Plot of AirPassengers Data

Decomposition of additive time series


observed
400 100
40 150 300 450
trend
random seasonal
0
-40
40
0
-40

1950 1952 1954 1956 1958 1960

Time

Figure 27: Decomposed Time Series Plot of AirPassengers Data

58 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

13.5 Augmented Dickey-Fuller (ADF) Test for Stationarity


The Augmented Dickey-Fuller (ADF) test is used to check the stationarity of a time series. Stationarity is a critical
assumption for many time series models, including ARIMA. A stationary time series has constant mean and variance
over time, making it easier to model and forecast.

13.5.1 ADF Test Code


The following R code performs the ADF test on the logarithm of the time series data:
1 # Perform ADF test
2 adf _ test <- adf . test ( log ( ts _ data ) , alternative = " stationary " )
3 print ( adf _ test )
Listing 39: Augmented Dickey-Fuller Test for Stationarity

The test is applied to the log-transformed series (log(ts data)) with the alternative hypothesis being that the
time series is stationary. The [Link] function from the tseries package is used.

13.5.2 Test Output


The ADF test generates the following output:

Augmented Dickey-Fuller Test

data: log(ts_data)
Dickey-Fuller = -6.4215, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

13.5.3 Interpretation of Results


• Dickey-Fuller Statistic: The test statistic value is -6.4215. A more negative value of the Dickey-Fuller
statistic indicates stronger evidence against the null hypothesis (which assumes that the time series is non-
stationary).

• Lag Order: The test uses a lag order of 5. This means that 5 lagged differences of the time series were
considered in the test to remove autocorrelation in the residuals.
• p-value: The p-value for the test is 0.01. Since this value is less than the commonly used significance levels
(e.g., 0.05 or 0.01), we reject the null hypothesis. This implies that there is significant evidence to conclude
that the time series is stationary.

• Alternative Hypothesis: The alternative hypothesis is that the series is stationary. Based on the p-value,
we conclude that the time series (log-transformed) is indeed stationary.

59 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Log−Transformed Time Series with ADF Test Result


ADF p−value: 0.01
6.5

6.0
Log of Values

5.5

5.0

1952 1956 1960


Time

Figure 28: ADF Test Results for Log-Transformed Time Series

The ADF test results (Figure 28) show that the series can be considered stationary, thus fulfilling an important
precondition for ARIMA modeling.
Expected Output:
• A plot of the original data showing the overall trend and seasonal pattern.

• A decomposition plot showing the trend, seasonality, and residual components.


• Results from the Augmented Dickey-Fuller (ADF) test, providing a p-value to check for stationarity.

[Link] Conclusion: Based on the results of the Augmented Dickey-Fuller test, with a test statistic of -6.4215
and a p-value of 0.01, we reject the null hypothesis of non-stationarity. Therefore, the log-transformed time series is
stationary, and we can proceed with fitting time series models such as ARIMA for forecasting. Ensuring stationarity
is crucial for achieving reliable forecasts, as non-stationary data can lead to inaccurate model estimates.
Interpreting the ARIMA Model and Forecast Results After fitting the ARIMA model, the forecast results
give predicted values along with confidence intervals. These forecasts help understand how the number of passengers
is expected to grow in the future. By examining the residuals, we ensure that the ARIMA model captures the
significant patterns of the data while leaving only white noise in the residuals.

60 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Decomposition of additive time series


observed
400 100
40 150 300 450
trend
random seasonal
0
-40
40
0
-40

1950 1952 1954 1956 1958 1960

Time

Figure 29: Decomposition of AirPassengers Data

13.6 Include Steps to Check for Stationarity, Select Model Parameters, and Evaluate
the Model’s Forecasting Accuracy
Stationarity Check: One of the key steps in time series analysis is ensuring that the data is stationary. This means
that the mean, variance, and covariance of the series are constant over time. If the data is non-stationary, it needs
to be transformed. In the ARIMA model, differencing is used to make the data stationary.
We check for stationarity using visual inspection of the time series and statistical tests like the Augmented Dickey-
Fuller (ADF) test. If the p-value of the test is below a certain threshold (typically 0.05), the series is considered
stationary.
1 # Perform ADF test
2 adf _ test <- adf . test ( log ( ts _ data ) , alternative = " stationary " )
3 print ( adf _ test )
Listing 40: Stationarity Check using ADF Test

Model Parameter Selection: The ‘[Link]‘ function in R automatically selects the best combination of
AR, I, and MA components based on the Akaike Information Criterion (AIC). However, in some cases, manual tuning
of the parameters may be necessary.
1 # Fit ARIMA model manually
2 manual _ fit <- arima ( log ( ts _ data ) , order = c (2 ,1 ,2) )
3 summary ( manual _ fit )
Listing 41: Manual ARIMA Model Selection

Model Evaluation: To evaluate the accuracy of the ARIMA model, we can use several methods such as residual
diagnostics and forecasting accuracy measures like Mean Absolute Error (MAE) or Root Mean Square Error (RMSE).
1 # Plot residuals
2 checkresiduals ( fit )
3

4 # Calculate accuracy
5 accuracy ( forecast _ data )

61 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Listing 42: Forecast Accuracy and Residual Diagnostics

Expected Output:

• A plot of the residuals should show no significant patterns, indicating that the ARIMA model has adequately
captured the underlying structure.
• Forecast accuracy results, which include metrics such as MAE, RMSE, and Mean Absolute Percentage Error
(MAPE).
Detailed Interpretation of Time Series Components and Forecast Results:
The decomposition of time series data reveals important components:
• Trend: The long-term movement in the data, which shows the increasing or decreasing pattern over time.
• Seasonality: The repeating short-term cycles observed in the data at regular intervals.
• Residuals: The noise or irregular component left after accounting for the trend and seasonality.

By analyzing these components, we can better understand the data’s behavior. The ARIMA model’s future
forecasts provide valuable insights into the expected growth or decline of the series, which can be used for decision-
making in areas like demand forecasting, resource allocation, and financial planning.

14 Creating Interactive Visualizations


Interactive visualizations are powerful tools for data analysis and presentation. They allow users to explore data
dynamically, providing a more engaging experience compared to static plots. In R, the ‘plotly‘ package is widely used
to create interactive plots. This section will guide you through designing R programs to create various interactive
visualizations using ‘plotly‘.

14.1 Designing Interactive Plots with Plotly


To create interactive visualizations using ‘plotly‘, you first need to install and load the package. The ‘plotly‘ package
can be installed from CRAN and is compatible with several R plotting systems. We will use the ‘plotly‘ package to
create interactive scatter plots, line charts, and bar charts.

14.1.1 Installing and Loading Plotly


Before creating interactive plots, ensure that the ‘plotly‘ package is installed and loaded into your R environment.
You can install it using the following command:
1 install . packages ( " plotly " )
Listing 43: Installing Plotly

Load the ‘plotly‘ package with:


1 library ( plotly )
Listing 44: Loading Plotly

62 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

14.1.2 Creating Interactive Scatter Plots


Scatter plots are useful for visualizing the relationship between two continuous variables. We will use the ‘iris‘
dataset, which is a classic dataset in R containing measurements of iris flowers.
Here is an R code example to create an interactive scatter plot:
1 # Load necessary library
2 library ( plotly )
3

4 # Load dataset
5 data ( iris )
6

7 # Create interactive scatter plot


8 scatter _ plot <- plot _ ly ( data = iris , x = ~ Sepal . Length , y = ~ Sepal . Width ,
9 color = ~ Species , type = ' scatter ' , mode = ' markers ' ,
10 marker = list ( size = 10) ) % >%
11 layout ( title = " Interactive Scatter Plot of Iris Dataset " ,
12 xaxis = list ( title = " Sepal Length " ) ,
13 yaxis = list ( title = " Sepal Width " ) )
14

15 # Show plot
16 scatter _ plot
Listing 45: Interactive Scatter Plot

Expected Output: The interactive scatter plot displays Sepal Length on the x-axis and Sepal Width on the
y-axis. Different species are represented by different colors, and users can hover over data points to see additional
information.

14.1.3 Creating Interactive Line Charts


Line charts are ideal for visualizing trends over time. We will use the ‘AirPassengers‘ dataset, which contains monthly
totals of international airline passengers.
Here is an R code example to create an interactive line chart:
1 # Load necessary library
2 library ( plotly )
3

4 # Load dataset
5 data ( " AirPassengers " )
6

7 # Convert time series to data frame


8 ap _ df <- data . frame (
9 Month = time ( AirPassengers ) ,
10 Passengers = as . numeric ( AirPassengers )
11 )
12

13 # Create interactive line chart


14 line _ chart <- plot _ ly ( data = ap _ df , x = ~ Month , y = ~ Passengers , type = ' scatter
' , mode = ' lines + markers ') % >%
15 layout ( title = " Interactive Line Chart of Air Passengers " ,
16 xaxis = list ( title = " Month " ) ,
17 yaxis = list ( title = " Number of Passengers " ) )
18

19 # Show plot
20 line _ chart

63 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Listing 46: Interactive Line Chart


Expected Output: The interactive line chart displays the number of passengers over time. Users can interact
with the chart by zooming in, hovering over data points for details, and adjusting the time range.

14.1.4 Creating Interactive Bar Charts


Bar charts are useful for comparing quantities across different categories. We will use the ‘mtcars‘ dataset, which
contains various attributes of car models.
Here is an R code example to create an interactive bar chart:
1 # Load necessary library
2 library ( plotly )
3

4 # Load dataset
5 data ( mtcars )
6

7 # Create interactive bar chart


8 bar _ chart <- plot _ ly ( data = mtcars , x = ~ rownames ( mtcars ) , y = ~ mpg , type = ' bar
') % >%
9 layout ( title = " Interactive Bar Chart of Car MPG " ,
10 xaxis = list ( title = " Car Models " ) ,
11 yaxis = list ( title = " Miles Per Gallon " ) )
12

13 # Show plot
14 bar _ chart
Listing 47: Interactive Bar Chart
Expected Output: The interactive bar chart displays the miles per gallon for different car models. Users can
hover over bars to see exact values and scroll through car models.

14.1.5 Customizing Interactivity


The ‘plotly‘ package provides various customization options to enhance the interactivity of plots. For example, you
can add tooltips, customize color scales, and adjust plot layouts.
Here is an R code example demonstrating some customizations:
1 # Load necessary library
2 library ( plotly )
3

4 # Load dataset
5 data ( iris )
6

7 # Create customized interactive scatter plot


8 custom _ scatter _ plot <- plot _ ly ( data = iris , x = ~ Sepal . Length , y = ~ Sepal . Width ,
9 color = ~ Species , type = ' scatter ' , mode = '
markers ' ,
10 marker = list ( size = 12 , opacity = 0.8) ,
11 text = ~ paste ( " Species : " , Species , " <br > Sepal
Length : " , Sepal . Length , " <br > Sepal Width : " , Sepal . Width ) ) % >%
12 layout ( title = " Customized Interactive Scatter Plot " ,
13 xaxis = list ( title = " Sepal Length " ) ,
14 yaxis = list ( title = " Sepal Width " ) ) % >%
15 config ( displayModeBar = TRUE , m o d e B a r B u t t o n s T o R e m o v e =
list ( " zoom2d " , " pan2d " ) )
64 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

16

17 # Show plot
18 custom _ scatter _ plot
Listing 48: Customized Interactive Scatter Plot

Expected Output: The customized scatter plot features larger, semi-transparent markers and displays detailed
tooltips. The mode bar, which includes zoom and pan buttons, is partially removed for a cleaner interface.

14.1.6 Conclusion
Interactive visualizations using the ‘plotly‘ package in R offer an engaging way to explore and present data. By
creating interactive scatter plots, line charts, and bar charts, users can dynamically interact with their data, uncover
insights, and make informed decisions. The ability to customize interactivity further enhances the user experience
and effectiveness of these visualizations.

14.2 Figures
Below are examples of the interactive plots created using the ‘plotly‘ package:

Figure 30: Interactive Scatter Plot of Iris Dataset

65 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Figure 31: Interactive Line Chart of Air Passengers

Figure 32: Interactive Bar Chart of Car MPG

66 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

14.3 Incorporating Tooltips, Hover Effects, and Interactive Legends


Interactive visualizations provide an enhanced experience through the use of tooltips, hover effects, and interactive
legends. These features allow users to delve deeper into the data without cluttering the main plot with excessive
labels or information. In this section, we will demonstrate how to incorporate these interactive elements using the
‘plotly‘ package.

14.3.1 Tooltips and Hover Effects


Tooltips are one of the key features in interactive plots that allow users to see additional information when hovering
over data points. This can be achieved by customizing the ‘text‘ attribute in ‘plotly‘ plots. In the following example,
we add tooltips to a scatter plot using the ‘iris‘ dataset.
1 # Load necessary library
2 library ( plotly )
3

4 # Load dataset
5 data ( iris )
6

7 # Create scatter plot with tooltips


8 scatter _ tooltip <- plot _ ly ( data = iris , x = ~ Sepal . Length , y = ~ Sepal . Width ,
9 color = ~ Species , type = ' scatter ' , mode = ' markers ' ,
10 marker = list ( size = 12 , opacity = 0.8) ,
11 text = ~ paste ( " Species : " , Species , " <br > Sepal Length
: " , Sepal . Length , " <br > Sepal Width : " , Sepal . Width ) ) % >%
12 layout ( title = " Scatter Plot with Tooltips " ,
13 xaxis = list ( title = " Sepal Length " ) ,
14 yaxis = list ( title = " Sepal Width " ) )
15

16 # Show plot
17 scatter _ tooltip
Listing 49: Scatter Plot with Tooltips and Hover Effects

In this example, when the user hovers over any point on the scatter plot, a tooltip appears showing the species and
the corresponding Sepal Length and Width values. This allows users to explore data points interactively without
adding clutter to the chart.

67 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Figure 33: Interactive Scatter Plot with Tooltips

14.3.2 Interactive Legends


Interactive legends allow users to toggle the visibility of data series on and off by clicking on the legend entries. This
feature can be especially useful when dealing with multi-series plots, as it gives users control over which series to
focus on.
Here is an R code example that demonstrates how to create an interactive legend using the ‘plotly‘ package:
1 # Load necessary library
2 library ( plotly )
3

4 # Create scatter plot with interactive legend


5 scatter _ legend <- plot _ ly ( data = iris , x = ~ Sepal . Length , y = ~ Sepal . Width ,
6 color = ~ Species , type = ' scatter ' , mode = ' markers ' ,
7 marker = list ( size = 10) ) % >%
8 layout ( title = " Scatter Plot with Interactive Legend " ,
9 xaxis = list ( title = " Sepal Length " ) ,
10 yaxis = list ( title = " Sepal Width " ) ,
11 legend = list ( title = list ( text = " Species " ) ) )
12

13 # Show plot
14 scatter _ legend
Listing 50: Scatter Plot with Interactive Legend

In this plot, users can click on each species in the legend to hide or show the corresponding data points, offering a
flexible way to focus on specific parts of the data.

68 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Figure 34: Scatter Plot with Interactive Legend

14.3.3 Customizing Hover Effects


In addition to tooltips, hover effects can be customized to enhance the user experience. The ‘hoverinfo‘ attribute
allows you to control which information is displayed when hovering over the plot.
Here’s an example of customizing hover effects in a line chart using the ‘AirPassengers‘ dataset:
1 # Load necessary library
2 library ( plotly )
3

4 # Load dataset
5 data ( " AirPassengers " )
6

7 # Convert time series to data frame


8 ap _ df <- data . frame (
9 Month = time ( AirPassengers ) ,
10 Passengers = as . numeric ( AirPassengers )
11 )
12

13 # Create line chart with custom hover effects


14 line _ custom _ hover <- plot _ ly ( data = ap _ df , x = ~ Month , y = ~ Passengers ,
15 type = ' scatter ' , mode = ' lines + markers ' ,
16 hoverinfo = 'x + y ') % >%
17 layout ( title = " Line Chart with Custom Hover Effects " ,
18 xaxis = list ( title = " Month " ) ,

69 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

19 yaxis = list ( title = " Passengers " ) )


20

21 # Show plot
22 line _ custom _ hover
Listing 51: Line Chart with Custom Hover Effects

In this plot, the hover effect displays only the x (Month) and y (Passengers) values. This is achieved by setting
‘hoverinfo‘ to ‘x+y‘, simplifying the information shown when hovering over data points.

Figure 35: Line Chart with Custom Hover Effects

14.4 Advantages of Using Interactive Visualizations for Data Exploration


Interactive visualizations offer numerous advantages for data exploration, especially when dealing with complex
datasets. Here are some of the key benefits:
• Enhanced Data Understanding:
– Interactive visualizations allow users to explore data in a more intuitive and engaging way.
– Interactions such as zooming or hovering over points to view detailed information help users gain deeper
insights into the underlying patterns and trends.
– For instance, the scatter plot with tooltips (Figure 33) allows users to explore individual data points in
the Iris dataset to understand species-specific variations.
• Real-Time Data Exploration:
– Interactive features like zooming, panning, and filtering allow users to explore large datasets in real-time
without the need to generate new static plots.
– This is particularly useful in time series analysis where zooming in on specific periods (as shown in Figure
35) reveals short-term trends that might be hidden in the overall data.
• Customizable Interactivity:
70 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

– Users can customize the interactivity of their visualizations, allowing for better tailoring of plots for
different audiences.
– Interactive legends (Figure 34) allow users to selectively view or hide data series, making it easier to focus
on relevant parts of the data.

• Improved Communication of Insights:


– Interactive visualizations facilitate the communication of complex data insights to non-technical stake-
holders.
– Interacting with plots and extracting additional information on demand helps bridge the gap between
data scientists and decision-makers, leading to more informed decisions.

• Better Handling of Complex and Large Datasets:


– With complex or large datasets, static plots can become overwhelming, but interactive visualizations
allow dynamic filtering, zooming, and toggling to better explore intricate details without losing overall
understanding.

• Ease of Integration in Web Applications:


– The ‘plotly‘ package integrates seamlessly with web applications, enabling embedding of interactive visu-
alizations in dashboards, reports, and presentations.
• Data Discovery and Pattern Recognition:

– Interactive tools facilitate the discovery of patterns and relationships that might not be immediately
obvious in static plots.
– For example, scatter plots with hover effects can reveal clusters or outliers that would otherwise go
unnoticed.

• Increased Engagement:
– Interactive visualizations increase user engagement by offering a hands-on approach to data exploration.
– Whether used in educational settings or business presentations, direct interaction with data can foster
curiosity and promote deeper learning or understanding.

15 Data Reporting with RMarkdown


15.1 Introduction
RMarkdown is a dynamic authoring format that allows users to integrate text, code, and visualizations into a single
document. By combining the capabilities of Markdown and R, it offers a powerful framework for creating repro-
ducible reports, documents, presentations, and even websites. One of the main advantages of using RMarkdown is its
ability to produce documents in multiple formats such as HTML, PDF, and Word, making it versatile for different
audiences and platforms. In data science, the need for reproducibility is paramount. Reproducibility ensures that
anyone can replicate your analysis, follow the reasoning behind your decisions, and verify the results. RMarkdown
facilitates this by embedding the code within the document itself, so when the document is regenerated, it runs the
same analysis and produces consistent results. This ability to blend analysis and reporting in a single document
makes RMarkdown a preferred tool for data scientists, statisticians, and researchers.

Another advantage of RMarkdown is the ease with which it allows data to be communicated clearly. By combining
code, narrative, and visualizations in one document, you can effectively convey complex insights to both technical
and non-technical audiences. Moreover, the interactive features provided in the HTML format (such as expandable
code blocks and dynamic visualizations) make it highly engaging for users.

71 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

To illustrate, consider a case where a data analyst wants to document their findings while ensuring others can follow
and replicate the process. By using RMarkdown, they can document their analysis steps (e.g., data cleaning, manip-
ulation, statistical testing) alongside the R code and output. This helps make the workflow clear and transparent,
ensuring credibility and fostering collaboration.
Overall, RMarkdown is a valuable tool in the world of data analysis because it simplifies the process of creating
reports, allows for real-time updates, and ensures transparency and reproducibility in data analysis workflows.

15.2 Download R Markdown


To get RMarkdown working in RStudio, the first thing you need is the rmarkdown package, which you can get from
CRAN by running the following commands in R or RStudio:
1 install . packages ( " rmarkdown " )
2 library ( rmarkdown )
Listing 52: Download R Markdown

15.3 Getting started with R Markdown


To create an R Markdown document, in R Studio, go to File > New File > R Markdown. The following window
will appear:

Figure 36: R Mark down in R

As one can see, the above image is filled in a title and an author and switched the output format to a HTML,
PDF, Word. Explore around this window and the tabs along the left to see all the different formats that it can output
to. When this is completed, click OK, and a new window should open with a little explanation on R Markdown files.

72 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

Figure 37: RMD Script

An R Markdown document is composed of three important sections:


1. Header: The header, enclosed by triple dashes (—), specifies the document’s metadata such as title, author,
date, and output format. This section is defined at the top of the R Markdown file.
2. Text: Text sections are marked by headers, such as ## R Markdown. These headers organize content and render
as formatted text in the output document. The formatting styles applied in this section will be consistent
throughout the document.
3. Code Chunks: Code chunks are enclosed by triple backticks (‘‘‘). These sections contain R code that executes
within the document. The results and outputs of this code will be included in the final PDF when it is generated.

15.4 Syntax used in R Mark Down

Syntax Set Off


*italics* italics
**bold** bold
superscriptˆ2ˆ superscript2
strikethrough strikethrough
## Header 2 Decrease the size of Header than the Header!
inline equation: $a = π ∗ r2 $ inline equation: a = πr2
image:![](path) insert image in R Markdown from the specified path

15.5 YAML
YAML header contains metadata of R markdown. Begins and end the header with a line of three dashes(—). You
can change the information in this section at any time by adding text or by overriding the current text.
The output value gives which type of file will build from your .rmd file
73 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R

Output Value File Type


output: github document creates a Github document
output: html document HTML file (web page).html
output: pdf document pdf [Link]
output: word document word [Link]
output: beamer presentation beamer slideshow
output: ioslides presentation slides slideshow

Figure 38: Header Section: YAML

RStudio automatically adds to the notebook with this formatted default code chunk. Code chunk starts with delimiter
‘ ‘ ‘ r and ends with “‘
There are two ways to add code chunks into an R Markdown document, you can press Ctrl + Alt + I(for windows)
or Cmd + Option + I(for mac). Or you can use the Add Chunk command in the editor toolbar. In the default
code section, we find “knitr” it is an R package with lightweight APIs designed to give users full control of the
output format, it is used fully when you render your R Markdown document. We have different options in “knitr”
package

Option Default value Effect


eval TRUE evaluate the code and include its result if it is set to true
echo TRUE display code along with its results if it is set to true
message TRUE display warnings
warning TRUE display messages
tidy FALSE reformat code in a tidy way when displaying it
error FALSE display errors
cache FALSE cache results for future renders

Table 2: Table of Options

15.6 Knitting the document


After completing a document in R Markdown, you need to ”knit” the plain text and code into the final document.
To do this, click the ”Knit” button located at the top of the source panel. When prompted, save the file with an
RMD extension. The resulting document will be generated as specified. You can run your R Markdown in two ways:

1. Run rmarkdown::render("<file path>")


2. Click the Knit HTML button at the top of the document

74 Programming With R, Lecture Review, Dr. Kalyan N


Programming With R

The Knit drop-down menu includes three main options: HTML, PDF, and Word document. You can use Knit to
convert your file to any of these types. When you render your file, you can preview how it will look in the format
you selected. Execute each code chunk and insert the result into your report and save the output file in your working
directory.

1. Open .rmd extension file.


2. Write text and add code chunks using R Markdown syntax.
3. Embed R code that creates output to include in the report.
4. Render R code with its output format.

75 Programming With R, Lecture Review, Dr. Kalyan N

You might also like