0% found this document useful (0 votes)
45 views187 pages

R Programming Basics and Overview

The document provides an introduction to R programming and RStudio, detailing their features, setup, and usage for data analysis. It covers basic concepts such as variables, data types, operators, control structures, and functions in R. Additionally, it highlights real-world applications of R in various fields like data science, finance, and healthcare.

Uploaded by

WhiteHatFarhan
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views187 pages

R Programming Basics and Overview

The document provides an introduction to R programming and RStudio, detailing their features, setup, and usage for data analysis. It covers basic concepts such as variables, data types, operators, control structures, and functions in R. Additionally, it highlights real-world applications of R in various fields like data science, finance, and healthcare.

Uploaded by

WhiteHatFarhan
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

​R PROGRAMMING​

​NOTES​
​Second Semester, DSAI, Department of Computer Science, University of kashmir​

​1​
​ NIT 1:​
U
​INTRODUCTION​
​TO R​
​1.​ ​Overview of R and RStudio​

​2.​ ​Basics of R programming​

​○​ ​Variables​

​○​ ​Data types​

​○​ ​Operators​

​3.​ ​Control structures​

​○​ ​if-else​

​○​ ​loops​

​4.​ ​Functions in R​

​○​ ​Defining functions​

​○​ ​Commonly used mathematical functions​

​○​ ​Commonly used string functions​

​5.​ ​User-defined functions​

​6.​ ​Local and global variables​

​2​
​Overview of R and RStudio​
​1. Introduction to R​

​ is a​​powerful programming language​​and environment​​mainly used for​​data analysis,​


R
​statistics, and visualization​​. It was created by​​Ross​​Ihaka​​and​​Robert Gentleman​​at the​
​University of Auckland, New Zealand.​

​ hink of R as a digital lab for data scientists — a place where you can experiment with data,​
T
​analyze patterns, and visualize results beautifully.​

​Key Features of R​

​●​ ​Open Source:​​Free to download and use — available​​for everyone.​

​●​ ​Cross-Platform:​​Works on Windows, macOS, and Linux.​

​●​ E
​ xtensive Libraries:​​Thousands of built-in and external​​packages for data science,​
​statistics, and machine learning.​

​●​ S
​ trong Visualization Support:​​R creates high-quality​​plots and graphs with libraries​
ggplot2​​and​​
​like​​ lattice​
​.​

​●​ D
​ ata Handling:​​Efficiently handles large datasets​​and supports data cleaning and​
​transformation.​

​●​ C
​ ommunity Support:​​Large, active community providing​​help, tutorials, and​
​open-source packages.​

​Why R?​

​●​ ​Ideal for​​data science​​,​​machine learning​​,​​statistical​​modeling​​, and​​research​​.​

​●​ ​Preferred by statisticians and data analysts for its​​accuracy​​and​​statistical depth​​.​

​●​ ​Integrates easily with tools like​​Excel​​,​​SQL​​, and​​Python​​.​

​Example:​

​ A simple R example​
#
​x <- c(2, 4, 6, 8, 10)​
​mean(x)​

​Explanation:​

​3​
​●​ ​
c()​​creates a vector of numbers.​

​●​ m
​ean()​​calculates the average of the given vector.​
​This simple code shows how quick and readable R is for basic data analysis.​

​2. What is RStudio?​

​ Studio is an​​Integrated Development Environment (IDE)​​for R.​


R
​Think of R as the engine, and RStudio as the car dashboard that makes driving easier.​

I​t provides a clean and organized interface to write, test, and visualize your R code​
​efficiently.​

​Main Components of RStudio​

​1.​ ​Source Pane:​

​○​ ​This is where you​​write and edit​​your R scripts.​

.R​
​○​ ​Files are usually saved with the extension​​ ​.​

​2.​ ​Console Pane:​

​○​ ​The​​execution area​​where you run commands directly.​

​○​ ​Anything you type here runs immediately.​

​3.​ ​Environment / History Pane:​

​○​ ​Shows​​all active variables, datasets, and functions​​in memory.​

​○​ ​The​​History​​tab lists all commands you’ve executed.​

​4.​ ​Files / Plots / Packages / Help / Viewer Pane:​

​○​ ​Files:​​View files in your current working directory.​

​○​ ​Plots:​​Displays graphs and visualizations.​

​○​ ​Packages:​​Manage installed R packages.​

​○​ ​Help:​​Access R documentation.​

​○​ ​Viewer:​​Displays HTML outputs and interactive visuals.​

​4​
​3. How R and RStudio Work Together​

​●​ ​You write your R code inside RStudio.​

​●​ ​RStudio sends that code to the​​R interpreter​​, which​​processes and executes it.​

​●​ T
​ he results (numbers, plots, or errors) appear in the RStudio​​Console​​or​​Plots​
​window​​.​

​It’s like RStudio being the​​user-friendly face​​of​​R.​

​4. Setting Up R and RStudio​

​1.​ ​Install R:​

​○​ ​Go to​​[Link] download R for​​your OS.​

​2.​ ​Install RStudio:​

​○​ ​Visit​​[Link] install​​RStudio Desktop.​

​3.​ ​Open RStudio:​

​○​ ​You’ll see four main panes as explained earlier.​

​○​ S
​ tart typing commands in the​​Console​​or create a new​​R Script via:​
File → New File → R Script​
​ ​.​

​5. Real-World Uses of R​

​●​ D
​ ata Science:​​Used for analyzing datasets, making​​predictions, and creating​
​dashboards.​

​●​ ​Academia & Research:​​For running statistical tests​​and modeling data.​

​●​ ​Finance:​​Risk modeling, forecasting stock prices.​

​●​ ​Healthcare:​​Analyzing patient data and clinical trials.​

​5​
​●​ ​Marketing:​​Customer segmentation and trend analysis.​

​6. Quick Tips for Beginners​

​●​ ​Use the​​arrow keys​​in the Console to navigate through​​previous commands.​

.R​​extension for later use.​


​●​ ​Save scripts with​​

#​​to write​​comments​​(ignored by R but useful for​​notes).​


​●​ ​Use​​

​●​ ​Press​​Ctrl + Enter​​to run the selected line of code.​

​✨ Quick Recall Box​

​●​ ​R​​= Language for data analysis and statistics.​

​●​ ​RStudio​​= User-friendly interface for R.​

.R​
​●​ ​R scripts end with​​ ​.​

​●​ ​Four main panes in RStudio: Source, Console, Environment, Files/Plots/Packages.​

​Common command example:​

​print("Hello, R World!")​

​●​

​Basics of R Programming​
​ very programming language begins with understanding its​​building blocks​​— how to store​
E
​information, what kinds of information exist, and how to perform operations on them.​
​In R, these basic concepts revolve around​​variables​​,​​data types​​, and​​operators​​.​

​Let’s explore each one clearly and step-by-step.​

​6​
​1. Variables in R​

​ ​​variable​​is like a container that holds data.​


A
​You can store a number, text, or even an entire dataset in a variable.​

​How to Create a Variable​

​In R, you can assign a value to a variable using any of the following operators:​

​ <- 10
x # most common​
​y = 20 # also works​
​30 -> z # less common but valid​

​All three mean the same thing — they assign a value to a variable.​

​Variable Naming Rules​

​●​ ​Must start with a​​letter​​(A–Z or a–z).​

​●​ ​Can contain​​numbers, dots, or underscores​​.​

​●​ ​Cannot start with a number​​or contain spaces.​

Data​​and​​
​●​ ​R is​​case-sensitive​​→​​ data​​are two different​​variables.​

​Example​
​ ame <- "R Programming"​
n
​version <- 4.3​
​isFun <- TRUE​

​Explanation:​

​●​ ​
name​​stores a text value (called a​​string​​)​

​●​ ​
version​​stores a number​

​●​ ​
isFun​​stores a logical value (TRUE/FALSE)​

​You can check the value of a variable by just typing its name:​

​name​

​2. Data Types in R​

​7​
​ supports several​​data types​​, each representing a different kind of data.​
R
​Let’s look at the most common ones:​

​Data Type​ ​Example​ ​Description​

​Numeric​ 12.5​
​ -4​
​,​​ 7.0​ ​Numbers with or without decimals​
​,​​

​Integer​ 10L​
​ -2L​
​,​​ L​​to specify​
​ hole numbers (use​​
W
​integer)​

​Character​ ​
"Hello"​
​,​ ​Text or string data​
'Data'​

​Logical​ TRUE​
​ FALSE​
​,​​ ​Boolean values for conditions​

​Complex​ 2 + 3i​
​ ​Numbers with real and imaginary parts​

​Raw​ charToRaw("​ ​Used for raw byte data​



R")​

​Example​
​ <- 15.7
a # numeric​
​b <- 10L # integer​
​c <- "R is fun" # character​
​d <- TRUE # logical​
​e <- 2 + 3i # complex​

​You can check the​​type of data​​stored in a variable​​using:​

​ lass(a)​
c
​typeof(b)​

​These functions help you understand what kind of data each variable holds.​

​3. Type Conversion in R​

​ ometimes you may need to change one data type to another.​


S
​R provides simple conversion functions:​

​Function​ ​Converts​
​To​

[Link](​ ​Numeric​

)​

​8​
[Link](​ ​Integer​

)​

[Link]​ ​Character​

r()​

[Link](​ ​Logical​

)​

​Example​

​ <- "25"​
x
​[Link](x)​

​ xplanation:​
E
"25"​​into the number​​
​This converts the string​​ 25​
​.​

​4. Operators in R​

​ perators help perform actions on variables and values.​


O
​They are divided into different categories:​

​A. Arithmetic Operators​

​Used for mathematical operations.​

​Operator​ ​Meaning​ ​Example​ ​Output​

+​
​ ​Addition​ 5 + 3​
​ 8​

-​
​ ​Subtraction​ 5 - 2​
​ 3​

*​
​ ​Multiplication​ 4 * 2​
​ 8​

/​
​ ​Division​ 10 / 2​ ​
​ 5​

%%​
​ ​Modulus (remainder)​ ​
10 %%​ 1​

3​

%/%​
​ ​Integer division​ 10 %/%​ ​
​ 3​
3​

^​
​ ​Power​ 2 ^ 3​
​ 8​

​Example​

​9​
​ <- 10​
a
​b <- 3​
​a + b​
​a %/% b​

​B. Relational Operators​

​Used to​​compare​​values.​

​Operator​ ​Meaning​ ​Example​ ​Output​

==​
​ ​Equal to​ 5 == 5​ ​TRUE​

!=​
​ ​Not equal to​ 5 != 3​ ​TRUE​

>​
​ ​Greater than​ 7 > 3​
​ ​TRUE​

<​
​ ​Less than​ 2 < 5​
​ ​TRUE​

>=​
​ ​ reater than or​
G 4 >= 4​ ​TRUE​

​equal​

<=​
​ ​Less than or equal​ 3 <= 2​ ​FALSE​

​Example​

​ <- 10​
x
​y <- 20​
​x > y​
​x <= y​

​Explanation:​

FALSE​
​●​ ​The first expression returns​​

TRUE​
​●​ ​The second returns​​

​C. Logical Operators​

​Used to combine multiple conditions.​

​Operator​ ​Meaning​ ​Example​ ​Output​

​10​
&​
​ ​AND (both true)​ (5 > 2) & (3​
​ ​TRUE​
< 6)​

​`​ ​`​ ​OR (either true)​ ​`(5 > 2)​

!​
​ ​NOT (negation)​ !(5 > 2)​
​ ​FALSE​

​D. Assignment Operators​

​Assign values to variables.​

​Operator​ ​Example​ ​Equivalent To​

<-​
​ x <-​
​ ​ ssigns 10 to​
a
10​
​ ​x​

->​
​ 10 ->​
​ ​ ssigns 10 to​
a
x​
​ ​x​

=​
​ x = 10​ a
​ ​ ssigns 10 to​
​x​

​E. Miscellaneous Operators​


​Operator​ ​Description​ ​Example​

:​
​ ​Sequence generator​ 1:5​​gives​​
​ 1 2 3 4 5​

%in%​
​ ​Membership test​ ​ %in% c(1,2,3)​​→​
2
​TRUE​

%*%​
​ ​Matrix multiplication​ ​Used for multiplying matrices​

​💡 Real-World Example​

​Imagine you’re analyzing exam scores:​

​ ath <- 80​


m
​science <- 90​
​average <- (math + science) / 2​
​average​

​11​
​ xplanation:​
E
​We created two numeric variables and calculated their average — a basic but common data​
​analysis operation.​

​✨ Quick Recall Box​

​●​ ​Variables​​store data and are case-sensitive.​

​●​ ​Data Types​​: numeric, integer, character, logical,​​complex.​

<-​​for assignment.​
​●​ ​Use​​

class()​
​●​ ​Check type:​​ typeof()​
​,​​ ​.​

​●​ ​Operators:​​arithmetic, relational, logical, assignment.​

​●​ ​
:​​creates sequences,​​
%in%​​checks membership.​

​Control Structures in R​
​ ontrol structures help you​​control the flow of your​​program​​— deciding​​what to do next​
C
​based on certain conditions or repeating actions multiple times.​
​In simple words, they make your R programs​​smarter​​and​​more dynamic​​.​

​There are two main types you’ll learn here:​

​1.​ ​Conditional statements (if-else)​

​2.​ ​Loops​

​1. Conditional Statements — if, else if, and else​

​ onditional statements allow your program to​​make​​decisions​​.​


C
​They check whether a condition is​​TRUE​​or​​FALSE​​and​​execute code accordingly.​

​Syntax​
​if (condition) {​
​# code to run if condition is TRUE​
​} else if (another_condition) {​

​12​
​ code to run if the above is FALSE but this is TRUE​
#
​} else {​
​# code to run if none are TRUE​
​}​

​Example 1: Simple if Statement​


​ <- 5​
x
​if (x > 0) {​
​print("Positive number")​
​}​

​ xplanation:​
E
x​​is greater than 0, the condition is​​
​Since​​ TRUE​
​,​​so the message​​“Positive number”​​is​
​printed.​

​Example 2: if-else Statement​


​ <- -3​
x
​if (x >= 0) {​
​print("Positive number")​
​} else {​
​print("Negative number")​
​}​

​ xplanation:​
E
x​​is -3, so the condition​​
​Here,​​ x >= 0​​is​​FALSE​​.​
else​​block runs and prints​​“Negative number”​​.​
​The​​

​Example 3: if-else-if Ladder​


​ <- 0​
x
​if (x > 0) {​
​print("Positive")​
​} else if (x < 0) {​
​print("Negative")​
​} else {​
​print("Zero")​
​}​

​ xplanation:​
E
​This example checks multiple conditions — it first checks for positive, then negative, and​
​finally prints​​“Zero”​​if neither is true.​

​13​
​Example 4: Nested if​

if​​inside another.​
​You can also place one​​

​ <- 20​
x
​if (x > 10) {​
​if (x < 30) {​
​print("Between 10 and 30")​
​}​
​}​

​ xplanation:​
E
if​​only runs if the outer condition is​​true — making it a​​nested decision​​.​
​The inner​​

​2. Loops in R​

​ oops allow you to​​repeat a block of code​​multiple​​times.​


L
​This is especially useful when performing repetitive tasks like printing numbers, analyzing​
​data rows, or performing calculations for many values.​

​R provides three main types of loops:​

​●​ ​for loop​

​●​ ​while loop​

​●​ ​repeat loop​

​A. for Loop​

​Used when you know​​how many times​​you want to repeat​​something.​

​Syntax:​

​for (variable in sequence) {​


​# code to repeat​
​}​

​Example:​

​for (i in 1:5) {​

​14​
​ rint(paste("This is loop number", i))​
p
​}​

​Explanation:​

​●​ ​
1:5​​creates a sequence (1, 2, 3, 4, 5).​

​●​ ​The loop runs five times, printing the message each time with the loop number.​

​B. while Loop​

​ sed when you​​don’t know exactly how many times​​to​​loop — it runs​​as long as a​
U
​condition remains TRUE​​.​

​Syntax:​

​while (condition) {​
​# code to execute​
​}​

​Example:​

​ ount <- 1​
c
​while (count <= 5) {​
​print(paste("Count is", count))​
​count <- count + 1​
​}​

​ xplanation:​
E
count​​becomes greater​​than 5.​
​The loop keeps printing until​​
count​
​If you forget to update​​ ​, this can lead to an​​infinite loop​​.​

​C. repeat Loop​

repeat​​loop is an​​infinite loop​​that continues​​until you use the​​


​ he​​
T break​​statement to​
​stop it.​

​Syntax:​

​repeat {​
​# code​
​if (condition) {​

​15​
​ reak​
b
​}​
​}​

​Example:​

​ <- 1​
x
​repeat {​
​print(x)​
​x <- x + 1​
​if (x > 5) {​
​break​
​}​
​}​

​ xplanation:​
E
x​​and increases it by 1 until​​
​The loop keeps printing​​ x​​becomes greater than 5, then stops.​

​3. Loop Control Statements​

​Sometimes you may want to skip certain iterations or exit a loop early.​

​A. break​

​ tops the loop entirely.​


S
​Example:​

​for (i in 1:10) {​
​if (i == 6) {​
​break​
​}​
​print(i)​
​}​

​ utput:​
O
1 2 3 4 5​

i​​equals 6.​
​Stops when​​

​B. next​

​ kips the current iteration and moves to the next one.​


S
​Example:​

​16​
​for (i in 1:5) {​
​if (i == 3) {​
​next​
​}​
​print(i)​
​}​

​ utput:​
O
1 2 4 5​

3​
​The loop skips printing​​​.​

​4. Combining Conditions and Loops​

​ oops and conditions often work together in real tasks like data cleaning or summarizing​
L
​values.​

​Example:​

​ umbers <- c(2, 5, 7, 9, 12)​


n
​for (num in numbers) {​
​if (num %% 2 == 0) {​
​print(paste(num, "is even"))​
​} else {​
​print(paste(num, "is odd"))​
​}​
​}​

​ xplanation:​
E
​This loop checks each number in the list and prints whether it’s even or odd.​

​💡 Real-World Example​

​Imagine you’re analyzing student scores:​

​ cores <- c(85, 42, 90, 33, 76)​


s
​for (s in scores) {​
​if (s >= 50) {​
​print("Pass")​
​} else {​
​print("Fail")​
​}​
​}​

​17​
​ xplanation:​
E
​This loop goes through each score and checks if it’s a pass or fail — similar to an​
​automated grading system.​

​✨ Quick Recall Box​

​●​ ​if-else​​handles decision-making.​

​●​ ​for​​,​​while​​, and​​repeat​​handle repetition.​

​●​ ​break​​exits a loop;​​next​​skips one iteration.​

​●​ ​Combine loops with conditions for flexible logic.​

​●​ ​Infinite loops happen if you forget to update your loop variable.​

​Functions in R​
​ unctions are the​​heart of R programming​​— they help​​you organize your code, reuse​
F
​logic, and simplify complex tasks.​
​Think of a function as a​​mini-program inside your​​main program​​that performs a specific​
​job whenever you call it.​

​ or example, if you often need to calculate the average of numbers, instead of rewriting the​
F
​same code again and again, you can just write a function once and reuse it whenever​
​needed.​

​1. Defining Functions in R​

​Syntax​
​function_name <- function(arguments) {​
​# body of the function​
​# code to execute​
​return(result)​
​}​

​Let’s break it down:​

​18​
​●​ ​
function_name​​→ the name you give to your function.​

​●​ ​
function()​​→ defines the function.​

​●​ ​
arguments​​→ inputs the function takes.​

​●​ ​
return()​​→ sends back the output (optional but good​​practice).​

​Example 1: Simple Function​


​add_numbers <- function(a, b) {​
​sum <- a + b​
​return(sum)​
​}​
​add_numbers(5, 7)​

​Explanation:​

​●​ ​
add_numbers​​is a user-defined function that takes​​two arguments.​

12​
​●​ ​It adds them and returns the result → output will be​​ ​.​

​Example 2: Function Without Parameters​


​say_hello <- function() {​
​print("Hello from R!")​
​}​
​say_hello()​

​ xplanation:​
E
​This function doesn’t take any arguments. It simply prints a message whenever you call it.​

​Example 3: Function with Default Arguments​


​greet <- function(name = "Student") {​
​paste("Welcome,", name)​
​}​
​greet()​
​greet("Faisal")​

​19​
​Explanation:​

"Student"​​as the default.​


​●​ ​If no name is given, it uses​​

"Faisal"​
​●​ ​If you pass​​ ​, it personalizes the message.​

​Example 4: Function Returning Multiple Values​

​You can return multiple values by combining them in a list.​

​calculate <- function(x, y) {​


​sum <- x + y​
​diff <- x - y​
​return(list(Sum = sum, Difference = diff))​
​}​
​result <- calculate(10, 5)​
​result$Sum​
​result$Difference​

​ xplanation:​
E
​This function returns both sum and difference in a list.​
$​.​​
​You can access each value using​​

​2. Commonly Used Mathematical Functions​

​R comes with many​​built-in mathematical functions​​that make calculations super easy.​

​Function​ ​Description​ ​Example​ ​Output​

abs(x)​
​ ​Absolute value​ abs(-5)​
​ 5​

sqrt(x)​
​ ​Square root​ sqrt(16)​
​ 4​

exp(x)​
​ ​Exponential​ exp(1)​
​ 2.718​

log(x)​
​ ​Natural log​ log(10)​
​ 2.302​

log10(x)​
​ ​Base-10 log​ log10(100)​
​ 2​

round(x, n)​
​ ​Round to n digits​ ​
round(3.14159​ ​
3.14​
, 2)​

ceiling(x)​
​ ​Round up​ ceiling(2.3)​
​ 3​

​20​
floor(x)​
​ ​Round down​ floor(2.9)​
​ 2​

sin(x)​
​ cos(x)​
​,​​ ​,​ ​Trigonometric​ sin(pi/2)​
​ 1​

tan(x)​

sum(x)​
​ ​Sum of elements​ ​
sum(c(1,2,3))​ ​
6​

mean(x)​
​ ​Average value​ mean(c(2,4,6)​ ​
​ 4​
)​

max(x)​
​ ​Maximum value​ max(c(5,9,2))​ ​
​ 9​

min(x)​
​ ​Minimum value​ min(c(5,9,2))​ ​
​ 2​

​Example​
​ ums <- c(2, 4, 6, 8)​
n
​mean(nums)​
​sd(nums)​

​Explanation:​

​●​ ​
mean()​​finds the average.​

​●​ ​
sd()​​gives the standard deviation — how spread out​​the data is.​

​3. Commonly Used String Functions​

​ orking with text (called​​strings​​) is common in R​​— like cleaning names, formatting output,​
W
​or labeling graphs.​

​Here are some useful functions:​

​Function​ ​Description​ ​Example​ ​Output​

nchar(x)​
​ ​Counts characters​ nchar("Hello")​
​ 5​

toupper(x)​
​ ​ onverts to​
C toupper("data")​
​ "DATA"​

​uppercase​

tolower(x)​
​ ​ onverts to​
C tolower("RStudio")​
​ "rstudio"​

​lowercase​

​21​
substr(x,​
​ ​ xtracts part of a​
E substr("Learning", 1,​ ​
​ "Lear"​
start, stop)​
​ ​string​ 4)​

paste(x, y,​
​ ​Joins strings​ paste("R",​
​ "R​

sep=" ")​
​ "Language")​
​ Language"​

paste0(x, y)​
​ ​ oins without​
J paste0("Data",​
​ "DataScien​

​space​ "Science")​
​ ce"​

strsplit(x,​
​ ​Splits a string​ strsplit("R is fun",​
​ "R" "is"​

split)​
​ " ")​
​ "fun"​

grep(pattern,​
​ ​ inds matching​
F grep("R",​
​ 1​

x)​
​ ​text​ c("R","Python","C"))​

​Example: Using String Functions Together​


​ entence <- "R programming is interesting"​
s
​words <- strsplit(sentence, " ")​
​toupper(words[[1]])​

​Explanation:​

​●​ ​
strsplit()​​breaks the sentence into words.​

​●​ ​
words[[1]]​​accesses the list of words.​

​●​ ​
toupper()​​converts all words to uppercase.​

​4. Why Functions Matter​

​Functions help make your code:​

​●​ ​Reusable​​→ write once, use anywhere.​

​●​ ​Readable​​→ organized and easy to understand.​

​●​ ​Maintainable​​→ easy to update or fix errors.​

​ hey’re especially powerful in data analysis where repetitive operations are common — such​
T
​as cleaning multiple datasets or computing statistical measures.​

​22​
​💡 Real-World Example​

​Let’s say you want to calculate total marks and percentage for a student:​

​calculate_result <- function(math, science, english) {​


​total <- math + science + english​
​percentage <- (total / 300) * 100​
​return(list(Total = total, Percentage = percentage))​
​}​
​student1 <- calculate_result(80, 75, 90)​
​student1​

​ xplanation:​
E
​This function computes both total and percentage for a student — practical, reusable, and​
​easy to extend for more subjects.​

​✨ Quick Recall Box​

function()​
​●​ ​Define a function using​​ ​.​

return()​​to send back output.​


​●​ ​Use​​

sum()​
​●​ ​Mathematical functions​​like​​ mean()​
​,​​ sqrt()​
​,​​ ​,​​etc.​

nchar()​
​●​ ​String functions​​like​​ toupper()​
​,​​ paste()​
​,​​ ​,​​etc.​

​●​ ​Functions make code modular, clean, and reusable.​

​User-defined Functions​

​ ser-defined functions in R are custom functions that you create to perform specific tasks​
U
​not covered by R’s built-in functions. They help you make your programs modular, readable,​
​and reusable.​

​Creating a User-defined Function​

function()​​keyword.​
​ ou can define your own function using the​​
Y
​Syntax:​

​function_name <- function(parameter1, parameter2, ...) {​

​23​
​# function body​

​# computations​

​return(result)​

​}​

​●​ ​function_name​​— name you assign to the function.​

​●​ ​parameters​​— values passed to the function (optional).​

​●​ ​return()​​— sends back the result (optional).​

​Example 1: Function without Arguments​

​greet <- function() {​

​print("Welcome to R Programming!")​

​}​

​greet()​

​Output:​

​[1] "Welcome to R Programming!"​

​Example 2: Function with Parameters​

​multiply <- function(a, b) {​

​product <- a * b​

​return(product)​

​}​

​multiply(6, 4)​

​Output:​

​[1] 24​

​24​
​Example 3: Function with Default Parameters​

​You can assign default values to parameters.​

​area_circle <- function(radius = 1) {​

​area <- pi * radius^2​

​return(area)​

​}​

​area_circle() # uses default value​

​area_circle(3) # uses user input​

​Output:​

​[1] 3.141593​

​[1] 28.27433​

​Example 4: Function Returning Multiple Values​

​A function can return multiple values using a​​list​​.​

​operations <- function(x, y) {​

​result <- list(​

​sum = x + y,​

​difference = x - y,​

​product = x * y,​

​quotient = x / y​

​)​

​return(result)​

​}​

​output <- operations(10, 5)​

​25​
​print(output)​

​Output:​

​$sum​

​[1] 15​

​$difference​

​[1] 5​

​$product​

​[1] 50​

​$quotient​

​[1] 2​

​Example 5: Function Calling Another Function​

​Functions can call other functions inside them.​

​square <- function(x) {​

​return(x^2)​

​}​

​sum_of_squares <- function(a, b) {​

​return(square(a) + square(b))​

​}​

​sum_of_squares(3, 4)​

​Output:​

​26​
​[1] 25​

​Example 6: Anonymous Functions (Lambda Functions)​

apply()​​functions.​
​Anonymous functions are unnamed, one-line functions often used with​​

​squared_values <- sapply(1:5, function(x) x^2)​

​print(squared_values)​

​Output:​

​[1] 1 4 9 16 25​

​ ser-defined functions give you complete control over what your code does, making them​
U
​essential for structuring large projects or automating repetitive tasks.​

​Local and Global Variables​

I​n R, variables can exist​​inside​​or​​outside​​of functions​​— this determines their​​scope​​,​


​meaning​​where​​a variable can be accessed or modified.​​Understanding this concept is very​
​important when writing larger programs or working with multiple functions.​

​1. Local Variables​

​ ​​local variable​​is one that’s​​created inside a function​​and can only be accessed within​
A
​that function.​
​Once the function finishes running, the local variable disappears (it’s destroyed).​

​Example:​

​add_numbers <- function() {​

​x <- 10​

​y <- 20​

​sum <- x + y​

​print(sum)​

​}​

​27​
​add_numbers()​

​print(x) # trying to access x outside the function​

​Output:​

​[1] 30​

​Error: object 'x' not found​

​🟢​​Explanation:​

​●​ ​
x​​and​​
y​​are​​local​​to the function​​
add_numbers()​
​.​

​●​ W x​​outside, R gives an error​​because it doesn’t exist in the global​


​ hen you try to print​​
​environment.​

​2. Global Variables​

​ ​​global variable​​is declared​​outside any function​​and can be used​​both inside and​


A
​outside​​of functions.​

​Example:​

​a <- 5​

​multiply <- function() {​

​result <- a * 10​

​print(result)​

​}​

​multiply()​

​print(a)​

​Output:​

​[1] 50​

​28​
​[1] 5​

​🟢​​Explanation:​

​●​ a
​​​is a​​global variable​​, so it’s accessible inside​​the​​
multiply()​​function as well as​
​outside.​

​3. Modifying Global Variables Inside Functions​

​ y default, if you assign a new value to a variable inside a function, R creates a​​new local​
B
​copy​​— it does​​not​​modify the global variable.​

​Example:​

​x <- 100​

​change_value <- function() {​

​x <- 50​

​print(paste("Inside function:", x))​

​}​

​change_value()​

​print(paste("Outside function:", x))​

​Output:​

​[1] "Inside function: 50"​

​[1] "Outside function: 100"​

​🟢​​Explanation:​

x​​inside the function, it only​​changed​​locally​​.​


​●​ ​Even though we changed​​

x​​remained the same.​


​●​ ​The global​​

​29​
​4. Forcing a Function to Modify a Global Variable​

I​f you really need to modify a global variable from inside a function, use the​
​super-assignment operator​​ <<-​​.​

​Example:​

​count <- 0​

​increment <- function() {​

​count <<- count + 1​

​}​

​increment()​

​increment()​

​print(count)​

​Output:​

​[1] 2​

​🟢​​Explanation:​

<<-​​operator updates the​​global variable​​instead​​of creating a new local one.​


​●​ ​The​​

count​​is updated globally.​


​●​ ​Each time the function runs,​​

​Quick Summary​

​Type​ ​Defined In​ ​Accessible From​ ​Lifetime​

​Local Variable​ ​Inside a function​ ​Only inside that function​ ​Until function ends​

​Global Variable​ ​Outside all functions​ ​Everywhere​ ​Until program ends​

​30​
​🧠​​Tip for Exams:​

​●​ ​Always prefer local variables to avoid unwanted side effects in large programs.​

<<-​​only when you​​really​​need to modify global​​data from a function.​


​●​ ​Use​​

​✅​​Quick Recall Box​

​●​ ​Local → Inside function → Temporary​

​●​ ​Global → Outside function → Permanent​

<<-​​to modify global variables inside a function​


​●​ ​Use​​

​31​
​ NIT 2: DATA​
U
​HANDLING IN R​
​1.​ ​Data structures in R​

​○​ ​Vectors​

​○​ ​Matrices​

​○​ ​Data frames​

​○​ ​Lists​

​2.​ ​Importing and exporting data​

​3.​ ​Importing data from Excel​

​4.​ ​Accessing databases​

​5.​ ​Saving data in R​

​6.​ ​Loading R data objects​

​7.​ ​Writing to files​

​8.​ ​Data cleaning and preparation​

​○​ ​Handling missing values​

​○​ ​Filtering data​

​32​
​Data Structures in R​

​ provides a rich set of​​data structures​​to store​​and organize data efficiently. These​
R
​structures are the building blocks for all data manipulation and analysis tasks in R.​

​Let’s explore each one step by step.​

​1. Vectors​

​ ​​vector​​is the simplest and most common data structure​​in R.​


A
​It stores elements of​​the same data type​​— numeric,​​character, or logical.​

​Creating Vectors​

c()​​(combine) function.​
​ ou can create a vector using the​​
Y
​Example:​

​ um_vector <- c(10, 20, 30, 40)​


n
​char_vector <- c("R", "is", "fun")​
​log_vector <- c(TRUE, FALSE, TRUE)​

​Explanation:​

​●​ ​
c()​​combines values into a single sequence.​

​●​ ​Each element in a vector must have the​​same type​​.​

​Accessing Vector Elements​

[]​​with the element’s position​​(index).​


​Use square brackets​​

​ um_vector[2]​
n
​num_vector[1:3]​

​Output:​

[​1] 20​
​[1] 10 20 30​

​Modifying Vectors​
​num_vector[2] <- 25​

25​
​This replaces the second element with​​ ​.​

​33​
​Vector Operations​

​R performs​​element-wise operations​​automatically.​

​ <- c(1, 2, 3)​


a
​b <- c(4, 5, 6)​
​a + b​
​a * b​

​Output:​

[​1] 5 7 9​
​[1] 4 10 18​

​Useful Functions for Vectors​

​Function​ ​Description​

length(​ ​Number of elements​



x)​

sum(x)​
​ ​Sum of all elements​

mean(x)​ ​Average value​


sort(x)​ ​Sorts elements​


rev(x)​
​ ​Reverses order​

​2. Matrices​

​ ​​matrix​​is a two-dimensional data structure where​​all elements are of the​​same type​


A
​(numeric, character, or logical).​

​Creating a Matrix​

matrix()​​function.​
​ se the​​
U
​Syntax:​

​matrix(data, nrow, ncol, byrow = FALSE)​

​Example:​

​ at <- matrix(1:9, nrow = 3, ncol = 3)​


m
​print(mat)​

​34​
​Output:​

​[,1] [,2] [,3]​


[​1,] 1 4 7​
​[2,] 2 5 8​
​[3,] 3 6 9​

​Accessing Matrix Elements​


​ at[1, 2] # element in 1st row, 2nd column​
m
​mat[, 2] # all elements in 2nd column​
​mat[2, ] # all elements in 2nd row​

​Matrix Operations​
​ <- matrix(1:4, 2, 2)​
A
​B <- matrix(5:8, 2, 2)​
​A + B​
​A * B​
​A %*% B # matrix multiplication​

​Naming Rows and Columns​


r​ ownames(mat) <- c("Row1", "Row2", "Row3")​
​colnames(mat) <- c("Col1", "Col2", "Col3")​

​3. Data Frames​

​ ​​data frame​​is one of the most important structures​​in R.​


A
​It stores​​tabular data​​— similar to an Excel sheet​​— and can hold​​different data types​​in​
​each column.​

​Creating a Data Frame​


​student_data <- [Link](​
​Name = c("Ali", "Sara", "John"),​
​Age = c(21, 22, 20),​
​Score = c(85, 90, 88)​
​)​
​print(student_data)​

​Output:​

​ ame Age Score​


N
​1 Ali 21 85​
​2 Sara 22 90​
​3 John 20 88​

​35​
​Accessing Data Frame Elements​
​ tudent_data$Name​
s
​student_data[1, 2]​
​student_data[ , "Score"]​

​Adding and Removing Columns​


​ tudent_data$Grade <- c("A", "A+", "B")​
s
​student_data$Score <- NULL # removes column​

​Useful Data Frame Functions​

​Function​ ​Purpose​

str(df)​ S
​ ​ tructure of data​
​frame​

nrow(df​ ​Number of rows​



)​

ncol(df​ ​Number of columns​



)​

names(d​ ​Column names​



f)​

head(df​ ​First few rows​



)​

​4. Lists​

​ ​​list​​can hold elements of​​different types​​— numbers,​​strings, vectors, even other lists or​
A
​data frames.​

​Creating a List​
​my_list <- list(​
​name = "R Programming",​
​numbers = c(1, 2, 3),​
​matrix_data = matrix(1:4, 2, 2)​
​)​

​Accessing List Elements​


​ y_list$name​
m
​my_list[[2]]​

​36​
​my_list$matrix_data[1, 2]​

​Adding or Removing List Elements​


​ y_list$new_item <- "Extra data"​
m
​my_list$numbers <- NULL​

​Why Lists Are Important​

​ ists are used to store​​complex results​​, such as outputs​​from models or multiple datasets​
L
​in one object.​

​Quick Summary​
​Data Structure​ ​Type​ ​Stores​ ​Example​

​Vector​ ​1D​ ​Same data type​ c(1, 2, 3)​


​Matrix​ ​2D​ ​Same data type​ matrix(1:4, 2, 2)​


​Data Frame​ ​2D​ ​Different types​ [Link](Name, Age)​


​List​ ​Mixed​ ​Different structures​ ​


list(name, vector,​
matrix)​

​🧠​​Tip:​

​●​ U
​ se​​vectors​​for simple sequences,​​data frames​​for​​datasets, and​​lists​​for flexible​
​combinations of objects.​

​Vectors​

​ ​​vector​​in R is the simplest data structure that​​holds elements of the same data type​
A
​(numeric, character, logical, etc.). Vectors are used to store a sequence of data elements in​
​a single variable.​

​Creating Vectors​

c()​​function (combine​​function).​
​ ectors can be created using the​​
V
​Example:​

​ <- c(1, 2, 3, 4, 5)​


x
​y <- c("apple", "banana", "cherry")​

​37​
​z <- c(TRUE, FALSE, TRUE, TRUE)​

​Accessing Vector Elements​

[ ]​​with index positions​


​ lements of a vector can be accessed using square brackets​​
E
​(indexing in R starts from 1).​
​Example:​

​ [1]
x # First element​
​x[3] # Third element​
​x[2:4] # Elements from 2nd to 4th​

​Vector Operations​

​ rithmetic operations are performed element-wise.​


A
​Example:​

​ <- c(10, 20, 30)​


a
​b <- c(1, 2, 3)​
​a + b # Addition​
​a - b # Subtraction​
​a * b # Multiplication​
​a / b # Division​

​Vector Functions​

​Some common vector functions in R are:​

​Function​ ​Description​ ​Example​

length(​ R
​ ​ eturns number of​ length(​

x)​
​ ​elements​ a)​

sum(x)​
​ ​Returns sum of all elements​ ​
sum(a)​

mean(x)​ ​Returns average value​


​ mean(a)​

max(x)​
​ ​Returns maximum value​ max(a)​

min(x)​
​ ​Returns minimum value​ min(a)​

sort(x)​ ​Sorts the vector​


​ sort(a)​

​Combining Vectors​

​38​
V c()​
​ ectors can be combined using​​ ​.​
​Example:​

​ <- c(1, 2, 3)​


a
​b <- c(4, 5, 6)​
​c <- c(a, b)​
​print(c)​

​Logical Operations on Vectors​

​ ou can compare elements of two vectors directly.​


Y
​Example:​

​ <- c(1, 2, 3)​


a
​b <- c(3, 2, 1)​
​a == b # Element-wise comparison​
​a > b​
​a < b​

​Vector Recycling​

I​f two vectors of different lengths are operated on, R recycles the shorter vector.​
​Example:​

​ <- c(1, 2, 3, 4)​


a
​b <- c(10, 20)​
​a + b # b is recycled as (10, 20, 10, 20)​

​Type Coercion​

I​f a vector has mixed data types, R automatically converts them to the same type following​
​this hierarchy:​
​Logical → Integer → Double → Character​
​Example:​

​ <- c(1, TRUE, "R")​


v
​print(v) # All elements converted to character​

​ ectors are fundamental in R and form the building blocks for more complex data structures​
V
​like matrices and data frames.​

​Matrices​

​39​
​ ​​matrix​​in R is a two-dimensional data structure that contains elements of the same data​
A
​type (numeric, character, or logical). It’s essentially a collection of vectors arranged in rows​
​and columns.​

​Creating a Matrix​

matrix()​​function.​
​ ou can create a matrix using the​​
Y
​Syntax:​

​matrix(data, nrow, ncol, byrow = FALSE, dimnames = NULL)​

​●​ ​data​​→ the input vector of elements​

​●​ ​nrow​​→ number of rows​

​●​ ​ncol​​→ number of columns​

TRUE​
​●​ ​byrow​​→ if​​ FALSE​
​, fills the matrix by rows; if​​ ​,​​fills by columns​

​●​ ​dimnames​​→ optional names for rows and columns​

​Example:​

​ 1 <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)​


m
​print(m1)​

​Output:​

​[,1] [,2] [,3]​


[​1,] 1 3 5​
​[2,] 2 4 6​

​To fill by rows:​

​ 2 <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)​


m
​print(m2)​

​Accessing Matrix Elements​

​ atrix elements are accessed using row and column indices.​


M
​Example:​

​ 1[1, 2] # Element in 1st row, 2nd column​


m
​m1[ ,2] # All elements in 2nd column​
​m1[2, ] # All elements in 2nd row​

​40​
​Matrix Operations​

​ allows arithmetic operations on matrices.​


R
​Example:​

​ <- matrix(c(1, 2, 3, 4), nrow = 2)​


A
​B <- matrix(c(5, 6, 7, 8), nrow = 2)​

​ + B # Addition​
A
​A - B # Subtraction​
​A * B # Element-wise multiplication​
​A / B # Element-wise division​
​A %*% B # Matrix multiplication​

​Matrix Functions​
​Function​ ​Description​ ​Example​

t(A)​
​ ​Transpose of matrix​ t(A)​

nrow(A)​
​ ​Number of rows​ nrow(A)​

ncol(A)​
​ ​Number of columns​ ncol(A)​

dim(A)​
​ ​Dimensions (rows, cols)​ ​
dim(A)​

rowSums(A​ ​Sum of each row​


​ rowSums(A​

)​
​ )​

colSums(A​ ​Sum of each column​


​ colSums(A​

)​
​ )​

rowMeans(​ ​Mean of each row​


​ rowMeans(​

A)​
​ A)​

colMeans(​ ​Mean of each column​


​ colMeans(​

A)​
​ A)​

​Combining Matrices​

​You can combine matrices using:​

​●​ ​
rbind()​​→ combines by rows​

​●​ ​
cbind()​​→ combines by columns​

​41​
​Example:​

​ 1 <- matrix(1:6, nrow = 2)​


m
​m2 <- matrix(7:12, nrow = 2)​
​rbind(m1, m2) # Combine by rows​
​cbind(m1, m2) # Combine by columns​

​Naming Rows and Columns​

​ ou can assign names to matrix rows and columns.​


Y
​Example:​

​ <- matrix(1:9, nrow = 3)​


m
​rownames(m) <- c("Row1", "Row2", "Row3")​
​colnames(m) <- c("Col1", "Col2", "Col3")​
​print(m)​

​Matrix Indexing with Names​

​ fter naming, you can access elements by name.​


A
​Example:​

​m["Row2", "Col3"]​

​Checking Data Type​

​All elements in a matrix have the same type:​

​ lass(m)​
c
​typeof(m)​

​ atrices are often used in mathematical computations, data transformations, and statistical​
M
​modeling where uniform data types are required.​

​Data Frames​

​ ​​data frame​​is one of the most commonly used data​​structures in R. It’s similar to a table in​
A
​a spreadsheet or a dataset in Python’s pandas — made up of rows and columns, but unlike​
​matrices,​​each column can contain a different data​​type​​(numeric, character, logical,​
​etc.).​

​You can think of a data frame as a​​collection of equal-length​​vectors​​combined together.​

​Creating a Data Frame​

​42​
[Link]()​​function.​
​You can create a data frame using the​​

​Syntax:​

​[Link](column1, column2, column3, ...)​

​Example:​

​ Creating a simple data frame​


#
​students <- [Link](​
​Name = c("Ali", "Sara", "John"),​
​Age = c(22, 21, 23),​
​Score = c(88, 95, 79)​
​)​
​print(students)​

​Output:​

​Name Age Score​


​ Ali 22 88​
1
​2 Sara 21 95​
​3 John 23 79​

​👉 Here,​​each column​​(​
N
​ ame​ Age​
​,​​ Score​
​,​​ ​) is a vector,​​and all have equal lengths.​

​Accessing Data Frame Elements​

​You can access elements in several ways:​

$​​operator​
​Using​​

​students$Name​

​1.​ ​→ Returns the entire “Name” column.​

​By column index or name​

​students[, 2] # 2nd column​


​ tudents["Score"] # Column by name​
s

​2.​

​43​
​By row and column position​

​ tudents[1, 3]
s # Element in 1st row, 3rd column​
​students[2, ] # Entire 2nd row​

​3.​

​Adding and Removing Columns​

​Add a new column:​

​ tudents$Grade <- c("B", "A", "C")​


s
​print(students)​

​Remove a column:​

​students$Score <- NULL​

​Adding and Removing Rows​

rbind()​
​Add a row using​​ ​:​

​ ew_student <- [Link](Name = "Emma", Age = 22, Score = 91, Grade = "A")​
n
​students <- rbind(students, new_student)​

​Remove a row:​

​students <- students[-2, ] # Removes 2nd row​

​Basic Operations on Data Frames​


​Operation​ ​Description​ ​Example​

nrow(df)​
​ ​Number of rows​ nrow(students​

)​

ncol(df)​
​ ​Number of columns​ ncol(students​

)​

dim(df)​
​ ​ imensions of data​
D dim(students)​

​frame​

​44​
names(df)​ ​Column names​
​ names(student​

s)​

str(df)​
​ ​Structure of data frame​ str(students)​

summary(d​ ​Statistical summary​


​ summary(stude​

f)​
​ nts)​

head(df)​
​ ​First 6 rows​ head(students​

)​

tail(df)​
​ ​Last 6 rows​ tail(students​

)​

​Changing Column Names​

​You can rename columns using:​

​colnames(students) <- c("StudentName", "Age", "Score", "Grade")​

​Or rename specific columns with:​

​names(students)[2] <- "StudentAge"​

​Filtering Data​

​ ou can filter rows based on conditions using logical operators.​


Y
​Example:​

​students[students$Score > 85, ]​

​→ Returns all students with scores greater than 85.​

​Sorting Data​

order()​​function.​
​ ou can sort data frames using the​​
Y
​Example:​

​students[order(students$Score, decreasing = TRUE), ]​

​→ Sorts the data in descending order of scores.​

​45​
​Merging Data Frames​

Y merge()​
​ ou can merge two data frames using a common column with​​ ​.​
​Example:​

​ f1 <- [Link](ID = c(1, 2, 3), Name = c("Ali", "Sara", "John"))​


d
​df2 <- [Link](ID = c(1, 2, 3), Marks = c(88, 95, 79))​
​merged_df <- merge(df1, df2, by = "ID")​

​Real-World Use Case​

​Data frames are used whenever you deal with​​structured​​datasets​​, such as:​

​●​ ​Storing survey results​

​●​ ​Analyzing CSV or Excel data​

​●​ ​Preparing data for visualization or machine learning​

​In short,​​data frames are the backbone of data analysis​​in R​​.​

​🧭 Quick Summary​

​●​ ​A​​data frame​​can contain columns of​​different data​​types​​.​

[Link]()​
​●​ ​Created using​​

$​,​indices, or names.​
​●​ ​Access using​​

rbind()​
​●​ ​Rows →​​ cbind()​​or​​
​; Columns →​​ $​
​.​

summary()​​and​​
​●​ ​Use​​ str()​​to understand data quickly.​

​●​ ​Perfect for​​real-world tabular data analysis​​.​

​Lists​

​46​
​ ​​list​​in R is a flexible data structure that can hold​​different types of elements​​— numbers,​
A
​strings, vectors, matrices, data frames, or even other lists!​
​Think of a list like a​​container that can store different​​kinds of objects together​​, unlike​
​vectors or matrices which require all elements to be of the same type.​

​Creating a List​

list()​​function.​
​You can create a list using the​​

​Syntax:​

​list(item1, item2, item3, ...)​

​Example:​

​my_list <- list(​


​Name = "Sara",​
​Age = 21,​
​Scores = c(85, 90, 95),​
​Passed = TRUE​
​)​
​print(my_list)​

​Output:​

​ Name​
$
​[1] "Sara"​

​ Age​
$
​[1] 21​

​ Scores​
$
​[1] 85 90 95​

​ Passed​
$
​[1] TRUE​

​ ere, you can see that the list contains​​different​​data types​​— a string, a number, a vector,​
H
​and a logical value.​

​Accessing List Elements​

​There are multiple ways to access list elements:​

​47​
$​
​By name (using​​​):​

​my_list$Name​

"Sara"​
​1.​ ​→ Returns​​

[[ ]]​
​By index (using​​ ​):​

​my_list[[2]]​

21​​(the 2nd element)​


​2.​ ​→ Returns​​

[ ]​
​By index (using​​ ​):​

​my_list[2]​

​3.​ → ​ Returns a​​list containing​​the 2nd element, not​​just the value.​


[ ]​​and​​
​(Notice the difference between​​ [[ ]]​​—​​ [​​]​​returns a sub-list,​​
[[ ]]​
​returns the actual element.)​

​Modifying List Elements​

​Change a value:​

​my_list$Age <- 22​

​Add a new element:​

​my_list$City <- "Srinagar"​

​Remove an element:​

​my_list$Passed <- NULL​

​Combining Lists​

c()​​function:​
​You can merge lists using the​​

l​ist1 <- list(a = 1, b = 2)​


​list2 <- list(c = 3, d = 4)​

​48​
​ ombined_list <- c(list1, list2)​
c
​print(combined_list)​

​Output:​

​ a​
$
​[1] 1​

​ b​
$
​[1] 2​

​ c​
$
​[1] 3​

​ d​
$
​[1] 4​

​Nested Lists​

​ ists can contain other lists too!​


L
​Example:​

​nested <- list(​


​student = list(Name = "John", Age = 22),​
​marks = c(88, 79, 93)​
​)​
​print(nested)​

​You can access nested elements like this:​

​nested$student$Name​

"John"​
​→ Returns​​

​Useful List Functions​


​Function​ ​Description​ ​Example​

length(li​ ​Number of elements in the list​ ​


​ length(my_li​
st)​
​ st)​

​49​
names(lis​ ​Get or set names of elements​ ​
​ names(my_lis​
t)​
​ t)​

str(list)​ ​Structure of the list​


​ str(my_list)​

unlist(li​ ​Converts list into a vector​


​ unlist(my_li​

st)​
​ st)​

​Unlisting a List​

Y unlist()​
​ ou can flatten a list into a single vector using​​ ​.​
​Example:​

​ cores_list <- list(A = 85, B = 90, C = 95)​


s
​unlist(scores_list)​

​Output:​

​ B C​
A
​85 90 95​

​Real-World Analogy​

​Imagine a list as a​​backpack​​where you can put different​​things:​

​●​ ​Books (vectors)​

​●​ ​A lunch box (data frame)​

​●​ ​A water bottle (string)​

​●​ ​A notebook with notes (another list)​

​All different, yet stored together in one place — that’s how lists work in R!​

​🧭 Quick Summary​

​●​ ​Lists​​can store multiple data types, including other​​lists.​

list()​​function.​
​●​ ​Created using​​

​50​
$​
​●​ ​Access elements using​​ [ ]​
​,​​ [[ ]]​
​, or​​ ​.​

​●​ ​
unlist()​​converts a list into a vector.​

​●​ G
​ reat for storing complex or hierarchical data (e.g., student info, model outputs,​
​JSON-like data).​

​Importing and Exporting Data​

​ ne of R’s strongest features is its ability to​​import​​and export data​​from a wide variety of​
O
​sources — text files, CSVs, Excel sheets, databases, and more. This allows you to bring​
​external data into R for analysis and then export your results back out for reporting or​
​sharing.​

​1. Importing Data​

​a) Importing CSV Files​

​ SV (Comma-Separated Values) files are the most common data format used for sharing​
C
​datasets.​

[Link]()​
​Function:​​

​Syntax:​

​[Link](file, header = TRUE, sep = ",", stringsAsFactors = FALSE)​

​Example:​

​ ata <- [Link]("[Link]")​


d
​head(data)​

​Explanation:​

​●​ ​
file​​→ path to your CSV file​

​●​ ​
header = TRUE​​→ treats first row as column names​

​●​ ​
sep = ","​​→ separates columns using commas​

​51​
​●​ ​
stringsAsFactors = FALSE​​→ keeps text as characters instead of factors​

setwd("path")​​to set your working directory​​before reading files.​


​Tip:​​Use​​

​Example:​

​ etwd("C:/Users/Faisal/Documents/R_Projects")​
s
​data <- [Link]("[Link]")​

​b) Importing Text Files​

.txt​
​If your data is stored in a plain text file (e.g.,​​ [Link]()​
​), you can use​​ ​.​

​Syntax:​

​[Link](file, header = TRUE, sep = "\t")​

​Example:​

​text_data <- [Link]("[Link]", header = TRUE, sep = "\t")​

​c) Importing from URLs​

​You can even import data directly from the internet!​

​Example:​

​ rl_data <-​
u
​[Link]("[Link]
​head(url_data)​

​2. Exporting Data​

​Once you’ve processed or analyzed your data, you can save it back to a file.​

​a) Exporting to CSV​

[Link]()​
​Function:​​

​Syntax:​

​[Link](data, file, [Link] = FALSE)​

​52​
​Example:​

​[Link](data, "output_students.csv", [Link] = FALSE)​

​Explanation:​

​●​ ​
data​​→ the data frame to be saved​

​●​ ​
file​​→ name or path of the output file​

​●​ r
​[Link] = FALSE​​→ prevents row numbers from being​​written as an extra​
​column​

​b) Exporting to Text File​

[Link]()​
​Function:​​

​Example:​

​[Link](data, "[Link]", sep = "\t", [Link] = FALSE)​

​3. Viewing and Understanding Imported Data​

​After importing, it’s a good idea to explore your data before analysis.​

​Common Functions:​

​Function​ ​Description​ ​Example​

head()​
​ ​View first 6 rows​ head(data)​

tail()​
​ ​View last 6 rows​ tail(data)​

str()​
​ ​ tructure of​
S str(data)​

​dataset​

summary​ ​Summary statistics​ ​


​ summary(da​
()​
​ ta)​

names()​ ​Column names​


​ names(data​

)​

​53​
dim()​
​ ​Dimensions​ dim(data)​

​4. Handling File Paths​

​If your files are not in your working directory, provide the​​full path​​:​

​data <- [Link]("C:/Users/Faisal/Desktop/[Link]")​

​To check your current working directory:​

​getwd()​

​To change it:​

​setwd("C:/Users/Faisal/Documents")​

​5. Reading Other Formats​

​R can also handle other common formats with the right packages:​

​File Type​ ​Packag​ ​Function​


​e​

​ xcel​
E readxl​ ​
​ read_exce​
​(​​
.xlsx​
​)​ l()​

​ PSS​
S haven​
​ read_sav(​

​(​​
.sav​
​)​ )​

​JSON​ jsonli​ ​
​ fromJSON(​
te​
​ )​

​XML​ xml2​
​ read_xml(​

)​

​Example (Excel):​

l​ibrary(readxl)​
​data <- read_excel("students_data.xlsx")​

​54​
​🧭 Quick Summary​

​✅​​Importing Data​​→​​
[Link]()​ [Link]()​
​,​​ read_excel()​
​,​​
​✅​​Exporting Data​​→​​
[Link]()​ [Link]()​
​,​​
​✅​​Check Data​​→​​
head()​ str()​
​,​​ summary()​
​,​​
✅ getwd()​
​ ​​Working Directory​​→​​ setwd()​
​,​​

​ ​​File Formats Supported​​→ CSV, TXT, Excel, JSON,​​XML, Databases​

​ eal-World Example:​
R
​In a data analysis project, you might:​

.csv​​file​
​1.​ ​Import raw survey results from a​​

​2.​ ​Clean and transform the data​

​3.​ ​Export the final processed data for visualization or reporting​

​Importing Data from Excel​

​While CSV files are the most common way to handle tabular data,​​Excel files (​​
.xls​​or​
​xlsx​
. ​)​​are equally popular—especially in business,​​research, and academic settings. R​
​doesn’t read Excel files natively, but with the help of a few packages, it becomes very easy.​

readxl​​Package​
​1. Using the​​

readxl​​package (developed by RStudio) is the most​​commonly used for importing​


​The​​
.xls​​and​
​Excel data. It works fast, doesn’t require Excel to be installed, and supports both​​
.xlsx​​files.​

​Installing and Loading the Package​

​You need to install it once and then load it into R.​

i​[Link]("readxl")​
​library(readxl)​

​Reading Excel Files​

​55​
read_excel()​​function.​
​Use the​​

​Syntax:​

​read_excel(path, sheet = 1, range = NULL, col_names = TRUE)​

​Example:​

l​ibrary(readxl)​
​students <- read_excel("students_data.xlsx")​
​head(students)​

​Explanation:​

​●​ ​
path​​→ file path of your Excel sheet​

​●​ ​
sheet​​→ specify sheet name or index (default is first​​sheet)​

​●​ ​
range​​→ optional cell range like​​
"A1:D10"​

​●​ ​
col_names​​→ if​​
TRUE​
​, first row is treated as column​​headers​

​Example with Sheet Name​

​If your Excel workbook has multiple sheets:​

​students_scores <- read_excel("students_data.xlsx", sheet = "Scores")​

​You can even list all sheets using:​

​excel_sheets("students_data.xlsx")​

​2. Reading Specific Ranges​

​If you only want a specific part of the data:​

​marks <- read_excel("students_data.xlsx", range = "A1:C10")​

​→ Reads data from columns A to C and rows 1 to 10.​

​56​
​3. Viewing Imported Data​

​After importing, you can check what’s inside using:​

​ ead(students)​
h
​str(students)​
​summary(students)​

openxlsx​​Package​
​4. Using the​​

A openxlsx​
​ nother popular package is​​ ​, which can both​​read and write Excel files without​
​requiring external dependencies.​

​Install and Load​


i​[Link]("openxlsx")​
​library(openxlsx)​

​Read Excel File​


​students <- [Link]("students_data.xlsx", sheet = 1)​

​Write Data Back to Excel​


​[Link](students, "new_students.xlsx")​

​5. Handling Excel File Paths​

​If your Excel file isn’t in your current working directory:​

​students <- read_excel("C:/Users/Faisal/Documents/R_Projects/students_data.xlsx")​

​To check or set your working directory:​

​ etwd()​
g
​setwd("C:/Users/Faisal/Documents/R_Projects")​

​6. Common Issues & Tips​

​57​
​●​ ​❗​​File not found?​​Check file path and working directory.​

​●​ ​⚠️​​Column types mismatched?​​Use​​


col_types​​argument​​in​​
read_excel()​
​.​

​●​ ​🧾​​Multiple sheets?​​Loop through them using​​


lapply()​​or manually specify.​

​Example (read all sheets):​

​ heets <- excel_sheets("students_data.xlsx")​


s
​all_data <- lapply(sheets, read_excel, path = "students_data.xlsx")​

​7. Exporting Data to Excel​

​You can save your processed data back to an Excel file using:​

l​ibrary(openxlsx)​
​[Link](students, "output_students.xlsx")​

​🧭 Quick Summary​
​Function​ ​Packag​ ​Purpose​
​e​

read_excel(​ ​
​ readxl​ ​Read Excel files (.xls, .xlsx)​
)​

excel_sheet​ ​
​ readxl​ ​List all sheet names​
s()​

[Link]()​ ​
​ openxl​ ​Read Excel data​
sx​

[Link](​ ​
​ openxl​ ​Write data to Excel file​
)​
​ sx​

​✅​​Key Takeaways​

readxl​​for fast Excel reading.​


​●​ ​Use​​

openxlsx​​if you need both read/write capabilities.​


​●​ ​Use​​

str()​​or​​
​●​ ​Always check data structure after importing using​​ head()​
​.​

​58​
​●​ ​Perfect for handling data from​​Excel-based reports, financial sheets, and surveys.​

​Accessing Databases​

​ hen working with large datasets, it’s often not practical to store all your data in files like​
W
​CSV or Excel. Instead, data is usually stored in​​databases​​such as MySQL, PostgreSQL, or​
​SQLite.​
​R can​​connect to these databases​​,​​run SQL queries​​,​​and​​import or export data​​directly​
​— allowing smooth integration between R and database systems.​

​1. Why Connect R to a Database?​

​●​ ​To handle​​large datasets​​efficiently​

​●​ ​To​​query specific data​​instead of loading the whole​​dataset​

​●​ ​To​​update, insert, or delete​​records directly from​​R​

​●​ ​To perform​​statistical analysis or visualization​​on​​live data​

​2. Database Connection in R​

​R provides multiple packages for connecting to databases. The most common are:​

​Database​ ​Package​ ​Function​

​MySQL​ RMySQL​
​ dbConnect​

()​

​PostgreSQL​ RPostgreS​ ​
​ dbConnect​
QL​
​ ()​

​SQLite​ RSQLite​
​ dbConnect​

()​

​ eneral​
G DBI​
​ dbConnect​

​interface​ ()​

​59​
DBI​​package provides a common interface for working with any database​
​ he​​
T
​package.​

​3. Installing Required Packages​

​You need to install and load the required packages depending on your database.​

i​[Link]("DBI")​
​[Link]("RSQLite") # for SQLite​
​[Link]("RMySQL") # for MySQL​
​library(DBI)​

​4. Connecting to a Database​

​Let’s see examples for the most common databases.​

​a) SQLite Database​

​SQLite is lightweight and perfect for practice or small projects.​

​Example:​

​library(DBI)​

​ Connect to a SQLite database file​


#
​con <- dbConnect(RSQLite::SQLite(), "student_database.sqlite")​

​ List all tables in the database​


#
​dbListTables(con)​

​If the file doesn’t exist, R will create it automatically.​

​b) MySQL Database​

​You can also connect to remote or local MySQL servers.​

​Example:​

​library(DBI)​

​con <- dbConnect(​


​RMySQL::MySQL(),​
​dbname = "university",​

​60​
​ ost = "localhost",​
h
​port = 3306,​
​user = "root",​
​password = "your_password"​
​)​

​5. Running SQL Queries from R​

​Once connected, you can run SQL queries directly using R functions.​

​a) Reading Data​

​ ata <- dbGetQuery(con, "SELECT * FROM students WHERE marks > 80;")​
d
​head(data)​

​b) Writing Data​

​dbWriteTable(con, "new_students", data)​

​c) Listing All Tables​

​dbListTables(con)​

​d) Reading a Whole Table​

​students <- dbReadTable(con, "students")​

​e) Removing a Table​

​dbRemoveTable(con, "old_data")​

​6. Disconnecting from Database​

​Always close the connection after use:​

​dbDisconnect(con)​

​7. Example Workflow​

​Here’s a full workflow connecting R to a SQLite database:​

​61​
l​ibrary(DBI)​
​library(RSQLite)​

​ Connect​
#
​con <- dbConnect(RSQLite::SQLite(), "[Link]")​

​ Create a table​
#
​data <- [Link](Name = c("Ali", "Sara", "John"), Marks = c(85, 90, 78))​
​dbWriteTable(con, "Students", data)​

​ Query the table​


#
​result <- dbGetQuery(con, "SELECT * FROM Students WHERE Marks > 80")​
​print(result)​

​ Disconnect​
#
​dbDisconnect(con)​

​Explanation:​

​1.​ ​Connects to a local SQLite database​

​2.​ ​Creates a new table “Students”​

​3.​ ​Retrieves records where marks > 80​

​4.​ ​Closes the connection​

​8. Real-World Applications​

​●​ U
​ niversities store student records in MySQL or PostgreSQL — you can directly fetch​
​data for analysis.​

​●​ ​Companies analyze sales or customer data stored in corporate databases.​

​●​ ​Research projects use databases to store large datasets for reproducibility.​

​🧭 Quick Summary​
​Function​ ​Purpose​

dbConnect()​
​ ​Connect to a database​

​62​
dbListTables​ ​List all tables​

()​

dbReadTable(​ ​Read an entire table​



)​

dbGetQuery()​ ​Run SQL query​


dbWriteTable​ W
​ ​ rite data frame into​
()​
​ ​database​

dbRemoveTabl​ ​Delete a table​



e()​

dbDisconnect​ ​Close the connection​



()​

​✅​​Key Points​

DBI​​+ database-specific packages (​​


​●​ ​Use​​ RSQLite​ RMySQL​
​,​​ ​,​​etc.)​

​●​ ​You can read and write data frames directly as tables.​

dbDisconnect()​
​●​ ​Always close connections with​​ ​.​

​●​ ​Ideal for​​large datasets​​and​​dynamic data retrieval​​.​

​Saving Data in R​

​ hen you’re working on a project in R, you often need to​​save your data​​so that you can​
W
​reuse it later​​without re-running all your code. R​​provides multiple ways to store your data​
​— from saving single variables to entire workspaces.​

​Let’s explore each method step by step.​

​1. Why Save Data?​

​Saving data allows you to:​

​●​ ​Avoid re-running time-consuming operations.​

​63​
​●​ ​Reuse data in future sessions.​

​●​ ​Share data with others.​

​●​ ​Keep a record of your analysis or experiments.​

​2. Saving Individual Objects​

save()​​function.​
​You can save​​specific variables, data frames, or vectors​​using the​​

​Syntax:​

​save(object1, object2, ..., file = "[Link]")​

​Example:​

​x <- 10​

​y <- c(2, 4, 6, 8)​

​data <- [Link](Name = c("Ali", "Sara"), Age = c(21, 22))​

​save(x, y, data, file = "[Link]")​

​ xplanation:​
E
​This saves the variables​​ x​,​​​
y​,​and​​
data​​in one file​​named​​[Link]​​.​
​Later, you can load this file to restore those objects.​

​3. Saving the Entire Workspace​

​ ometimes you may want to save​​everything​​currently​​in your R session (all variables,​


S
​functions, etc.).​

​Use:​

​[Link](file = "[Link]")​

​64​
​ xplanation:​
E
​This command saves the entire workspace to a file.​
[Link]()​
​By default, if you just run​​ .RData​​in your current working​
​, R saves​​it as​​
​directory.​

.RData​​file, restoring your​


​ hen you reopen R, it automatically loads this​​
W
​previous session.​

​4. Saving Data Frames as CSV Files​

I​f you want to store data in a simple​​text format​​that can be read by other software (like​
[Link]()​
​Excel), use​​ ​.​

​Example:​

​students <- [Link](Name = c("John", "Aisha", "Ravi"), Marks = c(85, 90, 78))​

​[Link](students, "[Link]", [Link] = FALSE)​

​ xplanation:​
E
students​​into a​​CSV file​​without row numbers.​
​This saves the data frame​​

​Key Parameters:​

​●​ ​
[Link] = FALSE​​→ avoids saving unnecessary row​​numbers​

​●​ ​
sep​​→ allows specifying other separators (like​​
;​​or​​
\t​
​)​

​5. Saving Text Files​

write()​​or​​
​You can also save plain text or vector data using​​ [Link]()​
​.​

​Example:​

​numbers <- c(1, 2, 3, 4, 5)​

​write(numbers, file = "[Link]")​

​ xplanation:​
E
​This saves the vector values into a simple text file named​​[Link]​​.​

​65​
​6. Saving Data in RDS Format​

saveRDS()​​and​​
​ he​​
T readRDS()​​functions are useful when​​you want to save a​​single R​
​object​​.​

​Example:​

​data <- [Link](City = c("Srinagar", "Delhi", "Mumbai"), Temp = c(12, 25, 30))​

​saveRDS(data, "weather_data.rds")​

​To load it later:​

​loaded_data <- readRDS("weather_data.rds")​

save()​
​Difference from​​ ​:​

​●​ ​
save()​​can store​​multiple​​objects.​

​●​ s
​aveRDS()​​is meant for​​one object only​​, and you must​​assign it to a variable when​
​reloading.​

​7. Saving Plots or Graphs​

​If you’ve created a plot, you can save it using functions like:​

​●​ ​
png("[Link]")​

​●​ ​
pdf("[Link]")​

​●​ ​
jpeg("[Link]")​

​Example:​

​png("[Link]")​

​hist(c(2,4,6,8,10))​

​[Link]()​

​66​
​Explanation:​

​●​ ​
png()​​starts saving the next plot as an image file.​

​●​ ​
[Link]()​​stops the saving process.​

​8. Real-World Use Case​

I​magine you’ve cleaned and processed a large dataset of​​COVID-19 cases​​.​


​Instead of cleaning it again every time, you can save the cleaned data using:​

​save(cleaned_data, file = "covid_clean.RData")​

​Next time, you can just load it instantly:​

​load("covid_clean.RData")​

​This saves both​​time​​and​​effort​​.​

​🧭 Quick Summary​

​Function​ ​Purpose​

save()​
​ ​Save specific objects​

[Link]​ ​Save all objects (workspace)​



()​

[Link](​ ​Save data frame as CSV​



)​

​67​
[Link]​ S
​ ​ ave data in tabular text​
e()​
​ ​format​

saveRDS()​
​ ​Save a single R object​

readRDS()​
​ ​Read a saved R object​

png()​
​ ​,​ ​Save plots or graphs​
pdf()​

​✅​​Tips​

​●​ ​Always specify the correct​​file path​​before saving.​

.RData​​or​​
​●​ ​Prefer​​ .RDS​​for R-only projects, and​​
.csv​​for sharing data with others.​

​●​ ​Use descriptive file names for easy identification.​

​Loading R Data Objects​

​ fter saving your work in R (like datasets, variables, or entire sessions), you’ll eventually​
A
​need to​​load​​it back to continue your analysis. R​​provides simple functions to restore saved​
​data and make it available again in your current session.​

​Let’s explore how it works step by step.​

​1. Why Load Data?​

​ hen you reopen R or RStudio, your workspace starts empty. If you want to reuse​
W
​previously saved data, you must load it.​
​Loading data helps you:​

​●​ ​Restore variables and data frames quickly​

​●​ ​Continue your analysis from where you left off​

​68​
​●​ ​Avoid rerunning data-cleaning or preparation steps​

.RData​​or​​
​2. Loading​​ .rda​​Files​

save()​​or​​
​If you used the​​ [Link]()​​function to​​store your workspace or selected​
load()​​function to bring them back.​
​objects, use the​​

​Syntax:​

​load("[Link]")​

​Example:​

​load("[Link]")​

​ xplanation:​
E
​This will restore all the objects (​​
x​,​​​
y​,​​​
data​ [Link]​
​, etc.)​​that were saved inside​​ ​.​
​You can now use them directly — there’s no need to assign them to new variables.​

​3. Loading RDS Files​

saveRDS()​
​If you saved a single object using​​ readRDS()​​to load it.​
​, you​​must use​​

​Syntax:​

​object_name <- readRDS("[Link]")​

​Example:​

​weather_data <- readRDS("weather_data.rds")​

​ xplanation:​
E
load()​
​Unlike​​ ​, this function doesn’t automatically​​create the object in your workspace —​
​you decide what name to assign it.​
​This makes it more flexible and safer when dealing with multiple datasets.​

​69​
​4. Loading CSV or Text Files​

[Link]()​​to​​load it.​
​If you saved data in​​CSV​​format, use​​

​Example:​

​students <- [Link]("[Link]")​

​ xplanation:​
E
students​
​This reads the file​​[Link]​​and stores it in​​a data frame named​​ ​.​
​You can then view it using:​

​head(students)​

​5. Loading Excel Files​

.xlsx​​or​​
​For Excel files (saved as​​ .xls​ readxl​​package.​
​), use the​​

​Example:​

​library(readxl)​

​marks <- read_excel("student_marks.xlsx")​

​ xplanation:​
E
​This reads data directly from Excel sheets.​
​You can specify sheet names if your Excel file has multiple sheets:​

​read_excel("student_marks.xlsx", sheet = "Semester1")​

​6. Loading Text Files​

.txt​​files), use​​
​If you saved plain text data (like​​ [Link]()​​or​​
[Link]()​
​.​

​Example:​

​numbers <- [Link]("[Link]")​

​70​
​or​

​data <- [Link]("[Link]")​

​ xplanation:​
E
​These functions can handle tab-separated or space-separated text files easily.​

​7. Loading Complete Workspaces Automatically​

.RData​​file is found in the​


​ hen R starts, it can automatically load the previous session if a​​
W
​working directory.​

​ xample:​
E
.RData​​in your​​working directory, R will automatically load​
​If you closed R with a file named​​
​it on restart.​

​8. Real-World Example​

I​magine you analyzed customer data last week and saved it as​
customers_cleaned.RData​
​ ​.​
​Next week, you can simply type:​

​load("customers_cleaned.RData")​

​All your cleaned data frames, summary tables, and variables come back — ready for use!​

​🧭 Quick Summary​

​File Type​ ​Function​ ​Description​

.RData​​or​
​ load()​
​ ​Loads all saved R objects​
.rda​

​71​
.rds​
​ readRDS()​
​ ​Loads one saved R object​

.csv​
​ [Link]()​
​ ​Loads data from CSV file​

.xlsx​
​ read_excel()​
​ ​Loads data from Excel file​

.txt​
​ [Link]()​​/​
​ ​Loads data from text files​
[Link]()​

​✅​​Tips​

getwd()​​before loading files.​


​●​ ​Always check your​​working directory​​using​​

[Link]()​​to confirm if the file exists in​​the directory.​


​●​ ​Use​​

​●​ U str(object_name)​​to inspect loaded data and ensure​​it’s in the expected​


​ se​​
​format.​

​Writing to Files in R​

​ nce you’ve created, cleaned, or analyzed data in R, you often need to​​export it​​— maybe​
O
​to share it with others, to use it in another software like Excel, or to keep it for future​
​reference.​
​This process is called​​writing data to files​​, and​​R provides several convenient functions to​
​handle it.​

​Let’s explore them step-by-step.​

​1. Why Write Data to Files?​

​Writing data to files allows you to:​

​●​ ​Save your analysis results permanently​

​●​ ​Share output with collaborators​

​72​
​●​ ​Transfer data between R and other tools (like Python, Excel, or SQL)​

​●​ ​Keep backups of important data frames​

​2. Writing Data Frames to CSV Files​

​ he most common way to export data is to use​​CSV (Comma-Separated​​Values)​​format,​


T
​because it’s supported by almost every data tool.​

[Link]()​
​Function:​​

​Syntax:​

​[Link](object, file = "[Link]", [Link] = TRUE/FALSE)​

​Example:​

​students <- [Link](​

​Name = c("Aisha", "Ravi", "Mehak"),​

​Marks = c(88, 92, 79)​

​)​

​[Link](students, "[Link]", [Link] = FALSE)​

​Explanation:​

students​​is written to a file called​​[Link]​​.​


​●​ ​The data frame​​

​●​ ​
[Link] = FALSE​​prevents R from adding row numbers​​as an extra column.​

​ heck the file:​


C
​You can find it in your working directory (​​
getwd()​
​).​

​3. Writing Data with Custom Delimiters​

I​f your system or collaborators prefer a different separator (like a semicolon or tab), you can​
[Link]()​
​use​​ ​.​

​73​
​Syntax:​

​[Link](object, file = "[Link]", sep = "\t", [Link] = FALSE)​

​Example:​

​[Link](students, "students_tab.txt", sep = "\t", [Link] = FALSE)​

​ xplanation:​
E
​This saves the data with​​tab-separated columns​​instead​​of commas.​
​Perfect when working with text editors or systems that expect tabular text.​

​4. Writing Data to Excel Files​

writexl​​package.​
​If you want to directly write data into Excel sheets, use the​​

​Example:​

​library(writexl)​

​write_xlsx(students, "students_data.xlsx")​

​ xplanation:​
E
​This creates an Excel file with one sheet containing your data frame.​
​You can easily open it in Excel or Google Sheets.​

​5. Writing Text or Vector Data​

write()​
I​f you want to write plain text (like a vector of names, numbers, or results), use the​​
​function.​

​Example:​

​names <- c("Ali", "Sara", "John")​

​write(names, file = "[Link]")​

​74​
​ xplanation:​
E
names​​into​​a text file, one per line.​
​This writes each value from the vector​​

​6. Writing Multiple Data Frames in One File​

append = TRUE​​argument​
​You can append multiple data frames to the same file using the​​
[Link]()​
​in​​ ​.​

​Example:​

​df1 <- [Link](ID = 1:3, Score = c(90, 85, 88))​

​df2 <- [Link](ID = 4:6, Score = c(78, 91, 83))​

​[Link](df1, "[Link]", sep = ",", [Link] = FALSE)​

​ [Link](df2, "[Link]", sep = ",", [Link] = FALSE, append = TRUE, [Link] =​


w
​FALSE)​

​Explanation:​

​●​ ​The first command writes the first data frame.​

​●​ ​The second appends the second one without repeating the column names.​

​7. Writing Binary R Objects​

​To save data in a compact format only readable by R, use:​

​●​ ​
save()​​for multiple objects​

​●​ ​
saveRDS()​​for one object​

​Example:​

​saveRDS(students, "students_data.rds")​

​This method is faster and takes less space than CSV files.​

​75​
​8. Writing Output to a Text File​

​ ou can also write console output (like printed summaries or results) to a text file using​
Y
sink()​
​ ​.​

​Example:​

​sink("[Link]")​

​summary(students)​

​sink()​

​ xplanation:​
E
sink()​​calls is redirected​​to​​[Link]​​instead of printing to​
​Everything between the two​​
​the console.​

​9. Real-World Use Case​

I​magine you cleaned a large survey dataset in R and now need to send it to your team who​
​uses Excel.​
​You can simply export it as:​

​[Link](cleaned_survey, "final_survey_data.csv", [Link] = FALSE)​

​Now it’s easy to share and open anywhere!​

​🧭 Quick Summary​

​Function​ ​Purpose​

[Link](​ ​Write data frame to CSV file​



)​

​76​
[Link]​ ​Write data with custom separators​

e()​

write_xlsx​ ​Write data to Excel file​



()​

write()​
​ ​Write simple vectors or text​

saveRDS()​
​ ​ ave one R object (R’s own​
S
​format)​

sink()​
​ ​Redirect console output to a file​

​✅​​Tips​

"sales_report_2025.csv"​
​●​ ​Use descriptive filenames like​​ ​.​

getwd()​​to know where your file is saved.​


​●​ ​Always check​​

.csv​​or​​
​●​ ​When sharing with non-R users, prefer​​ .xlsx​
​.​

.RDS​​if only R users will​​use it.​


​●​ ​For reproducibility, prefer​​

​Data Cleaning and Preparation​

​ efore analyzing or visualizing data, you need to​​clean and prepare it properly​​. Real-world​
B
​datasets often contain​​missing values, duplicates,​​inconsistent formats, or irrelevant​
​entries​​— all of which can lead to misleading results​​if ignored.​

​In this topic, we’ll explore how to​​handle missing​​values​​and​​filter data​​effectively in R.​

​1. What Is Data Cleaning?​

​77​
​ ata cleaning is the process of​​detecting and correcting errors or inconsistencies​​in​
D
​data to improve its quality.​
​It ensures that your dataset is:​

​●​ ​Accurate​​— no wrong or inconsistent values​

​●​ ​Complete​​— missing data handled properly​

​●​ ​Consistent​​— same formats for all columns​

​●​ ​Useful​​— only relevant rows and columns kept​

​2. Handling Missing Values​


​ issing values are very common in datasets.​
M
NA​​(Not Available)​​.​
​In R, they are represented as​​

​2.1 Detecting Missing Values​

[Link]()​​function to check for missing data.​


​Use the​​

​Example:​

​data <- c(10, 20, NA, 15, NA)​

​[Link](data)​

​Output:​

​[1] FALSE FALSE TRUE FALSE TRUE​

​ xplanation:​
E
TRUE​​indicates missing values at those positions.​

​To count how many missing values are in your dataset:​

​sum([Link](data))​

​78​
​Output:​

​2​

​That means there are two missing entries.​

​2.2 Removing Missing Values​

​If you want to remove all missing values:​

​clean_data <- [Link](data)​

​ xplanation:​
E
NA​​values from the vector or data​​frame.​
​This removes all​​

​Alternatively, you can use:​

​data[![Link](data)]​

​This returns only the non-missing values.​

​2.3 Replacing Missing Values​

​ ometimes, deleting missing values isn’t ideal.​


S
​You can replace them with a constant value, the​​mean​​,​​median​​, or​​mode​​of the column.​

​Example:​

​data <- c(10, 20, NA, 30)​

​data[[Link](data)] <- mean(data, [Link] = TRUE)​

​Explanation:​

​●​ ​
[Link] = TRUE​​ignores missing values while calculating​​the mean.​

​●​ ​Missing values are replaced by that mean value.​

​79​
​Output:​

​[1] 10 20 20 30​

​2.4 Handling Missing Values in Data Frames​

​If your dataset has multiple columns:​

​df <- [Link](​

​Name = c("Ali", "Sara", "John"),​

​Marks = c(85, NA, 90),​

​Age = c(21, 22, NA)​

​)​

​To find missing values:​

​colSums([Link](df))​

​Output:​

​Marks Age​

​1 1​

​To fill missing numeric values with the column mean:​

​df$Marks[[Link](df$Marks)] <- mean(df$Marks, [Link] = TRUE)​

​df$Age[[Link](df$Age)] <- mean(df$Age, [Link] = TRUE)​

​3. Filtering Data​

​80​
​ iltering means selecting​​only the rows that meet certain conditions​​.​
F
​This helps you focus on the relevant part of your dataset.​

​3.1 Basic Filtering with Subset()​

​Syntax:​

​subset(data_frame, condition)​

​Example:​

​students <- [Link](​

​Name = c("Aisha", "Ravi", "Mehak", "Ali"),​

​Marks = c(85, 90, 75, 60)​

​)​

​high_scorers <- subset(students, Marks > 80)​

​ xplanation:​
E
​Only students with marks greater than 80 are selected.​

​3.2 Filtering Using Logical Operators​

&​​(AND),​​
​You can use logical operators like​​ |​​(OR),​​and​​
!​​(NOT).​

​Example:​

​subset(students, Marks > 70 & Marks < 90)​

​ xplanation:​
E
​This filters rows where marks are​​between 70 and​​90​​.​

​3.3 Filtering with dplyr Package​

dplyr​​package provides a cleaner way to filter​​data using the​​


​The​​ filter()​​function.​

​81​
​Example:​

​library(dplyr)​

​students <- [Link](Name = c("Ali", "Sara", "John"), Marks = c(85, 60, 95))​

​filtered_data <- filter(students, Marks > 80)​

​Explanation:​

filter()​​function is easier to read and use for​​data analysis workflows.​


​●​ ​The​​

​●​ ​It works seamlessly with pipelines (​​


%>%​
​).​

​Example using pipe:​

​students %>% filter(Marks >= 85)​

​4. Real-World Example​


​Imagine a hospital dataset:​

​patients <- [Link](​

​Name = c("Aisha", "Ravi", "Mehak", "Ali"),​

​Age = c(25, 30, NA, 28),​

​Status = c("Recovered", "Recovered", "Sick", "Recovered")​

​)​

​To remove rows with missing data:​

​patients_clean <- [Link](patients)​

​●​

​To keep only “Recovered” patients:​

​recovered <- subset(patients_clean, Status == "Recovered")​

​82​
​●​

​This gives you a​​clean and filtered dataset​​ready​​for analysis.​

​🧭 Quick Summary​

​Concept​ ​Function​ ​Description​

​Check missing values​ [Link]()​


​ ​Detects missing values​

​ emove missing​
R [Link]()​
​ ​Deletes rows with NAs​
​values​

​ eplace missing​
R data[[Link]()] <-​
​ ​Fills missing entries​
​values​ value​

​Count missing values​ sum([Link]())​


​ ​Counts number of NAs​

​Filter data (base R)​ subset()​


​ ​Select rows based on conditions​

​Filter data (dplyr)​ filter()​


​ ​Clean, readable syntax for filtering​

​✅​​Tips​

​●​ ​Always check missing data before analysis.​

​●​ ​Replacing with mean or median is common for numeric columns.​

​●​ ​Use filtering to remove outliers or irrelevant records.​

​●​ ​Combine cleaning steps in a function for reusability.​

​83​
​ NIT 3: DATA​
U
​MANIPULATION​
​1.​ ​Data manipulation techniques​

​2.​ ​Selecting rows/observations​

​3.​ ​Selecting columns/fields​

​4.​ ​Merging data​

​5.​ ​Relabeling column names​

​6.​ ​Reshaping data​

​7.​ ​Centering, scaling, and normalizing data values​

​8.​ ​Converting variable types​

​9.​ ​Data sorting​

​10.​​Data aggregation​

​84​
​Data Manipulation Techniques​

​ nce your data is cleaned and ready, the next important step is​​manipulating it​​— that is,​
O
​organizing, transforming, and restructuring​​the data​​so it becomes suitable for analysis​
​or visualization.​

I​n R, data manipulation is one of the most frequently performed tasks, and it’s made​
dplyr​​and​​
​incredibly powerful and easy by packages like​​ tidyr​​from the​​tidyverse​
​collection.​

​Let’s understand what data manipulation means and how to perform it efficiently in R.​

​1. What Is Data Manipulation?​


​ ata manipulation means​​modifying the structure, format,​​or values​​in a dataset to make​
D
​it easier to work with.​
​It includes tasks like:​

​●​ ​Selecting or rearranging rows and columns​

​●​ ​Adding or removing data​

​●​ ​Renaming columns​

​●​ ​Sorting, merging, and reshaping data​

​●​ ​Applying mathematical or logical operations to transform values​

​Basically, it’s the process of​​preparing your data​​for analysis​​.​

​2. Tools for Data Manipulation in R​


​You can manipulate data using:​

subset()​
​1.​ ​Base R functions​​— built-in commands like​​ merge()​
​,​​ order()​
​,​​ ​, etc.​

​2.​ d
​ plyr package​​— a modern, human-friendly toolkit designed​​specifically for data​
​manipulation.​

dplyr​​package provides functions that make code​​cleaner, faster, and more​


​ he​​
T
​readable​​.​

​85​
dplyr​​Functions​
​3. The Core​​
​ ere are the most important verbs (functions) in​​dplyr​​,​​often called the “grammar of data​
H
​manipulation”:​

​Function​ ​Purpose​

select()​
​ ​Choose specific columns​

filter()​
​ ​ elect rows based on​
S
​conditions​

arrange()​ ​Sort rows​


mutate()​
​ ​Add or modify columns​

summarise​ ​Create summary statistics​



()​

group_by(​ ​Group data for analysis​



)​

​ hese functions can be used​​individually​​or combined​​using the​​pipe operator (​​


T %>%​
​)​,​
​which passes the result of one function to the next.​

​4. Example Dataset​


​Let’s create a small dataset to work with:​

l​ibrary(dplyr)​
​students <- [Link](​
​Name = c("Ali", "Sara", "Ravi", "Mehak", "John"),​
​Marks = c(85, 92, 76, 89, 95),​
​Age = c(20, 21, 22, 20, 23)​
​)​

​5. Selecting Specific Columns​


select()​​to keep only certain columns.​
​Use​​

​86​
​Example:​

​select(students, Name, Marks)​

​ xplanation:​
E
​Keeps only the​​Name​​and​​Marks​​columns, excluding​​others.​

​6. Filtering Rows​


filter()​​to choose rows that meet a condition.​
​Use​​

​Example:​

​filter(students, Marks > 85)​

​ xplanation:​
E
​Displays only students who scored more than 85 marks.​

​You can also use logical operators:​

​filter(students, Marks > 80 & Age < 22)​

​This filters students with​​Marks > 80​​and​​Age < 22​​.​

​7. Arranging (Sorting) Data​


arrange()​​to sort rows based on one or more columns.​
​Use​​

​Example:​

​arrange(students, Marks)​

​ xplanation:​
E
​Sorts students in​​ascending order​​of marks.​

​To sort in descending order:​

​arrange(students, desc(Marks))​

​87​
​8. Creating or Modifying Columns​
mutate()​​to add new columns or modify existing​​ones.​
​Use​​

​Example:​

​mutate(students, Grade = ifelse(Marks >= 90, "A", "B"))​

​ xplanation:​
E
Grade​​— if Marks ≥ 90, assigns​​“A”; otherwise “B”.​
​Adds a new column​​

​9. Summarizing Data​


summarise()​​(or​​
​Use​​ summarize()​
​) to compute summary​​statistics.​

​Example:​

​summarise(students, Avg_Marks = mean(Marks))​

​Output:​

​ vg_Marks​
A
​1 87.4​

​ xplanation:​
E
​Calculates the average marks of all students.​

​10. Grouping Data​


group_by()​​to group data based on one or more​​columns — often combined with​
​Use​​
summarise()​
​ ​.​

​Example:​

​grouped_data <- students %>%​


​mutate(Gender = c("M", "F", "M", "F", "M")) %>%​
​group_by(Gender) %>%​
​summarise(Average = mean(Marks))​

​88​
​ xplanation:​
E
Gender​​column, groups data by gender, and calculates the​​average marks for​
​This adds a​​
​each group​​.​

​11. Using the Pipe Operator (​​


%>%​
​)​
​The​​pipe​​makes code easier to read by chaining multiple​​steps.​

​Example:​

​students %>%​
​filter(Marks > 80) %>%​
​select(Name, Marks) %>%​
​arrange(desc(Marks))​

​Explanation:​

​●​ ​Filters students scoring more than 80​

​●​ ​Selects only their names and marks​

​●​ ​Sorts them in descending order of marks​

​ his is a clean, natural way to write a sequence of operations — almost like reading a​
T
​sentence.​

​12. Real-World Example​


​Imagine you have a sales dataset:​

​sales <- [Link](​


​Region = c("North", "South", "East", "West", "North"),​
​Sales = c(500, 400, 300, 450, 600)​
​)​

​You can find the​​average sales by region​​:​

​sales %>%​
​group_by(Region) %>%​
​summarise(Average_Sales = mean(Sales))​

​89​
​Output:​

​# A tibble: 4 × 2​
​Region Average_Sales​
​<chr> <dbl>​
​1 East 300​
​2 North 550​
​3 South 400​
​4 West 450​

​🧭 Quick Summary​
​Task​ ​Function​ ​Description​

​Select columns​ select()​


​ ​Choose specific variables​

​Filter rows​ filter()​


​ ​Keep only rows meeting conditions​

​Sort data​ arrange()​ ​Order rows ascending/descending​


​Add new columns​ ​


mutate()​ ​Create or modify variables​

​ ummarize​
S summarise​ ​Compute mean, sum, etc.​

​values​ ()​

​Group data​ group_by(​ ​Group by categories​



)​

​Combine steps​ %>%​


​ ​Chain operations smoothly​

​✅​​Tips​

dplyr​​before use:​​
​●​ ​Load​​ library(dplyr)​

%>%​​for clean and readable workflows​


​●​ ​Use​​

head()​​or​​
​●​ ​Always check results using​​ View()​

group_by()​​and​​
​●​ ​Combine​​ summarise()​​for quick insights​

​Selecting Rows/Observations​

​90​
​ electing rows or observations means​​choosing specific records from a dataset​​that​
S
​satisfy a certain condition. In R, this can be done using​​base R techniques​​or using the​
dplyr​​package​​, which makes row selection more readable​​and powerful.​

​1. Why Select Rows?​


​You often need to select rows to:​

​●​ ​Focus on a subset of data​

​●​ ​Remove unwanted or irrelevant entries​

​●​ ​Analyze data that meets certain conditions​

​●​ ​Prepare smaller datasets for testing or plotting​

​ or example, from a dataset of students, you might want to select only those with marks​
F
​above 80 or those belonging to a certain age group.​

​2. Selecting Rows Using Base R​


​a. Using Row Index Numbers​

​ ou can select rows by their​​position (index)​​.​


Y
​Example:​

​students <- [Link](​


​Name = c("Ali", "Sara", "Ravi", "Mehak", "John"),​
​Marks = c(85, 92, 76, 89, 95),​
​Age = c(20, 21, 22, 20, 23)​
​)​

​Select the​​first three rows​​:​

​students[1:3, ]​

​Explanation:​

​●​ ​
1:3​​specifies the range of row indices​

​91​
,​​indicates we want all columns for those rows​
​●​ ​The comma​​

​b. Using Logical Conditions​

Marks > 85​


​Select rows where​​ ​:​

​students[students$Marks > 85, ]​

​Explanation:​

​●​ ​
students$Marks > 85​​returns​​
TRUE​​for rows where Marks​​exceed 85​

​●​ ​Only those rows are displayed​

​Select students aged​​less than or equal to 21​​:​

​students[students$Age <= 21, ]​

​c. Combining Conditions​

&​​(AND),​​
​ se logical operators like​​
U |​​(OR).​
​Example:​

​students[students$Marks > 80 & students$Age < 22, ]​

​This selects students with​​Marks > 80​​and​​Age < 22​​.​

subset()​​Function​
​3. Selecting Rows Using​​
subset()​​function is simpler and more readable​​for conditional selection.​
​The​​

​Syntax:​

​subset(dataframe, condition)​

​Examples:​

​92​
​ ubset(students, Marks > 85)​
s
​subset(students, Age == 20)​
​subset(students, Marks > 80 & Age < 22)​

​Advantages:​

​●​ ​Easier to read​

$​​sign for each variable​


​●​ ​Doesn’t require​​

​●​ ​Useful for multiple conditions​

dplyr::filter()​
​4. Selecting Rows Using​​
filter()​​function from​​
​ he​​
T dplyr​​is the most modern​​and clean method for selecting​
​rows.​

​Syntax:​

​filter(dataframe, condition)​

​Examples:​

l​ibrary(dplyr)​
​filter(students, Marks > 85)​
​filter(students, Age < 22)​
​filter(students, Marks > 80 & Age < 22)​

​Explanation:​

​●​ ​Keeps only rows that meet the specified condition​

&​​or​​
​●​ ​You can use multiple conditions connected with​​ |​

​Multiple Conditions Example​


​filter(students, Marks > 80, Age < 22)​

​Here, both conditions must be true.​

​93​
​If you want to select either condition:​

​filter(students, Marks > 90 | Age < 21)​

​5. Filtering with Membership (​​


%in%​
​)​
%in%​
​If you want to select rows where a variable matches one of several values, use​​ ​.​

​Example:​

​filter(students, Name %in% c("Ali", "Ravi"))​

​This selects rows where​​Name​​is either "Ali" or "Ravi".​

​6. Filtering Missing Values​


​To select rows​​without missing values​​, use:​

​filter(students, ![Link](Marks))​

​This removes all rows with missing Marks.​

​To select rows​​with​​missing values:​

​filter(students, [Link](Marks))​

slice()​
​7. Filtering Rows Using​​
slice()​​from​​
​If you want to select specific rows by​​position​​(not​​condition), use​​ dplyr​
​.​

​Example:​

​slice(students, 1:3)​

​Selects the first three rows.​

​You can also select:​

​94​
​ lice_head(students, n = 2) # First 2 rows​
s
​slice_tail(students, n = 2) # Last 2 rows​
​slice_sample(students, n = 2) # Random 2 rows​

​8. Example with Pipes (​​


%>%​
​)​
​You can combine operations using the​​pipe operator​​:​

​students %>%​
​filter(Marks > 80) %>%​
​arrange(desc(Marks))​

​Explanation:​

​●​ ​Filters students with marks greater than 80​

​●​ ​Sorts them in descending order of marks​

​🧭 Summary Table​
​Method​ ​Function​ ​Description​

​Base R (by index)​ data[row_numbers, ]​


​ ​Select rows by their index​

​Base R (by condition)​ ​


data[data$column >​ ​Filter rows using condition​
value, ]​

​Simpler syntax​ subset()​


​ ​Readable way to filter rows​

​dplyr modern way​ filter()​


​ ​Fast and easy filtering​

​Row by position​ slice()​


​ ​Select rows by position​

​Missing values​ [Link]()​​/​​


​ ![Link]()​ ​Handle missing rows​

​✅​​Tips​

head()​​or​​
​●​ ​Always check results using​​ View()​

​95​
arrange()​​or​​
​●​ ​Combine filters with​​ select()​​for refined datasets​

==​​for floating-point comparisons (use​​tolerance methods instead)​


​●​ ​Avoid using​​

​Selecting Columns/Fields​

I​n R, selecting​​columns (fields)​​means choosing specific​​variables from a dataset. This is​


​one of the most common tasks while analyzing data — for instance, you might want to​
​extract only the columns​​Name​​and​​Marks​​from a large​​student dataset.​

subset()​​function​​, or​
​R provides multiple ways to select columns — through​​base R​​, the​​
dplyr​​package​​(which is highly preferred for its​​clean syntax).​
​the​​

​1. Selecting Columns Using Base R​


​a. Using Column Names​

$​​operator.​
​ ou can access a column using the​​
Y
​Example:​

​students <- [Link](​


​Name = c("Ali", "Sara", "Ravi", "Mehak", "John"),​
​Marks = c(85, 92, 76, 89, 95),​
​Age = c(20, 21, 22, 20, 23)​
​)​

​students$Marks​

​Explanation:​

​●​ ​
$​​is used to access a specific column.​

​●​ ​
students$Marks​​returns only the “Marks” column as​​a vector.​

​b. Using Column Index​

​96​
​ ou can select columns by their​​position​​in the dataset.​
Y
​Example:​

​ tudents[, 1] # Selects the first column (Name)​


s
​students[, 2:3] # Selects 2nd and 3rd columns (Marks and Age)​

​Explanation:​

​●​ ​
,​​separates rows and columns.​

​●​ ​Leaving the row part blank means “all rows.”​

​c. Using Column Names in Square Brackets​

​If you want multiple specific columns by name:​

​students[, c("Name", "Marks")]​

​This returns only the columns “Name” and “Marks.”​

subset()​
​2. Selecting Columns Using​​
subset()​​function can be used to choose specific​​columns easily.​
​The​​

​Syntax:​

​subset(dataframe, select = c(column1, column2, ...))​

​Example:​

​subset(students, select = c(Name, Marks))​

​You can also​​exclude​​certain columns using a negative​​sign:​

​subset(students, select = -Age)​

​Explanation:​

​97​
​●​ ​
select = -Age​​removes the​​Age​​column.​

select = -c(Marks,​​
​●​ ​You can exclude multiple columns as​​ Age)​

dplyr::select()​
​3. Selecting Columns Using​​
​This is the most popular and clean method for selecting columns.​

​Syntax:​

​select(dataframe, columns...)​

​Example:​

l​ibrary(dplyr)​
​select(students, Name, Marks)​

​Explanation:​

​●​ ​Selects only the​​Name​​and​​Marks​​columns.​

​●​ ​The original dataset remains unchanged unless reassigned.​

​a. Excluding Columns​

-​​sign.​
​You can exclude columns using the​​

​select(students, -Age)​

​Removes the​​Age​​column and keeps the rest.​

​b. Selecting Columns by Range​


​select(students, Name:Marks)​

​Selects all columns​​from Name to Marks​​(based on position).​

​98​
​c. Selecting Columns by Name Pattern​

​You can select columns based on name patterns using helper functions:​

​Function​ ​Description​ ​Example​

starts_with("​ S
​ ​ elects columns starting with​ select(students,​

A")​
​ ​“A”​ starts_with("A"))​

ends_with("s"​ S
​ ​ elects columns ending with​ select(students,​

)​
​ ​“s”​ ends_with("s"))​

contains("ar"​ ​Selects columns containing “ar”​ ​


​ select(students,​
)​
​ contains("ar"))​

matches("^[A-​ S
​ ​ elects columns matching​ select(students,​

Z]")​
​ ​regex​ matches("^[A-Z]"))​

​d. Using the Pipe Operator (​​


%>%​
​)​
​students %>%​
​select(Name, Marks)​

​Explanation:​

%>%​​passes the​​
​●​ ​The pipe​​ students​​dataframe into the​​
select()​​function.​

​●​ T
​ his makes the code more readable and chainable with other operations like​
filter()​​or​​
​ arrange()​ ​.​

​4. Selecting Columns by Data Type​


​You can select columns based on their​​data type​​using:​

​select_if(students, [Link])​

​This selects only numeric columns (like​​Marks​​and​​Age​​).​

​To select only character columns:​

​99​
​select_if(students, [Link])​

​5. Reordering Columns​


​To change the order of columns:​

​students %>%​
​select(Marks, Name, Age)​

​This rearranges the columns in the specified order.​

​6. Renaming Columns While Selecting​


select()​
​You can rename columns directly inside​​ ​:​

​students %>%​
​select(Student_Name = Name, Score = Marks)​

​Explanation:​

​●​ ​Renames​​Name​​to​​Student_Name​​and​​Marks​​to​​Score​​.​

​🧠 Quick Recap Table​


​Method​ ​Function​ ​Description​

​Base R​ data[, c()]​


​ ​Select columns by index or name​

$​​operator​
​ data$col​
​ ​Access single column​

subset()​
​ subset(data, select =​ C
​ ​ hoose or exclude columns​
...)​
​ ​easily​

dplyr::select​ ​
​ select(data, col1,​ ​Clean and readable method​
()​
​ col2)​

​Exclude columns​ ​
select(data,​ ​Remove specific columns​
-colname)​

​100​
​Range select​ select(data,​
​ ​Select column range​
col1:col3)​

​Pattern select​ starts_with(),​


​ ​Match by name pattern​
contains()​

​ ype-based​
T select_if()​
​ ​Select by data type​
​select​

​✅​​Tips​

View()​​to verify selected columns quickly.​


​●​ ​Use​​

select()​​with​​
​●​ ​Combine​​ filter()​​for focused sub-datasets.​

dplyr::select()​​for readability and cleaner​​syntax.​


​●​ ​Prefer​​

​Merging Data​

I​n R,​​merging​​means combining two or more datasets​​based on a​​common column (key)​


​— similar to​​joins​​in SQL. This is essential when​​data is stored in multiple sources and​
​needs to be brought together for analysis.​

​1. Why Merge Data?​


​Imagine you have two tables:​

​●​ ​One with​​student names and their marks​

​●​ ​Another with​​student names and their ages​

​ o analyze both marks and ages together, you must​​merge​​them into a single dataset using​
T
​the​​Name​​column as the key.​

​2. Types of Merging in R​


​R supports several ways to merge data:​

​101​
merge()​​function​​(Base R)​
​1.​ ​Using the​​

dplyr​​join functions​​— like​​


​2.​ ​Using​​ inner_join​ left_join​
​,​​ ​,​​etc.​

merge()​​Function​
​3. Merging Using Base R​​

​Syntax:​

​merge(x, y, by, all.x = FALSE, all.y = FALSE)​

​Parameters:​

​●​ ​
x​,​​​
y​​→ data frames to merge​

​●​ ​
by​​→ column(s) to merge on (common key)​

​●​ ​
all.x​​→ if​​
TRUE​ x​​(Left Join)​
​, keeps all rows from​​

​●​ ​
all.y​​→ if​​
TRUE​ y​​(Right Join)​
​, keeps all rows from​​

​Example:​

​data1 <- [Link](​

​Name = c("Ali", "Sara", "Ravi", "Mehak"),​

​Marks = c(85, 92, 76, 89)​

​)​

​data2 <- [Link](​

​Name = c("Ali", "Sara", "John", "Mehak"),​

​Age = c(20, 21, 23, 20)​

​)​

​102​
​a. Inner Join (default)​

​Keeps only rows present in​​both​​datasets.​

​merge(data1, data2, by = "Name")​

​Result:​​Only​​Ali​​,​​Sara​​, and​​Mehak​​appear because they​​exist in both data frames.​

​b. Left Join​

data1​​and adds matching rows from​​


​Keeps​​all rows from​​ data2​
​.​

​merge(data1, data2, by = "Name", all.x = TRUE)​

​Result:​

data1​​remain.​
​●​ ​All students from​​

data2​
​●​ ​If a match isn’t found in​​ ​, missing values (NA)​​are added.​

​c. Right Join​

data2​
​Keeps​​all rows from​​ ​.​

​merge(data1, data2, by = "Name", all.y = TRUE)​

​Result:​

data2​​remain.​
​●​ ​All students from​​

data1​​get NA in missing​​columns.​
​●​ ​Non-matching records from​​

​103​
​d. Full Join​

​Keeps​​all rows from both​​datasets.​

​merge(data1, data2, by = "Name", all = TRUE)​

​ esult:​
R
​Every name from both tables is included — unmatched rows get NA.​

​e. Merging on Different Column Names​

​If the key column has​​different names​​in each dataset:​

​merge(data1, data2, by.x = "Student", by.y = "Name")​

dplyr​​Joins​
​4. Merging Using​​
dplyr​​package​​offers more readable and powerful​​merging functions that are widely​
​ he​​
T
​used in data science.​

​Common Join Functions:​

​Function​ ​Description​ ​Similar to​

inner_join(x, y,​ ​Only matching rows​


​ ​Inner join​
by)​

left_join(x, y,​
​ x​
​All rows from​​ ​Left join​
by)​

right_join(x, y,​ ​All rows from​​


​ y​ ​Right join​
by)​

​104​
full_join(x, y,​
​ ​All rows from both​ ​Full outer join​
by)​

semi_join(x, y,​
​ x​​that have a match in​
​Rows in​​ ​Filtered inner join​
by)​
​ y​

anti_join(x, y,​
​ x​​with no match in​​
​Rows in​​ y​ ​Opposite of semi join​
by)​

dplyr​
​Example using​​ ​:​

​library(dplyr)​

​data1 <- [Link](Name = c("Ali", "Sara", "Ravi", "Mehak"),​

​Marks = c(85, 92, 76, 89))​

​data2 <- [Link](Name = c("Ali", "Sara", "John", "Mehak"),​

​Age = c(20, 21, 23, 20))​

​Inner Join​

​inner_join(data1, data2, by = "Name")​

​Left Join​

​left_join(data1, data2, by = "Name")​

​Full Join​

​full_join(data1, data2, by = "Name")​

​105​
data1​​but​​not​​in​​
​Anti Join​​— shows students in​​ data2​
​:​

​anti_join(data1, data2, by = "Name")​

​Example Output (Inner Join):​

​Name​ ​Marks​ ​Age​

​Ali​ ​85​ ​20​

​Sara​ ​92​ ​21​

​Mehak​ ​89​ ​20​

​5. Merging on Multiple Columns​


​You can join based on​​multiple keys​​:​

​merge(data1, data2, by = c("Name", "Age"))​

dplyr​
​or using​​ ​:​

​inner_join(data1, data2, by = c("Name", "Age"))​

​6. Concatenating Data Vertically (Row-wise Merge)​


​When datasets have​​the same columns​​and you just want​​to stack them:​

​106​
rbind()​​in base R​
​●​ ​Use​​

bind_rows()​​in​​
​●​ ​Use​​ dplyr​

​Example:​

​df1 <- [Link](Name = c("Ali", "Sara"), Marks = c(85, 92))​

​df2 <- [Link](Name = c("Ravi", "Mehak"), Marks = c(76, 89))​

​combined <- rbind(df1, df2)​

​ esult:​
R
​All four students in one dataset.​

​7. Combining Columns (Column-wise Merge)​


​If both data frames have the same number of rows:​

​cbind(data1, data2)​

​This merges them​​side-by-side​​without matching keys.​

​🧠 Quick Recap Table​

​Type​ ​Base R Syntax​ ​dplyr Syntax​ ​Description​

​Inner Join​ merge(x, y, by)​


​ inner_join(x,​ ​Only matching rows​

y)​

​Left Join​ merge(x, y, by,​


​ left_join(x,​ ​All from left table​

all.x=TRUE)​
​ y)​

​107​
​Right Join​ merge(x, y, by,​
​ right_join(x,​ ​All from right table​

all.y=TRUE)​
​ y)​

​Full Join​ merge(x, y, by,​


​ full_join(x,​ ​All from both​

all=TRUE)​
​ y)​

​Row Bind​ rbind(x, y)​


​ bind_rows(x,​ ​Stack vertically​

y)​

​Column Bind​ ​
cbind(x, y)​ bind_cols(x,​ ​Combine horizontally​

y)​

​✅​​Tips​

View()​​or​​
​●​ ​Always verify merged results using​​ head()​

​●​ ​Ensure column names and types match before merging​

anti_join()​​to find unmatched rows (useful for​​data cleaning)​


​●​ ​Use​​

dplyr​​joins for clean, readable, and efficient​​merging​


​●​ ​Prefer​​

​Relabeling Column Names​

​ elabeling column names in R means​​renaming one or​​more columns​​of a dataset to​


R
​make them more meaningful, readable, or standardized. Clean and consistent column​
​names are crucial for easy coding, especially when working on large projects or with multiple​
​datasets.​

​1. Why Relabel Column Names?​


​Some reasons for renaming columns include:​

​108​
​●​ ​Making names​​shorter or easier to type​

​●​ ​Replacing​​spaces or special characters​

V1​​to​​
​●​ ​Giving​​clearer, descriptive labels​​(e.g., changing​​ Student_Name​
​)​

​●​ ​Standardizing column names before merging data​

​2. Viewing Current Column Names​


​Before renaming, you can check existing column names using:​

​colnames(dataframe)​

​Example:​

​students <- [Link](​

​N = c("Ali", "Sara", "Ravi"),​

​M = c(85, 92, 76),​

​A = c(20, 21, 22)​

​)​

​colnames(students)​

​Output:​

​[1] "N" "M" "A"​

​3. Renaming Columns in Base R​


​a. Rename All Columns​

​If you want to rename all columns at once:​

​109​
​colnames(students) <- c("Name", "Marks", "Age")​

​Now the dataset looks like:​

​Name​ ​Marks​ ​Age​

​Ali​ ​85​ ​20​

​Sara​ ​92​ ​21​

​Ravi​ ​76​ ​22​

​b. Rename Specific Columns​

​You can rename individual columns by referring to their index:​

​colnames(students)[2] <- "Score"​

​This renames only the​​second column​​(Marks → Score).​

names()​​Function​
​c. Rename Using​​

names()​​function works the same way:​


​The​​

​names(students)[names(students) == "A"] <- "Age"​

​Explanation:​

​●​ ​Finds the column named “A”​

​●​ ​Renames it to “Age”​

​110​
dplyr::rename()​
​4. Renaming Columns Using​​
rename()​​function from the​​
​ he​​
T dplyr​​package makes​​renaming easier and more​
​readable.​

​Syntax:​

​rename(dataframe, new_name = old_name)​

​Example:​

​library(dplyr)​

​students <- [Link](​

​Name = c("Ali", "Sara", "Ravi"),​

​Marks = c(85, 92, 76),​

​Age = c(20, 21, 22)​

​)​

​students <- rename(students, Score = Marks)​

​Explanation:​

​●​ ​The old column name (​​


Marks​ =​
​) comes​​after​​the​​​.​

​●​ ​The new name (​​


Score​ =​
​) comes​​before​​the​​​.​

​●​ ​The dataset now has columns:​​Name​​,​​Score​​, and​​Age​​.​

​Renaming Multiple Columns​

​students <- rename(students, Student_Name = Name, Years = Age)​

​111​
​Result:​

​Student_Nam​ ​Scor​ ​Year​


​e​ ​e​ ​s​

​Ali​ ​85​ ​20​

​Sara​ ​92​ ​21​

​Ravi​ ​76​ ​22​

setNames()​​Function​
​5. Using​​
​This function creates a renamed copy of the dataset without changing the original directly.​

​Syntax:​

​new_df <- setNames(old_df, c("new1", "new2", "new3"))​

​Example:​

​new_students <- setNames(students, c("Student", "Marks", "Age"))​

​Explanation:​

​●​ ​Assigns the new names to the columns in order.​

​●​ ​Creates a new data frame with renamed columns.​

names()​​with Pipes​
​6. Using​​
​You can use pipes to rename columns inline:​

​112​
​students %>%​

​rename(Score = Marks, Years = Age)​

​This is preferred for cleaner, chainable code.​

​ . Changing Column Names to Lowercase or​


7
​Uppercase​
​You can standardize names easily:​

​colnames(students) <- tolower(colnames(students))​

​colnames(students) <- toupper(colnames(students))​

​Example:​

​Before: Name, Marks, Age​

​After: NAME, MARKS, AGE​

​ . Replacing Spaces or Special Characters in Column​


8
​Names​
​If your dataset has spaces in column names:​

​data <- [Link]("Student Name" = c("Ali", "Sara"), "Test Score" = c(85, 90))​

​Replace spaces with underscores:​

​colnames(data) <- gsub(" ", "_", colnames(data))​

​Result:​

​113​
​Student_Nam​ ​Test_Scor​
​e​ ​e​

​Ali​ ​85​

​Sara​ ​90​

​🧭 Quick Recap Table​

​Method​ ​Function​ ​Use​

​ iew column​
V colnames(data)​
​ ​Displays current names​
​names​

​ ename all​
R colnames(data) <- c()​
​ ​Rename all at once​
​columns​

​ ename one​
R colnames(data)[i] <- "NewName"​
​ ​Rename by position​
​column​

​ onditional​
C names(data)[names(data) ==​
​ ​Rename by name​
​rename​ "Old"] <- "New"​

​Modern rename​ rename(data, New = Old)​


​ ​ lean and readable​
C
​(​​
dplyr​
​)​

​Rename with order​ ​


setNames(data, c("a","b","c"))​ ​Rename all with vector​

​114​
​ eplace​
R gsub(" ", "_", colnames(data))​
​ ​Clean column labels​
​characters​

​✅​​Tips​

colnames()​​or​​
​●​ ​Always check renamed data using​​ head()​
​.​

​●​ K
​ eep column names short, lowercase, and underscore-separated (e.g.,​
student_name​
​ ​).​

​●​ ​Avoid spaces, punctuation, and symbols in names — they make coding harder.​

​Reshaping Data​

​ eshaping data in R refers to the process of​​changing​​the structure or layout of a dataset​


R
​— for example, converting data from​​wide format to​​long format​​or vice versa. This is​
​essential for analysis, visualization, and efficient data management.​

​1. Understanding Data Shapes​


​●​ W
​ ide Format:​
​Each subject or category has a single row, and multiple variables are represented as​
​separate columns.​
​Example:​
​| Name | Math | Science | English |​
​|------|-------|----------|----------|​
​| Ali | 85 | 90 | 78 |​
​| Sara | 92 | 87 | 88 |​

​●​ L
​ ong Format:​
​Each row represents a single observation for a variable.​
​Example:​
​| Name | Subject | Marks |​
​|------|----------|-------|​
​| Ali | Math | 85 |​
​| Ali | Science | 90 |​
​| Ali | English | 78 |​
​| Sara | Math | 92 |​

​115​
|​ Sara | Science | 87 |​
​| Sara | English | 88 |​

​Converting between these two formats is what we call​​reshaping​​.​

tidyr​​Package​
​2. Reshaping Using​​
tidyr​​package provides the most efficient functions​​to reshape data:​
​The​​

​●​ ​
pivot_longer()​​— converts data from​​wide​​to​​long​​format.​

​●​ ​
pivot_wider()​​— converts data from​​long​​to​​wide​​format.​

​Let’s understand both.​

pivot_longer()​
​3. Wide to Long Format using​​
​Syntax:​

​pivot_longer(data, cols, names_to, values_to)​

​Parameters:​

​●​ ​
data​
​: dataset​

​●​ ​
cols​
​: columns to convert into key-value pairs​

​●​ ​
names_to​
​: name of the new column that will store variable​​names​

​●​ ​
values_to​
​: name of the new column that will store​​corresponding values​

​Example:​

​library(tidyr)​

​marks <- [Link](​

​116​
​Name = c("Ali", "Sara"),​

​Math = c(85, 92),​

​Science = c(90, 87),​

​English = c(78, 88)​

​)​

​long_data <- pivot_longer(marks, cols = c(Math, Science, English),​

​names_to = "Subject", values_to = "Marks")​

​print(long_data)​

​Output:​

​Name​ ​Subject​ ​Marks​

​Ali​ ​Math​ ​85​

​Ali​ ​Science​ ​90​

​Ali​ ​English​ ​78​

​Sara​ ​Math​ ​92​

​Sara​ ​Science​ ​87​

​Sara​ ​English​ ​88​

​Explanation:​

​117​
​●​ ​The columns​​Math, Science, English​​become entries under a single column​​Subject​​.​

​●​ ​Their corresponding values go under​​Marks​​.​

pivot_wider()​
​4. Long to Wide Format using​​
​Syntax:​

​pivot_wider(data, names_from, values_from)​

​Parameters:​

​●​ ​
data​
​: dataset​

​●​ ​
names_from​
​: column whose values become new column​​names​

​●​ ​
values_from​
​: column whose values fill the new columns​

​Example:​

​wide_data <- pivot_wider(long_data, names_from = Subject, values_from = Marks)​

​print(wide_data)​

​Output:​

​Name​ ​Math​ ​Scienc​ ​English​


​e​

​Ali​ ​85​ ​90​ ​78​

​Sara​ ​92​ ​87​ ​88​

​118​
​ xplanation:​
E
​The​​Subject​​column becomes new column names (​​Math,​​Science, English​​), and​​Marks​
​values fill those columns.​

​5. Using Base R for Reshaping​


tidyr​
​ efore​​
B reshape()​
​, reshaping was commonly done using​​ melt()​
​,​​ cast()​
​, and​​
​functions from the​​reshape2​​package.​

melt()​
​a. Using​​

​Converts wide data into long format.​

​Example:​

​library(reshape2)​

​long_data <- melt(marks, [Link] = "Name",​

​[Link] = "Subject", [Link] = "Marks")​

dcast()​
​b. Using​​

​Converts long data back into wide format.​

​wide_data <- dcast(long_data, Name ~ Subject, [Link] = "Marks")​

​6. Combining Multiple Identifiers​


​If your data contains multiple identifier columns, you can specify them easily.​

​Example:​

​data <- [Link](​

​Student = c("Ali", "Ali", "Sara", "Sara"),​

​Year = c(2022, 2023, 2022, 2023),​

​Math = c(85, 88, 92, 95),​

​119​
​Science = c(90, 91, 87, 89)​

​)​

​long <- pivot_longer(data, cols = c(Math, Science),​

​names_to = "Subject", values_to = "Marks")​

​print(long)​

​Output:​

​Studen​ ​Year​ ​Subject​ ​Marks​


​t​

​Ali​ ​2022​ ​Math​ ​85​

​Ali​ ​2022​ ​Science​ ​90​

​Ali​ ​2023​ ​Math​ ​88​

​Ali​ ​2023​ ​Science​ ​91​

​Sara​ ​2022​ ​Math​ ​92​

​Sara​ ​2022​ ​Science​ ​87​

​Sara​ ​2023​ ​Math​ ​95​

​Sara​ ​2023​ ​Science​ ​89​

​120​
​7. Spread and Gather (Older Functions)​
pivot_longer()​​and​​
​Before​​ pivot_wider()​
​, we used:​

​●​ ​
gather()​​→ to convert wide to long​

​●​ ​
spread()​​→ to convert long to wide​

​These are now replaced but still found in older scripts.​

​Example:​

​library(tidyr)​

​long_data <- gather(marks, Subject, Marks, Math:English)​

​wide_data <- spread(long_data, Subject, Marks)​

​8. Advantages of Reshaping Data​


ggplot2​
​●​ ​Simplifies data visualization (especially in​​ ​).​

​●​ ​Facilitates statistical modeling (models often need long format).​

​●​ ​Makes datasets tidy and consistent for analysis.​

​●​ ​Enables easier joining, filtering, and summarizing.​

​🧭 Quick Recap Table​

​Task​ ​Function​ ​Direction​ ​Package​

​121​
​Wide → Long​ pivot_longer(​ ​Wide to Long​ ​
​ tidyr​
)​

​Long → Wide​ pivot_wider()​ ​Long to Wide​ ​


​ tidyr​

​Wide → Long (older)​ melt()​​/​


​ ​Wide to Long​ ​
reshape2​​/​
gather()​
​ tidyr​

​Long → Wide (older)​ dcast()​​/​


​ ​Long to Wide​ ​
reshape2​​/​
spread()​
​ tidyr​

​✅​​Tips​

head()​​before and​​after reshaping.​


​●​ ​Always inspect the dataset using​​

​●​ P pivot_longer()​​and​​
​ refer​​ pivot_wider()​​— they are​​simpler and more​
​readable.​

​●​ F [Link]::melt()​​and​​
​ or large datasets, use​​ [Link]::dcast()​​—​
​faster versions.​

​●​ ​Long format is better for visualizations; wide format is better for presentation.​

​Centering, Scaling, and Normalizing Data Values​

​ efore analyzing or modeling data, it’s often important to make sure that all numeric​
B
​variables are​​on a comparable scale​​. This helps improve​​the performance and​
​interpretability of many algorithms, especially in​​machine learning​​and​​statistical​
​modeling​​.​

​This process involves​​centering​​,​​scaling​​, and​​normalizing​​data.​

​1. Why Scaling is Important​

​122​
​Imagine you’re analyzing a dataset with two features:​

​●​ ​
Age​​(values like 20, 30, 40)​

​●​ ​
Income​​(values like 20,000, 50,000, 90,000)​

​ he​​
T Income​​variable dominates because its values are​​much larger. Scaling brings all​
​variables to similar ranges, making them​​equally important​​in analysis and modeling.​

​2. Centering Data​


​ entering​​means subtracting the​​mean value​​of a variable​​from each of its data points.​
C
​This shifts the variable so that its​​mean becomes​​zero​​.​

​ ormula:​
F
​[​
​x_{centered} = x - \bar{x}​
​]​

​Example:​

​x <- c(10, 20, 30, 40, 50)​

​centered_x <- x - mean(x)​

​print(centered_x)​

​Output:​

​[1] -20 -10 0 10 20​

​Explanation:​

x​​is 30.​
​●​ ​The mean of​​

​●​ ​Each value is reduced by 30, resulting in centered values.​


​ ​​Use Case:​​Centering is useful before performing​​Principal Component Analysis (PCA)​
​or regression, where the mean needs to be zero.​

​123​
​3. Scaling Data​
​ caling​​means dividing centered data by its​​standard​​deviation (SD)​​so that all variables​
S
​have a​​standard deviation of 1​​.​

​ ormula:​
F
​[​
​x_{scaled} = \frac{x - \bar{x}}{s}​
​]​
​where​

​●​ ​( \bar{x} ) = mean of variable​

​●​ ​( s ) = standard deviation​

​Example:​

​x <- c(10, 20, 30, 40, 50)​

​scaled_x <- scale(x)​

​print(scaled_x)​

​Output:​

​[,1]​

​[1,] -1.2649111​

​[2,] -0.6324555​

​[3,] 0.0000000​

​[4,] 0.6324555​

​[5,] 1.2649111​

​Explanation:​

​●​ ​After scaling, the mean becomes 0 and SD becomes 1.​

​●​ ​This ensures all variables contribute equally to analysis.​

​124​

​ ​​Use Case:​​Scaling is vital in​​machine learning​​(e.g., K-Means, SVM, PCA, Linear​
​Regression).​

​4. Normalization (Min-Max Scaling)​


​ ormalization​​rescales data to a​​fixed range​​, usually​​between​​0 and 1​​.​
N
​Unlike centering/scaling, normalization preserves the shape but changes the scale.​

​ ormula:​
F
​[​
​x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}​
​]​

​Example:​

​x <- c(10, 20, 30, 40, 50)​

​normalized_x <- (x - min(x)) / (max(x) - min(x))​

​print(normalized_x)​

​Output:​

​[1] 0.00 0.25 0.50 0.75 1.00​

​Explanation:​

​●​ ​The smallest value becomes 0, the largest becomes 1.​

​●​ ​All values are proportionally scaled in between.​


​ ​​Use Case:​​Normalization is especially used for​​distance-based​​algorithms​​(like KNN,​
​neural networks).​

​5. Using Built-in R Functions​


​R provides handy built-in tools for centering and scaling:​

scale()​​Function​
​a.​​

​125​
​Performs both centering and scaling.​

​Syntax:​

​scale(x, center = TRUE, scale = TRUE)​

​Example:​

​data <- [Link](​

​Age = c(20, 25, 30, 35, 40),​

​Income = c(20000, 35000, 50000, 65000, 80000)​

​)​

​scaled_data <- scale(data)​

​print(scaled_data)​

​Explanation:​

​●​ ​Each column is centered (mean = 0) and scaled (SD = 1).​

​●​ ​Returns a standardized version of the dataset.​

​b. Custom Normalization Function​

​If you want to normalize manually:​

​normalize <- function(x) {​

​return ((x - min(x)) / (max(x) - min(x)))​

​}​

​normalized_data <- [Link](lapply(data, normalize))​

​print(normalized_data)​

​126​
​Explanation:​

​●​ ​
lapply()​​applies the normalization function to each​​column.​

​●​ ​Output is a dataframe where all values are between 0 and 1.​

​6. Z-Score Normalization​


​A special case of scaling — also called​​standardization​​.​

​ ormula:​
F
​[​
​z = \frac{x - \bar{x}}{s}​
​]​

scale()​​function​​— they are equivalent.​


​You can calculate it manually or with​​

​7. Comparison Between Scaling Methods​

​Method​ ​Range​ ​Mean​ ​SD​ ​Common Use​

​Centering​ ​ ame as​


S ​0​ ​ ame as​
S ​Removes offset​
​original​ ​original​

​Standard Scaling​ ​-∞ to +∞​ ​0​ ​1​ ​ML algorithms​

​Min-Max Normalization​ ​0 to 1​ ​ epend​ ​Depends​


D ​Neural networks​
​s​

​Z-score Normalization​ -​ 3 to +3​ ​0​ ​1​ ​ tatistical​


S
​(approx.)​ ​analysis​

​127​
​8. Practical Example: Standardizing a Dataset​
​data <- [Link](​

​Height = c(150, 160, 170, 180, 190),​

​Weight = c(50, 60, 70, 80, 90)​

​)​

​scaled_data <- [Link](scale(data))​

​print(scaled_data)​

​Output:​

​Height Weight​

​1 -1.2649111 -1.2649111​

​2 -0.6324555 -0.6324555​

​3 0.0000000 0.0000000​

​4 0.6324555 0.6324555​

​5 1.2649111 1.2649111​

​✅ Both Height and Weight are now on the same scale, ready for analysis.​

​9. Real-World Applications​


​●​ ​Machine Learning​​→ Algorithms like KNN, SVM, PCA need​​scaled data.​

​●​ ​Clustering​​→ Euclidean distance is sensitive to scale​​differences.​

​●​ ​Data Visualization​​→ Prevents distortion in plots.​

​●​ ​Regression Models​​→ Makes coefficients more interpretable.​

​128​
​🧭 Quick Recap​

​Concept​ ​Purpose​ ​Example Function​

​Centering​ ​Shift mean to 0​ x - mean(x)​


​Scaling​ ​Make SD = 1​ scale(x)​


​Normalization​ F
​ it data between 0 and​ (x -​
​Custom​​
​1​ min(x))/(max(x)-min(x))​

​Z-score​ ​Standardization​ scale(x)​


​✅​​Tips​

​●​ ​Always scale​​numerical​​variables only, not categorical.​

​●​ S
​ caling should be done​​after splitting​​data into train/test​​sets to prevent data​
​leakage.​

caret::preProcess()​​for automated preprocessing​​pipelines.​


​●​ ​Use​​

​Converting Variable Types​

I​n R, each piece of data has a specific​​data type​​,​​such as numeric, character, factor, or​
​logical. Sometimes, we need to​​convert variables​​from​​one type to another to perform​
​certain operations correctly — for instance, converting a numeric column into a factor for​
​categorical analysis, or changing character data into numeric for calculations.​

​Let’s explore how and why this conversion is done.​

​1. Why Convert Variable Types?​

​129​
​●​ D
​ ata Compatibility:​​Some functions work only with specific data types (e.g.,​
​statistical tests often require numeric data).​

​●​ A
​ ccurate Analysis:​​Converting categorical variables​​to factors helps R treat them​
​properly during modeling.​

​●​ ​Avoid Errors:​​Incorrect data types may cause calculation​​or visualization errors.​

​●​ D
​ ata Cleaning:​​Imported data (especially from CSV/Excel)​​may misinterpret types​
​(e.g., numbers read as characters).​

​2. Checking Variable Types​


​Before converting, always check the data type using these functions:​

​class(x) # Returns the class of a variable​

​typeof(x) # Returns the internal storage type​

​str(data) # Displays structure of a dataset​

​Example:​

​x <- "25"​

​class(x)​

​Output:​

​[1] "character"​

"25"​​is a​​character​​, not numeric.​


​This shows that​​

​3. Type Conversion Functions​


​R provides several built-in functions for conversion:​

​130​
​Function​ ​Converts​ ​Example​
​To​

[Link](​ ​Numeric​
​ [Link]("25")​​→ 25​

)​

[Link](​ ​Integer​
​ [Link](3.8)​​→ 3​

)​

[Link]​ ​Character​
​ [Link](25)​​→ "25"​

r()​

[Link](​ ​Logical​
​ [Link](1)​​→ TRUE​

)​

[Link]()​ ​Factor​
​ [Link](c("A", "B",​

"A"))​

[Link]()​
​ ​Date​ [Link]("2024-05-20")​

​4. Example: Converting Character to Numeric​


​x <- c("10", "20", "30")​

​y <- [Link](x)​

​print(y)​

​Output:​

​[1] 10 20 30​

​131​
​ xplanation:​
E
​Each string element is converted into a numeric value, making it ready for mathematical​
​operations.​

​5. Example: Converting Numeric to Character​


​num <- c(100, 200, 300)​

​char <- [Link](num)​

​print(char)​

​Output:​

​[1] "100" "200" "300"​

​ xplanation:​
E
​Numbers become strings, which are treated as text, not numeric values.​

​6. Example: Converting Character to Factor​


​colors <- c("Red", "Blue", "Red", "Green")​

​fact_colors <- [Link](colors)​

​print(fact_colors)​

​Output:​

​[1] Red Blue Red Green​

​Levels: Blue Green Red​

​ xplanation:​
E
​R recognizes unique categories as​​levels​​of the factor.​
​This is essential for categorical analysis (like grouping, plotting, or regression).​

​132​
​7. Example: Converting Factor to Numeric​
​ irect conversion from factor to numeric can give​​wrong results​​because R stores factors​
D
​as integer codes internally.​
​Use a​​two-step conversion​​instead.​

​❌​​Incorrect way:​

​f <- [Link](c(10, 20, 30))​

​[Link](f)​

​Output:​

​[1] 1 2 3​

​✅​​Correct way:​

​[Link]([Link](f))​

​Output:​

​[1] 10 20 30​

​ xplanation:​
E
​First convert the factor to character, then to numeric to preserve original values.​

​8. Converting to Logical​


​x <- c(1, 0, 1, 0)​

​[Link](x)​

​Output:​

​[1] TRUE FALSE TRUE FALSE​

​133​
​Explanation:​

​●​ ​Non-zero values become​​TRUE​

​●​ ​Zero values become​​FALSE​

​9. Converting to Date​


"Date"​
​R stores dates as a special class —​​ ​.​

​Example:​

​date_char <- c("2025-01-10", "2025-02-15")​

​date_real <- [Link](date_char)​

​print(date_real)​

​Output:​

​[1] "2025-01-10" "2025-02-15"​

​ xplanation:​
E
​Converts text formatted as “YYYY-MM-DD” into date objects recognized by R.​

​If your date format is different (e.g., “10/01/2025”), specify the format:​

​[Link]("10/01/2025", format="%d/%m/%Y")​

​10. Bulk Conversion Inside a Data Frame​


lapply()​​or​​
​You can convert multiple columns at once using​​ mutate()​​from​​
dplyr​
​.​

​Example:​

​data <- [Link](​

​134​
​Age = c("20", "25", "30"),​

​Score = c("90", "85", "80")​

​)​

​data[] <- lapply(data, [Link])​

​str(data)​

​Output:​

​'[Link]':​ ​3 obs. of 2 variables:​

​$ Age : num 20 25 30​

​$ Score: num 90 85 80​

​✅ Both columns are now numeric.​

mutate()​​(Tidyverse Way)​
​11. Using​​
​library(dplyr)​

​df <- [Link](​

​ID = c("1", "2", "3"),​

​Gender = c("M", "F", "M")​

​)​

​df <- df %>%​

​mutate(​

​ID = [Link](ID),​

​Gender = [Link](Gender)​

​135​
​)​

​str(df)​

​Output:​

​'[Link]':​ ​3 obs. of 2 variables:​

​$ ID : int 1 2 3​

​$ Gender : Factor w/ 2 levels "F","M": 2 1 2​

​12. Common Conversion Issues​

​Issue​ ​Cause​ ​Solution​

​ A values after​
N ​Non-numeric characters​ [Link]()​​only on clean​
​ se​​
U
​conversion​ ​data​

​Factor levels lost​ ​ kipping character​


S ​ onvert factor → character →​
C
​conversion​ ​numeric​

​Wrong date format​ ​Incorrect format string​ format=​


​ pecify format using​​
S
​parameter​

​Logical conversion fails​ ​Text like “Yes”, “No”​ ifelse()​


​Map manually using​​

​13. Real-World Example​


​Imagine you import survey data from Excel:​

​136​
​survey <- [Link](​

​Age = c("25", "30", "35"),​

​Gender = c("Male", "Female", "Male"),​

​Score = c("80", "90", "85")​

​)​

​Now convert it properly:​

​survey$Age <- [Link](survey$Age)​

​survey$Gender <- [Link](survey$Gender)​

​survey$Score <- [Link](survey$Score)​

​str(survey)​

​✅ Now your dataset is clean and ready for analysis or visualization.​

​🧭 Quick Recap​

​Conversion​ ​Function​ ​Example​

​Character → Numeric​ ​
[Link]()​ [Link]("2​

5")​

​Numeric → Character​ ​
[Link]()​ [Link](​

25)​

​Character → Factor​ [Link]()​


​ [Link]("Ye​

s")​

​137​
​Factor → Numeric​ [Link]([Link]​ ​Safe conversion​

r(f))​

​Character → Date​ [Link](x, format)​


​ ​Date handling​

​✅​​Tips​

str()​​after conversion to confirm changes.​


​●​ ​Always use​​

​●​ ​Be careful with​​factor → numeric​​conversions.​

​●​ ​When importing from CSV/Excel, check the structure immediately.​

mutate(across())​​in​​
​●​ ​Use​​ dplyr​​for efficient multiple​​conversions.​

​Data Sorting​

​ orting means​​arranging data in a specific order​​—​​ascending or descending — based on​


S
​one or more variables. It’s one of the most common operations in R, especially during data​
​cleaning and exploration. Sorting helps us see trends, identify outliers, and prepare data for​
​reports or models.​

​1. Why Sorting Matters​


​●​ ​To​​organize​​data for easy interpretation.​

​●​ ​To​​find highest or lowest​​values quickly.​

​●​ ​To​​rank​​or​​prioritize​​data.​

​●​ ​To prepare datasets before merging or summarizing.​

​2. Sorting Vectors in R​

​138​
​R provides simple functions to sort numeric, character, or logical vectors.​

sort()​
​Using​​

​Syntax:​

​sort(x, decreasing = FALSE)​

​Parameters:​

​●​ ​
x​​→ the vector to sort​

​●​ ​
decreasing​​→ set​​
TRUE​​for descending order​

​Example 1: Ascending Order​

​numbers <- c(15, 3, 20, 8, 10)​

​sort(numbers)​

​Output:​

​[1] 3 8 10 15 20​

​Example 2: Descending Order​

​sort(numbers, decreasing = TRUE)​

​Output:​

​[1] 20 15 10 8 3​

​3. Sorting Characters​


​names <- c("Zara", "Ali", "Hina", "Bilal")​

​sort(names)​

​139​
​Output:​

​[1] "Ali" "Bilal" "Hina" "Zara"​

​✅ R sorts strings alphabetically.​

order()​
​4. Sorting Using​​
​rder()​​doesn’t return sorted data directly — it returns​​the​​order of indices​​that can be​
o
​used to rearrange data.​

​Example:​

​numbers <- c(15, 3, 20, 8, 10)​

​order(numbers)​

​Output:​

​[1] 2 4 5 1 3​

​This means the 2nd element (3) should come first, then the 4th (8), and so on.​

​To sort the data:​

​numbers[order(numbers)]​

​Output:​

​[1] 3 8 10 15 20​

​Descending Order:​

​numbers[order(-numbers)]​

​140​
​Output:​

​[1] 20 15 10 8 3​

​5. Sorting Data Frames​


order()​​on a column inside​​square brackets.​
​To sort a​​data frame​​, use​​

​Example:​

​students <- [Link](​

​Name = c("Ali", "Sara", "Bilal", "Hina"),​

​Marks = c(85, 90, 80, 95)​

​)​

​students[order(students$Marks), ]​

​Output:​

​Name​ ​Marks​

​Bilal​ ​80​

​Ali​ ​85​

​Sara​ ​90​

​Hina​ ​95​

​✅ Data is now sorted by​​Marks​​in ascending order.​

​141​
​Descending Order:​

​students[order(-students$Marks), ]​

​Output:​

​Name​ ​Marks​

​Hina​ ​95​

​Sara​ ​90​

​Ali​ ​85​

​Bilal​ ​80​

​6. Sorting by Multiple Columns​


order()​
​You can sort by multiple columns using a comma-separated list in​​ ​.​

​Example:​

​data <- [Link](​

​Name = c("Ali", "Sara", "Ali", "Hina"),​

​Marks = c(85, 90, 75, 90)​

​)​

​data[order(data$Name, -data$Marks), ]​

​142​
​Output:​

​Name​ ​Marks​

​Ali​ ​85​

​Ali​ ​75​

​Hina​ ​90​

​Sara​ ​90​

​Explanation:​

Name​​alphabetically.​
​●​ ​First sorted by​​

Marks​​in descending​​order.​
​●​ ​If names are the same, sorted by​​

dplyr​​for Sorting​
​7. Using​​
dplyr​​package provides a more readable and modern​​syntax using​​
​The​​ arrange()​
​.​

​Example 1: Ascending Order​

​library(dplyr)​

​students %>% arrange(Marks)​

​Example 2: Descending Order​

​students %>% arrange(desc(Marks))​

​143​
​Example 3: Multiple Columns​

​data %>% arrange(Name, desc(Marks))​

​✅​​Explanation:​

​●​ ​
arrange()​​sorts by columns.​

​●​ ​
desc()​​specifies descending order.​

​8. Sorting Rows by Row Names​


​data <- [Link](Score = c(88, 92, 76))​

​rownames(data) <- c("Ali", "Sara", "Hina")​

​data[order(rownames(data)), ]​

​Output:​

​Scor​
​e​

​Ali​ ​88​

​Hina​ ​76​

​Sara​ ​92​

​9. Sorting Columns (Transposed Sorting)​


​If you want to sort columns instead of rows, you can transpose first.​

​144​
​matrix_data <- matrix(c(5,2,8,1,7,4), nrow=2)​

​colnames(matrix_data) <- c("C1", "C2", "C3")​

​matrix_data[, order(colMeans(matrix_data))]​

​✅ This sorts columns based on their mean values.​

​10. Dealing with Missing Values (​​


NA​
​)​
NA​​values are placed​​at the end​​when sorting.​
​By default,​​
[Link]​​argument.​
​You can control this using the​​

​Example:​

​x <- c(3, NA, 5, 1)​

​sort(x, [Link] = TRUE)​

​Output:​

​[1] 1 3 5 NA​

[Link] = FALSE​
​If you set​​ NA​​values come first.​
​,​​

​11. Real-World Example​


​Imagine sorting an employee dataset by salary and then by department.​

​employees <- [Link](​

​Name = c("Aisha", "Bilal", "Sara", "Hina"),​

​Department = c("IT", "HR", "IT", "HR"),​

​Salary = c(60000, 55000, 70000, 60000)​

​)​

​145​
​employees %>% arrange(Department, desc(Salary))​

​Output:​

​Name​ ​Department​ ​Salary​

​Hina​ ​HR​ ​60000​

​Bilal​ ​HR​ ​55000​

​Sara​ ​IT​ ​70000​

​Aisha​ ​IT​ ​60000​

​✅ Sorted first by department, then by salary (descending).​

​🧭 Quick Recap​

​Task​ ​Function​ ​Packag​


​e​

​Sort a vector​ sort()​


​ ​Base R​

​Get order of indices​ order()​


​ ​Base R​

​Sort a data frame​ df[order(df$col),​


​ ​Base R​
]​

​146​
​Sort with multiple columns​ order(df$col1,​
​ ​Base R​
df$col2)​

​Sort in modern syntax​ arrange()​


​ dplyr​

​Descending order​ desc()​


​ dplyr​

​✅​​Tips​

NA​
​●​ ​Always check for​​ ​s before sorting.​

desc()​​or negative sign (​​


​●​ ​For descending sort, use​​ -x​​).​

​●​ ​Sorting doesn’t modify the original data unless reassigned.​

​●​ ​
arrange()​​is cleaner and preferred for pipelines.​

​Data Aggregation​

​ ata aggregation is the process of​​summarizing, grouping,​​and combining data​​to​


D
​produce meaningful insights. It’s one of the most important steps in data analysis, allowing​
​you to compute totals, averages, counts, and other summary statistics across groups or​
​entire datasets.​

​1. What is Data Aggregation?​


​Data aggregation involves:​

​●​ ​Grouping data based on one or more attributes.​

sum()​
​●​ ​Applying summary functions such as​​ mean()​
​,​​ min()​
​,​​ max()​
​,​​ length()​
​, or​​ ​.​

​●​ ​Producing a compact representation of large datasets for easier interpretation.​

​147​
​ xample:​
E
​If you have a dataset of students’ marks from different classes, aggregation can help find:​

​●​ ​Average marks per class​

​●​ ​Total students per class​

​●​ ​Highest and lowest scores in each subject​

​2. Aggregation Using Base R Functions​

aggregate()​
​a. Using​​

aggregate()​​is a powerful base R function for grouped​​summaries.​


​Syntax:​

​aggregate(x, by, FUN)​

​Parameters:​

​●​ ​
x​​→ data to summarize (numeric columns)​

​●​ ​
by​​→ list of grouping variables​

​●​ ​
FUN​​→ summary function (mean, sum, etc.)​

​Example:​

​data <- [Link](​

​Class = c("A", "A", "B", "B", "C"),​

​Marks = c(80, 85, 90, 75, 88)​

​)​

​aggregate(Marks ~ Class, data = data, FUN = mean)​

​148​
​Output:​

​Clas​ ​Marks​
​s​

​A​ ​82.5​

​B​ ​82.5​

​C​ ​88.0​

​✅ Aggregates mean marks for each class.​

​b. Using Multiple Summary Columns​

​data <- [Link](​

​Class = c("A", "A", "B", "B", "C"),​

​Math = c(80, 85, 90, 75, 88),​

​Science = c(70, 75, 80, 85, 90)​

​)​

​aggregate(. ~ Class, data = data, FUN = mean)​

​Output:​

​Clas​ ​Math​ ​Scienc​


​s​ ​e​

​A​ ​82.5​ ​72.5​

​149​
​B​ ​82.5​ ​82.5​

​C​ ​88.0​ ​90.0​

​✅ The​​
.​​symbol means “apply to all other columns.”​

tapply()​
​3. Aggregation with​​
tapply()​​applies a function to subsets of a vector​​defined by one or more factors.​

​Syntax:​

​tapply(X, INDEX, FUN)​

​Example:​

​Class <- c("A", "A", "B", "B", "C")​

​Marks <- c(80, 85, 90, 75, 88)​

​tapply(Marks, Class, mean)​

​Output:​

​A B C​

​82.5 82.5 88.0​

​✅ Returns a named vector with mean marks for each class.​

by()​
​4. Aggregation with​​
by()​​is similar to​​
​ tapply()​​but works with data frames.​

​Example:​

​150​
​data <- [Link](​

​Class = c("A", "A", "B", "B", "C"),​

​Marks = c(80, 85, 90, 75, 88)​

​)​

​by(data$Marks, data$Class, mean)​

​Output:​

​data$Class: A​

​[1] 82.5​

​data$Class: B​

​[1] 82.5​

​data$Class: C​

​[1] 88​

dplyr​
​5. Aggregation Using​​
dplyr​​package provides simple and readable functions​​for aggregation using pipes​
​The​​
​(​​
%>%​
​).​

summarise()​​and​​
​a. Using​​ group_by()​

​library(dplyr)​

​data %>%​

​group_by(Class) %>%​

​summarise(Average = mean(Marks))​

​151​
​Output:​

​Clas​ ​Averag​
​s​ ​e​

​A​ ​82.5​

​B​ ​82.5​

​C​ ​88.0​

​b. Multiple Summary Statistics​

​data %>%​

​group_by(Class) %>%​

​summarise(​

​Avg = mean(Marks),​

​Min = min(Marks),​

​Max = max(Marks),​

​Count = n()​

​)​

​Output:​

​Clas​ ​Avg​ ​Min​ ​Max​ ​Count​


​s​

​A​ ​82.5​ ​80​ ​85​ ​2​

​152​
​B​ ​82.5​ ​75​ ​90​ ​2​

​C​ ​88.0​ ​88​ ​88​ ​1​

​✅​​
n()​​counts number of entries per group.​

​c. Grouping by Multiple Columns​

​sales <- [Link](​

​Region = c("North", "North", "South", "South", "East"),​

​Product = c("A", "B", "A", "B", "A"),​

​Sales = c(100, 150, 200, 180, 130)​

​)​

​sales %>%​

​group_by(Region, Product) %>%​

​summarise(TotalSales = sum(Sales))​

​Output:​

​Regio​ ​Product​ ​TotalSale​


​n​ ​s​

​East​ ​A​ ​130​

​North​ ​A​ ​100​

​153​
​North​ ​B​ ​150​

​South​ ​A​ ​200​

​South​ ​B​ ​180​

​✅ Aggregated total sales for each region-product pair.​

aggregate()​​for Multiple Functions​


​6. Using​​
​You can use a custom function to compute multiple summaries.​

​Example:​

​aggregate(Marks ~ Class, data = data,​

​FUN = function(x) c(Mean = mean(x), Sum = sum(x)))​

​Output:​

​Clas​ ​[Link]​ ​[Link]​


​s​

​A​ ​82.5​ ​165​

​B​ ​82.5​ ​165​

​C​ ​88.0​ ​88​

[Link]​
​7. Aggregation Using​​

​154​
[Link]​​package is extremely efficient for large datasets.​
​The​​

​Example:​

​library([Link])​

​dt <- [Link](Class = c("A", "A", "B", "B", "C"),​

​Marks = c(80, 85, 90, 75, 88))​

​dt[, .(Average = mean(Marks), Count = .N), by = Class]​

​Output:​

​Clas​ ​Averag​ ​Count​


​s​ ​e​

​A​ ​82.5​ ​2​

​B​ ​82.5​ ​2​

​C​ ​88.0​ ​1​

​✅​​
.N​​gives the number of rows in each group.​

​8. Aggregating Missing Values​


mean()​​and​​
​Functions like​​ sum()​​ignore​​
NA​​values if​​you specify​​
[Link] = TRUE​
​.​

​Example:​

​data <- [Link](​

​Class = c("A", "A", "B"),​

​Marks = c(80, NA, 90)​

​155​
​)​

​aggregate(Marks ~ Class, data, mean, [Link] = TRUE)​

​Output:​

​Clas​ ​Marks​
​s​

​A​ ​80​

​B​ ​90​

​9. Real-World Example: Sales Data​


​sales <- [Link](​

​Region = c("East", "East", "West", "West", "North"),​

​Sales = c(120, 100, 200, 180, 160)​

​)​

​sales %>%​

​group_by(Region) %>%​

​summarise(​

​Total = sum(Sales),​

​Average = mean(Sales),​

​Transactions = n()​

​)​

​156​
​Output:​

​Regio​ ​Total​ ​Averag​ ​Transaction​


​n​ ​e​ ​s​

​East​ ​220​ ​110​ ​2​

​North​ ​160​ ​160​ ​1​

​West​ ​380​ ​190​ ​2​

​✅ Perfect for generating business summaries.​

​10. Quick Recap​

​Function​ ​Package​ ​Description​

aggregate()​
​ ​Base R​ ​Summarizes data by groups​

tapply()​
​ ​Base R​ ​Applies function to grouped vector​

by()​
​ ​Base R​ ​Applies function to grouped data frame​

group_by()​​+​
​ ​dplyr​ ​Modern and clean syntax​
summarise()​

.N​
​ mean()​
​,​​ ​, etc.​ ​ [Link]​
d ​High-performance aggregation​
​e​

​157​
​✅​​Tips:​

[Link] = TRUE​
​●​ ​Always handle missing values using​​ ​.​

dplyr​​for readable and chainable summaries.​


​●​ ​Use​​

[Link]​​for very large datasets.​


​●​ ​Use​​

​●​ ​Combine aggregation with filtering or sorting for complete insights.​

​158​
​ NIT 4: BASIC​
U
​STATISTICAL​
​ANALYSIS​
​1.​ ​Introduction to statistical inference​

​2.​ ​Hypothesis testing​

​○​ ​t-tests​

​○​ ​chi-square tests​

​3.​ ​Regression analysis​

​4.​ ​Creating plots using ggplot2​

​○​ ​Scatter plots​

​○​ ​Histograms​

​○​ ​ ar plots​
B
​===========​
​5.​ ​ ustomizing plots​
C

​○​ ​Titles​

​○​ ​Labels​

​○​ ​Legends​

​○​ ​Colors​

​○​ ​Themes​

​6.​ ​Exploratory data analysis with visualization techniques​

​7.​ ​Creating reproducible reports​

​○​ ​Generating HTML documents​

​○​ ​Generating PDF documents​

​○​ ​Generating Word documents​

​159​
​Introduction to Statistical Inference​
​1. What is Statistical Inference?​

​ tatistical inference is the process of​​drawing conclusions​​about a population​​based on​


S
​data collected from a​​sample​​. Since analyzing an entire​​population is often impossible, we​
​use statistics to estimate or test hypotheses about population parameters.​

I​n simpler terms:​


👉
​ We use data from a smaller group (sample) to make educated guesses about a larger​
​group (population).​

​ xample:​
E
​If we survey 200 students about their study habits, we can infer patterns for all students in​
​the university.​

​2. Key Terms in Statistical Inference​


​Term​ ​Description​

​Population​ ​The entire group of individuals or items of interest.​

​Sample​ ​A subset of the population used for analysis.​

​Parameter​ ​ numerical summary that describes a population (e.g., population​


A
​mean).​

​Statistic​ ​A numerical summary calculated from a sample (e.g., sample mean).​

​Estimation​ ​Using sample data to estimate population parameters.​

​ ypothesis​
H ​Using data to test assumptions about a population.​
​Testing​

​3. Two Main Approaches​

​a. Estimation​

​●​ ​Used to estimate unknown population parameters.​

​●​ E
​ xample: Estimating the average height of all students using a sample of 100​
​students.​

​b. Hypothesis Testing​

​160​
​●​ ​Used to test a claim about a population parameter.​

​●​ ​Example: Testing whether the average exam score is above 70.​

​4. Types of Estimates​

​1.​ P
​ oint Estimate​​– A single value estimate of a population​​parameter.​
​Example: Sample mean​​ x̄ = 72​​is the point estimate​​for population mean​​
μ​
​.​

​2.​ I​nterval Estimate​​– A range of values (confidence​​interval) within which the true​
​parameter likely falls.​
​Example: “The average score is between 70 and 74 with 95% confidence.”​

​5. Confidence Intervals​

​A​​confidence interval (CI)​​gives a range that likely​​contains the true population value.​

​Formula for a 95% CI for the mean:​

​x̄ ± z * (s / √n)​

​Where:​

​●​ ​
x̄ ​​= sample mean​

​●​ ​
s​​= sample standard deviation​

​●​ ​
n​​= sample size​

​●​ ​
z​​= z-value (1.96 for 95% confidence)​

​Example in R:​

​ ean <- 72​


m
​sd <- 8​
​n <- 50​
​error <- 1.96 * (sd / sqrt(n))​
​lower <- mean - error​
​upper <- mean + error​
​c(lower, upper)​

​161​
​Output:​

​[1] 69.78 74.22​

​✅ Interpretation: We’re 95% confident that the true mean lies between 69.78 and 74.22.​

​6. Types of Errors​


​Error Type​ ​Description​

​Type I Error (α)​ ​Rejecting a true null hypothesis (false positive).​

​Type II Error (β)​ ​ ailing to reject a false null hypothesis (false​


F
​negative).​

​Example:​

​●​ ​Type I: Concluding a medicine works when it doesn’t.​

​●​ ​Type II: Concluding a medicine doesn’t work when it actually does.​

​7. Levels of Significance​

​ he​​level of significance (α)​​is the probability of​​making a Type I error.​


T
​Common values: 0.05, 0.01, 0.10​

​ xample:​
E
​If α = 0.05, it means we accept a 5% chance of being wrong when rejecting the null​
​hypothesis.​

​8. Example: Testing a Claim About Average Marks​

​ roblem:​
P
​A teacher claims that the average score of students is​​75​​.​
​A sample of 25 students has a mean of​​78​​with a standard​​deviation of​​10​​.​
​Test the claim at a 5% significance level.​

​Steps in R:​

​[Link](x = NULL, mu = 75, [Link] = 0.95, alternative = "[Link]",​

​162​
​xbar = 78, s = 10, n = 25)​

​(We can compute manually or using sample data.)​

​ xplanation:​
E
​If the p-value < 0.05 → Reject the claim (significant difference).​
​If p-value > 0.05 → Accept the claim (no significant difference).​

​9. Real-World Applications​

​●​ ​Predicting population trends from surveys​

​●​ ​Quality control in manufacturing​

​●​ ​Measuring the effect of new medicines​

​●​ ​Market research and opinion polling​

​10. Quick Recap​


​Concept​ ​Meaning​

​ tatistical​
S ​Drawing conclusions about population from sample​
​inference​

​Estimation​ ​Finding unknown population values​

​Hypothesis testing​ ​Checking assumptions about population​

​ onfidence​
C ​Range containing true parameter​
​interval​

​Significance level​ ​Probability of Type I error​

​✅​​Tips for Exams​

​●​ ​Always define population, sample, parameter, and statistic clearly.​

​●​ ​Know the difference between Type I and Type II errors.​

​●​ R
​ emember that smaller p-values mean stronger evidence​​against​​the null​
​hypothesis.​

​163​
​●​ ​Confidence interval = estimation; hypothesis testing = decision-making.​

​Hypothesis Testing​

​ ypothesis testing is a​​statistical method​​used to​​make decisions or inferences about​


H
​population parameters based on sample data.​
​It helps us answer questions like —​
​“Is the average income of men and women the same?”​
​“Did a new drug actually improve recovery time?”​

​Let’s learn this step by step in an intuitive way.​

​1. What is a Hypothesis?​


​ ​​hypothesis​​is an assumption or claim about a population​​parameter.​
A
​In statistics, we test whether this claim is likely to be true based on sample data.​

​For example:​

​●​ ​A company claims the average battery life of their phones is​​10 hours​​.​

​●​ ​We collect a sample and test if this claim is true or not.​

​2. Types of Hypotheses​


​a. Null Hypothesis (H₀)​

​ his is the statement we start with — it assumes​​no​​effect or no difference​​.​


T
​It’s the hypothesis we​​try to disprove​​.​

​ xample:​
E
​H₀: The mean battery life = 10 hours​

​b. Alternative Hypothesis (H₁ or Hₐ)​

​This is what we want to prove — that there​​is​​a difference​​or effect.​

​ xample:​
E
​H₁: The mean battery life ≠ 10 hours​

​164​
​3. Steps in Hypothesis Testing​
​1.​ ​State hypotheses​​(H₀ and H₁)​

​2.​ ​Choose significance level (α)​​– often 0.05​

​3.​ ​Select the appropriate test​​(like t-test or chi-square​​test)​

​4.​ ​Compute the test statistic and p-value​

​5.​ ​Make a decision:​

​○​ ​If​​p < α​​, reject H₀ (significant difference)​

​○​ ​If​​p ≥ α​​, fail to reject H₀ (no significant difference)​

​4. One-tailed vs. Two-tailed Tests​


​Type​ ​When to Use​ ​Example​

​Two-tailed​ ​You want to check if there’s​​any​​difference​ ​Mean ≠ 10​

​Left-tailed​ ​You suspect the mean is​​less than​​a value​ ​Mean < 10​

​Right-tailed​ Y
​ ou suspect the mean is​​greater than​​a​ ​Mean > 10​
​value​

​5. t-Tests​
​The​​t-test​​is used when:​

​●​ ​Sample size is small (n < 30)​

​●​ ​Population standard deviation is unknown​

​There are​​three main types​​of t-tests in R:​

​a. One-sample t-test​

​165​
​Used to compare a sample mean to a known value.​

​Example:​

​ cores <- c(78, 82, 75, 80, 77, 85, 79)​


s
​[Link](scores, mu = 75)​

​Explanation:​

​●​ ​
scores​​→ your sample data​

​●​ ​
mu = 75​​→ population mean (claimed value)​

​If​​p-value < 0.05​​, the sample mean is significantly​​different from 75.​

​b. Two-sample (Independent) t-test​

​Used to compare means of​​two different groups​​.​

​Example:​

​ roup1 <- c(80, 82, 85, 83, 81)​


g
​group2 <- c(75, 78, 72, 77, 74)​
​[Link](group1, group2, [Link] = TRUE)​

​Explanation:​

​●​ ​
group1​​and​​
group2​​→ two independent groups​

​●​ ​
[Link] = TRUE​​→ assumes equal variances​

​If​​p < 0.05​​, the two group means are significantly​​different.​

​c. Paired t-test​

​ sed when the two samples are​​related​​— for example,​​before and after measurements on​
U
​the same people.​

​Example:​

​ efore <- c(65, 70, 68, 72, 66)​


b
​after <- c(70, 74, 72, 76, 71)​

​166​
​[Link](before, after, paired = TRUE)​

​ xplanation:​
E
​Checks whether the “after” values differ significantly from “before” values.​

​✅​​Use Case:​​Effect of a training program on performance.​

​6. Interpreting p-value​


​p-value​ ​Decision​ ​Interpretation​

​< 0.05​ ​Reject H₀​ ​Significant difference​

​≥ 0.05​ ​Fail to reject H₀​ ​No significant difference​

​ xample Interpretation:​
E
p = 0.03​
​If​​ ​, it means there’s only a 3% chance that​​the observed difference occurred​
​randomly — so we conclude there​​is​​a real effect.​

​7. Chi-Square (χ²) Test​


​ he​​Chi-square test​​is used to compare​​categorical​​data​​— it checks if two variables are​
T
​related or independent.​

​There are two main types:​

​1.​ ​Chi-square goodness of fit test​

​2.​ ​Chi-square test of independence​

​a. Chi-square Goodness of Fit​

​Used to test whether the observed frequencies match the expected frequencies.​

​ xample:​
E
​Suppose we expect equal distribution of students in 3 courses (Math, CS, Stats),​
​but the actual counts are different.​

​ bserved <- c(45, 30, 25)​


o
​expected <- c(33.3, 33.3, 33.3)​

​167​
​[Link](x = observed, p = expected/sum(expected))​

​ xplanation:​
E
​If​​p < 0.05​​, the observed distribution significantly​​differs from what was expected.​

​b. Chi-square Test of Independence​

​Used to check if two categorical variables are​​independent​​.​

​ xample:​
E
​You want to see if​​gender​​is related to​​course preference​​.​

​ ata <- matrix(c(20, 30, 25, 25), nrow = 2)​


d
​colnames(data) <- c("Math", "Science")​
​rownames(data) <- c("Male", "Female")​

​[Link](data)​

​ xplanation:​
E
​If​​p < 0.05​​, it means gender and course preference​​are​​not independent​​(they’re related).​

​8. Real-World Examples​


​●​ ​t-test:​​Comparing average exam scores of two classes​

​●​ ​Paired t-test:​​Testing the effect of a new teaching​​method​

​●​ C
​ hi-square test:​​Analyzing survey results (e.g., “Is​​product preference linked to age​
​group?”)​

​9. Quick Recap​


​Test​ ​Used For​ ​Example Question​

​ ne-sample​
O ​ ompare sample mean to​
C ​Is average weight = 60kg?​
​t-test​ ​population mean​

​ wo-sample​
T ​Compare two independent groups​ ​ o males and females differ in​
D
​t-test​ ​height?​

​168​
​Paired t-test​ ​Compare two related samples​ ​Did training improve scores?​

​ hi-square​
C ​Compare categorical data​ I​s gender related to department​
​test​ ​choice?​

​✅​​Tips for Exams​

​●​ ​Always write H₀ and H₁ clearly.​

​●​ ​Report​​test statistic​​,​​degrees of freedom​​, and​​p-value​​.​

​●​ ​Mention significance level (usually 0.05).​

​●​ I​nterpret the result in plain language: “There is a significant difference…” or “No​
​significant difference was found.”​

​Regression Analysis​

​ egression analysis is a statistical technique used to study the relationship between one​
R
​dependent variable and one or more independent variables. It helps in predicting the value​
​of the dependent variable based on the independent variables. In R, regression analysis is​
​commonly performed using built-in functions such as​​lm()​​for linear regression and​​
glm()​
​for generalized linear models.​
​Types of Regression in R:​

​ imple Linear Regression:​​Used when there is one independent​​variable.​


S
​Example: Predicting sales based on advertising budget.​

​# Simple Linear Regression​


​ ata <- [Link](​
d
​sales = c(10, 20, 30, 40, 50),​
​budget = c(1, 2, 3, 4, 5)​
​)​
​model <- lm(sales ~ budget, data = data)​
​summary(model)​

​1.​ ​Output shows coefficients, R-squared value, and significance levels.​

​ ultiple Linear Regression:​​Used when there are two​​or more independent variables.​
M
​Example: Predicting sales based on budget and number of employees.​

​# Multiple Linear Regression​

​169​
​data <- [Link](​
​sales = c(10, 20, 30, 40, 50),​
​budget = c(1, 2, 3, 4, 5),​
​employees = c(2, 4, 6, 8, 10)​
​)​
​model <- lm(sales ~ budget + employees, data = data)​
​summary(model)​

​2.​

​Polynomial Regression:​​Fits nonlinear relationships​​between variables.​

​ Polynomial Regression​
#
​x <- c(1, 2, 3, 4, 5)​
​y <- c(2, 6, 14, 28, 45)​
​model <- lm(y ~ poly(x, 2, raw = TRUE))​
​summary(model)​

​3.​

​ ogistic Regression:​​Used when the dependent variable​​is categorical (binary outcome like​
L
​0/1).​

​ Logistic Regression​
#
​data <- [Link](​
​pass = c(1, 0, 1, 0, 1),​
​hours = c(5, 1, 8, 2, 10)​
​)​
​model <- glm(pass ~ hours, data = data, family = binomial)​
​summary(model)​

​4.​

​Model Evaluation Metrics:​

​●​ ​R-squared:​​Proportion of variance explained by the​​model.​

​●​ ​Adjusted R-squared:​​Adjusted for number of predictors.​

​●​ ​p-value:​​Checks significance of each predictor.​

​●​ R
​ esiduals:​​Difference between observed and predicted​​values.​
​Visualization of Regression Line:​

​ Visualization​
#
​plot(data$budget, data$sales, main="Regression Line", xlab="Budget", ylab="Sales")​
​abline(model, col="blue", lwd=2)​

​170​
​Applications:​

​●​ ​Predicting sales, prices, or growth.​

​●​ ​Evaluating influence of multiple factors.​

​●​ ​Risk assessment and forecasting.​

​Creating Plots Using ggplot2​

​ he​​ggplot2​​package in R is one of the most powerful​​and flexible tools for data​


T
​visualization. It follows the​​Grammar of Graphics​​concept, where a plot is built step-by-step​
​by adding layers such as data, aesthetics, geometries, and themes.​

​To use ggplot2, first install and load the package:​

i​[Link]("ggplot2")​
​library(ggplot2)​

​1. Basic Structure of a ggplot​

​The general syntax of ggplot2 is:​

​ggplot(data, aes(x, y)) + geom_<type>() + other layers​

​●​ ​data​​: The dataset used for plotting.​

​●​ ​aes()​​: Defines aesthetic mappings (x and y axes, color,​​size, etc.).​

​●​ ​geom_()​​: Adds a geometric layer such as points, lines,​​bars, etc.​

​2. Scatter Plots​

​A scatter plot shows the relationship between two numeric variables.​

​ Scatter Plot​
#
​ggplot(mtcars, aes(x = wt, y = mpg)) +​
​geom_point(color = "blue", size = 3) +​
​ggtitle("Scatter Plot of Weight vs MPG") +​
​xlab("Weight") + ylab("Miles per Gallon")​

​Explanation:​

​171​
​●​ ​Each point represents a car.​

​●​ ​The plot shows how car weight affects fuel efficiency.​

​3. Histograms​

​Histograms are used to visualize the distribution of a single numeric variable.​

​ Histogram​
#
​ggplot(mtcars, aes(x = mpg)) +​
​geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +​
​ggtitle("Histogram of Miles per Gallon")​

​Explanation:​

​●​ ​
binwidth​​defines the width of each bar.​

​●​ ​Helps identify the frequency distribution of data.​

​4. Bar Plots​

​Bar plots represent categorical data using rectangular bars.​

​ Bar Plot​
#
​ggplot(mtcars, aes(x = factor(cyl))) +​
​geom_bar(fill = "orange", color = "black") +​
​ggtitle("Number of Cars by Cylinder Type") +​
​xlab("Cylinders") + ylab("Count")​

​Explanation:​

​●​ ​
factor(cyl)​​converts the numeric variable into a categorical​​one.​

​●​ ​Each bar shows the number of cars for a specific cylinder count.​

​5. Line Plots​

​Line plots are used to show trends over a continuous variable (often time).​

​172​
​ Line Plot​
#
​ggplot(economics, aes(x = date, y = unemploy)) +​
​geom_line(color = "darkgreen", linewidth = 1) +​
​ggtitle("Unemployment Over Time") +​
​xlab("Date") + ylab("Number of Unemployed")​

​Explanation:​

​●​ ​Shows changes in unemployment over time using connected points.​

​6. Box Plots​

​Box plots summarize data using median, quartiles, and outliers.​

​ Box Plot​
#
​ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +​
​geom_boxplot(fill = "lightgreen") +​
​ggtitle("MPG by Cylinder Type") +​
​xlab("Cylinders") + ylab("Miles per Gallon")​

​Explanation:​

​●​ ​Displays spread and skewness of MPG for each cylinder category.​

​7. Density Plots​

​Density plots are smooth versions of histograms.​

​ Density Plot​
#
​ggplot(mtcars, aes(x = mpg)) +​
​geom_density(fill = "pink", alpha = 0.5) +​
​ggtitle("Density Plot of MPG")​

​Key Features of ggplot2​

​●​ ​Layered plotting (add or remove elements easily).​

​●​ ​Supports customization of themes, colors, and labels.​

​173​
dplyr​
​●​ ​Compatible with various data transformation packages like​​ ​.​

​●​ ​Allows advanced statistical visualization (facets, trend lines, etc.).​

​Example: Adding Multiple Layers​

​ Combined Plot​
#
​ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +​
​geom_point(size = 3) +​
​geom_smooth(method = "lm", se = FALSE) +​
​ggtitle("Relationship Between Weight and MPG by Cylinder") +​
​xlab("Weight") + ylab("Miles per Gallon")​

​This adds both scatter points and a fitted regression line for each cylinder category.​

​Customizing Plots​

​ hile​​
W ggplot2​​provides beautiful default visuals,​​customizing your plots helps make them​
​clearer, more readable, and presentation-ready​​. You​​can change almost every element —​
​from titles and axis labels to colors, legends, and themes.​

​Let’s explore the most common customizations step by step.​

​1. Adding Titles, Subtitles, and Captions​

​You can add descriptive titles and captions to make your plots more informative.​

​ggplot(mtcars, aes(x = wt, y = mpg)) +​


​geom_point(color = "blue") +​
​labs(​
​title = "Relationship Between Car Weight and Mileage",​
​subtitle = "Data from mtcars dataset",​
​x = "Weight (1000 lbs)",​
​y = "Miles per Gallon",​
​caption = "Source: R mtcars dataset"​
​)​

​Explanation:​

​●​ ​
title​​adds the main heading.​

​174​
​●​ ​
subtitle​​gives additional context.​

​●​ ​
x​​and​​
y​​label the axes.​

​●​ ​
caption​​appears at the bottom, useful for mentioning​​data sources.​

​2. Customizing Axis Labels and Ticks​

​You can modify axis text, font size, or rotation for better clarity.​

​ggplot(mtcars, aes(x = wt, y = mpg)) +​


​geom_point() +​
​scale_x_continuous(name = "Car Weight", breaks = seq(2, 5, 0.5)) +​
​scale_y_continuous(name = "Fuel Efficiency (MPG)", limits = c(10, 35))​

​Explanation:​

​●​ ​
breaks​​defines tick intervals.​

​●​ ​
limits​​restricts the axis range.​

​3. Customizing Colors​

​You can assign specific colors manually or use predefined color scales.​

​Example 1: Manual Colors for Categorical Variables​

​ggplot(mtcars, aes(x = factor(cyl), fill = factor(cyl))) +​


​geom_bar() +​
​scale_fill_manual(values = c("4" = "skyblue", "6" = "orange", "8" = "green")) +​
​labs(fill = "Cylinders")​

​Example 2: Using a Gradient for Continuous Data​

​ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +​


​geom_point(size = 3) +​
​scale_color_gradient(low = "lightblue", high = "darkred")​

​Explanation:​

​175​
​●​ s
​cale_fill_manual()​​and​​
scale_color_gradient()​​let you precisely define​
​color schemes.​

​4. Customizing Legends​

​You can change legend position, title, and appearance.​

​ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +​


​geom_point(size = 3) +​
​labs(color = "Cylinders") +​
​theme([Link] = "bottom")​

"top"​
​Tip:​​You can also use​​ "left"​
​,​​ "right"​
​,​​ "none"​​to remove the legend.​
​, or​​

​5. Customizing Themes​

​Themes control the​​overall appearance​​(background,​​grid lines, fonts).​

​Example:​

​ggplot(mtcars, aes(x = wt, y = mpg)) +​


​geom_point(size = 3, color = "darkblue") +​
​theme_minimal() +​
​labs(title = "Weight vs Mileage") +​
​theme(​
​[Link] = element_text(size = 14, face = "bold", color = "darkred"),​
​[Link] = element_text(size = 12, face = "bold"),​
​[Link] = element_line(color = "grey80")​
​)​

​Popular Built-in Themes:​

​●​ ​
theme_bw()​​– black and white clean style.​

​●​ ​
theme_minimal()​​– modern minimal look.​

​●​ ​
theme_classic()​​– simple with axes and no grid lines.​

​●​ ​
theme_light()​​– gentle background.​

​176​
​6. Faceting (Multiple Plots by Category)​

​Faceting allows you to split data into multiple panels automatically.​

​ggplot(mtcars, aes(x = wt, y = mpg)) +​


​geom_point(color = "blue") +​
​facet_wrap(~ cyl) +​
​labs(title = "MPG vs Weight for Different Cylinders")​

​ xplanation:​
E
​Each panel shows data for one cylinder category — a great way to compare groups visually.​

​7. Combining Multiple Customizations​

​Here’s how you can combine all these ideas in one polished plot:​

​ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +​


​geom_point(size = 3) +​
​geom_smooth(method = "lm", se = FALSE) +​
​scale_color_manual(values = c("red", "green", "blue")) +​
​labs(​
​title = "Effect of Weight on Mileage",​
​subtitle = "Comparison by Number of Cylinders",​
​x = "Car Weight (1000 lbs)",​
​y = "Miles per Gallon",​
​color = "Cylinders"​
​) +​
​theme_classic() +​
​theme(​
​[Link] = element_text(face = "bold", color = "darkred", size = 14),​
​[Link] = "bottom"​
​)​

​ xplanation:​
E
​This plot includes:​

​●​ ​Title, labels, and legend.​

​●​ ​Custom colors.​

​●​ ​Trend lines (regression).​

​●​ ​Clean theme and professional layout.​

​177​
​🧭 Quick Summary: Plot Customization Tips​

labs()​​for all text elements.​


​●​ ​Use​​

theme()​​for fine styling (fonts, colors, positions).​


​●​ ​Use​​

scale_...()​​functions for axis and color control.​


​●​ ​Use​​

+​​for complex, professional​​visuals.​


​●​ ​Combine multiple layers with​​


​ ​​Exam Tip:​
​Questions often ask about​​plot customization functions​​and​​theme control in ggplot2​​.​
​Remember:​

​●​ ​
labs()​​for labels,​

​●​ ​
theme()​​for style,​

​●​ ​
scale_...()​​for colors/scales.​

​Exploratory Data Analysis (EDA) with Visualization Techniques​

​ xploratory Data Analysis (EDA)​​is the process of​​visually and statistically exploring data​
E
​to​​understand its structure, patterns, relationships,​​and anomalies​​before formal​
​modeling.​
​It’s the most important phase in any data science project because it helps you discover what​
​the data is trying to tell you.​

​ DA in R often combines​​summary statistics​​,​​data visualization​​,​​and​​data cleaning​


E
ggplot2​​is one of the most powerful​​tools for this.​
​techniques — and​​

​1. Understanding EDA​

​Before running complex models, EDA helps answer questions like:​

​●​ ​What does the distribution of each variable look like?​

​●​ ​Are there outliers or missing values?​

​178​
​●​ ​How are different variables related?​

​●​ ​Are there hidden patterns or correlations?​

​ DA is both​​a science and an art​​— it’s about asking​​the right questions and visually​
E
​exploring answers.​

​2. Common Visualization Tools for EDA​

​Let’s look at some visualization types commonly used in EDA.​

​a) Histograms — Understanding Data Distribution​

​Histograms help you visualize how data is distributed across ranges.​

​ggplot(mtcars, aes(x = mpg)) +​


​geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +​
​labs(title = "Distribution of Miles per Gallon", x = "MPG", y = "Frequency")​

I​nsight:​
​Check if the data is​​normally distributed​​,​​skewed​​,​​or has​​outliers​​.​

​b) Box Plots — Detecting Outliers​

​Box plots show spread and identify outliers easily.​

​ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +​


​geom_boxplot() +​
​labs(title = "Mileage by Cylinder Type", x = "Cylinders", y = "MPG")​

I​nsight:​
​Higher-cylinder cars generally have lower mileage, and box plots can confirm that visually.​

​c) Scatter Plots — Checking Relationships Between Variables​

​Scatter plots help identify​​correlations​​between two​​continuous variables.​

​ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +​


​geom_point(size = 3) +​
​labs(title = "Relationship Between Weight, Mileage, and Horsepower")​

​179​
I​nsight:​
​Cars with higher weight tend to have lower mileage, and horsepower also influences the​
​trend.​

​d) Pair Plots — Multiple Relationships at Once​

GGally​​package allows creating pair plots (scatterplots​​for every numeric variable pair).​
​The​​

l​ibrary(GGally)​
​ggpairs(mtcars[, c("mpg", "wt", "hp", "disp")])​

I​nsight:​
​Quickly see correlations and patterns across several variables.​

​e) Correlation Heatmaps​

​Heatmaps visually display correlations between numeric variables.​

l​ibrary(reshape2)​
​cor_matrix <- cor(mtcars)​
​melted_cor <- melt(cor_matrix)​
​ggplot(melted_cor, aes(x = Var1, y = Var2, fill = value)) +​
​geom_tile() +​
​scale_fill_gradient2(low = "red", high = "blue", mid = "white", midpoint = 0) +​
​labs(title = "Correlation Heatmap")​

I​nsight:​
​Strong correlations (positive or negative) appear darker, helping spot variable​
​dependencies.​

​f) Bar Charts — Understanding Categorical Data​

​Bar charts visualize the frequency of categories.​

​ggplot(mtcars, aes(x = factor(gear), fill = factor(gear))) +​


​geom_bar() +​
​labs(title = "Frequency of Gear Types", x = "Gears", y = "Count")​

I​nsight:​
​Helps in comparing categories like gear types, fuel type, or transmission.​

​180​
​3. Combining EDA with Summary Statistics​

​Visualization becomes more powerful when supported by summary functions:​

​summary(mtcars)​

​This gives quick stats like​​mean​​,​​median​​,​​min​​,​​max​​,​​and​​quartiles​​for every variable.​

​You can also group and summarize data:​

l​ibrary(dplyr)​
​mtcars %>%​
​group_by(cyl) %>%​
​summarise(avg_mpg = mean(mpg), avg_hp = mean(hp))​

​ xplanation:​
E
​Summarizes mileage and horsepower by the number of cylinders — a useful analytical​
​view.​

​4. Outlier Detection​

​ isual techniques such as boxplots and scatter plots are best for spotting​​outliers​​, but you​
V
​can also detect them programmatically.​

​ utliers <- [Link](mtcars$mpg)$out​


o
​outliers​

​ xplanation:​
E
​This finds MPG values that lie outside the normal range.​

​5. Combining Multiple Insights​

patchwork​​or​​
​ ou can create dashboards of multiple visuals using packages like​​
Y cowplot​
​to combine plots for a holistic view.​

l​ibrary(patchwork)​
​p1 <- ggplot(mtcars, aes(mpg)) + geom_histogram(fill="skyblue")​
​p2 <- ggplot(mtcars, aes(wt, mpg)) + geom_point(color="darkgreen")​
​p1 + p2​

​181​
​🧩 Summary of EDA Visualization Techniques​
​Visualization​ ​Purpose​

​Histogram​ ​Distribution of numeric data​

​Box Plot​ ​Spread & outliers​

​Scatter Plot​ ​ elationship between two​


R
​variables​

​Heatmap​ ​Correlations​

​Bar Chart​ ​Frequency of categorical variables​

​Pair Plot​ ​Multivariate relationships​

​💡 Quick Tips​

​●​ ​Always start EDA with​​summary statistics​​and​​basic​​plots​​.​

​●​ ​Use​​color and shape​​wisely to highlight relationships.​

​●​ ​Look for​​outliers​​,​​missing values​​, and​​skewed distributions​​.​

​●​ ​Combine visuals for deeper insights.​


​ ​​Exam Tip:​
​Common questions:​

​●​ ​What is EDA and why is it important?​

​●​ ​Which visualization tools are used for EDA?​

​●​ ​Explain how to detect outliers or correlations in data using ggplot2.​

​Creating Reproducible Reports​

​ hen working with data analysis or research, it’s important to make your work​​reproducible​
W
​— meaning anyone can rerun your code and get the same results, along with all the visuals,​
​explanations, and outputs in one organized report.​

​182​
I​n R, this is achieved using​​R Markdown​​— a powerful tool that combines​​code, output,​
​and text​​in a single document. From R Markdown, you​​can generate professional reports in​
​HTML, PDF, or Word formats​​.​

​1. What is R Markdown?​

​R Markdown is a special type of document that lets you:​

​●​ ​Write​​normal text (like a report)​​in Markdown format.​

​●​ ​Insert​​R code chunks​​that execute and display results.​

​●​ ​Export your work as a formatted​​HTML​​,​​PDF​​, or​​Word​​document.​

​ ou can create a new R Markdown file in RStudio by:​


Y
​File → New File → R Markdown​

​2. Basic Structure of an R Markdown File​

​A typical R Markdown file has three main parts:​

​1.​ ​YAML Header​​— defines title, author, and output type​

​2.​ ​Markdown Text​​— regular descriptive content​

​3.​ ​R Code Chunks​​— embedded executable code​

​Example:​

-​ --​
​title: "My Analysis Report"​
​author: "Shah Faisal"​
​date: "2025-11-10"​
​output: html_document​
​---​

​ # Introduction​
#
​This report explores the relationship between car weight and mileage.​

`​ ``{r}​
​# R code chunk​
​library(ggplot2)​
​ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(color = "blue")​

​183​
​Conclusion​
​We observe that heavier cars tend to have lower mileage.​

​---​

​ ### **3. Creating HTML Documents**​


#
​If you select **HTML** as the output format, your report will be generated as an interactive​
​webpage.​

`​ ``yaml​
​output: html_document​

​ hen click​​Knit​​in RStudio (the blue yarn button 🧶).​


T
.html​​file that can be opened in​​a browser.​
​It will create an​​

​Advantages:​

​●​ ​Interactive plots and hyperlinks.​

​●​ ​Attractive formatting and color themes.​

​●​ ​Easily shareable online.​

​Example Output:​

​ggplot(mtcars, aes(wt, mpg)) +​


​geom_point(color = "purple") +​
​geom_smooth(method = "lm", se = FALSE)​

​4. Creating PDF Documents​

​To create a​​PDF report​​, specify:​

​output: pdf_document​

​PDF reports are great for​​official documentation or​​academic submissions​​.​

​Note:​​You’ll need to have​​LaTeX​​installed (RStudio​​will guide you if it’s missing).​

​Advantages:​

​184​
​●​ ​Professional formatting​

​●​ ​Printable reports​

​●​ ​Widely accepted in research settings​

​5. Creating Word Documents​

​To export your report to Microsoft Word:​

​output: word_document​

.docx​​file that you can open and edit​​in Word.​


​This creates a​​

​Advantages:​

​●​ ​Editable format​

​●​ ​Ideal for teamwork or assignments requiring revisions​

​6. Adding Plots and Tables​

​R Markdown automatically includes plots or tables generated by your R code chunks.​

​Example:​

​ ummary(mtcars$mpg)​
s
​boxplot(mtcars$mpg, main = "Boxplot of MPG")​

​You can also display data frames neatly:​

​head(mtcars)​

​7. Combining Text and Code​

​This is where R Markdown shines — you can mix your explanations and visuals together:​

​The dataset shows that cars with higher **weight** (`wt`) have lower **mileage** (`mpg`).​

​```{r}​

​185​
​plot(mtcars$wt, mtcars$mpg)​

​ his structure makes your report readable **like a story** — each code snippet is​
T
​immediately explained by the text around it.​

​---​

​ ### **8. Adding Inline Code**​


#
​You can embed small code results directly inside your text using backticks and `r`.​

​ xample:​
E
​```markdown​
​The dataset contains `r nrow(mtcars)` observations and `r ncol(mtcars)` variables.​

​When knitted, R will replace it with actual numbers.​

​9. Customizing Reports​

​You can style your report using various options:​

​●​ ​Change themes (​​


theme: cerulean​​or​​
cosmo​
​)​

​●​ ​Add table of contents (​​


toc: true​
​)​

​●​ ​Control figure size (​​


[Link]​ [Link]​
​,​​ ​)​

​Example YAML:​

​output:​
​html_document:​
​theme: cerulean​
​toc: true​
​toc_float: true​

​10. Benefits of Reproducible Reports​

​●​ ​Transparency​​: Every analysis step is documented.​

​●​ ​Reproducibility​​: Others can verify or build upon your​​work.​

​●​ ​Automation​​: Update one dataset, and the entire report​​updates automatically.​

​186​
​●​ P
​ rofessional Presentation​​: Clean, structured reports suitable for projects and​
​research.​

​🧩 Summary Table: Report Types​


​Output Type​ ​ ile​
F ​Ideal For​ ​Key Benefit​
​ xtension​
E

​HTML​ .html​
​ ​Interactive sharing​ ​Beautiful web-based layout​

​PDF​ .pdf​
​ ​Academic/official use​ ​Print-ready format​

​Word​ .docx​
​ ​Editable reports​ ​Easy to modify or annotate​

​💡 Quick Tips​

##​​and​​
​●​ ​Use​​ ###​​for section headings.​

​●​ ​Use triple backticks (```) for R code chunks.​

​●​ ​Always write meaningful section titles.​

​●​ ​Click​​Knit​​to render your report.​


​ ​​Exam Tip:​
​Common questions:​

​●​ ​What is R Markdown?​

​●​ ​How do you create reproducible reports in R?​

​●​ ​Differentiate between HTML, PDF, and Word outputs in R Markdown.​

​187​

You might also like