R Programming Basics and Overview
R Programming Basics and Overview
NOTES
Second Semester, DSAI, Department of Computer Science, University of kashmir
1
NIT 1:
U
INTRODUCTION
TO R
1. Overview of R and RStudio
○ Variables
○ Operators
○ if-else
○ loops
4. Functions in R
2
Overview of R and RStudio
1. Introduction to R
hink of R as a digital lab for data scientists — a place where you can experiment with data,
T
analyze patterns, and visualize results beautifully.
Key Features of R
● E
xtensive Libraries:Thousands of built-in and externalpackages for data science,
statistics, and machine learning.
● S
trong Visualization Support:R creates high-qualityplots and graphs with libraries
ggplot2and
like lattice
.
● D
ata Handling:Efficiently handles large datasetsand supports data cleaning and
transformation.
● C
ommunity Support:Large, active community providinghelp, tutorials, and
open-source packages.
Why R?
Example:
A simple R example
#
x <- c(2, 4, 6, 8, 10)
mean(x)
Explanation:
3
●
c()creates a vector of numbers.
● m
ean()calculates the average of the given vector.
This simple code shows how quick and readable R is for basic data analysis.
It provides a clean and organized interface to write, test, and visualize your R code
efficiently.
.R
○ Files are usually saved with the extension .
4
3. How R and RStudio Work Together
● RStudio sends that code to theR interpreter, whichprocesses and executes it.
● T
he results (numbers, plots, or errors) appear in the RStudioConsoleorPlots
window.
○ S
tart typing commands in theConsoleor create a newR Script via:
File → New File → R Script
.
● D
ata Science:Used for analyzing datasets, makingpredictions, and creating
dashboards.
5
● Marketing:Customer segmentation and trend analysis.
.R
● R scripts end with .
print("Hello, R World!")
●
Basics of R Programming
very programming language begins with understanding itsbuilding blocks— how to store
E
information, what kinds of information exist, and how to perform operations on them.
In R, these basic concepts revolve aroundvariables,data types, andoperators.
6
1. Variables in R
In R, you can assign a value to a variable using any of the following operators:
<- 10
x # most common
y = 20 # also works
30 -> z # less common but valid
All three mean the same thing — they assign a value to a variable.
Dataand
● R iscase-sensitive→ dataare two differentvariables.
Example
ame <- "R Programming"
n
version <- 4.3
isFun <- TRUE
Explanation:
●
namestores a text value (called astring)
●
versionstores a number
●
isFunstores a logical value (TRUE/FALSE)
You can check the value of a variable by just typing its name:
name
7
supports severaldata types, each representing a different kind of data.
R
Let’s look at the most common ones:
Numeric 12.5
-4
, 7.0 Numbers with or without decimals
,
Integer 10L
-2L
, Lto specify
hole numbers (use
W
integer)
Character
"Hello"
, Text or string data
'Data'
Logical TRUE
FALSE
, Boolean values for conditions
Complex 2 + 3i
Numbers with real and imaginary parts
Example
<- 15.7
a # numeric
b <- 10L # integer
c <- "R is fun" # character
d <- TRUE # logical
e <- 2 + 3i # complex
lass(a)
c
typeof(b)
These functions help you understand what kind of data each variable holds.
Function Converts
To
[Link]( Numeric
)
8
[Link]( Integer
)
[Link] Character
r()
[Link]( Logical
)
Example
<- "25"
x
[Link](x)
xplanation:
E
"25"into the number
This converts the string 25
.
4. Operators in R
+
Addition 5 + 3
8
-
Subtraction 5 - 2
3
*
Multiplication 4 * 2
8
/
Division 10 / 2
5
%%
Modulus (remainder)
10 %% 1
3
%/%
Integer division 10 %/%
3
3
^
Power 2 ^ 3
8
Example
9
<- 10
a
b <- 3
a + b
a %/% b
Used tocomparevalues.
==
Equal to 5 == 5 TRUE
!=
Not equal to 5 != 3 TRUE
>
Greater than 7 > 3
TRUE
<
Less than 2 < 5
TRUE
>=
reater than or
G 4 >= 4 TRUE
equal
<=
Less than or equal 3 <= 2 FALSE
Example
<- 10
x
y <- 20
x > y
x <= y
Explanation:
FALSE
● The first expression returns
TRUE
● The second returns
10
&
AND (both true) (5 > 2) & (3
TRUE
< 6)
!
NOT (negation) !(5 > 2)
FALSE
<-
x <-
ssigns 10 to
a
10
x
->
10 ->
ssigns 10 to
a
x
x
=
x = 10 a
ssigns 10 to
x
:
Sequence generator 1:5gives
1 2 3 4 5
%in%
Membership test %in% c(1,2,3)→
2
TRUE
%*%
Matrix multiplication Used for multiplying matrices
💡 Real-World Example
11
xplanation:
E
We created two numeric variables and calculated their average — a basic but common data
analysis operation.
<-for assignment.
● Use
class()
● Check type: typeof()
, .
●
:creates sequences,
%in%checks membership.
Control Structures in R
ontrol structures help youcontrol the flow of yourprogram— decidingwhat to do next
C
based on certain conditions or repeating actions multiple times.
In simple words, they make your R programssmarterandmore dynamic.
2. Loops
Syntax
if (condition) {
# code to run if condition is TRUE
} else if (another_condition) {
12
code to run if the above is FALSE but this is TRUE
#
} else {
# code to run if none are TRUE
}
xplanation:
E
xis greater than 0, the condition is
Since TRUE
,so the message“Positive number”is
printed.
xplanation:
E
xis -3, so the condition
Here, x >= 0isFALSE.
elseblock runs and prints“Negative number”.
The
xplanation:
E
This example checks multiple conditions — it first checks for positive, then negative, and
finally prints“Zero”if neither is true.
13
Example 4: Nested if
ifinside another.
You can also place one
<- 20
x
if (x > 10) {
if (x < 30) {
print("Between 10 and 30")
}
}
xplanation:
E
ifonly runs if the outer condition istrue — making it anested decision.
The inner
2. Loops in R
Syntax:
Example:
for (i in 1:5) {
14
rint(paste("This is loop number", i))
p
}
Explanation:
●
1:5creates a sequence (1, 2, 3, 4, 5).
● The loop runs five times, printing the message each time with the loop number.
sed when youdon’t know exactly how many timestoloop — it runsas long as a
U
condition remains TRUE.
Syntax:
while (condition) {
# code to execute
}
Example:
ount <- 1
c
while (count <= 5) {
print(paste("Count is", count))
count <- count + 1
}
xplanation:
E
countbecomes greaterthan 5.
The loop keeps printing until
count
If you forget to update , this can lead to aninfinite loop.
Syntax:
repeat {
# code
if (condition) {
15
reak
b
}
}
Example:
<- 1
x
repeat {
print(x)
x <- x + 1
if (x > 5) {
break
}
}
xplanation:
E
xand increases it by 1 until
The loop keeps printing xbecomes greater than 5, then stops.
Sometimes you may want to skip certain iterations or exit a loop early.
A. break
for (i in 1:10) {
if (i == 6) {
break
}
print(i)
}
utput:
O
1 2 3 4 5
iequals 6.
Stops when
B. next
16
for (i in 1:5) {
if (i == 3) {
next
}
print(i)
}
utput:
O
1 2 4 5
3
The loop skips printing.
oops and conditions often work together in real tasks like data cleaning or summarizing
L
values.
Example:
xplanation:
E
This loop checks each number in the list and prints whether it’s even or odd.
💡 Real-World Example
17
xplanation:
E
This loop goes through each score and checks if it’s a pass or fail — similar to an
automated grading system.
● Infinite loops happen if you forget to update your loop variable.
Functions in R
unctions are theheart of R programming— they helpyou organize your code, reuse
F
logic, and simplify complex tasks.
Think of a function as amini-program inside yourmain programthat performs a specific
job whenever you call it.
or example, if you often need to calculate the average of numbers, instead of rewriting the
F
same code again and again, you can just write a function once and reuse it whenever
needed.
Syntax
function_name <- function(arguments) {
# body of the function
# code to execute
return(result)
}
18
●
function_name→ the name you give to your function.
●
function()→ defines the function.
●
arguments→ inputs the function takes.
●
return()→ sends back the output (optional but goodpractice).
Explanation:
●
add_numbersis a user-defined function that takestwo arguments.
12
● It adds them and returns the result → output will be .
xplanation:
E
This function doesn’t take any arguments. It simply prints a message whenever you call it.
19
Explanation:
"Faisal"
● If you pass , it personalizes the message.
xplanation:
E
This function returns both sum and difference in a list.
$.
You can access each value using
abs(x)
Absolute value abs(-5)
5
sqrt(x)
Square root sqrt(16)
4
exp(x)
Exponential exp(1)
2.718
log(x)
Natural log log(10)
2.302
log10(x)
Base-10 log log10(100)
2
round(x, n)
Round to n digits
round(3.14159
3.14
, 2)
ceiling(x)
Round up ceiling(2.3)
3
20
floor(x)
Round down floor(2.9)
2
sin(x)
cos(x)
, , Trigonometric sin(pi/2)
1
tan(x)
sum(x)
Sum of elements
sum(c(1,2,3))
6
mean(x)
Average value mean(c(2,4,6)
4
)
max(x)
Maximum value max(c(5,9,2))
9
min(x)
Minimum value min(c(5,9,2))
2
Example
ums <- c(2, 4, 6, 8)
n
mean(nums)
sd(nums)
Explanation:
●
mean()finds the average.
●
sd()gives the standard deviation — how spread outthe data is.
orking with text (calledstrings) is common in R— like cleaning names, formatting output,
W
or labeling graphs.
nchar(x)
Counts characters nchar("Hello")
5
toupper(x)
onverts to
C toupper("data")
"DATA"
uppercase
tolower(x)
onverts to
C tolower("RStudio")
"rstudio"
lowercase
21
substr(x,
xtracts part of a
E substr("Learning", 1,
"Lear"
start, stop)
string 4)
paste(x, y,
Joins strings paste("R",
"R
sep=" ")
"Language")
Language"
paste0(x, y)
oins without
J paste0("Data",
"DataScien
space "Science")
ce"
strsplit(x,
Splits a string strsplit("R is fun",
"R" "is"
split)
" ")
"fun"
grep(pattern,
inds matching
F grep("R",
1
x)
text c("R","Python","C"))
Explanation:
●
strsplit()breaks the sentence into words.
●
words[[1]]accesses the list of words.
●
toupper()converts all words to uppercase.
hey’re especially powerful in data analysis where repetitive operations are common — such
T
as cleaning multiple datasets or computing statistical measures.
22
💡 Real-World Example
Let’s say you want to calculate total marks and percentage for a student:
xplanation:
E
This function computes both total and percentage for a student — practical, reusable, and
easy to extend for more subjects.
function()
● Define a function using .
sum()
● Mathematical functionslike mean()
, sqrt()
, ,etc.
nchar()
● String functionslike toupper()
, paste()
, ,etc.
User-defined Functions
ser-defined functions in R are custom functions that you create to perform specific tasks
U
not covered by R’s built-in functions. They help you make your programs modular, readable,
and reusable.
function()keyword.
ou can define your own function using the
Y
Syntax:
23
# function body
# computations
return(result)
}
print("Welcome to R Programming!")
}
greet()
Output:
product <- a * b
return(product)
}
multiply(6, 4)
Output:
[1] 24
24
Example 3: Function with Default Parameters
return(area)
}
Output:
[1] 3.141593
[1] 28.27433
sum = x + y,
difference = x - y,
product = x * y,
quotient = x / y
)
return(result)
}
25
print(output)
Output:
$sum
[1] 15
$difference
[1] 5
$product
[1] 50
$quotient
[1] 2
return(x^2)
}
return(square(a) + square(b))
}
sum_of_squares(3, 4)
Output:
26
[1] 25
apply()functions.
Anonymous functions are unnamed, one-line functions often used with
print(squared_values)
Output:
[1] 1 4 9 16 25
ser-defined functions give you complete control over what your code does, making them
U
essential for structuring large projects or automating repetitive tasks.
local variableis one that’screated inside a functionand can only be accessed within
A
that function.
Once the function finishes running, the local variable disappears (it’s destroyed).
Example:
x <- 10
y <- 20
sum <- x + y
print(sum)
}
27
add_numbers()
Output:
[1] 30
🟢Explanation:
●
xand
yarelocalto the function
add_numbers()
.
Example:
a <- 5
print(result)
}
multiply()
print(a)
Output:
[1] 50
28
[1] 5
🟢Explanation:
● a
is aglobal variable, so it’s accessible insidethe
multiply()function as well as
outside.
y default, if you assign a new value to a variable inside a function, R creates anew local
B
copy— it doesnotmodify the global variable.
Example:
x <- 100
x <- 50
}
change_value()
Output:
🟢Explanation:
29
4. Forcing a Function to Modify a Global Variable
If you really need to modify a global variable from inside a function, use the
super-assignment operator <<-.
Example:
count <- 0
}
increment()
increment()
print(count)
Output:
[1] 2
🟢Explanation:
Quick Summary
Local Variable Inside a function Only inside that function Until function ends
30
🧠Tip for Exams:
● Always prefer local variables to avoid unwanted side effects in large programs.
31
NIT 2: DATA
U
HANDLING IN R
1. Data structures in R
○ Vectors
○ Matrices
○ Lists
32
Data Structures in R
provides a rich set ofdata structuresto storeand organize data efficiently. These
R
structures are the building blocks for all data manipulation and analysis tasks in R.
1. Vectors
Creating Vectors
c()(combine) function.
ou can create a vector using the
Y
Example:
Explanation:
●
c()combines values into a single sequence.
um_vector[2]
n
num_vector[1:3]
Output:
[1] 20
[1] 10 20 30
Modifying Vectors
num_vector[2] <- 25
25
This replaces the second element with .
33
Vector Operations
R performselement-wise operationsautomatically.
Output:
[1] 5 7 9
[1] 4 10 18
Function Description
sum(x)
Sum of all elements
rev(x)
Reverses order
2. Matrices
Creating a Matrix
matrix()function.
se the
U
Syntax:
Example:
34
Output:
Matrix Operations
<- matrix(1:4, 2, 2)
A
B <- matrix(5:8, 2, 2)
A + B
A * B
A %*% B # matrix multiplication
Output:
35
Accessing Data Frame Elements
tudent_data$Name
s
student_data[1, 2]
student_data[ , "Score"]
Function Purpose
str(df) S
tructure of data
frame
4. Lists
listcan hold elements ofdifferent types— numbers,strings, vectors, even other lists or
A
data frames.
Creating a List
my_list <- list(
name = "R Programming",
numbers = c(1, 2, 3),
matrix_data = matrix(1:4, 2, 2)
)
36
my_list$matrix_data[1, 2]
ists are used to storecomplex results, such as outputsfrom models or multiple datasets
L
in one object.
Quick Summary
Data Structure Type Stores Example
🧠Tip:
● U
sevectorsfor simple sequences,data framesfordatasets, andlistsfor flexible
combinations of objects.
Vectors
vectorin R is the simplest data structure thatholds elements of the same data type
A
(numeric, character, logical, etc.). Vectors are used to store a sequence of data elements in
a single variable.
Creating Vectors
c()function (combinefunction).
ectors can be created using the
V
Example:
37
z <- c(TRUE, FALSE, TRUE, TRUE)
[1]
x # First element
x[3] # Third element
x[2:4] # Elements from 2nd to 4th
Vector Operations
Vector Functions
length( R
eturns number of length(
x)
elements a)
sum(x)
Returns sum of all elements
sum(a)
max(x)
Returns maximum value max(a)
min(x)
Returns minimum value min(a)
Combining Vectors
38
V c()
ectors can be combined using .
Example:
Vector Recycling
If two vectors of different lengths are operated on, R recycles the shorter vector.
Example:
Type Coercion
If a vector has mixed data types, R automatically converts them to the same type following
this hierarchy:
Logical → Integer → Double → Character
Example:
ectors are fundamental in R and form the building blocks for more complex data structures
V
like matrices and data frames.
Matrices
39
matrixin R is a two-dimensional data structure that contains elements of the same data
A
type (numeric, character, or logical). It’s essentially a collection of vectors arranged in rows
and columns.
Creating a Matrix
matrix()function.
ou can create a matrix using the
Y
Syntax:
TRUE
● byrow→ if FALSE
, fills the matrix by rows; if ,fills by columns
Example:
Output:
40
Matrix Operations
+ B # Addition
A
A - B # Subtraction
A * B # Element-wise multiplication
A / B # Element-wise division
A %*% B # Matrix multiplication
Matrix Functions
Function Description Example
t(A)
Transpose of matrix t(A)
nrow(A)
Number of rows nrow(A)
ncol(A)
Number of columns ncol(A)
dim(A)
Dimensions (rows, cols)
dim(A)
Combining Matrices
●
rbind()→ combines by rows
●
cbind()→ combines by columns
41
Example:
m["Row2", "Col3"]
lass(m)
c
typeof(m)
atrices are often used in mathematical computations, data transformations, and statistical
M
modeling where uniform data types are required.
Data Frames
data frameis one of the most commonly used datastructures in R. It’s similar to a table in
A
a spreadsheet or a dataset in Python’s pandas — made up of rows and columns, but unlike
matrices,each column can contain a different datatype(numeric, character, logical,
etc.).
42
[Link]()function.
You can create a data frame using the
Syntax:
Example:
Output:
👉 Here,each column(
N
ame Age
, Score
, ) is a vector,and all have equal lengths.
$operator
Using
students$Name
2.
43
By row and column position
tudents[1, 3]
s # Element in 1st row, 3rd column
students[2, ] # Entire 2nd row
3.
Remove a column:
rbind()
Add a row using :
ew_student <- [Link](Name = "Emma", Age = 22, Score = 91, Grade = "A")
n
students <- rbind(students, new_student)
Remove a row:
nrow(df)
Number of rows nrow(students
)
ncol(df)
Number of columns ncol(students
)
dim(df)
imensions of data
D dim(students)
frame
44
names(df) Column names
names(student
s)
str(df)
Structure of data frame str(students)
head(df)
First 6 rows head(students
)
tail(df)
Last 6 rows tail(students
)
Filtering Data
Sorting Data
order()function.
ou can sort data frames using the
Y
Example:
45
Merging Data Frames
Y merge()
ou can merge two data frames using a common column with .
Example:
Data frames are used whenever you deal withstructureddatasets, such as:
🧭 Quick Summary
[Link]()
● Created using
$,indices, or names.
● Access using
rbind()
● Rows → cbind()or
; Columns → $
.
summary()and
● Use str()to understand data quickly.
Lists
46
listin R is a flexible data structure that can holddifferent types of elements— numbers,
A
strings, vectors, matrices, data frames, or even other lists!
Think of a list like acontainer that can store differentkinds of objects together, unlike
vectors or matrices which require all elements to be of the same type.
Creating a List
list()function.
You can create a list using the
Syntax:
Example:
Output:
Name
$
[1] "Sara"
Age
$
[1] 21
Scores
$
[1] 85 90 95
Passed
$
[1] TRUE
ere, you can see that the list containsdifferentdata types— a string, a number, a vector,
H
and a logical value.
47
$
By name (using):
my_list$Name
"Sara"
1. → Returns
[[ ]]
By index (using ):
my_list[[2]]
[ ]
By index (using ):
my_list[2]
Change a value:
Remove an element:
Combining Lists
c()function:
You can merge lists using the
48
ombined_list <- c(list1, list2)
c
print(combined_list)
Output:
a
$
[1] 1
b
$
[1] 2
c
$
[1] 3
d
$
[1] 4
Nested Lists
nested$student$Name
"John"
→ Returns
49
names(lis Get or set names of elements
names(my_lis
t)
t)
Unlisting a List
Y unlist()
ou can flatten a list into a single vector using .
Example:
Output:
B C
A
85 90 95
Real-World Analogy
All different, yet stored together in one place — that’s how lists work in R!
🧭 Quick Summary
list()function.
● Created using
50
$
● Access elements using [ ]
, [[ ]]
, or .
●
unlist()converts a list into a vector.
● G
reat for storing complex or hierarchical data (e.g., student info, model outputs,
JSON-like data).
ne of R’s strongest features is its ability toimportand export datafrom a wide variety of
O
sources — text files, CSVs, Excel sheets, databases, and more. This allows you to bring
external data into R for analysis and then export your results back out for reporting or
sharing.
SV (Comma-Separated Values) files are the most common data format used for sharing
C
datasets.
[Link]()
Function:
Syntax:
Example:
Explanation:
●
file→ path to your CSV file
●
header = TRUE→ treats first row as column names
●
sep = ","→ separates columns using commas
51
●
stringsAsFactors = FALSE→ keeps text as characters instead of factors
Example:
etwd("C:/Users/Faisal/Documents/R_Projects")
s
data <- [Link]("[Link]")
.txt
If your data is stored in a plain text file (e.g., [Link]()
), you can use .
Syntax:
Example:
Example:
rl_data <-
u
[Link]("[Link]
head(url_data)
Once you’ve processed or analyzed your data, you can save it back to a file.
[Link]()
Function:
Syntax:
52
Example:
Explanation:
●
data→ the data frame to be saved
●
file→ name or path of the output file
● r
[Link] = FALSE→ prevents row numbers from beingwritten as an extra
column
[Link]()
Function:
Example:
After importing, it’s a good idea to explore your data before analysis.
Common Functions:
head()
View first 6 rows head(data)
tail()
View last 6 rows tail(data)
str()
tructure of
S str(data)
dataset
53
dim()
Dimensions dim(data)
If your files are not in your working directory, provide thefull path:
getwd()
setwd("C:/Users/Faisal/Documents")
R can also handle other common formats with the right packages:
xcel
E readxl
read_exce
(
.xlsx
) l()
PSS
S haven
read_sav(
(
.sav
) )
JSON jsonli
fromJSON(
te
)
XML xml2
read_xml(
)
Example (Excel):
library(readxl)
data <- read_excel("students_data.xlsx")
54
🧭 Quick Summary
✅Importing Data→
[Link]() [Link]()
, read_excel()
,
✅Exporting Data→
[Link]() [Link]()
,
✅Check Data→
head() str()
, summary()
,
✅ getwd()
Working Directory→ setwd()
,
✅
File Formats Supported→ CSV, TXT, Excel, JSON,XML, Databases
eal-World Example:
R
In a data analysis project, you might:
.csvfile
1. Import raw survey results from a
While CSV files are the most common way to handle tabular data,Excel files (
.xlsor
xlsx
. )are equally popular—especially in business,research, and academic settings. R
doesn’t read Excel files natively, but with the help of a few packages, it becomes very easy.
readxlPackage
1. Using the
i[Link]("readxl")
library(readxl)
55
read_excel()function.
Use the
Syntax:
Example:
library(readxl)
students <- read_excel("students_data.xlsx")
head(students)
Explanation:
●
path→ file path of your Excel sheet
●
sheet→ specify sheet name or index (default is firstsheet)
●
range→ optional cell range like
"A1:D10"
●
col_names→ if
TRUE
, first row is treated as columnheaders
excel_sheets("students_data.xlsx")
56
3. Viewing Imported Data
ead(students)
h
str(students)
summary(students)
openxlsxPackage
4. Using the
A openxlsx
nother popular package is , which can bothread and write Excel files without
requiring external dependencies.
etwd()
g
setwd("C:/Users/Faisal/Documents/R_Projects")
57
● ❗File not found?Check file path and working directory.
You can save your processed data back to an Excel file using:
library(openxlsx)
[Link](students, "output_students.xlsx")
🧭 Quick Summary
Function Packag Purpose
e
read_excel(
readxl Read Excel files (.xls, .xlsx)
)
excel_sheet
readxl List all sheet names
s()
[Link]()
openxl Read Excel data
sx
[Link](
openxl Write data to Excel file
)
sx
✅Key Takeaways
str()or
● Always check data structure after importing using head()
.
58
● Perfect for handling data fromExcel-based reports, financial sheets, and surveys.
Accessing Databases
hen working with large datasets, it’s often not practical to store all your data in files like
W
CSV or Excel. Instead, data is usually stored indatabasessuch as MySQL, PostgreSQL, or
SQLite.
R canconnect to these databases,run SQL queries,andimport or export datadirectly
— allowing smooth integration between R and database systems.
R provides multiple packages for connecting to databases. The most common are:
MySQL RMySQL
dbConnect
()
PostgreSQL RPostgreS
dbConnect
QL
()
SQLite RSQLite
dbConnect
()
eneral
G DBI
dbConnect
interface ()
59
DBIpackage provides a common interface for working with any database
he
T
package.
You need to install and load the required packages depending on your database.
i[Link]("DBI")
[Link]("RSQLite") # for SQLite
[Link]("RMySQL") # for MySQL
library(DBI)
Example:
library(DBI)
Example:
library(DBI)
60
ost = "localhost",
h
port = 3306,
user = "root",
password = "your_password"
)
Once connected, you can run SQL queries directly using R functions.
ata <- dbGetQuery(con, "SELECT * FROM students WHERE marks > 80;")
d
head(data)
dbListTables(con)
dbRemoveTable(con, "old_data")
dbDisconnect(con)
61
library(DBI)
library(RSQLite)
Connect
#
con <- dbConnect(RSQLite::SQLite(), "[Link]")
Create a table
#
data <- [Link](Name = c("Ali", "Sara", "John"), Marks = c(85, 90, 78))
dbWriteTable(con, "Students", data)
Disconnect
#
dbDisconnect(con)
Explanation:
● U
niversities store student records in MySQL or PostgreSQL — you can directly fetch
data for analysis.
● Research projects use databases to store large datasets for reproducibility.
🧭 Quick Summary
Function Purpose
dbConnect()
Connect to a database
62
dbListTables List all tables
()
dbWriteTable W
rite data frame into
()
database
✅Key Points
● You can read and write data frames directly as tables.
dbDisconnect()
● Always close connections with .
Saving Data in R
hen you’re working on a project in R, you often need tosave your dataso that you can
W
reuse it laterwithout re-running all your code. Rprovides multiple ways to store your data
— from saving single variables to entire workspaces.
63
● Reuse data in future sessions.
save()function.
You can savespecific variables, data frames, or vectorsusing the
Syntax:
Example:
x <- 10
xplanation:
E
This saves the variables x,
y,and
datain one filenamed[Link].
Later, you can load this file to restore those objects.
Use:
[Link](file = "[Link]")
64
xplanation:
E
This command saves the entire workspace to a file.
[Link]()
By default, if you just run .RDatain your current working
, R savesit as
directory.
If you want to store data in a simpletext formatthat can be read by other software (like
[Link]()
Excel), use .
Example:
students <- [Link](Name = c("John", "Aisha", "Ravi"), Marks = c(85, 90, 78))
xplanation:
E
studentsinto aCSV filewithout row numbers.
This saves the data frame
Key Parameters:
●
[Link] = FALSE→ avoids saving unnecessary rownumbers
●
sep→ allows specifying other separators (like
;or
\t
)
write()or
You can also save plain text or vector data using [Link]()
.
Example:
xplanation:
E
This saves the vector values into a simple text file named[Link].
65
6. Saving Data in RDS Format
saveRDS()and
he
T readRDS()functions are useful whenyou want to save asingle R
object.
Example:
data <- [Link](City = c("Srinagar", "Delhi", "Mumbai"), Temp = c(12, 25, 30))
saveRDS(data, "weather_data.rds")
save()
Difference from :
●
save()can storemultipleobjects.
● s
aveRDS()is meant forone object only, and you mustassign it to a variable when
reloading.
If you’ve created a plot, you can save it using functions like:
●
png("[Link]")
●
pdf("[Link]")
●
jpeg("[Link]")
Example:
png("[Link]")
hist(c(2,4,6,8,10))
[Link]()
66
Explanation:
●
png()starts saving the next plot as an image file.
●
[Link]()stops the saving process.
load("covid_clean.RData")
🧭 Quick Summary
Function Purpose
save()
Save specific objects
67
[Link] S
ave data in tabular text
e()
format
saveRDS()
Save a single R object
readRDS()
Read a saved R object
png()
, Save plots or graphs
pdf()
✅Tips
.RDataor
● Prefer .RDSfor R-only projects, and
.csvfor sharing data with others.
fter saving your work in R (like datasets, variables, or entire sessions), you’ll eventually
A
need toloadit back to continue your analysis. Rprovides simple functions to restore saved
data and make it available again in your current session.
hen you reopen R or RStudio, your workspace starts empty. If you want to reuse
W
previously saved data, you must load it.
Loading data helps you:
68
● Avoid rerunning data-cleaning or preparation steps
.RDataor
2. Loading .rdaFiles
save()or
If you used the [Link]()function tostore your workspace or selected
load()function to bring them back.
objects, use the
Syntax:
load("[Link]")
Example:
load("[Link]")
xplanation:
E
This will restore all the objects (
x,
y,
data [Link]
, etc.)that were saved inside .
You can now use them directly — there’s no need to assign them to new variables.
saveRDS()
If you saved a single object using readRDS()to load it.
, youmust use
Syntax:
Example:
xplanation:
E
load()
Unlike , this function doesn’t automaticallycreate the object in your workspace —
you decide what name to assign it.
This makes it more flexible and safer when dealing with multiple datasets.
69
4. Loading CSV or Text Files
[Link]()toload it.
If you saved data inCSVformat, use
Example:
xplanation:
E
students
This reads the file[Link]and stores it ina data frame named .
You can then view it using:
head(students)
.xlsxor
For Excel files (saved as .xls readxlpackage.
), use the
Example:
library(readxl)
xplanation:
E
This reads data directly from Excel sheets.
You can specify sheet names if your Excel file has multiple sheets:
.txtfiles), use
If you saved plain text data (like [Link]()or
[Link]()
.
Example:
70
or
xplanation:
E
These functions can handle tab-separated or space-separated text files easily.
xample:
E
.RDatain yourworking directory, R will automatically load
If you closed R with a file named
it on restart.
Imagine you analyzed customer data last week and saved it as
customers_cleaned.RData
.
Next week, you can simply type:
load("customers_cleaned.RData")
All your cleaned data frames, summary tables, and variables come back — ready for use!
🧭 Quick Summary
.RDataor
load()
Loads all saved R objects
.rda
71
.rds
readRDS()
Loads one saved R object
.csv
[Link]()
Loads data from CSV file
.xlsx
read_excel()
Loads data from Excel file
.txt
[Link]()/
Loads data from text files
[Link]()
✅Tips
Writing to Files in R
nce you’ve created, cleaned, or analyzed data in R, you often need toexport it— maybe
O
to share it with others, to use it in another software like Excel, or to keep it for future
reference.
This process is calledwriting data to files, andR provides several convenient functions to
handle it.
72
● Transfer data between R and other tools (like Python, Excel, or SQL)
[Link]()
Function:
Syntax:
Example:
)
Explanation:
●
[Link] = FALSEprevents R from adding row numbersas an extra column.
If your system or collaborators prefer a different separator (like a semicolon or tab), you can
[Link]()
use .
73
Syntax:
Example:
xplanation:
E
This saves the data withtab-separated columnsinsteadof commas.
Perfect when working with text editors or systems that expect tabular text.
writexlpackage.
If you want to directly write data into Excel sheets, use the
Example:
library(writexl)
write_xlsx(students, "students_data.xlsx")
xplanation:
E
This creates an Excel file with one sheet containing your data frame.
You can easily open it in Excel or Google Sheets.
write()
If you want to write plain text (like a vector of names, numbers, or results), use the
function.
Example:
74
xplanation:
E
namesintoa text file, one per line.
This writes each value from the vector
append = TRUEargument
You can append multiple data frames to the same file using the
[Link]()
in .
Example:
Explanation:
● The second appends the second one without repeating the column names.
●
save()for multiple objects
●
saveRDS()for one object
Example:
saveRDS(students, "students_data.rds")
This method is faster and takes less space than CSV files.
75
8. Writing Output to a Text File
ou can also write console output (like printed summaries or results) to a text file using
Y
sink()
.
Example:
sink("[Link]")
summary(students)
sink()
xplanation:
E
sink()calls is redirectedto[Link]instead of printing to
Everything between the two
the console.
Imagine you cleaned a large survey dataset in R and now need to send it to your team who
uses Excel.
You can simply export it as:
🧭 Quick Summary
Function Purpose
76
[Link] Write data with custom separators
e()
write()
Write simple vectors or text
saveRDS()
ave one R object (R’s own
S
format)
sink()
Redirect console output to a file
✅Tips
"sales_report_2025.csv"
● Use descriptive filenames like .
.csvor
● When sharing with non-R users, prefer .xlsx
.
efore analyzing or visualizing data, you need toclean and prepare it properly. Real-world
B
datasets often containmissing values, duplicates,inconsistent formats, or irrelevant
entries— all of which can lead to misleading resultsif ignored.
In this topic, we’ll explore how tohandle missingvaluesandfilter dataeffectively in R.
77
ata cleaning is the process ofdetecting and correcting errors or inconsistenciesin
D
data to improve its quality.
It ensures that your dataset is:
Example:
[Link](data)
Output:
xplanation:
E
TRUEindicates missing values at those positions.
sum([Link](data))
78
Output:
2
xplanation:
E
NAvalues from the vector or dataframe.
This removes all
data[]
Example:
Explanation:
●
[Link] = TRUEignores missing values while calculatingthe mean.
79
Output:
[1] 10 20 20 30
)
colSums([Link](df))
Output:
Marks Age
1 1
80
iltering means selectingonly the rows that meet certain conditions.
F
This helps you focus on the relevant part of your dataset.
Syntax:
subset(data_frame, condition)
Example:
)
xplanation:
E
Only students with marks greater than 80 are selected.
&(AND),
You can use logical operators like |(OR),and
!(NOT).
Example:
xplanation:
E
This filters rows where marks arebetween 70 and90.
81
Example:
library(dplyr)
students <- [Link](Name = c("Ali", "Sara", "John"), Marks = c(85, 60, 95))
Explanation:
)
●
82
●
🧭 Quick Summary
emove missing
R [Link]()
Deletes rows with NAs
values
eplace missing
R data[[Link]()] <-
Fills missing entries
values value
✅Tips
83
NIT 3: DATA
U
MANIPULATION
1. Data manipulation techniques
10.Data aggregation
84
Data Manipulation Techniques
nce your data is cleaned and ready, the next important step ismanipulating it— that is,
O
organizing, transforming, and restructuringthe dataso it becomes suitable for analysis
or visualization.
In R, data manipulation is one of the most frequently performed tasks, and it’s made
dplyrand
incredibly powerful and easy by packages like tidyrfrom thetidyverse
collection.
Let’s understand what data manipulation means and how to perform it efficiently in R.
subset()
1. Base R functions— built-in commands like merge()
, order()
, , etc.
2. d
plyr package— a modern, human-friendly toolkit designedspecifically for data
manipulation.
85
dplyrFunctions
3. The Core
ere are the most important verbs (functions) indplyr,often called the “grammar of data
H
manipulation”:
Function Purpose
select()
Choose specific columns
filter()
elect rows based on
S
conditions
mutate()
Add or modify columns
library(dplyr)
students <- [Link](
Name = c("Ali", "Sara", "Ravi", "Mehak", "John"),
Marks = c(85, 92, 76, 89, 95),
Age = c(20, 21, 22, 20, 23)
)
86
Example:
xplanation:
E
Keeps only theNameandMarkscolumns, excludingothers.
Example:
xplanation:
E
Displays only students who scored more than 85 marks.
Example:
arrange(students, Marks)
xplanation:
E
Sorts students inascending orderof marks.
arrange(students, desc(Marks))
87
8. Creating or Modifying Columns
mutate()to add new columns or modify existingones.
Use
Example:
xplanation:
E
Grade— if Marks ≥ 90, assigns“A”; otherwise “B”.
Adds a new column
Example:
Output:
vg_Marks
A
1 87.4
xplanation:
E
Calculates the average marks of all students.
Example:
88
xplanation:
E
Gendercolumn, groups data by gender, and calculates theaverage marks for
This adds a
each group.
Example:
students %>%
filter(Marks > 80) %>%
select(Name, Marks) %>%
arrange(desc(Marks))
Explanation:
his is a clean, natural way to write a sequence of operations — almost like reading a
T
sentence.
sales %>%
group_by(Region) %>%
summarise(Average_Sales = mean(Sales))
89
Output:
# A tibble: 4 × 2
Region Average_Sales
<chr> <dbl>
1 East 300
2 North 550
3 South 400
4 West 450
🧭 Quick Summary
Task Function Description
ummarize
S summarise Compute mean, sum, etc.
values ()
✅Tips
dplyrbefore use:
● Load library(dplyr)
head()or
● Always check results using View()
group_by()and
● Combine summarise()for quick insights
Selecting Rows/Observations
90
electing rows or observations meanschoosing specific records from a datasetthat
S
satisfy a certain condition. In R, this can be done usingbase R techniquesor using the
dplyrpackage, which makes row selection more readableand powerful.
or example, from a dataset of students, you might want to select only those with marks
F
above 80 or those belonging to a certain age group.
students[1:3, ]
Explanation:
●
1:3specifies the range of row indices
91
,indicates we want all columns for those rows
● The comma
Explanation:
●
students$Marks > 85returns
TRUEfor rows where Marksexceed 85
&(AND),
se logical operators like
U |(OR).
Example:
subset()Function
3. Selecting Rows Using
subset()function is simpler and more readablefor conditional selection.
The
Syntax:
subset(dataframe, condition)
Examples:
92
ubset(students, Marks > 85)
s
subset(students, Age == 20)
subset(students, Marks > 80 & Age < 22)
Advantages:
dplyr::filter()
4. Selecting Rows Using
filter()function from
he
T dplyris the most modernand clean method for selecting
rows.
Syntax:
filter(dataframe, condition)
Examples:
library(dplyr)
filter(students, Marks > 85)
filter(students, Age < 22)
filter(students, Marks > 80 & Age < 22)
Explanation:
&or
● You can use multiple conditions connected with |
93
If you want to select either condition:
Example:
filter(students, )
filter(students, [Link](Marks))
slice()
7. Filtering Rows Using
slice()from
If you want to select specific rows byposition(notcondition), use dplyr
.
Example:
slice(students, 1:3)
94
lice_head(students, n = 2) # First 2 rows
s
slice_tail(students, n = 2) # Last 2 rows
slice_sample(students, n = 2) # Random 2 rows
students %>%
filter(Marks > 80) %>%
arrange(desc(Marks))
Explanation:
🧭 Summary Table
Method Function Description
✅Tips
head()or
● Always check results using View()
95
arrange()or
● Combine filters with select()for refined datasets
Selecting Columns/Fields
subset()function, or
R provides multiple ways to select columns — throughbase R, the
dplyrpackage(which is highly preferred for itsclean syntax).
the
$operator.
ou can access a column using the
Y
Example:
students$Marks
Explanation:
●
$is used to access a specific column.
●
students$Marksreturns only the “Marks” column asa vector.
96
ou can select columns by theirpositionin the dataset.
Y
Example:
Explanation:
●
,separates rows and columns.
subset()
2. Selecting Columns Using
subset()function can be used to choose specificcolumns easily.
The
Syntax:
Example:
Explanation:
97
●
select = -Ageremoves theAgecolumn.
select = -c(Marks,
● You can exclude multiple columns as Age)
dplyr::select()
3. Selecting Columns Using
This is the most popular and clean method for selecting columns.
Syntax:
select(dataframe, columns...)
Example:
library(dplyr)
select(students, Name, Marks)
Explanation:
-sign.
You can exclude columns using the
select(students, -Age)
98
c. Selecting Columns by Name Pattern
You can select columns based on name patterns using helper functions:
starts_with(" S
elects columns starting with select(students,
A")
“A” starts_with("A"))
ends_with("s" S
elects columns ending with select(students,
)
“s” ends_with("s"))
matches("^[A- S
elects columns matching select(students,
Z]")
regex matches("^[A-Z]"))
Explanation:
%>%passes the
● The pipe studentsdataframe into the
select()function.
● T
his makes the code more readable and chainable with other operations like
filter()or
arrange() .
select_if(students, [Link])
99
select_if(students, [Link])
students %>%
select(Marks, Name, Age)
students %>%
select(Student_Name = Name, Score = Marks)
Explanation:
● RenamesNametoStudent_NameandMarkstoScore.
$operator
data$col
Access single column
subset()
subset(data, select = C
hoose or exclude columns
...)
easily
dplyr::select
select(data, col1, Clean and readable method
()
col2)
Exclude columns
select(data, Remove specific columns
-colname)
100
Range select select(data,
Select column range
col1:col3)
ype-based
T select_if()
Select by data type
select
✅Tips
select()with
● Combine filter()for focused sub-datasets.
Merging Data
o analyze both marks and ages together, you mustmergethem into a single dataset using
T
theNamecolumn as the key.
101
merge()function(Base R)
1. Using the
merge()Function
3. Merging Using Base R
Syntax:
Parameters:
●
x,
y→ data frames to merge
●
by→ column(s) to merge on (common key)
●
all.x→ if
TRUE x(Left Join)
, keeps all rows from
●
all.y→ if
TRUE y(Right Join)
, keeps all rows from
Example:
)
)
102
a. Inner Join (default)
Result:
data1remain.
● All students from
data2
● If a match isn’t found in , missing values (NA)are added.
data2
Keepsall rows from .
Result:
data2remain.
● All students from
data1get NA in missingcolumns.
● Non-matching records from
103
d. Full Join
esult:
R
Every name from both tables is included — unmatched rows get NA.
dplyrJoins
4. Merging Using
dplyrpackageoffers more readable and powerfulmerging functions that are widely
he
T
used in data science.
left_join(x, y,
x
All rows from Left join
by)
104
full_join(x, y,
All rows from both Full outer join
by)
semi_join(x, y,
xthat have a match in
Rows in Filtered inner join
by)
y
anti_join(x, y,
xwith no match in
Rows in y Opposite of semi join
by)
dplyr
Example using :
library(dplyr)
Inner Join
Left Join
Full Join
105
data1butnotin
Anti Join— shows students in data2
:
dplyr
or using :
106
rbind()in base R
● Use
bind_rows()in
● Use dplyr
Example:
esult:
R
All four students in one dataset.
cbind(data1, data2)
107
Right Join merge(x, y, by,
right_join(x, All from right table
all.y=TRUE)
y)
Column Bind
cbind(x, y) bind_cols(x, Combine horizontally
y)
✅Tips
View()or
● Always verify merged results using head()
108
● Making namesshorter or easier to type
V1to
● Givingclearer, descriptive labels(e.g., changing Student_Name
)
colnames(dataframe)
Example:
)
colnames(students)
Output:
109
colnames(students) <- c("Name", "Marks", "Age")
names()Function
c. Rename Using
Explanation:
110
dplyr::rename()
4. Renaming Columns Using
rename()function from the
he
T dplyrpackage makesrenaming easier and more
readable.
Syntax:
Example:
library(dplyr)
)
Explanation:
111
Result:
setNames()Function
5. Using
This function creates a renamed copy of the dataset without changing the original directly.
Syntax:
Example:
Explanation:
names()with Pipes
6. Using
You can use pipes to rename columns inline:
112
students %>%
Example:
data <- [Link]("Student Name" = c("Ali", "Sara"), "Test Score" = c(85, 90))
Result:
113
Student_Nam Test_Scor
e e
Ali 85
Sara 90
iew column
V colnames(data)
Displays current names
names
ename all
R colnames(data) <- c()
Rename all at once
columns
ename one
R colnames(data)[i] <- "NewName"
Rename by position
column
onditional
C names(data)[names(data) ==
Rename by name
rename "Old"] <- "New"
114
eplace
R gsub(" ", "_", colnames(data))
Clean column labels
characters
✅Tips
colnames()or
● Always check renamed data using head()
.
● K
eep column names short, lowercase, and underscore-separated (e.g.,
student_name
).
● Avoid spaces, punctuation, and symbols in names — they make coding harder.
Reshaping Data
● L
ong Format:
Each row represents a single observation for a variable.
Example:
| Name | Subject | Marks |
|------|----------|-------|
| Ali | Math | 85 |
| Ali | Science | 90 |
| Ali | English | 78 |
| Sara | Math | 92 |
115
| Sara | Science | 87 |
| Sara | English | 88 |
tidyrPackage
2. Reshaping Using
tidyrpackage provides the most efficient functionsto reshape data:
The
●
pivot_longer()— converts data fromwidetolongformat.
●
pivot_wider()— converts data fromlongtowideformat.
pivot_longer()
3. Wide to Long Format using
Syntax:
Parameters:
●
data
: dataset
●
cols
: columns to convert into key-value pairs
●
names_to
: name of the new column that will store variablenames
●
values_to
: name of the new column that will storecorresponding values
Example:
library(tidyr)
116
Name = c("Ali", "Sara"),
)
print(long_data)
Output:
Explanation:
117
● The columnsMath, Science, Englishbecome entries under a single columnSubject.
pivot_wider()
4. Long to Wide Format using
Syntax:
Parameters:
●
data
: dataset
●
names_from
: column whose values become new columnnames
●
values_from
: column whose values fill the new columns
Example:
print(wide_data)
Output:
118
xplanation:
E
TheSubjectcolumn becomes new column names (Math,Science, English), andMarks
values fill those columns.
melt()
a. Using
Example:
library(reshape2)
dcast()
b. Using
Example:
119
Science = c(90, 91, 87, 89)
)
print(long)
Output:
120
7. Spread and Gather (Older Functions)
pivot_longer()and
Before pivot_wider()
, we used:
●
gather()→ to convert wide to long
●
spread()→ to convert long to wide
Example:
library(tidyr)
121
Wide → Long pivot_longer( Wide to Long
tidyr
)
✅Tips
● P pivot_longer()and
refer pivot_wider()— they aresimpler and more
readable.
● F [Link]::melt()and
or large datasets, use [Link]::dcast()—
faster versions.
● Long format is better for visualizations; wide format is better for presentation.
efore analyzing or modeling data, it’s often important to make sure that all numeric
B
variables areon a comparable scale. This helps improvethe performance and
interpretability of many algorithms, especially inmachine learningandstatistical
modeling.
122
Imagine you’re analyzing a dataset with two features:
●
Age(values like 20, 30, 40)
●
Income(values like 20,000, 50,000, 90,000)
he
T Incomevariable dominates because its values aremuch larger. Scaling brings all
variables to similar ranges, making themequally importantin analysis and modeling.
ormula:
F
[
x_{centered} = x - \bar{x}
]
Example:
print(centered_x)
Output:
Explanation:
xis 30.
● The mean of
✅
Use Case:Centering is useful before performingPrincipal Component Analysis (PCA)
or regression, where the mean needs to be zero.
123
3. Scaling Data
calingmeans dividing centered data by itsstandarddeviation (SD)so that all variables
S
have astandard deviation of 1.
ormula:
F
[
x_{scaled} = \frac{x - \bar{x}}{s}
]
where
Example:
print(scaled_x)
Output:
[,1]
[1,] -1.2649111
[2,] -0.6324555
[3,] 0.0000000
[4,] 0.6324555
[5,] 1.2649111
Explanation:
124
✅
Use Case:Scaling is vital inmachine learning(e.g., K-Means, SVM, PCA, Linear
Regression).
ormula:
F
[
x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}
]
Example:
print(normalized_x)
Output:
Explanation:
✅
Use Case:Normalization is especially used fordistance-basedalgorithms(like KNN,
neural networks).
scale()Function
a.
125
Performs both centering and scaling.
Syntax:
Example:
)
print(scaled_data)
Explanation:
}
print(normalized_data)
126
Explanation:
●
lapply()applies the normalization function to eachcolumn.
● Output is a dataframe where all values are between 0 and 1.
ormula:
F
[
z = \frac{x - \bar{x}}{s}
]
127
8. Practical Example: Standardizing a Dataset
data <- [Link](
)
print(scaled_data)
Output:
Height Weight
1 -1.2649111 -1.2649111
2 -0.6324555 -0.6324555
3 0.0000000 0.0000000
4 0.6324555 0.6324555
5 1.2649111 1.2649111
✅ Both Height and Weight are now on the same scale, ready for analysis.
128
🧭 Quick Recap
Normalization F
it data between 0 and (x -
Custom
1 min(x))/(max(x)-min(x))
✅Tips
● S
caling should be doneafter splittingdata into train/testsets to prevent data
leakage.
In R, each piece of data has a specificdata type,such as numeric, character, factor, or
logical. Sometimes, we need toconvert variablesfromone type to another to perform
certain operations correctly — for instance, converting a numeric column into a factor for
categorical analysis, or changing character data into numeric for calculations.
129
● D
ata Compatibility:Some functions work only with specific data types (e.g.,
statistical tests often require numeric data).
● A
ccurate Analysis:Converting categorical variablesto factors helps R treat them
properly during modeling.
● Avoid Errors:Incorrect data types may cause calculationor visualization errors.
● D
ata Cleaning:Imported data (especially from CSV/Excel)may misinterpret types
(e.g., numbers read as characters).
Example:
x <- "25"
class(x)
Output:
[1] "character"
130
Function Converts Example
To
[Link]( Numeric
[Link]("25")→ 25
)
[Link]( Integer
[Link](3.8)→ 3
)
[Link] Character
[Link](25)→ "25"
r()
[Link]( Logical
[Link](1)→ TRUE
)
[Link]() Factor
[Link](c("A", "B",
"A"))
[Link]()
Date [Link]("2024-05-20")
y <- [Link](x)
print(y)
Output:
[1] 10 20 30
131
xplanation:
E
Each string element is converted into a numeric value, making it ready for mathematical
operations.
print(char)
Output:
xplanation:
E
Numbers become strings, which are treated as text, not numeric values.
print(fact_colors)
Output:
xplanation:
E
R recognizes unique categories aslevelsof the factor.
This is essential for categorical analysis (like grouping, plotting, or regression).
132
7. Example: Converting Factor to Numeric
irect conversion from factor to numeric can givewrong resultsbecause R stores factors
D
as integer codes internally.
Use atwo-step conversioninstead.
❌Incorrect way:
[Link](f)
Output:
[1] 1 2 3
✅Correct way:
[Link]([Link](f))
Output:
[1] 10 20 30
xplanation:
E
First convert the factor to character, then to numeric to preserve original values.
[Link](x)
Output:
133
Explanation:
Example:
print(date_real)
Output:
xplanation:
E
Converts text formatted as “YYYY-MM-DD” into date objects recognized by R.
If your date format is different (e.g., “10/01/2025”), specify the format:
[Link]("10/01/2025", format="%d/%m/%Y")
Example:
134
Age = c("20", "25", "30"),
)
str(data)
Output:
mutate()(Tidyverse Way)
11. Using
library(dplyr)
)
mutate(
ID = [Link](ID),
Gender = [Link](Gender)
135
)
str(df)
Output:
$ ID : int 1 2 3
A values after
N Non-numeric characters [Link]()only on clean
se
U
conversion data
136
survey <- [Link](
)
str(survey)
🧭 Quick Recap
Character → Numeric
[Link]() [Link]("2
5")
Numeric → Character
[Link]() [Link](
25)
137
Factor → Numeric [Link]([Link] Safe conversion
r(f))
✅Tips
mutate(across())in
● Use dplyrfor efficient multipleconversions.
Data Sorting
● Torankorprioritizedata.
138
R provides simple functions to sort numeric, character, or logical vectors.
sort()
Using
Syntax:
Parameters:
●
x→ the vector to sort
●
decreasing→ set
TRUEfor descending order
sort(numbers)
Output:
[1] 3 8 10 15 20
Output:
[1] 20 15 10 8 3
sort(names)
139
Output:
order()
4. Sorting Using
rder()doesn’t return sorted data directly — it returnstheorder of indicesthat can be
o
used to rearrange data.
Example:
order(numbers)
Output:
[1] 2 4 5 1 3
This means the 2nd element (3) should come first, then the 4th (8), and so on.
numbers[order(numbers)]
Output:
[1] 3 8 10 15 20
Descending Order:
numbers[order(-numbers)]
140
Output:
[1] 20 15 10 8 3
Example:
)
students[order(students$Marks), ]
Output:
Name Marks
Bilal 80
Ali 85
Sara 90
Hina 95
141
Descending Order:
students[order(-students$Marks), ]
Output:
Name Marks
Hina 95
Sara 90
Ali 85
Bilal 80
Example:
)
data[order(data$Name, -data$Marks), ]
142
Output:
Name Marks
Ali 85
Ali 75
Hina 90
Sara 90
Explanation:
Namealphabetically.
● First sorted by
Marksin descendingorder.
● If names are the same, sorted by
dplyrfor Sorting
7. Using
dplyrpackage provides a more readable and modernsyntax using
The arrange()
.
library(dplyr)
143
Example 3: Multiple Columns
✅Explanation:
●
arrange()sorts by columns.
●
desc()specifies descending order.
data[order(rownames(data)), ]
Output:
Scor
e
Ali 88
Hina 76
Sara 92
144
matrix_data <- matrix(c(5,2,8,1,7,4), nrow=2)
matrix_data[, order(colMeans(matrix_data))]
Example:
Output:
[1] 1 3 5 NA
[Link] = FALSE
If you set NAvalues come first.
,
)
145
employees %>% arrange(Department, desc(Salary))
Output:
🧭 Quick Recap
146
Sort with multiple columns order(df$col1,
Base R
df$col2)
✅Tips
NA
● Always check for s before sorting.
●
arrange()is cleaner and preferred for pipelines.
Data Aggregation
sum()
● Applying summary functions such as mean()
, min()
, max()
, length()
, or .
147
xample:
E
If you have a dataset of students’ marks from different classes, aggregation can help find:
aggregate()
a. Using
Syntax:
Parameters:
●
x→ data to summarize (numeric columns)
●
by→ list of grouping variables
●
FUN→ summary function (mean, sum, etc.)
Example:
)
148
Output:
Clas Marks
s
A 82.5
B 82.5
C 88.0
)
Output:
149
B 82.5 82.5
✅ The
.symbol means “apply to all other columns.”
tapply()
3. Aggregation with
tapply()applies a function to subsets of a vectordefined by one or more factors.
Syntax:
Example:
Output:
A B C
by()
4. Aggregation with
by()is similar to
tapply()but works with data frames.
Example:
150
data <- [Link](
)
Output:
data$Class: A
[1] 82.5
data$Class: B
[1] 82.5
data$Class: C
[1] 88
dplyr
5. Aggregation Using
dplyrpackage provides simple and readable functionsfor aggregation using pipes
The
(
%>%
).
summarise()and
a. Using group_by()
library(dplyr)
data %>%
group_by(Class) %>%
summarise(Average = mean(Marks))
151
Output:
Clas Averag
s e
A 82.5
B 82.5
C 88.0
data %>%
group_by(Class) %>%
summarise(
Avg = mean(Marks),
Min = min(Marks),
Max = max(Marks),
Count = n()
)
Output:
152
B 82.5 75 90 2
✅
n()counts number of entries per group.
)
sales %>%
summarise(TotalSales = sum(Sales))
Output:
153
North B 150
Example:
Output:
[Link]
7. Aggregation Using
154
[Link]package is extremely efficient for large datasets.
The
Example:
library([Link])
Output:
✅
.Ngives the number of rows in each group.
Example:
155
)
Output:
Clas Marks
s
A 80
B 90
)
sales %>%
group_by(Region) %>%
summarise(
Total = sum(Sales),
Average = mean(Sales),
Transactions = n()
)
156
Output:
aggregate()
Base R Summarizes data by groups
tapply()
Base R Applies function to grouped vector
by()
Base R Applies function to grouped data frame
group_by()+
dplyr Modern and clean syntax
summarise()
.N
mean()
, , etc. [Link]
d High-performance aggregation
e
157
✅Tips:
[Link] = TRUE
● Always handle missing values using .
158
NIT 4: BASIC
U
STATISTICAL
ANALYSIS
1. Introduction to statistical inference
○ t-tests
○ Histograms
○ ar plots
B
===========
5. ustomizing plots
C
○ Titles
○ Labels
○ Legends
○ Colors
○ Themes
159
Introduction to Statistical Inference
1. What is Statistical Inference?
xample:
E
If we survey 200 students about their study habits, we can infer patterns for all students in
the university.
ypothesis
H Using data to test assumptions about a population.
Testing
a. Estimation
● E
xample: Estimating the average height of all students using a sample of 100
students.
160
● Used to test a claim about a population parameter.
● Example: Testing whether the average exam score is above 70.
1. P
oint Estimate– A single value estimate of a populationparameter.
Example: Sample mean x̄ = 72is the point estimatefor population mean
μ
.
2. Interval Estimate– A range of values (confidenceinterval) within which the true
parameter likely falls.
Example: “The average score is between 70 and 74 with 95% confidence.”
Aconfidence interval (CI)gives a range that likelycontains the true population value.
x̄ ± z * (s / √n)
Where:
●
x̄ = sample mean
●
s= sample standard deviation
●
n= sample size
●
z= z-value (1.96 for 95% confidence)
Example in R:
161
Output:
✅ Interpretation: We’re 95% confident that the true mean lies between 69.78 and 74.22.
Example:
● Type II: Concluding a medicine doesn’t work when it actually does.
xample:
E
If α = 0.05, it means we accept a 5% chance of being wrong when rejecting the null
hypothesis.
roblem:
P
A teacher claims that the average score of students is75.
A sample of 25 students has a mean of78with a standarddeviation of10.
Test the claim at a 5% significance level.
Steps in R:
162
xbar = 78, s = 10, n = 25)
xplanation:
E
If the p-value < 0.05 → Reject the claim (significant difference).
If p-value > 0.05 → Accept the claim (no significant difference).
tatistical
S Drawing conclusions about population from sample
inference
onfidence
C Range containing true parameter
interval
● R
emember that smaller p-values mean stronger evidenceagainstthe null
hypothesis.
163
● Confidence interval = estimation; hypothesis testing = decision-making.
Hypothesis Testing
For example:
● A company claims the average battery life of their phones is10 hours.
● We collect a sample and test if this claim is true or not.
xample:
E
H₀: The mean battery life = 10 hours
xample:
E
H₁: The mean battery life ≠ 10 hours
164
3. Steps in Hypothesis Testing
1. State hypotheses(H₀ and H₁)
Left-tailed You suspect the mean isless thana value Mean < 10
Right-tailed Y
ou suspect the mean isgreater thana Mean > 10
value
5. t-Tests
Thet-testis used when:
165
Used to compare a sample mean to a known value.
Example:
Explanation:
●
scores→ your sample data
●
mu = 75→ population mean (claimed value)
Example:
Explanation:
●
group1and
group2→ two independent groups
●
[Link] = TRUE→ assumes equal variances
sed when the two samples arerelated— for example,before and after measurements on
U
the same people.
Example:
166
[Link](before, after, paired = TRUE)
xplanation:
E
Checks whether the “after” values differ significantly from “before” values.
xample Interpretation:
E
p = 0.03
If , it means there’s only a 3% chance thatthe observed difference occurred
randomly — so we conclude thereisa real effect.
Used to test whether the observed frequencies match the expected frequencies.
xample:
E
Suppose we expect equal distribution of students in 3 courses (Math, CS, Stats),
but the actual counts are different.
167
[Link](x = observed, p = expected/sum(expected))
xplanation:
E
Ifp < 0.05, the observed distribution significantlydiffers from what was expected.
xample:
E
You want to see ifgenderis related tocourse preference.
[Link](data)
xplanation:
E
Ifp < 0.05, it means gender and course preferencearenot independent(they’re related).
● C
hi-square test:Analyzing survey results (e.g., “Isproduct preference linked to age
group?”)
ne-sample
O ompare sample mean to
C Is average weight = 60kg?
t-test population mean
wo-sample
T Compare two independent groups o males and females differ in
D
t-test height?
168
Paired t-test Compare two related samples Did training improve scores?
hi-square
C Compare categorical data Is gender related to department
test choice?
● Interpret the result in plain language: “There is a significant difference…” or “No
significant difference was found.”
Regression Analysis
egression analysis is a statistical technique used to study the relationship between one
R
dependent variable and one or more independent variables. It helps in predicting the value
of the dependent variable based on the independent variables. In R, regression analysis is
commonly performed using built-in functions such aslm()for linear regression and
glm()
for generalized linear models.
Types of Regression in R:
ultiple Linear Regression:Used when there are twoor more independent variables.
M
Example: Predicting sales based on budget and number of employees.
169
data <- [Link](
sales = c(10, 20, 30, 40, 50),
budget = c(1, 2, 3, 4, 5),
employees = c(2, 4, 6, 8, 10)
)
model <- lm(sales ~ budget + employees, data = data)
summary(model)
2.
Polynomial Regression
#
x <- c(1, 2, 3, 4, 5)
y <- c(2, 6, 14, 28, 45)
model <- lm(y ~ poly(x, 2, raw = TRUE))
summary(model)
3.
ogistic Regression:Used when the dependent variableis categorical (binary outcome like
L
0/1).
Logistic Regression
#
data <- [Link](
pass = c(1, 0, 1, 0, 1),
hours = c(5, 1, 8, 2, 10)
)
model <- glm(pass ~ hours, data = data, family = binomial)
summary(model)
4.
● R
esiduals:Difference between observed and predictedvalues.
Visualization of Regression Line:
Visualization
#
plot(data$budget, data$sales, main="Regression Line", xlab="Budget", ylab="Sales")
abline(model, col="blue", lwd=2)
170
Applications:
i[Link]("ggplot2")
library(ggplot2)
Scatter Plot
#
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "blue", size = 3) +
ggtitle("Scatter Plot of Weight vs MPG") +
xlab("Weight") + ylab("Miles per Gallon")
Explanation:
171
● Each point represents a car.
● The plot shows how car weight affects fuel efficiency.
3. Histograms
Histogram
#
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
ggtitle("Histogram of Miles per Gallon")
Explanation:
●
binwidthdefines the width of each bar.
Bar Plot
#
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar(fill = "orange", color = "black") +
ggtitle("Number of Cars by Cylinder Type") +
xlab("Cylinders") + ylab("Count")
Explanation:
●
factor(cyl)converts the numeric variable into a categoricalone.
● Each bar shows the number of cars for a specific cylinder count.
Line plots are used to show trends over a continuous variable (often time).
172
Line Plot
#
ggplot(economics, aes(x = date, y = unemploy)) +
geom_line(color = "darkgreen", linewidth = 1) +
ggtitle("Unemployment Over Time") +
xlab("Date") + ylab("Number of Unemployed")
Explanation:
Box Plot
#
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "lightgreen") +
ggtitle("MPG by Cylinder Type") +
xlab("Cylinders") + ylab("Miles per Gallon")
Explanation:
● Displays spread and skewness of MPG for each cylinder category.
Density Plot
#
ggplot(mtcars, aes(x = mpg)) +
geom_density(fill = "pink", alpha = 0.5) +
ggtitle("Density Plot of MPG")
173
dplyr
● Compatible with various data transformation packages like .
Combined Plot
#
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Relationship Between Weight and MPG by Cylinder") +
xlab("Weight") + ylab("Miles per Gallon")
This adds both scatter points and a fitted regression line for each cylinder category.
Customizing Plots
hile
W ggplot2provides beautiful default visuals,customizing your plots helps make them
clearer, more readable, and presentation-ready. Youcan change almost every element —
from titles and axis labels to colors, legends, and themes.
You can add descriptive titles and captions to make your plots more informative.
Explanation:
●
titleadds the main heading.
174
●
subtitlegives additional context.
●
xand
ylabel the axes.
●
captionappears at the bottom, useful for mentioningdata sources.
You can modify axis text, font size, or rotation for better clarity.
Explanation:
●
breaksdefines tick intervals.
●
limitsrestricts the axis range.
You can assign specific colors manually or use predefined color scales.
Explanation:
175
● s
cale_fill_manual()and
scale_color_gradient()let you precisely define
color schemes.
"top"
Tip:You can also use "left"
, "right"
, "none"to remove the legend.
, or
Example:
●
theme_bw()– black and white clean style.
●
theme_minimal()– modern minimal look.
●
theme_classic()– simple with axes and no grid lines.
●
theme_light()– gentle background.
176
6. Faceting (Multiple Plots by Category)
xplanation:
E
Each panel shows data for one cylinder category — a great way to compare groups visually.
Here’s how you can combine all these ideas in one polished plot:
xplanation:
E
This plot includes:
177
🧭 Quick Summary: Plot Customization Tips
✅
Exam Tip:
Questions often ask aboutplot customization functionsandtheme control in ggplot2.
Remember:
●
labs()for labels,
●
theme()for style,
●
scale_...()for colors/scales.
xploratory Data Analysis (EDA)is the process ofvisually and statistically exploring data
E
tounderstand its structure, patterns, relationships,and anomaliesbefore formal
modeling.
It’s the most important phase in any data science project because it helps you discover what
the data is trying to tell you.
178
● How are different variables related?
DA is botha science and an art— it’s about askingthe right questions and visually
E
exploring answers.
Insight:
Check if the data isnormally distributed,skewed,or hasoutliers.
Insight:
Higher-cylinder cars generally have lower mileage, and box plots can confirm that visually.
179
Insight:
Cars with higher weight tend to have lower mileage, and horsepower also influences the
trend.
GGallypackage allows creating pair plots (scatterplotsfor every numeric variable pair).
The
library(GGally)
ggpairs(mtcars[, c("mpg", "wt", "hp", "disp")])
Insight:
Quickly see correlations and patterns across several variables.
library(reshape2)
cor_matrix <- cor(mtcars)
melted_cor <- melt(cor_matrix)
ggplot(melted_cor, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "red", high = "blue", mid = "white", midpoint = 0) +
labs(title = "Correlation Heatmap")
Insight:
Strong correlations (positive or negative) appear darker, helping spot variable
dependencies.
Insight:
Helps in comparing categories like gear types, fuel type, or transmission.
180
3. Combining EDA with Summary Statistics
summary(mtcars)
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg), avg_hp = mean(hp))
xplanation:
E
Summarizes mileage and horsepower by the number of cylinders — a useful analytical
view.
isual techniques such as boxplots and scatter plots are best for spottingoutliers, but you
V
can also detect them programmatically.
xplanation:
E
This finds MPG values that lie outside the normal range.
patchworkor
ou can create dashboards of multiple visuals using packages like
Y cowplot
to combine plots for a holistic view.
library(patchwork)
p1 <- ggplot(mtcars, aes(mpg)) + geom_histogram(fill="skyblue")
p2 <- ggplot(mtcars, aes(wt, mpg)) + geom_point(color="darkgreen")
p1 + p2
181
🧩 Summary of EDA Visualization Techniques
Visualization Purpose
Heatmap Correlations
💡 Quick Tips
✅
Exam Tip:
Common questions:
hen working with data analysis or research, it’s important to make your workreproducible
W
— meaning anyone can rerun your code and get the same results, along with all the visuals,
explanations, and outputs in one organized report.
182
In R, this is achieved usingR Markdown— a powerful tool that combinescode, output,
and textin a single document. From R Markdown, youcan generate professional reports in
HTML, PDF, or Word formats.
Example:
- --
title: "My Analysis Report"
author: "Shah Faisal"
date: "2025-11-10"
output: html_document
---
# Introduction
#
This report explores the relationship between car weight and mileage.
` ``{r}
# R code chunk
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(color = "blue")
183
Conclusion
We observe that heavier cars tend to have lower mileage.
---
` ``yaml
output: html_document
Advantages:
Example Output:
output: pdf_document
Advantages:
184
● Professional formatting
output: word_document
Advantages:
Example:
ummary(mtcars$mpg)
s
boxplot(mtcars$mpg, main = "Boxplot of MPG")
head(mtcars)
This is where R Markdown shines — you can mix your explanations and visuals together:
The dataset shows that cars with higher **weight** (`wt`) have lower **mileage** (`mpg`).
```{r}
185
plot(mtcars$wt, mtcars$mpg)
his structure makes your report readable **like a story** — each code snippet is
T
immediately explained by the text around it.
---
xample:
E
```markdown
The dataset contains `r nrow(mtcars)` observations and `r ncol(mtcars)` variables.
Example YAML:
output:
html_document:
theme: cerulean
toc: true
toc_float: true
● Automation: Update one dataset, and the entire reportupdates automatically.
186
● P
rofessional Presentation: Clean, structured reports suitable for projects and
research.
HTML .html
Interactive sharing Beautiful web-based layout
PDF .pdf
Academic/official use Print-ready format
Word .docx
Editable reports Easy to modify or annotate
💡 Quick Tips
##and
● Use ###for section headings.
✅
Exam Tip:
Common questions:
187