0% found this document useful (0 votes)

45 views187 pages

R Programming Basics and Overview

The document provides an introduction to R programming and RStudio, detailing their features, setup, and usage for data analysis. It covers basic concepts such as variables, data types, operators, control structures, and functions in R. Additionally, it highlights real-world applications of R in various fields like data science, finance, and healthcare.

Uploaded by

WhiteHatFarhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views187 pages

R Programming Basics and Overview

Uploaded by

WhiteHatFarhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

R PROGRAMMING

NOTES
Second Semester, DSAI, Department of Computer Science, University of kashmir

1
NIT 1:
U
INTRODUCTION
TO R
1. Overview of R and RStudio

2. Basics of R programming

○ Variables

○ Data types

○ Operators

3. Control structures

○ if-else

○ loops

4. Functions in R

○ Defining functions

○ Commonly used mathematical functions

○ Commonly used string functions

5. User-defined functions

6. Local and global variables

2
Overview of R and RStudio
1. Introduction to R

is apowerful programming languageand environmentmainly used fordata analysis,

R
statistics, and visualization. It was created byRossIhakaandRobert Gentlemanat the
University of Auckland, New Zealand.

hink of R as a digital lab for data scientists — a place where you can experiment with data,
T
analyze patterns, and visualize results beautifully.

Key Features of R

● Open Source:Free to download and use — availablefor everyone.

● Cross-Platform:Works on Windows, macOS, and Linux.

● E
xtensive Libraries:Thousands of built-in and externalpackages for data science,
statistics, and machine learning.

● S
trong Visualization Support:R creates high-qualityplots and graphs with libraries
ggplot2and
like lattice
.

● D
ata Handling:Efficiently handles large datasetsand supports data cleaning and
transformation.

● C
ommunity Support:Large, active community providinghelp, tutorials, and
open-source packages.

Why R?

● Ideal fordata science,machine learning,statisticalmodeling, andresearch.

● Preferred by statisticians and data analysts for itsaccuracyandstatistical depth.

● Integrates easily with tools likeExcel,SQL, andPython.

Example:

A simple R example
#
x <- c(2, 4, 6, 8, 10)
mean(x)

Explanation:

3
●
c()creates a vector of numbers.

● m
ean()calculates the average of the given vector.
This simple code shows how quick and readable R is for basic data analysis.

2. What is RStudio?

Studio is anIntegrated Development Environment (IDE)for R.

R
Think of R as the engine, and RStudio as the car dashboard that makes driving easier.

It provides a clean and organized interface to write, test, and visualize your R code
efficiently.

Main Components of RStudio

1. Source Pane:

○ This is where youwrite and edityour R scripts.

.R
○ Files are usually saved with the extension .

2. Console Pane:

○ Theexecution areawhere you run commands directly.

○ Anything you type here runs immediately.

3. Environment / History Pane:

○ Showsall active variables, datasets, and functionsin memory.

○ TheHistorytab lists all commands you’ve executed.

4. Files / Plots / Packages / Help / Viewer Pane:

○ Files:View files in your current working directory.

○ Plots:Displays graphs and visualizations.

○ Packages:Manage installed R packages.

○ Help:Access R documentation.

○ Viewer:Displays HTML outputs and interactive visuals.

4
3. How R and RStudio Work Together

● You write your R code inside RStudio.

● RStudio sends that code to theR interpreter, whichprocesses and executes it.

● T
he results (numbers, plots, or errors) appear in the RStudioConsoleorPlots
window.

It’s like RStudio being theuser-friendly faceofR.

4. Setting Up R and RStudio

1. Install R:

○ Go to[Link] download R foryour OS.

2. Install RStudio:

○ Visit[Link] installRStudio Desktop.

3. Open RStudio:

○ You’ll see four main panes as explained earlier.

○ S
tart typing commands in theConsoleor create a newR Script via:
File → New File → R Script
.

5. Real-World Uses of R

● D
ata Science:Used for analyzing datasets, makingpredictions, and creating
dashboards.

● Academia & Research:For running statistical testsand modeling data.

● Finance:Risk modeling, forecasting stock prices.

● Healthcare:Analyzing patient data and clinical trials.

5
● Marketing:Customer segmentation and trend analysis.

6. Quick Tips for Beginners

● Use thearrow keysin the Console to navigate throughprevious commands.

.Rextension for later use.

● Save scripts with

#to writecomments(ignored by R but useful fornotes).

● Use

● PressCtrl + Enterto run the selected line of code.

✨ Quick Recall Box

● R= Language for data analysis and statistics.

● RStudio= User-friendly interface for R.

.R
● R scripts end with .

● Four main panes in RStudio: Source, Console, Environment, Files/Plots/Packages.

Common command example:

print("Hello, R World!")

●

Basics of R Programming
very programming language begins with understanding itsbuilding blocks— how to store
E
information, what kinds of information exist, and how to perform operations on them.
In R, these basic concepts revolve aroundvariables,data types, andoperators.

Let’s explore each one clearly and step-by-step.

6
1. Variables in R

variableis like a container that holds data.

A
You can store a number, text, or even an entire dataset in a variable.

How to Create a Variable

In R, you can assign a value to a variable using any of the following operators:

<- 10
x # most common
y = 20 # also works
30 -> z # less common but valid

All three mean the same thing — they assign a value to a variable.

Variable Naming Rules

● Must start with aletter(A–Z or a–z).

● Can containnumbers, dots, or underscores.

● Cannot start with a numberor contain spaces.

Dataand
● R iscase-sensitive→ dataare two differentvariables.

Example
ame <- "R Programming"
n
version <- 4.3
isFun <- TRUE

Explanation:

●
namestores a text value (called astring)

●
versionstores a number

●
isFunstores a logical value (TRUE/FALSE)

You can check the value of a variable by just typing its name:

name

2. Data Types in R

7
supports severaldata types, each representing a different kind of data.
R
Let’s look at the most common ones:

Data Type Example Description

Numeric 12.5
-4
, 7.0 Numbers with or without decimals
,

Integer 10L
-2L
, Lto specify
hole numbers (use
W
integer)

Character
"Hello"
, Text or string data
'Data'

Logical TRUE
FALSE
, Boolean values for conditions

Complex 2 + 3i
Numbers with real and imaginary parts

Raw charToRaw(" Used for raw byte data

R")

Example
<- 15.7
a # numeric
b <- 10L # integer
c <- "R is fun" # character
d <- TRUE # logical
e <- 2 + 3i # complex

You can check thetype of datastored in a variableusing:

lass(a)
c
typeof(b)

These functions help you understand what kind of data each variable holds.

3. Type Conversion in R

ometimes you may need to change one data type to another.

S
R provides simple conversion functions:

Function Converts
To

[Link]( Numeric

)

8
[Link]( Integer

)

[Link] Character

r()

[Link]( Logical

)

Example

<- "25"
x
[Link](x)

xplanation:
E
"25"into the number
This converts the string 25
.

4. Operators in R

perators help perform actions on variables and values.

O
They are divided into different categories:

A. Arithmetic Operators

Used for mathematical operations.

Operator Meaning Example Output

+
Addition 5 + 3
8

-
Subtraction 5 - 2
3

*
Multiplication 4 * 2
8

/
Division 10 / 2
5

%%
Modulus (remainder)
10 %% 1

3

%/%
Integer division 10 %/%
3
3

^
Power 2 ^ 3
8

Example

9
<- 10
a
b <- 3
a + b
a %/% b

B. Relational Operators

Used tocomparevalues.

Operator Meaning Example Output

==
Equal to 5 == 5 TRUE

!=
Not equal to 5 != 3 TRUE

>
Greater than 7 > 3
TRUE

<
Less than 2 < 5
TRUE

>=
reater than or
G 4 >= 4 TRUE

equal

<=
Less than or equal 3 <= 2 FALSE

Example

<- 10
x
y <- 20
x > y
x <= y

Explanation:

FALSE
● The first expression returns

TRUE
● The second returns

C. Logical Operators

Used to combine multiple conditions.

Operator Meaning Example Output

10
&
AND (both true) (5 > 2) & (3
TRUE
< 6)

` ` OR (either true) `(5 > 2)

!
NOT (negation) !(5 > 2)
FALSE

D. Assignment Operators

Assign values to variables.

Operator Example Equivalent To

<-
x <-
ssigns 10 to
a
10
x

->
10 ->
ssigns 10 to
a
x
x

=
x = 10 a
ssigns 10 to
x

E. Miscellaneous Operators

Operator Description Example

:
Sequence generator 1:5gives
1 2 3 4 5

%in%
Membership test %in% c(1,2,3)→
2
TRUE

%*%
Matrix multiplication Used for multiplying matrices

💡 Real-World Example

Imagine you’re analyzing exam scores:

ath <- 80

m
science <- 90
average <- (math + science) / 2
average

11
xplanation:
E
We created two numeric variables and calculated their average — a basic but common data
analysis operation.

✨ Quick Recall Box

● Variablesstore data and are case-sensitive.

● Data Types: numeric, integer, character, logical,complex.

<-for assignment.
● Use

class()
● Check type: typeof()
, .

● Operators:arithmetic, relational, logical, assignment.

●
:creates sequences,
%in%checks membership.

Control Structures in R
ontrol structures help youcontrol the flow of yourprogram— decidingwhat to do next
C
based on certain conditions or repeating actions multiple times.
In simple words, they make your R programssmarterandmore dynamic.

There are two main types you’ll learn here:

1. Conditional statements (if-else)

2. Loops

1. Conditional Statements — if, else if, and else

onditional statements allow your program tomakedecisions.

C
They check whether a condition isTRUEorFALSEandexecute code accordingly.

Syntax
if (condition) {
# code to run if condition is TRUE
} else if (another_condition) {

12
code to run if the above is FALSE but this is TRUE
#
} else {
# code to run if none are TRUE
}

Example 1: Simple if Statement

<- 5
x
if (x > 0) {
print("Positive number")
}

xplanation:
E
xis greater than 0, the condition is
Since TRUE
,so the message“Positive number”is
printed.

Example 2: if-else Statement

<- -3
x
if (x >= 0) {
print("Positive number")
} else {
print("Negative number")
}

xplanation:
E
xis -3, so the condition
Here, x >= 0isFALSE.
elseblock runs and prints“Negative number”.
The

Example 3: if-else-if Ladder

<- 0
x
if (x > 0) {
print("Positive")
} else if (x < 0) {
print("Negative")
} else {
print("Zero")
}

xplanation:
E
This example checks multiple conditions — it first checks for positive, then negative, and
finally prints“Zero”if neither is true.

13
Example 4: Nested if

ifinside another.
You can also place one

<- 20
x
if (x > 10) {
if (x < 30) {
print("Between 10 and 30")
}
}

xplanation:
E
ifonly runs if the outer condition istrue — making it anested decision.
The inner

2. Loops in R

oops allow you torepeat a block of codemultipletimes.

L
This is especially useful when performing repetitive tasks like printing numbers, analyzing
data rows, or performing calculations for many values.

R provides three main types of loops:

● for loop

● while loop

● repeat loop

A. for Loop

Used when you knowhow many timesyou want to repeatsomething.

Syntax:

for (variable in sequence) {

# code to repeat
}

Example:

for (i in 1:5) {

14
rint(paste("This is loop number", i))
p
}

Explanation:

●
1:5creates a sequence (1, 2, 3, 4, 5).

● The loop runs five times, printing the message each time with the loop number.

B. while Loop

sed when youdon’t know exactly how many timestoloop — it runsas long as a
U
condition remains TRUE.

Syntax:

while (condition) {
# code to execute
}

Example:

ount <- 1
c
while (count <= 5) {
print(paste("Count is", count))
count <- count + 1
}

xplanation:
E
countbecomes greaterthan 5.
The loop keeps printing until
count
If you forget to update , this can lead to aninfinite loop.

C. repeat Loop

repeatloop is aninfinite loopthat continuesuntil you use the

he
T breakstatement to
stop it.

Syntax:

repeat {
# code
if (condition) {

15
reak
b
}
}

Example:

<- 1
x
repeat {
print(x)
x <- x + 1
if (x > 5) {
break
}
}

xplanation:
E
xand increases it by 1 until
The loop keeps printing xbecomes greater than 5, then stops.

3. Loop Control Statements

Sometimes you may want to skip certain iterations or exit a loop early.

A. break

tops the loop entirely.

S
Example:

for (i in 1:10) {
if (i == 6) {
break
}
print(i)
}

utput:
O
1 2 3 4 5

iequals 6.
Stops when

B. next

kips the current iteration and moves to the next one.

S
Example:

16
for (i in 1:5) {
if (i == 3) {
next
}
print(i)
}

utput:
O
1 2 4 5

3
The loop skips printing.

4. Combining Conditions and Loops

oops and conditions often work together in real tasks like data cleaning or summarizing
L
values.

Example:

umbers <- c(2, 5, 7, 9, 12)

n
for (num in numbers) {
if (num %% 2 == 0) {
print(paste(num, "is even"))
} else {
print(paste(num, "is odd"))
}
}

xplanation:
E
This loop checks each number in the list and prints whether it’s even or odd.

💡 Real-World Example

Imagine you’re analyzing student scores:

cores <- c(85, 42, 90, 33, 76)

s
for (s in scores) {
if (s >= 50) {
print("Pass")
} else {
print("Fail")
}
}

17
xplanation:
E
This loop goes through each score and checks if it’s a pass or fail — similar to an
automated grading system.

✨ Quick Recall Box

● if-elsehandles decision-making.

● for,while, andrepeathandle repetition.

● breakexits a loop;nextskips one iteration.

● Combine loops with conditions for flexible logic.

● Infinite loops happen if you forget to update your loop variable.

Functions in R
unctions are theheart of R programming— they helpyou organize your code, reuse
F
logic, and simplify complex tasks.
Think of a function as amini-program inside yourmain programthat performs a specific
job whenever you call it.

or example, if you often need to calculate the average of numbers, instead of rewriting the
F
same code again and again, you can just write a function once and reuse it whenever
needed.

1. Defining Functions in R

Syntax
function_name <- function(arguments) {
# body of the function
# code to execute
return(result)
}

Let’s break it down:

18
●
function_name→ the name you give to your function.

●
function()→ defines the function.

●
arguments→ inputs the function takes.

●
return()→ sends back the output (optional but goodpractice).

Example 1: Simple Function

add_numbers <- function(a, b) {
sum <- a + b
return(sum)
}
add_numbers(5, 7)

Explanation:

●
add_numbersis a user-defined function that takestwo arguments.

12
● It adds them and returns the result → output will be .

Example 2: Function Without Parameters

say_hello <- function() {
print("Hello from R!")
}
say_hello()

xplanation:
E
This function doesn’t take any arguments. It simply prints a message whenever you call it.

Example 3: Function with Default Arguments

greet <- function(name = "Student") {
paste("Welcome,", name)
}
greet()
greet("Faisal")

19
Explanation:

"Student"as the default.

● If no name is given, it uses

"Faisal"
● If you pass , it personalizes the message.

Example 4: Function Returning Multiple Values

You can return multiple values by combining them in a list.

calculate <- function(x, y) {

sum <- x + y
diff <- x - y
return(list(Sum = sum, Difference = diff))
}
result <- calculate(10, 5)
result$Sum
result$Difference

xplanation:
E
This function returns both sum and difference in a list.
$.
You can access each value using

2. Commonly Used Mathematical Functions

R comes with manybuilt-in mathematical functionsthat make calculations super easy.

Function Description Example Output

abs(x)
Absolute value abs(-5)
5

sqrt(x)
Square root sqrt(16)
4

exp(x)
Exponential exp(1)
2.718

log(x)
Natural log log(10)
2.302

log10(x)
Base-10 log log10(100)
2

round(x, n)
Round to n digits
round(3.14159
3.14
, 2)

ceiling(x)
Round up ceiling(2.3)
3

20
floor(x)
Round down floor(2.9)
2

sin(x)
cos(x)
, , Trigonometric sin(pi/2)
1

tan(x)

sum(x)
Sum of elements
sum(c(1,2,3))
6

mean(x)
Average value mean(c(2,4,6)
4
)

max(x)
Maximum value max(c(5,9,2))
9

min(x)
Minimum value min(c(5,9,2))
2

Example
ums <- c(2, 4, 6, 8)
n
mean(nums)
sd(nums)

Explanation:

●
mean()finds the average.

●
sd()gives the standard deviation — how spread outthe data is.

3. Commonly Used String Functions

orking with text (calledstrings) is common in R— like cleaning names, formatting output,
W
or labeling graphs.

Here are some useful functions:

Function Description Example Output

nchar(x)
Counts characters nchar("Hello")
5

toupper(x)
onverts to
C toupper("data")
"DATA"

uppercase

tolower(x)
onverts to
C tolower("RStudio")
"rstudio"

lowercase

21
substr(x,
xtracts part of a
E substr("Learning", 1,
"Lear"
start, stop)
string 4)

paste(x, y,
Joins strings paste("R",
"R

sep=" ")
"Language")
Language"

paste0(x, y)
oins without
J paste0("Data",
"DataScien

space "Science")
ce"

strsplit(x,
Splits a string strsplit("R is fun",
"R" "is"

split)
" ")
"fun"

grep(pattern,
inds matching
F grep("R",
1

x)
text c("R","Python","C"))

Example: Using String Functions Together

entence <- "R programming is interesting"
s
words <- strsplit(sentence, " ")
toupper(words[[1]])

Explanation:

●
strsplit()breaks the sentence into words.

●
words[[1]]accesses the list of words.

●
toupper()converts all words to uppercase.

4. Why Functions Matter

Functions help make your code:

● Reusable→ write once, use anywhere.

● Readable→ organized and easy to understand.

● Maintainable→ easy to update or fix errors.

hey’re especially powerful in data analysis where repetitive operations are common — such
T
as cleaning multiple datasets or computing statistical measures.

22
💡 Real-World Example

Let’s say you want to calculate total marks and percentage for a student:

calculate_result <- function(math, science, english) {

total <- math + science + english
percentage <- (total / 300) * 100
return(list(Total = total, Percentage = percentage))
}
student1 <- calculate_result(80, 75, 90)
student1

xplanation:
E
This function computes both total and percentage for a student — practical, reusable, and
easy to extend for more subjects.

✨ Quick Recall Box

function()
● Define a function using .

return()to send back output.

● Use

sum()
● Mathematical functionslike mean()
, sqrt()
, ,etc.

nchar()
● String functionslike toupper()
, paste()
, ,etc.

● Functions make code modular, clean, and reusable.

User-defined Functions

ser-defined functions in R are custom functions that you create to perform specific tasks
U
not covered by R’s built-in functions. They help you make your programs modular, readable,
and reusable.

Creating a User-defined Function

function()keyword.
ou can define your own function using the
Y
Syntax:

function_name <- function(parameter1, parameter2, ...) {

23
# function body

# computations

return(result)

}

● function_name— name you assign to the function.

● parameters— values passed to the function (optional).

● return()— sends back the result (optional).

Example 1: Function without Arguments

greet <- function() {

print("Welcome to R Programming!")

}

greet()

Output:

[1] "Welcome to R Programming!"

Example 2: Function with Parameters

multiply <- function(a, b) {

product <- a * b

return(product)

}

multiply(6, 4)

Output:

[1] 24

24
Example 3: Function with Default Parameters

You can assign default values to parameters.

area_circle <- function(radius = 1) {

area <- pi * radius^2

return(area)

}

area_circle() # uses default value

area_circle(3) # uses user input

Output:

[1] 3.141593

[1] 28.27433

Example 4: Function Returning Multiple Values

A function can return multiple values using alist.

operations <- function(x, y) {

result <- list(

sum = x + y,

difference = x - y,

product = x * y,

quotient = x / y

)

return(result)

}

output <- operations(10, 5)

25
print(output)

Output:

$sum

[1] 15

$difference

[1] 5

$product

[1] 50

$quotient

[1] 2

Example 5: Function Calling Another Function

Functions can call other functions inside them.

square <- function(x) {

return(x^2)

}

sum_of_squares <- function(a, b) {

return(square(a) + square(b))

}

sum_of_squares(3, 4)

Output:

26
[1] 25

Example 6: Anonymous Functions (Lambda Functions)

apply()functions.
Anonymous functions are unnamed, one-line functions often used with

squared_values <- sapply(1:5, function(x) x^2)

print(squared_values)

Output:

[1] 1 4 9 16 25

ser-defined functions give you complete control over what your code does, making them
U
essential for structuring large projects or automating repetitive tasks.

Local and Global Variables

In R, variables can existinsideoroutsideof functions— this determines theirscope,

meaningwherea variable can be accessed or modified.Understanding this concept is very
important when writing larger programs or working with multiple functions.

1. Local Variables

local variableis one that’screated inside a functionand can only be accessed within
A
that function.
Once the function finishes running, the local variable disappears (it’s destroyed).

Example:

add_numbers <- function() {

x <- 10

y <- 20

sum <- x + y

print(sum)

}

27
add_numbers()

print(x) # trying to access x outside the function

Output:

[1] 30

Error: object 'x' not found

🟢Explanation:

●
xand
yarelocalto the function
add_numbers()
.

● W xoutside, R gives an errorbecause it doesn’t exist in the global

hen you try to print
environment.

2. Global Variables

global variableis declaredoutside any functionand can be usedboth inside and

A
outsideof functions.

Example:

a <- 5

multiply <- function() {

result <- a * 10

print(result)

}

multiply()

print(a)

Output:

[1] 50

28
[1] 5

🟢Explanation:

● a
is aglobal variable, so it’s accessible insidethe
multiply()function as well as
outside.

3. Modifying Global Variables Inside Functions

y default, if you assign a new value to a variable inside a function, R creates anew local
B
copy— it doesnotmodify the global variable.

Example:

x <- 100

change_value <- function() {

x <- 50

print(paste("Inside function:", x))

}

change_value()

print(paste("Outside function:", x))

Output:

[1] "Inside function: 50"

[1] "Outside function: 100"

🟢Explanation:

xinside the function, it onlychangedlocally.

● Even though we changed

xremained the same.

● The global

29
4. Forcing a Function to Modify a Global Variable

If you really need to modify a global variable from inside a function, use the
super-assignment operator <<-.

Example:

count <- 0

increment <- function() {

count <<- count + 1

}

increment()

print(count)

Output:

[1] 2

🟢Explanation:

<<-operator updates theglobal variableinsteadof creating a new local one.

● The

countis updated globally.

● Each time the function runs,

Quick Summary

Type Defined In Accessible From Lifetime

Local Variable Inside a function Only inside that function Until function ends

Global Variable Outside all functions Everywhere Until program ends

30
🧠Tip for Exams:

● Always prefer local variables to avoid unwanted side effects in large programs.

<<-only when youreallyneed to modify globaldata from a function.

● Use

✅Quick Recall Box

● Local → Inside function → Temporary

● Global → Outside function → Permanent

<<-to modify global variables inside a function

● Use

31
NIT 2: DATA
U
HANDLING IN R
1. Data structures in R

○ Vectors

○ Matrices

○ Data frames

○ Lists

2. Importing and exporting data

3. Importing data from Excel

4. Accessing databases

5. Saving data in R

6. Loading R data objects

7. Writing to files

8. Data cleaning and preparation

○ Handling missing values

○ Filtering data

32
Data Structures in R

provides a rich set ofdata structuresto storeand organize data efficiently. These
R
structures are the building blocks for all data manipulation and analysis tasks in R.

Let’s explore each one step by step.

1. Vectors

vectoris the simplest and most common data structurein R.

A
It stores elements ofthe same data type— numeric,character, or logical.

Creating Vectors

c()(combine) function.
ou can create a vector using the
Y
Example:

um_vector <- c(10, 20, 30, 40)

n
char_vector <- c("R", "is", "fun")
log_vector <- c(TRUE, FALSE, TRUE)

Explanation:

●
c()combines values into a single sequence.

● Each element in a vector must have thesame type.

Accessing Vector Elements

[]with the element’s position(index).

Use square brackets

um_vector[2]
n
num_vector[1:3]

Output:

[1] 20
[1] 10 20 30

Modifying Vectors
num_vector[2] <- 25

25
This replaces the second element with .

33
Vector Operations

R performselement-wise operationsautomatically.

<- c(1, 2, 3)

a
b <- c(4, 5, 6)
a + b
a * b

Output:

[1] 5 7 9
[1] 4 10 18

Useful Functions for Vectors

Function Description

length( Number of elements

sum(x)
Sum of all elements

mean(x) Average value

sort(x) Sorts elements

rev(x)
Reverses order

2. Matrices

matrixis a two-dimensional data structure whereall elements are of thesame type

A
(numeric, character, or logical).

Creating a Matrix

matrix()function.
se the
U
Syntax:

matrix(data, nrow, ncol, byrow = FALSE)

Example:

at <- matrix(1:9, nrow = 3, ncol = 3)

m
print(mat)

34
Output:

[,1] [,2] [,3]

[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Accessing Matrix Elements

at[1, 2] # element in 1st row, 2nd column
m
mat[, 2] # all elements in 2nd column
mat[2, ] # all elements in 2nd row

Matrix Operations
<- matrix(1:4, 2, 2)
A
B <- matrix(5:8, 2, 2)
A + B
A * B
A %*% B # matrix multiplication

Naming Rows and Columns

r ownames(mat) <- c("Row1", "Row2", "Row3")
colnames(mat) <- c("Col1", "Col2", "Col3")

3. Data Frames

data frameis one of the most important structuresin R.

A
It storestabular data— similar to an Excel sheet— and can holddifferent data typesin
each column.

Creating a Data Frame

student_data <- [Link](
Name = c("Ali", "Sara", "John"),
Age = c(21, 22, 20),
Score = c(85, 90, 88)
)
print(student_data)

Output:

ame Age Score

N
1 Ali 21 85
2 Sara 22 90
3 John 20 88

35
Accessing Data Frame Elements
tudent_data$Name
s
student_data[1, 2]
student_data[ , "Score"]

Adding and Removing Columns

tudent_data$Grade <- c("A", "A+", "B")
s
student_data$Score <- NULL # removes column

Useful Data Frame Functions

Function Purpose

str(df) S
tructure of data
frame

nrow(df Number of rows

)

ncol(df Number of columns

)

names(d Column names

head(df First few rows

)

4. Lists

listcan hold elements ofdifferent types— numbers,strings, vectors, even other lists or
A
data frames.

Creating a List
my_list <- list(
name = "R Programming",
numbers = c(1, 2, 3),
matrix_data = matrix(1:4, 2, 2)
)

Accessing List Elements

y_list$name
m
my_list[[2]]

36
my_list$matrix_data[1, 2]

Adding or Removing List Elements

y_list$new_item <- "Extra data"
m
my_list$numbers <- NULL

Why Lists Are Important

ists are used to storecomplex results, such as outputsfrom models or multiple datasets
L
in one object.

Quick Summary
Data Structure Type Stores Example

Vector 1D Same data type c(1, 2, 3)

Matrix 2D Same data type matrix(1:4, 2, 2)

Data Frame 2D Different types [Link](Name, Age)

List Mixed Different structures

list(name, vector,
matrix)

🧠Tip:

● U
sevectorsfor simple sequences,data framesfordatasets, andlistsfor flexible
combinations of objects.

Vectors

vectorin R is the simplest data structure thatholds elements of the same data type
A
(numeric, character, logical, etc.). Vectors are used to store a sequence of data elements in
a single variable.

Creating Vectors

c()function (combinefunction).
ectors can be created using the
V
Example:

<- c(1, 2, 3, 4, 5)

x
y <- c("apple", "banana", "cherry")

37
z <- c(TRUE, FALSE, TRUE, TRUE)

Accessing Vector Elements

[ ]with index positions

lements of a vector can be accessed using square brackets
E
(indexing in R starts from 1).
Example:

[1]
x # First element
x[3] # Third element
x[2:4] # Elements from 2nd to 4th

Vector Operations

rithmetic operations are performed element-wise.

A
Example:

<- c(10, 20, 30)

a
b <- c(1, 2, 3)
a + b # Addition
a - b # Subtraction
a * b # Multiplication
a / b # Division

Vector Functions

Some common vector functions in R are:

Function Description Example

length( R
eturns number of length(

x)
elements a)

sum(x)
Returns sum of all elements
sum(a)

mean(x) Returns average value

mean(a)

max(x)
Returns maximum value max(a)

min(x)
Returns minimum value min(a)

sort(x) Sorts the vector

sort(a)

Combining Vectors

38
V c()
ectors can be combined using .
Example:

<- c(1, 2, 3)

a
b <- c(4, 5, 6)
c <- c(a, b)
print(c)

Logical Operations on Vectors

ou can compare elements of two vectors directly.

Y
Example:

<- c(1, 2, 3)

a
b <- c(3, 2, 1)
a == b # Element-wise comparison
a > b
a < b

Vector Recycling

If two vectors of different lengths are operated on, R recycles the shorter vector.
Example:

<- c(1, 2, 3, 4)

a
b <- c(10, 20)
a + b # b is recycled as (10, 20, 10, 20)

Type Coercion

If a vector has mixed data types, R automatically converts them to the same type following
this hierarchy:
Logical → Integer → Double → Character
Example:

<- c(1, TRUE, "R")

v
print(v) # All elements converted to character

ectors are fundamental in R and form the building blocks for more complex data structures
V
like matrices and data frames.

Matrices

39
matrixin R is a two-dimensional data structure that contains elements of the same data
A
type (numeric, character, or logical). It’s essentially a collection of vectors arranged in rows
and columns.

Creating a Matrix

matrix()function.
ou can create a matrix using the
Y
Syntax:

matrix(data, nrow, ncol, byrow = FALSE, dimnames = NULL)

● data→ the input vector of elements

● nrow→ number of rows

● ncol→ number of columns

TRUE
● byrow→ if FALSE
, fills the matrix by rows; if ,fills by columns

● dimnames→ optional names for rows and columns

Example:

1 <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)

m
print(m1)

Output:

[,1] [,2] [,3]

[1,] 1 3 5
[2,] 2 4 6

To fill by rows:

2 <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE)

m
print(m2)

Accessing Matrix Elements

atrix elements are accessed using row and column indices.

M
Example:

1[1, 2] # Element in 1st row, 2nd column

m
m1[ ,2] # All elements in 2nd column
m1[2, ] # All elements in 2nd row

40
Matrix Operations

allows arithmetic operations on matrices.

R
Example:

<- matrix(c(1, 2, 3, 4), nrow = 2)

A
B <- matrix(c(5, 6, 7, 8), nrow = 2)

+ B # Addition
A
A - B # Subtraction
A * B # Element-wise multiplication
A / B # Element-wise division
A %*% B # Matrix multiplication

Matrix Functions
Function Description Example

t(A)
Transpose of matrix t(A)

nrow(A)
Number of rows nrow(A)

ncol(A)
Number of columns ncol(A)

dim(A)
Dimensions (rows, cols)
dim(A)

rowSums(A Sum of each row

rowSums(A

)
)

colSums(A Sum of each column

colSums(A

)
)

rowMeans( Mean of each row

rowMeans(

A)
A)

colMeans( Mean of each column

colMeans(

A)
A)

Combining Matrices

You can combine matrices using:

●
rbind()→ combines by rows

●
cbind()→ combines by columns

41
Example:

1 <- matrix(1:6, nrow = 2)

m
m2 <- matrix(7:12, nrow = 2)
rbind(m1, m2) # Combine by rows
cbind(m1, m2) # Combine by columns

Naming Rows and Columns

ou can assign names to matrix rows and columns.

Y
Example:

<- matrix(1:9, nrow = 3)

m
rownames(m) <- c("Row1", "Row2", "Row3")
colnames(m) <- c("Col1", "Col2", "Col3")
print(m)

Matrix Indexing with Names

fter naming, you can access elements by name.

A
Example:

m["Row2", "Col3"]

Checking Data Type

All elements in a matrix have the same type:

lass(m)
c
typeof(m)

atrices are often used in mathematical computations, data transformations, and statistical
M
modeling where uniform data types are required.

Data Frames

data frameis one of the most commonly used datastructures in R. It’s similar to a table in
A
a spreadsheet or a dataset in Python’s pandas — made up of rows and columns, but unlike
matrices,each column can contain a different datatype(numeric, character, logical,
etc.).

You can think of a data frame as acollection of equal-lengthvectorscombined together.

Creating a Data Frame

42
[Link]()function.
You can create a data frame using the

Syntax:

[Link](column1, column2, column3, ...)

Example:

Creating a simple data frame

#
students <- [Link](
Name = c("Ali", "Sara", "John"),
Age = c(22, 21, 23),
Score = c(88, 95, 79)
)
print(students)

Output:

Name Age Score

Ali 22 88
1
2 Sara 21 95
3 John 23 79

👉 Here,each column(
N
ame Age
, Score
, ) is a vector,and all have equal lengths.

Accessing Data Frame Elements

You can access elements in several ways:

$operator
Using

students$Name

1. → Returns the entire “Name” column.

By column index or name

students[, 2] # 2nd column

tudents["Score"] # Column by name
s

43
By row and column position

tudents[1, 3]
s # Element in 1st row, 3rd column
students[2, ] # Entire 2nd row

Adding and Removing Columns

Add a new column:

tudents$Grade <- c("B", "A", "C")

s
print(students)

Remove a column:

students$Score <- NULL

Adding and Removing Rows

rbind()
Add a row using :

ew_student <- [Link](Name = "Emma", Age = 22, Score = 91, Grade = "A")
n
students <- rbind(students, new_student)

Remove a row:

students <- students[-2, ] # Removes 2nd row

Basic Operations on Data Frames

Operation Description Example

nrow(df)
Number of rows nrow(students

)

ncol(df)
Number of columns ncol(students

)

dim(df)
imensions of data
D dim(students)

frame

44
names(df) Column names
names(student

s)

str(df)
Structure of data frame str(students)

summary(d Statistical summary

summary(stude

f)
nts)

head(df)
First 6 rows head(students

)

tail(df)
Last 6 rows tail(students

)

Changing Column Names

You can rename columns using:

colnames(students) <- c("StudentName", "Age", "Score", "Grade")

Or rename specific columns with:

names(students)[2] <- "StudentAge"

Filtering Data

ou can filter rows based on conditions using logical operators.

Y
Example:

students[students$Score > 85, ]

→ Returns all students with scores greater than 85.

Sorting Data

order()function.
ou can sort data frames using the
Y
Example:

students[order(students$Score, decreasing = TRUE), ]

→ Sorts the data in descending order of scores.

45
Merging Data Frames

Y merge()
ou can merge two data frames using a common column with .
Example:

f1 <- [Link](ID = c(1, 2, 3), Name = c("Ali", "Sara", "John"))

d
df2 <- [Link](ID = c(1, 2, 3), Marks = c(88, 95, 79))
merged_df <- merge(df1, df2, by = "ID")

Real-World Use Case

Data frames are used whenever you deal withstructureddatasets, such as:

● Storing survey results

● Analyzing CSV or Excel data

● Preparing data for visualization or machine learning

In short,data frames are the backbone of data analysisin R.

🧭 Quick Summary

● Adata framecan contain columns ofdifferent datatypes.

[Link]()
● Created using

$,indices, or names.
● Access using

rbind()
● Rows → cbind()or
; Columns → $
.

summary()and
● Use str()to understand data quickly.

● Perfect forreal-world tabular data analysis.

Lists

46
listin R is a flexible data structure that can holddifferent types of elements— numbers,
A
strings, vectors, matrices, data frames, or even other lists!
Think of a list like acontainer that can store differentkinds of objects together, unlike
vectors or matrices which require all elements to be of the same type.

Creating a List

list()function.
You can create a list using the

Syntax:

list(item1, item2, item3, ...)

Example:

my_list <- list(

Name = "Sara",
Age = 21,
Scores = c(85, 90, 95),
Passed = TRUE
)
print(my_list)

Output:

Name
$
[1] "Sara"

Age
$
[1] 21

Scores
$
[1] 85 90 95

Passed
$
[1] TRUE

ere, you can see that the list containsdifferentdata types— a string, a number, a vector,
H
and a logical value.

Accessing List Elements

There are multiple ways to access list elements:

47
$
By name (using):

my_list$Name

"Sara"
1. → Returns

[[ ]]
By index (using ):

my_list[[2]]

21(the 2nd element)

2. → Returns

[ ]
By index (using ):

my_list[2]

3. → Returns alist containingthe 2nd element, notjust the value.

[ ]and
(Notice the difference between [[ ]]— []returns a sub-list,
[[ ]]
returns the actual element.)

Modifying List Elements

Change a value:

my_list$Age <- 22

Add a new element:

my_list$City <- "Srinagar"

Remove an element:

my_list$Passed <- NULL

Combining Lists

c()function:
You can merge lists using the

list1 <- list(a = 1, b = 2)

list2 <- list(c = 3, d = 4)

48
ombined_list <- c(list1, list2)
c
print(combined_list)

Output:

a
$
[1] 1

b
$
[1] 2

c
$
[1] 3

d
$
[1] 4

Nested Lists

ists can contain other lists too!

L
Example:

nested <- list(

student = list(Name = "John", Age = 22),
marks = c(88, 79, 93)
)
print(nested)

You can access nested elements like this:

nested$student$Name

"John"
→ Returns

Useful List Functions

Function Description Example

length(li Number of elements in the list

length(my_li
st)
st)

49
names(lis Get or set names of elements
names(my_lis
t)
t)

str(list) Structure of the list

str(my_list)

unlist(li Converts list into a vector

unlist(my_li

st)
st)

Unlisting a List

Y unlist()
ou can flatten a list into a single vector using .
Example:

cores_list <- list(A = 85, B = 90, C = 95)

s
unlist(scores_list)

Output:

B C
A
85 90 95

Real-World Analogy

Imagine a list as abackpackwhere you can put differentthings:

● Books (vectors)

● A lunch box (data frame)

● A water bottle (string)

● A notebook with notes (another list)

All different, yet stored together in one place — that’s how lists work in R!

🧭 Quick Summary

● Listscan store multiple data types, including otherlists.

list()function.
● Created using

50
$
● Access elements using [ ]
, [[ ]]
, or .

●
unlist()converts a list into a vector.

● G
reat for storing complex or hierarchical data (e.g., student info, model outputs,
JSON-like data).

Importing and Exporting Data

ne of R’s strongest features is its ability toimportand export datafrom a wide variety of
O
sources — text files, CSVs, Excel sheets, databases, and more. This allows you to bring
external data into R for analysis and then export your results back out for reporting or
sharing.

1. Importing Data

a) Importing CSV Files

SV (Comma-Separated Values) files are the most common data format used for sharing
C
datasets.

[Link]()
Function:

Syntax:

[Link](file, header = TRUE, sep = ",", stringsAsFactors = FALSE)

Example:

ata <- [Link]("[Link]")

d
head(data)

Explanation:

●
file→ path to your CSV file

●
header = TRUE→ treats first row as column names

●
sep = ","→ separates columns using commas

51
●
stringsAsFactors = FALSE→ keeps text as characters instead of factors

setwd("path")to set your working directorybefore reading files.

Tip:Use

Example:

etwd("C:/Users/Faisal/Documents/R_Projects")
s
data <- [Link]("[Link]")

b) Importing Text Files

.txt
If your data is stored in a plain text file (e.g., [Link]()
), you can use .

Syntax:

[Link](file, header = TRUE, sep = "\t")

Example:

text_data <- [Link]("[Link]", header = TRUE, sep = "\t")

c) Importing from URLs

You can even import data directly from the internet!

Example:

rl_data <-
u
[Link]("[Link]
head(url_data)

2. Exporting Data

Once you’ve processed or analyzed your data, you can save it back to a file.

a) Exporting to CSV

[Link]()
Function:

Syntax:

[Link](data, file, [Link] = FALSE)

52
Example:

[Link](data, "output_students.csv", [Link] = FALSE)

Explanation:

●
data→ the data frame to be saved

●
file→ name or path of the output file

● r
[Link] = FALSE→ prevents row numbers from beingwritten as an extra
column

b) Exporting to Text File

[Link]()
Function:

Example:

[Link](data, "[Link]", sep = "\t", [Link] = FALSE)

3. Viewing and Understanding Imported Data

After importing, it’s a good idea to explore your data before analysis.

Common Functions:

Function Description Example

head()
View first 6 rows head(data)

tail()
View last 6 rows tail(data)

str()
tructure of
S str(data)

dataset

summary Summary statistics

summary(da
()
ta)

names() Column names

names(data

)

53
dim()
Dimensions dim(data)

4. Handling File Paths

If your files are not in your working directory, provide thefull path:

data <- [Link]("C:/Users/Faisal/Desktop/[Link]")

To check your current working directory:

getwd()

To change it:

setwd("C:/Users/Faisal/Documents")

5. Reading Other Formats

R can also handle other common formats with the right packages:

File Type Packag Function

xcel
E readxl
read_exce
(
.xlsx
) l()

PSS
S haven
read_sav(

(
.sav
) )

JSON jsonli
fromJSON(
te
)

XML xml2
read_xml(

)

Example (Excel):

library(readxl)
data <- read_excel("students_data.xlsx")

54
🧭 Quick Summary

✅Importing Data→
[Link]() [Link]()
, read_excel()
,
✅Exporting Data→
[Link]() [Link]()
,
✅Check Data→
head() str()
, summary()
,
✅ getwd()
Working Directory→ setwd()
,
✅
File Formats Supported→ CSV, TXT, Excel, JSON,XML, Databases

eal-World Example:
R
In a data analysis project, you might:

.csvfile
1. Import raw survey results from a

2. Clean and transform the data

3. Export the final processed data for visualization or reporting

Importing Data from Excel

While CSV files are the most common way to handle tabular data,Excel files (
.xlsor
xlsx
. )are equally popular—especially in business,research, and academic settings. R
doesn’t read Excel files natively, but with the help of a few packages, it becomes very easy.

readxlPackage
1. Using the

readxlpackage (developed by RStudio) is the mostcommonly used for importing

The
.xlsand
Excel data. It works fast, doesn’t require Excel to be installed, and supports both
.xlsxfiles.

Installing and Loading the Package

You need to install it once and then load it into R.

i[Link]("readxl")
library(readxl)

Reading Excel Files

55
read_excel()function.
Use the

Syntax:

read_excel(path, sheet = 1, range = NULL, col_names = TRUE)

Example:

library(readxl)
students <- read_excel("students_data.xlsx")
head(students)

Explanation:

●
path→ file path of your Excel sheet

●
sheet→ specify sheet name or index (default is firstsheet)

●
range→ optional cell range like
"A1:D10"

●
col_names→ if
TRUE
, first row is treated as columnheaders

Example with Sheet Name

If your Excel workbook has multiple sheets:

students_scores <- read_excel("students_data.xlsx", sheet = "Scores")

You can even list all sheets using:

excel_sheets("students_data.xlsx")

2. Reading Specific Ranges

If you only want a specific part of the data:

marks <- read_excel("students_data.xlsx", range = "A1:C10")

→ Reads data from columns A to C and rows 1 to 10.

56
3. Viewing Imported Data

After importing, you can check what’s inside using:

ead(students)
h
str(students)
summary(students)

openxlsxPackage
4. Using the

A openxlsx
nother popular package is , which can bothread and write Excel files without
requiring external dependencies.

Install and Load

i[Link]("openxlsx")
library(openxlsx)

Read Excel File

students <- [Link]("students_data.xlsx", sheet = 1)

Write Data Back to Excel

[Link](students, "new_students.xlsx")

5. Handling Excel File Paths

If your Excel file isn’t in your current working directory:

students <- read_excel("C:/Users/Faisal/Documents/R_Projects/students_data.xlsx")

To check or set your working directory:

etwd()
g
setwd("C:/Users/Faisal/Documents/R_Projects")

6. Common Issues & Tips

57
● ❗File not found?Check file path and working directory.

● ⚠️Column types mismatched?Use

col_typesargumentin
read_excel()
.

● 🧾Multiple sheets?Loop through them using

lapply()or manually specify.

Example (read all sheets):

heets <- excel_sheets("students_data.xlsx")

s
all_data <- lapply(sheets, read_excel, path = "students_data.xlsx")

7. Exporting Data to Excel

You can save your processed data back to an Excel file using:

library(openxlsx)
[Link](students, "output_students.xlsx")

🧭 Quick Summary
Function Packag Purpose
e

read_excel(
readxl Read Excel files (.xls, .xlsx)
)

excel_sheet
readxl List all sheet names
s()

[Link]()
openxl Read Excel data
sx

[Link](
openxl Write data to Excel file
)
sx

✅Key Takeaways

readxlfor fast Excel reading.

● Use

openxlsxif you need both read/write capabilities.

● Use

str()or
● Always check data structure after importing using head()
.

58
● Perfect for handling data fromExcel-based reports, financial sheets, and surveys.

Accessing Databases

hen working with large datasets, it’s often not practical to store all your data in files like
W
CSV or Excel. Instead, data is usually stored indatabasessuch as MySQL, PostgreSQL, or
SQLite.
R canconnect to these databases,run SQL queries,andimport or export datadirectly
— allowing smooth integration between R and database systems.

1. Why Connect R to a Database?

● To handlelarge datasetsefficiently

● Toquery specific datainstead of loading the wholedataset

● Toupdate, insert, or deleterecords directly fromR

● To performstatistical analysis or visualizationonlive data

2. Database Connection in R

R provides multiple packages for connecting to databases. The most common are:

Database Package Function

MySQL RMySQL
dbConnect

()

PostgreSQL RPostgreS
dbConnect
QL
()

SQLite RSQLite
dbConnect

()

eneral
G DBI
dbConnect

interface ()

59
DBIpackage provides a common interface for working with any database
he
T
package.

3. Installing Required Packages

You need to install and load the required packages depending on your database.

i[Link]("DBI")
[Link]("RSQLite") # for SQLite
[Link]("RMySQL") # for MySQL
library(DBI)

4. Connecting to a Database

Let’s see examples for the most common databases.

a) SQLite Database

SQLite is lightweight and perfect for practice or small projects.

Example:

library(DBI)

Connect to a SQLite database file

#
con <- dbConnect(RSQLite::SQLite(), "student_database.sqlite")

List all tables in the database

#
dbListTables(con)

If the file doesn’t exist, R will create it automatically.

b) MySQL Database

You can also connect to remote or local MySQL servers.

Example:

library(DBI)

con <- dbConnect(

RMySQL::MySQL(),
dbname = "university",

60
ost = "localhost",
h
port = 3306,
user = "root",
password = "your_password"
)

5. Running SQL Queries from R

Once connected, you can run SQL queries directly using R functions.

a) Reading Data

ata <- dbGetQuery(con, "SELECT * FROM students WHERE marks > 80;")
d
head(data)

b) Writing Data

dbWriteTable(con, "new_students", data)

c) Listing All Tables

dbListTables(con)

d) Reading a Whole Table

students <- dbReadTable(con, "students")

e) Removing a Table

dbRemoveTable(con, "old_data")

6. Disconnecting from Database

Always close the connection after use:

dbDisconnect(con)

7. Example Workflow

Here’s a full workflow connecting R to a SQLite database:

61
library(DBI)
library(RSQLite)

Connect
#
con <- dbConnect(RSQLite::SQLite(), "[Link]")

Create a table
#
data <- [Link](Name = c("Ali", "Sara", "John"), Marks = c(85, 90, 78))
dbWriteTable(con, "Students", data)

Query the table

#
result <- dbGetQuery(con, "SELECT * FROM Students WHERE Marks > 80")
print(result)

Disconnect
#
dbDisconnect(con)

Explanation:

1. Connects to a local SQLite database

2. Creates a new table “Students”

3. Retrieves records where marks > 80

4. Closes the connection

8. Real-World Applications

● U
niversities store student records in MySQL or PostgreSQL — you can directly fetch
data for analysis.

● Companies analyze sales or customer data stored in corporate databases.

● Research projects use databases to store large datasets for reproducibility.

🧭 Quick Summary
Function Purpose

dbConnect()
Connect to a database

62
dbListTables List all tables

()

dbReadTable( Read an entire table

)

dbGetQuery() Run SQL query

dbWriteTable W
rite data frame into
()
database

dbRemoveTabl Delete a table

e()

dbDisconnect Close the connection

()

✅Key Points

DBI+ database-specific packages (

● Use RSQLite RMySQL
, ,etc.)

● You can read and write data frames directly as tables.

dbDisconnect()
● Always close connections with .

● Ideal forlarge datasetsanddynamic data retrieval.

Saving Data in R

hen you’re working on a project in R, you often need tosave your dataso that you can
W
reuse it laterwithout re-running all your code. Rprovides multiple ways to store your data
— from saving single variables to entire workspaces.

Let’s explore each method step by step.

1. Why Save Data?

Saving data allows you to:

● Avoid re-running time-consuming operations.

63
● Reuse data in future sessions.

● Share data with others.

● Keep a record of your analysis or experiments.

2. Saving Individual Objects

save()function.
You can savespecific variables, data frames, or vectorsusing the

Syntax:

save(object1, object2, ..., file = "[Link]")

Example:

x <- 10

y <- c(2, 4, 6, 8)

data <- [Link](Name = c("Ali", "Sara"), Age = c(21, 22))

save(x, y, data, file = "[Link]")

xplanation:
E
This saves the variables x,
y,and
datain one filenamed[Link].
Later, you can load this file to restore those objects.

3. Saving the Entire Workspace

ometimes you may want to saveeverythingcurrentlyin your R session (all variables,

S
functions, etc.).

Use:

[Link](file = "[Link]")

64
xplanation:
E
This command saves the entire workspace to a file.
[Link]()
By default, if you just run .RDatain your current working
, R savesit as
directory.

.RDatafile, restoring your

hen you reopen R, it automatically loads this
W
previous session.

4. Saving Data Frames as CSV Files

If you want to store data in a simpletext formatthat can be read by other software (like
[Link]()
Excel), use .

Example:

students <- [Link](Name = c("John", "Aisha", "Ravi"), Marks = c(85, 90, 78))

[Link](students, "[Link]", [Link] = FALSE)

xplanation:
E
studentsinto aCSV filewithout row numbers.
This saves the data frame

Key Parameters:

●
[Link] = FALSE→ avoids saving unnecessary rownumbers

●
sep→ allows specifying other separators (like
;or
\t
)

5. Saving Text Files

write()or
You can also save plain text or vector data using [Link]()
.

Example:

numbers <- c(1, 2, 3, 4, 5)

write(numbers, file = "[Link]")

xplanation:
E
This saves the vector values into a simple text file named[Link].

65
6. Saving Data in RDS Format

saveRDS()and
he
T readRDS()functions are useful whenyou want to save asingle R
object.

Example:

data <- [Link](City = c("Srinagar", "Delhi", "Mumbai"), Temp = c(12, 25, 30))

saveRDS(data, "weather_data.rds")

To load it later:

loaded_data <- readRDS("weather_data.rds")

save()
Difference from :

●
save()can storemultipleobjects.

● s
aveRDS()is meant forone object only, and you mustassign it to a variable when
reloading.

7. Saving Plots or Graphs

If you’ve created a plot, you can save it using functions like:

●
png("[Link]")

●
pdf("[Link]")

●
jpeg("[Link]")

Example:

png("[Link]")

hist(c(2,4,6,8,10))

[Link]()

66
Explanation:

●
png()starts saving the next plot as an image file.

●
[Link]()stops the saving process.

8. Real-World Use Case

Imagine you’ve cleaned and processed a large dataset ofCOVID-19 cases.

Instead of cleaning it again every time, you can save the cleaned data using:

save(cleaned_data, file = "covid_clean.RData")

Next time, you can just load it instantly:

load("covid_clean.RData")

This saves bothtimeandeffort.

🧭 Quick Summary

Function Purpose

save()
Save specific objects

[Link] Save all objects (workspace)

()

[Link]( Save data frame as CSV

)

67
[Link] S
ave data in tabular text
e()
format

saveRDS()
Save a single R object

readRDS()
Read a saved R object

png()
, Save plots or graphs
pdf()

✅Tips

● Always specify the correctfile pathbefore saving.

.RDataor
● Prefer .RDSfor R-only projects, and
.csvfor sharing data with others.

● Use descriptive file names for easy identification.

Loading R Data Objects

fter saving your work in R (like datasets, variables, or entire sessions), you’ll eventually
A
need toloadit back to continue your analysis. Rprovides simple functions to restore saved
data and make it available again in your current session.

Let’s explore how it works step by step.

1. Why Load Data?

hen you reopen R or RStudio, your workspace starts empty. If you want to reuse
W
previously saved data, you must load it.
Loading data helps you:

● Restore variables and data frames quickly

● Continue your analysis from where you left off

68
● Avoid rerunning data-cleaning or preparation steps

.RDataor
2. Loading .rdaFiles

save()or
If you used the [Link]()function tostore your workspace or selected
load()function to bring them back.
objects, use the

Syntax:

load("[Link]")

Example:

load("[Link]")

xplanation:
E
This will restore all the objects (
x,
y,
data [Link]
, etc.)that were saved inside .
You can now use them directly — there’s no need to assign them to new variables.

3. Loading RDS Files

saveRDS()
If you saved a single object using readRDS()to load it.
, youmust use

Syntax:

object_name <- readRDS("[Link]")

Example:

weather_data <- readRDS("weather_data.rds")

xplanation:
E
load()
Unlike , this function doesn’t automaticallycreate the object in your workspace —
you decide what name to assign it.
This makes it more flexible and safer when dealing with multiple datasets.

69
4. Loading CSV or Text Files

[Link]()toload it.
If you saved data inCSVformat, use

Example:

students <- [Link]("[Link]")

xplanation:
E
students
This reads the file[Link]and stores it ina data frame named .
You can then view it using:

head(students)

5. Loading Excel Files

.xlsxor
For Excel files (saved as .xls readxlpackage.
), use the

Example:

library(readxl)

marks <- read_excel("student_marks.xlsx")

xplanation:
E
This reads data directly from Excel sheets.
You can specify sheet names if your Excel file has multiple sheets:

read_excel("student_marks.xlsx", sheet = "Semester1")

6. Loading Text Files

.txtfiles), use
If you saved plain text data (like [Link]()or
[Link]()
.

Example:

numbers <- [Link]("[Link]")

70
or

data <- [Link]("[Link]")

xplanation:
E
These functions can handle tab-separated or space-separated text files easily.

7. Loading Complete Workspaces Automatically

.RDatafile is found in the

hen R starts, it can automatically load the previous session if a
W
working directory.

xample:
E
.RDatain yourworking directory, R will automatically load
If you closed R with a file named
it on restart.

8. Real-World Example

Imagine you analyzed customer data last week and saved it as
customers_cleaned.RData
.
Next week, you can simply type:

load("customers_cleaned.RData")

All your cleaned data frames, summary tables, and variables come back — ready for use!

🧭 Quick Summary

File Type Function Description

.RDataor
load()
Loads all saved R objects
.rda

71
.rds
readRDS()
Loads one saved R object

.csv
[Link]()
Loads data from CSV file

.xlsx
read_excel()
Loads data from Excel file

.txt
[Link]()/
Loads data from text files
[Link]()

✅Tips

getwd()before loading files.

● Always check yourworking directoryusing

[Link]()to confirm if the file exists inthe directory.

● Use

● U str(object_name)to inspect loaded data and ensureit’s in the expected

se
format.

Writing to Files in R

nce you’ve created, cleaned, or analyzed data in R, you often need toexport it— maybe
O
to share it with others, to use it in another software like Excel, or to keep it for future
reference.
This process is calledwriting data to files, andR provides several convenient functions to
handle it.

Let’s explore them step-by-step.

1. Why Write Data to Files?

Writing data to files allows you to:

● Save your analysis results permanently

● Share output with collaborators

72
● Transfer data between R and other tools (like Python, Excel, or SQL)

● Keep backups of important data frames

2. Writing Data Frames to CSV Files

he most common way to export data is to useCSV (Comma-SeparatedValues)format,

T
because it’s supported by almost every data tool.

[Link]()
Function:

Syntax:

[Link](object, file = "[Link]", [Link] = TRUE/FALSE)

Example:

students <- [Link](

Name = c("Aisha", "Ravi", "Mehak"),

Marks = c(88, 92, 79)

)

[Link](students, "[Link]", [Link] = FALSE)

Explanation:

studentsis written to a file called[Link].

● The data frame

●
[Link] = FALSEprevents R from adding row numbersas an extra column.

heck the file:

C
You can find it in your working directory (
getwd()
).

3. Writing Data with Custom Delimiters

If your system or collaborators prefer a different separator (like a semicolon or tab), you can
[Link]()
use .

73
Syntax:

[Link](object, file = "[Link]", sep = "\t", [Link] = FALSE)

Example:

[Link](students, "students_tab.txt", sep = "\t", [Link] = FALSE)

xplanation:
E
This saves the data withtab-separated columnsinsteadof commas.
Perfect when working with text editors or systems that expect tabular text.

4. Writing Data to Excel Files

writexlpackage.
If you want to directly write data into Excel sheets, use the

Example:

library(writexl)

write_xlsx(students, "students_data.xlsx")

xplanation:
E
This creates an Excel file with one sheet containing your data frame.
You can easily open it in Excel or Google Sheets.

5. Writing Text or Vector Data

write()
If you want to write plain text (like a vector of names, numbers, or results), use the
function.

Example:

names <- c("Ali", "Sara", "John")

write(names, file = "[Link]")

74
xplanation:
E
namesintoa text file, one per line.
This writes each value from the vector

6. Writing Multiple Data Frames in One File

append = TRUEargument
You can append multiple data frames to the same file using the
[Link]()
in .

Example:

df1 <- [Link](ID = 1:3, Score = c(90, 85, 88))

df2 <- [Link](ID = 4:6, Score = c(78, 91, 83))

[Link](df1, "[Link]", sep = ",", [Link] = FALSE)

[Link](df2, "[Link]", sep = ",", [Link] = FALSE, append = TRUE, [Link] =

w
FALSE)

Explanation:

● The first command writes the first data frame.

● The second appends the second one without repeating the column names.

7. Writing Binary R Objects

To save data in a compact format only readable by R, use:

●
save()for multiple objects

●
saveRDS()for one object

Example:

saveRDS(students, "students_data.rds")

This method is faster and takes less space than CSV files.

75
8. Writing Output to a Text File

ou can also write console output (like printed summaries or results) to a text file using
Y
sink()
.

Example:

sink("[Link]")

summary(students)

sink()

xplanation:
E
sink()calls is redirectedto[Link]instead of printing to
Everything between the two
the console.

9. Real-World Use Case

Imagine you cleaned a large survey dataset in R and now need to send it to your team who
uses Excel.
You can simply export it as:

[Link](cleaned_survey, "final_survey_data.csv", [Link] = FALSE)

Now it’s easy to share and open anywhere!

🧭 Quick Summary

Function Purpose

[Link]( Write data frame to CSV file

)

76
[Link] Write data with custom separators

e()

write_xlsx Write data to Excel file

()

write()
Write simple vectors or text

saveRDS()
ave one R object (R’s own
S
format)

sink()
Redirect console output to a file

✅Tips

"sales_report_2025.csv"
● Use descriptive filenames like .

getwd()to know where your file is saved.

● Always check

.csvor
● When sharing with non-R users, prefer .xlsx
.

.RDSif only R users willuse it.

● For reproducibility, prefer

Data Cleaning and Preparation

efore analyzing or visualizing data, you need toclean and prepare it properly. Real-world
B
datasets often containmissing values, duplicates,inconsistent formats, or irrelevant
entries— all of which can lead to misleading resultsif ignored.

In this topic, we’ll explore how tohandle missingvaluesandfilter dataeffectively in R.

1. What Is Data Cleaning?

77
ata cleaning is the process ofdetecting and correcting errors or inconsistenciesin
D
data to improve its quality.
It ensures that your dataset is:

● Accurate— no wrong or inconsistent values

● Complete— missing data handled properly

● Consistent— same formats for all columns

● Useful— only relevant rows and columns kept

2. Handling Missing Values

issing values are very common in datasets.
M
NA(Not Available).
In R, they are represented as

2.1 Detecting Missing Values

[Link]()function to check for missing data.

Use the

Example:

data <- c(10, 20, NA, 15, NA)

[Link](data)

Output:

[1] FALSE FALSE TRUE FALSE TRUE

xplanation:
E
TRUEindicates missing values at those positions.

To count how many missing values are in your dataset:

sum([Link](data))

78
Output:

That means there are two missing entries.

2.2 Removing Missing Values

If you want to remove all missing values:

clean_data <- [Link](data)

xplanation:
E
NAvalues from the vector or dataframe.
This removes all

Alternatively, you can use:

data[![Link](data)]

This returns only the non-missing values.

2.3 Replacing Missing Values

ometimes, deleting missing values isn’t ideal.

S
You can replace them with a constant value, themean,median, ormodeof the column.

Example:

data <- c(10, 20, NA, 30)

data[[Link](data)] <- mean(data, [Link] = TRUE)

Explanation:

●
[Link] = TRUEignores missing values while calculatingthe mean.

● Missing values are replaced by that mean value.

79
Output:

[1] 10 20 20 30

2.4 Handling Missing Values in Data Frames

If your dataset has multiple columns:

df <- [Link](

Name = c("Ali", "Sara", "John"),

Marks = c(85, NA, 90),

Age = c(21, 22, NA)

)

To find missing values:

colSums([Link](df))

Output:

Marks Age

1 1

To fill missing numeric values with the column mean:

df$Marks[[Link](df$Marks)] <- mean(df$Marks, [Link] = TRUE)

df$Age[[Link](df$Age)] <- mean(df$Age, [Link] = TRUE)

3. Filtering Data

80
iltering means selectingonly the rows that meet certain conditions.
F
This helps you focus on the relevant part of your dataset.

3.1 Basic Filtering with Subset()

Syntax:

subset(data_frame, condition)

Example:

students <- [Link](

Name = c("Aisha", "Ravi", "Mehak", "Ali"),

Marks = c(85, 90, 75, 60)

)

high_scorers <- subset(students, Marks > 80)

xplanation:
E
Only students with marks greater than 80 are selected.

3.2 Filtering Using Logical Operators

&(AND),
You can use logical operators like |(OR),and
!(NOT).

Example:

subset(students, Marks > 70 & Marks < 90)

xplanation:
E
This filters rows where marks arebetween 70 and90.

3.3 Filtering with dplyr Package

dplyrpackage provides a cleaner way to filterdata using the

The filter()function.

81
Example:

library(dplyr)

students <- [Link](Name = c("Ali", "Sara", "John"), Marks = c(85, 60, 95))

filtered_data <- filter(students, Marks > 80)

Explanation:

filter()function is easier to read and use fordata analysis workflows.

● The

● It works seamlessly with pipelines (

%>%
).

Example using pipe:

students %>% filter(Marks >= 85)

4. Real-World Example

Imagine a hospital dataset:

patients <- [Link](

Name = c("Aisha", "Ravi", "Mehak", "Ali"),

Age = c(25, 30, NA, 28),

Status = c("Recovered", "Recovered", "Sick", "Recovered")

)

To remove rows with missing data:

patients_clean <- [Link](patients)

●

To keep only “Recovered” patients:

recovered <- subset(patients_clean, Status == "Recovered")

82
●

This gives you aclean and filtered datasetreadyfor analysis.

🧭 Quick Summary

Concept Function Description

Check missing values [Link]()

Detects missing values

emove missing
R [Link]()
Deletes rows with NAs
values

eplace missing
R data[[Link]()] <-
Fills missing entries
values value

Count missing values sum([Link]())

Counts number of NAs

Filter data (base R) subset()

Select rows based on conditions

Filter data (dplyr) filter()

Clean, readable syntax for filtering

✅Tips

● Always check missing data before analysis.

● Replacing with mean or median is common for numeric columns.

● Use filtering to remove outliers or irrelevant records.

● Combine cleaning steps in a function for reusability.

83
NIT 3: DATA
U
MANIPULATION
1. Data manipulation techniques

2. Selecting rows/observations

3. Selecting columns/fields

4. Merging data

5. Relabeling column names

6. Reshaping data

7. Centering, scaling, and normalizing data values

8. Converting variable types

9. Data sorting

10.Data aggregation

84
Data Manipulation Techniques

nce your data is cleaned and ready, the next important step ismanipulating it— that is,
O
organizing, transforming, and restructuringthe dataso it becomes suitable for analysis
or visualization.

In R, data manipulation is one of the most frequently performed tasks, and it’s made
dplyrand
incredibly powerful and easy by packages like tidyrfrom thetidyverse
collection.

Let’s understand what data manipulation means and how to perform it efficiently in R.

1. What Is Data Manipulation?

ata manipulation meansmodifying the structure, format,or valuesin a dataset to make
D
it easier to work with.
It includes tasks like:

● Selecting or rearranging rows and columns

● Adding or removing data

● Renaming columns

● Sorting, merging, and reshaping data

● Applying mathematical or logical operations to transform values

Basically, it’s the process ofpreparing your datafor analysis.

2. Tools for Data Manipulation in R

You can manipulate data using:

subset()
1. Base R functions— built-in commands like merge()
, order()
, , etc.

2. d
plyr package— a modern, human-friendly toolkit designedspecifically for data
manipulation.

dplyrpackage provides functions that make codecleaner, faster, and more

he
T
readable.

85
dplyrFunctions
3. The Core
ere are the most important verbs (functions) indplyr,often called the “grammar of data
H
manipulation”:

Function Purpose

select()
Choose specific columns

filter()
elect rows based on
S
conditions

arrange() Sort rows

mutate()
Add or modify columns

summarise Create summary statistics

()

group_by( Group data for analysis

)

hese functions can be usedindividuallyor combinedusing thepipe operator (

T %>%
),
which passes the result of one function to the next.

4. Example Dataset

Let’s create a small dataset to work with:

library(dplyr)
students <- [Link](
Name = c("Ali", "Sara", "Ravi", "Mehak", "John"),
Marks = c(85, 92, 76, 89, 95),
Age = c(20, 21, 22, 20, 23)
)

5. Selecting Specific Columns

select()to keep only certain columns.
Use

86
Example:

select(students, Name, Marks)

xplanation:
E
Keeps only theNameandMarkscolumns, excludingothers.

6. Filtering Rows

filter()to choose rows that meet a condition.
Use

Example:

filter(students, Marks > 85)

xplanation:
E
Displays only students who scored more than 85 marks.

You can also use logical operators:

filter(students, Marks > 80 & Age < 22)

This filters students withMarks > 80andAge < 22.

7. Arranging (Sorting) Data

arrange()to sort rows based on one or more columns.
Use

Example:

arrange(students, Marks)

xplanation:
E
Sorts students inascending orderof marks.

To sort in descending order:

arrange(students, desc(Marks))

87
8. Creating or Modifying Columns
mutate()to add new columns or modify existingones.
Use

Example:

mutate(students, Grade = ifelse(Marks >= 90, "A", "B"))

xplanation:
E
Grade— if Marks ≥ 90, assigns“A”; otherwise “B”.
Adds a new column

9. Summarizing Data

summarise()(or
Use summarize()
) to compute summarystatistics.

Example:

summarise(students, Avg_Marks = mean(Marks))

Output:

vg_Marks
A
1 87.4

xplanation:
E
Calculates the average marks of all students.

10. Grouping Data

group_by()to group data based on one or morecolumns — often combined with
Use
summarise()
.

Example:

grouped_data <- students %>%

mutate(Gender = c("M", "F", "M", "F", "M")) %>%
group_by(Gender) %>%
summarise(Average = mean(Marks))

88
xplanation:
E
Gendercolumn, groups data by gender, and calculates theaverage marks for
This adds a
each group.

11. Using the Pipe Operator (

%>%
)
Thepipemakes code easier to read by chaining multiplesteps.

Example:

students %>%
filter(Marks > 80) %>%
select(Name, Marks) %>%
arrange(desc(Marks))

Explanation:

● Filters students scoring more than 80

● Selects only their names and marks

● Sorts them in descending order of marks

his is a clean, natural way to write a sequence of operations — almost like reading a
T
sentence.

12. Real-World Example

Imagine you have a sales dataset:

sales <- [Link](

Region = c("North", "South", "East", "West", "North"),
Sales = c(500, 400, 300, 450, 600)
)

You can find theaverage sales by region:

sales %>%
group_by(Region) %>%
summarise(Average_Sales = mean(Sales))

89
Output:

# A tibble: 4 × 2
Region Average_Sales
<chr> <dbl>
1 East 300
2 North 550
3 South 400
4 West 450

🧭 Quick Summary
Task Function Description

Select columns select()

Choose specific variables

Filter rows filter()

Keep only rows meeting conditions

Sort data arrange() Order rows ascending/descending

Add new columns

mutate() Create or modify variables

ummarize
S summarise Compute mean, sum, etc.

values ()

Group data group_by( Group by categories

)

Combine steps %>%

Chain operations smoothly

✅Tips

dplyrbefore use:
● Load library(dplyr)

%>%for clean and readable workflows

● Use

head()or
● Always check results using View()

group_by()and
● Combine summarise()for quick insights

Selecting Rows/Observations

90
electing rows or observations meanschoosing specific records from a datasetthat
S
satisfy a certain condition. In R, this can be done usingbase R techniquesor using the
dplyrpackage, which makes row selection more readableand powerful.

1. Why Select Rows?

You often need to select rows to:

● Focus on a subset of data

● Remove unwanted or irrelevant entries

● Analyze data that meets certain conditions

● Prepare smaller datasets for testing or plotting

or example, from a dataset of students, you might want to select only those with marks
F
above 80 or those belonging to a certain age group.

2. Selecting Rows Using Base R

a. Using Row Index Numbers

ou can select rows by theirposition (index).

Y
Example:

students <- [Link](

Name = c("Ali", "Sara", "Ravi", "Mehak", "John"),
Marks = c(85, 92, 76, 89, 95),
Age = c(20, 21, 22, 20, 23)
)

Select thefirst three rows:

students[1:3, ]

Explanation:

●
1:3specifies the range of row indices

91
,indicates we want all columns for those rows
● The comma

b. Using Logical Conditions

Marks > 85

Select rows where :

students[students$Marks > 85, ]

Explanation:

●
students$Marks > 85returns
TRUEfor rows where Marksexceed 85

● Only those rows are displayed

Select students agedless than or equal to 21:

students[students$Age <= 21, ]

c. Combining Conditions

&(AND),
se logical operators like
U |(OR).
Example:

students[students$Marks > 80 & students$Age < 22, ]

This selects students withMarks > 80andAge < 22.

subset()Function
3. Selecting Rows Using
subset()function is simpler and more readablefor conditional selection.
The

Syntax:

subset(dataframe, condition)

Examples:

92
ubset(students, Marks > 85)
s
subset(students, Age == 20)
subset(students, Marks > 80 & Age < 22)

Advantages:

● Easier to read

$sign for each variable

● Doesn’t require

● Useful for multiple conditions

dplyr::filter()
4. Selecting Rows Using
filter()function from
he
T dplyris the most modernand clean method for selecting
rows.

Syntax:

filter(dataframe, condition)

Examples:

library(dplyr)
filter(students, Marks > 85)
filter(students, Age < 22)
filter(students, Marks > 80 & Age < 22)

Explanation:

● Keeps only rows that meet the specified condition

&or
● You can use multiple conditions connected with |

Multiple Conditions Example

filter(students, Marks > 80, Age < 22)

Here, both conditions must be true.

93
If you want to select either condition:

filter(students, Marks > 90 | Age < 21)

5. Filtering with Membership (

%in%
)
%in%
If you want to select rows where a variable matches one of several values, use .

Example:

filter(students, Name %in% c("Ali", "Ravi"))

This selects rows whereNameis either "Ali" or "Ravi".

6. Filtering Missing Values

To select rowswithout missing values, use:

filter(students, ![Link](Marks))

This removes all rows with missing Marks.

To select rowswithmissing values:

filter(students, [Link](Marks))

slice()
7. Filtering Rows Using
slice()from
If you want to select specific rows byposition(notcondition), use dplyr
.

Example:

slice(students, 1:3)

Selects the first three rows.

You can also select:

94
lice_head(students, n = 2) # First 2 rows
s
slice_tail(students, n = 2) # Last 2 rows
slice_sample(students, n = 2) # Random 2 rows

8. Example with Pipes (

%>%
)
You can combine operations using thepipe operator:

students %>%
filter(Marks > 80) %>%
arrange(desc(Marks))

Explanation:

● Filters students with marks greater than 80

● Sorts them in descending order of marks

🧭 Summary Table
Method Function Description

Base R (by index) data[row_numbers, ]

Select rows by their index

Base R (by condition)

data[data$column > Filter rows using condition
value, ]

Simpler syntax subset()

Readable way to filter rows

dplyr modern way filter()

Fast and easy filtering

Row by position slice()

Select rows by position

Missing values [Link]()/

![Link]() Handle missing rows

✅Tips

head()or
● Always check results using View()

95
arrange()or
● Combine filters with select()for refined datasets

==for floating-point comparisons (usetolerance methods instead)

● Avoid using

Selecting Columns/Fields

In R, selectingcolumns (fields)means choosing specificvariables from a dataset. This is

one of the most common tasks while analyzing data — for instance, you might want to
extract only the columnsNameandMarksfrom a largestudent dataset.

subset()function, or
R provides multiple ways to select columns — throughbase R, the
dplyrpackage(which is highly preferred for itsclean syntax).
the

1. Selecting Columns Using Base R

a. Using Column Names

$operator.
ou can access a column using the
Y
Example:

students <- [Link](

Name = c("Ali", "Sara", "Ravi", "Mehak", "John"),
Marks = c(85, 92, 76, 89, 95),
Age = c(20, 21, 22, 20, 23)
)

students$Marks

Explanation:

●
$is used to access a specific column.

●
students$Marksreturns only the “Marks” column asa vector.

b. Using Column Index

96
ou can select columns by theirpositionin the dataset.
Y
Example:

tudents[, 1] # Selects the first column (Name)

s
students[, 2:3] # Selects 2nd and 3rd columns (Marks and Age)

Explanation:

●
,separates rows and columns.

● Leaving the row part blank means “all rows.”

c. Using Column Names in Square Brackets

If you want multiple specific columns by name:

students[, c("Name", "Marks")]

This returns only the columns “Name” and “Marks.”

subset()
2. Selecting Columns Using
subset()function can be used to choose specificcolumns easily.
The

Syntax:

subset(dataframe, select = c(column1, column2, ...))

Example:

subset(students, select = c(Name, Marks))

You can alsoexcludecertain columns using a negativesign:

subset(students, select = -Age)

Explanation:

97
●
select = -Ageremoves theAgecolumn.

select = -c(Marks,
● You can exclude multiple columns as Age)

dplyr::select()
3. Selecting Columns Using
This is the most popular and clean method for selecting columns.

Syntax:

select(dataframe, columns...)

Example:

library(dplyr)
select(students, Name, Marks)

Explanation:

● Selects only theNameandMarkscolumns.

● The original dataset remains unchanged unless reassigned.

a. Excluding Columns

-sign.
You can exclude columns using the

select(students, -Age)

Removes theAgecolumn and keeps the rest.

b. Selecting Columns by Range

select(students, Name:Marks)

Selects all columnsfrom Name to Marks(based on position).

98
c. Selecting Columns by Name Pattern

You can select columns based on name patterns using helper functions:

Function Description Example

starts_with(" S
elects columns starting with select(students,

A")
“A” starts_with("A"))

ends_with("s" S
elects columns ending with select(students,

)
“s” ends_with("s"))

contains("ar" Selects columns containing “ar”

select(students,
)
contains("ar"))

matches("^[A- S
elects columns matching select(students,

Z]")
regex matches("^[A-Z]"))

d. Using the Pipe Operator (

%>%
)
students %>%
select(Name, Marks)

Explanation:

%>%passes the
● The pipe studentsdataframe into the
select()function.

● T
his makes the code more readable and chainable with other operations like
filter()or
arrange() .

4. Selecting Columns by Data Type

You can select columns based on theirdata typeusing:

select_if(students, [Link])

This selects only numeric columns (likeMarksandAge).

To select only character columns:

99
select_if(students, [Link])

5. Reordering Columns

To change the order of columns:

students %>%
select(Marks, Name, Age)

This rearranges the columns in the specified order.

6. Renaming Columns While Selecting

select()
You can rename columns directly inside :

students %>%
select(Student_Name = Name, Score = Marks)

Explanation:

● RenamesNametoStudent_NameandMarkstoScore.

🧠 Quick Recap Table

Method Function Description

Base R data[, c()]

Select columns by index or name

$operator
data$col
Access single column

subset()
subset(data, select = C
hoose or exclude columns
...)
easily

dplyr::select
select(data, col1, Clean and readable method
()
col2)

Exclude columns
select(data, Remove specific columns
-colname)

100
Range select select(data,
Select column range
col1:col3)

Pattern select starts_with(),

Match by name pattern
contains()

ype-based
T select_if()
Select by data type
select

✅Tips

View()to verify selected columns quickly.

● Use

select()with
● Combine filter()for focused sub-datasets.

dplyr::select()for readability and cleanersyntax.

● Prefer

Merging Data

In R,mergingmeans combining two or more datasetsbased on acommon column (key)

— similar tojoinsin SQL. This is essential whendata is stored in multiple sources and
needs to be brought together for analysis.

1. Why Merge Data?

Imagine you have two tables:

● One withstudent names and their marks

● Another withstudent names and their ages

o analyze both marks and ages together, you mustmergethem into a single dataset using
T
theNamecolumn as the key.

2. Types of Merging in R

R supports several ways to merge data:

101
merge()function(Base R)
1. Using the

dplyrjoin functions— like

2. Using inner_join left_join
, ,etc.

merge()Function
3. Merging Using Base R

Syntax:

merge(x, y, by, all.x = FALSE, all.y = FALSE)

Parameters:

●
x,
y→ data frames to merge

●
by→ column(s) to merge on (common key)

●
all.x→ if
TRUE x(Left Join)
, keeps all rows from

●
all.y→ if
TRUE y(Right Join)
, keeps all rows from

Example:

data1 <- [Link](

Name = c("Ali", "Sara", "Ravi", "Mehak"),

Marks = c(85, 92, 76, 89)

)

data2 <- [Link](

Name = c("Ali", "Sara", "John", "Mehak"),

Age = c(20, 21, 23, 20)

)

102
a. Inner Join (default)

Keeps only rows present inbothdatasets.

merge(data1, data2, by = "Name")

Result:OnlyAli,Sara, andMehakappear because theyexist in both data frames.

b. Left Join

data1and adds matching rows from

Keepsall rows from data2
.

merge(data1, data2, by = "Name", all.x = TRUE)

Result:

data1remain.
● All students from

data2
● If a match isn’t found in , missing values (NA)are added.

c. Right Join

data2
Keepsall rows from .

merge(data1, data2, by = "Name", all.y = TRUE)

Result:

data2remain.
● All students from

data1get NA in missingcolumns.
● Non-matching records from

103
d. Full Join

Keepsall rows from bothdatasets.

merge(data1, data2, by = "Name", all = TRUE)

esult:
R
Every name from both tables is included — unmatched rows get NA.

e. Merging on Different Column Names

If the key column hasdifferent namesin each dataset:

merge(data1, data2, by.x = "Student", by.y = "Name")

dplyrJoins
4. Merging Using
dplyrpackageoffers more readable and powerfulmerging functions that are widely
he
T
used in data science.

Common Join Functions:

Function Description Similar to

inner_join(x, y, Only matching rows

Inner join
by)

left_join(x, y,
x
All rows from Left join
by)

right_join(x, y, All rows from

y Right join
by)

104
full_join(x, y,
All rows from both Full outer join
by)

semi_join(x, y,
xthat have a match in
Rows in Filtered inner join
by)
y

anti_join(x, y,
xwith no match in
Rows in y Opposite of semi join
by)

dplyr
Example using :

library(dplyr)

data1 <- [Link](Name = c("Ali", "Sara", "Ravi", "Mehak"),

Marks = c(85, 92, 76, 89))

data2 <- [Link](Name = c("Ali", "Sara", "John", "Mehak"),

Age = c(20, 21, 23, 20))

Inner Join

inner_join(data1, data2, by = "Name")

Left Join

left_join(data1, data2, by = "Name")

Full Join

full_join(data1, data2, by = "Name")

105
data1butnotin
Anti Join— shows students in data2
:

anti_join(data1, data2, by = "Name")

Example Output (Inner Join):

Name Marks Age

Ali 85 20

Sara 92 21

Mehak 89 20

5. Merging on Multiple Columns

You can join based onmultiple keys:

merge(data1, data2, by = c("Name", "Age"))

dplyr
or using :

inner_join(data1, data2, by = c("Name", "Age"))

6. Concatenating Data Vertically (Row-wise Merge)

When datasets havethe same columnsand you just wantto stack them:

106
rbind()in base R
● Use

bind_rows()in
● Use dplyr

Example:

df1 <- [Link](Name = c("Ali", "Sara"), Marks = c(85, 92))

df2 <- [Link](Name = c("Ravi", "Mehak"), Marks = c(76, 89))

combined <- rbind(df1, df2)

esult:
R
All four students in one dataset.

7. Combining Columns (Column-wise Merge)

If both data frames have the same number of rows:

cbind(data1, data2)

This merges themside-by-sidewithout matching keys.

🧠 Quick Recap Table

Type Base R Syntax dplyr Syntax Description

Inner Join merge(x, y, by)

inner_join(x, Only matching rows

y)

Left Join merge(x, y, by,

left_join(x, All from left table

all.x=TRUE)
y)

107
Right Join merge(x, y, by,
right_join(x, All from right table

all.y=TRUE)
y)

Full Join merge(x, y, by,

full_join(x, All from both

all=TRUE)
y)

Row Bind rbind(x, y)

bind_rows(x, Stack vertically

y)

Column Bind
cbind(x, y) bind_cols(x, Combine horizontally

y)

✅Tips

View()or
● Always verify merged results using head()

● Ensure column names and types match before merging

anti_join()to find unmatched rows (useful fordata cleaning)

● Use

dplyrjoins for clean, readable, and efficientmerging

● Prefer

Relabeling Column Names

elabeling column names in R meansrenaming one ormore columnsof a dataset to

R
make them more meaningful, readable, or standardized. Clean and consistent column
names are crucial for easy coding, especially when working on large projects or with multiple
datasets.

1. Why Relabel Column Names?

Some reasons for renaming columns include:

108
● Making namesshorter or easier to type

● Replacingspaces or special characters

V1to
● Givingclearer, descriptive labels(e.g., changing Student_Name
)

● Standardizing column names before merging data

2. Viewing Current Column Names

Before renaming, you can check existing column names using:

colnames(dataframe)

Example:

students <- [Link](

N = c("Ali", "Sara", "Ravi"),

M = c(85, 92, 76),

A = c(20, 21, 22)

)

colnames(students)

Output:

[1] "N" "M" "A"

3. Renaming Columns in Base R

a. Rename All Columns

If you want to rename all columns at once:

109
colnames(students) <- c("Name", "Marks", "Age")

Now the dataset looks like:

Name Marks Age

Ali 85 20

Sara 92 21

Ravi 76 22

b. Rename Specific Columns

You can rename individual columns by referring to their index:

colnames(students)[2] <- "Score"

This renames only thesecond column(Marks → Score).

names()Function
c. Rename Using

names()function works the same way:

The

names(students)[names(students) == "A"] <- "Age"

Explanation:

● Finds the column named “A”

● Renames it to “Age”

110
dplyr::rename()
4. Renaming Columns Using
rename()function from the
he
T dplyrpackage makesrenaming easier and more
readable.

Syntax:

rename(dataframe, new_name = old_name)

Example:

library(dplyr)

students <- [Link](

Name = c("Ali", "Sara", "Ravi"),

Marks = c(85, 92, 76),

Age = c(20, 21, 22)

)

students <- rename(students, Score = Marks)

Explanation:

● The old column name (

Marks =
) comesafterthe.

● The new name (

Score =
) comesbeforethe.

● The dataset now has columns:Name,Score, andAge.

Renaming Multiple Columns

students <- rename(students, Student_Name = Name, Years = Age)

111
Result:

Student_Nam Scor Year

e e s

Ali 85 20

Sara 92 21

Ravi 76 22

setNames()Function
5. Using
This function creates a renamed copy of the dataset without changing the original directly.

Syntax:

new_df <- setNames(old_df, c("new1", "new2", "new3"))

Example:

new_students <- setNames(students, c("Student", "Marks", "Age"))

Explanation:

● Assigns the new names to the columns in order.

● Creates a new data frame with renamed columns.

names()with Pipes
6. Using
You can use pipes to rename columns inline:

112
students %>%

rename(Score = Marks, Years = Age)

This is preferred for cleaner, chainable code.

. Changing Column Names to Lowercase or

7
Uppercase
You can standardize names easily:

colnames(students) <- tolower(colnames(students))

colnames(students) <- toupper(colnames(students))

Example:

Before: Name, Marks, Age

After: NAME, MARKS, AGE

. Replacing Spaces or Special Characters in Column

8
Names
If your dataset has spaces in column names:

data <- [Link]("Student Name" = c("Ali", "Sara"), "Test Score" = c(85, 90))

Replace spaces with underscores:

colnames(data) <- gsub(" ", "_", colnames(data))

Result:

113
Student_Nam Test_Scor
e e

Ali 85

Sara 90

🧭 Quick Recap Table

Method Function Use

iew column
V colnames(data)
Displays current names
names

ename all
R colnames(data) <- c()
Rename all at once
columns

ename one
R colnames(data)[i] <- "NewName"
Rename by position
column

onditional
C names(data)[names(data) ==
Rename by name
rename "Old"] <- "New"

Modern rename rename(data, New = Old)

lean and readable
C
(
dplyr
)

Rename with order

setNames(data, c("a","b","c")) Rename all with vector

114
eplace
R gsub(" ", "_", colnames(data))
Clean column labels
characters

✅Tips

colnames()or
● Always check renamed data using head()
.

● K
eep column names short, lowercase, and underscore-separated (e.g.,
student_name
).

● Avoid spaces, punctuation, and symbols in names — they make coding harder.

Reshaping Data

eshaping data in R refers to the process ofchangingthe structure or layout of a dataset

R
— for example, converting data fromwide format tolong formator vice versa. This is
essential for analysis, visualization, and efficient data management.

1. Understanding Data Shapes

● W
ide Format:
Each subject or category has a single row, and multiple variables are represented as
separate columns.
Example:
| Name | Math | Science | English |
|------|-------|----------|----------|
| Ali | 85 | 90 | 78 |
| Sara | 92 | 87 | 88 |

● L
ong Format:
Each row represents a single observation for a variable.
Example:
| Name | Subject | Marks |
|------|----------|-------|
| Ali | Math | 85 |
| Ali | Science | 90 |
| Ali | English | 78 |
| Sara | Math | 92 |

115
| Sara | Science | 87 |
| Sara | English | 88 |

Converting between these two formats is what we callreshaping.

tidyrPackage
2. Reshaping Using
tidyrpackage provides the most efficient functionsto reshape data:
The

●
pivot_longer()— converts data fromwidetolongformat.

●
pivot_wider()— converts data fromlongtowideformat.

Let’s understand both.

pivot_longer()
3. Wide to Long Format using
Syntax:

pivot_longer(data, cols, names_to, values_to)

Parameters:

●
data
: dataset

●
cols
: columns to convert into key-value pairs

●
names_to
: name of the new column that will store variablenames

●
values_to
: name of the new column that will storecorresponding values

Example:

library(tidyr)

marks <- [Link](

116
Name = c("Ali", "Sara"),

Math = c(85, 92),

Science = c(90, 87),

English = c(78, 88)

)

long_data <- pivot_longer(marks, cols = c(Math, Science, English),

names_to = "Subject", values_to = "Marks")

print(long_data)

Output:

Name Subject Marks

Ali Math 85

Ali Science 90

Ali English 78

Sara Math 92

Sara Science 87

Sara English 88

Explanation:

117
● The columnsMath, Science, Englishbecome entries under a single columnSubject.

● Their corresponding values go underMarks.

pivot_wider()
4. Long to Wide Format using
Syntax:

pivot_wider(data, names_from, values_from)

Parameters:

●
data
: dataset

●
names_from
: column whose values become new columnnames

●
values_from
: column whose values fill the new columns

Example:

wide_data <- pivot_wider(long_data, names_from = Subject, values_from = Marks)

print(wide_data)

Output:

Name Math Scienc English

Ali 85 90 78

Sara 92 87 88

118
xplanation:
E
TheSubjectcolumn becomes new column names (Math,Science, English), andMarks
values fill those columns.

5. Using Base R for Reshaping

tidyr
efore
B reshape()
, reshaping was commonly done using melt()
, cast()
, and
functions from thereshape2package.

melt()
a. Using

Converts wide data into long format.

Example:

library(reshape2)

long_data <- melt(marks, [Link] = "Name",

[Link] = "Subject", [Link] = "Marks")

dcast()
b. Using

Converts long data back into wide format.

wide_data <- dcast(long_data, Name ~ Subject, [Link] = "Marks")

6. Combining Multiple Identifiers

If your data contains multiple identifier columns, you can specify them easily.

Example:

data <- [Link](

Student = c("Ali", "Ali", "Sara", "Sara"),

Year = c(2022, 2023, 2022, 2023),

Math = c(85, 88, 92, 95),

119
Science = c(90, 91, 87, 89)

)

long <- pivot_longer(data, cols = c(Math, Science),

names_to = "Subject", values_to = "Marks")

print(long)

Output:

Studen Year Subject Marks

Ali 2022 Math 85

Ali 2022 Science 90

Ali 2023 Math 88

Ali 2023 Science 91

Sara 2022 Math 92

Sara 2022 Science 87

Sara 2023 Math 95

Sara 2023 Science 89

120
7. Spread and Gather (Older Functions)
pivot_longer()and
Before pivot_wider()
, we used:

●
gather()→ to convert wide to long

●
spread()→ to convert long to wide

These are now replaced but still found in older scripts.

Example:

library(tidyr)

long_data <- gather(marks, Subject, Marks, Math:English)

wide_data <- spread(long_data, Subject, Marks)

8. Advantages of Reshaping Data

ggplot2
● Simplifies data visualization (especially in ).

● Facilitates statistical modeling (models often need long format).

● Makes datasets tidy and consistent for analysis.

● Enables easier joining, filtering, and summarizing.

🧭 Quick Recap Table

Task Function Direction Package

121
Wide → Long pivot_longer( Wide to Long
tidyr
)

Long → Wide pivot_wider() Long to Wide

tidyr

Wide → Long (older) melt()/

Wide to Long
reshape2/
gather()
tidyr

Long → Wide (older) dcast()/

Long to Wide
reshape2/
spread()
tidyr

✅Tips

head()before andafter reshaping.

● Always inspect the dataset using

● P pivot_longer()and
refer pivot_wider()— they aresimpler and more
readable.

● F [Link]::melt()and
or large datasets, use [Link]::dcast()—
faster versions.

● Long format is better for visualizations; wide format is better for presentation.

Centering, Scaling, and Normalizing Data Values

efore analyzing or modeling data, it’s often important to make sure that all numeric
B
variables areon a comparable scale. This helps improvethe performance and
interpretability of many algorithms, especially inmachine learningandstatistical
modeling.

This process involvescentering,scaling, andnormalizingdata.

1. Why Scaling is Important

122
Imagine you’re analyzing a dataset with two features:

●
Age(values like 20, 30, 40)

●
Income(values like 20,000, 50,000, 90,000)

he
T Incomevariable dominates because its values aremuch larger. Scaling brings all
variables to similar ranges, making themequally importantin analysis and modeling.

2. Centering Data

enteringmeans subtracting themean valueof a variablefrom each of its data points.
C
This shifts the variable so that itsmean becomeszero.

ormula:
F
[
x_{centered} = x - \bar{x}
]

Example:

x <- c(10, 20, 30, 40, 50)

centered_x <- x - mean(x)

print(centered_x)

Output:

[1] -20 -10 0 10 20

Explanation:

xis 30.
● The mean of

● Each value is reduced by 30, resulting in centered values.

✅
Use Case:Centering is useful before performingPrincipal Component Analysis (PCA)
or regression, where the mean needs to be zero.

123
3. Scaling Data
calingmeans dividing centered data by itsstandarddeviation (SD)so that all variables
S
have astandard deviation of 1.

ormula:
F
[
x_{scaled} = \frac{x - \bar{x}}{s}
]
where

● ( \bar{x} ) = mean of variable

● ( s ) = standard deviation

Example:

x <- c(10, 20, 30, 40, 50)

scaled_x <- scale(x)

print(scaled_x)

Output:

[,1]

[1,] -1.2649111

[2,] -0.6324555

[3,] 0.0000000

[4,] 0.6324555

[5,] 1.2649111

Explanation:

● After scaling, the mean becomes 0 and SD becomes 1.

● This ensures all variables contribute equally to analysis.

124
✅
Use Case:Scaling is vital inmachine learning(e.g., K-Means, SVM, PCA, Linear
Regression).

4. Normalization (Min-Max Scaling)

ormalizationrescales data to afixed range, usuallybetween0 and 1.
N
Unlike centering/scaling, normalization preserves the shape but changes the scale.

ormula:
F
[
x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}
]

Example:

x <- c(10, 20, 30, 40, 50)

normalized_x <- (x - min(x)) / (max(x) - min(x))

print(normalized_x)

Output:

[1] 0.00 0.25 0.50 0.75 1.00

Explanation:

● The smallest value becomes 0, the largest becomes 1.

● All values are proportionally scaled in between.

✅
Use Case:Normalization is especially used fordistance-basedalgorithms(like KNN,
neural networks).

5. Using Built-in R Functions

R provides handy built-in tools for centering and scaling:

scale()Function
a.

125
Performs both centering and scaling.

Syntax:

scale(x, center = TRUE, scale = TRUE)

Example:

data <- [Link](

Age = c(20, 25, 30, 35, 40),

Income = c(20000, 35000, 50000, 65000, 80000)

)

scaled_data <- scale(data)

print(scaled_data)

Explanation:

● Each column is centered (mean = 0) and scaled (SD = 1).

● Returns a standardized version of the dataset.

b. Custom Normalization Function

If you want to normalize manually:

normalize <- function(x) {

return ((x - min(x)) / (max(x) - min(x)))

}

normalized_data <- [Link](lapply(data, normalize))

print(normalized_data)

126
Explanation:

●
lapply()applies the normalization function to eachcolumn.

● Output is a dataframe where all values are between 0 and 1.

6. Z-Score Normalization

A special case of scaling — also calledstandardization.

ormula:
F
[
z = \frac{x - \bar{x}}{s}
]

scale()function— they are equivalent.

You can calculate it manually or with

7. Comparison Between Scaling Methods

Method Range Mean SD Common Use

Centering ame as

S 0 ame as
S Removes offset
original original

Standard Scaling -∞ to +∞ 0 1 ML algorithms

Min-Max Normalization 0 to 1 epend Depends

D Neural networks
s

Z-score Normalization - 3 to +3 0 1 tatistical

S
(approx.) analysis

127
8. Practical Example: Standardizing a Dataset
data <- [Link](

Height = c(150, 160, 170, 180, 190),

Weight = c(50, 60, 70, 80, 90)

)

scaled_data <- [Link](scale(data))

print(scaled_data)

Output:

Height Weight

1 -1.2649111 -1.2649111

2 -0.6324555 -0.6324555

3 0.0000000 0.0000000

4 0.6324555 0.6324555

5 1.2649111 1.2649111

✅ Both Height and Weight are now on the same scale, ready for analysis.

9. Real-World Applications

● Machine Learning→ Algorithms like KNN, SVM, PCA needscaled data.

● Clustering→ Euclidean distance is sensitive to scaledifferences.

● Data Visualization→ Prevents distortion in plots.

● Regression Models→ Makes coefficients more interpretable.

128
🧭 Quick Recap

Concept Purpose Example Function

Centering Shift mean to 0 x - mean(x)

Scaling Make SD = 1 scale(x)

Normalization F
it data between 0 and (x -
Custom
1 min(x))/(max(x)-min(x))

Z-score Standardization scale(x)

✅Tips

● Always scalenumericalvariables only, not categorical.

● S
caling should be doneafter splittingdata into train/testsets to prevent data
leakage.

caret::preProcess()for automated preprocessingpipelines.

● Use

Converting Variable Types

In R, each piece of data has a specificdata type,such as numeric, character, factor, or
logical. Sometimes, we need toconvert variablesfromone type to another to perform
certain operations correctly — for instance, converting a numeric column into a factor for
categorical analysis, or changing character data into numeric for calculations.

Let’s explore how and why this conversion is done.

1. Why Convert Variable Types?

129
● D
ata Compatibility:Some functions work only with specific data types (e.g.,
statistical tests often require numeric data).

● A
ccurate Analysis:Converting categorical variablesto factors helps R treat them
properly during modeling.

● Avoid Errors:Incorrect data types may cause calculationor visualization errors.

● D
ata Cleaning:Imported data (especially from CSV/Excel)may misinterpret types
(e.g., numbers read as characters).

2. Checking Variable Types

Before converting, always check the data type using these functions:

class(x) # Returns the class of a variable

typeof(x) # Returns the internal storage type

str(data) # Displays structure of a dataset

Example:

x <- "25"

class(x)

Output:

[1] "character"

"25"is acharacter, not numeric.

This shows that

3. Type Conversion Functions

R provides several built-in functions for conversion:

130
Function Converts Example
To

[Link]( Numeric
[Link]("25")→ 25

)

[Link]( Integer
[Link](3.8)→ 3

)

[Link] Character
[Link](25)→ "25"

r()

[Link]( Logical
[Link](1)→ TRUE

)

[Link]() Factor
[Link](c("A", "B",

"A"))

[Link]()
Date [Link]("2024-05-20")

4. Example: Converting Character to Numeric

x <- c("10", "20", "30")

y <- [Link](x)

print(y)

Output:

[1] 10 20 30

131
xplanation:
E
Each string element is converted into a numeric value, making it ready for mathematical
operations.

5. Example: Converting Numeric to Character

num <- c(100, 200, 300)

char <- [Link](num)

print(char)

Output:

[1] "100" "200" "300"

xplanation:
E
Numbers become strings, which are treated as text, not numeric values.

6. Example: Converting Character to Factor

colors <- c("Red", "Blue", "Red", "Green")

fact_colors <- [Link](colors)

print(fact_colors)

Output:

[1] Red Blue Red Green

Levels: Blue Green Red

xplanation:
E
R recognizes unique categories aslevelsof the factor.
This is essential for categorical analysis (like grouping, plotting, or regression).

132
7. Example: Converting Factor to Numeric
irect conversion from factor to numeric can givewrong resultsbecause R stores factors
D
as integer codes internally.
Use atwo-step conversioninstead.

❌Incorrect way:

f <- [Link](c(10, 20, 30))

[Link](f)

Output:

[1] 1 2 3

✅Correct way:

[Link]([Link](f))

Output:

[1] 10 20 30

xplanation:
E
First convert the factor to character, then to numeric to preserve original values.

8. Converting to Logical

x <- c(1, 0, 1, 0)

[Link](x)

Output:

[1] TRUE FALSE TRUE FALSE

133
Explanation:

● Non-zero values becomeTRUE

● Zero values becomeFALSE

9. Converting to Date

"Date"
R stores dates as a special class — .

Example:

date_char <- c("2025-01-10", "2025-02-15")

date_real <- [Link](date_char)

print(date_real)

Output:

[1] "2025-01-10" "2025-02-15"

xplanation:
E
Converts text formatted as “YYYY-MM-DD” into date objects recognized by R.

If your date format is different (e.g., “10/01/2025”), specify the format:

[Link]("10/01/2025", format="%d/%m/%Y")

10. Bulk Conversion Inside a Data Frame

lapply()or
You can convert multiple columns at once using mutate()from
dplyr
.

Example:

data <- [Link](

134
Age = c("20", "25", "30"),

Score = c("90", "85", "80")

)

data[] <- lapply(data, [Link])

str(data)

Output:

'[Link]': 3 obs. of 2 variables:

$ Age : num 20 25 30

$ Score: num 90 85 80

✅ Both columns are now numeric.

mutate()(Tidyverse Way)
11. Using
library(dplyr)

df <- [Link](

ID = c("1", "2", "3"),

Gender = c("M", "F", "M")

)

df <- df %>%

mutate(

ID = [Link](ID),

Gender = [Link](Gender)

135
)

str(df)

Output:

'[Link]': 3 obs. of 2 variables:

$ ID : int 1 2 3

$ Gender : Factor w/ 2 levels "F","M": 2 1 2

12. Common Conversion Issues

Issue Cause Solution

A values after
N Non-numeric characters [Link]()only on clean
se
U
conversion data

Factor levels lost kipping character

S onvert factor → character →
C
conversion numeric

Wrong date format Incorrect format string format=

pecify format using
S
parameter

Logical conversion fails Text like “Yes”, “No” ifelse()

Map manually using

13. Real-World Example

Imagine you import survey data from Excel:

136
survey <- [Link](

Age = c("25", "30", "35"),

Gender = c("Male", "Female", "Male"),

Score = c("80", "90", "85")

)

Now convert it properly:

survey$Age <- [Link](survey$Age)

survey$Gender <- [Link](survey$Gender)

survey$Score <- [Link](survey$Score)

str(survey)

✅ Now your dataset is clean and ready for analysis or visualization.

🧭 Quick Recap

Conversion Function Example

Character → Numeric
[Link]() [Link]("2

5")

Numeric → Character
[Link]() [Link](

25)

Character → Factor [Link]()

[Link]("Ye

s")

137
Factor → Numeric [Link]([Link] Safe conversion

r(f))

Character → Date [Link](x, format)

Date handling

✅Tips

str()after conversion to confirm changes.

● Always use

● Be careful withfactor → numericconversions.

● When importing from CSV/Excel, check the structure immediately.

mutate(across())in
● Use dplyrfor efficient multipleconversions.

Data Sorting

orting meansarranging data in a specific order—ascending or descending — based on

S
one or more variables. It’s one of the most common operations in R, especially during data
cleaning and exploration. Sorting helps us see trends, identify outliers, and prepare data for
reports or models.

1. Why Sorting Matters

● Toorganizedata for easy interpretation.

● Tofind highest or lowestvalues quickly.

● Torankorprioritizedata.

● To prepare datasets before merging or summarizing.

2. Sorting Vectors in R

138
R provides simple functions to sort numeric, character, or logical vectors.

sort()
Using

Syntax:

sort(x, decreasing = FALSE)

Parameters:

●
x→ the vector to sort

●
decreasing→ set
TRUEfor descending order

Example 1: Ascending Order

numbers <- c(15, 3, 20, 8, 10)

sort(numbers)

Output:

[1] 3 8 10 15 20

Example 2: Descending Order

sort(numbers, decreasing = TRUE)

Output:

[1] 20 15 10 8 3

3. Sorting Characters

names <- c("Zara", "Ali", "Hina", "Bilal")

sort(names)

139
Output:

[1] "Ali" "Bilal" "Hina" "Zara"

✅ R sorts strings alphabetically.

order()
4. Sorting Using
rder()doesn’t return sorted data directly — it returnstheorder of indicesthat can be
o
used to rearrange data.

Example:

numbers <- c(15, 3, 20, 8, 10)

order(numbers)

Output:

[1] 2 4 5 1 3

This means the 2nd element (3) should come first, then the 4th (8), and so on.

To sort the data:

numbers[order(numbers)]

Output:

[1] 3 8 10 15 20

Descending Order:

numbers[order(-numbers)]

140
Output:

[1] 20 15 10 8 3

5. Sorting Data Frames

order()on a column insidesquare brackets.
To sort adata frame, use

Example:

students <- [Link](

Name = c("Ali", "Sara", "Bilal", "Hina"),

Marks = c(85, 90, 80, 95)

)

students[order(students$Marks), ]

Output:

Name Marks

Bilal 80

Ali 85

Sara 90

Hina 95

✅ Data is now sorted byMarksin ascending order.

141
Descending Order:

students[order(-students$Marks), ]

Output:

Name Marks

Hina 95

Sara 90

Ali 85

Bilal 80

6. Sorting by Multiple Columns

order()
You can sort by multiple columns using a comma-separated list in .

Example:

data <- [Link](

Name = c("Ali", "Sara", "Ali", "Hina"),

Marks = c(85, 90, 75, 90)

)

data[order(data$Name, -data$Marks), ]

142
Output:

Name Marks

Ali 85

Ali 75

Hina 90

Sara 90

Explanation:

Namealphabetically.
● First sorted by

Marksin descendingorder.
● If names are the same, sorted by

dplyrfor Sorting
7. Using
dplyrpackage provides a more readable and modernsyntax using
The arrange()
.

Example 1: Ascending Order

library(dplyr)

students %>% arrange(Marks)

Example 2: Descending Order

students %>% arrange(desc(Marks))

143
Example 3: Multiple Columns

data %>% arrange(Name, desc(Marks))

✅Explanation:

●
arrange()sorts by columns.

●
desc()specifies descending order.

8. Sorting Rows by Row Names

data <- [Link](Score = c(88, 92, 76))

rownames(data) <- c("Ali", "Sara", "Hina")

data[order(rownames(data)), ]

Output:

Scor
e

Ali 88

Hina 76

Sara 92

9. Sorting Columns (Transposed Sorting)

If you want to sort columns instead of rows, you can transpose first.

144
matrix_data <- matrix(c(5,2,8,1,7,4), nrow=2)

colnames(matrix_data) <- c("C1", "C2", "C3")

matrix_data[, order(colMeans(matrix_data))]

✅ This sorts columns based on their mean values.

10. Dealing with Missing Values (

NA
)
NAvalues are placedat the endwhen sorting.
By default,
[Link]argument.
You can control this using the

Example:

x <- c(3, NA, 5, 1)

sort(x, [Link] = TRUE)

Output:

[1] 1 3 5 NA

[Link] = FALSE
If you set NAvalues come first.
,

11. Real-World Example

Imagine sorting an employee dataset by salary and then by department.

employees <- [Link](

Name = c("Aisha", "Bilal", "Sara", "Hina"),

Department = c("IT", "HR", "IT", "HR"),

Salary = c(60000, 55000, 70000, 60000)

)

145
employees %>% arrange(Department, desc(Salary))

Output:

Name Department Salary

Hina HR 60000

Bilal HR 55000

Sara IT 70000

Aisha IT 60000

✅ Sorted first by department, then by salary (descending).

🧭 Quick Recap

Task Function Packag

Sort a vector sort()

Base R

Get order of indices order()

Base R

Sort a data frame df[order(df$col),

Base R
]

146
Sort with multiple columns order(df$col1,
Base R
df$col2)

Sort in modern syntax arrange()

dplyr

Descending order desc()

dplyr

✅Tips

NA
● Always check for s before sorting.

desc()or negative sign (

● For descending sort, use -x).

● Sorting doesn’t modify the original data unless reassigned.

●
arrange()is cleaner and preferred for pipelines.

Data Aggregation

ata aggregation is the process ofsummarizing, grouping,and combining datato

D
produce meaningful insights. It’s one of the most important steps in data analysis, allowing
you to compute totals, averages, counts, and other summary statistics across groups or
entire datasets.

1. What is Data Aggregation?

Data aggregation involves:

● Grouping data based on one or more attributes.

sum()
● Applying summary functions such as mean()
, min()
, max()
, length()
, or .

● Producing a compact representation of large datasets for easier interpretation.

147
xample:
E
If you have a dataset of students’ marks from different classes, aggregation can help find:

● Average marks per class

● Total students per class

● Highest and lowest scores in each subject

2. Aggregation Using Base R Functions

aggregate()
a. Using

aggregate()is a powerful base R function for groupedsummaries.

Syntax:

aggregate(x, by, FUN)

Parameters:

●
x→ data to summarize (numeric columns)

●
by→ list of grouping variables

●
FUN→ summary function (mean, sum, etc.)

Example:

data <- [Link](

Class = c("A", "A", "B", "B", "C"),

Marks = c(80, 85, 90, 75, 88)

)

aggregate(Marks ~ Class, data = data, FUN = mean)

148
Output:

Clas Marks
s

A 82.5

B 82.5

C 88.0

✅ Aggregates mean marks for each class.

b. Using Multiple Summary Columns

data <- [Link](

Class = c("A", "A", "B", "B", "C"),

Math = c(80, 85, 90, 75, 88),

Science = c(70, 75, 80, 85, 90)

)

aggregate(. ~ Class, data = data, FUN = mean)

Output:

Clas Math Scienc

s e

A 82.5 72.5

149
B 82.5 82.5

C 88.0 90.0

✅ The
.symbol means “apply to all other columns.”

tapply()
3. Aggregation with
tapply()applies a function to subsets of a vectordefined by one or more factors.

Syntax:

tapply(X, INDEX, FUN)

Example:

Class <- c("A", "A", "B", "B", "C")

Marks <- c(80, 85, 90, 75, 88)

tapply(Marks, Class, mean)

Output:

A B C

82.5 82.5 88.0

✅ Returns a named vector with mean marks for each class.

by()
4. Aggregation with
by()is similar to
tapply()but works with data frames.

Example:

150
data <- [Link](

Class = c("A", "A", "B", "B", "C"),

Marks = c(80, 85, 90, 75, 88)

)

by(data$Marks, data$Class, mean)

Output:

data$Class: A

[1] 82.5

data$Class: B

[1] 82.5

data$Class: C

[1] 88

dplyr
5. Aggregation Using
dplyrpackage provides simple and readable functionsfor aggregation using pipes
The
(
%>%
).

summarise()and
a. Using group_by()

library(dplyr)

data %>%

group_by(Class) %>%

summarise(Average = mean(Marks))

151
Output:

Clas Averag
s e

A 82.5

B 82.5

C 88.0

b. Multiple Summary Statistics

data %>%

group_by(Class) %>%

summarise(

Avg = mean(Marks),

Min = min(Marks),

Max = max(Marks),

Count = n()

)

Output:

Clas Avg Min Max Count

A 82.5 80 85 2

152
B 82.5 75 90 2

C 88.0 88 88 1

✅
n()counts number of entries per group.

c. Grouping by Multiple Columns

sales <- [Link](

Region = c("North", "North", "South", "South", "East"),

Product = c("A", "B", "A", "B", "A"),

Sales = c(100, 150, 200, 180, 130)

)

sales %>%

group_by(Region, Product) %>%

summarise(TotalSales = sum(Sales))

Output:

Regio Product TotalSale

n s

East A 130

North A 100

153
North B 150

South A 200

South B 180

✅ Aggregated total sales for each region-product pair.

aggregate()for Multiple Functions

6. Using
You can use a custom function to compute multiple summaries.

Example:

aggregate(Marks ~ Class, data = data,

FUN = function(x) c(Mean = mean(x), Sum = sum(x)))

Output:

Clas [Link] [Link]

A 82.5 165

B 82.5 165

C 88.0 88

[Link]
7. Aggregation Using

154
[Link]package is extremely efficient for large datasets.
The

Example:

library([Link])

dt <- [Link](Class = c("A", "A", "B", "B", "C"),

Marks = c(80, 85, 90, 75, 88))

dt[, .(Average = mean(Marks), Count = .N), by = Class]

Output:

Clas Averag Count

s e

A 82.5 2

B 82.5 2

C 88.0 1

✅
.Ngives the number of rows in each group.

8. Aggregating Missing Values

mean()and
Functions like sum()ignore
NAvalues ifyou specify
[Link] = TRUE
.

Example:

data <- [Link](

Class = c("A", "A", "B"),

Marks = c(80, NA, 90)

155
)

aggregate(Marks ~ Class, data, mean, [Link] = TRUE)

Output:

Clas Marks
s

A 80

B 90

9. Real-World Example: Sales Data

sales <- [Link](

Region = c("East", "East", "West", "West", "North"),

Sales = c(120, 100, 200, 180, 160)

)

sales %>%

group_by(Region) %>%

summarise(

Total = sum(Sales),

Average = mean(Sales),

Transactions = n()

)

156
Output:

Regio Total Averag Transaction

n e s

East 220 110 2

North 160 160 1

West 380 190 2

✅ Perfect for generating business summaries.

10. Quick Recap

Function Package Description

aggregate()
Base R Summarizes data by groups

tapply()
Base R Applies function to grouped vector

by()
Base R Applies function to grouped data frame

group_by()+
dplyr Modern and clean syntax
summarise()

.N
mean()
, , etc. [Link]
d High-performance aggregation
e

157
✅Tips:

[Link] = TRUE
● Always handle missing values using .

dplyrfor readable and chainable summaries.

● Use

[Link]for very large datasets.

● Use

● Combine aggregation with filtering or sorting for complete insights.

158
NIT 4: BASIC
U
STATISTICAL
ANALYSIS
1. Introduction to statistical inference

2. Hypothesis testing

○ t-tests

○ chi-square tests

3. Regression analysis

4. Creating plots using ggplot2

○ Scatter plots

○ Histograms

○ ar plots
B
===========
5. ustomizing plots
C

○ Titles

○ Labels

○ Legends

○ Colors

○ Themes

6. Exploratory data analysis with visualization techniques

7. Creating reproducible reports

○ Generating HTML documents

○ Generating PDF documents

○ Generating Word documents

159
Introduction to Statistical Inference
1. What is Statistical Inference?

tatistical inference is the process ofdrawing conclusionsabout a populationbased on

S
data collected from asample. Since analyzing an entirepopulation is often impossible, we
use statistics to estimate or test hypotheses about population parameters.

In simpler terms:

👉
We use data from a smaller group (sample) to make educated guesses about a larger
group (population).

xample:
E
If we survey 200 students about their study habits, we can infer patterns for all students in
the university.

2. Key Terms in Statistical Inference

Term Description

Population The entire group of individuals or items of interest.

Sample A subset of the population used for analysis.

Parameter numerical summary that describes a population (e.g., population

A
mean).

Statistic A numerical summary calculated from a sample (e.g., sample mean).

Estimation Using sample data to estimate population parameters.

ypothesis
H Using data to test assumptions about a population.
Testing

3. Two Main Approaches

a. Estimation

● Used to estimate unknown population parameters.

● E
xample: Estimating the average height of all students using a sample of 100
students.

b. Hypothesis Testing

160
● Used to test a claim about a population parameter.

● Example: Testing whether the average exam score is above 70.

4. Types of Estimates

1. P
oint Estimate– A single value estimate of a populationparameter.
Example: Sample mean x̄ = 72is the point estimatefor population mean
μ
.

2. Interval Estimate– A range of values (confidenceinterval) within which the true
parameter likely falls.
Example: “The average score is between 70 and 74 with 95% confidence.”

5. Confidence Intervals

Aconfidence interval (CI)gives a range that likelycontains the true population value.

Formula for a 95% CI for the mean:

x̄ ± z * (s / √n)

Where:

●
x̄ = sample mean

●
s= sample standard deviation

●
n= sample size

●
z= z-value (1.96 for 95% confidence)

Example in R:

ean <- 72

m
sd <- 8
n <- 50
error <- 1.96 * (sd / sqrt(n))
lower <- mean - error
upper <- mean + error
c(lower, upper)

161
Output:

[1] 69.78 74.22

✅ Interpretation: We’re 95% confident that the true mean lies between 69.78 and 74.22.

6. Types of Errors

Error Type Description

Type I Error (α) Rejecting a true null hypothesis (false positive).

Type II Error (β) ailing to reject a false null hypothesis (false

F
negative).

Example:

● Type I: Concluding a medicine works when it doesn’t.

● Type II: Concluding a medicine doesn’t work when it actually does.

7. Levels of Significance

helevel of significance (α)is the probability ofmaking a Type I error.

T
Common values: 0.05, 0.01, 0.10

xample:
E
If α = 0.05, it means we accept a 5% chance of being wrong when rejecting the null
hypothesis.

8. Example: Testing a Claim About Average Marks

roblem:
P
A teacher claims that the average score of students is75.
A sample of 25 students has a mean of78with a standarddeviation of10.
Test the claim at a 5% significance level.

Steps in R:

[Link](x = NULL, mu = 75, [Link] = 0.95, alternative = "[Link]",

162
xbar = 78, s = 10, n = 25)

(We can compute manually or using sample data.)

xplanation:
E
If the p-value < 0.05 → Reject the claim (significant difference).
If p-value > 0.05 → Accept the claim (no significant difference).

9. Real-World Applications

● Predicting population trends from surveys

● Quality control in manufacturing

● Measuring the effect of new medicines

● Market research and opinion polling

10. Quick Recap

Concept Meaning

tatistical
S Drawing conclusions about population from sample
inference

Estimation Finding unknown population values

Hypothesis testing Checking assumptions about population

onfidence
C Range containing true parameter
interval

Significance level Probability of Type I error

✅Tips for Exams

● Always define population, sample, parameter, and statistic clearly.

● Know the difference between Type I and Type II errors.

● R
emember that smaller p-values mean stronger evidenceagainstthe null
hypothesis.

163
● Confidence interval = estimation; hypothesis testing = decision-making.

Hypothesis Testing

ypothesis testing is astatistical methodused tomake decisions or inferences about

H
population parameters based on sample data.
It helps us answer questions like —
“Is the average income of men and women the same?”
“Did a new drug actually improve recovery time?”

Let’s learn this step by step in an intuitive way.

1. What is a Hypothesis?

hypothesisis an assumption or claim about a populationparameter.
A
In statistics, we test whether this claim is likely to be true based on sample data.

For example:

● A company claims the average battery life of their phones is10 hours.

● We collect a sample and test if this claim is true or not.

2. Types of Hypotheses

a. Null Hypothesis (H₀)

his is the statement we start with — it assumesnoeffect or no difference.

T
It’s the hypothesis wetry to disprove.

xample:
E
H₀: The mean battery life = 10 hours

b. Alternative Hypothesis (H₁ or Hₐ)

This is what we want to prove — that thereisa differenceor effect.

xample:
E
H₁: The mean battery life ≠ 10 hours

164
3. Steps in Hypothesis Testing
1. State hypotheses(H₀ and H₁)

2. Choose significance level (α)– often 0.05

3. Select the appropriate test(like t-test or chi-squaretest)

4. Compute the test statistic and p-value

5. Make a decision:

○ Ifp < α, reject H₀ (significant difference)

○ Ifp ≥ α, fail to reject H₀ (no significant difference)

4. One-tailed vs. Two-tailed Tests

Type When to Use Example

Two-tailed You want to check if there’sanydifference Mean ≠ 10

Left-tailed You suspect the mean isless thana value Mean < 10

Right-tailed Y
ou suspect the mean isgreater thana Mean > 10
value

5. t-Tests
Thet-testis used when:

● Sample size is small (n < 30)

● Population standard deviation is unknown

There arethree main typesof t-tests in R:

a. One-sample t-test

165
Used to compare a sample mean to a known value.

Example:

cores <- c(78, 82, 75, 80, 77, 85, 79)

s
[Link](scores, mu = 75)

Explanation:

●
scores→ your sample data

●
mu = 75→ population mean (claimed value)

Ifp-value < 0.05, the sample mean is significantlydifferent from 75.

b. Two-sample (Independent) t-test

Used to compare means oftwo different groups.

Example:

roup1 <- c(80, 82, 85, 83, 81)

g
group2 <- c(75, 78, 72, 77, 74)
[Link](group1, group2, [Link] = TRUE)

Explanation:

●
group1and
group2→ two independent groups

●
[Link] = TRUE→ assumes equal variances

Ifp < 0.05, the two group means are significantlydifferent.

c. Paired t-test

sed when the two samples arerelated— for example,before and after measurements on
U
the same people.

Example:

efore <- c(65, 70, 68, 72, 66)

b
after <- c(70, 74, 72, 76, 71)

166
[Link](before, after, paired = TRUE)

xplanation:
E
Checks whether the “after” values differ significantly from “before” values.

✅Use Case:Effect of a training program on performance.

6. Interpreting p-value

p-value Decision Interpretation

< 0.05 Reject H₀ Significant difference

≥ 0.05 Fail to reject H₀ No significant difference

xample Interpretation:
E
p = 0.03
If , it means there’s only a 3% chance thatthe observed difference occurred
randomly — so we conclude thereisa real effect.

7. Chi-Square (χ²) Test

heChi-square testis used to comparecategoricaldata— it checks if two variables are
T
related or independent.

There are two main types:

1. Chi-square goodness of fit test

2. Chi-square test of independence

a. Chi-square Goodness of Fit

Used to test whether the observed frequencies match the expected frequencies.

xample:
E
Suppose we expect equal distribution of students in 3 courses (Math, CS, Stats),
but the actual counts are different.

bserved <- c(45, 30, 25)

o
expected <- c(33.3, 33.3, 33.3)

167
[Link](x = observed, p = expected/sum(expected))

xplanation:
E
Ifp < 0.05, the observed distribution significantlydiffers from what was expected.

b. Chi-square Test of Independence

Used to check if two categorical variables areindependent.

xample:
E
You want to see ifgenderis related tocourse preference.

ata <- matrix(c(20, 30, 25, 25), nrow = 2)

d
colnames(data) <- c("Math", "Science")
rownames(data) <- c("Male", "Female")

[Link](data)

xplanation:
E
Ifp < 0.05, it means gender and course preferencearenot independent(they’re related).

8. Real-World Examples

● t-test:Comparing average exam scores of two classes

● Paired t-test:Testing the effect of a new teachingmethod

● C
hi-square test:Analyzing survey results (e.g., “Isproduct preference linked to age
group?”)

9. Quick Recap

Test Used For Example Question

ne-sample
O ompare sample mean to
C Is average weight = 60kg?
t-test population mean

wo-sample
T Compare two independent groups o males and females differ in
D
t-test height?

168
Paired t-test Compare two related samples Did training improve scores?

hi-square
C Compare categorical data Is gender related to department
test choice?

✅Tips for Exams

● Always write H₀ and H₁ clearly.

● Reporttest statistic,degrees of freedom, andp-value.

● Mention significance level (usually 0.05).

● Interpret the result in plain language: “There is a significant difference…” or “No
significant difference was found.”

Regression Analysis

egression analysis is a statistical technique used to study the relationship between one
R
dependent variable and one or more independent variables. It helps in predicting the value
of the dependent variable based on the independent variables. In R, regression analysis is
commonly performed using built-in functions such aslm()for linear regression and
glm()
for generalized linear models.
Types of Regression in R:

imple Linear Regression:Used when there is one independentvariable.

S
Example: Predicting sales based on advertising budget.

# Simple Linear Regression

ata <- [Link](
d
sales = c(10, 20, 30, 40, 50),
budget = c(1, 2, 3, 4, 5)
)
model <- lm(sales ~ budget, data = data)
summary(model)

1. Output shows coefficients, R-squared value, and significance levels.

ultiple Linear Regression:Used when there are twoor more independent variables.
M
Example: Predicting sales based on budget and number of employees.

# Multiple Linear Regression

169
data <- [Link](
sales = c(10, 20, 30, 40, 50),
budget = c(1, 2, 3, 4, 5),
employees = c(2, 4, 6, 8, 10)
)
model <- lm(sales ~ budget + employees, data = data)
summary(model)

Polynomial Regression:Fits nonlinear relationshipsbetween variables.

Polynomial Regression
#
x <- c(1, 2, 3, 4, 5)
y <- c(2, 6, 14, 28, 45)
model <- lm(y ~ poly(x, 2, raw = TRUE))
summary(model)

ogistic Regression:Used when the dependent variableis categorical (binary outcome like
L
0/1).

Logistic Regression
#
data <- [Link](
pass = c(1, 0, 1, 0, 1),
hours = c(5, 1, 8, 2, 10)
)
model <- glm(pass ~ hours, data = data, family = binomial)
summary(model)

Model Evaluation Metrics:

● R-squared:Proportion of variance explained by themodel.

● Adjusted R-squared:Adjusted for number of predictors.

● p-value:Checks significance of each predictor.

● R
esiduals:Difference between observed and predictedvalues.
Visualization of Regression Line:

Visualization
#
plot(data$budget, data$sales, main="Regression Line", xlab="Budget", ylab="Sales")
abline(model, col="blue", lwd=2)

170
Applications:

● Predicting sales, prices, or growth.

● Evaluating influence of multiple factors.

● Risk assessment and forecasting.

Creating Plots Using ggplot2

heggplot2package in R is one of the most powerfuland flexible tools for data

T
visualization. It follows theGrammar of Graphicsconcept, where a plot is built step-by-step
by adding layers such as data, aesthetics, geometries, and themes.

To use ggplot2, first install and load the package:

i[Link]("ggplot2")
library(ggplot2)

1. Basic Structure of a ggplot

The general syntax of ggplot2 is:

ggplot(data, aes(x, y)) + geom_<type>() + other layers

● data: The dataset used for plotting.

● aes(): Defines aesthetic mappings (x and y axes, color,size, etc.).

● geom_(): Adds a geometric layer such as points, lines,bars, etc.

2. Scatter Plots

A scatter plot shows the relationship between two numeric variables.

Scatter Plot
#
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(color = "blue", size = 3) +
ggtitle("Scatter Plot of Weight vs MPG") +
xlab("Weight") + ylab("Miles per Gallon")

Explanation:

171
● Each point represents a car.

● The plot shows how car weight affects fuel efficiency.

3. Histograms

Histograms are used to visualize the distribution of a single numeric variable.

Histogram
#
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
ggtitle("Histogram of Miles per Gallon")

Explanation:

●
binwidthdefines the width of each bar.

● Helps identify the frequency distribution of data.

4. Bar Plots

Bar plots represent categorical data using rectangular bars.

Bar Plot
#
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar(fill = "orange", color = "black") +
ggtitle("Number of Cars by Cylinder Type") +
xlab("Cylinders") + ylab("Count")

Explanation:

●
factor(cyl)converts the numeric variable into a categoricalone.

● Each bar shows the number of cars for a specific cylinder count.

5. Line Plots

Line plots are used to show trends over a continuous variable (often time).

172
Line Plot
#
ggplot(economics, aes(x = date, y = unemploy)) +
geom_line(color = "darkgreen", linewidth = 1) +
ggtitle("Unemployment Over Time") +
xlab("Date") + ylab("Number of Unemployed")

Explanation:

● Shows changes in unemployment over time using connected points.

6. Box Plots

Box plots summarize data using median, quartiles, and outliers.

Box Plot
#
ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot(fill = "lightgreen") +
ggtitle("MPG by Cylinder Type") +
xlab("Cylinders") + ylab("Miles per Gallon")

Explanation:

● Displays spread and skewness of MPG for each cylinder category.

7. Density Plots

Density plots are smooth versions of histograms.

Density Plot
#
ggplot(mtcars, aes(x = mpg)) +
geom_density(fill = "pink", alpha = 0.5) +
ggtitle("Density Plot of MPG")

Key Features of ggplot2

● Layered plotting (add or remove elements easily).

● Supports customization of themes, colors, and labels.

173
dplyr
● Compatible with various data transformation packages like .

● Allows advanced statistical visualization (facets, trend lines, etc.).

Example: Adding Multiple Layers

Combined Plot
#
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Relationship Between Weight and MPG by Cylinder") +
xlab("Weight") + ylab("Miles per Gallon")

This adds both scatter points and a fitted regression line for each cylinder category.

Customizing Plots

hile
W ggplot2provides beautiful default visuals,customizing your plots helps make them
clearer, more readable, and presentation-ready. Youcan change almost every element —
from titles and axis labels to colors, legends, and themes.

Let’s explore the most common customizations step by step.

1. Adding Titles, Subtitles, and Captions

You can add descriptive titles and captions to make your plots more informative.

ggplot(mtcars, aes(x = wt, y = mpg)) +

geom_point(color = "blue") +
labs(
title = "Relationship Between Car Weight and Mileage",
subtitle = "Data from mtcars dataset",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
caption = "Source: R mtcars dataset"
)

Explanation:

●
titleadds the main heading.

174
●
subtitlegives additional context.

●
xand
ylabel the axes.

●
captionappears at the bottom, useful for mentioningdata sources.

2. Customizing Axis Labels and Ticks

You can modify axis text, font size, or rotation for better clarity.

ggplot(mtcars, aes(x = wt, y = mpg)) +

geom_point() +
scale_x_continuous(name = "Car Weight", breaks = seq(2, 5, 0.5)) +
scale_y_continuous(name = "Fuel Efficiency (MPG)", limits = c(10, 35))

Explanation:

●
breaksdefines tick intervals.

●
limitsrestricts the axis range.

3. Customizing Colors

You can assign specific colors manually or use predefined color scales.

Example 1: Manual Colors for Categorical Variables

ggplot(mtcars, aes(x = factor(cyl), fill = factor(cyl))) +

geom_bar() +
scale_fill_manual(values = c("4" = "skyblue", "6" = "orange", "8" = "green")) +
labs(fill = "Cylinders")

Example 2: Using a Gradient for Continuous Data

ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +

geom_point(size = 3) +
scale_color_gradient(low = "lightblue", high = "darkred")

Explanation:

175
● s
cale_fill_manual()and
scale_color_gradient()let you precisely define
color schemes.

4. Customizing Legends

You can change legend position, title, and appearance.

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +

geom_point(size = 3) +
labs(color = "Cylinders") +
theme([Link] = "bottom")

"top"
Tip:You can also use "left"
, "right"
, "none"to remove the legend.
, or

5. Customizing Themes

Themes control theoverall appearance(background,grid lines, fonts).

Example:

ggplot(mtcars, aes(x = wt, y = mpg)) +

geom_point(size = 3, color = "darkblue") +
theme_minimal() +
labs(title = "Weight vs Mileage") +
theme(
[Link] = element_text(size = 14, face = "bold", color = "darkred"),
[Link] = element_text(size = 12, face = "bold"),
[Link] = element_line(color = "grey80")
)

Popular Built-in Themes:

●
theme_bw()– black and white clean style.

●
theme_minimal()– modern minimal look.

●
theme_classic()– simple with axes and no grid lines.

●
theme_light()– gentle background.

176
6. Faceting (Multiple Plots by Category)

Faceting allows you to split data into multiple panels automatically.

ggplot(mtcars, aes(x = wt, y = mpg)) +

geom_point(color = "blue") +
facet_wrap(~ cyl) +
labs(title = "MPG vs Weight for Different Cylinders")

xplanation:
E
Each panel shows data for one cylinder category — a great way to compare groups visually.

7. Combining Multiple Customizations

Here’s how you can combine all these ideas in one polished plot:

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +

geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE) +
scale_color_manual(values = c("red", "green", "blue")) +
labs(
title = "Effect of Weight on Mileage",
subtitle = "Comparison by Number of Cylinders",
x = "Car Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Cylinders"
) +
theme_classic() +
theme(
[Link] = element_text(face = "bold", color = "darkred", size = 14),
[Link] = "bottom"
)

xplanation:
E
This plot includes:

● Title, labels, and legend.

● Custom colors.

● Trend lines (regression).

● Clean theme and professional layout.

177
🧭 Quick Summary: Plot Customization Tips

labs()for all text elements.

● Use

theme()for fine styling (fonts, colors, positions).

● Use

scale_...()functions for axis and color control.

● Use

+for complex, professionalvisuals.

● Combine multiple layers with

✅
Exam Tip:
Questions often ask aboutplot customization functionsandtheme control in ggplot2.
Remember:

●
labs()for labels,

●
theme()for style,

●
scale_...()for colors/scales.

Exploratory Data Analysis (EDA) with Visualization Techniques

xploratory Data Analysis (EDA)is the process ofvisually and statistically exploring data
E
tounderstand its structure, patterns, relationships,and anomaliesbefore formal
modeling.
It’s the most important phase in any data science project because it helps you discover what
the data is trying to tell you.

DA in R often combinessummary statistics,data visualization,anddata cleaning

E
ggplot2is one of the most powerfultools for this.
techniques — and

1. Understanding EDA

Before running complex models, EDA helps answer questions like:

● What does the distribution of each variable look like?

● Are there outliers or missing values?

178
● How are different variables related?

● Are there hidden patterns or correlations?

DA is botha science and an art— it’s about askingthe right questions and visually
E
exploring answers.

2. Common Visualization Tools for EDA

Let’s look at some visualization types commonly used in EDA.

a) Histograms — Understanding Data Distribution

Histograms help you visualize how data is distributed across ranges.

ggplot(mtcars, aes(x = mpg)) +

geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
labs(title = "Distribution of Miles per Gallon", x = "MPG", y = "Frequency")

Insight:
Check if the data isnormally distributed,skewed,or hasoutliers.

b) Box Plots — Detecting Outliers

Box plots show spread and identify outliers easily.

ggplot(mtcars, aes(x = factor(cyl), y = mpg, fill = factor(cyl))) +

geom_boxplot() +
labs(title = "Mileage by Cylinder Type", x = "Cylinders", y = "MPG")

Insight:
Higher-cylinder cars generally have lower mileage, and box plots can confirm that visually.

c) Scatter Plots — Checking Relationships Between Variables

Scatter plots help identifycorrelationsbetween twocontinuous variables.

ggplot(mtcars, aes(x = wt, y = mpg, color = hp)) +

geom_point(size = 3) +
labs(title = "Relationship Between Weight, Mileage, and Horsepower")

179
Insight:
Cars with higher weight tend to have lower mileage, and horsepower also influences the
trend.

d) Pair Plots — Multiple Relationships at Once

GGallypackage allows creating pair plots (scatterplotsfor every numeric variable pair).
The

library(GGally)
ggpairs(mtcars[, c("mpg", "wt", "hp", "disp")])

Insight:
Quickly see correlations and patterns across several variables.

e) Correlation Heatmaps

Heatmaps visually display correlations between numeric variables.

library(reshape2)
cor_matrix <- cor(mtcars)
melted_cor <- melt(cor_matrix)
ggplot(melted_cor, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "red", high = "blue", mid = "white", midpoint = 0) +
labs(title = "Correlation Heatmap")

Insight:
Strong correlations (positive or negative) appear darker, helping spot variable
dependencies.

f) Bar Charts — Understanding Categorical Data

Bar charts visualize the frequency of categories.

ggplot(mtcars, aes(x = factor(gear), fill = factor(gear))) +

geom_bar() +
labs(title = "Frequency of Gear Types", x = "Gears", y = "Count")

Insight:
Helps in comparing categories like gear types, fuel type, or transmission.

180
3. Combining EDA with Summary Statistics

Visualization becomes more powerful when supported by summary functions:

summary(mtcars)

This gives quick stats likemean,median,min,max,andquartilesfor every variable.

You can also group and summarize data:

library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(avg_mpg = mean(mpg), avg_hp = mean(hp))

xplanation:
E
Summarizes mileage and horsepower by the number of cylinders — a useful analytical
view.

4. Outlier Detection

isual techniques such as boxplots and scatter plots are best for spottingoutliers, but you
V
can also detect them programmatically.

utliers <- [Link](mtcars$mpg)$out

o
outliers

xplanation:
E
This finds MPG values that lie outside the normal range.

5. Combining Multiple Insights

patchworkor
ou can create dashboards of multiple visuals using packages like
Y cowplot
to combine plots for a holistic view.

library(patchwork)
p1 <- ggplot(mtcars, aes(mpg)) + geom_histogram(fill="skyblue")
p2 <- ggplot(mtcars, aes(wt, mpg)) + geom_point(color="darkgreen")
p1 + p2

181
🧩 Summary of EDA Visualization Techniques
Visualization Purpose

Histogram Distribution of numeric data

Box Plot Spread & outliers

Scatter Plot elationship between two

R
variables

Heatmap Correlations

Bar Chart Frequency of categorical variables

Pair Plot Multivariate relationships

💡 Quick Tips

● Always start EDA withsummary statisticsandbasicplots.

● Usecolor and shapewisely to highlight relationships.

● Look foroutliers,missing values, andskewed distributions.

● Combine visuals for deeper insights.

✅
Exam Tip:
Common questions:

● What is EDA and why is it important?

● Which visualization tools are used for EDA?

● Explain how to detect outliers or correlations in data using ggplot2.

Creating Reproducible Reports

hen working with data analysis or research, it’s important to make your workreproducible
W
— meaning anyone can rerun your code and get the same results, along with all the visuals,
explanations, and outputs in one organized report.

182
In R, this is achieved usingR Markdown— a powerful tool that combinescode, output,
and textin a single document. From R Markdown, youcan generate professional reports in
HTML, PDF, or Word formats.

1. What is R Markdown?

R Markdown is a special type of document that lets you:

● Writenormal text (like a report)in Markdown format.

● InsertR code chunksthat execute and display results.

● Export your work as a formattedHTML,PDF, orWorddocument.

ou can create a new R Markdown file in RStudio by:

Y
File → New File → R Markdown

2. Basic Structure of an R Markdown File

A typical R Markdown file has three main parts:

1. YAML Header— defines title, author, and output type

2. Markdown Text— regular descriptive content

3. R Code Chunks— embedded executable code

Example:

- --
title: "My Analysis Report"
author: "Shah Faisal"
date: "2025-11-10"
output: html_document
---

# Introduction
#
This report explores the relationship between car weight and mileage.

` ``{r}
# R code chunk
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point(color = "blue")

183
Conclusion
We observe that heavier cars tend to have lower mileage.

---

### 3. Creating HTML Documents

#
If you select **HTML** as the output format, your report will be generated as an interactive
webpage.

` ``yaml
output: html_document

hen clickKnitin RStudio (the blue yarn button 🧶).

T
.htmlfile that can be opened ina browser.
It will create an

Advantages:

● Interactive plots and hyperlinks.

● Attractive formatting and color themes.

● Easily shareable online.

Example Output:

ggplot(mtcars, aes(wt, mpg)) +

geom_point(color = "purple") +
geom_smooth(method = "lm", se = FALSE)

4. Creating PDF Documents

To create aPDF report, specify:

output: pdf_document

PDF reports are great forofficial documentation oracademic submissions.

Note:You’ll need to haveLaTeXinstalled (RStudiowill guide you if it’s missing).

Advantages:

184
● Professional formatting

● Printable reports

● Widely accepted in research settings

5. Creating Word Documents

To export your report to Microsoft Word:

output: word_document

.docxfile that you can open and editin Word.

This creates a

Advantages:

● Editable format

● Ideal for teamwork or assignments requiring revisions

6. Adding Plots and Tables

R Markdown automatically includes plots or tables generated by your R code chunks.

Example:

ummary(mtcars$mpg)
s
boxplot(mtcars$mpg, main = "Boxplot of MPG")

You can also display data frames neatly:

head(mtcars)

7. Combining Text and Code

This is where R Markdown shines — you can mix your explanations and visuals together:

The dataset shows that cars with higher **weight** (`wt`) have lower **mileage** (`mpg`).

```{r}

185
plot(mtcars$wt, mtcars$mpg)

his structure makes your report readable **like a story** — each code snippet is
T
immediately explained by the text around it.

---

### 8. Adding Inline Code

#
You can embed small code results directly inside your text using backticks and `r`.

xample:
E
```markdown
The dataset contains `r nrow(mtcars)` observations and `r ncol(mtcars)` variables.

When knitted, R will replace it with actual numbers.

9. Customizing Reports

You can style your report using various options:

● Change themes (

theme: ceruleanor
cosmo
)

● Add table of contents (

toc: true
)

● Control figure size (

[Link] [Link]
, )

Example YAML:

output:
html_document:
theme: cerulean
toc: true
toc_float: true

10. Benefits of Reproducible Reports

● Transparency: Every analysis step is documented.

● Reproducibility: Others can verify or build upon yourwork.

● Automation: Update one dataset, and the entire reportupdates automatically.

186
● P
rofessional Presentation: Clean, structured reports suitable for projects and
research.

🧩 Summary Table: Report Types

Output Type ile
F Ideal For Key Benefit
xtension
E

HTML .html
Interactive sharing Beautiful web-based layout

PDF .pdf
Academic/official use Print-ready format

Word .docx
Editable reports Easy to modify or annotate

💡 Quick Tips

##and
● Use ###for section headings.

● Use triple backticks (```) for R code chunks.

● Always write meaningful section titles.

● ClickKnitto render your report.

✅
Exam Tip:
Common questions:

● What is R Markdown?

● How do you create reproducible reports in R?

● Differentiate between HTML, PDF, and Word outputs in R Markdown.

187

ASM Unit1
No ratings yet
ASM Unit1
15 pages
R Programming Basics and Installation Guide
No ratings yet
R Programming Basics and Installation Guide
35 pages
R Programming FULL
No ratings yet
R Programming FULL
140 pages
P1 Basics of R
No ratings yet
P1 Basics of R
14 pages
R Programming Basics and Applications
No ratings yet
R Programming Basics and Applications
217 pages
Introduction to R for Business Analytics
No ratings yet
Introduction to R for Business Analytics
64 pages
RStudio Basics for Beginners
No ratings yet
RStudio Basics for Beginners
47 pages
R Programming Basics and Features
No ratings yet
R Programming Basics and Features
55 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
181 pages
Unit 3 DataScience
No ratings yet
Unit 3 DataScience
12 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
35 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
73 pages
Installing R and RStudio on Windows
No ratings yet
Installing R and RStudio on Windows
25 pages
Data Science and Machine Learning in R
100% (2)
Data Science and Machine Learning in R
34 pages
Introduction to R and RStudio Basics
No ratings yet
Introduction to R and RStudio Basics
8 pages
Lect 1
No ratings yet
Lect 1
14 pages
Introduction to R for Data Science
No ratings yet
Introduction to R for Data Science
17 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
50 pages
R Programming for Data Analytics Lab
No ratings yet
R Programming for Data Analytics Lab
57 pages
R Programming Lab Manual for B.Tech
100% (1)
R Programming Lab Manual for B.Tech
46 pages
Introduction to R and RStudio
No ratings yet
Introduction to R and RStudio
35 pages
Install R and RStudio: A Beginner's Guide
No ratings yet
Install R and RStudio: A Beginner's Guide
52 pages
R Programming Basics: Installation & Operations
No ratings yet
R Programming Basics: Installation & Operations
70 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
25 pages
R-Programming Final
No ratings yet
R-Programming Final
31 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
25 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
103 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
74 pages
Introduction to R for Data Science
No ratings yet
Introduction to R for Data Science
32 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
82 pages
R Language Overview and Programming Guide
No ratings yet
R Language Overview and Programming Guide
66 pages
Unit-1 (R Programming)
No ratings yet
Unit-1 (R Programming)
36 pages
Business Analytics R Complete Teacher Guide
No ratings yet
Business Analytics R Complete Teacher Guide
8 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
50 pages
R Programming Essentials Guide
No ratings yet
R Programming Essentials Guide
9 pages
R Week1
No ratings yet
R Week1
43 pages
01 Class Notes - 58878527 - 2025 - 05 - 15 - 10 - 46
No ratings yet
01 Class Notes - 58878527 - 2025 - 05 - 15 - 10 - 46
11 pages
R Programming for Data Science Guide
No ratings yet
R Programming for Data Science Guide
47 pages
EDA Week1
No ratings yet
EDA Week1
24 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
6 pages
R Programming
No ratings yet
R Programming
42 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
41 pages
B - Data Types and Operators
No ratings yet
B - Data Types and Operators
4 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
64 pages
Introduction to R for Data Analysis
No ratings yet
Introduction to R for Data Analysis
11 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
65 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
67 pages
R Studio Beginner's Guide by Laura Boehm
No ratings yet
R Studio Beginner's Guide by Laura Boehm
13 pages
R Programming: History, Data Structures, and Functions
No ratings yet
R Programming: History, Data Structures, and Functions
26 pages
R Programming and Big Data Lab Guide
No ratings yet
R Programming and Big Data Lab Guide
26 pages
Basic R Programming Syntax Guide
No ratings yet
Basic R Programming Syntax Guide
22 pages
R Programming Basics and Installation Guide
No ratings yet
R Programming Basics and Installation Guide
32 pages
R Programming for Data Analysis Guide
No ratings yet
R Programming for Data Analysis Guide
60 pages
R Programming: A Comprehensive Guide
No ratings yet
R Programming: A Comprehensive Guide
48 pages
Business Analytics Chapter 3 4 Complete Teaching
No ratings yet
Business Analytics Chapter 3 4 Complete Teaching
20 pages
Data Structure Classification Overview
No ratings yet
Data Structure Classification Overview
44 pages
Programming Concepts and Definitions
No ratings yet
Programming Concepts and Definitions
11 pages
Algorithm Analysis and Design Basics
No ratings yet
Algorithm Analysis and Design Basics
238 pages
LZW Compression Algorithm Overview
No ratings yet
LZW Compression Algorithm Overview
3 pages
Understanding XPath Syntax and Functions
No ratings yet
Understanding XPath Syntax and Functions
41 pages
Freeform Resolution Args for MIUI
No ratings yet
Freeform Resolution Args for MIUI
92 pages
Understanding Python Functions and Types
No ratings yet
Understanding Python Functions and Types
36 pages
General Greedy Method Overview
No ratings yet
General Greedy Method Overview
13 pages
Computer Applications Assessment Key
No ratings yet
Computer Applications Assessment Key
8 pages
B.Tech Resume of Ravula Rakesh Reddy
No ratings yet
B.Tech Resume of Ravula Rakesh Reddy
2 pages
HDFS Command Basics for Big Data
No ratings yet
HDFS Command Basics for Big Data
21 pages
Overview of Web Dynpro ABAP
100% (1)
Overview of Web Dynpro ABAP
173 pages
Adder Verification With UVM
No ratings yet
Adder Verification With UVM
32 pages
CS2040S Cheat Sheet: Growth & Trees
No ratings yet
CS2040S Cheat Sheet: Growth & Trees
3 pages
JNTUK Data Warehousing Exam Paper
No ratings yet
JNTUK Data Warehousing Exam Paper
1 page
Cloud Computing Chapter-6
No ratings yet
Cloud Computing Chapter-6
31 pages
Hill Climbing Algorithm for N-Queens
No ratings yet
Hill Climbing Algorithm for N-Queens
3 pages
Senior High School ICT Student Profile
No ratings yet
Senior High School ICT Student Profile
3 pages
Black Box Testing for Patient Registration
No ratings yet
Black Box Testing for Patient Registration
8 pages
Introduction to System Elements
No ratings yet
Introduction to System Elements
38 pages
Removing Unused MongoDB Indexes
No ratings yet
Removing Unused MongoDB Indexes
8 pages
PSP Development Overview and Tools
No ratings yet
PSP Development Overview and Tools
61 pages
GPT-3.5 for Programming Feedback Automation
No ratings yet
GPT-3.5 for Programming Feedback Automation
10 pages
Data Analytics Internship Program
No ratings yet
Data Analytics Internship Program
5 pages
Digital Logic Design Assignment 2
No ratings yet
Digital Logic Design Assignment 2
10 pages
Preventing SQL Injections in SAP
No ratings yet
Preventing SQL Injections in SAP
2 pages
PHP Framework with HMVC Architecture
No ratings yet
PHP Framework with HMVC Architecture
6 pages
Handouts For Lab 1
No ratings yet
Handouts For Lab 1
13 pages
MATLAB Applications in Civil Engineering
No ratings yet
MATLAB Applications in Civil Engineering
10 pages
Manager Salary Report with Conditions
0% (1)
Manager Salary Report with Conditions
17 pages

R Programming Basics and Overview

Uploaded by

R Programming Basics and Overview

Uploaded by

​R PROGRAMMING​

​2.​ ​Basics of R programming​

​○​ ​Data types​

​3.​ ​Control structures​

​○​ ​Defining functions​

​○​ ​Commonly used mathematical functions​

​○​ ​Commonly used string functions​

​5.​ ​User-defined functions​

​6.​ ​Local and global variables​

​ is a​​powerful programming language​​and environment​​mainly used for​​data analysis,​

​●​ ​Open Source:​​Free to download and use — available​​for everyone.​

​●​ ​Cross-Platform:​​Works on Windows, macOS, and Linux.​

​●​ ​Ideal for​​data science​​,​​machine learning​​,​​statistical​​modeling​​, and​​research​​.​

​●​ ​Preferred by statisticians and data analysts for its​​accuracy​​and​​statistical depth​​.​

​●​ ​Integrates easily with tools like​​Excel​​,​​SQL​​, and​​Python​​.​

​2. What is RStudio?​

​ Studio is an​​Integrated Development Environment (IDE)​​for R.​

​Main Components of RStudio​

​1.​ ​Source Pane:​

​○​ ​This is where you​​write and edit​​your R scripts.​

​2.​ ​Console Pane:​

​○​ ​The​​execution area​​where you run commands directly.​

​○​ ​Anything you type here runs immediately.​

​3.​ ​Environment / History Pane:​

​○​ ​Shows​​all active variables, datasets, and functions​​in memory.​

​○​ ​The​​History​​tab lists all commands you’ve executed.​

​4.​ ​Files / Plots / Packages / Help / Viewer Pane:​

​○​ ​Files:​​View files in your current working directory.​

​○​ ​Plots:​​Displays graphs and visualizations.​

​○​ ​Packages:​​Manage installed R packages.​

​○​ ​Help:​​Access R documentation.​

​○​ ​Viewer:​​Displays HTML outputs and interactive visuals.​

​●​ ​You write your R code inside RStudio.​

​It’s like RStudio being the​​user-friendly face​​of​​R.​

​4. Setting Up R and RStudio​

​1.​ ​Install R:​

​○​ ​Go to​​[Link] download R for​​your OS.​

​2.​ ​Install RStudio:​

​○​ ​Visit​​[Link] install​​RStudio Desktop.​

​3.​ ​Open RStudio:​

​○​ ​You’ll see four main panes as explained earlier.​

​5. Real-World Uses of R​

​●​ ​Academia & Research:​​For running statistical tests​​and modeling data.​

​●​ ​Finance:​​Risk modeling, forecasting stock prices.​

​●​ ​Healthcare:​​Analyzing patient data and clinical trials.​

​6. Quick Tips for Beginners​

​●​ ​Use the​​arrow keys​​in the Console to navigate through​​previous commands.​

.R​​extension for later use.​

#​​to write​​comments​​(ignored by R but useful for​​notes).​

​●​ ​Press​​Ctrl + Enter​​to run the selected line of code.​

​✨ Quick Recall Box​

​●​ ​R​​= Language for data analysis and statistics.​

​●​ ​RStudio​​= User-friendly interface for R.​

​●​ ​Four main panes in RStudio: Source, Console, Environment, Files/Plots/Packages.​

​Common command example:​

​Let’s explore each one clearly and step-by-step.​

​ ​​variable​​is like a container that holds data.​

​How to Create a Variable​

​Variable Naming Rules​

​●​ ​Must start with a​​letter​​(A–Z or a–z).​

​●​ ​Can contain​​numbers, dots, or underscores​​.​

​●​ ​Cannot start with a number​​or contain spaces.​

​2. Data Types in R​

​Data Type​ ​Example​ ​Description​

​Raw​ charToRaw("​ ​Used for raw byte data​

​You can check the​​type of data​​stored in a variable​​using:​

​3. Type Conversion in R​

​ ometimes you may need to change one data type to another.​

​ perators help perform actions on variables and values.​

​A. Arithmetic Operators​

​Used for mathematical operations.​

​Operator​ ​Meaning​ ​Example​ ​Output​

​B. Relational Operators​

​Operator​ ​Meaning​ ​Example​ ​Output​

​C. Logical Operators​

​Used to combine multiple conditions.​

R PROGRAMMING

2. Basics of R programming

○ Data types

3. Control structures

○ Defining functions

○ Commonly used mathematical functions

○ Commonly used string functions

5. User-defined functions

6. Local and global variables

is apowerful programming languageand environmentmainly used fordata analysis,

● Open Source:Free to download and use — availablefor everyone.

● Cross-Platform:Works on Windows, macOS, and Linux.

● Ideal fordata science,machine learning,statisticalmodeling, andresearch.

● Preferred by statisticians and data analysts for itsaccuracyandstatistical depth.

● Integrates easily with tools likeExcel,SQL, andPython.

2. What is RStudio?

Studio is anIntegrated Development Environment (IDE)for R.

Main Components of RStudio

1. Source Pane:

○ This is where youwrite and edityour R scripts.

2. Console Pane:

○ Theexecution areawhere you run commands directly.

○ Anything you type here runs immediately.

3. Environment / History Pane:

○ Showsall active variables, datasets, and functionsin memory.

○ TheHistorytab lists all commands you’ve executed.

4. Files / Plots / Packages / Help / Viewer Pane:

○ Files:View files in your current working directory.

○ Plots:Displays graphs and visualizations.

○ Packages:Manage installed R packages.

○ Help:Access R documentation.

○ Viewer:Displays HTML outputs and interactive visuals.

● You write your R code inside RStudio.

It’s like RStudio being theuser-friendly faceofR.

4. Setting Up R and RStudio

1. Install R:

○ Go to[Link] download R foryour OS.

2. Install RStudio:

○ Visit[Link] installRStudio Desktop.

3. Open RStudio:

○ You’ll see four main panes as explained earlier.

5. Real-World Uses of R

● Academia & Research:For running statistical testsand modeling data.

● Finance:Risk modeling, forecasting stock prices.

● Healthcare:Analyzing patient data and clinical trials.

6. Quick Tips for Beginners

● Use thearrow keysin the Console to navigate throughprevious commands.

.Rextension for later use.

#to writecomments(ignored by R but useful fornotes).

● PressCtrl + Enterto run the selected line of code.

✨ Quick Recall Box

● R= Language for data analysis and statistics.

● RStudio= User-friendly interface for R.

● Four main panes in RStudio: Source, Console, Environment, Files/Plots/Packages.

Common command example:

Let’s explore each one clearly and step-by-step.

variableis like a container that holds data.

How to Create a Variable

Variable Naming Rules

● Must start with aletter(A–Z or a–z).

● Can containnumbers, dots, or underscores.

● Cannot start with a numberor contain spaces.

2. Data Types in R

Data Type Example Description

Raw charToRaw(" Used for raw byte data

You can check thetype of datastored in a variableusing:

3. Type Conversion in R

ometimes you may need to change one data type to another.

perators help perform actions on variables and values.

A. Arithmetic Operators

Used for mathematical operations.

Operator Meaning Example Output

B. Relational Operators

Operator Meaning Example Output

C. Logical Operators

Used to combine multiple conditions.

Operator Meaning Example Output

` ` OR (either true) `(5 > 2)

D. Assignment Operators

Assign values to variables.