Data structures in R
Arrays
• Multi-dimensional data structures (unlike vectors which are 1D)
• Store elements of the same data type
• Built on top of vectors with dimension attribute
• Can be 1D, 2D (matrix), 3D, or higher dimensions
• Useful for organizing complex datasets
Arrays
• Single dimension Arrays
a =c(10,20,30,40,50)
arr =array(a)
Print(class(arr)) # array
Print(a)
[1] 10 20 30 40 50 # output
Arrays
Multi dimensional Arrays
Syntax:
array(data, dim = c(nrow, ncol, narray))
a = c(10, 20, 30, 40, 50, 60)
n = array(a, dim = c(2, 2, 3))
print(n)
Arrays
• n=array(a,dim=c(2,2,3))
print(n[2, 1, 1]) # Single element row 2 colum 1 layer 1 #20
print(n[2 ,,1]) # entire 2nd row [1] 20 40
print(n[, c(1,2), 3]) # All rows, Col 1-2, Array
Arrays
a[row,column,layer]
a=array(c(10,20,30,40,50,60),dim=c(2,2,1))
b=array(c(10,20,30,40,50,60),dim=c(2,2,1))
Apply()
Apply( array_name, margin ,function)
res=apply(a,2,sum)
# 30 70 margin 1 for row 2 for columns
Arrays
• Using for loop to iterate through array elements:
for(i in a) { print(i) }
Search in Array
> 20 %in% a
• [1] TRUE
Matrices
• Matrices in R are created using the matrix() function with three key
parameters:
• data (a vector), nrow (number of rows), and ncol (number of columns)
• For example, m <- matrix(c(10,20,30,40,50,60), nrow=3,
ncol=2,byrow=FALSE)
• fills the matrix column-wise by default, resulting in:
• [,1] [,2]
• [1,] 10 40
• [2,] 20 50
• [3,] 30 60
Matrices
• Accessing elements in matrix
• Use indexing m[row, col] for single elements :m[2,1] [1] 20.
• Access whole rows with m[row, ] print(m[1,]) [1] 10 40
• columns with m[, col], print(m[,1])
• [1] 10 20 30
• multiple rows via m[c(1,2), ]
• multiple columns via m[, c(1,2)].
Matrices
multiple rows via m[c(1,2), ]
print(m[c(1,2),])
• [,1] [,2]
• [1,] 10 40
• [2,] 20 50
• multiple columns via m[, c(1,2)].
Matrices
Operation in matrix
Adding rows and columns in matrix
In order to add rows in matrix we use
rbind(matrix_name,data)
n=rbind(m,c(70,80))
print(n)
Matrices
Operation in matrix
To add rows column wise
Cbind(matrix_name, data)
• Note: the arguments to rbind() should have the same number of columns and
the arguments to cbind() should have the same number of rows.
• Since cbind() and rbind() work on vectors, you can use them to construct
matrices from vectors on a row by row, or column by column basis.
Matrices
• Deleting columns and rows in a matrix
• Single element –matrixname[-c(index)]
• Single row -matrix_name[ -(row_index),]
• Single colum
• MatrixName[,-c(column_num)]
• Entire row and column – matrix_name[-c(row_index),-
c(column_index)
MATRIX
• searching an element in-membership operator element %in%
• 10 %in% m [TRUE]
• dimension of matirx
• length of a matrix print(length(matrix_name))
• finding number of rows and no of columns
• nrow(matrix_namer)- return number of rows
• ncol(matrix_name)-return number of columns
MATRIX
MATRIX
matrix
• looping in matrix_elements
• for(row in 1:nrow(m))
• {for(col in 1:ncol(m))
•}
Data frame
• . A data frame is a 2-D data structure with rows (records) and columns
(variables)
• Each column represents one feature, each row represents one observation.
• Unlike a matrix, columns can have different data types (numeric, character,
factor, etc.)
• Internally, a data frame is a list of equal-length vectors, where each vector is
a column.
DATA FRAME
• Creation of dataframe
• [Link](col_name1,col_name2,col_name3)
a <- c(1,2,3,4,5)
b <- c("R", "Is", "Fun!","Let's","Learn")
c <- c(TRUE,FALSE,TRUE,TRUE,FALSE)
my_frame <- [Link](a,b,c)
DATA FRAME
ACCESSING DATAFRAME
dataframe[index] – return columns with specified index
dataframe[c(col1,col2,…)] –return multiple columns having given
index
dataframe[[“col_name”]] – return column having given name
or
Datafram$ col_name
Data Frame
• Dataframe$col_name[index]
• –return element having index in
• specifiedcolumn
DATA FRAME
DATA FRAME
• Summary(dataframe)
• Gives a statistical overview of each column (min, max, mean for
numeric; counts for factors).
• length(data_frame)
• Dim(data_frame)
• nrow(data_frame)
• ncolumn(data_frame)
Factor
A factor is a way of categorizing or labeling data that falls into different
groups or categories. Think of it like labels or tags for data points.
• gender_vector <- c(rep("male",10),
rep("female",15)) # Create a character variable
gender_factor <- factor(gender_vector)
• print(gender_factor) # male , female
Factor
•
Factor
• data <- rep(c("very low", "low", "medium", "high", "very high"), 5)
• dat_factor <- factor(dat,
• levels=c("very low", "low", "medium", "high", "very high"),
• ordered=TRUE)
• print(dat_factor)
Handling missing values
• data <- [Link]("C:/Users/madhu/Downloads/titanic/[Link]")
reads a CSV (comma-separated values) file from disk and loads it into R
as a data frame.
• #Dataset dimensions and structure
• dim(data) -dim() returns the dimensions of the dataset: number of rows
and number of columns
• str(data) –str() displays the internal structure of the dataset, including
variable types, dimensions, and sample values.
Handling missing values
# Check if any missing values exist
anyNA(data)
#Identifies the exact row and column positions of all missing values (NA) in the dataset.
• idx <- which([Link](data), [Link] = TRUE)
#[Link]()
• Definition
• Checks whether values are missing (NA).
• What it returns
• TRUE → value is NA
• FALSE → value is not NA
Handling missing values
[Link]() -[Link](data$Age)
• Definition
• Checks whether an object itself is NULL (i.e., does not exist).
[Link]()- new_data <- [Link](data)
• Definition
• Removes all rows that contain at least one missing value (NA).
[Link]()-[Link](data)
• Definition
• Identifies rows with no missing values across all columns.
Handling missing values
• table(data$Embarked)
• Counts the frequency of unique values.
Numeric Imputation and categorical imputation
• Definition
• Replacing missing numeric values with the median of the variable.
• Example
• data$Age[[Link](data$Age)] <- median(data$Age, [Link] = TRUE)
• data$Embarked[[Link](data$Embarked)] <- "S"
Exercise
• Q1. Data Loading & Inspection
• Write R code to:
• Load the Titanic dataset from a CSV file
• Display the number of rows and columns
• Display the internal structure of the dataset
Exercise
Q2. Missing Value Detection & Summary
• Write R code to:
• Check whether missing values exist in the dataset
• Count missing values for each column
• Compute the percentage of missing values
• Create a summary data frame with variable name, missing count, and
missing percentage
Exercise
• Q3. Locating Missing Values & Data Structures
• Write R code to:
• Identify the row and column indices of all missing values
• Display the first six such indices
• Convert the index object into a data frame
• Add a column with corresponding variable names
Exercise
Q4. Logical Matrix, Matrix Indexing & Array
• Write R code to:
• Create a logical matrix indicating missing values
• Extract a 5×5 subset from this matrix
• Convert the logical matrix into an array
Exercise
Q5. Handling Missing Values
•Write R code to:
•Replace missing values in a numeric column using median
imputation
•Randomly introduce missing values into a categorical
column
•Replace missing categorical values using mode imputation
Exercise
Q6. Final Cleaning & Validation
•Write R code to:
•Remove rows containing any remaining missing values
•Verify that all rows in the final dataset are complete