Essential R
FOR DATA ANALYTICS
Compiled by
Isi .A. Edeoghon (PhD)
About R
R is a free, open-source programming language and software environment designed
specifically for statistical computing, data analysis, and graphical presentation.
It is widely used by data scientists, statisticians, and researchers in academia and
industry.
About R
i. The Comprehensive R Archive Network (CRAN) has thousands of user-created packages
that make R's functionality very flexible. Many people use popular packages like ggplot2
for making graphs and tidyverse for cleaning up data.
ii. Advanced Data Visualization: R is great at making high-quality, customizable charts,
graphs, and interactive dashboards, which is very important for sharing data insights.
Benefits of R
i. Cross-Platform Compatibility: It is compatible with the main operating systems,
such as Linux, macOS, and Windows.
ii. Active Community: Through forums and other projects, a sizable and vibrant
community provides support, tutorials, and documentation.
Applications of R
i. Statistical Modeling: Epidemiologists use statistical modeling from R to forecast trends and model
diseases.
ii. Finance and Economics: Used for stock trend forecasting, risk analysis, and portfolio management.
iii. Bioinformatics and Healthcare: Used in clinical trials and pharmaceutical research to analyze
genomic data.
iv. Data Visualization: R is used by data scientists to visualize information contained in data.
v. Machine learning: Utilized to create predictive models using randomForest and caret packages
How to install and use R (Step 1)
i. Go to the CRAN Website: [Link]
ii. Choose your OS: Select the link for Linux, macOS, or Windows.
iii. Windows users: Click "install R for the first time" and then "Download R for
Windows."
How to install and use R (Step 1)
Step 1 to install R [Link]
How to install and use R (Step 2)
Install RStudio (Highly Recommended):
While R comes with a basic interface, almost everyone uses RStudio (now part of Posit).
It makes writing code, viewing graphs, and managing files much easier.
Visit Posit: [Link]
Click the button under
2: Install RStudio Desktop.
"Install: Run the setup file and follow the instructions.
How to install and use R (Step 2)
[Link]
Starting with R
R is just like any normal programming language but of course with its own slight syntax differences.
We’ll run through some basics quickly before we get to Statistics and Data analytics
Creating a variable: This is as simple as choosing a name for a variable and using the R assignment
operator: <-
Texts are distinguished from numbers with quotation marks: “ “ or ‘ ‘
Example:
name <- “Simon”
age <- 12
>name #typing name at R command prompt
“Simon” #result of typing name at R command prompt (in blue)
>age
12
Numbers
R can be used as a simple calculator:
> 11 + 3 * 5
26
We can also print on the screen using the print() function
>print(name)
”Simon”
Data Structures
R has the following built in data structures:
1. Vectors
2. List
3. Matrix
4. Arrays
5. Data Frames
We’ll quickly go through all of them in turn
Vectors
We all know what a vector is: it’s a one-dimensional arrangement of data.
We declare a vector in with the c() function:
courses<- c(“CPE333”, “CPE305”, “GET301”)
“CPE333” , “CPE305”, “GET301”
ages< -c(21, 20, 19)
>ages
21, 20, 19
Vectors are the most common data structures in R
They can hold a mix of Strings and numbers:
student<- c(“amaka”, 12, “ENG2109999”)
>student
“amaka”, 12, “ENG2109999”
Lists
Similar to vectors
We declare a list in with the list() function:
courses<- list(‘’CPE333”, “CPE305”, “GET301”)
>courses
“CPE333” , “CPE305”, “GET301”
ages< - list(21, 20, 19)
>ages
21, 20, 19
They can also hold a mix of Strings and numbers:
student<- list(“amaka”, 12, “ENG2109999”)
>student
“amaka”, 12, “ENG2109999”
Matrices
Of course, a matrix is a two-dimensional data structure with the same data type for all
elements in it.
We declare a matrix in with the matrix() function:
Mymatrix<- matrix(c(10,20,30,40), nrow=2, ncol=2)
10 30
20 40
Where nrow stand for the dimensions of the row and ncol stand for the dimensions of
the column
Arrays
An Array is like a multidimensional matrix:
We declare it with the array() function:
myarray < - array(c(1:10), dim =c(3, 2, 2))
14
25
36
7 10
81
92
Data Frames
A Data Frame is used to store information in rows and columns. Where rows represent
observations and columns represent variables.
We create a data frame with the [Link]() function.
> dataframe<- [Link]( name=c('Simon', 'Praise', 'Ufuoma'), course =c('GET305',
'GET305', 'GET305'), score =c(71, 73, 72) )
>dataframe
name course score
1 Simon GET305 71
2 Praise GET305 73
3 Ufuoma GET305 72
Data Frames
Accessing Data: To access specific columns, we use the $ sign, to access rows or
specifics we use the [ ] symbol and indicate the point of interest using numbers as
positional selectors.
> dataframe$name
[1] "Simon" "Praise" "Ufuoma"
> dataframe[3,2]
[1] "GET305"
> dataframe[3,3]
[1] 72
Data Frames
We can add new rows to a data frame with the rbind() function
> rbind(dataframe, c("Brad", "GET305", 53) )
name course score
1 Simon GET305 71
2 Praise GET305 73
3 Ufuoma GET305 72
4 Brad GET305 53
Here we added a new student called “Brad” to our dataframe.
We can also add new columns with the cbind() function
Data Frames
We can use the c() function to remove rows and columns. This works alongside the index
selector box:
[]
> dataframe[-c(4), -c(4)]
name course score
1 Simon GET305 71
2 Praise GET305 73
3 Ufuoma GET305 72
Here we removed the row containing the new student called “Brad” we added last to our data
frame.
Data Frames
We can find out more about the data in a dataset by using the summary() function
> summary(dataframe)
name course score
Length:3 Length:3 Min. :71.0
Class :character Class :character 1st Qu.:71.5
Mode :character Mode :character Median :72.0
Mean :72.0
3rd Qu.:72.5
Max. :73.0
Data Frames
R has a number of built in data frames such as mtcars, iris
and airquality.
e.g. > mtcars
> iris mpg cyl disp hp drat wt qsec vs am gear carb
[Link] [Link] [Link] [Link] Species Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 5.1 3.5 1.4 0.2 setosa Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 4.9 3.0 1.4 0.2 setosa Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 4.7 3.2 1.3 0.2 setosa Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 4.6 3.1 1.5 0.2 setosa Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 5.0 3.6 1.4 0.2 setosa Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 5.4 3.9 1.7 0.4 setosa Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 4.6 3.4 1.4 0.3 setosa Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 5.0 3.4 1.5 0.2 setosa Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 4.4 2.9 1.4 0.2 setosa Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4
4…………………………………………
10 4.9 3.1 1.5 0.1 setosa ……………………………………………
Statistics
This is the science of learning from data. It involves the collection, organization, analysis, interpretation, and
presentation of information to uncover patterns, make predictions, and handle uncertainty.
Statistics is of two types:
Descriptive statistics summarize dataset features without extending conclusions beyond the data.
Key measuring indices include central tendency (mean, median, mode) and dispersion (range, variance, standard
deviation), often visualized through charts like histograms and pie charts.
Inferential statistics generalize from samples to populations, involving hypothesis testing to assess statistical
significance, p-values for evidence strength, and regression analysis to explore variable relationships.
Data Analytics
Data Analytics is the practical application of statistical principles to discover actionable insights or influence
decision making.
There are four types of Data Analytics:
Descriptive Analytics involves analyzing historical data to identify trends, such as a retail store reviewing last
month's sales.
Diagnostic Analytics explores the reasons behind trends, like understanding a sales drop in a region. It uses
tools like correlation, regression or comparison.
Predictive Analytics forecasts future outcomes by utilizing past data, often through Machine Learning
techniques.
Prescriptive Analytics recommends actions to achieve goals
e.g. An AI system that automatically adjusts train ticket prices based on real-time demand and weather
patterns.
Data Analytics Types
Data Analytics Process (Life Cycle)
Data Analytics follows four main steps:
Definition: Defining the research question
Data Collection: Collecting raw information
Data Preprocessing: Cleaning the data to eliminate inaccuracies and duplicates
Data Analysis: Applying analytical models
Data Visualization: Visualizing the results using tools like ggplot2 for r, Power BI, Tableau, or
Matplotlib.
Data Analytics Process (Life Cycle)
Importance of Data Analytics
Data Analytics Process (Life Cycle)
Data Analytics follows four main steps:
Definition: Defining the research question
Data Collection: Collecting raw information
Data Preprocessing: Cleaning the data to eliminate inaccuracies and duplicates
Data Analysis: Applying analytical models
Data Visualization: Visualizing the results using tools like ggplot2 for r, Power BI, Tableau, or
Matplotlib.
Let’s not forget our inbuilt datasets
R has a number of built in data frames such as mtcars, iris
and airquality.
e.g. > mtcars
> iris mpg cyl disp hp drat wt qsec vs am gear carb
[Link] [Link] [Link] [Link] Species Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 5.1 3.5 1.4 0.2 setosa Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 4.9 3.0 1.4 0.2 setosa Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 4.7 3.2 1.3 0.2 setosa Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 4.6 3.1 1.5 0.2 setosa Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 5.0 3.6 1.4 0.2 setosa Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 5.4 3.9 1.7 0.4 setosa Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 4.6 3.4 1.4 0.3 setosa Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 5.0 3.4 1.5 0.2 setosa Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 4.4 2.9 1.4 0.2 setosa Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4
4…………………………………………
10 4.9 3.1 1.5 0.1 setosa ……………………………………………
Data Analytics with R
Let’s play around with the mtcars dataset using the following statistical functions we have previously learnt in r:
>?mtcars
>summary(mtcars)
>mtcars$column_name #you can try this with as many columns as you want
NOTE:
You can save an observation as a variable e.g. newvar<- mtcars$hp
Write down your observations for each process
Can you now tell more about this dataset?
What is a “dataset”?
Data visualization
The process of displaying data using visual components like graphs, charts, and maps is known as data
visualization.
It enables easier comprehension of large datasets, enabling the discovery of patterns and trends
that aid in improved decision-making.
Types of Data visualization in R
Barplot
>barplot(mtcars$hp,main ='Vehicle Horsepower',xlab ='horsepower', horiz = TRUE)
Plots a horizontal graph of vehicle horsepower for vertical graphs use: horiz= FALSE
Histogram
> hist(mtcars$gear,main ='Vehicle Gears',xlab ='gears', ylab='frequency', col='blue')
Data visualization
Boxplot
A boxplot depicts information like the minimum and maximum data point,
the median value, first and third quartile and interquartile range.
> boxplot(mtcars$gear,main ='Vehicle Gears',xlab ='gears',col='blue',
horiz=FALSE, notch=FALSE)
> Plots a vertical graph of vehicle gears for horizontal graphs use: horiz=
TRUE
Scatter Plot
Scatter plots are used to to demonstrate if there is a correlation between
bivariate data and to gauge the direction and strength of this kind of
interaction.
> plot(mtcars$gear,mtcars$am, main ='Vehicle Gears',xlab ='gears', Scatter Plot
ylab='Automatic or Manual', col='blue')
Data visualization
NOTE:
Scatter plots can draw dots together with a line just add the command type = “l” to the modifiers in the
plot() function
Exercise
Write a little on these other types of vizualisation in R:
i. Heat Maps
ii. 3D graphs
Apply them to the mtcars dataset
Write down your observations
Can you now tell more about this dataset?
Other Statistical Metrices in R
Mean
> mean(mtcars$gear)
[1] 3.6875
> min(mtcars$gear)
[1] 3
> max(mtcars$gear)
[1] 5
Visualization with ggplot2
ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.
You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes
care of the details.
Installation
# The easiest way to get ggplot2 is to install the whole tidyverse:
>[Link]("tidyverse")
# Alternatively, install just ggplot2:
>[Link]("ggplot2")
After installation you should see something like this:
The downloaded binary packages are in
C:\Users\isied\AppData\Local\Temp\RtmpEh8qkN\downloaded_packages
Visualization with ggplot2
How to use ggplot2:
After installation use the following:
>library(ggplot2)
Let’s apply it to our mtcars dataset and observe what happens:
>ggplot(mtcars, aes(x = wt, y = mpg))+ geom_point()
We can see the plot on the right
We can take it further by enhancing the information presented on
the graph. Use this:
Visualization with ggplot2
We can take it further by enhancing the information presented on
the graph. Use this:
> ggplot(mtcars, aes(x = wt, y = mpg))+ geom_point()+
labs(title="MPG vs Weight", x = "Weight (1000 lbs)",
y = "Miles per Gallon")
We can see the plot on the right
Visualization with ggplot2
We can again take it further by enhancing the information
presented on the graph. Use this:
>ggplot(mtcars, aes(x = wt, y = mpg))+ geom_point()
+labs(title="MPG vs Weight", x = "Weight (1000 lbs)",
y = "Miles per Gallon")+
geom_smooth(method="lm", col ="blue")
We can plot a trend line using the
method=“lm” in the geom_smooth() method
We can see from the plot on the right, that there is a clear correlation between the weight
of the vehicles and the miles per gallon (mpg)
Multiple Linear Regression
Multiple Linear Regression in RMultiple linear regression models the linear
relationship between a continuous dependent variable and multiple independent
variables using the lm() function (linear model).
Let us apply Multiple Linear Regression to our mtcars dataset:
# Load the built-in mtcars dataset
>data(mtcars)
Multiple Linear Regression
# Model 'mpg' (miles per gallon) as a function of 'hp' (horsepower) and 'wt' (weight)
>model_regression <- lm(mpg ~ hp + wt, data = mtcars)
# View the summary of the model
>summary(model_regression)
The next page shows the output of this summary
Multiple Linear Regression
Call: wt -3.87783 0.63273 -6.129 1.12e-06 ***
lm(formula = mpg ~ hp + wt, data = data) Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residuals: Residual standard error: 2.593 on 29 degrees of freedom
Min 1Q Median 3Q Max Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
-3.941 -1.600 -0.182 1.050 5.854 F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
hp -0.03177 0.00903 -3.519 0.00145 **
Multiple Linear Regression
Exercise:
Explain the following from the preceding page:
i. Estimate Std
ii. Multiple R-squared
iii. Adjusted R-squared
iv. F-statistic
v. p-value
What do they tell us about our model on the mtcars dataset?
Multiple Linear Regression
# Create new data to predict on
new_data <- [Link](hp = c(100, 150), wt = c(2.5, 3.1)) # Predict mpg for new data
>predictions_regression <- predict(model_regression, new_data)
>print(predictions_regression)
OUTPUT
1 2
24.35540 20.44005
Decision Trees
This is a supervised learning method that acts like a flowchart.
It works by asking a series of binary (yes/no) questions to split your data into increasingly smaller, more
"pure" groups.
It is mostly used for classification tasks
The rpart and [Link] packages are commonly used.
Install rpart and [Link]
>[Link]("rpart")
>[Link]("[Link]")
Decision Trees
Let’s make use of theses libraries:
>library(rpart)
>library([Link])
Let’s try and see if we can predict if a car is automatic (am) based off the number of gears (gear)
>model_tree <- rpart(gear ~ am, data = mtcars, method = "class")
> # Plot the decision tree
> [Link](model_tree, [Link] = "auto", nn = TRUE)
Find output in the next page
Decision Trees
We can also make predictons on the data: OUTPUT
# Make predictions on the original data (or a separate test set)
>predictions_tree <- predict(model_tree, mtcars, type = "class")
>#View Predictions
>print(predictions_tree)
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
4 4 4 3
Hornet Sportabout Valiant Duster 360 Merc 240D
3 3 3 3
Merc 230 Merc 280 Merc 280C Merc 450SE
3 3 3 3
Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln
Continental
3 3 3 3 ……………………………..
Conclusion
In conclusion we have explored the basics of R as it applies to Data Analytics
Generate a confusion matrix for the predicted values vs actual from our decision tree model.
There are so many other machine learning algorithms:
Random Forest
Support Vector Machine (SVM)
Logistic Regression
i. Apply any or all of them to our mtcars dataset
ii. Write down your observations
iii. What are your insights from the dataset?
References
R Tutorial (w3schools) Accessed January 2026
[Link]
What is Data Analytics? (Geeks for Geeks) Accessed January 2026
[Link]
R documentation (CRAN) Accessed January 2026
[Link]
What Is Data Analytics? A Comprehensive Guide for Beginners Syracrause University,
Accessed January 2026
[Link]
Programiz online R compiler, Accessed January, 2026
[Link]