0% found this document useful (0 votes)
8 views130 pages

Decision Trees in Data Analytics with R

The document discusses the concepts of classification and clustering within the context of machine learning, detailing three types of learning: supervised, unsupervised, and reinforcement learning. It explains decision trees and random forests as key algorithms for classification, including their structures, advantages, and disadvantages, as well as the processes for building and optimizing them. Additionally, it covers important metrics for decision tree algorithms, such as information gain and Gini index, and provides examples of their application.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views130 pages

Decision Trees in Data Analytics with R

The document discusses the concepts of classification and clustering within the context of machine learning, detailing three types of learning: supervised, unsupervised, and reinforcement learning. It explains decision trees and random forests as key algorithms for classification, including their structures, advantages, and disadvantages, as well as the processes for building and optimizing them. Additionally, it covers important metrics for decision tree algorithms, such as information gain and Gini index, and provides examples of their application.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

SNS COLLEGE OF ENGINEERING

Kurumbapalayam (Po), Coimbatore – 641 107

An Autonomous Institution
Accredited by NBA – AICTE and Accredited by NAAC – UGC with ‘A’ Grade
Approved by AICTE, New Delhi & Affiliated to Anna University, Chennai

DEPARTMENT OF COMPUTER SCIENCE AND TECHNOLOGY

COURSE NAME : 19CS407-DATA ANALYTICS WITH R

II YEAR /IV SEMESTER

Unit II – Statistics and Prescriptive Analytics


Topic : Random forest and Decision Trees

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 1/20
Brain Storming

What is classification and clustering?

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 2/20
Forms of Learning
At a broad level, machine learning can be classified into three types:
• Supervised learning
• Unsupervised learning
• Reinforcement learning

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 3/20
Reinforcement Learning
1. Supervised
Learning
• In supervised learning, we use known or labeled data for the training data. Since the data is known,
the learning is, therefore, supervised, i.e., directed into successful execution.
• The input data goes through the Machine Learning algorithm and is used to train the model. Once
the model is trained based on the known data, you can use unknown data into the model and get
a new response.
In this case, the model tries to figure out whether the data is an apple or another fruit.
Once the model has been trained well, it will identify that the data is an apple and give
the desired response.
Here is the list of top algorithms currently being used for supervised learning are:
Polynomial regression
Random forest
Linear regression
Logistic regression
Decision trees
K-nearest neighbors
Naive Bayes
2. Unsupervised
Learning
In unsupervised learning, the training data is unknown and unlabeled - meaning that no one has
looked at the data before. Without the aspect of known data, the input cannot be guided to the
algorithm, which is where the unsupervised term originates from.

Algorithms:
Partial least squares
Fuzzy means
Singular value decomposition
K-means clustering
Apriori
Hierarchical clustering
Principal component analysis
3. Reinforcement
Learning
• Like traditional types of data analysis, here, the algorithm discovers data through a process of
trial and error and then decides what action results in higher rewards.
• Three major components make up reinforcement learning: the agent, the environment, and the
actions.
• The agent is the learner or decision-maker, the environment includes everything that the agent
interacts with, and the actions are what the agent does.
• Reinforcement learning occurs when the agent chooses actions that maximize the expected
reward over a given time. This is easiest to achieve when the agent is working within a sound
policy framework.
Regression algorithms are used to predict
the continuous values such as price,
salary, age, etc. and
Classification algorithms are used
to predict/Classify the discrete
values such as Male or Female, True or
False, Spam or Not Spam, etc.
• Decision is a Supervised learning technique that can be used for both
Tree
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
• Below diagram explains the general structure of a decision tree:
• A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a machine
learning model.
Below are the two reasons for using the Decision tree:
•Decision Trees usually mimic human thinking ability while making a decision, so it
is easy to understand.
•The logic behind the decision tree can be easily understood because it shows
a tree-like structure.
Decision Tree Terminologies
•Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
•Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
•Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
•Branch/Sub Tree: A tree formed by splitting the tree.
•Pruning: Pruning is the process of removing the unwanted branches from the tree.
•Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree.
This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next
node.
Decision Trees Algorithm
The complete process can be better understood using the below algorithm:
•Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
•Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
•Step-3: Divide the S into subsets that contains possible values for the best attributes.
•Step-4: Generate the decision tree node, which contains the best attribute.
•Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office) and one leaf node based
on the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:
Consider whether a dataset based on which we will
determine whether to play football or not.
Decision tree for the above data set
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes. So, to solve such problems
there is a technique which is called as Attribute selection measure or ASM.
• By this measurement, we can easily select the best attribute for the nodes of the
tree.
There are two popular techniques for ASM, which are:
•Information Gain
•Gini Index
1. Information Gain:
•Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
•It calculates how much information a feature provides us about a class.
•According to the value of information gain, we split the node and build
the decision tree.
•A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It can
be calculated using the below formula:
•Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy:
Entropy is a metric to measure the impurity in a given attribute.
It specifies randomness in data. Entropy can be calculated as:
Entropy(s)= - P(yes)log2 P(yes) - P(no) log2 P(no)

Where,
•S= Total number of samples
•P(yes)= probability of yes
•P(no)= probability of no
2. Gini Index:
•Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
•An attribute with the low Gini index should be preferred as compared to the high
Gini index.
•It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
•Gini index can be calculated using the below formula:
2
Gini Index= 1- ∑jPj
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get
the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset. Therefore, a technique that decreases the
size of the learning tree without reducing accuracy is known as Pruning. There
are mainly two types of tree pruning technology used:
•Cost Complexity Pruning
•Reduced Error Pruning.
Advantages of the Decision Tree
•It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
•It can be very useful for solving decision-related problems.
•It helps to think about all the possible outcomes for a problem.
•There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
•The decision tree contains lots of layers, which makes it complex.
•It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
•For more class labels, the computational complexity of the decision tree may
increase.
There are many algorithms there to build a
decision tree. They are

1. CART (Classification and Regression Trees) — This makes use of


Gini impurity as the metric.
2. ID3 (Iterative Dichotomiser 3) — This uses entropy and information
gain as metric.
Problem on Decision Tree -Classification
using the ID3 algorithm

Here typically we will take log to base 2.


Here total there are 14 yes/no.
Out of which 9 yes and 5 no.
Based on it we calculated probability
above.

Entropy(s)= - P(yes)log2 P(yes) - P(no) log2 P(no)


Find the entropy of the class variable..
E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94
From the above data for outlook we can arrive at the
following table easily

Now we have to calculate average weighted entropy. ie, we have found the
total of weights of each feature multiplied by probabilities.
E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3) =
(5/14)(-(3/5)log(3/5)-(2/5)log(2/5))+ (4/14)(0) +
(5/14)((2/5)log(2/5)-(3/5)log(3/5))
= 0.693
The next step is to find the information gain. It is the difference between parent
entropy and average weighted entropy we found above.
IG(S, outlook) = 0.94 - 0.693 = 0.247
Similarly find Information gain for Temperature, Humidity, and Windy.
IG(S, Temperature) = 0.940 - 0.911 = 0.029
IG(S, Humidity) = 0.940 - 0.788 = 0.152
IG(S, Windy) = 0.940 - 0.8932 = 0.048
Now select the feature having the largest entropy gain. Here it is Outlook. So it
forms the first node(root node) of our decision tree.
Classification using CART algorithm

Classification using CART is similar to it. But instead of entropy, we use Gini
impurity.
So as the first step we will find the root node of our decision tree. For that
Calculate the Gini index of the class variable
Gini(S) = 1 - [(9/14)² + (5/14)²] = 0.4591
As the next step, we will calculate the Gini gain. For that first, we will find the
average weighted Gini impurity of Outlook, Temperature, Humidity, and Windy.
First, consider case of Outlook
Gini(S, outlook) = (5/14)gini(3,2) + (4/14)*gini(4,0)+ (5/14)*gini(2,3)
= (5/14)(1 - (3/5)² - (2/5)²) + (4/14)*0 + (5/14)(1 - (2/5)² - (3/5)²)

= 0.171+0+0.171 = 0.342
Gini gain (S, outlook) = 0.459 - 0.342 = 0.117 Choose one that has a higher Gini gain.
Gini gain(S, Temperature) = 0.459 - 0.4405 = Gini gain is higher for outlook.
0.0185 So we can choose it as our root node.
Gini gain(S, Humidity) = 0.459 - 0.3674 = 0.0916
Gini gain(S, windy) = 0.459 - 0.4286 = 0.0304
Random Forest Algorithm

• "Random Forest is a classifier that contains a number of decision trees


on various subsets of the given dataset and takes the average to improve
the predictive accuracy of that dataset.“
• The random forest takes the prediction from each tree and based on the
majority votes of predictions, and it predicts the final output.
• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 37/20
Random Forest Algorithm
The below diagram explains the working of the Random Forest algorithm:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 38/20
Random Forest Algorithm
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 39/20
Random Forest Algorithm
How does Random Forest algorithm work?
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 40/20
Random Forest Algorithm
Example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree.
During the training phase, each decision tree produces a prediction result, and when a new
data point occurs, then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 41/20
Random Forest Algorithm
Example:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 42/20
Random Forest Algorithm
Applications of Random Forest

There are mainly four sectors where Random forest mostly used:
Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
Land Use: We can identify the areas of similar land use by this algorithm.
Marketing: Marketing trends can be identified using this algorithm.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 43/20
Random Forest Algorithm
Advantages of Random Forest

• Random Forest is capable of performing both Classification and Regression


tasks.
• It is capable of handling large datasets with high dimensionality.
• It enhances the accuracy of the model and prevents the overfitting issue.
• Disadvantages of Random Forest
• Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 44/20
Time Series Analysis

What is Time Series Analysis?


• Definition: A time series is nothing but a sequence of various data points that
occurred in a successive order for a given period of time

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 45/20
Time Series Analysis
Significance of Time Series and its types

• TSA is the backbone for prediction and forecasting analysis, specific to the
time-based problem statements.
• Analyzing the historical dataset and its patterns
• Understanding and matching the current situation with patterns derived from
the previous stage.
• Understanding the factor or factors influencing certain variable(s) in different
periods.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 46/20
Time Series Analysis
Significance of Time Series and its types
• With help of “Time Series” we can prepare numerous time-based analyses and
results.
o Forecasting
o Segmentation
o Classification
o Descriptive analysis`
o Intervention analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 47/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 48/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 49/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 50/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 51/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 52/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 53/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 54/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 55/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 56/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 57/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 58/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 59/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 60/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 61/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 62/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 63/20
Time Series Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 64/20
Linear Regression
It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age,
product price, etc.
Definition:
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression.
Since linear regression shows the linear relationship, which means it finds how the
value of the dependent variable is changing according to the value of the
independent variable.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 65/20
Linear Regression
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 66/20
Linear Regression
Linear Regression Equation is of the form:
y = mx+c
y = dependent variable
x = independent variable
c = intercept

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 67/20
Linear Regression
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 68/20
Linear Regression
How Does Linear Regression Work?

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 69/20
Linear Regression

1. With the available data, we plot a graph with Area in the X-axis and Rent on Y-axis.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 70/20
Linear Regression
Next, we find the mean of Area and Rent.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 71/20
Linear Regression
3. We then plot the mean on the graph.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 72/20
Linear Regression
4. We draw a line of best fit that passes through the mean.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 73/20
Linear Regression

5. But we encounter a problem. As you can see below, multiple lines can be drawn
through the mean:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 74/20
Linear Regression

6. To overcome this problem, we keep moving the line to make sure the best fit
line has the least square distance from the data points

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 75/20
Linear Regression

7. The least-square distance is found by adding the square of the residuals

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 76/20
Linear Regression

7. We now arrive at the relation that, Residual is the distance between Y-actual and Y-
pred.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 77/20
Linear Regression

8. The value of m & c for the best fit line, y = mx+ c can be calculated using these
formulas:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 78/20
Linear Regression

10. This helps us find the corresponding values:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 79/20
Linear Regression

11. With that, we can obtain the values of m & c.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 80/20
Linear Regression

12. Now, we can find the value of Y-pred.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 81/20
Linear Regression
13. After calculating, we find that the least square
value for the below line is 3.02.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 82/20
Linear Regression
14. Finally, we are able to plot the Y-pred and this is found out to be the best fit line.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 83/20
Linear Regression
Model Performance:
It can be achieved by below method:
R-squared method:
• R-squared is a statistical method that determines the goodness of fit.
• It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
• The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
• It is also called a coefficient of determination, or coefficient of multiple determination
for multiple regression.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 84/20
Linear Regression

It can be calculated from the below formula:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 85/20
Multiple Linear Regression

Multiple Linear Regression:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 86/20
Multiple Linear Regression

Multiple Linear Regression:

Multiple Linear Regression formula:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 87/20
Logistic Regression

• Logistic regression is the Supervised Learning technique.


• It is used for predicting the categorical dependent variable using a given set of
independent variables. Therefore the outcome must be a categorical or discrete
value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 88/20
Logistic Regression

• Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 89/20
Logistic Regression

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 90/20
Applications of Logistic Regression

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 91/20
Applications of Logistic Regression

• Using the logistic regression algorithm, banks can predict whether a customer
would default on loans or not
• To predict the weather conditions of a certain place (sunny, windy, rainy, humid,
etc.)
• Ecommerce companies can identify buyers if they are likely to purchase a certain
product
• Companies can predict whether they will gain or lose money in the next quarter,
year, or month based on their current performance
• To classify objects based on their features and attributes
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 92/20
Logistic Regression

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 93/20
Logistic Regression
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 94/20
Logistic Regression
Logistic Function (Sigmoid Function):
In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 95/20
Logistic Regression
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are
given below:
We know the equation of the straight line can be written as:

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 96/20
Logistic Regression
Logistic Regression Equation:
In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 97/20
Logistic Regression
Logistic Regression Equation:
But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 98/20
Logistic Regression
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:

• Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 99/20
Survival Analysis

• Survival analysis is a field of statistics that focuses on analyzing the


expected time until a certain event happens.
• For example, imagine a group of cancer patients who are administered a
certain new form of treatment. Survival analysis can be used for analyzing
the results of that treatment in terms of the patients’ life expectancy.
• Used for determining the time until a machine fails or

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 100/20
Survival Analysis

The conditions that apply to the most popular methods of survival analysis
are:
• the event of interest is clearly defined and well-specified, so there is no
ambiguity about whether it happened or not,
• the event can occur only once for each subject
survival analysis is used for modeling the time-to-event series

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 101/20
Survival Analysis

survival analysis try to answer questions like:


• what percentage of the population will survive past a certain time?
• of the survivors, what will be their death/failure rate?
• how do particular characteristics (for example, such features as age, gender,
geographical location, etc.) affect the probability of survival?

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 102/20
Survival Analysis
Censoring
Censoring can be described as the missing data problem in the domain of
survival analysis.
• Subjects are said to be censored
• if they are lost to follow up
• drop out of the study,
• if the study ends before they die or have an outcome of interest.
• In simple words, some important information required to make a
calculation is not available to us. i.e. censored.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 103/20
Survival Analysis

Types of Censoring:
There are different kinds of censoring, such as:
• right-censoring,
• interval-censoring,
• left-censoring.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 104/20
Survival Analysis

Right-censoring:
• Right censoring is the most common of concern.
• It means that we are not certain what happened to people after some point
in time.
• This happens when some people cannot be followed the entire time
because they died or were lost to follow-up or withdrew from the study.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 105/20
Survival Analysis

Right-censoring:
• Right censoring is the most common of concern.
• It means that we are not certain what happened to people after some point
in time.
• This happens when some people cannot be followed the entire time
because they died or were lost to follow-up or withdrew from the study.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 106/20
Survival Analysis

Left-censoring:
• Left censoring is when we are not certain what happened to people before
some point in time.
• Commonest example is when people already have the disease of interest
when the study starts.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 107/20
Survival Analysis
Interval/Random Censoring
• Interval/random censoring is when we know that something happened in an
interval (i.e. not before starting time and not after ending time of the
study ), but do not know exactly when (in the interval) it happened.
• For example, we know that the patient was well at time of start of the study
and was diagnosed with disease at time of end of the study, so when did the
disease actually begin?

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 108/20
Types of Survival Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 109/20
Survival Analysis

The Survival Function

The survival function is a function of time (t) and can be represented as

where Pr() stands for the probability and T for the time of the event of interest for a
random observation from the sample.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 110/20
Survival Analysis
Different approaches to Survival Analysis

Non-parametric
Semi-parametric
Parametric

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 111/20
Survival Analysis

Non-parametric
• With these approaches, we make no assumptions about the underlying
distribution of data.
• Perhaps the most popular example from this group is the Kaplan-Meier
curve, which in short is a method of estimating and plotting the survival
probability as a function of time.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 112/20
Survival Analysis
Semi-parametric:
• Makes very few assumptions.
• Most importantly, there are no assumptions about the shape of the
hazard function/rate.
• The most popular method from this group is the Cox regression, which
we can use to identify the relationship between the hazard function and
a set of explanatory variables (predictors).

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 113/20
Survival Analysis

Parametric
• The idea is to use some statistical distributions to estimate how long a
subject will survive.
• Often, we use maximum likelihood estimation (MLE) to fit the
distribution (or actually the distribution’s parameters) to the data for the
best performance.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 114/20
Survival Analysis – methods

 Non-parametric estimation
 Within-group survival: Kaplan-Meier
 Between-group comparison: Log-rank Test

 Semi-parametric estimation model


 Cox proportional hazard model (allows explanatory variables)
 Hazard: The event of interest. Usually it is believed to be harmful, e.g. death, relapse of a disease, re-
hospitalization, failure of the product or part, etc

 Parametric models: Exponential, Weibull distribution, etc.


(won’t cover today)

115 /49
Survival Analysis

The Kaplan-Meier estimator is used to estimate the survival


function.
The Kaplan Meier Curve is the visual representation of this function that
shows the probability of an event at a respective time interval.
It usually compared two groups in a study (like a group that got treatment
A vs a group that got treatment B).

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 116/20
Survival Analysis
Median Survival : gives the
survival value for the
treatment.
Here Treatment A gives 30
months survival time and
Treatment B gives 47 months
Treatment B seems to be doing
better than treatment A (median
survival time of +/- 47 months vs
30 months with a significant p-
value).

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 117/20
Survival Analysis
Table for Kaplan Meier curve
• The follow-up time can be any time-
interval: minutes, days, months, years.
• An event type of 1 equals an event. An
typical event in a cancer trial can be
death, but Kaplan-Meier curves can
also be used in other types of studies.
Ann, for example, participated in this
fictional study for a new cancer drug
but died at after 4 months.
• An event type of 0 equals a right-
censored event.

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 118/20
Survival Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 119/20
Survival Analysis
With Censored data

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 120/20
Survival Analysis
With Censored data
Censored data

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 121/20
Kaplan-Meier Survival Method

 Non-parametric estimate of survival


probability
 Commonly used to describe survival-ship
of a study population
 Intuitive graphical presentation

 Cumulative survival characteristics


 Estimation of median survival time
 Commonly used to compare two study
population

122 /49
Kaplan-Meier Plot

123 /49
Kaplan-Meier Survival Probability
Survival Function:

Survival Probability at
1-month = 1-1/10 = 0.9
3-month = (1-1/10)*(1-1/8)=0.788
4-month = (1-1/10)*(1-1/8)*(1-2/7) = 0.56
1-year survival rate = 28%
Median survival time= 6 months
The K-M curve takes a step down when there is an
event.

124 /49
Comparison of groups-Logrank Test

Logrank Test :
– For comparison of survival distributions between groups
– The groups are defined by categorical covariates. Can be more than 2 groups.
e.g. Therapy : treatment, placebo
Gender : male, female
Age group : ≤40, ≥40

Bad performance when two survival curves are crossing.


– The logrank test has better performance under the assumption of proportional hazards.

Proportional hazards: The hazard functions for any two individuals at any point in time are
proportional, and does not change with time t.

125 /49
Logrank Test-Example
Comparison of post-transplant death
and graft failure probability in IPF
patients with lung transplantation

Comparison group:
single transplantation
vs.
double transplantation

Logrank test:
p-value < 0.001

Single- vs Double-Lung Transplantation in Patients With Chronic Obstructive Pulmonary Disease and Idiopathic Pulmonary Fibrosis Since the Implementation
of Lung Allocation Based on Medical Need
JAMA. 2015;313(9):936-948. doi:10.1001/jama.2015.1175

©copyright reserved 126


Log-rank Test- Another example

Comparison of composite of
cardiovascular death, MI, or
severe recurrent ischemia in
patients with acute MI:

Comparison group:
Losmapimod vs. Placebo

Logrank test:
p-value =0.24

Effect of Losmapimod on Cardiovascular Outcomes in Patients With Acute MI JAMA. 2016;315(15):1591-71599

127 /49
Survival Analysis

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 128/20
Assessment 1

1. What is called univariate statistics?.

Ans : _______________________________________________________________________

2. What is dispersion univariate statistics?

Ans : _______________________________________________________________________

November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 129/20
References

TEXT BOOKS
[Link]ão Moreira, Andre Carvalho, Tomás Horvath – “A General Introduction to Data Analytics” – Wiley -
2018
[Link] Introduction to R, Notes on R: A Programming Environment for Data Analysis and
Graphics. W. N. Venables, D.M. Smith and the R Development Core Team. Version
3.0.1 (2013-05-16). URL: [Link]
References:
1. Dean J, ―Big Data, Data Mining and Machine learning, Wiley publications, 2014.
2. Provost F and Fawcett T, ―Data Science for Business, O‘Reilly Media Inc, 2013.
3. Janert PK, ―Data Analysis with Open Source Tools, O‘Reilly Media Inc, 2011.
4. Weiss SM, Indurkhya N and Zhang T, ―Fundamentals of Predictive Text Mining, Springer-Verlag
London Limited, 2010.
[Link] N and Warren J,- Big Data, Manning Publications,2015

Thank You
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 130/49

You might also like