Decision Trees in Data Analytics with R
Decision Trees in Data Analytics with R
An Autonomous Institution
Accredited by NBA – AICTE and Accredited by NAAC – UGC with ‘A’ Grade
Approved by AICTE, New Delhi & Affiliated to Anna University, Chennai
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 1/20
Brain Storming
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 2/20
Forms of Learning
At a broad level, machine learning can be classified into three types:
• Supervised learning
• Unsupervised learning
• Reinforcement learning
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 3/20
Reinforcement Learning
1. Supervised
Learning
• In supervised learning, we use known or labeled data for the training data. Since the data is known,
the learning is, therefore, supervised, i.e., directed into successful execution.
• The input data goes through the Machine Learning algorithm and is used to train the model. Once
the model is trained based on the known data, you can use unknown data into the model and get
a new response.
In this case, the model tries to figure out whether the data is an apple or another fruit.
Once the model has been trained well, it will identify that the data is an apple and give
the desired response.
Here is the list of top algorithms currently being used for supervised learning are:
Polynomial regression
Random forest
Linear regression
Logistic regression
Decision trees
K-nearest neighbors
Naive Bayes
2. Unsupervised
Learning
In unsupervised learning, the training data is unknown and unlabeled - meaning that no one has
looked at the data before. Without the aspect of known data, the input cannot be guided to the
algorithm, which is where the unsupervised term originates from.
Algorithms:
Partial least squares
Fuzzy means
Singular value decomposition
K-means clustering
Apriori
Hierarchical clustering
Principal component analysis
3. Reinforcement
Learning
• Like traditional types of data analysis, here, the algorithm discovers data through a process of
trial and error and then decides what action results in higher rewards.
• Three major components make up reinforcement learning: the agent, the environment, and the
actions.
• The agent is the learner or decision-maker, the environment includes everything that the agent
interacts with, and the actions are what the agent does.
• Reinforcement learning occurs when the agent chooses actions that maximize the expected
reward over a given time. This is easiest to achieve when the agent is working within a sound
policy framework.
Regression algorithms are used to predict
the continuous values such as price,
salary, age, etc. and
Classification algorithms are used
to predict/Classify the discrete
values such as Male or Female, True or
False, Spam or Not Spam, etc.
• Decision is a Supervised learning technique that can be used for both
Tree
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
• Below diagram explains the general structure of a decision tree:
• A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a machine
learning model.
Below are the two reasons for using the Decision tree:
•Decision Trees usually mimic human thinking ability while making a decision, so it
is easy to understand.
•The logic behind the decision tree can be easily understood because it shows
a tree-like structure.
Decision Tree Terminologies
•Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
•Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
•Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
•Branch/Sub Tree: A tree formed by splitting the tree.
•Pruning: Pruning is the process of removing the unwanted branches from the tree.
•Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree.
This algorithm compares the values of root attribute with the record (real dataset)
attribute and, based on the comparison, follows the branch and jumps to the next
node.
Decision Trees Algorithm
The complete process can be better understood using the below algorithm:
•Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
•Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
•Step-3: Divide the S into subsets that contains possible values for the best attributes.
•Step-4: Generate the decision tree node, which contains the best attribute.
•Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary
attribute by ASM). The root node splits further into the next decision node (distance from the office) and one leaf node based
on the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below
diagram:
Consider whether a dataset based on which we will
determine whether to play football or not.
Decision tree for the above data set
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes. So, to solve such problems
there is a technique which is called as Attribute selection measure or ASM.
• By this measurement, we can easily select the best attribute for the nodes of the
tree.
There are two popular techniques for ASM, which are:
•Information Gain
•Gini Index
1. Information Gain:
•Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
•It calculates how much information a feature provides us about a class.
•According to the value of information gain, we split the node and build
the decision tree.
•A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It can
be calculated using the below formula:
•Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy:
Entropy is a metric to measure the impurity in a given attribute.
It specifies randomness in data. Entropy can be calculated as:
Entropy(s)= - P(yes)log2 P(yes) - P(no) log2 P(no)
Where,
•S= Total number of samples
•P(yes)= probability of yes
•P(no)= probability of no
2. Gini Index:
•Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
•An attribute with the low Gini index should be preferred as compared to the high
Gini index.
•It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
•Gini index can be calculated using the below formula:
2
Gini Index= 1- ∑jPj
Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get
the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset. Therefore, a technique that decreases the
size of the learning tree without reducing accuracy is known as Pruning. There
are mainly two types of tree pruning technology used:
•Cost Complexity Pruning
•Reduced Error Pruning.
Advantages of the Decision Tree
•It is simple to understand as it follows the same process which a human follow
while making any decision in real-life.
•It can be very useful for solving decision-related problems.
•It helps to think about all the possible outcomes for a problem.
•There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
•The decision tree contains lots of layers, which makes it complex.
•It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
•For more class labels, the computational complexity of the decision tree may
increase.
There are many algorithms there to build a
decision tree. They are
Now we have to calculate average weighted entropy. ie, we have found the
total of weights of each feature multiplied by probabilities.
E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3) =
(5/14)(-(3/5)log(3/5)-(2/5)log(2/5))+ (4/14)(0) +
(5/14)((2/5)log(2/5)-(3/5)log(3/5))
= 0.693
The next step is to find the information gain. It is the difference between parent
entropy and average weighted entropy we found above.
IG(S, outlook) = 0.94 - 0.693 = 0.247
Similarly find Information gain for Temperature, Humidity, and Windy.
IG(S, Temperature) = 0.940 - 0.911 = 0.029
IG(S, Humidity) = 0.940 - 0.788 = 0.152
IG(S, Windy) = 0.940 - 0.8932 = 0.048
Now select the feature having the largest entropy gain. Here it is Outlook. So it
forms the first node(root node) of our decision tree.
Classification using CART algorithm
Classification using CART is similar to it. But instead of entropy, we use Gini
impurity.
So as the first step we will find the root node of our decision tree. For that
Calculate the Gini index of the class variable
Gini(S) = 1 - [(9/14)² + (5/14)²] = 0.4591
As the next step, we will calculate the Gini gain. For that first, we will find the
average weighted Gini impurity of Outlook, Temperature, Humidity, and Windy.
First, consider case of Outlook
Gini(S, outlook) = (5/14)gini(3,2) + (4/14)*gini(4,0)+ (5/14)*gini(2,3)
= (5/14)(1 - (3/5)² - (2/5)²) + (4/14)*0 + (5/14)(1 - (2/5)² - (3/5)²)
= 0.171+0+0.171 = 0.342
Gini gain (S, outlook) = 0.459 - 0.342 = 0.117 Choose one that has a higher Gini gain.
Gini gain(S, Temperature) = 0.459 - 0.4405 = Gini gain is higher for outlook.
0.0185 So we can choose it as our root node.
Gini gain(S, Humidity) = 0.459 - 0.3674 = 0.0916
Gini gain(S, windy) = 0.459 - 0.4286 = 0.0304
Random Forest Algorithm
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 37/20
Random Forest Algorithm
The below diagram explains the working of the Random Forest algorithm:
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 38/20
Random Forest Algorithm
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 39/20
Random Forest Algorithm
How does Random Forest algorithm work?
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 40/20
Random Forest Algorithm
Example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree.
During the training phase, each decision tree produces a prediction result, and when a new
data point occurs, then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 41/20
Random Forest Algorithm
Example:
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 42/20
Random Forest Algorithm
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
Banking: Banking sector mostly uses this algorithm for the identification of loan
risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
Land Use: We can identify the areas of similar land use by this algorithm.
Marketing: Marketing trends can be identified using this algorithm.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 43/20
Random Forest Algorithm
Advantages of Random Forest
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 44/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 45/20
Time Series Analysis
Significance of Time Series and its types
• TSA is the backbone for prediction and forecasting analysis, specific to the
time-based problem statements.
• Analyzing the historical dataset and its patterns
• Understanding and matching the current situation with patterns derived from
the previous stage.
• Understanding the factor or factors influencing certain variable(s) in different
periods.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 46/20
Time Series Analysis
Significance of Time Series and its types
• With help of “Time Series” we can prepare numerous time-based analyses and
results.
o Forecasting
o Segmentation
o Classification
o Descriptive analysis`
o Intervention analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 47/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 48/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 49/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 50/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 51/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 52/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 53/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 54/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 55/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 56/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 57/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 58/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 59/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 60/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 61/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 62/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 63/20
Time Series Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 64/20
Linear Regression
It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age,
product price, etc.
Definition:
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression.
Since linear regression shows the linear relationship, which means it finds how the
value of the dependent variable is changing according to the value of the
independent variable.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 65/20
Linear Regression
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 66/20
Linear Regression
Linear Regression Equation is of the form:
y = mx+c
y = dependent variable
x = independent variable
c = intercept
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 67/20
Linear Regression
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple
Linear Regression.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 68/20
Linear Regression
How Does Linear Regression Work?
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 69/20
Linear Regression
1. With the available data, we plot a graph with Area in the X-axis and Rent on Y-axis.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 70/20
Linear Regression
Next, we find the mean of Area and Rent.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 71/20
Linear Regression
3. We then plot the mean on the graph.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 72/20
Linear Regression
4. We draw a line of best fit that passes through the mean.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 73/20
Linear Regression
5. But we encounter a problem. As you can see below, multiple lines can be drawn
through the mean:
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 74/20
Linear Regression
6. To overcome this problem, we keep moving the line to make sure the best fit
line has the least square distance from the data points
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 75/20
Linear Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 76/20
Linear Regression
7. We now arrive at the relation that, Residual is the distance between Y-actual and Y-
pred.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 77/20
Linear Regression
8. The value of m & c for the best fit line, y = mx+ c can be calculated using these
formulas:
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 78/20
Linear Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 79/20
Linear Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 80/20
Linear Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 81/20
Linear Regression
13. After calculating, we find that the least square
value for the below line is 3.02.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 82/20
Linear Regression
14. Finally, we are able to plot the Y-pred and this is found out to be the best fit line.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 83/20
Linear Regression
Model Performance:
It can be achieved by below method:
R-squared method:
• R-squared is a statistical method that determines the goodness of fit.
• It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
• The high value of R-square determines the less difference between the predicted values
and actual values and hence represents a good model.
• It is also called a coefficient of determination, or coefficient of multiple determination
for multiple regression.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 84/20
Linear Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 85/20
Multiple Linear Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 86/20
Multiple Linear Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 87/20
Logistic Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 88/20
Logistic Regression
• Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas
Logistic regression is used for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 89/20
Logistic Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 90/20
Applications of Logistic Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 91/20
Applications of Logistic Regression
• Using the logistic regression algorithm, banks can predict whether a customer
would default on loans or not
• To predict the weather conditions of a certain place (sunny, windy, rainy, humid,
etc.)
• Ecommerce companies can identify buyers if they are likely to purchase a certain
product
• Companies can predict whether they will gain or lose money in the next quarter,
year, or month based on their current performance
• To classify objects based on their features and attributes
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 92/20
Logistic Regression
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 93/20
Logistic Regression
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 94/20
Logistic Regression
Logistic Function (Sigmoid Function):
In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 95/20
Logistic Regression
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are
given below:
We know the equation of the straight line can be written as:
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 96/20
Logistic Regression
Logistic Regression Equation:
In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 97/20
Logistic Regression
Logistic Regression Equation:
But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 98/20
Logistic Regression
Type of Logistic Regression:
On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 99/20
Survival Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 100/20
Survival Analysis
The conditions that apply to the most popular methods of survival analysis
are:
• the event of interest is clearly defined and well-specified, so there is no
ambiguity about whether it happened or not,
• the event can occur only once for each subject
survival analysis is used for modeling the time-to-event series
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 101/20
Survival Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 102/20
Survival Analysis
Censoring
Censoring can be described as the missing data problem in the domain of
survival analysis.
• Subjects are said to be censored
• if they are lost to follow up
• drop out of the study,
• if the study ends before they die or have an outcome of interest.
• In simple words, some important information required to make a
calculation is not available to us. i.e. censored.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 103/20
Survival Analysis
Types of Censoring:
There are different kinds of censoring, such as:
• right-censoring,
• interval-censoring,
• left-censoring.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 104/20
Survival Analysis
Right-censoring:
• Right censoring is the most common of concern.
• It means that we are not certain what happened to people after some point
in time.
• This happens when some people cannot be followed the entire time
because they died or were lost to follow-up or withdrew from the study.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 105/20
Survival Analysis
Right-censoring:
• Right censoring is the most common of concern.
• It means that we are not certain what happened to people after some point
in time.
• This happens when some people cannot be followed the entire time
because they died or were lost to follow-up or withdrew from the study.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 106/20
Survival Analysis
Left-censoring:
• Left censoring is when we are not certain what happened to people before
some point in time.
• Commonest example is when people already have the disease of interest
when the study starts.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 107/20
Survival Analysis
Interval/Random Censoring
• Interval/random censoring is when we know that something happened in an
interval (i.e. not before starting time and not after ending time of the
study ), but do not know exactly when (in the interval) it happened.
• For example, we know that the patient was well at time of start of the study
and was diagnosed with disease at time of end of the study, so when did the
disease actually begin?
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 108/20
Types of Survival Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 109/20
Survival Analysis
where Pr() stands for the probability and T for the time of the event of interest for a
random observation from the sample.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 110/20
Survival Analysis
Different approaches to Survival Analysis
Non-parametric
Semi-parametric
Parametric
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 111/20
Survival Analysis
Non-parametric
• With these approaches, we make no assumptions about the underlying
distribution of data.
• Perhaps the most popular example from this group is the Kaplan-Meier
curve, which in short is a method of estimating and plotting the survival
probability as a function of time.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 112/20
Survival Analysis
Semi-parametric:
• Makes very few assumptions.
• Most importantly, there are no assumptions about the shape of the
hazard function/rate.
• The most popular method from this group is the Cox regression, which
we can use to identify the relationship between the hazard function and
a set of explanatory variables (predictors).
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 113/20
Survival Analysis
Parametric
• The idea is to use some statistical distributions to estimate how long a
subject will survive.
• Often, we use maximum likelihood estimation (MLE) to fit the
distribution (or actually the distribution’s parameters) to the data for the
best performance.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 114/20
Survival Analysis – methods
Non-parametric estimation
Within-group survival: Kaplan-Meier
Between-group comparison: Log-rank Test
115 /49
Survival Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 116/20
Survival Analysis
Median Survival : gives the
survival value for the
treatment.
Here Treatment A gives 30
months survival time and
Treatment B gives 47 months
Treatment B seems to be doing
better than treatment A (median
survival time of +/- 47 months vs
30 months with a significant p-
value).
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 117/20
Survival Analysis
Table for Kaplan Meier curve
• The follow-up time can be any time-
interval: minutes, days, months, years.
• An event type of 1 equals an event. An
typical event in a cancer trial can be
death, but Kaplan-Meier curves can
also be used in other types of studies.
Ann, for example, participated in this
fictional study for a new cancer drug
but died at after 4 months.
• An event type of 0 equals a right-
censored event.
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 118/20
Survival Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 119/20
Survival Analysis
With Censored data
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 120/20
Survival Analysis
With Censored data
Censored data
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 121/20
Kaplan-Meier Survival Method
122 /49
Kaplan-Meier Plot
123 /49
Kaplan-Meier Survival Probability
Survival Function:
Survival Probability at
1-month = 1-1/10 = 0.9
3-month = (1-1/10)*(1-1/8)=0.788
4-month = (1-1/10)*(1-1/8)*(1-2/7) = 0.56
1-year survival rate = 28%
Median survival time= 6 months
The K-M curve takes a step down when there is an
event.
124 /49
Comparison of groups-Logrank Test
Logrank Test :
– For comparison of survival distributions between groups
– The groups are defined by categorical covariates. Can be more than 2 groups.
e.g. Therapy : treatment, placebo
Gender : male, female
Age group : ≤40, ≥40
Proportional hazards: The hazard functions for any two individuals at any point in time are
proportional, and does not change with time t.
125 /49
Logrank Test-Example
Comparison of post-transplant death
and graft failure probability in IPF
patients with lung transplantation
Comparison group:
single transplantation
vs.
double transplantation
Logrank test:
p-value < 0.001
Single- vs Double-Lung Transplantation in Patients With Chronic Obstructive Pulmonary Disease and Idiopathic Pulmonary Fibrosis Since the Implementation
of Lung Allocation Based on Medical Need
JAMA. 2015;313(9):936-948. doi:10.1001/jama.2015.1175
Comparison of composite of
cardiovascular death, MI, or
severe recurrent ischemia in
patients with acute MI:
Comparison group:
Losmapimod vs. Placebo
Logrank test:
p-value =0.24
127 /49
Survival Analysis
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 128/20
Assessment 1
Ans : _______________________________________________________________________
Ans : _______________________________________________________________________
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 129/20
References
TEXT BOOKS
[Link]ão Moreira, Andre Carvalho, Tomás Horvath – “A General Introduction to Data Analytics” – Wiley -
2018
[Link] Introduction to R, Notes on R: A Programming Environment for Data Analysis and
Graphics. W. N. Venables, D.M. Smith and the R Development Core Team. Version
3.0.1 (2013-05-16). URL: [Link]
References:
1. Dean J, ―Big Data, Data Mining and Machine learning, Wiley publications, 2014.
2. Provost F and Fawcett T, ―Data Science for Business, O‘Reilly Media Inc, 2013.
3. Janert PK, ―Data Analysis with Open Source Tools, O‘Reilly Media Inc, 2011.
4. Weiss SM, Indurkhya N and Zhang T, ―Fundamentals of Predictive Text Mining, Springer-Verlag
London Limited, 2010.
[Link] N and Warren J,- Big Data, Manning Publications,2015
Thank You
November 4, 2025 Big data and Data science/ 19CS407-Data Analytics wit R/KALPANA C/CST/SNSCE 130/49