Comprehensive Notes on Decision Tree Algorithm
Based on Krish Naik Day 5 Live Session and Intellipaat Decision Tree Lectures
Introduction
Decision Trees are supervised learning models used for both classification and
regression tasks. They model decisions and their possible consequences in a
tree-like structure. Internal nodes represent tests on features, branches repre-
sent decision rules, and leaf nodes represent outcomes (class labels or predicted
values).
These notes synthesize explanations from:
• Krish Naik – Live Day 5: Ensemble Techniques (Bagging, Boosting, Ran-
dom Forest, AdaBoost, XGBoost), where Decision Trees appear as base
learners and Random Forest is explained as a bagging-based ensemble of
trees.
• Intellipaat – Decision Tree Algorithm lectures (concepts of entropy, in-
formation gain, Gini impurity, and practical implementation in Python /
R).
Basic Intuition of Decision Trees
Real-world Motivation
A decision tree mimics human decision-making: at each step you ask a question
about the input features and branch according to the answer until you reach a
decision.
Examples:
• Telecom Churn: Predict whether a customer will churn based on fea-
tures like gender, tenure, and monthly charges.
• Restaurant Dessert Prediction: Predict whether a customer will order
dessert based on age and whether they ordered a main course.
• Play Tennis: Predict whether a person will play tennis based on weather
conditions like outlook, temperature, humidity, and wind.
1
In each case, the decision tree splits the feature space into regions where the
target is as homogeneous as possible.
Terminology
• Root Node: The top-most node of the tree. It represents the first feature
and condition used to split the dataset. Choosing this node is a key
optimization problem (based on Information Gain or Gini decrease).
• Internal / Decision Node: A node that splits into further branches
based on a test on a feature (e.g. Price < 90).
• Leaf / Terminal Node: A node with no further children. It represents
a final prediction: a class label (for classification) or a real value (for
regression).
• Branch: The edge connecting two nodes, representing the outcome of a
test (e.g. Yes, No, or an interval condition like x < 5).
• Subtree: Any node together with all its descendants can be viewed as a
subtree.
• Depth of Tree: Length of the longest path from the root node to any
leaf node.
Types of Decision Trees
Classification Trees
Used when the target variable is categorical (e.g. churn vs no-churn, play vs
not-play, drug A/B/C). The objective is to maximize the purity of class labels
in each leaf.
The most common impurity measures for classification are:
• Entropy
• Gini impurity
Regression Trees
Used when the target variable is continuous (e.g., price, sales, salary, house
value). Instead of measuring class impurity, we typically minimize variance or
squared error in each leaf.
Common criteria for regression trees:
• Mean Squared Error (MSE)
• Mean Absolute Error (MAE)
2
Core Idea: Recursive Partitioning of Feature
Space
The decision tree recursively partitions the feature space:
1. Start with the entire training dataset at the root node.
2. At each node, consider a set of possible splits of the data using feature
thresholds.
3. Choose the split that most reduces impurity (for classification) or error
(for regression).
4. Recurse on each child node until a stopping criterion is met (e.g., maxi-
mum depth, minimum samples per leaf, or pure node).
Each path from root to leaf corresponds to a conjunction of conditions and thus
defines a decision rule (e.g., Outlook = Sunny & Humidity = High & Windy
= False ⇒ Don’t Play).
Mathematical Foundations: Entropy, Informa-
tion Gain, and Gini Impurity
Entropy
Entropy measures the amount of disorder or uncertainty in a set of labels. For
a node containing examples of 𝐾 classes, with class probabilities 𝑝1 , 𝑝2 , … , 𝑝𝐾 ,
the entropy is:
𝐾
𝐻(𝑆) = − ∑ 𝑝𝑖 log2 𝑝𝑖 .
𝑖=1
Properties:
• 𝐻(𝑆) = 0 when all samples belong to a single class (node is pure).
• 𝐻(𝑆) is maximized when all classes are equally likely, i.e. 𝑝𝑖 = 1/𝐾 for
all 𝑖.
Information Gain
Information Gain (IG) measures how much a feature 𝐹 reduces entropy when
we split the node 𝑆 based on 𝐹 .
Let the original node be 𝑆 with entropy 𝐻(𝑆). Suppose feature 𝐹 splits 𝑆 into
subsets 𝑆1 , 𝑆2 , … , 𝑆𝑚 (for example, for Outlook we might have Sunny, Overcast,
and Rainy). Then the Information Gain of splitting on feature 𝐹 is:
𝑚
|𝑆𝑗 |
Gain(𝑆, 𝐹 ) = 𝐻(𝑆) − ∑ 𝐻(𝑆𝑗 ).
𝑗=1
|𝑆|
3
Interpretation:
• 𝐻(𝑆): entropy before the split (parent node).
|𝑆𝑗 |
• ∑𝑗 𝐻(𝑆𝑗 ): weighted average entropy after the split (children nodes).
|𝑆|
• Higher Gain(𝑆, 𝐹 ) means greater reduction in uncertainty, so 𝐹 is a better
splitting feature at node 𝑆.
Example: Play Tennis Dataset
Consider the classical tennis dataset with target Play having values Yes or No.
Suppose at the root we have 9 Yes and 5 No out of 14 samples.
Entropy at root:
9 9 5 5
𝐻(𝑆) = − log2 − log2 ≈ 0.94.
14 14 14 14
If we split by Outlook with categories Sunny, Overcast, and Rain, we com-
pute entropy for each subset and then compute Information Gain. The feature
(Outlook, Temperature, Humidity, Wind) with the highest gain becomes the
root node.
Gini Impurity
Gini impurity is an alternative measure of impurity used primarily in the CART
(Classification and Regression Trees) algorithm. For a node with class probabil-
ities 𝑝1 , … , 𝑝𝐾 :
𝐾
𝐺(𝑆) = 1 − ∑ 𝑝𝑖2 .
𝑖=1
Properties:
• 𝐺(𝑆) = 0 when the node is pure (only one class present).
• Gini impurity is maximized when the classes are equally likely.
• The range of Gini impurity is [0, 1 − 1/𝐾].
Splitting criterion using Gini:
𝑚
|𝑆𝑗 |
Δ𝐺(𝑆, 𝐹 ) = 𝐺(𝑆) − ∑ 𝐺(𝑆𝑗 ).
𝑗=1
|𝑆|
The feature and threshold that maximizes Δ𝐺 is chosen at each split.
4
Entropy vs. Gini Impurity
• Both are measures of node impurity; in practice they often lead to very
similar trees.
• Entropy involves 𝑝 log 𝑝 and has an information-theoretic interpretation.
• Gini is computationally simpler (no logarithms) and is the default in many
libraries (e.g. scikit-learn).
Decision Tree Construction Algorithm
The high-level algorithm (Top-Down Induction of Decision Trees) is as follows:
1. Input: Training set 𝐷 with features 𝑋1 , … , 𝑋𝑑 and target 𝑌 .
2. If all examples in 𝐷 belong to the same class 𝑐, create a leaf node labeled
𝑐 and stop.
3. If 𝐷 is empty or no features remain, create a leaf node labeled with the
majority class in the parent node and stop.
4. Otherwise:
(a) For each candidate feature 𝐹 (and possible split thresholds, if nu-
meric), compute the impurity reduction (Information Gain or Gini
decrease).
(b) Select feature 𝐹 ∗ and threshold 𝑡∗ that maximize impurity reduction.
(c) Create a decision node that splits on 𝐹 ∗ (e.g., 𝐹 ∗ ≤ 𝑡∗ vs 𝐹 ∗ > 𝑡∗ or
categorical branches).
(d) Recurse on each child subset 𝐷𝑗 to build subtrees.
This greedy algorithm does not guarantee a globally optimal tree, but in practice
works well.
Root Node Selection
From the Intellipaat explanation, the key interview-ready statement is:
“In decision trees, the root node is selected based on the concept
of entropy or Information Gain (or equivalently, by minimizing Gini
impurity). The feature with the highest Information Gain (or highest
Gini decrease) is chosen as the root.”
Formally:
• Compute entropy 𝐻(𝑆) at the root using the target distribution.
• For each feature 𝐹 , compute Gain(𝑆, 𝐹 ).
5
• Choose 𝐹 ∗ = arg max𝐹 Gain(𝑆, 𝐹 ).
Subsequent internal nodes are chosen similarly but on the respective subsets.
Stopping Criteria and Pruning
Stopping Criteria (Pre-pruning)
To prevent the tree from growing too deep (and overfitting), we may stop split-
ting when:
• Node depth reaches a maximum (max_depth).
• Number of samples at a node is below a threshold (min_samples_split
or min_samples_leaf).
• Impurity decrease from any split is below a minimum threshold.
• Node is already sufficiently pure (e.g., above a purity threshold).
Post-pruning (Cost-Complexity Pruning)
Krish Naik’s R example demonstrates post-pruning:
• First, grow a full tree (possibly overfitted) on the training data.
• Use cross-validation to compute misclassification error for different subtree
sizes.
• Plot tree size vs. error and identify the subtree size with minimal error
(or where error stops decreasing significantly).
• Prune back the full tree to this optimal size (using functions like
[Link] in R or cost-complexity pruning in CART).
Conceptually, cost-complexity pruning minimizes:
𝑅𝛼 (𝑇 ) = 𝑅(𝑇 ) + 𝛼|𝑇 |,
where 𝑅(𝑇 ) is the empirical error of tree 𝑇 , |𝑇 | is the number of leaf nodes, and
𝛼 ≥ 0 is a complexity penalty. Increasing 𝛼 prefers smaller trees.
Decision Trees for Regression
For regression, the idea is similar but with continuous targets:
• At each node, choose the split that minimizes the sum of squared errors
(SSE) or variance in the child nodes.
• The prediction at a leaf is typically the mean (or median) of target values
in that leaf.
6
Split quality (using MSE) can be defined as:
|𝑆𝑗 |
ΔMSE(𝑆, 𝐹 ) = Var(𝑆) − ∑ Var(𝑆𝑗 ).
𝑗
|𝑆|
Advantages and Disadvantages
Advantages
• Interpretability: Easy to visualize and interpret; paths from root to leaf
correspond to explicit decision rules.
• Handles Mixed Data Types: Can handle both numerical and categor-
ical features without much preprocessing.
• Non-linear Decision Boundaries: Can capture complex, non-linear
relationships by hierarchical splitting.
• Little Data Preparation: Requires less feature scaling or normalization
compared to models like SVM or logistic regression.
• Versatility: Works for both classification and regression tasks (Decision-
TreeClassifier and DecisionTreeRegressor).
Disadvantages
• Overfitting: Unpruned trees tend to overfit training data, learning noise
and leading to poor generalization.
• Instability: Small changes in data can lead to very different tree struc-
tures (high variance model).
• Bias towards Features with More Levels: With some criteria, fea-
tures with many distinct values may be favored, which can lead to subop-
timal splits.
• Limited Expressiveness for Very Complex Problems: A single tree
may not perform well on highly complex tasks or high-dimensional contin-
uous spaces; ensembles like Random Forest and Gradient Boosted Trees
perform better.
Most disadvantages are mitigated by ensemble methods, where trees are used
as base learners (weak learners) and combined into a stronger model.
Decision Trees in Ensemble Methods
From Krish Naik’s Day 5 session:
7
Bagging and Random Forest
• Bagging (Bootstrap Aggregation):
– Draw multiple bootstrap samples (with replacement) from the train-
ing data.
– Train a separate model (e.g., decision tree) on each bootstrap sample
in parallel.
– For classification: combine predictions by majority voting.
– For regression: combine predictions by averaging.
• Random Forest:
– A bagging method where each tree is trained on a bootstrap sample.
– At each split, only a random subset of features is considered, intro-
ducing additional randomness.
– Reduces correlation between trees and often improves generalization
compared to plain bagging.
In Random Forest, decision trees are used as base learners and are typically
grown deep without pruning (high variance individually, but variance is reduced
by averaging).
Boosting (AdaBoost, Gradient Boosting, XGBoost)
• Boosting: Sequentially combine many weak learners (often shallow deci-
sion trees called decision stumps) into a strong learner.
• AdaBoost:
– Start with uniform weights on training samples.
– Train a weak learner (e.g., depth-1 tree) and compute its error.
– Increase weights of misclassified points and train the next weak
learner focused on those hard examples.
– Final prediction is a weighted sum (or vote) of all weak learners.
• Gradient Boosting / XGBoost:
– Fit successive trees to the residual errors (negative gradients) of pre-
vious trees.
– Optimizes a differentiable loss function using gradient descent in func-
tion space.
8
Implementation Details in Python (scikit-learn)
Key Classes
• [Link]
• [Link]
Important Hyperparameters
For DecisionTreeClassifier:
• criterion: "gini" (default) or "entropy" (or "log_loss" in newer ver-
sions).
• max_depth: Maximum depth of the tree.
• min_samples_split: Minimum number of samples required to split an
internal node.
• min_samples_leaf: Minimum samples required at a leaf node.
• max_features: Number of features to consider when looking for the best
split.
• random_state: Seed for reproducibility.
Typical Workflow (Classification)
1. Load data using pandas.
2. Handle missing values, duplicates, and encode categorical variables (e.g.,
using LabelEncoder or one-hot encoding).
3. Split into features 𝑋 and target 𝑦.
4. Perform train–test split: train_test_split(X, y, test_size=0.2,
random_state=...).
5. Initialize classifier, e.g.:
from [Link] import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=None,
random_state=0)
6. Fit model: [Link](X_train, y_train).
7. Predict on test data: y_pred = [Link](X_test).
8. Evaluate performance using accuracy, confusion matrix, F1-score, etc.
9. Optionally visualize the tree using [Link].plot_tree or export to
Graphviz.
9
Example Sketch
import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import DecisionTreeClassifier, plot_tree
from [Link] import accuracy_score
# 1. Load data
df = pd.read_csv('drug_data.csv') # example from Intellipaat
# 2. Preprocess (encode categorical features, handle missing values)
# ... (LabelEncoder or OneHotEncoder for 'Sex', 'BP', 'Cholesterol', 'Drug')
X = [Link]('Drug', axis=1)
y = df['Drug']
# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0
)
# 4. Create and train classifier
clf = DecisionTreeClassifier(criterion='gini', random_state=0)
[Link](X_train, y_train)
# 5. Predict and evaluate
y_pred = [Link](X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
# 6. Visualize the tree
import [Link] as plt
[Link](figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=[Link], class_names=clf.classes_)
[Link]()
Common Interview Points
• How is the root node selected in a decision tree?
– By choosing the feature (and threshold) that maximizes Information
Gain (entropy-based) or Gini impurity decrease at the root.
• Difference between entropy and Gini impurity?
– Both measure impurity; entropy comes from information theory, Gini
is simpler computationally. Both often lead to similar splits.
10
• Why do decision trees overfit?
– They can grow deep and memorize training data, capturing noise and
variance. Unpruned trees have low bias but very high variance.
• How to prevent overfitting?
– Pre-pruning (max_depth, min_samples_leaf, impurity thresholds)
and post-pruning (cost-complexity pruning). Also, use ensembles
(Random Forest, Gradient Boosting).
• Can decision trees handle both regression and classification?
– Yes. For regression, the split criterion is based on variance/MSE; leaf
prediction is often the mean of targets.
• What are weak learners in boosting?
– Typically shallow decision trees (e.g. depth-1 or depth-2) that indi-
vidually have performance slightly better than random but combine
to form a strong learner.
Summary
Decision Trees are fundamental models in machine learning, both as standalone
interpretable models and as building blocks of powerful ensembles like Ran-
dom Forests and Gradient Boosted Trees. Understanding entropy, Information
Gain, Gini impurity, and pruning is essential not only for theoretical mastery
but also for practical tuning and interpretation. These notes consolidate con-
ceptual explanations, mathematical foundations, and implementation patterns
as presented in Krish Naik’s ensemble-learning session (where trees act as base
learners) and Intellipaat’s detailed decision tree tutorials.
11