ML Mid-Term Notes (Complete)
ML Mid-Term Notes (Complete)
Example: A spam filter learns from past emails marked as “spam” or “not spam” and
automatically classifies new emails accordingly.
Image Recognition
Facial Recognition: Used in smartphones and social media (e.g., Facebook photo tagging).
Medical Imaging: Identifies diseases or abnormalities in X-rays, MRIs, and CT scans.
Object Detection: Used in autonomous vehicles, security, and retail systems.
Virtual Assistants: Siri, Alexa, and Google Assistant use ML to process voice commands.
Speech-to-Text: Converts spoken words into text (e.g., Google Voice Typing).
Voice Biometrics: Used for secure identity verification based on voice patterns.
Natural Language Processing (NLP)
Recommendation Systems
Fraud Detection
Autonomous Vehicles
Self-Driving Cars: Use ML to detect pedestrians, traffic signs, and navigate roads safely.
Drones: Use ML for route planning and obstacle avoidance.
Healthcare
Supervised Learning
The model learns from labeled data, each example has both input and correct output.
The goal is to learn the mapping between inputs and outputs.
Example: Email spam detec on (input: email content; output: spam or not spam).
Unsupervised Learning
The model learns from unlabeled data and discovers hidden structures or pa erns.
Semi-supervised Learning
Uses a small amount of labeled data combined with a large amount of unlabeled data to
improve accuracy.
Example: Image classifica on when only a few images are labeled but many are
unlabeled.
Reinforcement Learning
An agent interacts with an environment and learns through rewards or penal es.
The agent’s goal is to maximize total rewards over me.
Example: Training a robot to navigate a maze, rewarded when it reaches the goal.
Another Example: Ad placement on a webpage, an agent decides how many ads to
show based on reward (revenue) feedback.
Model & Algorithm: The model represents what the system learns; algorithms are the
methods used to train models (e.g., Linear Regression, Decision Trees).
Training & Testing: Data is split into training data (to teach the model) and testing data
(to check performance).
Loss Function: Measures how far off predictions are from actual results (e.g., Mean
Squared Error).
Learning Process
Steps:
1. Collect and prepare data
2. Train the model using an algorithm
3. Test the model with new data
4. Evaluate the results with performance metrics
Evaluation Metrics:
Accuracy: Overall correctness
Precision: Correct positive predictions among predicted positives
Recall: Ability to find all relevant cases
F1 Score: Harmonic mean of precision and recall
Overfitting: Model memorizes data → performs well on training but poorly on new data.
Underfitting: Model is too simple → fails to capture data patterns.
Fundamental Algorithms Overview
1. Linear Regression: Predicts continuous values (e.g., house prices).
2. Logistic Regression: For binary classification (e.g., pass/fail, spam/not spam).
3. Decision Trees: Splits data into branches for decision-making.
4. k-Nearest Neighbors (k-NN): Classifies based on similarity to nearby samples.
5. Neural Networks: Mimic the human brain; used in deep learning (e.g., image or speech recognition).
6. Support Vector Machines (SVM): Finds the best dividing boundary between different
classes.
Examples:
Regression → Predicting a car’s price based on its mileage, brand, and age.
Classification → Spam detection (classifying emails as spam or not spam).
What is Regression?
Regression is a type of supervised learning where the goal is to predict con nuous output
values (e.g., temperature, price, or age) based on one or more input variables.
Example: Predic ng the price of a house based on features like size, loca on, and
number of rooms.
Objec ve:
To find the best-fit line (or curve) that represents the rela onship between input variables
(X) and the output variable (Y).
This line should minimize the difference (error) between predicted and actual values.
Types of Regression
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Linear Regression
4. Logistic Regression
Equation:
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙
Where:
Example: Predicting a student’s exam score (Y) based on hours studied (X):
𝒚 = 𝟓𝟎 + 𝟓𝒙
𝑰𝒇 𝒙 = 𝟒 𝒉𝒐𝒖𝒓𝒔:
𝒚 = 𝟓𝟎 + 𝟓 × 𝟒 = 𝟕𝟎
Real-world uses:
Predicting sales based on advertising spend.
Predicting temperature based on time of day.
Multiple Linear Regression
When there are two or more independent variables, the model is called Mul ple Linear
Regression.
Equation:
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + 𝒃𝟑 𝒙𝟑 + … … + 𝒃𝒏 𝒙𝒏
Where:
Interpretation:
Intercept (2000): Base salary for zero education and experience.
Prediction Example:
𝑭𝒐𝒓 𝟏𝟔 𝒚𝒆𝒂𝒓𝒔 𝒐𝒇 𝒆𝒅𝒖𝒄𝒂𝒕𝒊𝒐𝒏 𝒂𝒏𝒅 𝟓 𝒚𝒆𝒂𝒓𝒔 𝒐𝒇 𝒆𝒙𝒑𝒆𝒓𝒊𝒆𝒏𝒄𝒆:
Real-world uses:
Predicting house prices using multiple features (area, location, number of rooms).
Forecasting revenue using advertising, market conditions, and product pricing.
Polynomial Linear Regression
Polynomial Regression models the rela onship between X and Y as an nth-degree
polynomial. Although the curve is non-linear, the model is linear in coefficients.
Equation:
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 + 𝒃𝟐 𝒙𝟐 𝟐 + 𝒃𝟑 𝒙𝟑 𝟑 + … … + 𝒃𝒏 𝒙𝒏 𝒏 𝟏
Where:
Use Case:
When data shows a curved relationship instead of a straight line.
Real-world uses:
Predicting population growth or stock trends where change is not constant.
Equation:
1
𝑦= (𝒃𝟎 𝒃𝟏 𝒙)
1+𝑒
In other words:
Example:
Actual (Y) Predicted (Ŷ) Error (Y - Ŷ)
50 48 +2
60 63 -3
70 68 +2
So, the model makes small mistakes — these differences are what we call errors.
The goal of training a regression model is to minimize the cost function — meaning, we
want to make predictions as close as possible to the real values.
Goal: Minimize the sum of squared errors (SSE) using op miza on techniques like Gradient
Descent.
Why it ma ers: A well-fit line accurately represents the rela onship between variables,
improving predic on accuracy.
𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙𝒊
𝒏
𝟏
𝑴𝑺𝑬/𝑱 = (𝒚 − 𝒚𝒊 )𝟐
𝒏
𝒊 𝟏
Where:
Example:
Actual (Y) Predicted (Ŷ) Error (Y - Ŷ) Error²
4 5 -1 1
3 2.5 0.5 0.25
5 4 1 1
6 5.5 0.5 0.25
So, the mean squared error = 0.625. A smaller MSE means a be er-fi ng regression model.
Root Mean Squared Error (RMSE)
RMSE is simply the square root of MSE, which brings the error value back to the same unit
as the output variable.
𝑹𝑴𝑺𝑬 = √𝑴𝑺𝑬
𝒏
𝟏
𝑴𝑨𝑬 = |𝒚 − 𝒚𝒊 |
𝒏
𝒊 𝟏
Example:
Actual (Y) Predicted (Ŷ) |Y − Ŷ|
4 5 1
3 2.5 0.5
5 4 1
6 5.5 0.5
1 + 0.5 + 1 + 0.5 3
𝑀𝐴𝐸 = = = 0.75
4 4
That line should make predictions that are as close as possible to the actual data points.
To achieve this, we adjust the parameters (𝜽𝟐 = slope, 𝜽𝟏 = intercept) until the cost function
(error) is minimized.
Gradient Descent helps us find those optimal parameter values step by step.
𝝏𝑱
𝜽𝟏 = 𝜽𝟏 − 𝜶
𝝏𝜽𝟏
𝝏𝑱
𝜽𝟐 = 𝜽𝟐 − 𝜶
𝝏𝜽𝟐
Where:
𝒏 𝒏
𝟏 𝝏 𝟏 𝝏
= 𝟐(𝒚 − 𝒚𝒊 ) (𝒚 − 𝒚𝒊 ) ∵ 𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙 𝒊 = 𝟐(𝒚 − 𝒚𝒊 ) (𝒚 − 𝒚𝒊 )
𝒏 𝝏𝜽𝟏 𝒏 𝝏𝜽𝟐
𝒊 𝟏 𝒊 𝟏
𝒏 𝒏
𝟏 𝝏 𝟏 𝝏
= 𝟐(𝒚 − 𝒚𝒊 ) (𝜽 + 𝜽𝟐 𝒙𝒊 − 𝒚𝒊 ) = 𝟐(𝒚 − 𝒚𝒊 ) (𝜽 + 𝜽𝟐 𝒙𝒊 − 𝒚𝒊 )
𝒏 𝝏𝜽𝟏 𝟏 𝒏 𝝏𝜽𝟐 𝟏
𝒊 𝟏 𝒊 𝟏
𝒏 𝒏
𝟏 𝟏
= 𝟐(𝒚 − 𝒚𝒊 ) (𝟏 + 𝟎 − 𝟎) = 𝟐(𝒚 − 𝒚𝒊 ) (𝟎 + 𝒙𝒊 − 𝟎)
𝒏 𝒏
𝒊 𝟏 𝒊 𝟏
𝒏 𝒏
𝟐 𝟐
𝑱 𝜽𝟏 = (𝒚 − 𝒚𝒊 ) … … … … (𝒊) 𝑱 𝜽𝟐 = (𝒚 − 𝒚𝒊 ) 𝒙𝒊 … … … … (𝒊𝒊)
𝒏 𝒏
𝒊 𝟏 𝒊 𝟏
𝒏
𝟐 𝒏
𝜽𝟏 = 𝜽𝟏 − 𝜶 × (𝒚𝒊 − 𝒚𝒊 ) 𝟐
𝒏 𝜽𝟐 = 𝜽𝟐 − 𝜶 × (𝒚𝒊 − 𝒚𝒊 ) 𝒙𝒊
𝒊=𝟏 𝒏
𝒊=𝟏
These formulas iteratively adjust θ1 and θ2 to reduce the error and reach the minimum point of the cost
function.
Problem 1: We are given the following dataset showing the rela onship between
Age (X) and Salary (Y):
A linear regression model is used to predict salary from age using the hypothesis
func on:
𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙𝒊
Solution
(1)
Is given,
We know,
𝒏
𝟏 1
𝑱= (𝒚 − 𝒚𝒊 )𝟐 = (𝑦 − 𝑦 ) … … … (𝑖)
𝒏 7
𝒊 𝟏
(𝑦 − 𝑦 ) = (300 + 10 × 30 − 800)2 + (300 + 10 × 37 − 950)2 + (300 + 10 × 25 − 600)2 + (300 + 10 × 43 − 1050)2
1 1
𝐽= (𝑦 − 𝑦 ) = (𝟓𝟐𝟏𝟑𝟗𝟗. 𝟗𝟗𝟖) = 𝟕𝟒𝟒𝟖𝟓. 𝟕𝟏𝟒
7 7
(2)
Gadients 𝑱 𝜽𝟏 Gradients 𝑱 𝜽𝟐
𝒏 𝒏
𝟐 𝟐
𝑱 𝜽𝟏 = (𝒚 − 𝒚𝒊 ) 𝑱 𝜽𝟐 = (𝒚 − 𝒚𝒊 ) 𝒙𝒊
𝒏 𝒏
𝒊 𝟏 𝒊 𝟏
𝒏 𝒏
𝟐 𝟐
𝑱 𝜽𝟏 = (𝟑𝟎𝟎 + 𝟏𝟎𝒙𝒊 − 𝒚𝒊 ) 𝑱 𝜽𝟐 = (𝟑𝟎𝟎 + 𝟏𝟎𝒙𝒊 − 𝒚𝒊 ) 𝒙𝒊
𝒏 𝒏
𝒊 𝟏 𝒊 𝟏
(3)
NEW 𝜽𝟏 NEW 𝜽𝟐
𝜽𝟏 = 𝜽𝟏 − 𝜶 × 𝑱′ 𝜽𝟏 𝜽𝟐 = 𝜽𝟐 − 𝜶 × 𝑱′ 𝜽𝟐
= 𝟑𝟎𝟎 − 𝟎. 𝟎𝟎𝟎𝟏 × −𝟒𝟗𝟕. 𝟏𝟒𝟏𝟐 = 𝟏𝟎 − 𝟎. 𝟎𝟎𝟎𝟏 × −𝟐𝟎𝟑𝟖𝟖. 𝟓𝟕𝟒𝟏𝟐
= 𝟑𝟎𝟎. 𝟎𝟒𝟗𝟕 = 𝟏𝟐. 𝟎𝟑
(4)
Is given,
(5)
We know,
𝒏
𝟏 1
𝑱= (𝒚 − 𝒚𝒊 )𝟐 = (𝑦 − 𝑦 ) = 39084.8289
𝒏 7
𝒊 𝟏
This makes computa on simpler and more efficient — especially when using tools like
NumPy, MATLAB, or TensorFlow.
Equation:
𝒀 = 𝑿𝒘
Where:
𝟏 𝟏 𝟐 𝒘 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀
𝑿= , 𝒀=
𝟏 𝟐 𝟑
Where:
The first column of 1’s in X corresponds to the
𝑿𝑻 = Transpose of X
intercept term θ0
(𝑿𝑻 𝑿) 𝟏 = Inverse of the product 𝑿𝑻 𝑿
𝒘 = Parameter vector [θ0,θ1]T
1 1 1 1 2 3 1 1 2 5
𝑿𝑻 𝑿 = = 𝑿𝑻 𝒀 = =
1 2 1 2 3 5 1 2 3 8
−1 𝟓 −𝟑 𝟓 𝟏
(𝑿𝑻 𝑿) 𝟏
= 2 3 = 5 −3 𝒘= × =
3 5 −3 2 −𝟑 𝟐 𝟖 𝟏
𝒀 = 𝟏 + 𝟏𝑿
Problem 2: You are given the following dataset showing the rela onship between
hours studied (X) and exam scores (Y):
a) Using Linear Regression in Matrix Form, find the best-fit line equa on Y=θ0+θ1X
b) by compu ng all steps manually: 𝛉 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀, Then, predict the score when a
student studies for 4 hours.
Solutions
(a)
Step1: Represen ng Data in Matrix Form Step2: is given Normal Equa on:
1 0.86 2.49
⎡1 0.09 ⎤ ⎡ 0.83 ⎤
⎢1 −0.85⎥ ⎢−0.25⎥ 𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀
⎢1 0.87 ⎥ ⎢ 3.10 ⎥
⎢1 −0.44⎥ ⎢ 0.87 ⎥ Where:
𝑿 = ⎢1 , 𝒀=⎢
−0.43⎥ 0.02 ⎥
⎢1 −1.10⎥ ⎢−0.12⎥ 𝑿𝑻 = Transpose of X
⎢1 (𝑿𝑻 𝑿) 𝟏 = Inverse of the product 𝑿𝑻 𝑿
0.40 ⎥ ⎢ 1.81 ⎥
⎢1 𝜽 = Parameter vector [θ0,θ1]T
−0.96⎥ ⎢−0.83⎥
⎣1 0.17 ⎦ ⎣ 0.43 ⎦
Step3: Step-by-Step Calcula on
Step3.3: Compute 𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀 Step4: Final Model Equa on for best fit line
Step5: figure
(b)
𝒀 = 𝟏. 𝟔𝟎 + 𝟏. 𝟎𝟓𝑿
𝒀 = 𝟏. 𝟔𝟎 + 𝟏. 𝟎𝟓 × 𝟒 = 𝟓. 𝟖𝟎𝟎
Logistic Regression
Used to predict the probability that something belongs to a certain class.
Example: Is an email spam (1) or not spam (0)?
Binary Classification:
How it works:
𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙𝒊 becomes 𝒚 = 𝝈(𝜽𝟏 + 𝜽𝟐 𝒙𝒊 )
2. Then, instead of giving the sum directly, it passes it through the sigmoid (logistic)
function:
𝟏
𝝈(𝒕) =
𝟏 𝒆 𝒕
Key Features
Term Meaning
Instance-based / Lazy learner Does not build a model; stores training data
Neighbor Class
1 Cat
2 Dog
3 Cat
𝒅= (𝒙𝟐 − 𝒙𝟏 )𝟐 + (𝒚𝟐 − 𝒚𝟏 )𝟐
Weight (grams)
Sweetness level (1–10)
Solutions:
Compute Euclidean Distance:
(𝒙𝟐 − 𝒙𝟏 )𝟐 + (𝒚𝟐 − 𝒚𝟏 )𝟐
height (cm)
Weight (kg)
a) Using the above dataset and KNN algorithm with K = 3, classify whether the person is
Underweight or Normal.
Solutions:
Compute Euclidean Distance:
(𝒙𝟐 − 𝒙𝟏 )𝟐 + (𝒚𝟐 − 𝒚𝟏 )𝟐
Coordinates Distance
(167, 51) (170 − 167) + (57 − 51) = 6.70
(182, 62) (170 − 182) + (57 − 62) = 13.0
(176, 69) (170 − 176) + (57 − 69) = 13.4
(173, 64) (170 − 173) + (57 − 64) = 7.6
(172,65) (170 − 172) + (57 − 65) = 8.2
(174, 56) (170 − 174) + (57 − 56) = 4.1
(169, 58) (170 − 169) + (57 − 58) = 1.4
(173, 57) (170 − 173) + (57 − 57) = 3
(170, 55) (170 − 170) + (57 − 55) = 2
Majority Vote
Class Count
Normal 3
Underweight 0
They collect data from six previously released movies, each described by:
Now, a new movie, M7 — “Love & Fury”, has just been added to the platform.
It has the following attributes:
a) Based on the given data, use the K-Nearest Neighbors (KNN) algorithm with K = 5 to
determine the most suitable genre for the new movie “Love & Fury”. Show your reasoning or
steps briefly.
Solutions:
Compute Euclidean Distance:
(𝒙𝟐 − 𝒙𝟏 )𝟐 + (𝒚𝟐 − 𝒚𝟏 )𝟐 + (𝒛𝟐 − 𝒛𝟏 )𝟐
Coordinates Distance
(150, 12, 2) (135 − 150) + (6 − 12) + (3 − 2) = 16.19
(140, 10, 1) (135 − 140) + (6 − 10) + (3 − 1) = 6.71
(120, 3, 10) (135 − 120) + (6 − 3) + (3 − 10) = 16.82
(110, 2, 9) (135 − 110) + (6 − 2) + (3 − 9) = 26.02
(130 ,5, 4) (135 − 130) + (6 − 5) + (3 − 4) = 5.20
(125, 4, 5) (135 − 125) + (6 − 4) + (3 − 5) = 10.39
Majority Vote
Class Count
Comedy 2
Action 2
Romantic 1
Handle the tie (Comedy vs Action): Use the smaller K (e.g., K=3) as a ebreaker
Class Count
Comedy 2
Action 1
Romantic 0
Disadvantages of KNN
Issue Explanation
Slow for large datasets Calculates distance to every point
Needs memory to store data All data must be saved
Sensitive to noise/outliers Wrong results if data is messy
Feature scaling required Variables with high values dominate distance
Important Concepts
Term Meaning
It works by splitting data into branches based on certain decision rules, forming a structure
that looks like a tree.
Example: If you want to predict whether someone will buy a computer based on features like
age, income, student status, and credit rating,
Purpose of ID3:
Build a decision tree that classifies data with the least uncertainty.
Selects attributes that provide the highest information gain.
Splits the dataset recursively until:
o All instances are classified, or
o No attributes remain.
Where, Where,
𝒏 D = Original dataset.
pi = proportion of class i, 𝒑𝒊 = 𝒊 A = Attribute being evaluated.
𝑵
Dv = Subset of D where attribute A takes value v.
N = total sample
𝒏𝒊 = Samples belonging to class i
The a ribute with the highest informa on gain becomes the decision node.
Step 5: Recursion:
Repeat until:
All data in subset belongs to the same class
No attributes left
Dataset becomes empty
Problem1: Consider the following dataset of 14 customers with attributes age, income,
student, credit_rating, and the target class buys_computer.
Dataset:
age income student credit_rating buys_computer
≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no
𝟑𝟏 … 𝟒𝟎 high no fair yes
> 𝟒𝟎 medium no fair yes
> 𝟒𝟎 low yes fair yes
> 𝟒𝟎 low yes excellent no
𝟑𝟏 … 𝟒𝟎 low yes excellent yes
≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes
> 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no
Is given,
𝑵 = 𝟏𝟒
𝒏 𝟏 = 𝟗
𝒏 𝟐 = 𝟓
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟗
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟒𝟑
𝑵 𝟏𝟒
𝒏𝟐 𝟓
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟑𝟓𝟕
𝑵 𝟏𝟒
now,
𝒏
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
(2)
We Know,
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐀) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
|𝑫|
age income student credit_rating buys_computer
≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no
𝟑𝟏 … 𝟒𝟎 high no fair yes
For Age (values: <=30, 31..40, >40): > 𝟒𝟎 medium no fair yes
Is given, For Age ≤ 𝟑𝟎: > 𝟒𝟎 low yes fair yes
> 𝟒𝟎 low yes excellent no
𝟑𝟏 … 𝟒𝟎 low yes excellent yes
Totals for ≤ 𝟑𝟎 : 5 datasets → 2 𝒚𝒆𝒔, 3 𝒏𝒐 .
≤ 𝟑𝟎 medium no fair no
𝒄𝒍𝒂𝒔𝒔 𝟏 𝒄𝒍𝒂𝒔𝒔 𝟐 ≤ 𝟑𝟎 low yes fair yes
𝑵 𝟑𝟎 = 𝟓 > 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝒏 𝟏 = 𝟐 𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝒏 𝟐 = 𝟑 𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟐
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟒𝟎𝟎
𝑵 𝟓
𝒏𝟐 𝟑
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟔𝟎𝟎
𝑵 𝟓
now,
𝒏
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟎 ) = 0.971
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟒𝟎 ) = 0.971
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.694
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐀𝐠𝐞) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.911
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
3 3 4 4 6 6 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
7 7 7 7 7 7 7 7
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = 0.985 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = 0.592
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.789
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
6 6 2 2 3 3 3 3
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
8 8 8 8 6 6 6 6
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = 0.811 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 ) = 1.000
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.892
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
We have,
𝐈𝐆(𝐃, 𝐀𝐠𝐞) = 0.940 − 0.694 = 0.246
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.940 − 0.911 = 0.029
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.940 − 0.789 = 0.152
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.940 − 0.892 = 0.048
We Know,
(4)
age income student credit_rating buys age income student credit_rating buys
≤ 𝟑𝟎 high no fair no >40 medium no fair yes
≤ 𝟑𝟎 high no excellent no >40 low yes fair yes
≤ 𝟑𝟎 medium no fair no >40 low yes excellent no
≤ 𝟑𝟎 low yes fair yes >40 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes >40 medium no excellent no
For next left side root node:
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟐
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟒𝟎𝟎
𝑵 𝟓
𝒏𝟐 𝟑
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟔𝟎𝟎
𝑵 𝟓
now,
𝒏
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
age income student credit_rating buys age income student credit_rating buys
≤ 𝟑𝟎 high no fair no ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no ≤ 𝟑𝟎 high no excellent no
≤ 𝟑𝟎 medium no fair no ≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes ≤ 𝟑𝟎 low yes fair yes
≤ 𝟑𝟎 medium yes excellent yes ≤ 𝟑𝟎 medium yes excellent yes
2 2 0 0 1 1 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
2 2 2 2 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = 0.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = 1.000
1 1 0 0
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
1 1 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = 0.000
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.400
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
age income student credit_rating buys age income student credit_rating buys
≤ 𝟑𝟎 high no fair no ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no ≤ 𝟑𝟎 high no excellent no
≤ 𝟑𝟎 medium no fair no ≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes ≤ 𝟑𝟎 low yes fair yes
≤ 𝟑𝟎 medium yes excellent yes ≤ 𝟑𝟎 medium yes excellent yes
3 3 0 0 2 2 0 0
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3 3 3 3 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = 0.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = 0.000
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.000
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
age income student credit_rating buys age income student credit_rating buys
≤ 𝟑𝟎 high no fair no ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no ≤ 𝟑𝟎 high no excellent no
≤ 𝟑𝟎 medium no fair no ≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes ≤ 𝟑𝟎 low yes fair yes
≤ 𝟑𝟎 medium yes excellent yes ≤ 𝟑𝟎 medium yes excellent yes
1 1 2 2 1 1 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 excellent ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3 3 3 3 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = 0.918 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 excellent ) = 1.000
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.951
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
We have,
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.940 − 0.911 = 0.571
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.940 − 0.789 = 0.971
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.940 − 0.892 = 0.020
We Know,
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟑
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟎𝟎
𝑵 𝟓
𝒏𝟐 𝟐
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟒𝟎𝟎
𝑵 𝟓
now,
𝒏
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
age income student credit_rating buys age income student credit_rating buys
>40 medium no fair yes >40 medium no fair yes
>40 low yes fair yes >40 low yes fair yes
>40 low yes excellent no >40 low yes excellent no
>40 medium yes fair yes >40 medium yes fair yes
>40 medium no excellent no >40 medium no excellent no
2 2 1 1 1 1 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3 3 3 3 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = 0.918 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = 1.000
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.951
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.971 − 0.951 = 0.020
For Student (values: yes, no):
age income student credit_rating buys age income student credit_rating buys
>40 medium no fair yes >40 medium no fair yes
>40 low yes fair yes >40 low yes fair yes
>40 low yes excellent no >40 low yes excellent no
>40 medium yes fair yes >40 medium yes fair yes
>40 medium no excellent no >40 medium no excellent no
1 1 1 1 2 2 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
2 2 2 2 3 3 3 3
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = 1.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = 0.918
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.951
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.000
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
We have,
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.940 − 0.911 = 0.020
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.940 − 0.789 = 0.020
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.940 − 0.892 = 0.971
We Know,
Is given,
𝒂𝒈𝒆 = ≤ 𝟑𝟎
𝒊𝒏𝒄𝒐𝒎𝒆 = 𝒍𝒐𝒘
𝒔𝒕𝒖𝒅𝒆𝒏𝒕 = 𝒚𝒆𝒔
𝒄𝒓𝒆𝒅𝒊𝒕_𝒓𝒂𝒕𝒊𝒏𝒈 = 𝒇𝒂𝒊𝒓
So, Predict the value of the target attribute:
𝒃𝒖𝒚𝒔_𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒓 = 𝒀𝑬𝑺
Problem2: Consider the following dataset of 14 Records with attributes Outlook,
Temp, Humidity, Wind, and the target class Play Tennis.
Dataset:
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
3. Determine the best attribute for the root node based on highest Information Gain.
4. Using your root node, perform the next split on the corresponding subsets and show
the resulting smaller decision tree.
5. Finally, use the constructed decision tree to classify the following new instance:
Outlook = Sunny
Temp = Cool
Humidity = High
Wind = Weak
Predict the value of the target attribute: Play Tennis?
(1)
Is given,
𝑵 = 𝟏𝟒
𝒏 𝟏 = 𝟗
𝒏 𝟐 = 𝟓
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟗
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟒𝟑
𝑵 𝟏𝟒
𝒏𝟐 𝟓
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟑𝟓𝟕
𝑵 𝟏𝟒
now,
𝒏
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
Entropy of the en re dataset for the target a ribute Play Tennis, 𝑫 = 0.940
(2)
We Know,
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐀) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
|𝑫|
For Outlook (values: Sunny, Overcast, Rain):
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No(sunny)
D2 Sunny Hot High Strong No(sunny)
D3 Overcast Hot High Weak Yes (Overcast)
D4 Rain Mild High Weak Yes (Rain)
D5 Rain Cool Normal Weak Yes (Rain)
D6 Rain Cool Normal Strong No (Rain)
D7 Overcast Cool Normal Strong Yes (Overcast)
D8 Sunny Mild High Weak No(sunny)
D9 Sunny Cool Normal Weak Yes(sunny)
D10 Rain Mild Normal Weak Yes (Rain)
D11 Sunny Mild Normal Strong Yes(sunny)
D12 Overcast Mild High Strong Yes (Overcast)
D13 Overcast Hot Normal Weak Yes (Overcast)
D14 Rain Mild High Strong No (Rain)
2 2 3 3 4 4 0 0
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒔𝒖𝒏𝒏𝒚 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒐𝒗𝒆𝒓𝒄𝒂𝒔𝒕 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
5 5 5 5 4 4 4 4
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒔𝒖𝒏𝒏𝒚 = 0.971 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒐𝒗𝒆𝒓𝒄𝒂𝒔𝒕 ) = 0.000
3 3 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒓𝒂𝒊𝒏 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
5 5 5 5
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒓𝒂𝒊𝒏 ) = 0.971
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.694
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
2 2 2 2 4 4 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑯𝒐𝒕 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑴𝒊𝒍𝒅 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
4 4 4 4 6 6 6 6
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑯𝒐𝒕 ) = 1.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑴𝒊𝒍𝒅 ) = 0.918
3 3 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑪𝒐𝒐𝒍 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
4 4 4 4
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑪𝒐𝒐𝒍 ) = 0.811
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.911
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐓𝐞𝐦𝐩 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
3 3 4 4 6 6 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐𝒓𝒎𝒂𝒍 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
7 7 7 7 7 7 7 7
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = 0.985 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐𝒓𝒎𝒂𝒍 ) = 0.592
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.789
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐇𝐮𝐦𝐢𝐝𝐢𝐭𝐲 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
6 6 2 2 3 3 3 3
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑾𝒆𝒂𝒌 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝑺𝒕𝒓𝒐𝒏𝒈 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
8 8 8 8 6 6 6 6
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑾𝒆𝒂𝒌 ) = 0.811 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝑺𝒕𝒓𝒐𝒏𝒈 = 1.000
|𝑫𝒗 |
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.892
|𝑫|
So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐖𝐢𝐧𝐝 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
We have,
𝐈𝐆(𝐃, 𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = 0.940 − 0.694 = 0.246
𝐈𝐆(𝐃, 𝐓𝐞𝐦𝐩 ) = 0.940 − 0.911 = 0.029
𝐈𝐆(𝐃, 𝐇𝐮𝐦𝐢𝐝𝐢𝐭𝐲 ) = 0.940 − 0.789 = 0.151
𝐈𝐆(𝐃, 𝐖𝐢𝐧𝐝 ) = 0.940 − 0.892 = 0.048
We Know,
(4)
Outlook Temp Humidity Wind Play Tennis Outlook Temp Humidity Wind Play Tennis
Sunny Hot High Weak No Rain Mild High Weak Yes
Sunny Hot High Strong No Rain Cool Normal Weak Yes
Sunny Mild High Weak No Rain Cool Normal Strong No
Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Rain Mild High Strong No
Next, repeat the same process with the other nodes with smaller dataset
FINAL DECISION TREE
(5)
Is given,
𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑺𝒖𝒏𝒏𝒚
𝑻𝒆𝒎𝒑 = 𝑪𝒐𝒐𝒍
𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝑯𝒊𝒈𝒉
𝑾𝒊𝒏𝒅 = 𝑾𝒆𝒂𝒌
So, Predict the value of the target attribute:
𝑷𝒍𝒂𝒚 𝑻𝒆𝒏𝒏𝒊𝒔 = 𝑵𝑶
Gain Ratio
Gain Ratio is an attribute selection measure used in the C4.5 Decision Tree algorithm.
It is an improvement over Information Gain (IG).
To fix this problem, C4.5 uses Gain Ratio, which adjusts IG by considering how broadly an
attribute splits the data.
𝐈. 𝐆𝐚𝐢𝐧(𝐀)
𝐆𝐚𝐢𝐧𝐑𝐚𝐭𝐢𝐨(𝐀) =
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐀)
Where:
𝒗
|𝑫𝒗 |
𝐈. 𝐆𝐚𝐢𝐧(𝐀) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
|𝑫|
𝒊 𝟏
𝒗
|𝑫𝒗 | |𝑫𝒗 |
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐀) = − × 𝐥𝐨𝐠 𝟐
|𝑫| |𝑫|
𝒊 𝟏
Problem3: Compute the Gain Ratio for the attribute Outlook using the Play Tennis
dataset. Provide the Information Gain, SplitInfo, and the final Gain Ratio.
We have,
𝐈𝐆(𝐃, 𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = 0.940 − 0.694 = 0.246
We Know,
Outlook Count Proportion (p)
Sunny 5 5/14 = 0.3571
Overcast 4 4/14 = 0.2857
Rain 5 5/14 = 0.3571
𝒗
|𝑫𝒗 | |𝑫𝒗 |
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐀) = − × 𝐥𝐨𝐠 𝟐
|𝑫| |𝑫|
𝒊 𝟏
5 5 4 4 5 5
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(Outlook) = − log − log − log
14 14 14 14 14 14
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(Outlook) = 1.577
So,
𝐈. 𝐆𝐚𝐢𝐧(𝐀)
𝐆𝐚𝐢𝐧𝐑𝐚𝐭𝐢𝐨(𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) =
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐀)
0.246
𝐆𝐚𝐢𝐧𝐑𝐚𝐭𝐢𝐨(𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = = 0.156
1.577
Gini Index
The Gini Index (or Gini Impurity) is a measure used in CART (Classification and
Regression Trees) to select the best attribute for splitting data.
𝐆𝐢𝐧𝐢(𝐒) = 𝟏 − (𝒑𝒊 )𝟐
𝒊 𝟏
Where:
𝒑𝒊 = proportion of class i
𝒄 = number of classes
𝒗
|𝑺𝒗 |
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐒) = 𝐆𝐢𝐧𝐢(𝐒𝒗 )
|𝑺|
𝒊 𝟏
Problem4: Using the Play Tennis dataset below, compute the Gini Index to determine
the best attribute for splitting the dataset. Show all steps clearly.
Dataset:
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
(a)
Is given,
𝑵 = 𝟏𝟒
𝒏 𝟏 = 𝟗
𝒏 𝟐 = 𝟓
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟗
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟒𝟑
𝑵 𝟏𝟒
𝒏𝟐 𝟓
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟑𝟓𝟕
𝑵 𝟏𝟒
Now,
𝒄
𝐆𝐢𝐧𝐢(𝐒) = 𝟏 − (𝒑𝒊 )𝟐
𝒊 𝟏
𝟐
𝐆𝐢𝐧𝐢(𝐒) = 𝟏 − (𝒑𝒊 )𝟐
𝒊 𝟏
𝐆𝐢𝐧𝐢(𝐒) = 1 − {𝑝 +𝑝 }
𝐆𝐢𝐧𝐢(𝐒) = 0.459
(b)
Compute Gini Index for a Split for Outlook:
4 0
Gini(Overcast) = 1 − + = 0.000
4 4
3 3
Gini(Rain) = 1 − + = 0.480
5 5
Hot 2 2 4
Mild 4 2 6
Cool 3 1 4
High 3 4 7
Normal 6 1 7
(c)
We have,
Attribute Gini Split
Outlook 0.343
Temperature 0.441
Humidity 0.367
Wind 0.428
We Know,
𝐵𝐸𝑆𝑇 𝐴𝑇𝑇𝑅𝐼𝐵𝑈𝑇𝐸 = 𝑂𝑈𝑇𝐿𝑂𝑂𝐾
Because it has the lowest weighted Gini Index
Thus, CART Decision Tree will split on Outlook first.
(d)
Weighted Gini Index for Outlook:
5 4 5
Gini = (0.48) + (0) + (0.48) = 0.343
14 14 14
Overfitting in Decision Trees
What is it?
Tree becomes too complex.
Fits noise instead of patterns.
Symptoms:
o High training accuracy.
o Low test accuracy.
Causes
Tree grown too deep.
Too many branches capturing random noise.
Result
Poor generalization.
Stopping criteria
Max tree depth reached.
Minimum samples at leaf.
Minimum info gain threshold.
Example
If IG < 0.01 → stop splitting.
Advantages
Prevents overfitting
Faster training
Simple, interpretable tree
Disadvantages
May stop early and miss good splits
Hard to choose thresholds
Post-Pruning (Cost Complexity Pruning)
First grow full tree, then prune unnecessary branches.
Steps
1. Grow full tree.
2. Starting from leaves, remove branches that don’t improve accuracy.
3. Use validation set or cross-validation.
Example
In spam classification, deep branches with rare word patterns are pruned.
Advantages
o Better generalization
o Simplifies tree
o Data-driven pruning
Disadvantages
o Extra computation
o Needs validation set
o Time-consuming on large trees