0% found this document useful (0 votes)
18 views75 pages

ML Mid-Term Notes (Complete)

The document provides comprehensive notes on Machine Learning, covering its introduction, applications, types, key concepts, and fundamental algorithms. It details supervised, unsupervised, semi-supervised, and reinforcement learning, along with regression techniques and their equations. Additionally, it discusses challenges in ML and the importance of data quality, model training, and evaluation metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views75 pages

ML Mid-Term Notes (Complete)

The document provides comprehensive notes on Machine Learning, covering its introduction, applications, types, key concepts, and fundamental algorithms. It details supervised, unsupervised, semi-supervised, and reinforcement learning, along with regression techniques and their equations. Additionally, it discusses challenges in ML and the importance of data quality, model training, and evaluation metrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Mohsin Iban Hossain

AIUB, Machine Learning Notes


Table of Contents
Sl. No Topic Pages
1 Introduction to Machine Learning 3-10
2 Regression 11-27
3 K-Nearest Neighbors (KNN) 28-35
4 Decision Tree 36-75
Introduction to Machine Learning
 Machine Learning (ML) is a branch of Artificial Intelligence (AI) that focuses on building
systems that can learn from data, recognize patterns, and make decisions with minimal human
intervention.
Instead of being explicitly programmed, ML models improve automatically through
experience.

 Example: A spam filter learns from past emails marked as “spam” or “not spam” and
automatically classifies new emails accordingly.

Applications of Machine Learning


 ML is used in many real-world applica ons across industries:

 Image Recognition

 Facial Recognition: Used in smartphones and social media (e.g., Facebook photo tagging).
 Medical Imaging: Identifies diseases or abnormalities in X-rays, MRIs, and CT scans.
 Object Detection: Used in autonomous vehicles, security, and retail systems.

 Speech and Voice Recognition

 Virtual Assistants: Siri, Alexa, and Google Assistant use ML to process voice commands.
 Speech-to-Text: Converts spoken words into text (e.g., Google Voice Typing).
 Voice Biometrics: Used for secure identity verification based on voice patterns.
 Natural Language Processing (NLP)

 Chatbots: Provide customer support by understanding and responding to queries.


 Language Translation: Google Translate converts text or speech between languages.
 Sentiment Analysis: Helps companies analyze customer opinions on social media or
reviews.

 Recommendation Systems

 E-commerce: Amazon suggests products based on previous purchases.


 Social Media: Facebook and Instagram suggest friends or content you might like.
 Streaming Platforms: Netflix and YouTube recommend shows or videos using viewing
history.

 Fraud Detection

 Banking: Detects suspicious transactions (e.g., unusual credit card activity).


 Insurance: Flags fraudulent claim patterns.

 Autonomous Vehicles

 Self-Driving Cars: Use ML to detect pedestrians, traffic signs, and navigate roads safely.
 Drones: Use ML for route planning and obstacle avoidance.

 Healthcare

 Disease Prediction: Predicts risks for diseases like cancer or diabetes.


 Personalized Medicine: Creates tailored treatment plans based on medical data.
 Drug Discovery: Speeds up the process of identifying new drugs.
Types of Machine Learning
1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning
4. Reinforcement Learning

Supervised Learning
 The model learns from labeled data, each example has both input and correct output.
The goal is to learn the mapping between inputs and outputs.

 Example: Email spam detec on (input: email content; output: spam or not spam).
Unsupervised Learning
The model learns from unlabeled data and discovers hidden structures or pa erns.

 Example: Customer segmenta on, grouping customers by similar purchasing habits


without predefined labels.

Semi-supervised Learning
 Uses a small amount of labeled data combined with a large amount of unlabeled data to
improve accuracy.

 Example: Image classifica on when only a few images are labeled but many are
unlabeled.
Reinforcement Learning
 An agent interacts with an environment and learns through rewards or penal es.
The agent’s goal is to maximize total rewards over me.

 Example: Training a robot to navigate a maze, rewarded when it reaches the goal.
 Another Example: Ad placement on a webpage, an agent decides how many ads to
show based on reward (revenue) feedback.

Difference Between Learning Types

Aspect Supervised Unsupervised Reinforcement

Data Type Labeled Unlabeled Interaction-based

Goal Predict outcomes Find patterns Maximize rewards

Example Spam filter Customer groups Game-playing robot


Key Concepts in Machine Learning
 Data & Features: Raw data is the input; features are measurable attributes (e.g., height,
weight). Labels are the expected outputs for supervised learning.

 Model & Algorithm: The model represents what the system learns; algorithms are the
methods used to train models (e.g., Linear Regression, Decision Trees).

 Training & Testing: Data is split into training data (to teach the model) and testing data
(to check performance).

 Loss Function: Measures how far off predictions are from actual results (e.g., Mean
Squared Error).

 Optimization: Techniques like Gradient Descent adjust model parameters to minimize


loss and improve accuracy.

Learning Process
 Steps:
1. Collect and prepare data
2. Train the model using an algorithm
3. Test the model with new data
4. Evaluate the results with performance metrics

 Evaluation Metrics:
 Accuracy: Overall correctness
 Precision: Correct positive predictions among predicted positives
 Recall: Ability to find all relevant cases
 F1 Score: Harmonic mean of precision and recall

Overfitting: Model memorizes data → performs well on training but poorly on new data.
Underfitting: Model is too simple → fails to capture data patterns.
Fundamental Algorithms Overview
1. Linear Regression: Predicts continuous values (e.g., house prices).
2. Logistic Regression: For binary classification (e.g., pass/fail, spam/not spam).
3. Decision Trees: Splits data into branches for decision-making.
4. k-Nearest Neighbors (k-NN): Classifies based on similarity to nearby samples.
5. Neural Networks: Mimic the human brain; used in deep learning (e.g., image or speech recognition).
6. Support Vector Machines (SVM): Finds the best dividing boundary between different
classes.

Challenges in Machine Learning


 Data Quality & Quantity: Poor or limited data reduces model accuracy.
 Feature Engineering: Choosing the right input variables greatly affects model performance.
 Interpretability: Some models (like deep learning) act as “black boxes” , hard to explain decisions.
 Scalability: Handling massive datasets efficiently is a challenge.
 Performance: Balancing accuracy, speed, and memory usage for real-time systems.
Supervised Learning Overview
 In supervised learning, the model is trained using labeled data — meaning we know
both the input and the desired output.
There are two main types of supervised learning tasks:

 Regression: Predicting a numeric (continuous) value.


 Classification: Predicting a category or class.

Examples:
 Regression → Predicting a car’s price based on its mileage, brand, and age.
 Classification → Spam detection (classifying emails as spam or not spam).

Fig: Supervised Learning: Classification

What is Regression?
 Regression is a type of supervised learning where the goal is to predict con nuous output
values (e.g., temperature, price, or age) based on one or more input variables.

 Example: Predic ng the price of a house based on features like size, loca on, and
number of rooms.

 Objec ve:
To find the best-fit line (or curve) that represents the rela onship between input variables
(X) and the output variable (Y).
This line should minimize the difference (error) between predicted and actual values.
Types of Regression
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Linear Regression
4. Logistic Regression

Simple Linear Regression


 Simple Linear Regression models the rela onship between one independent variable (X) and
one dependent variable (Y) using a straight line.

 Equation:
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙

Where:

 y → Predicted output (dependent variable)


 x → Input variable (independent variable)
 b1 → Slope of the line (rate of change)
 b0 → Intercept (value of y when x = 0)

 Example: Predicting a student’s exam score (Y) based on hours studied (X):
𝒚 = 𝟓𝟎 + 𝟓𝒙

 b1 = 5 → Each extra hour of study increases score by 5 points.


 b0 = 50 → Even with 0 study hours, expected score = 50.

𝑰𝒇 𝒙 = 𝟒 𝒉𝒐𝒖𝒓𝒔:

𝒚 = 𝟓𝟎 + 𝟓 × 𝟒 = 𝟕𝟎

So, the predicted score is 70.

Real-world uses:
 Predicting sales based on advertising spend.
 Predicting temperature based on time of day.
Multiple Linear Regression
 When there are two or more independent variables, the model is called Mul ple Linear
Regression.

 Equation:
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + 𝒃𝟑 𝒙𝟑 + … … + 𝒃𝒏 𝒙𝒏

Where:

 y → Predicted output (dependent variable)


 x → Input variable (independent variable)
 b1, b2, b3, … → Slope of the line (rate of change)
 b0 → Intercept (value of y when x = 0)

 Example Scenario: Predic ng a person’s monthly salary (Y) using:


o 𝒙𝟏 : Years of education
o 𝒙𝟐 : Years of work experience

𝒚 = 𝟐𝟎𝟎𝟎 + 𝟏𝟓𝟎𝟎𝒙𝟏 + 𝟖𝟎𝟎𝒙𝟐

Interpretation:
 Intercept (2000): Base salary for zero education and experience.

 b₁ = 1500: Each additional year of education increases salary by $1500.


 b₂ = 800: Each additional year of experience increases salary by $800.

Prediction Example:
𝑭𝒐𝒓 𝟏𝟔 𝒚𝒆𝒂𝒓𝒔 𝒐𝒇 𝒆𝒅𝒖𝒄𝒂𝒕𝒊𝒐𝒏 𝒂𝒏𝒅 𝟓 𝒚𝒆𝒂𝒓𝒔 𝒐𝒇 𝒆𝒙𝒑𝒆𝒓𝒊𝒆𝒏𝒄𝒆:

𝒚 = 𝟐𝟎𝟎𝟎 + 𝟏𝟓𝟎𝟎 × 𝟏𝟔 + 𝟖𝟎𝟎 × 𝟓 = 𝟑𝟎, 𝟎𝟎𝟎

Predicted salary = $30,000.

Real-world uses:
 Predicting house prices using multiple features (area, location, number of rooms).
 Forecasting revenue using advertising, market conditions, and product pricing.
Polynomial Linear Regression
 Polynomial Regression models the rela onship between X and Y as an nth-degree
polynomial. Although the curve is non-linear, the model is linear in coefficients.

 Equation:
𝒚 = 𝒃𝟎 + 𝒃𝟏 𝒙 + 𝒃𝟐 𝒙𝟐 𝟐 + 𝒃𝟑 𝒙𝟑 𝟑 + … … + 𝒃𝒏 𝒙𝒏 𝒏 𝟏

Where:

 y → Predicted output (dependent variable)


 x → Input variable (independent variable)
 b1, b2, b3, … → Slope of the line (rate of change)
 b0 → Intercept (value of y when x = 0)

Use Case:
 When data shows a curved relationship instead of a straight line.

Real-world uses:
 Predicting population growth or stock trends where change is not constant.

Logistic Regression (Overview)


 Despite its name, Logis c Regression is used for classifica on, not regression. It predicts
categorical outcomes (e.g., Yes/No, 0/1) using a sigmoid func on to map values between 0
and 1.

 Equation:
1
𝑦= (𝒃𝟎 𝒃𝟏 𝒙)
1+𝑒

Example: Predicting whether an email is spam (1) or not spam (0).


Error and Cost Function in Linear Regression
What is an Error in Regression?
 When a regression model makes predic ons, the error (also called residual) is the difference
between the actual output and the predicted output.

𝐄𝐫𝐫𝐨𝐫 (𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥) = 𝐀𝐜𝐭𝐮𝐚𝐥 𝐕𝐚𝐥𝐮𝐞 − 𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝 𝐕𝐚𝐥𝐮𝐞

In other words:

 If the prediction is too high, the error is negative.


 If the prediction is too low, the error is positive.

Example:
Actual (Y) Predicted (Ŷ) Error (Y - Ŷ)
50 48 +2
60 63 -3
70 68 +2

So, the model makes small mistakes — these differences are what we call errors.

Why Do We Need a Cost Function?


 A cost function measures how well (or poorly) a regression model fits the data.
It calculates the overall error of all predictions in the dataset.

The goal of training a regression model is to minimize the cost function — meaning, we
want to make predictions as close as possible to the real values.

The Concept of the Best-Fit Line


 The best-fit line minimizes the error between actual and predicted values.

Goal: Minimize the sum of squared errors (SSE) using op miza on techniques like Gradient
Descent.
Why it ma ers: A well-fit line accurately represents the rela onship between variables,
improving predic on accuracy.

𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙𝒊

Mean Squared Error (MSE) – The Most Common Cost Function


 The Mean Squared Error (MSE) is the most widely used cost func on in regression.

𝒏
𝟏
𝑴𝑺𝑬/𝑱 = (𝒚 − 𝒚𝒊 )𝟐
𝒏
𝒊 𝟏

Where:

 n = number of data points


 𝒚i = actual value
 𝐲𝐢 = predicted value

Example:
Actual (Y) Predicted (Ŷ) Error (Y - Ŷ) Error²
4 5 -1 1
3 2.5 0.5 0.25
5 4 1 1
6 5.5 0.5 0.25

1 + 0.25 + 1 + 0.25 2.5


𝑀𝑆𝐸 = = = 0.625
4 4

So, the mean squared error = 0.625. A smaller MSE means a be er-fi ng regression model.
Root Mean Squared Error (RMSE)
 RMSE is simply the square root of MSE, which brings the error value back to the same unit
as the output variable.

𝑹𝑴𝑺𝑬 = √𝑴𝑺𝑬

Example: If MSE = 0.625 →

𝑅𝑀𝑆𝐸 = √0.625 = 0.79

So, the average prediction error is around 0.79 units.

Mean Absolute Error (MAE)


 MAE is another way to measure error — instead of squaring, it takes the absolute difference
between actual and predicted values.

𝒏
𝟏
𝑴𝑨𝑬 = |𝒚 − 𝒚𝒊 |
𝒏
𝒊 𝟏

Example:
Actual (Y) Predicted (Ŷ) |Y − Ŷ|
4 5 1
3 2.5 0.5
5 4 1
6 5.5 0.5

1 + 0.5 + 1 + 0.5 3
𝑀𝐴𝐸 = = = 0.75
4 4

So, On average, the model’s predic on is off by 0.75 units.


Gradient Descent for Linear Regression
 Gradient Descent is an optimization algorithm used to minimize the cost function in
machine learning models — especially in linear regression.

The main goal of linear regression is to find the best-fit line:

That line should make predictions that are as close as possible to the actual data points.
To achieve this, we adjust the parameters (𝜽𝟐 = slope, 𝜽𝟏 = intercept) until the cost function
(error) is minimized.

Gradient Descent helps us find those optimal parameter values step by step.

The Gradient Descent Update Rule


 We update both parameters iteratively:

𝝏𝑱
𝜽𝟏 = 𝜽𝟏 − 𝜶
𝝏𝜽𝟏

𝝏𝑱
𝜽𝟐 = 𝜽𝟐 − 𝜶
𝝏𝜽𝟐

Where:

 α = Learning rate (controls how big each update step is)


𝝏𝑱 𝝏𝑱
 , = gradients of the cost function
𝝏𝜽𝟏 𝝏𝜽𝟐
Derivation of the Gradients
 Let’s differen ate the cost func on with respect to each parameter.

Deriva ve w.r.t.𝜽𝟏 Deriva ve w.r.t.𝜽𝟐


𝒏 𝒏
𝝏 𝟏 𝝏 𝟏
𝑱 𝜽𝟏 = (𝒚 − 𝒚𝒊 )𝟐 𝑱 𝜽𝟐 = (𝒚 − 𝒚𝒊 )𝟐
𝝏𝜽𝟏 𝒏 𝝏𝜽𝟐 𝒏
𝒊 𝟏 𝒊 𝟏

𝒏 𝒏
𝟏 𝝏 𝟏 𝝏
= 𝟐(𝒚 − 𝒚𝒊 ) (𝒚 − 𝒚𝒊 ) ∵ 𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙 𝒊 = 𝟐(𝒚 − 𝒚𝒊 ) (𝒚 − 𝒚𝒊 )
𝒏 𝝏𝜽𝟏 𝒏 𝝏𝜽𝟐
𝒊 𝟏 𝒊 𝟏

𝒏 𝒏
𝟏 𝝏 𝟏 𝝏
= 𝟐(𝒚 − 𝒚𝒊 ) (𝜽 + 𝜽𝟐 𝒙𝒊 − 𝒚𝒊 ) = 𝟐(𝒚 − 𝒚𝒊 ) (𝜽 + 𝜽𝟐 𝒙𝒊 − 𝒚𝒊 )
𝒏 𝝏𝜽𝟏 𝟏 𝒏 𝝏𝜽𝟐 𝟏
𝒊 𝟏 𝒊 𝟏

𝒏 𝒏
𝟏 𝟏
= 𝟐(𝒚 − 𝒚𝒊 ) (𝟏 + 𝟎 − 𝟎) = 𝟐(𝒚 − 𝒚𝒊 ) (𝟎 + 𝒙𝒊 − 𝟎)
𝒏 𝒏
𝒊 𝟏 𝒊 𝟏

𝒏 𝒏
𝟐 𝟐
𝑱 𝜽𝟏 = (𝒚 − 𝒚𝒊 ) … … … … (𝒊) 𝑱 𝜽𝟐 = (𝒚 − 𝒚𝒊 ) 𝒙𝒊 … … … … (𝒊𝒊)
𝒏 𝒏
𝒊 𝟏 𝒊 𝟏

Final Gradient Descent Update Formulas:

𝒏
𝟐 𝒏
𝜽𝟏 = 𝜽𝟏 − 𝜶 × (𝒚𝒊 − 𝒚𝒊 ) 𝟐
𝒏 𝜽𝟐 = 𝜽𝟐 − 𝜶 × (𝒚𝒊 − 𝒚𝒊 ) 𝒙𝒊
𝒊=𝟏 𝒏
𝒊=𝟏

These formulas iteratively adjust θ1 and θ2 to reduce the error and reach the minimum point of the cost
function.
 Problem 1: We are given the following dataset showing the rela onship between
Age (X) and Salary (Y):

Age (X) Salary (Y)


30 800
37 950
25 600
43 1050
50 1200
29 740
46 1100

A linear regression model is used to predict salary from age using the hypothesis
func on:

𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙𝒊

θ1=300, θ2=10, α=0.0001

Perform one iteration of the Gradient Descent algorithm, and compute:

1. The initial cost J (Mean Squared Error).


2. The gradients 𝑱 𝜽𝟏 and 𝑱 𝜽𝟐
3. The updated values of 𝜽𝟏 & 𝜽𝟐
4. Write the updated regression equation.
5. The new cost Jnew after parameter updates.

Solution
(1)
Is given,

Hypothesis Func on: 𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙𝒊

Subs tute given values: 𝒚 = 𝟑𝟎𝟎 + 𝟏𝟎𝒙𝒊


n=7

We know,
𝒏
𝟏 1
𝑱= (𝒚 − 𝒚𝒊 )𝟐 = (𝑦 − 𝑦 ) … … … (𝑖)
𝒏 7
𝒊 𝟏
(𝑦 − 𝑦 ) = (300 + 10 × 30 − 800)2 + (300 + 10 × 37 − 950)2 + (300 + 10 × 25 − 600)2 + (300 + 10 × 43 − 1050)2

+ (300 + 10 × 50 − 1200)2 + (300 + 10 × 29 − 740)2 + (300 + 10 × 46 − 1100)2 = 𝟓𝟐𝟏𝟑𝟗𝟗. 𝟗𝟗𝟖

Put the value in equation 1:

1 1
𝐽= (𝑦 − 𝑦 ) = (𝟓𝟐𝟏𝟑𝟗𝟗. 𝟗𝟗𝟖) = 𝟕𝟒𝟒𝟖𝟓. 𝟕𝟏𝟒
7 7

(2)

Gadients 𝑱 𝜽𝟏 Gradients 𝑱 𝜽𝟐
𝒏 𝒏
𝟐 𝟐
𝑱 𝜽𝟏 = (𝒚 − 𝒚𝒊 ) 𝑱 𝜽𝟐 = (𝒚 − 𝒚𝒊 ) 𝒙𝒊
𝒏 𝒏
𝒊 𝟏 𝒊 𝟏
𝒏 𝒏
𝟐 𝟐
𝑱 𝜽𝟏 = (𝟑𝟎𝟎 + 𝟏𝟎𝒙𝒊 − 𝒚𝒊 ) 𝑱 𝜽𝟐 = (𝟑𝟎𝟎 + 𝟏𝟎𝒙𝒊 − 𝒚𝒊 ) 𝒙𝒊
𝒏 𝒏
𝒊 𝟏 𝒊 𝟏

= −𝟒𝟗𝟕. 𝟏𝟒𝟏𝟐 = −𝟐𝟎𝟑𝟖𝟖. 𝟓𝟕𝟒𝟏𝟐

(3)

NEW 𝜽𝟏 NEW 𝜽𝟐

𝜽𝟏 = 𝜽𝟏 − 𝜶 × 𝑱′ 𝜽𝟏 𝜽𝟐 = 𝜽𝟐 − 𝜶 × 𝑱′ 𝜽𝟐
= 𝟑𝟎𝟎 − 𝟎. 𝟎𝟎𝟎𝟏 × −𝟒𝟗𝟕. 𝟏𝟒𝟏𝟐 = 𝟏𝟎 − 𝟎. 𝟎𝟎𝟎𝟏 × −𝟐𝟎𝟑𝟖𝟖. 𝟓𝟕𝟒𝟏𝟐
= 𝟑𝟎𝟎. 𝟎𝟒𝟗𝟕 = 𝟏𝟐. 𝟎𝟑
(4)
Is given,

Hypothesis Func on: 𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙𝒊

Updated equa on be: 𝒚 = 𝟑𝟎𝟎. 𝟎𝟒𝟗𝟕 + 𝟏𝟐. 𝟎𝟑𝒙𝒊

(5)
We know,
𝒏
𝟏 1
𝑱= (𝒚 − 𝒚𝒊 )𝟐 = (𝑦 − 𝑦 ) = 39084.8289
𝒏 7
𝒊 𝟏

Linear Regression in Matrix Form


 In machine learning, especially when working with mul ple variables, it’s more convenient to
represent linear regression using matrix nota on.

This makes computa on simpler and more efficient — especially when using tools like
NumPy, MATLAB, or TensorFlow.

 Equation:
𝒀 = 𝑿𝒘

Where:

 Y → Output (dependent variable)


 X → Input features (independent variables)
 w → Weight vector (parameters, including intercept)

If we include an intercept term θ0 (bias), we modify X as:


1 𝑥 𝜃 +𝜃 𝑥
1 𝑥 𝜃 𝜃 +𝜃 𝑥
𝒀 = 𝑿𝒘 = =
⋮ ⋮ 𝜃 ⋮
1 𝑥 𝜃 +𝜃 𝑥

So, each row represents one training example.


 Example Dataset: Let’s take a small example with two data points:

X (Hours Studied) Y (Score)


1 2
2 3

Here, we are trying to find the best-fit line Y=θ0+θ1X

Step2: To find the best-fit parameters


Step1: Represen ng Data in Matrix Form
(weights), we use the Normal Equa on:

𝟏 𝟏 𝟐 𝒘 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀
𝑿= , 𝒀=
𝟏 𝟐 𝟑
Where:
The first column of 1’s in X corresponds to the
 𝑿𝑻 = Transpose of X
intercept term θ0
 (𝑿𝑻 𝑿) 𝟏 = Inverse of the product 𝑿𝑻 𝑿
 𝒘 = Parameter vector [θ0,θ1]T

Step3: Step-by-Step Calcula on

Step3.1: Compute 𝐗 𝐓 𝐗 Step3.2: Compute 𝐗 𝐓 𝐘

1 1 1 1 2 3 1 1 2 5
𝑿𝑻 𝑿 = = 𝑿𝑻 𝒀 = =
1 2 1 2 3 5 1 2 3 8

Step3.3: Compute (𝑿𝑻 𝑿) 𝟏


Step3.3: Compute 𝒘 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀

−1 𝟓 −𝟑 𝟓 𝟏
(𝑿𝑻 𝑿) 𝟏
= 2 3 = 5 −3 𝒘= × =
3 5 −3 2 −𝟑 𝟐 𝟖 𝟏

Step4: Final Model Equa on for best fit line

𝒀 = 𝟏 + 𝟏𝑿
 Problem 2: You are given the following dataset showing the rela onship between
hours studied (X) and exam scores (Y):

a) Using Linear Regression in Matrix Form, find the best-fit line equa on Y=θ0+θ1X
b) by compu ng all steps manually: 𝛉 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀, Then, predict the score when a
student studies for 4 hours.

Solutions

(a)
Step1: Represen ng Data in Matrix Form Step2: is given Normal Equa on:

1 0.86 2.49
⎡1 0.09 ⎤ ⎡ 0.83 ⎤
⎢1 −0.85⎥ ⎢−0.25⎥ 𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀
⎢1 0.87 ⎥ ⎢ 3.10 ⎥
⎢1 −0.44⎥ ⎢ 0.87 ⎥ Where:
𝑿 = ⎢1 , 𝒀=⎢
−0.43⎥ 0.02 ⎥
⎢1 −1.10⎥ ⎢−0.12⎥  𝑿𝑻 = Transpose of X
⎢1  (𝑿𝑻 𝑿) 𝟏 = Inverse of the product 𝑿𝑻 𝑿
0.40 ⎥ ⎢ 1.81 ⎥
⎢1  𝜽 = Parameter vector [θ0,θ1]T
−0.96⎥ ⎢−0.83⎥
⎣1 0.17 ⎦ ⎣ 0.43 ⎦
Step3: Step-by-Step Calcula on

Step3.1: Compute 𝐗 𝐓 𝐗 Step3.2: Compute 𝐗 𝐓 𝐘

1 0.86 1 0.86 1 0.86 𝑇 2.49


⎡1 ⎤ ⎡ 0.09 ⎤ ⎡
⎢1
0.09 1 1 0.09 ⎤ ⎡ 0.83 ⎤
−0.85⎥ ⎢1 −0.85⎥ ⎢ −0.85⎥⎥ ⎢−0.25⎥
⎢1 ⎢1
0.87 ⎥ ⎢1 0.87 ⎥ ⎢1 0.87 ⎥ ⎢ 3.10 ⎥
⎢1 −0.44⎥ ⎢1 −0.44⎥ 4.95 −1.39 1 −0.44⎥ × ⎢ 0.87 ⎥ 6.49
𝑿𝑻 𝑿 = ⎢1 ⎥ × ⎢1 ⎥ = 𝑿 𝒀=⎢
𝑻
⎢ 0.02 ⎥ = 8.34
−0.43 −0.43 −1.39 10 ⎢1 −0.43⎥
⎢1 −1.10⎥ ⎢1 −1.10⎥ ⎢1 −1.10⎥ ⎢−0.12⎥
⎢1 ⎥ ⎢1 ⎢1 ⎢ 1.81 ⎥
0.40 0.40 ⎥ 0.40 ⎥
⎢1 ⎢1 −0.96⎥ ⎢−0.83⎥
−0.96⎥ ⎢1 −0.96⎥
⎣1 0.17 ⎦ ⎣ 0.43 ⎦
⎣1 0.17 ⎦ ⎣1 0.17 ⎦

Step3.3: Compute 𝜽 = (𝑿𝑻 𝑿) 𝟏 𝑿𝑻 𝒀 Step4: Final Model Equa on for best fit line

4.95 −1.39 6.49 1.60 𝒀 = 𝟏. 𝟔𝟎 + 𝟏. 𝟎𝟓𝑿


𝜃= × =
−1.39 10 8.34 1.05

Step5: figure

(b)

Predict the Score when X = 4:

𝒀 = 𝟏. 𝟔𝟎 + 𝟏. 𝟎𝟓𝑿

𝒀 = 𝟏. 𝟔𝟎 + 𝟏. 𝟎𝟓 × 𝟒 = 𝟓. 𝟖𝟎𝟎
Logistic Regression
 Used to predict the probability that something belongs to a certain class.
 Example: Is an email spam (1) or not spam (0)?

Binary Classification:

 Predicts two classes only:


o Positive class → 1
o Negative class → 0
 Uses a threshold (usually 0.5) to decide the class:
o Probability ≥ 0.5 → predict 1
o Probability < 0.5 → predict 0

How it works:

1. Like linear regression, it computes a weighted sum of inputs + bias:

𝒚 = 𝜽𝟏 + 𝜽𝟐 𝒙𝒊 becomes 𝒚 = 𝝈(𝜽𝟏 + 𝜽𝟐 𝒙𝒊 )

2. Then, instead of giving the sum directly, it passes it through the sigmoid (logistic)
function:

𝟏
𝝈(𝒕) =
𝟏 𝒆 𝒕

3. The output is a probability between 0 and 1.

Key points about the sigmoid function:

 𝝈(𝒕) < 𝟎. 𝟓 → 𝒕 < 𝟎 → predicts 0


 𝝈(𝒕) ≥ 𝟎. 𝟓 → 𝒕 ≥ 𝟎 → predicts 1
K-Nearest Neighbors (KNN)
 KNN is a simple classifica on algorithm that works by finding the nearest data points and
choosing the most common class among them.

Key Features
Term Meaning

Non-parametric Does not assume any data pattern (flexible)

Instance-based / Lazy learner Does not build a model; stores training data

Memory-based Needs the data stored for prediction

How KNN Works (Step-by-Step)


1. Choose the value of K (number of neighbors, like K=3 or K=5)
2. Calculate distance between new point & all training points
(Common distances: Euclidean, Manhattan)
3. Pick K nearest neighbors
4. Majority Vote → The class most common among the K neighbors → result

 Example Dataset: Suppose K = 3 and among the 3 closest neighbors:

Neighbor Class
1 Cat
2 Dog
3 Cat

Result = Cat (because Cat appears more)


Distance Formula (Euclidean)

𝒅= (𝒙𝟐 − 𝒙𝟏 )𝟐 + (𝒚𝟐 − 𝒚𝟏 )𝟐

 Problem 1: You are given a dataset of fruits based on two features:

 Weight (grams)
 Sweetness level (1–10)

Fruit Weight Sweetness


Apple 180 7
Apple 160 6
Orange 200 4
Orange 220 3
Banana 120 8
Banana 130 9

A new fruit arrives with: Weight = 170 g, Sweetness = 7

a) Use KNN with K = 3 to classify which fruit it is.

Solutions:
 Compute Euclidean Distance:
(𝒙𝟐 − 𝒙𝟏 )𝟐 + (𝒚𝟐 − 𝒚𝟏 )𝟐

Fruit Coordinates Distance


Apple (180,7) (180,7) (170 − 180) + (7 − 7) = 10.0
Apple (160,6) (160,6)
(170 − 160) + (7 − 6) = 10.05
Orange (200,4) (200,4)
(170 − 200) + (7 − 4) = 30.15
Orange (220,3) (220,3)
(170 − 220) + (7 − 3) = 50.16
Banana (120,8) (120,8)
(170 − 120) + (7 − 8) = 50.01
Banana (130,9) (130,9)
(170 − 130) − (7 − 9) = 40.05
Pick 3 Nearest Neighbors:
Fruit Distance
Apple (180,7) 10.00
Apple (160,6) 10.05
Orange (200,4) 30.15

Final Prediction: Predicted fruit = Apple

 Problem 2: You are given a dataset of fruits based on two features:

 height (cm)
 Weight (kg)

Height (cm) Weight (kg) Class


167 51 Underweight
182 62 Normal
176 69 Normal
173 64 Normal
172 65 Normal
174 56 Underweight
169 58 Normal
173 57 Normal
170 55 Normal

A new people arrives with: Height = 170 cm, Weight = 57

a) Using the above dataset and KNN algorithm with K = 3, classify whether the person is
Underweight or Normal.

Solutions:
 Compute Euclidean Distance:
(𝒙𝟐 − 𝒙𝟏 )𝟐 + (𝒚𝟐 − 𝒚𝟏 )𝟐
Coordinates Distance
(167, 51) (170 − 167) + (57 − 51) = 6.70
(182, 62) (170 − 182) + (57 − 62) = 13.0
(176, 69) (170 − 176) + (57 − 69) = 13.4
(173, 64) (170 − 173) + (57 − 64) = 7.6
(172,65) (170 − 172) + (57 − 65) = 8.2
(174, 56) (170 − 174) + (57 − 56) = 4.1
(169, 58) (170 − 169) + (57 − 58) = 1.4
(173, 57) (170 − 173) + (57 − 57) = 3
(170, 55) (170 − 170) + (57 − 55) = 2

Arrange the distances in ascending order:


Height Weight Class Distance
169 58 Normal 1.41
170 55 Normal 2.00
173 57 Normal 3.00
174 56 Underweight 4.12
167 51 Underweight 6.71
173 64 Normal 7.62
172 65 Normal 8.25
182 62 Normal 13.00
176 69 Normal 13.42

Majority Vote
Class Count
Normal 3
Underweight 0

Final Prediction: Predicted Class = Normal


 Problem 3: A movie streaming platform, CineMatch, is developing a recommendation
system that classifies new movies into genres based on their characteristics. The data
science team decides to use the K-Nearest Neighbors (KNN) algorithm for genre
classification.

They collect data from six previously released movies, each described by:

 Length of the movie (in minutes)


 Number of action scenes
 Number of romantic scenes

Each movie also has a known Genre.

Movie Length (min) Action Scenes Romantic Scenes Genre


M1 150 12 2 Action
M2 140 10 1 Action
M3 120 3 10 Romantic
M4 110 2 9 Romantic
M5 130 5 4 Comedy
M6 125 4 5 Comedy

Now, a new movie, M7 — “Love & Fury”, has just been added to the platform.
It has the following attributes:

 Length = 135 minutes


 Action Scenes = 6
 Romantic Scenes = 3

a) Based on the given data, use the K-Nearest Neighbors (KNN) algorithm with K = 5 to
determine the most suitable genre for the new movie “Love & Fury”. Show your reasoning or
steps briefly.

Solutions:
 Compute Euclidean Distance:
(𝒙𝟐 − 𝒙𝟏 )𝟐 + (𝒚𝟐 − 𝒚𝟏 )𝟐 + (𝒛𝟐 − 𝒛𝟏 )𝟐

Coordinates Distance
(150, 12, 2) (135 − 150) + (6 − 12) + (3 − 2) = 16.19
(140, 10, 1) (135 − 140) + (6 − 10) + (3 − 1) = 6.71
(120, 3, 10) (135 − 120) + (6 − 3) + (3 − 10) = 16.82
(110, 2, 9) (135 − 110) + (6 − 2) + (3 − 9) = 26.02
(130 ,5, 4) (135 − 130) + (6 − 5) + (3 − 4) = 5.20
(125, 4, 5) (135 − 125) + (6 − 4) + (3 − 5) = 10.39

Arrange the distances in ascending order:


Length (min) Action Scenes Romantic Scenes Genre Distance
130 5 4 Comedy 5.20
140 10 1 Action 6.71
125 4 5 Comedy 10.39
150 12 2 Action 16.19
120 3 10 Romantic 16.82
110 2 9 Romantic 26.02

Majority Vote
Class Count
Comedy 2
Action 2
Romantic 1

Handle the tie (Comedy vs Action): Use the smaller K (e.g., K=3) as a ebreaker
Class Count
Comedy 2
Action 1
Romantic 0

Final Prediction: Predicted Genre = Comedy


Advantages of KNN
Benefit Why
Simple & easy to understand No complex math
No training time Only stores data
Can be used for classification & regression Flexible
No assumptions on data Works for many data types

Disadvantages of KNN
Issue Explanation
Slow for large datasets Calculates distance to every point
Needs memory to store data All data must be saved
Sensitive to noise/outliers Wrong results if data is messy
Feature scaling required Variables with high values dominate distance

When to Use KNN


 Use KNN when:
 Dataset is small to medium-sized
 Data is labeled
 No strong assumption about data pattern

 Don't use when:


 Dataset is very large
 Real-time fast prediction is required

Important Concepts
Term Meaning

Lazy learner No training phase

Majority vote Most neighbors' class wins

Distance metric Way of measuring closeness


Introduction to Decision Trees
 A Decision Tree is a supervised machine learning algorithm used for classification and
regression tasks.

 It works by splitting data into branches based on certain decision rules, forming a structure
that looks like a tree.

 Each node represents a decision based on a feature.


 Each branch represents an outcome of that decision.
 Each leaf node represents a final prediction or class label.
 The root node is the starting point (the top of the tree).

Example: If you want to predict whether someone will buy a computer based on features like
age, income, student status, and credit rating,

How Decision Trees Work


 Decision Trees follow a Top-Down Recursive Divide-and-Conquer approach:
1. Start with all data at the root.
2. Select the best feature to split the data.
3. Split the dataset into subsets based on feature values.
4. Repeat the process for each subset.
5. Stop when:
o All samples in a node belong to the same class.
o No more features to split.
o A stopping condition is met.
ID3 Algorithm (Iterative Dichotomiser 3)
 ID3 is one of the earliest and most famous algorithms for building classifica on trees.

 Purpose of ID3:
 Build a decision tree that classifies data with the least uncertainty.
 Selects attributes that provide the highest information gain.
 Splits the dataset recursively until:
o All instances are classified, or
o No attributes remain.

Steps in the ID3 Algorithm


Step 1: Calculate Entropy Step 2: Compute Informa on Gain
 Measures reduc on in entropy a er
 Measures impurity or randomness.
spli ng the dataset.

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 ) |𝑫𝒗 |


𝐈𝐆(𝐃, 𝐀) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
𝒊 𝟏

Where, Where,
𝒏 D = Original dataset.
pi = proportion of class i, 𝒑𝒊 = 𝒊 A = Attribute being evaluated.
𝑵
Dv = Subset of D where attribute A takes value v.
N = total sample
𝒏𝒊 = Samples belonging to class i

Step 3: Select A ribute with Highest Informa on Gain

 The a ribute with the highest informa on gain becomes the decision node.

Step 4: Split Dataset

 Par on data by a ribute values.

Step 5: Recursion:
 Repeat until:
 All data in subset belongs to the same class
 No attributes left
 Dataset becomes empty
 Problem1: Consider the following dataset of 14 customers with attributes age, income,
student, credit_rating, and the target class buys_computer.

Dataset:
age income student credit_rating buys_computer
≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no
𝟑𝟏 … 𝟒𝟎 high no fair yes
> 𝟒𝟎 medium no fair yes
> 𝟒𝟎 low yes fair yes
> 𝟒𝟎 low yes excellent no
𝟑𝟏 … 𝟒𝟎 low yes excellent yes
≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes
> 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no

 Your Tasks is:


1. Calculate the entropy of the entire dataset for the target attribute buys_computer.
2. Compute the Information Gain (IG) for each attribute:
o age
o income
o student
o credit_rating
3. Determine the best attribute for the root node based on highest Information Gain.
4. Using your root node, perform the next split on the corresponding subsets and show the
resulting smaller decision tree.
5. Finally, use the constructed decision tree to classify the following new instance:
age = <=30
income = low
student = yes
credit_rating = fair
Predict the value of the target attribute: buys_computer?.
(1)

Is given,

Totals: 14 datasets → 𝟗 𝒚𝒆𝒔, 𝟓 𝒏𝒐 .

𝑵 = 𝟏𝟒
𝒏 𝟏 = 𝟗
𝒏 𝟐 = 𝟓

We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟗
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟒𝟑
𝑵 𝟏𝟒
𝒏𝟐 𝟓
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟑𝟓𝟕
𝑵 𝟏𝟒
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = −0.643𝑙𝑜𝑔 (0.643) − 0.357𝑙𝑜𝑔 (0.357)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

Entropy of the en re dataset for the target a ribute buys_computer, 𝑫 = 0.940

(2)

We Know,
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐀) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
|𝑫|
age income student credit_rating buys_computer
≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no
𝟑𝟏 … 𝟒𝟎 high no fair yes
 For Age (values: <=30, 31..40, >40): > 𝟒𝟎 medium no fair yes
Is given, For Age ≤ 𝟑𝟎: > 𝟒𝟎 low yes fair yes
> 𝟒𝟎 low yes excellent no
𝟑𝟏 … 𝟒𝟎 low yes excellent yes
Totals for ≤ 𝟑𝟎 : 5 datasets → 2 𝒚𝒆𝒔, 3 𝒏𝒐 .
≤ 𝟑𝟎 medium no fair no
𝒄𝒍𝒂𝒔𝒔 𝟏 𝒄𝒍𝒂𝒔𝒔 𝟐 ≤ 𝟑𝟎 low yes fair yes
𝑵 𝟑𝟎 = 𝟓 > 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝒏 𝟏 = 𝟐 𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝒏 𝟐 = 𝟑 𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no

We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟐
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟒𝟎𝟎
𝑵 𝟓
𝒏𝟐 𝟑
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟔𝟎𝟎
𝑵 𝟓
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟎 ) =− 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟎 ) =− 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟎 ) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟎 ) = −0.400𝑙𝑜𝑔 (0.400) − 0.600𝑙𝑜𝑔 (0.600)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟎 ) = 0.971

Is given, For Age 𝟑𝟏 … 𝟒𝟎:


age income student credit_rating buys_computer
Totals for 𝟑𝟏 … 𝟒𝟎 : 4 datasets → 4 𝒚𝒆𝒔, 0 𝒏𝒐 . ≤ 𝟑𝟎 high no fair no
𝒄𝒍𝒂𝒔𝒔 𝟏 𝒄𝒍𝒂𝒔𝒔 𝟐 ≤ 𝟑𝟎 high no excellent no
𝑵𝟑𝟏…𝟒𝟎 = 𝟒 𝟑𝟏 … 𝟒𝟎 high no fair yes
> 𝟒𝟎 medium no fair yes
𝒏 𝟏 = 𝟒 > 𝟒𝟎 low yes fair yes
> 𝟒𝟎 low yes excellent no
𝒏 𝟐 = 𝟎 𝟑𝟏 … 𝟒𝟎 low yes excellent yes
≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes
> 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟒
proportion of class 1 = 𝒑𝟏 = = = 𝟏. 𝟎𝟎𝟎
𝑵 𝟒
𝒏𝟐 𝟎
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟎𝟎𝟎
𝑵 𝟒
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟏…𝟒𝟎 ) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟏…𝟒𝟎 ) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟏…𝟒𝟎 ) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟏…𝟒𝟎 ) = −1.000𝑙𝑜𝑔 (1.000) − 0.000𝑙𝑜𝑔 (0.000)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟑𝟏…𝟒𝟎 ) = 0.000 (pure)

Is given, For Age > 𝟒𝟎:


age income student credit_rating buys_computer

Totals for > 𝟒𝟎 : 5 datasets → 3 𝒚𝒆𝒔, 2 𝒏𝒐 . ≤ 𝟑𝟎 high no fair no


≤ 𝟑𝟎 high no excellent no
𝒄𝒍𝒂𝒔𝒔 𝟏 𝒄𝒍𝒂𝒔𝒔 𝟐
𝟑𝟏 … 𝟒𝟎 high no fair yes
𝑵 𝟒𝟎 = 𝟓 > 𝟒𝟎 medium no fair yes
low yes fair yes
𝒏 𝟏 = 𝟑 > 𝟒𝟎
> 𝟒𝟎 low yes excellent no
𝒏 𝟐 = 𝟐 𝟑𝟏 … 𝟒𝟎 low yes excellent yes
≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes
> 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟑
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟎𝟎
𝑵 𝟓
𝒏𝟐 𝟐
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟒𝟎𝟎
𝑵 𝟓
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟒𝟎 ) =− 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟒𝟎 ) =− 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟒𝟎 ) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟒𝟎 ) = −0.600𝑙𝑜𝑔 (0.600) − 0.400𝑙𝑜𝑔 (0.400)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝟒𝟎 ) = 0.971

|𝑫𝒗 | |𝑫𝒗 𝟑𝟎 | |𝑫𝒗 𝟑𝟏…𝟒𝟎 | |𝑫𝒗 𝟒𝟏 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝟑𝟎 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝟑𝟏…𝟒𝟎 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝟒𝟏 )
|𝑫| |𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟓| |𝟒| |𝟓|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟗𝟕𝟏 + × 𝟎. 𝟎𝟎𝟎 + × 𝟎. 𝟗𝟕𝟏
|𝑫| |𝟏𝟒| |𝟏𝟒| |𝟏𝟒|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.694
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐀𝐠𝐞) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐀𝐠𝐞) = 0.940 − 0.694 = 0.246


age income student credit_rating buys_computer
≤ 𝟑𝟎 high no fair no
 For Income (values: high, medium, low): ≤ 𝟑𝟎 high no excellent no
𝟑𝟏 … 𝟒𝟎 high no fair yes
Is given, For income, High: > 𝟒𝟎 medium no fair yes
> 𝟒𝟎 low yes fair yes
Totals for 𝑯𝒊𝒈𝒉 : 4 datasets → 𝟐 𝒚𝒆𝒔, 2 𝒏𝒐 . > 𝟒𝟎 low yes excellent no
𝒄𝒍𝒂𝒔𝒔 𝟏 𝒄𝒍𝒂𝒔𝒔 𝟐 𝟑𝟏 … 𝟒𝟎 low yes excellent yes
medium no fair no
𝑵𝒉𝒊𝒈𝒉 = 𝟒 ≤ 𝟑𝟎
≤ 𝟑𝟎 low yes fair yes
𝒏 𝟏 = 𝟐 > 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝒏 𝟐 = 𝟒 𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes
We Know, > 𝟒𝟎 medium no excellent no
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟐
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟓𝟎𝟎
𝑵 𝟒
𝒏𝟐 𝟐
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟓𝟎𝟎
𝑵 𝟒
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = −0.500𝑙𝑜𝑔 (0.500) − 0.500𝑙𝑜𝑔 (0.500)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = 1.000

Is given, For Income, medium:


age income student credit_rating buys_computer
Totals for medium: 6 datasets → 𝟒 𝒚𝒆𝒔, 2 𝒏𝒐 . ≤ 𝟑𝟎 high no fair no
𝒄𝒍𝒂𝒔𝒔 𝟏 𝒄𝒍𝒂𝒔𝒔 𝟐 ≤ 𝟑𝟎 high no excellent no
𝑵𝒎𝒆𝒅𝒊𝒖𝒎 = 𝟔 𝟑𝟏 … 𝟒𝟎 high no fair yes
> 𝟒𝟎 medium no fair yes
𝒏 𝟏 = 𝟒 > 𝟒𝟎 low yes fair yes
> 𝟒𝟎 low yes excellent no
𝒏 𝟐 = 𝟐 𝟑𝟏 … 𝟒𝟎 low yes excellent yes
≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes
> 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no
We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟒
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟔𝟕
𝑵 𝟔
𝒏𝟐 𝟐
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟑𝟑𝟑
𝑵 𝟔
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = −0.667𝑙𝑜𝑔 (0.667) − 0.333𝑙𝑜𝑔 (0.333)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = 0.918

age income student credit_rating buys_computer


Is given, For Income, low: ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no
𝟑𝟏 … 𝟒𝟎 high no fair yes
Totals for low: 4 datasets → 3 𝒚𝒆𝒔, 1 𝒏𝒐 .
> 𝟒𝟎 medium no fair yes
𝒄𝒍𝒂𝒔𝒔 𝟏 𝒄𝒍𝒂𝒔𝒔 𝟐
> 𝟒𝟎 low yes fair yes
𝑵𝒍𝒐𝒘 = 𝟒 > 𝟒𝟎 low yes excellent no
𝟑𝟏 … 𝟒𝟎 low yes excellent yes
𝒏 𝟏 = 𝟑 ≤ 𝟑𝟎 medium no fair no
𝒏 𝟐 = 𝟏 ≤ 𝟑𝟎 low yes fair yes
> 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes
We Know, > 𝟒𝟎 medium no excellent no
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟑
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟕𝟓𝟎
𝑵 𝟒
𝒏𝟐 𝟏
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟐𝟓𝟎
𝑵 𝟒
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = −0.750𝑙𝑜𝑔 (0.750) − 0.250𝑙𝑜𝑔 (0.250)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = 0.811

|𝑫𝒗 | 𝑫𝒗 𝒉𝒊𝒈𝒉 |𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 | |𝑫𝒗 𝒍𝒐𝒘 |


× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒉𝒊𝒈𝒉 + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒍𝒐𝒘 )
|𝑫| |𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟒| |𝟔| |𝟒|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟏. 𝟎𝟎𝟎 + × 𝟎. 𝟗𝟏𝟖 + × 𝟎. 𝟖𝟏𝟏
|𝑫| |𝟏𝟒| |𝟏𝟒| |𝟏𝟒|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.911
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.940 − 0.911 = 0.029


 For Student (values: yes, no):
age income student credit_rating buys_computer age income student credit_rating buys_computer
≤ 𝟑𝟎 high no fair no ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no ≤ 𝟑𝟎 high no excellent no
𝟑𝟏 … 𝟒𝟎 high no fair yes 𝟑𝟏 … 𝟒𝟎 high no fair yes
> 𝟒𝟎 medium no fair yes > 𝟒𝟎 medium no fair yes
> 𝟒𝟎 low yes fair yes > 𝟒𝟎 low yes fair yes
> 𝟒𝟎 low yes excellent no > 𝟒𝟎 low yes excellent no
𝟑𝟏 … 𝟒𝟎 low yes excellent yes 𝟑𝟏 … 𝟒𝟎 low yes excellent yes
≤ 𝟑𝟎 medium no fair no ≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes ≤ 𝟑𝟎 low yes fair yes
> 𝟒𝟎 medium yes fair yes > 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes ≤ 𝟑𝟎 medium yes excellent yes
𝟑𝟏 … 𝟒𝟎 medium no excellent yes 𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes 𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no > 𝟒𝟎 medium no excellent no

3 3 4 4 6 6 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
7 7 7 7 7 7 7 7
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = 0.985 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = 0.592

|𝑫𝒗 | |𝑫𝒗 𝒏𝒐 | 𝑫𝒗 𝒚𝒆𝒔


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒏𝒐 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒚𝒆𝒔
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟕| |𝟕|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟗𝟖𝟓 + × 𝟎. 𝟓𝟗𝟐
|𝑫| |𝟏𝟒| |𝟏𝟒|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.789
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.940 − 0.789 = 0.152


 For credit_rating (values: fair, excellent):
age income student credit_rating buys_computer age income student credit_rating buys_computer
≤ 𝟑𝟎 high no fair no ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no ≤ 𝟑𝟎 high no excellent no
𝟑𝟏 … 𝟒𝟎 high no fair yes 𝟑𝟏 … 𝟒𝟎 high no fair yes
> 𝟒𝟎 medium no fair yes > 𝟒𝟎 medium no fair yes
> 𝟒𝟎 low yes fair yes > 𝟒𝟎 low yes fair yes
> 𝟒𝟎 low yes excellent no > 𝟒𝟎 low yes excellent no
𝟑𝟏 … 𝟒𝟎 low yes excellent yes 𝟑𝟏 … 𝟒𝟎 low yes excellent yes
≤ 𝟑𝟎 medium no fair no ≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes ≤ 𝟑𝟎 low yes fair yes
> 𝟒𝟎 medium yes fair yes > 𝟒𝟎 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes ≤ 𝟑𝟎 medium yes excellent yes
𝟑𝟏 … 𝟒𝟎 medium no excellent yes 𝟑𝟏 … 𝟒𝟎 medium no excellent yes
𝟑𝟏 … 𝟒𝟎 high yes fair yes 𝟑𝟏 … 𝟒𝟎 high yes fair yes
> 𝟒𝟎 medium no excellent no > 𝟒𝟎 medium no excellent no

6 6 2 2 3 3 3 3
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
8 8 8 8 6 6 6 6
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = 0.811 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 ) = 1.000

|𝑫𝒗 | 𝑫𝒗 𝒇𝒂𝒊𝒓 |𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒇𝒂𝒊𝒓 + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 )
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟖| |𝟔|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟖𝟏𝟏 + × 𝟏. 𝟎𝟎𝟎
|𝑫| |𝟏𝟒| |𝟏𝟒|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.892
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.940 − 0.892 = 0.048


(3)

We have,
𝐈𝐆(𝐃, 𝐀𝐠𝐞) = 0.940 − 0.694 = 0.246
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.940 − 0.911 = 0.029
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.940 − 0.789 = 0.152
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.940 − 0.892 = 0.048

We Know,

𝑹𝒐𝒐𝒕 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕 = 𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝒘𝒊𝒕𝒉 𝑯𝒊𝒈𝒉𝒆𝒔𝒕 𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏


𝑹𝒐𝒐𝒕 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕 = 𝒂𝒈𝒆

(4)

age income student credit_rating buys age income student credit_rating buys
≤ 𝟑𝟎 high no fair no >40 medium no fair yes
≤ 𝟑𝟎 high no excellent no >40 low yes fair yes
≤ 𝟑𝟎 medium no fair no >40 low yes excellent no
≤ 𝟑𝟎 low yes fair yes >40 medium yes fair yes
≤ 𝟑𝟎 medium yes excellent yes >40 medium no excellent no
 For next left side root node:

Is given, age income student credit_rating buys


≤ 𝟑𝟎 high no fair no
Totals: 5 datasets → 2 𝒚𝒆𝒔, 𝟑 𝒏𝒐 . ≤ 𝟑𝟎 high no excellent no
≤ 𝟑𝟎 medium no fair no
𝑵 = 𝟓 ≤ 𝟑𝟎 low yes fair yes
≤ 𝟑𝟎 medium yes excellent yes
𝒏 𝟏 = 𝟐
𝒏 𝟐 = 𝟑

We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟐
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟒𝟎𝟎
𝑵 𝟓
𝒏𝟐 𝟑
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟔𝟎𝟎
𝑵 𝟓
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = −0.400𝑙𝑜𝑔 (0.400) − 0.600𝑙𝑜𝑔 (0.600)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971

Entropy of the en re dataset for the target a ribute buys_computer, 𝑫 = 0.971


 For Income (values: high, medium, low):

age income student credit_rating buys age income student credit_rating buys
≤ 𝟑𝟎 high no fair no ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no ≤ 𝟑𝟎 high no excellent no
≤ 𝟑𝟎 medium no fair no ≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes ≤ 𝟑𝟎 low yes fair yes
≤ 𝟑𝟎 medium yes excellent yes ≤ 𝟑𝟎 medium yes excellent yes

2 2 0 0 1 1 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
2 2 2 2 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = 0.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = 1.000

age income student credit_rating buys


≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no
≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes
≤ 𝟑𝟎 medium yes excellent yes

1 1 0 0
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
1 1 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = 0.000

|𝑫𝒗 | 𝑫𝒗 𝒉𝒊𝒈𝒉 |𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 | |𝑫𝒗 𝒍𝒐𝒘 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒉𝒊𝒈𝒉 + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒍𝒐𝒘 )
|𝑫| |𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟐| |𝟐| |𝟐|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟎𝟎𝟎 + × 𝟏. 𝟎𝟎𝟎 + × 𝟎. 𝟎𝟎𝟎
|𝑫| |𝟓| |𝟓| |𝟓|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.400
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.971 − 0.400 = 0.571


 For Student (values: yes, no):

age income student credit_rating buys age income student credit_rating buys
≤ 𝟑𝟎 high no fair no ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no ≤ 𝟑𝟎 high no excellent no
≤ 𝟑𝟎 medium no fair no ≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes ≤ 𝟑𝟎 low yes fair yes
≤ 𝟑𝟎 medium yes excellent yes ≤ 𝟑𝟎 medium yes excellent yes

3 3 0 0 2 2 0 0
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3 3 3 3 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = 0.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = 0.000

|𝑫𝒗 | |𝑫𝒗 𝒏𝒐 | 𝑫𝒗 𝒚𝒆𝒔


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒏𝒐 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒚𝒆𝒔
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟑| |𝟐|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟎𝟎𝟎 + × 𝟎. 𝟎𝟎𝟎
|𝑫| |𝟓| |𝟓|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.000
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.971 − 0.000 = 0.971

 For credit_rating (values: fair, excellent):

age income student credit_rating buys age income student credit_rating buys
≤ 𝟑𝟎 high no fair no ≤ 𝟑𝟎 high no fair no
≤ 𝟑𝟎 high no excellent no ≤ 𝟑𝟎 high no excellent no
≤ 𝟑𝟎 medium no fair no ≤ 𝟑𝟎 medium no fair no
≤ 𝟑𝟎 low yes fair yes ≤ 𝟑𝟎 low yes fair yes
≤ 𝟑𝟎 medium yes excellent yes ≤ 𝟑𝟎 medium yes excellent yes
1 1 2 2 1 1 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 excellent ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3 3 3 3 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = 0.918 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 excellent ) = 1.000

|𝑫𝒗 | 𝑫𝒗 𝒇𝒂𝒊𝒓 |𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒇𝒂𝒊𝒓 + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 )
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟑| |𝟐|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟗𝟏𝟖 + × 𝟏. 𝟎𝟎𝟎
|𝑫| |𝟓| |𝟓|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.951
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.971 − 0.951 = 0.020

We have,
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.940 − 0.911 = 0.571
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.940 − 0.789 = 0.971
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.940 − 0.892 = 0.020

We Know,

𝑹𝒐𝒐𝒕 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕 = 𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝒘𝒊𝒕𝒉 𝑯𝒊𝒈𝒉𝒆𝒔𝒕 𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏


𝑹𝒐𝒐𝒕 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕 = 𝒔𝒕𝒖𝒅𝒆𝒏𝒕
age income student credit_rating buys
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
>40 medium yes fair yes
>40 medium no excellent no

 For next right side root node:


age income student credit_rating buys
Is given, >40 medium no fair yes
>40 low yes fair yes
Totals: 5 datasets → 𝟑 𝒚𝒆𝒔, 2 𝒏𝒐 . >40 low yes excellent no
>40 medium yes fair yes
𝑵 = 𝟓 >40 medium no excellent no
𝒏 𝟏 = 𝟑
𝒏 𝟐 = 𝟐

We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟑
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟎𝟎
𝑵 𝟓
𝒏𝟐 𝟐
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟒𝟎𝟎
𝑵 𝟓
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = −0.600𝑙𝑜𝑔 (0.600) − 0.400𝑙𝑜𝑔 (0.400)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971

Entropy of the en re dataset for the target a ribute buys_computer, 𝑫 = 0.971

 For Income (values: medium, low):

age income student credit_rating buys age income student credit_rating buys
>40 medium no fair yes >40 medium no fair yes
>40 low yes fair yes >40 low yes fair yes
>40 low yes excellent no >40 low yes excellent no
>40 medium yes fair yes >40 medium yes fair yes
>40 medium no excellent no >40 medium no excellent no

2 2 1 1 1 1 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3 3 3 3 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) = 0.918 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒍𝒐𝒘 ) = 1.000

|𝑫𝒗 | |𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 | |𝑫𝒗 𝒍𝒐𝒘 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒎𝒆𝒅𝒊𝒖𝒎 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒍𝒐𝒘 )
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟑| |𝟐|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟗𝟏𝟖 + × 𝟏. 𝟎𝟎𝟎
|𝑫| |𝟓| |𝟓|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.951
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.971 − 0.951 = 0.020
 For Student (values: yes, no):
age income student credit_rating buys age income student credit_rating buys
>40 medium no fair yes >40 medium no fair yes
>40 low yes fair yes >40 low yes fair yes
>40 low yes excellent no >40 low yes excellent no
>40 medium yes fair yes >40 medium yes fair yes
>40 medium no excellent no >40 medium no excellent no

1 1 1 1 2 2 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
2 2 2 2 3 3 3 3
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐 ) = 1.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒚𝒆𝒔 = 0.918

|𝑫𝒗 | |𝑫𝒗 𝒏𝒐 | 𝑫𝒗 𝒚𝒆𝒔


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒏𝒐 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒚𝒆𝒔
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟐| |𝟑|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟏. 𝟎𝟎𝟎 + × 𝟎. 𝟗𝟏𝟖
|𝑫| |𝟓| |𝟓|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.951
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.971 − 0.951 = 0.020

 For credit_rating (values: fair, excellent):


age income student credit_rating buys age income student credit_rating buys
>40 medium no fair yes >40 medium no fair yes
>40 low yes fair yes >40 low yes fair yes
>40 low yes excellent no >40 low yes excellent no
>40 medium yes fair yes >40 medium yes fair yes
>40 medium no excellent no >40 medium no excellent no
3 3 0 0 2 2 0 0
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 excellent ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
3 3 3 3 2 2 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒇𝒂𝒊𝒓 = 0.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 excellent ) = 0.000

|𝑫𝒗 | 𝑫𝒗 𝒇𝒂𝒊𝒓 |𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒇𝒂𝒊𝒓 + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 )
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟑| |𝟐|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟎𝟎𝟎 + × 𝟎. 𝟎𝟎𝟎
|𝑫| |𝟓| |𝟓|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.000
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.971

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.971 − 0.000 = 0.971

We have,
𝐈𝐆(𝐃, 𝐈𝐧𝐜𝐨𝐦𝐞) = 0.940 − 0.911 = 0.020
𝐈𝐆(𝐃, 𝐒𝐭𝐮𝐝𝐞𝐧𝐭) = 0.940 − 0.789 = 0.020
𝐈𝐆(𝐃, 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠 ) = 0.940 − 0.892 = 0.971

We Know,

𝑹𝒐𝒐𝒕 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕 = 𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝒘𝒊𝒕𝒉 𝑯𝒊𝒈𝒉𝒆𝒔𝒕 𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏


𝑹𝒐𝒐𝒕 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕 = 𝐜𝐫𝐞𝐝𝐢𝐭_𝐫𝐚𝐭𝐢𝐧𝐠
(5)

Is given,
𝒂𝒈𝒆 = ≤ 𝟑𝟎
𝒊𝒏𝒄𝒐𝒎𝒆 = 𝒍𝒐𝒘
𝒔𝒕𝒖𝒅𝒆𝒏𝒕 = 𝒚𝒆𝒔
𝒄𝒓𝒆𝒅𝒊𝒕_𝒓𝒂𝒕𝒊𝒏𝒈 = 𝒇𝒂𝒊𝒓
So, Predict the value of the target attribute:

𝒃𝒖𝒚𝒔_𝒄𝒐𝒎𝒑𝒖𝒕𝒆𝒓 = 𝒀𝑬𝑺
 Problem2: Consider the following dataset of 14 Records with attributes Outlook,
Temp, Humidity, Wind, and the target class Play Tennis.

Dataset:
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

 Your Tasks is:


1. Calculate the entropy of the entire dataset for the target attribute Play Tennis.
2. Compute the Information Gain (IG) for each attribute:
o Outlook
o Temp
o Humidity
o Wind

3. Determine the best attribute for the root node based on highest Information Gain.
4. Using your root node, perform the next split on the corresponding subsets and show
the resulting smaller decision tree.
5. Finally, use the constructed decision tree to classify the following new instance:
Outlook = Sunny
Temp = Cool
Humidity = High
Wind = Weak
Predict the value of the target attribute: Play Tennis?
(1)

Is given,

Totals: 14 datasets → 𝟗 𝒚𝒆𝒔, 𝟓 𝒏𝒐 .

𝑵 = 𝟏𝟒
𝒏 𝟏 = 𝟗
𝒏 𝟐 = 𝟓

We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟗
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟒𝟑
𝑵 𝟏𝟒
𝒏𝟐 𝟓
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟑𝟓𝟕
𝑵 𝟏𝟒
now,
𝒏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏
𝟐

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = − 𝒑𝒊 𝒍𝒐𝒈𝟐 (𝒑𝒊 )


𝒊 𝟏

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = −𝒑𝟏 𝒍𝒐𝒈𝟐 (𝒑𝟏 ) − 𝒑𝟐 𝒍𝒐𝒈𝟐 (𝒑𝟐 )

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = −0.643𝑙𝑜𝑔 (0.643) − 0.357𝑙𝑜𝑔 (0.357)

𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

Entropy of the en re dataset for the target a ribute Play Tennis, 𝑫 = 0.940

(2)

We Know,
|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐀) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
|𝑫|
 For Outlook (values: Sunny, Overcast, Rain):
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No(sunny)
D2 Sunny Hot High Strong No(sunny)
D3 Overcast Hot High Weak Yes (Overcast)
D4 Rain Mild High Weak Yes (Rain)
D5 Rain Cool Normal Weak Yes (Rain)
D6 Rain Cool Normal Strong No (Rain)
D7 Overcast Cool Normal Strong Yes (Overcast)
D8 Sunny Mild High Weak No(sunny)
D9 Sunny Cool Normal Weak Yes(sunny)
D10 Rain Mild Normal Weak Yes (Rain)
D11 Sunny Mild Normal Strong Yes(sunny)
D12 Overcast Mild High Strong Yes (Overcast)
D13 Overcast Hot Normal Weak Yes (Overcast)
D14 Rain Mild High Strong No (Rain)

2 2 3 3 4 4 0 0
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒔𝒖𝒏𝒏𝒚 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒐𝒗𝒆𝒓𝒄𝒂𝒔𝒕 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
5 5 5 5 4 4 4 4
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒔𝒖𝒏𝒏𝒚 = 0.971 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒐𝒗𝒆𝒓𝒄𝒂𝒔𝒕 ) = 0.000

3 3 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒓𝒂𝒊𝒏 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
5 5 5 5
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒓𝒂𝒊𝒏 ) = 0.971

|𝑫𝒗 | 𝑫𝒗 𝒔𝒖𝒏𝒏𝒚 |𝑫𝒗 𝒐𝒗𝒆𝒓𝒄𝒂𝒔𝒕 | |𝑫𝒗 𝒓𝒂𝒊𝒏 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝒔𝒖𝒏𝒏𝒚 + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒐𝒗𝒆𝒓𝒄𝒂𝒔𝒕 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝒓𝒂𝒊𝒏 )
|𝑫| |𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟓| |𝟒| |𝟓|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟗𝟕𝟏 + × 𝟎. 𝟎𝟎𝟎 + × 𝟎. 𝟗𝟕𝟏
|𝑫| |𝟏𝟒| |𝟏𝟒| |𝟏𝟒|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.694
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = 0.940 − 0.694 = 0.246


 For Temp (values: Hot, Mild, Cool):
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No (Hot)
D2 Sunny Hot High Strong No (Hot)
D3 Overcast Hot High Weak Yes (Hot)
D4 Rain Mild High Weak Yes (Mild)
D5 Rain Cool Normal Weak Yes (cool)
D6 Rain Cool Normal Strong No (cool)
D7 Overcast Cool Normal Strong Yes (cool)
D8 Sunny Mild High Weak No (Mild)
D9 Sunny Cool Normal Weak Yes(cool)
D10 Rain Mild Normal Weak Yes (Mild)
D11 Sunny Mild Normal Strong Yes (Mild)
D12 Overcast Mild High Strong Yes (Mild)
D13 Overcast Hot Normal Weak Yes (Hot)
D14 Rain Mild High Strong No (Mild)

2 2 2 2 4 4 2 2
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑯𝒐𝒕 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑴𝒊𝒍𝒅 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
4 4 4 4 6 6 6 6
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑯𝒐𝒕 ) = 1.000 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑴𝒊𝒍𝒅 ) = 0.918

3 3 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑪𝒐𝒐𝒍 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
4 4 4 4
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑪𝒐𝒐𝒍 ) = 0.811

|𝑫𝒗 | |𝑫𝒗 𝑯𝒐𝒕 | |𝑫𝒗 𝑴𝒊𝒍𝒅 | |𝑫𝒗 𝑪𝒐𝒐𝒍 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝑯𝒐𝒕 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝑴𝒊𝒍𝒅 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝑪𝒐𝒐𝒍 )
|𝑫| |𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟒| |𝟔| |𝟒|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟏. 𝟎𝟎𝟎 + × 𝟎. 𝟗𝟏𝟖 + × 𝟎. 𝟖𝟏𝟏
|𝑫| |𝟏𝟒| |𝟏𝟒| |𝟏𝟒|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.911
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐓𝐞𝐦𝐩 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐓𝐞𝐦𝐩 ) = 0.940 − 0.911 = 0.029


 For Humidity (values: High, Normal):
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No (High)
D2 Sunny Hot High Strong No (High)
D3 Overcast Hot High Weak Yes (High)
D4 Rain Mild High Weak Yes (High)
D5 Rain Cool Normal Weak Yes (Normal)
D6 Rain Cool Normal Strong No (Normal)
D7 Overcast Cool Normal Strong Yes (Normal)
D8 Sunny Mild High Weak No (High)
D9 Sunny Cool Normal Weak Yes (Normal)
D10 Rain Mild Normal Weak Yes (Normal)
D11 Sunny Mild Normal Strong Yes (Normal)
D12 Overcast Mild High Strong Yes (High)
D13 Overcast Hot Normal Weak Yes (Normal)
D14 Rain Mild High Strong No (High)

3 3 4 4 6 6 1 1
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐𝒓𝒎𝒂𝒍 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
7 7 7 7 7 7 7 7
𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝒉𝒊𝒈𝒉 = 0.985 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝒏𝒐𝒓𝒎𝒂𝒍 ) = 0.592

|𝑫𝒗 | 𝑫𝒗 𝑯𝒊𝒈𝒉 |𝑫𝒗 𝑵𝒐𝒓𝒎𝒂𝒍 |


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝑯𝒊𝒈𝒉 + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝑵𝒐𝒓𝒎𝒂𝒍 )
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟕| |𝟕|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟗𝟖𝟓 + × 𝟎. 𝟓𝟗𝟐
|𝑫| |𝟏𝟒| |𝟏𝟒|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.789
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐇𝐮𝐦𝐢𝐝𝐢𝐭𝐲 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐇𝐮𝐦𝐢𝐝𝐢𝐭𝐲 ) = 0.940 − 0.789 = 0.151


 For Wind (values: Weak, Strong):
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No (Weak)
D2 Sunny Hot High Strong No (Strong)
D3 Overcast Hot High Weak Yes (Weak)
D4 Rain Mild High Weak Yes (Weak)
D5 Rain Cool Normal Weak Yes (Weak)
D6 Rain Cool Normal Strong No (Strong)
D7 Overcast Cool Normal Strong Yes (Strong)
D8 Sunny Mild High Weak No (Weak)
D9 Sunny Cool Normal Weak Yes (Weak)
D10 Rain Mild Normal Weak Yes (Weak)
D11 Sunny Mild Normal Strong Yes (Strong)
D12 Overcast Mild High Strong Yes (Strong)
D13 Overcast Hot Normal Weak Yes (Weak)
D14 Rain Mild High Strong No (Strong)

6 6 2 2 3 3 3 3
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑾𝒆𝒂𝒌 ) = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝑺𝒕𝒓𝒐𝒏𝒈 = − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
8 8 8 8 6 6 6 6
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒗 𝑾𝒆𝒂𝒌 ) = 0.811 𝑬𝒏𝒕𝒓𝒐𝒑𝒚 𝑫𝒗 𝑺𝒕𝒓𝒐𝒏𝒈 = 1.000

|𝑫𝒗 | |𝑫𝒗 𝑾𝒆𝒂𝒌 | 𝑫𝒗 𝑺𝒕𝒓𝒐𝒏𝒈


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 𝑾𝒆𝒂𝒌 ) + 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝑫𝒗 𝑺𝒕𝒓𝒐𝒏𝒈
|𝑫| |𝑫| |𝑫|

|𝑫𝒗 | |𝟖| |𝟔|


 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = × 𝟎. 𝟖𝟏𝟏 + × 𝟏. 𝟎𝟎𝟎
|𝑫| |𝟏𝟒| |𝟏𝟒|

|𝑫𝒗 |
 × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 ) = 0.892
|𝑫|

So,
We, have 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫) = 0.940

|𝑫𝒗 |
𝐈𝐆(𝐃, 𝐖𝐢𝐧𝐝 ) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − ∑ |𝑫|
× 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )

𝐈𝐆(𝐃, 𝐖𝐢𝐧𝐝 ) = 0.940 − 0.892 = 0.048


(3)

We have,
𝐈𝐆(𝐃, 𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = 0.940 − 0.694 = 0.246
𝐈𝐆(𝐃, 𝐓𝐞𝐦𝐩 ) = 0.940 − 0.911 = 0.029
𝐈𝐆(𝐃, 𝐇𝐮𝐦𝐢𝐝𝐢𝐭𝐲 ) = 0.940 − 0.789 = 0.151
𝐈𝐆(𝐃, 𝐖𝐢𝐧𝐝 ) = 0.940 − 0.892 = 0.048

We Know,

𝑹𝒐𝒐𝒕 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕 = 𝑨𝒕𝒕𝒓𝒊𝒃𝒖𝒕𝒆 𝒘𝒊𝒕𝒉 𝑯𝒊𝒈𝒉𝒆𝒔𝒕 𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏


𝑹𝒐𝒐𝒕 𝒂𝒕𝒕𝒓𝒊𝒃𝒖𝒕 = 𝑶𝒖𝒕𝒍𝒐𝒐𝒌

(4)

Outlook Temp Humidity Wind Play Tennis Outlook Temp Humidity Wind Play Tennis
Sunny Hot High Weak No Rain Mild High Weak Yes
Sunny Hot High Strong No Rain Cool Normal Weak Yes
Sunny Mild High Weak No Rain Cool Normal Strong No
Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Rain Mild High Strong No

Next, repeat the same process with the other nodes with smaller dataset
FINAL DECISION TREE

(5)

Is given,
𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑺𝒖𝒏𝒏𝒚
𝑻𝒆𝒎𝒑 = 𝑪𝒐𝒐𝒍
𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚 = 𝑯𝒊𝒈𝒉
𝑾𝒊𝒏𝒅 = 𝑾𝒆𝒂𝒌
So, Predict the value of the target attribute:

𝑷𝒍𝒂𝒚 𝑻𝒆𝒏𝒏𝒊𝒔 = 𝑵𝑶
Gain Ratio
 Gain Ratio is an attribute selection measure used in the C4.5 Decision Tree algorithm.
 It is an improvement over Information Gain (IG).

Why do we need Gain Ratio?


 Information Gain sometimes has a bias:

 It favors attributes with many distinct values


 Example: A unique ID attribute (like Customer ID) would split each row separately, giving
very high IG — but it’s useless for prediction.

 To fix this problem, C4.5 uses Gain Ratio, which adjusts IG by considering how broadly an
attribute splits the data.

Formula for Gain Ratio

𝐈. 𝐆𝐚𝐢𝐧(𝐀)
𝐆𝐚𝐢𝐧𝐑𝐚𝐭𝐢𝐨(𝐀) =
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐀)

Where:
𝒗
|𝑫𝒗 |
𝐈. 𝐆𝐚𝐢𝐧(𝐀) = 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝐃) − × 𝐄𝐧𝐭𝐫𝐨𝐩𝐲(𝑫𝒗 )
|𝑫|
𝒊 𝟏

𝒗
|𝑫𝒗 | |𝑫𝒗 |
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐀) = − × 𝐥𝐨𝐠 𝟐
|𝑫| |𝑫|
𝒊 𝟏
 Problem3: Compute the Gain Ratio for the attribute Outlook using the Play Tennis
dataset. Provide the Information Gain, SplitInfo, and the final Gain Ratio.

We have,
𝐈𝐆(𝐃, 𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = 0.940 − 0.694 = 0.246

We Know,
Outlook Count Proportion (p)
Sunny 5 5/14 = 0.3571
Overcast 4 4/14 = 0.2857
Rain 5 5/14 = 0.3571

𝒗
|𝑫𝒗 | |𝑫𝒗 |
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐀) = − × 𝐥𝐨𝐠 𝟐
|𝑫| |𝑫|
𝒊 𝟏

5 5 4 4 5 5
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(Outlook) = − log − log − log
14 14 14 14 14 14

𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(Outlook) = 1.577

So,
𝐈. 𝐆𝐚𝐢𝐧(𝐀)
𝐆𝐚𝐢𝐧𝐑𝐚𝐭𝐢𝐨(𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) =
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐀)
0.246
𝐆𝐚𝐢𝐧𝐑𝐚𝐭𝐢𝐨(𝐎𝐮𝐭𝐥𝐨𝐨𝐤 ) = = 0.156
1.577
Gini Index
 The Gini Index (or Gini Impurity) is a measure used in CART (Classification and
Regression Trees) to select the best attribute for splitting data.

 It measures how impure a dataset is.

 Gini = 0 → perfectly pure node (all samples belong to one class)


 Higher Gini → more mixed classes (more impurity)

Why use Gini Index?


 Faster to compute than entropy (no logarithms)
 Works well in practice
 Does not have bias toward attributes with many values (unlike Information Gain)
 Used in CART, Random Forest, and modern tree-based models

Gini Index Formula


𝒄

𝐆𝐢𝐧𝐢(𝐒) = 𝟏 − (𝒑𝒊 )𝟐
𝒊 𝟏

Where:
 𝒑𝒊 = proportion of class i
 𝒄 = number of classes

𝒗
|𝑺𝒗 |
𝐒𝐩𝐥𝐢𝐭𝐈𝐧𝐟𝐨(𝐒) = 𝐆𝐢𝐧𝐢(𝐒𝒗 )
|𝑺|
𝒊 𝟏
 Problem4: Using the Play Tennis dataset below, compute the Gini Index to determine
the best attribute for splitting the dataset. Show all steps clearly.

Dataset:
Day Outlook Temp Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Your Tasks is:

a) Compute the Gini Index of the entire dataset.


b) Compute the Gini Index for each attribute (Outlook, Temp, Humidity, Wind).
c) Determine which attribute provides the best split according to Gini Index.
d) Show the weighted Gini Index calculation for the chosen attribute.

(a)

Is given,

Totals: 14 datasets → 𝟗 𝒚𝒆𝒔, 𝟓 𝒏𝒐 .

𝑵 = 𝟏𝟒
𝒏 𝟏 = 𝟗
𝒏 𝟐 = 𝟓

We Know,
𝒏𝒊
proportion of class 𝑖 = 𝒑𝒊 =
𝑵
𝒏𝟏 𝟗
proportion of class 1 = 𝒑𝟏 = = = 𝟎. 𝟔𝟒𝟑
𝑵 𝟏𝟒
𝒏𝟐 𝟓
proportion of class 2 = 𝒑𝟐 = = = 𝟎. 𝟑𝟓𝟕
𝑵 𝟏𝟒

Now,
𝒄

𝐆𝐢𝐧𝐢(𝐒) = 𝟏 − (𝒑𝒊 )𝟐
𝒊 𝟏
𝟐

𝐆𝐢𝐧𝐢(𝐒) = 𝟏 − (𝒑𝒊 )𝟐
𝒊 𝟏

𝐆𝐢𝐧𝐢(𝐒) = 1 − {𝑝 +𝑝 }

𝐆𝐢𝐧𝐢(𝐒) = 1 − {(0.643) + (0.357) }

𝐆𝐢𝐧𝐢(𝐒) = 0.459

(b)
 Compute Gini Index for a Split for Outlook:

 Outlook has three values:

Outlook Yes No Total


Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5

 Gini for each branch:


2 3
Gini(Sunny) = 1 − + = 0.480
5 5

4 0
Gini(Overcast) = 1 − + = 0.000
4 4

3 3
Gini(Rain) = 1 − + = 0.480
5 5

 Weighted Gini Index for Outlook:


5 4 5
Gini = (0.48) + (0) + (0.48) = 0.343
14 14 14
 Compute Gini Index for a Split for TEMP:

 TEMP has three values:

Temp Yes No Total

Hot 2 2 4

Mild 4 2 6

Cool 3 1 4

 Gini for each branch:


2 2
Gini(Hot) = 1 − + = 0.500
4 4
4 2
Gini(Mild) = 1 − + = 0.444
6 6
3 1
Gini(Cool) = 1 − + = 0.375
4 4
 Weighted Gini Index for Temp:
4 6 4
Gini = (0.500) + (0.444) + (0.375) = 0.440
14 14 14

 Compute Gini Index for a Split for Humidity:

 Humidity has two values:

Humidity Yes No Total

High 3 4 7

Normal 6 1 7

 Gini for each branch:


3 4
Gini(High) = 1 − + = 0.490
7 7
6 1
Gini(Normal) = 1 − + = 0.245
7 7
 Weighted Gini Index for Humidity:
7 7
Gini = (0.490) + (0.245) = 0.368
14 14
 Compute Gini Index for a Split for Wind:

 Wind has two values:

Wind Yes No Total


Weak 6 2 8
Strong 3 3 6

 Gini for each branch:


6 2
Gini(Weak) = 1 − + = 0.375
8 8
3 3
Gini(Strong) = 1 − + = 0.500
6 6
 Weighted Gini Index for Wind:
7 7
Gini = (0.490) + (0.245) = 0.368
14 14

(c)

We have,
Attribute Gini Split
Outlook 0.343
Temperature 0.441
Humidity 0.367
Wind 0.428

We Know,
𝐵𝐸𝑆𝑇 𝐴𝑇𝑇𝑅𝐼𝐵𝑈𝑇𝐸 = 𝑂𝑈𝑇𝐿𝑂𝑂𝐾
Because it has the lowest weighted Gini Index
Thus, CART Decision Tree will split on Outlook first.
(d)
 Weighted Gini Index for Outlook:
5 4 5
Gini = (0.48) + (0) + (0.48) = 0.343
14 14 14
Overfitting in Decision Trees

 What is it?
 Tree becomes too complex.
 Fits noise instead of patterns.
 Symptoms:
o High training accuracy.
o Low test accuracy.

 Causes
 Tree grown too deep.
 Too many branches capturing random noise.

 Result
 Poor generalization.

Pre-pruning (Early Stopping)


 Stop tree growth before it becomes too complex.

 Stopping criteria
 Max tree depth reached.
 Minimum samples at leaf.
 Minimum info gain threshold.

 Example
 If IG < 0.01 → stop splitting.

 Advantages
 Prevents overfitting
 Faster training
 Simple, interpretable tree

 Disadvantages
 May stop early and miss good splits
 Hard to choose thresholds
Post-Pruning (Cost Complexity Pruning)
 First grow full tree, then prune unnecessary branches.

 Steps
1. Grow full tree.
2. Starting from leaves, remove branches that don’t improve accuracy.
3. Use validation set or cross-validation.

 Example
 In spam classification, deep branches with rare word patterns are pruned.

 Advantages
o Better generalization
o Simplifies tree
o Data-driven pruning

 Disadvantages
o Extra computation
o Needs validation set
o Time-consuming on large trees

You might also like