0% found this document useful (0 votes)
16 views33 pages

Supervised Learning in Machine Learning

Uploaded by

study2722004
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views33 pages

Supervised Learning in Machine Learning

Uploaded by

study2722004
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MACHINE LEARNING

Unit 3: Supervised Learning


Classification and Regression, Some Sample Datasets, k-Nearest
Neighbours, Linear Models, Naive Bayes Classifiers, Decision Trees.

Supervised Learning
Supervised learning is the types of machine learning in which machines are
trained using well "labelled" training data, and on basis of that data, machines
predict the output.
The aim of a supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y).
The type of data that contains both the features and the target is known as
labelled data.
Once trained, the algorithm can predict or classify new, unseen data based on its
learned(trained) Knowledge
Advantages of Supervised learning:
o With the help of supervised learning, the model can predict the output on
the basis of prior experiences.
o In supervised learning, we can have an exact idea about the classes of
objects.
o Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the complex
tasks.
o Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of
object.
Applications :
1. Healthcare
2. Finance
3. E-commerce
4. Natural Language Processing
5. Image & Speech recognition
6. Cyber Security
7. Autonomous Vehicles

Types of Supervised Learning

Classification:
Classification is a type of supervised machine learning where algorithms from a
set of labeled training data and then makes predictions on new, unseen data.
In machine learning, classification involves teaching a computer how to sort or
group different items into categories,

Example:
Imagine teaching a computer to identify spam emails. You start by showing it
examples of emails, some marked as spam and some not. The computer learns
from these examples, recognizing patterns in the features of the emails. Once
trained, it can predict whether a new email is spam or not by applying what it
learned. This is how classification works sorting things into groups based on
characteristics
Applications:
1. Spam Detection.
2. Medical diagnosis
3. Sentiment Analysis
4. Image recognition.
5. Fraud detection etc…

Classification Algorithms:
There are several common classification algorithms in machine learning, cach
with its strengths and suitable use cases. Here are some widely used
classification algorithms:
Logistic Regression
Decision Trees
Random Forest
Support Vector Machines (SVM)
K-Nearest Neighbors (KNN)
Naive Bayes

Benefits of Classification
Fast Decisions: These algorithms can quickly sort information or data, helping
in areas like spotting fake bank transactions or figuring out a patient's illness,
making things faster and more efficient.
Really Accurate: If you teach these algorithms well with lots of examples, they
can get really good at their jobs, like recognizing faces or understanding spoken
words
Can Handle Lots of Data: They're great at dealing with loads of information.
growing with your needs, whether you're analyzing tweets or tracking customer
habits Flexible: You can use them for many different things, from understanding
text messages to figuring out what's in a photo.
Easy to Understand Some Choices: Algorithms like decision trees show you
clearly why they made a certain choice, which can be very important in areas
like banking or health.
Good with Complicated Stuff: Some types can sort through complex patterns,
even when things aren't straightforward or clear-cut.
Keep Learning: They can keep getting better over time, learning from new
information without starting over from scratch.
Saves Time and Effort: By doing the sorting work, these algorithms let people
focus on more creative tasks, saving time and effort.
Saves Money: Automating these sorting tasks can cut costs by reducing the need
for manual work and making processes quicker.

Classification performance metrix


The machine learning model is built using training data (which has input as well
as output). Prediction is made on the test data (unseen data which does not have
an output label) using the same model.
But how do you figure out the effectiveness of the model?
There must be some measures that will evaluate the performance of the model.
There are many performance metrics to evaluate the model performance, such
as accuracy, precision, recall, F1-score, ROC curve, etc. each having its
advantages and disadvantages.
Performance measure always evaluates the performance of the model on the test
dataset. In this article, we will be using binary classification (where only two
classes are present) since it is trivial to understand.

Accuracy
It is the ratio of correctly classified points (prediction) to the total number of
predictions. Its value ranges between 0 and 1.
number of correct predictions
Accuracy=
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Precision
Confusion Matrix:
Confusion Matrix helps us to display the performance of a model or how a
model has made its prediction in Machine Learning.

In any machine learning model, we usually focus on accuracy. But if you are
dealing with a classification problem, you also need to worry about the
percentage of correct classification and misclassification. So we need a
mechanism that not only provides accuracy but also helps in estimating correct
classification and misclassification. The confusion matrix serves this purpose. It
is an NxN matrix which helps to evaluate the performance of machine learning
model for classification problem.

This matrix consists of 4 main elements that show different metrics to count a
number of correct and incorrect predictions. Each element has two words either
as follows:
• True or False
• Positive or Negative
If the predicted and truth labels match, then the prediction is said to be correct.
but when the predicted and truth labels are mismatched, then the prediction is
said to be incorrect. Further, positive and negative represents the predicted
labels in the matrix.
There are four metrics combinations in the confusion matrix, which are as
follows:
• True Positive: This combination tells us how many times a model
correctly classifies a positive sample as Positive?
• False Negative: This combination tells us how many times a model
incorrectly classifies a positive sample as Negative?
• False Positive: This combination tells us how many times a model
incorrectly classifies a negative sample as Positive?
• True Negative: This combination tells us how many times a model
correctly classifies a negative sample as Negative?

Hence, we can calculate the total of 7 predictions in binary classification


problems using a confusion matrix.
Precision
Precision is defined as the ratio of correctly classified positive samples (True
Positive) to a total member of classified positive samples (either correctly or
incorrectly).
Precision= True Positive/True Positive False Positive
Precision= TP/TP+FP
Precision helps us to visualize the reliability of the machine learning model in
classifying the model as positive.
Recall
The recall is calculated as the ratio between the numbers of Positive samples
correctly classified as Positive to the total number of Positive samples. The
recall measures the model's ability to detect positive samples. The higher the
recall, the more positive samples detected.
Recall =True Positive True Positive False Negative
Recall =TP/TP+FN
Recall is independent of the number of negative sample classifications. Further,
if the model classifies all positive samples as positive, then Recall will be 1.
Sensitivity
Sensitivity is the metric that evaluates a model's ability to predict true positives
of each available category.
Sensitivity is a measure of how well a machine learning model can detect
positive instances. It is also known as the true positive rate (TPR) or recall.
Sensitivity is used to evaluate model performance because it allows us to see
how many positive instances the model was able to correctly identify
𝑇𝑃
Sensitivity =
𝑇𝑃+𝐹𝑁

Specificity
Specificity is the metric that evaluates a model's ability to predict true negatives
of each available category.
Specificity is defined as the proportion of actual negatives, which got predicted
as the negative (or true negative).
Specificity is a measure of the proportion of people not suffering from the
disease who got predicted correctly as the ones who are not suffering from the
disease. In other words, the person who is healthy actually got predicted as
healthy is specificity.
𝑇𝑁
Specificity=
𝑇𝑁+𝐹𝑃

ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the
performance of a classification model at all classification thresholds. This curve
plots two parameters:
• True Positive Rate
• False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as
follows:
TPR-TP/TP+FN
False Positive Rate (FPR) is defined as follows
𝐹𝑃
FP FPR=
𝑇𝑁+𝐹𝑃

An ROC curve plots TPR vs. FPR at different classification thresholds.


Lowering the classification threshold classifies more items as positive, thus
increasing both False Positives and True Positives. The following figure shows
a typical ROC curve.

To compute the points in an ROC curve, we could evaluate a logistic regression


model many times with different classification thresholds, but this would be
inefficient Fortunately, there's an efficient, algorithm that can provide this
information for us, called AUC.

AUC: Area Under the ROC curve


AUC is known for Area Under the ROC curve. As its name suggests, AUC
calculates the two-dimensional area under the entire ROC curve ranging from
(0,0) to (1.1), as shown below image:

In the ROC curve, AUC computes the performance of the binary classifier
across different thresholds and provides an aggregate measure. The value of
AUC ranges from 0 to 1, which means an excellent model will have AUC near
1, and hence it will show a good measure of Separability.

Bias Variance decomposition


The components of any predictive errors are Noise, Bias, and Variance. This
intends to measure the bias and variance of a given model and observe the
behavior of bias and variance [Link] various models such as Linear Regression,
Decision Tree Bagging, and Random Forest for a given number of sample sizes.
• Bias: Difference between the prediction of the true model and the average
models (models build on n number of samples obtained from the
population).
• True Model: Model builds on a population data
• Average Model: Average of all the prediction results obtained from the
various sample obtained from the population model.
• Variance: Difference between the prediction of all the models obtained
from the sample with the average model.
• Noise: It is the irreducible error that a model cannot predict.

Regression
• Regression is a type of supervised machine learning where algorithms
learn from the data to predict continuous values.
• The goal is to establish a mathematical relationship between the
independent variables (features) and the dependent variable (output)
using labeled training data.
• The trained regression model can then make predictions for new, unseen
data, providing a way to estimate or forecast numerical outcomes.
• In technical terms, during the regression process, your model will try to
learn a function that maps the features of the houses (size, location,
bedrooms) to their prices as accurately as possible.
• This involves a lot of trial and error, adjusting the model's internal
parameters based on the feedback it gets (how far off its guesses are from
the real prices).
• The ultimate goal is to reduce the error of these guesses to a minimum, so
that when presented with a new house, the model can predict its price as
accurately as possible, based on what it has learned from the magic book.
• Common applications include predicting prices, sales, temperatures, or
any other continuous variable.

Example:
For example, Weather Forecasting: Imagine you're planning a picnic and want
to know if it's going to rain in your area so you can decide whether to go ahead
or cancel. You've been keeping a diary of the weather: on days when it rained,
what the temperature was, how cloudy it was, and the wind speed.
Using regression for weather forecasting is like having a magic crystal ball that
uses your diary to predict the weather. You tell the crystal ball, "Today, it's 75
degrees, 50% cloudy, and the wind is blowing at 5 miles per hour." The crystal
ball looks at all the similar days in your diary and predicts whether it will rain
today or not, helping you plan your picnic with more confidence. in simple
words two lines
Common Regression Algorithms:
Linear Regression
Decision Tree Regression.
K Nearest Neighbor Regression
Random Forest Regression
Neural Networks
Difference between Classification and Regression

K-Nearest Neighbour(KNN):
• The K-Nearest Neighbors (KNN) algorithm is a simple, easy-to-
understand machine learning algorithm used for both classification &
regression tasks.
• It works by finding the 'K' nearest data points to a given data point and
then making predictions based on their labels (for classification) or values
(for regression).
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.
• K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• The choice of 'K' determines how many neighboring points are
considered when making predictions.
• KNN is simple and intuitive, making it easy to understand and
implement.

Working of KNN:
The K-NN algorithm looks at a new piece of data and checks how similar it is to
the data we already have.
It does this by comparing the new data to its closest neighbors within a certain
range (Κ).
Then, based on these similarities, it decides which group or category the new
data belongs to in our dataset.

Algorithm:
Step 1-Assign a value to K. (This value which tells us how many nearby
neighbors we'll look at.)
Step 2- Calculate the distance between the new data entry and all other existing
data entries. Arrange them in ascending order.
Step 3 Find the K nearest neighbors to the new entry based on the calculated
distances.
Step 4 - Assign the new data points to that category for which the number of the
neighbor is maximum.
Step-6: Our model is ready.
In its simplest version, the k-NN algorithm only considers exactly one nearest
neigh- bor, which is the closest training data point to the point we want to make
a prediction for. The prediction is then simply the known output for this training
point.
Advantages of KNN
1. Simple and Easy to Implement: KNN is straightforward to understand and
implement, making it a good starting point for beginners in machine learning.
2. No Model Training Required: KNN is a lazy learning algorithm, meaning it
doesn't require training a model. The "training" phase is essentially just storing
the dataset, which makes it fast to adapt to new data.
3. Versatile: It can be used for both classification and regression tasks.
4. Adaptive: Since KNN makes predictions based on the nearest neighbors, it
can adapt well if the dataset changes over time, provided you re-run the
algorithm with the updated data.
5. No Assumptions about Data: Unlike many algorithms that require data to
follow a certain distribution, KNN doesn't make any strong assumptions about
the underlying data distribution, making it useful in real-world scenarios where
such assumptions often don't hold.

Disadvantages of KNN
1. Computationally Expensive: As the dataset grows, the algorithm becomes
slower. It needs to compute the distance of a new point to every other
point in the dataset, which can be slow for large datasets.
2. High Memory Requirement: Since the algorithm stores all of the training
data, it can require a lot of memory for large datasets.
3. Sensitive to Irrelevant Features: KNN can perform poorly if there are a
lot of irrelevant or redundant features in the dataset because it treats all
features with equal importance.
4. Sensitive to the Scale of Data: Features that are on larger scales can
unduly influence the algorithm. Data often need to be normalized or
standardized to make sure each feature contributes equally to the distance
calculations.
5. Choosing the Right k Value: Selecting the optimal number of neighbors,
k, can be challenging. A value too small can make the algorithm sensitive
to noise, while a value too large can make it overly generalize, possibly
leading to incorrect predictions.
KNN Problem
Apply K- Nearest neighbor classifier to predict the diabetic patient with the
given features BMI, Age. If the training examples are,
BMI Age Sugar
33.6 50 1
26.6 30 0
23.4 40 0
43.1 67 0
35.3 23 1
35.9 67 1
36.7 45 1
25.7 46 0
23.3 29 0
31 56 1

Assume k=3, Find the patient age=40, BMI=43.6, has Sugar or not?

Distance formula= √(𝑥2 − 𝑥1 )2 + (𝑦2 − 𝑦1 )2

BMI Age Sugar Formula Distance Rank


33.6 50 1 √(43.6 − 33.6)2 + (40 − 50)2 14.14 2
26.6 30 0 √(43.6 − 26.6)2 + (40 − 30)2 19.72
23.4 40 0 √(43.6 − 23.4)2 + (40 − 40)2 20.20
43.1 67 0 √(43.6 − 43.1)2 + (40 − 67)2 27
35.3 23 1 √(43.6 − 35.3)2 + (40 − 23)2 18.92
35.9 67 1 √(43.6 − 35.9)2 + (40 − 67)2 28.08
36.7 45 1 √(43.6 − 36.7)2 + (40 − 45)2 8.52 1
25.7 46 0 √(43.6 − 25.7)2 + (40 − 46)2 18.88 3
23.3 29 0 √(43.6 − 23.3)2 + (40 − 29)2 23.09
31 56 1 √(43.6 − 31)2 + (40 − 56)2 20.37

Test Example BMI=43.6, Age =40, Sugar=1


Linear Models:
Linear models are a fundamental class of algorithms used in machine learning
for test egression and classification tasks.
They model the relationship between input features and target variables using
Spear functions
A linear model, alternatively referred to as a linear regression model, is a
statistical method employed to elucidate the association between a dependent
variable and one or more independent variables.
linear models generate a formula to create a best-fit line to predict unknown
values

Types of Linear Models:


Linear Regression
Logistic Regression
Ridge Regression
Lasso Regression

Linear Regression:
• Linear regression is a statistical method used to model the relationship
between a dependent variable and one or more independent variables.
• It assumes a linear relationship between the independent variables and the
dependent variable. T
• The goal of linear regression is to fit a straight line to the data that best
represents the relationship between the variables.
• This line is determined by estimating the coefficients that minimize the
difference between the observed values and the values predicted by the
linear model
• Linear regression is commonly used for prediction and forecasting in
various fields such as economics, finance, and social sciences.
Diagram:
The linear regression model simply shows how two variables are related by
hawing a straight line between them, with a certain slope.

Linear Regression are represented by


Y=𝑎0 + 𝑎1 𝑥 + 𝜀
Here
Y represents the dependent variable(also know as the target variable)
X represents the independent variable(referred as the predictor variable)
𝑎0 𝑖𝑠 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
𝑎1 𝑖𝑠 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒
𝜀 𝑟𝑒𝑓𝑒𝑟𝑒𝑠 𝑡𝑜 𝑟𝑎𝑛𝑑𝑜𝑚 𝑒𝑟𝑟𝑜𝑟𝑠

Types of Linear regression


1. Simple Linear Regression: If a single independent variable is used
to predict the value of a numerical dependent variable, then such a
Linear Regression algorithm is called Simple Linear Regression.
2. Multi Linear Regression: If more than one independent variable is
used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Multiple Linear
Regression.
Advantages of Linear Regression
• Simple to Understand: Linear regression is easy to grasp, showing how
changes in one thing relate to changes in another.
• Easy to Use: It's straightforward to apply with most software doing the
heavy lifting for you.
• Fast: It works quickly, even with lots of data, saving time and computing
power
• Works Well with Small Data: You don't need tons of data to get good
results, making it cost-effective.
• Great for Straight-Line Relationships: When things increase or decrease
in a direct line, it's very accurate.
• Helps Make Decisions: You can figure out if and how much things
matter, like how sales relate to advertising.
• Flexible: Can be tweaked (with versions like ridge or LASSO) to work
better under different situations.

Disadvantages of Linear Regression


• Linear Only: Best for straight-line relationships; struggles with complex
patterns.
• Outlier Sensitive: Extreme data points can throw off the entire model.
• Overfitting Risk: Too many variables can make it perform poorly on new
data.
• No Multicollinearity: Problems arise if independent variables are too
similar Needs Independent Observations: Assumes each data point
doesn't affect the others.
• Constant Error Variance: Assumes the spread of errors stays the same
throughout..
• Struggles with Curves: Can't accurately model trends that aren't straight
lines.
• Continuous Data Only: Not ideal for predicting categories or groups.
Naïve Bayes Algorithm
Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the
observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that
the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.

Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

[Link] Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes
Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naive Bayes Classifier


• Single and Easy to Implement: Naive Bayes is straightforward to
understand and implement, making it accessible even for beginners in
data science
• Efficient on Large Datasets: It can quickly process large amounts of data,
making suitable for big data applications.
• Performs Well with Small Data: Even with a small dataset, Naive Bayes
can produce good classification results.
• Handles Missing Values Well: The algorithm can still function effectively
even if some of the data is missing
• Good for Multi-class Prediction It is capable of handling problems with
multiple classes effectively.
• Requires Less Training Data: Unlike more complex models, it doesn't
require as much training data to make predictions.
• Works Well with Text Data: Particularly effective for text classification
tasks, like spam detection or sentiment analysis.

Disadvantages:-
• Does not work well with large dataset as calculating distances between
each data instance would be very costly.
• Does not work well with high dimensionality as this will complicate the
distance calculating process to calculate distance for each dimension.
• Sensitive to noisy & missing data
• Feature Scaling : Data in all the dimension should be scaled properly.
Decision tree:
Decision tree is a supervised Learning Technique.
It is used for both the classification and regression.
These trees are used to either classify data or predict what will come next.
It is tree like structure which contains root nodes, branches, internal node and
leaf nodes.
Basic terminologies:
Root node: "Root Node" is the initial node in a decision tree. It's where dataset
begin to split into groups based on different features or conditions.
Decision Nodes: Decision Nodes: Decision nodes are nodes that emerge from
the division of root nodes. These nodes represent intermediate decisions or
conditions within the tree.
Leaf Nodes: Nodes that cannot be further divided; they frequently represent a
final classification or result. Another name for leaf nodes is terminal nodes.
Branch or subtree: A branch, also known as a sub-tree, is a portion of the
decision tree that is smaller than the complete tree. Within the tree, it indicates a
specific path of decisions and outcome.

Parent and Child Node:


In a decision tree, a node that is divided into sub-nodes is known as a parent
node and the sub-nodes emerging from it are referred to as child nodes.
A decision or conditions is represented by the parent node, and possible
outcomes of additional decisions based on that situation are represented by the
child nodes.
Pruning: The process of removing branches or nodes from a decision tree to
improve its generalization and prevent overfitting.
Advantages:
• Compared to other algorithms decision trees requires less effort for data
preparation during pre-processing.
• A decision tree does not require normalization of data.
• A decision tree does not require scaling of data as well.
• Missing values in the data also do NOT affect the process of building a
decision tree to any considerable extent.
• A Decision tree model is very intuitive and easy to explain to technical
teams as well as stakeholders.

Disadvantages:
• A small change in the data can cause a large change in the structure of the
decision tree causing instability.
• For a Decision tree sometimes calculation can go far more complex
compared to other algorithms.
• Decision tree often involves higher time to train the model.
• Decision tree training is relatively expensive as the complexity and time
has taken are more.
• The Decision Tree algorithm is inadequate for applying regression and
predicting continuous values. Rewrite this in simple words.

Diagram:
The diagrammatical representation of decision tree
Basic algorithm (ID3)
• ID3 stands for Iterative Dichotomiser 3 and is named such because the
algorithm iteratively (repeatedly) dichotomizes(divides) features into two
or more groups at each step.
• ID3 uses a top-down greedy approach to build a decision tree. In simple
words, the top-down approach means that we start building the tree from
the top and the greedy approach means that at each iteration we select the
best feature at the present moment to create a node.
• Most generally ID3 is only used for classification problems with nominal
features only.
What is ID3 algorithm?
• The ID3 algorithm selects the best feature at each step while building a
Decision tree.
• It is a classification algorithm that follows a greedy approach by selecting
a best attribute that yields maximum Information Gain(IG) or minimum
• Entropy(H). ID3 uses Information Gain or just Gain to find the best
feature.
What is Entropy and Information gain?
Entropy is a measure of the amount of uncertainty in the dataset S.
Mathematical Representation of Entropy is shown here
H(S) - Σc∈c -p(c)𝑙𝑜𝑔2 p(c)
* Where,
S-The current dataset for which entropy is being calculated(changes every
iteration of the ID3 algorithm).
C-Set of classes in S (example-C(yes, no}}
p(c) - The proportion of the number of elements in class c to the number of
elements in set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with the
smallest entropy is used to split the set S on that particular iteration.
Entropy=0 implies it is of pure class, that means all are of same category.
Information Gain IG(A) tells us how much uncertainty in S was reduced after
splitting set S on attribute A. Mathematical representation of Information gain is
shown here
IG(A,S)=H(S) -∑⬚
𝑡∈𝑇 𝑝(t)H(t)

Where,
• H(S) Entropy of set S.
• T-The subsets created from splitting set S by attribute A such that
S= 𝑈𝑡∈𝑇 t
• p(t) - The proportion of the number of elements in 1 to the number of
elements in set S
• H(t)- Entropy of subset t.
In ID3, information gain can be calculated (instead of entropy) for each
remaining attribute. The attribute with the largest information gain is used to
split the set S on that particular iteration.

What are the steps in ID3 algorithm?


The steps in ID3 algorithm are as follows:
• Calculate entropy for dataset.
• For each attribute/feature.
• Calculate entropy for all its categorical values.
• Calculate information gain for the feature.
• Find the feature with maximum information gain.
• Repeat it until we get the desired tree.
Decision Tree Problem
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

Here, dataset is of binary classes(yes and no), where 9 out of 14 are "yes" and 5
out of 14 are "no".
Complete entropy of dataset is:
H(S)=-p(yes) log2(p(yes)) -p(no) log2(p(no))
=-(9/14) log2(9/14)-(5/14) log2(5/14)
=-(-0.41)-(-0.53)
=-0.94
For each attribute of the dataset, let's follow the step-2 of pseudocode:-

First Attribute - Outlook


Categorical values - sunny, overcast and rain
H(Outlook-sunny)=-(2/5)*log(2/5)-(3/5)*log(3/5) =0.971
H(Outlook rain)=-(3/5)*log(3/5)-(2/5)*log(2/5) -0.971
H(Outlook overcast)=-(4/4)*log(4/4)-0=0
Average Entropy Information for Outlook -
1(Outlook)=p(sunny)*H(Outlook=sunny) + p(rain)*H(Outlook=rain) +
p(overcast)*H(Outlook=overcast)
=(5/14)*0.971+(5/14)*0.971+(4/14)*0
=-0.693
Information Gain H(S)-I(Outlook)
=0.94-0.693
=0.247
Second Attribute - Temperature
Categorical values - hot, mild, cool
H(Temperature=hot)=-(2/4)*log(2/4)-(2/4)*log(2/4)=1
H(Temperature=cool)=-(3/4)*log(3/4)-(1/4)*log(1/4)=0.811
H(Temperature mild)=-(4/6)*log(4/6)-(2/6)*log(2/6)-0.9179
Average Entropy Information for Temperature -
I(Temperature)=p(hot)*H(Temperature-hot) + p(mild)*H(Temperature mild) +
p(cool)*H(Temperature cool)
=(4/14)*1+(6/14)*0.9179+ (4/14)*0.811
=0.9108
Information Gain H(S) - I(Temperature)
=0.94-0.9108
= 0.0292
Third Attribute - Humidity
Categorical values - high, normal
H(Humidity=high)=-(3/7)*log(3/7)-(4/7)*log(4/7) =0.983
H(Humidity=normal)=-(6/7)*log(6/7)-(1/7)*log(1/7)=0.591
Average Entropy Information for Humidity –
I(Humidity)=p(high)*H(Humidity=high) + p(normal)*H(Humidity=normal)
=(7/14)*0.983+ (7/14)*0.591
= 0.787
Information Gain=H(S) - I(Humidity)
=0.94-0.787
=0.153
Fourth Attribute - Wind
Categorical values - weak, strong
H(Wind=weak)=-(6/8)*log(6/8)-(2/8)*log(2/8) =0.811
H(Wind=strong)=-(3/6)*log(3/6)-(3/6)*log(3/6)=1
Average Entropy Information for Wind -
I(Wind)=p(weak)*H(Wind=weak) + p(strong)*H(Wind=strong)
=(8/14) 0.811+(6/14)*1
=0.892
Information Gain=H(S)-I(Wind)
=0.94 0.892
= 0.048
Here, the attribute with maximum information gain is Outlook. So, the decision
tree built so far -

Here, when Outlook=overcast, it is of pure class(Yes).


Now, we have to repeat same procedure for the data with rows consist of
Outlook value as Sunny and then for Outlook value as Rain.
Now, finding the best attribute for splitting the data with Outlook-Sunny values
Dataset rows=[1, 2, 8, 9, 11]}.
Complete entropy of Sunny is -
H(S)=-p(yes)*log2(p(yes)) - p(no) (2/5)* log2(p(no))
=-(2/5)*log2(2/5)-(3/5)*log2(3/5)
= 0.971
First Attribute-Temperature
Categorical values hot, mild, cool
H(Sunny, Temperature=hot)=-0-(2/2)*log(2/2)=0
H(Sunny, Temperature=cool)=-(1)*log(1)-0=0
H(Sunny, Temperature=mild)= -(1/2)*log(1/2)-(1/2)*log(1/2)=1
Average Entropy Information for Temperature -
1(Sunny, Temperature)= p(Sunny, hot)*H(Sunny, Temperature=hot)+p(Sunny,
mild)*H(Sunny, Temperature=mild)+p(Sunny, cool)*H(Sunny. Temperature=
cool)
=(2/5)*0+(1/5)*0+(2/5)*1
=0.4
Information Gain =H(Sunny) - I(Sunny, Temperature)
=0.971-0.4
=0.571
Second Attribute - Humidity
Categorical values high, normal
H(Sunny, Humidity=high)=-0-(3/3)*log(3/3)=0
H(Sunny, Humidity=normal) =-(2/2)*log(2/2)-0=0
Average Entropy Information for Humidity -
I(Sunny, Humidity)= p(Sunny, high)*H(Sunny, Humidity=high) + p(Sunny,
normal)*H(Sunny, Humidity=normal)
=(3/5)*0+ (2/5)*0
=0
Information Gain=H(Sunny) - I(Sunny, Humidity)
=0.971-0
=0.971

Third Attribute - Wind


Categorical values - weak, strong
H(Sunny, Wind=weak)=-(1/3)*log(1/3)-(2/3)*log(2/3)=0.918
H(Sunny, Wind=strong)=-(1/2)*log(1/2)-(1/2)*log(1/2)=1
Average Entropy Information for Wind-
I(Sunny, Wind)= p(Sunny, weak)*H(Sunny, Wind=weak) + p(Sunny, strong)
*H(Sunny, Wind=strong)
=(3/5)*0.918+(2/5)*1
=0.9508
Information Gain =H(Sunny)- I(Sunny, Wind)
=0.971-0.9508
=0.0202
Here, the attribute with maximum information gain is Wind. So, the decision
tree built so far –

Here, when Outlook Sunny and Humidity High, it is a pure class of category
"no". And When Outlook Sunny and Humidity Normal, it is again a pure class
of category "yes". Therefore, we don't need to do further calculations.
Now, finding the best attribute for splitting the data with Outlook Rain values(
Dataset rows [4, 5, 6, 10, 14]}.
Complete entropy of Rain is-
H(S)=-p(yes)*log2(p(yes))-p(no) *og2(p(no))
=-(3/5)*log(3/5)-(2/5)*log(2/5)
=0.971
First Attribute - Temperature
Categorical values - mild, cool
H(Rain, Temperature=cool)=-(1/2)*log(1/2)-(1/2)*log(1/2)=1
H(Rain, Temperature=mild)=-(2/3)*log(2/3)-(1/3)*log(1/3)=0.918
Average Entropy Information for Temperature -
I(Rain, Temperature) = p(Rain, mild)*H(Rain, Temperature=mild) + p(Rain,
cool)*H(Rain, Temperature=cool)
=(2/5)*1+(3/5)*0.918
=0.9508
Information Gain =H(Rain)- I(Rain, Temperature)
=0.971-0.9508
Second Attribute - Wind
Categorical values - weak, strong
H(Wind=weak)=-(3/3)*log(3/3)-0=0
H(Wind=strong)=0-(2/2)*log(2/2) = 0
Average Entropy Information for Wind -
I(Wind)= p(Rain, weak)* H(Rain, Wind=weak) + p(Rain, strong)*H(Rain,
Wind = strong)
=(3/5)*0+ (2/5)*0
=0
Information Gain=H(Rain)-1(Rain, Wind)
=0.971-0
=0.971
Here, the attribute with maximum information gain is Wind. So, the decision
tree built so far-

Here, when Outlook - Rain and Wind-Strong, it is a pure class of category "no"
And When Outlook Rain and Wind Weak, it is again a pure class of category
"yes",
And this is our final desired tree for the given dataset.

Characteristics of ID3 Algorithm are as follows:


1. ID3 uses a greedy approach that's why it does not guarantee an optimal
solution, it can get stuck in local optimums.
2. ID3 can overfit to the training data (to avoid overfitting, smaller decision
trees should be preferred over larger ones).
3. This algorithm usually produces small trees, but it does not always produce
the smallest possible tree.
4. ID3 is harder to use on continuous data (if the values of any given attribute is
continuous, then there are many more places to split the data on this attribute.
and searching for the best value to split by can be time consuming).
Issues in learning decision trees include
• Determining how deeply to grow the decision tree
• Handling continuous attributes
• Choosing an appropriate attribute selection measure
• Handling training data with missing attribute values
• Handling attributes with differing costs
• Improving computational efficiency.

Common questions

Powered by AI

K-Nearest Neighbors (k-NN) is an instance-based or 'lazy' learning algorithm that stores all available data and makes decisions at prediction time based on the closest points, making it suitable for datasets where class boundaries are non-linear but sensitive to irrelevant features and computationally expensive on large datasets . Decision Trees, such as those created by ID3, use a tree structure to recursively split data based on attribute selection that maximizes information gain, which enables them to handle both categorical and continuous data effectively and produce easily interpretable models . However, decision trees are prone to overfitting and inefficient with continuous data without proper pruning. K-NN is better for datasets with a limited number of features, while decision trees are generally more suitable for data where feature interaction is significant and interpretability is important .

The k-Nearest Neighbor (k-NN) algorithm determines the category or value of a new data point by finding the 'K' nearest data points to it based on a chosen distance metric, often Euclidean distance. The algorithm calculates distances from the new point to all other points in the dataset, selects the K closest neighbors, and assigns the most common label (for classification) or averages the values (for regression) among them . Its strengths include simplicity, ease of implementation, and versatility in handling both classification and regression tasks without making assumptions about data distribution . However, its weaknesses include computational inefficiency with large datasets, sensitivity to irrelevant features, and high memory requirements .

Decision trees, such as those created by the ID3 algorithm, are advantageous because they are easy to interpret and require no assumptions about data distribution. They can handle both numerical and categorical data and are simple to understand and visualize . However, they have limitations such as overfitting and inefficiency when handling continuous attributes . ID3 uses a greedy method by selecting the attribute that maximizes information gain at each step, which helps to effectively handle nominal data but does not guarantee an optimal solution . While ID3 avoids exhaustive searching by using entropy and information gain to make splits, it can overfit on training data, which requires strategies like pruning to address .

In the ID3 algorithm, entropy measures the uncertainty or disorder in the dataset. It quantifies the impurity in a given set. Information gain, on the other hand, measures the reduction in entropy achieved after splitting a dataset according to a particular attribute . A higher information gain indicates a more effective attribute for classifying the dataset because it results in purer subsets. During the decision tree building process, ID3 calculates the entropy for each attribute and selects the one with the highest information gain to make a split, ensuring the tree grows with the most informative features .

Relying solely on accuracy, which calculates the ratio of correct predictions to total predictions, may be misleading, especially in imbalanced datasets where one class predominates . Precision measures how many of the positively classified cases were actually positive, providing insight into false positive control . Recall (or sensitivity) indicates the proportion of actual positives correctly identified, highlighting the model's ability to detect true positives . F1-score, which harmonizes precision and recall into a single metric, is particularly useful in balancing the trade-off between precision and recall . These metrics together provide a comprehensive evaluation by offering insights into different aspects of prediction reliability beyond simply being 'right' or 'wrong' .

The choice of 'K' in K-Nearest Neighbors (k-NN) significantly affects the algorithm's performance and results. A small 'K' can lead to a model that is too sensitive to noise, potentially overfitting to outliers, whereas a very large 'K' might oversmooth and overlook local data structure by averaging too many points . Selecting the optimal 'K' is crucial, as it balances the bias-variance tradeoff: lower 'K' increases variance and reduces bias, while higher 'K' does the opposite . Cross-validation techniques are often used to experiment with different 'K' values to find the best one for the specific dataset .

Scaling or normalizing data is crucial for K-Nearest Neighbors (k-NN) because the algorithm relies on distance calculations to determine nearest neighbors, which can be disproportionately affected by features with larger scales . Without normalization, features on larger scales dominate the distance computation, leading to biased model results. It is essential to ensure that all features contribute equally, often through methods such as min-max scaling or z-score normalization . Proper scaling ensures that the model's performance reflects true similarities in traits rather than scale differences, thus improving the reliability and accuracy of predictions .

Regression tasks in machine learning focus on predicting a continuous output based on input features, such as predicting house prices or weather temperatures based on certain parameters . In contrast, classification tasks aim to assign discrete labels to instances, such as determining if an image contains a cat or dog . Handling outputs differs in that regression results are real-valued numbers requiring an error criterion like mean squared error, while classification results are categorical labels necessitating metrics like accuracy and F1-score for evaluation .

A confusion matrix provides a comprehensive overview of a machine learning model's performance on a classification task by detailing the actual versus predicted classifications . Its main components are True Positives, True Negatives, False Positives, and False Negatives, which collectively represent correctly and incorrectly classified instances across the positive and negative classes . This detailed breakdown allows one to derive various metrics like precision, recall, and F1-score, which provide insight into model strengths and weaknesses beyond overall accuracy, particularly in class-imbalanced problems .

High-dimensional data poses significant challenges for K-Nearest Neighbors (k-NN) due to the 'curse of dimensionality,' where the distance measures become less meaningful as dimensions increase, often leading to poor performance and high computation costs . This can also result in overemphasis on irrelevant attributes. Techniques such as feature selection or dimensionality reduction (e.g., PCA) can mitigate these issues by reducing the number of features, ensuring the model focuses only on relevant information and maintains computational efficiency . Additionally, standardizing or normalizing the data can also help ensure that attributes contribute equally to the distance metrics, preventing skewed results due to varying data scales .

You might also like