REGRESSION
Regression in machine learning refers to a supervised learning technique where the goal is to
predict a continuous numerical value based on one or more independent features. It finds
relationships between variables so that predictions can be made.
Two types of variables present in regression:
Dependent Variable (Target): The variable we are trying to predict e.g house price.
Independent Variables (Features): The input variables that influence the prediction e.g
locality, number of rooms.
Linear Regression in Machine learning
“Linear regression is a type of Supervised Machine Learning Algorithm that learns from the
labeled datasets and maps the data points with most optimized linear functions which can be
used for prediction on new datasets.”
It computes the linear relationship between the dependent variable and one or more
independent features by fitting a linear equation with observed data.
It predicts the continuous output variables based on the independent input variable.
E.g.: if we want to predict house price we consider various factor such as house age, distance
from the main road, location, area and number of room, linear regression uses all these
parameter to predict house price as it consider a linear relation between all these features and
price of house.
Best Fit Line
The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).
Our primary objective while using linear regression is to locate the best-fit line, which implies
that the error between the predicted and actual values should be kept to a minimum. There will
be the least error in the best-fit line.
Fig: Linear Regression
Here, Y= Dependent/Target variable
X=Independent variable/Predictor of Y.
In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The
regression line is the best-fit line for our model.
Types of Linear Regression
1. Simple Linear Regression: it involves only one independent variable and one dependent
variable. Y=β0+β1X+ϵ
Where:
Y = dependent variable (target)
X = independent variable(s) (predictors)
β0 = intercept
β1 = coefficient (slopes)
ϵ = error term
2. Multiple Linear Regressions: Involves multiple independent variables and one
dependent variable. Y=β0+β1X1+β2X2+...+βnXn+ϵ
Where:
Y = dependent variable (target)
X = independent variable(s) (predictors)
β0 = intercept
β1,β2,...βn = coefficients (slopes)
ϵ= error term
Assumptions of Linear Regression
1. Linearity: The relationship between independent and dependent variables is linear.
2. Independence: Observations are independent of each other.
3. Homoscedasticity: Constant variance of residuals (error terms).
4. Normality: The residuals should be normally distributed.
5. No Multicollinearity (for multiple regressions): Independent variables should not be
highly correlated with each other.
Applications of Linear Regression
Predicting sales, stock prices, or house prices
Analyzing trends and relationships in data
Assessing the impact of variables in scientific and business research
Logistic Regression in Machine learning
Logistic regression is a supervised machine learning algorithm used for classification
tasks where the goal is to predict the probability that an instance belongs to a given
class or not.
Unlike linear regression, it predicts probabilities rather than continuous values.
The output is transformed using the sigmoid (logistic) function, which ensures the
prediction is between 0 and 1.
Logistic regression is used for binary classification where we use sigmoid function that
takes input as independent variables and produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function
for an input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it
belongs to Class 0.
Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value
as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Examples:
(i) Deciding on whether or not to offer a loan to a bank customer: Outcome = yes or no.
(ii) Evaluating the risk of cancer: Outcome = high or low.
(iii) Predicting a team’s win in a football match: Outcome = yes or no.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
Examples:
(i) Let’s say you want to predict the most popular transportation type for 2040. Here,
transport type equates to the dependent variable, and the possible outcomes can be
electric cars, electric trains, electric buses, and electric bikes.
(ii) Predicting whether a student will join a college, vocational/trade school, or corporate
industry.
(iii) Estimating the type of food consumed by pets, the outcome may be wet food, dry
food, or junk food.
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as “low”, “Medium”, or “High”.
Examples: Dependent variables represent,
(i) Formal shirt size: Outcomes = XS/S/M/L/XL
(ii) Survey answers: Outcomes = Agree/Disagree/Unsure
(iii) Scores on a math test: Outcomes = Poor/Average/Good
Assumptions of Logistic Regression
1. Independent observations: Each observation is independent of the other means there is
no correlation between any input variables.
2. Binary dependent variables: It takes the assumption that the dependent variable must
be binary or dichotomous, meaning it can take only two values. For more than two
categories SoftMax functions are used.
3. Linearity relationship between independent variables and log odds: The relationship
between the independent variables and the log odds of the dependent variable should
be linear.
4. No outliers: No outliers should be there in the dataset.
5. Large sample size: The sample size is sufficiently large
Sigmoid Function
The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
It maps any real value into another value within a range of 0 and 1. The value of the
logistic regression must be between 0 and 1, which cannot go beyond this limit, so it
forms a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Difference between Linear and Logistic Regression
Linear Regression Logistic Regression
Linear regression is used to predict the Logistic regression is used to predict the
continuous dependent variable using a given categorical dependent variable using a given
set of independent variables. set of independent variables.
Linear regression is used for solving
It is used for solving classification problems.
regression problem.
Linear Regression Logistic Regression
In this we predict the value of continuous In this we predict values of categorical
variables variables
In this we find best fit line. In this we find S-Curve.
Least square estimation method is used for Maximum likelihood estimation method is
estimation of accuracy. used for Estimation of accuracy.
The output must be continuous value, such Output must be categorical value such as 0
as price, age, etc. or 1, Yes or no, etc.
It required linear relationship between
It not required linear relationship.
dependent and independent variables.
There may be collinearity between the There should be little to no collinearity
independent variables. between independent variables.
BAYESIAN LEARNING
Conditional Probability
Definition: Conditional probability is the probability of an event occurring given that another
event has already occurred. It is denoted as:
P(A∣B)=P(A∩B)
P(B)
P(A∣B)= Conditional Probability
P(A∩B)=Joint Probability
P(B)=Marginal Probability
Where:
P(A∣B) is the probability of event A occurring given that B has occurred.
P(A∩B) is the probability of both A and B happening together.
P(B) is the probability of event B occurring (must be greater than 0).
Example: Suppose a deck of 52 playing cards is given, and we randomly pick one card. Let:
Event A: The card is a King. P(A)=4/52
Event B: The card is a face card (King, Queen, or Jack). (P(B)=12/52)
P(A∩B) = Probability of drawing a King given that the card is a face card. Since all Kings
are face cards, there are 4 Kings among the 12 face cards.
P(A∣B)=P(A∩B)
P(B)
Thus, the probability of drawing a King given that the card is a face card is 1/3 or 33.3%.
BayesTheorem/Bayes Rule/Bayes Law
According to Conditional Probability:
P(A∣B)=P(A∩B)
P(B)
P(A∩B)= P(A∣B). P(B)……….eq(i)
P(B∣A)=P(B∩A)
P(A)
P(B∩A)= P(B∣A). P(A)……….eq(ii)
equating eq(i) & eq(ii) as the LHS is equal
P(A∣B). P(B)= P(B∣A). P(A)
P(A∣B)= P(B∣A).P(A)
P(B)
This is Bayes Theorem.
Fig: Bayes Theorem with Terminology
Concept Learning
Concept learning is the process of inferring a function from labeled training data in supervised
learning. It involves identifying patterns or rules that correctly classify instances into predefined
categories, using methods like decision tree or neural networks to search through possibly
hypotheses and select the best one.
Concept learning involves learning a general rule from a set of observed instances.
Instead of memorizing each example, it creates a broader understanding that can be applied
to unseen situations.
Maximally specific hypothesis: A maximally specific hypothesis is the most specific hypothesis
that can be derived from a given set of training data.
Find S Algorithm
The find-S algorithm is a basic concept learning algorithm in machine learning. The find-S
algorithm finds the most specific hypothesis that fits all the positive examples. We have to
note here that the algorithm considers only those positive training example. The find-S
algorithm starts with the most specific hypothesis and generalizes this hypothesis each time it
fails to classify an observed positive training data. Hence, the Find-S algorithm moves from
the most specific hypothesis to the most general hypothesis.
Important Representation:
1. ? Indicates that any value is acceptable for the attribute.
2. Specify a single required value ( e.g., Cold ) for the attribute.
3. ϕ indicates that no value is acceptable.
4. The most general hypothesis is represented by: {?, ?, ?, ?, ?, ?}
5. The most specific hypothesis is represented by: {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
Steps Involved In Find-S :
1. Start with the most specific hypothesis. h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
2. Take the next example and if it is negative, then no changes occur to the hypothesis.
3. If the example is positive and we find that our initial hypothesis is too specific then we
update our current hypothesis to a general condition.
4. Keep repeating the above steps till all the training examples are complete.
5. After we have completed all the training examples we will have the final hypothesis when
can use to classify the new examples.
Algorithm:
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x
For each attribute constraint a, in h
If the constraint a, is satisfied by x
Then do nothing
Else replace a, in h by the next more general constraint that is satisfied by x
3. Output hypothesis h
Limitations of Find-S Algorithm
1. There is no way to determine if the hypothesis is consistent throughout the data.
2. Inconsistent training sets can actually mislead the Find-S algorithm, since it ignores the
negative examples.
3. Find-S algorithm does not provide a backtracking technique to determine the best
possible changes that could be done to improve the resulting hypothesis.
2 important points to solve numerical on Find S Algorithm:
1. If the example has a negative outcome, neglect the example and our hypothesis remains
the same.
2. If the example has a positive outcome, compare every single attribute with the initial data
and if any mismatch is found we replace that particular attribute with a general case
(“?”).
Numerical Examples on Concept Learning
Example 1: Apply concept learning to determine a hypothesis that describes when "Enjoy
Sport" is Yes or No based on the given attributes.
Enjoy
Example Sky Air Temp Humidity Wind Water Forecast
Sport
1 Sunny Warm Normal Strong Warm Same Yes
2 Sunny Warm High Strong Warm Same Yes
3 Rainy Cold High Strong Warm Change No
4 Sunny Warm High Strong Cool Change Yes
Step 1: Identifying Attributes and Possible Values
The dataset consists of the following attributes:
Attribute Possible Values
Sky Sunny, Rainy
Air Temp Warm, Cold
Humidity Normal, High
Wind Strong
Water Warm, Cool
Forecast Same, Change
Enjoy Sport (Target) Yes, No
Step 2: Find-S Algorithm (Finding the Most Specific Hypothesis)
The Find-S algorithm finds the most specific hypothesis that covers all positive examples.
1. Initialize the hypothesis : h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ}
2. E1=(Sunny, Warm, Normal, Strong, Warm, Same)------ positive[Consider]
h= {Sunny, Warm, Normal, Strong, Warm, Same}
3. E2=(Sunny, Warm, High, Strong, Warm, Same)------ positive[Consider]
h = {Sunny, Warm, ?, Strong, Warm, Same}
4. E3=(Rainy, Cold, High, Strong, Warm, Change)------ Negative[Ignore]
5. E4=(Sunny, Warm, High, Strong, Cool, Change) ------ positive[Consider]
h = {Sunny, Warm, ?, Strong, ?, ?}
Step 3: Final Hypothesis
The most specific hypothesis learned from the positive examples is:
h = (Sunny, Warm, ?, Strong, ?, ?)
Step 4: Conclusion:---
The concept learned suggests that
o if the Sky is Sunny,
o Air Temperature is Warm, and
o Wind is Strong,
o then "Enjoy Sport" is likely Yes,
o regardless of Humidity, Water, and Forecast[any value].
This means Sunny & Warm weather with Strong Wind is the most crucial condition
for Enjoying Sport.
Example 2: Apply concept learning to determine a hypothesis that describes when "Smile"
is Yes or No based on the given attributes.
Example Eyes Nose Head Fcolor Hair Smile
1 Round Triangle Round Purple Yes Yes
2 Square Square Square Green Yes No
3 Square Triangle Round Yellow Yes Yes
4 Round Triangle Round Green No No
5 Square Square Round Yellow Yes Yes
Step 1: Identifying Attributes and Values
The dataset contains six attributes:
Attribute Possible Values
Eyes Round, Square
Nose Triangle, Square
Head Round, Square
Fcolor (Face Color) Purple, Green, Yellow
Hair Yes, No
Smile Yes, No
Attribute Possible Values
Target (Enjoy) Yes, No
Step 2: Find-S Algorithm (Finding the Most Specific Hypothesis)
1. Initialize the hypothesis : h = {ϕ, ϕ, ϕ, ϕ, ϕ}
2. E1: (Round, Triangle, Round, Purple, Yes) ------ positive[Consider]
h = (Round, Triangle, Round, Purple, Yes)
3. E2=(Square, Square, Square, Green, Yes) ------ Negative[Ignore]
4. E3= (Square, Triangle, Round, Yellow, Yes) ------ positive[Consider]
h = (?, Triangle, Round, ?, Yes, Yes)
5. E4=(Round, Triangle, Round, Green, No) ------ Negative[Ignore]
6. E5= (Square, Square, Round, Yellow, Yes) ------ positive[Consider]
h = (?, ?, Round, ?, Yes, Yes)
Step 3: Final Hypothesis
The most specific hypothesis that covers all positive examples is:
h = (?, ?, Round, ?, Yes, Yes)
Step 4: Conclusion: -
The necessary conditions for a positive outcome i.e. Yes:
o Head must be Round
o Hair must be Yes
o Smile must be Yes
o The values for Eyes, Nose, and Face Color do not matter.
Example 3: Apply concept learning to determine a hypothesis that describes when "Buy" is
Yes or No based on the given attributes.
Example Citation Size In Library Price Editions Buy
1 Some Small No Affordable Many No
2 Many Big No Expensive One Yes
3 Some Big Always Expensive Few No
4 Many Medium No Expensive Many Yes
5 Many Small No Affordable Many Yes
Step 1: Identifying Attributes and Values
The dataset consists of six attributes:
Attribute Possible Values
Citation Some, Many
Size Small, Medium, Big
In Library No, Always
Attribute Possible Values
Price Affordable, Expensive
Editions One, Few, Many
Buy (Target) Yes, No
Step 2: Apply Find-S Algorithm
1. Initialize the hypothesis : h = {ϕ, ϕ, ϕ, ϕ, ϕ}
2. E1=(Some, Small, No, Affordable, Many) -------Negative[Ignore]
3. E2=(Many, Big, No, Expensive, One) ------ positive[Consider]
H=(Many, Big, No, Expensive, One)
4. E3=(Some, Big, Always, Expensive, Few)-------Negative[Ignore]
5. E4=(Many, Medium, No, Expensive, Many, Yes) ------ positive[Consider]
h = (Many, ?, No, Expensive, ?)
6. E5=(Many, Small, No, Affordable, Many, Yes) ------ positive[Consider]
h = (Many, ?, No, ?, ?)
Step 3: Final Hypothesis:-
After processing all positive examples, we get:
h = (Many, ?, No, ?, ?)
Step 4: Conclusion: -Conditions necessary for "Buy = Yes":
Citation must be "Many"
In Library must be "No"
Size, Price, and Editions do not matter[any value]
Thus, the model has learned that if a book has many citations and is not in the library, it is
more likely to be bought, regardless of size, price, or editions.
Example 4: Consider the following data set having the data about which particular seeds are
poisonous. Apply concept learning to determine a hypothesis that describes when
"POISONOUS" is Yes or No based on the given attributes.
Step 1: Identifying Attributes and Values
The dataset consists of six attributes:
Attribute Possible Values
Color GREEN, BROWN, ORANGE
Toughness HARD, SOFT
Fungus YES, NO
Appearance WRINKLED, SMOOTH
Poisonous(Target) YES, NO
Step 2: Apply Find-S Algorithm
1. Initialize the hypothesis : h = {ϕ, ϕ, ϕ, ϕ, }
2. E1=(Green, Hard, No, Wrinkled ) ------- positive[Consider]
h={ Green, Hard, No, Wrinkled}
3. E2=(Green, Hard, Yes, Smooth) ------Negative[Ignore]
4. E3=( Brown, Soft, No, Wrinkled) ------Negative[Ignore]
5. E4=( Orange, Hard, No, Wrinkled) ------- positive[Consider]
h={ ?, Hard, No, Wrinkled}
6. E5=( Green, Soft, Yes, Smooth) ------- positive[Consider]
h={ ?, ?, ?, ?}
Since we have reached a point where all the attributes in our hypothesis have the general
condition, example 6 and example 7 would result in the same hypothesizes with all general
attributes.
Step 3: Final Hypothesis:-
After processing all positive examples, we get:
h = ( ?, ?, ?, ?)
Step 4: Conclusion: -The model has learned that a Seed is more likely to be Poisonous,
regardless of Color, Toughness, Fungus, and Appearance [Any Value].
Example 4: Apply concept learning to determine a hypothesis that describes when a person
will “goes out” is Yes or No based on the given attributes.
Time Weather Temperature Company Humidity Wind Goes
Morning Sunny Warm Yes Mild Strong Yes
Evening Rainy Cold No Mild Normal No
Morning Sunny Moderate Yes Normal Normal Yes
Evening Sunny Cold Yes High Strong Yes
Step 1: Identify Attributes and Values
The dataset consists of six attributes:
Attribute Possible Values
Time Morning, Evening
Weather Sunny, Rainy
Temperature Warm, Cold, Moderate
Company Yes, No
Humidity Mild, Normal, High
Wind Strong, Normal
Goes (Target) Yes, No
Step 2: Apply Find-S Algorithm
1. Initialize the hypothesis: h = {ϕ, ϕ, ϕ, ϕ, ϕ, ϕ }
2. E1=(Morning, Sunny, Warm, Yes, Mild, Strong) ------- positive[Consider]
H=(Morning, Sunny, Warm, Yes, Mild, Strong)
3. E2=(Rainy, Cold, No, Mild, Normal) ------Negative[Ignore]
4. E3=(Morning, Sunny, Moderate, Yes, Normal, Normal) ------- positive[Consider]
h = (Morning, Sunny, ?, Yes, ?, ?)
5. E4=): (Evening, Sunny, Cold, Yes, High, Strong) ------- positive[Consider]
h = (?, Sunny, ?, Yes, ?, ?)
Step 3: Final Hypothesis
After processing all positive examples, we get:
h = (?, Sunny, ?, Yes, ?, ?)
Step 4: Conclusion: Conditions necessary for "Goes = Yes":
o Weather must be "Sunny"
o Company must be "Yes"
o Time, Temperature, Humidity, and Wind do not matter[Any value]
Thus, the model has learned that if the weather is sunny and there is company, the person will
go out, regardless of the time, temperature, humidity, or wind conditions.
Bayes Optimal Classifier
The Bayes Optimal Classifier is a theoretical model that provides the most accurate
classification of a new instance based on the training data.
It operates under the principles of Bayes' theorem, calculating the conditional probabilities of
different outcomes and selecting the one with the highest probability.
This classifier is often referred to as the Bayes optimal learner, and it serves as a benchmark
for evaluating the performance of other classifiers in machine learning.
Key Concepts
1. Bayes' Theorem: At the core of the Bayes Optimal Classifier is Bayes' theorem, which describes
how to update the probability of a hypothesis based on new evidence. The theorem is
expressed mathematically as:
P(H∣E)= P(E∣H).P(H)
P(E)
Where:
o P(H∣E) is the posterior probability of the hypothesis HH given evidence E.
o P(E∣H) is the likelihood of observing evidence EE given hypothesis H.
o P(H) is the prior probability of hypothesis H.
o P(E) is the marginal likelihood of evidence E.
2. Maximum A Posteriori (MAP): This is a probabilistic framework that seeks to find the most
probable hypothesis given the training data. It is closely related to the Bayes Optimal Classifier
but focuses on selecting a single hypothesis rather than making a prediction based on all
possible hypotheses.
3. Hypothesis Space: The set of all possible hypotheses that can be used to classify the data. The
Bayes Optimal Classifier evaluates each hypothesis and combines their predictions based on
their posterior probabilities.
The Mechanics of the Bayes Optimal Classifier
The Bayes Optimal Classifier answers the question: "What is the most probable classification
of a new instance given the training data?"
This is achieved by combining the predictions of all hypotheses weighted by their posterior
probabilities.
Mathematical Representation
The Bayes Optimal Classifier is a probabilistic model that assigns a new instance x to the class
yyy that has the highest probability, given the evidence x. mathematically, it chooses:
y∗=argmax P(y∣x)
y∈Y
where:
P(y∣x) is the posterior probability of class y given the instance x.
Y is the set of possible class labels.
Using Bayes’ theorem, we can rewrite the posterior probability as:
P(y∣x)= P(x∣y).P(y)
P(x)
where:
P(x∣y) is the likelihood (probability of observing xxx given class y).
P(y) is the prior probability of class y occurring.
P(x) is the evidence (probability of observing xxx across all classes).
Since P(x) is the same for all classes, it does not affect which class has the highest probability.
So, the Bayes Optimal Classifier simplifies to:
y∗=argmax P(x∣y).P(y)
y∈Y
Why is it "Optimal"?
Minimizes Error: The Bayes classifier minimizes the expected classification error,
meaning it is the best possible classifier assuming the true probabilities are known.
No Bias or Overfitting: It does not suffer from over fitting or bias, unlike other classifiers
trained on limited data.
Theoretical Upper Bound: It defines the best performance a classifier can achieve,
serving as a benchmark for other models.
Limitations
1. Requires True Probabilities: The Bayes Optimal Classifier assumes we know the exact
probabilities P(y)P(y)P(y) and P(x∣y)P(x | y)P(x∣y), which is rarely possible in real-world
applications.
2. Computationally Infeasible: Even if we estimate the probabilities, computing them
accurately for all possible xxx can be expensive or impossible.
3. Curse of Dimensionality: When features are high-dimensional, estimating probabilities
becomes impractical.
Advantages
Theoretical Foundation: The Bayes Optimal Classifier is grounded in solid statistical
principles, making it a reliable benchmark for classification tasks.
Optimal Performance: It provides the best possible classification accuracy under the
given conditions, outperforming other classifiers on average.
Difference between Bayes Optimal Classifier & Naïve Bayes Classifier
Feature Bayes Optimal Classifier Naïve Bayes Classifier
Assumption No assumptions; considers all Assumes feature
hypotheses. independence given the class.
Computation Intractable for large datasets. Efficient and fast
Accuracy Theoretically optimal (Bayes Good but can be inaccurate
error rate). due to independence
assumption.
Usage Not used in real-world Commonly used in text
applications. classification, spam filtering,
etc.
Naïve Bayes Classifier
Naïve Bayes classifiers are supervised machine learning algorithms used for classification
tasks, based on Bayes’ Theorem to find probabilities.
The main idea behind the Naïve Bayes classifier is to use Bayes’ Theorem to classify data
based on the probabilities of different classes given the features of the data. It is used mostly
in high-dimensional text classification.
It is a probabilistic classifier because it assumes that one feature in the model is independent
of existence of another feature. In other words, each feature contributes to the predictions
with no relation between each other.
Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying articles and
many more.
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector consists of
the value of dependent features
Response vector contains the value of class variable (prediction or output) for each row of
feature matrix.
Why it is Called Naïve Bayes?
Naïve : It is named as “Naïve” because it assumes the presence of one feature does not affect
other features.
Bayes: The “Bayes” part of the name refers to for the basis in Bayes’ Theorem.
Assumption of Naive Bayes
The fundamental Naïve Bayes assumption is that each feature makes an:
Feature independence: This means that when we are trying to classify something, we
assume that each feature (or piece of information) in the data does not affect any other
feature.
Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it is
assumed to have a multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
No missing data: The data should not contain any missing values.
Understanding Bayes’ Theorem for Naïve bayes
Using Bayes’ theorem, we can write the posterior probability as:
P(y∣x)= P(x∣y).P(y)
………..eq(i)
P(x)
Let’s suppose we have 2 classes “yes” and “no” and ‘n’ evidences/features. That means in eq(i)
1. ‘y’ has 2 values i.e. “yes” and “no”
2. ‘x’ has ‘n’ values i.e. x1, x2, x3…………, xn
So using above values in eq(i)
P(yes| x1, x2, x3…………, xn )=P(x1|yes). P(x2|yes). P(x3|yes)………. P(xn|yes).P(yes) ……eq(ii)
P(x1). P(x2). P(x3)………. P(xn)
P(no| x1, x2, x3…………, xn )=P(x1|no). P(x2|no). P(x3|no)………. P(xn|no).P(no) ……eq(iii)
P(x1). P(x2). P(x3)………. P(xn)
Now, as the denominator remains constant for a given input, we can remove that term:
Removing denominator from eq(ii):
P(yes| x1, x2, x3…………, xn )=P(x1|yes). P(x2|yes). P(x3|yes)………. P(xn|yes).P(yes). ..eq(iv)
Removing denominator from eq(iii):
P(no| x1, x2, x3…………, xn )=P(x1|no). P(x2|no). P(x3|no)………. P(xn|no).P(no) ……eq(v)
So, we need to calculate probability in eq(iv) & eq(v), the one with higher probability will be
classified into that class. It means:
1. If P(yes| x1, x2, x3…………, xn )> P(no| x1, x2, x3…………, xn )
New Data will be classified to class “yes”
2. If P(yes| x1, x2, x3…………, xn )< P(no| x1, x2, x3…………, xn )
New Data will be classified to class “no”
Numerical Examples on Naïve Bayes Classifier
Example 1: X= {Red, SUV, Domestic}. Classify X using Naïve Bayes classifiers.
No Color Type Origin Stolen
1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
Example 2: X= {Covid=Yes, Flu=Yes}. Classify X using Naïve Bayes classifiers.
Person Covid Flu Fever
1 Yes No Yes
2 No Yes Yes
3 Yes Yes Yes
4 No No No
5 Yes No Yes
6 No No Yes
7 Yes No Yes
8 Yes No No
9 No Yes Yes
10 No Yes No
Example 3: X= {Yellow, Sweet, Long}. Classify X using Naïve Bayes classifiers
Fruit Yellow Sweet Long Total
Mango 350 450 0 650
Banana 400 300 350 400
Others 50 100 50 150
Total 800 850 400 1200
Solution:
X={Yellow,Sweet,Long}
We’ll classify X as either Mango, Banana, or Others.
Step 1: Given Data
Fruit Yellow Sweet Long Total
Mango 350 450 0 650
Banana 400 300 350 400
Others 50 100 50 150
Total 800 850 400 1200
Step 2: Prior Probabilities
P(Mango)=650/1200=0.5417
P(Banana)=400/1200=0.3333
P(Others)=150/1200=0.125
Step 3: Conditional Probabilities
For Mango:
P(Yellow∣Mango)=350/650=0.5385
P(Sweet∣Mango)=450/650=0.6923
P(Long∣Mango)=0/650=0
For Banana:
P(Yellow∣Banana)=400/400=1
P(Sweet∣Banana)=300/400=0.75
P(Long∣Banana)=350/400=0.875
For Others:
P(Yellow∣Others)=50/150=0.3333
P(Sweet∣Others)=100/150=0.6667
P(Long∣Others)=50/150=0.3333
Step 4: Calculate Posterior Probabilities (unnormalized)
P(Fruit∣Yellow,Sweet,Long) ∝ P(Fruit)⋅P(Yellow∣Fruit)⋅P(Sweet∣Fruit)⋅P(Long∣Fruit)
Mango:
P(Mango∣X) ∝ 0.5417⋅0.5385⋅0.6923⋅0=0
Note: Mango is ruled out because the probability of it being Long is 0 (no mangoes are long).
Banana:
P(Banana∣X) ∝ 0.3333⋅1⋅0.75⋅0.875 ≈ 0.21875
Others:
P(Others∣X) ∝ 0.125⋅0.3333⋅0.6667⋅0.3333 ≈ 0.00926
Step 5: Normalize
Total=0+0.21875+0.00926=0.22801
P(Banana∣X)=0.218750.22801≈0.9594
P(Others∣X)=0.009260.22801≈0.0406
P(Mango∣X)=0
Final Classification:
Given X={Yellow,Sweet,Long}, the fruit is most likely a Banana, with a ~95.9% probability.
Bayesian Belief Network
Bayesian Belief Network is a probabilistic graphical model that is used for representing
uncertain knowledge and drawing decisions on the basis of that knowledge.
These are a type of Bayesian network, a graphical model depicting probabilistic relationships
between variables.
Bayesian Belief network is also known as belief network, Bayes Network, Bayesian Model, and
Decision Network.
Bayesian belief networks are probabilistic; the reason behind this is that these networks are
designed from a probability distribution and also use theory of probability for anomaly
detection and prediction.
Parts of Bayesian Network
Bayesian belief network can be used for creating models from data and experts' opinions, and it
comprises of two parts:
1. Directed Acyclic Graph[DAG]
2. Table of Conditional Probabilities Table[CPT]
1. Directed Acyclic Graph[DAG]
This is a visual graphic representation of the network's variables and their relationships with
one another. In a Directed Acyclic Graph, the nodes indicates variables, and the edges indicate
the dependencies between them. The arrows in the graph depicts the direction of casualty.
2. Table of Conditional Probabilities[CPT]
For every node in the directed acyclic graph, there is a corresponding table of conditional
probabilities that states the probability of every possible value of the node given the values of
its parents in the directed acyclic graph. These tables show the probabilistic connection
between the variables in the network.
A Bayesian Belief Network Graph
A Bayesian belief network graph is made up of Arcs(directed links)and nodes, where:
Each node is correlated with the random variables, and a variable can be discrete or
continuous
Arc (directed rows) indicate the casual relationship or conditional probabilities between
random variables. These arrows or directed links connect the pair of nodes in the graph
These links show that one node directly affects the other node, and if there is no directed
connection, that means that nodes are independent of each other
Components of Bayesian Network
The Bayesian Belief Network has mainly two components: the casual and the actual number
component.
1. Casual Component
The casual component of a Bayesian belief network indicates the causal relationships
between variables in the system
It consists of directed acyclic graphs that represent the direction of causal relationships
among the variables
The casual component of a Bayesian belief network is important for understanding how
the system's variables are connected
It gives us a graphical representation of the causal relationships between the variables.
This representation can be used to make predictions and also to understand how changes
in one variable will influence the other
2. Actual Numbers
The numerical component of a Bayesian belief network consists of conditional probability
tables for each node in the directed acyclic graph
These tables specify the probability of each variable given the values of its parent
variables
The numerical component of a Bayesian belief network gives us the actual numbers that
are used to make predictions and calculate probabilities
Every node in the network contains a conditional probability table that specifies the
probability of that node given the values of its parent nodes
These probabilities are used for calculating the overall probabilities of the system given
several inputs or observations
Joint probability distribution
In the Bayesian network model, joint probability distribution describes the probability of all
possible configurations of the variables of the network. It is the product of the conditional
probabilities of each node given its parent in the network. This means that the joint probability
distribution gives us a complete description of the probability distribution of all the variables in
the network.
Representation:
If we have variables such as x1,x2,x3,x4,..,xn, then the probabilities of various combinations of
x1,x2,x3,x4,....xn, are known as Joint probability distribution.
P[x1,x2,x3,x4,.....,xn] This can be written in the following way in terms of the joint probability
distribution.
= P[x1|x2,x3,x4,.....,xn]P[x2,x3,x4,....,xn]
= P[x1|x2,x3,x4,....,xn]P[x2|x3,x4,......,xn]P[x3|x4,.....,xn].....P[x n-1|xn]P[xn].
General Syntax for each variable Xi:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Example of Bayesian Belief Network
Henry installed a new burglar alarm at his home to detect burglary. The alarm consistently
reacts to a break-in, but it also reacts to minor earthquakes. Henry has two neighbors, John and
Olivia, who have taken responsibility for calling Henry at work when they hear an alarm.
John always calls Henry when he hears the alarm, but sometimes, he gets distracted by the
phone ringing and calls at other times. On the other hand, Olivia likes to listen to loud music, so
sometimes she misses hearing the alarm. Here, we would like to compute the probability of a
Burglary Alarm.
Problem
Determine the probability that the alarm has sounded but that neither a burglary nor an
earthquake has occurred and that John and Olivia have phoned Henry.
Solution
The Bayesian belief network for the above problem is provided below. The network
structure is a representation that earthquake and burglary is the parent node of the alarm
and directly affect the probability of the alarm going off, but John and Olivia's calls depend
on the alarm probability
The network is showing that our presumptions did not directly observe the break-in and
also did not notice the minor earthquake, and also they did not consult with another
before contacting
The conditional distributions for each node are provided as a conditional probabilities
table (CPT)
Each row in the conditional probabilities table must be summed to 1 because all the
entries in the table depict an exhaustive set of cases for the variable
In the conditional probabilities table, a boolean variable with k boolean parent contains 2k
probabilities. Hence, if there are two parents, then the conditional probabilities table will
contain 4 probability values
List of all Occurring events in this network:
Burglary(B)
John Calls(J)
Olivia Calls (O)
Alarm (A)
Earthquake(E)
We can write the events of the problem statement in the form of probability: P[J, O, A, B, E],
can rewrite the above probability statement using joining probability distribution:
P[J, O, A, B, E]= P[J | O, A, B, E]. P[O, A, B, E]
=P[J | O, A, B, E]. P[O | A, B, E]. P[A, B, E]
= P [J| A]. P [ O| A, B, E]. P[ A, B, E]
= P[J | A]. P[ O | A]. P[A| B, E]. P[B, E]
= P[J | A ]. P[O | A]. P[A| B, E]. P[B |E]. P[E]
Let's take the detected probability for the Burglary and earthquake component:
The probability of burglary is P(B= True) = 0.002
The probability of no burglary is P(B= False)= 0.998
The probability of a minor earthquake is P(E= True)= 0.001
The probability that an earthquake has not occurred is P(E= False)= 0.999
Condition probability table for Alarm A:
The Condition probability of Alarm A depends on the earthquake and Burglar:
B E P(A=True) P(A=False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999
Conditional probability table for John Calls:
The conditional probability of John that he will call depends on the probability of Alarn.
A P(J=True) P(J=False)
A P(J=True) P(J=False)
True 0.91 0.09
False 0.05 0.95
Conditional probability table for Olivia Calls:
The Conditional probability of Olivia that she calls depends on its parent node "Alarm."
A P(O=True) P(O=False)
True 0.75 0.25
False 0.02 0.98
Using the formula of joint distribution probability, we can draw the problem statement in the
form of probability distribution:
P(O, J, A, ~B, ~E)
= P (O|A) *P (J|A)*P (A|~B ^ ~E) *P (~B) *P (~E)
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045
4. Applications of Bayesian Belief Networks
1. Medical Diagnosis – Predict diseases based on symptoms.
2. Spam Filtering – Classify emails as spam or not.
3. Fraud Detection – Identify suspicious transactions.
4. Autonomous Systems – AI decision-making in robots and self-driving cars.
5. Risk Analysis – Assess probabilities of failures in engineering.
EM Algorithm
The Expectation-Maximization (EM) algorithm is an iterative method used in Unsupervised
Machine Learning to estimate unknown parameters in statistical models.
Imp What problem does the EM algorithm solve?
** It helps find the best values for unknown parameters, especially when some data is missing
or hidden.
It works in two steps:
1. E-step (Expectation Step): Estimates missing or hidden values using current parameter
estimates.
2. M-step (Maximization Step): Updates model parameters to maximize the likelihood based
on the estimated values from the E-step.
This process repeats until the model reaches a stable solution, improving accuracy with each
iteration. EM is widely used in clustering (e.g., Gaussian Mixture Models) and handling
missing data.
By iteratively repeating these steps, the EM algorithm seeks to maximize the likelihood of the
observed data. It is commonly used for clustering, where latent variables are inferred and has
applications in various fields, including machine learning, computer vision, and natural
language processing.
Key Terms in Expectation-Maximization (EM) Algorithm
Latent Variables: These are hidden or unmeasured variables that affect what we can
observe in the data. We can’t directly see them, but we can make educated guesses about
them based on the data we can see.
Likelihood: This refers to the probability of seeing the data we have, based on certain
assumptions or parameters. The EM algorithm tries to find the best parameters that make
the data most likely.
Log-Likelihood: This is just the natural log of the likelihood function. It’s used to make
calculations easier and measure how well the model fits the data. The EM algorithm tries
to maximize the log-likelihood to improve the model fit.
Maximum Likelihood Estimation (MLE): This is a technique for estimating the parameters
of a model. It does this by finding the parameter values that make the observed data most
likely (maximizing the likelihood).
Posterior Probability: In Bayesian methods, this is the probability of the parameters, given
both prior knowledge and the observed data. In EM, it helps estimate the “best”
parameters when there’s uncertainty about the data.
Expectation (E) Step: In this step, the algorithm estimates the missing or hidden
information (latent variables) based on the observed data and current parameters. It
calculates probabilities for the hidden values given what we can see.
Maximization (M) Step: This step updates the parameters by finding the values that
maximize the likelihood, based on the estimates from the E-step. It often involves running
optimization methods to get the best parameters.
Convergence: Convergence happens when the algorithm has reached a stable point. This
is checked by seeing if the changes in the model’s parameters or the log-likelihood are
small enough to stop the process.
How Expectation-Maximization (EM) Algorithm Works:
Fig: EM Algorithm Flowchart
1. Initialization:
The algorithm starts with initial parameter values and assumes the observed data comes
from a specific model.
E-Step (Expectation Step):
o Estimate the missing or hidden data based on the current parameters.
o Calculate the posterior probability (responsibility) of each latent variable given
the observed data.
o Compute the log-likelihood of the observed data using the current parameter
estimates.
M-Step (Maximization Step):
o Update the model parameters by maximizing the log-likelihood computed in the
E-step.
o This involves solving an optimization problem to find parameter values that
improve the model fit.
Convergence:
o Check if the model parameters are stable (converging).
o If the changes in log-likelihood or parameters are below a set threshold, stop. If
not, repeat the E-step and M-step until convergence is reached
Advantages of EM algorithm
Always improves results – With each step, the algorithm improves the likelihood
(chances) of finding a good solution.
Simple to implement – The two steps (E-step and M-step) are often easy to code for many
problems.
Quick math solutions – In many cases, the M-step has a direct mathematical solution
(closed-form), making it efficient
Disadvantages of EM algorithm
Takes time to finish – It converges slowly, meaning it may take many iterations to reach
the best solution.
Gets stuck in local best – Instead of finding the absolute best solution, it might settle for a
“good enough” one.
Needs extra probabilities – Unlike some optimization methods that only need forward
probability, EM requires both forward and backward probabilities, making it slightly more
complex.
SUPPORT VECTOR MACHINE
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks.
While it can handle regression problems, SVM is particularly well-suited for classification
tasks.
The goal of SVM is to create the best line or decision boundary. This decision boundary is
called Hyperplane.
It can segregate n-dimension space into classes so that we can easily put the new data
point in the correct category in the future.
SVM chooses the extreme points/vectors that help in creating the hyperplane. Theses
extreme cases are called as Support Vectors.
SVM can be used for Face Detection, Image Classification, Text Classification etc.
Support Vector Machine (SVM) Terminology
Hyperplane: A decision boundary separating different classes in feature space,
represented by the equation wx + b = 0 in linear classification.
Support Vectors[Most Imp]: The closest data points to the hyperplane, crucial for
determining the hyperplane and margin in SVM.
Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.
Kernel: A function that maps data to a higher-dimensional space, enabling SVM to handle
non-linearly separable data.
Hard Margin: A maximum-margin hyperplane that perfectly separates the data without
misclassifications.
Soft Margin: Allows some misclassifications by introducing slack variables (A slack variable
is a non-negative variable that is added to an inequality constraint to make it an equality
constraint.), balancing margin maximization and misclassification penalties when data is
not perfectly separable.
C: A regularization term balancing margin maximization and misclassification penalties. A
higher C value enforces a stricter penalty for misclassifications.
Hinge Loss: A loss function penalizing misclassified points or margin violations, combined
with regularization in SVM.
Dual Problem: Involves solving for Lagrange multipliers associated with support vectors,
facilitating the kernel trick and efficient computation.
How does Support Vector Machine Algorithm Work?
The key idea behind the SVM algorithm is to find the hyperplane that best separates two
classes by maximizing the margin between them. This margin is the distance from the
hyperplane to the nearest data points (support vectors) on each side.
Fig : Multiple hyperplanes separate the data from two classes
The best hyperplane, also known as the “hard margin,” is the one that maximizes the
distance between the hyperplane and the nearest data points from both classes. This ensures
a clear separation between the classes. So, from the above figure, we choose L2 as hard
margin.
Let’s consider a scenario like shown below:
Fig : Selecting hyperplane for data with outlier
Here, we have one blue ball in the boundary of the red ball.
How does SVM classify the data?
It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM
algorithm has the characteristics to ignore the outlier and finds the best hyperplane that
maximizes the margin. SVM is robust to outliers.
Hyperplane which is the most optimized one
A soft margin allows for some misclassifications or violations of the margin to improve
generalization. The SVM optimizes the following equation to balance margin maximization
and penalty minimization:
Objective Function=(1margin)+λ∑penalty
The penalty used for violations is often hinge loss, which has the following behavior:
If a data point is correctly classified and within the margin, there is no penalty (loss = 0).
If a point is incorrectly classified or violates the margin, the hinge loss increases
proportionally to the distance of the violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls
are separable by a straight line/linear line).
Kernal
A kernel is a function that maps data points into a higher-dimensional space without explicitly
computing the coordinates in that space or in other words, A kernel is a function that is used
to transform the original feature space a higher-dimensional space
This allows SVM to work efficiently with non-linear data by implicitly performing the
mapping.
For example, consider data points that are not linearly separable. By applying a kernel
function, SVM transforms the data points into a higher-dimensional space where they
become linearly separable.
After
applying
Kernel
function
1-D Space 2-D Space
2-D Space 3-D Space
Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided
into two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A hyperplane
that maximizes the margin between the classes is the decision boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel functions,
nonlinear SVMs can handle nonlinearly separable data. The original input data is
transformed by these kernel functions into a higher-dimensional feature space, where the
data points can be linearly separated. A linear SVM is used to locate a nonlinear decision
boundary in this modified space.
Types of SVM Kernel Functions
1. Linear Kernel
A linear kernel is the simplest form of kernel used in SVM. It is suitable when the data is
linearly separable meaning that a straight line (or hyperplane in higher dimensions) can
effectively separate the classes.
It is represented as: K(x,y)=x.y
It is used for text classification problems such as spam detection
2. Polynomial Kernel
The polynomial kernel allows SVM to model more complex relationships by introducing
polynomial terms. It is useful when the data is not linearly separable but still follows a
pattern.
The formula of Polynomial kernel is:
Polynomial kernel equation
where d is the polynomial degree.
It is used in Complex problems like image recognition where relationships between
features can be non-linear.
3. Radial Basis Function Kernel (RBF) Kernel
The RBF kernel is the most widely used kernel in SVM. It maps the data into an infinite-
dimensional space making it highly effective for complex classification problems. The
formula of RBF kernel is:
2
K(x,y)=e–(γ∣∣x–y∣∣ ) where x & y are the input points and γ is a parameter that controls the
influence of each training example.
We use RBF kernel When the decision boundary is highly non-linear and we have no prior
knowledge about the data’s structure is available.
4. Gaussian Kernel
The Gaussian kernel is a special case of the RBF kernel and is widely used for non-linear
data classification. It provides smooth and continuous transformations of data into higher
dimensions. It can be represented by:
It is used Used when data has a smooth, continuous distribution and requires a flexible
boundary.
5. Sigmoid Kernel
The sigmoid kernel is inspired by neural networks and behaves similarly to the activation
function of a neuron. It is based on the hyperbolic tangent function and is suitable for
neural networks and other non-linear classifiers. It is represented as:
K(x,y)=tanh(γ.xTy+r)
It is often used in neural networks and non-linear classifiers.
Some more kernels
1. Polynomial kernel
It is popular in image processing.
Equation is:
Polynomial kernel equation
where d is the degree of the polynomial.
2. Gaussian kernel
It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:
Gaussian kernel equation
3. Gaussian radial basis function (RBF)
It is a general-purpose kernel; used when there is no prior knowledge about the data.
Equation is:
Gaussian radial basis function (RBF)
for:
Gaussian radial basis function (RBF)
Sometimes parametrized using:
Gaussian radial basis function (RBF)
4. Laplace RBF kernel
It is general-purpose kernel; used when there is no prior knowledge about the data.
Equation is:
Laplace RBF kernel equation
5. Hyperbolic tangent kernel
We can use it in neural networks.
Equation is:
Hyperbolic tangent kernel equation
for some (not every) k>0 and c<0.
6. Sigmoid kernel
We can use it as the proxy for neural networks. Equation is
Sigmoid kernel equation
7. Bessel function of the first kind Kernel
We can use it to remove the cross term in mathematical functions. Equation is :
Equation of Bessel function of the first kind kernel
where j is the Bessel function of first kind.
8. ANOVA radial basis kernel
We can use it in regression problems. Equation is:
ANOVA radial basis kernel equation
9. Linear splines kernel in one-dimension
It is useful when dealing with large sparse data vectors. It is often used in text categorization.
The splines kernel also performs well in regression problems. Equation is:
Linear splines kernel equation in one-dimension
Choosing the Right Kernel for SVM
Picking the right kernel for an SVM (Support Vector Machine) model is very important
because it affects how well the model works. Here’s a simple guide to help you choose the
right kernel:
1. What the Data Looks Like:
If the data can be separated by a straight line we use a linear kernel.
If the data is messy and needs a more complex boundary use a non-linear
kernel like RBF (Radial Basis Function) or polynomial kernels.
2. How Fast You Need the Model:
Linear kernels are faster and use less computer power.
Non-linear kernels like RBF take more time and resources.
3. How Easy It Is to Understand the Model:
Linear kernels are easier to understand because the boundary is simple.
Non-linear kernels create complex boundaries and make the model harder to
understand.
4. Tuning the Model:
Each kernel has special settings called hyperparameters that you can adjust to get the
best performance.
You will need to try different combinations of these settings using cross-validation to
find the best one.
Properties of SVM
Support Vector Machines (SVM) have several important properties that contribute to their
effectiveness and popularity as a machine learning algorithm:
1. Maximal Margin: SVM aims to find a hyperplane that maximizes the margin, which is
the distance between the hyperplane and the closest data points from each class.
2. Non-linearity with Kernels: SVM can handle non-linearly separable data by utilizing
kernel functions. Kernels transform the data into a higher-dimensional feature space
where a linear decision boundary can be found.
3. Support Vectors: SVM uses a subset of the training data called support vectors. These
are the data points that are closest to the decision boundary.
4. Sparsity: SVM often has a sparse solution, meaning that the decision boundary is
determined by only a small number of support vectors rather than the entire training
dataset.
5. Versatility: SVM can be applied to both classification and regression tasks.
6. Control of Complexity: SVM provides control over the model's complexity through the
choice of the hyperparameter.
Issues in SVM
1. Scalability with Large Datasets: SVMs can become computationally expensive and
memory-intensive, particularly with large-scale datasets.
2. Lack of Probabilistic Interpretation: SVMs inherently provide a decision boundary
that separates classes, but they do not directly provide probabilistic outputs.
3. Imbalanced Data: When dealing with imbalanced datasets where the number of samples
in different classes is significantly unequal, SVMs may be biased towards the majority
class.
4. Interpretability: SVMs tend to provide good predictive performance, but they may not
offer direct interpretability of the learned model.
Advantages of Support Vector Machine (SVM)
1. High-Dimensional Performance: SVM excels in high-dimensional spaces, making it
suitable for image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial, SVM
effectively handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers, enhancing
robustness in spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both binary
classification and multiclass classification, suitable for applications in text classification.
5. Memory Efficiency: SVM focuses on support vectors, making it memory efficient
compared to other algorithms.
Disadvantages of Support Vector Machine (SVM)
1. Slow Training: SVM can be slow for large datasets, affecting performance in SVM in
data mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters
like C requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes, limiting
effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher dimensions makes
SVM less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential; otherwise, SVM models
may perform poorly.