0% found this document useful (0 votes)
20 views38 pages

Naive Bayes Classifier Overview

Naive Bayes Classifiers are probabilistic algorithms used for classification tasks, particularly effective in text classification due to their speed and efficiency. They operate on Bayes' theorem, assuming feature independence, and can be applied in various domains such as spam filtering and medical diagnosis. The main types include Gaussian, Multinomial, and Bernoulli Naive Bayes, each suited for different types of data.

Uploaded by

Pragati Dagale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views38 pages

Naive Bayes Classifier Overview

Naive Bayes Classifiers are probabilistic algorithms used for classification tasks, particularly effective in text classification due to their speed and efficiency. They operate on Bayes' theorem, assuming feature independence, and can be applied in various domains such as spam filtering and medical diagnosis. The main types include Gaussian, Multinomial, and Bernoulli Naive Bayes, each suited for different types of data.

Uploaded by

Pragati Dagale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT 2

Naive Bayes Classifiers


What is Naive Bayes Classifiers?
Naïve Bayes algorithm is used for classification problems. It is highly used in
text classification. In text classification tasks, data contains high dimension (as
each word represent one feature in the data). It is used in spam filtering,
sentiment detection, rating classification etc. The advantage of using naïve
Bayes is its speed. It is fast and making prediction is easy with high dimension
of data.
This model predicts the probability of an instance belongs to a class with a
given set of feature value. It is a probabilistic classifier. It is because it assumes
that one feature in the model is independent of existence of another feature. In
other words, each feature contributes to the predictions with no relation between
each other. In real world, this condition satisfies rarely. It uses Bayes theorem in
the algorithm for training and prediction
Why it is Called Naive Bayes?


Bayes’ Theorem
Naive Bayes Theorem
 Based on prior knowledge of conditions that may be related to an event,
Bayes theorem describes the probability of the event
 conditional probability can be found this way
Assume we have a Hypothesis(H) and evidence(E),
According to Bayes theorem, the relationship between the probability of
the Hypothesis before getting the evidence represented as P(H) and the
probability of the hypothesis after getting the evidence represented
as P(H|E) is:
P(H|E) = P(E|H)*P(H)/P(E)
 Prior probability = P(H) is the probability before getting the evidence
Posterior probability = P(H|E) is the probability after getting evidence
 In general,
P(class|data) = (P(data|class) * P(class)) / P(data)
Naive Bayes Theorem Example
Assume we have to find the probability of the randomly picked card to be king
given that it is a face card.
There are 4 Kings in a Deck of Cards which implies that
P(King) = 4/52
as all the Kings are face Cards so
P(Face|King) = 1
there are 3 Face Cards in a Suit of 13 cards and there are 4 Suits in total so
P(Face) = 12/52
Therefore,
P(King|face) = P(face|king)*P(king)/P(face) = 1/3

Here are some details about Bayes classifiers:

Naive Bayes Classifier: A Simple Yet Powerful Algorithm


Naive Bayes is a probabilistic machine learning algorithm that's particularly

effective for classification tasks. It's based on Bayes' theorem with a strong

independence assumption between features. Despite this "naive" assumption,

Naive Bayes often performs surprisingly well in practice, especially with text

data.

How Naive Bayes Works


1. Bayes' Theorem:
o The core of Naive Bayes is Bayes' theorem, which relates

conditional probabilities:

P(A|B) = P(B|A) * P(A) / P(B)

o In the context of classification, this translates to:

o P(Class|Features) = P(Features|Class) * P(Class) / P(Features)

2. Naive Assumption:

o Naive Bayes assumes that all features are independent of each

other given the class. This simplifies the calculation of

probabilities:

o P(Features|Class) = P(Feature1|Class) * P(Feature2|Class) * ... *

P(FeatureN|Class)

3. Classification:

o To classify a new instance, Naive Bayes calculates the probability

of the instance belonging to each class.

o The class with the highest probability is assigned to the instance.

Types of Naive Bayes Classifiers


 Gaussian Naive Bayes: Assumes that features are continuous and

normally distributed.

 Multinomial Naive Bayes: Suitable for discrete features, often used for

text classification.
 Bernoulli Naive Bayes: Similar to Multinomial but treats features as

binary (present or absent).

Advantages of Naive Bayes


 Simplicity: Easy to understand and implement.

 Efficiency: Fast training and prediction times.

 Handles high-dimensional data: Works well with many features.

 Effective for text classification: Often achieves high accuracy in tasks

like spam filtering and sentiment analysis.

Disadvantages of Naive Bayes


 Naive assumption: The independence assumption may not always hold in

real-world data.

 Zero-frequency problem: If a feature value doesn't appear in the training

data, its probability will be zero, affecting the overall probability.

Smoothing techniques like Laplace smoothing can help mitigate this

issue.

Applications of Naive Bayes


 Text classification: Spam filtering, sentiment analysis, topic modeling

 Image classification: Facial recognition, object detection

 Medical diagnosis: Disease prediction

 Recommendation systems: Product recommendations


Bayes Theorem Formula

The formula for the Bayes theorem can be written in a variety of ways. The
following is the most common version:
P(A ∣ B) = P(B ∣ A)P(A) / P(B)
P(A ∣ B) is the conditional probability of event A occurring, given that B is true.
P(B ∣ A) is the conditional probability of event B occurring, given that A is true.
P(A) and P(B) are the probabilities of A and B occurring independently of one
another.

Bayes Theorem Formula Solved Examples


Example 1. A certain disease affects 2% of the population. A diagnostic test for
the disease has an accuracy rate of 95% (i.e., the probability of testing positive
given that the disease is present is 0.95), and the false positive rate is 3% (i.e.,
the probability of testing positive given that the disease is absent is 0.03). If a
randomly selected person tests positive, what is the probability that they
actually have the disease?
Example 2. A box contains 3 fair coins and 1 biased coin with two heads. A
coin is randomly selected and tossed, and it shows heads. What is the
probability that the chosen coin is biased?
Example 3. A certain drug test correctly identifies drug users 98% of the time
and gives false negatives for 3% of non-drug users. If 1% of the population are
drug users, what is the probability that a person who tests positive is actually a
drug user?
Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a


certain feature is independent of the occurrence of other features. Such as
if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple. Hence each feature
individually contributes to identify that it is an apple without depending
on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used
to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we should
play or not on a particular day according to the weather conditions. So to solve
this problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

Outlook Play
0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No
Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal
distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from the
Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the
data is multinomial distributed. It is primarily used for document
classification problems, it means a particular document belongs to which
category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.
Python Implementation of the Naïve Bayes algorithm:
Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1) Data Pre-processing step:


In this step, we will pre-process/prepare the data so that we can use it efficiently
in our code. It is similar as we did in data-pre-processing. The code for this is
given below:
 Importing the libraries
 import numpy as nm
 import [Link] as mtp
 import pandas as pd

 # Importing the dataset
 dataset = pd.read_csv('user_data.csv')
 x = [Link][:, [2, 3]].values
 y = [Link][:, 4].values

 # Splitting the dataset into the Training set and Test set
 from sklearn.model_selection import train_test_split
 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25,
random_state = 0)

 # Feature Scaling
 from [Link] import StandardScaler
 sc = StandardScaler()
 x_train = sc.fit_transform(x_train)
 x_test = [Link](x_test)
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test
set, and then we have scaled the feature variable.
The output for the dataset is given as:

2) Fitting Naive Bayes to the Training Set:


After the pre-processing step, now we will fit the Naive Bayes model to the
Training set. Below is the code for it:
 # Fitting Naive Bayes to the Training set
 from sklearn.naive_bayes import GaussianNB
 classifier = GaussianNB()
 [Link](x_train, y_train)
In the above code, we have used the GaussianNB classifier to fit it to the
training dataset. We can also use other classifiers as per our requirement.
Output:

 Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)

3) Prediction of the test set result:


Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.
 # Predicting the Test set results
 y_pred = [Link](x_test)
Output:
The above output shows the result for prediction vector y_pred and real vector
y_test. We can see that some predications are different from the real values,
which are the incorrect predictions.
4) Creating Confusion Matrix:
Now we will check the accuracy of the Naive Bayes classifier using the
Confusion matrix. Below is the code for it:
 # Making the Confusion Matrix
 from [Link] import confusion_matrix
 cm = confusion_matrix(y_test, y_pred)
Output:

As we can see in the above confusion matrix output, there are 7+3= 10 incorrect
predictions, and 65+25=90 correct predictions.
5) Visualizing the training set result:
Next we will visualize the training set result using Na�ve Bayes Classifier.
Below is the code for it:

 # Visualising the Training set results


 from [Link] import ListedColormap
 x_set, y_set = x_train, y_train
 X1, X2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop =
x_set[:, 0].max() + 1, step = 0.01),
 [Link](start = x_set[:, 1].min() - 1, stop = x_set[:,
1].max() + 1, step = 0.01))
 [Link](X1, X2, [Link]([Link]([[Link](),
[Link]()]).T).reshape([Link]),
 alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
 [Link]([Link](), [Link]())
 [Link]([Link](), [Link]())
 for i, j in enumerate([Link](y_set)):
 [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
 c = ListedColormap(('purple', 'green'))(i), label = j)
 [Link]('Naive Bayes (Training set)')
 [Link]('Age')
 [Link]('Estimated Salary')
 [Link]()
 [Link]()
Output:

In the above output we can see that the Na�ve Bayes classifier has segregated
the data points with the fine boundary. It is Gaussian curve as we have used
GaussianNB classifier in our code.
6) Visualizing the Test set result:
 # Visualising the Test set results
 from [Link] import ListedColormap
 x_set, y_set = x_test, y_test
 X1, X2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop =
x_set[:, 0].max() + 1, step = 0.01),
 [Link](start = x_set[:, 1].min() - 1, stop = x_set[:,
1].max() + 1, step = 0.01))
 [Link](X1, X2, [Link]([Link]([[Link](),
[Link]()]).T).reshape([Link]),
 alpha = 0.75, cmap = ListedColormap(('purple', 'green')))
 [Link]([Link](), [Link]())
 [Link]([Link](), [Link]())
 for i, j in enumerate([Link](y_set)):
 [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
 c = ListedColormap(('purple', 'green'))(i), label = j)
 [Link]('Naive Bayes (test set)')
 [Link]('Age')
 [Link]('Estimated Salary')
 [Link]()
 [Link]()
Output:

Random Forest Hyperparameters


1. n_estimators
Random Forest is nothing but a set of trees. It is an extended version of the
Decision Tree in a very optimized way. One issue here might arise is how many
trees need to be created. n_estimator is the hyperparameter that defines the
number of trees to be used in the model. The tree can also be understood as the
sub-divisions.
By default: n_estimators=100
2. max_features
In order to train the Machine learning model, the given dataset should contain
multiple features/variables to predict the label/target. Max_features limits a
count to select the maximum features in each tree.
By default: max_features="sqrt" [available: ["sqrt", "log2", None}]
3. max_depth
A tree is incomplete without a split or child node. max_depth determines the
maximum number of splits each tree can take. If the max_depth is too low, the
model will be trained less and have a high bias, leading the model to underfit. In
the same way, if the max_depth is high, the model learns too much and leads to
high variance, leading the model to overfit.
By default: max_depth=None
4. max_leaf_nodes
We have a tree and know what max_depth is used for. Talking of a Tree, each
tree is used to split into multiple nodes. But how many divisions of nodes
should be done is specified by max_lead_nodes. max_leaf_nodes restricts the
growth of each tree.
By default: max_leaf_nodes = None; (takes an unlimited number of nodes)
5. max_sample
Apart from the features, we have a large set of training datasets. max_sample
determines how much of the dataset is given to each individual tree.
By default: max_sample = None; (this means [Link][0] is taken)
6. min_sample_split
Since ensemble algorithms are weak learners and are derived from strong
learners, Random Forest which is a Weak Learner depends on Decision Tree
decisions. min_sample_split determines the minimum number of decision tree
observations in any given node in order to split.
By default: min_sample_split = 2 (this means every node has 2 subnodes)
What is Attribute Selection Measures?
The tree node generated for partition D is labeled with the splitting criterion,
branches are increase for each result of the criterion, and the tuples are isolated
accordingly. There are three famous attribute selection measures including
information gain, gain ratio, and gini index.
Information gain − Information gain is used for deciding the best
features/attributes that render maximum data about a class. It follows the
method of entropy while aiming at reducing the level of entropy, starting from
the root node to the leaf nodes.
Let node N defines or hold the tuples of partition D. The attribute with the
largest information gain is selected as the splitting attribute for node N. This
attribute minimizes the data required to define the tuples in the resulting
subdivide and reflects the least randomness or “impurity” in these subdivide.
Gain ratio − The information gain measure is biased approaching tests with
several results. It can select attributes having a high number of values. For
instance, consider an attribute that facilitates as a unique identifier, including
product ID.
A split on product ID can result in a huge number of partitions, each one
including only one tuple. Because each partition is authentic, the data needed to
define data set D based on this partitioning would be Infoproduct_ID(D) = 0.
Gini index − The Gini index can be used in CART. The Gini index calculates
the impurity of D, a data partition or collection of training tuples, as

Decision Tree Classification Algorithm

 Decision Tree is a Supervised learning technique that can be used for


both classification and Regression problems, but mostly it is preferred for
solving Classification problems. It is a tree-structured classifier,
where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the
outcome.
 In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision and
have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the
given dataset.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.
 In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
 A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
 Below diagram explains the general structure of a decision tree:
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best
algorithm for the given dataset and problem is the main point to remember
while creating a machine learning model. Below are the two reasons for using
the Decision tree:

 Decision Trees usually mimic human thinking ability while making a


decision, so it is easy to understand.
 The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

Decision Tree Terminologies


 Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the
tree.
 Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of root
attribute with the record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the
leaf node of the tree. The complete process can be better understood using the
below algorithm:
 Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for the best
attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary attribute by ASM). The root node
splits further into the next decision node (distance from the office) and one leaf
node based on the corresponding labels. The next decision node further gets
split into one decision node (Cab facility) and one leaf node. Finally, the
decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

Advantages of the Decision Tree

 It is simple to understand as it follows the same process which a human


follow while making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

 The decision tree contains lots of layers, which makes it complex.


 It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
 For more class labels, the computational complexity of the decision tree
may increase.
why use decision tree algo?The decision tree algorithm is used in machine
learning because it's an effective way to make decisions by:

 Laying out possible outcomes: Decision trees help developers analyze


the potential consequences of a decision.
 Predicting outcomes: As the algorithm accesses more data, it can
predict outcomes for future data.
 Producing accurate models: Decision trees can produce accurate and
interpretable models with minimal user intervention.
 Handling data: Decision trees can handle both categorical and
numerical data.
 Being fast: The decision tree algorithm is fast at both build time and
apply time.

Python Implementation of Decision Tree

Steps will also remain the same, which are given below:

o Data Pre-processing step


o Fitting a Decision-Tree algorithm to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.

1. Data Pre-Processing Step:


Below is the code for the pre-processing step:

 # importing libraries
 import numpy as nm
 import [Link] as mtp
 import pandas as pd

 #importing datasets
 data_set= pd.read_csv('user_data.csv')

 #Extracting Independent and dependent Variable
 x= data_set.iloc[:, [2,3]].values
 y= data_set.iloc[:, 4].values

 # Splitting the dataset into training and test set.
 from sklearn.model_selection import train_test_split
 x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25,
random_state=0)

 #feature Scaling
 from [Link] import StandardScaler
 st_x= StandardScaler()
 x_train= st_x.fit_transform(x_train)
 x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the
dataset, which is given as:
2. Fitting a Decision-Tree algorithm to the Training set
Now we will fit the model to the training set. For this, we will import the
DecisionTreeClassifier class from [Link] library. Below is the code for it:
 #Fitting Decision Tree classifier to the training set
 From [Link] import DecisionTreeClassifier
 classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)
 [Link](x_train, y_train)
In the above code, we have created a classifier object, in which we have passed
two main parameters;

o "criterion='entropy': Criterion is used to measure the quality of split,


which is calculated by information gain given by entropy.
o random_state=0": For generating the random states.

Below is the output for this:


Advertisement
Out[8]:
DecisionTreeClassifier(class_weight=None, criterion='entropy',
max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=0, splitter='best')

3. Predicting the test result


Now we will predict the test set result. We will create a new prediction vector
y_pred. Below is the code for it:
 #Predicting the test set result
 y_pred= [Link](x_test)
Output:
In the below output image, the predicted output and real test output are given.
We can clearly see that there are some values in the prediction vector, which are
different from the real vector values. These are prediction errors.
Advertisement

4. Test accuracy of the result (Creation of Confusion matrix)


In the above output, we have seen that there were some incorrect predictions, so
if we want to know the number of correct and incorrect predictions, we need to
use the confusion matrix. Below is the code for it:

 #Creating the Confusion matrix


 from [Link] import confusion_matrix
 cm= confusion_matrix(y_test, y_pred)
Output:

Advertisement
In the above output image, we can see the confusion matrix, which has 6+3= 9
incorrect predictions and62+29=91 correct predictions. Therefore, we can say
that compared to other classification models, the Decision Tree classifier made
a good prediction.
5. Visualizing the training set result:
Here we will visualize the training set result. To visualize the training set result
we will plot a graph for the decision tree classifier. The classifier will predict
yes or No for the users who have either Purchased or Not purchased the SUV
car as we did in Logistic Regression. Below is the code for it:
 #Visulaizing the trianing set result
 from [Link] import ListedColormap
 x_set, y_set = x_train, y_train
 x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop =
x_set[:, 0].max() + 1, step =0.01),
 [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step =
0.01))
 [Link](x1, x2, [Link]([Link]([[Link](),
[Link]()]).T).reshape([Link]),
 alpha = 0.75, cmap = ListedColormap(('purple','green' )))
 [Link]([Link](), [Link]())
 [Link]([Link](), [Link]())
 fori, j in enumerate([Link](y_set)):
 [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
 c = ListedColormap(('purple', 'green'))(i), label = j)
 [Link]('Decision Tree Algorithm (Training set)')
 [Link]('Age')
 [Link]('Estimated Salary')
 [Link]()
 [Link]()
Output:

The above output is completely different from the rest classification models. It
has both vertical and horizontal lines that are splitting the dataset according to
the age and estimated salary variable.
As we can see, the tree is trying to capture each dataset, which is the case of
overfitting.
6. Visualizing the test set result:
Visualization of test set result will be similar to the visualization of the training
set except that the training set will be replaced with the test set.
 #Visulaizing the test set result
 from [Link] import ListedColormap
 x_set, y_set = x_test, y_test
 x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop =
x_set[:, 0].max() + 1, step =0.01),
 [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step =
0.01))
 [Link](x1, x2, [Link]([Link]([[Link](),
[Link]()]).T).reshape([Link]),
 alpha = 0.75, cmap = ListedColormap(('purple','green' )))
 [Link]([Link](), [Link]())
 [Link]([Link](), [Link]())
 fori, j in enumerate([Link](y_set)):
 [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
 c = ListedColormap(('purple', 'green'))(i), label = j)
 [Link]('Decision Tree Algorithm(Test set)')
 [Link]('Age')
 [Link]('Estimated Salary')
 [Link]()
 [Link]()
Advertisement
Output:

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble learning,
which is a process of combining multiple classifiers to solve a complex problem
and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one
decision tree, the random forest takes the prediction from each tree and based on
the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note: To better understand the Random Forest Algorithm, you should have
knowledge of the Decision Tree Algorithm.

Assumptions for Random Forest


Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct output,
while others may not. But together, all the trees predict the correct output.
Therefore, below are two assumptions for a better Random forest classifier:

o There should be some actual values in the feature variable of the dataset
so that the classifier can predict accurate results rather than a guessed
result.
o The predictions from each tree must have very low correlations.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest
algorithm:

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs
efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by
combining N decision tree, and second is to make predictions for each tree
created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into
subsets and given to each decision tree. During the training phase, each decision
tree produces a prediction result, and when a new data point occurs, then based
on the majority of results, the Random Forest classifier predicts the final
decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification
of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the
disease can be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest


o Random Forest is capable of performing both Classification and
Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
o Although random forest can be used for both classification and regression
tasks, it is not more suitable for Regression tasks.

Python Implementation of Random Forest Algorithm


Now we will implement the Random Forest Algorithm tree using Python. For
this, we will use the same dataset "user_data.csv", which we have used in
previous classification models. By using the same dataset, we can compare the
Random Forest classifier with other classification models such as Decision tree
Classifier, KNN, SVM, Logistic Regression, etc.
Implementation Steps are given below:

o Data Pre-processing step


o Fitting the Random forest algorithm to the Training set
o Predicting the test result
o Test accuracy of the result (Creation of Confusion matrix)
o Visualizing the test set result.

[Link] Pre-Processing Step:


Below is the code for the pre-processing step:

 # importing libraries
 import numpy as nm
 import [Link] as mtp
 import pandas as pd

 #importing datasets
 data_set= pd.read_csv('user_data.csv')

 #Extracting Independent and dependent Variable
 x= data_set.iloc[:, [2,3]].values
 y= data_set.iloc[:, 4].values

 # Splitting the dataset into training and test set.
 from sklearn.model_selection import train_test_split
 x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25,
random_state=0)

 #feature Scaling
 from [Link] import StandardScaler
 st_x= StandardScaler()
 x_train= st_x.fit_transform(x_train)
 x_test= st_x.transform(x_test)
In the above code, we have pre-processed the data. Where we have loaded the
dataset, which is given as:

2. Fitting the Random Forest algorithm to the training set:


Now we will fit the Random forest algorithm to the training set. To fit it, we
will import the RandomForestClassifier class from the [Link] library.
The code is given below:

 #Fitting Decision Tree classifier to the training set


 from [Link] import RandomForestClassifier
 classifier= RandomForestClassifier(n_estimators= 10,
criterion="entropy")
 [Link](x_train, y_train)
In the above code, the classifier object takes below parameters:
o n_estimators= The required number of trees in the Random Forest. The
default value is 10. We can choose any number but need to take care of
the overfitting issue.
o criterion= It is a function to analyze the accuracy of the split. Here we
have taken "entropy" for the information gain.
Output:
RandomForestClassifier(bootstrap=True, class_weight=None,
criterion='entropy',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)

3. Predicting the Test Set result


Since our model is fitted to the training set, so now we can predict the test
result. For prediction, we will create a new prediction vector y_pred. Below is
the code for it:

 #Predicting the test set result


 y_pred= [Link](x_test)
Output:
The prediction vector is given as:
By checking the above prediction vector and test set real vector, we can
determine the incorrect predictions done by the classifier.
4. Creating the Confusion Matrix
Now we will create the confusion matrix to determine the correct and incorrect
predictions. Below is the code for it:
 #Creating the Confusion matrix
 from [Link] import confusion_matrix
 cm= confusion_matrix(y_test, y_pred)
Output:
As we can see in the above matrix, there are 4+4= 8 incorrect predictions and
64+28= 92 correct predictions.
5. Visualizing the training Set result
Here we will visualize the training set result. To visualize the training set result
we will plot a graph for the Random forest classifier. The classifier will predict
yes or No for the users who have either Purchased or Not purchased the SUV
car as we did in Logistic Regression. Below is the code for it:
 from [Link] import ListedColormap
 x_set, y_set = x_train, y_train
 x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop =
x_set[:, 0].max() + 1, step =0.01),
 [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step =
0.01))
 [Link](x1, x2, [Link]([Link]([[Link](),
[Link]()]).T).reshape([Link]),
 alpha = 0.75, cmap = ListedColormap(('purple','green' )))
 [Link]([Link](), [Link]())
 [Link]([Link](), [Link]())
 for i, j in enumerate([Link](y_set)):
 [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
 c = ListedColormap(('purple', 'green'))(i), label = j)
 [Link]('Random Forest Algorithm (Training set)')
 [Link]('Age')
 [Link]('Estimated Salary')
 [Link]()
 [Link]()
Output:

The above image is the visualization result for the Random Forest classifier
working with the training set result. It is very much similar to the Decision tree
classifier. Each data point corresponds to each user of the user_data, and the
purple and green regions are the prediction regions. The purple region is
classified for the users who did not purchase the SUV car, and the green region
is for the users who purchased the SUV.
So, in the Random Forest classifier, we have taken 10 trees that have predicted
Yes or NO for the Purchased variable. The classifier took the majority of the
predictions and provided the result.
6. Visualizing the test set result
Now we will visualize the test set result. Below is the code for it:
 #Visulaizing the test set result
 from [Link] import ListedColormap
 x_set, y_set = x_test, y_test
 x1, x2 = [Link]([Link](start = x_set[:, 0].min() - 1, stop =
x_set[:, 0].max() + 1, step =0.01),
 [Link](start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step =
0.01))
 [Link](x1, x2, [Link]([Link]([[Link](),
[Link]()]).T).reshape([Link]),
 alpha = 0.75, cmap = ListedColormap(('purple','green' )))
 [Link]([Link](), [Link]())
 [Link]([Link](), [Link]())
 for i, j in enumerate([Link](y_set)):
 [Link](x_set[y_set == j, 0], x_set[y_set == j, 1],
 c = ListedColormap(('purple', 'green'))(i), label = j)
 [Link]('Random Forest Algorithm(Test set)')
 [Link]('Age')
 [Link]('Estimated Salary')
 [Link]()
 [Link]()
Output:

Common questions

Powered by AI

The Naive Bayes classifier faces limitations due to its strong independence assumption, which is rarely satisfied in real-world datasets where features often correlate. This can lead to inaccurate probability estimations when dependencies exist among features. Additionally, the zero-frequency problem, where unseen feature values in training data will have zero probability in predictions, can undermine the model's effectiveness, although smoothing techniques like Laplace smoothing help mitigate this issue . Nonetheless, despite these limitations, Naive Bayes often provides competitive accuracy in many practical applications .

A confusion matrix is used to evaluate the accuracy of a Naive Bayes classifier by summarizing the prediction results on a classification problem. It provides a detailed breakdown of true vs. predicted classifications, including true positives, true negatives, false positives, and false negatives. This matrix helps in understanding how well the model differentiates between classes, identifies areas for improvement, and calculates various performance metrics like precision, recall, and F1 score, offering insights beyond overall accuracy . This comprehensive view helps refine model predictions and adjust for any biases or errors present in the model .

The Naive Bayes classifier is considered simple due to its straightforward implementation and computational efficiency, given the assumption of feature independence, which simplifies probability computations by treating each feature independently. This simplicity allows it to scale well with high-dimensional data, enabling its application across a variety of complex tasks such as text classification. Despite the 'naive' independence assumption, which rarely holds in practical scenarios, Naive Bayes can achieve high accuracy due to the way it aggregates evidence across features and can be quite robust in the face of redundant or irrelevant features . Its efficiency combined with competitive performance in many contexts defines its power .

The Naive Bayes Classifier handles high-dimensional data effectively because it makes predictions based on probabilities and assumes the independence of features, which simplifies calculations, even when dealing with a large number of features, such as words in text classification . While this 'naive' assumption rarely holds in practice, the method remains computationally efficient and often yields accurate results. It considers each word, or feature, separately, which allows it to manage the computational complexity in text classification applications, like spam filtering and sentiment analysis .

Naive Bayes classifiers are particularly beneficial in applications such as text classification (e.g., spam filtering and sentiment analysis) due to their ability to efficiently handle high-dimensional data and achieve high accuracy even with large feature spaces such as word occurrences. They are also applied in medical diagnosis, where quick probabilistic inferences can aid in disease prediction, and in recommendation systems to provide product recommendations based on likelihood estimations of user preferences .

The independence assumption in Naive Bayes classifiers is beneficial because it simplifies computations, allowing high efficiency and speed in handling high-dimensional data and making the algorithm easy to implement. However, it is also a significant drawback as it often fails to hold true in real-world data where feature dependencies can be overlooked, potentially reducing the model's accuracy. This assumption can lead to suboptimal performance when features are correlated, a common occurrence in practical datasets, which requires practitioners to weigh the trade-offs between assumption simplicity and real-world applicability .

The Naive Bayes classifier utilizes Bayes' Theorem to calculate the probability that a given instance belongs to a particular class based on the observed features. The theorem relates conditional probabilities and allows the classifier to evaluate the posterior probability of a class given the feature values. This is expressed as P(Class|Features) = P(Features|Class) * P(Class) / P(Features), where the 'naive' assumption is that the features are conditionally independent given the class .

Laplace smoothing, also known as add-one smoothing, is used in Naive Bayes classifiers to handle the zero-frequency problem, which arises when a feature value does not appear in the training data, resulting in a zero probability for that feature, thus potentially reducing the overall predictive reliability. Laplace smoothing adjusts the likelihood estimates by adding a small positive number to each feature count, ensuring that no probability is exactly zero, thus stabilizing the probability estimates and improving the robustness of the model .

The different types of Naive Bayes classifiers include Gaussian, Multinomial, and Bernoulli Naive Bayes. Gaussian Naive Bayes is suitable for datasets where the feature values are continuous and normally distributed, often used for continuous data classification. Multinomial Naive Bayes is used for discrete feature data, which is common in text classification tasks where word occurrence counts are informative. Bernoulli Naive Bayes is similar to Multinomial but is used for binary/boolean feature data, making it ideal for cases like binary text data, indicating the presence or absence of words .

The Gaussian Naive Bayes classifier differs in its approach by assuming that the feature values are continuous and normally distributed, modeling each feature's distribution with a Gaussian. This contrasts with the Multinomial and Bernoulli Naive Bayes classifiers, which are designed for discrete features, treating word frequencies or presence/absence data. Gaussian is suitable for datasets where the normal distribution assumption of continuous features holds, enabling it to apply Naive Bayes in scenarios like numerical data classification .

You might also like