0% found this document useful (0 votes)
13 views31 pages

Logistic Regression and SVM Overview

The document provides an overview of various machine learning algorithms, focusing on Logistic Regression, Support Vector Machines (SVM), and Decision Trees. It explains the differences between regression and classification algorithms, the logistic function, types of logistic regression, and the workings of SVM and decision trees, including key concepts like entropy and information gain. Additionally, it highlights real-world applications of these algorithms in fields such as healthcare, finance, and marketing.

Uploaded by

tejaswi ruttala
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views31 pages

Logistic Regression and SVM Overview

The document provides an overview of various machine learning algorithms, focusing on Logistic Regression, Support Vector Machines (SVM), and Decision Trees. It explains the differences between regression and classification algorithms, the logistic function, types of logistic regression, and the workings of SVM and decision trees, including key concepts like entropy and information gain. Additionally, it highlights real-world applications of these algorithms in fields such as healthcare, finance, and marketing.

Uploaded by

tejaswi ruttala
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT 1

Logistic Regression: The Problem, The Logistic Function, Applying the Model,
Goodness of Fit Support Vector Machines.
Decision Trees: What Is a Decision Tree? Entropy, The Entropy of a Partition,
Creating a Decision Tree, Random Forests.
Neural Networks: Perceptron, Feed-Forward Neural Networks and Back propagation
with examples.

The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and
Classification algorithms are used to predict/Classify the discrete values such as Male
or Female, True or False, Spam or Not Spam, etc.

Logistic Regression in Machine Learning:


o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable.
Therefore the outcome must be a categorical or discrete value. It can be either Yes
or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it
gives the probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Logistic Function (Sigmoid Function):

This is the Sigmoid function, which produces an S-shaped curve. It always returns a
probability value between 0 and 1. The Sigmoid function is used to convert expected values
to probabilities. The function converts any real number into a number between 0 and 1. We
utilize sigmoid to translate predictions to probabilities in machine learning.
The mathematically sigmoid function can be,

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.


Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

Example, 0 and 1, or pass and fail or true and false.

o Multinomial: In multinomial Logistic regression, there can be 3 or more possible


unordered types of the dependent variable, such as "cat", "dogs", or "sheep"

Example, Predicting preference of food i.e. Veg, Non-Veg, Vegan.

o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "low", "Medium", or "High".

Example, Movie rating from 1 to 5.

How is logistic regression different from linear regression?


In linear regression, the outcome is continuous and can be any possible value. However in
the case of logistic regression, the predicted outcome is discrete and restricted to a limited
number of values.
For example, say we are trying to apply machine learning to the sale of a house. If we are
trying to predict the sale price based on the size, year built, and number of stories we would
use linear regression, as linear regression can predict a sale price of any possible value. If we
are using those same factors to predict if the house sells or not, we would logistic regression
as the possible outcomes here are restricted to yes or no.

Hence, linear regression is an example of a regression model and logistic regression is an


example of a classification model.

Where to use logistic regression


Logistic regression is used to solve classification problems, and the most common use case
is binary logistic regression, where the outcome is binary (yes or no). In the real world, you
can see logistic regression applied across multiple areas and fields.
 In health care, logistic regression can be used to predict if a tumor is likely to be
benign or malignant.
 In the financial industry, logistic regression can be used to predict if a transaction is
fraudulent or not.
 In marketing, logistic regression can be used to predict if a targeted audience will
respond or not.

Real-world cases where logistic regression was effectively used:


Credit scoring, Medicine, Hotel Booking, Gaming

Support Vector Machine Algorithm:


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs
so that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog. On the basis of the support vectors, it will classify it as a cat.
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Advantages of SVM:
 Effective in high dimensional cases
 Its memory efficient as it uses a subset of training points in the decision function called
support vectors
 Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels

Pure: If all the elements belong to a single class, then it can be called pure.

o Entropy: Entropy is an information theory metric that measures the impurity or


uncertainty in a group of observations. It determines how a decision tree chooses to
split data. The image below gives a better description of the purity of a set.
o

o Entropy is a useful tool in machine learning to understand various concepts such as


feature selection, building decision trees, and fitting classification models, etc. Being
a machine learning engineer and professional data scientist, you must have in-depth
knowledge of entropy in machine learning.
o Then entropy E(S) can be mathematically represented as

o If we have a dataset of 10 observations belonging to two classes YES and NO. If 6


observations belong to the class, YES, and 4 observations belong to class NO, then
entropy can be written as below.

o P yes is the
probability of choosing Yes and Pno is the probability of choosing a No. Here Pyes is
6/10 and Pno is 4/10.
o

o Information gain : Information gain or IG is a statistical property that measures how


well a given attribute separates the training examples according to their target
classification. Constructing a decision tree is all about finding an attribute that
returns the highest information gain and the smallest entropy.
o

o Information gain is a decrease in entropy. It computes the difference between


entropy before split and average entropy after split of the dataset based on given
attribute values.

we can conclude that:

(or)

o Information Gain = Entropy before splitting - Entropy after splitting


o

o Gini Index :The other way of splitting a decision tree is via the Gini Index. The
Entropy and Information Gain method focuses on purity and impurity in a node. The
Gini Index or Impurity measures the probability for a random instance being
misclassified when chosen randomly. The lower the Gini Index, the better the lower
the likelihood of misclassification.
o The formula for Gini Index
o

o Where j represents the no. of classes in the target variable — Pass and Fail in our
example
o

o Decision Tree Classification Algorithm

Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems. It
is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.

o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
How Does the Decision Tree Work?

some key terms of a decision tree.


 Root node: The base of the decision tree.
 Splitting: The process of dividing a node into multiple sub-nodes.
 Decision node: When a sub-node is further split into additional sub-nodes.
 Leaf node: When a sub-node does not further split into additional sub-nodes; represents
possible outcomes.
 Pruning: The process of removing sub-nodes of a decision tree.
 Branch: A subsection of the decision tree consisting of multiple nodes.
 Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on
the corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:


o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.

In Decision Tree the major challenge is to identification of the attribute for the root node in
each level. This process is known as attribute selection. We have two popular attribute
selection measures:
1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller
subsets the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A =
v, and Values (A) is the set of all possible values of A, then

(or)
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity
of an arbitrary collection of examples. The higher the entropy more the information
content.

he formula for Entropy is shown below:

Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A =


v, and Values (A) is the set of all possible values of A, then

Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3

= -[0.375 * (-1.415) + 0.625 * (-0.678)]


=-(-0.53-0.424)
= 0.954
.Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high
Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Advantages:
Works for numerical or categorical data and variables.
Models problems with multiple outputs.
Tests the reliability of the tree.
Requires less data cleaning than other data modeling techniques.
Easy to explain to those without an analytical background.

Disadvantages
Affected by noise in the data.
Not ideal for large datasets.
Can disproportionately value, or weigh, attributes.
The decisions at nodes are limited to binary outcomes, reducing the complexity that the
tree can handle.
Trees can become very complex when dealing with uncertainty and numerous linked
outcomes.

Constructing a Decision Tree:


Let us take an example of the last 10 days weather dataset with attributes outlook,
temperature, wind, and humidity. The outcome variable will be playing cricket or not. We
will use the ID3 algorithm to build the decision tree.

Step1: The first step will be to create a root node.


Step2: If all results are yes, then the leaf node “yes” will be returned else the leaf node “no”
will be returned.
Step3: Find out the Entropy of all observations and entropy with attribute “x” that is E(S)
and E(S, x).
Step4: Find out the information gain and select the attribute with high information gain.
Step5: Repeat the above steps until all attributes are covered.

Calculation of Entropy:
Yes No

9 5
If entropy is zero, it means that all members belong to the same class and if entropy is one
then it means that half of the tuples belong to one class and one of them belong to other
class. 0.94 means fair distribution.

Find the information gain attribute which gives maximum information gain.

For Example “Wind”, it takes two values: Strong and Weak, therefore, x = {Strong, Weak}.

Find out H(x), P(x) for x =weak and x= strong. H(S) is already calculated above.

Weak= 8

Strong= 8

For “weak” wind, 6 of them say “Yes” to play cricket and 2 of them say “No”. So entropy will
be:
This shows perfect randomness as half items belong to one class and the remaining half
belong to others.
Calculate the information gain,

The attribute outlook has the highest information gain of 0.246, thus it is chosen as root.
Overcast has 3 values: Sunny, Overcast and Rain. Overcast with play cricket is always “Yes”.
So it ends up with a leaf node, “yes”. For the other values “Sunny” and “Rain”.

Table for Outlook as “Sunny” will be:


The information gain for humidity is highest, therefore it is chosen as the next node.
Similarly, Entropy is calculated for Rain. Wind gives the highest information gain.
The decision tree would look like below:

Entropy v/s Gini Impurity: The range of Entropy lies in between 0 to 1 and the range of Gini
Impurity lies in between 0 to 0.5. Hence we can conclude that Gini Impurity is better as
compared to entropy for selecting the best features.
Random Forest Algorithm:

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

Real Life Analogy:

Let’s dive into a real-life analogy to understand this concept further. A student named X
wants to choose a course after his 10+2, and he is confused about the choice of course
based on his skill set. So he decides to consult various people like his cousins, teachers,
parents, degree students, and working people. He asks them varied questions like why he
should choose, job opportunities with that course, course fee, etc. Finally, after consulting
various people about the course he decides to take the course suggested by most of the
people.

Working of Random Forest Algorithm

Before understanding the working of the random forest algorithm in machine learning, we
must look into the ensemble technique. Ensemble simply means combining multiple
models. Thus a collection of models is used to make predictions rather than an individual
model.

Ensemble uses two types of methods:

1. Bagging– It creates a different training subset from sample training data with replacement
& the final output is based on majority voting. For example, Random Forest.

2. Boosting– It combines weak learners into strong learners by creating sequential models
such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST

Bagging: Bagging chooses a random sample from the data set. Hence each model is
generated from the samples (Bootstrap Samples) provided by the Original Data with
replacement known as row sampling. This step of row sampling with replacement is
called bootstrap. Now each model is trained independently which generates results. The
final output is based on majority voting after combining the results of all models. This step
which involves combining all the results and generating output based on majority voting is
known as aggregation.

Here the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap
sample 02, and Bootstrap sample 03) with a replacement which means there is a high
possibility that each sample won’t contain unique data. Now the model (Model 01, Model
02, and Model 03) obtained from this bootstrap sample is trained independently. Each
model generates results as shown. Now Happy emoji is having a majority when compared to
sad emoji. Thus based on majority voting final output is obtained as Happy emoji.
Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets
and given to each decision tree. During the training phase, each decision tree
produces a prediction result, and when a new data point occurs, then based on the
majority of results, the Random Forest classifier predicts the final decision. Consider
the below image:

Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks.


o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest:

o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.
Difference Between Decision Tree & Random Forest

Random forest is a collection of decision trees; still, there are a lot of differences in their

behavior.

Decision trees Random Forest


1. Decision trees normally suffer from 1. Random forests are created from
the problem of overfitting if it’s subsets of data and the final output is
allowed to grow without any control. based on average or majority ranking
and hence the problem of overfitting
is taken care of.

2. A single decision tree is faster in 2. It is comparatively slower.


computation.

3. When a data set with features is 3. Random forest randomly selects


taken as input by a decision tree it will observations, builds a decision tree
formulate some set of rules to do and the average result is taken. It
prediction. doesn’t use any set of formulas.

Biological Neuron: A human brain has billions of neurons. Neurons are interconnected
nerve cells in the human brain that are involved in processing and transmitting chemical and
electrical signals. Dendrites are branches that receive information from other neurons.

Cell nucleus or Soma processes the information received from dendrites. Axon is a cable
that is used by neurons to send information. Synapse is the connection between an axon
and other neuron dendrites.

What is Artificial Neuron: An artificial neuron is a mathematical function based on a model


of biological neurons, where each neuron takes inputs, weighs them separately, sums them
up and passes this sum through a nonlinear function to produce output.
Perceptron : Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a
Perceptron learning rule based on the original MCP(McCullock-Pitts ) neuron. A Perceptron
is an algorithm for supervised learning of binary classifiers. This algorithm enables neurons
to learn and processes elements in the training set one at a time.

Input Nodes or Input Layer:


This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.

Weight and Bias:

Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.

Activation Function:

These are the final and important components that help to determine whether the neuron
will fire or not. Activation Function can be considered primarily as a step function.

Types of Activation functions: Sign function, Step function, and Sigmoid

function

The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g.,
Sign, Step, and Sigmoid) in perceptron models by checking whether the learning process is
slow or has vanishing or exploding gradients.

How Does Perceptron Work?


Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate the weighted
sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.

∑wi*xi + b

Step-2: In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value as follows:

Y = f(∑wi*xi + b)

Types of Perceptron models

Single Layer Perceptron model: One of the easiest ANN(Artificial Neural Networks) types
consists of a feed-forward network and includes a threshold transfer inside the model. The
main objective of the single-layer perceptron model is to analyze the linearly separable
objects with binary outcomes. A Single-layer perceptron can learn only linearly separable
patterns.

Multi-Layered Perceptron model: It is mainly similar to a single-layer perceptron model but


has more hidden layers.

Forward Stage: From the input layer in the on stage, activation functions begin and
terminate on the output layer.

Backward Stage: In the backward stage, weight and bias values are modified per the
model’s requirement. The backstage removed the error between the actual output and
demands originating backward on the output layer. A multilayer perceptron model has a
greater processing power and can process linear and non-linear patterns. Further, it also
implements logic gates such as AND, OR, XOR, XNOR, and NOR.

Advantages:
A multi-layered perceptron model can solve complex non-linear problems.
It works well with both small and large input data.
Helps us to obtain quick predictions after the training.
Helps us obtain the same accuracy ratio with big and small data.
Disadvantages:
In multi-layered perceptron model, computations are time-consuming and complex.
It is tough to predict how much the dependent variable affects each independent variable.
The model functioning depends on the quality of training.
Limitations of Perceptron Model
A perceptron model has limitations as follows:
The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
Perceptron can only be used to classify the linearly separable sets of input vectors. If input
vectors are non-linear, it is not easy to classify them properly.
What is a Feed Forward Neural Network?
A Feed Forward Neural Network is an artificial neural network in which the connections
between nodes does not form a cycle. The opposite of a feed forward neural network is
a recurrent neural network, in which certain pathways are cycled. The feed forward model is
the simplest form of neural network as information is only processed in one direction. While
the data may pass through multiple hidden nodes, it always moves in one direction and
never backwards.

(or)

How does a Feed Forward Neural Network work?

A Feed Forward Neural Network is commonly seen in its simplest form as a single
layer perceptron. In this model, a series of inputs enter the layer and are multiplied by the
weights. Each value is then added together to get a sum of the weighted input values. If the
sum of the values is above a specific threshold, usually set at zero, the value produced is
often 1, whereas if the sum falls below the threshold, the output value is -1. The single layer
perceptron is an important model of feed forward neural networks and is often used in
classification tasks. Furthermore, single layer perceptrons can incorporate aspects
of machine learning. Using a property known as the delta rule, the neural network can
compare the outputs of its nodes with the intended values, thus allowing the network to
adjust its weights through training in order to produce more accurate output values. This
process of training and learning produces a form of a gradient descent. In multi-layered
perceptrons, the process of updating weights is nearly analogous, however the process is
defined more specifically as back-propagation. In such cases, each hidden layer within the
network is adjusted according to the output values produced by the final layer.

Applications on Feed Forward Neural Networks:

Simple classification (where traditional Machine-learning based classification algorithms


have limitations)
Face recognition [Simple straight forward image processing]
Computer vision [Where target classes are difficult to classify]
Speech Recognition
Advantages of Feed Forward Neural Networks
Less complex, easy to design & maintain
Fast and speedy [One-way propagation]
Highly responsive to noisy data
Disadvantages of Feed Forward Neural Networks:
Cannot be used for deep learning [due to absence of dense layers and back propagation]

What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method of fine-tuning
the weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the
model reliable by increasing its generalization.

Backpropagation in neural network is a short form for “backward propagation of errors.” It


is a standard method of training artificial neural networks. This method helps calculate the
gradient of a loss function with respect to all the weights in the network.

How Backpropagation Algorithm Works

The Back propagation algorithm in neural network computes the gradient of the loss
function for a single weight by the chain rule. It efficiently computes one layer at a time,
unlike a native direct computation. It computes the gradient, but it does not define how the
gradient is used. It generalizes the computation in the delta rule.

Inputs X, arrive through the preconnected path

Input is modeled using real weights W. The weights are usually randomly selected.

Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.

Calculate the error in the outputs

Error B= Actual Output – Desired Output

Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.

Keep repeating the process until the desired output is achieved


Why We Need Backpropagation?

Most prominent advantages of Backpropagation are:

Backpropagation is fast, simple and easy to program

It has no parameters to tune apart from the numbers of input

It is a flexible method as it does not require prior knowledge about the network

It is a standard method that generally works well

It does not need any special mention of the features of the function to be learned.

Types of Back Propagation

There are two types of backpropagation networks.

Static backpropagation: Static backpropagation is a network designed to map static inputs


for static outputs. These types of networks are capable of solving static classification
problems such as OCR (Optical Character Recognition).

Recurrent backpropagation: Recursive backpropagation is another network used for fixed-


point learning. Activation in recurrent backpropagation is feed-forward until a fixed value is
reached. Static backpropagation provides an instant mapping, while recurrent
backpropagation does not provide an instant mapping.

Advantages:

It is simple, fast, and easy to program.

Only numbers of the input are tuned, not any other parameter.

It is Flexible and efficient.

No need for users to learn any special functions.

Disadvantages: It is sensitive to noisy data and irregularities. Noisy data can lead to
inaccurate results.

Performance is highly dependent on input data.

Spending too much time training.

The matrix-based approach is preferred over a mini-batch.

You might also like