Unit 2 Data Analytics
Unit 2 Data Analytics
Data Analytics
BIT-601
Ashish Tripathi
Assistant Professor
Department of Information Technology
Pranveer Singh Institute of Technology
ASHISH TRIPATHI
ASHISH TRIPATHI
Prerequisite:
ASHISH TRIPATHI
Regression Modelling
Regression analysis is a form of predictive modelling technique which investigates the relationship between
a dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling
and finding the causal effect relationship between the variables.
For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.
Regression analysis is an important tool for modelling and analyzing data. Here, we fit a curve / line to the data points, in such
a manner that the differences between the distances of data points from the curve or line is minimized
In other words “regression is a statistical technique to determine the linear relationship between two or more
variables.”
It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship
between them) form, regression shows the relationship between one independent variable (X) and a dependent variable (Y),
as in the formula below:
Y = β0+β1X+ µ
• The magnitude and direction of that relation are given by the slope parameter (β1)
• The status of the dependent variable when the independent variable is absent is given by the intercept parameter (β 0).
• An error term (µ) captures the amount of variation not predicted by the slope and intercept terms.
• The regression coefficient (R2) shows how well the values fit the
ASHISH data.
TRIPATHI
ASHISH TRIPATHI
ASHISH TRIPATHI
Terminologies Related to the Regression Analysis
• Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the
dependent variable. It is also called target, Response, Regressand, Predicted and Output variable.
• Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the
dependent variables are called independent variable, also called as a Predictor, Regressor, Exploratory, Input variable.
• Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other
observed values. An outlier may hamper the result, so it should be avoided.
• Multi-collinearity: If the independent variables are highly correlated with each other than other variables, then such
condition is called Multi-collinearity. It should not be present in the dataset, because it creates problem while ranking the
most affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then
such problem is called Overfitting. And if our algorithm does not perform well even with training dataset, then such
problem is called underfitting.
ASHISH TRIPATHI
Regression models
Regression models are two types: Simple regression model and multiple regression model. Both are divided into linear and
nonlinear models
The interpretation of slope (β1) highlights how Y changes by β1 for each 1 unit increase in X.
Similarly, the Y-Intercept (β0) highlights the average value of Y when X = 0, for example if β0 = 4, then average Y is expected to be 4
when X is 0.
ASHISH TRIPATHI
ASHISH TRIPATHI
Example: Consider the following data about food intake by cows and milk yield collected from a cattle farm:
What is the relationship between cows’ food intake and milk yield?
Food (kg) Milk Yield (Ltrs)
4 3.0
6 5.5
10 6.5
12 9.0
ASHISH TRIPATHI
Advantages:
• Can be used to predict the future: By using the relevant model to a data set, Regression Analysis can accurately
predict a lot of useful information like Stock Prices, Medical Conditions and even Sentiments of the public
• Can be used to back major decisions and policies: Results from regression analysis adds a scientific backing to a
decision or policy and makes it even more reliable as it likelihood of success is then high.
• Can correct an error in thinking or disabuse: Sometimes, an anomaly between the prediction of regression analysis
and a decision/thinking can help correct the fallacy of the decision.
• Provides a new perspective: Large data sets realize their potential to provide new dimensions to a study through the
application of Regression Analysis.
ASHISH TRIPATHI
Coefficient of Determination
The coefficient of determination (denoted by R2) is a key output of regression
analysis. It is interpreted as the proportion of the variance in the dependent variable
that is predictable from the independent variable.
• The coefficient of determination is the square of the correlation (R) between
predicted y scores and actual y scores; thus, it ranges from 0 to 1.
• An R2 of 0 means that the dependent variable cannot be predicted from the
independent variable.
• An R2 of 1 means the dependent variable can be predicted without error from the
independent variable.
• An R2 between 0 and 1 indicates the extent to which the dependent variable is
predictable. An R2 of 0.10 means that 10 percent of the variance in Y is
predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so
on.
ASHISH TRIPATHI
Multivariate Analysis
ASHISH TRIPATHI
Multivariate Analysis
Multivariate analysis is essentially the statistical process of simultaneously analyzing multiple independent (or
predictor) variables with multiple dependent (outcome or criterion) variables. Multivariate analysis (MVA) can help
summarize the data and also can reduce the chance of obtaining spurious results.
Multiple linear regression (MLR) aims to quantify the degree of linear association between one response variable
and several explanatory variables. It also refers to set of techniques for studying the straight-line relationships
among two or more variables.
The general MLR equation. A response variable (y; known as the regressand) is predicted by a number of
explanatory variables (x1, x2 ... xn; the regressors). The strength of each regressor effect on the response variable is
determined by the regression coefficients β1 ... βn.
ASHISH TRIPATHI
Multivariate Analysis methods
There are two general types of Multivariate analysis as given below:
a) Analysis of dependence:- If the variables are dependent on others, they are called analysis of dependence.
i.e., a category of multivariate statistical techniques; dependence methods explain or predict a dependent
variable(s) on the basis of two or more independent variables
E.g. Multiple and Partial Least Square(PLS) regression, Multiple Discriminant Analysis(MDA)
Variance analysis: Determines the influence of several or individual variables on groups by calculating statistical
averages. Here you can compare variables within a group as well as different groups, depending on where
deviations are to be assumed. For example: Which groups most often click on the' Buy Now' button in
your shopping cart?
Discriminant analysis: Used in the context of variance analysis to differentiate between groups that can be
described by similar or identical characteristics. Multiple discriminant analysis (MDA), also known as canonical
variates analysis (CVA) or canonical discriminant analysis (CDA), constructs functions to maximally discriminate
between n groups of objects. For example, by whichASHISH
variables
TRIPATHI do different groups of buyers differ?
b) Analysis of interdependence:- If the variables are not dependent on others, they are called analysis of
[Link], a category of multivariate statistical techniques; interdependence methods give meaning
to a set of variables or seek to group things together
E.g. Cluster analysis, factor analysis
Factor analysis: Reduces the structure to relevant data and individual variables. Factor studies focus on
different variables, so they are further subdivided into main component analysis and correspondence analysis.
For example: Which website elements have the greatest influence on purchasing behavior?
Cluster analysis: Observations are graphically assigned to individual variable groups and classified on the
basis of these. The results are clusters and segments, such as the number of buyers of a particular product,
who are between 35 and 47 years old and have a high income.
ASHISH TRIPATHI
Question: Apply regression analysis to predict the salary of an employee having 5 years of experience, based on their salary with respect to
years of experience, find equation of line.
Years of Experience Salary
1 30000
2 35000
3 45000
4 50000
5 X
X Y XY X² Y²
1 30000 30000 1 900,000,000 Means:
2 35000 70000 4 1,225,000,000
3 45000 135000 9 2,025,000,000
4 50000 200000 16 2,500,000,000
ΣY² =
ΣX = 10 ΣY = 160,000 ΣXY = 435,000 ΣX² = 30
6,650,000,000
ASHISH TRIPATHI
Slope (β₁): Intercept (β₀):
ASHISH TRIPATHI
Question: Apply regression modelling to calculate the linear regression equation of the form: Y=β0+β1X. also predict the
price of a house that is 150 sq m. based on the following data:
House Size (X) (sq m) Price (Y) (100000s)
1 30 20
2 50 30
3 70 40
4 100 60
5 120 70
ASHISH TRIPATHI
Bayesian Modeling
ASHISH TRIPATHI
Bayesian Statistics
ASHISH TRIPATHI
ASHISH TRIPATHI
ASHISH TRIPATHI
BAYESIAN INFERENCE & NETWORK
ASHISH TRIPATHI
Bayesian inference derives the posterior probability as a consequence of two antecedents: a prior probability and
a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the
posterior probability according to Bayes' theorem:
ASHISH TRIPATHI
Question: A clinic uses a Bayesian network to diagnose a rare disease.
If a patient tests positive, what is the probability they actually have the disease?
𝑃 𝐷𝑖𝑠𝑒𝑎𝑠𝑒 ∣ 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 =?
ASHISH TRIPATHI
Question: A security system at an airport uses two sensors: Metal Detector and Bag Scanner.
If both sensors alarm (MetalDetector=True and BagScanner=True), what is the probability that the passenger poses
a threat?
ASHISH TRIPATHI
Find: 𝑃(T | M,B)
T P(M|T) P(B|T)
Y 0.95 0.90
N 0.02 0.03
P(T)=0.001 (Threat)
ASHISH TRIPATHI
ASHISH TRIPATHI
Bayesian belief network is key computer technology for dealing with probabilistic events and
to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph.“
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
ASHISH TRIPATHI
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
•Each node corresponds to the random variables, and a variable can be continuous or discrete.
•Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link
that means that nodes are independent with each other
• In the above diagram, A, B, C, and D are random variables represented by the nodes
of the network graph.
• If we are considering node B, which is connected with node A by a directed arrow,
then node A is called the parent of Node B.
• Node C is independent of node A. ASHISH TRIPATHI
ASHISH TRIPATHI
Analysis of Time Series
ASHISH TRIPATHI
Time series analysis comprises methods for analyzing time series data in order to extract meaningful
statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future
values based on previously observed values.
In the following table, we can see the basic structure of time series data. In this case the observations are
recorded every hour.
ASHISH TRIPATHI
• Time-domain vs. Frequency-domain
– Time-domain approach: how does what happened today affect what will happen tomorrow? These approaches
view the investigation of lagged relationships as most important, e.g. autocorrelation analysis.
– Frequency-domain approach: what is the economic cycle through periods of expansion and recession? These
approaches view the investigation of cycles as most important, e.g. spectral analysis and wavelet analysis
Autocorrelation
Informally, autocorrelation is the similarity between observations as a function of the time lag between them.
Autocorrelation represents the degree of similarity between a given time series and a lagged version of itself over
successive time intervals. Autocorrelation measures the relationship between a variable's current value and its past
values. An autocorrelation of +1 represents a perfect positive correlation, while an autocorrelation of negative -1
represents a perfect negative correlation. It is also known as serial correlation. For example, the temperatures on different
days in a month are auto correlated.
ASHISH TRIPATHI
Rule Induction
ASHISH TRIPATHI
Rule induction is an area of machine learning in which formal rules are extracted from a set of
observations. The rules extracted may represent a full scientific model of the data, or merely represent
local patterns in the data.
ASHISH TRIPATHI
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
ASHISH TRIPATHI
Rule Extraction
To extract a rule from a decision tree −
•One rule is created for each path from the root to the leaf node.
•To form a rule antecedent, each splitting criterion is logically ANDed.
•The leaf node holds the class prediction, forming the rule consequent.
The Following is the sequential learning Algorithm where rules are learned for one class at a time. When
learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple form any
other class.
ASHISH TRIPATHI
ASHISH TRIPATHI
Steps in algorithm:
1. Class Selection:
• A class is selected one by one and all the rules of that class are marked.
2. Rule Development:
• Here a single rule considering all the points in data is created
• Then the points or tuples covering that rule is deleted for next iteration
3. Rule Accuracy:
• Next the accuracy of the single rule is measured as
𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑏𝑦 𝑟𝑢𝑙𝑒
• Rule Accuracy = 𝐴𝑙𝑙 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑐𝑜𝑣𝑒𝑟𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑟𝑢𝑙𝑒
4. Next rule:
• After going through all the points of a particular rule these data points are deleted and rule is
appended to the rule set. Then the next rule data points are taken and iteration is repeated.
5. Pruning the rule set:
𝑝−𝑛
• After the rule set is developed, it is pruned according to 𝑝+𝑛
• Here p is number of positive records covered by the rule and n is number of negative records
covered by the rule.
ASHISH TRIPATHI
Support Vector and Kernel Methods
ASHISH TRIPATHI
Supervised learning (SL)
“Machine Learning is the field of Supervised learning is the machine learning task of learning a
function that maps an input to an output based on example
study that gives computers the
input-output pairs. It infers a function from labeled training
ability to learn without data consisting of a set of training examples.
This is a situation where you put in some data you already have
being explicitly programmed.”
answers to, for example, to predict if a dog is a particular
breed, we load in millions of dog information/properties like
type, height, skin color, body hair length etc.
In ML lingo, these properties are referred to as ‘features’. A
single entry of these list of features is a data instance while the
collection of everything is the Training Data which forms the
basis of our prediction i.e. if you know the skin color, body hair
length, height and so on of a particular dog, then you can
predict the breed it will probably belong to.
ASHISH TRIPATHI
Basically, SVM finds a hyper-plane that creates a boundary between the types of data. In 2-dimensional space,
this hyper-plane is nothing but a line.
In SVM, we plot each data item in the dataset in an N-dimensional space, where N is the number of
features/attributes in the data. Next, find the optimal hyperplane to separate the data.
ASHISH TRIPATHI
Usually a learning algorithm tries to learn the most common characteristics (what differentiates one class from
another) of a class and the classification is based on those representative characteristics learnt (so classification is based
on differences between classes). The SVM works in the other way around. It finds the most similar examples between
classes. Those will be the support vectors.
As an example, lets consider two classes, apples and lemons. If we visualize the example above in 2D, we will have
something like this:
ASHISH TRIPATHI
Based on these support vectors, the algorithm tries to find the best hyperplane that separates the classes. In 2D the
hyperplane is a line, so it would look like this:
ASHISH TRIPATHI
So why Kernels?
The red and blue balls cannot be separated by a straight line as they
are randomly distributed and this, in reality, is how most real life
problem data are -randomly distributed.
Kernels or kernel methods (also called Kernel functions) are sets of different types of algorithms that are
being used for pattern analysis. They are used to solve a non-linear problem by using a linear classifier.
Kernels Methods are employed in SVM (Support Vector Machines) which are used in classification and
regression problems. The SVM uses what is called a “Kernel Trick” where the data is transformed and an
optimal boundary is found for the possible outputs.
ASHISH TRIPATHI
ASHISH TRIPATHI
In order to get a mathematical understanding of kernel, let us understand the Lili Jiang’s equation of
kernel which is:
K(x, y)=<f(x), f(y)> where,
The Kernel :
• The kernel, is selected based on the type of data and also the type of transformation. By default,
the kernel is Radial Basis Function Kernel (RBF).
Gamma :
• This parameter decides how far the influence of a single training example reaches during
transformation, which in turn affects how tightly the decision boundaries end up surrounding
points in the input space. If there is a small value of gamma, points farther apart are considered
similar. So more points are grouped together and have smoother decision boundaries (may be
less accurate). Larger values of gamma cause points to be closer together (may cause overfitting).
The ‘C’ parameter :
• This parameter controls the amount of regularization applied on the data. Large values of C mean
low regularization which in turn causes the training data to fit very well (may cause overfitting).
Lower values of C mean higher regularization which causes the model to be more tolerant of
errors (may lead to lower accuracy).
ASHISH TRIPATHI
Tuning Parameters
Regularization
The Regularization Parameter (in python it’s called C) tells the SVM optimization how much you want to avoid miss
classifying each training example.
If the C is higher, the optimization will choose smaller margin hyperplane, so training data miss classification rate will be
lower.
On the other hand, if the C is low, then the margin will be big, even if there will be miss classified training data
examples. This is shown in the following two diagrams:
ASHISH TRIPATHI
Gamma
The next important parameter is Gamma. The gamma parameter defines how far the influence of a single training
example reaches. This means that high Gamma will consider only points close to the plausible hyperplane
and low Gamma will consider points at greater distance.
As you can see, decreasing the Gamma will result that finding
the correct hyperplane will consider points at greater distances
so more and more points will be used
ASHISH TRIPATHI
Margin
The last parameter is the margin. Higher margin results better model, so better classification (or prediction). The
margin should be always maximized.
ASHISH TRIPATHI
Pros of Kernelized SVM:
[Link] perform very well on a range of datasets, where number of dimensions is greater than the number of
samples.
[Link] are versatile : different kernel functions can be specified, or custom kernels can also be defined for
specific datatypes.
[Link] work well for both high and low dimensional data.
[Link] very sensitive to overfitting.
[Link] can have high accuracy (Even better than Neural Networks)
ASHISH TRIPATHI
Generalisation is the ability of a neural network to perform accurately on new, unseen data after learning from training
data. A model that generalises well has learned underlying patterns, not just memorised the training set.
Key Challenges
Overfitting: The model learns the training data too well, including its noise and outliers, performing poorly on new data.
This is analogous to memorising answers to practice questions without understanding the underlying principles .
Underfitting: The model fails to capture the underlying trend in the data, performing poorly even on the training set.
• Variance is the error from sensitivity to fluctuations in the training set. High variance can cause overfitting.
• The goal is to find the optimal model complexity that minimises both .
•The Curse of Dimensionality: As the number of features (dimensions) increases, the data becomes sparse. This sparsity
makes it increasingly difficult for a model to generalise, as the training data becomes a less representative sample of the
overall space
ASHISH TRIPATHI
Regularisation Techniques
•Lasso Regularisation (L1): Adds a penalty equal to the absolute value of the magnitude of coefficients. This can shrink
some coefficients to zero, effectively performing feature selection.
•Dropout: A technique where randomly selected neurons are ignored during training. This forces the network to learn
more robust features that are useful in conjunction with many different random subsets of the other neurons.
•Data Augmentation: Increasing the diversity of the training set by applying transformations (e.g., rotating or cropping
images) to create new, plausible training examples.
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained using labeled data. Unsupervised learning algorithms are trained using unlabeled data.
Supervised learning model takes direct feedback to check if it is
Unsupervised learning model does not take any feedback.
predicting correct output or not.
Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in data.
In supervised learning, input data is provided to the model along with
In unsupervised learning, only input data is provided to the model.
the output.
The goal of supervised learning is to train the model so that it can The goal of unsupervised learning is to find the hidden patterns and
predict the output when it is given new data. useful insights from the unknown dataset.
Unsupervised learning does not need any supervision to train the
Supervised learning needs supervision to train the model.
model.
Supervised learning can be categorized in Classification and Regression Unsupervised Learning can be classified in Clustering and Associations
problems. problems.
Supervised learning can be used for those cases where we know the Unsupervised learning can be used for those cases where we have only
input as well as corresponding outputs. input data and no corresponding output data.
Unsupervised learning model may give less accurate result as compared
Supervised learning model produces an accurate result.
to supervised learning.
Supervised learning is not close to true Artificial intelligence as in this, Unsupervised learning is more close to the true Artificial Intelligence as
we first train the model for each data, and then only it can predict the it learns similarly as a child learns daily routine things by his
correct output. experiences.
It includes various algorithms such as Linear Regression, Logistic
It includes various algorithms such as Clustering, KNN, and Apriori
Regression, Support Vector Machine, Multi-class Classification, Decision
algorithm.
tree, Bayesian Logic, etc.
ASHISH TRIPATHI
Perceptron
A simple neural network architecture allows only a unidirectional forward connections among neurons and because of that it
is called feed-forward neural network. The simplest type of feed-forward neural network, the Perceptron consists of only one
layer of neural units connected with a set of n input terminals. The number of outputs is the same as the number of neural
units. It is a single artificial neuron that computes itself its weighted input and uses a threshold activation function. It is a
computer model or computerized machine devised to represent or simulate the ability of the brain to recognize and
discriminate. The most basic form of an activation function is a simple binary function that has only two possible results.
ASHISH TRIPATHI
Multi-Layer Perceptron
Networks with more than one layer of artificial neurons, where only forward connections from the input towards the
output are allowed are called Multi layer perceptron(MLP) or Multilayer feed forward neural networks.
• A multilayer perceptron is a feedforward neural network with one or more hidden layers.
• The network consists of an input layer of source neurons, at least one middle or hidden layer
of computational neurons, and an output layer of computational neurons.
• The input signals are propagated in a forward direction on a layer-by-layer basis.
ASHISH TRIPATHI
In a multilayer perceptron, bias b(n) is treated as a synaptic weight driven by fixed input equal to +1
𝑥 𝑛 = [+1, 𝑥1 𝑛 , 𝑥2 𝑛 ……… 𝑥𝑚 𝑛 ]T
Correspondingly, we define weight vectors as
w 𝑛 = [𝑏(𝑛), 𝑤1 𝑛 , 𝑤2 𝑛 ……… 𝑤𝑚 𝑛 ]T
Accordingly the linear combiner output is written in the compact form
𝑉 𝑛 = σ𝑚 𝑇
𝑖=0 𝑤𝑖 𝑛 𝑥𝑖 𝑛 = 𝑤 𝑛 𝑥(𝑛)
ASHISH TRIPATHI
Neural Networks: Competitive Learning
ASHISH TRIPATHI
Competitive learning
Competitive learning is an unsupervised learning paradigm where neurons in a network compete to respond to a given
input. This is often described as a "winner-takes-all" approach.
The ultimate objective of training a neural network is to obtain a set of weights that makes almost all the tuples in the
training data classified correctly. The steps involved are given below:
1. Initialize weights with random values
2. Feed the input tuples into the network one by one
3. For each unit
1. Compute the net input to the unit as a linear combination of all the inputs to the unit
2. Compute the output value using the activation function
3. Compute the error
4. Update the weights and the bias
ASHISH TRIPATHI
Core Principle
•Neurons specialise in recognising specific types of input patterns. For a given input, the neuron whose weight vector is
most similar to the input is declared the "winner."
•Only the winning neuron (and perhaps its neighbours) is allowed to update its weights, becoming even more sensitive
to that type of input. This process leads to a natural partitioning of the feature space
Growing Algorithms
• Train a small unit sample and then new units are added to it
• Example: Upstart Algorithm, Tiling Algorithm and Cascade Correlation Algorithm
Pruning Algorithms
• Train a large network and then remove the unwanted weights or units from it.
• The large size helps in quick training the network and finally reduced size helps in improving generalizations.
• Example: weight decay method, cross-validation method, Significance based method
ASHISH TRIPATHI
Back-propagation neural network
ASHISH TRIPATHI
ASHISH TRIPATHI
ASHISH TRIPATHI
ASHISH TRIPATHI
Parameters to optimize the Back Propagation Theorem
• Number of hidden nodes: This should be kept small.
• Momentum Coefficient: It changes the weight according to neighbor nodes so as to
achieve global minima
• Sigmoidal gain: It optimize the scaling factor so that the same weight range can be
applied to wide variety of functions.
• Local minima: here the weight change should be such that the network doesn’t fix up in
local minima, instead it should be back-traced to release it to find global minima
• Learning coefficient: It should be between 0 to 1 as it will keep the weight aligned to
the perception.
ASHISH TRIPATHI
Neural Network: Principal Components Analysis
ASHISH TRIPATHI
Dimension Reduction-
Example-
ASHISH TRIPATHI
Dimension Reduction Techniques-
•Principal Component Analysis reduces the number of variables in dataset by extracting important one from large
dataset..
•It transforms the variables into a new set of variables called as principal components.
•These principal components are linear combination of original variables and weighted vectors.
•They are orthogonal. These are the eigenvectors of covariance matrix.
•The first principal component accounts for most of the possible variation of original data.
•The second principal component does its best to capture the variance in the data.
•There can be only two principal components for a two-dimensional data set.
ASHISH TRIPATHI
WHAT IS PRINCIPAL COMPONENT ANALYSIS?
Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the
dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the
information in the large set.
PCA Algorithm-
Step-01: Get data.
Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.
ASHISH TRIPATHI
Example:
Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }.
Compute the principal component using PCA Algorithm.
Step-01:
Get data.
The given feature vectors are-
•x1 = (2, 1)
•x2 = (3, 5)
•x3 = (4, 3)
•x4 = (5, 6)
•x5 = (6, 7)
•x6 = (7, 8)
Step-02:
ASHISH TRIPATHI
Step-03:
Step-04:
ASHISH TRIPATHI
ASHISH TRIPATHI
Now,
Covariance matrix
= (m1 + m2 + m3 + m4 + m5 + m6) / 6
Step-05:
Calculate the eigen values and eigen vectors of the covariance matrix.
λ is an eigen value for a matrix M if it is a solution of the characteristic
equation |M – λI| = 0.
So, we have-
ASHISH TRIPATHI
From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0
Clearly, the second eigen value is very small compared to the first eigen value.
So, the second eigen vector can be left out.
Eigen vector corresponding to the greatest eigen value is the principal component for the given data set.
So. we find the eigen vector corresponding to eigen value λ1.
ASHISH TRIPATHI
Solving these, we get-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2
On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)
ASHISH TRIPATHI
FUZZY LOGIC
ASHISH TRIPATHI
Fuzzy Logic
Fuzzy logic is an extension of Boolean logic by Lotfi Zadeh in 1965 based on the mathematical theory of fuzzy sets, which is a
generalization of the classical set theory. One advantage of fuzzy logic in order to formalize human reasoning is that the rules
are set in natural language.
Fuzzy logic(FL) is defined as a form of knowledge representation suitable for notions that cannot be defined precisely, but
which depend upon their contexts. Fuzzy means “not clear, distinct, or precise; blurred”.
Fuzzy Logic (FL) is a method of reasoning that resembles human reasoning. The approach of FL imitates the way of decision
making in humans that involves all intermediate possibilities between digital values YES and NO.
Fuzzy inference systems rely on membership functions to explain to the computer how to calculate the correct value between 0
and 1. The degree to which any fuzzy statement is true is denoted by a value between 0 and 1.
ASHISH TRIPATHI
Classical sets – either an element Fuzzy sets – A fuzzy set has a
belongs to the set or it does not. It is graphical description that expresses
a set of distinct objects. For example, how the transition from one to
for the set of integers, either an another takes place. This graphical
integer is even(member) or it is not description is called a membership
(non-member). function
Fuzzy Logic is a multivalued logic, that allows intermediate values to be defined between conventional evaluations like
true/false, yes/no, high/low, etc.
Crisp set provides-
• Precise location of the set boundaries
• Membership values of the set
ASHISH TRIPATHI
Key Techniques and Tools
•Rule Generation: Algorithms like the Wang-Mendel method can generate fuzzy rules from numerical data by
partitioning the input space and determining the relationship between inputs and outputs. This allows for the
creation of interpretable "If-Then" rules .
•Linguistic Summarisation: Fuzzy logic can be used to create concise, human-readable summaries of quantitative
data (e.g., summarising sales data as "Most products in the North region had high sales in the summer months.") .
•Fuzzy Clustering for Model Extraction: Techniques like the Evolving Clustering Method (ECM) can be used to find
the natural groupings in data, which can then be used to define fuzzy sets and the initial structure of a fuzzy model
ASHISH TRIPATHI
Basic elements of fuzzy logic system
The architecture of the fuzzy logic controller shown in figure includes four components: Fuzzifier, Rule Base, Fuzzy Inference
Engine, and Defuzzifier.
ASHISH TRIPATHI
Fuzzifier: The fuzzifier is the input interface which maps a numeric input to a fuzzy set so that it can be matched with the
premises of the fuzzy rules defined in the application-specific rule base.
Rule Base: The rule base contains a set of fuzzy if-then rules which defines the actions of the controller in terms of
linguistic variables and membership functions of linguistic terms.
Fuzzy Inference Engine: The fuzzy inference engine applies the inference mechanism to the set of rules in the fuzzy rule
base to produce a fuzzy output set. This involves matching the input fuzzy set with the premises of the rules, activation of
the rules to deduce the conclusion of each rule that is fired, and combination of all activated conclusions using fuzzy set
union to generate fuzzy set output.
Defuzzifier: The defuzzifier is an output mapping which converts fuzzy set output to a crisp output. Based on the crisp
output, the fuzzy logic controller can drive the system under control.
The fuzzy rule base contains a set of linguistic rules. These linguistic rules are expressed using linguistic values and linguistic
variables. Different linguistic values can be assigned to a linguistic variable. These linguistic values are modeled as fuzzy
sets. Based on the linguistic values, their corresponding membership functions can be expressed based on application
requirements.
ASHISH TRIPATHI
Membership Functions
Membership functions allow you to quantify linguistic term and represent a fuzzy set graphically. A membership
function is the core component of Fuzzy Logic. It defines how much a value belongs to a fuzzy set. A membership
function for a fuzzy set A on the universe of discourse X is defined as µA: X → [0, 1].
Here, each element of X is mapped to a value between 0 and 1. It is called membership value or degree of
membership. It quantifies the degree of membership of the element in X to the fuzzy set A.
A membership function maps an input value to a degree between 0 and 1:
Where:
𝑥= input value
𝐴= fuzzy set (e.g., “Hot”, “Cold”)
𝜇𝐴 𝑥 =degree of membership
In Fuzzy ID3:
•Membership functions:
• Convert crisp data → fuzzy values
• Determine how much each sample contributes to each
branch
• Used in entropy & information gain calculations
ASHISH TRIPATHI
Types of Membership Functions
1. Triangular Membership Function 2. Trapezoidal Membership Function
Most commonly used (simple & efficient) 0 𝑥≤𝑎
𝑥−𝑎
0 𝑥≤𝑎 𝑎<𝑥≤𝑏
𝑥−𝑎 𝑏−𝑎
𝑎<𝑥≤𝑏 𝜇 𝑥 = 1 𝑏<𝑥≤𝑐
𝜇 𝑥 = 𝑐−𝑥 𝑏 − 𝑎
𝑑−𝑥
𝑏<𝑥<𝑐 𝑐<𝑥<𝑑
𝑐−𝑏 𝑑−𝑐
0 𝑥≥𝑐 0 𝑥≥𝑑
•Defined by 3 points: 𝑎, 𝑏, 𝑐 •Defined by 4 points: 𝑎, 𝑏, 𝑐, 𝑑
ASHISH TRIPATHI
Core Parameters of a Membership Function
Parameter Meaning
Support Range where membership > 0
Core Region where membership = 1
Boundary Points where membership = 0.5
Shape parameters Define structure (a, b, c, etc.)
Height Maximum membership value
Width/Spread Range of influence
Slope Rate of change
ASHISH TRIPATHI
Fuzzy Logic In Control Systems
Fuzzy Logic provides a more efficient and resourceful way to solve Control Systems. Some Examples:
– Temperature Controller
– Anti – Lock Break System ( ABS)
Temperature Controller:
→ The problem - Change the speed of a heater fan, based off the room temperature and humidity
→ A temperature control system has four settings - Cold, Cool, Warm, and Hot
→ Humidity can be defined by - Low, Medium, and High
→ Using this we can define - the fuzzy set
ASHISH TRIPATHI
Example of a Fuzzy Logic System
Let us consider an air conditioning system with 5-level fuzzy logic system. This system adjusts the temperature of
air conditioner by comparing the room temperature and the target temperature value.
Algorithm
1. Define linguistic Variables and terms (start)
2. Construct membership functions for them (start)
3. Construct knowledge base of rules (start)
4. Convert crisp data into fuzzy data sets using membership
functions (Fuzzification)
5. Evaluate rules in the rule base (inference engine)
6. Combine results from each rule (inference engine)
7. Convert output data into non-fuzzy values. (De-Fuzzification)
ASHISH TRIPATHI
Fuzzy Decision Trees
ASHISH TRIPATHI
FUZZY DECISION TREES
Decision trees are one of the most popular methods for learning and reasoning from instances. Fuzzy Decision
Trees(FDT) aims to combine the ability of decision trees (to learn from examples, to present knowledge in
comprehensible form) with fuzzy representation (to deal with inexact and uncertain information).
Fuzzy Decision Trees (FDTs) are an extension of classical decision trees designed to handle uncertainty and
continuous data more gracefully
Classical trees like ID3 or CART often struggle with continuous data.
•They may require fuzzification of continuous features into discrete categories, which can lead to information
loss .
•They often use a dichotomous (binary) splitting approach, which can generate overly complex trees with
many rules, leading to overfitting
ASHISH TRIPATHI
Fuzzy Decision Trees Working
FDTs address these issues by using fuzzy logic at the decision nodes. Instead of a hard split (e.g., "Age > 30"), an
instance can follow multiple branches simultaneously with different membership degrees.
•Fuzzy ID3 and IFD are well-established algorithms that use fuzziness for feature selection and tree induction
ASHISH TRIPATHI
Fuzzy ID3 is a generalization of the ID3 algorithm, a popular
and efficient method for inducing decision trees from symbolic
data. The key motivation for Fuzzy ID3 is to handle
the imprecision and uncertainty inherent in much human
knowledge and real-world data, which traditional "crisp"
decision trees cannot process effectively .
ASHISH TRIPATHI
The general procedure for generating fuzzy decision trees using Fuzzy ID3 is outlined as follows :
Prerequisites: A Fuzzy partition space, leaf selection threshold βth and the best node selection criterion.
Procedure:
While there exist candidate nodes
DO
Select one of them using a search strategy,
Generate its child-nodes according to an expanded attribute obtained by the given heuristic.
Check child nodes for the leaf selection threshold.
Child-nodes meeting the leaf threshold has to be terminated as leaf-nodes.
The remaining child-nodes are regarded as new candidate nodes.
end
Before training, the α-cut is usually used for the initial data to
reduce the fuzziness. The α-cut of a fuzzy set A is defined as:
ASHISH TRIPATHI
Stochastic Search Methods
ASHISH TRIPATHI
Stochastic search is the method of choice for solving many hard combinatorial problems.
Combinatorial Decision Problems: For a given problem instance, decide whether a solution (grouping, ordering, or
assignment) exists which satisfies the given constraints.
Stochastic search methods are a broad class of optimisation algorithms that use randomness to find optima in a
search space. They are particularly useful for complex, high-dimensional, or non-differentiable problems where
traditional gradient-based methods fail.
Key Characteristics
•Randomness: They incorporate random elements to explore the search space, which helps in escaping local
optima.
•Adaptive Search: Many modern algorithms are highly adaptive, changing their search strategy based on feedback
from the environment .
•Parallelisation: These methods are often well-suited for implementation on parallel computers, as multiple
independent searches can be run simultaneously ASHISH TRIPATHI
Common Examples in Machine Learning
•Genetic Algorithms (GAs): Inspired by natural selection, GAs evolve a population of candidate solutions over
generations using operators like mutation and crossover.
•Simulated Annealing (SA): Inspired by annealing in metallurgy, SA probabilistically accepts worse solutions at the
beginning to explore the space and gradually reduces this probability to converge on an optimum.
•Particle Swarm Optimisation (PSO): Models a population (swarm) of candidate solutions (particles) that move around
the search space, influenced by their own best-known position and the swarm's best-known position.
Applications
Stochastic methods are used to solve a wide variety of real-world problems, from engineering design to financial
modelling, especially when the problem is too complex for deterministic algorithms
ASHISH TRIPATHI
General Working Principle
[Link] with an initial solution
[Link] new candidate solutions randomly
[Link] using objective/fitness function
[Link] or reject based on probability rules
[Link] until stopping condition
ASHISH TRIPATHI
Types of Stochastic Search Methods 4. Genetic Algorithm
1. Random Search
•Simplest method •Based on natural evolution:
•Randomly samples solutions • Selection
No memory, no learning
• Crossover
2. Hill Climbing (Stochastic Variant) • Mutation
•Moves to a better neighbor randomly
•If multiple better options exist → choose randomly Works with a population of solutions
May still get stuck in local optimum
Disadvantages
•No guarantee of global optimum
•May require many iterations
•Performance depends on randomness
ASHISH TRIPATHI