PE603IT
MACHINE LEARNING
PE603IT
PE603IT
1. INTRODUCTION
1.1 Learning
1.2 Machine learning
1.3 Types of machine learning
1.4 Machine learning process
1.5 Weight space
1.6 Curse of dimensionality
1.7 Overfitting
1.8 Training, testing and validation sets
1.9 The confusion matrix
1.10 Accuracy metrics
1.11 Receiver operator characteristic (ROC) curve
1.12 Unbalanced datasets
1.13 Some basic statistics
1.14 The bias-variance tradeoff
1. INTRODUCTION
1.1 Learning
Learning means becoming aware of something which you doesn’t know before. In the context of
artificial intelligence and machine learning, you are going to make self-teaching computers (Note: the
term was used as synonyms when ‘Machine learning’ was first coined by Arthur Samuel at IBM in
1959) by providing lots of data in specific format.
1.2. Machine learning
Machine learning is a branch of artificial intelligence (AI) which focuses on the use of data and
algorithms to imitate the way that humans learn, gradually improving its accuracy. The focus of the
field is learning, that is, acquiring skills or knowledge from experience. Most commonly, this means
synthesizing useful concepts from historical data.
The term machine learning was coined in 1959 by Arthur Samuel, an IBM employee and pioneer in
the field of computer gaming and artificial intelligence. The synonym self-teaching computers was
also used in this time period. Machine learning algorithms build a model based on sample data,
known as training data, in order to make predictions or decisions without being explicitly
programmed to do so. The discipline of machine learning employs various approaches to teach
computers to accomplish tasks where no fully satisfactory algorithm is available. It finds application
in various tasks such as – Natural language processing, recommender systems, information retrieval,
image classification, object detection, speech recognition, handwriting recognition, healthcare, email
filtering etc.
PE603IT
Figure 1: Machine learning is a sub-field of Artificial intelligence.
1.3. Types of machine learning
There are primarily three types of machine learning: Supervised, Unsupervised, and Reinforcement
Learning. But more thoroughly we can divide it into 5 types.
1.3.1 Supervised learning
Supervised learning is a type of machine learning that uses labeled data to train machine learning
models. In labeled data, the output is already known. The model just needs to map the inputs to the
respective outputs. Some of the most popularly used supervised learning algorithms are - Linear
Regression, Logistic Regression, Support Vector Machines, K Nearest Neighbor, Decision Trees,
Random Forest, Naive Bayes etc. Supervised learning algorithms are generally used for solving
classification and regression problems. Few of the top supervised learning applications are weather
prediction, sales forecasting, stock price analysis.
Models are fit on training data comprised of inputs and outputs and used to make predictions on test
sets where only the inputs are provided and the outputs from the model are compared to the withheld
target variables and used to estimate the skill of the model. There are two main types of supervised
learning problems: they are classification that involves predicting a class label and regression that
involves predicting a numerical value. Algorithms are referred to as “supervised” because they learn
by making predictions given examples of input data, and the models are supervised and corrected via
an algorithm to better predict the expected target outputs in the training dataset. Some algorithms may
be specifically designed for classification (such as logistic regression) or regression (such as linear
regression) and some may be used for both types of problems with minor modifications (such as
artificial neural networks).
1.3.2 Semi-supervised learning
In semi-supervised learning, small amount of labelled data is combined with large amount of
unlabelled data while training. Semi-supervised learning aims at alleviating the issue of limited
amount of labelled data available for training. Semi-supervised learning is motivated by problem
settings where unlabeled data is abundant and obtaining labeled data is expensive. Unlabeled data,
when used in conjunction with a small amount of labeled data, can produce considerable improvement
in learning accuracy. The acquisition of labeled data for a learning problem often requires a skilled
human agent (e.g. to transcribe an audio segment) or a physical experiment (e.g. determining the 3D
structure of a protein or determining whether there is oil at a particular location). The cost associated
with the labeling process thus may render large, fully labeled training sets infeasible, whereas
acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning
can be of great practical value.
PE603IT
1.3.3 Unsupervised learning
Unsupervised learning algorithms make use of unlabelled dataset for training the model. The
preparation of training dataset for unsupervised learning is easier than the supervised one as no labels
are needed. In unsupervised learning the model learns patterns from unlabelled data. The goal is that
through mimicry, which is an important mode of learning in people, the machine is forced to build a
concise representation of its world and then generate imaginative content from it. Unsupervised
learning is much similar as a human learns to think by their own experiences, which makes it closer to
the real AI. The unsupervised learning algorithm can be further divided into two types of problems –
clustering and association. Clustering is the method of grouping the objects into clusters such that
objects in one group would have highest similarities with each other and lowest or no similarities with
objects of other groups. An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database.
Some of the popular unsupervised learning algorithms are: k-means clustering, k-nearesr neigbors,
hierarchal clustering, anomaly detection, neural networks, principal component analysis, apriori
algorithm, singular value decomposition etc.
1.3.4 Reinforcement learning
Reinforcement learning is concerned with how intelligent agents ought to take actions in an
environment in order to maximize the notion of cumulative reward. RL focuses on finding a balance
between exploration (of uncharted territory) and exploitation (of current knowledge). Due to its
generality, reinforcement learning is studied in many disciplines, such as game theory, control
theory, operations research, information theory, simulation-based optimization, multi-agent
systems, swarm intelligence, and statistics.
1.3.5 Evolutionary learning
Biological evolution can be seen as a learning process: biological organisms adapt to improve their
survival rates and chance of having offspring in their environment. This can be modelled in a
computer, using an idea of fitness, which corresponds to a score for how good the current solution is.
1.4. Machine learning process
To solve a problem using machine learning some prerequisite processes had to be done.
Data collection and preparation:
First and foremost, data has to be collected and prepared in a particular format. In general, large
amount of data is required to train a machine learning model. Dataset contains a lot of separate pieces
of data but can be used to train an algorithm with the goal of finding predictable patterns inside the
whole dataset. After collection of raw data, preprocessing is done. Features are extracted from the
preprocessed data in form of numerical scores. In the case of supervised learning, labels are also
required, which most of the time is difficult and costly issue.
Feature selection
Amongst the large number of features extracted from preprocessed data, often, feature selection is
done to select appropriate set of features so as to get the best training results.
Algorithm choice
Depending upon the data and type of problem to be solved appropriate algorithm choice is required.
Parameter and model selection
For many of the algorithms there are parameters that have to be set manually, or that require
PE603IT
experimentation to identify appropriate values. Model selection is the process of choosing one among
many candidate models for a predictive modeling problem. Model selection is the process of selecting
one final machine learning model from among a collection of candidate machine learning models for
a training dataset. There may be many competing concerns when performing model selection beyond
model performance, such as complexity, maintainability, and available resources.
Training
Given the dataset, algorithm, and parameters, training should be simply the use of computational
resources in order to build a model of the data in order to predict the outputs on new data.
Evaluation
Before a system can be deployed it needs to be tested and evaluated for accuracy on data that it was
not trained on. This can often include a comparison with human experts in the field, and the selection
of appropriate metrics for this comparison.
1.5. Weight space
When working with data, it is often useful to be able to plot it and analyze it. If our data has only two
or three input dimensions, then this is pretty easy: we use the x-axis for feature 1, the y-axis for feature
2, and the z-axis for feature 3. We then plot the positions of the input vectors on these axes. The same
thing can be extended to as many dimensions as we like. Even if we have 200 input dimensions (that
is, 200 elements in each of our input vectors) then we can try to imagine it plotted by using 200 axes
that are all mutually orthogonal (that is, at right angles to each other). Apart from data points, we can
plot some of the parameters of machine learning algorithm. This is particularly useful in the case of
neural networks since the parameters of a neural network are the values of a set of weights that
connect the neurons to the inputs.
Figure 2: The position of two neurons in weight space. The labels on the network refer to the dimension in
which that weight is plotted, not its value.
If we treat the weights that get fed into one of the neurons as a set of coordinates in weight space than
we can plot them. We think about the weights that connect into a particular neuron, and plot the
strengths of the weights by using one axis for each weight that comes into the neuron, and plotting the
position of the neuron as the location, using the value of w1 as the position on the 1st axis.
PE603IT
1.6 Curse of dimensionality
Curse of Dimensionality describes the explosive nature of increasing data dimensions and its resulting
exponential increase in computational efforts required for its processing and/or analysis. This term was
first introduced by Richard E. Bellman, to explain the increase in volume of Euclidean space associated
with adding extra dimensions, in area of dynamic programming. This phenomena is observed in the
field of machine learning also. An increase in the dimensions can in theory, add more information to
the data thereby improving the quality of data but practically increases the noise and redundancy
during its analysis.
In machine learning, a feature of an object can be an attribute or a characteristic that defines it. Each
feature represents a dimension and group of dimensions creates a data point. This represents a feature
vector that defines the data point to be used by a machine learning algorithm(s). When we say increase
in dimensionality it implies an increase in the number of features used to describe the data. For
example, in the field of breast cancer research, age, number of cancerous nodes can be used as features
to define the prognosis of the breast cancer patient. These features constitute the dimensions of a
feature vector. But other factors like past surgeries, patient history, type of tumor and other such
features help a doctor to better determine the prognosis. In this case by adding features, we are
theoretically increasing the dimensions of our data.
As the dimensionality increases, the number of data points required for good performance of any
machine learning algorithm increases exponentially. The reason is that, we would need more number
of data points for any given combination of features, for any machine learning model to be valid. For
example, let’s say that for a model to perform well, we need at least 10 data points for each
combination of feature values. If we assume that we have one binary feature, then for its 21 unique
values (0 and 1) we would need 2¹x 10 = 20 data points. For 2 binary features, we would have 2²
unique values and need 2² x 10 = 40 data points. Thus, for k-number of binary features we would need
2ᵏ x 10 data points.
Figure 3: Need for data points with increase in dimensions.
Hughes (1968) in his study concluded that with a fixed number of training samples, the predictive
power of any classifier first increases as the number of dimensions increase, but after a certain value of
number of dimensions, the performance deteriorates. Thus, the phenomenon of curse of dimensionality
is also known as Hughes phenomenon.
PE603IT
Figure 4: Curse of dimensionality or Hughes phenomenon.
For any point A, lets assume distₘᵢₘ(A) is the minimum distance between A and its nearest neighbor
and distₘₐₓ(A) is the maximum distance between A and the farthest neighbor.
In one-dimensional, 2D, or even 3D data space
But as the dimension increases, that is, dim -> ꚙ;
That is, for a d — dimensional space, given n-random points, the distₘᵢₘ(A) ≈ distₘₐₓ(A) meaning,
any given pair of points are equidistant to each other.
Therefore, any machine learning algorithms which are based on the distance measure including
KNN(k-Nearest Neighbor) tend to fail when the number of dimensions in the data is very high. Thus,
dimensionality can be considered as a “curse” in such algorithms.
Solutions to curse of dimensionality: One of the ways to reduce the impact of high dimensions is to
use a different measure of distance in a space vector. One could explore the use of cosine similarity to
replace Euclidean distance. Cosine similarity can have a lesser impact on data with higher dimensions.
However, use of such method could also be specific to the required solution of the problem. Other
methods could involve the use of reduction in dimensions. Some of the techniques that can be used are:
1. Forward-feature selection: This method involves picking the most useful subset of features from all
given features.
2. PCA/t-SNE: Though these methods help in reduction of number of features, but it does not
necessarily preserve the class labels and thus can make the interpretation of results a tough task.
PE603IT
1.7 Overfitting
The goal of a machine learning model is to generalize well. Overfitting degrades the performance of
the machine learning model. Generalization is the ability of a ML model to provide a suitable output
by adapting the given set of unknown input. While training, if a ML model takes into account the
noises and inaccuracies present in the dataset also then overfitting can occur. The overfitted model has
low bias and high variance. The chances of occurrence of overfitting increase as much we provide
training to our model. It means the more we train our model, the more chances of occurring the
overfitted model. Overfitting is the main problem that occurs in supervised learning.
For example the graph in Figure 5 gives the output of linear regression problem for the model which
was overfitted.
Figure 5: Instead of finding the best fit line, the model tried to cover every data points and resulted in haphazard
curve.
The model tried to cover all the data points present in the scatter plot. It may look efficient, but in
reality, it is not so. Because the goal of the regression model to find the best fit line, but here we have
not got any best fit, so, it will generate the prediction errors.
In Figure 6 also two graphs are shown. Rather than finding the generating function (shown on left
side of Figure 5), the model matches the inputs perfectly, including the noise in them (right side of
Figure 5). This reduces the generalization capability of the model.
Figure 6: Effect of overfitting.
Methods to avoid overfitting:
PE603IT
[Noise: Unnecessary and irrelevant data.]
[Bias: Bias is a prediction error that is introduced in the model due to oversimplifying the machine
learning algorithms. Or it is the difference between the predicted values and the actual values.]
[Variance: If the machine learning model performs well with the training dataset, but does not
perform well with the test dataset, then variance occurs.]
1.8 Training, testing and validation sets
We need three sets of data: the training set, validation set and test set in a machine learning process.
The training set is used to actually train the algorithm. The validation set keeps track of how well the
model is doing as it goes on learning. The test set is used to produce final output. The exact proportion
of training to testing to validation data is up to the user. It can be something like – 50:25:25 or
60:20:20.
1.9 The confusion matrix
We need to evaluate the machine learning model we have trained. Confusion matrix is one of the
metric used to evaluate the model which is trained for classification type of problem. It is also known
as error matrix. It is a specific table layout that allows visualization of the performance of an
algorithm. Each row of the matrix represents the instances in an actual class while each column
represents the instances in a predicted class, or vice versa – both variants are found in the
literature. The name stems from the fact that it makes it easy to see whether the system is confusing
two classes (i.e. commonly mislabeling one as another). It is a special kind of contingency table, with
two dimensions ("actual" and "predicted"), and identical sets of "classes" in both dimensions (each
combination of dimension and class is a variable in the contingency table).
Example:
Given a sample of 12 individuals, 8 that have been diagnosed with cancer and 4 that are cancer-free,
where individuals with cancer belong to class 1 (positive) and non-cancer individuals belong to class
0 (negative), we can display that data as follows:
Assume that we have a classifier that distinguishes between individuals with and without cancer in
some way, we can take the 12 individuals and run them through the classifier. The classifier then
makes 9 accurate predictions and misses 3: 2 individuals with cancer wrongly predicted as being
cancer-free (sample 1 and 2), and 1 person without cancer that is wrongly predicted to have cancer
(sample 9).
Notice, that if we compare the actual classification set to the predicted classification set, there are 4
different outcomes that could result in any particular column. One, if the actual classification is
positive and the predicted classification is positive (1,1), this is called a true positive result because
the positive sample was correctly identified by the classifier. Two, if the actual classification is
PE603IT
positive and the predicted classification is negative (1,0), this is called a false negative result because
the positive sample is incorrectly identified by the classifier as being negative. Third, if the actual
classification is negative and the predicted classification is positive (0,1), this is called a false positive
result because the negative sample is incorrectly identified by the classifier as being positive. Fourth,
if the actual classification is negative and the predicted classification is negative (0,0), this is called a
true negative result because the negative sample gets correctly identified by the classifier.
We can then perform the comparison between actual and predicted classifications and add this
information to the table, making correct results appear in green so they are more easily identifiable.
The template for any binary confusion matrix uses the four kinds of results discussed above (true
positives, false negatives, false positives, and true negatives) along with the positive and negative
classifications. The four outcomes can be formulated in a 2×2 confusion matrix, as follows:
The color convention of the three data tables above were picked to match this confusion matrix, in
order to easily differentiate the data.
Now, we can simply total up each type of result, substitute into the template, and create a confusion
matrix that will concisely summarize the results of testing the classifier:
PE603IT
In this confusion matrix, of the 8 samples with cancer, the system judged that 2 were cancer-free, and
of the 4 samples without cancer, it predicted that 1 did have cancer. All correct predictions are located
in the diagonal of the table (highlighted in green), so it is easy to visually inspect the table for
prediction errors, as values outside the diagonal will represent them. By summing up the 2 rows of the
confusion matrix, one can also deduce the total number of positive (P) and negative (N) samples in
the original dataset, i.e.
1.10 Accuracy metrics
With reference to the confusion matrix accuracy is defined as the sum of the number of true positives
and true negatives divided by the total number of examples.
The problem with accuracy is that it doesn’t tell us everything about the results, since it turns four
numbers into just one. There are two complementary pairs of measurements that can help us to
interpret the performance of a classifier, namely sensitivity and specificity, and precision and recall.
Their definitions are shown next, followed by some explanation.
Sensitivity (also known as the true positive rate) is the ratio of the number of correct positive
examples to the number classified as positive, while specificity is the same ratio for negative
examples. Precision is the ratio of correct positive examples to the number of actual positive
examples, while recall is the ratio of the number of correct positive examples out of those that were
classified as positive, which is the same as sensitivity.
PE603IT
Precision and recall can be combined to give single measure called F1 measure.
Or it can also be written as:
1.11 Receiver operator characteristic (ROC) curve
The ROC curve is a plot of the percentage of true positives on the y-axis against false positives on the
x-axis.
Figure 6: An example of an ROC curve. The diagonal line represents exactly chance, so anything above the line
is better than chance, and the further from the line, the better. Of the two curves shown, the one that is further
away from the diagonal line would represent a more accurate method.
A single run of a classifier produces a single point on the ROC plot, and a perfect classifier would be
a point at (0, 1) (100% true positives, 0% false positives), while the anti-classifier that got everything
wrong would be at (1,0); so the closer to the top-left-hand corner the result of a classifier is, the better
the classifier has performed.
Any classifier that sits on the diagonal line from (0,0) to (1,1) behaves exactly at the chance level
(assuming that the positive and negative classes are equally common).
In order to compare classifiers, or choices of parameters settings for the same classifier, you could just
compute the point that is furthest from the ‘chance’ line along the diagonal. However, it is normal to
compute the area under the curve (AUC) instead. If you only have one point for each classifier, the
curve is the trapezoid that runs from (0,0) up to the point and then from there to (1,1). If there are
more points (based on more runs of the classifier, such as trained and/or tested on different datasets),
then they are just included in order along the diagonal line. The key to getting a curve rather than a
point on the ROC curve is to use cross-validation. If you use 10-fold cross-validation, then you have
10 classifiers, with 10 different test sets, and you also have the ‘ground truth’ labels. The true labels
can be used to produce a ranked list of the different cross-validation-trained results, which can be used
to specify a curve through the 10 datapoints on the ROC curve that correspond to the results of this
classifier. By producing an ROC curve for each classifier it is possible to compare their results.
PE603IT
1.12 Unbalanced datasets
Imbalanced data refers to those types of datasets where the target class has an uneven distribution of
observations, i.e one class label has a very high number of observations and the other has a very low
number of observations.
Example:
Let’s assume that XYZ is a bank that issues a credit card to its customers. Now the bank is concerned
that some fraudulent transactions are going on and when the bank checks their data they found that for
each 2000 transaction there are only 30 Nos of fraud recorded. So, the number of fraud per 100
transactions is less than 2%, or we can say more than 98% transaction is “No Fraud” in nature. Here,
the class “No Fraud” is called the majority class, and the much smaller in size “Fraud” class is called
the minority class.
More such example of imbalanced data is – Disease diagnosis, Customer churn prediction, Fraud
detection, Natural disaster. Class imbalanced is generally normal in classification problems. But, in
some cases, this imbalance is quite acute where the majority class’s presence is much higher than the
minority class.
Problems with imbalanced data classification: If we explain it in a very simple manner, the main
problem with imbalanced dataset prediction is how accurately are we actually predicting both
majority and minority class? Let’s explain it with an example of disease diagnosis. Let’s assume we
are going to predict disease from an existing dataset where for every 100 records only 5 patients are
diagnosed with the disease. So, the majority class is 95% with no disease and the minority class is
only 5% with the disease. Now, assume our model predicts that all 100 out of 100 patients have no
disease.
Sometimes when the records of a certain class are much more than the other class, our classifier may
get biased towards the prediction. In this case, the confusion matrix for the classification problem
shows how well our model classifies the target classes and we arrive at the accuracy of the model
from the confusion matrix. It is calculated based on the total no of correct predictions by the model
divided by the total no of predictions. In the above case it is (0+95)/(0+95+0+5)=0.95 or 95%. It
means that the model fails to identify the minority class yet the accuracy score of the model will be
95%.
Thus our traditional approach of classification and model accuracy calculation is not useful in the case
of the imbalanced dataset.
PE603IT
Techniques to handle unbalanced datasets:
1) Choose proper evaluation metric
The accuracy of a classifier is the total number of correct predictions by the classifier divided by the
total number of predictions. This may be good enough for a well-balanced class but not ideal for the
imbalanced class problem. The other metrics such as precision is the measure of how accurate the
classifier’s prediction of a specific class and recall is the measure of the classifier’s ability to identify
a class.
For an imbalanced class dataset F1 score is a more appropriate metric. It is the harmonic mean of
precision and recall and the expression is –
So, if the classifier predicts the minority class but the prediction is erroneous and false-positive
increases, the precision metric will be low and so as F1 score. Also, if the classifier identifies the
minority class poorly, i.e. more of this class wrongfully predicted as the majority class then false
negatives will increase, so recall and F1 score will low. F1 score only increases if both the number
and quality of prediction improves. Thus F1 score keeps the balance between precision and recall and
improves the score only if the classifier identifies more of a certain class correctly.
2) Resampling (oversampling and undersampling)
This technique is used to upsample or downsample the minority or majority class. When we are using
an imbalanced dataset, we can oversample the minority class using replacement. This technique is
called oversampling. Similarly, we can randomly delete rows from the majority class to match them
with the minority class which is called undersampling. After sampling the data we can get a balanced
dataset for both majority and minority classes. So, when both classes have a similar number of records
present in the dataset, we can assume that the classifier will give equal importance to both classes.
PE603IT
3) SMOTE
Synthetic Minority Oversampling Technique or SMOTE is another technique to oversample the
minority class. Simply adding duplicate records of minority class often don’t add any new information
to the model. In SMOTE new instances are synthesized from the existing data. If we explain it in
simple words, SMOTE looks into minority class instances and use k nearest neighbor to select a
random nearest neighbor, and a synthetic instance is created randomly in feature space.
4) Balanced Bagging Classifier
When we try to use a usual classifier to classify an imbalanced dataset, the model favors the majority
class due to its larger volume presence. A Balanced Bagging Classifier is the same as a sklearn
classifier but with additional balancing. It includes an additional step to balance the training set at the
time of fit for a given sampler. This classifier takes two special parameters “sampling_strategy” and
“replacement”. The sampling_strategy decides the type of resampling required (e.g. ‘majority’ –
resample only the majority class, ‘all’ – resample all classes, etc) and replacement decides whether it
is going to be a sample with replacement or not.
5) Threshold moving
In the case of our classifiers, many times classifiers actually predict the probability of class
membership. We assign those prediction’s probabilities to a certain class based on a threshold which
is usually 0.5, i.e. if the probabilities < 0.5 it belongs to a certain class, and if not it belongs to the
other class. For imbalanced class problems, this default threshold may not work properly. We need to
change the threshold to the optimum value so that it can efficiently separate two classes. We can use
ROC Curves and Precision-Recall Curves to find the optimal threshold for the classifier. We can also
use a grid search method or search within a set of values to identify the optimal value.
1.13 Some basic statistics
This section discusses some basic statistical concepts important for machine learning.
1.13.1 Averages
PE603IT
Mean, median and mode are basic types of averages.
Addition of all the points in the dataset
Mean =
Total number of datapoints
Median is the middle value in the dataset. The mode is the most common value, so it just requires
counting how many times each element appears and picking the most frequent one.
1.13.2 Variance and covariance
In statistics and probability theory, covariance deals with the joint variability of two random variables:
x and y. Generally, it is treated as a statistical tool used to define the relationship between two
variables. Covariance is a measure of the relationship between two random variables and to what
extent, they change together. Or we can say, in other words, it defines the changes between the two
variables, such that change in one variable is equal to change in another variable. This is the property
of a function of maintaining its form when the variables are linearly transformed. Covariance is
measured in units, which are calculated by multiplying the units of the two variables.
Types of Covariance:
Covariance can have both positive and negative values. Based on this, it has two types:
1. Positive Covariance
2. Negative Covariance
Positive Covariance:
If the covariance for any two variables is positive, that means, both the variables move in the same
direction. Here, the variables show similar behaviour. That means, if the values (greater or lesser) of
one variable corresponds to the values of another variable, then they are said to be in positive
covariance.
Negative Covariance:
If the covariance for any two variables is negative, that means, both the variables move in the opposite
direction. It is the opposite case of positive covariance, where greater values of
one variable correspond to lesser values of another variable and vice-versa.
Population covariance formula:
∑(xi − x̅ )( yi − y̅ )
Cov (x, y) =
N
Sample covariance:
∑(xi − x̅ ) (yi − y)
Cov (x, y) =
N−1
Where
xi is data value of x.
yi is data value of y.
x̅ is mean of x.
y̅ is mean of y.
PE603IT
N is the number of data points.
Figure 7: Covariance of X and Y.
If cov(X, Y) is greater than zero, then we can say that the covariance for any two variables is positive
and both the variables move in the same direction.
If cov(X, Y) is less than zero, then we can say that the covariance for any two variables is negative
and both the variables move in the opposite direction.
If cov(X, Y) is zero, then we can say that there is no relation between two variables.
1.13.3 The Guassian distribution
In probability theory and statistics, the Normal Distribution, also called the Gaussian Distribution,
is the most significant continuous probability distribution. Sometimes it is also called a bell curve. A
large number of random variables are either nearly or exactly represented by the normal distribution,
in every physical science and economics. Furthermore, it can be used to approximate other probability
distributions, therefore supporting the usage of the word ‘normal ‘as in about the one, mostly used.
The Normal Distribution is defined by the probability density function for a continuous random
variable in a system. Let us say, f(x) is the probability density function and X is the random variable.
Hence, it defines a function which is integrated between the range or interval (x to x + dx), giving the
probability of random variable X, by considering the values between x and x+dx.
f(x) ≥ 0 ∀ x ϵ (−∞,+∞)
and -∞∫+∞ f(x) = 1
The probability density function of normal or gaussian distribution is given by;
Where,
x is the variable, μ is the mean and σ is the standard deviation The random variables following the
normal distribution are those whose values can find any unknown value in a given range. For
example, finding the height of the students in the school. Here, the distribution can consider any
value, but it will be bounded in the range say, 0 to 6ft.
Whereas, the normal distribution doesn’t even bother about the range. The range can also extend to –
∞ to + ∞ and still we can find a smooth curve. These random variables are called Continuous
Variables, and the Normal Distribution then provides here probability of the value lying in a particular
PE603IT
range for a given experiment. Also, use the normal distribution calculator to find the probability
density function by just providing the mean and standard deviation value.
Generally, the normal distribution has any positive standard deviation. We know that the mean helps
to determine the line of symmetry of a graph, whereas the standard deviation helps to know how far
the data are spread out. If the standard deviation is smaller, the data are somewhat close to each other
and the graph becomes narrower. If the standard deviation is larger, the data are dispersed more, and
the graph becomes wider. The standard deviations are used to subdivide the area under the normal
curve. Each subdivided section defines the percentage of data, which falls into the specific region of a
graph.
Using 1 standard deviation, the Empirical Rule states that,
Approximately 68% of the data falls within one standard deviation of the mean. (i.e., Between
Mean- one Standard Deviation and Mean + one standard deviation)
Approximately 95% of the data falls within two standard deviations of the mean. (i.e.,
Between Mean- two Standard Deviation and Mean + two standard deviations)
Approximately 99.7% of the data fall within three standard deviations of the mean. (i.e.,
Between Mean- three Standard Deviation and Mean + three standard deviations)
Thus, the empirical rule is also called the 68 – 95 – 99.7 rule.
Example: Calculate the probability density function of normal distribution using the following data. x
= 3, μ = 4 and σ = 2.
Solution: Given, variable, x = 3
Mean = 4 and
Standard deviation = 2
By the formula of the probability density of normal distribution, we can write;
PE603IT
Hence, f(3,4,2) = 1.106.
Example: If the value of random variable is 2, mean is 5 and the standard deviation is 4, then find the
probability density function of the gaussian distribution.
Solution: Given,
Variable, x = 2
Mean = 5 and
Standard deviation = 4
By the formula of the probability density of normal distribution, we can write;
f(2,2,4) = 1/(4√2π) e0
f(2,2,4) = 0.0997
There are two main parameters of normal distribution in statistics namely mean and standard
deviation. The location and scale parameters of the given normal distribution can be estimated using
these two parameters.
Normal Distribution Properties
Some of the important properties of the normal distribution are listed below:
In a normal distribution, the mean, median and mode are equal.(i.e., Mean = Median= Mode).
The total area under the curve should be equal to 1.
The normally distributed curve should be symmetric at the centre.
There should be exactly half of the values are to the right of the centre and exactly half of the
values are to the left of the centre.
The normal distribution should be defined by the mean and standard deviation.
The normal distribution curve must have only one peak. (i.e., Unimodal)
The curve approaches the x-axis, but it never touches, and it extends farther away from the
mean.
Applications
The normal distributions are closely associated with many things such as:
Marks scored on the test
Heights of different persons
Size of objects produced by the machine
Blood pressure and so on.
1.14 The bias-variance tradeoff
PE603IT
During predictive modeling, there exist tradeoff between a model’s ability to minimize bias and
variance. Bias is the prediction error that is introduced in the model due to oversimplifying the
machine learning algorithms. We can say, it is the difference between the predicted values and the
actual values. Variance occur when machine learning model performs well with the training dataset,
but does not perform well with the test dataset. In machine learning, these errors will always be
present as there is always a slight difference between the model predictions and actual predictions.
We have to reduce these errors in order to obtain accurate results.
Figure 8: Plot of model complexity versus error.
In machine learning, an error is a measure of how accurately an algorithm can make predictions for
the previously unknown dataset. On the basis of these errors, the machine learning model is selected
that can perform best on the particular dataset. There are mainly two types of errors in machine
learning, reducible and irreducible errors. Bias and variance are the reducible errors which can be
changed to improve model’s accuracy. Irreducible errors are always present in the model which
cannot be changed. High bias mainly occurs due to simplicity of the model. To reduce the high bias
we should increase the number of input features if underfitting occurs. The model can be made
complex by introducing some polynomial features also. We should decrease the regularization term
also.
Figure 9: Underfitting, overfitting and optimal generalization while training the model.
To reduce the high variance we should reduce the input features if overfitting occurs. We should
increase the training data and regularization term.
Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors
and Support Vector Machines.
PE603IT
Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant
Analysis and Logistic Regression.
Examples of low-variance machine learning algorithms include: Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest
Neighbors and Support Vector Machines.
Figure: Four possible combination of bias and variance.
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that
accurately captures the regularities in training data and simultaneously generalizes well with the
unseen dataset. Unfortunately, doing this is not possible simultaneously. Because a high variance
algorithm may perform well with training data, but it may lead to overfitting to noisy data. Whereas,
high bias algorithm generates a much simple model that may not even capture important regularities
in the data. So, we need to find a sweet spot between bias and variance to make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and
variance errors.