BCS602: Intro to Machine Learning
BCS602: Intro to Machine Learning
INTRODUCTION TO MACHINE L E A R N I N G
Figure 1.2: (a) A Learning System for Humans (b) A Learning Systemfor Machine Learning
The quality of data directly impacts the quality of experience and, ultimately, the
quality of learning systems.
Statistical Learning
In statistical learning, the relationship between input x and output y is modeled as:
y = f(x)
o f is the learning function mapping inputs to outputs.
In machine learning, this is referred to as the mapping of input to output.
intelligence
analytics
o Flexibility: Works well with large, complex datasets; adaptable to different scenarios.
o Goal: Makes predictions based on learned patterns, often without needing detailed
statistical knowledge.
1.2 TYPES OF MACHINE LEARNING
What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of
interaction of the program with its environment. There are four types of machine learning
as shown in Figure 1.5.
[Link]. Length ofPetal Width ofPetal Length ofSepal Width ofSepal Class
A dataset need not be always numbers. It can be images or video frames. Deep neural
networks can handle images with labels. In the following Figure 1.6, the deep neural
network takes images ofdogs and cats with labels for classification. In unlabelled data, there
are no labels in the dataset.
Cat
(a) (b)
Figure 1.6: (a) Labelled Dataset (b) Unlabeled Dataset
Classi�ication
Module 1- Machine Learning (BCS602)
Artificial Neural Networks (ANN) and Deep Learning (e.g., Convolutional Neural
Networks - CNN)
Regression Models
Regression is another type of supervised learning, similar to classification, but
instead of predicting categories (labels), it predicts continuous values, like
numbers.
Key Difference:
Regression: Predicts continuous values (e.g., product sales, house prices).
Classification: Predicts labels or categories (e.g., dog or cat).
How Regression Works:
In a regression model, we are trying to find a relationship between the
independent variable(s) (x) and the dependent variable (y).
For example, in Figure 1.8, the independent variable (x) is the number of weeks,
and the dependent variable (y) is product sales. The regression model fits a line
to the data, which can be used to predict future sales. This line is written as: shown
in fig 1.8
Sales (y)=0.66×Week (x)+0.54
Here, 0.66 and 0.54 are regression coefficients that the model learns from the data.
If you want to predict the sales for the 8th week, you can substitute x=8x = 8x=8 into
the formula and calculate the predicted sales (y).
Example:
Module 1- Machine Learning (BCS602)
It groups objects into different clusters, where each cluster contains objects that are
similar to each other.
The objects in one cluster are different from those in other clusters.
For example, if you have a set of images of dogs and cats, a clustering algorithm
will automatically group them into two clusters: one for dogs and one for cats,
without needing any labels to tell it which is which.
Applications of Clustering:
Image Segmentation: Grouping parts of an image, like separating a region of interest
(e.g., identifying a tumor in a medical image).
Gene Analysis: Finding groups of similar genes in a database.
In summary, unsupervised learning helps the algorithm discover patterns in data
without any explicit instructions. Cluster analysis and dimensional reduction are key
types of unsupervised learning.
3. Assigns categories or labels Performs grouping process such that similar objectswill
be in one cluster
Danger
In this grid game, the gray til indicates t he dang er, black is a block, and the tile with
diagonallines is the goal. The a Fi m
i g uirset1o. 1s0t :a rAt ,Gsr iadyg farmoem bottom-left grid, using the actions
left, right, top andbottom to reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the environment
toget experience. In the above case, the agent tries to create a model by simulating many
paths and finding rewarding paths. This experience helps in constructing a model.
1, 1 1
2, 1 2
3, 1 3
4, 1 4
5, 1 5
Can a model for this test data be multiplication? That is, y = x1 * x2. Well! It is true! But, this
is equally true that y may be y = x1 / x2 or y = x1 ^ x2. So, there are three functions that fit
the data.
This means that the problem is ill-posed. To solve this problem, one needs more example to
check the model. Puzzles and games that do not have sufficient specification may become an
ill-posed problem and scientific computation has many ill-posed problems.
2. Need for Huge, Quality Data:
o Machine learning requires large amounts of high-quality data. The data must be
complete, without missing or incorrect values. Poor-quality data can lead to inaccurate
models.
3. High Computational Power:
o With the growth of Big Data, machine learning tasks require powerful computers with
specialized hardware like GPUs or TPUs to handle the high computational load. The
increasing complexity of tasks has made high-performance computing essential.
4. Complexity of Algorithms:
o Choosing the right machine learning algorithm, explaining how it works, applying it
correctly, and comparing different algorithms are now critical skills for data scientists.
This makes the selection and evaluation of algorithms a significant challenge.
5. Bias-Variance Trade-off:
o Overfitting: When a model performs well on training data but fails on test data, it’s
called overfitting. This means the model has learned the training data too well but lacks
generalization to new data.
o Underfitting: When a model fails to perform well on both training and test data, it’s
called underfitting. The model is too simple to capture the patterns in the data.
o Balancing between overfitting and underfitting is a major challenge for machine
learning algorithms.
1.6 MACHINE LEARNING PROCESS
The emerging process model for the data mining solutions for business organizations is
[Link] machine learning is like data mining, except for the aim, this process can
Module 1- Machine Learning (BCS602)
be used for machinelearning. CRISP-DM stands for Cross Industry Standard Process – Data
Mining. This process involves six steps. The steps are listed below in Figure 1.11.
1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is
enough for giving the solution. This step also involves the formulation of the problem
statement for the data mining process.
2. Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns to the
selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the
raw data and preparation of data for the data mining process. The missing values may
cause problems during both training and testing phases. Missing data forces classifiers to
produceinaccurate results. This is a perennial problem for the classification models. Hence,
suitablestrategies should be adopted to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the
datato obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy issue.
For example, classification of emails requires extensive domain knowledge and requires
domain experts. Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining
algorithm to improve the existing process or for a new situation.
Module 1- Machine Learning (BCS602)
7. Games Game programs for Chess, GO, and Atari video games
8. Natural Language Google Translate, Text summarization, and sentiment analysis
Translation
9. Web Analysis and Identification of access patterns, detection of e-mail spams, viruses,
Services personalized web services, search engines like Google, detection of
promotion of user websites, and finding loyalty of users after web page
layout modification
12. Scientific Domain Discovery of new galaxies, identification of groups of houses based
on house type/geographical location, identification of earthquake
epicenters, and identification of similar land use
Key Terms:
• Machine Learning – A branch of AI that concerns about machines to learn automatically withoutbeing
explicitly programmed.
• Data – A raw fact.
• Model – An explicit description of patterns in a data.
• Experience – A collection of knowledge and heuristics in humans and historical training data in case of
machines.
• Predictive Modelling – A technique of developing models and making a prediction of unseen data.
• Deep Learning – A branch of machine learning that deals with constructing models using neural
networks.
• Data Science – A field of study that encompasses capturing of data to its analysis covering all stagesof
data management.
• Data Analytics – A field of study that deals with analysis of data.
• Big Data – A study of data that has characteristics of volume, variety, and velocity.
• Statistics – A branch of mathematics that deals with learning from data using statistical methods.
• Hypothesis – An initial assumption of an experiment.
• Learning – Adapting to the environment that happens because of interaction of an agent with the
environment.
• Label – A target attribute.
• Labelled Data – A data that is associated with a label.
• Model Deployment – A method of deploying machine learning algorithms to improve the existing
business processes for a new situation.
Flat Files These are the simplest and most commonly available data
source. It is also the cheapest way of organizing the data. These flat
files are the files where data is stored in plain ASCII or EBCDIC format.
Minor changes of data in flat files affect the results of the data mining
algorithms.
Hence, flat file is suitable only for storing small dataset and not
desirable if the dataset becomes larger.
Some of the popular spreadsheet formats are listed below:
• CSV files – CSV stands for comma-separated value files where the
values are separated by commas. These are used by spreadsheet and
Module 1- Machine Learning (BCS602)
database applications. The first row may have attributes and the rest
of the rows represent the data.
• TSV files – TSV stands for Tab separated values files where values
are separated by Tab. Both CSV and TSV files are generic in nature and
can be shared. There are many tools like Google Sheets and Microsoft
Excel to process these files.
Database System It normally consists of database files and a
database management system (DBMS). Database files contain original
data and metadata. DBMS aims to manage data and improve operator
performance by including various tools like database administrator,
query processing, and transaction manager. A relational database
consists of sets of tables. The tables have rows and columns. The
columns represent the attributes and rows represent tuples. A tuple
corresponds to either an object or a relationship between objects. A
user can access and manipulate the data in the database using SQL.
2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics
Descriptive Analytics
Descriptive analytics is about summarizing the main features of the
data you've collected. It tells you what has happened by using
historical data and statistical techniques. The goal is to organize,
describe, and present this data in an understandable way. Think of it
as a report that explains "what is" without drawing any conclusions
about why it happened.
Example: Imagine a store that collects data on monthly sales.
Descriptive analytics would summarize this data by calculating the
average sales, total revenue, or the most popular product in a given
month.
Key Point: It doesn't explain why the sales were high or low—just
tells you what the data shows.
Diagnostic Analytics
Diagnostic analytics answers the question "Why did this happen?"
It's about understanding the root cause of an event. By examining the
data closely, we look for patterns, trends, and relationships that
explain the cause of an outcome.
Example: If the store's sales drop one month, diagnostic analytics
would investigate why the drop happened. Maybe it’s due to bad
weather, a competitor’s sale, or a new product that didn’t perform
well. The analysis focuses on finding and explaining the reasons
behind the drop.
Key Point: It's all about cause and effect—identifying the reasons
behind the data patterns.
Predictive Analytics
Module 1- Machine Learning (BCS602)
Predictive analytics looks into the future and answers the question
"What will happen?" Using historical data and advanced algorithms
(like machine learning), it predicts future trends and outcomes.
Example: The store uses data from previous years to predict what the
sales will be in the upcoming holiday season. Algorithms analyze
patterns like past holiday sales, customer behavior, and current
market trends to make predictions.
Key Point: It focuses on forecasting future events based on current
and past data.
Prescriptive Analytics
Prescriptive analytics goes a step further and asks "What should we
do?" It not only predicts the future but also recommends actions to
take. This type of analytics provides decision-making support by
suggesting the best course of action to achieve desired outcomes.
Example: After predicting that sales will be low in the next quarter,
prescriptive analytics suggests specific actions the store can take, such
as launching a promotion, adjusting prices, or stocking more popular
products. This helps businesses make better decisions and minimize
risks.
Key Point: It’s all about decision-making—helping businesses choose
the best possible actions based on data.
o For patients John, Andre, and Raju, the Date of Birth (DoB) is
missing. This is an example of missing values.
2. Inaccurate Data:
o David's age is recorded as 5, but his DoB is 10/10/1980, which
makes his real age much older than 5. This is inconsistent data.
o Raju's age is recorded as 136, which is not realistic. This might be
a typographical error or an outlier.
3. Outliers:
o Raju’s age of 136 is an outlier, as it is an unrealistic value when
compared to normal human lifespans. Outliers are often caused by
data entry errors.
4. Noisy Data:
o John’s salary is recorded as -1500, which is not possible. Salary
cannot be negative, making this an example of noisy data.
o The entry for David’s salary is simply blank (" "), which is another
instance of missing data.
5. Inconsistent Values:
o In the salary column, Andre and Raju both have ‘Yes’ recorded,
which doesn’t make sense in the context of salary data. A salary should
be a numeric value, not a text response.
How to Address These Issues?
1. Missing Data:
o Ignore the Tuple: If a lot of values are missing in a row, you may
choose to ignore or remove that row from the dataset.
o Fill Values Manually: Domain experts can manually fill missing
values, but this is time-consuming.
o Use Global Constants: Fill missing values with a placeholder like
‘Unknown’ or ‘0’.
o Use Average/Mean Values: Replace missing numeric values (like
salary) with the average value of that column.
Module 1- Machine Learning (BCS602)
o Replace all values in the bin with the mean (average) of the bin
values.
Example:
o Given data: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}
o First, divide into bins of size 3:
Bin 1: {12, 14, 19}
Bin 2: {22, 24, 26}
Bin 3: {28, 31, 34}
o Now apply smoothing by means (replace all values with the bin's
mean):
Bin 1 (mean = 15): {15, 15, 15}
Bin 2 (mean = 24): {24, 24, 24}
Bin 3 (mean ≈ 31): {31, 31, 31}
o Explanation: Each value in the bin is replaced by the mean of the
bin to smooth the data.
2. Smoothing by Medians:
o Replace all values in the bin with the median of the bin values (the
middle value when the data is sorted).
Example:
o Given the same data and bins:
Bin 1 (median = 14): {14, 14, 14}
Bin 2 (median = 24): {24, 24, 24}
Bin 3 (median = 31): {31, 31, 31}
o Explanation: Each value in the bin is replaced by the median,
which reduces the effect of outliers or extreme values.
3. Smoothing by Bin Boundaries:
o Replace each value in the bin with the closest boundary value
(minimum or maximum value in the bin).
Module 1- Machine Learning (BCS602)
Example:
o Given the same data and bins:
Bin 1 (boundary values: 12 and 19): {12, 12, 19}
Bin 2 (boundary values: 22 and 26): {22, 22, 26}
Bin 3 (boundary values: 28 and 34): {28, 34, 34}
o Explanation: For each bin, values are replaced by the closest
boundary value (either the minimum or maximum of that bin).
o Example: In Bin 1, the original data was {12, 14, 19}. The
boundaries are 12 and 19, so the value 14 is closer to 12, and it's
replaced by 12.
Here max-min is the range. Min and max are the minimum and
maximum of the given data, new max and new min are the minimum
and maximum of the target range, say 0 and 1.
Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max
procedure and map the marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The
new min and new max are 0 and 1, respectively. The mapping can be
done using Eq. (2.1) as:
Module 1- Machine Learning (BCS602)
So, it can be observed that the marks {88, 90, 92, 94} are mapped to
the new range {0, 0.33, 0.66, 1}. Thus, the Min-Max normalization
range is between 0 and 1.
Here, s is the standard deviation of the list V and m is the mean of the
list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the
marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the
list V are 20 and 10, respectively. So the z-scores of these marks are
Module 1- Machine Learning (BCS602)
Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.
Data Reduction
Data reduction reduces data size but produces the same results. There
are different ways in which data reduction can be carried out such as
data aggregation, feature selection, and dimensionality reduction.
Bar Chart A Bar chart (or Bar graph) is used to display the frequency
distribution for variables.
Bar charts are used to illustrate discrete data. The charts can also help
to explain the counts of nominal data. It also helps in comparing the
frequency of different groups. The bar chart for students' marks {45,
60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown below in Figure
2.3.
Module 1- Machine Learning (BCS602)
Pie Chart These are equally helpful in illustrating the univariate data.
The percentage frequency distribution of students' marks {22, 22, 40,
40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.
range 76-100 is 2.
Dot Plots These are similar to bar charts. They are less clustered as
compared to bar charts, as they illustrate the bars only with single
points. The dot plot of English marks for five students with ID as {1, 2,
3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The
advantage
Module 1- Machine Learning (BCS602)
is that by visual inspection one can find out who got more marks.
For example, the mean of the three numbers 10, 20, and 30 is 20
Module 1- Machine Learning (BCS602)
Here, n is the number of items and xi are values. For example, if the
values are 6 and 8, the geometric mean is given as In larger cases,
computing geometric mean is difficult. Hence, it is usually calculated
as:
Median class is that class where N/2th item is present. Here, i is the
class interval of the median class and L1 is the lower limit of median
class, f is the frequency of the median class, and cf is the cumulative
frequency of all classes preceding median.
3. Mode – Mode is the value that occurs more frequently in the
dataset. In other words, the value that has the highest frequency is
called mode.
2.5.3 Dispersion
The spreadout of a set of data around the central tendency (mean,
median or mode) is called dispersion. Dispersion is represented by
various ways such as range, variance, standard deviation, and
standard error. These are second order measures. The most common
measures of the dispersion data are listed below:
Range Range is the difference between the maximum and minimum
of values of the given list of data.
Standard Deviation The mean does not convey much more than a
middle point. For example, the following datasets {10, 20, 30} and {10,
50, 0} both have a mean of 20. The difference between these two sets
is the spread of data. Standard deviation is the average distance from
the mean of the dataset to each point.
The formula for sample standard deviation is given by:
Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34},
find the IQR.
Solution: The median is in the fifth position. In this case, 24 is the
median. The first quartile is median of the scores below the mean i.e.,
{12, 14, 19, 22}. Hence, it’s the median of the list below 24. In this case,
the median is the average of the second and third values, that is, Q0.25
= 16.5. Similarly, the third quartile is the median of the values above
the median, that is {26, 28, 31, 34}. So, Q0.75 is the average of the
seventh and eighth score. In this case, it is 28 + 31/2 = 59/2 = 29.5.
Hence, the IQR using Eq. (2.10) is:
= Q0.75 – Q0.25
= 29.5-16.5 = 13
13}, that is, {minimum, Q1, median, Q3, maximum}. Box plots are
useful for describing 5-point summary. The Box plot for the set is given
in Figure 2.7.
2.5.4 Shape
Skewness and Kurtosis (called moments) indicate the
symmetry/asymmetry and peak location of the dataset.
Skewness
The measures of direction and degree of symmetry are called
measures of third order. Ideally, skewness should be zero as in ideal
normal distribution. More often, the given dataset may not have
perfect symmetry (consider the following Figure 2.8).
Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then
it indicates higher kurtosis and vice versa. Kurtosis is measured using
the formula given below:
given as:
It can be seen from Figure 2.9 that the first column is stem and the
second column is leaf. For the given English marks, two students with
60 marks are shown in stem and leaf plot as stem-6 with 2 leaves with
0. The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given below
n Figure 2.10.
Module 1- Machine Learning (BCS602)
Module 2- Machine Learning (BCS602)
Module 2
Understanding Data
Bivariate and Multivariate data, Multivariate statistics, Essential mathematics for Multivariate data,
Overview hypothesis, Feature engineering and dimensionality reduction techniques, Basics of Learning
Theory: Introduction to learning and its types, Introduction computation learning theory, Design of
learning system, Introduction concept learning. Similarity-based learning: Introduction to Similarity or
instance based learning, Nearest-neighbour learning, weighted k- Nearest - Neighbour algorithm.
CHAPTER -2
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is
to find relationships among data. Consider the following Table 2.3, with data of the temperature in
a shop and sales of sweaters.
Here, the aim of bivariate analysis is to find relationships among variables. The relationships can then be
used in comparisons, finding causes, and in further explorations. To do that, graphical display of the data is
necessary. One such graph method is called scatter plot.
Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal
variables, to illustrate the trends, and also to show differences. It is a plot between explanatory and response
variables. It is a 2D graph showing the relationship between two variables. Line graphs are similar to scatter
plots. The Line Chart for sales data is shown in Figure 2.12.
1
Module 2- Machine Learning (BCS602)
as covariance (X, Y) or COV (X, Y) and is used to measure the variance between two dimensions. The formula
for finding co-variance for specific x, and y are:
Here, xi and yi are data values from X and Y. E(X) and E(Y) are the mean values of xi and yi. N is the number
of given data. Also, the COV(X, Y) is same as COV(Y, X).
If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the Pearson correlation coefficient,
that is denoted as r, is given as: (σX, σY are the standard deviations of X and Y.)
Heatmap A heat map is a graphical representation of data where individual values are represented by
colors. Heat maps are often used in data analysis and visualization to show patterns, density, or intensity of
data points in a two-dimensional grid.
Example: Let's consider a heat map to display the average temperatures (in °C) across different regions in
a country over a week. Each cell in the heat map will represent a temperature for a specific region on a
specific day. This is useful to quickly identify trends, such as higher temperatures in certain regions or
specific days with unusual weather patterns. The color gradient (from blue to red) indicates the
temperature range: cooler colors represent lower temperatures, while warmer colors represent higher
temperatures.
2
Module 2- Machine Learning (BCS602)
Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several
pair-wise scatter plots of variables of the multivariate data. A random matrix of three columns is chosen and
the relationships of the columns is plotted as a pairplot (or scatter matrix) as shown in Figure 2.14.
Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics,
Probability and Information theory. The subsequent sections discuss important aspects of linear algebra
and probability.
3
Module 2- Machine Learning (BCS602)
set of equations with ‘n’ unknown variables. It means if A= and y=(y1 y2…yn), then the unknown
variable x can be computed as: x= y/A= A-1y
If there is a unique solution, then the system is called consistent independent. If there are various
solutions, then the system is called consistent dependant. If there are no solutions and if the equations are
contradictory, then the system is called inconsistent.
For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
1. Write the given matrix.
2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,
R2 - (a21/a11), here R2 is the 2nd row and (a21/a11) is called the multiplier.
The same logic can be used to remove a11 in all other equations.
4. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:
To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it
4
Module 2- Machine Learning (BCS602)
where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of matrix Q.
LU Decomposition
One of the simplest matrix decomposition is LU decomposition where the matrix A can be decomposed
matrices: A = LU. Here, L is the lower triangular matrix and U is the upper triangular matrix. The
decomposition can be done using Gaussian elimination method as discussed in the previous section. First,
an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination is
applied to reduce the given matrix to get matrices L and U. Example 2.9 illustrates the application of
Gaussian elimination to get LU.
5
Module 2- Machine Learning (BCS602)
Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose values are the
determiners used in the reduction of equations above such as 3, 3 and 2/3.
The second matrix is U, the upper triangular matrix whose values are the values of the reduced matrix
because of Gaussian elimination.
Probability Distributions
Definition: A probability distribution describes the likelihood of various outcomes for a variable XXX.
Types:
6
Module 2- Machine Learning (BCS602)
3. Exponential Distribution
1 Binomial Distribution
7
Module 2- Machine Learning (BCS602)
2 Poisson Distribution
3 Bernoulli Distribution
Density Estimation
8
Module 2- Machine Learning (BCS602)
1 Parzen Window
Definition: A non-parametric technique that estimates the PDF based on local samples.
Example: Uses a kernel function like Gaussian around each data point.
Features are attributes. Feature engineering is about determining the subset of features that form
an important part of the input that improves the performance of the model, be it classification or any other
model in machine learning.
Feature engineering deals with two problems – Feature Transformation and Feature Selection.
Feature transformation is extraction of features and creating new features that may be helpful in increasing
performance. For example, the height and weight may give a new attribute called Body Mass Index (BMI).
Feature subset selection is another important aspect of feature engineering that focuses on selection of
features to reduce the time but not at the cost of reliability.
Filter-based selection uses statistical measures for assessing features. In this approach, no learning
algorithm is used. Correlation and information gain measures like mutual information and entropy are all
examples of this approach.
Wrapper-based methods use classifiers to identify the best features. These are selected and evaluated by
the learning algorithms. This procedure is computationally intensive but has superior performance.
9
Module 2- Machine Learning (BCS602)
The operator E refers to the expected value of the population. This is calculated theoretically using the
probability density functions (PDF) of the elements xi and the joint probability density functions between
the elements xi and xj. From this, the covariance matrix can be calculated as:
The mapping of the vectors x to y using the transformation can now be described as:
This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x
can now be reconstructed as follows:
If K largest eigen values are used, the recovered information would be:
The new data is a dimensionaly reduced matrix that represents the original data.
Figure 2.15. The scree plot indicates that only 6 out of 246 attributes are important.
From Figure 2.15, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.
10
Module 2- Machine Learning (BCS602)
Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is m × n, S is the
diagonal matrix of dimension n × n, and V is the orthogonal matrix. The procedure for finding decomposition
matrix is given as follows:
1. For a given matrix, find AA^T
2. Find eigen values of AA^T
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4. Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5. Find eigen values and eigen vectors for A^TA. Find the eigen value and pack the eigen vector as a
matrix called V.
Thus, A = USV^T. Here, U and V are orthogonal matrices. The columns of U and V are left and right
singular values, respectively. SVD is useful in compression, as one can decide to retain only a certain
component instead of the original matrix A as:
Concept learning is a learning strategy that involves acquiring abstract knowledge or inferring a general
concept based on the given training samples. It aims to derive a category or classification from the data,
facilitating abstraction and generalization. In machine learning, concept learning is about finding a function
that categorizes or labels instances correctly based on the observed features.
11
Module 2- Machine Learning (BCS602)
Hypothesis space is the set of all possible hypotheses that approximates the target function
f.
The subset of hypothesis space that is consistent with all-observed training instances is
called as Version Space.
There are two ways of learning the hypothesis, consistent with all training instances from
the large hypothesis space.
12
Module 2- Machine Learning (BCS602)
13
Module 2- Machine Learning (BCS602)
List-Then-Eliminate Algorithm
14
Module 2- Machine Learning (BCS602)
MODULE 3
CHAPTER 4
SIMILARITY-BASED LEARNING
Similarity or Instance-based Learning
KNN
Variants of KNN
Locally weighted regression
Learning vector quantization
Self-organizing maps
RBF networks
Nearest-Neighbor Learning
A powerful classification algorithm used in pattern recognition.
K nearest neighbors stores all available cases and classifies new cases based on a
similarity measure (e.g distance function)
One of the top data mining algorithms used today.
A non-parametric lazy learning algorithm (An Instance based Learning method).
Used for both classification and regression problems.
Module 3- Machine Learning (BCS602)
Where, г is called the bandwidth parameter and controls the rate at which wi reduces to zero
with distance from xi.
MODULE 3
CHAPTER 5
REGRESSION ANALYSIS
1.1 Introduction to Regression
Regression analysis is a fundamental concept that consists of a set of machine learning methods
that predict a continuous outcome variable (y) based on the value of one or multiple predictor
variables (x).
OR
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
Regression is a supervised learning technique which helps in finding the correlation between
variables.
It is mainly used for prediction, forecasting, time series modelling, and determining the causal-
effect relationship between variables.
Regression shows a line or curve that passes through all the data points on target-predictor
graph in such a way that the vertical distance between the data points and the regression line
is minimum." The distance between data points and line tells whether a model has captured a
strong relationship or not.
• Function of regression analysis is given by:
Y=f(x)
Here, y is called dependent variable and x is called independent variable.
Applications of Regression Analysis
Sales of a goods or services
Value of bonds in portfolio management
Premium on insurance companies
Yield of crop in agriculture
Prices of real estate
1
Positive Correlation: Two variables are said to be positively correlated when their values
move in the same direction. For example, in the image below, as the value for X increases, so
does the value for Y at a constant rate.
Negative Correlation: Finally, variables X and Y will be negatively correlated when their
values change in opposite directions, so here as the value for X increases, the value for Y
decreases at a constant rate.
Neutral Correlation: No relationship in the change of variables X and Y. In this case, the
values are completely random and do not show any sign of correlation, as shown in the
following image:
Causation
Causation is about relationship between two variables as x causes y. This is called x implies b.
Regression is different from causation. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear and Non-Linear Relationships
The relationship between input features (variables) and the output (target) variable is
fundamental. These concepts have significant implications for the choice of algorithms, model
complexity, and predictive performance.
Linear relationship creates a straight line when plotted on a graph, a Non-Linear relationship
does not create a straight line but instead creates a curve.
Example:
Linear-the relationship between the hours spent studying and the grades obtained in a class.
Non-Linear-
Linearity:
Linear Relationship: A linear relationship between variables means that a change in one
variable is associated with a proportional change in another variable. Mathematically, it can be
represented as y = a * x + b, where y is the output, x is the input, and a and b are constants.
2
Linear Models: Goal is to find the best-fitting line (plane in higher dimensions) to the data
points. Linear models are interpretable and work well when the relationship between variables
is close to being linear.
Limitations: Linear models may perform poorly when the relationship between variables is
non-linear. In such cases, they may underfit the data, meaning they are too simple to capture
the underlying patterns.
Non-Linearity:
Non-Linear Relationship: A non-linear relationship implies that the change in one variable is
not proportional to the change in another variable. Non-linear relationships can take various
forms, such as quadratic, exponential, logarithmic, or arbitrary shapes.
Non-Linear Models: Machine learning models like decision trees, random forests, support
vector machines with non-linear kernels, and neural networks can capture non-linear
relationships. These models are more flexible and can fit complex data patterns.
Benefits: Non-linear models can perform well when the underlying relationships in the data
are complex or when interactions between variables are non-linear. They have the capacity to
capture intricate patterns.
Types of Regression
3
Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is
used when there is a single independent variable (predictor) and one dependent variable
(target).
Equation: The linear regression equation takes the form: Y = β0 + β1X + ε, where Y is the
dependent variable, X is the independent variable, β0 is the intercept, β1 is the slope
(coefficient), and ε is the error term.
Purpose: Linear regression is used to establish a linear relationship between two variables and
make predictions based on this relationship. It's suitable for simple scenarios where there's only
one predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there
are two or more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors: Y = β0
+ β1X1 + β2X2 + ... + βnXn + ε, where Y is the dependent variable, X1, X2, ..., Xn are the
independent variables, β0 is the intercept, β1, β2, ..., βn are the coefficients, and ε is the error
term.
Purpose: Multiple regression allows you to model the relationship between the dependent
variable and multiple predictors simultaneously. It's used when there are multiple factors that
may influence the target variable, and you want to understand their combined effect and make
predictions based on all these factors.
Polynomial Regression:
Use: Polynomial regression is an extension of multiple regression used when the relationship
between the independent and dependent variables is non-linear.
Equation: The polynomial regression equation allows for higher-order terms, such as quadratic
or cubic terms: Y = β0 + β1X + β2X^2 + ... + βnX^n + ε. This allows the model to fit a curve
rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models the
probability of the dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)), where z is a linear combination of the independent
variables: z = β0 + β1X1 + β2X2 + ... + βnXn. It transforms this probability into a binary
outcome.
Lasso Regression (L1 Regularization):
Use: Lasso regression is used for feature selection and regularization. It penalizes the absolute
values of the coefficients, which encourages sparsity in the model.
4
Objective Function: Lasso regression adds an L1 penalty to the linear regression loss function:
Lasso = RSS + λΣ|βi|, where RSS is the residual sum of squares, λ is the regularization strength,
and |βi| represents the absolute values of the coefficients.
Ridge Regression (L2 Regularization):
Use: Ridge regression is used for regularization to prevent overfitting in multiple regression. It
penalizes the square of the coefficients.
Objective Function: Ridge regression adds an L2 penalty to the linear regression loss function:
Ridge = RSS + λΣ(βi^2), where RSS is the residual sum of squares, λ is the regularization
strength, and (βi^2) represents the square of the coefficients.
Limitations of Regression
5
Ordinary Least Square Approach
The ordinary least squares (OLS) algorithm is a method for estimating the parameters of a
linear regression model. Aim: To find the values of the linear regression model's parameters
(i.e., the coefficients) that minimize the sum of the squared residuals.
In mathematical terms, this can be written as: Minimize ∑(yi – ŷi)^2
6
Linear Regression Example
7
8
Linear Regression in Matrix Form
9
of determination r2 is the ratio of the explained and unexplained variations.
10
CHAPTER 5
REGRESSION ANALYSIS
2 Consider the following dataset in Table 5.11 where the week and number of working hours per
week spent by a research scholar in a library are tabulated. Based on the dataset, predict the
number of hours that will be spent by the research scholar in the 7th and 9th week. Apply Linear
regression model.
Table 5.11
xi 1 2 3 4 5
(week)
yi 12 18 22 28 35
(Hours Spent)
Solution
xi yi xi xi xi yi
1 12 1 12
2 18 4 36
3 22 9 66
4 28 16 112
5 35 25 175
Sum = 15 Sum = 115 Avg ( xi xi )=55/5=11 Avg( xi yi )=401/5=80.2
avg( xi )=15/5=3 avg( yi )=115/5=23
a0 y a1 x
The prediction for the 7th week hours spent by the research scholar will be
The prediction for the 9th week hours spent by the research scholar will be
Height of Boys 65 70 75 78
Height of Girls 63 67 70 73
Solution
xi yi xi xi xi yi
65 63 4225 4095
70 67 4900 4690
75 70 5625 5250
78 73 6084 5694
Sum = 288 Sum = 273 Avg ( xi xi Avg( xi yi
Mean( xi Mean( yi )=20834/4=5208.5 )=19729/4=4932.25
)=288/4=72 )=273/4=68.25
a0 y a1 x
4932.25 72(68.25) 18.25
a1 0.7449
5208.5 722 24.5
a0 68.25 0.744972 68.25 53.6328 14.6172
y 0.7449 14.6172 x
4 Using multiple regression, fit a line for the following dataset shown in Table 5.13.
Here, Z is the equity, X is the net sales and Y is the asset. Z is the dependent variable
and X and Y are independent variables. All the data is in million dollars.
Z X Y
4 12 8
6 18 12
7 22 16
8 28 36
11 35 42
Solution
5 4
115 114 1 1 1 1 1 1 6
= 115 2961 3142 12 18 22 28 35 7
114 36 42
3142 3524 8 12 16 8
11
0.4135
= 0.39625
0.0658
Therefore, the regression line is given as
***
CHAPTER 6
DECISION TREE LEARNING
6.1 Introduction
6.1.1 Structure of a Decision Tree A decision tree is a structure that includes a root
node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a class label. The topmost
node in the tree is the root node.
Bayes' Theorem is a fundamental concept in probability theory and forms the foundation of Bayesian
learning in machine learning. It allows you to update the probability of a hypothesis (or event) based on
new evidence.
Where:
P(H | D) is the posterior probability: the probability of the hypothesis HH being true given the data
DD.
P(D | H) is the likelihood: the probability of observing the data DD given that hypothesis HH is true.
P(H) is the prior probability: the initial belief about the hypothesis HH before any data is observed.
P(D) is the marginal likelihood or evidence: the total probability of the data under all possible
hypotheses. This acts as a normalizing constant to ensure that the posterior is a valid probability
distribution.
Before you collect any data, you have a prior belief about a hypothesis (e.g., the probability of a patient
having a disease).
After seeing new data (e.g., the result of a medical test), you update your belief about the hypothesis to
reflect this new evidence.
Bayes’ Theorem lets you do this systematically, ensuring that your updated belief (posterior) is
proportional to the prior belief and the likelihood of observing the new data.
•
Dendrites are tree like networks made of nerve fiber connected to the cell body.
An Axon is a single, long connection extending from the cell body and carrying signals from the
neuron. The end of axon splits into fine strands. It is found that each strand terminated into small
bulb like organs called as synapse. It is through synapse that the neuron introduces its signals to
other nearby neurons. The receiving ends of these synapses on the nearby neurons can be found
both on the dendrites and on the cell body. There are approximately 104 synapses per neuron in the
human body. Electric impulse is passed between synapse and dendrites. It is a chemical process
which results in increase/decrease in the electric potential inside the body of the receiving cell. If
the electric potential reaches a thresh hold value, receiving cell fires & pulse / action potential of
fixed strength and duration is send through the axon to synaptic junction of the cell. After that, cell
has to wait for a period called refractory period.
Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).
OR
Working:
The received input are computed as a weighted sum which is given to the activation function
and if the sum exceeds the threshold value the neuron gets [Link] neuron is the basic
processing unit that receives a set of inputs x1,x2,x3,….xn and their associated weights
w1,w2,w3,….wn. The summation function computes the weighted sum of the inputs
received by the neuron.
Sum=∑xiwi
Activation functions:
• To make work more efficient and for exact output, some force or activation is given. Like
that, activation function is applied over the net input to calculate the output of an ANN.
Information processing of processing element has two major parts: input and output. An
integration function (f) is associated with input of processing element.
The output is same as the input ie the weighted sum. The function is useful when we do
not apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
�(�) = { 1 �� � ≥ �
0 �� � < �
Where, θ represents threshhold value. It is used in single layer nets to convert
the net input to an output that is binary (0 or 1).
3. Bipolar step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
−1 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshold value. It is used in single layer nets to convert
the net input to an output that is bipolar (+1 or -1).
4. Sigmoid function: It is used in Back propagation nets.
Two types:
a) Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar
sigmoid function. It is defined as
7. ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0
8. Softmax function: Softmax is an activation function that scales numbers/logits into
probabilities. The output of a Softmax is a vector (say v) with probabilities of each
possible outcome. The probabilities in vector v sums to one for all possible outcomes or
classes.
• Knowledge is acquired by the network from its environment through a learning process.
OR
• The perceptron can represent all boolean primitive functions AND, OR, NAND , NOR.
• Some boolean functions can not be represented .
– E.g. the XOR function.
Solution:
X0
𝚹3 𝚹4
X1 𝑤13
X3 X4
𝑤34
0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0
ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer 𝑰𝒋 𝑶𝒋
𝑿𝟏 0 0
𝑿𝟐 1 1
2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer
= 0.450
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝐍𝐞𝐭 𝐨𝐮𝐭𝐩𝐮𝐭 𝑶𝒌
𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝚹4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.450 ∗ 0.3) + 1(−0.3)
1
= −0.165 =
1 + 𝑒−(−0.165)
= 0.458
3. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.458
𝐸𝑟𝑟𝑜𝑟 = 0.542
ITERATION 2:
Step 1: FORWARD PROPAGATION
𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝚹4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.451 ∗ 0.324) + 1(−0.246)
1
= −0.099 =
1 + 𝑒−(−0.099)
= 0.475
2. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.475
𝐸𝑟𝑟𝑜𝑟 = 0.525
ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525
In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:
X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1
X0
0.1
X1 -0.3
-0.2
0.4
0.4
X3 0.2
0.2
X2 X5
-0.3
-0.3
X4
Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O4 (1-O4) 𝐸𝑟𝑟𝑜𝑟5 𝑊45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O3 (1-O3) 𝐸𝑟𝑟𝑜𝑟5 𝑊35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = =
1+𝑒−𝐼3 1+𝑒−0.211
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211
0.552
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = =
1+𝑒−𝐼4 1+𝑒−0.484
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484
0.618
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.429
1+𝑒−𝐼5 1+𝑒0.282
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282
Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
0.2 0.8 0.5 0.1
[Unit 1]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?
Solution:
Use Self Organizing Feature Map (SOFM)
Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
0.2 0.8 0.5 0.1
[Unit 1]: [ ]
Unit 2 0.3 0.5 0.4 0.6
Iteration 3:
Training Sample X3: (1, 0, 0, 1)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1]:[ ]
Unit 2 0.12 0.2 0.76 0.84
Iteration 4:
Training Sample X4: (0, 0, 1, 0)
Weight matrix
This process is continued for many epochs until the feature map doesn’t change.
Learning Rules
Learning in NN is performed by adjusting the network weights in order to minimize the
difference between the desired and estimated output.
TYPES OF ANN
1. Feed Forward Neural Network
2. Fully connected Neural Network
3. Multilayer Perceptron
4. Feedback Neural Network
Feed Forward Neural Network:
Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer and are
multiplied by the weights in this model. The weighted input values are then summed together to form a total.
If the sum of the values is more than a predetermined threshold, which is normally set at zero, the output
value is usually 1, and if the sum is less than the threshold, the output value is usually -1.
The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification.
The model may or may not contain hidden layer and there is no backpropagation.
Based on the number of hidden layers they are further classified into single-layered and multilayered feed
forward network.
A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.
The major advantage of fully connected networks is that they are ―structure agnostic‖ i.e. there
are no special assumptions needed to be made about the input.
Multilayer Perceptron:
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes.
The information flows in both directions.
The weight adjustment training is done via backpropagation.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid activation
function takes real values as input and converts them to numbers between 0 and 1 using the sigmoid
formula.
Limitations of ANN
Challenges of Artificial Neural Networks
BCS602 | MACHINE LEARNING| VTU Belagavi.
Module-5
This is done using a trial and error approach as there are no supervisors available as in
classification.
The characteristic of clustering is that the objects in the clusters or groups are similar to
each other within the clusters while differ from the objects in other clusters
significantly.
The input for cluster analysis is examples or samples. These are known as objects, data
All these terms are same and used interchangeably in this chapter. All the samples or objects
The output is the set of clusters (or groups) of similar data if it exists in the input.
For example, the following Figure 13.1(a) shows data points or samples with two features
shown in different shaded samples and Figure 13.1(b) shows the manually drawn ellipse to
Visual identification of clusters in this case is easy as the examples have only two features.
But, when examples have more features, say 100, then clustering cannot be done manually and
automatic clustering algorithms are required.
Also, automating the clustering process is desirable as these tasks are considered difficult by
humans and almost impossible. All clusters are repre- sented by centroids.
Example: For example, if the input examples or data is (3, 3), (2, 6) and (7, 9), then the centroid
is given as.
The clusters should not overlap and every cluster should represent only one class. Therefore,
clustering algorithms use trial and error method to form clusters that can be converted to labels.
Applications of Clustering
High-Dimensional Data
Scalability Issue
o Some algorithms perform well for small datasets but fail for large-scale data.
Unit Inconsistency
Proximity Measures
Variables
Binary Attributes
Categorical Variables
Ordinal Variables
Cosine Similarity
Distance Measures
Overview
o Merges clusters based on the smallest distance between two points from different clusters.
o Related to the Minimum Spanning Tree (MST).
No model assumptions
Only one parameter of the window, that is, bandwidth is required Robust to noise No
Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed. If
it is small, then many points are missed and convergence occurs as the problem.
The number of clusters cannot be specified and user has no control over this parameter.
1. Initialization
o Select k initial cluster centers (randomly or using prior knowledge).
o Normalize data for better performance.
2. Assignment of Data Points
o Assign each data point to the nearest centroid based on Euclidean distance.
3. Update Centroids
Mathematical Optimization
Advantages
Disadvantages
Computational Complexity
O(nkId), where:
o n = number of samples
o k = number of clusters
o I = number of iterations
o d = number of attributes
Density-based Methods
1. Core Point
o A point with at least m points in its ε-neighborhood.
2. Border Point
o Has fewer than m points in its ε-neighborhood but is adjacent to a core point.
3. Noise Point (Outlier)
o Neither a core point nor a border point.
o X is densely reachable from Y if there exists a chain of core points linking them.
3. Density Connected
o X and Y are density connected if they are both densely reachable from a
common core point Z.
Advantages of DBSCAN
Disadvantages of DBSCAN
Grid-based Approach
Grid-based clustering partitions space into a grid structure and fits data into cells for
clustering.
Suitable for high-dimensional data.
Uses subspace clustering, dense cells, and monotonicity property.
Concepts
Subspace Clustering
o Clusters are formed using a subset of features (dimensions) rather than all
attributes.
o Useful for high-dimensional data like gene expression analysis.
Monotonicity Property
Advantages of CLIQUE
Disadvantage of CLIQUE
Chapter :- 2
Reinforcement Learning
RL simulates real-world scenarios for a computer program (agent) to learn by trial and
error.
The agent executes actions, receives positive or negative rewards, and optimizes its
future actions based on these experiences.
Characteristics of RL
o Consider a grid-based game where a robot must navigate from a starting node (E) to a goal
node (G) by choosing the optimal path.
o RL can learn the best route by exploring different paths and receiving feedback on their
efficiency.
o In obstacle-based games, RL can identify safe paths while avoiding dangerous zones.
1. Sequential Decision-Making
o In RL, decisions are made in steps, and each step influences future choices.
o Example: In a maze game, a wrong turn can lead to failure.
2. Delayed Feedback
o Rewards are not always immediate; the agent may need to take multiple steps before
receiving feedback.
3. Interdependence of Actions
o Each action affects the next set of choices, meaning an incorrect move can have
long-term consequences.
4. Time-Related Decisions
o Actions are taken in a specific sequence over time, affecting the final
outcome.
Reward Design
o Setting the right reward values is crucial. Incorrectly designed rewards may lead the agent
to learn undesired behavior.
o Some environments, like chess, have fixed rules, but many real-world problems lack
predefined models.
o Example: Training a self-driving car requires simulations to generate experiences.
Partial Observability
o Some environments, like weather prediction, involve uncertainty because complete state
information is unavailable.
1. Industrial Automation
o Optimizing robot movements for efficiency.
2. Resource Management
o Allocating resources in data centers and cloud computing.
3. Traffic Light Control
o Reducing congestion by adjusting signal timings dynamically.
4. Personalized Recommendation Systems
o Used in news feeds, e-commerce, and streaming services (e.g., Netflix
recommendations).
5. Online Advertisement Bidding
o Optimizing ad placements for maximum engagement.
6. Autonomous Vehicles
o RL helps in training self-driving cars to navigate safely.
7. Game AI (Chess, Go, Dota 2, etc.)
o AI models like AlphaGo use RL to master complex games.
8. DeepMind Applications
Reinforcement Learning (RL) is a distinct branch of machine learning that differs significantly
from supervised learning.
While supervised learning depends on labeled data, reinforcement learning learns through
interaction with the environment, making decisions based on trial and error.
Why RL Is Necessary?
Some tasks cannot be solved using supervised learning due to the absence of a labeled training
dataset. For example:
Chess & Go: There is no dataset with all possible game moves and their
outcomes. RL allows the agent to explore and improve over time.
Autonomous Driving: The car must learn from real-world experiences rather than
relying on a fixed dataset.
Basic Components of RL
Types of RL Problems
Learning Problems
Planning Problems
Known environment – The agent can compute and improve the policy using a model.
Example – Chess AI that plans its moves based on game rules.
The environment contains all elements the agent interacts with, including
obstacles, rewards, and state transitions.
The agent makes decisions and performs actions to maximize rewards.
Example
In self-driving cars,
Example (Navigation)
In a grid-based game, states represent positions (A, B, C, etc.), and actions are movements (UP,
DOWN, LEFT, RIGHT).
Types of States
Types of Episodes
Episodic – Has a definite start and goal state (e.g., solving a maze).
Continuous – No fixed goal state; the task continues indefinitely (e.g., stock
trading).
Policies in RL
A policy (π) is the strategy used by the agent to choose actions. Types of
Policies
The optimal policy is the one that maximizes cumulative expected rewards.
Rewards in RL
RL Algorithm Categories
It consists of a sequence of random variables where the probability of transitioning to the next
state depends only on the current state and not on the past states.
80% of students from University A move to University B for a master's degree, while
20% remain in University A.
60% of students from University B move to University A, while 40% remain in
University B.
Each row represents a probability distribution, meaning the sum of elements in each row equals
1.
Probability Prediction
1. Set of states
2. Set of actions
3. Transition probability function
4. Reward function
5. Policy
6. Value function
Markov Assumption
The Markov property states that the probability of reaching a state and receiving a reward
depends only on the previous state and action :
MDP Process
The probability of moving from state to after taking action is given by:
This forms a state transition matrix, where each row represents transition probabilities from one
state to another.
Expected Reward
Goal of MDP
The agent's objective is to maximize total accumulated rewards over time by following an
optimal policy.
Learning Overview
Reinforcement learning (RL) uses trial and error to learn a series of actions that maximize the
total reward. RL consists of two fundamental sub-problems:
o The goal is to predict the total reward (return), also known as policy evaluation or value
estimation.
o This requires the formulation of a function called the state-value function.
o The estimation of the state-value function can be performed using Temporal
Difference (TD) Learning.
Policy Improvement:
Consider a hypothetical casino with a robotic arm that activates a 5-armed slot machine. When a
lever is pulled, the machine returns a reward within a specified range (e.g., $1 to
$10).
The challenge is that each arm provides rewards randomly within this range.
Objective:
Given a limited number of attempts, the goal is to maximize the total reward by selecting the best
lever.
A logical approach is to determine which lever has the highest average reward and use it repeatedly.
Formalization:
Given k attempts on an N-arm slot machine, with rewards , the expected reward (action- value
function) is:
This indicates the action that returns the highest average reward and is used as an indicator of
action quality.
Example:
If a slot machine is chosen five times and returns rewards , the quality of this action is:
Exploration:
Exploitation:
Selection Policies
Greedy Method
Greedy Method
1. Value-Based Approaches
o Optimize the value function , which represents the maximum expected future reward
from a given state.
o Uses discount factors to prioritize future rewards.
2. Policy-Based Approaches
o Focus on finding the optimal policy , a function that maps states to actions.
o Rather than estimating values, it directly learns which action to take.
3. Actor-Critic Methods (Hybrid Approaches)
o Combine value-based and policy-based methods.
o The actor updates the policy, while the critic evaluates it.
4. Model-Based Approaches
o Create a model of the environment (e.g., using Markov Decision Processes
(MDPs)).
o Use simulations to plan the best actions.
5. Model-Free Approaches
Availability of models
Nature of updates (incremental vs. batch learning)
Exploration vs. exploitation trade-offs
Computational efficiency
Model-based Learning
Passive Learning refers to a model-based environment, where the environment is known. This
means that for any given state, the next state and action probability distribution are known.
Markov Decision Process (MDP) and Dynamic Programming are powerful tools for solving
reinforcement learning problems in this context.
The mathematical foundation for passive learning is provided by MDP. These model- based
reinforcement learning problems can be solved using dynamic programming after constructing
the model with MDP.
The primary objective in reinforcement learning is to take an action a that transitions the system
from the current state to the end state while maximizing rewards. These rewards can be positive
or negative.
The goal is to maximize expected rewards by choosing the optimal policy: for all
An agent in reinforcement learning has multiple courses of action for a given state. The way the
agent behaves is determined by its policy.
A policy is a distribution over all possible actions with probabilities assigned to each action.
Different actions yield different rewards. To quantify and compare these rewards, we use value
functions.
A value function summarizes possible future scenarios by averaging expected returns under a
given policy π.
It is a prediction of future rewards and computes the expected sum of future rewards for a given
state s under policy π:
where v(s) represents the quality of the state based on a long-term strategy. Example
If we have two states with values 0.2 and 0.9, the state with 0.9 is a better state to be in. Value
State-Value Function
Denoted as v(s), the state-value function of an MDP is the expected return from state s under a
policy π:
This function accumulates all expected rewards, potentially discounted over time, and helps
determine the goodness of a state.
Apart from v(s), another function called the Q-function is used. This function returns a real
value indicating the total expected reward when an agent:
1. Starts in state s
2. Takes action a
3. Follows a policy π afterward
Bellman Equation
Dynamic programming methods require a recursive formulation of the problem. The recursive
formulation of the state-value function is given by the Bellman equation:
There are two main algorithms for solving reinforcement learning problems using conventional
methods:
1. Value Iteration
2. Policy Iteration
Value Iteration
Algorithm
Policy Iteration
1. Policy Evaluation
2. Policy Improvement
Policy Evaluation
Initially, for a given policy π, the algorithm starts with v(s) = 0 (no reward). The Bellman
equation is used to obtain v(s), and the process continues iteratively until the optimal v(s) is
found.
Policy Improvement
Algorithm
Model-free methods do not require complete knowledge of the environment. Instead, they learn
through experience and interaction with the environment.
The reward determination in model-free methods can be categorized into three formulations:
1. Episodic Formulation: Rewards are assigned based on the outcome of an entire episode. For
example, if a game is won, all actions in the episode receive a positive reward (+1). If lost, all
actions receive a negative reward (-1). However, this approach may unfairly penalize or
reward intermediate actions.
2. Continuous Formulation: Rewards are determined immediately after an action. An
example is the multi-armed bandit problem, where an immediate reward between $1
- $10 can be given after each action.
3. Discounted Returns: Long-term rewards are considered using a discount factor. This method
is often used in reinforcement learning algorithms.
Monte-Carlo Methods
Monte Carlo (MC) methods do not assume any predefined model, making them purely
experience-driven. This approach is analogous to how humans and animals learn from interactions
with their environment.
Experience is divided into episodes, where each episode is a sequence of states from a
starting state to a goal state.
Episodes must terminate; regardless of the starting point, an episode must reach an
endpoint.
Value-action functions are computed only after the completion of an episode, making
MC an incremental method.
MC methods compute rewards at the end of an episode to estimate maximum expected
future rewards.
Empirical mean is used instead of expected return; the total return over multiple
episodes is averaged.
Due to the non-stationary nature of environments, value functions are computed for a
fixed policy and revised using dynamic programming.
Temporal Difference (TD) Learning is an alternative to Monte Carlo methods. It is also a model-
free technique that learns from experience and interaction with the environment.
Characteristics of TD Learning:
Bootstrapping Method: Updates are based on the current estimate and future reward.
Incremental Updates: Unlike MC, which waits until the end of an episode, TD
updates values at each step.
More Efficient: TD can learn before an episode ends, making it more sample-
efficient than MC methods.
Used for Non-Stationary Problems: Suitable for environments where conditions
change over time.
TD Learning can be accelerated using eligibility traces, which allow updates to be spread over
multiple states. This leads to a family of algorithms called TD(λ), where λ is the decay parameter
(0 ≤ λ ≤ 1):
Q-Learning
Q-Learning Algorithm
1. Initialize Q-table:
o Create a table Q(s,a) with states s and actions a.
o Initialize Q-values with random or zero values.
2. Set parameters:
o Learning rate α\alphaα (typically between 0 and 1).
o Discount factor γ (typically close to 1).
o Exploration-exploitation trade off strategy (e.g., ε-greedy policy).
3. Repeat for each episode:
o Start from an initial state s.
o Repeat until reaching a terminal state:
This iterative process helps the agent learn optimal Q-values, which guide it to take actions that
maximize rewards.
SARSA Learning
SARSA Algorithm (State-Action-Reward-State-Action)
Initialize Q-table:
Set parameters:
Model-based reinforcement learning algorithms construct a model of the environment, predicting state transitions and outcomes based on this model to plan optimal actions in advance . This approach is beneficial in scenarios where the environment dynamics are well understood, such as in chess AI, allowing precise action simulations before execution . On the other hand, model-free algorithms do not rely on a predefined environmental model; instead, they learn the best actions through direct interaction with the environment via trial and error, adjusting actions based on received rewards . Examples include Temporal Differencing (TD) Learning, where value functions are updated using past experience rather than predictions . Model-free methods are advantageous in complex environments where modeling is challenging or where dynamics continuously change .
Big Data is characterized by six main attributes known as the 6 Vs: Volume, Velocity, Variety, Veracity, Validity, and Value . Volume refers to the vast amounts of data, often measured in petabytes or exabytes . Velocity is the rapid generation and processing of data, often in real-time, attributed to IoT and the Internet . Variety highlights the diversity in data formats, from text and audio to video and graphs . Veracity addresses data accuracy and trustworthiness, which can be compromised by errors or technical glitches . Validity concerns the relevance and applicability of data for decision-making . Value refers to the usefulness of data insights for decision-making .
Within a Markov Decision Process (MDP), the state transition probability function defines the likelihood of moving from one state to another after taking a specific action . This function is central to MDPs as it encapsulates the dynamics of the environment, impacting the decision-making of reinforcement learning agents by outlining expected transitions and potential outcomes. MDPs operate under the Markov property, assuming future rewards depend only on the current state and action, without external dependency on past states . This probabilistic framework allows agents to evaluate potential actions by considering not just immediate rewards, but also long-term outcomes, thereby optimizing the agent's policy to maximize total expected rewards over time .
Structured data is organized in a predefined format, such as databases or spreadsheets, allowing easy searching, retrieval, and analysis using tools like SQL . It typically consists of record data, where rows represent objects and columns contain measurements for these objects, often organized in a data matrix or graph data . Conversely, unstructured data lacks a predefined organizational format and includes multimedia content such as video, images, and text documents . Storage and processing of structured data are generally straightforward, given its uniform schema, while unstructured data requires complex processing techniques to extract insights due to its varied and unformatted nature .
Data quality for numeric attributes in machine learning is influenced by precision, bias, and accuracy. Precision measures the consistency or closeness of repeated measurements, often evaluated through standard deviation . Bias represents systematic errors due to incorrect assumptions or procedures, affecting the model's ability to generalize . Accuracy signifies how closely measurements approach the true value, typically indicated by significant digits in data storage and manipulation . Low precision implies higher variability in data, while bias leads to skewed results, and insufficient accuracy diminishes the reliability of predictions and decisions derived from the data .
In reinforcement learning, exploration versus exploitation represents a fundamental strategy dilemma where agents must balance between trying new actions (exploration) and using known rewarding actions (exploitation). Exploration allows the agent to discover potentially more rewarding actions that were previously unknown, albeit with the risk of incurring sub-optimal outcomes in the short term. Exploitation focuses on actions that currently return the highest rewards based on past learning, often favoring known short-term benefits . An optimal policy is formed by effectively managing this balance; strategies like the ε-greedy method enable controlled exploration by introducing random action selection with a small probability, ensuring that exploitation persists most of the time, thus gradually converging on the most rewarding actions .
Non-operational data is utilized in strategic decision-making to inform long-term business strategies, often involving analysis of historical data to predict future trends and refine business approaches . This data type supports higher-level decisions such as market expansion or product line diversification. In contrast, operational data is used for day-to-day management and immediate operational processes, such as managing logistics or tracking daily sales figures, designed to optimize current business functioning and operational efficiency . While operational data impacts immediate business operations, non-operational data influences broader, strategic goals and forward-looking initiatives .
In machine learning, a well-posed problem is one with clearly defined specifications that allow for consistent solutions, facilitating effective model training and implementation . Having well-posed problems ensures that the learning algorithms can be optimized for predictable and reliable performance. In contrast, tackling ill-posed problems poses significant challenges, such as the lack of unique solutions or sensitivity to variations in input data, which can lead to difficulties in achieving model stability and generalization . This uncertainty often necessitates additional strategies such as regularization or reformulation of the problem to transform it into a well-posed one, ensuring better adaptability of the learning model to unseen data .
Unsupervised learning operates on unlabeled data and groups the data into clusters based on attribute similarity, without guidance from labeled inputs, using methods like cluster analysis . In contrast, semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data, initially labeling the unlabeled data using the labeled dataset, and then using both for learning purposes . Reinforcement learning, however, relies on an agent interacting with an environment to gather feedback and labels in the form of rewards and penalties, which guide the learning process through trial and error .
Data pre-processing is crucial in machine learning due to the typically 'dirty' nature of raw real-world data, which can include errors, missing data, and inconsistencies that negatively affect model outcomes . Pre-processing involves detecting and correcting these issues to enhance data quality before applying learning algorithms. Common problems addressed in data pre-processing include incomplete data, inaccuracies, outliers, missing values, inconsistent values, and duplicate data . By resolving these issues, pre-processing ensures the dataset is cleaner, more reliable, and thus more suitable for effective model training and accurate predictions .