0% found this document useful (0 votes)

120 views186 pages

BCS602: Intro to Machine Learning

Q: What are the key differences in approaches between model-based and model-free reinforcement learning algorithms?

Model-based reinforcement learning algorithms construct a model of the environment, predicting state transitions and outcomes based on this model to plan optimal actions in advance . This approach is beneficial in scenarios where the environment dynamics are well understood, such as in chess AI, allowing precise action simulations before execution . On the other hand, model-free algorithms do not rely on a predefined environmental model; instead, they learn the best actions through direct interaction with the environment via trial and error, adjusting actions based on received rewards . Examples include Temporal Differencing (TD) Learning, where value functions are updated using past experience rather than predictions . Model-free methods are advantageous in complex environments where modeling is challenging or where dynamics continuously change .

Q: What are the primary characteristics and challenges associated with big data that the 6 Vs identify?

Big Data is characterized by six main attributes known as the 6 Vs: Volume, Velocity, Variety, Veracity, Validity, and Value . Volume refers to the vast amounts of data, often measured in petabytes or exabytes . Velocity is the rapid generation and processing of data, often in real-time, attributed to IoT and the Internet . Variety highlights the diversity in data formats, from text and audio to video and graphs . Veracity addresses data accuracy and trustworthiness, which can be compromised by errors or technical glitches . Validity concerns the relevance and applicability of data for decision-making . Value refers to the usefulness of data insights for decision-making .

Q: Explain how state transition probability functions within a Markov Decision Process (MDP) influence the decision-making of reinforcement learning agents.

Within a Markov Decision Process (MDP), the state transition probability function defines the likelihood of moving from one state to another after taking a specific action . This function is central to MDPs as it encapsulates the dynamics of the environment, impacting the decision-making of reinforcement learning agents by outlining expected transitions and potential outcomes. MDPs operate under the Markov property, assuming future rewards depend only on the current state and action, without external dependency on past states . This probabilistic framework allows agents to evaluate potential actions by considering not just immediate rewards, but also long-term outcomes, thereby optimizing the agent's policy to maximize total expected rewards over time .

Q: What differentiates structured from unstructured data and how are they typically stored and processed?

Structured data is organized in a predefined format, such as databases or spreadsheets, allowing easy searching, retrieval, and analysis using tools like SQL . It typically consists of record data, where rows represent objects and columns contain measurements for these objects, often organized in a data matrix or graph data . Conversely, unstructured data lacks a predefined organizational format and includes multimedia content such as video, images, and text documents . Storage and processing of structured data are generally straightforward, given its uniform schema, while unstructured data requires complex processing techniques to extract insights due to its varied and unformatted nature .

Q: In what ways do precision, bias, and accuracy affect the data quality of numeric attributes in machine learning?

Data quality for numeric attributes in machine learning is influenced by precision, bias, and accuracy. Precision measures the consistency or closeness of repeated measurements, often evaluated through standard deviation . Bias represents systematic errors due to incorrect assumptions or procedures, affecting the model's ability to generalize . Accuracy signifies how closely measurements approach the true value, typically indicated by significant digits in data storage and manipulation . Low precision implies higher variability in data, while bias leads to skewed results, and insufficient accuracy diminishes the reliability of predictions and decisions derived from the data .

Q: Discuss how the concept of exploration versus exploitation is addressed in reinforcement learning and its significance in forming optimal policies.

In reinforcement learning, exploration versus exploitation represents a fundamental strategy dilemma where agents must balance between trying new actions (exploration) and using known rewarding actions (exploitation). Exploration allows the agent to discover potentially more rewarding actions that were previously unknown, albeit with the risk of incurring sub-optimal outcomes in the short term. Exploitation focuses on actions that currently return the highest rewards based on past learning, often favoring known short-term benefits . An optimal policy is formed by effectively managing this balance; strategies like the ε-greedy method enable controlled exploration by introducing random action selection with a small probability, ensuring that exploitation persists most of the time, thus gradually converging on the most rewarding actions .

Q: How does the use of non-operational data in strategic decision-making differ from the applications of operational data in a business context?

Non-operational data is utilized in strategic decision-making to inform long-term business strategies, often involving analysis of historical data to predict future trends and refine business approaches . This data type supports higher-level decisions such as market expansion or product line diversification. In contrast, operational data is used for day-to-day management and immediate operational processes, such as managing logistics or tracking daily sales figures, designed to optimize current business functioning and operational efficiency . While operational data impacts immediate business operations, non-operational data influences broader, strategic goals and forward-looking initiatives .

Q: What role does a well-posed problem play in machine learning, and what challenges arise from tackling ill-posed problems?

In machine learning, a well-posed problem is one with clearly defined specifications that allow for consistent solutions, facilitating effective model training and implementation . Having well-posed problems ensures that the learning algorithms can be optimized for predictable and reliable performance. In contrast, tackling ill-posed problems poses significant challenges, such as the lack of unique solutions or sensitivity to variations in input data, which can lead to difficulties in achieving model stability and generalization . This uncertainty often necessitates additional strategies such as regularization or reformulation of the problem to transform it into a well-posed one, ensuring better adaptability of the learning model to unseen data .

Q: How is unsupervised learning fundamentally different in approach when compared to semi-supervised and reinforcement learning in machine learning?

Unsupervised learning operates on unlabeled data and groups the data into clusters based on attribute similarity, without guidance from labeled inputs, using methods like cluster analysis . In contrast, semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data, initially labeling the unlabeled data using the labeled dataset, and then using both for learning purposes . Reinforcement learning, however, relies on an agent interacting with an environment to gather feedback and labels in the form of rewards and penalties, which guide the learning process through trial and error .

Q: Why is data pre-processing critical in preparing datasets for machine learning, and what common issues does it address?

Data pre-processing is crucial in machine learning due to the typically 'dirty' nature of raw real-world data, which can include errors, missing data, and inconsistencies that negatively affect model outcomes . Pre-processing involves detecting and correcting these issues to enhance data quality before applying learning algorithms. Common problems addressed in data pre-processing include incomplete data, inaccuracies, outliers, missing values, inconsistent values, and duplicate data . By resolving these issues, pre-processing ensures the dataset is cleaner, more reliable, and thus more suitable for effective model training and accurate predictions .

Module 1 of the Machine Learning course introduces the fundamentals of machine learning, highlighting its popularity due to high data volume, reduced storage costs, and advanced algorithms. It explains the knowledge pyramid, the evolution of machine learning, and its relationship with artificial intelligence, data science, and statistics. The module also covers types of machine learning, specifically supervised learning, which includes classification and regression techniques, along with their applications and key algorithms.

Uploaded by

nanirajuananya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views186 pages

BCS602: Intro to Machine Learning

Uploaded by

nanirajuananya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Module 1- Machine Learning (BCS602)

INTRODUCTION TO MACHINE L E A R N I N G

Why Machine Learning is Popular

1. High Data Volume: Large companies like Facebook, Twitter, and YouTube generate
huge amounts of data, which doubles annually.
2. Reduced Storage Costs: Declining hardware and storage costs make it easier to
capture, store, and process vast amounts of data.
3. Advanced Algorithms: The development of complex algorithms, especially in deep
learning, enables more powerful machine learning applications.
The Knowledge Pyramid
1. Data: Basic facts and raw numbers. Organizations store vast amounts of data from
sources like databases and warehouses.
2. Information: Processed data revealing patterns or associations. For instance,
analyzing sales data to determine the best-selling product.
3. Knowledge: Condensed information, such as historical patterns and future trends.
Extracting knowledge from data is crucial for decision-making.
4. Intelligence: Applied knowledge. It represents actionable insights, such as strategies
derived from knowledge.
5. Wisdom: The highest level, where intelligence evolves into maturity and sound
judgment, typically exhibited by humans.
Objective of Machine Learning
Machine learning processes archival data to:
 Make better decisions.
 Design new products.
 Improve business processes.
 Build decision support systems.
What is Machine Learning?
Machine learning is a sub-field of AI that allows computers to learn without
explicit programming. As Arthur Samuel defined: “Machine learning is the field
of study that gives computers the ability to learn without being explicitly
programmed.”
Module 1- Machine Learning (BCS602)

In conventional programming, we teach computers how to perform tasks step-by-

step. However, real-world problems like image recognition or complex games
require systems that can learn from data directly.
Evolution of Machine Learning
 Early systems like expert systems (e.g., MYCIN for medical diagnosis) relied on human
rules and logic, but they didn't exhibit true intelligence.
 Machine learning evolved with data-driven systems, focusing on learning from data
to automatically predict unknown outcomes.
Learning System
 Human Learning (Fig. 1.2a): Humans make decisions based on experience.
 Machine Learning (Fig. 1.2b): Machines create models from data patterns and
use these models for prediction, akin to human experience.

Figure 1.1: The Knowledge Pyramid

Figure 1.2: (a) A Learning System for Humans (b) A Learning Systemfor Machine Learning

Data Quality and Learning

Module 1- Machine Learning (BCS602)

 The quality of data directly impacts the quality of experience and, ultimately, the
quality of learning systems.
Statistical Learning
 In statistical learning, the relationship between input x and output y is modeled as:
y = f(x)
o f is the learning function mapping inputs to outputs.
 In machine learning, this is referred to as the mapping of input to output.

Model in Machine Learning

 A model is a summary of raw data, structured into a representation for decision-
making.
 Models can be in different forms, such as:
1. Mathematical equations: e.g., linear regression formula
2. Relational diagrams: like decision trees or graphs
3. Logical rules: like if/else rules (e.g., rule-based spam filters)
4. Clusters: for grouping data (e.g., k-means clustering)
Patterns vs. Models
 Pattern: Local; applicable to specific parts of the data.
 Model: Global; fits the entire dataset.
o Example: A model trained to detect spam can be used to predict whether an email is
spam or not.
Tom Mitchell’s Definition of Machine Learning
Tom Mitchell’s famous definition:
“A computer program is said to learn from experience E, with respect to task
T and some performance measure P, if its performance on T, measured by P,
improves with experience E.”
 Experience (E): Data used to learn (e.g., thousands of images to train an object
detection model).
 Task (T): The job the machine does (e.g., detecting objects in images).
 Performance measure (P): How well the machine performs the task (e.g., precision,
recall).
Example:
 Task (T): Detecting an object in images.
Module 1- Machine Learning (BCS602)

 Experience (E): Training data containing thousands of labeled images.

 Performance measure (P): The system is evaluated by how accurately it detects
objects, using metrics like precision and recall. Improvements can be made if the system
underperforms.
Human vs. Machine Experience
 Human Experience: Gained through learning, observing, imitation, and trial & error.
o Example: Learning how to ride a bike through practice and observation.
 Machine Experience: Gained through data processing and model building:
1. Data Collection: Gathering data from the environment (e.g., images, text, or
numbers).
2. Abstraction: Extracting key features from the data (e.g., identifying basic features of
an elephant: trunk, ears).
3. Generalization: Turning abstraction into an actionable form, like forming rules
(heuristics) from past experiences.
 Example: A self-driving car generalizes rules about stopping at red lights.
4. Heuristics: Actionable “rules of thumb” that guide decisions based on prior
experience.
 Example: A heuristic rule: If you see a red light, stop.
 Heuristics can sometimes fail but are typically effective.
Evaluation & Course Correction
 Heuristics: Often work, but sometimes fail due to limitations (it’s a general rule, not a
certainty).
o Example: If someone runs when sensing danger, it’s an automatic response based on
past experience (heuristics).
 Evaluation: Assesses the effectiveness of the model or heuristic. If the model
underperforms, we use evaluation measures to adjust and improve it.

1.1 MACHINE LEARNING IN RELATION TO OTHER FIELDS

Machine learning uses the concepts of Artificial Intelligence, Data Science, and Statistics
[Link] is the resultant of combined ideas of diverse fields.
1.3.1 Machine Learning and Artificial Intelligence
Module 1- Machine Learning (BCS602)

intelligence

Figure 1.3: Relationship of AI with Machine Learning

Artificial Intelligence (AI) is a broad field focused on creating systems (called

"intelligent agents") that can perform tasks autonomously, such as robots,
humans, or other systems. The early goal of AI was ambitious: to create intelligent
systems that could think and act like humans, focusing on logic and reasoning.
However, AI faced several challenges and periods of slow progress, called AI
winters, where enthusiasm and funding declined. AI’s resurgence came with the
rise of data-driven systems—models that learn by finding patterns in data. This
led to the development of Machine Learning (ML), a key branch of AI.
Machine Learning aims to extract patterns from data to make predictions.
Instead of explicitly programming systems for every possible scenario, ML
algorithms "learn" from examples (training data) and can handle new, unseen
situations. Machine learning includes various techniques like reinforcement
learning, where agents learn by interacting with their environment.

Relationship Between AI and Machine Learning:

 AI is the broader field aiming to create intelligent agents.
 ML is a subfield of AI that focuses on learning from data.
 Deep Learning, a subset of ML, uses neural networks inspired by the human brain
to build models. These networks consist of layers of interconnected units ("neurons") that
process information in a way that mimics how the brain works, and they are especially
useful for tasks like image and speech recognition.
Module 1- Machine Learning (BCS602)

1.3.2 Machine Learning, Data Science, Data Mining, and Data

Analytics
Data Science is an umbrella term that covers various fields related to working
with data. It involves gathering, processing, analyzing, and drawing insights from
data. Machine learning starts with data, which makes it closely linked to data
science. Here’s how machine learning connects to related fields:
Big Data:
Big data is part of data science and refers to massive volumes of data generated by
companies like Facebook, Twitter, and YouTube. It deals with three key
characteristics:
1. Volume: The sheer amount of data being generated.
2. Variety: Data comes in many forms—text, images, videos, etc.
3. Velocity: The speed at which data is generated and processed.
Big data is essential for machine learning because many algorithms rely on large
datasets for training. For example, deep learning (a subfield of ML) uses big data
for tasks like image recognition and language translation.
Data Mining:
Data mining originally came from business applications. It’s like "mining" for
valuable information hidden in large datasets. While data mining and machine
learning overlap significantly, the distinction is:
 Data Mining: Focuses on discovering hidden patterns in data.
 Machine Learning: Uses those patterns to make predictions.
Data Analytics:
Another branch of data science is data analytics, which aims to extract useful
insights from raw data. There are different types of analytics, such as predictive
analytics, which forecasts future events based on past data. Machine learning
plays a major role in predictive analytics since many of its algorithms are used to
make predictions.
Pattern Recognition:
Pattern recognition is an engineering field that uses machine learning algorithms
to detect and classify patterns. While it’s often considered a specific application of
machine learning, it has its own identity as a field, dealing with tasks like facial
recognition or speech analysis.
These relations are summarized in Figure 1.4.
Module 1- Machine Learning (BCS602)

analytics

Figure 1.4: Relationship of Machine Learning with Other Major Fields

1.3.3 Machine Learning and Statistics

1. Statistics:
 Definition: A branch of mathematics focused on analyzing and interpreting data to
uncover patterns and relationships.
 Key Features:
o Hypothesis-driven: Starts with a hypothesis and tests it through experiments.
o Assumptions: Requires strict assumptions (e.g., normal distribution, independence of
variables).
o Mathematical Models: Uses complex equations (e.g., regression, ANOVA) to explain
data.
o Knowledge Required: Strong statistical background needed for analysis and
interpretation.
o Goal: Primarily concerned with verifying relationships and patterns in data.
2. Machine Learning (ML):
 Definition: A branch of AI focused on building models that learn from data to make
predictions or decisions without being explicitly programmed.
 Key Features:
o Data-driven: Focuses on learning from data patterns for predictions.
o Less Assumptions: Fewer restrictions on data (e.g., can handle non-normal data).
o Automation: Emphasizes using tools and algorithms to automate the learning process.
Module 1- Machine Learning (BCS602)

o Flexibility: Works well with large, complex datasets; adaptable to different scenarios.
o Goal: Makes predictions based on learned patterns, often without needing detailed
statistical knowledge.
1.2 TYPES OF MACHINE LEARNING
What does the word ‘learn’ mean? Learning, like adaptation, occurs as the result of
interaction of the program with its environment. There are four types of machine learning
as shown in Figure 1.5.

Before discussing the types of learning, it is necessary to discuss about data.

Labelled and Unlabelled Data: Data is a raw fact. Normally, data is represented in the form
of a table. Data also can be referred to as a data point, sample, or an example. Each row of the table
represents a data point. Features are attributes or characteristics of an object. Normally, the
columns of the table are attributes. Out of all attributes, one attribute is important and is called a
label. Label is the feature that we aim to predict. Thus, there are two types of data – labelled and
unlabelled.
Labelled Data To illustrate labelled data, let us take one example dataset called Iris flower dataset
or Fisher’s Iris dataset. The dataset has 50 samples of Iris – with four attributes, length and width
of sepals and petals. The target variable is called class. There are three classes – Iris setosa, Iris
virginica, and Iris versicolor.
The partial data of Iris dataset is shown in Table 1.1.
Table 1.1: Iris Flower Dataset

[Link]. Length ofPetal Width ofPetal Length ofSepal Width ofSepal Class

1. 5.5 4.2 1.4 0.2 Setosa

2. 7 3.2 4.7 1.4 Versicolor
Module 1- Machine Learning (BCS602)

3. 7.3 2.9 6.3 1.8 Virginica

A dataset need not be always numbers. It can be images or video frames. Deep neural
networks can handle images with labels. In the following Figure 1.6, the deep neural
network takes images ofdogs and cats with labels for classification. In unlabelled data, there
are no labels in the dataset.

Cat

(a) (b)
Figure 1.6: (a) Labelled Dataset (b) Unlabeled Dataset

1.4.1 Supervised Learning

Supervised algorithms use labelled dataset. As the name suggests, there is a supervisor or
teachercomponent in supervised learning. A supervisor provides labelled data so that the
model is constructed and generates test data.
In supervised learning algorithms, learning takes place in two stages. In layman terms, during
thefirst stage, the teacher communicates the information to the student that the student is
supposed tomaster. The student receives the information and understands it. During this
stage, the teacher has noknowledge of whether the information is grasped by the student.
This leads to the second stage of learning. The teacher then asks the student a set of
questionsto find out how much information has been grasped by the student. Based on
these questions, the student is tested, and the teacher informs the student about his
assessment. This kind of learningis typically called supervised learning.
Supervised learning has two methods:
1. Classification
2. Regression

Classi�ication
Module 1- Machine Learning (BCS602)

Classification is a type of supervised learning.

 Supervised Learning: The algorithm learns from labeled data, where we know
the correct answers.
 Independent Variables: These are the input features, also called attributes.
 Dependent Variable (Label): This is the target we want to predict, and it’s in the form
of discrete categories or labels (e.g., dog or cat).
How Classi�ication Works:
1. Training Stage:
o The algorithm is given a dataset that includes both the features (input) and their
correct labels (output).
o The algorithm learns from this data and creates a model.
2. Testing Stage:
o The model is tested on new, unseen data (input), and it predicts the label (output).
o For example, if you input an image of a dog or cat that the model hasn’t seen before, the
model will assign the correct label based on what it has learned.
Example:
In the Iris dataset, if you input data like (6.3, 2.9, 5.6, 1.8, ?), the model will predict the
missing label. This process of assigning a label to new data is called classification.
Applications of Classi�ication:
Image Recognition: Classifying images of animals, plants, or even medical conditions like
cancer.
Types of Classi�ication Models:
Classification models can be grouped into two categories:
1. Generative Models: Focus on how the data is generated and its distribution (e.g., Naı¨ve
Bayes).
2. Discriminative Models: Focus only on distinguishing between different classes (e.g.,
Support Vector Machines).
Key Classi�ication Algorithms:
 Decision Tree
 Random Forest
 Support Vector Machines (SVM)
 Naïve Bayes
Module 1- Machine Learning (BCS602)

 Artificial Neural Networks (ANN) and Deep Learning (e.g., Convolutional Neural
Networks - CNN)

Regression Models
Regression is another type of supervised learning, similar to classification, but
instead of predicting categories (labels), it predicts continuous values, like
numbers.
Key Difference:
 Regression: Predicts continuous values (e.g., product sales, house prices).
 Classification: Predicts labels or categories (e.g., dog or cat).
How Regression Works:
In a regression model, we are trying to find a relationship between the
independent variable(s) (x) and the dependent variable (y).
For example, in Figure 1.8, the independent variable (x) is the number of weeks,
and the dependent variable (y) is product sales. The regression model fits a line
to the data, which can be used to predict future sales. This line is written as: shown
in fig 1.8
Sales (y)=0.66×Week (x)+0.54
 Here, 0.66 and 0.54 are regression coefficients that the model learns from the data.
 If you want to predict the sales for the 8th week, you can substitute x=8x = 8x=8 into
the formula and calculate the predicted sales (y).
Example:
Module 1- Machine Learning (BCS602)

For the 8th week: Sales=0.66×8+0.54

Sales=0.66×8+0.54 This gives a predicted value of sales for week 8.

Similarities Between Regression and Classi�ication:

 Both are supervised learning methods, meaning they require a labeled training
dataset.
 Both involve a training stage (where the model learns from data) and a testing stage
(where the model is used to make predictions on new data).
Main Difference:
 Regression: Predicts numbers (continuous values).
 Classification: Predicts categories (discrete values, like class labels).
One of the most common regression algorithms is linear regression, which fits a
straight line to the data.
Unsupervised learning
is a type of learning where there is no supervisor or teacher guiding the process.
Instead, the algorithm learns by itself using trial and error.
How Unsupervised Learning Works:
 In this method, the algorithm is given data without any labels.
 The algorithm looks at the data and tries to find patterns or groupings on its own.
 The goal is to group similar objects together based on their characteristics.
Example of Unsupervised Learning:
Clustering
 Clustering is a common unsupervised learning technique.
Module 1- Machine Learning (BCS602)

 It groups objects into different clusters, where each cluster contains objects that are
similar to each other.
 The objects in one cluster are different from those in other clusters.

For example, if you have a set of images of dogs and cats, a clustering algorithm
will automatically group them into two clusters: one for dogs and one for cats,
without needing any labels to tell it which is which.
Applications of Clustering:
 Image Segmentation: Grouping parts of an image, like separating a region of interest
(e.g., identifying a tumor in a medical image).
 Gene Analysis: Finding groups of similar genes in a database.
In summary, unsupervised learning helps the algorithm discover patterns in data
without any explicit instructions. Cluster analysis and dimensional reduction are key
types of unsupervised learning.

Some of the key clustering algorithms are:

• k-means algorithm
• Hierarchical algorithms
Dimensionality Reduction
Dimensionality reduction algorithms are examples of unsupervised algorithms. It takes a
higher dimension data as input and outputs the data in lower dimension by taking advantage
of the varianceof the data. It is a task of reducing the dataset with few features without losing
the generality. The differences between supervised and unsupervised learning are listed
in the followingTable 1.2.
Table 1.2: Differences between Supervised and Unsupervised Learning
Module 1- Machine Learning (BCS602)

[Link]. Supervised Learning Unsupervised Learning

1. There is a supervisor component No supervisor component

2. Uses Labelled data Uses Unlabelled data

3. Assigns categories or labels Performs grouping process such that similar objectswill
be in one cluster

1.4.2 Semi-supervised Learning

There are circumstances where the dataset has a huge collection of unlabelled data and
some labelled data. Labelling is a costly process and difficult to perform by the humans.
Semi-supervised algorithms use unlabelled data by assigning a pseudo-label. Then, the
labelled and pseudo-labelled dataset can be combined.
1.4.3 Reinforcement Learning
Reinforcement learning is a type of machine learning where an agent learns by
interacting with its environment. The agent performs actions and receives feedback in
the form of rewards or penalties, and its goal is to maximize the total reward over time.
Key Concepts:
 Agent: The learner or decision-maker.
 Environment: The world the agent interacts with.
 Action: What the agent can do.
 Reward: Feedback given to the agent based on its actions (positive for good actions,
negative for bad actions).
The agent learns through trial and error, improving its strategy (called a policy) over
time to make better decisions.
Example of Reinforcement Learning:
Consider a robot learning to walk:
 The robot (agent) takes steps (actions) in its environment.
 If the robot falls, it receives a penalty (negative reward). If it moves forward without
falling, it receives a reward (positive reward).
 Over time, the robot adjusts its movements to maximize its forward motion and
minimize falling, effectively learning how to walk.
Module 1- Machine Learning (BCS602)

In summary, reinforcement learning is about learning from experience to make better

decisions in the future by maximizing rewards.
Consider the following example of a Grid game as shown in Figure 1.10.

Danger

In this grid game, the gray til indicates t he dang er, black is a block, and the tile with
diagonallines is the goal. The a Fi m
i g uirset1o. 1s0t :a rAt ,Gsr iadyg farmoem bottom-left grid, using the actions
left, right, top andbottom to reach the goal state.
To solve this sort of problem, there is no data. The agent interacts with the environment
toget experience. In the above case, the agent tries to create a model by simulating many
paths and finding rewarding paths. This experience helps in constructing a model.

1.5 CHALLENGES OF MACHINE LEARNING

Machine learning allows computers to solve certain types of problems much
better than humans, especially tasks involving computation. For instance,
computers can quickly calculate the square root of large numbers or win games
like chess and Go against professional players.
However, humans are still better than machines at tasks like recognition, though
modern machine learning systems, especially deep learning, are improving
rapidly. For example, machines can recognize human faces instantly. But there are
still challenges in machine learning, mainly due to the need for high-quality data.
Key Challenges in Machine Learning:
1. Well-Posed Vs Ill-Posed Problems:
o Machine learning works well with well-posed problems, where the problem is clearly
defined and has enough information to find a solution.
o In ill-posed problems, there may be multiple possible answers, making it hard to find
the correct one. For example, in a simple dataset (as shown in Table 1.3), several models
could fit the data (e.g., multiplication or division). To solve such problems, more data is
needed to narrow down the correct model.
Table 1.3: An Example

Input (x1, x2) Output (y)

Module 1- Machine Learning (BCS602)

1, 1 1
2, 1 2
3, 1 3
4, 1 4
5, 1 5

Can a model for this test data be multiplication? That is, y = x1 * x2. Well! It is true! But, this
is equally true that y may be y = x1 / x2 or y = x1 ^ x2. So, there are three functions that fit
the data.
This means that the problem is ill-posed. To solve this problem, one needs more example to
check the model. Puzzles and games that do not have sufficient specification may become an
ill-posed problem and scientific computation has many ill-posed problems.
2. Need for Huge, Quality Data:
o Machine learning requires large amounts of high-quality data. The data must be
complete, without missing or incorrect values. Poor-quality data can lead to inaccurate
models.
3. High Computational Power:
o With the growth of Big Data, machine learning tasks require powerful computers with
specialized hardware like GPUs or TPUs to handle the high computational load. The
increasing complexity of tasks has made high-performance computing essential.
4. Complexity of Algorithms:
o Choosing the right machine learning algorithm, explaining how it works, applying it
correctly, and comparing different algorithms are now critical skills for data scientists.
This makes the selection and evaluation of algorithms a significant challenge.
5. Bias-Variance Trade-off:
o Overfitting: When a model performs well on training data but fails on test data, it’s
called overfitting. This means the model has learned the training data too well but lacks
generalization to new data.
o Underfitting: When a model fails to perform well on both training and test data, it’s
called underfitting. The model is too simple to capture the patterns in the data.
o Balancing between overfitting and underfitting is a major challenge for machine
learning algorithms.
1.6 MACHINE LEARNING PROCESS
The emerging process model for the data mining solutions for business organizations is
[Link] machine learning is like data mining, except for the aim, this process can
Module 1- Machine Learning (BCS602)

be used for machinelearning. CRISP-DM stands for Cross Industry Standard Process – Data
Mining. This process involves six steps. The steps are listed below in Figure 1.11.

1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is
enough for giving the solution. This step also involves the formulation of the problem
statement for the data mining process.
2. Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns to the
selected hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the
raw data and preparation of data for the data mining process. The missing values may
cause problems during both training and testing phases. Missing data forces classifiers to
produceinaccurate results. This is a perennial problem for the classification models. Hence,
suitablestrategies should be adopted to handle the missing data.
4. Modelling – This step plays a role in the application of data mining algorithm for the
datato obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy issue.
For example, classification of emails requires extensive domain knowledge and requires
domain experts. Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining
algorithm to improve the existing process or for a new situation.
Module 1- Machine Learning (BCS602)

1.7 MACHINE LEARNING APPLICATIONS

Machine Learning technologies are used widely now in different domains. Machine learning appli
cations are everywhere! One encounters many machine learning applications in the day-to-day life.
Some applications are listed below:
1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which arecaptured by
emoticons effectively. For movie reviews or product reviews, five stars or onestar are automatically
attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases [Link]
example, Amazon recommends users to find related books or books bought by peoplewho have the
same taste like you, and Netflix suggests shows or related movies of your taste. The
recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
Assistant are all examples of voice assistants. They take speech commands and perform tasks.
These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine learning
which offer to locate and navigate shortest paths to reduce time.
The machine learning applications are enormous. The following Table 1.4 summarizes some ofthe
machine learning applications.
Table 1.4: Applications’ Survey Table

[Link]. Problem Domain Applications

1. Business Predicting the bankruptcy of a business firm
2. Banking Prediction of bank loan defaulters and detecting credit card frauds
3. Image Processing Image search engines, object identification, image classification, and
generating synthetic images
4. Audio/Voice Chatbots like Alexa, Microsoft Cortana. Developing chatbots for
customer support, speech to text, and text to voice
5. Telecommuni-cation Trend analysis and identification of bogus calls, fraudulent calls and
its callers, churn analysis
6. Marketing Retail sales analysis, market basket analysis, product performance
analysis, market segmentation analysis, and study of travel patterns of
customers for marketing tours

7. Games Game programs for Chess, GO, and Atari video games
8. Natural Language Google Translate, Text summarization, and sentiment analysis
Translation

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT

Module 1- Machine Learning (BCS602)

9. Web Analysis and Identification of access patterns, detection of e-mail spams, viruses,
Services personalized web services, search engines like Google, detection of
promotion of user websites, and finding loyalty of users after web page
layout modification

10. Medicine Prediction of diseases, given disease symptoms as cancer or diabetes.

Prediction of effectiveness of the treatment using patient history and
Chatbots to interact with patients like IBM Watson uses machine
learning technologies.

11. Multimedia and Face recognition/identification, biometric projects like identification

Security of a person from a large image or video database, and applications
involving multimedia retrieval

12. Scientific Domain Discovery of new galaxies, identification of groups of houses based
on house type/geographical location, identification of earthquake
epicenters, and identification of similar land use

Key Terms:
• Machine Learning – A branch of AI that concerns about machines to learn automatically withoutbeing
explicitly programmed.
• Data – A raw fact.
• Model – An explicit description of patterns in a data.
• Experience – A collection of knowledge and heuristics in humans and historical training data in case of
machines.
• Predictive Modelling – A technique of developing models and making a prediction of unseen data.
• Deep Learning – A branch of machine learning that deals with constructing models using neural
networks.
• Data Science – A field of study that encompasses capturing of data to its analysis covering all stagesof
data management.
• Data Analytics – A field of study that deals with analysis of data.
• Big Data – A study of data that has characteristics of volume, variety, and velocity.

• Statistics – A branch of mathematics that deals with learning from data using statistical methods.
• Hypothesis – An initial assumption of an experiment.
• Learning – Adapting to the environment that happens because of interaction of an agent with the
environment.
• Label – A target attribute.
• Labelled Data – A data that is associated with a label.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT

Module 1- Machine Learning (BCS602)

• Unlabelled Data – A data without labels.

• Supervised Learning – A type of machine learning that uses labelled data and learns with the help of a
supervisor or teacher component.
• Classification Program – A supervisory learning method that takes an unknown input and assigns a
label for it. In simple words, finds the category of class of the input attributes.
• Regression Analysis – A supervisory method that predicts the continuous variables based on the input
variables.
• Unsupervised Learning – A type of machine leaning that uses unlabelled data and groups the attributes
to clusters using a trial and error approach.
• Cluster Analysis – A type of unsupervised approach that groups the objects based on attributesso
that similar objects or data points form a cluster.
• Semi-supervised Learning – A type of machine learning that uses limited labelled and largeunlabelled
data. It first labels unlabelled data using labelled data and combines it for learning purposes.
• Reinforcement Learning – A type of machine learning that uses agents and environment interactionfor
creating labelled data for learning.
• Well-posed Problem – A problem that has well-defined specifications. Otherwise, the problem is called
ill-posed.
• Bias/Variance – The inability of the machine learning algorithm to predict correctly due to lackof
generalization is called bias. Variance is the error of the model for training data. This leads to problems called
overfitting and underfitting.

• Model Deployment – A method of deploying machine learning algorithms to improve the existing
business processes for a new situation.

Dr. Sudhamani M J, Professor, Dept. of CSE, RNSIT

Module 1- Machine Learning (BCS602)

2.1 WHAT IS DATA?

 Data refers to raw facts that can be numbers, text, images, audio,
or video.
 In computer systems, these facts are encoded in bits, allowing
machines to process and store them.
 Directly interpretable data: Numbers or text, like "John is 25
years old."
 Diffused data: Data like images or videos that require computers
to interpret, like identifying objects in a photo.
Types of Data Sources
1. Flat files: Simple files like CSV or text files.
2. Databases: Systems that store structured data.
3. Data warehouses: Centralized repositories for large volumes of
data.
Operational vs. Non-operational Data
 Operational data: Data generated during regular business
processes. Example: daily sales figures.
 Non-operational data: Data used for strategic decision-making,
such as past sales data to predict future trends.
Data vs. Information
 Data alone is meaningless until it is processed to create
information.
o Example: A list of numbers is just data, but labeling it as "heights of
students" gives it context, turning it into information.
 Information shows patterns, relationships, and associations.
o Example: Analyzing sales data can reveal which products sell the
most.
Big Data: Elements and Characteristics
What is Big Data?
Module 1- Machine Learning (BCS602)

 Small data can be processed using regular computers.

 Big Data is data that exceeds the capacity of standard computers
and requires specialized tools.
The 6 Vs of Big Data
1. Volume:
o Refers to the size of data.
o Big Data is often measured in petabytes (PB) or exabytes (EB),
much larger than the gigabytes or terabytes of traditional data.
2. Velocity:
o The speed at which data is generated and processed.
o Thanks to IoT devices and the Internet, data arrives rapidly, often
in real-time.
3. Variety:
o The diversity of data formats:
 Form: Data comes as text, audio, video, graphs, etc.
 Function: Data from sources like conversations, transactions, or
archives.
 Source: Data can come from public sources, social media, or
multimodal sources (combining different types).
4. Veracity:
o Refers to the accuracy and trustworthiness of data.
o Errors like technical glitches or human mistakes can affect the
reliability of data, making veracity crucial.
5. Validity:
o The relevance of data for a particular purpose, ensuring it is
accurate and fit for decision-making.
6. Value:
Module 1- Machine Learning (BCS602)

o The usefulness of data based on the insights and information it

provides, helping organizations make better decisions.
The data quality of the numeric attributes is determined by
factors like precision, bias, and accuracy.
 Precision is defined as the closeness of repeated measurements.
Often, standard deviation is used to measure the precision.
 Bias is a systematic result due to erroneous assumptions of the
algorithms or procedures.
 Accuracy is the degree of measurement of errors that refers to the
closeness of measurements to the true value of the quantity. Normally,
the significant digits used to store and manipulate indicate the
accuracy of the measurement.

2.1.1 Categories of Big Data

Big Data, data can be categorized into three types: structured data,
unstructured data, and semi-structured data. Each type has its own
characteristics, formats, and storage methods.
1. Structured Data
Definition: Structured data is organized and stored in a
predefined format, such as a table in a database. This data is easy
to search, retrieve, and analyze using tools like SQL.
Types of Structured Data:
 Record Data:
o A dataset consists of a collection of measurements.
o Rows (entities, cases, or records) represent objects.
o Columns (attributes, features, or fields) represent measurements
for each object.
o A label refers to individual observations in the dataset.
 Data Matrix:
Module 1- Machine Learning (BCS602)

o A type of record data where all attributes are numeric.

o Data is represented as points in a multidimensional space where
each attribute represents a dimension.
o Matrix operations can be applied to analyze this data.
 Graph Data:
o Represents relationships between objects.
o Example: In a web graph, nodes are web pages, and edges
(hyperlinks) connect them.
 Ordered Data:
o Objects have attributes with an implicit order.
o Examples of ordered data:
 Temporal data: Attributes associated with time, e.g., customer
purchase patterns during festivals.
 Sequence data: Sequence of elements without timestamps, e.g.,
DNA sequences (A, T, G, C).
 Spatial data: Related to locations or positions, e.g., maps where
points relate to geographical locations.
2. Unstructured Data
Unstructured data does not have a predefined organizational
format. This type of data includes multimedia (video, image,
audio) as well as text documents, blogs, and social media data.
 Examples of unstructured data include:
o Videos on platforms like YouTube.
o Images and photos.
o Audio recordings, such as podcasts or songs.
o Text documents, blogs, and posts from social media.
Module 1- Machine Learning (BCS602)

Key Point: It is estimated that around 80% of all data is

unstructured, making it a large and significant portion of Big
Data.
3. Semi-Structured Data
Semi-structured data falls between structured and
unstructured data. While it does not conform to a strict
structure, it contains tags or markers that make it easier to
organize.
 Examples of semi-structured data include:
o XML/JSON files: Contain data with embedded tags or fields.
o RSS feeds: Often follow a hierarchical structure, but not as rigid as
a database.
o Hierarchical data: Data that follows a parent-child relationship,
like in a directory tree.
2.1.2 Data Storage and Representation
Once the dataset is assembled, it must be stored in a structure that is
suitable for data analysis. The goal of data storage management is to
make data available for analysis. There are different approaches to
organize and manage data in storage files and systems from flat file to
data warehouses. Some of them are listed below:

Flat Files These are the simplest and most commonly available data
source. It is also the cheapest way of organizing the data. These flat
files are the files where data is stored in plain ASCII or EBCDIC format.
Minor changes of data in flat files affect the results of the data mining
algorithms.
Hence, flat file is suitable only for storing small dataset and not
desirable if the dataset becomes larger.
Some of the popular spreadsheet formats are listed below:
• CSV files – CSV stands for comma-separated value files where the
values are separated by commas. These are used by spreadsheet and
Module 1- Machine Learning (BCS602)

database applications. The first row may have attributes and the rest
of the rows represent the data.
• TSV files – TSV stands for Tab separated values files where values
are separated by Tab. Both CSV and TSV files are generic in nature and
can be shared. There are many tools like Google Sheets and Microsoft
Excel to process these files.
Database System It normally consists of database files and a
database management system (DBMS). Database files contain original
data and metadata. DBMS aims to manage data and improve operator
performance by including various tools like database administrator,
query processing, and transaction manager. A relational database
consists of sets of tables. The tables have rows and columns. The
columns represent the attributes and rows represent tuples. A tuple
corresponds to either an object or a relationship between objects. A
user can access and manipulate the data in the database using SQL.

Different types of databases are listed below:

1 A transactional database is a collection of transactional records.
Each record is a transaction. A transaction may have a time stamp,
identifier and a set of items, which may have links to other tables.
Normally, transaction databases are created for performing
associational analysis that indicates the correlation among the items.
2. Time-series database stores time related information like log files
where data is associated with a time stamp. This data represents the
sequences of data, which represent values or events obtained over a
period (for example, hourly, weekly or yearly) or repeated time span.
Observing sales of product continuously may yield a time-series data.
3. Spatial databases contain spatial information in a raster or vector
format. Raster formats are either bitmaps or pixel maps. For example,
images can be stored as a raster data. On the other hand, the vector
format can be used to store maps as maps use basic geometric
primitives like points, lines, polygons and so forth.
World Wide Web (WWW) It provides a diverse, worldwide online
information source.
Module 1- Machine Learning (BCS602)

The objective of data mining algorithms is to mine interesting patterns

of information present in WWW.
XML (eXtensible Markup Language) It is both human and machine
interpretable data format that can be used to represent data that
needs to be shared across the platforms.
Data Stream It is dynamic data, which flows in and out of the
observing environment. Typical characteristics of data stream are
huge volume of data, dynamic, fixed order movement, and real-time
constraints.
RSS (Really Simple Syndication) It is a format for sharing instant
feeds across services.
JSON (JavaScript Object Notation) It is another useful data
interchange format that is often used for many machine learning
algorithms.

2.2 BIG DATA ANALYTICS AND TYPES OF ANALYTICS

The primary aim of data analysis is to assist business organizations to
take decisions. For example, a business organization may want to
know which is the fastest selling product, in order for them to market
activities. Data analysis is an activity that takes the data and generates
useful information and insights for assisting the organizations.
Data analysis and data analytics are terms that are used
interchangeably to refer to the same concept. However, there is a
subtle difference. Data analytics is a general term and data analysis is
a part of it. Data analytics refers to the process of data collection,
preprocessing and analysis. It deals with the complete cycle of data
management. Data analysis is just analysis and is a part of data
analytics. It takes historical data and does the analysis.
Data analytics, instead, concentrates more on future and helps in
prediction.
There are four types of data analytics:
1. Descriptive analytics
Module 1- Machine Learning (BCS602)

2. Diagnostic analytics
3. Predictive analytics
4. Prescriptive analytics

Descriptive Analytics
Descriptive analytics is about summarizing the main features of the
data you've collected. It tells you what has happened by using
historical data and statistical techniques. The goal is to organize,
describe, and present this data in an understandable way. Think of it
as a report that explains "what is" without drawing any conclusions
about why it happened.
Example: Imagine a store that collects data on monthly sales.
Descriptive analytics would summarize this data by calculating the
average sales, total revenue, or the most popular product in a given
month.
Key Point: It doesn't explain why the sales were high or low—just
tells you what the data shows.
Diagnostic Analytics
Diagnostic analytics answers the question "Why did this happen?"
It's about understanding the root cause of an event. By examining the
data closely, we look for patterns, trends, and relationships that
explain the cause of an outcome.
Example: If the store's sales drop one month, diagnostic analytics
would investigate why the drop happened. Maybe it’s due to bad
weather, a competitor’s sale, or a new product that didn’t perform
well. The analysis focuses on finding and explaining the reasons
behind the drop.
Key Point: It's all about cause and effect—identifying the reasons
behind the data patterns.
Predictive Analytics
Module 1- Machine Learning (BCS602)

Predictive analytics looks into the future and answers the question
"What will happen?" Using historical data and advanced algorithms
(like machine learning), it predicts future trends and outcomes.
Example: The store uses data from previous years to predict what the
sales will be in the upcoming holiday season. Algorithms analyze
patterns like past holiday sales, customer behavior, and current
market trends to make predictions.
Key Point: It focuses on forecasting future events based on current
and past data.
Prescriptive Analytics
Prescriptive analytics goes a step further and asks "What should we
do?" It not only predicts the future but also recommends actions to
take. This type of analytics provides decision-making support by
suggesting the best course of action to achieve desired outcomes.
Example: After predicting that sales will be low in the next quarter,
prescriptive analytics suggests specific actions the store can take, such
as launching a promotion, adjusting prices, or stocking more popular
products. This helps businesses make better decisions and minimize
risks.
Key Point: It’s all about decision-making—helping businesses choose
the best possible actions based on data.

2.3.1 Data Collection

The first task of gathering datasets are the collection of data. It is often
estimated that most of the time is spent for collection of good quality
data. A good quality data yields a better result. It is often difficult to
characterize a ‘Good data’. ‘Good data’ is one that has the following
properties:
1. Timeliness – The data should be relevant and not stale or
obsolete data.
2. Relevancy – The data should be relevant and ready for the
machine learning or data mining algorithms. All the necessary
Module 1- Machine Learning (BCS602)

information should be available and there should be no bias in

the data.
3. Knowledge about the data – The data should be understandable
and interpretable, and should be self-sufficient for the required
application as desired by the domain knowledge engineer.

Broadly, the data source can be classified as open/public data, social

media data and multimodal data.
1. Open or public data source – It is a data source that does not have
any stringent copyright
rules or restrictions. Its data can be primarily used for many purposes.
Government census data are good examples of open data:
• Digital libraries that have huge amount of text data as well as
document images
• Scientific domains with a huge collection of experimental data like
genomic data and biological data
• Healthcare systems that use extensive databases like patient
databases, health insurance data, doctors’ information, and
bioinformatics information
2. Social media – It is the data that is generated by various social
media platforms like Twitter,
Facebook, YouTube, and Instagram. An enormous amount of data is
generated by these platforms.
3. Multimodal data – It includes data that involves many modes such
as text, video, audio and mixed types. Some of them are listed below:
• Image archives contain larger image databases along with numeric
and text data
• The World Wide Web (WWW) has huge amount of data that is
distributed on the Internet.
2.3.2 Data Pre-processing
Module 1- Machine Learning (BCS602)

Data Cleaning is the process of detecting and correcting (or

removing) errors and inconsistencies in data to improve its quality
before applying machine learning or data mining techniques. In the
real world, raw data is often ‘dirty’, meaning it contains errors,
missing information, or inconsistencies that can affect the results of
the analysis.

Common Problems with Dirty Data:

1. Incomplete Data: When certain values are missing from the
dataset.
2. Inaccurate Data: Data that has incorrect values or errors.
3. Outliers: Data points that are significantly different from the rest
of the data, often due to errors or unusual circumstances.
4. Missing Values: Data entries where certain attributes are not
provided.
5. Inconsistent Values: Mismatched or incorrectly formatted data
values.
6. Duplicate Data: When the same data appears multiple times,
unnecessarily.
Example of Dirty Data:
Let’s refer to the table of patient data (from the image) to
explain common data issues:

Identifying the Problems:

1. Incomplete Data:
Module 1- Machine Learning (BCS602)

o For patients John, Andre, and Raju, the Date of Birth (DoB) is
missing. This is an example of missing values.
2. Inaccurate Data:
o David's age is recorded as 5, but his DoB is 10/10/1980, which
makes his real age much older than 5. This is inconsistent data.
o Raju's age is recorded as 136, which is not realistic. This might be
a typographical error or an outlier.
3. Outliers:
o Raju’s age of 136 is an outlier, as it is an unrealistic value when
compared to normal human lifespans. Outliers are often caused by
data entry errors.
4. Noisy Data:
o John’s salary is recorded as -1500, which is not possible. Salary
cannot be negative, making this an example of noisy data.
o The entry for David’s salary is simply blank (" "), which is another
instance of missing data.
5. Inconsistent Values:
o In the salary column, Andre and Raju both have ‘Yes’ recorded,
which doesn’t make sense in the context of salary data. A salary should
be a numeric value, not a text response.
How to Address These Issues?
1. Missing Data:
o Ignore the Tuple: If a lot of values are missing in a row, you may
choose to ignore or remove that row from the dataset.
o Fill Values Manually: Domain experts can manually fill missing
values, but this is time-consuming.
o Use Global Constants: Fill missing values with a placeholder like
‘Unknown’ or ‘0’.
o Use Average/Mean Values: Replace missing numeric values (like
salary) with the average value of that column.
Module 1- Machine Learning (BCS602)

o Prediction Techniques: Machine learning algorithms can predict

missing values based on patterns in other data (e.g., using decision
trees).
2. Inaccurate Data:
o Correct entries by referring to other reliable data sources or
consult domain experts. For example, David's age should be corrected
based on his actual DoB.
3. Handling Outliers:
o Investigate outliers to determine if they are errors or legitimate
data. Raju’s age of 136 may be a typo and can be corrected if the
correct age is known.
4. Noisy Data:
o Noisy data, like John's negative salary (-1500), can be corrected by
setting a minimum limit (e.g., salary cannot be below 0). In this case,
either correct or remove the invalid entry.
5. Inconsistent Values:
o Standardize the format for fields like salary. For Andre and Raju,
change the text entries (‘Yes’) to numeric values or fill in missing data
using estimation techniques.
Methods for Handling Missing Data:
1. Ignoring the Tuple: Discard rows with missing data (not ideal
when a lot of data is missing).
2. Filling Manually: Domain experts analyze and fill the missing
values.
3. Global Constant: Fill missing values with ‘Unknown’ or ‘None’.
4. Attribute Mean: Replace missing numerical values with the
average for that attribute.
5. Class-based Mean: Use the mean value of the same class or group
to fill missing data.
Module 1- Machine Learning (BCS602)

6. Most Probable Value: Predict missing values using machine

learning algorithms like decision trees.

Removal of Noisy or Outlier Data

In data analysis, noise refers to random errors or variations in
the data that can distort the results of analysis. Noise can affect
data accuracy and, if not removed, can lead to misleading
conclusions. Therefore, it's important to clean noisy data before
applying any analysis or machine learning algorithms.
What is Noise?
 Noise is random error or variance in measured values.
 It can appear as outliers, missing values, or inconsistent data.
 Noise reduction is an essential step in data cleaning to improve
the quality of analysis.
Techniques for Removing Noise:
One common method to remove noisy data is binning, which
organizes data into groups (bins) and then applies smoothing
techniques to remove noise. Binning methods can also be used
for data discretization, which reduces the number of values for
easier analysis.
Binning Method:
 Step 1: Sort the data in increasing order.
 Step 2: Divide the sorted data into equal-frequency bins (also
called buckets).
 Step 3: Apply smoothing techniques within each bin to reduce
noise.
Smoothing Techniques for Binning:
1. Smoothing by Means:
Module 1- Machine Learning (BCS602)

o Replace all values in the bin with the mean (average) of the bin
values.
Example:
o Given data: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}
o First, divide into bins of size 3:
 Bin 1: {12, 14, 19}
 Bin 2: {22, 24, 26}
 Bin 3: {28, 31, 34}
o Now apply smoothing by means (replace all values with the bin's
mean):
 Bin 1 (mean = 15): {15, 15, 15}
 Bin 2 (mean = 24): {24, 24, 24}
 Bin 3 (mean ≈ 31): {31, 31, 31}
o Explanation: Each value in the bin is replaced by the mean of the
bin to smooth the data.
2. Smoothing by Medians:
o Replace all values in the bin with the median of the bin values (the
middle value when the data is sorted).
Example:
o Given the same data and bins:
 Bin 1 (median = 14): {14, 14, 14}
 Bin 2 (median = 24): {24, 24, 24}
 Bin 3 (median = 31): {31, 31, 31}
o Explanation: Each value in the bin is replaced by the median,
which reduces the effect of outliers or extreme values.
3. Smoothing by Bin Boundaries:
o Replace each value in the bin with the closest boundary value
(minimum or maximum value in the bin).
Module 1- Machine Learning (BCS602)

Example:
o Given the same data and bins:
 Bin 1 (boundary values: 12 and 19): {12, 12, 19}
 Bin 2 (boundary values: 22 and 26): {22, 22, 26}
 Bin 3 (boundary values: 28 and 34): {28, 34, 34}
o Explanation: For each bin, values are replaced by the closest
boundary value (either the minimum or maximum of that bin).
o Example: In Bin 1, the original data was {12, 14, 19}. The
boundaries are 12 and 19, so the value 14 is closer to 12, and it's
replaced by 12.

Why Use Binning to Remove Noise?

 Smoothing by Means: Reduces random noise by averaging the
values within each bin.
 Smoothing by Medians: More robust against outliers than using
means since medians are less sensitive to extreme values.
 Smoothing by Bin Boundaries: Eliminates noise by forcing all
values within the bin to adhere to the boundaries, creating a more
consistent dataset.

Data Integration and Data Transformations

Data integration involves routines that merge data from multiple
sources into a single data source.
So, this may lead to redundant data. The main goal of data integration
is to detect and remove redundancies that arise from integration. Data
transformation routines perform operations like normalization to
improve the performance of the data mining algorithms. It is
necessary to transform data so that it can be processed. This can be
considered as a preliminary stage of data conditioning. Normalization
is one such technique. In normalization, the attribute values are scaled
Module 1- Machine Learning (BCS602)

to fit in a range (say 0-1) to improve the performance of the data

mining algorithm. Often, in neural networks, these techniques are
used. Some of the normalization procedures used are:
1. Min-Max
2. z-Score
Min-Max Procedure It is a normalization technique where each
variable V is normalized by its difference with the minimum value
divided by the range to a new range, say 0–1. Often, neural networks
require this kind of normalization. The formula to implement this
normalization is given as:

Here max-min is the range. Min and max are the minimum and
maximum of the given data, new max and new min are the minimum
and maximum of the target range, say 0 and 1.

Example 2.2: Consider the set: V = {88, 90, 92, 94}. Apply Min-Max
procedure and map the marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The
new min and new max are 0 and 1, respectively. The mapping can be
done using Eq. (2.1) as:
Module 1- Machine Learning (BCS602)

So, it can be observed that the marks {88, 90, 92, 94} are mapped to
the new range {0, 0.33, 0.66, 1}. Thus, the Min-Max normalization
range is between 0 and 1.

z-Score Normalization This procedure works by taking the

difference between the field value
and mean value, and by scaling this difference by standard deviation
of the attribute.

Here, s is the standard deviation of the list V and m is the mean of the
list V.
Example 2.3: Consider the mark list V = {10, 20, 30}, convert the
marks to z-score.
Solution: The mean and Sample Standard deviation (s) values of the
list V are 20 and 10, respectively. So the z-scores of these marks are
Module 1- Machine Learning (BCS602)

calculated using Eq. (2.2) as:

Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.

Data Reduction
Data reduction reduces data size but produces the same results. There
are different ways in which data reduction can be carried out such as
data aggregation, feature selection, and dimensionality reduction.

2.4 DESCRIPTIVE STATISTICS

Descriptive statistics is a branch of statistics that does dataset
summarization. It is used to summarize and describe data. Descriptive
statistics are just descriptive and do not go beyond that.
In other words, descriptive statistics do not bother too much about
machine learning algorithms and its functioning.
Let us discuss descriptive statistics with the fundamental concepts of
datatypes.
Dataset and Data Types
A dataset can be assumed to be a collection of data objects. The data
objects may be records, points, vectors, patterns, events, cases,
samples or observations. These records contain many attributes. An
attribute can be defined as the property or characteristics of an object.
Module 1- Machine Learning (BCS602)

For example, consider the following database shown in sample Table

2.2.

Every attribute should be associated with a value. This process is

called measurement. The type of attribute determines the data types,
often referred to as measurement scale types.
The data types are shown in Figure 2.1.

Broadly, data can be classified into two types:

1. Categorical or qualitative data
2. Numerical or quantitative data

Categorical or Qualitative Data The categorical data can be divided

into two types. They are nominal type and ordinal type.
• Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data
are symbols and cannot be processed like a number. For example, the
average of a patient ID does not make any statistical sense. Nominal
data type provides only information but has no ordering among data.
Only operations like (=, ≠) are meaningful for these data. For example,
the patient ID can be checked for equality and nothing else.
Module 1- Machine Learning (BCS602)

• Ordinal Data – It provides enough information and has natural order.

For example, Fever = {Low, Medium, High} is an ordinal data. Certainly,
low is less than medium and medium is less than high, irrespective of
the value. Any transformation can be applied to these data to get a new
value.
Numeric or Qualitative Data It can be divided into two categories.
They are interval type and ratio type.
• Interval Data – Interval data is a numeric data for which the
differences between values are meaningful. For example, there is a
difference between 30 degrees and 40 degrees. Only the permissible
operations are + and -.
• Ratio Data – For ratio data, both differences and ratio are meaningful.
The difference between the ratio and interval data is the position of
zero in the scale. For example, take the Centigrade-Fahrenheit
conversion. The zeroes of both scales do not match.
Hence, these are interval data.

Another way of classifying the data is to classify it as:

[Link] value data
[Link] data
Discrete Data This kind of data is recorded as integers. For example,
the responses of the survey can be discrete data. Employee
identification number such as 10001 is discrete data.
Continuous Data It can be fitted into a range and includes decimal
point. For example, age is a continuous data. Though age appears to
be discrete data, one may be 12.5 years old and it makes sense. Patient
height and weight are all continuous data.
Third way of classifying the data is based on the number of variables
used in the dataset. Based on that, the data can be classified as
univariate data, bivariate data, and multivariate data. This is shown in
Figure 2.2.
Module 1- Machine Learning (BCS602)

2.5 UNIVARIATE DATA ANALYSIS AND VISUALIZATION

Univariate analysis is the simplest form of statistical analysis. As the
name indicates, the dataset has only one variable. A variable can be
called as a category. Univariate does not deal with cause or
relationships. The aim of univariate analysis is to describe data and
find patterns. Univariate data description involves finding the
frequency distributions, central tendency measures, dispersion or
variation, and shape of the data.

2.5.1 Data Visualization

Let us consider some forms of graphs

Bar Chart A Bar chart (or Bar graph) is used to display the frequency
distribution for variables.
Bar charts are used to illustrate discrete data. The charts can also help
to explain the counts of nominal data. It also helps in comparing the
frequency of different groups. The bar chart for students' marks {45,
60, 60, 80, 85} with Student ID = {1, 2, 3, 4, 5} is shown below in Figure
2.3.
Module 1- Machine Learning (BCS602)

Pie Chart These are equally helpful in illustrating the univariate data.
The percentage frequency distribution of students' marks {22, 22, 40,
40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.

It can be observed that the number of students with 22 marks are 2.

The total number of students are 10. So, 2/10 × 100 = 20% space in a
pie of 100% is allotted for marks 22 in Figure 2.4.
Histogram It plays an important role in data mining for showing
frequency distributions.
The histogram for students’ marks {45, 60, 60, 80, 85} in the group
range of 0-25, 26-50, 51-75, 76-100 is given below in Figure 2.5. One
can visually inspect from Figure 2.5 that the number of students in the
Module 1- Machine Learning (BCS602)

range 76-100 is 2.

Histogram conveys useful information like nature of data and its

mode. Mode indicates the peak of dataset. In other words, histograms
can be used as charts to show frequency, skewness present in the data,
and shape.

Dot Plots These are similar to bar charts. They are less clustered as
compared to bar charts, as they illustrate the bars only with single
points. The dot plot of English marks for five students with ID as {1, 2,
3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The
advantage
Module 1- Machine Learning (BCS602)

is that by visual inspection one can find out who got more marks.

2.5.2 Central Tendency

Therefore, a condensation or summary of the data is necessary. This
makes the data analysis easy and simple. One such summary is called
central tendency. Thus, central tendency can explain the
characteristics of data and that further helps in comparison. Mass data
have tendency to concentrate at certain values, normally in the central
location. It is called measure of central tendency (or averages).
Popular measures are mean, median and mode.
1. Mean – Arithmetic average (or mean) is a measure of central
tendency that represents the ‘center’ of the dataset. Mathematically,
the average of all the values in the sample (population) is denoted as
x. Let x1, x2, … , xN be a set of ‘N’ values or observations, then the
arithmetic mean is given as:

For example, the mean of the three numbers 10, 20, and 30 is 20
Module 1- Machine Learning (BCS602)

• Weighted mean – Unlike arithmetic mean that gives the weightage

of all items equally, weighted mean gives different importance to all
items as the item importance varies.
Hence, different weightage can be given to items. In case of frequency
distribution, mid values of the range are taken for computation. This
is illustrated in the following computation. In weighted mean, the
mean is computed by adding the product of proportion and group
mean. It is mostly used when the sample sizes are unequal.
• Geometric mean – Let x1, x2, … , xN be a set of ‘N’ values or
observations. Geometric mean
is the Nth root of the product of N items. The formula for computing
geometric mean is given as follows:

Here, n is the number of items and xi are values. For example, if the
values are 6 and 8, the geometric mean is given as In larger cases,
computing geometric mean is difficult. Hence, it is usually calculated
as:

The problem of mean is its extreme sensitiveness to noise. Even small

changes in the input affect the mean drastically. Hence, often the top
2% is chopped off and then the mean is calculated for a larger dataset.

2. Median – The middle value in the distribution is called median. If

the total number of items in the distribution is odd, then the middle
value is called median. A median class is that class where (N/2)th item
is present.
Module 1- Machine Learning (BCS602)

In the continuous case, the median is given by the formula:

Median class is that class where N/2th item is present. Here, i is the
class interval of the median class and L1 is the lower limit of median
class, f is the frequency of the median class, and cf is the cumulative
frequency of all classes preceding median.
3. Mode – Mode is the value that occurs more frequently in the
dataset. In other words, the value that has the highest frequency is
called mode.

2.5.3 Dispersion
The spreadout of a set of data around the central tendency (mean,
median or mode) is called dispersion. Dispersion is represented by
various ways such as range, variance, standard deviation, and
standard error. These are second order measures. The most common
measures of the dispersion data are listed below:
Range Range is the difference between the maximum and minimum
of values of the given list of data.
Standard Deviation The mean does not convey much more than a
middle point. For example, the following datasets {10, 20, 30} and {10,
50, 0} both have a mean of 20. The difference between these two sets
is the spread of data. Standard deviation is the average distance from
the mean of the dataset to each point.
The formula for sample standard deviation is given by:

Here, N is the size of the population, xi is observation or value from

the population and m is the population mean. Often, N – 1 is used
instead of N in the denominator of Eq. (2.8).
Module 1- Machine Learning (BCS602)

Quartiles and Inter Quartile Range It is sometimes convenient to

subdivide the dataset using coordinates. Percentiles are about data
that are less than the coordinates by some percentage of the total
value. kth percentile is the property that the k% of the data lies at or
below Xi. For example, median is 50th percentile and can be denoted
as Q0.50. The 25th percentile is called first quartile (Q1) and the 75th
percentile is called third quartile (Q3). Another measure that is useful
to measure dispersion is Inter Quartile Range (IQR). The IQR is the
difference between Q3 and Q1.
Interquartile percentile = Q3 – Q1 (2.9)
Outliers are normally the values falling apart at least by the amount
1.5 × IQR above the third quartile or below the first quartile.
Interquartile is defined by Q0.75 – Q0.25. (2.10)

Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34},
find the IQR.
Solution: The median is in the fifth position. In this case, 24 is the
median. The first quartile is median of the scores below the mean i.e.,
{12, 14, 19, 22}. Hence, it’s the median of the list below 24. In this case,
the median is the average of the second and third values, that is, Q0.25
= 16.5. Similarly, the third quartile is the median of the values above
the median, that is {26, 28, 31, 34}. So, Q0.75 is the average of the
seventh and eighth score. In this case, it is 28 + 31/2 = 59/2 = 29.5.
Hence, the IQR using Eq. (2.10) is:
= Q0.75 – Q0.25
= 29.5-16.5 = 13

Five-point Summary and Box Plots The median, quartiles Q1 and

Q3, and minimum and maximum written in the order < Minimum, Q1,
Median, Q3, Maximum > is known as five-point summary. Example 2.5:
Find the 5-point summary of the list {13, 11, 2, 3, 4, 8, 9}.
Solution: The minimum is 2 and the maximum is 13. The Q1, Q2 and
Q3 are 3, 8 and 11, respectively. Hence, 5-point summary is {2, 3, 8, 11,
Module 1- Machine Learning (BCS602)

13}, that is, {minimum, Q1, median, Q3, maximum}. Box plots are
useful for describing 5-point summary. The Box plot for the set is given
in Figure 2.7.

2.5.4 Shape
Skewness and Kurtosis (called moments) indicate the
symmetry/asymmetry and peak location of the dataset.
Skewness
The measures of direction and degree of symmetry are called
measures of third order. Ideally, skewness should be zero as in ideal
normal distribution. More often, the given dataset may not have
perfect symmetry (consider the following Figure 2.8).

Generally, for negatively skewed distribution, the median is more than

the mean. The relationship between skew and the relative size of the
mean and median can be summarized by a convenient numerical skew
Module 1- Machine Learning (BCS602)

index known as Pearson 2 skewness coefficient.

Also, the following measure is more commonly used to measure

skewness. Let X1, X2, …, XN be a set of ‘N’ values or observations then
the skewness can be given as:

Here, m is the population mean and s is the population standard

deviation of the univariate data. Sometimes, for bias correction
instead of N, N - 1 is used.

Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then
it indicates higher kurtosis and vice versa. Kurtosis is measured using
the formula given below:

It can be observed that N - 1 is used instead of N in the numerator of

Eq. (2.14) for bias correction. Here, x and s are the mean and standard
deviation of the univariate data, respectively.
Some of the other useful measures for finding the shape of the
univariate dataset are mean absolute deviation (MAD) and coefficient
of variation (CV).

Mean Absolute Deviation (MAD)

MAD is another dispersion measure and is robust to outliers.
Normally, the outlier point is detected by computing the deviation
from median and by dividing it by MAD. Here, the absolute deviation
between the data and mean is taken. Thus, the absolute deviation is
Module 1- Machine Learning (BCS602)

given as:

Coefficient of Variation (CV)

Coefficient of variation is used to compare datasets with different
units. CV is the ratio of standard deviation and mean, and %CV is the
percentage of coefficient of variations.

2.5.5 Special Univariate Plots

The ideal way to check the shape of the dataset is a stem and leaf plot.
A stem and leaf plot are a display that help us to know the shape and
distribution of the data. In this method, each value is
split into a ’stem’ and a ’leaf’. The last digit is usually the leaf and digits
to the left of the leaf mostly form the stem. For example, marks 45 are
divided into stem 4 and leaf 5 in Figure 2.9. The stem and leaf plot for
the English subject marks, say, {45, 60, 60, 80, 85} is given in Figure
2.9.

It can be seen from Figure 2.9 that the first column is stem and the
second column is leaf. For the given English marks, two students with
60 marks are shown in stem and leaf plot as stem-6 with 2 leaves with
0. The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given below
n Figure 2.10.
Module 1- Machine Learning (BCS602)
Module 2- Machine Learning (BCS602)

Module 2
Understanding Data
Bivariate and Multivariate data, Multivariate statistics, Essential mathematics for Multivariate data,
Overview hypothesis, Feature engineering and dimensionality reduction techniques, Basics of Learning
Theory: Introduction to learning and its types, Introduction computation learning theory, Design of
learning system, Introduction concept learning. Similarity-based learning: Introduction to Similarity or
instance based learning, Nearest-neighbour learning, weighted k- Nearest - Neighbour algorithm.

CHAPTER -2
2.6 BIVARIATE DATA AND MULTIVARIATE DATA
Bivariate Data involves two variables. Bivariate data deals with causes of relationships. The aim is
to find relationships among data. Consider the following Table 2.3, with data of the temperature in
a shop and sales of sweaters.

Here, the aim of bivariate analysis is to find relationships among variables. The relationships can then be
used in comparisons, finding causes, and in further explorations. To do that, graphical display of the data is
necessary. One such graph method is called scatter plot.

Scatter plot is used to visualize bivariate data. It is useful to plot two variables with or without nominal
variables, to illustrate the trends, and also to show differences. It is a plot between explanatory and response
variables. It is a 2D graph showing the relationship between two variables. Line graphs are similar to scatter
plots. The Line Chart for sales data is shown in Figure 2.12.

2.6.1 Bivariate Statistics

Covariance and Correlation are examples of bivariate statistics. Covariance is a measure of joint probability
of random variables, say X and Y. Generally, random variables are represented in capital letters. It is defined

1
Module 2- Machine Learning (BCS602)

as covariance (X, Y) or COV (X, Y) and is used to measure the variance between two dimensions. The formula
for finding co-variance for specific x, and y are:

Here, xi and yi are data values from X and Y. E(X) and E(Y) are the mean values of xi and yi. N is the number
of given data. Also, the COV(X, Y) is same as COV(Y, X).

If the given attributes are X = (x1, x2, … , xN) and Y = (y1, y2, … , yN), then the Pearson correlation coefficient,
that is denoted as r, is given as: (σX, σY are the standard deviations of X and Y.)

2.7 MULTIVARIATE STATISTICS

In machine learning, almost all datasets are multivariable. Multivariate data is the analysis of more than two
observable variables, and often, thousands of multiple measurements need to be conducted for one or more
subjects. Multivariate data has three or more variables. The aim of the multivariate analysis is much more.
They are regression analysis, factor analysis and multivariate analysis of variance.

Heatmap A heat map is a graphical representation of data where individual values are represented by
colors. Heat maps are often used in data analysis and visualization to show patterns, density, or intensity of
data points in a two-dimensional grid.
Example: Let's consider a heat map to display the average temperatures (in °C) across different regions in
a country over a week. Each cell in the heat map will represent a temperature for a specific region on a
specific day. This is useful to quickly identify trends, such as higher temperatures in certain regions or
specific days with unusual weather patterns. The color gradient (from blue to red) indicates the
temperature range: cooler colors represent lower temperatures, while warmer colors represent higher
temperatures.

2
Module 2- Machine Learning (BCS602)

Pairplot
Pairplot or scatter matrix is a data visual technique for multivariate data. A scatter matrix consists of several
pair-wise scatter plots of variables of the multivariate data. A random matrix of three columns is chosen and
the relationships of the columns is plotted as a pairplot (or scatter matrix) as shown in Figure 2.14.

2.8 ESSENTIAL MATHEMATICS FOR MULTIVARIATE DATA

Machine learning involves many mathematical concepts from the domain of Linear algebra, Statistics,
Probability and Information theory. The subsequent sections discuss important aspects of linear algebra
and probability.

2.8.1 Linear Systems and Gaussian Elimination for Multivariate Data

A linear system of equations is a group of equations with unknown variables. Let Ax = y, then the solution
x is given as: x= y/A= A-1y. This is true if y is not zero and A is not zero. The logic can be extended for N-

3
Module 2- Machine Learning (BCS602)

set of equations with ‘n’ unknown variables. It means if A= and y=(y1 y2…yn), then the unknown
variable x can be computed as: x= y/A= A-1y

If there is a unique solution, then the system is called consistent independent. If there are various
solutions, then the system is called consistent dependant. If there are no solutions and if the equations are
contradictory, then the system is called inconsistent.

For solving large number of system of equations, Gaussian elimination can be used. The
procedure for applying Gaussian elimination is given as follows:
1. Write the given matrix.
2. Append vector y to the matrix A. This matrix is called augmentation matrix.
3. Keep the element a11 as pivot and eliminate all a11 in second row using the matrix operation,

R2 - (a21/a11), here R2 is the 2nd row and (a21/a11) is called the multiplier.

The same logic can be used to remove a11 in all other equations.
4. Repeat the same logic and reduce it to reduced echelon form. Then, the unknown variable as:

5. Then, the remaining unknown variables can be found by back-substitution as:

To facilitate the application of Gaussian elimination method, the following row operations are
applied:
1. Swapping the rows
2. Multiplying or dividing a row by a constant
3. Replacing a row by adding or subtracting a multiple of another row to it

4
Module 2- Machine Learning (BCS602)

These concepts are illustrated in Example 2.8.

2.8.2 Matrix Decomposition

It is often necessary to reduce a matrix to its constituent parts so that complex matrix operations can be
performed.
Then, the matrix A can be decomposed as: A=Q ^ QT

where, Q is the matrix of eigen vectors, Λ is the diagonal matrix and QT is the transpose of matrix Q.

LU Decomposition
One of the simplest matrix decomposition is LU decomposition where the matrix A can be decomposed
matrices: A = LU. Here, L is the lower triangular matrix and U is the upper triangular matrix. The
decomposition can be done using Gaussian elimination method as discussed in the previous section. First,
an identity matrix is augmented to the given matrix. Then, row operations and Gaussian elimination is
applied to reduce the given matrix to get matrices L and U. Example 2.9 illustrates the application of
Gaussian elimination to get LU.

5
Module 2- Machine Learning (BCS602)

Now, it can be observed that the first matrix is L as it is the lower triangular matrix whose values are the
determiners used in the reduction of equations above such as 3, 3 and 2/3.
The second matrix is U, the upper triangular matrix whose values are the values of the reduced matrix
because of Gaussian elimination.

Introduction to Machine Learning and Probability/Statistics

 Importance: Machine learning relies heavily on statistics and probability to make

predictions and analyze data.
 Statistics in ML: Key for understanding data patterns, measuring relationships, and
quantifying uncertainties.

Probability Distributions

 Definition: A probability distribution describes the likelihood of various outcomes for a variable XXX.
 Types:

6
Module 2- Machine Learning (BCS602)

o Discrete Probability Distributions: For countable events (e.g., binomial, Poisson).

o Continuous Probability Distributions: For measurable events on a continuum (e.g., normal,
exponential).

Continuous Probability Distributions

1. Normal Distribution (Gaussian Distribution)

 Shape: Bell curve, symmetric around the mean.

 Characteristics: Defined by mean μ and standard deviation σ.
 Probability Density Function (PDF)

 Applications: Common in natural data (e.g., heights, exam scores).

 Z-score: Standardizes data points. Z=X−μ/σ
2. Uniform Distribution (Rectangular Distribution)

 Definition: Equal probability for all outcomes within range [a,b].

 PDF :

3. Exponential Distribution

Definition: Models time between events in a Poisson process

Discrete Probability Distributions

1 Binomial Distribution

 Definition: For trials with two outcomes (success/failure).

 Formula for Probability of k Successes in n Trials:

7
Module 2- Machine Learning (BCS602)

2 Poisson Distribution

 Definition: Models the number of events in a fixed interval of time.

 PDF

3 Bernoulli Distribution

 Definition: Models a single trial with two outcomes (success/failure).

 Probability Mass Function (PMF)

Density Estimation

 Goal: Estimate the probability density function (PDF) of data.

 Types:
o Parametric Density Estimation: Assumes a known distribution (e.g., Gaussian)
and estimates parameters.
o Non-Parametric Density Estimation: Does not assume a fixed distribution (e.g.,
Parzen window, k-Nearest Neighbors)

Parametric Density Estimation

1 Maximum Likelihood Estimation (MLE)

 Definition: A method for estimating the parameters of a distribution by maximizing the

likelihood function.
 Likelihood Function: Maximize L(ϴ) for parameter ϴ

Gaussian Mixture Model (GMM) and Expectation-Maximization (EM) Algorithm

 GMM: A probabilistic model assuming data is generated from a mixture of Gaussian

distributions.
 EM Algorithm:
o E-Step: Estimate the distribution parameters for each latent variable.
o M-Step: Optimize parameters using MLE.
 Iteration: Repeat until convergence.

8
Module 2- Machine Learning (BCS602)

Non-Parametric Density Estimation Methods

1 Parzen Window

 Definition: A non-parametric technique that estimates the PDF based on local samples.
 Example: Uses a kernel function like Gaussian around each data point.

2 k-Nearest Neighbors (KNN)

 Definition: Estimates density by considering the kkk closest neighbors.

 Application: Frequently used in classification tasks.

2.10 FEATURE ENGINEERING AND DIMENSIONALITY REDUCTION TECHNIQUES

Features are attributes. Feature engineering is about determining the subset of features that form
an important part of the input that improves the performance of the model, be it classification or any other
model in machine learning.

Feature engineering deals with two problems – Feature Transformation and Feature Selection.
Feature transformation is extraction of features and creating new features that may be helpful in increasing
performance. For example, the height and weight may give a new attribute called Body Mass Index (BMI).

Feature subset selection is another important aspect of feature engineering that focuses on selection of
features to reduce the time but not at the cost of reliability.

The features can be removed based on two aspects:

1. Feature relevancy – Some features contribute more for classification than other features. For
example, a mole on the face can help in face detection than common features like nose. In simple
words, the features should be relevant.
Feature redundancy – Some features are redundant. For example, when a database table has a field called
Date of birth, then age field is not relevant as age can be computed easily from date of birth.
So, the procedure is:
1. Generate all possible subsets
2. Evaluate the subsets and model performance
3. Evaluate the results for optimal feature selection

Filter-based selection uses statistical measures for assessing features. In this approach, no learning
algorithm is used. Correlation and information gain measures like mutual information and entropy are all
examples of this approach.

Wrapper-based methods use classifiers to identify the best features. These are selected and evaluated by
the learning algorithms. This procedure is computationally intensive but has superior performance.

2.10.1 Stepwise Forward Selection

This procedure starts with an empty set of attributes. Every time, an attribute is tested for statistical
significance for best quality and is added to the reduced set. This process is continued till a good reduced
set of attributes is obtained.

2.10.2 Stepwise Backward Elimination

This procedure starts with a complete set of attributes. At every stage, the procedure removes the worst
attribute from the set, leading to the reduced set.

9
Module 2- Machine Learning (BCS602)

2.10.3 Principal Component Analysis

The idea of the principal component analysis (PCA) or KL transform is to transform a given set of
measurements to a new set of features so that the features exhibit high information packing properties.
This leads to a reduced and compact set of features. Consider a group of random vectors of the form:

The mean vector of the set of random vectors is defined as:

The operator E refers to the expected value of the population. This is calculated theoretically using the
probability density functions (PDF) of the elements xi and the joint probability density functions between
the elements xi and xj. From this, the covariance matrix can be calculated as:

The mapping of the vectors x to y using the transformation can now be described as:

This transform is also called as Karhunen-Loeve or Hoteling transform. The original vector x
can now be reconstructed as follows:

If K largest eigen values are used, the recovered information would be:

The PCA algorithm is as follows:

1. The target dataset x is obtained
2. The mean is subtracted from the dataset. Let the mean be m. Thus, the adjusted dataset is X – m.
The objective of this process is to transform the dataset with zero mean.
3. The covariance of dataset x is obtained. Let it be C.
4. Eigen values and eigen vectors of the covariance matrix are calculated.
5. The eigen vector of the highest eigen value is the principal component of the dataset. The eigen
values are arranged in a descending order. The feature vector is formed with these eigen vectors in
its columns.
Feature vector = {eigen vector1, eigen vector2, … , eigen vectorn}
6. Obtain the transpose of feature vector. Let it be A.
7. PCA transform is y = A × (x – m), where x is the input dataset, m is the mean, and A is the transpose
of the feature vector.
The original data can be retrieved using the formula given below:

The new data is a dimensionaly reduced matrix that represents the original data.
Figure 2.15. The scree plot indicates that only 6 out of 246 attributes are important.

From Figure 2.15, one can infer the relevance of the attributes. The scree plot indicates that
the first attribute is more important than all other attributes.

2.10.4 Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is also a feature reduction technique like PCA. The focus of LDA
is to project higher dimension data to a line (lower dimension data). LDA is also used to classify the
data. Let there be two classes, c1 and c2. Let m1 and m2 be the mean of the patterns of two classes.
The mean of the class c1 and c2 can be computed as:

10
Module 2- Machine Learning (BCS602)

The aim of LDA is to optimize the function:

2.10.5 Singular Value Decomposition

Singular Value Decomposition (SVD) is another useful decomposition technique. Let A be the
matrix, then the matrix A can be decomposed as:

Here, A is the given matrix of dimension m × n, U is the orthogonal matrix whose dimension is m × n, S is the
diagonal matrix of dimension n × n, and V is the orthogonal matrix. The procedure for finding decomposition
matrix is given as follows:
1. For a given matrix, find AA^T
2. Find eigen values of AA^T
3. Sort the eigen values in a descending order. Pack the eigen vectors as a matrix U.
4. Arrange the square root of the eigen values in diagonal. This matrix is diagonal matrix, S.
5. Find eigen values and eigen vectors for A^TA. Find the eigen value and pack the eigen vector as a
matrix called V.
Thus, A = USV^T. Here, U and V are orthogonal matrices. The columns of U and V are left and right
singular values, respectively. SVD is useful in compression, as one can decide to retain only a certain
component instead of the original matrix A as:

Based on the choice of retention, the compression can be controlled.

CHAPTER 3 - BASICS OF LEARNING THEORY

3.3 DESIGN OF A LEARNING SYSTEM

3.4 INTRODUCTION TO CONCEPT LEARNING

Concept learning is a learning strategy that involves acquiring abstract knowledge or inferring a general
concept based on the given training samples. It aims to derive a category or classification from the data,
facilitating abstraction and generalization. In machine learning, concept learning is about finding a function
that categorizes or labels instances correctly based on the observed features.

11
Module 2- Machine Learning (BCS602)

3.4.1 Representation of a Hypothesis

A hypothesis, denoted by h, is an approximation of the target function f. It represents the relationship

between independent attributes (input features) and the dependent attribute (output or label) of the
training instances. The hypothesis acts as the predicted model that maps inputs to outputs effectively.
In concept learning, each hypothesis is represented as a conjunction (AND combination) of attribute
conditions in the antecedent part, defining specific constraints on attributes to classify instances
accurately.

3.4.2 Hypothesis Space

Hypothesis space is the set of all possible hypotheses that approximates the target function
f.

The subset of hypothesis space that is consistent with all-observed training instances is
called as Version Space.

3.4.3 Heuristic Space Search

Heuristic search is a search strategy that finds an optimized hypothesis/solution to a

problem by iteratively improving the hypothesis/solution based on a given heuristic
function or a cost measure.

3.4.4 Generalization and Specialization

Searching the Hypothesis Space

There are two ways of learning the hypothesis, consistent with all training instances from
the large hypothesis space.

12
Module 2- Machine Learning (BCS602)

1. Specialization – General to Specific learning

2. Generalization – Specific to General learning

Generalization – Specific to General Learning This learning methodology will search

through the hypothesis space for an approximate hypothesis by generalizing the most
specific hypothesis.

Specialization – General to Specific Learning This learning methodology will search

through the hypothesis space for an approximate hypothesis by specializing the most
general hypothesis.

3.4.5 Hypothesis Space Search by Find-S Algorithm

Limitations of Find-S Algorithm

3.4.6 Version Spaces

13
Module 2- Machine Learning (BCS602)

List-Then-Eliminate Algorithm

Candidate Elimination Algorithm

The diagrammatic representation of deriving the version space is shown below:

14
Module 2- Machine Learning (BCS602)

Deriving the Version Space

Dept of CSE, RNSIT 15

Module 3- Machine Learning (BCS602)

MODULE 3
CHAPTER 4

SIMILARITY-BASED LEARNING
Similarity or Instance-based Learning

Difference between Instance-and Model-based Learning

Some examples of Instance-based Learning algorithms are:

KNN
Variants of KNN
Locally weighted regression
Learning vector quantization
Self-organizing maps
RBF networks

Nearest-Neighbor Learning
 A powerful classification algorithm used in pattern recognition.
 K nearest neighbors stores all available cases and classifies new cases based on a
similarity measure (e.g distance function)
 One of the top data mining algorithms used today.
 A non-parametric lazy learning algorithm (An Instance based Learning method).
 Used for both classification and regression problems.

Module 3- Machine Learning (BCS602)

Algorithm 4.2: k-NN

4.3 Weighted k-Nearest-Neighbor Algorithm

The weighted KNN is an extension of [Link] chooses the neighbors by using the weighted
distance. In weighted kNN, the nearest k points are given a weight using a function called as
the kernel function. The intuition behind weighted kNN, is to give more weight to the points
which are nearby and less weight to the points which are farther away.
Module 3- Machine Learning (BCS602)

4.4 Nearest Centroid Classifier

The Nearest Centroids algorithm assumes that the centroids in the input feature space are
different for each target label. The training data is split into groups by class label, then the
centroid for each group of data is calculated. Each centroid is simply the mean value of each
of the input variables, so it is also called as Mean Difference classifier. If there are two classes,
then two centroids or points are calculated; three classes give three centroids, and so on.

4.5 Locally Weighted Regression (LWR)

Where, г is called the bandwidth parameter and controls the rate at which wi reduces to zero
with distance from xi.
MODULE 3
CHAPTER 5
REGRESSION ANALYSIS
1.1 Introduction to Regression
Regression analysis is a fundamental concept that consists of a set of machine learning methods
that predict a continuous outcome variable (y) based on the value of one or multiple predictor
variables (x).
OR
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
Regression is a supervised learning technique which helps in finding the correlation between
variables.
It is mainly used for prediction, forecasting, time series modelling, and determining the causal-
effect relationship between variables.
Regression shows a line or curve that passes through all the data points on target-predictor
graph in such a way that the vertical distance between the data points and the regression line
is minimum." The distance between data points and line tells whether a model has captured a
strong relationship or not.
• Function of regression analysis is given by:
Y=f(x)
Here, y is called dependent variable and x is called independent variable.
Applications of Regression Analysis
 Sales of a goods or services
 Value of bonds in portfolio management
 Premium on insurance companies
 Yield of crop in agriculture
 Prices of real estate

1.2 INTRODUCTION TO LINEARITY, CORRELATION AND CAUSATION

A correlation is the statistical summary of the relationship between two sets of variables. It is
a core part of data exploratory analysis, and is a critical aspect of numerous advanced machine
learning techniques.
Correlation between two variables can be found using a scatter plot
There are different types of correlation:

1
Positive Correlation: Two variables are said to be positively correlated when their values
move in the same direction. For example, in the image below, as the value for X increases, so
does the value for Y at a constant rate.
Negative Correlation: Finally, variables X and Y will be negatively correlated when their
values change in opposite directions, so here as the value for X increases, the value for Y
decreases at a constant rate.
Neutral Correlation: No relationship in the change of variables X and Y. In this case, the
values are completely random and do not show any sign of correlation, as shown in the
following image:

Causation
Causation is about relationship between two variables as x causes y. This is called x implies b.
Regression is different from causation. Causation indicates that one event is the result of the
occurrence of the other event; i.e. there is a causal relationship between the two events.
Linear and Non-Linear Relationships
The relationship between input features (variables) and the output (target) variable is
fundamental. These concepts have significant implications for the choice of algorithms, model
complexity, and predictive performance.
Linear relationship creates a straight line when plotted on a graph, a Non-Linear relationship
does not create a straight line but instead creates a curve.
Example:
Linear-the relationship between the hours spent studying and the grades obtained in a class.
Non-Linear-
Linearity:
Linear Relationship: A linear relationship between variables means that a change in one
variable is associated with a proportional change in another variable. Mathematically, it can be
represented as y = a * x + b, where y is the output, x is the input, and a and b are constants.

2
Linear Models: Goal is to find the best-fitting line (plane in higher dimensions) to the data
points. Linear models are interpretable and work well when the relationship between variables
is close to being linear.
Limitations: Linear models may perform poorly when the relationship between variables is
non-linear. In such cases, they may underfit the data, meaning they are too simple to capture
the underlying patterns.
Non-Linearity:
Non-Linear Relationship: A non-linear relationship implies that the change in one variable is
not proportional to the change in another variable. Non-linear relationships can take various
forms, such as quadratic, exponential, logarithmic, or arbitrary shapes.
Non-Linear Models: Machine learning models like decision trees, random forests, support
vector machines with non-linear kernels, and neural networks can capture non-linear
relationships. These models are more flexible and can fit complex data patterns.
Benefits: Non-linear models can perform well when the underlying relationships in the data
are complex or when interactions between variables are non-linear. They have the capacity to
capture intricate patterns.

Types of Regression

3
Linear Regression:
Single Independent Variable: Linear regression, also known as simple linear regression, is
used when there is a single independent variable (predictor) and one dependent variable
(target).
Equation: The linear regression equation takes the form: Y = β0 + β1X + ε, where Y is the
dependent variable, X is the independent variable, β0 is the intercept, β1 is the slope
(coefficient), and ε is the error term.
Purpose: Linear regression is used to establish a linear relationship between two variables and
make predictions based on this relationship. It's suitable for simple scenarios where there's only
one predictor.
Multiple Regression:
Multiple Independent Variables: Multiple regression, as the name suggests, is used when there
are two or more independent variables (predictors) and one dependent variable (target).
Equation: The multiple regression equation extends the concept to multiple predictors: Y = β0
+ β1X1 + β2X2 + ... + βnXn + ε, where Y is the dependent variable, X1, X2, ..., Xn are the
independent variables, β0 is the intercept, β1, β2, ..., βn are the coefficients, and ε is the error
term.
Purpose: Multiple regression allows you to model the relationship between the dependent
variable and multiple predictors simultaneously. It's used when there are multiple factors that
may influence the target variable, and you want to understand their combined effect and make
predictions based on all these factors.
Polynomial Regression:
Use: Polynomial regression is an extension of multiple regression used when the relationship
between the independent and dependent variables is non-linear.
Equation: The polynomial regression equation allows for higher-order terms, such as quadratic
or cubic terms: Y = β0 + β1X + β2X^2 + ... + βnX^n + ε. This allows the model to fit a curve
rather than a straight line.
Logistic Regression:
Use: Logistic regression is used when the dependent variable is binary (0 or 1). It models the
probability of the dependent variable belonging to a particular class.
Equation: Logistic regression uses the logistic function (sigmoid function) to model
probabilities: P(Y=1) = 1 / (1 + e^(-z)), where z is a linear combination of the independent
variables: z = β0 + β1X1 + β2X2 + ... + βnXn. It transforms this probability into a binary
outcome.
Lasso Regression (L1 Regularization):
Use: Lasso regression is used for feature selection and regularization. It penalizes the absolute
values of the coefficients, which encourages sparsity in the model.

4
Objective Function: Lasso regression adds an L1 penalty to the linear regression loss function:
Lasso = RSS + λΣ|βi|, where RSS is the residual sum of squares, λ is the regularization strength,
and |βi| represents the absolute values of the coefficients.
Ridge Regression (L2 Regularization):
Use: Ridge regression is used for regularization to prevent overfitting in multiple regression. It
penalizes the square of the coefficients.
Objective Function: Ridge regression adds an L2 penalty to the linear regression loss function:
Ridge = RSS + λΣ(βi^2), where RSS is the residual sum of squares, λ is the regularization
strength, and (βi^2) represents the square of the coefficients.

Limitations of Regression

1.3 INTRODUCTION TO LINEAR REGRESSION

Linear regression model can be created by fitting a line among the scattered data points. The
line is of the form:

5
Ordinary Least Square Approach
The ordinary least squares (OLS) algorithm is a method for estimating the parameters of a
linear regression model. Aim: To find the values of the linear regression model's parameters
(i.e., the coefficients) that minimize the sum of the squared residuals.
In mathematical terms, this can be written as: Minimize ∑(yi – ŷi)^2

where yi is the actual value, ŷi is the predicted value.

A linear regression model used for determining the value of the response variable, ŷ, can be
represented as the following equation.
y = b0 + b1x1 + b2x2 + … + bnxn + e
 where: y - is the dependent variable, b0 is the intercept, e is
the error term
 b1, b2, …, bn are the coefficients of the independent
variables x1, x2, …, xn
The coefficients b1, b2, …, bn can also be called the coefficients
of determination. The goal of the OLS method can be used to
estimate the unknown parameters (b1, b2, …, bn) by minimizing
the sum of squared residuals (RSS). The sum of squared residuals
is also termed the sum of squared error (SSE).
This method is also known as the least-squares method for regression or linear regression.
Mathematically the line of equations for points are:
y1=(a0+a1x1)+e1
y2=(a0+a1x2)+e2 and so on
……. yn=(a0+a1xn)+en.

In general ei=yi - (a0+a1x1)

6
Linear Regression Example

7
8
Linear Regression in Matrix Form

9
of determination r2 is the ratio of the explained and unexplained variations.

10
CHAPTER 5

REGRESSION ANALYSIS

2 Consider the following dataset in Table 5.11 where the week and number of working hours per
week spent by a research scholar in a library are tabulated. Based on the dataset, predict the
number of hours that will be spent by the research scholar in the 7th and 9th week. Apply Linear
regression model.

Table 5.11

xi 1 2 3 4 5
(week)
yi 12 18 22 28 35
(Hours Spent)

Solution

The computation table is shown below:

xi yi xi  xi xi  yi
1 12 1 12
2 18 4 36
3 22 9 66
4 28 16 112
5 35 25 175
Sum = 15 Sum = 115 Avg ( xi  xi )=55/5=11 Avg( xi  yi )=401/5=80.2
avg( xi )=15/5=3 avg( yi )=115/5=23

The regression Equations are

 xy   x y

a  ________

1 2
2
i


a0  y  a1  x

80.2  3(23) 80.2  69 11.2

a1     5.6
11 32 11 9 2
a0  23 5.63  2316.8  6.2

Therefore, the regression equation is given as

y  5.6  6.2 x

The prediction for the 7th week hours spent by the research scholar will be

y  5.6  6.27  49 hours

The prediction for the 9th week hours spent by the research scholar will be

y  5.6  6.29  61.4  61 hours

3 The height of boys and girls is given in the following Table5.12.

Table 5.12: Sample Data

Height of Boys 65 70 75 78

Height of Girls 63 67 70 73

Fit a suitable line of best fit for the above data.

Solution

The computation table is shown below:

xi yi xi  xi xi  yi
65 63 4225 4095
70 67 4900 4690
75 70 5625 5250
78 73 6084 5694
Sum = 288 Sum = 273 Avg ( xi  xi Avg( xi  yi
Mean( xi Mean( yi )=20834/4=5208.5 )=19729/4=4932.25
)=288/4=72 )=273/4=68.25

The regression Equations are

 xy   x y

a  ________

1 2
2
i


a0  y  a1  x
4932.25  72(68.25) 18.25
a1    0.7449
5208.5  722 24.5
a0  68.25 0.744972  68.25 53.6328  14.6172

Therefore, the regression line of best fit is given as

y  0.7449 14.6172 x

4 Using multiple regression, fit a line for the following dataset shown in Table 5.13.
Here, Z is the equity, X is the net sales and Y is the asset. Z is the dependent variable
and X and Y are independent variables. All the data is in million dollars.

Table 5.13: Sample Data

Z X Y

4 12 8

6 18 12

7 22 16

8 28 36

11 35 42

Solution

The matrix X and Y is given as follows:

1 12 8 
1 18 12 
 
X  1 22 16 
 
1 28 36 
1 35 42 
 
 4 
 
 6 
Y   7 
 
 8 
 
11 
The regression coefficients can be found as follows
^
a  (( X T X )1 X T )Y

Substituting the values one get,

 
 1 12 8 1 1
12 8 T  4 
  
18 12 
 1  6 
 1 1 1 1 1 18 12 1
^
       
a   128 18 22 28 35 1 22 16  1
22 16   7 
 12 16 36 42 1 28 36 1
28 36  
 1 35 42  1
35 42   8 
     11 

 5  4 

115 114 1  1 1 1 1 1  6
  
= 115 2961 3142   12 18 22 28 35   7 
 
114 36 42  
 3142 3524   8 12 16   8 
 
11 
 0.4135
 
= 0.39625

 0.0658
 

Therefore, the regression line is given as

y  0.39625x1 0.0658x2  0.4135

***
CHAPTER 6
DECISION TREE LEARNING
6.1 Introduction

 Why called as decision tree ?

 As starts from root node and finds number of solutions .
 The benefits of having a decision tree are as follows :
 It does not require any domain knowledge.
 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.
 Example : Toll free number

6.1.1 Structure of a Decision Tree A decision tree is a structure that includes a root
node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each
branch denotes the outcome of a test, and each leaf node holds a class label. The topmost
node in the tree is the root node.

Applies to classification and regression model.

The decision tree consists of 2 major procedures:

1) Building a tree and

2) Knowledge inference or classification.

Building the Tree

Knowledge Inference or Classification

Advantages of Decision Trees

Disadvantages of Decision Trees

6.1.2 Fundamentals of Entropy

 How to draw a decision tree ?

Entropy
Information gain
Algorithm 6.1: General Algorithm for Decision Trees

6.2 DECISION TREE INDUCTION ALGORITHMS

6.2.1 ID3 Tree Construction(ID3 stands for Iterative Dichotomiser 3 )

A decision tree is one of the most powerful tools of supervised learning algorithms
used for both classification and regression tasks.
It builds a flowchart-like tree structure where each internal node denotes a test on an
attribute, each branch represents an outcome of the test, and each leaf node (terminal
node) holds a class label. It is constructed by recursively splitting the training data
into subsets based on the values of the attributes until a stopping criterion is met, such
as the maximum depth of the tree or the minimum number of samples required to split
a node.
6.2.2 C4.5 Construction
C4.5 is a widely used algorithm for constructing decision trees from a dataset.
Disadvantages of ID3 are: Attributes must be nominal values, dataset must not include
missing data, and finally the algorithm tend to fall into overfitting.
To overcome this disadvantage Ross Quinlan, inventor of ID3, made some
improvements for these bottlenecks and created a new algorithm named C4.5. Now, the
algorithm can create a more generalized models including continuous data and could
handle missing data. And also works with discrete data, supports post-prunning.
Dealing with Continuous Attributes in C4.5
6.2.3 Classification and Regression Trees Construction
Classification and Regression Trees (CART) is a widely used algorithm for
constructing decision trees that can be applied to both classification and regression
tasks. CART is similar to C4.5 but has some differences in its construction and splitting
criteria.
The classification method CART is required to construct a decision tree based on Gini's
impurity index. It serves as an example of how the values of other variables can be used
to predict the values of a target variable. It functions as a fundamental machine-learning
method and provides a wide range of use cases
6.2.4 Regression Trees
MODULE-4

Bayes' Theorem is a fundamental concept in probability theory and forms the foundation of Bayesian
learning in machine learning. It allows you to update the probability of a hypothesis (or event) based on
new evidence.

Bayes' Theorem Explained

At its core, Bayes' Theorem relates current knowledge or belief about an event (the prior probability) to
new data or evidence (the likelihood) to produce an updated belief (the posterior probability).

Mathematically, Bayes' Theorem is:

Where:

 P(H | D) is the posterior probability: the probability of the hypothesis HH being true given the data
DD.
 P(D | H) is the likelihood: the probability of observing the data DD given that hypothesis HH is true.
 P(H) is the prior probability: the initial belief about the hypothesis HH before any data is observed.
 P(D) is the marginal likelihood or evidence: the total probability of the data under all possible
hypotheses. This acts as a normalizing constant to ensure that the posterior is a valid probability
distribution.

Breaking Down the Components of Bayes' Theorem

1. Prior Probability (P(H)):
o This represents what we know or believe about a hypothesis before seeing any new data.
o Example: In a medical test scenario, it could be the prior probability of a person having a
disease before considering the test results (e.g., based on the general population statistics).
2. Likelihood (P(D | H)):
o This is the probability of observing the data, assuming the hypothesis is true. It expresses how
likely it is to see the given data under the assumption of the hypothesis.
o Example: The likelihood would be the probability of getting a positive test result assuming the
person has the disease.
3. Evidence (P(D)):
o This is the total probability of the data across all hypotheses. It serves to normalize the
posterior probability so that it sums to 1.
o Example: The probability of getting a positive test result across all people, whether they have
the disease or not.
4. Posterior Probability (P(H | D)):
o This is the updated belief about the hypothesis after considering the new data (the evidence).
o Example: The posterior would give the probability of a person having the disease after
considering both the prior knowledge and the test results.

Intuition Behind Bayes' Theorem

Bayes' Theorem can be understood in terms of updating beliefs. When you receive new evidence, you
modify your prior belief to form a new belief that incorporates both your prior knowledge and the new
data.

 Before you collect any data, you have a prior belief about a hypothesis (e.g., the probability of a patient
having a disease).
 After seeing new data (e.g., the result of a medical test), you update your belief about the hypothesis to
reflect this new evidence.

Bayes’ Theorem lets you do this systematically, ensuring that your updated belief (posterior) is
proportional to the prior belief and the likelihood of observing the new data.

Example: Disease Diagnosis

Consider a simple example of diagnosing a disease using a medical test.

1. Prior Probability (P(H)):

o The prior belief is the probability that a person has the disease. For example, in a population,
1% of people might have the disease, so P(H)=0.01P(H) = 0.01.
2. Likelihood (P(D | H)):
o This is the probability of getting a positive test result if the person has the disease. Suppose the
test correctly identifies the disease 95% of the time, so P(D∣H)=0.95P(D | H) = 0.95.
3. Evidence (P(D)):
o This is the total probability of a positive test result in the population. It includes both people
who have the disease and those who do not.
4. Posterior Probability (P(H | D)):
o After receiving a positive test result, we want to calculate the probability that the person
actually has the disease.
Chapter 10
Artificial Neural Networks
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial intelligence
modelled after the brain.
An Artificial neural network is usually a computational network based on biological neural networks
that construct the structure of the human brain.
Similar to a human brain has neurons interconnected to each other, artificial neural networks also have
neurons that are linked to each other in various layers of the networks. These neurons are known as
nodes.

The biological neuron consists of main four parts:

• dendrites: nerve fibres carrying electrical signals to the cell .
• cell body: computes a non-linear function of its inputs
• axon: single long fiber that carries the electrical signal from the cell body to other neurons
• synapse: the point of contact between the axon of one cell and the dendrite of another,
regulating a chemical connection whose strength affects the input to the cell.

•
Dendrites are tree like networks made of nerve fiber connected to the cell body.
An Axon is a single, long connection extending from the cell body and carrying signals from the
neuron. The end of axon splits into fine strands. It is found that each strand terminated into small
bulb like organs called as synapse. It is through synapse that the neuron introduces its signals to
other nearby neurons. The receiving ends of these synapses on the nearby neurons can be found
both on the dendrites and on the cell body. There are approximately 104 synapses per neuron in the
human body. Electric impulse is passed between synapse and dendrites. It is a chemical process
which results in increase/decrease in the electric potential inside the body of the receiving cell. If
the electric potential reaches a thresh hold value, receiving cell fires & pulse / action potential of
fixed strength and duration is send through the axon to synaptic junction of the cell. After that, cell
has to wait for a period called refractory period.

Difference between biological and Artificial Neuron

ARTIFICIAL NEURONS:
Artificial neurons are like biological neurons that are linked to each other in various layers of the
networks. These neurons are known as nodes.
A node or a neuron can receive one or more input information and process it. artificial neurons are
connected by connection links to another neuron. Each connection link is associated with a synaptic
weight. The structure of a single neuron is shown below:
Fig: McCulloch-Pitts Neuron Mathematical model.

Simple Model of an ANN

The first mathematical model of a biological neuron was designed by McCulloch-Pitts in 1943.
It includes 2 steps:
1. It receives weighted inputs from other neurons.
2. It operates with a threshold function or activation function.

Basically, a neuron takes an input signal (dendrite), processes it like the CPU (soma), passes
the output through a cable like structure to other connected neurons (axon to synapse to
other neuron’s dendrite).

OR
Working:
The received input are computed as a weighted sum which is given to the activation function
and if the sum exceeds the threshold value the neuron gets [Link] neuron is the basic
processing unit that receives a set of inputs x1,x2,x3,….xn and their associated weights
w1,w2,w3,….wn. The summation function computes the weighted sum of the inputs
received by the neuron.
Sum=∑xiwi

Activation functions:
• To make work more efficient and for exact output, some force or activation is given. Like
that, activation function is applied over the net input to calculate the output of an ANN.
Information processing of processing element has two major parts: input and output. An
integration function (f) is associated with input of processing element.

• Several activation functions are there.

1. Identity function or Linear Function: It is a linear function which is defined as �(�) =

� 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥

The output is same as the input ie the weighted sum. The function is useful when we do
not apply any threshold. The output value ranged between –∞ and +∞
2. Binary step function: This function can be defined as
�(�) = { 1 �� ≥ �
0 �� < �
Where, θ represents threshhold value. It is used in single layer nets to convert
the net input to an output that is binary (0 or 1).
3. Bipolar step function: This function can be defined as
𝑓(𝑥) = { 1 𝑖𝑓 𝑥 ≥ 𝜃
−1 𝑖𝑓 𝑥 < 𝜃
Where, θ represents threshold value. It is used in single layer nets to convert
the net input to an output that is bipolar (+1 or -1).
4. Sigmoid function: It is used in Back propagation nets.
Two types:
a) Binary sigmoid function: It is also termed as logistic sigmoid function or unipolar
sigmoid function. It is defined as

where, λ represents steepness parameter. The range of sigmoid function is 0

to 1
b) Bipolar sigmoid function: This function is defined as

Where λ represents steepness parameter and the sigmoid range is between -1

and +1.
5. Ramp function: The ramp function is defined as:

It is a linear function whose upper and lower limits are fixed.

6. Tanh-Hyperbolic tangent function : Tanh function is very similar to the sigmoid/logistic
activation function, and even has the same S-shape with the difference in output range of -1 to
1. In Tanh, the larger the input (more positive), the closer the output value will be to 1.0,
whereas the smaller the input (more negative), the closer the output will be to -1.0.

7. ReLU Function
ReLU stands for Rectified Linear Unit.
Although it gives an impression of a linear function, ReLU has a derivative function and
allows for backpropagation while simultaneously making it computationally efficient.
The main catch here is that the ReLU function does not activate all the neurons at the same
time.
The neurons will only be deactivated if the output of the linear transformation is less than 0
8. Softmax function: Softmax is an activation function that scales numbers/logits into
probabilities. The output of a Softmax is a vector (say v) with probabilities of each

possible outcome. The probabilities in vector v sums to one for all possible outcomes or

classes.

Artificial Neural Network Structure

• Artificial Neural Networks Computational models inspired by the human brain: – Massively
parallel, distributed system, made up of simple processing units (neurons) – Synaptic
connection strengths among neurons are used to store the acquired knowledge.

• Knowledge is acquired by the network from its environment through a learning process.

• The Neural Network is constructed from 3 type of layers:

• Input layer — initial data for the neural network.
• Hidden layers — intermediate layer between input and output layer and place where all the
computation is done.

• Output layer — produce the result for given inputs.

PERCEPTRON AND LEARNING THEORY

• The perceptron is also a simplified model of a biological neuron.
• The perceptron is an algorithm for supervised learning of binary classifiers. It is a type of
linear classifier, i.e. a classification algorithm that makes all of its predictions based on a
linear predictor function combining a set of weights with the feature vector.
• One type of ANN system is based on a unit called a perceptron.

OR
• The perceptron can represent all boolean primitive functions AND, OR, NAND , NOR.
• Some boolean functions can not be represented .
– E.g. the XOR function.

Major components of a perceptron

• Input
• Weight
• Bias
• Weighted summation
• Step/activation function
• output
WORKING:
• Feed the features of the model that is required to be trained as input in the first layer. All
weights and inputs will be multiplied – the multiplied result of each weight and input will be
added [Link] Bias value will be added to shift the output function .This value will be
presented to the activation function (the type of activation function will depend on the need)
The value received after the last step is the output value.
The activation function is a binary step function which outputs a value 1, if f(x) is above the
threshold value Θ and a 0 if f(x) is below the threshold value Θ. Then the output of a neuron
is:
PROBLEM:
Design a 2 layer network of perceptron to implement NAND gate. Assume your own weights and
biases in the range of [-0.5 0.5]. Use learning rate as 0.4.

Solution:

𝚹3 𝚹4
X1 𝑤13

X3 X4
𝑤34

𝑤23 AND NOT

Figure 1 Two Layer Network for NAND gate

Table 1: Weights and Biases

𝑿𝟏 𝑿𝟐 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 𝒘𝟏𝟑 𝒘𝟐𝟑 𝒘𝟑𝟒 𝚹𝟑 𝚹𝟒 𝑿𝟎

0 1 1 0.1 -0.4 0.3 0.2 -0.3 1

Table 2: Truth Table of NAND Gate
𝑿𝟏 𝑿𝟐 𝑿𝟏 𝑨𝑵𝑫 𝑿𝟐 𝑵𝑨𝑵𝑫 = 𝑵𝑶𝑻(𝑿𝟏 𝑨𝑵𝑫 𝑿𝟐)

0 0 0 1
0 1 0 1
1 0 0 1
1 1 1 0

ITERATION 1:
Step 1: FORWARD PROPAGATION
1. Calculate net inputs and outputs in input layer as shown in Table 3.
Table 3: Net Input and Output Calculation
Input Layer 𝑰𝒋 𝑶𝒋

𝑿𝟏 0 0

𝑿𝟐 1 1

2. Calculate net inputs and outputs in hidden and output layer as shown in Table 4.
Table 4: Net Input and Output Calculation in Hidden and Output layer

𝑼𝒏𝒊𝒕𝒋 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒋 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒋

𝑿𝟑 𝐼3 = 𝑋1𝑊13 + 𝑋2𝑊23 + 𝑋0𝚹3 1

𝑶𝟑 =
1 + 𝑒−𝐼3
= 0(0.1) + 1(−0.4) + 1(0.2)
1
= −0.2 =
1 + 𝑒−(−0.2)

= 0.450
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝐍𝐞𝐭 𝐨𝐮𝐭𝐩𝐮𝐭 𝑶𝒌

𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝚹4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.450 ∗ 0.3) + 1(−0.3)
1
= −0.165 =
1 + 𝑒−(−0.165)

= 0.458

3. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.458
𝐸𝑟𝑟𝑜𝑟 = 0.542

Step 2: BACKWARD PROPAGATION

1. For each 𝒖𝒏𝒊𝒕𝒌 in the output layer
𝑬𝒓𝒓𝒐𝒓𝒌 = 𝑶𝒌 ∗ (𝟏 − 𝑶𝒌) ∗ (𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒌)

For each 𝒖𝒏𝒊𝒕𝒋 in the hidden layer

𝑬𝒓𝒓𝒐𝒓𝒋 = 𝑶𝒋 ∗ (𝟏 − 𝑶𝒋) ∗ (∑ 𝑬𝒓𝒓𝒐𝒓 ∗ 𝑾𝒋𝒌)

𝒌

Table 5: Error Calculation

For each output 𝑬𝒓𝒓𝒐𝒓𝒌
layer 𝒖𝒏𝒊𝒕𝒌
𝑋4 𝐸𝑟𝑟𝑜𝑟𝑘 = 𝑂𝑘 ∗ (1 − 𝑂𝑘) ∗ (𝑂𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑂𝑘)
= 0.458(1 − 0.458)(1 − 0.458)
= 0.134

For each hidden layer 𝑬𝒓𝒓𝒐𝒓𝒋

𝒖𝒏𝒊𝒕𝒋

𝑋3 𝐸𝑟𝑟𝑜𝑟𝑗 = 𝑂𝑗 ∗ (1 − 𝑂𝑗) ∗ (∑ 𝐸𝑟𝑟𝑜𝑟 ∗ 𝑊𝑗𝑘)

𝑘

= 0.450 ∗ (1 − 0.450) ∗ 0.134 ∗ 0.3

= 0.0099

2. Update Weights and biases

Table 6: Weight and Bias Calculation

𝒘𝒊𝒋 𝒘𝒊𝒋 = 𝒘𝒊𝒋 + (𝜶 ∗ 𝑬𝒓𝒓𝒐𝒓𝒋 ∗ 𝑶𝒊) Net Weight

𝑤13 𝑤13 = 𝑤13 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3 ∗ 𝑂1) 0.1

= 0.1 ∗ (0.4 ∗ 0.0099 ∗ 0)
𝑤23 𝑤23 = 𝑤23 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3 ∗ 𝑂2) -0.396
= −0.4 ∗ (0.4 ∗ 0.0099 ∗ 1)
𝑤24 𝑤24 = 𝑤24 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟4 ∗ 𝑂2) 0.324
= 0.3 ∗ (0.4 ∗ 0.134 ∗ 0.450)
𝚹𝒋 𝚹𝒋 = 𝚹𝒋 + (𝜶 ∗ 𝑬𝒓𝒓𝒐𝒓𝒋) Net Bias

𝚹3 𝚹3 = 𝚹3 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟3) 0.203

= 0.2 + (0.4 ∗ 0.0099)
𝚹4 𝚹4 = 𝚹4 + (0.4 ∗ 𝐸𝑟𝑟𝑜𝑟4) -0.246
= −0.3 + (0.4 ∗ 0.134

ITERATION 2:
Step 1: FORWARD PROPAGATION

1. Calculate net inputs and outputs in hidden and output layer

Table 7: Inputs and Outputs in Hidden and Output layer

𝑼𝒏𝒊𝒕𝒋 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒋 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒋

𝑿𝟑 𝐼3 = 𝑋1𝑊13 + 𝑋2𝑊23 + 𝑋0𝚹3 1

𝑶𝟑 =
1 + 𝑒−𝐼3
= 0(0.1) + 1(−0.396) + 1(0.203)
1
= −0.193 =
1 + 𝑒−(−0.193)
= 0.451
𝑼𝒏𝒊𝒕𝒌 𝑵𝒆𝒕 𝑰𝒏𝒑𝒖𝒕 𝑰𝒌 𝑵𝒆𝒕 𝒐𝒖𝒕𝒑𝒖𝒕 𝑶𝒌

𝑿𝟒 𝐼4 = 𝑂3𝑊34 + 𝑋0𝚹4 1
𝑶𝟒 =
1 + 𝑒−𝐼4
= (0.451 ∗ 0.324) + 1(−0.246)
1
= −0.099 =
1 + 𝑒−(−0.099)
= 0.475

2. Calculate Error
𝑬𝒓𝒓𝒐𝒓 = 𝑶𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝑶𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅
= 1 − 0.475
𝐸𝑟𝑟𝑜𝑟 = 0.525

ITERATION ERROR
1 0.542 =0.542-0.525
=0.017
2 0.525

In iteration 2 the error gets reduced to 0.525. This process will continue until desired output
is achieved.
How a Multi-Layer Perceptron does solves the XOR problem. Design an MLP with back
propagation to implement the XOR Boolean function.
Solution:

X1 X2 Y
0 0 1
0 1 0
1 0 0
1 1 1

0.1

X1 -0.3
-0.2
0.4
0.4
X3 0.2
0.2

X2 X5
-0.3
-0.3

Figure 2: Multi Layer Perceptron for XOR

Learning rate: =0.8

Table 8: Weights and Biases
X1 X2 W13 W14 W23 W24 W35 W45 𝜃3 𝜃4 𝜃5
1 0 -0.2 0.4 0.2 -0.3 0.2 -0.3 0.4 0.1 -0.3

Step 1: Forward Propagation

1. Calculate Input and Output in the Input Layer shown in Table 9.
Table 9: Net Input and Output Calculation
Input Layer Ij Oj
X1 1 1
X2 0 0
2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 10.
Table 10: Unit j at Hidden Layer and Output Layer – Net Input and Output Calculation
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = = 0.549
1+𝑒−𝐼3 1+𝑒−0.2
I3 = 1*-0.2 + 0*0.2+ 1*0.4 = 0.2
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = = 0.622
1+𝑒−𝐼4 1+𝑒−0.5
I4 = 1*0.4 + 0*-0.3+ 1*0.1 = 0.5
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.407
1+𝑒−𝐼5 1+𝑒0.376
I5 = 0.549 * 0.2 + 0.622 * -0.3 + 1*-0.3 = -0.376

3. Calculate Error = Odesired – OEstimated

So error for this network is,
Error = Odesired – O7 = 1 – 0.407 = 0.593

Step 2: Backward Propagation

1. Calculate Error at each node as shown in Table 11.
For each unit k in the output layer, calculate
Error k = Ok (1-Ok) (YN – Ok)
For each unit j in the hidden layer, calculate
Error j = Oj (1-Oj) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘

Table 11: Error Calculation for each unit in the Output layer and Hidden layer
For Output Layer Errork
Unit k
X5 Error 5 = O5 (1-O5) (1 – O5)
= 0.407 * (1-0.407) * (1- 0.407)
= 0.143
For Hidden layer Errorj
Unit j
X4 Error 4 = O4 (1-O4) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O4 (1-O4) 𝐸𝑟𝑟𝑜𝑟5 𝑊45
= 0.622 (1-0.622) *- 0.3 *0.143
= -0.010
X3 Error 3 = O3 (1-O3) ∑𝑘 𝐸𝑟𝑟𝑜𝑟𝑘 𝑊𝑗𝑘 = O3 (1-O3) 𝐸𝑟𝑟𝑜𝑟5 𝑊35
= 0.549 (1- 0.549) * 0.143 * 0.2
= -0.007

2. Update weight using the below formula,

Learning rate α = 0.8
∆Wij = ∝∗ Error j* Oi
Wij = Wij+ ∆Wij
The updated weight and bias is shown in Table 12 and Table 13.
Table 12: Weight Updation
Wij Wij = Wij+ ∝∗ Error j* Oi New Weight
W13 W13 = W13 + 0.8 * Error 3* O1 -0.194
= -0.2 + 0.8 * 0.007 * 1
W14 W14 = W14 + 0.8 * Error 4* O1 0.392
= 0.4+ 0.8 * -0.01 *1
W23 W23 = W23 + 0.8 * Error 3* O2 0.2
= 0.2 + 0.8 * 0.007 *0
W24 W24 = W24+ 0.8 * Error 4 * O2 -0.3
= -0.3+ 0.8 * -0.001 *0
W35 W35 = W35 + 0.8 * Error 5* O3 0.154
= 0.2 + 0.8 *0.143* 0.4
W45 W45 = W45 + 0.8 * Error 5* O4 -0.288
= 0.3 + 0.8 * 0.143* 0.1

Update bias using the below formula,

∆θj = = ∝∗ Error j
θj = θj + ∆θj
Table 13: Bias Updation
θj θj = θj + ∝∗ Error j New Bias
𝜃3 Θ3 = θ3 + ∝∗ Error 3 0.405
= 0.4 + 0.8 * 0.007
𝜃4 θ 4 = θ4 + ∝∗ Error 4 0.092
= 0.1 + 0.8 *- 0.01
𝜃5 θ 5 = θ5 + ∝∗ Error 5 -0.185
= -0.3 + 0.8 * 0.143
Iteration 2
Now with the updated weights and biases,
1. Calculate Input and Output in the Input Layer shown in Table 14.
Table 14: Net Input and Output Calculation
Input Layer Ij Oj
X1 1 1
X2 0 0

2. Calculate Net Input and Output in the Hidden Layer and Output Layer shown in Table 15.
Table 15: Net Input and Output Calculation in the Hidden Layer and Output Layer
Unit j Net Input Ij Output Oj
1 1
X3 I3 = X1*W13 + X2*W23+ X0*θ3 O3 = = =
1+𝑒−𝐼3 1+𝑒−0.211
I3 = 1*-0.194 + 0*0.2+ 1*0.405 = 0.211
0.552
1 1
X4 I4 = X1*W14 + X2*W24+ X0*θ4 O4 = = =
1+𝑒−𝐼4 1+𝑒−0.484
I4 = 1*0.392 + 0*-0.3+ 1*0.092 = 0.484
0.618
1 1
X5 I5 = O3 * W35 + O4*W45 + X0*θ5 O5 = = =0.429
1+𝑒−𝐼5 1+𝑒0.282
I5 = 0.552* 0.154 + 0.618* -0.288 + 1*-0.185 = -
0.282

The output we receive in the network at node 5 is 0.407.

Error = 1 - 0.429= 0.571
Now when we compare the error, we get in the previous iteration and in the current iteration, the
network has learnt which reduces the error by 0.022.
Error is reduced by 0.055: 0.593 – 0.571.

Consider the Network architecture with 4 input units and 2 output units. Consider four training
samples each vector of length 4.
Training samples
i1: (1, 1, 1, 0)
i2: (0, 0, 1, 1)
i3: (1, 0, 0, 1)
i4: (0, 0, 1, 0)
Output Units: Unit 1, Unit 2
Learning rate η(t) = 0.6
Initial Weight matrix
0.2 0.8 0.5 0.1
[Unit 1]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Identify an algorithm to learn without supervision? How do you cluster them as we
expected?

Solution:
Use Self Organizing Feature Map (SOFM)

Iteration 1:
Training Sample X1: (1, 1, 1, 0)
Weight matrix
0.2 0.8 0.5 0.1
[Unit 1]: [ ]
Unit 2 0.3 0.5 0.4 0.6

Compute Euclidean distance between X1: (1, 1, 1, 0) and Unit 1 weights.

d2 = (0.2 -1)2 + (0.8 – 1)2 + (0.5 -1)2 + (0.1 – 0)2

= 0.94
Compute Euclidean distance between X1: (1, 1, 1, 0) and Unit 2 weights.

d2 = (0.3 -1)2 + (0.5 – 1)2 + (0.4 -1)2 + (0.6– 0)2

= 1.46
Unit 1 wins
Update the weights of the winning unit
New Unit 1 weights = [0.2 0.8 0.5 0.2] + 0.6 ([1 1 1 0] - [0.2 0.8 0.5 0.2])
= [0.2 0.8 0.5 0.2] + 0.6 [0.8 0.2 0.5 -0.2]
= [0.2 0.8 0.5 0.2] + [0.48 0.12 0.30 -0.12]
= [0.68 0.92 0.80 0.08]

[Unit 1]:[0.68 0.92 0.80 0.08]

Unit 2 0.3 0.5 0.4 0.6
Iteration 2:
Training Sample X2: (0, 0, 1, 1)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1]:[ ]
Unit 2 0.3 0.5 0.4 0.6
Compute Euclidean distance between X2: (0, 0, 1, 1) and Unit 1 weights.

d2 = (0.68 -0)2 + (0.92 – 0)2 + (0.80 -1)2 + (0.08 – 1)2

= 2.1952
Compute Euclidean distance between X2: (0, 0, 1, 1) and Unit 2 weights.

d2 = (0.3 -0)2 + (0.5 – 0)2 + (0.4 -1)2 + (0.6– 1)2

= 0.86
Unit 2 wins
Update the weights of the winning unit
New Unit 2 weights = [0.3 0.5 0.4 0.6] + 0.6 ([0 0 1 1] - [0.3 0.5 0.4 0.6])
= [0.3 0.5 0.4 0.6] + 0.6 [-0.3 -0.5 0.6 0.4]
= [0.3 0.5 0.4 0.6] + [-0.18 -0.30 0.36 0.24]
= [0.12 0.2 0.76 0.84]

[Unit 1]:[0.68 0.92 0.80 0.08]

Unit 2 0.12 0.2 0.76 0.84

Iteration 3:
Training Sample X3: (1, 0, 0, 1)
Weight matrix
0.68 0.92 0.80 0.08
[Unit 1]:[ ]
Unit 2 0.12 0.2 0.76 0.84

Compute Euclidean distance between X3: (1, 0, 0, 1) and Unit 1 weights.

d2 = (0.68 -1)2 + (0.92 – 0)2 + (0.80 -0)2 + (0.08 – 1)2

= 2.44
Compute Euclidean distance between X3: (1, 0, 0, 1) and Unit 2 weights.

d2 = (0.12 -1)2 + (0.2 – 0)2 + (0.76 -0)2 + (0.84– 1)2

= 1.42
Unit 2 wins
Update the weights of the winning unit
New Unit 2 weights = [0.12 0.2 0.76 0.84] + 0.6 ([1 0 0 1] - [0.12 0.2 0.76 0.84])
= [0.12 0.2 0.76 0.84] + 0.6 [0.88 -0.2 -0.76 0.16]
= [0.12 0.2 0.76 0.84] + [0.53 -0.12 -0.46 0.096]
= [0.65 0.08 0.3 0.94]

[Unit 1]:[0.68 0.92 0.80 0.08]

Unit 2 0.65 0.08 0.3 0.94

Iteration 4:
Training Sample X4: (0, 0, 1, 0)
Weight matrix

[Unit 1]:[0.68 0.92 0.80 0.08

]
Unit 2 0.65 0.08 0.3 0.94

Compute Euclidean distance between X4: (0, 0, 1, 0) and Unit 1 weights.

d2 = (0.68 -0)2 + (0.92 –0)2 + (0.80 -1)2 + (0.08 – 0)2

= 1.36
Compute Euclidean distance between X1: (0, 0, 1, 0) and Unit 2 weights.

d2 = (0.65- 0)2 + (0.08 – 0)2 + (0.3 -1)2 + (0.94– 0)2

= 1.8025
Unit 1 wins
Update the weights of the winning unit
New Unit 1 weights = [0.68 0.92 0.80 0.08] + 0.6 ([0 0 1 0] - [0.68 0.92 0.80 0.08])
= [0.68 0.92 0.80 0.08] + 0.6 [-0.68 -0.92 0.2 -0.08]
= [0.68 0.92 0.80 0.08] + [-0.408 -0.552 0.12 -0.258]
= [0.27 0.37 0.92 -0.178]
0.27 0.37 0.92 − 0.178
[Unit 1]:[ ]
Unit 2 0.65 0.08 0.3 0.94

Best mapping unit for each of the sample taken are,

X1: (1, 1, 1, 0)  Unit 1
X2: (0, 0, 1, 1)  Unit 2
X3: (1, 0, 0, 1)  Unit 2
X4: (0, 0, 1, 0)  Unit 1

This process is continued for many epochs until the feature map doesn’t change.
Learning Rules
Learning in NN is performed by adjusting the network weights in order to minimize the
difference between the desired and estimated output.

Delta Learning Rule and Gradient Descent

Developed by Widrow and Hoff, the delta rule, is one of the most common learning rules.
It is supervised learning.
Delta rule is derived from gradient descent method(Back-propogation).
It is Non-linearly separable. Also called as continuous perceptron Learning rule.
It updates the connection weights with the difference between the target and the output
value. It is the least mean square learning algorithm.
The Delta difference is measured as an error function or also called as cost function.

TYPES OF ANN
1. Feed Forward Neural Network
2. Fully connected Neural Network
3. Multilayer Perceptron
4. Feedback Neural Network
Feed Forward Neural Network:
Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enters the layer and are
multiplied by the weights in this model. The weighted input values are then summed together to form a total.
If the sum of the values is more than a predetermined threshold, which is normally set at zero, the output
value is usually 1, and if the sum is less than the threshold, the output value is usually -1.
The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification.
The model may or may not contain hidden layer and there is no backpropagation.
Based on the number of hidden layers they are further classified into single-layered and multilayered feed
forward network.

Fully connected Neural Network:

 A fully connected neural network consists of a series of fully connected layers that connect
every neuron in one layer to every neuron in the other layer.

 The major advantage of fully connected networks is that they are ―structure agnostic‖ i.e. there
are no special assumptions needed to be made about the input.
Multilayer Perceptron:
A multi-layer perceptron has one input layer and for each input, there is one neuron (or node), it has
one output layer with a single node for each output and it can have any number of hidden layers and
each hidden layer can have any number of nodes.
The information flows in both directions.
The weight adjustment training is done via backpropagation.
Every node in the multi-layer perception uses a sigmoid activation function. The sigmoid activation
function takes real values as input and converts them to numbers between 0 and 1 using the sigmoid
formula.

Feedback Neural Network:

Feedback networks also known as recurrent neural network or interactive neural network are
the deep learning models in which information flows in backward direction.
It allows feedback loops in the network. Feedback networks are dynamic in nature, powerful and
can get much complicated at some stage of execution
Neuronal connections can be made in any way.
RNNs may process input sequences of different lengths by using their internal state, which can
represent a form of memory.
They can therefore be used for applications like speech recognition or handwriting recognition.
Advantages and Disadvantages of ANN

Limitations of ANN
Challenges of Artificial Neural Networks
BCS602 | MACHINE LEARNING| VTU Belagavi.

Module-5

Chapter – 01 - Clustering Algorithms

Introduction to Clustering Approaches

 Cluster analysis is the fundamental task of unsupervised learning. Unsupervised learning

involves exploring the given dataset.

 Cluster analysis is a technique of partitioning a collection of unlabelled objects that have

many attributes into meaningful disjoint groups or clusters.

 This is done using a trial and error approach as there are no supervisors available as in

classification.

 The characteristic of clustering is that the objects in the clusters or groups are similar to

each other within the clusters while differ from the objects in other clusters

significantly.

 The input for cluster analysis is examples or samples. These are known as objects, data

points or data instances.

 All these terms are same and used interchangeably in this chapter. All the samples or objects

with no labels associated with them are called unlabelled.

 The output is the set of clusters (or groups) of similar data if it exists in the input.

 For example, the following Figure 13.1(a) shows data points or samples with two features

shown in different shaded samples and Figure 13.1(b) shows the manually drawn ellipse to

indicate the clusters formed.

Machine Learning. Page 1

BCS602 | MACHINE LEARNING| VTU Belagavi.

Machine Learning. Page 2

BCS602 | MACHINE LEARNING| VTU Belagavi.

Visual identification of clusters in this case is easy as the examples have only two features.

But, when examples have more features, say 100, then clustering cannot be done manually and
automatic clustering algorithms are required.

Also, automating the clustering process is desirable as these tasks are considered difficult by
humans and almost impossible. All clusters are represented by centroids.

Example: For example, if the input examples or data is (3, 3), (2, 6) and (7, 9), then the centroid
is given as.

The clusters should not overlap and every cluster should represent only one class. Therefore,
clustering algorithms use trial and error method to form clusters that can be converted to labels.

Difference between Clustering & Classification

Machine Learning. Page 3

BCS602 | MACHINE LEARNING| VTU Belagavi.

Applications of Clustering

Challenges of Clustering Algorithms

High-Dimensional Data

o As the number of features increases, clustering becomes difficult.

Scalability Issue

o Some algorithms perform well for small datasets but fail for large-scale data.

Unit Inconsistency

o Different measurement units (e.g., kg vs. pounds) can create problems.

Proximity Measure Design

o Choosing an appropriate distance metric is crucial for accurate clustering.

Machine Learning. Page 4

BCS602 | MACHINE LEARNING| VTU Belagavi.

Advantages and Disadvantages of Clustering Algorithms

Proximity Measures

Proximity measures determine similarity or dissimilarity among objects. Distance

measures (dissimilarity) indicate how different objects are.

Similarity measures indicate how alike objects are. property:

More distance → Less similarity, and vice versa. Properties of

Distance Measures (Metric Conditions)

Machine Learning. Page 5

BCS602 | MACHINE LEARNING| VTU Belagavi.

Types of Distance Measures Based on Data Types Quantitative

Variables

Machine Learning. Page 6

BCS602 | MACHINE LEARNING| VTU Belagavi.

Binary Attributes

Categorical Variables

Distance is 1 if different, 0 if same.

Example: Gender (Male, Female) → Distance = 1

Machine Learning. Page 7

BCS602 | MACHINE LEARNING| VTU Belagavi.

Ordinal Variables

Vector-Based Distance Measures (For Text & Documents)

Cosine Similarity

o Measures angle between two vectors.

o Formula:

Machine Learning. Page 8

BCS602 | MACHINE LEARNING| VTU Belagavi.

Distance Measures

Hierarchical Clustering Algorithms

Overview

 Produces a nested partition of objects with hierarchical relationships.

 Represented using a dendrogram.
 Two main categories: Agglomerative and Divisive methods.

Types of Hierarchical Clustering

1. Agglomerative Methods (Bottom-Up)

o Each sample starts as an individual cluster.
o Clusters are merged iteratively until one cluster remains.
o Once a cluster is formed, it cannot be undone (irreversible).
2. Divisive Methods (Top-Down)
o Starts with a single cluster containing all data points.
o Splits iteratively into smaller clusters.
o Continues until each sample becomes its own cluster.

Machine Learning. Page 9

BCS602 | MACHINE LEARNING| VTU Belagavi.

Agglomerative Clustering Techniques

Single Linkage (MIN Algorithm)

o Merges clusters based on the smallest distance between two points from different clusters.
o Related to the Minimum Spanning Tree (MST).

Complete Linkage (MAX or Clique Algorithm)

Machine Learning. Page 10

BCS602 | MACHINE LEARNING| VTU Belagavi.

Average Linkage Algorithm

Mean-Shift Clustering Algorithm

 Non-parametric and hierarchical clustering technique.

 Also known as mode-seeking or sliding window algorithm.
 No prior knowledge of cluster count or shape required.
 Moves towards high-density regions in data using a kernel function (e.g., Gaussian
window).

Machine Learning. Page 11

BCS602 | MACHINE LEARNING| VTU Belagavi.

Advantages of Mean-Shift Clustering

No model assumptions

Suitable for all non-convex shapes

Only one parameter of the window, that is, bandwidth is required Robust to noise No

issues of local minima or premature termination

Disadvantages of Mean-Shift Clustering

Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed. If
it is small, then many points are missed and convergence occurs as the problem.

The number of clusters cannot be specified and user has no control over this parameter.

Partitional Clustering Algorithm

 k-means is a widely used partitional clustering algorithm.

 The user specifies k, the number of clusters.
 Assumes non-overlapping clusters.
 Works well for circular or spherical clusters.

Process of k-means Algorithm

1. Initialization
o Select k initial cluster centers (randomly or using prior knowledge).
o Normalize data for better performance.
2. Assignment of Data Points
o Assign each data point to the nearest centroid based on Euclidean distance.
3. Update Centroids

Machine Learning. Page 12

BCS602 | MACHINE LEARNING| VTU Belagavi.

o Compute the mean vector of assigned points to update cluster centroids.

o Repeat this process until no changes occur in cluster assignments.
4. Termination
o The process stops when cluster assignments remain unchanged.

Mathematical Optimization

Advantages

1. Simple and easy to implement.

2. Efficient for small to medium datasets.

Machine Learning. Page 13

BCS602 | MACHINE LEARNING| VTU Belagavi.

Disadvantages

1. Sensitive to initialization – different initial points may lead to different results.

2. Time-consuming for large datasets – requires multiple iterations.

Choosing the Value of k

 No fixed rule for selecting k.

 Use Elbow Method:
o Run k-means with different values of k.
o Plot Within Cluster Sum of Squares (WCSS) vs. k.
o The optimal k is at the "elbow" where the curve flattens.

Computational Complexity

O(nkId), where:

o n = number of samples
o k = number of clusters
o I = number of iterations
o d = number of attributes

Density-based Methods

 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-

based clustering algorithm.
 Clusters are dense regions of data points separated by areas of low density (noise).
 Works well for arbitrary-shaped clusters and datasets with noise.

Machine Learning. Page 14

BCS602 | MACHINE LEARNING| VTU Belagavi.

Uses two parameters:

1. ε (epsilon) – Neighborhood radius.

2. m (minPts) – Minimum number of points within ε to form a cluster.

Types of Points in DBSCAN

1. Core Point
o A point with at least m points in its ε-neighborhood.
2. Border Point
o Has fewer than m points in its ε-neighborhood but is adjacent to a core point.
3. Noise Point (Outlier)
o Neither a core point nor a border point.

Density Connectivity Measures

1. Direct Density Reachability

o Point X is directly reachable from Y if:
 X is in the ε-neighborhood of Y.
 Y is a core point.
2. Densely Reachable

Machine Learning. Page 15

BCS602 | MACHINE LEARNING| VTU Belagavi.

o X is densely reachable from Y if there exists a chain of core points linking them.
3. Density Connected
o X and Y are density connected if they are both densely reachable from a
common core point Z.

Advantages of DBSCAN

1. Can detect arbitrary-shaped clusters.

2. Robust to noise and outliers.
3. Does not require specifying the number of clusters (k-means does).

Disadvantages of DBSCAN

1. Sensitive to ε and m parameters – Poor parameter choice can affect results.

2. Fails in datasets with varying density – A single ε may not work for all clusters.
3. Computationally expensive for high-dimensional data.

Grid-based Approach

 Grid-based clustering partitions space into a grid structure and fits data into cells for
clustering.
 Suitable for high-dimensional data.
 Uses subspace clustering, dense cells, and monotonicity property.

Concepts

Subspace Clustering

o Clusters are formed using a subset of features (dimensions) rather than all
attributes.
o Useful for high-dimensional data like gene expression analysis.

Machine Learning. Page 16

BCS602 | MACHINE LEARNING| VTU Belagavi.

o CLIQUE (Clustering in Quest) is a widely used grid-based subspace clustering

algorithm.

Concept of Dense Cells

o CLIQUE partitions dimensions into intervals (cells).

o A cell is dense if its data point density exceeds a threshold.
o Dense cells are merged to form clusters.

Monotonicity Property

Machine Learning. Page 17

BCS602 | MACHINE LEARNING| VTU Belagavi.

o Uses anti-monotonicity (Apriori property):

 If a k-dimensional cell is dense, then all (k-1) dimensional projections must
also be dense.
 If a lower-dimensional cell is not dense, then higher-dimensional cells
containing it are also not dense.
o Similar to association rule mining in frequent pattern mining.

Advantages of CLIQUE

1. Insensitive to input order of objects.

2. No assumptions about data distribution.
3. Finds high-density clusters in subspaces of high-dimensional data.

Disadvantage of CLIQUE

 Tuning grid parameters (grid size, density threshold) is difficult.

 Finding the optimal threshold to classify a cell as dense is challenging.

Machine Learning. Page 18

BCS602 | MACHINE LEARNING| VTU Belagavi.

Chapter :- 2

Reinforcement Learning

Overview of Reinforcement Learning

What is Reinforcement Learning?

 Reinforcement Learning (RL) is a machine learning paradigm that mimics how

humans and animals learn through experience.
 Humans interact with the environment, receive feedback (rewards or penalties), and
adjust their behavior accordingly.
 Example: A child touching fire learns to avoid it after experiencing pain (negative
reinforcement).

How RL Works in Machines

 RL simulates real-world scenarios for a computer program (agent) to learn by trial and
error.
 The agent executes actions, receives positive or negative rewards, and optimizes its
future actions based on these experiences.

Types of Reinforcement Learning

1. Positive Reinforcement Learning

o Rewards encourage good behavior (reinforce correct actions).
o Example: A robot gets +10 points for reaching a goal successfully.
o Effect: Increases the likelihood of repeating the rewarded action.
2. Negative Reinforcement Learning
o Negative rewards discourage unwanted actions.
o Example: A game agent loses -10 points for stepping into a danger zone.
o Effect: Helps the agent learn to avoid negative outcomes.

Machine Learning. Page 19

BCS602 | MACHINE LEARNING| VTU Belagavi.

Characteristics of RL

 Sequential Decision-Making: The agent makes a series of decisions to maximize total

rewards.
 Trial and Error Learning: The agent learns by exploring different actions and their
consequences.
 No Supervised Labels: Unlike supervised learning, RL does not require labeled data; it
learns from experience.

Applications of Reinforcement Learning

 Robotics: Teaching robots to walk, grasp objects, or perform complex tasks.

 Gaming: AI agents in chess, Go, and video games (e.g., AlphaGo, OpenAI Five).
 Autonomous Vehicles: Self-driving cars learn optimal driving strategies.
 Finance: AI-based trading strategies for stock markets.
 Healthcare: Personalized treatment plans based on patient responses.

Scope of Reinforcement Learning

Reinforcement Learning (RL) is well-suited for decision-making problems in dynamic and

uncertain environments. It excels in cases where an agent must learn through trial and error
and optimize its actions based on delayed rewards.

Situations Where RL Can Be Used

Pathfinding and Navigation

o Consider a grid-based game where a robot must navigate from a starting node (E) to a goal
node (G) by choosing the optimal path.
o RL can learn the best route by exploring different paths and receiving feedback on their
efficiency.

Machine Learning. Page 20

BCS602 | MACHINE LEARNING| VTU Belagavi.

o In obstacle-based games, RL can identify safe paths while avoiding dangerous zones.

Dynamic Decision-Making with Uncertainty

o RL is useful in environments where not all information is known upfront.

o It is not suitable for tasks like object detection, where a classifier with complete
labeled data performs better.

Characteristics of Reinforcement Learning

1. Sequential Decision-Making
o In RL, decisions are made in steps, and each step influences future choices.
o Example: In a maze game, a wrong turn can lead to failure.
2. Delayed Feedback
o Rewards are not always immediate; the agent may need to take multiple steps before
receiving feedback.
3. Interdependence of Actions
o Each action affects the next set of choices, meaning an incorrect move can have
long-term consequences.
4. Time-Related Decisions
o Actions are taken in a specific sequence over time, affecting the final
outcome.

Challenges in Reinforcement Learning

Reward Design

o Setting the right reward values is crucial. Incorrectly designed rewards may lead the agent
to learn undesired behavior.

Absence of a Fixed Model

Machine Learning. Page 21

BCS602 | MACHINE LEARNING| VTU Belagavi.

o Some environments, like chess, have fixed rules, but many real-world problems lack
predefined models.
o Example: Training a self-driving car requires simulations to generate experiences.

Partial Observability

o Some environments, like weather prediction, involve uncertainty because complete state
information is unavailable.

High Computational Complexity

o Games like Go involve a huge state space, making RL training time-consuming.

o More possible actions → More training time needed.

Applications of Reinforcement Learning

1. Industrial Automation
o Optimizing robot movements for efficiency.
2. Resource Management
o Allocating resources in data centers and cloud computing.
3. Traffic Light Control
o Reducing congestion by adjusting signal timings dynamically.
4. Personalized Recommendation Systems
o Used in news feeds, e-commerce, and streaming services (e.g., Netflix
recommendations).
5. Online Advertisement Bidding
o Optimizing ad placements for maximum engagement.
6. Autonomous Vehicles
o RL helps in training self-driving cars to navigate safely.
7. Game AI (Chess, Go, Dota 2, etc.)
o AI models like AlphaGo use RL to master complex games.
8. DeepMind Applications

Machine Learning. Page 22

BCS602 | MACHINE LEARNING| VTU Belagavi.

o AI systems that generate programs, images, and optimize machine learning

models.

Reinforcement Learning as Machine Learning

Reinforcement Learning (RL) is a distinct branch of machine learning that differs significantly
from supervised learning.

While supervised learning depends on labeled data, reinforcement learning learns through
interaction with the environment, making decisions based on trial and error.

Why RL Is Necessary?

Some tasks cannot be solved using supervised learning due to the absence of a labeled training
dataset. For example:

 Chess & Go: There is no dataset with all possible game moves and their
outcomes. RL allows the agent to explore and improve over time.
 Autonomous Driving: The car must learn from real-world experiences rather than
relying on a fixed dataset.

Challenges in Reinforcement Learning Compared to Supervised Learning

 More complex decision-making since every action affects future outcomes.

 Longer training times due to trial-and-error learning.
 Delayed rewards, making it difficult to attribute success or failure to a specific action.

Machine Learning. Page 23

BCS602 | MACHINE LEARNING| VTU Belagavi.

Differences between Supervised Learning and Reinforcement Learning

Components of Reinforcement Learning

Reinforcement Learning (RL) is based on an agent interacting with an environment to learn an

optimal strategy through trial and error.

Basic Components of RL

Machine Learning. Page 24

BCS602 | MACHINE LEARNING| VTU Belagavi.

1. Agent – The decision-maker (e.g., a robot, self-driving car, AI player in a game).

2. Environment – The external world where the agent interacts (e.g., a game board, real-
world traffic).
3. State (S) – A representation of the environment at a specific time.
4. Actions (A) – The possible choices available to the agent.
5. Rewards (R) – The feedback signal received by the agent for taking an action.
6. Policy (π) – The agent’s strategy for selecting actions based on states.
7. Episodes – The sequence of states, actions, and rewards from the start state to the goal
state.

Types of RL Problems
Learning Problems

 Unknown environment – The agent learns by trial and error.

 Goal – Improve the policy through interaction.
 Example – A robot navigating through an unknown maze.

Planning Problems

 Known environment – The agent can compute and improve the policy using a model.
 Example – Chess AI that plans its moves based on game rules.

Environment and Agent

 The environment contains all elements the agent interacts with, including
obstacles, rewards, and state transitions.
 The agent makes decisions and performs actions to maximize rewards.

Example

In self-driving cars,

Machine Learning. Page 25

BCS602 | MACHINE LEARNING| VTU Belagavi.

 The environment includes roads, traffic, and signals.

 The agent is the AI system making driving decisions.

States and Actions

 State (S) – Represents the current situation.

 Action (A) – Causes a transition from one state to another.

Example (Navigation)

In a grid-based game, states represent positions (A, B, C, etc.), and actions are movements (UP,
DOWN, LEFT, RIGHT).

Types of States

1. Start State – Where the agent begins.

2. Goal State – The target state with the highest reward.
3. Non-terminal States – Intermediate steps between start and goal.

Machine Learning. Page 26

BCS602 | MACHINE LEARNING| VTU Belagavi.

Types of Episodes

 Episodic – Has a definite start and goal state (e.g., solving a maze).
 Continuous – No fixed goal state; the task continues indefinitely (e.g., stock
trading).

Policies in RL

A policy (π) is the strategy used by the agent to choose actions. Types of

Policies

Choosing the Best Policy

 The optimal policy is the one that maximizes cumulative expected rewards.

Rewards in RL

 Immediate Reward (r) – The instant feedback for an action.

 Total Reward (G) – The sum of all rewards collected during an episode.
 Long-term Reward – The cumulative future reward.

Machine Learning. Page 27

BCS602 | MACHINE LEARNING| VTU Belagavi.

Discount Factor (γ)

RL Algorithm Categories

 Model-Based RL – Uses a predefined model (e.g., Chess AI).

 Model-Free RL – Learns by trial and error (e.g., a robot navigating an unknown
environment).

Markov Decision Process

A Markov Chain is a stochastic process that satisfies the Markov property.

It consists of a sequence of random variables where the probability of transitioning to the next
state depends only on the current state and not on the past states.

Machine Learning. Page 28

BCS602 | MACHINE LEARNING| VTU Belagavi.

Example: University Transition

Consider two universities:

 80% of students from University A move to University B for a master's degree, while
20% remain in University A.
 60% of students from University B move to University A, while 40% remain in
University B.

This can be represented as a Markov Chain, where:

 States represent the universities.

 Edges denote the probability of transitioning between states.

A transition matrix at time is defined as:

Each row represents a probability distribution, meaning the sum of elements in each row equals
1.

Probability Prediction

Let the initial distribution be:

To find the state distribution after one time step

After two time steps:

Machine Learning. Page 29

BCS602 | MACHINE LEARNING| VTU Belagavi.

The system stabilizes over time, reflecting the equilibrium distribution.

Markov Decision Process (MDP)

An MDP extends a Markov Chain by incorporating rewards. It consists of:

1. Set of states
2. Set of actions
3. Transition probability function
4. Reward function
5. Policy
6. Value function

Markov Assumption

The Markov property states that the probability of reaching a state and receiving a reward
depends only on the previous state and action :

MDP Process

1. Observe the current state .

2. Choose an action .
3. Receive a reward .
4. Move to the next state .
5. Repeat to maximize cumulative rewards.

State Transition Probability

Machine Learning. Page 30

BCS602 | MACHINE LEARNING| VTU Belagavi.

The probability of moving from state to after taking action is given by:

This forms a state transition matrix, where each row represents transition probabilities from one
state to another.

Expected Reward

The expected reward for an action in state is given by:

Training and Testing of RL Systems

Once an MDP is modeled, the system undergoes:

1. Training: The agent repeatedly interacts with the environment, adjusting

parameters based on rewards.
2. Inference: A trained model is deployed to make decisions in real-time.
3. Retraining: When the environment changes, the model is retrained to adapt and
improve performance.

Goal of MDP

The agent's objective is to maximize total accumulated rewards over time by following an
optimal policy.

Machine Learning. Page 31

BCS602 | MACHINE LEARNING| VTU Belagavi.

Multi-Arm Bandit Problem and Reinforcement Problem Types Reinforcement

Learning Overview

Reinforcement learning (RL) uses trial and error to learn a series of actions that maximize the
total reward. RL consists of two fundamental sub-problems:

Prediction (Value Estimation):

o The goal is to predict the total reward (return), also known as policy evaluation or value
estimation.
o This requires the formulation of a function called the state-value function.
o The estimation of the state-value function can be performed using Temporal
Difference (TD) Learning.

Policy Improvement:

o The objective is to determine actions that maximize returns.

o This process is known as policy improvement.
o Both prediction and policy improvement can be combined into policy iteration, where these
steps are used alternately to find an optimal policy.

Multi-Arm Bandit Problem

A commonly encountered problem in reinforcement learning is the multi-arm bandit problem

(or N-arm bandit problem).

Consider a hypothetical casino with a robotic arm that activates a 5-armed slot machine. When a
lever is pulled, the machine returns a reward within a specified range (e.g., $1 to
$10).

The challenge is that each arm provides rewards randomly within this range.

Machine Learning. Page 32

BCS602 | MACHINE LEARNING| VTU Belagavi.

Objective:

Given a limited number of attempts, the goal is to maximize the total reward by selecting the best
lever.

A logical approach is to determine which lever has the highest average reward and use it repeatedly.

Formalization:

Given k attempts on an N-arm slot machine, with rewards , the expected reward (action- value
function) is:

The best action is defined as:

This indicates the action that returns the highest average reward and is used as an indicator of
action quality.

Example:

If a slot machine is chosen five times and returns rewards , the quality of this action is:

Machine Learning. Page 33

BCS602 | MACHINE LEARNING| VTU Belagavi.

Exploration vs Exploitation and Selection Policies

In reinforcement learning, an agent must decide how to select actions:

Exploration:

o Tries all actions, even if they lead to sub-optimal decisions.

o Useful in games where exploring different actions provides better long-term
rewards.
o Risky but informative.

Exploitation:

o Uses the current best-known action repeatedly.

o Focuses on short-term gains.
o Simple but often sub-optimal.

A balance between exploration and exploitation is crucial for optimal decision-making.

Selection Policies

Greedy Method

 Picks the best-known action at any given time.

 Based solely on exploitation.
 Risk: It may miss out on exploring better options. ε-

Greedy Method

 Balances exploration and exploitation.

 With probability ε, the agent explores a random action.
 With probability 1 - ε, it selects the best-known action.
 ε ranges from 0 to 1 (e.g., ε = 0.1 means a 10% chance of exploration).

Machine Learning. Page 34

BCS602 | MACHINE LEARNING| VTU Belagavi.

Reinforcement Learning Agent Types

An RL agent can be classified into different approaches based on how it learns:

1. Value-Based Approaches
o Optimize the value function , which represents the maximum expected future reward
from a given state.
o Uses discount factors to prioritize future rewards.
2. Policy-Based Approaches
o Focus on finding the optimal policy , a function that maps states to actions.
o Rather than estimating values, it directly learns which action to take.
3. Actor-Critic Methods (Hybrid Approaches)
o Combine value-based and policy-based methods.
o The actor updates the policy, while the critic evaluates it.
4. Model-Based Approaches
o Create a model of the environment (e.g., using Markov Decision Processes
(MDPs)).
o Use simulations to plan the best actions.
5. Model-Free Approaches

Machine Learning. Page 35

BCS602 | MACHINE LEARNING| VTU Belagavi.

o No predefined model of the environment.

o Use methods like Temporal Differencing (TD) Learning and Monte Carlo
methods to estimate values from experience.

Reinforcement Algorithm Selection

The choice of a reinforcement learning algorithm depends on factors such as:

 Availability of models
 Nature of updates (incremental vs. batch learning)
 Exploration vs. exploitation trade-offs
 Computational efficiency

Model-based Learning

Passive Learning refers to a model-based environment, where the environment is known. This
means that for any given state, the next state and action probability distribution are known.

Markov Decision Process (MDP) and Dynamic Programming are powerful tools for solving
reinforcement learning problems in this context.

Machine Learning. Page 36

BCS602 | MACHINE LEARNING| VTU Belagavi.

The mathematical foundation for passive learning is provided by MDP. These model- based
reinforcement learning problems can be solved using dynamic programming after constructing
the model with MDP.

The primary objective in reinforcement learning is to take an action a that transitions the system
from the current state to the end state while maximizing rewards. These rewards can be positive
or negative.

The goal is to maximize expected rewards by choosing the optimal policy: for all

possible values of s at time t.

Policy and Value Functions

An agent in reinforcement learning has multiple courses of action for a given state. The way the
agent behaves is determined by its policy.

A policy is a distribution over all possible actions with probabilities assigned to each action.

Different actions yield different rewards. To quantify and compare these rewards, we use value
functions.

Value Function Notation

A value function summarizes possible future scenarios by averaging expected returns under a
given policy π.

It is a prediction of future rewards and computes the expected sum of future rewards for a given
state s under policy π:

Machine Learning. Page 37

BCS602 | MACHINE LEARNING| VTU Belagavi.

where v(s) represents the quality of the state based on a long-term strategy. Example

If we have two states with values 0.2 and 0.9, the state with 0.9 is a better state to be in. Value

functions can be of two types:

 State-Value Function (for a state)

 State-Action Function (for a state-action pair)

State-Value Function

Denoted as v(s), the state-value function of an MDP is the expected return from state s under a
policy π:

This function accumulates all expected rewards, potentially discounted over time, and helps
determine the goodness of a state.

The optimal state-value function is given by:

Action-Value Function (Q-Function)

Apart from v(s), another function called the Q-function is used. This function returns a real
value indicating the total expected reward when an agent:

1. Starts in state s
2. Takes action a
3. Follows a policy π afterward

Machine Learning. Page 38

BCS602 | MACHINE LEARNING| VTU Belagavi.

Bellman Equation

Dynamic programming methods require a recursive formulation of the problem. The recursive
formulation of the state-value function is given by the Bellman equation:

Solving Reinforcement Problems

There are two main algorithms for solving reinforcement learning problems using conventional
methods:

1. Value Iteration
2. Policy Iteration

Value Iteration

Value iteration estimates v(s) iteratively:

Machine Learning. Page 39

BCS602 | MACHINE LEARNING| VTU Belagavi.

Algorithm

1. Initialize v(s) arbitrarily (e.g., all zeros).

2. Iterate until convergence:
o For each state s, update v(s) using the Bellman equation.
o Repeat until changes are negligible.

Policy Iteration

Policy iteration consists of two main steps:

1. Policy Evaluation
2. Policy Improvement

Policy Evaluation

Initially, for a given policy π, the algorithm starts with v(s) = 0 (no reward). The Bellman
equation is used to obtain v(s), and the process continues iteratively until the optimal v(s) is
found.

Policy Improvement

The policy improvement process is performed as follows:

1. Evaluate the current policy using policy evaluation.

2. Solve the Bellman equation for the current policy to obtain v(s).
3. Improve the policy by applying the greedy approach to maximize expected
rewards.
4. Repeat the process until the policy converges to the optimal policy.

Algorithm

1. Start with an arbitrary policy π.

2. Perform policy evaluation using Bellman’s equation.

Machine Learning. Page 40

BCS602 | MACHINE LEARNING| VTU Belagavi.

3. Improve the policy greedily.

4. Repeat until convergence.

Model Free Methods

Model-free methods do not require complete knowledge of the environment. Instead, they learn
through experience and interaction with the environment.

The reward determination in model-free methods can be categorized into three formulations:

1. Episodic Formulation: Rewards are assigned based on the outcome of an entire episode. For
example, if a game is won, all actions in the episode receive a positive reward (+1). If lost, all
actions receive a negative reward (-1). However, this approach may unfairly penalize or
reward intermediate actions.
2. Continuous Formulation: Rewards are determined immediately after an action. An
example is the multi-armed bandit problem, where an immediate reward between $1
- $10 can be given after each action.
3. Discounted Returns: Long-term rewards are considered using a discount factor. This method
is often used in reinforcement learning algorithms.

Model-free methods primarily utilize the following techniques:

 Monte Carlo (MC) Methods

 Temporal Difference (TD) Learning

Monte-Carlo Methods

Monte Carlo (MC) methods do not assume any predefined model, making them purely
experience-driven. This approach is analogous to how humans and animals learn from interactions
with their environment.

Machine Learning. Page 41

BCS602 | MACHINE LEARNING| VTU Belagavi.

Characteristics of Monte Carlo Methods:

 Experience is divided into episodes, where each episode is a sequence of states from a
starting state to a goal state.
 Episodes must terminate; regardless of the starting point, an episode must reach an
endpoint.
 Value-action functions are computed only after the completion of an episode, making
MC an incremental method.
 MC methods compute rewards at the end of an episode to estimate maximum expected
future rewards.
 Empirical mean is used instead of expected return; the total return over multiple
episodes is averaged.
 Due to the non-stationary nature of environments, value functions are computed for a
fixed policy and revised using dynamic programming.

Monte Carlo Mean Value Computation:

The mean value of a state is calculated as:

Incremental Monte Carlo Update:

The value function is updated incrementally using the following formula:

Machine Learning. Page 42

BCS602 | MACHINE LEARNING| VTU Belagavi.

Temporal Difference (TD) Learning

Temporal Difference (TD) Learning is an alternative to Monte Carlo methods. It is also a model-
free technique that learns from experience and interaction with the environment.

Characteristics of TD Learning:

 Bootstrapping Method: Updates are based on the current estimate and future reward.
 Incremental Updates: Unlike MC, which waits until the end of an episode, TD
updates values at each step.
 More Efficient: TD can learn before an episode ends, making it more sample-
efficient than MC methods.
 Used for Non-Stationary Problems: Suitable for environments where conditions
change over time.

Differences between Monte Carlo and TD Learning

Machine Learning. Page 43

BCS602 | MACHINE LEARNING| VTU Belagavi.

Eligibility Traces and TD(λ)

TD Learning can be accelerated using eligibility traces, which allow updates to be spread over
multiple states. This leads to a family of algorithms called TD(λ), where λ is the decay parameter
(0 ≤ λ ≤ 1):

 λ = 0: Only the previous prediction is updated.

 λ = 1: All previous predictions are updated.

By incorporating eligibility traces, TD(λ) provides an alternative short-term memory mechanism

to enhance learning efficiency.

Machine Learning. Page 44

BCS602 | MACHINE LEARNING| VTU Belagavi.

Q-Learning

Q-Learning Algorithm

1. Initialize Q-table:
o Create a table Q(s,a) with states s and actions a.
o Initialize Q-values with random or zero values.
2. Set parameters:
o Learning rate α\alphaα (typically between 0 and 1).
o Discount factor γ (typically close to 1).
o Exploration-exploitation trade off strategy (e.g., ε-greedy policy).
3. Repeat for each episode:
o Start from an initial state s.
o Repeat until reaching a terminal state:

4. End the training once convergence is reached (Q-values become stable).

This iterative process helps the agent learn optimal Q-values, which guide it to take actions that
maximize rewards.

Machine Learning. Page 45

BCS602 | MACHINE LEARNING| VTU Belagavi.

SARSA Learning
SARSA Algorithm (State-Action-Reward-State-Action)

Initialize Q-table:

o Create a table Q(s,a) for all state-action pairs.

Machine Learning. Page 46

BCS602 | MACHINE LEARNING| VTU Belagavi.

o Initialize Q-values with random or zero values.

Set parameters:

o Learning rate α (typically between 0 and 1).

o Discount factor γ (typically close to 1).
o Exploration-exploitation strategy (e.g., ε-greedy policy).

Repeat for each episode:

o Start from an initial state s.

o Choose an action a using the ε-greedy policy.

Repeat until the terminal state is reached:

End the training when Q-values converge. Differences

between SARSA and Q-Learning

Machine Learning. Page 47

Common questions