Reg. No.
: __________ Dec 2024 – Apr 2025
COIMBATORE INSTITUTE OF TECHNOLOGY
(Government Aided Autonomous Institution)
COIMBATORE - 641 014
DEPARTMENT OF COMPUTING – AI & ML
19MAM42 – Machine Learning
II Year [Link]. (AI&ML) – IV Semester
Course Faculty: Ms. P. Hemashree
II Mid Semester Examination – Answer Key
Date: 24.04.2025 (AN) Time: 3.15 PM - 05.15 PM
Hall Number: MS 206 & MS 202 Max. Marks: 60
Q.
PART A - (10 x 2 Marks = 10 Marks) Marks
NO.
Define Maximum likelihood hypothesis.
Under certain assumptions, any learning algorithm triesto find a hypothesis that minimizes the squared
error between the output hypothesis predictions and training data targets. Such a hypothesis that
1 minimizes the sum of squared error over the training data is called maximum likelihood hypothesis. 2
Mention any two significant feature of Bayesian learning methods.
1. Incorporation of Prior Knowledge
● Bayesian methods allow the inclusion of prior beliefs or domain knowledge about the
parameters before observing data.
2 ● These prior probabilities are updated as new data becomes available using Bayes’ Theorem. 2
2. Probabilistic Interpretation
● Bayesian learning provides a probabilistic framework for learning.
● Instead of just predicting a class label or a value, it gives the probability distribution over
possible outcomes, enabling better decision-making under uncertainty.
How does Naïve Bayes classifier overcome zero frequency problem?
The Naïve Bayes classifier overcomes the zero-frequency problem by using a technique called
Laplace Smoothing (also known as Add-one Smoothing).
What is the Zero Frequency Problem?
● It occurs when a categorical feature value is not present in the training data for a given class.
● As a result, the probability becomes zero, which in turn makes the entire product of
probabilities zero, even if other features strongly support the classification.
How Laplace Smoothing Solves It:
● Laplace Smoothing adjusts the formula for conditional probability by adding 1 to the count of
3 2
each feature-class combination.
Formula:
● count(x_i, y) is the number of times feature value x_i appears in class y.
● count(y) is the total number of feature instances in class y.
● n is the number of possible feature values.
State the principle of k-nearest neighbor (KNN) algorithm.
The k-nearest neighbors (KNN) algorithm relies on the principle of similarity for classification and
regression tasks (2 marks). Here's a breakdown:
● Similar data points belong to the same class: This principle forms the foundation of KNN. It
assumes that data points close together in feature space are likely to share the same class label.
4 2
● Prediction based on neighbors: When encountering a new data point, KNN identifies its k
closest neighbors in the training data based on a distance metric (e.g., Euclidean distance).
Majority vote (classification) or average (regression): In classification, the new data point is
assigned the class label that appears most frequently among its k neighbors. In regression, the predicted
value is the average of the values of its k neighbors.
Compare inductive based learning with analytical based learning.
Feature Inductive Learning Analytical Learning
5 2
Using prior knowledge to understand
Focus Identifying patterns from data
data
Data Lower (can learn from fewer
High (needs large datasets)
Reliance examples)
Less susceptible due to prior
Bias More susceptible to bias in data
knowledge
Explanatio Often lacks clear explanation for Can provide explanations based on
n learned patterns prior knowledge
List out the elements of Reinforcement learning.
The Agent and Environment:
● Agent
● Environment
6 2
Core Components for Learning:
● Policy
● Reward Signal
Value Function
State the significance of a partially observable state in Reinforcement learning.
In Reinforcement Learning (RL), a partially observable state refers to situations where the agent
does not have complete information about the environment's current state.
Significance of Partially Observable States:
1. Real-World Applicability: Many real-world problems (e.g., robotics, autonomous driving, games)
are partially observable because the agent cannot access the full environment state at any given
time.
2. Challenges Learning: The agent must infer hidden state information from past observations or
7 2
actions, making decision-making more complex.
3. Use of Memory: Partially observable environments often require the use of memory-based
models, like Recurrent Neural Networks (RNNs), to capture past context and improve
decision-making.
4. Leads to POMDPs: Such scenarios are modeled as Partially Observable Markov Decision
Processes (POMDPs), which extend MDPs to handle uncertainty in state observations.
Example: In a card game, the agent cannot see the opponent's cards (partial observability), so it must
make decisions based on its hand and the game history.
Illustrate the concept of support and confidence in Association Rules with and example.
Association Rule Mining
Association rules are used to find relationships between items in large datasets, typically in market
basket analysis.
1. Support:
o It indicates how frequently the itemset appears in the dataset.
o Formula:
Support(A⇒B) = Number of transactions containing both A and B / Total number of transactions
2. Confidence:
o It measures how often item B is purchased when item A is purchased.
o Formula:
Confidence(A⇒B) = Number of transactions containing both A and B /
Number of transactions containing A
Example:
Consider the following 5 transactions in a supermarket:
Transaction ID Items Bought
T1 Milk, Bread
8 2
T2 Milk, Diaper, Beer, Eggs
T3 Milk, Bread, Diaper, Beer
T4 Bread, Butter
T5 Milk, Bread, Diaper
Let’s analyze the rule:
Milk ⇒ Bread
Step 1: Support
● Transactions containing both Milk and Bread:
o T1, T3, T5 → 3 transactions
● Total transactions = 5
Support = 35 = 0.6 or 60%
Step 2: Confidence
● Transactions containing Milk:
o T1, T2, T3, T5 → 4 transactions
● Transactions containing both Milk and Bread = 3
Confidence = 34 = 0.75 or 75%
Conclusion:
For the rule Milk ⇒ Bread:
● Support = 60%
● Confidence = 75%
This means that:
● 60% of all transactions contain both Milk and Bread.
● When Milk is bought, there is a 75% chance that Bread is also bought.
Differentiate between Agglomerative and Divisive clustering.
Agglomerative Clustering Divisive Clustering
Bottom-up approach Top-down approach
Starts with each data point as its own individual Starts with all data points in a single cluster
cluster
Merges the closest clusters step by step until one Splits the cluster recursively until each data
cluster remains point is in its own cluster
9 Continues merging until a single cluster is Continues splitting until individual points 2
formed or threshold is met form separate clusters
Lower compared to divisive Higher compared to agglomerative
More widely used in practice due to simplicity Less commonly used due to computational
cost
Eg: Merging customers based on similar Eg: Splitting a group of customers into
purchase behavior subgroups based on diversity
Use k-means clustering algorithm for clustering the following data points into two clusters.
Assume the initial centroids are C1 = [5, 3] and C2 = [6. 2]. Find the new cluster centroids. Data
points are X1 = [2, 4], X2 = [5, 3], X3 = [1, 2], X4 = [6, 2].
Data Points: X1 = [2, 4], X2 = [5, 3], X3 = [1, 2], X4 = [6, 2]
Initial Centroids: C1 = [5, 3], C2 = [6, 2]
Distance Calculation:
C1 C2 Cluster assignment
X1 3.16 5.39 C1
X2 0 1.41 C1
X3 5 5 C2
X4 1.41 0 C2
10 2
Cluster Assignment:
● X1 belongs to Cluster 1 (distance 3.16 < distance 5.39)
● X2 belongs to Cluster 1 (distance 0 < distance 1.41)
● X3 belongs to Cluster 2 (distance 5 < distance 5) - Choosing C2 arbitrarily in case of equal
distance
● X4 belongs to Cluster 2 (distance 0 < distance 1.41)
Recalculate Centroids:
● New Centroid for Cluster 1: Mean of [X1, X2] = [(2+5)/2, (4+3)/2] = [3.5, 3.5]
● New Centroid for Cluster 2: Mean of [X3, X4] = [(1+6)/2, (2+2)/2] = [3.5, 2]
New Centroids:
● Cluster 1: [3.5, 3.5]
Cluster 2: [3.5, 2]
PART B - (4 X 10 Marks = 40 Marks) Marks
a) What is Bayes optimal classifier? State the role of Bayes optimal classifier in classifying the
instance correctly.
The Bayes Optimal Classifier is a theoretical classification model that makes predictions based on the
probability of all possible hypotheses in the hypothesis space. It selects the class label with the
highest posterior probability, taking into account all possible models and their respective probabilities
given the data.
Formally, for a given instance xx, the Bayes Optimal Classifier predicts the class cc that maximizes the
following:
P(c ∣ x) = ∑h ∈ H P(c ∣ x, h) ⋅ P(h ∣ D)
11 5
Where:
● H: Set of all hypotheses.
● P(h ∣ D) P(h | D): Posterior probability of hypothesis h given data D.
● P(c ∣ x, h) P(c | x, h): Probability of class c for instance x under hypothesis h.
Role in Classifying the Instance Correctly
● The Bayes Optimal Classifier minimizes the overall classification error by considering
every possible hypothesis weighted by how likely it is.
● Unlike a single-model classifier (like Naive Bayes or Decision Trees), it doesn't rely on just
one hypothesis but aggregates over all possible hypotheses.
● It acts as the gold standard or benchmark for classifiers — no classifier can perform better
in terms of expected accuracy than the Bayes Optimal Classifier (assuming the true
distribution is known).
● In practice, since evaluating all hypotheses is computationally infeasible, approximations like
Naive Bayes or Bayesian model averaging are used.
b) Outline case-based reasoning with a suitable example.
Case-based reasoning (CBR) is a problem-solving approach where new situations are tackled by
referencing similar past experiences (cases) and adapting the solutions from those cases to the current
problem. It's like learning from past experiences and applying those lessons to new situations.
Here's a breakdown of the key steps in CBR:
1. Retrieve: When faced with a new problem, CBR retrieves similar cases from a case base (a library
of past experiences). This retrieval process relies on identifying relevant features or characteristics
that match the new problem.
2. Reuse: Once a similar case (or cases) is retrieved, CBR attempts to reuse the solution associated
with that past case. This might involve directly applying the solution or adapting it based on the
specific details of the new problem.
3. Revise: In some cases, the retrieved solution might need adjustments to fit the new situation
perfectly. CBR allows for revising the solution based on the new problem's specific context.
4. Retain: The CBR system can learn from the current problem-solving experience. The new 5
situation and its solution can be added to the case base, enriching the knowledge base for future
use.
Example: A Medical Diagnosis Scenario
Imagine a doctor encountering a patient with a set of symptoms (fever, fatigue, sore throat). Here's how
CBR might be applied:
● Retrieve: The doctor recalls a previous case (from their memory or a medical case base) where a
patient exhibited similar symptoms.
● Reuse: Based on the past case, the doctor suspects a viral infection and prescribes medication that
was effective in the previous case.
● Revise (if needed): If the patient's condition worsens or doesn't respond well to the initial
medication, the doctor might revise the diagnosis and treatment plan.
Retain: The doctor might document this new experience in the medical record, potentially including
details about the patient's response to the treatment. This enriches the case base for future reference.
a) Explain the features of Bayesian Belief network with a suitable example.
• A graphical representation of different probabilistic relationships among random variables in a
particular set
• A classifier with no dependency on attributes i.e it is condition independent
• Due to its feature of joint probability, the probability is derived, based on a condition
— P(attribute/parent) i.e probability of an attribute, true over parent attribute
Imagine you're a detective investigating a burglary. You have some clues: the alarm went off (A), there
are muddy footprints outside (F), and your neighbor (N) didn't see anything suspicious. A Bayesian
Belief Network (BBN) can help you represent the relationships between these clues and determine the
likelihood of a burglary (B) happening.
Key Features of BBNs:
1. Directed Acyclic Graph (DAG): A BBN is visualized as a DAG. Nodes represent variables
(alarm, burglary, etc.), and arrows depict directed influences between them. There are no cycles
(loops) to ensure clear cause-and-effect relationships.
2. Conditional Probability Tables (CPTs): Each node has a CPT that specifies the probability of its
state (true/false) given the states of its parent nodes. In our example:
12 o P(A | B) - Probability of the alarm going off given a burglary. 5
o P(F | B) - Probability of muddy footprints given a burglary.
o P(N | B) - Probability of the neighbor not seeing anything suspicious given a burglary
(might be because they were away).
3. Probabilistic Inference: BBNs allow you to calculate the probability of any variable given
evidence (observed values) in other variables. This is powerful for updating your beliefs as you
gather more information.
Example:
● Let's say P(B) = 0.1 (base chance of a burglary), P(A | B) = 0.9 (alarm is reliable), P(F | B) =
0.8 (burglars often leave muddy prints), and P(N | B) = 0.7 (neighbor might miss something).
● Now, the alarm goes off (A=true). We want to find the probability of a burglary (B) given this
evidence.
Using Bayes' theorem and the network structure, we can calculate the updated probability:
P(B | A) = [P(A | B) * P(B)] / P(A)
Here, P(A) considers all possibilities that could trigger the alarm (burglary or other reasons). We can
estimate P(A) by summing the probabilities of the alarm given burglary (0.9 * 0.1) and the alarm given
no burglary (let's say 0.1).
By calculating these probabilities and incorporating them into the formula, you can determine the
likelihood of a burglary after the alarm goes off. The BBN allows you to further refine this probability
as you discover muddy footprints or your neighbor's observation.
b) Discuss about the Expectation Maximization algorithm.
The Expectation-Maximization (EM) algorithm is a powerful statistical tool used to estimate the
parameters of models involving latent variables. These are variables that influence the observed data
but are themselves unobserved.
Here's a breakdown of the core concepts and how EM tackles this challenge:
The Challenge of Latent Variables:
Imagine you're analyzing customer purchases. You might observe what items were bought together
(purchased items), but you can't directly observe the customer's shopping intent (latent variable)
behind those purchases.
The EM algorithm helps you estimate the parameters that govern these hidden variables based on the
observed data.
The EM Algorithm in Action:
EM works in an iterative two-step process: 5
1. Expectation (E-Step): In this step, the algorithm estimates the expected values of the latent
variables for each data point. These expected values are based on the current estimates of the
model parameters and the observed data.
2. Maximization (M-Step): Using the expected values from the E-step, the M-step aims to
maximize the likelihood of the observed data by updating the model parameters. These
parameters can represent things like probabilities of certain items being purchased together or the
influence of shopping intent on purchase patterns.
The Iteration:
EM repeats the E-step and M-step iteratively until the model parameters converge (stabilize and no
longer significantly change). With each iteration, the estimates of the latent variables and model
parameters become more refined, leading to a better understanding of the underlying relationships in
the data.
Design a Bayesian Classifier for assigning tweets into categories positive or negative. Consider
the input to the system is set of tweets which include positive and negative.
i) List the steps in modelling tweets.
ii) Explain the steps of training and testing the classifier in detail with mathematical model.
i) Steps in Modelling Tweets
1. Data Collection
o Collect a labeled dataset of tweets with sentiment labels (Positive, Negative).
o Example:
▪ "I love this product!" → Positive
▪ "This is terrible." → Negative
2. Data Preprocessing
o Clean the tweets to remove irrelevant elements:
▪ Lowercasing
▪ Removing punctuation, URLs, hashtags, mentions
▪ Removing stop words
▪ Tokenization (splitting into words)
▪ Stemming/Lemmatization (optional)
13 3. Feature Extraction 5+5
o Convert text into numeric representation:
▪ Bag of Words (BoW)
▪ TF-IDF (Term Frequency-Inverse Document Frequency)
▪ Word frequencies for Naive Bayes
4. Split the Dataset
o Split the dataset into:
▪ Training Set (to train the model)
▪ Test Set (to evaluate performance)
5. Model Building
o Apply the Naïve Bayes Classifier based on word occurrences.
6. Prediction
o Classify new/unseen tweets as Positive or Negative using the trained model.
ii) Training and Testing the Classifier with Mathematical Model
Training Phase
Let:
● C = Set of classes = {Positive, Negative}
● D = Training dataset of tweets
For a new tweet T = {w₁, w₂, ..., w } (words in the tweet), the classifier calculates:
Where:
● P(Ci) is the prior probability of class Ci
● P(wj ∣ Ci) is the likelihood of word wj given class Ci
1. Estimate Prior Probabilities:
2. Estimate Likelihood:
Using Multinomial Naïve Bayes:
Where:
● |V| is the size of vocabulary
● Laplace Smoothing (+1) is applied to avoid zero probabilities
Testing Phase
For a new tweet T = {w₁, w₂, ..., w }:
1. Compute:
2. Assign the tweet to class C^
Example
Let’s say we have the following simplified training data:
Tweet Label
"I love this product" Positive
"This is terrible" Negative
"Absolutely fantastic!" Positive
"Worst experience ever" Negative
After preprocessing and tokenization:
● Positive Class Words: [i, love, this, product, absolutely, fantastic]
● Negative Class Words: [this, is, terrible, worst, experience, ever]
● Vocabulary Size (|V|): 12
● P(Positive) = 2/4 = 0.5
● P(Negative) = 2/4 = 0.5
For a test tweet: "this product is terrible"
● After tokenization: [this, product, is, terrible]
Compute:
Whichever is higher is the predicted class.
Conclusion
A Bayesian classifier uses probabilities of word occurrences across labeled data to make predictions for
new texts. It is especially effective for sentiment analysis on short texts like tweets due to its simplicity,
speed, and performance.
The following table gives dataset about stolen vehicles. Using Naïve Bayes Classifier classify the
new data {Color: Red, Type: SUV, Origin: Domestic}
Color Type Origin Stolen
Red Sports Domestic Yes
14 Red Sports Domestic No 10
Red Sports Domestic Yes
Yellow Sports Domestic No
Yellow Sports Imported Yes
Yellow SUV Imported No
Yellow SUV Imported Yes
Yellow SUV Domestic No
Red SUV Imported No
Red Sports Imported Yes
Let's use the Naïve Bayes Classifier to classify the new data instance:
{Color: Red, Type: SUV, Origin: Domestic}
Step 2: Count Class Labels
● Total records: 10
● Stolen = Yes: 5
● Stolen = No: 5
So,
● P(Yes) = 5/10 = 0.5
● P(No) = 5/10 = 0.5
Step 3: Calculate Likelihoods
For Stolen = Yes
Count records where Stolen = Yes (5 records):
Color Type Origin
Red Sports Domestic
Red Sports Domestic
Yellow Sports Imported
Yellow SUV Imported
Red Sports Imported
● P(Color = Red | Yes) = 3/5
● P(Type = SUV | Yes) = 1/5
● P(Origin = Domestic | Yes) = 2/5
For Stolen = No
Count records where Stolen = No (5 records):
Color Type Origin
Red Sports Domestic
Yellow Sports Domestic
Yellow SUV Imported
Yellow SUV Domestic
Red SUV Imported
● P(Color = Red | No) = 2/5
● P(Type = SUV | No) = 2/5
● P(Origin = Domestic | No) = 2/5
Step 4: Apply Naïve Bayes Formula
P(Class∣Attributes) ∝ P(Class) × P(Color∣Class) × P(Type∣Class) × P(Origin∣Class) P(Class|Attributes)
∝ P(Class) × P(Color|Class) × P(Type|Class) × P(Origin|Class)
Probability for Yes:
P(Yes ∣ Red, SUV, Domestic) ∝ 0.5×35×15×25=0.5×0.6×0.2×0.4 = 0.024
Probability for No:
P(No ∣ Red, SUV, Domestic) ∝ 0.5×25×25×25=0.5×0.4×0.4×0.4 = 0.032
Step 5: Compare & Classify
● P(Yes | Data) = 0.024
● P(No | Data) = 0.032
Since P(No) > P(Yes), the new data instance is classified as: Stolen = No
a) Describe k-nearest neighbour algorithm with an example.
K-Nearest Neighbour is a supervised machine learning algorithm used for classification and
regression tasks. It is a lazy learning algorithm, meaning it does not build a model during training but
stores the entire training dataset. When a new input arrives, it makes predictions based on the majority
class (for classification) or average value (for regression) of the k closest training instances in the
feature space.
Steps in K-NN Algorithm:
1. Choose the number of neighbors k.
15 2. Calculate the distance (e.g., Euclidean) between the new data point and all the training points. 4
3. Sort the distances and select the k nearest neighbors.
4. For classification: Assign the class which has the most occurrences among the k neighbors.
5. For regression: Calculate the average of the k nearest neighbors' values.
Example:
Consider the following training data:
Height (cm) Weight (kg) Class
170 65 Fit
160 59 Fit
180 80 Overweight
175 85 Overweight
Now, we want to predict the class of a new person with:
● Height = 172 cm, Weight = 70 kg
Let’s assume k = 3.
1. Calculate Euclidean distance from the new point to all training points:
Distance=Sqrt((x2−x1)2+(y2−y1)2)
2. Compute distances:
● Distance to (170, 65): √((172-170)² + (70-65)²) = √(4 + 25) = √29 ≈ 5.39
● Distance to (160, 59): √((172-160)² + (70-59)²) = √(144 + 121) = √265 ≈ 16.28
● Distance to (180, 80): √((172-180)² + (70-80)²) = √(64 + 100) = √164 ≈ 12.8
● Distance to (175, 85): √((172-175)² + (70-85)²) = √(9 + 225) = √234 ≈ 15.3
3. Select 3 nearest neighbors:
● (170, 65) – Fit
● (180, 80) – Overweight
● (175, 85) – Overweight
4. Majority class among neighbors = Overweight (2 out of 3)
Prediction:
The new data point is classified as Overweight.
b) Write short notes on the following:
i) Locally weighted regression (LWR)
LWR is a non-parametric regression technique that performs regression analysis locally, focusing on
data points near the query point being predicted.
Key Idea: Unlike traditional regression that fits a single global model to the entire data set, LWR fits a
separate model for each prediction. This separate model is based on the weighted contributions of
nearby data points, where closer points have a higher influence.
Benefits:
● Captures local patterns: LWR can adapt to non-linear relationships between features and target
variables by considering the local trends around the prediction point.
● Flexible and data-driven: It doesn't require pre-defined assumptions about the functional form of
the relationship between features and target variables.
Drawbacks:
● Computationally expensive: Calculating weighted contributions for each prediction can be
computationally intensive for large datasets.
● Tuning the kernel: The choice of the weighting function (kernel) can affect the performance of
LWR.
3+3
Applications:
● Time series forecasting
● Image analysis
● Econometrics
ii) Radial Basis Function
● Function Type: RBFs are a class of mathematical functions that depend only on the distance from
the input to a center point.
● Shape: They typically have a bell-shaped curve, with the output value decreasing as the distance
from the center increases.
● Common Example: The Gaussian function (normal distribution) is a popular RBF.
Applications in Machine Learning:
1. Kernel Methods: RBFs are widely used as kernels in machine learning algorithms like Support
Vector Machines (SVMs). The kernel function transforms the input data into a higher-dimensional
space, enabling SVMs to handle non-linear relationships effectively.
2. Radial Basis Function Networks (RBFNs): These are a type of artificial neural network where the
hidden layer uses RBF activation functions. RBFNs can learn complex non-linear mappings
between input and output data.
a) Explain how Q-learning finds the best course of action, given the current state of the agent.
Q-learning, unlike some other reinforcement learning algorithms, doesn't directly tell the agent the best
action for a given state. Instead, it guides the agent towards the optimal choice through a process of
value estimation and exploration. Here's how it works:
1. Q-Value Table: Q-learning maintains a Q-value table. This table stores the estimated
expected future reward (Q-value) for taking each possible action (a) in a given state (s).
16 2. Bellman Equation: Q-learning utilizes the Bellman Equation to iteratively update the 6
Q-values in the table. This equation considers:
● Immediate reward (R): The reward received by the agent after taking action (a) in state (s).
● Discounted future reward (γ * max(Q(s', a'))): The expected future reward, discounted by a
factor (γ, 0 < γ ≤ 1), achievable from the next state (s') by taking the best possible action (a')
based on the current Q-values.
Here's a simplified version of the Bellman Equation:
Q(s, a) = R(s, a) + γ * max(Q(s', a'))
3. Action Selection: At each step, the agent uses the Q-value table to select an action. This
selection can involve:
● Epsilon-greedy strategy: This strategy balances exploration and exploitation. With a
probability (ε), the agent might choose a random action to explore the environment and
potentially discover better options. With probability (1-ε), it chooses the action with the
highest Q-value in the current state (exploitation).
● Other selection policies: More advanced strategies might consider factors like uncertainty in
Q-value estimates or prioritize actions that could lead to unexplored states.
Learning and Improvement: As the agent interacts with the environment, taking actions and
receiving rewards, the Q-values in the table are updated through the Bellman Equation. Over time, the
Q-values converge towards the optimal expected future reward for each state-action pair. This guides
the agent to consistently choose actions that lead to the most long-term reward.
b) Discuss the temporal differences learning approach of reinforcement learning.
Temporal Difference (TD) learning is a fundamental approach in reinforcement learning where the
agent learns by bootstrapping, meaning it uses its current understanding of the value of a state to
estimate the value of future states. This stands in contrast to some methods that require waiting until the
end of an entire episode (sequence of actions) to learn. Here's a breakdown of the key concepts in TD
learning:
Challenges in Reinforcement Learning:
● Delayed rewards: In many real-world scenarios, rewards are not received immediately after
taking an action. The agent might need to take multiple steps before reaching a state with a
significant reward.
● High dimensionality: The state space (all possible states the agent can be in) can be vast.
Learning the optimal action for every single state can be computationally expensive or even
impossible.
TD Learning to the Rescue:
TD learning addresses these challenges by leveraging one-step updates based on the current state, the
action taken, the immediate reward received, and an estimate of the value of the next state. This allows
the agent to learn incrementally without waiting for the final outcome of an entire episode.
Core Idea of TD Learning:
Imagine the agent is in state (s) and takes action (a), leading to immediate reward (R) and transitioning
to the next state (s'). TD learning uses the Bellman Equation as a foundation, but instead of waiting for
the actual future reward, it uses the agent's current estimate of the value of the next state (Q(s', a')) to
update its estimate of the value for the current state-action pair (Q(s, a)).
Here's a simplified form of the TD update rule:
4
Q(s, a) = Q(s, a) + α [ R(s, a) + γ * Q(s', a') - Q(s, a) ]
● α (alpha): Learning rate, controls the weight given to the new estimate compared to the
previous Q-value.
● γ (gamma): Discount factor, balances the importance of immediate reward (R) vs. future
reward (estimated by Q(s', a')).
Types of TD Learning:
There are various TD learning algorithms, each with its own advantages and trade-offs. Here are two
common types:
● SARSA (State-Action-Reward-State-Action): This approach updates the Q-value for the
state-action pair that led to the transition to the next state.
● Q-learning (State-Action-Reward-State- ):* This approach updates the Q-value for the best
possible action achievable in the next state (based on the current Q-values).
Benefits of TD Learning:
● Sample efficiency: TD learning allows for learning from single transitions, making it efficient for
large or continuous environments.
● Adaptability: The agent can continuously update its estimates as it interacts with the environment,
allowing for adaptation to changes.
Limitations of TD Learning:
● Convergence issues: Depending on the specific algorithm and parameters, TD learning might not
always converge to the optimal Q-values.
Off-policy learning limitations: Some TD methods are designed for on-policy learning (the agent
follows the policy it is learning). Adapting them for off-policy learning (learning from different
policies) can be challenging.
Consider the following database of ten transactions. Assume min_sup=30% and
min_confidence=60%.
TID Items Bought
T1 Pen, Pencil
17 10
T2 Book, eraser, pencil
T3 Book, Chalk, eraser, pen
T4 Chalk, eraser, pen
T5 Book, pen, pencil
T6 Book, eraser, pen, pencil
T7 Ink, pen
T8 Book, pen, pencil
T9 Eraser, pen, pencil
T10 Book, chalk, pencil
i) Find 6 frequent item sets using Apriori algorithm.
ii) Generate strong association rules.
a) Apply k-means clustering algorithm to group the following data points into 2 clusters. Initial
centroids will be the farthest i.e., a pair of data-points with maximum Euclidean distance.
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Step 1: Dataset
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
18 4 5.0 7.0 6
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
Step 2: Find the Farthest Pair of Points
We calculate the Euclidean distance between every pair of points.
The distance formula is:
d = √ (x2−x1)2+(y2−y1)2
After calculating distances between all pairs, we find the farthest pair is:
● Subject 1 (1.0, 1.0) and Subject 4 (5.0, 7.0)
● Distance = √[(5−1)² + (7−1)²] = √[16 + 36] = √52 ≈ 7.21
So, the initial centroids are:
● C1 = (1.0, 1.0)
● C2 = (5.0, 7.0)
Step 3: Assign Points to Nearest Centroid
Now, calculate distance from each point to both centroids and assign to the nearest:
Subject A B Dist to Dist to Cluster
C1 C2
1 1.0 1.0 0.00 7.21 C1
2 1.5 2.0 1.12 6.10 C1
3 3.0 4.0 3.61 3.61 C1 (tie, prefer C1)
4 5.0 7.0 7.21 0.00 C2
5 3.5 5.0 5.00 1.80 C2
6 4.5 5.0 5.32 2.24 C2
7 3.5 4.5 4.61 2.50 C2
Step 4: Recalculate Centroids
New centroid for C1 (Subjects 1, 2, 3):
C1 = (1.0+1.5+3.03,1.0+2.0+4.03) = (1.83,2.33)
New centroid for C2 (Subjects 4, 5, 6, 7):
C2 = (5.0+3.5+4.5+3.54,7.0+5.0+5.0+4.54) = (4.125,5.375)
Step 5: Reassign Points Based on New Centroids
Subject A B Dist to Dist to new Cluster
new C1 C2
1 1.0 1.0 1.67 5.35 C1
2 1.5 2.0 0.47 4.26 C1
3 3.0 4.0 2.14 1.64 C2
4 5.0 7.0 4.94 1.85 C2
5 3.5 5.0 2.87 0.66 C2
6 4.5 5.0 3.39 0.52 C2
7 3.5 4.5 2.66 0.94 C2
Final Clusters After Iteration 2:
● Cluster C1: Subjects 1, 2
● Cluster C2: Subjects 3, 4, 5, 6, 7
Summary:
● Initial centroids: Subject 1 and Subject 4 (farthest points)
● After two iterations, final clusters formed:
o C1 (1.83, 2.33) → Subject 1, 2
o C2 (4.125, 5.375) → Subject 3, 4, 5, 6, 7
b) Explain Apriori algorithm with a suitable example.
The Apriori algorithm is a popular technique in data mining used to find frequent itemsets and
association rules within a dataset. It helps uncover hidden patterns from purchase transactions,
browsing history, or any scenario where items appear together frequently.
Here's how Apriori works, along with an illustrative example:
Key Steps:
1. Minimum Support Threshold: Define a minimum support threshold. This represents the
minimum number of times an itemset (a group of items) needs to appear in transactions (as a
percentage of total transactions) to be considered frequent.
2. Frequent Itemset Generation (Level-wise Approach):
● Start by finding frequent individual items (itemsets of size 1). Count the occurrences of each
item in all transactions. Items exceeding the minimum support threshold are considered
frequent.
● In subsequent levels, iteratively generate candidate itemsets by combining frequent itemsets
from the previous level. For example, if items A and B are frequent individually, the candidate
itemset at level 2 would be {A, B}. 4
● Prune any candidate containing a sub-itemset that wasn't frequent in the previous level
(Apriori's pruning property for efficiency).
● Scan the transactions again to count the support count for each candidate itemset.
● Items exceeding the minimum support threshold at this level are added to the list of frequent
itemsets.
● Repeat steps for higher levels until no more frequent itemsets can be generated.
3. Association Rule Generation: Once you have frequent itemsets, you can generate association
rules that describe relationships between items. These rules follow the format "X -> Y", where X is
the antecedent (the "if" part) and Y is the consequent (the "then" part) of the rule. The support and
confidence of each rule are calculated:
● Support: The percentage of transactions containing both X and Y.
● Confidence: The probability of finding Y in a transaction given that X is already present (Y
appears in X% of transactions that contain X).
Example:
Imagine a grocery store dataset with transaction records of customer purchases.
● Minimum Support: Set a minimum support of 50% (meaning an itemset needs to appear in at
least half of the transactions).
Frequent Itemset Generation:
● Level 1: Find frequent individual items (bread, milk, eggs, etc.) based on their occurrence in
transactions.
● Level 2: Generate candidate itemsets by combining frequent level 1 items (e.g., {bread, milk},
{bread, eggs}, etc.). Prune any candidate with an infrequent sub-itemset (e.g., {bread, cheese}
wouldn't be a candidate if cheese wasn't frequent on its own).
● Level 3: Repeat the process for level 3 with frequent level 2 itemsets (e.g., {bread, milk, eggs}).
Association Rule Generation:
● From the frequent itemsets, derive association rules like "bread, milk -> eggs" (if someone buys
bread and milk, they are likely to buy eggs as well).
Calculate the support and confidence for each rule to assess its strength and validity.