1.1 What is Machine Learning?
Arthur Samuel’s definition (1959):
“Field of study that gives computers the ability to learn without
being explicitly programmed.”
Tom Mitchell’s definition (1997):
“A computer program is said to learn from experience E with respect
to some task T and performance measure P, if its performance on T,
as measured by P, improves with experience E.”
Example:
Task (T): Predicting house prices.
Experience (E): Past sales data.
Performance measure (P): Prediction accuracy (MSE)
Quiz-style Questions
Q1. Who gave the first popular definition of Machine Learning in 1959?
A. Tom Mitchell
B. Arthur Samuel
C. Alan Turing
D. Andrew Ng
👉 Answer: B. Arthur Samuel
Q2. Tom Mitchell’s definition of ML has 3 parts: Task (T), Experience (E),
and Performance measure (P).
Which of the following is an example of Experience (E)?
A. Predicting whether email is spam
B. Accuracy of predictions
C. Historical labeled email dataset
D. Number of correct predictions
👉 Answer: C. Historical labeled email dataset
Q3. According to Tom Mitchell’s definition, a program is said to learn if:
A. Its performance improves with experience
B. It can memorize past data
C. It is explicitly programmed
D. It runs faster with more data
👉 Answer: A. Its performance improves with experience
Q4. In predicting house prices, what would be a suitable performance
measure (P)?
A. Mean Squared Error (MSE)
B. Number of houses sold
C. Square footage of houses
D. Size of training dataset
👉 Answer: A. Mean Squared Error (MSE)
Q5. Which of the following is NOT part of Tom Mitchell’s ML definition?
A. Task (T)
B. Experience (E)
C. Performance measure (P)
D. Dataset size (D)
👉 Answer: D. Dataset size (D)
🔹 PYQs (Exam-style)
Q1.
True/False:
“According to Tom Mitchell, a program is said to learn from experience if it
can memorize the dataset.”
👉 Answer: False.
Learning = performance improves on the task with experience, not just
memorization.
Q2.
Fill in the blanks:
A computer program is said to learn from ______ with respect to some task
______ and performance measure ______, if its performance on ______
improves with ______.
👉 Answer: Experience (E), Task (T), Performance measure (P), Task (T),
Experience (E).
Q3.
Example mapping (PYQ style):
Predicting house prices → Task (T)
Past house sales data → Experience (E)
Mean Squared Error → Performance measure (P)
Q4.
Multiple choice:
Which statement is most aligned with Tom Mitchell’s ML definition?
A. ML is programming computers manually.
B. ML is about writing explicit rules for every case.
C. ML is about systems that improve automatically with more data and
experience.
D. ML is about faster algorithms only.
👉 Answer: C. ML is about systems that improve automatically with
more data and experience.
1.2 Types of Machine Learning
1. Supervised Learning
o Supervised learning(means labels) Input data has labels.
o Goal: Learn mapping f:X→Yf: X \to Yf:X→Y.
o Supervised learning has 2 types [Link] [Link]
o Examples: Regression (predict price), Classification (spam
detection).
o .
2. Unsupervised Learning
o Data has no labels.
o Goal: Find hidden patterns/structure.
o Examples: Clustering (K-means), Dimensionality reduction
(PCA).
3. Semi-Supervised Learning
Few labeled + many unlabeled data.
Example: Medical diagnosis (labels are costly).
[Link] Learning
o Agent interacts with environment → gets
rewards/punishments.
o Goal: Learn optimal policy to maximize reward.
o Examples: Chess AI, self-driving cars.
Quiz – Types of Machine Learning
Q1. Which of the following is an example of supervised learning?
A. K-means clustering
B. PCA (Principal Component Analysis)
C. Spam email detection
D. Chess AI learning via rewards
Answer: C. Spam email detection
Explanation: Supervised learning uses labeled data to learn a mapping
from input to output. Spam detection has labeled emails (spam/not spam).
Q2. In unsupervised learning, the main goal is to:
A. Predict outcomes based on labels
B. Find hidden patterns or structures in data
C. Maximize rewards through interaction
D. Solve linear equations
Answer: B. Find hidden patterns or structures in data
Explanation: Unsupervised learning works on unlabeled data and
discovers patterns, e.g., clustering or dimensionality reduction.
Q3. Reinforcement learning is different from supervised learning
because:
A. It always requires labeled input-output pairs
B. The agent learns by interacting with the environment and receiving
feedback
C. It uses clustering algorithms
D. It only works with linear models
Answer: B. The agent learns by interacting with the environment and
receiving feedback
Explanation: RL focuses on learning optimal actions to maximize long-
term reward through trial-and-error.
Q4. Which algorithm is commonly used for unsupervised learning?
A. Linear regression
B. Decision Trees
C. K-means clustering
D. Logistic regression
Answer: C. K-means clustering
Q5. Self-driving cars learning to drive by trial and error is an
example of:
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. Deep learning
Answer: C. Reinforcement learning
Explanation: The car (agent) interacts with the environment (roads) and
receives feedback (rewards/punishments) to improve its driving policy.
🔹 PYQs – Exam-style Questions
Q1.
True/False: In supervised learning, input data does not need labels.
Answer: False
Explanation: Supervised learning requires labeled data to learn the
mapping from input to output.
Q2.
Match the following examples with the type of ML:
Type of
Example
ML
1. Grouping customers by purchasing
?
patterns
2. Predicting house prices ?
3. AlphaGo learning to play Go ?
Answer:
1 → Unsupervised (clustering)
2 → Supervised (regression)
3 → Reinforcement learning
Q3.
Fill in the blank:
“An agent in ______ learning interacts with the ______ and learns to
maximize the ______.”
Answer: Reinforcement; environment; reward
Q4.
Multiple choice: Which of the following is NOT typically a goal of
unsupervised learning?
A. Clustering data points
B. Dimensionality reduction
C. Predicting a labeled outcome
D. Discovering hidden patterns
Answer: C. Predicting a labeled outcome
Q5. Numerical/Scenario-style (PYQ style)
A company wants to segment its customers into 3 groups based on buying
behavior, but no labels are available. Which ML type and algorithm is
suitable?
Answer:
Type: Unsupervised learning
Algorithm: K-means clustering
1.3 Key Concepts
Generalization: How well a model works on unseen data.
Model: Mathematical representation (e.g., linear regression line).
Training: Process of fitting model to data.
Testing: Evaluating model on unseen data.
Overfitting: Model memorizes training data (high variance).
Underfitting: Model too simple, cannot capture data patterns (high
bias).
Q1.
Which of the following best describes a model in machine learning?
A. Raw data collected from sensors
B. Mathematical representation capturing patterns in data
C. The process of evaluating performance
D. Splitting data into training and test sets
Answer: B. Mathematical representation capturing patterns in data
Explanation: A model is a function or representation that learns the relationship
between input features and output.
Q2.
Training in machine learning refers to:
A. Using a model to predict unseen data
B. Splitting data into subsets
C. Fitting the model to known data to learn parameters
D. Testing the model on new data
Answer: C. Fitting the model to known data to learn parameters
Explanation: Training adjusts model parameters to minimize error on training data.
Q3.
Testing (or evaluation) in machine learning is used to:
A. Learn model parameters
B. Measure model performance on unseen data
C. Generate more training data
D. Reduce the dataset size
Answer: B. Measure model performance on unseen data
Explanation: Testing evaluates how well the model generalizes beyond the training
set.
Q4.
Which scenario indicates overfitting?
A. Model performs poorly on both training and test data
B. Model performs well on training data but poorly on test data
C. Model performs moderately on both training and test data
D. Model ignores training data
Answer: B. Model performs well on training data but poorly on test data
Explanation: Overfitting means the model memorizes training data, capturing noise
rather than general patterns.
Q5.
Which scenario indicates underfitting?
A. Model captures noise in training data
B. Model is too simple and fails to capture patterns in training data
C. Model performs very well on test data only
D. Model has perfect predictions on training data
Answer: B. Model is too simple and fails to capture patterns in training data
Explanation: Underfitting happens when the model is too simple (high bias) to
represent data relationships.
🔹 PYQs – Exam-style Questions
Q1. True/False:
“Testing data is used to adjust model parameters during training.”
Answer: False
Explanation: Testing data is only for evaluation; parameters are adjusted on training
data.
Q2. Fill in the blanks:
“A model that performs extremely well on ______ data but poorly on ______ data is
likely ______.”
Answer: training; test; overfitting
Q3. Scenario:
You fit a linear regression model on a small dataset. Training error is high and test
error is also high. Which problem is this?
Answer: Underfitting (high bias, model too simple).
Q4. Scenario:
You fit a deep neural network with millions of parameters on a small dataset. Training
error is near zero, but test error is high.
Answer: Overfitting (memorization of training data).
Q5. Multiple choice:
Which of the following can reduce overfitting?
A. Reduce dataset size
B. Use regularization (L1/L2)
C. Increase model complexity
D. Ignore validation set
Answer: B. Use regularization (L1/L2)
Explanation: Regularization penalizes large weights, helping the model generalize
better.
Q7. What is Inductive Bias?
Definition:
Inductive bias = The assumptions a learning algorithm makes to generalize from
training data to unseen data.
Why needed?
Without bias, ML can’t predict unseen cases.
Examples:
o Linear regression assumes relation is linear.
o Decision tree assumes data can be split with conditions.
Q9. Confusion Matrix
Definition: A table showing performance of a classification model.
Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
1.4 Bias-Variance Tradeoff
Bias: Error due to wrong assumptions (underfitting).
Variance: Error due to sensitivity to training data (overfitting).
Goal: Find balance → good generalization.
Precision : Precision measures correctness of positive predictions,
Recall: Recall measures ability to detect all positives.
Accuracy: Accuracy measures the overall correctness of the
model.
Quiz – Bias-Variance Tradeoff
Q1.
Which of the following best describes bias in machine learning?
A. Error due to sensitivity to small changes in training data
B. Error due to wrong assumptions or overly simple model
C. Random noise in data
D. Error due to large training dataset
Answer: B. Error due to wrong assumptions or overly simple model
Explanation: High bias → underfitting → model too simple to capture
patterns.
Q2.
Which of the following best describes variance in machine learning?
A. Error due to model not being trained enough
B. Error due to noise in test data
C. Error due to sensitivity to small fluctuations in training data
D. Error due to wrong assumptions
Answer: C. Error due to sensitivity to small fluctuations in training data
Explanation: High variance → overfitting → model learns noise as if it
were signal.
Q3.
What is the main goal of the bias-variance tradeoff?
A. Minimize both bias and variance completely
B. Find a balance so the model generalizes well to unseen data
C. Maximize bias to reduce training time
D. Maximize variance to memorize training data
Answer: B. Find a balance so the model generalizes well to unseen data
Q4.
True/False: Increasing model complexity always decreases both bias and
variance.
Answer: False
Explanation: Increasing complexity reduces bias but increases variance.
Need balance.
Q5. Numerical-style:
Suppose a model has:
Bias² = 4
Variance = 6
Irreducible error = 2
What is the expected total error?
Solution:
Total Error=Bias2+Variance+Irreducible error=4+6+2=12\text{Total
Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible error} = 4 +
6 + 2 = 12Total Error=Bias2+Variance+Irreducible error=4+6+2=12
Answer: 12
🔹 PYQs – Exam-style Questions
Q1. Fill in the blank:
“Bias refers to ______, while variance refers to ______.”
Answer: Bias → error due to wrong assumptions/underfitting;
Variance → error due to sensitivity to training data/overfitting.
Q2. Scenario:
You train a linear regression model on complex non-linear data. Training
error is high, test error is high.
Which problem is this? Answer: High bias (underfitting).
Q3. Scenario:
You train a deep neural network on a small dataset. Training error is near
zero, but test error is very high.
Which problem is this? Answer: High variance (overfitting).
Q4. Conceptual:
Why can decreasing bias too much increase variance?
Answer:
A more complex model fits training data closely (low bias) → small
fluctuations/noise in training data cause large changes in predictions
→ high variance.
Q5. Practical:
Name two methods to reduce overfitting (high variance).
Answer:
1. Regularization (L1/L2)
2. Using more training data
3. Early stopping
4. Reducing model complexity
Performance Metrics
Regression: MSE, RMSE, MAE.
Classification: Accuracy, Precision, Recall, F1-score.
Clustering: Silhouette score, Davies–Bouldin index.
📘 Performance Metrics in Machine Learning
Performance metrics help us quantify how well a model is doing. They
vary depending on the type of ML task.
2️⃣ Classification Metrics
Used when predicting categorical labels.
Confusion Matrix:
Pred Pred
+ –
Actual
TP FN
+
Actual
FP TN
–
a) Accuracy
Accuracy=TP+TNTP+TN+FP+FNAccuracy = \frac{TP + TN}{TP + TN + FP
+ FN} Accuracy=TP+TN+FP+FNTP+TN
Fraction of correct predictions over total.
b) Precision
Precision=TPTP+FPPrecision = \frac{TP}{TP + FP} Precision=TP+FPTP
Fraction of correctly predicted positives among all predicted
positives.
c) Recall (Sensitivity)
Recall=TPTP+FNRecall = \frac{TP}{TP + FN} Recall=TP+FNTP
Fraction of correctly predicted positives among all actual positives.
Example:
TP=40, TN=50, FP=10, FN=20
Accuracy = (40+50)/120 = 0.75
Precision = 40/(40+10) = 0.8
Recall = 40/(40+20) ≈ 0.667
F1 ≈ 0.727
3️⃣ Clustering Metrics
Used for unsupervised learning.
a) Silhouette Score
s=b−amax(a,b)s = \frac{b - a}{\max(a,b)}s=max(a,b)b−a
aaa = avg distance within cluster, bbb = avg distance to nearest
other cluster.
Range: −1 to 1. Higher = better clustering.
b) Davies–Bouldin Index (DBI)
Measures average similarity between clusters.
Lower DBI → better separation and compactness.
🔹 Quiz – Performance Metrics
Q1. Which regression metric is most sensitive to outliers?
A) MSE
B) MAE
C) RMSE
D) R²
Answer: A. MSE
Explanation: Squared errors amplify outliers.
Q2. If TP=50, FP=10, FN=5, TN=35, compute Precision and Recall.
Precision = 50/(50+10) = 0.833
Recall = 50/(50+5) ≈ 0.909
Answer: Precision ≈ 0.833, Recall ≈ 0.909
Q3. True/False: F1-score is the arithmetic mean of Precision and Recall.
Answer: False
Explanation: F1-score is harmonic mean, not arithmetic mean.
Q4. Silhouette score of −0.2 indicates:
A) Good clustering
B) Poor clustering, some points assigned to wrong clusters
C) Perfect clustering
D) Cannot determine
Answer: B. Poor clustering
Q5. Lower Davies–Bouldin Index indicates:
A) Better clustering
B) Worse clustering
C) Same as silhouette score
D) Higher variance
Answer: A. Better clustering
🔹 PYQs – Exam-style Questions
Q1.
Compute RMSE and MAE for: y=[2,4,6], ŷ=[3,3,5]
Errors: 2−3=−1, 4−3=1, 6−5=1
MSE = (1² +1² +1²)/3 = 1
RMSE = √1 =1
MAE = (1+1+1)/3=1
Answer: RMSE=1, MAE=1
Q2.
A confusion matrix: TP=25, TN=45, FP=5, FN=15. Compute Accuracy and
F1-score.
Accuracy = (25+45)/90 = 70/90 ≈ 0.778
Precision = 25/(25+5)=0.833
Recall = 25/(25+15)=0.625
F1 = 2*(0.833*0.625)/(0.833+0.625) ≈ 0.714
Answer: Accuracy≈0.778, F1≈0.714
Q3. Conceptual:
Explain why silhouette score ranges from −1 to 1.
Answer:
+1 → points well matched to own cluster, far from others
0 → points on the boundary of clusters
−1 → points may be assigned to wrong cluster
Q4. Scenario:
You cluster customer data using K-means. Silhouette score = 0.72,
DBI=0.3.
Are clusters good? Answer: Yes, high silhouette and low DBI
indicate compact, well-separated clusters.
Real-world Applications
Healthcare: disease prediction.
Finance: fraud detection.
Retail: recommender systems.
NLP: machine translation, sentiment analysis.
Previous Year Questions (Expanded)
Q1. (2019) According to Tom Mitchell’s definition, which of the
following are necessary elements?
A) Training Data
B) Task to perform
C) Performance measure
D) All of the above
✅ Answer: D
Q2. (2020) Which of the following is NOT a supervised learning
task?
A) Predicting rainfall given weather conditions
B) Classifying handwritten digits
C) Customer segmentation
D) Predicting stock market trend as up or down
✅ Answer: C (Segmentation = unsupervised)
Q3. (2021) If a model has high bias and low variance, then it is
most likely:
A) Underfitting
B) Overfitting
C) Generalizing well
D) None of the above
✅ Answer: A
Q4. (2022) Which of the following is NOT true about
reinforcement learning?
A) It requires labeled training data
B) It learns by interaction with environment
C) It uses rewards and penalties
D) It is used in robotics and games
✅ Answer: A
Q5. (2023) Which metric is most suitable for evaluating a
classifier on imbalanced data?
A) Accuracy
B) Precision-Recall
C) Mean Squared Error
D) R² score
✅ Answer: B
Q6. (Assignment) Which of these leads to overfitting?
A) Very simple model
B) Too many parameters relative to data
C) Large training set
D) Using cross-validation
✅ Answer: B
🌟 Quick Review
Supervised ↔ labeled data
Unsupervised ↔ unlabeled data
Reinforcement ↔ rewards
Overfit ↔ memorizes, Underfit ↔ too simple
Bias = systematic error, Variance = sensitivity
Hypothesis Space & Inductive Bias
Hypothesis space: All the “brains” the computer could choose
from.
o Example: All straight lines → linear regression
Inductive bias: Rules the model uses to guess unseen data.
o Example: KNN assumes nearby points have same label
Evaluation & Cross-Validation
Training set: For learning
Test set: For checking predictions
Validation set: Optional, helps tune the model
Cross-Validation: Split data multiple ways → train/test → average result
Helps know if model really works
Cross-Validation
K-Fold CV: Split data into K folds, train K times, average score
LOOCV: Each data point tested once
Why: Reduce variance in performance estimate
Mnemonic: K-fold → Rotate Test
Week 2
Week 2
Part A: Linear Regression
1. What is Regression(continuos)
Regression = predicting a numeric/continuous value.
Example: predict marks from hours studied, predict house price
from area.
Different from classification (which predicts categories).
X has =independent variable
Y has =dependent variable
Note: [Link] is continuous
[Link] is discrete
.2. Simple Linear Regression
Equation:
Y=β0+β1X+εY = β_0 + β_1 X + εY=β0+β1X+ε
YYY: dependent variable (target/output).
XXX: independent variable (input).
β0β_0β0: intercept → value of Y when X=0.
β1β_1β1: slope → how much Y changes when X increases by 1.
εεε: error/noise in data.
👉 It assumes the relation between X and Y is a straight line.
3. Multiple Linear Regression
Equation:
Y=β0+β1X1+β2X2+…+βpXp+εY = β_0 + β_1 X_1 + β_2 X_2 + … + β_p
X_p + εY=β0+β1X1+β2X2+…+βpXp+ε
Used when output depends on multiple features.
Example: house price depends on area, number of rooms, location.
4. Cost Function (Error Measurement)
We want the line that fits data best.
Error = difference between predicted and actual values.
Quiz + Answers
Q1. Regression is mainly used for?
👉 Predicting continuous values ✅
Q2. Which metric is commonly minimized in regression?
👉 Mean Squared Error ✅
Q3. Which algorithm is used to estimate coefficients in regression?
👉 Gradient Descent ✅
PYQs
Q1. Write assumptions of linear regression.
✔ Linearity, independence, homoscedasticity, normal errors.
Q2. Differentiate between simple and multiple regression.
✔ Simple → 1 feature; Multiple → many features.
Quiz + Answers
Q1. What does learning rate (α) control?
👉 Step size of updates ✅
Q2. Which GD variant is faster but noisier?
👉 Stochastic Gradient Descent ✅
PYQs
Q1. Explain difference between Batch and SGD.
✔ Batch = stable, slow; SGD = fast, noisy.
Q1. What is the goal of gradient descent?
a) Maximize error
b) Minimize error
c) Find local maximum
d) None
👉 Answer: b) Minimize error
Q2. Which Gradient Descent is fastest but noisy?
👉 Answer: Stochastic Gradient Descent (SGD)
Q3. What happens if the learning rate is too high?
👉 Answer: The algorithm may overshoot and fail to converge.
6. Assumptions of Linear Regression
1. Linear relationship exists between X and Y.
2. Errors (residuals) are independent.
3. Errors have constant variance.
4. Errors follow normal distribution.
7. Problem with Linear Regression in Classification
Output is not bounded (can be –∞ to +∞).
For classification we need probabilities between 0 and 1
QUIZ (Week 2)
Q1. What is inductive bias in ML?
Ans: Assumptions made by algorithm to choose a hypothesis from the
hypothesis space.
Q2. Which method reduces variance by averaging results across multiple
folds?
a) Bootstrapping
b) Hold-out split
c) Cross-validation
Ans: c) Cross-validation
Q3. Formula for Recall?
Ans: Recall = TP / (TP + FN)
Q4. Gradient Descent updates weights in which direction?
Ans: Opposite direction of gradient of cost function.
Q5. In Linear Regression, which function is minimized?
Ans: Mean Squared Error (MSE).
3. Which Gradient Descent uses full dataset for one update?
Ans: Batch Gradient Descent.
Q4. Which metric is less sensitive to outliers: MSE or MAE?
Ans: MAE.
Q5. If R² = 1, what does it mean?
Ans: Perfect fit of regression model.
Week 2 (Part B, C & D) – Decision Trees & Overfitting
Part B: Introduction to Decision Tree
Definition:
A Decision Tree is a classifier in a tree structure.
Nodes:
o Decision Node: attribute test (e.g., “Outlook = Sunny?”).
o Leaf Node: final classification (Yes/No).
Key Idea:
Recursively split dataset based on attributes → until classification is
simple.
Example (PlayTennis):
Outlook = Sunny → check Humidity → Yes/No.
Outlook = Overcast → Always Yes.
Outlook = Rain → check Wind → Yes/No.
Decision Tree = disjunctions of conjunctions
(Outlook = Sunny ∧ Humidity = Normal) ∨ (Outlook = Overcast) ∨
Example:
(Outlook = Rain ∧ Wind = Weak)
Challenges:
Which tree to pick?
o Prefer smallest consistent tree (Occam’s Razor).
How to split?
o Use attribute selection measures (Info Gain, Gini, etc.).
Part C: Learning Decision Trees
ID3 Algorithm (Top-Down Induction)
1. Select “best” attribute A.
2. Make node with attribute A.
3. For each value of A, create a branch.
4. Sort training examples by branch value.
5. If examples are pure (all same class) → stop. Else repeat.
Stopping Conditions:
No attributes left.
All examples same class.
Very few examples left.
Part D: Overfitting in Decision Trees
Problem:
Tree fits training set too perfectly → poor test performance.
Definitions:
Overfitting: Training error ↓, but test error ↑.
Underfitting: Both training and test error ↑.
Causes:
Noise in data.
Too few samples.
Too many parameters (deep tree).
Avoiding Overfitting
1. Pre-pruning (Early stopping):
o Stop growing if split not statistically significant.
o Conditions:
All samples same class.
Few samples left.
χ² test says no useful split.
2. Post-pruning:
o Grow full tree → prune unhelpful branches.
o Example: Reduced Error Pruning (using validation set).
3. Other Methods:
o Cross-validation.
o Regularization (limit tree depth, minimum samples).
o Use more data.
Regularization (Connection to Regression)
Overfitting in Linear Regression → very large weights.
Solutions:
o L2 Regularization (Ridge): Penalize large weights.
J(w)=MSE+λ∑w2J(w) = MSE + \lambda \sum w^2J(w)=MSE+λ∑w2
o L1 Regularization (Lasso): Penalize absolute weights.
J(w)=MSE+λ∑∣w∣J(w) = MSE + \lambda \sum |w|J(w)=MSE+λ∑∣w∣
📌 QUIZ (Week 2 B, C, D)
Q1. What does a decision node represent?
Ans: A test on an attribute.
Q2. What is entropy when dataset is pure (all positive)?
Ans: 0.
Q3. Which attribute has the highest info gain in PlayTennis?
Ans: Outlook.
Q4. Define Overfitting.
Ans: When model has low training error but high test error.
Q5. What is the difference between Pre-pruning and Post-pruning?
Ans: Pre-pruning stops tree early; Post-pruning prunes after building.
Week-3
Week 3 – Instance-Based Learning & Feature Selection/Extraction
🔹 Part A: Instance-Based Learning (IBL)
1. Key Idea
Store the training examples (xn,f(xn))(x_n, f(x_n))(xn,f(xn)).
For a new test example → find the closest matches and predict
from them.
Inductive assumption: similar inputs → similar outputs.
2. k-Nearest Neighbor (k-NN)
Training phase: just memorize data.
Prediction phase:
1. Find the k nearest neighbors of the test point.
2. Classification → majority vote of neighbors.
Regression → average of neighbors.
Decision boundary: forms a Voronoi diagram.
k-Nearest Neighbors (kNN):
A lazy learning algorithm (no training, only stores data).
To classify a new point, it checks the k closest neighbors in the
dataset.
Decision = majority vote of neighbors (classification) OR average
value (regression).
3. Choosing “k”
Small k: captures fine structure, but sensitive to noise.
Large k: more stable, less sensitive to noise, better probability
estimates.
As data → ∞ and k → ∞, kNN → Bayes optimal.
Theory
kNN = instance-based, lazy learning algorithm.
Stores all training data → no training phase.
Classification → majority class of neighbors.
Regression → average of neighbors.
Key parameter: value of k.
o Small k = sensitive to noise.
o Large k = smoother, may underfit.
Curse of Dimensionality
If features are too many, distance becomes meaningless.
Example: In 2D, neighbors are near; in 100D, everyone is far away!
👉 Need feature selection/extraction.
curse of dimensionality:
o In high dimensions, distances between points become similar.
o Nearest neighbor loses meaning.
o Model becomes less accurate.
o Solution → reduce dimensions (Feature Selection/Extraction).
Feature Selection Methods:
1. Filter methods – use statistics (correlation, chi-square, mutual
information).
2. Wrapper methods – test feature subsets with classifier
(expensive).
3. Embedded methods – feature selection during training (e.g.,
Decision Trees, Lasso).
Quiz
1. Curse of dimensionality means → distances lose meaning in high
dimensions.
2. Solution to curse → dimension reduction.
3. Filter method uses → statistical tests.
4. Wrapper method uses → classifier accuracy.
5. Embedded method example? → Decision Trees, Lasso
regression.
Quiz
1. kNN is a lazy/instance-based learner.
2. Formula of Euclidean distance? → ∑(xi−yi)2\sqrt{\sum (x_i -
y_i)^2}∑(xi−yi)2
3. Which similarity measure is useful in text data? → Cosine
similarity.
4. If k=1, model is sensitive to → noise.
5. Increasing k makes model → smoother / less sensitive.
Quiz Q&A
1. kNN is what type of learning algorithm?
→ Instance-based / Lazy learner.
2. What happens if k = 1?
→ Nearest neighbor decides, may overfit to noise.
3. Formula for Euclidean distance?
→ ∑(xi−yi)2\sqrt{\sum (x_i - y_i)^2}∑(xi−yi)2
4. Which distance is suitable for text classification?
→ Cosine similarity.
5. Main drawback of kNN?
→ High computation for large datasets (must calculate distance to
all points).
6. kNN works well when features are…
→ Scaled properly (normalization is important).
Part C: Feature Extraction (PCA & LDA)
Theory
Feature Extraction = create new reduced features from original
ones.
PCA (Principal Component Analysis):
o Unsupervised, ignores class labels.
o Maximizes variance.
o Uses eigenvectors of covariance matrix.
LDA (Linear Discriminant Analysis):
o Supervised, uses class labels.
o Maximizes separation between classes.
👉 Difference: PCA = variance, LDA = class separability.
Quiz
1. PCA is unsupervised.
2. LDA uses → class labels.
3. PCA chooses directions that maximize → variance.
4. LDA chooses directions that maximize → class separation.
5. Which is better for classification? → LDA.
PYQs
Explain PCA with steps. (10 marks)
Differentiate PCA and LDA. (5 marks)
Why is PCA used before applying kNN? (5 marks)
🔹 Part D: Collaborative Filtering (Recommender Systems)
Theory
Collaborative Filtering = recommend items based on similar
users/items.
Types:
1. User-based CF
o Find similar users.
o Recommend items they liked.
o Example: “Users like you watched…”
2. Item-based CF
o Find similar items to those the user liked.
o Recommend those items.
o Example: “Because you watched A, we recommend B.”
o More scalable.
Quiz
1. CF stands for → Collaborative Filtering.
2. User-based CF uses → similar users.
3. Item-based CF uses → similar items.
4. Which is more scalable? → Item-based CF.
5. Amazon’s recommender is mostly → item-based CF.
QUICK REVISION CHEAT SHEET (Week 3)
kNN = Lazy learner, depends on distance.
Distances: Euclidean, Manhattan, Cosine.
Curse of dimensionality = distance useless in high dimensions →
reduce features.
Feature Selection: Filter / Wrapper / Embedded.
Feature Extraction: PCA (variance), LDA (class separation).
Recommender Systems: User-based vs Item-based CF
Week -4
Week 4 Notes – Probability & Bayesian Learning
Part A: Probability Basics
1. Probability & Experiments
Probability: study of randomness & uncertainty.
Random Experiment: process with uncertain outcome.
o Example: tossing a coin, rolling a die, drawing cards.
In MAP learning, which factor is considered?
a) Only likelihood
b) Only prior
c) Prior × Likelihood ✅
d) None
In Naïve Bayes, the main assumption is?
Conditional independence of features ✅
QUIZ QUESTIONS (MCQ + Short) – Week 4
MCQs
1. Sample space of rolling a die is?
a) {H,T}
b) {1,2,3,4,5,6} ✅
c) {HH,HT}
d) None
2. Which of these is NOT an axiom of probability?
a) P(Ω)=1P(\Omega)=1P(Ω)=1
b) 0≤P(A)≤10 \leq P(A) \leq 10≤P(A)≤1
c) P(A∩B)=P(A)+P(B)P(A∩B)=P(A)+P(B)P(A∩B)=P(A)+P(B) ✅
d) P(∅)=0P(\varnothing)=0P(∅)=0
3. In Naïve Bayes, the main assumption is:
a) Features are correlated
b) Features are independent ✅
c) Priors are ignored
d) None
4. Bayes theorem is used to compute?
a) Prior
b) Likelihood
c) Posterior ✅
d) Evidence
5. ML learning ignores:
a) Prior ✅
b) Likelihood
c) Evidence
d) Data
True/False
6. For any event A, P(A)+P(A’) = 1. ✅
7. In MAP hypothesis, prior is not considered. ❌
8. Naïve Bayes always gives optimal performance. ❌
Short Answer
9. Define Conditional Probability with formula.
👉 P(A∣B)=P(A∩B)P(B)P(A|B) = \frac{P(A∩B)}{P(B)}P(A∣B)=P(B)P(A∩B)
10. Differentiate ML vs MAP.
👉 ML uses only likelihood, MAP uses prior × likelihood.
Week 4 – Probability & Bayesian Learning (120 Q&A)
Topic 1 – Probability Basics (40 Q&A)
MCQs (20)
1. Sample space of a coin toss?
a) {H,T} ✅
b) {1,2}
c) {HH,HT}
d) None
2. Probability of an impossible event?
a) 0 ✅
b) 1
c) 0.5
d) None
3. P(A∪B) = ?
a) P(A) + P(B) ✅
b) P(A) + P(B) – P(A∩B) ✅
c) P(A∩B)
d) P(A) × P(B)
4. If P(A) = 0.3, P(B) = 0.4, independent, P(A∩B)?
a) 0.12 ✅
b) 0.7
c) 0.1
d) 0.3
5. Complement of event A formula?
a) 1 – P(A) ✅
b) P(A)
c) P(A∩B)
d) None
… (MCQs 6–20 continue with event, union, intersection, independence,
conditional probability, etc.)
True/False (10)
21. P(A)+P(A’) = 1 ✅
22. P(Ω)=0 ❌
23. Disjoint events cannot occur together ✅
24. Conditional probability formula: P(A|B)=P(A∩B)/P(B) ✅
25. Two independent events always have P(A∩B)=0 ❌
… (TF 26–30)
Fill-in-the-blank (5)
31. The sum of probabilities of all outcomes in a sample space is
1.
32. If A ⊆ B, then P(A) ≤ P(B).
… (FB 33–35)
Short Answer (5)
36. Define Random Experiment, Sample Space, Event.
37. Explain complement of an event with formula.
38. State axioms of probability.
39. Example of independent events.
40. Example of mutually exclusive events.
Topic 2 – Random Variables & Distributions (20 Q&A)
MCQs (10)
41. A discrete random variable takes:
a) Countable values ✅
b) Continuous values
c) Infinite values
d) None
42. Sum of PMF probabilities = ?
a) 1 ✅
b) 0
c) 0.5
d) 2
43. PDF integral over all x = ?
a) 1 ✅
b) 0
…
(MCQs 44–50 cover expectation, variance, PMF/PDF, discrete vs
continuous)
True/False (5)
51. PDF can be negative ❌
52. Expectation is mean of RV ✅
…
Short Answer (5)
56. Difference between discrete & continuous RV.
57. Define PMF & PDF.
58. Compute E[X] for X={1,2,3} with P(X)={0.2,0.5,0.3}.
59. Compute Var[X].
60. Explain why ∫f(x)dx=1 for PDF.
Topic 3 – Conditional Probability & Bayes’ Theorem (20 Q&A)
MCQs (10)
61. P(A|B) = ?
a) P(A∩B)/P(B) ✅
…
(MCQs 62–70 include independence, examples, conditional probs, Bayes
theorem calculations)
True/False (5)
71. Bayes theorem updates prior with evidence ✅
…
Topic 4 – Bayesian Learning (20 Q&A)
MCQs (10)
81. MAP maximizes:
a) P(D|h) ✅
b) P(h|D) ✅
c) Both ✅
d) None
82. ML ignores prior ✅
… (MCQs 83–90)
True/False (5)
91. ML = MAP when prior is uniform ✅
92. MAP considers prior knowledge ✅
…
Topic 5 – Naïve Bayes & Bayesian Networks (20 Q&A)
MCQs (10)
101. Naïve Bayes assumes features are independent ✅
102. Bayesian network represents:
a) Dependencies between variables ✅
… (MCQs 103–110)
True/False (5)
111. Naïve Bayes always gives best prediction ❌
112. Bayesian networks use CPT ✅
…
Week -5
Q10. Linear SVM with noise is solved using:
A) Logistic regression
B) Soft margin SVM
C) Decision tree
D) Random forest
Answer: B
Explanation: Soft margin allows misclassification to handle noisy/non-
linear data.
Q11. Which kernel is NOT commonly used in SVM?
A) Linear
B) Polynomial
C) Gaussian (RBF)
D) Cosine
Answer: D
Explanation: Linear, polynomial, Gaussian, and sigmoid kernels are
standard. Cosine is not standard.
Q12. Kernel trick is used to:
A) Reduce computation by avoiding high-dimensional feature mapping
B) Increase model overfitting
C) Replace SVM with decision trees
D) Compute gradient faster
Answer: A
Q13. SMO algorithm is used to:
A) Solve primal SVM problem
B) Solve dual SVM problem efficiently
C) Compute sigmoid function
D) Normalize features
Answer: B
Q14. Multi-class SVM is handled by:
A) Single SVM
B) One-vs-Rest approach
C) Logistic regression
D) Random forest
Answer: B
Explanation: Train N SVMs, each for one class vs all others.
Q15. Which parameter controls the trade-off between margin width and
misclassification in soft-margin SVM?
A) Alpha
B) Beta
C) C
D) Gamma
Answer: C
SVM Questions
Q26. Maximum margin classifier is:
A) Logistic regression
B) Linear SVM
C) Decision tree
D) Naive Bayes
Answer: B
Q27. Hinge loss is used in:
A) Linear regression
B) Logistic regression
C) SVM
D) KNN
Answer: C
Explanation: SVM minimizes hinge loss to maximize margin.
Q30. Which statement is TRUE for soft-margin SVM?
A) Allows some misclassification
B) Uses C parameter to control penalty
C) Both A and B
D) None
Answer: C
Q31. Which is TRUE about linear vs kernel SVM?
A) Linear SVM cannot classify linearly separable data
B) Kernel SVM can map data to higher dimensions
C) Linear SVM always overfits
D) Kernel SVM is slower but always worse
Answer: B
Q32. If a point is correctly classified with distance > margin, its Lagrange
multiplier αi\alpha_iαi is:
A) 0
B) > 0
C) = 1
D) < 0
Answer: A
Explanation: Only support vectors (on margin) have αi>0\alpha_i > 0αi
>0.
Q33. What does the parameter gamma control in RBF kernel?
A) Margin width
B) Influence of a single training point
C) Learning rate
D) Number of features
Answer: B
Q34. “One-vs-One” approach in multi-class SVM means:
A) Train N SVMs for N classes
B) Train N(N-1)/2 SVMs for N classes
C) Train one SVM for all classes
D) None
Answer: B
Q35. Which SVM type is best for non-linear data?
A) Linear SVM
B) Soft-margin SVM with kernel
C) Logistic regression
D) Naive Bayes
Answer: B