0% found this document useful (0 votes)
56 views62 pages

Understanding Machine Learning Basics

Uploaded by

saniyanadaf300
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views62 pages

Understanding Machine Learning Basics

Uploaded by

saniyanadaf300
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

1.1 What is Machine Learning?

 Arthur Samuel’s definition (1959):


“Field of study that gives computers the ability to learn without
being explicitly programmed.”

 Tom Mitchell’s definition (1997):


“A computer program is said to learn from experience E with respect
to some task T and performance measure P, if its performance on T,
as measured by P, improves with experience E.”

Example:

 Task (T): Predicting house prices.

 Experience (E): Past sales data.

 Performance measure (P): Prediction accuracy (MSE)

Quiz-style Questions

Q1. Who gave the first popular definition of Machine Learning in 1959?

 A. Tom Mitchell

 B. Arthur Samuel

 C. Alan Turing

 D. Andrew Ng

👉 Answer: B. Arthur Samuel

Q2. Tom Mitchell’s definition of ML has 3 parts: Task (T), Experience (E),
and Performance measure (P).
Which of the following is an example of Experience (E)?

 A. Predicting whether email is spam

 B. Accuracy of predictions

 C. Historical labeled email dataset

 D. Number of correct predictions

👉 Answer: C. Historical labeled email dataset

Q3. According to Tom Mitchell’s definition, a program is said to learn if:

 A. Its performance improves with experience


 B. It can memorize past data

 C. It is explicitly programmed

 D. It runs faster with more data

👉 Answer: A. Its performance improves with experience

Q4. In predicting house prices, what would be a suitable performance


measure (P)?

 A. Mean Squared Error (MSE)

 B. Number of houses sold

 C. Square footage of houses

 D. Size of training dataset

👉 Answer: A. Mean Squared Error (MSE)

Q5. Which of the following is NOT part of Tom Mitchell’s ML definition?

 A. Task (T)

 B. Experience (E)

 C. Performance measure (P)

 D. Dataset size (D)

👉 Answer: D. Dataset size (D)

🔹 PYQs (Exam-style)

Q1.
True/False:
“According to Tom Mitchell, a program is said to learn from experience if it
can memorize the dataset.”

👉 Answer: False.
Learning = performance improves on the task with experience, not just
memorization.

Q2.
Fill in the blanks:
A computer program is said to learn from ______ with respect to some task
______ and performance measure ______, if its performance on ______
improves with ______.

👉 Answer: Experience (E), Task (T), Performance measure (P), Task (T),
Experience (E).

Q3.
Example mapping (PYQ style):

 Predicting house prices → Task (T)

 Past house sales data → Experience (E)

 Mean Squared Error → Performance measure (P)

Q4.
Multiple choice:
Which statement is most aligned with Tom Mitchell’s ML definition?

A. ML is programming computers manually.


B. ML is about writing explicit rules for every case.
C. ML is about systems that improve automatically with more data and
experience.
D. ML is about faster algorithms only.

👉 Answer: C. ML is about systems that improve automatically with


more data and experience.

1.2 Types of Machine Learning


1. Supervised Learning

o Supervised learning(means labels) Input data has labels.

o Goal: Learn mapping f:X→Yf: X \to Yf:X→Y.

o Supervised learning has 2 types [Link] [Link]

o Examples: Regression (predict price), Classification (spam


detection).

o .

2. Unsupervised Learning
o Data has no labels.

o Goal: Find hidden patterns/structure.

o Examples: Clustering (K-means), Dimensionality reduction


(PCA).

3. Semi-Supervised Learning
 Few labeled + many unlabeled data.

 Example: Medical diagnosis (labels are costly).

[Link] Learning
o Agent interacts with environment → gets
rewards/punishments.

o Goal: Learn optimal policy to maximize reward.

o Examples: Chess AI, self-driving cars.

Quiz – Types of Machine Learning

Q1. Which of the following is an example of supervised learning?

A. K-means clustering
B. PCA (Principal Component Analysis)
C. Spam email detection
D. Chess AI learning via rewards

Answer: C. Spam email detection


Explanation: Supervised learning uses labeled data to learn a mapping
from input to output. Spam detection has labeled emails (spam/not spam).

Q2. In unsupervised learning, the main goal is to:

A. Predict outcomes based on labels


B. Find hidden patterns or structures in data
C. Maximize rewards through interaction
D. Solve linear equations
Answer: B. Find hidden patterns or structures in data
Explanation: Unsupervised learning works on unlabeled data and
discovers patterns, e.g., clustering or dimensionality reduction.

Q3. Reinforcement learning is different from supervised learning


because:

A. It always requires labeled input-output pairs


B. The agent learns by interacting with the environment and receiving
feedback
C. It uses clustering algorithms
D. It only works with linear models

Answer: B. The agent learns by interacting with the environment and


receiving feedback
Explanation: RL focuses on learning optimal actions to maximize long-
term reward through trial-and-error.

Q4. Which algorithm is commonly used for unsupervised learning?

A. Linear regression
B. Decision Trees
C. K-means clustering
D. Logistic regression

Answer: C. K-means clustering

Q5. Self-driving cars learning to drive by trial and error is an


example of:

A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. Deep learning

Answer: C. Reinforcement learning


Explanation: The car (agent) interacts with the environment (roads) and
receives feedback (rewards/punishments) to improve its driving policy.

🔹 PYQs – Exam-style Questions

Q1.
True/False: In supervised learning, input data does not need labels.

Answer: False
Explanation: Supervised learning requires labeled data to learn the
mapping from input to output.

Q2.

Match the following examples with the type of ML:

Type of
Example
ML

1. Grouping customers by purchasing


?
patterns

2. Predicting house prices ?

3. AlphaGo learning to play Go ?

Answer:
1 → Unsupervised (clustering)
2 → Supervised (regression)
3 → Reinforcement learning

Q3.

Fill in the blank:


“An agent in ______ learning interacts with the ______ and learns to
maximize the ______.”

Answer: Reinforcement; environment; reward

Q4.

Multiple choice: Which of the following is NOT typically a goal of


unsupervised learning?

A. Clustering data points


B. Dimensionality reduction
C. Predicting a labeled outcome
D. Discovering hidden patterns

Answer: C. Predicting a labeled outcome


Q5. Numerical/Scenario-style (PYQ style)

A company wants to segment its customers into 3 groups based on buying


behavior, but no labels are available. Which ML type and algorithm is
suitable?

Answer:

 Type: Unsupervised learning

 Algorithm: K-means clustering

1.3 Key Concepts

 Generalization: How well a model works on unseen data.

 Model: Mathematical representation (e.g., linear regression line).

 Training: Process of fitting model to data.

 Testing: Evaluating model on unseen data.

 Overfitting: Model memorizes training data (high variance).

 Underfitting: Model too simple, cannot capture data patterns (high


bias).

 Q1.
 Which of the following best describes a model in machine learning?
 A. Raw data collected from sensors
B. Mathematical representation capturing patterns in data
C. The process of evaluating performance
D. Splitting data into training and test sets
 Answer: B. Mathematical representation capturing patterns in data
Explanation: A model is a function or representation that learns the relationship
between input features and output.

 Q2.
 Training in machine learning refers to:
 A. Using a model to predict unseen data
B. Splitting data into subsets
C. Fitting the model to known data to learn parameters
D. Testing the model on new data
 Answer: C. Fitting the model to known data to learn parameters
Explanation: Training adjusts model parameters to minimize error on training data.

 Q3.
 Testing (or evaluation) in machine learning is used to:
 A. Learn model parameters
B. Measure model performance on unseen data
C. Generate more training data
D. Reduce the dataset size
 Answer: B. Measure model performance on unseen data
Explanation: Testing evaluates how well the model generalizes beyond the training
set.

 Q4.
 Which scenario indicates overfitting?
 A. Model performs poorly on both training and test data
B. Model performs well on training data but poorly on test data
C. Model performs moderately on both training and test data
D. Model ignores training data
 Answer: B. Model performs well on training data but poorly on test data
Explanation: Overfitting means the model memorizes training data, capturing noise
rather than general patterns.

 Q5.
 Which scenario indicates underfitting?
 A. Model captures noise in training data
B. Model is too simple and fails to capture patterns in training data
C. Model performs very well on test data only
D. Model has perfect predictions on training data
 Answer: B. Model is too simple and fails to capture patterns in training data
Explanation: Underfitting happens when the model is too simple (high bias) to
represent data relationships.

 🔹 PYQs – Exam-style Questions


 Q1. True/False:
“Testing data is used to adjust model parameters during training.”
 Answer: False
Explanation: Testing data is only for evaluation; parameters are adjusted on training
data.

 Q2. Fill in the blanks:


“A model that performs extremely well on ______ data but poorly on ______ data is
likely ______.”
 Answer: training; test; overfitting

 Q3. Scenario:
You fit a linear regression model on a small dataset. Training error is high and test
error is also high. Which problem is this?
 Answer: Underfitting (high bias, model too simple).

 Q4. Scenario:
You fit a deep neural network with millions of parameters on a small dataset. Training
error is near zero, but test error is high.
 Answer: Overfitting (memorization of training data).

 Q5. Multiple choice:


Which of the following can reduce overfitting?
 A. Reduce dataset size
B. Use regularization (L1/L2)
C. Increase model complexity
D. Ignore validation set
 Answer: B. Use regularization (L1/L2)
Explanation: Regularization penalizes large weights, helping the model generalize
better.

Q7. What is Inductive Bias?


 Definition:
Inductive bias = The assumptions a learning algorithm makes to generalize from
training data to unseen data.
 Why needed?
Without bias, ML can’t predict unseen cases.
 Examples:
o Linear regression assumes relation is linear.
o Decision tree assumes data can be split with conditions.

Q9. Confusion Matrix


 Definition: A table showing performance of a classification model.

Predicted Positive Predicted Negative


Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

1.4 Bias-Variance Tradeoff

 Bias: Error due to wrong assumptions (underfitting).

 Variance: Error due to sensitivity to training data (overfitting).

 Goal: Find balance → good generalization.


Precision : Precision measures correctness of positive predictions,

Recall: Recall measures ability to detect all positives.

Accuracy: Accuracy measures the overall correctness of the


model.

Quiz – Bias-Variance Tradeoff

Q1.

Which of the following best describes bias in machine learning?

A. Error due to sensitivity to small changes in training data


B. Error due to wrong assumptions or overly simple model
C. Random noise in data
D. Error due to large training dataset
Answer: B. Error due to wrong assumptions or overly simple model
Explanation: High bias → underfitting → model too simple to capture
patterns.

Q2.

Which of the following best describes variance in machine learning?

A. Error due to model not being trained enough


B. Error due to noise in test data
C. Error due to sensitivity to small fluctuations in training data
D. Error due to wrong assumptions

Answer: C. Error due to sensitivity to small fluctuations in training data


Explanation: High variance → overfitting → model learns noise as if it
were signal.

Q3.

What is the main goal of the bias-variance tradeoff?

A. Minimize both bias and variance completely


B. Find a balance so the model generalizes well to unseen data
C. Maximize bias to reduce training time
D. Maximize variance to memorize training data

Answer: B. Find a balance so the model generalizes well to unseen data

Q4.

True/False: Increasing model complexity always decreases both bias and


variance.

Answer: False
Explanation: Increasing complexity reduces bias but increases variance.
Need balance.

Q5. Numerical-style:

Suppose a model has:

 Bias² = 4

 Variance = 6
 Irreducible error = 2

What is the expected total error?

Solution:

Total Error=Bias2+Variance+Irreducible error=4+6+2=12\text{Total


Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible error} = 4 +
6 + 2 = 12Total Error=Bias2+Variance+Irreducible error=4+6+2=12

Answer: 12

🔹 PYQs – Exam-style Questions

Q1. Fill in the blank:


“Bias refers to ______, while variance refers to ______.”

Answer: Bias → error due to wrong assumptions/underfitting;


Variance → error due to sensitivity to training data/overfitting.

Q2. Scenario:
You train a linear regression model on complex non-linear data. Training
error is high, test error is high.

 Which problem is this? Answer: High bias (underfitting).

Q3. Scenario:
You train a deep neural network on a small dataset. Training error is near
zero, but test error is very high.

 Which problem is this? Answer: High variance (overfitting).

Q4. Conceptual:
Why can decreasing bias too much increase variance?

Answer:

 A more complex model fits training data closely (low bias) → small
fluctuations/noise in training data cause large changes in predictions
→ high variance.

Q5. Practical:
Name two methods to reduce overfitting (high variance).
Answer:

1. Regularization (L1/L2)

2. Using more training data

3. Early stopping

4. Reducing model complexity

Performance Metrics

 Regression: MSE, RMSE, MAE.

 Classification: Accuracy, Precision, Recall, F1-score.

 Clustering: Silhouette score, Davies–Bouldin index.

📘 Performance Metrics in Machine Learning

Performance metrics help us quantify how well a model is doing. They


vary depending on the type of ML task.
2️⃣ Classification Metrics

Used when predicting categorical labels.

Confusion Matrix:

Pred Pred
+ –

Actual
TP FN
+

Actual
FP TN

a) Accuracy

Accuracy=TP+TNTP+TN+FP+FNAccuracy = \frac{TP + TN}{TP + TN + FP


+ FN} Accuracy=TP+TN+FP+FNTP+TN

 Fraction of correct predictions over total.

b) Precision

Precision=TPTP+FPPrecision = \frac{TP}{TP + FP} Precision=TP+FPTP

 Fraction of correctly predicted positives among all predicted


positives.

c) Recall (Sensitivity)

Recall=TPTP+FNRecall = \frac{TP}{TP + FN} Recall=TP+FNTP

 Fraction of correctly predicted positives among all actual positives.


Example:
TP=40, TN=50, FP=10, FN=20

 Accuracy = (40+50)/120 = 0.75

 Precision = 40/(40+10) = 0.8

 Recall = 40/(40+20) ≈ 0.667

 F1 ≈ 0.727

3️⃣ Clustering Metrics

Used for unsupervised learning.

a) Silhouette Score

s=b−amax⁡(a,b)s = \frac{b - a}{\max(a,b)}s=max(a,b)b−a

 aaa = avg distance within cluster, bbb = avg distance to nearest


other cluster.

 Range: −1 to 1. Higher = better clustering.

b) Davies–Bouldin Index (DBI)

 Measures average similarity between clusters.

 Lower DBI → better separation and compactness.


🔹 Quiz – Performance Metrics

Q1. Which regression metric is most sensitive to outliers?

 A) MSE

 B) MAE

 C) RMSE

 D) R²

Answer: A. MSE
Explanation: Squared errors amplify outliers.

Q2. If TP=50, FP=10, FN=5, TN=35, compute Precision and Recall.

 Precision = 50/(50+10) = 0.833

 Recall = 50/(50+5) ≈ 0.909

Answer: Precision ≈ 0.833, Recall ≈ 0.909

Q3. True/False: F1-score is the arithmetic mean of Precision and Recall.

Answer: False
Explanation: F1-score is harmonic mean, not arithmetic mean.

Q4. Silhouette score of −0.2 indicates:

 A) Good clustering

 B) Poor clustering, some points assigned to wrong clusters

 C) Perfect clustering

 D) Cannot determine

Answer: B. Poor clustering

Q5. Lower Davies–Bouldin Index indicates:

 A) Better clustering

 B) Worse clustering

 C) Same as silhouette score

 D) Higher variance
Answer: A. Better clustering

🔹 PYQs – Exam-style Questions

Q1.
Compute RMSE and MAE for: y=[2,4,6], ŷ=[3,3,5]

 Errors: 2−3=−1, 4−3=1, 6−5=1

 MSE = (1² +1² +1²)/3 = 1

 RMSE = √1 =1

 MAE = (1+1+1)/3=1

Answer: RMSE=1, MAE=1

Q2.
A confusion matrix: TP=25, TN=45, FP=5, FN=15. Compute Accuracy and
F1-score.

 Accuracy = (25+45)/90 = 70/90 ≈ 0.778

 Precision = 25/(25+5)=0.833

 Recall = 25/(25+15)=0.625

 F1 = 2*(0.833*0.625)/(0.833+0.625) ≈ 0.714

Answer: Accuracy≈0.778, F1≈0.714

Q3. Conceptual:
Explain why silhouette score ranges from −1 to 1.

Answer:

 +1 → points well matched to own cluster, far from others

 0 → points on the boundary of clusters

 −1 → points may be assigned to wrong cluster

Q4. Scenario:
You cluster customer data using K-means. Silhouette score = 0.72,
DBI=0.3.
 Are clusters good? Answer: Yes, high silhouette and low DBI
indicate compact, well-separated clusters.

Real-world Applications

 Healthcare: disease prediction.

 Finance: fraud detection.

 Retail: recommender systems.

 NLP: machine translation, sentiment analysis.

Previous Year Questions (Expanded)

Q1. (2019) According to Tom Mitchell’s definition, which of the


following are necessary elements?

 A) Training Data

 B) Task to perform

 C) Performance measure

 D) All of the above

✅ Answer: D

Q2. (2020) Which of the following is NOT a supervised learning


task?

 A) Predicting rainfall given weather conditions

 B) Classifying handwritten digits

 C) Customer segmentation

 D) Predicting stock market trend as up or down

✅ Answer: C (Segmentation = unsupervised)

Q3. (2021) If a model has high bias and low variance, then it is
most likely:

 A) Underfitting

 B) Overfitting

 C) Generalizing well
 D) None of the above

✅ Answer: A

Q4. (2022) Which of the following is NOT true about


reinforcement learning?

 A) It requires labeled training data

 B) It learns by interaction with environment

 C) It uses rewards and penalties

 D) It is used in robotics and games

✅ Answer: A

Q5. (2023) Which metric is most suitable for evaluating a


classifier on imbalanced data?

 A) Accuracy

 B) Precision-Recall

 C) Mean Squared Error

 D) R² score

✅ Answer: B

Q6. (Assignment) Which of these leads to overfitting?

 A) Very simple model

 B) Too many parameters relative to data

 C) Large training set

 D) Using cross-validation

✅ Answer: B

🌟 Quick Review

 Supervised ↔ labeled data

 Unsupervised ↔ unlabeled data

 Reinforcement ↔ rewards
 Overfit ↔ memorizes, Underfit ↔ too simple

 Bias = systematic error, Variance = sensitivity


Hypothesis Space & Inductive Bias

 Hypothesis space: All the “brains” the computer could choose


from.

o Example: All straight lines → linear regression

 Inductive bias: Rules the model uses to guess unseen data.

o Example: KNN assumes nearby points have same label

Evaluation & Cross-Validation

 Training set: For learning

 Test set: For checking predictions

 Validation set: Optional, helps tune the model

Cross-Validation: Split data multiple ways → train/test → average result

 Helps know if model really works

Cross-Validation

 K-Fold CV: Split data into K folds, train K times, average score
 LOOCV: Each data point tested once

 Why: Reduce variance in performance estimate

Mnemonic: K-fold → Rotate Test

Week 2
Week 2

Part A: Linear Regression

1. What is Regression(continuos)

 Regression = predicting a numeric/continuous value.

 Example: predict marks from hours studied, predict house price


from area.

 Different from classification (which predicts categories).

X has =independent variable

Y has =dependent variable

Note: [Link] is continuous

[Link] is discrete
.2. Simple Linear Regression

Equation:

Y=β0+β1X+εY = β_0 + β_1 X + εY=β0+β1X+ε

 YYY: dependent variable (target/output).

 XXX: independent variable (input).

 β0β_0β0: intercept → value of Y when X=0.

 β1β_1β1: slope → how much Y changes when X increases by 1.

 εεε: error/noise in data.


👉 It assumes the relation between X and Y is a straight line.

3. Multiple Linear Regression

Equation:

Y=β0+β1X1+β2X2+…+βpXp+εY = β_0 + β_1 X_1 + β_2 X_2 + … + β_p


X_p + εY=β0+β1X1+β2X2+…+βpXp+ε

 Used when output depends on multiple features.

 Example: house price depends on area, number of rooms, location.

4. Cost Function (Error Measurement)

 We want the line that fits data best.

 Error = difference between predicted and actual values.

Quiz + Answers

Q1. Regression is mainly used for?


👉 Predicting continuous values ✅

Q2. Which metric is commonly minimized in regression?


👉 Mean Squared Error ✅

Q3. Which algorithm is used to estimate coefficients in regression?


👉 Gradient Descent ✅

PYQs

Q1. Write assumptions of linear regression.


✔ Linearity, independence, homoscedasticity, normal errors.

Q2. Differentiate between simple and multiple regression.


✔ Simple → 1 feature; Multiple → many features.
Quiz + Answers

Q1. What does learning rate (α) control?


👉 Step size of updates ✅

Q2. Which GD variant is faster but noisier?


👉 Stochastic Gradient Descent ✅

PYQs

Q1. Explain difference between Batch and SGD.


✔ Batch = stable, slow; SGD = fast, noisy.

Q1. What is the goal of gradient descent?


a) Maximize error
b) Minimize error
c) Find local maximum
d) None

👉 Answer: b) Minimize error

Q2. Which Gradient Descent is fastest but noisy?


👉 Answer: Stochastic Gradient Descent (SGD)

Q3. What happens if the learning rate is too high?


👉 Answer: The algorithm may overshoot and fail to converge.

6. Assumptions of Linear Regression

1. Linear relationship exists between X and Y.

2. Errors (residuals) are independent.

3. Errors have constant variance.


4. Errors follow normal distribution.

7. Problem with Linear Regression in Classification

 Output is not bounded (can be –∞ to +∞).

 For classification we need probabilities between 0 and 1

QUIZ (Week 2)

Q1. What is inductive bias in ML?


Ans: Assumptions made by algorithm to choose a hypothesis from the
hypothesis space.

Q2. Which method reduces variance by averaging results across multiple


folds?
a) Bootstrapping
b) Hold-out split
c) Cross-validation
Ans: c) Cross-validation

Q3. Formula for Recall?


Ans: Recall = TP / (TP + FN)

Q4. Gradient Descent updates weights in which direction?


Ans: Opposite direction of gradient of cost function.

Q5. In Linear Regression, which function is minimized?


Ans: Mean Squared Error (MSE).

3. Which Gradient Descent uses full dataset for one update?

 Ans: Batch Gradient Descent.

Q4. Which metric is less sensitive to outliers: MSE or MAE?

 Ans: MAE.

Q5. If R² = 1, what does it mean?


 Ans: Perfect fit of regression model.

Week 2 (Part B, C & D) – Decision Trees & Overfitting

Part B: Introduction to Decision Tree

Definition:

 A Decision Tree is a classifier in a tree structure.

 Nodes:

o Decision Node: attribute test (e.g., “Outlook = Sunny?”).

o Leaf Node: final classification (Yes/No).

Key Idea:

 Recursively split dataset based on attributes → until classification is


simple.

Example (PlayTennis):

 Outlook = Sunny → check Humidity → Yes/No.

 Outlook = Overcast → Always Yes.

 Outlook = Rain → check Wind → Yes/No.

Decision Tree = disjunctions of conjunctions

(Outlook = Sunny ∧ Humidity = Normal) ∨ (Outlook = Overcast) ∨


 Example:

(Outlook = Rain ∧ Wind = Weak)

Challenges:

 Which tree to pick?

o Prefer smallest consistent tree (Occam’s Razor).

 How to split?

o Use attribute selection measures (Info Gain, Gini, etc.).

Part C: Learning Decision Trees

ID3 Algorithm (Top-Down Induction)

1. Select “best” attribute A.

2. Make node with attribute A.


3. For each value of A, create a branch.

4. Sort training examples by branch value.

5. If examples are pure (all same class) → stop. Else repeat.

Stopping Conditions:

 No attributes left.

 All examples same class.

 Very few examples left.


Part D: Overfitting in Decision Trees

Problem:

 Tree fits training set too perfectly → poor test performance.

Definitions:

 Overfitting: Training error ↓, but test error ↑.

 Underfitting: Both training and test error ↑.

Causes:

 Noise in data.

 Too few samples.

 Too many parameters (deep tree).

Avoiding Overfitting

1. Pre-pruning (Early stopping):

o Stop growing if split not statistically significant.

o Conditions:

 All samples same class.

 Few samples left.

 χ² test says no useful split.

2. Post-pruning:
o Grow full tree → prune unhelpful branches.

o Example: Reduced Error Pruning (using validation set).

3. Other Methods:

o Cross-validation.

o Regularization (limit tree depth, minimum samples).

o Use more data.

Regularization (Connection to Regression)

 Overfitting in Linear Regression → very large weights.

 Solutions:

o L2 Regularization (Ridge): Penalize large weights.

J(w)=MSE+λ∑w2J(w) = MSE + \lambda \sum w^2J(w)=MSE+λ∑w2

o L1 Regularization (Lasso): Penalize absolute weights.

J(w)=MSE+λ∑∣w∣J(w) = MSE + \lambda \sum |w|J(w)=MSE+λ∑∣w∣

📌 QUIZ (Week 2 B, C, D)

Q1. What does a decision node represent?

 Ans: A test on an attribute.

Q2. What is entropy when dataset is pure (all positive)?

 Ans: 0.

Q3. Which attribute has the highest info gain in PlayTennis?

 Ans: Outlook.

Q4. Define Overfitting.

 Ans: When model has low training error but high test error.

Q5. What is the difference between Pre-pruning and Post-pruning?

 Ans: Pre-pruning stops tree early; Post-pruning prunes after building.


Week-3
Week 3 – Instance-Based Learning & Feature Selection/Extraction

🔹 Part A: Instance-Based Learning (IBL)

1. Key Idea

 Store the training examples (xn,f(xn))(x_n, f(x_n))(xn,f(xn)).

 For a new test example → find the closest matches and predict
from them.

 Inductive assumption: similar inputs → similar outputs.

2. k-Nearest Neighbor (k-NN)

 Training phase: just memorize data.

 Prediction phase:

1. Find the k nearest neighbors of the test point.

2. Classification → majority vote of neighbors.


Regression → average of neighbors.

Decision boundary: forms a Voronoi diagram.

k-Nearest Neighbors (kNN):

 A lazy learning algorithm (no training, only stores data).

 To classify a new point, it checks the k closest neighbors in the


dataset.

 Decision = majority vote of neighbors (classification) OR average


value (regression).

3. Choosing “k”

 Small k: captures fine structure, but sensitive to noise.

 Large k: more stable, less sensitive to noise, better probability


estimates.

 As data → ∞ and k → ∞, kNN → Bayes optimal.


Theory

 kNN = instance-based, lazy learning algorithm.

 Stores all training data → no training phase.

 Classification → majority class of neighbors.

 Regression → average of neighbors.

 Key parameter: value of k.

o Small k = sensitive to noise.

o Large k = smoother, may underfit.

Curse of Dimensionality

 If features are too many, distance becomes meaningless.

 Example: In 2D, neighbors are near; in 100D, everyone is far away!


👉 Need feature selection/extraction.

 curse of dimensionality:

o In high dimensions, distances between points become similar.

o Nearest neighbor loses meaning.

o Model becomes less accurate.

o Solution → reduce dimensions (Feature Selection/Extraction).


Feature Selection Methods:

1. Filter methods – use statistics (correlation, chi-square, mutual


information).

2. Wrapper methods – test feature subsets with classifier


(expensive).

3. Embedded methods – feature selection during training (e.g.,


Decision Trees, Lasso).

Quiz

1. Curse of dimensionality means → distances lose meaning in high


dimensions.

2. Solution to curse → dimension reduction.

3. Filter method uses → statistical tests.

4. Wrapper method uses → classifier accuracy.

5. Embedded method example? → Decision Trees, Lasso


regression.

Quiz

1. kNN is a lazy/instance-based learner.

2. Formula of Euclidean distance? → ∑(xi−yi)2\sqrt{\sum (x_i -


y_i)^2}∑(xi−yi)2

3. Which similarity measure is useful in text data? → Cosine


similarity.

4. If k=1, model is sensitive to → noise.

5. Increasing k makes model → smoother / less sensitive.

Quiz Q&A

1. kNN is what type of learning algorithm?


→ Instance-based / Lazy learner.

2. What happens if k = 1?
→ Nearest neighbor decides, may overfit to noise.

3. Formula for Euclidean distance?


→ ∑(xi−yi)2\sqrt{\sum (x_i - y_i)^2}∑(xi−yi)2

4. Which distance is suitable for text classification?


→ Cosine similarity.
5. Main drawback of kNN?
→ High computation for large datasets (must calculate distance to
all points).

6. kNN works well when features are…


→ Scaled properly (normalization is important).

Part C: Feature Extraction (PCA & LDA)

Theory

 Feature Extraction = create new reduced features from original


ones.

 PCA (Principal Component Analysis):

o Unsupervised, ignores class labels.

o Maximizes variance.

o Uses eigenvectors of covariance matrix.

 LDA (Linear Discriminant Analysis):

o Supervised, uses class labels.

o Maximizes separation between classes.

👉 Difference: PCA = variance, LDA = class separability.

Quiz

1. PCA is unsupervised.

2. LDA uses → class labels.

3. PCA chooses directions that maximize → variance.

4. LDA chooses directions that maximize → class separation.

5. Which is better for classification? → LDA.

PYQs

 Explain PCA with steps. (10 marks)

 Differentiate PCA and LDA. (5 marks)

 Why is PCA used before applying kNN? (5 marks)


🔹 Part D: Collaborative Filtering (Recommender Systems)

Theory

 Collaborative Filtering = recommend items based on similar


users/items.

Types:

1. User-based CF

o Find similar users.

o Recommend items they liked.

o Example: “Users like you watched…”

2. Item-based CF

o Find similar items to those the user liked.

o Recommend those items.

o Example: “Because you watched A, we recommend B.”

o More scalable.

Quiz

1. CF stands for → Collaborative Filtering.

2. User-based CF uses → similar users.

3. Item-based CF uses → similar items.

4. Which is more scalable? → Item-based CF.

5. Amazon’s recommender is mostly → item-based CF.

QUICK REVISION CHEAT SHEET (Week 3)

 kNN = Lazy learner, depends on distance.

 Distances: Euclidean, Manhattan, Cosine.

 Curse of dimensionality = distance useless in high dimensions →


reduce features.

 Feature Selection: Filter / Wrapper / Embedded.

 Feature Extraction: PCA (variance), LDA (class separation).

 Recommender Systems: User-based vs Item-based CF


Week -4
Week 4 Notes – Probability & Bayesian Learning

Part A: Probability Basics

1. Probability & Experiments

 Probability: study of randomness & uncertainty.

 Random Experiment: process with uncertain outcome.

o Example: tossing a coin, rolling a die, drawing cards.


 In MAP learning, which factor is considered?

 a) Only likelihood

 b) Only prior

 c) Prior × Likelihood ✅

 d) None

 In Naïve Bayes, the main assumption is?

 Conditional independence of features ✅


QUIZ QUESTIONS (MCQ + Short) – Week 4

MCQs

1. Sample space of rolling a die is?


a) {H,T}
b) {1,2,3,4,5,6} ✅
c) {HH,HT}
d) None

2. Which of these is NOT an axiom of probability?


a) P(Ω)=1P(\Omega)=1P(Ω)=1
b) 0≤P(A)≤10 \leq P(A) \leq 10≤P(A)≤1
c) P(A∩B)=P(A)+P(B)P(A∩B)=P(A)+P(B)P(A∩B)=P(A)+P(B) ✅
d) P(∅)=0P(\varnothing)=0P(∅)=0

3. In Naïve Bayes, the main assumption is:


a) Features are correlated
b) Features are independent ✅
c) Priors are ignored
d) None

4. Bayes theorem is used to compute?


a) Prior
b) Likelihood
c) Posterior ✅
d) Evidence

5. ML learning ignores:
a) Prior ✅
b) Likelihood
c) Evidence
d) Data

True/False

6. For any event A, P(A)+P(A’) = 1. ✅

7. In MAP hypothesis, prior is not considered. ❌

8. Naïve Bayes always gives optimal performance. ❌

Short Answer
9. Define Conditional Probability with formula.
👉 P(A∣B)=P(A∩B)P(B)P(A|B) = \frac{P(A∩B)}{P(B)}P(A∣B)=P(B)P(A∩B)

10. Differentiate ML vs MAP.


👉 ML uses only likelihood, MAP uses prior × likelihood.

Week 4 – Probability & Bayesian Learning (120 Q&A)

Topic 1 – Probability Basics (40 Q&A)

MCQs (20)

1. Sample space of a coin toss?


a) {H,T} ✅
b) {1,2}
c) {HH,HT}
d) None

2. Probability of an impossible event?


a) 0 ✅
b) 1
c) 0.5
d) None

3. P(A∪B) = ?
a) P(A) + P(B) ✅
b) P(A) + P(B) – P(A∩B) ✅
c) P(A∩B)
d) P(A) × P(B)

4. If P(A) = 0.3, P(B) = 0.4, independent, P(A∩B)?


a) 0.12 ✅
b) 0.7
c) 0.1
d) 0.3

5. Complement of event A formula?


a) 1 – P(A) ✅
b) P(A)
c) P(A∩B)
d) None

… (MCQs 6–20 continue with event, union, intersection, independence,


conditional probability, etc.)
True/False (10)

21. P(A)+P(A’) = 1 ✅

22. P(Ω)=0 ❌

23. Disjoint events cannot occur together ✅

24. Conditional probability formula: P(A|B)=P(A∩B)/P(B) ✅

25. Two independent events always have P(A∩B)=0 ❌


… (TF 26–30)

Fill-in-the-blank (5)

31. The sum of probabilities of all outcomes in a sample space is


1.

32. If A ⊆ B, then P(A) ≤ P(B).


… (FB 33–35)

Short Answer (5)

36. Define Random Experiment, Sample Space, Event.

37. Explain complement of an event with formula.

38. State axioms of probability.

39. Example of independent events.

40. Example of mutually exclusive events.

Topic 2 – Random Variables & Distributions (20 Q&A)

MCQs (10)

41. A discrete random variable takes:


a) Countable values ✅
b) Continuous values
c) Infinite values
d) None

42. Sum of PMF probabilities = ?


a) 1 ✅
b) 0
c) 0.5
d) 2

43. PDF integral over all x = ?


a) 1 ✅
b) 0

(MCQs 44–50 cover expectation, variance, PMF/PDF, discrete vs


continuous)

True/False (5)

51. PDF can be negative ❌

52. Expectation is mean of RV ✅


Short Answer (5)

56. Difference between discrete & continuous RV.

57. Define PMF & PDF.

58. Compute E[X] for X={1,2,3} with P(X)={0.2,0.5,0.3}.

59. Compute Var[X].

60. Explain why ∫f(x)dx=1 for PDF.

Topic 3 – Conditional Probability & Bayes’ Theorem (20 Q&A)

MCQs (10)

61. P(A|B) = ?
a) P(A∩B)/P(B) ✅

(MCQs 62–70 include independence, examples, conditional probs, Bayes


theorem calculations)

True/False (5)

71. Bayes theorem updates prior with evidence ✅


Topic 4 – Bayesian Learning (20 Q&A)


MCQs (10)

81. MAP maximizes:


a) P(D|h) ✅
b) P(h|D) ✅
c) Both ✅
d) None

82. ML ignores prior ✅


… (MCQs 83–90)

True/False (5)

91. ML = MAP when prior is uniform ✅

92. MAP considers prior knowledge ✅


Topic 5 – Naïve Bayes & Bayesian Networks (20 Q&A)

MCQs (10)

101. Naïve Bayes assumes features are independent ✅

102. Bayesian network represents:


a) Dependencies between variables ✅
… (MCQs 103–110)

True/False (5)

111. Naïve Bayes always gives best prediction ❌

112. Bayesian networks use CPT ✅


Week -5
Q10. Linear SVM with noise is solved using:
A) Logistic regression
B) Soft margin SVM
C) Decision tree
D) Random forest

Answer: B
Explanation: Soft margin allows misclassification to handle noisy/non-
linear data.

Q11. Which kernel is NOT commonly used in SVM?


A) Linear
B) Polynomial
C) Gaussian (RBF)
D) Cosine

Answer: D
Explanation: Linear, polynomial, Gaussian, and sigmoid kernels are
standard. Cosine is not standard.
Q12. Kernel trick is used to:
A) Reduce computation by avoiding high-dimensional feature mapping
B) Increase model overfitting
C) Replace SVM with decision trees
D) Compute gradient faster

Answer: A

Q13. SMO algorithm is used to:


A) Solve primal SVM problem
B) Solve dual SVM problem efficiently
C) Compute sigmoid function
D) Normalize features

Answer: B

Q14. Multi-class SVM is handled by:


A) Single SVM
B) One-vs-Rest approach
C) Logistic regression
D) Random forest

Answer: B
Explanation: Train N SVMs, each for one class vs all others.

Q15. Which parameter controls the trade-off between margin width and
misclassification in soft-margin SVM?
A) Alpha
B) Beta
C) C
D) Gamma

Answer: C
SVM Questions

Q26. Maximum margin classifier is:


A) Logistic regression
B) Linear SVM
C) Decision tree
D) Naive Bayes

Answer: B

Q27. Hinge loss is used in:


A) Linear regression
B) Logistic regression
C) SVM
D) KNN

Answer: C
Explanation: SVM minimizes hinge loss to maximize margin.
Q30. Which statement is TRUE for soft-margin SVM?
A) Allows some misclassification
B) Uses C parameter to control penalty
C) Both A and B
D) None

Answer: C

Q31. Which is TRUE about linear vs kernel SVM?


A) Linear SVM cannot classify linearly separable data
B) Kernel SVM can map data to higher dimensions
C) Linear SVM always overfits
D) Kernel SVM is slower but always worse

Answer: B
Q32. If a point is correctly classified with distance > margin, its Lagrange
multiplier αi\alpha_iαi is:
A) 0
B) > 0
C) = 1
D) < 0

Answer: A
Explanation: Only support vectors (on margin) have αi>0\alpha_i > 0αi
>0.

Q33. What does the parameter gamma control in RBF kernel?


A) Margin width
B) Influence of a single training point
C) Learning rate
D) Number of features

Answer: B

Q34. “One-vs-One” approach in multi-class SVM means:


A) Train N SVMs for N classes
B) Train N(N-1)/2 SVMs for N classes
C) Train one SVM for all classes
D) None

Answer: B

Q35. Which SVM type is best for non-linear data?


A) Linear SVM
B) Soft-margin SVM with kernel
C) Logistic regression
D) Naive Bayes

Answer: B

Common questions

Powered by AI

Inductive bias refers to the assumptions a learning algorithm makes to generalize from the training data to unseen data . It is necessary because, without these assumptions, the model would not be able to predict unseen cases effectively . Examples include assuming linear relationships in linear regression or tree-like splits in decision trees.

The training phase in machine learning involves fitting the model to known data to learn the parameters that minimize the error on this data . The testing phase evaluates the model's performance on unseen data to measure how well it generalizes beyond the training set .

A confusion matrix helps evaluate a classification model's performance by displaying the count of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions. These components allow for the calculation of important metrics such as accuracy, precision, recall, and F1-score, which provide deeper insights into a model's accuracy and error distribution .

The expected total error in a model is composed of the square of the bias, variance, and irreducible error. The bias-variance trade-off involves finding an optimal balance between the bias and variance to minimize the total error and ensure the model performs well on unseen data .

The main goal of the bias-variance trade-off is to find a balance that minimizes both bias, which is the error due to wrong assumptions or an overly simple model, and variance, which is the error due to sensitivity to small fluctuations in training data . This balance ensures that the model generalizes well to unseen data.

Regularization techniques, such as L1 and L2 regularization, help reduce overfitting by penalizing large coefficients in the model. This discourages the learning of a model that is too complex and thus helps in generalizing better to unseen data .

A model with high training and test errors is likely underfitting because it is too simple to capture the underlying patterns in the data, indicated by high bias . This suggests the model lacks the complexity needed to adequately represent the training data and consequently generalizes poorly.

This scenario indicates overfitting, where the model has memorized the training data rather than learning general patterns. This leads to poor generalization to new, unseen data . Overfitting results from high variance, where small fluctuations in training data significantly affect predictions.

The kernel trick in SVM allows the transformation of data into higher dimensional spaces without the computational cost of explicitly mapping points. This enables the linear separation of data that is not linearly separable in the original space by using different kernels such as polynomial or Gaussian (RBF).

A soft-margin SVM is preferable over a hard-margin SVM in scenarios involving noisy or non-linearly separable data. The soft-margin allows for some misclassification to handle noise and maintain a better balance between fitting the training data and preserving generalization to new data .

You might also like