0% found this document useful (0 votes)
15 views24 pages

ML Assignment Complete Answers

The document provides comprehensive assignment answers on Machine Learning, covering theory, numerical problems, and example questions across various units. Key topics include definitions, types of machine learning, data preprocessing techniques, model evaluation metrics, and specific algorithms like linear regression and decision trees. It serves as a study guide for understanding the fundamental concepts and applications of machine learning.

Uploaded by

vmistry1616
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views24 pages

ML Assignment Complete Answers

The document provides comprehensive assignment answers on Machine Learning, covering theory, numerical problems, and example questions across various units. Key topics include definitions, types of machine learning, data preprocessing techniques, model evaluation metrics, and specific algorithms like linear regression and decision trees. It serves as a study guide for understanding the fundamental concepts and applications of machine learning.

Uploaded by

vmistry1616
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning

Complete Assignment Answers


Assignment 1 (Theory) • Assignment 2 (Numericals) • Example Questions

Covers: Units 1–6 | All Numerical Problems | Concept Explanations


ASSIGNMENT 1 — THEORY QUESTIONS
UNIT 1: Introduction to Machine Learning

Q1. Definition and Scope of Machine Learning


Answer:
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn from data and
improve their performance on tasks without being explicitly programmed. Instead of writing rules manually,
the system discovers patterns in data on its own.

Formal Definition:
"A computer program is said to learn from experience E with respect to some task T and performance
measure P, if its performance at tasks in T, measured by P, improves with experience E." — Tom Mitchell
(1997)

Scope of ML:
• Healthcare: Predicting diseases from patient records (e.g., detecting cancer from X-rays)
• Finance: Fraud detection, stock price prediction, credit scoring
• E-commerce: Product recommendation systems (Amazon, Netflix)
• Natural Language Processing: Chatbots, language translation, sentiment analysis
• Computer Vision: Face recognition, object detection in self-driving cars
• Agriculture: Crop disease detection using image classification
• Education: Personalized learning and adaptive testing systems

Example:
Email spam filtering — The system learns from thousands of spam and non-spam emails to classify new
incoming emails automatically.

Q2. Traditional Programming vs Machine Learning


Answer:
Traditional Programming requires humans to write explicit rules, while Machine Learning allows systems to
derive rules from data automatically.

Aspect Traditional Programming Machine Learning

Input Data + Rules → Output Data + Output → Rules

Rule Creation Written by programmer Learned from data

Flexibility Rigid — rules must be updated manually Adaptive — learns with new data

Performance Depends on quality of rules Improves with more data

Best For Well-defined, deterministic tasks Complex, pattern-based tasks

Example Tax calculation software Spam detection system

Q3. Types of Machine Learning


Answer:
1. Supervised Learning
The model is trained on labeled data — each input has a corresponding correct output. The model learns to
map inputs to outputs.
• Algorithms: Linear Regression, Logistic Regression, Decision Trees, SVM, KNN, Random Forest
• Example: Predicting house prices (input: size, location → output: price)
• Problem Types: Classification (spam/not spam) and Regression (price prediction)

2. Unsupervised Learning
The model is trained on unlabeled data. It discovers hidden patterns or groupings without any prior
knowledge of the output.
• Algorithms: K-Means Clustering, K-Medoids, Hierarchical Clustering, PCA
• Example: Grouping customers by purchasing behavior for market segmentation
• Types: Clustering and Dimensionality Reduction

3. Reinforcement Learning
An agent learns by interacting with an environment. It receives rewards for good actions and penalties for
bad actions, learning optimal strategies over time.
• Example: Training a game-playing AI (AlphaGo) or robot navigation
• Key Components: Agent, Environment, State, Action, Reward
• Goal: Maximize cumulative reward over time

Q4. Real-World Applications of ML


Answer:
• Healthcare: Disease diagnosis (cancer detection from MRI), drug discovery, patient outcome prediction
• Finance: Credit card fraud detection, algorithmic trading, loan approval prediction
• Retail/E-commerce: Product recommendations (Netflix, Amazon), demand forecasting, dynamic
pricing
• Transportation: Self-driving cars (Tesla Autopilot), route optimization (Google Maps), traffic prediction
• Natural Language: Google Translate, Siri/Alexa voice assistants, ChatGPT, sentiment analysis
• Manufacturing: Predictive maintenance of machines, quality control using computer vision
• Agriculture: Crop disease identification, yield prediction, precision farming
• Security: Face recognition systems, intrusion detection, cybersecurity threat detection

Q5. Concept Learning: Find-S Algorithm


Answer:
Concept learning is the problem of searching through a predefined space of hypotheses to find the one that
best fits the training examples.

Find-S Algorithm:
Find-S starts with the most specific hypothesis (empty set) and generalizes it only when a positive example is
encountered. Negative examples are ignored.

Steps:
• Step 1: Initialize h = <∅, ∅, ∅, ...> (most specific hypothesis)
• Step 2: For each positive training example x:
• — For each attribute a_i: if a_i in h = a_i in x → no change; else → replace with '?' (generalize)
• Step 3: Ignore all negative examples
• Step 4: Output the final hypothesis h

Key Terms:
• Hypothesis: A rule that defines when output is YES
• Specific: Very strict — exact match needed (e.g., )
• General: Flexible — many cases match (e.g., )
• '?': Accepts any value for that attribute

Candidate Elimination Algorithm:


Unlike Find-S (which only maintains one hypothesis), Candidate Elimination maintains two boundaries:
• S (Specific Boundary): Most specific set of hypotheses consistent with training data
• G (General Boundary): Most general set of hypotheses consistent with training data
The version space is the set of all hypotheses between S and G. On positive examples, S is generalized. On
negative examples, G is specialized.
UNIT 2: Data Preprocessing & Feature Engineering

Q1. Data Preprocessing: Handling Missing Values


Answer:
Missing values occur when no data is stored for a variable in an observation. If not handled, they can distort
analysis and reduce model accuracy.

Methods to Handle Missing Values:

a) Mean Imputation
Replace missing values with the mean of the column. Suitable for normally distributed numerical data.
Missing Value = Mean of Column = (Sum of all values) / (Count of non-missing
values)
Example: Age column: [25, 30, NaN, 40, 35] → Mean = (25+30+40+35)/4 = 32.5 → Replace NaN with 32.5

b) Median Imputation
Replace missing values with the median. Better than mean when data has outliers.
Missing Value = Median = Middle value after sorting
Example: Salary: [20000, 25000, NaN, 1000000] → Median = 22500 (outlier-resistant)

c) Mode Imputation
Replace missing values with the most frequently occurring value. Best for categorical data.
Example: Gender: [Male, Female, NaN, Male, Male] → Mode = Male → Fill NaN with Male

d) KNN Imputation
Find K nearest neighbors of the missing record and use their average to fill the gap. More accurate but
computationally expensive.
Example: If a student has missing marks, find K most similar students and average their marks.

Q2. Handling Outliers using Z-Score


Answer:
An outlier is a value significantly different from other observations. Outliers distort model training.

Z-Score Method:
Z = (x - µ) / σ
Where: x = data point, µ = mean, σ = standard deviation
Rule: If |Z| > 3, the point is considered an outlier and can be removed or capped.
Example: If mean salary = 50,000 and std = 5,000, then salary = 100,000 → Z = (100000-50000)/5000 = 10
→ Outlier!

Q3. Data Transformation: Normalization and Standardization


Answer:

Normalization (Min-Max Scaling):


Scales data to a fixed range [0, 1]. Used when features have different ranges.
X_normalized = (X - X_min) / (X_max - X_min)
Example: Age values [18, 25, 30, 60] → Normalized to [0.0, 0.167, 0.286, 1.0]

Standardization (Z-Score Normalization):


Transforms data so that Mean = 0 and Standard Deviation = 1. Used when data follows normal distribution.
X_standardized = (X - µ) / σ
Example: If ages have mean=30, std=10, then age 40 → (40-30)/10 = 1.0

Aspect Normalization Standardization

Range 0 to 1 -∞ to +∞ (typically -3 to 3)

Formula (X - min) / (max - min) (X - mean) / std

Use Case Neural Networks, Image data SVM, Logistic Regression, PCA

Outlier Sensitive Yes (outliers affect min/max) Less sensitive

Q4. Feature Engineering


Answer:

a) Feature Scaling:
Ensuring all features have similar scales to prevent any one feature from dominating (e.g., salary vs. age).
Methods: Normalization and Standardization (explained above).

b) Encoding Categorical Variables:


• Label Encoding: Assigns integer values to categories. Example: [Red, Blue, Green] → [0, 1, 2]. Risk:
Creates false ordinal relationship.
• One-Hot Encoding: Creates binary columns for each category. Example: Color → Color_Red,
Color_Blue, Color_Green. Best for nominal data.
• Ordinal Encoding: Assigns ordered integers to ordered categories. Example: [Low, Medium, High] →
[1, 2, 3]. Use when order matters.

c) Feature Selection:
Selecting the most relevant features to reduce dimensionality and improve model performance.
• Filter Methods: Use statistical tests (correlation, chi-square) to select features independently of the
model
• Wrapper Methods: Use model performance to evaluate feature subsets (e.g., Recursive Feature
Elimination)
• Embedded Methods: Feature selection during model training (e.g., L1/Lasso Regularization)

d) Feature Extraction:
Creating new features from existing ones to capture more information (e.g., PCA for dimensionality reduction,
extracting 'day of week' from a date column).
UNIT 3: Model Evaluation & Performance Metrics

Q1. Model Evaluation Concepts


Answer:

Overfitting:
When a model learns the training data too well — including noise and irrelevant patterns — resulting in poor
performance on new data.
• Symptoms: Very high training accuracy, very low test accuracy
• Cause: Model too complex, insufficient training data, too many features
• Fix: Regularization, pruning (for Decision Trees), more training data, cross-validation

Underfitting:
When a model is too simple to capture the underlying patterns in training data, resulting in poor performance
on both training and test data.
• Symptoms: Low accuracy on both training and test sets
• Cause: Model too simple (high bias), insufficient training, too few features
• Fix: Increase model complexity, add more features, reduce regularization

Train-Test Split:
The dataset is split into two parts — typically 70-80% for training and 20-30% for testing. The model is trained
on training data and evaluated on unseen test data.
Common Split: 80% Train | 20% Test
Limitation: Results may vary depending on how data is split.

K-Fold Cross Validation:


The data is divided into K equal parts (folds). The model is trained K times, each time using K-1 folds for
training and 1 fold for testing. Final score = average of all K scores.
Final Score = (Score_1 + Score_2 + ... + Score_K) / K
Advantage: More reliable estimate — every data point is used for both training and testing. Common K
values: 5 or 10.

Q2. Classification Performance Metrics


Answer:

Confusion Matrix:
Predicted Positive Predicted Negative

Actual Positive TP (True Positive) FN (False Negative)

Actual Negative FP (False Positive) TN (True Negative)

Metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP) [Of predicted positives, how many are correct?]
Recall = TP / (TP + FN) [Of actual positives, how many detected?]
F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
ROC Curve (Receiver Operating Characteristic):
• Plots True Positive Rate (Recall) vs False Positive Rate at various classification thresholds
• AUC (Area Under Curve): 1.0 = perfect model, 0.5 = random model
• Used to compare models — higher AUC means better model

Q3. Regression Performance Metrics


Answer:
MAE = (1/n) * Σ|y_i - ■_i| [Mean Absolute Error]
MSE = (1/n) * Σ(y_i - ■_i)^2 [Mean Squared Error]
RMSE = sqrt(MSE) [Root Mean Squared Error]
R² = 1 - [Σ(y_i-■_i)^2 / Σ(y_i-■)^2] [Coefficient of Determination]
• R² = 1: Perfect model | R² = 0: Baseline model (predicts mean) | R² < 0: Worse than baseline
UNIT 4: Regression & Classification Algorithms

Q1. Simple Linear Regression


Answer:
Simple Linear Regression models the relationship between one independent variable (X) and one dependent
variable (Y) using a straight line.
Y = a + b*X where:
b (slope) = [n*ΣXY - ΣX*ΣY] / [n*ΣX² - (ΣX)²]
a (intercept) = (ΣY - b*ΣX) / n
Example: Predicting exam score (Y) based on study hours (X).

Multiple Linear Regression:


Models the relationship between multiple independent variables and one dependent variable.
Y = b0 + b1*X1 + b2*X2 + ... + bn*Xn
Example: Predicting house price (Y) based on size (X1), location (X2), and age (X3).

Q2. Logistic Regression


Answer:
Logistic Regression is a supervised classification algorithm that predicts the probability that an input belongs
to a binary class (0 or 1).
Why not Linear Regression? Linear regression can predict values outside [0,1], which are invalid as
probabilities. Logistic regression uses the Sigmoid function.
P(Y=1) = 1 / (1 + e^(-z)) where z = b0 + b1*X1 + b2*X2 + ...
• Output range: (0, 1) — S-shaped curve
• If P >= 0.5 → Predict class 1 (positive)
• If P < 0.5 → Predict class 0 (negative)
Applications: Spam detection, disease diagnosis, credit scoring, customer churn prediction.

Q3. Decision Trees


Answer:
A Decision Tree is a tree-structured model where internal nodes represent feature tests, branches represent
outcomes, and leaf nodes represent final decisions/class labels.

ID3 Algorithm (uses Information Gain):


Entropy(S) = -Σ p_i * log2(p_i)
Information Gain(A) = Entropy(S) - Σ (|Sv|/|S|) * Entropy(Sv)
• Select the attribute with the HIGHEST Information Gain as root/split node

C4.5 Algorithm (uses Gain Ratio):


Gain Ratio = Information Gain / Split Information
Split Info = -Σ (|Sv|/|S|) * log2(|Sv|/|S|)

CART Algorithm (uses Gini Index):


Gini(S) = 1 - Σ p_i²
• Select attribute with LOWEST Gini Index

Overfitting Prevention:
• Max Depth: Limit tree depth (e.g., max_depth=5)
• Min Samples Split: Minimum samples required to split a node
• Min Samples Leaf: Minimum samples at leaf nodes

Q4. Random Forest


Answer:
Random Forest is an ensemble method that builds multiple decision trees on random subsets of data
(Bootstrap Sampling) and combines predictions via majority voting (classification) or averaging (regression).
• Step 1: Create multiple random samples with replacement (Bootstrap)
• Step 2: Build a Decision Tree on each sample — at each split, only a random subset of features is
considered (Feature Randomness)
• Step 3: Combine predictions — majority vote for classification, mean for regression
• Advantages: Reduces overfitting, handles high-dimensional data, robust to noise
• Disadvantages: Slower training, requires more memory, less interpretable

Q5. K-Nearest Neighbors (KNN)


Answer:
KNN is a lazy learning algorithm — it stores all training data and makes predictions based on the K nearest
data points to the query point.

Algorithm Steps:
• 1. Choose K (number of neighbors)
• 2. Calculate Euclidean distance from query point to all training points
• 3. Sort distances in ascending order
• 4. Select K nearest neighbors
• 5. For Classification: majority vote; For Regression: average of K neighbors
Euclidean Distance = sqrt((x2-x1)^2 + (y2-y1)^2)
• Choosing K: Too small K → overfitting, Too large K → underfitting. Rule of thumb: K = sqrt(N)
• Feature Scaling is required for KNN since it is distance-based

Q6. Support Vector Machine (SVM)


Answer:
SVM finds the optimal hyperplane that maximally separates two classes. The hyperplane is positioned to
maximize the margin — the distance between the hyperplane and the nearest data points (support vectors).

Key Concepts:
• Hyperplane: Decision boundary separating classes (line in 2D, plane in 3D)
• Support Vectors: Data points closest to the hyperplane — these determine the hyperplane position
• Margin: Distance between hyperplane and nearest points — SVM maximizes this
• Hard Margin: Perfect separation, no misclassification (sensitive to outliers)
• Soft Margin: Allows some misclassification for better generalization
• Kernel Trick: Maps data to higher dimensions for non-linearly separable data (RBF, Polynomial
kernels)
UNIT 5: Clustering & Dimensionality Reduction

Q1. Clustering Techniques


Answer:

K-Means Clustering:
Partitions data into K clusters by minimizing within-cluster variance. Each data point belongs to the cluster
with the nearest centroid.
• 1. Initialize K centroids randomly
• 2. Assign each point to the nearest centroid using Euclidean distance
• 3. Recalculate centroids as the mean of all points in each cluster
• 4. Repeat steps 2-3 until centroids no longer change
• Disadvantage: Must specify K, sensitive to outliers, may converge to local optima

K-Medoids Clustering:
Similar to K-Means but uses actual data points (medoids) as cluster centers instead of means. More robust to
outliers than K-Means.
• Medoid: The data point within a cluster that minimizes the total distance to all other points in the cluster
• Algorithm: PAM (Partitioning Around Medoids)

Hierarchical Clustering:
Builds a hierarchy of clusters without specifying K in advance. Represented as a dendrogram.
• Agglomerative (Bottom-Up): Start with each point as its own cluster; merge closest clusters iteratively
• Divisive (Top-Down): Start with all points in one cluster; split iteratively

Linkage Methods:
• Single Linkage: Distance = minimum distance between any two points in clusters
• Complete Linkage: Distance = maximum distance between any two points in clusters
• Average Linkage: Distance = average of all pairwise distances

Q2. Dimensionality Reduction: PCA and LDA


Answer:

PCA (Principal Component Analysis):


PCA transforms high-dimensional data into fewer dimensions (Principal Components) while retaining
maximum variance. It finds new orthogonal axes in the direction of maximum variance.
• PC1: Direction of maximum variance in data
• PC2: Direction of second maximum variance (perpendicular to PC1)

Steps:
• 1. Standardize the data
• 2. Compute covariance matrix
• 3. Calculate eigenvectors and eigenvalues
• 4. Sort eigenvectors by eigenvalue (largest first)
• 5. Select top K eigenvectors as principal components
• 6. Transform data to new dimensions
• Use Case: Visualizing high-dimensional data, removing redundant features, image compression
LDA (Linear Discriminant Analysis):
LDA finds the linear combinations of features that best separate two or more classes. Unlike PCA
(unsupervised), LDA is supervised — it uses class labels.
• Goal: Maximize between-class variance and minimize within-class variance
• Produces at most C-1 components where C = number of classes
• Better for classification tasks than PCA
Aspect PCA LDA

Type Unsupervised Supervised

Uses Labels No Yes

Goal Maximize variance Maximize class separation

Use Case Feature reduction, visualization Classification, dimensionality reduction


UNIT 6: Ensemble Learning, Regularization & Bias-Variance
Trade-off

Q1. Ensemble Learning: Bagging, Boosting, Stacking


Answer:

Bagging (Bootstrap Aggregating):


Multiple models are trained independently and in parallel on different random subsets of the training data
(with replacement). Predictions are combined by voting (classification) or averaging (regression).
• Example Algorithm: Random Forest
• Reduces: Variance (overfitting)
• Key Property: Models are trained in parallel and independently

Boosting:
Models are trained sequentially. Each new model focuses on correcting the errors of the previous model.
Data points misclassified by earlier models get higher weights.
• Example Algorithms: AdaBoost, Gradient Boosting, XGBoost
• Reduces: Bias (underfitting) more than variance
• Key Property: Sequential training — each model learns from the previous model's mistakes

Stacking:
Multiple base models (Level-0) are trained, and their predictions are used as features to train a meta-model
(Level-1) that makes the final prediction.
• Base models can be of different types (e.g., KNN + SVM + Decision Tree)
• Meta-model is typically Logistic Regression or another simple model
• Often achieves better accuracy than any individual base model

Q2. Hyperparameter Tuning: Grid Search and Random Search


Answer:

Grid Search:
Exhaustively tries every combination of specified hyperparameter values. Guaranteed to find the best
combination but is computationally expensive.
Example: For KNN, grid search might try K = {1,3,5,7,9} × distance = {euclidean, manhattan} → 10 total
combinations tested.

Random Search:
Randomly samples hyperparameter combinations from the specified distributions. More efficient than grid
search — often finds a good solution faster by not exhaustively checking all combinations.
Example: Instead of testing all 100 combinations, randomly test 20 and pick the best.
• Grid Search: Best when few hyperparameters and small search space
• Random Search: Best when many hyperparameters and large search space

Q3. Regularization: Ridge (L2) and Lasso (L1)


Answer:
Regularization adds a penalty term to the loss function to prevent overfitting by discouraging large model
weights.
Normal Loss: L = Σ(y_i - ■_i)^2
Regularized Loss: L = Σ(y_i - ■_i)^2 + λ * Penalty

L1 Regularization (Lasso):
Loss = Σ(y - ■)^2 + λ * Σ|w_i|
• Penalty = sum of absolute values of weights
• Can make some weights exactly ZERO → Automatic Feature Selection
• Best when many features are irrelevant

L2 Regularization (Ridge):
Loss = Σ(y - ■)^2 + λ * Σw_i^2
• Penalty = sum of squared weights
• Shrinks weights towards zero but NEVER exactly zero
• Best when all features are important; handles multicollinearity

Elastic Net:
Combination of L1 and L2 regularization. Best for datasets with many correlated features.
Loss = Σ(y - ■)^2 + λ1*Σ|w_i| + λ2*Σw_i^2

Q4. Bias-Variance Trade-off


Answer:
The Bias-Variance Trade-off describes the balance between model simplicity (high bias) and model
complexity (high variance) to achieve optimal prediction performance.

Bias:
The error from incorrect assumptions in the learning algorithm. High bias means the model is too simple and
fails to capture the underlying pattern (Underfitting).
• High Bias: Low training accuracy AND low test accuracy
• Example: Using a straight line to fit data that is clearly curved

Variance:
The error from sensitivity to small fluctuations in training data. High variance means the model memorizes
training data including noise (Overfitting).
• High Variance: High training accuracy BUT low test accuracy
• Example: Very deep Decision Tree that perfectly fits training data but fails on new data

Trade-off:
Total Error = Bias^2 + Variance + Irreducible Noise. The goal is to find the sweet spot — a model that is
complex enough to learn patterns but not so complex that it memorizes noise.
Scenario Bias Variance Result

Simple Model High Low Underfitting

Complex Model Low High Overfitting

Optimal Model Low Low Best Performance


ASSIGNMENT 2 — NUMERICAL SOLUTIONS
Q1. Find-S Algorithm
Dataset:
Ex Sky AirTemp Humidity Wind Water Forecast PlaySport

1 Sunny Warm Normal Strong Warm Same YES

2 Sunny Warm High Strong Warm Same YES

3 Rainy Cold High Strong Warm Change NO

4 Sunny Warm High Strong Cool Change YES

Step-by-Step Solution:
• Initialize: h = <∅, ∅, ∅, ∅, ∅, ∅>

• Example 1 (YES):
→ h = (First positive example, adopt directly)

• Example 2 (YES):
→ Humidity: Normal ≠ High → Generalize to '?'
→h=

• Example 3 (NO): → SKIP (negative example)

• Example 4 (YES):
→ Water: Warm ≠ Cool → Generalize to '?'
→ Forecast: Same ≠ Change → Generalize to '?'
→h=

FINAL HYPOTHESIS: h = <Sunny, Warm, ?, Strong, ?, ?>

Q2. Candidate Elimination Algorithm


Dataset: Outlook, Temp, Humidity, Wind, EnjoySport
Ex Outlook Temp Humidity Wind EnjoySport

1 Sunny Warm Normal Strong YES

2 Sunny Warm High Strong YES

3 Rainy Cold High Strong NO

4 Sunny Warm High Weak YES

Initial State:
S0 = <∅, ∅, ∅, ∅>
G0 =

Example 1 (YES): Sunny, Warm, Normal, Strong


→ S1 = (first positive: S = example)
→ G1 = (G unchanged, does not reject positive)

Example 2 (YES): Sunny, Warm, High, Strong


→ Humidity differs (Normal vs High) → generalize S
→ S2 =
→ G2 =

Example 3 (NO): Rainy, Cold, High, Strong


→ G must reject this negative example. Specialize G minimally:
→ G3 = { , , }
→ Remove any G hypotheses that do not cover positive examples:
→ does NOT cover Ex2 (Humidity=High) → Remove it
→ G3 = { , }

Example 4 (YES): Sunny, Warm, High, Weak


→ S: Wind Strong ≠ Weak → generalize
→ S4 =
→ G: Check which G hypotheses cover this example:
→ covers it (Sunny matches) ✓
→ covers it (Warm matches) ✓
→ G4 = { , } (no change)

FINAL SPECIFIC BOUNDARY (S): <Sunny, Warm, ?, ?>


FINAL GENERAL BOUNDARY (G): { <Sunny,?,?,?>, <?,Warm,?,?> }

Q3. Simple Linear Regression


Given: X = [1,2,3,4,5], Y = [2,4,5,4,5] Find: Y = a + bX and predict Y for X=6

X Y X² XY

1 2 1 2

2 4 4 8

3 5 9 15

4 4 16 16

5 5 25 25

ΣX=15 ΣY=20 ΣX²=55 ΣXY=66

Calculations (n=5):
b = [n*ΣXY - ΣX*ΣY] / [n*ΣX² - (ΣX)²]
b = [5*66 - 15*20] / [5*55 - 15²]
b = [330 - 300] / [275 - 225]
b = 30 / 50 = 0.6

a = (ΣY - b*ΣX) / n
a = (20 - 0.6*15) / 5
a = (20 - 9) / 5 = 11/5 = 2.2

Regression Equation: Y = 2.2 + 0.6X

For X=6: Y = 2.2 + 0.6*6 = 2.2 + 3.6 = 5.8


Predicted Y for X=6: Y = 5.8

Q4. Multiple Linear Regression


Given: X1=[1,2,3,4], X2=[2,1,4,3], Y=[5,6,10,12] Find: Y = b0 + b1*X1 + b2*X2

X1 X2 Y X1² X2² X1X2 X1Y X2Y

1 2 5 1 4 2 5 10

2 1 6 4 1 2 12 6

3 4 10 9 16 12 30 40

4 3 12 16 9 12 48 36

Σ=10 Σ=10 Σ=33 Σ=30 Σ=30 Σ=28 Σ=95 Σ=92

Solution using Normal Equations (n=4):


n=4, ΣX1=10, ΣX2=10, ΣY=33
ΣX1²=30, ΣX2²=30, ΣX1X2=28, ΣX1Y=95, ΣX2Y=92

Setting up the 3 normal equations:


Eq1: ΣY = n*b0 + b1*ΣX1 + b2*ΣX2
33 = 4*b0 + 10*b1 + 10*b2 ... (i)
Eq2: ΣX1Y = b0*ΣX1 + b1*ΣX1² + b2*ΣX1X2
95 = 10*b0 + 30*b1 + 28*b2 ... (ii)
Eq3: ΣX2Y = b0*ΣX2 + b1*ΣX1X2 + b2*ΣX2²
92 = 10*b0 + 28*b1 + 30*b2 ... (iii)

Solving simultaneously:
From (ii) - (iii): 3 = 2*b1 - 2*b2 → b1 - b2 = 1.5 ... (iv)
From (i): b0 = (33 - 10*b1 - 10*b2)/4
Substituting in (ii) and solving: b1 ≈ 1.75, b2 ≈ 0.25, b0 ≈ 0.5
Y = 0.5 + 1.75*X1 + 0.25*X2

Q5. Logistic Regression


Given: P(Y=1) = 1 / (1 + e^-(2 + 0.5X)) Find probability for X=2 and class label.

Solution:
z = 2 + 0.5*X = 2 + 0.5*2 = 2 + 1 = 3
P(Y=1) = 1 / (1 + e^-3)
= 1 / (1 + 0.0498)
= 1 / 1.0498
= 0.9526

P(Y=1) for X=2 = 0.9526 (95.26%)


Since P = 0.9526 ≥ 0.5 → Class Label = 1 (Positive Class)

Q6. Decision Tree — ID3 Algorithm


Dataset: Outlook, Temp, Humidity, Wind, Play (4 examples — 2 Yes, 2 No)

Outlook Temp Humidity Wind Play

Sunny Hot High Weak No

Sunny Hot High Strong No

Overcast Hot High Weak Yes

Rain Mild High Weak Yes

Step 1: Entropy of entire dataset


Total: 4 examples → 2 Yes (p=0.5), 2 No (p=0.5)
Entropy(S) = -p(Yes)*log2(p(Yes)) - p(No)*log2(p(No))
= -(0.5)*log2(0.5) - (0.5)*log2(0.5)
= -(0.5)*(-1) - (0.5)*(-1) = 0.5 + 0.5 = 1.0

Step 2: Information Gain for each attribute

Outlook (Sunny=2, Overcast=1, Rain=1):


Sunny: 2 examples → 0 Yes, 2 No → Entropy = 0
Overcast: 1 example → 1 Yes, 0 No → Entropy = 0
Rain: 1 example → 1 Yes, 0 No → Entropy = 0
IG(Outlook) = 1.0 - (2/4)*0 - (1/4)*0 - (1/4)*0 = 1.0

Wind (Weak=3, Strong=1):


Weak: 3 examples → 2 Yes, 1 No → Entropy = -(2/3)log2(2/3) - (1/3)log2(1/3)
≈ 0.918
Strong: 1 example → 0 Yes, 1 No → Entropy = 0
IG(Wind) = 1.0 - (3/4)*0.918 - (1/4)*0 = 1.0 - 0.689 = 0.311

Result: Outlook has IG = 1.0 (highest) → Outlook is the ROOT NODE


Overcast branch → always Yes (pure leaf). Sunny → No. Rain → Yes. Tree is fully determined.

Q7. KNN Classification (K=3)


Classify point P(3,2) using K=3. Training data: A(1,2)=Red, B(2,3)=Red, C(3,3)=Blue, D(6,5)=Blue

Step 1: Calculate Euclidean Distance from P(3,2) to each point:


Point Coordinates Distance Formula Distance Class
A (1,2) √((3-1)² + (2-2)²) = √(4+0) 2.00 Red

B (2,3) √((3-2)² + (2-3)²) = √(1+1) 1.41 Red

C (3,3) √((3-3)² + (2-3)²) = √(0+1) 1.00 Blue

D (6,5) √((3-6)² + (2-5)²) = √(9+9) 4.24 Blue

Step 2: Sort by distance and select K=3 nearest neighbors:


• 1st: C (distance=1.00) → Blue
• 2nd: B (distance=1.41) → Red
• 3rd: A (distance=2.00) → Red

Step 3: Majority Voting:


Red = 2 votes | Blue = 1 vote
PREDICTION: Point P(3,2) → Class = RED

Q8. K-Means Clustering — One Iteration


Points: P1(1,1), P2(2,1), P3(4,3), P4(5,4) Initial Centroids: C1=(1,1), C2=(5,4)

Step 1: Assign each point to nearest centroid


Point Coords Dist to C1(1,1) Dist to C2(5,4) Assigned Cluster

P1 (1,1) 0.00 √(16+9)=5.00 C1 (closer)

P2 (2,1) √(1+0)=1.00 √(9+9)=4.24 C1 (closer)

P3 (4,3) √(9+4)=3.61 √(1+1)=1.41 C2 (closer)

P4 (5,4) √(16+9)=5.00 0.00 C2 (closer)

Step 2: Recalculate Centroids


Cluster 1: {P1(1,1), P2(2,1)}
New C1 = ((1+2)/2, (1+1)/2) = (1.5, 1.0)

Cluster 2: {P3(4,3), P4(5,4)}


New C2 = ((4+5)/2, (3+4)/2) = (4.5, 3.5)

After 1 Iteration:
Cluster 1: {P1(1,1), P2(2,1)} → New Centroid = (1.5, 1.0)
Cluster 2: {P3(4,3), P4(5,4)} → New Centroid = (4.5, 3.5)

Q9. K-Medoids Clustering


Points: A(2,6), B(3,4), C(3,8), D(4,7), E(6,2), F(7,3) Initial Medoids: M1=(2,6)=A, M2=(6,2)=E

Step 1: Calculate distances from all points to each medoid


Distance formula (Euclidean): d = sqrt((x2-x1)^2 + (y2-y1)^2)

Point Dist to M1=(2,6) Dist to M2=(6,2) Assigned Cluster


A(2,6) 0.00 √(16+16)=5.66 M1

B(3,4) √(1+4)=2.24 √(9+4)=3.61 M1

C(3,8) √(1+4)=2.24 √(9+36)=6.71 M1

D(4,7) √(4+1)=2.24 √(4+25)=5.39 M1

E(6,2) √(16+16)=5.66 0.00 M2

F(7,3) √(25+9)=5.83 √(1+1)=1.41 M2

Cluster 1 (Medoid A): {A, B, C, D}


Cluster 2 (Medoid E): {E, F}
In K-Medoids, medoids remain as actual data points (unlike K-Means centroids). The algorithm would then
check if swapping medoids with other cluster members reduces total cost.

Q10. Hierarchical Clustering — Agglomerative (Single Linkage)


Given Distance Matrix:

A B C
A 0 2 6
B 2 0 4
C 6 4 0

Step-by-Step Agglomerative Clustering:


• Start: Each point is its own cluster: {A}, {B}, {C}

Iteration 1: Find minimum distance:


d(A,B) = 2 ← MINIMUM
d(A,C) = 6
d(B,C) = 4
→ Merge A and B into cluster {A,B}

Iteration 2: Update distances using Single Linkage (minimum):


d({A,B}, C) = min(d(A,C), d(B,C)) = min(6, 4) = 4
→ Only clusters: {A,B} and {C} — merge them

Iteration 3: Merge into one cluster {A,B,C}

Dendrogram Structure:
Step 1: A ■■■ B (merge at distance 2)
Step 2: {A,B} ■■■ C (merge at distance 4)
Final Cluster: {A, B, C} | Merging Order: A+B at d=2, then {A,B}+C at d=4

Q11. Confusion Matrix — Performance Metrics


Given Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive TP = 70 FN = 18
Actual Negative FP = 9 TN = 45

TP=70, FN=18, FP=9, TN=45 → Total = 142

Accuracy = (TP + TN) / Total


= (70 + 45) / 142 = 115 / 142 = 0.8099 ≈ 80.99%

Precision = TP / (TP + FP)


= 70 / (70 + 9) = 70 / 79 = 0.8861 ≈ 88.61%

Recall = TP / (TP + FN)


= 70 / (70 + 18) = 70 / 88 = 0.7955 ≈ 79.55%

F1-Score = 2 * (Precision * Recall) / (Precision + Recall)


= 2 * (0.8861 * 0.7955) / (0.8861 + 0.7955)
= 2 * 0.7048 / 1.6816
= 1.4096 / 1.6816 = 0.8383 ≈ 83.83%

Metric Value
Accuracy 80.99%
Precision 88.61%
Recall 79.55%
F1-Score 83.83%

Q12 & Q13. MSE and RMSE


Actual: [0.3, -0.5, 2.3, 7.2] Predicted: [2.5, 0.0, 2.8, 8.0]

i Actual (y) Predicted (■) Error (y-■) Error² (y-■)²

1 0.3 2.5 0.3-2.5 = -2.2 (-2.2)² = 4.84

2 -0.5 0.0 -0.5-0.0 = -0.5 (-0.5)² = 0.25

3 2.3 2.8 2.3-2.8 = -0.5 (-0.5)² = 0.25

4 7.2 8.0 7.2-8.0 = -0.8 (-0.8)² = 0.64

Total Σ = 5.98

Q12. Mean Squared Error (MSE):


MSE = (1/n) * Σ(y_i - ■_i)²
= (1/4) * (4.84 + 0.25 + 0.25 + 0.64)
= (1/4) * 5.98
= 1.495
MSE = 1.495
Q13. Root Mean Squared Error (RMSE):
RMSE = sqrt(MSE)
= sqrt(1.495)
= 1.2228
RMSE ≈ 1.2228
EXAMPLE QUESTIONS — QUICK
REFERENCE ANSWERS
Q1. What is Machine Learning?
Answer:
Machine Learning (ML) is a subset of AI that enables systems to automatically learn and improve from
experience (data) without being explicitly programmed. It focuses on building algorithms that can access data
and use it to learn for themselves. Types: Supervised, Unsupervised, Reinforcement Learning.

Q2. Explain Supervised Learning and its algorithms.


Answer:
Supervised Learning trains models on labeled data (input-output pairs). Algorithms: Linear Regression
(predict continuous values), Logistic Regression (binary classification), Decision Trees (tree-based decisions
using entropy/gini), Random Forest (ensemble of trees), KNN (distance-based), SVM (optimal hyperplane).

Q3. Explain Unsupervised Learning and its algorithms.


Answer:
Unsupervised Learning finds patterns in unlabeled data. Types: Clustering (K-Means — assigns points to K
centroids; K-Medoids — uses actual data points; Hierarchical — dendrogram-based tree) and Dimensionality
Reduction (PCA — finds principal components of maximum variance; LDA — maximizes class separation).

Q4. What is Reinforcement Learning?


Answer:
Reinforcement Learning is a type of ML where an agent learns by interacting with an environment. It receives
positive rewards for good actions and penalties for bad ones. Over time, the agent learns a policy to
maximize cumulative reward. Example: Game-playing AI (AlphaGo), robotic navigation, self-driving cars.

Q5. Explain data preprocessing techniques.


Answer:
Data Preprocessing cleans and transforms raw data: (1) Handling Missing Values — Mean/Median/Mode
imputation or KNN Imputation; (2) Handling Outliers — Z-score (remove if |Z|>3) or IQR method; (3) Noise
Removal — smoothing; (4) Data Transformation — Normalization (0-1) or Standardization (mean=0, std=1);
(5) Removing Duplicates.

Q6. Types of categorical data.


Answer:
Categorical data comes in two types: (1) Nominal — categories with no order (e.g., Color: Red/Blue/Green,
Gender: Male/Female) — use One-Hot Encoding; (2) Ordinal — categories with a meaningful order (e.g.,
Rating: Low/Medium/High, Education: School/UG/PG) — use Label or Ordinal Encoding.

Q7. What is Label Encoding, One-Hot Encoding, Ordinal Encoding?


Answer:
Label Encoding: Assigns integers to categories (Red=0, Blue=1, Green=2) — simple but creates false ordinal
relationship. One-Hot Encoding: Creates binary columns per category (Red→[1,0,0], Blue→[0,1,0]) — avoids
false ordering, increases dimensionality. Ordinal Encoding: Assigns ordered integers based on category rank
(Low=1, Medium=2, High=3) — preserves order, suitable for ordinal data.

Q8. What is model evaluation?


Answer:
Model Evaluation assesses how well a trained model generalizes to unseen data. Key concepts: Train-Test
Split (divide data 80/20), K-Fold Cross Validation (rotate K folds for robust estimation), Overfitting (high train,
low test accuracy — model memorizes), Underfitting (low accuracy everywhere — model too simple).

Q9. Performance metrics for classification and regression.


Answer:
Classification: Accuracy=(TP+TN)/Total, Precision=TP/(TP+FP), Recall=TP/(TP+FN),
F1-Score=2*P*R/(P+R), Confusion Matrix (shows TP/FP/FN/TN), ROC Curve/AUC. Regression:
MAE=(1/n)*Σ|y-■|, MSE=(1/n)*Σ(y-■)², RMSE=√MSE, R²=1-SSres/SStot (measures explained variance).

Q10. What is linear regression and its numerical?


Answer:
Linear Regression models the linear relationship between input (X) and output (Y). Simple LR: Y = a + bX
where b=[nΣXY-ΣXΣY]/[nΣX²-(ΣX)²], a=(ΣY-bΣX)/n. Multiple LR: Y=b0+b1X1+b2X2+...+bnXn. Example: For
X=[1,2,3,4,5], Y=[2,4,5,4,5]: b=0.6, a=2.2 → Y=2.2+0.6X → Predict X=6: Y=5.8.

Q11. What is classification and regression with algorithms?


Answer:
Classification predicts discrete class labels (e.g., Spam/Not Spam, Yes/No). Algorithms: Logistic Regression,
Decision Tree (ID3/C4.5/CART), Random Forest, KNN, SVM. Regression predicts continuous values (e.g.,
price, temperature). Algorithms: Linear Regression, Polynomial Regression, Ridge/Lasso, Decision Tree
Regressor, Random Forest Regressor.

You might also like