Machine Learning
Complete Assignment Answers
Assignment 1 (Theory) • Assignment 2 (Numericals) • Example Questions
Covers: Units 1–6 | All Numerical Problems | Concept Explanations
ASSIGNMENT 1 — THEORY QUESTIONS
UNIT 1: Introduction to Machine Learning
Q1. Definition and Scope of Machine Learning
Answer:
Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables computers to learn from data and
improve their performance on tasks without being explicitly programmed. Instead of writing rules manually,
the system discovers patterns in data on its own.
Formal Definition:
"A computer program is said to learn from experience E with respect to some task T and performance
measure P, if its performance at tasks in T, measured by P, improves with experience E." — Tom Mitchell
(1997)
Scope of ML:
• Healthcare: Predicting diseases from patient records (e.g., detecting cancer from X-rays)
• Finance: Fraud detection, stock price prediction, credit scoring
• E-commerce: Product recommendation systems (Amazon, Netflix)
• Natural Language Processing: Chatbots, language translation, sentiment analysis
• Computer Vision: Face recognition, object detection in self-driving cars
• Agriculture: Crop disease detection using image classification
• Education: Personalized learning and adaptive testing systems
Example:
Email spam filtering — The system learns from thousands of spam and non-spam emails to classify new
incoming emails automatically.
Q2. Traditional Programming vs Machine Learning
Answer:
Traditional Programming requires humans to write explicit rules, while Machine Learning allows systems to
derive rules from data automatically.
Aspect Traditional Programming Machine Learning
Input Data + Rules → Output Data + Output → Rules
Rule Creation Written by programmer Learned from data
Flexibility Rigid — rules must be updated manually Adaptive — learns with new data
Performance Depends on quality of rules Improves with more data
Best For Well-defined, deterministic tasks Complex, pattern-based tasks
Example Tax calculation software Spam detection system
Q3. Types of Machine Learning
Answer:
1. Supervised Learning
The model is trained on labeled data — each input has a corresponding correct output. The model learns to
map inputs to outputs.
• Algorithms: Linear Regression, Logistic Regression, Decision Trees, SVM, KNN, Random Forest
• Example: Predicting house prices (input: size, location → output: price)
• Problem Types: Classification (spam/not spam) and Regression (price prediction)
2. Unsupervised Learning
The model is trained on unlabeled data. It discovers hidden patterns or groupings without any prior
knowledge of the output.
• Algorithms: K-Means Clustering, K-Medoids, Hierarchical Clustering, PCA
• Example: Grouping customers by purchasing behavior for market segmentation
• Types: Clustering and Dimensionality Reduction
3. Reinforcement Learning
An agent learns by interacting with an environment. It receives rewards for good actions and penalties for
bad actions, learning optimal strategies over time.
• Example: Training a game-playing AI (AlphaGo) or robot navigation
• Key Components: Agent, Environment, State, Action, Reward
• Goal: Maximize cumulative reward over time
Q4. Real-World Applications of ML
Answer:
• Healthcare: Disease diagnosis (cancer detection from MRI), drug discovery, patient outcome prediction
• Finance: Credit card fraud detection, algorithmic trading, loan approval prediction
• Retail/E-commerce: Product recommendations (Netflix, Amazon), demand forecasting, dynamic
pricing
• Transportation: Self-driving cars (Tesla Autopilot), route optimization (Google Maps), traffic prediction
• Natural Language: Google Translate, Siri/Alexa voice assistants, ChatGPT, sentiment analysis
• Manufacturing: Predictive maintenance of machines, quality control using computer vision
• Agriculture: Crop disease identification, yield prediction, precision farming
• Security: Face recognition systems, intrusion detection, cybersecurity threat detection
Q5. Concept Learning: Find-S Algorithm
Answer:
Concept learning is the problem of searching through a predefined space of hypotheses to find the one that
best fits the training examples.
Find-S Algorithm:
Find-S starts with the most specific hypothesis (empty set) and generalizes it only when a positive example is
encountered. Negative examples are ignored.
Steps:
• Step 1: Initialize h = <∅, ∅, ∅, ...> (most specific hypothesis)
• Step 2: For each positive training example x:
• — For each attribute a_i: if a_i in h = a_i in x → no change; else → replace with '?' (generalize)
• Step 3: Ignore all negative examples
• Step 4: Output the final hypothesis h
Key Terms:
• Hypothesis: A rule that defines when output is YES
• Specific: Very strict — exact match needed (e.g., )
• General: Flexible — many cases match (e.g., )
• '?': Accepts any value for that attribute
Candidate Elimination Algorithm:
Unlike Find-S (which only maintains one hypothesis), Candidate Elimination maintains two boundaries:
• S (Specific Boundary): Most specific set of hypotheses consistent with training data
• G (General Boundary): Most general set of hypotheses consistent with training data
The version space is the set of all hypotheses between S and G. On positive examples, S is generalized. On
negative examples, G is specialized.
UNIT 2: Data Preprocessing & Feature Engineering
Q1. Data Preprocessing: Handling Missing Values
Answer:
Missing values occur when no data is stored for a variable in an observation. If not handled, they can distort
analysis and reduce model accuracy.
Methods to Handle Missing Values:
a) Mean Imputation
Replace missing values with the mean of the column. Suitable for normally distributed numerical data.
Missing Value = Mean of Column = (Sum of all values) / (Count of non-missing
values)
Example: Age column: [25, 30, NaN, 40, 35] → Mean = (25+30+40+35)/4 = 32.5 → Replace NaN with 32.5
b) Median Imputation
Replace missing values with the median. Better than mean when data has outliers.
Missing Value = Median = Middle value after sorting
Example: Salary: [20000, 25000, NaN, 1000000] → Median = 22500 (outlier-resistant)
c) Mode Imputation
Replace missing values with the most frequently occurring value. Best for categorical data.
Example: Gender: [Male, Female, NaN, Male, Male] → Mode = Male → Fill NaN with Male
d) KNN Imputation
Find K nearest neighbors of the missing record and use their average to fill the gap. More accurate but
computationally expensive.
Example: If a student has missing marks, find K most similar students and average their marks.
Q2. Handling Outliers using Z-Score
Answer:
An outlier is a value significantly different from other observations. Outliers distort model training.
Z-Score Method:
Z = (x - µ) / σ
Where: x = data point, µ = mean, σ = standard deviation
Rule: If |Z| > 3, the point is considered an outlier and can be removed or capped.
Example: If mean salary = 50,000 and std = 5,000, then salary = 100,000 → Z = (100000-50000)/5000 = 10
→ Outlier!
Q3. Data Transformation: Normalization and Standardization
Answer:
Normalization (Min-Max Scaling):
Scales data to a fixed range [0, 1]. Used when features have different ranges.
X_normalized = (X - X_min) / (X_max - X_min)
Example: Age values [18, 25, 30, 60] → Normalized to [0.0, 0.167, 0.286, 1.0]
Standardization (Z-Score Normalization):
Transforms data so that Mean = 0 and Standard Deviation = 1. Used when data follows normal distribution.
X_standardized = (X - µ) / σ
Example: If ages have mean=30, std=10, then age 40 → (40-30)/10 = 1.0
Aspect Normalization Standardization
Range 0 to 1 -∞ to +∞ (typically -3 to 3)
Formula (X - min) / (max - min) (X - mean) / std
Use Case Neural Networks, Image data SVM, Logistic Regression, PCA
Outlier Sensitive Yes (outliers affect min/max) Less sensitive
Q4. Feature Engineering
Answer:
a) Feature Scaling:
Ensuring all features have similar scales to prevent any one feature from dominating (e.g., salary vs. age).
Methods: Normalization and Standardization (explained above).
b) Encoding Categorical Variables:
• Label Encoding: Assigns integer values to categories. Example: [Red, Blue, Green] → [0, 1, 2]. Risk:
Creates false ordinal relationship.
• One-Hot Encoding: Creates binary columns for each category. Example: Color → Color_Red,
Color_Blue, Color_Green. Best for nominal data.
• Ordinal Encoding: Assigns ordered integers to ordered categories. Example: [Low, Medium, High] →
[1, 2, 3]. Use when order matters.
c) Feature Selection:
Selecting the most relevant features to reduce dimensionality and improve model performance.
• Filter Methods: Use statistical tests (correlation, chi-square) to select features independently of the
model
• Wrapper Methods: Use model performance to evaluate feature subsets (e.g., Recursive Feature
Elimination)
• Embedded Methods: Feature selection during model training (e.g., L1/Lasso Regularization)
d) Feature Extraction:
Creating new features from existing ones to capture more information (e.g., PCA for dimensionality reduction,
extracting 'day of week' from a date column).
UNIT 3: Model Evaluation & Performance Metrics
Q1. Model Evaluation Concepts
Answer:
Overfitting:
When a model learns the training data too well — including noise and irrelevant patterns — resulting in poor
performance on new data.
• Symptoms: Very high training accuracy, very low test accuracy
• Cause: Model too complex, insufficient training data, too many features
• Fix: Regularization, pruning (for Decision Trees), more training data, cross-validation
Underfitting:
When a model is too simple to capture the underlying patterns in training data, resulting in poor performance
on both training and test data.
• Symptoms: Low accuracy on both training and test sets
• Cause: Model too simple (high bias), insufficient training, too few features
• Fix: Increase model complexity, add more features, reduce regularization
Train-Test Split:
The dataset is split into two parts — typically 70-80% for training and 20-30% for testing. The model is trained
on training data and evaluated on unseen test data.
Common Split: 80% Train | 20% Test
Limitation: Results may vary depending on how data is split.
K-Fold Cross Validation:
The data is divided into K equal parts (folds). The model is trained K times, each time using K-1 folds for
training and 1 fold for testing. Final score = average of all K scores.
Final Score = (Score_1 + Score_2 + ... + Score_K) / K
Advantage: More reliable estimate — every data point is used for both training and testing. Common K
values: 5 or 10.
Q2. Classification Performance Metrics
Answer:
Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive TP (True Positive) FN (False Negative)
Actual Negative FP (False Positive) TN (True Negative)
Metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP) [Of predicted positives, how many are correct?]
Recall = TP / (TP + FN) [Of actual positives, how many detected?]
F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
ROC Curve (Receiver Operating Characteristic):
• Plots True Positive Rate (Recall) vs False Positive Rate at various classification thresholds
• AUC (Area Under Curve): 1.0 = perfect model, 0.5 = random model
• Used to compare models — higher AUC means better model
Q3. Regression Performance Metrics
Answer:
MAE = (1/n) * Σ|y_i - ■_i| [Mean Absolute Error]
MSE = (1/n) * Σ(y_i - ■_i)^2 [Mean Squared Error]
RMSE = sqrt(MSE) [Root Mean Squared Error]
R² = 1 - [Σ(y_i-■_i)^2 / Σ(y_i-■)^2] [Coefficient of Determination]
• R² = 1: Perfect model | R² = 0: Baseline model (predicts mean) | R² < 0: Worse than baseline
UNIT 4: Regression & Classification Algorithms
Q1. Simple Linear Regression
Answer:
Simple Linear Regression models the relationship between one independent variable (X) and one dependent
variable (Y) using a straight line.
Y = a + b*X where:
b (slope) = [n*ΣXY - ΣX*ΣY] / [n*ΣX² - (ΣX)²]
a (intercept) = (ΣY - b*ΣX) / n
Example: Predicting exam score (Y) based on study hours (X).
Multiple Linear Regression:
Models the relationship between multiple independent variables and one dependent variable.
Y = b0 + b1*X1 + b2*X2 + ... + bn*Xn
Example: Predicting house price (Y) based on size (X1), location (X2), and age (X3).
Q2. Logistic Regression
Answer:
Logistic Regression is a supervised classification algorithm that predicts the probability that an input belongs
to a binary class (0 or 1).
Why not Linear Regression? Linear regression can predict values outside [0,1], which are invalid as
probabilities. Logistic regression uses the Sigmoid function.
P(Y=1) = 1 / (1 + e^(-z)) where z = b0 + b1*X1 + b2*X2 + ...
• Output range: (0, 1) — S-shaped curve
• If P >= 0.5 → Predict class 1 (positive)
• If P < 0.5 → Predict class 0 (negative)
Applications: Spam detection, disease diagnosis, credit scoring, customer churn prediction.
Q3. Decision Trees
Answer:
A Decision Tree is a tree-structured model where internal nodes represent feature tests, branches represent
outcomes, and leaf nodes represent final decisions/class labels.
ID3 Algorithm (uses Information Gain):
Entropy(S) = -Σ p_i * log2(p_i)
Information Gain(A) = Entropy(S) - Σ (|Sv|/|S|) * Entropy(Sv)
• Select the attribute with the HIGHEST Information Gain as root/split node
C4.5 Algorithm (uses Gain Ratio):
Gain Ratio = Information Gain / Split Information
Split Info = -Σ (|Sv|/|S|) * log2(|Sv|/|S|)
CART Algorithm (uses Gini Index):
Gini(S) = 1 - Σ p_i²
• Select attribute with LOWEST Gini Index
Overfitting Prevention:
• Max Depth: Limit tree depth (e.g., max_depth=5)
• Min Samples Split: Minimum samples required to split a node
• Min Samples Leaf: Minimum samples at leaf nodes
Q4. Random Forest
Answer:
Random Forest is an ensemble method that builds multiple decision trees on random subsets of data
(Bootstrap Sampling) and combines predictions via majority voting (classification) or averaging (regression).
• Step 1: Create multiple random samples with replacement (Bootstrap)
• Step 2: Build a Decision Tree on each sample — at each split, only a random subset of features is
considered (Feature Randomness)
• Step 3: Combine predictions — majority vote for classification, mean for regression
• Advantages: Reduces overfitting, handles high-dimensional data, robust to noise
• Disadvantages: Slower training, requires more memory, less interpretable
Q5. K-Nearest Neighbors (KNN)
Answer:
KNN is a lazy learning algorithm — it stores all training data and makes predictions based on the K nearest
data points to the query point.
Algorithm Steps:
• 1. Choose K (number of neighbors)
• 2. Calculate Euclidean distance from query point to all training points
• 3. Sort distances in ascending order
• 4. Select K nearest neighbors
• 5. For Classification: majority vote; For Regression: average of K neighbors
Euclidean Distance = sqrt((x2-x1)^2 + (y2-y1)^2)
• Choosing K: Too small K → overfitting, Too large K → underfitting. Rule of thumb: K = sqrt(N)
• Feature Scaling is required for KNN since it is distance-based
Q6. Support Vector Machine (SVM)
Answer:
SVM finds the optimal hyperplane that maximally separates two classes. The hyperplane is positioned to
maximize the margin — the distance between the hyperplane and the nearest data points (support vectors).
Key Concepts:
• Hyperplane: Decision boundary separating classes (line in 2D, plane in 3D)
• Support Vectors: Data points closest to the hyperplane — these determine the hyperplane position
• Margin: Distance between hyperplane and nearest points — SVM maximizes this
• Hard Margin: Perfect separation, no misclassification (sensitive to outliers)
• Soft Margin: Allows some misclassification for better generalization
• Kernel Trick: Maps data to higher dimensions for non-linearly separable data (RBF, Polynomial
kernels)
UNIT 5: Clustering & Dimensionality Reduction
Q1. Clustering Techniques
Answer:
K-Means Clustering:
Partitions data into K clusters by minimizing within-cluster variance. Each data point belongs to the cluster
with the nearest centroid.
• 1. Initialize K centroids randomly
• 2. Assign each point to the nearest centroid using Euclidean distance
• 3. Recalculate centroids as the mean of all points in each cluster
• 4. Repeat steps 2-3 until centroids no longer change
• Disadvantage: Must specify K, sensitive to outliers, may converge to local optima
K-Medoids Clustering:
Similar to K-Means but uses actual data points (medoids) as cluster centers instead of means. More robust to
outliers than K-Means.
• Medoid: The data point within a cluster that minimizes the total distance to all other points in the cluster
• Algorithm: PAM (Partitioning Around Medoids)
Hierarchical Clustering:
Builds a hierarchy of clusters without specifying K in advance. Represented as a dendrogram.
• Agglomerative (Bottom-Up): Start with each point as its own cluster; merge closest clusters iteratively
• Divisive (Top-Down): Start with all points in one cluster; split iteratively
Linkage Methods:
• Single Linkage: Distance = minimum distance between any two points in clusters
• Complete Linkage: Distance = maximum distance between any two points in clusters
• Average Linkage: Distance = average of all pairwise distances
Q2. Dimensionality Reduction: PCA and LDA
Answer:
PCA (Principal Component Analysis):
PCA transforms high-dimensional data into fewer dimensions (Principal Components) while retaining
maximum variance. It finds new orthogonal axes in the direction of maximum variance.
• PC1: Direction of maximum variance in data
• PC2: Direction of second maximum variance (perpendicular to PC1)
Steps:
• 1. Standardize the data
• 2. Compute covariance matrix
• 3. Calculate eigenvectors and eigenvalues
• 4. Sort eigenvectors by eigenvalue (largest first)
• 5. Select top K eigenvectors as principal components
• 6. Transform data to new dimensions
• Use Case: Visualizing high-dimensional data, removing redundant features, image compression
LDA (Linear Discriminant Analysis):
LDA finds the linear combinations of features that best separate two or more classes. Unlike PCA
(unsupervised), LDA is supervised — it uses class labels.
• Goal: Maximize between-class variance and minimize within-class variance
• Produces at most C-1 components where C = number of classes
• Better for classification tasks than PCA
Aspect PCA LDA
Type Unsupervised Supervised
Uses Labels No Yes
Goal Maximize variance Maximize class separation
Use Case Feature reduction, visualization Classification, dimensionality reduction
UNIT 6: Ensemble Learning, Regularization & Bias-Variance
Trade-off
Q1. Ensemble Learning: Bagging, Boosting, Stacking
Answer:
Bagging (Bootstrap Aggregating):
Multiple models are trained independently and in parallel on different random subsets of the training data
(with replacement). Predictions are combined by voting (classification) or averaging (regression).
• Example Algorithm: Random Forest
• Reduces: Variance (overfitting)
• Key Property: Models are trained in parallel and independently
Boosting:
Models are trained sequentially. Each new model focuses on correcting the errors of the previous model.
Data points misclassified by earlier models get higher weights.
• Example Algorithms: AdaBoost, Gradient Boosting, XGBoost
• Reduces: Bias (underfitting) more than variance
• Key Property: Sequential training — each model learns from the previous model's mistakes
Stacking:
Multiple base models (Level-0) are trained, and their predictions are used as features to train a meta-model
(Level-1) that makes the final prediction.
• Base models can be of different types (e.g., KNN + SVM + Decision Tree)
• Meta-model is typically Logistic Regression or another simple model
• Often achieves better accuracy than any individual base model
Q2. Hyperparameter Tuning: Grid Search and Random Search
Answer:
Grid Search:
Exhaustively tries every combination of specified hyperparameter values. Guaranteed to find the best
combination but is computationally expensive.
Example: For KNN, grid search might try K = {1,3,5,7,9} × distance = {euclidean, manhattan} → 10 total
combinations tested.
Random Search:
Randomly samples hyperparameter combinations from the specified distributions. More efficient than grid
search — often finds a good solution faster by not exhaustively checking all combinations.
Example: Instead of testing all 100 combinations, randomly test 20 and pick the best.
• Grid Search: Best when few hyperparameters and small search space
• Random Search: Best when many hyperparameters and large search space
Q3. Regularization: Ridge (L2) and Lasso (L1)
Answer:
Regularization adds a penalty term to the loss function to prevent overfitting by discouraging large model
weights.
Normal Loss: L = Σ(y_i - ■_i)^2
Regularized Loss: L = Σ(y_i - ■_i)^2 + λ * Penalty
L1 Regularization (Lasso):
Loss = Σ(y - ■)^2 + λ * Σ|w_i|
• Penalty = sum of absolute values of weights
• Can make some weights exactly ZERO → Automatic Feature Selection
• Best when many features are irrelevant
L2 Regularization (Ridge):
Loss = Σ(y - ■)^2 + λ * Σw_i^2
• Penalty = sum of squared weights
• Shrinks weights towards zero but NEVER exactly zero
• Best when all features are important; handles multicollinearity
Elastic Net:
Combination of L1 and L2 regularization. Best for datasets with many correlated features.
Loss = Σ(y - ■)^2 + λ1*Σ|w_i| + λ2*Σw_i^2
Q4. Bias-Variance Trade-off
Answer:
The Bias-Variance Trade-off describes the balance between model simplicity (high bias) and model
complexity (high variance) to achieve optimal prediction performance.
Bias:
The error from incorrect assumptions in the learning algorithm. High bias means the model is too simple and
fails to capture the underlying pattern (Underfitting).
• High Bias: Low training accuracy AND low test accuracy
• Example: Using a straight line to fit data that is clearly curved
Variance:
The error from sensitivity to small fluctuations in training data. High variance means the model memorizes
training data including noise (Overfitting).
• High Variance: High training accuracy BUT low test accuracy
• Example: Very deep Decision Tree that perfectly fits training data but fails on new data
Trade-off:
Total Error = Bias^2 + Variance + Irreducible Noise. The goal is to find the sweet spot — a model that is
complex enough to learn patterns but not so complex that it memorizes noise.
Scenario Bias Variance Result
Simple Model High Low Underfitting
Complex Model Low High Overfitting
Optimal Model Low Low Best Performance
ASSIGNMENT 2 — NUMERICAL SOLUTIONS
Q1. Find-S Algorithm
Dataset:
Ex Sky AirTemp Humidity Wind Water Forecast PlaySport
1 Sunny Warm Normal Strong Warm Same YES
2 Sunny Warm High Strong Warm Same YES
3 Rainy Cold High Strong Warm Change NO
4 Sunny Warm High Strong Cool Change YES
Step-by-Step Solution:
• Initialize: h = <∅, ∅, ∅, ∅, ∅, ∅>
• Example 1 (YES):
→ h = (First positive example, adopt directly)
• Example 2 (YES):
→ Humidity: Normal ≠ High → Generalize to '?'
→h=
• Example 3 (NO): → SKIP (negative example)
• Example 4 (YES):
→ Water: Warm ≠ Cool → Generalize to '?'
→ Forecast: Same ≠ Change → Generalize to '?'
→h=
FINAL HYPOTHESIS: h = <Sunny, Warm, ?, Strong, ?, ?>
Q2. Candidate Elimination Algorithm
Dataset: Outlook, Temp, Humidity, Wind, EnjoySport
Ex Outlook Temp Humidity Wind EnjoySport
1 Sunny Warm Normal Strong YES
2 Sunny Warm High Strong YES
3 Rainy Cold High Strong NO
4 Sunny Warm High Weak YES
Initial State:
S0 = <∅, ∅, ∅, ∅>
G0 =
Example 1 (YES): Sunny, Warm, Normal, Strong
→ S1 = (first positive: S = example)
→ G1 = (G unchanged, does not reject positive)
Example 2 (YES): Sunny, Warm, High, Strong
→ Humidity differs (Normal vs High) → generalize S
→ S2 =
→ G2 =
Example 3 (NO): Rainy, Cold, High, Strong
→ G must reject this negative example. Specialize G minimally:
→ G3 = { , , }
→ Remove any G hypotheses that do not cover positive examples:
→ does NOT cover Ex2 (Humidity=High) → Remove it
→ G3 = { , }
Example 4 (YES): Sunny, Warm, High, Weak
→ S: Wind Strong ≠ Weak → generalize
→ S4 =
→ G: Check which G hypotheses cover this example:
→ covers it (Sunny matches) ✓
→ covers it (Warm matches) ✓
→ G4 = { , } (no change)
FINAL SPECIFIC BOUNDARY (S): <Sunny, Warm, ?, ?>
FINAL GENERAL BOUNDARY (G): { <Sunny,?,?,?>, <?,Warm,?,?> }
Q3. Simple Linear Regression
Given: X = [1,2,3,4,5], Y = [2,4,5,4,5] Find: Y = a + bX and predict Y for X=6
X Y X² XY
1 2 1 2
2 4 4 8
3 5 9 15
4 4 16 16
5 5 25 25
ΣX=15 ΣY=20 ΣX²=55 ΣXY=66
Calculations (n=5):
b = [n*ΣXY - ΣX*ΣY] / [n*ΣX² - (ΣX)²]
b = [5*66 - 15*20] / [5*55 - 15²]
b = [330 - 300] / [275 - 225]
b = 30 / 50 = 0.6
a = (ΣY - b*ΣX) / n
a = (20 - 0.6*15) / 5
a = (20 - 9) / 5 = 11/5 = 2.2
Regression Equation: Y = 2.2 + 0.6X
For X=6: Y = 2.2 + 0.6*6 = 2.2 + 3.6 = 5.8
Predicted Y for X=6: Y = 5.8
Q4. Multiple Linear Regression
Given: X1=[1,2,3,4], X2=[2,1,4,3], Y=[5,6,10,12] Find: Y = b0 + b1*X1 + b2*X2
X1 X2 Y X1² X2² X1X2 X1Y X2Y
1 2 5 1 4 2 5 10
2 1 6 4 1 2 12 6
3 4 10 9 16 12 30 40
4 3 12 16 9 12 48 36
Σ=10 Σ=10 Σ=33 Σ=30 Σ=30 Σ=28 Σ=95 Σ=92
Solution using Normal Equations (n=4):
n=4, ΣX1=10, ΣX2=10, ΣY=33
ΣX1²=30, ΣX2²=30, ΣX1X2=28, ΣX1Y=95, ΣX2Y=92
Setting up the 3 normal equations:
Eq1: ΣY = n*b0 + b1*ΣX1 + b2*ΣX2
33 = 4*b0 + 10*b1 + 10*b2 ... (i)
Eq2: ΣX1Y = b0*ΣX1 + b1*ΣX1² + b2*ΣX1X2
95 = 10*b0 + 30*b1 + 28*b2 ... (ii)
Eq3: ΣX2Y = b0*ΣX2 + b1*ΣX1X2 + b2*ΣX2²
92 = 10*b0 + 28*b1 + 30*b2 ... (iii)
Solving simultaneously:
From (ii) - (iii): 3 = 2*b1 - 2*b2 → b1 - b2 = 1.5 ... (iv)
From (i): b0 = (33 - 10*b1 - 10*b2)/4
Substituting in (ii) and solving: b1 ≈ 1.75, b2 ≈ 0.25, b0 ≈ 0.5
Y = 0.5 + 1.75*X1 + 0.25*X2
Q5. Logistic Regression
Given: P(Y=1) = 1 / (1 + e^-(2 + 0.5X)) Find probability for X=2 and class label.
Solution:
z = 2 + 0.5*X = 2 + 0.5*2 = 2 + 1 = 3
P(Y=1) = 1 / (1 + e^-3)
= 1 / (1 + 0.0498)
= 1 / 1.0498
= 0.9526
P(Y=1) for X=2 = 0.9526 (95.26%)
Since P = 0.9526 ≥ 0.5 → Class Label = 1 (Positive Class)
Q6. Decision Tree — ID3 Algorithm
Dataset: Outlook, Temp, Humidity, Wind, Play (4 examples — 2 Yes, 2 No)
Outlook Temp Humidity Wind Play
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Step 1: Entropy of entire dataset
Total: 4 examples → 2 Yes (p=0.5), 2 No (p=0.5)
Entropy(S) = -p(Yes)*log2(p(Yes)) - p(No)*log2(p(No))
= -(0.5)*log2(0.5) - (0.5)*log2(0.5)
= -(0.5)*(-1) - (0.5)*(-1) = 0.5 + 0.5 = 1.0
Step 2: Information Gain for each attribute
Outlook (Sunny=2, Overcast=1, Rain=1):
Sunny: 2 examples → 0 Yes, 2 No → Entropy = 0
Overcast: 1 example → 1 Yes, 0 No → Entropy = 0
Rain: 1 example → 1 Yes, 0 No → Entropy = 0
IG(Outlook) = 1.0 - (2/4)*0 - (1/4)*0 - (1/4)*0 = 1.0
Wind (Weak=3, Strong=1):
Weak: 3 examples → 2 Yes, 1 No → Entropy = -(2/3)log2(2/3) - (1/3)log2(1/3)
≈ 0.918
Strong: 1 example → 0 Yes, 1 No → Entropy = 0
IG(Wind) = 1.0 - (3/4)*0.918 - (1/4)*0 = 1.0 - 0.689 = 0.311
Result: Outlook has IG = 1.0 (highest) → Outlook is the ROOT NODE
Overcast branch → always Yes (pure leaf). Sunny → No. Rain → Yes. Tree is fully determined.
Q7. KNN Classification (K=3)
Classify point P(3,2) using K=3. Training data: A(1,2)=Red, B(2,3)=Red, C(3,3)=Blue, D(6,5)=Blue
Step 1: Calculate Euclidean Distance from P(3,2) to each point:
Point Coordinates Distance Formula Distance Class
A (1,2) √((3-1)² + (2-2)²) = √(4+0) 2.00 Red
B (2,3) √((3-2)² + (2-3)²) = √(1+1) 1.41 Red
C (3,3) √((3-3)² + (2-3)²) = √(0+1) 1.00 Blue
D (6,5) √((3-6)² + (2-5)²) = √(9+9) 4.24 Blue
Step 2: Sort by distance and select K=3 nearest neighbors:
• 1st: C (distance=1.00) → Blue
• 2nd: B (distance=1.41) → Red
• 3rd: A (distance=2.00) → Red
Step 3: Majority Voting:
Red = 2 votes | Blue = 1 vote
PREDICTION: Point P(3,2) → Class = RED
Q8. K-Means Clustering — One Iteration
Points: P1(1,1), P2(2,1), P3(4,3), P4(5,4) Initial Centroids: C1=(1,1), C2=(5,4)
Step 1: Assign each point to nearest centroid
Point Coords Dist to C1(1,1) Dist to C2(5,4) Assigned Cluster
P1 (1,1) 0.00 √(16+9)=5.00 C1 (closer)
P2 (2,1) √(1+0)=1.00 √(9+9)=4.24 C1 (closer)
P3 (4,3) √(9+4)=3.61 √(1+1)=1.41 C2 (closer)
P4 (5,4) √(16+9)=5.00 0.00 C2 (closer)
Step 2: Recalculate Centroids
Cluster 1: {P1(1,1), P2(2,1)}
New C1 = ((1+2)/2, (1+1)/2) = (1.5, 1.0)
Cluster 2: {P3(4,3), P4(5,4)}
New C2 = ((4+5)/2, (3+4)/2) = (4.5, 3.5)
After 1 Iteration:
Cluster 1: {P1(1,1), P2(2,1)} → New Centroid = (1.5, 1.0)
Cluster 2: {P3(4,3), P4(5,4)} → New Centroid = (4.5, 3.5)
Q9. K-Medoids Clustering
Points: A(2,6), B(3,4), C(3,8), D(4,7), E(6,2), F(7,3) Initial Medoids: M1=(2,6)=A, M2=(6,2)=E
Step 1: Calculate distances from all points to each medoid
Distance formula (Euclidean): d = sqrt((x2-x1)^2 + (y2-y1)^2)
Point Dist to M1=(2,6) Dist to M2=(6,2) Assigned Cluster
A(2,6) 0.00 √(16+16)=5.66 M1
B(3,4) √(1+4)=2.24 √(9+4)=3.61 M1
C(3,8) √(1+4)=2.24 √(9+36)=6.71 M1
D(4,7) √(4+1)=2.24 √(4+25)=5.39 M1
E(6,2) √(16+16)=5.66 0.00 M2
F(7,3) √(25+9)=5.83 √(1+1)=1.41 M2
Cluster 1 (Medoid A): {A, B, C, D}
Cluster 2 (Medoid E): {E, F}
In K-Medoids, medoids remain as actual data points (unlike K-Means centroids). The algorithm would then
check if swapping medoids with other cluster members reduces total cost.
Q10. Hierarchical Clustering — Agglomerative (Single Linkage)
Given Distance Matrix:
A B C
A 0 2 6
B 2 0 4
C 6 4 0
Step-by-Step Agglomerative Clustering:
• Start: Each point is its own cluster: {A}, {B}, {C}
Iteration 1: Find minimum distance:
d(A,B) = 2 ← MINIMUM
d(A,C) = 6
d(B,C) = 4
→ Merge A and B into cluster {A,B}
Iteration 2: Update distances using Single Linkage (minimum):
d({A,B}, C) = min(d(A,C), d(B,C)) = min(6, 4) = 4
→ Only clusters: {A,B} and {C} — merge them
Iteration 3: Merge into one cluster {A,B,C}
Dendrogram Structure:
Step 1: A ■■■ B (merge at distance 2)
Step 2: {A,B} ■■■ C (merge at distance 4)
Final Cluster: {A, B, C} | Merging Order: A+B at d=2, then {A,B}+C at d=4
Q11. Confusion Matrix — Performance Metrics
Given Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive TP = 70 FN = 18
Actual Negative FP = 9 TN = 45
TP=70, FN=18, FP=9, TN=45 → Total = 142
Accuracy = (TP + TN) / Total
= (70 + 45) / 142 = 115 / 142 = 0.8099 ≈ 80.99%
Precision = TP / (TP + FP)
= 70 / (70 + 9) = 70 / 79 = 0.8861 ≈ 88.61%
Recall = TP / (TP + FN)
= 70 / (70 + 18) = 70 / 88 = 0.7955 ≈ 79.55%
F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
= 2 * (0.8861 * 0.7955) / (0.8861 + 0.7955)
= 2 * 0.7048 / 1.6816
= 1.4096 / 1.6816 = 0.8383 ≈ 83.83%
Metric Value
Accuracy 80.99%
Precision 88.61%
Recall 79.55%
F1-Score 83.83%
Q12 & Q13. MSE and RMSE
Actual: [0.3, -0.5, 2.3, 7.2] Predicted: [2.5, 0.0, 2.8, 8.0]
i Actual (y) Predicted (■) Error (y-■) Error² (y-■)²
1 0.3 2.5 0.3-2.5 = -2.2 (-2.2)² = 4.84
2 -0.5 0.0 -0.5-0.0 = -0.5 (-0.5)² = 0.25
3 2.3 2.8 2.3-2.8 = -0.5 (-0.5)² = 0.25
4 7.2 8.0 7.2-8.0 = -0.8 (-0.8)² = 0.64
Total Σ = 5.98
Q12. Mean Squared Error (MSE):
MSE = (1/n) * Σ(y_i - ■_i)²
= (1/4) * (4.84 + 0.25 + 0.25 + 0.64)
= (1/4) * 5.98
= 1.495
MSE = 1.495
Q13. Root Mean Squared Error (RMSE):
RMSE = sqrt(MSE)
= sqrt(1.495)
= 1.2228
RMSE ≈ 1.2228
EXAMPLE QUESTIONS — QUICK
REFERENCE ANSWERS
Q1. What is Machine Learning?
Answer:
Machine Learning (ML) is a subset of AI that enables systems to automatically learn and improve from
experience (data) without being explicitly programmed. It focuses on building algorithms that can access data
and use it to learn for themselves. Types: Supervised, Unsupervised, Reinforcement Learning.
Q2. Explain Supervised Learning and its algorithms.
Answer:
Supervised Learning trains models on labeled data (input-output pairs). Algorithms: Linear Regression
(predict continuous values), Logistic Regression (binary classification), Decision Trees (tree-based decisions
using entropy/gini), Random Forest (ensemble of trees), KNN (distance-based), SVM (optimal hyperplane).
Q3. Explain Unsupervised Learning and its algorithms.
Answer:
Unsupervised Learning finds patterns in unlabeled data. Types: Clustering (K-Means — assigns points to K
centroids; K-Medoids — uses actual data points; Hierarchical — dendrogram-based tree) and Dimensionality
Reduction (PCA — finds principal components of maximum variance; LDA — maximizes class separation).
Q4. What is Reinforcement Learning?
Answer:
Reinforcement Learning is a type of ML where an agent learns by interacting with an environment. It receives
positive rewards for good actions and penalties for bad ones. Over time, the agent learns a policy to
maximize cumulative reward. Example: Game-playing AI (AlphaGo), robotic navigation, self-driving cars.
Q5. Explain data preprocessing techniques.
Answer:
Data Preprocessing cleans and transforms raw data: (1) Handling Missing Values — Mean/Median/Mode
imputation or KNN Imputation; (2) Handling Outliers — Z-score (remove if |Z|>3) or IQR method; (3) Noise
Removal — smoothing; (4) Data Transformation — Normalization (0-1) or Standardization (mean=0, std=1);
(5) Removing Duplicates.
Q6. Types of categorical data.
Answer:
Categorical data comes in two types: (1) Nominal — categories with no order (e.g., Color: Red/Blue/Green,
Gender: Male/Female) — use One-Hot Encoding; (2) Ordinal — categories with a meaningful order (e.g.,
Rating: Low/Medium/High, Education: School/UG/PG) — use Label or Ordinal Encoding.
Q7. What is Label Encoding, One-Hot Encoding, Ordinal Encoding?
Answer:
Label Encoding: Assigns integers to categories (Red=0, Blue=1, Green=2) — simple but creates false ordinal
relationship. One-Hot Encoding: Creates binary columns per category (Red→[1,0,0], Blue→[0,1,0]) — avoids
false ordering, increases dimensionality. Ordinal Encoding: Assigns ordered integers based on category rank
(Low=1, Medium=2, High=3) — preserves order, suitable for ordinal data.
Q8. What is model evaluation?
Answer:
Model Evaluation assesses how well a trained model generalizes to unseen data. Key concepts: Train-Test
Split (divide data 80/20), K-Fold Cross Validation (rotate K folds for robust estimation), Overfitting (high train,
low test accuracy — model memorizes), Underfitting (low accuracy everywhere — model too simple).
Q9. Performance metrics for classification and regression.
Answer:
Classification: Accuracy=(TP+TN)/Total, Precision=TP/(TP+FP), Recall=TP/(TP+FN),
F1-Score=2*P*R/(P+R), Confusion Matrix (shows TP/FP/FN/TN), ROC Curve/AUC. Regression:
MAE=(1/n)*Σ|y-■|, MSE=(1/n)*Σ(y-■)², RMSE=√MSE, R²=1-SSres/SStot (measures explained variance).
Q10. What is linear regression and its numerical?
Answer:
Linear Regression models the linear relationship between input (X) and output (Y). Simple LR: Y = a + bX
where b=[nΣXY-ΣXΣY]/[nΣX²-(ΣX)²], a=(ΣY-bΣX)/n. Multiple LR: Y=b0+b1X1+b2X2+...+bnXn. Example: For
X=[1,2,3,4,5], Y=[2,4,5,4,5]: b=0.6, a=2.2 → Y=2.2+0.6X → Predict X=6: Y=5.8.
Q11. What is classification and regression with algorithms?
Answer:
Classification predicts discrete class labels (e.g., Spam/Not Spam, Yes/No). Algorithms: Logistic Regression,
Decision Tree (ID3/C4.5/CART), Random Forest, KNN, SVM. Regression predicts continuous values (e.g.,
price, temperature). Algorithms: Linear Regression, Polynomial Regression, Ridge/Lasso, Decision Tree
Regressor, Random Forest Regressor.