ML 5 Marks Q/A
1. Explain Regression line, Scatter plot, Error in prediction and Best fitting line.
— (Q1 b, Dec 2022; Q1 B, May 2023; Q1 A, May 2024) — Repeated 3 times |
Prob 90%
Answer:
1. Regression Line:
A regression line is a statistical tool used in supervised learning (specifically in
regression analysis) to model the relationship between a dependent variable (Y)
and one or more independent variables (X).
• In Simple Linear Regression, the regression line is represented as:
𝑌 = 𝑎 + 𝑏𝑋
where
a = intercept,
b = slope (regression coefficient).
• It shows how the dependent variable changes with respect to the independent
variable.
2. Scatter Plot:
A scatter plot is a graphical representation of data points plotted on a two-
dimensional plane, where:
• The X-axis represents the independent variable, and
• The Y-axis represents the dependent variable.
It helps visualize the relationship or correlation between variables—whether
it is positive, negative, or no correlation.
A linear pattern in the scatter plot indicates that a regression line can fit the
data well.
3. Error in Prediction:
The error (or residual) in prediction is the difference between the actual value (Yi)
and the predicted value (Ŷi) from the regression line.
Error (Ei) = 𝑌𝑖 − 𝑌̂𝑖
These errors show how far the predicted values deviate from the actual data points.
The goal of regression is to minimize the sum of squared errors (SSE) for better
accuracy.
4. Best Fitting Line:
The best fitting line (or line of best fit) is the line that minimizes the total
prediction error across all data points.
• It is found using the Least Squares Method, which minimizes:
∑(𝑌𝑖 − 𝑌̂𝑖 )2
• This line represents the trend in the data and provides the most accurate
predictions for unknown inputs.
2. Explain the concept of feature selection and extraction. — (Q1 C, May 2023)
— Repeated 1 time | Prob 80%
1. Introduction:
In Machine Learning, features are the measurable properties or characteristics used as
input for training a model.
The performance of a model heavily depends on the quality and relevance of the
features.
To improve accuracy, efficiency, and reduce overfitting, Feature Selection and
Feature Extraction are used.
2. Feature Selection:
Feature Selection is the process of choosing the most relevant and significant
features from the original dataset while removing redundant or irrelevant ones.
It helps in simplifying the model without losing important information.
Techniques used:
• Filter Methods: Use statistical tests like Chi-square, ANOVA, or correlation.
• Wrapper Methods: Use predictive models to evaluate subsets (e.g., Forward
Selection, Backward Elimination).
• Embedded Methods: Perform selection during model training (e.g., LASSO,
Decision Tree importance).
Advantages:
• Reduces overfitting
• Decreases training time
• Improves model interpretability
3. Feature Extraction:
Feature Extraction is the process of transforming existing features into a new set of
features that retain most of the important information from the original data.
It reduces the dimensionality of the data.
Techniques used:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Independent Component Analysis (ICA)
Example:
In image recognition, raw pixel data is transformed into features like edges, textures,
or color histograms.
4. Difference between Feature Selection and Extraction:
Aspect Feature Selection Feature Extraction
Definition Chooses a subset of original Creates new features from
features original ones
Dimensionality Reduced by elimination Reduced by transformation
Example Removing low-variance PCA generating principal
columns components
Conclusion:
Both feature selection and extraction are dimensionality reduction techniques that
improve model performance, reduce computation, and help achieve better
generalization.
3. Explain Linear Discriminant Analysis. — (Q1 B, Dec 2023; Q6 B, May 2024)
— Repeated 2 times | Prob 85%
1. Introduction:
Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction
and classification technique used in Machine Learning and Statistics.
It projects high-dimensional data onto a lower-dimensional space such that the class
separability is maximized.
It is mainly used when the output variable is categorical (e.g., class labels).
2. Objective of LDA:
The main goal of LDA is to:
• Maximize the distance between different class means (inter-class variance),
and
• Minimize the spread within each class (intra-class variance).
This ensures that after transformation, the classes are as distinct and well-separated
as possible.
3. Working of LDA (Steps):
1. Compute the Mean Vectors:
Calculate the mean vector for each class in the dataset.
2. Compute the Scatter Matrices:
o Within-class scatter matrix (Sw): Measures how much the samples in
the same class vary.
o Between-class scatter matrix (Sb): Measures how far apart the class
means are.
3. Compute the Eigenvectors and Eigenvalues:
Solve the generalized eigenvalue problem for 𝑆𝑤−1 𝑆𝑏 .
4. Select Linear Discriminants:
Choose eigenvectors with the largest eigenvalues — these form the new
feature subspace.
5. Project Data:
Project the original data points onto this lower-dimensional space for
classification.
4. Mathematical Representation:
If there are two classes, LDA finds a projection vector w such that:
𝑤 = 𝑆𝑤−1 (𝑚1 − 𝑚2 )
Where,
• 𝑆𝑤 = within-class scatter matrix
• 𝑚1 , 𝑚2 = mean vectors of class 1 and class 2
The new feature is then obtained as:
𝑦 = 𝑤𝑇𝑥
where x is the original feature vector.
5. Applications:
• Face recognition
• Medical diagnosis (classifying diseases)
• Text classification and document analysis
• Marketing (customer segmentation)
6. Diagram (recommended for full marks):
A labeled diagram showing two class distributions projected onto a line, with
maximum separation between class means and minimal within-class variance.
4. Differentiate between Logistic regression and Support vector machine. — (Q1
D, Dec 2023) — Repeated 1 time | Prob 85%
1. Introduction:
Both Logistic Regression (LR) and Support Vector Machine (SVM) are supervised
learning algorithms used mainly for classification problems.
However, they differ in how they determine the decision boundary and handle data
separation.
2. Comparison Table:
Aspect Logistic Regression (LR) Support Vector Machine
(SVM)
1. Type of Model Probabilistic model (estimates Non-probabilistic model
probability of class (focuses on maximum
membership). margin classification).
2. Decision Uses a sigmoid (logistic) Uses a hyperplane that
Boundary function to draw a soft maximizes the margin
decision boundary. between classes.
3. Objective Minimizes log-loss (cross- Maximizes margin and
Function entropy). minimizes hinge loss.
4. Output Produces probabilities (e.g., Produces hard class labels
class 0.8 vs 0.2). (+1 or −1).
5. Handling Non-linear classification Naturally handles non-linear
Non-linearity possible only after applying data using kernel functions
feature transformations (e.g., (e.g., RBF, polynomial
polynomial terms). kernel).
6. Uses L1/L2 regularization to Uses a regularization
Regularization prevent overfitting. parameter (C) to control
margin width and
misclassification.
7. Computationally simpler and Computationally heavier for
Computational faster. large datasets.
Cost
8. Interpretation Easy to interpret coefficients; Less interpretable; focuses
good for feature importance. more on boundary
optimization.
3. Diagram (recommended for full marks):
Draw two subplots:
• (a) Logistic Regression → smooth S-shaped decision boundary.
• (b) SVM → clear separating hyperplane with support vectors marked.
4. Example Use-Cases:
• Logistic Regression: Email spam detection, disease prediction.
• SVM: Image classification, handwriting recognition, face detection.
Summary:
• Logistic Regression predicts probabilities using a logistic function.
• SVM finds the maximum margin hyperplane to separate classes efficiently.
Both can perform classification, but SVM is more powerful for complex and
high-dimensional data.
5. Explain any five business applications of Machine learning. — (Q1 A, May
2024) — Repeated 1 time | Prob 75%
• Healthcare: Helps doctors diagnose diseases from images (like X-rays),
predict risks, discover new drugs faster, and automate patient support through
chatbots.
• Recommendation Systems: Powers the suggestions you see on Netflix,
Amazon, and Spotify by analysing your behaviour and predicting what you
might like next.
• Self-driving Cars: Allows cars to recognize objects, follow traffic rules, and
make decisions in real time using camera and sensor data.
• Voice Assistants: Siri, Alexa, and Google Assistant use machine learning to
understand questions, recognize speech, and give useful answers.
• Fraud Detection: Banks use machine learning to spot unusual patterns in
transactions, helping prevent credit card fraud.
• Social Media: Suggests friends, filters spam, recommends content, and even
detects harmful posts or fake news.
• Finance: Predicts stock prices, automates trading, manages risk, and provides
personalized financial advice.
• Image Recognition: Used in photo apps, security cameras, and scientific
research to identify faces, objects, or unusual activities.
• Traffic Prediction: Apps like Google Maps use machine learning to predict
traffic and suggest faster routes by learning from millions of users’ travel data.
These applications show how machine learning is already making everyday
technology smarter and more useful all around us.
6. Explain Gini index along with an example. — (Q1 D, May 2024; Q3 A, May
2023; Q4 B, Dec 2024) — Repeated 3 times | Prob 85%
1. Definition (1 mark)
The Gini Index (also called Gini Impurity) is a measure of purity or impurity used
in Decision Tree algorithms (such as CART – Classification and Regression Trees).
It determines how mixed the classes are in a dataset or node.
A pure node (containing only one class) has a Gini Index of 0.
2. Formula (1 mark)
For a dataset with 𝑚classes:
𝑚
Gini Index = 1 − ∑ 𝑝𝑖2
𝑖=1
where
• 𝑝𝑖 = proportion (probability) of class i in that node.
Range:
0 ≤ Gini ≤ 0.5 (for binary classification)
• 0 → Pure node (only one class)
• 0.5 → Impure node (mixed classes equally)
3. Interpretation (1 mark)
• The smaller the Gini Index, the purer the node.
• During decision tree construction, the attribute with the lowest weighted Gini
Index (or highest Gini gain) is chosen for splitting.
4. Example (2 marks)
Suppose we have a dataset of 10 samples:
Class Count
Yes 6
No 4
6 4
𝑝𝑌𝑒𝑠 = = 0.6, 𝑝𝑁𝑜 = = 0.4
10 10
Gini Index = 1 − (0.62 + 0.42 ) = 1 − (0.36 + 0.16) = 1 − 0.52 = 0.48
Interpretation:
• Gini = 0.48 → indicates moderate impurity (node contains both classes).
• If all 10 were “Yes,” Gini would be 0 (perfectly pure).
5. Additional Example (optional short one-liner for extra marks)
If the node had {Yes = 8, No = 2},
𝐺 = 1 − (0.82 + 0.22 ) = 1 − 0.68 = 0.32
→ More pure (lower impurity) than before.
6. Conclusion (optional)
• Gini Index helps measure how good a split is.
• A lower Gini value means the attribute provides a better split and should be
chosen first when building a decision tree.
7. Explain performance evaluation metrics for binary classification with suitable
example. — (Q1 C, May 2024; Q1 E, Dec 2024) — Repeated 2 times | Prob
80%
1. Introduction (1 mark)
In binary classification, a model predicts one of two classes — usually Positive or
Negative.
To evaluate how well the model performs, we use performance metrics derived from
the Confusion Matrix, which records correct and incorrect predictions.
2. Confusion Matrix
A Confusion Matrix is a 2×2 summary of a binary classifier’s predictions against the
actual labels.
The confusion matrix consists of four outcomes:
• True Positive (TP): Model correctly predicts the positive class.
• True Negative (TN): Model correctly predicts the negative class.
• False Positive (FP): Model incorrectly predicts positive (a false alarm).
• False Negative (FN): Model incorrectly predicts negative (a missed detection).
3. Performance Evaluation Metrics (3 marks)
a) Accuracy
Accuracy measures the overall correctness of the model.
It is defined as the ratio of correctly predicted instances to the total number of
instances.
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
A higher accuracy indicates that the model performs well, but it may not be reliable
when data is imbalanced (e.g., one class dominates).
b) Precision
Precision indicates the quality of positive predictions.
It is the proportion of correctly predicted positives (TP) out of all instances predicted
as positive (TP + FP).
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
High precision means that when the model predicts “positive,” it is usually correct.
Precision is especially important in cases where false positives are costly — for
example, predicting spam emails.
c) Recall (Sensitivity or True Positive Rate)
Recall measures the ability of the model to detect actual positives.
It is the ratio of correctly predicted positives (TP) to all actual positives (TP + FN).
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
A high recall value means that most positive cases are correctly identified.
This is crucial in fields like medical diagnosis, where missing a positive case (false
negative) can be dangerous.
d) F1-Score
The F1-Score is the harmonic mean of Precision and Recall.
It provides a single measure that balances both Precision and Recall.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
It is useful when you want to find a balance between avoiding false positives and false
negatives.
e) Specificity (True Negative Rate)
Specificity measures how well the model identifies negative cases.
It is the proportion of correctly predicted negatives (TN) to all actual negatives (TN +
FP).
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
𝑇𝑁 + 𝐹𝑃
A high specificity means that most negative cases are correctly identified — important
for reducing false alarms.
3. Example (1 mark)
Consider a medical test predicting whether a person has a disease.
Out of 10 patients, the test results are:
• True Positives (TP): 4
• True Negatives (TN): 4
• False Positives (FP): 1
• False Negatives (FN): 1
Now,
4+4
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = 0.8 = 80%
10
4
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 0.8
4+1
4
𝑅𝑒𝑐𝑎𝑙𝑙 = = 0.8
4+1
0.8 × 0.8
𝐹1 = 2 × = 0.8
0.8 + 0.8
The model therefore achieves 80% accuracy, 0.8 precision, and 0.8 recall, showing
balanced performance.
4. Conclusion (optional)
Different metrics provide different perspectives of performance.
Accuracy gives overall correctness, Precision measures trust in positive predictions,
Recall measures completeness, and F1 balances the two.
Depending on the application, one metric may be more important than others.
8. Explain the concept of k fold cross validation. — (Q1 E, May 2024; Q3 B, Dec
2023) — Repeated 2 times | Prob 85%
Definition:
Cross Validation is a model evaluation and selection technique used to assess how
well a model performs on independent datasets.
It helps prevent overfitting and provides a more reliable estimate of model
performance than a single train-test split.
Why it is necessary:
1. Avoids Overfitting:
Ensures that the model is not memorizing the training data.
2. Provides Reliable Accuracy Estimates:
Evaluates performance on multiple unseen data subsets for a balanced measure.
3. Efficient Use of Data:
Especially useful when available data is limited — all samples are used for
both training and testing.
4. Model Selection and Hyperparameter Tuning:
Helps choose the best algorithm or parameter set by comparing cross-validation
scores.
3. Traditional Data Splitting Problem (1 mark)
In a simple Train-Test split (e.g., 80%-20%), model accuracy depends heavily on
how the data is split — it may be biased if the test set isn’t representative.
Cross Validation eliminates this bias by testing on multiple splits systematically.
4. K-Fold Cross Validation (4 marks)
Concept:
K-Fold Cross Validation divides the entire dataset into K equal-sized subsets (called
folds).
The model is trained K times, each time using a different fold as the test set and the
remaining (K–1) folds as the training set.
Steps:
1. Shuffle the dataset randomly.
2. Split it into K equal parts (folds).
3. For each fold i (from 1 to K):
o Use fold i as the test set.
o Use remaining K–1 folds as the training set.
o Compute accuracy or error.
4. Average all K results to obtain the final performance score.
Mathematically:
𝐾
1
CV Accuracy = ∑ Accuracy𝑖
𝐾
𝑖=1
Example:
For K = 5, data is divided into 5 parts → train on 4 parts, test on 1 part, and repeat this
5 times.
Each instance in the dataset is used for testing exactly once.
5. Variants of Cross Validation (1 mark)
• Leave-One-Out (LOOCV): K = n (each data point is a test fold).
• Stratified K-Fold: Ensures each fold has equal class proportions.
• Repeated K-Fold: Runs K-Fold multiple times for higher reliability.
6. Advantages (1 mark)
• Provides a more accurate estimate of model performance.
• Reduces bias and variance in evaluation.
• Utilizes all data for both training and testing.
9. Explain Training error and Generalization error. — (Q1 A, Dec 2024) —
Repeated 1 time | Prob 80%
1. Introduction (1 mark)
In Machine Learning, the goal of a model is not just to perform well on the training
data, but also to make accurate predictions on unseen data.
To evaluate how well a model has learned, we measure two key types of errors —
Training Error and Generalization Error.
2. Training Error (2 marks)
• Definition:
Training error is the average error the model makes on the training dataset,
i.e., the data used to build and fit the model.
• It shows how well the model has learned patterns from the data it has already
seen.
• A low training error means the model has fitted the training data well.
• However, if the training error is too low (close to zero), it may indicate
overfitting — the model has memorized the data instead of learning general
patterns.
Formula:
𝑁𝑡𝑟𝑎𝑖𝑛
1
𝐸𝑡𝑟𝑎𝑖𝑛 = ∑ 𝐿(𝑦𝑖 , 𝑦̂)
𝑖
𝑁𝑡𝑟𝑎𝑖𝑛
𝑖=1
where 𝐿is the loss function (e.g., Mean Squared Error or Cross-Entropy).
Example:
If a regression model predicts almost perfectly on the training dataset, the training
error will be very small, e.g., MSE = 0.02.
3. Generalization Error (2 marks)
• Definition:
Generalization error is the average error the model makes on unseen or test
data — data not used during training.
• It measures the model’s ability to generalize its learning to new, real-world
data.
• A low generalization error means the model performs consistently well on
both training and unseen data.
• If the generalization error is much higher than the training error, it indicates
overfitting.
Formula:
𝑁𝑡𝑒𝑠𝑡
1
𝐸𝑔𝑒𝑛 = ∑ 𝐿(𝑦𝑖 , 𝑦̂)
𝑖
𝑁𝑡𝑒𝑠𝑡
𝑖=1
Example:
If a model gives 98% accuracy on training data but only 70% on test data, it has a high
generalization error.
4. Relation Between the Two (bonus clarity)
• Ideal scenario: Both training and generalization errors are low → good model.
• Overfitting: Low training error, high generalization error.
• Underfitting: Both training and generalization errors are high.
10. Differentiate between Supervised and unsupervised Learning. — (Q1 B, Dec
2024) — Repeated 1 time | Prob 80%
Definition:
Machine Learning algorithms are mainly classified into Supervised and
Unsupervised learning based on whether the training data contains labels (known
outputs) or not.
Comparison Table:
Sr. No. Supervised Learning Unsupervised Learning
1. Definition The model is trained on The model is trained on unlabeled
labeled data — both data — only input features are
inputs and correct outputs provided, without known outputs.
are provided.
2. Objective To learn a mapping from To find hidden patterns,
input to output so that it structures, or groupings in data.
can predict outcomes for
new data.
3. Data Used Uses input-output pairs Uses only input data 𝑥.
(𝑥, 𝑦).
4. Feedback Model receives feedback No feedback is given since there
based on prediction error are no labels to compare against.
(e.g., through loss
function).
5. Common Classification and Clustering and Dimensionality
Tasks Regression. Reduction.
6. Algorithms Linear/Logistic K-Means, Hierarchical Clustering,
Regression, Decision Tree, DBSCAN, PCA.
SVM, k-NN.
7. Example Predicting house price or Grouping customers based on
detecting spam emails. buying habits.
8. Evaluation Accuracy, Precision, Cluster cohesion, Silhouette score.
Metrics Recall, MSE.
Summary:
• Supervised Learning → Works with labeled data to predict outcomes.
• Unsupervised Learning → Works with unlabeled data to uncover hidden
structures.
11. Differentiate between Linear regression and Logistic regression. — (Q1 C, Dec
2024; Q1 D, Dec 2023) — Repeated 2 times | Prob 85%
Introduction:
Regression techniques are supervised learning methods used to establish
relationships between input variables and output variables.
While Linear Regression predicts continuous numerical values, Logistic
Regression predicts categorical (discrete) outcomes such as Yes/No or 0/1.
Comparison Table:
Sr. No. Linear Regression Logistic Regression
1. Objective Predicts a continuous Predicts a categorical
dependent variable based on dependent variable (usually
independent variables. binary: 0 or 1).
2. Type of Used for Regression Used for Classification
Problem problems (e.g., predicting problems (e.g., spam detection,
salary, house price). disease prediction).
3. Output Output is a real-valued Output is a probability between
Nature number (e.g., 35.6, 102.4). 0 and 1, which is then converted
to class labels.
4. Model 𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜖 ( P(y=1
Equation
5. Relationship Assumes a linear Assumes a logistic (sigmoid)
Assumed relationship between input relationship between input and
and output. probability of output.
6. Error Mean Squared Error Log Loss or Cross-Entropy
Function Used (MSE) or RMSE. Loss.
7. Output Produces a straight line fit to Produces an S-shaped (sigmoid)
Curve data. curve.
8. Example Predicting marks based on Predicting whether a student
study hours. passes or fails based on study
hours.
Summary:
• Linear Regression → For continuous output prediction.
• Logistic Regression → For categorical output prediction using a sigmoid
function.
12. Explain issues in Machine learning. — (Q1 D, Dec 2024; Q1 E, May 2024; Q1
A, Dec 2022) — Repeated 3 times | Prob 80%
• Overfitting: The model memorizes the training data too closely and performs
poorly on new data. It's like learning answers by heart without really
understanding the concepts.
• Underfitting: The model is too simple and fails to capture patterns in the data,
making bad predictions even on training data.
• Bad Data Quality: Machine learning needs good data. Issues like missing,
incorrect, or biased data can make the model inaccurate.
• Imbalanced Data: If some types of data are much more common than others
(for example, more photos of dogs than cats), the model may not learn to
recognize the less common [Link]+1
• Lack of Interpretability: Complex models are hard to understand, making it
tough to explain why they make certain decisions.
• Generalization to New Data: A model might work well on its training data but
fail when given new, unseen data because it hasn’t generalized well.
13. How to choose the right ML algorithm? — (Q1 A, May 2023; Q1 A, Dec 2023)
— Repeated 2 times | Prob 75%
1. Gathering Data
The first step is to collect relevant and sufficient data from various sources such as
databases, sensors, APIs, or web scraping. The quality and quantity of data directly
affect the performance of the model.
2. Preparing the Data
Raw data is cleaned and preprocessed — this includes handling missing values,
removing duplicates, normalizing, encoding categorical variables, and splitting data
into training and testing sets.
3. Choosing a Model
Select a suitable Machine Learning algorithm or model depending on the type of
problem — e.g., regression for continuous prediction, classification for categorical
prediction, clustering for pattern discovery, etc.
4. Training the Model
The chosen model is trained using the training dataset, allowing it to learn
relationships or patterns from the input data. This step involves optimizing model
parameters.
5. Evaluating the Model
After training, the model’s performance is tested using the test dataset. Common
evaluation metrics include accuracy, precision, recall, F1-score, and MSE depending
on the problem type.
6. Hyperparameter Tuning
In this step, model hyperparameters (like learning rate, number of trees, regularization
strength, etc.) are fine-tuned to improve accuracy and reduce errors. Techniques like
Grid Search or Random Search are commonly used.
7. Prediction / Deployment
The final trained and optimized model is used to make predictions on unseen data or
deployed into real-world applications for continuous usage.
14. Explain the concept of Logistic Regression. — (Q1 E, May 2023; Q1 C, Dec
2024) — Repeated 2 times | Prob 85%
Logistic regression is a machine learning method used for classification, meaning it
predicts categories (like yes/no, 0/1, true/false) instead of continuous values.
How It Works
• The model takes one or more input features (like age, income, exam score) and
calculates the probability that an outcome belongs to a particular category.
• For example, predict whether a student passes (1) or fails (0) an exam based on
the number of hours studied and number of practice tests taken.
The ML Equation
Logistic regression uses the following equation:
𝑃(𝑦 = 1 ∣ 𝑥) = sigmoid(𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 +. . . +𝑤𝑛 𝑥𝑛 )
• 𝑃(𝑦 = 1 ∣ 𝑥)is the probability of a positive outcome (like passing).
• 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 are the input features.
• 𝑤0 , 𝑤1 , . . . , 𝑤𝑛 are weights learned from data (showing how strong each feature
is).
• The sigmoid function squashes the output so it falls between 0 and 1, making it
a probability:
1
sigmoid(𝑧) =
1 + 𝑒 −𝑧
Here, if the probability is above a threshold (like 0.5), the model predicts “yes” (1); if
below, it predicts “no” (0).
Example in Machine Learning Context
Suppose you want to predict if a customer will buy a product (𝑦), based on age (𝑥1 ),
salary (𝑥2 ), and hours spent online (𝑥3 ). Logistic regression uses the data to learn the
best weights and calculates the buying probability. Whenever a new customer’s info
comes in, the model predicts “buy” or “not buy” based on the probability.
Key Points
• Used for spam detection, medical diagnosis, predicting customer churn, and
many other classification problems.
• Easy to implement, works well for simple tasks, and gives probabilities for
decision-making.
15. Explain K-means algorithm. — (Q1 D, May 2023) — Repeated 1 time | Prob
75%
1. Introduction (1 mark)
K-Means is an unsupervised clustering algorithm used to partition a dataset into K
distinct, non-overlapping clusters based on feature similarity.
It minimizes the intra-cluster distance (points within the same cluster should be
close) and maximizes the inter-cluster distance (clusters should be far apart).
2. Objective (0.5 mark)
The goal of K-Means is to find K cluster centroids such that the sum of squared
distances (SSD) between each data point and its assigned centroid is minimized.
𝐾
𝐽=∑ ∑ ∣∣ 𝑥𝑗 − 𝜇𝑖 ∣∣2
𝑖=1 𝑥𝑗 ∈𝐶𝑖
where 𝜇𝑖 = centroid of cluster 𝐶𝑖 .
3. Working of K-Means Algorithm (2.5 marks)
Step 1: Choose the number of clusters (K).
Decide how many clusters you want to divide your data into.
Step 2: Initialize centroids.
Randomly select K data points as initial cluster centers (centroids).
Step 3: Assign points to nearest centroid.
For each data point, calculate the Euclidean distance to each centroid and assign it to
the cluster with the closest centroid.
Step 4: Update centroids.
Recalculate the centroid of each cluster as the mean of all data points belonging to
that cluster.
Step 5: Repeat Steps 3 and 4
Continue reassigning and updating centroids until they no longer change significantly
or the algorithm reaches a set number of iterations.
4. Example (0.5 mark)
Suppose you have 6 points representing customers based on “Annual Income” and
“Spending Score.”
Using K = 2, K-Means divides them into two groups — one representing high income
& high spenders, and another low income & low spenders.
5. Advantages (optional bonus line)
• Simple and fast for large datasets.
• Easy to interpret and implement.
Limitations: Requires predefined K and is sensitive to initial centroid placement.
16. What are the issues in Machine learning? — (Q1 a, Dec 2022; Q1 D, Dec
2024) — Repeated 2 times | Prob 80%
Overfitting: The model memorizes the training data too closely and performs
poorly on new data. It's like learning answers by heart without really
understanding the concepts.
Underfitting: The model is too simple and fails to capture patterns in the data,
making bad predictions even on training data.
Bad Data Quality: Machine learning needs good data. Issues like missing,
incorrect, or biased data can make the model inaccurate.
Imbalanced Data: If some types of data are much more common than others (for
example, more photos of dogs than cats), the model may not learn to recognize the
less common [Link]+1
Lack of Interpretability: Complex models are hard to understand, making it
tough to explain why they make certain decisions.
Generalization to New Data: A model might work well on its training data but
fail when given new, unseen data because it hasn’t generalized well.
17. Explain the concept of margin and support vector. — (Q1 c, Dec 2022) —
Repeated 1 time | Prob 85%
Support Vector Machine (SVM) is a supervised learning algorithm used for
classification and regression tasks. It works by finding an optimal decision boundary,
also known as a hyperplane, that best separates data points belonging to different
classes.
The margin in SVM is the distance between the hyperplane and the closest data
points from either class. The goal of SVM is to maximize this margin, as a larger
margin generally leads to better generalization and improved classification
performance. A wide margin ensures that the model is less sensitive to noise and small
changes in the data. There are two main types of margins — the functional margin,
which measures how confidently the model classifies points, and the geometric
margin, which represents the actual perpendicular distance between data points and
the hyperplane. The formula for the margin is given by
2
Margin =
∣∣ 𝑤 ∣∣
where 𝑤is the weight vector of the hyperplane.
The support vectors are the data points that lie closest to the decision boundary.
These points are extremely important because they define both the position and
orientation of the hyperplane. If any of these support vectors are removed, the
hyperplane will shift, changing the classifier. On the other hand, removing other non-
support vectors will not affect the model.
The margin boundaries pass through the support vectors, and the optimal hyperplane
lies exactly midway between them.
In summary:
• Margin: Distance between hyperplane and nearest data points.
• Support Vectors: Points that define the margin boundaries and determine the
hyperplane.
• Objective of SVM: Maximize the margin for better classification accuracy.
18. Explain the distance metrics used in clustering. — (Q1 d, Dec 2022) —
Repeated 1 time | Prob 75%
In clustering, distance metrics are mathematical measures used to calculate the
similarity or dissimilarity between data points. The goal of clustering algorithms
such as K-Means, Hierarchical Clustering, and DBSCAN is to group similar data
points together based on these distances. Choosing the right distance metric directly
affects the quality and accuracy of the clusters.
1. Euclidean Distance
It is the most common distance metric and represents the shortest straight-line
distance between two points in multidimensional space.
𝐷(𝑥, 𝑦) = √(𝑥1 − 𝑦1 )2 + (𝑥2 − 𝑦2 )2 + ⋯ + (𝑥𝑛 − 𝑦𝑛 )2
It is suitable for continuous and normalized data.
Example: Used in K-Means clustering.
2. Manhattan Distance (L1 Distance or City Block Distance)
It is the sum of absolute differences between coordinates of two points.
𝐷(𝑥, 𝑦) =∣ 𝑥1 − 𝑦1 ∣ +∣ 𝑥2 − 𝑦2 ∣ + ⋯ +∣ 𝑥𝑛 − 𝑦𝑛 ∣
It measures distance like traveling along a grid (city blocks) and is robust to outliers.
3. Minkowski Distance
It is a generalized form of both Euclidean and Manhattan distances.
𝑛
𝐷(𝑥, 𝑦) = (∑ ∣ 𝑥𝑖 − 𝑦𝑖 ∣𝑝 )1/𝑝
𝑖=1
When 𝑝 = 1, it becomes Manhattan distance; when 𝑝 = 2, it becomes Euclidean
distance.
4. Cosine Distance
It measures the angle between two vectors rather than the magnitude.
𝑥⋅𝑦
𝐷(𝑥, 𝑦) = 1 −
∣∣ 𝑥 ∣∣ ∣∣ 𝑦 ∣∣
Used when the direction of data matters more than magnitude, e.g., in text or
document clustering.
5. Mahalanobis Distance
It considers correlation between variables and measures how many standard
deviations away a point is from the mean.
𝐷(𝑥, 𝑦) = √(𝑥 − 𝑦)𝑇 𝑆 −1 (𝑥 − 𝑦)
where 𝑆is the covariance matrix. It is used for multivariate data where variables are
correlated.
Summary
• Distance metrics are used to determine how similar or dissimilar two data
points are.
• Common metrics: Euclidean, Manhattan, Minkowski, Cosine, and
Mahalanobis.
• The choice depends on data type, scaling, and clustering algorithm.
✅ Top Repeated (High-Priority 5-Mark Questions)
1. Regression line, Scatter plot, Error in prediction and Best fitting line — 3×
(Dec 2022, May 2023, May 2024)
2. Gini index along with an example — 3× (May 2023, May 2024, Dec 2024)
3. Issues in Machine learning — 3× (Dec 2022, May 2024, Dec 2024)
4. Linear Discriminant Analysis — 2× (Dec 2023, May 2024)
5. K Fold Cross Validation — 2× (Dec 2023, May 2024)