Machine Learning
Semester 6 Viva — Easy Language Edition
(Skips all numerical calculations — focus on concepts for oral viva)
Unit 1 — Introduction to ML
DEFINITION
Q1. What is Machine Learning?
ML is when a computer gets better at a task by learning from data — without being manually
programmed. Example: a spam filter learns from thousands of emails and gets better at spotting
spam over time.
AI VS ML
Q2. What's the difference between AI, ML, and Deep Learning?
Think of three nested circles. AI is the biggest — it's the whole field of making machines smart.
ML is inside AI — machines learn from data. Deep Learning is inside ML — uses many layers of
neural networks like the human brain.
TYPES
Q3. What are the 4 main types of Machine Learning?
1. Supervised — you give it labeled data (right answers included). Example: spam/not spam.
2. Unsupervised — no labels, find hidden patterns. Example: grouping customers.
3. Semi-supervised — a little labeled + a lot unlabeled.
4. Reinforcement — learns by trial and error using rewards and punishments.
THEORY
Q4. What is PAC Learning?
PAC = Probably Approximately Correct. It's a theory that says: if you give a machine enough
examples, it will learn a 'good enough' answer with high confidence. You don't need infinite data
— just enough. Introduced by Leslie Valiant in 1984.
THEORY
Q5. What is a Version Space?
It's all the possible hypotheses (guesses) that are consistent with your training data. The
Candidate Elimination algorithm tracks two boundaries: S (most specific guess) and G (most
general guess). As more examples come in, S and G get closer until only correct hypotheses
remain.
CORE CONCEPT
Q6. What is the Bias-Variance Tradeoff?
Bias = your model is too simple and gets the wrong answer (underfitting).
Variance = your model is too complex and memorizes training data but fails on new data
(overfitting).
The goal is to find the sweet spot — a model that generalizes well to new data.
PREPROCESSING
Q7. What are missing values and how do you handle them?
Missing values = blank/empty data in your dataset. You can: fill them with the mean, median, or
mode of that column; use KNN to estimate the value; or just delete rows/columns that are too
incomplete. Never leave them as-is — most algorithms can't handle blanks.
PREPROCESSING
Q8. What is normalization and why do we need it?
Normalization rescales data so all features are on the same scale. Without it, a feature like 'salary
(50000)' would completely overpower 'age (25)'. Min-Max Scaling puts everything between 0 and
1. Z-score makes mean=0 and std=1. Always normalize before using KNN or SVM.
Unit 2 — Supervised Learning
ALGORITHM
Q9. What is Linear Regression?
It draws the best-fit straight line through your data to predict a number. Like predicting house
price from its size. It minimizes the error between predicted and actual values (using MSE). The
formula is: y = b0 + b1*x (where b0 = starting point, b1 = slope).
ALGORITHM
Q10. What is Naive Bayes? Why is it 'naive'?
It uses probability to classify things. It's 'naive' because it assumes all features are completely
independent of each other — which is almost never true in real life. Despite this wrong
assumption, it still works really well for text classification like spam detection.
ALGORITHM
Q11. What is a Decision Tree? What is entropy?
A Decision Tree asks yes/no questions to classify data — like a game of 20 questions. Entropy
measures how mixed/impure a group is (high entropy = very mixed). The tree picks the question
(feature) that reduces entropy the most — this is called Information Gain (ID3 algorithm).
ALGORITHM
Q12. What is the difference between ID3 and CART?
ID3: uses Information Gain, only handles categories, makes multi-way splits.
CART: uses Gini Impurity (different purity measure), handles numbers too, always makes binary
(2-way) splits. CART can also predict numbers (regression), not just categories.
ALGORITHM
Q13. What is KNN? How does it work?
K-Nearest Neighbors: to classify a new point, look at the K closest points in your training data and
take a majority vote. It's a lazy learner — it doesn't actually train, just memorizes everything. Very
sensitive to the value of K and to feature scaling.
ALGORITHM
Q14. What is Logistic Regression? How is it different from Linear Regression?
Despite the name, Logistic Regression is for classification, not predicting numbers! It outputs a
probability between 0 and 1 using the sigmoid (S-shaped) curve. Linear Regression predicts
numbers. Logistic Regression predicts class probabilities like yes/no.
ALGORITHM
Q15. What is a Perceptron? What is its limitation?
The simplest neural network unit — takes inputs, multiplies by weights, and fires 0 or 1 as output.
Works great for linearly separable problems (where a straight line separates classes). Big
limitation: completely fails on XOR and any non-linearly separable problem.
ALGORITHM
Q16. What is SVM? What is the kernel trick?
SVM (Support Vector Machine) finds the widest possible gap (margin) between two classes. The
data points at the edge of the margin are called support vectors. When data can't be separated by
a straight line, the kernel trick maps it to a higher dimension where it can be separated. Common
kernels: RBF, Polynomial.
Unit 3 — Unsupervised Learning
CONCEPT
Q17. What is clustering? What are its main types?
Clustering groups unlabeled data points based on similarity — no one tells it the right answer.
Types: Partitional (K-Means), Hierarchical (builds a tree of clusters), Density-based (DBSCAN
finds dense regions), Model-based (EM/GMM), and Self-Organizing Maps (SOM).
ALGORITHM
Q18. What is K-Means clustering? Explain step by step.
1. Pick K random center points (centroids).
2. Assign every data point to the nearest centroid.
3. Move each centroid to the average of its assigned points.
4. Repeat steps 2-3 until nothing changes.
The goal is to minimize the distance within clusters. K is chosen using the Elbow Method.
ALGORITHM
Q19. What's the difference between K-Means and K-Modes?
K-Means: for numerical data — centroid is the mean (average) of the group.
K-Modes: for categorical data (like colors, cities, yes/no) — centroid is the mode (most common
value).
K-Prototypes: handles a mix of both numerical and categorical data.
ALGORITHM
Q20. What is Hierarchical Clustering? What is a dendrogram?
Hierarchical clustering builds a tree of clusters. Agglomerative (bottom-up) starts with every point
alone, then keeps merging the two closest clusters. A dendrogram is the tree diagram showing
which clusters merged and at what distance. Cut the tree at a height to get K clusters.
ALGORITHM
Q21. What is DBSCAN? What are core, border, and noise points?
DBSCAN finds clusters based on density (how packed points are).
Core point: has many neighbors within a radius (epsilon).
Border point: near a core point but not dense itself.
Noise: isolated points that don't belong to any cluster.
Great for irregular-shaped clusters and handling outliers.
ALGORITHM
Q22. What is PCA and why is it used?
PCA (Principal Component Analysis) reduces the number of features while keeping as much
information as possible. It finds the directions where data varies the most (principal components)
and projects data onto fewer dimensions. Used to remove redundant features and for
visualization.
ALGORITHM
Q23. What is the EM Algorithm?
Expectation-Maximization is used when data has hidden (latent) variables. E-step: guess which
cluster each point probably belongs to (soft assignments, not hard). M-step: update the cluster
parameters based on those guesses. Repeat until it stabilizes. Used in Gaussian Mixture Models
(GMM).
ALGORITHM
Q24. What is a Self-Organizing Map (SOM)?
SOM is an unsupervised neural network that maps high-dimensional data onto a 2D grid while
preserving the structure (similar things end up nearby). Neurons compete to represent inputs —
the winner and its neighbors move toward the input. Great for visualization and clustering.
Unit 4 — Ensemble Learning
CONCEPT
Q25. What is Ensemble Learning and why is it used?
Combining multiple models gives better predictions than any single model. Like asking 5 doctors
instead of 1 — you get a more reliable answer. It reduces both overfitting (variance) and
underfitting (bias). Three main strategies: Bagging, Boosting, and Stacking.
CONCEPT
Q26. What is the difference between Bagging and Boosting?
Bagging: trains models in parallel on random subsets, then takes a vote. Reduces variance
(overfitting). Example: Random Forest.
Boosting: trains models one after another — each model fixes the mistakes of the previous.
Reduces bias. Example: AdaBoost, XGBoost.
Key difference: parallel vs sequential.
ALGORITHM
Q27. What is Random Forest?
Random Forest = Bagging + random feature selection. It builds many Decision Trees, each on a
random sample of data and random subset of features. For classification: majority vote wins. For
regression: take the average. Very hard to overfit, and gives you feature importance for free.
ALGORITHM
Q28. What is AdaBoost?
Adaptive Boosting: starts by giving all training samples equal weight. Trains a weak model (tiny
decision tree stump). Misclassified samples get higher weight — so the next model focuses on
the hard cases. Final answer = weighted vote of all weak models. Hard examples get more
attention each round.
ALGORITHM
Q29. What is XGBoost and why is it popular?
XGBoost (Extreme Gradient Boosting) is a fast, powerful boosting framework. Extra features: L1
and L2 regularization to prevent overfitting, parallel tree building (fast!), handles missing values
natively, and smart tree pruning. It's won hundreds of Kaggle competitions — very practical.
ALGORITHM
Q30. What is Stacking in Ensemble Learning?
Stacking uses multiple different models (Level-0 learners) trained on the original data. Their
predictions become the input to a meta-learner (Level-1) which learns how to best combine them.
The meta-learner figures out how to trust each base model. Different from Bagging (same
algorithm) — uses diverse algorithms.
ALGORITHM
Q31. What is Gradient Boosting? How does it differ from AdaBoost?
Gradient Boosting: each new model is trained on the residual errors (mistakes) of the ensemble
so far, using gradient descent on a loss function.
AdaBoost: reweights the training samples.
Gradient Boosting: fits new trees directly on the errors.
Gradient Boosting works with any differentiable loss function — more general.
CONCEPT
Q32. What is feature importance in Random Forest?
Feature importance tells you which features the model found most useful. Measured as: how
much does using this feature for splitting reduce impurity (Gini/entropy) on average across all
trees? Higher importance = more influential feature. Or use permutation importance: shuffle a
feature and see how much accuracy drops.
Unit 5 — Evaluation Metrics
METRICS
Q33. Define Accuracy, Precision, Recall, and F1-Score.
Accuracy: out of everything, how much did I get right?
Precision: out of all the things I said were positive, how many actually were?
Recall (Sensitivity): out of all actual positives, how many did I catch?
F1-Score: the harmonic mean of Precision and Recall — useful when classes are imbalanced.
METRICS
Q34. What is an ROC Curve and how is AUC interpreted?
ROC curve plots True Positive Rate (how many positives caught) vs False Positive Rate (false
alarms) at different thresholds. AUC = Area Under the Curve. AUC = 1.0 means perfect. AUC =
0.5 means random guessing (useless). Higher AUC = better model at distinguishing classes.
VALIDATION
Q35. What is K-Fold Cross Validation?
Split data into K equal parts (folds). Train on K-1 folds, test on the 1 remaining fold. Repeat K
times so every fold gets a turn as the test set. Final score = average of all K test scores. Why?
Every single data point is used for both training and testing. Most common: K=5 or K=10.
CONCEPT
Q36. What is overfitting and underfitting? How to fix them?
Overfitting: model memorizes training data, fails on new data (too complex). Fix: add
regularization, get more data, use cross-validation, prune the tree.
Underfitting: model is too simple, does badly on everything. Fix: use a more complex model, add
more features, reduce regularization.
CONCEPT
Q37. What is Regularization? What is L1 vs L2?
Regularization adds a penalty for large weights to the loss function — prevents overfitting.
L1 (Lasso): penalizes the absolute value of weights. Some weights become exactly 0, so it does
automatic feature selection (sparse model).
L2 (Ridge): penalizes the square of weights. Shrinks all weights but doesn't zero them out.
ElasticNet = L1 + L2 combined.
CONCEPT
Q38. How do you handle class imbalance in classification?
When one class has way more samples than the other: use SMOTE (generates synthetic minority
samples), undersample the majority class, or assign higher loss penalty to the minority class.
Also: don't use Accuracy as your metric — use F1-Score or AUC-ROC instead, since they're not
fooled by imbalance.
Unit 6 — Reinforcement Learning
CONCEPT
Q39. What is Reinforcement Learning? What are its main components?
RL is learning by trial and error in an environment. Components:
Agent = the learner/decision maker
Environment = the world it acts in
State = the current situation
Action = what the agent can do
Reward = feedback signal (positive or negative)
Policy = the agent's strategy
Value function = expected future reward
ALGORITHM
Q40. What is Q-Learning?
Q-Learning is a model-free RL algorithm. It builds a Q-Table that stores values for every (state,
action) pair — Q(s,a) = how good is it to take action a in state s? Update rule: Q(s,a) = Q(s,a) +
learning_rate * [reward + discount * best_future_value - current_Q]. It converges to the optimal
strategy regardless of how you explore.
THEORY
Q41. What is the Bellman Equation?
The Bellman Equation says: the value of a state = the best reward you can get now + the
discounted value of the next best state. V(s) = max_a[R(s,a) + gamma * V(s')]. It's recursive —
the value of where you are depends on where you can go. Foundation of Q-Learning and
dynamic programming.
CONCEPT
Q42. What is exploration vs exploitation tradeoff?
Exploration: try new, unknown actions — you might find something better.
Exploitation: do the best thing you know — maximize current reward.
You need both! Too much exploitation = stuck in local optima. Too much exploration = never uses
what it learned.
Solution: Epsilon-greedy — usually exploit, but randomly explore with probability epsilon.
ALGORITHM
Q43. What is Temporal Difference (TD) Learning?
TD Learning updates value estimates after each step without waiting for the episode to end. It
combines Monte Carlo (learn from complete episodes) and Dynamic Programming (use current
estimates). Q-Learning and SARSA are both TD methods. The key idea: learn as you go, not just
at the end.
COMPARISON
Q44. What is model-based vs model-free RL?
Model-based: agent builds an internal map/model of the environment (knows what happens when
it takes an action). Can plan ahead. Example: Dyna-Q.
Model-free: agent just learns from experience — no internal map. Simpler but needs more
interaction. Examples: Q-Learning, SARSA, Policy Gradient.
APPLICATIONS
Q45. Give 3 real-world applications of Reinforcement Learning.
1. Game playing: AlphaGo beat the world champion at Go; DeepMind mastered Atari games.
2. Robotics: robot arms learning to pick up and place objects.
3. Autonomous driving: cars learning navigation and obstacle avoidance policies.
Also used in recommendation systems and algorithmic trading.
Unit 7 — Practical Viva — Python & sklearn
PYTHON
Q46. What is the difference between fit(), transform(), and fit_transform()?
fit(): learns the parameters from training data (e.g., calculates mean and std for scaling).
transform(): applies those learned parameters to data.
fit_transform(): does both in one step.
Critical rule: only fit() on training data. Never fit on test data — just transform it. Otherwise you're
cheating (data leakage).
PYTHON
Q47. How do you split data into train and test sets in sklearn?
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
test_size=0.2 means 20% goes to testing, 80% to training. random_state=42 makes the split
reproducible.
PYTHON
Q48. How do you implement K-Means in Python?
from [Link] import KMeans
km = KMeans(n_clusters=3, random_state=42)
[Link](X)
labels = km.labels_
centers = km.cluster_centers_
Use km.inertia_ to get the WCSS value for the Elbow Method.
PYTHON
Q49. How do you implement a Decision Tree in sklearn?
from [Link] import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=5)
[Link](X_train, y_train)
y_pred = [Link](X_test)
from [Link] import accuracy_score
print(accuracy_score(y_test, y_pred))
PYTHON
Q50. How do you print a classification report in sklearn?
from [Link] import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
The report shows precision, recall, F1-score, and support for every class. Very useful for
imbalanced datasets.
PYTHON
Q51. What is a sklearn Pipeline and why is it useful?
A Pipeline chains preprocessing + model into one object so they work together:
from [Link] import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('clf', SVC())])
[Link](X_train, y_train)
Benefits: prevents data leakage, makes cross-validation cleaner, easier to deploy.
PYTHON
Q52. How do you do K-Fold Cross Validation in sklearn?
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print([Link](), [Link]())
cv=5 means 5-fold cross validation. For imbalanced classes, use StratifiedKFold instead.
Unit 8 — Short Notes & Theory
THEORY
Q53. What is Feature Engineering? Give examples.
Creating new, more useful features from raw data to help the model learn better. Examples:
extract day/month/year from a date column; create a ratio like income/expenses; apply log
transform to skewed data (like salary); one-hot encode categories; create polynomial features like
x-squared.
THEORY
Q54. What is One-Hot Encoding vs Label Encoding?
Label Encoding: gives each category an integer (Red=0, Green=1, Blue=2). Problem: the model
might think Green is 'between' Red and Blue.
One-Hot Encoding: creates a separate 0/1 column per category (is_Red, is_Green, is_Blue). Use
Label Encoding for ordered categories. Use One-Hot for unordered categories.
THEORY
Q55. What's the difference between Batch, Mini-batch, and Stochastic Gradient
Descent?
Batch GD: uses all training data to update weights once — very stable but slow.
Stochastic GD (SGD): updates weights after every single sample — fast but noisy/jumpy.
Mini-batch GD: updates after a small batch (e.g. 32 samples) — best of both worlds, most
commonly used in practice.
THEORY
Q56. What are the ethical concerns in Machine Learning?
1. Bias and fairness: biased training data leads to discriminatory models (e.g. facial recognition
worse for dark skin).
2. Privacy: using personal data without consent.
3. Transparency: black-box models can't explain their decisions.
4. Job displacement.
5. Deepfakes and misinformation.
6. Who is accountable when an algorithm harms someone?
THEORY
Q57. What is Hyperparameter Tuning? Grid Search vs Random Search?
Hyperparameters are settings you choose before training (like K in KNN, or depth of a Decision
Tree).
Grid Search: tries every possible combination — thorough but very slow.
Random Search: tries random combinations — much faster, often just as good.
Bayesian Optimization: uses past results to smartly decide what to try next.
THEORY
Q58. What is Semi-Supervised Learning?
Uses a small amount of labeled data + a large amount of unlabeled data. The model learns the
basic rules from labeled data, then uses the structure of unlabeled data to improve. Example: you
have 100 labeled images but 10,000 unlabeled ones. Methods: label propagation, self-training.
THEORY
Q59. What is the difference between Multi-class and Multi-label classification?
Multi-class: each sample belongs to exactly ONE class. Example: digit recognition (0-9) — a digit
can only be one number.
Multi-label: each sample can belong to MULTIPLE classes at once. Example: a news article
tagged as both 'sports' AND 'politics' simultaneously.
THEORY
Q60. What are future trends in Machine Learning?
1. AutoML — automatically selects and tunes models.
2. Federated Learning — train across devices without sharing raw data (privacy-preserving).
3. Explainable AI (XAI) — making black-box models interpretable.
4. Edge AI — running ML on phones and IoT devices.
5. Large Language Models (GPT, BERT) — foundation models.
6. RLHF — training AI from human feedback.
Unit 9 — PU Exam Style Questions
PU EXAM
Q61. What does PAC stand for? (PU Nov 2023)
PAC = Probably Approximately Correct. Introduced by Leslie Valiant in 1984. The idea: with
enough training examples, a machine can learn a hypothesis that's approximately correct most of
the time. 'Probably' refers to the confidence level; 'approximately' refers to acceptable error.
PU EXAM
Q62. Which ML algorithm is based on Bagging? (PU Nov 2023)
Random Forest! It trains multiple Decision Trees on random bootstrap samples (sampling with
replacement) and aggregates predictions by majority vote. Each tree also uses a random subset
of features at each split — this is what makes it powerful and avoids all trees being identical.
PU EXAM
Q63. What is a kernel function in SVM? Why is it required? (PU Nov 2023)
A kernel function computes similarity between two points in a higher-dimensional space without
actually transforming them (the kernel trick — saves huge computation). Required when data is
not linearly separable. RBF kernel is most popular. The kernel lets SVM handle complex, curved
decision boundaries.
PU EXAM
Q64. Explain Bayes Theorem and Bayesian Learning. (PU Apr 2024)
Bayes Theorem: P(H|E) = P(E|H) * P(H) / P(E).
P(H) = prior belief before seeing evidence.
P(H|E) = posterior belief after seeing evidence.
Bayesian Learning: combines your prior knowledge with observed data to update beliefs. Handles
noisy data well. Foundation of the Naive Bayes classifier.
PU EXAM
Q65. Explain Q-Learning. (PU Apr 2024)
Q-Learning is a model-free, off-policy RL algorithm. It builds a Q-Table of (state, action) values.
Every time it takes an action, it updates the Q-value using: new Q = old Q + learning_rate *
[reward + discount * best_next_Q - old Q]. It converges to optimal values no matter what policy
you follow while exploring.
PU EXAM
Q66. Differentiate Supervised vs Unsupervised Learning. (PU Nov 2023)
Supervised: data has labels (right answers). Model learns mapping input → output. Algorithms:
Linear Regression, Decision Tree, SVM. Evaluated by accuracy, F1, MSE.
Unsupervised: no labels. Model finds hidden patterns on its own. Algorithms: K-Means, PCA,
DBSCAN. Evaluated by silhouette score, inertia.
PU EXAM
Q67. What is a Genetic Algorithm? (PU Apr 2024)
A nature-inspired optimization algorithm. Starts with a random population of solutions
(chromosomes). Fitness function scores each solution. Selection: keep the best ones. Crossover:
mix two solutions to create offspring. Mutation: randomly change part of a solution to maintain
diversity. Repeat until the best solution is found.
PU EXAM
Q68. Describe applications of ML in today's world. (PU Apr 2024)
Healthcare: disease diagnosis, MRI/X-ray analysis, drug discovery.
Finance: fraud detection, credit scoring, algorithmic trading.
NLP: chatbots (like me!), translation, sentiment analysis.
Computer Vision: face recognition, autonomous vehicles.
Recommendation Systems: Netflix, Amazon, Spotify.
Agriculture: crop yield prediction, pest detection.