MACHINE LEARNING NOTES:
MODULE 1 – INTRODUCTION TO MACHINE LEARNING (15-MARK
ANSWERS)
1. Introduction to Machine Learning (15 Marks Answer)
Machine Learning (ML) is a subset of Artificial Intelligence that enables
machines to learn patterns from data and improve performance on tasks
without explicitly being programmed. Traditional programming depends on
hard-coded rules, but ML automatically discovers these rules by analyzing
examples. The core idea is to construct models that generalize from past
observations to future unseen data.
ML systems consist of data, model, loss function, and optimization algorithm.
The learning process involves identifying patterns, detecting structures, and
making predictions such as classification, regression, or clustering. ML learns
from experience (data), improves with more examples, and adapts
automatically. It powers modern applications such as recommendation systems
(Netflix, Amazon), spam detection, medical diagnosis, speech recognition, fraud
detection, and autonomous vehicles.
ML is broadly categorized into supervised learning (labeled data), unsupervised
learning (unlabeled data), semi-supervised learning, and reinforcement
learning (reward-based learning). Each category suits different types of
problems. ML contributes significantly to automation, decision-making, and
data-driven insights, becoming essential across industries.
2. Feature Engineering (15 Marks Answer)
Feature engineering refers to transforming raw data into meaningful inputs
that improve model performance. Good features directly influence accuracy,
robustness, and generalizability of ML models. It includes feature extraction,
creation, and transformation.
The process begins with understanding domain knowledge, identifying key
attributes, and converting raw data into numerical representations suitable for
ML algorithms. Techniques include handling missing values, encoding
categorical variables, normalization, scaling, creating interaction features,
dimensionality reduction, PCA, and time-based features.
Feature engineering also involves selecting relevant features that reduce noise
and prevent overfitting. Strong features improve model interpretability and
reduce computational complexity. In practice, it often determines more than
70% of the success of ML systems, as algorithms can only perform well if they
receive high-quality inputs.
3. Learning Paradigm (15 Marks Answer)
Learning paradigms describe the ways machines learn patterns from data. The
primary paradigms include:
• Supervised Learning: Uses labeled data to perform prediction tasks like
regression and classification.
• Unsupervised Learning: Works on unlabeled data to find structure such
as clusters or associations.
• Semi-Supervised Learning: Combines small labeled and large unlabeled
datasets.
• Reinforcement Learning: Agents learn optimal actions via trial and error,
guided by rewards.
Each paradigm has different goals, methods, and applications. For example,
supervised learning is used in email filtering, unsupervised learning is used in
customer segmentation, and reinforcement learning is used in robotics. The
learning paradigm selection depends on data availability and problem nature.
4. Generalization of Hypothesis (15 Marks Answer)
Generalization refers to the model’s ability to perform well on unseen data. A
hypothesis is a function chosen by the model from hypothesis space to
approximate the true function. A hypothesis generalizes well if the model does
not memorize the training data but learns underlying patterns.
Overfitting occurs when models learn noise, while underfitting occurs when
models are too simple. Techniques like regularization, cross-validation, and
early stopping help improve generalization. The quality of generalization
determines the practical usefulness of the ML model, making it a core concern
in ML theory.
5. VC Dimension (15 Marks Answer)
Vapnik–Chervonenkis (VC) Dimension measures the capacity of a model class
by determining the maximum number of points it can shatter. A hypothesis
class “shatters” a set if it can correctly classify all possible labelings of that set.
Higher VC dimension means more complex models that may overfit, while
lower VC dimension indicates limited flexibility.
VC dimension provides theoretical bounds for learning, determining sample
complexity required for generalization. It plays a key role in statistical learning
theory and PAC learning framework. Understanding VC dimension helps
balance bias-variance tradeoff and select appropriate models.
6. Probably Approximately Correct (PAC) Learning (15 Marks Answer)
PAC learning theory defines conditions under which a learner can find a
hypothesis that is “probably” close to the true function. The hypothesis must
perform well with high probability (confidence) and have low error (accuracy).
The PAC framework establishes sample complexity requirements, showing how
many training examples are needed to learn a concept. It assumes distribution
of training samples and provides guarantees for generalization. PAC learning
forms the theoretical foundation for modern ML algorithms and explains
feasibility of learning.
7. Applications of Machine Learning (15 Marks Answer)
ML is widely used in various domains:
• Healthcare (disease diagnosis, medical imaging)
• Finance (fraud detection, credit scoring)
• E-commerce (recommendation engines)
• NLP (translation, sentiment analysis)
• Autonomous Driving (object detection)
• Cybersecurity (anomaly detection)
• Manufacturing (predictive maintenance)
• Robotics and automation
ML’s flexibility, accuracy and predictive power make it essential for innovation
across all sectors.
MODULE 2 – Data Handling and Artificial Neural Networks (15-Marks
Answer)
Data handling is a critical step in ML, as the performance of any model depends
heavily on the quality and structure of the input data. Feature selection
mechanisms aim to reduce dimensionality by keeping only the most relevant
features. Techniques include filter methods (correlation, chi-square test),
wrapper methods (forward selection, backward elimination), and embedded
methods (LASSO). Feature selection reduces overfitting, training time, and
enhances interpretability.
Imbalanced data is a common problem where one class has significantly more
samples than others, such as fraud detection or medical diagnosis. Handling
imbalance requires techniques like oversampling (SMOTE), undersampling,
cost-sensitive learning, and using evaluation metrics such as F1-score and ROC-
AUC instead of accuracy.
Outlier detection is another key preprocessing task, identifying data points that
deviate significantly from the rest. Outliers may indicate errors, fraud, or rare
events. Techniques include statistical methods (z-score, IQR), density-based
methods (DBSCAN, LOF), and model-based approaches.
Artificial Neural Networks (ANNs) are inspired by biological neurons. An ANN
consists of layers of interconnected nodes (neurons) that compute weighted
sums of inputs followed by an activation function (ReLU, sigmoid). Networks
can have input layers, hidden layers, and output layers. ANNs learn through a
process called backpropagation, where the error between predicted and actual
output propagates backward and updates weights using gradient descent.
Backpropagation computes partial derivatives of the loss function with respect
to every weight, making training efficient.
Applications of ANN include image recognition, speech processing, natural
language processing, autonomous driving, recommendation systems, and
medical diagnosis. Deep neural networks, a special class of ANN, have
dramatically advanced ML performance in many complex tasks.
MODULE 3 – ML Models and Evaluation (15-Marks Answer)
Regression is a supervised learning technique used to predict continuous
values. Multivariable regression extends simple linear regression to multiple
features. Its objective is to minimize the prediction error. Techniques like least
squares regression compute optimal coefficients that minimize the sum of
squared errors. To improve generalization, regularization techniques such as L1
(LASSO) and L2 (Ridge) are applied. LASSO performs feature selection by
shrinking some coefficients to zero.
Regression finds applications in predicting housing prices, stock market trends,
sales forecasting, temperature prediction, and demand forecasting.
Classification models categorize data into discrete classes. Popular methods
include:
1. K-Nearest Neighbors (KNN) – A distance-based method that assigns
labels based on nearest neighbours.
2. Naïve Bayes – Uses Bayes’ theorem with the assumption of feature
independence; widely used in spam detection and text classification.
3. Support Vector Machines (SVM) – Finds the optimal hyperplane that
separates classes with maximum margin; works well with high-
dimensional data.
4. Decision Trees – Use a tree-like structure to model decisions; easy to
interpret.
Training and testing classifier models require splitting data into training and
testing sets. To avoid bias or overfitting, cross-validation (especially k-fold CV)
is used. Evaluation metrics include precision, recall, F1-measure, accuracy, and
AUC (Area Under Curve). AUC represents the performance of a classifier across
all thresholds.
Statistical decision theory provides a framework for optimal decision-making
under uncertainty. It includes discriminant functions and decision surfaces that
separate classes. These mathematical tools help understand the geometric and
probabilistic foundations of classification algorithms.
MODULE 4 – Model Assessment, Ensemble Learning & Inference (15-
Marks Answer)
Model assessment involves determining how well a model generalizes to
unseen data. It includes cross-validation, error analysis, and performance
metrics. Model selection is about choosing the best model from a set of
candidates based on validation performance.
Ensemble learning improves prediction accuracy by combining multiple
models. Two major ensemble methods are bagging and boosting.
Bagging (Bootstrap Aggregating) reduces variance by training multiple models
on different bootstrap samples of data and averaging their predictions. The
most popular example is the Random Forest algorithm, which constructs
multiple decision trees.
Boosting focuses on sequentially correcting the errors of previous models.
Algorithms like AdaBoost and Gradient Boosting assign higher weights to
misclassified samples to improve performance. Boosting often achieves
excellent accuracy but may risk overfitting.
Model inference and averaging allow combining the predictions of multiple
models to reduce variance and stabilize performance. Bayesian model
averaging incorporates uncertainty in model parameters for more reliable
predictions.
The Bayesian Theory provides a probabilistic framework for learning. It
updates prior beliefs using observed data to produce posterior probabilities.
Bayesian methods handle uncertainty effectively and prevent overfitting with
the help of priors.
The Expectation-Maximization (EM) algorithm is an iterative method used
when data has missing or latent variables. It alternates between the
Expectation (E) step, which estimates hidden variables, and the Maximization
(M) step, which updates parameters. EM is widely used in clustering (Gaussian
Mixture Models) and probabilistic inference.
MODULE 5 – Hidden Markov Models (15-Marks Answer)
Hidden Markov Models (HMMs) are statistical models used to analyze
sequential or time-series data where the system has hidden states and
observable outputs. An HMM is defined by states, transition probabilities,
emission probabilities, and initial state distribution. It assumes the Markov
property, meaning the next state depends only on the current state.
Two major algorithms used in HMM are the Forward-Backward algorithm and
the Viterbi algorithm.
• The Forward-Backward algorithm computes the probability of
observations given the model. It is used for training HMM parameters.
• The Viterbi algorithm finds the most likely sequence of hidden states for
a given observation sequence.
HMMs are widely used for sequence classification, where sequences such as
speech, text, biological signals, or sensor readings must be categorized.
However, HMMs have limitations in capturing long-range dependencies.
Conditional Random Fields (CRFs) are discriminative models that overcome
some limitations of HMMs by modelling conditional probability directly
without requiring independence assumptions. CRFs are widely used for
structured prediction tasks.
Applications include speech recognition, handwriting recognition, part-of-
speech tagging, gene sequence analysis, activity recognition, and machine
translation.
MODULE 6 – Association Rules (15-Marks Answer)
Association rule mining discovers interesting relationships among variables in
large datasets. It is widely used in market basket analysis to find patterns like
“customers buying bread also buy butter.”
Basic concepts include support, confidence, and lift.
• Support measures how frequently an itemset appears.
• Confidence measures the strength of an association rule.
• Lift checks if a rule is statistically significant.
Mining frequent patterns efficiently is essential due to the enormous search
space. Two main algorithms are used:
1. Apriori Algorithm – Uses a bottom-up approach where frequent
itemsets are generated iteratively. It uses the apriori property: if an
itemset is frequent, all its subsets must also be frequent. While simple
and effective, it may require many scans of the database.
2. FP-Growth Algorithm – An improved method that eliminates candidate
generation. It uses a compact structure called the FP-tree to store data
and recursively mines frequent patterns. FP-Growth is faster and more
scalable for large datasets.
Association rule mining is widely applied in e-commerce recommendation
systems, bioinformatics, social network analysis, fraud detection, and intrusion
detection systems.
MODULE 7 – Clustering (15-Marks Answer)
Clustering is an unsupervised learning technique used to group similar data
points. It reveals patterns in data without labelled examples.
The most common algorithm is K-Means, which partitions data into k clusters
by minimizing within-cluster variance. It iteratively assigns points to the nearest
cluster center and updates centroids. K-Means is efficient but sensitive to initial
seeds and outliers.
Hierarchical clustering builds a tree-like structure (dendrogram).
• Single linkage merges clusters based on the minimum distance between
points.
• Complete linkage uses maximum distance.
• Average linkage considers average distances.
Hierarchical clustering is useful when the number of clusters is unknown.
Ward’s algorithm minimizes total within-cluster variance, producing compact
and spherical clusters.
Minimum Spanning Tree (MST) clustering constructs an MST and removes long
edges to form clusters.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is
designed for very large datasets. It incrementally builds a clustering feature
tree and is highly scalable.
Applications of clustering include customer segmentation, anomaly detection,
image compression, document clustering, biological taxonomy, and social
network analysis.
MODULE 8 – Recent Trends in ML (15-Marks Answer)
Recent advances in ML have significantly expanded its real-world impact. Deep
learning has transformed computer vision, speech processing, and NLP through
architectures such as CNNs, RNNs, Transformers, and LSTMs. Large language
models (LLMs) like GPT and BERT have enabled human-like text generation and
improved natural language understanding.
Automated Machine Learning (AutoML) automates model selection,
hyperparameter tuning, and feature engineering. It reduces the need for
expert intervention.
Edge AI enables ML models to run on low-power devices like smartphones, IoT
sensors, and drones, improving privacy and latency.
Explainable AI (XAI) has gained importance due to ethical and legal
requirements. Tools like SHAP and LIME help interpret model decisions.
Other major trends include federated learning, quantum machine learning,
reinforcement learning in robotics, healthcare AI, and ML fairness and
accountability.
Case studies demonstrate ML’s transformative applications in autonomous
driving, real-time fraud detection, precision agriculture, industrial automation,
climate modelling, healthcare diagnostics, and personalized recommendations.