1. (a) Whats is confusion matrix?
Confusion matrix is a simple table used to measure how well a classification model is performing. It compares the
predictions made by the model with the actual results and shows where the model was right or wrong. This helps
you understand where the model is making mistakes so you can improve it. It breaks down the predictions into four
categories:
True Positive (TP): The model correctly predicted a positive outcome i.e the actual outcome was positive.
True Negative (TN): The model correctly predicted a negative outcome i.e the actual outcome was
negative.
False Positive (FP): The model incorrectly predicted a positive outcome i.e the actual outcome was
negative. It is also known as a Type I error.
False Negative (FN): The model incorrectly predicted a negative outcome i.e the actual outcome was
positive. It is also known as a Type II error.
(b) What is Distribution of Error?
The distribution of errors is a key concept in evaluating the performance of a machine learning model. Errors refer
to the difference between the predicted values and the actual values (or ground truth) from a dataset. Analyzing the
distribution of errors helps in understanding the behavior of the model and can provide insights into whether the
model is performing well or if there are issues like bias, variance, or outliers.
Types of Errors
Bias: This refers to the error introduced by the model's assumptions. High bias means the model is too simplistic
and underfits the data. Underfitting typically occurs when the model cannot capture the underlying patterns in the
data.
Variance: This refers to the error introduced by the model's sensitivity to small fluctuations in the training data.
High variance means the model is too complex and overfits the data, capturing noise and fluctuations as if they were
real patterns.
Residuals: The residuals are the differences between the actual values and the predicted values, i.e., the errors made
by the model.
(c ) What is the purpose of Clustering?
Clustering is an unsupervised machine learning technique that groups similar data points together into clusters based
on their characteristics, without using any labeled data. The objective is to ensure that data points within the same
cluster are more similar to each other than to those in different clusters, enabling the discovery of natural groupings
and hidden patterns in complex datasets.
Goal: Discover the natural grouping or structure in unlabeled data without predefined categories.
How: Data points are assigned to clusters based on similarity or distance measures.
Similarity Measures: Can include Euclidean distance, cosine similarity or other metrics depending
on data type and clustering method.
Output: Each group is assigned a cluster ID, representing shared characteristics within the cluster.
(d) What is reinforcement Learning?
Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents can learn to make
decisions through trial and error to maximize cumulative rewards. RL allows machines to learn by interacting with
an environment and receiving feedback based on their actions. This feedback comes in the form of rewards or
penalties.
Reinforcement Learning revolves around the idea that an agent (the learner or decision-maker) interacts with an
environment to achieve a goal. The agent performs actions and receives feedback to optimize its decision-making
over time.
Agent: The decision-maker that performs actions.
Environment: The world or system in which the agent operates.
State: The situation or condition the agent is currently in.
Action: The possible moves or decisions the agent can make.
Reward: The feedback or result from the environment based on the agent’s action.
(e) what is Regression in Machine Learning?
Regression in machine learning is a supervised learning technique used to predict continuous numerical values by
learning relationships between input variables (features) and an output variable (target). It helps understand how
changes in one or more factors influence a measurable outcome and is widely used in forecasting, risk analysis,
decision-making and trend estimation.
Works with real-valued output variables
Helps to identify strengths and the type of relationships
Supports both simple and complex predictive models.
Used for tasks like price prediction, trend forecasting and risk scoring.
(f) What is semi-supervised learning?
Semi-supervised learning is a hybrid machine learning approach which uses both supervised and unsupervised
learning. It uses a small amount of labelled data combined with a large amount of unlabelled data to train models.
The goal is to learn a function that accurately predicts outputs based on inputs, similar to supervised learning, but
with much less labelled data.
Semi-supervised learning is particularly valuable when acquiring labelled data is expensive or time-consuming, yet
unlabelled data is plentiful and easy to collect.
Supervised learning: Similar to a student being taught concepts by a teacher both in class and at home.
Unsupervised learning: Like a student figuring out concepts independently without instruction like a math
problem.
Semi-supervised learning: A mix where the teacher provides some concepts in class and the student
practices with homework assignments based on those concepts.
Unit-1
2. What is Machine Learning ? Discuss the need and Fetures of Machine Learning?
Machine learning is a branch of artificial intelligence that enables algorithms to uncover hidden patterns within
datasets. It allows them to predict new, similar data without explicit programming for each task. Machine learning
finds applications in diverse fields such as image and speech recognition, natural language processing,
recommendation systems, fraud detection, portfolio optimization, and automating tasks.
Handles Massive Data: Machine learning works well with large data and finds patterns that humans might
miss.
Adapts Dynamically: Systems evolve with new data, staying relevant in changing environments.
Drives Smarter Decisions: From predicting customer behavior to detecting fraud, ML enhances decision-
making with data-driven insights.
Personalizes Experiences: Recommendation systems, like those on Netflix or Amazon, tailor suggestions
to individual preferences.
Need for Machine Learning
Machine Learning is needed because traditional programming methods are often insufficient to handle complex,
large-scale, and dynamic problems. Key reasons include:
1. Handling Large Volumes of Data
Modern systems generate massive amounts of data (big data) that cannot be efficiently analyzed using
manual or rule-based approaches.
2. Automation of Decision-Making
ML enables systems to make decisions automatically without explicit programming, reducing human effort
and errors.
3. Learning from Experience
ML systems improve their performance over time by learning from past data, unlike static traditional
programs.
4. Complex Problem Solving
Problems like speech recognition, image processing, fraud detection, and recommendation systems are too
complex for fixed rules.
5. Adaptability to Change
ML models can adapt to changing environments and patterns, such as changing customer behavior or
market trends.
6. Cost and Time Efficiency
Once trained, ML systems can process data and make predictions faster and at lower cost than manual
methods.
Features of Machine Learning
1. Data-Driven Learning
ML algorithms learn patterns directly from data rather than relying on explicitly written rules.
2. Self-Improvement
Performance improves as the system is exposed to more data and experience.
3. Generalization Ability
ML models can make accurate predictions on unseen or new data.
4. Automation
Reduces the need for human intervention in repetitive or complex tasks.
5. Scalability
ML systems can handle increasing amounts of data efficiently.
6. Pattern Recognition
ML excels at identifying hidden patterns and relationships in data.
7. Multiple Learning Types
Supports supervised, unsupervised, semi-supervised, and reinforcement learning methods.
8. Real-Time Processing
ML can be applied in real-time applications such as spam detection and recommendation engines.
3. (a) Explain the life cycle of Machine Learning
Machine Learning Lifecycle is a structured process that defines how machine learning (ML) models are developed,
deployed and maintained. It consists of a series of steps that ensure the model is accurate, reliable and scalable.
It includes defining the problem, collecting and preparing data, exploring patterns, engineering features, training
and evaluating models, deploying them into production and continuously monitoring performance to handle issues
like data drift and retraining needs. Below are the key steps of the ML lifecycle:
Step 1: Problem Definition
The first step is identifying and clearly defining the business problem. A well-framed problem provides the
foundation for the entire lifecycle. Important things like project objectives, desired outcomes and the scope of the
task are carefully designed during this stage.
Collaborate with stakeholders to understand business goals
Define project objectives, scope and success criteria
Ensure clarity in desired outcomes
Step 2: Data Collection
Data Collection phase involves systematic collection of datasets that can be used as raw data to train model. The
quality and variety of data directly affect the model’s performance.
Here are some basic features of Data Collection:
Relevance: Collect data should be relevant to the defined problem and include necessary features.
Quality: Ensure data quality by considering factors like accuracy and ethical use.
Quantity: Gather sufficient data volume to train a robust model.
Diversity: Include diverse datasets to capture a broad range of scenarios and patterns.
Step 3: Data Cleaning and Preprocessing
Raw data is often messy and unstructured and if we use this data directly to train then it can lead to poor accuracy.
We need to do data cleaning and preprocessing which often involves:
Data Cleaning: Address issues such as missing values, outliers and inconsistencies in the data.
Data Preprocessing: Standardize formats, scale values and encode categorical variables for consistency.
Data Quality: Ensure that the data is well-organized and prepared for meaningful analysis.
Step 4: Exploratory Data Analysis (EDA)
To find patterns and characteristics hidden in the data Exploratory Data Analysis (EDA) is used to uncover insights
and understand the dataset's structure. During EDA patterns, trends and insights are provided which may not be
visible by naked eyes. This valuable insight can be used to make informed decision.
Here are the basic features of Exploratory Data Analysis:
Exploration: Use statistical and visual tools to explore patterns in data.
Patterns and Trends: Identify underlying patterns, trends and potential challenges within the dataset.
Insights: Gain valuable insights for informed decisions making in later stages.
Decision Making: Use EDA for feature engineering and model selection.
Step 5: Feature Engineering and Selection
Feature engineering and selection is a transformative process that involve selecting only relevant features to
enhance model efficiency and prediction while reducing complexity.
Here are the basic features of Feature Engineering and Selection:
Feature Engineering: Create new features or transform existing ones to capture better patterns and
relationships.
Feature Selection: Identify subset of features that most significantly impact the model's performance.
Domain Expertise: Use domain knowledge to engineer features that contribute meaningfully for
prediction.
Optimization: Balance set of features for accuracy while minimizing computational complexity.
Step 6: Model Selection
For a good machine learning model, model selection is a very important part as we need to find model that aligns
with our defined problem, nature of the data, complexity of problem and the desired outcomes.
Here are the basic features of Model Selection:
Complexity: Consider the complexity of the problem and the nature of the data when choosing a model.
Decision Factors: Evaluate factors like performance, interpretability and scalability when selecting a
model.
Experimentation: Experiment with different models to find the best fit for the problem.
Step 7: Model Training
With the selected model the machine learning lifecycle moves to model training process. This process involves
exposing model to historical data allowing it to learn patterns, relationships and dependencies within the dataset.
Here are the basic features of Model Training:
Iterative Process: Train the model iteratively, adjusting parameters to minimize errors and enhance
accuracy.
Optimization: Fine-tune model to optimize its predictive capabilities.
Validation: Rigorously train model to ensure accuracy to new unseen data.
Step 8: Model Evaluation and Tuning
Model evaluation involves rigorous testing against validation or test datasets to test accuracy of model on new
unseen data. It provides insights into model's strengths and weaknesses. If the model fails to acheive desired
performance levels we may need to tune model again and adjust its hyperparameters to enhance predictive accuracy.
Here are the basic features of Model Evaluation and Tuning:
Evaluation Metrics: Use metrics like accuracy, precision, recall and F1 score to evaluate model
performance.
Strengths and Weaknesses: Identify the strengths and weaknesses of the model through rigorous testing.
Iterative Improvement: Initiate model tuning to adjust hyperparameters and enhance predictive accuracy.
Model Robustness: Iterative tuning to achieve desired levels of model robustness and reliability.
Step 9: Model Deployment
Now model is ready for deployment for real-world application. It involves integrating the predictive model with
existing systems allowing business to use this for informed decision-making.
Here are the basic features of Model Deployment:
Integrate with existing systems
Enable decision-making using predictions
Ensure deployment scalability and security
Provide APIs or pipelines for production use
Step 10: Model Monitoring and Maintenance
After Deployment models must be monitored to ensure they perform well over time. Regular tracking helps detect
data drift, accuracy drops or changing patterns and retraining may be needed to keep the model reliable in real-
world use.
Here are the basic features of Model Monitoring and Maintenance:
Track model performance over time
Detect data drift or concept drift
Update and retrain the model when accuracy drops
Maintain logs and alerts for real-time issues
(b) What are the application of Machine Learning ?
Machine Learning (ML) is one of the most significant advancements in the field of technology. It gives machines
the ability to learn from data and improve over time without being explicitly programmed. ML models identify
patterns from data and use them to make predictions or decisions.
Organizations use machine learning to automate tasks, make smarter decisions and gain valuable insights. ML is
shaping the world around us. Here are few real-world applications of Machine Learning:
1. Healthcare and Medical Diagnosis
ML algorithms can analyze large volumes of patient data, medical scans and genetic information to aid in diagnosis
and treatment.
Applications:
Disease Detection: ML models are used to identify diseases like cancer, pneumonia and Parkinson’s from
medical images. They often achieve accuracy comparable to or better than human doctors.
Predictive Analytics: By analyzing patient history and symptoms, models can predict the risk of certain
diseases or potential complications.
Drug Discovery: ML accelerates the drug development process by predicting how different compounds
will interact, reducing the time and cost of research.
2. Smart Assistants and Human-Machine Interaction
Virtual assistants systems rely on natural language processing (NLP) and speech recognition to understand
commands and respond intelligently.
Applications:
Voice Assistants: Tools like Siri, Alexa and Google Assistant convert spoken input into actionable
commands.
Voice Search & Transcription: ML enables users to perform hands-free web searches and get
transcription during meetings or phone calls.
Chatbots: Businesses use AI-powered chatbots for 24/7 customer support, helping resolve queries faster
and more efficiently.
3. Personalized Recommendations and User Experience
Modern digital platforms uses personalization which is done by using recommender systems. Machine learning
models analyze user behavior to deliver relevant content, improving engagement and satisfaction.
Applications:
Streaming Platforms: Netflix and Spotify suggest shows and songs based on your watching or listening
history.
E-commerce: Sites like Amazon recommend products tailored to your preferences, browsing patterns and
past purchases.
Social Media: Algorithms curate content feeds, prioritize posts and suggest friends or pages.
These systems use techniques like collaborative filtering and content-based filtering to create personalized digital
experiences.
4. Fraud Detection and Financial Forecasting
In finance, vast sums of money move digitally and machine learning plays a important role in fraud detection and
market analysis.
Applications:
Transaction Monitoring: Banks use ML models to detect unusual spending behavior and flag suspicious
transactions.
Loan Risk Assessment: Credit scoring models analyze customer profiles and predict the likelihood of
default.
Stock Market Prediction: ML is used to analyze historical stock data and forecast price movements.
Stock markets are complex, algorithmic trading uses these predictions for better decision-making.
5. Autonomous Vehicles and Smart Mobility
Self-driving vehicles use ML to understand their environment, navigate safely and make immediate decisions.
Key Components:
Computer Vision: Recognizing lanes, pedestrians, traffic signals and obstacles.
Sensor Fusion: Combining data from cameras, LiDAR and radar for a 360-degree view.
Behavior Prediction: Anticipating how other drivers or pedestrians may act.
Autonomous vehicles are capable of operating with minimal human input. Beyond cars, ML is also being used in
traffic optimization, smart navigation systems and predictive maintenance in transportation.
Unit-II
4. (a) What is NaïveBayesian Theorem? Explain.
Naive Bayes is a machine learning classification algorithm that predicts the category of a data point using
probability. It assumes that all features are independent of each other. Naive Bayes performs well in many real-
world applications such as spam filtering, document categorisation and sentiment analysis.
Here:
Original data has two classes: green circles (y = 1) and red squares (y = 2).
Estimate probability distribution along the first dimension i.e P(x1∣y=1),P(x1∣y=2)P(x1∣y=1),P(x1∣y=2)
Estimate probability distribution along the second dimension i.e P(x2∣y=1),P(x2∣y=2)P(x2∣y=1),P(x2∣y=2)
Combine both dimensions using conditional independence i.e P(x∣y)=∏αP(xα∣y)P(x∣y)=∏αP(xα∣y)
Key Features of Naive Bayes Classifiers
The main idea behind the Naive Bayes classifier is to use Bayes' Theorem to classify data based on the
probabilities of different classes given the features of the data. It is used mostly in high-dimensional text
classification
The Naive Bayes Classifier is a simple probabilistic classifier and it has very few number of parameters
which are used to build the ML models that can predict at a faster speed than other classification
algorithms.
It is a probabilistic classifier because it assumes that one feature in the model is independent of existence
of another feature. In other words, each feature contributes to the predictions with no relation between each
other.
Naive Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying articles and many more.
Why it is Called Naive Bayes?
It is named as "Naive" because it assumes the presence of one feature does not affect other features. The "Bayes"
part of the name refers to its basis in Bayes’ Theorem.
Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given the weather
conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for playing golf. Here is a tabular
representation of our dataset.
Outlook Temperature Humidity Windy Play Golf
0 Rainy Hot High False Yes
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False No
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No
The dataset is divided into two parts i.e feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector consists of the value
of dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable (prediction or output) for each row of feature matrix.
In above dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature makes an:
Feature independence: This means that when we are trying to classify something, we assume that each
feature (or piece of information) in the data does not affect any other feature.
Continuous features are normally distributed: If a feature is continuous, then it is assumed to be
normally distributed within each class.
Discrete features have multinomial distributions: If a feature is discrete, then it is assumed to have a
multinomial distribution within each class.
Features are equally important: All features are assumed to contribute equally to the prediction of the
class label.
No missing data: The data should not contain any missing values.
Introduction to Bayes' Theorem
Bayes’ Theorem provides a principled way to reverse conditional probabilities. It is defined as:
P(y∣X)=P(X∣y)⋅P(y)P(X)P(y∣X)=P(X)P(X∣y)⋅P(y)
Where:
P(y∣X)P(y∣X): Posterior probability, probability of class yy given features XX
P(X∣y)P(X∣y): Likelihood, probability of features XX given class yy
P(y)P(y): Prior probability of class yy
P(X)P(X): Marginal likelihood or evidence
Naive Bayes Working
1. Terminology
Consider a classification problem (like predicting if someone plays golf based on weather). Then:
yy is the class label (e.g. "Yes" or "No" for playing golf)
X=(x1,x2,...,xn)X=(x1,x2,...,xn) is the feature vector (e.g. Outlook, Temperature, Humidity, Wind)
A sample row from the dataset:
X=(Rainy, Hot, High, False),y=NoX=(Rainy, Hot, High, False),y=No
This represents:
What is the probability that someone will not play golf given that the weather is Rainy, Hot, High humidity and No
wind?
2. The Naive Assumption
The "naive" in Naive Bayes comes from the assumption that all features are independent given the class. That is:
P(x1,x2,...,xn∣y)=P(x1∣y)⋅P(x2∣y)⋯P(xn∣y)P(x1,x2,...,xn∣y)=P(x1∣y)⋅P(x2∣y)⋯P(xn∣y)
Thus, Bayes' theorem becomes:
P(y∣x1,...,xn)=P(y)⋅∏i=1nP(xi∣y)P(X)P(y∣x1,...,xn)=P(X)P(y)⋅∏i=1nP(xi∣y)
Since the denominator is constant for a given input, we can write:
P(y∣x1,...,xn)∝P(y)⋅∏i=1nP(xi∣y)P(y∣x1,...,xn)∝P(y)⋅∏i=1nP(xi∣y)
3. Constructing the Naive Bayes Classifier
We compute the posterior for each class yy and choose the class with the highest probability:
y^=argmaxyP(y)⋅∏i=1nP(xi∣y)y^=argmaxyP(y)⋅∏i=1nP(xi∣y)
This becomes our Naive Bayes classifier.
4. Example: Weather Dataset
Let’s take a dataset used for predicting if golf is played based on:
Outlook: Sunny, Rainy, Overcast
Temperature: Hot, Mild, Cool
Humidity: High, Normal
Wind: True, False
Example Input: X=(Sunny,Hot,Normal,False)X=(Sunny,Hot,Normal,False)
Goal: Predict if golf will be played (Yes or No).
5. Pre-computation from Dataset
Class Probabilities:
From dataset of 14 rows:
P(Yes)=914P(Yes)=149
P(No)=514P(No)=145
Conditional Probabilities (Tables 1–4):
Table
6. Calculate Posterior Probabilities
For Class = Yes:
P(Yes | today)∝29⋅29⋅69⋅69⋅914P(Yes | today)∝92⋅92⋅96⋅96⋅149
P(Yes | today)≈0.02116P(Yes | today)≈0.02116
For Class = No:
P(No | today)∝35⋅25⋅15⋅25⋅514P(No | today)∝53⋅52⋅51⋅52⋅145
P(No | today)≈0.0068P(No | today)≈0.0068
7. Normalize Probabilities
To compare:
P(Yes | today)=0.021160.02116+0.0068≈0.756P(Yes | today)=0.02116+0.00680.02116≈0.756
P(No | today)=0.00680.02116+0.0068≈0.244P(No | today)=0.02116+0.00680.0068≈0.244
8. Final Prediction
Since:
P(Yes | today)>P(No | today)P(Yes | today)>P(No | today)
The model predicts: Yes (Play Golf)
Naive Bayes for Continuous Features
For continuous features, we assume a Gaussian distribution:
P(xi∣y)=12πσy2exp(−(xi−μy)22σy2)P(xi∣y)=2πσy21exp(−2σy2(xi−μy)2)
Where:
μyμy is the mean of feature xixi for class yy
σy2σy2 is the variance of feature xixi for class yy
This leads to what is called Gaussian Naive Bayes.
Types of Naive Bayes Model
There are three types of Naive Bayes Model :
1. Gaussian Naive Bayes
In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according
to a Gaussian distribution. A Gaussian distribution is also called Normal distribution When plotted, it gives a bell
shaped curve which is symmetric about the mean of the feature values as shown below:
2. Multinomial Naive Bayes
Multinomial Naive Bayes is used when features represent the frequency of terms (such as word counts) in a
document. It is commonly applied in text classification, where term frequencies are important.
3. Bernoulli Naive Bayes
Bernoulli Naive Bayes deals with binary features, where each feature indicates whether a word appears or not in a
document. It is suited for scenarios where the presence or absence of terms is more relevant than their frequency.
Both models are widely used in document classification tasks
Advantages
Easy to implement and computationally efficient.
Effective in cases with a large number of features.
Performs well even with limited training data.
It performs well in the presence of categorical features.
For numerical features data is assumed to come from normal distributions
Disadvantages
Assumes that features are independent, which may not always hold in real-world data.
Can be influenced by irrelevant attributes.
May assign zero probability to unseen events, leading to poor generalization.
Applications
Spam Email Filtering: Classifies emails as spam or non-spam based on features.
Text Classification: Used in sentiment analysis, document categorization and topic classification.
Medical Diagnosis: Helps in predicting the likelihood of a disease based on symptoms.
Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
Weather Prediction: Classifies weather conditions based on various factors.
(b) Differentiate linear and logistic Regression .
Regression
Here we compare linear regression and logistic regression
Feature Linear Regression Logistic Regression
Type Supervised regression model Supervised classification model
Prediction Continuous numeric values Categorical values
Activation Sigmoid function converts linear output to
Not used
Function probability
Threshold Value Not required Required to classify output into classes
Evaluation
Root Mean Square Error Precision, Recall, Accuracy, F1-score
Metric
Dependent variable follows normal/Gaussian Dependent variable follows binomial
Assumption distribution distribution
Dependent
Numeric and continuous Categorical (binary or multinomial)
Variable
5. What is K-Nearest Neighbours (KNN) and Why it is used? Explain
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm generally used for classification but can
also be used for regression tasks. It works by finding the "k" closest data points (neighbors) to a given input and
makes a predictions based on the majority class (for classification) or the average value (for regression). Since KNN
makes no assumptions about the underlying data distribution it makes it a non-parametric and instance-based
learning method.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the entire dataset and performs computations only at the time of classification.
For example, consider the following table of data points containing two features:
The new point is classified as Category 2 because most of its closest neighbors are blue squares. KNN assigns the
category based on the majority of nearby points. The image shows how KNN predicts the category of a new data
point based on its closest neighbours.
The red diamonds represent Category 1 and the blue squares represent Category 2.
The new data point checks its closest neighbors (circled points).
Since the majority of its closest neighbors are blue squares (Category 2) KNN predicts the new data point
belongs to Category 2.
KNN works by using proximity and majority voting to make predictions.
What is 'K' in K Nearest Neighbour?
In the k-Nearest Neighbours algorithm k is just a number that tells the algorithm how many nearby points or
neighbors to look at when it makes a decision.
Example: Imagine you're deciding which fruit it is based on its shape and size. You compare it to fruits you already
know.
If k = 3, the algorithm looks at the 3 closest fruits to the new one.
If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an apple because most
of its neighbors are apples.
How to choose the value of k for KNN Algorithm?
The value of k in KNN decides how many neighbors the algorithm looks at when making a prediction.
Choosing the right k is important for good results.
If the data has lots of noise or outliers, using a larger k can make the predictions more stable.
But if k is too large the model may become too simple and miss important patterns and this is called
underfitting.
So k should be picked carefully based on the data.
Statistical Methods for Selecting k
Cross-Validation: Cross-Validation is a good way to find the best value of k is by using k-fold cross-
validation. This means dividing the dataset into k parts. The model is trained on some of these parts and
tested on the remaining ones. This process is repeated for each part. The k value that gives the highest
average accuracy during these tests is usually the best one to use.
Elbow Method: In Elbow Method we draw a graph showing the error rate or accuracy for different k
values. As k increases the error usually drops at first. But after a certain point error stops decreasing
quickly. The point where the curve changes direction and looks like an "elbow" is usually the best choice
for k.
Odd Values for k: It’s a good idea to use an odd number for k especially in classification problems. This
helps avoid ties when deciding which class is the most common among the neighbors.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these neighbors are used for classification and regression
task. To identify nearest neighbor we use below distance metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or space. You can think
of it like the shortest path you would walk if you were to go directly from one point to another.
distance(x,Xi)=∑j=1d(xj−Xij)2]distance(x,Xi)=∑j=1d(xj−Xij)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and vertical lines like a grid
or city streets. It’s also called "taxicab distance" because a taxi can only drive along the grid-like streets of a city.
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes both Euclidean and Manhattan distances as
special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above, when p=2, it becomes the same as the Euclidean distance formula and when p=1, it
turns into the Manhattan distance formula. Minkowski distance is essentially a flexible formula that can represent
either Euclidean or Manhattan distance depending on the value of p.
Applications of KNN
Recommendation Systems: Suggests items like movies or products by finding users with similar
preferences.
Spam Detection: Identifies spam emails by comparing new emails to known spam and non-spam
examples.
Customer Segmentation: Groups customers by comparing their shopping behavior to others.
Speech Recognition: Matches spoken words to known patterns to convert them into text.
Unit-III
6. (a) How Unsupervised learning Works? Explain.
Unsupervised learning is a type of machine learning in which the model is trained on unlabeled data. This means
the system is not given predefined output values or target labels. Instead, it learns by identifying patterns,
structures, and relationships within the data on its own.
Working of Unsupervised Learning
1. Input Data Collection
The algorithm receives a dataset containing only input features without any corresponding output labels.
2. Data Preprocessing
The data is cleaned, normalized, and transformed to improve the learning process and accuracy.
3. Pattern Discovery
The algorithm analyzes the data to find hidden patterns such as similarities, differences, or correlations
among data points.
4. Grouping or Structuring Data
Based on discovered patterns, the algorithm organizes data into:
o Clusters (grouping similar data points together), or
o Associations (discovering relationships between variables), or
o Dimensional structures (reducing data complexity).
5. Model Output
The result is not a predicted label but insights such as:
o Clusters of similar data
o Frequently occurring patterns
o Reduced-dimensional representations
6. Evaluation and Interpretation
Since no labels exist, results are evaluated using measures like cohesion, separation, or domain knowledge
rather than accuracy.
(b) Differentiate supervised and unsupervised Learning?
Difference between Supervised and Unsupervised Learning
Aspect Supervised Learning Unsupervised Learning
Uses labeled data (input features + Uses unlabeled data (only input features,
Input Data corresponding outputs). no outputs).
Predicts outcomes or classifies data Discovers hidden patterns, structures, or
Goal based on known labels. groupings in data.
Less complex, as the model learns
More complex, as the model must find
Computational from labeled data with clear
patterns without any guidance.
Complexity guidance.
Two types : Classification (for
discrete outputs) or regression (for Clustering and association
Types continuous outputs).
Model can be tested and evaluated Cannot be tested in the traditional sense,
Testing the Model using labeled test data. as there are no labels.
7. Explain K-means Clustering Algorithm.
K-Means Clustering Algorithm
K-means clustering is an unsupervised machine learning algorithm used to group data into K distinct clusters
based on similarity. Each cluster is represented by the mean (centroid) of its data points.
I. Working of K-Means Algorithm
1. Choose the Number of Clusters (K)
The value of K is predefined by the user.
2. Initialize Centroids
Randomly select K data points as the initial centroids.
3. Assign Data Points to Clusters
Each data point is assigned to the nearest centroid using a distance measure (commonly Euclidean
distance).
4. Recalculate Centroids
For each cluster, calculate the new centroid by taking the mean of all data points assigned to that cluster.
5. Repeat the Process
Steps 3 and 4 are repeated until:
o Centroids no longer change, or
o Data points remain in the same clusters.
6. Final Clusters Formed
The algorithm converges and produces K well-defined clusters.
Algorithm (Steps)
1. Select value of K
2. Initialize K centroids randomly
3. Assign each data point to the nearest centroid
4. Update centroids
5. Repeat steps 3 and 4 until convergence
Advantages of K-Means
Simple and easy to understand
Efficient for large datasets
Fast convergence
Disadvantages of K-Means
Requires predefined value of K
Sensitive to initial centroid selection
Performs poorly with non-spherical clusters and outliers
Applications of K-Means
Customer segmentation
Image compression
Document clustering
Market analysis
Unit-IV
8. Explain
(i) ROC and AUC
AUC-ROC curve is a graph used to check how well a binary classification model works. It helps us to understand
how well the model separates the positive cases like people with a disease from the negative cases like people
without the disease at different threshold level. It shows how good the model is at telling the difference between the
two classes by plotting:
True Positive Rate (TPR): how often the model correctly predicts the positive cases also known as
Sensitivity or Recall.
False Positive Rate (FPR): how often the model incorrectly predicts a negative case as positive.
Specificity: measures the proportion of actual negatives that the model correctly identifies. It is calculated
as 1 - FPR.
The higher the curve the better the model is at making correct predictions.
These terms are derived from the confusion matrix which provides the following values:
True Positive (TP): Correctly predicted positive instances
True Negative (TN): Correctly predicted negative instances
False Positive (FP): Incorrectly predicted as positive
False Negative (FN): Incorrectly predicted as negative
ROC Curve : It plots TPR vs. FPR at different thresholds. It represents the trade-off between the
sensitivity and specificity of a classifier.
AUC(Area Under the Curve): measures the area under the ROC curve. A higher AUC value indicates
better model performance as it suggests a greater ability to distinguish between classes. An AUC value of
1.0 indicates perfect performance while 0.5 suggests it is random guessing.
How AUC-ROC Works
AUC-ROC curve helps us understand how well a classification model distinguishes between the two classes.
Imagine we have 6 data points and out of these:
3 belong to the positive class: Class 1 for people who have a disease.
3 belong to the negative class: Class 0 for people who don’t have disease.
Now the model will give each data point a predicted probability of belonging to Class 1. The AUC measures the
model's ability to assign higher predicted probabilities to the positive class than to the negative class. Here’s how it
work:
1. Randomly choose a pair: Pick one data point from the positive class (Class 1) and one from the negative
class (Class 0).
2. Check if the positive point has a higher predicted probability: If the model assigns a higher probability
to the positive data point than to the negative one for correct ranking.
3. Repeat for all pairs: We do this for all possible pairs of positive and negative examples.
When to Use AUC-ROC
AUC-ROC is effective when:
The dataset is balanced and the model needs to be evaluated across all thresholds.
False positives and false negatives are of similar importance.
In cases of highly imbalanced datasets AUC-ROC might give overly optimistic results. In such cases the Precision-
Recall Curve is more suitable focusing on the positive class.
Model Performance with AUC-ROC:
High AUC (close to 1): The model effectively distinguishes between positive and negative instances.
Low AUC (close to 0): The model struggles to differentiate between the two classes.
AUC around 0.5: The model doesn’t learn any meaningful patterns i.e it is doing random guessing.
(b) Median absolute deviation
Median Absolute Deviation (MAD)
Median Absolute Deviation (MAD) is a robust statistical measure of variability that describes how spread out
the data values are around the median. It is especially useful when the data contains outliers, as it is less sensitive
to extreme values than standard deviation.
Definition
The Median Absolute Deviation is defined as:
MAD=median(∣xi−median(X)∣)
where:
xix_ixi = each data value
XXX = dataset
Steps to Calculate MAD
1. Find the median of the dataset.
2. Calculate the absolute deviation of each data point from the median.
3. Find the median of these absolute deviations.
4. The result is the MAD.
Example
Given data:
2, 4, 5, 6, 92,\ 4,\ 5,\ 6,\ 92, 4, 5, 6, 9
1. Median = 5
∣2−5∣=3, ∣4−5∣=1, ∣5−5∣=0, ∣6−5∣=1, ∣9−5∣=4|2-5|=3,\ |4-5|=1,\ |5-5|=0,\ |6-5|=1,\ |9-5|
2. Absolute deviations:
=4∣2−5∣=3, ∣4−5∣=1, ∣5−5∣=0, ∣6−5∣=1, ∣9−5∣=4
3. Median of deviations = 1
MAD=1\text{MAD} = 1MAD=1
Features / Characteristics
Robust to outliers
Based on median, not mean
Simple to compute
Suitable for skewed data
Uses of MAD
Outlier detection
Robust statistics
Data preprocessing in machine learning
Measuring data variability
9. What is performance measurement of Model in machine learning. Explain
Model performance indicates how well a machine learning (ML) model carries out the task for which it was
designed, based on various metrics. Measuring model performance is essential for optimizing an ML model before
releasing it to production and enhancing it after deployment. Without proper optimization, models might produce
inaccurate or unreliable predictions and suffer from inefficiencies, leading to poor performance.
Assessing model performance happens during the model evaluation and model monitoring stages of a machine
learning pipeline. After artificial intelligence (AI) practitioners work on the initial phases of ML projects, they then
evaluate a model’s performance across multiple datasets, tasks and metrics to gauge its effectiveness. Once the
model is deployed, machine learning operations (MLOps) teams monitor model performance for continuous
improvement.
Factors affecting model performance
An AI model’s performance is generally measured using a test set, comparing the model’s outputs against
predictions on the baseline test set. Insights gained from evaluating performance help determine if a model is ready
for real-world deployment or if it needs tweaking or additional training.
Here are some factors that can impact a machine learning model’s performance:
Data quality
Data leakage
Feature selection
Model fit
Model drift
Bias
Data quality
A model is only as good as the data used to train it. Model performance falls short when its training data is flawed,
containing inaccuracies or inconsistencies like duplicates, missing values and wrong data labels or annotations. A
lack of balance—such as having too many values for one scenario over another or a training dataset that’s not
sufficient or diverse enough to correctly capture correlations—can also lead to skewed results.
Data leakage
Data leakage in machine learning occurs when a model uses information during training that wouldn’t be available
at the time of prediction. This can be caused by data preprocessing errors or contamination due to improper splitting
of data into training, validation and test sets. Data leakage causes a predictive model to struggle when generalizing
on unseen data, yield inaccurate or unreliable results, or inflate or deflate performance metrics.
Feature selection
Feature selection involves choosing the most relevant features of a dataset to use for model training. Data features
influence how machine learning algorithms configure their weights during training, which in turn drives
performance. Additionally, reducing the feature space to a selected subset can help improve performance while
lowering computational demands. However, picking irrelevant or insignificant features can weaken model
performance.
Model fit
Overfitting happens when an ML model is too complex and fits too closely or even exactly to its training data, so it
doesn’t generalize well on new data. Conversely, underfitting occurs when a model is so simple that it fails to
capture the underlying patterns in both training and testing data.
Model drift
Model drift refers to a model’s performance degrading because of changes in data or in the relationships between
input and output variables. This decay can negatively impact model performance, leading to faulty decision-making
and bad predictions.
Bias
Bias in AI can be introduced at any phase of a machine learning workflow, but it’s particularly prevalent in the data
processing and model development stages. Data bias occurs when the unrepresentative nature of training and fine-
tuning datasets adversely affects model behavior and performance. Meanwhile, algorithmic bias is not caused by the
algorithm itself but by how data science teams collect and code training data and how AI programmers design and
develop machine learning algorithms. AI bias can lead to inaccurate outputs and potentially harmful outcomes.