0% found this document useful (0 votes)

12 views4 pages

Machine Learning

The document discusses various machine learning algorithms including linear regression, logistic regression, decision trees, K-Nearest Neighbors (KNN), and K-Means clustering, explaining their workings, applications, and key concepts such as entropy and information gain. It also covers overfitting and underfitting, hyperparameter tuning methods, feature selection techniques, and the importance of cross-validation in model evaluation. Each algorithm is described with its mathematical model, steps, advantages, and disadvantages, providing a comprehensive overview of their functionalities in machine learning.

Uploaded by

omprakashbehura26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views4 pages

Machine Learning

Uploaded by

omprakashbehura26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Q) Explain the working of linear regression and interpret Q) Describe logistic regression and its applications Q)Analyze decision

Q)Analyze decision tree construction using entropy and

results. Logistic Regression is a supervised machine learning algorithm information gain.
Linear Regression is a supervised learning technique used to used for classification problems, where the output is categorical A Decision Tree is a supervised learning algorithm used for
predict a dependent variable (Y) based on an independent (e.g., Yes/No, 0/1, True/False). classification and regression, where data is split into branches
variable (X) by fitting a straight line. Mathematical Model:- How It Works [Link] input data [Link] a value based on conditions.
𝑦 = 𝑚𝑥 + 𝑏 [Link] it into a probability (0 to 1) using a function 2. Entropy (Measure of Impurity)
Where: y = predicted value x = input variable m = slope (rate 1 Entropy measures how mixed or impure a dataset is.
𝑃(𝑦 = 1) =
of change) b = intercept (value when x = 0) 1 + 𝑒 −𝑧 If all data belongs to one class → Entropy = 0 (pure)
Working of Linear Regression If probability > 0.5 → Yes (1) If data is equally mixed → Entropy = high (impure)
Data Collection:-Gather input (X) and output (Y) data If probability < 0.5 → No (0) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −∑𝑝𝑖 log⁡2 (𝑝𝑖 )
Plotting Data:-Represent data points on a graph Applications of Logistic Regression Where 𝑝𝑖 is the probability of each class.
Fit Best-Fit Line:-Draw a straight line that best represents the 1. Healthcare:-Disease prediction (e.g., diabetes, heart disease) 3. Information Gain (IG)
data Classifying patients as high risk or low risk Information Gain tells how much uncertainty is reduced after
Least Squares Method:-Minimize the error between actual and 2. Finance:-Credit scoring (loan approval/rejection) splitting data.
predicted values Fraud detection in transactions ∣ 𝑆𝑣 ∣
3. Marketing:-[Link] whether a customer will buy a 𝐼𝐺 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝𝑎𝑟𝑒𝑛𝑡) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
Prediction:-Use the equation to predict future values ∣𝑆∣
Interpretation of Results product or not [Link] spam detection Higher IG = Better split
Slope (m):Shows how much Y changes when X increases 4. Education:-Predicting student pass/fail outcomes # Import libraries:-import pandas as pd
(e.g., +10 marks per hour) Intercept (b):Value of Y when X = 0 # Import libraries :-import pandas as pd from [Link] import DecisionTreeClassifier
(base value) Residual/Error:Difference between actual and from sklearn.linear_model import LogisticRegression # Create dataset in table format
predicted values from sklearn.model_selection import train_test_split data = {"Study": [1, 1, 0, 0], # 1 = Yes, 0 = No
Code:-from sklearn.model_selection import train_test_split from [Link] import accuracy_score "Result": [1, 1, 0, 0] # 1 = Pass, 0 = Fail}
import pandas as pd # Create dataset:- data = { "StudyHours": [1, 2, 3, 4, 5], df = [Link](data)
from sklearn.linear_model import LinearRegression "Pass": [0, 0, 0, 1, 1] } # 0 = Fail, 1 = Pass # Separate input and output
# Dataset df = [Link](data) X = df[["Study"]] y = df["Result"]
data = {"StudyHours": [1, 2, 3, 4, 5], "Marks": [20, 40, 50, 65, # Separate input and output:-X = df[["StudyHours"]] # Create model (use entropy for Information Gain)
80]} y = df["Pass"] model = DecisionTreeClassifier(criterion="entropy")
df = [Link](data) # Train-test split:- X_train, X_test, y_train, y_test = # Train model :-[Link](X, y)
X = df[["StudyHours"]] # separate the data train_test_split( X, y, test_size=0.2, random_state=42) # Predictions:-y_pred = [Link](X)
y = df["Marks"] # Create model:-model = LogisticRegression() # Output results:-print("Predictions:", y_pred)
# Split data:-X_train, X_test, y_train, y_test = train_test_split( # Train model:- [Link](X_train, y_train) # Predict new value
X, y, test_size=0.2, random_state=42) # Predictions:- y_pred = [Link](X_test) print("Prediction if Study = 1:", [Link]([[1]])[0])
# Train model # Output results:- print("Predictions:", y_pred)
model = LinearRegression() print("Actual:", y_test.values) Q)Explain KNN algorithm using distance metrics
[Link](X_train, y_train) # Accuracy:-print("Accuracy:", accuracy_score(y_test, y_pred)) K-Nearest Neighbors (KNN) is a supervised machine learning
# Test model:- y_pred = [Link](X_test) # Predict new value algorithm used for classification and regression.
#print result:- print("Slope (m):", model.coef_[0]) print("Prediction for 3.5 hours:", [Link]([[3.5]])[0]) It works by finding the K closest data points (neighbors) and
print("Intercept (b):", model.intercept_) Q)Explain the steps of k-means clustering with example. making a decision based on them.
print("Predicted:", y_pred) K-Means Clustering is an unsupervised learning algorithm used How KNN Works (Steps)
print("Actual:", y_test.values) to group similar data points into K clusters. Step 1: Choose K:-Select number of neighbors (e.g., K = 3 or 5)
Steps of K-Means Algorithm Step 2: Calculate Distance
Q)Discuss overfitting and techniques to reduce it. Step 1: Choose K (number of clusters) Find distance between new point and all training points
Overfitting occurs when a machine learning model learns the Decide how many clusters you want Common method: Euclidean distance
training data too closely, including noise and random Example: K = 2 (2 groups) Step 3: Find Nearest Neighbors:-Select K closest points
fluctuations, instead of capturing the general pattern. Step 2: Initialize Centroids Step 4: Make Prediction:-[Link]: Majority voting (most
As a result: Randomly select K points as centroids (center of clusters) common class) [Link]: Average of values
Training accuracy → Very high Step 3: Assign Points to Nearest Centroid Advantages of KNN:-[Link] and easy to understand
Test/validation accuracy → Low Calculate distance (usually Euclidean distance) [Link] training phase (lazy learner)3. Works well for small
Model fails to generalize to new data Assign each data point to the nearest centroid [Link] handle non-linear data
Causes of Overfitting Step 4: Update Centroids Disadvantages of KNN:-1 Slow for large datasets (needs
Model is too complex (e.g., deep trees, many parameters) Compute new centroid = mean of all points in that cluster distance calculation for all points)2. Sensitive to noise
Small or insufficient dataset Step 5: Repeat
[Link] the right K is [Link] by irrelevant
Too many features (irrelevant variables) Repeat Step 3 and 4 until: Centroids do not change or maximum
features5. Requires feature scaling
Training for too long (in some models like neural networks) iterations reached
Common Distance Metrics
Techniques to Reduce Overfitting # Import libraries :-import pandas as pd
1. Euclidean Distance (Most Common):-Straight-line distance
1. Increase Training Data:-More data helps the model learn from [Link] import KMeans
# Create dataset between two points 𝑑 = √∑(𝑥𝑖 − 𝑦𝑖 )2
general patterns instead of noise. 2. Cross-Validation:-Use
data = { "X": [1, 2, 4, 5], "Y": [1, 1, 3, 4]} 2. Manhattan Distance:-Distance measured along axes (grid-
techniques like k-fold cross-validation to ensure model
performs well on different data splits. 3. Regularization:-Adds df = [Link](data) like path) 𝑑 = ∑ ∣ 𝑥𝑖 − 𝑦𝑖 ∣
a penalty for large weights to simplify the model: L1 # Apply K-Means (K = 2 clusters) 3. Minkowski Distance:-General form of distance (includes
Regularization (Lasso) -L2 Regularization (Ridge) 4. Feature model = KMeans(n_clusters=2, random_state=42) Euclidean & Manhattan) 𝑑 = (∑ ∣ 𝑥𝑖 − 𝑦𝑖 ∣𝑝 )1/𝑝
Selection:-Remove irrelevant or redundant features to reduce # Train model:-[Link](df)
complexity. 5. Pruning (for Decision Trees):-Cut unnecessary # Get cluster labels:-labels = model.labels_ Q)Explain hyperparameter tuning methods.
branches to avoid learning noise. 6. Early Stopping:-Stop # Output results:-print("Cluster labels:", labels) Hyperparameters are parameters set before training a model that
training when validation error starts increasing. # Get centroids:-print("Centroids:\n", model.cluster_centers_) control the learning process (e.g., learning rate, number of trees,
Q)Explain bias-variance tradeoff with examples. regularization strength).Hyperparameter tuning is the process of
Underfitting:- The bias variance tradeoff describes the balance between a finding the best combination of hyperparameters to improve
Underfitting happens when the model fails to learn important model being too simple and too complex. A simple model may model performance.
patterns. It performs poorly on both training and testing data miss important patterns (high bias), while a very complex model Common Hyperparameter Tuning Methods
Causes of Underfitting may learn noise from training data (high variance). The aim is to (a) Grid Search:-How it works: Try all possible combinations
Model is too simple (e.g., linear model for complex data) balance both so the model performs well on new data. of hyperparameters from a predefined set (grid).
Not enough features (missing important information) [Link] models usually have high bias and low variance, which Pros: Finds the optimal combination exhaustively.
Too much regularization may cause underfitting. Cons: Computationally expensive for many hyperparameters.
Insufficient training time [Link] models usually have low bias but high variance, (b)Random Search
Techniques to Reduce Underfitting:-[Link] Model which may cause overfitting. How it works: Randomly selects combinations of
Complexity:-Use more powerful models (e.g., polynomial [Link] model achieves an optimal point where both bias and hyperparameters to evaluate.
regression, deep learning) variance are reasonably low. Pros: Faster than grid search, can find good parameters without
Add more layers or parameters [Link] of machine learning is to minimize the total prediction testing all combinations.
2. Add More Features error on unseen data. (c) Bayesian Optimization
Include relevant input variables The total prediction error can be expressed as: How it works: Uses past evaluation results to choose the next
Helps model understand data better 𝐓𝐨𝐭𝐚𝐥𝐄𝐫𝐫𝐨𝐫 = 𝐁𝐢𝐚𝐬 𝟐 + 𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 + 𝐈𝐫𝐫𝐞𝐝𝐮𝐜𝐢𝐛𝐥𝐞𝐄𝐫𝐫𝐨𝐫 combination intelligently.
3. Reduce Regularization Here: Pros: Efficient for expensive models, often finds better
Too much regularization restricts learning Bias²: Error caused by incorrect assumptions in the model. parameters faster.
Lowering it allows model to fit data better Variance: Error caused by sensitivity to training data. Cons: More complex to implement.
4. Train for Longer Time Irreducible Error: Random noise in the data that cannot be Steps for Hyperparameter Tuning
Increase number of epochs/iterations eliminated. [Link] the model to tune (e.g., Random Forest, SVM)
Helps model learn patterns fully Example [Link] the hyperparameters and their ranges
5. Feature Engineering Case 1: High Bias [Link] a search method (Grid Search, Random Search,
Create meaningful features (e.g., combining variables, Model: Linear regression for complex data Bayesian, etc.) [Link] the model on training data with different
transformations) Result: Poor predictions everywhere hyperparameter combinations
Q)Describe feature selection techniques. Q)Discuss cross-validation and its importance. Q)Explain training and testing datasets.
Feature selection is the process of choosing only the most useful Cross-validation is a technique to evaluate a model’s In machine learning, a dataset is usually split into two parts:
input features for a machine learning model. It helps improve performance by splitting the dataset into multiple subsets (folds) Training Dataset – used to train the model
model performance, reduces noise and makes results easier to and training/testing the model on different combinations. Testing Dataset – used to evaluate the model’s performance on
understand. Example:-[Link] predicting marks, useful features = Ensures that the evaluation is more reliable and less biased than unseen data
study hours [Link] useful = favorite color using a single train-test split. This split ensures that the model generalizes well and does not
Why Feature Selection is Needed Why Cross-Validation is Important just memorize the training data.
[Link] model accuracy [Link] overfitting (i)Helps prevent overfitting → the model is tested on multiple Training Dataset:-Purpose: [Link] patterns from the input
[Link] model faster [Link] noise (irrelevant data) data subsets. (ii)Provides a better estimate of model performance data [Link] model parameters (weights, coefficients, etc.)
Main Feature Selection Techniques on unseen data. (iii)Useful when the dataset is small, so all data Characteristics: [Link] 70–80% of the total dataset
a) Filter Methods:-[Link] are selected based on statistical can be used for training and testing at some point. (iv)Helps in [Link] in fitting the model
measures between input and output. hyperparameter tuning, ensuring that selected parameters work Example:Predicting house prices :-Input: size, location ,
[Link] of any machine learning algorithm. well across different splits. Output: price
Common techniques: [Link] Coefficient – Select features Common Cross-Validation Methods(a) K-Fold Cross- The model uses the training data to learn the relationship
highly correlated with target. [Link]-Square Test – Select Validation(i)Split data into k equal folds between features and price
categorical features related to target. [Link] Threshold – (ii)Train on k-1 folds and test on the remaining fold Testing Dataset:-Purpose: [Link] how well the model
Remove features with low variance. (iii)Repeat k times, each fold used once as the test set performs on new, unseen data [Link] overfitting or
Example: If StudyHours has a high correlation with Pass, it is (Iv)Final performance: Average of all k test results underfitting
selected. (b) Leave-One-Out Cross-Validation (LOOCV) Characteristics: [Link] 20–30% of the total dataset
(b) Wrapper Methods:-[Link] a machine learning model to Special case of k-fold where k = number of samples [Link] does not see this data during training
evaluate feature subsets. Techniques: [Link] Selection – Each sample is tested once while the rest are used for training Example:After training the house price model, testing data is
Start with no features, add features one by one that improve Advantage: Uses almost all data for training used to see if predictions are accurate on houses the model has
model performance. Backward Elimination – Start with all Disadvantage: Very computationally expensive for large never seen
features, remove the least important one iteratively. Recursive datasets Validation Dataset:-I)Purpose: Used to tune hyperparameters
Feature Elimination (RFE) – Recursively remove least important (c) Stratified K-Fold and select the best model. II)Size: Typically 10–20% of the
features using a model. Ensures each fold has the same proportion of class labels (for dataset. III)Role: Helps in preventing overfitting by evaluating
Example:A regression model tests different combinations of classification tasks) the model during training.
features to find the best set predicting salary Useful when classes are imbalanced Example:While training a Random Forest, validation data is
(c) Embedded Methods:-Feature selection is part of the model used to choose the optimal number of trees or max depth.
training process. Techniques: Lasso Regression (L1 Q)Discuss model generalization and its importance.
Regularization) – Shrinks less important feature coefficients to Model generalization refers to a model’s ability to perform well
zero. Decision Tree / Random Forest – Uses feature importance on new, unseen data that was not part of the training set. Q)Describe distance metrics used in ML algorithms.
scores to select features. Elastic Net – Combination of L1 and A model that generalizes well captures the underlying patterns Distance metrics measure the similarity or dissimilarity between
L2 regularization. rather than memorizing the training data. data [Link] are crucial for algorithms like K-Nearest
Example: A Random Forest model automatically ranks features Importance of Model Generalization Neighbors (KNN), K-Means clustering, and Hierarchical
by importance in predicting house prices. Ensures Real-World Performance (i)A model that works only Clustering. Smaller distance → more similar data points
on training data is useless in practical scenarios. 1. Common Distance Metrics
Q)Explain gradient descent algorithm. (ii)Generalization ensures predictions are accurate on new (a) Euclidean Distance(i)Most common metric for continuous
Gradient Descent is an optimization algorithm used in machine inputs. variables (ii)Measures straight-line distance between two points
learning and deep learning to minimize the cost/loss function of Prevents Overfitting :-Overfitting → model memorizes 𝑛
in n-dimensional space 𝑑(𝑝, 𝑞) = √∑𝑖=1( 𝑝𝑖 − 𝑞𝑖 )2
a model by iteratively updating its parameters (weights). training data but fails on test data.
Purpose:-1Find the best model parameters (like slope and Good generalization balances accuracy on training and unseen (b) Manhattan Distance (L1 Distance):-i)Measures distance
intercept in linear regression) that minimize error between data. along axes (like a grid in a city) ii)Sum of absolute differences
predicted and actual values. Improves Reliability :-A generalized model provides consistent 𝑑(𝑝, 𝑞) = ∑𝑛𝑖=1 ∣ 𝑝𝑖 − 𝑞𝑖 ∣
[Link] for linear regression, logistic regression, neural results across different datasets. Example:Point A (2,3), Point B (5,7) → distance = |5−2| + |7−3|
networks, etc. Guides Model Selection and Tuning :-Helps in choosing the =7
How It Works (Step-by-Step) right model complexity and hyperparameters to achieve optimal (c) Minkowski Distance:-i)Generalization of Euclidean and
Initialize parameters :-Start with random values for model performance. Manhattan distances
parameters (weights). ExampleTask: Predict house prices 𝑑(𝑝, 𝑞) = (∑𝑛𝑖=1 ∣ 𝑝𝑖 −𝑞𝑖 ∣𝑝 )1/𝑝
Compute the loss function :-1Measure how far the model’s Scenario 1: Model fits training data perfectly but fails on new p = 1 → Manhattan , p = 2 → Euclidean
predictions are from actual values. houses → poor generalization (d) Cosine Similarity / Cosine Distance:-i)Measures angle
[Link] loss functions: i)Linear regression: Mean Squared Scenario 2: Model captures overall price trends and performs between two vectors rather than magnitude
Error (MSE) ii)Logistic regression: Cross-Entropy Loss well on new data → good generalization ii)Used for text data, high-dimensional data
Compute gradients :-1Find the partial derivatives of the loss 𝐴⋅𝐵
Cosine Similarity =
function with respect to each parameter. Q)Explain cost function and optimization process ∣∣ 𝐴 ∣∣ ∣∣ 𝐵 ∣∣
[Link] tells the direction of steepest increase in loss. A cost function (also called a loss function) measures how well a iii)Cosine distance = 1 − Cosine similarity
Update parameters :-Move parameters in the opposite machine learning model is performing.
direction of the gradient to reduce the loss: I)It quantifies the difference between predicted values and actual Q)Discuss learning rate and its impact on model training.
∂𝐽(𝜃) values. II)The goal is to minimize the cost during training. The learning rate (α) is a hyperparameter that determines how
𝜃new = 𝜃old − 𝛼
∂𝜃 1. Common Cost Functions much the model parameters are updated in each step of the
Where:𝜃= model parameter (weight or bias) optimization process (e.g., gradient descent).
Model Type Cost Function Example
𝛼= learning rate (step size) , 𝐽(𝜃)= loss/cost function
Repeat :-Iterate steps 2–4 until: Loss converges Linear Regression Mean Squared Error (MSE) • It controls the step size when moving toward the
minimum of the cost function.
Maximum number of iterations is reached Logistic Regression Cross-Entropy / Log Loss
1. Role of Learning Rate
Q)Explain entropy and information gain in decision trees.
Classification Models Hinge Loss (for SVM)
Entropy:-Entropy measures the uncertainty or impurity in a • High learning rate: Large steps in each update →
dataset. Example (Linear Regression):
𝑚
faster convergence but risk of overshooting the
I)High entropy → data is mixed / uncertain 1 minimum.
𝐽(𝜃) = ∑( 𝑦𝑖 − 𝑦̂𝑖 )2
II)Low entropy → data is pure / mostly one class
•
2𝑚
Formula:𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = − ∑𝑐𝑖=1 𝑝𝑖 log⁡2 𝑝𝑖
𝑖=1 Low learning rate: Small steps → more precise
Where:𝑆= dataset ,𝑐= number of classes convergence but slower training.
Where:𝑦𝑖 = actual value 2. Impact on Model Training
𝑝𝑖 = proportion of samples in class 𝑖 𝑦̂𝑖 = predicted value ,𝑚= number of samples
Example:Dataset with 9 positive and 5 negative examples: Learning Rate Effect on Training
Optimization Process
𝑝+ = 9/14, 𝑝− = 5/14 Optimization is the process of finding model parameters - May overshoot the minimum
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −(9/14)log⁡2 (9/14) − (5/14)log⁡2 (5/14) (weights and biases) that minimize the cost function. Too High - Loss may diverge
≈ 0.94 Steps:Initialize Parameters :-Start with random values for - Model may not converge
2. Information GainDefinition: weights and biases
Information Gain (IG) measures the reduction in entropy after - Training is very slow
Compute Cost :-Use the cost function to calculate error Too Low - Takes longer to reach minimum
splitting a dataset based on a [Link] choose the best Compute Gradient :-Find how the cost changes with respect to - May get stuck in local minima
feature to split at each decision tree node. each parameter (derivatives)
Formula: Update Parameters :-Adjust parameters in the direction that - Fast enough to converge
∣ 𝑆𝑣 ∣ reduces the cost (opposite to gradient) Optimal - Stable decrease in loss
𝐼𝐺(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ) - Reaches global or good local minimum
∣𝑆∣ ∂𝐽(𝜃)
𝜃new = 𝜃old − 𝛼
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)
∂𝜃 3. Example (Conceptual)
Where:𝐴= feature ,𝑆𝑣 = subset of dataset where feature 𝐴has 𝛼= learning rate Suppose we are training a linear regression model using gradient
value 𝑣, ∣ 𝑆𝑣 ∣/∣ 𝑆 ∣= weight of subset Repeat :-Iterate until cost converges (doesn’t decrease further) descent:
Example:Dataset entropy = 0.94 i)Learning rate = 1.0 → model overshoots and loss fluctuates,
After splitting on feature A, weighted entropy = 0.5 never stabilizes. ii)Learning rate = 0.001 → model takes too
𝐼𝐺(𝑆, 𝐴) = 0.94 − 0.5 = 0.44 long to converge, slow training.
Interpretation: Higher IG → feature better separates the data
Q)Describe k-means clustering algorithm. Q)A dataset contains customer information, and the goal is Q)A company wants to predict employee salaries based on
K-Means is an unsupervised learning algorithm used to group to group similar customers. Explain how k-means clustering experience and skills. Explain how regression models can be
similar data points into k clusters based on their features. can be used and interpret the results. used and evaluate performance.
It minimizes the distance between data points and their cluster # Step 1: Import libraries:-import numpy as np import pandas as pd
centroids. 1. Steps of K-Means Algorithm import pandas as pd, import [Link] as plt from sklearn.model_selection import train_test_split
Choose the number of clusters (k) :-Decide how many groups from [Link] import KMeans from sklearn.linear_model import LinearRegression
you want to divide the data into. from [Link] import StandardScaler from [Link] import mean_absolute_error,
Initialize centroids :-Randomly select k points as the initial # Step 2: Create a sample dataset mean_squared_error, r2_score
cluster centroids. data = {'CustomerID': [1,2,3,4,5,6,7,8,9,10], # Step 1: Sample dataset
Assign points to clusters :-Each data point is assigned to the 'Age': [22, 25, 47, 35, 46, 52, 23, 40, 36, 28], data = { 'Experience': [2, 5, 10, 3, 7, 8], 'Skills': [70, 85, 95, 60,
nearest centroid (based on Euclidean distance). 'AnnualSpending': [250, 300, 2000, 1200, 2200, 4000, 270, 90, 88], 'Salary': [40000, 70000, 120000, 35000, 90000, 95000]}
Update centroids :-Compute the mean of all points in each 1800, 1300, 350]} df = [Link](data)
cluster → new centroid. df = [Link](data) # Step 2: Features and target
Repeat steps 3–4 :-Continue until: i)Centroids do not change, or # Step 3: Select features for clustering X = df[['Experience', 'Skills']]
ii)Maximum number of iterations is reached. X = df[['Age', 'AnnualSpending']] y = df['Salary']
2. Key Features:-Distance metric: Usually Euclidean distance # Step 4: Standardize features (important for distance-based # Step 3: Split data
Unsupervised: No labeled data required methods):-scaler = StandardScaler() X_train, X_test, y_train, y_test = train_test_split(X, y,
Output: Cluster assignments and cluster centroids X_scaled = scaler.fit_transform(X) test_size=0.3, random_state=42)
Example (Conceptual) # Step 5: Apply k-means clustering # Step 4: Train Linear Regression
Dataset: Points representing customers based on annual income k = 3 # number of clusters model = LinearRegression()
and spending score kmeans = KMeans(n_clusters=k, random_state=42) [Link](X_train, y_train)
i)k = 2 → we want 2 clusters ii)Randomly select 2 centroids [Link](X_scaled) # Step 5: Predictions
iii)Assign customers to nearest centroid → Cluster 1 & Cluster 2 # Step 6: Add cluster labels to the dataset y_pred = [Link](X_test)
iv)Update centroids based on cluster points v)Repeat until df['Cluster'] = kmeans.labels_ # Step 6: Evaluate performance
centroids stabilize # Step 7: Print results mae = mean_absolute_error(y_test, y_pred)
print(df) mse = mean_squared_error(y_test, y_pred)
Q)Describe k-nearest neighbors (KNN) algorithm. # Step 8: Visualize clusters rmse = mean_squared_error(y_test, y_pred, squared=False)
KNN is a supervised machine learning algorithm used for [Link](figsize=(8,5)) r2 = r2_score(y_test, y_pred)
classification and [Link] predicts the label or value of a [Link](df['Age'], df['AnnualSpending'], c=df['Cluster'], print(f"MAE: {mae:.2f}")
new data point based on the labels of its k nearest neighbors. cmap='viridis', s=100) print(f"MSE: {mse:.2f}")
1. How KNN Works (Steps) [Link]('Age') print(f"RMSE: {rmse:.2f}")
i)Choose k → the number of nearest neighbors to consider. [Link]('Annual Spending') print(f"R2 Score: {r2:.2f}")
ii)Compute distance → measure the distance (e.g., Euclidean) [Link]('Customer Segmentation using K-Means')
between the new point and all existing points. [Link]() Q)Design a machine learning solution for a real-world
iii)Select nearest neighbors → pick the k points with the problem (e.g., spam detection, recommendation system) and
smallest distance. Q)A classification model is giving high training accuracy but explain the choice of algorithm and steps involved.
iv)Make prediction:i) Classification: Use majority vote of low testing accuracy. Explain the problem and suggest
neighbors’ classes. ii)Regression: Use average of neighbors’ solutions. problem Overview:-Goal: Automatically classify incoming
values. Understanding the Problem:-1)High training accuracy: The emails as “Spam” or “Not Spam”. Type of problem: Binary
2. Key Points:-Distance metrics: Euclidean, Manhattan, model performs very well on the training dataset. classification (two classes). Challenges: Emails vary in content,
Minkowski, etc. 2)Low testing accuracy: The model performs poorly on new, length, and structure. Ii)Must handle large datasets efficiently.
Non-parametric: No prior assumptions about the data unseen data. This happens because the model has memorized the 2. Choice of Algorithm:-For spam detection, suitable
distribution. training data rather than learning general patterns. algorithms include:-Naive Bayes Classifier (commonly used
Lazy learning: No training; computation happens during Symptoms:1)Model captures noise in the training data. for text classification) Based on Bayes’ Theorem and assumes
prediction. 2)It is too complex relative to the amount of training data. feature independence. -Works well for high-dimensional text
Example Dataset: Students’ study hours and pass/fail 3)Poor generalization to new data. data. Alternative Algorithms Logistic Regression –
New student studied 3.5 hours → k = 3 nearest students: 2 pass, 2. Causes of Overfitting:-i)Complex model: Too many interpretable and efficient for binary classification.
1 fail → predict Pass parameters (e.g., deep decision trees, high-degree polynomials).
4. Advantages:-i)Simple and intuitive ,ii)Works for ii)Insufficient data: Not enough examples to learn general from sklearn.model_selection import train_test_split,from
classification and regression iii)Adapts easily to new data patterns. iii)Noisy data: Irrelevant features or errors in the sklearn.feature_extraction.text import TfidfVectorizer,from
5. Limitations:-Sensitive to noise and irrelevant features training data. 3. Solutions to Avoid Overfitting sklearn.naive_bayes import MultinomialNB,from
Slow for large datasets (distance computed at prediction time) A. Simplify the Model:-i)Use a less complex model (e.g., [Link] import accuracy_score
Requires choosing optimal k shallower decision tree, smaller neural network).
ii)Limit parameters like max_depth in trees or number of # Sample emails:-emails = [("Win a free iPhone now!",
layers/neurons in neural networks. "Spam"),("Meeting at 10 am tomorrow", "Not
Q)Explain decision tree algorithm. B. Regularization:-i)Penalize large or complex weights: Spam"),("Congratulations! You won $1000", "Spam"),("Project
A Decision Tree is a supervised learning algorithm used for L1 (Lasso) or L2 (Ridge) for linear models deadline is next week", "Not Spam")]
classification and regression. Dropout in neural networks
i)It splits the data into branches based on feature values to make C. Increase Training Data:-Collect more data to help the # Separate text and labelstexts, labels = zip(*emails)
predictions. model learn general patterns. ii)Use data augmentation if data
ii)The tree has nodes (decision points) and leaves (outcomes or collection is difficult (common in images). # Split data:-X_train, X_test, y_train, y_test =
classes). D. Feature Selection:-Remove irrelevant or noisy features. train_test_split(texts, labels, test_size=0.25, random_state=42)
1. How It Works (Steps):-Select the best feature to split Focus on features that contribute meaningfully to predictions.
Use metrics like entropy and information gain or Gini index for # Convert text to numbers
classification. Split the dataset :-Divide data based on the Q)A classification model shows poor performance. Analyze
selected feature’s values. Repeat recursively :-For each branch, possible reasons and suggest improvements. vectorizer = TfidfVectorizer()
repeat the process with remaining features until: Possible Reasons for Poor Performance
i)All data points in a branch belong to the same class, or A. Data-related Issues:-Insufficient data :-Not enough training X_train_vect = vectorizer.fit_transform(X_train)
ii)No more features are left, or examples to learn patterns.
iii)A stopping condition (max depth, minimum samples) is Noisy or inconsistent data :-Errors, missing values, or incorrect X_test_vect = [Link](X_test)
reached. labels confuse the model. .
Make predictions :-i)For classification → follow the branches Irrelevant or redundant features :-Features that don’t contribute # Train model
according to feature values to a leaf node (class). to prediction can reduce accuracy.
ii)For regression → leaf node gives predicted value (mean of Model-related Issues model = MultinomialNB()
target values). Advantages Underfitting :-The model is too simple to capture patterns in the

• Easy to understand and visualize

data. ii)Example: Using linear regression for a non-linear
problem. Overfitting :-Model is too complex and performs well
[Link](X_train_vect, y_train)

• Handles both numerical and categorical data on training data but poorly on testing data. # Predict
Poor choice of algorithm :-Some algorithms are better suited
• Non-parametric → no assumptions about data for certain data types. y_pred = [Link](X_test_vect)
distribution Suggested Improvements:- Data Improvements:-Collect
4. Limitations more data if possible. Ii)Clean the data: Handle missing values,
• Prone to overfitting remove outliers, correct labels. Feature engineering: Remove
irrelevant features, create meaningful ones. Model # Evaluate
• Sensitive to small changes in data Improvements:-Choose a more appropriate algorithm based on
• Can become complex for large datasets data type and problem. Prevent Overfitting:-Regularization: print("Accuracy:", accuracy_score(y_test, y_pred))
L1/L2 for linear models, pruning for decision trees.
Cross-validation: Use k-fold to ensure generalization.
Feature and Label:-Feature: Input variable used to train a 6. Clustering:An unsupervised technique to group similar data
model. Label: Output variable or target that the model predicts points into clusters based on feature similarity.
based on features.
7. Prediction in ML:The process of estimating the target
Supervised Learning:-A machine learning method where the variable for new, unseen data using a trained model.
model learns from labeled data to predict outcomes for new
unseen inputs. 8. Training Error:The error or difference between predicted
and actual outputs on the training dataset.
Unsupervised Learning:-Learning from unlabeled data to find
patterns, clusters, or associations without any predefined outputs 9. Testing Error:The error measured on unseen test data,
or labels. reflecting the model’s generalization performance.

Reinforcement Learning:-An agent learns to make decisions 10. Model Generalization:The ability of a model to perform
by interacting with an environment, receiving rewards or well on unseen data, not just the training set.
penalties for actions taken.
11. Cost Function:A function that measures how well a model’s
Classification:-Predicting categorical outcomes or classes for predictions match the actual target values.
input data, such as email being spam or not spam.
12. Optimization in ML:The process of adjusting model
Regression:-Predicting continuous numerical values based on parameters to minimize the cost or loss function.
input features, like predicting salary, temperature, or house
prices. 13. Centroid in Clustering:The central point of a cluster
representing the mean position of all points within that cluster.
Training Data:-Dataset used to train a machine learning model,
helping it learn relationships between features and 14. Entropy in Decision Trees:A measure of impurity or
corresponding labels. disorder in data; used to decide the best feature to split nodes.

Testing Data:-Dataset used to evaluate model performance, 15. Distance Metric:A function that quantifies similarity or
checking how accurately the model predicts outcomes on unseen dissimilarity between two data points.
data.
16. Euclidean Distance:The straight-line distance between two
Overfitting:When a model learns training data too well, points in n-dimensional space.
including noise, causing poor generalization and low accuracy
on unseen data. 17. Validation Dataset:A dataset used to tune model
hyperparameters and prevent overfitting during training.
Underfitting:When a model is too simple to capture patterns in
data, resulting in low accuracy on both training and testing 18. Model Complexity:Refers to the flexibility of a model;
datasets. more parameters can capture more patterns but may overfit.

Model Accuracy:The percentage of correctly predicted 19. Regularization:A technique to prevent overfitting by
outcomes compared to total predictions, measuring how well a penalizing large model parameters during training.
model performs on a dataset.
20. Learning Rate:A hyperparameter controlling the step size at
Loss Function:A function that quantifies the difference each iteration during optimization to minimize the loss function.
between predicted values and actual values, guiding model
optimization during training.

Gradient Descent:An optimization algorithm that iteratively

updates model parameters to minimize the loss function using
the gradient.

K-Means Clustering:An unsupervised learning algorithm that

partitions data into K clusters based on feature similarity by
minimizing intra-cluster variance.

Decision Tree:A supervised learning algorithm that splits data

into branches based on feature decisions to predict outcomes at
leaf nodes.

Nearest Neighbor Algorithm (KNN):Predicts the class of a

data point based on the majority class of its k closest points in
the feature space.

Cross-Validation:A technique to evaluate model performance

by splitting data into multiple train-test folds to ensure
generalization on unseen data.

Hyperparameter:A configuration parameter set before

training that controls the learning process, like learning rate,
tree depth, or number of clusters.