0% found this document useful (0 votes)

9 views38 pages

U1 Int395

Uploaded by

rr8303804

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views38 pages

U1 Int395

Uploaded by

rr8303804

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

INT395- SUPERVISED ML

Unit 1:
Introduction
Data Preprocessing

Presented By: Blossom Kaler

Assistant Professor
SCSE, LPU
WHAT IS MACHINE LEARNING?
Machine Learning is a branch of
Artificial Intelligence

it enables computers to learn from data

and make predictions without being
explicitly programmed

Instead of writing specific rules, we train

models to find patterns in data and
make decisions on their own
SUPERVISED
TYPES OF UNSUPERVISED
MACHINE LEARNING
REINFORCEMENT
SUPERVISED ML
when we train the machine using labelled
data.

labelled data is already tagged with the

correct answer.

After that, the machine is provided with a

new set of examples (testing data) to
make predictions.

Types : Classification and Regression

labelled data
LABELLED VERSUS UNLABELLED DATA
CLASSIFICATION: SUPERVISED ML

The algorithm processes the

training data, learning the
Training data relationships between the input
features and the output labels.

Testing data
REGRESSION: SUPERVISED ML

Dependent Variable

Independent Variable

Linear Regression
REAL LIFE APPLICATIONS

Image House Price Spam Detection

Classification Prediction
COMMON DATA ISSUES
Raw data often contains :
missing values due to human error or system failure
duplicate data
inconsistent formats (e.g., date, currency)
outliers that misrepresent the overall trend
class imbalance causing biased model training
irrelevant features
These issues can negatively impact model performance.
EXAMPLE OF DIRTY DATA
HANDLING MISSING VALUES
Delete complete row (generally not
recommended)
Impute with mean/median for numerical
features
Use mode for categorical feature imputation
Imputation methods: KNN or regression-
based filling
SKEWNESS
Skewness measures asymmetry in data distribution.
→ tail on right.
Positive skew
Negative skew → tail on left.
Skewed data may harm algorithms assuming normal
distribution (e.g., linear regression).
Helps improve model performance and
interpretability.
OUTLIER DETECTION
An outlier is a data point that significantly deviates from the rest of the data.

Using Clustering Using Z-score

OUTLIER DETECTION

Using Box Plot

OUTLIER HANDLING
Remove the Outliers: In cases where outliers are simply errors or irrelevant
to your analysis, you can remove them.
Transform Data: Apply transformations like logarithmic or square root
transformations to reduce the impact of outliers.
Use Clipping (trimming): Clipping involves setting an upper or lower
limit on values.
Use Capping (Winsorizing): replaces outlier values with a defined
maximum or minimum value from the dataset's distribution.
Use Robust Algorithms: Certain algorithms like decision trees and
random forests are less sensitive to outliers.
LOG TRANSFORMATION
The logarithmic transformation
“compresses” large values more than
small ones.
Reduces range and variability in data
Makes skewed data more symmetric
Works well for right-skewed, positive data

right skewed data left skewed data

SQUARE ROOT TRANSFORMATION
Used to reduce mild right-skewness in
data
Applies transformation y = sqrt{x}
Compresses larger values, keeps small
values stable
Makes distribution closer to normal
Simpler and safer than log for small
datasets
CLASS IMBALANCE
happens when some classes in a
classification problem have significantly
more instances than others
it often leads to biased model performance.
UNDERSAMPLING
Randomly removes samples from the majority class to
balance the dataset.
Pros:
Reduces dataset size, making training faster.
Helps prevent overfitting to the majority class.
Cons:
Risk of losing important data points, potentially affecting
model performance.
May not work well if the dataset is already small.
Example: If Class 1 has 1000 samples and Class 2 has 100,
RU randomly removes 900 samples from Class 1 to
match Class 2.
OVERSAMPLING
Randomly duplicates samples from the minority class to
balance the dataset.
Pros:
Prevents loss of information.
Helps the model learn from more balanced data.
Cons:
Can lead to overfitting as the model sees repeated data
points.
Does not create new informative examples.
Example: If Class 1 has 1000 samples and Class 2 has 100,
RO duplicates 900 samples from Class 2 to match Class
1.
SYNTHETIC MINORITY OVERSAMPLING
TECHNIQUE (SMOTE)
Instead of duplicating existing samples, SMOTE generates
synthetic samples by interpolating between existing
minority class instances.
Selects a random minority class sample
Finds its k nearest minority neighbors
Randomly selects one of these neighbors
Creates synthetic point between sample and neighbor
Repeats until desired balance is achieved
Pro:Prevents overfitting caused by simple oversampling
Con:May generate noisy or irrelevant samples
BORDERLINE SMOTE
Focuses on samples near decision boundary
Identifies “danger” samples close to majority class
Generates new samples around these critical points
Improves learning of hard-to-classify regions
ADAPTIVE SYNTHETIC SAMPLING (ADASYN)
It is an extension of the SMOTE technique
Minority samples harder to classify get more attention
Regions with higher learning difficulty get more new points
Assign higher weights to hard-to-learn samples
Generate synthetic samples accordingly using nearest neighbors
Cons: Can create noise if difficult samples are actually outliers.
FEATURE SCALING it is the process of standardizing or
normalizing data to bring all features to a
common scale.
Without scaling, features with larger
magnitudes may dominate the learning
process, biasing the model towards those
features.

(Final data b/w (Final data have

0 and 1) Mean=0, Variance=1)
NORMALIZATION
Also called Min–Max Scaling
Rescales data to a fixed range, usually [0, 1]
Preserves relative relationships among data
points
Suitable for methods using Euclidean
distance (kNN, K-Means)
Sensitive to outliers — range may distort if
extremes exist
STANDARDIZATION
Also called Z-score scaling
Centers data around mean 0, standard
deviation 1
Does not bound values to [0, 1]
Retains outlier influence but balances
feature variance
Preferred when outliers exist but shouldn’t be
clipped
FEATURE ENCODING
conversion of categorical or text data into
a numerical format
this helps algorithms to easily understand
and process the data
techniques: One-Hot Encoding, Label
Encoding, Ordinal Encoding
ONE-HOT ENCODING
FEATURE ENGINEERING
Feature engineering transforms raw data
into meaningful input features.
It improves model performance and
learning efficiency significantly.
Techniques include encoding, scaling,
binning, and feature creation.
Domain knowledge helps in designing
impactful custom features.
Goal: enhance predictive power and
simplify model complexity.
FEATURE SELECTION
identifying and retaining only the most
important features for model training
Helps remove irrelevant, redundant, or noisy
features
Focuses on the most informative attributes
of the dataset
reduces overfitting, improves accuracy,
speeds up training
Example: In medical diagnosis, only the most
relevant biomarkers are selected from
hundreds of tests to predict disease.
FEATURE SELECTION
DATA SPLITTING
dividing a dataset into training, validation, and testing
sets to evaluate model performance.
Split data into training and testing sets
Use validation set for model tuning
Ensure proportional representation in splits
Avoid data leakage during splitting process
Test set remains unseen during training
K-FOLD CROSS VALIDATION
To evaluate model performance more
reliably using all available data.
Process:
Split dataset into k equal folds.
For each iteration (total k):
Train model on (k−1) folds
Validate on the remaining 1 fold
Repeat this k times (each fold acts as
validation once).
Average the validation results → gives final
performance estimation
DIMENSIONALITY REDUCTION
process of reducing the number of input
variables while preserving essential
information.
Helps overcome the curse of dimensionality
in large datasets.
Improves computation time and reduces
model overfitting risk.
PCA extracts new orthogonal features
maximizing variance captured.
Simplifies models, enhances interpretability,
and boosts generalization.
DIMENSIONALITY REDUCTION
There are two components of dimensionality reduction:
Feature selection: In this, we try to find a subset of the original set of variables, or features, to
get a smaller subset which can be used to model the problem. It usually involves three ways:
Filter, Wrapper, Embedded
Feature extraction: This reduces the data in a high dimensional space to a lower dimension
space, i.e. a space with lesser no. of dimensions. Methods of Dimensionality Reduction:
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
Generalized Discriminant Analysis (GDA)
DIMENSIONALITY REDUCTION:PCA
Imagine you're analyzing a dataset with dozens of features, like a customer survey with age,
income, purchase history, and website behavior.
While rich, this high dimensionality can be a curse: complex models, slower training, and even
irrelevant info hiding the good stuff. That's where Principal Component Analysis (PCA) comes in
as your dimensionality reduction hero!
The Gist: PCA takes your high-dimensional data and squishes it into a lower- dimensional
space, capturing the most important information but ditching the redundancy. Think of it like
summarizing a long lecture into key points – you lose some detail, but the core meaning
remains.
THANK YOU

Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
38 pages
Data Preprocessing vs Feature Engineering
100% (1)
Data Preprocessing vs Feature Engineering
32 pages
Overfitting and Feature Engineering Guide
No ratings yet
Overfitting and Feature Engineering Guide
37 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
41 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
23 pages
Data Preparation Checklist for ML
No ratings yet
Data Preparation Checklist for ML
22 pages
Machine Learning Fundamentals Guide
No ratings yet
Machine Learning Fundamentals Guide
7 pages
Feature Engineering Basics in Python
No ratings yet
Feature Engineering Basics in Python
33 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
24 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
25 pages
Data Splitting and Transformation Methods
No ratings yet
Data Splitting and Transformation Methods
96 pages
Machine Learning Basics and Data Preprocessing
No ratings yet
Machine Learning Basics and Data Preprocessing
35 pages
VTU Exam Question Paper With Solution of BCS602 Machine Learning-1 June-2025-Navya V K
No ratings yet
VTU Exam Question Paper With Solution of BCS602 Machine Learning-1 June-2025-Navya V K
34 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
55 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
30 pages
Data Preprocessing for AI Performance
No ratings yet
Data Preprocessing for AI Performance
35 pages
Data Cleaning & Preparation
No ratings yet
Data Cleaning & Preparation
22 pages
Hands-On Data Preprocessing in Python
No ratings yet
Hands-On Data Preprocessing in Python
12 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
12 pages
Lecture 05
No ratings yet
Lecture 05
26 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
39 pages
Preprocessing
No ratings yet
Preprocessing
10 pages
Machine Learning - White BG
No ratings yet
Machine Learning - White BG
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
5 pages
Machine Learning Feature Selection Guide
100% (1)
Machine Learning Feature Selection Guide
5 pages
AI Feature Extraction & Model Building
No ratings yet
AI Feature Extraction & Model Building
35 pages
Understanding Bias, Variance, and Model Optimization
No ratings yet
Understanding Bias, Variance, and Model Optimization
19 pages
05 Basic Practice
No ratings yet
05 Basic Practice
32 pages
Data Cleaning for Machine Learning
No ratings yet
Data Cleaning for Machine Learning
6 pages
Logistic Regression and Hyperparameter Tuning
No ratings yet
Logistic Regression and Hyperparameter Tuning
9 pages
Feature Engineering Basics in ML
100% (1)
Feature Engineering Basics in ML
33 pages
Z-Score Analysis in Machine Learning
No ratings yet
Z-Score Analysis in Machine Learning
33 pages
Essential Data Preprocessing Steps
No ratings yet
Essential Data Preprocessing Steps
5 pages
ML Workflow & Data Preprocessing Guide
No ratings yet
ML Workflow & Data Preprocessing Guide
16 pages
Machine Learning Basics and Preprocessing
No ratings yet
Machine Learning Basics and Preprocessing
52 pages
Data Cleaning and Feature Scaling Guide
No ratings yet
Data Cleaning and Feature Scaling Guide
18 pages
Machine Learning Pipeline Overview
No ratings yet
Machine Learning Pipeline Overview
19 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
17 pages
06 Machine Learning Overview Logistic Regression 2
No ratings yet
06 Machine Learning Overview Logistic Regression 2
48 pages
Machine Learning in Cyber Security
No ratings yet
Machine Learning in Cyber Security
26 pages
Lecture03 Handling Machine+Ensemble
No ratings yet
Lecture03 Handling Machine+Ensemble
62 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
27 pages
Machine Learning Methodology Overview
No ratings yet
Machine Learning Methodology Overview
53 pages
Data Cleaning and Feature Engineering Guide
No ratings yet
Data Cleaning and Feature Engineering Guide
17 pages
Scikit-Learn Machine Learning Guide
No ratings yet
Scikit-Learn Machine Learning Guide
54 pages
Understanding EDA and Model Evaluation
No ratings yet
Understanding EDA and Model Evaluation
22 pages
Steps for Machine Learning Projects
No ratings yet
Steps for Machine Learning Projects
9 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
64 pages
Machine Learning SIT DGP-pages-5
No ratings yet
Machine Learning SIT DGP-pages-5
10 pages
Understanding the Machine Learning Process
No ratings yet
Understanding the Machine Learning Process
44 pages
Data Preparation and Analysis Techniques
No ratings yet
Data Preparation and Analysis Techniques
14 pages
DS ML Machine Learning I
No ratings yet
DS ML Machine Learning I
8 pages
Transformer Model for Time-Series Forecasting
No ratings yet
Transformer Model for Time-Series Forecasting
8 pages
Hybrid ML/DL Intrusion Detection Study
No ratings yet
Hybrid ML/DL Intrusion Detection Study
9 pages
Machine Learning Project Guidelines
No ratings yet
Machine Learning Project Guidelines
25 pages
Enhancing Computer Vision Robustness
No ratings yet
Enhancing Computer Vision Robustness
14 pages
Data Mining Process Overview
No ratings yet
Data Mining Process Overview
43 pages
Gear Fault Diagnosis Using DCGAN
No ratings yet
Gear Fault Diagnosis Using DCGAN
18 pages
Introduction to Business Analytics
No ratings yet
Introduction to Business Analytics
82 pages
Subspace Regularizers for FSCIL
No ratings yet
Subspace Regularizers for FSCIL
18 pages
Machine Learning Concepts Explained
No ratings yet
Machine Learning Concepts Explained
4 pages
Propositional vs First-Order Logic Explained
No ratings yet
Propositional vs First-Order Logic Explained
26 pages
Overfitting vs. Underfitting Explained
No ratings yet
Overfitting vs. Underfitting Explained
8 pages
ChatGPT in Clustering and Data Mining
No ratings yet
ChatGPT in Clustering and Data Mining
32 pages
HPC Concepts and Parallel Programming Guide
No ratings yet
HPC Concepts and Parallel Programming Guide
15 pages
NLP Objective Questions Set 01 & 02
No ratings yet
NLP Objective Questions Set 01 & 02
19 pages
CBSE Class X AI Question Bank 2025-26
No ratings yet
CBSE Class X AI Question Bank 2025-26
4 pages
Bloom's Taxonomy Question Set for AI
No ratings yet
Bloom's Taxonomy Question Set for AI
6 pages
Sun 等 - 2024 - Resolve Domain Conflicts for Generalizable Remote Physiological Measurement
No ratings yet
Sun 等 - 2024 - Resolve Domain Conflicts for Generalizable Remote Physiological Measurement
11 pages
Isi-25 (2025)
No ratings yet
Isi-25 (2025)
6 pages
Deep Learning for Liveness Detection
No ratings yet
Deep Learning for Liveness Detection
29 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
72 pages
Overview of Data Analytics Techniques
No ratings yet
Overview of Data Analytics Techniques
3 pages
Inductive Learning in Machine Learning
No ratings yet
Inductive Learning in Machine Learning
29 pages
Predicting Solid Waste Composition Using ML
No ratings yet
Predicting Solid Waste Composition Using ML
14 pages
Join Operations and Interview Prep Guide
No ratings yet
Join Operations and Interview Prep Guide
122 pages
Heart Disease Prediction with ML Techniques
No ratings yet
Heart Disease Prediction with ML Techniques
49 pages
Chapter 07
No ratings yet
Chapter 07
17 pages
Heart Disease Prediction with Tree Ensembles
No ratings yet
Heart Disease Prediction with Tree Ensembles
16 pages
Regularization and Optimization Quiz
No ratings yet
Regularization and Optimization Quiz
17 pages
AI Framework for Personalized Diabetes Care
No ratings yet
AI Framework for Personalized Diabetes Care
23 pages
Bias and Variance in Machine Learning - Javatpoint
No ratings yet
Bias and Variance in Machine Learning - Javatpoint
12 pages

U1 Int395

Uploaded by

U1 Int395

Uploaded by

INT395- SUPERVISED ML

Presented By: Blossom Kaler

it enables computers to learn from data

Instead of writing specific rules, we train

labelled data is already tagged with the

After that, the machine is provided with a

Types : Classification and Regression

The algorithm processes the

Image House Price Spam Detection

Using Clustering Using Z-score

Using Box Plot

right skewed data left skewed data

(Final data b/w (Final data have

You might also like