Data Cleaning & Feature
Engineering
Data Cleaning
● Real-world data is often incomplete or messy
● Machine Learning models cannot handle
missing values
● Data cleaning ensures:
● Better accuracy
● Reliable predictions
● Smooth model training
2
Missing Values Problem
● Missing values occur due to:
● Data collection errors
● Incomplete records
● Corrupted data
● Example: missing values in numerical features
● Must be handled before training the model
3
Methods to Handle Missing Values
● Three Common Approaches:
● Remove rows with missing values
● Remove the entire feature
● Replace missing values with a statistic
● Mean
● Median
● Zero
4
When to Use Which Method
● Remove rows
● Very few missing values
● Large dataset
● Remove feature
● Feature is not important
● Too many missing values
● Replace values (Median)
● Feature is important
● Want to keep all data
● Median is robust to outliers
5
Feature Engineering
● Feature Engineering = transforming raw data into usable
features
● ML models cannot learn directly from raw logs or text
● Dataset consists of:
● Features (x) → inputs
● Labels (y) → outputs
● Goal: create informative features with high predictive power
● Requires creativity + domain knowledge
6
Examples of Feature Engineering
● From user interaction logs, we can create:
● Subscription price
● Login frequency (daily / weekly)
● Average session duration
● Response time
● Anything measurable can be a feature
● Good features → better predictions, lower
model bias
7
One-Hot Encoding (Categorical Features)
● Some models only work with numerical data
● Categorical values (e.g., colors, days) must be
converted
● One-Hot Encoding:
● Each category → separate binary feature
● Avoid assigning numbers like 1, 2, 3 when order
is meaningless
● Prevents false ordering and overfitting
8
Binning (Numerical → Categorical)
● Converts continuous values into ranges (bins)
● Example: age groups instead of exact age
● Helps when:
● Exact value is less important than the
range
● Dataset is small
● Gives the model a useful hint, reduces
complexity
9
Feature Scaling
● Normalization
● Scales values into a fixed range (e.g., 0 to 1)
● Useful when feature ranges differ greatly
● Standardization
● Rescales features to:
● Mean = 0
● Standard deviation = 1
● Preferred when:
● Data is normally distributed
● Outliers exist
● Unsupervised learning is used
10