Department of Electronics and
Computer Science
Semester TE Sem VI EXCS
Subject ML Laboratory
Laboratory Professor Prof. Uma Jaishankar
Student Name Mayuresh Takalkar
Roll Number 23108A0049
Experiment 3 Missing data Imputation
Number
Problem A digital-first brand wants to vibe-check how Gen Z interacts with social media so it can design
Statement: hyper-targeted marketing campaigns that actually resonate . The goal is to understand daily usage
patterns, platform preferences, and feature engagement among young adults.
You are provided with a CSV dataset collected from a social media survey of 500 Gen Z users, aged
18–25. However, the dataset isn’t perfectly clean — which reflects real-world data collected from
online forms and quick surveys.
The dataset includes the following attributes:
Age of the respondent
Gender identity (male, female, non-binary)
Social media platforms actively used (e.g., Instagram, Linkedin, Snapchat,Twitter,Facebook )
Average hours spent on social media per day
Favorite social media features (stories, reels, live streaming, etc.)
Data Challenges
Before meaningful insights can be extracted, the following data quality issues must be addressed:
15% of the age values are missing, possibly due to skipped questions or privacy concerns
5% of the records are duplicated, likely caused by multiple form submissions or syncing
issues
Tasks Handle the missing data in the dataset using appropriate imputation techniques
Remove Duplicates
Perform Feature Scaling
Resources / Python Libraries
Apparatus import pandas as pd
Required from [Link] import
StandardScaler
from [Link] import SimpleImputer
from [Link] import
ColumnTransformer
from [Link] import Pipeline
Program import pandas as pd
from [Link] import SimpleImputer
# Load your CSV
df = pd.read_csv("genz_social_media_500.csv")
# Check missing values before
print("Missing values before imputation:")
print([Link]().sum())
# Impute Age with mean
P age |1
Department of Electronics and
Computer Science
age_imputer = SimpleImputer(strategy='mean')
df['Age'] = age_imputer.fit_transform(df[['Age']])
# Impute Avg_Hours_Per_Day with median
hours_imputer = SimpleImputer(strategy='median')
df['Avg_Hours_Per_Day'] =
hours_imputer.fit_transform(df[['Avg_Hours_Per_Day']])
# Check missing values after
print("\nMissing values after imputation:")
print([Link]().sum())
# Save the cleaned dataset
df.to_csv("genz_social_media_500_clean.csv", index=False)
print("\nCleaned dataset saved as
'genz_social_media_500_clean.csv'")
Output
Screen shots
Applications 1. Targeted Marketing: Helps
brands segment Gen Z and design
personalized campaigns.
2. Behavior Analysis: Understand
platform usage and feature
preferences.
3. Predictive Modeling: Build
models for engagement trends and
recommendations.
4. Data Quality: Handles missing
values and duplicates for reliable
analysis.
5. ML Readiness: Scales numerical
features for machine learning
P age |2
Department of Electronics and
Computer Science
algorithms.
P age |3
Department of Electronics and
Computer Science
P age |4