0% found this document useful (0 votes)
5 views4 pages

ML Exp 3

The document outlines an experiment for a Machine Learning Laboratory course focused on handling missing data in a dataset from a social media survey of 500 Gen Z users. It details the data quality issues, including missing age values and duplicate records, and provides a Python program for data imputation and cleaning. The applications of the cleaned data include targeted marketing, behavior analysis, predictive modeling, and ensuring data quality for machine learning readiness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

ML Exp 3

The document outlines an experiment for a Machine Learning Laboratory course focused on handling missing data in a dataset from a social media survey of 500 Gen Z users. It details the data quality issues, including missing age values and duplicate records, and provides a Python program for data imputation and cleaning. The applications of the cleaned data include targeted marketing, behavior analysis, predictive modeling, and ensuring data quality for machine learning readiness.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Department of Electronics and

Computer Science

Semester TE Sem VI EXCS


Subject ML Laboratory
Laboratory Professor Prof. Uma Jaishankar

Student Name Mayuresh Takalkar


Roll Number 23108A0049

Experiment 3 Missing data Imputation


Number
Problem A digital-first brand wants to vibe-check how Gen Z interacts with social media so it can design
Statement: hyper-targeted marketing campaigns that actually resonate . The goal is to understand daily usage
patterns, platform preferences, and feature engagement among young adults.
You are provided with a CSV dataset collected from a social media survey of 500 Gen Z users, aged
18–25. However, the dataset isn’t perfectly clean — which reflects real-world data collected from
online forms and quick surveys.
The dataset includes the following attributes:
 Age of the respondent
 Gender identity (male, female, non-binary)
 Social media platforms actively used (e.g., Instagram, Linkedin, Snapchat,Twitter,Facebook )
 Average hours spent on social media per day
 Favorite social media features (stories, reels, live streaming, etc.)
Data Challenges
Before meaningful insights can be extracted, the following data quality issues must be addressed:
 15% of the age values are missing, possibly due to skipped questions or privacy concerns
 5% of the records are duplicated, likely caused by multiple form submissions or syncing
issues

Tasks Handle the missing data in the dataset using appropriate imputation techniques
Remove Duplicates
Perform Feature Scaling
Resources / Python Libraries
Apparatus import pandas as pd
Required from [Link] import
StandardScaler
from [Link] import SimpleImputer
from [Link] import
ColumnTransformer
from [Link] import Pipeline
Program import pandas as pd
from [Link] import SimpleImputer

# Load your CSV


df = pd.read_csv("genz_social_media_500.csv")

# Check missing values before


print("Missing values before imputation:")
print([Link]().sum())

# Impute Age with mean

P age |1
Department of Electronics and
Computer Science
age_imputer = SimpleImputer(strategy='mean')
df['Age'] = age_imputer.fit_transform(df[['Age']])

# Impute Avg_Hours_Per_Day with median


hours_imputer = SimpleImputer(strategy='median')
df['Avg_Hours_Per_Day'] =
hours_imputer.fit_transform(df[['Avg_Hours_Per_Day']])

# Check missing values after


print("\nMissing values after imputation:")
print([Link]().sum())

# Save the cleaned dataset


df.to_csv("genz_social_media_500_clean.csv", index=False)
print("\nCleaned dataset saved as
'genz_social_media_500_clean.csv'")

Output
Screen shots

Applications 1. Targeted Marketing: Helps


brands segment Gen Z and design
personalized campaigns.
2. Behavior Analysis: Understand
platform usage and feature
preferences.
3. Predictive Modeling: Build
models for engagement trends and
recommendations.
4. Data Quality: Handles missing
values and duplicates for reliable
analysis.
5. ML Readiness: Scales numerical
features for machine learning
P age |2
Department of Electronics and
Computer Science
algorithms.

P age |3
Department of Electronics and
Computer Science

P age |4

You might also like