Data Science
Lab Task: VI
Semester VII
Submitted to: Ma’am Fatima
Submitted by: Rooha Tanveer
LAB 06 Data
Preprocessing using Python
Objective: To apply various data preprocessing techniques using the Mall Customers Dataset from Kaggle.
Dataset: This dataset contains customer demographic details and spending behavior at a shopping mall.
Download Link: [Link]
File Name: Mall_Customers.csv Dataset
Description:
Column Name Data Type Description
CustomerID Integer Unique ID of the customer
Gender Categorical Male/Female
Age Numerical Age of the customer
Annual Income (k$) Numerical Annual income in thousands of dollars
Spending Score (1-100) Numerical Score assigned by the mall based on spending patterns
Task Requirements
Step 1: Load the Dataset
• Import required libraries (pandas, numpy, [Link]).
• Load Mall_Customers.csv and display the first few rows.
Step 2: Handle Missing Values
• Check for missing values in all columns.
• If any missing values exist in numerical columns (Age, Annual Income, Spending Score), replace them
with the mean.
• If any missing values exist in categorical columns (Gender), replace them with the most frequent
value.
Step 3: Encode Categorical Data
• Convert Gender using Label Encoding (Male=1, Female=0).
Step 4: Detect and Handle Outliers
• Use the IQR (Interquartile Range) method to detect and remove outliers in Annual Income and
Spending Score.
Step 5: Feature Scaling
• Apply Standardization (Z-score normalization) to Age, Annual Income, and Spending Score.
Step 6: Save the Preprocessed Dataset
• Save the cleaned dataset as Mall_Customers_Processed.csv.
1. Create a synthetic dataset (students_performance.csv) using the following script:
import pandas as pd
import numpy as np
# Generate synthetic data
[Link](42) data =
{
"StudentID": range(1, 51),
"Gender": [Link](["Male", "Female"], 50),
"Age": [Link](15, 22, 50).astype(float),
"Math Score": [Link](30, 100, 50).astype(float),
"Reading Score": [Link](35, 100, 50).astype(float),
"Writing Score": [Link](40, 100, 50).astype(float)
}
# Introduce missing values
data["Age"][[Link](50, 5, replace=False)] = [Link] data["Math
Score"][[Link](50, 3, replace=False)] = [Link] data["Reading
Score"][[Link](50, 2, replace=False)] = [Link]
# Create DataFrame df =
[Link](data)
# Save CSV
df.to_csv("students_performance.csv", index=False)
print("Synthetic 'students_performance.csv' file created successfully!")
Output:
2. Run the preprocessing script (intro_data_preprocessing.py):
sh python
intro_data_preprocessing.py
Python Implementation (intro_data_preprocessing.py)
import pandas as pd import
numpy as np
from [Link] import LabelEncoder, MinMaxScaler
# Step 1: Load the dataset
file_path = "students_performance.csv" # Ensure this file is in the same directory df
= pd.read_csv(file_path)
# Display first few rows print("Original
Dataset:") print([Link]())
# Step 2: Check for missing values
print("\nMissing Values Before Handling:\n", [Link]().sum())
# Fill missing numerical values with mean df['Age'].fillna(df['Age'].mean(),
inplace=True)
df['Math Score'].fillna(df['Math Score'].mean(), inplace=True)
df['Reading Score'].fillna(df['Reading Score'].mean(), inplace=True)
df['Writing Score'].fillna(df['Writing Score'].mean(), inplace=True)
# Fill missing categorical values with the most frequent value
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
print("\nMissing Values After Handling:\n", [Link]().sum())
# Step 3: Encode Categorical Data label_encoder
= LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender']) # Male=1, Female=0
# Step 4: Detect and Handle Outliers using IQR method
def remove_outliers(column): Q1 =
df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
df = remove_outliers('Math Score')
print("\nDataset After Removing Outliers:") print([Link]())
# Step 5: Feature Scaling using Min-Max Scaling scaler
= MinMaxScaler()
df[['Math Score', 'Reading Score', 'Writing Score']] =
scaler.fit_transform( df[['Math Score', 'Reading Score', 'Writing Score']]
)
# Step 6: Save the Preprocessed Dataset
df.to_csv("students_performance_processed.csv", index=False)
print("\nPreprocessing complete. Cleaned dataset saved as
'students_performance_processed.csv'.")
Output: