0% found this document useful (0 votes)
5 views10 pages

Data Preprocessing with Python Techniques

The document outlines a lab task focused on data preprocessing using the Mall Customers Dataset, detailing steps such as loading the dataset, handling missing values, encoding categorical data, detecting outliers, and applying feature scaling. It also includes instructions for creating a synthetic dataset of student performance and provides a Python script for preprocessing this dataset. The final output is a cleaned dataset saved as 'students_performance_processed.csv'.

Uploaded by

223563
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Data Preprocessing with Python Techniques

The document outlines a lab task focused on data preprocessing using the Mall Customers Dataset, detailing steps such as loading the dataset, handling missing values, encoding categorical data, detecting outliers, and applying feature scaling. It also includes instructions for creating a synthetic dataset of student performance and provides a Python script for preprocessing this dataset. The final output is a cleaned dataset saved as 'students_performance_processed.csv'.

Uploaded by

223563
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Data Science

Lab Task: VI
Semester VII
Submitted to: Ma’am Fatima
Submitted by: Rooha Tanveer
LAB 06 Data
Preprocessing using Python
Objective: To apply various data preprocessing techniques using the Mall Customers Dataset from Kaggle.

Dataset: This dataset contains customer demographic details and spending behavior at a shopping mall.
Download Link: [Link]

File Name: Mall_Customers.csv Dataset


Description:
Column Name Data Type Description
CustomerID Integer Unique ID of the customer
Gender Categorical Male/Female
Age Numerical Age of the customer
Annual Income (k$) Numerical Annual income in thousands of dollars
Spending Score (1-100) Numerical Score assigned by the mall based on spending patterns

Task Requirements
Step 1: Load the Dataset

• Import required libraries (pandas, numpy, [Link]).


• Load Mall_Customers.csv and display the first few rows.
Step 2: Handle Missing Values
• Check for missing values in all columns.
• If any missing values exist in numerical columns (Age, Annual Income, Spending Score), replace them
with the mean.
• If any missing values exist in categorical columns (Gender), replace them with the most frequent
value.
Step 3: Encode Categorical Data

• Convert Gender using Label Encoding (Male=1, Female=0).


Step 4: Detect and Handle Outliers

• Use the IQR (Interquartile Range) method to detect and remove outliers in Annual Income and
Spending Score.

Step 5: Feature Scaling


• Apply Standardization (Z-score normalization) to Age, Annual Income, and Spending Score.

Step 6: Save the Preprocessed Dataset

• Save the cleaned dataset as Mall_Customers_Processed.csv.


1. Create a synthetic dataset (students_performance.csv) using the following script:
import pandas as pd
import numpy as np

# Generate synthetic data


[Link](42) data =
{
"StudentID": range(1, 51),
"Gender": [Link](["Male", "Female"], 50),
"Age": [Link](15, 22, 50).astype(float),
"Math Score": [Link](30, 100, 50).astype(float),
"Reading Score": [Link](35, 100, 50).astype(float),
"Writing Score": [Link](40, 100, 50).astype(float)
}

# Introduce missing values


data["Age"][[Link](50, 5, replace=False)] = [Link] data["Math
Score"][[Link](50, 3, replace=False)] = [Link] data["Reading
Score"][[Link](50, 2, replace=False)] = [Link]
# Create DataFrame df =
[Link](data)

# Save CSV
df.to_csv("students_performance.csv", index=False)

print("Synthetic 'students_performance.csv' file created successfully!")


Output:

2. Run the preprocessing script (intro_data_preprocessing.py):


sh python
intro_data_preprocessing.py
Python Implementation (intro_data_preprocessing.py)
import pandas as pd import
numpy as np
from [Link] import LabelEncoder, MinMaxScaler
# Step 1: Load the dataset
file_path = "students_performance.csv" # Ensure this file is in the same directory df
= pd.read_csv(file_path)

# Display first few rows print("Original


Dataset:") print([Link]())

# Step 2: Check for missing values


print("\nMissing Values Before Handling:\n", [Link]().sum())
# Fill missing numerical values with mean df['Age'].fillna(df['Age'].mean(),
inplace=True)
df['Math Score'].fillna(df['Math Score'].mean(), inplace=True)
df['Reading Score'].fillna(df['Reading Score'].mean(), inplace=True)
df['Writing Score'].fillna(df['Writing Score'].mean(), inplace=True)
# Fill missing categorical values with the most frequent value
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

print("\nMissing Values After Handling:\n", [Link]().sum())


# Step 3: Encode Categorical Data label_encoder
= LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender']) # Male=1, Female=0
# Step 4: Detect and Handle Outliers using IQR method
def remove_outliers(column): Q1 =
df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

df = remove_outliers('Math Score')

print("\nDataset After Removing Outliers:") print([Link]())

# Step 5: Feature Scaling using Min-Max Scaling scaler


= MinMaxScaler()
df[['Math Score', 'Reading Score', 'Writing Score']] =
scaler.fit_transform( df[['Math Score', 'Reading Score', 'Writing Score']]
)

# Step 6: Save the Preprocessed Dataset


df.to_csv("students_performance_processed.csv", index=False)

print("\nPreprocessing complete. Cleaned dataset saved as


'students_performance_processed.csv'.")

Output:

You might also like