0% found this document useful (0 votes)
4 views21 pages

Data Cleaning for Loan Prediction Dataset

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views21 pages

Data Cleaning for Loan Prediction Dataset

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Download any un-pre-processed dataset and perform various data cleaning processes to prepare

the data. You can use R-Programming/Python for this assignment.

Un-pre-processed dataset link:

[Link]

Dataset Overview

The dataset contains 367 records and 12 columns related to loan applications. It includes
categorical attributes like Gender, Education, and Property Area, along with numerical
attributes such as Applicant Income, Loan Amount, and Credit History. Some columns, such
as Self_Employed, LoanAmount, and Credit_History, have missing values that require
cleaning.
Table 1: Column Descriptions

Column Name Description


Loan_ID Unique identifier for each loan application
Gender Gender of the applicant (Male/Female)
Married Marital status (Yes/No)

Dependents Number of dependents (0,1,2,3+)


Education Education level (Graduate/Not Graduate)
Self_Employed Whether the applicant is self-employed (Yes/No)
ApplicantIncome Income of the applicant

CoapplicantIncome Income of the co-applicant (if any)


LoanAmount Loan amount applied for
Loan_Amount_Term Term of the loan in months

Credit_History Credit history (1 - good, 0 - bad, NaN - missing)


Property_Area Type of property area (Urban, Semiurban, Rural)
Table 2: Types of Noises & Fixes

Type of Noise Description Impact Possible Fix


Missing Values LoanAmount, Leads to Fill using
Self_Employed, incomplete mean/mode, or
Credit_History have missing analysis remove rows
values

Inconsistent Dependents column contains May cause issues Convert '3+' to 3


Data '3+' which is non-numeric in calculations
Outliers ApplicantIncome has Skews analysis Use median or
extreme values and predictions remove extreme
values

Categorical Gender, Married, Education, Cannot be used Convert to


Encoding Property_Area are categorical directly in ML numerical labels
Needed models

Implementing Data Cleaning and Visualization:


Platform and Language:

Platform: Google Colab

Language: Python

Libraries: Pandas, numpy, matplotlib, seaborn, missingno, scipy

Methodologies:

Data Cleaning: Handled missing values, standardized text formats, converted data types,
removed duplicates, and detected outliers using IQR and Z-score.
Data Visualization: Used Matplotlib and Seaborn to create histograms, box plots, line charts,
scatter plots, heatmaps, and pie charts for analysis.
INITIAL DATASET :
CODE:

Importing libraries in python:

Uploading the file:

Mounting google drive:


Load the dataset:

Initial dataset info and description:


Fill missing values for numerical columns with median values:

Fill missing values in 'self_employed' with the mode (most frequent value):
Check and remove any duplicate rows:

Confirm that there are no more missing values:

Save the cleaned dataset to a new CSV file:


Cleaned data info and description:
Output the first few rows of the cleaned data and check for any remaining missing values:
List Files in the Current Directory and Print the List of Files:

DATA CLEANED (processed data):


DATA VISUALIZATION :

Line Chart: Loan Amount Trend

Histogram: Applicant Income


KDE Plot: Probability Density of Loan Amount

Pie Chart: Loan Approval based on Credit History


Scatter Plot: Applicant Income vs Loan Amount
Box Plot:Checking Outliers in Loan Amount

Rug Plot: Distribution of Coapplicant Income


Bar Chart: Count of Dependents

Joint Plot: Applicant Income vs Loan Amount


Radial Bar Chart (Custom Circular Representation)
3D Scatter Plot: Loan Amount vs Income vs Credit History

Correlation Heatmap of Numerical Features

You might also like