Download any un-pre-processed dataset and perform various data cleaning processes to prepare
the data. You can use R-Programming/Python for this assignment.
Un-pre-processed dataset link:
[Link]
Dataset Overview
The dataset contains 367 records and 12 columns related to loan applications. It includes
categorical attributes like Gender, Education, and Property Area, along with numerical
attributes such as Applicant Income, Loan Amount, and Credit History. Some columns, such
as Self_Employed, LoanAmount, and Credit_History, have missing values that require
cleaning.
Table 1: Column Descriptions
Column Name Description
Loan_ID Unique identifier for each loan application
Gender Gender of the applicant (Male/Female)
Married Marital status (Yes/No)
Dependents Number of dependents (0,1,2,3+)
Education Education level (Graduate/Not Graduate)
Self_Employed Whether the applicant is self-employed (Yes/No)
ApplicantIncome Income of the applicant
CoapplicantIncome Income of the co-applicant (if any)
LoanAmount Loan amount applied for
Loan_Amount_Term Term of the loan in months
Credit_History Credit history (1 - good, 0 - bad, NaN - missing)
Property_Area Type of property area (Urban, Semiurban, Rural)
Table 2: Types of Noises & Fixes
Type of Noise Description Impact Possible Fix
Missing Values LoanAmount, Leads to Fill using
Self_Employed, incomplete mean/mode, or
Credit_History have missing analysis remove rows
values
Inconsistent Dependents column contains May cause issues Convert '3+' to 3
Data '3+' which is non-numeric in calculations
Outliers ApplicantIncome has Skews analysis Use median or
extreme values and predictions remove extreme
values
Categorical Gender, Married, Education, Cannot be used Convert to
Encoding Property_Area are categorical directly in ML numerical labels
Needed models
Implementing Data Cleaning and Visualization:
Platform and Language:
Platform: Google Colab
Language: Python
Libraries: Pandas, numpy, matplotlib, seaborn, missingno, scipy
Methodologies:
Data Cleaning: Handled missing values, standardized text formats, converted data types,
removed duplicates, and detected outliers using IQR and Z-score.
Data Visualization: Used Matplotlib and Seaborn to create histograms, box plots, line charts,
scatter plots, heatmaps, and pie charts for analysis.
INITIAL DATASET :
CODE:
Importing libraries in python:
Uploading the file:
Mounting google drive:
Load the dataset:
Initial dataset info and description:
Fill missing values for numerical columns with median values:
Fill missing values in 'self_employed' with the mode (most frequent value):
Check and remove any duplicate rows:
Confirm that there are no more missing values:
Save the cleaned dataset to a new CSV file:
Cleaned data info and description:
Output the first few rows of the cleaned data and check for any remaining missing values:
List Files in the Current Directory and Print the List of Files:
DATA CLEANED (processed data):
DATA VISUALIZATION :
Line Chart: Loan Amount Trend
Histogram: Applicant Income
KDE Plot: Probability Density of Loan Amount
Pie Chart: Loan Approval based on Credit History
Scatter Plot: Applicant Income vs Loan Amount
Box Plot:Checking Outliers in Loan Amount
Rug Plot: Distribution of Coapplicant Income
Bar Chart: Count of Dependents
Joint Plot: Applicant Income vs Loan Amount
Radial Bar Chart (Custom Circular Representation)
3D Scatter Plot: Loan Amount vs Income vs Credit History
Correlation Heatmap of Numerical Features