CREDIT CARD DATA ANALYSIS DATASET
A CASE STUDY REPORT
Data Science - 21CSS303T
(2021 Regulation)
III Year/ V Semester
Academic Year: 2024 -2025
Submitted by
G.R.N.S.S. CHARAN REDDY–RA2211003011289
SAI VECHAN PRASAD – RA2211003011293
SIVA KUMAR G – RA2211003011300
DURGA PAWAN KUMAR– RA2211003011315
Under the Guidance of
Mrs. Sundari K
in partial fulfillment of the requirements for the degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE ENGINEERING
DEPARTMENT OF COMPUTING TECHNOLOGIES
SCHOOL OF COMPUTING
COLLEGE OF ENGINEERING AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR- 603 203
MAY 2025
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203
BONAFIDE CERTIFICATE
Certified that Computer Networks, A Case Study Report titled “C CREDIT CARD
DATA ANALYSIS DATASET” is the Bonafide work of SAI VECHAN PRASAD
(RA2211003011293),G.R.N.S.S. CHARAN REDDY(RA2211003011289), SIVA KUMAR
G (RA2211003011300), DURGA PAWAN KUMAR(RA2211003011315) who carried out
the case study under my supervision. Certified further, that to the best of my knowledge the
work reported herein does not form any other work
Faculty Signature
Mrs. Sundari K
Assistant Professor
Department of Computing Technologies
Date:
ABSTRACT
In today’s data-driven financial environment, accurate credit risk analysis is
essential for preventing losses and improving lending strategies. This case study
focuses on the cleaning and preprocessing of a real-world dataset involving credit
card clients in Taiwan. The dataset includes demographic information, billing
history, payment records, and a binary indicator for default in the upcoming month.
However, like many large datasets, the raw file contained missing values,
inconsistencies, and unprocessed categorical variables that hindered direct analysis.
The goal of this project was to transform the raw data into a clean, structured format
that is suitable for machine learning and statistical modeling.
Key data preprocessing tasks included imputing missing values using median and
mode strategies, encoding categorical features such as gender and marital status,
dropping redundant identifiers, and standardizing column names for readability.
Post-cleaning, the dataset was reshaped using pivot tables to visualize default rates
across demographic segments. This cleaned dataset lays a strong foundation for
exploratory analysis, customer segmentation, and the development of predictive
models to assess credit default risk. It also highlights the critical role of data
preparation in the success of data science workflows in the financial domain.
The increase in digital financial transactions has led to a rise in credit card usage,
resulting in massive financial datasets. This project focuses on cleaning and
preprocessing the Default of Credit Card Clients dataset containing 30,000
records across 25 features. The objective is to enhance the data quality for further
analysis and credit risk modelling.
Key steps included identifying and imputing missing values, encoding categorical
fields, removing unnecessary columns, and reshaping the dataset for machine
learning models. After preprocessing, the dataset was ready for exploratory data
analysis, customer segmentation, and predictive modelling.
TABLE OF CONTENTS
ABSTRACT 2
ABBREVIATIONS 5
1 INTRODUCTION 6
1.1 Problem Statement
1.2 Objectives
1.3 Software Requirements Specification
2 LITERATURE SURVEY 9
3 METHODOLOGY OF DISEASE PREDICTION USING SYMPTOMS 10
3.1 Data Collection
3.2 Data Preprocessing
3.3 Post Processing
4 RESULTS AND DISCUSSIONS 11
4.1 cleaning outcomes
4.2 Discussion
5 CONCLUSION AND FUTURE ENHANCEMENT 14
REFERENCES 15
6 APPENDIX-1 16
7 APPENDIX-2 17
ABBREVIATIONS
EDA - Exploratory Data Analysis
IQR - Interquartile Range
SVM - Support Vector Machine
GNB - Gaussian Naive Bayes
CHAPTER 1
INTRODUCTION
COVID-19 has had a profound impact on global health and economies. Studying
pandemic trends requires high-quality, clean data to ensure accurate insights. Raw
datasets, however, often suffer from missing values, inconsistent formats, and
extreme outliers that can distort findings.
This project aims to clean a comprehensive COVID-19 dataset to enable robust trend
analysis, forecasting, and evaluation of health policies. By applying systematic
preprocessing techniques, the dataset will be transformed into an analysis-ready
resource suitable for further machine learning applications.
1.1 Problem Statement
Financial datasets, especially in credit risk analysis, often contain noise, missing
data, and inconsistent formats that can lead to misleading interpretations. Raw credit
card datasets may include:
• Missing demographic and transactional information.
• Categorical values in numerical format.
• Outliers in spending, billing, or payment data.
• Redundant or irrelevant features such as customer ID.
Proper cleaning is essential to produce a trustworthy dataset for training robust
machine learning models to predict defaults and analyze customer behavior.
1.2 Objectives
• Identify and impute missing data accurately.
• Encode categorical variables for ML compatibility.
• Remove irrelevant columns (e.g., ID).
• Create pivot summaries for visualization.
• Standardize field names and formats.
• Output a ready-to-use dataset for modelling and insights.
1.3 Software Requirements Specification
The successful implementation of this credit card default data preprocessing project
required a set of robust tools and software frameworks. Each tool played a crucial role
in ensuring efficient data manipulation, visualization, and preparation for downstream
machine learning tasks. The following software components were used throughout the
project:
Programming Language: Python
Python was selected as the primary language due to its simplicity, readability, and
rich ecosystem of data science libraries. It supports seamless integration of data
handling, preprocessing, and modelling operations, making it ideal for large-scale
financial data projects.
Key Libraries Used:
• Pandas: Used extensively for loading, cleaning, transforming, and summarizing the
dataset. Its powerful DataFrame structure allowed easy manipulation of tabular data
and quick handling of missing values and encodings.
• NumPy: Essential for numerical operations such as calculating medians and
applying transformations across large datasets. It also facilitated efficient array-
based computations during imputation.
• Matplotlib & Seaborn: These libraries were used for data visualization, helping
generate pivot tables, bar graphs, and heatmaps to understand correlations and
default patterns.
• Scikit-learn (sklearn): Provided preprocessing utilities such as label encoding and
served as a foundation for future modeling stages like classification and clustering.
Development Environment: Google Collab
All the code was executed in Google Collab—a cloud-based Python notebook
environment that supports real-time collaboration and has pre-installed data science
libraries. Collab also allowed GPU acceleration and easy file management, making it
suitable for educational and prototyping use.
These tools collectively enabled the transformation of raw credit card data into a
clean, structured format ready for predictive modelling and risk assessment.
CHAPTER 2
LITERATURE SURVEY
A large-scale financial datasets, particularly those used in credit scoring and
default prediction, presents significant challenges such as missing values, non-
standardized categorical fields, outliers, and data redundancy. To ensure reliable
analysis, several researchers have proposed preprocessing techniques including
threshold-based column elimination, statistical imputation, and feature encoding.
Data cleaning is especially important in financial domains where risk
modeling and decision systems rely heavily on high-quality, consistent input.
Unclean datasets can skew default prediction models, leading to high false-positive
or false-negative rates, which can cost institutions millions in credit misjudgements.
Notable methods include:
• Median imputation for numerical fields like AGE or billing amounts to reduce the
impact of outliers.
• Mode imputation for categorical fields such as EDUCATION or MARRIAGE to
retain category distribution.
• Label encoding for gender and education levels to prepare the dataset for machine
learning models.
• Outlier detection using statistical methods (e.g., IQR) to maintain realistic limits
on features like PAY_AMT, BILL_AMT, etc.
• Author(s) • Title • Key Contribution
• Baesens et al. • Benchmarking Credit • Emphasized the need for data
(2003) Scoring quality in credit risk modelling.
• Reviewed preprocessing
• Zhang et al. • Financial Data
methods for loan and credit
(2016) Cleaning Techniques
datasets.
• Demonstrated effective
• Credit Risk Modelling
• Kaggle (2020) imputation and encoding
Competition
strategies for default datasets.
• Highlighted feature engineering
• Huang et al. • Credit Scoring with
and preprocessing in model
(2007) SVM
accuracy.
• Provided standardized dataset
• UCI Repository • Default of Credit Card
used in global credit scoring
(2009) Clients
research.
Table.1. Literature Survey
CHAPTER 3
METHODOLOGY
3.1 Data Collection
The dataset used in this study, Default of Credit Card Clients, was obtained
from the UCI Machine Learning Repository. It contains records for 30,000 credit
card holders in Taiwan, along with 25 attributes describing their financial status,
demographic information, repayment history, bill statements, and whether they
defaulted on their next month’s payment. This dataset has been widely used in
academic and industry-level credit risk analysis and provides a robust foundation for
building default prediction models.
3.2 Data Preprocessing
Handling Missing Data
Although the dataset is relatively clean, fields such as AGE, EDUCATION, and
MARRIAGE were checked for missing values or inconsistencies. Imputation was
applied where necessary:
Median imputation was used for continuous features like AGE, which may
contain outliers.
Mode imputation was applied to categorical features like EDUCATION and
MARRIAGE to maintain class distribution.
Categorical Encoding
Categorical variables were encoded to numerical formats for use in machine learning:
SEX was encoded into a binary column called Gender_encoded (0 for male, 1 for
female).
EDUCATION and MARRIAGE were renamed and structured into labeled fields.
Feature Renaming & Dropping
The column default payment next month was renamed to default_payment for
simplicity.
The ID column, which holds no predictive value, was removed to reduce noise
and redundancy.
Outlier Detection (Optional)
Though not required in all cases, outlier detection based on Interquartile Range
(IQR) was considered for columns like LIMIT_BAL, BILL_AMT*, and
PAY_AMT* to ensure extreme values didn’t skew analysis or model training.
3.3 Post-Processing
After preprocessing, the final dataset had:
• 30,000 rows and 24 columns
• No missing values in critical fields
• Encoded and renamed categorical features
• Ready-to-use format for clustering, correlation analysis, and classification models.
Additionally, pivot tables were generated to observe default rates based on demographic
segments (e.g., default rate by education and gender), providing early insights into risk
segments.
CHAPTER 4
RESULTS AND DISCUSSIONS
4.1 Cleaning Outcomes:
• Missing values handled in key fields
• Columns renamed for clarity
• ID dropped as non-essential
• Categorical fields encoded
• Summary pivot tables created for default rate by education and gender
Visualizations:
• Bar charts comparing missing data before and after cleaning.
• Line plots showing new cases per million (after outlier treatment).
• Pie chart showing variable type distribution (numerical vs categorical).
4.2 Discussion
The preprocessing pipeline led to a well-structured, high-quality dataset. Encoding
and cleaning made it ML-ready. The dataset can now be used for:
• Predictive modelling (e.g., Logistic Regression, Decision Trees)
• Customer segmentation (e.g., K-Means Clustering)
• Risk analysis and credit scoring.
CHAPTER 5
CONCLUSION AND FUTURE ENHANCEMENT
Conclusion
This project successfully cleaned and pre-processed the credit card dataset,
making it suitable for downstream machine learning applications. Data quality was
significantly improved, with missing data resolved and categorical variables encoded.
Future Enhancements
• Use SMOTE to handle class imbalance (if defaults are rare)
• Try advanced imputers (KNN, Iterative)
• Apply dimensionality reduction (PCA)
• Train models like XGBoost and evaluate using ROC-AUC
REFERENCES
• UCI Machine Learning Repository
• Han, J., Kamber, M., & Pei, J. Data Mining: Concepts and Techniques
• Kaggle credit risk projects and tutorials
APPENDIX-1
APPENDIX-2