0% found this document useful (0 votes)
15 views15 pages

Credit Card Data Analysis Case Study

This case study report focuses on the cleaning and preprocessing of a credit card dataset containing 30,000 records to enhance data quality for credit risk analysis. Key preprocessing tasks included imputing missing values, encoding categorical features, and removing irrelevant columns, resulting in a structured dataset ready for machine learning applications. The project emphasizes the importance of data preparation in achieving reliable predictive modeling and customer segmentation in the financial domain.

Uploaded by

psychoeditsb2i
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views15 pages

Credit Card Data Analysis Case Study

This case study report focuses on the cleaning and preprocessing of a credit card dataset containing 30,000 records to enhance data quality for credit risk analysis. Key preprocessing tasks included imputing missing values, encoding categorical features, and removing irrelevant columns, resulting in a structured dataset ready for machine learning applications. The project emphasizes the importance of data preparation in achieving reliable predictive modeling and customer segmentation in the financial domain.

Uploaded by

psychoeditsb2i
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CREDIT CARD DATA ANALYSIS DATASET

A CASE STUDY REPORT

Data Science - 21CSS303T


(2021 Regulation)
III Year/ V Semester
Academic Year: 2024 -2025

Submitted by

G.R.N.S.S. CHARAN REDDY–RA2211003011289


SAI VECHAN PRASAD – RA2211003011293
SIVA KUMAR G – RA2211003011300
DURGA PAWAN KUMAR– RA2211003011315

Under the Guidance of

Mrs. Sundari K

in partial fulfillment of the requirements for the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE ENGINEERING

DEPARTMENT OF COMPUTING TECHNOLOGIES


SCHOOL OF COMPUTING
COLLEGE OF ENGINEERING AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR- 603 203
MAY 2025
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified that Computer Networks, A Case Study Report titled “C CREDIT CARD
DATA ANALYSIS DATASET” is the Bonafide work of SAI VECHAN PRASAD
(RA2211003011293),G.R.N.S.S. CHARAN REDDY(RA2211003011289), SIVA KUMAR
G (RA2211003011300), DURGA PAWAN KUMAR(RA2211003011315) who carried out
the case study under my supervision. Certified further, that to the best of my knowledge the
work reported herein does not form any other work

Faculty Signature
Mrs. Sundari K
Assistant Professor
Department of Computing Technologies

Date:
ABSTRACT

In today’s data-driven financial environment, accurate credit risk analysis is


essential for preventing losses and improving lending strategies. This case study
focuses on the cleaning and preprocessing of a real-world dataset involving credit
card clients in Taiwan. The dataset includes demographic information, billing
history, payment records, and a binary indicator for default in the upcoming month.
However, like many large datasets, the raw file contained missing values,
inconsistencies, and unprocessed categorical variables that hindered direct analysis.
The goal of this project was to transform the raw data into a clean, structured format
that is suitable for machine learning and statistical modeling.

Key data preprocessing tasks included imputing missing values using median and
mode strategies, encoding categorical features such as gender and marital status,
dropping redundant identifiers, and standardizing column names for readability.
Post-cleaning, the dataset was reshaped using pivot tables to visualize default rates
across demographic segments. This cleaned dataset lays a strong foundation for
exploratory analysis, customer segmentation, and the development of predictive
models to assess credit default risk. It also highlights the critical role of data
preparation in the success of data science workflows in the financial domain.

The increase in digital financial transactions has led to a rise in credit card usage,
resulting in massive financial datasets. This project focuses on cleaning and
preprocessing the Default of Credit Card Clients dataset containing 30,000
records across 25 features. The objective is to enhance the data quality for further
analysis and credit risk modelling.

Key steps included identifying and imputing missing values, encoding categorical
fields, removing unnecessary columns, and reshaping the dataset for machine
learning models. After preprocessing, the dataset was ready for exploratory data
analysis, customer segmentation, and predictive modelling.
TABLE OF CONTENTS

ABSTRACT 2

ABBREVIATIONS 5

1 INTRODUCTION 6
1.1 Problem Statement
1.2 Objectives
1.3 Software Requirements Specification
2 LITERATURE SURVEY 9
3 METHODOLOGY OF DISEASE PREDICTION USING SYMPTOMS 10
3.1 Data Collection
3.2 Data Preprocessing
3.3 Post Processing
4 RESULTS AND DISCUSSIONS 11
4.1 cleaning outcomes
4.2 Discussion
5 CONCLUSION AND FUTURE ENHANCEMENT 14
REFERENCES 15
6 APPENDIX-1 16
7 APPENDIX-2 17
ABBREVIATIONS

EDA - Exploratory Data Analysis

IQR - Interquartile Range

SVM - Support Vector Machine

GNB - Gaussian Naive Bayes


CHAPTER 1

INTRODUCTION

COVID-19 has had a profound impact on global health and economies. Studying
pandemic trends requires high-quality, clean data to ensure accurate insights. Raw
datasets, however, often suffer from missing values, inconsistent formats, and
extreme outliers that can distort findings.

This project aims to clean a comprehensive COVID-19 dataset to enable robust trend
analysis, forecasting, and evaluation of health policies. By applying systematic
preprocessing techniques, the dataset will be transformed into an analysis-ready
resource suitable for further machine learning applications.

1.1 Problem Statement

Financial datasets, especially in credit risk analysis, often contain noise, missing
data, and inconsistent formats that can lead to misleading interpretations. Raw credit
card datasets may include:
• Missing demographic and transactional information.
• Categorical values in numerical format.
• Outliers in spending, billing, or payment data.
• Redundant or irrelevant features such as customer ID.
Proper cleaning is essential to produce a trustworthy dataset for training robust
machine learning models to predict defaults and analyze customer behavior.
1.2 Objectives

• Identify and impute missing data accurately.


• Encode categorical variables for ML compatibility.
• Remove irrelevant columns (e.g., ID).
• Create pivot summaries for visualization.
• Standardize field names and formats.
• Output a ready-to-use dataset for modelling and insights.

1.3 Software Requirements Specification

The successful implementation of this credit card default data preprocessing project
required a set of robust tools and software frameworks. Each tool played a crucial role
in ensuring efficient data manipulation, visualization, and preparation for downstream
machine learning tasks. The following software components were used throughout the
project:
Programming Language: Python
Python was selected as the primary language due to its simplicity, readability, and
rich ecosystem of data science libraries. It supports seamless integration of data
handling, preprocessing, and modelling operations, making it ideal for large-scale
financial data projects.

Key Libraries Used:


• Pandas: Used extensively for loading, cleaning, transforming, and summarizing the
dataset. Its powerful DataFrame structure allowed easy manipulation of tabular data
and quick handling of missing values and encodings.
• NumPy: Essential for numerical operations such as calculating medians and
applying transformations across large datasets. It also facilitated efficient array-
based computations during imputation.
• Matplotlib & Seaborn: These libraries were used for data visualization, helping
generate pivot tables, bar graphs, and heatmaps to understand correlations and
default patterns.
• Scikit-learn (sklearn): Provided preprocessing utilities such as label encoding and
served as a foundation for future modeling stages like classification and clustering.
Development Environment: Google Collab
All the code was executed in Google Collab—a cloud-based Python notebook
environment that supports real-time collaboration and has pre-installed data science
libraries. Collab also allowed GPU acceleration and easy file management, making it
suitable for educational and prototyping use.

These tools collectively enabled the transformation of raw credit card data into a
clean, structured format ready for predictive modelling and risk assessment.
CHAPTER 2

LITERATURE SURVEY

A large-scale financial datasets, particularly those used in credit scoring and


default prediction, presents significant challenges such as missing values, non-
standardized categorical fields, outliers, and data redundancy. To ensure reliable
analysis, several researchers have proposed preprocessing techniques including
threshold-based column elimination, statistical imputation, and feature encoding.
Data cleaning is especially important in financial domains where risk
modeling and decision systems rely heavily on high-quality, consistent input.
Unclean datasets can skew default prediction models, leading to high false-positive
or false-negative rates, which can cost institutions millions in credit misjudgements.

Notable methods include:


• Median imputation for numerical fields like AGE or billing amounts to reduce the
impact of outliers.
• Mode imputation for categorical fields such as EDUCATION or MARRIAGE to
retain category distribution.
• Label encoding for gender and education levels to prepare the dataset for machine
learning models.
• Outlier detection using statistical methods (e.g., IQR) to maintain realistic limits
on features like PAY_AMT, BILL_AMT, etc.

• Author(s) • Title • Key Contribution


• Baesens et al. • Benchmarking Credit • Emphasized the need for data
(2003) Scoring quality in credit risk modelling.
• Reviewed preprocessing
• Zhang et al. • Financial Data
methods for loan and credit
(2016) Cleaning Techniques
datasets.
• Demonstrated effective
• Credit Risk Modelling
• Kaggle (2020) imputation and encoding
Competition
strategies for default datasets.
• Highlighted feature engineering
• Huang et al. • Credit Scoring with
and preprocessing in model
(2007) SVM
accuracy.
• Provided standardized dataset
• UCI Repository • Default of Credit Card
used in global credit scoring
(2009) Clients
research.

Table.1. Literature Survey


CHAPTER 3

METHODOLOGY

3.1 Data Collection


The dataset used in this study, Default of Credit Card Clients, was obtained
from the UCI Machine Learning Repository. It contains records for 30,000 credit
card holders in Taiwan, along with 25 attributes describing their financial status,
demographic information, repayment history, bill statements, and whether they
defaulted on their next month’s payment. This dataset has been widely used in
academic and industry-level credit risk analysis and provides a robust foundation for
building default prediction models.

3.2 Data Preprocessing

Handling Missing Data


Although the dataset is relatively clean, fields such as AGE, EDUCATION, and
MARRIAGE were checked for missing values or inconsistencies. Imputation was
applied where necessary:

Median imputation was used for continuous features like AGE, which may
contain outliers.
Mode imputation was applied to categorical features like EDUCATION and
MARRIAGE to maintain class distribution.
Categorical Encoding

Categorical variables were encoded to numerical formats for use in machine learning:

SEX was encoded into a binary column called Gender_encoded (0 for male, 1 for
female).

EDUCATION and MARRIAGE were renamed and structured into labeled fields.

Feature Renaming & Dropping

The column default payment next month was renamed to default_payment for
simplicity.

The ID column, which holds no predictive value, was removed to reduce noise
and redundancy.

Outlier Detection (Optional)


Though not required in all cases, outlier detection based on Interquartile Range
(IQR) was considered for columns like LIMIT_BAL, BILL_AMT*, and
PAY_AMT* to ensure extreme values didn’t skew analysis or model training.

3.3 Post-Processing
After preprocessing, the final dataset had:

• 30,000 rows and 24 columns

• No missing values in critical fields

• Encoded and renamed categorical features

• Ready-to-use format for clustering, correlation analysis, and classification models.

Additionally, pivot tables were generated to observe default rates based on demographic
segments (e.g., default rate by education and gender), providing early insights into risk
segments.
CHAPTER 4
RESULTS AND DISCUSSIONS

4.1 Cleaning Outcomes:

• Missing values handled in key fields


• Columns renamed for clarity
• ID dropped as non-essential
• Categorical fields encoded
• Summary pivot tables created for default rate by education and gender

Visualizations:
• Bar charts comparing missing data before and after cleaning.
• Line plots showing new cases per million (after outlier treatment).
• Pie chart showing variable type distribution (numerical vs categorical).

4.2 Discussion
The preprocessing pipeline led to a well-structured, high-quality dataset. Encoding
and cleaning made it ML-ready. The dataset can now be used for:

• Predictive modelling (e.g., Logistic Regression, Decision Trees)


• Customer segmentation (e.g., K-Means Clustering)
• Risk analysis and credit scoring.
CHAPTER 5

CONCLUSION AND FUTURE ENHANCEMENT

Conclusion
This project successfully cleaned and pre-processed the credit card dataset,
making it suitable for downstream machine learning applications. Data quality was
significantly improved, with missing data resolved and categorical variables encoded.
Future Enhancements
• Use SMOTE to handle class imbalance (if defaults are rare)
• Try advanced imputers (KNN, Iterative)
• Apply dimensionality reduction (PCA)
• Train models like XGBoost and evaluate using ROC-AUC

REFERENCES

• UCI Machine Learning Repository


• Han, J., Kamber, M., & Pei, J. Data Mining: Concepts and Techniques
• Kaggle credit risk projects and tutorials
APPENDIX-1
APPENDIX-2

Common questions

Powered by AI

The preprocessing pipeline facilitates the development of predictive models and customer segmentation by transforming raw data into a clean, structured format suitable for analysis. With encoded categorical variables, filled missing values, and removed non-informative features, the dataset becomes ready for machine learning applications. Pivot summaries provide initial insights, enabling segmentation based on risk, education, and payment behavior. This structured approach supports accurate and efficient model training, leading to better prediction and segmentation strategies .

Poor data quality can distort risk modeling in the financial sector by inflating false-positive or false-negative rates, leading to costly mistakes in credit assessments. Preprocessing mitigates these effects by cleaning data through techniques like imputation, encoding, and outlier treatment. Ensuring consistent, high-quality input data facilitates accurate risk modeling, reducing potential financial losses and improving credit evaluation measures .

Data visualization played a crucial role in understanding credit card default patterns by providing insights into correlations and demographic default rates. Tools like Matplotlib and Seaborn were used to generate pivot tables, bar graphs, and heatmaps. These visualizations helped identify trends and patterns across different customer segments and contributed to strategic decision-making for risk assessment and customer segmentation .

Post-cleaning, the dataset was reshaped using pivot tables to visualize default rates across demographic segments. Columns were standardized and non-informative identifiers were dropped. The preprocessed dataset, with encoded categorical fields and resolved missing values, provided a structured format ready for clustering, correlation analysis, and classification models. This reshaping enabled easier identification of patterns and trends relevant to predictive modeling and customer segmentation .

Data cleaning is critical in credit risk analysis as financial datasets often contain noise, missing values, and inconsistencies that can lead to misleading interpretations. To address these issues, preprocessing techniques such as median imputation for numerical fields to mitigate outliers, mode imputation for categorical fields to retain distribution, and label encoding for machine learning readiness are employed. Outlier detection using methods like the interquartile range (IQR) also helps maintain realistic limits on features like payment amounts. These steps ensure high-quality input for reliable risk modeling and decision systems .

The dataset ensured no missing values in critical fields through imputation strategies. Median imputation was employed for continuous features like age to address potential outliers. Mode imputation was applied to categorical features such as education and marriage to maintain class distributions. This systematic approach filled in the gaps and corrected inconsistencies to prepare the dataset for machine learning applications .

Common challenges in data preprocessing for credit card analysis include missing demographic and transactional data, inconsistent formats, outliers, and redundant features. Effective preprocessing addresses these challenges by accurately imputing missing values, encoding categorical variables for compatibility with machine learning models, detecting and mitigating outliers, and removing irrelevant columns. These steps ensure data quality and integrity, essential for reliable credit risk modeling and interpretation .

Outlier detection in the credit card data preprocessing was managed using statistical methods like the interquartile range (IQR) to identify and handle abnormal values in features such as billing and payment amounts. Addressing these outliers is crucial as they can skew model training, leading to inaccurate predictions. By maintaining realistic data limits, models can generate more reliable insights and have improved generalizability in predicting credit defaults .

The project employed Python due to its simplicity and rich data science libraries. Pandas was used for data loading, cleaning, transforming, and summarizing, facilitating the manipulation of data frames and handling missing values. NumPy supported numerical operations like calculating medians and applying transformations. Visualization tools such as Matplotlib and Seaborn generated plots to understand correlations. Scikit-learn provided label encoding and other preprocessing utilities essential for preparing the dataset for machine learning models. The development environment, Google Collab, supported these tools efficiently .

Suggested future enhancements include using SMOTE to handle class imbalances, applying advanced imputers like KNN or Iterative Imputer, and employing dimensionality reduction techniques like PCA. Training models like XGBoost and evaluating them with metrics such as ROC-AUC are also recommended. These enhancements can improve the predictive power and accuracy of models, provide robust handling of data imbalance, and reduce dimensionality, thus simplifying models while retaining essential information for insight generation .

You might also like