0% found this document useful (0 votes)

24 views18 pages

LendingClub Loan Default Prediction Model

LendingClub is the world's largest peer-to-peer lending platform, headquartered in San Francisco. It was the first peer-to-peer lender to register its offerings as securities with the SEC. The author built several models using random forest, decision trees, and neural networks to predict whether borrowers will pay back loans using LendingClub data. Metrics like accuracy, precision, recall, and F1 scores were used to evaluate and select the best performing model.

Uploaded by

Sai Pawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views18 pages

LendingClub Loan Default Prediction Model

Uploaded by

Sai Pawan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.

It was the first peer-to-peer

lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on
a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

My Goal
Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), I have built
a model that can predict whether or not a borrower will pay back their loan? This way in the future when the company gets a
new potential customer we can assess whether or not they are likely to pay back the loan.
Language/Analytics Tools Used
1. Python – Jupyter Notebook
Model Used
1. Random Forest
2. Decision Tree Modules used
3. Neural Network 2. Pandas
3. Numpy
4. Matplotlib
5. Seaborn
6. Scikit Learn
7. Keras
Data Set Overview: copy from
[Link]
0model%20using%20Decision%20Tress%20and%20Random%[Link]
In the form of table
27 columns
Mention it is categorical/Numerical/Ordinal in 1 column
2 files – 1st for data and 2nd for field description

Data set : Subset of All Lending Club loan data

[Link]

Number of Rows and columns: 303704 and 26

EXPLORATORY DATA ANALYSIS

OVERALL GOAL: Get an understanding for which variables are important, view summary statistics, and
visualize the data

As we can see, this is really an imbalanced

problem. We have lot more entries of people that
fully pay off their loans than the ones that did not
pay back.

Ratio: XX:YY

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard

amount loans!!
EXPLORATORY DATA ANALYSIS

Checking the correlation between the continuous feature variables

We can see that 'loan_amnt' has almost perfect

correlation with the 'installment' feature. Lets
Explore this feature further.

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard

amount loans!!
EXPLORATORY DATA ANALYSIS

Checking the correlation between the continuous feature variables

boxplot showing the relationship between the loan_status a

the Loan Amount.

The loan status is not too dependant on the loan_amount. Although

the 'Charged off' status has relatively higher loan amount, which
intuitively does makes sense. We can also see this with the
summary statistics for the loan amount, grouped by the loan_status.
# Summary statistics for the loan amount, grouped by the loan_status.

Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans.

# Lets display a count plot per subgrade

To get a correlation between numeric features and loan_status, first lets create a new column 'loan_repaid' which
contains 1 if the status is 'Fully Paid' and 0 if its 'Charged Off'

# Now lets create a bar plot showing this correlations

Step 2 – Cleaning the data Make A table for each column evaluated , final action
1. Missing data Analysis (dropped/transformed) , and reason
Checking for the Title and emp_length

Charge off rates are extremely similar across all

employment lengths. Lets drop the emp_length column.

138448 unique values , which are subgrouped in

many categories -> Removing

Looks like the title column is simply a string

subcategory/description of the purpose column. So lets drop
the title column.
Mort account – 10% values missing – Strategy?
Lets see the correlation of the mort_acc column with other features

Looks like the total_acc feature correlates with the mort_acc, Let's fill in the missing mort_acc values based
and this makes sense! So, i'll use this fillna() approach. Lets total_acc value. If the mort_acc is missing, th
group the dataframe by the total_acc and calculate the mean that missing value with the mean value corres
value for the mort_acc per total_acc entry. total_acc value from the Series above.

revol_util and the pub_rec_bankruptcies have missing data

points, but they account for less than 0.5% of the total data.
Lets remove the rows that are missing those values in those
columns
# List of all the columns that are currently non-numeric.

Column Name Description(Short) Features Operation done Reason Example (Initial

and Final)
Term The number of 36 months-> Lets convert the
payments on the 230928 term feature into
loan. Values are in 60 months 72147 either a 36 or 60
months and can be integer numeric
either 36 or 60. data type.
'grade' We already know
grade is part of
sub_grade, so lets
just drop the grade
feature.
'sub_grade', '
home_ownership',
'verification_status'
, 'issue_d',
'loan_status',
'purpose',
'earliest_cr_line',
'initial_list_status',
'application_type’,a
dress
Decision Tree Classifier

As we saw earlier, the problem is this dataset is highly

skewed with lot more class 1 data points than class 0. With
this is mind, the accuracy of this model is not too bad
actually (83%). But, as I expected, this model is
misclassifying a lot of Class 0 points (Loan_status: Charged
Off) with f1-score for class 0 being 0.58. Lets see how the
random forests model perform.
Random Forest
Next we try k-nearest neighbors. We need to reduce the number of variables to 10 or fewer (reference) for kNN to
perform well. We'll use LDA for dimension reduction. The number of component variables to keep is a
hyperparameter.
KERAS ANN – Artificial Neural Network

Link:
Dropout Layers [Link]
•Dropout is a technique where randomly selected neurons are ignored or-loan-investments-keras-ann
during training. They are “dropped-out” randomly.
•Simply put, dropout refers to ignoring units (i.e. neurons) during the From code line: 40
training phase of certain set of neurons which is chosen at random.
•Helps prevent overfitting.
# Code - - for adding additional details
model = Sequential()

# input layer
[Link](Dense(79,activation='relu')) # 79 features for training
[Link](Dropout(0.2))

# hidden layer
[Link](Dense(39,activation='relu'))
[Link](Dropout(0.2))

# hidden layer
[Link](Dense(19,activation='relu'))
[Link](Dropout(0.2))
[Link]
# output layer
[Link](Dense(1, activation='sigmoid'))

# compile model
[Link](optimizer="adam", loss='binary_crossentropy')
Training loss per epoch
•This plot shows the training loss per epoch.
•This plot helps us to see if there is overfitting in the
model. In this case there is no overfitting because
both lines go down at the same time.

Classification Report
•Accuracy is just the actual percent that we got right, in this case it was 89%.
•Note that since the data is imbalance if we were to make a model that approve all the loans, it would have a 80
•For example: (317696/len(df)) = 0.80 where 317696 is df[loan_repaid].value_counts()
•So do not be fooled by a model that returns back 80% accuracy because by default a model that always report
repaid itself will be 80% accurate on this actual test data set.
•The recall means "how many of this class you find over the whole number of element of this class"
•The precision will be "how many are correctly classified among that class"
•The f1-score is the harmonic mean between precision & recall
•The support is the number of occurence of the given class in your dataset.
•Precision is 0.96, which is really good. On the other hand, recall is not good.
•We should focus in improving the f1-score in the 0 class. We should improve the 0.62.
Confusion Matirx
•A confusion matrix is a technique for summarizing the performance of a classification algorith
•Classification accuracy alone can be misleading if you have an unequal number of observations in each class,
•We have 308 Type I errors (False Positive) and 8562 Type II errors (False Negative).
•7096 True Positive and 63078 True Negative.
RUN Screenshots

LendingClub Loan Default Prediction Model
No ratings yet
LendingClub Loan Default Prediction Model
17 pages
LendingClub Loan Default Prediction Model
No ratings yet
LendingClub Loan Default Prediction Model
17 pages
Predicting Loan Default with ML Models
No ratings yet
Predicting Loan Default with ML Models
15 pages
Home Credit Default Prediction Model
No ratings yet
Home Credit Default Prediction Model
10 pages
LendingClub Loan Default Prediction Model
No ratings yet
LendingClub Loan Default Prediction Model
9 pages
Credit Risk Analysis Using Machine Learning
No ratings yet
Credit Risk Analysis Using Machine Learning
6 pages
LendingClub Loan Default Analysis
No ratings yet
LendingClub Loan Default Analysis
8 pages
Loan Default Prediction in Python
No ratings yet
Loan Default Prediction in Python
4 pages
Credit Score Prediction Model Analysis
No ratings yet
Credit Score Prediction Model Analysis
22 pages
Credit Risk Prediction with Random Forest
No ratings yet
Credit Risk Prediction with Random Forest
8 pages
Loan Status Prediction Analysis
No ratings yet
Loan Status Prediction Analysis
23 pages
Loan Default Prediction Analysis
No ratings yet
Loan Default Prediction Analysis
8 pages
Assignment Essay
No ratings yet
Assignment Essay
12 pages
Home Credit Default Risk Analysis
No ratings yet
Home Credit Default Risk Analysis
16 pages
Credit Risk Prediction with Random Forest
No ratings yet
Credit Risk Prediction with Random Forest
10 pages
Rishi Shah BP MSB
No ratings yet
Rishi Shah BP MSB
10 pages
Rishi Shah BP MSB
No ratings yet
Rishi Shah BP MSB
10 pages
Rishi Shah BP MSB
No ratings yet
Rishi Shah BP MSB
10 pages
Rishi Shah BP MSB
No ratings yet
Rishi Shah BP MSB
10 pages
Yuvadhan L ML (My - Ac.p2mca25050)
No ratings yet
Yuvadhan L ML (My - Ac.p2mca25050)
39 pages
Credit Risk Prediction Model Analysis
No ratings yet
Credit Risk Prediction Model Analysis
27 pages
Loan Status Prediction Using ML Models
No ratings yet
Loan Status Prediction Using ML Models
13 pages
Bank Marketing Dataset Analysis Guide
No ratings yet
Bank Marketing Dataset Analysis Guide
6 pages
Loan Default Prediction with ML
No ratings yet
Loan Default Prediction with ML
16 pages
Credit Card Default Prediction Model
No ratings yet
Credit Card Default Prediction Model
19 pages
Loan Approval Prediction Analysis
No ratings yet
Loan Approval Prediction Analysis
14 pages
Loan Eligibility ML Report
No ratings yet
Loan Eligibility ML Report
8 pages
Credit Default Risk Model Guide
No ratings yet
Credit Default Risk Model Guide
16 pages
Financial Data Imputation Strategies
No ratings yet
Financial Data Imputation Strategies
2 pages
Loan Default Prediction Using ROC Curve
No ratings yet
Loan Default Prediction Using ROC Curve
24 pages
Bank Telemarketing Success Prediction
No ratings yet
Bank Telemarketing Success Prediction
22 pages
Predicting Loan Repayment with Data Analysis
No ratings yet
Predicting Loan Repayment with Data Analysis
6 pages
Random Forest Classification Guide
No ratings yet
Random Forest Classification Guide
13 pages
Credit Risk Dataset Preprocessing Steps
No ratings yet
Credit Risk Dataset Preprocessing Steps
5 pages
Bank Loan Prediction with ML Algorithms
No ratings yet
Bank Loan Prediction with ML Algorithms
6 pages
Vehicle Loan Default Prediction Model
No ratings yet
Vehicle Loan Default Prediction Model
2 pages
Logistic Regression and Missing Data Handling
No ratings yet
Logistic Regression and Missing Data Handling
16 pages
Loan Default Prediction
No ratings yet
Loan Default Prediction
19 pages
Loan Default Prediction with ML Techniques
No ratings yet
Loan Default Prediction with ML Techniques
3 pages
NUS Capstone: Credit Card Default Analysis
No ratings yet
NUS Capstone: Credit Card Default Analysis
5 pages
Credit Card Default Risk Prediction Report
No ratings yet
Credit Card Default Risk Prediction Report
13 pages
SSA-2 Front Sheet EDA
No ratings yet
SSA-2 Front Sheet EDA
11 pages
Loan Default Prediction with ML Models
No ratings yet
Loan Default Prediction with ML Models
92 pages
Credit Risk Analysis Project Report
No ratings yet
Credit Risk Analysis Project Report
24 pages
Loan Project
No ratings yet
Loan Project
104 pages
Home Credit Default Risk Analysis
No ratings yet
Home Credit Default Risk Analysis
21 pages
DS Using Python Lab Report
No ratings yet
DS Using Python Lab Report
9 pages
Data Mining Lab Report: Techniques & Analysis
No ratings yet
Data Mining Lab Report: Techniques & Analysis
33 pages
EDA Case Study: Risk Analytics in Loans
100% (1)
EDA Case Study: Risk Analytics in Loans
19 pages
Loan Default Analysis Case Study
No ratings yet
Loan Default Analysis Case Study
19 pages
Loan Default Prediction with ML Models
No ratings yet
Loan Default Prediction with ML Models
5 pages
Import Libraries: Get The Data
No ratings yet
Import Libraries: Get The Data
11 pages
Standard Bank Home Loan Process Update
No ratings yet
Standard Bank Home Loan Process Update
11 pages
Bank Loan Default Risk Analysis Insights
No ratings yet
Bank Loan Default Risk Analysis Insights
30 pages
Data Quality and Analysis in Credit EDA
100% (1)
Data Quality and Analysis in Credit EDA
16 pages
Credit Card Approval Classification with Python
No ratings yet
Credit Card Approval Classification with Python
12 pages
Loan Eligibility Prediction Model
No ratings yet
Loan Eligibility Prediction Model
16 pages
MAIN PROJECT Customer Loan ML Preprocessing
No ratings yet
MAIN PROJECT Customer Loan ML Preprocessing
16 pages
Enhancing Employee Productivity at Tata Docomo
No ratings yet
Enhancing Employee Productivity at Tata Docomo
109 pages
Proposal and Fomal Report - CH10
100% (1)
Proposal and Fomal Report - CH10
18 pages
Mehrotra 2010 Toward A Continuum of Intersectionality Theorizing For Feminist Social Work Scholarship
No ratings yet
Mehrotra 2010 Toward A Continuum of Intersectionality Theorizing For Feminist Social Work Scholarship
14 pages
CIA Structure and Future Insights
No ratings yet
CIA Structure and Future Insights
33 pages
Stress and Coping in Grade 12 Students
No ratings yet
Stress and Coping in Grade 12 Students
28 pages
Understanding R² Inflation in Time Series
No ratings yet
Understanding R² Inflation in Time Series
63 pages
E-Wallet Adoption Among Gen Z in Malaysia
No ratings yet
E-Wallet Adoption Among Gen Z in Malaysia
10 pages
Global Branding Strategies Explained
No ratings yet
Global Branding Strategies Explained
17 pages
DOE Quality Assurance Order 414.1D
No ratings yet
DOE Quality Assurance Order 414.1D
24 pages
Homeland Security Strategic Management Plan
No ratings yet
Homeland Security Strategic Management Plan
25 pages
Understanding the Budgeting Process
No ratings yet
Understanding the Budgeting Process
36 pages
DeBoer v. Snyder Trial Proceedings
100% (1)
DeBoer v. Snyder Trial Proceedings
91 pages
Technological Forecasting & Social Change
No ratings yet
Technological Forecasting & Social Change
12 pages
Selective Caries Removal Trial Protocol
No ratings yet
Selective Caries Removal Trial Protocol
18 pages
RBI Grade B Phase-1 Current Affairs 2022
No ratings yet
RBI Grade B Phase-1 Current Affairs 2022
19 pages
Data Analyst Skills and Experience Summary
No ratings yet
Data Analyst Skills and Experience Summary
1 page
Future of Official Statistics and Big Data
No ratings yet
Future of Official Statistics and Big Data
20 pages
DB Determination of The N Protein Content of Animal Feed With The Rapid MAX N Exceed
No ratings yet
DB Determination of The N Protein Content of Animal Feed With The Rapid MAX N Exceed
1 page
Paired Samples T-Test Explained
No ratings yet
Paired Samples T-Test Explained
44 pages
Emotional Intelligence in Management
No ratings yet
Emotional Intelligence in Management
16 pages
Management Accounting 2A Course Outline
No ratings yet
Management Accounting 2A Course Outline
3 pages
Test Paper Item Analysis Guide
No ratings yet
Test Paper Item Analysis Guide
8 pages
Ipr Internal 2
No ratings yet
Ipr Internal 2
10 pages
Investing in China's Pharmaceutical Industry - : 2 Edition
No ratings yet
Investing in China's Pharmaceutical Industry - : 2 Edition
38 pages
Grade 11 Statistics Exam Guide 2024-2025
No ratings yet
Grade 11 Statistics Exam Guide 2024-2025
3 pages
Aerospace Bracket Stress Analysis with ANSYS
No ratings yet
Aerospace Bracket Stress Analysis with ANSYS
11 pages
Optimizing F1 Pit Stops with Game Theory
No ratings yet
Optimizing F1 Pit Stops with Game Theory
12 pages
Rosenblatt's Perceptron: Neural Networks and Learning Machines
No ratings yet
Rosenblatt's Perceptron: Neural Networks and Learning Machines
12 pages

LendingClub Loan Default Prediction Model

Uploaded by

LendingClub Loan Default Prediction Model

Uploaded by

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.

It was the first peer-to-peer

Data set : Subset of All Lending Club loan data

Number of Rows and columns: 303704 and 26

As we can see, this is really an imbalanced

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard

Checking the correlation between the continuous feature variables

We can see that 'loan_amnt' has almost perfect

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard

Checking the correlation between the continuous feature variables

boxplot showing the relationship between the loan_status a

The loan status is not too dependant on the loan_amount. Although

# Lets display a count plot per subgrade

# Now lets create a bar plot showing this correlations

Charge off rates are extremely similar across all

138448 unique values , which are subgrouped in

Looks like the title column is simply a string

revol_util and the pub_rec_bankruptcies have missing data

Column Name Description(Short) Features Operation done Reason Example (Initial

As we saw earlier, the problem is this dataset is highly

You might also like