0% found this document useful (0 votes)
24 views18 pages

LendingClub Loan Default Prediction Model

LendingClub is the world's largest peer-to-peer lending platform, headquartered in San Francisco. It was the first peer-to-peer lender to register its offerings as securities with the SEC. The author built several models using random forest, decision trees, and neural networks to predict whether borrowers will pay back loans using LendingClub data. Metrics like accuracy, precision, recall, and F1 scores were used to evaluate and select the best performing model.

Uploaded by

Sai Pawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views18 pages

LendingClub Loan Default Prediction Model

LendingClub is the world's largest peer-to-peer lending platform, headquartered in San Francisco. It was the first peer-to-peer lender to register its offerings as securities with the SEC. The author built several models using random forest, decision trees, and neural networks to predict whether borrowers will pay back loans using LendingClub data. Metrics like accuracy, precision, recall, and F1 scores were used to evaluate and select the best performing model.

Uploaded by

Sai Pawan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California.

It was the first peer-to-peer


lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on
a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

My Goal
Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), I have built
a model that can predict whether or not a borrower will pay back their loan? This way in the future when the company gets a
new potential customer we can assess whether or not they are likely to pay back the loan.
Language/Analytics Tools Used
1. Python – Jupyter Notebook
Model Used
1. Random Forest
2. Decision Tree Modules used
3. Neural Network 2. Pandas
3. Numpy
4. Matplotlib
5. Seaborn
6. Scikit Learn
7. Keras
Data Set Overview: copy from
[Link]
0model%20using%20Decision%20Tress%20and%20Random%[Link]
In the form of table
27 columns
Mention it is categorical/Numerical/Ordinal in 1 column
2 files – 1st for data and 2nd for field description

Data set : Subset of All Lending Club loan data


[Link]

Number of Rows and columns: 303704 and 26


EXPLORATORY DATA ANALYSIS

OVERALL GOAL: Get an understanding for which variables are important, view summary statistics, and
visualize the data

As we can see, this is really an imbalanced


problem. We have lot more entries of people that
fully pay off their loans than the ones that did not
pay back.

Ratio: XX:YY

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard


amount loans!!
EXPLORATORY DATA ANALYSIS

Checking the correlation between the continuous feature variables

We can see that 'loan_amnt' has almost perfect


correlation with the 'installment' feature. Lets
Explore this feature further.

The peaks at (10,000, 15,000, 20,000, etc.) indicate standard


amount loans!!
EXPLORATORY DATA ANALYSIS

Checking the correlation between the continuous feature variables

boxplot showing the relationship between the loan_status a


the Loan Amount.

The loan status is not too dependant on the loan_amount. Although


the 'Charged off' status has relatively higher loan amount, which
intuitively does makes sense. We can also see this with the
summary statistics for the loan amount, grouped by the loan_status.
# Summary statistics for the loan amount, grouped by the loan_status.

Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans.

# Lets display a count plot per subgrade


To get a correlation between numeric features and loan_status, first lets create a new column 'loan_repaid' which
contains 1 if the status is 'Fully Paid' and 0 if its 'Charged Off'

# Now lets create a bar plot showing this correlations


Step 2 – Cleaning the data Make A table for each column evaluated , final action
1. Missing data Analysis (dropped/transformed) , and reason
Checking for the Title and emp_length

Charge off rates are extremely similar across all


employment lengths. Lets drop the emp_length column.

138448 unique values , which are subgrouped in


many categories -> Removing

Looks like the title column is simply a string


subcategory/description of the purpose column. So lets drop
the title column.
Mort account – 10% values missing – Strategy?
Lets see the correlation of the mort_acc column with other features

Looks like the total_acc feature correlates with the mort_acc, Let's fill in the missing mort_acc values based
and this makes sense! So, i'll use this fillna() approach. Lets total_acc value. If the mort_acc is missing, th
group the dataframe by the total_acc and calculate the mean that missing value with the mean value corres
value for the mort_acc per total_acc entry. total_acc value from the Series above.

revol_util and the pub_rec_bankruptcies have missing data


points, but they account for less than 0.5% of the total data.
Lets remove the rows that are missing those values in those
columns
# List of all the columns that are currently non-numeric.

Column Name Description(Short) Features Operation done Reason Example (Initial


and Final)
Term The number of 36 months-> Lets convert the
payments on the 230928 term feature into
loan. Values are in 60 months 72147 either a 36 or 60
months and can be integer numeric
either 36 or 60. data type.
'grade' We already know
grade is part of
sub_grade, so lets
just drop the grade
feature.
'sub_grade', '
home_ownership',
'verification_status'
, 'issue_d',
'loan_status',
'purpose',
'earliest_cr_line',
'initial_list_status',
'application_type’,a
dress
Decision Tree Classifier

As we saw earlier, the problem is this dataset is highly


skewed with lot more class 1 data points than class 0. With
this is mind, the accuracy of this model is not too bad
actually (83%). But, as I expected, this model is
misclassifying a lot of Class 0 points (Loan_status: Charged
Off) with f1-score for class 0 being 0.58. Lets see how the
random forests model perform.
Random Forest
Next we try k-nearest neighbors. We need to reduce the number of variables to 10 or fewer (reference) for kNN to
perform well. We'll use LDA for dimension reduction. The number of component variables to keep is a
hyperparameter.
KERAS ANN – Artificial Neural Network

Link:
Dropout Layers [Link]
•Dropout is a technique where randomly selected neurons are ignored or-loan-investments-keras-ann
during training. They are “dropped-out” randomly.
•Simply put, dropout refers to ignoring units (i.e. neurons) during the From code line: 40
training phase of certain set of neurons which is chosen at random.
•Helps prevent overfitting.
# Code - - for adding additional details
model = Sequential()

# input layer
[Link](Dense(79,activation='relu')) # 79 features for training
[Link](Dropout(0.2))

# hidden layer
[Link](Dense(39,activation='relu'))
[Link](Dropout(0.2))

# hidden layer
[Link](Dense(19,activation='relu'))
[Link](Dropout(0.2))
[Link]
# output layer
[Link](Dense(1, activation='sigmoid'))

# compile model
[Link](optimizer="adam", loss='binary_crossentropy')
Training loss per epoch
•This plot shows the training loss per epoch.
•This plot helps us to see if there is overfitting in the
model. In this case there is no overfitting because
both lines go down at the same time.

Classification Report
•Accuracy is just the actual percent that we got right, in this case it was 89%.
•Note that since the data is imbalance if we were to make a model that approve all the loans, it would have a 80
•For example: (317696/len(df)) = 0.80 where 317696 is df[loan_repaid].value_counts()
•So do not be fooled by a model that returns back 80% accuracy because by default a model that always report
repaid itself will be 80% accurate on this actual test data set.
•The recall means "how many of this class you find over the whole number of element of this class"
•The precision will be "how many are correctly classified among that class"
•The f1-score is the harmonic mean between precision & recall
•The support is the number of occurence of the given class in your dataset.
•Precision is 0.96, which is really good. On the other hand, recall is not good.
•We should focus in improving the f1-score in the 0 class. We should improve the 0.62.
Confusion Matirx
•A confusion matrix is a technique for summarizing the performance of a classification algorith
•Classification accuracy alone can be misleading if you have an unequal number of observations in each class,
•We have 308 Type I errors (False Positive) and 8562 Type II errors (False Negative).
•7096 True Positive and 63078 True Negative.
RUN Screenshots

You might also like